Understanding Outputs

How to read and interpret KAST results.


Output Folder Structure

All results are saved inside your active workspace folder (workspaces/<your_workspace>/):

workspaces/<your_workspace>/
├── prepared_data/
│   └── invalid_smiles_report.txt        # Audit report of rejected SMILES
├── 01_train_set.csv                     # Training data
├── 01_test_set.csv                      # Test data
├── trained_model/                       # Trained neural network files
├── 4_0_evaluation_report.txt           # Main metrics
├── 4_1_cross_validation_results.txt    # Cross-val scores
├── 4_2_enrichment_factor_results.txt   # Enrichment analysis
├── 4_3_tanimoto_similarity_results.txt # Similarity analysis
├── 4_4_learning_curve_results.txt      # Learning curve data
├── <custom_filename>.csv     # Prediction results (Default: 05_new_molecule_predictions.csv)
├── 4_0_roc_curve.png
├── 4_4_learning_curve.png
├── 4_2_enrichment_curve.png
├── 4_3_tanimoto_similarity_histogram.png
└── logs/
    └── kast_20251028.log               # Detailed execution log

CSV Files

01_train_set.csv & 01_test_set.csv

Columns:

  • SMILES — Molecular structure

  • Label — 1 (active) or 0 (inactive)

  • Name — Compound name (if provided)

Example:

SMILES,Label,Name
CC(C)Cc1ccc(cc1)C(C)C(O)=O,1,ibuprofen
CN1C=NC2=C1C(=O)N(C(=O)N2C)C,1,caffeine
CCCCCCCCCCCCCCCC,0,hexadecane

<custom_filename>.csv (Default: 05_new_molecule_predictions.csv)

Columns:

  • SMILES — Molecular structure

  • K-Score — Prediction score (0.0-1.0)

  • Predicted_Class — “Active” or “Inactive”

Example:

SMILES,K-Score,Predicted_Class
CCc1ccccc1O,0.94,Active
Cc1ccccc1C,0.92,Active
CCCc1ccccc1O,0.87,Active
Cc1ccc(cc1)C,0.45,Inactive
CCc1ccccc1,0.22,Inactive

Interpretation:

  • K-Score 0.9-1.0 → Very likely active

  • K-Score 0.7-0.9 → Likely active

  • K-Score 0.5-0.7 → Uncertain

  • K-Score 0.0-0.5 → Likely inactive

The K-Prediction Score represents the predicted probability of the active class P(active). In virtual screening workflows, probability-based scores are primarily used for ranking and prioritization rather than as absolute estimates, as discriminative power is generally more relevant than probability calibration for hit selection (Truchon & Bayly, 2007).


Metrics Files

4_0_evaluation_report.txt

Main evaluation metrics on test set:

ROC-AUC Score: 0.87
Accuracy: 0.85
Sensitivity (Recall): 0.82
Specificity: 0.88
Precision: 0.86
F1-Score: 0.84

Interpretation:

Metric

What It Means

Good Value

ROC-AUC

Overall model performance (0-1)

> 0.8

Accuracy

% correct predictions

> 80%

Sensitivity

% of actives found

> 80%

Specificity

% of inactives correctly rejected

> 80%

Precision

% of predictions that are correct

> 80%

F1-Score

Balance between precision & recall

> 0.8

The ROC-AUC is the recommended primary metric for evaluating binary classifiers in bioactivity prediction, as it is threshold-independent and robust to class imbalance (Hanley & McNeil, 1982). For imbalanced chemical datasets, F1-Score and Sensitivity are particularly important as complementary metrics (Jiang et al., 2025).


4_1_cross_validation_results.txt

5-fold cross-validation scores:

Fold 1: AUC=0.85, Accuracy=0.84
Fold 2: AUC=0.86, Accuracy=0.85
Fold 3: AUC=0.87, Accuracy=0.86
Fold 4: AUC=0.88, Accuracy=0.87
Fold 5: AUC=0.86, Accuracy=0.85

Mean AUC: 0.864 ± 0.011
Mean Accuracy: 0.854 ± 0.011

Interpretation:

  • Low variation (±0.01) → Model is stable

  • High variation (±0.1) → Model is unstable

  • If CV score << test score → Overfitting

Cross-validation provides a less biased estimate of generalization performance than a single train/test split. In QSAR modeling, k-fold cross-validation is considered essential to assess model robustness and detect overfitting (Tropsha, 2010).


4_2_enrichment_factor_results.txt

How much better than random screening:

Enrichment Factor at 10%: 3.2x
Enrichment Factor at 20%: 2.1x
Enrichment Factor at 50%: 1.5x

Interpretation:

  • EF = 3.2x → By screening top 10%, you find 3.2x more actives than random

  • Higher EF = better virtual screening tool

The Enrichment Factor (EF) at a given percentage quantifies the ability of a model to concentrate actives in the top-ranked fraction of a screened library relative to random selection. It is one of the most widely used metrics to evaluate practical virtual screening performance (Truchon & Bayly, 2007).


4_3_tanimoto_similarity_results.txt

Molecular diversity metrics:

=== DESCRIPTIVE STATISTICS (TEST SET) ===
Mean similarity: 0.45
Standard deviation: 0.12

=== DESCRIPTIVE STATISTICS (TRAIN SET INTERNAL) ===
Mean similarity: 0.65
Standard deviation: 0.08

Interpretation:

  • Test Mean ≈ Train Mean → Test set is well within the chemical space of the training set.

  • Test Mean << Train Mean → Test set is exploring new chemical space.

Molecular similarity is computed using the Tanimoto coefficient over binary molecular fingerprints. Comparing the internal training similarity to the test-to-training similarity is the standard method to assess if both populations occupy the same chemical space.


4_4_learning_curve_results.txt

Model improvement with more data, using 5-Fold Validation:

Training Size | Train AUC         | Val AUC
50           | 0.7011 ± 0.0520   | 0.6800 ± 0.0610
100          | 0.7832 ± 0.0310   | 0.7610 ± 0.0401
200          | 0.8214 ± 0.0210   | 0.8123 ± 0.0305
400          | 0.8500 ± 0.0150   | 0.8410 ± 0.0200
798          | 0.8805 ± 0.0050   | 0.8715 ± 0.0080

Interpretation:

  • Val AUC increasing → Model improves with more data

  • Val AUC plateauing → More data won’t help much

  • High Variance (±) → Model is unstable at that dataset size

Learning curves are a standard diagnostic tool in machine learning to evaluate whether a model would benefit from additional training data or requires architectural changes (Ramsundar et al., 2019).


Plots

roc_curve.png

ROC (Receiver Operating Characteristic) curve showing model discrimination ability.

Interpretation:

  • Curve closer to top-left → Better model

  • Diagonal line → Random classifier

  • Area under curve (AUC) > 0.8 → Good

4_4_learning_curve.png

How accuracy improves as training set grows, including shaded standard deviation variance bands across 5 cross-validation folds.

Interpretation:

  • Curves converging → Model has learned most patterns

  • Curves still diverging → More data would help

  • Narrow shaded bands → High statistical confidence/stability

enrichment_curve.png

Virtual screening performance across different screening percentages.

Interpretation:

  • Steep initial slope → Model finds actives early

  • Steep = good for virtual screening

4_3_tanimoto_similarity_histogram.png

Overlapping histogram of Tanimoto Similarities.

Interpretation:

  • Blue Distribution (Train): Internal similarity of the training set.

  • Red Distribution (Test): Similarity of the test set to the training set.

  • High Overlap: Test set molecules are highly similar to training molecules.

  • Shift to the Left: Test set molecules are structurally novel compared to the training set.

Log File

logs/kast_YYYYMMDD.log

Detailed execution log with timestamps and debug info.

Check log if:

  • Something fails

  • You need execution details

  • Debugging issues


Quality Assessment

Good Results

  • AUC > 0.85

  • Accuracy > 85%

  • CV stability ± < 0.05

  • Learning curve converges

  • Clear enrichment factor (> 2x at 10%)

Acceptable Results

  • AUC 0.75-0.85

  • Accuracy 75-85%

  • CV stability ± 0.05-0.10

  • Model still learning with more data

Poor Results

  • AUC < 0.70

  • Accuracy < 70%

  • High CV variation (± > 0.15)

  • Enrichment factor < 1.5x

  • Check: data quality, balance, duplicate molecules


Exporting for Publication

CSV Export

# All results already in CSV format
# Open in Excel or Python:
import pandas as pd
results = pd.read_csv('workspaces/<your_workspace>/<custom_filename>.csv')
top_100 = results.head(100)
top_100.to_csv('top_100_predicted_actives.csv')

Plot Export

Plots automatically saved as PNG (high resolution for publications).

Report Generation

# Combine all results
cat workspaces/<your_workspace>/4_0_evaluation_report.txt \
    workspaces/<your_workspace>/4_1_cross_validation_results.txt \
    > publication_report.txt

Further Reading & Foundations

  • ROC-AUC: Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36. doi:10.1148/radiology.143.1.7063747

  • Enrichment Factor: Truchon, J.F., & Bayly, C.I. (2007). Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. Journal of Chemical Information and Modeling, 47(2), 488-508. doi:10.1021/ci600426e

  • Tanimoto Similarity: Willett, P., Barnard, J.M., & Downs, G.M. (1998). Chemical Similarity Searching. Journal of Chemical Information and Computer Sciences, 38(6), 983-996.

  • Cross-Validation in QSAR: Tropsha, A. (2010). Best Practices for QSAR Model Development, Validation, and Exploitation. Molecular Informatics, 29(6-7), 476-488.

  • Imbalanced Learning: Jiang, J., et al. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. Chemical Science, 16, 7637-7658. doi:10.1039/D5SC00270B

  • Deep Learning Pipeline: Ramsundar, B., et al. (2019). Deep Learning for the Life Sciences. O’Reilly Media.