Understanding Outputs
How to read and interpret KAST results.
Output Folder Structure
All results are saved inside your active workspace folder (workspaces/<your_workspace>/):
workspaces/<your_workspace>/
├── prepared_data/
│ └── invalid_smiles_report.txt # Audit report of rejected SMILES
├── 01_train_set.csv # Training data
├── 01_test_set.csv # Test data
├── trained_model/ # Trained neural network files
├── 4_0_evaluation_report.txt # Main metrics
├── 4_1_cross_validation_results.txt # Cross-val scores
├── 4_2_enrichment_factor_results.txt # Enrichment analysis
├── 4_3_tanimoto_similarity_results.txt # Similarity analysis
├── 4_4_learning_curve_results.txt # Learning curve data
├── <custom_filename>.csv # Prediction results (Default: 05_new_molecule_predictions.csv)
├── 4_0_roc_curve.png
├── 4_4_learning_curve.png
├── 4_2_enrichment_curve.png
├── 4_3_tanimoto_similarity_histogram.png
└── logs/
└── kast_20251028.log # Detailed execution log
CSV Files
01_train_set.csv & 01_test_set.csv
Columns:
SMILES— Molecular structureLabel— 1 (active) or 0 (inactive)Name— Compound name (if provided)
Example:
SMILES,Label,Name
CC(C)Cc1ccc(cc1)C(C)C(O)=O,1,ibuprofen
CN1C=NC2=C1C(=O)N(C(=O)N2C)C,1,caffeine
CCCCCCCCCCCCCCCC,0,hexadecane
<custom_filename>.csv (Default: 05_new_molecule_predictions.csv)
Columns:
SMILES— Molecular structureK-Score— Prediction score (0.0-1.0)Predicted_Class— “Active” or “Inactive”
Example:
SMILES,K-Score,Predicted_Class
CCc1ccccc1O,0.94,Active
Cc1ccccc1C,0.92,Active
CCCc1ccccc1O,0.87,Active
Cc1ccc(cc1)C,0.45,Inactive
CCc1ccccc1,0.22,Inactive
Interpretation:
K-Score 0.9-1.0 → Very likely active
K-Score 0.7-0.9 → Likely active
K-Score 0.5-0.7 → Uncertain
K-Score 0.0-0.5 → Likely inactive
The K-Prediction Score represents the predicted probability of the active class P(active). In virtual screening workflows, probability-based scores are primarily used for ranking and prioritization rather than as absolute estimates, as discriminative power is generally more relevant than probability calibration for hit selection (Truchon & Bayly, 2007).
Metrics Files
4_0_evaluation_report.txt
Main evaluation metrics on test set:
ROC-AUC Score: 0.87
Accuracy: 0.85
Sensitivity (Recall): 0.82
Specificity: 0.88
Precision: 0.86
F1-Score: 0.84
Interpretation:
Metric |
What It Means |
Good Value |
|---|---|---|
ROC-AUC |
Overall model performance (0-1) |
> 0.8 |
Accuracy |
% correct predictions |
> 80% |
Sensitivity |
% of actives found |
> 80% |
Specificity |
% of inactives correctly rejected |
> 80% |
Precision |
% of predictions that are correct |
> 80% |
F1-Score |
Balance between precision & recall |
> 0.8 |
The ROC-AUC is the recommended primary metric for evaluating binary classifiers in bioactivity prediction, as it is threshold-independent and robust to class imbalance (Hanley & McNeil, 1982). For imbalanced chemical datasets, F1-Score and Sensitivity are particularly important as complementary metrics (Jiang et al., 2025).
4_1_cross_validation_results.txt
5-fold cross-validation scores:
Fold 1: AUC=0.85, Accuracy=0.84
Fold 2: AUC=0.86, Accuracy=0.85
Fold 3: AUC=0.87, Accuracy=0.86
Fold 4: AUC=0.88, Accuracy=0.87
Fold 5: AUC=0.86, Accuracy=0.85
Mean AUC: 0.864 ± 0.011
Mean Accuracy: 0.854 ± 0.011
Interpretation:
Low variation (±0.01) → Model is stable
High variation (±0.1) → Model is unstable
If CV score << test score → Overfitting
Cross-validation provides a less biased estimate of generalization performance than a single train/test split. In QSAR modeling, k-fold cross-validation is considered essential to assess model robustness and detect overfitting (Tropsha, 2010).
4_2_enrichment_factor_results.txt
How much better than random screening:
Enrichment Factor at 10%: 3.2x
Enrichment Factor at 20%: 2.1x
Enrichment Factor at 50%: 1.5x
Interpretation:
EF = 3.2x → By screening top 10%, you find 3.2x more actives than random
Higher EF = better virtual screening tool
The Enrichment Factor (EF) at a given percentage quantifies the ability of a model to concentrate actives in the top-ranked fraction of a screened library relative to random selection. It is one of the most widely used metrics to evaluate practical virtual screening performance (Truchon & Bayly, 2007).
4_3_tanimoto_similarity_results.txt
Molecular diversity metrics:
=== DESCRIPTIVE STATISTICS (TEST SET) ===
Mean similarity: 0.45
Standard deviation: 0.12
=== DESCRIPTIVE STATISTICS (TRAIN SET INTERNAL) ===
Mean similarity: 0.65
Standard deviation: 0.08
Interpretation:
Test Mean ≈ Train Mean → Test set is well within the chemical space of the training set.
Test Mean << Train Mean → Test set is exploring new chemical space.
Molecular similarity is computed using the Tanimoto coefficient over binary molecular fingerprints. Comparing the internal training similarity to the test-to-training similarity is the standard method to assess if both populations occupy the same chemical space.
4_4_learning_curve_results.txt
Model improvement with more data, using 5-Fold Validation:
Training Size | Train AUC | Val AUC
50 | 0.7011 ± 0.0520 | 0.6800 ± 0.0610
100 | 0.7832 ± 0.0310 | 0.7610 ± 0.0401
200 | 0.8214 ± 0.0210 | 0.8123 ± 0.0305
400 | 0.8500 ± 0.0150 | 0.8410 ± 0.0200
798 | 0.8805 ± 0.0050 | 0.8715 ± 0.0080
Interpretation:
Val AUC increasing → Model improves with more data
Val AUC plateauing → More data won’t help much
High Variance (±) → Model is unstable at that dataset size
Learning curves are a standard diagnostic tool in machine learning to evaluate whether a model would benefit from additional training data or requires architectural changes (Ramsundar et al., 2019).
Plots
roc_curve.png
ROC (Receiver Operating Characteristic) curve showing model discrimination ability.
Interpretation:
Curve closer to top-left → Better model
Diagonal line → Random classifier
Area under curve (AUC) > 0.8 → Good
4_4_learning_curve.png
How accuracy improves as training set grows, including shaded standard deviation variance bands across 5 cross-validation folds.
Interpretation:
Curves converging → Model has learned most patterns
Curves still diverging → More data would help
Narrow shaded bands → High statistical confidence/stability
enrichment_curve.png
Virtual screening performance across different screening percentages.
Interpretation:
Steep initial slope → Model finds actives early
Steep = good for virtual screening
4_3_tanimoto_similarity_histogram.png
Overlapping histogram of Tanimoto Similarities.
Interpretation:
Blue Distribution (Train): Internal similarity of the training set.
Red Distribution (Test): Similarity of the test set to the training set.
High Overlap: Test set molecules are highly similar to training molecules.
Shift to the Left: Test set molecules are structurally novel compared to the training set.
Log File
logs/kast_YYYYMMDD.log
Detailed execution log with timestamps and debug info.
Check log if:
Something fails
You need execution details
Debugging issues
Quality Assessment
Good Results
AUC > 0.85
Accuracy > 85%
CV stability ± < 0.05
Learning curve converges
Clear enrichment factor (> 2x at 10%)
Acceptable Results
AUC 0.75-0.85
Accuracy 75-85%
CV stability ± 0.05-0.10
Model still learning with more data
Poor Results
AUC < 0.70
Accuracy < 70%
High CV variation (± > 0.15)
Enrichment factor < 1.5x
Check: data quality, balance, duplicate molecules
Exporting for Publication
CSV Export
# All results already in CSV format
# Open in Excel or Python:
import pandas as pd
results = pd.read_csv('workspaces/<your_workspace>/<custom_filename>.csv')
top_100 = results.head(100)
top_100.to_csv('top_100_predicted_actives.csv')
Plot Export
Plots automatically saved as PNG (high resolution for publications).
Report Generation
# Combine all results
cat workspaces/<your_workspace>/4_0_evaluation_report.txt \
workspaces/<your_workspace>/4_1_cross_validation_results.txt \
> publication_report.txt
Further Reading & Foundations
ROC-AUC: Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36. doi:10.1148/radiology.143.1.7063747
Enrichment Factor: Truchon, J.F., & Bayly, C.I. (2007). Evaluating virtual screening methods: good and bad metrics for the “early recognition” problem. Journal of Chemical Information and Modeling, 47(2), 488-508. doi:10.1021/ci600426e
Tanimoto Similarity: Willett, P., Barnard, J.M., & Downs, G.M. (1998). Chemical Similarity Searching. Journal of Chemical Information and Computer Sciences, 38(6), 983-996.
Cross-Validation in QSAR: Tropsha, A. (2010). Best Practices for QSAR Model Development, Validation, and Exploitation. Molecular Informatics, 29(6-7), 476-488.
Imbalanced Learning: Jiang, J., et al. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. Chemical Science, 16, 7637-7658. doi:10.1039/D5SC00270B
Deep Learning Pipeline: Ramsundar, B., et al. (2019). Deep Learning for the Life Sciences. O’Reilly Media.