# Understanding Outputs How to read and interpret KAST results. --- ## Output Folder Structure All results are saved inside your active workspace folder (`workspaces//`): ``` workspaces// ├── prepared_data/ │ └── invalid_smiles_report.txt # Audit report of rejected SMILES ├── 01_train_set.csv # Training data ├── 01_test_set.csv # Test data ├── trained_model/ # Trained neural network files ├── 4_0_evaluation_report.txt # Main metrics ├── 4_1_cross_validation_results.txt # Cross-val scores ├── 4_2_enrichment_factor_results.txt # Enrichment analysis ├── 4_3_tanimoto_similarity_results.txt # Similarity analysis ├── 4_4_learning_curve_results.txt # Learning curve data ├── .csv # Prediction results (Default: 05_new_molecule_predictions.csv) ├── 4_0_roc_curve.png ├── 4_4_learning_curve.png ├── 4_2_enrichment_curve.png ├── 4_3_tanimoto_similarity_histogram.png └── logs/ └── kast_20251028.log # Detailed execution log ``` --- ## CSV Files ### `01_train_set.csv` & `01_test_set.csv` **Columns:** - `SMILES` — Molecular structure - `Label` — 1 (active) or 0 (inactive) - `Name` — Compound name (if provided) **Example:** ``` SMILES,Label,Name CC(C)Cc1ccc(cc1)C(C)C(O)=O,1,ibuprofen CN1C=NC2=C1C(=O)N(C(=O)N2C)C,1,caffeine CCCCCCCCCCCCCCCC,0,hexadecane ``` ### `.csv` (Default: `05_new_molecule_predictions.csv`) **Columns:** - `SMILES` — Molecular structure - `K-Score` — Prediction score (0.0-1.0) - `Predicted_Class` — "Active" or "Inactive" **Example:** ``` SMILES,K-Score,Predicted_Class CCc1ccccc1O,0.94,Active Cc1ccccc1C,0.92,Active CCCc1ccccc1O,0.87,Active Cc1ccc(cc1)C,0.45,Inactive CCc1ccccc1,0.22,Inactive ``` **Interpretation:** - **K-Score 0.9-1.0** → Very likely active - **K-Score 0.7-0.9** → Likely active - **K-Score 0.5-0.7** → Uncertain - **K-Score 0.0-0.5** → Likely inactive The K-Prediction Score represents the predicted probability of the active class P(active). In virtual screening workflows, probability-based scores are primarily used for ranking and prioritization rather than as absolute estimates, as discriminative power is generally more relevant than probability calibration for hit selection (Truchon & Bayly, 2007). --- ## Metrics Files ### `4_0_evaluation_report.txt` Main evaluation metrics on test set: ``` ROC-AUC Score: 0.87 Accuracy: 0.85 Sensitivity (Recall): 0.82 Specificity: 0.88 Precision: 0.86 F1-Score: 0.84 ``` **Interpretation:** | Metric | What It Means | Good Value | |--------|-------------|-----------| | **ROC-AUC** | Overall model performance (0-1) | > 0.8 | | **Accuracy** | % correct predictions | > 80% | | **Sensitivity** | % of actives found | > 80% | | **Specificity** | % of inactives correctly rejected | > 80% | | **Precision** | % of predictions that are correct | > 80% | | **F1-Score** | Balance between precision & recall | > 0.8 | The ROC-AUC is the recommended primary metric for evaluating binary classifiers in bioactivity prediction, as it is threshold-independent and robust to class imbalance (Hanley & McNeil, 1982). For imbalanced chemical datasets, F1-Score and Sensitivity are particularly important as complementary metrics (Jiang et al., 2025). --- ### `4_1_cross_validation_results.txt` 5-fold cross-validation scores: ``` Fold 1: AUC=0.85, Accuracy=0.84 Fold 2: AUC=0.86, Accuracy=0.85 Fold 3: AUC=0.87, Accuracy=0.86 Fold 4: AUC=0.88, Accuracy=0.87 Fold 5: AUC=0.86, Accuracy=0.85 Mean AUC: 0.864 ± 0.011 Mean Accuracy: 0.854 ± 0.011 ``` **Interpretation:** - Low variation (±0.01) → Model is stable - High variation (±0.1) → Model is unstable - If CV score << test score → Overfitting Cross-validation provides a less biased estimate of generalization performance than a single train/test split. In QSAR modeling, k-fold cross-validation is considered essential to assess model robustness and detect overfitting (Tropsha, 2010). --- ### `4_2_enrichment_factor_results.txt` How much better than random screening: ``` Enrichment Factor at 10%: 3.2x Enrichment Factor at 20%: 2.1x Enrichment Factor at 50%: 1.5x ``` **Interpretation:** - **EF = 3.2x** → By screening top 10%, you find 3.2x more actives than random - Higher EF = better virtual screening tool The Enrichment Factor (EF) at a given percentage quantifies the ability of a model to concentrate actives in the top-ranked fraction of a screened library relative to random selection. It is one of the most widely used metrics to evaluate practical virtual screening performance (Truchon & Bayly, 2007). --- ### `4_3_tanimoto_similarity_results.txt` Molecular diversity metrics: ``` === DESCRIPTIVE STATISTICS (TEST SET) === Mean similarity: 0.45 Standard deviation: 0.12 === DESCRIPTIVE STATISTICS (TRAIN SET INTERNAL) === Mean similarity: 0.65 Standard deviation: 0.08 ``` **Interpretation:** - **Test Mean ≈ Train Mean** → Test set is well within the chemical space of the training set. - **Test Mean << Train Mean** → Test set is exploring new chemical space. Molecular similarity is computed using the Tanimoto coefficient over binary molecular fingerprints. Comparing the internal training similarity to the test-to-training similarity is the standard method to assess if both populations occupy the same chemical space. --- ### `4_4_learning_curve_results.txt` Model improvement with more data, using 5-Fold Validation: ``` Training Size | Train AUC | Val AUC 50 | 0.7011 ± 0.0520 | 0.6800 ± 0.0610 100 | 0.7832 ± 0.0310 | 0.7610 ± 0.0401 200 | 0.8214 ± 0.0210 | 0.8123 ± 0.0305 400 | 0.8500 ± 0.0150 | 0.8410 ± 0.0200 798 | 0.8805 ± 0.0050 | 0.8715 ± 0.0080 ``` **Interpretation:** - **Val AUC increasing** → Model improves with more data - **Val AUC plateauing** → More data won't help much - **High Variance (±)** → Model is unstable at that dataset size Learning curves are a standard diagnostic tool in machine learning to evaluate whether a model would benefit from additional training data or requires architectural changes (Ramsundar et al., 2019). --- ## Plots ### `roc_curve.png` ROC (Receiver Operating Characteristic) curve showing model discrimination ability. **Interpretation:** - Curve closer to top-left → Better model - Diagonal line → Random classifier - Area under curve (AUC) > 0.8 → Good ### `4_4_learning_curve.png` How accuracy improves as training set grows, including shaded standard deviation variance bands across 5 cross-validation folds. **Interpretation:** - **Curves converging** → Model has learned most patterns - **Curves still diverging** → More data would help - **Narrow shaded bands** → High statistical confidence/stability ### `enrichment_curve.png` Virtual screening performance across different screening percentages. **Interpretation:** - Steep initial slope → Model finds actives early - Steep = good for virtual screening ### `4_3_tanimoto_similarity_histogram.png` Overlapping histogram of Tanimoto Similarities. **Interpretation:** - **Blue Distribution (Train)**: Internal similarity of the training set. - **Red Distribution (Test)**: Similarity of the test set to the training set. - **High Overlap**: Test set molecules are highly similar to training molecules. - **Shift to the Left**: Test set molecules are structurally novel compared to the training set. ## Log File ### `logs/kast_YYYYMMDD.log` Detailed execution log with timestamps and debug info. **Check log if:** - Something fails - You need execution details - Debugging issues --- ## Quality Assessment ### Good Results - AUC > 0.85 - Accuracy > 85% - CV stability ± < 0.05 - Learning curve converges - Clear enrichment factor (> 2x at 10%) ### Acceptable Results - AUC 0.75-0.85 - Accuracy 75-85% - CV stability ± 0.05-0.10 - Model still learning with more data ### Poor Results - AUC < 0.70 - Accuracy < 70% - High CV variation (± > 0.15) - Enrichment factor < 1.5x - Check: data quality, balance, duplicate molecules --- ## Exporting for Publication ### CSV Export ```bash # All results already in CSV format # Open in Excel or Python: import pandas as pd results = pd.read_csv('workspaces//.csv') top_100 = results.head(100) top_100.to_csv('top_100_predicted_actives.csv') ``` ### Plot Export Plots automatically saved as PNG (high resolution for publications). ### Report Generation ```bash # Combine all results cat workspaces//4_0_evaluation_report.txt \ workspaces//4_1_cross_validation_results.txt \ > publication_report.txt ``` --- ## Further Reading & Foundations - **ROC-AUC:** Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. *Radiology*, 143(1), 29-36. [doi:10.1148/radiology.143.1.7063747](https://doi.org/10.1148/radiology.143.1.7063747) - **Enrichment Factor:** Truchon, J.F., & Bayly, C.I. (2007). Evaluating virtual screening methods: good and bad metrics for the "early recognition" problem. *Journal of Chemical Information and Modeling*, 47(2), 488-508. [doi:10.1021/ci600426e](https://doi.org/10.1021/ci600426e) - **Tanimoto Similarity:** Willett, P., Barnard, J.M., & Downs, G.M. (1998). Chemical Similarity Searching. *Journal of Chemical Information and Computer Sciences*, 38(6), 983-996. - **Cross-Validation in QSAR:** Tropsha, A. (2010). Best Practices for QSAR Model Development, Validation, and Exploitation. *Molecular Informatics*, 29(6-7), 476-488. - **Imbalanced Learning:** Jiang, J., et al. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. *Chemical Science*, 16, 7637-7658. [doi:10.1039/D5SC00270B](https://doi.org/10.1039/D5SC00270B) - **Deep Learning Pipeline:** Ramsundar, B., et al. (2019). *Deep Learning for the Life Sciences*. O'Reilly Media.