# Understanding Outputs

How to read and interpret KAST results.

---

## Output Folder Structure

All results are saved inside your active workspace folder (`workspaces/<your_workspace>/`):

```
workspaces/<your_workspace>/
├── prepared_data/
│   └── invalid_smiles_report.txt        # Audit report of rejected SMILES
├── 01_train_set.csv                     # Training data
├── 01_test_set.csv                      # Test data
├── trained_model/                       # Trained neural network files
├── 4_0_evaluation_report.txt           # Main metrics
├── 4_1_cross_validation_results.txt    # Cross-val scores
├── 4_2_enrichment_factor_results.txt   # Enrichment analysis
├── 4_3_tanimoto_similarity_results.txt # Similarity analysis
├── 4_4_learning_curve_results.txt      # Learning curve data
├── <custom_filename>.csv     # Prediction results (Default: 05_new_molecule_predictions.csv)
├── 4_0_roc_curve.png
├── 4_4_learning_curve.png
├── 4_2_enrichment_curve.png
├── 4_3_tanimoto_similarity_histogram.png
└── logs/
    └── kast_20251028.log               # Detailed execution log
```

---

## CSV Files

### `01_train_set.csv` & `01_test_set.csv`

**Columns:**
- `SMILES` — Molecular structure
- `Label` — 1 (active) or 0 (inactive)
- `Name` — Compound name (if provided)

**Example:**
```
SMILES,Label,Name
CC(C)Cc1ccc(cc1)C(C)C(O)=O,1,ibuprofen
CN1C=NC2=C1C(=O)N(C(=O)N2C)C,1,caffeine
CCCCCCCCCCCCCCCC,0,hexadecane
```

### `<custom_filename>.csv` (Default: `05_new_molecule_predictions.csv`)

**Columns:**
- `SMILES` — Molecular structure
- `K-Score` — Prediction score (0.0-1.0)
- `Predicted_Class` — "Active" or "Inactive"

**Example:**
```
SMILES,K-Score,Predicted_Class
CCc1ccccc1O,0.94,Active
Cc1ccccc1C,0.92,Active
CCCc1ccccc1O,0.87,Active
Cc1ccc(cc1)C,0.45,Inactive
CCc1ccccc1,0.22,Inactive
```

**Interpretation:**
- **K-Score 0.9-1.0** → Very likely active 
- **K-Score 0.7-0.9** → Likely active 
- **K-Score 0.5-0.7** → Uncertain 
- **K-Score 0.0-0.5** → Likely inactive

The K-Prediction Score represents the predicted probability of the active class P(active).
In virtual screening workflows, probability-based scores are primarily used for ranking
and prioritization rather than as absolute estimates, as discriminative power is generally
more relevant than probability calibration for hit selection (Truchon & Bayly, 2007).

---

## Metrics Files

### `4_0_evaluation_report.txt`

Main evaluation metrics on test set:

```
ROC-AUC Score: 0.87
Accuracy: 0.85
Sensitivity (Recall): 0.82
Specificity: 0.88
Precision: 0.86
F1-Score: 0.84
```

**Interpretation:**
| Metric | What It Means | Good Value |
|--------|-------------|-----------|
| **ROC-AUC** | Overall model performance (0-1) | > 0.8 |
| **Accuracy** | % correct predictions | > 80% |
| **Sensitivity** | % of actives found | > 80% |
| **Specificity** | % of inactives correctly rejected | > 80% |
| **Precision** | % of predictions that are correct | > 80% |
| **F1-Score** | Balance between precision & recall | > 0.8 |

The ROC-AUC is the recommended primary metric for evaluating binary classifiers in
bioactivity prediction, as it is threshold-independent and robust to class imbalance
(Hanley & McNeil, 1982). For imbalanced chemical datasets, F1-Score and Sensitivity
are particularly important as complementary metrics (Jiang et al., 2025).

---

### `4_1_cross_validation_results.txt`

5-fold cross-validation scores:

```
Fold 1: AUC=0.85, Accuracy=0.84
Fold 2: AUC=0.86, Accuracy=0.85
Fold 3: AUC=0.87, Accuracy=0.86
Fold 4: AUC=0.88, Accuracy=0.87
Fold 5: AUC=0.86, Accuracy=0.85

Mean AUC: 0.864 ± 0.011
Mean Accuracy: 0.854 ± 0.011
```

**Interpretation:**
- Low variation (±0.01) → Model is stable
- High variation (±0.1) → Model is unstable
- If CV score << test score → Overfitting

Cross-validation provides a less biased estimate of generalization performance than
a single train/test split. In QSAR modeling, k-fold cross-validation is considered
essential to assess model robustness and detect overfitting (Tropsha, 2010).

---

### `4_2_enrichment_factor_results.txt`

How much better than random screening:

```
Enrichment Factor at 10%: 3.2x
Enrichment Factor at 20%: 2.1x
Enrichment Factor at 50%: 1.5x
```

**Interpretation:**
- **EF = 3.2x** → By screening top 10%, you find 3.2x more actives than random
- Higher EF = better virtual screening tool

The Enrichment Factor (EF) at a given percentage quantifies the ability of a model
to concentrate actives in the top-ranked fraction of a screened library relative to
random selection. It is one of the most widely used metrics to evaluate practical
virtual screening performance (Truchon & Bayly, 2007).


---

### `4_3_tanimoto_similarity_results.txt`

Molecular diversity metrics:

```
=== DESCRIPTIVE STATISTICS (TEST SET) ===
Mean similarity: 0.45
Standard deviation: 0.12

=== DESCRIPTIVE STATISTICS (TRAIN SET INTERNAL) ===
Mean similarity: 0.65
Standard deviation: 0.08
```

**Interpretation:**
- **Test Mean ≈ Train Mean** → Test set is well within the chemical space of the training set.
- **Test Mean << Train Mean** → Test set is exploring new chemical space.

Molecular similarity is computed using the Tanimoto coefficient over binary molecular
fingerprints. Comparing the internal training similarity to the test-to-training similarity is
the standard method to assess if both populations occupy the same chemical space.

---

### `4_4_learning_curve_results.txt`

Model improvement with more data, using 5-Fold Validation:

```
Training Size | Train AUC         | Val AUC
50           | 0.7011 ± 0.0520   | 0.6800 ± 0.0610
100          | 0.7832 ± 0.0310   | 0.7610 ± 0.0401
200          | 0.8214 ± 0.0210   | 0.8123 ± 0.0305
400          | 0.8500 ± 0.0150   | 0.8410 ± 0.0200
798          | 0.8805 ± 0.0050   | 0.8715 ± 0.0080
```

**Interpretation:**
- **Val AUC increasing** → Model improves with more data 
- **Val AUC plateauing** → More data won't help much
- **High Variance (±)** → Model is unstable at that dataset size

Learning curves are a standard diagnostic tool in machine learning to evaluate
whether a model would benefit from additional training data or requires architectural
changes (Ramsundar et al., 2019).

---

## Plots

### `roc_curve.png`
ROC (Receiver Operating Characteristic) curve showing model discrimination ability.

**Interpretation:**
- Curve closer to top-left → Better model
- Diagonal line → Random classifier
- Area under curve (AUC) > 0.8 → Good

### `4_4_learning_curve.png`
How accuracy improves as training set grows, including shaded standard deviation variance bands across 5 cross-validation folds.

**Interpretation:**
- **Curves converging** → Model has learned most patterns
- **Curves still diverging** → More data would help
- **Narrow shaded bands** → High statistical confidence/stability

### `enrichment_curve.png`
Virtual screening performance across different screening percentages.

**Interpretation:**
- Steep initial slope → Model finds actives early
- Steep = good for virtual screening 

### `4_3_tanimoto_similarity_histogram.png`
Overlapping histogram of Tanimoto Similarities.

**Interpretation:**
- **Blue Distribution (Train)**: Internal similarity of the training set.
- **Red Distribution (Test)**: Similarity of the test set to the training set.
- **High Overlap**: Test set molecules are highly similar to training molecules.
- **Shift to the Left**: Test set molecules are structurally novel compared to the training set.


## Log File

### `logs/kast_YYYYMMDD.log`

Detailed execution log with timestamps and debug info.

**Check log if:**
- Something fails
- You need execution details
- Debugging issues

---

## Quality Assessment

### Good Results
- AUC > 0.85
- Accuracy > 85%
- CV stability ± < 0.05
- Learning curve converges
- Clear enrichment factor (> 2x at 10%)

### Acceptable Results
- AUC 0.75-0.85
- Accuracy 75-85%
- CV stability ± 0.05-0.10
- Model still learning with more data

### Poor Results
- AUC < 0.70
- Accuracy < 70%
- High CV variation (± > 0.15)
- Enrichment factor < 1.5x
- Check: data quality, balance, duplicate molecules

---

## Exporting for Publication

### CSV Export
```bash
# All results already in CSV format
# Open in Excel or Python:
import pandas as pd
results = pd.read_csv('workspaces/<your_workspace>/<custom_filename>.csv')
top_100 = results.head(100)
top_100.to_csv('top_100_predicted_actives.csv')
```

### Plot Export
Plots automatically saved as PNG (high resolution for publications).

### Report Generation
```bash
# Combine all results
cat workspaces/<your_workspace>/4_0_evaluation_report.txt \
    workspaces/<your_workspace>/4_1_cross_validation_results.txt \
    > publication_report.txt
```

---

## Further Reading & Foundations

- **ROC-AUC:** Hanley, J.A., & McNeil, B.J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. *Radiology*, 143(1), 29-36. [doi:10.1148/radiology.143.1.7063747](https://doi.org/10.1148/radiology.143.1.7063747)
- **Enrichment Factor:** Truchon, J.F., & Bayly, C.I. (2007). Evaluating virtual screening methods: good and bad metrics for the "early recognition" problem. *Journal of Chemical Information and Modeling*, 47(2), 488-508. [doi:10.1021/ci600426e](https://doi.org/10.1021/ci600426e)
- **Tanimoto Similarity:** Willett, P., Barnard, J.M., & Downs, G.M. (1998). Chemical Similarity Searching. *Journal of Chemical Information and Computer Sciences*, 38(6), 983-996.
- **Cross-Validation in QSAR:** Tropsha, A. (2010). Best Practices for QSAR Model Development, Validation, and Exploitation. *Molecular Informatics*, 29(6-7), 476-488.
- **Imbalanced Learning:** Jiang, J., et al. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. *Chemical Science*, 16, 7637-7658. [doi:10.1039/D5SC00270B](https://doi.org/10.1039/D5SC00270B)
- **Deep Learning Pipeline:** Ramsundar, B., et al. (2019). *Deep Learning for the Life Sciences*. O'Reilly Media.