Step-by-Step Guide
Detailed walkthrough of each pipeline step.
Before You Start
✅ Have you completed Installation?
✅ Do you have data/actives.smi and data/inactives.smi ready?
✅ Check Data Preparation for file format
Workspace Management
Menu Option: [W]
Before starting any tasks, KAST prompts you to create or select a Workspace.
Workspaces act as isolated project folders, meaning you can train a model for one protein target (e.g., AChE) and another for a different target (e.g., Kinase) without mixing up data, models, or predictions.
All outputs from Steps 1-6 are saved securely inside workspaces/<your_workspace>/.
Step 1: Data Preparation
Menu Option: [1]
Script: bin/1_preparation.py
What Happens
Reads active and inactive molecule files
Validates SMILES structures
Removes invalid/malformed entries
Splits into train/test sets
Input
data/actives.smi
data/inactives.smi
Splitting Methods
KAST supports two splitting strategies to divide molecules into training and test sets:
Scaffold Split (Default): Groups molecules by their Bemis-Murcko core structures. This ensures that the model is evaluated on its ability to generalize to chemically distinct molecular classes (out-of-distribution structures).
Stratified Random Split: Divides molecules randomly but maintains the exact active/inactive class ratios across both subsets. Recommended for imbalanced datasets to prevent metrics calculation issues on small test sets.
Output Files
workspaces/<your_workspace>/
├── prepared_data/
│ ├── invalid_smiles_report.txt # Audit report of rejected SMILES and reasons
│ └── preparation_checkpoint.json # Metadata checkpoint of selected files & method
├── 01_train_set.csv # 291 molecules for training
└── 01_test_set.csv # 126 molecules for testing
Step 2: Featurization
Menu Option: [2]
Script: bin/2_featurization.py
What Happens
Converts SMILES into Morgan fingerprints
Generates 2048-dimensional binary vectors
Creates ML-ready feature representations
Supports parallel processing (5-10x faster!)
Menu Interaction
[2] 🧬 Featurize Molecules (Generate Fingerprints)
Featurizing training set...
Processing 798 molecules...
Parallel Processing ENABLED (6 workers)
Progress: [████████████████] 100% 2m 15s
Featurizing test set...
Processing 200 molecules...
Progress: [████████████████] 100% 0m 45s
✅ Complete! Features saved.
Output
Fingerprint data saved internally (featurized numpy arrays)
Pro Tips
Enable Parallel Processing for datasets > 10K molecules
This step is usually the bottleneck for large datasets
Step 3: Model Training
Menu Option: [3]
Script: bin/3_create_training.py
What Happens
Builds a neural network using DeepChem + TensorFlow
Trains on featurized training data
Learns patterns in active/inactive molecules
Ensures reproducibility with fixed seeds
Menu Interaction
[3] 🤖 Train Model
Building neural network...
- Input: 2048 features (Morgan fingerprints)
- Hidden layers: [512, 256, 128]
- Output: Binary (active/inactive)
Training reproducibility configured (seeds: Python, NumPy, TensorFlow)
Starting training...
Epoch 1/50: loss=0.65, val_loss=0.62
Epoch 10/50: loss=0.42, val_loss=0.40
Epoch 25/50: loss=0.28, val_loss=0.31
Epoch 50/50: loss=0.18, val_loss=0.22
✅ Model trained! Saved to workspaces/<your_workspace>/trained_model/checkpoint1.pt
Output Files
workspaces/<your_workspace>/
├── trained_model/checkpoint1.pt # Trained neural network
└── training_metrics.txt # Training history
Step 4: Model Evaluation
Menu Option: [4] → Choose evaluation type
4.1: Main Evaluation
Script: 4_0_evaluation_main.py
Evaluating on test set (200 molecules)...
Results:
ROC-AUC Score: 0.87
Accuracy: 0.85
Sensitivity (TPR): 0.82
Specificity (TNR): 0.88
Precision: 0.86
Recall: 0.82
F1-Score: 0.84
✅ Report saved: 4_0_evaluation_report.txt
4.2: Cross-Validation
Script: 4_1_cross_validation.py
Running 5-fold cross-validation...
Fold 1/5: AUC=0.85, Accuracy=0.84
Fold 2/5: AUC=0.86, Accuracy=0.85
Fold 3/5: AUC=0.87, Accuracy=0.86
Fold 4/5: AUC=0.88, Accuracy=0.87
Fold 5/5: AUC=0.86, Accuracy=0.85
Mean AUC: 0.864 ± 0.011
Mean Accuracy: 0.854 ± 0.011
✅ Results saved
4.3: Enrichment Factor
Script: 4_2_enrichment_factor.py
Calculating enrichment factor...
- 10% screened: 3.2x better than random
- 20% screened: 2.1x better than random
- 50% screened: 1.5x better than random
✅ Plot saved: enrichment_curve.png
4.4: Tanimoto Similarity
Script: 4_3_tanimoto_similarity.py
Computing molecular similarity...
Calculating fingerprints for 798 training molecules...
Calculating fingerprints for 200 test molecules...
Parallel Processing ENABLED (6 workers)
---------------------------------------------------------
TANIMOTO SIMILARITY ANALYSIS SUMMARY
---------------------------------------------------------
Train internal mean sim : 0.6500
Test mean similarity : 0.4500
Test minimum similarity : 0.1200
Test maximum similarity : 0.8900
---------------------------------------------------------
✅ Report saved: 4_3_tanimoto_similarity_results.txt
✅ Histogram saved: 4_3_tanimoto_similarity_histogram.png
4.5: Learning Curve
Script: 4_4_learning_curve.py
Generating learning curves...
• Training pool: 798 samples
• Validation set: 200 samples
• Computing 10 curve points with 5 splits each...
Progress: 100%|██████████████████| 10/10 [02:15<00:00]
📊 Results:
Point 1: 79 samples → Train: 0.701±0.052 | Val: 0.680±0.061
Point 5: 399 samples → Train: 0.850±0.015 | Val: 0.841±0.020
Point 10: 798 samples → Train: 0.880±0.005 | Val: 0.871±0.008
------------------------------------------------------------
LEARNING CURVE SUMMARY
------------------------------------------------------------
Final Training AUC : 0.8805
Final Validation AUC : 0.8715
Gap (Train - Validation) : 0.0090
------------------------------------------------------------
Plot saved: 4_4_learning_curve.png
Step 5: Featurize New Molecules
Menu Option: [5] → [1]
Script: bin/5_0_featurize_for_prediction.py
What Happens
Converts your screening library into fingerprints
Uses same parameters as training
Prepares data for prediction
Menu Interaction
[1] 🔮 Featurize New Molecules
Select library file to featurize
Available files in data/:
1. my_library.smi (1,500 molecules)
2. zinc_subset.smi (5,000 molecules)
Enter choice: 1
Processing 1,500 molecules...
Parallel Processing ENABLED (6 workers)
Progress: [████████████████] 100% 1m 30s
✅ Featurized! Ready for prediction.
Output
Featurized library saved internally
Step 5.1: Run Predictions
Menu Option: [5] → [2]
Script: bin/5_1_run_prediction.py
What Happens
Runs trained model on new molecules
Generates K-Prediction Score (0-1)
Ranks results by predicted activity
Exports CSV
Menu Interaction
[2] 🎯 Run Predictions
Select featurized library: my_library_featurized
Running predictions on 1,500 molecules...
Progress: [████████████████] 100% 0m 45s
Prediction Results:
Total: 1,500 molecules
Predicted Active (score > 0.5): 437 molecules
Predicted Inactive (score ≤ 0.5): 1,063 molecules
Ranking by K-Prediction Score...
Export options:
[1] Top 100 predictions
[2] All predictions
[3] Custom threshold
Enter choice: 1
Exporting top 100 to CSV...
✅ Saved: workspaces/<your_workspace>/<custom_filename>.csv
Output Files
workspaces/<your_workspace>/
└── <custom_filename>.csv # Default: 05_new_molecule_predictions.csv
Contents:
SMILES K-Score Predicted_Class
CCc1ccc(cc1)O 0.94 Active
Cc1ccccc1C 0.92 Active
CCCc1ccccc1O 0.87 Active
Step 8: Advanced Options (Configuration)
Menu Option: [8]
KAST allows power users to configure the internal mechanics without editing Python files:
[1] Environment Check
Runs a diagnostic script to verify if Conda, RDKit, DeepChem, and GPU (CUDA) are properly installed.
[3] Configure Parallel Processing
Adjust the number of CPU cores allocated for heavy tasks (featurization, predictions).
[4] Model Hyperparameters Configuration
An interactive menu where you can dynamically modify:
Neural Network Layers (e.g.,
[1000, 500])Dropout Rate
Training Epochs
Learning Rate
Note: Changes to model hyperparameters take effect on the next training run (Step 3).