# Step-by-Step Guide Detailed walkthrough of each pipeline step. ## Before You Start ✅ Have you completed [Installation](../getting-started/installation.md)? ✅ Do you have `data/actives.smi` and `data/inactives.smi` ready? ✅ Check [Data Preparation](data-preparation.md) for file format --- ## Workspace Management **Menu Option:** `[W]` Before starting any tasks, KAST prompts you to create or select a **Workspace**. Workspaces act as isolated project folders, meaning you can train a model for one protein target (e.g., `AChE`) and another for a different target (e.g., `Kinase`) without mixing up data, models, or predictions. All outputs from Steps 1-6 are saved securely inside `workspaces//`. --- ## Step 1: Data Preparation **Menu Option:** `[1]` **Script:** `bin/1_preparation.py` ### What Happens - Reads active and inactive molecule files - Validates SMILES structures - Removes invalid/malformed entries - Splits into train/test sets ### Input ``` data/actives.smi data/inactives.smi ``` ### Menu Interaction ```text ====================================================================== K-talysticFlow | Step 1: Preparing and Splitting Data ====================================================================== Select the file with ACTIVE molecules (.smi): 1. actives.smi Select the file number: 1 Select the file with INACTIVE molecules (.smi): 1. inactives.smi Select the file number: 1 Loading actives... ✅ Loaded 131 valid active molecules Loading inactives... ✅ Loaded 286 valid inactive molecules ====================================================================== 📋 DATA SPLIT METHOD SELECTION ====================================================================== Current default (settings.py): scaffold Available split methods: 1. Scaffold Split (Default) - Groups by core molecular structure. 2. Stratified Random Split - Preserves active/inactive class ratio. Select option (1-2, or Enter for default): 2 ====================================================================== 📊 TRAIN/TEST SPLIT CONFIGURATION ====================================================================== Current default (settings.py): 70% train / 30% test Select option (1-5): 4 ✅ Using split: 70.0% train / 30.0% test Splitting data using Stratified Random Split... Split completed: Training set size: 291 Test set size: 126 Class distribution: Train - Inactives: 200, Actives: 91 Test - Inactives: 86, Actives: 40 ``` ### Splitting Methods KAST supports two splitting strategies to divide molecules into training and test sets: - **Scaffold Split (Default)**: Groups molecules by their Bemis-Murcko core structures. This ensures that the model is evaluated on its ability to generalize to chemically distinct molecular classes (out-of-distribution structures). - **Stratified Random Split**: Divides molecules randomly but maintains the exact active/inactive class ratios across both subsets. Recommended for imbalanced datasets to prevent metrics calculation issues on small test sets. ### Output Files ```text workspaces// ├── prepared_data/ │ ├── invalid_smiles_report.txt # Audit report of rejected SMILES and reasons │ └── preparation_checkpoint.json # Metadata checkpoint of selected files & method ├── 01_train_set.csv # 291 molecules for training └── 01_test_set.csv # 126 molecules for testing ``` --- ## Step 2: Featurization **Menu Option:** `[2]` **Script:** `bin/2_featurization.py` ### What Happens - Converts SMILES into Morgan fingerprints - Generates 2048-dimensional binary vectors - Creates ML-ready feature representations - **Supports parallel processing** (5-10x faster!) ### Menu Interaction ``` [2] 🧬 Featurize Molecules (Generate Fingerprints) Featurizing training set... Processing 798 molecules... Parallel Processing ENABLED (6 workers) Progress: [████████████████] 100% 2m 15s Featurizing test set... Processing 200 molecules... Progress: [████████████████] 100% 0m 45s ✅ Complete! Features saved. ``` ### Output Fingerprint data saved internally (featurized numpy arrays) ### Pro Tips - Enable [Parallel Processing](parallel-processing.md) for datasets > 10K molecules - This step is usually the bottleneck for large datasets --- ## Step 3: Model Training **Menu Option:** `[3]` **Script:** `bin/3_create_training.py` ### What Happens - Builds a neural network using DeepChem + TensorFlow - Trains on featurized training data - Learns patterns in active/inactive molecules - Ensures reproducibility with fixed seeds ### Menu Interaction ``` [3] 🤖 Train Model Building neural network... - Input: 2048 features (Morgan fingerprints) - Hidden layers: [512, 256, 128] - Output: Binary (active/inactive) Training reproducibility configured (seeds: Python, NumPy, TensorFlow) Starting training... Epoch 1/50: loss=0.65, val_loss=0.62 Epoch 10/50: loss=0.42, val_loss=0.40 Epoch 25/50: loss=0.28, val_loss=0.31 Epoch 50/50: loss=0.18, val_loss=0.22 ✅ Model trained! Saved to workspaces//trained_model/checkpoint1.pt ``` ### Output Files ``` workspaces// ├── trained_model/checkpoint1.pt # Trained neural network └── training_metrics.txt # Training history ``` --- ## Step 4: Model Evaluation **Menu Option:** `[4]` → Choose evaluation type ### 4.1: Main Evaluation **Script:** `4_0_evaluation_main.py` ``` Evaluating on test set (200 molecules)... Results: ROC-AUC Score: 0.87 Accuracy: 0.85 Sensitivity (TPR): 0.82 Specificity (TNR): 0.88 Precision: 0.86 Recall: 0.82 F1-Score: 0.84 ✅ Report saved: 4_0_evaluation_report.txt ``` ### 4.2: Cross-Validation **Script:** `4_1_cross_validation.py` ``` Running 5-fold cross-validation... Fold 1/5: AUC=0.85, Accuracy=0.84 Fold 2/5: AUC=0.86, Accuracy=0.85 Fold 3/5: AUC=0.87, Accuracy=0.86 Fold 4/5: AUC=0.88, Accuracy=0.87 Fold 5/5: AUC=0.86, Accuracy=0.85 Mean AUC: 0.864 ± 0.011 Mean Accuracy: 0.854 ± 0.011 ✅ Results saved ``` ### 4.3: Enrichment Factor **Script:** `4_2_enrichment_factor.py` ``` Calculating enrichment factor... - 10% screened: 3.2x better than random - 20% screened: 2.1x better than random - 50% screened: 1.5x better than random ✅ Plot saved: enrichment_curve.png ``` ### 4.4: Tanimoto Similarity **Script:** `4_3_tanimoto_similarity.py` ``` Computing molecular similarity... Calculating fingerprints for 798 training molecules... Calculating fingerprints for 200 test molecules... Parallel Processing ENABLED (6 workers) --------------------------------------------------------- TANIMOTO SIMILARITY ANALYSIS SUMMARY --------------------------------------------------------- Train internal mean sim : 0.6500 Test mean similarity : 0.4500 Test minimum similarity : 0.1200 Test maximum similarity : 0.8900 --------------------------------------------------------- ✅ Report saved: 4_3_tanimoto_similarity_results.txt ✅ Histogram saved: 4_3_tanimoto_similarity_histogram.png ``` ### 4.5: Learning Curve **Script:** `4_4_learning_curve.py` ``` Generating learning curves... • Training pool: 798 samples • Validation set: 200 samples • Computing 10 curve points with 5 splits each... Progress: 100%|██████████████████| 10/10 [02:15<00:00] 📊 Results: Point 1: 79 samples → Train: 0.701±0.052 | Val: 0.680±0.061 Point 5: 399 samples → Train: 0.850±0.015 | Val: 0.841±0.020 Point 10: 798 samples → Train: 0.880±0.005 | Val: 0.871±0.008 ------------------------------------------------------------ LEARNING CURVE SUMMARY ------------------------------------------------------------ Final Training AUC : 0.8805 Final Validation AUC : 0.8715 Gap (Train - Validation) : 0.0090 ------------------------------------------------------------ Plot saved: 4_4_learning_curve.png ``` --- ## Step 5: Featurize New Molecules **Menu Option:** `[5]` → `[1]` **Script:** `bin/5_0_featurize_for_prediction.py` ### What Happens - Converts your screening library into fingerprints - Uses same parameters as training - Prepares data for prediction ### Menu Interaction ``` [1] 🔮 Featurize New Molecules Select library file to featurize Available files in data/: 1. my_library.smi (1,500 molecules) 2. zinc_subset.smi (5,000 molecules) Enter choice: 1 Processing 1,500 molecules... Parallel Processing ENABLED (6 workers) Progress: [████████████████] 100% 1m 30s ✅ Featurized! Ready for prediction. ``` ### Output Featurized library saved internally --- ## Step 5.1: Run Predictions **Menu Option:** `[5]` → `[2]` **Script:** `bin/5_1_run_prediction.py` ### What Happens - Runs trained model on new molecules - Generates K-Prediction Score (0-1) - Ranks results by predicted activity - Exports CSV ### Menu Interaction ``` [2] 🎯 Run Predictions Select featurized library: my_library_featurized Running predictions on 1,500 molecules... Progress: [████████████████] 100% 0m 45s Prediction Results: Total: 1,500 molecules Predicted Active (score > 0.5): 437 molecules Predicted Inactive (score ≤ 0.5): 1,063 molecules Ranking by K-Prediction Score... Export options: [1] Top 100 predictions [2] All predictions [3] Custom threshold Enter choice: 1 Exporting top 100 to CSV... ✅ Saved: workspaces//.csv ``` ### Output Files ``` workspaces// └── .csv # Default: 05_new_molecule_predictions.csv Contents: SMILES K-Score Predicted_Class CCc1ccc(cc1)O 0.94 Active Cc1ccccc1C 0.92 Active CCCc1ccccc1O 0.87 Active ``` --- ## Step 8: Advanced Options (Configuration) **Menu Option:** `[8]` KAST allows power users to configure the internal mechanics without editing Python files: ### [1] Environment Check Runs a diagnostic script to verify if Conda, RDKit, DeepChem, and GPU (CUDA) are properly installed. ### [3] Configure Parallel Processing Adjust the number of CPU cores allocated for heavy tasks (featurization, predictions). ### [4] Model Hyperparameters Configuration An interactive menu where you can dynamically modify: - **Neural Network Layers** (e.g., `[1000, 500]`) - **Dropout Rate** - **Training Epochs** - **Learning Rate** *Note: Changes to model hyperparameters take effect on the next training run (Step 3).*