# Step-by-Step Guide

Detailed walkthrough of each pipeline step.


## Before You Start

✅ Have you completed [Installation](../getting-started/installation.md)?  
✅ Do you have `data/actives.smi` and `data/inactives.smi` ready?  
✅ Check [Data Preparation](data-preparation.md) for file format

---

## Workspace Management
**Menu Option:** `[W]`

Before starting any tasks, KAST prompts you to create or select a **Workspace**. 
Workspaces act as isolated project folders, meaning you can train a model for one protein target (e.g., `AChE`) and another for a different target (e.g., `Kinase`) without mixing up data, models, or predictions.
All outputs from Steps 1-6 are saved securely inside `workspaces/<your_workspace>/`.

---

## Step 1: Data Preparation

**Menu Option:** `[1]`  
**Script:** `bin/1_preparation.py`

### What Happens
- Reads active and inactive molecule files
- Validates SMILES structures
- Removes invalid/malformed entries
- Splits into train/test sets

### Input
```
data/actives.smi
data/inactives.smi
```

### Menu Interaction
```text
======================================================================
        K-talysticFlow | Step 1: Preparing and Splitting Data         
======================================================================

Select the file with ACTIVE molecules (.smi):
1. actives.smi
Select the file number: 1
Select the file with INACTIVE molecules (.smi):
1. inactives.smi
Select the file number: 1

Loading actives... ✅ Loaded 131 valid active molecules
Loading inactives... ✅ Loaded 286 valid inactive molecules

======================================================================
📋 DATA SPLIT METHOD SELECTION
======================================================================
Current default (settings.py): scaffold

Available split methods:
  1. Scaffold Split (Default) - Groups by core molecular structure.
  2. Stratified Random Split - Preserves active/inactive class ratio.

Select option (1-2, or Enter for default): 2

======================================================================
📊 TRAIN/TEST SPLIT CONFIGURATION
======================================================================
Current default (settings.py): 70% train / 30% test

Select option (1-5): 4
✅ Using split: 70.0% train / 30.0% test

Splitting data using Stratified Random Split...
Split completed:
  Training set size: 291
  Test set size: 126

Class distribution:
  Train - Inactives: 200, Actives: 91
  Test  - Inactives: 86, Actives: 40
```

### Splitting Methods

KAST supports two splitting strategies to divide molecules into training and test sets:
- **Scaffold Split (Default)**: Groups molecules by their Bemis-Murcko core structures. This ensures that the model is evaluated on its ability to generalize to chemically distinct molecular classes (out-of-distribution structures).
- **Stratified Random Split**: Divides molecules randomly but maintains the exact active/inactive class ratios across both subsets. Recommended for imbalanced datasets to prevent metrics calculation issues on small test sets.

### Output Files
```text
workspaces/<your_workspace>/
├── prepared_data/
│   ├── invalid_smiles_report.txt  # Audit report of rejected SMILES and reasons
│   └── preparation_checkpoint.json # Metadata checkpoint of selected files & method
├── 01_train_set.csv               # 291 molecules for training
└── 01_test_set.csv                # 126 molecules for testing
```

---

## Step 2: Featurization

**Menu Option:** `[2]`  
**Script:** `bin/2_featurization.py`

### What Happens
- Converts SMILES into Morgan fingerprints
- Generates 2048-dimensional binary vectors
- Creates ML-ready feature representations
- **Supports parallel processing** (5-10x faster!)

### Menu Interaction
```
[2] 🧬 Featurize Molecules (Generate Fingerprints)

Featurizing training set...
Processing 798 molecules...
Parallel Processing ENABLED (6 workers)
Progress: [████████████████] 100% 2m 15s

Featurizing test set...
Processing 200 molecules...
Progress: [████████████████] 100% 0m 45s

✅ Complete! Features saved.
```

### Output
Fingerprint data saved internally (featurized numpy arrays)

### Pro Tips
- Enable [Parallel Processing](parallel-processing.md) for datasets > 10K molecules
- This step is usually the bottleneck for large datasets

---

## Step 3: Model Training

**Menu Option:** `[3]`  
**Script:** `bin/3_create_training.py`

### What Happens
- Builds a neural network using DeepChem + TensorFlow
- Trains on featurized training data
- Learns patterns in active/inactive molecules
- Ensures reproducibility with fixed seeds

### Menu Interaction
```
[3] 🤖 Train Model

Building neural network...
- Input: 2048 features (Morgan fingerprints)
- Hidden layers: [512, 256, 128]
- Output: Binary (active/inactive)

Training reproducibility configured (seeds: Python, NumPy, TensorFlow)

Starting training...
Epoch 1/50: loss=0.65, val_loss=0.62
Epoch 10/50: loss=0.42, val_loss=0.40
Epoch 25/50: loss=0.28, val_loss=0.31
Epoch 50/50: loss=0.18, val_loss=0.22

✅ Model trained! Saved to workspaces/<your_workspace>/trained_model/checkpoint1.pt
```

### Output Files
```
workspaces/<your_workspace>/
├── trained_model/checkpoint1.pt  # Trained neural network
└── training_metrics.txt  # Training history
```

---

## Step 4: Model Evaluation

**Menu Option:** `[4]` → Choose evaluation type

### 4.1: Main Evaluation
**Script:** `4_0_evaluation_main.py`

```
Evaluating on test set (200 molecules)...

Results:
  ROC-AUC Score: 0.87
  Accuracy: 0.85
  Sensitivity (TPR): 0.82
  Specificity (TNR): 0.88
  Precision: 0.86
  Recall: 0.82
  F1-Score: 0.84

✅ Report saved: 4_0_evaluation_report.txt
```

### 4.2: Cross-Validation
**Script:** `4_1_cross_validation.py`

```
Running 5-fold cross-validation...

Fold 1/5: AUC=0.85, Accuracy=0.84
Fold 2/5: AUC=0.86, Accuracy=0.85
Fold 3/5: AUC=0.87, Accuracy=0.86
Fold 4/5: AUC=0.88, Accuracy=0.87
Fold 5/5: AUC=0.86, Accuracy=0.85

Mean AUC: 0.864 ± 0.011
Mean Accuracy: 0.854 ± 0.011

✅ Results saved
```

### 4.3: Enrichment Factor
**Script:** `4_2_enrichment_factor.py`

```
Calculating enrichment factor...
- 10% screened: 3.2x better than random
- 20% screened: 2.1x better than random
- 50% screened: 1.5x better than random

✅ Plot saved: enrichment_curve.png
```

### 4.4: Tanimoto Similarity
**Script:** `4_3_tanimoto_similarity.py`

```
Computing molecular similarity...
Calculating fingerprints for 798 training molecules...
Calculating fingerprints for 200 test molecules...
Parallel Processing ENABLED (6 workers)

---------------------------------------------------------
           TANIMOTO SIMILARITY ANALYSIS SUMMARY          
---------------------------------------------------------
  Train internal mean sim : 0.6500
  Test mean similarity    : 0.4500
  Test minimum similarity : 0.1200
  Test maximum similarity : 0.8900
---------------------------------------------------------

✅ Report saved: 4_3_tanimoto_similarity_results.txt
✅ Histogram saved: 4_3_tanimoto_similarity_histogram.png
```

### 4.5: Learning Curve
**Script:** `4_4_learning_curve.py`

```
Generating learning curves...
  • Training pool: 798 samples
  • Validation set: 200 samples
  • Computing 10 curve points with 5 splits each...

  Progress: 100%|██████████████████| 10/10 [02:15<00:00]

  📊 Results:
    Point  1:  79 samples → Train: 0.701±0.052 | Val: 0.680±0.061
    Point  5: 399 samples → Train: 0.850±0.015 | Val: 0.841±0.020
    Point 10: 798 samples → Train: 0.880±0.005 | Val: 0.871±0.008

------------------------------------------------------------
               LEARNING CURVE SUMMARY               
------------------------------------------------------------
  Final Training AUC         : 0.8805
  Final Validation AUC       : 0.8715
  Gap (Train - Validation)   : 0.0090
------------------------------------------------------------

Plot saved: 4_4_learning_curve.png
```

---

## Step 5: Featurize New Molecules

**Menu Option:** `[5]` → `[1]`  
**Script:** `bin/5_0_featurize_for_prediction.py`

### What Happens
- Converts your screening library into fingerprints
- Uses same parameters as training
- Prepares data for prediction

### Menu Interaction
```
[1] 🔮 Featurize New Molecules

Select library file to featurize
Available files in data/:
  1. my_library.smi (1,500 molecules)
  2. zinc_subset.smi (5,000 molecules)

Enter choice: 1

Processing 1,500 molecules...
Parallel Processing ENABLED (6 workers)
Progress: [████████████████] 100% 1m 30s

✅ Featurized! Ready for prediction.
```

### Output
Featurized library saved internally

---

## Step 5.1: Run Predictions

**Menu Option:** `[5]` → `[2]`  
**Script:** `bin/5_1_run_prediction.py`

### What Happens
- Runs trained model on new molecules
- Generates K-Prediction Score (0-1)
- Ranks results by predicted activity
- Exports CSV

### Menu Interaction
```
[2] 🎯 Run Predictions

Select featurized library: my_library_featurized
Running predictions on 1,500 molecules...
Progress: [████████████████] 100% 0m 45s

Prediction Results:
  Total: 1,500 molecules
  Predicted Active (score > 0.5): 437 molecules
  Predicted Inactive (score ≤ 0.5): 1,063 molecules

Ranking by K-Prediction Score...

Export options:
  [1] Top 100 predictions
  [2] All predictions
  [3] Custom threshold

Enter choice: 1

Exporting top 100 to CSV...
✅ Saved: workspaces/<your_workspace>/<custom_filename>.csv
```

### Output Files
```
workspaces/<your_workspace>/
└── <custom_filename>.csv             # Default: 05_new_molecule_predictions.csv

Contents:
SMILES                              K-Score  Predicted_Class
CCc1ccc(cc1)O                       0.94     Active
Cc1ccccc1C                          0.92     Active
CCCc1ccccc1O                        0.87     Active
```

---

## Step 8: Advanced Options (Configuration)

**Menu Option:** `[8]`

KAST allows power users to configure the internal mechanics without editing Python files:

### [1] Environment Check
Runs a diagnostic script to verify if Conda, RDKit, DeepChem, and GPU (CUDA) are properly installed.

### [3] Configure Parallel Processing
Adjust the number of CPU cores allocated for heavy tasks (featurization, predictions).

### [4] Model Hyperparameters Configuration
An interactive menu where you can dynamically modify:
- **Neural Network Layers** (e.g., `[1000, 500]`)
- **Dropout Rate**
- **Training Epochs**
- **Learning Rate**

*Note: Changes to model hyperparameters take effect on the next training run (Step 3).*