Pipeline Overview๏
K-talysticFlow is a 5-step automated pipeline for molecular screening. Understand what each step does.
The 5 Steps๏
Step 1: ๐ Data Preparation๏
File: bin/1_preparation.py
What it does:
Reads active and inactive molecule files (SMILES format)
Validates SMILES structures using RDKit
Removes invalid/duplicate molecules and logs them in an audit report (
invalid_smiles_report.txt)Splits into train/test sets (e.g., 80/20)
Input: data/actives.smi + data/inactives.smi
Output: train_set.csv + test_set.csv + invalid_smiles_report.txt
Step 2: ๐งฌ Featurization๏
File: bin/2_featurization.py
What it does:
Converts SMILES into machine learning features (fingerprints)
Uses Morgan fingerprints (ECFP6, radius=3)
Generates numerical vectors from molecular structures
Supports parallel processing (5-10x faster on large datasets)
Input: Train/test CSV files
Output: Featurized numpy arrays
Step 3: ๐ค Model Training๏
File: bin/3_create_training.py
What it does:
Builds a neural network using DeepChem + TensorFlow
Trains on fingerprint features
Learns to predict molecular activity
Ensures reproducibility with fixed random seeds
Input: Featurized training data
Output: Trained model file + training metrics
Step 4: ๐ Model Evaluation๏
Menu Option: [4] โ choose evaluation type
Available evaluations:
Evaluation |
File |
Purpose |
|---|---|---|
Main Evaluation |
|
ROC/AUC, accuracy, precision, recall, F1 |
Cross-Validation |
|
Verify model generalization (5-fold CV) |
Enrichment Factor |
|
How much better than random screening |
Tanimoto Similarity |
|
Chemical space comparison (Train internal vs. Test) |
Learning Curve |
|
Model performance vs. training size |
Input: Trained model + test data
Output: Plots + detailed metrics
Step 5: ๐ฎ Prediction๏
Files: bin/5_0_featurize_for_prediction.py + bin/5_1_run_prediction.py
Step 5.0 โ Featurize for Prediction๏
New molecules are converted to the same 2048-bit sparse binary vector representation
used during training (ECFP/Morgan fingerprints, radius=3). The resulting feature matrix
is stored in HDF5 format (.h5) as a CSR (Compressed Sparse Row) sparse matrix to
minimize memory footprint, since fingerprint vectors are typically over 95% sparse
(Rogers & Hahn, 2010).
Each molecule is represented as an input vector \(\mathbf{x} \in \{0, 1\}^{2048}\), where each bit encodes the presence or absence of a circular structural fragment.
Step 5.1 โ Inference via the Trained MLP๏
KAST is not a fine-tuning or transfer learning workflow. The MultitaskClassifier
loaded here is the exact model trained from scratch in Step 3, restored from its saved
checkpoint via DeepChemโs TensorFlow backend (Ramsundar et al., 2019).
How the MLP works โ in plain terms:
Think of the network as a series of filters. Each hidden layer receives the output of the previous one, applies a set of learned weights and biases, and passes only the โactivatedโ signals forward. This allows the model to learn increasingly abstract patterns from the molecular fingerprint (Goodfellow, Bengio & Courville, 2016).
Hidden layer computation:
For each hidden layer \(l\):
where:
\(\mathbf{W}^{(l)}\) โ weight matrix learned during training
\(\mathbf{b}^{(l)}\) โ bias vector (shifts the activation threshold)
\(\text{ReLU}(z) = \max(0,\, z)\) โ keeps only positive signals, introduces non-linearity
Output layer โ Softmax:
The final layer converts raw scores into probabilities:
This ensures that \(\hat{y}_0 + \hat{y}_1 = 1\), producing a valid probability distribution over both classes (Bishop, 2006).
K-Prediction Score:
The score is the probability assigned to the active class (index 1 of the Softmax output), and all molecules are ranked in descending order of this value.
Predictions are computed in batches of 512 molecules, with optional parallel dispatch
across CPU cores via joblib. Each worker loads its own model instance to avoid
TensorFlow serialization issues.
Output โ User-Controlled Export๏
After all predictions are computed, the full results DataFrame is generated and the user interactively selects how to export:
Export Option |
Description |
|---|---|
All molecules |
Complete ranked CSV |
High activity only (score โฅ 0.7) |
Confident hits only |
Medium to high (score โฅ 0.5) |
Broader candidate set |
Custom score cutoff |
User-defined threshold |
Top N molecules |
Fixed number of top candidates |
Output files are saved atomically (via temporary files + safe move) inside your active workspace:
File |
Contents |
|---|---|
|
Full ranked predictions (SMILES, K-Score, Predicted_Class) |
|
Score distribution summary + top 20 candidates |
Input: data/my_library.smi
Output: user-named files in workspaces/<your_workspace>/
Data Flow๏
Active & Inactive
โ
[1] Data Prep
โ
Train Set | Test Set
โ
[2] Featurize
โ
Feature Vectors
โ
[3] Train Model
โ
Neural Network
โ
[4] Evaluate
โ
Metrics & Plots
โ
โ New Molecules
โ
[5] Featurize New
โ
[5.1] Predict
โ
Ranked Results (CSV)
Further Reading & Foundations๏
Rogers, D. & Hahn, M. (2010). Extended-connectivity fingerprints. J. Chem. Inf. Model., 50(5), 742โ754. https://doi.org/10.1021/ci100050t
Ramsundar, B., Eastman, P., Walters, P., Pande, V., Leswing, K. & Wu, Z. (2019). Deep Learning for the Life Sciences. OโReilly Media. โ DeepChem framework &
MultitaskClassifierWu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K. & Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2), 513โ530. https://doi.org/10.1039/C7SC02664A
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.