Pipeline Overview๏ƒ

K-talysticFlow is a 5-step automated pipeline for molecular screening. Understand what each step does.


The 5 Steps๏ƒ

Step 1: ๐Ÿ“‹ Data Preparation๏ƒ

File: bin/1_preparation.py

What it does:

  • Reads active and inactive molecule files (SMILES format)

  • Validates SMILES structures using RDKit

  • Removes invalid/duplicate molecules and logs them in an audit report (invalid_smiles_report.txt)

  • Splits into train/test sets (e.g., 80/20)

Input: data/actives.smi + data/inactives.smi
Output: train_set.csv + test_set.csv + invalid_smiles_report.txt


Step 2: ๐Ÿงฌ Featurization๏ƒ

File: bin/2_featurization.py

What it does:

  • Converts SMILES into machine learning features (fingerprints)

  • Uses Morgan fingerprints (ECFP6, radius=3)

  • Generates numerical vectors from molecular structures

  • Supports parallel processing (5-10x faster on large datasets)

Input: Train/test CSV files
Output: Featurized numpy arrays


Step 3: ๐Ÿค– Model Training๏ƒ

File: bin/3_create_training.py

What it does:

  • Builds a neural network using DeepChem + TensorFlow

  • Trains on fingerprint features

  • Learns to predict molecular activity

  • Ensures reproducibility with fixed random seeds

Input: Featurized training data
Output: Trained model file + training metrics


Step 4: ๐Ÿ“Š Model Evaluation๏ƒ

Menu Option: [4] โ†’ choose evaluation type

Available evaluations:

Evaluation

File

Purpose

Main Evaluation

4_0_evaluation_main.py

ROC/AUC, accuracy, precision, recall, F1

Cross-Validation

4_1_cross_validation.py

Verify model generalization (5-fold CV)

Enrichment Factor

4_2_enrichment_factor.py

How much better than random screening

Tanimoto Similarity

4_3_tanimoto_similarity.py

Chemical space comparison (Train internal vs. Test)

Learning Curve

4_4_learning_curve.py

Model performance vs. training size

Input: Trained model + test data
Output: Plots + detailed metrics


Step 5: ๐Ÿ”ฎ Prediction๏ƒ

Files: bin/5_0_featurize_for_prediction.py + bin/5_1_run_prediction.py


Step 5.0 โ€” Featurize for Prediction๏ƒ

New molecules are converted to the same 2048-bit sparse binary vector representation used during training (ECFP/Morgan fingerprints, radius=3). The resulting feature matrix is stored in HDF5 format (.h5) as a CSR (Compressed Sparse Row) sparse matrix to minimize memory footprint, since fingerprint vectors are typically over 95% sparse (Rogers & Hahn, 2010).

Each molecule is represented as an input vector \(\mathbf{x} \in \{0, 1\}^{2048}\), where each bit encodes the presence or absence of a circular structural fragment.


Step 5.1 โ€” Inference via the Trained MLP๏ƒ

KAST is not a fine-tuning or transfer learning workflow. The MultitaskClassifier loaded here is the exact model trained from scratch in Step 3, restored from its saved checkpoint via DeepChemโ€™s TensorFlow backend (Ramsundar et al., 2019).

How the MLP works โ€” in plain terms:

Think of the network as a series of filters. Each hidden layer receives the output of the previous one, applies a set of learned weights and biases, and passes only the โ€œactivatedโ€ signals forward. This allows the model to learn increasingly abstract patterns from the molecular fingerprint (Goodfellow, Bengio & Courville, 2016).

Hidden layer computation:

For each hidden layer \(l\):

\[ \mathbf{h}^{(l)} = \text{ReLU}\!\left(\mathbf{W}^{(l)}\,\mathbf{h}^{(l-1)} + \mathbf{b}^{(l)}\right) \]

where:

  • \(\mathbf{W}^{(l)}\) โ€” weight matrix learned during training

  • \(\mathbf{b}^{(l)}\) โ€” bias vector (shifts the activation threshold)

  • \(\text{ReLU}(z) = \max(0,\, z)\) โ€” keeps only positive signals, introduces non-linearity

Output layer โ€” Softmax:

The final layer converts raw scores into probabilities:

\[ \hat{\mathbf{y}} = \text{Softmax}\!\left(\mathbf{W}^{(L)}\,\mathbf{h}^{(L-1)} + \mathbf{b}^{(L)}\right) \]
\[ \text{Softmax}(z_i) = \frac{e^{z_i}}{\displaystyle\sum_{j} e^{z_j}}, \quad j \in \{\text{inactive},\, \text{active}\} \]

This ensures that \(\hat{y}_0 + \hat{y}_1 = 1\), producing a valid probability distribution over both classes (Bishop, 2006).

K-Prediction Score:

\[ \text{K-Score} = \hat{y}_1 = P(\text{active} \mid \mathbf{x}) \]

The score is the probability assigned to the active class (index 1 of the Softmax output), and all molecules are ranked in descending order of this value.

Predictions are computed in batches of 512 molecules, with optional parallel dispatch across CPU cores via joblib. Each worker loads its own model instance to avoid TensorFlow serialization issues.

Output โ€” User-Controlled Export๏ƒ

After all predictions are computed, the full results DataFrame is generated and the user interactively selects how to export:

Export Option

Description

All molecules

Complete ranked CSV

High activity only (score โ‰ฅ 0.7)

Confident hits only

Medium to high (score โ‰ฅ 0.5)

Broader candidate set

Custom score cutoff

User-defined threshold

Top N molecules

Fixed number of top candidates

Output files are saved atomically (via temporary files + safe move) inside your active workspace:

File

Contents

<name>.csv

Full ranked predictions (SMILES, K-Score, Predicted_Class)

<name>_report.txt

Score distribution summary + top 20 candidates

Input: data/my_library.smi
Output: user-named files in workspaces/<your_workspace>/


Data Flow๏ƒ

Active & Inactive          
    โ†“
[1] Data Prep
    โ†“
Train Set | Test Set
    โ†“
[2] Featurize
    โ†“
Feature Vectors
    โ†“
[3] Train Model
    โ†“
Neural Network
    โ†“
[4] Evaluate
    โ†“
Metrics & Plots
    โ†“
        โ†“ New Molecules
        โ†“
    [5] Featurize New
        โ†“
    [5.1] Predict
        โ†“
    Ranked Results (CSV)

Further Reading & Foundations๏ƒ

  • Rogers, D. & Hahn, M. (2010). Extended-connectivity fingerprints. J. Chem. Inf. Model., 50(5), 742โ€“754. https://doi.org/10.1021/ci100050t

  • Ramsundar, B., Eastman, P., Walters, P., Pande, V., Leswing, K. & Wu, Z. (2019). Deep Learning for the Life Sciences. Oโ€™Reilly Media. โ€” DeepChem framework & MultitaskClassifier

  • Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K. & Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2), 513โ€“530. https://doi.org/10.1039/C7SC02664A

  • Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.