Overview

K-talysticFlow (KAST) is an automated pipeline for training, evaluating, and applying deep learning models to predict molecular bioactivity. It is designed for virtual screening, helping identify promising compounds from large chemical libraries in a fast, reproducible, and user-friendly workflow.

What Does KAST Do?

KAST automates a complete end-to-end workflow for molecular activity prediction:

Prepare Data — Clean, validate, and organize user-provided molecules in SMILES format, separating active and inactive compounds into training and test sets. Invalid molecules are logged in an audit report.
Generate Features — Convert molecules into machine-learning-ready ECFP/Morgan fingerprints.
Train Model — Build and train a deep learning model from scratch using the user’s featurized training data. KAST uses a DeepChem-based neural network implemented with MultitaskClassifier, with configurable hidden layers, dropout, and learning rate.
Evaluate — Assess model quality using metrics and analyses such as ROC-AUC, cross-validation, enrichment factor, Tanimoto similarity, and learning curves.
Predict — Screen new molecules and rank them according to their predicted probability of activity and the K-Prediction Score.

How the Learning Works

KAST does not rely on a fixed pretrained predictor. Instead, it creates and trains a new model based on the user’s own dataset, using labeled active/inactive compounds as supervision. This makes the workflow suitable for target-specific virtual screening, where model performance depends on the quality and representativeness of the user’s dataset.

How It Is Scored

The K-Prediction Score is derived from the output of the trained MultitaskClassifier. During the prediction phase, the network processes molecular fingerprints and generates a probability distribution across classes. The K-Prediction Score is defined as the scalar value representing the predicted probability of the active class ( (P(active)) ). These scores are utilized primarily for ranking and prioritization in virtual screening workflows, where the relative ordering of candidates provides a robust metric for experimental validation, even if absolute probability calibration is subject to the specific training dataset distribution.

How It Works (Simple Version)

Workspace Management
       ↓
Your Data (SMILES)
       ↓
   [STEP 1] Prepare & Split
       ↓
   [STEP 2] Featurize
       ↓
   [STEP 3] Train Model
       ↓
   [STEP 4] Evaluate
       ↓
   [STEP 5] Predict New
       ↓
   Results (CSV + Plots + Reports)

Each step is interactive — you choose options at each stage. No coding needed!

Key Capabilities

Feature	Benefit
Interactive Menu	Click through the workflow step-by-step
Workspace Management	Isolate projects without mixing data or models
Parallel Processing	5-10x faster on large datasets (100K+ molecules)
Full Validation	ROC/AUC, Cross-Validation, Enrichment Factor, Similarity
K-Prediction Score	Proprietary ranking score for molecules (0-1)
Configurable Hyperparameters	Easily change Layers, Dropout, and Epochs from the UI
One-Click Setup	`setup.exe` (Windows) or `setup.sh` (Linux) handles everything
Automatic Shortcuts	Desktop shortcut to launch KAST without terminal

Who Should Use KAST?

✅ Researchers in drug discovery
✅ Computational chemists
✅ Anyone screening chemical libraries
✅ Students learning ML + chemistry

What You’ll Need

✅ SMILES file with active molecules (e.g., actives.smi)
✅ SMILES file with inactive molecules (e.g., inactives.smi)
✅ (Optional) New molecules to predict (e.g., library.smi)

Format example:

CC(C)Cc1ccc(cc1)C(C)C(O)=O  ibuprofen
CN1C=NC2=C1C(=O)N(C(=O)N2C)C  caffeine

Tested Platforms

✅ Windows 11 with Anaconda
✅ Ubuntu 20.04 LTS with Conda

Both platforms create desktop shortcuts and handle all conda setup automatically.