# Overview **K-talysticFlow** (KAST) is an automated pipeline for training, evaluating, and applying deep learning models to predict molecular bioactivity. It is designed for **virtual screening**, helping identify promising compounds from large chemical libraries in a fast, reproducible, and user-friendly workflow. --- ## What Does KAST Do? KAST automates a complete end-to-end workflow for molecular activity prediction: 1. **Prepare Data** — Clean, validate, and organize user-provided molecules in **SMILES** format, separating active and inactive compounds into training and test sets. Invalid molecules are logged in an audit report. 2. **Generate Features** — Convert molecules into machine-learning-ready **ECFP/Morgan fingerprints**. 3. **Train Model** — Build and train a deep learning model from scratch using the user’s featurized training data. KAST uses a **DeepChem-based neural network implemented with `MultitaskClassifier`**, with configurable hidden layers, dropout, and learning rate. 4. **Evaluate** — Assess model quality using metrics and analyses such as **ROC-AUC, cross-validation, enrichment factor, Tanimoto similarity, and learning curves**. 5. **Predict** — Screen new molecules and rank them according to their predicted probability of activity and the **K-Prediction Score**. --- ## How the Learning Works KAST does not rely on a fixed pretrained predictor. Instead, it **creates and trains a new model based on the user’s own dataset**, using labeled active/inactive compounds as supervision. This makes the workflow suitable for target-specific virtual screening, where model performance depends on the quality and representativeness of the user’s dataset. ## How It Is Scored The **K-Prediction Score** is derived from the output of the trained `MultitaskClassifier`. During the prediction phase, the network processes molecular fingerprints and generates a probability distribution across classes. The K-Prediction Score is defined as the scalar value representing the **predicted probability of the active class** ( \(P(active)\) ). These scores are utilized primarily for **ranking and prioritization** in virtual screening workflows, where the relative ordering of candidates provides a robust metric for experimental validation, even if absolute probability calibration is subject to the specific training dataset distribution. ## How It Works (Simple Version) ``` Workspace Management ↓ Your Data (SMILES) ↓ [STEP 1] Prepare & Split ↓ [STEP 2] Featurize ↓ [STEP 3] Train Model ↓ [STEP 4] Evaluate ↓ [STEP 5] Predict New ↓ Results (CSV + Plots + Reports) ``` Each step is interactive — you choose options at each stage. No coding needed! --- ## Key Capabilities | Feature | Benefit | |---------|---------| | **Interactive Menu** | Click through the workflow step-by-step | | **Workspace Management** | Isolate projects without mixing data or models | | **Parallel Processing** | 5-10x faster on large datasets (100K+ molecules) | | **Full Validation** | ROC/AUC, Cross-Validation, Enrichment Factor, Similarity | | **K-Prediction Score** | Proprietary ranking score for molecules (0-1) | | **Configurable Hyperparameters** | Easily change Layers, Dropout, and Epochs from the UI | | **One-Click Setup** | `setup.exe` (Windows) or `setup.sh` (Linux) handles everything | | **Automatic Shortcuts** | Desktop shortcut to launch KAST without terminal | --- ## Who Should Use KAST? ✅ **Researchers in drug discovery** ✅ **Computational chemists** ✅ **Anyone screening chemical libraries** ✅ **Students learning ML + chemistry** --- ## What You'll Need - ✅ SMILES file with active molecules (e.g., `actives.smi`) - ✅ SMILES file with inactive molecules (e.g., `inactives.smi`) - ✅ (Optional) New molecules to predict (e.g., `library.smi`) **Format example:** ``` CC(C)Cc1ccc(cc1)C(C)C(O)=O ibuprofen CN1C=NC2=C1C(=O)N(C(=O)N2C)C caffeine ``` --- ## Tested Platforms - ✅ **Windows 11** with Anaconda - ✅ **Ubuntu 20.04 LTS** with Conda Both platforms create desktop shortcuts and handle all conda setup automatically. --- ### Further Reading & Foundations KAST is built upon foundational principles of chemoinformatics and machine learning. For deeper technical insights into the methodologies used: - **Molecular Fingerprints (ECFP):** Rogers, D., & Hahn, M. (2010). Extended-Connectivity Fingerprints. *Journal of Chemical Information and Modeling*, 50(5), 742-754. [doi:10.1021/ci100050t](https://doi.org/10.1021/ci100050t) - **Deep Learning in Drug Discovery:** Ramsundar, B., et al. (2019). *Deep Learning for the Life Sciences: Applying Deep Learning to Genomics, Microscopy, Drug Discovery, and More*. O'Reilly Media. - **Imbalanced Learning in Chemistry:** Jiang, J., et al. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. *Chemical Science*, 16, 7637–7658. [doi:10.1039/D5SC00270B](https://doi.org/10.1039/D5SC00270B) ---