Data Preparation
How to format and prepare your molecular data for KAST.
File Format: SMILES
SMILES (Simplified Molecular Input Line Entry System) is the standard text format for molecules.
Format:
SMILES [space/tab] optional_name
Example:
CC(C)Cc1ccc(cc1)C(C)C(O)=O ibuprofen
CN1C=NC2=C1C(=O)N(C(=O)N2C)C caffeine
CC(C)CC1=CC(=C(C=C1)C(C)C)O another_compound
File Organization
Place SMILES files in the data/ folder:
KAST/
├── data/
│ ├── actives.smi ← Active compounds (required)
│ ├── inactives.smi ← Inactive compounds (required)
│ └── my_library.smi ← (Optional) New molecules to predict
├── workspaces/ ← Outputs will go here
├── main.py
└── ...
Required Files
data/actives.smi
Active (bioactive) compounds that show the desired property.
Minimum: 50 molecules (use with caution; see Data Quality Guidelines) Recommended: 500+ molecules for Deep Learning accuracy
Format: SMILES string followed by a space or tab and the molecule name.
CC(C)Cc1ccc(cc1)C(C)C(O)=O ibuprofen
CN1C=NC2=C1C(=O)N(C(=O)N2C)C caffeine
CC(C)CC1=CC(=C(C=C1)C(C)C)O compound3
data/inactives.smi
Inactive (non-bioactive) compounds that do not show the desired property.
Minimum: 50 molecules (use with caution; see Data Quality Guidelines) Recommended: 500+ molecules for Deep Learning accuracy
Format: SMILES string followed by a space or tab and the molecule name.
CCCCCCCCCCCCCCCC hexadecane
CC(=O)OC1=CC=CC=C1C(=O)O aspirin
CCC(C)C(O)=O carboxylic_acid
Optional File
data/my_library.smi
New molecules to screen (predict activity).
Can have any name, used in Step 5-6.
Example:
CCc1ccccc1 ethylbenzene
Cc1ccccc1 toluene
CCC1=CC=CC=C1O ethylphenol
Data Quality Guidelines
Good Practices
Use canonical SMILES (standardized form)
Remove duplicates before starting
Validate SMILES structures with RDKit
Balance active/inactive ratio (aim for 1:1 to 1:10)
Aim for 500+ molecules per class for reliable Deep Learning performance
Maximum 100,000 molecules per file
ML/DL Warning (Overfitting Risk)
Minimum Threshold (50 molecules): Technically accepted by the pipeline, but strongly discouraged for Deep Learning. With small datasets, models will likely overfit, memorizing the training data rather than learning chemical features.
Data Scaling: If you have < 200 molecules per class, consider using simpler machine learning models (e.g., Random Forest or SVM) instead of deep neural networks to maintain predictive stability.
Avoid
Invalid SMILES (won’t parse)
Salts/mixtures (unless intended)
Very small molecules (< 5 heavy atoms)
Very large molecules (> 200 heavy atoms)
Incomplete or corrupted files
Quality Check
Before running KAST, validate your SMILES:
# Option 1: Use KAST's built-in check
python main.py
[1] Data Preparation
# KAST automatically validates during preparation and generates an audit report:
# workspaces/<your_workspace>/prepared_data/invalid_smiles_report.txt
# This file will list every discarded SMILES and the reason (e.g., duplicate, invalid_rdkit).
# Option 2: Manual check with RDKit
python -c "
from rdkit import Chem
with open('data/actives.smi') as f:
for line in f:
smiles = line.split()[0]
if Chem.MolFromSmiles(smiles) is None:
print(f'Invalid SMILES: {smiles}')
"
Balance Active/Inactive
Recommended ratios:
1:1 (50 active, 50 inactive) — Balanced (optimal for training)
1:5 (100 active, 500 inactive) — Realistic for many screening libraries
1:10 (100 active, 1000 inactive) — Highly imbalanced (requires more attention to metrics)
Handling Imbalance: KAST incorporates strategies to manage imbalanced datasets within the pipeline. While the system can process significantly imbalanced data, maintaining a ratio between 1:1 and 1:10 is recommended for better model generalization and to avoid bias toward the majority class (Banerjee et al., 2018; Jiang et al., 2025).
Large Datasets (100K+ molecules)
For very large libraries:
Enable parallel processing (see Parallel Processing)
Increase
PARALLEL_BATCH_SIZEif you have 16GB+ RAMUse filtering to reduce library size first
Monitor RAM during featurization
Canonicalization
Note: KAST automatically canonicalizes all your SMILES during the 1_preparation.py step. You do not need to canonicalize your files manually for the pipeline to work.
If you prefer to pre-process your files externally to ensure consistency before importing them into KAST, you can use the following RDKit snippet:
# Optional: Manual canonicalization using RDKit
python -c "
from rdkit import Chem
import sys
infile, outfile = sys.argv, sys.argv[1][2]
with open(infile) as f, open(outfile, 'w') as out:
for line in f:
parts = line.strip().split()
smiles = parts
name = parts if len(parts) > 1 else ''[1]
mol = Chem.MolFromSmiles(smiles)
if mol:
canonical = Chem.MolToSmiles(mol)
out.write(f'{canonical} {name}\n' if name else f'{canonical}\n')
" data/actives.smi data/actives_canonical.smi
References
Banerjee et al. (2018). Prediction Is a Balancing Act…. Frontiers in Chemistry.
Jiang et al. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. Chemical Science.