Configuration

Advanced settings explained.


Where to Configure

Edit settings.py in the main KAST folder:

# settings.py

# Section 1: Reproducibility
RANDOM_STATE = 42

# Section 2: Data Processing
TRAIN_SET_FRACTION = 0.70        # 70% train, 30% test
DEFAULT_SPLIT_METHOD = 'scaffold' # 'scaffold' or 'stratified'

# ... (more sections)

# Section 12: PARALLEL PROCESSING
ENABLE_PARALLEL_PROCESSING = True
N_WORKERS = None
PARALLEL_BATCH_SIZE = 100000
PARALLEL_MIN_THRESHOLD = 10000

All Settings Explained

Reproducibility

RANDOM_STATE = 42

What it does: Ensures identical results on every run

Change if: You want different random initialization (not recommended for reproducibility)


Data Processing

TRAIN_SET_FRACTION = 0.70            # 70% train, 30% test
DEFAULT_SPLIT_METHOD = 'scaffold'   # 'scaffold' or 'stratified'

What it does: Controls how data is divided:

  • TRAIN_SET_FRACTION: The percentage of data used for training.

  • DEFAULT_SPLIT_METHOD: The default technique for train/test splits.

Molecular Splitting Methods

  1. Scaffold Split (Bemis-Murcko Split)

    • How it works: Groups molecules based on their core carbon skeleton (Bemis-Murcko scaffold). All molecules sharing the same scaffold are assigned to the same subset (either training or test, but never split across both).

    • Why use it: This is a rigorous method in cheminformatics to test the model’s ability to generalize to novel chemical spaces. It simulates real-world drug discovery scenarios where the model must predict the activity of compounds belonging to new, unseen chemical classes.

    • Use Case: Best for validating model generalization to novel scaffolds.

  2. Stratified Random Split

    • How it works: Splits the data randomly but enforces that the proportion of active to inactive molecules (class distribution) remains identical between the training and test sets.

    • Why use it: Extremely useful when dealing with imbalanced datasets (e.g., low active hit rates). Standard random splitting can result in a test set with very few or no actives, leading to unstable metric calculations. Stratification ensures stable, representative training and validation sets.

    • Use Case: Best when preserving the active/inactive ratio is critical for model training and metric stability.


Model Architecture

FP_RADIUS = 3                      # Morgan fingerprint radius
FP_SIZE = 2048                     # Feature vector size

What it does: Molecular fingerprint generation

Notes:

  • Radius 3 = standard (ECFP6)

  • 2048 bits = standard feature size

  • Rarely need to change these


Neural Network

# In settings.py
MODEL_PARAMS = {
    'n_tasks': 1,
    'layer_sizes': [1000, 500],
    'dropouts': 0.25,
    'learning_rate': 0.001,
    'mode': 'classification',
    'nb_epoch': 50
}

CLASSIFICATION_THRESHOLD = 0.5

What each parameter does:

layer_sizes: [1000, 500]

Defines the neural network architecture — number of neurons in each hidden layer.

Selection guideline (based on dataset size):

  • Small dataset (<1K molecules): [256, 128] — fewer parameters, less overfitting risk

  • Medium dataset (1K-10K molecules): [1000, 500]recommended

  • Large dataset (>10K molecules): [2048, 1024, 512] — can learn more complex patterns

Decision rule: Use a pyramid shape (each layer smaller than previous) to force the network to compress information into abstract representations.

Adjust if:

  • Underfitting (low AUC on both train & test): increase layer sizes

  • Overfitting (high train AUC, low test AUC): decrease layer sizes

dropouts: 0.25

Random neuron deactivation during training (25% of neurons turn off each step). Prevents overfitting by forcing the network to learn redundant representations.

Recommended values:

  • 0.1 → minimal regularization (large datasets >50K molecules)

  • 0.25 → moderate ← recommended

  • 0.5 → strong regularization (severe overfitting observed)

Adjust if: AUC_train - AUC_test > 0.15 (overfitting) → increase to 0.5

learning_rate: 0.001

Controls the “step size” the Adam optimizer takes when adjusting weights. Too high = overshoots. Too low = learns slowly.

Recommended values:

  • 0.01 → large steps, fast but unstable

  • 0.001 → default Adam ← recommended

  • 0.0001 → small steps, stable but slow

Adjust if:

  • Loss oscillates wildly: decrease to 0.0001

  • Loss plateau too early: increase to 0.01 (with caution)

nb_epoch (different for different uses)

Number of complete passes over the dataset during training.

Why different values?

  • nb_epoch: 50: Final model needs full training

  • Cross-validation runs automatically adapt epochs if needed

  • Learning curve runs many times, so faster iterations

Decide by: Run learning curve (option 4→5 in main menu)

  • If loss still decreasing at epoch 50 → increase to 100

  • If loss plateaued by epoch 30 → reduce to 35

CLASSIFICATION_THRESHOLD: 0.5

Decision boundary for binary classification. Prediction probability ≥ threshold = “active”.

Recommended values by use case:

  • 0.5 → balanced (equal importance to false positives/negatives)

  • 0.3sensitive mode (capture more actives, accept more false positives)

    • Use when: “Don’t miss any active compound”

    • Context: Initial virtual screening, discovery phase

  • 0.7specific mode (only predict when very confident)

    • Use when: “Can’t afford false positives”

    • Context: Final selection for synthesis/testing


Quick Decision Table:

Observable symptom

Adjustment needed

High train AUC, low test AUC (overfitting)

Increase dropout, decrease layer_sizes or epochs

Low train AUC, low test AUC (underfitting)

Increase layer_sizes or epochs, decrease dropout

Loss doesn’t converge smoothly

Decrease learning_rate to 0.0001

Loss converges before epoch 50

Reduce nb_epoch to avoid wasting time

Many false negatives (missing actives)

Decrease CLASSIFICATION_THRESHOLD to 0.3

Many false positives (wrong predictions)

Increase CLASSIFICATION_THRESHOLD to 0.7


Safe defaults:

  • ✅ Start with values above — they’re optimized for 1K-10K molecular datasets

  • ✅ Only adjust if model performance is poor

  • ✅ Always validate changes with learning curve analysis


Parallel Processing

ENABLE_PARALLEL_PROCESSING = True
N_WORKERS = None                    # Auto-detect
PARALLEL_BATCH_SIZE = 100000        # Molecules per batch
PARALLEL_MIN_THRESHOLD = 10000      # Min size to activate

What it does: Multi-core processing configuration

See Parallel Processing guide for detailed info

Quick settings:

RAM

N_WORKERS

BATCH_SIZE

4GB

2

25,000

8GB

4

50,000

16GB

6

100,000

32GB+

-1

200,000


Output

OUTPUT_FOLDER = 'results'
PLOT_DPI = 300                      # Plot resolution
SAVE_PLOTS = True
SAVE_CSV = True
VERBOSE_LOGGING = True

What it does: Output file organization and detail level

Common adjustments:

  • PLOT_DPI = 300 → Publication quality

  • PLOT_DPI = 72 → Web quality (smaller files)

  • VERBOSE_LOGGING = False → Less log file output


Configure Interactively

Without editing settings.py, use the menu:

python main.py
→ [8] Advanced Options

Interactive options:

  • [1] Check Environment (Validates Conda, dependencies, and GPU)

  • [3] Configure CPU Cores (Adjust Parallel Processing)

  • [4] Model Hyperparameters Configuration (Change Hyperparameters settings)

Note: Changes to Model hyperparameters apply on the next training run.


Reset to Defaults

If you mess up settings, revert:

# Backup your changes
cp settings.py settings.py.backup

# Download original from GitHub or reinstall
conda env create -f environment.yml -y --force-reinstall

Environment Variables

Some settings can be overridden via environment:

# Linux/Mac
export KAST_N_WORKERS=4
export KAST_BATCH_SIZE=50000
python bin/2_featurization.py

# Windows PowerShell
$env:KAST_N_WORKERS=4
$env:KAST_BATCH_SIZE=50000
python bin\2_featurization.py

# Windows Command Prompt
set KAST_N_WORKERS=4
set KAST_BATCH_SIZE=50000
python bin\2_featurization.py

Common Configurations

Scenario 1: Small Dataset (< 1K molecules)

TRAIN_SET_FRACTION = 0.8        # 80/20 split
MODEL_PARAMS = {
    'layer_sizes': [256, 128],
    'dropouts': 0.3,
    'learning_rate': 0.001,
    'nb_epoch': 50
}
CLASSIFICATION_THRESHOLD = 0.4  # More sensitive

Scenario 3: Large Dataset (> 10K molecules)

TRAIN_SET_FRACTION = 0.8        # 80/20 split
MODEL_PARAMS = {
    'layer_sizes': [2048, 1024, 512],
    'dropouts': 0.2,
    'learning_rate': 0.001,
    'nb_epoch': 50
}
CLASSIFICATION_THRESHOLD = 0.5

Scenario 4: Overfitting Detected

# Observable: High train AUC, low test AUC
MODEL_PARAMS = {
    'layer_sizes': [512, 256],      # Reduce complexity
    'dropouts': 0.5,                # Increase regularization
    'learning_rate': 0.001,
    'nb_epoch': 30                  # Shorter training
}
TRAIN_SET_FRACTION = 0.9            # More test data

Scenario 5: Underfitting Detected

# Observable: Low AUC on both train and test
MODEL_PARAMS = {
    'layer_sizes': [2048, 1024],    # More capacity
    'dropouts': 0.15,               # Less regularization
    'learning_rate': 0.001,
    'nb_epoch': 100                 # Longer training
}
TRAIN_SET_FRACTION = 0.8            # Less test data

How to Choose Hyperparameters

Step 1: Know Your Dataset Size

Molecules count: ___ (from [1] Prepare Data output)

IF < 1K        → Use Scenario 1 (small dataset)
IF 1K - 10K    → Use Scenario 2 (medium) ← most common
IF > 10K       → Use Scenario 3 (large)

Step 2: Train and Evaluate

python main.py
→ [1] Prepare Data
→ [2] Generate Fingerprints   [3] Create and Train Model
→ [4] Model Evaluation  [1] Run ALL evaluations

Step 3: Review Learning Curve

python main.py  [4]  [1] (Main Report)

Look at plots in workspaces/<your_workspace>/:

  • 4_4_learning_curve.png — shows if you’re overfitting or underfitting

  • 4_0_roc_curve.png — shows model performance

Step 4: Adjust Based on Signals

IF you see overfitting (train AUC high, test AUC low):

  • Increase dropouts from 0.25 → 0.5

  • Decrease layer_sizes from [1000, 500] → [512, 256]

  • Reduce nb_epoch from 50 → 35

IF you see underfitting (both AUC low):

  • Increase layer_sizes from [1000, 500] → [2048, 1024]

  • Decrease dropouts from 0.25 → 0.1

  • Increase nb_epoch from 50 → 100

IF you have many false negatives (missing actives):

  • Decrease CLASSIFICATION_THRESHOLD from 0.5 → 0.3

IF you have many false positives (wrong predictions):

  • Increase CLASSIFICATION_THRESHOLD from 0.5 → 0.7

Step 5: Repeat

Make one change at a time, retrain, and compare. Document what worked!


Parameter Interdependencies

⚠️ These interact:

If you…

Then typically…

Increase layer_sizes

…may need lower dropout

Increase dropout

…model may underfit, need more epochs

Decrease learning_rate

…training slower, may need more epochs

Use smaller TRAIN_SET_FRACTION

…model sees less training data, may underfit

Pro tip: Change one parameter at a time and measure impact!


Why These Defaults?

The default parameters (layer_sizes=[1000, 500], dropouts=0.25, learning_rate=0.001) are:

Empirically proven for molecular ML on 1K-10K compounds ✅ Balanced between underfitting and overfitting
Efficient (converge in ~50 epochs without wasting time) ✅ Reproducible (documented in literature)

For publication: You can justify using defaults by citing:

“Hyperparameters were set to established defaults for medium-sized molecular datasets: two hidden layers (1000, 500 neurons), dropout rate of 0.25, Adam optimizer with learning rate 0.001, trained for 50 epochs. Configuration was validated via learning curve analysis to confirm convergence without overfitting.”

This shows you’re informed, not just guessing!


Best Practices

✅ Do

  • Keep RANDOM_SEED fixed for reproducibility

  • Use N_WORKERS = None (auto-detect) unless you know better

  • Increase PARALLEL_BATCH_SIZE only if you have plenty of RAM

  • Start with defaults and adjust only if needed

  • Change one parameter at a time and measure impact

  • Document what changes you make and why

❌ Don’t

  • Change TENSORFLOW_SEED unless you know why

  • Set N_WORKERS higher than your CPU core count

  • Make BATCH_SIZE too large (> 64 usually unnecessary)

  • Modify neural network layers without testing

  • Change multiple hyperparameters simultaneously (can’t tell what worked)

  • Use different hyperparameters on train vs test sets


Verify Configuration

Check current settings:

python -c "import settings as cfg; print(cfg.__dict__)"

Or create test script:

# settings_check.py
import settings as cfg

print("Parallel Processing:")
print(f"  Enabled: {cfg.ENABLE_PARALLEL_PROCESSING}")
print(f"  Workers: {cfg.N_WORKERS}")
print(f"  Batch Size: {cfg.PARALLEL_BATCH_SIZE}")

print("\nModel:")
print(f"  Hidden Layers: {cfg.MODEL_PARAMS.get('layer_sizes')}")
print(f"  Epochs: {cfg.MODEL_PARAMS.get('nb_epoch')}")
print(f"  Learning Rate: {cfg.MODEL_PARAMS.get('learning_rate')}")

Further Reading & Foundations

For deeper understanding of hyperparameter tuning, neural network architecture, and molecular machine learning:

Molecular Machine Learning

  1. Ramsundar, B. et al. Massively Multitask Networks for Drug Discovery.

    • arXiv:1502.02072 (2015)

    • Link: https://arxiv.org/abs/1502.02072

  2. Ma, J. et al. Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships.

    • J. Chem. Inf. Model. 2015, 55, 263–274

    • DOI: 10.1021/ci500747n

  3. Wu, Z. et al. MoleculeNet: A Benchmark for Molecular Machine Learning.

    • Chem. Sci. 2018, 9, 513–530

    • DOI: 10.1039/c7sc02664a

    • PubMed: 29629118

Deep Learning Theory

  1. Goodfellow, I., Bengio, Y., Courville, A. Deep Learning.

    • MIT Press, 2016

    • Chapter 5.2: “Capacity, Overfitting and Underfitting”

    • Link: https://www.deeplearningbook.org/

  2. Bengio, Y. et al. Representation Learning: A Review and New Perspectives.

    • IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828

    • DOI: 10.1109/TPAMI.2013.50


Key Concepts Summary

Concept

Where it appears

Relevant reading

Overfitting prevention via dropout

dropouts parameter

Goodfellow et al., Ch. 5.2

Model capacity trade-off

layer_sizes tuning

Goodfellow et al., Ch. 5.2

Molecular fingerprints

FINGERPRINT_RADIUS, FINGERPRINT_LENGTH

Wu et al., MoleculeNet

Multitask learning

Model architecture

Ramsundar et al.

QSAR via deep learning

Classification threshold, evaluation metrics

Ma et al.