Configuration

Advanced settings explained.

Where to Configure

Edit settings.py in the main KAST folder:

# settings.py

# Section 1: Reproducibility
RANDOM_STATE = 42

# Section 2: Data Processing
TRAIN_SET_FRACTION = 0.70        # 70% train, 30% test
DEFAULT_SPLIT_METHOD = 'scaffold' # 'scaffold' or 'stratified'

# ... (more sections)

# Section 12: PARALLEL PROCESSING
ENABLE_PARALLEL_PROCESSING = True
N_WORKERS = None
PARALLEL_BATCH_SIZE = 100000
PARALLEL_MIN_THRESHOLD = 10000

All Settings Explained

Reproducibility

RANDOM_STATE = 42

What it does: Ensures identical results on every run

Change if: You want different random initialization (not recommended for reproducibility)

Data Processing

TRAIN_SET_FRACTION = 0.70            # 70% train, 30% test
DEFAULT_SPLIT_METHOD = 'scaffold'   # 'scaffold' or 'stratified'

What it does: Controls how data is divided:

TRAIN_SET_FRACTION: The percentage of data used for training.
DEFAULT_SPLIT_METHOD: The default technique for train/test splits.

Molecular Splitting Methods

Scaffold Split (Bemis-Murcko Split)
- How it works: Groups molecules based on their core carbon skeleton (Bemis-Murcko scaffold). All molecules sharing the same scaffold are assigned to the same subset (either training or test, but never split across both).
- Why use it: This is a rigorous method in cheminformatics to test the model’s ability to generalize to novel chemical spaces. It simulates real-world drug discovery scenarios where the model must predict the activity of compounds belonging to new, unseen chemical classes.
- Use Case: Best for validating model generalization to novel scaffolds.
Stratified Random Split
- How it works: Splits the data randomly but enforces that the proportion of active to inactive molecules (class distribution) remains identical between the training and test sets.
- Why use it: Extremely useful when dealing with imbalanced datasets (e.g., low active hit rates). Standard random splitting can result in a test set with very few or no actives, leading to unstable metric calculations. Stratification ensures stable, representative training and validation sets.
- Use Case: Best when preserving the active/inactive ratio is critical for model training and metric stability.

Model Architecture

FP_RADIUS = 3                      # Morgan fingerprint radius
FP_SIZE = 2048                     # Feature vector size

What it does: Molecular fingerprint generation

Notes:

Radius 3 = standard (ECFP6)
2048 bits = standard feature size
Rarely need to change these

Neural Network

# In settings.py
MODEL_PARAMS = {
    'n_tasks': 1,
    'layer_sizes': [1000, 500],
    'dropouts': 0.25,
    'learning_rate': 0.001,
    'mode': 'classification',
    'nb_epoch': 50
}

CLASSIFICATION_THRESHOLD = 0.5

What each parameter does:

`layer_sizes: [1000, 500]`

Defines the neural network architecture — number of neurons in each hidden layer.

Selection guideline (based on dataset size):

Small dataset (<1K molecules): [256, 128] — fewer parameters, less overfitting risk
Medium dataset (1K-10K molecules): [1000, 500] ← recommended
Large dataset (>10K molecules): [2048, 1024, 512] — can learn more complex patterns

Decision rule: Use a pyramid shape (each layer smaller than previous) to force the network to compress information into abstract representations.

Adjust if:

Underfitting (low AUC on both train & test): increase layer sizes
Overfitting (high train AUC, low test AUC): decrease layer sizes

`dropouts: 0.25`

Random neuron deactivation during training (25% of neurons turn off each step). Prevents overfitting by forcing the network to learn redundant representations.

Recommended values:

0.1 → minimal regularization (large datasets >50K molecules)
0.25 → moderate ← recommended
0.5 → strong regularization (severe overfitting observed)

Adjust if: AUC_train - AUC_test > 0.15 (overfitting) → increase to 0.5

`learning_rate: 0.001`

Controls the “step size” the Adam optimizer takes when adjusting weights. Too high = overshoots. Too low = learns slowly.

Recommended values:

0.01 → large steps, fast but unstable
0.001 → default Adam ← recommended
0.0001 → small steps, stable but slow

Adjust if:

Loss oscillates wildly: decrease to 0.0001
Loss plateau too early: increase to 0.01 (with caution)

`nb_epoch` (different for different uses)

Number of complete passes over the dataset during training.

Why different values?

nb_epoch: 50: Final model needs full training
Cross-validation runs automatically adapt epochs if needed
Learning curve runs many times, so faster iterations

Decide by: Run learning curve (option 4→5 in main menu)

If loss still decreasing at epoch 50 → increase to 100
If loss plateaued by epoch 30 → reduce to 35

`CLASSIFICATION_THRESHOLD: 0.5`

Decision boundary for binary classification. Prediction probability ≥ threshold = “active”.

Recommended values by use case:

0.5 → balanced (equal importance to false positives/negatives)
0.3 → sensitive mode (capture more actives, accept more false positives)
- Use when: “Don’t miss any active compound”
- Context: Initial virtual screening, discovery phase
0.7 → specific mode (only predict when very confident)
- Use when: “Can’t afford false positives”
- Context: Final selection for synthesis/testing

Quick Decision Table:

Observable symptom	Adjustment needed
High train AUC, low test AUC (overfitting)	Increase dropout, decrease layer_sizes or epochs
Low train AUC, low test AUC (underfitting)	Increase layer_sizes or epochs, decrease dropout
Loss doesn’t converge smoothly	Decrease learning_rate to 0.0001
Loss converges before epoch 50	Reduce nb_epoch to avoid wasting time
Many false negatives (missing actives)	Decrease CLASSIFICATION_THRESHOLD to 0.3
Many false positives (wrong predictions)	Increase CLASSIFICATION_THRESHOLD to 0.7

Safe defaults:

✅ Start with values above — they’re optimized for 1K-10K molecular datasets
✅ Only adjust if model performance is poor
✅ Always validate changes with learning curve analysis

Parallel Processing

ENABLE_PARALLEL_PROCESSING = True
N_WORKERS = None                    # Auto-detect
PARALLEL_BATCH_SIZE = 100000        # Molecules per batch
PARALLEL_MIN_THRESHOLD = 10000      # Min size to activate

What it does: Multi-core processing configuration

See Parallel Processing guide for detailed info

Quick settings:

RAM	N_WORKERS	BATCH_SIZE
4GB	2	25,000
8GB	4	50,000
16GB	6	100,000
32GB+	-1	200,000

Output

OUTPUT_FOLDER = 'results'
PLOT_DPI = 300                      # Plot resolution
SAVE_PLOTS = True
SAVE_CSV = True
VERBOSE_LOGGING = True

What it does: Output file organization and detail level

Common adjustments:

PLOT_DPI = 300 → Publication quality
PLOT_DPI = 72 → Web quality (smaller files)
VERBOSE_LOGGING = False → Less log file output

Configure Interactively

Without editing settings.py, use the menu:

python main.py
→ [8] Advanced Options

Interactive options:

[1] Check Environment (Validates Conda, dependencies, and GPU)
[3] Configure CPU Cores (Adjust Parallel Processing)
[4] Model Hyperparameters Configuration (Change Hyperparameters settings)

Note: Changes to Model hyperparameters apply on the next training run.

Reset to Defaults

If you mess up settings, revert:

# Backup your changes
cp settings.py settings.py.backup

# Download original from GitHub or reinstall
conda env create -f environment.yml -y --force-reinstall

Environment Variables

Some settings can be overridden via environment:

# Linux/Mac
export KAST_N_WORKERS=4
export KAST_BATCH_SIZE=50000
python bin/2_featurization.py

# Windows PowerShell
$env:KAST_N_WORKERS=4
$env:KAST_BATCH_SIZE=50000
python bin\2_featurization.py

# Windows Command Prompt
set KAST_N_WORKERS=4
set KAST_BATCH_SIZE=50000
python bin\2_featurization.py

Common Configurations

Scenario 1: Small Dataset (< 1K molecules)

TRAIN_SET_FRACTION = 0.8        # 80/20 split
MODEL_PARAMS = {
    'layer_sizes': [256, 128],
    'dropouts': 0.3,
    'learning_rate': 0.001,
    'nb_epoch': 50
}
CLASSIFICATION_THRESHOLD = 0.4  # More sensitive

Scenario 2: Standard/Medium Dataset (1K-10K molecules) ← RECOMMENDED

TRAIN_SET_FRACTION = 0.7        # 70/30 split
MODEL_PARAMS = {
    'layer_sizes': [1000, 500],
    'dropouts': 0.25,
    'learning_rate': 0.001,
    'nb_epoch': 50
}
CLASSIFICATION_THRESHOLD = 0.5  # Balanced

Scenario 3: Large Dataset (> 10K molecules)

TRAIN_SET_FRACTION = 0.8        # 80/20 split
MODEL_PARAMS = {
    'layer_sizes': [2048, 1024, 512],
    'dropouts': 0.2,
    'learning_rate': 0.001,
    'nb_epoch': 50
}
CLASSIFICATION_THRESHOLD = 0.5

Scenario 4: Overfitting Detected

# Observable: High train AUC, low test AUC
MODEL_PARAMS = {
    'layer_sizes': [512, 256],      # Reduce complexity
    'dropouts': 0.5,                # Increase regularization
    'learning_rate': 0.001,
    'nb_epoch': 30                  # Shorter training
}
TRAIN_SET_FRACTION = 0.9            # More test data

Scenario 5: Underfitting Detected

# Observable: Low AUC on both train and test
MODEL_PARAMS = {
    'layer_sizes': [2048, 1024],    # More capacity
    'dropouts': 0.15,               # Less regularization
    'learning_rate': 0.001,
    'nb_epoch': 100                 # Longer training
}
TRAIN_SET_FRACTION = 0.8            # Less test data

How to Choose Hyperparameters

Step 1: Know Your Dataset Size

Molecules count: ___ (from [1] Prepare Data output)

IF < 1K        → Use Scenario 1 (small dataset)
IF 1K - 10K    → Use Scenario 2 (medium) ← most common
IF > 10K       → Use Scenario 3 (large)

Step 2: Train and Evaluate

python main.py
→ [1] Prepare Data
→ [2] Generate Fingerprints  
→ [3] Create and Train Model
→ [4] Model Evaluation → [1] Run ALL evaluations

Step 3: Review Learning Curve

python main.py → [4] → [1] (Main Report)

Look at plots in workspaces/<your_workspace>/:

4_4_learning_curve.png — shows if you’re overfitting or underfitting
4_0_roc_curve.png — shows model performance

Step 4: Adjust Based on Signals

IF you see overfitting (train AUC high, test AUC low):

Increase dropouts from 0.25 → 0.5
Decrease layer_sizes from [1000, 500] → [512, 256]
Reduce nb_epoch from 50 → 35

IF you see underfitting (both AUC low):

Increase layer_sizes from [1000, 500] → [2048, 1024]
Decrease dropouts from 0.25 → 0.1
Increase nb_epoch from 50 → 100

IF you have many false negatives (missing actives):

Decrease CLASSIFICATION_THRESHOLD from 0.5 → 0.3

IF you have many false positives (wrong predictions):

Increase CLASSIFICATION_THRESHOLD from 0.5 → 0.7

Step 5: Repeat

Make one change at a time, retrain, and compare. Document what worked!

Parameter Interdependencies

⚠️ These interact:

If you…	Then typically…
Increase layer_sizes	…may need lower dropout
Increase dropout	…model may underfit, need more epochs
Decrease learning_rate	…training slower, may need more epochs
Use smaller TRAIN_SET_FRACTION	…model sees less training data, may underfit

Pro tip: Change one parameter at a time and measure impact!

Why These Defaults?

The default parameters (layer_sizes=[1000, 500], dropouts=0.25, learning_rate=0.001) are:

✅ Empirically proven for molecular ML on 1K-10K compounds ✅ Balanced between underfitting and overfitting
✅ Efficient (converge in ~50 epochs without wasting time) ✅ Reproducible (documented in literature)

For publication: You can justify using defaults by citing:

“Hyperparameters were set to established defaults for medium-sized molecular datasets: two hidden layers (1000, 500 neurons), dropout rate of 0.25, Adam optimizer with learning rate 0.001, trained for 50 epochs. Configuration was validated via learning curve analysis to confirm convergence without overfitting.”

This shows you’re informed, not just guessing!

Best Practices

✅ Do

Keep RANDOM_SEED fixed for reproducibility
Use N_WORKERS = None (auto-detect) unless you know better
Increase PARALLEL_BATCH_SIZE only if you have plenty of RAM
Start with defaults and adjust only if needed
Change one parameter at a time and measure impact
Document what changes you make and why

❌ Don’t

Change TENSORFLOW_SEED unless you know why
Set N_WORKERS higher than your CPU core count
Make BATCH_SIZE too large (> 64 usually unnecessary)
Modify neural network layers without testing
Change multiple hyperparameters simultaneously (can’t tell what worked)
Use different hyperparameters on train vs test sets

Verify Configuration

Check current settings:

python -c "import settings as cfg; print(cfg.__dict__)"

Or create test script:

# settings_check.py
import settings as cfg

print("Parallel Processing:")
print(f"  Enabled: {cfg.ENABLE_PARALLEL_PROCESSING}")
print(f"  Workers: {cfg.N_WORKERS}")
print(f"  Batch Size: {cfg.PARALLEL_BATCH_SIZE}")

print("\nModel:")
print(f"  Hidden Layers: {cfg.MODEL_PARAMS.get('layer_sizes')}")
print(f"  Epochs: {cfg.MODEL_PARAMS.get('nb_epoch')}")
print(f"  Learning Rate: {cfg.MODEL_PARAMS.get('learning_rate')}")

Key Concepts Summary

Concept	Where it appears	Relevant reading
Overfitting prevention via dropout	`dropouts` parameter	Goodfellow et al., Ch. 5.2
Model capacity trade-off	`layer_sizes` tuning	Goodfellow et al., Ch. 5.2
Molecular fingerprints	`FINGERPRINT_RADIUS`, `FINGERPRINT_LENGTH`	Wu et al., MoleculeNet
Multitask learning	Model architecture	Ramsundar et al.
QSAR via deep learning	Classification threshold, evaluation metrics	Ma et al.

Configuration

Where to Configure

All Settings Explained

Reproducibility

Data Processing

Molecular Splitting Methods

Model Architecture

Neural Network

`layer_sizes: [1000, 500]`

`dropouts: 0.25`

`learning_rate: 0.001`

`nb_epoch` (different for different uses)

`CLASSIFICATION_THRESHOLD: 0.5`

Parallel Processing

Output

Configure Interactively

Reset to Defaults

Environment Variables

Common Configurations

Scenario 1: Small Dataset (< 1K molecules)

Scenario 2: Standard/Medium Dataset (1K-10K molecules) ← RECOMMENDED

Scenario 3: Large Dataset (> 10K molecules)

Scenario 4: Overfitting Detected

Scenario 5: Underfitting Detected

How to Choose Hyperparameters

Step 1: Know Your Dataset Size

Step 2: Train and Evaluate

Step 3: Review Learning Curve

Step 4: Adjust Based on Signals

Step 5: Repeat

Parameter Interdependencies

Why These Defaults?

Best Practices

✅ Do

❌ Don’t

Verify Configuration

Further Reading & Foundations

Molecular Machine Learning

Deep Learning Theory

Key Concepts Summary

Configuration

Where to Configure

All Settings Explained

Reproducibility

Data Processing

Molecular Splitting Methods

Model Architecture

Neural Network

layer_sizes: [1000, 500]

dropouts: 0.25

learning_rate: 0.001

nb_epoch (different for different uses)

CLASSIFICATION_THRESHOLD: 0.5

Parallel Processing

Output

Configure Interactively

Reset to Defaults

Environment Variables

Common Configurations

Scenario 1: Small Dataset (< 1K molecules)

Scenario 2: Standard/Medium Dataset (1K-10K molecules) ← RECOMMENDED

Scenario 3: Large Dataset (> 10K molecules)

Scenario 4: Overfitting Detected

Scenario 5: Underfitting Detected

How to Choose Hyperparameters

Step 1: Know Your Dataset Size

Step 2: Train and Evaluate

Step 3: Review Learning Curve

Step 4: Adjust Based on Signals

Step 5: Repeat

Parameter Interdependencies

Why These Defaults?

Best Practices

✅ Do

❌ Don’t

Verify Configuration

Further Reading & Foundations

Molecular Machine Learning

Deep Learning Theory

Key Concepts Summary

`layer_sizes: [1000, 500]`

`dropouts: 0.25`

`learning_rate: 0.001`

`nb_epoch` (different for different uses)

`CLASSIFICATION_THRESHOLD: 0.5`