AutoDist Configuration Reference¶

This document provides a comprehensive guide to all configuration options available in AutoDist’s AutoDistConfig class.

Overview¶

AutoDistConfig is the central configuration class for AutoDist, allowing you to control various aspects of automatic parallelization including memory optimization, pipeline parallelism, tensor parallelism, and recomputation strategies.

Basic Usage¶

from nnscaler.autodist.autodist_config import AutoDistConfig

# Basic configuration
config = AutoDistConfig(
    task_name='my_experiment',
    memory_constraint=32,  # 32GB memory limit
    recompute_modules='transformer.layer'  # Recompute transformer layers
)

Configuration Parameters¶

Task Configuration¶

task_name (str, optional, default: 'default')

The name of the current task to distinguish different runs. Used for naming saved plans and logs.

config = AutoDistConfig(task_name='bert_large_training')

Memory Management¶

consider_mem (bool, optional, default: True)

Whether to consider memory constraints when searching for parallelization plans.

memory_constraint (float, optional, default: 32)

The memory constraint for each device in GB. AutoDist will ensure that the parallelization plan fits within this memory limit.

config = AutoDistConfig(memory_constraint=80)  # 80GB A100

memory_granularity (int, optional, default: 1)

The memory granularity in bytes. Used for memory profiling and estimation.

transient_mem_coef (float, optional, default: 2)

Coefficient for estimating transient memory size. Formula: transient_mem_size = transient_mem_coef * (1st_largest_infer_mem + 2nd_largest_infer_mem).

Reduce this value if operators consume/generate very large tensors (≥4GB).

Optimizer Configuration¶

opt_resident_coef (int, optional, default: 2)

Coefficient for optimizer resident state compared to model weight size.

Common cases:

FP32 training with Adam: 2 (FP32 momentum1 + FP32 momentum2)
FP16/BF16 training with Adam: 6 (FP32 momentum1 + FP32 momentum2 + FP32 weight)
FP16/BF16 training with memory-efficient Adam: 4 (FP32 momentum1 + FP32 momentum2)

opt_transient_coef (int, optional, default: 0)

Coefficient for optimizer transient state compared to model weight size.

Common cases:

FP32 training with Adam: 0
FP16/BF16 training with Adam without internal cast: 2 (FP32 gradient)
FP16/BF16 training with memory-efficient Adam without internal cast: 4 (FP32 weight + FP32 gradient)

Recomputation¶

recompute_modules (str, optional, default: '')

Module names to recompute, separated by commas. Recomputation trades computation for memory by not storing intermediate activations during forward pass and recomputing them during backward pass. Note that recomputation still requires storing some tensors for gradient computation, so the memory savings depend on the specific model structure and recomputation granularity.

Examples:

# Recompute specific modules
config = AutoDistConfig(recompute_modules='transformer.layer,attention')

# Recompute entire model
config = AutoDistConfig(recompute_modules='ROOT')

# Recompute multiple specific modules
config = AutoDistConfig(recompute_modules='encoder.layer,decoder.layer')

Note: Module names can be any suffix of the full module name. For example, layer will match transformer.layer, encoder.layer, etc. ROOT recomputes the entire model but may not always provide maximum memory savings due to the need to store intermediate tensors for backward pass.

ZeRO Optimization¶

zero_stage (int, optional, default: 0)

ZeRO optimization stage (see ZeRO paper).

0: No ZeRO optimization
1: Optimizer state partitioning

zero_ngroups (int, optional, default: 1)

Number of ZeRO groups to balance memory usage and communication cost. Larger values use more memory but reduce communication overhead.

Pipeline Parallelism¶

pipeline_pivots (str, optional, default: '')

Module names that serve as pipeline stage boundaries, separated by commas.

config = AutoDistConfig(pipeline_pivots='encoder,decoder')

pipeline_nstages (int or ‘auto’, optional, default: 'auto')

Number of pipeline stages. Set to 1 to disable pipeline parallelism.

'auto': Automatically determine optimal number of stages
int: Fixed number of stages

pipeline_scheduler (str, optional, default: '1f1b')

Pipeline scheduling strategy. Currently only supports '1f1b' (1-forward-1-backward).

max_pipeline_bubble_ratio (float, optional, default: 0.2)

Maximum allowed bubble ratio in pipeline parallelism. Higher values allow more pipeline bubbles but explore larger search space.

max_pipeline_unbalance_ratio (float, optional, default: 0.5)

Maximum unbalance ratio between pipeline stages (min_stage_time / max_stage_time). Higher values require better balance but reduce search space.

Mesh and Parallelism¶

mesh_row (int, optional, default: 1): Number of available nodes in the device mesh.
mesh_col (int, optional, default: 1): Number of available devices per node in the device mesh.
world_size (int, optional, default: 1): Total number of devices (mesh_row × mesh_col × scale_factor).
micro_batch_size (int, optional, default: 1): Micro batch size for gradient accumulation.
update_freq (int, optional, default: 1): Update frequency. The effective batch size is micro_batch_size × update_freq.

Profiling and Search¶

profile_dir (str, optional, default: ~/.cache/nnscaler/autodist/1.0/get_node_arch())

Directory to store profiling results for computation cost estimation.

parallel_profile (bool, optional, default: True)

Whether to profile on multiple devices in parallel. Set to False for sequential profiling on a single device.

re_profile (bool, optional, default: False)

Whether to override existing profiling results and re-profile operations.

topk (int, optional, default: 20)

Number of parallelization plans to generate for robustness. Higher values provide more options but increase search time.

solver (str, optional, default: 'dp')

Solver algorithm for SPMD parallelism:

'dp': Dynamic programming
'ilp': Integer linear programming

nproc (int, optional, default: 1)

Number of processes for pipeline parallelism search.

Plan Management¶

load_plan_path (str, optional, default: ''): Path to load an existing parallelization plan. When specified, skips plan searching and uses the loaded plan.
save_plan_path (str, optional, default: ''): Path to save the generated parallelization plan for reuse.
partition_constraints_path (str, optional, default: ''): Path to partition constraints file. See solver_interface/partition_constraints for details.

Training Configuration¶

is_train (bool, optional, default: True): Whether the model is for training or inference. Affects memory estimation and operator selection.

Debug and Optimization¶

verbose (bool, optional, default: False): Whether to print verbose information during plan generation.
ignore_small_tensor_threshold (int, optional, default: 1): Tensor size threshold (in elements) to ignore during analysis. Small tensors below this threshold are not considered for partitioning.

Example Configurations¶

High Memory Training¶

# Configuration for large model training with high memory
config = AutoDistConfig(
    task_name='large_model_training',
    memory_constraint=80,  # 80GB A100
    recompute_modules='transformer.layer',  # Selective recomputation
    zero_stage=1,  # Enable ZeRO stage 1
    zero_ngroups=4,  # Use 4 ZeRO groups
    opt_resident_coef=6,  # FP16 training with Adam
    opt_transient_coef=2,
    topk=50  # More plan options
)

Pipeline Parallelism¶

# Configuration for pipeline parallelism
config = AutoDistConfig(
    task_name='pipeline_training',
    pipeline_pivots='encoder,decoder',
    pipeline_nstages=4,
    pipeline_scheduler='1f1b',
    max_pipeline_bubble_ratio=0.1,  # Strict bubble control
    mesh_row=2,  # 2 nodes
    mesh_col=4,  # 4 GPUs per node
    micro_batch_size=2,
    update_freq=4  # Effective batch size = 2 * 4 = 8
)

Memory-Efficient Training¶

# Configuration for memory-efficient training
config = AutoDistConfig(
    task_name='efficient_training',
    is_train=True,
    consider_mem=True,
    memory_constraint=24,  # 24GB RTX 4090
    recompute_modules='attention,mlp',  # Selective recomputation
    solver='ilp',  # More precise optimization
    topk=10
)

Best Practices¶

Start Simple: Begin with default settings and gradually tune parameters based on your needs.
Memory Tuning: - Consider recompute_modules for memory savings, but note that more aggressive recomputation (like 'ROOT') doesn’t always provide maximum memory savings - Adjust memory_constraint based on your hardware - Fine-tune optimizer coefficients based on your training setup - Experiment with different recomputation granularities to find the optimal memory-computation trade-off
Pipeline Parallelism: - Choose pipeline_pivots at natural module boundaries - Start with pipeline_nstages='auto' to find optimal stages - Monitor bubble ratio and adjust max_pipeline_bubble_ratio
Profiling: - Enable parallel_profile for faster profiling - Set re_profile=True when changing hardware or model architecture - Use appropriate profile_dir for different experiments
Plan Management: - Save successful plans with save_plan_path for reuse - Use descriptive task_name for better organization

Troubleshooting¶

Out of Memory Errors

Reduce memory_constraint
Experiment with different recompute_modules strategies (selective recomputation may be more effective than 'ROOT')
Increase zero_ngroups or enable higher ZeRO stages
Reduce transient_mem_coef

Slow Plan Generation

Reduce topk for faster search
Use 'dp' solver instead of 'ilp'
Set parallel_profile=True
Increase ignore_small_tensor_threshold

Poor Performance

Check max_pipeline_bubble_ratio if using pipeline parallelism
Verify mesh_row and mesh_col match your hardware
Tune micro_batch_size and update_freq
Consider different recompute_modules strategies