########################################## LongRope2 context length extension Example ########################################## ************ Introduction ************ `LongRoPE2 `_ is an advanced version of `LongRoPE `_ that significantly improves long-context extension for RoPE-based LLMs. It has been adopted in Phi4-mini and Phi4-multimodal. This example includes the training part for LongRope2. Before training, please using `LongRoPE repo ` for searching the rope extension scaling factor for your model. This example provides the extension scaling factor of llama3-8b-base as a reference. If you want to have a try with llama3-8b-base, you can run this example directly. *********** Preparation *********** If this is the first time you use nnScalar, it would be better start with ``examples/llama`` for more using detail. But it is OK to directly follow this example to run pass. Assume following packages have been installed in the environment. :: nnscaler zstandard transformers>=4.48 datasets tensorboard apex flash-attn A new model config includes the longrope ``rope_scaling`` field and ``original_max_position_embeddings`` are needed, please reference ``examples/longrope2/llama3_8b_longrope2_config.json`` **************** Data Preparation **************** We use ``HuggingFaceFW/fineweb-edu`` for short context window training and ``togethercomputer/RedPajama-Data-1T`` for long context window training. .. code-block:: bash export PYTHONPATH=$PYTHONPATH:/home/USER_NAME/MagicCube:/home/USER_NAME/MagicCube/examples # download data to at MagicCube/examples/longrope2/data, will take around 100GB disk memory. python data/download.py # process the data to mix context window length format for long context training, will take around 900GB disk memory. python data/process.py --tokenizer_name_or_path "meta-llama/Meta-Llama-3-8B" If you don't have large disk memory, i.e., 1 TB free memory, you could take a sub-dataset by modify the code. ******** Training ******** The main different compared with the common long context training example ``examples/llama`` is we need to pass ``--model_config`` to passin the rope extension scaling factor to the model. .. code-block:: bash # compile the distributed code for llama3 model with dp2, tp4 on 8 gpus python train.py --run_mode compile --model_id "meta-llama/Meta-Llama-3-8B" --model_config llama3_8b_longrope2_config.json --dataset_path data/mix-context-win-short-8192-long-131072 --plan_ngpus=4 --runtime_ngpus=8 --recompute_modules LlamaDecoderLayer --gpu_mem_constraint 64 --enable-chunk-loss --grad_accumulation_steps 16 --max_train_steps 2250 2>&1 | tee compile.log # run the training job torchrun --nproc_per_node=8 train.py --model_id "meta-llama/Meta-Llama-3-8B" --model_config llama3_8b_longrope2_config.json --dataset_path data/mix-context-win-short-8192-long-131072 --plan_ngpus=4 --runtime_ngpus=8 --recompute_modules LlamaDecoderLayer --gpu_mem_constraint 64 --enable-chunk-loss --grad_accumulation_steps 16 --max_train_steps 2250 2>&1 | tee run.log ********** Additional ********** More details about how to change distributed plan or merge checkpoints, please reference ``examples/llama/README.rst``.