LongRope2 context length extension Example

Introduction

LongRoPE2 is an advanced version of LongRoPE that significantly improves long-context extension for RoPE-based LLMs. It has been adopted in Phi4-mini and Phi4-multimodal.

This example includes the training part for LongRope2. Before training, please using LongRoPE repo <https://github.com/microsoft/LongRoPE> for searching the rope extension scaling factor for your model. This example provides the extension scaling factor of llama3-8b-base as a reference. If you want to have a try with llama3-8b-base, you can run this example directly.

Preparation

If this is the first time you use nnScalar, it would be better start with examples/llama for more using detail. But it is OK to directly follow this example to run pass.

Assume following packages have been installed in the environment.

nnscaler
zstandard
transformers>=4.48
datasets
tensorboard
apex
flash-attn

A new model config includes the longrope rope_scaling field and original_max_position_embeddings are needed, please reference examples/longrope2/llama3_8b_longrope2_config.json

Data Preparation

We use HuggingFaceFW/fineweb-edu for short context window training and togethercomputer/RedPajama-Data-1T for long context window training.

If you don’t have large disk memory, i.e., 1 TB free memory, you could take a sub-dataset by modify the code.

Training

The main different compared with the common long context training example examples/llama is we need to pass --model_config to passin the rope extension scaling factor to the model.

Additional

More details about how to change distributed plan or merge checkpoints, please reference examples/llama/README.rst.