Llama 3 Demo

This is an example demostrating how to train Llama 3 8B with nnScaler’s trainer.

The example contains one single script, train.py.

Get Started

Installation

  1. Get your Hugging Face token to access Llama 3 model

    export HF_TOKEN=...
    
  2. Clone nnScaler repo

    git clone --recursive https://github.com/microsoft/nnscaler
    
  3. Install dependencies (including Llama 3 dependencies) and nnScaler from source

    cd nnscaler
    pip install -r requirements.txt
    pip install -e .
    
  4. Find the Llama 3 example

    cd nnscaler/examples/llama3_demo
    
  5. Prepare dataset

    # To run Llama 3 8B:
    python train.py --prepare_data
    
    # Or to run a shrinked Llama for debug:
    python train.py --prepare_data --mini
    

Train a Mini-model

This examples requires 8 x 80GB GPU memory to train a full 8B model. If your have qualified GPUs, you can go to the next section.

Alternatively, you may start from a smaller model for verification:

python train.py --prepare_data --mini
torchrun --nproc_per_node=2 train.py --mini

This will resize Llama 3 into a model with 4 hidden layers and max-sequence-length reduced to 4K (4096). We have tested it with 2 x 48GB GPUs.

You may further shrink it if the model is still too large:

python train.py --prepare_data --max_seq_len=1024
torchrun --nproc_per_node=2 train.py --max_seq_len=1024 --num_hidden_layers=2 --from_scratch

Here is the training loss with the default mini config (4 layers, 4K sequence length):

../_images/llama3-curves-mini.png

Finetune Llama 3 8B

Use the following commands to finetune Meta-Llama-3-8B-Instruct:

python train.py --prepare_data
torchrun --nproc_per_node=8 train.py
../_images/llama3-curves-8b.png

Resuming

The example will save checkpoint files after finishing 1000 steps then exit. To continue training from the saved checkpoint:

torchrun --nproc_per_node=8 train.py --resume_from=last --max_train_steps=2000

Please note that the checkpoint is sharded as multiple files. If you want to resume a checkpoint in a different environment, you need to merge it into an single checkpoint file first:

python train.py --merge_checkpoint=./checkpoints/last
torchrun --nproc_per_node=8 train.py --resume_from=./checkpoints/merged.ckpt --max_train_steps=3000