Llama 3 Demo¶

This is an example demostrating how to train Llama 3 8B with nnScaler’s trainer.

The example contains one single script, train.py.

Get Started¶

Installation¶

Get your Hugging Face token to access Llama 3 model
```
export HF_TOKEN=...
```

Clone nnScaler repo

git clone --recursive https://github.com/microsoft/nnscaler

Install dependencies (including Llama 3 dependencies) and nnScaler from source
```
cd nnscaler
pip install -r requirements.txt
pip install -e .
```
Find the Llama 3 example
```
cd nnscaler/examples/llama3_demo
```

Prepare dataset

# To run Llama 3 8B:
python train.py --prepare_data

# Or to run a shrinked Llama for debug:
python train.py --prepare_data --mini

Train a Mini-model¶

This examples requires 8 x 80GB GPU memory to train a full 8B model. If your have qualified GPUs, you can go to the next section.

Alternatively, you may start from a smaller model for verification:

python train.py --prepare_data --mini
torchrun --nproc_per_node=2 train.py --mini

This will resize Llama 3 into a model with 4 hidden layers and max-sequence-length reduced to 4K (4096). We have tested it with 2 x 48GB GPUs.

You may further shrink it if the model is still too large:

python train.py --prepare_data --max_seq_len=1024
torchrun --nproc_per_node=2 train.py --max_seq_len=1024 --num_hidden_layers=2 --from_scratch

Here is the training loss with the default mini config (4 layers, 4K sequence length):

Finetune Llama 3 8B¶

Use the following commands to finetune Meta-Llama-3-8B-Instruct:

python train.py --prepare_data
torchrun --nproc_per_node=8 train.py

Resuming¶

The example will save checkpoint files after finishing 1000 steps then exit. To continue training from the saved checkpoint:

torchrun --nproc_per_node=8 train.py --resume_from=last --max_train_steps=2000

Please note that the checkpoint is sharded as multiple files. If you want to resume a checkpoint in a different environment, you need to merge it into an single checkpoint file first:

python train.py --merge_checkpoint=./checkpoints/last
torchrun --nproc_per_node=8 train.py --resume_from=./checkpoints/merged.ckpt --max_train_steps=3000