############ Llama 3 Demo ############ This is an example demostrating how to train Llama 3 8B with nnScaler's :doc:`trainer <../trainer>`. The example contains one single script, ``train.py``. *********** Get Started *********** Installation ============ 0. Get your `Hugging Face token `_ to access Llama 3 model :: export HF_TOKEN=... 1. Clone nnScaler repo :: git clone --recursive https://github.com/microsoft/nnscaler 2. Install dependencies (including Llama 3 dependencies) and :doc:`nnScaler from source <../install_from_source>` :: cd nnscaler pip install -r requirements.txt pip install -e . 3. Find the Llama 3 example :: cd nnscaler/examples/llama3_demo 4. Prepare dataset :: # To run Llama 3 8B: python train.py --prepare_data # Or to run a shrinked Llama for debug: python train.py --prepare_data --mini Train a Mini-model ================== This examples requires 8 x 80GB GPU memory to train a full 8B model. If your have qualified GPUs, you can go to :ref:`the next section `. Alternatively, you may start from a smaller model for verification: :: python train.py --prepare_data --mini torchrun --nproc_per_node=2 train.py --mini This will resize Llama 3 into a model with 4 hidden layers and max-sequence-length reduced to 4K (4096). We have tested it with 2 x 48GB GPUs. You may further shrink it if the model is still too large: :: python train.py --prepare_data --max_seq_len=1024 torchrun --nproc_per_node=2 train.py --max_seq_len=1024 --num_hidden_layers=2 --from_scratch Here is the training loss with the default mini config (4 layers, 4K sequence length): .. image:: ../images/llama3-curves-mini.png .. _finetune: Finetune Llama 3 8B =================== Use the following commands to finetune `Meta-Llama-3-8B-Instruct `_: :: python train.py --prepare_data torchrun --nproc_per_node=8 train.py .. image:: ../images/llama3-curves-8b.png ******** Resuming ******** The example will save checkpoint files after finishing 1000 steps then exit. To continue training from the saved checkpoint: :: torchrun --nproc_per_node=8 train.py --resume_from=last --max_train_steps=2000 Please note that the checkpoint is sharded as multiple files. If you want to resume a checkpoint in a different environment, you need to merge it into an single checkpoint file first: :: python train.py --merge_checkpoint=./checkpoints/last torchrun --nproc_per_node=8 train.py --resume_from=./checkpoints/merged.ckpt --max_train_steps=3000