ViT Example

Introduction

This example demonstrates how to use nnscaler to fine-tuning a transformer model. Here we use ViT as an example.

Requirements

To run this example, you need to install the packages listed in the requirements.txt file. You can install them by running the following command:

pip install -r requirements.txt

nnScaler is a framework for distributed training by automatically partitioning the model. Apart from the core nnScaler library, it also includes a mini-trainer for modern model training. You can find related documents and examples at nnScaler

transformers and datasets are required to prepare the data and loading the model.

The implementation is inspired by here. Many thanks to the author.

Run

First go to examples/vit directory, You can use the following command to run the example:

  1. Use transformer.train() to train the model

    • python examples/vit/vit_cli.py: will use DataParallel to train the model. It will utilize all your GPUs in current node. You can specify the GPUs with CUDA_VISIBLE_DEVICES env variable.

    • torchrun --nproc_per_node=<gpus> --nnodes=<nodes> examples/vit/vit_cli.py: will use DistributedDataParallel to train the model.

  2. Use nnscaler to train the model torchrun --nproc_per_node=<gpus> --nnodes=<nodes> $(which nnscaler-train) -f train_cli_args.yaml

In order to be consistent with transformers.train(), we use dataloader/scheduler from transformers. See accelerator_dataloader_fn and scheduler_fn functions in the code for details. If you don’t need to be consistent with transformers, you can just use your own dataloader/scheduler.

Please note nnscaler only supports 1 optimizer parameter group for now. So we also disable multiple parameter groups in the code when you use transformers.train(). See SingleParamGroupTrainer in the code for details.

The loss of nnscaler will be exactly the same as transformers given gnorm clipping is disabled. When gnorm clipping is enabled, the loss will be slightly different due to the difference in gnorm calculation.