# ViT Example ## Introduction This example demonstrates how to use nnscaler to fine-tuning a transformer model. Here we use ViT as an example. ## Requirements To run this example, you need to install the packages listed in the `requirements.txt` file. You can install them by running the following command: ```bash pip install -r requirements.txt ``` *nnScaler* is a framework for distributed training by automatically partitioning the model. Apart from the core nnScaler library, it also includes a mini-trainer for modern model training. You can find related documents and examples at [nnScaler](https://github.com/microsoft/nnscaler) *transformers* and *datasets* are required to prepare the data and loading the model. The implementation is inspired by [here](https://medium.com/@supersjgk/fine-tuning-vision-transformer-with-hugging-face-and-pytorch-df19839d5396). Many thanks to the author. ### Run First go to `examples/vit` directory, You can use the following command to run the example: 1. Use transformer.train() to train the model - `python examples/vit/vit_cli.py`: will use `DataParallel` to train the model. It will utilize all your GPUs in current node. You can specify the GPUs with `CUDA_VISIBLE_DEVICES` env variable. - `torchrun --nproc_per_node= --nnodes= examples/vit/vit_cli.py`: will use `DistributedDataParallel` to train the model. 2. Use nnscaler to train the model `torchrun --nproc_per_node= --nnodes= $(which nnscaler-train) -f train_cli_args.yaml` In order to be consistent with `transformers.train()`, we use dataloader/scheduler from `transformers`. See `accelerator_dataloader_fn` and `scheduler_fn` functions in the code for details. If you don't need to be consistent with `transformers`, you can just use your own dataloader/scheduler. Please note `nnscaler` only supports 1 optimizer parameter group for now. So we also disable multiple parameter groups in the code when you use `transformers.train()`. See `SingleParamGroupTrainer` in the code for details. The loss of `nnscaler` will be exactly the same as `transformers` given gnorm clipping is disabled. When gnorm clipping is enabled, the loss will be slightly different due to the difference in gnorm calculation.