############### Troubleshooting ############### Reuse Cache =========== I have modified the model but the result does not change -------------------------------------------------------- Remove ``.nnscaler`` directory in the working path and try again. nnScaler's workflow is first compiling the model, and then running the compiled (generated) model. After modifying the original model, you need to tell nnScaler to re-compile it. This can be achieved by two ways: 1. Remove the compiled model (located in ``.nnscaler`` directory); 2. Set ``TrainerArgs.gen_reuse`` to ``"override"``. We recommend to set ``gen_reuse="override"`` to debug the model, and change it to ``gen_reuse="auto"`` for deployment. .. code-block:: python trainer_args = TrainerArgs( gen_reuse='override', ... ) trainer = Trainer(trainer_args=trainer_args) trainer.run() Note that setting ``gen_reuse="match"`` will NOT solve this problem, since it only checks ``compute_config``, not the model. "RuntimeError: Output directory ... is not empty. And the existing files do not match..." after modifying models ---------------------------------------------------------------------------------------------------------------- As the error message said, please remove the ``.nnscaler`` directory. To prevent this kind of errors permanently, you can set ``gen_reuse`` to ``"override"``, at the expense of time. Example stacktrace: :: Traceback (most recent call last): File "train.py", line 244, in main() File "train.py", line 240, in main trainer.run() File ".../nnscaler/cli/trainer.py", line 95, in run self._setup() File ".../nnscaler/cli/trainer.py", line 206, in _setup pmodel_class = nnscaler.parallelize( File ".../nnscaler/parallel.py", line 983, in parallelize outdir, reusable = _prepare_and_check_reusable(gen_savedir, module_class, compute_config, instance_name, reuse) File ".../nnscaler/parallel.py", line 547, in _prepare_and_check_reusable raise RuntimeError(f'Output directory {outdir} is not empty. ' RuntimeError: Output directory .../.nnscaler/_parallel_modules/__main__/WrapperModel/_ is not empty. And the existing files do not match with current config. You can remove the directory and try again, or set reuse to ReuseType.NONE/ReuseType.OVERRIDE to regenerate the code. Known Issues ============ "KeyError: '__mro__'" and errors mentioning "_dynamo" ----------------------------------------------------- Add ``import torch._dynamo`` to the beginning of your main script. Due to a limitation in nnScaler, the dynamic import of ``torch._dynamo`` cannot be correctly traced. This can be workaround by importing it before tracing. Example stacktrace: :: Traceback (most recent call last): File "train.py", line 286, in trainer.run() File ".../nnscaler/cli/trainer.py", line 95, in run self._setup() File ".../nnscaler/cli/trainer.py", line 206, in _setup pmodel_class = nnscaler.parallelize( File ".../nnscaler/parallel.py", line 993, in parallelize regen_status = _gencode( ...... File ".../site-packages/transformers/models/llama/modeling_llama.py", line 1041, in _update_causal_mask if AttentionMaskConverter._ignore_causal_mask_sdpa( File ".../nnscaler/graph/parser/fx/concrete_trace_utils/operator_patcher.py", line 354, in patch_run return new_func(*args, **kwargs) File ".../site-packages/transformers/modeling_attn_mask_utils.py", line 259, in _ignore_causal_mask_sdpa or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling()) File ".../nnscaler/graph/parser/fx/concrete_trace_utils/operator_patcher.py", line 354, in patch_run return new_func(*args, **kwargs) File ".../site-packages/torch/__init__.py", line 2003, in __getattr__ return importlib.import_module(f".{name}", __name__) File ".../importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) ...... File ".../site-packages/torch/_dynamo/utils.py", line 567, in unwrap_with_attr_name_if_wrapper elif is_function(fn) and inspect.getattr_static(fn, "_torchdynamo_inline", False): File ".../inspect.py", line 1738, in getattr_static if not _is_type(obj): File ".../inspect.py", line 1707, in _is_type _static_getmro(obj) File ".../inspect.py", line 1685, in _static_getmro return type.__dict__['__mro__'].__get__(klass) KeyError: '__mro__' "ModuleNotFoundError: No module named 'nnscaler.autodist.dp_solver'" when using editable install ------------------------------------------------------------------------------------------------ Run the following command: :: python -c 'import os,sys,nnscaler,cppimport.import_hook ; sys.path.append(os.path.dirname(nnscaler.__path__[0])) ; import nnscaler.autodist.dp_solver' If it complains ``GLIBCXX_x.y.z`` not found, check the next issue. Example stacktrace: :: Traceback (most recent call last): File "model.py", line 48, in trainer.run() File ".../nnscaler/cli/trainer.py", line 95, in run self._setup() File ".../nnscaler/cli/trainer.py", line 206, in _setup pmodel_class = nnscaler.parallelize( File ".../nnscaler/parallel.py", line 988, in parallelize regen_status = _gencode( File ".../nnscaler/parallel.py", line 753, in _gencode graph = pas_policy(graph, compute_config) File ".../nnscaler/policies.py", line 303, in pas_autodist return parallelize_graph(graph, autodist_cfg) File ".../nnscaler/autodist/apis.py", line 117, in parallelize_graph search_out = calc_parallel_plan(graph, autodist_config) File ".../nnscaler/autodist/apis.py", line 98, in calc_parallel_plan pp_out = calc_optimal_spmd_plan(autodist_graph, autodist_config) File ".../nnscaler/autodist/spmd_solver.py", line 1503, in calc_optimal_spmd_plan spmd_outs = spmd_solver.solve([(0, model_graph.op_num - 1)], 1)[0] File ".../nnscaler/autodist/spmd_solver.py", line 1374, in solve return self.do_dp(intervals, topk) File ".../nnscaler/autodist/spmd_solver.py", line 1183, in do_dp import nnscaler.autodist.dp_solver as dp_solver ModuleNotFoundError: No module named 'nnscaler.autodist.dp_solver' "ImportError: ...... libstdc++.so.6: version \`GLIBCXX_x.y.z' not found" ------------------------------------------------------------------------- This is caused by gcc and glibc version mismatch. Typically it means it's using the system gcc and conda's glibc. You can remove conda's glibc to force it use system glibc: :: rm /lib/libstdc++.so.6 The path is shown in the error message. Example stacktrace: :: $ python -c 'import nnscaler,cppimport.import_hook ; import nnscaler.autodist.dp_solver' Traceback (most recent call last): File "", line 1, in ImportError: /home/user/miniconda3/envs/user/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by .../nnscaler/autodist/dp_solver.cpython-310-x86_64-linux-gnu.so) Incorrect Usages ================ "RuntineError: Loss can only be scalar tensor ..." when forward returns dict ---------------------------------------------------------------------------- When using nnScaler's Trainer, the return value of the top-level ``forward()`` must not be a dict. It can either be: 1. A loss tensor; 2. A tuple where the first element is a loss tensor. Detailed explaination: :ref:`end2end model `. How to fix: .. code-block:: diff def forward(self, data): ... -return {'loss': loss, 'ntokens': ntokens} +return loss, ntokens Example stacktrace: :: Traceback (most recent call last): File "example.py", line 27, in trainer.run() File ".../nnscaler/cli/trainer.py", line 95, in run self._setup() File ".../nnscaler/cli/trainer.py", line 206, in _setup pmodel_class = nnscaler.parallelize( File ".../nnscaler/parallel.py", line 988, in parallelize regen_status = _gencode( File ".../nnscaler/parallel.py", line 737, in _gencode graph, forward_args = _gen_graph( File ".../nnscaler/parallel.py", line 656, in _gen_graph raise RuntimeError(f"Loss can only be scalar tensor but got {ir_loss.shape if isinstance(ir_loss, IRTensor) else ir_loss}") RuntimeError: Loss can only be scalar tensor but got {'loss': t1596(p920,(1,),d(),v(0/1)), 'ntokens': t1597(p922,(1,),d(),v(0/1))} "TypeError: ... 'device_type' must be str, not ConcreteAttrProxy" when using torch>=2.4 --------------------------------------------------------------------------------------- nnScaler does not support torch 2.4 yet. Downgrade to torch 2.3.* will fix the issue: :: pip install "torch<2.4" Example stacktrace: :: Traceback (most recent call last): File "model.py", line 43, in trainer.run() File ".../nnscaler/cli/trainer.py", line 95, in run self._setup() File ".../nnscaler/cli/trainer.py", line 206, in _setup pmodel_class = nnscaler.parallelize( File ".../nnscaler/parallel.py", line 988, in parallelize regen_status = _gencode( ...... File ".../nnscaler/graph/parser/fx/concrete_trace_utils/operator_patcher.py", line 354, in patch_run return new_func(*args, **kwargs) File ".../torch/amp/autocast_mode.py", line 237, in __init__ if not is_autocast_available(self.device): File ".../torch/amp/autocast_mode.py", line 36, in is_autocast_available return torch._C._is_autocast_available(device_type) TypeError: _is_autocast_available(): argument 'device_type' (position 1) must be str, not ConcreteAttrProxy "RuntimeError: Broadcast generated files failed" when use ``run_mode='compile'`` -------------------------------------------------------------------------------- When using ``Trainer``'s ``run_mode='compile'`` option, ``broadcast_strategy`` must be set to ``'none'``. How to fix: .. code-block:: diff trainer_args = TrainerArgs( run_mode='compile', ... +broadcast_strategy=('none' if run_mode=='compile' else 'all'), ) Example stacktrace: :: Traceback (most recent call last): File "model.py", line 63, in trainer.run() File ".../nnscaler/cli/trainer.py", line 102, in run self._setup() File ".../nnscaler/cli/trainer.py", line 148, in _setup pmodel = parallelize_model(self.train_args, self.dummy_input, load_module=not compile_only) File ".../nnscaler/cli/mixed_module.py", line 281, in parallelize_model return _new_adapter().parallelize(dummy_input, load_module=load_module) File ".../nnscaler/cli/mixed_module.py", line 188, in parallelize pmodel_class = nnscaler.parallelize( File ".../nnscaler/parallel.py", line 1081, in parallelize raise RuntimeError("Broadcast generated files failed: torch.distributed is not initialized.") RuntimeError: Broadcast generated files failed: torch.distributed is not initialized. Flash Attention Problems ======================== "NameError: name 'flash_attn' is not defined" --------------------------------------------- When using flash attention, it must be registered with ``register_op`` API. Check :doc:`the llama 3 example ` for its usage. Example stacktrace: :: Traceback (most recent call last): File "train.py", line 247, in trainer.run() File ".../nnscaler/cli/trainer.py", line 98, in run self._train() File ".../nnscaler/cli/trainer.py", line 558, in _train self._train_epoch(epoch) File ".../nnscaler/cli/trainer.py", line 698, in _train_epoch losses = self.model.train_step(batches, is_dummy_batch) File ".../nnscaler/runtime/module.py", line 967, in train_step output = self._train_step(dataloader) File ".nnscaler/_parallel_modules/__main__/WrapperModel/_/gencode0.py", line 1228, in _train_step cross_entropy_1433, getitem_62_1431 = nnscaler.runtime.executor.fexecute('segment1977', model.segment1977, *(data_1780, ), requires_grad=True) File ".../nnscaler/runtime/executor.py", line 105, in fexecute outputs = subgraph(*input_dtensors) File ".nnscaler/_parallel_modules/__main__/WrapperModel/_/gencode0.py", line 452, in segment1977 add_7_2220, add_7_2221 = ckpt.checkpoint(recompute, unsqueeze_1439, embedding_2130, embedding_2131, use_reentrant=False) File ".../site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, **kwargs) File ".../site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn return fn(*args, **kwargs) File ".../site-packages/torch/_dynamo/external_utils.py", line 36, in inner return fn(*args, **kwargs) File ".../site-packages/torch/utils/checkpoint.py", line 494, in checkpoint ret = function(*args, **kwargs) File ".nnscaler/_parallel_modules/__main__/WrapperModel/_/gencode0.py", line 386, in recompute apply_1495 = flash_attn.flash_attn_interface.FlashAttnFunc.apply(transpose_4_1492, transpose_5_1493, transpose_6_1494, ifexpr_930, None, True, (-1, -1), 0.0, None, False, False) NameError: name 'flash_attn' is not defined "ImportError" when using flash attention ---------------------------------------- This is likely an error in flash attention itself. Please try the related import command outside nnScaler. If it still fails, please refer to `flash attention `_'s docs. If your ``flash-attn`` package is installed from pip, you can try to use a wheel its `release page `_ which matches your environment more accurately. Example stacktrace: :: Traceback (most recent call last): File "train.py", line 9, in from modeling_modifier import nnscaler_llama_init File "modeling_modifier.py", line 14, in from transformers.models.llama.modeling_llama import LlamaAttention, LLAMA_ATTENTION_CLASSES, apply_rotary_pos_emb, LlamaRMSNorm File ".../site-packages/transformers/models/llama/modeling_llama.py", line 53, in from flash_attn import flash_attn_func, flash_attn_varlen_func File ".../site-packages/flash_attn/__init__.py", line 3, in from flash_attn.flash_attn_interface import ( File ".../site-packages/flash_attn/flash_attn_interface.py", line 10, in import flash_attn_2_cuda as flash_attn_cuda ImportError: .../site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c105Error4whatEv Hugging Face Access =================== "Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted. ... Please log in." --------------------------------------------------------------------------------------- You need to request for `Llama 3 access `_ on Hugging Face first. Once you get access, generates your `Hugging Face token `_ and export it: :: export HF_TOKEN=hf_... .. (FIXME: check it) Or alternatively, you can try replacing ``meta-llama/Meta-Llama-3-8B-Instruct`` with ``microsoft/Phi-3-mini-4k-instruct``.