AssertionError for Phi-3.5-mini-instruct and Qwen2.5-7B-Instruct with NeMo + ThunderFX #1476

mpatel31415 · 2024-11-26T14:48:55Z

🐛 Bug

When running Phi-3.5-mini-instruct and Qwen2.5-7B-Instruct with NeMo + ThunderFX we get error:

0: File "/usr/lib/python3.10/copy.py", line 153, in deepcopy
0: y = copier(memo)
0: File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 793, in deepcopy
0: fake_mod = _CodeOnlyModule(copy.deepcopy(self.dict, memo))
0: File "/usr/lib/python3.10/copy.py", line 146, in deepcopy
0: y = copier(x, memo)
0: File "/usr/lib/python3.10/copy.py", line 231, in _deepcopy_dict
0: y[deepcopy(key, memo)] = deepcopy(value, memo)
0: File "/usr/lib/python3.10/copy.py", line 153, in deepcopy
0: y = copier(memo)
0: File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph.py", line 940, in deepcopy
0: assert isinstance(output_vals, tuple)
0: torch._dynamo.exc.BackendCompilerFailed: backend='<thunder.dynamo.compiler.ThunderCompiler object at 0x7ffd30242dd0>' raised:
0: AssertionError:

(I'll add file with full traceback)

To Reproduce

The error is present on 1xH100.

Dockerfile used (I build it yesterday and I'm not sure yet how nemo:dev images are versioned, so I can't provide its detailed version):

FROM nvcr.io/nvidia/nemo:dev
ARG NVFUSER_REPO=git+https://github.com/NVIDIA/Fuser.git
ARG THUNDER_REPO=git+https://github.com/Lightning-AI/lightning-thunder.git

# Add cloned NeMo latest code
RUN git clone --recursive https://github.com/NVIDIA/NeMo.git /NeMo_cloned
RUN (cd /NeMo_cloned && python -m pip install .)


# Install requirements needed for NeMo, Thunder and NVFUser.
# We must install them in such compilated way because otherwise Thunder is not 
# updated and we are not able to use the latest version. 
RUN python -m pip install -r /NeMo_cloned/requirements/requirements_lightning.txt && \
    python -m pip install --upgrade ${NVFUSER_REPO}  && \
    python -m pip install --upgrade ${THUNDER_REPO} && \
    python -m pip install --upgrade --no-deps --force-reinstall ${NVFUSER_REPO} && \
    python -m pip install --upgrade --no-deps --force-reinstall ${THUNDER_REPO}
 
# Install Mixology requirements (this can be skipped, so I'm commenting it out)
# COPY requirements/mixology.txt mixology_requirements.txt
# RUN pip install --upgrade -r mixology_requirements.txt

Inside docker container please run:

model=microsoft/Phi-3.5-mini-instruct
# Download the model (you might need to set HF_TOKEN and agree on the website to terms of use of this model)
huggingface-cli download $model --local-dir checkpoints/$model --cache-dir checkpoints/$model 
# Run benchmark
python bench_targets/llm_peft/_nemo.py --model checkpoints/$model --mbs 1 --seq-length 2048 --jit-backend thunder

Script bench_targets/llm_peft/_nemo.py can be obtained from internal Gitlab from akoumparouli/nemo_bench. You can contact me or @tfogal if you have any questions.

You can check that the command below works:

python bench_targets/llm_peft/_nemo.py --model checkpoints/$model --mbs 1 --seq-length 2048 --jit-backend eager

Expected behavior

No error for Thunder.

Environment

nvfuser @ git+https://github.com/NVIDIA/Fuser.git@bb058595c49dc32416d563f5a4c1c5f22a01ca54
lightning-thunder @ git+https://github.com/Lightning-AI/lightning-thunder.git@81f83f3549d0cde9fc012fb9ebfb9eb4a3254e61

cc @tfogal

The text was updated successfully, but these errors were encountered:

mpatel31415 · 2024-11-26T15:21:58Z

Here is txt file with full traceback: full_traceback.txt

kiya00 · 2024-11-26T15:35:00Z

I think the reason is this PR(#1437), it relies on PyTorch's bug fixing pytorch/pytorch#139275, probably only in Torch nightly

IvanYashchuk · 2024-11-26T16:42:11Z

The error is fixed only with the latest PyTorch (Nov 1st+, pytorch/pytorch@0cf4cc3). What's the PyTorch version used in nvcr.io/nvidia/nemo:dev?

tfogal · 2024-11-26T18:27:20Z

I think the reason is this PR(#1437), it relies on PyTorch's bug fixing pytorch/pytorch#139275, probably only in Torch nightly

The functionality added in #1437 is not (yet) a blocker for our Q4 goals. I recommend a workaround that simply disables the functionality when/if PyTorch is too old.

The error is fixed only with the latest PyTorch (Nov 1st+, pytorch/pytorch@0cf4cc3). What's the PyTorch version used in nvcr.io/nvidia/nemo:dev?

It is old: 2.4.0a0+3bcc3cddb5.nv24.07.

IvanYashchuk · 2024-11-26T19:16:49Z

I recommend a workaround that simply disables the functionality when/if PyTorch is too old.

Sure, if we need to make it work for the older PyTorch we can do that.

A workaround could be to iterate over all submodules returned in split_module used here

lightning-thunder/thunder/dynamo/splitter.py

Lines 134 to 137 in cd6977d

    
           # `split_module` iterates over nodes and determines the partition to place them based on the callback. 
        
           original_split_gm: torch.fx.GraphModule = split_module( 
        
               gm, root_m=None, split_callback=callback, keep_original_order=True, keep_original_node_name=True 
        
           )

and add an output node to all submodules that are missing one. @kshitij12345, does this sound like a correct workaround?

kshitij12345 · 2024-11-26T20:09:30Z

Yes, I think that should work.

IvanYashchuk added nemo Issues needed to support NVIDIA NeMo models. mixology Issues that the mixology team has surfaced labels Nov 26, 2024

tfogal added the thunderfx for things that could be applicable to the dynamo+thunder frontend label Nov 26, 2024

IvanYashchuk assigned kiya00 Nov 26, 2024

kiya00 added a commit that referenced this issue Nov 26, 2024

Add output node if it does not exist in the split module (#1476)

7951a51

kiya00 linked a pull request Nov 26, 2024 that will close this issue

Add output node if it does not exist in the split module #1480

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError for Phi-3.5-mini-instruct and Qwen2.5-7B-Instruct with NeMo + ThunderFX #1476

AssertionError for Phi-3.5-mini-instruct and Qwen2.5-7B-Instruct with NeMo + ThunderFX #1476

mpatel31415 commented Nov 26, 2024 •

edited by github-actions bot

Loading

mpatel31415 commented Nov 26, 2024

kiya00 commented Nov 26, 2024

IvanYashchuk commented Nov 26, 2024

tfogal commented Nov 26, 2024

IvanYashchuk commented Nov 26, 2024

kshitij12345 commented Nov 26, 2024

AssertionError for Phi-3.5-mini-instruct and Qwen2.5-7B-Instruct with NeMo + ThunderFX #1476

AssertionError for Phi-3.5-mini-instruct and Qwen2.5-7B-Instruct with NeMo + ThunderFX #1476

Comments

mpatel31415 commented Nov 26, 2024 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

mpatel31415 commented Nov 26, 2024

kiya00 commented Nov 26, 2024

IvanYashchuk commented Nov 26, 2024

tfogal commented Nov 26, 2024

IvanYashchuk commented Nov 26, 2024

kshitij12345 commented Nov 26, 2024

mpatel31415 commented Nov 26, 2024 •

edited by github-actions bot

Loading