Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add convert_module to FSDP #20323

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Add convert_module to FSDP #20323

wants to merge 3 commits into from

Conversation

tshu-w
Copy link
Contributor

@tshu-w tshu-w commented Oct 6, 2024

What does this PR do?

Add convert_module for FSDP as DeepSpeed.

Fixes #19721 (comment)

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20323.org.readthedocs.build/en/20323/

@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels Oct 6, 2024
Copy link

codecov bot commented Oct 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88%. Comparing base (87565cb) to head (51f2b86).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #20323   +/-   ##
=======================================
  Coverage      88%      88%           
=======================================
  Files         267      267           
  Lines       23274    23285   +11     
=======================================
+ Hits        20381    20392   +11     
  Misses       2893     2893           

@tshu-w tshu-w force-pushed the FSDP branch 2 times, most recently from 7a0c355 to baeb535 Compare October 7, 2024 08:31
@lantiga
Copy link
Collaborator

lantiga commented Oct 7, 2024

Thank you @tshu-w!
Looks good in general, FSDP relies on contexts, but this may not cleanly apply when recomputations are involved.

As a sanity check, can you verify that the issues in #19721 are resolved? (i.e. memory goes back to what PyTorch uses, and no inconsistency errors are produced - these may be good tests to add btw, or at least a scaled-down version thereof).

I'll be happy to run things on my end and dig deeper in parallel.

@tshu-w
Copy link
Contributor Author

tshu-w commented Oct 7, 2024

I indeed noticed a decrease in VRAM usage (which I will confirm again in the coming week), even when I initialize the LLM in def configure_model as follows. However, I cannot guarantee that this PR resolves the original issue, as the author has manually set the LLM torch_dtype to torch.bfloat16. Nevertheless, I believe this might be able to solve part of the problem.

def configure_model(self):
    if self.model is not None:
        return

    self.model = AutoModelForCausalLM.from_pretrained(self.model_name_or_path)
    # suppress the open-end generation warning
    self.model.generation_config.pad_token_id = (
        self.model.generation_config.pad_token_id
        or self.model.generation_config.eos_token_id
    )

    if self.hparams.peft_config:
        peft_config = get_peft_config(self.hparams.peft_config)
        self.model = get_peft_model(self.model, peft_config)

    if self.tokenizer.chat_template is None:
        self.tokenizer.chat_template = (
            self.chatml_template
            if self.hparams.use_chatml_template
            else self.base_template
        )
        if self.hparams.use_chatml_template:
            self.tokenizer.add_tokens(
                ["<|im_start|>", "<|im_end|>"], special_tokens=True
            )
            self.model.resize_token_embeddings(len(self.tokenizer))

    if self.hparams.ckpt_path:
        checkpoint = torch.load(self.hparams.ckpt_path, weights_only=True)
        self.load_state_dict(checkpoint["state_dict"])

@lantiga
Copy link
Collaborator

lantiga commented Nov 12, 2024

hey @tshu-w did you end up digging further?

@lantiga lantiga added the waiting on author Waiting on user action, correction, or update label Nov 12, 2024
@lantiga
Copy link
Collaborator

lantiga commented Nov 26, 2024

Checking memory gains on my end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyTorch Lightning FSDP takes more memory than PyTorch FSDP
2 participants