Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for CPU and MPS #188

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

cdoern
Copy link
Contributor

@cdoern cdoern commented Aug 28, 2024

do not use distributed when not available, instead use CPU or MPS.

This entails a few changes:

  • --device is now a valid flag to the library since ilab can pass CPU, MPS, or default to cuda when using CPU or MPS,
  • do not initialize DS, instead put the model on the device and initialize Adafactor optimizer which is more efficient and than Adam based one
  • inside of train add logic for handling if torch.cuda.is_available and torch.distributed.is_initialized() we dont use distributed torch on consumer systems
  • the train loop needs some custom step and loss logic for a LlamaForCausalLM model, add that in when using CPU or MPS we are always world_size == 1 and local_rank == 0

@RobotSail
Copy link
Member

Thank you for this PR, we'll just need to test this to make sure it works with all systems.

@mergify mergify bot added the ci-failure label Aug 28, 2024
@mergify mergify bot removed the ci-failure label Aug 28, 2024
@cdoern cdoern force-pushed the cpu-mps branch 3 times, most recently from 2cc75cf to 1f76014 Compare August 29, 2024 13:48
@mergify mergify bot added the ci-failure label Aug 29, 2024
@mergify mergify bot removed the ci-failure label Aug 29, 2024
@cdoern cdoern force-pushed the cpu-mps branch 2 times, most recently from dbfb5ea to c17c028 Compare August 29, 2024 13:59
@mergify mergify bot added the ci-failure label Aug 29, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Aug 29, 2024
@mergify mergify bot removed the ci-failure label Aug 29, 2024
@cdoern cdoern force-pushed the cpu-mps branch 2 times, most recently from f7d33d3 to 4b1a528 Compare August 29, 2024 14:45
@mergify mergify bot added the ci-failure label Aug 29, 2024
@mergify mergify bot removed the ci-failure label Aug 29, 2024
@mergify mergify bot added the ci-failure label Aug 29, 2024
@mergify mergify bot removed the ci-failure label Aug 29, 2024
@cdoern cdoern force-pushed the cpu-mps branch 2 times, most recently from b1ba779 to c7f7682 Compare August 29, 2024 15:02
@mergify mergify bot added the ci-failure label Aug 29, 2024
@nathan-weinberg nathan-weinberg linked an issue Aug 30, 2024 that may be closed by this pull request
Copy link
Contributor

mergify bot commented Aug 30, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@cdoern
Copy link
Contributor Author

cdoern commented Sep 3, 2024

I have tested this on

  • a linux CPU only laptop
  • a M3 iMac

MPS is slower than CPU at training but this is a known issue that requires some custom classes for loss.backward() and optimizer.step(). I think this is a good first impl that can at least get CPU and MPS support on the books

@mergify mergify bot added the ci-failure label Sep 3, 2024
@cdoern
Copy link
Contributor Author

cdoern commented Sep 4, 2024

e2e is broken because ilab doesn't have the changes needed to use this new CPU capable training method.

However, there's no way for me to fix this upstream first bc then ilab CI will break due to not recognizing the training changes since they don't exist yet.

This seems like a pretty circular dep, I will try to find a way around it

do not use distributed when not available, instead use CPU or MPS.

This entails a few changes:

--device is now a valid flag to the library since `ilab` can pass CPU, MPS, or default to cuda
when using CPU or MPS, do not initialize DS, instead put the model on the device and initialize `Adafactor` optimizer which is more efficient and than Adam based one
inside of `train` add logic for handling if torch.cuda.is_available and torch.distributed.is_initialized() we dont use distributed torch on consumer systems
the train loop needs some custom step and loss logic for a LlamaForCausalLM model, add that in
when using CPU or MPS we are always world_size == 1 and local_rank == 0

Signed-off-by: Charlie Doern <[email protected]>
@JamesKunstle
Copy link
Contributor

This PR is very valuable. It proves that the existing training loop is minor-modification away from being accelerator-agnostic. However, there are some things that make merging it challenging:

  1. The cuda+distributed case are woven together. The vector of acceptable target hardware is:
    [CPU, MPS, Habana, ROCm, Cuda, Habana+Distributed, ROCm+Distributed, Cuda+Distributed].
  2. MPS doesn't support quantized training down this route because those layers don't exist in native PyTorch. That means this is a feature regression.
  3. Each of the setup components + the training loop become "braided" because we have to handle the two addressed hardware classes separately.

These are criticisms that shouldn't be taken as being heavier-weight than the praise of success for getting this working. Ultimately, I don't think that we should merge this now. We're going to be working toward support for two new classes of accelerator, using a new distributed training framework. I think that work will be easier to accomplish if we don't have the outlier CPU and MPS hardware to worry about.

That being said, I don't want to discard this work at all. I'd advocate for it to be "frozen" for a bit until we have more control over the training strategy for the "harder" cases before we adopt these simplest cases.

from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
CheckpointImpl,
apply_activation_checkpointing,
checkpoint_wrapper,
)
import numpy as np
import torch
import torch.distributed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd recommend dropping this refactor. torch.distributed as dist is common convention. I would keep dist.get_rank and dist.is_initialized though, that's clearer.

# also, initialize Adafactor, a Transformers Optimizer designed to use less resources.
# if we use AdamW here most people will always run out of RAM
model = model.to(device)
optimizer = Adafactor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably shouldn't switch optimizers blindly. AdamW is popular because it's more effective, with its tradeoff. Switching to Adafactor would require some performance experiments.

Copy link
Contributor

mergify bot commented Sep 26, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 26, 2024
@JamesKunstle
Copy link
Contributor

@cdoern since you've merged CPU + MPS training in instructlab/instructlab should this stay open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add CPU-Only Support to the training library
3 participants