-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for CPU and MPS #188
base: main
Are you sure you want to change the base?
Conversation
Thank you for this PR, we'll just need to test this to make sure it works with all systems. |
2cc75cf
to
1f76014
Compare
dbfb5ea
to
c17c028
Compare
f7d33d3
to
4b1a528
Compare
b1ba779
to
c7f7682
Compare
This pull request has merge conflicts that must be resolved before it can be |
I have tested this on
MPS is slower than CPU at training but this is a known issue that requires some custom classes for |
e2e is broken because However, there's no way for me to fix this upstream first bc then ilab CI will break due to not recognizing the training changes since they don't exist yet. This seems like a pretty circular dep, I will try to find a way around it |
do not use distributed when not available, instead use CPU or MPS. This entails a few changes: --device is now a valid flag to the library since `ilab` can pass CPU, MPS, or default to cuda when using CPU or MPS, do not initialize DS, instead put the model on the device and initialize `Adafactor` optimizer which is more efficient and than Adam based one inside of `train` add logic for handling if torch.cuda.is_available and torch.distributed.is_initialized() we dont use distributed torch on consumer systems the train loop needs some custom step and loss logic for a LlamaForCausalLM model, add that in when using CPU or MPS we are always world_size == 1 and local_rank == 0 Signed-off-by: Charlie Doern <[email protected]>
This PR is very valuable. It proves that the existing training loop is minor-modification away from being accelerator-agnostic. However, there are some things that make merging it challenging:
These are criticisms that shouldn't be taken as being heavier-weight than the praise of success for getting this working. Ultimately, I don't think that we should merge this now. We're going to be working toward support for two new classes of accelerator, using a new distributed training framework. I think that work will be easier to accomplish if we don't have the outlier CPU and MPS hardware to worry about. That being said, I don't want to discard this work at all. I'd advocate for it to be "frozen" for a bit until we have more control over the training strategy for the "harder" cases before we adopt these simplest cases. |
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import ( | ||
CheckpointImpl, | ||
apply_activation_checkpointing, | ||
checkpoint_wrapper, | ||
) | ||
import numpy as np | ||
import torch | ||
import torch.distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd recommend dropping this refactor. torch.distributed as dist
is common convention. I would keep dist.get_rank
and dist.is_initialized
though, that's clearer.
# also, initialize Adafactor, a Transformers Optimizer designed to use less resources. | ||
# if we use AdamW here most people will always run out of RAM | ||
model = model.to(device) | ||
optimizer = Adafactor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably shouldn't switch optimizers blindly. AdamW is popular because it's more effective, with its tradeoff. Switching to Adafactor would require some performance experiments.
This pull request has merge conflicts that must be resolved before it can be |
@cdoern since you've merged CPU + MPS training in |
do not use distributed when not available, instead use CPU or MPS.
This entails a few changes:
ilab
can pass CPU, MPS, or default to cuda when using CPU or MPS,Adafactor
optimizer which is more efficient and than Adam based onetrain
add logic for handling if torch.cuda.is_available and torch.distributed.is_initialized() we dont use distributed torch on consumer systems