add support for CPU and MPS #188

cdoern · 2024-08-28T17:45:30Z

do not use distributed when not available, instead use CPU or MPS.

This entails a few changes:

--device is now a valid flag to the library since ilab can pass CPU, MPS, or default to cuda when using CPU or MPS,
do not initialize DS, instead put the model on the device and initialize Adafactor optimizer which is more efficient and than Adam based one
inside of train add logic for handling if torch.cuda.is_available and torch.distributed.is_initialized() we dont use distributed torch on consumer systems
the train loop needs some custom step and loss logic for a LlamaForCausalLM model, add that in when using CPU or MPS we are always world_size == 1 and local_rank == 0

RobotSail · 2024-08-28T17:49:32Z

Thank you for this PR, we'll just need to test this to make sure it works with all systems.

mergify · 2024-08-30T15:54:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

cdoern · 2024-09-03T13:41:08Z

I have tested this on

a linux CPU only laptop
a M3 iMac

MPS is slower than CPU at training but this is a known issue that requires some custom classes for loss.backward() and optimizer.step(). I think this is a good first impl that can at least get CPU and MPS support on the books

cdoern · 2024-09-04T13:37:23Z

e2e is broken because ilab doesn't have the changes needed to use this new CPU capable training method.

However, there's no way for me to fix this upstream first bc then ilab CI will break due to not recognizing the training changes since they don't exist yet.

This seems like a pretty circular dep, I will try to find a way around it

do not use distributed when not available, instead use CPU or MPS. This entails a few changes: --device is now a valid flag to the library since `ilab` can pass CPU, MPS, or default to cuda when using CPU or MPS, do not initialize DS, instead put the model on the device and initialize `Adafactor` optimizer which is more efficient and than Adam based one inside of `train` add logic for handling if torch.cuda.is_available and torch.distributed.is_initialized() we dont use distributed torch on consumer systems the train loop needs some custom step and loss logic for a LlamaForCausalLM model, add that in when using CPU or MPS we are always world_size == 1 and local_rank == 0 Signed-off-by: Charlie Doern <[email protected]>

JamesKunstle · 2024-09-04T20:23:13Z

This PR is very valuable. It proves that the existing training loop is minor-modification away from being accelerator-agnostic. However, there are some things that make merging it challenging:

The cuda+distributed case are woven together. The vector of acceptable target hardware is:
[CPU, MPS, Habana, ROCm, Cuda, Habana+Distributed, ROCm+Distributed, Cuda+Distributed].
MPS doesn't support quantized training down this route because those layers don't exist in native PyTorch. That means this is a feature regression.
Each of the setup components + the training loop become "braided" because we have to handle the two addressed hardware classes separately.

These are criticisms that shouldn't be taken as being heavier-weight than the praise of success for getting this working. Ultimately, I don't think that we should merge this now. We're going to be working toward support for two new classes of accelerator, using a new distributed training framework. I think that work will be easier to accomplish if we don't have the outlier CPU and MPS hardware to worry about.

That being said, I don't want to discard this work at all. I'd advocate for it to be "frozen" for a bit until we have more control over the training strategy for the "harder" cases before we adopt these simplest cases.

JamesKunstle · 2024-09-04T20:24:18Z

src/instructlab/training/utils.py

 from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
    CheckpointImpl,
    apply_activation_checkpointing,
    checkpoint_wrapper,
 )
 import numpy as np
 import torch
+import torch.distributed


i'd recommend dropping this refactor. torch.distributed as dist is common convention. I would keep dist.get_rank and dist.is_initialized though, that's clearer.

JamesKunstle · 2024-09-04T20:25:24Z

src/instructlab/training/main_ds.py

+        # also, initialize Adafactor, a Transformers Optimizer designed to use less resources.
+        # if we use AdamW here most people will always run out of RAM
+        model = model.to(device)
+        optimizer = Adafactor(


we probably shouldn't switch optimizers blindly. AdamW is popular because it's more effective, with its tradeoff. Switching to Adafactor would require some performance experiments.

mergify · 2024-09-26T15:02:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. @cdoern please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

JamesKunstle · 2024-10-01T19:41:25Z

@cdoern since you've merged CPU + MPS training in instructlab/instructlab should this stay open?

cdoern force-pushed the cpu-mps branch from d3c3a18 to ef8bbe5 Compare August 28, 2024 17:58

mergify bot added the ci-failure label Aug 28, 2024

cdoern force-pushed the cpu-mps branch from ef8bbe5 to 1675e3f Compare August 28, 2024 18:08

mergify bot removed the ci-failure label Aug 28, 2024

cdoern force-pushed the cpu-mps branch 3 times, most recently from 2cc75cf to 1f76014 Compare August 29, 2024 13:48