Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multi-gpu private fine-tuning #32

Open
Pier297 opened this issue Jan 10, 2023 · 2 comments
Open

Support for multi-gpu private fine-tuning #32

Pier297 opened this issue Jan 10, 2023 · 2 comments

Comments

@Pier297
Copy link

Pier297 commented Jan 10, 2023

Hi all,

I wanted to try and add support for multi-gpu training to allow the fine-tuning of LLM. I've already opened an issue a few weeks ago and also thanks a lot for the fast response :)

I was trying to understand how we could use deepspeed (to use zero 3) and I saw that in your library the gradient is computed here by calling backward.

In deepspeed the backward pass is handled by the DeepSpeedEngine here, but if I'm not mistaken it's not different than calling backward as you do, what changes is the model parameter update done by the step function.

More or less my idea would be:

  1. Like in Data Parallel each gpu computes the loss on a micro-batch (also this would be done with model parallelism by deepspeed zero 3? not really understood this perfectly)
  2. Each gpu then calls privacy_engine.virtual_step(micro_batch_loss), this will then call _accumulate_summed_grad that computes the gradient for that micro-batch
  3. We now have to syncronize the gradients by summing them across the gpus (note: this has to sum the param.summed_grad)
  4. We can now call privacy_engine._create_noisy_clipped_gradient() to privatise the gradient
  5. Perform the optimizer.step as usual

I'm not really an expert using deepspeed so I don't know if this would be the correct solution and any suggestion that you could give would be much appreciated :)

If you prefer you can contact me via email at: [email protected]
Thanks a lot!

@lxuechen
Copy link
Owner

lxuechen commented Jan 10, 2023

Hi,

Thanks for following up. Yeah, the engineering specifics are perhaps somewhat hairy, so I'll mostly comment on the high-level ideas for now.

If your goal is to fine-tune a large model that's small enough to be fit on a single GPU, I think plain dataparallel is sufficient. FWIW, it's also simpler and doesn't require dealing with the complexity of model/pipeline/FS parallelism.

If your model can't really be fit on a single GPU, then you'd strictly need some of the features in deepspeed. But I think alternative to deepspeed, FSDP might be a better option. I've personally given it a try with DP, and it seems workable.

FSDP essentially enables optimizer sharding and weight sharing, so you'll be able to optimize models that can't fit on single accelerators. The central ideas of FSDP and deepspeed are pretty much the same.

@Pier297
Copy link
Author

Pier297 commented Jan 11, 2023

Hi and thank you again for your help!

I tried FSDP since my model doesn't fit on a single-gpu but I'm not sure how to proceed because when I call privacy_engine.attach(optimizer) I get the following error:

ValueError: Model type <class 'torch.distributed.fsdp.fully_sharded_data_parallel.FullyShardedDataParallel'> is not supported

which was also the error I more or less got when I tried opacus with deepspeed.

If it can help the model I'm testing with is gpt2 and my code more or less is this:
my_auto_wrap_policy = functools.partial(size_based_auto_wrap_policy, min_num_params=20000)
model = FSDP(model, auto_wrap_policy=my_auto_wrap_policy, cpu_offload=CPUOffload(offload_params=True))
privacy_engine = PrivacyEngine(model, ...)
privacy_engine.attach(optimizer)

Really sorry for taking your time and really thanks for any help you can give me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants