Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable support for Intel XPU devices (AKA Intel GPUs) #19443

Draft
wants to merge 40 commits into
base: master
Choose a base branch
from

Conversation

coreyjadams
Copy link

@coreyjadams coreyjadams commented Feb 9, 2024

What does this PR do?

This PR extends pytorch_lighting with support for Intel GPUs, as enabled with intel_extension_for_pytorch. With Intel's module, pytorch gains the torch.xpu module which is equivalent to torch.cuda.

Throughout the pytorch_lightning repository, in places where cuda is explicitly mentioned I tried to include equivalent functionality for xpu. In some cases, I declined to extend support to xpu where I was not sure it would work / be worth it: for example, there is BitsAndBytes which I know very little about, and I decided not to add xpu. The main enablements are XPUAccelerator and including logic to manage xpus in pytorch DDP.

In the distributed case, instead of nccl Intel provides the ccl backend for collective communications. There is a known bug that I encountered when testing, if one calls torch.distributed.broadcast with a list of strings it will induce a hang. I currently wrapped that call with an explicit check against this which isn't ideal, but it does enable DDP in XPUs.

Both xpu and ccl are currently extensions to pytorch and must be loaded dynamically. torch.xpu is available with import intel_extension_for_pytorch and the ccl backend to torch.distributed becomes available when one does import oneccl_bindings_for_pytorch. Because of this, I have in many cases done one of these:

  • In locations where I'm mostly sure xpu is initialized, I use it freely.
  • When calling torch.distributed.initialize, since the target backend is available, I intercept and ensure the oneccl bindings are loaded.
  • If I want to use torch.xpu and can't be sure its available, I have included logic analogous to cuda: instead of if torch.cuda.is_available(): ... I do if hasattr(torch, "xpu") and torch.xpu.is_available(): ...

This PR was not intended to introduce any breaking changes.

I think this PR needs some discussion before we even ask "should it be merged":

  • I don't have any XPU tests included. I don't know if you have hardware available to test and while I'm happy to run tests case-by-case, I myself can't offer access to XPU hardware like that.
  • I am not sure what tests DO run. I'm expecting this PR to trigger your automatic test suite and I'll find out what, if anything, I've broken :).
  • I haven't updated anything in the CHANGELOG. I'd like to understand where the tests stand before doing so.

📚 Documentation preview 📚: https://pytorch-lightning--19443.org.readthedocs.build/en/19443/

@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package data (external) litdata package labels Feb 9, 2024
@coreyjadams coreyjadams marked this pull request as draft February 9, 2024 22:35
@abhilash1910
Copy link

Hi @coreyjadams , there is a long standing PR for XPU support from us - #17700 which we are planning to integrate soon. We are already in discussions regarding this and would appreciate using the branch for the time being until this gets merged. Please also feel free to set up an offline discussion with us ( I work with Venkat /Sam and others regarding LLMs from Intel)

@hipudding hipudding mentioned this pull request Feb 20, 2024
7 tasks
@ronghongbo
Copy link

ronghongbo commented Oct 8, 2024

Hello, could you provide at least one simple example with distributed training on Intel GPUs? I have such hardware and would like to try this PR. Thanks!

@uniartisan
Copy link

Hello, could you provide at least one simple example with distributed training on Intel GPUs? I have such hardware and would like to try this PR. Thanks!

#20349

You can move forward to conversations here, there is an RWKV example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data (external) litdata package fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants