Enable support for Intel XPU devices (AKA Intel GPUs) #19443

coreyjadams · 2024-02-09T22:32:45Z

What does this PR do?

This PR extends pytorch_lighting with support for Intel GPUs, as enabled with intel_extension_for_pytorch. With Intel's module, pytorch gains the torch.xpu module which is equivalent to torch.cuda.

Throughout the pytorch_lightning repository, in places where cuda is explicitly mentioned I tried to include equivalent functionality for xpu. In some cases, I declined to extend support to xpu where I was not sure it would work / be worth it: for example, there is BitsAndBytes which I know very little about, and I decided not to add xpu. The main enablements are XPUAccelerator and including logic to manage xpus in pytorch DDP.

In the distributed case, instead of nccl Intel provides the ccl backend for collective communications. There is a known bug that I encountered when testing, if one calls torch.distributed.broadcast with a list of strings it will induce a hang. I currently wrapped that call with an explicit check against this which isn't ideal, but it does enable DDP in XPUs.

Both xpu and ccl are currently extensions to pytorch and must be loaded dynamically. torch.xpu is available with import intel_extension_for_pytorch and the ccl backend to torch.distributed becomes available when one does import oneccl_bindings_for_pytorch. Because of this, I have in many cases done one of these:

In locations where I'm mostly sure xpu is initialized, I use it freely.
When calling torch.distributed.initialize, since the target backend is available, I intercept and ensure the oneccl bindings are loaded.
If I want to use torch.xpu and can't be sure its available, I have included logic analogous to cuda: instead of if torch.cuda.is_available(): ... I do if hasattr(torch, "xpu") and torch.xpu.is_available(): ...

This PR was not intended to introduce any breaking changes.

I think this PR needs some discussion before we even ask "should it be merged":

I don't have any XPU tests included. I don't know if you have hardware available to test and while I'm happy to run tests case-by-case, I myself can't offer access to XPU hardware like that.
I am not sure what tests DO run. I'm expecting this PR to trigger your automatic test suite and I'll find out what, if anything, I've broken :).
I haven't updated anything in the CHANGELOG. I'd like to understand where the tests stand before doing so.

📚 Documentation preview 📚: https://pytorch-lightning--19443.org.readthedocs.build/en/19443/

…ed a bit, mpi environment seems to be broken

…broadcasting strings isn't working. This commit includes a workaround for that case.

Syncronize xpu devices

Add xpu warning

Include XPU in on-gpu check.

Include XPU in map location

for more information, see https://pre-commit.ci

…rride decorator in line with other accelerators.

for more information, see https://pre-commit.ci

abhilash1910 · 2024-02-15T03:04:37Z

Hi @coreyjadams , there is a long standing PR for XPU support from us - #17700 which we are planning to integrate soon. We are already in discussions regarding this and would appreciate using the branch for the time being until this gets merged. Please also feel free to set up an offline discussion with us ( I work with Venkat /Sam and others regarding LLMs from Intel)

for more information, see https://pre-commit.ci

ronghongbo · 2024-10-08T04:18:15Z

Hello, could you provide at least one simple example with distributed training on Intel GPUs? I have such hardware and would like to try this PR. Thanks!

uniartisan · 2024-10-19T07:43:37Z

Hello, could you provide at least one simple example with distributed training on Intel GPUs? I have such hardware and would like to try this PR. Thanks!

#20349

You can move forward to conversations here, there is an RWKV example.

coreyjadams and others added 13 commits July 20, 2023 08:58

Enable Intel XPU as an accelerator and automatic GPU

4b1c2f3

Merge branch 'Lightning-AI:master' into master

3cd7503

Merge branch 'master' of https://github.com/Lightning-AI/lightning

494a95a

Fixing some things since my last updates: accelerator structure chang…

54af860

…ed a bit, mpi environment seems to be broken

Merge branch 'master' of https://github.com/Lightning-AI/lightning

d5274ac

Merge branch 'Lightning-AI:master' into master

5de217e

Enable DDP for XPU. THere is a bug, probably in the CCL layer, where …

bd12150

…broadcasting strings isn't working. This commit includes a workaround for that case.

Update throughput_monitor.py

711281a

Syncronize xpu devices

Update accelerator_connector.py

0c36f3c

Add xpu warning

Update module.py

4477301

Include XPU in on-gpu check.

Update saving.py

6b76644

Include XPU in map location

Add further support for XPU devices from Intel

0fac170

Merge branch 'Lightning-AI:master' into master

3bb66e3

coreyjadams requested review from awaelchli, carmocca, justusschock, tchaton and williamFalcon as code owners February 9, 2024 22:32

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package data (external) litdata package labels Feb 9, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

365126d

for more information, see https://pre-commit.ci

coreyjadams marked this pull request as draft February 9, 2024 22:35

coreyjadams and others added 7 commits February 9, 2024 16:37

Fix typo

5b6c3a8

Fix type error in memory stats. Enable oneccl in distributed mode

d08bda8

[pre-commit.ci] auto fixes from pre-commit.com hooks

cdc9ae5

for more information, see https://pre-commit.ci

Address precommit.ci errors

68ffe7a

Merge branch 'master' of github.com:coreyjadams/lightning

0242636

[pre-commit.ci] auto fixes from pre-commit.com hooks

5103a16

for more information, see https://pre-commit.ci

Missed a line that was too long

0aad417

coreyjadams and others added 10 commits February 12, 2024 11:15

Merge branch 'Lightning-AI:master' into master

0866cc9

Merge branch 'master' of github.com:coreyjadams/lightning

eab1302

[pre-commit.ci] auto fixes from pre-commit.com hooks

036cc1b

for more information, see https://pre-commit.ci

Wrap ipex imports more carefully

6936c24

Merge branch 'master' of github.com:coreyjadams/lightning

a4bb6dc

[pre-commit.ci] auto fixes from pre-commit.com hooks

6ab7e1a

for more information, see https://pre-commit.ci

Fix typo of nncl to nccl

eee4061

Add function typing and return signature to one XPU function. Add ove…

be66729

…rride decorator in line with other accelerators.

Fix missing import

b5ab237

[pre-commit.ci] auto fixes from pre-commit.com hooks

d62c74f

for more information, see https://pre-commit.ci

hipudding mentioned this pull request Feb 20, 2024

[WIP]add npu support #19308

Closed

7 tasks

coreyjadams and others added 9 commits April 3, 2024 15:44

Merge branch 'master' of https://github.com/Lightning-AI/lightning

74f2e24

[pre-commit.ci] auto fixes from pre-commit.com hooks

89865b5

for more information, see https://pre-commit.ci

Fix typing error in register_accelerator

5820c75

Merge branch 'master' of github.com:coreyjadams/lightning

3680639

[pre-commit.ci] auto fixes from pre-commit.com hooks

b3b2832

for more information, see https://pre-commit.ci

Update xpu.py

8a443c7

Merge branch 'master' of https://github.com/Lightning-AI/lightning

e87d366

Merge branch 'Lightning-AI:master' into master

11d9dbe

Merge branch 'Lightning-AI:master' into master

09da87b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable support for Intel XPU devices (AKA Intel GPUs) #19443

Enable support for Intel XPU devices (AKA Intel GPUs) #19443

coreyjadams commented Feb 9, 2024 •

edited by github-actions bot

Loading

abhilash1910 commented Feb 15, 2024

ronghongbo commented Oct 8, 2024 •

edited

Loading

uniartisan commented Oct 19, 2024

Enable support for Intel XPU devices (AKA Intel GPUs) #19443

Are you sure you want to change the base?

Enable support for Intel XPU devices (AKA Intel GPUs) #19443

Conversation

coreyjadams commented Feb 9, 2024 • edited by github-actions bot Loading

What does this PR do?

abhilash1910 commented Feb 15, 2024

ronghongbo commented Oct 8, 2024 • edited Loading

uniartisan commented Oct 19, 2024

coreyjadams commented Feb 9, 2024 •

edited by github-actions bot

Loading

ronghongbo commented Oct 8, 2024 •

edited

Loading