Define device allocation type fields in fi_mr_attr #6540

aingerson · 2021-01-26T18:26:39Z

aingerson
Jan 26, 2021
Collaborator

To extend for use with FI_MR_HMEM, fi_mr_attr was extended to include the interface used to allocate and manage the memory (ie FI_HMEM_CUDA, FI_HMEM_ZE) as well as interface-specific fields.
Each of these interface types is defined to include memory allocated in a variety of different ways. For example, for CUDA:
"Uses Nvidia CUDA interfaces such as cuMemAlloc, cuMemAllocHost, cuMemAllocManaged, cuMemFree, cudaMalloc, cudaFree."
However for both CUDA and L0 (maybe others), IPC is not supported for all types of allocations. See here for CUDA and here for ZE.

fi_mr_attr needs to include what type of memory/allocation the buffer is as well so that the provider can know what protocols, like IPC, can or cannot be used.

Possible solution - add allocation types to fi_hmem_iface

shefty · 2021-01-26T21:18:54Z

shefty
Jan 26, 2021
Maintainer

If the application knows this, it makes sense to pass that data through. But does middleware know what call was used for the allocation? Can it get this through the buffer attributes?

0 replies

aingerson · 2021-01-26T21:50:44Z

aingerson
Jan 26, 2021
Collaborator Author

Both CUDA and L0 have calls to find this out so it's possible for the middleware to do this discovery if they don't have the information already to pass through. The same call that the hooking provider will need to do to register the memory will have the type of memory as well to use in the registration.

0 replies

shefty · 2021-01-26T21:59:15Z

shefty
Jan 26, 2021
Maintainer

Does this data come 'for free' as part of querying if a buffer is a GPU or host buffer? Or is it an extra call?

0 replies

aingerson · 2021-01-26T22:00:51Z

aingerson
Jan 26, 2021
Collaborator Author

Fo free!

0 replies

shefty · 2021-01-28T20:13:43Z

shefty
Jan 28, 2021
Maintainer

An 'easy' solution is to extend enum fi_hmem_iface with device allocation details. For example:

enum fi_hmem_iface {
    FI_HMEM_SYSTEM = 0,
    FI_HMEM_CUDA,
    FI_HMEM_ROCR,
    FI_HMEM_ZE,
    FI_HMEM_CUDA_ALLOC,
    FI_HMEM_CUDA_ALLOCHOST,

This may add a bunch of enum values. However, I'm not sure if additional details would be necessary or not, based on which call was used and how it was invoked.

As an alternative, rather than indicate which call was used, we could pass in the necessary information more directly. Using a flag, rather than extending the enum seems to make more sense here:

#define FI_MR_HMEM_IPC_BUFFER (1ULL << 59)
#define FI_MR_HMEM_NONIPC_BUFFER (1ULL << 58)

These flags would be passed into fi_mr_regattr. I don't like have 2 flags, but I think we need both in order to indicate that the IPC'ish is unknown to the caller.

0 replies

iziemba · 2021-02-01T15:10:35Z

iziemba
Feb 1, 2021
Collaborator

In addition to whether a buffer supports IPC, I think we may need to distinguish between managed (cudaMallocManaged) and non-managed (cudaMalloc). With MOFED, RDMA to CUDA managed buffers cannot be supported due to the following:

CUDA Unified Memory is not explicitly supported in combination with GPUDirect RDMA. While the page table returned by nvidia_p2p_get_pages() is valid for managed memory buffers and provides a mapping of GPU memory at any given moment in time, the GPU device copy of that memory may be incoherent with the writable copy of the page which is not on the GPU. Using the page table in this circumstance may result in accessing stale data, or data loss, because of a DMA write access to device memory that is subsequently overwritten by the Unified Memory run-time. cuPointerGetAttribute() may be used to determine if an address is being managed by the Unified Memory runtime.

https://docs.nvidia.com/cuda/gpudirect-rdma/index.html

2 replies

sen-sd Dec 1, 2022

Is libfabric support for AMD GPU card ? Is it will work for windows.

shefty Dec 1, 2022
Maintainer

AMD GPUs are supported. I don't know about windows support. I doubt it, and I'm not sure what it would take to support that. The code paths for Linux and Windows are mostly shared, but I don't know that Windows has a similar lower-level functionality that's needed.

shefty · 2021-02-01T17:07:30Z

shefty
Feb 1, 2021
Maintainer

@iziemba comment suggests that the enum extension suggested above would work for both the IPC and GPUDirect RDMA cases. Or we need to define more flags. Enum is looking like a better approach, and might be easier for the user.

0 replies

aingerson · 2021-02-01T22:36:30Z

aingerson
Feb 1, 2021
Collaborator Author

As we add interfaces and types of memory, there are going to be a lot of combinations and I think it might make it hard to code for specific protocols. For example in shm to figure out if I can do IPC then I have to do
if (mem_type == FI_HMEM_CUDA_DEVICE || mem_type == FI_HMEM_ZE_DEVICE...)
and maybe more depending on the more interfaces we add. I don't love having two fields and adding a flag but I think it does simplify it a bit since CUDA's "managed" memory and ZE's "shared" memory are both just their own way of saying unified memory. And each interface (as far as I know) has its own version of host, device, and unified memory with similar restrictions. CUDA and L0 both can only do IPC with their specifically device allocations because of migration of memory so they can't support their unified version (managed or shared).
So, implementation wise for the providers, it might be easier to add 3 flags

FI_HMEM_HOST
FI_HMEM_DEVICE
FI_HMEM_UNIFIED

that can apply the same to all interfaces.
I don't love the flags not only because of the extra field but also because it renames/makes connections between the interfaces' APIs that I don't think necessarily can be true.
Just something to consider/think about.

2 replies

iziemba Feb 2, 2021
Collaborator

FI_HMEM_HOST
FI_HMEM_DEVICE
FI_HMEM_UNIFIED

This is exactly how I view the various GPU allocation functions. WRT to AMD, the HIP API (ROCM), which can either layer over CUDA or AMD HSA (ROCR), also has these three types of calls: hipMalloc(), hipMallocHost(), and hipMallocManaged(). But since libfabric supports ROCR and not explicitly ROCM, I need to spend some time understanding how hipMallocManaged() translates to ROCR.

iziemba Feb 2, 2021
Collaborator

I have verified that it is possible to identify these three types using ROCR.

shefty · 2021-02-01T22:52:58Z

shefty
Feb 1, 2021
Maintainer

Why would anyone use a device API to allocate host memory? I don't understand why those calls exist at all.

Looking at the CUDA documentation, there's a ridiculous number of allocation calls. Sigh...

What is 'managed' memory actually mean? That the allocation uses a shared virtual address space?

And are we getting into an area where a restriction today may not be a restriction tomorrow?

1 reply

iziemba Feb 2, 2021
Collaborator

Why would anyone use a device API to allocate host memory? I don't understand why those calls exist at all.

Using a device API to allocate host memory can lead to improved host <-> GPU memcpy latency. It has been a while since I have looked into it, but one optimization this device API has is the pining of registering of host pages with the GPU device. But, the same thing can be achieved using standard host memory allocators and registering the host buffer with the GPU. This is what RXM does today with its bounce buffers.

What is 'managed' memory actually mean? That the allocation uses a shared virtual address space?

Managed memory, or unified memory in CUDA terms, is an allocation where the pages can be faulted and migrated between the CPU and GPU. https://developer.nvidia.com/blog/unified-memory-cuda-beginners/

shefty · 2021-02-02T02:14:06Z

shefty
Feb 2, 2021
Maintainer

Thanks. This is helpful. With the kernel hmem driver, malloc supports page migration between the host and GPU.

If we can get this down to 3 flags, that's not too bad. But the flags appear to be exclusive. Is it enough to indicate that the memory has FI_HMEM_HOST_ACCESS and/or FI_HMEM_DEVICE_ACCESS? The absence of any flag would indicate the caller has no idea, I guess.

4 replies

iziemba Feb 2, 2021
Collaborator

Thanks. This is helpful. With the kernel hmem driver, malloc supports page migration between the host and GPU.

This is a good point. Using FI_HMEM_HOST_ACCESS and FI_HMEM_DEVICE_ACCESS, how would a user distinguish between a malloc allocation which migrates between host and GPU and a GPU managed allocation?

aingerson Feb 2, 2021
Collaborator Author

If it can migrate between them then it can be accessed by both, right? So
FI_HMEM_UNIFIED = FI_HMEM_HOST_ACCESS | FI_HMEM_DEVICE_ACCESS

shefty Feb 2, 2021
Maintainer

The hmem driver allows migration without the app even being aware that it's a thing.

aingerson Feb 2, 2021
Collaborator Author

Ahh shoot. Is there a way to detect if this is enabled/happening? Does it always migrate when necessary or is there no guarantee?

shefty · 2021-02-02T19:27:50Z

shefty
Feb 2, 2021
Maintainer

Trying to decipher back through the thread... IPC is supported on device-only memory, correct? And GPUDirect is also only supported on device-only memory, correct?

Host-memory buffers may or may not have the ability to migrate to/from the device. This can occur without the application even being aware that migration is possible. DMAbuf support enables peer-to-peer transfers, similar to GPUDirect, but using an upstream mechanism. DMAbuf requires on-demand paging support from the NIC, and would work with page migration.

Even if MOFED today has a limitation on peer-to-peer, I don't think we can assume that restriction will hold. And I'm not confident that the restrictions in place on CUDA, Ze, RoCR will hold either.

4 replies

shefty Feb 2, 2021
Maintainer

Basically, I'm wondering if we can get away with only having this flag

FI_HMEM_DEVICE_ONLY - Indicates that the region is restricted to the indicated device.

aingerson Feb 2, 2021
Collaborator Author

For my purposes, yes, but based on what @iziemba I don't think it would cover his case since his restriction is just not managed memory. So it sounds like explicitly host memory would still be ok.

shefty Feb 2, 2021
Maintainer

PCI peer-to-peer transfers between the GPU and NIC either need non-migrating device regions or the NIC capable of handling the migration. The latter today is handled by the hmem driver + RDMA on-demand paging, though I have no idea if Nvidia SW supports this yet. The former is indicated by setting the FI_HMEM_DEVICE_ONLY flag.

AFAIK, no special handling is needed for accessing host regions, even if allocated using a proprietary device API. I'm assuming that the device API still hooks into the standard kernel page table mechanisms. This should be a normal memory registration call to the NIC.

There is a problem that migrating ('managed') regions may not work, depending on what lower layer software is used, and possibly what version or drivers are used.

iziemba Feb 2, 2021
Collaborator

Ignoring hmem driver + RDMA on-demand paging, I think what we have found is host memory allocated with a GPU allocator supports RDMA, IPC/GPUDirect only supports device memory, and IPC/GPUDirect cannot support managed memory. Could the man pages be updated to say FI_HMEM_CUDA, and others, should only be used for device memory, use FI_HMEM_SYSTEM for host memory allocated with GPU allocators, and managed memory is not supported?

Even if MOFED today has a limitation on peer-to-peer, I don't think we can assume that restriction will hold. And I'm not confident that the restrictions in place on CUDA, Ze, RoCR will hold either.

I agree. I'm not sure if defining FI_HMEM_DEVICE_ONLY or updating documentation adds restrictions. It seems like if a device/provider supports hmem driver + RDMA on-demand paging, this would be a new capability bit.

shefty · 2021-02-24T00:43:28Z

shefty
Feb 24, 2021
Maintainer

Based on OFIWG discussions, the FI_HMEM_DEVICE_ONLY flag passed into the memory registration call should work. When given, the provider can optimize transfers. The shm provider can use IPC for transfers, and verbs can setup P2P RDMA. If not given, the provider would need to query the hmem APIs to discover this data.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define device allocation type fields in fi_mr_attr #6540

{{title}}

Replies: 12 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Define device allocation type fields in fi_mr_attr #6540

aingerson Jan 26, 2021 Collaborator

Replies: 12 comments · 13 replies

shefty Jan 26, 2021 Maintainer

aingerson Jan 26, 2021 Collaborator Author

shefty Jan 26, 2021 Maintainer

aingerson Jan 26, 2021 Collaborator Author

shefty Jan 28, 2021 Maintainer

iziemba Feb 1, 2021 Collaborator

sen-sd Dec 1, 2022

shefty Dec 1, 2022 Maintainer

shefty Feb 1, 2021 Maintainer

aingerson Feb 1, 2021 Collaborator Author

iziemba Feb 2, 2021 Collaborator

iziemba Feb 2, 2021 Collaborator

shefty Feb 1, 2021 Maintainer

iziemba Feb 2, 2021 Collaborator

shefty Feb 2, 2021 Maintainer

iziemba Feb 2, 2021 Collaborator

aingerson Feb 2, 2021 Collaborator Author

shefty Feb 2, 2021 Maintainer

aingerson Feb 2, 2021 Collaborator Author

shefty Feb 2, 2021 Maintainer

shefty Feb 2, 2021 Maintainer

aingerson Feb 2, 2021 Collaborator Author

shefty Feb 2, 2021 Maintainer

iziemba Feb 2, 2021 Collaborator

shefty Feb 24, 2021 Maintainer

aingerson
Jan 26, 2021
Collaborator

Replies: 12 comments 13 replies

shefty
Jan 26, 2021
Maintainer

aingerson
Jan 26, 2021
Collaborator Author

shefty
Jan 26, 2021
Maintainer

aingerson
Jan 26, 2021
Collaborator Author

shefty
Jan 28, 2021
Maintainer

iziemba
Feb 1, 2021
Collaborator

shefty Dec 1, 2022
Maintainer

shefty
Feb 1, 2021
Maintainer

aingerson
Feb 1, 2021
Collaborator Author

iziemba Feb 2, 2021
Collaborator

iziemba Feb 2, 2021
Collaborator

shefty
Feb 1, 2021
Maintainer

iziemba Feb 2, 2021
Collaborator

shefty
Feb 2, 2021
Maintainer

iziemba Feb 2, 2021
Collaborator

aingerson Feb 2, 2021
Collaborator Author

shefty Feb 2, 2021
Maintainer

aingerson Feb 2, 2021
Collaborator Author

shefty
Feb 2, 2021
Maintainer

shefty Feb 2, 2021
Maintainer

aingerson Feb 2, 2021
Collaborator Author

shefty Feb 2, 2021
Maintainer

iziemba Feb 2, 2021
Collaborator

shefty
Feb 24, 2021
Maintainer