Replies: 12 comments 13 replies
-
If the application knows this, it makes sense to pass that data through. But does middleware know what call was used for the allocation? Can it get this through the buffer attributes? |
Beta Was this translation helpful? Give feedback.
-
Both CUDA and L0 have calls to find this out so it's possible for the middleware to do this discovery if they don't have the information already to pass through. The same call that the hooking provider will need to do to register the memory will have the type of memory as well to use in the registration. |
Beta Was this translation helpful? Give feedback.
-
Does this data come 'for free' as part of querying if a buffer is a GPU or host buffer? Or is it an extra call? |
Beta Was this translation helpful? Give feedback.
-
Fo free! |
Beta Was this translation helpful? Give feedback.
-
An 'easy' solution is to extend enum fi_hmem_iface with device allocation details. For example:
This may add a bunch of enum values. However, I'm not sure if additional details would be necessary or not, based on which call was used and how it was invoked. As an alternative, rather than indicate which call was used, we could pass in the necessary information more directly. Using a flag, rather than extending the enum seems to make more sense here:
These flags would be passed into fi_mr_regattr. I don't like have 2 flags, but I think we need both in order to indicate that the IPC'ish is unknown to the caller. |
Beta Was this translation helpful? Give feedback.
-
In addition to whether a buffer supports IPC, I think we may need to distinguish between managed (cudaMallocManaged) and non-managed (cudaMalloc). With MOFED, RDMA to CUDA managed buffers cannot be supported due to the following: CUDA Unified Memory is not explicitly supported in combination with GPUDirect RDMA. While the page table returned by nvidia_p2p_get_pages() is valid for managed memory buffers and provides a mapping of GPU memory at any given moment in time, the GPU device copy of that memory may be incoherent with the writable copy of the page which is not on the GPU. Using the page table in this circumstance may result in accessing stale data, or data loss, because of a DMA write access to device memory that is subsequently overwritten by the Unified Memory run-time. cuPointerGetAttribute() may be used to determine if an address is being managed by the Unified Memory runtime. |
Beta Was this translation helpful? Give feedback.
-
@iziemba comment suggests that the enum extension suggested above would work for both the IPC and GPUDirect RDMA cases. Or we need to define more flags. Enum is looking like a better approach, and might be easier for the user. |
Beta Was this translation helpful? Give feedback.
-
As we add interfaces and types of memory, there are going to be a lot of combinations and I think it might make it hard to code for specific protocols. For example in shm to figure out if I can do IPC then I have to do
that can apply the same to all interfaces. |
Beta Was this translation helpful? Give feedback.
-
Why would anyone use a device API to allocate host memory? I don't understand why those calls exist at all. Looking at the CUDA documentation, there's a ridiculous number of allocation calls. Sigh... What is 'managed' memory actually mean? That the allocation uses a shared virtual address space? And are we getting into an area where a restriction today may not be a restriction tomorrow? |
Beta Was this translation helpful? Give feedback.
-
Thanks. This is helpful. With the kernel hmem driver, malloc supports page migration between the host and GPU. If we can get this down to 3 flags, that's not too bad. But the flags appear to be exclusive. Is it enough to indicate that the memory has FI_HMEM_HOST_ACCESS and/or FI_HMEM_DEVICE_ACCESS? The absence of any flag would indicate the caller has no idea, I guess. |
Beta Was this translation helpful? Give feedback.
-
Trying to decipher back through the thread... IPC is supported on device-only memory, correct? And GPUDirect is also only supported on device-only memory, correct? Host-memory buffers may or may not have the ability to migrate to/from the device. This can occur without the application even being aware that migration is possible. DMAbuf support enables peer-to-peer transfers, similar to GPUDirect, but using an upstream mechanism. DMAbuf requires on-demand paging support from the NIC, and would work with page migration. Even if MOFED today has a limitation on peer-to-peer, I don't think we can assume that restriction will hold. And I'm not confident that the restrictions in place on CUDA, Ze, RoCR will hold either. |
Beta Was this translation helpful? Give feedback.
-
Based on OFIWG discussions, the FI_HMEM_DEVICE_ONLY flag passed into the memory registration call should work. When given, the provider can optimize transfers. The shm provider can use IPC for transfers, and verbs can setup P2P RDMA. If not given, the provider would need to query the hmem APIs to discover this data. |
Beta Was this translation helpful? Give feedback.
-
To extend for use with FI_MR_HMEM, fi_mr_attr was extended to include the interface used to allocate and manage the memory (ie FI_HMEM_CUDA, FI_HMEM_ZE) as well as interface-specific fields.
Each of these interface types is defined to include memory allocated in a variety of different ways. For example, for CUDA:
"Uses Nvidia CUDA interfaces such as cuMemAlloc, cuMemAllocHost, cuMemAllocManaged, cuMemFree, cudaMalloc, cudaFree."
However for both CUDA and L0 (maybe others), IPC is not supported for all types of allocations. See here for CUDA and here for ZE.
fi_mr_attr needs to include what type of memory/allocation the buffer is as well so that the provider can know what protocols, like IPC, can or cannot be used.
Possible solution - add allocation types to fi_hmem_iface
Beta Was this translation helpful? Give feedback.
All reactions