-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose new CUDA APIs for sharing the same memory between processes #1235
base: master
Are you sure you want to change the base?
Conversation
Sorry for the switching around of the draft status, I'm still getting used to this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @Deneas.
I've made some comments inline already.
In addition, this PR handles consuming an IPC buffer from another process, but not the "Get IPC Mem Handle" functionality. That would mean ILGPU could not share it's buffer with another process. Is that correct?
With regards to unit testing, I'm not sure if that is even possible. Perhaps adding two sample projects, for the Producer and Consumer?
/// <remarks>This represents the struct | ||
/// <a href="https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaIpcMemHandle__t.html#structcudaIpcMemHandle__t">cudaIpcMemHandle__t</a>. | ||
/// </remarks> | ||
public struct CudaIpcMemHandle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QUESTION: Is there a reason for having a struct with another struct (i.e. Handle
within CudaIpcMemHandle
)?
Why not just declare a single struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally it was that way to mirror cudaIpcMemHandle__t
having the array as a field.
Having [InlineArray(64)]
directly on CudaIpcMemHandle
means it straight up is the array.
Also honestly speaking, I didn't think to test it that way, asg etting arrays to work with LibraryImport
instead of DllImport
was a already a bit of a hassle.
That being said, I tested it that way and it actually worked. Since we don't have any other fields in the struct, we should actually be fine this way.
To ease the usage, since InlineArrays can't be used directly as arrays, I added explicit conversions to and from byte[]
.
I thought about making them implicit, but decided against it because they should not throw exceptions and generally carry the assumption of being allocation free.
CudaIpcMemHandle ipcMemHandle, | ||
long length, | ||
int elementSize, | ||
CudaIpcMemFlags flags = CudaIpcMemFlags.None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest making the flags
mandatory, so that the caller has to provide it. It will primarily be used internally anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I made it mandatory. What I'm not sure about is, whether we should even add them to the "convenience" methods. The currently only flag LazyEnablePeerAccess
will enable peer access if the underlying memory is allocated on a different device than the device used by the current context.
Obviously I'm quite new to this library, but my gut says this would be a rather niche case not worth complicating the API for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I spoke too soon. Without LazyEnablePeerAccess
IPC won't work for me even with the same device. I strongly suspect it relates to me having an SLI setup with two GPUs, but I was unable to find a definite answer.
Unsurprisingly, I'm now strongly in favor of exposing the flags. I also added my finding in the remarks.
@@ -0,0 +1,173 @@ | |||
// --------------------------------------------------------------------------------------- | |||
// ILGPU | |||
// Copyright (c) 2017-2023 ILGPU Project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISSUE: Need to fix up the copyright header. Should be 2024 only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I also assumed you probably wanted the same for all other files I touched.
namespace ILGPU.Runtime.Cuda | ||
{ | ||
/// <summary> | ||
/// Represents an unmanaged Cuda buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISSUE: Please tidy up the comments in this file. e.g. "This represents a Cuda IPC buffer"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some changes, how does it look now? I left comments of code identically to CudaMemoryBuffer
alone, as it was the original template for the class.
/// <param name="length">The number of elements to allocate.</param> | ||
/// <param name="elementSize">The size of a single element in bytes.</param> | ||
/// <returns>A mapped buffer of shared memory on this accelerator.</returns> | ||
public CudaIpcMemoryBuffer MapRaw(CudaIpcMemHandle ipcMemHandle, long length, int elementSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking this method signature should be something like:
public MemoryBuffer MapFromIpcMemHandle(CudaIpcMemHandle ipcMemHandle, long length, int elementSize)
.
We do not need to expose the underlying CudaIpcMemoryBuffer
.
And we should also have another method like:
public MemoryBuffer1D<T, Stride1D.Dense> MapFromIpcMemHandle1D<T>(CudaIpcMemHandle ipcMemHandle, long length)
.
This is the convenience method that will make it easier to use. Not sure if we need the 2D or 3D variants.
I'm also wondering if we should be using the Map
prefix, or if it should be an 'Allocate' prefix, so that it is discoverable with the rest of the memory allocation functions when using auto-complete. e.g. AllocateFromIpcMemHandle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on returning MemoryBuffer
and naming the method MapFromIpcMemHandle
.
My original idea with MapRaw
was to mirror Accelerator.AllocateRaw
, but MapFrom
works as well.
I also did think about using Allocate
, but decided against it, because it is exactly what we're not doing. The memory was already allocated by a different process, what we're creating is more like a View
. Instead I went with Map
because that is also what they use in the documentation ("Maps memory exported from another process").
As for having a similar API to Allocate1D()
like e.g. Map1D
, that would be something I could do. As far I understand the code, I most likely can just replace AllocateRaw
with its IPC equivalent and move the CudaIpcMemHandle
through the methods. The methods then should be in CudaAccelerator
as opposed to Accelerator
.
/// Enables peer access lazily. | ||
/// </summary> | ||
/// <remarks> | ||
/// From <a href="https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g60f28a5142ee7ae0336dfa83fd54e006"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ISSUE: I'm guessing the GUID will change with each Cuda release, so I would prefer not to include the link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understandable, I also removed all of the remarks as well, since they seemed too vague without the links.
throw new ArgumentOutOfRangeException(nameof(elementSize)); | ||
|
||
Bind(); | ||
return new CudaIpcMemoryBuffer(this, ipcMemHandle, length, elementSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QUESTION: Should there be a flags
parameter in the method signature, which then gets passed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I went from top to bottom and only saw this later. Feel free to reply in either thread.
What I'm not sure about is, whether we should even add them to the "convenience" methods. The currently only flag LazyEnablePeerAccess will enable peer access if the underlying memory is allocated on a different device than the device used by the current context.
Obviously I'm quite new to this library, but my gut says this would be a rather niche case not worth complicating the API for.
No, I just missed creating a convenient method for doing so in
Yeah, while I think it would be doable as test with a new project, adding a new Sample seems the more idomatic solution for ILGPU. |
Alright, with the added sample and changes, this PR is feature-complete. All that remains is to settle naming and the ergonomics of the helper methods. A quick summary of my new work:
As before, feel free to ask questions and give feedback. |
Instead of allocating and freeing memory it uses OpenIpcMemHandle and CloseIpcMemHandle.
This mimics AllocateRaw for IPC shared memory with MapRaw. The newly added constructor for CudaIpcMemHandle allows users to avoid dealing with the quirks of inline array.
…f the allocations are less than 1 MB
- Use the newest version of cuIpcOpenMemHandle. - Apply InlineArray to CudaIpcMemHandle directly, as we don't have any other fields or properties. - Makes the flags parameter mandatory for CudaIpcMemoryBuffer. - Adjust header format of all touched files to use the current year only. - Adjusted doc comments for CudaIpcMemoryBuffer when behavior differs from CudaMemoryBuffer.
InlineArrays complicated array handling and forced users on newer C# versions. Now we use regular arrays and pointers for interop.
Amazing work, thank you very much for your efforts. I ping @MoFtZ to take a look to get this over the line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR, looks mostly good to me! I was wondering about the class CudaIpcMemoryBuffer
which does not inherit from CudaMemoryBuffer
. This can lead to confusion and potential issues with people testing for Cuda-enabled buffers via is
.
This allows introspection e.g. checking if a memory handle was already used.
Yeah, I agree it could be a bit confusing, but I felt it would be wrong for |
Hello
This Pull request is a result of my experimentations with CUDA IPC and will expose the CUDA IPC API to allow sharing the same memory between different processes. Another part would be the Event API, but as far as I can tell, you already habe that one exposed.
The biggest currently open task is memory management of the buffer allocated by IPC. My current plan is creating a class
CudaIpcMemoryBuffer
mirroringCudaMemoryBuffer
. The main difference I currently found was in Disposing, as the IPC buffer should then use the new methodCloseIpcMemHandle
.Also the way I see it, the actual exchange of the IPC handle between different processes should be out of scope for this project.
Feel free to give feedback and to ask questions.