You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From my understanding, when allocating memory the hardware will return a pointer aligned to a specific number of bytes for technical and performance reasons. According to this, that number is 256 for CUDA GPUs. I imagine that other types of hardware will adhere to different alignments.
Does ILGPU provide a hardware independent way to query this information?
The reason I am wondering about this is because I'm trying to reduce the number of memory allocations I make (in order to avoid accumulating API overhead). Instead of multiple buffers, I allocate a single one and split it into multiple sections for arrays and variables. According to the guide I referenced, these sections should be aligned the same way the original allocation was, otherwise things may break (at least for CUDA devices).
Quote:
Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.
Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results (off by a few words), so special care must be taken to maintain alignment of the starting address of any value or array of values of these types. A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays (with multiple calls to cudaMalloc() or cuMemAlloc()) is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block’s starting address.
The text was updated successfully, but these errors were encountered:
Hi @TriceHelix, thanks for reaching out to us regarding this topic. Currently, ILGPU is internally aware of the global pointer alignment (e.g., here) and uses this information to automatically vectorize IO ops, for instance. Based on the backend and the accelerator (... and your kernel of course) ILGPU will do "everything for you" when using the O2 optimization pipeline. However, you can also control the alignment of views inside kernels via view.AlignToXYZ() and view.AlignTo(xyz) (where xyz evaluates to a constant). Moreover, you can "tell" ILGPU that a particular view as a certain alignment via view.AsAlignedXYZ() and view.AsAligned(xyz) (where xyz evaluates to a constant at kernel compile time).
To answer your actual question, there is no such API to query the automatic alignment rules at the moment. However, I think this makes absolutely sense to add. @MoFtZ what do you think?
Turns out most of the issues I was facing were due to the Cast method messing with aligned memory, not the actual alignments themselves. I also realized ArrayViews are not eqivalent to pointers, as they all refer to a buffer at a certain offset under the hood. As you've said, all of that alignment for the buffer is done automatically. However, I'd still say that an API to make the alignment amounts transparent would be neat. For now, consider the other issue the actual problem I've encountered.
Dear ILGPU Devs & Community,
From my understanding, when allocating memory the hardware will return a pointer aligned to a specific number of bytes for technical and performance reasons. According to this, that number is 256 for CUDA GPUs. I imagine that other types of hardware will adhere to different alignments.
Does ILGPU provide a hardware independent way to query this information?
The reason I am wondering about this is because I'm trying to reduce the number of memory allocations I make (in order to avoid accumulating API overhead). Instead of multiple buffers, I allocate a single one and split it into multiple sections for arrays and variables. According to the guide I referenced, these sections should be aligned the same way the original allocation was, otherwise things may break (at least for CUDA devices).
Quote:
The text was updated successfully, but these errors were encountered: