-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION]: is it possible to detect errors in kernel operations? #1305
Comments
hi @El-Gor-do. When your kernel causes a failure (e.g. divide by zero), the kernel will stop executing. The error will be reported on the next Cuda API call (e.g. Synchronize). This is the standard Cuda error reporting mechanism. ILGPU will take that Cuda error code, and raise a CudaException. #1268 is saying that, in case of an error, you should assume your Cuda context is no longer usable, and you should start again. You could potentially start a new context in the same process. |
I don't seem to be cleaning up correctly after a
Lastly, after running the test app multiple times when using |
@El-Gor-do ok, did some more investigation, and it looks like Cuda will not throw an exception on divide-by-zero. This is consistent with the exception handling rules for the IEEE 754 standard. It looks like it will return +infinity or -infinity. Looking at other Cuda libraries, they have written their own code to detect divide-by-zero and abort the kernel. |
If IEEE754 says divide by zero should be +/- infinity, is there a bug in ILGPU? Divide by zero is returning -1 which is why I've been asking how I can detect that an error has occurred. |
I have checked the behavior of divide by zero, and it is giving me infinity. Looking back through your example, you are using integers not floating points. Since IEEE 754 only applies to floating point operations, the +/- infinity result does not apply. I checked the behavior of divide by zero on integers, and yes, I am getting -1 too (Cuda 12 SDK, 1070 GPU). My guess is that you cannot rely on this -1 result, since divide by zero on integers is undefined. Cuda can change their behavior at any time. It just so happens that .NET throws an exception. So in summary, no, there is no way to automatically check for divide by zero errors. You would need to write code to check the denominator yourself. |
Question
If a kernel operation fails but doesn't kill my application, is it possible to detect that an error has occurred? In my sample code below, I have 2 test kernels -
DivideByZeroKernel
andIndexOutOfRangeKernel
.The output when running
DivideByZeroKernel
is below. Even though I divide by zero, the operation returned -1 instead of infinity, NaN or crashing. Is there a way to detect that -1 is an error value in the first case but a valid result in the second case?The output when running
IndexOutOfRangeKernel
is below. Based on #1268 it would appear that my application will crash even ifacc.Synchronize()
is wrapped in atry/catch
. I could use a sidecar app to monitor if my GPU app has crashed - are there any other ways to detect that a GPU kernel has crashed my app?Environment
Additional context
No response
The text was updated successfully, but these errors were encountered: