-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resources: Fix hipDeviceSynchronize calls #19
base: stable
Are you sure you want to change the base?
Conversation
Corrected hipDeviceSynchronize calls to prevent kernels from exiting prematurely Change-Id: I068b6324f33d0de905477d03efa195e3068dc7c5
@powerjg this would require creating a new binary. Are you ok with merging in anyways? |
We should probably make new binaries before it's merged. Otherwise, we could easily forget. @Harshil2107 this is going to be on you :) |
Should these binaries be built on top of the changes in #22? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I haven't reviewed yet...
What does "exiting prematurely" mean? Is gem5 exiting before the kernels finish? Is the goal here to intentionally serialize the kernels?
@jebraun3 can chime in with more details, but my recollection is that we were seeing gem5 essentially treat the kernel launches as blocking, causing multiple kernels from the same stream to overlap and thus get incorrect data. So we added the hipDeviceSynch's to prevent this. |
Is this PR ready to be merged? |
@jebraun3 reminder to reply here to @abmerop , before this can/should be merged in |
@jebraun3, could you please let us know what is the status of this PR? Thank you. |
@abmerop do you want more details from @jebraun3 or is my explanation above sufficient? |
My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github) |
Why would it affect performance? I guess for the CPU part? |
The sync is essentially a barrier which will prevent the new kernel dispatch packet from being placed in the AQL queue and therefore you would be paying an extra latency for sending the packet, ringing the doorbell, etc. which would otherwise be done while the previous kernel was running |
Right, so shaderActiveTicks wouldn't be affected (I think) but the sim_ticks would? Regardless, @jebraun3 can you please add the details about what overlap you were seeing? Ultimately it likely indicates something is wrong with the barrier bit, which in theory makes the device synch's not needed ... |
I ran into this first with the Floyd-Warshall benchmark where running FW in gem5 would produce different results from running it on the actual hardware. From what I remember the result array that was produced left the vast majority of the array untouched, so most of its values were still set to their initialized values. Adding the synchs to the benchmark got gem5 to produce the same result array as the actual hardware. |
Should this be merged or not? If I don't hear back from you until the end of the day today, I will merge it. |
I now think it's OK to merge this to get the correct result. If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result? If that is the case we should circle back and figure out why the original code is different. |
I believe this is correct -- on the real GPU the code works and gets the correct answer, in gem5 it gets the incorrect answer without this fix. @jebraun3 please confirm. I guess the question is: should we take this as a sign we need to investigate the barrier bit code again and try to fix it? Or should we patch this even with the shortcomings you highlighted and investigate later? |
A barrier problem was my first thought. Specifically of the AND variety. And yes I think we should look into that. |
Right, the question is: should we merge this and then fix (and revert later), or just not merge this and try to fix? My concern without merging is if people are using it now, they are not getting the correct answer. And it's unknown how long a fix would take (maybe long, maybe short). |
I did not see any problem when running in FS mode, so maybe this is only happening in SE mode? @jebraun3 what does the error look like? The application doesn't seem to print anything |
What @mattsinc said is correct, on the real GPU the code works and gets the correct answer but in gem5 it gets the incorrect answer. An error is only printed when FW is run with error checking, otherwise there is no indication of the error. When printed, the error should be a list of the locations of the mismatches followed by "WARNING: Produced incorrect results!". |
@jebraun3 It looks like FW's change is not in this commit, is that intentional? I was looking because I believe FW requires the user to manually change the output to check it, right? |
Hi, what is the status of this PR? |
This is fairly ancient at this point, but looking into it, it seems the HIP runtime (well rocCLR) is smart enough to know if Therefore my previous comment no longer applies and I am fine with this PR. |
When this is confirmed "good to go" can you please ping @Harshil2107 to rebuild these binaries and upload them? I want to make sure the source and the binaries are in-sync here. |
@jebraun3 is away on internship this semester and I'm guessing will not be able to reply too much. With the context @abmerop provided (thanks!) I think this patch is ok as is to go now, @BobbyRBruce , unless we want to switch it to be on stable instead of develop? @jebraun3 it looks like we'll need a similar patch with the FW updates, but since you are away I'd rather we integrate this then wait more time for that one benchmark. |
Anything on the stable branch needs to be compatible with the current version of gem5 (that is, gem5 v24.0). develop is used to submit changes that should only be merged into stable upon the next release of gem5 (i.e., it requires some feature in the upcoming release and would fail on 24.0). In short, go for stable if the resource works with 24.0. |
This works with both stable and develop. |
If this change works with v24.0 then I'd recommend putting it on stable but if you keep on the develop branch the only downside is you'll have to wait month until the next release when the gem5-resources develop branch is merged into the stable branch. It's not worth worrying about either way. |
Since @jebraun3 is away on internship and thus I don't think he'll be able to move it soon, unless there is an easy way to switch branches on Github here, we should just merge on develop. Is there such an easy way? |
Corrected hipDeviceSynchronize calls to prevent kernels from exiting prematurely
Change-Id: I068b6324f33d0de905477d03efa195e3068dc7c5