Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resources: Fix hipDeviceSynchronize calls #19

Open
wants to merge 1 commit into
base: stable
Choose a base branch
from

Conversation

jebraun3
Copy link
Contributor

Corrected hipDeviceSynchronize calls to prevent kernels from exiting prematurely

Change-Id: I068b6324f33d0de905477d03efa195e3068dc7c5

Corrected hipDeviceSynchronize calls to prevent kernels from exiting prematurely

Change-Id: I068b6324f33d0de905477d03efa195e3068dc7c5
@mattsinc mattsinc added the bug Something isn't working label Dec 18, 2023
mattsinc
mattsinc previously approved these changes Dec 18, 2023
@mattsinc
Copy link
Contributor

@powerjg this would require creating a new binary. Are you ok with merging in anyways?

@powerjg
Copy link
Contributor

powerjg commented Feb 27, 2024

We should probably make new binaries before it's merged. Otherwise, we could easily forget.

@Harshil2107 this is going to be on you :)

@Harshil2107
Copy link
Contributor

Should these binaries be built on top of the changes in #22?

@mattsinc
Copy link
Contributor

mattsinc commented Mar 1, 2024

Should these binaries be built on top of the changes in #22?

Should these binaries be built on top of the changes in #22?

This is separate from #22 but if it's easier for you to update binaries once for both, that's fine with me

@abmerop
Copy link
Member

abmerop commented Mar 1, 2024

Should these binaries be built on top of the changes in #22?

Just to clarify, #22 is only adding links to the prebuilts so as long as they are placed in the same link locations then no changes are required in #22.

Copy link
Member

@abmerop abmerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I haven't reviewed yet...

What does "exiting prematurely" mean? Is gem5 exiting before the kernels finish? Is the goal here to intentionally serialize the kernels?

@mattsinc
Copy link
Contributor

mattsinc commented Mar 1, 2024

I guess I haven't reviewed yet...

What does "exiting prematurely" mean? Is gem5 exiting before the kernels finish? Is the goal here to intentionally serialize the kernels?

@jebraun3 can chime in with more details, but my recollection is that we were seeing gem5 essentially treat the kernel launches as blocking, causing multiple kernels from the same stream to overlap and thus get incorrect data. So we added the hipDeviceSynch's to prevent this.

@Harshil2107
Copy link
Contributor

Is this PR ready to be merged?

@mattsinc
Copy link
Contributor

mattsinc commented Mar 8, 2024

I guess I haven't reviewed yet...
What does "exiting prematurely" mean? Is gem5 exiting before the kernels finish? Is the goal here to intentionally serialize the kernels?

@jebraun3 can chime in with more details, but my recollection is that we were seeing gem5 essentially treat the kernel launches as blocking, causing multiple kernels from the same stream to overlap and thus get incorrect data. So we added the hipDeviceSynch's to prevent this.

@jebraun3 reminder to reply here to @abmerop , before this can/should be merged in

@ivanaamit
Copy link
Contributor

@jebraun3, could you please let us know what is the status of this PR? Thank you.

@mattsinc
Copy link
Contributor

I guess I haven't reviewed yet...
What does "exiting prematurely" mean? Is gem5 exiting before the kernels finish? Is the goal here to intentionally serialize the kernels?

@jebraun3 can chime in with more details, but my recollection is that we were seeing gem5 essentially treat the kernel launches as blocking, causing multiple kernels from the same stream to overlap and thus get incorrect data. So we added the hipDeviceSynch's to prevent this.

@jebraun3 reminder to reply here to @abmerop , before this can/should be merged in

@abmerop do you want more details from @jebraun3 or is my explanation above sufficient?

@abmerop
Copy link
Member

abmerop commented Mar 22, 2024

My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github)

@mattsinc
Copy link
Contributor

My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github)

Why would it affect performance? I guess for the CPU part?

@abmerop
Copy link
Member

abmerop commented Mar 22, 2024

My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github)

Why would it affect performance? I guess for the CPU part?

The sync is essentially a barrier which will prevent the new kernel dispatch packet from being placed in the AQL queue and therefore you would be paying an extra latency for sending the packet, ringing the doorbell, etc. which would otherwise be done while the previous kernel was running

@mattsinc
Copy link
Contributor

My general concern is all of these syncs are going to tank the performance so I'm not sure how much I would trust this benchmark for any performance related data after this change. But it seems to fix some problem I am not seeing, so I won't block it (if that is even possible on github)

Why would it affect performance? I guess for the CPU part?

The sync is essentially a barrier which will prevent the new kernel dispatch packet from being placed in the AQL queue and therefore you would be paying an extra latency for sending the packet, ringing the doorbell, etc. which would otherwise be done while the previous kernel was running

Right, so shaderActiveTicks wouldn't be affected (I think) but the sim_ticks would?

Regardless, @jebraun3 can you please add the details about what overlap you were seeing? Ultimately it likely indicates something is wrong with the barrier bit, which in theory makes the device synch's not needed ...

@jebraun3
Copy link
Contributor Author

I ran into this first with the Floyd-Warshall benchmark where running FW in gem5 would produce different results from running it on the actual hardware. From what I remember the result array that was produced left the vast majority of the array untouched, so most of its values were still set to their initialized values. Adding the synchs to the benchmark got gem5 to produce the same result array as the actual hardware.

@ivanaamit
Copy link
Contributor

Should this be merged or not? If I don't hear back from you until the end of the day today, I will merge it.

@mattsinc
Copy link
Contributor

mattsinc commented Mar 28, 2024

Should this be merged or not? If I don't hear back from you until the end of the day today, I will merge it.

No, please do not merge until @jebraun3 , @abmerop , and myself come to a conclusion

@abmerop
Copy link
Member

abmerop commented Mar 29, 2024

I now think it's OK to merge this to get the correct result.

If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result?

If that is the case we should circle back and figure out why the original code is different.

@mattsinc
Copy link
Contributor

I now think it's OK to merge this to get the correct result.

If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result?

If that is the case we should circle back and figure out why the original code is different.

I believe this is correct -- on the real GPU the code works and gets the correct answer, in gem5 it gets the incorrect answer without this fix. @jebraun3 please confirm.

I guess the question is: should we take this as a sign we need to investigate the barrier bit code again and try to fix it? Or should we patch this even with the shortcomings you highlighted and investigate later?

@abmerop
Copy link
Member

abmerop commented Mar 30, 2024

I now think it's OK to merge this to get the correct result.
If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result?
If that is the case we should circle back and figure out why the original code is different.

I believe this is correct -- on the real GPU the code works and gets the correct answer, in gem5 it gets the incorrect answer without this fix. @jebraun3 please confirm.

I guess the question is: should we take this as a sign we need to investigate the barrier bit code again and try to fix it? Or should we patch this even with the shortcomings you highlighted and investigate later?

A barrier problem was my first thought. Specifically of the AND variety. And yes I think we should look into that.

@mattsinc
Copy link
Contributor

I now think it's OK to merge this to get the correct result.
If I understand it correctly, the original code works gets correct result on hardware, new code gets correct result on hardware. But in gem5 the old code gets wrong result but new code gets correct result?
If that is the case we should circle back and figure out why the original code is different.

I believe this is correct -- on the real GPU the code works and gets the correct answer, in gem5 it gets the incorrect answer without this fix. @jebraun3 please confirm.
I guess the question is: should we take this as a sign we need to investigate the barrier bit code again and try to fix it? Or should we patch this even with the shortcomings you highlighted and investigate later?

A barrier problem was my first thought. Specifically of the AND variety. And yes I think we should look into that.

Right, the question is: should we merge this and then fix (and revert later), or just not merge this and try to fix? My concern without merging is if people are using it now, they are not getting the correct answer. And it's unknown how long a fix would take (maybe long, maybe short).

@abmerop
Copy link
Member

abmerop commented Apr 1, 2024

I did not see any problem when running in FS mode, so maybe this is only happening in SE mode? @jebraun3 what does the error look like? The application doesn't seem to print anything

@jebraun3
Copy link
Contributor Author

What @mattsinc said is correct, on the real GPU the code works and gets the correct answer but in gem5 it gets the incorrect answer. An error is only printed when FW is run with error checking, otherwise there is no indication of the error. When printed, the error should be a list of the locations of the mismatches followed by "WARNING: Produced incorrect results!".

@mattsinc
Copy link
Contributor

What @mattsinc said is correct, on the real GPU the code works and gets the correct answer but in gem5 it gets the incorrect answer. An error is only printed when FW is run with error checking, otherwise there is no indication of the error. When printed, the error should be a list of the locations of the mismatches followed by "WARNING: Produced incorrect results!".

@jebraun3 It looks like FW's change is not in this commit, is that intentional? I was looking because I believe FW requires the user to manually change the output to check it, right?

@ivanaamit
Copy link
Contributor

Hi, what is the status of this PR?

@mattsinc
Copy link
Contributor

Hi, what is the status of this PR?

Please just leave it open, I need to talk with @jebraun3 and @abmerop about it

@abmerop
Copy link
Member

abmerop commented Oct 20, 2024

This is fairly ancient at this point, but looking into it, it seems the HIP runtime (well rocCLR) is smart enough to know if hipDeviceSynchronize is necessary or not for a given device.

Therefore my previous comment no longer applies and I am fine with this PR.

@BobbyRBruce
Copy link
Member

When this is confirmed "good to go" can you please ping @Harshil2107 to rebuild these binaries and upload them? I want to make sure the source and the binaries are in-sync here.

@mattsinc
Copy link
Contributor

When this is confirmed "good to go" can you please ping @Harshil2107 to rebuild these binaries and upload them? I want to make sure the source and the binaries are in-sync here.

@jebraun3 is away on internship this semester and I'm guessing will not be able to reply too much. With the context @abmerop provided (thanks!) I think this patch is ok as is to go now, @BobbyRBruce , unless we want to switch it to be on stable instead of develop?

@jebraun3 it looks like we'll need a similar patch with the FW updates, but since you are away I'd rather we integrate this then wait more time for that one benchmark.

@BobbyRBruce
Copy link
Member

Anything on the stable branch needs to be compatible with the current version of gem5 (that is, gem5 v24.0). develop is used to submit changes that should only be merged into stable upon the next release of gem5 (i.e., it requires some feature in the upcoming release and would fail on 24.0).

In short, go for stable if the resource works with 24.0.

@mattsinc
Copy link
Contributor

Anything on the stable branch needs to be compatible with the current version of gem5 (that is, gem5 v24.0). develop is used to submit changes that should only be merged into stable upon the next release of gem5 (i.e., it requires some feature in the upcoming release and would fail on 24.0).

In short, go for stable if the resource works with 24.0.

This works with both stable and develop.

@BobbyRBruce
Copy link
Member

If this change works with v24.0 then I'd recommend putting it on stable but if you keep on the develop branch the only downside is you'll have to wait month until the next release when the gem5-resources develop branch is merged into the stable branch.

It's not worth worrying about either way.

@mattsinc
Copy link
Contributor

If this change works with v24.0 then I'd recommend putting it on stable but if you keep on the develop branch the only downside is you'll have to wait month until the next release when the gem5-resources develop branch is merged into the stable branch.

It's not worth worrying about either way.

Since @jebraun3 is away on internship and thus I don't think he'll be able to move it soon, unless there is an easy way to switch branches on Github here, we should just merge on develop. Is there such an easy way?

@jebraun3 jebraun3 changed the base branch from develop to stable October 31, 2024 00:21
@jebraun3 jebraun3 dismissed mattsinc’s stale review October 31, 2024 00:21

The base branch was changed.

@jebraun3
Copy link
Contributor Author

jebraun3 commented Oct 31, 2024

I updated the PR to merge into stable.

@jebraun3 it looks like we'll need a similar patch with the FW updates, but since you are away I'd rather we integrate this then wait more time for that one benchmark.

It looks like FW already has this update from #5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants