Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skew-symmetric BLAS operations #805

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Add skew-symmetric BLAS operations #805

wants to merge 11 commits into from

Conversation

devinamatthews
Copy link
Member

@devinamatthews devinamatthews commented Apr 26, 2024

This PR adds a number of level-2 and level-3 skew-symmetric (and skew-hermitian) BLAS operations, defining the essential operations of a "Skew-BLAS" interface. These operations have been added as full 1st-class citizens of the BLIS API complete with testsuite and mixed-precision/mixed-domain support (level-3 only).

This operation negates a scalar (both real and imaginary parts).
- Add `bli_?negs/bli_?negris` to negate a scalar.
- Add `bli_?setr0s` to zero out only the real part of a complex scalar.
- Add (void) to silence unused variable warnings in several level-0 macros.
Add `mkskewsymm` and `mkskewherm` operations to explicit skew-symmetrize or skew-hermitize a matrix. For a skew-symmetric matrix, the diagonal is explicitly set to zero, while for a skew-hermitian matrix the real part of the diagonal is set to zero.
Add `BLIS_SKEW_SYMMETRIC` and `BLIS_SKEW_HERMITIAN` matrix structures along with associated help functions and macros. Note that this requires increasing the number of bits used to represent a `struc_t` in the `obj_t::info` member. A compile-time check has also been added to prevent against accidental bit overflow in the future.
This operation sets only the real part of a matrix diagonal to the given value.
The function signature for dotaxpyf has been changed to allow different `alpha` values for the dot and axpy sub-problems. This is needed to support skew-symmetric operations which differ in more than just conjugation of A and A^T.
Add `skmv` (skew-symmetric matrix times vector), `shmv` (skew-hermitian matrix times vector), `skr2` (skew-symmetric rank-2 update), and `shr2` (skew-hermitian rank-2 update) operations. Note that a rank-1 skew-symmetric update is not possible, and a rank-1 skew-hermitian update is not particularly useful.
The reference packing kernels have been updated to support skew-symmetric and skew-hermitian matrix structures. No updates to the dense reference packing kernel (`bli_?packm_ckx_<arch>_ref`) or to any optimized packing kernels, since `bli_?packm_struc_cxk` handles the negation of the unstored region by modifying `kappa`.
Add `skmm` (skew-symmetric matrix times dense matrix), `shmm` (skew-hermitian matrix times dense matrix), `skr2k` (skew-symmetric rank-2k update), and `shr2k` (skew-hermitian rank-2k update) operations. Note that a rank-k skew-symmetric update is not possible, and a rank-k skew-hermitian update is not particularly useful.
@devinamatthews
Copy link
Member Author

@myeh01 @nick-knight @Aaron-Hutchinson can the SiFive team please review commit b986782? I had to delve into the RISC-V assembly there and I'm only ~80% sure I did it right.

@devinamatthews
Copy link
Member Author

@fgvanzee again the Travis CI build failed to trigger...

@fgvanzee
Copy link
Member

@fgvanzee again the Travis CI build failed to trigger...

I don't remember if there was anything we could do to fix it on our end. Might be that we just have to wait and then make a dummy commit to try to trigger again?

@devinamatthews
Copy link
Member Author

I don't remember either but it is annoying

@devinamatthews
Copy link
Member Author

There we go

@nick-knight
Copy link

@devinamatthews Confirming I got your message. It looks like the register allocation in dotxaxpyf changed: thanks for taking a stab at it. I think @myeh01 wrote this one, I'll ask him to review it. (Michael, it looks like what's changed is the microkernel now has separate alpha values for the "dot" and "axpy" parts.)

@devinamatthews
Copy link
Member Author

Travis CI failed for x280 so I guess I did do something wrong.

@myeh01
Copy link

myeh01 commented Apr 27, 2024

Running the testsuite, it looks like s and d are passing while c and z are failing. For c and z, when I change the register allocation from fa4, fa5 to fa0, fa1 for alphaw, it passes the dotxaxpyf tests locally, so I think Devin modified the code correctly. My guess is the inline assembly I wrote is not generating the code I want it to (maybe the compiler is overwriting some floating-point registers in between asm blocks). I'll need some time to study the objdump to see what's going on.

@myeh01
Copy link

myeh01 commented Apr 29, 2024

After looking at the objdump, it looks like the compiler is using fa4 and fa5 in some branches involving floating point comparisons. To prevent this issue from reappearing, I would like to go back through all the inline assembly files I wrote and replace any manually allocated floating point (and integer) registers with C variables. I should be able to get it done this week. What would be the best way to get these changes merged into the repo? Should I submit a PR to this branch or master?

@nick-knight
Copy link

nick-knight commented Apr 29, 2024

@myeh01 A quicker fix might be to add clobbers. We should be using these anyway whenever we use inline asm with explicit register allocation, whether X-, F-, or V-registers.

Going the other direction, I think we might be better off using generic C for all the scalar stuff, and only using inline asm for the vector stuff (when intrinsics don't suffice). I don't think we'll lose much in performance, and it would make the code much more maintainable and retargetable.

@myeh01
Copy link

myeh01 commented Apr 29, 2024

@nick-knight I originally tried just adding the output register to the clobber list of any floating-point load, e.g.

__asm__(FLT_LOAD "fa5, %1(%0)" : : "r"(alphaw), "I"(FLT_SIZE) : "fa5");

But the compiler still uses fa5 for floating-point comparisons. Does the clobber only apply to that block of inline asm? Or maybe I did this incorrectly and should add clobbers in other places as well. There's also this, but then we would have to use C variables anyways.

Edit: Yeah, I think replacing the scalar stuff with generic C would be more robust. I started doing this for some parts of cdotxaxpyf and it wasn't too hard.

Edit 2: After looking through some of the code again, I think I would also like to replace some inline asm code with intrinsics.

@nick-knight
Copy link

Does the clobber only apply to that block of inline asm?

Correct. If you don't want the compiler to overwrite fa5 between this asm statement and a subsequent one, you'll have to merge them into the same statement. This tends to snowball, necessitating rewriting conditionals and loops in assembly, etc.

There's also this, but then we would have to use C variables anyways.

Correct.

I think replacing the scalar stuff with generic C would be more robust.

I agree.

Our code still has lurking risks related to our explicit allocation of V-registers: we are trusting the compiler not to generate any vector code between each pair of asm statements with a RAW dependence through a V-register. Hardening this will probably snowball in the manner described above, so I don't propose we attempt to tackle it here.

@myeh01
Copy link

myeh01 commented Apr 29, 2024

@devinamatthews How would you like to proceed? There are a few short-term solutions we discussed above. Longer-term, I'd like to rewrite the inline assembly files to be more robust (probably using intrinsics where it won't significantly impact performance). Please let me know what I can do to help.

@nick-knight
Copy link

nick-knight commented Apr 29, 2024

One perspective is that this ukernel interface change exposed a bug in SiFive's implementation of the legacy ukernel interface. To proceed, the defective ukernel implementation could be deleted in this PR --- reverting to a generic implementation --- and then reintroduced, upgraded and corrected, in a subsequent PR. This might be the cleanest way forward.

@devinamatthews
Copy link
Member Author

I'm not in a huge rush to merge. If it takes say a month or less to fix it properly then I can wait. Otherwise yes we could revert to generic and fix later. This wouldn't require deleting anything, just commenting out the kernel registration.

@myeh01
Copy link

myeh01 commented Apr 30, 2024

Mind if we sync up in a week or two? I'll start working on it this week and hopefully by then I'll have a sense of how much more time it would take.

@devinamatthews
Copy link
Member Author

Sure.

@myeh01
Copy link

myeh01 commented May 10, 2024

@devinamatthews I'm steadily working through cleaning up all the kernels, but I don't think I'll be able to finish it in the next two weeks. I'm also trying to balance this work with other projects I need to work on, so it may take a few more weeks. It may be best to follow Nick's suggestion and temporarily disable the sifive_x280 config failing microkernel (sorry, misread Nick's comment form earlier) until the clean up is complete.

@nick-knight
Copy link

It may be best to follow Nick's suggestion and temporarily disable the sifive_x280 config until the clean up is complete.

I don't propose disabling the whole configuration, just removing the one ukernel that's causing issues. IIUC, this will cause BLIS to default to a generic/reference implementation.

@devinamatthews
Copy link
Member Author

@myeh01 We've decided not to include this PR in the next release so there's not much time pressure.

@myeh01
Copy link

myeh01 commented Oct 2, 2024

@devinamatthews I opened a PR (#822) converting all our inline assembly to intrinsics, not sure if you've seen it yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants