Test cota/hardfloat-v5 #20

JayFoxRox · 2018-09-11T12:42:16Z

I've tested hardfloat-v5 on Halo, and I think it's.. not beneficial (for us) 😢

It looks like a great feature, but doesn't seem to affect us yet, because it's primarily turned on for 64 bit math, which we have almost none of. We can force it to be more active by setting QEMU_HARDFLOAT_2F32_USE_FP to 1 (I assume at dangerous precision loss) but even then I don't seem to see much difference on my machine.

I'll still wait for this feature and monitor it further. I'm sure it will cause great speedups for us eventually - I really hope it gets integrated into unicorn too.

Used this to rebase (I think):

git checkout cota/hardfloat-v5
git rebase --onto test-hardfloat-v5 qemu/master
git checkout test-hardfloat-v5
git merge <resulting commit from rebase --onto>

These are BSD-licensed so we can add them as submodules. Signed-off-by: Emilio G. Cota <[email protected]>

By leveraging berkeley's softfloat and testfloat. fp-test.c is derived from testfloat's testsoftfloat.c. To ease the tracking of upstream changes to the latter file, fp-test.c keeps the original camel-case variable naming, and includes most new code via wrap.inc.c. Most changes to the original code are simple style changes, although a couple of not-so-subtle modifications have been made (noted with XXX in the code), namely: - We do not test ROUND_ODD, since not all of our primitives support it (e.g. fp16) - Do not test !exact in round-to-integer, since it is not implemented in QEMU (this flag was added to softfloat v3). Signed-off-by: Emilio G. Cota <[email protected]>

This paves the way for upcoming work. Cc: Bastian Koppelmann <[email protected]> Reviewed-by: Bastian Koppelmann <[email protected]> Reviewed-by: Alex Bennée <[email protected]> Signed-off-by: Emilio G. Cota <[email protected]>

Cc: Bastian Koppelmann <[email protected]> Reviewed-by: Bastian Koppelmann <[email protected]> Signed-off-by: Emilio G. Cota <[email protected]>

…nchmarks This will allow us to measure the performance impact of FP emulation optimizations. Note that we can measure both directly the impact on the softfloat functions (with "-t soft"), or the impact on an emulated workload (call with "-t host" and run under qemu user-mode). Signed-off-by: Emilio G. Cota <[email protected]>

glibc >= 2.25 defines canonicalize in commit eaf5ad0 (Add canonicalize, canonicalizef, canonicalizel., 2016-10-26). Given that we'll be including <math.h> soon, prepare for this by prefixing our canonicalize() with sf_ to avoid clashing with the libc's canonicalize(). Cc: Bastian Koppelmann <[email protected]> Reported-by: Bastian Koppelmann <[email protected]> Tested-by: Bastian Koppelmann <[email protected]> Signed-off-by: Emilio G. Cota <[email protected]>

These will gain some users very soon. Signed-off-by: Emilio G. Cota <[email protected]>

The appended paves the way for leveraging the host FPU for a subset of guest FP operations. For most guest workloads (e.g. FP flags aren't ever cleared, inexact occurs often and rounding is set to the default [to nearest]) this will yield sizable performance speedups. The approach followed here avoids checking the FP exception flags register. See the added comment for details. This assumes that QEMU is running on an IEEE754-compliant FPU and that the rounding is set to the default (to nearest). The implementation-dependent specifics of the FPU should not matter; things like tininess detection and snan representation are still dealt with in soft-fp. However, this approach will break on most hosts if we compile QEMU with flags such as -ffast-math. We control the flags so this should be easy to enforce though. This patch just adds common code. Some operations will be migrated to hardfloat in subsequent patches to ease bisection. Note: some architectures (at least PPC, there might be others) clear the status flags passed to softfloat before most FP operations. This precludes the use of hardfloat, so to avoid introducing a performance regression for those targets, we add a flag to disable hardfloat. In the long run though it would be good to fix the targets so that at least the inexact flag passed to softfloat is indeed sticky. Signed-off-by: Emilio G. Cota <[email protected]>

Performance results (single and double precision) for fp-bench: 1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz - before: add-single: 135.07 MFlops add-double: 131.60 MFlops sub-single: 130.04 MFlops sub-double: 133.01 MFlops - after: add-single: 443.04 MFlops add-double: 301.95 MFlops sub-single: 411.36 MFlops sub-double: 293.15 MFlops 2. ARM Aarch64 A57 @ 2.4GHz - before: add-single: 44.79 MFlops add-double: 49.20 MFlops sub-single: 44.55 MFlops sub-double: 49.06 MFlops - after: add-single: 93.28 MFlops add-double: 88.27 MFlops sub-single: 91.47 MFlops sub-double: 88.27 MFlops 3. IBM POWER8E @ 2.1 GHz - before: add-single: 72.59 MFlops add-double: 72.27 MFlops sub-single: 75.33 MFlops sub-double: 70.54 MFlops - after: add-single: 112.95 MFlops add-double: 201.11 MFlops sub-single: 116.80 MFlops sub-double: 188.72 MFlops Note that the IBM and ARM machines benefit from having HARDFLOAT_2F{32,64}_USE_FP set to 0. Otherwise their performance can suffer significantly: - IBM Power8: add-single: [1] 54.94 vs [0] 116.37 MFlops add-double: [1] 58.92 vs [0] 201.44 MFlops - Aarch64 A57: add-single: [1] 80.72 vs [0] 93.24 MFlops add-double: [1] 82.10 vs [0] 88.18 MFlops On the Intel machine, having 2F64 set to 1 pays off, but it doesn't for 2F32: - Intel i7-6700K: add-single: [1] 285.79 vs [0] 426.70 MFlops add-double: [1] 302.15 vs [0] 278.82 MFlops Signed-off-by: Emilio G. Cota <[email protected]>

Performance results for fp-bench: 1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz - before: mul-single: 126.91 MFlops mul-double: 118.28 MFlops - after: mul-single: 258.02 MFlops mul-double: 197.96 MFlops 2. ARM Aarch64 A57 @ 2.4GHz - before: mul-single: 37.42 MFlops mul-double: 38.77 MFlops - after: mul-single: 73.41 MFlops mul-double: 76.93 MFlops 3. IBM POWER8E @ 2.1 GHz - before: mul-single: 58.40 MFlops mul-double: 59.33 MFlops - after: mul-single: 60.25 MFlops mul-double: 94.79 MFlops Signed-off-by: Emilio G. Cota <[email protected]>

Performance results for fp-bench: 1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz - before: div-single: 34.84 MFlops div-double: 34.04 MFlops - after: div-single: 275.23 MFlops div-double: 216.38 MFlops 2. ARM Aarch64 A57 @ 2.4GHz - before: div-single: 9.33 MFlops div-double: 9.30 MFlops - after: div-single: 51.55 MFlops div-double: 15.09 MFlops 3. IBM POWER8E @ 2.1 GHz - before: div-single: 25.65 MFlops div-double: 24.91 MFlops - after: div-single: 96.83 MFlops div-double: 31.01 MFlops Here setting 2FP64_USE_FP to 1 pays off for x86_64: [1] 215.97 vs [0] 62.15 MFlops Signed-off-by: Emilio G. Cota <[email protected]>

Performance results for fp-bench: 1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz - before: fma-single: 74.73 MFlops fma-double: 74.54 MFlops - after: fma-single: 203.37 MFlops fma-double: 169.37 MFlops 2. ARM Aarch64 A57 @ 2.4GHz - before: fma-single: 23.24 MFlops fma-double: 23.70 MFlops - after: fma-single: 66.14 MFlops fma-double: 63.10 MFlops 3. IBM POWER8E @ 2.1 GHz - before: fma-single: 37.26 MFlops fma-double: 37.29 MFlops - after: fma-single: 48.90 MFlops fma-double: 59.51 MFlops Here having 3FP64 set to 1 pays off for x86_64: [1] 170.15 vs [0] 153.12 MFlops Signed-off-by: Emilio G. Cota <[email protected]>

Performance results for fp-bench: 1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz - before: sqrt-single: 43.27 MFlops sqrt-double: 24.81 MFlops - after: sqrt-single: 297.94 MFlops sqrt-double: 210.46 MFlops 2. ARM Aarch64 A57 @ 2.4GHz - before: sqrt-single: 12.41 MFlops sqrt-double: 6.22 MFlops - after: sqrt-single: 55.58 MFlops sqrt-double: 40.63 MFlops 3. IBM POWER8E @ 2.1 GHz - before: sqrt-single: 17.01 MFlops sqrt-double: 9.61 MFlops - after: sqrt-single: 104.17 MFlops sqrt-double: 133.32 MFlops Here none of the machines got faster from enabling USE_FP. For instance, on x86_64 sqrt is 23% slower for single precision, with it enabled, and 17% slower for double precision. Signed-off-by: Emilio G. Cota <[email protected]>

@i

Performance results for fp-bench: 1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz - before: cmp-single: 113.01 MFlops cmp-double: 115.54 MFlops - after: cmp-single: 527.83 MFlops cmp-double: 457.21 MFlops 2. ARM Aarch64 A57 @ 2.4GHz - before: cmp-single: 39.32 MFlops cmp-double: 39.80 MFlops - after: cmp-single: 162.74 MFlops cmp-double: 167.08 MFlops 3. IBM POWER8E @ 2.1 GHz - before: cmp-single: 60.81 MFlops cmp-double: 62.76 MFlops - after: cmp-single: 235.39 MFlops cmp-double: 283.44 MFlops Here using float{32,64}_is_any_nan is faster than using isnan for all machines. On x86_64 the perf difference is just a few percentage points, but on aarch64 we go from 117/119 to 164/169 MFlops for single/double precision, respectively. Aggregate performance improvement for the last few patches: [ all charts in png: https://imgur.com/a/4yV8p ] 1. Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz qemu-aarch64 NBench score; higher is better Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz 16 +-+-----------+-------------+----===-------+---===-------+-----------+-+ 14 +-+..........................@@@&&.=.......@@@&&.=...................+-+ 12 +-+..........................@.@.&.=.......@.@.&.=.....+befor=== +-+ 10 +-+..........................@.@.&.=.......@.@.&.=.....+ad@@&& = +-+ 8 +-+.......................$$$%.@.&.=.......@.@.&.=.....+ @@U& = +-+ 6 +-+............@@@&&=+***##.$%.@.&.=***##$$%+@.&.=..###$$%%@i& = +-+ 4 +-+.......###$%%.@.&=.*.*.#.$%.@.&.=*.*.#.$%.@.&.=+**.#+$ +@m& = +-+ 2 +-+.....***.#$.%.@.&=.*.*.#.$%.@.&.=*.*.#.$%.@.&.=.**.#+$+sqr& = +-+ 0 +-+-----***##$%%@@&&=-***##$$%@@&&==***##$$%@@&&==-**##$$%+cmp==-----+-+ FOURIER NEURAL NELU DECOMPOSITION gmean qemu-aarch64 SPEC06fp (test set) speedup over QEMU 4c2c101 Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz error bars: 95% confidence interval 4.5 +-+---+-----+----+-----+-----+-&---+-----+----+-----+-----+-----+----+-----+-----+-----+-----+----+-----+---+-+ 4 +-+..........................+@@+...........................................................................+-+ 3.5 +-+..............%%@&.........@@..............%%@&............................................+++dsub +-+ 2.5 +-+....&&+.......%%@&.......+%%@..+%%&+..@@&+.%%@&....................................+%%&+.+%@&++%%@& +-+ 2 +-+..+%%&..+%@&+.%%@&...+++..%%@...%%&.+$$@&..%%@&..%%@&.......+%%&+.%%@&+......+%%@&.+%%&++$$@&++d%@& %%@&+-+ 1.5 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&*+f%@&**$%@&+-+ 0.5 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&+sqr@&**$%@&+-+ 0 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&*+cmp&**$%@&+-+ 410.bw416.gam433.434.z435.436.cac437.lesli444.447.de450.so453454.ca459.GemsF465.tont470.lb4482.sphinxgeomean 2. Host: ARM Aarch64 A57 @ 2.4GHz qemu-aarch64 NBench score; higher is better Host: Applied Micro X-Gene, Aarch64 A57 @ 2.4 GHz 5 +-+-----------+-------------+-------------+-------------+-----------+-+ 4.5 +-+........................................@@@&==...................+-+ 3 4 +-+..........................@@@&==........@.@&.=.....+before +-+ 3 +-+..........................@.@&.=........@.@&.=.....+ad@@@&== +-+ 2.5 +-+.....................##$$%%.@&.=........@.@&.=.....+ @m@& = +-+ 2 +-+............@@@&==.***#.$.%.@&.=.***#$$%%.@&.=.***#$$%%d@& = +-+ 1.5 +-+.....***#$$%%.@&.=.*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#+$ +f@& = +-+ 0.5 +-+.....*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#+$+sqr& = +-+ 0 +-+-----***#$$%%@@&==-***#$$%%@@&==-***#$$%%@@&==-***#$$%+cmp==-----+-+ FOURIER NEURAL NLU DECOMPOSITION gmean Note that by not inlining the soft-fp primitives we end up with a smaller softfloat.o--in particular, see the difference for the softfloat.o built for fp-bench: - before this series: text data bss dec hex filename 103235 0 0 103235 19343 softfloat.o - after: text data bss dec hex filename 93369 0 0 93369 16cb9 softfloat.o Signed-off-by: Emilio G. Cota <[email protected]>

This allows us to test code paths that depend on certain FP flags being set. Note: we're pulling in a commit that is not in upstream testfloat. Signed-off-by: Emilio G. Cota <[email protected]>

JayFoxRox · 2018-12-26T09:23:53Z

Looks like it reached upstream: qemu/qemu@ec3c927

Didn't make QEMU 3.1 in time; next release will probably be 4.0 which started development on the 12th December 2018.

cota added 15 commits September 11, 2018 14:19

gitmodules: add berkeley's softfloat + testfloat version 3

a34fbf3

These are BSD-licensed so we can add them as submodules. Signed-off-by: Emilio G. Cota <[email protected]>

softfloat: add float{32,64}_is_{de,}normal

94efa2a

This paves the way for upcoming work. Cc: Bastian Koppelmann <[email protected]> Reviewed-by: Bastian Koppelmann <[email protected]> Reviewed-by: Alex Bennée <[email protected]> Signed-off-by: Emilio G. Cota <[email protected]>

target/tricore: use float32_is_denormal

c5cc941

Cc: Bastian Koppelmann <[email protected]> Reviewed-by: Bastian Koppelmann <[email protected]> Signed-off-by: Emilio G. Cota <[email protected]>

softfloat: add float{32,64}_is_zero_or_normal

2e8e207

These will gain some users very soon. Signed-off-by: Emilio G. Cota <[email protected]>

fp-test: add -flags to set initial flags

41cdcd2

This allows us to test code paths that depend on certain FP flags being set. Note: we're pulling in a commit that is not in upstream testfloat. Signed-off-by: Emilio G. Cota <[email protected]>

JayFoxRox mentioned this pull request Feb 8, 2019

Priority list (JayFoxRox) #8

Open

JayFoxRox added the experiments label Feb 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test cota/hardfloat-v5 #20

Test cota/hardfloat-v5 #20

JayFoxRox commented Sep 11, 2018

JayFoxRox commented Dec 26, 2018

Test cota/hardfloat-v5 #20

Are you sure you want to change the base?

Test cota/hardfloat-v5 #20

Conversation

JayFoxRox commented Sep 11, 2018

JayFoxRox commented Dec 26, 2018