Optimize BOLT with PGO and/or BOLT itself #63245

zamazan4ik · 2023-06-10T23:42:38Z

Hi!

Did anyone try to optimize BOLT performance with LLVM PGO and/or BOLT itself? Is it worth it? I guess it could help to gain some performance improvement to BOLT as well since on the large profiles BOLT works quite long. If it could bring benefits - would be great to see such optimization into the existing LLVM CMake scripts infrastructure.

I have carefully read the LLVM docs about PGO and not sure that BOLT is covered right now. If it is - would be great to mention it explicitly. If you could share PGO results for other LLVM-based tooling like LLD - would be awesome!

llvmbot · 2023-06-11T02:35:14Z

@llvm/issue-subscribers-bolt

ms178 · 2023-06-11T13:26:17Z

First, please consider that BOLT is a post-link optimizer by design that works on optimizing the already-created binaries. Apart from some projects that have incorporated BOLT in their build process, I know of various build scripts that build LLVM in a way that incorporates a PGO + BOLT workflow. You might want to study these build scripts below for more analysis as the LLVM docs are pretty slim on the details.

https://github.com/ClangBuiltLinux/tc-build
https://github.com/ptr1337/llvm-bolt-scripts (or my modifications of these in my own repo)

Both make use of LTO+PGO, the PGO training gets done on either building the Linux Kernel or by compiling LLVM itself. With 2), there is a stage 3 script where a profile is created specifically for BOLT consumption later on either by instrumentation or by sampling. Also note that these scripts focus on BOLTing the Clang binary only.

The optimizing phase of BOLT itself is usaually very fast in comparison to all the other steps before, e.g. the more time consuming part is getting a good profile sample in stage 3 and its conversion via perf2bolt which could take several minutes (that's with a 11GB big profile, you quickly run into BOLT errors with bigger profiles).

zamazan4ik · 2023-06-11T14:09:10Z

Are there plans for integration BOLT phase into the LLVM build scripts?

ms178 · 2023-06-11T21:18:05Z

While I have seen some work for better Cmake-integration, I'd leave this for more knowledgeable people to answer. Currently there are some caveats for using a BOLTed LLVM/Clang that prevent it from being usable as a system compiler replacement.

boomanaiden154 · 2023-06-12T18:36:20Z

There's the AdvancedBuilds page which details building a Clang optimized with BOLT. You can look at the CMake caches under clang/cmake/caches to see the specific flags that get set/the infrastructure available for building LLVM tools with BOLT.

zamazan4ik · 2023-06-13T15:44:32Z

Currently there are some caveats for using a BOLTed LLVM/Clang that prevent it from being usable as a system compiler replacement.

@ms178 Could you please share more details about the current problems with BOLTing LLVM/Clang? Really interesting to read about them.

ms178 · 2023-06-13T16:16:58Z

@zamazan4ik Some shortcomings are mentioned at this page. From my (limited) understanding, it is also currently not that well-suited where shared libraries and dynamic objects are used e.g. in Mesa (see also: facebookarchive/BOLT#175). Also to BOLT a binary, it needs to be compiled with -Wl,--emit-relocs that increases binary size. BOLT also doesn't work well with stripped binaries which are common on Linux distributions to save size and for security reasons. I'd recommend to look at various other BOLT issues in the LLVM or the BOLT incubator bug tracker. I am sorry that I cannot be too specific on the technical details as I am just a power user that learned to use BOLT for various projects and did some research on it. I hope a "state of the union" like in-depth video presentation about the present and the future of BOLT is not too distant (in the meantime, there are various older presentations that cover BOLT on youtube to get a basic understanding).

zamazan4ik · 2023-06-13T16:18:59Z

Thanks a lot for the links!

zamazan4ik · 2023-06-15T15:05:30Z

@boomanaiden154 maybe a bit offtop but... As far as I see, clang is the only project in LLVM that is PGO/BOLT optimized (at least according to the CMake file in the repo). Did anyone try to apply PGO on other LLVM binaries like clangd, clang-tidy, lld, lldb, flang, and others? Are there any available results for that?

aaupov · 2023-06-15T15:50:08Z

@zamazan4ik

Hi!

Did anyone try to optimize BOLT performance with LLVM PGO and/or BOLT itself? Is it worth it? I guess it could help to gain some performance improvement to BOLT as well since on the large profiles BOLT works quite long. If it could bring benefits - would be great to see such optimization into the existing LLVM CMake scripts infrastructure.

BOLT primarily benefits large code footprint workloads that are CPU FE-bound, that is, workloads experiencing high icache/itlb misses and branch mispredictions. And in general, it only makes sense to apply PGO/BOLT to workloads that are CPU-bound, those that would benefit from making CPU faster or from extra CPUs.

Clang building a large code base is a good example of such a workload. BOLT itself is probably not – BOLT processing is very quick compared to compilation and linking time, even when optimizing large binaries. I don't see a use case of BOLT running for a long time or in batch mode.

@ms178

While I have seen some work for better Cmake-integration, I'd leave this for more knowledgeable people to answer. Currently there are some caveats for using a BOLTed LLVM/Clang that prevent it from being usable as a system compiler replacement.

Please flag any such caveats here or in a separate issue. We're pretty confident that BOLT-optimized Clang is functionally equivalent to non-optimized Clang, based on the amount of testing we run and that it's used to build large services. I'm not aware of any distros making use of BOLT-optimized Clang as a system compiler. Gentoo has a chance to be the first – track https://bugs.gentoo.org/907931 for Clang-BOLT.

@zamazan4ik Some shortcomings are mentioned at this page.

These are prospective ideas rather than current shortcomings.

From my (limited) understanding, it is also currently not that well-suited where shared libraries and dynamic objects are used e.g. in Mesa (see also: facebookincubator/BOLT#175).

BOLT generally supports optimizing shared libraries. We do have extensive testing with them. Please report issues if you experience any.

Also to BOLT a binary, it needs to be compiled with -Wl,--emit-relocs that increases binary size. BOLT also doesn't work well with stripped binaries which are common on Linux distributions to save size and for security reasons.

These points are fair. However, these are the tradeoffs of achieving the peak performance. Worth mentioning that distro-built binaries are generally not built for speed: they are usually targeting lowest common denominator architecture and are rarely optimized even with PGO/LTO.

I'd recommend to look at various other BOLT issues in the LLVM or the BOLT incubator bug tracker. I am sorry that I cannot be too specific on the technical details as I am just a power user that learned to use BOLT for various projects and did some research on it. I hope a "state of the union" like in-depth video presentation about the present and the future of BOLT is not too distant (in the meantime, there are various older presentations that cover BOLT on youtube to get a basic understanding).

I gave two presentations focused on BOLT users, specifically aiming Clang, at LLVM distributors conference and US LLVM Devmtg 2022. Maksim gave an overview presentation earlier, with videos published on YouTube.

@boomanaiden154 maybe a bit offtop but... As far as I see, clang is the only project in LLVM that is PGO/BOLT optimized (at least according to the CMake file in the repo). Did anyone try to apply PGO on other LLVM binaries like clangd, clang-tidy, lld, lldb, flang, and others? Are there any available results for that?

See answer above. lld might be the #2 candidate for optimization as non-trivial part of build time might be spent in linking. From what I know, Android project attempted optimizing lld with BOLT but saw much smaller build time wins than from optimizing Clang.

zamazan4ik · 2023-06-15T16:02:14Z

@aaupov thank you for the brief answer!

BOLT itself is probably not – BOLT processing is very quick compared to compilation and linking time, even when optimizing large binaries

Even if BOLTing phase is much shorter than compiling/linking - that's not an argument why we shouldn't try to optimize BOLT itself if it's running continuously multiple times on CI. If BOLT benefits from applying PGO/BOLT to BOLT itself - it's worth trying. At least from my experience with BOLTing quite large binaries (like ClickHouse) it works long enough to try to optimize it. The same logic to other binaries like lld and others - even if the build-time benefit is less than from clang - it's worth try to optimize it. Regarding compilers - I see that flang is not a PGO/BOLT optimized by LLVM itself (flang has no corresponding CMake scripts) so the speed-up in Flang could be estimated somehow equivalent to Clang.

By the way, do you have a link with a performance improvement with PGOed lld in Android project?

aaupov · 2023-06-15T16:21:00Z

@aaupov thank you for the brief answer!

BOLT itself is probably not – BOLT processing is very quick compared to compilation and linking time, even when optimizing large binaries

Even if BOLTing phase is much shorter than compiling/linking - that's not an argument why we shouldn't try to optimize BOLT itself if it's running continuously multiple times on CI. If BOLT benefits from applying PGO/BOLT to BOLT itself - it's worth trying. At least from my experience with BOLTing quite large binaries (like ClickHouse) it works long enough to try to optimize it. The same logic to other binaries like lld and others - even if the build-time benefit is less than from clang - it's worth try to optimize it. Regarding compilers - I see that flang is not a PGO/BOLT optimized by LLVM itself (flang has no corresponding CMake scripts) so the speed-up in Flang could be estimated somehow equivalent to Clang.

If running BOLT in CI makes sense in your case, then yes, you can try to optimize it. I'll see if we can leverage Clang's PGO perf-training infrastructure to produce profiles for other tools (lld or BOLT itself).

By the way, do you have a link with a performance improvement with PGOed lld in Android project?

I don't know about PGO for lld in Android. There's this commit enabling BOLT for Clang: https://android-review.linaro.org/plugins/gitiles/toolchain/llvm_android/+/f36c64eeddf531b7b1a144c40f61d6c9a78eee7a

This produces around 8.88% speed up in Clang compile performance.

Now I can't find anything for LLD with BOLT. You can try searching for it in Android's mailing lists.

zamazan4ik · 2023-06-15T23:00:30Z

By the way, found some results about optimizing clangd with PGO: https://blog.jetbrains.com/clion/2022/05/testing-3-approaches-performance-cpp_apps/

aaupov · 2023-09-22T05:57:22Z

Closing as wontfix. BOLT is not a major application/service by global cycles, and there are lower-hanging opportunities in BOLT to go after.

github-actions bot added the new issue label Jun 10, 2023

EugeneZelenko added BOLT and removed new issue labels Jun 11, 2023

aaupov closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize BOLT with PGO and/or BOLT itself #63245

Optimize BOLT with PGO and/or BOLT itself #63245

zamazan4ik commented Jun 10, 2023 •

edited

Loading

llvmbot commented Jun 11, 2023

ms178 commented Jun 11, 2023 •

edited

Loading

zamazan4ik commented Jun 11, 2023

ms178 commented Jun 11, 2023

boomanaiden154 commented Jun 12, 2023

zamazan4ik commented Jun 13, 2023

ms178 commented Jun 13, 2023

zamazan4ik commented Jun 13, 2023

zamazan4ik commented Jun 15, 2023 •

edited

Loading

aaupov commented Jun 15, 2023

zamazan4ik commented Jun 15, 2023

aaupov commented Jun 15, 2023

zamazan4ik commented Jun 15, 2023

aaupov commented Sep 22, 2023

Optimize BOLT with PGO and/or BOLT itself #63245

Optimize BOLT with PGO and/or BOLT itself #63245

Comments

zamazan4ik commented Jun 10, 2023 • edited Loading

llvmbot commented Jun 11, 2023

ms178 commented Jun 11, 2023 • edited Loading

zamazan4ik commented Jun 11, 2023

ms178 commented Jun 11, 2023

boomanaiden154 commented Jun 12, 2023

zamazan4ik commented Jun 13, 2023

ms178 commented Jun 13, 2023

zamazan4ik commented Jun 13, 2023

zamazan4ik commented Jun 15, 2023 • edited Loading

aaupov commented Jun 15, 2023

zamazan4ik commented Jun 15, 2023

aaupov commented Jun 15, 2023

zamazan4ik commented Jun 15, 2023

aaupov commented Sep 22, 2023

zamazan4ik commented Jun 10, 2023 •

edited

Loading

ms178 commented Jun 11, 2023 •

edited

Loading

zamazan4ik commented Jun 15, 2023 •

edited

Loading