-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize BOLT with PGO and/or BOLT itself #63245
Comments
@llvm/issue-subscribers-bolt |
First, please consider that BOLT is a post-link optimizer by design that works on optimizing the already-created binaries. Apart from some projects that have incorporated BOLT in their build process, I know of various build scripts that build LLVM in a way that incorporates a PGO + BOLT workflow. You might want to study these build scripts below for more analysis as the LLVM docs are pretty slim on the details.
Both make use of LTO+PGO, the PGO training gets done on either building the Linux Kernel or by compiling LLVM itself. With 2), there is a stage 3 script where a profile is created specifically for BOLT consumption later on either by instrumentation or by sampling. Also note that these scripts focus on BOLTing the Clang binary only. The optimizing phase of BOLT itself is usaually very fast in comparison to all the other steps before, e.g. the more time consuming part is getting a good profile sample in stage 3 and its conversion via perf2bolt which could take several minutes (that's with a 11GB big profile, you quickly run into BOLT errors with bigger profiles). |
Are there plans for integration BOLT phase into the LLVM build scripts? |
While I have seen some work for better Cmake-integration, I'd leave this for more knowledgeable people to answer. Currently there are some caveats for using a BOLTed LLVM/Clang that prevent it from being usable as a system compiler replacement. |
There's the AdvancedBuilds page which details building a Clang optimized with BOLT. You can look at the CMake caches under |
@ms178 Could you please share more details about the current problems with BOLTing LLVM/Clang? Really interesting to read about them. |
@zamazan4ik Some shortcomings are mentioned at this page. From my (limited) understanding, it is also currently not that well-suited where shared libraries and dynamic objects are used e.g. in Mesa (see also: facebookarchive/BOLT#175). Also to BOLT a binary, it needs to be compiled with -Wl,--emit-relocs that increases binary size. BOLT also doesn't work well with stripped binaries which are common on Linux distributions to save size and for security reasons. I'd recommend to look at various other BOLT issues in the LLVM or the BOLT incubator bug tracker. I am sorry that I cannot be too specific on the technical details as I am just a power user that learned to use BOLT for various projects and did some research on it. I hope a "state of the union" like in-depth video presentation about the present and the future of BOLT is not too distant (in the meantime, there are various older presentations that cover BOLT on youtube to get a basic understanding). |
Thanks a lot for the links! |
@boomanaiden154 maybe a bit offtop but... As far as I see, clang is the only project in LLVM that is PGO/BOLT optimized (at least according to the CMake file in the repo). Did anyone try to apply PGO on other LLVM binaries like |
BOLT primarily benefits large code footprint workloads that are CPU FE-bound, that is, workloads experiencing high icache/itlb misses and branch mispredictions. And in general, it only makes sense to apply PGO/BOLT to workloads that are CPU-bound, those that would benefit from making CPU faster or from extra CPUs. Clang building a large code base is a good example of such a workload. BOLT itself is probably not – BOLT processing is very quick compared to compilation and linking time, even when optimizing large binaries. I don't see a use case of BOLT running for a long time or in batch mode.
Please flag any such caveats here or in a separate issue. We're pretty confident that BOLT-optimized Clang is functionally equivalent to non-optimized Clang, based on the amount of testing we run and that it's used to build large services. I'm not aware of any distros making use of BOLT-optimized Clang as a system compiler. Gentoo has a chance to be the first – track https://bugs.gentoo.org/907931 for Clang-BOLT.
These are prospective ideas rather than current shortcomings.
BOLT generally supports optimizing shared libraries. We do have extensive testing with them. Please report issues if you experience any.
These points are fair. However, these are the tradeoffs of achieving the peak performance. Worth mentioning that distro-built binaries are generally not built for speed: they are usually targeting lowest common denominator architecture and are rarely optimized even with PGO/LTO.
I gave two presentations focused on BOLT users, specifically aiming Clang, at LLVM distributors conference and US LLVM Devmtg 2022. Maksim gave an overview presentation earlier, with videos published on YouTube.
See answer above. |
@aaupov thank you for the brief answer!
Even if BOLTing phase is much shorter than compiling/linking - that's not an argument why we shouldn't try to optimize BOLT itself if it's running continuously multiple times on CI. If BOLT benefits from applying PGO/BOLT to BOLT itself - it's worth trying. At least from my experience with BOLTing quite large binaries (like ClickHouse) it works long enough to try to optimize it. The same logic to other binaries like By the way, do you have a link with a performance improvement with PGOed |
If running BOLT in CI makes sense in your case, then yes, you can try to optimize it. I'll see if we can leverage Clang's PGO perf-training infrastructure to produce profiles for other tools (lld or BOLT itself).
I don't know about PGO for lld in Android. There's this commit enabling BOLT for Clang: https://android-review.linaro.org/plugins/gitiles/toolchain/llvm_android/+/f36c64eeddf531b7b1a144c40f61d6c9a78eee7a
Now I can't find anything for LLD with BOLT. You can try searching for it in Android's mailing lists. |
By the way, found some results about optimizing |
Closing as wontfix. BOLT is not a major application/service by global cycles, and there are lower-hanging opportunities in BOLT to go after. |
Hi!
Did anyone try to optimize BOLT performance with LLVM PGO and/or BOLT itself? Is it worth it? I guess it could help to gain some performance improvement to BOLT as well since on the large profiles BOLT works quite long. If it could bring benefits - would be great to see such optimization into the existing LLVM CMake scripts infrastructure.
I have carefully read the LLVM docs about PGO and not sure that BOLT is covered right now. If it is - would be great to mention it explicitly. If you could share PGO results for other LLVM-based tooling like LLD - would be awesome!
The text was updated successfully, but these errors were encountered: