Alexandre Ganea via llvm-dev
2020-Jul-02 04:20 UTC
[llvm-dev] RFC: Replacing the default CRT allocator on Windows
Hello, I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly. The CRT heap allocator on Windows doesn't scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We're observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets. We've replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster. Time to link clang.exe with LLD and -flto on 36-core: Windows CRT heap allocator: 38 min 47 sec mimalloc: 2 min 22 sec rpmalloc: 2 min 15 sec snmalloc: 2 min 19 sec We're running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we're using other downstream forks of LLVM that we can't change. Two questions arise: 1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree? 2. If the answer for above question is "yes", given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn't have a LD_PRELOAD mechanism. Please see demo patch here: reviews.llvm.org/D71786 Thank you in advance for the feedback! Alex. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20200702/40adf06e/attachment.html>
Michael Kruse via llvm-dev
2020-Jul-02 04:53 UTC
[llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows
I'd appreciate the speed-up due to the inclusion of an alternative malloc and the ease of using it if its source was included in the LLVM repository. We already have multiple sources from external projects included in the repository (among them gtest, gmock, ConvertUTF, google benchmark, ISL, ...) and don't see a reason why it should be different for a malloc implementation. AFAIK replacing malloc is quite common for Windows projects. The Windows default implementation (called "low fragmentation heap") has different optimization goals. I'd start with including it in the repository and providing the option to enable it, with the possibility to change it to the default after some experience has been collected, maybe even with multiple malloc implementations. Michael
Neil Henning via llvm-dev
2020-Jul-02 07:58 UTC
[llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows
Not against this for the executables, but I just wanted to 100% check that it is possible to override malloc/free for clang and not for the libclang.dll? I think it's fine to make the built executables use a different allocator, but it'd be a bigger pain if we force an allocator on users that link against the LLVM libraries or shared libraries by default. Cheers, -Neil. On Thu, Jul 2, 2020 at 5:54 AM Michael Kruse via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I'd appreciate the speed-up due to the inclusion of an alternative > malloc and the ease of using it if its source was included in the LLVM > repository. We already have multiple sources from external projects > included in the repository (among them gtest, gmock, ConvertUTF, > google benchmark, ISL, ...) and don't see a reason why it should be > different for a malloc implementation. > > AFAIK replacing malloc is quite common for Windows projects. The > Windows default implementation (called "low fragmentation heap") has > different optimization goals. > > I'd start with including it in the repository and providing the option > to enable it, with the possibility to change it to the default after > some experience has been collected, maybe even with multiple malloc > implementations. > > Michael > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Neil Henning Senior Software Engineer Compiler unity.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20200702/99a160f9/attachment.html>
David Chisnall via llvm-dev
2020-Jul-02 12:03 UTC
[llvm-dev] RFC: Replacing the default CRT allocator on Windows
On 02/07/2020 05:20, Alexandre Ganea via llvm-dev wrote:> Time to link clang.exe with LLD and -flto on 36-core: > > Windows CRT heap allocator: 38 min 47 sec > > mimalloc: 2 min 22 sec > > rpmalloc: 2 min 15 sec > > snmalloc: 2 min 19 secThese numbers all seem very close (apart from the baseline). How many runs did you do and what was the jitter? FWIW, I'm using snmalloc on FreeBSD instead of jemalloc and clang is around 2% faster, so it might be worth considering this as an option for all platforms. It's likely to be a big win on anything where dlmalloc is the default allocator. Snmalloc currently supports macOS, Windows, Linux, FreeBSD, NetBSD, OpenBSD, Haiku, and OpenEnclave (adding other POSIXy systems is fairly trivial, can be completely trivial if you don't want any non-standard-POSIX behaviour) on x86, ARM, and PowerPC (RISC-V and MIPS under review). I am obviously biased towards snmalloc, since I'm one of the authors, and happy to help out anyone wanting to integrate it with LLVM. Note that snmalloc requires C++17, so would need to be conditional on LLVM being built with a vaguely modern compiler. David
Alexandre Ganea via llvm-dev
2020-Jul-02 15:15 UTC
[llvm-dev] RFC: Replacing the default CRT allocator on Windows
Hello David, Please see below. -----Message d'origine----- De : llvm-dev <llvm-dev-bounces at lists.llvm.org> De la part de David Chisnall via llvm-dev Envoyé : July 2, 2020 8:04 AM À : llvm-dev at lists.llvm.org Objet : Re: [llvm-dev] RFC: Replacing the default CRT allocator on Windows> These numbers all seem very close (apart from the baseline). How many runs did you do and what was the jitter?Three runs, and I took the last value. Once the Windows cache is hot, the numbers are very stable. The ThinLTO cache is not enabled, and I used /opt:lldltojobs=all to extend the ThreadPool to all hardware threads.> FWIW, I'm using snmalloc on FreeBSD instead of jemalloc and clang is around 2% faster, so it might be worth considering this as an option for all platforms. It's likely to be a big win on anything where dlmalloc is the default allocator.I didn't mention, but the compile-time experience has was improved, in the range of 5-10% depending on the file to compile. When using integrated compilation, ie. all TUs compile in the same process, the gain is in the range of 60%. But in that case there are other effects that come into play.> I am obviously biased towards snmalloc, since I'm one of the authors, and happy to help out anyone wanting to integrate it with LLVM. Note that snmalloc requires C++17, so would need to be conditional on LLVM being built with a vaguely modern compiler.snmalloc currently compiles as part of the LLVM codebase with a few C++17-related constexpr warnings. However the contentious issue is the commit size, which //could be// a showstopper for certain users. A runtime flag -fno-integrated-crt-alloc could somehow mitigate this issue. However this only exacerbates with high core count machines. Peak commit when linking clang.exe with LLD and -flto on 36-core: Windows CRT heap allocator: 14.9 GB mimalloc: 19.8 GB rpmalloc: 31.9 GB snmalloc: 42 GB _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
James Y Knight via llvm-dev
2020-Jul-02 22:08 UTC
[llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows
Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons. If the performance is good, seems like that might be the simplest choice docs.microsoft.com/en-us/windows/win32/sbscs/application-manifests#heaptype blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals.pdf On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev < cfe-dev at lists.llvm.org> wrote:> Hello, > > > > I was wondering how folks were feeling about replacing the default Windows > CRT allocator in Clang, LLD and other LLVM tools possibly. > > > > The CRT heap allocator on Windows doesn’t scale well on large core count > machines. Any multi-threaded workload in LLVM that allocates often is > impacted by this. As a result, link times with ThinLTO are extremely slow > on Windows. We’re observing performance inversely proportional to the > number of cores. The more cores the machines has, the slower ThinLTO > linking gets. > > > > We’ve replaced the CRT heap allocator by modern lock-free thread-cache > allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc > (MIT licence). The runtime performance is an order of magnitude faster. > > > > Time to link clang.exe with LLD and -flto on 36-core: > > Windows CRT heap allocator: 38 min 47 sec > > mimalloc: 2 min 22 sec > > rpmalloc: 2 min 15 sec > > snmalloc: 2 min 19 sec > > > > We’re running in production with a downstream fork of LLVM + rpmalloc for > more than a year. However when cross-compiling some specific game platforms > we’re using other downstream forks of LLVM that we can’t change. > > > > Two questions arise: > > 1. The licencing. Should we embed one of these allocators into the > LLVM tree, or keep them separate out-of-the-tree? > 2. If the answer for above question is “yes”, given the tremendous > performance speedup, should we embed one of these allocators into Clang/LLD > builds by default? (on Windows only) Considering that Windows doesn’t have > a LD_PRELOAD mechanism. > > > > Please see demo patch here: reviews.llvm.org/D71786 > > > > Thank you in advance for the feedback! > > Alex. > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20200702/7c79c7d9/attachment.html>
Alexandre Ganea via llvm-dev
2020-Jul-03 03:56 UTC
[llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows
Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB -> 14 GB). Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We could manually create thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free thread-local arenas, which probably explains why this issue has gone (mostly) unnoticed. De : James Y Knight <jyknight at google.com> Envoyé : July 2, 2020 6:08 PM À : Alexandre Ganea <alexandre.ganea at ubisoft.com> Cc : Clang Dev <cfe-dev at lists.llvm.org>; LLVM Dev <llvm-dev at lists.llvm.org> Objet : Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons. If the performance is good, seems like that might be the simplest choice docs.microsoft.com/en-us/windows/win32/sbscs/application-manifests#heaptype blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals.pdf On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>> wrote: Hello, I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly. The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets. We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster. Time to link clang.exe with LLD and -flto on 36-core: Windows CRT heap allocator: 38 min 47 sec mimalloc: 2 min 22 sec rpmalloc: 2 min 15 sec snmalloc: 2 min 19 sec We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change. Two questions arise: 1. The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree? 2. If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism. Please see demo patch here: reviews.llvm.org/D71786 Thank you in advance for the feedback! Alex. _______________________________________________ cfe-dev mailing list cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org> lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20200703/cbe57497/attachment.html>
Possibly Parallel Threads
- [cfe-dev] RFC: Replacing the default CRT allocator on Windows
- [cfe-dev] RFC: Replacing the default CRT allocator on Windows
- [cfe-dev] RFC: Replacing the default CRT allocator on Windows
- [cfe-dev] RFC: Replacing the default CRT allocator on Windows
- 7-8% compile time slowdowns in LLVM 10