Stefanos Baziotis via llvm-dev
2020-Sep-27 11:52 UTC
[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC
Hi everyone, I was watching this video [1]. There's an example of an initialization loop for which Clang unfortunately generates really bad code [2]. In my machine, the Clang version is 4x slower than the GCC version. I have not tested the MSVC version, but it should be around the same. In case anyone's interested, in the video [1] Casey explains why this code is bad (around 59:39). So, I tried to run -print-after-all [3]. There are a lot of passes that interact here, so I was wondering if anyone knows more about that. It seems to me that the problem starts with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are handled down the pipeline. Finally, the regalloc probably did not go very well. Best, Stefanos [1] https://youtu.be/R5tBY9Zyw6o?t=3580 [2] https://godbolt.org/z/9oWhEn [3] https://godbolt.org/z/xa4jo9 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200927/e3d32312/attachment.html>
Florian Hahn via llvm-dev
2020-Oct-01 19:45 UTC
[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC
Hi,> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hi everyone, > > I was watching this video [1]. There's an example of an initialization loop for which > Clang unfortunately generates really bad code [2]. In my machine, the Clang version > is 4x slower than the GCC version. I have not tested the MSVC version, but it should > be around the same. > > In case anyone's interested, in the video [1] Casey explains why this code is bad (around 59:39). > > So, I tried to run -print-after-all [3]. There are a lot of passes that interact here, so I was > wondering if anyone knows more about that. It seems to me that the problem starts > with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are handled down > the pipeline. Finally, the regalloc probably did not go very well.I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of the issue. While the code for the initialization is not ideal, it appears the main issue causing the slowdown is the fact that GCC interchanges the main loops, but LLVM does not. After interchanging, the memory access patterns are completely different (and it also probably slightly defeats the purpose of the benchmark). There’s also an issue with SROA which splits a nice single consecutive llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x speedup (on top of manually interchanging the loops, which gives a ~3x speedup). Cheers, Florian
Florian Hahn via llvm-dev
2020-Oct-01 20:59 UTC
[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC
> On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Hi, > >> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Hi everyone, >> >> I was watching this video [1]. There's an example of an initialization loop for which >> Clang unfortunately generates really bad code [2]. In my machine, the Clang version >> is 4x slower than the GCC version. I have not tested the MSVC version, but it should >> be around the same. >> >> In case anyone's interested, in the video [1] Casey explains why this code is bad (around 59:39). >> >> So, I tried to run -print-after-all [3]. There are a lot of passes that interact here, so I was >> wondering if anyone knows more about that. It seems to me that the problem starts >> with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are handled down >> the pipeline. Finally, the regalloc probably did not go very well. > > > I filed https://bugs.llvm.org/show_bug.cgi?id=47705 <https://bugs.llvm.org/show_bug.cgi?id=47705> to keep track of the issue. > > While the code for the initialization is not ideal, it appears the main issue causing the slowdown is the fact that GCC interchanges the main loops, but LLVM does not. After interchanging, the memory access patterns are completely different (and it also probably slightly defeats the purpose of the benchmark). > > There’s also an issue with SROA which splits a nice single consecutive llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x speedup (on top of manually interchanging the loops, which gives a ~3x speedup).Alternatively, if we we would create vector stores instead of the small memcpy calls, we probably would get a better result overall. Using Clang's Matrix Types extensions effectively does so, and with that version https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA (although the code is not as nice as it code be right now, as there's no syntax for constant initializers for matrix types yet) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201001/d523a34e/attachment.html>