thr3ads.net - llvm dev - [llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC [Sep 2020]

If this information is useful, please help other people find it:
Share via:

Stefanos Baziotis via llvm-dev

2020-Sep-27 11:52 UTC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

Hi everyone,

I was watching this video [1]. There's an example of an initialization loop
for which
Clang unfortunately generates really bad code [2]. In my machine, the Clang
version
is 4x slower than the GCC version. I have not tested the MSVC version, but
it should
be around the same.

In case anyone's interested, in the video [1] Casey explains why this code
is bad (around 59:39).

So, I tried to run -print-after-all [3]. There are a lot of passes that
interact here, so I was
wondering if anyone knows more about that. It seems to me that the problem
starts
with SROA. Also, I'm not familiar with how these llvm.memcpy / memset are
handled down
the pipeline. Finally, the regalloc probably did not go very well.

Best,
Stefanos

[1] https://youtu.be/R5tBY9Zyw6o?t=3580
[2] https://godbolt.org/z/9oWhEn
[3] https://godbolt.org/z/xa4jo9
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200927/e3d32312/attachment.html>

Florian Hahn via llvm-dev

2020-Oct-01 19:45 UTC

head link

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

Hi,
> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi everyone,
> 
> I was watching this video [1]. There's an example of an initialization
loop for which
> Clang unfortunately generates really bad code [2]. In my machine, the Clang
version
> is 4x slower than the GCC version. I have not tested the MSVC version, but
it should
> be around the same.
> 
> In case anyone's interested, in the video [1] Casey explains why this
code is bad (around 59:39).
> 
> So, I tried to run -print-after-all [3]. There are a lot of passes that
interact here, so I was
> wondering if anyone knows more about that. It seems to me that the problem
starts
> with SROA. Also, I'm not familiar with how these llvm.memcpy / memset
are handled down
> the pipeline. Finally, the regalloc probably did not go very well.

I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of the issue.

While the code for the initialization is not ideal, it appears the main issue
causing the slowdown is the fact that GCC interchanges the main loops, but LLVM
does not. After interchanging, the memory access patterns are completely
different (and it also probably slightly defeats the purpose of the benchmark).

There’s also an issue with SROA which splits a nice single consecutive
llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x speedup
(on top of manually interchanging the loops, which gives a ~3x speedup).

Cheers,
Florian

Florian Hahn via llvm-dev

2020-Oct-01 20:59 UTC

head link

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

> On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi,
> 
>> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>> 
>> Hi everyone,
>> 
>> I was watching this video [1]. There's an example of an
initialization loop for which
>> Clang unfortunately generates really bad code [2]. In my machine, the
Clang version
>> is 4x slower than the GCC version. I have not tested the MSVC version,
but it should
>> be around the same.
>> 
>> In case anyone's interested, in the video [1] Casey explains why
this code is bad (around 59:39).
>> 
>> So, I tried to run -print-after-all [3]. There are a lot of passes that
interact here, so I was
>> wondering if anyone knows more about that. It seems to me that the
problem starts
>> with SROA. Also, I'm not familiar with how these llvm.memcpy /
memset are handled down
>> the pipeline. Finally, the regalloc probably did not go very well.
> 
> 
> I filed https://bugs.llvm.org/show_bug.cgi?id=47705
<https://bugs.llvm.org/show_bug.cgi?id=47705> to keep track of the issue.
> 
> While the code for the initialization is not ideal, it appears the main
issue causing the slowdown is the fact that GCC interchanges the main loops, but
LLVM does not. After interchanging, the memory access patterns are completely
different (and it also probably slightly defeats the purpose of the benchmark).
> 
> There’s also an issue with SROA which splits a nice single consecutive
llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x speedup
(on top of manually interchanging the loops, which gives a ~3x speedup).
Alternatively, if we we would create vector stores instead of the small memcpy
calls, we probably would get a better result overall. Using Clang's Matrix
Types extensions effectively does so, and with that version
https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA (although
the code is not as nice as it code be right now, as there's no syntax for
constant initializers for matrix types yet)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201001/d523a34e/attachment.html>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Sep 2020 - A 4x slower initialization loop in LLVM vs GCC and MSVC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

Possibly Parallel Threads