thr3ads.net - llvm dev - [llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC [Oct 2020]

If this information is useful, please help other people find it:
Share via:

Florian Hahn via llvm-dev

2020-Oct-01 20:59 UTC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

> On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi,
> 
>> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
>> 
>> Hi everyone,
>> 
>> I was watching this video [1]. There's an example of an
initialization loop for which
>> Clang unfortunately generates really bad code [2]. In my machine, the
Clang version
>> is 4x slower than the GCC version. I have not tested the MSVC version,
but it should
>> be around the same.
>> 
>> In case anyone's interested, in the video [1] Casey explains why
this code is bad (around 59:39).
>> 
>> So, I tried to run -print-after-all [3]. There are a lot of passes that
interact here, so I was
>> wondering if anyone knows more about that. It seems to me that the
problem starts
>> with SROA. Also, I'm not familiar with how these llvm.memcpy /
memset are handled down
>> the pipeline. Finally, the regalloc probably did not go very well.
> 
> 
> I filed https://bugs.llvm.org/show_bug.cgi?id=47705
<https://bugs.llvm.org/show_bug.cgi?id=47705> to keep track of the issue.
> 
> While the code for the initialization is not ideal, it appears the main
issue causing the slowdown is the fact that GCC interchanges the main loops, but
LLVM does not. After interchanging, the memory access patterns are completely
different (and it also probably slightly defeats the purpose of the benchmark).
> 
> There’s also an issue with SROA which splits a nice single consecutive
llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x speedup
(on top of manually interchanging the loops, which gives a ~3x speedup).
Alternatively, if we we would create vector stores instead of the small memcpy
calls, we probably would get a better result overall. Using Clang's Matrix
Types extensions effectively does so, and with that version
https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA (although
the code is not as nice as it code be right now, as there's no syntax for
constant initializers for matrix types yet)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201001/d523a34e/attachment.html>

Stefanos Baziotis via llvm-dev

2020-Oct-01 22:16 UTC

head link

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

Hi Florian,

Thanks for providing feedback.

I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of
the> issue.

Great thanks I wasn't sure if it's considered a bug / regression /
whatever.

There’s also an issue with SROA which splits a nice single
consecutive> llvm.memcpy into 3 separate ones.

Yes exactly, as I said in my first message SROA seems to be a big problem
here.

While the code for the initialization is not ideal, it appears the
main> issue causing the slowdown is the fact that GCC interchanges the main
> loops, but LLVM does not. After interchanging, the memory access patterns
> are completely different

Well, definitely that's one thing and without interchange cache locality
plays a big role but I think two things should be noted:
1) GCC, even though it interchanges the original code, still runs 25
iterations, so it still kind of does the benchmark.
2) Most importantly, even if we interchange the code ourselves, Clang's
codegen is still very bad both compared to GCC and in general. Also, GCC
does the obvious thing of completely removing the inner loop (now that with
IV `run`), while
Clang doesn't. Basically, these are problems that I feel loop optimizations
should not suffer from at this point :/

Alternatively, if we we would create vector stores instead of the
small> memcpy calls, we probably would get a better result overall. Using
Clang's
> Matrix Types extensions effectively does so, and with that version
> https://godbolt.org/z/nvq86W

Oh that's pretty neat. I mean for this code I don't think we should
expect
the user to have to write the code like that to get good codegen but it's
still cool.

Cheers,
Stefanos

Στις Πέμ, 1 Οκτ 2020 στις 11:59 μ.μ., ο/η Florian Hahn <
florian_hahn at apple.com> έγραψε:
>
>
> On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hi,
>
> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hi everyone,
>
> I was watching this video [1]. There's an example of an initialization
> loop for which
> Clang unfortunately generates really bad code [2]. In my machine, the
> Clang version
> is 4x slower than the GCC version. I have not tested the MSVC version, but
> it should
> be around the same.
>
> In case anyone's interested, in the video [1] Casey explains why this
code
> is bad (around 59:39).
>
> So, I tried to run -print-after-all [3]. There are a lot of passes that
> interact here, so I was
> wondering if anyone knows more about that. It seems to me that the problem
> starts
> with SROA. Also, I'm not familiar with how these llvm.memcpy / memset
are
> handled down
> the pipeline. Finally, the regalloc probably did not go very well.
>
>
>
> I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of the
> issue.
>
> While the code for the initialization is not ideal, it appears the main
> issue causing the slowdown is the fact that GCC interchanges the main
> loops, but LLVM does not. After interchanging, the memory access patterns
> are completely different (and it also probably slightly defeats the purpose
> of the benchmark).
>
> There’s also an issue with SROA which splits a nice single consecutive
> llvm.memcpy into 3 separate ones. With SROA disabled there’s another ~2x
> speedup (on top of manually interchanging the loops, which gives a ~3x
> speedup).
>
>
> Alternatively, if we we would create vector stores instead of the small
> memcpy calls, we probably would get a better result overall. Using
Clang's
> Matrix Types extensions effectively does so, and with that version
> https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA
> (although the code is not as nice as it code be right now, as there's
no
> syntax for constant initializers for matrix types yet)
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201002/fea9fbbc/attachment.html>

Stefanos Baziotis via llvm-dev

2020-Oct-01 22:28 UTC

head link

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

Oh also another thing that's important:

on top of manually interchanging the loops, which gives a ~3x speedup).


Indeed in my machine too. Although I understand that this speedup is not
the important thing here, it has to be noted what
GCC does with them interchanged:

Clang:
- Original: 2.04
- Interchanged: 0.60

G++:
- Original: 0.43
- Interchanged: 0.15

So, GCC gets a 3x speedup too and also the interchanged version of GCC is
still 4x faster.

Best,
Stefanos

Στις Παρ, 2 Οκτ 2020 στις 1:16 π.μ., ο/η Stefanos Baziotis <
stefanos.baziotis at gmail.com> έγραψε:
> Hi Florian,
>
> Thanks for providing feedback.
>
> I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of the
>> issue.
>
>
> Great thanks I wasn't sure if it's considered a bug / regression /
> whatever.
>
> There’s also an issue with SROA which splits a nice single consecutive
>> llvm.memcpy into 3 separate ones.
>
>
> Yes exactly, as I said in my first message SROA seems to be a big problem
> here.
>
> While the code for the initialization is not ideal, it appears the main
>> issue causing the slowdown is the fact that GCC interchanges the main
>> loops, but LLVM does not. After interchanging, the memory access
patterns
>> are completely different
>
>
> Well, definitely that's one thing and without interchange cache
locality
> plays a big role but I think two things should be noted:
> 1) GCC, even though it interchanges the original code, still runs 25
> iterations, so it still kind of does the benchmark.
> 2) Most importantly, even if we interchange the code ourselves, Clang's
> codegen is still very bad both compared to GCC and in general. Also, GCC
> does the obvious thing of completely removing the inner loop (now that with
> IV `run`), while
> Clang doesn't. Basically, these are problems that I feel loop
> optimizations should not suffer from at this point :/
>
> Alternatively, if we we would create vector stores instead of the small
>> memcpy calls, we probably would get a better result overall. Using
Clang's
>> Matrix Types extensions effectively does so, and with that version
>> https://godbolt.org/z/nvq86W
>
>
> Oh that's pretty neat. I mean for this code I don't think we should
expect
> the user to have to write the code like that to get good codegen but
it's
> still cool.
>
> Cheers,
> Stefanos
>
> Στις Πέμ, 1 Οκτ 2020 στις 11:59 μ.μ., ο/η Florian Hahn <
> florian_hahn at apple.com> έγραψε:
>
>>
>>
>> On Oct 1, 2020, at 20:45, Florian Hahn via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hi,
>>
>> On Sep 27, 2020, at 12:52, Stefanos Baziotis via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hi everyone,
>>
>> I was watching this video [1]. There's an example of an
initialization
>> loop for which
>> Clang unfortunately generates really bad code [2]. In my machine, the
>> Clang version
>> is 4x slower than the GCC version. I have not tested the MSVC version,
>> but it should
>> be around the same.
>>
>> In case anyone's interested, in the video [1] Casey explains why
this
>> code is bad (around 59:39).
>>
>> So, I tried to run -print-after-all [3]. There are a lot of passes that
>> interact here, so I was
>> wondering if anyone knows more about that. It seems to me that the
>> problem starts
>> with SROA. Also, I'm not familiar with how these llvm.memcpy /
memset are
>> handled down
>> the pipeline. Finally, the regalloc probably did not go very well.
>>
>>
>>
>> I filed https://bugs.llvm.org/show_bug.cgi?id=47705 to keep track of
the
>> issue.
>>
>> While the code for the initialization is not ideal, it appears the main
>> issue causing the slowdown is the fact that GCC interchanges the main
>> loops, but LLVM does not. After interchanging, the memory access
patterns
>> are completely different (and it also probably slightly defeats the
purpose
>> of the benchmark).
>>
>> There’s also an issue with SROA which splits a nice single consecutive
>> llvm.memcpy into 3 separate ones. With SROA disabled there’s another
~2x
>> speedup (on top of manually interchanging the loops, which gives a ~3x
>> speedup).
>>
>>
>> Alternatively, if we we would create vector stores instead of the small
>> memcpy calls, we probably would get a better result overall. Using
Clang's
>> Matrix Types extensions effectively does so, and with that version
>> https://godbolt.org/z/nvq86W I get the same speed as if disabling SROA
>> (although the code is not as nice as it code be right now, as
there's no
>> syntax for constant initializers for matrix types yet)
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201002/193097be/attachment-0001.html>

Florian Hahn via llvm-dev

2020-Oct-02 09:27 UTC

head link

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

> On Oct 1, 2020, at 23:16, Stefanos Baziotis <stefanos.baziotis at
gmail.com> wrote:
> 
> While the code for the initialization is not ideal, it appears the main
issue causing the slowdown is the fact that GCC interchanges the main loops, but
LLVM does not. After interchanging, the memory access patterns are completely
different
> 
> Well, definitely that's one thing and without interchange cache
locality plays a big role but I think two things should be noted:
> 1) GCC, even though it interchanges the original code, still runs 25
iterations, so it still kind of does the benchmark.
Agreed, the benchmark does exactly the same computation, just in a different
order.

I didn’t write the benchmark so I can only assume what it is intended to
benchmark. From the title of the email and the code, it seems the goal is to
benchmark the performance of initializing a large array of matrixes and the
outer `run` loop has been added to make sure the benchmark runs a sufficiently
large number of times to get more stable timings.

With that assumption in mind, I don’t think the benchmark really measures the
same thing after interchanging. In other words, say interchanging gives a 4x
speedup on the benchmark, but that is unlikely to translate into practice for
loops that just initialize matrixes with constants without the artificial
’no-op’ outer loop. Granted, it is easy to come up with code where interchanging
is highly beneficial for the interesting parts of the computation.
> 2) Most importantly, even if we interchange the code ourselves, Clang's
codegen is still very bad both compared to GCC and in general. Also, GCC does
the obvious thing of completely removing the inner loop (now that with IV
`run`), while
> Clang doesn't. Basically, these are problems that I feel loop
optimizations should not suffer from at this point :/
Personally I try to take a look at the bright side: more potential for further
optimizations! I also don’t think there’s some fundamental transformations
missing here. It appears the main issue is the llvm.memcpy calls, which means we
have to go through memory in the main loop instead of hoisting the invariant
‘load’ part out of the loop (in the original, un-interchanged version).

As for LLVM not removing the redundant loop after interchanging: this is indeed
surprising, but again is likely caused by the memcpys, which loop invariant code
motion does currently not hoist out in that case. However adding support for
that should be relatively straight forward. If corresponding load/store pairs
are used instead of the memcpy, LLVM should also eliminate the loop as well.

And with load/stores instead of memcpy, LLVM’s loop interchange pass should also
be able to interchange the loops (it’s disabled by default though).
> Oh also another thing that's important:
> 
> on top of manually interchanging the loops, which gives a ~3x speedup).
> 
> Indeed in my machine too. Although I understand that this speedup is not
the important thing here, it has to be noted what
> GCC does with them interchanged:
> 
> Clang:
> - Original: 2.04
> - Interchanged: 0.60
> 
> G++:
> - Original: 0.43
> - Interchanged: 0.15
> 
> So, GCC gets a 3x speedup too and also the interchanged version of GCC is
still 4x faster.
Right, but as you indicated earlier, if you manually interchange the loops
first, GCC will completely remove the `run` loop while LLVM does not currently,
so we are now even further removed from benchmarking the performance of the
matrix array initialization code. The example certainly doesn’t make LLVM look
great compared to GCC. (Also there appears to be a phase-ordering issue in GCC,
because the `run` loop is not removed when automatically interchanged).

But going back to my original point, I’d said that if the goal is to compare the
performance of the generated code to initialize a large array of matrixes, the
most interesting comparison would be comparing GCC & Clang with
interchanging disabled for GCC.

On my system, the runtime difference of the un-interchanged versions generated
by Clang with and without SROA are in the noise, which is not too surprising I
think, because it will be dominated by the memory writes and that hides the
other smaller codegen differences. On different processors, e.g. smaller
in-order ones, the difference will be more visible.

To summarize, I think the benchmark is very interesting and highlights some
important issues with how different optimizations deal with llvm.memcpy and how
SROA decides to use memcpy, which would be good to address. Will those changes
make loops that initialize large arrays with constant matrixes 2x or 4x faster
in practice? I doubt it.

Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20201002/93a4dc11/attachment-0001.html>

llvm dev - Oct 2020 - A 4x slower initialization loop in LLVM vs GCC and MSVC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC

[llvm-dev] A 4x slower initialization loop in LLVM vs GCC and MSVC