thr3ads.net - llvm dev - [LLVMdev] unaligned AVX store gets split into two instructions [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Jul-10 05:15 UTC

[LLVMdev] unaligned AVX store gets split into two instructions

Hi, 

Yes. On Sandybridge 256-bit loads/stores are double pumped.  This means that
they go in one after the other in two cycles.  On Haswell the memory ports are
wide enough to allow a 256bit memory operation in one cycle.  So, on Sandybridge
we split unaligned memory operations into two 128bit parts to allow them to
execute in two separate ports. This is also what GCC and ICC do.

It is very possible that the decision to split the wide vectors causes a
regression.  If the memory ports are busy it is better to double-pump them and
save the cost of the insert/extract subvector.  Unfortunately, during ISel we
don’t have a good way to estimate port pressure. In any case, it is a good idea
to revise the heuristics that I put in and to see if it matches the Sandybridge
optimization guide. If I remember correctly the optimization guide does not have
too much information on this, but Elena looked over it and said that it made
sense.

BTW, you can validate that this is the problem using the IACA tool. It performs
static analysis on your binary and tells you where the critical path is. 
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer

Thanks,
Nadav

On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at gmail.com>
wrote:
> On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com>
wrote:
>> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned
vector loads
>> on AVX.
>> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted
as a
>> single instruction (details below).
>> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
which
>> seems to be due to this.
>> 
>> Any ideas why this changed? Thanks!
> 
> This was intentional; apparently doing it with two instructions is
> supposed to be faster.  See r172868/r172894.
> 
> Adding Nadav in case he has anything more to say.
> 
> -Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130709/c745e7fb/attachment.html>

Zach Devito

2013-Jul-10 06:33 UTC

head link

[LLVMdev] unaligned AVX store gets split into two instructions

Thanks for all the the info! I'm still in the process of narrowing down the
performance difference in my code. I'm no longer convinced its related to
only the unaligned loads/stores alone since extracting this part of the
kernel makes the performance difference disappear.  I will try to narrow
down what is going on and if it seems related LLVM, I will post an example.
Thanks again,

Zach

On Tue, Jul 9, 2013 at 10:15 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi,
>
> Yes. On Sandybridge 256-bit loads/stores are double pumped.  This means
> that they go in one after the other in two cycles.  On Haswell the memory
> ports are wide enough to allow a 256bit memory operation in one cycle.  So,
> on Sandybridge we split unaligned memory operations into two 128bit parts
> to allow them to execute in two separate ports. This is also what GCC and
> ICC do.
>
> It is very possible that the decision to split the wide vectors causes a
> regression.  If the memory ports are busy it is better to double-pump them
> and save the cost of the insert/extract subvector.  Unfortunately, during
> ISel we don’t have a good way to estimate port pressure. In any case, it is
> a good idea to revise the heuristics that I put in and to see if it matches
> the Sandybridge optimization guide. If I remember correctly the
> optimization guide does not have too much information on this, but Elena
> looked over it and said that it made sense.
>
> BTW, you can validate that this is the problem using the IACA tool. It
> performs static analysis on your binary and tells you where the critical
> path is.
> http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
>
> Thanks,
> Nadav
>
>
> On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at gmail.com>
wrote:
>
> On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com>
wrote:
>
> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector
loads
> on AVX.
> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as
> a
> single instruction (details below).
> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
> which
> seems to be due to this.
>
> Any ideas why this changed? Thanks!
>
>
> This was intentional; apparently doing it with two instructions is
> supposed to be faster.  See r172868/r172894.
>
> Adding Nadav in case he has anything more to say.
>
> -Eli
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130709/a7243480/attachment.html>

Demikhovsky, Elena

2013-Jul-10 07:50 UTC

head link

[LLVMdev] unaligned AVX store gets split into two instructions

Send me a pointer to the code, I'll check performance for our workloads.

-           Elena

From: Nadav Rotem [mailto:nrotem at apple.com]
Sent: Wednesday, July 10, 2013 08:15
To: Eli Friedman
Cc: Zach Devito; LLVM Developers Mailing List; Demikhovsky, Elena
Subject: Re: [LLVMdev] unaligned AVX store gets split into two instructions

Hi,

Yes. On Sandybridge 256-bit loads/stores are double pumped.  This means that
they go in one after the other in two cycles.  On Haswell the memory ports are
wide enough to allow a 256bit memory operation in one cycle.  So, on Sandybridge
we split unaligned memory operations into two 128bit parts to allow them to
execute in two separate ports. This is also what GCC and ICC do.

It is very possible that the decision to split the wide vectors causes a
regression.  If the memory ports are busy it is better to double-pump them and
save the cost of the insert/extract subvector.  Unfortunately, during ISel we
don't have a good way to estimate port pressure. In any case, it is a good
idea to revise the heuristics that I put in and to see if it matches the
Sandybridge optimization guide. If I remember correctly the optimization guide
does not have too much information on this, but Elena looked over it and said
that it made sense.

BTW, you can validate that this is the problem using the IACA tool. It performs
static analysis on your binary and tells you where the critical path is. 
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer

Thanks,
Nadav

On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at
gmail.com<mailto:eli.friedman at gmail.com>> wrote:

On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at
gmail.com<mailto:zdevito at gmail.com>> wrote:

I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector loads
on AVX.
3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as a
single instruction (details below).
In a matrix-matrix inner-kernel, I see a ~25% decrease in performance, which
seems to be due to this.

Any ideas why this changed? Thanks!

This was intentional; apparently doing it with two instructions is
supposed to be faster.  See r172868/r172894.

Adding Nadav in case he has anything more to say.

-Eli

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130710/1825b729/attachment.html>

Zach Devito

2013-Jul-10 09:12 UTC

head link

[LLVMdev] unaligned AVX store gets split into two instructions

I've narrowed this down to a single kernel (kernel.ll), which does a
fixed-size matrix-matrix multiply:

# ~/llvm-32-final/bin/llc kernel.ll -o kernel32.s
# ~/llvm-33-final/bin/llc kernel.ll -o kernel33.s
# ~/llvm-32-final/bin/clang++ harness.cpp kernel32.s -o harness32
# ~/llvm-32-final/bin/clang++ harness.cpp kernel33.s -o harness33
# time ./harness32
real 0m0.584s
user 0m0.581s
sys 0m0.001s
# time ./harness33
real 0m0.730s
user 0m0.725s
sys 0m0.001s

If you look at kernel33.s, it has a register spill/reload in the inner
loop. This doesn't appear in the llvm 3.2 version and disappears from the
3.3 version if you remove the "align 8"s from kernel.ll which are
making it
unaligned.  Do the two-instruction unaligned loads increase register
pressure? Or is something else going on?

Zach

On Tue, Jul 9, 2013 at 11:33 PM, Zach Devito <zdevito at stanford.edu>
wrote:
> Thanks for all the the info! I'm still in the process of narrowing down
> the performance difference in my code. I'm no longer convinced its
related
> to only the unaligned loads/stores alone since extracting this part of the
> kernel makes the performance difference disappear.  I will try to narrow
> down what is going on and if it seems related LLVM, I will post an example.
> Thanks again,
>
> Zach
>
>
> On Tue, Jul 9, 2013 at 10:15 PM, Nadav Rotem <nrotem at apple.com>
wrote:
>
>> Hi,
>>
>> Yes. On Sandybridge 256-bit loads/stores are double pumped.  This means
>> that they go in one after the other in two cycles.  On Haswell the
memory
>> ports are wide enough to allow a 256bit memory operation in one cycle. 
So,
>> on Sandybridge we split unaligned memory operations into two 128bit
parts
>> to allow them to execute in two separate ports. This is also what GCC
and
>> ICC do.
>>
>> It is very possible that the decision to split the wide vectors causes
a
>> regression.  If the memory ports are busy it is better to double-pump
them
>> and save the cost of the insert/extract subvector.  Unfortunately,
during
>> ISel we don’t have a good way to estimate port pressure. In any case,
it is
>> a good idea to revise the heuristics that I put in and to see if it
matches
>> the Sandybridge optimization guide. If I remember correctly the
>> optimization guide does not have too much information on this, but
Elena
>> looked over it and said that it made sense.
>>
>> BTW, you can validate that this is the problem using the IACA tool. It
>> performs static analysis on your binary and tells you where the
critical
>> path is.
>>
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
>>
>> Thanks,
>> Nadav
>>
>>
>> On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at
gmail.com> wrote:
>>
>> On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at
gmail.com> wrote:
>>
>> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned
vector
>> loads
>> on AVX.
>> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted
>> as a
>> single instruction (details below).
>> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
>> which
>> seems to be due to this.
>>
>> Any ideas why this changed? Thanks!
>>
>>
>> This was intentional; apparently doing it with two instructions is
>> supposed to be faster.  See r172868/r172894.
>>
>> Adding Nadav in case he has anything more to say.
>>
>> -Eli
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130710/86bbc835/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: harness.cpp
Type: text/x-c++src
Size: 346 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130710/86bbc835/attachment.cpp>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: kernel.ll
Type: application/octet-stream
Size: 6787 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130710/86bbc835/attachment.obj>

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Jul 2013 - [LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

Reasonably Related Threads