thr3ads.net - llvm dev - [LLVMdev] unaligned AVX store gets split into two instructions [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Zach Devito

2013-Jul-10 04:01 UTC

[LLVMdev] unaligned AVX store gets split into two instructions

I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector loads
on AVX.
3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as
a single instruction (details below).
In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
which seems to be due to this.

Any ideas why this changed? Thanks!

Zach

LLVM Code:
define <4 x double> @vstore(<4 x double>*) {
entry:
  %1 = load <4 x double>* %0, align 8
  ret <4 x double> %1
}
------------------------------------------------------------
Running llvm-32/bin/llc vstore.ll creates:
.section __TEXT,__text,regular,pure_instructions
.globl _vstore
.align 4, 0x90
_vstore:                                ## @vstore
.cfi_startproc
## BB#0:                                ## %entry
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
vmovups (%rdi), %ymm0
popq %rbp
ret
.cfi_endproc
----------------------------------------------------------------
Running llvm-33/bin/llc vstore.ll creates:
        .section        __TEXT,__text,regular,pure_instructions
        .globl  _main
        .align  4, 0x90
_main:                                  ## @main
        .cfi_startproc
## BB#0:                                ## %entry
        vmovups (%rdi), %xmm0
        vinsertf128     $1, 16(%rdi), %ymm0, %ymm0
        ret
        .cfi_endproc


.subsections_via_symbols
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130709/f0222f43/attachment.html>

Tom Stellard

2013-Jul-10 04:57 UTC

head link

[LLVMdev] unaligned AVX store gets split into two instructions

On Tue, Jul 09, 2013 at 09:01:48PM -0700, Zach Devito
wrote:> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector
loads
> on AVX.
> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as
> a single instruction (details below).
> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
> which seems to be due to this.
> 
> Any ideas why this changed? Thanks!
>
Hi Zack,

I ran into a similar problem with the R600 backend, and I was able to fix it
by implementing the TargetLowering::allowsUnalignedMemoryAccesses().
Take a look at r184822.

-Tom
> Zach
> 
> LLVM Code:
> define <4 x double> @vstore(<4 x double>*) {
> entry:
>   %1 = load <4 x double>* %0, align 8
>   ret <4 x double> %1
> }
> ------------------------------------------------------------
> Running llvm-32/bin/llc vstore.ll creates:
> .section __TEXT,__text,regular,pure_instructions
> .globl _vstore
> .align 4, 0x90
> _vstore:                                ## @vstore
> .cfi_startproc
> ## BB#0:                                ## %entry
> pushq %rbp
> Ltmp2:
> .cfi_def_cfa_offset 16
> Ltmp3:
> .cfi_offset %rbp, -16
> movq %rsp, %rbp
> Ltmp4:
> .cfi_def_cfa_register %rbp
> vmovups (%rdi), %ymm0
> popq %rbp
> ret
> .cfi_endproc
> ----------------------------------------------------------------
> Running llvm-33/bin/llc vstore.ll creates:
>         .section        __TEXT,__text,regular,pure_instructions
>         .globl  _main
>         .align  4, 0x90
> _main:                                  ## @main
>         .cfi_startproc
> ## BB#0:                                ## %entry
>         vmovups (%rdi), %xmm0
>         vinsertf128     $1, 16(%rdi), %ymm0, %ymm0
>         ret
>         .cfi_endproc
> 
> 
> .subsections_via_symbols
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Eli Friedman

2013-Jul-10 05:01 UTC

head link

[LLVMdev] unaligned AVX store gets split into two instructions

On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com>
wrote:> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned vector
loads
> on AVX.
> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted as
a
> single instruction (details below).
> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
which
> seems to be due to this.
>
> Any ideas why this changed? Thanks!
This was intentional; apparently doing it with two instructions is
supposed to be faster.  See r172868/r172894.

Adding Nadav in case he has anything more to say.

-Eli

Nadav Rotem

2013-Jul-10 05:15 UTC

head link

[LLVMdev] unaligned AVX store gets split into two instructions

Hi, 

Yes. On Sandybridge 256-bit loads/stores are double pumped.  This means that
they go in one after the other in two cycles.  On Haswell the memory ports are
wide enough to allow a 256bit memory operation in one cycle.  So, on Sandybridge
we split unaligned memory operations into two 128bit parts to allow them to
execute in two separate ports. This is also what GCC and ICC do.

It is very possible that the decision to split the wide vectors causes a
regression.  If the memory ports are busy it is better to double-pump them and
save the cost of the insert/extract subvector.  Unfortunately, during ISel we
don’t have a good way to estimate port pressure. In any case, it is a good idea
to revise the heuristics that I put in and to see if it matches the Sandybridge
optimization guide. If I remember correctly the optimization guide does not have
too much information on this, but Elena looked over it and said that it made
sense.

BTW, you can validate that this is the problem using the IACA tool. It performs
static analysis on your binary and tells you where the critical path is. 
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer

Thanks,
Nadav

On Jul 9, 2013, at 10:01 PM, Eli Friedman <eli.friedman at gmail.com>
wrote:
> On Tue, Jul 9, 2013 at 9:01 PM, Zach Devito <zdevito at gmail.com>
wrote:
>> I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned
vector loads
>> on AVX.
>> 3.3 is splitting up an unaligned vector load but in 3.2, it was emitted
as a
>> single instruction (details below).
>> In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
which
>> seems to be due to this.
>> 
>> Any ideas why this changed? Thanks!
> 
> This was intentional; apparently doing it with two instructions is
> supposed to be faster.  See r172868/r172894.
> 
> Adding Nadav in case he has anything more to say.
> 
> -Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130709/c745e7fb/attachment.html>

Ondřej Bílka

2013-Jul-10 05:32 UTC

head link

[LLVMdev] unaligned AVX store gets split into two instructions

On Tue, Jul 09, 2013 at 09:01:48PM -0700, Zach Devito
wrote:>    I'm seeing a difference in how LLVM 3.3 and 3.2 emit unaligned
vector
>    loads on AVX.
>    3.3 is splitting up an unaligned vector load but in 3.2, it was emitted
as
>    a single instruction (details below).
>    In a matrix-matrix inner-kernel, I see a ~25% decrease in performance,
>    which seems to be due to this.
>    Any ideas why this changed? Thanks!
What is code and architecture? In most loops spliting makes code faster
when ran on ivy bridge, You could dig intel optimization manual for that
recomendation. Perhaps this code is special case.>    Zach
>    LLVM Code:
>    define <4 x double> @vstore(<4 x double>*) {
>    entry:
>      %1 = load <4 x double>* %0, align 8
>      ret <4 x double> %1
>    }
>    ------------------------------------------------------------
>    Running llvm-32/bin/llc vstore.ll creates:
>            .section        __TEXT,__text,regular,pure_instructions
>            .globl  _vstore
>            .align  4, 0x90
>    _vstore:                                ## @vstore
>            .cfi_startproc
>    ## BB#0:                                ## %entry
>            pushq   %rbp
>    Ltmp2:
>            .cfi_def_cfa_offset 16
>    Ltmp3:
>            .cfi_offset %rbp, -16
>            movq    %rsp, %rbp
>    Ltmp4:
>            .cfi_def_cfa_register %rbp
>            vmovups         (%rdi), %ymm0
>            popq    %rbp
>            ret
>            .cfi_endproc
>    ----------------------------------------------------------------
>    Running llvm-33/bin/llc vstore.ll creates:
>            .section        __TEXT,__text,regular,pure_instructions
>            .globl  _main
>            .align  4, 0x90
>    _main:                                  ## @main
>            .cfi_startproc
>    ## BB#0:                                ## %entry
>            vmovups (%rdi), %xmm0
>            vinsertf128     $1, 16(%rdi), %ymm0, %ymm0
>            ret
>            .cfi_endproc
>    .subsections_via_symbols
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-- 

fat electrons in the lines

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Jul 2013 - [LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

[LLVMdev] unaligned AVX store gets split into two instructions

Possibly Parallel Threads