thr3ads.net - llvm dev - [LLVMdev] AVX code gen [Dec 2013]

If this information is useful, please help other people find it:
Share via:

Ken Gahagan

2013-Dec-11 20:59 UTC

[LLVMdev] AVX code gen

Hello -

I found this post on the llvm blog:
http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that
clang / llvm are capable of generating AVX with packed instructions as well as
utilizing the full width of the YMM registers…  I have an environment where icc
generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can
not get clang/llvm to generate such instructions (using the 3.3 release or
either 3.4 rc1 or 3.4 rc2).  I am new to clang / llvm so I may not be invoking
the tools correctly but given that –fvectorize and –fslp-vectorize are on by
default at 3.4 I would have thought that if the code is AVX-able by icc that
clang / llvm would be able to do the same… The code is basic matrix
multiplication written a number of ways (with and without transposition and
such) as a performance measurement exercise.
 
The environments I’ve tried are:
Intel Ivy Bridge-EX (pre-release hardware) running Red Hat Linux 6.5
Generic desktop with Haswell processor running Fedora 18
 
If you have a moment to point me to the appropriate docs I’m happy to go learn
on my own – but I’ve now googled for the better part of 3 days trying to find
what invocation parameters I should use to get the desired use of packed AVX
instructions and the YMM registers and I just can’t seem to get it right.  I’m
also grateful if you just send the correct invocation.

I’ve actually started digging through the code as well - but since I am starting
from zero it could take me a while to find an answer this way - just didn’t want
you to think I’m not willing to try to find the answer on my own :-)

Thank you,
Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131211/eeb7c61c/attachment.html>

Arnold Schwaighofer

2013-Dec-12 15:57 UTC

head link

[LLVMdev] AVX code gen

It probably does not pick the right processor architecture.

You could try “clang -mavx” or “clang -march=corei7-avx” for ivy-bridge and
“clang -march=core-avx2” or “clang -mavx2" for haswell.

$ clang -march=core-avx2 -O3 -S -o - test.c

        .section        __TEXT,__text,regular,pure_instructions
        .globl  _f
        .align  4, 0x90
_f:                                     ## @f
        .cfi_startproc
## BB#0:                                ## %entry
        pushq   %rbp
Ltmp2:
        .cfi_def_cfa_offset 16
Ltmp3:
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
Ltmp4:
        .cfi_def_cfa_register %rbp
        xorl    %eax, %eax
        .align  4, 0x90
LBB0_1:                                 ## %vector.body
                                        ## =>This Inner Loop Header: Depth=1
        vmovups (%rdx,%rax,4), %ymm0
        vmulps  (%rsi,%rax,4), %ymm0, %ymm0
        vaddps  (%rdi,%rax,4), %ymm0, %ymm0
        vmovups %ymm0, (%rdi,%rax,4)
        addq    $8, %rax
        cmpq    $256, %rax              ## imm = 0x100
        jne     LBB0_1
## BB#2:                                ## %for.end
        popq    %rbp
        vzeroupper
        ret
        .cfi_endproc
$ cat test.c
void f(float * restrict A, float * restrict B, float * restrict C) {
  for (int i = 0; i < 256; ++i)
    A[i] += C[i] *B[i];
}

$ clang -v
clang version 3.5 (trunk 195376) (llvm/trunk 195372)


Best,
Arnold

On Dec 11, 2013, at 2:59 PM, Ken Gahagan <ken.gahagan at gmail.com> wrote:
> Hello -
> 
> I found this post on the llvm blog:
http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me think that
clang / llvm are capable of generating AVX with packed instructions as well as
utilizing the full width of the YMM registers…  I have an environment where icc
generates these instructions (vmulps %ymm1, %ymm3, %ymm2 for example) but I can
not get clang/llvm to generate such instructions (using the 3.3 release or
either 3.4 rc1 or 3.4 rc2).  I am new to clang / llvm so I may not be invoking
the tools correctly but given that –fvectorize and –fslp-vectorize are on by
default at 3.4 I would have thought that if the code is AVX-able by icc that
clang / llvm would be able to do the same… The code is basic matrix
multiplication written a number of ways (with and without transposition and
such) as a performance measurement exercise.
>  
> The environments I’ve tried are:
> Intel Ivy Bridge-EX (pre-release hardware) running Red Hat Linux 6.5
> Generic desktop with Haswell processor running Fedora 18
>  
> If you have a moment to point me to the appropriate docs I’m happy to go
learn on my own – but I’ve now googled for the better part of 3 days trying to
find what invocation parameters I should use to get the desired use of packed
AVX instructions and the YMM registers and I just can’t seem to get it right. 
I’m also grateful if you just send the correct invocation.
> 
> I’ve actually started digging through the code as well - but since I am
starting from zero it could take me a while to find an answer this way - just
didn’t want you to think I’m not willing to try to find the answer on my own :-)
> 
> Thank you,
> Ken
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Kay Tiong Khoo

2013-Dec-12 16:11 UTC

head link

[LLVMdev] AVX code gen

For the list of cpus and features for the current target, you can try:
$ llvm-as < /dev/null | llc -mcpu=help



On Thu, Dec 12, 2013 at 8:57 AM, Arnold Schwaighofer <
aschwaighofer at apple.com> wrote:
> It probably does not pick the right processor architecture.
>
> You could try “clang -mavx” or “clang -march=corei7-avx” for ivy-bridge
> and “clang -march=core-avx2” or “clang -mavx2" for haswell.
>
> $ clang -march=core-avx2 -O3 -S -o - test.c
>
>         .section        __TEXT,__text,regular,pure_instructions
>         .globl  _f
>         .align  4, 0x90
> _f:                                     ## @f
>         .cfi_startproc
> ## BB#0:                                ## %entry
>         pushq   %rbp
> Ltmp2:
>         .cfi_def_cfa_offset 16
> Ltmp3:
>         .cfi_offset %rbp, -16
>         movq    %rsp, %rbp
> Ltmp4:
>         .cfi_def_cfa_register %rbp
>         xorl    %eax, %eax
>         .align  4, 0x90
> LBB0_1:                                 ## %vector.body
>                                         ## =>This Inner Loop Header:
> Depth=1
>         vmovups (%rdx,%rax,4), %ymm0
>         vmulps  (%rsi,%rax,4), %ymm0, %ymm0
>         vaddps  (%rdi,%rax,4), %ymm0, %ymm0
>         vmovups %ymm0, (%rdi,%rax,4)
>         addq    $8, %rax
>         cmpq    $256, %rax              ## imm = 0x100
>         jne     LBB0_1
> ## BB#2:                                ## %for.end
>         popq    %rbp
>         vzeroupper
>         ret
>         .cfi_endproc
> $ cat test.c
> void f(float * restrict A, float * restrict B, float * restrict C) {
>   for (int i = 0; i < 256; ++i)
>     A[i] += C[i] *B[i];
> }
>
> $ clang -v
> clang version 3.5 (trunk 195376) (llvm/trunk 195372)
>
>
> Best,
> Arnold
>
> On Dec 11, 2013, at 2:59 PM, Ken Gahagan <ken.gahagan at gmail.com>
wrote:
>
> > Hello -
> >
> > I found this post on the llvm blog:
> http://blog.llvm.org/2012/12/new-loop-vectorizer.html which makes me
> think that clang / llvm are capable of generating AVX with packed
> instructions as well as utilizing the full width of the YMM registers…  I
> have an environment where icc generates these instructions (vmulps %ymm1,
> %ymm3, %ymm2 for example) but I can not get clang/llvm to generate such
> instructions (using the 3.3 release or either 3.4 rc1 or 3.4 rc2).  I am
> new to clang / llvm so I may not be invoking the tools correctly but given
> that –fvectorize and –fslp-vectorize are on by default at 3.4 I would have
> thought that if the code is AVX-able by icc that clang / llvm would be able
> to do the same… The code is basic matrix multiplication written a number of
> ways (with and without transposition and such) as a performance measurement
> exercise.
> >
> > The environments I’ve tried are:
> > Intel Ivy Bridge-EX (pre-release hardware) running Red Hat Linux 6.5
> > Generic desktop with Haswell processor running Fedora 18
> >
> > If you have a moment to point me to the appropriate docs I’m happy to
go
> learn on my own – but I’ve now googled for the better part of 3 days trying
> to find what invocation parameters I should use to get the desired use of
> packed AVX instructions and the YMM registers and I just can’t seem to get
> it right.  I’m also grateful if you just send the correct invocation.
> >
> > I’ve actually started digging through the code as well - but since I
am
> starting from zero it could take me a while to find an answer this way -
> just didn’t want you to think I’m not willing to try to find the answer on
> my own :-)
> >
> > Thank you,
> > Ken
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131212/e153ccd6/attachment.html>

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Dec 2013 - [LLVMdev] AVX code gen

[LLVMdev] AVX code gen

[LLVMdev] AVX code gen

[LLVMdev] AVX code gen

Seemingly Similar Threads