thr3ads.net - llvm dev - [llvm-dev] AVX 512 Assembly Code Generation issues [Jun 2017]

If this information is useful, please help other people find it:
Share via:

hameeza ahmed via llvm-dev

2017-Jun-21 13:16 UTC

[llvm-dev] AVX 512 Assembly Code Generation issues

when i generate code with 72 loop iterations.

the compiler generates code with using avx512 zmm operations 4 times
(16x4=64) and remaining 8 iterations are handled by routine mov operations
with EAX register. wouldn't it be better if it uses ymm for remaining 8
iterations as it does when iteration count is between 8 and 15. same for
xmm and so on.


please correct me if i am wrong.


Thank You

On Jun 21, 2017 12:21 AM, "hameeza ahmed" <hahmed2305 at
gmail.com> wrote:
> Hello,
>
> I am using llvm  on my core i7 laptop which has no avx support.
>
> my goal is to generate avx512 code (loop vectorization) for  Knight
> landing/skylake .
>
>
>
> my .c code is;
>
> int a[256], b[256], c[256];
> foo () {
> int i;
> for (i=0; i<256; i++) {
> a[i] = b[i] + c[i];
> }
> }
>
> i first generated its .ll file via clang
>
> clang -S  -emit-llvm test.c -o test.ll
>
> then i optimized it;
>
> opt -S -O3 test.ll -o test_o3.ll
>
> then i used llc for code generation
>
> llc -mcpu=skylake-avx512 -mattr=+avx512f test_o3.ll -o test_o3.s
>
> llc -mcpu=knl -mattr=+avx512f test_o3.ll -o test_o3.s
>
>
> here is my generated code;
>
>
>
> .text
> .file "filer_o3.ll"
> .globl foo
> .p2align 4, 0x90
> .type foo, at function
> foo:                                    # @foo
> .cfi_startproc
> # BB#0:                                 # %min.iters.checked
> pushq %rbp
> .Ltmp0:
> .cfi_def_cfa_offset 16
> .Ltmp1:
> .cfi_offset %rbp, -16
> movq %rsp, %rbp
> .Ltmp2:
> .cfi_def_cfa_register %rbp
> movq $-1024, %rax            # imm = 0xFC00
> .p2align 4, 0x90
> .*LBB0_1:                                # %vector.body*
> *                                        # =>This Inner Loop Header:
> Depth=1*
> * vmovdqa32 c+1024(%rax), %xmm0*
> * vmovdqa32 c+1040(%rax), %xmm1*
> * vpaddd b+1024(%rax), %xmm0, %xmm0*
> * vpaddd b+1040(%rax), %xmm1, %xmm1*
> * vmovdqa32 %xmm0, a+1024(%rax)*
> * vmovdqa32 %xmm1, a+1040(%rax)*
> * vmovdqa32 c+1056(%rax), %xmm0*
> * vmovdqa32 c+1072(%rax), %xmm1*
> * vpaddd b+1056(%rax), %xmm0, %xmm0*
> * vpaddd b+1072(%rax), %xmm1, %xmm1*
> * vmovdqa32 %xmm0, a+1056(%rax)*
> * vmovdqa32 %xmm1, a+1072(%rax)*
> * addq $64, %rax*
> * jne .LBB0_1*
> # BB#2:                                 # %middle.block
> popq %rbp
> retq
> .Lfunc_end0:
> .size foo, .Lfunc_end0-foo
> .cfi_endproc
>
> .type b, at object               # @b
> .comm b,1024,16
> .type c, at object               # @c
> .comm c,1024,16
> .type a, at object               # @a
> .comm a,1024,16
>
> .ident "clang version 3.9.0 (tags/RELEASE_390/final)"
> .section ".note.GNU-stack","", at progbits
>
> in the generated code although there is use of vmov... instructions but no
> zmm register? only xmm registers.
>
>
> Can you please specify where i am wrong. i have tried it several times by
> different parameters but always get xmm registers.
>
>
> Thank You
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170621/4c202ae5/attachment.html>

Friedman, Eli via llvm-dev

2017-Jun-22 17:58 UTC

head link

[llvm-dev] AVX 512 Assembly Code Generation issues

On 6/21/2017 6:16 AM, hameeza ahmed via llvm-dev wrote:> when i generate code with 72 loop iterations.
>
> the compiler generates code with using avx512 zmm operations 4 times 
> (16x4=64) and remaining 8 iterations are handled by routine mov 
> operations with EAX register. wouldn't it be better if it uses ymm for 
> remaining 8 iterations as it does when iteration count is between 8 
> and 15. same for xmm and so on.
See http://lists.llvm.org/pipermail/llvm-dev/2017-February/110424.html .

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170622/d5e2ffa6/attachment.html>

Serge Preis via llvm-dev

2017-Jun-23 04:03 UTC

head link

[llvm-dev] AVX 512 Assembly Code Generation issues

<div>Thank you for the reference. Very interesting
read!</div><div> </div><div>There are couple questions
though:</div><div>- What is the implementation status of this
effort?</div><div>- I didn't find anything on masked
low-trip/remainder vectorization through what I've read. I believe for
AVX512 and other masking-enabled targets (e.g. VPU) masking may be preferred
technique for low trip count
vectoization.</div><div> </div><div>By masked low-trip
vectorization I mean something along the lines of following
transformation:</div><div> </div><div>for (i = 0; i <
smallN; ++i)
{</div><div>op;</div><div>}</div><div> </div><div>Transformed
to:</div><div> </div><div><div>for (i = 0; i <
round_UP_to_multiple_of_VL(smallN, VL); ++i) {</div><div>if (i <
smallN)</div><div>op;</div><div>}</div><div> </div><div>And
than vectorize by VL with
mask.</div><div> </div></div><div>Where (assuming
VL -- is small power of 2)</div><div>round_up_to_multiple_of_VL(x,
consexpr VL) {</div><div>return (x + ~VL) ^
~VL;</div><div>}</div><div> </div><div>Thank
you,</div><div>Serge.</div><div> </div><div> </div><div>23.06.2017,
01:14, "Friedman, Eli via llvm-dev"
<llvm-dev@lists.llvm.org>:</div><blockquote
type="cite"><div bgcolor="#FFFFFF"><div>On
6/21/2017 6:16 AM, hameeza ahmed via llvm-dev wrote:</div><blockquote
type="cite"
cite="mid:CAFMPKeZNLxe0vS6uZOLgHpv_D_C5YEBD6=nq448ri-5G-WBhzw@mail.gmail.com"><div><div
style="font-family:sans-serif;font-size:16px;">when i generate code
with 72 loop iterations.</div><div
style="font-family:sans-serif;font-size:16px;"> </div><div
style="font-family:sans-serif;font-size:16px;">the compiler
generates code with using avx512 zmm operations 4 times (16x4=64) and remaining
8 iterations are handled by routine mov operations with EAX register.
wouldn't it be better if it uses ymm for remaining 8 iterations as it does
when iteration count is between 8 and 15. same for xmm and so
on.</div></div></blockquote><br />See <a
href="http://lists.llvm.org/pipermail/llvm-dev/2017-February/110424.html">http://lists.llvm.org/pipermail/llvm-dev/2017-February/110424.html</a>
.<br /><br />-Eli<pre>--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative
Project</pre></div>,<p>_______________________________________________<br
/>LLVM Developers mailing list<br /><a
href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br
/><a
href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a></p></blockquote>

llvm dev - Jun 2017 - AVX 512 Assembly Code Generation issues

[llvm-dev] AVX 512 Assembly Code Generation issues

[llvm-dev] AVX 512 Assembly Code Generation issues

[llvm-dev] AVX 512 Assembly Code Generation issues