hameeza ahmed via llvm-dev
2017-Jun-21 13:16 UTC
[llvm-dev] AVX 512 Assembly Code Generation issues
when i generate code with 72 loop iterations. the compiler generates code with using avx512 zmm operations 4 times (16x4=64) and remaining 8 iterations are handled by routine mov operations with EAX register. wouldn't it be better if it uses ymm for remaining 8 iterations as it does when iteration count is between 8 and 15. same for xmm and so on. please correct me if i am wrong. Thank You On Jun 21, 2017 12:21 AM, "hameeza ahmed" <hahmed2305 at gmail.com> wrote:> Hello, > > I am using llvm on my core i7 laptop which has no avx support. > > my goal is to generate avx512 code (loop vectorization) for Knight > landing/skylake . > > > > my .c code is; > > int a[256], b[256], c[256]; > foo () { > int i; > for (i=0; i<256; i++) { > a[i] = b[i] + c[i]; > } > } > > i first generated its .ll file via clang > > clang -S -emit-llvm test.c -o test.ll > > then i optimized it; > > opt -S -O3 test.ll -o test_o3.ll > > then i used llc for code generation > > llc -mcpu=skylake-avx512 -mattr=+avx512f test_o3.ll -o test_o3.s > > llc -mcpu=knl -mattr=+avx512f test_o3.ll -o test_o3.s > > > here is my generated code; > > > > .text > .file "filer_o3.ll" > .globl foo > .p2align 4, 0x90 > .type foo, at function > foo: # @foo > .cfi_startproc > # BB#0: # %min.iters.checked > pushq %rbp > .Ltmp0: > .cfi_def_cfa_offset 16 > .Ltmp1: > .cfi_offset %rbp, -16 > movq %rsp, %rbp > .Ltmp2: > .cfi_def_cfa_register %rbp > movq $-1024, %rax # imm = 0xFC00 > .p2align 4, 0x90 > .*LBB0_1: # %vector.body* > * # =>This Inner Loop Header: > Depth=1* > * vmovdqa32 c+1024(%rax), %xmm0* > * vmovdqa32 c+1040(%rax), %xmm1* > * vpaddd b+1024(%rax), %xmm0, %xmm0* > * vpaddd b+1040(%rax), %xmm1, %xmm1* > * vmovdqa32 %xmm0, a+1024(%rax)* > * vmovdqa32 %xmm1, a+1040(%rax)* > * vmovdqa32 c+1056(%rax), %xmm0* > * vmovdqa32 c+1072(%rax), %xmm1* > * vpaddd b+1056(%rax), %xmm0, %xmm0* > * vpaddd b+1072(%rax), %xmm1, %xmm1* > * vmovdqa32 %xmm0, a+1056(%rax)* > * vmovdqa32 %xmm1, a+1072(%rax)* > * addq $64, %rax* > * jne .LBB0_1* > # BB#2: # %middle.block > popq %rbp > retq > .Lfunc_end0: > .size foo, .Lfunc_end0-foo > .cfi_endproc > > .type b, at object # @b > .comm b,1024,16 > .type c, at object # @c > .comm c,1024,16 > .type a, at object # @a > .comm a,1024,16 > > .ident "clang version 3.9.0 (tags/RELEASE_390/final)" > .section ".note.GNU-stack","", at progbits > > in the generated code although there is use of vmov... instructions but no > zmm register? only xmm registers. > > > Can you please specify where i am wrong. i have tried it several times by > different parameters but always get xmm registers. > > > Thank You >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170621/4c202ae5/attachment.html>
Friedman, Eli via llvm-dev
2017-Jun-22 17:58 UTC
[llvm-dev] AVX 512 Assembly Code Generation issues
On 6/21/2017 6:16 AM, hameeza ahmed via llvm-dev wrote:> when i generate code with 72 loop iterations. > > the compiler generates code with using avx512 zmm operations 4 times > (16x4=64) and remaining 8 iterations are handled by routine mov > operations with EAX register. wouldn't it be better if it uses ymm for > remaining 8 iterations as it does when iteration count is between 8 > and 15. same for xmm and so on.See http://lists.llvm.org/pipermail/llvm-dev/2017-February/110424.html . -Eli -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170622/d5e2ffa6/attachment.html>
Serge Preis via llvm-dev
2017-Jun-23 04:03 UTC
[llvm-dev] AVX 512 Assembly Code Generation issues
<div>Thank you for the reference. Very interesting read!</div><div> </div><div>There are couple questions though:</div><div>- What is the implementation status of this effort?</div><div>- I didn't find anything on masked low-trip/remainder vectorization through what I've read. I believe for AVX512 and other masking-enabled targets (e.g. VPU) masking may be preferred technique for low trip count vectoization.</div><div> </div><div>By masked low-trip vectorization I mean something along the lines of following transformation:</div><div> </div><div>for (i = 0; i < smallN; ++i) {</div><div>op;</div><div>}</div><div> </div><div>Transformed to:</div><div> </div><div><div>for (i = 0; i < round_UP_to_multiple_of_VL(smallN, VL); ++i) {</div><div>if (i < smallN)</div><div>op;</div><div>}</div><div> </div><div>And than vectorize by VL with mask.</div><div> </div></div><div>Where (assuming VL -- is small power of 2)</div><div>round_up_to_multiple_of_VL(x, consexpr VL) {</div><div>return (x + ~VL) ^ ~VL;</div><div>}</div><div> </div><div>Thank you,</div><div>Serge.</div><div> </div><div> </div><div>23.06.2017, 01:14, "Friedman, Eli via llvm-dev" <llvm-dev@lists.llvm.org>:</div><blockquote type="cite"><div bgcolor="#FFFFFF"><div>On 6/21/2017 6:16 AM, hameeza ahmed via llvm-dev wrote:</div><blockquote type="cite" cite="mid:CAFMPKeZNLxe0vS6uZOLgHpv_D_C5YEBD6=nq448ri-5G-WBhzw@mail.gmail.com"><div><div style="font-family:sans-serif;font-size:16px;">when i generate code with 72 loop iterations.</div><div style="font-family:sans-serif;font-size:16px;"> </div><div style="font-family:sans-serif;font-size:16px;">the compiler generates code with using avx512 zmm operations 4 times (16x4=64) and remaining 8 iterations are handled by routine mov operations with EAX register. wouldn't it be better if it uses ymm for remaining 8 iterations as it does when iteration count is between 8 and 15. same for xmm and so on.</div></div></blockquote><br />See <a href="http://lists.llvm.org/pipermail/llvm-dev/2017-February/110424.html">http://lists.llvm.org/pipermail/llvm-dev/2017-February/110424.html</a> .<br /><br />-Eli<pre>-- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</pre></div>,<p>_______________________________________________<br />LLVM Developers mailing list<br /><a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br /><a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a></p></blockquote>
Apparently Analagous Threads
- AVX512 instruction generated when JIT compiling for an avx2 architecture
- AVX512 instruction generated when JIT compiling for an avx2 architecture
- [LLVMdev] Passing a 256 bit integer vector with XMM registers
- AVX Scheduling and Parallelism
- AVX Scheduling and Parallelism