thr3ads.net - llvm dev - [llvm-dev] AVX Scheduling and Parallelism [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Rackover, Zvi via llvm-dev

2017-Jun-25 12:14 UTC

[llvm-dev] AVX Scheduling and Parallelism

Hi Ahmed,
>From what can be seen in the code snippet you provided, the reuse of XMM0
and XMM1 across loop-unroll instances does not inhibit instruction-level
parallelism.Modern X86 processors use register renaming that can eliminate the dependencies
in the instruction stream. In the example you provided, the processor should be
able to identify the 2-vloads + vadd + vstore sequences as independent and
pipeline their execution.

Thanks, Zvi

From: Hal Finkel [mailto:hfinkel at anl.gov]
Sent: Saturday, June 24, 2017 05:17
To: hameeza ahmed <hahmed2305 at gmail.com>; llvm-dev at lists.llvm.org
Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com>; Rackover, Zvi
<zvi.rackover at intel.com>; Breger, Igor <igor.breger at
intel.com>; craig.topper at gmail.com
Subject: Re: [llvm-dev] AVX Scheduling and Parallelism


It is possible that the issue with scheduling is constrained due to
pointer-aliasing assumptions. Could you share the source for the loop in
question?

RIP-relative indexing, as I recall, is a feature of position-independent code.
Based on what's below, it might cause problems by making the instruction
encodings large. cc'ing some Intel folks for further comments.

 -Hal
On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
Hello,

After generating AVX code for large no of iterations i came to realize that it
still uses only 2 registers zmm0 and zmm1 when the loop urnroll factor=1024,

i wonder if this register allocation allows operations in parallel?

Also i know all the elements within a single vector instruction are computed in
parallel but does the elements of multiple instructions computed in parallel?
like are 2 vmov with different registers executed in parallel? it can be because
each core has an AVX unit. does compiler exploit it?


secondly i am generating assembly for intel and there are some offset like rip
register or some constant addition in memory index. why is that so?
eg.1

            vmovdqu32     zmm0, zmmword ptr [rip + c]
            vpaddd            zmm0, zmm0, zmmword ptr [rip + b]
            vmovdqu32     zmmword ptr [rip + a], zmm0
            vmovdqu32     zmm0, zmmword ptr [rip + c+64]
            vpaddd            zmm0, zmm0, zmmword ptr [rip + b+64]


and

eg. 2

mov     rax, -393216
            .p2align           4, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
            vmovdqu32     zmm1, zmmword ptr [rax + c+401344]             ; load
c[401344] in zmm1
            vmovdqu32     zmm0, zmmword ptr [rax + c+401280]              ;load
b[401280] in zmm0
            vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]         
; zmm1<-zmm1+b[401344]
            vmovdqu32     zmmword ptr [rax + a+401344], zmm1              ;
store zmm1 in c[401344]
            vmovdqu32     zmm1, zmmword ptr [rax + c+401216]
            vpaddd            zmm0, zmm0, zmmword ptr [rax + b+401280]          
; zmm0<-zmm0+b[401280]
            vmovdqu32     zmmword ptr [rax + a+401280], zmm0               ;
store zmm0 in c[401280]
            vmovdqu32     zmm0, zmmword ptr [rax + c+401152]
........ in the remaining instructions also there is only zmm0 and zmm1 used?

As you can see in the above examples there could be multiple registers use. also
i doubt if the above set of repeating instructions in eg. 2 are executed in
parallel? and why repeat zmm0 and zmm1 cant it be more zmms and all in parallel,
mean the one w/o dependency. for eg in above example blue has dependency in
between and red has dependency among each other they cant be executed in
parallel but blue and red can be executed in parallel?



Please correct me if I am wrong.






_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170625/a3e3ed40/attachment-0001.html>

Hal Finkel via llvm-dev

2017-Jun-25 14:23 UTC

head link

[llvm-dev] AVX Scheduling and Parallelism

Hi, Zvi,

I agree. In the context of targeting the KNL, however, I'm a bit 
concerned about the addressing, and specifically, the size of the 
resulting encoding:
> vmovdqu32     zmm0, zmmword ptr [rax + c+401280]    ;load b[401280] in 
> zmm0
>
>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344] 
>          ; zmm1<-zmm1+b[401344]
The KNL can only deliver 16 bytes per cycle from the icache to the 
decoder. Essentially all of the instructions in the loop, as we seem to 
generate it, have 10-byte encodings:

   10:    62 f1 7e 48 6f 80 00     vmovdqu32 0x0(%rax),%zmm0
   17:    00 00 00
             16: R_X86_64_32S    c+0x61f00

...

   38:    62 f1 7d 48 fe 80 00     vpaddd 0x0(%rax),%zmm0,%zmm0
   3f:    00 00 00
             3e: R_X86_64_32S    b+0x61f00
...

and since this seems like a generic feature of how we generate code, it 
seems like we can end up decoder limited (it might even be decoder 
limited for this loop). We might want to less aggressive in generating 
complex addressing modes for the KNL. It seems like it would be better 
to materialize the base array addresses into a register to make the 
encodings shorter.

  -Hal

On 06/25/2017 07:14 AM, Rackover, Zvi wrote:>
> Hi Ahmed,
>
> From what can be seen in the code snippet you provided, the reuse of 
> XMM0 and XMM1 across loop-unroll instances does not inhibit 
> instruction-level parallelism.
>
> Modern X86 processors use register renaming that can eliminate the 
> dependencies in the instruction stream. In the example you provided, 
> the processor should be able to identify the 2-vloads + vadd + vstore 
> sequences as independent and pipeline their execution.
>
> Thanks, Zvi
>
> *From:*Hal Finkel [mailto:hfinkel at anl.gov]
> *Sent:* Saturday, June 24, 2017 05:17
> *To:* hameeza ahmed <hahmed2305 at gmail.com>; llvm-dev at
lists.llvm.org
> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>; Rackover,
Zvi
> <zvi.rackover at intel.com>; Breger, Igor <igor.breger at
intel.com>;
> craig.topper at gmail.com
> *Subject:* Re: [llvm-dev] AVX Scheduling and Parallelism
>
> It is possible that the issue with scheduling is constrained due to 
> pointer-aliasing assumptions. Could you share the source for the loop 
> in question?
>
> RIP-relative indexing, as I recall, is a feature of 
> position-independent code. Based on what's below, it might cause 
> problems by making the instruction encodings large. cc'ing some Intel 
> folks for further comments.
>
>  -Hal
>
> On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
>
>     Hello,
>
>     After generating AVX code for large no of iterations i came to
>     realize that it still uses only 2 registers zmm0 and zmm1 when the
>     loop urnroll factor=1024,
>
>     i wonder if this register allocation allows operations in parallel?
>
>     Also i know all the elements within a single vector instruction
>     are computed in parallel but does the elements of multiple
>     instructions computed in parallel? like are 2 vmov with different
>     registers executed in parallel? it can be because each core has an
>     AVX unit. does compiler exploit it?
>
>     secondly i am generating assembly for intel and there are some
>     offset like rip register or some constant addition in memory
>     index. why is that so?
>
>     eg.1
>
>                 vmovdqu32     zmm0, zmmword ptr [rip + c]
>
>                 vpaddd            zmm0, zmm0, zmmword ptr [rip + b]
>
>                 vmovdqu32     zmmword ptr [rip + a], zmm0
>
>                 vmovdqu32     zmm0, zmmword ptr [rip + c+64]
>
>                 vpaddd            zmm0, zmm0, zmmword ptr [rip + b+64]
>
>     and
>
>     eg. 2
>
>     mov     rax, -393216
>
>                 .p2align           4, 0x90
>
>     .LBB0_1:      # %vector.body
>
>         # =>This Inner Loop Header: Depth=1
>
>     vmovdqu32     zmm1, zmmword ptr [rax + c+401344]             ;
>     load c[401344] in zmm1
>
>     vmovdqu32     zmm0, zmmword ptr [rax + c+401280]          ;load
>     b[401280] in zmm0
>
>     vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]        
>      ; zmm1<-zmm1+b[401344]
>
>     vmovdqu32     zmmword ptr [rax + a+401344], zmm1          ; store
>     zmm1 in c[401344]
>
>                 vmovdqu32     zmm1, zmmword ptr [rax + c+401216]
>
>     vpaddd            zmm0, zmm0, zmmword ptr [rax + b+401280]        
>       ; zmm0<-zmm0+b[401280]
>
>     vmovdqu32     zmmword ptr [rax + a+401280], zmm0           ; store
>     zmm0 in c[401280]
>
>                 vmovdqu32     zmm0, zmmword ptr [rax + c+401152]
>
>     ........ in the remaining instructions also there is only zmm0 and
>     zmm1 used?
>
>     As you can see in the above examples there could be multiple
>     registers use. also i doubt if the above set of repeating
>     instructions in eg. 2 are executed in parallel? and why repeat
>     zmm0 and zmm1 cant it be more zmms and all in parallel, mean the
>     one w/o dependency. for eg in above example blue has dependency in
>     between and red has dependency among each other they cant be
>     executed in parallel but blue and red can be executed in parallel?
>
>     Please correct me if I am wrong.
>
>
>
>
>     _______________________________________________
>
>     LLVM Developers mailing list
>
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170625/b55885a6/attachment.html>

hameeza ahmed via llvm-dev

2017-Aug-26 18:58 UTC

head link

[llvm-dev] AVX Scheduling and Parallelism

Hello,

I have defined 8 registers in registerinfo.td file in the following order:
R_0, R_1, R_2, R_3, R_4, R_5, R_6, R_7

But the generated assembly code only uses 2 registers. How to enable it to
use all 8? Also can i control the ordering like after R_0 can i use R_5
without changes in registerinfo.td?

What changes are required here? either in scheduling or register allocation
phases?



P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+2048]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+2048]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+2048], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+4096]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+4096]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+4096], R_0
P_2048B_LOAD_DWORD R_0, Pword ptr [rip + b+6144]
P_2048B_LOAD_DWORD R_1, Pword ptr [rip + c+6144]
P_2048B_VADD R_0, R_1, R_0
P_2048B_STORE_DWORD Pword ptr [rip + a+6144], R_0

Please help. I am stuck here.

Thank You

On Mon, Jun 26, 2017 at 2:12 PM, hameeza ahmed <hahmed2305 at gmail.com>
wrote:
> Thank You
>
> On Sun, Jun 25, 2017 at 7:23 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>> Hi, Zvi,
>>
>> I agree. In the context of targeting the KNL, however, I'm a bit
>> concerned about the addressing, and specifically, the size of the
resulting
>> encoding:
>>
>>             vmovdqu32     zmm0, zmmword ptr [rax + c+401280]
>>  ;load b[401280] in zmm0
>>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]
>>        ; zmm1<-zmm1+b[401344]
>>
>> The KNL can only deliver 16 bytes per cycle from the icache to the
>> decoder. Essentially all of the instructions in the loop, as we seem to
>> generate it, have 10-byte encodings:
>>
>>   10:    62 f1 7e 48 6f 80 00     vmovdqu32 0x0(%rax),%zmm0
>>   17:    00 00 00
>>             16: R_X86_64_32S    c+0x61f00
>>
>> ...
>>   38:    62 f1 7d 48 fe 80 00     vpaddd 0x0(%rax),%zmm0,%zmm0
>>   3f:    00 00 00
>>             3e: R_X86_64_32S    b+0x61f00
>> ...
>>
>> and since this seems like a generic feature of how we generate code, it
>> seems like we can end up decoder limited (it might even be decoder
limited
>> for this loop). We might want to less aggressive in generating complex
>> addressing modes for the KNL. It seems like it would be better to
>> materialize the base array addresses into a register to make the
encodings
>> shorter.
>>
>>  -Hal
>>
>>
>> On 06/25/2017 07:14 AM, Rackover, Zvi wrote:
>>
>> Hi Ahmed,
>>
>>
>>
>> From what can be seen in the code snippet you provided, the reuse of
XMM0
>> and XMM1 across loop-unroll instances does not inhibit
instruction-level
>> parallelism.
>>
>> Modern X86 processors use register renaming that can eliminate the
>> dependencies in the instruction stream. In the example you provided,
the
>> processor should be able to identify the 2-vloads + vadd + vstore
sequences
>> as independent and pipeline their execution.
>>
>>
>>
>> Thanks, Zvi
>>
>>
>>
>> *From:* Hal Finkel [mailto:hfinkel at anl.gov <hfinkel at
anl.gov>]
>> *Sent:* Saturday, June 24, 2017 05:17
>> *To:* hameeza ahmed <hahmed2305 at gmail.com> <hahmed2305 at
gmail.com>;
>> llvm-dev at lists.llvm.org
>> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>
>> <elena.demikhovsky at intel.com>; Rackover, Zvi <zvi.rackover
at intel.com>
>> <zvi.rackover at intel.com>; Breger, Igor<igor.breger at
intel.com>
>> <igor.breger at intel.com>; craig.topper at gmail.com
>> *Subject:* Re: [llvm-dev] AVX Scheduling and Parallelism
>>
>>
>>
>> It is possible that the issue with scheduling is constrained due to
>> pointer-aliasing assumptions. Could you share the source for the loop
in
>> question?
>>
>> RIP-relative indexing, as I recall, is a feature of
position-independent
>> code. Based on what's below, it might cause problems by making the
>> instruction encodings large. cc'ing some Intel folks for further
comments.
>>
>>  -Hal
>>
>> On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
>>
>> Hello,
>>
>>
>>
>> After generating AVX code for large no of iterations i came to realize
>> that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
>> factor=1024,
>>
>>
>>
>> i wonder if this register allocation allows operations in parallel?
>>
>>
>>
>> Also i know all the elements within a single vector instruction are
>> computed in parallel but does the elements of multiple instructions
>> computed in parallel? like are 2 vmov with different registers executed
in
>> parallel? it can be because each core has an AVX unit. does compiler
>> exploit it?
>>
>>
>>
>>
>>
>> secondly i am generating assembly for intel and there are some offset
>> like rip register or some constant addition in memory index. why is
that so?
>>
>> eg.1
>>
>>
>>
>>             vmovdqu32     zmm0, zmmword ptr [rip + c]
>>
>>             vpaddd            zmm0, zmm0, zmmword ptr [rip + b]
>>
>>             vmovdqu32     zmmword ptr [rip + a], zmm0
>>
>>             vmovdqu32     zmm0, zmmword ptr [rip + c+64]
>>
>>             vpaddd            zmm0, zmm0, zmmword ptr [rip + b+64]
>>
>>
>>
>>
>>
>> and
>>
>>
>>
>> eg. 2
>>
>>
>>
>> mov     rax, -393216
>>
>>             .p2align           4, 0x90
>>
>> .LBB0_1:                                # %vector.body
>>
>>                                         # =>This Inner Loop Header:
>> Depth=1
>>
>>             vmovdqu32     zmm1, zmmword ptr [rax + c+401344]
>> ; load c[401344] in zmm1
>>
>>             vmovdqu32     zmm0, zmmword ptr [rax + c+401280]
>>  ;load b[401280] in zmm0
>>
>>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]
>>        ; zmm1<-zmm1+b[401344]
>>
>>             vmovdqu32     zmmword ptr [rax + a+401344], zmm1
>>  ; store zmm1 in c[401344]
>>
>>             vmovdqu32     zmm1, zmmword ptr [rax + c+401216]
>>
>>             vpaddd            zmm0, zmm0, zmmword ptr [rax + b+401280]
>>         ; zmm0<-zmm0+b[401280]
>>
>>             vmovdqu32     zmmword ptr [rax + a+401280], zmm0
>>   ; store zmm0 in c[401280]
>>
>>             vmovdqu32     zmm0, zmmword ptr [rax + c+401152]
>>
>> ........ in the remaining instructions also there is only zmm0 and zmm1
>> used?
>>
>>
>>
>> As you can see in the above examples there could be multiple registers
>> use. also i doubt if the above set of repeating instructions in eg. 2
are
>> executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms
and
>> all in parallel, mean the one w/o dependency. for eg in above example
blue
>> has dependency in between and red has dependency among each other they
cant
>> be executed in parallel but blue and red can be executed in parallel?
>>
>>
>>
>>
>>
>>
>>
>> Please correct me if I am wrong.
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>> LLVM Developers mailing list
>>
>> llvm-dev at lists.llvm.org
>>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>> --
>>
>> Hal Finkel
>>
>> Lead, Compiler Technology and Programming Languages
>>
>> Leadership Computing Facility
>>
>> Argonne National Laboratory
>>
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>On Sun, Jun 25, 2017 at 7:23 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> Hi, Zvi,
>
> I agree. In the context of targeting the KNL, however, I'm a bit
concerned
> about the addressing, and specifically, the size of the resulting encoding:
>
>             vmovdqu32     zmm0, zmmword ptr [rax + c+401280]
>  ;load b[401280] in zmm0
>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]
>      ; zmm1<-zmm1+b[401344]
>
> The KNL can only deliver 16 bytes per cycle from the icache to the
> decoder. Essentially all of the instructions in the loop, as we seem to
> generate it, have 10-byte encodings:
>
>   10:    62 f1 7e 48 6f 80 00     vmovdqu32 0x0(%rax),%zmm0
>   17:    00 00 00
>             16: R_X86_64_32S    c+0x61f00
>
> ...
>   38:    62 f1 7d 48 fe 80 00     vpaddd 0x0(%rax),%zmm0,%zmm0
>   3f:    00 00 00
>             3e: R_X86_64_32S    b+0x61f00
> ...
>
> and since this seems like a generic feature of how we generate code, it
> seems like we can end up decoder limited (it might even be decoder limited
> for this loop). We might want to less aggressive in generating complex
> addressing modes for the KNL. It seems like it would be better to
> materialize the base array addresses into a register to make the encodings
> shorter.
>
>  -Hal
>
>
> On 06/25/2017 07:14 AM, Rackover, Zvi wrote:
>
> Hi Ahmed,
>
>
>
> From what can be seen in the code snippet you provided, the reuse of XMM0
> and XMM1 across loop-unroll instances does not inhibit instruction-level
> parallelism.
>
> Modern X86 processors use register renaming that can eliminate the
> dependencies in the instruction stream. In the example you provided, the
> processor should be able to identify the 2-vloads + vadd + vstore sequences
> as independent and pipeline their execution.
>
>
>
> Thanks, Zvi
>
>
>
> *From:* Hal Finkel [mailto:hfinkel at anl.gov <hfinkel at anl.gov>]
> *Sent:* Saturday, June 24, 2017 05:17
> *To:* hameeza ahmed <hahmed2305 at gmail.com> <hahmed2305 at
gmail.com>;
> llvm-dev at lists.llvm.org
> *Cc:* Demikhovsky, Elena <elena.demikhovsky at intel.com>
> <elena.demikhovsky at intel.com>; Rackover, Zvi <zvi.rackover at
intel.com>
> <zvi.rackover at intel.com>; Breger, Igor <igor.breger at
intel.com>
> <igor.breger at intel.com>; craig.topper at gmail.com
> *Subject:* Re: [llvm-dev] AVX Scheduling and Parallelism
>
>
>
> It is possible that the issue with scheduling is constrained due to
> pointer-aliasing assumptions. Could you share the source for the loop in
> question?
>
> RIP-relative indexing, as I recall, is a feature of position-independent
> code. Based on what's below, it might cause problems by making the
> instruction encodings large. cc'ing some Intel folks for further
comments.
>
>  -Hal
>
> On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
>
> Hello,
>
>
>
> After generating AVX code for large no of iterations i came to realize
> that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
> factor=1024,
>
>
>
> i wonder if this register allocation allows operations in parallel?
>
>
>
> Also i know all the elements within a single vector instruction are
> computed in parallel but does the elements of multiple instructions
> computed in parallel? like are 2 vmov with different registers executed in
> parallel? it can be because each core has an AVX unit. does compiler
> exploit it?
>
>
>
>
>
> secondly i am generating assembly for intel and there are some offset like
> rip register or some constant addition in memory index. why is that so?
>
> eg.1
>
>
>
>             vmovdqu32     zmm0, zmmword ptr [rip + c]
>
>             vpaddd            zmm0, zmm0, zmmword ptr [rip + b]
>
>             vmovdqu32     zmmword ptr [rip + a], zmm0
>
>             vmovdqu32     zmm0, zmmword ptr [rip + c+64]
>
>             vpaddd            zmm0, zmm0, zmmword ptr [rip + b+64]
>
>
>
>
>
> and
>
>
>
> eg. 2
>
>
>
> mov     rax, -393216
>
>             .p2align           4, 0x90
>
> .LBB0_1:                                # %vector.body
>
>                                         # =>This Inner Loop Header:
Depth=1
>
>             vmovdqu32     zmm1, zmmword ptr [rax + c+401344]
> ; load c[401344] in zmm1
>
>             vmovdqu32     zmm0, zmmword ptr [rax + c+401280]
>  ;load b[401280] in zmm0
>
>             vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]
>      ; zmm1<-zmm1+b[401344]
>
>             vmovdqu32     zmmword ptr [rax + a+401344], zmm1
>  ; store zmm1 in c[401344]
>
>             vmovdqu32     zmm1, zmmword ptr [rax + c+401216]
>
>             vpaddd            zmm0, zmm0, zmmword ptr [rax + b+401280]
>       ; zmm0<-zmm0+b[401280]
>
>             vmovdqu32     zmmword ptr [rax + a+401280], zmm0
> ; store zmm0 in c[401280]
>
>             vmovdqu32     zmm0, zmmword ptr [rax + c+401152]
>
> ........ in the remaining instructions also there is only zmm0 and zmm1
> used?
>
>
>
> As you can see in the above examples there could be multiple registers
> use. also i doubt if the above set of repeating instructions in eg. 2 are
> executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and
> all in parallel, mean the one w/o dependency. for eg in above example blue
> has dependency in between and red has dependency among each other they cant
> be executed in parallel but blue and red can be executed in parallel?
>
>
>
>
>
>
>
> Please correct me if I am wrong.
>
>
>
>
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> llvm-dev at lists.llvm.org
>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> --
>
> Hal Finkel
>
> Lead, Compiler Technology and Programming Languages
>
> Leadership Computing Facility
>
> Argonne National Laboratory
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170826/1231fc1b/attachment-0001.html>

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jun 2017 - AVX Scheduling and Parallelism

[llvm-dev] AVX Scheduling and Parallelism

[llvm-dev] AVX Scheduling and Parallelism

[llvm-dev] AVX Scheduling and Parallelism

Reasonably Related Threads