thr3ads.net - llvm dev - [llvm-dev] AVX Scheduling and Parallelism [Jun 2017]

If this information is useful, please help other people find it:
Share via:

hameeza ahmed via llvm-dev

2017-Jun-24 02:02 UTC

[llvm-dev] AVX Scheduling and Parallelism

Hello,

After generating AVX code for large no of iterations i came to realize that
it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
factor=1024,

i wonder if this register allocation allows operations in parallel?

Also i know all the elements within a single vector instruction are
computed in parallel but does the elements of multiple instructions
computed in parallel? like are 2 vmov with different registers executed in
parallel? it can be because each core has an AVX unit. does compiler
exploit it?


secondly i am generating assembly for intel and there are some offset like
rip register or some constant addition in memory index. why is that so?
eg.1

vmovdqu32 zmm0, zmmword ptr [rip + c]
vpaddd zmm0, zmm0, zmmword ptr [rip + b]
vmovdqu32 zmmword ptr [rip + a], zmm0
vmovdqu32 zmm0, zmmword ptr [rip + c+64]
vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]


and

eg. 2

mov rax, -393216
.p2align 4, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
vmovdqu32 zmm1, zmmword ptr [rax + c+401344]             ; load c[401344]
in zmm1
vmovdqu32 zmm0, zmmword ptr [rax + c+401280]              ;load b[401280]
in zmm0
vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]          ;
zmm1<-zmm1+b[401344]
vmovdqu32 zmmword ptr [rax + a+401344], zmm1              ; store zmm1 in
c[401344]
vmovdqu32 zmm1, zmmword ptr [rax + c+401216]
vpaddd zmm0, zmm0, zmmword ptr [rax + b+401280]           ;
zmm0<-zmm0+b[401280]
vmovdqu32 zmmword ptr [rax + a+401280], zmm0               ; store zmm0 in
c[401280]
vmovdqu32 zmm0, zmmword ptr [rax + c+401152]
........ in the remaining instructions also there is only zmm0 and zmm1
used?

As you can see in the above examples there could be multiple registers use.
also i doubt if the above set of repeating instructions in eg. 2 are
executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and
all in parallel, mean the one w/o dependency. for eg in above example blue
has dependency in between and red has dependency among each other they cant
be executed in parallel but blue and red can be executed in parallel?



Please correct me if I am wrong.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170624/6a59a42a/attachment.html>

Hal Finkel via llvm-dev

2017-Jun-24 02:16 UTC

head link

[llvm-dev] AVX Scheduling and Parallelism

It is possible that the issue with scheduling is constrained due to 
pointer-aliasing assumptions. Could you share the source for the loop in 
question?

RIP-relative indexing, as I recall, is a feature of position-independent 
code. Based on what's below, it might cause problems by making the 
instruction encodings large. cc'ing some Intel folks for further comments.

  -Hal

On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev
wrote:> Hello,
>
> After generating AVX code for large no of iterations i came to realize 
> that it still uses only 2 registers zmm0 and zmm1 when the loop 
> urnroll factor=1024,
>
> i wonder if this register allocation allows operations in parallel?
>
> Also i know all the elements within a single vector instruction are 
> computed in parallel but does the elements of multiple instructions 
> computed in parallel? like are 2 vmov with different registers 
> executed in parallel? it can be because each core has an AVX unit. 
> does compiler exploit it?
>
>
> secondly i am generating assembly for intel and there are some offset 
> like rip register or some constant addition in memory index. why is 
> that so?
> eg.1
>
> vmovdqu32zmm0, zmmword ptr [rip + c]
> vpadddzmm0, zmm0, zmmword ptr [rip + b]
> vmovdqu32zmmword ptr [rip + a], zmm0
> vmovdqu32zmm0, zmmword ptr [rip + c+64]
> vpadddzmm0, zmm0, zmmword ptr [rip + b+64]
>
>
> and
>
> eg. 2
>
> movrax, -393216
> .p2align4, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header: 
> Depth=1
> vmovdqu32zmm1, zmmword ptr [rax + c+401344]             ; load 
> c[401344] in zmm1
> vmovdqu32zmm0, zmmword ptr [rax + c+401280]              ;load 
> b[401280] in zmm0
> vpadddzmm1, zmm1, zmmword ptr [rax + b+401344]          ; 
> zmm1<-zmm1+b[401344]
> vmovdqu32zmmword ptr [rax + a+401344], zmm1              ; store zmm1 
> in c[401344]
> vmovdqu32zmm1, zmmword ptr [rax + c+401216]
> vpadddzmm0, zmm0, zmmword ptr [rax + b+401280]           ; 
> zmm0<-zmm0+b[401280]
> vmovdqu32zmmword ptr [rax + a+401280], zmm0               ; store zmm0 
> in c[401280]
> vmovdqu32zmm0, zmmword ptr [rax + c+401152]
> ........ in the remaining instructions also there is only zmm0 and 
> zmm1 used?
>
> As you can see in the above examples there could be multiple registers 
> use. also i doubt if the above set of repeating instructions in eg. 2 
> are executed in parallel? and why repeat zmm0 and zmm1 cant it be more 
> zmms and all in parallel, mean the one w/o dependency. for eg in above 
> example blue has dependency in between and red has dependency among 
> each other they cant be executed in parallel but blue and red can be 
> executed in parallel?
>
>
>
> Please correct me if I am wrong.
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170623/a1893134/attachment.html>

hameeza ahmed via llvm-dev

2017-Jun-24 02:25 UTC

head link

[llvm-dev] Fwd: AVX Scheduling and Parallelism

---------- Forwarded message ----------
From: hameeza ahmed <hahmed2305 at gmail.com>
Date: Sat, Jun 24, 2017 at 7:21 AM
Subject: Re: [llvm-dev] AVX Scheduling and Parallelism
To: Hal Finkel <hfinkel at anl.gov>


int a[100351], b[100351], c[100351];
foo () {
int i;
for (i=0; i<100351; i++) {
a[i] = b[i] + c[i];
}
}

On Sat, Jun 24, 2017 at 7:16 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> It is possible that the issue with scheduling is constrained due to
> pointer-aliasing assumptions. Could you share the source for the loop in
> question?
>
> RIP-relative indexing, as I recall, is a feature of position-independent
> code. Based on what's below, it might cause problems by making the
> instruction encodings large. cc'ing some Intel folks for further
comments.
>
>  -Hal
> On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
>
> Hello,
>
> After generating AVX code for large no of iterations i came to realize
> that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
> factor=1024,
>
> i wonder if this register allocation allows operations in parallel?
>
> Also i know all the elements within a single vector instruction are
> computed in parallel but does the elements of multiple instructions
> computed in parallel? like are 2 vmov with different registers executed in
> parallel? it can be because each core has an AVX unit. does compiler
> exploit it?
>
>
> secondly i am generating assembly for intel and there are some offset like
> rip register or some constant addition in memory index. why is that so?
> eg.1
>
> vmovdqu32 zmm0, zmmword ptr [rip + c]
> vpaddd zmm0, zmm0, zmmword ptr [rip + b]
> vmovdqu32 zmmword ptr [rip + a], zmm0
> vmovdqu32 zmm0, zmmword ptr [rip + c+64]
> vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]
>
>
> and
>
> eg. 2
>
> mov rax, -393216
> .p2align 4, 0x90
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header:
Depth=1
> vmovdqu32 zmm1, zmmword ptr [rax + c+401344]             ; load c[401344]
> in zmm1
> vmovdqu32 zmm0, zmmword ptr [rax + c+401280]              ;load b[401280]
> in zmm0
> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]          ;
> zmm1<-zmm1+b[401344]
> vmovdqu32 zmmword ptr [rax + a+401344], zmm1              ; store zmm1 in
> c[401344]
> vmovdqu32 zmm1, zmmword ptr [rax + c+401216]
> vpaddd zmm0, zmm0, zmmword ptr [rax + b+401280]           ;
> zmm0<-zmm0+b[401280]
> vmovdqu32 zmmword ptr [rax + a+401280], zmm0               ; store zmm0
> in c[401280]
> vmovdqu32 zmm0, zmmword ptr [rax + c+401152]
> ........ in the remaining instructions also there is only zmm0 and zmm1
> used?
>
> As you can see in the above examples there could be multiple registers
> use. also i doubt if the above set of repeating instructions in eg. 2 are
> executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms and
> all in parallel, mean the one w/o dependency. for eg in above example blue
> has dependency in between and red has dependency among each other they cant
> be executed in parallel but blue and red can be executed in parallel?
>
>
>
> Please correct me if I am wrong.
>
>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170624/8fc0bb32/attachment.html>

Craig Topper via llvm-dev

2017-Jun-24 03:22 UTC

head link

[llvm-dev] Fwd: AVX Scheduling and Parallelism

I'll attempt to answer some of the questions.

Using a few registers by itself isn't necessarily going to prevent
performance. Modern processors implement what's known as register renaming
internally. Each time a register is written to the processor allocates one
of nearly 200 internal registers to hold the result of the computation.
Each time a register like xmm0 is read by an instruction the processor
looks up in a table to see which internal register was last assigned to
xmm0. This tells the instruction to read that internal register to get the
value of xmm0. So if there are a lot of independent calculations its
perfectly fine to use only a few registers because the processor will remap
them to a much larger set internally. Out of the 200 internal registers
only a few of them have a name(xmm0, xmm1, etc) accessible to software at
any point in the code. Having more named registers allows you to have a lot
of values "live" at a particular point in a program.

Although each core has an AVX unit, a single "thread" of a program is
only
able to use one core. Using the other cores requires the developer to write
special code to create additional threads and give them their own work to
do. The cost of moving data between cores is quite high so these
multithread tasks need to be pretty indepenent.

Within a thread a processor can theoretically execute up to about 6 or so
instructions simultaneously. But those 6 instructions are handled by units
of the cpu designed for different tasks. For example a couple of those
units only handle loads and stores. While others only handle arithmetic
instructions. So to execute the maximum number of instructions in parallel
you need the right mix of operations. And even the units that handle
arithmetic instructions aren't created equal. For example, there is only
one unit that can handle a divide operation and that keeps the divider busy
for 10s of cycles.

rip-relative addressing isn't specific to position independent code. In
fact I believe x86-64 is always position independent because it always uses
rip-relative addressing. There only a few instructions that can take a full
64-bit constant address. So rip relative addressing is used for things like
global variables and constant tables. The assumption is that the code and
data are with 2gb of each other in the address space and always in the same
layout so you can use a fixed offset from the current instruction
pointer(the ip in rip stands for instruction pointer) to access the data.
This results in a smaller encoding than if you needed a full 64-bit offset.

~Craig

On Fri, Jun 23, 2017 at 7:25 PM, hameeza ahmed via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> ---------- Forwarded message ----------
> From: hameeza ahmed <hahmed2305 at gmail.com>
> Date: Sat, Jun 24, 2017 at 7:21 AM
> Subject: Re: [llvm-dev] AVX Scheduling and Parallelism
> To: Hal Finkel <hfinkel at anl.gov>
>
>
> int a[100351], b[100351], c[100351];
> foo () {
> int i;
> for (i=0; i<100351; i++) {
> a[i] = b[i] + c[i];
> }
> }
>
> On Sat, Jun 24, 2017 at 7:16 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>> It is possible that the issue with scheduling is constrained due to
>> pointer-aliasing assumptions. Could you share the source for the loop
in
>> question?
>>
>> RIP-relative indexing, as I recall, is a feature of
position-independent
>> code. Based on what's below, it might cause problems by making the
>> instruction encodings large. cc'ing some Intel folks for further
comments.
>>
>>  -Hal
>> On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
>>
>> Hello,
>>
>> After generating AVX code for large no of iterations i came to realize
>> that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
>> factor=1024,
>>
>> i wonder if this register allocation allows operations in parallel?
>>
>> Also i know all the elements within a single vector instruction are
>> computed in parallel but does the elements of multiple instructions
>> computed in parallel? like are 2 vmov with different registers executed
in
>> parallel? it can be because each core has an AVX unit. does compiler
>> exploit it?
>>
>>
>> secondly i am generating assembly for intel and there are some offset
>> like rip register or some constant addition in memory index. why is
that so?
>> eg.1
>>
>> vmovdqu32 zmm0, zmmword ptr [rip + c]
>> vpaddd zmm0, zmm0, zmmword ptr [rip + b]
>> vmovdqu32 zmmword ptr [rip + a], zmm0
>> vmovdqu32 zmm0, zmmword ptr [rip + c+64]
>> vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]
>>
>>
>> and
>>
>> eg. 2
>>
>> mov rax, -393216
>> .p2align 4, 0x90
>> .LBB0_1:                                # %vector.body
>>                                         # =>This Inner Loop Header:
>> Depth=1
>> vmovdqu32 zmm1, zmmword ptr [rax + c+401344]             ; load
>> c[401344] in zmm1
>> vmovdqu32 zmm0, zmmword ptr [rax + c+401280]              ;load
>> b[401280] in zmm0
>> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]          ;
>> zmm1<-zmm1+b[401344]
>> vmovdqu32 zmmword ptr [rax + a+401344], zmm1              ; store zmm1
>> in c[401344]
>> vmovdqu32 zmm1, zmmword ptr [rax + c+401216]
>> vpaddd zmm0, zmm0, zmmword ptr [rax + b+401280]           ;
>> zmm0<-zmm0+b[401280]
>> vmovdqu32 zmmword ptr [rax + a+401280], zmm0               ; store zmm0
>> in c[401280]
>> vmovdqu32 zmm0, zmmword ptr [rax + c+401152]
>> ........ in the remaining instructions also there is only zmm0 and zmm1
>> used?
>>
>> As you can see in the above examples there could be multiple registers
>> use. also i doubt if the above set of repeating instructions in eg. 2
are
>> executed in parallel? and why repeat zmm0 and zmm1 cant it be more zmms
and
>> all in parallel, mean the one w/o dependency. for eg in above example
blue
>> has dependency in between and red has dependency among each other they
cant
>> be executed in parallel but blue and red can be executed in parallel?
>>
>>
>>
>> Please correct me if I am wrong.
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttp://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170623/53ca63ed/attachment.html>

Rackover, Zvi via llvm-dev

2017-Jun-25 12:14 UTC

head link

[llvm-dev] AVX Scheduling and Parallelism

Hi Ahmed,
>From what can be seen in the code snippet you provided, the reuse of XMM0
and XMM1 across loop-unroll instances does not inhibit instruction-level
parallelism.Modern X86 processors use register renaming that can eliminate the dependencies
in the instruction stream. In the example you provided, the processor should be
able to identify the 2-vloads + vadd + vstore sequences as independent and
pipeline their execution.

Thanks, Zvi

From: Hal Finkel [mailto:hfinkel at anl.gov]
Sent: Saturday, June 24, 2017 05:17
To: hameeza ahmed <hahmed2305 at gmail.com>; llvm-dev at lists.llvm.org
Cc: Demikhovsky, Elena <elena.demikhovsky at intel.com>; Rackover, Zvi
<zvi.rackover at intel.com>; Breger, Igor <igor.breger at
intel.com>; craig.topper at gmail.com
Subject: Re: [llvm-dev] AVX Scheduling and Parallelism


It is possible that the issue with scheduling is constrained due to
pointer-aliasing assumptions. Could you share the source for the loop in
question?

RIP-relative indexing, as I recall, is a feature of position-independent code.
Based on what's below, it might cause problems by making the instruction
encodings large. cc'ing some Intel folks for further comments.

 -Hal
On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
Hello,

After generating AVX code for large no of iterations i came to realize that it
still uses only 2 registers zmm0 and zmm1 when the loop urnroll factor=1024,

i wonder if this register allocation allows operations in parallel?

Also i know all the elements within a single vector instruction are computed in
parallel but does the elements of multiple instructions computed in parallel?
like are 2 vmov with different registers executed in parallel? it can be because
each core has an AVX unit. does compiler exploit it?


secondly i am generating assembly for intel and there are some offset like rip
register or some constant addition in memory index. why is that so?
eg.1

            vmovdqu32     zmm0, zmmword ptr [rip + c]
            vpaddd            zmm0, zmm0, zmmword ptr [rip + b]
            vmovdqu32     zmmword ptr [rip + a], zmm0
            vmovdqu32     zmm0, zmmword ptr [rip + c+64]
            vpaddd            zmm0, zmm0, zmmword ptr [rip + b+64]


and

eg. 2

mov     rax, -393216
            .p2align           4, 0x90
.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
            vmovdqu32     zmm1, zmmword ptr [rax + c+401344]             ; load
c[401344] in zmm1
            vmovdqu32     zmm0, zmmword ptr [rax + c+401280]              ;load
b[401280] in zmm0
            vpaddd            zmm1, zmm1, zmmword ptr [rax + b+401344]         
; zmm1<-zmm1+b[401344]
            vmovdqu32     zmmword ptr [rax + a+401344], zmm1              ;
store zmm1 in c[401344]
            vmovdqu32     zmm1, zmmword ptr [rax + c+401216]
            vpaddd            zmm0, zmm0, zmmword ptr [rax + b+401280]          
; zmm0<-zmm0+b[401280]
            vmovdqu32     zmmword ptr [rax + a+401280], zmm0               ;
store zmm0 in c[401280]
            vmovdqu32     zmm0, zmmword ptr [rax + c+401152]
........ in the remaining instructions also there is only zmm0 and zmm1 used?

As you can see in the above examples there could be multiple registers use. also
i doubt if the above set of repeating instructions in eg. 2 are executed in
parallel? and why repeat zmm0 and zmm1 cant it be more zmms and all in parallel,
mean the one w/o dependency. for eg in above example blue has dependency in
between and red has dependency among each other they cant be executed in
parallel but blue and red can be executed in parallel?



Please correct me if I am wrong.






_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev



--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170625/a3e3ed40/attachment-0001.html>

llvm dev - Jun 2017 - AVX Scheduling and Parallelism

[llvm-dev] AVX Scheduling and Parallelism

[llvm-dev] AVX Scheduling and Parallelism

[llvm-dev] Fwd: AVX Scheduling and Parallelism

[llvm-dev] Fwd: AVX Scheduling and Parallelism

[llvm-dev] AVX Scheduling and Parallelism