Displaying 3 results from an estimated 3 matches for "vmovdqu32".
2017 Jun 25
2
AVX Scheduling and Parallelism
...ike are 2 vmov with different registers executed in parallel? it can be because each core has an AVX unit. does compiler exploit it?
secondly i am generating assembly for intel and there are some offset like rip register or some constant addition in memory index. why is that so?
eg.1
vmovdqu32 zmm0, zmmword ptr [rip + c]
vpaddd zmm0, zmm0, zmmword ptr [rip + b]
vmovdqu32 zmmword ptr [rip + a], zmm0
vmovdqu32 zmm0, zmmword ptr [rip + c+64]
vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]
and
eg. 2
mov rax...
2017 Jun 24
4
AVX Scheduling and Parallelism
...parallel? like are 2 vmov with different registers executed in
parallel? it can be because each core has an AVX unit. does compiler
exploit it?
secondly i am generating assembly for intel and there are some offset like
rip register or some constant addition in memory index. why is that so?
eg.1
vmovdqu32 zmm0, zmmword ptr [rip + c]
vpaddd zmm0, zmm0, zmmword ptr [rip + b]
vmovdqu32 zmmword ptr [rip + a], zmm0
vmovdqu32 zmm0, zmmword ptr [rip + c+64]
vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]
and
eg. 2
mov rax, -393216
.p2align 4, 0x90
.LBB0_1: # %vector.body...
2017 Jun 25
0
AVX Scheduling and Parallelism
Hi, Zvi,
I agree. In the context of targeting the KNL, however, I'm a bit
concerned about the addressing, and specifically, the size of the
resulting encoding:
> vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in
> zmm0
>
> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]
> ; zmm1<-zmm1+b[401344]
The KNL can only deliver 16 bytes per cycle from the icache to the
decoder. Essentially all of the inst...