Displaying 8 results from an estimated 8 matches for "zmm1".
Did you mean:
xmm1
2017 Jun 24
4
AVX Scheduling and Parallelism
Hello,
After generating AVX code for large no of iterations i came to realize that
it still uses only 2 registers zmm0 and zmm1 when the loop urnroll
factor=1024,
i wonder if this register allocation allows operations in parallel?
Also i know all the elements within a single vector instruction are
computed in parallel but does the elements of multiple instructions
computed in parallel? like are 2 vmov with different regis...
2017 Jun 25
2
AVX Scheduling and Parallelism
...ht cause problems by making the instruction encodings large. cc'ing some Intel folks for further comments.
-Hal
On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote:
Hello,
After generating AVX code for large no of iterations i came to realize that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll factor=1024,
i wonder if this register allocation allows operations in parallel?
Also i know all the elements within a single vector instruction are computed in parallel but does the elements of multiple instructions computed in parallel? like are 2 vmov with different regis...
2017 Jun 25
0
AVX Scheduling and Parallelism
Hi, Zvi,
I agree. In the context of targeting the KNL, however, I'm a bit
concerned about the addressing, and specifically, the size of the
resulting encoding:
> vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in
> zmm0
>
> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]
> ; zmm1<-zmm1+b[401344]
The KNL can only deliver 16 bytes per cycle from the icache to the
decoder. Essentially all of the instructions in the loop, as we seem to
generate it, have 10-byte encodings:
10: 62 f1 7e 48 6f 80 00 vmovdqu...
2017 Jul 01
2
KNL Assembly Code for Matrix Multiplication
Thank You,
It means vmovdqa64 zmm22, zmmword ptr [rip + .LCPI0_0] # zmm22 =
[8,9,10,11,12,13,14,15] zmm22 will contain 64 bit constant values which are
indexes here zmm22=8, 9, 10, 11, 12,13,14,15. not the values loaded from
these locations. and zmm2 contains constant 4000. so,
vpmuludq zmm14, zmm10, zmm2 ; will multiply the indexes values with 4000,
as for array b the stride is 4000.
zmm14= 3200, 3600, 40000, ............28000.
now as you said
vpsrlq zmm15, zmm10, 32 ; will shift zmm10(=zmm22) each 64 bit element by
32bit so
zmm15=? (can you compute the value of zmm15 here)?...
2017 Jan 24
7
[X86][AVX512] RFC: make i1 illegal in the Codegen
...%r = call <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> %p, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x i32> undef)
ret 8 x i32>%r
}
Can be lowered to
# BB#0:
kxnorw %k0, %k0, %k1
vpgatherqd (,%zmm1), %ymm0 {%k1}
retq
Legal vectors of i1's require support for BUILD_VECTOR(i1, i1, .., i1), i1 EXTRACT_VEC_ELEMENT (...) and INSERT_VEC_ELEMENT(i1, ...) , so making i1 legal seemed like a sensible decision, and this is the current state in the top of trunk.
However, making i1 legal affe...
2017 Aug 06
2
VBROADCAST Implementation Issues
...t;>>>>>>>>>>>>>>>> .long 1045220557 # float 0.200000003
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> vbroadcastss zmm1, dword ptr [rip + .LCPI0_0]
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> vmulps zmm2, zmm2, zmm1
>>>>>>>>>>>>>>>>>>>&...
2017 Aug 07
2
VBROADCAST Implementation Issues
...t;>>>>>>>>>>>> .long 1045220557 # float 0.200000003
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> vbroadcastss zmm1, dword ptr [rip + .LCPI0_0]
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> vmulps zmm2, zmm2, zmm1
>>>>>>>>>>>>>>>&...
2017 Aug 07
3
VBROADCAST Implementation Issues
...>>>>>>>>>>>>>> 0.200000003
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> vbroadcastss zmm1, dword ptr [rip + .LCPI0_0]
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> vmulps zmm2, zmm2, zmm1
>>>>>>>&...