search for: zmm1

Displaying 8 results from an estimated 8 matches for "zmm1".

Did you mean: xmm1
2017 Jun 24
4
AVX Scheduling and Parallelism
Hello, After generating AVX code for large no of iterations i came to realize that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll factor=1024, i wonder if this register allocation allows operations in parallel? Also i know all the elements within a single vector instruction are computed in parallel but does the elements of multiple instructions computed in parallel? like are 2 vmov with different regis...
2017 Jun 25
2
AVX Scheduling and Parallelism
...ht cause problems by making the instruction encodings large. cc'ing some Intel folks for further comments. -Hal On 06/23/2017 09:02 PM, hameeza ahmed via llvm-dev wrote: Hello, After generating AVX code for large no of iterations i came to realize that it still uses only 2 registers zmm0 and zmm1 when the loop urnroll factor=1024, i wonder if this register allocation allows operations in parallel? Also i know all the elements within a single vector instruction are computed in parallel but does the elements of multiple instructions computed in parallel? like are 2 vmov with different regis...
2017 Jun 25
0
AVX Scheduling and Parallelism
Hi, Zvi, I agree. In the context of targeting the KNL, however, I'm a bit concerned about the addressing, and specifically, the size of the resulting encoding: > vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in > zmm0 > > vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344] > ; zmm1<-zmm1+b[401344] The KNL can only deliver 16 bytes per cycle from the icache to the decoder. Essentially all of the instructions in the loop, as we seem to generate it, have 10-byte encodings: 10: 62 f1 7e 48 6f 80 00 vmovdqu...
2017 Jul 01
2
KNL Assembly Code for Matrix Multiplication
Thank You, It means vmovdqa64 zmm22, zmmword ptr [rip + .LCPI0_0] # zmm22 = [8,9,10,11,12,13,14,15] zmm22 will contain 64 bit constant values which are indexes here zmm22=8, 9, 10, 11, 12,13,14,15. not the values loaded from these locations. and zmm2 contains constant 4000. so, vpmuludq zmm14, zmm10, zmm2 ; will multiply the indexes values with 4000, as for array b the stride is 4000. zmm14= 3200, 3600, 40000, ............28000. now as you said vpsrlq zmm15, zmm10, 32 ; will shift zmm10(=zmm22) each 64 bit element by 32bit so zmm15=? (can you compute the value of zmm15 here)?...
2017 Jan 24
7
[X86][AVX512] RFC: make i1 illegal in the Codegen
...%r = call <8 x i32> @llvm.masked.gather.v8i32(<8 x i32*> %p, i32 4, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x i32> undef) ret 8 x i32>%r } Can be lowered to # BB#0: kxnorw %k0, %k0, %k1 vpgatherqd (,%zmm1), %ymm0 {%k1} retq Legal vectors of i1's require support for BUILD_VECTOR(i1, i1, .., i1), i1 EXTRACT_VEC_ELEMENT (...) and INSERT_VEC_ELEMENT(i1, ...) , so making i1 legal seemed like a sensible decision, and this is the current state in the top of trunk. However, making i1 legal affe...
2017 Aug 06
2
VBROADCAST Implementation Issues
...t;>>>>>>>>>>>>>>>> .long 1045220557 # float 0.200000003 >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> vbroadcastss zmm1, dword ptr [rip + .LCPI0_0] >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> vmulps zmm2, zmm2, zmm1 >>>>>>>>>>>>>>>>>>>&...
2017 Aug 07
2
VBROADCAST Implementation Issues
...t;>>>>>>>>>>>> .long 1045220557 # float 0.200000003 >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> vbroadcastss zmm1, dword ptr [rip + .LCPI0_0] >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> vmulps zmm2, zmm2, zmm1 >>>>>>>>>>>>>>>&...
2017 Aug 07
3
VBROADCAST Implementation Issues
...>>>>>>>>>>>>>> 0.200000003 >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> vbroadcastss zmm1, dword ptr [rip + .LCPI0_0] >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> vmulps zmm2, zmm2, zmm1 >>>>>>>&...