thr3ads.net - search: "vpaddd"

Displaying 19 results from an estimated 19 matches for "vpaddd".

2017 Jun 25

AVX Scheduling and Parallelism

...rallel? it can be because each core has an AVX unit. does compiler exploit it? secondly i am generating assembly for intel and there are some offset like rip register or some constant addition in memory index. why is that so? eg.1 vmovdqu32 zmm0, zmmword ptr [rip + c] vpaddd zmm0, zmm0, zmmword ptr [rip + b] vmovdqu32 zmmword ptr [rip + a], zmm0 vmovdqu32 zmm0, zmmword ptr [rip + c+64] vpaddd zmm0, zmm0, zmmword ptr [rip + b+64] and eg. 2 mov rax, -393216 .p2align 4, 0x90 .L...

AVX Scheduling and Parallelism

2017 Jun 25

AVX Scheduling and Parallelism

Hi, Zvi, I agree. In the context of targeting the KNL, however, I'm a bit concerned about the addressing, and specifically, the size of the resulting encoding: > vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in > zmm0 > > vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344] > ; zmm1<-zmm1+b[401344] The KNL can only deliver 16 bytes per cycle from the icache to the decoder. Essentially all of the instructions in the loop, as we seem to generate it, have 10-byte encodings: 10: 62 f1 7e 48 6f 8...

KNL Assembly Code for Matrix Multiplication

2017 Jul 01

KNL Assembly Code for Matrix Multiplication

...gt;> * vpmulld zmm0, zmm0, zmmword ptr [rbx + 4*rax]* >>>>> vpmulld zmm14, zmm14, zmmword ptr [rbx + 4*rax + 64] >>>>> vpmulld zmm15, zmm15, zmmword ptr [rbx + 4*rax + 128] >>>>> vpmulld zmm1, zmm1, zmmword ptr [rbx + 4*rax + 192] >>>>> vpaddd zmm8, zmm0, zmm8 >>>>> vpaddd zmm11, zmm14, zmm11 >>>>> vpaddd zmm12, zmm15, zmm12 >>>>> vpaddd zmm13, zmm1, zmm13 >>>>> vpaddq zmm9, zmm9, zmm7 #zmm7=64 >>>>> vpaddq zmm10, zmm10, zmm7 >>>>> add rcx, -...

AVX Scheduling and Parallelism

2017 Jun 24

AVX Scheduling and Parallelism

...ent registers executed in parallel? it can be because each core has an AVX unit. does compiler exploit it? secondly i am generating assembly for intel and there are some offset like rip register or some constant addition in memory index. why is that so? eg.1 vmovdqu32 zmm0, zmmword ptr [rip + c] vpaddd zmm0, zmm0, zmmword ptr [rip + b] vmovdqu32 zmmword ptr [rip + a], zmm0 vmovdqu32 zmm0, zmmword ptr [rip + c+64] vpaddd zmm0, zmm0, zmmword ptr [rip + b+64] and eg. 2 mov rax, -393216 .p2align 4, 0x90 .LBB0_1: # %vector.body...

Unnecessary spill/fill issue

2016 May 06

Unnecessary spill/fill issue

...the function of all the constant vectors immediately to stack, then each use references the stack pointer directly: Lots of these at top of function: movabsq $.LCPI0_212, %rbx vmovaps (%rbx), %ymm0 vmovaps %ymm0, 2816(%rsp) # 32-byte Spill Later on, each use references the stack pointer: vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload It seems the spill to stack is unnecessary. In one particularly bad kernel, I have 128 8-wide constant vectors, and so there is 4KB of stack use just for these constants. I think a better approach could be to load the constant vector pointers as nee...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...t products from 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 result. In the example code, because we are reducing the number of elements from 8->4 in the vpmaddwd step we are left with a width mismatch between vpmaddwd and the vpaddd instruction that we use to sum with the results from the previous loop iterations. We rely on the fact that a 128-bit vpmaddwd zeros the upper bits of the register so that we can use a 256-bit vpaddd instruction so that the upper elements can keep going around the loop without being disturbed in ca...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...2 > result. > > That godbolt link seems wrong. It wasn't supposed to be clang IR. This should be right. > > In the example code, because we are reducing the number of elements from > 8->4 in the vpmaddwd step we are left with a width mismatch between > vpmaddwd and the vpaddd instruction that we use to sum with the results > from the previous loop iterations. We rely on the fact that a 128-bit > vpmaddwd zeros the upper bits of the register so that we can use a 256-bit > vpaddd instruction so that the upper elements can keep going around the > loop without b...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

...godbolt link seems wrong. It wasn't supposed to be clang IR. This > should be right. > > >> >> In the example code, because we are reducing the number of elements from >> 8->4 in the vpmaddwd step we are left with a width mismatch between >> vpmaddwd and the vpaddd instruction that we use to sum with the results >> from the previous loop iterations. We rely on the fact that a 128-bit >> vpmaddwd zeros the upper bits of the register so that we can use a 256-bit >> vpaddd instruction so that the upper elements can keep going around the >&gt...

[LLVMdev] Passing a 256 bit integer vector with XMM registers

2013 Sep 20

[LLVMdev] Passing a 256 bit integer vector with XMM registers

...8 x i32> @add(<8 x i32> %a, <8 x i32> %b) { %add = add <8 x i32> %a, %b ret <8 x i32> %add } With march=X86-64 and mcpu=corei7-avx, llc with the default calling convention generates the following code vextractf128 $1, %ymm1, %xmm2 vextractf128 $1, %ymm0, %xmm3 vpaddd %xmm2, %xmm3, %xmm2 vpaddd %xmm1, %xmm0, %xmm0 vinsertf128 $1, %xmm2, %ymm0, %ymm0 ret With this new calling convention, llc would generate slightly different code inside the callee vpaddd %xmm2, %xmm0, %xmm0 vpaddd %xmm3, %xmm1, %xmm1 ret I am wonder how...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 23

[LoopVectorizer] Improving the performance of dot product reduction loop

...izontal add of adjacent pairs. A vpmaddwd >> given two v8i16 inputs will produce a v4i32 result. >> >> In the example code, because we are reducing the number of elements >> from 8->4 in the vpmaddwd step we are left with a width mismatch >> between vpmaddwd and the vpaddd instruction that we use to sum with >> the results from the previous loop iterations. We rely on the fact >> that a 128-bit vpmaddwd zeros the upper bits of the register so that >> we can use a 256-bit vpaddd instruction so that the upper elements >> can keep going around th...

AVX 512 Assembly Code Generation issues

2017 Jun 21

AVX 512 Assembly Code Generation issues

...vq $-1024, %rax # imm = 0xFC00 > .p2align 4, 0x90 > .*LBB0_1: # %vector.body* > * # =>This Inner Loop Header: > Depth=1* > * vmovdqa32 c+1024(%rax), %xmm0* > * vmovdqa32 c+1040(%rax), %xmm1* > * vpaddd b+1024(%rax), %xmm0, %xmm0* > * vpaddd b+1040(%rax), %xmm1, %xmm1* > * vmovdqa32 %xmm0, a+1024(%rax)* > * vmovdqa32 %xmm1, a+1040(%rax)* > * vmovdqa32 c+1056(%rax), %xmm0* > * vmovdqa32 c+1072(%rax), %xmm1* > * vpaddd b+1056(%rax), %xmm0, %xmm0* > * vpaddd b+1072(%rax), %xmm1,...

[PATCH] cpu.h: add defines for clang

2017 Feb 12

[PATCH] cpu.h: add defines for clang

...3.8.1 from Debian testing. I forgot that all avx2 functions are inside "#ifdef FLAC__AVX2_SUPPORTED" conditional, so they simply don't exist if FLAC__AVX2_SUPPORTED is not set. Anyway, stream_encoder_intrin_avx2-after.txt shows that the code contains AVX2 instructions such as vpabsd/vpaddd/vphaddd, so this function was compiled properly.

[LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets

2014 Oct 13

[LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets

...ict-aliasing -funroll-loops -ffast-math -march=native -mtune=native -DSPILLING_ENSUES=0 /* no spilling */ $ objdump -dC --no-show-raw-insn ./a.out ... 00000000004004f0 <main>: 4004f0: vmovdqa 0x2004c8(%rip),%xmm0 # 6009c0 <x> 4004f8: vpsrld $0x17,%xmm0,%xmm0 4004fd: vpaddd 0x17b(%rip),%xmm0,%xmm0 # 400680 <__dso_handle+0x8> 400505: vcvtdq2ps %xmm0,%xmm1 400509: vdivps 0x17f(%rip),%xmm1,%xmm1 # 400690 <__dso_handle+0x18> 400511: vcvttps2dq %xmm1,%xmm1 400515: vpmullw 0x183(%rip),%xmm1,%xmm1 # 4006a0 <__dso_handle+0x2...

[LoopVectorizer] Improving the performance of dot product reduction loop

2018 Jul 24

[LoopVectorizer] Improving the performance of dot product reduction loop

...vpmaddwd > given two v8i16 inputs will produce a v4i32 result. > > > > In the example code, because we are reducing the number of > elements from 8->4 in the vpmaddwd step we are left with a > width mismatch between vpmaddwd and the vpaddd instruction > that we use to sum with the results from the previous loop > iterations. We rely on the fact that a 128-bit vpmaddwd zeros > the upper bits of the register so that we can use a 256-bit > vpaddd instruction so that the upper elements can keep...

unable to emit vectorized code in LLVM IR

2017 Aug 17

unable to emit vectorized code in LLVM IR

I assume compiler knows that your only have 2 input values that you just added together 1000 times. Despite the fact that you stored to a[i] and b[i] here, nothing reads them other than the addition in the same loop iteration. So the compiler easily removed the a and b arrays. Same with 'c', it's not read outside the loop so it doesn't need to exist. So the compiler turned your

[PATCH] cpu.h: add defines for clang

2017 Jan 25

[PATCH] cpu.h: add defines for clang

Currently cpu.h lacks FLAC__SSE_TARGET and FLAC__SSEnn_SUPPORTED macros for clang. I added them, but I cannot properly test them as I can't get compiled flac.exe under Windows (don't know how to setup clang under MSYS2). If somebody has working clang, please test this patch. Does it affect en/decoding speed? Or at least, dows it affect disassembly of functions such as

[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization

2018 Mar 13

[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization

Changes: - patch v2: - Adapt patch to work post KPTI and compiler changes - Redo all performance testing with latest configs and compilers - Simplify mov macro on PIE (MOVABS now) - Reduce GOT footprint - patch v1: - Simplify ftrace implementation. - Use gcc mstack-protector-guard-reg=%gs with PIE when possible. - rfc v3: - Use --emit-relocs instead of -pie to reduce

[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization

2018 Mar 13

[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization

[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization

2018 May 23

[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization

Changes: - patch v3: - Update on message to describe longer term PIE goal. - Minor change on ftrace if condition. - Changed code using xchgq. - patch v2: - Adapt patch to work post KPTI and compiler changes - Redo all performance testing with latest configs and compilers - Simplify mov macro on PIE (MOVABS now) - Reduce GOT footprint - patch v1: - Simplify ftrace

search for: vpaddd