Displaying 19 results from an estimated 19 matches for "vpaddd".
2017 Jun 25
2
AVX Scheduling and Parallelism
...rallel? it can be because each core has an AVX unit. does compiler exploit it?
secondly i am generating assembly for intel and there are some offset like rip register or some constant addition in memory index. why is that so?
eg.1
vmovdqu32 zmm0, zmmword ptr [rip + c]
vpaddd zmm0, zmm0, zmmword ptr [rip + b]
vmovdqu32 zmmword ptr [rip + a], zmm0
vmovdqu32 zmm0, zmmword ptr [rip + c+64]
vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]
and
eg. 2
mov rax, -393216
.p2align 4, 0x90
.L...
2017 Jun 25
0
AVX Scheduling and Parallelism
Hi, Zvi,
I agree. In the context of targeting the KNL, however, I'm a bit
concerned about the addressing, and specifically, the size of the
resulting encoding:
> vmovdqu32 zmm0, zmmword ptr [rax + c+401280] ;load b[401280] in
> zmm0
>
> vpaddd zmm1, zmm1, zmmword ptr [rax + b+401344]
> ; zmm1<-zmm1+b[401344]
The KNL can only deliver 16 bytes per cycle from the icache to the
decoder. Essentially all of the instructions in the loop, as we seem to
generate it, have 10-byte encodings:
10: 62 f1 7e 48 6f 8...
2017 Jul 01
2
KNL Assembly Code for Matrix Multiplication
...gt;> * vpmulld zmm0, zmm0, zmmword ptr [rbx + 4*rax]*
>>>>> vpmulld zmm14, zmm14, zmmword ptr [rbx + 4*rax + 64]
>>>>> vpmulld zmm15, zmm15, zmmword ptr [rbx + 4*rax + 128]
>>>>> vpmulld zmm1, zmm1, zmmword ptr [rbx + 4*rax + 192]
>>>>> vpaddd zmm8, zmm0, zmm8
>>>>> vpaddd zmm11, zmm14, zmm11
>>>>> vpaddd zmm12, zmm15, zmm12
>>>>> vpaddd zmm13, zmm1, zmm13
>>>>> vpaddq zmm9, zmm9, zmm7 #zmm7=64
>>>>> vpaddq zmm10, zmm10, zmm7
>>>>> add rcx, -...
2017 Jun 24
4
AVX Scheduling and Parallelism
...ent registers executed in
parallel? it can be because each core has an AVX unit. does compiler
exploit it?
secondly i am generating assembly for intel and there are some offset like
rip register or some constant addition in memory index. why is that so?
eg.1
vmovdqu32 zmm0, zmmword ptr [rip + c]
vpaddd zmm0, zmm0, zmmword ptr [rip + b]
vmovdqu32 zmmword ptr [rip + a], zmm0
vmovdqu32 zmm0, zmmword ptr [rip + c+64]
vpaddd zmm0, zmm0, zmmword ptr [rip + b+64]
and
eg. 2
mov rax, -393216
.p2align 4, 0x90
.LBB0_1: # %vector.body...
2016 May 06
3
Unnecessary spill/fill issue
...the function of all the constant vectors immediately to stack,
then each use references the stack pointer directly:
Lots of these at top of function:
movabsq $.LCPI0_212, %rbx
vmovaps (%rbx), %ymm0
vmovaps %ymm0, 2816(%rsp) # 32-byte Spill
Later on, each use references the stack pointer:
vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload
It seems the spill to stack is unnecessary. In one particularly bad kernel,
I have 128 8-wide constant vectors, and so there is 4KB of stack use just
for these constants. I think a better approach could be to load the
constant vector pointers as nee...
2018 Jul 23
3
[LoopVectorizer] Improving the performance of dot product reduction loop
...t products from 16-bit inputs and does a horizontal add of
adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32
result.
In the example code, because we are reducing the number of elements from
8->4 in the vpmaddwd step we are left with a width mismatch between
vpmaddwd and the vpaddd instruction that we use to sum with the results
from the previous loop iterations. We rely on the fact that a 128-bit
vpmaddwd zeros the upper bits of the register so that we can use a 256-bit
vpaddd instruction so that the upper elements can keep going around the
loop without being disturbed in ca...
2018 Jul 23
4
[LoopVectorizer] Improving the performance of dot product reduction loop
...2
> result.
>
>
That godbolt link seems wrong. It wasn't supposed to be clang IR. This
should be right.
>
> In the example code, because we are reducing the number of elements from
> 8->4 in the vpmaddwd step we are left with a width mismatch between
> vpmaddwd and the vpaddd instruction that we use to sum with the results
> from the previous loop iterations. We rely on the fact that a 128-bit
> vpmaddwd zeros the upper bits of the register so that we can use a 256-bit
> vpaddd instruction so that the upper elements can keep going around the
> loop without b...
2018 Jul 24
4
[LoopVectorizer] Improving the performance of dot product reduction loop
...godbolt link seems wrong. It wasn't supposed to be clang IR. This
> should be right.
>
>
>>
>> In the example code, because we are reducing the number of elements from
>> 8->4 in the vpmaddwd step we are left with a width mismatch between
>> vpmaddwd and the vpaddd instruction that we use to sum with the results
>> from the previous loop iterations. We rely on the fact that a 128-bit
>> vpmaddwd zeros the upper bits of the register so that we can use a 256-bit
>> vpaddd instruction so that the upper elements can keep going around the
>>...
2013 Sep 20
0
[LLVMdev] Passing a 256 bit integer vector with XMM registers
...8 x i32> @add(<8 x i32> %a, <8 x i32> %b) {
%add = add <8 x i32> %a, %b
ret <8 x i32> %add
}
With march=X86-64 and mcpu=corei7-avx, llc with the default calling convention generates the following code
vextractf128 $1, %ymm1, %xmm2
vextractf128 $1, %ymm0, %xmm3
vpaddd %xmm2, %xmm3, %xmm2
vpaddd %xmm1, %xmm0, %xmm0
vinsertf128 $1, %xmm2, %ymm0, %ymm0
ret
With this new calling convention, llc would generate slightly different code inside the callee
vpaddd %xmm2, %xmm0, %xmm0
vpaddd %xmm3, %xmm1, %xmm1
ret
I am wonder how...
2018 Jul 23
2
[LoopVectorizer] Improving the performance of dot product reduction loop
...izontal add of adjacent pairs. A vpmaddwd
>> given two v8i16 inputs will produce a v4i32 result.
>>
>> In the example code, because we are reducing the number of elements
>> from 8->4 in the vpmaddwd step we are left with a width mismatch
>> between vpmaddwd and the vpaddd instruction that we use to sum with
>> the results from the previous loop iterations. We rely on the fact
>> that a 128-bit vpmaddwd zeros the upper bits of the register so that
>> we can use a 256-bit vpaddd instruction so that the upper elements
>> can keep going around th...
2017 Jun 21
2
AVX 512 Assembly Code Generation issues
...vq $-1024, %rax # imm = 0xFC00
> .p2align 4, 0x90
> .*LBB0_1: # %vector.body*
> * # =>This Inner Loop Header:
> Depth=1*
> * vmovdqa32 c+1024(%rax), %xmm0*
> * vmovdqa32 c+1040(%rax), %xmm1*
> * vpaddd b+1024(%rax), %xmm0, %xmm0*
> * vpaddd b+1040(%rax), %xmm1, %xmm1*
> * vmovdqa32 %xmm0, a+1024(%rax)*
> * vmovdqa32 %xmm1, a+1040(%rax)*
> * vmovdqa32 c+1056(%rax), %xmm0*
> * vmovdqa32 c+1072(%rax), %xmm1*
> * vpaddd b+1056(%rax), %xmm0, %xmm0*
> * vpaddd b+1072(%rax), %xmm1,...
2017 Feb 12
1
[PATCH] cpu.h: add defines for clang
...3.8.1 from Debian testing.
I forgot that all avx2 functions are inside "#ifdef FLAC__AVX2_SUPPORTED"
conditional, so they simply don't exist if FLAC__AVX2_SUPPORTED is not set.
Anyway, stream_encoder_intrin_avx2-after.txt shows that the code
contains AVX2 instructions such as vpabsd/vpaddd/vphaddd, so
this function was compiled properly.
2014 Oct 13
2
[LLVMdev] Unexpected spilling of vector register during lane extraction on some x86_64 targets
...ict-aliasing -funroll-loops -ffast-math
-march=native -mtune=native -DSPILLING_ENSUES=0 /* no spilling */
$ objdump -dC --no-show-raw-insn ./a.out
...
00000000004004f0 <main>:
4004f0: vmovdqa 0x2004c8(%rip),%xmm0 # 6009c0 <x>
4004f8: vpsrld $0x17,%xmm0,%xmm0
4004fd: vpaddd 0x17b(%rip),%xmm0,%xmm0 # 400680
<__dso_handle+0x8>
400505: vcvtdq2ps %xmm0,%xmm1
400509: vdivps 0x17f(%rip),%xmm1,%xmm1 # 400690
<__dso_handle+0x18>
400511: vcvttps2dq %xmm1,%xmm1
400515: vpmullw 0x183(%rip),%xmm1,%xmm1 # 4006a0
<__dso_handle+0x2...
2018 Jul 24
2
[LoopVectorizer] Improving the performance of dot product reduction loop
...vpmaddwd
> given two v8i16 inputs will produce a v4i32 result.
>
>
>
> In the example code, because we are reducing the number of
> elements from 8->4 in the vpmaddwd step we are left with a
> width mismatch between vpmaddwd and the vpaddd instruction
> that we use to sum with the results from the previous loop
> iterations. We rely on the fact that a 128-bit vpmaddwd zeros
> the upper bits of the register so that we can use a 256-bit
> vpaddd instruction so that the upper elements can keep...
2017 Aug 17
4
unable to emit vectorized code in LLVM IR
I assume compiler knows that your only have 2 input values that you just
added together 1000 times.
Despite the fact that you stored to a[i] and b[i] here, nothing reads them
other than the addition in the same loop iteration. So the compiler easily
removed the a and b arrays. Same with 'c', it's not read outside the loop
so it doesn't need to exist. So the compiler turned your
2017 Jan 25
3
[PATCH] cpu.h: add defines for clang
Currently cpu.h lacks FLAC__SSE_TARGET and FLAC__SSEnn_SUPPORTED
macros for clang. I added them, but I cannot properly test them
as I can't get compiled flac.exe under Windows (don't know
how to setup clang under MSYS2).
If somebody has working clang, please test this patch.
Does it affect en/decoding speed?
Or at least, dows it affect disassembly of functions
such as
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce
2018 Mar 13
32
[PATCH v2 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace implementation.
- Use gcc mstack-protector-guard-reg=%gs with PIE when possible.
- rfc v3:
- Use --emit-relocs instead of -pie to reduce
2018 May 23
33
[PATCH v3 00/27] x86: PIE support and option to extend KASLR randomization
Changes:
- patch v3:
- Update on message to describe longer term PIE goal.
- Minor change on ftrace if condition.
- Changed code using xchgq.
- patch v2:
- Adapt patch to work post KPTI and compiler changes
- Redo all performance testing with latest configs and compilers
- Simplify mov macro on PIE (MOVABS now)
- Reduce GOT footprint
- patch v1:
- Simplify ftrace