thr3ads.net - search: "shufp"

2008 Jul 12

2

[LLVMdev] Shuffle regression

...ue is still present. 2.3 generates the following x86 code: 03A10010 push ebp 03A10011 mov ebp,esp 03A10013 and esp,0FFFFFFF0h 03A10019 movups xmm0,xmmword ptr ds:[141D280h] 03A10020 xorps xmm1,xmm1 03A10023 movaps xmm2,xmm0 03A10026 shufps xmm2,xmm1,32h 03A1002A movaps xmm1,xmm0 03A1002D shufps xmm1,xmm2,84h 03A10031 shufps xmm0,xmm1,23h 03A10035 shufps xmm1,xmm1,40h 03A10039 shufps xmm1,xmm0,2Eh 03A1003D movups xmmword ptr ds:[14262C0h],xmm1 03A10044 mov esp,ebp 03A1...

[LLVMdev] Shuffle regression

2008 Jul 12

0

[LLVMdev] Shuffle regression

...; 2.3 generates the following x86 code: > > 03A10010 push ebp > 03A10011 mov ebp,esp > 03A10013 and esp,0FFFFFFF0h > 03A10019 movups xmm0,xmmword ptr ds:[141D280h] > 03A10020 xorps xmm1,xmm1 > 03A10023 movaps xmm2,xmm0 > 03A10026 shufps xmm2,xmm1,32h > 03A1002A movaps xmm1,xmm0 > 03A1002D shufps xmm1,xmm2,84h > 03A10031 shufps xmm0,xmm1,23h > 03A10035 shufps xmm1,xmm1,40h > 03A10039 shufps xmm1,xmm0,2Eh > 03A1003D movups xmmword ptr ds:[14262C0h],xmm1 > 03A10044 mov...

[LLVMdev] x86 Vector Shuffle Patterns

2010 Aug 04

2

[LLVMdev] x86 Vector Shuffle Patterns

...ined as: def vperm2f128 : PatFrag<(ops node:$src1, node:$src2), (vector_shuffle node:$src1, node:$src2), [{ return X86::isVPERM2F128Mask(cast<ShuffleVectorSDNode>(N)); }], SHUFFLE_get_vperm2f128_imm>; I don't understand completely how the new system all works. Take a simple SHUFPS match: def SHUFPSrri : PSIi8<0xC6, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src1, VR128:$src2, i8imm:$src3), "shufps\t{$src3, $src2, $dst|$dst, $src2, $src3}", [(set VR128:$d...

New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16

2013 Aug 22

2

New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16

...rps xmm5, xmm5 + xorps xmm6, xmm6 + movaps [esp], xmm5 + movaps [esp + 16], xmm6 + + mov edx, [ebp + 12] ; edx == data_len + mov eax, [ebp + 8] ; eax == &data[sample] <- &data[0] + + movss xmm0, [eax] ; xmm0 = 0,0,0,data[0] + add eax, 4 + movaps xmm1, xmm0 ; xmm1 = 0,0,0,data[0] + shufps xmm0, xmm0, 0 ; xmm0 == data[sample],data[sample],data[sample],data[sample] = data[0],data[0],data[0],data[0] + xorps xmm2, xmm2 ; xmm2 = 0,0,0,0 + xorps xmm3, xmm3 ; xmm3 = 0,0,0,0 + xorps xmm4, xmm4 ; xmm4 = 0,0,0,0 + movaps xmm7, xmm0 + mulps xmm7, xmm1 + addps xmm5, xmm7 + dec edx + jz...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 29

2

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...s! > > > Another problem I'm seeing is that in some cases we can't fold memory > anymore: > vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] > vblendps $0x1, %xmm2, %xmm0, %xmm0 > becomes: > vmovaps -0xXX(%rdx), %xmm2 > vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] > vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2] > > > Also, I see differences when some loads are shuffled, that I'm a bit > conflicted about: > vmovaps -0xXX(%rbp), %xmm3 &g...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 30

4

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...ng is that in some cases we can't fold memory >>> anymore: >>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >>> vblendps $0x1, %xmm2, %xmm0, %xmm0 >>> becomes: >>> vmovaps -0xXX(%rdx), %xmm2 >>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] >>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = >>> xmm3[0,2],xmm0[1,2] >>> >>> >>> Also, I see differences when some loads are shuffled, that I'm a bit >>> conflicte...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 29

0

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...her problem I'm seeing is that in some cases we can't fold memory >> anymore: >> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >> vblendps $0x1, %xmm2, %xmm0, %xmm0 >> becomes: >> vmovaps -0xXX(%rdx), %xmm2 >> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] >> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = >> xmm3[0,2],xmm0[1,2] >> >> >> Also, I see differences when some loads are shuffled, that I'm a bit >> conflicted about: >> vm...

[PATCH] Make SSE Run Time option. Add Win32 SSE code

2004 Aug 06

2

[PATCH] Make SSE Run Time option. Add Win32 SSE code

...mov ecx, mem + + mov edx, in1 + movss xmm0, [edx] + + movss xmm1, [ecx] + addss xmm1, xmm0 + + mov edx, in2 + movss [edx], xmm1 + + shufps xmm0, xmm0, 0x00 + shufps xmm1, xmm1, 0x00 + + movaps xmm2, [eax+4] + movaps xmm3, [ebx+4] + mulps xmm2, xmm0 + mulps xmm3, xmm1 + movaps xmm4, [eax+20] +...

Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others

2004 Aug 06

2

Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others

..._m128 xx; 257: __m128 yy; 258: /* Compute next filter result */ 259: xx = _mm_load_ps1(x+i); 00413483 mov eax,dword ptr [ebp-64h] 00413486 mov ecx,dword ptr [ebx+8] 00413489 lea edx,[ecx+eax*4] 0041348C movss xmm0,dword ptr [edx] 00413490 shufps xmm0,xmm0,0 00413494 movaps xmmword ptr [xx],xmm0 260: yy = _mm_add_ss(xx, mem[0]); 00413498 movaps xmm0,xmmword ptr [ebp-60h] 0041349C movaps xmm1,xmmword ptr [xx] 004134A0 addss xmm1,xmm0 004134A4 movaps xmmword ptr [yy],xmm1 261: _mm_store_...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 30

0

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

...we can't fold memory >>>> anymore: >>>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >>>> vblendps $0x1, %xmm2, %xmm0, %xmm0 >>>> becomes: >>>> vmovaps -0xXX(%rdx), %xmm2 >>>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = >>>> xmm2[3,0],xmm0[0,0] >>>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = >>>> xmm3[0,2],xmm0[1,2] >>>> >>>> >>>> Also, I see differences when some loads are shuffled, that...

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

2015 Jan 25

4

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

I ran the benchmarking subset of test-suite on a btver2 machine and optimizing for btver2 (so enabling AVX codegen). I don't see anything outside of the noise with x86-experimental-vector-shuffle-legality=1. On Fri, Jan 23, 2015 at 5:19 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com > wrote: > Hi Chandler, > > On Fri, Jan 23, 2015 at 8:15 AM, Chandler Carruth

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

2011 Feb 26

0

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

...defs for VUNPCKLPSY and VUNPCKLPDY and corresponding patterns. Once I added them everything started working. I found this all very confusing because it appears there are now two ways to match certain shuffle instructions in .td files: one through the traditional shuffle operators like unpckl and shufp and another through these special X86* operators. This is reflected in X86InstrSSE.td: "Traditional": defm VUNPCKLPS: sse12_unpack_interleave<0x14, unpckl, v4f32, memopv4f32, VR128, f128mem, "unpcklps\t{$src2, $src1, $dst|$dst, $src1, $src2}",...

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

2011 Feb 25

2

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

In ToT, LowerVECTOR_SHUFFLE for x86 has this code: if (X86::isUNPCKLMask(SVOp)) getTargetShuffleNode(getUNPCKLOpcode(VT) dl, VT, V1, V2, DAG); why would this not be: if (X86::isUNPCKLMask(SVOp)) return SVOp; I'm trying to add support for VUNPCKL and am getting into trouble because the existing code ends up creating: VUNPCKLPS load load which is badness come selection

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

2014 Sep 09

5

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

...uffle lowering is missing some features; for example, you explicitly said that we currently lack of SSE4.1 blend support. Unfortunately, this seems to be one of the main reasons for the slowdown we are seeing. Here is a list of what we found so far that we think is causing most of the slowdown: 1) shufps is always emitted in cases where we could emit a single blendps; in these cases, blendps is preferable because it has better reciprocal throughput (this is true on all modern Intel and AMD cpus). Things get worse when it comes to lowering shuffles where the shuffle mask indices refer to elements...

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

2011 Feb 28

2

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

...ly > thought about the bigger picture enough yet. > >> but IMHO the implementation of x86 shuffle matching is a lot more >> clear now then they used to be in the past. > > There's certainly been improvement on the TableGen side of things. I > really liked the unpck*, shufp, etc. nodes and the ShuffleVectorSDNode. > That's a huge help. It's too bad we're getting rid of them. But > legalization still looks about the same to me. The idea is to use tablegen again once we have a clean implementation. It would be good to have all tables and per-process...

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

2014 Sep 10

2

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

...features; for example, you explicitly said that we currently lack > of SSE4.1 blend support. Unfortunately, this seems to be one of the > main reasons for the slowdown we are seeing. > > Here is a list of what we found so far that we think is causing most > of the slowdown: > 1) shufps is always emitted in cases where we could emit a single > blendps; in these cases, blendps is preferable because it has better > reciprocal throughput (this is true on all modern Intel and AMD cpus). > > Yep. I think this is actually super easy. I'll add support for blendps shortl...

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

2011 Feb 28

0

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

...ar fashion. I haven't really thought about the bigger picture enough yet. > but IMHO the implementation of x86 shuffle matching is a lot more > clear now then they used to be in the past. There's certainly been improvement on the TableGen side of things. I really liked the unpck*, shufp, etc. nodes and the ShuffleVectorSDNode. That's a huge help. It's too bad we're getting rid of them. But legalization still looks about the same to me. Thanks for the explanation. -Dave

[LLVMdev] Seg faulting on vector ops

2007 Jul 20

5

[LLVMdev] Seg faulting on vector ops

...0000`01b80030 660fc4c903 pinsrw xmm1,ecx,3 00000000`01b80035 660fc4c804 pinsrw xmm1,eax,4 00000000`01b8003a 660fc4c905 pinsrw xmm1,ecx,5 00000000`01b8003f 660fc4c806 pinsrw xmm1,eax,6 00000000`01b80044 660fc4c907 pinsrw xmm1,ecx,7 00000000`01b80049 0fc6c903 shufps xmm1,xmm1,3 00000000`01b8004d f30f110c24 movss dword ptr [esp],xmm1 00000000`01b80052 d90424 fld dword ptr [esp] 00000000`01b80055 83c420 add esp,20h 00000000`01b80058 c3 ret The code used to generate and run the program was: #include &quo...

[LLVMdev] How does SSEDomainFix work?

2010 May 11

0

[LLVMdev] How does SSEDomainFix work?

...uctions moved to the int domain because the add forced them. > Please tell me if something would be wrong for me. You should measure if LLVM's code is actually slower that the code you want. If it is, I would like to hear. Our weakness is the shufflevector instruction. It is selected into shufps/pshufd/palign/... only by looking at patterns. The instruction selector does not consider execution domains. This can be a problem because these instructions cannot be freely interchanged by the SSE execution domain pass. > foo.ll: > define <4 x i32> @foo(<4 x i32> %x, <4 x...

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

2011 Feb 28

2

[LLVMdev] X86 LowerVECTOR_SHUFFLE Question

> In the experience I just had, it is quite error-prone to have multiple > tblgen patterns to match these things. The way things were before, > there was a clean separation between checking/enforcing node legality > and doing the final code selection, with isel being automatic through > tblgen. That was nice. The current setup mixes the two and seems to > result in more code

search for: shufp