Chandler Carruth
2015-Jan-29  00:47 UTC
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at gmail.com> wrote:> Hi Chandler, > > I've been looking at the regressions Quentin mentioned, and filed a PR > for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377 > > As for the others, I'm working on reducing them, but for now, here are > some raw observations, in case any of it rings a bell: >Very cool, and thanks for the analysis!> > > Another problem I'm seeing is that in some cases we can't fold memory > anymore: > vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] > vblendps $0x1, %xmm2, %xmm0, %xmm0 > becomes: > vmovaps -0xXX(%rdx), %xmm2 > vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] > vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2] > > > Also, I see differences when some loads are shuffled, that I'm a bit > conflicted about: > vmovaps -0xXX(%rbp), %xmm3 > ... > vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3] > becomes: > vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2] > ... > vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3] > > Note that the second version does the shuffle in-place, in xmm2. > > > Some are blends (har har) of those two: > vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2] > vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2] > vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3] > becomes: > vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3] > vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2] > vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 > = xmm0[3,0],xmm_mem_1[0,0] > vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 > = xmm0[0,2],xmm_mem_1[1,2] > > > I also see a lot of somewhat neutral (focusing on Haswell for now) > domain changes such as (xmm5 and 0 are initially integers, and are > dead after the store): > vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2] > vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0 > = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11] > vmovdqu %xmm0, 0x20(%rax) > turning into: > vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0] > vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2] > vmovups %xmm0, 0x20(%rax) >All of these stem from what I think is the same core weakness of the current algorithm: we prefer the fully general shufps+shufps 4-way shuffle/blend far too often. Here is how I would more precisely classify the two things missing here: - Check if either inputs are "in place" and we can do a fast single-input shuffle with a fixed blend. - Check if we can form a rotation and use palignr to finish a shuffle/blend There may be other patterns we're missing, but these two seem to jump out based on your analysis, and may be fairly easy to tackle. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150128/c312f99e/attachment.html>
Ahmed Bougacha
2015-Jan-29  19:50 UTC
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com> wrote:> > On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at gmail.com> > wrote: > >> Hi Chandler, >> >> I've been looking at the regressions Quentin mentioned, and filed a PR >> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377 >> >> As for the others, I'm working on reducing them, but for now, here are >> some raw observations, in case any of it rings a bell: >> > > Very cool, and thanks for the analysis! > > >> >> >> Another problem I'm seeing is that in some cases we can't fold memory >> anymore: >> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >> vblendps $0x1, %xmm2, %xmm0, %xmm0 >> becomes: >> vmovaps -0xXX(%rdx), %xmm2 >> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] >> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 >> xmm3[0,2],xmm0[1,2] >> >> >> Also, I see differences when some loads are shuffled, that I'm a bit >> conflicted about: >> vmovaps -0xXX(%rbp), %xmm3 >> ... >> vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3] >> becomes: >> vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2] >> ... >> vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3] >> >> Note that the second version does the shuffle in-place, in xmm2. >> >> >> Some are blends (har har) of those two: >> vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2] >> vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2] >> vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3] >> becomes: >> vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3] >> vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2] >> vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >> = xmm0[3,0],xmm_mem_1[0,0] >> vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >> = xmm0[0,2],xmm_mem_1[1,2] >> >> >> I also see a lot of somewhat neutral (focusing on Haswell for now) >> domain changes such as (xmm5 and 0 are initially integers, and are >> dead after the store): >> vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2] >> vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0 >> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11] >> vmovdqu %xmm0, 0x20(%rax) >> turning into: >> vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0] >> vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 >> xmm0[0,2],xmm5[1,2] >> vmovups %xmm0, 0x20(%rax) >> > > All of these stem from what I think is the same core weakness of the > current algorithm: we prefer the fully general shufps+shufps 4-way > shuffle/blend far too often. Here is how I would more precisely classify > the two things missing here: > > - Check if either inputs are "in place" and we can do a fast single-input > shuffle with a fixed blend. >I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390> - Check if we can form a rotation and use palignr to finish a shuffle/blend >.. and this would be http://llvm.org/bugs/show_bug.cgi?id=22391 I think this about covers the Haswell regressions I'm seeing. Now for some pre-AVX fun! -Ahmed> There may be other patterns we're missing, but these two seem to jump out > based on your analysis, and may be fairly easy to tackle. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150129/d2c379ce/attachment.html>
Ahmed Bougacha
2015-Jan-30  19:15 UTC
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
I filed a couple more, in case they're actually different issues: - http://llvm.org/bugs/show_bug.cgi?id=22412 - http://llvm.org/bugs/show_bug.cgi?id=22413 And that's pretty much it for internal changes. I'm fine with flipping the switch; Quentin, are you? Also, just to have an idea, do you (or someone else!) plan to tackle these in the near future? -Ahmed On Thu, Jan 29, 2015 at 11:50 AM, Ahmed Bougacha <ahmed.bougacha at gmail.com> wrote:> > On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com> > wrote: > >> >> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at gmail.com >> > wrote: >> >>> Hi Chandler, >>> >>> I've been looking at the regressions Quentin mentioned, and filed a PR >>> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377 >>> >>> As for the others, I'm working on reducing them, but for now, here are >>> some raw observations, in case any of it rings a bell: >>> >> >> Very cool, and thanks for the analysis! >> >> >>> >>> >>> Another problem I'm seeing is that in some cases we can't fold memory >>> anymore: >>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >>> vblendps $0x1, %xmm2, %xmm0, %xmm0 >>> becomes: >>> vmovaps -0xXX(%rdx), %xmm2 >>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] >>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 >>> xmm3[0,2],xmm0[1,2] >>> >>> >>> Also, I see differences when some loads are shuffled, that I'm a bit >>> conflicted about: >>> vmovaps -0xXX(%rbp), %xmm3 >>> ... >>> vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 >>> xmm4[3],xmm3[1,2,3] >>> becomes: >>> vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2] >>> ... >>> vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 >>> xmm4[3],xmm2[1,2,3] >>> >>> Note that the second version does the shuffle in-place, in xmm2. >>> >>> >>> Some are blends (har har) of those two: >>> vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2] >>> vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2] >>> vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3] >>> becomes: >>> vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3] >>> vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2] >>> vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >>> = xmm0[3,0],xmm_mem_1[0,0] >>> vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >>> = xmm0[0,2],xmm_mem_1[1,2] >>> >>> >>> I also see a lot of somewhat neutral (focusing on Haswell for now) >>> domain changes such as (xmm5 and 0 are initially integers, and are >>> dead after the store): >>> vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2] >>> vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0 >>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11] >>> vmovdqu %xmm0, 0x20(%rax) >>> turning into: >>> vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0] >>> vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 >>> xmm0[0,2],xmm5[1,2] >>> vmovups %xmm0, 0x20(%rax) >>> >> >> All of these stem from what I think is the same core weakness of the >> current algorithm: we prefer the fully general shufps+shufps 4-way >> shuffle/blend far too often. Here is how I would more precisely classify >> the two things missing here: >> >> - Check if either inputs are "in place" and we can do a fast single-input >> shuffle with a fixed blend. >> > > I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390 > > >> - Check if we can form a rotation and use palignr to finish a >> shuffle/blend >> > > .. and this would be http://llvm.org/bugs/show_bug.cgi?id=22391 > > I think this about covers the Haswell regressions I'm seeing. Now for > some pre-AVX fun! > > -Ahmed > > >> There may be other patterns we're missing, but these two seem to jump out >> based on your analysis, and may be fairly easy to tackle. >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150130/a4e875a4/attachment.html>
Reasonably Related Threads
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag