Ahmed Bougacha
2015-Jan-30  19:15 UTC
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
I filed a couple more, in case they're actually different issues: - http://llvm.org/bugs/show_bug.cgi?id=22412 - http://llvm.org/bugs/show_bug.cgi?id=22413 And that's pretty much it for internal changes. I'm fine with flipping the switch; Quentin, are you? Also, just to have an idea, do you (or someone else!) plan to tackle these in the near future? -Ahmed On Thu, Jan 29, 2015 at 11:50 AM, Ahmed Bougacha <ahmed.bougacha at gmail.com> wrote:> > On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com> > wrote: > >> >> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at gmail.com >> > wrote: >> >>> Hi Chandler, >>> >>> I've been looking at the regressions Quentin mentioned, and filed a PR >>> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377 >>> >>> As for the others, I'm working on reducing them, but for now, here are >>> some raw observations, in case any of it rings a bell: >>> >> >> Very cool, and thanks for the analysis! >> >> >>> >>> >>> Another problem I'm seeing is that in some cases we can't fold memory >>> anymore: >>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >>> vblendps $0x1, %xmm2, %xmm0, %xmm0 >>> becomes: >>> vmovaps -0xXX(%rdx), %xmm2 >>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0] >>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 >>> xmm3[0,2],xmm0[1,2] >>> >>> >>> Also, I see differences when some loads are shuffled, that I'm a bit >>> conflicted about: >>> vmovaps -0xXX(%rbp), %xmm3 >>> ... >>> vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 >>> xmm4[3],xmm3[1,2,3] >>> becomes: >>> vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2] >>> ... >>> vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 >>> xmm4[3],xmm2[1,2,3] >>> >>> Note that the second version does the shuffle in-place, in xmm2. >>> >>> >>> Some are blends (har har) of those two: >>> vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2] >>> vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2] >>> vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3] >>> becomes: >>> vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3] >>> vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2] >>> vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >>> = xmm0[3,0],xmm_mem_1[0,0] >>> vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >>> = xmm0[0,2],xmm_mem_1[1,2] >>> >>> >>> I also see a lot of somewhat neutral (focusing on Haswell for now) >>> domain changes such as (xmm5 and 0 are initially integers, and are >>> dead after the store): >>> vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2] >>> vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0 >>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11] >>> vmovdqu %xmm0, 0x20(%rax) >>> turning into: >>> vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0] >>> vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 >>> xmm0[0,2],xmm5[1,2] >>> vmovups %xmm0, 0x20(%rax) >>> >> >> All of these stem from what I think is the same core weakness of the >> current algorithm: we prefer the fully general shufps+shufps 4-way >> shuffle/blend far too often. Here is how I would more precisely classify >> the two things missing here: >> >> - Check if either inputs are "in place" and we can do a fast single-input >> shuffle with a fixed blend. >> > > I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390 > > >> - Check if we can form a rotation and use palignr to finish a >> shuffle/blend >> > > .. and this would be http://llvm.org/bugs/show_bug.cgi?id=22391 > > I think this about covers the Haswell regressions I'm seeing. Now for > some pre-AVX fun! > > -Ahmed > > >> There may be other patterns we're missing, but these two seem to jump out >> based on your analysis, and may be fairly easy to tackle. >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150130/a4e875a4/attachment.html>
Chandler Carruth
2015-Jan-30  19:23 UTC
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
I may get one or two in the next month, but not more than that. Focused on the pass manager for now. If none get there first, I'll eventually circle back though, so they won't rot forever. On Jan 30, 2015 11:21 AM, "Ahmed Bougacha" <ahmed.bougacha at gmail.com> wrote:> I filed a couple more, in case they're actually different issues: > - http://llvm.org/bugs/show_bug.cgi?id=22412 > - http://llvm.org/bugs/show_bug.cgi?id=22413 > > And that's pretty much it for internal changes. I'm fine with flipping > the switch; Quentin, are you? > Also, just to have an idea, do you (or someone else!) plan to tackle these > in the near future? > > -Ahmed > > On Thu, Jan 29, 2015 at 11:50 AM, Ahmed Bougacha <ahmed.bougacha at gmail.com > > wrote: > >> >> On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com> >> wrote: >> >>> >>> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha < >>> ahmed.bougacha at gmail.com> wrote: >>> >>>> Hi Chandler, >>>> >>>> I've been looking at the regressions Quentin mentioned, and filed a PR >>>> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377 >>>> >>>> As for the others, I'm working on reducing them, but for now, here are >>>> some raw observations, in case any of it rings a bell: >>>> >>> >>> Very cool, and thanks for the analysis! >>> >>> >>>> >>>> >>>> Another problem I'm seeing is that in some cases we can't fold memory >>>> anymore: >>>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >>>> vblendps $0x1, %xmm2, %xmm0, %xmm0 >>>> becomes: >>>> vmovaps -0xXX(%rdx), %xmm2 >>>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 >>>> xmm2[3,0],xmm0[0,0] >>>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 >>>> xmm3[0,2],xmm0[1,2] >>>> >>>> >>>> Also, I see differences when some loads are shuffled, that I'm a bit >>>> conflicted about: >>>> vmovaps -0xXX(%rbp), %xmm3 >>>> ... >>>> vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 >>>> xmm4[3],xmm3[1,2,3] >>>> becomes: >>>> vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2] >>>> ... >>>> vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 >>>> xmm4[3],xmm2[1,2,3] >>>> >>>> Note that the second version does the shuffle in-place, in xmm2. >>>> >>>> >>>> Some are blends (har har) of those two: >>>> vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2] >>>> vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2] >>>> vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 >>>> xmm1[0],xmm6[1,2,3] >>>> becomes: >>>> vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3] >>>> vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2] >>>> vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >>>> = xmm0[3,0],xmm_mem_1[0,0] >>>> vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >>>> = xmm0[0,2],xmm_mem_1[1,2] >>>> >>>> >>>> I also see a lot of somewhat neutral (focusing on Haswell for now) >>>> domain changes such as (xmm5 and 0 are initially integers, and are >>>> dead after the store): >>>> vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2] >>>> vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0 >>>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11] >>>> vmovdqu %xmm0, 0x20(%rax) >>>> turning into: >>>> vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 >>>> xmm0[2,0],xmm5[0,0] >>>> vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 >>>> xmm0[0,2],xmm5[1,2] >>>> vmovups %xmm0, 0x20(%rax) >>>> >>> >>> All of these stem from what I think is the same core weakness of the >>> current algorithm: we prefer the fully general shufps+shufps 4-way >>> shuffle/blend far too often. Here is how I would more precisely classify >>> the two things missing here: >>> >>> - Check if either inputs are "in place" and we can do a fast >>> single-input shuffle with a fixed blend. >>> >> >> I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390 >> >> >>> - Check if we can form a rotation and use palignr to finish a >>> shuffle/blend >>> >> >> .. and this would be http://llvm.org/bugs/show_bug.cgi?id=22391 >> >> I think this about covers the Haswell regressions I'm seeing. Now for >> some pre-AVX fun! >> >> -Ahmed >> >> >>> There may be other patterns we're missing, but these two seem to jump >>> out based on your analysis, and may be fairly easy to tackle. >>> >> >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150130/ebb1638c/attachment.html>
Ahmed Bougacha
2015-Jan-30  19:25 UTC
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
On Fri, Jan 30, 2015 at 11:23 AM, Chandler Carruth <chandlerc at gmail.com> wrote:> I may get one or two in the next month, but not more than that. Focused on > the pass manager for now. If none get there first, I'll eventually circle > back though, so they won't rot forever. >Alright, I'll give it a try in the next few weeks as well. -Ahmed> On Jan 30, 2015 11:21 AM, "Ahmed Bougacha" <ahmed.bougacha at gmail.com> > wrote: > >> I filed a couple more, in case they're actually different issues: >> - http://llvm.org/bugs/show_bug.cgi?id=22412 >> - http://llvm.org/bugs/show_bug.cgi?id=22413 >> >> And that's pretty much it for internal changes. I'm fine with flipping >> the switch; Quentin, are you? >> Also, just to have an idea, do you (or someone else!) plan to tackle >> these in the near future? >> >> -Ahmed >> >> On Thu, Jan 29, 2015 at 11:50 AM, Ahmed Bougacha < >> ahmed.bougacha at gmail.com> wrote: >> >>> >>> On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com> >>> wrote: >>> >>>> >>>> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha < >>>> ahmed.bougacha at gmail.com> wrote: >>>> >>>>> Hi Chandler, >>>>> >>>>> I've been looking at the regressions Quentin mentioned, and filed a PR >>>>> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377 >>>>> >>>>> As for the others, I'm working on reducing them, but for now, here are >>>>> some raw observations, in case any of it rings a bell: >>>>> >>>> >>>> Very cool, and thanks for the analysis! >>>> >>>> >>>>> >>>>> >>>>> Another problem I'm seeing is that in some cases we can't fold memory >>>>> anymore: >>>>> vpermilps $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2] >>>>> vblendps $0x1, %xmm2, %xmm0, %xmm0 >>>>> becomes: >>>>> vmovaps -0xXX(%rdx), %xmm2 >>>>> vshufps $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 >>>>> xmm2[3,0],xmm0[0,0] >>>>> vshufps $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 >>>>> xmm3[0,2],xmm0[1,2] >>>>> >>>>> >>>>> Also, I see differences when some loads are shuffled, that I'm a bit >>>>> conflicted about: >>>>> vmovaps -0xXX(%rbp), %xmm3 >>>>> ... >>>>> vinsertps $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 >>>>> xmm4[3],xmm3[1,2,3] >>>>> becomes: >>>>> vpermilps $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2] >>>>> ... >>>>> vinsertps $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 >>>>> xmm4[3],xmm2[1,2,3] >>>>> >>>>> Note that the second version does the shuffle in-place, in xmm2. >>>>> >>>>> >>>>> Some are blends (har har) of those two: >>>>> vpermilps $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 >>>>> xmm_mem_1[3,0,1,2] >>>>> vpermilps $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2] >>>>> vblendps $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 >>>>> xmm1[0],xmm6[1,2,3] >>>>> becomes: >>>>> vmovaps -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3] >>>>> vpermilps $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2] >>>>> vshufps $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >>>>> = xmm0[3,0],xmm_mem_1[0,0] >>>>> vshufps $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0 >>>>> = xmm0[0,2],xmm_mem_1[1,2] >>>>> >>>>> >>>>> I also see a lot of somewhat neutral (focusing on Haswell for now) >>>>> domain changes such as (xmm5 and 0 are initially integers, and are >>>>> dead after the store): >>>>> vpshufd $-0x5c, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2] >>>>> vpalignr $0xc, %xmm0, %xmm5, %xmm0 ## xmm0 >>>>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11] >>>>> vmovdqu %xmm0, 0x20(%rax) >>>>> turning into: >>>>> vshufps $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 >>>>> xmm0[2,0],xmm5[0,0] >>>>> vshufps $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 >>>>> xmm0[0,2],xmm5[1,2] >>>>> vmovups %xmm0, 0x20(%rax) >>>>> >>>> >>>> All of these stem from what I think is the same core weakness of the >>>> current algorithm: we prefer the fully general shufps+shufps 4-way >>>> shuffle/blend far too often. Here is how I would more precisely classify >>>> the two things missing here: >>>> >>>> - Check if either inputs are "in place" and we can do a fast >>>> single-input shuffle with a fixed blend. >>>> >>> >>> I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390 >>> >>> >>>> - Check if we can form a rotation and use palignr to finish a >>>> shuffle/blend >>>> >>> >>> .. and this would be http://llvm.org/bugs/show_bug.cgi?id=22391 >>> >>> I think this about covers the Haswell regressions I'm seeing. Now for >>> some pre-AVX fun! >>> >>> -Ahmed >>> >>> >>>> There may be other patterns we're missing, but these two seem to jump >>>> out based on your analysis, and may be fairly easy to tackle. >>>> >>> >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150130/689e6d96/attachment.html>
Chandler Carruth
2015-Feb-03  20:55 UTC
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
Thanks to everyone doing the benchmarking!!! =D On Fri, Jan 30, 2015 at 11:15 AM, Ahmed Bougacha <ahmed.bougacha at gmail.com> wrote:> I'm fine with flipping the switch; Quentin, are you? >I checked quickly and Quentin seems happy. Everyone else seems to have reported back happy. I'm planning to flip the switch and delete the old shuffle code "soon". No guarantees (lots of other stuff in flight) but hoping to rip all of this stuff out. -Chandler -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150203/0f2f1f48/attachment.html>
Chandler Carruth
2015-Feb-20  02:05 UTC
[LLVMdev] RFB: Would like to flip the vector shuffle legality flag
On Tue, Feb 3, 2015 at 12:55 PM, Chandler Carruth <chandlerc at gmail.com> wrote:> Thanks to everyone doing the benchmarking!!! =D > > On Fri, Jan 30, 2015 at 11:15 AM, Ahmed Bougacha <ahmed.bougacha at gmail.com > > wrote: > >> I'm fine with flipping the switch; Quentin, are you? >> > > I checked quickly and Quentin seems happy. Everyone else seems to have > reported back happy. I'm planning to flip the switch and delete the old > shuffle code "soon". No guarantees (lots of other stuff in flight) but > hoping to rip all of this stuff out. >FYI, I've fixed all the regressions filed except for PR22391 along with a *giant* pile of other improvements to the vector shuffle lowering. I even have a fix up my sleeve for PR22391, but it needs refactoring in the code that I don't really want to do while supporting both. It is time. I'm going to start submitting the patches now to rip out the flag and all the code supporting it. -Chandler -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150219/0f4f8dbd/attachment.html>
Maybe Matching Threads
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag