thr3ads.net - llvm dev - [LLVMdev] RFB: Would like to flip the vector shuffle legality flag [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2015-Jan-29 00:47 UTC

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at
gmail.com>
wrote:
> Hi Chandler,
>
> I've been looking at the regressions Quentin mentioned, and filed a PR
> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377
>
> As for the others, I'm working on reducing them, but for now, here are
> some raw observations, in case any of it rings a bell:
>
Very cool, and thanks for the analysis!

>
>
> Another problem I'm seeing is that in some cases we can't fold
memory
> anymore:
>     vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>     vblendps      $0x1, %xmm2, %xmm0, %xmm0
> becomes:
>     vmovaps       -0xXX(%rdx), %xmm2
>     vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
>     vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2]
>
>
> Also, I see differences when some loads are shuffled, that I'm a bit
> conflicted about:
>     vmovaps       -0xXX(%rbp), %xmm3
>     ...
>     vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
> becomes:
>     vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>     ...
>     vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]
>
> Note that the second version does the shuffle in-place, in xmm2.
>
>
> Some are blends (har har) of those two:
>     vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
>     vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
>     vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3]
> becomes:
>     vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
>     vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
>     vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
> = xmm0[3,0],xmm_mem_1[0,0]
>     vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
> = xmm0[0,2],xmm_mem_1[1,2]
>
>
> I also see a lot of somewhat neutral (focusing on Haswell for now)
> domain changes such as (xmm5 and 0 are initially integers, and are
> dead after the store):
>     vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
>     vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
>     vmovdqu       %xmm0, 0x20(%rax)
> turning into:
>     vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
>     vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2]
>     vmovups       %xmm0, 0x20(%rax)
>
All of these stem from what I think is the same core weakness of the
current algorithm: we prefer the fully general shufps+shufps 4-way
shuffle/blend far too often. Here is how I would more precisely classify
the two things missing here:

- Check if either inputs are "in place" and we can do a fast
single-input
shuffle with a fixed blend.
- Check if we can form a rotation and use palignr to finish a shuffle/blend

There may be other patterns we're missing, but these two seem to jump out
based on your analysis, and may be fairly easy to tackle.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150128/c312f99e/attachment.html>

Ahmed Bougacha

2015-Jan-29 19:50 UTC

head link

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at gmail.com>
wrote:
>
> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at
gmail.com>
> wrote:
>
>> Hi Chandler,
>>
>> I've been looking at the regressions Quentin mentioned, and filed a
PR
>> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377
>>
>> As for the others, I'm working on reducing them, but for now, here
are
>> some raw observations, in case any of it rings a bell:
>>
>
> Very cool, and thanks for the analysis!
>
>
>>
>>
>> Another problem I'm seeing is that in some cases we can't fold
memory
>> anymore:
>>     vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>>     vblendps      $0x1, %xmm2, %xmm0, %xmm0
>> becomes:
>>     vmovaps       -0xXX(%rdx), %xmm2
>>     vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 =
xmm2[3,0],xmm0[0,0]
>>     vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 >>
xmm3[0,2],xmm0[1,2]
>>
>>
>> Also, I see differences when some loads are shuffled, that I'm a
bit
>> conflicted about:
>>     vmovaps       -0xXX(%rbp), %xmm3
>>     ...
>>     vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 =
xmm4[3],xmm3[1,2,3]
>> becomes:
>>     vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>>     ...
>>     vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 =
xmm4[3],xmm2[1,2,3]
>>
>> Note that the second version does the shuffle in-place, in xmm2.
>>
>>
>> Some are blends (har har) of those two:
>>     vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 =
xmm_mem_1[3,0,1,2]
>>     vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
>>     vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 =
xmm1[0],xmm6[1,2,3]
>> becomes:
>>     vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
>>     vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
>>     vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>> = xmm0[3,0],xmm_mem_1[0,0]
>>     vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>> = xmm0[0,2],xmm_mem_1[1,2]
>>
>>
>> I also see a lot of somewhat neutral (focusing on Haswell for now)
>> domain changes such as (xmm5 and 0 are initially integers, and are
>> dead after the store):
>>     vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
>>     vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
>>     vmovdqu       %xmm0, 0x20(%rax)
>> turning into:
>>     vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 =
xmm0[2,0],xmm5[0,0]
>>     vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 >>
xmm0[0,2],xmm5[1,2]
>>     vmovups       %xmm0, 0x20(%rax)
>>
>
> All of these stem from what I think is the same core weakness of the
> current algorithm: we prefer the fully general shufps+shufps 4-way
> shuffle/blend far too often. Here is how I would more precisely classify
> the two things missing here:
>
> - Check if either inputs are "in place" and we can do a fast
single-input
> shuffle with a fixed blend.
>
I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390

> - Check if we can form a rotation and use palignr to finish a shuffle/blend
>
.. and this would be  http://llvm.org/bugs/show_bug.cgi?id=22391

I think this about covers the Haswell regressions I'm seeing.  Now for some
pre-AVX fun!

-Ahmed

> There may be other patterns we're missing, but these two seem to jump
out
> based on your analysis, and may be fairly easy to tackle.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150129/d2c379ce/attachment.html>

Ahmed Bougacha

2015-Jan-30 19:15 UTC

head link

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

I filed a couple more, in case they're actually different issues:
- http://llvm.org/bugs/show_bug.cgi?id=22412
- http://llvm.org/bugs/show_bug.cgi?id=22413

And that's pretty much it for internal changes.  I'm fine with flipping
the
switch; Quentin, are you?
Also, just to have an idea, do you (or someone else!) plan to tackle these
in the near future?

-Ahmed

On Thu, Jan 29, 2015 at 11:50 AM, Ahmed Bougacha <ahmed.bougacha at
gmail.com>
wrote:
>
> On Wed, Jan 28, 2015 at 4:47 PM, Chandler Carruth <chandlerc at
gmail.com>
> wrote:
>
>>
>> On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at
gmail.com
>> > wrote:
>>
>>> Hi Chandler,
>>>
>>> I've been looking at the regressions Quentin mentioned, and
filed a PR
>>> for the most egregious one:
http://llvm.org/bugs/show_bug.cgi?id=22377
>>>
>>> As for the others, I'm working on reducing them, but for now,
here are
>>> some raw observations, in case any of it rings a bell:
>>>
>>
>> Very cool, and thanks for the analysis!
>>
>>
>>>
>>>
>>> Another problem I'm seeing is that in some cases we can't
fold memory
>>> anymore:
>>>     vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>>>     vblendps      $0x1, %xmm2, %xmm0, %xmm0
>>> becomes:
>>>     vmovaps       -0xXX(%rdx), %xmm2
>>>     vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 =
xmm2[3,0],xmm0[0,0]
>>>     vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 >>>
xmm3[0,2],xmm0[1,2]
>>>
>>>
>>> Also, I see differences when some loads are shuffled, that I'm
a bit
>>> conflicted about:
>>>     vmovaps       -0xXX(%rbp), %xmm3
>>>     ...
>>>     vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 >>>
xmm4[3],xmm3[1,2,3]
>>> becomes:
>>>     vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>>>     ...
>>>     vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 >>>
xmm4[3],xmm2[1,2,3]
>>>
>>> Note that the second version does the shuffle in-place, in xmm2.
>>>
>>>
>>> Some are blends (har har) of those two:
>>>     vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 =
xmm_mem_1[3,0,1,2]
>>>     vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 =
mem_2[3,0,1,2]
>>>     vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 =
xmm1[0],xmm6[1,2,3]
>>> becomes:
>>>     vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
>>>     vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
>>>     vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>>> = xmm0[3,0],xmm_mem_1[0,0]
>>>     vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
>>> = xmm0[0,2],xmm_mem_1[1,2]
>>>
>>>
>>> I also see a lot of somewhat neutral (focusing on Haswell for now)
>>> domain changes such as (xmm5 and 0 are initially integers, and are
>>> dead after the store):
>>>     vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
>>>     vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
>>> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
>>>     vmovdqu       %xmm0, 0x20(%rax)
>>> turning into:
>>>     vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 =
xmm0[2,0],xmm5[0,0]
>>>     vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 >>>
xmm0[0,2],xmm5[1,2]
>>>     vmovups       %xmm0, 0x20(%rax)
>>>
>>
>> All of these stem from what I think is the same core weakness of the
>> current algorithm: we prefer the fully general shufps+shufps 4-way
>> shuffle/blend far too often. Here is how I would more precisely
classify
>> the two things missing here:
>>
>> - Check if either inputs are "in place" and we can do a fast
single-input
>> shuffle with a fixed blend.
>>
>
> I believe this would be http://llvm.org/bugs/show_bug.cgi?id=22390
>
>
>> - Check if we can form a rotation and use palignr to finish a
>> shuffle/blend
>>
>
> .. and this would be  http://llvm.org/bugs/show_bug.cgi?id=22391
>
> I think this about covers the Haswell regressions I'm seeing.  Now for
> some pre-AVX fun!
>
> -Ahmed
>
>
>> There may be other patterns we're missing, but these two seem to
jump out
>> based on your analysis, and may be fairly easy to tackle.
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150130/a4e875a4/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Jan 2015 - [LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Seemingly Similar Threads