thr3ads.net - llvm dev - [LLVMdev] RFB: Would like to flip the vector shuffle legality flag [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2015-Jan-23 08:15 UTC

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Greetings LLVM hackers and x86 vector shufflers!

I would like to flip on another chunk of the new vector shuffling,
specifically the logic to mark ~all shuffles as "legal".

This can be tested today with the flag
"-x86-experimental-vector-shuffle-legality". I would essentially like
to
make this the default (by removing the "false" path). Doing this will
allow
me to completely delete the old vector shuffle lowering.

I've got the patches prepped and ready to go, but it will likely have a
significant impact on performance. Notably, a bunch of the remaining domain
crossing bugs I'm seeing are due to this. The key thing to realize is that
vector shuffle combining is *much* more powerful when we think all of these
are legal, and so we combine away bad shuffles that would trigger domain
crosses.

All of my benchmarks have come back performance neutral overall with a few
benchmarks improving. However, there may be some regressions that folks
want to track down first. I'd really like to get those reported and
prioritize among the vector shuffle work so we can nuke several *thousand*
lines of code from X86ISelLowering.cpp. =D

Thanks!
-Chandler


PS: If you're feeling adventurous, the next big mode flip flag I want to
see changed is -x86-experimental-vector-widening-legalization, but this is
a much more deep change to the entire vector legalization strategy, so I
want to do it second and separately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/8fee23af/attachment.html>

Andrea Di Biagio

2015-Jan-23 12:19 UTC

head link

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Hi Chandler,

On Fri, Jan 23, 2015 at 8:15 AM, Chandler Carruth <chandlerc at gmail.com>
wrote:> Greetings LLVM hackers and x86 vector shufflers!
>
> I would like to flip on another chunk of the new vector shuffling,
> specifically the logic to mark ~all shuffles as "legal".
>
> This can be tested today with the flag
> "-x86-experimental-vector-shuffle-legality". I would essentially
like to
> make this the default (by removing the "false" path). Doing this
will allow
> me to completely delete the old vector shuffle lowering.
>
> I've got the patches prepped and ready to go, but it will likely have a
> significant impact on performance. Notably, a bunch of the remaining domain
> crossing bugs I'm seeing are due to this. The key thing to realize is
that
> vector shuffle combining is *much* more powerful when we think all of these
> are legal, and so we combine away bad shuffles that would trigger domain
> crosses.
That's good news!
Also, I really like your idea of making all shuffles legal by default.
I remember I did some experiments disabling the checks for legal
shuffles in the DAGCombiner to see how well the new shuffle lowering
coped with 'overly' aggressive shuffle combining. I was surprised to
see that from eyeballing the generated code it looked much cleaner
(although I didn't test it extensively). Our target is btver2, so I
also didn't look at what could have been codegen for targets with no
AVX/SSE4.1 where there might be fewer opportunities to match a shuffle
with a single target instruction during legalization.
>
> All of my benchmarks have come back performance neutral overall with a few
> benchmarks improving. However, there may be some regressions that folks
want
> to track down first. I'd really like to get those reported and
prioritize
> among the vector shuffle work so we can nuke several *thousand* lines of
> code from X86ISelLowering.cpp. =D
I'll see if I can get some numbers from our internal codebase and help
with reporting potential regressions.

Thanks,
Andrea
>
> Thanks!
> -Chandler
>
>
> PS: If you're feeling adventurous, the next big mode flip flag I want
to see
> changed is -x86-experimental-vector-widening-legalization, but this is a
> much more deep change to the entire vector legalization strategy, so I want
> to do it second and separately.

Kuperstein, Michael M

2015-Jan-25 08:28 UTC

head link

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Thanks for the heads up, Chandler!
I'll also try to get some numbers locally.

Michael

-----Original Message-----
From: Andrea Di Biagio [mailto:andrea.dibiagio at gmail.com] 
Sent: Friday, January 23, 2015 14:19
To: Chandler Carruth
Cc: Simon Pilgrim; qcolombet at apple.com; Sanjay Patel; LLVM Developers Mailing
List; Kuperstein, Michael M
Subject: Re: RFB: Would like to flip the vector shuffle legality flag

Hi Chandler,

On Fri, Jan 23, 2015 at 8:15 AM, Chandler Carruth <chandlerc at gmail.com>
wrote:> Greetings LLVM hackers and x86 vector shufflers!
>
> I would like to flip on another chunk of the new vector shuffling, 
> specifically the logic to mark ~all shuffles as "legal".
>
> This can be tested today with the flag 
> "-x86-experimental-vector-shuffle-legality". I would essentially
like
> to make this the default (by removing the "false" path). Doing
this
> will allow me to completely delete the old vector shuffle lowering.
>
> I've got the patches prepped and ready to go, but it will likely have 
> a significant impact on performance. Notably, a bunch of the remaining 
> domain crossing bugs I'm seeing are due to this. The key thing to 
> realize is that vector shuffle combining is *much* more powerful when 
> we think all of these are legal, and so we combine away bad shuffles 
> that would trigger domain crosses.
That's good news!
Also, I really like your idea of making all shuffles legal by default.
I remember I did some experiments disabling the checks for legal shuffles in the
DAGCombiner to see how well the new shuffle lowering coped with 'overly'
aggressive shuffle combining. I was surprised to see that from eyeballing the
generated code it looked much cleaner (although I didn't test it
extensively). Our target is btver2, so I also didn't look at what could have
been codegen for targets with no
AVX/SSE4.1 where there might be fewer opportunities to match a shuffle with a
single target instruction during legalization.
>
> All of my benchmarks have come back performance neutral overall with a 
> few benchmarks improving. However, there may be some regressions that 
> folks want to track down first. I'd really like to get those reported 
> and prioritize among the vector shuffle work so we can nuke several 
> *thousand* lines of code from X86ISelLowering.cpp. =D
I'll see if I can get some numbers from our internal codebase and help with
reporting potential regressions.

Thanks,
Andrea
>
> Thanks!
> -Chandler
>
>
> PS: If you're feeling adventurous, the next big mode flip flag I want 
> to see changed is -x86-experimental-vector-widening-legalization, but 
> this is a much more deep change to the entire vector legalization 
> strategy, so I want to do it second and separately.---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Sanjay Patel

2015-Jan-25 22:15 UTC

head link

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

I ran the benchmarking subset of test-suite on a btver2 machine and
optimizing for btver2 (so enabling AVX codegen).

I don't see anything outside of the noise with
x86-experimental-vector-shuffle-legality=1.

On Fri, Jan 23, 2015 at 5:19 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> Hi Chandler,
>
> On Fri, Jan 23, 2015 at 8:15 AM, Chandler Carruth <chandlerc at
gmail.com>
> wrote:
> > Greetings LLVM hackers and x86 vector shufflers!
> >
> > I would like to flip on another chunk of the new vector shuffling,
> > specifically the logic to mark ~all shuffles as "legal".
> >
> > This can be tested today with the flag
> > "-x86-experimental-vector-shuffle-legality". I would
essentially like to
> > make this the default (by removing the "false" path). Doing
this will
> allow
> > me to completely delete the old vector shuffle lowering.
> >
> > I've got the patches prepped and ready to go, but it will likely
have a
> > significant impact on performance. Notably, a bunch of the remaining
> domain
> > crossing bugs I'm seeing are due to this. The key thing to realize
is
> that
> > vector shuffle combining is *much* more powerful when we think all of
> these
> > are legal, and so we combine away bad shuffles that would trigger
domain
> > crosses.
>
> That's good news!
> Also, I really like your idea of making all shuffles legal by default.
> I remember I did some experiments disabling the checks for legal
> shuffles in the DAGCombiner to see how well the new shuffle lowering
> coped with 'overly' aggressive shuffle combining. I was surprised
to
> see that from eyeballing the generated code it looked much cleaner
> (although I didn't test it extensively). Our target is btver2, so I
> also didn't look at what could have been codegen for targets with no
> AVX/SSE4.1 where there might be fewer opportunities to match a shuffle
> with a single target instruction during legalization.
>
> >
> > All of my benchmarks have come back performance neutral overall with a
> few
> > benchmarks improving. However, there may be some regressions that
folks
> want
> > to track down first. I'd really like to get those reported and
prioritize
> > among the vector shuffle work so we can nuke several *thousand* lines
of
> > code from X86ISelLowering.cpp. =D
>
> I'll see if I can get some numbers from our internal codebase and help
> with reporting potential regressions.
>
> Thanks,
> Andrea
>
> >
> > Thanks!
> > -Chandler
> >
> >
> > PS: If you're feeling adventurous, the next big mode flip flag I
want to
> see
> > changed is -x86-experimental-vector-widening-legalization, but this is
a
> > much more deep change to the entire vector legalization strategy, so I
> want
> > to do it second and separately.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150125/494487d0/attachment.html>

Ahmed Bougacha

2015-Jan-29 00:05 UTC

head link

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Hi Chandler,

I've been looking at the regressions Quentin mentioned, and filed a PR
for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377

As for the others, I'm working on reducing them, but for now, here are
some raw observations, in case any of it rings a bell:


Another problem I'm seeing is that in some cases we can't fold memory
anymore:
    vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
    vblendps      $0x1, %xmm2, %xmm0, %xmm0
becomes:
    vmovaps       -0xXX(%rdx), %xmm2
    vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
    vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2]


Also, I see differences when some loads are shuffled, that I'm a bit
conflicted about:
    vmovaps       -0xXX(%rbp), %xmm3
    ...
    vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
becomes:
    vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
    ...
    vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]

Note that the second version does the shuffle in-place, in xmm2.


Some are blends (har har) of those two:
    vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
    vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
    vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3]
becomes:
    vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
    vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
    vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
= xmm0[3,0],xmm_mem_1[0,0]
    vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
= xmm0[0,2],xmm_mem_1[1,2]


I also see a lot of somewhat neutral (focusing on Haswell for now)
domain changes such as (xmm5 and 0 are initially integers, and are
dead after the store):
    vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
    vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
= xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
    vmovdqu       %xmm0, 0x20(%rax)
turning into:
    vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
    vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2]
    vmovups       %xmm0, 0x20(%rax)


-Ahmed


On Fri, Jan 23, 2015 at 12:15 AM, Chandler Carruth <chandlerc at
gmail.com>
wrote:> Greetings LLVM hackers and x86 vector shufflers!
>
> I would like to flip on another chunk of the new vector shuffling,
> specifically the logic to mark ~all shuffles as "legal".
>
> This can be tested today with the flag
> "-x86-experimental-vector-shuffle-legality". I would essentially
like to
> make this the default (by removing the "false" path). Doing this
will
allow> me to completely delete the old vector shuffle lowering.
>
> I've got the patches prepped and ready to go, but it will likely have a
> significant impact on performance. Notably, a bunch of the remaining
domain> crossing bugs I'm seeing are due to this. The key thing to realize is
that
> vector shuffle combining is *much* more powerful when we think all of
these> are legal, and so we combine away bad shuffles that would trigger domain
> crosses.
>
> All of my benchmarks have come back performance neutral overall with a few
> benchmarks improving. However, there may be some regressions that folks
want> to track down first. I'd really like to get those reported and
prioritize
> among the vector shuffle work so we can nuke several *thousand* lines of
> code from X86ISelLowering.cpp. =D
>
> Thanks!
> -Chandler
>
>
> PS: If you're feeling adventurous, the next big mode flip flag I want
to
see> changed is -x86-experimental-vector-widening-legalization, but this is a
> much more deep change to the entire vector legalization strategy, so I
want> to do it second and separately.
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150128/611767ed/attachment.html>

Chandler Carruth

2015-Jan-29 00:47 UTC

head link

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

On Wed, Jan 28, 2015 at 4:05 PM, Ahmed Bougacha <ahmed.bougacha at
gmail.com>
wrote:
> Hi Chandler,
>
> I've been looking at the regressions Quentin mentioned, and filed a PR
> for the most egregious one: http://llvm.org/bugs/show_bug.cgi?id=22377
>
> As for the others, I'm working on reducing them, but for now, here are
> some raw observations, in case any of it rings a bell:
>
Very cool, and thanks for the analysis!

>
>
> Another problem I'm seeing is that in some cases we can't fold
memory
> anymore:
>     vpermilps     $-0x6d, -0xXX(%rdx), %xmm2 ## xmm2 = mem[3,0,1,2]
>     vblendps      $0x1, %xmm2, %xmm0, %xmm0
> becomes:
>     vmovaps       -0xXX(%rdx), %xmm2
>     vshufps       $0x3, %xmm0, %xmm2, %xmm3 ## xmm3 = xmm2[3,0],xmm0[0,0]
>     vshufps       $-0x68, %xmm0, %xmm3, %xmm0 ## xmm0 = xmm3[0,2],xmm0[1,2]
>
>
> Also, I see differences when some loads are shuffled, that I'm a bit
> conflicted about:
>     vmovaps       -0xXX(%rbp), %xmm3
>     ...
>     vinsertps     $0xc0, %xmm4, %xmm3, %xmm5 ## xmm5 = xmm4[3],xmm3[1,2,3]
> becomes:
>     vpermilps     $-0x6d, -0xXX(%rbp), %xmm2 ## xmm2 = mem[3,0,1,2]
>     ...
>     vinsertps     $0xc0, %xmm4, %xmm2, %xmm2 ## xmm2 = xmm4[3],xmm2[1,2,3]
>
> Note that the second version does the shuffle in-place, in xmm2.
>
>
> Some are blends (har har) of those two:
>     vpermilps     $-0x6d, %xmm_mem_1, %xmm6 ## xmm6 = xmm_mem_1[3,0,1,2]
>     vpermilps     $-0x6d, -0xXX(%rax), %xmm1 ## xmm1 = mem_2[3,0,1,2]
>     vblendps      $0x1, %xmm1, %xmm6, %xmm0 ## xmm0 = xmm1[0],xmm6[1,2,3]
> becomes:
>     vmovaps       -0xXX(%rax), %xmm0 ## %xmm0 = mem_2[0,1,2,3]
>     vpermilps     $-0x6d, %xmm0, %xmm1 ## xmm1 = xmm0[3,0,1,2]
>     vshufps       $0x3, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
> = xmm0[3,0],xmm_mem_1[0,0]
>     vshufps       $-0x68, %xmm_mem_1, %xmm0, %xmm0 ## xmm0
> = xmm0[0,2],xmm_mem_1[1,2]
>
>
> I also see a lot of somewhat neutral (focusing on Haswell for now)
> domain changes such as (xmm5 and 0 are initially integers, and are
> dead after the store):
>     vpshufd       $-0x5c, %xmm0, %xmm0    ## xmm0 = xmm0[0,1,2,2]
>     vpalignr      $0xc, %xmm0, %xmm5, %xmm0 ## xmm0
> = xmm0[12,13,14,15],xmm5[0,1,2,3,4,5,6,7,8,9,10,11]
>     vmovdqu       %xmm0, 0x20(%rax)
> turning into:
>     vshufps       $0x2, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[2,0],xmm5[0,0]
>     vshufps       $-0x68, %xmm5, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm5[1,2]
>     vmovups       %xmm0, 0x20(%rax)
>
All of these stem from what I think is the same core weakness of the
current algorithm: we prefer the fully general shufps+shufps 4-way
shuffle/blend far too often. Here is how I would more precisely classify
the two things missing here:

- Check if either inputs are "in place" and we can do a fast
single-input
shuffle with a fixed blend.
- Check if we can form a rotation and use palignr to finish a shuffle/blend

There may be other patterns we're missing, but these two seem to jump out
based on your analysis, and may be fairly easy to tackle.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150128/c312f99e/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Jan 2015 - [LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

[LLVMdev] RFB: Would like to flip the vector shuffle legality flag

Possibly Parallel Threads