thr3ads.net - llvm dev - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon! [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Andrea Di Biagio

2014-Sep-19 20:22 UTC

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

I have tested the new shuffle lowering on a AMD Jaguar cpu (which is
AVX but not AVX2).

On this particular target, there is a delay when output data from an
execution unit is used as input to another execution unit of a
different cluster. For example, There are 6 executions units which are
divided into 3 execution clusters of Float(FPM,FPA), Vector Integer
(MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs
an addition 1 cycle latency penalty.
Your new shuffle lowering algorithm is very good at keeping the
computation inside clusters. This is an improvement with respect to
the "old" shuffle lowering algorithm.

I haven't observed any significant regression in our internal codebase.
In one particular case I observed a slowdown (around 1%); here is what
I found when investigating on this slowdown.

1.  With the new shuffle lowering, there is one case where we end up
producing the following sequence:
   vmovss .LCPxx(%rip), %xmm1
   vxorps %xmm0, %xmm0, %xmm0
   vblendps $1, %xmm1, %xmm0, %xmm0

Before, we used to generate a simpler:
   vmovss .LCPxx(%rip), %xmm1

In this particular case, the 'vblendps' is redundant since the vmovss
would zero the upper bits in %xmm1. I am not sure why we get this
poor-codegen with your new shuffle lowering. I will investigate more
on this bug (maybe we no longer trigger some ISel patterns?) and I
will try to give you a small reproducible for this paticular case.

2.  There are cases where we no longer fold a vector load in one of
the operands of a shuffle.
This is an example:

     vmovaps  320(%rsp), %xmm0
     vshufps $-27, %xmm0, %xmm0, %xmm0    # %xmm0 = %xmm0[1,1,2,3]

Before, we used to emit the following sequence:
     # 16-byte Folded reload.
     vpshufd $1, 320(%rsp), %xmm0      # %xmm0 = mem[1,0,0,0]

Note: the reason why the shuffle masks are different but still valid
is because the upper bits in %xmm0 are unused. Later on, the code uses
register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of
%xmm0 have a meaning in this context).
As for 1. I'll try to create a small reproducible.

3.  When zero extending 2 packed 32-bit integers, we should try to
emit a vpmovzxdq
Example:
  vmovq  20(%rbx), %xmm0
  vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]

Before:
   vpmovzxdq  20(%rbx), %xmm0

4.  We no longer emit a simpler 'vmovq' in the following case:
   vxorpd %xmm4, %xmm4, %xmm4
   vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1]

Before, we used to generate:
   vmovq %xmm2, %xmm4

Before, the vmovq implicitly zero-extended to 128 bits the quadword in
%xmm2. Now we always do this with a vxorpd+vblendps.

As I said, I will try to create smaller reproducible for each of the
problems I found.
I hope this helps. I will keep testing.

Thanks,
Andrea

Chandler Carruth

2014-Sep-19 20:39 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

FWIW, I agree with Quentin. I'm actually confident I have test cases
covering all of these.

On Fri, Sep 19, 2014 at 1:22 PM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> 3.  When zero extending 2 packed 32-bit integers, we should try to
> emit a vpmovzxdq
> Example:
>   vmovq  20(%rbx), %xmm0
>   vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]
>
> Before:
>    vpmovzxdq  20(%rbx), %xmm0
>
This one is already fixed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/880c7012/attachment.html>

Andrea Di Biagio

2014-Sep-19 20:44 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Fri, Sep 19, 2014 at 9:39 PM, Chandler Carruth <chandlerc at
google.com> wrote:> FWIW, I agree with Quentin. I'm actually confident I have test cases
> covering all of these.
Ok then. In that case, I will keep investigating to see if I can find
something more.

Cheers,
Andrea
>
> On Fri, Sep 19, 2014 at 1:22 PM, Andrea Di Biagio
> <andrea.dibiagio at gmail.com> wrote:
>>
>> 3.  When zero extending 2 packed 32-bit integers, we should try to
>> emit a vpmovzxdq
>> Example:
>>   vmovq  20(%rbx), %xmm0
>>   vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1]
>>
>> Before:
>>    vpmovzxdq  20(%rbx), %xmm0
>
>
> This one is already fixed.

Simon Pilgrim

2014-Sep-20 14:12 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On 19 Sep 2014, at 21:22, Andrea Di Biagio <andrea.dibiagio at gmail.com>
wrote:
> 2.  There are cases where we no longer fold a vector load in one of
> the operands of a shuffle.
> This is an example:
> 
>     vmovaps  320(%rsp), %xmm0
>     vshufps $-27, %xmm0, %xmm0, %xmm0    # %xmm0 = %xmm0[1,1,2,3]
> 
> Before, we used to emit the following sequence:
>     # 16-byte Folded reload.
>     vpshufd $1, 320(%rsp), %xmm0      # %xmm0 = mem[1,0,0,0]
> 
> Note: the reason why the shuffle masks are different but still valid
> is because the upper bits in %xmm0 are unused. Later on, the code uses
> register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits
of
> %xmm0 have a meaning in this context).
> As for 1. I'll try to create a small reproducible.
Hi Andrea /  Chandler / Quentin,

If AVX is available I would expect the vpermilps/vpermilpd instruction to be
used for all float/double single vector shuffles, especially as it can deal with
the folded load case as well - this would avoid the integer/float execution
domain transfer issue with using vpshufd.

Thanks, Simon.

Chandler Carruth

2014-Sep-20 18:44 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Sat, Sep 20, 2014 at 7:12 AM, Simon Pilgrim <llvm-dev at redking.me.uk>
wrote:
> Hi Andrea /  Chandler / Quentin,
>
> If AVX is available I would expect the vpermilps/vpermilpd instruction to
> be used for all float/double single vector shuffles, especially as it can
> deal with the folded load case as well - this would avoid the integer/float
> execution domain transfer issue with using vpshufd.
>
Yes, this is the obvious solution to folding memory loads. It just isn't
implemented yet.

Well, actually it is, but I haven't finished writing tests for it. =]
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140920/29e2be8f/attachment.html>

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Sep 2014 - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Reasonably Related Threads