thr3ads.net - llvm dev - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon! [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Jim Grosbach

2014-Sep-10 19:31 UTC

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

> On Sep 9, 2014, at 3:39 PM, Chandler Carruth <chandlerc at
google.com> wrote:
> 
> Awesome, thanks for all the information!
> 
> See below:
> 
> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com <mailto:andrea.dibiagio at gmail.com>> wrote:
> You have already mentioned how the new shuffle lowering is missing
> some features; for example, you explicitly said that we currently lack
> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> main reasons for the slowdown we are seeing.
> 
> Here is a list of what we found so far that we think is causing most
> of the slowdown:
> 1) shufps is always emitted in cases where we could emit a single
> blendps; in these cases, blendps is preferable because it has better
> reciprocal throughput (this is true on all modern Intel and AMD cpus).
> 
> Yep. I think this is actually super easy. I'll add support for blendps
shortly.
>  
> 
> Things get worse when it comes to lowering shuffles where the shuffle
> mask indices refer to elements from both input vectors in each lane.
> For example, a shuffle mask of <0,5,2,7> could be easily lowered into
> a single blendps; instead it gets lowered into two shufps
> instructions.
> 
> Example:
> ;;;
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 0,
> i32 5, i32 2, i32 7>
>   ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>   vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 =
xmm0[0],xmm1[5],xmm0[2],xmm1[7]
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>   vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
> 
> 
> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
> mask identifies a blend. At the moment the new lowering logic is very
> aggressively emitting insertps instead of cheaper blendps.
> 
> Example:
> ;;;
> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 5, i32 2, i32 7>
>   ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>   vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
> 
> 
> 3) When a shuffle performs an insert at index 0 we always generate an
> insertps, while a movss would do a better job.
> ;;;
> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 1, i32 2, i32 3>
>   ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>   vmovss %xmm1, %xmm0, %xmm0
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
> 
> So, this is hard. I think we should do this in MC after register allocation
because movss is the worst instruction ever: it switches from blending with the
destination to zeroing the destination when the source switches from a register
to a memory operand. =[ I would like to not emit movss in the DAG *ever*, and
teach the MC combine pass to run after register allocation (and thus spills)
have been emitted. This way we can match both patterns: when insertps is zeroing
the other lanes and the operand is from memory, and when insertps is blending
into the other lanes and the operand is in a register.
What MC pass? Are you using the acronym generically rather than referring
specifically to the MC layer? This sort of transform is almost certainly better
done on MachineInstr rather than MCInst.

-Jim

> 
> Does that make sense? If so, would you be up for looking at this side of
things? It seems nicely separable.
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140910/74f9df94/attachment.html>

Chandler Carruth

2014-Sep-10 21:10 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Wed, Sep 10, 2014 at 12:31 PM, Jim Grosbach <grosbach at apple.com>
wrote:
> What MC pass? Are you using the acronym generically rather than referring
> specifically to the MC layer? This sort of transform is almost certainly
> better done on MachineInstr rather than MCInst.

I fat-fingered it, I was imagining an MI combine pass.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140910/43da571f/attachment.html>

Jim Grosbach

2014-Sep-10 22:11 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

> On Sep 10, 2014, at 2:10 PM, Chandler Carruth <chandlerc at
google.com> wrote:
> 
> 
> On Wed, Sep 10, 2014 at 12:31 PM, Jim Grosbach <grosbach at apple.com
<mailto:grosbach at apple.com>> wrote:
> What MC pass? Are you using the acronym generically rather than referring
specifically to the MC layer? This sort of transform is almost certainly better
done on MachineInstr rather than MCInst.
> 
> I fat-fingered it, I was imagining an MI combine pass.
OK, cool. Panic averted. I post-regalloc peephole or something like that sounds
totally reasonable. Thanks. :)

-Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140910/d2042d29/attachment.html>

llvm dev - Sep 2014 - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!