Jim Grosbach
2014-Sep-10 19:31 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
> On Sep 9, 2014, at 3:39 PM, Chandler Carruth <chandlerc at google.com> wrote: > > Awesome, thanks for all the information! > > See below: > > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com <mailto:andrea.dibiagio at gmail.com>> wrote: > You have already mentioned how the new shuffle lowering is missing > some features; for example, you explicitly said that we currently lack > of SSE4.1 blend support. Unfortunately, this seems to be one of the > main reasons for the slowdown we are seeing. > > Here is a list of what we found so far that we think is causing most > of the slowdown: > 1) shufps is always emitted in cases where we could emit a single > blendps; in these cases, blendps is preferable because it has better > reciprocal throughput (this is true on all modern Intel and AMD cpus). > > Yep. I think this is actually super easy. I'll add support for blendps shortly. > > > Things get worse when it comes to lowering shuffles where the shuffle > mask indices refer to elements from both input vectors in each lane. > For example, a shuffle mask of <0,5,2,7> could be easily lowered into > a single blendps; instead it gets lowered into two shufps > instructions. > > Example: > ;;; > define <4 x float> @foo(<4 x float> %A, <4 x float> %B) { > %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0, > i32 5, i32 2, i32 7> > ret <4 x float> %1 > } > ;;; > > llc (-mcpu=corei7-avx): > vblendps $10, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[5],xmm0[2],xmm1[7] > > llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): > vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3] > vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3] > > > 2) On SSE4.1, we should try not to emit an insertps if the shuffle > mask identifies a blend. At the moment the new lowering logic is very > aggressively emitting insertps instead of cheaper blendps. > > Example: > ;;; > define <4 x float> @bar(<4 x float> %A, <4 x float> %B) { > %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, > i32 5, i32 2, i32 7> > ret <4 x float> %1 > } > ;;; > > llc (-mcpu=corei7-avx): > vblendps $11, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3] > > llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): > vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3] > > > 3) When a shuffle performs an insert at index 0 we always generate an > insertps, while a movss would do a better job. > ;;; > define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { > %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, > i32 1, i32 2, i32 3> > ret <4 x float> %1 > } > ;;; > > llc (-mcpu=corei7-avx): > vmovss %xmm1, %xmm0, %xmm0 > > llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): > vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] > > So, this is hard. I think we should do this in MC after register allocation because movss is the worst instruction ever: it switches from blending with the destination to zeroing the destination when the source switches from a register to a memory operand. =[ I would like to not emit movss in the DAG *ever*, and teach the MC combine pass to run after register allocation (and thus spills) have been emitted. This way we can match both patterns: when insertps is zeroing the other lanes and the operand is from memory, and when insertps is blending into the other lanes and the operand is in a register.What MC pass? Are you using the acronym generically rather than referring specifically to the MC layer? This sort of transform is almost certainly better done on MachineInstr rather than MCInst. -Jim> > Does that make sense? If so, would you be up for looking at this side of things? It seems nicely separable. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140910/74f9df94/attachment.html>
Chandler Carruth
2014-Sep-10 21:10 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Wed, Sep 10, 2014 at 12:31 PM, Jim Grosbach <grosbach at apple.com> wrote:> What MC pass? Are you using the acronym generically rather than referring > specifically to the MC layer? This sort of transform is almost certainly > better done on MachineInstr rather than MCInst.I fat-fingered it, I was imagining an MI combine pass. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140910/43da571f/attachment.html>
Jim Grosbach
2014-Sep-10 22:11 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
> On Sep 10, 2014, at 2:10 PM, Chandler Carruth <chandlerc at google.com> wrote: > > > On Wed, Sep 10, 2014 at 12:31 PM, Jim Grosbach <grosbach at apple.com <mailto:grosbach at apple.com>> wrote: > What MC pass? Are you using the acronym generically rather than referring specifically to the MC layer? This sort of transform is almost certainly better done on MachineInstr rather than MCInst. > > I fat-fingered it, I was imagining an MI combine pass.OK, cool. Panic averted. I post-regalloc peephole or something like that sounds totally reasonable. Thanks. :) -Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140910/d2042d29/attachment.html>