Andrea Di Biagio
2014-Sep-10 10:36 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com> wrote:> Awesome, thanks for all the information! > > See below: > > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> > wrote: >> >> You have already mentioned how the new shuffle lowering is missing >> some features; for example, you explicitly said that we currently lack >> of SSE4.1 blend support. Unfortunately, this seems to be one of the >> main reasons for the slowdown we are seeing. >> >> Here is a list of what we found so far that we think is causing most >> of the slowdown: >> 1) shufps is always emitted in cases where we could emit a single >> blendps; in these cases, blendps is preferable because it has better >> reciprocal throughput (this is true on all modern Intel and AMD cpus). > > > Yep. I think this is actually super easy. I'll add support for blendps > shortly.Thanks Chandler!> >> 3) When a shuffle performs an insert at index 0 we always generate an >> insertps, while a movss would do a better job. >> ;;; >> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { >> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, >> i32 1, i32 2, i32 3> >> ret <4 x float> %1 >> } >> ;;; >> >> llc (-mcpu=corei7-avx): >> vmovss %xmm1, %xmm0, %xmm0 >> >> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] > > > So, this is hard. I think we should do this in MC after register allocation > because movss is the worst instruction ever: it switches from blending with > the destination to zeroing the destination when the source switches from a > register to a memory operand. =[ I would like to not emit movss in the DAG > *ever*, and teach the MC combine pass to run after register allocation (and > thus spills) have been emitted. This way we can match both patterns: when > insertps is zeroing the other lanes and the operand is from memory, and when > insertps is blending into the other lanes and the operand is in a register. > > Does that make sense? If so, would you be up for looking at this side of > things? It seems nicely separable.I think it is a good idea and it makes sense to me. I will start investigating on this and see what can be done. Cheers, Andrea
Chandler Carruth
2014-Sep-15 12:57 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Andrea, Quentin: Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps, and unpckhps is committed and should generally be working. I've not tested it *super* thoroughly (will do this ASAP) so if you run into something fishy, don't burn lots of time on it. I've also fixed a number of issues I found in the nightly test suite and things like gcc-loops. I think there are still a couple of regressions I spotted in the nightly test suite, but haven't gotten to them yet. I've got very rhudimentary support for pblendw finished and committed. There is a much more fundamental change that is really needed for pblendw support though -- currently, the blend lowering strategy assumes this instruction doesn't exist and thus picks a deeply wrong strategy in some cases... Not sure how much this is even relevant though. Anyways, it's almost certainly useful to look into any non-test-suite benchmarks you have, or to run the benchmarks on non-intel hardware. Let me know how it goes! So far, with the fixes I've landed recently, I'm seeing more improvements than regressions on the nightly test suite. =] -Chandler On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com> > wrote: > > Awesome, thanks for all the information! > > > > See below: > > > > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio < > andrea.dibiagio at gmail.com> > > wrote: > >> > >> You have already mentioned how the new shuffle lowering is missing > >> some features; for example, you explicitly said that we currently lack > >> of SSE4.1 blend support. Unfortunately, this seems to be one of the > >> main reasons for the slowdown we are seeing. > >> > >> Here is a list of what we found so far that we think is causing most > >> of the slowdown: > >> 1) shufps is always emitted in cases where we could emit a single > >> blendps; in these cases, blendps is preferable because it has better > >> reciprocal throughput (this is true on all modern Intel and AMD cpus). > > > > > > Yep. I think this is actually super easy. I'll add support for blendps > > shortly. > > Thanks Chandler! > > > > >> 3) When a shuffle performs an insert at index 0 we always generate an > >> insertps, while a movss would do a better job. > >> ;;; > >> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { > >> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, > >> i32 1, i32 2, i32 3> > >> ret <4 x float> %1 > >> } > >> ;;; > >> > >> llc (-mcpu=corei7-avx): > >> vmovss %xmm1, %xmm0, %xmm0 > >> > >> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): > >> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] > > > > > > So, this is hard. I think we should do this in MC after register > allocation > > because movss is the worst instruction ever: it switches from blending > with > > the destination to zeroing the destination when the source switches from > a > > register to a memory operand. =[ I would like to not emit movss in the > DAG > > *ever*, and teach the MC combine pass to run after register allocation > (and > > thus spills) have been emitted. This way we can match both patterns: when > > insertps is zeroing the other lanes and the operand is from memory, and > when > > insertps is blending into the other lanes and the operand is in a > register. > > > > Does that make sense? If so, would you be up for looking at this side of > > things? It seems nicely separable. > > I think it is a good idea and it makes sense to me. > I will start investigating on this and see what can be done. > > Cheers, > Andrea >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140915/137af0ed/attachment.html>
Andrea Di Biagio
2014-Sep-15 16:03 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at google.com> wrote:> Andrea, Quentin: > > Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps, > and unpckhps is committed and should generally be working. I've not tested > it *super* thoroughly (will do this ASAP) so if you run into something > fishy, don't burn lots of time on it.Ok.> > I've also fixed a number of issues I found in the nightly test suite and > things like gcc-loops. I think there are still a couple of regressions I > spotted in the nightly test suite, but haven't gotten to them yet. > > I've got very rhudimentary support for pblendw finished and committed. There > is a much more fundamental change that is really needed for pblendw support > though -- currently, the blend lowering strategy assumes this instruction > doesn't exist and thus picks a deeply wrong strategy in some cases... Not > sure how much this is even relevant though. > > > Anyways, it's almost certainly useful to look into any non-test-suite > benchmarks you have, or to run the benchmarks on non-intel hardware. Let me > know how it goes! So far, with the fixes I've landed recently, I'm seeing > more improvements than regressions on the nightly test suite. =]Cool! I'll have a look at it. I will let you know how it goes. Thanks for working on this :-). -Andrea> > -Chandler > > On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio > <andrea.dibiagio at gmail.com> wrote: >> >> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com> >> wrote: >> > Awesome, thanks for all the information! >> > >> > See below: >> > >> > On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio >> > <andrea.dibiagio at gmail.com> >> > wrote: >> >> >> >> You have already mentioned how the new shuffle lowering is missing >> >> some features; for example, you explicitly said that we currently lack >> >> of SSE4.1 blend support. Unfortunately, this seems to be one of the >> >> main reasons for the slowdown we are seeing. >> >> >> >> Here is a list of what we found so far that we think is causing most >> >> of the slowdown: >> >> 1) shufps is always emitted in cases where we could emit a single >> >> blendps; in these cases, blendps is preferable because it has better >> >> reciprocal throughput (this is true on all modern Intel and AMD cpus). >> > >> > >> > Yep. I think this is actually super easy. I'll add support for blendps >> > shortly. >> >> Thanks Chandler! >> >> > >> >> 3) When a shuffle performs an insert at index 0 we always generate an >> >> insertps, while a movss would do a better job. >> >> ;;; >> >> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { >> >> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, >> >> i32 1, i32 2, i32 3> >> >> ret <4 x float> %1 >> >> } >> >> ;;; >> >> >> >> llc (-mcpu=corei7-avx): >> >> vmovss %xmm1, %xmm0, %xmm0 >> >> >> >> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >> >> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] >> > >> > >> > So, this is hard. I think we should do this in MC after register >> > allocation >> > because movss is the worst instruction ever: it switches from blending >> > with >> > the destination to zeroing the destination when the source switches from >> > a >> > register to a memory operand. =[ I would like to not emit movss in the >> > DAG >> > *ever*, and teach the MC combine pass to run after register allocation >> > (and >> > thus spills) have been emitted. This way we can match both patterns: >> > when >> > insertps is zeroing the other lanes and the operand is from memory, and >> > when >> > insertps is blending into the other lanes and the operand is in a >> > register. >> > >> > Does that make sense? If so, would you be up for looking at this side of >> > things? It seems nicely separable. >> >> I think it is a good idea and it makes sense to me. >> I will start investigating on this and see what can be done. >> >> Cheers, >> Andrea > >
Quentin Colombet
2014-Sep-17 18:10 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Chandler, Here is a new test case. With the new lowering, we miss to fold a load into the shuffle. To reproduce: llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll -Quentin On Sep 17, 2014, at 10:28 AM, Quentin Colombet <qcolombet at apple.com> wrote:> Hi Chandler, > > I saw regressions in our internal testing. Some of them are avx/avx2 specific. > > Should I send reduced test cases for those or is it something you haven’t looked yet and thus, is expected? > > Anyway, here is the biggest offender. This is avx-specific. > > To reproduce: > llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx avx_test_case.ll > llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx avx_test_case.ll > > I’ll send more test cases (first for non-avx specific) as I reduce the regressions. > > Thanks, > -Quentin > <avx_test_case.ll> > > On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote: > >> On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at google.com> wrote: >>> Andrea, Quentin: >>> >>> Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps, >>> and unpckhps is committed and should generally be working. I've not tested >>> it *super* thoroughly (will do this ASAP) so if you run into something >>> fishy, don't burn lots of time on it. >> >> Ok. >> >>> >>> I've also fixed a number of issues I found in the nightly test suite and >>> things like gcc-loops. I think there are still a couple of regressions I >>> spotted in the nightly test suite, but haven't gotten to them yet. >>> >>> I've got very rhudimentary support for pblendw finished and committed. There >>> is a much more fundamental change that is really needed for pblendw support >>> though -- currently, the blend lowering strategy assumes this instruction >>> doesn't exist and thus picks a deeply wrong strategy in some cases... Not >>> sure how much this is even relevant though. >>> >>> >>> Anyways, it's almost certainly useful to look into any non-test-suite >>> benchmarks you have, or to run the benchmarks on non-intel hardware. Let me >>> know how it goes! So far, with the fixes I've landed recently, I'm seeing >>> more improvements than regressions on the nightly test suite. =] >> >> Cool! >> I'll have a look at it. I will let you know how it goes. >> Thanks for working on this :-). >> >> -Andrea >> >>> >>> -Chandler >>> >>> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio >>> <andrea.dibiagio at gmail.com> wrote: >>>> >>>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com> >>>> wrote: >>>>> Awesome, thanks for all the information! >>>>> >>>>> See below: >>>>> >>>>> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio >>>>> <andrea.dibiagio at gmail.com> >>>>> wrote: >>>>>> >>>>>> You have already mentioned how the new shuffle lowering is missing >>>>>> some features; for example, you explicitly said that we currently lack >>>>>> of SSE4.1 blend support. Unfortunately, this seems to be one of the >>>>>> main reasons for the slowdown we are seeing. >>>>>> >>>>>> Here is a list of what we found so far that we think is causing most >>>>>> of the slowdown: >>>>>> 1) shufps is always emitted in cases where we could emit a single >>>>>> blendps; in these cases, blendps is preferable because it has better >>>>>> reciprocal throughput (this is true on all modern Intel and AMD cpus). >>>>> >>>>> >>>>> Yep. I think this is actually super easy. I'll add support for blendps >>>>> shortly. >>>> >>>> Thanks Chandler! >>>> >>>>> >>>>>> 3) When a shuffle performs an insert at index 0 we always generate an >>>>>> insertps, while a movss would do a better job. >>>>>> ;;; >>>>>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { >>>>>> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, >>>>>> i32 1, i32 2, i32 3> >>>>>> ret <4 x float> %1 >>>>>> } >>>>>> ;;; >>>>>> >>>>>> llc (-mcpu=corei7-avx): >>>>>> vmovss %xmm1, %xmm0, %xmm0 >>>>>> >>>>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >>>>>> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] >>>>> >>>>> >>>>> So, this is hard. I think we should do this in MC after register >>>>> allocation >>>>> because movss is the worst instruction ever: it switches from blending >>>>> with >>>>> the destination to zeroing the destination when the source switches from >>>>> a >>>>> register to a memory operand. =[ I would like to not emit movss in the >>>>> DAG >>>>> *ever*, and teach the MC combine pass to run after register allocation >>>>> (and >>>>> thus spills) have been emitted. This way we can match both patterns: >>>>> when >>>>> insertps is zeroing the other lanes and the operand is from memory, and >>>>> when >>>>> insertps is blending into the other lanes and the operand is in a >>>>> register. >>>>> >>>>> Does that make sense? If so, would you be up for looking at this side of >>>>> things? It seems nicely separable. >>>> >>>> I think it is a good idea and it makes sense to me. >>>> I will start investigating on this and see what can be done. >>>> >>>> Cheers, >>>> Andrea > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: missing_folding.ll Type: application/octet-stream Size: 1132 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment.obj> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/63f774cd/attachment-0001.html>
Quentin Colombet
2014-Sep-17 19:51 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Chandler, Yet another test case :). We use two shuffles instead of 1 palign. To reproduce: llc -x86-experimental-vector-shuffle-lowering=true missing_palign.ll -mcpu=core2 llc -x86-experimental-vector-shuffle-lowering=false missing_palign.ll -mcpu=core2 You can replace -mcpu=core2 by -mattr=+ssse3. Q. On Sep 17, 2014, at 11:10 AM, Quentin Colombet <qcolombet at apple.com> wrote:> Hi Chandler, > > Here is a new test case. > With the new lowering, we miss to fold a load into the shuffle. > To reproduce: > llc -x86-experimental-vector-shuffle-lowering=true missing_folding.ll > llc -x86-experimental-vector-shuffle-lowering=false missing_folding.ll > > -Quentin > <missing_folding.ll> > > On Sep 17, 2014, at 10:28 AM, Quentin Colombet <qcolombet at apple.com> wrote: > >> Hi Chandler, >> >> I saw regressions in our internal testing. Some of them are avx/avx2 specific. >> >> Should I send reduced test cases for those or is it something you haven’t looked yet and thus, is expected? >> >> Anyway, here is the biggest offender. This is avx-specific. >> >> To reproduce: >> llc -x86-experimental-vector-shuffle-lowering=true -mattr=+avx avx_test_case.ll >> llc -x86-experimental-vector-shuffle-lowering=false -mattr=+avx avx_test_case.ll >> >> I’ll send more test cases (first for non-avx specific) as I reduce the regressions. >> >> Thanks, >> -Quentin >> <avx_test_case.ll> >> >> On Sep 15, 2014, at 9:03 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote: >> >>> On Mon, Sep 15, 2014 at 1:57 PM, Chandler Carruth <chandlerc at google.com> wrote: >>>> Andrea, Quentin: >>>> >>>> Ok, everything for blendps, insertps, movddup, movsldup, movshdup, unpcklps, >>>> and unpckhps is committed and should generally be working. I've not tested >>>> it *super* thoroughly (will do this ASAP) so if you run into something >>>> fishy, don't burn lots of time on it. >>> >>> Ok. >>> >>>> >>>> I've also fixed a number of issues I found in the nightly test suite and >>>> things like gcc-loops. I think there are still a couple of regressions I >>>> spotted in the nightly test suite, but haven't gotten to them yet. >>>> >>>> I've got very rhudimentary support for pblendw finished and committed. There >>>> is a much more fundamental change that is really needed for pblendw support >>>> though -- currently, the blend lowering strategy assumes this instruction >>>> doesn't exist and thus picks a deeply wrong strategy in some cases... Not >>>> sure how much this is even relevant though. >>>> >>>> >>>> Anyways, it's almost certainly useful to look into any non-test-suite >>>> benchmarks you have, or to run the benchmarks on non-intel hardware. Let me >>>> know how it goes! So far, with the fixes I've landed recently, I'm seeing >>>> more improvements than regressions on the nightly test suite. =] >>> >>> Cool! >>> I'll have a look at it. I will let you know how it goes. >>> Thanks for working on this :-). >>> >>> -Andrea >>> >>>> >>>> -Chandler >>>> >>>> On Wed, Sep 10, 2014 at 3:36 AM, Andrea Di Biagio >>>> <andrea.dibiagio at gmail.com> wrote: >>>>> >>>>> On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at google.com> >>>>> wrote: >>>>>> Awesome, thanks for all the information! >>>>>> >>>>>> See below: >>>>>> >>>>>> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio >>>>>> <andrea.dibiagio at gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> You have already mentioned how the new shuffle lowering is missing >>>>>>> some features; for example, you explicitly said that we currently lack >>>>>>> of SSE4.1 blend support. Unfortunately, this seems to be one of the >>>>>>> main reasons for the slowdown we are seeing. >>>>>>> >>>>>>> Here is a list of what we found so far that we think is causing most >>>>>>> of the slowdown: >>>>>>> 1) shufps is always emitted in cases where we could emit a single >>>>>>> blendps; in these cases, blendps is preferable because it has better >>>>>>> reciprocal throughput (this is true on all modern Intel and AMD cpus). >>>>>> >>>>>> >>>>>> Yep. I think this is actually super easy. I'll add support for blendps >>>>>> shortly. >>>>> >>>>> Thanks Chandler! >>>>> >>>>>> >>>>>>> 3) When a shuffle performs an insert at index 0 we always generate an >>>>>>> insertps, while a movss would do a better job. >>>>>>> ;;; >>>>>>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { >>>>>>> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, >>>>>>> i32 1, i32 2, i32 3> >>>>>>> ret <4 x float> %1 >>>>>>> } >>>>>>> ;;; >>>>>>> >>>>>>> llc (-mcpu=corei7-avx): >>>>>>> vmovss %xmm1, %xmm0, %xmm0 >>>>>>> >>>>>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >>>>>>> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] >>>>>> >>>>>> >>>>>> So, this is hard. I think we should do this in MC after register >>>>>> allocation >>>>>> because movss is the worst instruction ever: it switches from blending >>>>>> with >>>>>> the destination to zeroing the destination when the source switches from >>>>>> a >>>>>> register to a memory operand. =[ I would like to not emit movss in the >>>>>> DAG >>>>>> *ever*, and teach the MC combine pass to run after register allocation >>>>>> (and >>>>>> thus spills) have been emitted. This way we can match both patterns: >>>>>> when >>>>>> insertps is zeroing the other lanes and the operand is from memory, and >>>>>> when >>>>>> insertps is blending into the other lanes and the operand is in a >>>>>> register. >>>>>> >>>>>> Does that make sense? If so, would you be up for looking at this side of >>>>>> things? It seems nicely separable. >>>>> >>>>> I think it is a good idea and it makes sense to me. >>>>> I will start investigating on this and see what can be done. >>>>> >>>>> Cheers, >>>>> Andrea >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: missing_palign.ll Type: application/octet-stream Size: 1430 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment.obj> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/190d3840/attachment-0001.html>
Chandler Carruth
2014-Sep-17 20:30 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Wed, Sep 17, 2014 at 12:51 PM, Quentin Colombet <qcolombet at apple.com> wrote:> We use two shuffles instead of 1 palign.Doh! I just forgot to teach it about palign... This one should at least be easy. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/301846e6/attachment.html>
Quentin Colombet
2014-Sep-18 00:18 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Chandler, Here are a few more test cases. I’ve ordered them from the hottest to the coldest. To reproduce: llc <test case> -x86-experimental-vector-shuffle-lowering=<true | false> [specific feature] 1. avx2_vperm.ll avx2 We use a sequence of extract, 2 shuffle, insert instead of vperm when avx2 is set. 2. avx_blend.ll avx Instead of using one big blend, we use 2 extracts, one small blend, and one insert. 3. avx2_extract2perm.ll avx2 We use a sequence of two instructions: extract, unpck, instead of one perm. 4. pxor.ll none Instead of using pxor to set a register to zero, we use a sequence composed of xorpd, shuffle. 5. sse4.1_pmovzxwd.ll sse4.1 Instead of using a single pmovzxwd, we use a movq followed by an unpck. If you prefer, I can file PRs. Cheers, -Quentin On Sep 17, 2014, at 1:30 PM, Chandler Carruth <chandlerc at google.com> wrote:> > On Wed, Sep 17, 2014 at 12:51 PM, Quentin Colombet <qcolombet at apple.com> wrote: > We use two shuffles instead of 1 palign. > > Doh! I just forgot to teach it about palign... This one should at least be easy.-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/68a0c42e/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: test_cases.tgz Type: application/octet-stream Size: 1166 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/68a0c42e/attachment.obj> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140917/68a0c42e/attachment-0001.html>
Quentin Colombet
2014-Sep-19 18:53 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Thanks Chandler! Here are two more test cases :). I kept the format of my previous email. 1. avx_unpck.ll avx Instead of issueing a single unpck with use a sequence of extrats, inserts, shuffles, and blends. 2. none_useless_shuflle none Instead of using a single move to materialize a zero extended constant into a vector register, we explicitly zeroed a vector register and use a shuffle. Cheers, -Quentin On Sep 18, 2014, at 2:13 AM, Chandler Carruth <chandlerc at google.com> wrote:> As of r218038 we should get palign for all integer shuffles. That fixes the test case you reduced for me. If you have any other regressions that point to palignr, I'd be especially interested to have an actual test case. As I noted in my commit log, there seems to be a clear place where using this could be faster but it introduces domain crossing. I don't really have a good model for the cost there and so am hesitant to go down that route without good evidence of the need. > > On Wed, Sep 17, 2014 at 1:30 PM, Chandler Carruth <chandlerc at google.com> wrote: > > On Wed, Sep 17, 2014 at 12:51 PM, Quentin Colombet <qcolombet at apple.com> wrote: > We use two shuffles instead of 1 palign. > > Doh! I just forgot to teach it about palign... This one should at least be easy. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: none_useless_shuffle.ll Type: application/octet-stream Size: 210 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment.obj> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment-0001.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: avx_unpck.ll Type: application/octet-stream Size: 415 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment-0001.obj> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/7d04a0cf/attachment-0002.html>
Andrea Di Biagio
2014-Sep-19 20:22 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Chandler, I have tested the new shuffle lowering on a AMD Jaguar cpu (which is AVX but not AVX2). On this particular target, there is a delay when output data from an execution unit is used as input to another execution unit of a different cluster. For example, There are 6 executions units which are divided into 3 execution clusters of Float(FPM,FPA), Vector Integer (MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs an addition 1 cycle latency penalty. Your new shuffle lowering algorithm is very good at keeping the computation inside clusters. This is an improvement with respect to the "old" shuffle lowering algorithm. I haven't observed any significant regression in our internal codebase. In one particular case I observed a slowdown (around 1%); here is what I found when investigating on this slowdown. 1. With the new shuffle lowering, there is one case where we end up producing the following sequence: vmovss .LCPxx(%rip), %xmm1 vxorps %xmm0, %xmm0, %xmm0 vblendps $1, %xmm1, %xmm0, %xmm0 Before, we used to generate a simpler: vmovss .LCPxx(%rip), %xmm1 In this particular case, the 'vblendps' is redundant since the vmovss would zero the upper bits in %xmm1. I am not sure why we get this poor-codegen with your new shuffle lowering. I will investigate more on this bug (maybe we no longer trigger some ISel patterns?) and I will try to give you a small reproducible for this paticular case. 2. There are cases where we no longer fold a vector load in one of the operands of a shuffle. This is an example: vmovaps 320(%rsp), %xmm0 vshufps $-27, %xmm0, %xmm0, %xmm0 # %xmm0 = %xmm0[1,1,2,3] Before, we used to emit the following sequence: # 16-byte Folded reload. vpshufd $1, 320(%rsp), %xmm0 # %xmm0 = mem[1,0,0,0] Note: the reason why the shuffle masks are different but still valid is because the upper bits in %xmm0 are unused. Later on, the code uses register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of %xmm0 have a meaning in this context). As for 1. I'll try to create a small reproducible. 3. When zero extending 2 packed 32-bit integers, we should try to emit a vpmovzxdq Example: vmovq 20(%rbx), %xmm0 vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1] Before: vpmovzxdq 20(%rbx), %xmm0 4. We no longer emit a simpler 'vmovq' in the following case: vxorpd %xmm4, %xmm4, %xmm4 vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1] Before, we used to generate: vmovq %xmm2, %xmm4 Before, the vmovq implicitly zero-extended to 128 bits the quadword in %xmm2. Now we always do this with a vxorpd+vblendps. As I said, I will try to create smaller reproducible for each of the problems I found. I hope this helps. I will keep testing. Thanks, Andrea
Quentin Colombet
2014-Sep-19 20:36 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Andrea, I think most if not all the regressions are covered by the previous test cases I’ve provided. Please double check if you want to avoid reducing them :). On Sep 19, 2014, at 1:22 PM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> Hi Chandler, > > I have tested the new shuffle lowering on a AMD Jaguar cpu (which is > AVX but not AVX2). > > On this particular target, there is a delay when output data from an > execution unit is used as input to another execution unit of a > different cluster. For example, There are 6 executions units which are > divided into 3 execution clusters of Float(FPM,FPA), Vector Integer > (MMXA,MMXB,IMM), and Store (STC). Moving data between clusters costs > an addition 1 cycle latency penalty. > Your new shuffle lowering algorithm is very good at keeping the > computation inside clusters. This is an improvement with respect to > the "old" shuffle lowering algorithm. > > I haven't observed any significant regression in our internal codebase. > In one particular case I observed a slowdown (around 1%); here is what > I found when investigating on this slowdown. > > 1. With the new shuffle lowering, there is one case where we end up > producing the following sequence: > vmovss .LCPxx(%rip), %xmm1 > vxorps %xmm0, %xmm0, %xmm0 > vblendps $1, %xmm1, %xmm0, %xmm0 > > Before, we used to generate a simpler: > vmovss .LCPxx(%rip), %xmm1 > > In this particular case, the 'vblendps' is redundant since the vmovss > would zero the upper bits in %xmm1. I am not sure why we get this > poor-codegen with your new shuffle lowering. I will investigate more > on this bug (maybe we no longer trigger some ISel patterns?) and I > will try to give you a small reproducible for this paticular case.I think it should already be covered by one of the test case I provided: none_useless_shuflle.ll> > 2. There are cases where we no longer fold a vector load in one of > the operands of a shuffle. > This is an example: > > vmovaps 320(%rsp), %xmm0 > vshufps $-27, %xmm0, %xmm0, %xmm0 # %xmm0 = %xmm0[1,1,2,3] > > Before, we used to emit the following sequence: > # 16-byte Folded reload. > vpshufd $1, 320(%rsp), %xmm0 # %xmm0 = mem[1,0,0,0] > > Note: the reason why the shuffle masks are different but still valid > is because the upper bits in %xmm0 are unused. Later on, the code uses > register %xmm0 in a 'vcvtss2sd' instruction; only the lower 32-bits of > %xmm0 have a meaning in this context). > As for 1. I'll try to create a small reproducible.Same here, I think this is already covered by: missing_folding.ll> > 3. When zero extending 2 packed 32-bit integers, we should try to > emit a vpmovzxdq > Example: > vmovq 20(%rbx), %xmm0 > vpshufd $80, %xmm0, %xmm0 # %xmm0 = %xmm0[0,0,1,1] > > Before: > vpmovzxdq 20(%rbx), %xmm0Probably same logic as: sse4.1_pmovzxwd.ll But you can double check it.> > 4. We no longer emit a simpler 'vmovq' in the following case: > vxorpd %xmm4, %xmm4, %xmm4 > vblendpd $2, %xmm4, %xmm2, %xmm4 # %xmm4 = %xmm2[0],%xmm4[1] > > Before, we used to generate: > vmovq %xmm2, %xmm4 > > Before, the vmovq implicitly zero-extended to 128 bits the quadword in > %xmm2. Now we always do this with a vxorpd+vblendps.Probably same as: none_useless_shuflle.ll Cheers, Q.> > As I said, I will try to create smaller reproducible for each of the > problems I found. > I hope this helps. I will keep testing. > > Thanks, > Andrea
Chandler Carruth
2014-Sep-20 05:15 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
After some adding some serious ninja-ry to the new shuffle lowering... On Fri, Sep 19, 2014 at 11:53 AM, Quentin Colombet <qcolombet at apple.com> wrote:> 2. none_useless_shuflle none > Instead of using a single move to materialize a zero extended constant > into a vector register, we explicitly zeroed a vector register and use a > shuffle. >... this test case is fixed, as is your 'pxor.ll' test case from earlier, and the 'movss' test cases from Andrea earlier in the thread (I suspect). Turns out that there is a trick that we can use in the existing tables to get most of the memory-operand-movss optimizations. I think this is all of the non-avx-specific issues raised thus far.... One of the issues isn't avx specific but can only be solved with avx. Anyways, I'll look into some of the AVX issues next. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140919/dcda6298/attachment.html>
Quentin Colombet
2014-Sep-29 16:33 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Chandler, On Sep 29, 2014, at 6:05 AM, Chandler Carruth <chandlerc at google.com> wrote:> > On Tue, Sep 23, 2014 at 4:28 AM, Chandler Carruth <chandlerc at google.com> wrote: > AVX2 is still in-flight. > > AVX2 is pretty much done. > > All of the AVX and AVX2 lowering has now been heavily fuzz tested (a few million test cases and counting). I believe it is correct. > > I've added the basic framework for AVX-512. Nothing interesting is implemented there, mostly because I think there are still very big unanswered questions about how AVX-512 should work. For example, it would be good to lower with index-destructive vs. table-destructive shuffles based on # of uses, but that isn't really possible today. Even better would be to actually respect any loop structure or other invariant properties. > > There are still plenty of performance gains to be had in AVX or AVX2 (broadcast support, work to combine away intermediate shuffling such as can be seen in the v32i8 test cases with interleaved unpacks, etc. etc. > > However, I think essentially all of the test cases (other than broadcast and shift test cases) have been fixed. I'd really like to enable this and let folks submit patches for the few remaining cases that impact them significantly.Sounds good to me. Just keep the old path until we are done with the triage of regressions. Indeed, with r218454, I am seeing few regressions (working on test cases), but I do not think those should hold up on your big refactoring. Thanks for all your work. -Quentin> As far as I can tell, the new code paths offer very significant advantages for hardware folks have today with only a few downsides. While they are less implemented for AVX-512 than the current code, I don't really think that should be the priority. > > Are there any remaining objections? > -Chandler-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140929/1a4d225f/attachment.html>
Chandler Carruth
2014-Sep-30 05:47 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Awesome. I'll send a new email to the list shortly as a heads-up and then start switching the default over. On Mon, Sep 29, 2014 at 9:33 AM, Quentin Colombet <qcolombet at apple.com> wrote:> Just keep the old path until we are done with the triage of regressions. >Sure, as soon as you're done triaging, let me know. I'm eager to delete several thousand lines of code to make up for adding so much. =]> Indeed, with r218454, I am seeing few regressions (working on test cases), > but I do not think those should hold up on your big refactoring. >Excellent. I would recommend filing PRs with remaining test cases so that others can also take a look. Also, I know you mentioned seeing some AVX failures. There was indeed a bad patch of mine (r218600) which I've reverted. Hope that fixes things. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140929/ae6679ce/attachment.html>
Quentin Colombet
2014-Oct-02 20:41 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Chandler, I’ve filed a few PRs regarding the latest regressions I found. Here are the links if you want the details. http://llvm.org/bugs/show_bug.cgi?id=21137 <http://llvm.org/bugs/show_bug.cgi?id=21137> http://llvm.org/bugs/show_bug.cgi?id=21138 <http://llvm.org/bugs/show_bug.cgi?id=21138> http://llvm.org/bugs/show_bug.cgi?id=21139 <http://llvm.org/bugs/show_bug.cgi?id=21139> http://llvm.org/bugs/show_bug.cgi?id=21140 <http://llvm.org/bugs/show_bug.cgi?id=21140> I've already reported the first one a while back. This is just FYI, I do not expect you to handle all the work :). Cheers, -Quentin> On Oct 1, 2014, at 11:24 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote: > > Hi Chandler, > > Not sure how important this can be, however I found a minor regression > with the new shuffle lowering. > Here is a reproducible test case: > > ;; > define <4 x i32> @test(<4 x i32> %V) { > %1 = shufflevector <4 x i32> %V, <4 x i32> <i32 0, i32 0, i32 0, i32 > 0>, <4 x i32> <i32 0, i32 1, i32 4, i32 5> > ret <4 x i32> %1 > } > ;; > > $ llc -mcpu=corei7-avx -o - > > vmovq %xmm0, %xmm0 > retq > > $ llc -mcpu=corei7-avx -x86-experimental-vector-shuffle-lowering -o - > vpxor %xmm1, %xmm1, %xmm1 > vpunpcklqdq %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[0] > retq > > If we know that the upper 64-bits of the destination register are > zero, we can try to emit a simpler vmovq instead of a vxor+vunpck. > > As I said, this is a minor issue. > I just wanted to post this finding so that we don't forget about it. > > Cheers, > Andrea > > On Wed, Oct 1, 2014 at 9:23 AM, Andrea Di Biagio > <andrea.dibiagio at gmail.com <mailto:andrea.dibiagio at gmail.com>> wrote: >> On Wed, Oct 1, 2014 at 1:52 AM, Chandler Carruth <chandlerc at google.com> wrote: >>> This has been added in r218724. >> Thanks Chandler! >> >>> Based on the feedback here and from Quentin, I'm going to email the list >>> shortly with a heads-up, and then flip the default over to the new shuffle >>> lowering. >> >> Nice. >> Again, thanks for working on this! >> >> -Andrea >> >>> >>> On Mon, Sep 29, 2014 at 10:48 PM, Chandler Carruth <chandlerc at google.com> >>> wrote: >>>> >>>> Wow. Somehow, I forgot about vbroadcast and vpbroadcast. =[ Sorry about >>>> that. I'll fix those. >>>> >>>> On Fri, Sep 26, 2014 at 3:39 AM, Andrea Di Biagio >>>> <andrea.dibiagio at gmail.com> wrote: >>>>> >>>>> Hi Chandler, >>>>> >>>>> Here is another test. >>>>> >>>>> When looking at the AVX codegen, I noticed that, when using the new >>>>> shuffle lowering, we no longer emit a single vbroadcastss in the case >>>>> where the shuffle performs a splat of a scalar float loaded from >>>>> memory. >>>>> >>>>> For example: >>>>> (with -mcpu=corei7-avx -x86-experimental-vector-shuffle-lowering) >>>>> vmovss (%rdi), %xmm0 >>>>> vpermilps $0, %xmm0, %xmm0 # xmm0 = xmm0[0,0,0,0] >>>>> >>>>> Instead of: >>>>> (with -mcpu=corei7-avx) >>>>> vbroadcastss (%rdi), %xmm0 >>>>> >>>>> I have attached a small reproducible for it. >>>>> >>>>> Basically, the old shuffle lowering logic calls function >>>>> 'NormalizeVectorShuffle' to handle shuffles that perform a splat >>>>> operation. >>>>> On AVX, function 'NormalizeVectorShuffle' tries to lower a splat where >>>>> the splat value comes from a load into a X86ISD::VBROADCAST dag node. >>>>> Later on, during instruction selection, we emit a single avx_broadcast >>>>> for the load+splat sequence (basically, we end up folding the load in >>>>> the operand of the vbroadcastss). >>>>> >>>>> What happens is that the new shuffle lowering doesn't emit a >>>>> vbroadcast node in this case and eventually we end up selecting the >>>>> sequence of vmovss+vpermilps. >>>>> >>>>> I hope this helps. >>>>> Andrea >>>>> >>>>> On Tue, Sep 23, 2014 at 10:53 PM, Chandler Carruth <chandlerc at google.com> >>>>> wrote: >>>>>> >>>>>> On Tue, Sep 23, 2014 at 2:35 PM, Simon Pilgrim <llvm-dev at redking.me.uk> >>>>>> wrote: >>>>>>> >>>>>>> If you don’t want to spend time on this, I’d be happy to create a >>>>>>> candidate patch for review? I’ve been unclear if you were taking >>>>>>> patches for >>>>>>> your shuffle work prior to it becoming the default. >>>>>> >>>>>> >>>>>> While I'm happy to work on it, I'm even more happy to have patches. =D >>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >>>>>> >>>> >>>> >>> > <test.ll>_______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141002/c9dd2cab/attachment.html>
Possibly Parallel Threads
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag
- [LLVMdev] RFB: Would like to flip the vector shuffle legality flag