Sean Silva
2014-Sep-09 20:47 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
On Tue, Sep 9, 2014 at 12:53 PM, Quentin Colombet <qcolombet at apple.com> wrote:> Hi Chandler, > > I had observed some improvements and regressions with the new lowering. > > Here are the numbers for an Ivy Bridge machine fixed at 2900MHz. > > I’ll look into the regressions to provide test cases. > > ** Numbers ** > > Smaller is better. Only reported tests that run for at least one second. > Reference is the default lowering, Test is the new lowering. > The Os numbers are overall neutral, but the O3 numbers mainly expose > regressions. > > Note: I can attach the raw numbers if you want. >That would be great. Please do. -- Sean Silva> > * Os * > Benchmark_ID Reference Test Expansion Percent > > ------------------------------------------------------------------------------- > External/Nurbs/nurbs 2.3302 2.3122 0.99 > -1% > External/SPEC/CFP2000/183.equake/183.eq 3.2606 3.2419 0.99 > -1% > External/SPEC/CFP2006/447.dealII/447.de 16.4638 16.1313 > 0.98 -2% > External/SPEC/CFP2006/470.lbm/470.lbm 2.0159 1.9931 0.99 > -1% > External/SPEC/CINT2000/164.gzip/164.gzi 8.7611 8.6981 0.99 > -1% > External/SPEC/CINT2006/456.hmmer/456.hm 2.5674 2.5819 1.01 > +1% > External/SPEC/CINT2006/462.libquantum/4 1.2924 1.347 > 1.04 +4% > MultiSource/Benchmarks/TSVC/CrossingThr 2.4703 2.4852 1.01 > +1% > MultiSource/Benchmarks/TSVC/LoopRerolli 2.6611 2.5668 0.96 > -4% > MultiSource/Benchmarks/mafft/pairlocala 24.676 24.5372 > 0.99 -1% > SingleSource/Benchmarks/Adobe-C++/simpl 1.0579 1.1048 1.04 > +4% > SingleSource/Benchmarks/Linpack/linpack 4.2817 4.3298 1.01 > +1% > SingleSource/Benchmarks/Misc-C++/stepan 4.1821 4.226 > 1.01 +1% > SingleSource/Benchmarks/Misc/oourafft 3.0305 3.1777 1.05 > +5% > > ------------------------------------------------------------------------------- > Min (14) - - > 0.96 - > > ------------------------------------------------------------------------------- > Max (14) - - > 1.05 - > > ------------------------------------------------------------------------------- > Sum (14) 79 79 1 > +0% > > ------------------------------------------------------------------------------- > A.Mean (14) - - > 1.01 +1% > > ------------------------------------------------------------------------------- > G.Mean 2 (14) - - > 1.01 +1% > > ------------------------------------------------------------------------------- > > * O3 * > Benchmark_ID Reference Test Expansion Percent > > ------------------------------------------------------------------------------- > External/Nurbs/nurbs 2.2322 2.2131 0.99 > -1% > External/Povray/povray 2.2638 2.2762 1.01 > +1% > External/SPEC/CFP2000/177.mesa/177.mesa 1.6675 1.6828 1.01 > +1% > External/SPEC/CFP2000/188.ammp/188.ammp 10.9309 11.1191 > 1.02 +2% > External/SPEC/CFP2006/433.milc/433.milc 6.9214 7.1696 1.04 > +4% > External/SPEC/CINT2000/164.gzip/164.gzi 8.5327 8.8114 1.03 > +3% > External/SPEC/CINT2000/186.crafty/186.c 4.1266 4.16 1.01 > +1% > External/SPEC/CINT2000/253.perlbmk/253. 5.6991 5.7309 1.01 > +1% > External/SPEC/CINT2000/256.bzip2/256.bz 6.7917 6.8763 1.01 > +1% > External/SPEC/CINT2006/400.perlbench/40 6.243 6.1464 > 0.98 -2% > External/SPEC/CINT2006/401.bzip2/401.bz 2.095 2.0588 > 0.98 -2% > External/SPEC/CINT2006/462.libquantum/4 1.2 1.2108 > 1.01 +1% > MultiSource/Applications/SIBsim4/SIBsim 2.4547 2.5129 1.02 > +2% > MultiSource/Benchmarks/Bullet/bullet 4.1687 4.0882 0.98 > -2% > MultiSource/Benchmarks/TSVC/LinearDepen 3.0389 3.0566 1.01 > +1% > MultiSource/Benchmarks/TSVC/LinearDepen 2.1298 2.1997 1.03 > +3% > MultiSource/Benchmarks/TSVC/LoopRerolli 2.6458 2.5552 0.97 > -3% > MultiSource/Benchmarks/TSVC/Symbolics-f 1.6243 1.6612 1.02 > +2% > MultiSource/Benchmarks/mafft/pairlocala 23.8979 24.0547 > 1.01 +1% > SingleSource/Benchmarks/Misc/oourafft 3.0374 3.1846 1.05 > +5% > SingleSource/Benchmarks/SmallPT/smallpt 6.5533 6.6683 1.02 > +2% > > ------------------------------------------------------------------------------- > Min (21) - - > 0.97 - > > ------------------------------------------------------------------------------- > Max (21) - - > 1.05 - > > ------------------------------------------------------------------------------- > Sum (21) 108 109 > 1.01 -1% > > ------------------------------------------------------------------------------- > A.Mean (21) - - > 1.01 +1% > > ------------------------------------------------------------------------------- > G.Mean 2 (21) - - > 1.01 +1% > > ------------------------------------------------------------------------------- > > Thanks, > -Quentin > > On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> > wrote: > > Hi Chandler, > > Thanks for fixing the problem with the insertps mask. > > Generally the new shuffle lowering looks promising, however there are > some cases where the codegen is now worse causing runtime performance > regressions in some of our internal codebase. > > You have already mentioned how the new shuffle lowering is missing > some features; for example, you explicitly said that we currently lack > of SSE4.1 blend support. Unfortunately, this seems to be one of the > main reasons for the slowdown we are seeing. > > Here is a list of what we found so far that we think is causing most > of the slowdown: > 1) shufps is always emitted in cases where we could emit a single > blendps; in these cases, blendps is preferable because it has better > reciprocal throughput (this is true on all modern Intel and AMD cpus). > > Things get worse when it comes to lowering shuffles where the shuffle > mask indices refer to elements from both input vectors in each lane. > For example, a shuffle mask of <0,5,2,7> could be easily lowered into > a single blendps; instead it gets lowered into two shufps > instructions. > > Example: > ;;; > define <4 x float> @foo(<4 x float> %A, <4 x float> %B) { > %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0, > i32 5, i32 2, i32 7> > ret <4 x float> %1 > } > ;;; > > llc (-mcpu=corei7-avx): > vblendps $10, %xmm1, %xmm0, %xmm0 # xmm0 > xmm0[0],xmm1[5],xmm0[2],xmm1[7] > > llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): > vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3] > vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3] > > > 2) On SSE4.1, we should try not to emit an insertps if the shuffle > mask identifies a blend. At the moment the new lowering logic is very > aggressively emitting insertps instead of cheaper blendps. > > Example: > ;;; > define <4 x float> @bar(<4 x float> %A, <4 x float> %B) { > %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, > i32 5, i32 2, i32 7> > ret <4 x float> %1 > } > ;;; > > llc (-mcpu=corei7-avx): > vblendps $11, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3] > > llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): > vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3] > > > 3) When a shuffle performs an insert at index 0 we always generate an > insertps, while a movss would do a better job. > ;;; > define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { > %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, > i32 1, i32 2, i32 3> > ret <4 x float> %1 > } > ;;; > > llc (-mcpu=corei7-avx): > vmovss %xmm1, %xmm0, %xmm0 > > llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): > vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] > > I hope this is useful. We would be happy to contribute patches to > improve some of the above cases, but we obviously know that this is > still a work in progress, so we don't want to introduce conflicts with > your work. Please let us know what you think. > > We will keep looking at this and follow up with any further findings. > > Thanks, > Andrea Di Biagio > SN Systems - Sony Computer Entertainment Inc. > > On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at apple.com> > wrote: > > Hi Chandler, > > Forget about that I said. > It seems I have some weird dependencies in my built system. > My binaries are out-of-sync. > > Let me sort that out, this is likely the problem is already fixed, and I > can > resume the measurements. > > Sorry for the noise. > > Q. > > On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at apple.com> wrote: > > > On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at apple.com> wrote: > > Sure, > > Here is the command line: > clang -cc1 -triple x86_64-apple-macosx -S -disable-free > -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic > -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu > core-avx-i -O3 -ferror-limit 19 -fmessage-length 114 -stack-protector 1 > -mstackrealign -fblocks -fencode-extended-block-signature > -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics > -vectorize-loops -vectorize-slp -mllvm > -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i > > This was with trunk 215249. > > I meant, r217281. > > > Thanks, > -Quentin > > <tmp.i> > > On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at gmail.com> wrote: > > I've run the SingleSource test suite for core-avx-i and have no failures > here so a preprocessed file + commandline would be very useful if this > reproduces for you still. > > On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at gmail.com> > wrote: > > > I'm having trouble reproducing this. I'm trying to get LNT to actually > run, but manually compiling the given source file didn't reproduce it for > me. > > It might have been fixed recently (although I'd be surprised if so), but > it would help to get the actual command line for which compiling this file > in the test suite failed. > > -Chandler > > On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet at apple.com> > wrote: > > > Hi Chandler, > > While doing the performance measurement on a Ivy Bridge, I ran into > compile time errors. > > I saw a bunch of “cannot select" in the LLVM test suite with > -march=core-avx-i. > E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing at O3 > -march=core-avx-i with: > fatal error: error in backend: Cannot select: 0x7f91b99a6420: v4i32 > bitcast 0x7f91b99b0e10 [ORD=3] [ID=27] > 0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210, > 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25] > 0x7f91b99a7210: v4i64 = undef [ID=15] > 0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840 [ORD=2] > [ID=23] > 0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60, 0x7f91b99ac738 > [ORD=2] [ID=20] > 0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820, > 0x7f91b99a3a10 [ORD=2] [ID=16] > 0x7f91b99a3a10: i64 = Register %vreg68 [ID=1] > 0x7f91b99ace70: i64 = Constant<0> [ID=3] > In function: isamax0 > clang: error: clang frontend command failed with exit code 70 (use -v to > see invocation) > clang version 3.6.0 (215249) > Target: x86_64-apple-darwin14.0.0 > > For some reason, I cannot reproduce the problem with the test case that > clang gives me using -emit-llvm. Since the source is public, I guess you > can > try to reproduce on your side. > Indeed, if you run the test-suite with -march=core-avx-i you’ll likely > see all those failures. > > Let me know if you cannot and I’ll try harder to produce a test case. > > Note: This is the same failure all over the place, i.e., cannot select a > bit cast from various types to v4i32 or v4i64. > > Thanks, > -Quentin > > > On Sep 5, 2014, at 11:09 AM, Robert Lougher <rob.lougher@ > > gmail.com> wrote: > > Hi Chandler, > > On 5 September 2014 17:38, Chandler Carruth <chandlerc at gmail.com> wrote: > > > On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at gmail.com> > wrote: > > > Unfortunately, another team, while doing internal testing has seen the > new path generating illegal insertps masks. A sample here: > > vinsertps $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3] > vinsertps $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3] > vinsertps $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3] > vinsertps $416, %xmm1, %xmm4, %xmm14 # xmm14 > xmm4[0,1],xmm1[2],xmm4[3] > vinsertps $416, %xmm13, %xmm6, %xmm13 # xmm13 > xmm6[0,1],xmm13[2],xmm6[3] > vinsertps $416, %xmm0, %xmm7, %xmm0 # xmm0 > xmm7[0,1],xmm0[2],xmm7[3] > > We'll continue to look into this and do additional testing. > > > > Interesting. Let me know if you get a test case. The insertps code path > was > added recently though and has been much less well tested. I'll start fuzz > testing it and should hopefully uncover the bug. > > > Here's two small test cases. Hope they are of use. > > Thanks, > Rob. > > ------ > define <4 x float> @test(<4 x float> %xyzw, <4 x float> %abcd) { > %1 = extractelement <4 x float> %xyzw, i32 0 > %2 = insertelement <4 x float> undef, float %1, i32 0 > %3 = insertelement <4 x float> %2, float 0.000000e+00, i32 1 > %4 = shufflevector <4 x float> %3, <4 x float> %xyzw, <4 x i32> <i32 > 0, i32 1, i32 6, i32 undef> > %5 = shufflevector <4 x float> %4, <4 x float> %abcd, <4 x i32> <i32 > 0, i32 1, i32 2, i32 4> > ret <4 x float> %5 > } > > define <4 x float> @test2(<4 x float> %xyzw, <4 x float> %abcd) { > %1 = shufflevector <4 x float> %xyzw, <4 x float> %abcd, <4 x i32> > <i32 0, i32 undef, i32 2, i32 4> > %2 = shufflevector <4 x float> <float undef, float 0.000000e+00, > float undef, float undef>, <4 x float> %1, <4 x i32> <i32 4, i32 1, > i32 6, i32 7> > ret <4 x float> %2 > } > > > llc -march=x86-64 -mattr=+avx test.ll -o - > > test: # @test > vxorps %xmm2, %xmm2, %xmm2 > vmovss %xmm0, %xmm2, %xmm2 > vblendps $4, %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0,1],xmm0[2],xmm2[3] > vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] > retl > > test2: # @test2 > vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] > vxorps %xmm1, %xmm1, %xmm1 > vblendps $13, %xmm0, %xmm1, %xmm0 # xmm0 > xmm0[0],xmm1[1],xmm0[2,3] > retl > > llc -march=x86-64 -mattr=+avx > -x86-experimental-vector-shuffle-lowering test.ll -o - > > test: # @test > vinsertps $270, %xmm0, %xmm0, %xmm2 # xmm2 = xmm0[0],zero,zero,zero > vinsertps $416, %xmm0, %xmm2, %xmm0 # xmm0 > xmm2[0,1],xmm0[2],xmm2[3] > vinsertps $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] > retl > > test2: # @test2 > vinsertps $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] > vxorps %xmm1, %xmm1, %xmm1 > vinsertps $336, %xmm1, %xmm0, %xmm0 # xmm0 > xmm0[0],xmm1[1],xmm0[2,3] > retl > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/4444ed3d/attachment.html>
Quentin Colombet
2014-Sep-09 20:59 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
> On Sep 9, 2014, at 1:47 PM, Sean Silva <chisophugis at gmail.com> wrote: > > > > On Tue, Sep 9, 2014 at 12:53 PM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote: > Hi Chandler, > > I had observed some improvements and regressions with the new lowering. > > Here are the numbers for an Ivy Bridge machine fixed at 2900MHz. > > I’ll look into the regressions to provide test cases. > > ** Numbers ** > > Smaller is better. Only reported tests that run for at least one second. > Reference is the default lowering, Test is the new lowering. > The Os numbers are overall neutral, but the O3 numbers mainly expose regressions. > > Note: I can attach the raw numbers if you want. > > That would be great. Please do.Alright, here they are :). base-perf-Ox.txt: runtime for the default lowering. new-perf-Ox.txt: runtime for the new lowering. Each line in those files has the following format: <unit> <benchmark> <perf number> The units are: - min: Minimum of the 7 runs. - max: Maximum of the 7 runs. - avg: Average of the 7 runs. - total: Total of the 7 runs. - med: Median of the 7 runs. - SD: Standard deviation of the 7 runs. - SD%: Standard deviation of the7 runs in percentage. -Quentin> > -- Sean Silva > > > * Os * > Benchmark_ID Reference Test Expansion Percent > ------------------------------------------------------------------------------- > External/Nurbs/nurbs 2.3302 2.3122 0.99 -1% > External/SPEC/CFP2000/183.equake/183.eq 3.2606 3.2419 0.99 -1% > External/SPEC/CFP2006/447.dealII/447.de <http://447.de/> 16.4638 16.1313 0.98 -2% > External/SPEC/CFP2006/470.lbm/470.lbm 2.0159 1.9931 0.99 -1% > External/SPEC/CINT2000/164.gzip/164.gzi 8.7611 8.6981 0.99 -1% > External/SPEC/CINT2006/456.hmmer/456.hm <http://456.hm/> 2.5674 2.5819 1.01 +1% > External/SPEC/CINT2006/462.libquantum/4 1.2924 1.347 1.04 +4% > MultiSource/Benchmarks/TSVC/CrossingThr 2.4703 2.4852 1.01 +1% > MultiSource/Benchmarks/TSVC/LoopRerolli 2.6611 2.5668 0.96 -4% > MultiSource/Benchmarks/mafft/pairlocala 24.676 24.5372 0.99 -1% > SingleSource/Benchmarks/Adobe-C++/simpl 1.0579 1.1048 1.04 +4% > SingleSource/Benchmarks/Linpack/linpack 4.2817 4.3298 1.01 +1% > SingleSource/Benchmarks/Misc-C++/stepan 4.1821 4.226 1.01 +1% > SingleSource/Benchmarks/Misc/oourafft 3.0305 3.1777 1.05 +5% > ------------------------------------------------------------------------------- > Min (14) - - 0.96 - > ------------------------------------------------------------------------------- > Max (14) - - 1.05 - > ------------------------------------------------------------------------------- > Sum (14) 79 79 1 +0% > ------------------------------------------------------------------------------- > A.Mean (14) - - 1.01 +1% > ------------------------------------------------------------------------------- > G.Mean 2 (14) - - 1.01 +1% > ------------------------------------------------------------------------------- > > * O3 * > Benchmark_ID Reference Test Expansion Percent > ------------------------------------------------------------------------------- > External/Nurbs/nurbs 2.2322 2.2131 0.99 -1% > External/Povray/povray 2.2638 2.2762 1.01 +1% > External/SPEC/CFP2000/177.mesa/177.mesa 1.6675 1.6828 1.01 +1% > External/SPEC/CFP2000/188.ammp/188.ammp 10.9309 11.1191 1.02 +2% > External/SPEC/CFP2006/433.milc/433.milc 6.9214 7.1696 1.04 +4% > External/SPEC/CINT2000/164.gzip/164.gzi 8.5327 8.8114 1.03 +3% > External/SPEC/CINT2000/186.crafty/186.c 4.1266 4.16 1.01 +1% > External/SPEC/CINT2000/253.perlbmk/253. 5.6991 5.7309 1.01 +1% > External/SPEC/CINT2000/256.bzip2/256.bz <http://256.bz/> 6.7917 6.8763 1.01 +1% > External/SPEC/CINT2006/400.perlbench/40 6.243 6.1464 0.98 -2% > External/SPEC/CINT2006/401.bzip2/401.bz <http://401.bz/> 2.095 2.0588 0.98 -2% > External/SPEC/CINT2006/462.libquantum/4 1.2 1.2108 1.01 +1% > MultiSource/Applications/SIBsim4/SIBsim 2.4547 2.5129 1.02 +2% > MultiSource/Benchmarks/Bullet/bullet 4.1687 4.0882 0.98 -2% > MultiSource/Benchmarks/TSVC/LinearDepen 3.0389 3.0566 1.01 +1% > MultiSource/Benchmarks/TSVC/LinearDepen 2.1298 2.1997 1.03 +3% > MultiSource/Benchmarks/TSVC/LoopRerolli 2.6458 2.5552 0.97 -3% > MultiSource/Benchmarks/TSVC/Symbolics-f 1.6243 1.6612 1.02 +2% > MultiSource/Benchmarks/mafft/pairlocala 23.8979 24.0547 1.01 +1% > SingleSource/Benchmarks/Misc/oourafft 3.0374 3.1846 1.05 +5% > SingleSource/Benchmarks/SmallPT/smallpt 6.5533 6.6683 1.02 +2% > ------------------------------------------------------------------------------- > Min (21) - - 0.97 - > ------------------------------------------------------------------------------- > Max (21) - - 1.05 - > ------------------------------------------------------------------------------- > Sum (21) 108 109 1.01 -1% > ------------------------------------------------------------------------------- > A.Mean (21) - - 1.01 +1% > ------------------------------------------------------------------------------- > G.Mean 2 (21) - - 1.01 +1% > ------------------------------------------------------------------------------- > > Thanks, > -Quentin > >> On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com <mailto:andrea.dibiagio at gmail.com>> wrote: >> >> Hi Chandler, >> >> Thanks for fixing the problem with the insertps mask. >> >> Generally the new shuffle lowering looks promising, however there are >> some cases where the codegen is now worse causing runtime performance >> regressions in some of our internal codebase. >> >> You have already mentioned how the new shuffle lowering is missing >> some features; for example, you explicitly said that we currently lack >> of SSE4.1 blend support. Unfortunately, this seems to be one of the >> main reasons for the slowdown we are seeing. >> >> Here is a list of what we found so far that we think is causing most >> of the slowdown: >> 1) shufps is always emitted in cases where we could emit a single >> blendps; in these cases, blendps is preferable because it has better >> reciprocal throughput (this is true on all modern Intel and AMD cpus). >> >> Things get worse when it comes to lowering shuffles where the shuffle >> mask indices refer to elements from both input vectors in each lane. >> For example, a shuffle mask of <0,5,2,7> could be easily lowered into >> a single blendps; instead it gets lowered into two shufps >> instructions. >> >> Example: >> ;;; >> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) { >> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0, >> i32 5, i32 2, i32 7> >> ret <4 x float> %1 >> } >> ;;; >> >> llc (-mcpu=corei7-avx): >> vblendps $10, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[5],xmm0[2],xmm1[7] >> >> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >> vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3] >> vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3] >> >> >> 2) On SSE4.1, we should try not to emit an insertps if the shuffle >> mask identifies a blend. At the moment the new lowering logic is very >> aggressively emitting insertps instead of cheaper blendps. >> >> Example: >> ;;; >> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) { >> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, >> i32 5, i32 2, i32 7> >> ret <4 x float> %1 >> } >> ;;; >> >> llc (-mcpu=corei7-avx): >> vblendps $11, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3] >> >> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >> vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3] >> >> >> 3) When a shuffle performs an insert at index 0 we always generate an >> insertps, while a movss would do a better job. >> ;;; >> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { >> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, >> i32 1, i32 2, i32 3> >> ret <4 x float> %1 >> } >> ;;; >> >> llc (-mcpu=corei7-avx): >> vmovss %xmm1, %xmm0, %xmm0 >> >> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] >> >> I hope this is useful. We would be happy to contribute patches to >> improve some of the above cases, but we obviously know that this is >> still a work in progress, so we don't want to introduce conflicts with >> your work. Please let us know what you think. >> >> We will keep looking at this and follow up with any further findings. >> >> Thanks, >> Andrea Di Biagio >> SN Systems - Sony Computer Entertainment Inc. >> >> On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote: >>> Hi Chandler, >>> >>> Forget about that I said. >>> It seems I have some weird dependencies in my built system. >>> My binaries are out-of-sync. >>> >>> Let me sort that out, this is likely the problem is already fixed, and I can >>> resume the measurements. >>> >>> Sorry for the noise. >>> >>> Q. >>> >>> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote: >>> >>> >>> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote: >>> >>> Sure, >>> >>> Here is the command line: >>> clang -cc1 -triple x86_64-apple-macosx -S -disable-free >>> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic >>> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu >>> core-avx-i -O3 -ferror-limit 19 -fmessage-length 114 -stack-protector 1 >>> -mstackrealign -fblocks -fencode-extended-block-signature >>> -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics >>> -vectorize-loops -vectorize-slp -mllvm >>> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i >>> >>> This was with trunk 215249. >>> >>> I meant, r217281. >>> >>> >>> Thanks, >>> -Quentin >>> >>> <tmp.i> >>> >>> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at gmail.com <mailto:chandlerc at gmail.com>> wrote: >>> >>> I've run the SingleSource test suite for core-avx-i and have no failures >>> here so a preprocessed file + commandline would be very useful if this >>> reproduces for you still. >>> >>> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at gmail.com <mailto:chandlerc at gmail.com>> >>> wrote: >>>> >>>> I'm having trouble reproducing this. I'm trying to get LNT to actually >>>> run, but manually compiling the given source file didn't reproduce it for >>>> me. >>>> >>>> It might have been fixed recently (although I'd be surprised if so), but >>>> it would help to get the actual command line for which compiling this file >>>> in the test suite failed. >>>> >>>> -Chandler >>>> >>>> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> >>>> wrote: >>>>> >>>>> Hi Chandler, >>>>> >>>>> While doing the performance measurement on a Ivy Bridge, I ran into >>>>> compile time errors. >>>>> >>>>> I saw a bunch of “cannot select" in the LLVM test suite with >>>>> -march=core-avx-i. >>>>> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing at O3 >>>>> -march=core-avx-i with: >>>>> fatal error: error in backend: Cannot select: 0x7f91b99a6420: v4i32 >>>>> bitcast 0x7f91b99b0e10 [ORD=3] [ID=27] >>>>> 0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210, >>>>> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25] >>>>> 0x7f91b99a7210: v4i64 = undef [ID=15] >>>>> 0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840 [ORD=2] >>>>> [ID=23] >>>>> 0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60, 0x7f91b99ac738 >>>>> [ORD=2] [ID=20] >>>>> 0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820, >>>>> 0x7f91b99a3a10 [ORD=2] [ID=16] >>>>> 0x7f91b99a3a10: i64 = Register %vreg68 [ID=1] >>>>> 0x7f91b99ace70: i64 = Constant<0> [ID=3] >>>>> In function: isamax0 >>>>> clang: error: clang frontend command failed with exit code 70 (use -v to >>>>> see invocation) >>>>> clang version 3.6.0 (215249) >>>>> Target: x86_64-apple-darwin14.0.0 >>>>> >>>>> For some reason, I cannot reproduce the problem with the test case that >>>>> clang gives me using -emit-llvm. Since the source is public, I guess you can >>>>> try to reproduce on your side. >>>>> Indeed, if you run the test-suite with -march=core-avx-i you’ll likely >>>>> see all those failures. >>>>> >>>>> Let me know if you cannot and I’ll try harder to produce a test case. >>>>> >>>>> Note: This is the same failure all over the place, i.e., cannot select a >>>>> bit cast from various types to v4i32 or v4i64. >>>>> >>>>> Thanks, >>>>> -Quentin >>>>> >>>>> >>>>> On Sep 5, 2014, at 11:09 AM, Robert Lougher <rob.lougher@ >>>>> >>>>> gmail.com <http://gmail.com/>> wrote: >>>>> >>>>> Hi Chandler, >>>>> >>>>> On 5 September 2014 17:38, Chandler Carruth <chandlerc at gmail.com <mailto:chandlerc at gmail.com>> wrote: >>>>> >>>>> >>>>> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at gmail.com <mailto:rob.lougher at gmail.com>> >>>>> wrote: >>>>> >>>>> >>>>> Unfortunately, another team, while doing internal testing has seen the >>>>> new path generating illegal insertps masks. A sample here: >>>>> >>>>> vinsertps $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3] >>>>> vinsertps $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3] >>>>> vinsertps $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3] >>>>> vinsertps $416, %xmm1, %xmm4, %xmm14 # xmm14 >>>>> xmm4[0,1],xmm1[2],xmm4[3] >>>>> vinsertps $416, %xmm13, %xmm6, %xmm13 # xmm13 >>>>> xmm6[0,1],xmm13[2],xmm6[3] >>>>> vinsertps $416, %xmm0, %xmm7, %xmm0 # xmm0 >>>>> xmm7[0,1],xmm0[2],xmm7[3] >>>>> >>>>> We'll continue to look into this and do additional testing. >>>>> >>>>> >>>>> >>>>> Interesting. Let me know if you get a test case. The insertps code path >>>>> was >>>>> added recently though and has been much less well tested. I'll start fuzz >>>>> testing it and should hopefully uncover the bug. >>>>> >>>>> >>>>> Here's two small test cases. Hope they are of use. >>>>> >>>>> Thanks, >>>>> Rob. >>>>> >>>>> ------ >>>>> define <4 x float> @test(<4 x float> %xyzw, <4 x float> %abcd) { >>>>> %1 = extractelement <4 x float> %xyzw, i32 0 >>>>> %2 = insertelement <4 x float> undef, float %1, i32 0 >>>>> %3 = insertelement <4 x float> %2, float 0.000000e+00, i32 1 >>>>> %4 = shufflevector <4 x float> %3, <4 x float> %xyzw, <4 x i32> <i32 >>>>> 0, i32 1, i32 6, i32 undef> >>>>> %5 = shufflevector <4 x float> %4, <4 x float> %abcd, <4 x i32> <i32 >>>>> 0, i32 1, i32 2, i32 4> >>>>> ret <4 x float> %5 >>>>> } >>>>> >>>>> define <4 x float> @test2(<4 x float> %xyzw, <4 x float> %abcd) { >>>>> %1 = shufflevector <4 x float> %xyzw, <4 x float> %abcd, <4 x i32> >>>>> <i32 0, i32 undef, i32 2, i32 4> >>>>> %2 = shufflevector <4 x float> <float undef, float 0.000000e+00, >>>>> float undef, float undef>, <4 x float> %1, <4 x i32> <i32 4, i32 1, >>>>> i32 6, i32 7> >>>>> ret <4 x float> %2 >>>>> } >>>>> >>>>> >>>>> llc -march=x86-64 -mattr=+avx test.ll -o - >>>>> >>>>> test: # @test >>>>> vxorps %xmm2, %xmm2, %xmm2 >>>>> vmovss %xmm0, %xmm2, %xmm2 >>>>> vblendps $4, %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0,1],xmm0[2],xmm2[3] >>>>> vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] >>>>> retl >>>>> >>>>> test2: # @test2 >>>>> vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] >>>>> vxorps %xmm1, %xmm1, %xmm1 >>>>> vblendps $13, %xmm0, %xmm1, %xmm0 # xmm0 >>>>> xmm0[0],xmm1[1],xmm0[2,3] >>>>> retl >>>>> >>>>> llc -march=x86-64 -mattr=+avx >>>>> -x86-experimental-vector-shuffle-lowering test.ll -o - >>>>> >>>>> test: # @test >>>>> vinsertps $270, %xmm0, %xmm0, %xmm2 # xmm2 = xmm0[0],zero,zero,zero >>>>> vinsertps $416, %xmm0, %xmm2, %xmm0 # xmm0 >>>>> xmm2[0,1],xmm0[2],xmm2[3] >>>>> vinsertps $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] >>>>> retl >>>>> >>>>> test2: # @test2 >>>>> vinsertps $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] >>>>> vxorps %xmm1, %xmm1, %xmm1 >>>>> vinsertps $336, %xmm1, %xmm0, %xmm0 # xmm0 >>>>> xmm0[0],xmm1[1],xmm0[2,3] >>>>> retl >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> LLVM Developers mailing list >>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>>>> >>>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>> >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>> > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment.html> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: base-perf-O3.txt URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment.txt> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0001.html> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: base-perf-Os.txt URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0001.txt> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0002.html> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: new-perf-O3.txt URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0002.txt> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0003.html> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: new-perf-Os.txt URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0003.txt> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0004.html>
Quentin Colombet
2014-Sep-09 22:01 UTC
[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
Hi Chandler, Here is a test case for the biggest offender (oourafft.c). To reproduce: llc -mcpu=core-avx-i -x86-experimental-vector-shuffle-lowering=true repro.ll llc -mcpu=core-avx-i -x86-experimental-vector-shuffle-lowering=false repro.ll The main problem is that we miss: vmovsd (%rdi,%rcx,8), %xmm2 vmovlhps %xmm2, %xmm2, %xmm2 ## xmm2 = xmm2[0,0] => vmovddup (%rdi,%rcx,8), %xmm2 I do not know how problematic is that (it seems we catch up on the performance with just the previous transformation), but we also miss: vsubpd %xmm1, %xmm0, %xmm2 vaddpd %xmm1, %xmm0, %xmm0 vshufpd $2, %xmm0, %xmm2, %xmm0 ## xmm0 = xmm2[0],xmm0[1] => vaddsubpd %xmm1, %xmm0, %xmm0 I’ll look into the other regressions. Thanks, -Quentin> On Sep 9, 2014, at 1:59 PM, Quentin Colombet <qcolombet at apple.com> wrote: > > >> On Sep 9, 2014, at 1:47 PM, Sean Silva <chisophugis at gmail.com <mailto:chisophugis at gmail.com>> wrote: >> >> >> >> On Tue, Sep 9, 2014 at 12:53 PM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote: >> Hi Chandler, >> >> I had observed some improvements and regressions with the new lowering. >> >> Here are the numbers for an Ivy Bridge machine fixed at 2900MHz. >> >> I’ll look into the regressions to provide test cases. >> >> ** Numbers ** >> >> Smaller is better. Only reported tests that run for at least one second. >> Reference is the default lowering, Test is the new lowering. >> The Os numbers are overall neutral, but the O3 numbers mainly expose regressions. >> >> Note: I can attach the raw numbers if you want. >> >> That would be great. Please do. > > Alright, here they are :). > > base-perf-Ox.txt: runtime for the default lowering. > new-perf-Ox.txt: runtime for the new lowering. > > Each line in those files has the following format: > <unit> <benchmark> <perf number> > > The units are: > - min: Minimum of the 7 runs. > - max: Maximum of the 7 runs. > - avg: Average of the 7 runs. > - total: Total of the 7 runs. > - med: Median of the 7 runs. > - SD: Standard deviation of the 7 runs. > - SD%: Standard deviation of the7 runs in percentage. > > -Quentin > <base-perf-O3.txt> > <base-perf-Os.txt> > <new-perf-O3.txt> > <new-perf-Os.txt> > >> >> -- Sean Silva >> >> >> * Os * >> Benchmark_ID Reference Test Expansion Percent >> ------------------------------------------------------------------------------- >> External/Nurbs/nurbs 2.3302 2.3122 0.99 -1% >> External/SPEC/CFP2000/183.equake/183.eq 3.2606 3.2419 0.99 -1% >> External/SPEC/CFP2006/447.dealII/447.de <http://447.de/> 16.4638 16.1313 0.98 -2% >> External/SPEC/CFP2006/470.lbm/470.lbm 2.0159 1.9931 0.99 -1% >> External/SPEC/CINT2000/164.gzip/164.gzi 8.7611 8.6981 0.99 -1% >> External/SPEC/CINT2006/456.hmmer/456.hm <http://456.hm/> 2.5674 2.5819 1.01 +1% >> External/SPEC/CINT2006/462.libquantum/4 1.2924 1.347 1.04 +4% >> MultiSource/Benchmarks/TSVC/CrossingThr 2.4703 2.4852 1.01 +1% >> MultiSource/Benchmarks/TSVC/LoopRerolli 2.6611 2.5668 0.96 -4% >> MultiSource/Benchmarks/mafft/pairlocala 24.676 24.5372 0.99 -1% >> SingleSource/Benchmarks/Adobe-C++/simpl 1.0579 1.1048 1.04 +4% >> SingleSource/Benchmarks/Linpack/linpack 4.2817 4.3298 1.01 +1% >> SingleSource/Benchmarks/Misc-C++/stepan 4.1821 4.226 1.01 +1% >> SingleSource/Benchmarks/Misc/oourafft 3.0305 3.1777 1.05 +5% >> ------------------------------------------------------------------------------- >> Min (14) - - 0.96 - >> ------------------------------------------------------------------------------- >> Max (14) - - 1.05 - >> ------------------------------------------------------------------------------- >> Sum (14) 79 79 1 +0% >> ------------------------------------------------------------------------------- >> A.Mean (14) - - 1.01 +1% >> ------------------------------------------------------------------------------- >> G.Mean 2 (14) - - 1.01 +1% >> ------------------------------------------------------------------------------- >> >> * O3 * >> Benchmark_ID Reference Test Expansion Percent >> ------------------------------------------------------------------------------- >> External/Nurbs/nurbs 2.2322 2.2131 0.99 -1% >> External/Povray/povray 2.2638 2.2762 1.01 +1% >> External/SPEC/CFP2000/177.mesa/177.mesa 1.6675 1.6828 1.01 +1% >> External/SPEC/CFP2000/188.ammp/188.ammp 10.9309 11.1191 1.02 +2% >> External/SPEC/CFP2006/433.milc/433.milc 6.9214 7.1696 1.04 +4% >> External/SPEC/CINT2000/164.gzip/164.gzi 8.5327 8.8114 1.03 +3% >> External/SPEC/CINT2000/186.crafty/186.c 4.1266 4.16 1.01 +1% >> External/SPEC/CINT2000/253.perlbmk/253. 5.6991 5.7309 1.01 +1% >> External/SPEC/CINT2000/256.bzip2/256.bz <http://256.bz/> 6.7917 6.8763 1.01 +1% >> External/SPEC/CINT2006/400.perlbench/40 6.243 6.1464 0.98 -2% >> External/SPEC/CINT2006/401.bzip2/401.bz <http://401.bz/> 2.095 2.0588 0.98 -2% >> External/SPEC/CINT2006/462.libquantum/4 1.2 1.2108 1.01 +1% >> MultiSource/Applications/SIBsim4/SIBsim 2.4547 2.5129 1.02 +2% >> MultiSource/Benchmarks/Bullet/bullet 4.1687 4.0882 0.98 -2% >> MultiSource/Benchmarks/TSVC/LinearDepen 3.0389 3.0566 1.01 +1% >> MultiSource/Benchmarks/TSVC/LinearDepen 2.1298 2.1997 1.03 +3% >> MultiSource/Benchmarks/TSVC/LoopRerolli 2.6458 2.5552 0.97 -3% >> MultiSource/Benchmarks/TSVC/Symbolics-f 1.6243 1.6612 1.02 +2% >> MultiSource/Benchmarks/mafft/pairlocala 23.8979 24.0547 1.01 +1% >> SingleSource/Benchmarks/Misc/oourafft 3.0374 3.1846 1.05 +5% >> SingleSource/Benchmarks/SmallPT/smallpt 6.5533 6.6683 1.02 +2% >> ------------------------------------------------------------------------------- >> Min (21) - - 0.97 - >> ------------------------------------------------------------------------------- >> Max (21) - - 1.05 - >> ------------------------------------------------------------------------------- >> Sum (21) 108 109 1.01 -1% >> ------------------------------------------------------------------------------- >> A.Mean (21) - - 1.01 +1% >> ------------------------------------------------------------------------------- >> G.Mean 2 (21) - - 1.01 +1% >> ------------------------------------------------------------------------------- >> >> Thanks, >> -Quentin >> >>> On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com <mailto:andrea.dibiagio at gmail.com>> wrote: >>> >>> Hi Chandler, >>> >>> Thanks for fixing the problem with the insertps mask. >>> >>> Generally the new shuffle lowering looks promising, however there are >>> some cases where the codegen is now worse causing runtime performance >>> regressions in some of our internal codebase. >>> >>> You have already mentioned how the new shuffle lowering is missing >>> some features; for example, you explicitly said that we currently lack >>> of SSE4.1 blend support. Unfortunately, this seems to be one of the >>> main reasons for the slowdown we are seeing. >>> >>> Here is a list of what we found so far that we think is causing most >>> of the slowdown: >>> 1) shufps is always emitted in cases where we could emit a single >>> blendps; in these cases, blendps is preferable because it has better >>> reciprocal throughput (this is true on all modern Intel and AMD cpus). >>> >>> Things get worse when it comes to lowering shuffles where the shuffle >>> mask indices refer to elements from both input vectors in each lane. >>> For example, a shuffle mask of <0,5,2,7> could be easily lowered into >>> a single blendps; instead it gets lowered into two shufps >>> instructions. >>> >>> Example: >>> ;;; >>> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) { >>> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 0, >>> i32 5, i32 2, i32 7> >>> ret <4 x float> %1 >>> } >>> ;;; >>> >>> llc (-mcpu=corei7-avx): >>> vblendps $10, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0],xmm1[5],xmm0[2],xmm1[7] >>> >>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >>> vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3] >>> vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3] >>> >>> >>> 2) On SSE4.1, we should try not to emit an insertps if the shuffle >>> mask identifies a blend. At the moment the new lowering logic is very >>> aggressively emitting insertps instead of cheaper blendps. >>> >>> Example: >>> ;;; >>> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) { >>> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, >>> i32 5, i32 2, i32 7> >>> ret <4 x float> %1 >>> } >>> ;;; >>> >>> llc (-mcpu=corei7-avx): >>> vblendps $11, %xmm0, %xmm1, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3] >>> >>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >>> vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3] >>> >>> >>> 3) When a shuffle performs an insert at index 0 we always generate an >>> insertps, while a movss would do a better job. >>> ;;; >>> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) { >>> %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32> <i32 4, >>> i32 1, i32 2, i32 3> >>> ret <4 x float> %1 >>> } >>> ;;; >>> >>> llc (-mcpu=corei7-avx): >>> vmovss %xmm1, %xmm0, %xmm0 >>> >>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx): >>> vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3] >>> >>> I hope this is useful. We would be happy to contribute patches to >>> improve some of the above cases, but we obviously know that this is >>> still a work in progress, so we don't want to introduce conflicts with >>> your work. Please let us know what you think. >>> >>> We will keep looking at this and follow up with any further findings. >>> >>> Thanks, >>> Andrea Di Biagio >>> SN Systems - Sony Computer Entertainment Inc. >>> >>> On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote: >>>> Hi Chandler, >>>> >>>> Forget about that I said. >>>> It seems I have some weird dependencies in my built system. >>>> My binaries are out-of-sync. >>>> >>>> Let me sort that out, this is likely the problem is already fixed, and I can >>>> resume the measurements. >>>> >>>> Sorry for the noise. >>>> >>>> Q. >>>> >>>> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote: >>>> >>>> >>>> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote: >>>> >>>> Sure, >>>> >>>> Here is the command line: >>>> clang -cc1 -triple x86_64-apple-macosx -S -disable-free >>>> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic >>>> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu >>>> core-avx-i -O3 -ferror-limit 19 -fmessage-length 114 -stack-protector 1 >>>> -mstackrealign -fblocks -fencode-extended-block-signature >>>> -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics >>>> -vectorize-loops -vectorize-slp -mllvm >>>> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i >>>> >>>> This was with trunk 215249. >>>> >>>> I meant, r217281. >>>> >>>> >>>> Thanks, >>>> -Quentin >>>> >>>> <tmp.i> >>>> >>>> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at gmail.com <mailto:chandlerc at gmail.com>> wrote: >>>> >>>> I've run the SingleSource test suite for core-avx-i and have no failures >>>> here so a preprocessed file + commandline would be very useful if this >>>> reproduces for you still. >>>> >>>> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at gmail.com <mailto:chandlerc at gmail.com>> >>>> wrote: >>>>> >>>>> I'm having trouble reproducing this. I'm trying to get LNT to actually >>>>> run, but manually compiling the given source file didn't reproduce it for >>>>> me. >>>>> >>>>> It might have been fixed recently (although I'd be surprised if so), but >>>>> it would help to get the actual command line for which compiling this file >>>>> in the test suite failed. >>>>> >>>>> -Chandler >>>>> >>>>> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet at apple.com <mailto:qcolombet at apple.com>> >>>>> wrote: >>>>>> >>>>>> Hi Chandler, >>>>>> >>>>>> While doing the performance measurement on a Ivy Bridge, I ran into >>>>>> compile time errors. >>>>>> >>>>>> I saw a bunch of “cannot select" in the LLVM test suite with >>>>>> -march=core-avx-i. >>>>>> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing at O3 >>>>>> -march=core-avx-i with: >>>>>> fatal error: error in backend: Cannot select: 0x7f91b99a6420: v4i32 >>>>>> bitcast 0x7f91b99b0e10 [ORD=3] [ID=27] >>>>>> 0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210, >>>>>> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25] >>>>>> 0x7f91b99a7210: v4i64 = undef [ID=15] >>>>>> 0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840 [ORD=2] >>>>>> [ID=23] >>>>>> 0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60, 0x7f91b99ac738 >>>>>> [ORD=2] [ID=20] >>>>>> 0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820, >>>>>> 0x7f91b99a3a10 [ORD=2] [ID=16] >>>>>> 0x7f91b99a3a10: i64 = Register %vreg68 [ID=1] >>>>>> 0x7f91b99ace70: i64 = Constant<0> [ID=3] >>>>>> In function: isamax0 >>>>>> clang: error: clang frontend command failed with exit code 70 (use -v to >>>>>> see invocation) >>>>>> clang version 3.6.0 (215249) >>>>>> Target: x86_64-apple-darwin14.0.0 >>>>>> >>>>>> For some reason, I cannot reproduce the problem with the test case that >>>>>> clang gives me using -emit-llvm. Since the source is public, I guess you can >>>>>> try to reproduce on your side. >>>>>> Indeed, if you run the test-suite with -march=core-avx-i you’ll likely >>>>>> see all those failures. >>>>>> >>>>>> Let me know if you cannot and I’ll try harder to produce a test case. >>>>>> >>>>>> Note: This is the same failure all over the place, i.e., cannot select a >>>>>> bit cast from various types to v4i32 or v4i64. >>>>>> >>>>>> Thanks, >>>>>> -Quentin >>>>>> >>>>>> >>>>>> On Sep 5, 2014, at 11:09 AM, Robert Lougher <rob.lougher@ >>>>>> >>>>>> gmail.com <http://gmail.com/>> wrote: >>>>>> >>>>>> Hi Chandler, >>>>>> >>>>>> On 5 September 2014 17:38, Chandler Carruth <chandlerc at gmail.com <mailto:chandlerc at gmail.com>> wrote: >>>>>> >>>>>> >>>>>> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at gmail.com <mailto:rob.lougher at gmail.com>> >>>>>> wrote: >>>>>> >>>>>> >>>>>> Unfortunately, another team, while doing internal testing has seen the >>>>>> new path generating illegal insertps masks. A sample here: >>>>>> >>>>>> vinsertps $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3] >>>>>> vinsertps $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3] >>>>>> vinsertps $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3] >>>>>> vinsertps $416, %xmm1, %xmm4, %xmm14 # xmm14 >>>>>> xmm4[0,1],xmm1[2],xmm4[3] >>>>>> vinsertps $416, %xmm13, %xmm6, %xmm13 # xmm13 >>>>>> xmm6[0,1],xmm13[2],xmm6[3] >>>>>> vinsertps $416, %xmm0, %xmm7, %xmm0 # xmm0 >>>>>> xmm7[0,1],xmm0[2],xmm7[3] >>>>>> >>>>>> We'll continue to look into this and do additional testing. >>>>>> >>>>>> >>>>>> >>>>>> Interesting. Let me know if you get a test case. The insertps code path >>>>>> was >>>>>> added recently though and has been much less well tested. I'll start fuzz >>>>>> testing it and should hopefully uncover the bug. >>>>>> >>>>>> >>>>>> Here's two small test cases. Hope they are of use. >>>>>> >>>>>> Thanks, >>>>>> Rob. >>>>>> >>>>>> ------ >>>>>> define <4 x float> @test(<4 x float> %xyzw, <4 x float> %abcd) { >>>>>> %1 = extractelement <4 x float> %xyzw, i32 0 >>>>>> %2 = insertelement <4 x float> undef, float %1, i32 0 >>>>>> %3 = insertelement <4 x float> %2, float 0.000000e+00, i32 1 >>>>>> %4 = shufflevector <4 x float> %3, <4 x float> %xyzw, <4 x i32> <i32 >>>>>> 0, i32 1, i32 6, i32 undef> >>>>>> %5 = shufflevector <4 x float> %4, <4 x float> %abcd, <4 x i32> <i32 >>>>>> 0, i32 1, i32 2, i32 4> >>>>>> ret <4 x float> %5 >>>>>> } >>>>>> >>>>>> define <4 x float> @test2(<4 x float> %xyzw, <4 x float> %abcd) { >>>>>> %1 = shufflevector <4 x float> %xyzw, <4 x float> %abcd, <4 x i32> >>>>>> <i32 0, i32 undef, i32 2, i32 4> >>>>>> %2 = shufflevector <4 x float> <float undef, float 0.000000e+00, >>>>>> float undef, float undef>, <4 x float> %1, <4 x i32> <i32 4, i32 1, >>>>>> i32 6, i32 7> >>>>>> ret <4 x float> %2 >>>>>> } >>>>>> >>>>>> >>>>>> llc -march=x86-64 -mattr=+avx test.ll -o - >>>>>> >>>>>> test: # @test >>>>>> vxorps %xmm2, %xmm2, %xmm2 >>>>>> vmovss %xmm0, %xmm2, %xmm2 >>>>>> vblendps $4, %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0,1],xmm0[2],xmm2[3] >>>>>> vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] >>>>>> retl >>>>>> >>>>>> test2: # @test2 >>>>>> vinsertps $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] >>>>>> vxorps %xmm1, %xmm1, %xmm1 >>>>>> vblendps $13, %xmm0, %xmm1, %xmm0 # xmm0 >>>>>> xmm0[0],xmm1[1],xmm0[2,3] >>>>>> retl >>>>>> >>>>>> llc -march=x86-64 -mattr=+avx >>>>>> -x86-experimental-vector-shuffle-lowering test.ll -o - >>>>>> >>>>>> test: # @test >>>>>> vinsertps $270, %xmm0, %xmm0, %xmm2 # xmm2 = xmm0[0],zero,zero,zero >>>>>> vinsertps $416, %xmm0, %xmm2, %xmm0 # xmm0 >>>>>> xmm2[0,1],xmm0[2],xmm2[3] >>>>>> vinsertps $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] >>>>>> retl >>>>>> >>>>>> test2: # @test2 >>>>>> vinsertps $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0] >>>>>> vxorps %xmm1, %xmm1, %xmm1 >>>>>> vinsertps $336, %xmm1, %xmm0, %xmm0 # xmm0 >>>>>> xmm0[0],xmm1[1],xmm0[2,3] >>>>>> retl >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>>> >>>> >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>>> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/565be8f4/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: repro.ll Type: application/octet-stream Size: 2265 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/565be8f4/attachment.obj> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/565be8f4/attachment-0001.html>
Reasonably Related Threads
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!
- [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!