thr3ads.net - llvm dev - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon! [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Sean Silva

2014-Sep-09 20:47 UTC

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Tue, Sep 9, 2014 at 12:53 PM, Quentin Colombet <qcolombet at apple.com>
wrote:
> Hi Chandler,
>
> I had observed some improvements and regressions with the new lowering.
>
> Here are the numbers for an Ivy Bridge machine fixed at 2900MHz.
>
> I’ll look into the regressions to provide test cases.
>
> ** Numbers **
>
> Smaller is better. Only reported tests that run for at least one second.
> Reference is the default lowering, Test is the new lowering.
> The Os numbers are overall neutral, but the O3 numbers mainly expose
> regressions.
>
> Note: I can attach the raw numbers if you want.
>
That would be great. Please do.

-- Sean Silva

>
> * Os *
> Benchmark_ID    Reference Test    Expansion Percent
>
>
-------------------------------------------------------------------------------
> External/Nurbs/nurbs                          2.3302       2.3122     0.99
>   -1%
> External/SPEC/CFP2000/183.equake/183.eq       3.2606       3.2419     0.99
>   -1%
> External/SPEC/CFP2006/447.dealII/447.de       16.4638       16.1313
> 0.98     -2%
> External/SPEC/CFP2006/470.lbm/470.lbm         2.0159       1.9931     0.99
>   -1%
> External/SPEC/CINT2000/164.gzip/164.gzi       8.7611       8.6981     0.99
>   -1%
> External/SPEC/CINT2006/456.hmmer/456.hm       2.5674       2.5819     1.01
>   +1%
> External/SPEC/CINT2006/462.libquantum/4       1.2924         1.347
> 1.04     +4%
> MultiSource/Benchmarks/TSVC/CrossingThr       2.4703       2.4852     1.01
>   +1%
> MultiSource/Benchmarks/TSVC/LoopRerolli       2.6611       2.5668     0.96
>   -4%
> MultiSource/Benchmarks/mafft/pairlocala       24.676       24.5372
> 0.99     -1%
> SingleSource/Benchmarks/Adobe-C++/simpl       1.0579       1.1048     1.04
>   +4%
> SingleSource/Benchmarks/Linpack/linpack       4.2817       4.3298     1.01
>   +1%
> SingleSource/Benchmarks/Misc-C++/stepan       4.1821         4.226
> 1.01     +1%
> SingleSource/Benchmarks/Misc/oourafft         3.0305       3.1777     1.05
>   +5%
>
>
-------------------------------------------------------------------------------
> Min (14)                                           -             -
> 0.96       -
>
>
-------------------------------------------------------------------------------
> Max (14)                                           -             -
> 1.05       -
>
>
-------------------------------------------------------------------------------
> Sum (14)                                          79           79       1
>   +0%
>
>
-------------------------------------------------------------------------------
> A.Mean (14)                                        -             -
> 1.01     +1%
>
>
-------------------------------------------------------------------------------
> G.Mean 2 (14)                                      -             -
> 1.01     +1%
>
>
-------------------------------------------------------------------------------
>
> * O3 *
> Benchmark_ID    Reference Test    Expansion Percent
>
>
-------------------------------------------------------------------------------
> External/Nurbs/nurbs                          2.2322       2.2131     0.99
>   -1%
> External/Povray/povray                        2.2638       2.2762     1.01
>   +1%
> External/SPEC/CFP2000/177.mesa/177.mesa       1.6675       1.6828     1.01
>   +1%
> External/SPEC/CFP2000/188.ammp/188.ammp       10.9309       11.1191
> 1.02     +2%
> External/SPEC/CFP2006/433.milc/433.milc       6.9214       7.1696     1.04
>   +4%
> External/SPEC/CINT2000/164.gzip/164.gzi       8.5327       8.8114     1.03
>   +3%
> External/SPEC/CINT2000/186.crafty/186.c       4.1266         4.16     1.01
>   +1%
> External/SPEC/CINT2000/253.perlbmk/253.       5.6991       5.7309     1.01
>   +1%
> External/SPEC/CINT2000/256.bzip2/256.bz       6.7917       6.8763     1.01
>   +1%
> External/SPEC/CINT2006/400.perlbench/40         6.243       6.1464
> 0.98     -2%
> External/SPEC/CINT2006/401.bzip2/401.bz         2.095       2.0588
> 0.98     -2%
> External/SPEC/CINT2006/462.libquantum/4           1.2       1.2108
> 1.01     +1%
> MultiSource/Applications/SIBsim4/SIBsim       2.4547       2.5129     1.02
>   +2%
> MultiSource/Benchmarks/Bullet/bullet          4.1687       4.0882     0.98
>   -2%
> MultiSource/Benchmarks/TSVC/LinearDepen       3.0389       3.0566     1.01
>   +1%
> MultiSource/Benchmarks/TSVC/LinearDepen       2.1298       2.1997     1.03
>   +3%
> MultiSource/Benchmarks/TSVC/LoopRerolli       2.6458       2.5552     0.97
>   -3%
> MultiSource/Benchmarks/TSVC/Symbolics-f       1.6243       1.6612     1.02
>   +2%
> MultiSource/Benchmarks/mafft/pairlocala       23.8979       24.0547
> 1.01     +1%
> SingleSource/Benchmarks/Misc/oourafft         3.0374       3.1846     1.05
>   +5%
> SingleSource/Benchmarks/SmallPT/smallpt       6.5533       6.6683     1.02
>   +2%
>
>
-------------------------------------------------------------------------------
> Min (21)                                           -             -
> 0.97       -
>
>
-------------------------------------------------------------------------------
> Max (21)                                           -             -
> 1.05       -
>
>
-------------------------------------------------------------------------------
> Sum (21)                                         108           109
> 1.01     -1%
>
>
-------------------------------------------------------------------------------
> A.Mean (21)                                        -             -
> 1.01     +1%
>
>
-------------------------------------------------------------------------------
> G.Mean 2 (21)                                      -             -
> 1.01     +1%
>
>
-------------------------------------------------------------------------------
>
> Thanks,
> -Quentin
>
> On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com>
> wrote:
>
> Hi Chandler,
>
> Thanks for fixing the problem with the insertps mask.
>
> Generally the new shuffle lowering looks promising, however there are
> some cases where the codegen is now worse causing runtime performance
> regressions in some of our internal codebase.
>
> You have already mentioned how the new shuffle lowering is missing
> some features; for example, you explicitly said that we currently lack
> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> main reasons for the slowdown we are seeing.
>
> Here is a list of what we found so far that we think is causing most
> of the slowdown:
> 1) shufps is always emitted in cases where we could emit a single
> blendps; in these cases, blendps is preferable because it has better
> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>
> Things get worse when it comes to lowering shuffles where the shuffle
> mask indices refer to elements from both input vectors in each lane.
> For example, a shuffle mask of <0,5,2,7> could be easily lowered into
> a single blendps; instead it gets lowered into two shufps
> instructions.
>
> Example:
> ;;;
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 0,
> i32 5, i32 2, i32 7>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 >
xmm0[0],xmm1[5],xmm0[2],xmm1[7]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
>
>
> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
> mask identifies a blend. At the moment the new lowering logic is very
> aggressively emitting insertps instead of cheaper blendps.
>
> Example:
> ;;;
> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 5, i32 2, i32 7>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
>
> 3) When a shuffle performs an insert at index 0 we always generate an
> insertps, while a movss would do a better job.
> ;;;
> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 1, i32 2, i32 3>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vmovss %xmm1, %xmm0, %xmm0
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>
> I hope this is useful. We would be happy to contribute patches to
> improve some of the above cases, but we obviously know that this is
> still a work in progress, so we don't want to introduce conflicts with
> your work. Please let us know what you think.
>
> We will keep looking at this and follow up with any further findings.
>
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Inc.
>
> On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at
apple.com>
> wrote:
>
> Hi Chandler,
>
> Forget about that I said.
> It seems I have some weird dependencies in my built system.
> My binaries are out-of-sync.
>
> Let me sort that out, this is likely the problem is already fixed, and I
> can
> resume the measurements.
>
> Sorry for the noise.
>
> Q.
>
> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at apple.com>
wrote:
>
>
> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at apple.com>
wrote:
>
> Sure,
>
> Here is the command line:
> clang -cc1 -triple x86_64-apple-macosx -S -disable-free
> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic
> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu
> core-avx-i  -O3  -ferror-limit 19 -fmessage-length 114 -stack-protector 1
> -mstackrealign -fblocks  -fencode-extended-block-signature
> -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics
> -vectorize-loops -vectorize-slp -mllvm
> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i
>
> This was with trunk 215249.
>
> I meant, r217281.
>
>
> Thanks,
> -Quentin
>
> <tmp.i>
>
> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at gmail.com>
wrote:
>
> I've run the SingleSource test suite for core-avx-i and have no
failures
> here so a preprocessed file + commandline would be very useful if this
> reproduces for you still.
>
> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at
gmail.com>
> wrote:
>
>
> I'm having trouble reproducing this. I'm trying to get LNT to
actually
> run, but manually compiling the given source file didn't reproduce it
for
> me.
>
> It might have been fixed recently (although I'd be surprised if so),
but
> it would help to get the actual command line for which compiling this file
> in the test suite failed.
>
> -Chandler
>
> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet at
apple.com>
> wrote:
>
>
> Hi Chandler,
>
> While doing the performance measurement on a Ivy Bridge, I ran into
> compile time errors.
>
> I saw a bunch of “cannot select" in the LLVM test suite with
> -march=core-avx-i.
> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing at O3
> -march=core-avx-i with:
> fatal error: error in backend: Cannot select: 0x7f91b99a6420: v4i32 >
bitcast 0x7f91b99b0e10 [ORD=3] [ID=27]
>  0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210,
> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25]
>    0x7f91b99a7210: v4i64 = undef [ID=15]
>    0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840 [ORD=2]
> [ID=23]
>      0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60, 0x7f91b99ac738
> [ORD=2] [ID=20]
>        0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820,
> 0x7f91b99a3a10 [ORD=2] [ID=16]
>          0x7f91b99a3a10: i64 = Register %vreg68 [ID=1]
>    0x7f91b99ace70: i64 = Constant<0> [ID=3]
> In function: isamax0
> clang: error: clang frontend command failed with exit code 70 (use -v to
> see invocation)
> clang version 3.6.0 (215249)
> Target: x86_64-apple-darwin14.0.0
>
> For some reason, I cannot reproduce the problem with the test case that
> clang gives me using -emit-llvm. Since the source is public, I guess you
> can
> try to reproduce on your side.
> Indeed, if you run the test-suite with -march=core-avx-i you’ll likely
> see all those failures.
>
> Let me know if you cannot and I’ll try harder to produce a test case.
>
> Note: This is the same failure all over the place, i.e., cannot select a
> bit cast from various types to v4i32 or v4i64.
>
> Thanks,
> -Quentin
>
>
> On Sep 5, 2014, at 11:09 AM, Robert Lougher <rob.lougher@
>
> gmail.com> wrote:
>
> Hi Chandler,
>
> On 5 September 2014 17:38, Chandler Carruth <chandlerc at gmail.com>
wrote:
>
>
> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at
gmail.com>
> wrote:
>
>
> Unfortunately, another team, while doing internal testing has seen the
> new path generating illegal insertps masks.  A sample here:
>
>   vinsertps    $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3]
>   vinsertps    $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3]
>   vinsertps    $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3]
>   vinsertps    $416, %xmm1, %xmm4, %xmm14 # xmm14 >
xmm4[0,1],xmm1[2],xmm4[3]
>   vinsertps    $416, %xmm13, %xmm6, %xmm13 # xmm13 >
xmm6[0,1],xmm13[2],xmm6[3]
>   vinsertps    $416, %xmm0, %xmm7, %xmm0 # xmm0 >
xmm7[0,1],xmm0[2],xmm7[3]
>
> We'll continue to look into this and do additional testing.
>
>
>
> Interesting. Let me know if you get a test case. The insertps code path
> was
> added recently though and has been much less well tested. I'll start
fuzz
> testing it and should hopefully uncover the bug.
>
>
> Here's two small test cases.  Hope they are of use.
>
> Thanks,
> Rob.
>
> ------
> define <4 x float> @test(<4 x float> %xyzw, <4 x float>
%abcd) {
> %1 = extractelement <4 x float> %xyzw, i32 0
> %2 = insertelement <4 x float> undef, float %1, i32 0
> %3 = insertelement <4 x float> %2, float 0.000000e+00, i32 1
> %4 = shufflevector <4 x float> %3, <4 x float> %xyzw, <4 x
i32> <i32
> 0, i32 1, i32 6, i32 undef>
> %5 = shufflevector <4 x float> %4, <4 x float> %abcd, <4 x
i32> <i32
> 0, i32 1, i32 2, i32 4>
> ret <4 x float> %5
> }
>
> define <4 x float> @test2(<4 x float> %xyzw, <4 x float>
%abcd) {
> %1 = shufflevector <4 x float> %xyzw, <4 x float> %abcd, <4
x i32>
> <i32 0, i32 undef, i32 2, i32 4>
> %2 = shufflevector <4 x float> <float undef, float 0.000000e+00,
> float undef, float undef>, <4 x float> %1, <4 x i32> <i32
4, i32 1,
> i32 6, i32 7>
> ret <4 x float> %2
> }
>
>
> llc -march=x86-64 -mattr=+avx test.ll -o -
>
> test:                                   # @test
>   vxorps    %xmm2, %xmm2, %xmm2
>   vmovss    %xmm0, %xmm2, %xmm2
>   vblendps    $4, %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0,1],xmm0[2],xmm2[3]
>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   retl
>
> test2:                                  # @test2
>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   vxorps    %xmm1, %xmm1, %xmm1
>   vblendps    $13, %xmm0, %xmm1, %xmm0 # xmm0 >
xmm0[0],xmm1[1],xmm0[2,3]
>   retl
>
> llc -march=x86-64 -mattr=+avx
> -x86-experimental-vector-shuffle-lowering test.ll -o -
>
> test:                                   # @test
>   vinsertps    $270, %xmm0, %xmm0, %xmm2 # xmm2 = xmm0[0],zero,zero,zero
>   vinsertps    $416, %xmm0, %xmm2, %xmm0 # xmm0 >
xmm2[0,1],xmm0[2],xmm2[3]
>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   retl
>
> test2:                                  # @test2
>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   vxorps    %xmm1, %xmm1, %xmm1
>   vinsertps    $336, %xmm1, %xmm0, %xmm0 # xmm0 >
xmm0[0],xmm1[1],xmm0[2,3]
>   retl
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/4444ed3d/attachment.html>

Quentin Colombet

2014-Sep-09 20:59 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

> On Sep 9, 2014, at 1:47 PM, Sean Silva <chisophugis at gmail.com>
wrote:
> 
> 
> 
> On Tue, Sep 9, 2014 at 12:53 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
> Hi Chandler,
> 
> I had observed some improvements and regressions with the new lowering.
> 
> Here are the numbers for an Ivy Bridge machine fixed at 2900MHz.
> 
> I’ll look into the regressions to provide test cases.
> 
> ** Numbers **
> 
> Smaller is better. Only reported tests that run for at least one second.
> Reference is the default lowering, Test is the new lowering.
> The Os numbers are overall neutral, but the O3 numbers mainly expose
regressions.
> 
> Note: I can attach the raw numbers if you want.
> 
> That would be great. Please do.
Alright, here they are :).

base-perf-Ox.txt: runtime for the default lowering.
new-perf-Ox.txt: runtime for the new lowering.

Each line in those files has the following format:
<unit> <benchmark> <perf number>

The units are:
- min: Minimum of the 7 runs.
- max: Maximum of the 7 runs.
- avg: Average of the 7 runs.
- total: Total of the 7 runs.
- med: Median of the 7 runs.
- SD: Standard deviation of the 7 runs.
- SD%: Standard deviation of the7  runs in percentage.

-Quentin
> 
> -- Sean Silva
>  
> 
> * Os *
> Benchmark_ID    	Reference	Test    	Expansion 	Percent
>
-------------------------------------------------------------------------------
> External/Nurbs/nurbs                   	       2.3302	       2.3122	   
0.99	    -1%
> External/SPEC/CFP2000/183.equake/183.eq	       3.2606	       3.2419	   
0.99	    -1%
> External/SPEC/CFP2006/447.dealII/447.de <http://447.de/>	     
16.4638	      16.1313	    0.98	    -2%
> External/SPEC/CFP2006/470.lbm/470.lbm  	       2.0159	       1.9931	   
0.99	    -1%
> External/SPEC/CINT2000/164.gzip/164.gzi	       8.7611	       8.6981	   
0.99	    -1%
> External/SPEC/CINT2006/456.hmmer/456.hm <http://456.hm/>	      
2.5674	       2.5819	    1.01	    +1%
> External/SPEC/CINT2006/462.libquantum/4	       1.2924	        1.347	   
1.04	    +4%
> MultiSource/Benchmarks/TSVC/CrossingThr	       2.4703	       2.4852	   
1.01	    +1%
> MultiSource/Benchmarks/TSVC/LoopRerolli	       2.6611	       2.5668	   
0.96	    -4%
> MultiSource/Benchmarks/mafft/pairlocala	       24.676	      24.5372	   
0.99	    -1%
> SingleSource/Benchmarks/Adobe-C++/simpl	       1.0579	       1.1048	   
1.04	    +4%
> SingleSource/Benchmarks/Linpack/linpack	       4.2817	       4.3298	   
1.01	    +1%
> SingleSource/Benchmarks/Misc-C++/stepan	       4.1821	        4.226	   
1.01	    +1%
> SingleSource/Benchmarks/Misc/oourafft  	       3.0305	       3.1777	   
1.05	    +5%
>
-------------------------------------------------------------------------------
> Min (14)                               	            -	            -	   
0.96	      -
>
-------------------------------------------------------------------------------
> Max (14)                               	            -	            -	   
1.05	      -
>
-------------------------------------------------------------------------------
> Sum (14)                               	           79	           79	      
1	    +0%
>
-------------------------------------------------------------------------------
> A.Mean (14)                            	            -	            -	   
1.01	    +1%
>
-------------------------------------------------------------------------------
> G.Mean 2 (14)                          	            -	            -	   
1.01	    +1%
>
-------------------------------------------------------------------------------
> 
> * O3 *
> Benchmark_ID    	Reference	Test    	Expansion 	Percent
>
-------------------------------------------------------------------------------
> External/Nurbs/nurbs                   	       2.2322	       2.2131	   
0.99	    -1%
> External/Povray/povray                 	       2.2638	       2.2762	   
1.01	    +1%
> External/SPEC/CFP2000/177.mesa/177.mesa	       1.6675	       1.6828	   
1.01	    +1%
> External/SPEC/CFP2000/188.ammp/188.ammp	      10.9309	      11.1191	   
1.02	    +2%
> External/SPEC/CFP2006/433.milc/433.milc	       6.9214	       7.1696	   
1.04	    +4%
> External/SPEC/CINT2000/164.gzip/164.gzi	       8.5327	       8.8114	   
1.03	    +3%
> External/SPEC/CINT2000/186.crafty/186.c	       4.1266	         4.16	   
1.01	    +1%
> External/SPEC/CINT2000/253.perlbmk/253.	       5.6991	       5.7309	   
1.01	    +1%
> External/SPEC/CINT2000/256.bzip2/256.bz <http://256.bz/>	      
6.7917	       6.8763	    1.01	    +1%
> External/SPEC/CINT2006/400.perlbench/40	        6.243	       6.1464	   
0.98	    -2%
> External/SPEC/CINT2006/401.bzip2/401.bz <http://401.bz/>	       
2.095	       2.0588	    0.98	    -2%
> External/SPEC/CINT2006/462.libquantum/4	          1.2	       1.2108	   
1.01	    +1%
> MultiSource/Applications/SIBsim4/SIBsim	       2.4547	       2.5129	   
1.02	    +2%
> MultiSource/Benchmarks/Bullet/bullet   	       4.1687	       4.0882	   
0.98	    -2%
> MultiSource/Benchmarks/TSVC/LinearDepen	       3.0389	       3.0566	   
1.01	    +1%
> MultiSource/Benchmarks/TSVC/LinearDepen	       2.1298	       2.1997	   
1.03	    +3%
> MultiSource/Benchmarks/TSVC/LoopRerolli	       2.6458	       2.5552	   
0.97	    -3%
> MultiSource/Benchmarks/TSVC/Symbolics-f	       1.6243	       1.6612	   
1.02	    +2%
> MultiSource/Benchmarks/mafft/pairlocala	      23.8979	      24.0547	   
1.01	    +1%
> SingleSource/Benchmarks/Misc/oourafft  	       3.0374	       3.1846	   
1.05	    +5%
> SingleSource/Benchmarks/SmallPT/smallpt	       6.5533	       6.6683	   
1.02	    +2%
>
-------------------------------------------------------------------------------
> Min (21)                               	            -	            -	   
0.97	      -
>
-------------------------------------------------------------------------------
> Max (21)                               	            -	            -	   
1.05	      -
>
-------------------------------------------------------------------------------
> Sum (21)                               	          108	          109	   
1.01	    -1%
>
-------------------------------------------------------------------------------
> A.Mean (21)                            	            -	            -	   
1.01	    +1%
>
-------------------------------------------------------------------------------
> G.Mean 2 (21)                          	            -	            -	   
1.01	    +1%
>
-------------------------------------------------------------------------------
> 
> Thanks,
> -Quentin
> 
>> On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com <mailto:andrea.dibiagio at gmail.com>> wrote:
>> 
>> Hi Chandler,
>> 
>> Thanks for fixing the problem with the insertps mask.
>> 
>> Generally the new shuffle lowering looks promising, however there are
>> some cases where the codegen is now worse causing runtime performance
>> regressions in some of our internal codebase.
>> 
>> You have already mentioned how the new shuffle lowering is missing
>> some features; for example, you explicitly said that we currently lack
>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>> main reasons for the slowdown we are seeing.
>> 
>> Here is a list of what we found so far that we think is causing most
>> of the slowdown:
>> 1) shufps is always emitted in cases where we could emit a single
>> blendps; in these cases, blendps is preferable because it has better
>> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>> 
>> Things get worse when it comes to lowering shuffles where the shuffle
>> mask indices refer to elements from both input vectors in each lane.
>> For example, a shuffle mask of <0,5,2,7> could be easily lowered
into
>> a single blendps; instead it gets lowered into two shufps
>> instructions.
>> 
>> Example:
>> ;;;
>> define <4 x float> @foo(<4 x float> %A, <4 x float>
%B) {
>>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 0,
>> i32 5, i32 2, i32 7>
>>  ret <4 x float> %1
>> }
>> ;;;
>> 
>> llc (-mcpu=corei7-avx):
>>  vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 =
xmm0[0],xmm1[5],xmm0[2],xmm1[7]
>> 
>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>>  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
>> 
>> 
>> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
>> mask identifies a blend. At the moment the new lowering logic is very
>> aggressively emitting insertps instead of cheaper blendps.
>> 
>> Example:
>> ;;;
>> define <4 x float> @bar(<4 x float> %A, <4 x float>
%B) {
>>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
>> i32 5, i32 2, i32 7>
>>  ret <4 x float> %1
>> }
>> ;;;
>> 
>> llc (-mcpu=corei7-avx):
>>  vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 =
xmm0[0,1],xmm1[2],xmm0[3]
>> 
>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>> 
>> 
>> 3) When a shuffle performs an insert at index 0 we always generate an
>> insertps, while a movss would do a better job.
>> ;;;
>> define <4 x float> @baz(<4 x float> %A, <4 x float>
%B) {
>>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
>> i32 1, i32 2, i32 3>
>>  ret <4 x float> %1
>> }
>> ;;;
>> 
>> llc (-mcpu=corei7-avx):
>>  vmovss %xmm1, %xmm0, %xmm0
>> 
>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>> 
>> I hope this is useful. We would be happy to contribute patches to
>> improve some of the above cases, but we obviously know that this is
>> still a work in progress, so we don't want to introduce conflicts
with
>> your work. Please let us know what you think.
>> 
>> We will keep looking at this and follow up with any further findings.
>> 
>> Thanks,
>> Andrea Di Biagio
>> SN Systems - Sony Computer Entertainment Inc.
>> 
>> On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>> Hi Chandler,
>>> 
>>> Forget about that I said.
>>> It seems I have some weird dependencies in my built system.
>>> My binaries are out-of-sync.
>>> 
>>> Let me sort that out, this is likely the problem is already fixed,
and I can
>>> resume the measurements.
>>> 
>>> Sorry for the noise.
>>> 
>>> Q.
>>> 
>>> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>> 
>>> 
>>> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>> 
>>> Sure,
>>> 
>>> Here is the command line:
>>> clang -cc1 -triple x86_64-apple-macosx -S -disable-free
>>> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic
>>> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables
-target-cpu
>>> core-avx-i  -O3  -ferror-limit 19 -fmessage-length 114
-stack-protector 1
>>> -mstackrealign -fblocks  -fencode-extended-block-signature
>>> -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics
>>> -vectorize-loops -vectorize-slp -mllvm
>>> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x
cpp-output tmp.i
>>> 
>>> This was with trunk 215249.
>>> 
>>> I meant, r217281.
>>> 
>>> 
>>> Thanks,
>>> -Quentin
>>> 
>>> <tmp.i>
>>> 
>>> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at
gmail.com <mailto:chandlerc at gmail.com>> wrote:
>>> 
>>> I've run the SingleSource test suite for core-avx-i and have no
failures
>>> here so a preprocessed file + commandline would be very useful if
this
>>> reproduces for you still.
>>> 
>>> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at
gmail.com <mailto:chandlerc at gmail.com>>
>>> wrote:
>>>> 
>>>> I'm having trouble reproducing this. I'm trying to get
LNT to actually
>>>> run, but manually compiling the given source file didn't
reproduce it for
>>>> me.
>>>> 
>>>> It might have been fixed recently (although I'd be
surprised if so), but
>>>> it would help to get the actual command line for which
compiling this file
>>>> in the test suite failed.
>>>> 
>>>> -Chandler
>>>> 
>>>> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet
at apple.com <mailto:qcolombet at apple.com>>
>>>> wrote:
>>>>> 
>>>>> Hi Chandler,
>>>>> 
>>>>> While doing the performance measurement on a Ivy Bridge, I
ran into
>>>>> compile time errors.
>>>>> 
>>>>> I saw a bunch of “cannot select" in the LLVM test
suite with
>>>>> -march=core-avx-i.
>>>>> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is
failing at O3
>>>>> -march=core-avx-i with:
>>>>> fatal error: error in backend: Cannot select:
0x7f91b99a6420: v4i32 >>>>> bitcast 0x7f91b99b0e10 [ORD=3]
[ID=27]
>>>>>  0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210,
>>>>> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25]
>>>>>    0x7f91b99a7210: v4i64 = undef [ID=15]
>>>>>    0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840
[ORD=2]
>>>>> [ID=23]
>>>>>      0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60,
0x7f91b99ac738
>>>>> [ORD=2] [ID=20]
>>>>>        0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820,
>>>>> 0x7f91b99a3a10 [ORD=2] [ID=16]
>>>>>          0x7f91b99a3a10: i64 = Register %vreg68 [ID=1]
>>>>>    0x7f91b99ace70: i64 = Constant<0> [ID=3]
>>>>> In function: isamax0
>>>>> clang: error: clang frontend command failed with exit code
70 (use -v to
>>>>> see invocation)
>>>>> clang version 3.6.0 (215249)
>>>>> Target: x86_64-apple-darwin14.0.0
>>>>> 
>>>>> For some reason, I cannot reproduce the problem with the
test case that
>>>>> clang gives me using -emit-llvm. Since the source is
public, I guess you can
>>>>> try to reproduce on your side.
>>>>> Indeed, if you run the test-suite with -march=core-avx-i
you’ll likely
>>>>> see all those failures.
>>>>> 
>>>>> Let me know if you cannot and I’ll try harder to produce a
test case.
>>>>> 
>>>>> Note: This is the same failure all over the place, i.e.,
cannot select a
>>>>> bit cast from various types to v4i32 or v4i64.
>>>>> 
>>>>> Thanks,
>>>>> -Quentin
>>>>> 
>>>>> 
>>>>> On Sep 5, 2014, at 11:09 AM, Robert Lougher
<rob.lougher@
>>>>> 
>>>>> gmail.com <http://gmail.com/>> wrote:
>>>>> 
>>>>> Hi Chandler,
>>>>> 
>>>>> On 5 September 2014 17:38, Chandler Carruth <chandlerc
at gmail.com <mailto:chandlerc at gmail.com>> wrote:
>>>>> 
>>>>> 
>>>>> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher
<rob.lougher at gmail.com <mailto:rob.lougher at gmail.com>>
>>>>> wrote:
>>>>> 
>>>>> 
>>>>> Unfortunately, another team, while doing internal testing
has seen the
>>>>> new path generating illegal insertps masks.  A sample here:
>>>>> 
>>>>>   vinsertps    $256, %xmm0, %xmm13, %xmm4 # xmm4 =
xmm0[0],xmm13[1,2,3]
>>>>>   vinsertps    $256, %xmm1, %xmm0, %xmm6 # xmm6 =
xmm1[0],xmm0[1,2,3]
>>>>>   vinsertps    $256, %xmm13, %xmm1, %xmm7 # xmm7 =
xmm13[0],xmm1[1,2,3]
>>>>>   vinsertps    $416, %xmm1, %xmm4, %xmm14 # xmm14
>>>>> xmm4[0,1],xmm1[2],xmm4[3]
>>>>>   vinsertps    $416, %xmm13, %xmm6, %xmm13 # xmm13
>>>>> xmm6[0,1],xmm13[2],xmm6[3]
>>>>>   vinsertps    $416, %xmm0, %xmm7, %xmm0 # xmm0
>>>>> xmm7[0,1],xmm0[2],xmm7[3]
>>>>> 
>>>>> We'll continue to look into this and do additional
testing.
>>>>> 
>>>>> 
>>>>> 
>>>>> Interesting. Let me know if you get a test case. The
insertps code path
>>>>> was
>>>>> added recently though and has been much less well tested.
I'll start fuzz
>>>>> testing it and should hopefully uncover the bug.
>>>>> 
>>>>> 
>>>>> Here's two small test cases.  Hope they are of use.
>>>>> 
>>>>> Thanks,
>>>>> Rob.
>>>>> 
>>>>> ------
>>>>> define <4 x float> @test(<4 x float> %xyzw,
<4 x float> %abcd) {
>>>>> %1 = extractelement <4 x float> %xyzw, i32 0
>>>>> %2 = insertelement <4 x float> undef, float %1, i32 0
>>>>> %3 = insertelement <4 x float> %2, float
0.000000e+00, i32 1
>>>>> %4 = shufflevector <4 x float> %3, <4 x float>
%xyzw, <4 x i32> <i32
>>>>> 0, i32 1, i32 6, i32 undef>
>>>>> %5 = shufflevector <4 x float> %4, <4 x float>
%abcd, <4 x i32> <i32
>>>>> 0, i32 1, i32 2, i32 4>
>>>>> ret <4 x float> %5
>>>>> }
>>>>> 
>>>>> define <4 x float> @test2(<4 x float> %xyzw,
<4 x float> %abcd) {
>>>>> %1 = shufflevector <4 x float> %xyzw, <4 x
float> %abcd, <4 x i32>
>>>>> <i32 0, i32 undef, i32 2, i32 4>
>>>>> %2 = shufflevector <4 x float> <float undef, float
0.000000e+00,
>>>>> float undef, float undef>, <4 x float> %1, <4 x
i32> <i32 4, i32 1,
>>>>> i32 6, i32 7>
>>>>> ret <4 x float> %2
>>>>> }
>>>>> 
>>>>> 
>>>>> llc -march=x86-64 -mattr=+avx test.ll -o -
>>>>> 
>>>>> test:                                   # @test
>>>>>   vxorps    %xmm2, %xmm2, %xmm2
>>>>>   vmovss    %xmm0, %xmm2, %xmm2
>>>>>   vblendps    $4, %xmm0, %xmm2, %xmm0 # xmm0 =
xmm2[0,1],xmm0[2],xmm2[3]
>>>>>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>>   retl
>>>>> 
>>>>> test2:                                  # @test2
>>>>>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>>   vxorps    %xmm1, %xmm1, %xmm1
>>>>>   vblendps    $13, %xmm0, %xmm1, %xmm0 # xmm0
>>>>> xmm0[0],xmm1[1],xmm0[2,3]
>>>>>   retl
>>>>> 
>>>>> llc -march=x86-64 -mattr=+avx
>>>>> -x86-experimental-vector-shuffle-lowering test.ll -o -
>>>>> 
>>>>> test:                                   # @test
>>>>>   vinsertps    $270, %xmm0, %xmm0, %xmm2 # xmm2 =
xmm0[0],zero,zero,zero
>>>>>   vinsertps    $416, %xmm0, %xmm2, %xmm0 # xmm0
>>>>> xmm2[0,1],xmm0[2],xmm2[3]
>>>>>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>>   retl
>>>>> 
>>>>> test2:                                  # @test2
>>>>>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>>   vxorps    %xmm1, %xmm1, %xmm1
>>>>>   vinsertps    $336, %xmm1, %xmm0, %xmm0 # xmm0
>>>>> xmm0[0],xmm1[1],xmm0[2,3]
>>>>>   retl
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at
cs.uiuc.edu>         http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at
cs.uiuc.edu>         http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>       
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>       
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>       
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: base-perf-O3.txt
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment.txt>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0001.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: base-perf-Os.txt
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0001.txt>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0002.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: new-perf-O3.txt
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0002.txt>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0003.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: new-perf-Os.txt
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0003.txt>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/268479be/attachment-0004.html>

Quentin Colombet

2014-Sep-09 22:01 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

Here is a test case for the biggest offender (oourafft.c).
To reproduce:
llc -mcpu=core-avx-i -x86-experimental-vector-shuffle-lowering=true repro.ll
llc -mcpu=core-avx-i -x86-experimental-vector-shuffle-lowering=false repro.ll

The main problem is that we miss:
	vmovsd	(%rdi,%rcx,8), %xmm2
	vmovlhps	%xmm2, %xmm2, %xmm2 ## xmm2 = xmm2[0,0]
=>
	vmovddup	(%rdi,%rcx,8), %xmm2

I do not know how problematic is that (it seems we catch up on the performance
with just the previous transformation), but we also miss:
	vsubpd	%xmm1, %xmm0, %xmm2
	vaddpd	%xmm1, %xmm0, %xmm0
	vshufpd	$2, %xmm0, %xmm2, %xmm0 ## xmm0 = xmm2[0],xmm0[1]
=>
	vaddsubpd	%xmm1, %xmm0, %xmm0

I’ll look into the other regressions.

Thanks,
-Quentin

> On Sep 9, 2014, at 1:59 PM, Quentin Colombet <qcolombet at apple.com>
wrote:
> 
> 
>> On Sep 9, 2014, at 1:47 PM, Sean Silva <chisophugis at gmail.com
<mailto:chisophugis at gmail.com>> wrote:
>> 
>> 
>> 
>> On Tue, Sep 9, 2014 at 12:53 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>> Hi Chandler,
>> 
>> I had observed some improvements and regressions with the new lowering.
>> 
>> Here are the numbers for an Ivy Bridge machine fixed at 2900MHz.
>> 
>> I’ll look into the regressions to provide test cases.
>> 
>> ** Numbers **
>> 
>> Smaller is better. Only reported tests that run for at least one
second.
>> Reference is the default lowering, Test is the new lowering.
>> The Os numbers are overall neutral, but the O3 numbers mainly expose
regressions.
>> 
>> Note: I can attach the raw numbers if you want.
>> 
>> That would be great. Please do.
> 
> Alright, here they are :).
> 
> base-perf-Ox.txt: runtime for the default lowering.
> new-perf-Ox.txt: runtime for the new lowering.
> 
> Each line in those files has the following format:
> <unit> <benchmark> <perf number>
> 
> The units are:
> - min: Minimum of the 7 runs.
> - max: Maximum of the 7 runs.
> - avg: Average of the 7 runs.
> - total: Total of the 7 runs.
> - med: Median of the 7 runs.
> - SD: Standard deviation of the 7 runs.
> - SD%: Standard deviation of the7  runs in percentage.
> 
> -Quentin
> <base-perf-O3.txt>
> <base-perf-Os.txt>
> <new-perf-O3.txt>
> <new-perf-Os.txt>
> 
>> 
>> -- Sean Silva
>>  
>> 
>> * Os *
>> Benchmark_ID    	Reference	Test    	Expansion 	Percent
>>
-------------------------------------------------------------------------------
>> External/Nurbs/nurbs                   	       2.3302	       2.3122	   
0.99	    -1%
>> External/SPEC/CFP2000/183.equake/183.eq	       3.2606	       3.2419	   
0.99	    -1%
>> External/SPEC/CFP2006/447.dealII/447.de <http://447.de/>	     
16.4638	      16.1313	    0.98	    -2%
>> External/SPEC/CFP2006/470.lbm/470.lbm  	       2.0159	       1.9931	   
0.99	    -1%
>> External/SPEC/CINT2000/164.gzip/164.gzi	       8.7611	       8.6981	   
0.99	    -1%
>> External/SPEC/CINT2006/456.hmmer/456.hm <http://456.hm/>	      
2.5674	       2.5819	    1.01	    +1%
>> External/SPEC/CINT2006/462.libquantum/4	       1.2924	        1.347	   
1.04	    +4%
>> MultiSource/Benchmarks/TSVC/CrossingThr	       2.4703	       2.4852	   
1.01	    +1%
>> MultiSource/Benchmarks/TSVC/LoopRerolli	       2.6611	       2.5668	   
0.96	    -4%
>> MultiSource/Benchmarks/mafft/pairlocala	       24.676	      24.5372	   
0.99	    -1%
>> SingleSource/Benchmarks/Adobe-C++/simpl	       1.0579	       1.1048	   
1.04	    +4%
>> SingleSource/Benchmarks/Linpack/linpack	       4.2817	       4.3298	   
1.01	    +1%
>> SingleSource/Benchmarks/Misc-C++/stepan	       4.1821	        4.226	   
1.01	    +1%
>> SingleSource/Benchmarks/Misc/oourafft  	       3.0305	       3.1777	   
1.05	    +5%
>>
-------------------------------------------------------------------------------
>> Min (14)                               	            -	            -	   
0.96	      -
>>
-------------------------------------------------------------------------------
>> Max (14)                               	            -	            -	   
1.05	      -
>>
-------------------------------------------------------------------------------
>> Sum (14)                               	           79	           79	   
1	    +0%
>>
-------------------------------------------------------------------------------
>> A.Mean (14)                            	            -	            -	   
1.01	    +1%
>>
-------------------------------------------------------------------------------
>> G.Mean 2 (14)                          	            -	            -	   
1.01	    +1%
>>
-------------------------------------------------------------------------------
>> 
>> * O3 *
>> Benchmark_ID    	Reference	Test    	Expansion 	Percent
>>
-------------------------------------------------------------------------------
>> External/Nurbs/nurbs                   	       2.2322	       2.2131	   
0.99	    -1%
>> External/Povray/povray                 	       2.2638	       2.2762	   
1.01	    +1%
>> External/SPEC/CFP2000/177.mesa/177.mesa	       1.6675	       1.6828	   
1.01	    +1%
>> External/SPEC/CFP2000/188.ammp/188.ammp	      10.9309	      11.1191	   
1.02	    +2%
>> External/SPEC/CFP2006/433.milc/433.milc	       6.9214	       7.1696	   
1.04	    +4%
>> External/SPEC/CINT2000/164.gzip/164.gzi	       8.5327	       8.8114	   
1.03	    +3%
>> External/SPEC/CINT2000/186.crafty/186.c	       4.1266	         4.16	   
1.01	    +1%
>> External/SPEC/CINT2000/253.perlbmk/253.	       5.6991	       5.7309	   
1.01	    +1%
>> External/SPEC/CINT2000/256.bzip2/256.bz <http://256.bz/>	      
6.7917	       6.8763	    1.01	    +1%
>> External/SPEC/CINT2006/400.perlbench/40	        6.243	       6.1464	   
0.98	    -2%
>> External/SPEC/CINT2006/401.bzip2/401.bz <http://401.bz/>	       
2.095	       2.0588	    0.98	    -2%
>> External/SPEC/CINT2006/462.libquantum/4	          1.2	       1.2108	   
1.01	    +1%
>> MultiSource/Applications/SIBsim4/SIBsim	       2.4547	       2.5129	   
1.02	    +2%
>> MultiSource/Benchmarks/Bullet/bullet   	       4.1687	       4.0882	   
0.98	    -2%
>> MultiSource/Benchmarks/TSVC/LinearDepen	       3.0389	       3.0566	   
1.01	    +1%
>> MultiSource/Benchmarks/TSVC/LinearDepen	       2.1298	       2.1997	   
1.03	    +3%
>> MultiSource/Benchmarks/TSVC/LoopRerolli	       2.6458	       2.5552	   
0.97	    -3%
>> MultiSource/Benchmarks/TSVC/Symbolics-f	       1.6243	       1.6612	   
1.02	    +2%
>> MultiSource/Benchmarks/mafft/pairlocala	      23.8979	      24.0547	   
1.01	    +1%
>> SingleSource/Benchmarks/Misc/oourafft  	       3.0374	       3.1846	   
1.05	    +5%
>> SingleSource/Benchmarks/SmallPT/smallpt	       6.5533	       6.6683	   
1.02	    +2%
>>
-------------------------------------------------------------------------------
>> Min (21)                               	            -	            -	   
0.97	      -
>>
-------------------------------------------------------------------------------
>> Max (21)                               	            -	            -	   
1.05	      -
>>
-------------------------------------------------------------------------------
>> Sum (21)                               	          108	          109	   
1.01	    -1%
>>
-------------------------------------------------------------------------------
>> A.Mean (21)                            	            -	            -	   
1.01	    +1%
>>
-------------------------------------------------------------------------------
>> G.Mean 2 (21)                          	            -	            -	   
1.01	    +1%
>>
-------------------------------------------------------------------------------
>> 
>> Thanks,
>> -Quentin
>> 
>>> On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com <mailto:andrea.dibiagio at gmail.com>> wrote:
>>> 
>>> Hi Chandler,
>>> 
>>> Thanks for fixing the problem with the insertps mask.
>>> 
>>> Generally the new shuffle lowering looks promising, however there
are
>>> some cases where the codegen is now worse causing runtime
performance
>>> regressions in some of our internal codebase.
>>> 
>>> You have already mentioned how the new shuffle lowering is missing
>>> some features; for example, you explicitly said that we currently
lack
>>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>>> main reasons for the slowdown we are seeing.
>>> 
>>> Here is a list of what we found so far that we think is causing
most
>>> of the slowdown:
>>> 1) shufps is always emitted in cases where we could emit a single
>>> blendps; in these cases, blendps is preferable because it has
better
>>> reciprocal throughput (this is true on all modern Intel and AMD
cpus).
>>> 
>>> Things get worse when it comes to lowering shuffles where the
shuffle
>>> mask indices refer to elements from both input vectors in each
lane.
>>> For example, a shuffle mask of <0,5,2,7> could be easily
lowered into
>>> a single blendps; instead it gets lowered into two shufps
>>> instructions.
>>> 
>>> Example:
>>> ;;;
>>> define <4 x float> @foo(<4 x float> %A, <4 x
float> %B) {
>>>  %1 = shufflevector <4 x float> %A, <4 x float> %B,
<4 x i32> <i32 0,
>>> i32 5, i32 2, i32 7>
>>>  ret <4 x float> %1
>>> }
>>> ;;;
>>> 
>>> llc (-mcpu=corei7-avx):
>>>  vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 =
xmm0[0],xmm1[5],xmm0[2],xmm1[7]
>>> 
>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>>  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>>>  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
>>> 
>>> 
>>> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
>>> mask identifies a blend. At the moment the new lowering logic is
very
>>> aggressively emitting insertps instead of cheaper blendps.
>>> 
>>> Example:
>>> ;;;
>>> define <4 x float> @bar(<4 x float> %A, <4 x
float> %B) {
>>>  %1 = shufflevector <4 x float> %A, <4 x float> %B,
<4 x i32> <i32 4,
>>> i32 5, i32 2, i32 7>
>>>  ret <4 x float> %1
>>> }
>>> ;;;
>>> 
>>> llc (-mcpu=corei7-avx):
>>>  vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 =
xmm0[0,1],xmm1[2],xmm0[3]
>>> 
>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>>  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1],xmm1[2],xmm0[3]
>>> 
>>> 
>>> 3) When a shuffle performs an insert at index 0 we always generate
an
>>> insertps, while a movss would do a better job.
>>> ;;;
>>> define <4 x float> @baz(<4 x float> %A, <4 x
float> %B) {
>>>  %1 = shufflevector <4 x float> %A, <4 x float> %B,
<4 x i32> <i32 4,
>>> i32 1, i32 2, i32 3>
>>>  ret <4 x float> %1
>>> }
>>> ;;;
>>> 
>>> llc (-mcpu=corei7-avx):
>>>  vmovss %xmm1, %xmm0, %xmm0
>>> 
>>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>>> 
>>> I hope this is useful. We would be happy to contribute patches to
>>> improve some of the above cases, but we obviously know that this is
>>> still a work in progress, so we don't want to introduce
conflicts with
>>> your work. Please let us know what you think.
>>> 
>>> We will keep looking at this and follow up with any further
findings.
>>> 
>>> Thanks,
>>> Andrea Di Biagio
>>> SN Systems - Sony Computer Entertainment Inc.
>>> 
>>> On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>>> Hi Chandler,
>>>> 
>>>> Forget about that I said.
>>>> It seems I have some weird dependencies in my built system.
>>>> My binaries are out-of-sync.
>>>> 
>>>> Let me sort that out, this is likely the problem is already
fixed, and I can
>>>> resume the measurements.
>>>> 
>>>> Sorry for the noise.
>>>> 
>>>> Q.
>>>> 
>>>> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>>> 
>>>> 
>>>> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>>> 
>>>> Sure,
>>>> 
>>>> Here is the command line:
>>>> clang -cc1 -triple x86_64-apple-macosx -S -disable-free
>>>> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model
pic
>>>> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables
-target-cpu
>>>> core-avx-i  -O3  -ferror-limit 19 -fmessage-length 114
-stack-protector 1
>>>> -mstackrealign -fblocks  -fencode-extended-block-signature
>>>> -fmax-type-align=16 -fdiagnostics-show-option
-fcolor-diagnostics
>>>> -vectorize-loops -vectorize-slp -mllvm
>>>> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x
cpp-output tmp.i
>>>> 
>>>> This was with trunk 215249.
>>>> 
>>>> I meant, r217281.
>>>> 
>>>> 
>>>> Thanks,
>>>> -Quentin
>>>> 
>>>> <tmp.i>
>>>> 
>>>> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at
gmail.com <mailto:chandlerc at gmail.com>> wrote:
>>>> 
>>>> I've run the SingleSource test suite for core-avx-i and
have no failures
>>>> here so a preprocessed file + commandline would be very useful
if this
>>>> reproduces for you still.
>>>> 
>>>> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc
at gmail.com <mailto:chandlerc at gmail.com>>
>>>> wrote:
>>>>> 
>>>>> I'm having trouble reproducing this. I'm trying to
get LNT to actually
>>>>> run, but manually compiling the given source file
didn't reproduce it for
>>>>> me.
>>>>> 
>>>>> It might have been fixed recently (although I'd be
surprised if so), but
>>>>> it would help to get the actual command line for which
compiling this file
>>>>> in the test suite failed.
>>>>> 
>>>>> -Chandler
>>>>> 
>>>>> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet
<qcolombet at apple.com <mailto:qcolombet at apple.com>>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Chandler,
>>>>>> 
>>>>>> While doing the performance measurement on a Ivy
Bridge, I ran into
>>>>>> compile time errors.
>>>>>> 
>>>>>> I saw a bunch of “cannot select" in the LLVM test
suite with
>>>>>> -march=core-avx-i.
>>>>>> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is
failing at O3
>>>>>> -march=core-avx-i with:
>>>>>> fatal error: error in backend: Cannot select:
0x7f91b99a6420: v4i32 >>>>>> bitcast 0x7f91b99b0e10 [ORD=3]
[ID=27]
>>>>>>  0x7f91b99b0e10: v4i64 = insert_subvector
0x7f91b99a7210,
>>>>>> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25]
>>>>>>    0x7f91b99a7210: v4i64 = undef [ID=15]
>>>>>>    0x7f91b99a6d68: v2i64 = scalar_to_vector
0x7f91b99ab840 [ORD=2]
>>>>>> [ID=23]
>>>>>>      0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60,
0x7f91b99ac738
>>>>>> [ORD=2] [ID=20]
>>>>>>        0x7f91b99acc60: i64,ch = CopyFromReg
0x7f91b8d52820,
>>>>>> 0x7f91b99a3a10 [ORD=2] [ID=16]
>>>>>>          0x7f91b99a3a10: i64 = Register %vreg68 [ID=1]
>>>>>>    0x7f91b99ace70: i64 = Constant<0> [ID=3]
>>>>>> In function: isamax0
>>>>>> clang: error: clang frontend command failed with exit
code 70 (use -v to
>>>>>> see invocation)
>>>>>> clang version 3.6.0 (215249)
>>>>>> Target: x86_64-apple-darwin14.0.0
>>>>>> 
>>>>>> For some reason, I cannot reproduce the problem with
the test case that
>>>>>> clang gives me using -emit-llvm. Since the source is
public, I guess you can
>>>>>> try to reproduce on your side.
>>>>>> Indeed, if you run the test-suite with
-march=core-avx-i you’ll likely
>>>>>> see all those failures.
>>>>>> 
>>>>>> Let me know if you cannot and I’ll try harder to
produce a test case.
>>>>>> 
>>>>>> Note: This is the same failure all over the place,
i.e., cannot select a
>>>>>> bit cast from various types to v4i32 or v4i64.
>>>>>> 
>>>>>> Thanks,
>>>>>> -Quentin
>>>>>> 
>>>>>> 
>>>>>> On Sep 5, 2014, at 11:09 AM, Robert Lougher
<rob.lougher@
>>>>>> 
>>>>>> gmail.com <http://gmail.com/>> wrote:
>>>>>> 
>>>>>> Hi Chandler,
>>>>>> 
>>>>>> On 5 September 2014 17:38, Chandler Carruth
<chandlerc at gmail.com <mailto:chandlerc at gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher
<rob.lougher at gmail.com <mailto:rob.lougher at gmail.com>>
>>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> Unfortunately, another team, while doing internal
testing has seen the
>>>>>> new path generating illegal insertps masks.  A sample
here:
>>>>>> 
>>>>>>   vinsertps    $256, %xmm0, %xmm13, %xmm4 # xmm4 =
xmm0[0],xmm13[1,2,3]
>>>>>>   vinsertps    $256, %xmm1, %xmm0, %xmm6 # xmm6 =
xmm1[0],xmm0[1,2,3]
>>>>>>   vinsertps    $256, %xmm13, %xmm1, %xmm7 # xmm7 =
xmm13[0],xmm1[1,2,3]
>>>>>>   vinsertps    $416, %xmm1, %xmm4, %xmm14 # xmm14
>>>>>> xmm4[0,1],xmm1[2],xmm4[3]
>>>>>>   vinsertps    $416, %xmm13, %xmm6, %xmm13 # xmm13
>>>>>> xmm6[0,1],xmm13[2],xmm6[3]
>>>>>>   vinsertps    $416, %xmm0, %xmm7, %xmm0 # xmm0
>>>>>> xmm7[0,1],xmm0[2],xmm7[3]
>>>>>> 
>>>>>> We'll continue to look into this and do additional
testing.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Interesting. Let me know if you get a test case. The
insertps code path
>>>>>> was
>>>>>> added recently though and has been much less well
tested. I'll start fuzz
>>>>>> testing it and should hopefully uncover the bug.
>>>>>> 
>>>>>> 
>>>>>> Here's two small test cases.  Hope they are of use.
>>>>>> 
>>>>>> Thanks,
>>>>>> Rob.
>>>>>> 
>>>>>> ------
>>>>>> define <4 x float> @test(<4 x float> %xyzw,
<4 x float> %abcd) {
>>>>>> %1 = extractelement <4 x float> %xyzw, i32 0
>>>>>> %2 = insertelement <4 x float> undef, float %1,
i32 0
>>>>>> %3 = insertelement <4 x float> %2, float
0.000000e+00, i32 1
>>>>>> %4 = shufflevector <4 x float> %3, <4 x
float> %xyzw, <4 x i32> <i32
>>>>>> 0, i32 1, i32 6, i32 undef>
>>>>>> %5 = shufflevector <4 x float> %4, <4 x
float> %abcd, <4 x i32> <i32
>>>>>> 0, i32 1, i32 2, i32 4>
>>>>>> ret <4 x float> %5
>>>>>> }
>>>>>> 
>>>>>> define <4 x float> @test2(<4 x float>
%xyzw, <4 x float> %abcd) {
>>>>>> %1 = shufflevector <4 x float> %xyzw, <4 x
float> %abcd, <4 x i32>
>>>>>> <i32 0, i32 undef, i32 2, i32 4>
>>>>>> %2 = shufflevector <4 x float> <float undef,
float 0.000000e+00,
>>>>>> float undef, float undef>, <4 x float> %1,
<4 x i32> <i32 4, i32 1,
>>>>>> i32 6, i32 7>
>>>>>> ret <4 x float> %2
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> llc -march=x86-64 -mattr=+avx test.ll -o -
>>>>>> 
>>>>>> test:                                   # @test
>>>>>>   vxorps    %xmm2, %xmm2, %xmm2
>>>>>>   vmovss    %xmm0, %xmm2, %xmm2
>>>>>>   vblendps    $4, %xmm0, %xmm2, %xmm0 # xmm0 =
xmm2[0,1],xmm0[2],xmm2[3]
>>>>>>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>>>   retl
>>>>>> 
>>>>>> test2:                                  # @test2
>>>>>>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>>>   vxorps    %xmm1, %xmm1, %xmm1
>>>>>>   vblendps    $13, %xmm0, %xmm1, %xmm0 # xmm0
>>>>>> xmm0[0],xmm1[1],xmm0[2,3]
>>>>>>   retl
>>>>>> 
>>>>>> llc -march=x86-64 -mattr=+avx
>>>>>> -x86-experimental-vector-shuffle-lowering test.ll -o -
>>>>>> 
>>>>>> test:                                   # @test
>>>>>>   vinsertps    $270, %xmm0, %xmm0, %xmm2 # xmm2 =
xmm0[0],zero,zero,zero
>>>>>>   vinsertps    $416, %xmm0, %xmm2, %xmm0 # xmm0
>>>>>> xmm2[0,1],xmm0[2],xmm2[3]
>>>>>>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>>>   retl
>>>>>> 
>>>>>> test2:                                  # @test2
>>>>>>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>>>   vxorps    %xmm1, %xmm1, %xmm1
>>>>>>   vinsertps    $336, %xmm1, %xmm0, %xmm0 # xmm0
>>>>>> xmm0[0],xmm1[1],xmm0[2,3]
>>>>>>   retl
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at
cs.uiuc.edu>         http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at
cs.uiuc.edu>         http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>   
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>   
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>   
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/565be8f4/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: repro.ll
Type: application/octet-stream
Size: 2265 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/565be8f4/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/565be8f4/attachment-0001.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Sep 2014 - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Apparently Analagous Threads