thr3ads.net - llvm dev - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon! [Sep 2014]

If this information is useful, please help other people find it:
Share via:

Andrea Di Biagio

2014-Sep-09 13:13 UTC

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

Thanks for fixing the problem with the insertps mask.

Generally the new shuffle lowering looks promising, however there are
some cases where the codegen is now worse causing runtime performance
regressions in some of our internal codebase.

You have already mentioned how the new shuffle lowering is missing
some features; for example, you explicitly said that we currently lack
of SSE4.1 blend support. Unfortunately, this seems to be one of the
main reasons for the slowdown we are seeing.

Here is a list of what we found so far that we think is causing most
of the slowdown:
1) shufps is always emitted in cases where we could emit a single
blendps; in these cases, blendps is preferable because it has better
reciprocal throughput (this is true on all modern Intel and AMD cpus).

Things get worse when it comes to lowering shuffles where the shuffle
mask indices refer to elements from both input vectors in each lane.
For example, a shuffle mask of <0,5,2,7> could be easily lowered into
a single blendps; instead it gets lowered into two shufps
instructions.

Example:
;;;
define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32>
<i32 0,
i32 5, i32 2, i32 7>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 = xmm0[0],xmm1[5],xmm0[2],xmm1[7]

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]


2) On SSE4.1, we should try not to emit an insertps if the shuffle
mask identifies a blend. At the moment the new lowering logic is very
aggressively emitting insertps instead of cheaper blendps.

Example:
;;;
define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32>
<i32 4,
i32 5, i32 2, i32 7>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]


3) When a shuffle performs an insert at index 0 we always generate an
insertps, while a movss would do a better job.
;;;
define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x i32>
<i32 4,
i32 1, i32 2, i32 3>
  ret <4 x float> %1
}
;;;

llc (-mcpu=corei7-avx):
  vmovss %xmm1, %xmm0, %xmm0

llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]

I hope this is useful. We would be happy to contribute patches to
improve some of the above cases, but we obviously know that this is
still a work in progress, so we don't want to introduce conflicts with
your work. Please let us know what you think.

We will keep looking at this and follow up with any further findings.

Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Inc.

On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at apple.com>
wrote:> Hi Chandler,
>
> Forget about that I said.
> It seems I have some weird dependencies in my built system.
> My binaries are out-of-sync.
>
> Let me sort that out, this is likely the problem is already fixed, and I
can
> resume the measurements.
>
> Sorry for the noise.
>
> Q.
>
> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at apple.com>
wrote:
>
>
> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at apple.com>
wrote:
>
> Sure,
>
> Here is the command line:
> clang -cc1 -triple x86_64-apple-macosx -S -disable-free
> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic
> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu
> core-avx-i  -O3  -ferror-limit 19 -fmessage-length 114 -stack-protector 1
> -mstackrealign -fblocks  -fencode-extended-block-signature
> -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics
> -vectorize-loops -vectorize-slp -mllvm
> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i
>
> This was with trunk 215249.
>
> I meant, r217281.
>
>
> Thanks,
> -Quentin
>
> <tmp.i>
>
> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at gmail.com>
wrote:
>
> I've run the SingleSource test suite for core-avx-i and have no
failures
> here so a preprocessed file + commandline would be very useful if this
> reproduces for you still.
>
> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at
gmail.com>
> wrote:
>>
>> I'm having trouble reproducing this. I'm trying to get LNT to
actually
>> run, but manually compiling the given source file didn't reproduce
it for
>> me.
>>
>> It might have been fixed recently (although I'd be surprised if
so), but
>> it would help to get the actual command line for which compiling this
file
>> in the test suite failed.
>>
>> -Chandler
>>
>> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet at
apple.com>
>> wrote:
>>>
>>> Hi Chandler,
>>>
>>> While doing the performance measurement on a Ivy Bridge, I ran into
>>> compile time errors.
>>>
>>> I saw a bunch of “cannot select" in the LLVM test suite with
>>> -march=core-avx-i.
>>> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing at
O3
>>> -march=core-avx-i with:
>>> fatal error: error in backend: Cannot select: 0x7f91b99a6420: v4i32
>>> bitcast 0x7f91b99b0e10 [ORD=3] [ID=27]
>>>   0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210,
>>> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25]
>>>     0x7f91b99a7210: v4i64 = undef [ID=15]
>>>     0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840 [ORD=2]
>>> [ID=23]
>>>       0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60,
0x7f91b99ac738
>>> [ORD=2] [ID=20]
>>>         0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820,
>>> 0x7f91b99a3a10 [ORD=2] [ID=16]
>>>           0x7f91b99a3a10: i64 = Register %vreg68 [ID=1]
>>>     0x7f91b99ace70: i64 = Constant<0> [ID=3]
>>> In function: isamax0
>>> clang: error: clang frontend command failed with exit code 70 (use
-v to
>>> see invocation)
>>> clang version 3.6.0 (215249)
>>> Target: x86_64-apple-darwin14.0.0
>>>
>>> For some reason, I cannot reproduce the problem with the test case
that
>>> clang gives me using -emit-llvm. Since the source is public, I
guess you can
>>> try to reproduce on your side.
>>> Indeed, if you run the test-suite with -march=core-avx-i you’ll
likely
>>> see all those failures.
>>>
>>> Let me know if you cannot and I’ll try harder to produce a test
case.
>>>
>>> Note: This is the same failure all over the place, i.e., cannot
select a
>>> bit cast from various types to v4i32 or v4i64.
>>>
>>> Thanks,
>>> -Quentin
>>>
>>>
>>> On Sep 5, 2014, at 11:09 AM, Robert Lougher <rob.lougher@
>>>
>>> gmail.com> wrote:
>>>
>>> Hi Chandler,
>>>
>>> On 5 September 2014 17:38, Chandler Carruth <chandlerc at
gmail.com> wrote:
>>>
>>>
>>> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at
gmail.com>
>>> wrote:
>>>
>>>
>>> Unfortunately, another team, while doing internal testing has seen
the
>>> new path generating illegal insertps masks.  A sample here:
>>>
>>>    vinsertps    $256, %xmm0, %xmm13, %xmm4 # xmm4 =
xmm0[0],xmm13[1,2,3]
>>>    vinsertps    $256, %xmm1, %xmm0, %xmm6 # xmm6 =
xmm1[0],xmm0[1,2,3]
>>>    vinsertps    $256, %xmm13, %xmm1, %xmm7 # xmm7 =
xmm13[0],xmm1[1,2,3]
>>>    vinsertps    $416, %xmm1, %xmm4, %xmm14 # xmm14 >>>
xmm4[0,1],xmm1[2],xmm4[3]
>>>    vinsertps    $416, %xmm13, %xmm6, %xmm13 # xmm13 >>>
xmm6[0,1],xmm13[2],xmm6[3]
>>>    vinsertps    $416, %xmm0, %xmm7, %xmm0 # xmm0 >>>
xmm7[0,1],xmm0[2],xmm7[3]
>>>
>>> We'll continue to look into this and do additional testing.
>>>
>>>
>>>
>>> Interesting. Let me know if you get a test case. The insertps code
path
>>> was
>>> added recently though and has been much less well tested. I'll
start fuzz
>>> testing it and should hopefully uncover the bug.
>>>
>>>
>>> Here's two small test cases.  Hope they are of use.
>>>
>>> Thanks,
>>> Rob.
>>>
>>> ------
>>> define <4 x float> @test(<4 x float> %xyzw, <4 x
float> %abcd) {
>>>  %1 = extractelement <4 x float> %xyzw, i32 0
>>>  %2 = insertelement <4 x float> undef, float %1, i32 0
>>>  %3 = insertelement <4 x float> %2, float 0.000000e+00, i32 1
>>>  %4 = shufflevector <4 x float> %3, <4 x float> %xyzw,
<4 x i32> <i32
>>> 0, i32 1, i32 6, i32 undef>
>>>  %5 = shufflevector <4 x float> %4, <4 x float> %abcd,
<4 x i32> <i32
>>> 0, i32 1, i32 2, i32 4>
>>>  ret <4 x float> %5
>>> }
>>>
>>> define <4 x float> @test2(<4 x float> %xyzw, <4 x
float> %abcd) {
>>>  %1 = shufflevector <4 x float> %xyzw, <4 x float>
%abcd, <4 x i32>
>>> <i32 0, i32 undef, i32 2, i32 4>
>>>  %2 = shufflevector <4 x float> <float undef, float
0.000000e+00,
>>> float undef, float undef>, <4 x float> %1, <4 x i32>
<i32 4, i32 1,
>>> i32 6, i32 7>
>>>  ret <4 x float> %2
>>> }
>>>
>>>
>>> llc -march=x86-64 -mattr=+avx test.ll -o -
>>>
>>> test:                                   # @test
>>>    vxorps    %xmm2, %xmm2, %xmm2
>>>    vmovss    %xmm0, %xmm2, %xmm2
>>>    vblendps    $4, %xmm0, %xmm2, %xmm0 # xmm0 =
xmm2[0,1],xmm0[2],xmm2[3]
>>>    vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>    retl
>>>
>>> test2:                                  # @test2
>>>    vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>    vxorps    %xmm1, %xmm1, %xmm1
>>>    vblendps    $13, %xmm0, %xmm1, %xmm0 # xmm0 >>>
xmm0[0],xmm1[1],xmm0[2,3]
>>>    retl
>>>
>>> llc -march=x86-64 -mattr=+avx
>>> -x86-experimental-vector-shuffle-lowering test.ll -o -
>>>
>>> test:                                   # @test
>>>    vinsertps    $270, %xmm0, %xmm0, %xmm2 # xmm2 =
xmm0[0],zero,zero,zero
>>>    vinsertps    $416, %xmm0, %xmm2, %xmm0 # xmm0 >>>
xmm2[0,1],xmm0[2],xmm2[3]
>>>    vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>    retl
>>>
>>> test2:                                  # @test2
>>>    vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>    vxorps    %xmm1, %xmm1, %xmm1
>>>    vinsertps    $336, %xmm1, %xmm0, %xmm0 # xmm0 >>>
xmm0[0],xmm1[1],xmm0[2,3]
>>>    retl
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Quentin Colombet

2014-Sep-09 19:53 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Hi Chandler,

I had observed some improvements and regressions with the new lowering.

Here are the numbers for an Ivy Bridge machine fixed at 2900MHz.

I’ll look into the regressions to provide test cases.

** Numbers **

Smaller is better. Only reported tests that run for at least one second.
Reference is the default lowering, Test is the new lowering.
The Os numbers are overall neutral, but the O3 numbers mainly expose
regressions.

Note: I can attach the raw numbers if you want.

* Os *
Benchmark_ID    	Reference	Test    	Expansion 	Percent
-------------------------------------------------------------------------------
External/Nurbs/nurbs                   	       2.3302	       2.3122	    0.99	   
-1%
External/SPEC/CFP2000/183.equake/183.eq	       3.2606	       3.2419	    0.99	   
-1%
External/SPEC/CFP2006/447.dealII/447.de	      16.4638	      16.1313	    0.98	   
-2%
External/SPEC/CFP2006/470.lbm/470.lbm  	       2.0159	       1.9931	    0.99	   
-1%
External/SPEC/CINT2000/164.gzip/164.gzi	       8.7611	       8.6981	    0.99	   
-1%
External/SPEC/CINT2006/456.hmmer/456.hm	       2.5674	       2.5819	    1.01	   
+1%
External/SPEC/CINT2006/462.libquantum/4	       1.2924	        1.347	    1.04	   
+4%
MultiSource/Benchmarks/TSVC/CrossingThr	       2.4703	       2.4852	    1.01	   
+1%
MultiSource/Benchmarks/TSVC/LoopRerolli	       2.6611	       2.5668	    0.96	   
-4%
MultiSource/Benchmarks/mafft/pairlocala	       24.676	      24.5372	    0.99	   
-1%
SingleSource/Benchmarks/Adobe-C++/simpl	       1.0579	       1.1048	    1.04	   
+4%
SingleSource/Benchmarks/Linpack/linpack	       4.2817	       4.3298	    1.01	   
+1%
SingleSource/Benchmarks/Misc-C++/stepan	       4.1821	        4.226	    1.01	   
+1%
SingleSource/Benchmarks/Misc/oourafft  	       3.0305	       3.1777	    1.05	   
+5%
-------------------------------------------------------------------------------
Min (14)                               	            -	            -	    0.96	   
-
-------------------------------------------------------------------------------
Max (14)                               	            -	            -	    1.05	   
-
-------------------------------------------------------------------------------
Sum (14)                               	           79	           79	       1	   
+0%
-------------------------------------------------------------------------------
A.Mean (14)                            	            -	            -	    1.01	   
+1%
-------------------------------------------------------------------------------
G.Mean 2 (14)                          	            -	            -	    1.01	   
+1%
-------------------------------------------------------------------------------

* O3 *
Benchmark_ID    	Reference	Test    	Expansion 	Percent
-------------------------------------------------------------------------------
External/Nurbs/nurbs                   	       2.2322	       2.2131	    0.99	   
-1%
External/Povray/povray                 	       2.2638	       2.2762	    1.01	   
+1%
External/SPEC/CFP2000/177.mesa/177.mesa	       1.6675	       1.6828	    1.01	   
+1%
External/SPEC/CFP2000/188.ammp/188.ammp	      10.9309	      11.1191	    1.02	   
+2%
External/SPEC/CFP2006/433.milc/433.milc	       6.9214	       7.1696	    1.04	   
+4%
External/SPEC/CINT2000/164.gzip/164.gzi	       8.5327	       8.8114	    1.03	   
+3%
External/SPEC/CINT2000/186.crafty/186.c	       4.1266	         4.16	    1.01	   
+1%
External/SPEC/CINT2000/253.perlbmk/253.	       5.6991	       5.7309	    1.01	   
+1%
External/SPEC/CINT2000/256.bzip2/256.bz	       6.7917	       6.8763	    1.01	   
+1%
External/SPEC/CINT2006/400.perlbench/40	        6.243	       6.1464	    0.98	   
-2%
External/SPEC/CINT2006/401.bzip2/401.bz	        2.095	       2.0588	    0.98	   
-2%
External/SPEC/CINT2006/462.libquantum/4	          1.2	       1.2108	    1.01	   
+1%
MultiSource/Applications/SIBsim4/SIBsim	       2.4547	       2.5129	    1.02	   
+2%
MultiSource/Benchmarks/Bullet/bullet   	       4.1687	       4.0882	    0.98	   
-2%
MultiSource/Benchmarks/TSVC/LinearDepen	       3.0389	       3.0566	    1.01	   
+1%
MultiSource/Benchmarks/TSVC/LinearDepen	       2.1298	       2.1997	    1.03	   
+3%
MultiSource/Benchmarks/TSVC/LoopRerolli	       2.6458	       2.5552	    0.97	   
-3%
MultiSource/Benchmarks/TSVC/Symbolics-f	       1.6243	       1.6612	    1.02	   
+2%
MultiSource/Benchmarks/mafft/pairlocala	      23.8979	      24.0547	    1.01	   
+1%
SingleSource/Benchmarks/Misc/oourafft  	       3.0374	       3.1846	    1.05	   
+5%
SingleSource/Benchmarks/SmallPT/smallpt	       6.5533	       6.6683	    1.02	   
+2%
-------------------------------------------------------------------------------
Min (21)                               	            -	            -	    0.97	   
-
-------------------------------------------------------------------------------
Max (21)                               	            -	            -	    1.05	   
-
-------------------------------------------------------------------------------
Sum (21)                               	          108	          109	    1.01	   
-1%
-------------------------------------------------------------------------------
A.Mean (21)                            	            -	            -	    1.01	   
+1%
-------------------------------------------------------------------------------
G.Mean 2 (21)                          	            -	            -	    1.01	   
+1%
-------------------------------------------------------------------------------

Thanks,
-Quentin> On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> 
> Hi Chandler,
> 
> Thanks for fixing the problem with the insertps mask.
> 
> Generally the new shuffle lowering looks promising, however there are
> some cases where the codegen is now worse causing runtime performance
> regressions in some of our internal codebase.
> 
> You have already mentioned how the new shuffle lowering is missing
> some features; for example, you explicitly said that we currently lack
> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> main reasons for the slowdown we are seeing.
> 
> Here is a list of what we found so far that we think is causing most
> of the slowdown:
> 1) shufps is always emitted in cases where we could emit a single
> blendps; in these cases, blendps is preferable because it has better
> reciprocal throughput (this is true on all modern Intel and AMD cpus).
> 
> Things get worse when it comes to lowering shuffles where the shuffle
> mask indices refer to elements from both input vectors in each lane.
> For example, a shuffle mask of <0,5,2,7> could be easily lowered into
> a single blendps; instead it gets lowered into two shufps
> instructions.
> 
> Example:
> ;;;
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 0,
> i32 5, i32 2, i32 7>
>  ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>  vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 =
xmm0[0],xmm1[5],xmm0[2],xmm1[7]
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
> 
> 
> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
> mask identifies a blend. At the moment the new lowering logic is very
> aggressively emitting insertps instead of cheaper blendps.
> 
> Example:
> ;;;
> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 5, i32 2, i32 7>
>  ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>  vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
> 
> 
> 3) When a shuffle performs an insert at index 0 we always generate an
> insertps, while a movss would do a better job.
> ;;;
> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 1, i32 2, i32 3>
>  ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>  vmovss %xmm1, %xmm0, %xmm0
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
> 
> I hope this is useful. We would be happy to contribute patches to
> improve some of the above cases, but we obviously know that this is
> still a work in progress, so we don't want to introduce conflicts with
> your work. Please let us know what you think.
> 
> We will keep looking at this and follow up with any further findings.
> 
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Inc.
> 
> On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at
apple.com> wrote:
>> Hi Chandler,
>> 
>> Forget about that I said.
>> It seems I have some weird dependencies in my built system.
>> My binaries are out-of-sync.
>> 
>> Let me sort that out, this is likely the problem is already fixed, and
I can
>> resume the measurements.
>> 
>> Sorry for the noise.
>> 
>> Q.
>> 
>> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at
apple.com> wrote:
>> 
>> 
>> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at
apple.com> wrote:
>> 
>> Sure,
>> 
>> Here is the command line:
>> clang -cc1 -triple x86_64-apple-macosx -S -disable-free
>> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic
>> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables
-target-cpu
>> core-avx-i  -O3  -ferror-limit 19 -fmessage-length 114 -stack-protector
1
>> -mstackrealign -fblocks  -fencode-extended-block-signature
>> -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics
>> -vectorize-loops -vectorize-slp -mllvm
>> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output
tmp.i
>> 
>> This was with trunk 215249.
>> 
>> I meant, r217281.
>> 
>> 
>> Thanks,
>> -Quentin
>> 
>> <tmp.i>
>> 
>> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at
gmail.com> wrote:
>> 
>> I've run the SingleSource test suite for core-avx-i and have no
failures
>> here so a preprocessed file + commandline would be very useful if this
>> reproduces for you still.
>> 
>> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at
gmail.com>
>> wrote:
>>> 
>>> I'm having trouble reproducing this. I'm trying to get LNT
to actually
>>> run, but manually compiling the given source file didn't
reproduce it for
>>> me.
>>> 
>>> It might have been fixed recently (although I'd be surprised if
so), but
>>> it would help to get the actual command line for which compiling
this file
>>> in the test suite failed.
>>> 
>>> -Chandler
>>> 
>>> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet at
apple.com>
>>> wrote:
>>>> 
>>>> Hi Chandler,
>>>> 
>>>> While doing the performance measurement on a Ivy Bridge, I ran
into
>>>> compile time errors.
>>>> 
>>>> I saw a bunch of “cannot select" in the LLVM test suite
with
>>>> -march=core-avx-i.
>>>> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing
at O3
>>>> -march=core-avx-i with:
>>>> fatal error: error in backend: Cannot select: 0x7f91b99a6420:
v4i32 >>>> bitcast 0x7f91b99b0e10 [ORD=3] [ID=27]
>>>>  0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210,
>>>> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25]
>>>>    0x7f91b99a7210: v4i64 = undef [ID=15]
>>>>    0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840
[ORD=2]
>>>> [ID=23]
>>>>      0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60,
0x7f91b99ac738
>>>> [ORD=2] [ID=20]
>>>>        0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820,
>>>> 0x7f91b99a3a10 [ORD=2] [ID=16]
>>>>          0x7f91b99a3a10: i64 = Register %vreg68 [ID=1]
>>>>    0x7f91b99ace70: i64 = Constant<0> [ID=3]
>>>> In function: isamax0
>>>> clang: error: clang frontend command failed with exit code 70
(use -v to
>>>> see invocation)
>>>> clang version 3.6.0 (215249)
>>>> Target: x86_64-apple-darwin14.0.0
>>>> 
>>>> For some reason, I cannot reproduce the problem with the test
case that
>>>> clang gives me using -emit-llvm. Since the source is public, I
guess you can
>>>> try to reproduce on your side.
>>>> Indeed, if you run the test-suite with -march=core-avx-i you’ll
likely
>>>> see all those failures.
>>>> 
>>>> Let me know if you cannot and I’ll try harder to produce a test
case.
>>>> 
>>>> Note: This is the same failure all over the place, i.e., cannot
select a
>>>> bit cast from various types to v4i32 or v4i64.
>>>> 
>>>> Thanks,
>>>> -Quentin
>>>> 
>>>> 
>>>> On Sep 5, 2014, at 11:09 AM, Robert Lougher <rob.lougher@
>>>> 
>>>> gmail.com> wrote:
>>>> 
>>>> Hi Chandler,
>>>> 
>>>> On 5 September 2014 17:38, Chandler Carruth <chandlerc at
gmail.com> wrote:
>>>> 
>>>> 
>>>> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher
at gmail.com>
>>>> wrote:
>>>> 
>>>> 
>>>> Unfortunately, another team, while doing internal testing has
seen the
>>>> new path generating illegal insertps masks.  A sample here:
>>>> 
>>>>   vinsertps    $256, %xmm0, %xmm13, %xmm4 # xmm4 =
xmm0[0],xmm13[1,2,3]
>>>>   vinsertps    $256, %xmm1, %xmm0, %xmm6 # xmm6 =
xmm1[0],xmm0[1,2,3]
>>>>   vinsertps    $256, %xmm13, %xmm1, %xmm7 # xmm7 =
xmm13[0],xmm1[1,2,3]
>>>>   vinsertps    $416, %xmm1, %xmm4, %xmm14 # xmm14
>>>> xmm4[0,1],xmm1[2],xmm4[3]
>>>>   vinsertps    $416, %xmm13, %xmm6, %xmm13 # xmm13
>>>> xmm6[0,1],xmm13[2],xmm6[3]
>>>>   vinsertps    $416, %xmm0, %xmm7, %xmm0 # xmm0
>>>> xmm7[0,1],xmm0[2],xmm7[3]
>>>> 
>>>> We'll continue to look into this and do additional testing.
>>>> 
>>>> 
>>>> 
>>>> Interesting. Let me know if you get a test case. The insertps
code path
>>>> was
>>>> added recently though and has been much less well tested.
I'll start fuzz
>>>> testing it and should hopefully uncover the bug.
>>>> 
>>>> 
>>>> Here's two small test cases.  Hope they are of use.
>>>> 
>>>> Thanks,
>>>> Rob.
>>>> 
>>>> ------
>>>> define <4 x float> @test(<4 x float> %xyzw, <4 x
float> %abcd) {
>>>> %1 = extractelement <4 x float> %xyzw, i32 0
>>>> %2 = insertelement <4 x float> undef, float %1, i32 0
>>>> %3 = insertelement <4 x float> %2, float 0.000000e+00,
i32 1
>>>> %4 = shufflevector <4 x float> %3, <4 x float>
%xyzw, <4 x i32> <i32
>>>> 0, i32 1, i32 6, i32 undef>
>>>> %5 = shufflevector <4 x float> %4, <4 x float>
%abcd, <4 x i32> <i32
>>>> 0, i32 1, i32 2, i32 4>
>>>> ret <4 x float> %5
>>>> }
>>>> 
>>>> define <4 x float> @test2(<4 x float> %xyzw, <4
x float> %abcd) {
>>>> %1 = shufflevector <4 x float> %xyzw, <4 x float>
%abcd, <4 x i32>
>>>> <i32 0, i32 undef, i32 2, i32 4>
>>>> %2 = shufflevector <4 x float> <float undef, float
0.000000e+00,
>>>> float undef, float undef>, <4 x float> %1, <4 x
i32> <i32 4, i32 1,
>>>> i32 6, i32 7>
>>>> ret <4 x float> %2
>>>> }
>>>> 
>>>> 
>>>> llc -march=x86-64 -mattr=+avx test.ll -o -
>>>> 
>>>> test:                                   # @test
>>>>   vxorps    %xmm2, %xmm2, %xmm2
>>>>   vmovss    %xmm0, %xmm2, %xmm2
>>>>   vblendps    $4, %xmm0, %xmm2, %xmm0 # xmm0 =
xmm2[0,1],xmm0[2],xmm2[3]
>>>>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>   retl
>>>> 
>>>> test2:                                  # @test2
>>>>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>   vxorps    %xmm1, %xmm1, %xmm1
>>>>   vblendps    $13, %xmm0, %xmm1, %xmm0 # xmm0 >>>>
xmm0[0],xmm1[1],xmm0[2,3]
>>>>   retl
>>>> 
>>>> llc -march=x86-64 -mattr=+avx
>>>> -x86-experimental-vector-shuffle-lowering test.ll -o -
>>>> 
>>>> test:                                   # @test
>>>>   vinsertps    $270, %xmm0, %xmm0, %xmm2 # xmm2 =
xmm0[0],zero,zero,zero
>>>>   vinsertps    $416, %xmm0, %xmm2, %xmm0 # xmm0
>>>> xmm2[0,1],xmm0[2],xmm2[3]
>>>>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>   retl
>>>> 
>>>> test2:                                  # @test2
>>>>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2],xmm1[0]
>>>>   vxorps    %xmm1, %xmm1, %xmm1
>>>>   vinsertps    $336, %xmm1, %xmm0, %xmm0 # xmm0
>>>> xmm0[0],xmm1[1],xmm0[2,3]
>>>>   retl
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/53bf0380/attachment.html>

Sean Silva

2014-Sep-09 20:47 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Tue, Sep 9, 2014 at 12:53 PM, Quentin Colombet <qcolombet at apple.com>
wrote:
> Hi Chandler,
>
> I had observed some improvements and regressions with the new lowering.
>
> Here are the numbers for an Ivy Bridge machine fixed at 2900MHz.
>
> I’ll look into the regressions to provide test cases.
>
> ** Numbers **
>
> Smaller is better. Only reported tests that run for at least one second.
> Reference is the default lowering, Test is the new lowering.
> The Os numbers are overall neutral, but the O3 numbers mainly expose
> regressions.
>
> Note: I can attach the raw numbers if you want.
>
That would be great. Please do.

-- Sean Silva

>
> * Os *
> Benchmark_ID    Reference Test    Expansion Percent
>
>
-------------------------------------------------------------------------------
> External/Nurbs/nurbs                          2.3302       2.3122     0.99
>   -1%
> External/SPEC/CFP2000/183.equake/183.eq       3.2606       3.2419     0.99
>   -1%
> External/SPEC/CFP2006/447.dealII/447.de       16.4638       16.1313
> 0.98     -2%
> External/SPEC/CFP2006/470.lbm/470.lbm         2.0159       1.9931     0.99
>   -1%
> External/SPEC/CINT2000/164.gzip/164.gzi       8.7611       8.6981     0.99
>   -1%
> External/SPEC/CINT2006/456.hmmer/456.hm       2.5674       2.5819     1.01
>   +1%
> External/SPEC/CINT2006/462.libquantum/4       1.2924         1.347
> 1.04     +4%
> MultiSource/Benchmarks/TSVC/CrossingThr       2.4703       2.4852     1.01
>   +1%
> MultiSource/Benchmarks/TSVC/LoopRerolli       2.6611       2.5668     0.96
>   -4%
> MultiSource/Benchmarks/mafft/pairlocala       24.676       24.5372
> 0.99     -1%
> SingleSource/Benchmarks/Adobe-C++/simpl       1.0579       1.1048     1.04
>   +4%
> SingleSource/Benchmarks/Linpack/linpack       4.2817       4.3298     1.01
>   +1%
> SingleSource/Benchmarks/Misc-C++/stepan       4.1821         4.226
> 1.01     +1%
> SingleSource/Benchmarks/Misc/oourafft         3.0305       3.1777     1.05
>   +5%
>
>
-------------------------------------------------------------------------------
> Min (14)                                           -             -
> 0.96       -
>
>
-------------------------------------------------------------------------------
> Max (14)                                           -             -
> 1.05       -
>
>
-------------------------------------------------------------------------------
> Sum (14)                                          79           79       1
>   +0%
>
>
-------------------------------------------------------------------------------
> A.Mean (14)                                        -             -
> 1.01     +1%
>
>
-------------------------------------------------------------------------------
> G.Mean 2 (14)                                      -             -
> 1.01     +1%
>
>
-------------------------------------------------------------------------------
>
> * O3 *
> Benchmark_ID    Reference Test    Expansion Percent
>
>
-------------------------------------------------------------------------------
> External/Nurbs/nurbs                          2.2322       2.2131     0.99
>   -1%
> External/Povray/povray                        2.2638       2.2762     1.01
>   +1%
> External/SPEC/CFP2000/177.mesa/177.mesa       1.6675       1.6828     1.01
>   +1%
> External/SPEC/CFP2000/188.ammp/188.ammp       10.9309       11.1191
> 1.02     +2%
> External/SPEC/CFP2006/433.milc/433.milc       6.9214       7.1696     1.04
>   +4%
> External/SPEC/CINT2000/164.gzip/164.gzi       8.5327       8.8114     1.03
>   +3%
> External/SPEC/CINT2000/186.crafty/186.c       4.1266         4.16     1.01
>   +1%
> External/SPEC/CINT2000/253.perlbmk/253.       5.6991       5.7309     1.01
>   +1%
> External/SPEC/CINT2000/256.bzip2/256.bz       6.7917       6.8763     1.01
>   +1%
> External/SPEC/CINT2006/400.perlbench/40         6.243       6.1464
> 0.98     -2%
> External/SPEC/CINT2006/401.bzip2/401.bz         2.095       2.0588
> 0.98     -2%
> External/SPEC/CINT2006/462.libquantum/4           1.2       1.2108
> 1.01     +1%
> MultiSource/Applications/SIBsim4/SIBsim       2.4547       2.5129     1.02
>   +2%
> MultiSource/Benchmarks/Bullet/bullet          4.1687       4.0882     0.98
>   -2%
> MultiSource/Benchmarks/TSVC/LinearDepen       3.0389       3.0566     1.01
>   +1%
> MultiSource/Benchmarks/TSVC/LinearDepen       2.1298       2.1997     1.03
>   +3%
> MultiSource/Benchmarks/TSVC/LoopRerolli       2.6458       2.5552     0.97
>   -3%
> MultiSource/Benchmarks/TSVC/Symbolics-f       1.6243       1.6612     1.02
>   +2%
> MultiSource/Benchmarks/mafft/pairlocala       23.8979       24.0547
> 1.01     +1%
> SingleSource/Benchmarks/Misc/oourafft         3.0374       3.1846     1.05
>   +5%
> SingleSource/Benchmarks/SmallPT/smallpt       6.5533       6.6683     1.02
>   +2%
>
>
-------------------------------------------------------------------------------
> Min (21)                                           -             -
> 0.97       -
>
>
-------------------------------------------------------------------------------
> Max (21)                                           -             -
> 1.05       -
>
>
-------------------------------------------------------------------------------
> Sum (21)                                         108           109
> 1.01     -1%
>
>
-------------------------------------------------------------------------------
> A.Mean (21)                                        -             -
> 1.01     +1%
>
>
-------------------------------------------------------------------------------
> G.Mean 2 (21)                                      -             -
> 1.01     +1%
>
>
-------------------------------------------------------------------------------
>
> Thanks,
> -Quentin
>
> On Sep 9, 2014, at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com>
> wrote:
>
> Hi Chandler,
>
> Thanks for fixing the problem with the insertps mask.
>
> Generally the new shuffle lowering looks promising, however there are
> some cases where the codegen is now worse causing runtime performance
> regressions in some of our internal codebase.
>
> You have already mentioned how the new shuffle lowering is missing
> some features; for example, you explicitly said that we currently lack
> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> main reasons for the slowdown we are seeing.
>
> Here is a list of what we found so far that we think is causing most
> of the slowdown:
> 1) shufps is always emitted in cases where we could emit a single
> blendps; in these cases, blendps is preferable because it has better
> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>
> Things get worse when it comes to lowering shuffles where the shuffle
> mask indices refer to elements from both input vectors in each lane.
> For example, a shuffle mask of <0,5,2,7> could be easily lowered into
> a single blendps; instead it gets lowered into two shufps
> instructions.
>
> Example:
> ;;;
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 0,
> i32 5, i32 2, i32 7>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 >
xmm0[0],xmm1[5],xmm0[2],xmm1[7]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>  vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
>
>
> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
> mask identifies a blend. At the moment the new lowering logic is very
> aggressively emitting insertps instead of cheaper blendps.
>
> Example:
> ;;;
> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 5, i32 2, i32 7>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
>
> 3) When a shuffle performs an insert at index 0 we always generate an
> insertps, while a movss would do a better job.
> ;;;
> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>  %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 1, i32 2, i32 3>
>  ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>  vmovss %xmm1, %xmm0, %xmm0
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>  vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>
> I hope this is useful. We would be happy to contribute patches to
> improve some of the above cases, but we obviously know that this is
> still a work in progress, so we don't want to introduce conflicts with
> your work. Please let us know what you think.
>
> We will keep looking at this and follow up with any further findings.
>
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Inc.
>
> On Mon, Sep 8, 2014 at 6:08 PM, Quentin Colombet <qcolombet at
apple.com>
> wrote:
>
> Hi Chandler,
>
> Forget about that I said.
> It seems I have some weird dependencies in my built system.
> My binaries are out-of-sync.
>
> Let me sort that out, this is likely the problem is already fixed, and I
> can
> resume the measurements.
>
> Sorry for the noise.
>
> Q.
>
> On Sep 8, 2014, at 9:32 AM, Quentin Colombet <qcolombet at apple.com>
wrote:
>
>
> On Sep 7, 2014, at 8:49 PM, Quentin Colombet <qcolombet at apple.com>
wrote:
>
> Sure,
>
> Here is the command line:
> clang -cc1 -triple x86_64-apple-macosx -S -disable-free
> -disable-llvm-verifier -main-file-name tmp.i -mrelocation-model pic
> -pic-level 2 -mdisable-fp-elim -masm-verbose -munwind-tables -target-cpu
> core-avx-i  -O3  -ferror-limit 19 -fmessage-length 114 -stack-protector 1
> -mstackrealign -fblocks  -fencode-extended-block-signature
> -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics
> -vectorize-loops -vectorize-slp -mllvm
> -x86-experimental-vector-shuffle-lowering=true -o tmp.s -x cpp-output tmp.i
>
> This was with trunk 215249.
>
> I meant, r217281.
>
>
> Thanks,
> -Quentin
>
> <tmp.i>
>
> On Sep 6, 2014, at 4:27 PM, Chandler Carruth <chandlerc at gmail.com>
wrote:
>
> I've run the SingleSource test suite for core-avx-i and have no
failures
> here so a preprocessed file + commandline would be very useful if this
> reproduces for you still.
>
> On Sat, Sep 6, 2014 at 4:07 PM, Chandler Carruth <chandlerc at
gmail.com>
> wrote:
>
>
> I'm having trouble reproducing this. I'm trying to get LNT to
actually
> run, but manually compiling the given source file didn't reproduce it
for
> me.
>
> It might have been fixed recently (although I'd be surprised if so),
but
> it would help to get the actual command line for which compiling this file
> in the test suite failed.
>
> -Chandler
>
> On Fri, Sep 5, 2014 at 4:36 PM, Quentin Colombet <qcolombet at
apple.com>
> wrote:
>
>
> Hi Chandler,
>
> While doing the performance measurement on a Ivy Bridge, I ran into
> compile time errors.
>
> I saw a bunch of “cannot select" in the LLVM test suite with
> -march=core-avx-i.
> E.g., SingleSource/UnitTests/Vector/SSE/sse.isamax.c is failing at O3
> -march=core-avx-i with:
> fatal error: error in backend: Cannot select: 0x7f91b99a6420: v4i32 >
bitcast 0x7f91b99b0e10 [ORD=3] [ID=27]
>  0x7f91b99b0e10: v4i64 = insert_subvector 0x7f91b99a7210,
> 0x7f91b99a6d68, 0x7f91b99ace70 [ORD=2] [ID=25]
>    0x7f91b99a7210: v4i64 = undef [ID=15]
>    0x7f91b99a6d68: v2i64 = scalar_to_vector 0x7f91b99ab840 [ORD=2]
> [ID=23]
>      0x7f91b99ab840: i64 = AssertZext 0x7f91b99acc60, 0x7f91b99ac738
> [ORD=2] [ID=20]
>        0x7f91b99acc60: i64,ch = CopyFromReg 0x7f91b8d52820,
> 0x7f91b99a3a10 [ORD=2] [ID=16]
>          0x7f91b99a3a10: i64 = Register %vreg68 [ID=1]
>    0x7f91b99ace70: i64 = Constant<0> [ID=3]
> In function: isamax0
> clang: error: clang frontend command failed with exit code 70 (use -v to
> see invocation)
> clang version 3.6.0 (215249)
> Target: x86_64-apple-darwin14.0.0
>
> For some reason, I cannot reproduce the problem with the test case that
> clang gives me using -emit-llvm. Since the source is public, I guess you
> can
> try to reproduce on your side.
> Indeed, if you run the test-suite with -march=core-avx-i you’ll likely
> see all those failures.
>
> Let me know if you cannot and I’ll try harder to produce a test case.
>
> Note: This is the same failure all over the place, i.e., cannot select a
> bit cast from various types to v4i32 or v4i64.
>
> Thanks,
> -Quentin
>
>
> On Sep 5, 2014, at 11:09 AM, Robert Lougher <rob.lougher@
>
> gmail.com> wrote:
>
> Hi Chandler,
>
> On 5 September 2014 17:38, Chandler Carruth <chandlerc at gmail.com>
wrote:
>
>
> On Fri, Sep 5, 2014 at 9:32 AM, Robert Lougher <rob.lougher at
gmail.com>
> wrote:
>
>
> Unfortunately, another team, while doing internal testing has seen the
> new path generating illegal insertps masks.  A sample here:
>
>   vinsertps    $256, %xmm0, %xmm13, %xmm4 # xmm4 = xmm0[0],xmm13[1,2,3]
>   vinsertps    $256, %xmm1, %xmm0, %xmm6 # xmm6 = xmm1[0],xmm0[1,2,3]
>   vinsertps    $256, %xmm13, %xmm1, %xmm7 # xmm7 = xmm13[0],xmm1[1,2,3]
>   vinsertps    $416, %xmm1, %xmm4, %xmm14 # xmm14 >
xmm4[0,1],xmm1[2],xmm4[3]
>   vinsertps    $416, %xmm13, %xmm6, %xmm13 # xmm13 >
xmm6[0,1],xmm13[2],xmm6[3]
>   vinsertps    $416, %xmm0, %xmm7, %xmm0 # xmm0 >
xmm7[0,1],xmm0[2],xmm7[3]
>
> We'll continue to look into this and do additional testing.
>
>
>
> Interesting. Let me know if you get a test case. The insertps code path
> was
> added recently though and has been much less well tested. I'll start
fuzz
> testing it and should hopefully uncover the bug.
>
>
> Here's two small test cases.  Hope they are of use.
>
> Thanks,
> Rob.
>
> ------
> define <4 x float> @test(<4 x float> %xyzw, <4 x float>
%abcd) {
> %1 = extractelement <4 x float> %xyzw, i32 0
> %2 = insertelement <4 x float> undef, float %1, i32 0
> %3 = insertelement <4 x float> %2, float 0.000000e+00, i32 1
> %4 = shufflevector <4 x float> %3, <4 x float> %xyzw, <4 x
i32> <i32
> 0, i32 1, i32 6, i32 undef>
> %5 = shufflevector <4 x float> %4, <4 x float> %abcd, <4 x
i32> <i32
> 0, i32 1, i32 2, i32 4>
> ret <4 x float> %5
> }
>
> define <4 x float> @test2(<4 x float> %xyzw, <4 x float>
%abcd) {
> %1 = shufflevector <4 x float> %xyzw, <4 x float> %abcd, <4
x i32>
> <i32 0, i32 undef, i32 2, i32 4>
> %2 = shufflevector <4 x float> <float undef, float 0.000000e+00,
> float undef, float undef>, <4 x float> %1, <4 x i32> <i32
4, i32 1,
> i32 6, i32 7>
> ret <4 x float> %2
> }
>
>
> llc -march=x86-64 -mattr=+avx test.ll -o -
>
> test:                                   # @test
>   vxorps    %xmm2, %xmm2, %xmm2
>   vmovss    %xmm0, %xmm2, %xmm2
>   vblendps    $4, %xmm0, %xmm2, %xmm0 # xmm0 = xmm2[0,1],xmm0[2],xmm2[3]
>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   retl
>
> test2:                                  # @test2
>   vinsertps    $48, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   vxorps    %xmm1, %xmm1, %xmm1
>   vblendps    $13, %xmm0, %xmm1, %xmm0 # xmm0 >
xmm0[0],xmm1[1],xmm0[2,3]
>   retl
>
> llc -march=x86-64 -mattr=+avx
> -x86-experimental-vector-shuffle-lowering test.ll -o -
>
> test:                                   # @test
>   vinsertps    $270, %xmm0, %xmm0, %xmm2 # xmm2 = xmm0[0],zero,zero,zero
>   vinsertps    $416, %xmm0, %xmm2, %xmm0 # xmm0 >
xmm2[0,1],xmm0[2],xmm2[3]
>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   retl
>
> test2:                                  # @test2
>   vinsertps    $304, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2],xmm1[0]
>   vxorps    %xmm1, %xmm1, %xmm1
>   vinsertps    $336, %xmm1, %xmm0, %xmm0 # xmm0 >
xmm0[0],xmm1[1],xmm0[2,3]
>   retl
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/4444ed3d/attachment.html>

Chandler Carruth

2014-Sep-09 22:39 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

Awesome, thanks for all the information!

See below:

On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com>
wrote:
> You have already mentioned how the new shuffle lowering is missing
> some features; for example, you explicitly said that we currently lack
> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> main reasons for the slowdown we are seeing.
>
> Here is a list of what we found so far that we think is causing most
> of the slowdown:
> 1) shufps is always emitted in cases where we could emit a single
> blendps; in these cases, blendps is preferable because it has better
> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>
Yep. I think this is actually super easy. I'll add support for blendps
shortly.

>
> Things get worse when it comes to lowering shuffles where the shuffle
> mask indices refer to elements from both input vectors in each lane.
> For example, a shuffle mask of <0,5,2,7> could be easily lowered into
> a single blendps; instead it gets lowered into two shufps
> instructions.
>
> Example:
> ;;;
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 0,
> i32 5, i32 2, i32 7>
>   ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>   vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 >
xmm0[0],xmm1[5],xmm0[2],xmm1[7]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>   vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
>
>
> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
> mask identifies a blend. At the moment the new lowering logic is very
> aggressively emitting insertps instead of cheaper blendps.
>
> Example:
> ;;;
> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 5, i32 2, i32 7>
>   ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>   vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
>
>
> 3) When a shuffle performs an insert at index 0 we always generate an
> insertps, while a movss would do a better job.
> ;;;
> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 1, i32 2, i32 3>
>   ret <4 x float> %1
> }
> ;;;
>
> llc (-mcpu=corei7-avx):
>   vmovss %xmm1, %xmm0, %xmm0
>
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>
So, this is hard. I think we should do this in MC after register allocation
because movss is the worst instruction ever: it switches from blending with
the destination to zeroing the destination when the source switches from a
register to a memory operand. =[ I would like to not emit movss in the DAG
*ever*, and teach the MC combine pass to run after register allocation (and
thus spills) have been emitted. This way we can match both patterns: when
insertps is zeroing the other lanes and the operand is from memory, and
when insertps is blending into the other lanes and the operand is in a
register.

Does that make sense? If so, would you be up for looking at this side of
things? It seems nicely separable.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140909/b4e5cd53/attachment.html>

Andrea Di Biagio

2014-Sep-10 10:36 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

On Tue, Sep 9, 2014 at 11:39 PM, Chandler Carruth <chandlerc at
google.com> wrote:> Awesome, thanks for all the information!
>
> See below:
>
> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com>
> wrote:
>>
>> You have already mentioned how the new shuffle lowering is missing
>> some features; for example, you explicitly said that we currently lack
>> of SSE4.1 blend support. Unfortunately, this seems to be one of the
>> main reasons for the slowdown we are seeing.
>>
>> Here is a list of what we found so far that we think is causing most
>> of the slowdown:
>> 1) shufps is always emitted in cases where we could emit a single
>> blendps; in these cases, blendps is preferable because it has better
>> reciprocal throughput (this is true on all modern Intel and AMD cpus).
>
>
> Yep. I think this is actually super easy. I'll add support for blendps
> shortly.
Thanks Chandler!
>
>> 3) When a shuffle performs an insert at index 0 we always generate an
>> insertps, while a movss would do a better job.
>> ;;;
>> define <4 x float> @baz(<4 x float> %A, <4 x float>
%B) {
>>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4
x i32> <i32 4,
>> i32 1, i32 2, i32 3>
>>   ret <4 x float> %1
>> }
>> ;;;
>>
>> llc (-mcpu=corei7-avx):
>>   vmovss %xmm1, %xmm0, %xmm0
>>
>> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
>
>
> So, this is hard. I think we should do this in MC after register allocation
> because movss is the worst instruction ever: it switches from blending with
> the destination to zeroing the destination when the source switches from a
> register to a memory operand. =[ I would like to not emit movss in the DAG
> *ever*, and teach the MC combine pass to run after register allocation (and
> thus spills) have been emitted. This way we can match both patterns: when
> insertps is zeroing the other lanes and the operand is from memory, and
when
> insertps is blending into the other lanes and the operand is in a register.
>
> Does that make sense? If so, would you be up for looking at this side of
> things? It seems nicely separable.
I think it is a good idea and it makes sense to me.
I will start investigating on this and see what can be done.

Cheers,
Andrea

Jim Grosbach

2014-Sep-10 19:31 UTC

head link

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

> On Sep 9, 2014, at 3:39 PM, Chandler Carruth <chandlerc at
google.com> wrote:
> 
> Awesome, thanks for all the information!
> 
> See below:
> 
> On Tue, Sep 9, 2014 at 6:13 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com <mailto:andrea.dibiagio at gmail.com>> wrote:
> You have already mentioned how the new shuffle lowering is missing
> some features; for example, you explicitly said that we currently lack
> of SSE4.1 blend support. Unfortunately, this seems to be one of the
> main reasons for the slowdown we are seeing.
> 
> Here is a list of what we found so far that we think is causing most
> of the slowdown:
> 1) shufps is always emitted in cases where we could emit a single
> blendps; in these cases, blendps is preferable because it has better
> reciprocal throughput (this is true on all modern Intel and AMD cpus).
> 
> Yep. I think this is actually super easy. I'll add support for blendps
shortly.
>  
> 
> Things get worse when it comes to lowering shuffles where the shuffle
> mask indices refer to elements from both input vectors in each lane.
> For example, a shuffle mask of <0,5,2,7> could be easily lowered into
> a single blendps; instead it gets lowered into two shufps
> instructions.
> 
> Example:
> ;;;
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 0,
> i32 5, i32 2, i32 7>
>   ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>   vblendps  $10, %xmm1, %xmm0, %xmm0   # xmm0 =
xmm0[0],xmm1[5],xmm0[2],xmm1[7]
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vshufps $-40, %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0,2],xmm0[1,3]
>   vshufps $-40, %xmm0, %xmm0, %xmm0 # xmm0[0,2,1,3]
> 
> 
> 2) On SSE4.1, we should try not to emit an insertps if the shuffle
> mask identifies a blend. At the moment the new lowering logic is very
> aggressively emitting insertps instead of cheaper blendps.
> 
> Example:
> ;;;
> define <4 x float> @bar(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 5, i32 2, i32 7>
>   ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>   vblendps  $11, %xmm0, %xmm1, %xmm0   # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vinsertps $-96, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[0,1],xmm1[2],xmm0[3]
> 
> 
> 3) When a shuffle performs an insert at index 0 we always generate an
> insertps, while a movss would do a better job.
> ;;;
> define <4 x float> @baz(<4 x float> %A, <4 x float> %B) {
>   %1 = shufflevector <4 x float> %A, <4 x float> %B, <4 x
i32> <i32 4,
> i32 1, i32 2, i32 3>
>   ret <4 x float> %1
> }
> ;;;
> 
> llc (-mcpu=corei7-avx):
>   vmovss %xmm1, %xmm0, %xmm0
> 
> llc -x86-experimental-vector-shuffle-lowering (-mcpu=corei7-avx):
>   vinsertps $0, %xmm1, %xmm0, %xmm0 # xmm0 = xmm1[0],xmm0[1,2,3]
> 
> So, this is hard. I think we should do this in MC after register allocation
because movss is the worst instruction ever: it switches from blending with the
destination to zeroing the destination when the source switches from a register
to a memory operand. =[ I would like to not emit movss in the DAG *ever*, and
teach the MC combine pass to run after register allocation (and thus spills)
have been emitted. This way we can match both patterns: when insertps is zeroing
the other lanes and the operand is from memory, and when insertps is blending
into the other lanes and the operand is in a register.
What MC pass? Are you using the acronym generically rather than referring
specifically to the MC layer? This sort of transform is almost certainly better
done on MachineInstr rather than MCInst.

-Jim

> 
> Does that make sense? If so, would you be up for looking at this side of
things? It seems nicely separable.
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140910/74f9df94/attachment.html>

llvm dev - Sep 2014 - [LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!

[LLVMdev] Please benchmark new x86 vector shuffle lowering, planning to make it the default very soon!