thr3ads.net - llvm dev - [LLVMdev] Enabling the SLP vectorizer by default for -O3 [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Jul-14 06:30 UTC

[LLVMdev] Enabling the SLP vectorizer by default for -O3

Hi, 

LLVM’s SLP-vectorizer is a new pass that combines similar independent
instructions in a straight-line code.  It is currently not enabled by default,
and people who want to experiment with it can use the clang command line flag
“-fslp-vectorize”.  I ran LLVM’s test suite with and without the SLP vectorizer
on a Sandybridge mac (using SSE4, w/o AVX).  Based on my performance
measurements (below) I would like to enable the SLP-vectorizer by default on
-O3.  I would like to hear what others in the community think about this and
give other people the opportunity to perform their own performance measurements.

— Performance Gains — 
SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
MultiSource/Benchmarks/Olden/power/power  -18.55%
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
SingleSource/Benchmarks/Misc/flops-6  -11.02%
SingleSource/Benchmarks/Misc/flops-5  -10.03%
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37%
External/Nurbs/nurbs  -7.98%
SingleSource/Benchmarks/Misc/pi -7.29%
External/SPEC/CINT2000/252_eon/252_eon  -5.78%
External/SPEC/CFP2006/444_namd/444_namd -4.52%
External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
MultiSource/Applications/SIBsim4/SIBsim4  -3.58%
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%
SingleSource/Benchmarks/Misc-C++/Large/sphereflake  -2.96%
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75%
MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%
SingleSource/Benchmarks/Misc/flops  -1.89%
SingleSource/Benchmarks/Misc/oourafft -1.71%
MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
External/SPEC/CFP2006/447_dealII/447_dealII -1.06%

— Regressions — 
MultiSource/Benchmarks/Olden/bh/bh  22.47%
MultiSource/Benchmarks/Bullet/bullet  7.31%
SingleSource/Benchmarks/Misc-C++-EH/spirit  5.68%
SingleSource/Benchmarks/SmallPT/smallpt 3.91%

Thanks,
Nadav

Chandler Carruth

2013-Jul-14 07:07 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

Cool!

What changes have you seen to generated code size?

I'll take it for a spin on our benchmarks.


On Sat, Jul 13, 2013 at 11:30 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi,
>
> LLVM’s SLP-vectorizer is a new pass that combines similar independent
> instructions in a straight-line code.  It is currently not enabled by
> default, and people who want to experiment with it can use the clang
> command line flag “-fslp-vectorize”.  I ran LLVM’s test suite with and
> without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX).
>  Based on my performance measurements (below) I would like to enable the
> SLP-vectorizer by default on -O3.  I would like to hear what others in the
> community think about this and give other people the opportunity to perform
> their own performance measurements.
>
> — Performance Gains —
> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
> MultiSource/Benchmarks/Olden/power/power  -18.55%
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
> SingleSource/Benchmarks/Misc/flops-6  -11.02%
> SingleSource/Benchmarks/Misc/flops-5  -10.03%
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
> -8.37%
> External/Nurbs/nurbs  -7.98%
> SingleSource/Benchmarks/Misc/pi -7.29%
> External/SPEC/CINT2000/252_eon/252_eon  -5.78%
> External/SPEC/CFP2006/444_namd/444_namd -4.52%
> External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
> MultiSource/Applications/SIBsim4/SIBsim4  -3.58%
> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%
> SingleSource/Benchmarks/Misc-C++/Large/sphereflake  -2.96%
> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl
> -2.75%
> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%
> SingleSource/Benchmarks/Misc/flops  -1.89%
> SingleSource/Benchmarks/Misc/oourafft -1.71%
> MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
> External/SPEC/CFP2006/447_dealII/447_dealII -1.06%
>
> — Regressions —
> MultiSource/Benchmarks/Olden/bh/bh  22.47%
> MultiSource/Benchmarks/Bullet/bullet  7.31%
> SingleSource/Benchmarks/Misc-C++-EH/spirit  5.68%
> SingleSource/Benchmarks/SmallPT/smallpt 3.91%
>
> Thanks,
> Nadav
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/52dfc635/attachment.html>

Nadav Rotem

2013-Jul-14 07:09 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

> 
> What changes have you seen to generated code size?
> 
I did not measure code size. 

> I'll take it for a spin on our benchmarks.
> 
Thanks!

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/8e020c7a/attachment.html>

Anton Korobeynikov

2013-Jul-14 18:24 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

> MultiSource/Benchmarks/Olden/bh/bh  22.47%
> MultiSource/Benchmarks/Bullet/bullet  7.31%Looks like quite big regressions. Any idea, why?

--
With best regards, Anton Korobeynikov
Faculty of Mathematics and Mechanics, Saint Petersburg State University

Chris Lattner

2013-Jul-15 04:52 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi, 
> 
> LLVM’s SLP-vectorizer is a new pass that combines similar independent
instructions in a straight-line code.  It is currently not enabled by default,
and people who want to experiment with it can use the clang command line flag
“-fslp-vectorize”.  I ran LLVM’s test suite with and without the SLP vectorizer
on a Sandybridge mac (using SSE4, w/o AVX).  Based on my performance
measurements (below) I would like to enable the SLP-vectorizer by default on
-O3.  I would like to hear what others in the community think about this and
give other people the opportunity to perform their own performance measurements.
This looks great Nadav.  The performance wins are really big.  How you
investigated the bh and bullet regression though?  We should at least understand
what is going wrong there.  bh is pretty tiny, so it should be straight-forward.
It would also be really useful to see what the code size and compile time impact
is.

-Chris
> 
> — Performance Gains — 
> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
> MultiSource/Benchmarks/Olden/power/power  -18.55%
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
> SingleSource/Benchmarks/Misc/flops-6  -11.02%
> SingleSource/Benchmarks/Misc/flops-5  -10.03%
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
-8.37%
> External/Nurbs/nurbs  -7.98%
> SingleSource/Benchmarks/Misc/pi -7.29%
> External/SPEC/CINT2000/252_eon/252_eon  -5.78%
> External/SPEC/CFP2006/444_namd/444_namd -4.52%
> External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
> MultiSource/Applications/SIBsim4/SIBsim4  -3.58%
> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%
> SingleSource/Benchmarks/Misc-C++/Large/sphereflake  -2.96%
> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl
-2.75%
> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%
> SingleSource/Benchmarks/Misc/flops  -1.89%
> SingleSource/Benchmarks/Misc/oourafft -1.71%
> MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
> External/SPEC/CFP2006/447_dealII/447_dealII -1.06%
> 
> — Regressions — 
> MultiSource/Benchmarks/Olden/bh/bh  22.47%
> MultiSource/Benchmarks/Bullet/bullet  7.31%
> SingleSource/Benchmarks/Misc-C++-EH/spirit  5.68%
> SingleSource/Benchmarks/SmallPT/smallpt 3.91%
> 
> Thanks,
> Nadav
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Nadav Rotem

2013-Jul-15 05:55 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

On Jul 14, 2013, at 9:52 PM, Chris Lattner <clattner at apple.com> wrote:
> 
> On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com>
wrote:
> 
>> Hi, 
>> 
>> LLVM’s SLP-vectorizer is a new pass that combines similar independent
instructions in a straight-line code.  It is currently not enabled by default,
and people who want to experiment with it can use the clang command line flag
“-fslp-vectorize”.  I ran LLVM’s test suite with and without the SLP vectorizer
on a Sandybridge mac (using SSE4, w/o AVX).  Based on my performance
measurements (below) I would like to enable the SLP-vectorizer by default on
-O3.  I would like to hear what others in the community think about this and
give other people the opportunity to perform their own performance measurements.
> 
> This looks great Nadav.  The performance wins are really big.  How you
investigated the bh and bullet regression though?
Thanks.  Yes, I looked at both.  The hot function in BH is “gravsub”.  The
vectorized IR looks fine and the assembly looks fine, but for some reason
Instruments reports that the first vector-subtract instruction takes 18% of the
time. The regression happens both with the VEX prefix and without. I suspected
that the problem is the movupd's that load xmm0 and xmm1. I started looking
at some performance counters on Friday, but I did not find anything suspicious
yet.

+0x00 movupd              16(%rsi), %xmm0
+0x05 movupd              16(%rsp), %xmm1
+0x0b subpd                %xmm1, %xmm0    <———— 18% of the runtime of bh ?
+0x0f movapd               %xmm0, %xmm2
+0x13 mulsd                %xmm2, %xmm2
+0x17 xorpd                %xmm1, %xmm1
+0x1b addsd                %xmm2, %xmm1 

I spent less time on Bullet.  Bullet also has one hot function
(“resolveSingleConstraintRowLowerLimit”).  On this code the vectorizer generates
several trees that use the <3 x float> type. This is risky because the
loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very
popular in some domains and we do want to vectorize them.  I skimmed through the
IR and the assembly and I did not see anything too bad. The next step would be
to do a binary search on the places where the vectorizer fires to locate the bad
pattern.

On AVX we have another regression that I did not mention: Flops-7.  When we
vectorize we cause more spills because we do a poor job scheduling
non-destructive source instructions (related to PR10928). Hopefully Andy’s
scheduler will fix this regression once it is enabled.

I did not measure code size, but I did measure compile time.  There are 4-5
workloads (not counting workloads that run below 0.5 seconds) where the compile
time increase is more than 5%.  I am aware of a problem in the (quadratic) code
that looks for consecutive stores. This code calls SCEV too many times. I plan
to fix this.

Thanks,
Nadav  

> We should at least understand what is going wrong there.  bh is pretty
tiny, so it should be straight-forward.  It would also be really useful to see
what the code size and compile time impact is.
> 
> -Chris
> 
>> 
>> — Performance Gains — 
>> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
>> MultiSource/Benchmarks/Olden/power/power  -18.55%
>> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
>> SingleSource/Benchmarks/Misc/flops-6  -11.02%
>> SingleSource/Benchmarks/Misc/flops-5  -10.03%
>> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
-8.37%
>> External/Nurbs/nurbs  -7.98%
>> SingleSource/Benchmarks/Misc/pi -7.29%
>> External/SPEC/CINT2000/252_eon/252_eon  -5.78%
>> External/SPEC/CFP2006/444_namd/444_namd -4.52%
>> External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
>> MultiSource/Applications/SIBsim4/SIBsim4  -3.58%
>> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%
>> SingleSource/Benchmarks/Misc-C++/Large/sphereflake  -2.96%
>> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl
-2.75%
>> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
>> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%
>> SingleSource/Benchmarks/Misc/flops  -1.89%
>> SingleSource/Benchmarks/Misc/oourafft -1.71%
>> MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
>> External/SPEC/CFP2006/447_dealII/447_dealII -1.06%
>> 
>> — Regressions — 
>> MultiSource/Benchmarks/Olden/bh/bh  22.47%
>> MultiSource/Benchmarks/Bullet/bullet  7.31%
>> SingleSource/Benchmarks/Misc-C++-EH/spirit  5.68%
>> SingleSource/Benchmarks/SmallPT/smallpt 3.91%
>> 
>> Thanks,
>> Nadav
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/01bb31e2/attachment.html>

Chandler Carruth

2013-Jul-15 11:44 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

On Sun, Jul 14, 2013 at 12:07 AM, Chandler Carruth <chandlerc at
google.com>wrote:
> I'll take it for a spin on our benchmarks.

It'll be a bit before I can go in and reduce it, but I thought I would
mention that I've seen just one new crasher, and it's on part of the
GLU's
reference implementation libtess in normal.c... No real details, but in
case you're aware or someone else knows how to build that...
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/fc72ea03/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Jul 2013 - [LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

Maybe Matching Threads