Nadav Rotem
2013-Jul-14 06:30 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
Hi, LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements. — Performance Gains — SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% MultiSource/Benchmarks/Olden/power/power -18.55% MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% SingleSource/Benchmarks/Misc/flops-6 -11.02% SingleSource/Benchmarks/Misc/flops-5 -10.03% MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37% External/Nurbs/nurbs -7.98% SingleSource/Benchmarks/Misc/pi -7.29% External/SPEC/CINT2000/252_eon/252_eon -5.78% External/SPEC/CFP2006/444_namd/444_namd -4.52% External/SPEC/CFP2000/188_ammp/188_ammp -4.45% MultiSource/Applications/SIBsim4/SIBsim4 -3.58% MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52% SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96% MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75% MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70% MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95% SingleSource/Benchmarks/Misc/flops -1.89% SingleSource/Benchmarks/Misc/oourafft -1.71% MultiSource/Benchmarks/mafft/pairlocalalign -1.16% External/SPEC/CFP2006/447_dealII/447_dealII -1.06% — Regressions — MultiSource/Benchmarks/Olden/bh/bh 22.47% MultiSource/Benchmarks/Bullet/bullet 7.31% SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68% SingleSource/Benchmarks/SmallPT/smallpt 3.91% Thanks, Nadav
Chandler Carruth
2013-Jul-14 07:07 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
Cool! What changes have you seen to generated code size? I'll take it for a spin on our benchmarks. On Sat, Jul 13, 2013 at 11:30 PM, Nadav Rotem <nrotem at apple.com> wrote:> Hi, > > LLVM’s SLP-vectorizer is a new pass that combines similar independent > instructions in a straight-line code. It is currently not enabled by > default, and people who want to experiment with it can use the clang > command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and > without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). > Based on my performance measurements (below) I would like to enable the > SLP-vectorizer by default on -O3. I would like to hear what others in the > community think about this and give other people the opportunity to perform > their own performance measurements. > > — Performance Gains — > SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% > MultiSource/Benchmarks/Olden/power/power -18.55% > MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% > SingleSource/Benchmarks/Misc/flops-6 -11.02% > SingleSource/Benchmarks/Misc/flops-5 -10.03% > MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt > -8.37% > External/Nurbs/nurbs -7.98% > SingleSource/Benchmarks/Misc/pi -7.29% > External/SPEC/CINT2000/252_eon/252_eon -5.78% > External/SPEC/CFP2006/444_namd/444_namd -4.52% > External/SPEC/CFP2000/188_ammp/188_ammp -4.45% > MultiSource/Applications/SIBsim4/SIBsim4 -3.58% > MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52% > SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96% > MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl > -2.75% > MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70% > MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95% > SingleSource/Benchmarks/Misc/flops -1.89% > SingleSource/Benchmarks/Misc/oourafft -1.71% > MultiSource/Benchmarks/mafft/pairlocalalign -1.16% > External/SPEC/CFP2006/447_dealII/447_dealII -1.06% > > — Regressions — > MultiSource/Benchmarks/Olden/bh/bh 22.47% > MultiSource/Benchmarks/Bullet/bullet 7.31% > SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68% > SingleSource/Benchmarks/SmallPT/smallpt 3.91% > > Thanks, > Nadav > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/52dfc635/attachment.html>
Nadav Rotem
2013-Jul-14 07:09 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
> > What changes have you seen to generated code size? >I did not measure code size.> I'll take it for a spin on our benchmarks. >Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/8e020c7a/attachment.html>
Anton Korobeynikov
2013-Jul-14 18:24 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
> MultiSource/Benchmarks/Olden/bh/bh 22.47% > MultiSource/Benchmarks/Bullet/bullet 7.31%Looks like quite big regressions. Any idea, why? -- With best regards, Anton Korobeynikov Faculty of Mathematics and Mechanics, Saint Petersburg State University
Chris Lattner
2013-Jul-15 04:52 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com> wrote:> Hi, > > LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements.This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though? We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is. -Chris> > — Performance Gains — > SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% > MultiSource/Benchmarks/Olden/power/power -18.55% > MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% > SingleSource/Benchmarks/Misc/flops-6 -11.02% > SingleSource/Benchmarks/Misc/flops-5 -10.03% > MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37% > External/Nurbs/nurbs -7.98% > SingleSource/Benchmarks/Misc/pi -7.29% > External/SPEC/CINT2000/252_eon/252_eon -5.78% > External/SPEC/CFP2006/444_namd/444_namd -4.52% > External/SPEC/CFP2000/188_ammp/188_ammp -4.45% > MultiSource/Applications/SIBsim4/SIBsim4 -3.58% > MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52% > SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96% > MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75% > MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70% > MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95% > SingleSource/Benchmarks/Misc/flops -1.89% > SingleSource/Benchmarks/Misc/oourafft -1.71% > MultiSource/Benchmarks/mafft/pairlocalalign -1.16% > External/SPEC/CFP2006/447_dealII/447_dealII -1.06% > > — Regressions — > MultiSource/Benchmarks/Olden/bh/bh 22.47% > MultiSource/Benchmarks/Bullet/bullet 7.31% > SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68% > SingleSource/Benchmarks/SmallPT/smallpt 3.91% > > Thanks, > Nadav > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Nadav Rotem
2013-Jul-15 05:55 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
On Jul 14, 2013, at 9:52 PM, Chris Lattner <clattner at apple.com> wrote:> > On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com> wrote: > >> Hi, >> >> LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements. > > This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though?Thanks. Yes, I looked at both. The hot function in BH is “gravsub”. The vectorized IR looks fine and the assembly looks fine, but for some reason Instruments reports that the first vector-subtract instruction takes 18% of the time. The regression happens both with the VEX prefix and without. I suspected that the problem is the movupd's that load xmm0 and xmm1. I started looking at some performance counters on Friday, but I did not find anything suspicious yet. +0x00 movupd 16(%rsi), %xmm0 +0x05 movupd 16(%rsp), %xmm1 +0x0b subpd %xmm1, %xmm0 <———— 18% of the runtime of bh ? +0x0f movapd %xmm0, %xmm2 +0x13 mulsd %xmm2, %xmm2 +0x17 xorpd %xmm1, %xmm1 +0x1b addsd %xmm2, %xmm1 I spent less time on Bullet. Bullet also has one hot function (“resolveSingleConstraintRowLowerLimit”). On this code the vectorizer generates several trees that use the <3 x float> type. This is risky because the loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very popular in some domains and we do want to vectorize them. I skimmed through the IR and the assembly and I did not see anything too bad. The next step would be to do a binary search on the places where the vectorizer fires to locate the bad pattern. On AVX we have another regression that I did not mention: Flops-7. When we vectorize we cause more spills because we do a poor job scheduling non-destructive source instructions (related to PR10928). Hopefully Andy’s scheduler will fix this regression once it is enabled. I did not measure code size, but I did measure compile time. There are 4-5 workloads (not counting workloads that run below 0.5 seconds) where the compile time increase is more than 5%. I am aware of a problem in the (quadratic) code that looks for consecutive stores. This code calls SCEV too many times. I plan to fix this. Thanks, Nadav> We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is. > > -Chris > >> >> — Performance Gains — >> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% >> MultiSource/Benchmarks/Olden/power/power -18.55% >> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% >> SingleSource/Benchmarks/Misc/flops-6 -11.02% >> SingleSource/Benchmarks/Misc/flops-5 -10.03% >> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37% >> External/Nurbs/nurbs -7.98% >> SingleSource/Benchmarks/Misc/pi -7.29% >> External/SPEC/CINT2000/252_eon/252_eon -5.78% >> External/SPEC/CFP2006/444_namd/444_namd -4.52% >> External/SPEC/CFP2000/188_ammp/188_ammp -4.45% >> MultiSource/Applications/SIBsim4/SIBsim4 -3.58% >> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52% >> SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96% >> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75% >> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70% >> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95% >> SingleSource/Benchmarks/Misc/flops -1.89% >> SingleSource/Benchmarks/Misc/oourafft -1.71% >> MultiSource/Benchmarks/mafft/pairlocalalign -1.16% >> External/SPEC/CFP2006/447_dealII/447_dealII -1.06% >> >> — Regressions — >> MultiSource/Benchmarks/Olden/bh/bh 22.47% >> MultiSource/Benchmarks/Bullet/bullet 7.31% >> SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68% >> SingleSource/Benchmarks/SmallPT/smallpt 3.91% >> >> Thanks, >> Nadav >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/01bb31e2/attachment.html>
Chandler Carruth
2013-Jul-15 11:44 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
On Sun, Jul 14, 2013 at 12:07 AM, Chandler Carruth <chandlerc at google.com>wrote:> I'll take it for a spin on our benchmarks.It'll be a bit before I can go in and reduce it, but I thought I would mention that I've seen just one new crasher, and it's on part of the GLU's reference implementation libtess in normal.c... No real details, but in case you're aware or someone else knows how to build that... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/fc72ea03/attachment.html>
Possibly Parallel Threads
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP-vectorizer by default for -O3
- [LLVMdev] [RFC] AArch64: Should we disable GlobalMerge?