Nadav Rotem
2013-Jul-15 05:55 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
On Jul 14, 2013, at 9:52 PM, Chris Lattner <clattner at apple.com> wrote:> > On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com> wrote: > >> Hi, >> >> LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements. > > This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though?Thanks. Yes, I looked at both. The hot function in BH is “gravsub”. The vectorized IR looks fine and the assembly looks fine, but for some reason Instruments reports that the first vector-subtract instruction takes 18% of the time. The regression happens both with the VEX prefix and without. I suspected that the problem is the movupd's that load xmm0 and xmm1. I started looking at some performance counters on Friday, but I did not find anything suspicious yet. +0x00 movupd 16(%rsi), %xmm0 +0x05 movupd 16(%rsp), %xmm1 +0x0b subpd %xmm1, %xmm0 <———— 18% of the runtime of bh ? +0x0f movapd %xmm0, %xmm2 +0x13 mulsd %xmm2, %xmm2 +0x17 xorpd %xmm1, %xmm1 +0x1b addsd %xmm2, %xmm1 I spent less time on Bullet. Bullet also has one hot function (“resolveSingleConstraintRowLowerLimit”). On this code the vectorizer generates several trees that use the <3 x float> type. This is risky because the loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very popular in some domains and we do want to vectorize them. I skimmed through the IR and the assembly and I did not see anything too bad. The next step would be to do a binary search on the places where the vectorizer fires to locate the bad pattern. On AVX we have another regression that I did not mention: Flops-7. When we vectorize we cause more spills because we do a poor job scheduling non-destructive source instructions (related to PR10928). Hopefully Andy’s scheduler will fix this regression once it is enabled. I did not measure code size, but I did measure compile time. There are 4-5 workloads (not counting workloads that run below 0.5 seconds) where the compile time increase is more than 5%. I am aware of a problem in the (quadratic) code that looks for consecutive stores. This code calls SCEV too many times. I plan to fix this. Thanks, Nadav> We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is. > > -Chris > >> >> — Performance Gains — >> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% >> MultiSource/Benchmarks/Olden/power/power -18.55% >> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% >> SingleSource/Benchmarks/Misc/flops-6 -11.02% >> SingleSource/Benchmarks/Misc/flops-5 -10.03% >> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37% >> External/Nurbs/nurbs -7.98% >> SingleSource/Benchmarks/Misc/pi -7.29% >> External/SPEC/CINT2000/252_eon/252_eon -5.78% >> External/SPEC/CFP2006/444_namd/444_namd -4.52% >> External/SPEC/CFP2000/188_ammp/188_ammp -4.45% >> MultiSource/Applications/SIBsim4/SIBsim4 -3.58% >> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52% >> SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96% >> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75% >> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70% >> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95% >> SingleSource/Benchmarks/Misc/flops -1.89% >> SingleSource/Benchmarks/Misc/oourafft -1.71% >> MultiSource/Benchmarks/mafft/pairlocalalign -1.16% >> External/SPEC/CFP2006/447_dealII/447_dealII -1.06% >> >> — Regressions — >> MultiSource/Benchmarks/Olden/bh/bh 22.47% >> MultiSource/Benchmarks/Bullet/bullet 7.31% >> SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68% >> SingleSource/Benchmarks/SmallPT/smallpt 3.91% >> >> Thanks, >> Nadav >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/01bb31e2/attachment.html>
Renato Golin
2013-Jul-15 13:48 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
Hi Nadav, I think it's a great idea to have the slp vectorizer enabled, but maybe we should trim the horrible cases first (regressions, +5% compile time, etc). I don't mind sub-5% compile-time increase in O3, nor I mind sub-1% regressions in performance on some benchmarks IFF the majority of the benchmarks improve. On 15 July 2013 06:55, Nadav Rotem <nrotem at apple.com> wrote:> I suspected that the problem is the movupd's that load xmm0 and xmm1. >I've seen this before on ARM, and I agree, it looks like the load is constrained by some other condition or pipeline stall before that. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/4f8e2392/attachment.html>
Nadav Rotem
2013-Jul-23 22:33 UTC
[LLVMdev] Enabling the SLP vectorizer by default for -O3
Hi, Sorry for the delay in response. I measured the code size change and noticed small changes in both directions for individual programs. I found a 30k binary size growth for the entire testsuite + SPEC. I attached an updated performance report that includes both compile time and performance measurements. Thanks, Nadav On Jul 14, 2013, at 10:55 PM, Nadav Rotem <nrotem at apple.com> wrote:> > On Jul 14, 2013, at 9:52 PM, Chris Lattner <clattner at apple.com> wrote: > >> >> On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com> wrote: >> >>> Hi, >>> >>> LLVM’s SLP-vectorizer is a new pass that combines similar independent instructions in a straight-line code. It is currently not enabled by default, and people who want to experiment with it can use the clang command line flag “-fslp-vectorize”. I ran LLVM’s test suite with and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o AVX). Based on my performance measurements (below) I would like to enable the SLP-vectorizer by default on -O3. I would like to hear what others in the community think about this and give other people the opportunity to perform their own performance measurements. >> >> This looks great Nadav. The performance wins are really big. How you investigated the bh and bullet regression though? > > Thanks. Yes, I looked at both. The hot function in BH is “gravsub”. The vectorized IR looks fine and the assembly looks fine, but for some reason Instruments reports that the first vector-subtract instruction takes 18% of the time. The regression happens both with the VEX prefix and without. I suspected that the problem is the movupd's that load xmm0 and xmm1. I started looking at some performance counters on Friday, but I did not find anything suspicious yet. > > +0x00 movupd 16(%rsi), %xmm0 > +0x05 movupd 16(%rsp), %xmm1 > +0x0b subpd %xmm1, %xmm0 <———— 18% of the runtime of bh ? > +0x0f movapd %xmm0, %xmm2 > +0x13 mulsd %xmm2, %xmm2 > +0x17 xorpd %xmm1, %xmm1 > +0x1b addsd %xmm2, %xmm1 > > I spent less time on Bullet. Bullet also has one hot function (“resolveSingleConstraintRowLowerLimit”). On this code the vectorizer generates several trees that use the <3 x float> type. This is risky because the loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very popular in some domains and we do want to vectorize them. I skimmed through the IR and the assembly and I did not see anything too bad. The next step would be to do a binary search on the places where the vectorizer fires to locate the bad pattern. > > On AVX we have another regression that I did not mention: Flops-7. When we vectorize we cause more spills because we do a poor job scheduling non-destructive source instructions (related to PR10928). Hopefully Andy’s scheduler will fix this regression once it is enabled. > > I did not measure code size, but I did measure compile time. There are 4-5 workloads (not counting workloads that run below 0.5 seconds) where the compile time increase is more than 5%. I am aware of a problem in the (quadratic) code that looks for consecutive stores. This code calls SCEV too many times. I plan to fix this. > > Thanks, > Nadav > > >> We should at least understand what is going wrong there. bh is pretty tiny, so it should be straight-forward. It would also be really useful to see what the code size and compile time impact is. >> >> -Chris >> >>> >>> — Performance Gains — >>> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% >>> MultiSource/Benchmarks/Olden/power/power -18.55% >>> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71% >>> SingleSource/Benchmarks/Misc/flops-6 -11.02% >>> SingleSource/Benchmarks/Misc/flops-5 -10.03% >>> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37% >>> External/Nurbs/nurbs -7.98% >>> SingleSource/Benchmarks/Misc/pi -7.29% >>> External/SPEC/CINT2000/252_eon/252_eon -5.78% >>> External/SPEC/CFP2006/444_namd/444_namd -4.52% >>> External/SPEC/CFP2000/188_ammp/188_ammp -4.45% >>> MultiSource/Applications/SIBsim4/SIBsim4 -3.58% >>> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52% >>> SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96% >>> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75% >>> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70% >>> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95% >>> SingleSource/Benchmarks/Misc/flops -1.89% >>> SingleSource/Benchmarks/Misc/oourafft -1.71% >>> MultiSource/Benchmarks/mafft/pairlocalalign -1.16% >>> External/SPEC/CFP2006/447_dealII/447_dealII -1.06% >>> >>> — Regressions — >>> MultiSource/Benchmarks/Olden/bh/bh 22.47% >>> MultiSource/Benchmarks/Bullet/bullet 7.31% >>> SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68% >>> SingleSource/Benchmarks/SmallPT/smallpt 3.91% >>> >>> Thanks, >>> Nadav >>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130723/305125f3/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: report.pdf Type: application/pdf Size: 53595 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130723/305125f3/attachment.pdf> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130723/305125f3/attachment-0001.html>
----- Original Message -----> > > Hi, > > > Sorry for the delay in response. I measured the code size change and > noticed small changes in both directions for individual programs. I > found a 30k binary size growth for the entire testsuite + SPEC. I > attached an updated performance report that includes both compile > time and performance measurements. >I think that these number look good. Regarding the performance regressions: This looks like noise: MultiSource/Benchmarks/McCat/08-main/main 44.40% 0.0277 0.0400 0.0000 For these two: MultiSource/Benchmarks/Olden/bh/bh 19.73% 1.1547 1.3825 0.0017 MultiSource/Benchmarks/Bullet/bullet 7.30% 3.6130 3.8767 0.0069 can you run them on a different CPU and see how generic these slowdowns are? Thanks again, Hal> > Thanks, > Nadav > > > > On Jul 14, 2013, at 10:55 PM, Nadav Rotem < nrotem at apple.com > wrote: > > > > > > On Jul 14, 2013, at 9:52 PM, Chris Lattner < clattner at apple.com > > wrote: > > > > > On Jul 13, 2013, at 11:30 PM, Nadav Rotem < nrotem at apple.com > wrote: > > > > Hi, > > LLVM’s SLP-vectorizer is a new pass that combines similar independent > instructions in a straight-line code. It is currently not enabled by > default, and people who want to experiment with it can use the clang > command line flag “-fslp-vectorize”. I ran LLVM’s test suite with > and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o > AVX). Based on my performance measurements (below) I would like to > enable the SLP-vectorizer by default on -O3. I would like to hear > what others in the community think about this and give other people > the opportunity to perform their own performance measurements. > > This looks great Nadav. The performance wins are really big. How you > investigated the bh and bullet regression though? > > > Thanks. Yes, I looked at both. The hot function in BH is “gravsub”. > The vectorized IR looks fine and the assembly looks fine, but for > some reason Instruments reports that the first vector-subtract > instruction takes 18% of the time. The regression happens both with > the VEX prefix and without. I suspected that the problem is the > movupd's that load xmm0 and xmm1. I started looking at some > performance counters on Friday, but I did not find anything > suspicious yet. > > +0x00 movupd 16(%rsi), %xmm0 > +0x05 movupd 16(%rsp), %xmm1 > +0x0b subpd %xmm1, %xmm0 <———— 18% of the runtime of bh ? > +0x0f movapd %xmm0, %xmm2 > +0x13 mulsd %xmm2, %xmm2 > +0x17 xorpd %xmm1, %xmm1 > > +0x1b addsd %xmm2, %xmm1 > > > I spent less time on Bullet. Bullet also has one hot function > (“resolveSingleConstraintRowLowerLimit”). On this code the > vectorizer generates several trees that use the <3 x float> type. > This is risky because the loads/stores are inefficient, but > unfortunately triples of RGB and XYZ are very popular in some > domains and we do want to vectorize them. I skimmed through the IR > and the assembly and I did not see anything too bad. The next step > would be to do a binary search on the places where the vectorizer > fires to locate the bad pattern. > > > On AVX we have another regression that I did not mention: Flops-7. > When we vectorize we cause more spills because we do a poor job > scheduling non-destructive source instructions (related to PR10928). > Hopefully Andy’s scheduler will fix this regression once it is > enabled. > > > I did not measure code size, but I did measure compile time. There > are 4-5 workloads (not counting workloads that run below 0.5 > seconds) where the compile time increase is more than 5%. I am aware > of a problem in the (quadratic) code that looks for consecutive > stores. This code calls SCEV too many times. I plan to fix this. > > > Thanks, > Nadav > > > > > > > We should at least understand what is going wrong there. bh is pretty > tiny, so it should be straight-forward. It would also be really > useful to see what the code size and compile time impact is. > > -Chris > > > > > — Performance Gains — > SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68% > MultiSource/Benchmarks/Olden/power/power -18.55% > MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt > -14.71% > SingleSource/Benchmarks/Misc/flops-6 -11.02% > SingleSource/Benchmarks/Misc/flops-5 -10.03% > MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt > -8.37% > External/Nurbs/nurbs -7.98% > SingleSource/Benchmarks/Misc/pi -7.29% > External/SPEC/CINT2000/252_eon/252_eon -5.78% > External/SPEC/CFP2006/444_namd/444_namd -4.52% > External/SPEC/CFP2000/188_ammp/188_ammp -4.45% > MultiSource/Applications/SIBsim4/SIBsim4 -3.58% > MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl > -3.52% > SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96% > MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl > -2.75% > MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70% > MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl > -1.95% > SingleSource/Benchmarks/Misc/flops -1.89% > SingleSource/Benchmarks/Misc/oourafft -1.71% > MultiSource/Benchmarks/mafft/pairlocalalign -1.16% > External/SPEC/CFP2006/447_dealII/447_dealII -1.06% > > — Regressions — > MultiSource/Benchmarks/Olden/bh/bh 22.47% > MultiSource/Benchmarks/Bullet/bullet 7.31% > SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68% > SingleSource/Benchmarks/SmallPT/smallpt 3.91% > > Thanks, > Nadav > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Apparently Analagous Threads
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP-vectorizer by default for -O3
- [LLVMdev] Enabling the SLP vectorizer by default for -O3