Nadav Rotem
2013-Jul-28 06:54 UTC
[LLVMdev] Enabling the SLP-vectorizer by default for -O3
Hi, Below you can see the updated benchmark results for the new SLP-vectorizer. As you can see, there is a small number of compile time regressions, a single major runtime *regression, and many performance gains. There is a tiny increase in code size: 30k for the whole test-suite. Based on the numbers below I would like to enable the SLP-vectorizer by default for -O3. Please let me know if you have any concerns. Thanks, Nadav * - I now understand the Olden/BH regression better. BH is slower because of a store-buffer stall. This means that the store buffer fills up and the CPU has to wait for some stores to finish. I can think of two reasons that may cause this problem. First, our vectorized stores are followed by a memcpy that's expanded to a list of scalar-read/writes to the same addresses as the vector store. Maybe the processors can’t prune multiple stores to the same address with different sizes (Section 2.2.4 in the optimization guide has some info on this). Another possibility (less likely) is that we increase the critical path by adding a new pshufd instruction before the last vector store and that affects the store-buffer somehow. In any case, there is not much we can do at the IR-level to predict this. Performance Regressions - Compile Time Δ Previous Current σ MultiSource/Benchmarks/VersaBench/beamformer/beamformer 18.98% 0.0722 0.0859 0.0003 MultiSource/Benchmarks/FreeBench/pifft/pifft 5.66% 0.5003 0.5286 0.0015 MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt 4.85% 0.4084 0.4282 0.0014 MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt 4.36% 0.3856 0.4024 0.0018 MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt 2.62% 0.4424 0.4540 0.0019 External/SPEC/CINT2006/401_bzip2/401_bzip2 1.50% 1.0613 1.0772 0.0010 MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 1.23% 12.1337 12.2831 0.0296 MultiSource/Applications/kimwitu++/kc 1.15% 9.3690 9.4769 0.0186 SingleSource/Benchmarks/Misc-C++-EH/spirit 1.13% 3.2769 3.3139 0.0079 External/SPEC/CFP2000/188_ammp/188_ammp 1.01% 1.8632 1.8820 0.0059 Performance Regressions - Execution Time Δ Previous Current σ MultiSource/Benchmarks/Olden/bh/bh 19.24% 1.1551 1.3773 0.0021 SingleSource/Benchmarks/SmallPT/smallpt 3.75% 5.8779 6.0983 0.0146 SingleSource/Benchmarks/Misc-C++/Large/ray 1.08% 1.8194 1.8390 0.0009 Performance Improvements - Execution Time Δ Previous Current σ SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.67% 1.4064 0.6516 0.0007 External/Nurbs/nurbs -19.47% 2.5389 2.0445 0.0029 MultiSource/Benchmarks/Olden/power/power -18.49% 1.2572 1.0248 0.0004 SingleSource/Benchmarks/Misc/flops-4 -15.93% 0.7767 0.6530 0.0348 MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.72% 2.3925 2.0404 0.0013 SingleSource/Benchmarks/Misc/flops-6 -11.05% 1.1427 1.0164 0.0009 SingleSource/Benchmarks/Misc/flops-5 -10.43% 1.2771 1.1439 0.0015 MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.10% 2.3468 2.1568 0.0195 SingleSource/Benchmarks/Misc/pi -7.18% 0.6042 0.5608 0.0000 External/SPEC/CFP2006/444_namd/444_namd -4.01% 9.6053 9.2200 0.0064 SingleSource/Benchmarks/Linpack/linpack-pc -3.85% 95.5313 91.8522 1.1151 MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52% 3.1962 3.0837 0.0063 MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.93% 2.9336 2.8477 0.0037 MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.79% 0.8845 0.8598 0.0026 SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.79% 1.8517 1.8001 0.0014 External/SPEC/CFP2000/177_mesa/177_mesa -2.15% 1.7214 1.6844 0.0017 SingleSource/Benchmarks/CoyoteBench/fftbench -2.05% 0.7280 0.7131 0.0049 MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.96% 3.1494 3.0878 0.0034 SingleSource/Benchmarks/Misc/oourafft -1.70% 3.4625 3.4035 0.0009 SingleSource/Benchmarks/Misc/flops -1.31% 7.0775 6.9845 0.0014 MultiSource/Applications/JM/lencod/lencod -1.12% 4.5972 4.5455 0.0050 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130727/7e6324b0/attachment.html>
Chandler Carruth
2013-Jul-28 07:20 UTC
[LLVMdev] Enabling the SLP-vectorizer by default for -O3
Sorry for not posting sooner. I forgot to send an update with the results. I also have some benchmark data. It confirms much of what you posted -- binary size increase is essentially 0, performance increases across the board. It looks really good to me. However, there was one crash that I'd like to check if it still fires. Will update later today (feel free to ping me if you don't hear anything.). That said, why -O3? I think we should just enable this across the board, as it doesn't seem to cause any size regression under any mode, and the compile time hit is really low. On Sat, Jul 27, 2013 at 11:54 PM, Nadav Rotem <nrotem at apple.com> wrote:> Hi, > > Below you can see the updated benchmark results for the new > SLP-vectorizer. As you can see, there is a small number of compile time > regressions, a single major runtime *regression, and many performance > gains. There is a tiny increase in code size: 30k for the whole test-suite. > Based on the numbers below I would like to enable the SLP-vectorizer by > default for -O3. Please let me know if you have any concerns. > > Thanks, > Nadav > > > * - I now understand the Olden/BH regression better. BH is slower because > of a store-buffer stall. This means that the store buffer fills up and the > CPU has to wait for some stores to finish. I can think of two reasons > that may cause this problem. First, our vectorized stores are followed by > a memcpy that's expanded to a list of scalar-read/writes to the same > addresses as the vector store. Maybe the processors can’t prune multiple > stores to the same address with different sizes (Section 2.2.4 in the > optimization guide has some info on this). Another possibility (less > likely) is that we increase the critical path by adding a new pshufd > instruction before the last vector store and that affects the store-buffer > somehow. In any case, there is not much we can do at the IR-level to > predict this. > > > > Performance Regressions - Compile TimeΔPreviousCurrentσ > MultiSource/Benchmarks/VersaBench/beamformer/beamformer18.98%0.07220.0859 > 0.0003MultiSource/Benchmarks/FreeBench/pifft/pifft5.66%0.50030.52860.0015 > MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt4.85% > 0.40840.42820.0014 > MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt > 4.36%0.38560.40240.0018 > MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt2.62%0.4424 > 0.45400.0019External/SPEC/CINT2006/401_bzip2/401_bzip21.50%1.06131.0772 > 0.0010MultiSource/Benchmarks/tramp3d-v4/tramp3d-v41.23%12.133712.2831 > 0.0296MultiSource/Applications/kimwitu++/kc1.15%9.36909.47690.0186 > SingleSource/Benchmarks/Misc-C++-EH/spirit1.13%3.27693.31390.0079 > External/SPEC/CFP2000/188_ammp/188_ammp1.01%1.86321.88200.0059 > > > Performance Regressions - Execution TimeΔPreviousCurrentσ > MultiSource/Benchmarks/Olden/bh/bh19.24%1.15511.37730.0021 > SingleSource/Benchmarks/SmallPT/smallpt3.75%5.87796.09830.0146 > SingleSource/Benchmarks/Misc-C++/Large/ray1.08%1.81941.83900.0009 > > > Performance Improvements - Execution TimeΔPreviousCurrentσ > SingleSource/Benchmarks/Misc/matmul_f64_4x4-53.67%1.40640.65160.0007 > External/Nurbs/nurbs-19.47%2.53892.04450.0029 > MultiSource/Benchmarks/Olden/power/power-18.49%1.25721.02480.0004 > SingleSource/Benchmarks/Misc/flops-4-15.93%0.77670.65300.0348 > MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt-14.72% > 2.39252.04040.0013SingleSource/Benchmarks/Misc/flops-6-11.05%1.14271.0164 > 0.0009SingleSource/Benchmarks/Misc/flops-5-10.43%1.27711.14390.0015 > MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt > -8.10%2.34682.15680.0195SingleSource/Benchmarks/Misc/pi-7.18%0.60420.5608 > 0.0000External/SPEC/CFP2006/444_namd/444_namd-4.01%9.60539.22000.0064 > SingleSource/Benchmarks/Linpack/linpack-pc-3.85%95.531391.85221.1151 > MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl-3.52% > 3.19623.08370.0063 > MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl > -2.93%2.93362.84770.0037 > MultiSource/Benchmarks/VersaBench/beamformer/beamformer-2.79%0.88450.8598 > 0.0026SingleSource/Benchmarks/Misc-C++/Large/sphereflake-2.79%1.85171.8001 > 0.0014External/SPEC/CFP2000/177_mesa/177_mesa-2.15%1.72141.68440.0017 > SingleSource/Benchmarks/CoyoteBench/fftbench-2.05%0.72800.71310.0049 > MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl-1.96% > 3.14943.08780.0034SingleSource/Benchmarks/Misc/oourafft-1.70%3.46253.4035 > 0.0009SingleSource/Benchmarks/Misc/flops-1.31%7.07756.98450.0014 > MultiSource/Applications/JM/lencod/lencod-1.12%4.59724.54550.0050 > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130728/e7b6cc46/attachment.html>
Nadav Rotem
2013-Jul-29 18:23 UTC
[LLVMdev] Enabling the SLP-vectorizer by default for -O3
On Jul 28, 2013, at 12:20 AM, Chandler Carruth <chandlerc at google.com> wrote:> That said, why -O3? I think we should just enable this across the board, as it doesn't seem to cause any size regression under any mode, and the compile time hit is really low.I agree. I think that it would be a good idea to enable it for -Os and -O2, but I’d like to make one step at a time. Thanks, Nadav -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130729/e30f842a/attachment.html>
Maybe Matching Threads
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP-vectorizer by default for -O3
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP vectorizer by default for -O3
- [LLVMdev] Enabling the SLP vectorizer by default for -O3