thr3ads.net - llvm dev - [LLVMdev] Enabling the SLP-vectorizer by default for -O3 [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Jul-28 06:54 UTC

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

Hi, 

Below you can see the updated benchmark results for the new SLP-vectorizer.  As
you can see, there is a small number of compile time regressions, a single major
runtime *regression, and many performance gains. There is a tiny increase in
code size: 30k for the whole test-suite. Based on the numbers below I would like
to enable the SLP-vectorizer by default for -O3. Please let me know if you have
any concerns.

Thanks,
Nadav


* - I now understand the Olden/BH regression better. BH is slower because of a
store-buffer stall. This means that the store buffer fills up and the CPU has to
wait for some stores to finish. I can think of two reasons that may cause this
problem. First, our vectorized stores are followed by a memcpy that's
expanded to a list of scalar-read/writes to the same addresses as the vector
store. Maybe the processors can’t prune multiple stores to the same address with
different sizes (Section 2.2.4 in the optimization guide has some info on this).
Another possibility (less likely) is that we increase the critical path by
adding a new pshufd instruction before the last vector store and that affects
the store-buffer somehow. In any case, there is not much we can do at the
IR-level to predict this.



Performance Regressions - Compile Time	Δ	Previous	Current	σ
MultiSource/Benchmarks/VersaBench/beamformer/beamformer	18.98%	0.0722	0.0859
0.0003
MultiSource/Benchmarks/FreeBench/pifft/pifft	5.66%	0.5003	0.5286	0.0015
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt	4.85%
0.4084	0.4282	0.0014
MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt	4.36%
0.3856	0.4024	0.0018
MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt	2.62%	0.4424	0.4540
0.0019
External/SPEC/CINT2006/401_bzip2/401_bzip2	1.50%	1.0613	1.0772	0.0010
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4	1.23%	12.1337	12.2831	0.0296
MultiSource/Applications/kimwitu++/kc	1.15%	9.3690	9.4769	0.0186
SingleSource/Benchmarks/Misc-C++-EH/spirit	1.13%	3.2769	3.3139	0.0079
External/SPEC/CFP2000/188_ammp/188_ammp	1.01%	1.8632	1.8820	0.0059


Performance Regressions - Execution Time	Δ	Previous	Current	σ
MultiSource/Benchmarks/Olden/bh/bh	19.24%	1.1551	1.3773	0.0021
SingleSource/Benchmarks/SmallPT/smallpt	3.75%	5.8779	6.0983	0.0146
SingleSource/Benchmarks/Misc-C++/Large/ray	1.08%	1.8194	1.8390	0.0009


Performance Improvements - Execution Time	Δ	Previous	Current	σ
SingleSource/Benchmarks/Misc/matmul_f64_4x4	-53.67%	1.4064	0.6516	0.0007
External/Nurbs/nurbs	-19.47%	2.5389	2.0445	0.0029
MultiSource/Benchmarks/Olden/power/power	-18.49%	1.2572	1.0248	0.0004
SingleSource/Benchmarks/Misc/flops-4	-15.93%	0.7767	0.6530	0.0348
MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt	-14.72%	2.3925
2.0404	0.0013
SingleSource/Benchmarks/Misc/flops-6	-11.05%	1.1427	1.0164	0.0009
SingleSource/Benchmarks/Misc/flops-5	-10.43%	1.2771	1.1439	0.0015
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt	-8.10%
2.3468	2.1568	0.0195
SingleSource/Benchmarks/Misc/pi	-7.18%	0.6042	0.5608	0.0000
External/SPEC/CFP2006/444_namd/444_namd	-4.01%	9.6053	9.2200	0.0064
SingleSource/Benchmarks/Linpack/linpack-pc	-3.85%	95.5313	91.8522	1.1151
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl	-3.52%	3.1962
3.0837	0.0063
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl	-2.93%
2.9336	2.8477	0.0037
MultiSource/Benchmarks/VersaBench/beamformer/beamformer	-2.79%	0.8845	0.8598
0.0026
SingleSource/Benchmarks/Misc-C++/Large/sphereflake	-2.79%	1.8517	1.8001	0.0014
External/SPEC/CFP2000/177_mesa/177_mesa	-2.15%	1.7214	1.6844	0.0017
SingleSource/Benchmarks/CoyoteBench/fftbench	-2.05%	0.7280	0.7131	0.0049
MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl	-1.96%	3.1494
3.0878	0.0034
SingleSource/Benchmarks/Misc/oourafft	-1.70%	3.4625	3.4035	0.0009
SingleSource/Benchmarks/Misc/flops	-1.31%	7.0775	6.9845	0.0014
MultiSource/Applications/JM/lencod/lencod	-1.12%	4.5972	4.5455	0.0050

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130727/7e6324b0/attachment.html>

Chandler Carruth

2013-Jul-28 07:20 UTC

head link

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

Sorry for not posting sooner. I forgot to send an update with the results.

I also have some benchmark data. It confirms much of what you posted --
binary size increase is essentially 0, performance increases across the
board. It looks really good to me.

However, there was one crash that I'd like to check if it still fires. Will
update later today (feel free to ping me if you don't hear anything.).

That said, why -O3? I think we should just enable this across the board, as
it doesn't seem to cause any size regression under any mode, and the
compile time hit is really low.


On Sat, Jul 27, 2013 at 11:54 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi,
>
> Below you can see the updated benchmark results for the new
> SLP-vectorizer.  As you can see, there is a small number of compile time
> regressions, a single major runtime *regression, and many performance
> gains. There is a tiny increase in code size: 30k for the whole test-suite.
> Based on the numbers below I would like to enable the SLP-vectorizer by
> default for -O3. Please let me know if you have any concerns.
>
> Thanks,
> Nadav
>
>
> * - I now understand the Olden/BH regression better. BH is slower because
> of a store-buffer stall. This means that the store buffer fills up and the
> CPU has to wait for some stores to finish. I can think of two reasons
> that may cause this problem. First, our vectorized stores are followed by
> a memcpy that's expanded to a list of scalar-read/writes to the same
> addresses as the vector store. Maybe the processors can’t prune multiple
> stores to the same address with different sizes (Section 2.2.4 in the
> optimization guide has some info on this). Another possibility (less
> likely) is that we increase the critical path by adding a new pshufd
> instruction before the last vector store and that affects the store-buffer
> somehow. In any case, there is not much we can do at the IR-level to
> predict this.
>
>
>
> Performance Regressions - Compile TimeΔPreviousCurrentσ
> MultiSource/Benchmarks/VersaBench/beamformer/beamformer18.98%0.07220.0859
> 0.0003MultiSource/Benchmarks/FreeBench/pifft/pifft5.66%0.50030.52860.0015
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt4.85%
> 0.40840.42820.0014
> MultiSource/Benchmarks/TSVC/LoopRestructuring-flt/LoopRestructuring-flt
> 4.36%0.38560.40240.0018
> MultiSource/Benchmarks/TSVC/ControlFlow-flt/ControlFlow-flt2.62%0.4424
> 0.45400.0019External/SPEC/CINT2006/401_bzip2/401_bzip21.50%1.06131.0772
> 0.0010MultiSource/Benchmarks/tramp3d-v4/tramp3d-v41.23%12.133712.2831
> 0.0296MultiSource/Applications/kimwitu++/kc1.15%9.36909.47690.0186
> SingleSource/Benchmarks/Misc-C++-EH/spirit1.13%3.27693.31390.0079
> External/SPEC/CFP2000/188_ammp/188_ammp1.01%1.86321.88200.0059
>
>
> Performance Regressions - Execution TimeΔPreviousCurrentσ
> MultiSource/Benchmarks/Olden/bh/bh19.24%1.15511.37730.0021
> SingleSource/Benchmarks/SmallPT/smallpt3.75%5.87796.09830.0146
> SingleSource/Benchmarks/Misc-C++/Large/ray1.08%1.81941.83900.0009
>
>
> Performance Improvements - Execution TimeΔPreviousCurrentσ
> SingleSource/Benchmarks/Misc/matmul_f64_4x4-53.67%1.40640.65160.0007
> External/Nurbs/nurbs-19.47%2.53892.04450.0029
> MultiSource/Benchmarks/Olden/power/power-18.49%1.25721.02480.0004
> SingleSource/Benchmarks/Misc/flops-4-15.93%0.77670.65300.0348
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt-14.72%
> 2.39252.04040.0013SingleSource/Benchmarks/Misc/flops-6-11.05%1.14271.0164
> 0.0009SingleSource/Benchmarks/Misc/flops-5-10.43%1.27711.14390.0015
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
> -8.10%2.34682.15680.0195SingleSource/Benchmarks/Misc/pi-7.18%0.60420.5608
> 0.0000External/SPEC/CFP2006/444_namd/444_namd-4.01%9.60539.22000.0064
> SingleSource/Benchmarks/Linpack/linpack-pc-3.85%95.531391.85221.1151
> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl-3.52%
> 3.19623.08370.0063
> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl
> -2.93%2.93362.84770.0037
> MultiSource/Benchmarks/VersaBench/beamformer/beamformer-2.79%0.88450.8598
> 0.0026SingleSource/Benchmarks/Misc-C++/Large/sphereflake-2.79%1.85171.8001
> 0.0014External/SPEC/CFP2000/177_mesa/177_mesa-2.15%1.72141.68440.0017
> SingleSource/Benchmarks/CoyoteBench/fftbench-2.05%0.72800.71310.0049
> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl-1.96%
> 3.14943.08780.0034SingleSource/Benchmarks/Misc/oourafft-1.70%3.46253.4035
> 0.0009SingleSource/Benchmarks/Misc/flops-1.31%7.07756.98450.0014
> MultiSource/Applications/JM/lencod/lencod-1.12%4.59724.54550.0050
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130728/e7b6cc46/attachment.html>

Nadav Rotem

2013-Jul-29 18:23 UTC

head link

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

On Jul 28, 2013, at 12:20 AM, Chandler Carruth <chandlerc at google.com>
wrote:
> That said, why -O3? I think we should just enable this across the board, as
it doesn't seem to cause any size regression under any mode, and the compile
time hit is really low.

I agree.  I think that it would be a good idea to enable it for -Os and -O2, but
I’d like to make one step at a time.

Thanks,
Nadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130729/e30f842a/attachment.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Jul 2013 - [LLVMdev] Enabling the SLP-vectorizer by default for -O3

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

[LLVMdev] Enabling the SLP-vectorizer by default for -O3

Possibly Parallel Threads