thr3ads.net - llvm dev - [LLVMdev] Enabling the SLP vectorizer by default for -O3 [Jul 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Jul-15 05:55 UTC

[LLVMdev] Enabling the SLP vectorizer by default for -O3

On Jul 14, 2013, at 9:52 PM, Chris Lattner <clattner at apple.com> wrote:
> 
> On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com>
wrote:
> 
>> Hi, 
>> 
>> LLVM’s SLP-vectorizer is a new pass that combines similar independent
instructions in a straight-line code.  It is currently not enabled by default,
and people who want to experiment with it can use the clang command line flag
“-fslp-vectorize”.  I ran LLVM’s test suite with and without the SLP vectorizer
on a Sandybridge mac (using SSE4, w/o AVX).  Based on my performance
measurements (below) I would like to enable the SLP-vectorizer by default on
-O3.  I would like to hear what others in the community think about this and
give other people the opportunity to perform their own performance measurements.
> 
> This looks great Nadav.  The performance wins are really big.  How you
investigated the bh and bullet regression though?
Thanks.  Yes, I looked at both.  The hot function in BH is “gravsub”.  The
vectorized IR looks fine and the assembly looks fine, but for some reason
Instruments reports that the first vector-subtract instruction takes 18% of the
time. The regression happens both with the VEX prefix and without. I suspected
that the problem is the movupd's that load xmm0 and xmm1. I started looking
at some performance counters on Friday, but I did not find anything suspicious
yet.

+0x00 movupd              16(%rsi), %xmm0
+0x05 movupd              16(%rsp), %xmm1
+0x0b subpd                %xmm1, %xmm0    <———— 18% of the runtime of bh ?
+0x0f movapd               %xmm0, %xmm2
+0x13 mulsd                %xmm2, %xmm2
+0x17 xorpd                %xmm1, %xmm1
+0x1b addsd                %xmm2, %xmm1 

I spent less time on Bullet.  Bullet also has one hot function
(“resolveSingleConstraintRowLowerLimit”).  On this code the vectorizer generates
several trees that use the <3 x float> type. This is risky because the
loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very
popular in some domains and we do want to vectorize them.  I skimmed through the
IR and the assembly and I did not see anything too bad. The next step would be
to do a binary search on the places where the vectorizer fires to locate the bad
pattern.

On AVX we have another regression that I did not mention: Flops-7.  When we
vectorize we cause more spills because we do a poor job scheduling
non-destructive source instructions (related to PR10928). Hopefully Andy’s
scheduler will fix this regression once it is enabled.

I did not measure code size, but I did measure compile time.  There are 4-5
workloads (not counting workloads that run below 0.5 seconds) where the compile
time increase is more than 5%.  I am aware of a problem in the (quadratic) code
that looks for consecutive stores. This code calls SCEV too many times. I plan
to fix this.

Thanks,
Nadav  

> We should at least understand what is going wrong there.  bh is pretty
tiny, so it should be straight-forward.  It would also be really useful to see
what the code size and compile time impact is.
> 
> -Chris
> 
>> 
>> — Performance Gains — 
>> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
>> MultiSource/Benchmarks/Olden/power/power  -18.55%
>> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt -14.71%
>> SingleSource/Benchmarks/Misc/flops-6  -11.02%
>> SingleSource/Benchmarks/Misc/flops-5  -10.03%
>> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
-8.37%
>> External/Nurbs/nurbs  -7.98%
>> SingleSource/Benchmarks/Misc/pi -7.29%
>> External/SPEC/CINT2000/252_eon/252_eon  -5.78%
>> External/SPEC/CFP2006/444_namd/444_namd -4.52%
>> External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
>> MultiSource/Applications/SIBsim4/SIBsim4  -3.58%
>> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl -3.52%
>> SingleSource/Benchmarks/Misc-C++/Large/sphereflake  -2.96%
>> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl
-2.75%
>> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
>> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl -1.95%
>> SingleSource/Benchmarks/Misc/flops  -1.89%
>> SingleSource/Benchmarks/Misc/oourafft -1.71%
>> MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
>> External/SPEC/CFP2006/447_dealII/447_dealII -1.06%
>> 
>> — Regressions — 
>> MultiSource/Benchmarks/Olden/bh/bh  22.47%
>> MultiSource/Benchmarks/Bullet/bullet  7.31%
>> SingleSource/Benchmarks/Misc-C++-EH/spirit  5.68%
>> SingleSource/Benchmarks/SmallPT/smallpt 3.91%
>> 
>> Thanks,
>> Nadav
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130714/01bb31e2/attachment.html>

Renato Golin

2013-Jul-15 13:48 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

Hi Nadav,

I think it's a great idea to have the slp vectorizer enabled, but maybe we
should trim the horrible cases first (regressions, +5% compile time, etc).
I don't mind sub-5% compile-time increase in O3, nor I mind sub-1%
regressions in performance on some benchmarks IFF the majority of the
benchmarks improve.

On 15 July 2013 06:55, Nadav Rotem <nrotem at apple.com> wrote:
> I suspected that the problem is the movupd's that load xmm0 and xmm1.
>
I've seen this before on ARM, and I agree, it looks like the load is
constrained by some other condition or pipeline stall before that.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130715/4f8e2392/attachment.html>

Nadav Rotem

2013-Jul-23 22:33 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

Hi, 

Sorry for the delay in response. I  measured the code size change and noticed
small changes in both directions for individual programs. I found a 30k binary
size growth for the entire testsuite + SPEC. I attached an updated performance
report that includes both compile time and performance measurements.



Thanks,
Nadav

On Jul 14, 2013, at 10:55 PM, Nadav Rotem <nrotem at apple.com> wrote:
> 
> On Jul 14, 2013, at 9:52 PM, Chris Lattner <clattner at apple.com>
wrote:
> 
>> 
>> On Jul 13, 2013, at 11:30 PM, Nadav Rotem <nrotem at apple.com>
wrote:
>> 
>>> Hi, 
>>> 
>>> LLVM’s SLP-vectorizer is a new pass that combines similar
independent instructions in a straight-line code.  It is currently not enabled
by default, and people who want to experiment with it can use the clang command
line flag “-fslp-vectorize”.  I ran LLVM’s test suite with and without the SLP
vectorizer on a Sandybridge mac (using SSE4, w/o AVX).  Based on my performance
measurements (below) I would like to enable the SLP-vectorizer by default on
-O3.  I would like to hear what others in the community think about this and
give other people the opportunity to perform their own performance measurements.
>> 
>> This looks great Nadav.  The performance wins are really big.  How you
investigated the bh and bullet regression though?
> 
> Thanks.  Yes, I looked at both.  The hot function in BH is “gravsub”.  The
vectorized IR looks fine and the assembly looks fine, but for some reason
Instruments reports that the first vector-subtract instruction takes 18% of the
time. The regression happens both with the VEX prefix and without. I suspected
that the problem is the movupd's that load xmm0 and xmm1. I started looking
at some performance counters on Friday, but I did not find anything suspicious
yet.
> 
> +0x00 movupd              16(%rsi), %xmm0
> +0x05 movupd              16(%rsp), %xmm1
> +0x0b subpd                %xmm1, %xmm0    <———— 18% of the runtime of
bh ?
> +0x0f movapd               %xmm0, %xmm2
> +0x13 mulsd                %xmm2, %xmm2
> +0x17 xorpd                %xmm1, %xmm1
> +0x1b addsd                %xmm2, %xmm1 
> 
> I spent less time on Bullet.  Bullet also has one hot function
(“resolveSingleConstraintRowLowerLimit”).  On this code the vectorizer generates
several trees that use the <3 x float> type. This is risky because the
loads/stores are inefficient, but unfortunately triples of RGB and XYZ are very
popular in some domains and we do want to vectorize them.  I skimmed through the
IR and the assembly and I did not see anything too bad. The next step would be
to do a binary search on the places where the vectorizer fires to locate the bad
pattern.
> 
> On AVX we have another regression that I did not mention: Flops-7.  When we
vectorize we cause more spills because we do a poor job scheduling
non-destructive source instructions (related to PR10928). Hopefully Andy’s
scheduler will fix this regression once it is enabled.
> 
> I did not measure code size, but I did measure compile time.  There are 4-5
workloads (not counting workloads that run below 0.5 seconds) where the compile
time increase is more than 5%.  I am aware of a problem in the (quadratic) code
that looks for consecutive stores. This code calls SCEV too many times. I plan
to fix this.
> 
> Thanks,
> Nadav  
> 
> 
>> We should at least understand what is going wrong there.  bh is pretty
tiny, so it should be straight-forward.  It would also be really useful to see
what the code size and compile time impact is.
>> 
>> -Chris
>> 
>>> 
>>> — Performance Gains — 
>>> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
>>> MultiSource/Benchmarks/Olden/power/power  -18.55%
>>> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt
-14.71%
>>> SingleSource/Benchmarks/Misc/flops-6  -11.02%
>>> SingleSource/Benchmarks/Misc/flops-5  -10.03%
>>>
MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt -8.37%
>>> External/Nurbs/nurbs  -7.98%
>>> SingleSource/Benchmarks/Misc/pi -7.29%
>>> External/SPEC/CINT2000/252_eon/252_eon  -5.78%
>>> External/SPEC/CFP2006/444_namd/444_namd -4.52%
>>> External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
>>> MultiSource/Applications/SIBsim4/SIBsim4  -3.58%
>>> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl
-3.52%
>>> SingleSource/Benchmarks/Misc-C++/Large/sphereflake  -2.96%
>>>
MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl -2.75%
>>> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
>>> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl
-1.95%
>>> SingleSource/Benchmarks/Misc/flops  -1.89%
>>> SingleSource/Benchmarks/Misc/oourafft -1.71%
>>> MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
>>> External/SPEC/CFP2006/447_dealII/447_dealII -1.06%
>>> 
>>> — Regressions — 
>>> MultiSource/Benchmarks/Olden/bh/bh  22.47%
>>> MultiSource/Benchmarks/Bullet/bullet  7.31%
>>> SingleSource/Benchmarks/Misc-C++-EH/spirit  5.68%
>>> SingleSource/Benchmarks/SmallPT/smallpt 3.91%
>>> 
>>> Thanks,
>>> Nadav
>>> 
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130723/305125f3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: report.pdf
Type: application/pdf
Size: 53595 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130723/305125f3/attachment.pdf>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130723/305125f3/attachment-0001.html>

Hal Finkel

2013-Jul-23 23:33 UTC

head link

[LLVMdev] Enabling the SLP vectorizer by default for -O3

----- Original Message -----> 
> 
> Hi,
> 
> 
> Sorry for the delay in response. I measured the code size change and
> noticed small changes in both directions for individual programs. I
> found a 30k binary size growth for the entire testsuite + SPEC. I
> attached an updated performance report that includes both compile
> time and performance measurements.
> 
I think that these number look good. Regarding the performance regressions:

This looks like noise:
MultiSource/Benchmarks/McCat/08-main/main 44.40% 0.0277 0.0400 0.0000

For these two:
MultiSource/Benchmarks/Olden/bh/bh 19.73% 1.1547 1.3825 0.0017
MultiSource/Benchmarks/Bullet/bullet 7.30% 3.6130 3.8767 0.0069
can you run them on a different CPU and see how generic these slowdowns are?

Thanks again,
Hal
> 
> Thanks,
> Nadav
> 
> 
> 
> On Jul 14, 2013, at 10:55 PM, Nadav Rotem < nrotem at apple.com >
wrote:
> 
> 
> 
> 
> 
> On Jul 14, 2013, at 9:52 PM, Chris Lattner < clattner at apple.com >
> wrote:
> 
> 
> 
> 
> On Jul 13, 2013, at 11:30 PM, Nadav Rotem < nrotem at apple.com >
wrote:
> 
> 
> 
> Hi,
> 
> LLVM’s SLP-vectorizer is a new pass that combines similar independent
> instructions in a straight-line code. It is currently not enabled by
> default, and people who want to experiment with it can use the clang
> command line flag “-fslp-vectorize”. I ran LLVM’s test suite with
> and without the SLP vectorizer on a Sandybridge mac (using SSE4, w/o
> AVX). Based on my performance measurements (below) I would like to
> enable the SLP-vectorizer by default on -O3. I would like to hear
> what others in the community think about this and give other people
> the opportunity to perform their own performance measurements.
> 
> This looks great Nadav. The performance wins are really big. How you
> investigated the bh and bullet regression though?
> 
> 
> Thanks. Yes, I looked at both. The hot function in BH is “gravsub”.
> The vectorized IR looks fine and the assembly looks fine, but for
> some reason Instruments reports that the first vector-subtract
> instruction takes 18% of the time. The regression happens both with
> the VEX prefix and without. I suspected that the problem is the
> movupd's that load xmm0 and xmm1. I started looking at some
> performance counters on Friday, but I did not find anything
> suspicious yet.
> 
> +0x00 movupd 16(%rsi), %xmm0
> +0x05 movupd 16(%rsp), %xmm1
> +0x0b subpd %xmm1, %xmm0 <———— 18% of the runtime of bh ?
> +0x0f movapd %xmm0, %xmm2
> +0x13 mulsd %xmm2, %xmm2
> +0x17 xorpd %xmm1, %xmm1
> 
> +0x1b addsd %xmm2, %xmm1
> 
> 
> I spent less time on Bullet. Bullet also has one hot function
> (“resolveSingleConstraintRowLowerLimit”). On this code the
> vectorizer generates several trees that use the <3 x float> type.
> This is risky because the loads/stores are inefficient, but
> unfortunately triples of RGB and XYZ are very popular in some
> domains and we do want to vectorize them. I skimmed through the IR
> and the assembly and I did not see anything too bad. The next step
> would be to do a binary search on the places where the vectorizer
> fires to locate the bad pattern.
> 
> 
> On AVX we have another regression that I did not mention: Flops-7.
> When we vectorize we cause more spills because we do a poor job
> scheduling non-destructive source instructions (related to PR10928).
> Hopefully Andy’s scheduler will fix this regression once it is
> enabled.
> 
> 
> I did not measure code size, but I did measure compile time. There
> are 4-5 workloads (not counting workloads that run below 0.5
> seconds) where the compile time increase is more than 5%. I am aware
> of a problem in the (quadratic) code that looks for consecutive
> stores. This code calls SCEV too many times. I plan to fix this.
> 
> 
> Thanks,
> Nadav
> 
> 
> 
> 
> 
> 
> We should at least understand what is going wrong there. bh is pretty
> tiny, so it should be straight-forward. It would also be really
> useful to see what the code size and compile time impact is.
> 
> -Chris
> 
> 
> 
> 
> — Performance Gains —
> SingleSource/Benchmarks/Misc/matmul_f64_4x4 -53.68%
> MultiSource/Benchmarks/Olden/power/power -18.55%
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt/LoopRerolling-flt
> -14.71%
> SingleSource/Benchmarks/Misc/flops-6 -11.02%
> SingleSource/Benchmarks/Misc/flops-5 -10.03%
> MultiSource/Benchmarks/TSVC/LinearDependence-flt/LinearDependence-flt
> -8.37%
> External/Nurbs/nurbs -7.98%
> SingleSource/Benchmarks/Misc/pi -7.29%
> External/SPEC/CINT2000/252_eon/252_eon -5.78%
> External/SPEC/CFP2006/444_namd/444_namd -4.52%
> External/SPEC/CFP2000/188_ammp/188_ammp -4.45%
> MultiSource/Applications/SIBsim4/SIBsim4 -3.58%
> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl/LoopRerolling-dbl
> -3.52%
> SingleSource/Benchmarks/Misc-C++/Large/sphereflake -2.96%
> MultiSource/Benchmarks/TSVC/LinearDependence-dbl/LinearDependence-dbl
> -2.75%
> MultiSource/Benchmarks/VersaBench/beamformer/beamformer -2.70%
> MultiSource/Benchmarks/TSVC/NodeSplitting-dbl/NodeSplitting-dbl
> -1.95%
> SingleSource/Benchmarks/Misc/flops -1.89%
> SingleSource/Benchmarks/Misc/oourafft -1.71%
> MultiSource/Benchmarks/mafft/pairlocalalign -1.16%
> External/SPEC/CFP2006/447_dealII/447_dealII -1.06%
> 
> — Regressions —
> MultiSource/Benchmarks/Olden/bh/bh 22.47%
> MultiSource/Benchmarks/Bullet/bullet 7.31%
> SingleSource/Benchmarks/Misc-C++-EH/spirit 5.68%
> SingleSource/Benchmarks/SmallPT/smallpt 3.91%
> 
> Thanks,
> Nadav
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Jul 2013 - [LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

[LLVMdev] Enabling the SLP vectorizer by default for -O3

Possibly Parallel Threads