thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Nov-08 10:45 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

I've attached the latest version of my autovectorization patch.

Working through the test suite has proved to be a productive
experience ;) -- And almost all of the bugs that it revealed have now
been fixed. There are still two programs that don't compile with
vectorization turned on, and I'm working on those now, but in case
anyone feels like playing with vectorization, this patch will probably
work for you.

The largest three performance speedups are:
SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup

The largest three performance slowdowns are:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown

(from these, I've excluded tests that took less that 0.1 seconds to
run).

Largest three compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
1276% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown

Not everything slows down, MultiSource/Benchmarks/Prolangs-C
++/city/city, for example, compiles 10% faster with vectorization
enabled; but, for the most part, things certainly take longer to compile
with vectorization enabled. The average slowdown over all tests was 29%,
the median was 11%. On the other hand, the average speedup over all
tests was 5.2%, the median was 1.3%.

Compared to previous patches, which had a minimum required chain length
of 3 or 4, I've now made the default 6. While using a chain length of 4
worked well for targeted benchmarks, it caused an overall slowdown on
almost all test-suite programs. Using a minimum length of 6 causes, on
average, a speedup; so I think that is a better default choice.

 -Hal

On Tue, 2011-11-01 at 18:54 -0500, Hal Finkel wrote:> On Tue, 2011-11-01 at 16:59 -0500, Hal Finkel wrote:
> > On Tue, 2011-11-01 at 19:19 +0000, Tobias Grosser wrote:
> > > On 11/01/2011 06:32 PM, Hal Finkel wrote:
> > > > Any objections to me committing this? [And some relevant
docs changes] I
> > > > think that it is ready at this point.
> > > 
> > > First of all. I think it is great to see work starting on an 
> > > autovectorizer for LLVM. Unfortunately I did not have time to
test your
> > > vectorizer pass intensively, but here my first comments:
> > > 
> > > 1. This patch breaks the --enable-shared/BUILD_SHARED_LIBS build.
The
> > >     following patch fixes this for cmake:
> > >     0001-Add-vectorizer-to-libraries-used-by-Transforms-IPO.patch
> > > 
> > 
> > Thanks!
> > 
> > >     Can you check the autoconf build with --enable-shared?
> > 
> > I will check.
> 
> This appears to work as it should.
> 
> > 
> > > 
> > > 2. Did you run this pass on the llvm test-suite? Does your
vectorizer
> > >     introduce any correctness regressions? What are the top 10
compile
> > >     time increases/decreases. How about run time?
> > > 
> > 
> > I'll try to get this setup and post the results.
> > 
> > > 3. I did not really test this intensively, but I had the feeling
the
> > >     compile time increase for large basic blocks is quite a lot.
> > >     I still need to extract a test case. Any comments on the
complexity
> > >     of your vectorizer?
> > 
> > This may very will be true. As is, I would not recommend activating
this
> > pass by default (at -O3) because it is fairly slow and the resulting
> > performance increase, while significant in many cases, is not large
> > enough to, IMHO, justify the extra base compile-time increase.
Ideally,
> > this kind of vectorization should be the "vectorizer of last
resort" --
> > the pass that tries really hard to squeeze the last little bit of
> > vectorization possible out of the code. At the moment, it is all that
we
> > have, but I hope that will change. I've not yet done any real
profiling,
> > so I'll hold off on commenting about future performance
improvements.
> > 
> > Base complexity is a bit difficult, there are certainly a few stages,
> > including that initial one, that are O(n^2), where n is the number of
> > instructions in the block. The "connection-finding" stage
should also be
> > O(n^2) in practice, but is really iterating over instruction-user
pairs
> > and so could be worse in pathological cases. Note, however, that in
the
> > latter stages, that n^2 is not the number of instructions in the
block,
> > but rather the number of (unordered) candidate instruction pairs
(which
> > is going to be must less than the n^2 from just the number of
> > instructions in the block). It should be possible to generate a
> > compile-time scaling plot by taking a loop and compiling it with
partial
> > unrolling, looking at how the compile time changes with the unrolling
> > limit; I'll try and so that.
> 
> So for this test, I ran:
> time opt -S -O3 -unroll-allow-partial -vectorize -o /dev/null q.ll
> where q.ll contains the output from clang -O3 of the vbor function from
> the benchmarks I've been posting recently. The first column is the
value
> of -unroll-threshold, the second column is the time with vectorization,
> and the third column is the time without vectorization (time in seconds
> for a release build).
> 
> 100    0.030  0.000
> 200    0.130  0.030
> 300    0.770  0.030
> 400    1.240  0.040
> 500    1.280  0.050
> 600    9.450  0.060
> 700   29.300  0.060
> 
> I am not sure why the 400 and 500 times are so close. Obviously, it is
> not linear ;) I am not sure that enumerating the possible pairings can
> be done in a sub-quadratic way, but I will do some profiling and see if
> I can make things better. To be fair, this test creates a kind of a
> worse-case scenario: an increasingly large block of instructions, almost
> all of which are potentially fusable.
> 
> It may also be possible to design additional heuristics to help the
> situation. For example, we might introduce a target chain length such
> that if the vectorizer finds a chain of a given length, it selects it,
> foregoing the remainder of the search for the selected starting
> instruction. This kind of thing will require further research and
> testing.
> 
>  -Hal
> 
> > 
> > I'm writing a paper on the vectorizer, so within a few weeks there
will
> > be a very good description (complete with diagrams) :)
> > 
> > > 
> > > I plan to look into your vectorizer during the next couple of 
> > > days/weeks, but will most probably not have the time to do this
tonight.
> > > Sorry. :-(
> > 
> > Not a problem; it seems that I have some homework to do first ;)
> > 
> > Thanks,
> > Hal
> > 
> > > 
> > > Cheers
> > > Tobi
> > 
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_bb_vectorize-20111107.diff
Type: text/x-patch
Size: 79455 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111108/3b39956c/attachment.bin>

Hal Finkel

2011-Nov-08 14:36 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> > I've attached the latest version of my autovectorization patch.
> >
> > Working through the test suite has proved to be a productive
> > experience ;) -- And almost all of the bugs that it revealed have now
> > been fixed. There are still two programs that don't compile with
> > vectorization turned on, and I'm working on those now, but in case
> > anyone feels like playing with vectorization, this patch will probably
> > work for you.
> 
> Hey Hal,
> 
> those are great news. Especially as the numbers seem to show that 
> vectorization has a significant performance impact. What did you compare 
> exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?
Yes. [I've tested the current patch directly using opt -vectorize
-unroll-allow-partial; for running the test suite I recompiled
llvm/clang to hardcode the options as I wanted them].
> 
> > The largest three performance speedups are:
> > SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
> > SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
> > SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
> >
> > The largest three performance slowdowns are:
> > MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> > 114% slowdown
> > MultiSource/Benchmarks/MiBench/network-patricia/network-patricia -
66.6%
> > slowdown
> > SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
> >
> Interesting. Do you understand what causes these slowdowns? Can your 
> heuristic be improved?
I've not specifically looked at these cases.

Generally, I've observed slowdowns from the introduction of too many
permutations per chain (the chain selection procedure can be changed to
help this, and I'll work on that). Also, sometimes legalization of
non-native vector operations creates inefficient code, and I'll also
look at these cases in more detail.
> 
> > Largest three compile-time slowdowns:
> > MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> > 1276% slowdown
> > SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
> > MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown
> 
> Yes, that is a lot. Do you understand if this time is invested well 
> (does it give significant speedups)?
No, not always. Actually, the security-rijndael test not only takes the
longest to vectorize, but it also shows the largest slowdown. This is
certainly something that I'll investigate.
> 
> If I understood correctly it seems your vectorizer has quadratic 
> complexity which may cause large slowdowns. Do you think it may be 
> useful/possible to make it linear by introducing a constant upper bound 
> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and 
> most of the vectorization opportunities are close by (in some sense), 
> such that we get most of the speedup by locking at a subset of the problem.
Yes, I agree. That makes a lot of sense.

What would be even better is if the loop unroller would intermix
statements from the loops where possible instead of leaving it to the
vectorizer to do all of the grouping after the fact. That, I fear, is a
whole other project.
> 
> > Not everything slows down, MultiSource/Benchmarks/Prolangs-C
> > ++/city/city, for example, compiles 10% faster with vectorization
> > enabled; but, for the most part, things certainly take longer to
compile
> > with vectorization enabled. The average slowdown over all tests was
29%,
> > the median was 11%. On the other hand, the average speedup over all
> > tests was 5.2%, the median was 1.3%.
> Nice. I think this is a great start.
Thanks!
> 
> > Compared to previous patches, which had a minimum required chain
length
> > of 3 or 4, I've now made the default 6. While using a chain length
of 4
> > worked well for targeted benchmarks, it caused an overall slowdown on
> > almost all test-suite programs. Using a minimum length of 6 causes, on
> > average, a speedup; so I think that is a better default choice.
> 
> I also try to understand if it is possible to use your vectorizer for 
> Polly. My idea is to do some clever loop unrolling.
> 
> Starting from this loop.
> 
> for (int i = 0; i < 4; i++)
>     A[i] += 1;
>     A[i] = B[i] + 3;
>     C[i] = A[i];
> 
> The classical unroller would create this code:
> 
>     A[0] += 1;
>     A[0] = B[i] + 3;
>     C[0] = A[i];
> 
>     A[1] += 1;
>     A[1] = B[i] + 3;
>     C[1] = A[i];
> 
>     A[2] += 1;
>     A[2] = B[i] + 3;
>     C[2] = A[i];
> 
>     A[3] += 1;
>     A[3] = B[i] + 3;
>     C[3] = A[i];
> 
> However, in case I can prove this loop is parallel, I want to create 
> this code:
> 
>     A[0] += 1;
>     A[1] += 1;
>     A[2] += 1;
>     A[3] += 1;
> 
>     A[0] = B[i] + 3;
>     A[1] = B[i] + 3;
>     A[2] = B[i] + 3;
>     A[3] = B[i] + 3;
> 
>     C[0] = A[i];
>     C[1] = A[i];
>     C[2] = A[i];
>     C[3] = A[i];
> 
> I assume this will allow the vectorization of test cases, that failed 
> because of possible aliasing. However, I am more interested, if the
> execution order change could also improve the vectorization outcome or 
> reduce compile time overhead of your vectorizer.
Yes, this would certainly help.

By the way, the current implementation, by default, it will create
unaligned vector loads and stores, but these are generally broken up by
the legalizer. This behavior can be suppressed using the
-bb-vectorize-aligned-only flag. It would be nice if the loop unroller
chose the unrolling factor to preserve the maximum available alignment,
but I don't think that it currently does so.

One problem with the current implementation is that it relies on
GetPointerBaseWithConstantOffset to determine if two loads or stores
share the same base address. This fails with partially-unrolled loops
because it cannot "undo" all of the additions to the offset induction
variable in order to understand that some of the loads and stores are
really adjacent in memory. This is something that I think can be
improved within the vectorizer itself, and I'm planning on doing some
work on this in the future.

Thanks for your feedback,
Hal
> 
> Thanks for working on the vectorization
> Cheers
> 
> Tobi
> 
> 
> 
> 
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Nov 2011 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Apparently Analagous Threads