thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Nov-08 14:36 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> > I've attached the latest version of my autovectorization patch.
> >
> > Working through the test suite has proved to be a productive
> > experience ;) -- And almost all of the bugs that it revealed have now
> > been fixed. There are still two programs that don't compile with
> > vectorization turned on, and I'm working on those now, but in case
> > anyone feels like playing with vectorization, this patch will probably
> > work for you.
> 
> Hey Hal,
> 
> those are great news. Especially as the numbers seem to show that 
> vectorization has a significant performance impact. What did you compare 
> exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?
Yes. [I've tested the current patch directly using opt -vectorize
-unroll-allow-partial; for running the test suite I recompiled
llvm/clang to hardcode the options as I wanted them].
> 
> > The largest three performance speedups are:
> > SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
> > SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
> > SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
> >
> > The largest three performance slowdowns are:
> > MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> > 114% slowdown
> > MultiSource/Benchmarks/MiBench/network-patricia/network-patricia -
66.6%
> > slowdown
> > SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
> >
> Interesting. Do you understand what causes these slowdowns? Can your 
> heuristic be improved?
I've not specifically looked at these cases.

Generally, I've observed slowdowns from the introduction of too many
permutations per chain (the chain selection procedure can be changed to
help this, and I'll work on that). Also, sometimes legalization of
non-native vector operations creates inefficient code, and I'll also
look at these cases in more detail.
> 
> > Largest three compile-time slowdowns:
> > MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> > 1276% slowdown
> > SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
> > MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown
> 
> Yes, that is a lot. Do you understand if this time is invested well 
> (does it give significant speedups)?
No, not always. Actually, the security-rijndael test not only takes the
longest to vectorize, but it also shows the largest slowdown. This is
certainly something that I'll investigate.
> 
> If I understood correctly it seems your vectorizer has quadratic 
> complexity which may cause large slowdowns. Do you think it may be 
> useful/possible to make it linear by introducing a constant upper bound 
> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and 
> most of the vectorization opportunities are close by (in some sense), 
> such that we get most of the speedup by locking at a subset of the problem.
Yes, I agree. That makes a lot of sense.

What would be even better is if the loop unroller would intermix
statements from the loops where possible instead of leaving it to the
vectorizer to do all of the grouping after the fact. That, I fear, is a
whole other project.
> 
> > Not everything slows down, MultiSource/Benchmarks/Prolangs-C
> > ++/city/city, for example, compiles 10% faster with vectorization
> > enabled; but, for the most part, things certainly take longer to
compile
> > with vectorization enabled. The average slowdown over all tests was
29%,
> > the median was 11%. On the other hand, the average speedup over all
> > tests was 5.2%, the median was 1.3%.
> Nice. I think this is a great start.
Thanks!
> 
> > Compared to previous patches, which had a minimum required chain
length
> > of 3 or 4, I've now made the default 6. While using a chain length
of 4
> > worked well for targeted benchmarks, it caused an overall slowdown on
> > almost all test-suite programs. Using a minimum length of 6 causes, on
> > average, a speedup; so I think that is a better default choice.
> 
> I also try to understand if it is possible to use your vectorizer for 
> Polly. My idea is to do some clever loop unrolling.
> 
> Starting from this loop.
> 
> for (int i = 0; i < 4; i++)
>     A[i] += 1;
>     A[i] = B[i] + 3;
>     C[i] = A[i];
> 
> The classical unroller would create this code:
> 
>     A[0] += 1;
>     A[0] = B[i] + 3;
>     C[0] = A[i];
> 
>     A[1] += 1;
>     A[1] = B[i] + 3;
>     C[1] = A[i];
> 
>     A[2] += 1;
>     A[2] = B[i] + 3;
>     C[2] = A[i];
> 
>     A[3] += 1;
>     A[3] = B[i] + 3;
>     C[3] = A[i];
> 
> However, in case I can prove this loop is parallel, I want to create 
> this code:
> 
>     A[0] += 1;
>     A[1] += 1;
>     A[2] += 1;
>     A[3] += 1;
> 
>     A[0] = B[i] + 3;
>     A[1] = B[i] + 3;
>     A[2] = B[i] + 3;
>     A[3] = B[i] + 3;
> 
>     C[0] = A[i];
>     C[1] = A[i];
>     C[2] = A[i];
>     C[3] = A[i];
> 
> I assume this will allow the vectorization of test cases, that failed 
> because of possible aliasing. However, I am more interested, if the
> execution order change could also improve the vectorization outcome or 
> reduce compile time overhead of your vectorizer.
Yes, this would certainly help.

By the way, the current implementation, by default, it will create
unaligned vector loads and stores, but these are generally broken up by
the legalizer. This behavior can be suppressed using the
-bb-vectorize-aligned-only flag. It would be nice if the loop unroller
chose the unrolling factor to preserve the maximum available alignment,
but I don't think that it currently does so.

One problem with the current implementation is that it relies on
GetPointerBaseWithConstantOffset to determine if two loads or stores
share the same base address. This fails with partially-unrolled loops
because it cannot "undo" all of the additions to the offset induction
variable in order to understand that some of the loads and stores are
really adjacent in memory. This is something that I think can be
improved within the vectorizer itself, and I'm planning on doing some
work on this in the future.

Thanks for your feedback,
Hal
> 
> Thanks for working on the vectorization
> Cheers
> 
> Tobi
> 
> 
> 
> 
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Tobias Grosser

2011-Nov-08 19:24 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On 11/08/2011 03:36 PM, Hal Finkel wrote:> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
>> On 11/08/2011 11:45 AM, Hal Finkel wrote:
>>> I've attached the latest version of my autovectorization patch.
>>>
>>> Working through the test suite has proved to be a productive
>>> experience ;) -- And almost all of the bugs that it revealed have
now
>>> been fixed. There are still two programs that don't compile
with
>>> vectorization turned on, and I'm working on those now, but in
case
>>> anyone feels like playing with vectorization, this patch will
probably
>>> work for you.
>>
>> Hey Hal,
>>
>> those are great news. Especially as the numbers seem to show that
>> vectorization has a significant performance impact. What did you
compare
>> exactly. 'clang -O3' against 'clang -O3 -mllvm
-vectorize'?
>
> Yes. [I've tested the current patch directly using opt -vectorize
> -unroll-allow-partial; for running the test suite I recompiled
> llvm/clang to hardcode the options as I wanted them].
You should not need to hack clang. As shown above, you should be able to
pass '-vectorize' to the optimizer by using '-mllvm -vectorize'
in your
CFLAGS. Did you run the test suite at -O3?

Also I think it would be interesting to compare for the test-suite the 
performance of
'clang -O3 -mllvm -unroll-allow-partial' with
'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will show
how much of the runtime overhead is due to the unrolling (produces more 
code that needs to be optimized) and which part is due to vectorization. 
The same counts for the speedup. How much is caused by
unrolling and how much is actually caused by your pass.
>>> The largest three performance speedups are:
>>> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
>>> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
>>> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
>>>
>>> The largest three performance slowdowns are:
>>> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael
-
>>> 114% slowdown
>>> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia -
66.6%
>>> slowdown
>>> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
>>>
>> Interesting. Do you understand what causes these slowdowns? Can your
>> heuristic be improved?
>
> I've not specifically looked at these cases.
>
> Generally, I've observed slowdowns from the introduction of too many
> permutations per chain (the chain selection procedure can be changed to
> help this, and I'll work on that). Also, sometimes legalization of
> non-native vector operations creates inefficient code, and I'll also
> look at these cases in more detail.Sure. That sounds reasonable. I am confident you can improve this gradually.
>> If I understood correctly it seems your vectorizer has quadratic
>> complexity which may cause large slowdowns. Do you think it may be
>> useful/possible to make it linear by introducing a constant upper bound
>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
>> most of the vectorization opportunities are close by (in some sense),
>> such that we get most of the speedup by locking at a subset of the
problem.
>
> Yes, I agree. That makes a lot of sense.
>
> What would be even better is if the loop unroller would intermix
> statements from the loops where possible instead of leaving it to the
> vectorizer to do all of the grouping after the fact. That, I fear, is a
> whole other project.
First, I do not understand the 'even better' here. To me it would be 
great if on one hand the vectorizer could be constrained, such that the 
compile time is predictable. And on the other hand, we could make the 
loop unroller create code in a way such that the constrained vectorizer 
still performs the relevant transformations.

Also, what do you mean with 'if the loop unroller would intermix
statements from the loops where possible'. Are you referring to the
grouped unrolling as shown in my the last mail?

>>> Compared to previous patches, which had a minimum required chain
length
>>> of 3 or 4, I've now made the default 6. While using a chain
length of 4
>>> worked well for targeted benchmarks, it caused an overall slowdown
on
>>> almost all test-suite programs. Using a minimum length of 6 causes,
on
>>> average, a speedup; so I think that is a better default choice.
>>
>> I also try to understand if it is possible to use your vectorizer for
>> Polly. My idea is to do some clever loop unrolling.
>>
>> Starting from this loop.
>>
>> for (int i = 0; i<  4; i++)
>>      A[i] += 1;
>>      A[i] = B[i] + 3;
>>      C[i] = A[i];
>>
>> The classical unroller would create this code:
>>
>>      A[0] += 1;
>>      A[0] = B[i] + 3;
>>      C[0] = A[i];
>>
>>      A[1] += 1;
>>      A[1] = B[i] + 3;
>>      C[1] = A[i];
>>
>>      A[2] += 1;
>>      A[2] = B[i] + 3;
>>      C[2] = A[i];
>>
>>      A[3] += 1;
>>      A[3] = B[i] + 3;
>>      C[3] = A[i];
>>
>> However, in case I can prove this loop is parallel, I want to create
>> this code:
>>
>>      A[0] += 1;
>>      A[1] += 1;
>>      A[2] += 1;
>>      A[3] += 1;
>>
>>      A[0] = B[i] + 3;
>>      A[1] = B[i] + 3;
>>      A[2] = B[i] + 3;
>>      A[3] = B[i] + 3;
>>
>>      C[0] = A[i];
>>      C[1] = A[i];
>>      C[2] = A[i];
>>      C[3] = A[i];
>>
>> I assume this will allow the vectorization of test cases, that failed
>> because of possible aliasing. However, I am more interested, if the
>> execution order change could also improve the vectorization outcome or
>> reduce compile time overhead of your vectorizer.
>
> Yes, this would certainly help.
In which way? How much? Would it be possible to restrict the vectorizer 
such that it's complexity is linear, while at the same time we can 
ensure it will still vectorize code that is 'group unrolled' (which 
means unrolled in the way shown above?
> By the way, the current implementation, by default, it will create
> unaligned vector loads and stores, but these are generally broken up by
> the legalizer. This behavior can be suppressed using the
> -bb-vectorize-aligned-only flag. It would be nice if the loop unroller
> chose the unrolling factor to preserve the maximum available alignment,
> but I don't think that it currently does so.I don't know, but it sounds like a good thing to do.
> One problem with the current implementation is that it relies on
> GetPointerBaseWithConstantOffset to determine if two loads or stores
> share the same base address. This fails with partially-unrolled loops
> because it cannot "undo" all of the additions to the offset
induction
> variable in order to understand that some of the loads and stores are
> really adjacent in memory. This is something that I think can be
> improved within the vectorizer itself, and I'm planning on doing some
> work on this in the future.Here you may also want to look into ScalarEvolution. Basically two loads
access adjacent memory if the difference of the scalar evolution of the 
two load addresses is equal to sizeof(element_type). ScalarEvolution 
should be a lot more general than GetPointerBaseWithConstantOffset().

Cheers
Tobi

Hal Finkel

2011-Nov-08 22:29 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> >>> I've attached the latest version of my autovectorization
patch.
> >>>
> >>> Working through the test suite has proved to be a productive
> >>> experience ;) -- And almost all of the bugs that it revealed
have now
> >>> been fixed. There are still two programs that don't
compile with
> >>> vectorization turned on, and I'm working on those now, but
in case
> >>> anyone feels like playing with vectorization, this patch will
probably
> >>> work for you.
> >>
> >> Hey Hal,
> >>
> >> those are great news. Especially as the numbers seem to show that
> >> vectorization has a significant performance impact. What did you
compare
> >> exactly. 'clang -O3' against 'clang -O3 -mllvm
-vectorize'?
> >
> > Yes. [I've tested the current patch directly using opt -vectorize
> > -unroll-allow-partial; for running the test suite I recompiled
> > llvm/clang to hardcode the options as I wanted them].
> 
> You should not need to hack clang. As shown above, you should be able to
> pass '-vectorize' to the optimizer by using '-mllvm
-vectorize' in your
> CFLAGS. Did you run the test suite at -O3?
That's good to know. Yes, it was run at -O3.
> 
> Also I think it would be interesting to compare for the test-suite the 
> performance of
> 'clang -O3 -mllvm -unroll-allow-partial' with
> 'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will
show
> how much of the runtime overhead is due to the unrolling (produces more 
> code that needs to be optimized) and which part is due to vectorization. 
> The same counts for the speedup. How much is caused by
> unrolling and how much is actually caused by your pass.
Just turning on partial unrolling yields, on average, a 4.7% slowdown
(0% median), and a 28.6% increase in compile time (21.8% median).
> 
> >>> The largest three performance speedups are:
> >>> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
With just partial unrolling: 64.8% speedup> >>> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
With just partial unrolling: 58.2% speedup> >>> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
With just partial unrolling: 11.8% speedup> >>>
> >>> The largest three performance slowdowns are:
> >>>
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> >>> 114% slowdownWith just partial unrolling: 0% slowdown (47.8% slowdown in compile
time)> >>>
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
> >>> slowdown
With just partial unrolling: 8.3% speedup> >>> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
With just partial unrolling: 8.3% speedup> >>>
Here are the top-3 lists comparing vectorization with partial unrolling
to just with partial unrolling:

Top speedups:
SingleSource/Benchmarks/Misc/flops-7 - 44.1% speedup
SingleSource/Benchmarks/Misc/flops-1 - 42.3% speedup
MultiSource/Applications/sqlite3/sqlite3 - 42.3% speedup

Top slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 81.8%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 79% slowdown

Top compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
832% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 600% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 263% slowdown

For this comparison, however (unlike comparing to plain -O3), there are
a significant number of compile-time speedups (I guess that this is
because vectorization can reduce the number of instructions processed by
later passes). Top compile-time speedups:
SingleSource/Benchmarks/Stanford/Oscar - 46% speedup
SingleSource/Regression/C/bigstack - 45% speedup
SingleSource/Benchmarks/BenchmarkGame/fannkuch - 45% speedup
> >> Interesting. Do you understand what causes these slowdowns? Can
your
> >> heuristic be improved?
> >
> > I've not specifically looked at these cases.
> >
> > Generally, I've observed slowdowns from the introduction of too
many
> > permutations per chain (the chain selection procedure can be changed
to
> > help this, and I'll work on that). Also, sometimes legalization of
> > non-native vector operations creates inefficient code, and I'll
also
> > look at these cases in more detail.
> Sure. That sounds reasonable. I am confident you can improve this
gradually.
> 
> >> If I understood correctly it seems your vectorizer has quadratic
> >> complexity which may cause large slowdowns. Do you think it may be
> >> useful/possible to make it linear by introducing a constant upper
bound
> >> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky
and
> >> most of the vectorization opportunities are close by (in some
sense),
> >> such that we get most of the speedup by locking at a subset of the
problem.
> >
> > Yes, I agree. That makes a lot of sense.
> >
> > What would be even better is if the loop unroller would intermix
> > statements from the loops where possible instead of leaving it to the
> > vectorizer to do all of the grouping after the fact. That, I fear, is
a
> > whole other project.
> 
> First, I do not understand the 'even better' here. To me it would
be
> great if on one hand the vectorizer could be constrained, such that the 
> compile time is predictable. And on the other hand, we could make the 
> loop unroller create code in a way such that the constrained vectorizer 
> still performs the relevant transformations.
I was not clear; I meant that imposing a cut off is likely to work, but
it would work even better if the loop unroller produced code
more-conducive to vectorization.
> 
> Also, what do you mean with 'if the loop unroller would intermix
> statements from the loops where possible'. Are you referring to the
> grouped unrolling as shown in my the last mail?
Yes.

Also, the code necessary to take advantage of grouped unrolling already
exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
causes the vectorizer to stop searching for instruction pairings after
the first use of an instruction.
> 
> 
> >>> Compared to previous patches, which had a minimum required
chain length
> >>> of 3 or 4, I've now made the default 6. While using a
chain length of 4
> >>> worked well for targeted benchmarks, it caused an overall
slowdown on
> >>> almost all test-suite programs. Using a minimum length of 6
causes, on
> >>> average, a speedup; so I think that is a better default
choice.
> >>
> >> I also try to understand if it is possible to use your vectorizer
for
> >> Polly. My idea is to do some clever loop unrolling.
> >>
> >> Starting from this loop.
> >>
> >> for (int i = 0; i<  4; i++)
> >>      A[i] += 1;
> >>      A[i] = B[i] + 3;
> >>      C[i] = A[i];
> >>
> >> The classical unroller would create this code:
> >>
> >>      A[0] += 1;
> >>      A[0] = B[i] + 3;
> >>      C[0] = A[i];
> >>
> >>      A[1] += 1;
> >>      A[1] = B[i] + 3;
> >>      C[1] = A[i];
> >>
> >>      A[2] += 1;
> >>      A[2] = B[i] + 3;
> >>      C[2] = A[i];
> >>
> >>      A[3] += 1;
> >>      A[3] = B[i] + 3;
> >>      C[3] = A[i];
> >>
> >> However, in case I can prove this loop is parallel, I want to
create
> >> this code:
> >>
> >>      A[0] += 1;
> >>      A[1] += 1;
> >>      A[2] += 1;
> >>      A[3] += 1;
> >>
> >>      A[0] = B[i] + 3;
> >>      A[1] = B[i] + 3;
> >>      A[2] = B[i] + 3;
> >>      A[3] = B[i] + 3;
> >>
> >>      C[0] = A[i];
> >>      C[1] = A[i];
> >>      C[2] = A[i];
> >>      C[3] = A[i];
> >>
> >> I assume this will allow the vectorization of test cases, that
failed
> >> because of possible aliasing. However, I am more interested, if
the
> >> execution order change could also improve the vectorization
outcome or
> >> reduce compile time overhead of your vectorizer.
> >
> > Yes, this would certainly help.
> 
> In which way? How much? Would it be possible to restrict the vectorizer 
> such that it's complexity is linear, while at the same time we can 
> ensure it will still vectorize code that is 'group unrolled' (which
> means unrolled in the way shown above?
It would help because we could stop the pairing search after an
instructions first use; and that is significantly faster and generates
many fewer potential pairings that need to be considered by the later
stages. Exactly "how much" it would help I can't yet answer;
I'd need to
construct actual benchmarks for that. If you have some group-unrolled
code, then it will be simple to test: just pass -bb-vectorize-fast-dep.
> 
> > By the way, the current implementation, by default, it will create
> > unaligned vector loads and stores, but these are generally broken up
by
> > the legalizer. This behavior can be suppressed using the
> > -bb-vectorize-aligned-only flag. It would be nice if the loop unroller
> > chose the unrolling factor to preserve the maximum available
alignment,
> > but I don't think that it currently does so.
> I don't know, but it sounds like a good thing to do.
> 
> > One problem with the current implementation is that it relies on
> > GetPointerBaseWithConstantOffset to determine if two loads or stores
> > share the same base address. This fails with partially-unrolled loops
> > because it cannot "undo" all of the additions to the offset
induction
> > variable in order to understand that some of the loads and stores are
> > really adjacent in memory. This is something that I think can be
> > improved within the vectorizer itself, and I'm planning on doing
some
> > work on this in the future.
> Here you may also want to look into ScalarEvolution. Basically two loads
> access adjacent memory if the difference of the scalar evolution of the 
> two load addresses is equal to sizeof(element_type). ScalarEvolution 
> should be a lot more general than GetPointerBaseWithConstantOffset().
Thanks! That sounds great; I'll have to look at that.

 -Hal
> 
> Cheers
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2011-Nov-10 21:25 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> >>> I've attached the latest version of my autovectorization
patch.
> >>>
> >>> Working through the test suite has proved to be a productive
> >>> experience ;) -- And almost all of the bugs that it revealed
have now
> >>> been fixed. There are still two programs that don't
compile with
> >>> vectorization turned on, and I'm working on those now, but
in case
> >>> anyone feels like playing with vectorization, this patch will
probably
> >>> work for you.
> >>
> >> Hey Hal,
> >>
> >> those are great news. Especially as the numbers seem to show that
> >> vectorization has a significant performance impact. What did you
compare
> >> exactly. 'clang -O3' against 'clang -O3 -mllvm
-vectorize'?
> >
> > Yes. [I've tested the current patch directly using opt -vectorize
> > -unroll-allow-partial; for running the test suite I recompiled
> > llvm/clang to hardcode the options as I wanted them].
> 
> You should not need to hack clang. As shown above, you should be able to
> pass '-vectorize' to the optimizer by using '-mllvm
-vectorize' in your
> CFLAGS. 
-mllvm -unroll-allow-partial works because -unroll-allow-partial is an
option to the pass, but I had added -vectorize as an option to opt
itself, and those options do not get picked up by clang. Would it be
better to add -vectorize as an option in IPO/PassManagerBuilder? Maybe
I'll try doing it that way instead.

 -Hal
> Did you run the test suite at -O3?
> 
> Also I think it would be interesting to compare for the test-suite the 
> performance of
> 'clang -O3 -mllvm -unroll-allow-partial' with
> 'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will
show
> how much of the runtime overhead is due to the unrolling (produces more 
> code that needs to be optimized) and which part is due to vectorization. 
> The same counts for the speedup. How much is caused by
> unrolling and how much is actually caused by your pass.
> 
> >>> The largest three performance speedups are:
> >>> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
> >>> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
> >>> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
> >>>
> >>> The largest three performance slowdowns are:
> >>>
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> >>> 114% slowdown
> >>>
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
> >>> slowdown
> >>> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
> >>>
> >> Interesting. Do you understand what causes these slowdowns? Can
your
> >> heuristic be improved?
> >
> > I've not specifically looked at these cases.
> >
> > Generally, I've observed slowdowns from the introduction of too
many
> > permutations per chain (the chain selection procedure can be changed
to
> > help this, and I'll work on that). Also, sometimes legalization of
> > non-native vector operations creates inefficient code, and I'll
also
> > look at these cases in more detail.
> Sure. That sounds reasonable. I am confident you can improve this
gradually.
> 
> >> If I understood correctly it seems your vectorizer has quadratic
> >> complexity which may cause large slowdowns. Do you think it may be
> >> useful/possible to make it linear by introducing a constant upper
bound
> >> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky
and
> >> most of the vectorization opportunities are close by (in some
sense),
> >> such that we get most of the speedup by locking at a subset of the
problem.
> >
> > Yes, I agree. That makes a lot of sense.
> >
> > What would be even better is if the loop unroller would intermix
> > statements from the loops where possible instead of leaving it to the
> > vectorizer to do all of the grouping after the fact. That, I fear, is
a
> > whole other project.
> 
> First, I do not understand the 'even better' here. To me it would
be
> great if on one hand the vectorizer could be constrained, such that the 
> compile time is predictable. And on the other hand, we could make the 
> loop unroller create code in a way such that the constrained vectorizer 
> still performs the relevant transformations.
> 
> Also, what do you mean with 'if the loop unroller would intermix
> statements from the loops where possible'. Are you referring to the
> grouped unrolling as shown in my the last mail?
> 
> 
> >>> Compared to previous patches, which had a minimum required
chain length
> >>> of 3 or 4, I've now made the default 6. While using a
chain length of 4
> >>> worked well for targeted benchmarks, it caused an overall
slowdown on
> >>> almost all test-suite programs. Using a minimum length of 6
causes, on
> >>> average, a speedup; so I think that is a better default
choice.
> >>
> >> I also try to understand if it is possible to use your vectorizer
for
> >> Polly. My idea is to do some clever loop unrolling.
> >>
> >> Starting from this loop.
> >>
> >> for (int i = 0; i<  4; i++)
> >>      A[i] += 1;
> >>      A[i] = B[i] + 3;
> >>      C[i] = A[i];
> >>
> >> The classical unroller would create this code:
> >>
> >>      A[0] += 1;
> >>      A[0] = B[i] + 3;
> >>      C[0] = A[i];
> >>
> >>      A[1] += 1;
> >>      A[1] = B[i] + 3;
> >>      C[1] = A[i];
> >>
> >>      A[2] += 1;
> >>      A[2] = B[i] + 3;
> >>      C[2] = A[i];
> >>
> >>      A[3] += 1;
> >>      A[3] = B[i] + 3;
> >>      C[3] = A[i];
> >>
> >> However, in case I can prove this loop is parallel, I want to
create
> >> this code:
> >>
> >>      A[0] += 1;
> >>      A[1] += 1;
> >>      A[2] += 1;
> >>      A[3] += 1;
> >>
> >>      A[0] = B[i] + 3;
> >>      A[1] = B[i] + 3;
> >>      A[2] = B[i] + 3;
> >>      A[3] = B[i] + 3;
> >>
> >>      C[0] = A[i];
> >>      C[1] = A[i];
> >>      C[2] = A[i];
> >>      C[3] = A[i];
> >>
> >> I assume this will allow the vectorization of test cases, that
failed
> >> because of possible aliasing. However, I am more interested, if
the
> >> execution order change could also improve the vectorization
outcome or
> >> reduce compile time overhead of your vectorizer.
> >
> > Yes, this would certainly help.
> 
> In which way? How much? Would it be possible to restrict the vectorizer 
> such that it's complexity is linear, while at the same time we can 
> ensure it will still vectorize code that is 'group unrolled' (which
> means unrolled in the way shown above?
> 
> > By the way, the current implementation, by default, it will create
> > unaligned vector loads and stores, but these are generally broken up
by
> > the legalizer. This behavior can be suppressed using the
> > -bb-vectorize-aligned-only flag. It would be nice if the loop unroller
> > chose the unrolling factor to preserve the maximum available
alignment,
> > but I don't think that it currently does so.
> I don't know, but it sounds like a good thing to do.
> 
> > One problem with the current implementation is that it relies on
> > GetPointerBaseWithConstantOffset to determine if two loads or stores
> > share the same base address. This fails with partially-unrolled loops
> > because it cannot "undo" all of the additions to the offset
induction
> > variable in order to understand that some of the loads and stores are
> > really adjacent in memory. This is something that I think can be
> > improved within the vectorizer itself, and I'm planning on doing
some
> > work on this in the future.
> Here you may also want to look into ScalarEvolution. Basically two loads
> access adjacent memory if the difference of the scalar evolution of the 
> two load addresses is equal to sizeof(element_type). ScalarEvolution 
> should be a lot more general than GetPointerBaseWithConstantOffset().
> 
> Cheers
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Nov 2011 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Possibly Parallel Threads