thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Nov-08 22:29 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> >>> I've attached the latest version of my autovectorization
patch.
> >>>
> >>> Working through the test suite has proved to be a productive
> >>> experience ;) -- And almost all of the bugs that it revealed
have now
> >>> been fixed. There are still two programs that don't
compile with
> >>> vectorization turned on, and I'm working on those now, but
in case
> >>> anyone feels like playing with vectorization, this patch will
probably
> >>> work for you.
> >>
> >> Hey Hal,
> >>
> >> those are great news. Especially as the numbers seem to show that
> >> vectorization has a significant performance impact. What did you
compare
> >> exactly. 'clang -O3' against 'clang -O3 -mllvm
-vectorize'?
> >
> > Yes. [I've tested the current patch directly using opt -vectorize
> > -unroll-allow-partial; for running the test suite I recompiled
> > llvm/clang to hardcode the options as I wanted them].
> 
> You should not need to hack clang. As shown above, you should be able to
> pass '-vectorize' to the optimizer by using '-mllvm
-vectorize' in your
> CFLAGS. Did you run the test suite at -O3?
That's good to know. Yes, it was run at -O3.
> 
> Also I think it would be interesting to compare for the test-suite the 
> performance of
> 'clang -O3 -mllvm -unroll-allow-partial' with
> 'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will
show
> how much of the runtime overhead is due to the unrolling (produces more 
> code that needs to be optimized) and which part is due to vectorization. 
> The same counts for the speedup. How much is caused by
> unrolling and how much is actually caused by your pass.
Just turning on partial unrolling yields, on average, a 4.7% slowdown
(0% median), and a 28.6% increase in compile time (21.8% median).
> 
> >>> The largest three performance speedups are:
> >>> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
With just partial unrolling: 64.8% speedup> >>> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
With just partial unrolling: 58.2% speedup> >>> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
With just partial unrolling: 11.8% speedup> >>>
> >>> The largest three performance slowdowns are:
> >>>
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> >>> 114% slowdownWith just partial unrolling: 0% slowdown (47.8% slowdown in compile
time)> >>>
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
> >>> slowdown
With just partial unrolling: 8.3% speedup> >>> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
With just partial unrolling: 8.3% speedup> >>>
Here are the top-3 lists comparing vectorization with partial unrolling
to just with partial unrolling:

Top speedups:
SingleSource/Benchmarks/Misc/flops-7 - 44.1% speedup
SingleSource/Benchmarks/Misc/flops-1 - 42.3% speedup
MultiSource/Applications/sqlite3/sqlite3 - 42.3% speedup

Top slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 81.8%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 79% slowdown

Top compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
832% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 600% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 263% slowdown

For this comparison, however (unlike comparing to plain -O3), there are
a significant number of compile-time speedups (I guess that this is
because vectorization can reduce the number of instructions processed by
later passes). Top compile-time speedups:
SingleSource/Benchmarks/Stanford/Oscar - 46% speedup
SingleSource/Regression/C/bigstack - 45% speedup
SingleSource/Benchmarks/BenchmarkGame/fannkuch - 45% speedup
> >> Interesting. Do you understand what causes these slowdowns? Can
your
> >> heuristic be improved?
> >
> > I've not specifically looked at these cases.
> >
> > Generally, I've observed slowdowns from the introduction of too
many
> > permutations per chain (the chain selection procedure can be changed
to
> > help this, and I'll work on that). Also, sometimes legalization of
> > non-native vector operations creates inefficient code, and I'll
also
> > look at these cases in more detail.
> Sure. That sounds reasonable. I am confident you can improve this
gradually.
> 
> >> If I understood correctly it seems your vectorizer has quadratic
> >> complexity which may cause large slowdowns. Do you think it may be
> >> useful/possible to make it linear by introducing a constant upper
bound
> >> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky
and
> >> most of the vectorization opportunities are close by (in some
sense),
> >> such that we get most of the speedup by locking at a subset of the
problem.
> >
> > Yes, I agree. That makes a lot of sense.
> >
> > What would be even better is if the loop unroller would intermix
> > statements from the loops where possible instead of leaving it to the
> > vectorizer to do all of the grouping after the fact. That, I fear, is
a
> > whole other project.
> 
> First, I do not understand the 'even better' here. To me it would
be
> great if on one hand the vectorizer could be constrained, such that the 
> compile time is predictable. And on the other hand, we could make the 
> loop unroller create code in a way such that the constrained vectorizer 
> still performs the relevant transformations.
I was not clear; I meant that imposing a cut off is likely to work, but
it would work even better if the loop unroller produced code
more-conducive to vectorization.
> 
> Also, what do you mean with 'if the loop unroller would intermix
> statements from the loops where possible'. Are you referring to the
> grouped unrolling as shown in my the last mail?
Yes.

Also, the code necessary to take advantage of grouped unrolling already
exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
causes the vectorizer to stop searching for instruction pairings after
the first use of an instruction.
> 
> 
> >>> Compared to previous patches, which had a minimum required
chain length
> >>> of 3 or 4, I've now made the default 6. While using a
chain length of 4
> >>> worked well for targeted benchmarks, it caused an overall
slowdown on
> >>> almost all test-suite programs. Using a minimum length of 6
causes, on
> >>> average, a speedup; so I think that is a better default
choice.
> >>
> >> I also try to understand if it is possible to use your vectorizer
for
> >> Polly. My idea is to do some clever loop unrolling.
> >>
> >> Starting from this loop.
> >>
> >> for (int i = 0; i<  4; i++)
> >>      A[i] += 1;
> >>      A[i] = B[i] + 3;
> >>      C[i] = A[i];
> >>
> >> The classical unroller would create this code:
> >>
> >>      A[0] += 1;
> >>      A[0] = B[i] + 3;
> >>      C[0] = A[i];
> >>
> >>      A[1] += 1;
> >>      A[1] = B[i] + 3;
> >>      C[1] = A[i];
> >>
> >>      A[2] += 1;
> >>      A[2] = B[i] + 3;
> >>      C[2] = A[i];
> >>
> >>      A[3] += 1;
> >>      A[3] = B[i] + 3;
> >>      C[3] = A[i];
> >>
> >> However, in case I can prove this loop is parallel, I want to
create
> >> this code:
> >>
> >>      A[0] += 1;
> >>      A[1] += 1;
> >>      A[2] += 1;
> >>      A[3] += 1;
> >>
> >>      A[0] = B[i] + 3;
> >>      A[1] = B[i] + 3;
> >>      A[2] = B[i] + 3;
> >>      A[3] = B[i] + 3;
> >>
> >>      C[0] = A[i];
> >>      C[1] = A[i];
> >>      C[2] = A[i];
> >>      C[3] = A[i];
> >>
> >> I assume this will allow the vectorization of test cases, that
failed
> >> because of possible aliasing. However, I am more interested, if
the
> >> execution order change could also improve the vectorization
outcome or
> >> reduce compile time overhead of your vectorizer.
> >
> > Yes, this would certainly help.
> 
> In which way? How much? Would it be possible to restrict the vectorizer 
> such that it's complexity is linear, while at the same time we can 
> ensure it will still vectorize code that is 'group unrolled' (which
> means unrolled in the way shown above?
It would help because we could stop the pairing search after an
instructions first use; and that is significantly faster and generates
many fewer potential pairings that need to be considered by the later
stages. Exactly "how much" it would help I can't yet answer;
I'd need to
construct actual benchmarks for that. If you have some group-unrolled
code, then it will be simple to test: just pass -bb-vectorize-fast-dep.
> 
> > By the way, the current implementation, by default, it will create
> > unaligned vector loads and stores, but these are generally broken up
by
> > the legalizer. This behavior can be suppressed using the
> > -bb-vectorize-aligned-only flag. It would be nice if the loop unroller
> > chose the unrolling factor to preserve the maximum available
alignment,
> > but I don't think that it currently does so.
> I don't know, but it sounds like a good thing to do.
> 
> > One problem with the current implementation is that it relies on
> > GetPointerBaseWithConstantOffset to determine if two loads or stores
> > share the same base address. This fails with partially-unrolled loops
> > because it cannot "undo" all of the additions to the offset
induction
> > variable in order to understand that some of the loads and stores are
> > really adjacent in memory. This is something that I think can be
> > improved within the vectorizer itself, and I'm planning on doing
some
> > work on this in the future.
> Here you may also want to look into ScalarEvolution. Basically two loads
> access adjacent memory if the difference of the scalar evolution of the 
> two load addresses is equal to sizeof(element_type). ScalarEvolution 
> should be a lot more general than GetPointerBaseWithConstantOffset().
Thanks! That sounds great; I'll have to look at that.

 -Hal
> 
> Cheers
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Tobias Grosser

2011-Nov-10 22:07 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On 11/08/2011 11:29 PM, Hal Finkel wrote:> On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
>> On 11/08/2011 03:36 PM, Hal Finkel wrote:
>>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
>>>> On 11/08/2011 11:45 AM, Hal Finkel wrote:
[A lot of performance results skipped]

OK. As expected part of the speedup is because of unrolling, however it 
also shows that vectorization itself improves performance. There is 
still a lot of room for improvement, but it seems to be a good start.
>>>> If I understood correctly it seems your vectorizer has
quadratic
>>>> complexity which may cause large slowdowns. Do you think it may
be
>>>> useful/possible to make it linear by introducing a constant
upper bound
>>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are
lucky and
>>>> most of the vectorization opportunities are close by (in some
sense),
>>>> such that we get most of the speedup by locking at a subset of
the problem.
>>>
>>> Yes, I agree. That makes a lot of sense.
>>>
>>> What would be even better is if the loop unroller would intermix
>>> statements from the loops where possible instead of leaving it to
the
>>> vectorizer to do all of the grouping after the fact. That, I fear,
is a
>>> whole other project.
>>
>> First, I do not understand the 'even better' here. To me it
would be
>> great if on one hand the vectorizer could be constrained, such that the
>> compile time is predictable. And on the other hand, we could make the
>> loop unroller create code in a way such that the constrained vectorizer
>> still performs the relevant transformations.
>
> I was not clear; I meant that imposing a cut off is likely to work, but
> it would work even better if the loop unroller produced code
> more-conducive to vectorization.
Is such a cutoff already integrated and can be enabled by default. I 
believe it would be great if we could show that the compile time 
increase is limited by a low constant (maybe 100%), while still showing 
an overall improvement in run time.
>> Also, what do you mean with 'if the loop unroller would intermix
>> statements from the loops where possible'. Are you referring to the
>> grouped unrolling as shown in my the last mail?
>
> Yes.
>
> Also, the code necessary to take advantage of grouped unrolling already
> exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> causes the vectorizer to stop searching for instruction pairings after
> the first use of an instruction.
Nice. I just tried one of the very basic examples from the Polly test 
suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create 
any vector instructions. I also played a little with the vectorizer 
options, but was not able to get this example vectorized. Is this 
because the chain is too short?
>>> One problem with the current implementation is that it relies on
>>> GetPointerBaseWithConstantOffset to determine if two loads or
stores
>>> share the same base address. This fails with partially-unrolled
loops
>>> because it cannot "undo" all of the additions to the
offset induction
>>> variable in order to understand that some of the loads and stores
are
>>> really adjacent in memory. This is something that I think can be
>>> improved within the vectorizer itself, and I'm planning on
doing some
>>> work on this in the future.
>> Here you may also want to look into ScalarEvolution. Basically two
loads
>> access adjacent memory if the difference of the scalar evolution of the
>> two load addresses is equal to sizeof(element_type). ScalarEvolution
>> should be a lot more general than GetPointerBaseWithConstantOffset().
>
> Thanks! That sounds great; I'll have to look at that.
Talking about this I looked again into ScalarEvolution.

To analyze a load, you would do:

LoadInst *Load = ...
Value *Pointer = Load->getPointer();
const SCEV *PointerSCEV = SE->getSCEV(Pointer);
const SCEVUnknown *PointerBase 
dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));

if (!PointerBase)
   return 'Analysis failed'

const Value *BaseValue = PointerBase->getValue();

You get the offset between two load addresses with SE->getMinusSCEV(). 
The size of an element is SE->getSizeOfExpr().

Hope that helps
Tobi
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: simple_vec.ll
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111110/15790d03/attachment.ksh>

Hal Finkel

2011-Nov-10 22:24 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:> On 11/08/2011 11:29 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> 
> [A lot of performance results skipped]
> 
> OK. As expected part of the speedup is because of unrolling, however it 
> also shows that vectorization itself improves performance. There is 
> still a lot of room for improvement, but it seems to be a good start.
> 
> >>>> If I understood correctly it seems your vectorizer has
quadratic
> >>>> complexity which may cause large slowdowns. Do you think
it may be
> >>>> useful/possible to make it linear by introducing a
constant upper bound
> >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we
are lucky and
> >>>> most of the vectorization opportunities are close by (in
some sense),
> >>>> such that we get most of the speedup by locking at a
subset of the problem.
> >>>
> >>> Yes, I agree. That makes a lot of sense.
> >>>
> >>> What would be even better is if the loop unroller would
intermix
> >>> statements from the loops where possible instead of leaving it
to the
> >>> vectorizer to do all of the grouping after the fact. That, I
fear, is a
> >>> whole other project.
> >>
> >> First, I do not understand the 'even better' here. To me
it would be
> >> great if on one hand the vectorizer could be constrained, such
that the
> >> compile time is predictable. And on the other hand, we could make
the
> >> loop unroller create code in a way such that the constrained
vectorizer
> >> still performs the relevant transformations.
> >
> > I was not clear; I meant that imposing a cut off is likely to work,
but
> > it would work even better if the loop unroller produced code
> > more-conducive to vectorization.
> 
> Is such a cutoff already integrated and can be enabled by default.
No, but I'll add one.
> I 
> believe it would be great if we could show that the compile time 
> increase is limited by a low constant (maybe 100%), while still showing 
> an overall improvement in run time.
> 
> >> Also, what do you mean with 'if the loop unroller would
intermix
> >> statements from the loops where possible'. Are you referring
to the
> >> grouped unrolling as shown in my the last mail?
> >
> > Yes.
> >
> > Also, the code necessary to take advantage of grouped unrolling
already
> > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> > causes the vectorizer to stop searching for instruction pairings after
> > the first use of an instruction.
> 
> Nice. I just tried one of the very basic examples from the Polly test 
> suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create 
> any vector instructions. I also played a little with the vectorizer 
> options, but was not able to get this example vectorized. Is this 
> because the chain is too short?
You are correct. It will vectorize with:
opt -O3 -vectorize -bb-vectorize-req-chain-depth=2
> 
> >>> One problem with the current implementation is that it relies
on
> >>> GetPointerBaseWithConstantOffset to determine if two loads or
stores
> >>> share the same base address. This fails with
partially-unrolled loops
> >>> because it cannot "undo" all of the additions to the
offset induction
> >>> variable in order to understand that some of the loads and
stores are
> >>> really adjacent in memory. This is something that I think can
be
> >>> improved within the vectorizer itself, and I'm planning on
doing some
> >>> work on this in the future.
> >> Here you may also want to look into ScalarEvolution. Basically two
loads
> >> access adjacent memory if the difference of the scalar evolution
of the
> >> two load addresses is equal to sizeof(element_type).
ScalarEvolution
> >> should be a lot more general than
GetPointerBaseWithConstantOffset().
> >
> > Thanks! That sounds great; I'll have to look at that.
> 
> Talking about this I looked again into ScalarEvolution.
> 
> To analyze a load, you would do:
> 
> LoadInst *Load = ...
> Value *Pointer = Load->getPointer();
> const SCEV *PointerSCEV = SE->getSCEV(Pointer);
> const SCEVUnknown *PointerBase > 
dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));
> 
> if (!PointerBase)
>    return 'Analysis failed'
> 
> const Value *BaseValue = PointerBase->getValue();
> 
> You get the offset between two load addresses with SE->getMinusSCEV(). 
> The size of an element is SE->getSizeOfExpr().
This is great! Thanks!

 -Hal
> 
> Hope that helps
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2011-Nov-11 22:36 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:> On 11/08/2011 11:29 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> 
> [A lot of performance results skipped]
> 
> OK. As expected part of the speedup is because of unrolling, however it 
> also shows that vectorization itself improves performance. There is 
> still a lot of room for improvement, but it seems to be a good start.
> 
> >>>> If I understood correctly it seems your vectorizer has
quadratic
> >>>> complexity which may cause large slowdowns. Do you think
it may be
> >>>> useful/possible to make it linear by introducing a
constant upper bound
> >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we
are lucky and
> >>>> most of the vectorization opportunities are close by (in
some sense),
> >>>> such that we get most of the speedup by locking at a
subset of the problem.
> >>>
> >>> Yes, I agree. That makes a lot of sense.
> >>>
> >>> What would be even better is if the loop unroller would
intermix
> >>> statements from the loops where possible instead of leaving it
to the
> >>> vectorizer to do all of the grouping after the fact. That, I
fear, is a
> >>> whole other project.
> >>
> >> First, I do not understand the 'even better' here. To me
it would be
> >> great if on one hand the vectorizer could be constrained, such
that the
> >> compile time is predictable. And on the other hand, we could make
the
> >> loop unroller create code in a way such that the constrained
vectorizer
> >> still performs the relevant transformations.
> >
> > I was not clear; I meant that imposing a cut off is likely to work,
but
> > it would work even better if the loop unroller produced code
> > more-conducive to vectorization.
> 
> Is such a cutoff already integrated and can be enabled by default. I 
> believe it would be great if we could show that the compile time 
> increase is limited by a low constant (maybe 100%), while still showing 
> an overall improvement in run time.
> 
> >> Also, what do you mean with 'if the loop unroller would
intermix
> >> statements from the loops where possible'. Are you referring
to the
> >> grouped unrolling as shown in my the last mail?
> >
> > Yes.
> >
> > Also, the code necessary to take advantage of grouped unrolling
already
> > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> > causes the vectorizer to stop searching for instruction pairings after
> > the first use of an instruction.
> 
> Nice. I just tried one of the very basic examples from the Polly test 
> suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create 
> any vector instructions. I also played a little with the vectorizer 
> options, but was not able to get this example vectorized. Is this 
> because the chain is too short?
> 
> >>> One problem with the current implementation is that it relies
on
> >>> GetPointerBaseWithConstantOffset to determine if two loads or
stores
> >>> share the same base address. This fails with
partially-unrolled loops
> >>> because it cannot "undo" all of the additions to the
offset induction
> >>> variable in order to understand that some of the loads and
stores are
> >>> really adjacent in memory. This is something that I think can
be
> >>> improved within the vectorizer itself, and I'm planning on
doing some
> >>> work on this in the future.
> >> Here you may also want to look into ScalarEvolution. Basically two
loads
> >> access adjacent memory if the difference of the scalar evolution
of the
> >> two load addresses is equal to sizeof(element_type).
ScalarEvolution
> >> should be a lot more general than
GetPointerBaseWithConstantOffset().
> >
> > Thanks! That sounds great; I'll have to look at that.
> 
> Talking about this I looked again into ScalarEvolution.
> 
> To analyze a load, you would do:
> 
> LoadInst *Load = ...
> Value *Pointer = Load->getPointer();
> const SCEV *PointerSCEV = SE->getSCEV(Pointer);
> const SCEVUnknown *PointerBase > 
dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));
> 
> if (!PointerBase)
>    return 'Analysis failed'
> 
> const Value *BaseValue = PointerBase->getValue();
> 
> You get the offset between two load addresses with SE->getMinusSCEV(). 
> The size of an element is SE->getSizeOfExpr().
> 
The AliasAnalysis class has a set of interfaces that can be used to
preserve the analysis even when some things are changed. Does
ScalarEvolution have a similar capability?

Thanks again,
Hal
> Hope that helps
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Nov 2011 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Possibly Parallel Threads