thr3ads.net - llvm dev - [LLVMdev] Vectorization: Next Steps [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2012-Feb-07 20:10 UTC

[LLVMdev] Vectorization: Next Steps

On Mon, 2012-02-06 at 14:26 -0800, Chris Lattner wrote:> On Feb 2, 2012, at 7:56 PM, Hal Finkel wrote:
> > As some of you may know, I committed my basic-block autovectorization
> > pass a few days ago. I encourage anyone interested to try it out (pass
> > -vectorize to opt or -mllvm -vectorize to clang) and provide feedback.
> > Especially in combination with -unroll-allow-partial, I have observed
> > some significant benchmark speedups, but, I have also observed some
> > significant slowdowns. I would like to share my thoughts, and
hopefully
> > get feedback, on next steps.
> 
> Hi Hal,
> 
> I haven't had a chance to look at your pass in detail, but here are
some opinions: :)
> 
> > 1. "Target Data" for vectorization - I think that in order
to improve
> > the vectorization quality, the vectorizer will need more information
> > about the target. This information could be provided in the form of a
> > kind of extended target data. This extended target data might contain:
> > - What basic types can be vectorized, and how many of them will fit
> > into (the largest) vector registers
> > - What classes of operations can be vectorized (division, conversions
/
> > sign extension, etc. are not always supported)
> > - What alignment is necessary for loads and stores
> > - Is scalar-to-vector free?
> 
> I think that this will be a really important API, but I strongly advocate
that you model this after TargetLoweringInfo instead of TargetData.  First,
TargetData isn't actually a target API (it should be fixed, I filed PR11936
to track this).  Second, targets will have to implement imperative code to
return precise answers to questions.  For example, you'll want something
like "what is the cost of a shuffle with this mask" which will be
extremely target specific, will depend on what CPU subfeatures are enabled, etc.
This makes sense. What do you think will be the best way of
synchronizing things like CPU subfeatures between this API and the
backend target libraries? They could be linked directly, although I
don't know if we want to do that. tablegen could extract a bunch of this
information into separate objects that get linked into opt.
> 
> When you start working on this, I strongly encourage you to propose the API
you want here.  Start small and add features as you go.
> 
> > 2. Feedback between passes - We may to implement a closer coupling
> > between optimization passes than currently exists. Specifically, I
have
> > in mind two things:
> > - The vectorizer should communicate more closely with the loop
> > unroller. First, the loop unroller should try to unroll to preserve
> > maximal load/store alignments. Second, I think it would make a lot of
> > sense to be able to unroll and, only if this helps vectorization
should
> > the unrolled version be kept in preference to the original. With basic
> > block vectorization, it is often necessary to (partially) unroll in
> > order to vectorize. Even when we also have real loop vectorization,
> > however, I still think that it will be important for the loop unroller
> > to communicate with the vectorizer.
> 
> I really disagree with this, see below.
> 
> > 3. Loop vectorization - It would be nice to have, in addition to
> > basic-block vectorization, a more-traditional loop vectorization pass.
I
> > think that we'll need a better loop analysis pass in order for
this to
> > happen. Some of this was started in LoopDependenceAnalysis, but that
> > pass is not yet finished. We'll need something like this to
recognize
> > affine memory references, etc.
> 
> I think that a loop vectorizor and a basic block vectorizer both make
perfect sense and are important for different classes of code.  However, I
don't think that we should go down the path of trying to use a "basic
block vectorizor + loop unrolling" serve the purpose of a loop vectorizer. 
Trying to make a BBVectorizer and a loop unroller play together will be really
fragile, because they'll both have to duplicate the same metrics (otherwise,
for example, you'd unroll a loop that isn't vectorizable).  This will
also be a huge hit to compile time.
The only problem with this comes from loops for which unrolling is
necessary to expose vectorization because the memory access pattern is
too complicated to model in more-traditional loop vectorization. This
generally is useful only in cases with a large number of flops per
memory operation (or maybe integer ops too, but I have less experience
with those), so maybe we can design a useful heuristic to handle those
cases. That having been said, unroll+(failed vectorize)+rollback is not
really any more expensive at compile time than unroll+(failed vectorize)
except that the resulting code would run faster (actually it is cheaper
to compile because the optimization/compilation of the unvectorized
unrolled loop code takes longer than the non-unrolled loop). There might
be a clean way of doing this; I'll think about it.

Thanks again,
Hal
> 
> -Chris
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Chris Lattner

2012-Feb-09 01:26 UTC

head link

[LLVMdev] Vectorization: Next Steps

On Feb 7, 2012, at 12:10 PM, Hal Finkel wrote:>>> 1. "Target Data" for vectorization - I think that in
order to improve
>>> the vectorization quality, the vectorizer will need more
information
>>> about the target. This information could be provided in the form of
a
>>> kind of extended target data. This extended target data might
contain:
>>> - What basic types can be vectorized, and how many of them will fit
>>> into (the largest) vector registers
>>> - What classes of operations can be vectorized (division,
conversions /
>>> sign extension, etc. are not always supported)
>>> - What alignment is necessary for loads and stores
>>> - Is scalar-to-vector free?
>> 
>> I think that this will be a really important API, but I strongly
advocate that you model this after TargetLoweringInfo instead of TargetData. 
First, TargetData isn't actually a target API (it should be fixed, I filed
PR11936 to track this).  Second, targets will have to implement imperative code
to return precise answers to questions.  For example, you'll want something
like "what is the cost of a shuffle with this mask" which will be
extremely target specific, will depend on what CPU subfeatures are enabled, etc.
> 
> This makes sense. What do you think will be the best way of
> synchronizing things like CPU subfeatures between this API and the
> backend target libraries? They could be linked directly, although I
> don't know if we want to do that. tablegen could extract a bunch of
this
> information into separate objects that get linked into opt.
The best model we have at the moment is TargetLoweringInfo, as used by
LoopStrengthReduction.  The details of this interface aren't a great example
to follow for a few reasons (i.e. it has selectiondag specific stuff in it,
which is a layering violation) but the idea is sound.  This does mean that
running "opt -vectorize foo.bc" would not get the same optimization as
running clang with the target you want enabled though.  We already have this
problem with -loop-reduce though.
>> I think that a loop vectorizor and a basic block vectorizer both make
perfect sense and are important for different classes of code.  However, I
don't think that we should go down the path of trying to use a "basic
block vectorizor + loop unrolling" serve the purpose of a loop vectorizer. 
Trying to make a BBVectorizer and a loop unroller play together will be really
fragile, because they'll both have to duplicate the same metrics (otherwise,
for example, you'd unroll a loop that isn't vectorizable).  This will
also be a huge hit to compile time.
> 
> The only problem with this comes from loops for which unrolling is
> necessary to expose vectorization because the memory access pattern is
> too complicated to model in more-traditional loop vectorization. This
> generally is useful only in cases with a large number of flops per
> memory operation (or maybe integer ops too, but I have less experience
> with those), so maybe we can design a useful heuristic to handle those
> cases. That having been said, unroll+(failed vectorize)+rollback is not
> really any more expensive at compile time than unroll+(failed vectorize)
> except that the resulting code would run faster (actually it is cheaper
> to compile because the optimization/compilation of the unvectorized
> unrolled loop code takes longer than the non-unrolled loop). There might
> be a clean way of doing this; I'll think about it.
I don't really understand the issue here, can you elaborate on when this
might be a win?  I really don't like "speculatively unroll, try to do
something, then reroll".  That is terrible for compile time and just
strikes me as poor design :-)

-Chris

Roel Jordans

2012-Feb-09 10:04 UTC

head link

[LLVMdev] Vectorization: Next Steps

On 02/09/2012 02:26 AM, Chris Lattner wrote:>>> I think that a loop vectorizor and a basic block vectorizer both
make perfect sense and are important for different classes of code.  However, I
don't think that we should go down the path of trying to use a "basic
block vectorizor + loop unrolling" serve the purpose of a loop vectorizer. 
Trying to make a BBVectorizer and a loop unroller play together will be really
fragile, because they'll both have to duplicate the same metrics (otherwise,
for example, you'd unroll a loop that isn't vectorizable).  This will
also be a huge hit to compile time.
>>
>> The only problem with this comes from loops for which unrolling is
>> necessary to expose vectorization because the memory access pattern is
>> too complicated to model in more-traditional loop vectorization. This
>> generally is useful only in cases with a large number of flops per
>> memory operation (or maybe integer ops too, but I have less experience
>> with those), so maybe we can design a useful heuristic to handle those
>> cases. That having been said, unroll+(failed vectorize)+rollback is not
>> really any more expensive at compile time than unroll+(failed
vectorize)
>> except that the resulting code would run faster (actually it is cheaper
>> to compile because the optimization/compilation of the unvectorized
>> unrolled loop code takes longer than the non-unrolled loop). There
might
>> be a clean way of doing this; I'll think about it.
>
> I don't really understand the issue here, can you elaborate on when
this might be a win?  I really don't like "speculatively unroll, try to
do something, then reroll".  That is terrible for compile time and just
strikes me as poor design :-)
>
This seems a bit related to Resource-Directed Loop Pipelining [1] to me. 
RDLP uses loop unrolling in combination with loop shifting (or peeling) 
to map a loop-body to a parallel architecture. It was originally focused 
on VLIW like parallelism but I think that a similar technique may be 
useful for vectorization.

Cheers,
Roel

[1] http://comjnl.oxfordjournals.org/content/40/6/311.short

Hal Finkel

2012-Feb-09 16:21 UTC

head link

[LLVMdev] Vectorization: Next Steps

On Wed, 2012-02-08 at 17:26 -0800, Chris Lattner wrote:> On Feb 7, 2012, at 12:10 PM, Hal Finkel wrote:
> >>> 1. "Target Data" for vectorization - I think that in
order to improve
> >>> the vectorization quality, the vectorizer will need more
information
> >>> about the target. This information could be provided in the
form of a
> >>> kind of extended target data. This extended target data might
contain:
> >>> - What basic types can be vectorized, and how many of them
will fit
> >>> into (the largest) vector registers
> >>> - What classes of operations can be vectorized (division,
conversions /
> >>> sign extension, etc. are not always supported)
> >>> - What alignment is necessary for loads and stores
> >>> - Is scalar-to-vector free?
> >> 
> >> I think that this will be a really important API, but I strongly
advocate that you model this after TargetLoweringInfo instead of TargetData. 
First, TargetData isn't actually a target API (it should be fixed, I filed
PR11936 to track this).  Second, targets will have to implement imperative code
to return precise answers to questions.  For example, you'll want something
like "what is the cost of a shuffle with this mask" which will be
extremely target specific, will depend on what CPU subfeatures are enabled, etc.
> > 
> > This makes sense. What do you think will be the best way of
> > synchronizing things like CPU subfeatures between this API and the
> > backend target libraries? They could be linked directly, although I
> > don't know if we want to do that. tablegen could extract a bunch
of this
> > information into separate objects that get linked into opt.
> 
> The best model we have at the moment is TargetLoweringInfo, as used by
LoopStrengthReduction.  The details of this interface aren't a great example
to follow for a few reasons (i.e. it has selectiondag specific stuff in it,
which is a layering violation) but the idea is sound.  This does mean that
running "opt -vectorize foo.bc" would not get the same optimization as
running clang with the target you want enabled though.  We already have this
problem with -loop-reduce though.
> 
> >> I think that a loop vectorizor and a basic block vectorizer both
make perfect sense and are important for different classes of code.  However, I
don't think that we should go down the path of trying to use a "basic
block vectorizor + loop unrolling" serve the purpose of a loop vectorizer. 
Trying to make a BBVectorizer and a loop unroller play together will be really
fragile, because they'll both have to duplicate the same metrics (otherwise,
for example, you'd unroll a loop that isn't vectorizable).  This will
also be a huge hit to compile time.
> > 
> > The only problem with this comes from loops for which unrolling is
> > necessary to expose vectorization because the memory access pattern is
> > too complicated to model in more-traditional loop vectorization. This
> > generally is useful only in cases with a large number of flops per
> > memory operation (or maybe integer ops too, but I have less experience
> > with those), so maybe we can design a useful heuristic to handle those
> > cases. That having been said, unroll+(failed vectorize)+rollback is
not
> > really any more expensive at compile time than unroll+(failed
vectorize)
> > except that the resulting code would run faster (actually it is
cheaper
> > to compile because the optimization/compilation of the unvectorized
> > unrolled loop code takes longer than the non-unrolled loop). There
might
> > be a clean way of doing this; I'll think about it.
> 
> I don't really understand the issue here, can you elaborate on when
this might be a win?  I really don't like "speculatively unroll, try to
do something, then reroll".  That is terrible for compile time and just
strikes me as poor design :-)
>From Ayal's e-mail, it seems that the gcc vectorizer containsspecialized unrolling code to handle these kinds of cases. With
appropriate refactoring, perhaps that is the best solution.

 -Hal
> 
> -Chris
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2012-Feb-13 23:30 UTC

head link

[LLVMdev] Vectorization: Next Steps

On Wed, 2012-02-08 at 17:26 -0800, Chris Lattner wrote:> On Feb 7, 2012, at 12:10 PM, Hal Finkel wrote:
> >>> 1. "Target Data" for vectorization - I think that in
order to improve
> >>> the vectorization quality, the vectorizer will need more
information
> >>> about the target. This information could be provided in the
form of a
> >>> kind of extended target data. This extended target data might
contain:
> >>> - What basic types can be vectorized, and how many of them
will fit
> >>> into (the largest) vector registers
> >>> - What classes of operations can be vectorized (division,
conversions /
> >>> sign extension, etc. are not always supported)
> >>> - What alignment is necessary for loads and stores
> >>> - Is scalar-to-vector free?
> >> 
> >> I think that this will be a really important API, but I strongly
advocate that you model this after TargetLoweringInfo instead of TargetData. 
First, TargetData isn't actually a target API (it should be fixed, I filed
PR11936 to track this).  Second, targets will have to implement imperative code
to return precise answers to questions.  For example, you'll want something
like "what is the cost of a shuffle with this mask" which will be
extremely target specific, will depend on what CPU subfeatures are enabled, etc.
> > 
> > This makes sense. What do you think will be the best way of
> > synchronizing things like CPU subfeatures between this API and the
> > backend target libraries? They could be linked directly, although I
> > don't know if we want to do that. tablegen could extract a bunch
of this
> > information into separate objects that get linked into opt.
> 
> The best model we have at the moment is TargetLoweringInfo, as used by
LoopStrengthReduction.  The details of this interface aren't a great example
to follow for a few reasons (i.e. it has selectiondag specific stuff in it,
which is a layering violation) but the idea is sound.  This does mean that
running "opt -vectorize foo.bc" would not get the same optimization as
running clang with the target you want enabled though.  We already have this
problem with -loop-reduce though.
> 
LoopStrengthReduction is currently created in
TargetPassConfig::addIRPasses (CodeGen/Passes.cpp). Currently the
vectorization pass is created in
PassManagerBuilder::populateModulePassManager (which is used by opt).
Are you suggesting that I move the vectorization pass creation into
CodeGen? Or are you saying that TLI will sometimes be available to the
pass, as it is now, when called from a full-compilation driver (like
clang)? Or are you suggesting that I propose some object like TLI that
might be available in 'opt' even though TLI itself is not available
there?

Thanks again,
Hal
> >> I think that a loop vectorizor and a basic block vectorizer both
make perfect sense and are important for different classes of code.  However, I
don't think that we should go down the path of trying to use a "basic
block vectorizor + loop unrolling" serve the purpose of a loop vectorizer. 
Trying to make a BBVectorizer and a loop unroller play together will be really
fragile, because they'll both have to duplicate the same metrics (otherwise,
for example, you'd unroll a loop that isn't vectorizable).  This will
also be a huge hit to compile time.
> > 
> > The only problem with this comes from loops for which unrolling is
> > necessary to expose vectorization because the memory access pattern is
> > too complicated to model in more-traditional loop vectorization. This
> > generally is useful only in cases with a large number of flops per
> > memory operation (or maybe integer ops too, but I have less experience
> > with those), so maybe we can design a useful heuristic to handle those
> > cases. That having been said, unroll+(failed vectorize)+rollback is
not
> > really any more expensive at compile time than unroll+(failed
vectorize)
> > except that the resulting code would run faster (actually it is
cheaper
> > to compile because the optimization/compilation of the unvectorized
> > unrolled loop code takes longer than the non-unrolled loop). There
might
> > be a clean way of doing this; I'll think about it.
> 
> I don't really understand the issue here, can you elaborate on when
this might be a win?  I really don't like "speculatively unroll, try to
do something, then reroll".  That is terrible for compile time and just
strikes me as poor design :-)
> 
> -Chris
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Feb 2012 - [LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

Apparently Analagous Threads