thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Nov-11 22:36 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:> On 11/08/2011 11:29 PM, Hal Finkel wrote:
> > On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 03:36 PM, Hal Finkel wrote:
> >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote:
> >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote:
> 
> [A lot of performance results skipped]
> 
> OK. As expected part of the speedup is because of unrolling, however it 
> also shows that vectorization itself improves performance. There is 
> still a lot of room for improvement, but it seems to be a good start.
> 
> >>>> If I understood correctly it seems your vectorizer has
quadratic
> >>>> complexity which may cause large slowdowns. Do you think
it may be
> >>>> useful/possible to make it linear by introducing a
constant upper bound
> >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we
are lucky and
> >>>> most of the vectorization opportunities are close by (in
some sense),
> >>>> such that we get most of the speedup by locking at a
subset of the problem.
> >>>
> >>> Yes, I agree. That makes a lot of sense.
> >>>
> >>> What would be even better is if the loop unroller would
intermix
> >>> statements from the loops where possible instead of leaving it
to the
> >>> vectorizer to do all of the grouping after the fact. That, I
fear, is a
> >>> whole other project.
> >>
> >> First, I do not understand the 'even better' here. To me
it would be
> >> great if on one hand the vectorizer could be constrained, such
that the
> >> compile time is predictable. And on the other hand, we could make
the
> >> loop unroller create code in a way such that the constrained
vectorizer
> >> still performs the relevant transformations.
> >
> > I was not clear; I meant that imposing a cut off is likely to work,
but
> > it would work even better if the loop unroller produced code
> > more-conducive to vectorization.
> 
> Is such a cutoff already integrated and can be enabled by default. I 
> believe it would be great if we could show that the compile time 
> increase is limited by a low constant (maybe 100%), while still showing 
> an overall improvement in run time.
> 
> >> Also, what do you mean with 'if the loop unroller would
intermix
> >> statements from the loops where possible'. Are you referring
to the
> >> grouped unrolling as shown in my the last mail?
> >
> > Yes.
> >
> > Also, the code necessary to take advantage of grouped unrolling
already
> > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> > causes the vectorizer to stop searching for instruction pairings after
> > the first use of an instruction.
> 
> Nice. I just tried one of the very basic examples from the Polly test 
> suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create 
> any vector instructions. I also played a little with the vectorizer 
> options, but was not able to get this example vectorized. Is this 
> because the chain is too short?
> 
> >>> One problem with the current implementation is that it relies
on
> >>> GetPointerBaseWithConstantOffset to determine if two loads or
stores
> >>> share the same base address. This fails with
partially-unrolled loops
> >>> because it cannot "undo" all of the additions to the
offset induction
> >>> variable in order to understand that some of the loads and
stores are
> >>> really adjacent in memory. This is something that I think can
be
> >>> improved within the vectorizer itself, and I'm planning on
doing some
> >>> work on this in the future.
> >> Here you may also want to look into ScalarEvolution. Basically two
loads
> >> access adjacent memory if the difference of the scalar evolution
of the
> >> two load addresses is equal to sizeof(element_type).
ScalarEvolution
> >> should be a lot more general than
GetPointerBaseWithConstantOffset().
> >
> > Thanks! That sounds great; I'll have to look at that.
> 
> Talking about this I looked again into ScalarEvolution.
> 
> To analyze a load, you would do:
> 
> LoadInst *Load = ...
> Value *Pointer = Load->getPointer();
> const SCEV *PointerSCEV = SE->getSCEV(Pointer);
> const SCEVUnknown *PointerBase > 
dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));
> 
> if (!PointerBase)
>    return 'Analysis failed'
> 
> const Value *BaseValue = PointerBase->getValue();
> 
> You get the offset between two load addresses with SE->getMinusSCEV(). 
> The size of an element is SE->getSizeOfExpr().
> 
The AliasAnalysis class has a set of interfaces that can be used to
preserve the analysis even when some things are changed. Does
ScalarEvolution have a similar capability?

Thanks again,
Hal
> Hope that helps
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Tobias Grosser

2011-Nov-11 22:55 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On 11/11/2011 11:36 PM, Hal Finkel wrote:> On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:
>> On 11/08/2011 11:29 PM, Hal Finkel wrote:
>> Talking about this I looked again into ScalarEvolution.
>>
>> To analyze a load, you would do:
>>
>> LoadInst *Load = ...
>> Value *Pointer = Load->getPointer();
>> const SCEV *PointerSCEV = SE->getSCEV(Pointer);
>> const SCEVUnknown *PointerBase >> 
dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));
>>
>> if (!PointerBase)
>>     return 'Analysis failed'
>>
>> const Value *BaseValue = PointerBase->getValue();
>>
>> You get the offset between two load addresses with
SE->getMinusSCEV().
>> The size of an element is SE->getSizeOfExpr().
>>
>
> The AliasAnalysis class has a set of interfaces that can be used to
> preserve the analysis even when some things are changed. Does
> ScalarEvolution have a similar capability?
You can state that your pass preserves ScalarEvolution. In this case all 
analysis results are by default preserved and it is your job to 
invalidate the scalar evolution for the loops/values where it needs to
be recalculated.

The relevant functions are

ScalarEvolution::forgetValue(Value *)
ScalarEvolution::forgetLoop(Loop *)

Cheers
Tobi

Hal Finkel

2011-Nov-11 23:11 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Fri, 2011-11-11 at 23:55 +0100, Tobias Grosser wrote:> On 11/11/2011 11:36 PM, Hal Finkel wrote:
> > On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:
> >> On 11/08/2011 11:29 PM, Hal Finkel wrote:
> >> Talking about this I looked again into ScalarEvolution.
> >>
> >> To analyze a load, you would do:
> >>
> >> LoadInst *Load = ...
> >> Value *Pointer = Load->getPointer();
> >> const SCEV *PointerSCEV = SE->getSCEV(Pointer);
> >> const SCEVUnknown *PointerBase > >> 
dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));
> >>
> >> if (!PointerBase)
> >>     return 'Analysis failed'
> >>
> >> const Value *BaseValue = PointerBase->getValue();
> >>
> >> You get the offset between two load addresses with
SE->getMinusSCEV().
> >> The size of an element is SE->getSizeOfExpr().
> >>
> >
> > The AliasAnalysis class has a set of interfaces that can be used to
> > preserve the analysis even when some things are changed. Does
> > ScalarEvolution have a similar capability?
> 
> You can state that your pass preserves ScalarEvolution. In this case all 
> analysis results are by default preserved and it is your job to 
> invalidate the scalar evolution for the loops/values where it needs to
> be recalculated.
> 
> The relevant functions are
> 
> ScalarEvolution::forgetValue(Value *)
Since the vectorization pass is currently just a basic-block pass, I
think that I should need only forgetValue, right? I suppose that I would
call that on all of the values that are fused.

Also, using getPointerBase to get the base pointer seems simple enough,
but how should I use getMinusSCEV to get the offset. Should I call it on
each load pointer and its base pointer, or between the two load pointers
once I know that they share the same base. And once I do that, how do I
get the offset (if known). I see the get[Uns|S]ignedRange functions, but
if there is a way to directly get a constant value, then that would be
more straightforward.

Thanks again,
Hal
> ScalarEvolution::forgetLoop(Loop *)
> 
> Cheers
> Tobi
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Reasonably Related Threads

Search for more possibly parallel threads

llvm dev - Nov 2011 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Reasonably Related Threads