Hal Finkel
2011-Nov-11 22:36 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:> On 11/08/2011 11:29 PM, Hal Finkel wrote: > > On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote: > >> On 11/08/2011 03:36 PM, Hal Finkel wrote: > >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote: > >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote: > > [A lot of performance results skipped] > > OK. As expected part of the speedup is because of unrolling, however it > also shows that vectorization itself improves performance. There is > still a lot of room for improvement, but it seems to be a good start. > > >>>> If I understood correctly it seems your vectorizer has quadratic > >>>> complexity which may cause large slowdowns. Do you think it may be > >>>> useful/possible to make it linear by introducing a constant upper bound > >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and > >>>> most of the vectorization opportunities are close by (in some sense), > >>>> such that we get most of the speedup by locking at a subset of the problem. > >>> > >>> Yes, I agree. That makes a lot of sense. > >>> > >>> What would be even better is if the loop unroller would intermix > >>> statements from the loops where possible instead of leaving it to the > >>> vectorizer to do all of the grouping after the fact. That, I fear, is a > >>> whole other project. > >> > >> First, I do not understand the 'even better' here. To me it would be > >> great if on one hand the vectorizer could be constrained, such that the > >> compile time is predictable. And on the other hand, we could make the > >> loop unroller create code in a way such that the constrained vectorizer > >> still performs the relevant transformations. > > > > I was not clear; I meant that imposing a cut off is likely to work, but > > it would work even better if the loop unroller produced code > > more-conducive to vectorization. > > Is such a cutoff already integrated and can be enabled by default. I > believe it would be great if we could show that the compile time > increase is limited by a low constant (maybe 100%), while still showing > an overall improvement in run time. > > >> Also, what do you mean with 'if the loop unroller would intermix > >> statements from the loops where possible'. Are you referring to the > >> grouped unrolling as shown in my the last mail? > > > > Yes. > > > > Also, the code necessary to take advantage of grouped unrolling already > > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep > > causes the vectorizer to stop searching for instruction pairings after > > the first use of an instruction. > > Nice. I just tried one of the very basic examples from the Polly test > suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create > any vector instructions. I also played a little with the vectorizer > options, but was not able to get this example vectorized. Is this > because the chain is too short? > > >>> One problem with the current implementation is that it relies on > >>> GetPointerBaseWithConstantOffset to determine if two loads or stores > >>> share the same base address. This fails with partially-unrolled loops > >>> because it cannot "undo" all of the additions to the offset induction > >>> variable in order to understand that some of the loads and stores are > >>> really adjacent in memory. This is something that I think can be > >>> improved within the vectorizer itself, and I'm planning on doing some > >>> work on this in the future. > >> Here you may also want to look into ScalarEvolution. Basically two loads > >> access adjacent memory if the difference of the scalar evolution of the > >> two load addresses is equal to sizeof(element_type). ScalarEvolution > >> should be a lot more general than GetPointerBaseWithConstantOffset(). > > > > Thanks! That sounds great; I'll have to look at that. > > Talking about this I looked again into ScalarEvolution. > > To analyze a load, you would do: > > LoadInst *Load = ... > Value *Pointer = Load->getPointer(); > const SCEV *PointerSCEV = SE->getSCEV(Pointer); > const SCEVUnknown *PointerBase > dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV)); > > if (!PointerBase) > return 'Analysis failed' > > const Value *BaseValue = PointerBase->getValue(); > > You get the offset between two load addresses with SE->getMinusSCEV(). > The size of an element is SE->getSizeOfExpr(). >The AliasAnalysis class has a set of interfaces that can be used to preserve the analysis even when some things are changed. Does ScalarEvolution have a similar capability? Thanks again, Hal> Hope that helps > Tobi-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Tobias Grosser
2011-Nov-11 22:55 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On 11/11/2011 11:36 PM, Hal Finkel wrote:> On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote: >> On 11/08/2011 11:29 PM, Hal Finkel wrote: >> Talking about this I looked again into ScalarEvolution. >> >> To analyze a load, you would do: >> >> LoadInst *Load = ... >> Value *Pointer = Load->getPointer(); >> const SCEV *PointerSCEV = SE->getSCEV(Pointer); >> const SCEVUnknown *PointerBase >> dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV)); >> >> if (!PointerBase) >> return 'Analysis failed' >> >> const Value *BaseValue = PointerBase->getValue(); >> >> You get the offset between two load addresses with SE->getMinusSCEV(). >> The size of an element is SE->getSizeOfExpr(). >> > > The AliasAnalysis class has a set of interfaces that can be used to > preserve the analysis even when some things are changed. Does > ScalarEvolution have a similar capability?You can state that your pass preserves ScalarEvolution. In this case all analysis results are by default preserved and it is your job to invalidate the scalar evolution for the loops/values where it needs to be recalculated. The relevant functions are ScalarEvolution::forgetValue(Value *) ScalarEvolution::forgetLoop(Loop *) Cheers Tobi
Hal Finkel
2011-Nov-11 23:11 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Fri, 2011-11-11 at 23:55 +0100, Tobias Grosser wrote:> On 11/11/2011 11:36 PM, Hal Finkel wrote: > > On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote: > >> On 11/08/2011 11:29 PM, Hal Finkel wrote: > >> Talking about this I looked again into ScalarEvolution. > >> > >> To analyze a load, you would do: > >> > >> LoadInst *Load = ... > >> Value *Pointer = Load->getPointer(); > >> const SCEV *PointerSCEV = SE->getSCEV(Pointer); > >> const SCEVUnknown *PointerBase > >> dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV)); > >> > >> if (!PointerBase) > >> return 'Analysis failed' > >> > >> const Value *BaseValue = PointerBase->getValue(); > >> > >> You get the offset between two load addresses with SE->getMinusSCEV(). > >> The size of an element is SE->getSizeOfExpr(). > >> > > > > The AliasAnalysis class has a set of interfaces that can be used to > > preserve the analysis even when some things are changed. Does > > ScalarEvolution have a similar capability? > > You can state that your pass preserves ScalarEvolution. In this case all > analysis results are by default preserved and it is your job to > invalidate the scalar evolution for the loops/values where it needs to > be recalculated. > > The relevant functions are > > ScalarEvolution::forgetValue(Value *)Since the vectorization pass is currently just a basic-block pass, I think that I should need only forgetValue, right? I suppose that I would call that on all of the values that are fused. Also, using getPointerBase to get the base pointer seems simple enough, but how should I use getMinusSCEV to get the offset. Should I call it on each load pointer and its base pointer, or between the two load pointers once I know that they share the same base. And once I do that, how do I get the offset (if known). I see the get[Uns|S]ignedRange functions, but if there is a way to directly get a constant value, then that would be more straightforward. Thanks again, Hal> ScalarEvolution::forgetLoop(Loop *) > > Cheers > Tobi-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Possibly Parallel Threads
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass