Hal Finkel
2011-Nov-08 22:29 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote:> On 11/08/2011 03:36 PM, Hal Finkel wrote: > > On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote: > >> On 11/08/2011 11:45 AM, Hal Finkel wrote: > >>> I've attached the latest version of my autovectorization patch. > >>> > >>> Working through the test suite has proved to be a productive > >>> experience ;) -- And almost all of the bugs that it revealed have now > >>> been fixed. There are still two programs that don't compile with > >>> vectorization turned on, and I'm working on those now, but in case > >>> anyone feels like playing with vectorization, this patch will probably > >>> work for you. > >> > >> Hey Hal, > >> > >> those are great news. Especially as the numbers seem to show that > >> vectorization has a significant performance impact. What did you compare > >> exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'? > > > > Yes. [I've tested the current patch directly using opt -vectorize > > -unroll-allow-partial; for running the test suite I recompiled > > llvm/clang to hardcode the options as I wanted them]. > > You should not need to hack clang. As shown above, you should be able to > pass '-vectorize' to the optimizer by using '-mllvm -vectorize' in your > CFLAGS. Did you run the test suite at -O3?That's good to know. Yes, it was run at -O3.> > Also I think it would be interesting to compare for the test-suite the > performance of > 'clang -O3 -mllvm -unroll-allow-partial' with > 'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will show > how much of the runtime overhead is due to the unrolling (produces more > code that needs to be optimized) and which part is due to vectorization. > The same counts for the speedup. How much is caused by > unrolling and how much is actually caused by your pass.Just turning on partial unrolling yields, on average, a 4.7% slowdown (0% median), and a 28.6% increase in compile time (21.8% median).> > >>> The largest three performance speedups are: > >>> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedupWith just partial unrolling: 64.8% speedup> >>> SingleSource/UnitTests/Vector/multiplies - 57.7% speedupWith just partial unrolling: 58.2% speedup> >>> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedupWith just partial unrolling: 11.8% speedup> >>> > >>> The largest three performance slowdowns are: > >>> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael - > >>> 114% slowdownWith just partial unrolling: 0% slowdown (47.8% slowdown in compile time)> >>> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6% > >>> slowdownWith just partial unrolling: 8.3% speedup> >>> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdownWith just partial unrolling: 8.3% speedup> >>>Here are the top-3 lists comparing vectorization with partial unrolling to just with partial unrolling: Top speedups: SingleSource/Benchmarks/Misc/flops-7 - 44.1% speedup SingleSource/Benchmarks/Misc/flops-1 - 42.3% speedup MultiSource/Applications/sqlite3/sqlite3 - 42.3% speedup Top slowdowns: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael - 114% slowdown MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 81.8% slowdown SingleSource/Benchmarks/Misc/flops-8 - 79% slowdown Top compile-time slowdowns: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael - 832% slowdown SingleSource/Benchmarks/Misc/salsa20 - 600% slowdown MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 263% slowdown For this comparison, however (unlike comparing to plain -O3), there are a significant number of compile-time speedups (I guess that this is because vectorization can reduce the number of instructions processed by later passes). Top compile-time speedups: SingleSource/Benchmarks/Stanford/Oscar - 46% speedup SingleSource/Regression/C/bigstack - 45% speedup SingleSource/Benchmarks/BenchmarkGame/fannkuch - 45% speedup> >> Interesting. Do you understand what causes these slowdowns? Can your > >> heuristic be improved? > > > > I've not specifically looked at these cases. > > > > Generally, I've observed slowdowns from the introduction of too many > > permutations per chain (the chain selection procedure can be changed to > > help this, and I'll work on that). Also, sometimes legalization of > > non-native vector operations creates inefficient code, and I'll also > > look at these cases in more detail. > Sure. That sounds reasonable. I am confident you can improve this gradually. > > >> If I understood correctly it seems your vectorizer has quadratic > >> complexity which may cause large slowdowns. Do you think it may be > >> useful/possible to make it linear by introducing a constant upper bound > >> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and > >> most of the vectorization opportunities are close by (in some sense), > >> such that we get most of the speedup by locking at a subset of the problem. > > > > Yes, I agree. That makes a lot of sense. > > > > What would be even better is if the loop unroller would intermix > > statements from the loops where possible instead of leaving it to the > > vectorizer to do all of the grouping after the fact. That, I fear, is a > > whole other project. > > First, I do not understand the 'even better' here. To me it would be > great if on one hand the vectorizer could be constrained, such that the > compile time is predictable. And on the other hand, we could make the > loop unroller create code in a way such that the constrained vectorizer > still performs the relevant transformations.I was not clear; I meant that imposing a cut off is likely to work, but it would work even better if the loop unroller produced code more-conducive to vectorization.> > Also, what do you mean with 'if the loop unroller would intermix > statements from the loops where possible'. Are you referring to the > grouped unrolling as shown in my the last mail?Yes. Also, the code necessary to take advantage of grouped unrolling already exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep causes the vectorizer to stop searching for instruction pairings after the first use of an instruction.> > > >>> Compared to previous patches, which had a minimum required chain length > >>> of 3 or 4, I've now made the default 6. While using a chain length of 4 > >>> worked well for targeted benchmarks, it caused an overall slowdown on > >>> almost all test-suite programs. Using a minimum length of 6 causes, on > >>> average, a speedup; so I think that is a better default choice. > >> > >> I also try to understand if it is possible to use your vectorizer for > >> Polly. My idea is to do some clever loop unrolling. > >> > >> Starting from this loop. > >> > >> for (int i = 0; i< 4; i++) > >> A[i] += 1; > >> A[i] = B[i] + 3; > >> C[i] = A[i]; > >> > >> The classical unroller would create this code: > >> > >> A[0] += 1; > >> A[0] = B[i] + 3; > >> C[0] = A[i]; > >> > >> A[1] += 1; > >> A[1] = B[i] + 3; > >> C[1] = A[i]; > >> > >> A[2] += 1; > >> A[2] = B[i] + 3; > >> C[2] = A[i]; > >> > >> A[3] += 1; > >> A[3] = B[i] + 3; > >> C[3] = A[i]; > >> > >> However, in case I can prove this loop is parallel, I want to create > >> this code: > >> > >> A[0] += 1; > >> A[1] += 1; > >> A[2] += 1; > >> A[3] += 1; > >> > >> A[0] = B[i] + 3; > >> A[1] = B[i] + 3; > >> A[2] = B[i] + 3; > >> A[3] = B[i] + 3; > >> > >> C[0] = A[i]; > >> C[1] = A[i]; > >> C[2] = A[i]; > >> C[3] = A[i]; > >> > >> I assume this will allow the vectorization of test cases, that failed > >> because of possible aliasing. However, I am more interested, if the > >> execution order change could also improve the vectorization outcome or > >> reduce compile time overhead of your vectorizer. > > > > Yes, this would certainly help. > > In which way? How much? Would it be possible to restrict the vectorizer > such that it's complexity is linear, while at the same time we can > ensure it will still vectorize code that is 'group unrolled' (which > means unrolled in the way shown above?It would help because we could stop the pairing search after an instructions first use; and that is significantly faster and generates many fewer potential pairings that need to be considered by the later stages. Exactly "how much" it would help I can't yet answer; I'd need to construct actual benchmarks for that. If you have some group-unrolled code, then it will be simple to test: just pass -bb-vectorize-fast-dep.> > > By the way, the current implementation, by default, it will create > > unaligned vector loads and stores, but these are generally broken up by > > the legalizer. This behavior can be suppressed using the > > -bb-vectorize-aligned-only flag. It would be nice if the loop unroller > > chose the unrolling factor to preserve the maximum available alignment, > > but I don't think that it currently does so. > I don't know, but it sounds like a good thing to do. > > > One problem with the current implementation is that it relies on > > GetPointerBaseWithConstantOffset to determine if two loads or stores > > share the same base address. This fails with partially-unrolled loops > > because it cannot "undo" all of the additions to the offset induction > > variable in order to understand that some of the loads and stores are > > really adjacent in memory. This is something that I think can be > > improved within the vectorizer itself, and I'm planning on doing some > > work on this in the future. > Here you may also want to look into ScalarEvolution. Basically two loads > access adjacent memory if the difference of the scalar evolution of the > two load addresses is equal to sizeof(element_type). ScalarEvolution > should be a lot more general than GetPointerBaseWithConstantOffset().Thanks! That sounds great; I'll have to look at that. -Hal> > Cheers > Tobi-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Tobias Grosser
2011-Nov-10 22:07 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On 11/08/2011 11:29 PM, Hal Finkel wrote:> On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote: >> On 11/08/2011 03:36 PM, Hal Finkel wrote: >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote: >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote:[A lot of performance results skipped] OK. As expected part of the speedup is because of unrolling, however it also shows that vectorization itself improves performance. There is still a lot of room for improvement, but it seems to be a good start.>>>> If I understood correctly it seems your vectorizer has quadratic >>>> complexity which may cause large slowdowns. Do you think it may be >>>> useful/possible to make it linear by introducing a constant upper bound >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and >>>> most of the vectorization opportunities are close by (in some sense), >>>> such that we get most of the speedup by locking at a subset of the problem. >>> >>> Yes, I agree. That makes a lot of sense. >>> >>> What would be even better is if the loop unroller would intermix >>> statements from the loops where possible instead of leaving it to the >>> vectorizer to do all of the grouping after the fact. That, I fear, is a >>> whole other project. >> >> First, I do not understand the 'even better' here. To me it would be >> great if on one hand the vectorizer could be constrained, such that the >> compile time is predictable. And on the other hand, we could make the >> loop unroller create code in a way such that the constrained vectorizer >> still performs the relevant transformations. > > I was not clear; I meant that imposing a cut off is likely to work, but > it would work even better if the loop unroller produced code > more-conducive to vectorization.Is such a cutoff already integrated and can be enabled by default. I believe it would be great if we could show that the compile time increase is limited by a low constant (maybe 100%), while still showing an overall improvement in run time.>> Also, what do you mean with 'if the loop unroller would intermix >> statements from the loops where possible'. Are you referring to the >> grouped unrolling as shown in my the last mail? > > Yes. > > Also, the code necessary to take advantage of grouped unrolling already > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep > causes the vectorizer to stop searching for instruction pairings after > the first use of an instruction.Nice. I just tried one of the very basic examples from the Polly test suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create any vector instructions. I also played a little with the vectorizer options, but was not able to get this example vectorized. Is this because the chain is too short?>>> One problem with the current implementation is that it relies on >>> GetPointerBaseWithConstantOffset to determine if two loads or stores >>> share the same base address. This fails with partially-unrolled loops >>> because it cannot "undo" all of the additions to the offset induction >>> variable in order to understand that some of the loads and stores are >>> really adjacent in memory. This is something that I think can be >>> improved within the vectorizer itself, and I'm planning on doing some >>> work on this in the future. >> Here you may also want to look into ScalarEvolution. Basically two loads >> access adjacent memory if the difference of the scalar evolution of the >> two load addresses is equal to sizeof(element_type). ScalarEvolution >> should be a lot more general than GetPointerBaseWithConstantOffset(). > > Thanks! That sounds great; I'll have to look at that.Talking about this I looked again into ScalarEvolution. To analyze a load, you would do: LoadInst *Load = ... Value *Pointer = Load->getPointer(); const SCEV *PointerSCEV = SE->getSCEV(Pointer); const SCEVUnknown *PointerBase dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV)); if (!PointerBase) return 'Analysis failed' const Value *BaseValue = PointerBase->getValue(); You get the offset between two load addresses with SE->getMinusSCEV(). The size of an element is SE->getSizeOfExpr(). Hope that helps Tobi -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: simple_vec.ll URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111110/15790d03/attachment.ksh>
Hal Finkel
2011-Nov-10 22:24 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:> On 11/08/2011 11:29 PM, Hal Finkel wrote: > > On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote: > >> On 11/08/2011 03:36 PM, Hal Finkel wrote: > >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote: > >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote: > > [A lot of performance results skipped] > > OK. As expected part of the speedup is because of unrolling, however it > also shows that vectorization itself improves performance. There is > still a lot of room for improvement, but it seems to be a good start. > > >>>> If I understood correctly it seems your vectorizer has quadratic > >>>> complexity which may cause large slowdowns. Do you think it may be > >>>> useful/possible to make it linear by introducing a constant upper bound > >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and > >>>> most of the vectorization opportunities are close by (in some sense), > >>>> such that we get most of the speedup by locking at a subset of the problem. > >>> > >>> Yes, I agree. That makes a lot of sense. > >>> > >>> What would be even better is if the loop unroller would intermix > >>> statements from the loops where possible instead of leaving it to the > >>> vectorizer to do all of the grouping after the fact. That, I fear, is a > >>> whole other project. > >> > >> First, I do not understand the 'even better' here. To me it would be > >> great if on one hand the vectorizer could be constrained, such that the > >> compile time is predictable. And on the other hand, we could make the > >> loop unroller create code in a way such that the constrained vectorizer > >> still performs the relevant transformations. > > > > I was not clear; I meant that imposing a cut off is likely to work, but > > it would work even better if the loop unroller produced code > > more-conducive to vectorization. > > Is such a cutoff already integrated and can be enabled by default.No, but I'll add one.> I > believe it would be great if we could show that the compile time > increase is limited by a low constant (maybe 100%), while still showing > an overall improvement in run time. > > >> Also, what do you mean with 'if the loop unroller would intermix > >> statements from the loops where possible'. Are you referring to the > >> grouped unrolling as shown in my the last mail? > > > > Yes. > > > > Also, the code necessary to take advantage of grouped unrolling already > > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep > > causes the vectorizer to stop searching for instruction pairings after > > the first use of an instruction. > > Nice. I just tried one of the very basic examples from the Polly test > suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create > any vector instructions. I also played a little with the vectorizer > options, but was not able to get this example vectorized. Is this > because the chain is too short?You are correct. It will vectorize with: opt -O3 -vectorize -bb-vectorize-req-chain-depth=2> > >>> One problem with the current implementation is that it relies on > >>> GetPointerBaseWithConstantOffset to determine if two loads or stores > >>> share the same base address. This fails with partially-unrolled loops > >>> because it cannot "undo" all of the additions to the offset induction > >>> variable in order to understand that some of the loads and stores are > >>> really adjacent in memory. This is something that I think can be > >>> improved within the vectorizer itself, and I'm planning on doing some > >>> work on this in the future. > >> Here you may also want to look into ScalarEvolution. Basically two loads > >> access adjacent memory if the difference of the scalar evolution of the > >> two load addresses is equal to sizeof(element_type). ScalarEvolution > >> should be a lot more general than GetPointerBaseWithConstantOffset(). > > > > Thanks! That sounds great; I'll have to look at that. > > Talking about this I looked again into ScalarEvolution. > > To analyze a load, you would do: > > LoadInst *Load = ... > Value *Pointer = Load->getPointer(); > const SCEV *PointerSCEV = SE->getSCEV(Pointer); > const SCEVUnknown *PointerBase > dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV)); > > if (!PointerBase) > return 'Analysis failed' > > const Value *BaseValue = PointerBase->getValue(); > > You get the offset between two load addresses with SE->getMinusSCEV(). > The size of an element is SE->getSizeOfExpr().This is great! Thanks! -Hal> > Hope that helps > Tobi-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Hal Finkel
2011-Nov-11 22:36 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Thu, 2011-11-10 at 23:07 +0100, Tobias Grosser wrote:> On 11/08/2011 11:29 PM, Hal Finkel wrote: > > On Tue, 2011-11-08 at 20:24 +0100, Tobias Grosser wrote: > >> On 11/08/2011 03:36 PM, Hal Finkel wrote: > >>> On Tue, 2011-11-08 at 12:12 +0100, Tobias Grosser wrote: > >>>> On 11/08/2011 11:45 AM, Hal Finkel wrote: > > [A lot of performance results skipped] > > OK. As expected part of the speedup is because of unrolling, however it > also shows that vectorization itself improves performance. There is > still a lot of room for improvement, but it seems to be a good start. > > >>>> If I understood correctly it seems your vectorizer has quadratic > >>>> complexity which may cause large slowdowns. Do you think it may be > >>>> useful/possible to make it linear by introducing a constant upper bound > >>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and > >>>> most of the vectorization opportunities are close by (in some sense), > >>>> such that we get most of the speedup by locking at a subset of the problem. > >>> > >>> Yes, I agree. That makes a lot of sense. > >>> > >>> What would be even better is if the loop unroller would intermix > >>> statements from the loops where possible instead of leaving it to the > >>> vectorizer to do all of the grouping after the fact. That, I fear, is a > >>> whole other project. > >> > >> First, I do not understand the 'even better' here. To me it would be > >> great if on one hand the vectorizer could be constrained, such that the > >> compile time is predictable. And on the other hand, we could make the > >> loop unroller create code in a way such that the constrained vectorizer > >> still performs the relevant transformations. > > > > I was not clear; I meant that imposing a cut off is likely to work, but > > it would work even better if the loop unroller produced code > > more-conducive to vectorization. > > Is such a cutoff already integrated and can be enabled by default. I > believe it would be great if we could show that the compile time > increase is limited by a low constant (maybe 100%), while still showing > an overall improvement in run time. > > >> Also, what do you mean with 'if the loop unroller would intermix > >> statements from the loops where possible'. Are you referring to the > >> grouped unrolling as shown in my the last mail? > > > > Yes. > > > > Also, the code necessary to take advantage of grouped unrolling already > > exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep > > causes the vectorizer to stop searching for instruction pairings after > > the first use of an instruction. > > Nice. I just tried one of the very basic examples from the Polly test > suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create > any vector instructions. I also played a little with the vectorizer > options, but was not able to get this example vectorized. Is this > because the chain is too short? > > >>> One problem with the current implementation is that it relies on > >>> GetPointerBaseWithConstantOffset to determine if two loads or stores > >>> share the same base address. This fails with partially-unrolled loops > >>> because it cannot "undo" all of the additions to the offset induction > >>> variable in order to understand that some of the loads and stores are > >>> really adjacent in memory. This is something that I think can be > >>> improved within the vectorizer itself, and I'm planning on doing some > >>> work on this in the future. > >> Here you may also want to look into ScalarEvolution. Basically two loads > >> access adjacent memory if the difference of the scalar evolution of the > >> two load addresses is equal to sizeof(element_type). ScalarEvolution > >> should be a lot more general than GetPointerBaseWithConstantOffset(). > > > > Thanks! That sounds great; I'll have to look at that. > > Talking about this I looked again into ScalarEvolution. > > To analyze a load, you would do: > > LoadInst *Load = ... > Value *Pointer = Load->getPointer(); > const SCEV *PointerSCEV = SE->getSCEV(Pointer); > const SCEVUnknown *PointerBase > dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV)); > > if (!PointerBase) > return 'Analysis failed' > > const Value *BaseValue = PointerBase->getValue(); > > You get the offset between two load addresses with SE->getMinusSCEV(). > The size of an element is SE->getSizeOfExpr(). >The AliasAnalysis class has a set of interfaces that can be used to preserve the analysis even when some things are changed. Does ScalarEvolution have a similar capability? Thanks again, Hal> Hope that helps > Tobi-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Possibly Parallel Threads
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass