Tobias Grosser
2011-Dec-30 09:09 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On 12/29/2011 06:32 PM, Hal Finkel wrote:> On Thu, 2011-12-29 at 15:00 +0100, Tobias Grosser wrote: >> On 12/14/2011 01:25 AM, Hal Finkel wrote: >> One thing that I would still like to have is a test case where >> bb-vectorize-search-limit is needed to avoid exponential compile time >> growth and another test case that is not optimized, if >> bb-vectorize-search-limit is set to a value less than 4000. I think >> those cases are very valuable to understand the performance behavior of >> this code. > > Good idea, I'll add these test cases. > >> Especially, as I am not yet sure why we need a value as high >> as 4000. > > I am not exactly sure why that turned out to be the best number, but > I'll try this again in combination with my load/store reordering patch > and see if such a large value still seems best.They reason why I am surprised about this value, is that I believe partial loop unrolling would not yield bbs of this size. Code size limits should prevent size. However, loop unrolling seems to be the major reason why two accesses to adjacent memory may be placed far away. Without loop unrolling, at a distance of 4000 the fact that two instructions access adjacent memory locations seems to be completely random and the probability that the following instructions perform the same calculations seems low. Also, I believe at 4000 the compile time should already be significant higher. As it seems my intuition is wrong, I am very eager to see and understand an example where a search limit of 4000 is really needed. Cheers Tobi
Sebastian Pop
2012-Jan-17 19:25 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
Hi, On Fri, Dec 30, 2011 at 3:09 AM, Tobias Grosser <tobias at grosser.es> wrote:> As it seems my intuition is wrong, I am very eager to see and understand > an example where a search limit of 4000 is really needed. >To make the ball roll again, I attached a testcase that can be tuned to understand the impact on compile time for different sizes of a basic block. One can also set the number of iterations in the loop to 1 to test the vectorizer with no loops around. Hal, could you please report the compile times with/without the vectorizer for different basic block sizes? Once this parameter is tuned, could we get this code committed to llvm? Thanks, Sebastian PS: this testcase is also a compile time hog for GCC at -O3 when the loop vectorizer is running. -- Qualcomm Innovation Center, Inc is a member of Code Aurora Forum -------------- next part -------------- A non-text attachment was scrubbed... Name: bbvect.c Type: text/x-csrc Size: 582 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120117/884e9227/attachment.c>
Hal Finkel
2012-Jan-17 20:10 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Tue, 2012-01-17 at 13:25 -0600, Sebastian Pop wrote:> Hi, > > On Fri, Dec 30, 2011 at 3:09 AM, Tobias Grosser <tobias at grosser.es> wrote: > > As it seems my intuition is wrong, I am very eager to see and understand > > an example where a search limit of 4000 is really needed. > > > > To make the ball roll again,I will post an updated patch shortly. I have been "stress-testing" the patch to ensure correctness, and have corrected a few bugs. The most non-trivial issue that I've discovered was the possibility of generating non-trivial (meaning > 2 in length) pairing-induced dependency cycles. To prevent this from happening I've implemented a cycle check, but this increases the algorithmic complexity of the pair-selection process, and makes the vectorizer quite slow on some blocks. I can see two ways to proceed here: 1. Improve the cycle detection algorithm used (for example, I can use the algorithm currently used for ScheduleDAGTopologicalSort::WillCreateCycle, or something similar). 2. Late abort on non-trivial cycles. This will make the fusing process more expensive, but will not increase the algorithmic complexity. It would, however, degrade the quality of the vectorization, because it would mean that pairs otherwise selected for vectorization might, at the very end, not end up fused into vector instructions [this seems relatively rare, so it might not be a big deal]. The best thing may be to implement both and give the user an option of a fast way and a slower way (maybe this can be done post-commit).> I attached a testcase that can be tuned > to understand the impact on compile time for different sizes of a > basic block. One can also set the number of iterations in the loop to > 1 to test the vectorizer with no loops around. >Thanks!> Hal, could you please report the compile times with/without the > vectorizer for different basic block sizes?Absolutely!> > Once this parameter is tuned, could we get this code committed to llvm?Also, I've re-run the test suite, and with my load/store reordering patch also applied, a much smaller look-ahead value is optimal. In the name of solving one issue at a time, I would probably want to commit with the smaller default. Thanks again, Hal> > Thanks, > Sebastian > > PS: this testcase is also a compile time hog for GCC at -O3 when the > loop vectorizer is running. > > -- > Qualcomm Innovation Center, Inc is a member of Code Aurora Forum-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Hal Finkel
2012-Jan-24 04:13 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Tue, 2012-01-17 at 13:25 -0600, Sebastian Pop wrote:> Hi, > > On Fri, Dec 30, 2011 at 3:09 AM, Tobias Grosser <tobias at grosser.es> wrote: > > As it seems my intuition is wrong, I am very eager to see and understand > > an example where a search limit of 4000 is really needed. > > > > To make the ball roll again, I attached a testcase that can be tuned > to understand the impact on compile time for different sizes of a > basic block. One can also set the number of iterations in the loop to > 1 to test the vectorizer with no loops around. > > Hal, could you please report the compile times with/without the > vectorizer for different basic block sizes?I've looked at your test case, and I am pleased to report a negligible compile-time increase! Also, there is no vectorization of the main loop :) Here's why: (as you know) the main part of the loop is essentially one long dependency chain, and so there is nothing to vectorize there. The only vectorization opportunities come from unrolling the loop. Using the default thresholds, the loop will not even partially unroll (because the body is too large). As a result, essentially nothing happens. I've prepared a reduced version of your test case (attached). Using -unroll-threshold=300 (along with -unroll-allow-partial), I can make the loop unroll partially (the reduced loop size is 110, so this allows unrolling 2 iterations). Once this is done, the vectorizer finds candidate pairs and vectorizes [as a practical manner, you need -basicaa too]. I think that even this is probably too big for a regression test. I don't think that the basic structure really adds anything over existing tests (although I need to make sure that alias-analysis use is otherwise covered), but I'll copy-and-paste a small portion into a regression test to cover the search limit logic (which is currently uncovered). We should probably discuss different situations that we'd like to see covered in the regression suite (perhaps post-commit). Thanks for working on this! I'll post an updated patch for review shortly. -Hal> > Once this parameter is tuned, could we get this code committed to llvm? > > Thanks, > Sebastian > > PS: this testcase is also a compile time hog for GCC at -O3 when the > loop vectorizer is running. > > -- > Qualcomm Innovation Center, Inc is a member of Code Aurora Forum-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- ; ModuleID = '/homes/hfinkel/tmp/bbvect.c' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu" ; This test case is based on code by Sebastian Pop. define i32 @main() nounwind uwtable readnone { entry: %A = alloca [100 x i32], align 16 %B = alloca [100 x i32], align 16 %C = alloca [100 x i32], align 16 br label %for.body for.body: ; preds = %for.body, %entry %indvars.iv21247 = phi i64 [ 0, %entry ], [ %indvars.iv.next21248, %for.body ] %arrayidx = getelementptr inbounds [100 x i32]* %A, i64 0, i64 %indvars.iv21247 %0 = trunc i64 %indvars.iv21247 to i32 store i32 %0, i32* %arrayidx, align 4, !tbaa !0 %1 = add nsw i64 %indvars.iv21247, 42 %arrayidx2 = getelementptr inbounds [100 x i32]* %B, i64 0, i64 %indvars.iv21247 %2 = trunc i64 %1 to i32 store i32 %2, i32* %arrayidx2, align 4, !tbaa !0 %3 = shl nsw i64 %indvars.iv21247, 1 %arrayidx4 = getelementptr inbounds [100 x i32]* %C, i64 0, i64 %indvars.iv21247 %4 = trunc i64 %3 to i32 store i32 %4, i32* %arrayidx4, align 4, !tbaa !0 %indvars.iv.next21248 = add i64 %indvars.iv21247, 1 %lftr.wideiv21251 = trunc i64 %indvars.iv.next21248 to i32 %exitcond21252 = icmp eq i32 %lftr.wideiv21251, 100 br i1 %exitcond21252, label %do.body9, label %for.body for.cond21230.preheader: ; preds = %do.body9 %arrayidx21236 = getelementptr inbounds [100 x i32]* %B, i64 0, i64 99 %arrayidx21234 = getelementptr inbounds [100 x i32]* %A, i64 0, i64 99 %arrayidx21239 = getelementptr inbounds [100 x i32]* %C, i64 0, i64 99 %5 = load i32* %arrayidx21236, align 4, !tbaa !0 %6 = load i32* %arrayidx21234, align 4, !tbaa !0 %add21237 = add i32 %5, %6 %7 = load i32* %arrayidx21239, align 4, !tbaa !0 %add21240 = add i32 %add21237, %7 ret i32 %add21240 do.body9: ; preds = %for.body, %do.body9 %indvars.iv = phi i64 [ %indvars.iv.next, %do.body9 ], [ 0, %for.body ] %arrayidx11 = getelementptr inbounds [100 x i32]* %B, i64 0, i64 %indvars.iv %8 = load i32* %arrayidx11, align 4, !tbaa !0 %arrayidx13 = getelementptr inbounds [100 x i32]* %C, i64 0, i64 %indvars.iv %9 = load i32* %arrayidx13, align 4, !tbaa !0 %add14 = add nsw i32 %9, %8 %arrayidx16 = getelementptr inbounds [100 x i32]* %A, i64 0, i64 %indvars.iv %mul21 = mul nsw i32 %add14, %9 %sub = sub nsw i32 %add14, %mul21 %mul41 = mul nsw i32 %add14, %sub %sub48 = sub nsw i32 %add14, %mul41 %mul62 = mul nsw i32 %add14, %sub48 %sub69 = sub nsw i32 %add14, %mul62 %mul83 = mul nsw i32 %add14, %sub69 %sub90 = sub nsw i32 %add14, %mul83 %mul104 = mul nsw i32 %add14, %sub90 %sub111 = sub nsw i32 %add14, %mul104 %mul125 = mul nsw i32 %add14, %sub111 %sub132 = sub nsw i32 %add14, %mul125 %mul146 = mul nsw i32 %add14, %sub132 %sub153 = sub nsw i32 %add14, %mul146 %mul167 = mul nsw i32 %add14, %sub153 %sub174 = sub nsw i32 %add14, %mul167 %mul188 = mul nsw i32 %add14, %sub174 %sub195 = sub nsw i32 %add14, %mul188 %mul209 = mul nsw i32 %add14, %sub195 %sub216 = sub nsw i32 %add14, %mul209 %mul231 = mul nsw i32 %add14, %sub216 %sub238 = sub nsw i32 %add14, %mul231 %mul252 = mul nsw i32 %add14, %sub238 %sub259 = sub nsw i32 %add14, %mul252 %mul273 = mul nsw i32 %add14, %sub259 %sub280 = sub nsw i32 %add14, %mul273 %mul294 = mul nsw i32 %add14, %sub280 %sub301 = sub nsw i32 %add14, %mul294 %mul315 = mul nsw i32 %add14, %sub301 %sub322 = sub nsw i32 %add14, %mul315 %mul336 = mul nsw i32 %add14, %sub322 %sub343 = sub nsw i32 %add14, %mul336 %mul357 = mul nsw i32 %add14, %sub343 %sub364 = sub nsw i32 %add14, %mul357 %mul378 = mul nsw i32 %add14, %sub364 %sub385 = sub nsw i32 %add14, %mul378 %mul399 = mul nsw i32 %add14, %sub385 %sub406 = sub nsw i32 %add14, %mul399 %mul420 = mul nsw i32 %add14, %sub406 %sub427 = sub nsw i32 %add14, %mul420 %mul443 = mul nsw i32 %add14, %sub427 %sub450 = sub nsw i32 %add14, %mul443 %mul464 = mul nsw i32 %add14, %sub450 %sub471 = sub nsw i32 %add14, %mul464 %mul485 = mul nsw i32 %add14, %sub471 %sub492 = sub nsw i32 %add14, %mul485 %mul506 = mul nsw i32 %add14, %sub492 %sub513 = sub nsw i32 %add14, %mul506 %mul527 = mul nsw i32 %add14, %sub513 %sub534 = sub nsw i32 %add14, %mul527 %mul548 = mul nsw i32 %add14, %sub534 %sub555 = sub nsw i32 %add14, %mul548 %mul569 = mul nsw i32 %add14, %sub555 %sub576 = sub nsw i32 %add14, %mul569 %mul590 = mul nsw i32 %add14, %sub576 %sub597 = sub nsw i32 %add14, %mul590 %mul611 = mul nsw i32 %add14, %sub597 %sub618 = sub nsw i32 %add14, %mul611 %mul632 = mul nsw i32 %add14, %sub618 %sub639 = sub nsw i32 %add14, %mul632 %mul655 = mul nsw i32 %add14, %sub639 %sub662 = sub nsw i32 %add14, %mul655 %mul676 = mul nsw i32 %add14, %sub662 %sub683 = sub nsw i32 %add14, %mul676 %mul697 = mul nsw i32 %add14, %sub683 %sub704 = sub nsw i32 %add14, %mul697 %mul718 = mul nsw i32 %add14, %sub704 %sub725 = sub nsw i32 %add14, %mul718 %mul739 = mul nsw i32 %add14, %sub725 %sub746 = sub nsw i32 %add14, %mul739 %mul760 = mul nsw i32 %add14, %sub746 %sub767 = sub nsw i32 %add14, %mul760 %mul781 = mul nsw i32 %add14, %sub767 %sub788 = sub nsw i32 %add14, %mul781 %mul802 = mul nsw i32 %add14, %sub788 %sub809 = sub nsw i32 %add14, %mul802 %mul823 = mul nsw i32 %add14, %sub809 %sub830 = sub nsw i32 %add14, %mul823 %mul844 = mul nsw i32 %add14, %sub830 %sub851 = sub nsw i32 %add14, %mul844 %mul867 = mul nsw i32 %add14, %sub851 %sub874 = sub nsw i32 %add14, %mul867 %mul888 = mul nsw i32 %add14, %sub874 %sub895 = sub nsw i32 %add14, %mul888 %mul909 = mul nsw i32 %add14, %sub895 %sub916 = sub nsw i32 %add14, %mul909 %mul930 = mul nsw i32 %add14, %sub916 %sub937 = sub nsw i32 %add14, %mul930 %mul951 = mul nsw i32 %add14, %sub937 %sub958 = sub nsw i32 %add14, %mul951 %mul972 = mul nsw i32 %add14, %sub958 %sub979 = sub nsw i32 %add14, %mul972 %mul993 = mul nsw i32 %add14, %sub979 %sub21179 = sub nsw i32 %add14, %mul993 %mul21193 = mul nsw i32 %add14, %sub21179 %sub21200 = sub nsw i32 %add14, %mul21193 store i32 %add14, i32* %arrayidx16, align 4, !tbaa !0 %mul21214 = mul nsw i32 %add14, %sub21200 store i32 %mul21214, i32* %arrayidx11, align 4, !tbaa !0 %sub21221 = sub nsw i32 %add14, %mul21214 store i32 %sub21221, i32* %arrayidx13, align 4, !tbaa !0 %indvars.iv.next = add i64 %indvars.iv, 1 %lftr.wideiv = trunc i64 %indvars.iv.next to i32 %exitcond = icmp eq i32 %lftr.wideiv, 100 br i1 %exitcond, label %for.cond21230.preheader, label %do.body9 } !0 = metadata !{metadata !"int", metadata !1} !1 = metadata !{metadata !"omnipotent char", metadata !2} !2 = metadata !{metadata !"Simple C/C++ TBAA", null}