Jingu Kang via llvm-dev
2021-Apr-29 14:58 UTC
[llvm-dev] Enabling IRCE pass or Adding something similar in the pipeline of new pass manager
Hi All, I am trying to vectorize some loops. Let 's see a simple loop. while() { ... if () ... } As you can see, there is if statement inside loop. As you know, LoopVectorizer tries to predicate the block of if statement with its condition and it asks target machines the cost of the instructions. If there are load or store instructions in the block to be predicated and target machine does not support masked load and store, LoopVectorizer assigns big number to the load and store instructions and it means the loop is not vectorized with loop vectorizer. In this case, it is important to remove the if statement inside loop before LoopVectorizer. We need to consider two cases to remove the if statement inside loop as below. 1. if statement's condition with loop invariant variables 2. if statement's condition with induction variables For the first one, UnSwitch/SimpleUnSwitch pass handles the case. The passes hoist the loop invariant condition outside loop and unswitch loop. For the second one, there could be several transformations but I think the IRCE pass is general solution. The pass splits the iteration space following the condition with induction variable. At this moment, the only SimpleUnSwitch pass is in pipeline of new pass manager but IRCE is not. Let's see an IR function to see the impact of IRCE pass. target triple = "aarch64" define void @foo(i64 %a, i64* noalias %src, i64* noalias %dst, i64 %n) { entry: br label %bb bb: %if.cond = icmp slt i64 1, %n br i1 %if.cond, label %if.then, label %exit if.then: %if.cond.2 = icmp slt i64 10, %n br i1 %if.cond.2, label %loop.ph, label %if.then.else if.then.else: br label %loop.ph loop.ph: br label %loop loop: %iv = phi i64 [ %inc, %for.inc ], [ 1, %loop.ph ] %cmp = icmp slt i64 %iv, %a br i1 %cmp, label %if.then.2, label %for.inc if.then.2: %src.arrayidx = getelementptr inbounds i64, i64* %src, i64 %iv %val = load i64, i64* %src.arrayidx %dst.arrayidx = getelementptr inbounds i64, i64* %dst, i64 %iv store i64 %val, i64* %dst.arrayidx br label %for.inc for.inc: %inc = add nuw nsw i64 %iv, 1 %cond = icmp eq i64 %inc, %n br i1 %cond, label %exit, label %loop exit: ret void } If we try to vectorize above code, we can see below debug output. opt -loop-vectorize -S ./test.ll -debug-only=loop-vectorize LV: Scalar loop costs: 5. ... LV: Found an estimated cost of 3000000 for VF 2 For instruction: %val = load i64, i64* %src.arrayidx, align 4 LV: Vector loop of width 2 costs: 1500004. ... LV: Vectorization is possible but not beneficial If we run IRCE pass ahead of loop vectorizer with SCEV changes https://reviews.llvm.org/D101409 and https://reviews.llvm.org/D100566, we can see the loop is vectorized with below debug output. opt -irce -irce-skip-profitability-checks -simplifycfg -instcombine -loop-vectorize -S ./test.ll -debug-only=loop-vectorize LV: Scalar loop costs: 6. LV: Vector loop of width 2 costs: 2. In order to vectorize more loops which has conditional branch inside, we need to enable IRCE pass or add something similar ahead of loop vectorizer in the pipeline of new pass manager. I had a short discussion with Philip and Nikita on https://reviews.llvm.org/D101409 and it looks we need more effort for IRCE pass or something similar features. If you have experience or idea about this, please share it. Thanks JinGu Kang
Philip Reames via llvm-dev
2021-Apr-29 21:00 UTC
[llvm-dev] Enabling IRCE pass or Adding something similar in the pipeline of new pass manager
Let me start by copying my comment for the review. @mkazantsev <https://reviews.llvm.org/p/mkazantsev/> and I have spent a lot of time working on vectorization of similar loops. If you want to talk through this, I'd be happy to jump on a call. You may also find my talk at last years llvm conference on multiple exit loops useful background. A couple of general comments. I really don't think that extending IRCE is the right path forward here. IRCE has some serious design defects, and I'm honestly quite nervous about it's correctness. I think that iteration set splitting (the basic transform IRCE uses) is absolutely something we should implement for the main pipeline, but I'd approach it as building new infrastructure to replace IRCE, not as getting IRCE on by default. In particular, I suspect the value comes primarily from a cost model driven approach to splitting, not IRCE's unconditional one. Second, I advise being very cautious about going directly for the general case here. The general case for this is *really really hard*. If it wasn't, we'd already have robust solutions. If you can describe your motivating examples in a bit more depth (maybe offline), we can see if we can find a specific sub-case which is both tractable and profitable. Thank you for continuing the conversation on llvm-dev, a specific suggestion inline below. I also forgot to mention the ongoing (currently stalled) work on multiple exit vectorization. That might also be of interest. On 4/29/21 7:58 AM, Jingu Kang via llvm-dev wrote:> Hi All, > > I am trying to vectorize some loops. Let 's see a simple loop. > > while() { > ... > if () > ... > } > > As you can see, there is if statement inside loop. As you know, LoopVectorizer tries to predicate the block of if statement with its condition and it asks target machines the cost of the instructions. If there are load or store instructions in the block to be predicated and target machine does not support masked load and store, LoopVectorizer assigns big number to the load and store instructions and it means the loop is not vectorized with loop vectorizer. In this case, it is important to remove the if statement inside loop before LoopVectorizer. > > We need to consider two cases to remove the if statement inside loop as below. > > 1. if statement's condition with loop invariant variables > 2. if statement's condition with induction variables > > For the first one, UnSwitch/SimpleUnSwitch pass handles the case. The passes hoist the loop invariant condition outside loop and unswitch loop. > For the second one, there could be several transformations but I think the IRCE pass is general solution. The pass splits the iteration space following the condition with induction variable. > > At this moment, the only SimpleUnSwitch pass is in pipeline of new pass manager but IRCE is not. > > Let's see an IR function to see the impact of IRCE pass. > > target triple = "aarch64" > > define void @foo(i64 %a, i64* noalias %src, i64* noalias %dst, i64 %n) { > entry: > br label %bb > > bb: > %if.cond = icmp slt i64 1, %n > br i1 %if.cond, label %if.then, label %exit > > if.then: > %if.cond.2 = icmp slt i64 10, %n > br i1 %if.cond.2, label %loop.ph, label %if.then.else > > if.then.else: > br label %loop.ph > > loop.ph: > br label %loop > > loop: > %iv = phi i64 [ %inc, %for.inc ], [ 1, %loop.ph ] > %cmp = icmp slt i64 %iv, %a > br i1 %cmp, label %if.then.2, label %for.inc > > if.then.2: > %src.arrayidx = getelementptr inbounds i64, i64* %src, i64 %iv > %val = load i64, i64* %src.arrayidx > %dst.arrayidx = getelementptr inbounds i64, i64* %dst, i64 %iv > store i64 %val, i64* %dst.arrayidx > br label %for.inc > > for.inc: > %inc = add nuw nsw i64 %iv, 1 > %cond = icmp eq i64 %inc, %n > br i1 %cond, label %exit, label %loop > > exit: > ret void > }In this example, forming the full pre/main/post loop structure of IRCE is overkill. Instead, we could simply restrict the loop bounds in the following manner: loop.ph: ;; Warning: psuedo code, might have edge conditions wrong %c = icmp sgt %iv, %n %min = umax(%n, %a) br i1 %c, label %exit, label %loop.ph loop.ph.split: br label %loop loop: %iv = phi i64 [ %inc, %loop ], [ 1, %loop.ph ] %src.arrayidx = getelementptr inbounds i64, i64* %src, i64 %iv %val = load i64, i64* %src.arrayidx %dst.arrayidx = getelementptr inbounds i64, i64* %dst, i64 %iv store i64 %val, i64* %dst.arrayidx %inc = add nuw nsw i64 %iv, 1 %cond = icmp eq i64 %inc, %min br i1 %cond, label %exit, label %loop exit: ret void } I'm not quite sure what to call this transform, but it's not IRCE. If this example is actually general enough to cover your use cases, it's going to be a lot easier to judge profitability on than the general form of iteration set splitting. Another way to frame this special case might be to recognize the conditional block can be inverted into an early exit. (Reasoning: %iv is strictly increasing, condition is monotonic, path if not taken has no observable effect) Consider: loop.ph: br label %loop loop: %iv = phi i64 [ %inc, %for.inc ], [ 1, %loop.ph ] %cmp = icmp sge i64 %iv, %a br i1 %cmp, label %exit, label %for.inc for.inc: %src.arrayidx = getelementptr inbounds i64, i64* %src, i64 %iv %val = load i64, i64* %src.arrayidx %dst.arrayidx = getelementptr inbounds i64, i64* %dst, i64 %iv store i64 %val, i64* %dst.arrayidx %inc = add nuw nsw i64 %iv, 1 %cond = icmp eq i64 %inc, %n br i1 %cond, label %exit, label %loop exit: ret void } Once that's done, the multiple exit vectorization work should vectorize this loop. Thinking about it, I really like this variant.> > If we try to vectorize above code, we can see below debug output. > > opt -loop-vectorize -S ./test.ll -debug-only=loop-vectorize > > LV: Scalar loop costs: 5. > ... > LV: Found an estimated cost of 3000000 for VF 2 For instruction: %val = load i64, i64* %src.arrayidx, align 4 > LV: Vector loop of width 2 costs: 1500004. > ... > LV: Vectorization is possible but not beneficialThe costing here seems quite off. I have not looked at how the vectorize costs predicated loads on hardware without predication, but needing to scalarize a conditional VF-times and form a vector again does not have a cost of 3 million. This could definitely be improved.> > If we run IRCE pass ahead of loop vectorizer with SCEV changes https://reviews.llvm.org/D101409 and https://reviews.llvm.org/D100566, we can see the loop is vectorized with below debug output. > > opt -irce -irce-skip-profitability-checks -simplifycfg -instcombine -loop-vectorize -S ./test.ll -debug-only=loop-vectorize > > LV: Scalar loop costs: 6. > LV: Vector loop of width 2 costs: 2. > > In order to vectorize more loops which has conditional branch inside, we need to enable IRCE pass or add something similar ahead of loop vectorizer in the pipeline of new pass manager. > > I had a short discussion with Philip and Nikita on https://reviews.llvm.org/D101409 and it looks we need more effort for IRCE pass or something similar features. If you have experience or idea about this, please share it. > > Thanks > JinGu Kang > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210429/4dfa6975/attachment.html>