Chandler Carruth
2012-Apr-17 13:09 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
On Tue, Apr 17, 2012 at 1:36 PM, Rafael Espíndola < rafael.espindola at gmail.com> wrote:> > I am not sure how best to fix this. If possible, InstCombine's > canonicalization shouldn't hide arithmetic progressions behind bit masks. > At least, it seems these transformations should be disabled unless (X >> > C).hasOneUse(). They aren't exactly optimizations. > > > > This: > > > > %div = lshr i32 %a, 2 > > store i32 %div, i32* %p, align 4, !tbaa !0 > > %add = shl nuw nsw i32 %div, 1 > > > > is better than this: > > > > %div = lshr i32 %a, 2 > > store i32 %div, i32* %p, align 4, !tbaa !0 > > %0 = lshr i32 %a, 1 > > %add = and i32 %0, 2147483646 > > I think we could try your hasOneUse idea. If we are going to keep > %div, we may as well keep using it and save one instruction in the > canonical form.I really dislike hasOneUse-based "solutions" at the IR / InstCombine layer. They result in strange artifacts during optimization: places where adding similar code turns off optimizations because we fold the similar bits together and reuse parts of the computation. I would much rather see us devise a reasonable set of canonicalization rules at the IR level that don't block optimization. Tricks like saving instructions, using bit masks, etc., should all be reserved for the codegen layer / DAGCombine. There, hasOneUse makes perfect sense because you're trying to peephole through architectural gymnastics, not trying to get systematic scalar optimizations to fire. Let's see if we have any evidence that reserving this transform for the DAGCombiner hurts things, how it hurts things, and what we can do about it first? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/0b5e205d/attachment.html>
Rafael Espíndola
2012-Apr-17 13:35 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
> I really dislike hasOneUse-based "solutions" at the IR / InstCombine layer. > They result in strange artifacts during optimization: places where adding > similar code turns off optimizations because we fold the similar bits > together and reuse parts of the computation. > > I would much rather see us devise a reasonable set of canonicalization rules > at the IR level that don't block optimization. Tricks like saving > instructions, using bit masks, etc., should all be reserved for the codegen > layer / DAGCombine. There, hasOneUse makes perfect sense because you're > trying to peephole through architectural gymnastics, not trying to get > systematic scalar optimizations to fire. > > Let's see if we have any evidence that reserving this transform for the > DAGCombiner hurts things, how it hurts things, and what we can do about it > first?<day dreaming> In general I think I agree. Putting the subexpression in an inline function for example would cause the hasOneUse hack to fail as the multiple uses would only show up after inlining. To be general we would have to be able to combine the other way any transformation we refuse to do because it has more than one use. I also agree that the pipeline should first make the IL canonical, and then decide what is the best way to compute the canonical form. The one thing I find unfortunate is that in LLVM the second step is almost all in DAG and MI levels. There is no point were we have IL passes that known more about the target. </day dreaming> Looking a bit more at this particular case, I think the problem is not so much the representation we have chosen (with the masks) or that the DAG combiner is better at handling it. It is just that the DAG combiner is a second pass. Running the example in opt again produces define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp { entry: %div = lshr i32 %a, 2 store i32 %div, i32* %p, align 4 %0 = lshr i32 %a, 1 %add = and i32 %0, 2147483646 %arrayidx1 = getelementptr inbounds i32* %p, i64 1 store i32 %add, i32* %arrayidx1, align 4 ret i32 0 } Which I realize is orthogonal to using the mask being a good idea or not as it could be even simpler by some metric by reusing the %div, but does show that the IL optimizers are stopping while there is more they can do. OK. In the end I think I agree we should try it out and decide on the canonical representation first. Cheers, Rafael
Jakob Stoklund Olesen
2012-Apr-17 17:40 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
On Apr 17, 2012, at 6:09 AM, Chandler Carruth <chandlerc at google.com> wrote:> I really dislike hasOneUse-based "solutions" at the IR / InstCombine layer. They result in strange artifacts during optimization: places where adding similar code turns off optimizations because we fold the similar bits together and reuse parts of the computation.In my example, it won't even work. The second use of 'x+x' only emerges after GVN, after the harm has been done. However, I am more concerned about hiding things from scalar evolution than from InstCombine itself. Here is an example I was expecting to fail: $ cat small2.c struct desc { unsigned size : 2; unsigned skip : 6; unsigned value; }; void f(struct desc d, unsigned *p) { for (unsigned i = 0; i != 64; ++i) { *p = 0; p += d.skip; *p = 1; p += d.skip; } } It does the right thing, though: define void @f(i64 %d.coerce, i32* nocapture %p) nounwind uwtable ssp { entry: %0 = lshr i64 %d.coerce, 2 %bf.clear = and i64 %0, 63 %add.ptr.sum = shl nuw nsw i64 %bf.clear, 1 br label %for.body for.body: ; preds = %entry, %for.body %p.addr.06 = phi i32* [ %p, %entry ], [ %add.ptr3, %for.body ] %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ] store i32 0, i32* %p.addr.06, align 4, !tbaa !0 %add.ptr = getelementptr inbounds i32* %p.addr.06, i64 %bf.clear store i32 1, i32* %add.ptr, align 4, !tbaa !0 %add.ptr3 = getelementptr inbounds i32* %p.addr.06, i64 %add.ptr.sum %inc = add i32 %i.05, 1 %cmp = icmp eq i32 %inc, 64 br i1 %cmp, label %for.end, label %for.body for.end: ; preds = %for.body ret void } Notice how the two gep offsets %bf.clear and %add.ptr.sum have a simple arithmetic relationship that scalar evolution can figure out. With a very small change, that breaks: $ cat small3.c void f(unsigned long d, unsigned *p) { unsigned long skip = d/4; for (unsigned i = 0; i != 64; ++i) { *p = 0; p += skip; *p = 1; p += skip; } } define void @f(i64 %d, i32* nocapture %p) nounwind uwtable ssp { entry: %div = lshr i64 %d, 2 %0 = lshr i64 %d, 1 %add.ptr.sum = and i64 %0, 9223372036854775806 br label %for.body for.body: ; preds = %entry, %for.body %i.03 = phi i32 [ 0, %entry ], [ %inc, %for.body ] %p.addr.02 = phi i32* [ %p, %entry ], [ %add.ptr1, %for.body ] store i32 0, i32* %p.addr.02, align 4, !tbaa !0 %add.ptr = getelementptr inbounds i32* %p.addr.02, i64 %div store i32 1, i32* %add.ptr, align 4, !tbaa !0 %add.ptr1 = getelementptr inbounds i32* %p.addr.02, i64 %add.ptr.sum %inc = add i32 %i.03, 1 %cmp = icmp eq i32 %inc, 64 br i1 %cmp, label %for.end, label %for.body for.end: ; preds = %for.body ret void } Now it suddenly becomes very difficult for scalar evolution to figure out the relationship between %div and %add.ptr.sum. I think the canonical form should look more like the bit field example - don't hide the shl behind a bit mask. I'll try deferring the transforms that hide an shl instruction to DAGCombine. I'll run some benchmarks. /jakob -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/be34ad55/attachment.html>
Jakob Stoklund Olesen
2012-Apr-17 17:57 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
On Apr 17, 2012, at 6:35 AM, Rafael Espíndola <rafael.espindola at gmail.com> wrote:> I also agree that the pipeline should first make the IL canonical, and > then decide what is the best way to compute the canonical form. The > one thing I find unfortunate is that in LLVM the second step is almost > all in DAG and MI levels. There is no point were we have IL passes > that known more about the target.Note that adding the bit masks does make sense as a canonical form, it is not just a premature optimization. Expressions like x >> 2 << 3 >> 5 << 6 get canonicalized to a single shift + a mask that way. I think it would make sense to leave the last shl alone, so the above expression gets canonicalized to: ((x >> 4) & mask) << 6 DAGCombine can reduce it completely if it wants. This also matches the code generated for the bit field access in my small2.c example: entry: %and = lshr i64 %d, 2 %div = and i64 %and, 63 %add.ptr.sum = shl nuw nsw i64 %div, 1 /jakob
Apparently Analagous Threads
- [LLVMdev] InstCombine adds bit masks, confuses self, others
- [LLVMdev] InstCombine adds bit masks, confuses self, others
- [LLVMdev] InstCombine adds bit masks, confuses self, others
- [LLVMdev] InstCombine adds bit masks, confuses self, others
- [LLVMdev] InstCombine adds bit masks, confuses self, others