Jakob Stoklund Olesen
2012-Apr-16 22:23 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
Look at this silly function: $ cat small.c unsigned f(unsigned a, unsigned *p) { unsigned x = a/4; p[0] = x; p[1] = x+x; return p[1] - 2*p[0]; } GCC turns this into straightforward code and figures out the 0 return value: shrl $2, %edi movl %edi, (%rsi) addl %edi, %edi movl %edi, 4(%rsi) movl $0, %eax ret LLVM optimizes the code: $ clang -O -S -o- small.c -emit-llvm define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp { entry: %div = lshr i32 %a, 2 store i32 %div, i32* %p, align 4, !tbaa !0 %0 = lshr i32 %a, 1 %add = and i32 %0, 2147483646 %arrayidx1 = getelementptr inbounds i32* %p, i64 1 store i32 %add, i32* %arrayidx1, align 4, !tbaa !0 %1 = lshr i32 %a, 1 %mul = and i32 %1, 2147483646 %sub = sub i32 %add, %mul ret i32 %sub } InstCombine has obfuscated 'x+x' so badly that it can't even recognize it itself. DAGCombine will eventually untangle the mess and figure out that the return value is 0, but that is too late. The loop optimization passes and scalar evolution don't get to see the simple 'x' and '2*x' expressions. The problem is these transformations in InstCombineShifts.cpp: // If we have ((X >>? C) << C), turn this into X & (-1 << C). // (X >>? C1) << C2 --> X << (C2-C1) & (-1 << C2) // (X >>? C1) << C2 --> X >>? (C1-C2) & (-1 << C2) The shl instruction is just as much arithmetic as it is a bitwise logical instruction. It is used as the canonical form for x+x, 4*x etc. The transforms above turn an arithmetic expression into a purely logical expression, and it is very hard to recover the original arithmetic expression. Disabling the transformations produces the straightforward: define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp { entry: %div = lshr i32 %a, 2 store i32 %div, i32* %p, align 4, !tbaa !0 %add = shl nuw nsw i32 %div, 1 %arrayidx1 = getelementptr inbounds i32* %p, i64 1 store i32 %add, i32* %arrayidx1, align 4, !tbaa !0 ret i32 0 } The problem is not so much figuring out the 0 return value. The problem is that the 'canonical form' produced by InstCombine is hiding trivial arithmetic expressions like 'x+x' from scalar evolution. I am not sure how best to fix this. If possible, InstCombine's canonicalization shouldn't hide arithmetic progressions behind bit masks. At least, it seems these transformations should be disabled unless (X >> C).hasOneUse(). They aren't exactly optimizations. This: %div = lshr i32 %a, 2 store i32 %div, i32* %p, align 4, !tbaa !0 %add = shl nuw nsw i32 %div, 1 is better than this: %div = lshr i32 %a, 2 store i32 %div, i32* %p, align 4, !tbaa !0 %0 = lshr i32 %a, 1 %add = and i32 %0, 2147483646 /jakob
Chandler Carruth
2012-Apr-16 22:30 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
On Tue, Apr 17, 2012 at 12:23 AM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote:> I am not sure how best to fix this. If possible, InstCombine's > canonicalization shouldn't hide arithmetic progressions behind bit masks.The entire concept of cleverly converting arithmetic to bit masks seems like the perfect domain for DAGCombine instead of InstCombine: 1) We know the architecture, so we can make intelligent decisions about what masks are cheap or expensive. 2) We know the addressing modes so we can fold arithmetic into them 3) There are no more high-level optimizations we're trying to enable: gvn, scev, loop opts, other deep math optimizations have all already had their shot at the code. Does sinking these into the DAGCombine layer help? How much does it break? In the past, LLVM has been incredibly sensitive to canonicalization changes, so I'm worried, but I've also seen a *ton* of code similar to what you posted, and it simply makes no sense to me. I wrote a pile of heroics in the addressing mode matching for x86 to specifically work around these obnoxious representations. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/2d224fd3/attachment.html>
Jakob Stoklund Olesen
2012-Apr-16 23:06 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
On Apr 16, 2012, at 3:30 PM, Chandler Carruth <chandlerc at google.com> wrote:> On Tue, Apr 17, 2012 at 12:23 AM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote: > I am not sure how best to fix this. If possible, InstCombine's canonicalization shouldn't hide arithmetic progressions behind bit masks. > > The entire concept of cleverly converting arithmetic to bit masks seems like the perfect domain for DAGCombine instead of InstCombine: > > 1) We know the architecture, so we can make intelligent decisions about what masks are cheap or expensive. > 2) We know the addressing modes so we can fold arithmetic into them > 3) There are no more high-level optimizations we're trying to enable: gvn, scev, loop opts, other deep math optimizations have all already had their shot at the code. > > Does sinking these into the DAGCombine layer help? How much does it break?I don't know what would break, but DAGCombine already has these tricks: $ cat small2.c unsigned f(unsigned x) { return x >> 2 << 3 >> 2 << 5; } With the shift transforms disabled, we get: define i32 @f(i32 %x) nounwind uwtable readnone ssp { entry: %shr = lshr i32 %x, 2 %shl = shl i32 %shr, 3 %shr1 = lshr exact i32 %shl, 2 %shl2 = shl i32 %shr1, 5 ret i32 %shl2 } But DAGCombine goes: shll $4, %edi andl $-64, %edi movl %edi, %eax ret And you are right, we only get the bit masks when it is worthwhile. /jakob -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120416/add24235/attachment.html>
Rafael Espíndola
2012-Apr-17 11:36 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
> I am not sure how best to fix this. If possible, InstCombine's canonicalization shouldn't hide arithmetic progressions behind bit masks. At least, it seems these transformations should be disabled unless (X >> C).hasOneUse(). They aren't exactly optimizations. > > This: > > %div = lshr i32 %a, 2 > store i32 %div, i32* %p, align 4, !tbaa !0 > %add = shl nuw nsw i32 %div, 1 > > is better than this: > > %div = lshr i32 %a, 2 > store i32 %div, i32* %p, align 4, !tbaa !0 > %0 = lshr i32 %a, 1 > %add = and i32 %0, 2147483646I think we could try your hasOneUse idea. If we are going to keep %div, we may as well keep using it and save one instruction in the canonical form.> /jakobCheers, Rafael
Chandler Carruth
2012-Apr-17 13:09 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
On Tue, Apr 17, 2012 at 1:36 PM, Rafael Espíndola < rafael.espindola at gmail.com> wrote:> > I am not sure how best to fix this. If possible, InstCombine's > canonicalization shouldn't hide arithmetic progressions behind bit masks. > At least, it seems these transformations should be disabled unless (X >> > C).hasOneUse(). They aren't exactly optimizations. > > > > This: > > > > %div = lshr i32 %a, 2 > > store i32 %div, i32* %p, align 4, !tbaa !0 > > %add = shl nuw nsw i32 %div, 1 > > > > is better than this: > > > > %div = lshr i32 %a, 2 > > store i32 %div, i32* %p, align 4, !tbaa !0 > > %0 = lshr i32 %a, 1 > > %add = and i32 %0, 2147483646 > > I think we could try your hasOneUse idea. If we are going to keep > %div, we may as well keep using it and save one instruction in the > canonical form.I really dislike hasOneUse-based "solutions" at the IR / InstCombine layer. They result in strange artifacts during optimization: places where adding similar code turns off optimizations because we fold the similar bits together and reuse parts of the computation. I would much rather see us devise a reasonable set of canonicalization rules at the IR level that don't block optimization. Tricks like saving instructions, using bit masks, etc., should all be reserved for the codegen layer / DAGCombine. There, hasOneUse makes perfect sense because you're trying to peephole through architectural gymnastics, not trying to get systematic scalar optimizations to fire. Let's see if we have any evidence that reserving this transform for the DAGCombiner hurts things, how it hurts things, and what we can do about it first? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/0b5e205d/attachment.html>
Jakob Stoklund Olesen
2012-Apr-17 22:22 UTC
[LLVMdev] InstCombine adds bit masks, confuses self, others
On Apr 16, 2012, at 3:30 PM, Chandler Carruth <chandlerc at google.com> wrote:> Does sinking these into the DAGCombine layer help? How much does it break?I tried disabling just the InstCombine transforms that hide shl instructions behind bitmasks. Even though DAGCombine has the same transforms, it causes some pretty bad regressions: External/SPEC/CINT95/147_vortex/147_vortex 0.294 0.322 +9.6% +40mB MultiSource/Benchmarks/Olden/tsp/tsp 0.680 0.748 +9.9% +41mB SingleSource/Benchmarks/Shootout-C++/except 0.116 0.128 +10.8% +45mB SingleSource/Benchmarks/Shootout/strcat 0.102 0.113 +11.1% +46mB SingleSource/Benchmarks/Shootout-C++/hash 0.455 0.507 +11.4% +47mB External/Povray/povray 2.015 2.246 +11.5% +47mB External/SPEC/CINT2000/255_vortex/255_vortex 1.814 2.044 +12.7% +52mB SingleSource/Benchmarks/Shootout-C++/heapsort 1.871 2.132 +13.9% +57mB SingleSource/Benchmarks/Shootout-C++/ary3 1.087 1.264 +16.3% +65mB MultiSource/Benchmarks/SciMark2-C/scimark2 27.491 23.596 -14.2% -66mB MultiSource/Benchmarks/Olden/bisort/bisort 0.360 0.428 +19.0% +75mB MultiSource/Benchmarks/Olden/bh/bh 1.074 1.287 +19.9% +79mB (Running on Sandy Bridge, x86-64) I'll try to figure out why. /jakob -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/c7c5a56a/attachment.html>
Apparently Analagous Threads
- [LLVMdev] InstCombine adds bit masks, confuses self, others
- [LLVMdev] InstCombine adds bit masks, confuses self, others
- [LLVMdev] InstCombine adds bit masks, confuses self, others
- Compare test-suite benchmarks performance complied without TBAA, with default TBAA and with new TBAA struct path
- [LLVMdev] 2.2 Prerelease available for testing