thr3ads.net - llvm dev - [LLVMdev] InstCombine adds bit masks, confuses self, others [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Jakob Stoklund Olesen

2012-Apr-16 22:23 UTC

[LLVMdev] InstCombine adds bit masks, confuses self, others

Look at this silly function:

$ cat small.c
unsigned f(unsigned a, unsigned *p) {
 unsigned x = a/4;
 p[0] = x;
 p[1] = x+x;
 return p[1] - 2*p[0];
}

GCC turns this into straightforward code and figures out the 0 return value:

	shrl	$2, %edi
	movl	%edi, (%rsi)
	addl	%edi, %edi
	movl	%edi, 4(%rsi)
	movl	$0, %eax
	ret

LLVM optimizes the code:

$ clang -O -S -o- small.c -emit-llvm

define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp {
entry:
  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4, !tbaa !0
  %0 = lshr i32 %a, 1
  %add = and i32 %0, 2147483646
  %arrayidx1 = getelementptr inbounds i32* %p, i64 1
  store i32 %add, i32* %arrayidx1, align 4, !tbaa !0
  %1 = lshr i32 %a, 1
  %mul = and i32 %1, 2147483646
  %sub = sub i32 %add, %mul
  ret i32 %sub
}

InstCombine has obfuscated 'x+x' so badly that it can't even
recognize it itself. DAGCombine will eventually untangle the mess and figure out
that the return value is 0, but that is too late. The loop optimization passes
and scalar evolution don't get to see the simple 'x' and
'2*x' expressions.

The problem is these transformations in InstCombineShifts.cpp:

      // If we have ((X >>? C) << C), turn this into X & (-1
<< C).
      // (X >>? C1) << C2 --> X << (C2-C1) & (-1
<< C2)
      // (X >>? C1) << C2 --> X >>? (C1-C2) & (-1
<< C2)

The shl instruction is just as much arithmetic as it is a bitwise logical
instruction. It is used as the canonical form for x+x, 4*x etc. The transforms
above turn an arithmetic expression into a purely logical expression, and it is
very hard to recover the original arithmetic expression.

Disabling the transformations produces the straightforward:

define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp {
entry:
  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4, !tbaa !0
  %add = shl nuw nsw i32 %div, 1
  %arrayidx1 = getelementptr inbounds i32* %p, i64 1
  store i32 %add, i32* %arrayidx1, align 4, !tbaa !0
  ret i32 0
}

The problem is not so much figuring out the 0 return value. The problem is that
the 'canonical form' produced by InstCombine is hiding trivial
arithmetic expressions like 'x+x' from scalar evolution.

I am not sure how best to fix this. If possible, InstCombine's
canonicalization shouldn't hide arithmetic progressions behind bit masks. At
least, it seems these transformations should be disabled unless (X >>
C).hasOneUse(). They aren't exactly optimizations.

This:

  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4, !tbaa !0
  %add = shl nuw nsw i32 %div, 1

is better than this:

  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4, !tbaa !0
  %0 = lshr i32 %a, 1
  %add = and i32 %0, 2147483646

/jakob

Chandler Carruth

2012-Apr-16 22:30 UTC

head link

[LLVMdev] InstCombine adds bit masks, confuses self, others

On Tue, Apr 17, 2012 at 12:23 AM, Jakob Stoklund Olesen <stoklund at
2pi.dk>wrote:
> I am not sure how best to fix this. If possible, InstCombine's
> canonicalization shouldn't hide arithmetic progressions behind bit
masks.

The entire concept of cleverly converting arithmetic to bit masks seems
like the perfect domain for DAGCombine instead of InstCombine:

1) We know the architecture, so we can make intelligent decisions about
what masks are cheap or expensive.
2) We know the addressing modes so we can fold arithmetic into them
3) There are no more high-level optimizations we're trying to enable: gvn,
scev, loop opts, other deep math optimizations have all already had their
shot at the code.

Does sinking these into the DAGCombine layer help? How much does it break?

In the past, LLVM has been incredibly sensitive to canonicalization
changes, so I'm worried, but I've also seen a *ton* of code similar to
what
you posted, and it simply makes no sense to me.

I wrote a pile of heroics in the addressing mode matching for x86 to
specifically work around these obnoxious representations.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/2d224fd3/attachment.html>

Jakob Stoklund Olesen

2012-Apr-16 23:06 UTC

head link

[LLVMdev] InstCombine adds bit masks, confuses self, others

On Apr 16, 2012, at 3:30 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> On Tue, Apr 17, 2012 at 12:23 AM, Jakob Stoklund Olesen <stoklund at
2pi.dk> wrote:
> I am not sure how best to fix this. If possible, InstCombine's
canonicalization shouldn't hide arithmetic progressions behind bit masks.
> 
> The entire concept of cleverly converting arithmetic to bit masks seems
like the perfect domain for DAGCombine instead of InstCombine:
> 
> 1) We know the architecture, so we can make intelligent decisions about
what masks are cheap or expensive.
> 2) We know the addressing modes so we can fold arithmetic into them
> 3) There are no more high-level optimizations we're trying to enable:
gvn, scev, loop opts, other deep math optimizations have all already had their
shot at the code.
> 
> Does sinking these into the DAGCombine layer help? How much does it break?
I don't know what would break, but DAGCombine already has these tricks:

$ cat small2.c 
unsigned f(unsigned x) {
  return x >> 2 << 3 >> 2 << 5;
}

With the shift transforms disabled, we get:

define i32 @f(i32 %x) nounwind uwtable readnone ssp {
entry:
  %shr = lshr i32 %x, 2
  %shl = shl i32 %shr, 3
  %shr1 = lshr exact i32 %shl, 2
  %shl2 = shl i32 %shr1, 5
  ret i32 %shl2
}

But DAGCombine goes:

	shll	$4, %edi
	andl	$-64, %edi
	movl	%edi, %eax
	ret

And you are right, we only get the bit masks when it is worthwhile.

/jakob

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120416/add24235/attachment.html>

Rafael Espíndola

2012-Apr-17 11:36 UTC

head link

[LLVMdev] InstCombine adds bit masks, confuses self, others

> I am not sure how best to fix this. If possible, InstCombine's
canonicalization shouldn't hide arithmetic progressions behind bit masks. At
least, it seems these transformations should be disabled unless (X >>
C).hasOneUse(). They aren't exactly optimizations.
>
> This:
>
>  %div = lshr i32 %a, 2
>  store i32 %div, i32* %p, align 4, !tbaa !0
>  %add = shl nuw nsw i32 %div, 1
>
> is better than this:
>
>  %div = lshr i32 %a, 2
>  store i32 %div, i32* %p, align 4, !tbaa !0
>  %0 = lshr i32 %a, 1
>  %add = and i32 %0, 2147483646
I think we could try your hasOneUse idea. If we are going to keep
%div, we may as well keep using it and save one instruction in the
canonical form.
> /jakob
Cheers,
Rafael

Chandler Carruth

2012-Apr-17 13:09 UTC

head link

[LLVMdev] InstCombine adds bit masks, confuses self, others

On Tue, Apr 17, 2012 at 1:36 PM, Rafael Espíndola <
rafael.espindola at gmail.com> wrote:
> > I am not sure how best to fix this. If possible, InstCombine's
> canonicalization shouldn't hide arithmetic progressions behind bit
masks.
> At least, it seems these transformations should be disabled unless (X
>>
> C).hasOneUse(). They aren't exactly optimizations.
> >
> > This:
> >
> >  %div = lshr i32 %a, 2
> >  store i32 %div, i32* %p, align 4, !tbaa !0
> >  %add = shl nuw nsw i32 %div, 1
> >
> > is better than this:
> >
> >  %div = lshr i32 %a, 2
> >  store i32 %div, i32* %p, align 4, !tbaa !0
> >  %0 = lshr i32 %a, 1
> >  %add = and i32 %0, 2147483646
>
> I think we could try your hasOneUse idea. If we are going to keep
> %div, we may as well keep using it and save one instruction in the
> canonical form.

I really dislike hasOneUse-based "solutions" at the IR / InstCombine
layer.
They result in strange artifacts during optimization: places where adding
similar code turns off optimizations because we fold the similar bits
together and reuse parts of the computation.

I would much rather see us devise a reasonable set of canonicalization
rules at the IR level that don't block optimization. Tricks like saving
instructions, using bit masks, etc., should all be reserved for the codegen
layer / DAGCombine. There, hasOneUse makes perfect sense because you're
trying to peephole through architectural gymnastics, not trying to get
systematic scalar optimizations to fire.

Let's see if we have any evidence that reserving this transform for the
DAGCombiner hurts things, how it hurts things, and what we can do about it
first?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/0b5e205d/attachment.html>

Jakob Stoklund Olesen

2012-Apr-17 22:22 UTC

head link

[LLVMdev] InstCombine adds bit masks, confuses self, others

On Apr 16, 2012, at 3:30 PM, Chandler Carruth <chandlerc at google.com>
wrote:
> Does sinking these into the DAGCombine layer help? How much does it break?
I tried disabling just the InstCombine transforms that hide shl instructions
behind bitmasks. Even though DAGCombine has the same transforms, it causes some
pretty bad regressions:

External/SPEC/CINT95/147_vortex/147_vortex                                  
0.294   0.322   +9.6%  +40mB
MultiSource/Benchmarks/Olden/tsp/tsp                                        
0.680   0.748   +9.9%  +41mB
SingleSource/Benchmarks/Shootout-C++/except                                 
0.116   0.128  +10.8%  +45mB
SingleSource/Benchmarks/Shootout/strcat                                     
0.102   0.113  +11.1%  +46mB
SingleSource/Benchmarks/Shootout-C++/hash                                   
0.455   0.507  +11.4%  +47mB
External/Povray/povray                                                      
2.015   2.246  +11.5%  +47mB
External/SPEC/CINT2000/255_vortex/255_vortex                                
1.814   2.044  +12.7%  +52mB
SingleSource/Benchmarks/Shootout-C++/heapsort                               
1.871   2.132  +13.9%  +57mB
SingleSource/Benchmarks/Shootout-C++/ary3                                   
1.087   1.264  +16.3%  +65mB
MultiSource/Benchmarks/SciMark2-C/scimark2                                 
27.491  23.596  -14.2%  -66mB
MultiSource/Benchmarks/Olden/bisort/bisort                                  
0.360   0.428  +19.0%  +75mB
MultiSource/Benchmarks/Olden/bh/bh                                          
1.074   1.287  +19.9%  +79mB

(Running on Sandy Bridge, x86-64)

I'll try to figure out why.

/jakob

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/c7c5a56a/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Apr 2012 - [LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

Apparently Analagous Threads