thr3ads.net - llvm dev - [LLVMdev] InstCombine adds bit masks, confuses self, others [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2012-Apr-17 13:09 UTC

[LLVMdev] InstCombine adds bit masks, confuses self, others

On Tue, Apr 17, 2012 at 1:36 PM, Rafael Espíndola <
rafael.espindola at gmail.com> wrote:
> > I am not sure how best to fix this. If possible, InstCombine's
> canonicalization shouldn't hide arithmetic progressions behind bit
masks.
> At least, it seems these transformations should be disabled unless (X
>>
> C).hasOneUse(). They aren't exactly optimizations.
> >
> > This:
> >
> >  %div = lshr i32 %a, 2
> >  store i32 %div, i32* %p, align 4, !tbaa !0
> >  %add = shl nuw nsw i32 %div, 1
> >
> > is better than this:
> >
> >  %div = lshr i32 %a, 2
> >  store i32 %div, i32* %p, align 4, !tbaa !0
> >  %0 = lshr i32 %a, 1
> >  %add = and i32 %0, 2147483646
>
> I think we could try your hasOneUse idea. If we are going to keep
> %div, we may as well keep using it and save one instruction in the
> canonical form.

I really dislike hasOneUse-based "solutions" at the IR / InstCombine
layer.
They result in strange artifacts during optimization: places where adding
similar code turns off optimizations because we fold the similar bits
together and reuse parts of the computation.

I would much rather see us devise a reasonable set of canonicalization
rules at the IR level that don't block optimization. Tricks like saving
instructions, using bit masks, etc., should all be reserved for the codegen
layer / DAGCombine. There, hasOneUse makes perfect sense because you're
trying to peephole through architectural gymnastics, not trying to get
systematic scalar optimizations to fire.

Let's see if we have any evidence that reserving this transform for the
DAGCombiner hurts things, how it hurts things, and what we can do about it
first?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/0b5e205d/attachment.html>

Rafael Espíndola

2012-Apr-17 13:35 UTC

head link

[LLVMdev] InstCombine adds bit masks, confuses self, others

> I really dislike hasOneUse-based "solutions" at the IR /
InstCombine layer.
> They result in strange artifacts during optimization: places where adding
> similar code turns off optimizations because we fold the similar bits
> together and reuse parts of the computation.
>
> I would much rather see us devise a reasonable set of canonicalization
rules
> at the IR level that don't block optimization. Tricks like saving
> instructions, using bit masks, etc., should all be reserved for the codegen
> layer / DAGCombine. There, hasOneUse makes perfect sense because you're
> trying to peephole through architectural gymnastics, not trying to get
> systematic scalar optimizations to fire.
>
> Let's see if we have any evidence that reserving this transform for the
> DAGCombiner hurts things, how it hurts things, and what we can do about it
> first?
<day dreaming>
In general I think I agree. Putting the subexpression in an inline
function for example would cause the hasOneUse hack to fail as the
multiple uses would only show up after inlining. To be general we
would have to be able to combine the other way any transformation we
refuse to do because it has more than one use.

I also agree that the pipeline should first make the IL canonical, and
then decide what is the best way to compute the canonical form. The
one thing I find unfortunate is that in LLVM the second step is almost
all in DAG and MI levels. There is no point were we have IL passes
that known more about the target.
</day dreaming>

Looking a bit more at this particular case, I think the problem is not
so much the representation we have chosen (with the masks) or that the
DAG combiner is better at handling it. It is just that the DAG
combiner is a second pass. Running the example in opt again produces

define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp {
entry:
  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4
  %0 = lshr i32 %a, 1
  %add = and i32 %0, 2147483646
  %arrayidx1 = getelementptr inbounds i32* %p, i64 1
  store i32 %add, i32* %arrayidx1, align 4
  ret i32 0
}

Which I realize is orthogonal to using the mask being a good idea or
not as it could be even simpler by some metric by reusing the %div,
but does show that the IL optimizers are stopping while there is more
they can do.

OK. In the end I think I agree we should try it out and decide on the
canonical representation first.

Cheers,
Rafael

Jakob Stoklund Olesen

2012-Apr-17 17:40 UTC

head link

[LLVMdev] InstCombine adds bit masks, confuses self, others

On Apr 17, 2012, at 6:09 AM, Chandler Carruth <chandlerc at google.com>
wrote:
> I really dislike hasOneUse-based "solutions" at the IR /
InstCombine layer. They result in strange artifacts during optimization: places
where adding similar code turns off optimizations because we fold the similar
bits together and reuse parts of the computation.
In my example, it won't even work. The second use of 'x+x' only
emerges after GVN, after the harm has been done.

However, I am more concerned about hiding things from scalar evolution than from
InstCombine itself. Here is an example I was expecting to fail:

$ cat small2.c 
struct desc {
  unsigned size : 2;
  unsigned skip : 6;
  unsigned value;
};

void f(struct desc d, unsigned *p) {
  for (unsigned i = 0; i != 64; ++i) {
    *p = 0; p += d.skip;
    *p = 1; p += d.skip;
  }
}

It does the right thing, though:

define void @f(i64 %d.coerce, i32* nocapture %p) nounwind uwtable ssp {
entry:
  %0 = lshr i64 %d.coerce, 2
  %bf.clear = and i64 %0, 63
  %add.ptr.sum = shl nuw nsw i64 %bf.clear, 1
  br label %for.body

for.body:                                         ; preds = %entry, %for.body
  %p.addr.06 = phi i32* [ %p, %entry ], [ %add.ptr3, %for.body ]
  %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  store i32 0, i32* %p.addr.06, align 4, !tbaa !0
  %add.ptr = getelementptr inbounds i32* %p.addr.06, i64 %bf.clear
  store i32 1, i32* %add.ptr, align 4, !tbaa !0
  %add.ptr3 = getelementptr inbounds i32* %p.addr.06, i64 %add.ptr.sum
  %inc = add i32 %i.05, 1
  %cmp = icmp eq i32 %inc, 64
  br i1 %cmp, label %for.end, label %for.body

for.end:                                          ; preds = %for.body
  ret void
}

Notice how the two gep offsets %bf.clear and %add.ptr.sum have a simple
arithmetic relationship that scalar evolution can figure out.

With a very small change, that breaks:

$ cat small3.c 
void f(unsigned long d, unsigned *p) {
  unsigned long skip = d/4;
  for (unsigned i = 0; i != 64; ++i) {
    *p = 0; p += skip;
    *p = 1; p += skip;
  }
}

define void @f(i64 %d, i32* nocapture %p) nounwind uwtable ssp {
entry:
  %div = lshr i64 %d, 2
  %0 = lshr i64 %d, 1
  %add.ptr.sum = and i64 %0, 9223372036854775806
  br label %for.body

for.body:                                         ; preds = %entry, %for.body
  %i.03 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %p.addr.02 = phi i32* [ %p, %entry ], [ %add.ptr1, %for.body ]
  store i32 0, i32* %p.addr.02, align 4, !tbaa !0
  %add.ptr = getelementptr inbounds i32* %p.addr.02, i64 %div
  store i32 1, i32* %add.ptr, align 4, !tbaa !0
  %add.ptr1 = getelementptr inbounds i32* %p.addr.02, i64 %add.ptr.sum
  %inc = add i32 %i.03, 1
  %cmp = icmp eq i32 %inc, 64
  br i1 %cmp, label %for.end, label %for.body

for.end:                                          ; preds = %for.body
  ret void
}

Now it suddenly becomes very difficult for scalar evolution to figure out the
relationship between %div and %add.ptr.sum.

I think the canonical form should look more like the bit field example -
don't hide the shl behind a bit mask.

I'll try deferring the transforms that hide an shl instruction to
DAGCombine. I'll run some benchmarks.

/jakob

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120417/be34ad55/attachment.html>

Jakob Stoklund Olesen

2012-Apr-17 17:57 UTC

head link

[LLVMdev] InstCombine adds bit masks, confuses self, others

On Apr 17, 2012, at 6:35 AM, Rafael Espíndola <rafael.espindola at
gmail.com> wrote:
> I also agree that the pipeline should first make the IL canonical, and
> then decide what is the best way to compute the canonical form. The
> one thing I find unfortunate is that in LLVM the second step is almost
> all in DAG and MI levels. There is no point were we have IL passes
> that known more about the target.
Note that adding the bit masks does make sense as a canonical form, it is not
just a premature optimization.

Expressions like x >> 2 << 3 >> 5 << 6 get canonicalized
to a single shift + a mask that way.

I think it would make sense to leave the last shl alone, so the above expression
gets canonicalized to:

  ((x >> 4) & mask) << 6

DAGCombine can reduce it completely if it wants.

This also matches the code generated for the bit field access in my small2.c
example:

entry:
  %and = lshr i64 %d, 2
  %div = and i64 %and, 63
  %add.ptr.sum = shl nuw nsw i64 %div, 1

/jakob

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Apr 2012 - [LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

[LLVMdev] InstCombine adds bit masks, confuses self, others

Seemingly Similar Threads