thr3ads.net - llvm dev - [llvm-dev] memcmp code fragment [May 2017]

If this information is useful, please help other people find it:
Share via:

Sirish Pande via llvm-dev

2017-May-19 19:46 UTC

[llvm-dev] memcmp code fragment

Hi,

Look at the following code:

Look at the following C code  seqence:

unsigned char mainGtU ( unsigned int i1,
               unsigned int i2,
               unsigned char* block)
{
   unsigned char c1, c2;
   c1 = block[i1]; c2 = block[i2];
   if (c1 != c2) return (c1 > c2);
   i1++; i2++;

   c1 = block[i1]; c2 = block[i2];
   if (c1 != c2) return (c1 > c2);
   i1++; i2++;

..
..
<repeat 12 times>

In LLVM IR it will be following:

define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture
readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget)
local_unnamed_addr #0 {
entry:
  %idxprom = zext i32 %i1 to i64
  %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom
  %0 = load i8, i8* %arrayidx, align 1
  %idxprom1 = zext i32 %i2 to i64
  %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1
  %1 = load i8, i8* %arrayidx2, align 1
  %cmp = icmp eq i8 %0, %1
  br i1 %cmp, label %if.end, label %if.then

if.then:                                          ; preds = %entry
  %cmp7 = icmp ugt i8 %0, %1
  br label %return

if.end:                                           ; preds = %entry
  %inc = add i32 %i1, 1
  %inc10 = add i32 %i2, 1
  %idxprom11 = zext i32 %inc to i64
  %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11
  %2 = load i8, i8* %arrayidx12, align 1
  %idxprom13 = zext i32 %inc10 to i64
  %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13
  %3 = load i8, i8* %arrayidx14, align 1
  %cmp17 = icmp eq i8 %2, %3
  br i1 %cmp17, label %if.end25, label %if.then19

if.then19:                                        ; preds = %if.end
  %cmp22 = icmp ugt i8 %2, %3
  br label %return

..
..
<repeats 12 times>

This code sequence can be collapsed into call to  memcmp and we can get rid
of basic blocks. I have written a small peephole optimization for squenece
of instructions that identifies
branch termiantor, compare, load, gep etc and converts them to a call to
memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on X86.

Is there a better idea, other than small peephole pass on IR to optimize
this code?

Sirish
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170519/0880884b/attachment.html>

Hans Wennborg via llvm-dev

2017-May-19 20:06 UTC

head link

[llvm-dev] memcmp code fragment

On Fri, May 19, 2017 at 12:46 PM, Sirish Pande via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Hi,
>
> Look at the following code:
>
> Look at the following C code  seqence:
>
> unsigned char mainGtU ( unsigned int i1,
>                unsigned int i2,
>                unsigned char* block)
> {
>    unsigned char c1, c2;
>    c1 = block[i1]; c2 = block[i2];
>    if (c1 != c2) return (c1 > c2);
>    i1++; i2++;
>
>    c1 = block[i1]; c2 = block[i2];
>    if (c1 != c2) return (c1 > c2);
>    i1++; i2++;
>
> ..
> ..
> <repeat 12 times>
>
> In LLVM IR it will be following:
>
> define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture
> readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget)
> local_unnamed_addr #0 {
> entry:
>   %idxprom = zext i32 %i1 to i64
>   %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom
>   %0 = load i8, i8* %arrayidx, align 1
>   %idxprom1 = zext i32 %i2 to i64
>   %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1
>   %1 = load i8, i8* %arrayidx2, align 1
>   %cmp = icmp eq i8 %0, %1
>   br i1 %cmp, label %if.end, label %if.then
>
> if.then:                                          ; preds = %entry
>   %cmp7 = icmp ugt i8 %0, %1
>   br label %return
>
> if.end:                                           ; preds = %entry
>   %inc = add i32 %i1, 1
>   %inc10 = add i32 %i2, 1
>   %idxprom11 = zext i32 %inc to i64
>   %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11
>   %2 = load i8, i8* %arrayidx12, align 1
>   %idxprom13 = zext i32 %inc10 to i64
>   %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13
>   %3 = load i8, i8* %arrayidx14, align 1
>   %cmp17 = icmp eq i8 %2, %3
>   br i1 %cmp17, label %if.end25, label %if.then19
>
> if.then19:                                        ; preds = %if.end
>   %cmp22 = icmp ugt i8 %2, %3
>   br label %return
>
> ..
> ..
> <repeats 12 times>
>
> This code sequence can be collapsed into call to  memcmp and we can get rid
> of basic blocks. I have written a small peephole optimization for squenece
> of instructions that identifies
> branch termiantor, compare, load, gep etc and converts them to a call to
> memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on
X86.
>
> Is there a better idea, other than small peephole pass on IR to optimize
> this code?
There is LoopIdiomRecognize which does transformations like this, but
only for loops, not unrolled code like your example.

It would be very cool if we could somehow make that pass also
recognize unrolled patterns, both for memcmp, and other operations.

I don't have any specific ideas for how to do that, but the
improvement you saw suggests it might be very worthwhile :-)

Sirish Pande via llvm-dev

2017-May-19 20:14 UTC

head link

[llvm-dev] memcmp code fragment

Thanks Hans.

I did look at LoopIdiomRecognize and also Loop rerolling - but like you
said, they only work on loop. And having a control flow makes it a bit
complicated. I have test cases that help in X86 and Aarch64. I will put up
my patch for review and you guys can have a look.

Another thing, I noticed is that there is no __builtin_memcmp in llvm.
Maybe that will be good to have as well.

Sirish

On Fri, May 19, 2017 at 3:06 PM, Hans Wennborg <hans at chromium.org>
wrote:
> On Fri, May 19, 2017 at 12:46 PM, Sirish Pande via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > Hi,
> >
> > Look at the following code:
> >
> > Look at the following C code  seqence:
> >
> > unsigned char mainGtU ( unsigned int i1,
> >                unsigned int i2,
> >                unsigned char* block)
> > {
> >    unsigned char c1, c2;
> >    c1 = block[i1]; c2 = block[i2];
> >    if (c1 != c2) return (c1 > c2);
> >    i1++; i2++;
> >
> >    c1 = block[i1]; c2 = block[i2];
> >    if (c1 != c2) return (c1 > c2);
> >    i1++; i2++;
> >
> > ..
> > ..
> > <repeat 12 times>
> >
> > In LLVM IR it will be following:
> >
> > define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16*
nocapture
> > readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget)
> > local_unnamed_addr #0 {
> > entry:
> >   %idxprom = zext i32 %i1 to i64
> >   %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom
> >   %0 = load i8, i8* %arrayidx, align 1
> >   %idxprom1 = zext i32 %i2 to i64
> >   %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1
> >   %1 = load i8, i8* %arrayidx2, align 1
> >   %cmp = icmp eq i8 %0, %1
> >   br i1 %cmp, label %if.end, label %if.then
> >
> > if.then:                                          ; preds = %entry
> >   %cmp7 = icmp ugt i8 %0, %1
> >   br label %return
> >
> > if.end:                                           ; preds = %entry
> >   %inc = add i32 %i1, 1
> >   %inc10 = add i32 %i2, 1
> >   %idxprom11 = zext i32 %inc to i64
> >   %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11
> >   %2 = load i8, i8* %arrayidx12, align 1
> >   %idxprom13 = zext i32 %inc10 to i64
> >   %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13
> >   %3 = load i8, i8* %arrayidx14, align 1
> >   %cmp17 = icmp eq i8 %2, %3
> >   br i1 %cmp17, label %if.end25, label %if.then19
> >
> > if.then19:                                        ; preds = %if.end
> >   %cmp22 = icmp ugt i8 %2, %3
> >   br label %return
> >
> > ..
> > ..
> > <repeats 12 times>
> >
> > This code sequence can be collapsed into call to  memcmp and we can
get
> rid
> > of basic blocks. I have written a small peephole optimization for
> squenece
> > of instructions that identifies
> > branch termiantor, compare, load, gep etc and converts them to a call
to
> > memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2
on
> X86.
> >
> > Is there a better idea, other than small peephole pass on IR to
optimize
> > this code?
>
> There is LoopIdiomRecognize which does transformations like this, but
> only for loops, not unrolled code like your example.
>
> It would be very cool if we could somehow make that pass also
> recognize unrolled patterns, both for memcmp, and other operations.
>
> I don't have any specific ideas for how to do that, but the
> improvement you saw suggests it might be very worthwhile :-)
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170519/57a3fdd4/attachment.html>

Daniel Berlin via llvm-dev

2017-May-19 21:32 UTC

head link

[llvm-dev] memcmp code fragment

i think you'd have to write some idiom recognizer here. The ones we have
won't do it.

I guess my other question would be how commonly this happens. If it's
common and matters a lot, awesome.
I wouldn't do it just to fix SPEC :P

(people who care about bzip2 speed probably use any of the faster bzip2
impls :P)

On Fri, May 19, 2017 at 12:46 PM, Sirish Pande via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> Look at the following code:
>
> Look at the following C code  seqence:
>
> unsigned char mainGtU ( unsigned int i1,
>                unsigned int i2,
>                unsigned char* block)
> {
>    unsigned char c1, c2;
>    c1 = block[i1]; c2 = block[i2];
>    if (c1 != c2) return (c1 > c2);
>    i1++; i2++;
>
>    c1 = block[i1]; c2 = block[i2];
>    if (c1 != c2) return (c1 > c2);
>    i1++; i2++;
>
> ..
> ..
> <repeat 12 times>
>
> In LLVM IR it will be following:
>
> define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture
> readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget)
> local_unnamed_addr #0 {
> entry:
>   %idxprom = zext i32 %i1 to i64
>   %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom
>   %0 = load i8, i8* %arrayidx, align 1
>   %idxprom1 = zext i32 %i2 to i64
>   %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1
>   %1 = load i8, i8* %arrayidx2, align 1
>   %cmp = icmp eq i8 %0, %1
>   br i1 %cmp, label %if.end, label %if.then
>
> if.then:                                          ; preds = %entry
>   %cmp7 = icmp ugt i8 %0, %1
>   br label %return
>
> if.end:                                           ; preds = %entry
>   %inc = add i32 %i1, 1
>   %inc10 = add i32 %i2, 1
>   %idxprom11 = zext i32 %inc to i64
>   %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11
>   %2 = load i8, i8* %arrayidx12, align 1
>   %idxprom13 = zext i32 %inc10 to i64
>   %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13
>   %3 = load i8, i8* %arrayidx14, align 1
>   %cmp17 = icmp eq i8 %2, %3
>   br i1 %cmp17, label %if.end25, label %if.then19
>
> if.then19:                                        ; preds = %if.end
>   %cmp22 = icmp ugt i8 %2, %3
>   br label %return
>
> ..
> ..
> <repeats 12 times>
>
> This code sequence can be collapsed into call to  memcmp and we can get
> rid of basic blocks. I have written a small peephole optimization for
> squenece of instructions that identifies
> branch termiantor, compare, load, gep etc and converts them to a call to
> memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on
X86.
>
> Is there a better idea, other than small peephole pass on IR to optimize
> this code?
>
> Sirish
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170519/66a5479c/attachment.html>

Sebastian Pop via llvm-dev

2017-May-24 15:09 UTC

head link

[llvm-dev] memcmp code fragment

On Fri, May 19, 2017 at 4:32 PM, Daniel Berlin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> i think you'd have to write some idiom recognizer here. The ones we
have
> won't do it.
>
> I guess my other question would be how commonly this happens. If it's
common
> and matters a lot, awesome.
> I wouldn't do it just to fix SPEC :P
>
> (people who care about bzip2 speed probably use any of the faster bzip2
> impls :P)
>
Danny, do you know of forks of bzip2 that are more efficient than bzip.org?
I haven't seen one.

Sirish is going to send a patch to Julian Seward and try to get the change
in a new release of bzip2, and from there we may need to ask the AOSP
folks to update bzip2.  AOSP has the last released bzip2 from Sept 2010.

For SPEC benchmarketing, doing the right thing wouldn't help.
Maybe we should put this idiom and others under a -fspec flag ;-)

Sebastian

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - May 2017 - memcmp code fragment

[llvm-dev] memcmp code fragment

[llvm-dev] memcmp code fragment

[llvm-dev] memcmp code fragment

[llvm-dev] memcmp code fragment

[llvm-dev] memcmp code fragment

Apparently Analagous Threads