Hi, Look at the following code: Look at the following C code seqence: unsigned char mainGtU ( unsigned int i1, unsigned int i2, unsigned char* block) { unsigned char c1, c2; c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; c1 = block[i1]; c2 = block[i2]; if (c1 != c2) return (c1 > c2); i1++; i2++; .. .. <repeat 12 times> In LLVM IR it will be following: define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget) local_unnamed_addr #0 { entry: %idxprom = zext i32 %i1 to i64 %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom %0 = load i8, i8* %arrayidx, align 1 %idxprom1 = zext i32 %i2 to i64 %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1 %1 = load i8, i8* %arrayidx2, align 1 %cmp = icmp eq i8 %0, %1 br i1 %cmp, label %if.end, label %if.then if.then: ; preds = %entry %cmp7 = icmp ugt i8 %0, %1 br label %return if.end: ; preds = %entry %inc = add i32 %i1, 1 %inc10 = add i32 %i2, 1 %idxprom11 = zext i32 %inc to i64 %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11 %2 = load i8, i8* %arrayidx12, align 1 %idxprom13 = zext i32 %inc10 to i64 %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13 %3 = load i8, i8* %arrayidx14, align 1 %cmp17 = icmp eq i8 %2, %3 br i1 %cmp17, label %if.end25, label %if.then19 if.then19: ; preds = %if.end %cmp22 = icmp ugt i8 %2, %3 br label %return .. .. <repeats 12 times> This code sequence can be collapsed into call to memcmp and we can get rid of basic blocks. I have written a small peephole optimization for squenece of instructions that identifies branch termiantor, compare, load, gep etc and converts them to a call to memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on X86. Is there a better idea, other than small peephole pass on IR to optimize this code? Sirish -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170519/0880884b/attachment.html>
On Fri, May 19, 2017 at 12:46 PM, Sirish Pande via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi, > > Look at the following code: > > Look at the following C code seqence: > > unsigned char mainGtU ( unsigned int i1, > unsigned int i2, > unsigned char* block) > { > unsigned char c1, c2; > c1 = block[i1]; c2 = block[i2]; > if (c1 != c2) return (c1 > c2); > i1++; i2++; > > c1 = block[i1]; c2 = block[i2]; > if (c1 != c2) return (c1 > c2); > i1++; i2++; > > .. > .. > <repeat 12 times> > > In LLVM IR it will be following: > > define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture > readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget) > local_unnamed_addr #0 { > entry: > %idxprom = zext i32 %i1 to i64 > %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom > %0 = load i8, i8* %arrayidx, align 1 > %idxprom1 = zext i32 %i2 to i64 > %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1 > %1 = load i8, i8* %arrayidx2, align 1 > %cmp = icmp eq i8 %0, %1 > br i1 %cmp, label %if.end, label %if.then > > if.then: ; preds = %entry > %cmp7 = icmp ugt i8 %0, %1 > br label %return > > if.end: ; preds = %entry > %inc = add i32 %i1, 1 > %inc10 = add i32 %i2, 1 > %idxprom11 = zext i32 %inc to i64 > %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11 > %2 = load i8, i8* %arrayidx12, align 1 > %idxprom13 = zext i32 %inc10 to i64 > %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13 > %3 = load i8, i8* %arrayidx14, align 1 > %cmp17 = icmp eq i8 %2, %3 > br i1 %cmp17, label %if.end25, label %if.then19 > > if.then19: ; preds = %if.end > %cmp22 = icmp ugt i8 %2, %3 > br label %return > > .. > .. > <repeats 12 times> > > This code sequence can be collapsed into call to memcmp and we can get rid > of basic blocks. I have written a small peephole optimization for squenece > of instructions that identifies > branch termiantor, compare, load, gep etc and converts them to a call to > memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on X86. > > Is there a better idea, other than small peephole pass on IR to optimize > this code?There is LoopIdiomRecognize which does transformations like this, but only for loops, not unrolled code like your example. It would be very cool if we could somehow make that pass also recognize unrolled patterns, both for memcmp, and other operations. I don't have any specific ideas for how to do that, but the improvement you saw suggests it might be very worthwhile :-)
Thanks Hans. I did look at LoopIdiomRecognize and also Loop rerolling - but like you said, they only work on loop. And having a control flow makes it a bit complicated. I have test cases that help in X86 and Aarch64. I will put up my patch for review and you guys can have a look. Another thing, I noticed is that there is no __builtin_memcmp in llvm. Maybe that will be good to have as well. Sirish On Fri, May 19, 2017 at 3:06 PM, Hans Wennborg <hans at chromium.org> wrote:> On Fri, May 19, 2017 at 12:46 PM, Sirish Pande via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > Hi, > > > > Look at the following code: > > > > Look at the following C code seqence: > > > > unsigned char mainGtU ( unsigned int i1, > > unsigned int i2, > > unsigned char* block) > > { > > unsigned char c1, c2; > > c1 = block[i1]; c2 = block[i2]; > > if (c1 != c2) return (c1 > c2); > > i1++; i2++; > > > > c1 = block[i1]; c2 = block[i2]; > > if (c1 != c2) return (c1 > c2); > > i1++; i2++; > > > > .. > > .. > > <repeat 12 times> > > > > In LLVM IR it will be following: > > > > define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture > > readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget) > > local_unnamed_addr #0 { > > entry: > > %idxprom = zext i32 %i1 to i64 > > %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom > > %0 = load i8, i8* %arrayidx, align 1 > > %idxprom1 = zext i32 %i2 to i64 > > %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1 > > %1 = load i8, i8* %arrayidx2, align 1 > > %cmp = icmp eq i8 %0, %1 > > br i1 %cmp, label %if.end, label %if.then > > > > if.then: ; preds = %entry > > %cmp7 = icmp ugt i8 %0, %1 > > br label %return > > > > if.end: ; preds = %entry > > %inc = add i32 %i1, 1 > > %inc10 = add i32 %i2, 1 > > %idxprom11 = zext i32 %inc to i64 > > %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11 > > %2 = load i8, i8* %arrayidx12, align 1 > > %idxprom13 = zext i32 %inc10 to i64 > > %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13 > > %3 = load i8, i8* %arrayidx14, align 1 > > %cmp17 = icmp eq i8 %2, %3 > > br i1 %cmp17, label %if.end25, label %if.then19 > > > > if.then19: ; preds = %if.end > > %cmp22 = icmp ugt i8 %2, %3 > > br label %return > > > > .. > > .. > > <repeats 12 times> > > > > This code sequence can be collapsed into call to memcmp and we can get > rid > > of basic blocks. I have written a small peephole optimization for > squenece > > of instructions that identifies > > branch termiantor, compare, load, gep etc and converts them to a call to > > memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on > X86. > > > > Is there a better idea, other than small peephole pass on IR to optimize > > this code? > > There is LoopIdiomRecognize which does transformations like this, but > only for loops, not unrolled code like your example. > > It would be very cool if we could somehow make that pass also > recognize unrolled patterns, both for memcmp, and other operations. > > I don't have any specific ideas for how to do that, but the > improvement you saw suggests it might be very worthwhile :-) >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170519/57a3fdd4/attachment.html>
i think you'd have to write some idiom recognizer here. The ones we have won't do it. I guess my other question would be how commonly this happens. If it's common and matters a lot, awesome. I wouldn't do it just to fix SPEC :P (people who care about bzip2 speed probably use any of the faster bzip2 impls :P) On Fri, May 19, 2017 at 12:46 PM, Sirish Pande via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > > Look at the following code: > > Look at the following C code seqence: > > unsigned char mainGtU ( unsigned int i1, > unsigned int i2, > unsigned char* block) > { > unsigned char c1, c2; > c1 = block[i1]; c2 = block[i2]; > if (c1 != c2) return (c1 > c2); > i1++; i2++; > > c1 = block[i1]; c2 = block[i2]; > if (c1 != c2) return (c1 > c2); > i1++; i2++; > > .. > .. > <repeat 12 times> > > In LLVM IR it will be following: > > define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture > readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget) > local_unnamed_addr #0 { > entry: > %idxprom = zext i32 %i1 to i64 > %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom > %0 = load i8, i8* %arrayidx, align 1 > %idxprom1 = zext i32 %i2 to i64 > %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1 > %1 = load i8, i8* %arrayidx2, align 1 > %cmp = icmp eq i8 %0, %1 > br i1 %cmp, label %if.end, label %if.then > > if.then: ; preds = %entry > %cmp7 = icmp ugt i8 %0, %1 > br label %return > > if.end: ; preds = %entry > %inc = add i32 %i1, 1 > %inc10 = add i32 %i2, 1 > %idxprom11 = zext i32 %inc to i64 > %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11 > %2 = load i8, i8* %arrayidx12, align 1 > %idxprom13 = zext i32 %inc10 to i64 > %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13 > %3 = load i8, i8* %arrayidx14, align 1 > %cmp17 = icmp eq i8 %2, %3 > br i1 %cmp17, label %if.end25, label %if.then19 > > if.then19: ; preds = %if.end > %cmp22 = icmp ugt i8 %2, %3 > br label %return > > .. > .. > <repeats 12 times> > > This code sequence can be collapsed into call to memcmp and we can get > rid of basic blocks. I have written a small peephole optimization for > squenece of instructions that identifies > branch termiantor, compare, load, gep etc and converts them to a call to > memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on X86. > > Is there a better idea, other than small peephole pass on IR to optimize > this code? > > Sirish > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170519/66a5479c/attachment.html>
On Fri, May 19, 2017 at 4:32 PM, Daniel Berlin via llvm-dev <llvm-dev at lists.llvm.org> wrote:> i think you'd have to write some idiom recognizer here. The ones we have > won't do it. > > I guess my other question would be how commonly this happens. If it's common > and matters a lot, awesome. > I wouldn't do it just to fix SPEC :P > > (people who care about bzip2 speed probably use any of the faster bzip2 > impls :P) >Danny, do you know of forks of bzip2 that are more efficient than bzip.org? I haven't seen one. Sirish is going to send a patch to Julian Seward and try to get the change in a new release of bzip2, and from there we may need to ask the AOSP folks to update bzip2. AOSP has the last released bzip2 from Sept 2010. For SPEC benchmarketing, doing the right thing wouldn't help. Maybe we should put this idiom and others under a -fspec flag ;-) Sebastian