suyog sarda
2014-Aug-08 20:34 UTC
[LLVMdev] Efficient Pattern matching in Instruction Combine
Hi Duncan, David, Sean. Thanks for your reply.> It'd be interesting if you could find a design that also treated these > the same: > > (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C > (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C > (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C > > I.e., `^` is also associative.Agree with Duncan on including associative operation too.> Can we handle this by just having a canonical ordering? Or is that toodifficult to maintain through various instcombines? Yes, its the easiest way to do that. If i am not wrong, what Sean is suggesting is that if we get something like (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C and we have written pass for pattern (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C we could just swap 'B' and 'A' before matching the pattern i.e. we do canonical re-ordering (Sean, Correct me if i am wrong on my understanding of what you meant by canonical ordering). I have seen in the "instcombine" module that many times std::swap was used for the same. But it becomes too much repetitive and somewhat irritating to keep doing this for every pattern. Shouldn't we do this in more generic way where we do not have to worry about swapping operands? (How to do it - i don't know for now. I looked at the matchers implementation for modification/improvement, but didn't find any clue on it).> It seems to me that we could rejigger Reassociate to reduce the number ofpossibilities that InstCombine could expect.>Agree with David that infact it should be reassociate pass which should handle this. But following is something interesting and rather worrying things i have found : (I have omitted unimportant code and highlighted important one in following example) *e.x. * ((~A & B) | A) --> (A | B) ; Code is implemented for this already *C code :* suyog at suyog-Inspiron-N5010:~$ cat 1.c #include<stdio.h> int cal(int a, int b) { *return ((~a & b) | a);* } int main(){ int a, b; scanf("%d %d", &a, &b); return cal(a,b); } LLVM IR at O0 : suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm 1.c ; Function Attrs: nounwind *define i32 @cal(i32 %a, i32 %b) #0 { %1 = xor i32 %a, -1 %2 = and i32 %1, %b %3 = or i32 %2, %a ret i32 %3* } ; Function Attrs: nounwind define i32 @main() #0 { .. .. %4 = call i32 @cal(i32 %2, i32 %3) ret i32 %4 } *LLVM IR at instcombine :* suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 = or i32 %a, %b ret i32 %1}* .. .. .. which is OK as per expected transform. *LLVM IR at reassociate and instcombine :*suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 = xor i32 %a, -1 %2 = and i32 %b, %1 %3 = or i32 %2, %a ret i32 %3}* .. .. Reassociate converted (~a&b) to (b&~a) and instruction combine failed to optimize. As evident, sometimes, reassociate pass actually harms the optimization opportunity !! Some more interesting findings on how can this affect the final machine code: *Test case :* 1.c : #include<stdio.h> int cal(int a, int b) { *return ((~a & b) | a);* } int main(){ int a, b; scanf("%d %d", &a, &b); return cal(a,b); } *X86 .s file with clang at O2 for above program :* suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c main: # @main # BB#0: subl $28, %esp leal 20(%esp), %eax movl %eax, 8(%esp) leal 24(%esp), %eax movl %eax, 4(%esp) movl $.L.str, (%esp) calll __isoc99_scanf movl 20(%esp), %eax * orl 24(%esp), %eax* addl $28, %esp retl As seen, optimization happened at IR level itself reflected in .s file. *GCC output for the same:* suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c main: .LFB23: .cfi_startproc pushl %ebp .cfi_def_cfa_offset 8 .cfi_offset 5, -8 movl %esp, %ebp .cfi_def_cfa_register 5 andl $-16, %esp subl $32, %esp leal 28(%esp), %eax movl %eax, 8(%esp) leal 24(%esp), %eax movl %eax, 4(%esp) movl $.LC0, (%esp) call __isoc99_scanf movl 24(%esp), %eax * orl 28(%esp), %eax* leave .cfi_restore 5 .cfi_def_cfa 4, 4 ret .cfi_endproc GCC also did the optimization. Now we just *slightly flip* the test case : *1.c Test case:*#include<stdio.h> int cal(int a, int b) { *return ((b & ~a) | a);* } int main(){ int a, b; scanf("%d %d", &a, &b); return cal(a,b); } *X86 .s file with clang at O2 for above program :*suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c main: # @main # BB#0: subl $28, %esp leal 20(%esp), %eax movl %eax, 8(%esp) leal 24(%esp), %eax movl %eax, 4(%esp) movl $.L.str, (%esp) calll __isoc99_scanf movl 24(%esp), %ecx movl %ecx, %eax *notl %eax andl 20(%esp), %eax orl %ecx, %eax* addl $28, %esp retl We lost the Optimization opportunity here !! *GCC output for the same:* suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c main: .LFB23: .cfi_startproc pushl %ebp .cfi_def_cfa_offset 8 .cfi_offset 5, -8 movl %esp, %ebp .cfi_def_cfa_register 5 andl $-16, %esp subl $32, %esp leal 28(%esp), %eax movl %eax, 8(%esp) leal 24(%esp), %eax movl %eax, 4(%esp) movl $.LC0, (%esp) call __isoc99_scanf movl 24(%esp), %eax * orl 28(%esp), %eax* leave .cfi_restore 5 .cfi_def_cfa 4, 4 ret .cfi_endproc GCC still optimized it !! Clearly evident that llvm is losing because of naive approach of pattern matching. How can we improve this so that LLVM also recognizes commutative/associative pattern efficiently with single 'matchers' call? I am not having any idea so far. I will think more on it. Any help would be really helpful. Any suggestions/comments/rectification are most awaited :) -- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140809/418a1055/attachment.html>
Sean Silva
2014-Aug-11 18:19 UTC
[LLVMdev] Efficient Pattern matching in Instruction Combine
On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com> wrote:> Hi Duncan, David, Sean. > > Thanks for your reply. > > > > It'd be interesting if you could find a design that also treated these > > the same: > > > > (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C > > > (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C > > (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C > > > > I.e., `^` is also associative. > > Agree with Duncan on including associative operation too. > > > Can we handle this by just having a canonical ordering? Or is that too > difficult to maintain through various instcombines? > > Yes, its the easiest way to do that. If i am not wrong, what Sean is > suggesting is that if we get > > something like > > (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C > > and we have written pass for pattern > > (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C > > we could just swap 'B' and 'A' before matching the pattern i.e. we do > canonical re-ordering > (Sean, Correct me if i am wrong on my understanding of what you meant by > canonical ordering). >You understood correctly.> > I have seen in the "instcombine" module that many times std::swap was used > for the same. > > But it becomes too much repetitive and somewhat irritating to keep doing > this for every pattern. > Shouldn't we do this in more generic way where we do not have to worry > about swapping operands? > (How to do it - i don't know for now. I looked at the matchers > implementation for modification/improvement, > but didn't find any clue on it). > > > > It seems to me that we could rejigger Reassociate to reduce the number > of possibilities that InstCombine could expect. > >> > Agree with David that infact it should be reassociate pass which should > handle this. > > But following is something interesting and rather worrying things i have > found : > (I have omitted unimportant code and highlighted important one in > following example) > *e.x. * > > ((~A & B) | A) --> (A | B) ; Code is implemented for this already > > *C code :* > > suyog at suyog-Inspiron-N5010:~$ cat 1.c > #include<stdio.h> > int cal(int a, int b) { > *return ((~a & b) | a);* > } > > int main(){ > int a, b; > scanf("%d %d", &a, &b); > return cal(a,b); > } > > LLVM IR at O0 : > > suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm 1.c > > ; Function Attrs: nounwind > > > > > *define i32 @cal(i32 %a, i32 %b) #0 { %1 = xor i32 %a, -1 %2 = and i32 > %1, %b %3 = or i32 %2, %a ret i32 %3* > } > > ; Function Attrs: nounwind > define i32 @main() #0 { > .. > .. > %4 = call i32 @cal(i32 %2, i32 %3) > ret i32 %4 > } > > *LLVM IR at instcombine :* > > suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll > > > > > > *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 = or > i32 %a, %b ret i32 %1}* > .. > .. > .. > > which is OK as per expected transform. > > > > *LLVM IR at reassociate and instcombine : *suyog at suyog-Inspiron-N5010:~$ > Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll > > > > > > > > *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 = xor > i32 %a, -1 %2 = and i32 %b, %1 %3 = or i32 %2, %a ret i32 %3}* > > .. > .. > > Reassociate converted (~a&b) to (b&~a) and instruction combine failed to > optimize. > > As evident, sometimes, reassociate pass actually harms the optimization > opportunity !! > > Some more interesting findings on how can this affect the final machine > code: > > *Test case :* > 1.c : > > #include<stdio.h> > int cal(int a, int b) { > *return ((~a & b) | a);* > } > > int main(){ > int a, b; > scanf("%d %d", &a, &b); > return cal(a,b); > } > > *X86 .s file with clang at O2 for above program :* > > suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c > > main: # @main > # BB#0: > subl $28, %esp > leal 20(%esp), %eax > movl %eax, 8(%esp) > leal 24(%esp), %eax > movl %eax, 4(%esp) > movl $.L.str, (%esp) > calll __isoc99_scanf > movl 20(%esp), %eax > * orl 24(%esp), %eax* > addl $28, %esp > retl > > As seen, optimization happened at IR level itself reflected in .s file. > > *GCC output for the same:* > > suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c > > main: > .LFB23: > .cfi_startproc > pushl %ebp > .cfi_def_cfa_offset 8 > .cfi_offset 5, -8 > movl %esp, %ebp > .cfi_def_cfa_register 5 > andl $-16, %esp > subl $32, %esp > leal 28(%esp), %eax > movl %eax, 8(%esp) > leal 24(%esp), %eax > movl %eax, 4(%esp) > movl $.LC0, (%esp) > call __isoc99_scanf > movl 24(%esp), %eax > * orl 28(%esp), %eax* > leave > .cfi_restore 5 > .cfi_def_cfa 4, 4 > ret > .cfi_endproc > > GCC also did the optimization. > > Now we just *slightly flip* the test case : > > > > *1.c Test case:*#include<stdio.h> > int cal(int a, int b) { > *return ((b & ~a) | a);* > } > > int main(){ > int a, b; > scanf("%d %d", &a, &b); > return cal(a,b); > } > > > > *X86 .s file with clang at O2 for above program :* > suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c > > main: # @main > # BB#0: > subl $28, %esp > leal 20(%esp), %eax > movl %eax, 8(%esp) > leal 24(%esp), %eax > movl %eax, 4(%esp) > movl $.L.str, (%esp) > calll __isoc99_scanf > movl 24(%esp), %ecx > movl %ecx, %eax > > > *notl %eax andl 20(%esp), %eax orl %ecx, %eax* > addl $28, %esp > retl > > We lost the Optimization opportunity here !! > > *GCC output for the same:* > > suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c > > main: > .LFB23: > .cfi_startproc > pushl %ebp > .cfi_def_cfa_offset 8 > .cfi_offset 5, -8 > movl %esp, %ebp > .cfi_def_cfa_register 5 > andl $-16, %esp > subl $32, %esp > leal 28(%esp), %eax > movl %eax, 8(%esp) > leal 24(%esp), %eax > movl %eax, 4(%esp) > movl $.LC0, (%esp) > call __isoc99_scanf > movl 24(%esp), %eax > * orl 28(%esp), %eax* > leave > .cfi_restore 5 > .cfi_def_cfa 4, 4 > ret > .cfi_endproc > > GCC still optimized it !! > > Clearly evident that llvm is losing because of naive approach of pattern > matching. > > How can we improve this so that LLVM also recognizes > commutative/associative > pattern efficiently with single 'matchers' call? I am not having any idea > so far. > I will think more on it. Any help would be really helpful. > > Any suggestions/comments/rectification are most awaited :) >I'm actually sort of wondering if instead of hardcoding all these bitwise simplifications, we can just use a general boolean minimization algorithm. Is there a reason we don't? Maybe it's just compile time? But eventually (maybe already?) adding enough special case combines will be slower than using the general optimization. -- Sean> > -- > With regards, > Suyog Sarda >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140811/83867903/attachment.html>
Sean Silva
2014-Aug-12 21:00 UTC
[LLVMdev] Efficient Pattern matching in Instruction Combine
Re-adding the mailing list (remember to hit "reply all") On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com> wrote:> Thanks Sean for the reply. > > > On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at gmail.com> > wrote: > >> >> >> >> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com> wrote: >> >>> Hi Duncan, David, Sean. >>> >>> Thanks for your reply. >>> >>> >>> > It'd be interesting if you could find a design that also treated these >>> > the same: >>> > >>> > (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C >>> >>> > (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C >>> > (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C >>> > >>> > I.e., `^` is also associative. >>> >>> Agree with Duncan on including associative operation too. >>> >>> > Can we handle this by just having a canonical ordering? Or is that too >>> difficult to maintain through various instcombines? >>> >>> Yes, its the easiest way to do that. If i am not wrong, what Sean is >>> suggesting is that if we get >>> >>> something like >>> >>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C >>> >>> and we have written pass for pattern >>> >>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C >>> >>> we could just swap 'B' and 'A' before matching the pattern i.e. we do >>> canonical re-ordering >>> (Sean, Correct me if i am wrong on my understanding of what you meant by >>> canonical ordering). >>> >> >> You understood correctly. >> >> >>> >>> I have seen in the "instcombine" module that many times std::swap was >>> used for the same. >>> >>> But it becomes too much repetitive and somewhat irritating to keep doing >>> this for every pattern. >>> Shouldn't we do this in more generic way where we do not have to worry >>> about swapping operands? >>> (How to do it - i don't know for now. I looked at the matchers >>> implementation for modification/improvement, >>> but didn't find any clue on it). >>> >>> >>> > It seems to me that we could rejigger Reassociate to reduce the number >>> of possibilities that InstCombine could expect. >>> >>>> >>> Agree with David that infact it should be reassociate pass which should >>> handle this. >>> >>> But following is something interesting and rather worrying things i have >>> found : >>> (I have omitted unimportant code and highlighted important one in >>> following example) >>> *e.x. * >>> >>> ((~A & B) | A) --> (A | B) ; Code is implemented for this already >>> >>> *C code :* >>> >>> suyog at suyog-Inspiron-N5010:~$ cat 1.c >>> #include<stdio.h> >>> int cal(int a, int b) { >>> *return ((~a & b) | a);* >>> } >>> >>> int main(){ >>> int a, b; >>> scanf("%d %d", &a, &b); >>> return cal(a,b); >>> } >>> >>> LLVM IR at O0 : >>> >>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm >>> 1.c >>> >>> ; Function Attrs: nounwind >>> >>> >>> >>> >>> *define i32 @cal(i32 %a, i32 %b) #0 { %1 = xor i32 %a, -1 %2 = and i32 >>> %1, %b %3 = or i32 %2, %a ret i32 %3* >>> } >>> >>> ; Function Attrs: nounwind >>> define i32 @main() #0 { >>> .. >>> .. >>> %4 = call i32 @cal(i32 %2, i32 %3) >>> ret i32 %4 >>> } >>> >>> *LLVM IR at instcombine :* >>> >>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll >>> >>> >>> >>> >>> >>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 >>> or i32 %a, %b ret i32 %1}* >>> .. >>> .. >>> .. >>> >>> which is OK as per expected transform. >>> >>> >>> >>> *LLVM IR at reassociate and instcombine : *suyog at suyog-Inspiron-N5010:~$ >>> Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll >>> >>> >>> >>> >>> >>> >>> >>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 >>> xor i32 %a, -1 %2 = and i32 %b, %1 %3 = or i32 %2, %a ret i32 %3}* >>> >>> .. >>> .. >>> >>> Reassociate converted (~a&b) to (b&~a) and instruction combine failed to >>> optimize. >>> >>> As evident, sometimes, reassociate pass actually harms the optimization >>> opportunity !! >>> >>> Some more interesting findings on how can this affect the final machine >>> code: >>> >>> *Test case :* >>> 1.c : >>> >>> #include<stdio.h> >>> int cal(int a, int b) { >>> *return ((~a & b) | a);* >>> } >>> >>> int main(){ >>> int a, b; >>> scanf("%d %d", &a, &b); >>> return cal(a,b); >>> } >>> >>> *X86 .s file with clang at O2 for above program :* >>> >>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >>> >>> main: # @main >>> # BB#0: >>> subl $28, %esp >>> leal 20(%esp), %eax >>> movl %eax, 8(%esp) >>> leal 24(%esp), %eax >>> movl %eax, 4(%esp) >>> movl $.L.str, (%esp) >>> calll __isoc99_scanf >>> movl 20(%esp), %eax >>> * orl 24(%esp), %eax* >>> addl $28, %esp >>> retl >>> >>> As seen, optimization happened at IR level itself reflected in .s file. >>> >>> *GCC output for the same:* >>> >>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c >>> >>> main: >>> .LFB23: >>> .cfi_startproc >>> pushl %ebp >>> .cfi_def_cfa_offset 8 >>> .cfi_offset 5, -8 >>> movl %esp, %ebp >>> .cfi_def_cfa_register 5 >>> andl $-16, %esp >>> subl $32, %esp >>> leal 28(%esp), %eax >>> movl %eax, 8(%esp) >>> leal 24(%esp), %eax >>> movl %eax, 4(%esp) >>> movl $.LC0, (%esp) >>> call __isoc99_scanf >>> movl 24(%esp), %eax >>> * orl 28(%esp), %eax* >>> leave >>> .cfi_restore 5 >>> .cfi_def_cfa 4, 4 >>> ret >>> .cfi_endproc >>> >>> GCC also did the optimization. >>> >>> Now we just *slightly flip* the test case : >>> >>> >>> >>> *1.c Test case:*#include<stdio.h> >>> int cal(int a, int b) { >>> *return ((b & ~a) | a);* >>> } >>> >>> int main(){ >>> int a, b; >>> scanf("%d %d", &a, &b); >>> return cal(a,b); >>> } >>> >>> >>> >>> *X86 .s file with clang at O2 for above program :* >>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >>> >>> main: # @main >>> # BB#0: >>> subl $28, %esp >>> leal 20(%esp), %eax >>> movl %eax, 8(%esp) >>> leal 24(%esp), %eax >>> movl %eax, 4(%esp) >>> movl $.L.str, (%esp) >>> calll __isoc99_scanf >>> movl 24(%esp), %ecx >>> movl %ecx, %eax >>> >>> >>> *notl %eax andl 20(%esp), %eax orl %ecx, %eax* >>> addl $28, %esp >>> retl >>> >>> We lost the Optimization opportunity here !! >>> >>> *GCC output for the same:* >>> >>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c >>> >>> main: >>> .LFB23: >>> .cfi_startproc >>> pushl %ebp >>> .cfi_def_cfa_offset 8 >>> .cfi_offset 5, -8 >>> movl %esp, %ebp >>> .cfi_def_cfa_register 5 >>> andl $-16, %esp >>> subl $32, %esp >>> leal 28(%esp), %eax >>> movl %eax, 8(%esp) >>> leal 24(%esp), %eax >>> movl %eax, 4(%esp) >>> movl $.LC0, (%esp) >>> call __isoc99_scanf >>> movl 24(%esp), %eax >>> * orl 28(%esp), %eax* >>> leave >>> .cfi_restore 5 >>> .cfi_def_cfa 4, 4 >>> ret >>> .cfi_endproc >>> >>> GCC still optimized it !! >>> >>> Clearly evident that llvm is losing because of naive approach of pattern >>> matching. >>> >>> How can we improve this so that LLVM also recognizes >>> commutative/associative >>> pattern efficiently with single 'matchers' call? I am not having any >>> idea so far. >>> I will think more on it. Any help would be really helpful. >>> >>> Any suggestions/comments/rectification are most awaited :) >>> >> >> >> I'm actually sort of wondering if instead of hardcoding all these bitwise >> simplifications, we can just use a general boolean minimization algorithm. >> >> Is there a reason we don't? Maybe it's just compile time? But eventually >> (maybe already?) adding enough special case combines will be slower than >> using the general optimization. >> > > Can you please elaborate more on "general boolean minimization algorithm" > ? Is it a specific algorithm? I couldn't find any proper > reference to it. If you can provide a reference for it, i will try to see > if i can implement it. >> Can K-Map be an option to generalize the logic? > (I could think of K-Map for now for simplification from my college days > :)) >Yeah, K-Maps are one way to do it, however K-Maps are really most useful for when your target operations are all NAND/NOR, which isn't the case here (and you can also do Reed-Muller for XOR/XNOR); instcombine has a very different set of "primitive operations" and their relative costs (i.e. XOR, AND, OR, and NOT are usually the base operations and cost the same amount of resources to execute). Since optimization of binary expressions is of such enormous importance to the electronics industry, this topic is extremely well researched, so there should be a lot to draw from. There may be stuff targeted specifically at software compilers, but I haven't looked in depth (maybe try Muchnick section 12.3 for references? I don't have the book handy). For starters, you may want to check out Knuth volume 4A, section 7.1. I'm working from memory here, but IIRC there is a whole part about minimization/optimization of binary expressions. Knuth as usual provides tons of references to the literature in case the text there is not enough. You can see preprint copies of volume 4A here: http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book that any programmer should have on their bookshelf; in fact, you can probably justify it as a work expense).> > Also few more things i will like to understand. > From above example i have shown, > > when we have test case as: > > #include<stdio.h> > int cal(int a, int b) { > *return ((~a & b) | a);* > } > > int main(){ > int a, b; > scanf("%d %d", &a, &b); > return cal(a,b); > } > > *X86 .s file with clang at O2 for above program :* > > main: # @main > # BB#0: > subl $28, %esp > leal 20(%esp), %eax > movl %eax, 8(%esp) > leal 24(%esp), %eax > movl %eax, 4(%esp) > movl $.L.str, (%esp) > calll __isoc99_scanf > movl 20(%esp), %eax > * orl 24(%esp), %eax* > addl $28, %esp > retl > > As seen, we optimize the pattern. > > but when we have test case as : > > #include<stdio.h> > int cal(int a, int b) { > *return ((b & ~a) | a);* > } > > int main(){ > int a, b; > scanf("%d %d", &a, &b); > return cal(a,b); > } > > > > *X86 .s file with clang at O2 for above program :* > suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c > > main: # @main > # BB#0: > subl $28, %esp > leal 20(%esp), %eax > movl %eax, 8(%esp) > leal 24(%esp), %eax > movl %eax, 4(%esp) > movl $.L.str, (%esp) > calll __isoc99_scanf > movl 24(%esp), %ecx > movl %ecx, %eax > > > *notl %eax andl 20(%esp), %eax orl %ecx, %eax* > addl $28, %esp > retl > > we fail to optimize. > > Does, 'reassociate' pass always run before 'instcombine' ? > Theoretically, it should in my opinion. But from the output in first case, > it seems it doesn't. > If 'reassociate' would have run before 'instcombine', it would have > converted (~a&b) to (b&~a) > and the pattern would have not been optimized. > > Is there any logic which determines when to run reassociate and when not > to run? >I don't know off the top of my head. Try running opt with -debug-pass=Arguments to see what the order is (or I think you should be able to pass -debug-pass=Arguments to clang's -mllvm option). -- Sean Silva> > Duncan, David, Sean - any comment on this are most welcomed :) > > Regards, > Suyog > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140812/41be749b/attachment.html>
suyog sarda
2014-Aug-13 16:37 UTC
[LLVMdev] Efficient Pattern matching in Instruction Combine
Thanks Sean for the reference. I will go through it and see if i can implement it for generic boolean expression minimization. Regards, Suyog On Wed, Aug 13, 2014 at 2:30 AM, Sean Silva <chisophugis at gmail.com> wrote:> Re-adding the mailing list (remember to hit "reply all") > > > On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com> wrote: > >> Thanks Sean for the reply. >> >> >> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >> >>> >>> >>> >>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com> wrote: >>> >>>> Hi Duncan, David, Sean. >>>> >>>> Thanks for your reply. >>>> >>>> >>>> > It'd be interesting if you could find a design that also treated these >>>> > the same: >>>> > >>>> > (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C >>>> >>>> > (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> > (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C >>>> > >>>> > I.e., `^` is also associative. >>>> >>>> Agree with Duncan on including associative operation too. >>>> >>>> > Can we handle this by just having a canonical ordering? Or is that >>>> too difficult to maintain through various instcombines? >>>> >>>> Yes, its the easiest way to do that. If i am not wrong, what Sean is >>>> suggesting is that if we get >>>> >>>> something like >>>> >>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> >>>> and we have written pass for pattern >>>> >>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> >>>> we could just swap 'B' and 'A' before matching the pattern i.e. we do >>>> canonical re-ordering >>>> (Sean, Correct me if i am wrong on my understanding of what you meant >>>> by canonical ordering). >>>> >>> >>> You understood correctly. >>> >>> >>>> >>>> I have seen in the "instcombine" module that many times std::swap was >>>> used for the same. >>>> >>>> But it becomes too much repetitive and somewhat irritating to keep >>>> doing this for every pattern. >>>> Shouldn't we do this in more generic way where we do not have to worry >>>> about swapping operands? >>>> (How to do it - i don't know for now. I looked at the matchers >>>> implementation for modification/improvement, >>>> but didn't find any clue on it). >>>> >>>> >>>> > It seems to me that we could rejigger Reassociate to reduce the >>>> number of possibilities that InstCombine could expect. >>>> >>>>> >>>> Agree with David that infact it should be reassociate pass which should >>>> handle this. >>>> >>>> But following is something interesting and rather worrying things i >>>> have found : >>>> (I have omitted unimportant code and highlighted important one in >>>> following example) >>>> *e.x. * >>>> >>>> ((~A & B) | A) --> (A | B) ; Code is implemented for this already >>>> >>>> *C code :* >>>> >>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c >>>> #include<stdio.h> >>>> int cal(int a, int b) { >>>> *return ((~a & b) | a);* >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> LLVM IR at O0 : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm >>>> 1.c >>>> >>>> ; Function Attrs: nounwind >>>> >>>> >>>> >>>> >>>> *define i32 @cal(i32 %a, i32 %b) #0 { %1 = xor i32 %a, -1 %2 = and >>>> i32 %1, %b %3 = or i32 %2, %a ret i32 %3* >>>> } >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @main() #0 { >>>> .. >>>> .. >>>> %4 = call i32 @cal(i32 %2, i32 %3) >>>> ret i32 %4 >>>> } >>>> >>>> *LLVM IR at instcombine :* >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll >>>> >>>> >>>> >>>> >>>> >>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 >>>> or i32 %a, %b ret i32 %1}* >>>> .. >>>> .. >>>> .. >>>> >>>> which is OK as per expected transform. >>>> >>>> >>>> >>>> *LLVM IR at reassociate and instcombine : *suyog at suyog-Inspiron-N5010:~$ >>>> Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 >>>> xor i32 %a, -1 %2 = and i32 %b, %1 %3 = or i32 %2, %a ret i32 %3}* >>>> >>>> .. >>>> .. >>>> >>>> Reassociate converted (~a&b) to (b&~a) and instruction combine failed >>>> to optimize. >>>> >>>> As evident, sometimes, reassociate pass actually harms the optimization >>>> opportunity !! >>>> >>>> Some more interesting findings on how can this affect the final machine >>>> code: >>>> >>>> *Test case :* >>>> 1.c : >>>> >>>> #include<stdio.h> >>>> int cal(int a, int b) { >>>> *return ((~a & b) | a);* >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> *X86 .s file with clang at O2 for above program :* >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >>>> >>>> main: # @main >>>> # BB#0: >>>> subl $28, %esp >>>> leal 20(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.L.str, (%esp) >>>> calll __isoc99_scanf >>>> movl 20(%esp), %eax >>>> * orl 24(%esp), %eax* >>>> addl $28, %esp >>>> retl >>>> >>>> As seen, optimization happened at IR level itself reflected in .s file. >>>> >>>> *GCC output for the same:* >>>> >>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c >>>> >>>> main: >>>> .LFB23: >>>> .cfi_startproc >>>> pushl %ebp >>>> .cfi_def_cfa_offset 8 >>>> .cfi_offset 5, -8 >>>> movl %esp, %ebp >>>> .cfi_def_cfa_register 5 >>>> andl $-16, %esp >>>> subl $32, %esp >>>> leal 28(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.LC0, (%esp) >>>> call __isoc99_scanf >>>> movl 24(%esp), %eax >>>> * orl 28(%esp), %eax* >>>> leave >>>> .cfi_restore 5 >>>> .cfi_def_cfa 4, 4 >>>> ret >>>> .cfi_endproc >>>> >>>> GCC also did the optimization. >>>> >>>> Now we just *slightly flip* the test case : >>>> >>>> >>>> >>>> *1.c Test case:*#include<stdio.h> >>>> int cal(int a, int b) { >>>> *return ((b & ~a) | a);* >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> >>>> >>>> *X86 .s file with clang at O2 for above program :* >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >>>> >>>> main: # @main >>>> # BB#0: >>>> subl $28, %esp >>>> leal 20(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.L.str, (%esp) >>>> calll __isoc99_scanf >>>> movl 24(%esp), %ecx >>>> movl %ecx, %eax >>>> >>>> >>>> *notl %eax andl 20(%esp), %eax orl %ecx, %eax* >>>> addl $28, %esp >>>> retl >>>> >>>> We lost the Optimization opportunity here !! >>>> >>>> *GCC output for the same:* >>>> >>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c >>>> >>>> main: >>>> .LFB23: >>>> .cfi_startproc >>>> pushl %ebp >>>> .cfi_def_cfa_offset 8 >>>> .cfi_offset 5, -8 >>>> movl %esp, %ebp >>>> .cfi_def_cfa_register 5 >>>> andl $-16, %esp >>>> subl $32, %esp >>>> leal 28(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.LC0, (%esp) >>>> call __isoc99_scanf >>>> movl 24(%esp), %eax >>>> * orl 28(%esp), %eax* >>>> leave >>>> .cfi_restore 5 >>>> .cfi_def_cfa 4, 4 >>>> ret >>>> .cfi_endproc >>>> >>>> GCC still optimized it !! >>>> >>>> Clearly evident that llvm is losing because of naive approach of >>>> pattern matching. >>>> >>>> How can we improve this so that LLVM also recognizes >>>> commutative/associative >>>> pattern efficiently with single 'matchers' call? I am not having any >>>> idea so far. >>>> I will think more on it. Any help would be really helpful. >>>> >>>> Any suggestions/comments/rectification are most awaited :) >>>> >>> >>> >>> I'm actually sort of wondering if instead of hardcoding all these >>> bitwise simplifications, we can just use a general boolean minimization >>> algorithm. >>> >>> Is there a reason we don't? Maybe it's just compile time? But eventually >>> (maybe already?) adding enough special case combines will be slower than >>> using the general optimization. >>> >> >> Can you please elaborate more on "general boolean minimization algorithm" >> ? Is it a specific algorithm? I couldn't find any proper >> reference to it. If you can provide a reference for it, i will try to see >> if i can implement it. >> > >> Can K-Map be an option to generalize the logic? >> (I could think of K-Map for now for simplification from my college days >> :)) >> > > Yeah, K-Maps are one way to do it, however K-Maps are really most useful > for when your target operations are all NAND/NOR, which isn't the case here > (and you can also do Reed-Muller for XOR/XNOR); instcombine has a very > different set of "primitive operations" and their relative costs (i.e. XOR, > AND, OR, and NOT are usually the base operations and cost the same amount > of resources to execute). > > Since optimization of binary expressions is of such enormous importance to > the electronics industry, this topic is extremely well researched, so there > should be a lot to draw from. There may be stuff targeted specifically at > software compilers, but I haven't looked in depth (maybe try Muchnick > section 12.3 for references? I don't have the book handy). > > For starters, you may want to check out Knuth volume 4A, section 7.1. I'm > working from memory here, but IIRC there is a whole part about > minimization/optimization of binary expressions. Knuth as usual provides > tons of references to the literature in case the text there is not enough. > You can see preprint copies of volume 4A here: > http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book > that any programmer should have on their bookshelf; in fact, you can > probably justify it as a work expense). > > > >> >> Also few more things i will like to understand. >> From above example i have shown, >> >> when we have test case as: >> >> #include<stdio.h> >> int cal(int a, int b) { >> *return ((~a & b) | a);* >> } >> >> int main(){ >> int a, b; >> scanf("%d %d", &a, &b); >> return cal(a,b); >> } >> >> *X86 .s file with clang at O2 for above program :* >> >> main: # @main >> # BB#0: >> subl $28, %esp >> leal 20(%esp), %eax >> movl %eax, 8(%esp) >> leal 24(%esp), %eax >> movl %eax, 4(%esp) >> movl $.L.str, (%esp) >> calll __isoc99_scanf >> movl 20(%esp), %eax >> * orl 24(%esp), %eax* >> addl $28, %esp >> retl >> >> As seen, we optimize the pattern. >> >> but when we have test case as : >> >> #include<stdio.h> >> int cal(int a, int b) { >> *return ((b & ~a) | a);* >> } >> >> int main(){ >> int a, b; >> scanf("%d %d", &a, &b); >> return cal(a,b); >> } >> >> >> >> *X86 .s file with clang at O2 for above program :* >> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >> >> main: # @main >> # BB#0: >> subl $28, %esp >> leal 20(%esp), %eax >> movl %eax, 8(%esp) >> leal 24(%esp), %eax >> movl %eax, 4(%esp) >> movl $.L.str, (%esp) >> calll __isoc99_scanf >> movl 24(%esp), %ecx >> movl %ecx, %eax >> >> >> *notl %eax andl 20(%esp), %eax orl %ecx, %eax* >> addl $28, %esp >> retl >> >> we fail to optimize. >> >> Does, 'reassociate' pass always run before 'instcombine' ? >> Theoretically, it should in my opinion. But from the output in first >> case, it seems it doesn't. >> If 'reassociate' would have run before 'instcombine', it would have >> converted (~a&b) to (b&~a) >> and the pattern would have not been optimized. >> >> Is there any logic which determines when to run reassociate and when not >> to run? >> > > I don't know off the top of my head. Try running opt with > -debug-pass=Arguments to see what the order is (or I think you should be > able to pass -debug-pass=Arguments to clang's -mllvm option). > > -- Sean Silva > > >> >> Duncan, David, Sean - any comment on this are most welcomed :) >> >> Regards, >> Suyog >> >> >-- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140813/7e1a8734/attachment.html>
Daniel Berlin
2014-Aug-13 22:39 UTC
[LLVMdev] Efficient Pattern matching in Instruction Combine
Even if you can't implement such an algorithm sanely, ISTM that auto-generating this code from a table (or whatever), and choosing canonical results (to avoid a fixpoint issue), rather than what seems to be hand-additions of every possible set of minimizations on three variables, is still a better solution, no? At least then you wouldn't have human errors, and a growing file that makes any of the non-trivial transforms easy to miss in the noise. On Tue, Aug 12, 2014 at 2:00 PM, Sean Silva <chisophugis at gmail.com> wrote:> Re-adding the mailing list (remember to hit "reply all") > > > On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com> wrote: >> >> Thanks Sean for the reply. >> >> >> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >>> >>> >>> >>> >>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com> wrote: >>>> >>>> Hi Duncan, David, Sean. >>>> >>>> Thanks for your reply. >>>> >>>> >>>> > It'd be interesting if you could find a design that also treated these >>>> > the same: >>>> > >>>> > (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C >>>> >>>> > (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> > (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C >>>> > >>>> > I.e., `^` is also associative. >>>> >>>> Agree with Duncan on including associative operation too. >>>> >>>> > Can we handle this by just having a canonical ordering? Or is that too >>>> > difficult to maintain through various instcombines? >>>> >>>> Yes, its the easiest way to do that. If i am not wrong, what Sean is >>>> suggesting is that if we get >>>> >>>> something like >>>> >>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> >>>> and we have written pass for pattern >>>> >>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> >>>> we could just swap 'B' and 'A' before matching the pattern i.e. we do >>>> canonical re-ordering >>>> (Sean, Correct me if i am wrong on my understanding of what you meant by >>>> canonical ordering). >>> >>> >>> You understood correctly. >>> >>>> >>>> >>>> I have seen in the "instcombine" module that many times std::swap was >>>> used for the same. >>>> >>>> But it becomes too much repetitive and somewhat irritating to keep doing >>>> this for every pattern. >>>> Shouldn't we do this in more generic way where we do not have to worry >>>> about swapping operands? >>>> (How to do it - i don't know for now. I looked at the matchers >>>> implementation for modification/improvement, >>>> but didn't find any clue on it). >>>> >>>> >>>> > It seems to me that we could rejigger Reassociate to reduce the number >>>> > of possibilities that InstCombine could expect. >>>> >>>> >>>> Agree with David that infact it should be reassociate pass which should >>>> handle this. >>>> >>>> But following is something interesting and rather worrying things i have >>>> found : >>>> (I have omitted unimportant code and highlighted important one in >>>> following example) >>>> e.x. >>>> >>>> ((~A & B) | A) --> (A | B) ; Code is implemented for this already >>>> >>>> C code : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c >>>> #include<stdio.h> >>>> int cal(int a, int b) { >>>> return ((~a & b) | a); >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> LLVM IR at O0 : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm >>>> 1.c >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @cal(i32 %a, i32 %b) #0 { >>>> %1 = xor i32 %a, -1 >>>> %2 = and i32 %1, %b >>>> %3 = or i32 %2, %a >>>> ret i32 %3 >>>> } >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @main() #0 { >>>> .. >>>> .. >>>> %4 = call i32 @cal(i32 %2, i32 %3) >>>> ret i32 %4 >>>> } >>>> >>>> LLVM IR at instcombine : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @cal(i32 %a, i32 %b) #0 { >>>> %1 = or i32 %a, %b >>>> ret i32 %1 >>>> } >>>> .. >>>> .. >>>> .. >>>> >>>> which is OK as per expected transform. >>>> >>>> LLVM IR at reassociate and instcombine : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -reassociate >>>> -instcombine 2.ll >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @cal(i32 %a, i32 %b) #0 { >>>> %1 = xor i32 %a, -1 >>>> %2 = and i32 %b, %1 >>>> %3 = or i32 %2, %a >>>> ret i32 %3 >>>> } >>>> >>>> .. >>>> .. >>>> >>>> Reassociate converted (~a&b) to (b&~a) and instruction combine failed to >>>> optimize. >>>> >>>> As evident, sometimes, reassociate pass actually harms the optimization >>>> opportunity !! >>>> >>>> Some more interesting findings on how can this affect the final machine >>>> code: >>>> >>>> Test case : >>>> 1.c : >>>> >>>> #include<stdio.h> >>>> int cal(int a, int b) { >>>> return ((~a & b) | a); >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> X86 .s file with clang at O2 for above program : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >>>> >>>> main: # @main >>>> # BB#0: >>>> subl $28, %esp >>>> leal 20(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.L.str, (%esp) >>>> calll __isoc99_scanf >>>> movl 20(%esp), %eax >>>> orl 24(%esp), %eax >>>> addl $28, %esp >>>> retl >>>> >>>> As seen, optimization happened at IR level itself reflected in .s file. >>>> >>>> GCC output for the same: >>>> >>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c >>>> >>>> main: >>>> .LFB23: >>>> .cfi_startproc >>>> pushl %ebp >>>> .cfi_def_cfa_offset 8 >>>> .cfi_offset 5, -8 >>>> movl %esp, %ebp >>>> .cfi_def_cfa_register 5 >>>> andl $-16, %esp >>>> subl $32, %esp >>>> leal 28(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.LC0, (%esp) >>>> call __isoc99_scanf >>>> movl 24(%esp), %eax >>>> orl 28(%esp), %eax >>>> leave >>>> .cfi_restore 5 >>>> .cfi_def_cfa 4, 4 >>>> ret >>>> .cfi_endproc >>>> >>>> GCC also did the optimization. >>>> >>>> Now we just slightly flip the test case : >>>> >>>> 1.c Test case: >>>> >>>> #include<stdio.h> >>>> int cal(int a, int b) { >>>> return ((b & ~a) | a); >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> X86 .s file with clang at O2 for above program : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >>>> >>>> main: # @main >>>> # BB#0: >>>> subl $28, %esp >>>> leal 20(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.L.str, (%esp) >>>> calll __isoc99_scanf >>>> movl 24(%esp), %ecx >>>> movl %ecx, %eax >>>> notl %eax >>>> andl 20(%esp), %eax >>>> orl %ecx, %eax >>>> addl $28, %esp >>>> retl >>>> >>>> We lost the Optimization opportunity here !! >>>> >>>> GCC output for the same: >>>> >>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c >>>> >>>> main: >>>> .LFB23: >>>> .cfi_startproc >>>> pushl %ebp >>>> .cfi_def_cfa_offset 8 >>>> .cfi_offset 5, -8 >>>> movl %esp, %ebp >>>> .cfi_def_cfa_register 5 >>>> andl $-16, %esp >>>> subl $32, %esp >>>> leal 28(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.LC0, (%esp) >>>> call __isoc99_scanf >>>> movl 24(%esp), %eax >>>> orl 28(%esp), %eax >>>> leave >>>> .cfi_restore 5 >>>> .cfi_def_cfa 4, 4 >>>> ret >>>> .cfi_endproc >>>> >>>> GCC still optimized it !! >>>> >>>> Clearly evident that llvm is losing because of naive approach of pattern >>>> matching. >>>> >>>> How can we improve this so that LLVM also recognizes >>>> commutative/associative >>>> pattern efficiently with single 'matchers' call? I am not having any >>>> idea so far. >>>> I will think more on it. Any help would be really helpful. >>>> >>>> Any suggestions/comments/rectification are most awaited :) >>> >>> >>> >>> I'm actually sort of wondering if instead of hardcoding all these bitwise >>> simplifications, we can just use a general boolean minimization algorithm. >>> >>> Is there a reason we don't? Maybe it's just compile time? But eventually >>> (maybe already?) adding enough special case combines will be slower than >>> using the general optimization. >> >> >> Can you please elaborate more on "general boolean minimization algorithm" >> ? Is it a specific algorithm? I couldn't find any proper >> reference to it. If you can provide a reference for it, i will try to see >> if i can implement it. >> >> >> Can K-Map be an option to generalize the logic? >> (I could think of K-Map for now for simplification from my college days >> :)) > > > Yeah, K-Maps are one way to do it, however K-Maps are really most useful for > when your target operations are all NAND/NOR, which isn't the case here (and > you can also do Reed-Muller for XOR/XNOR); instcombine has a very different > set of "primitive operations" and their relative costs (i.e. XOR, AND, OR, > and NOT are usually the base operations and cost the same amount of > resources to execute). > > Since optimization of binary expressions is of such enormous importance to > the electronics industry, this topic is extremely well researched, so there > should be a lot to draw from. There may be stuff targeted specifically at > software compilers, but I haven't looked in depth (maybe try Muchnick > section 12.3 for references? I don't have the book handy). > > For starters, you may want to check out Knuth volume 4A, section 7.1. I'm > working from memory here, but IIRC there is a whole part about > minimization/optimization of binary expressions. Knuth as usual provides > tons of references to the literature in case the text there is not enough. > You can see preprint copies of volume 4A here: > http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book > that any programmer should have on their bookshelf; in fact, you can > probably justify it as a work expense). > > >> >> >> Also few more things i will like to understand. >> From above example i have shown, >> >> when we have test case as: >> >> #include<stdio.h> >> int cal(int a, int b) { >> return ((~a & b) | a); >> } >> >> int main(){ >> int a, b; >> scanf("%d %d", &a, &b); >> return cal(a,b); >> } >> >> X86 .s file with clang at O2 for above program : >> >> main: # @main >> # BB#0: >> subl $28, %esp >> leal 20(%esp), %eax >> movl %eax, 8(%esp) >> leal 24(%esp), %eax >> movl %eax, 4(%esp) >> movl $.L.str, (%esp) >> calll __isoc99_scanf >> movl 20(%esp), %eax >> orl 24(%esp), %eax >> addl $28, %esp >> retl >> >> As seen, we optimize the pattern. >> >> but when we have test case as : >> >> #include<stdio.h> >> int cal(int a, int b) { >> return ((b & ~a) | a); >> } >> >> int main(){ >> int a, b; >> scanf("%d %d", &a, &b); >> return cal(a,b); >> } >> >> X86 .s file with clang at O2 for above program : >> >> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >> >> main: # @main >> # BB#0: >> subl $28, %esp >> leal 20(%esp), %eax >> movl %eax, 8(%esp) >> leal 24(%esp), %eax >> movl %eax, 4(%esp) >> movl $.L.str, (%esp) >> calll __isoc99_scanf >> movl 24(%esp), %ecx >> movl %ecx, %eax >> notl %eax >> andl 20(%esp), %eax >> orl %ecx, %eax >> addl $28, %esp >> retl >> >> we fail to optimize. >> >> Does, 'reassociate' pass always run before 'instcombine' ? >> Theoretically, it should in my opinion. But from the output in first case, >> it seems it doesn't. >> If 'reassociate' would have run before 'instcombine', it would have >> converted (~a&b) to (b&~a) >> and the pattern would have not been optimized. >> >> Is there any logic which determines when to run reassociate and when not >> to run? > > > I don't know off the top of my head. Try running opt with > -debug-pass=Arguments to see what the order is (or I think you should be > able to pass -debug-pass=Arguments to clang's -mllvm option). > > -- Sean Silva > >> >> >> Duncan, David, Sean - any comment on this are most welcomed :) >> >> Regards, >> Suyog >> > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Reasonably Related Threads
- [LLVMdev] Efficient Pattern matching in Instruction Combine
- [LLVMdev] Efficient Pattern matching in Instruction Combine
- [LLVMdev] Efficient Pattern matching in Instruction Combine
- [PATCH] x86-64: syscall/sysenter support for 32-bit apps
- [klibc 24/43] i386 support for klibc