Daniel Berlin
2014-Aug-13 22:39 UTC
[LLVMdev] Efficient Pattern matching in Instruction Combine
Even if you can't implement such an algorithm sanely, ISTM that auto-generating this code from a table (or whatever), and choosing canonical results (to avoid a fixpoint issue), rather than what seems to be hand-additions of every possible set of minimizations on three variables, is still a better solution, no? At least then you wouldn't have human errors, and a growing file that makes any of the non-trivial transforms easy to miss in the noise. On Tue, Aug 12, 2014 at 2:00 PM, Sean Silva <chisophugis at gmail.com> wrote:> Re-adding the mailing list (remember to hit "reply all") > > > On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com> wrote: >> >> Thanks Sean for the reply. >> >> >> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at gmail.com> >> wrote: >>> >>> >>> >>> >>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com> wrote: >>>> >>>> Hi Duncan, David, Sean. >>>> >>>> Thanks for your reply. >>>> >>>> >>>> > It'd be interesting if you could find a design that also treated these >>>> > the same: >>>> > >>>> > (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C >>>> >>>> > (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> > (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C >>>> > >>>> > I.e., `^` is also associative. >>>> >>>> Agree with Duncan on including associative operation too. >>>> >>>> > Can we handle this by just having a canonical ordering? Or is that too >>>> > difficult to maintain through various instcombines? >>>> >>>> Yes, its the easiest way to do that. If i am not wrong, what Sean is >>>> suggesting is that if we get >>>> >>>> something like >>>> >>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> >>>> and we have written pass for pattern >>>> >>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C >>>> >>>> we could just swap 'B' and 'A' before matching the pattern i.e. we do >>>> canonical re-ordering >>>> (Sean, Correct me if i am wrong on my understanding of what you meant by >>>> canonical ordering). >>> >>> >>> You understood correctly. >>> >>>> >>>> >>>> I have seen in the "instcombine" module that many times std::swap was >>>> used for the same. >>>> >>>> But it becomes too much repetitive and somewhat irritating to keep doing >>>> this for every pattern. >>>> Shouldn't we do this in more generic way where we do not have to worry >>>> about swapping operands? >>>> (How to do it - i don't know for now. I looked at the matchers >>>> implementation for modification/improvement, >>>> but didn't find any clue on it). >>>> >>>> >>>> > It seems to me that we could rejigger Reassociate to reduce the number >>>> > of possibilities that InstCombine could expect. >>>> >>>> >>>> Agree with David that infact it should be reassociate pass which should >>>> handle this. >>>> >>>> But following is something interesting and rather worrying things i have >>>> found : >>>> (I have omitted unimportant code and highlighted important one in >>>> following example) >>>> e.x. >>>> >>>> ((~A & B) | A) --> (A | B) ; Code is implemented for this already >>>> >>>> C code : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c >>>> #include<stdio.h> >>>> int cal(int a, int b) { >>>> return ((~a & b) | a); >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> LLVM IR at O0 : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm >>>> 1.c >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @cal(i32 %a, i32 %b) #0 { >>>> %1 = xor i32 %a, -1 >>>> %2 = and i32 %1, %b >>>> %3 = or i32 %2, %a >>>> ret i32 %3 >>>> } >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @main() #0 { >>>> .. >>>> .. >>>> %4 = call i32 @cal(i32 %2, i32 %3) >>>> ret i32 %4 >>>> } >>>> >>>> LLVM IR at instcombine : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @cal(i32 %a, i32 %b) #0 { >>>> %1 = or i32 %a, %b >>>> ret i32 %1 >>>> } >>>> .. >>>> .. >>>> .. >>>> >>>> which is OK as per expected transform. >>>> >>>> LLVM IR at reassociate and instcombine : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -reassociate >>>> -instcombine 2.ll >>>> >>>> ; Function Attrs: nounwind >>>> define i32 @cal(i32 %a, i32 %b) #0 { >>>> %1 = xor i32 %a, -1 >>>> %2 = and i32 %b, %1 >>>> %3 = or i32 %2, %a >>>> ret i32 %3 >>>> } >>>> >>>> .. >>>> .. >>>> >>>> Reassociate converted (~a&b) to (b&~a) and instruction combine failed to >>>> optimize. >>>> >>>> As evident, sometimes, reassociate pass actually harms the optimization >>>> opportunity !! >>>> >>>> Some more interesting findings on how can this affect the final machine >>>> code: >>>> >>>> Test case : >>>> 1.c : >>>> >>>> #include<stdio.h> >>>> int cal(int a, int b) { >>>> return ((~a & b) | a); >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> X86 .s file with clang at O2 for above program : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >>>> >>>> main: # @main >>>> # BB#0: >>>> subl $28, %esp >>>> leal 20(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.L.str, (%esp) >>>> calll __isoc99_scanf >>>> movl 20(%esp), %eax >>>> orl 24(%esp), %eax >>>> addl $28, %esp >>>> retl >>>> >>>> As seen, optimization happened at IR level itself reflected in .s file. >>>> >>>> GCC output for the same: >>>> >>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c >>>> >>>> main: >>>> .LFB23: >>>> .cfi_startproc >>>> pushl %ebp >>>> .cfi_def_cfa_offset 8 >>>> .cfi_offset 5, -8 >>>> movl %esp, %ebp >>>> .cfi_def_cfa_register 5 >>>> andl $-16, %esp >>>> subl $32, %esp >>>> leal 28(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.LC0, (%esp) >>>> call __isoc99_scanf >>>> movl 24(%esp), %eax >>>> orl 28(%esp), %eax >>>> leave >>>> .cfi_restore 5 >>>> .cfi_def_cfa 4, 4 >>>> ret >>>> .cfi_endproc >>>> >>>> GCC also did the optimization. >>>> >>>> Now we just slightly flip the test case : >>>> >>>> 1.c Test case: >>>> >>>> #include<stdio.h> >>>> int cal(int a, int b) { >>>> return ((b & ~a) | a); >>>> } >>>> >>>> int main(){ >>>> int a, b; >>>> scanf("%d %d", &a, &b); >>>> return cal(a,b); >>>> } >>>> >>>> X86 .s file with clang at O2 for above program : >>>> >>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >>>> >>>> main: # @main >>>> # BB#0: >>>> subl $28, %esp >>>> leal 20(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.L.str, (%esp) >>>> calll __isoc99_scanf >>>> movl 24(%esp), %ecx >>>> movl %ecx, %eax >>>> notl %eax >>>> andl 20(%esp), %eax >>>> orl %ecx, %eax >>>> addl $28, %esp >>>> retl >>>> >>>> We lost the Optimization opportunity here !! >>>> >>>> GCC output for the same: >>>> >>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c >>>> >>>> main: >>>> .LFB23: >>>> .cfi_startproc >>>> pushl %ebp >>>> .cfi_def_cfa_offset 8 >>>> .cfi_offset 5, -8 >>>> movl %esp, %ebp >>>> .cfi_def_cfa_register 5 >>>> andl $-16, %esp >>>> subl $32, %esp >>>> leal 28(%esp), %eax >>>> movl %eax, 8(%esp) >>>> leal 24(%esp), %eax >>>> movl %eax, 4(%esp) >>>> movl $.LC0, (%esp) >>>> call __isoc99_scanf >>>> movl 24(%esp), %eax >>>> orl 28(%esp), %eax >>>> leave >>>> .cfi_restore 5 >>>> .cfi_def_cfa 4, 4 >>>> ret >>>> .cfi_endproc >>>> >>>> GCC still optimized it !! >>>> >>>> Clearly evident that llvm is losing because of naive approach of pattern >>>> matching. >>>> >>>> How can we improve this so that LLVM also recognizes >>>> commutative/associative >>>> pattern efficiently with single 'matchers' call? I am not having any >>>> idea so far. >>>> I will think more on it. Any help would be really helpful. >>>> >>>> Any suggestions/comments/rectification are most awaited :) >>> >>> >>> >>> I'm actually sort of wondering if instead of hardcoding all these bitwise >>> simplifications, we can just use a general boolean minimization algorithm. >>> >>> Is there a reason we don't? Maybe it's just compile time? But eventually >>> (maybe already?) adding enough special case combines will be slower than >>> using the general optimization. >> >> >> Can you please elaborate more on "general boolean minimization algorithm" >> ? Is it a specific algorithm? I couldn't find any proper >> reference to it. If you can provide a reference for it, i will try to see >> if i can implement it. >> >> >> Can K-Map be an option to generalize the logic? >> (I could think of K-Map for now for simplification from my college days >> :)) > > > Yeah, K-Maps are one way to do it, however K-Maps are really most useful for > when your target operations are all NAND/NOR, which isn't the case here (and > you can also do Reed-Muller for XOR/XNOR); instcombine has a very different > set of "primitive operations" and their relative costs (i.e. XOR, AND, OR, > and NOT are usually the base operations and cost the same amount of > resources to execute). > > Since optimization of binary expressions is of such enormous importance to > the electronics industry, this topic is extremely well researched, so there > should be a lot to draw from. There may be stuff targeted specifically at > software compilers, but I haven't looked in depth (maybe try Muchnick > section 12.3 for references? I don't have the book handy). > > For starters, you may want to check out Knuth volume 4A, section 7.1. I'm > working from memory here, but IIRC there is a whole part about > minimization/optimization of binary expressions. Knuth as usual provides > tons of references to the literature in case the text there is not enough. > You can see preprint copies of volume 4A here: > http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book > that any programmer should have on their bookshelf; in fact, you can > probably justify it as a work expense). > > >> >> >> Also few more things i will like to understand. >> From above example i have shown, >> >> when we have test case as: >> >> #include<stdio.h> >> int cal(int a, int b) { >> return ((~a & b) | a); >> } >> >> int main(){ >> int a, b; >> scanf("%d %d", &a, &b); >> return cal(a,b); >> } >> >> X86 .s file with clang at O2 for above program : >> >> main: # @main >> # BB#0: >> subl $28, %esp >> leal 20(%esp), %eax >> movl %eax, 8(%esp) >> leal 24(%esp), %eax >> movl %eax, 4(%esp) >> movl $.L.str, (%esp) >> calll __isoc99_scanf >> movl 20(%esp), %eax >> orl 24(%esp), %eax >> addl $28, %esp >> retl >> >> As seen, we optimize the pattern. >> >> but when we have test case as : >> >> #include<stdio.h> >> int cal(int a, int b) { >> return ((b & ~a) | a); >> } >> >> int main(){ >> int a, b; >> scanf("%d %d", &a, &b); >> return cal(a,b); >> } >> >> X86 .s file with clang at O2 for above program : >> >> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c >> >> main: # @main >> # BB#0: >> subl $28, %esp >> leal 20(%esp), %eax >> movl %eax, 8(%esp) >> leal 24(%esp), %eax >> movl %eax, 4(%esp) >> movl $.L.str, (%esp) >> calll __isoc99_scanf >> movl 24(%esp), %ecx >> movl %ecx, %eax >> notl %eax >> andl 20(%esp), %eax >> orl %ecx, %eax >> addl $28, %esp >> retl >> >> we fail to optimize. >> >> Does, 'reassociate' pass always run before 'instcombine' ? >> Theoretically, it should in my opinion. But from the output in first case, >> it seems it doesn't. >> If 'reassociate' would have run before 'instcombine', it would have >> converted (~a&b) to (b&~a) >> and the pattern would have not been optimized. >> >> Is there any logic which determines when to run reassociate and when not >> to run? > > > I don't know off the top of my head. Try running opt with > -debug-pass=Arguments to see what the order is (or I think you should be > able to pass -debug-pass=Arguments to clang's -mllvm option). > > -- Sean Silva > >> >> >> Duncan, David, Sean - any comment on this are most welcomed :) >> >> Regards, >> Suyog >> > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
suyog sarda
2014-Aug-15 18:41 UTC
[LLVMdev] Efficient Pattern matching in Instruction Combine
On Thu, Aug 14, 2014 at 4:09 AM, Daniel Berlin <dberlin at dberlin.org> wrote:> Even if you can't implement such an algorithm sanely, ISTM that > auto-generating this code from a table (or whatever), and choosing > canonical results (to avoid a fixpoint issue), rather than what seems > to be hand-additions of every possible set of minimizations on three > variables, is still a better solution, no? > > At least then you wouldn't have human errors, and a growing file that > makes any of the non-trivial transforms easy to miss in the noise. > >That's what exactly i thought, basically to have auto generating minimized expression, instead of hand writing code for every pattern. Few links i found - http://sontrak.com/ http://sourceforge.net/projects/truthtablesolve/ This might take much more time to implement, my immediate goal is to improve matching capabilities. Regards, Suyog -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140816/9c8e464b/attachment.html>
Sean Silva
2014-Aug-15 19:25 UTC
[LLVMdev] Efficient Pattern matching in Instruction Combine
On Fri, Aug 15, 2014 at 11:41 AM, suyog sarda <sardask01 at gmail.com> wrote:> > > > On Thu, Aug 14, 2014 at 4:09 AM, Daniel Berlin <dberlin at dberlin.org> > wrote: > >> Even if you can't implement such an algorithm sanely, ISTM that >> auto-generating this code from a table (or whatever), and choosing >> canonical results (to avoid a fixpoint issue), rather than what seems >> to be hand-additions of every possible set of minimizations on three >> variables, is still a better solution, no? >> >> At least then you wouldn't have human errors, and a growing file that >> makes any of the non-trivial transforms easy to miss in the noise. >> >> > That's what exactly i thought, basically to have auto generating minimized > expression, > instead of hand writing code for every pattern. > > Few links i found - > > http://sontrak.com/ > > http://sourceforge.net/projects/truthtablesolve/ > > This might take much more time to implement, my immediate goal is to > improve matching capabilities. > >Daniel's suggestion could actually could be done pretty easily. (assume for now that we are most interested in boolean expressions over 3 variables). There are two parts: 1. Add some code that evaluates a given 3-arg boolean expression at each of the 2^3 == 8 values, creating an 8-bit index. 2. Create a table of 2^8 == 256 entries (not *too* much work to do by hand) which has the "preferred" representation for each boolean function of 3 variables. Then, you just use the 8-bit index to look in the table for the "preferred" form of the expression. This approach could get a lot of "bang for your buck" without much code complexity or implementation difficulty. The same could be done for 2-variable expressions easily (might be a good starting point). For 4-variable expressions, it might be more difficult since the table would need 2^(2^4) == 16k entries. -- Sean Silva> Regards, > Suyog > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140815/84338d0f/attachment.html>