thr3ads.net - llvm dev - [LLVMdev] Efficient Pattern matching in Instruction Combine [Aug 2014]

If this information is useful, please help other people find it:
Share via:

suyog sarda

2014-Aug-08 20:34 UTC

[LLVMdev] Efficient Pattern matching in Instruction Combine

Hi Duncan, David, Sean.

Thanks for your reply.
> It'd be interesting if you could find a design that also treated these
> the same:
>
>   (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>   (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>   (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
>
> I.e., `^` is also associative.
Agree with Duncan on including associative operation too.
> Can we handle this by just having a canonical ordering? Or is that toodifficult to maintain through various instcombines?

Yes, its the easiest way to do that. If i am not wrong, what Sean is
suggesting is that if we get

something like

(B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C

and we have written pass for pattern

(A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C

we could just swap 'B' and 'A' before matching the pattern i.e.
we do
canonical re-ordering
(Sean, Correct me if i am wrong on my understanding of what you meant by
canonical ordering).

I have seen in the "instcombine" module that many times std::swap was
used
for the same.

But it becomes too much repetitive and somewhat irritating to keep doing
this for every pattern.
Shouldn't we do this in more generic way where we do not have to worry
about swapping operands?
(How to do it - i don't know for now. I looked at the matchers
implementation for modification/improvement,
but didn't find any clue on it).
> It seems to me that we could rejigger Reassociate to reduce the number ofpossibilities that InstCombine could expect.
>Agree with David that infact it should be reassociate pass which should
handle this.

But following is something interesting and rather worrying things i have
found :
(I have omitted unimportant code and highlighted important one in following
example)
*e.x. *

((~A & B) | A)  -->  (A | B) ; Code is implemented for this already

*C code :*

suyog at suyog-Inspiron-N5010:~$ cat 1.c
#include<stdio.h>
int cal(int a, int b) {
*return ((~a & b) | a);*
}

int main(){
int a, b;
scanf("%d %d", &a, &b);
return cal(a,b);
}

LLVM IR at O0 :

suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm 1.c

; Function Attrs: nounwind




*define i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor i32 %a, -1  %2 = and i32
%1, %b  %3 = or i32 %2, %a  ret i32 %3*
}

; Function Attrs: nounwind
define i32 @main() #0 {
  ..
  ..
  %4 = call i32 @cal(i32 %2, i32 %3)
  ret i32 %4
}

*LLVM IR at instcombine :*

suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll





*; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {  %1 = or
i32 %a, %b  ret i32 %1}*
..
..
..

which is OK as per expected transform.



*LLVM IR at reassociate and instcombine :*suyog at suyog-Inspiron-N5010:~$
Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll







*; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor
i32 %a, -1  %2 = and i32 %b, %1  %3 = or i32 %2, %a  ret i32 %3}*

..
..

Reassociate converted (~a&b) to (b&~a) and instruction combine failed to
optimize.

As evident, sometimes, reassociate pass actually harms the optimization
opportunity !!

Some more interesting findings on how can this affect the final machine
code:

*Test case :*
1.c :

#include<stdio.h>
int cal(int a, int b) {
*return ((~a & b) | a);*
}

int main(){
int a, b;
scanf("%d %d", &a, &b);
return cal(a,b);
}

*X86 .s file with clang at O2 for above program :*

suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c

main:                                   # @main
# BB#0:
    subl    $28, %esp
    leal    20(%esp), %eax
    movl    %eax, 8(%esp)
    leal    24(%esp), %eax
    movl    %eax, 4(%esp)
    movl    $.L.str, (%esp)
    calll    __isoc99_scanf
    movl    20(%esp), %eax
*    orl    24(%esp), %eax*
    addl    $28, %esp
    retl

As seen, optimization happened at IR level itself reflected in .s file.

*GCC output for the same:*

suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c

main:
.LFB23:
    .cfi_startproc
    pushl    %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    subl    $32, %esp
    leal    28(%esp), %eax
    movl    %eax, 8(%esp)
    leal    24(%esp), %eax
    movl    %eax, 4(%esp)
    movl    $.LC0, (%esp)
    call    __isoc99_scanf
    movl    24(%esp), %eax
*    orl    28(%esp), %eax*
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc

GCC also did the optimization.

Now we just *slightly flip* the test case :



*1.c Test case:*#include<stdio.h>
int cal(int a, int b) {
*return ((b & ~a) | a);*
}

int main(){
int a, b;
scanf("%d %d", &a, &b);
return cal(a,b);
}



*X86 .s file with clang at O2 for above program :*suyog at
suyog-Inspiron-N5010:~$
Open/rbuild/bin/clang -S -O2 1.c

main:                                   # @main
# BB#0:
    subl    $28, %esp
    leal    20(%esp), %eax
    movl    %eax, 8(%esp)
    leal    24(%esp), %eax
    movl    %eax, 4(%esp)
    movl    $.L.str, (%esp)
    calll    __isoc99_scanf
    movl    24(%esp), %ecx
    movl    %ecx, %eax


*notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
    addl    $28, %esp
    retl

We lost the Optimization opportunity here !!

*GCC output for the same:*

suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c

main:
.LFB23:
    .cfi_startproc
    pushl    %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset 5, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register 5
    andl    $-16, %esp
    subl    $32, %esp
    leal    28(%esp), %eax
    movl    %eax, 8(%esp)
    leal    24(%esp), %eax
    movl    %eax, 4(%esp)
    movl    $.LC0, (%esp)
    call    __isoc99_scanf
    movl    24(%esp), %eax
*    orl    28(%esp), %eax*
    leave
    .cfi_restore 5
    .cfi_def_cfa 4, 4
    ret
    .cfi_endproc

GCC still optimized it !!

Clearly evident that llvm is losing because of naive approach of pattern
matching.

How can we improve this so that LLVM also recognizes
commutative/associative
pattern  efficiently with single 'matchers' call? I am not having any
idea
so far.
I will think more on it. Any help would be really helpful.

Any suggestions/comments/rectification are most awaited :)

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140809/418a1055/attachment.html>

Sean Silva

2014-Aug-11 18:19 UTC

head link

[LLVMdev] Efficient Pattern matching in Instruction Combine

On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com>
wrote:
> Hi Duncan, David, Sean.
>
> Thanks for your reply.
>
>
> > It'd be interesting if you could find a design that also treated
these
> > the same:
> >
> >   (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>
> >   (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
> >   (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
> >
> > I.e., `^` is also associative.
>
> Agree with Duncan on including associative operation too.
>
> > Can we handle this by just having a canonical ordering? Or is that too
> difficult to maintain through various instcombines?
>
> Yes, its the easiest way to do that. If i am not wrong, what Sean is
> suggesting is that if we get
>
> something like
>
> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>
> and we have written pass for pattern
>
> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C
>
> we could just swap 'B' and 'A' before matching the pattern
i.e. we do
> canonical re-ordering
> (Sean, Correct me if i am wrong on my understanding of what you meant by
> canonical ordering).
>
You understood correctly.

>
> I have seen in the "instcombine" module that many times std::swap
was used
> for the same.
>
> But it becomes too much repetitive and somewhat irritating to keep doing
> this for every pattern.
> Shouldn't we do this in more generic way where we do not have to worry
> about swapping operands?
> (How to do it - i don't know for now. I looked at the matchers
> implementation for modification/improvement,
> but didn't find any clue on it).
>
>
> > It seems to me that we could rejigger Reassociate to reduce the number
> of possibilities that InstCombine could expect.
>
>>
> Agree with David that infact it should be reassociate pass which should
> handle this.
>
> But following is something interesting and rather worrying things i have
> found :
> (I have omitted unimportant code and highlighted important one in
> following example)
> *e.x. *
>
> ((~A & B) | A)  -->  (A | B) ; Code is implemented for this already
>
> *C code :*
>
> suyog at suyog-Inspiron-N5010:~$ cat 1.c
> #include<stdio.h>
> int cal(int a, int b) {
> *return ((~a & b) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
> LLVM IR at O0 :
>
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm
1.c
>
> ; Function Attrs: nounwind
>
>
>
>
> *define i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor i32 %a, -1  %2 = and i32
> %1, %b  %3 = or i32 %2, %a  ret i32 %3*
> }
>
> ; Function Attrs: nounwind
> define i32 @main() #0 {
>   ..
>   ..
>   %4 = call i32 @cal(i32 %2, i32 %3)
>   ret i32 %4
> }
>
> *LLVM IR at instcombine :*
>
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll
>
>
>
>
>
> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {   %1 = or
> i32 %a, %b  ret i32 %1}*
> ..
> ..
> ..
>
> which is OK as per expected transform.
>
>
>
> *LLVM IR at reassociate and instcombine : *suyog at suyog-Inspiron-N5010:~$
> Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll
>
>
>
>
>
>
>
> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor
> i32 %a, -1  %2 = and i32 %b, %1   %3 = or i32 %2, %a  ret i32 %3}*
>
> ..
> ..
>
> Reassociate converted (~a&b) to (b&~a) and instruction combine
failed to
> optimize.
>
> As evident, sometimes, reassociate pass actually harms the optimization
> opportunity !!
>
> Some more interesting findings on how can this affect the final machine
> code:
>
> *Test case :*
> 1.c :
>
> #include<stdio.h>
> int cal(int a, int b) {
> *return ((~a & b) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
> *X86 .s file with clang at O2 for above program :*
>
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>
> main:                                   # @main
> # BB#0:
>     subl    $28, %esp
>     leal    20(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.L.str, (%esp)
>     calll    __isoc99_scanf
>     movl    20(%esp), %eax
> *    orl    24(%esp), %eax*
>     addl    $28, %esp
>     retl
>
> As seen, optimization happened at IR level itself reflected in .s file.
>
> *GCC output for the same:*
>
> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>
> main:
> .LFB23:
>     .cfi_startproc
>     pushl    %ebp
>     .cfi_def_cfa_offset 8
>     .cfi_offset 5, -8
>     movl    %esp, %ebp
>     .cfi_def_cfa_register 5
>     andl    $-16, %esp
>     subl    $32, %esp
>     leal    28(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.LC0, (%esp)
>     call    __isoc99_scanf
>     movl    24(%esp), %eax
> *    orl    28(%esp), %eax*
>     leave
>     .cfi_restore 5
>     .cfi_def_cfa 4, 4
>     ret
>     .cfi_endproc
>
> GCC also did the optimization.
>
> Now we just *slightly flip* the test case :
>
>
>
> *1.c Test case:*#include<stdio.h>
> int cal(int a, int b) {
> *return ((b & ~a) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
>
>
> *X86 .s file with clang at O2 for above program :*
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>
> main:                                   # @main
> # BB#0:
>     subl    $28, %esp
>     leal    20(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.L.str, (%esp)
>     calll    __isoc99_scanf
>     movl    24(%esp), %ecx
>     movl    %ecx, %eax
>
>
> *notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
>     addl    $28, %esp
>     retl
>
> We lost the Optimization opportunity here !!
>
> *GCC output for the same:*
>
> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>
> main:
> .LFB23:
>     .cfi_startproc
>     pushl    %ebp
>     .cfi_def_cfa_offset 8
>     .cfi_offset 5, -8
>     movl    %esp, %ebp
>     .cfi_def_cfa_register 5
>     andl    $-16, %esp
>     subl    $32, %esp
>     leal    28(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.LC0, (%esp)
>     call    __isoc99_scanf
>     movl    24(%esp), %eax
> *    orl    28(%esp), %eax*
>     leave
>     .cfi_restore 5
>     .cfi_def_cfa 4, 4
>     ret
>     .cfi_endproc
>
> GCC still optimized it !!
>
> Clearly evident that llvm is losing because of naive approach of pattern
> matching.
>
> How can we improve this so that LLVM also recognizes
> commutative/associative
> pattern  efficiently with single 'matchers' call? I am not having
any idea
> so far.
> I will think more on it. Any help would be really helpful.
>
> Any suggestions/comments/rectification are most awaited :)
>

I'm actually sort of wondering if instead of hardcoding all these bitwise
simplifications, we can just use a general boolean minimization algorithm.

Is there a reason we don't? Maybe it's just compile time? But eventually
(maybe already?) adding enough special case combines will be slower than
using the general optimization.

-- Sean

>
> --
> With regards,
> Suyog Sarda
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140811/83867903/attachment.html>

Sean Silva

2014-Aug-12 21:00 UTC

head link

[LLVMdev] Efficient Pattern matching in Instruction Combine

Re-adding the mailing list (remember to hit "reply all")


On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com>
wrote:
> Thanks Sean for the reply.
>
>
> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at
gmail.com>
> wrote:
>
>>
>>
>>
>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at
gmail.com> wrote:
>>
>>> Hi Duncan, David, Sean.
>>>
>>> Thanks for your reply.
>>>
>>>
>>> > It'd be interesting if you could find a design that also
treated these
>>> > the same:
>>> >
>>> >   (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>>>
>>> >   (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>> >   (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
>>> >
>>> > I.e., `^` is also associative.
>>>
>>> Agree with Duncan on including associative operation too.
>>>
>>> > Can we handle this by just having a canonical ordering? Or is
that too
>>> difficult to maintain through various instcombines?
>>>
>>> Yes, its the easiest way to do that. If i am not wrong, what Sean
is
>>> suggesting is that if we get
>>>
>>> something like
>>>
>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>
>>> and we have written pass for pattern
>>>
>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>
>>> we could just swap 'B' and 'A' before matching the
pattern i.e. we do
>>> canonical re-ordering
>>> (Sean, Correct me if i am wrong on my understanding of what you
meant by
>>> canonical ordering).
>>>
>>
>> You understood correctly.
>>
>>
>>>
>>> I have seen in the "instcombine" module that many times
std::swap was
>>> used for the same.
>>>
>>> But it becomes too much repetitive and somewhat irritating to keep
doing
>>> this for every pattern.
>>> Shouldn't we do this in more generic way where we do not have
to worry
>>> about swapping operands?
>>> (How to do it - i don't know for now. I looked at the matchers
>>> implementation for modification/improvement,
>>> but didn't find any clue on it).
>>>
>>>
>>> > It seems to me that we could rejigger Reassociate to reduce
the number
>>> of possibilities that InstCombine could expect.
>>>
>>>>
>>> Agree with David that infact it should be reassociate pass which
should
>>> handle this.
>>>
>>> But following is something interesting and rather worrying things i
have
>>> found :
>>> (I have omitted unimportant code and highlighted important one in
>>> following example)
>>> *e.x. *
>>>
>>> ((~A & B) | A)  -->  (A | B) ; Code is implemented for this
already
>>>
>>> *C code :*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c
>>> #include<stdio.h>
>>> int cal(int a, int b) {
>>> *return ((~a & b) | a);*
>>> }
>>>
>>> int main(){
>>> int a, b;
>>> scanf("%d %d", &a, &b);
>>> return cal(a,b);
>>> }
>>>
>>> LLVM IR at O0 :
>>>
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0
-emit-llvm
>>> 1.c
>>>
>>> ; Function Attrs: nounwind
>>>
>>>
>>>
>>>
>>> *define i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor i32 %a, -1  %2 =
and i32
>>> %1, %b  %3 = or i32 %2, %a  ret i32 %3*
>>> }
>>>
>>> ; Function Attrs: nounwind
>>> define i32 @main() #0 {
>>>   ..
>>>   ..
>>>   %4 = call i32 @cal(i32 %2, i32 %3)
>>>   ret i32 %4
>>> }
>>>
>>> *LLVM IR at instcombine :*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S
-instcombine 1.ll
>>>
>>>
>>>
>>>
>>>
>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {  
%1 >>> or i32 %a, %b  ret i32 %1}*
>>> ..
>>> ..
>>> ..
>>>
>>> which is OK as per expected transform.
>>>
>>>
>>>
>>> *LLVM IR at reassociate and instcombine : *suyog at
suyog-Inspiron-N5010:~$
>>> Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {  %1
>>> xor i32 %a, -1  %2 = and i32 %b, %1   %3 = or i32 %2, %a  ret i32
%3}*
>>>
>>> ..
>>> ..
>>>
>>> Reassociate converted (~a&b) to (b&~a) and instruction
combine failed to
>>> optimize.
>>>
>>> As evident, sometimes, reassociate pass actually harms the
optimization
>>> opportunity !!
>>>
>>> Some more interesting findings on how can this affect the final
machine
>>> code:
>>>
>>> *Test case :*
>>> 1.c :
>>>
>>> #include<stdio.h>
>>> int cal(int a, int b) {
>>> *return ((~a & b) | a);*
>>> }
>>>
>>> int main(){
>>> int a, b;
>>> scanf("%d %d", &a, &b);
>>> return cal(a,b);
>>> }
>>>
>>> *X86 .s file with clang at O2 for above program :*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>
>>> main:                                   # @main
>>> # BB#0:
>>>     subl    $28, %esp
>>>     leal    20(%esp), %eax
>>>     movl    %eax, 8(%esp)
>>>     leal    24(%esp), %eax
>>>     movl    %eax, 4(%esp)
>>>     movl    $.L.str, (%esp)
>>>     calll    __isoc99_scanf
>>>     movl    20(%esp), %eax
>>> *    orl    24(%esp), %eax*
>>>     addl    $28, %esp
>>>     retl
>>>
>>> As seen, optimization happened at IR level itself reflected in .s
file.
>>>
>>> *GCC output for the same:*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>
>>> main:
>>> .LFB23:
>>>     .cfi_startproc
>>>     pushl    %ebp
>>>     .cfi_def_cfa_offset 8
>>>     .cfi_offset 5, -8
>>>     movl    %esp, %ebp
>>>     .cfi_def_cfa_register 5
>>>     andl    $-16, %esp
>>>     subl    $32, %esp
>>>     leal    28(%esp), %eax
>>>     movl    %eax, 8(%esp)
>>>     leal    24(%esp), %eax
>>>     movl    %eax, 4(%esp)
>>>     movl    $.LC0, (%esp)
>>>     call    __isoc99_scanf
>>>     movl    24(%esp), %eax
>>> *    orl    28(%esp), %eax*
>>>     leave
>>>     .cfi_restore 5
>>>     .cfi_def_cfa 4, 4
>>>     ret
>>>     .cfi_endproc
>>>
>>> GCC also did the optimization.
>>>
>>> Now we just *slightly flip* the test case :
>>>
>>>
>>>
>>> *1.c Test case:*#include<stdio.h>
>>> int cal(int a, int b) {
>>> *return ((b & ~a) | a);*
>>> }
>>>
>>> int main(){
>>> int a, b;
>>> scanf("%d %d", &a, &b);
>>> return cal(a,b);
>>> }
>>>
>>>
>>>
>>> *X86 .s file with clang at O2 for above program :*
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>
>>> main:                                   # @main
>>> # BB#0:
>>>     subl    $28, %esp
>>>     leal    20(%esp), %eax
>>>     movl    %eax, 8(%esp)
>>>     leal    24(%esp), %eax
>>>     movl    %eax, 4(%esp)
>>>     movl    $.L.str, (%esp)
>>>     calll    __isoc99_scanf
>>>     movl    24(%esp), %ecx
>>>     movl    %ecx, %eax
>>>
>>>
>>> *notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
>>>     addl    $28, %esp
>>>     retl
>>>
>>> We lost the Optimization opportunity here !!
>>>
>>> *GCC output for the same:*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>
>>> main:
>>> .LFB23:
>>>     .cfi_startproc
>>>     pushl    %ebp
>>>     .cfi_def_cfa_offset 8
>>>     .cfi_offset 5, -8
>>>     movl    %esp, %ebp
>>>     .cfi_def_cfa_register 5
>>>     andl    $-16, %esp
>>>     subl    $32, %esp
>>>     leal    28(%esp), %eax
>>>     movl    %eax, 8(%esp)
>>>     leal    24(%esp), %eax
>>>     movl    %eax, 4(%esp)
>>>     movl    $.LC0, (%esp)
>>>     call    __isoc99_scanf
>>>     movl    24(%esp), %eax
>>> *    orl    28(%esp), %eax*
>>>     leave
>>>     .cfi_restore 5
>>>     .cfi_def_cfa 4, 4
>>>     ret
>>>     .cfi_endproc
>>>
>>>  GCC still optimized it !!
>>>
>>> Clearly evident that llvm is losing because of naive approach of
pattern
>>> matching.
>>>
>>> How can we improve this so that LLVM also recognizes
>>> commutative/associative
>>> pattern  efficiently with single 'matchers' call? I am not
having any
>>> idea so far.
>>> I will think more on it. Any help would be really helpful.
>>>
>>> Any suggestions/comments/rectification are most awaited :)
>>>
>>
>>
>> I'm actually sort of wondering if instead of hardcoding all these
bitwise
>> simplifications, we can just use a general boolean minimization
algorithm.
>>
>> Is there a reason we don't? Maybe it's just compile time? But
eventually
>> (maybe already?) adding enough special case combines will be slower
than
>> using the general optimization.
>>
>
> Can you please elaborate more on "general boolean minimization
algorithm"
> ? Is it a specific algorithm? I couldn't find any proper
> reference to it. If you can provide a reference for it, i will try to see
> if i can implement it.
>
> Can K-Map be an option to generalize the logic?
>  (I could think of K-Map for now for simplification from my college days
> :))
>
Yeah, K-Maps are one way to do it, however K-Maps are really most useful
for when your target operations are all NAND/NOR, which isn't the case here
(and you can also do Reed-Muller for XOR/XNOR); instcombine has a very
different set of "primitive operations" and their relative costs (i.e.
XOR,
AND, OR, and NOT are usually the base operations and cost the same amount
of resources to execute).

Since optimization of binary expressions is of such enormous importance to
the electronics industry, this topic is extremely well researched, so there
should be a lot to draw from. There may be stuff targeted specifically at
software compilers, but I haven't looked in depth (maybe try Muchnick
section 12.3 for references? I don't have the book handy).

For starters, you may want to check out Knuth volume 4A, section 7.1. I'm
working from memory here, but IIRC there is a whole part about
minimization/optimization of binary expressions. Knuth as usual provides
tons of references to the literature in case the text there is not enough.
You can see preprint copies of volume 4A here:
http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book
that any programmer should have on their bookshelf; in fact, you can
probably justify it as a work expense).


>
> Also few more things i will like to understand.
> From above example i have shown,
>
> when we have test case as:
>
> #include<stdio.h>
> int cal(int a, int b) {
> *return ((~a & b) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
> *X86 .s file with clang at O2 for above program :*
>
>  main:                                   # @main
> # BB#0:
>     subl    $28, %esp
>     leal    20(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.L.str, (%esp)
>     calll    __isoc99_scanf
>     movl    20(%esp), %eax
> *    orl    24(%esp), %eax*
>     addl    $28, %esp
>     retl
>
> As seen, we optimize the pattern.
>
> but when we have test case as :
>
> #include<stdio.h>
> int cal(int a, int b) {
> *return ((b & ~a) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
>
>
> *X86 .s file with clang at O2 for above program :*
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>
> main:                                   # @main
> # BB#0:
>     subl    $28, %esp
>     leal    20(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.L.str, (%esp)
>     calll    __isoc99_scanf
>     movl    24(%esp), %ecx
>     movl    %ecx, %eax
>
>
> *notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
>     addl    $28, %esp
>     retl
>
>  we fail  to optimize.
>
> Does, 'reassociate' pass always run before 'instcombine' ?
> Theoretically, it should in my opinion. But from the output in first case,
> it seems it doesn't.
> If 'reassociate' would have run before 'instcombine', it
would have
> converted (~a&b) to (b&~a)
> and the pattern would have not been optimized.
>
> Is there any logic which determines when to run reassociate and when not
> to run?
>
I don't know off the top of my head. Try running opt with
-debug-pass=Arguments to see what the order is (or I think you should be
able to pass -debug-pass=Arguments to clang's -mllvm option).

-- Sean Silva

>
> Duncan, David, Sean - any comment on this are most welcomed :)
>
> Regards,
> Suyog
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140812/41be749b/attachment.html>

suyog sarda

2014-Aug-13 16:37 UTC

head link

[LLVMdev] Efficient Pattern matching in Instruction Combine

Thanks Sean for the reference.

I will go through it and see if i can implement it for generic boolean
expression minimization.

Regards,
Suyog


On Wed, Aug 13, 2014 at 2:30 AM, Sean Silva <chisophugis at gmail.com>
wrote:
> Re-adding the mailing list (remember to hit "reply all")
>
>
> On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com>
wrote:
>
>> Thanks Sean for the reply.
>>
>>
>> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at
gmail.com>
>> wrote:
>>
>>>
>>>
>>>
>>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at
gmail.com> wrote:
>>>
>>>> Hi Duncan, David, Sean.
>>>>
>>>> Thanks for your reply.
>>>>
>>>>
>>>> > It'd be interesting if you could find a design that
also treated these
>>>> > the same:
>>>> >
>>>> >   (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>>>>
>>>> >   (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>> >   (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
>>>> >
>>>> > I.e., `^` is also associative.
>>>>
>>>> Agree with Duncan on including associative operation too.
>>>>
>>>> > Can we handle this by just having a canonical ordering? Or
is that
>>>> too difficult to maintain through various instcombines?
>>>>
>>>> Yes, its the easiest way to do that. If i am not wrong, what
Sean is
>>>> suggesting is that if we get
>>>>
>>>> something like
>>>>
>>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>
>>>> and we have written pass for pattern
>>>>
>>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>
>>>> we could just swap 'B' and 'A' before matching
the pattern i.e. we do
>>>> canonical re-ordering
>>>> (Sean, Correct me if i am wrong on my understanding of what you
meant
>>>> by canonical ordering).
>>>>
>>>
>>> You understood correctly.
>>>
>>>
>>>>
>>>> I have seen in the "instcombine" module that many
times std::swap was
>>>> used for the same.
>>>>
>>>> But it becomes too much repetitive and somewhat irritating to
keep
>>>> doing this for every pattern.
>>>> Shouldn't we do this in more generic way where we do not
have to worry
>>>> about swapping operands?
>>>> (How to do it - i don't know for now. I looked at the
matchers
>>>> implementation for modification/improvement,
>>>> but didn't find any clue on it).
>>>>
>>>>
>>>> > It seems to me that we could rejigger Reassociate to
reduce the
>>>> number of possibilities that InstCombine could expect.
>>>>
>>>>>
>>>> Agree with David that infact it should be reassociate pass
which should
>>>> handle this.
>>>>
>>>> But following is something interesting and rather worrying
things i
>>>> have found :
>>>> (I have omitted unimportant code and highlighted important one
in
>>>> following example)
>>>> *e.x. *
>>>>
>>>> ((~A & B) | A)  -->  (A | B) ; Code is implemented for
this already
>>>>
>>>> *C code :*
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c
>>>> #include<stdio.h>
>>>> int cal(int a, int b) {
>>>> *return ((~a & b) | a);*
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>> LLVM IR at O0 :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0
-emit-llvm
>>>> 1.c
>>>>
>>>> ; Function Attrs: nounwind
>>>>
>>>>
>>>>
>>>>
>>>> *define i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor i32 %a, -1  %2
= and
>>>> i32 %1, %b  %3 = or i32 %2, %a  ret i32 %3*
>>>> }
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @main() #0 {
>>>>   ..
>>>>   ..
>>>>   %4 = call i32 @cal(i32 %2, i32 %3)
>>>>   ret i32 %4
>>>> }
>>>>
>>>> *LLVM IR at instcombine :*
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S
-instcombine 1.ll
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {
%1 >>>> or i32 %a, %b  ret i32 %1}*
>>>> ..
>>>> ..
>>>> ..
>>>>
>>>> which is OK as per expected transform.
>>>>
>>>>
>>>>
>>>> *LLVM IR at reassociate and instcombine : *suyog at
suyog-Inspiron-N5010:~$
>>>> Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {
%1 >>>> xor i32 %a, -1  %2 = and i32 %b, %1   %3 = or i32 %2, %a 
ret i32 %3}*
>>>>
>>>> ..
>>>> ..
>>>>
>>>> Reassociate converted (~a&b) to (b&~a) and instruction
combine failed
>>>> to optimize.
>>>>
>>>> As evident, sometimes, reassociate pass actually harms the
optimization
>>>> opportunity !!
>>>>
>>>> Some more interesting findings on how can this affect the final
machine
>>>> code:
>>>>
>>>> *Test case :*
>>>> 1.c :
>>>>
>>>> #include<stdio.h>
>>>> int cal(int a, int b) {
>>>> *return ((~a & b) | a);*
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>> *X86 .s file with clang at O2 for above program :*
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2
1.c
>>>>
>>>> main:                                   # @main
>>>> # BB#0:
>>>>     subl    $28, %esp
>>>>     leal    20(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.L.str, (%esp)
>>>>     calll    __isoc99_scanf
>>>>     movl    20(%esp), %eax
>>>> *    orl    24(%esp), %eax*
>>>>     addl    $28, %esp
>>>>     retl
>>>>
>>>> As seen, optimization happened at IR level itself reflected in
.s file.
>>>>
>>>> *GCC output for the same:*
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>>
>>>> main:
>>>> .LFB23:
>>>>     .cfi_startproc
>>>>     pushl    %ebp
>>>>     .cfi_def_cfa_offset 8
>>>>     .cfi_offset 5, -8
>>>>     movl    %esp, %ebp
>>>>     .cfi_def_cfa_register 5
>>>>     andl    $-16, %esp
>>>>     subl    $32, %esp
>>>>     leal    28(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.LC0, (%esp)
>>>>     call    __isoc99_scanf
>>>>     movl    24(%esp), %eax
>>>> *    orl    28(%esp), %eax*
>>>>     leave
>>>>     .cfi_restore 5
>>>>     .cfi_def_cfa 4, 4
>>>>     ret
>>>>     .cfi_endproc
>>>>
>>>> GCC also did the optimization.
>>>>
>>>> Now we just *slightly flip* the test case :
>>>>
>>>>
>>>>
>>>> *1.c Test case:*#include<stdio.h>
>>>> int cal(int a, int b) {
>>>> *return ((b & ~a) | a);*
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>>
>>>>
>>>> *X86 .s file with clang at O2 for above program :*
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2
1.c
>>>>
>>>> main:                                   # @main
>>>> # BB#0:
>>>>     subl    $28, %esp
>>>>     leal    20(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.L.str, (%esp)
>>>>     calll    __isoc99_scanf
>>>>     movl    24(%esp), %ecx
>>>>     movl    %ecx, %eax
>>>>
>>>>
>>>> *notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
>>>>     addl    $28, %esp
>>>>     retl
>>>>
>>>> We lost the Optimization opportunity here !!
>>>>
>>>> *GCC output for the same:*
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>>
>>>> main:
>>>> .LFB23:
>>>>     .cfi_startproc
>>>>     pushl    %ebp
>>>>     .cfi_def_cfa_offset 8
>>>>     .cfi_offset 5, -8
>>>>     movl    %esp, %ebp
>>>>     .cfi_def_cfa_register 5
>>>>     andl    $-16, %esp
>>>>     subl    $32, %esp
>>>>     leal    28(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.LC0, (%esp)
>>>>     call    __isoc99_scanf
>>>>     movl    24(%esp), %eax
>>>> *    orl    28(%esp), %eax*
>>>>     leave
>>>>     .cfi_restore 5
>>>>     .cfi_def_cfa 4, 4
>>>>     ret
>>>>     .cfi_endproc
>>>>
>>>>  GCC still optimized it !!
>>>>
>>>> Clearly evident that llvm is losing because of naive approach
of
>>>> pattern matching.
>>>>
>>>> How can we improve this so that LLVM also recognizes
>>>> commutative/associative
>>>> pattern  efficiently with single 'matchers' call? I am
not having any
>>>> idea so far.
>>>> I will think more on it. Any help would be really helpful.
>>>>
>>>> Any suggestions/comments/rectification are most awaited :)
>>>>
>>>
>>>
>>> I'm actually sort of wondering if instead of hardcoding all
these
>>> bitwise simplifications, we can just use a general boolean
minimization
>>> algorithm.
>>>
>>> Is there a reason we don't? Maybe it's just compile time?
But eventually
>>> (maybe already?) adding enough special case combines will be slower
than
>>> using the general optimization.
>>>
>>
>> Can you please elaborate more on "general boolean minimization
algorithm"
>> ? Is it a specific algorithm? I couldn't find any proper
>> reference to it. If you can provide a reference for it, i will try to
see
>> if i can implement it.
>>
>
>> Can K-Map be an option to generalize the logic?
>>  (I could think of K-Map for now for simplification from my college
days
>> :))
>>
>
> Yeah, K-Maps are one way to do it, however K-Maps are really most useful
> for when your target operations are all NAND/NOR, which isn't the case
here
> (and you can also do Reed-Muller for XOR/XNOR); instcombine has a very
> different set of "primitive operations" and their relative costs
(i.e. XOR,
> AND, OR, and NOT are usually the base operations and cost the same amount
> of resources to execute).
>
> Since optimization of binary expressions is of such enormous importance to
> the electronics industry, this topic is extremely well researched, so there
> should be a lot to draw from. There may be stuff targeted specifically at
> software compilers, but I haven't looked in depth (maybe try Muchnick
> section 12.3 for references? I don't have the book handy).
>
> For starters, you may want to check out Knuth volume 4A, section 7.1.
I'm
> working from memory here, but IIRC there is a whole part about
> minimization/optimization of binary expressions. Knuth as usual provides
> tons of references to the literature in case the text there is not enough.
> You can see preprint copies of volume 4A here:
> http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book
> that any programmer should have on their bookshelf; in fact, you can
> probably justify it as a work expense).
>
>
>
>>
>> Also few more things i will like to understand.
>> From above example i have shown,
>>
>> when we have test case as:
>>
>> #include<stdio.h>
>> int cal(int a, int b) {
>> *return ((~a & b) | a);*
>> }
>>
>> int main(){
>> int a, b;
>> scanf("%d %d", &a, &b);
>> return cal(a,b);
>> }
>>
>> *X86 .s file with clang at O2 for above program :*
>>
>>  main:                                   # @main
>> # BB#0:
>>     subl    $28, %esp
>>     leal    20(%esp), %eax
>>     movl    %eax, 8(%esp)
>>     leal    24(%esp), %eax
>>     movl    %eax, 4(%esp)
>>     movl    $.L.str, (%esp)
>>     calll    __isoc99_scanf
>>     movl    20(%esp), %eax
>> *    orl    24(%esp), %eax*
>>     addl    $28, %esp
>>     retl
>>
>> As seen, we optimize the pattern.
>>
>> but when we have test case as :
>>
>> #include<stdio.h>
>> int cal(int a, int b) {
>> *return ((b & ~a) | a);*
>> }
>>
>> int main(){
>> int a, b;
>> scanf("%d %d", &a, &b);
>> return cal(a,b);
>> }
>>
>>
>>
>> *X86 .s file with clang at O2 for above program :*
>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>
>> main:                                   # @main
>> # BB#0:
>>     subl    $28, %esp
>>     leal    20(%esp), %eax
>>     movl    %eax, 8(%esp)
>>     leal    24(%esp), %eax
>>     movl    %eax, 4(%esp)
>>     movl    $.L.str, (%esp)
>>     calll    __isoc99_scanf
>>     movl    24(%esp), %ecx
>>     movl    %ecx, %eax
>>
>>
>> *notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
>>     addl    $28, %esp
>>     retl
>>
>>  we fail  to optimize.
>>
>> Does, 'reassociate' pass always run before
'instcombine' ?
>> Theoretically, it should in my opinion. But from the output in first
>> case, it seems it doesn't.
>> If 'reassociate' would have run before 'instcombine',
it would have
>> converted (~a&b) to (b&~a)
>> and the pattern would have not been optimized.
>>
>> Is there any logic which determines when to run reassociate and when
not
>> to run?
>>
>
> I don't know off the top of my head. Try running opt with
> -debug-pass=Arguments to see what the order is (or I think you should be
> able to pass -debug-pass=Arguments to clang's -mllvm option).
>
> -- Sean Silva
>
>
>>
>> Duncan, David, Sean - any comment on this are most welcomed :)
>>
>> Regards,
>> Suyog
>>
>>
>

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140813/7e1a8734/attachment.html>

Daniel Berlin

2014-Aug-13 22:39 UTC

head link

[LLVMdev] Efficient Pattern matching in Instruction Combine

Even if you can't implement such an algorithm sanely, ISTM that
auto-generating this code from a table (or whatever), and choosing
canonical results (to avoid a fixpoint issue), rather than what seems
to be hand-additions of every possible set of minimizations on three
variables, is still a better solution, no?

At least then you wouldn't have human errors, and a growing file that
makes any of the non-trivial transforms easy to miss in the noise.

On Tue, Aug 12, 2014 at 2:00 PM, Sean Silva <chisophugis at gmail.com>
wrote:> Re-adding the mailing list (remember to hit "reply all")
>
>
> On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com>
wrote:
>>
>> Thanks Sean for the reply.
>>
>>
>> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at
gmail.com>
>> wrote:
>>>
>>>
>>>
>>>
>>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at
gmail.com> wrote:
>>>>
>>>> Hi Duncan, David, Sean.
>>>>
>>>> Thanks for your reply.
>>>>
>>>>
>>>> > It'd be interesting if you could find a design that
also treated these
>>>> > the same:
>>>> >
>>>> >   (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>>>>
>>>> >   (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>> >   (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
>>>> >
>>>> > I.e., `^` is also associative.
>>>>
>>>> Agree with Duncan on including associative operation too.
>>>>
>>>> > Can we handle this by just having a canonical ordering? Or
is that too
>>>> > difficult to maintain through various instcombines?
>>>>
>>>> Yes, its the easiest way to do that. If i am not wrong, what
Sean is
>>>> suggesting is that if we get
>>>>
>>>> something like
>>>>
>>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>
>>>> and we have written pass for pattern
>>>>
>>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>
>>>> we could just swap 'B' and 'A' before matching
the pattern i.e. we do
>>>> canonical re-ordering
>>>> (Sean, Correct me if i am wrong on my understanding of what you
meant by
>>>> canonical ordering).
>>>
>>>
>>> You understood correctly.
>>>
>>>>
>>>>
>>>> I have seen in the "instcombine" module that many
times std::swap was
>>>> used for the same.
>>>>
>>>> But it becomes too much repetitive and somewhat irritating to
keep doing
>>>> this for every pattern.
>>>> Shouldn't we do this in more generic way where we do not
have to worry
>>>> about swapping operands?
>>>> (How to do it - i don't know for now. I looked at the
matchers
>>>> implementation for modification/improvement,
>>>> but didn't find any clue on it).
>>>>
>>>>
>>>> > It seems to me that we could rejigger Reassociate to
reduce the number
>>>> > of possibilities that InstCombine could expect.
>>>>
>>>>
>>>> Agree with David that infact it should be reassociate pass
which should
>>>> handle this.
>>>>
>>>> But following is something interesting and rather worrying
things i have
>>>> found :
>>>> (I have omitted unimportant code and highlighted important one
in
>>>> following example)
>>>> e.x.
>>>>
>>>> ((~A & B) | A)  -->  (A | B) ; Code is implemented for
this already
>>>>
>>>> C code :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c
>>>> #include<stdio.h>
>>>> int cal(int a, int b) {
>>>> return ((~a & b) | a);
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>> LLVM IR at O0 :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0
-emit-llvm
>>>> 1.c
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @cal(i32 %a, i32 %b) #0 {
>>>>   %1 = xor i32 %a, -1
>>>>   %2 = and i32 %1, %b
>>>>   %3 = or i32 %2, %a
>>>>   ret i32 %3
>>>> }
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @main() #0 {
>>>>   ..
>>>>   ..
>>>>   %4 = call i32 @cal(i32 %2, i32 %3)
>>>>   ret i32 %4
>>>> }
>>>>
>>>> LLVM IR at instcombine :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S
-instcombine 1.ll
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @cal(i32 %a, i32 %b) #0 {
>>>>   %1 = or i32 %a, %b
>>>>   ret i32 %1
>>>> }
>>>> ..
>>>> ..
>>>> ..
>>>>
>>>> which is OK as per expected transform.
>>>>
>>>> LLVM IR at reassociate and instcombine :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S
-reassociate
>>>> -instcombine 2.ll
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @cal(i32 %a, i32 %b) #0 {
>>>>   %1 = xor i32 %a, -1
>>>>   %2 = and i32 %b, %1
>>>>   %3 = or i32 %2, %a
>>>>   ret i32 %3
>>>> }
>>>>
>>>> ..
>>>> ..
>>>>
>>>> Reassociate converted (~a&b) to (b&~a) and instruction
combine failed to
>>>> optimize.
>>>>
>>>> As evident, sometimes, reassociate pass actually harms the
optimization
>>>> opportunity !!
>>>>
>>>> Some more interesting findings on how can this affect the final
machine
>>>> code:
>>>>
>>>> Test case :
>>>> 1.c :
>>>>
>>>> #include<stdio.h>
>>>> int cal(int a, int b) {
>>>> return ((~a & b) | a);
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>> X86 .s file with clang at O2 for above program :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2
1.c
>>>>
>>>> main:                                   # @main
>>>> # BB#0:
>>>>     subl    $28, %esp
>>>>     leal    20(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.L.str, (%esp)
>>>>     calll    __isoc99_scanf
>>>>     movl    20(%esp), %eax
>>>>     orl    24(%esp), %eax
>>>>     addl    $28, %esp
>>>>     retl
>>>>
>>>> As seen, optimization happened at IR level itself reflected in
.s file.
>>>>
>>>> GCC output for the same:
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>>
>>>> main:
>>>> .LFB23:
>>>>     .cfi_startproc
>>>>     pushl    %ebp
>>>>     .cfi_def_cfa_offset 8
>>>>     .cfi_offset 5, -8
>>>>     movl    %esp, %ebp
>>>>     .cfi_def_cfa_register 5
>>>>     andl    $-16, %esp
>>>>     subl    $32, %esp
>>>>     leal    28(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.LC0, (%esp)
>>>>     call    __isoc99_scanf
>>>>     movl    24(%esp), %eax
>>>>     orl    28(%esp), %eax
>>>>     leave
>>>>     .cfi_restore 5
>>>>     .cfi_def_cfa 4, 4
>>>>     ret
>>>>     .cfi_endproc
>>>>
>>>> GCC also did the optimization.
>>>>
>>>> Now we just slightly flip the test case :
>>>>
>>>> 1.c Test case:
>>>>
>>>> #include<stdio.h>
>>>> int cal(int a, int b) {
>>>> return ((b & ~a) | a);
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>> X86 .s file with clang at O2 for above program :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2
1.c
>>>>
>>>> main:                                   # @main
>>>> # BB#0:
>>>>     subl    $28, %esp
>>>>     leal    20(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.L.str, (%esp)
>>>>     calll    __isoc99_scanf
>>>>     movl    24(%esp), %ecx
>>>>     movl    %ecx, %eax
>>>>     notl    %eax
>>>>     andl    20(%esp), %eax
>>>>     orl    %ecx, %eax
>>>>     addl    $28, %esp
>>>>     retl
>>>>
>>>> We lost the Optimization opportunity here !!
>>>>
>>>> GCC output for the same:
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>>
>>>> main:
>>>> .LFB23:
>>>>     .cfi_startproc
>>>>     pushl    %ebp
>>>>     .cfi_def_cfa_offset 8
>>>>     .cfi_offset 5, -8
>>>>     movl    %esp, %ebp
>>>>     .cfi_def_cfa_register 5
>>>>     andl    $-16, %esp
>>>>     subl    $32, %esp
>>>>     leal    28(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.LC0, (%esp)
>>>>     call    __isoc99_scanf
>>>>     movl    24(%esp), %eax
>>>>     orl    28(%esp), %eax
>>>>     leave
>>>>     .cfi_restore 5
>>>>     .cfi_def_cfa 4, 4
>>>>     ret
>>>>     .cfi_endproc
>>>>
>>>> GCC still optimized it !!
>>>>
>>>> Clearly evident that llvm is losing because of naive approach
of pattern
>>>> matching.
>>>>
>>>> How can we improve this so that LLVM also recognizes
>>>> commutative/associative
>>>> pattern  efficiently with single 'matchers' call? I am
not having any
>>>> idea so far.
>>>> I will think more on it. Any help would be really helpful.
>>>>
>>>> Any suggestions/comments/rectification are most awaited :)
>>>
>>>
>>>
>>> I'm actually sort of wondering if instead of hardcoding all
these bitwise
>>> simplifications, we can just use a general boolean minimization
algorithm.
>>>
>>> Is there a reason we don't? Maybe it's just compile time?
But eventually
>>> (maybe already?) adding enough special case combines will be slower
than
>>> using the general optimization.
>>
>>
>> Can you please elaborate more on "general boolean minimization
algorithm"
>> ? Is it a specific algorithm? I couldn't find any proper
>> reference to it. If you can provide a reference for it, i will try to
see
>> if i can implement it.
>>
>>
>> Can K-Map be an option to generalize the logic?
>>  (I could think of K-Map for now for simplification from my college
days
>> :))
>
>
> Yeah, K-Maps are one way to do it, however K-Maps are really most useful
for
> when your target operations are all NAND/NOR, which isn't the case here
(and
> you can also do Reed-Muller for XOR/XNOR); instcombine has a very different
> set of "primitive operations" and their relative costs (i.e. XOR,
AND, OR,
> and NOT are usually the base operations and cost the same amount of
> resources to execute).
>
> Since optimization of binary expressions is of such enormous importance to
> the electronics industry, this topic is extremely well researched, so there
> should be a lot to draw from. There may be stuff targeted specifically at
> software compilers, but I haven't looked in depth (maybe try Muchnick
> section 12.3 for references? I don't have the book handy).
>
> For starters, you may want to check out Knuth volume 4A, section 7.1.
I'm
> working from memory here, but IIRC there is a whole part about
> minimization/optimization of binary expressions. Knuth as usual provides
> tons of references to the literature in case the text there is not enough.
> You can see preprint copies of volume 4A here:
> http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book
> that any programmer should have on their bookshelf; in fact, you can
> probably justify it as a work expense).
>
>
>>
>>
>> Also few more things i will like to understand.
>> From above example i have shown,
>>
>> when we have test case as:
>>
>> #include<stdio.h>
>> int cal(int a, int b) {
>> return ((~a & b) | a);
>> }
>>
>> int main(){
>> int a, b;
>> scanf("%d %d", &a, &b);
>> return cal(a,b);
>> }
>>
>> X86 .s file with clang at O2 for above program :
>>
>> main:                                   # @main
>> # BB#0:
>>     subl    $28, %esp
>>     leal    20(%esp), %eax
>>     movl    %eax, 8(%esp)
>>     leal    24(%esp), %eax
>>     movl    %eax, 4(%esp)
>>     movl    $.L.str, (%esp)
>>     calll    __isoc99_scanf
>>     movl    20(%esp), %eax
>>     orl    24(%esp), %eax
>>     addl    $28, %esp
>>     retl
>>
>> As seen, we optimize the pattern.
>>
>> but when we have test case as :
>>
>> #include<stdio.h>
>> int cal(int a, int b) {
>> return ((b & ~a) | a);
>> }
>>
>> int main(){
>> int a, b;
>> scanf("%d %d", &a, &b);
>> return cal(a,b);
>> }
>>
>> X86 .s file with clang at O2 for above program :
>>
>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>
>> main:                                   # @main
>> # BB#0:
>>     subl    $28, %esp
>>     leal    20(%esp), %eax
>>     movl    %eax, 8(%esp)
>>     leal    24(%esp), %eax
>>     movl    %eax, 4(%esp)
>>     movl    $.L.str, (%esp)
>>     calll    __isoc99_scanf
>>     movl    24(%esp), %ecx
>>     movl    %ecx, %eax
>>     notl    %eax
>>     andl    20(%esp), %eax
>>     orl    %ecx, %eax
>>     addl    $28, %esp
>>     retl
>>
>>  we fail  to optimize.
>>
>> Does, 'reassociate' pass always run before
'instcombine' ?
>> Theoretically, it should in my opinion. But from the output in first
case,
>> it seems it doesn't.
>> If 'reassociate' would have run before 'instcombine',
it would have
>> converted (~a&b) to (b&~a)
>> and the pattern would have not been optimized.
>>
>> Is there any logic which determines when to run reassociate and when
not
>> to run?
>
>
> I don't know off the top of my head. Try running opt with
> -debug-pass=Arguments to see what the order is (or I think you should be
> able to pass -debug-pass=Arguments to clang's -mllvm option).
>
> -- Sean Silva
>
>>
>>
>> Duncan, David, Sean - any comment on this are most welcomed :)
>>
>> Regards,
>> Suyog
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

llvm dev - Aug 2014 - [LLVMdev] Efficient Pattern matching in Instruction Combine

[LLVMdev] Efficient Pattern matching in Instruction Combine

[LLVMdev] Efficient Pattern matching in Instruction Combine

[LLVMdev] Efficient Pattern matching in Instruction Combine

[LLVMdev] Efficient Pattern matching in Instruction Combine

[LLVMdev] Efficient Pattern matching in Instruction Combine