thr3ads.net - llvm dev - [LLVMdev] XOR Optimization [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Daniel Nicácio

2011-Jul-27 00:21 UTC

[LLVMdev] XOR Optimization

After a few more tests, I found out that if we set -unroll-threshold to a
value large enough, and run "opt -std-compile-opts" or "opt
-O3" 3 times,
the unroll will be able to unroll the original loop 32 times, and when you
have it unrolled for at least 32 times a optimization is triggered, folding
it to a single "%xor.3.3.1 = xor i32 %tmp6, -1" (dont know why it does
not
transform it into a NOT though).

Therefore, the optimization that I want is already there somewhere, but is
not triggered when llvm unrolls the loop less than 32 times.

My goal now is to make it work for smaller unrolled loops, like for 4, 8,
and 16.

Does anyone knows which function gives the information that for 32 unrolled
steps it is possible to optimize but not for smaller numbers?

I also would like to see why the "XOR  A,  -1" is not turned into a
NOT, any
hints on that are welcome.

Thanks,

Daniel Nicacio


2011/7/26 Daniel Nicácio <dnicacios at gmail.com>
> Hi Duncan,
>
> when I run "opt -std-compile-opts" on the original source code it
has the
> same output of O3.
> when I run "opt -std-compile-opts" on the -O3 optimized code,
things get
> even more weird, it outputs the following code:
>
> while.body:                                       ; preds = %while.body,
> %entry
>   %indvar = phi i32 [ 0, %entry ], [ %indvar.next.3, %while.body ]
>   %tmp = shl i32 %indvar, 2
>   %0 = lshr i32 %indvar, 3
>   %shr = and i32 %0, 134217727
>   %rem = and i32 %tmp, 16
>   %shl = shl i32 1, %rem
>   %arrayidx = getelementptr inbounds i32* %bitmap, i32 %shr
>    %tmp6 = load i32* %arrayidx, align 4
>   %rem.1 = or i32 %rem, 1
>   %shl.1 = shl i32 1, %rem.1
>   %rem.2 = or i32 %rem, 2
>   %shl.2 = shl i32 1, %rem.2
>   %rem.3 = or i32 %rem, 3
>   %shl.3 = shl i32 1, %rem.3
>   %xor = xor i32 %shl, %tmp6
>   %xor.1 = xor i32 %xor, %shl.3
>   %xor.2 = xor i32 %xor.1, %shl.2
>   %xor.3 = xor i32 %xor.2, %shl.1
>   %rem.11 = or i32 %rem, 4
>   %shl.12 = shl i32 1, %rem.11
>   %rem.1.1 = or i32 %rem, 5
>   %shl.1.1 = shl i32 1, %rem.1.1
>   %rem.2.1 = or i32 %rem, 6
>   %shl.2.1 = shl i32 1, %rem.2.1
>   %rem.3.1 = or i32 %rem, 7
>   %shl.3.1 = shl i32 1, %rem.3.1
>   %xor.13 = xor i32 %shl.12, %xor.3
>   %xor.1.1 = xor i32 %xor.13, %shl.3.1
>   %xor.2.1 = xor i32 %xor.1.1, %shl.2.1
>   %xor.3.1 = xor i32 %xor.2.1, %shl.1.1
>   %rem.24 = or i32 %rem, 8
>   %shl.25 = shl i32 1, %rem.24
>   %rem.1.2 = or i32 %rem, 9
>   %shl.1.2 = shl i32 1, %rem.1.2
>   %rem.2.2 = or i32 %rem, 10
>   %shl.2.2 = shl i32 1, %rem.2.2
>   %rem.3.2 = or i32 %rem, 11
>   %shl.3.2 = shl i32 1, %rem.3.2
>   %xor.26 = xor i32 %shl.25, %xor.3.1
>   %xor.1.2 = xor i32 %xor.26, %shl.3.2
>   %xor.2.2 = xor i32 %xor.1.2, %shl.2.2
>   %xor.3.2 = xor i32 %xor.2.2, %shl.1.2
>   %rem.37 = or i32 %rem, 12
>   %shl.38 = shl i32 1, %rem.37
>   %rem.1.3 = or i32 %rem, 13
>   %shl.1.3 = shl i32 1, %rem.1.3
>   %rem.2.3 = or i32 %rem, 14
>   %shl.2.3 = shl i32 1, %rem.2.3
>   %rem.3.3 = or i32 %rem, 15
>   %shl.3.3 = shl i32 1, %rem.3.3
>   %xor.39 = xor i32 %shl.38, %xor.3.2
>   %xor.1.3 = xor i32 %xor.39, %shl.3.3
>   %xor.2.3 = xor i32 %xor.1.3, %shl.2.3
>   %xor.3.3 = xor i32 %xor.2.3, %shl.1.3
>   store i32 %xor.3.3, i32* %arrayidx, align 4
>   %indvar.next.3 = add i32 %indvar, 4
>   %exitcond.3 = icmp eq i32 %indvar.next.3, 32
>    br i1 %exitcond.3, label %while.end, label %while.body
>
> while.end:                                        ; preds = %while.body
>   ret void
>
>
> Thanks for the reply,
>
> Daniel
>
>
> On Tue, Jul 26, 2011 at 12:50 PM, Duncan Sands <baldrick at free.fr>
wrote:
>
>> Hi Daniel,
>>
>> > Precisely. The code generated by unrolling can be folded into a
single
>> XOR and
>> > SHL. And even if it was not inside a loop, it can still be
optimized.
>> What I
>> > want to know is:  is there any optimization supposed to optimize
this
>> code, but
>> > for some reason it thinks it is not possible, or  there is no
>> optimization for
>> > that situation at all?
>>
>> it could be a phase ordering problem.  If you run "opt
-std-compile-opts"
>> on the
>> unsatisfactory bitcode, does it clean up all the bit fiddling?
>>
>> Ciao, Duncan.
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110726/bf925be4/attachment.html>

me22

2011-Jul-27 00:47 UTC

head link

[LLVMdev] XOR Optimization

2011/7/26 Daniel Nicácio <dnicacios at gmail.com>:>
> I also would like to see why the "XOR  A,  -1" is not turned into
a NOT, any
>
Probably because NOT (like NEG) doesn't exist :)

<http://llvm.org/docs/LangRef.html#instref>

I assume the decision was made that it wasn't worth adding the extra
unary instructions when they can easily be handled in codegen by
matching "XOR X, -1" or "SUB 0, X".

~ Scott

Daniel Nicácio

2011-Jul-28 20:58 UTC

head link

[LLVMdev] XOR Optimization

Hey guys,

I still think there is no optimization doing what I want. When the loop is
unrolled 32 times, llvm is able to identify that the loop is working on a
whole word, it finds some constants and propagate them, resulting in the
folded XOR instruction. However, when the loop operates on some bits of the
word, llvm is still not able to fold those XOR, even when the operated bits
does not overlap each other.

Therefore, I am implementing the following optimization for folding XOR
instructions working on bits (it still must be extended to OR and AND
instructions). Any comments and critics are appreciated.

Basically, I try to identify a chain of XOR instructions and fold it. The
below image illustrate this:

[image: XORChain.png]

In order to do that I am adding an additional function call to
"visitXor()"
in Instruction Combining.

My Optimization function is attached as a patch file. (the diff was made
using the 2.9 release version of llvm).

Thanks

Daniel Nicacio
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110728/a5bc09cf/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: XORChain.png
Type: image/png
Size: 59953 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110728/a5bc09cf/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InstCombineAndOrXor.diff
Type: application/octet-stream
Size: 635 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110728/a5bc09cf/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: InstructionCombining.diff
Type: application/octet-stream
Size: 7068 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110728/a5bc09cf/attachment-0001.obj>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Jul 2011 - [LLVMdev] XOR Optimization

[LLVMdev] XOR Optimization

[LLVMdev] XOR Optimization

[LLVMdev] XOR Optimization

Maybe Matching Threads