Hi,
On Tue, Jul 26, 2011 at 11:32 AM, Matt Johnson
<johnso87 at crhc.illinois.edu>wrote:
> Hi Daniel,
>
> > Hi folks,
> >
> > I couldn't find a specific XOR (OR and AND) optimization on llvm,
and
> > therefore I am about to implement it.
> > But first I would like to check with you guys that it really does not
> exist.
> >
> > For a simple loop like this:
> >
> > nbits = 128;
> > bit_addr = 0;
> > while(nbits--)
> > {
> > bindex=bit_addr>>5; /* Index is number /32 */
> > bitnumb=bit_addr % 32; /* Bit number in longword */
> > bitmap[bindex]^=(1L<<bitnumb);
> > bit_addr++;
> > }
> >
> >
> > The -O3 set of optimizations generates a code like this:
> >
> >
> > entry:
> > br label %while.body
> >
> > while.body: ; preds =
%while.body,
> > %entry
> > %0 = phi i32 [ 0, %entry ], [ %inc.3, %while.body ]
> > %shr = lshr i32 %0, 5
> > %rem = and i32 %0, 28
> > %shl = shl i32 1, %rem
> > %arrayidx = getelementptr inbounds i32* %bitmap, i32 %shr
> > %tmp6 = load i32* %arrayidx, align 4
> > %xor = xor i32 %tmp6, %shl
> > %rem.1 = or i32 %rem, 1
> > %shl.1 = shl i32 1, %rem.1
> > %xor.1 = xor i32 %xor, %shl.1
> > %rem.2 = or i32 %rem, 2
> > %shl.2 = shl i32 1, %rem.2
> > %xor.2 = xor i32 %xor.1, %shl.2
> > %rem.3 = or i32 %rem, 3
> > %shl.3 = shl i32 1, %rem.3
> > %xor.3 = xor i32 %xor.2, %shl.3
> > store i32 %xor.3, i32* %arrayidx, align 4
> > %inc.3 = add i32 %0, 4
> > %exitcond.3 = icmp eq i32 %inc.3, 128
> > br i1 %exitcond.3, label %while.end, label %while.body
> >
> > while.end: ; preds =
%while.body
> > ret void
> >
> >
> >
> > It is clear that we are able to fold all XORs into a single XOR, and
the
> > same happens to all SHLs and ORs.
> > I am using -O3, but the code is not optimized, so I am assuming there
is
> no
> > optimization for this case. Am I correct?
>
> The loop is being unrolled by a factor of 4. This breaks the artificial
> dependence between loop iterations, and should yield a substantial
> improvement
> on machines with enough registers and functional units to take advantage of
> it.
> The fact that the loop is unrolled explains why the XORs, SHLs, and ORs are
> not
> folded into 1.
>
I think he is trying to say this expression generated by unrolling by a
factor of 4 can indeed be folded into a single XOR, SHL and OR.
>
> > If yes, I have a few other questions:
> >
> > - Do you know of any other similar optimization that could do
something
> > here but is not being triggered for some reason??
>
> Try compiling with -Os instead of -O3. I'm guessing this will turn off
> loop
> unrolling.
>
> > - Do you know why a OR instruction is used for increments? instead
of
> using
> > a INC or ADD?
>
> OR is a simpler operation than ADD at the hardware level, and
> usually faster and more energy-efficient, so it should be preferred when
> the compiler
> can statically infer that its usage is correct.
>
I haven't seen a machine in which OR is faster than ADD nor more
energy-efficient. They're all done by the same ALU circuitry which delays
the pipeline by its worstcase path timing. So, for modern processor hardware
purposes, OR is exactly equal ADD. Transforming ADD to OR isn't strenght
reduction at all. Maybe this is benefical only if you have a backend
generating circuitry (programming FPGAs).
>
> > - Is there a straight forward way to know if an instruction belongs
to
> a
> > loop? (just curiosity)
>
> I'll defer to others on this one.
>
> >
> > Thanks very much
> >
> > Daniel Nicacio
> Best,
> Matt
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110726/e2cd2b34/attachment.html>