thr3ads.net - llvm dev - [llvm-dev] [RFC] New pass: LoopExitValues [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Steve King via llvm-dev

2015-Sep-01 18:06 UTC

[llvm-dev] [RFC] New pass: LoopExitValues

On Mon, Aug 31, 2015 at 5:52 PM, Jake VanAdrighem
<jvanadrighem at gmail.com> wrote:> Do you have some specific performance measurements?
Averaging 4 runs of 10000 iterations each of Coremark on my X86_64
desktop showed:

-O2 performance: +2.9% faster with the L.E.V. pass
-Os size: 1.5% smaller with the L.E.V. pass

In the case of Coremark, the benefit comes mainly from the matrix
portion benchmark, which uses nested loops.  Similarly, I used a
matrix multiplication for the regression test as shown below.  The
L.E.V. pass eliminated 4 instructions.

void matrix_mul(unsigned int Size, unsigned int *Dst, unsigned int
*Src, unsigned int Val) {
  for (int Outer = 0; Outer < Size; ++Outer)
    for (int Inner = 0; Inner < Size; ++Inner)
       Dst[Outer * Size + Inner] = Src[Outer * Size + Inner] * Val;
}


With LoopExitValues
-------------------------------
matrix_mul:
    testl %edi, %edi
    je .LBB0_5
    xorl %r9d, %r9d
    xorl %r8d, %r8d
.LBB0_2:
    xorl %r11d, %r11d
.LBB0_3:
    movl %r9d, %r10d
    movl (%rdx,%r10,4), %eax
    imull %ecx, %eax
    movl %eax, (%rsi,%r10,4)
    incl %r11d
    incl %r9d
    cmpl %r11d, %edi
    jne .LBB0_3
    incl %r8d
    cmpl %edi, %r8d
    jne .LBB0_2
.LBB0_5:
    retq



Without LoopExitValues:
-----------------------------------
matrix_mul:
    pushq %rbx           # Eliminated by L.E.V. pass
.Ltmp0:
.Ltmp1:
    testl %edi, %edi
    je .LBB0_5
    xorl %r8d, %r8d
    xorl %r9d, %r9d
.LBB0_2:
    xorl %r10d, %r10d
    movl %r8d, %eax              # Eliminated by L.E.V. pass
.LBB0_3:
    movl %eax, %r11d
    movl (%rdx,%r11,4), %ebx
    imull %ecx, %ebx
    movl %ebx, (%rsi,%r11,4)
    incl %r10d
    incl %eax
    cmpl %r10d, %edi
    jne .LBB0_3
    incl %r9d
    addl %edi, %r8d            # Eliminated by L.E.V. pass
    cmpl %edi, %r9d
    jne .LBB0_2
.LBB0_5:
    popq %rbx                    # Eliminated by L.E.V. pass
    retq

James Molloy via llvm-dev

2015-Sep-02 12:36 UTC

head link

[llvm-dev] [RFC] New pass: LoopExitValues

Hi,

Coremark really isn't a good enough test - have you run the LLVM test suite
with this patch, and what were the performance differences?

I'm still a bit confused about what pattern exactly this pass is supposed
to trigger on. I understand the mechanics, but I still can't quite see what
patterns it would be useful on. You've mentioned matrix multiply - how does
this pass alter the IR? What value is it avoiding being recomputed? How
does this pass affect register pressure?

Also, your example just removes a mov and an add - the push/pops are just
register allocation (unless your pass in fact *reduces* register pressure?)

A bit more clarification would be great.

Cheers,

James

On Tue, 1 Sep 2015 at 19:07 Steve King via llvm-dev <llvm-dev at
lists.llvm.org>
wrote:
> On Mon, Aug 31, 2015 at 5:52 PM, Jake VanAdrighem
> <jvanadrighem at gmail.com> wrote:
> > Do you have some specific performance measurements?
>
> Averaging 4 runs of 10000 iterations each of Coremark on my X86_64
> desktop showed:
>
> -O2 performance: +2.9% faster with the L.E.V. pass
> -Os size: 1.5% smaller with the L.E.V. pass
>
> In the case of Coremark, the benefit comes mainly from the matrix
> portion benchmark, which uses nested loops.  Similarly, I used a
> matrix multiplication for the regression test as shown below.  The
> L.E.V. pass eliminated 4 instructions.
>
> void matrix_mul(unsigned int Size, unsigned int *Dst, unsigned int
> *Src, unsigned int Val) {
>   for (int Outer = 0; Outer < Size; ++Outer)
>     for (int Inner = 0; Inner < Size; ++Inner)
>        Dst[Outer * Size + Inner] = Src[Outer * Size + Inner] * Val;
> }
>
>
> With LoopExitValues
> -------------------------------
> matrix_mul:
>     testl %edi, %edi
>     je .LBB0_5
>     xorl %r9d, %r9d
>     xorl %r8d, %r8d
> .LBB0_2:
>     xorl %r11d, %r11d
> .LBB0_3:
>     movl %r9d, %r10d
>     movl (%rdx,%r10,4), %eax
>     imull %ecx, %eax
>     movl %eax, (%rsi,%r10,4)
>     incl %r11d
>     incl %r9d
>     cmpl %r11d, %edi
>     jne .LBB0_3
>     incl %r8d
>     cmpl %edi, %r8d
>     jne .LBB0_2
> .LBB0_5:
>     retq
>
>
>
> Without LoopExitValues:
> -----------------------------------
> matrix_mul:
>     pushq %rbx           # Eliminated by L.E.V. pass
> .Ltmp0:
> .Ltmp1:
>     testl %edi, %edi
>     je .LBB0_5
>     xorl %r8d, %r8d
>     xorl %r9d, %r9d
> .LBB0_2:
>     xorl %r10d, %r10d
>     movl %r8d, %eax              # Eliminated by L.E.V. pass
> .LBB0_3:
>     movl %eax, %r11d
>     movl (%rdx,%r11,4), %ebx
>     imull %ecx, %ebx
>     movl %ebx, (%rsi,%r11,4)
>     incl %r10d
>     incl %eax
>     cmpl %r10d, %edi
>     jne .LBB0_3
>     incl %r9d
>     addl %edi, %r8d            # Eliminated by L.E.V. pass
>     cmpl %edi, %r9d
>     jne .LBB0_2
> .LBB0_5:
>     popq %rbx                    # Eliminated by L.E.V. pass
>     retq
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150902/a92d50ef/attachment.html>

Steve King via llvm-dev

2015-Sep-03 03:27 UTC

head link

[llvm-dev] [RFC] New pass: LoopExitValues

On Wed, Sep 2, 2015 at 5:36 AM, James Molloy <james at jamesmolloy.co.uk>
wrote:> Hi,
>
> Coremark really isn't a good enough test - have you run the LLVM test
suite
> with this patch, and what were the performance differences?
For the test suite single source benches, the 235 tests improved
performance, 2 regressed and 705 were unchanged.  That seems very
optimistic. Comparing consecutive runs with identical setting shows
there is a lot of noise in the performance data.  Tips for stable
results would be appreciated.

> I'm still a bit confused about what pattern exactly this pass is
supposed to
> trigger on. I understand the mechanics, but I still can't quite see
what
> patterns it would be useful on. You've mentioned matrix multiply - how
does
> this pass alter the IR?
Here's before and after IR for the matrix_mul example.  Notice the two
bitcasts %1 and %2 generated in the for.cond.cleanup block.  The L.E.V
pass converts these to scevgep values that already exist.

*** Code after LSR ***

; Function Attrs: nounwind optsize
define void @matrix_mul(i32 %Size, i32* nocapture %Dst, i32* nocapture
readonly %Src, i32 %Val) #0 {
entry:
  %cmp.25 = icmp eq i32 %Size, 0
  br i1 %cmp.25, label %for.cond.cleanup, label %for.body.4.lr.ph.preheader

for.body.4.lr.ph.preheader:                       ; preds = %entry
  %0 = shl i32 %Size, 2
  br label %for.body.4.lr.ph

for.body.4.lr.ph:                                 ; preds
%for.body.4.lr.ph.preheader, %for.cond.cleanup.3
  %lsr.iv5 = phi i32* [ %Src, %for.body.4.lr.ph.preheader ], [ %2,
%for.cond.cleanup.3 ]
  %lsr.iv1 = phi i32* [ %Dst, %for.body.4.lr.ph.preheader ], [ %1,
%for.cond.cleanup.3 ]
  %Outer.026 = phi i32 [ %inc10, %for.cond.cleanup.3 ], [ 0,
%for.body.4.lr.ph.preheader ]
  %lsr.iv56 = bitcast i32* %lsr.iv5 to i1*
  %lsr.iv12 = bitcast i32* %lsr.iv1 to i1*
  br label %for.body.4

for.cond.cleanup.loopexit:                        ; preds = %for.cond.cleanup.3
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds
%for.cond.cleanup.loopexit, %entry
  ret void

for.cond.cleanup.3:                               ; preds = %for.body.4
  %inc10 = add nuw nsw i32 %Outer.026, 1
  %scevgep = getelementptr i1, i1* %lsr.iv12, i32 %0
  %1 = bitcast i1* %scevgep to i32*
  %scevgep7 = getelementptr i1, i1* %lsr.iv56, i32 %0
  %2 = bitcast i1* %scevgep7 to i32*
  %exitcond27 = icmp eq i32 %inc10, %Size
  br i1 %exitcond27, label %for.cond.cleanup.loopexit, label %for.body.4.lr.ph

for.body.4:                                       ; preds %for.body.4,
%for.body.4.lr.ph
  %lsr.iv8 = phi i32* [ %scevgep9, %for.body.4 ], [ %lsr.iv5,
%for.body.4.lr.ph ]
  %lsr.iv3 = phi i32* [ %scevgep4, %for.body.4 ], [ %lsr.iv1,
%for.body.4.lr.ph ]
  %lsr.iv = phi i32 [ %lsr.iv.next, %for.body.4 ], [ %Size, %for.body.4.lr.ph ]
  %3 = load i32, i32* %lsr.iv8, align 4, !tbaa !1
  %mul5 = mul i32 %3, %Val
  store i32 %mul5, i32* %lsr.iv3, align 4, !tbaa !1
  %lsr.iv.next = add i32 %lsr.iv, -1
  %scevgep4 = getelementptr i32, i32* %lsr.iv3, i32 1
  %scevgep9 = getelementptr i32, i32* %lsr.iv8, i32 1
  %exitcond = icmp eq i32 %lsr.iv.next, 0
  br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4
}


*** Code after Loop Exit Values Optimization **

; Function Attrs: nounwind optsize
define void @matrix_mul(i32 %Size, i32* nocapture %Dst, i32* nocapture
readonly %Src, i32 %Val) #0 {
entry:
  %cmp.25 = icmp eq i32 %Size, 0
  br i1 %cmp.25, label %for.cond.cleanup, label %for.body.4.lr.ph.preheader

for.body.4.lr.ph.preheader:                       ; preds = %entry
  br label %for.body.4.lr.ph

for.body.4.lr.ph:                                 ; preds
%for.body.4.lr.ph.preheader, %for.cond.cleanup.3
  %lsr.iv5 = phi i32* [ %Src, %for.body.4.lr.ph.preheader ], [
%scevgep9, %for.cond.cleanup.3 ]
  %lsr.iv1 = phi i32* [ %Dst, %for.body.4.lr.ph.preheader ], [
%scevgep4, %for.cond.cleanup.3 ]
  %Outer.026 = phi i32 [ %inc10, %for.cond.cleanup.3 ], [ 0,
%for.body.4.lr.ph.preheader ]
  br label %for.body.4

for.cond.cleanup.loopexit:                        ; preds = %for.cond.cleanup.3
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds
%for.cond.cleanup.loopexit, %entry
  ret void

for.cond.cleanup.3:                               ; preds = %for.body.4
  %inc10 = add nuw nsw i32 %Outer.026, 1
  %exitcond27 = icmp eq i32 %inc10, %Size
  br i1 %exitcond27, label %for.cond.cleanup.loopexit, label %for.body.4.lr.ph

for.body.4:                                       ; preds %for.body.4,
%for.body.4.lr.ph
  %lsr.iv8 = phi i32* [ %scevgep9, %for.body.4 ], [ %lsr.iv5,
%for.body.4.lr.ph ]
  %lsr.iv3 = phi i32* [ %scevgep4, %for.body.4 ], [ %lsr.iv1,
%for.body.4.lr.ph ]
  %lsr.iv = phi i32 [ %lsr.iv.next, %for.body.4 ], [ %Size, %for.body.4.lr.ph ]
  %0 = load i32, i32* %lsr.iv8, align 4, !tbaa !1
  %mul5 = mul i32 %0, %Val
  store i32 %mul5, i32* %lsr.iv3, align 4, !tbaa !1
  %lsr.iv.next = add i32 %lsr.iv, -1
  %scevgep4 = getelementptr i32, i32* %lsr.iv3, i32 1
  %scevgep9 = getelementptr i32, i32* %lsr.iv8, i32 1
  %exitcond = icmp eq i32 %lsr.iv.next, 0
  br i1 %exitcond, label %for.cond.cleanup.3, label %for.body.4

> What value is it avoiding being recomputed?I'm not precisely sure, but it's residue from LSR.  The pass checks
all computable SCEV values when a loop exits and in this case found
GEPs with the same value.
> How does this pass affect register pressure?
> Also, your example just removes a mov and an add - the push/pops are just
> register allocation (unless your pass in fact *reduces* register pressure?)
Right, the computation eliminated is the mov and add.  Register
savings is a byproduct.

Regards,
-steve

llvm dev - Sep 2015 - [RFC] New pass: LoopExitValues

[llvm-dev] [RFC] New pass: LoopExitValues

[llvm-dev] [RFC] New pass: LoopExitValues

[llvm-dev] [RFC] New pass: LoopExitValues