thr3ads.net - llvm dev - [llvm-dev] SCEV and LoopStrengthReduction Formulae [Apr 2018]

If this information is useful, please help other people find it:
Share via:

via llvm-dev

2018-Apr-03 22:00 UTC

[llvm-dev] SCEV and LoopStrengthReduction Formulae

I am attempting to implement a minor loop strength reduction optimization for
targets that support compare and jump fusion, specifically
TTI::canMacroFuseCmp().  My approach might be wrong; however, I am soliciting
the idea for feedback, so that I can implement this correctly.  My plan is to
add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes  the
following case, but perhaps this should stand alone as  its own pass:

// Example which can be optimized via cmp/jmp fusion.
// clang -O3 -S test.c
extern void g(int);
void f(int *p, long long n) {
    do {
        g(*p++);
    } while (--n);
}

LLVM currently generates the following sequence for x86_64 targets:
LBB0_1:
	movl	(%r15,%rbx,4), %edi
	callq	g
	addq	$1, %rbx
	cmpq	%rbx, %r14
	jne	.LBB0_1

LLVM can perform compare-jump fusion, it already does in certain cases, but not
in the case above.  We can remove the cmp above if we were to perform
the following transformation:
1.0) Initialize the induction variable, %rbx, to be 'n' instead of zero.
1.1) Negate the induction variable, so that the loop increments from -n to zero.
2.0) Set the base pointer, %r15, to be the last address that will be accessed
       from 'p'
2.1) In the preheader of the loop set %r15 to be: %r15 = <base> + n *
<ele size>
3.0) Remove the cmpq

We can remove the cmp, since we are now counting from -n to zero.  The
termination case for the loop will now be zero.  The add instruction will set
the zero-flag when %rbx reaches zero (target specific of course).

The pointer passed to g() will also be correct, since we access the same items
we normally would have, but the access is with respect to the base
being the last item and subtracting from the last item, via a negative induction
variable.  I imagine the result would look something like the following:
    movq   %rsi,   %rbx                   ; rbx becomes the value n
    leaq     (%rdi, %rbx, 4), %r15  ; <base> = p + (n * 4)
    negq    %rbx                              ; set rbx to be -n, so we count
from -n to 0
LBB0_1:
    movl    (%r15,%rbx,4), %edi
    callq    g
    addq   $1, %rbx
    jne      .LBB0_1

I realize this is a micro-op saving a single cycle.  But this reduces the
instruction count, one less
instr to decode in a potentially hot path. If this all makes sense, and seems
like a reasonable addition
to llvm, would it make sense to implement this as a supplemental LSR formula, or
as a separate pass?

-Matt

Gopalasubramanian, Ganesh via llvm-dev

2018-Apr-04 08:11 UTC

head link

[llvm-dev] SCEV and LoopStrengthReduction Formulae

> 	cmpq	%rbx, %r14
> 	jne	.LBB0_1
>
> LLVM can perform compare-jump fusion, it already does in certain cases, but
> not in the case above.  We can remove the cmp above if we were to perform
> the following transformation:
Do you mean branch-fusion (https://en.wikichip.org/wiki/macro-operation_fusion)?
Is there any more limitation why these two or not fused?
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of via
llvm-
> dev
> Sent: Wednesday, April 4, 2018 3:30 AM
> To: llvm-dev at lists.llvm.org
> Subject: [llvm-dev] SCEV and LoopStrengthReduction Formulae
> 
> I am attempting to implement a minor loop strength reduction optimization
for
> targets that support compare and jump fusion, specifically
> TTI::canMacroFuseCmp().  My approach might be wrong; however, I am
> soliciting the idea for feedback, so that I can implement this correctly. 
My plan
> is to add a Supplemental LSR formula to LoopStrengthReduce.cpp that
optimizes
> the following case, but perhaps this should stand alone as  its own pass:
> 
> // Example which can be optimized via cmp/jmp fusion.
> // clang -O3 -S test.c
> extern void g(int);
> void f(int *p, long long n) {
>     do {
>         g(*p++);
>     } while (--n);
> }
> 
> LLVM currently generates the following sequence for x86_64 targets:
> LBB0_1:
> 	movl	(%r15,%rbx,4), %edi
> 	callq	g
> 	addq	$1, %rbx
> 	cmpq	%rbx, %r14
> 	jne	.LBB0_1
> 
> LLVM can perform compare-jump fusion, it already does in certain cases, but
> not in the case above.  We can remove the cmp above if we were to perform
> the following transformation:
> 1.0) Initialize the induction variable, %rbx, to be 'n' instead of
zero.
> 1.1) Negate the induction variable, so that the loop increments from -n to
zero.
> 2.0) Set the base pointer, %r15, to be the last address that will be
accessed
>        from 'p'
> 2.1) In the preheader of the loop set %r15 to be: %r15 = <base> + n *
<ele size>
> 3.0) Remove the cmpq
> 
> We can remove the cmp, since we are now counting from -n to zero.  The
> termination case for the loop will now be zero.  The add instruction will
set the
> zero-flag when %rbx reaches zero (target specific of course).
> 
> The pointer passed to g() will also be correct, since we access the same
items we
> normally would have, but the access is with respect to the base being the
last
> item and subtracting from the last item, via a negative induction variable.
I
> imagine the result would look something like the following:
>     movq   %rsi,   %rbx                   ; rbx becomes the value n
>     leaq     (%rdi, %rbx, 4), %r15  ; <base> = p + (n * 4)
>     negq    %rbx                              ; set rbx to be -n, so we
count from -n to 0
> LBB0_1:
>     movl    (%r15,%rbx,4), %edi
>     callq    g
>     addq   $1, %rbx
>     jne      .LBB0_1
> 
> I realize this is a micro-op saving a single cycle.  But this reduces the
instruction
> count, one less instr to decode in a potentially hot path. If this all
makes sense,
> and seems like a reasonable addition to llvm, would it make sense to
implement
> this as a supplemental LSR formula, or as a separate pass?
> 
> -Matt
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

via llvm-dev

2018-Apr-04 18:06 UTC

head link

[llvm-dev] SCEV and LoopStrengthReduction Formulae

> From: Gopalasubramanian, Ganesh <Ganesh.Gopalasubramanian at amd.com>
> Sent: Wednesday, April 4, 2018 1:11 AM
> To: Davis, Matthew <Matthew.Davis at sony.com>
> 
> > 	cmpq	%rbx, %r14
> > 	jne	.LBB0_1
> >
> > LLVM can perform compare-jump fusion, it already does in certain
> > cases, but not in the case above.  We can remove the cmp above if we
> > were to perform the following transformation:
> 
> Do you mean branch-fusion (https://en.wikichip.org/wiki/macro-
> operation_fusion)?
Thanks for your reply.  

After looking at your link and the x86 instruction manual, Intel Architecture
Optimization manual (3.4.2.2), I realized that the fusion is performed
automatically by the hardware, on the cmp+jne instructions. What I  am trying to
achieve is the removal of the cmp.  Which would result with one less instruction
in the add/cmp/jmp sequence.  Since 'add' sets the zero flag and the jmp
checks
that flag.  So the change would eliminate the fusible cmp+jmp, resulting
in just add+jmp, in this case. 

On Sandy Bridge architectures, which introduce the capability of fusing ADD (and
a few other instructions), this proposed optimization would remove the cmp from
the add/cmp/jmp sequence, leaving just the add+jmp, which is fusible. This cmp 
removal would eliminate a single fetch-decode, in a potentially hot path (loop).
The benefit we get from the cmp removal is one less instruction to decode, and
less pressure on the decoder if the decoded instruction cache were filled-up or 
flushed.
> Is there any more limitation why these two or not fused?Excuse my confusion here.  The hardware might have been fusing the cmp+jmp
I was hoping to just remove the cmp and avoid the instruction entirely.

-Matt
> 
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
Of
> > via llvm- dev
> > Sent: Wednesday, April 4, 2018 3:30 AM
> > To: llvm-dev at lists.llvm.org
> > Subject: [llvm-dev] SCEV and LoopStrengthReduction Formulae
> >
> > I am attempting to implement a minor loop strength reduction
> > optimization for targets that support compare and jump fusion,
> > specifically TTI::canMacroFuseCmp().  My approach might be wrong;
> > however, I am soliciting the idea for feedback, so that I can
> > implement this correctly.  My plan is to add a Supplemental LSR
> > formula to LoopStrengthReduce.cpp that optimizes the following case,
but
> perhaps this should stand alone as  its own pass:
> >
> > // Example which can be optimized via cmp/jmp fusion.
> > // clang -O3 -S test.c
> > extern void g(int);
> > void f(int *p, long long n) {
> >     do {
> >         g(*p++);
> >     } while (--n);
> > }
> >
> > LLVM currently generates the following sequence for x86_64 targets:
> > LBB0_1:
> > 	movl	(%r15,%rbx,4), %edi
> > 	callq	g
> > 	addq	$1, %rbx
> > 	cmpq	%rbx, %r14
> > 	jne	.LBB0_1
> >
> > LLVM can perform compare-jump fusion, it already does in certain
> > cases, but not in the case above.  We can remove the cmp above if we
> > were to perform the following transformation:
> > 1.0) Initialize the induction variable, %rbx, to be 'n'
instead of zero.
> > 1.1) Negate the induction variable, so that the loop increments from
-n to zero.
> > 2.0) Set the base pointer, %r15, to be the last address that will be
accessed
> >        from 'p'
> > 2.1) In the preheader of the loop set %r15 to be: %r15 = <base>
+ n *
> > <ele size>
> > 3.0) Remove the cmpq
> >
> > We can remove the cmp, since we are now counting from -n to zero.  The
> > termination case for the loop will now be zero.  The add instruction
> > will set the zero-flag when %rbx reaches zero (target specific of
course).
> >
> > The pointer passed to g() will also be correct, since we access the
> > same items we normally would have, but the access is with respect to
> > the base being the last item and subtracting from the last item, via a
> > negative induction variable.  I imagine the result would look
something like the
> following:
> >     movq   %rsi,   %rbx                   ; rbx becomes the value n
> >     leaq     (%rdi, %rbx, 4), %r15  ; <base> = p + (n * 4)
> >     negq    %rbx                              ; set rbx to be -n, so
we count from -n to 0
> > LBB0_1:
> >     movl    (%r15,%rbx,4), %edi
> >     callq    g
> >     addq   $1, %rbx
> >     jne      .LBB0_1
> >
> > I realize this is a micro-op saving a single cycle.  But this reduces
> > the instruction count, one less instr to decode in a potentially hot
> > path. If this all makes sense, and seems like a reasonable addition to
> > llvm, would it make sense to implement this as a supplemental LSR
formula, or
> as a separate pass?
> >
> > -Matt
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

via llvm-dev

2018-Apr-07 15:22 UTC

head link

[llvm-dev] SCEV and LoopStrengthReduction Formulae

> 
> I realize this is a micro-op saving a single cycle.  But this reduces the
instruction count, one less
> instr to decode in a potentially hot path. If this all makes sense, and
seems like a reasonable addition
> to llvm, would it make sense to implement this as a supplemental LSR
formula, or as a separate pass?
This seems reasonable to me so long as rbx has no other uses that would
complicate the problem; I’m not sure how much this occurs in hot code (a loop
with an induction variable that isn’t used in the loop), but if it does, I don’t
see why not.

As a side note, in a past life, when I used to do x86 SIMD optimization for a
living, I did similar tricks pretty much everywhere in DSP functions. It’d be
pretty nice if the compiler could do it too.

There is one alternate approach that I recall, which looks like this:

Original code (example, pseudocode):

int add_delta_256(uint8 *in1, uint8 *in2) {
  int accum = 0;
  for (int i = 0; i < 16; ++i) {
   uint8x16 a = load16(in1 + i *16); // NOTE: takes an extra addressing op
because x86
   uint8x16 b = load16(in2 + i *16); // NOTE: takes an extra addressing op
because x86
   accum += psadbw(a, b);
  }
  return accum;
}

end of loop:
inc i
cmp i, 16
jl loop

LSR’d code:

int add_delta_256(uint8 *in1, uint8 *in2) {
  int accum = 0;
  for (int i = 0; i < 16; ++i, in1 += 16, in2 += 16) {
   uint8x16 a = load16(in1);
   uint8x16 b = load16(in2);
   accum += psadbw(a, b);
  }
  return accum;
}

end of loop:
add in1, 16
add in2, 16
inc i
cmp i, 16
jl loop

your code:

int add_delta_256(uint8 *in1, uint8 *in2) {
  int accum = 0;
  for (int i = -16; i < 0; ++i, in1 += 16, in2 += 16) {
   uint8x16 a = load16(in1);
   uint8x16 b = load16(in2);
   accum += psadbw(a, b);
  }
  return accum;
}

end of loop:
add in1, 16
add in2, 16
inc i
jl loop

ideal code:

int add_delta_256(uint8 *in1, uint8 *in2) {
  int accum = 0;
  in1 += 256;
  in2 += 256;
  for (int i = -256; i < 0; ++i) {
   uint8x16 a = load16(in1 + i);
   uint8x16 b = load16(in2 + i);
   accum += psadbw(a, b);
  }
  return accum;
}

end of loop:
inc i
jl loop

I don’t know, however, if it’s reasonable to teach the compiler to do the clever
nonsense necessary to do the last one (requires enough understanding of x86
addressing modes, for one).

—escha
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180407/f718f12c/attachment.html>

via llvm-dev

2018-Apr-09 23:45 UTC

head link

[llvm-dev] SCEV and LoopStrengthReduction Formulae

> From: fglaser at apple.com <fglaser at apple.com> On Behalf Of escha
at apple.com
> Sent: Saturday, April 7, 2018 8:22 AM
> 
>> I realize this is a micro-op saving a single cycle.  But this reduces
the instruction count, one less
>> instr to decode in a potentially hot path. If this all makes sense, and
seems like a reasonable addition
>> to llvm, would it make sense to implement this as a supplemental LSR
formula, or as a separate pass?
>
> This seems reasonable to me so long as rbx has no other uses that would
complicate the problem; I’m not sure how much this occurs in hot code (a loop
with an induction variable
> that isn’t used in the loop), but if it does, I don’t see why not.
Thanks for this response and the examples!  I too think it could be a win, and
I've explored both: implementing this as a pass (machine function pass) and
as a SCEV expression for LSR (an LSR formula).  I'm still on the fence about
which approach is the 'best', or more widely accepted.  To me it seems
to be more advantageous to add this as an LSR formula, that pass provides a cost
model and is target independent.  However, it does rely on the fact that X86
sets the zero flag for inc/add, I'm not sure about other architectures. 
Additionally, this change would be fairly similar to what LSR already does.

-Matt

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Apr 2018 - SCEV and LoopStrengthReduction Formulae

[llvm-dev] SCEV and LoopStrengthReduction Formulae

[llvm-dev] SCEV and LoopStrengthReduction Formulae

[llvm-dev] SCEV and LoopStrengthReduction Formulae

[llvm-dev] SCEV and LoopStrengthReduction Formulae

[llvm-dev] SCEV and LoopStrengthReduction Formulae

Possibly Parallel Threads