I am attempting to implement a minor loop strength reduction optimization for
targets that support compare and jump fusion, specifically
TTI::canMacroFuseCmp().  My approach might be wrong; however, I am soliciting
the idea for feedback, so that I can implement this correctly.  My plan is to
add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes  the
following case, but perhaps this should stand alone as  its own pass:
// Example which can be optimized via cmp/jmp fusion.
// clang -O3 -S test.c
extern void g(int);
void f(int *p, long long n) {
    do {
        g(*p++);
    } while (--n);
}
LLVM currently generates the following sequence for x86_64 targets:
LBB0_1:
	movl	(%r15,%rbx,4), %edi
	callq	g
	addq	$1, %rbx
	cmpq	%rbx, %r14
	jne	.LBB0_1
LLVM can perform compare-jump fusion, it already does in certain cases, but not
in the case above.  We can remove the cmp above if we were to perform
the following transformation:
1.0) Initialize the induction variable, %rbx, to be 'n' instead of zero.
1.1) Negate the induction variable, so that the loop increments from -n to zero.
2.0) Set the base pointer, %r15, to be the last address that will be accessed
       from 'p'
2.1) In the preheader of the loop set %r15 to be: %r15 = <base> + n *
<ele size>
3.0) Remove the cmpq
We can remove the cmp, since we are now counting from -n to zero.  The
termination case for the loop will now be zero.  The add instruction will set
the zero-flag when %rbx reaches zero (target specific of course).
The pointer passed to g() will also be correct, since we access the same items
we normally would have, but the access is with respect to the base
being the last item and subtracting from the last item, via a negative induction
variable.  I imagine the result would look something like the following:
    movq   %rsi,   %rbx                   ; rbx becomes the value n
    leaq     (%rdi, %rbx, 4), %r15  ; <base> = p + (n * 4)
    negq    %rbx                              ; set rbx to be -n, so we count
from -n to 0
LBB0_1:
    movl    (%r15,%rbx,4), %edi
    callq    g
    addq   $1, %rbx
    jne      .LBB0_1
I realize this is a micro-op saving a single cycle.  But this reduces the
instruction count, one less
instr to decode in a potentially hot path. If this all makes sense, and seems
like a reasonable addition
to llvm, would it make sense to implement this as a supplemental LSR formula, or
as a separate pass?
-Matt
Gopalasubramanian, Ganesh via llvm-dev
2018-Apr-04  08:11 UTC
[llvm-dev] SCEV and LoopStrengthReduction Formulae
> cmpq %rbx, %r14 > jne .LBB0_1 > > LLVM can perform compare-jump fusion, it already does in certain cases, but > not in the case above. We can remove the cmp above if we were to perform > the following transformation:Do you mean branch-fusion (https://en.wikichip.org/wiki/macro-operation_fusion)? Is there any more limitation why these two or not fused?> -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of via llvm- > dev > Sent: Wednesday, April 4, 2018 3:30 AM > To: llvm-dev at lists.llvm.org > Subject: [llvm-dev] SCEV and LoopStrengthReduction Formulae > > I am attempting to implement a minor loop strength reduction optimization for > targets that support compare and jump fusion, specifically > TTI::canMacroFuseCmp(). My approach might be wrong; however, I am > soliciting the idea for feedback, so that I can implement this correctly. My plan > is to add a Supplemental LSR formula to LoopStrengthReduce.cpp that optimizes > the following case, but perhaps this should stand alone as its own pass: > > // Example which can be optimized via cmp/jmp fusion. > // clang -O3 -S test.c > extern void g(int); > void f(int *p, long long n) { > do { > g(*p++); > } while (--n); > } > > LLVM currently generates the following sequence for x86_64 targets: > LBB0_1: > movl (%r15,%rbx,4), %edi > callq g > addq $1, %rbx > cmpq %rbx, %r14 > jne .LBB0_1 > > LLVM can perform compare-jump fusion, it already does in certain cases, but > not in the case above. We can remove the cmp above if we were to perform > the following transformation: > 1.0) Initialize the induction variable, %rbx, to be 'n' instead of zero. > 1.1) Negate the induction variable, so that the loop increments from -n to zero. > 2.0) Set the base pointer, %r15, to be the last address that will be accessed > from 'p' > 2.1) In the preheader of the loop set %r15 to be: %r15 = <base> + n * <ele size> > 3.0) Remove the cmpq > > We can remove the cmp, since we are now counting from -n to zero. The > termination case for the loop will now be zero. The add instruction will set the > zero-flag when %rbx reaches zero (target specific of course). > > The pointer passed to g() will also be correct, since we access the same items we > normally would have, but the access is with respect to the base being the last > item and subtracting from the last item, via a negative induction variable. I > imagine the result would look something like the following: > movq %rsi, %rbx ; rbx becomes the value n > leaq (%rdi, %rbx, 4), %r15 ; <base> = p + (n * 4) > negq %rbx ; set rbx to be -n, so we count from -n to 0 > LBB0_1: > movl (%r15,%rbx,4), %edi > callq g > addq $1, %rbx > jne .LBB0_1 > > I realize this is a micro-op saving a single cycle. But this reduces the instruction > count, one less instr to decode in a potentially hot path. If this all makes sense, > and seems like a reasonable addition to llvm, would it make sense to implement > this as a supplemental LSR formula, or as a separate pass? > > -Matt > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> From: Gopalasubramanian, Ganesh <Ganesh.Gopalasubramanian at amd.com> > Sent: Wednesday, April 4, 2018 1:11 AM > To: Davis, Matthew <Matthew.Davis at sony.com> > > > cmpq %rbx, %r14 > > jne .LBB0_1 > > > > LLVM can perform compare-jump fusion, it already does in certain > > cases, but not in the case above. We can remove the cmp above if we > > were to perform the following transformation: > > Do you mean branch-fusion (https://en.wikichip.org/wiki/macro- > operation_fusion)?Thanks for your reply. After looking at your link and the x86 instruction manual, Intel Architecture Optimization manual (3.4.2.2), I realized that the fusion is performed automatically by the hardware, on the cmp+jne instructions. What I am trying to achieve is the removal of the cmp. Which would result with one less instruction in the add/cmp/jmp sequence. Since 'add' sets the zero flag and the jmp checks that flag. So the change would eliminate the fusible cmp+jmp, resulting in just add+jmp, in this case. On Sandy Bridge architectures, which introduce the capability of fusing ADD (and a few other instructions), this proposed optimization would remove the cmp from the add/cmp/jmp sequence, leaving just the add+jmp, which is fusible. This cmp removal would eliminate a single fetch-decode, in a potentially hot path (loop). The benefit we get from the cmp removal is one less instruction to decode, and less pressure on the decoder if the decoded instruction cache were filled-up or flushed.> Is there any more limitation why these two or not fused?Excuse my confusion here. The hardware might have been fusing the cmp+jmp I was hoping to just remove the cmp and avoid the instruction entirely. -Matt> > > -----Original Message----- > > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of > > via llvm- dev > > Sent: Wednesday, April 4, 2018 3:30 AM > > To: llvm-dev at lists.llvm.org > > Subject: [llvm-dev] SCEV and LoopStrengthReduction Formulae > > > > I am attempting to implement a minor loop strength reduction > > optimization for targets that support compare and jump fusion, > > specifically TTI::canMacroFuseCmp(). My approach might be wrong; > > however, I am soliciting the idea for feedback, so that I can > > implement this correctly. My plan is to add a Supplemental LSR > > formula to LoopStrengthReduce.cpp that optimizes the following case, but > perhaps this should stand alone as its own pass: > > > > // Example which can be optimized via cmp/jmp fusion. > > // clang -O3 -S test.c > > extern void g(int); > > void f(int *p, long long n) { > > do { > > g(*p++); > > } while (--n); > > } > > > > LLVM currently generates the following sequence for x86_64 targets: > > LBB0_1: > > movl (%r15,%rbx,4), %edi > > callq g > > addq $1, %rbx > > cmpq %rbx, %r14 > > jne .LBB0_1 > > > > LLVM can perform compare-jump fusion, it already does in certain > > cases, but not in the case above. We can remove the cmp above if we > > were to perform the following transformation: > > 1.0) Initialize the induction variable, %rbx, to be 'n' instead of zero. > > 1.1) Negate the induction variable, so that the loop increments from -n to zero. > > 2.0) Set the base pointer, %r15, to be the last address that will be accessed > > from 'p' > > 2.1) In the preheader of the loop set %r15 to be: %r15 = <base> + n * > > <ele size> > > 3.0) Remove the cmpq > > > > We can remove the cmp, since we are now counting from -n to zero. The > > termination case for the loop will now be zero. The add instruction > > will set the zero-flag when %rbx reaches zero (target specific of course). > > > > The pointer passed to g() will also be correct, since we access the > > same items we normally would have, but the access is with respect to > > the base being the last item and subtracting from the last item, via a > > negative induction variable. I imagine the result would look something like the > following: > > movq %rsi, %rbx ; rbx becomes the value n > > leaq (%rdi, %rbx, 4), %r15 ; <base> = p + (n * 4) > > negq %rbx ; set rbx to be -n, so we count from -n to 0 > > LBB0_1: > > movl (%r15,%rbx,4), %edi > > callq g > > addq $1, %rbx > > jne .LBB0_1 > > > > I realize this is a micro-op saving a single cycle. But this reduces > > the instruction count, one less instr to decode in a potentially hot > > path. If this all makes sense, and seems like a reasonable addition to > > llvm, would it make sense to implement this as a supplemental LSR formula, or > as a separate pass? > > > > -Matt > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less > instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition > to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass?This seems reasonable to me so long as rbx has no other uses that would complicate the problem; I’m not sure how much this occurs in hot code (a loop with an induction variable that isn’t used in the loop), but if it does, I don’t see why not. As a side note, in a past life, when I used to do x86 SIMD optimization for a living, I did similar tricks pretty much everywhere in DSP functions. It’d be pretty nice if the compiler could do it too. There is one alternate approach that I recall, which looks like this: Original code (example, pseudocode): int add_delta_256(uint8 *in1, uint8 *in2) { int accum = 0; for (int i = 0; i < 16; ++i) { uint8x16 a = load16(in1 + i *16); // NOTE: takes an extra addressing op because x86 uint8x16 b = load16(in2 + i *16); // NOTE: takes an extra addressing op because x86 accum += psadbw(a, b); } return accum; } end of loop: inc i cmp i, 16 jl loop LSR’d code: int add_delta_256(uint8 *in1, uint8 *in2) { int accum = 0; for (int i = 0; i < 16; ++i, in1 += 16, in2 += 16) { uint8x16 a = load16(in1); uint8x16 b = load16(in2); accum += psadbw(a, b); } return accum; } end of loop: add in1, 16 add in2, 16 inc i cmp i, 16 jl loop your code: int add_delta_256(uint8 *in1, uint8 *in2) { int accum = 0; for (int i = -16; i < 0; ++i, in1 += 16, in2 += 16) { uint8x16 a = load16(in1); uint8x16 b = load16(in2); accum += psadbw(a, b); } return accum; } end of loop: add in1, 16 add in2, 16 inc i jl loop ideal code: int add_delta_256(uint8 *in1, uint8 *in2) { int accum = 0; in1 += 256; in2 += 256; for (int i = -256; i < 0; ++i) { uint8x16 a = load16(in1 + i); uint8x16 b = load16(in2 + i); accum += psadbw(a, b); } return accum; } end of loop: inc i jl loop I don’t know, however, if it’s reasonable to teach the compiler to do the clever nonsense necessary to do the last one (requires enough understanding of x86 addressing modes, for one). —escha -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180407/f718f12c/attachment.html>
> From: fglaser at apple.com <fglaser at apple.com> On Behalf Of escha at apple.com > Sent: Saturday, April 7, 2018 8:22 AM > >> I realize this is a micro-op saving a single cycle. But this reduces the instruction count, one less >> instr to decode in a potentially hot path. If this all makes sense, and seems like a reasonable addition >> to llvm, would it make sense to implement this as a supplemental LSR formula, or as a separate pass? > > This seems reasonable to me so long as rbx has no other uses that would complicate the problem; I’m not sure how much this occurs in hot code (a loop with an induction variable > that isn’t used in the loop), but if it does, I don’t see why not.Thanks for this response and the examples! I too think it could be a win, and I've explored both: implementing this as a pass (machine function pass) and as a SCEV expression for LSR (an LSR formula). I'm still on the fence about which approach is the 'best', or more widely accepted. To me it seems to be more advantageous to add this as an LSR formula, that pass provides a cost model and is target independent. However, it does rely on the fact that X86 sets the zero flag for inc/add, I'm not sure about other architectures. Additionally, this change would be fairly similar to what LSR already does. -Matt
Reasonably Related Threads
- SCEV and LoopStrengthReduction Formulae
- SCEV and LoopStrengthReduction Formulae
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON
- [PATCH] Optimize silk_warped_autocorrelation_FIX() for ARM NEON