thr3ads.net - llvm dev - [llvm-dev] Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ? [Jun 2019]

If this information is useful, please help other people find it:
Share via:

Joan Lluch via llvm-dev

2019-Jun-12 21:58 UTC

[llvm-dev] Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ?

Hi all,

In LSRInstance::CollectFixupsAndInitialFormulae() implementation, the function
getMinusSCEV is called at a particular point as part of the code to
replace/normalise 'x == y’  ICmp instructions into 'x - y == 0’.
However, what the actual code does is replacing them by ‘y - x == 0’ instead.

This is the excerpt of code in the LSRInstance::CollectFixupsAndInitialFormulae
function I am referring to:

        // x == y  -->  x - y == 0
        const SCEV *N = SE.getSCEV(NV);
        if (SE.isLoopInvariant(N, L) && isSafeToExpand(N, SE)) {
          // S is normalized, so normalize N before folding it into S
          // to keep the result normalized.
          N = normalizeForPostIncUse(N, TmpPostIncLoops, SE);
          Kind = LSRUse::ICmpZero;
          S = SE.getMinusSCEV(N, S);      // use instead S = SE.getMinusSCEV(S,
N);
        }
        
[ the comment “use instead" after the getMinusSCEV call is mine ]

This apparently unimportant difference has a remarkable effect latter on, when
the optimal solution for the loop is computed in LSRInstance::SolveRecurse

In the function  LSRInstance::SolveRecurse, given any two loop variants that are
found to have the same cost, the first one is always chosen. This implies that
the initial formula that was created in
LSRInstance::CollectFixupsAndInitialFormulae() is the ones that will prevail by
default (given same cost). The point that I want to highlight is that the
current implementation of CollectFixupsAndInitialFormulae  reverses the SCEVs
found in the user source code rather than leaving them as found. This causes the
following problem:
 
Consider the following code (please don’t take into account whether this makes
or not something useful, it’s just to show the point) :

int asr_Lowert( int a, unsigned ammount )
{
  if ( (ammount -= 8) >= 0  )
    a = a >> 8;

  while ( ammount-- )
    a >>= 1;

  return a; 
}

The interesting part for this discussion is the ‘while’ loop. It gets compiled
by Clang like this:

while.cond:                                       ; preds = %while.body, %entry
  %a.addr.1 = phi i32 [ %shr, %entry ], [ %shr1, %while.body ]
  %ammount.addr.0 = phi i32 [ %sub, %entry ], [ %dec, %while.body ]
  %tobool = icmp eq i32 %ammount.addr.0, 0
  br i1 %tobool, label %while.end, label %while.body

while.body:                                       ; preds = %while.cond
  %dec = add i32 %ammount.addr.0, -1
  %shr1 = ashr i32 %a.addr.1, 1
  br label %while.cond


Clang mostly follows the source code and produces code that resembles it. It’s
important to note is that variable ‘ ammount’ is decremented by 1 on each loop
iteration as found on the user source code.

I am going to chose the x86 target because I really want this subject to gain
some attention, but please note that this is a general subject that affects ALL
targets. So, for the x86, the resultant assembly code is this:

	.macosx_version_min 10, 12
	.globl	_asr_Lowert
_asr_Lowert:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset %ebp, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register %ebp
	movl	8(%ebp), %eax
	sarl	$8, %eax
	pushl	$8
	popl	%ecx
	subl	12(%ebp), %ecx
	jmp	LBB0_1
LBB0_2:
	sarl	%eax
	incl	%ecx
LBB0_1:
	testl	%ecx, %ecx
	jne	LBB0_2
	popl	%ebp
	retl
	.cfi_endproc


What I want to show is that, now, instead of ‘ammount’ being decremented, it is
incremented. Actually, the compiler performs the oddity of setting the initial
value of the ‘while’ variable to '8-ammount’ rather than ‘ ammount-8’. The
compiler even needs to perform a ‘push’ and a ‘pop’ as a setup for that. This is
all because the SCEVs where swapped in the LLVM implementation of 
LSRInstance::CollectFixupsAndInitialFormulae().

Now, let me show what happens if we implement the function as suggested in the
comment, that is, by replacing the current code by S = SE.getMinusSCEV(S, N);

This is what we get now from LLVM for the same input:

	.macosx_version_min 10, 12
	.globl	_asr_Lowert
_asr_Lowert:
	.cfi_startproc
	pushl	%ebp
	.cfi_def_cfa_offset 8
	.cfi_offset %ebp, -8
	movl	%esp, %ebp
	.cfi_def_cfa_register %ebp
	movl	8(%ebp), %eax
	movl	12(%ebp), %ecx
	addl	$-8, %ecx
	sarl	$8, %eax
	jmp	LBB0_1
LBB0_2:
	decl	%ecx
	sarl	%eax
LBB0_1:
	testl	%ecx, %ecx
	jne	LBB0_2
	popl	%ebp
	retl
	.cfi_endproc


The code now uses decrement, and the initial ‘while’ variable value is ‘
ammount-8’ as we could expect from the user source code. The whole function is
also shorter and it does not use potentially costly push/pop instructions.

I also want to point out, that this is not an isolated case at all. This happens
all the time for small loops where costs of several variants are the same. In
all cases, decrements in the user source code are replaced by positive
increments with necessarily strange setup code (as shown in the first example
above). Positive increments are replaced by decrements, which also may force
extra code generation because the loop induction variable can’t be any longer
used for useful purposes. The remarkable thing is that by swapping the SCEV
expressions as we currently do, the compiler stops taking the user source code
into account as a hint on what to do, and instead it may produce worse code in
circumstances where the user may have already chosen a naturally better
implementation, which the compiler just destroys by artificially replacing it by
a different solution with a supposed identical cost.

Also, although this notably affects small loops with a few small-cost variants,
it also affects more complex loops that may have two or more optimal solutions.
Among the optimal solutions of the same cost, the algorithm will always chose
the first one in the list, so again the one that will be chosen may be a
reversed one with respect to the user code, unless it is a really different
solution.

I do not understand the reason behind such reversing of loop expressions, and
that’s the question that I post. I understand that we may want to always favour
decrements (maybe for performance reasons in some targets), or that we may want
to prioritise increments (maybe to keep induction variables usable in addressing
modes in some targets), or even that we may want to do different things for
different targets. But I do not quite understand the current implementation. It
almost seems to me a typo in the LLVM source code that nobody has come across
before. I have had my local copy of LLVM for some time with the code set as in
the comment, and I have yet to find a case where it produces worse code than the
original.

So, I really am interested in starting some informed discussion about this.

John Lluch

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190612/7425d8ce/attachment.html>

Neil Nelson via llvm-dev

2019-Jun-15 14:28 UTC

head link

[llvm-dev] Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ?

Too bad this did not go anywhere. I was looking forward to a discovery 
of this mystery.

On 6/12/19 3:58 PM, Joan Lluch via llvm-dev wrote:> Hi all,
>
> In LSRInstance::CollectFixupsAndInitialFormulae() implementation, the 
> function getMinusSCEV is called at a particular point as part of the 
> code to replace/normalise 'x == y’  ICmp instructions into 'x - y
==
> 0’. However, what the actual code does is replacing them by ‘y - x == 
> 0’ instead.
>
> This is the excerpt of code in the 
> LSRInstance::CollectFixupsAndInitialFormulae function I am referring to:
>
>         // x == y  -->  x - y == 0
> const SCEV *N = SE.getSCEV(NV);
> if(SE.isLoopInvariant(N, L) && isSafeToExpand(N, SE)) {
> // S is normalized, so normalize N before folding it into S
> // to keep the result normalized.
>           N = normalizeForPostIncUse(N, TmpPostIncLoops, SE);
>           Kind = LSRUse::ICmpZero;
>           S = SE.getMinusSCEV(N, S);    // use instead S = 
> SE.getMinusSCEV(S, N);
>         }
>
>
> [ the comment “use instead" after the getMinusSCEV call is mine ]
>
>
> This apparently unimportant difference has a remarkable effect latter 
> on, when the optimal solution for the loop is computed in 
> LSRInstance::SolveRecurse
>
> In the function LSRInstance::SolveRecurse, given any two loop variants 
> that are found to have the same cost, the first one is always chosen. 
> This implies that the initial formula that was created in 
> LSRInstance::CollectFixupsAndInitialFormulae() is the ones that will 
> prevail by default (given same cost). The point that I want to 
> highlight is that the current implementation of 
> CollectFixupsAndInitialFormulae reverses the SCEVs found in the user 
> source code rather than leaving them as found. This causes the 
> following problem:
> Consider the following code (please don’t take into account whether 
> this makes or not something useful, it’s just to show the point) :
>
> int asr_Lowert( int a, unsigned ammount )
> {
> if ( (ammount -= 8) >= 0  )
>     a = a >> 8;
>
> while ( ammount-- )
>     a >>= 1;
>
> returna;
> }
>
> The interesting part for this discussion is the ‘while’ loop. It gets 
> compiled by Clang like this:
>
> while.cond:                                       ; preds = 
> %while.body, %entry
>   %a.addr.1 = phi i32 [ %shr, %entry ], [ %shr1, %while.body ]
>   %ammount.addr.0 = phi i32 [ %sub, %entry ], [ %dec, %while.body ]
>   %tobool = icmp eq i32 %ammount.addr.0, 0
>   br i1 %tobool, label %while.end, label %while.body
>
> while.body:                                       ; preds = %while.cond
>   %dec = add i32 %ammount.addr.0, -1
>   %shr1 = ashr i32 %a.addr.1, 1
>   br label %while.cond
>
>
> Clang mostly follows the source code and produces code that resembles 
> it. It’s important to note is that variable ‘ ammount’ is decremented 
> by 1 on each loop iteration as found on the user source code.
>
> I am going to chose the x86 target because I really want this subject 
> to gain some attention, but please note that this is a general subject 
> that affects ALL targets. So, for the x86, the resultant assembly code 
> is this:
>
> .macosx_version_min 10, 12
> .globl_asr_Lowert
> _asr_Lowert:
> .cfi_startproc
> pushl%ebp
> .cfi_def_cfa_offset 8
> .cfi_offset %ebp, -8
> movl%esp, %ebp
> .cfi_def_cfa_register %ebp
> movl8(%ebp), %eax
> sarl$8, %eax
> pushl$8
> popl%ecx
> subl12(%ebp), %ecx
> jmpLBB0_1
> LBB0_2:
> sarl%eax
> incl%ecx
> LBB0_1:
> testl%ecx, %ecx
> jneLBB0_2
> popl%ebp
> retl
> .cfi_endproc
>
>
> What I want to show is that, now, instead of ‘ammount’ being 
> decremented, it is incremented. Actually, the compiler performs the 
> oddity of setting the initial value of the ‘while’ variable to 
> '8-ammount’ rather than ‘ ammount-8’. The compiler even needs to 
> perform a ‘push’ and a ‘pop’ as a setup for that. This is all because 
> the SCEVs where swapped in the LLVM implementation of 
> LSRInstance::CollectFixupsAndInitialFormulae().
>
> Now, let me show what happens if we implement the function as 
> suggested in the comment, that is, by replacing the current code by S 
> = SE.getMinusSCEV(S, N);
>
> This is what we get now from LLVM for the same input:
>
> .macosx_version_min 10, 12
> .globl_asr_Lowert
> _asr_Lowert:
> .cfi_startproc
> pushl%ebp
> .cfi_def_cfa_offset 8
> .cfi_offset %ebp, -8
> movl%esp, %ebp
> .cfi_def_cfa_register %ebp
> movl8(%ebp), %eax
> movl12(%ebp), %ecx
> addl$-8, %ecx
> sarl$8, %eax
> jmpLBB0_1
> LBB0_2:
> decl%ecx
> sarl%eax
> LBB0_1:
> testl%ecx, %ecx
> jneLBB0_2
> popl%ebp
> retl
> .cfi_endproc
>
>
> The code now uses decrement, and the initial ‘while’ variable value is 
> ‘ ammount-8’ as we could expect from the user source code. The whole 
> function is also shorter and it does not use potentially costly 
> push/pop instructions.
>
> I also want to point out, that this is not an isolated case at all. 
> This happens all the time for small loops where costs of several 
> variants are the same. In all cases, decrements in the user source 
> code are replaced by positive increments with necessarily strange 
> setup code (as shown in the first example above). Positive increments 
> are replaced by decrements, which also may force extra code generation 
> because the loop induction variable can’t be any longer used for 
> useful purposes. The remarkable thing is that by swapping the SCEV 
> expressions as we currently do, the compiler stops taking the user 
> source code into account as a hint on what to do, and instead it may 
> produce worse code in circumstances where the user may have already 
> chosen a naturally better implementation, which the compiler just 
> destroys by artificially replacing it by a different solution with a 
> supposed identical cost.
>
> Also, although this notably affects small loops with a few small-cost 
> variants, it also affects more complex loops that may have two or more 
> optimal solutions. Among the optimal solutions of the same cost, the 
> algorithm will always chose the first one in the list, so again the 
> one that will be chosen may be a reversed one with respect to the user 
> code, unless it is a really different solution.
>
> I do not understand the reason behind such reversing of loop 
> expressions, and that’s the question that I post. I understand that we 
> may want to always favour decrements (maybe for performance reasons in 
> some targets), or that we may want to prioritise increments (maybe to 
> keep induction variables usable in addressing modes in some targets), 
> or even that we may want to do different things for different targets. 
> But I do not quite understand the current implementation. It almost 
> seems to me a typo in the LLVM source code that nobody has come across 
> before. I have had my local copy of LLVM for some time with the code 
> set as in the comment, and I have yet to find a case where it produces 
> worse code than the original.
>
> So, I really am interested in starting some informed discussion about 
> this.
>
> John Lluch
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190615/33fafbb0/attachment.html>

llvm dev - Jun 2019 - Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ?

[llvm-dev] Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ?

[llvm-dev] Is there a reason for swapping SCEVs in the call to getMinusSCEV from CollectFixupsAndInitialFormulae (all targets) ?