thr3ads.net - llvm dev - [LLVMdev] Scheduling quirks [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Jasper Neumann

2014-Jan-18 13:13 UTC

[LLVMdev] Scheduling quirks

Hello all!

When I compile the following more or less stupid functions with
   clang++ -O3 -S test.cpp
===>
int test_register(int x) {
   x ^= (x >> 2);
   x ^= (x >> 3);
   x = x ^ (x >> 4);
   int y = x;  x >>= 5;  x ^= y;  // almost the same but explicit
   return x;
   }

int test_scheduler(int x) {
   return ((x>>2) & 15) ^ ((x>>3) & 31);
   }
<==...I get the following result:
===>
	.file	"test.cpp"
	.text
	.globl	_Z13test_registeri
	.align	16, 0x90
	.type	_Z13test_registeri, at function
_Z13test_registeri:                     # @_Z13test_registeri
	.cfi_startproc
# BB#0:                                 # %entry
	movl	%edi, %eax
	sarl	$2, %eax
	xorl	%edi, %eax
	movl	%eax, %ecx
	sarl	$3, %ecx
	xorl	%eax, %ecx
	movl	%ecx, %edx
	sarl	$4, %edx
	xorl	%ecx, %edx
	movl	%edx, %eax
	sarl	$5, %eax
	xorl	%edx, %eax
	retq
.Ltmp0:
	.size	_Z13test_registeri, .Ltmp0-_Z13test_registeri
	.cfi_endproc

	.globl	_Z14test_scheduleri
	.align	16, 0x90
	.type	_Z14test_scheduleri, at function
_Z14test_scheduleri:                    # @_Z14test_scheduleri
	.cfi_startproc
# BB#0:                                 # %entry
	movl	%edi, %eax
	shrl	$2, %eax
	andl	$15, %eax
	shrl	$3, %edi
	andl	$31, %edi
	xorl	%eax, %edi
	movl	%edi, %eax
	retq
.Ltmp1:
	.size	_Z14test_scheduleri, .Ltmp1-_Z14test_scheduleri
	.cfi_endproc


	.ident	"clang version 3.5 (trunk 199507)"
	.section	".note.GNU-stack","", at progbits
<==
Now once more in detail.

The lines
   x ^= (x >> 2);
and
   x = x ^ (x >> 4);
and (!)
   int y = x;  x >>= 8;  x ^= y;  // almost the same but explicit
are compiled into code like
	movl	%edi, %eax
	sarl	$2, %eax
	xorl	%edi, %eax
As far as I know optimal for all x86 but the very latest 4th generation 
Intel Core processors the following variant is better (2 instead of 3 
cycles; I proved this for e.g. Intel i7 920) because the first two lines 
can be executed simultaneously:
	movl	%edi, %eax
	sarl	$2, %edi    # modify source instead of copy
	xorl	%edi, %eax
Is there a special reason to do that this way?
Interestingly most compilers including ICC and GCC show this strange 
behavior. I had reported this in an Intel forum as well as for GCC a 
long time ago but there has been no real reaction...
Also, why are 4 registers used whereas 2 are sufficient?


In the second function the line
   return ((x>>2) & 15) ^ ((x>>3) & 31);
is compiled into
	movl	%edi, %eax
	shrl	$2, %eax
	andl	$15, %eax
	shrl	$3, %edi
	andl	$31, %edi
	xorl	%eax, %edi
	movl	%edi, %eax
I would have expected that the scheduler interleaves the subexpressions 
and would be able to get rid of the final move like this:
	movl	%edi, %eax
	shrl	$3, %edi    # modify source instead of copy, see above
	shrl	$2, %eax
	andl	$31, %edi
	andl	$15, %eax
	xorl	%edi, %eax    # we need %eax here


I think this is independent of the used high level language.
Is this known to the LLVM community?
May I help to correct this?

Best regards
Jasper

Andrew Trick

2014-Jan-21 22:48 UTC

head link

[LLVMdev] Scheduling quirks

On Jan 18, 2014, at 5:13 AM, Jasper Neumann <jn at sirrida.de> wrote:
> Hello all!
> 
> When I compile the following more or less stupid functions with
>  clang++ -O3 -S test.cpp
> ===>
> int test_register(int x) {
>  x ^= (x >> 2);
>  x ^= (x >> 3);
>  x = x ^ (x >> 4);
>  int y = x;  x >>= 5;  x ^= y;  // almost the same but explicit
>  return x;
>  }
> 
> int test_scheduler(int x) {
>  return ((x>>2) & 15) ^ ((x>>3) & 31);
>  }
> <==> ...I get the following result:
> ===>
> 	.file	"test.cpp"
> 	.text
> 	.globl	_Z13test_registeri
> 	.align	16, 0x90
> 	.type	_Z13test_registeri, at function
> _Z13test_registeri:                     # @_Z13test_registeri
> 	.cfi_startproc
> # BB#0:                                 # %entry
> 	movl	%edi, %eax
> 	sarl	$2, %eax
> 	xorl	%edi, %eax
> 	movl	%eax, %ecx
> 	sarl	$3, %ecx
> 	xorl	%eax, %ecx
> 	movl	%ecx, %edx
> 	sarl	$4, %edx
> 	xorl	%ecx, %edx
> 	movl	%edx, %eax
> 	sarl	$5, %eax
> 	xorl	%edx, %eax
> 	retq
> .Ltmp0:
> 	.size	_Z13test_registeri, .Ltmp0-_Z13test_registeri
> 	.cfi_endproc
> 
> 	.globl	_Z14test_scheduleri
> 	.align	16, 0x90
> 	.type	_Z14test_scheduleri, at function
> _Z14test_scheduleri:                    # @_Z14test_scheduleri
> 	.cfi_startproc
> # BB#0:                                 # %entry
> 	movl	%edi, %eax
> 	shrl	$2, %eax
> 	andl	$15, %eax
> 	shrl	$3, %edi
> 	andl	$31, %edi
> 	xorl	%eax, %edi
> 	movl	%edi, %eax
> 	retq
> .Ltmp1:
> 	.size	_Z14test_scheduleri, .Ltmp1-_Z14test_scheduleri
> 	.cfi_endproc
> 
> 
> 	.ident	"clang version 3.5 (trunk 199507)"
> 	.section	".note.GNU-stack","", at progbits
> <==> 
> Now once more in detail.
> 
> The lines
>  x ^= (x >> 2);
> and
>  x = x ^ (x >> 4);
> and (!)
>  int y = x;  x >>= 8;  x ^= y;  // almost the same but explicit
> are compiled into code like
> 	movl	%edi, %eax
> 	sarl	$2, %eax
> 	xorl	%edi, %eax
> As far as I know optimal for all x86 but the very latest 4th generation
Intel Core processors the following variant is better (2 instead of 3 cycles; I
proved this for e.g. Intel i7 920) because the first two lines can be executed
simultaneously:
> 	movl	%edi, %eax
> 	sarl	$2, %edi    # modify source instead of copy
> 	xorl	%edi, %eax
> Is there a special reason to do that this way?
> Interestingly most compilers including ICC and GCC show this strange
behavior. I had reported this in an Intel forum as well as for GCC a long time
ago but there has been no real reaction...
> Also, why are 4 registers used whereas 2 are sufficient?
> 
> 
> In the second function the line
>  return ((x>>2) & 15) ^ ((x>>3) & 31);
> is compiled into
> 	movl	%edi, %eax
> 	shrl	$2, %eax
> 	andl	$15, %eax
> 	shrl	$3, %edi
> 	andl	$31, %edi
> 	xorl	%eax, %edi
> 	movl	%edi, %eax
> I would have expected that the scheduler interleaves the subexpressions and
would be able to get rid of the final move like this:
> 	movl	%edi, %eax
> 	shrl	$3, %edi    # modify source instead of copy, see above
> 	shrl	$2, %eax
> 	andl	$31, %edi
> 	andl	$15, %eax
> 	xorl	%edi, %eax    # we need %eax here
> 
> 
> I think this is independent of the used high level language.
> Is this known to the LLVM community?
> May I help to correct this?
In your example, we're copying from/to argument/return registers, hence the
extra copies. Normally, it is the register coalescer's job to remove copies
but
- our register coalescer is not always very smart
- in this case, it intentionally leaves the physreg (arg/return) copies in place
to give the register allocator the most freedom.

The register allocator attempts to remove physreg copies by biasing allocation,
but the decisions are subject to the order of allocation which really amounts to
luck in cases like this.

We tend to not introduce complexity to optimize special cases involving multiple
physreg copies because
- we care more about code within loops
- the performance impact of register copies is very small

In these cases, I think it comes down to luck given the allocation order that we
choose within a block (I’m not sure how scheduling could affect it). Do you
still see the extra copies on LLVM trunk?

-Andy
> 
> Best regards
> Jasper
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Jasper Neumann

2014-Jan-22 18:03 UTC

head link

[LLVMdev] Scheduling quirks

Hello Andrew!

Referring:
int test_scheduler(int x) {
  return ((x>>2) & 15) ^ ((x>>3) & 31);
  }
=>
	movl	%edi, %eax
	shrl	$2, %eax
	andl	$15, %eax
	shrl	$3, %edi
	andl	$31, %edi
	xorl	%eax, %edi
	movl	%edi, %eax
instead of
	movl	%edi, %eax
	shrl	$3, %edi    # modify source instead of copy
	shrl	$2, %eax    # modifications interlaced
	andl	$31, %edi
	andl	$15, %eax
	xorl	%edi, %eax    # we need %eax here


 > In your example, we're copying from/to argument/return registers,
 > hence the extra copies.

This might be the reason for the final move.
However a simple peephole optimizer could catch this case.


 > [...] Do you still see the extra copies on LLVM trunk?

I retested this with a fresh checkout and compile of trunk (svn 199769) 
and got the same result.

BTW: Providing e.g. -march=atom helps a bit by sporting the expected 
interleaving for this example but IMHO this ought to be the case even 
without explicitely specifying a processor type.


The missed optimization regarding the modification of a copy occurs 
quite often and costs a cycle on almost all x86 processors.

Best regards
Jasper

llvm dev - Jan 2014 - [LLVMdev] Scheduling quirks

[LLVMdev] Scheduling quirks

[LLVMdev] Scheduling quirks

[LLVMdev] Scheduling quirks