thr3ads.net - llvm dev - [LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Anthony Blake

2012-Jul-06 12:25 UTC

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk>
wrote:>
> On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at cs.waikato.ac.nz>
wrote:
>
>> I've noticed that LLVM tends to generate suboptimal code and spill
an
>> excessive amount of registers in large functions, such as in those
>> that are automatically generated by FFTW.
>
> One problem might be that we're forcing the 16 stores to the out array
to happen in source order, which constrains the schedule. The stores are clearly
non-aliasing.
>
>> LLVM generates good code for a function that computes an 8-point
>> complex FFT, but from 16-point upwards, icc or gcc generates much
>> better code. Here is an example of a sequence of instructions from a
>> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE:
>>
>>        [...]
>>       movaps  32(%rdi), %xmm3
>>       movaps  48(%rdi), %xmm2
>>       movaps  %xmm3, %xmm1     ### <-- xmm3 mov'ed into xmm1
>>       movaps  %xmm3, %xmm4     ### <-- xmm3 mov'ed into xmm4
>>       addps   %xmm0, %xmm1
>>       movaps  %xmm1, -16(%rbp)        ## 16-byte Spill
>>       movaps  144(%rdi), %xmm3   ### <-- new data mov'ed into
xmm3
>>        [...]
>>
>> xmm3 loaded, duplicated into 2 registers, and then discarded as other
>> data is loaded into it. Can anyone shed some light on why this might
>> be happening?
>
> I'm not actually seeing this behavior on trunk.
>
I've just tried trunk, and although behavior like above isn't
immediately obvious, trunk generates more instructions and spills more
registers compared to 3.1.

amb

Anthony Blake

2012-Jul-06 12:40 UTC

head link

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

On Sat, Jul 7, 2012 at 12:25 AM, Anthony Blake <amb33 at cs.waikato.ac.nz>
wrote:> On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at
2pi.dk> wrote:
>> On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at
cs.waikato.ac.nz> wrote:
>>>        [...]
>>>       movaps  32(%rdi), %xmm3
>>>       movaps  48(%rdi), %xmm2
>>>       movaps  %xmm3, %xmm1     ### <-- xmm3 mov'ed into xmm1
>>>       movaps  %xmm3, %xmm4     ### <-- xmm3 mov'ed into xmm4
>>>       addps   %xmm0, %xmm1
>>>       movaps  %xmm1, -16(%rbp)        ## 16-byte Spill
>>>       movaps  144(%rdi), %xmm3   ### <-- new data mov'ed
into xmm3
>>>        [...]
>>>
>>> xmm3 loaded, duplicated into 2 registers, and then discarded as
other
>>> data is loaded into it. Can anyone shed some light on why this
might
>>> be happening?
>>
>> I'm not actually seeing this behavior on trunk.
>>
>
> I've just tried trunk, and although behavior like above isn't
> immediately obvious, trunk generates more instructions and spills more
> registers compared to 3.1.
>
Actually, here is an occurrence of that behavior when compiling the
code with trunk:

        [...]
        movaps	%xmm1, %xmm0      ###  xmm1 mov'ed to xmm0
	movaps	%xmm1, %xmm14    ###  xmm1 mov'ed to xmm14
	addps	%xmm7, %xmm0
	movaps	%xmm7, %xmm13
	movaps	%xmm0, %xmm1      ###  and now other data is mov'ed into xmm1,
making one of the above movaps superfluous
        [...]

amb

Anthony Blake

2012-Jul-06 13:00 UTC

head link

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

On Sat, Jul 7, 2012 at 12:40 AM, Anthony Blake <amb33 at cs.waikato.ac.nz>
wrote:> On Sat, Jul 7, 2012 at 12:25 AM, Anthony Blake <amb33 at
cs.waikato.ac.nz> wrote:
>> On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at
2pi.dk> wrote:
>>> On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at
cs.waikato.ac.nz> wrote:
>>>>        [...]
>>>>       movaps  32(%rdi), %xmm3
>>>>       movaps  48(%rdi), %xmm2
>>>>       movaps  %xmm3, %xmm1     ### <-- xmm3 mov'ed into
xmm1
>>>>       movaps  %xmm3, %xmm4     ### <-- xmm3 mov'ed into
xmm4
>>>>       addps   %xmm0, %xmm1
>>>>       movaps  %xmm1, -16(%rbp)        ## 16-byte Spill
>>>>       movaps  144(%rdi), %xmm3   ### <-- new data mov'ed
into xmm3
>>>>        [...]
>>>>
>>>> xmm3 loaded, duplicated into 2 registers, and then discarded as
other
>>>> data is loaded into it. Can anyone shed some light on why this
might
>>>> be happening?
>>>
>>> I'm not actually seeing this behavior on trunk.
>>>
>>
>> I've just tried trunk, and although behavior like above isn't
>> immediately obvious, trunk generates more instructions and spills more
>> registers compared to 3.1.
>>
>
> Actually, here is an occurrence of that behavior when compiling the
> code with trunk:
>
>         [...]
>         movaps  %xmm1, %xmm0      ###  xmm1 mov'ed to xmm0
>         movaps  %xmm1, %xmm14    ###  xmm1 mov'ed to xmm14
>         addps   %xmm7, %xmm0
>         movaps  %xmm7, %xmm13
>         movaps  %xmm0, %xmm1      ###  and now other data is mov'ed
into xmm1,
> making one of the above movaps superfluous
>         [...]
As well as many occurrences in the above form, a similar form appears:

        [...]
        movaps	%xmm5, %xmm7
	movaps	%xmm7, %xmm3
	movaps	-96(%rsp), %xmm0        ## 16-byte Reload
	subps	%xmm0, %xmm3
	addps	%xmm0, %xmm7
	movaps	240(%rsp), %xmm0        ## 16-byte Reload
	movaps	-128(%rsp), %xmm1       ## 16-byte Reload
	movlhps	%xmm0, %xmm1            ## xmm1 = xmm1[0],xmm0[0]
	movaps	%xmm8, %xmm4
	movaps	160(%rsp), %xmm5        ## 16-byte Reload
        [...]

Here the problem manifests with xmm3, 5 and 7, but in contrast to the
above case, there is now data dependence in the first pair of
instructions.

amb

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Jul 2012 - [LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW

Possibly Parallel Threads