Anthony Blake
2012-Jul-06 12:25 UTC
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote:> > On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote: > >> I've noticed that LLVM tends to generate suboptimal code and spill an >> excessive amount of registers in large functions, such as in those >> that are automatically generated by FFTW. > > One problem might be that we're forcing the 16 stores to the out array to happen in source order, which constrains the schedule. The stores are clearly non-aliasing. > >> LLVM generates good code for a function that computes an 8-point >> complex FFT, but from 16-point upwards, icc or gcc generates much >> better code. Here is an example of a sequence of instructions from a >> 32-point FFT, compiled with clang/LLVM 3.1 for x86_64 with SSE: >> >> [...] >> movaps 32(%rdi), %xmm3 >> movaps 48(%rdi), %xmm2 >> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 >> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 >> addps %xmm0, %xmm1 >> movaps %xmm1, -16(%rbp) ## 16-byte Spill >> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3 >> [...] >> >> xmm3 loaded, duplicated into 2 registers, and then discarded as other >> data is loaded into it. Can anyone shed some light on why this might >> be happening? > > I'm not actually seeing this behavior on trunk. >I've just tried trunk, and although behavior like above isn't immediately obvious, trunk generates more instructions and spills more registers compared to 3.1. amb
Anthony Blake
2012-Jul-06 12:40 UTC
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
On Sat, Jul 7, 2012 at 12:25 AM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote:> On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote: >> On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote: >>> [...] >>> movaps 32(%rdi), %xmm3 >>> movaps 48(%rdi), %xmm2 >>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 >>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 >>> addps %xmm0, %xmm1 >>> movaps %xmm1, -16(%rbp) ## 16-byte Spill >>> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3 >>> [...] >>> >>> xmm3 loaded, duplicated into 2 registers, and then discarded as other >>> data is loaded into it. Can anyone shed some light on why this might >>> be happening? >> >> I'm not actually seeing this behavior on trunk. >> > > I've just tried trunk, and although behavior like above isn't > immediately obvious, trunk generates more instructions and spills more > registers compared to 3.1. >Actually, here is an occurrence of that behavior when compiling the code with trunk: [...] movaps %xmm1, %xmm0 ### xmm1 mov'ed to xmm0 movaps %xmm1, %xmm14 ### xmm1 mov'ed to xmm14 addps %xmm7, %xmm0 movaps %xmm7, %xmm13 movaps %xmm0, %xmm1 ### and now other data is mov'ed into xmm1, making one of the above movaps superfluous [...] amb
Anthony Blake
2012-Jul-06 13:00 UTC
[LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
On Sat, Jul 7, 2012 at 12:40 AM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote:> On Sat, Jul 7, 2012 at 12:25 AM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote: >> On Fri, Jul 6, 2012 at 6:39 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk> wrote: >>> On Jul 5, 2012, at 9:06 PM, Anthony Blake <amb33 at cs.waikato.ac.nz> wrote: >>>> [...] >>>> movaps 32(%rdi), %xmm3 >>>> movaps 48(%rdi), %xmm2 >>>> movaps %xmm3, %xmm1 ### <-- xmm3 mov'ed into xmm1 >>>> movaps %xmm3, %xmm4 ### <-- xmm3 mov'ed into xmm4 >>>> addps %xmm0, %xmm1 >>>> movaps %xmm1, -16(%rbp) ## 16-byte Spill >>>> movaps 144(%rdi), %xmm3 ### <-- new data mov'ed into xmm3 >>>> [...] >>>> >>>> xmm3 loaded, duplicated into 2 registers, and then discarded as other >>>> data is loaded into it. Can anyone shed some light on why this might >>>> be happening? >>> >>> I'm not actually seeing this behavior on trunk. >>> >> >> I've just tried trunk, and although behavior like above isn't >> immediately obvious, trunk generates more instructions and spills more >> registers compared to 3.1. >> > > Actually, here is an occurrence of that behavior when compiling the > code with trunk: > > [...] > movaps %xmm1, %xmm0 ### xmm1 mov'ed to xmm0 > movaps %xmm1, %xmm14 ### xmm1 mov'ed to xmm14 > addps %xmm7, %xmm0 > movaps %xmm7, %xmm13 > movaps %xmm0, %xmm1 ### and now other data is mov'ed into xmm1, > making one of the above movaps superfluous > [...]As well as many occurrences in the above form, a similar form appears: [...] movaps %xmm5, %xmm7 movaps %xmm7, %xmm3 movaps -96(%rsp), %xmm0 ## 16-byte Reload subps %xmm0, %xmm3 addps %xmm0, %xmm7 movaps 240(%rsp), %xmm0 ## 16-byte Reload movaps -128(%rsp), %xmm1 ## 16-byte Reload movlhps %xmm0, %xmm1 ## xmm1 = xmm1[0],xmm0[0] movaps %xmm8, %xmm4 movaps 160(%rsp), %xmm5 ## 16-byte Reload [...] Here the problem manifests with xmm3, 5 and 7, but in contrast to the above case, there is now data dependence in the first pair of instructions. amb
Maybe Matching Threads
- [LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
- [LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
- [LLVMdev] Excessive register spilling in large automatically generated functions, such as is found in FFTW
- New routine: FLAC__lpc_compute_autocorrelation_asm_ia32_sse_lag_16
- [LLVMdev] SIMD instructions and memory alignment on X86