Hi, I am using mcjit in llvm 3.6 to jit kernels to x86 avx2. I've noticed some inefficient use of the stack around constant vectors. In one example, I have code that computes a series of constant vectors at compile time. Each vector has a single use. In the final asm, I see a series of spills at the top of the function of all the constant vectors immediately to stack, then each use references the stack pointer directly: Lots of these at top of function: movabsq $.LCPI0_212, %rbx vmovaps (%rbx), %ymm0 vmovaps %ymm0, 2816(%rsp) # 32-byte Spill Later on, each use references the stack pointer: vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload It seems the spill to stack is unnecessary. In one particularly bad kernel, I have 128 8-wide constant vectors, and so there is 4KB of stack use just for these constants. I think a better approach could be to load the constant vector pointers as needed: movabsq $.LCPI0_212, %rbx vpaddd (%rbx), %ymm4, %ymm1 Thanks, Jason -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160506/6480b452/attachment.html>
Does anyone have any insight into this problem? Is there a way to minimize excessive spill/fill for this kind of scenario? Thanks, Jason On Fri, May 6, 2016 at 10:44 AM, Jason <thesurprises at gmail.com> wrote:> Hi, I am using mcjit in llvm 3.6 to jit kernels to x86 avx2. I've noticed > some inefficient use of the stack around constant vectors. In one example, > I have code that computes a series of constant vectors at compile time. > Each vector has a single use. In the final asm, I see a series of spills at > the top of the function of all the constant vectors immediately to stack, > then each use references the stack pointer directly: > > Lots of these at top of function: > > movabsq $.LCPI0_212, %rbx > vmovaps (%rbx), %ymm0 > vmovaps %ymm0, 2816(%rsp) # 32-byte Spill > > Later on, each use references the stack pointer: > > vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload > > It seems the spill to stack is unnecessary. In one particularly bad > kernel, I have 128 8-wide constant vectors, and so there is 4KB of stack > use just for these constants. I think a better approach could be to load > the constant vector pointers as needed: > > movabsq $.LCPI0_212, %rbx > vpaddd (%rbx), %ymm4, %ymm1 > > > Thanks, > Jason >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160509/84123ba9/attachment.html>
It sounds bad, but I can't tell from the posted info how to diagnose it. Can you post (a possibly reduced) example to demonstrate what you're seeing? A bug report would be even better, so we can track if there are multiple problems: https://llvm.org/bugs/ On Mon, May 9, 2016 at 3:41 PM, Jason via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Does anyone have any insight into this problem? Is there a way to minimize > excessive spill/fill for this kind of scenario? > Thanks, > Jason > > > On Fri, May 6, 2016 at 10:44 AM, Jason <thesurprises at gmail.com> wrote: > >> Hi, I am using mcjit in llvm 3.6 to jit kernels to x86 avx2. I've noticed >> some inefficient use of the stack around constant vectors. In one example, >> I have code that computes a series of constant vectors at compile time. >> Each vector has a single use. In the final asm, I see a series of spills at >> the top of the function of all the constant vectors immediately to stack, >> then each use references the stack pointer directly: >> >> Lots of these at top of function: >> >> movabsq $.LCPI0_212, %rbx >> vmovaps (%rbx), %ymm0 >> vmovaps %ymm0, 2816(%rsp) # 32-byte Spill >> >> Later on, each use references the stack pointer: >> >> vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload >> >> It seems the spill to stack is unnecessary. In one particularly bad >> kernel, I have 128 8-wide constant vectors, and so there is 4KB of stack >> use just for these constants. I think a better approach could be to load >> the constant vector pointers as needed: >> >> movabsq $.LCPI0_212, %rbx >> vpaddd (%rbx), %ymm4, %ymm1 >> >> >> Thanks, >> Jason >> > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160509/31f999e0/attachment.html>
Quentin Colombet via llvm-dev
2016-May-09 22:09 UTC
[llvm-dev] Unnecessary spill/fill issue
Hi Jason, I am guessing that the problem is that we do not recognize the sequence as rematerializable because, we do not directly load LCPI0_212 into a ymm register. One way to fix that is by using a pseudo instruction that does the load from the constant to ymm (while defining a dead GPR register to be able to expand the pseudo), then teach the folding code how to deal with that. Another option is to make the rematerialization smarter, but that is more complicated :). Cheers, -Quentin> On May 9, 2016, at 2:41 PM, Jason via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Does anyone have any insight into this problem? Is there a way to minimize excessive spill/fill for this kind of scenario? > Thanks, > Jason > > > On Fri, May 6, 2016 at 10:44 AM, Jason <thesurprises at gmail.com <mailto:thesurprises at gmail.com>> wrote: > Hi, I am using mcjit in llvm 3.6 to jit kernels to x86 avx2. I've noticed some inefficient use of the stack around constant vectors. In one example, I have code that computes a series of constant vectors at compile time. Each vector has a single use. In the final asm, I see a series of spills at the top of the function of all the constant vectors immediately to stack, then each use references the stack pointer directly: > > Lots of these at top of function: > > movabsq $.LCPI0_212, %rbx > vmovaps (%rbx), %ymm0 > vmovaps %ymm0, 2816(%rsp) # 32-byte Spill > > Later on, each use references the stack pointer: > > vpaddd 2816(%rsp), %ymm4, %ymm1 # 32-byte Folded Reload > > It seems the spill to stack is unnecessary. In one particularly bad kernel, I have 128 8-wide constant vectors, and so there is 4KB of stack use just for these constants. I think a better approach could be to load the constant vector pointers as needed: > > movabsq $.LCPI0_212, %rbx > vpaddd (%rbx), %ymm4, %ymm1 > > > Thanks, > Jason > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160509/04f11585/attachment.html>