thr3ads.net - llvm dev - [LLVMdev] RegAllocFast uses too much stack [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Nick Lewycky

2011-Jul-11 20:48 UTC

[LLVMdev] RegAllocFast uses too much stack

I discovered recently that RegAllocFast spills all the registers before
every function call. This is the root cause of one of our recursive
functions that takes about 150 bytes of stack when built with gcc (same at
-O0 and -O2, or 120 bytes at llc -O2) taking 960 bytes of stack when built
by llc -O0. That's pretty bad for situations where you have small stacks,
which is not uncommon for threaded software on a 32-bit architecture, or a
signal handler.

I realize that some of this is intentional and we don't want to do
optimization at -O0, but I'm really hoping there's something we could
sensibly do to improve this. Here's an example:

  extern "C" void foo(int);

  void test() {
    foo(0);
    foo(1);
    foo(2);
  }

This doesn't just spill out all the registers to the stack before each call,
we also set up 0, 1 and 2 into regs first, then spill them and don't even
get a chance to reuse stack slots. That's just bad:

        pushq   %rax
        movl    $2, %edi
        movl    $1, %eax
        movl    $0, %ecx
        movl    %edi, 4(%rsp)           # 4-byte Spill
        movl    %ecx, %edi
        movl    %eax, (%rsp)            # 4-byte Spill
        callq   foo
        movl    (%rsp), %edi            # 4-byte Reload
        callq   foo
        movl    4(%rsp), %edi           # 4-byte Reload
        callq   foo
        popq    %rax
        ret

Does anyone have any ideas what we could do that doesn't add to the compile
time?

Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20110711/c59efb24/attachment.html>

Rafael Ávila de Espíndola

2011-Jul-11 21:08 UTC

head link

[LLVMdev] RegAllocFast uses too much stack

> Does anyone have any ideas what we could do that doesn't add to the
> compile time?
There has been some discussion about it on PR10075.
> Nick
Cheers,
Rafael

Jakob Stoklund Olesen

2011-Jul-11 21:17 UTC

head link

[LLVMdev] RegAllocFast uses too much stack

On Jul 11, 2011, at 1:48 PM, Nick Lewycky wrote:
> I discovered recently that RegAllocFast spills all the registers before
every function call. This is the root cause of one of our recursive functions
that takes about 150 bytes of stack when built with gcc (same at -O0 and -O2, or
120 bytes at llc -O2) taking 960 bytes of stack when built by llc -O0.
That's pretty bad for situations where you have small stacks, which is not
uncommon for threaded software on a 32-bit architecture, or a signal handler.
> 
> I realize that some of this is intentional and we don't want to do
optimization at -O0, but I'm really hoping there's something we could
sensibly do to improve this. Here's an example:
> 
>   extern "C" void foo(int);
>   
>   void test() {
>     foo(0);
>     foo(1);
>     foo(2);
>   }
> 
> This doesn't just spill out all the registers to the stack before each
call, we also set up 0, 1 and 2 into regs first, then spill them and don't
even get a chance to reuse stack slots. That's just bad:
> 
>         pushq   %rax
>         movl    $2, %edi
>         movl    $1, %eax
>         movl    $0, %ecx
>         movl    %edi, 4(%rsp)           # 4-byte Spill
>         movl    %ecx, %edi
>         movl    %eax, (%rsp)            # 4-byte Spill
>         callq   foo
>         movl    (%rsp), %edi            # 4-byte Reload
>         callq   foo
>         movl    4(%rsp), %edi           # 4-byte Reload
>         callq   foo
>         popq    %rax
>         ret
> 
> Does anyone have any ideas what we could do that doesn't add to the
compile time?
We probably won't be having virtual registers live across calls. That
requires quite a bit of analysis which would be too expensive compile time wise.

Which means that we should avoid creating live ranges spanning calls. This is
what fast isel does now:

# After Instruction Selection:
# Machine code for function test:

BB#0: derived from LLVM BB %entry
        %vreg0<def> = MOV32ri 2; GR32:%vreg0
        %vreg2<def> = MOV32ri 1; GR32:%vreg2
        %vreg4<def> = MOV32ri 0; GR32:%vreg4
        ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
        %EDI<def> = COPY %vreg4; GR32:%vreg4
        %AL<def> = MOV8ri 0
        CALL64pcrel32 <ga:@foo>, %AL, %EDI, %RAX<imp-def>,
%RDI<imp-def,dead>, %RSP<imp-use>, ...
        ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
        %vreg5<def> = COPY %EAX; GR32:%vreg5
        ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
        %EDI<def> = COPY %vreg2; GR32:%vreg2
        %AL<def> = MOV8ri 0
        CALL64pcrel32 <ga:@foo>, %AL, %EDI, %RAX<imp-def>,
%RDI<imp-def,dead>, %RSP<imp-use>, ...
        ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
        %vreg3<def> = COPY %EAX; GR32:%vreg3
        ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
        %EDI<def> = COPY %vreg0; GR32:%vreg0
        %AL<def> = MOV8ri 0
        CALL64pcrel32 <ga:@foo>, %AL, %EDI, %RAX<imp-def>,
%RDI<imp-def,dead>, %RSP<imp-use>, ...
        ADJCALLSTACKUP64 0, 0, %RSP<imp-def>, %EFLAGS<imp-def>,
%RSP<imp-use>
        %vreg1<def> = COPY %EAX; GR32:%vreg1
        RET

I don't know why it materializes those constants at the top of the block. I
think they should just be materialized immediately before they are used.

/jakob

Eric Christopher

2011-Jul-11 21:44 UTC

head link

[LLVMdev] RegAllocFast uses too much stack

On Jul 11, 2011, at 1:48 PM, Nick Lewycky wrote:
> I discovered recently that RegAllocFast spills all the registers before
every function call. This is the root cause of one of our recursive functions
that takes about 150 bytes of stack when built with gcc (same at -O0 and -O2, or
120 bytes at llc -O2) taking 960 bytes of stack when built by llc -O0.
That's pretty bad for situations where you have small stacks, which is not
uncommon for threaded software on a 32-bit architecture, or a signal handler.
> 
> I realize that some of this is intentional and we don't want to do
optimization at -O0, but I'm really hoping there's something we could
sensibly do to improve this. Here's an example:
> 
>   extern "C" void foo(int);
>   
>   void test() {
>     foo(0);
>     foo(1);
>     foo(2);
>   }
> 
> This doesn't just spill out all the registers to the stack before each
call, we also set up 0, 1 and 2 into regs first, then spill them and don't
even get a chance to reuse stack slots. That's just bad:
> 
>         pushq   %rax
>         movl    $2, %edi
>         movl    $1, %eax
>         movl    $0, %ecx
>         movl    %edi, 4(%rsp)           # 4-byte Spill
>         movl    %ecx, %edi
>         movl    %eax, (%rsp)            # 4-byte Spill
>         callq   foo
>         movl    (%rsp), %edi            # 4-byte Reload
>         callq   foo
>         movl    4(%rsp), %edi           # 4-byte Reload
>         callq   foo
>         popq    %rax
>         ret
> 
> Does anyone have any ideas what we could do that doesn't add to the
compile time?
This seems odd. I'd think that fast-isel should be able to materialize the
constants when we want them rather than at the beginning of the block.

-eric

Eli Friedman

2011-Jul-11 22:06 UTC

head link

[LLVMdev] RegAllocFast uses too much stack

On Mon, Jul 11, 2011 at 2:44 PM, Eric Christopher <echristo at apple.com>
wrote:>
> On Jul 11, 2011, at 1:48 PM, Nick Lewycky wrote:
>
>> I discovered recently that RegAllocFast spills all the registers before
every function call. This is the root cause of one of our recursive functions
that takes about 150 bytes of stack when built with gcc (same at -O0 and -O2, or
120 bytes at llc -O2) taking 960 bytes of stack when built by llc -O0.
That's pretty bad for situations where you have small stacks, which is not
uncommon for threaded software on a 32-bit architecture, or a signal handler.
>>
>> I realize that some of this is intentional and we don't want to do
optimization at -O0, but I'm really hoping there's something we could
sensibly do to improve this. Here's an example:
>>
>>   extern "C" void foo(int);
>>
>>   void test() {
>>     foo(0);
>>     foo(1);
>>     foo(2);
>>   }
>>
>> This doesn't just spill out all the registers to the stack before
each call, we also set up 0, 1 and 2 into regs first, then spill them and
don't even get a chance to reuse stack slots. That's just bad:
>>
>>         pushq   %rax
>>         movl    $2, %edi
>>         movl    $1, %eax
>>         movl    $0, %ecx
>>         movl    %edi, 4(%rsp)           # 4-byte Spill
>>         movl    %ecx, %edi
>>         movl    %eax, (%rsp)            # 4-byte Spill
>>         callq   foo
>>         movl    (%rsp), %edi            # 4-byte Reload
>>         callq   foo
>>         movl    4(%rsp), %edi           # 4-byte Reload
>>         callq   foo
>>         popq    %rax
>>         ret
>>
>> Does anyone have any ideas what we could do that doesn't add to the
compile time?
>
> This seems odd. I'd think that fast-isel should be able to materialize
the constants when we want them rather than at the beginning of the block.
I'm not entirely sure why, but FastISel does intentionally materialize
constants at the beginning of the block.  See
FastISel::enterLocalValueArea etc.  Maybe Dan knows why?

-Eli

Reasonably Related Threads

Search for more maybe matching threads

llvm dev - Jul 2011 - [LLVMdev] RegAllocFast uses too much stack

[LLVMdev] RegAllocFast uses too much stack

[LLVMdev] RegAllocFast uses too much stack

[LLVMdev] RegAllocFast uses too much stack

[LLVMdev] RegAllocFast uses too much stack

[LLVMdev] RegAllocFast uses too much stack

Reasonably Related Threads