thr3ads.net - llvm dev - [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Kuperstein, Michael M

2014-Dec-21 09:27 UTC

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

Which performance guidelines are you referring to?

I'm not that familiar with decade-old CPUs, but to the best of my knowledge,
this is not true on current hardware.
There is one specific circumstance where PUSHes should be avoided - for
Atom/Silvermont processors, the memory form of PUSH is inefficient, so the
register-freeing optimization below may not be profitable (see 14.3.3.6 and
15.3.1.2 in the Intel optimization reference manual).

Having said that, one distinct possibility is to have the heuristic make
different decisions depending on the optimization flags, that is be much more
aggressive for optsize functions.

From: Herbie Robinson [mailto:HerbieRobinson at verizon.net]
Sent: Sunday, December 21, 2014 10:58
To: Kuperstein, Michael M; LLVMdev at cs.uiuc.edu
Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call
sequences

According to the Intel performance guidelines, pushes are significantly slower
than moves to the extent they should be avoided as much as possible.  It's
been a decade since I was dealing with this; so, I don't remember the
numbers, but I'm pretty sure the changes you are proposing will slow the
code down.

People who care about speed more than code size are probably not going to like
this very much...


On 12/21/14 3:17 AM, Kuperstein, Michael M wrote:

Hello all,



In r223757 I've committed a patch that performs, for the 32-bit x86 calling
convention, the transformation of MOV instructions that push function arguments
onto the stack into actual PUSH instructions.



For example, it will transform this:

subl    $16, %esp

movl    $4, 12(%esp)

movl    $3, 8(%esp)

movl    $2, 4(%esp)

movl    $1, (%esp)

calll   _func

addl    $16, %esp



Into this:



pushl   $4

pushl   $3

pushl   $2

pushl   $1

calll   _func

addl    $16, %esp



The main motivation for this is code size (a "pushl $4" is 2 bytes, a
"movl $4, 12(%esp)" is 7 bytes), but there are some other advantages,
as shown below.

The way this works in r223757 is by intercepting call frame simplification in
the Prolog/Epilog Inserter, and replacing the mov sequence with pushes. Right
now it only handles functions which do not have a reserved call frame (a small
minority of cases), and I'd like to extend it to cover other cases where it
is profitable.



The currently implemented approach has a couple of drawbacks:



1) Push vs. having a reserved call frame:
This transformation is always profitable when we do not have a reserved call
frame. When a reserved frame can be used, however, there is a trade-off. For
example, in a function that contains only one call site, and no other stack
allocations, pushes are a clear win, since having a reserved call frame
wouldn't save any instructions. On the other hand, if a function contains 10
call sites, and only one of them can use pushes, then it is most probably a loss
- not reserving a call frame will  cost us 10 add instructions, and the pushes
gain very little. I'd like to be able to make the decision on whether we
want to have a reserved frame or pushes by considering the entire function. I
don't think this can be done in the context of PEI.

Note that in theory we could have both a reserved call frame and have some
specific call sites in the function use pushes, but this is fairly tricky
because it requires much more precise tracking of the stack pointer state. That
is something I'm not planning to implement at this point.



2) Register allocation inefficiency:
Ideally, pushes can be used to make direct memory-to-memory movs, freeing up
registers, and saving quite a lot of code.

For example, for this (this is obviously a constructed example, but code of this
kind does exist in the wild):



void foo(int a, int b, int c, int d, int e, int f, int g, int h);



struct st { int arr[8]; };



void bar(struct st* p)

{

  foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4],
p->arr[5], p->arr[6], p->arr[7]); }



We currently generate (with -m32 -O2) this:



        pushl   %ebp

        movl    %esp, %ebp

        pushl   %ebx

        pushl   %edi

        pushl   %esi

        subl    $44, %esp

        movl    8(%ebp), %eax

        movl    28(%eax), %ecx

        movl    %ecx, -20(%ebp)         # 4-byte Spill

        movl    24(%eax), %edx

        movl    20(%eax), %esi

        movl    16(%eax), %edi

        movl    12(%eax), %ebx

        movl    8(%eax), %ecx

        movl    %ecx, -24(%ebp)         # 4-byte Spill

        movl    (%eax), %ecx

        movl    %ecx, -16(%ebp)         # 4-byte Spill

        movl    4(%eax), %eax

        movl    -20(%ebp), %ecx         # 4-byte Reload

        movl    %ecx, 28(%esp)

        movl    %edx, 24(%esp)

        movl    %esi, 20(%esp)

        movl    %edi, 16(%esp)

        movl    %ebx, 12(%esp)

        movl    -24(%ebp), %ecx         # 4-byte Reload

        movl    %ecx, 8(%esp)

        movl    %eax, 4(%esp)

        movl    -16(%ebp), %eax         # 4-byte Reload

        movl    %eax, (%esp)

        calll   _foo

        addl    $44, %esp

        popl    %esi

        popl    %edi

        popl    %ebx

        popl    %ebp

        retl



Which is fairly horrible.

Some parameters get mov-ed up to four times - a mov from the struct into a
register, a register spill,  a reload, and finally a mov onto the stack.



What we'd like to generate is something like:

        pushl   %ebp

        movl    %esp, %ebp

        movl    8(%ebp), %eax

        pushl   28(%eax)

        pushl   24(%eax)

        pushl   20(%eax)

        pushl   16(%eax)

        pushl   12(%eax)

        pushl   8(%eax)

        pushl   4(%eax)

        pushl   (%eax)

        calll   _foo

        addl    $32, %esp

        popl    %ebp

        retl



To produce the code above, the transformation has to run pre-reg-alloc.
Otherwise, even if we fold loads into the push, it's too late to recover
from spills.



The direction I'd like to take with this is:



1) Add an X86-specific MachineFunctionPass that does the mov -> push
transformation and runs pre-reg-alloc.

It will:

* Make a decision on whether promoting some (or all) of the call sites to use
pushes is worth giving up on the reserved call frame.

* If it is, perform the mov ->push transformation for the selected call
sites.

* Fold loads into the pushes while doing the transformation.

As an alternative, I could try to teach the peephole optimizer to do it (right
now it won't even try to do this folding because PUSHes store to memory),
but getting it right in the general case seems complex.

I think I'd rather do folding of the simple (but common) cases on the fly.



2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly.

Doing the mov->push transformation before PEI means I'd have to leave the
ADJCALLSTACKDOWN/UP pair unbalanced.



E.g. something like:

ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>,
%ESP<imp-use>

%vreg9<def,dead> = COPY %ESP; GR32:%vreg9

PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>,
%ESP<imp-use>; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>,
%ESP<imp-use>; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP<imp-def>,
%ESP<imp-use>; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP<imp-def>,
%ESP<imp-use>; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>,
%ESP<imp-use>; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>,
%ESP<imp-use>; GR32:%vreg0

PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP<imp-def>,
%ESP<imp-use>; GR32:%vreg0

PUSH32rmm %vreg0<kill>, 1, %noreg, 0, %noreg, %ESP<imp-def>,
%ESP<imp-use>; GR32:%vreg0

CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>,
%ESP<imp-def>

ADJCALLSTACKUP32 32, 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>,
%ESP<imp-use>



This, rightly, gets flagged by the verifier.

My proposal is to add an additional parameter to ADJCALLSTACKDOWN to express the
amount of adjustment the call sequence itself does. This is somewhat similar to
the second parameter of ADKCALLSTACKUP which allows adjustment for callee
stack-clean-up.

So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32" instead of
the "ADJCALLSTACKDOWN32 0". The verifier will be happy, and PEI will
know it doesn't need to do any stack pointer adjustment.



Does this sound like the right approach?



Any suggestions, as well as warnings of potential pitfalls, are welcome. :-)



Thanks,

   Michael


---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.




_______________________________________________

LLVM Developers mailing list

LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu

http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141221/5920f6f8/attachment.html>

Herbie Robinson

2014-Dec-21 22:11 UTC

head link

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

On 12/21/14 4:27 AM, Kuperstein, Michael M wrote:>
> Which performance guidelines are you referring to?
>Table C-21 in "Intel® 64 and IA-32 Architectures Optimization Reference 
Manual", September 2014.

It hasn't changed.  It still lists push and pop instructions as 2-3 
times more expensive as mov.  And that's not taking into account any 
optimizations that might get undone by the stack pointer changing. I'm 
just speculating, but I suspect that move being faster has to do with 
not having to modify a register every time.

With that as a basis, the fastest entry/exit sequences just use 
subl/addl on the stack pointer and don't use push at all.  For most C 
functions, you don't even have to materialize a frame pointer (if the 
unwind mechanisms are set up to handle that).  Not that I am 
recommending changing the x86_32 code generation to do
that.>
> I’m not that familiar with decade-old CPUs, but to the best of my 
> knowledge, this is not true on current hardware.
>
> There is one specific circumstance where PUSHes should be avoided – 
> for Atom/Silvermont processors, the memory form of PUSH is 
> inefficient, so the register-freeing optimization below may not be 
> profitable (see 14.3.3.6 and 15.3.1.2 in the Intel optimization 
> reference manual).
>
> Having said that, one distinct possibility is to have the heuristic 
> make different decisions depending on the optimization flags, that is 
> be much more aggressive for optsize functions.
>
> *From:*Herbie Robinson [mailto:HerbieRobinson at verizon.net]
> *Sent:* Sunday, December 21, 2014 10:58
> *To:* Kuperstein, Michael M; LLVMdev at cs.uiuc.edu
> *Subject:* Re: [LLVMdev] [RFC] [X86] Mov to push transformation in 
> x86-32 call sequences
>
> According to the Intel performance guidelines, pushes are 
> significantly slower than moves to the extent they should be avoided 
> as much as possible.  It's been a decade since I was dealing with 
> this; so, I don't remember the numbers, but I'm pretty sure the 
> changes you are proposing will slow the code down.
>
> People who care about speed more than code size are probably not going 
> to like this very much...
>
>
> On 12/21/14 3:17 AM, Kuperstein, Michael M wrote:
>
>     Hello all,
>
>     In r223757 I’ve committed a patch that performs, for the 32-bit
>     x86 calling convention, the transformation of MOV instructions
>     that push function arguments onto the stack into actual PUSH
>     instructions.
>
>     For example, it will transform this:
>
>     subl    $16, %esp
>
>     movl    $4, 12(%esp)
>
>     movl    $3, 8(%esp)
>
>     movl    $2, 4(%esp)
>
>     movl    $1, (%esp)
>
>     calll   _func
>
>     addl    $16, %esp
>
>     Into this:
>
>     pushl   $4
>
>     pushl   $3
>
>     pushl   $2
>
>     pushl   $1
>
>     calll   _func
>
>     addl    $16, %esp
>
>     The main motivation for this is code size (a “pushl $4” is 2
>     bytes, a “movl $4, 12(%esp)” is 7 bytes), but there are some other
>     advantages, as shown below.
>
>     The way this works in r223757 is by intercepting call frame
>     simplification in the Prolog/Epilog Inserter, and replacing the
>     mov sequence with pushes. Right now it only handles functions
>     which do not have a reserved call frame (a small minority of
>     cases), and I'd like to extend it to cover other cases where it is
>     profitable.
>
>     The currently implemented approach has a couple of drawbacks:
>
>     1) Push vs. having a reserved call frame:
>     This transformation is always profitable when we do not have a
>     reserved call frame. When a reserved frame can be used, however,
>     there is a trade-off. For example, in a function that contains
>     only one call site, and no other stack allocations, pushes are a
>     clear win, since having a reserved call frame wouldn't save any
>     instructions. On the other hand, if a function contains 10 call
>     sites, and only one of them can use pushes, then it is most
>     probably a loss – not reserving a call frame will  cost us 10 add
>     instructions, and the pushes gain very little. I’d like to be able
>     to make the decision on whether we want to have a reserved frame
>     or pushes by considering the entire function. I don't think this
>     can be done in the context of PEI.
>
>     Note that in theory we could have both a reserved call frame and
>     have some specific call sites in the function use pushes, but this
>     is fairly tricky because it requires much more precise tracking of
>     the stack pointer state. That is something I’m not planning to
>     implement at this point.
>
>     2) Register allocation inefficiency:
>     Ideally, pushes can be used to make direct memory-to-memory movs,
>     freeing up registers, and saving quite a lot of code.
>
>     For example, for this (this is obviously a constructed example,
>     but code of this kind does exist in the wild):
>
>     void foo(int a, int b, int c, int d, int e, int f, int g, int h);
>
>     struct st { int arr[8]; };
>
>     void bar(struct st* p)
>
>     {
>
>       foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3],
p->arr[4],
>     p->arr[5], p->arr[6], p->arr[7]); }
>
>     We currently generate (with -m32 -O2) this:
>
>             pushl   %ebp
>
>             movl    %esp, %ebp
>
>             pushl   %ebx
>
>             pushl   %edi
>
>             pushl   %esi
>
>             subl    $44, %esp
>
>             movl    8(%ebp), %eax
>
>             movl    28(%eax), %ecx
>
>             movl    %ecx, -20(%ebp)         # 4-byte Spill
>
>             movl    24(%eax), %edx
>
>             movl    20(%eax), %esi
>
>             movl    16(%eax), %edi
>
>             movl    12(%eax), %ebx
>
>             movl    8(%eax), %ecx
>
>             movl    %ecx, -24(%ebp)         # 4-byte Spill
>
>             movl    (%eax), %ecx
>
>             movl    %ecx, -16(%ebp)         # 4-byte Spill
>
>             movl    4(%eax), %eax
>
>             movl    -20(%ebp), %ecx         # 4-byte Reload
>
>             movl    %ecx, 28(%esp)
>
>             movl    %edx, 24(%esp)
>
>             movl    %esi, 20(%esp)
>
>             movl    %edi, 16(%esp)
>
>             movl    %ebx, 12(%esp)
>
>             movl    -24(%ebp), %ecx         # 4-byte Reload
>
>             movl    %ecx, 8(%esp)
>
>             movl    %eax, 4(%esp)
>
>             movl    -16(%ebp), %eax         # 4-byte Reload
>
>             movl    %eax, (%esp)
>
>             calll   _foo
>
>             addl    $44, %esp
>
>             popl    %esi
>
>             popl    %edi
>
>             popl    %ebx
>
>             popl    %ebp
>
>             retl
>
>     Which is fairly horrible.
>
>     Some parameters get mov-ed up to four times - a mov from the
>     struct into a register, a register spill,  a reload, and finally a
>     mov onto the stack.
>
>     What we’d like to generate is something like:
>
>             pushl   %ebp
>
>             movl    %esp, %ebp
>
>             movl    8(%ebp), %eax
>
>             pushl   28(%eax)
>
>             pushl   24(%eax)
>
>             pushl   20(%eax)
>
>             pushl   16(%eax)
>
>             pushl   12(%eax)
>
>             pushl   8(%eax)
>
>             pushl   4(%eax)
>
>             pushl   (%eax)
>
>             calll   _foo
>
>             addl    $32, %esp
>
>             popl    %ebp
>
>             retl
>
>     To produce the code above, the transformation has to run
>     pre-reg-alloc. Otherwise, even if we fold loads into the push,
>     it's too late to recover from spills.
>
>     The direction I'd like to take with this is:
>
>     1) Add an X86-specific MachineFunctionPass that does the mov ->
>     push transformation and runs pre-reg-alloc.
>
>     It will:
>
>     * Make a decision on whether promoting some (or all) of the call
>     sites to use pushes is worth giving up on the reserved call frame.
>
>     * If it is, perform the mov ->push transformation for the selected
>     call sites.
>
>     * Fold loads into the pushes while doing the transformation.
>
>     As an alternative, I could try to teach the peephole optimizer to
>     do it (right now it won't even try to do this folding because
>     PUSHes store to memory), but getting it right in the general case
>     seems complex.
>
>     I think I'd rather do folding of the simple (but common) cases on
>     the fly.
>
>     2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly.
>
>     Doing the mov->push transformation before PEI means I'd have to
>     leave the ADJCALLSTACKDOWN/UP pair unbalanced.
>
>     E.g. something like:
>
>     ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>,
>     %ESP<imp-use>
>
>     %vreg9<def,dead> = COPY %ESP; GR32:%vreg9
>
>     PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0<kill>, 1, %noreg, 0, %noreg,
%ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>,
%ESP<imp-def>
>
>     ADJCALLSTACKUP32 32, 0, %ESP<imp-def>,
%EFLAGS<imp-def,dead>,
>     %ESP<imp-use>
>
>     This, rightly, gets flagged by the verifier.
>
>     My proposal is to add an additional parameter to ADJCALLSTACKDOWN
>     to express the amount of adjustment the call sequence itself does.
>     This is somewhat similar to the second parameter of ADKCALLSTACKUP
>     which allows adjustment for callee stack-clean-up.
>
>     So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32"
>     instead of the “ADJCALLSTACKDOWN32 0”. The verifier will be happy,
>     and PEI will know it doesn't need to do any stack pointer
adjustment.
>
>     Does this sound like the right approach?
>
>     Any suggestions, as well as warnings of potential pitfalls, are
>     welcome. :-)
>
>     Thanks,
>
>        Michael
>
>     ---------------------------------------------------------------------
>     Intel Israel (74) Limited
>
>     This e-mail and any attachments may contain confidential material for
>     the sole use of the intended recipient(s). Any review or distribution
>     by others is strictly prohibited. If you are not the intended
>     recipient, please contact the sender and delete all copies.
>
>
>
>
>     _______________________________________________
>
>     LLVM Developers mailing list
>
>     LLVMdev at cs.uiuc.edu  <mailto:LLVMdev at cs.uiuc.edu>         
http://llvm.cs.uiuc.edu
>
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141221/8ee6597e/attachment.html>

Caldarale, Charles R

2014-Dec-22 02:55 UTC

head link

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu]
> On Behalf Of Herbie Robinson
> Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32
call sequences
> > On 12/21/14 4:27 AM, Kuperstein, Michael M wrote:
> > Which performance guidelines are you referring to?
> Table C-21 in "Intel(r) 64 and IA-32 Architectures Optimization
Reference Manual", September 2014.
> It hasn't changed.  It still lists push and pop instructions as 2-3
times more expensive as mov.
And verified by Agner Fog's independent measurements: 
http://www.agner.org/optimize/instruction_tables.pdf

The relevant Haswell numbers are on pages 186 - 187.

 -Chuck

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Dec 2014 - [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

Maybe Matching Threads