Kuperstein, Michael M
2014-Dec-21 08:17 UTC
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
Hello all, In r223757 I've committed a patch that performs, for the 32-bit x86 calling convention, the transformation of MOV instructions that push function arguments onto the stack into actual PUSH instructions. For example, it will transform this: subl $16, %esp movl $4, 12(%esp) movl $3, 8(%esp) movl $2, 4(%esp) movl $1, (%esp) calll _func addl $16, %esp Into this: pushl $4 pushl $3 pushl $2 pushl $1 calll _func addl $16, %esp The main motivation for this is code size (a "pushl $4" is 2 bytes, a "movl $4, 12(%esp)" is 7 bytes), but there are some other advantages, as shown below. The way this works in r223757 is by intercepting call frame simplification in the Prolog/Epilog Inserter, and replacing the mov sequence with pushes. Right now it only handles functions which do not have a reserved call frame (a small minority of cases), and I'd like to extend it to cover other cases where it is profitable. The currently implemented approach has a couple of drawbacks: 1) Push vs. having a reserved call frame: This transformation is always profitable when we do not have a reserved call frame. When a reserved frame can be used, however, there is a trade-off. For example, in a function that contains only one call site, and no other stack allocations, pushes are a clear win, since having a reserved call frame wouldn't save any instructions. On the other hand, if a function contains 10 call sites, and only one of them can use pushes, then it is most probably a loss - not reserving a call frame will cost us 10 add instructions, and the pushes gain very little. I'd like to be able to make the decision on whether we want to have a reserved frame or pushes by considering the entire function. I don't think this can be done in the context of PEI. Note that in theory we could have both a reserved call frame and have some specific call sites in the function use pushes, but this is fairly tricky because it requires much more precise tracking of the stack pointer state. That is something I'm not planning to implement at this point. 2) Register allocation inefficiency: Ideally, pushes can be used to make direct memory-to-memory movs, freeing up registers, and saving quite a lot of code. For example, for this (this is obviously a constructed example, but code of this kind does exist in the wild): void foo(int a, int b, int c, int d, int e, int f, int g, int h); struct st { int arr[8]; }; void bar(struct st* p) { foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4], p->arr[5], p->arr[6], p->arr[7]); } We currently generate (with -m32 -O2) this: pushl %ebp movl %esp, %ebp pushl %ebx pushl %edi pushl %esi subl $44, %esp movl 8(%ebp), %eax movl 28(%eax), %ecx movl %ecx, -20(%ebp) # 4-byte Spill movl 24(%eax), %edx movl 20(%eax), %esi movl 16(%eax), %edi movl 12(%eax), %ebx movl 8(%eax), %ecx movl %ecx, -24(%ebp) # 4-byte Spill movl (%eax), %ecx movl %ecx, -16(%ebp) # 4-byte Spill movl 4(%eax), %eax movl -20(%ebp), %ecx # 4-byte Reload movl %ecx, 28(%esp) movl %edx, 24(%esp) movl %esi, 20(%esp) movl %edi, 16(%esp) movl %ebx, 12(%esp) movl -24(%ebp), %ecx # 4-byte Reload movl %ecx, 8(%esp) movl %eax, 4(%esp) movl -16(%ebp), %eax # 4-byte Reload movl %eax, (%esp) calll _foo addl $44, %esp popl %esi popl %edi popl %ebx popl %ebp retl Which is fairly horrible. Some parameters get mov-ed up to four times - a mov from the struct into a register, a register spill, a reload, and finally a mov onto the stack. What we'd like to generate is something like: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax pushl 28(%eax) pushl 24(%eax) pushl 20(%eax) pushl 16(%eax) pushl 12(%eax) pushl 8(%eax) pushl 4(%eax) pushl (%eax) calll _foo addl $32, %esp popl %ebp retl To produce the code above, the transformation has to run pre-reg-alloc. Otherwise, even if we fold loads into the push, it's too late to recover from spills. The direction I'd like to take with this is: 1) Add an X86-specific MachineFunctionPass that does the mov -> push transformation and runs pre-reg-alloc. It will: * Make a decision on whether promoting some (or all) of the call sites to use pushes is worth giving up on the reserved call frame. * If it is, perform the mov ->push transformation for the selected call sites. * Fold loads into the pushes while doing the transformation. As an alternative, I could try to teach the peephole optimizer to do it (right now it won't even try to do this folding because PUSHes store to memory), but getting it right in the general case seems complex. I think I'd rather do folding of the simple (but common) cases on the fly. 2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly. Doing the mov->push transformation before PEI means I'd have to leave the ADJCALLSTACKDOWN/UP pair unbalanced. E.g. something like: ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, %ESP<imp-use> %vreg9<def,dead> = COPY %ESP; GR32:%vreg9 PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0<kill>, 1, %noreg, 0, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>, %ESP<imp-def> ADJCALLSTACKUP32 32, 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, %ESP<imp-use> This, rightly, gets flagged by the verifier. My proposal is to add an additional parameter to ADJCALLSTACKDOWN to express the amount of adjustment the call sequence itself does. This is somewhat similar to the second parameter of ADKCALLSTACKUP which allows adjustment for callee stack-clean-up. So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32" instead of the "ADJCALLSTACKDOWN32 0". The verifier will be happy, and PEI will know it doesn't need to do any stack pointer adjustment. Does this sound like the right approach? Any suggestions, as well as warnings of potential pitfalls, are welcome. :-) Thanks, Michael --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141221/bab25868/attachment.html>
Herbie Robinson
2014-Dec-21 08:58 UTC
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
According to the Intel performance guidelines, pushes are significantly slower than moves to the extent they should be avoided as much as possible. It's been a decade since I was dealing with this; so, I don't remember the numbers, but I'm pretty sure the changes you are proposing will slow the code down. People who care about speed more than code size are probably not going to like this very much... On 12/21/14 3:17 AM, Kuperstein, Michael M wrote:> > Hello all, > > In r223757 I’ve committed a patch that performs, for the 32-bit x86 > calling convention, the transformation of MOV instructions that push > function arguments onto the stack into actual PUSH instructions. > > For example, it will transform this: > > subl $16, %esp > > movl $4, 12(%esp) > > movl $3, 8(%esp) > > movl $2, 4(%esp) > > movl $1, (%esp) > > calll _func > > addl $16, %esp > > Into this: > > pushl $4 > > pushl $3 > > pushl $2 > > pushl $1 > > calll _func > > addl $16, %esp > > The main motivation for this is code size (a “pushl $4” is 2 bytes, a > “movl $4, 12(%esp)” is 7 bytes), but there are some other advantages, > as shown below. > > The way this works in r223757 is by intercepting call frame > simplification in the Prolog/Epilog Inserter, and replacing the mov > sequence with pushes. Right now it only handles functions which do not > have a reserved call frame (a small minority of cases), and I'd like > to extend it to cover other cases where it is profitable. > > The currently implemented approach has a couple of drawbacks: > > 1) Push vs. having a reserved call frame: > This transformation is always profitable when we do not have a > reserved call frame. When a reserved frame can be used, however, there > is a trade-off. For example, in a function that contains only one call > site, and no other stack allocations, pushes are a clear win, since > having a reserved call frame wouldn't save any instructions. On the > other hand, if a function contains 10 call sites, and only one of them > can use pushes, then it is most probably a loss – not reserving a call > frame will cost us 10 add instructions, and the pushes gain very > little. I’d like to be able to make the decision on whether we want to > have a reserved frame or pushes by considering the entire function. I > don't think this can be done in the context of PEI. > > Note that in theory we could have both a reserved call frame and have > some specific call sites in the function use pushes, but this is > fairly tricky because it requires much more precise tracking of the > stack pointer state. That is something I’m not planning to implement > at this point. > > 2) Register allocation inefficiency: > Ideally, pushes can be used to make direct memory-to-memory movs, > freeing up registers, and saving quite a lot of code. > > For example, for this (this is obviously a constructed example, but > code of this kind does exist in the wild): > > void foo(int a, int b, int c, int d, int e, int f, int g, int h); > > struct st { int arr[8]; }; > > void bar(struct st* p) > > { > > foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4], > p->arr[5], p->arr[6], p->arr[7]); } > > We currently generate (with -m32 -O2) this: > > pushl %ebp > > movl %esp, %ebp > > pushl %ebx > > pushl %edi > > pushl %esi > > subl $44, %esp > > movl 8(%ebp), %eax > > movl 28(%eax), %ecx > > movl %ecx, -20(%ebp) # 4-byte Spill > > movl 24(%eax), %edx > > movl 20(%eax), %esi > > movl 16(%eax), %edi > > movl 12(%eax), %ebx > > movl 8(%eax), %ecx > > movl %ecx, -24(%ebp) # 4-byte Spill > > movl (%eax), %ecx > > movl %ecx, -16(%ebp) # 4-byte Spill > > movl 4(%eax), %eax > > movl -20(%ebp), %ecx # 4-byte Reload > > movl %ecx, 28(%esp) > > movl %edx, 24(%esp) > > movl %esi, 20(%esp) > > movl %edi, 16(%esp) > > movl %ebx, 12(%esp) > > movl -24(%ebp), %ecx # 4-byte Reload > > movl %ecx, 8(%esp) > > movl %eax, 4(%esp) > > movl -16(%ebp), %eax # 4-byte Reload > > movl %eax, (%esp) > > calll _foo > > addl $44, %esp > > popl %esi > > popl %edi > > popl %ebx > > popl %ebp > > retl > > Which is fairly horrible. > > Some parameters get mov-ed up to four times - a mov from the struct > into a register, a register spill, a reload, and finally a mov onto > the stack. > > What we’d like to generate is something like: > > pushl %ebp > > movl %esp, %ebp > > movl 8(%ebp), %eax > > pushl 28(%eax) > > pushl 24(%eax) > > pushl 20(%eax) > > pushl 16(%eax) > > pushl 12(%eax) > > pushl 8(%eax) > > pushl 4(%eax) > > pushl (%eax) > > calll _foo > > addl $32, %esp > > popl %ebp > > retl > > To produce the code above, the transformation has to run > pre-reg-alloc. Otherwise, even if we fold loads into the push, it's > too late to recover from spills. > > The direction I'd like to take with this is: > > 1) Add an X86-specific MachineFunctionPass that does the mov -> push > transformation and runs pre-reg-alloc. > > It will: > > * Make a decision on whether promoting some (or all) of the call sites > to use pushes is worth giving up on the reserved call frame. > > * If it is, perform the mov ->push transformation for the selected > call sites. > > * Fold loads into the pushes while doing the transformation. > > As an alternative, I could try to teach the peephole optimizer to do > it (right now it won't even try to do this folding because PUSHes > store to memory), but getting it right in the general case seems complex. > > I think I'd rather do folding of the simple (but common) cases on the fly. > > 2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly. > > Doing the mov->push transformation before PEI means I'd have to leave > the ADJCALLSTACKDOWN/UP pair unbalanced. > > E.g. something like: > > ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, %ESP<imp-use> > > %vreg9<def,dead> = COPY %ESP; GR32:%vreg9 > > PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 > > PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 > > PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 > > PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 > > PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 > > PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 > > PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 > > PUSH32rmm %vreg0<kill>, 1, %noreg, 0, %noreg, %ESP<imp-def>, > %ESP<imp-use>; GR32:%vreg0 > > CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>, %ESP<imp-def> > > ADJCALLSTACKUP32 32, 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, > %ESP<imp-use> > > This, rightly, gets flagged by the verifier. > > My proposal is to add an additional parameter to ADJCALLSTACKDOWN to > express the amount of adjustment the call sequence itself does. This > is somewhat similar to the second parameter of ADKCALLSTACKUP which > allows adjustment for callee stack-clean-up. > > So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32" instead of > the “ADJCALLSTACKDOWN32 0”. The verifier will be happy, and PEI will > know it doesn't need to do any stack pointer adjustment. > > Does this sound like the right approach? > > Any suggestions, as well as warnings of potential pitfalls, are > welcome. :-) > > Thanks, > > Michael > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141221/e88cc307/attachment.html>
David Chisnall
2014-Dec-21 09:05 UTC
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
Hi Michael, You're probably in a better position than anyone to check this: My recollection is that the microcode on some Intel processors used hidden registers that are integrated into the register renaming pipeline for the top few stack slots, but that all of these are pushed into the store buffer and become real writes if the stack pointer is moved or the stack-top cache line is subject to any coherency messages. This means that esp-relative movs are significantly cheaper than pushs and pops. Is this still true with any modern x86 processors? (Or was it never true and I misremember) David On 21 Dec 2014, at 09:17, Kuperstein, Michael M <michael.m.kuperstein at intel.com> wrote:> Hello all, > > In r223757 I’ve committed a patch that performs, for the 32-bit x86 calling convention, the transformation of MOV instructions that push function arguments onto the stack into actual PUSH instructions. > > For example, it will transform this: > subl $16, %esp > movl $4, 12(%esp) > movl $3, 8(%esp) > movl $2, 4(%esp) > movl $1, (%esp) > calll _func > addl $16, %esp > > Into this: > > pushl $4 > pushl $3 > pushl $2 > pushl $1 > calll _func > addl $16, %esp > > The main motivation for this is code size (a “pushl $4” is 2 bytes, a “movl $4, 12(%esp)” is 7 bytes), but there are some other advantages, as shown below. > The way this works in r223757 is by intercepting call frame simplification in the Prolog/Epilog Inserter, and replacing the mov sequence with pushes. Right now it only handles functions which do not have a reserved call frame (a small minority of cases), and I'd like to extend it to cover other cases where it is profitable. > > The currently implemented approach has a couple of drawbacks: > > 1) Push vs. having a reserved call frame: > This transformation is always profitable when we do not have a reserved call frame. When a reserved frame can be used, however, there is a trade-off. For example, in a function that contains only one call site, and no other stack allocations, pushes are a clear win, since having a reserved call frame wouldn't save any instructions. On the other hand, if a function contains 10 call sites, and only one of them can use pushes, then it is most probably a loss – not reserving a call frame will cost us 10 add instructions, and the pushes gain very little. I’d like to be able to make the decision on whether we want to have a reserved frame or pushes by considering the entire function. I don't think this can be done in the context of PEI. > Note that in theory we could have both a reserved call frame and have some specific call sites in the function use pushes, but this is fairly tricky because it requires much more precise tracking of the stack pointer state. That is something I’m not planning to implement at this point. > > 2) Register allocation inefficiency: > Ideally, pushes can be used to make direct memory-to-memory movs, freeing up registers, and saving quite a lot of code. > For example, for this (this is obviously a constructed example, but code of this kind does exist in the wild): > > void foo(int a, int b, int c, int d, int e, int f, int g, int h); > > struct st { int arr[8]; }; > > void bar(struct st* p) > { > foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4], p->arr[5], p->arr[6], p->arr[7]); } > > We currently generate (with -m32 -O2) this: > > pushl %ebp > movl %esp, %ebp > pushl %ebx > pushl %edi > pushl %esi > subl $44, %esp > movl 8(%ebp), %eax > movl 28(%eax), %ecx > movl %ecx, -20(%ebp) # 4-byte Spill > movl 24(%eax), %edx > movl 20(%eax), %esi > movl 16(%eax), %edi > movl 12(%eax), %ebx > movl 8(%eax), %ecx > movl %ecx, -24(%ebp) # 4-byte Spill > movl (%eax), %ecx > movl %ecx, -16(%ebp) # 4-byte Spill > movl 4(%eax), %eax > movl -20(%ebp), %ecx # 4-byte Reload > movl %ecx, 28(%esp) > movl %edx, 24(%esp) > movl %esi, 20(%esp) > movl %edi, 16(%esp) > movl %ebx, 12(%esp) > movl -24(%ebp), %ecx # 4-byte Reload > movl %ecx, 8(%esp) > movl %eax, 4(%esp) > movl -16(%ebp), %eax # 4-byte Reload > movl %eax, (%esp) > calll _foo > addl $44, %esp > popl %esi > popl %edi > popl %ebx > popl %ebp > retl > > Which is fairly horrible. > Some parameters get mov-ed up to four times - a mov from the struct into a register, a register spill, a reload, and finally a mov onto the stack. > > What we’d like to generate is something like: > pushl %ebp > movl %esp, %ebp > movl 8(%ebp), %eax > pushl 28(%eax) > pushl 24(%eax) > pushl 20(%eax) > pushl 16(%eax) > pushl 12(%eax) > pushl 8(%eax) > pushl 4(%eax) > pushl (%eax) > calll _foo > addl $32, %esp > popl %ebp > retl > > To produce the code above, the transformation has to run pre-reg-alloc. Otherwise, even if we fold loads into the push, it's too late to recover from spills. > > The direction I'd like to take with this is: > > 1) Add an X86-specific MachineFunctionPass that does the mov -> push transformation and runs pre-reg-alloc. > It will: > * Make a decision on whether promoting some (or all) of the call sites to use pushes is worth giving up on the reserved call frame. > * If it is, perform the mov ->push transformation for the selected call sites. > * Fold loads into the pushes while doing the transformation. > As an alternative, I could try to teach the peephole optimizer to do it (right now it won't even try to do this folding because PUSHes store to memory), but getting it right in the general case seems complex. > I think I'd rather do folding of the simple (but common) cases on the fly. > > 2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly. > Doing the mov->push transformation before PEI means I'd have to leave the ADJCALLSTACKDOWN/UP pair unbalanced. > > E.g. something like: > ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, %ESP<imp-use> > %vreg9<def,dead> = COPY %ESP; GR32:%vreg9 > PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > PUSH32rmm %vreg0<kill>, 1, %noreg, 0, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>, %ESP<imp-def> > ADJCALLSTACKUP32 32, 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, %ESP<imp-use> > > This, rightly, gets flagged by the verifier. > My proposal is to add an additional parameter to ADJCALLSTACKDOWN to express the amount of adjustment the call sequence itself does. This is somewhat similar to the second parameter of ADKCALLSTACKUP which allows adjustment for callee stack-clean-up. > So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32" instead of the “ADJCALLSTACKDOWN32 0”. The verifier will be happy, and PEI will know it doesn't need to do any stack pointer adjustment. > > Does this sound like the right approach? > > Any suggestions, as well as warnings of potential pitfalls, are welcome. :-) > > Thanks, > Michael > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Kuperstein, Michael M
2014-Dec-21 09:27 UTC
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
Which performance guidelines are you referring to? I'm not that familiar with decade-old CPUs, but to the best of my knowledge, this is not true on current hardware. There is one specific circumstance where PUSHes should be avoided - for Atom/Silvermont processors, the memory form of PUSH is inefficient, so the register-freeing optimization below may not be profitable (see 14.3.3.6 and 15.3.1.2 in the Intel optimization reference manual). Having said that, one distinct possibility is to have the heuristic make different decisions depending on the optimization flags, that is be much more aggressive for optsize functions. From: Herbie Robinson [mailto:HerbieRobinson at verizon.net] Sent: Sunday, December 21, 2014 10:58 To: Kuperstein, Michael M; LLVMdev at cs.uiuc.edu Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences According to the Intel performance guidelines, pushes are significantly slower than moves to the extent they should be avoided as much as possible. It's been a decade since I was dealing with this; so, I don't remember the numbers, but I'm pretty sure the changes you are proposing will slow the code down. People who care about speed more than code size are probably not going to like this very much... On 12/21/14 3:17 AM, Kuperstein, Michael M wrote: Hello all, In r223757 I've committed a patch that performs, for the 32-bit x86 calling convention, the transformation of MOV instructions that push function arguments onto the stack into actual PUSH instructions. For example, it will transform this: subl $16, %esp movl $4, 12(%esp) movl $3, 8(%esp) movl $2, 4(%esp) movl $1, (%esp) calll _func addl $16, %esp Into this: pushl $4 pushl $3 pushl $2 pushl $1 calll _func addl $16, %esp The main motivation for this is code size (a "pushl $4" is 2 bytes, a "movl $4, 12(%esp)" is 7 bytes), but there are some other advantages, as shown below. The way this works in r223757 is by intercepting call frame simplification in the Prolog/Epilog Inserter, and replacing the mov sequence with pushes. Right now it only handles functions which do not have a reserved call frame (a small minority of cases), and I'd like to extend it to cover other cases where it is profitable. The currently implemented approach has a couple of drawbacks: 1) Push vs. having a reserved call frame: This transformation is always profitable when we do not have a reserved call frame. When a reserved frame can be used, however, there is a trade-off. For example, in a function that contains only one call site, and no other stack allocations, pushes are a clear win, since having a reserved call frame wouldn't save any instructions. On the other hand, if a function contains 10 call sites, and only one of them can use pushes, then it is most probably a loss - not reserving a call frame will cost us 10 add instructions, and the pushes gain very little. I'd like to be able to make the decision on whether we want to have a reserved frame or pushes by considering the entire function. I don't think this can be done in the context of PEI. Note that in theory we could have both a reserved call frame and have some specific call sites in the function use pushes, but this is fairly tricky because it requires much more precise tracking of the stack pointer state. That is something I'm not planning to implement at this point. 2) Register allocation inefficiency: Ideally, pushes can be used to make direct memory-to-memory movs, freeing up registers, and saving quite a lot of code. For example, for this (this is obviously a constructed example, but code of this kind does exist in the wild): void foo(int a, int b, int c, int d, int e, int f, int g, int h); struct st { int arr[8]; }; void bar(struct st* p) { foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4], p->arr[5], p->arr[6], p->arr[7]); } We currently generate (with -m32 -O2) this: pushl %ebp movl %esp, %ebp pushl %ebx pushl %edi pushl %esi subl $44, %esp movl 8(%ebp), %eax movl 28(%eax), %ecx movl %ecx, -20(%ebp) # 4-byte Spill movl 24(%eax), %edx movl 20(%eax), %esi movl 16(%eax), %edi movl 12(%eax), %ebx movl 8(%eax), %ecx movl %ecx, -24(%ebp) # 4-byte Spill movl (%eax), %ecx movl %ecx, -16(%ebp) # 4-byte Spill movl 4(%eax), %eax movl -20(%ebp), %ecx # 4-byte Reload movl %ecx, 28(%esp) movl %edx, 24(%esp) movl %esi, 20(%esp) movl %edi, 16(%esp) movl %ebx, 12(%esp) movl -24(%ebp), %ecx # 4-byte Reload movl %ecx, 8(%esp) movl %eax, 4(%esp) movl -16(%ebp), %eax # 4-byte Reload movl %eax, (%esp) calll _foo addl $44, %esp popl %esi popl %edi popl %ebx popl %ebp retl Which is fairly horrible. Some parameters get mov-ed up to four times - a mov from the struct into a register, a register spill, a reload, and finally a mov onto the stack. What we'd like to generate is something like: pushl %ebp movl %esp, %ebp movl 8(%ebp), %eax pushl 28(%eax) pushl 24(%eax) pushl 20(%eax) pushl 16(%eax) pushl 12(%eax) pushl 8(%eax) pushl 4(%eax) pushl (%eax) calll _foo addl $32, %esp popl %ebp retl To produce the code above, the transformation has to run pre-reg-alloc. Otherwise, even if we fold loads into the push, it's too late to recover from spills. The direction I'd like to take with this is: 1) Add an X86-specific MachineFunctionPass that does the mov -> push transformation and runs pre-reg-alloc. It will: * Make a decision on whether promoting some (or all) of the call sites to use pushes is worth giving up on the reserved call frame. * If it is, perform the mov ->push transformation for the selected call sites. * Fold loads into the pushes while doing the transformation. As an alternative, I could try to teach the peephole optimizer to do it (right now it won't even try to do this folding because PUSHes store to memory), but getting it right in the general case seems complex. I think I'd rather do folding of the simple (but common) cases on the fly. 2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly. Doing the mov->push transformation before PEI means I'd have to leave the ADJCALLSTACKDOWN/UP pair unbalanced. E.g. something like: ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, %ESP<imp-use> %vreg9<def,dead> = COPY %ESP; GR32:%vreg9 PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0<kill>, 1, %noreg, 0, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>, %ESP<imp-def> ADJCALLSTACKUP32 32, 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, %ESP<imp-use> This, rightly, gets flagged by the verifier. My proposal is to add an additional parameter to ADJCALLSTACKDOWN to express the amount of adjustment the call sequence itself does. This is somewhat similar to the second parameter of ADKCALLSTACKUP which allows adjustment for callee stack-clean-up. So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32" instead of the "ADJCALLSTACKDOWN32 0". The verifier will be happy, and PEI will know it doesn't need to do any stack pointer adjustment. Does this sound like the right approach? Any suggestions, as well as warnings of potential pitfalls, are welcome. :-) Thanks, Michael --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141221/5920f6f8/attachment.html>
Kuperstein, Michael M
2014-Dec-21 09:53 UTC
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
Hi David, I'm not that familiar with the micro-architecture (and even if I were, I doubt I could comment on this sort of detail :-) ), but I don't believe there is a penalty on modern x86. In any case, if it turns out this will introduce performance regressions, either for this or for another reason, then we can enable it for optsize only. I don't expect that to happen, but it's a possible fallback. I'll do the benchmarking once I have a sane prototype. Michael -----Original Message----- From: Dr D. Chisnall [mailto:dc552 at hermes.cam.ac.uk] On Behalf Of David Chisnall Sent: Sunday, December 21, 2014 11:05 To: Kuperstein, Michael M Cc: LLVMdev at cs.uiuc.edu Subject: Re: [LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences Hi Michael, You're probably in a better position than anyone to check this: My recollection is that the microcode on some Intel processors used hidden registers that are integrated into the register renaming pipeline for the top few stack slots, but that all of these are pushed into the store buffer and become real writes if the stack pointer is moved or the stack-top cache line is subject to any coherency messages. This means that esp-relative movs are significantly cheaper than pushs and pops. Is this still true with any modern x86 processors? (Or was it never true and I misremember) David On 21 Dec 2014, at 09:17, Kuperstein, Michael M <michael.m.kuperstein at intel.com> wrote:> Hello all, > > In r223757 I've committed a patch that performs, for the 32-bit x86 calling convention, the transformation of MOV instructions that push function arguments onto the stack into actual PUSH instructions. > > For example, it will transform this: > subl $16, %esp > movl $4, 12(%esp) > movl $3, 8(%esp) > movl $2, 4(%esp) > movl $1, (%esp) > calll _func > addl $16, %esp > > Into this: > > pushl $4 > pushl $3 > pushl $2 > pushl $1 > calll _func > addl $16, %esp > > The main motivation for this is code size (a "pushl $4" is 2 bytes, a "movl $4, 12(%esp)" is 7 bytes), but there are some other advantages, as shown below. > The way this works in r223757 is by intercepting call frame simplification in the Prolog/Epilog Inserter, and replacing the mov sequence with pushes. Right now it only handles functions which do not have a reserved call frame (a small minority of cases), and I'd like to extend it to cover other cases where it is profitable. > > The currently implemented approach has a couple of drawbacks: > > 1) Push vs. having a reserved call frame: > This transformation is always profitable when we do not have a reserved call frame. When a reserved frame can be used, however, there is a trade-off. For example, in a function that contains only one call site, and no other stack allocations, pushes are a clear win, since having a reserved call frame wouldn't save any instructions. On the other hand, if a function contains 10 call sites, and only one of them can use pushes, then it is most probably a loss - not reserving a call frame will cost us 10 add instructions, and the pushes gain very little. I'd like to be able to make the decision on whether we want to have a reserved frame or pushes by considering the entire function. I don't think this can be done in the context of PEI. > Note that in theory we could have both a reserved call frame and have some specific call sites in the function use pushes, but this is fairly tricky because it requires much more precise tracking of the stack pointer state. That is something I'm not planning to implement at this point. > > 2) Register allocation inefficiency: > Ideally, pushes can be used to make direct memory-to-memory movs, freeing up registers, and saving quite a lot of code. > For example, for this (this is obviously a constructed example, but code of this kind does exist in the wild): > > void foo(int a, int b, int c, int d, int e, int f, int g, int h); > > struct st { int arr[8]; }; > > void bar(struct st* p) > { > foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4], > p->arr[5], p->arr[6], p->arr[7]); } > > We currently generate (with -m32 -O2) this: > > pushl %ebp > movl %esp, %ebp > pushl %ebx > pushl %edi > pushl %esi > subl $44, %esp > movl 8(%ebp), %eax > movl 28(%eax), %ecx > movl %ecx, -20(%ebp) # 4-byte Spill > movl 24(%eax), %edx > movl 20(%eax), %esi > movl 16(%eax), %edi > movl 12(%eax), %ebx > movl 8(%eax), %ecx > movl %ecx, -24(%ebp) # 4-byte Spill > movl (%eax), %ecx > movl %ecx, -16(%ebp) # 4-byte Spill > movl 4(%eax), %eax > movl -20(%ebp), %ecx # 4-byte Reload > movl %ecx, 28(%esp) > movl %edx, 24(%esp) > movl %esi, 20(%esp) > movl %edi, 16(%esp) > movl %ebx, 12(%esp) > movl -24(%ebp), %ecx # 4-byte Reload > movl %ecx, 8(%esp) > movl %eax, 4(%esp) > movl -16(%ebp), %eax # 4-byte Reload > movl %eax, (%esp) > calll _foo > addl $44, %esp > popl %esi > popl %edi > popl %ebx > popl %ebp > retl > > Which is fairly horrible. > Some parameters get mov-ed up to four times - a mov from the struct into a register, a register spill, a reload, and finally a mov onto the stack. > > What we'd like to generate is something like: > pushl %ebp > movl %esp, %ebp > movl 8(%ebp), %eax > pushl 28(%eax) > pushl 24(%eax) > pushl 20(%eax) > pushl 16(%eax) > pushl 12(%eax) > pushl 8(%eax) > pushl 4(%eax) > pushl (%eax) > calll _foo > addl $32, %esp > popl %ebp > retl > > To produce the code above, the transformation has to run pre-reg-alloc. Otherwise, even if we fold loads into the push, it's too late to recover from spills. > > The direction I'd like to take with this is: > > 1) Add an X86-specific MachineFunctionPass that does the mov -> push transformation and runs pre-reg-alloc. > It will: > * Make a decision on whether promoting some (or all) of the call sites to use pushes is worth giving up on the reserved call frame. > * If it is, perform the mov ->push transformation for the selected call sites. > * Fold loads into the pushes while doing the transformation. > As an alternative, I could try to teach the peephole optimizer to do it (right now it won't even try to do this folding because PUSHes store to memory), but getting it right in the general case seems complex. > I think I'd rather do folding of the simple (but common) cases on the fly. > > 2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly. > Doing the mov->push transformation before PEI means I'd have to leave the ADJCALLSTACKDOWN/UP pair unbalanced. > > E.g. something like: > ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, > %ESP<imp-use> %vreg9<def,dead> = COPY %ESP; GR32:%vreg9 PUSH32rmm > %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>, > %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, > %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, > 16, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm > %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>, %ESP<imp-use>; > GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>, > %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, > %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 PUSH32rmm %vreg0<kill>, 1, > %noreg, 0, %noreg, %ESP<imp-def>, %ESP<imp-use>; GR32:%vreg0 > CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>, %ESP<imp-def> > ADJCALLSTACKUP32 32, 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>, > %ESP<imp-use> > > This, rightly, gets flagged by the verifier. > My proposal is to add an additional parameter to ADJCALLSTACKDOWN to express the amount of adjustment the call sequence itself does. This is somewhat similar to the second parameter of ADKCALLSTACKUP which allows adjustment for callee stack-clean-up. > So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32" instead of the "ADJCALLSTACKDOWN32 0". The verifier will be happy, and PEI will know it doesn't need to do any stack pointer adjustment. > > Does this sound like the right approach? > > Any suggestions, as well as warnings of potential pitfalls, are > welcome. :-) > > Thanks, > Michael > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev--------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
Chris Lattner
2014-Dec-22 07:03 UTC
[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
> On Dec 21, 2014, at 12:58 AM, Herbie Robinson <HerbieRobinson at verizon.net> wrote: > > According to the Intel performance guidelines, pushes are significantly slower than moves to the extent they should be avoided as much as possible. It's been a decade since I was dealing with this; so, I don't remember the numbers, but I'm pretty sure the changes you are proposing will slow the code down. > > People who care about speed more than code size are probably not going to like this very much…Very true (and confirmed downthread), but this code size optimization is still very interesting for functions using the minsize function attribute, which is specifically designed to favor code size over performance: http://llvm.org/docs/LangRef.html#function-attributes <http://llvm.org/docs/LangRef.html#function-attributes> The linux kernel folks really care about the size of the bootloader, for example. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141221/18936769/attachment.html>