Michael Stellmann via llvm-dev
2018-Jul-21 19:14 UTC
[llvm-dev] Finding scratch register after function call
For a Z80 backend, "eliminateCallFramePseudoInstr()" shall adjust the
stack pointer in three possible ways, e.g. after a function call,
depending on the amount (= adjustment size) *and some other rules*:
1. via one or more target "pop <reg>" instructions (SP
increments +2 per
instruction), using an unused reg (disregarding the contents after the
operation), followed by an optional +1 increment on the SP for odd
amounds (SP can be inc/dec'd by 1 directly).
2. incrementing the SP register directly with a special target operation
(increments +1 per operation), without using any other register.
Requires twice as much instructions as "pop" in sequence, though.
3. via a long sequence of target-specific arithmetic instructions that
involves a scratch reg (which would have to be saved before and restored
after the call). This should only be used for larger sizes of call frame
index.
Option 1 ("pop"s) is by far preferred for small call frame sizes.
However, this requires finding a suitable register. And this is where it
gets complicated:
"Suitable" for option 1 means that it shall be any of the 4 physical
registers AF, HL, DE, BC. When invoked after a function call to do
caller-cleans-stack, none of the register(s) used for return value of
the function call shall be used. The calling convention uses
- one specific pysical 8 bit register (lower 8 bits of AF) for 8 bit
return values
- one specific physical 16 bit register (reg HL) for 16 bit
- two 16 bit regs (regs HL + DE) for 32 bits return value
When determining if one of those registers is available for the option 1
way of cleaning-up the stack after a function call, reverse-order
priority is preferred: BC, DE, HL, AF - due to the highly asymmetric
command set of the Z80, the "least otherwise usable" register should
be
used for that operation, starting with "BC".
Now my questions are:
A) Is there a way to check if any of those registers are free at that
point (in eliminateCallFramePseudoInstr()) - i.e. not used as return or
to hold other values?
B) If it could be determined that none of the registers are free, option
2 (adjusting the SP by a series of +1) should be used for small amounts
of call frame size, option 1 with the forced register "BC" for
"pop" for
mid amounts, and option 3 for larger amounts.
Now if register "BC" is forced to be used to clean the stack up after
a
function call, it should be saved (via "push") on the stack before the
function is called, or to be specific, even *before* the first function
parameter for the upcoming function call is pushed to the stack. And
restored after call frame cleanup (after tha last
call-frame-elimination-"pop") - by another "pop", restoring
the original
value.
Would "createVirtualRegister" with a register class containing only
that
register do exactly that?
Or is there a better way to do this?
Thanks,
Michael
Bruce Hoult via llvm-dev
2018-Jul-22 02:46 UTC
[llvm-dev] Finding scratch register after function call
Seems like an idea for bigger stack frames would be: ld ix, 0xNNNN # 4 bytes, 14 cycles (or iy) add ix,sp # 2 bytes, 15 cycles ld sp,ix #2 bytes, 10 cycles You could then pop whatever registers you actually saved at the start of the function (maybe including IX) at one byte and 10 cycles for each 16 bit register. Using hl instead of ix/iy would be 3 bytes smaller and 12 cycles faster but then you'd need to keep any 16 bit result somewhere else first then then move it ld hl, 0xNNNN # 3 bytes, 10 cycles add hl,sp # 1 byte, 11 cycles ld sp,ix #1 byte, 6 cycles ld h,b #1 bytes, 4 cycles ld l,c #1 byte, 4 cycles So in the end you only save 1 byte and 4 cycles. Annoying that different instructions have different register restrictions: load 16 bit constant: BC, DE, HL, SO, IX, IY destination of 16 bit add: HL, IX, IY source of 16 bit add: BC, DE, SP, same as dest destination of 16 bit move: only SP! source of 16 bit move to SP: HL, IX, IY 16 bit push/pop: AF, BC, DE, HL, IX, IY So both the add and the move to SP restrict you to HL, IX, IY as the possibilities. BC, DL, AF aren't even options. Again, you don't "check if a register is free at that point". You *tell* llvm that the function return needs IX (or whatever) free, and it makes sure that happens. On Sat, Jul 21, 2018 at 12:14 PM, Michael Stellmann via llvm-dev < llvm-dev at lists.llvm.org> wrote:> For a Z80 backend, "eliminateCallFramePseudoInstr()" shall adjust the > stack pointer in three possible ways, e.g. after a function call, depending > on the amount (= adjustment size) *and some other rules*: > > 1. via one or more target "pop <reg>" instructions (SP increments +2 per > instruction), using an unused reg (disregarding the contents after the > operation), followed by an optional +1 increment on the SP for odd amounds > (SP can be inc/dec'd by 1 directly). > > 2. incrementing the SP register directly with a special target operation > (increments +1 per operation), without using any other register. Requires > twice as much instructions as "pop" in sequence, though. > > 3. via a long sequence of target-specific arithmetic instructions that > involves a scratch reg (which would have to be saved before and restored > after the call). This should only be used for larger sizes of call frame > index. > > > Option 1 ("pop"s) is by far preferred for small call frame sizes. However, > this requires finding a suitable register. And this is where it gets > complicated: > "Suitable" for option 1 means that it shall be any of the 4 physical > registers AF, HL, DE, BC. When invoked after a function call to do > caller-cleans-stack, none of the register(s) used for return value of the > function call shall be used. The calling convention uses > - one specific pysical 8 bit register (lower 8 bits of AF) for 8 bit > return values > - one specific physical 16 bit register (reg HL) for 16 bit > - two 16 bit regs (regs HL + DE) for 32 bits return value > > When determining if one of those registers is available for the option 1 > way of cleaning-up the stack after a function call, reverse-order priority > is preferred: BC, DE, HL, AF - due to the highly asymmetric command set of > the Z80, the "least otherwise usable" register should be used for that > operation, starting with "BC". > > Now my questions are: > A) Is there a way to check if any of those registers are free at that > point (in eliminateCallFramePseudoInstr()) - i.e. not used as return or > to hold other values? > B) If it could be determined that none of the registers are free, option 2 > (adjusting the SP by a series of +1) should be used for small amounts of > call frame size, option 1 with the forced register "BC" for "pop" for mid > amounts, and option 3 for larger amounts. > > Now if register "BC" is forced to be used to clean the stack up after a > function call, it should be saved (via "push") on the stack before the > function is called, or to be specific, even *before* the first function > parameter for the upcoming function call is pushed to the stack. And > restored after call frame cleanup (after tha last > call-frame-elimination-"pop") - by another "pop", restoring the original > value. > > Would "createVirtualRegister" with a register class containing only that > register do exactly that? > > Or is there a better way to do this? > > Thanks, > Michael > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180721/9d9028ae/attachment.html>
Michael Stellmann via llvm-dev
2018-Jul-22 07:26 UTC
[llvm-dev] Finding scratch register after function call
Thanks Bruce,
and elaborately as ever. Again, I'm surprised about your very thorough
Z80 knowledge when you said you only did little on the ZX81 in the
eighties :D
OK, understood. I was first thinking about doing something like this for
small frames:
1. push bc # 1 byte; 11 cycles - part of call frame-cleanup: save
scratch register
+-----begin call-related
2. ld <rr>,stack-param
3. push <rr>
... more code to load non-stack params into registers
4. call ...
5. pop bc # 1 byte ; 10 cycles - call frame cleanup: restore
stack-pointer (value in BC is not used)
+-----end call-related
6. pop bc # part of call frame-cleanup: restore scratch reg's value
The stack cleanup would insert line 5, and have to insert lines 1 and 6
- summing up to 3 bytes of instructions - and maybe the outer two could
be eliminated in a late optimization pass, when register usage is known.
But then again - looking at your math and the *total* mem and cycles,
incl. setup and tear-down - convinced me of dropping my complex idea
with saving "BC" and use it for cleanup or sacrificing the calling
convention. The complexity just doesn't justify the gains. Instead,
going for easy-to-implement solutions:
For small call frames with only 1 or 2 params on the stack, two "inc
sp"
(1 byte, 6 cycles per inst) per parameter can be used, and your "big
stack frame" suggestion for larger ones.
This also allows keeping a "beneficial" param and return value calling
convention:
I want to assign the first 3 (HL + DE + BC - or at least 2) function
params to registers, so the stack cleanup is only required for functions
with more than 3 parameters at all - or vararg funcs.
And only functions with more than 5 params will need the "big stack
frame" cleanup. Those cases are rare (or at least can be avoided easily
by a developer), or, knowing the mechanics, shouldn't be used for time
critical inner loops anyway.
Being able to keep HL for the return value allows very efficient nested
function calls in the form "Func1(Func2(nnn));", as register shuffling
can be avoided - the result of Func2 can be passed directly Func1.
Thanks for pointing me again to the right direction!
Michael
Oh, and BTW, I'm planning to do the backend primarily for the MSX - my
first computer in 1984. Just for the fun of it, I started now writing a
small game for it after 25+ years of absence, and was wondering what 30+
years compiler technology would be able to achieve on such a simple (but
challenging, as in "not-alway-straightforward") CPU ;-)