Hello, Arnold. Only quick comments, I'll try to make a full review a little bit later.> 0.)a fast calling convention (maybe use the current > CallingConv::Fast, or create a CallingConv::TailCall) > 1.) lowering of formal arguments > like for example x86_LowerCCCArguments in stdcall mode > we need to make sure that later mentioned CALL_CLOBBERED_REG is > not used (remove it from availableregisters in callingconv > for argument passing?)This can be acceptable, only if function has internal linkage. Otherwise, we cannot change calling convention of the function. So extension if this problem: Let us have the following sequence of calls: a->b->c Call b->c is tail call. We need to formulate conditions of the call to be "suitable for tail call". On x86 it's the following: 1. The calling conventions of b and c are "compatible". There are two scenarios: caller cleans the stack (C calling convention), callee cleans the stack (stdcall calling convention). The call is not suitable for tail call, if b and c cleans the stack differently. For example, let b has C CC and c - stdcall CC. In this case making tail call results to double cleaning (on in a and one in c), which is incorrect. 2. The same situtation is for struct return functions: the caller pushes extra pointer to return value, but callee is required to clean this pointer out of the stack. So, call b->c is "suitable for tail call" iff both b and c are struct return. 3. Various tricky cases with FP. I don't remeber correctly, but there can be some problems with functions returning FP values. 4. PIC. %ebx should be live in order to make a call via PLT. Unfortunately, %ebx is callee-saved, that's why we can assemble tail calls only to functions within the same module (PLT is not required). 5. %ecx is livein for regparm(3) functions. ... (maybe something I forget) -- WBR, Anton Korobeynikov
On 8/9/07, Anton Korobeynikov <asl at math.spbu.ru> wrote:> Hello, Arnold. > > Only quick comments, I'll try to make a full review a little bit later. > > > 0.)a fast calling convention (maybe use the current > > CallingConv::Fast, or create a CallingConv::TailCall) > > 1.) lowering of formal arguments > > like for example x86_LowerCCCArguments in stdcall mode > > we need to make sure that later mentioned CALL_CLOBBERED_REG is > > not used (remove it from availableregisters in callingconv > > for argument passing?) > This can be acceptable, only if function has internal linkage. > Otherwise, we cannot change calling convention of the function. So > extension if this problem:again i did not express myself clearly -sorry - what i meant is not to change the calling convention of the function but only do tail call optimization for functions (caller-callee) that obey a special calling convention (like for example x86_stdcall convention - with the exception that x86_stdcall as implemented in the backend currently allows for params passed in reg eax,ecx and edx, but when used in future with tailcallopt one of those registers must be reserved for the tailcallee address resulting in only 2 free registers for param passing). if both caller and callee obey that "callingconv::fast" - maximum 2args in register and keep one call clobbered free than do tailcallopt. what i also notice (in my first attempt) is that one has do keep the stack aligned (on mac os x, calls to dynamically loaded functions want the stack to be 16 byte aligned). when doing tailcalloptimization the naive way (my 1. try) was to just adjust the stack by the argument difference. but that could result in messing with the stack alignment. so one has to round the argument on stack size to 16n+12 such that the frame pointer difference of two functions is always a multiple of 16 (that is if the stackalignment has to be 16 byte).> Let us have the following sequence of calls: a->b->c > Call b->c is tail call. > > We need to formulate conditions of the call to be "suitable for tail > call". On x86 it's the following: > 1. The calling conventions of b and c are "compatible". There are two > scenarios: caller cleans the stack (C calling convention), callee cleans > the stack (stdcall calling convention). The call is not suitable for > tail call, if b and c cleans the stack differently. For example, let b > has C CC and c - stdcall CC. In this case making tail call results to > double cleaning (on in a and one in c), which is incorrect.i was thinking of just doing tailcallopt with one CC in the beginning (~stdcall) but yes that is true> 2. The same situtation is for struct return functions: the caller pushes > extra pointer to return value, but callee is required to clean this > pointer out of the stack. So, call b->c is "suitable for tail call" iff > both b and c are struct return. > 3. Various tricky cases with FP. I don't remeber correctly, but there > can be some problems with functions returning FP values.yes i will have to look a those> 4. PIC. %ebx should be live in order to make a call via PLT. > Unfortunately, %ebx is callee-saved, that's why we can assemble tail > calls only to functions within the same module (PLT is not required). > 5. %ecx is livein for regparm(3) functions.i guess you mean functions which have their address loaded to a register (call via a function pointer) by regparam functions> ... (maybe something I forget)... even more that i don't know :)> > -- > WBR, Anton Korobeynikov > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Okay so i implemented an(other :) initial version for X86-32 backend, this time based on TOT: It is not very generic at the moment. Only functions with callingconv::fastcc and the tail call attribute will be optimized. Maybe the next step should be to integrate the code into the other calling convention lowering. Here is what i have at the moment: If callingconv::fastcc is used the function will be custom lowered. (LowerX86_32FastCCCallTo and LowerX86_32FastCCArguments, the code is based on the std calling convention minus ecx as inreg) The lowering code decides whether the function call really is eligible for tail call optimization tco (caller-callee agree, tail call opt is enabled, tail position, TODO: check for PIC code). If it is okay to do tco it calculates the stack adjustment. Repositions the return address, Puts the arguments on the correct position relative to the (virtual) frame pointer. Moves the function address to register ECX if the tailcallee is in a register. (TODO: set ECX as live out?) TODO: check what to do with pic code Adds register arguments (EAX, EDX) as live in if they are used for arguments. Create a TAILCALL node (a pseudo instruction) holding the Callee (Register or TargetGlobalAddress/ExternalSymbol), the stack adjustment as a constant, and the live-in registers. The TargetLowering class has a IsLastCallTailCall state field which is set to true such that the LowerRET function nows it has to produce different code. The TAILCALL node is then returned. If it is not okay the code produces a function call like the std calling convention (callee pops arguments on return). There is one difference though: since tail call functions might arbitrarily change stack alignment (since they adjust the stack by the difference of the argument stack size) the stack adjustment (by that i mean the number of bytes a function requires as its stack size) i when i calculate that stack size/ stack adjustment i take care that the stack sizes are always multiples of the targets stack alignment. (this code at the moment can be turned off by an option eg.:if it's known that no dynamically linked functions are called). I observed the behavior of dynamically linked functions requiring 16 byte stack alignment on mac os x. This might also render it difficult to do general tail call optimization (for normal std call, c call functions it would only bem possible if the stack size difference is a multiple of the stack alignment). But maybe i am misinterpreting the situation. (TODO: look into that:) The lowering of arguments is pretty similar to std call. The LowerRET function checks for the IsLastCallTailCall field and if its true does the following otherwise it does its usual duty. It looks for the TAILCALL node and uses the operand holding the callee and the stack adjustment to build and return a TC_RETURN node with preceeding operands as its own operands. The TAILCALL node is converted to a noop machine instruction. Ditto the TC_RETURN node. (it declares the iscall=1, isbarrier=1, isret=1) The machine instructions needed to adjust the stack and jump to the callee is added in the epilogue generating function in X86RegisterInfo. It uses the two operands of the TC_RETURN node for the required information. So modifications are: a LowerX86_32FastCCCallTo function a LowerX86_32FastCCArguments function change to LowerRET to generate the TC_RETURN node some code at the end of X86RegisterInfo to do stack adjustment/ jumping to tailcallee (like the EH_RETURN code, thanks Anton for the pointers!) small modifications to X86InstrInfo.td (a yes and at the moment i remove generating TAILCALL nodes in the other functions - LowerCCCall and the like) Another TODO: look a floating point code to see whether there are any kinks. Look at other calling conventions to see what is possible there. (see Anton's remarks) I am away now for 2 weeks (isle of cres - camping). but after that i will be looking into further improving the code (and also look at the other architectures). Or maybe i have answers telling me that i am going into a totally wrong direction , and i should let the experts do this stuff and stop nagging people :). regards arnold> need to formulate conditions of the call to be "suitable for tail > call". On x86 it's the following: > 1. The calling conventions of b and c are "compatible". There are two > scenarios: caller cleans the stack (C calling convention), callee > cleans > the stack (stdcall calling convention). The call is not suitable for > tail call, if b and c cleans the stack differently. For example, let b > has C CC and c - stdcall CC. In this case making tail call results to > double cleaning (on in a and one in c), which is incorrect. > 2. The same situtation is for struct return functions: the caller > pushes > extra pointer to return value, but callee is required to clean this > pointer out of the stack. So, call b->c is "suitable for tail call" > iff > both b and c are struct return. > 3. Various tricky cases with FP. I don't remeber correctly, but there > can be some problems with functions returning FP values. > 4. PIC. %ebx should be live in order to make a call via PLT. > Unfortunately, %ebx is callee-saved, that's why we can assemble tail > calls only to functions within the same module (PLT is not required). > 5. %ecx is livein for regparm(3) functions. > ... (maybe something I forget) > > -- > WBR, Anton Korobeynikov > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hi Arnold and Anton, Sorry I have been ignoring your emails on this topic. It's an important task and I really need sometime to think about it (and talk to Chris about it!) But this has been an especially hectic week. I am also going to vacation soon so I am not sure when I would get around to it. If Chris has time, I am sure he has lots to say on this topic. :-) Otherwise, please be patient. :-) Evan On Aug 11, 2007, at 1:25 PM, Arnold Schwaighofer wrote:> Okay so i implemented an(other :) initial version for X86-32 backend, > this time based on TOT: > > It is not very generic at the moment. Only functions with > callingconv::fastcc and the tail call attribute will be optimized. > Maybe the next step should be to integrate the code into the other > calling convention lowering. Here is what i have at the moment: > > If callingconv::fastcc is used the function will be custom lowered. > (LowerX86_32FastCCCallTo and LowerX86_32FastCCArguments, the code is > based on the std calling convention minus ecx as inreg) > The lowering code decides whether the function call really is > eligible for tail call optimization tco (caller-callee agree, tail > call opt is enabled, tail position, TODO: check for PIC code). > If it is okay to do tco it calculates the stack adjustment. > Repositions the return address, > Puts the arguments on the correct position relative to the (virtual) > frame pointer. > Moves the function address to register ECX if the tailcallee is in a > register. (TODO: set ECX as live out?) > TODO: check what to do with pic code > Adds register arguments (EAX, EDX) as live in if they are used for > arguments. > Create a TAILCALL node (a pseudo instruction) holding the Callee > (Register or TargetGlobalAddress/ExternalSymbol), the stack > adjustment as a constant, and the live-in registers. The > TargetLowering class has a IsLastCallTailCall state field which is > set to true such that the LowerRET function nows it has to produce > different code. The TAILCALL node is then returned. > > If it is not okay the code produces a function call like the std > calling convention (callee pops arguments on return). > There is one difference though: since tail call functions might > arbitrarily change stack alignment (since they adjust the stack by > the difference of the argument stack size) the stack adjustment (by > that i mean the number of bytes a function requires as its stack > size) i when i calculate that stack size/ stack adjustment i take > care that the stack sizes are always multiples of the targets stack > alignment. (this code at the moment can be turned off by an option > eg.:if it's known that no dynamically linked functions are called). > I observed the behavior of dynamically linked functions requiring 16 > byte stack alignment on mac os x. This might also render it difficult > to do general tail call optimization (for normal std call, c call > functions it would only bem possible if the stack size difference is > a multiple of the stack alignment). But maybe i am misinterpreting > the situation. (TODO: look into that:) > > The lowering of arguments is pretty similar to std call. > > The LowerRET function checks for the IsLastCallTailCall field and if > its true does the following otherwise it does its usual duty. > It looks for the TAILCALL node and uses the operand holding the > callee and the stack adjustment to build and return a TC_RETURN node > with preceeding operands as its own operands. > > The TAILCALL node is converted to a noop machine instruction. > Ditto the TC_RETURN node. (it declares the iscall=1, isbarrier=1, > isret=1) > > The machine instructions needed to adjust the stack and jump to the > callee is added in the epilogue generating function in > X86RegisterInfo. It uses the two operands of the TC_RETURN node for > the required information. > > > So modifications are: > a LowerX86_32FastCCCallTo function > a LowerX86_32FastCCArguments function > change to LowerRET to generate the TC_RETURN node > some code at the end of X86RegisterInfo to do stack adjustment/ > jumping to tailcallee (like the EH_RETURN code, thanks Anton for the > pointers!) > small modifications to X86InstrInfo.td > (a yes and at the moment i remove generating TAILCALL nodes in the > other functions - LowerCCCall and the like) > > Another TODO: look a floating point code to see whether there are any > kinks. Look at other calling conventions to see what is possible > there. (see Anton's remarks) > > I am away now for 2 weeks (isle of cres - camping). but after that i > will be looking into further improving the code (and also look at the > other architectures). > Or maybe i have answers telling me that i am going into a totally > wrong direction , and i should let the experts do this stuff and stop > nagging people :). > > regards arnold > > >> need to formulate conditions of the call to be "suitable for tail >> call". On x86 it's the following: >> 1. The calling conventions of b and c are "compatible". There are two >> scenarios: caller cleans the stack (C calling convention), callee >> cleans >> the stack (stdcall calling convention). The call is not suitable for >> tail call, if b and c cleans the stack differently. For example, >> let b >> has C CC and c - stdcall CC. In this case making tail call results to >> double cleaning (on in a and one in c), which is incorrect. >> 2. The same situtation is for struct return functions: the caller >> pushes >> extra pointer to return value, but callee is required to clean this >> pointer out of the stack. So, call b->c is "suitable for tail call" >> iff >> both b and c are struct return. >> 3. Various tricky cases with FP. I don't remeber correctly, but there >> can be some problems with functions returning FP values. >> 4. PIC. %ebx should be live in order to make a call via PLT. >> Unfortunately, %ebx is callee-saved, that's why we can assemble tail >> calls only to functions within the same module (PLT is not required). >> 5. %ecx is livein for regparm(3) functions. >> ... (maybe something I forget) >> >> -- >> WBR, Anton Korobeynikov >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hi, I finally had a little time to talk to Chris about tail call. But what I have is nothing concrete and also very vague. We'd like to see tail call optimization to be similar to the target independent lowering of ISD::CALL nodes. These are auto-generated from ???CallingConv.td files. Some target specific details such as function address register (ECX in your example) should be coded in these files. Ditto for stack offsets, etc. These should be part of the fastcc calling convention. The tail call lowering will use these information and target hooks to perform tail call optimization on fastcc calls that are really tail calls. I think it's reasonable to start by doing a (clean) target specific implementation for X86. Just understand we'd like to move to a target independent mechanism so the earlier implementation may not survive. As for what you described here. I am having a hard time following it. :-) Please send patch. Thanks, Evan On Aug 11, 2007, at 1:25 PM, Arnold Schwaighofer wrote:> Okay so i implemented an(other :) initial version for X86-32 backend, > this time based on TOT: > > It is not very generic at the moment. Only functions with > callingconv::fastcc and the tail call attribute will be optimized. > Maybe the next step should be to integrate the code into the other > calling convention lowering. Here is what i have at the moment: > > If callingconv::fastcc is used the function will be custom lowered. > (LowerX86_32FastCCCallTo and LowerX86_32FastCCArguments, the code is > based on the std calling convention minus ecx as inreg) > The lowering code decides whether the function call really is > eligible for tail call optimization tco (caller-callee agree, tail > call opt is enabled, tail position, TODO: check for PIC code). > If it is okay to do tco it calculates the stack adjustment. > Repositions the return address, > Puts the arguments on the correct position relative to the (virtual) > frame pointer. > Moves the function address to register ECX if the tailcallee is in a > register. (TODO: set ECX as live out?) > TODO: check what to do with pic code > Adds register arguments (EAX, EDX) as live in if they are used for > arguments. > Create a TAILCALL node (a pseudo instruction) holding the Callee > (Register or TargetGlobalAddress/ExternalSymbol), the stack > adjustment as a constant, and the live-in registers. The > TargetLowering class has a IsLastCallTailCall state field which is > set to true such that the LowerRET function nows it has to produce > different code. The TAILCALL node is then returned. > > If it is not okay the code produces a function call like the std > calling convention (callee pops arguments on return). > There is one difference though: since tail call functions might > arbitrarily change stack alignment (since they adjust the stack by > the difference of the argument stack size) the stack adjustment (by > that i mean the number of bytes a function requires as its stack > size) i when i calculate that stack size/ stack adjustment i take > care that the stack sizes are always multiples of the targets stack > alignment. (this code at the moment can be turned off by an option > eg.:if it's known that no dynamically linked functions are called). > I observed the behavior of dynamically linked functions requiring 16 > byte stack alignment on mac os x. This might also render it difficult > to do general tail call optimization (for normal std call, c call > functions it would only bem possible if the stack size difference is > a multiple of the stack alignment). But maybe i am misinterpreting > the situation. (TODO: look into that:) > > The lowering of arguments is pretty similar to std call. > > The LowerRET function checks for the IsLastCallTailCall field and if > its true does the following otherwise it does its usual duty. > It looks for the TAILCALL node and uses the operand holding the > callee and the stack adjustment to build and return a TC_RETURN node > with preceeding operands as its own operands. > > The TAILCALL node is converted to a noop machine instruction. > Ditto the TC_RETURN node. (it declares the iscall=1, isbarrier=1, > isret=1) > > The machine instructions needed to adjust the stack and jump to the > callee is added in the epilogue generating function in > X86RegisterInfo. It uses the two operands of the TC_RETURN node for > the required information. > > > So modifications are: > a LowerX86_32FastCCCallTo function > a LowerX86_32FastCCArguments function > change to LowerRET to generate the TC_RETURN node > some code at the end of X86RegisterInfo to do stack adjustment/ > jumping to tailcallee (like the EH_RETURN code, thanks Anton for the > pointers!) > small modifications to X86InstrInfo.td > (a yes and at the moment i remove generating TAILCALL nodes in the > other functions - LowerCCCall and the like) > > Another TODO: look a floating point code to see whether there are any > kinks. Look at other calling conventions to see what is possible > there. (see Anton's remarks) > > I am away now for 2 weeks (isle of cres - camping). but after that i > will be looking into further improving the code (and also look at the > other architectures). > Or maybe i have answers telling me that i am going into a totally > wrong direction , and i should let the experts do this stuff and stop > nagging people :). > > regards arnold > > >> need to formulate conditions of the call to be "suitable for tail >> call". On x86 it's the following: >> 1. The calling conventions of b and c are "compatible". There are two >> scenarios: caller cleans the stack (C calling convention), callee >> cleans >> the stack (stdcall calling convention). The call is not suitable for >> tail call, if b and c cleans the stack differently. For example, >> let b >> has C CC and c - stdcall CC. In this case making tail call results to >> double cleaning (on in a and one in c), which is incorrect. >> 2. The same situtation is for struct return functions: the caller >> pushes >> extra pointer to return value, but callee is required to clean this >> pointer out of the stack. So, call b->c is "suitable for tail call" >> iff >> both b and c are struct return. >> 3. Various tricky cases with FP. I don't remeber correctly, but there >> can be some problems with functions returning FP values. >> 4. PIC. %ebx should be live in order to make a call via PLT. >> Unfortunately, %ebx is callee-saved, that's why we can assemble tail >> calls only to functions within the same module (PLT is not required). >> 5. %ecx is livein for regparm(3) functions. >> ... (maybe something I forget) >> >> -- >> WBR, Anton Korobeynikov >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev