Hello Mentors, I did analyze assembly files generated for IPRA + PGO. (1) I observed that I did not considered the scope of the optimization so changing callee saved register set for non local function is bad because IPRA can not pass this information to other modules. (2) applying this change to indirect function also has no effect because for such case IPRA is currently not able to propagate regmask. (3) 'main' function is marked as cold function and this optimization will make main's prolog/epilog long thus a cold start is observed (1) and (2) point leads to the constraints which we followed for the optimization 'not saving CSR'. Also with LTO there is no significant improvement ( i.e I tried to address (1) with help of LTO) Conclusion: As of now I am not able to find out a good heuristics so that restricting no of function for this optimization and overall getting performance improvement at acceptable code size change. I am currently pausing on this. Sincerely, Vivek On Sat, Aug 6, 2016 at 1:35 PM, vivek pandya <vivekvpandya at gmail.com> wrote:> > > On Sat, Aug 6, 2016 at 2:00 AM, Matthias Braun <matze at braunis.de> wrote: > >> The code in X86TargetLowering::IsEligibleForTailCallOptimization() has >> this part: >> >> // The callee has to preserve all registers the caller needs to >> preserve. >> const X86RegisterInfo *TRI = Subtarget.getRegisterInfo(); >> const uint32_t *CallerPreserved = TRI->getCallPreservedMask(MF, >> CallerCC); >> if (!CCMatch) { >> const uint32_t *CalleePreserved = TRI->getCallPreservedMask(MF, >> CalleeCC); >> if (!TRI->regmaskSubsetEqual(CallerPreserved, CalleePreserved)) >> return false; >> } >> >> Thanks MatzeB for pointing this out. > >> which usually checks that this is fine. Maybe that code looks at the >> regmask of the calling convention rather than the new regmask computed by >> IPRA? >> > > Yes I was doing the optimization at RegUsageInfoPropagate and and above > code is getting executed at ISel phase that is why effect of new CC is not > visible at ISel phase. > So I moved this optimization at CodeGenPrepare which happens before ISel. > Now I am not getting the above bug. But the result of this optimization is > not good ( I mean no positive change in generated code) also due to the > condition F.doesNotAccessMemory() && !F.hasLocalLinkage() very few of cold > functions are really getting optimized. > > One more thing how to test test-suite with PGO and also use generated > profiles in next run. I mean any easy way? > I just want to be sure if any more kind of case is not covered. > > -Vivek > >> - Matthias >> >> On Aug 4, 2016, at 9:22 PM, vivek pandya via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> >> Hello all, >> >> Adding MatzeB this may be interesting for him. >> >> I have tried out following way to save most of the registers for cold >> functions, in RegisterUsagePropagation.cpp >> >> if (PSI->isColdFunction(F) && F->doesNotAccessMemory() && >> !F->hasLocalLinkage()) { >> dbgs() << "Cold Function : " << F->getName() << "\n"; >> F->setCallingConv(CallingConv::CXX_FAST_TLS); >> } >> >> previously I was using CallingConv::PreserveMost but it also saves RAX >> and that generated bug for functions which returns address to global >> objects or some thing similar. CXX_FAST_TLS is very similar to PreserveMost >> but it excludes RAX and >> RDI. It also excludes XMM*. There is also one more interesting bug ( >> actually it is very similar to what we faced while optimizing function for >> not saving registers) >> >> See in SPASS application there is a very simple function : >> >> BOOL opts_IdIsNull(OPTID Id) >> { >> return opts_IdEqual(opts_IdNull(), Id); >> } >> >> This is cold function for a particular input and that is why it is >> getting selected for the optimization. But with CXX_FAST_TLS it is getting >> generated as follows : >> >> _opts_IdIsNull: ## @opts_IdIsNull >> .cfi_startproc >> ## BB#0: ## %entry >> pushq %rbp >> Ltmp9: >> .cfi_def_cfa_offset 16 >> Ltmp10: >> .cfi_offset %rbp, -16 >> movq %rsp, %rbp >> Ltmp11: >> .cfi_def_cfa_register %rbp >> pushq %rsi >> pushq %rcx >> Ltmp12: >> .cfi_offset %rcx, -32 >> Ltmp13: >> .cfi_offset %rsi, -24 >> movl %edi, %eax >> movl $-1, %edi >> popq %rcx >> popq %rsi >> popq %rbp >> jmp _opts_IdEqual ## TAILCALL >> .cfi_endproc >> >> Here before tailcall due to odl values are returned back result is >> generated wrong. >> I am also confused why rcx and rsi is saved/restored ? Is this normal ? >> Any suggestion to handle such case? >> >> Also if we do not want to use a CC directly then can you help me find >> proper place such that we can iterate over parameter list (also how to >> iterate over only register which are used for params only?) and do not put >> this in regMask and also not have save/restore for them. Some hints will >> speed up my work. >> >> Sincerely, >> Vivek >> >> >> >> On Fri, Jul 29, 2016 at 11:06 PM, Hal Finkel <hfinkel at anl.gov> wrote: >> >>> >>> ------------------------------ >>> >>> *From: *"vivek pandya" <vivekvpandya at gmail.com> >>> *To: *"Hal Finkel" <hfinkel at anl.gov> >>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Quentin Colombet" < >>> qcolombet at apple.com>, "Mehdi Amini" <mehdi.amini at apple.com> >>> *Sent: *Friday, July 29, 2016 5:02:44 AM >>> *Subject: *Re: A thought to improve IPRA >>> >>> >>> >>> On Fri, Jul 29, 2016 at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote: >>> >>>> ------------------------------ >>>> >>>> > From: "vivek pandya" <vivekvpandya at gmail.com> >>>> > To: "Mehdi Amini" <mehdi.amini at apple.com> >>>> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Hal Finkel" < >>>> hfinkel at anl.gov>, "Quentin Colombet" <qcolombet at apple.com> >>>> > Sent: Thursday, July 28, 2016 2:59:02 PM >>>> > Subject: Re: A thought to improve IPRA >>>> > >>>> > >>>> > I have been working on PGO driven IPRA and I want to measure if this >>>> > help to reduce execution time. So as mentioned earlier the idea is >>>> > to make cold function register usage free i.e saving and restoring >>>> > all used register by such cold function so caller of that function >>>> > will have more free registers. So here I am changing standard callee >>>> > saved registers set to a set which will be decided dynamically based >>>> > on the actual register usage. >>>> > >>>> > I am facing few problems to get this working: >>>> > 1 ) While generating CFI for such function it requires to map Dwarf >>>> > register to LLVM register and even if we force LLVM to use Dwarf >>>> > register number for CFI then also it will be wrong for some register >>>> > for which currently we don't have such mapping for example R8D >>>> > register on X86 (when dealing with actual register usage info we may >>>> > have such case where R8D is being used) >>>> > To fix this I tried to filter the functions which will be optimized >>>> > by putting a constraints that it should have attribute NoUnwind but >>>> > that does not help. Is it possible to disable CFI generation? >>>> >>>> Disabling CFI generation does not seem like the right solution. If the >>>> R8D definition, and similar, need DWARF register numbers, then we should >>>> fix that (you can try rearranging things and using DwarfRegAlias, or at >>>> least for testing, add the same DwarfRegNum as for R8). >>>> Adding DwarfRegNum as for R8 does not work because the mapping is >>>> currently generated as a sorted array on the first value of the key >>>> for DwarfLLVMRegPair. So in the build directory in file >>>> X86GenRegisterInfo.inc this will add 2 different entries for mapping LLVM >>>> Reg to Dwarf number i.e R8 -> 8 and R8D -> 8 but in the array of mapping >>>> Dwarf to LLVM Reg there is only entry as it will add 8 only once. >>>> >>> > >>>> > >>>> > 2) R8D is a 48 bit register >>>> >>>> Why do you say that? For one thing, it is in a register class GR32 and >>>> holds only 32-bit values. >>>> >>> Sorry this is my bad R8D is 32 bit value. To get this working changing >>> CC for cold functions to "preserve_all" seems to be easy and safe way. Let >>> me know your thought about this. >>> >>> Sounds like a reasonable thing to try. >>> >>> -Hal >>> >>> >>> -Vivek >>> >>>> >>>> -Hal >>>> >>>> > but pushing and popping such register is >>>> > not allowed and current implementation for CalleeSaved Register also >>>> > uses either 64 bit or 32 bit version of X86 instruction according to >>>> > target. So here I think it may be good to push/pop R8 for R8D (i.e I >>>> > don't want to change current implementation which inserts MI for >>>> > CSR) for that I need to find biggest register for which given >>>> > register is alias like R8 has R8D as alias. How can I find that? >>>> > I tried to use getMatchingSuperReg(unsigned Reg, unsigned SubIdx, >>>> > const TargetRegisterClass *RC) but here I don't know what will be >>>> > SubIdx for given Reg in given RC. >>>> > >>>> > >>>> > So for example if a function which should be optimized for above >>>> > optimization is having following set of clobbered registers: >>>> > R8D,R8, ECX, EAX, RAX, ESI It should push/pop R8, RCX, RAX, RSI. >>>> > >>>> > >>>> > Please help! >>>> > - Vivek >>>> > >>>> > >>>> > >>>> > >>>> > On Sat, Jul 9, 2016 at 12:26 AM, vivek pandya < >>>> > vivekvpandya at gmail.com > wrote: >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > On Sat, Jul 9, 2016 at 12:18 AM, Mehdi Amini < mehdi.amini at apple.com >>>> > > wrote: >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > On Jul 8, 2016, at 11:41 AM, vivek pandya < vivekvpandya at gmail.com > >>>> > wrote: >>>> > >>>> > >>>> > >>>> > >>>> > On Fri, Jul 8, 2016 at 11:46 PM, Mehdi Amini < mehdi.amini at apple.com >>>> > > wrote: >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > On Jul 8, 2016, at 11:12 AM, vivek pandya < vivekvpandya at gmail.com > >>>> > wrote: >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > Hello LLVM Developers, >>>> > >>>> > >>>> > I have a thought to improve IPRA and I would like summaries >>>> > discussion on IRC regarding that so we can develop an idea out of >>>> > that if it really helps. >>>> > >>>> > >>>> > So idea is to have more callee saved registers at infrequently called >>>> > leaf procedures and try provide more registers to procedures which >>>> > are in upper region of the call graph. But as pointed out by Quentin >>>> > this optimization may help in context of "true" IPRA but in our case >>>> > we may not require this. But I think that it can improve performance >>>> > in current IPRA. I explain both arguments ( Quentin's and mine) with >>>> > following example. >>>> > >>>> > >>>> > Consider following call sequence A->B->C , here C is very less time >>>> > called leaf procedure while A is called frequently and B may call C >>>> > based on some condition now while propagating actual register usage >>>> > information from C to A we almost clobbered most of the registers so >>>> > in this case as per Quentin's point we does not hurt the performance >>>> > as we fall back to CC but I think we can improve the performance as >>>> > follows: >>>> > If we mark every register preserved by C (i.e having more spill >>>> > reloads at procedure entry and exit ) and if this can help at A. >>>> > Suppose A requires more number of distinct registers than CC can >>>> > provide and if not provided it will spill variables to memory. Now >>>> > if we can provide more registers at A by having more spills at C >>>> > then we can save spill at A which can be beneficial because A is >>>> > frequently called but C is less frequently called and thus reducing >>>> > total number of spill/restore in program execution. >>>> > >>>> > >>>> > However again effect of this optimization will be limited by the >>>> > scope of current IPRA (i.e one Module only) because we can' really >>>> > propagate the details about more callee saved registers to caller >>>> > which is defined in other module, but still it may helpful. >>>> > >>>> > >>>> > Any thoughts on this ? >>>> > >>>> > >>>> > >>>> > >>>> > I think it is interesting, have you considered: >>>> > >>>> > >>>> > - the code size impact? (C will have a lot of spills) >>>> > Yes, this needs to be address with some heuristics based on call >>>> > frequency to C and no of clobbers it has. Also can we say that a >>>> > function which does not have any kind of call instruction in it's >>>> > body will have less clobbers ? >>>> > >>>> > >>>> > I am not sure what you mean. >>>> > A function which may do lots of computation but does not required to >>>> > call any other function may not have too many simultaneous live >>>> > ranges thus with very few registers it can be compiled. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > - what if C is cold but all (most) of its call sites are located in >>>> > different modules? >>>> > Can we user Uses to get no of call site in current module and based >>>> > on that we decide to optimize? Again some heuristics . >>>> > >>>> > >>>> > Of course, but what I’m mentioning is exactly what does not work with >>>> > that. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > - an alternative approach where we would break the CGSCC ordering to >>>> > codegen B and A before C, so we would be able to spill minimally >>>> > when performing the codegen for C? >>>> > Do you here mean marking all preserve for C while code gen for B and >>>> > then when we come to C (top-down) we may avoid some spills if C can >>>> > use regs which are not really used by B? >>>> > >>>> > >>>> > Yes, but it may be harder to implement for not much gain after all. >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > Also this can be applied to a function which is less frequently >>>> > called and which may not be a leaf function. It may help. >>>> > >>>> > >>>> > Sure, you can just refer to this as “PGO driven IPRA”. >>>> > Ok I will look into this. >>>> > >>>> > >>>> > Vivek >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > — >>>> > Mehdi >>>> > >>>> > >>>> > >>>> > >>>> >>>> -- >>>> Hal Finkel >>>> Assistant Computational Scientist >>>> Leadership Computing Facility >>>> Argonne National Laboratory >>>> >>> >>> >>> >>> >>> -- >>> Hal Finkel >>> Assistant Computational Scientist >>> Leadership Computing Facility >>> Argonne National Laboratory >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160816/b8c8ae67/attachment.html>
> On Aug 15, 2016, at 9:01 PM, vivek pandya <vivekvpandya at gmail.com> wrote: > > Hello Mentors, > > I did analyze assembly files generated for IPRA + PGO. (1) I observed that I did not considered the scope of the optimization so changing callee saved register set for non local function is bad because IPRA can not pass this information to other modules.With LTO there shouldn’t be a significant uses from “other modules”. Also applying it to local function should not suffer from this. A simple heuristic is to look at the callees, and make sure the direct callees that are visible are accounting for a large part of the number of actually calls to this function.> (2) applying this change to indirect function also has no effect because for such case IPRA is currently not able to propagate regmask.There is no such thing as “indirect function”. There are only “indirect calls”.> (3) 'main' function is marked as cold function and this optimization will make main's prolog/epilog long thus a cold start is observedThe above heuristic addresses this.> (1) and (2) point leads to the constraints which we followed for the optimization 'not saving CSR’.I don’t understand this.> Also with LTO there is no significant improvement ( i.e I tried to address (1) with help of LTO)How is LTO not addressing 1? — Mehdi> > Conclusion: > As of now I am not able to find out a good heuristics so that restricting no of function for this optimization and overall getting performance improvement at acceptable code size change. > I am currently pausing on this. > > Sincerely, > Vivek > > On Sat, Aug 6, 2016 at 1:35 PM, vivek pandya <vivekvpandya at gmail.com <mailto:vivekvpandya at gmail.com>> wrote: > > > On Sat, Aug 6, 2016 at 2:00 AM, Matthias Braun <matze at braunis.de <mailto:matze at braunis.de>> wrote: > The code in X86TargetLowering::IsEligibleForTailCallOptimization() has this part: > > // The callee has to preserve all registers the caller needs to preserve. > const X86RegisterInfo *TRI = Subtarget.getRegisterInfo(); > const uint32_t *CallerPreserved = TRI->getCallPreservedMask(MF, CallerCC); > if (!CCMatch) { > const uint32_t *CalleePreserved = TRI->getCallPreservedMask(MF, CalleeCC); > if (!TRI->regmaskSubsetEqual(CallerPreserved, CalleePreserved)) > return false; > } > > Thanks MatzeB for pointing this out. > which usually checks that this is fine. Maybe that code looks at the regmask of the calling convention rather than the new regmask computed by IPRA? > > Yes I was doing the optimization at RegUsageInfoPropagate and and above code is getting executed at ISel phase that is why effect of new CC is not visible at ISel phase. > So I moved this optimization at CodeGenPrepare which happens before ISel. Now I am not getting the above bug. But the result of this optimization is not good ( I mean no positive change in generated code) also due to the condition F.doesNotAccessMemory() && !F.hasLocalLinkage() very few of cold functions are really getting optimized. > > One more thing how to test test-suite with PGO and also use generated profiles in next run. I mean any easy way? > I just want to be sure if any more kind of case is not covered. > > -Vivek > - Matthias > > On Aug 4, 2016, at 9:22 PM, vivek pandya via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> Hello all, >> >> Adding MatzeB this may be interesting for him. >> >> I have tried out following way to save most of the registers for cold functions, in RegisterUsagePropagation.cpp >> >> if (PSI->isColdFunction(F) && F->doesNotAccessMemory() && !F->hasLocalLinkage()) { >> dbgs() << "Cold Function : " << F->getName() << "\n"; >> F->setCallingConv(CallingConv::CXX_FAST_TLS); >> } >> >> previously I was using CallingConv::PreserveMost but it also saves RAX and that generated bug for functions which returns address to global objects or some thing similar. CXX_FAST_TLS is very similar to PreserveMost but it excludes RAX and >> RDI. It also excludes XMM*. There is also one more interesting bug ( actually it is very similar to what we faced while optimizing function for not saving registers) >> >> See in SPASS application there is a very simple function : >> >> BOOL opts_IdIsNull(OPTID Id) >> { >> return opts_IdEqual(opts_IdNull(), Id); >> } >> >> This is cold function for a particular input and that is why it is getting selected for the optimization. But with CXX_FAST_TLS it is getting generated as follows : >> >> _opts_IdIsNull: ## @opts_IdIsNull >> .cfi_startproc >> ## BB#0: ## %entry >> pushq %rbp >> Ltmp9: >> .cfi_def_cfa_offset 16 >> Ltmp10: >> .cfi_offset %rbp, -16 >> movq %rsp, %rbp >> Ltmp11: >> .cfi_def_cfa_register %rbp >> pushq %rsi >> pushq %rcx >> Ltmp12: >> .cfi_offset %rcx, -32 >> Ltmp13: >> .cfi_offset %rsi, -24 >> movl %edi, %eax >> movl $-1, %edi >> popq %rcx >> popq %rsi >> popq %rbp >> jmp _opts_IdEqual ## TAILCALL >> .cfi_endproc >> >> Here before tailcall due to odl values are returned back result is generated wrong. >> I am also confused why rcx and rsi is saved/restored ? Is this normal ? >> Any suggestion to handle such case? >> >> Also if we do not want to use a CC directly then can you help me find proper place such that we can iterate over parameter list (also how to iterate over only register which are used for params only?) and do not put this in regMask and also not have save/restore for them. Some hints will speed up my work. >> >> Sincerely, >> Vivek >> >> >> >> On Fri, Jul 29, 2016 at 11:06 PM, Hal Finkel <hfinkel at anl.gov <mailto:hfinkel at anl.gov>> wrote: >> >> From: "vivek pandya" <vivekvpandya at gmail.com <mailto:vivekvpandya at gmail.com>> >> To: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at anl.gov>> >> Cc: "llvm-dev" <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>, "Quentin Colombet" <qcolombet at apple.com <mailto:qcolombet at apple.com>>, "Mehdi Amini" <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> >> Sent: Friday, July 29, 2016 5:02:44 AM >> Subject: Re: A thought to improve IPRA >> >> >> >> On Fri, Jul 29, 2016 at 9:01 AM, Hal Finkel <hfinkel at anl.gov <mailto:hfinkel at anl.gov>> wrote: >> >> > From: "vivek pandya" <vivekvpandya at gmail.com <mailto:vivekvpandya at gmail.com>> >> > To: "Mehdi Amini" <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> >> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>, "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>, "Quentin Colombet" <qcolombet at apple.com <mailto:qcolombet at apple.com>> >> > Sent: Thursday, July 28, 2016 2:59:02 PM >> > Subject: Re: A thought to improve IPRA >> > >> > >> > I have been working on PGO driven IPRA and I want to measure if this >> > help to reduce execution time. So as mentioned earlier the idea is >> > to make cold function register usage free i.e saving and restoring >> > all used register by such cold function so caller of that function >> > will have more free registers. So here I am changing standard callee >> > saved registers set to a set which will be decided dynamically based >> > on the actual register usage. >> > >> > I am facing few problems to get this working: >> > 1 ) While generating CFI for such function it requires to map Dwarf >> > register to LLVM register and even if we force LLVM to use Dwarf >> > register number for CFI then also it will be wrong for some register >> > for which currently we don't have such mapping for example R8D >> > register on X86 (when dealing with actual register usage info we may >> > have such case where R8D is being used) >> > To fix this I tried to filter the functions which will be optimized >> > by putting a constraints that it should have attribute NoUnwind but >> > that does not help. Is it possible to disable CFI generation? >> >> Disabling CFI generation does not seem like the right solution. If the R8D definition, and similar, need DWARF register numbers, then we should fix that (you can try rearranging things and using DwarfRegAlias, or at least for testing, add the same DwarfRegNum as for R8). >> Adding DwarfRegNum as for R8 does not work because the mapping is currently generated as a sorted array on the first value of the key for DwarfLLVMRegPair. So in the build directory in file X86GenRegisterInfo.inc this will add 2 different entries for mapping LLVM Reg to Dwarf number i.e R8 -> 8 and R8D -> 8 but in the array of mapping Dwarf to LLVM Reg there is only entry as it will add 8 only once. >> > >> > >> > 2) R8D is a 48 bit register >> >> Why do you say that? For one thing, it is in a register class GR32 and holds only 32-bit values. >> Sorry this is my bad R8D is 32 bit value. To get this working changing CC for cold functions to "preserve_all" seems to be easy and safe way. Let me know your thought about this. >> Sounds like a reasonable thing to try. >> >> -Hal >> >> >> -Vivek >> >> -Hal >> >> > but pushing and popping such register is >> > not allowed and current implementation for CalleeSaved Register also >> > uses either 64 bit or 32 bit version of X86 instruction according to >> > target. So here I think it may be good to push/pop R8 for R8D (i.e I >> > don't want to change current implementation which inserts MI for >> > CSR) for that I need to find biggest register for which given >> > register is alias like R8 has R8D as alias. How can I find that? >> > I tried to use getMatchingSuperReg(unsigned Reg, unsigned SubIdx, >> > const TargetRegisterClass *RC) but here I don't know what will be >> > SubIdx for given Reg in given RC. >> > >> > >> > So for example if a function which should be optimized for above >> > optimization is having following set of clobbered registers: >> > R8D,R8, ECX, EAX, RAX, ESI It should push/pop R8, RCX, RAX, RSI. >> > >> > >> > Please help! >> > - Vivek >> > >> > >> > >> > >> > On Sat, Jul 9, 2016 at 12:26 AM, vivek pandya < >> > vivekvpandya at gmail.com <mailto:vivekvpandya at gmail.com> > wrote: >> > >> > >> > >> > >> > >> > >> > >> > On Sat, Jul 9, 2016 at 12:18 AM, Mehdi Amini < mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> >> > > wrote: >> > >> > >> > >> > >> > >> > >> > >> > >> > On Jul 8, 2016, at 11:41 AM, vivek pandya < vivekvpandya at gmail.com <mailto:vivekvpandya at gmail.com> > >> > wrote: >> > >> > >> > >> > >> > On Fri, Jul 8, 2016 at 11:46 PM, Mehdi Amini < mehdi.amini at apple.com <mailto:mehdi.amini at apple.com> >> > > wrote: >> > >> > >> > >> > >> > >> > >> > >> > >> > On Jul 8, 2016, at 11:12 AM, vivek pandya < vivekvpandya at gmail.com <mailto:vivekvpandya at gmail.com> > >> > wrote: >> > >> > >> > >> > >> > >> > Hello LLVM Developers, >> > >> > >> > I have a thought to improve IPRA and I would like summaries >> > discussion on IRC regarding that so we can develop an idea out of >> > that if it really helps. >> > >> > >> > So idea is to have more callee saved registers at infrequently called >> > leaf procedures and try provide more registers to procedures which >> > are in upper region of the call graph. But as pointed out by Quentin >> > this optimization may help in context of "true" IPRA but in our case >> > we may not require this. But I think that it can improve performance >> > in current IPRA. I explain both arguments ( Quentin's and mine) with >> > following example. >> > >> > >> > Consider following call sequence A->B->C , here C is very less time >> > called leaf procedure while A is called frequently and B may call C >> > based on some condition now while propagating actual register usage >> > information from C to A we almost clobbered most of the registers so >> > in this case as per Quentin's point we does not hurt the performance >> > as we fall back to CC but I think we can improve the performance as >> > follows: >> > If we mark every register preserved by C (i.e having more spill >> > reloads at procedure entry and exit ) and if this can help at A. >> > Suppose A requires more number of distinct registers than CC can >> > provide and if not provided it will spill variables to memory. Now >> > if we can provide more registers at A by having more spills at C >> > then we can save spill at A which can be beneficial because A is >> > frequently called but C is less frequently called and thus reducing >> > total number of spill/restore in program execution. >> > >> > >> > However again effect of this optimization will be limited by the >> > scope of current IPRA (i.e one Module only) because we can' really >> > propagate the details about more callee saved registers to caller >> > which is defined in other module, but still it may helpful. >> > >> > >> > Any thoughts on this ? >> > >> > >> > >> > >> > I think it is interesting, have you considered: >> > >> > >> > - the code size impact? (C will have a lot of spills) >> > Yes, this needs to be address with some heuristics based on call >> > frequency to C and no of clobbers it has. Also can we say that a >> > function which does not have any kind of call instruction in it's >> > body will have less clobbers ? >> > >> > >> > I am not sure what you mean. >> > A function which may do lots of computation but does not required to >> > call any other function may not have too many simultaneous live >> > ranges thus with very few registers it can be compiled. >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > - what if C is cold but all (most) of its call sites are located in >> > different modules? >> > Can we user Uses to get no of call site in current module and based >> > on that we decide to optimize? Again some heuristics . >> > >> > >> > Of course, but what I’m mentioning is exactly what does not work with >> > that. >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > - an alternative approach where we would break the CGSCC ordering to >> > codegen B and A before C, so we would be able to spill minimally >> > when performing the codegen for C? >> > Do you here mean marking all preserve for C while code gen for B and >> > then when we come to C (top-down) we may avoid some spills if C can >> > use regs which are not really used by B? >> > >> > >> > Yes, but it may be harder to implement for not much gain after all. >> > >> > >> > >> > >> > >> > Also this can be applied to a function which is less frequently >> > called and which may not be a leaf function. It may help. >> > >> > >> > Sure, you can just refer to this as “PGO driven IPRA”. >> > Ok I will look into this. >> > >> > >> > Vivek >> > >> > >> > >> > >> > >> > — >> > Mehdi >> > >> > >> > >> > >> >> -- >> Hal Finkel >> Assistant Computational Scientist >> Leadership Computing Facility >> Argonne National Laboratory >> >> >> >> >> -- >> Hal Finkel >> Assistant Computational Scientist >> Leadership Computing Facility >> Argonne National Laboratory >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160815/36b16c8e/attachment-0001.html>
On Tue, Aug 16, 2016 at 9:39 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:> > On Aug 15, 2016, at 9:01 PM, vivek pandya <vivekvpandya at gmail.com> wrote: > > Hello Mentors, > > I did analyze assembly files generated for IPRA + PGO. (1) I observed > that I did not considered the scope of the optimization so changing callee > saved register set for non local function is bad because IPRA can not pass > this information to other modules. > > > With LTO there shouldn’t be a significant uses from “other modules”. > Also applying it to local function should not suffer from this. > A simple heuristic is to look at the callees, and make sure the direct > callees that are visible are accounting for a large part of the number of > actually calls to this function. >Do you mean to use a simple analyses instead of ProfileSummaryInfo?> > (2) applying this change to indirect function also has no effect because > for such case IPRA is currently not able to propagate regmask. > > > There is no such thing as “indirect function”. There are only “indirect > calls”. >Yes, my mistake.> > (3) 'main' function is marked as cold function and this optimization will > make main's prolog/epilog long thus a cold start is observed > > > The above heuristic addresses this. > > (1) and (2) point leads to the constraints which we followed for the > optimization 'not saving CSR’. > > > I don’t understand this. >This is because we can not apply this opt to functions with indirect calls, also without LTO there is not benefit by applying this to non local function.> > Also with LTO there is no significant improvement ( i.e I tried to address > (1) with help of LTO) > > > How is LTO not addressing 1? >This is just based on some experiments but yes more careful observation is required. This also leads to one more question: for the optimization 'not having CSR' we are not accounting if the caller of such function is hot or not. Because if due to that optimization we are forcing a hot function to have caller saved register than it may lead to runtime regression. -Vivek> >—> Mehdi > > > > Conclusion: > As of now I am not able to find out a good heuristics so that restricting > no of function for this optimization and overall getting performance > improvement at acceptable code size change. > I am currently pausing on this. > > Sincerely, > Vivek > > On Sat, Aug 6, 2016 at 1:35 PM, vivek pandya <vivekvpandya at gmail.com> > wrote: > >> >> >> On Sat, Aug 6, 2016 at 2:00 AM, Matthias Braun <matze at braunis.de> wrote: >> >>> The code in X86TargetLowering::IsEligibleForTailCallOptimization() has >>> this part: >>> >>> // The callee has to preserve all registers the caller needs to >>> preserve. >>> const X86RegisterInfo *TRI = Subtarget.getRegisterInfo(); >>> const uint32_t *CallerPreserved = TRI->getCallPreservedMask(MF, >>> CallerCC); >>> if (!CCMatch) { >>> const uint32_t *CalleePreserved = TRI->getCallPreservedMask(MF, >>> CalleeCC); >>> if (!TRI->regmaskSubsetEqual(CallerPreserved, CalleePreserved)) >>> return false; >>> } >>> >>> Thanks MatzeB for pointing this out. >> >>> which usually checks that this is fine. Maybe that code looks at the >>> regmask of the calling convention rather than the new regmask computed by >>> IPRA? >>> >> >> Yes I was doing the optimization at RegUsageInfoPropagate and and above >> code is getting executed at ISel phase that is why effect of new CC is not >> visible at ISel phase. >> So I moved this optimization at CodeGenPrepare which happens before ISel. >> Now I am not getting the above bug. But the result of this optimization is >> not good ( I mean no positive change in generated code) also due to the >> condition F.doesNotAccessMemory() && !F.hasLocalLinkage() very few of cold >> functions are really getting optimized. >> >> One more thing how to test test-suite with PGO and also use generated >> profiles in next run. I mean any easy way? >> I just want to be sure if any more kind of case is not covered. >> >> -Vivek >> >>> - Matthias >>> >>> On Aug 4, 2016, at 9:22 PM, vivek pandya via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>> >>> Hello all, >>> >>> Adding MatzeB this may be interesting for him. >>> >>> I have tried out following way to save most of the registers for cold >>> functions, in RegisterUsagePropagation.cpp >>> >>> if (PSI->isColdFunction(F) && F->doesNotAccessMemory() && >>> !F->hasLocalLinkage()) { >>> dbgs() << "Cold Function : " << F->getName() << "\n"; >>> F->setCallingConv(CallingConv::CXX_FAST_TLS); >>> } >>> >>> previously I was using CallingConv::PreserveMost but it also saves RAX >>> and that generated bug for functions which returns address to global >>> objects or some thing similar. CXX_FAST_TLS is very similar to PreserveMost >>> but it excludes RAX and >>> RDI. It also excludes XMM*. There is also one more interesting bug ( >>> actually it is very similar to what we faced while optimizing function for >>> not saving registers) >>> >>> See in SPASS application there is a very simple function : >>> >>> BOOL opts_IdIsNull(OPTID Id) >>> { >>> return opts_IdEqual(opts_IdNull(), Id); >>> } >>> >>> This is cold function for a particular input and that is why it is >>> getting selected for the optimization. But with CXX_FAST_TLS it is getting >>> generated as follows : >>> >>> _opts_IdIsNull: ## @opts_IdIsNull >>> .cfi_startproc >>> ## BB#0: ## %entry >>> pushq %rbp >>> Ltmp9: >>> .cfi_def_cfa_offset 16 >>> Ltmp10: >>> .cfi_offset %rbp, -16 >>> movq %rsp, %rbp >>> Ltmp11: >>> .cfi_def_cfa_register %rbp >>> pushq %rsi >>> pushq %rcx >>> Ltmp12: >>> .cfi_offset %rcx, -32 >>> Ltmp13: >>> .cfi_offset %rsi, -24 >>> movl %edi, %eax >>> movl $-1, %edi >>> popq %rcx >>> popq %rsi >>> popq %rbp >>> jmp _opts_IdEqual ## TAILCALL >>> .cfi_endproc >>> >>> Here before tailcall due to odl values are returned back result is >>> generated wrong. >>> I am also confused why rcx and rsi is saved/restored ? Is this normal ? >>> Any suggestion to handle such case? >>> >>> Also if we do not want to use a CC directly then can you help me find >>> proper place such that we can iterate over parameter list (also how to >>> iterate over only register which are used for params only?) and do not put >>> this in regMask and also not have save/restore for them. Some hints will >>> speed up my work. >>> >>> Sincerely, >>> Vivek >>> >>> >>> >>> On Fri, Jul 29, 2016 at 11:06 PM, Hal Finkel <hfinkel at anl.gov> wrote: >>> >>>> >>>> ------------------------------ >>>> >>>> *From: *"vivek pandya" <vivekvpandya at gmail.com> >>>> *To: *"Hal Finkel" <hfinkel at anl.gov> >>>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Quentin Colombet" < >>>> qcolombet at apple.com>, "Mehdi Amini" <mehdi.amini at apple.com> >>>> *Sent: *Friday, July 29, 2016 5:02:44 AM >>>> *Subject: *Re: A thought to improve IPRA >>>> >>>> >>>> >>>> On Fri, Jul 29, 2016 at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote: >>>> >>>>> ------------------------------ >>>>> >>>>> > From: "vivek pandya" <vivekvpandya at gmail.com> >>>>> > To: "Mehdi Amini" <mehdi.amini at apple.com> >>>>> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Hal Finkel" < >>>>> hfinkel at anl.gov>, "Quentin Colombet" <qcolombet at apple.com> >>>>> > Sent: Thursday, July 28, 2016 2:59:02 PM >>>>> > Subject: Re: A thought to improve IPRA >>>>> > >>>>> > >>>>> > I have been working on PGO driven IPRA and I want to measure if this >>>>> > help to reduce execution time. So as mentioned earlier the idea is >>>>> > to make cold function register usage free i.e saving and restoring >>>>> > all used register by such cold function so caller of that function >>>>> > will have more free registers. So here I am changing standard callee >>>>> > saved registers set to a set which will be decided dynamically based >>>>> > on the actual register usage. >>>>> > >>>>> > I am facing few problems to get this working: >>>>> > 1 ) While generating CFI for such function it requires to map Dwarf >>>>> > register to LLVM register and even if we force LLVM to use Dwarf >>>>> > register number for CFI then also it will be wrong for some register >>>>> > for which currently we don't have such mapping for example R8D >>>>> > register on X86 (when dealing with actual register usage info we may >>>>> > have such case where R8D is being used) >>>>> > To fix this I tried to filter the functions which will be optimized >>>>> > by putting a constraints that it should have attribute NoUnwind but >>>>> > that does not help. Is it possible to disable CFI generation? >>>>> >>>>> Disabling CFI generation does not seem like the right solution. If the >>>>> R8D definition, and similar, need DWARF register numbers, then we should >>>>> fix that (you can try rearranging things and using DwarfRegAlias, or at >>>>> least for testing, add the same DwarfRegNum as for R8). >>>>> Adding DwarfRegNum as for R8 does not work because the mapping is >>>>> currently generated as a sorted array on the first value of the key >>>>> for DwarfLLVMRegPair. So in the build directory in file >>>>> X86GenRegisterInfo.inc this will add 2 different entries for mapping LLVM >>>>> Reg to Dwarf number i.e R8 -> 8 and R8D -> 8 but in the array of mapping >>>>> Dwarf to LLVM Reg there is only entry as it will add 8 only once. >>>>> >>>> > >>>>> > >>>>> > 2) R8D is a 48 bit register >>>>> >>>>> Why do you say that? For one thing, it is in a register class GR32 and >>>>> holds only 32-bit values. >>>>> >>>> Sorry this is my bad R8D is 32 bit value. To get this working changing >>>> CC for cold functions to "preserve_all" seems to be easy and safe way. Let >>>> me know your thought about this. >>>> >>>> Sounds like a reasonable thing to try. >>>> >>>> -Hal >>>> >>>> >>>> -Vivek >>>> >>>>> >>>>> -Hal >>>>> >>>>> > but pushing and popping such register is >>>>> > not allowed and current implementation for CalleeSaved Register also >>>>> > uses either 64 bit or 32 bit version of X86 instruction according to >>>>> > target. So here I think it may be good to push/pop R8 for R8D (i.e I >>>>> > don't want to change current implementation which inserts MI for >>>>> > CSR) for that I need to find biggest register for which given >>>>> > register is alias like R8 has R8D as alias. How can I find that? >>>>> > I tried to use getMatchingSuperReg(unsigned Reg, unsigned SubIdx, >>>>> > const TargetRegisterClass *RC) but here I don't know what will be >>>>> > SubIdx for given Reg in given RC. >>>>> > >>>>> > >>>>> > So for example if a function which should be optimized for above >>>>> > optimization is having following set of clobbered registers: >>>>> > R8D,R8, ECX, EAX, RAX, ESI It should push/pop R8, RCX, RAX, RSI. >>>>> > >>>>> > >>>>> > Please help! >>>>> > - Vivek >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > On Sat, Jul 9, 2016 at 12:26 AM, vivek pandya < >>>>> > vivekvpandya at gmail.com > wrote: >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > On Sat, Jul 9, 2016 at 12:18 AM, Mehdi Amini < mehdi.amini at apple.com >>>>> > > wrote: >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > On Jul 8, 2016, at 11:41 AM, vivek pandya < vivekvpandya at gmail.com > >>>>> > wrote: >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > On Fri, Jul 8, 2016 at 11:46 PM, Mehdi Amini < mehdi.amini at apple.com >>>>> > > wrote: >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > On Jul 8, 2016, at 11:12 AM, vivek pandya < vivekvpandya at gmail.com > >>>>> > wrote: >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > Hello LLVM Developers, >>>>> > >>>>> > >>>>> > I have a thought to improve IPRA and I would like summaries >>>>> > discussion on IRC regarding that so we can develop an idea out of >>>>> > that if it really helps. >>>>> > >>>>> > >>>>> > So idea is to have more callee saved registers at infrequently called >>>>> > leaf procedures and try provide more registers to procedures which >>>>> > are in upper region of the call graph. But as pointed out by Quentin >>>>> > this optimization may help in context of "true" IPRA but in our case >>>>> > we may not require this. But I think that it can improve performance >>>>> > in current IPRA. I explain both arguments ( Quentin's and mine) with >>>>> > following example. >>>>> > >>>>> > >>>>> > Consider following call sequence A->B->C , here C is very less time >>>>> > called leaf procedure while A is called frequently and B may call C >>>>> > based on some condition now while propagating actual register usage >>>>> > information from C to A we almost clobbered most of the registers so >>>>> > in this case as per Quentin's point we does not hurt the performance >>>>> > as we fall back to CC but I think we can improve the performance as >>>>> > follows: >>>>> > If we mark every register preserved by C (i.e having more spill >>>>> > reloads at procedure entry and exit ) and if this can help at A. >>>>> > Suppose A requires more number of distinct registers than CC can >>>>> > provide and if not provided it will spill variables to memory. Now >>>>> > if we can provide more registers at A by having more spills at C >>>>> > then we can save spill at A which can be beneficial because A is >>>>> > frequently called but C is less frequently called and thus reducing >>>>> > total number of spill/restore in program execution. >>>>> > >>>>> > >>>>> > However again effect of this optimization will be limited by the >>>>> > scope of current IPRA (i.e one Module only) because we can' really >>>>> > propagate the details about more callee saved registers to caller >>>>> > which is defined in other module, but still it may helpful. >>>>> > >>>>> > >>>>> > Any thoughts on this ? >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > I think it is interesting, have you considered: >>>>> > >>>>> > >>>>> > - the code size impact? (C will have a lot of spills) >>>>> > Yes, this needs to be address with some heuristics based on call >>>>> > frequency to C and no of clobbers it has. Also can we say that a >>>>> > function which does not have any kind of call instruction in it's >>>>> > body will have less clobbers ? >>>>> > >>>>> > >>>>> > I am not sure what you mean. >>>>> > A function which may do lots of computation but does not required to >>>>> > call any other function may not have too many simultaneous live >>>>> > ranges thus with very few registers it can be compiled. >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > - what if C is cold but all (most) of its call sites are located in >>>>> > different modules? >>>>> > Can we user Uses to get no of call site in current module and based >>>>> > on that we decide to optimize? Again some heuristics . >>>>> > >>>>> > >>>>> > Of course, but what I’m mentioning is exactly what does not work with >>>>> > that. >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > - an alternative approach where we would break the CGSCC ordering to >>>>> > codegen B and A before C, so we would be able to spill minimally >>>>> > when performing the codegen for C? >>>>> > Do you here mean marking all preserve for C while code gen for B and >>>>> > then when we come to C (top-down) we may avoid some spills if C can >>>>> > use regs which are not really used by B? >>>>> > >>>>> > >>>>> > Yes, but it may be harder to implement for not much gain after all. >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > Also this can be applied to a function which is less frequently >>>>> > called and which may not be a leaf function. It may help. >>>>> > >>>>> > >>>>> > Sure, you can just refer to this as “PGO driven IPRA”. >>>>> > Ok I will look into this. >>>>> > >>>>> > >>>>> > Vivek >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > — >>>>> > Mehdi >>>>> > >>>>> > >>>>> > >>>>> > >>>>> >>>>> -- >>>>> Hal Finkel >>>>> Assistant Computational Scientist >>>>> Leadership Computing Facility >>>>> Argonne National Laboratory >>>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Hal Finkel >>>> Assistant Computational Scientist >>>> Leadership Computing Facility >>>> Argonne National Laboratory >>>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160816/26eb8519/attachment.html>