Hal Finkel via llvm-dev
2016-May-11 18:16 UTC
[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback
----- Original Message -----> From: "vivek pandya" <vivekvpandya at gmail.com> > To: "Matthias Braun" <matze at braunis.de> > Cc: "Hal Finkel" <hfinkel at anl.gov>, "llvm-dev" > <llvm-dev at lists.llvm.org> > Sent: Wednesday, May 11, 2016 1:14:07 PM > Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register > Allocation - Introduction and Feedback> Vivek Pandya> On Wed, May 11, 2016 at 11:19 PM, Matthias Braun < matze at braunis.de > > wrote:> > Yes there is also MachineRegisterInfo::UsedPhysRegMask which should > > be the union of all regmasks in the function. >> Does this means there is no requirement of a separate pass for > propagating register usage information ? And also only codegen order > needs to be changed to DFS on call graph. And currently no > intra-procedural register is using UsedPhysRegMask to avoid > load/store. Does it mean we need to change that ? Or I am getting it > wrong.No, it just means that the "scanning" procedure is simple, it does not actually need to visit each instruction. -Hal> - Vivek > > > On May 11, 2016, at 10:47 AM, Hal Finkel via llvm-dev < > > > llvm-dev at lists.llvm.org > wrote: > > >> > > > From: "Matthias Braun" < matze at braunis.de > > > > > > >> > > > To: "Hal Finkel" < hfinkel at anl.gov > > > > > > > > > > > Cc: "vivek pandya" < vivekvpandya at gmail.com >, "llvm-dev" < > > > > llvm-dev at lists.llvm.org > > > > > > > > > > > Sent: Wednesday, May 11, 2016 12:46:25 PM > > > > > > > > > > Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register > > > > Allocation - Introduction and Feedback > > > > > >> > > > > On May 11, 2016, at 3:31 AM, Hal Finkel via llvm-dev < > > > > > llvm-dev at lists.llvm.org > wrote: > > > > > > > > > >> > > > > > From: "vivek pandya" < vivekvpandya at gmail.com > > > > > > > > > > > > > > > > > > > > > > To: "Mehdi Amini" < mehdi.amini at apple.com > > > > > > > > > > > > > > > > > > > > > > Cc: "Hal Finkel" < hfinkel at anl.gov >, "Quentin Colombet" < > > > > > > qcolombet at apple.com >, "llvm-dev" < llvm-dev at lists.llvm.org > > > > > > >, > > > > > > "Matthias Braun" < matze at braunis.de > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, May 11, 2016 3:15:03 AM > > > > > > > > > > > > > > > > > > > > > Subject: Re: [GSoC 2016] Interprocedural Register > > > > > > Allocation > > > > > > - > > > > > > Introduction and Feedback > > > > > > > > > > > > > > >> > > > > > Vivek Pandya > > > > > > > > > > > > > > >> > > > > > On Wed, May 11, 2016 at 10:02 AM, vivek pandya < > > > > > > vivekvpandya at gmail.com > wrote: > > > > > > > > > > > > > > >> > > > > > > Vivek Pandya > > > > > > > > > > > > > > > > > > > > >> > > > > > > On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini < > > > > > > > mehdi.amini at apple.com > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > On May 10, 2016, at 6:06 PM, Hal Finkel < > > > > > > > > > hfinkel at anl.gov > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > From: "vivek pandya" < vivekvpandya at gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "llvm-dev" < llvm-dev at lists.llvm.org >, "Tim > > > > > > > > > > Amini > > > > > > > > > > Golling" > > > > > > > > > > < > > > > > > > > > > mehdi.amini at apple.com >, "Hal Finkel" < > > > > > > > > > > hfinkel at anl.gov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "Quentin Colombet" < qcolombet at apple.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Tuesday, May 10, 2016 2:59:16 PM > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Subject: [GSoC 2016] Interprocedural Register > > > > > > > > > > Allocation > > > > > > > > > > - > > > > > > > > > > Introduction and Feedback > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > Hello LLVM Community, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > Sorry for delay as I was busy in final exams. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > I am Vivek from India. Thanks for choosing my > > > > > > > > > > proposal > > > > > > > > > > for > > > > > > > > > > Interprocedural Register Allocation (IPRA) in LLVM. > > > > > > > > > > Mehdi > > > > > > > > > > Amini > > > > > > > > > > and > > > > > > > > > > Hal Finkel will be mentoring me for this project. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > IPRA can reduce code size and runtime of programs > > > > > > > > > > by > > > > > > > > > > allocating > > > > > > > > > > register across the module and procedure > > > > > > > > > > boundaries. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > I have identified some old but effective research > > > > > > > > > > work > > > > > > > > > > on > > > > > > > > > > this > > > > > > > > > > area. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I want community's feedback for feasibility of > > > > > > > > > > these > > > > > > > > > > approach > > > > > > > > > > and > > > > > > > > > > I > > > > > > > > > > am targeting to implement two of them during this > > > > > > > > > > project. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > Here is list of the papers, I have read first two > > > > > > > > > > papers > > > > > > > > > > and > > > > > > > > > > I > > > > > > > > > > would > > > > > > > > > > like to discuss those approach first, I will read > > > > > > > > > > other > > > > > > > > > > two > > > > > > > > > > paper > > > > > > > > > > then initiate discussion for them as well. All I > > > > > > > > > > want > > > > > > > > > > is > > > > > > > > > > to > > > > > > > > > > find > > > > > > > > > > out > > > > > > > > > > a concrete implementation plan before 23 May, 2016 > > > > > > > > > > and > > > > > > > > > > for > > > > > > > > > > that > > > > > > > > > > I > > > > > > > > > > need community's help. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > 1) Compile time ----- Minimizing register usage > > > > > > > > > > penalty > > > > > > > > > > at > > > > > > > > > > procedure > > > > > > > > > > calls - http://dl.acm.org/citation.cfm?id=53999 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ====================================================================In > > > > > > > > > > this approach intra-procedural register allocation > > > > > > > > > > is > > > > > > > > > > used > > > > > > > > > > as > > > > > > > > > > base > > > > > > > > > > but machine code generation order is bottom up > > > > > > > > > > traversal > > > > > > > > > > of > > > > > > > > > > call > > > > > > > > > > graph and inter-procedural effect is achieved by > > > > > > > > > > propagating > > > > > > > > > > register usage information of callee function to > > > > > > > > > > caller > > > > > > > > > > (i.e > > > > > > > > > > child > > > > > > > > > > to parent in CallGraph) so that caller can use > > > > > > > > > > different > > > > > > > > > > registers > > > > > > > > > > than callee and can save load store cost at > > > > > > > > > > procedure > > > > > > > > > > call, > > > > > > > > > > this > > > > > > > > > > is > > > > > > > > > > not trivial as it seems due to recursive calls, > > > > > > > > > > library > > > > > > > > > > function > > > > > > > > > > usage etc. Also for upper region of the graph in > > > > > > > > > > this > > > > > > > > > > technique > > > > > > > > > > available number of registers might become zero in > > > > > > > > > > that > > > > > > > > > > case > > > > > > > > > > it > > > > > > > > > > should fall back to normal load store at procedure > > > > > > > > > > call. > > > > > > > > > > Apart > > > > > > > > > > from > > > > > > > > > > these difficulties other difficulties have been > > > > > > > > > > identified > > > > > > > > > > please > > > > > > > > > > follow this mail-chain > > > > > > > > > > https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My mentor has already provided me a patch that > > > > > > > > > > alters > > > > > > > > > > code > > > > > > > > > > generation > > > > > > > > > > order as per bottom up call graph traversal, I am > > > > > > > > > > working > > > > > > > > > > from > > > > > > > > > > that > > > > > > > > > > point now. Any other help/suggestion is always > > > > > > > > > > welcomed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > 2) Link time ----- Global register allocation at > > > > > > > > > > link > > > > > > > > > > time > > > > > > > > > > - > > > > > > > > > > http://dl.acm.org/citation.cfm?id=989415 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ====================================================================In > > > > > > > > > > this particular approach (sort of true IPRA) > > > > > > > > > > registers > > > > > > > > > > will > > > > > > > > > > be > > > > > > > > > > reallocated (this optimization will be optional if > > > > > > > > > > turned > > > > > > > > > > off > > > > > > > > > > still > > > > > > > > > > code will be compiled as per intra-procedural > > > > > > > > > > allocation) > > > > > > > > > > at > > > > > > > > > > link > > > > > > > > > > time. Here modules are first complied as per normal > > > > > > > > > > compilation > > > > > > > > > > but > > > > > > > > > > the object code is annotated with details so that > > > > > > > > > > linker > > > > > > > > > > can > > > > > > > > > > build > > > > > > > > > > call graph and also calculate usage information at > > > > > > > > > > link > > > > > > > > > > time. > > > > > > > > > > Compiler also write hints in object code that if > > > > > > > > > > particular > > > > > > > > > > variable > > > > > > > > > > is allocated in some other register ( due to new > > > > > > > > > > allocation) > > > > > > > > > > then > > > > > > > > > > how the code should be changed? Thus linker can use > > > > > > > > > > these > > > > > > > > > > information to decide which variables (global) need > > > > > > > > > > to > > > > > > > > > > be > > > > > > > > > > in > > > > > > > > > > same > > > > > > > > > > register through out the program execution and also > > > > > > > > > > according > > > > > > > > > > to > > > > > > > > > > register usage information in call graph which > > > > > > > > > > procedure > > > > > > > > > > will > > > > > > > > > > not > > > > > > > > > > be > > > > > > > > > > active simultaneously so that locals for that > > > > > > > > > > procedures > > > > > > > > > > can > > > > > > > > > > be > > > > > > > > > > in > > > > > > > > > > same registers with out load store at procedure > > > > > > > > > > calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For these particular method help me to analyze > > > > > > > > > > feasibility: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) Can llvm collects following information at > > > > > > > > > > module > > > > > > > > > > level > > > > > > > > > > in > > > > > > > > > > MachineIR? list of procedures in module, list of > > > > > > > > > > locals > > > > > > > > > > in > > > > > > > > > > procedures, list of procedures that a particular > > > > > > > > > > procedure > > > > > > > > > > can > > > > > > > > > > call, > > > > > > > > > > and a list of the variables this procedure > > > > > > > > > > references. > > > > > > > > > > Each > > > > > > > > > > entry > > > > > > > > > > in > > > > > > > > > > the last two lists includes an estimate of the > > > > > > > > > > number > > > > > > > > > > of > > > > > > > > > > times > > > > > > > > > > the > > > > > > > > > > procedure is called or the variable is referenced > > > > > > > > > > in > > > > > > > > > > each > > > > > > > > > > execution > > > > > > > > > > of this procedure > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2) Can llvm write informative commands to object > > > > > > > > > > files? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3) Can LTO is capable of leveraging those commands? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > In terms of scoping the project for the summer, I > > > > > > > > > definitely > > > > > > > > > recommend that you focus on (1) first. If you finish > > > > > > > > > that, > > > > > > > > > we > > > > > > > > > can > > > > > > > > > certainly move on to other things. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'll add +1 here, but I already wrote the same thing on > > > > > > > > IRC > > > > > > > > when > > > > > > > > discussing with Vivek. True IPRA without a proper > > > > > > > > MachineModule > > > > > > > > infrastructure won't be doable in my opinion (even with > > > > > > > > such > > > > > > > > infrastructure, it may not be trivial in LLVM in > > > > > > > > general). > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > Regarding link time, note that any such a design > > > > > > > > > would > > > > > > > > > likely > > > > > > > > > look > > > > > > > > > much different than in David Wall's paper however, > > > > > > > > > because > > > > > > > > > our > > > > > > > > > LTO > > > > > > > > > re-codegens everything anyway. The paper says, > > > > > > > > > "Finally, > > > > > > > > > it > > > > > > > > > keeps > > > > > > > > > us > > > > > > > > > honest as designers of the system; once we postpone > > > > > > > > > anything > > > > > > > > > until > > > > > > > > > link time, the temptation is great to postpone > > > > > > > > > everything, > > > > > > > > > ..." > > > > > > > > > - > > > > > > > > > Well, we've long-since succumb to that temptation > > > > > > > > > when > > > > > > > > > we > > > > > > > > > LTO. > > > > > > > > > C'est > > > > > > > > > la vie. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > +1 as well, our LTO will benefit naturally from the > > > > > > > > leaf-to-root > > > > > > > > information propagation. ThinLTO will be more > > > > > > > > challenging/interesting though! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For the first part a mechanism similar to > > > > > > > > > > MachineModulePass > > > > > > > > > > would > > > > > > > > > > be > > > > > > > > > > desirable but that may not be possible during this > > > > > > > > > > project, > > > > > > > > > > but > > > > > > > > > > if > > > > > > > > > > we can make some sort of smaller version of that to > > > > > > > > > > suit > > > > > > > > > > our > > > > > > > > > > purpose. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't think we need to make any kind of > > > > > > > > > MachineModulePass > > > > > > > > > to > > > > > > > > > make > > > > > > > > > this work. Once we alter the visitation order based > > > > > > > > > on > > > > > > > > > the > > > > > > > > > CGSCC > > > > > > > > > iteration scheme, we can keep state in-between > > > > > > > > > functions > > > > > > > > > in > > > > > > > > > the > > > > > > > > > pre-existing hacky way (using static members of the > > > > > > > > > relevant > > > > > > > > > function passes). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > Sorry my mistake here by first part I mean 1) requirement > > > > > > > in > > > > > > > the > > > > > > > link > > > > > > > time approach. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also don't see where/why we need a > > > > > > > > MachineModule(Pass) > > > > > > > > for > > > > > > > > the > > > > > > > > CGSCC scheme, that said I'd rather avoid using a > > > > > > > > function > > > > > > > > pass > > > > > > > > with > > > > > > > > static members, if we can have a ModuleAnalysis that is > > > > > > > > bookkeeping > > > > > > > > the results for functions in the module and queries by > > > > > > > > the > > > > > > > > register > > > > > > > > allocator somehow. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Matthias/Quentin may have other inputs on this aspect. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Hal do you mean to add a simple MachineFunction pass that > > > > > > will > > > > > > just > > > > > > operate on register allocated function and prepare a > > > > > > BitVector > > > > > > to > > > > > > indicate which register is being used by MachineFunction, > > > > > > and > > > > > > then > > > > > > use this pass as analysis pass (i.e just simply return > > > > > > static > > > > > > BitVector for clobbered register when register allocation > > > > > > for > > > > > > next > > > > > > function begins. This part is not much clear to me) this > > > > > > thing > > > > > > can > > > > > > be done by scheduling a pass post register allocation in > > > > > > lib/CodeGen/Passes.cpp > > > > > > > > > > > > > > >> > > > > > void TargetPassConfig::addMachinePasses() { > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > // Run pre-ra passes. > > > > > > > > > > > > > > > > > > > > > addPreRegAlloc(); > > > > > > > > > > > > > > >> > > > > > // Run register allocation and passes that are tightly > > > > > > coupled > > > > > > with > > > > > > it, > > > > > > > > > > > > > > > > > > > > > // including phi elimination and scheduling. > > > > > > > > > > > > > > > > > > > > > if (getOptimizeRegAlloc()) > > > > > > > > > > > > > > > > > > > > > addOptimizedRegAlloc(createRegAllocPass(true)); > > > > > > > > > > > > > > > > > > > > > else > > > > > > > > > > > > > > > > > > > > > addFastRegAlloc(createRegAllocPass(false)); > > > > > > > > > > > > > > >> > > > > > // Run post-ra passes. > > > > > > > > > > > > > > > > > > > > > addPostRegAlloc(); > > > > > > > > > > > > > > > > > > > > > // Adding a new pass here which keeps register mask > > > > > > information > > > > > > across function calls. > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > >> > > > > > But this also requires current register allocators to use > > > > > > this > > > > > > information in someway because RegMaskBits in > > > > > > LiveIntervalAnalysis.cpp is not static across calls. I mean > > > > > > I > > > > > > am > > > > > > not > > > > > > clear for how to propagate static info to Intra-procedural > > > > > > Register > > > > > > allocators (if possible without disturbing their code ) > > > > > > > > > > > > > > > > > > > > First, my hope is that we won't need to change the register > > > > > allocators, as such, in order to make use of this > > > > > information. > > > > > Instead, we'll simply be able to alter the register masks > > > > > generated > > > > > for the call instructions. These masks will indicate fewer > > > > > clobbers > > > > > than might otherwise be present based on the ABI because of > > > > > information gathered during the codegen of the callee. These > > > > > masks > > > > > are generally constructed by target based on the calling > > > > > convention. > > > > > The PowerPC backend, for example, looks like this: > > > > > > > > > >> > > > > // Add a register mask operand representing the > > > > > call-preserved > > > > > registers. > > > > > > > > > > > > > > > const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo(); > > > > > > > > > > > > > > > const uint32_t *Mask > > > > > > > > > > > > > > > TRI->getCallPreservedMask(DAG.getMachineFunction(), > > > > > CallConv); > > > > > > > > > > > > > > > assert(Mask && "Missing call preserved mask for calling > > > > > convention"); > > > > > > > > > > > > > > > Ops.push_back(DAG.getRegisterMask(Mask)); > > > > > > > > > >> > > > > but it can be more complicated. If you look for uses of > > > > > 'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll see > > > > > what > > > > > I > > > > > mean. Regardless, the code ends up calling some method is the > > > > > targets TargetRegisterInfo subclass. These methods generally > > > > > look > > > > > something like this: > > > > > > > > > >> > > > > const uint32_t * > > > > > > > > > > > > > > > PPCRegisterInfo::getCallPreservedMask(const MachineFunction > > > > > &MF, > > > > > > > > > > > > > > > CallingConv::ID CC) const { > > > > > > > > > > > > > > > const PPCSubtarget &Subtarget > > > > > MF.getSubtarget<PPCSubtarget>(); > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > return TM.isPPC64() ? (Subtarget.hasAltivec() ? > > > > > CSR_SVR464_Altivec_RegMask > > > > > > > > > > > > > > > : CSR_SVR464_RegMask) > > > > > > > > > > > > > > > : (Subtarget.hasAltivec() ? CSR_SVR432_Altivec_RegMask > > > > > > > > > > > > > > > : CSR_SVR432_RegMask); > > > > > > > > > > > > > > > } > > > > > > > > > >> > > > > In any case, the fundamental idea here is that, when someone > > > > > calls > > > > > getCallPreservedMask in order to set the regmask on a call, > > > > > we > > > > > might > > > > > not have to use the CC at all. Instead, if we've already > > > > > codegened > > > > > the function, we might use a cache of 'exact' register masks > > > > > computed during codegen of the potential callees instead. > > > > > > > > > >> > > > > In order to do this, I think we'll need to provide a function > > > > > callable from the target's getCallPreservedMask > > > > > implementation, > > > > > which can return such an 'exact' regmask when available. I > > > > > think > > > > > we > > > > > need to do it this way for two reasons: > > > > > > > > > >> > > > > 1. Not all of the target code calls getCallPreservedMask, but > > > > > sometimes calls other similar target-specific functions (e.g. > > > > > getTLSCallPreservedMask). > > > > > > > > > > > > > > > 2. The targets need to opt-in to this behavior because only > > > > > the > > > > > target can know that all register uses are really tagged > > > > > correctly > > > > > post "pre-emit". > > > > > > > > > >> > > > > Because the target is free to introduce uses of registers at > > > > > essentially any time, we need to do the scanning for used > > > > > registers > > > > > after the "pre-emit" passes run. This can be done by > > > > > scheduling > > > > > some > > > > > simple register-use scanning pass after the call to > > > > > addPreEmitPass > > > > > in lib/CodeGen/Passes.cpp. > > > > > > > > > >> > > > MachineRegister maintains linked lists with defs/uses for each > > > > register so you can determine whether a specific register is > > > > used > > > > or > > > > not without scanning. > > > > > > > > > Does this include regmask-clobbered registers? > > >> > > -Hal > > >> > > > - Matthias > > > > > >> > > -- > > >> > > Hal Finkel > > > > > > Assistant Computational Scientist > > > > > > Leadership Computing Facility > > > > > > Argonne National Laboratory > > > > > > _______________________________________________ > > > > > > LLVM Developers mailing list > > > > > > llvm-dev at lists.llvm.org > > > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160511/73037f41/attachment.html>
vivek pandya via llvm-dev
2016-May-11 18:19 UTC
[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback
*Vivek Pandya* On Wed, May 11, 2016 at 11:46 PM, Hal Finkel <hfinkel at anl.gov> wrote:> > > ------------------------------ > > *From: *"vivek pandya" <vivekvpandya at gmail.com> > *To: *"Matthias Braun" <matze at braunis.de> > *Cc: *"Hal Finkel" <hfinkel at anl.gov>, "llvm-dev" <llvm-dev at lists.llvm.org> > *Sent: *Wednesday, May 11, 2016 1:14:07 PM > *Subject: *Re: [llvm-dev] [GSoC 2016] Interprocedural Register Allocation > - Introduction and Feedback > > > > *Vivek Pandya* > > > On Wed, May 11, 2016 at 11:19 PM, Matthias Braun <matze at braunis.de> wrote: > >> Yes there is also MachineRegisterInfo::UsedPhysRegMask which should be >> the union of all regmasks in the function. >> >> Does this means there is no requirement of a separate pass for > propagating register usage information ? And also only codegen order needs > to be changed to DFS on call graph. And currently no intra-procedural > register is using UsedPhysRegMask to avoid load/store. Does it mean we need > to change that ? Or I am getting it wrong. > > > No, it just means that the "scanning" procedure is simple, it does not > actually need to visit each instruction. >ok so still we need a ModulePass that will be executed as pre-emit pass but it will do propagation in O(1) time right ? - Vivek> -Hal > > > > - Vivek > >> On May 11, 2016, at 10:47 AM, Hal Finkel via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> >> >> ------------------------------ >> >> *From: *"Matthias Braun" <matze at braunis.de> >> >> *To: *"Hal Finkel" <hfinkel at anl.gov> >> *Cc: *"vivek pandya" <vivekvpandya at gmail.com>, "llvm-dev" < >> llvm-dev at lists.llvm.org> >> *Sent: *Wednesday, May 11, 2016 12:46:25 PM >> *Subject: *Re: [llvm-dev] [GSoC 2016] Interprocedural Register >> Allocation - Introduction and Feedback >> >> >> On May 11, 2016, at 3:31 AM, Hal Finkel via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> >> ------------------------------ >> >> *From: *"vivek pandya" <vivekvpandya at gmail.com> >> *To: *"Mehdi Amini" <mehdi.amini at apple.com> >> *Cc: *"Hal Finkel" <hfinkel at anl.gov>, "Quentin Colombet" < >> qcolombet at apple.com>, "llvm-dev" <llvm-dev at lists.llvm.org>, "Matthias >> Braun" <matze at braunis.de> >> *Sent: *Wednesday, May 11, 2016 3:15:03 AM >> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation - >> Introduction and Feedback >> >> >> >> *Vivek Pandya* >> >> >> On Wed, May 11, 2016 at 10:02 AM, vivek pandya <vivekvpandya at gmail.com> >> wrote: >> >>> >>> >>> *Vivek Pandya* >>> >>> >>> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini <mehdi.amini at apple.com> >>> wrote: >>> >>>> >>>> On May 10, 2016, at 6:06 PM, Hal Finkel <hfinkel at anl.gov> wrote: >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> *From: *"vivek pandya" <vivekvpandya at gmail.com> >>>> *To: *"llvm-dev" <llvm-dev at lists.llvm.org>, "Tim Amini Golling" < >>>> mehdi.amini at apple.com>, "Hal Finkel" <hfinkel at anl.gov> >>>> *Cc: *"Quentin Colombet" <qcolombet at apple.com> >>>> *Sent: *Tuesday, May 10, 2016 2:59:16 PM >>>> *Subject: *[GSoC 2016] Interprocedural Register Allocation - >>>> Introduction and Feedback >>>> >>>> Hello LLVM Community, >>>> >>>> Sorry for delay as I was busy in final exams. >>>> >>>> I am Vivek from India. Thanks for choosing my proposal for >>>> Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal >>>> Finkel will be mentoring me for this project. >>>> >>>> IPRA can reduce code size and runtime of programs by allocating >>>> register across the module and procedure boundaries. >>>> >>>> I have identified some old but effective research work on this area. >>>> I want community's feedback for feasibility of these approach and I am >>>> targeting to implement two of them during this project. >>>> >>>> Here is list of the papers, I have read first two papers and I would >>>> like to discuss those approach first, I will read other two paper then >>>> initiate discussion for them as well. All I want is to find out a concrete >>>> implementation plan before 23 May, 2016 and for that I need community's >>>> help. >>>> >>>> 1) Compile time ----- Minimizing register usage penalty at procedure >>>> calls - http://dl.acm.org/citation.cfm?id=53999 >>>> ====================================================================In >>>> this approach intra-procedural register allocation is used as base but >>>> machine code generation order is bottom up traversal of call graph and >>>> inter-procedural effect is achieved by propagating register usage >>>> information of callee function to caller (i.e child to parent in CallGraph) >>>> so that caller can use different registers than callee and can save load >>>> store cost at procedure call, this is not trivial as it seems due to >>>> recursive calls, library function usage etc. Also for upper region of the >>>> graph in this technique available number of registers might become zero in >>>> that case it should fall back to normal load store at procedure call. Apart >>>> from these difficulties other difficulties have been identified please >>>> follow this mail-chain >>>> https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion >>>> My mentor has already provided me a patch that alters code generation >>>> order as per bottom up call graph traversal, I am working from that point >>>> now. Any other help/suggestion is always welcomed. >>>> >>>> 2) Link time ----- Global register allocation at link time - >>>> http://dl.acm.org/citation.cfm?id=989415 >>>> ====================================================================In >>>> this particular approach (sort of true IPRA) registers will be reallocated >>>> (this optimization will be optional if turned off still code will be >>>> compiled as per intra-procedural allocation) at link time. Here modules are >>>> first complied as per normal compilation but the object code is annotated >>>> with details so that linker can build call graph and also calculate usage >>>> information at link time. Compiler also write hints in object code that if >>>> particular variable is allocated in some other register ( due to new >>>> allocation) then how the code should be changed? Thus linker can use these >>>> information to decide which variables (global) need to be in same register >>>> through out the program execution and also according to register usage >>>> information in call graph which procedure will not be active simultaneously >>>> so that locals for that procedures can be in same registers with out load >>>> store at procedure calls. >>>> For these particular method help me to analyze feasibility: >>>> 1) Can llvm collects following information at module level in >>>> MachineIR? list of procedures in module, list of locals in procedures, list >>>> of procedures that a particular procedure can call, and a list of the >>>> variables this procedure references. Each entry in the last two lists >>>> includes an estimate of the number of times the procedure is called or the >>>> variable is referenced in each execution of this procedure >>>> 2) Can llvm write informative commands to object files? >>>> 3) Can LTO is capable of leveraging those commands? >>>> >>>> In terms of scoping the project for the summer, I definitely recommend >>>> that you focus on (1) first. If you finish that, we can certainly move on >>>> to other things. >>>> >>>> >>>> I'll add +1 here, but I already wrote the same thing on IRC when >>>> discussing with Vivek. True IPRA without a proper MachineModule >>>> infrastructure won't be doable in my opinion (even with such >>>> infrastructure, it may not be trivial in LLVM in general). >>>> >>>> Regarding link time, note that any such a design would likely look much >>>> different than in David Wall's paper however, because our LTO re-codegens >>>> everything anyway. The paper says, "Finally, it keeps us honest as >>>> designers of the system; once we postpone anything until link time, the >>>> temptation is great to postpone everything, ..." - Well, we've long-since >>>> succumb to that temptation when we LTO. C'est la vie. >>>> >>>> >>>> +1 as well, our LTO will benefit naturally from the leaf-to-root >>>> information propagation. ThinLTO will be more challenging/interesting >>>> though! >>>> >>>> For the first part a mechanism similar to MachineModulePass would be >>>> desirable but that may not be possible during this project, but if we can >>>> make some sort of smaller version of that to suit our purpose. >>>> >>>> I don't think we need to make any kind of MachineModulePass to make >>>> this work. Once we alter the visitation order based on the CGSCC iteration >>>> scheme, we can keep state in-between functions in the pre-existing hacky >>>> way (using static members of the relevant function passes). >>>> >>>> Sorry my mistake here by first part I mean 1) requirement in the link >>> time approach. >>> >>>> >>> >>>> I also don't see where/why we need a MachineModule(Pass) for the CGSCC >>>> scheme, that said I'd rather avoid using a function pass with static >>>> members, if we can have a ModuleAnalysis that is bookkeeping the results >>>> for functions in the module and queries by the register allocator somehow. >>>> Matthias/Quentin may have other inputs on this aspect. >>>> >>> >> @Hal do you mean to add a simple MachineFunction pass that will just >> operate on register allocated function and prepare a BitVector to indicate >> which register is being used by MachineFunction, and then use this pass as >> analysis pass (i.e just simply return static BitVector for clobbered >> register when register allocation for next function begins. This part is >> not much clear to me) this thing can be done by scheduling a pass post >> register allocation in lib/CodeGen/Passes.cpp >> >> void TargetPassConfig::addMachinePasses() { >> . >> . >> . >> // Run pre-ra passes. >> addPreRegAlloc(); >> >> // Run register allocation and passes that are tightly coupled with it, >> // including phi elimination and scheduling. >> if (getOptimizeRegAlloc()) >> addOptimizedRegAlloc(createRegAllocPass(true)); >> else >> addFastRegAlloc(createRegAllocPass(false)); >> >> // Run post-ra passes. >> addPostRegAlloc(); >> // Adding a new pass here which keeps register mask information across >> function calls. >> . >> . >> . >> } >> >> But this also requires current register allocators to use this >> information in someway because RegMaskBits in LiveIntervalAnalysis.cpp is >> not static across calls. I mean I am not clear for how to propagate static >> info to Intra-procedural Register allocators (if possible without >> disturbing their code ) >> >> First, my hope is that we won't need to change the register allocators, >> as such, in order to make use of this information. Instead, we'll simply be >> able to alter the register masks generated for the call instructions. These >> masks will indicate fewer clobbers than might otherwise be present based on >> the ABI because of information gathered during the codegen of the callee. >> These masks are generally constructed by target based on the calling >> convention. The PowerPC backend, for example, looks like this: >> >> // Add a register mask operand representing the call-preserved >> registers. >> const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo(); >> const uint32_t *Mask >> TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv); >> assert(Mask && "Missing call preserved mask for calling convention"); >> Ops.push_back(DAG.getRegisterMask(Mask)); >> >> but it can be more complicated. If you look for uses of 'getRegisterMask' >> in Target/*/*ISelLowering.cpp, you'll see what I mean. Regardless, the code >> ends up calling some method is the targets TargetRegisterInfo subclass. >> These methods generally look something like this: >> >> const uint32_t * >> PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF, >> CallingConv::ID CC) const { >> const PPCSubtarget &Subtarget = MF.getSubtarget<PPCSubtarget>(); >> ... >> return TM.isPPC64() ? (Subtarget.hasAltivec() ? >> CSR_SVR464_Altivec_RegMask >> : CSR_SVR464_RegMask) >> : (Subtarget.hasAltivec() ? >> CSR_SVR432_Altivec_RegMask >> : CSR_SVR432_RegMask); >> } >> >> In any case, the fundamental idea here is that, when someone calls >> getCallPreservedMask in order to set the regmask on a call, we might not >> have to use the CC at all. Instead, if we've already codegened the >> function, we might use a cache of 'exact' register masks computed during >> codegen of the potential callees instead. >> >> In order to do this, I think we'll need to provide a function callable >> from the target's getCallPreservedMask implementation, which can return >> such an 'exact' regmask when available. I think we need to do it this way >> for two reasons: >> >> 1. Not all of the target code calls getCallPreservedMask, but sometimes >> calls other similar target-specific functions (e.g. >> getTLSCallPreservedMask). >> 2. The targets need to opt-in to this behavior because only the target >> can know that all register uses are really tagged correctly post "pre-emit". >> >> Because the target is free to introduce uses of registers at essentially >> any time, we need to do the scanning for used registers after the >> "pre-emit" passes run. This can be done by scheduling some simple >> register-use scanning pass after the call to addPreEmitPass in >> lib/CodeGen/Passes.cpp. >> >> MachineRegister maintains linked lists with defs/uses for each register >> so you can determine whether a specific register is used or not without >> scanning. >> >> Does this include regmask-clobbered registers? >> >> -Hal >> >> >> - Matthias >> >> >> >> >> -- >> Hal Finkel >> Assistant Computational Scientist >> Leadership Computing Facility >> Argonne National Laboratory >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160511/56b19eb1/attachment-0001.html>
Hal Finkel via llvm-dev
2016-May-11 18:22 UTC
[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback
----- Original Message -----> From: "vivek pandya" <vivekvpandya at gmail.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "llvm-dev" <llvm-dev at lists.llvm.org>, "Matthias Braun" > <matze at braunis.de> > Sent: Wednesday, May 11, 2016 1:19:18 PM > Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register > Allocation - Introduction and Feedback> Vivek Pandya> On Wed, May 11, 2016 at 11:46 PM, Hal Finkel < hfinkel at anl.gov > > wrote:> > > From: "vivek pandya" < vivekvpandya at gmail.com > > > > > > > To: "Matthias Braun" < matze at braunis.de > > > > > > > Cc: "Hal Finkel" < hfinkel at anl.gov >, "llvm-dev" < > > > llvm-dev at lists.llvm.org > > > > > > > Sent: Wednesday, May 11, 2016 1:14:07 PM > > > > > > Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register > > > Allocation - Introduction and Feedback > > >> > > Vivek Pandya > > >> > > On Wed, May 11, 2016 at 11:19 PM, Matthias Braun < > > > matze at braunis.de > > > > > > > wrote: > > >> > > > Yes there is also MachineRegisterInfo::UsedPhysRegMask which > > > > should > > > > be the union of all regmasks in the function. > > > > > >> > > Does this means there is no requirement of a separate pass for > > > propagating register usage information ? And also only codegen > > > order > > > needs to be changed to DFS on call graph. And currently no > > > intra-procedural register is using UsedPhysRegMask to avoid > > > load/store. Does it mean we need to change that ? Or I am getting > > > it > > > wrong. > > > > > No, it just means that the "scanning" procedure is simple, it does > > not actually need to visit each instruction. > > ok so still we need a ModulePass that will be executed as pre-emit > pass but it will do propagation in O(1) time right ?No, we need a function pass that runs after the pre-emit passes. This pass will update some module-level analysis (which will need to be an immutable pass, likely, to fit within the current infrastructure, even though it is not technically immutable - like our AssumptionCache). But, yes, this post-pre-emit function pass will be O(1). -Hal> - Vivek > > -Hal >> > > - Vivek > > > > > > > > On May 11, 2016, at 10:47 AM, Hal Finkel via llvm-dev < > > > > > llvm-dev at lists.llvm.org > wrote: > > > > > > > > > >> > > > > > From: "Matthias Braun" < matze at braunis.de > > > > > > > > > > > > > > > >> > > > > > To: "Hal Finkel" < hfinkel at anl.gov > > > > > > > > > > > > > > > > > > > > > > Cc: "vivek pandya" < vivekvpandya at gmail.com >, "llvm-dev" < > > > > > > llvm-dev at lists.llvm.org > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, May 11, 2016 12:46:25 PM > > > > > > > > > > > > > > > > > > > > > Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural > > > > > > Register > > > > > > Allocation - Introduction and Feedback > > > > > > > > > > > > > > >> > > > > > > On May 11, 2016, at 3:31 AM, Hal Finkel via llvm-dev < > > > > > > > llvm-dev at lists.llvm.org > wrote: > > > > > > > > > > > > > > > > > > > > >> > > > > > > > From: "vivek pandya" < vivekvpandya at gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "Mehdi Amini" < mehdi.amini at apple.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "Hal Finkel" < hfinkel at anl.gov >, "Quentin > > > > > > > > Colombet" > > > > > > > > < > > > > > > > > qcolombet at apple.com >, "llvm-dev" < > > > > > > > > llvm-dev at lists.llvm.org > > > > > > > > >, > > > > > > > > "Matthias Braun" < matze at braunis.de > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Wednesday, May 11, 2016 3:15:03 AM > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Subject: Re: [GSoC 2016] Interprocedural Register > > > > > > > > Allocation > > > > > > > > - > > > > > > > > Introduction and Feedback > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > Vivek Pandya > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > On Wed, May 11, 2016 at 10:02 AM, vivek pandya < > > > > > > > > vivekvpandya at gmail.com > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > Vivek Pandya > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini < > > > > > > > > > mehdi.amini at apple.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > On May 10, 2016, at 6:06 PM, Hal Finkel < > > > > > > > > > > > hfinkel at anl.gov > > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > From: "vivek pandya" < vivekvpandya at gmail.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > To: "llvm-dev" < llvm-dev at lists.llvm.org >, > > > > > > > > > > > > "Tim > > > > > > > > > > > > Amini > > > > > > > > > > > > Golling" > > > > > > > > > > > > < > > > > > > > > > > > > mehdi.amini at apple.com >, "Hal Finkel" < > > > > > > > > > > > > hfinkel at anl.gov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cc: "Quentin Colombet" < qcolombet at apple.com > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent: Tuesday, May 10, 2016 2:59:16 PM > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Subject: [GSoC 2016] Interprocedural Register > > > > > > > > > > > > Allocation > > > > > > > > > > > > - > > > > > > > > > > > > Introduction and Feedback > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > Hello LLVM Community, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > Sorry for delay as I was busy in final exams. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > I am Vivek from India. Thanks for choosing my > > > > > > > > > > > > proposal > > > > > > > > > > > > for > > > > > > > > > > > > Interprocedural Register Allocation (IPRA) in > > > > > > > > > > > > LLVM. > > > > > > > > > > > > Mehdi > > > > > > > > > > > > Amini > > > > > > > > > > > > and > > > > > > > > > > > > Hal Finkel will be mentoring me for this > > > > > > > > > > > > project. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > IPRA can reduce code size and runtime of > > > > > > > > > > > > programs > > > > > > > > > > > > by > > > > > > > > > > > > allocating > > > > > > > > > > > > register across the module and procedure > > > > > > > > > > > > boundaries. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > I have identified some old but effective > > > > > > > > > > > > research > > > > > > > > > > > > work > > > > > > > > > > > > on > > > > > > > > > > > > this > > > > > > > > > > > > area. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I want community's feedback for feasibility of > > > > > > > > > > > > these > > > > > > > > > > > > approach > > > > > > > > > > > > and > > > > > > > > > > > > I > > > > > > > > > > > > am targeting to implement two of them during > > > > > > > > > > > > this > > > > > > > > > > > > project. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > Here is list of the papers, I have read first > > > > > > > > > > > > two > > > > > > > > > > > > papers > > > > > > > > > > > > and > > > > > > > > > > > > I > > > > > > > > > > > > would > > > > > > > > > > > > like to discuss those approach first, I will > > > > > > > > > > > > read > > > > > > > > > > > > other > > > > > > > > > > > > two > > > > > > > > > > > > paper > > > > > > > > > > > > then initiate discussion for them as well. All > > > > > > > > > > > > I > > > > > > > > > > > > want > > > > > > > > > > > > is > > > > > > > > > > > > to > > > > > > > > > > > > find > > > > > > > > > > > > out > > > > > > > > > > > > a concrete implementation plan before 23 May, > > > > > > > > > > > > 2016 > > > > > > > > > > > > and > > > > > > > > > > > > for > > > > > > > > > > > > that > > > > > > > > > > > > I > > > > > > > > > > > > need community's help. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > 1) Compile time ----- Minimizing register usage > > > > > > > > > > > > penalty > > > > > > > > > > > > at > > > > > > > > > > > > procedure > > > > > > > > > > > > calls - http://dl.acm.org/citation.cfm?id=53999 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ====================================================================In > > > > > > > > > > > > this approach intra-procedural register > > > > > > > > > > > > allocation > > > > > > > > > > > > is > > > > > > > > > > > > used > > > > > > > > > > > > as > > > > > > > > > > > > base > > > > > > > > > > > > but machine code generation order is bottom up > > > > > > > > > > > > traversal > > > > > > > > > > > > of > > > > > > > > > > > > call > > > > > > > > > > > > graph and inter-procedural effect is achieved > > > > > > > > > > > > by > > > > > > > > > > > > propagating > > > > > > > > > > > > register usage information of callee function > > > > > > > > > > > > to > > > > > > > > > > > > caller > > > > > > > > > > > > (i.e > > > > > > > > > > > > child > > > > > > > > > > > > to parent in CallGraph) so that caller can use > > > > > > > > > > > > different > > > > > > > > > > > > registers > > > > > > > > > > > > than callee and can save load store cost at > > > > > > > > > > > > procedure > > > > > > > > > > > > call, > > > > > > > > > > > > this > > > > > > > > > > > > is > > > > > > > > > > > > not trivial as it seems due to recursive calls, > > > > > > > > > > > > library > > > > > > > > > > > > function > > > > > > > > > > > > usage etc. Also for upper region of the graph > > > > > > > > > > > > in > > > > > > > > > > > > this > > > > > > > > > > > > technique > > > > > > > > > > > > available number of registers might become zero > > > > > > > > > > > > in > > > > > > > > > > > > that > > > > > > > > > > > > case > > > > > > > > > > > > it > > > > > > > > > > > > should fall back to normal load store at > > > > > > > > > > > > procedure > > > > > > > > > > > > call. > > > > > > > > > > > > Apart > > > > > > > > > > > > from > > > > > > > > > > > > these difficulties other difficulties have been > > > > > > > > > > > > identified > > > > > > > > > > > > please > > > > > > > > > > > > follow this mail-chain > > > > > > > > > > > > https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My mentor has already provided me a patch that > > > > > > > > > > > > alters > > > > > > > > > > > > code > > > > > > > > > > > > generation > > > > > > > > > > > > order as per bottom up call graph traversal, I > > > > > > > > > > > > am > > > > > > > > > > > > working > > > > > > > > > > > > from > > > > > > > > > > > > that > > > > > > > > > > > > point now. Any other help/suggestion is always > > > > > > > > > > > > welcomed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > 2) Link time ----- Global register allocation > > > > > > > > > > > > at > > > > > > > > > > > > link > > > > > > > > > > > > time > > > > > > > > > > > > - > > > > > > > > > > > > http://dl.acm.org/citation.cfm?id=989415 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ====================================================================In > > > > > > > > > > > > this particular approach (sort of true IPRA) > > > > > > > > > > > > registers > > > > > > > > > > > > will > > > > > > > > > > > > be > > > > > > > > > > > > reallocated (this optimization will be optional > > > > > > > > > > > > if > > > > > > > > > > > > turned > > > > > > > > > > > > off > > > > > > > > > > > > still > > > > > > > > > > > > code will be compiled as per intra-procedural > > > > > > > > > > > > allocation) > > > > > > > > > > > > at > > > > > > > > > > > > link > > > > > > > > > > > > time. Here modules are first complied as per > > > > > > > > > > > > normal > > > > > > > > > > > > compilation > > > > > > > > > > > > but > > > > > > > > > > > > the object code is annotated with details so > > > > > > > > > > > > that > > > > > > > > > > > > linker > > > > > > > > > > > > can > > > > > > > > > > > > build > > > > > > > > > > > > call graph and also calculate usage information > > > > > > > > > > > > at > > > > > > > > > > > > link > > > > > > > > > > > > time. > > > > > > > > > > > > Compiler also write hints in object code that > > > > > > > > > > > > if > > > > > > > > > > > > particular > > > > > > > > > > > > variable > > > > > > > > > > > > is allocated in some other register ( due to > > > > > > > > > > > > new > > > > > > > > > > > > allocation) > > > > > > > > > > > > then > > > > > > > > > > > > how the code should be changed? Thus linker can > > > > > > > > > > > > use > > > > > > > > > > > > these > > > > > > > > > > > > information to decide which variables (global) > > > > > > > > > > > > need > > > > > > > > > > > > to > > > > > > > > > > > > be > > > > > > > > > > > > in > > > > > > > > > > > > same > > > > > > > > > > > > register through out the program execution and > > > > > > > > > > > > also > > > > > > > > > > > > according > > > > > > > > > > > > to > > > > > > > > > > > > register usage information in call graph which > > > > > > > > > > > > procedure > > > > > > > > > > > > will > > > > > > > > > > > > not > > > > > > > > > > > > be > > > > > > > > > > > > active simultaneously so that locals for that > > > > > > > > > > > > procedures > > > > > > > > > > > > can > > > > > > > > > > > > be > > > > > > > > > > > > in > > > > > > > > > > > > same registers with out load store at procedure > > > > > > > > > > > > calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For these particular method help me to analyze > > > > > > > > > > > > feasibility: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1) Can llvm collects following information at > > > > > > > > > > > > module > > > > > > > > > > > > level > > > > > > > > > > > > in > > > > > > > > > > > > MachineIR? list of procedures in module, list > > > > > > > > > > > > of > > > > > > > > > > > > locals > > > > > > > > > > > > in > > > > > > > > > > > > procedures, list of procedures that a > > > > > > > > > > > > particular > > > > > > > > > > > > procedure > > > > > > > > > > > > can > > > > > > > > > > > > call, > > > > > > > > > > > > and a list of the variables this procedure > > > > > > > > > > > > references. > > > > > > > > > > > > Each > > > > > > > > > > > > entry > > > > > > > > > > > > in > > > > > > > > > > > > the last two lists includes an estimate of the > > > > > > > > > > > > number > > > > > > > > > > > > of > > > > > > > > > > > > times > > > > > > > > > > > > the > > > > > > > > > > > > procedure is called or the variable is > > > > > > > > > > > > referenced > > > > > > > > > > > > in > > > > > > > > > > > > each > > > > > > > > > > > > execution > > > > > > > > > > > > of this procedure > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2) Can llvm write informative commands to > > > > > > > > > > > > object > > > > > > > > > > > > files? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3) Can LTO is capable of leveraging those > > > > > > > > > > > > commands? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > In terms of scoping the project for the summer, I > > > > > > > > > > > definitely > > > > > > > > > > > recommend that you focus on (1) first. If you > > > > > > > > > > > finish > > > > > > > > > > > that, > > > > > > > > > > > we > > > > > > > > > > > can > > > > > > > > > > > certainly move on to other things. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'll add +1 here, but I already wrote the same > > > > > > > > > > thing > > > > > > > > > > on > > > > > > > > > > IRC > > > > > > > > > > when > > > > > > > > > > discussing with Vivek. True IPRA without a proper > > > > > > > > > > MachineModule > > > > > > > > > > infrastructure won't be doable in my opinion (even > > > > > > > > > > with > > > > > > > > > > such > > > > > > > > > > infrastructure, it may not be trivial in LLVM in > > > > > > > > > > general). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > Regarding link time, note that any such a design > > > > > > > > > > > would > > > > > > > > > > > likely > > > > > > > > > > > look > > > > > > > > > > > much different than in David Wall's paper > > > > > > > > > > > however, > > > > > > > > > > > because > > > > > > > > > > > our > > > > > > > > > > > LTO > > > > > > > > > > > re-codegens everything anyway. The paper says, > > > > > > > > > > > "Finally, > > > > > > > > > > > it > > > > > > > > > > > keeps > > > > > > > > > > > us > > > > > > > > > > > honest as designers of the system; once we > > > > > > > > > > > postpone > > > > > > > > > > > anything > > > > > > > > > > > until > > > > > > > > > > > link time, the temptation is great to postpone > > > > > > > > > > > everything, > > > > > > > > > > > ..." > > > > > > > > > > > - > > > > > > > > > > > Well, we've long-since succumb to that temptation > > > > > > > > > > > when > > > > > > > > > > > we > > > > > > > > > > > LTO. > > > > > > > > > > > C'est > > > > > > > > > > > la vie. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > +1 as well, our LTO will benefit naturally from the > > > > > > > > > > leaf-to-root > > > > > > > > > > information propagation. ThinLTO will be more > > > > > > > > > > challenging/interesting though! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For the first part a mechanism similar to > > > > > > > > > > > > MachineModulePass > > > > > > > > > > > > would > > > > > > > > > > > > be > > > > > > > > > > > > desirable but that may not be possible during > > > > > > > > > > > > this > > > > > > > > > > > > project, > > > > > > > > > > > > but > > > > > > > > > > > > if > > > > > > > > > > > > we can make some sort of smaller version of > > > > > > > > > > > > that > > > > > > > > > > > > to > > > > > > > > > > > > suit > > > > > > > > > > > > our > > > > > > > > > > > > purpose. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't think we need to make any kind of > > > > > > > > > > > MachineModulePass > > > > > > > > > > > to > > > > > > > > > > > make > > > > > > > > > > > this work. Once we alter the visitation order > > > > > > > > > > > based > > > > > > > > > > > on > > > > > > > > > > > the > > > > > > > > > > > CGSCC > > > > > > > > > > > iteration scheme, we can keep state in-between > > > > > > > > > > > functions > > > > > > > > > > > in > > > > > > > > > > > the > > > > > > > > > > > pre-existing hacky way (using static members of > > > > > > > > > > > the > > > > > > > > > > > relevant > > > > > > > > > > > function passes). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > Sorry my mistake here by first part I mean 1) > > > > > > > > > requirement > > > > > > > > > in > > > > > > > > > the > > > > > > > > > link > > > > > > > > > time approach. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I also don't see where/why we need a > > > > > > > > > > MachineModule(Pass) > > > > > > > > > > for > > > > > > > > > > the > > > > > > > > > > CGSCC scheme, that said I'd rather avoid using a > > > > > > > > > > function > > > > > > > > > > pass > > > > > > > > > > with > > > > > > > > > > static members, if we can have a ModuleAnalysis > > > > > > > > > > that > > > > > > > > > > is > > > > > > > > > > bookkeeping > > > > > > > > > > the results for functions in the module and queries > > > > > > > > > > by > > > > > > > > > > the > > > > > > > > > > register > > > > > > > > > > allocator somehow. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Matthias/Quentin may have other inputs on this > > > > > > > > > > aspect. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > @Hal do you mean to add a simple MachineFunction pass > > > > > > > > that > > > > > > > > will > > > > > > > > just > > > > > > > > operate on register allocated function and prepare a > > > > > > > > BitVector > > > > > > > > to > > > > > > > > indicate which register is being used by > > > > > > > > MachineFunction, > > > > > > > > and > > > > > > > > then > > > > > > > > use this pass as analysis pass (i.e just simply return > > > > > > > > static > > > > > > > > BitVector for clobbered register when register > > > > > > > > allocation > > > > > > > > for > > > > > > > > next > > > > > > > > function begins. This part is not much clear to me) > > > > > > > > this > > > > > > > > thing > > > > > > > > can > > > > > > > > be done by scheduling a pass post register allocation > > > > > > > > in > > > > > > > > lib/CodeGen/Passes.cpp > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > void TargetPassConfig::addMachinePasses() { > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > // Run pre-ra passes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > addPreRegAlloc(); > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > // Run register allocation and passes that are tightly > > > > > > > > coupled > > > > > > > > with > > > > > > > > it, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > // including phi elimination and scheduling. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > if (getOptimizeRegAlloc()) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > addOptimizedRegAlloc(createRegAllocPass(true)); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > else > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > addFastRegAlloc(createRegAllocPass(false)); > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > // Run post-ra passes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > addPostRegAlloc(); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > // Adding a new pass here which keeps register mask > > > > > > > > information > > > > > > > > across function calls. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > >> > > > > > > > But this also requires current register allocators to > > > > > > > > use > > > > > > > > this > > > > > > > > information in someway because RegMaskBits in > > > > > > > > LiveIntervalAnalysis.cpp is not static across calls. I > > > > > > > > mean > > > > > > > > I > > > > > > > > am > > > > > > > > not > > > > > > > > clear for how to propagate static info to > > > > > > > > Intra-procedural > > > > > > > > Register > > > > > > > > allocators (if possible without disturbing their code ) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > First, my hope is that we won't need to change the > > > > > > > register > > > > > > > allocators, as such, in order to make use of this > > > > > > > information. > > > > > > > Instead, we'll simply be able to alter the register masks > > > > > > > generated > > > > > > > for the call instructions. These masks will indicate > > > > > > > fewer > > > > > > > clobbers > > > > > > > than might otherwise be present based on the ABI because > > > > > > > of > > > > > > > information gathered during the codegen of the callee. > > > > > > > These > > > > > > > masks > > > > > > > are generally constructed by target based on the calling > > > > > > > convention. > > > > > > > The PowerPC backend, for example, looks like this: > > > > > > > > > > > > > > > > > > > > >> > > > > > > // Add a register mask operand representing the > > > > > > > call-preserved > > > > > > > registers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > const TargetRegisterInfo *TRI > > > > > > > Subtarget.getRegisterInfo(); > > > > > > > > > > > > > > > > > > > > > > > > > > > > const uint32_t *Mask > > > > > > > > > > > > > > > > > > > > > > > > > > > > TRI->getCallPreservedMask(DAG.getMachineFunction(), > > > > > > > CallConv); > > > > > > > > > > > > > > > > > > > > > > > > > > > > assert(Mask && "Missing call preserved mask for calling > > > > > > > convention"); > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ops.push_back(DAG.getRegisterMask(Mask)); > > > > > > > > > > > > > > > > > > > > >> > > > > > > but it can be more complicated. If you look for uses of > > > > > > > 'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll > > > > > > > see > > > > > > > what > > > > > > > I > > > > > > > mean. Regardless, the code ends up calling some method is > > > > > > > the > > > > > > > targets TargetRegisterInfo subclass. These methods > > > > > > > generally > > > > > > > look > > > > > > > something like this: > > > > > > > > > > > > > > > > > > > > >> > > > > > > const uint32_t * > > > > > > > > > > > > > > > > > > > > > > > > > > > > PPCRegisterInfo::getCallPreservedMask(const > > > > > > > MachineFunction > > > > > > > &MF, > > > > > > > > > > > > > > > > > > > > > > > > > > > > CallingConv::ID CC) const { > > > > > > > > > > > > > > > > > > > > > > > > > > > > const PPCSubtarget &Subtarget > > > > > > > MF.getSubtarget<PPCSubtarget>(); > > > > > > > > > > > > > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > > > > > > > > > return TM.isPPC64() ? (Subtarget.hasAltivec() ? > > > > > > > CSR_SVR464_Altivec_RegMask > > > > > > > > > > > > > > > > > > > > > > > > > > > > : CSR_SVR464_RegMask) > > > > > > > > > > > > > > > > > > > > > > > > > > > > : (Subtarget.hasAltivec() ? CSR_SVR432_Altivec_RegMask > > > > > > > > > > > > > > > > > > > > > > > > > > > > : CSR_SVR432_RegMask); > > > > > > > > > > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > >> > > > > > > In any case, the fundamental idea here is that, when > > > > > > > someone > > > > > > > calls > > > > > > > getCallPreservedMask in order to set the regmask on a > > > > > > > call, > > > > > > > we > > > > > > > might > > > > > > > not have to use the CC at all. Instead, if we've already > > > > > > > codegened > > > > > > > the function, we might use a cache of 'exact' register > > > > > > > masks > > > > > > > computed during codegen of the potential callees instead. > > > > > > > > > > > > > > > > > > > > >> > > > > > > In order to do this, I think we'll need to provide a > > > > > > > function > > > > > > > callable from the target's getCallPreservedMask > > > > > > > implementation, > > > > > > > which can return such an 'exact' regmask when available. > > > > > > > I > > > > > > > think > > > > > > > we > > > > > > > need to do it this way for two reasons: > > > > > > > > > > > > > > > > > > > > >> > > > > > > 1. Not all of the target code calls getCallPreservedMask, > > > > > > > but > > > > > > > sometimes calls other similar target-specific functions > > > > > > > (e.g. > > > > > > > getTLSCallPreservedMask). > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. The targets need to opt-in to this behavior because > > > > > > > only > > > > > > > the > > > > > > > target can know that all register uses are really tagged > > > > > > > correctly > > > > > > > post "pre-emit". > > > > > > > > > > > > > > > > > > > > >> > > > > > > Because the target is free to introduce uses of registers > > > > > > > at > > > > > > > essentially any time, we need to do the scanning for used > > > > > > > registers > > > > > > > after the "pre-emit" passes run. This can be done by > > > > > > > scheduling > > > > > > > some > > > > > > > simple register-use scanning pass after the call to > > > > > > > addPreEmitPass > > > > > > > in lib/CodeGen/Passes.cpp. > > > > > > > > > > > > > > > > > > > > >> > > > > > MachineRegister maintains linked lists with defs/uses for > > > > > > each > > > > > > register so you can determine whether a specific register > > > > > > is > > > > > > used > > > > > > or > > > > > > not without scanning. > > > > > > > > > > > > > > > > > > > > Does this include regmask-clobbered registers? > > > > > > > > > >> > > > > -Hal > > > > > > > > > >> > > > > > - Matthias > > > > > > > > > > > > > > >> > > > > -- > > > > > > > > > >> > > > > Hal Finkel > > > > > > > > > > > > > > > Assistant Computational Scientist > > > > > > > > > > > > > > > Leadership Computing Facility > > > > > > > > > > > > > > > Argonne National Laboratory > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > > > LLVM Developers mailing list > > > > > > > > > > > > > > > llvm-dev at lists.llvm.org > > > > > > > > > > > > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > > > > > > > > -- >> > Hal Finkel > > > Assistant Computational Scientist > > > Leadership Computing Facility > > > Argonne National Laboratory >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160511/aede503b/attachment.html>