thr3ads.net - llvm dev - [llvm-dev] [GSoC 2016] Interprocedural Register Allocation

If this information is useful, please help other people find it:
Share via:

vivek pandya via llvm-dev

2016-May-25 03:16 UTC

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

On Wed, May 25, 2016 at 8:44 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ------------------------------
>
> *From: *"vivek pandya" <vivekvpandya at gmail.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Matthias Braun" <
> matze at braunis.de>, "Mehdi Amini" <mehdi.amini at
apple.com>, "Quentin
> Colombet" <qcolombet at apple.com>
> *Sent: *Tuesday, May 24, 2016 9:34:29 PM
>
> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
> Introduction and Feedback
>
>
>
> On Wed, May 25, 2016 at 3:53 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> ------------------------------
>>
>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>> *To: *"Quentin Colombet" <qcolombet at apple.com>
>> *Cc: *"Hal Finkel" <hfinkel at anl.gov>,
"llvm-dev" <llvm-dev at lists.llvm.org>,
>> "Matthias Braun" <matze at braunis.de>, "Mehdi
Amini" <mehdi.amini at apple.com
>> >
>> *Sent: *Tuesday, May 24, 2016 1:00:58 PM
>> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
>> Introduction and Feedback
>>
>> Hello,
>>
>> I have written following code to check each register if it is used by
>> machineFunction or not :
>>
>> MachineRegisterInfo *MRI = &MF.getRegInfo();
>> TargetRegisterInfo *TRI = (TargetRegisterInfo
>> *)MF.getSubtarget().getRegisterInfo();
>>
>> Some reason you can't use a const pointer here?
>>
> MCRegisterInfo is just used to get conventional name of register for given
> target like AX, BX on X86.
>
>>
>> const TargetMachine &TM = MF.getTarget();
>> const MCRegisterInfo *MCRI = TM.getMCRegisterInfo();
>> DEBUG(dbgs() << "Function Name : " <<
MF.getName() << "\n");
>>
>> for(TargetRegisterInfo::regclass_iterator i = (*TRI).regclass_begin(),
e
>> = (*TRI).regclass_end(); i != e; i++ ) {
>> for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege
>> (*i)->end(); pregi != prege; pregi++ ) {
>> DEBUG( dbgs() << "Physical Register : " <<
MCRI->getName(*pregi) << " is
>> modified "<< MRI->isPhysRegModified(*pregi) <<
" \n");
>>
>> Try isPhysRegUsed.
>>
> ok
>
>>
>> }
>> }
>> DEBUG(dbgs() << "\n");
>>
>> The pass which is executing this code is schedule POST-RA stage but
this
>> gives me true for all registers i.e in each function all registers are
>> being used except EBP and some other similar, Is this a correct way to
get
>> register usage information ? I think I have made some mistake please
help.
>>
>>
>> You might look at the implementation of these functions in
>> lib/CodeGen/MachineRegisterInfo.cpp and figure out if they're
returning
>> true because UsedPhysRegMask.test(PhysReg) is true or because
>> reg_nodbg_empty(*AliasReg) is true.
>>
> Yes that helped now I am getting actual register which have been used by
> given function, but a little problem
> The updated code is as shown below :
> for(TargetRegisterInfo::regclass_iterator i = (*TRI).regclass_begin(), e
> (*TRI).regclass_end(); i != e; i++ ) {
> for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege >
(*i)->end(); pregi != prege; pregi++ ) {
> for (MCRegAliasIterator AliasReg(*pregi, TRI, true); AliasReg.isValid();
> ++AliasReg) {
>    if (!MRI->reg_nodbg_empty(*AliasReg)) {
>     DEBUG( dbgs() << "Physical Register : " <<
MCRI->getName(*pregi) << "
> is used "<< MRI->isPhysRegUsed(*pregi) << "
\n");
>     break; // no need to process more alias
>    }
>   }
> }
> }
> But here some registers are getting processed with in different classes
> (unnecessary processing) Is this only way to iterate through all used
> register (using RegClass iterator) ? Is there any way to avoid duplicate
> regs?
> Of course currently I am just printing but next I am thinking to use a map
> to track usage info , in that only distinct register info will be stored
> but still due to loop structure I need to iterate through a single register
> 3 - 4 times making it time consuming.
>
> Yes, I believe you can just do:
>
>   for (unsigned Reg = 0; Reg < TRI->getNumRegs(); ++Reg) {
>Oh yes thanks I just forgot that PhyReg starts at 0.
>
>
>  -Hal
>
>
> -Vivek
>
>>
>>
>>  -Hal
>>
>>
>>
>> Vivek
>>
>> On Wed, May 18, 2016 at 11:42 PM, Quentin Colombet <qcolombet at
apple.com>
>> wrote:
>>
>>>
>>> On May 18, 2016, at 11:00 AM, vivek pandya <vivekvpandya at
gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> *Vivek Pandya*
>>>
>>>
>>> On Wed, May 18, 2016 at 11:25 PM, Quentin Colombet <qcolombet at
apple.com>
>>>  wrote:
>>>
>>>>
>>>> On May 18, 2016, at 10:46 AM, vivek pandya <vivekvpandya at
gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> *Vivek Pandya*
>>>>
>>>>
>>>> On Wed, May 11, 2016 at 4:01 PM, Hal Finkel <hfinkel at
anl.gov> wrote:
>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> *From: *"vivek pandya" <vivekvpandya at
gmail.com>
>>>>> *To: *"Mehdi Amini" <mehdi.amini at
apple.com>
>>>>> *Cc: *"Hal Finkel" <hfinkel at anl.gov>,
"Quentin Colombet" <
>>>>> qcolombet at apple.com>, "llvm-dev"
<llvm-dev at lists.llvm.org>, "Matthias
>>>>> Braun" <matze at braunis.de>
>>>>> *Sent: *Wednesday, May 11, 2016 3:15:03 AM
>>>>> *Subject: *Re: [GSoC 2016] Interprocedural Register
Allocation -
>>>>> Introduction and Feedback
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *Vivek Pandya*
>>>>>
>>>>>
>>>>> On Wed, May 11, 2016 at 10:02 AM, vivek pandya
<vivekvpandya at gmail.com
>>>>> > wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> *Vivek Pandya*
>>>>>>
>>>>>>
>>>>>> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini
<mehdi.amini at apple.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On May 10, 2016, at 6:06 PM, Hal Finkel <hfinkel
at anl.gov> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------
>>>>>>>
>>>>>>> *From: *"vivek pandya" <vivekvpandya
at gmail.com>
>>>>>>> *To: *"llvm-dev" <llvm-dev at
lists.llvm.org>, "Tim Amini Golling" <
>>>>>>> mehdi.amini at apple.com>, "Hal
Finkel" <hfinkel at anl.gov>
>>>>>>> *Cc: *"Quentin Colombet" <qcolombet at
apple.com>
>>>>>>> *Sent: *Tuesday, May 10, 2016 2:59:16 PM
>>>>>>> *Subject: *[GSoC 2016] Interprocedural Register
Allocation -
>>>>>>> Introduction and Feedback
>>>>>>>
>>>>>>> Hello LLVM Community,
>>>>>>>
>>>>>>> Sorry for delay as I was busy in final exams.
>>>>>>>
>>>>>>> I am Vivek from India. Thanks for choosing my
proposal for
>>>>>>> Interprocedural Register Allocation (IPRA) in LLVM.
Mehdi Amini and Hal
>>>>>>> Finkel will be mentoring me for this project.
>>>>>>>
>>>>>>> IPRA can reduce code size and runtime of programs
by allocating
>>>>>>> register across the module and procedure
boundaries.
>>>>>>>
>>>>>>> I have identified some old but effective research
work on this area.
>>>>>>> I want community's feedback for feasibility of
these approach and I
>>>>>>> am targeting to implement two of them during this
project.
>>>>>>>
>>>>>>> Here is list of the papers, I have read first two
papers and I would
>>>>>>> like to discuss those approach first, I will read
other two paper then
>>>>>>> initiate discussion for them as well. All I want is
to find out a concrete
>>>>>>> implementation plan before 23 May, 2016 and for
that I need community's
>>>>>>> help.
>>>>>>>
>>>>>>> 1) Compile time ----- Minimizing register usage
penalty at procedure
>>>>>>> calls - http://dl.acm.org/citation.cfm?id=53999
>>>>>>>
====================================================================In
>>>>>>> this approach intra-procedural register allocation
is used as base but
>>>>>>> machine code generation order is bottom up
traversal of call graph and
>>>>>>> inter-procedural effect is achieved by propagating
register usage
>>>>>>> information of callee function to caller (i.e child
to parent in CallGraph)
>>>>>>> so that caller can use different registers than
callee and can save load
>>>>>>> store cost at procedure call, this is not trivial
as it seems due to
>>>>>>> recursive calls, library function usage etc. Also
for upper region of the
>>>>>>> graph in this technique available number of
registers might become zero in
>>>>>>> that case it should fall back to normal load store
at procedure call. Apart
>>>>>>> from these difficulties other difficulties have
been identified please
>>>>>>> follow this mail-chain
>>>>>>>
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
>>>>>>> My mentor has already provided me a patch that
alters code
>>>>>>> generation order as per bottom up call graph
traversal, I am working from
>>>>>>> that point now. Any other help/suggestion is always
welcomed.
>>>>>>>
>>>>>>> 2) Link time ----- Global register allocation at
link time -
>>>>>>> http://dl.acm.org/citation.cfm?id=989415
>>>>>>>
====================================================================In
>>>>>>> this particular approach (sort of true IPRA)
registers will be reallocated
>>>>>>> (this optimization will be optional if turned off
still code will be
>>>>>>> compiled as per intra-procedural allocation) at
link time. Here modules are
>>>>>>> first complied as per normal compilation but the
object code is annotated
>>>>>>> with details so that linker can build call graph
and also calculate usage
>>>>>>> information at link time. Compiler also write hints
in object code that if
>>>>>>> particular variable is allocated in some other
register ( due to new
>>>>>>> allocation) then how the code should be changed?
Thus linker can use these
>>>>>>> information to decide which variables (global) need
to be in same register
>>>>>>> through out the program execution and also
according to register usage
>>>>>>> information in call graph which procedure will not
be active simultaneously
>>>>>>> so that locals for that procedures can be in same
registers with out load
>>>>>>> store at procedure calls.
>>>>>>> For these particular method help me to analyze
feasibility:
>>>>>>> 1) Can llvm collects following information at
module level in
>>>>>>> MachineIR? list of procedures in module, list of
locals in procedures, list
>>>>>>> of procedures that a particular procedure can call,
and a list of the
>>>>>>> variables this procedure references. Each entry in
the last two lists
>>>>>>> includes an estimate of the number of times the
procedure is called or the
>>>>>>> variable is referenced in each execution of this
procedure
>>>>>>> 2) Can llvm write informative commands to object
files?
>>>>>>> 3) Can LTO is capable of leveraging those commands?
>>>>>>>
>>>>>>> In terms of scoping the project for the summer, I
definitely
>>>>>>> recommend that you focus on (1) first. If you
finish that, we can certainly
>>>>>>> move on to other things.
>>>>>>>
>>>>>>>
>>>>>>> I'll add +1 here, but I already wrote the same
thing on IRC when
>>>>>>> discussing with Vivek. True IPRA without a proper
MachineModule
>>>>>>> infrastructure won't be doable in my opinion
(even with such
>>>>>>> infrastructure, it may not be trivial in LLVM in
general).
>>>>>>>
>>>>>>> Regarding link time, note that any such a design
would likely look
>>>>>>> much different than in David Wall's paper
however, because our LTO
>>>>>>> re-codegens everything anyway. The paper says,
"Finally, it keeps us honest
>>>>>>> as designers of the system; once we postpone
anything until link time, the
>>>>>>> temptation is great to postpone everything,
..." - Well, we've long-since
>>>>>>> succumb to that temptation when we LTO. C'est
la vie.
>>>>>>>
>>>>>>>
>>>>>>> +1 as well, our LTO will benefit naturally from the
leaf-to-root
>>>>>>> information propagation. ThinLTO will be more
challenging/interesting
>>>>>>> though!
>>>>>>>
>>>>>>> For the first part a mechanism similar to
MachineModulePass would be
>>>>>>> desirable but that may not be possible during this
project, but if we can
>>>>>>> make some sort of smaller version of that to suit
our purpose.
>>>>>>>
>>>>>>> I don't think we need to make any kind of
MachineModulePass to make
>>>>>>> this work. Once we alter the visitation order based
on the CGSCC iteration
>>>>>>> scheme, we can keep state in-between functions in
the pre-existing hacky
>>>>>>> way (using static members of the relevant function
passes).
>>>>>>>
>>>>>>>  Sorry my mistake here by first part I mean 1)
requirement in the
>>>>>> link time approach.
>>>>>>
>>>>>>>
>>>>>>
>>>>>>> I also don't see where/why we need a
MachineModule(Pass) for the
>>>>>>> CGSCC scheme, that said I'd rather avoid using
a function pass with static
>>>>>>> members, if we can have a ModuleAnalysis that is
bookkeeping the results
>>>>>>> for functions in the module and queries by the
register allocator somehow.
>>>>>>> Matthias/Quentin may have other inputs on this
aspect.
>>>>>>>
>>>>>>
>>>>> @Hal do you mean to add a simple MachineFunction pass that
will just
>>>>> operate on register allocated function and prepare a
BitVector to indicate
>>>>> which register is being used by MachineFunction, and then
use this pass as
>>>>> analysis pass (i.e just simply return static BitVector for
clobbered
>>>>> register when register allocation for next function begins.
This part is
>>>>> not much clear to me) this thing can be done by scheduling
a pass post
>>>>> register allocation in lib/CodeGen/Passes.cpp
>>>>>
>>>>> void TargetPassConfig::addMachinePasses() {
>>>>> .
>>>>> .
>>>>> .
>>>>>   // Run pre-ra passes.
>>>>>   addPreRegAlloc();
>>>>>
>>>>>   // Run register allocation and passes that are tightly
coupled with
>>>>> it,
>>>>>   // including phi elimination and scheduling.
>>>>>   if (getOptimizeRegAlloc())
>>>>>     addOptimizedRegAlloc(createRegAllocPass(true));
>>>>>   else
>>>>>     addFastRegAlloc(createRegAllocPass(false));
>>>>>
>>>>>   // Run post-ra passes.
>>>>>   addPostRegAlloc();
>>>>> // Adding a new pass here which keeps register mask
information across
>>>>> function calls.
>>>>> .
>>>>> .
>>>>> .
>>>>> }
>>>>>
>>>>> But this also requires current register allocators to use
this
>>>>> information in someway because RegMaskBits in
LiveIntervalAnalysis.cpp is
>>>>> not static across calls. I mean I am not clear for how to
propagate static
>>>>> info to Intra-procedural Register allocators (if possible
without
>>>>> disturbing their code )
>>>>>
>>>>> First, my hope is that we won't need to change the
register
>>>>> allocators, as such, in order to make use of this
information. Instead,
>>>>> we'll simply be able to alter the register masks
generated for the call
>>>>> instructions. These masks will indicate fewer clobbers than
might otherwise
>>>>> be present based on the ABI because of information gathered
during the
>>>>> codegen of the callee. These masks are generally
constructed by target
>>>>> based on the calling convention. The PowerPC backend, for
example, looks
>>>>> like this:
>>>>>
>>>>>   // Add a register mask operand representing the
call-preserved
>>>>> registers.
>>>>>   const TargetRegisterInfo *TRI =
Subtarget.getRegisterInfo();
>>>>>   const uint32_t *Mask >>>>>      
TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
>>>>>   assert(Mask && "Missing call preserved mask
for calling
>>>>> convention");
>>>>>   Ops.push_back(DAG.getRegisterMask(Mask));
>>>>>
>>>>> but it can be more complicated. If you look for uses of
>>>>> 'getRegisterMask' in Target/*/*ISelLowering.cpp,
you'll see what I mean.
>>>>> Regardless, the code ends up calling some method is the
targets
>>>>> TargetRegisterInfo subclass. These methods generally look
something like
>>>>> this:
>>>>>
>>>>> const uint32_t *
>>>>> PPCRegisterInfo::getCallPreservedMask(const MachineFunction
&MF,
>>>>>                                       CallingConv::ID CC)
const {
>>>>>   const PPCSubtarget &Subtarget =
MF.getSubtarget<PPCSubtarget>();
>>>>>   ...
>>>>>   return TM.isPPC64() ? (Subtarget.hasAltivec() ?
>>>>> CSR_SVR464_Altivec_RegMask
>>>>>                                                 :
CSR_SVR464_RegMask)
>>>>>                       : (Subtarget.hasAltivec() ?
>>>>> CSR_SVR432_Altivec_RegMask
>>>>>                                                 :
CSR_SVR432_RegMask);
>>>>> }
>>>>>
>>>>> In any case, the fundamental idea here is that, when
someone calls
>>>>> getCallPreservedMask in order to set the regmask on a call,
we might not
>>>>> have to use the CC at all. Instead, if we've already
codegened the
>>>>> function, we might use a cache of 'exact' register
masks computed during
>>>>> codegen of the potential callees instead.
>>>>>
>>>> I am thinking to add a simple Immutable pass
MachineRegisterUsageInfo
>>>> similar to MachineBranchProbabilityInfo that can maintain
>>>> RegisterUsageInformation per function. Can it be simply done by
using
>>>> UsedPhysRegMask from MachineRegisterInfo ??
>>>>
>>>>
>>>> No, like the comment said, UsedPhysRegMask gives only the
registers
>>>> clobbered by calls:
>>>> // This bit vector represents all the registers clobbered by
function
>>>> calls.
>>>>
>>>> You want to build this information yourself on top of
>>>> MachineRegisterInfo::isPhysRegModified
>>>>
>>> Ok but then the time complexity will be O(n) n = number of physical
>>> register on the target. Am I going correct?
>>>
>>>
>>> Yes, this is correct.
>>>
>>>
>>>> Here getCallPreservedMask will call API provided by
>>>> MachineRegisterUsageInfo to avail the exact register mask but
how it can
>>>> know that the function is already codegen or it will query each
time when
>>>> getCallPreservedMask is called and of available
MachineRegisterUsageInfo
>>>> will return the details otherwise simply return NULL.
>>>> So changes will be now in TargetRegisterInfo implementation for
each
>>>> target right ??
>>>>
>>>>
>>>>> In order to do this, I think we'll need to provide a
function callable
>>>>> from the target's getCallPreservedMask implementation,
which can return
>>>>> such an 'exact' regmask when available. I think we
need to do it this way
>>>>> for two reasons:
>>>>>
>>>>>  1. Not all of the target code calls getCallPreservedMask,
but
>>>>> sometimes calls other similar target-specific functions
(e.g.
>>>>> getTLSCallPreservedMask).
>>>>>  2. The targets need to opt-in to this behavior because
only the
>>>>> target can know that all register uses are really tagged
correctly post
>>>>> "pre-emit".
>>>>>
>>>>> Because the target is free to introduce uses of registers
at
>>>>> essentially any time, we need to do the scanning for used
registers after
>>>>> the "pre-emit" passes run. This can be done by
scheduling some simple
>>>>> register-use scanning pass after the call to addPreEmitPass
in
>>>>> lib/CodeGen/Passes.cpp.
>>>>>
>>>>>
>>>>> I think this also applies in someway to Mehdi Amini's
idea to keep a
>>>>> ModulePass for bookkeeping but then existing register
allocators will be
>>>>> required to change so that the code can query the
ModulePass for
>>>>> RegMaskBits for particular function.
>>>>>
>>>>> I think that the simplest way to do this is to create an
immutable
>>>>> analysis pass (e.g. BasicAA) that keeps the cache of the
computed register
>>>>> masks. This is somewhat similar in spirit to how the
'AssumptionCache'
>>>>> analysis works at the IR level. This analysis can then be
created by
>>>>> lib/CodeGen/Passes.cpp early, and then queried and passed
around later by
>>>>> the CodeGen/Target code. Because it is an immutable
analysis, it won't get
>>>>> destroyed until the very end, which is also important
because, I imagine,
>>>>> it will need to own the memory associated with the
generated register masks.
>>>>>
>>>>>  -Hal
>>>>>
>>>>>
>>>>> Vivek
>>>>>
>>>>>
>>>>>>> Yes for propagating register usage approach we
don't need
>>>>>> MachineModulePass
>>>>>>
>>>>>> Vivek
>>>>>>
>>>>>>> --
>>>>>>> Mehdi
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Assistant Computational Scientist
>>>>> Leadership Computing Facility
>>>>> Argonne National Laboratory
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160525/0182e15d/attachment-0001.html>

vivek pandya via llvm-dev

2016-May-25 04:17 UTC

head link

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Dear Mentors,

Please help me to understand our plan to implement Interprocedural Register
allocator by propogating register usage info. While writing this mail I am
considering all previous discussion over llvm-dev and IRC.

1) A MachineFunction pass to be executed POST-RA to collect the information
about the used Registers.
2) An Immutable pass which will store reg usage info collected by previous
pass and return it whenever queried.
3) A Target specific MachineFucntion pass that will use the register usage
info for available for call instrction to achive IPRA. This pass should run
at PRE-RA.

Relation among above passes:

1) pass will store info to 2) pass as well use info for call instruction
found while processing.

3) pass only requires to query information from 2) pass.


Questions
========
 Which pass is responsible for load/store of callee saved register, at the
begining of each function call? And how does it uses RegMask of call
instruction to generate load/store. I think Intra-procedural register
allocator is not responsible to generate load/store around the call site.

/- - -> (A) - - -> (D)
/
(K)- - ->(T)- - -> (B) - - -> (E)
\
\- - -> (C) - - -> (F)

So as per our discussion we would require following passes:

Suppose in given example call graph , register allocation for D is
completed now we have that information available So 3) pass while
processing A , it would collect reg usage info for all callees and OR them
and then it should update A's regmask by going to parant procedure that
actually calls A ?? How reg mask details of call D would be used by
Register allocator while allocating register for A and also not generating
load/store for register being used by A in body of D as we have callee
saved convention.

How the pass responsible for generating load/store will optimize for the
child node of call graph where it does not require to load/store because
caller will not use register used by callee ? I mean how our IPRA will take
care of this?

In short I am not much clear with the method for using information to get
effect of IPRA without modifying Register allocator them self(i.e by
updating regMask of call instructions).

Also 1) pass and 3) pass are seem to intersecting for their work, for
example consider while scanning register usage info for T function the
final register usage info should be <all regs used by T> OR < reg usage
info A > OR <reg usage info B > OR < reg usage info C > because K
should
not use any register which is used by T, A ,D, B, C, E, F with out
load/store the relevant paper also discuss this situation and suggest to
fall back to load/store approach. So as we move to upper region of the call
graph it is very likely that enough regiseters are not there to allocate.

Please bear with my silly questions.

Sincerely,
Vivek


On Wed, May 25, 2016 at 8:46 AM, vivek pandya <vivekvpandya at gmail.com>
wrote:
>
>
> On Wed, May 25, 2016 at 8:44 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> ------------------------------
>>
>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>> *Cc: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Matthias Braun" <
>> matze at braunis.de>, "Mehdi Amini" <mehdi.amini at
apple.com>, "Quentin
>> Colombet" <qcolombet at apple.com>
>> *Sent: *Tuesday, May 24, 2016 9:34:29 PM
>>
>> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
>> Introduction and Feedback
>>
>>
>>
>> On Wed, May 25, 2016 at 3:53 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
>>
>>>
>>> ------------------------------
>>>
>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>> *To: *"Quentin Colombet" <qcolombet at apple.com>
>>> *Cc: *"Hal Finkel" <hfinkel at anl.gov>,
"llvm-dev" <
>>> llvm-dev at lists.llvm.org>, "Matthias Braun"
<matze at braunis.de>, "Mehdi
>>> Amini" <mehdi.amini at apple.com>
>>> *Sent: *Tuesday, May 24, 2016 1:00:58 PM
>>> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
>>> Introduction and Feedback
>>>
>>> Hello,
>>>
>>> I have written following code to check each register if it is used
by
>>> machineFunction or not :
>>>
>>> MachineRegisterInfo *MRI = &MF.getRegInfo();
>>> TargetRegisterInfo *TRI = (TargetRegisterInfo
>>> *)MF.getSubtarget().getRegisterInfo();
>>>
>>> Some reason you can't use a const pointer here?
>>>
>> MCRegisterInfo is just used to get conventional name of register for
>> given target like AX, BX on X86.
>>
>>>
>>> const TargetMachine &TM = MF.getTarget();
>>> const MCRegisterInfo *MCRI = TM.getMCRegisterInfo();
>>> DEBUG(dbgs() << "Function Name : " <<
MF.getName() << "\n");
>>>
>>> for(TargetRegisterInfo::regclass_iterator i =
(*TRI).regclass_begin(), e
>>> = (*TRI).regclass_end(); i != e; i++ ) {
>>> for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege
>>> (*i)->end(); pregi != prege; pregi++ ) {
>>> DEBUG( dbgs() << "Physical Register : " <<
MCRI->getName(*pregi) << " is
>>> modified "<< MRI->isPhysRegModified(*pregi) <<
" \n");
>>>
>>> Try isPhysRegUsed.
>>>
>> ok
>>
>>>
>>> }
>>> }
>>> DEBUG(dbgs() << "\n");
>>>
>>> The pass which is executing this code is schedule POST-RA stage but
this
>>> gives me true for all registers i.e in each function all registers
are
>>> being used except EBP and some other similar, Is this a correct way
to get
>>> register usage information ? I think I have made some mistake
please help.
>>>
>>>
>>> You might look at the implementation of these functions in
>>> lib/CodeGen/MachineRegisterInfo.cpp and figure out if they're
returning
>>> true because UsedPhysRegMask.test(PhysReg) is true or because
>>> reg_nodbg_empty(*AliasReg) is true.
>>>
>> Yes that helped now I am getting actual register which have been used
by
>> given function, but a little problem
>> The updated code is as shown below :
>> for(TargetRegisterInfo::regclass_iterator i = (*TRI).regclass_begin(),
e
>> = (*TRI).regclass_end(); i != e; i++ ) {
>> for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege
>> (*i)->end(); pregi != prege; pregi++ ) {
>> for (MCRegAliasIterator AliasReg(*pregi, TRI, true);
AliasReg.isValid();
>> ++AliasReg) {
>>    if (!MRI->reg_nodbg_empty(*AliasReg)) {
>>     DEBUG( dbgs() << "Physical Register : " <<
MCRI->getName(*pregi) <<
>> " is used "<< MRI->isPhysRegUsed(*pregi) <<
" \n");
>>     break; // no need to process more alias
>>    }
>>   }
>> }
>> }
>> But here some registers are getting processed with in different classes
>> (unnecessary processing) Is this only way to iterate through all used
>> register (using RegClass iterator) ? Is there any way to avoid
duplicate
>> regs?
>> Of course currently I am just printing but next I am thinking to use a
>> map to track usage info , in that only distinct register info will be
>> stored but still due to loop structure I need to iterate through a
single
>> register 3 - 4 times making it time consuming.
>>
>> Yes, I believe you can just do:
>>
>>   for (unsigned Reg = 0; Reg < TRI->getNumRegs(); ++Reg) {
>>
> Oh yes thanks I just forgot that PhyReg starts at 0.
>
>>
>>
>>  -Hal
>>
>>
>> -Vivek
>>
>>>
>>>
>>>  -Hal
>>>
>>>
>>>
>>> Vivek
>>>
>>> On Wed, May 18, 2016 at 11:42 PM, Quentin Colombet <qcolombet at
apple.com>
>>> wrote:
>>>
>>>>
>>>> On May 18, 2016, at 11:00 AM, vivek pandya <vivekvpandya at
gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> *Vivek Pandya*
>>>>
>>>>
>>>> On Wed, May 18, 2016 at 11:25 PM, Quentin Colombet
<qcolombet at apple.com
>>>> > wrote:
>>>>
>>>>>
>>>>> On May 18, 2016, at 10:46 AM, vivek pandya <vivekvpandya
at gmail.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> *Vivek Pandya*
>>>>>
>>>>>
>>>>> On Wed, May 11, 2016 at 4:01 PM, Hal Finkel <hfinkel at
anl.gov> wrote:
>>>>>
>>>>>>
>>>>>> ------------------------------
>>>>>>
>>>>>> *From: *"vivek pandya" <vivekvpandya at
gmail.com>
>>>>>> *To: *"Mehdi Amini" <mehdi.amini at
apple.com>
>>>>>> *Cc: *"Hal Finkel" <hfinkel at
anl.gov>, "Quentin Colombet" <
>>>>>> qcolombet at apple.com>, "llvm-dev"
<llvm-dev at lists.llvm.org>,
>>>>>> "Matthias Braun" <matze at braunis.de>
>>>>>> *Sent: *Wednesday, May 11, 2016 3:15:03 AM
>>>>>> *Subject: *Re: [GSoC 2016] Interprocedural Register
Allocation -
>>>>>> Introduction and Feedback
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Vivek Pandya*
>>>>>>
>>>>>>
>>>>>> On Wed, May 11, 2016 at 10:02 AM, vivek pandya <
>>>>>> vivekvpandya at gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Vivek Pandya*
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini
<mehdi.amini at apple.com>
>>>>>>>  wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On May 10, 2016, at 6:06 PM, Hal Finkel
<hfinkel at anl.gov> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>>
>>>>>>>> *From: *"vivek pandya"
<vivekvpandya at gmail.com>
>>>>>>>> *To: *"llvm-dev" <llvm-dev at
lists.llvm.org>, "Tim Amini Golling" <
>>>>>>>> mehdi.amini at apple.com>, "Hal
Finkel" <hfinkel at anl.gov>
>>>>>>>> *Cc: *"Quentin Colombet"
<qcolombet at apple.com>
>>>>>>>> *Sent: *Tuesday, May 10, 2016 2:59:16 PM
>>>>>>>> *Subject: *[GSoC 2016] Interprocedural Register
Allocation -
>>>>>>>> Introduction and Feedback
>>>>>>>>
>>>>>>>> Hello LLVM Community,
>>>>>>>>
>>>>>>>> Sorry for delay as I was busy in final exams.
>>>>>>>>
>>>>>>>> I am Vivek from India. Thanks for choosing my
proposal for
>>>>>>>> Interprocedural Register Allocation (IPRA) in
LLVM. Mehdi Amini and Hal
>>>>>>>> Finkel will be mentoring me for this project.
>>>>>>>>
>>>>>>>> IPRA can reduce code size and runtime of
programs by allocating
>>>>>>>> register across the module and procedure
boundaries.
>>>>>>>>
>>>>>>>> I have identified some old but effective
research work on this area.
>>>>>>>> I want community's feedback for feasibility
of these approach and I
>>>>>>>> am targeting to implement two of them during
this project.
>>>>>>>>
>>>>>>>> Here is list of the papers, I have read first
two papers and I
>>>>>>>> would like to discuss those approach first, I
will read other two paper
>>>>>>>> then initiate discussion for them as well. All
I want is to find out a
>>>>>>>> concrete implementation plan before 23 May,
2016 and for that I need
>>>>>>>> community's help.
>>>>>>>>
>>>>>>>> 1) Compile time ----- Minimizing register usage
penalty at
>>>>>>>> procedure calls -
http://dl.acm.org/citation.cfm?id=53999
>>>>>>>>
====================================================================In
>>>>>>>> this approach intra-procedural register
allocation is used as base but
>>>>>>>> machine code generation order is bottom up
traversal of call graph and
>>>>>>>> inter-procedural effect is achieved by
propagating register usage
>>>>>>>> information of callee function to caller (i.e
child to parent in CallGraph)
>>>>>>>> so that caller can use different registers than
callee and can save load
>>>>>>>> store cost at procedure call, this is not
trivial as it seems due to
>>>>>>>> recursive calls, library function usage etc.
Also for upper region of the
>>>>>>>> graph in this technique available number of
registers might become zero in
>>>>>>>> that case it should fall back to normal load
store at procedure call. Apart
>>>>>>>> from these difficulties other difficulties have
been identified please
>>>>>>>> follow this mail-chain
>>>>>>>>
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
>>>>>>>> My mentor has already provided me a patch that
alters code
>>>>>>>> generation order as per bottom up call graph
traversal, I am working from
>>>>>>>> that point now. Any other help/suggestion is
always welcomed.
>>>>>>>>
>>>>>>>> 2) Link time ----- Global register allocation
at link time -
>>>>>>>> http://dl.acm.org/citation.cfm?id=989415
>>>>>>>>
====================================================================In
>>>>>>>> this particular approach (sort of true IPRA)
registers will be reallocated
>>>>>>>> (this optimization will be optional if turned
off still code will be
>>>>>>>> compiled as per intra-procedural allocation) at
link time. Here modules are
>>>>>>>> first complied as per normal compilation but
the object code is annotated
>>>>>>>> with details so that linker can build call
graph and also calculate usage
>>>>>>>> information at link time. Compiler also write
hints in object code that if
>>>>>>>> particular variable is allocated in some other
register ( due to new
>>>>>>>> allocation) then how the code should be
changed? Thus linker can use these
>>>>>>>> information to decide which variables (global)
need to be in same register
>>>>>>>> through out the program execution and also
according to register usage
>>>>>>>> information in call graph which procedure will
not be active simultaneously
>>>>>>>> so that locals for that procedures can be in
same registers with out load
>>>>>>>> store at procedure calls.
>>>>>>>> For these particular method help me to analyze
feasibility:
>>>>>>>> 1) Can llvm collects following information at
module level in
>>>>>>>> MachineIR? list of procedures in module, list
of locals in procedures, list
>>>>>>>> of procedures that a particular procedure can
call, and a list of the
>>>>>>>> variables this procedure references. Each entry
in the last two lists
>>>>>>>> includes an estimate of the number of times the
procedure is called or the
>>>>>>>> variable is referenced in each execution of
this procedure
>>>>>>>> 2) Can llvm write informative commands to
object files?
>>>>>>>> 3) Can LTO is capable of leveraging those
commands?
>>>>>>>>
>>>>>>>> In terms of scoping the project for the summer,
I definitely
>>>>>>>> recommend that you focus on (1) first. If you
finish that, we can certainly
>>>>>>>> move on to other things.
>>>>>>>>
>>>>>>>>
>>>>>>>> I'll add +1 here, but I already wrote the
same thing on IRC when
>>>>>>>> discussing with Vivek. True IPRA without a
proper MachineModule
>>>>>>>> infrastructure won't be doable in my
opinion (even with such
>>>>>>>> infrastructure, it may not be trivial in LLVM
in general).
>>>>>>>>
>>>>>>>> Regarding link time, note that any such a
design would likely look
>>>>>>>> much different than in David Wall's paper
however, because our LTO
>>>>>>>> re-codegens everything anyway. The paper says,
"Finally, it keeps us honest
>>>>>>>> as designers of the system; once we postpone
anything until link time, the
>>>>>>>> temptation is great to postpone everything,
..." - Well, we've long-since
>>>>>>>> succumb to that temptation when we LTO.
C'est la vie.
>>>>>>>>
>>>>>>>>
>>>>>>>> +1 as well, our LTO will benefit naturally from
the leaf-to-root
>>>>>>>> information propagation. ThinLTO will be more
challenging/interesting
>>>>>>>> though!
>>>>>>>>
>>>>>>>> For the first part a mechanism similar to
MachineModulePass would
>>>>>>>> be desirable but that may not be possible
during this project, but if we
>>>>>>>> can make some sort of smaller version of that
to suit our purpose.
>>>>>>>>
>>>>>>>> I don't think we need to make any kind of
MachineModulePass to make
>>>>>>>> this work. Once we alter the visitation order
based on the CGSCC iteration
>>>>>>>> scheme, we can keep state in-between functions
in the pre-existing hacky
>>>>>>>> way (using static members of the relevant
function passes).
>>>>>>>>
>>>>>>>>  Sorry my mistake here by first part I mean 1)
requirement in the
>>>>>>> link time approach.
>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>> I also don't see where/why we need a
MachineModule(Pass) for the
>>>>>>>> CGSCC scheme, that said I'd rather avoid
using a function pass with static
>>>>>>>> members, if we can have a ModuleAnalysis that
is bookkeeping the results
>>>>>>>> for functions in the module and queries by the
register allocator somehow.
>>>>>>>> Matthias/Quentin may have other inputs on this
aspect.
>>>>>>>>
>>>>>>>
>>>>>> @Hal do you mean to add a simple MachineFunction pass
that will just
>>>>>> operate on register allocated function and prepare a
BitVector to indicate
>>>>>> which register is being used by MachineFunction, and
then use this pass as
>>>>>> analysis pass (i.e just simply return static BitVector
for clobbered
>>>>>> register when register allocation for next function
begins. This part is
>>>>>> not much clear to me) this thing can be done by
scheduling a pass post
>>>>>> register allocation in lib/CodeGen/Passes.cpp
>>>>>>
>>>>>> void TargetPassConfig::addMachinePasses() {
>>>>>> .
>>>>>> .
>>>>>> .
>>>>>>   // Run pre-ra passes.
>>>>>>   addPreRegAlloc();
>>>>>>
>>>>>>   // Run register allocation and passes that are
tightly coupled
>>>>>> with it,
>>>>>>   // including phi elimination and scheduling.
>>>>>>   if (getOptimizeRegAlloc())
>>>>>>     addOptimizedRegAlloc(createRegAllocPass(true));
>>>>>>   else
>>>>>>     addFastRegAlloc(createRegAllocPass(false));
>>>>>>
>>>>>>   // Run post-ra passes.
>>>>>>   addPostRegAlloc();
>>>>>> // Adding a new pass here which keeps register mask
information
>>>>>> across function calls.
>>>>>> .
>>>>>> .
>>>>>> .
>>>>>> }
>>>>>>
>>>>>> But this also requires current register allocators to
use this
>>>>>> information in someway because RegMaskBits in
LiveIntervalAnalysis.cpp is
>>>>>> not static across calls. I mean I am not clear for how
to propagate static
>>>>>> info to Intra-procedural Register allocators (if
possible without
>>>>>> disturbing their code )
>>>>>>
>>>>>> First, my hope is that we won't need to change the
register
>>>>>> allocators, as such, in order to make use of this
information. Instead,
>>>>>> we'll simply be able to alter the register masks
generated for the call
>>>>>> instructions. These masks will indicate fewer clobbers
than might otherwise
>>>>>> be present based on the ABI because of information
gathered during the
>>>>>> codegen of the callee. These masks are generally
constructed by target
>>>>>> based on the calling convention. The PowerPC backend,
for example, looks
>>>>>> like this:
>>>>>>
>>>>>>   // Add a register mask operand representing the
call-preserved
>>>>>> registers.
>>>>>>   const TargetRegisterInfo *TRI =
Subtarget.getRegisterInfo();
>>>>>>   const uint32_t *Mask >>>>>>      
TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
>>>>>>   assert(Mask && "Missing call preserved
mask for calling
>>>>>> convention");
>>>>>>   Ops.push_back(DAG.getRegisterMask(Mask));
>>>>>>
>>>>>> but it can be more complicated. If you look for uses of
>>>>>> 'getRegisterMask' in
Target/*/*ISelLowering.cpp, you'll see what I mean.
>>>>>> Regardless, the code ends up calling some method is the
targets
>>>>>> TargetRegisterInfo subclass. These methods generally
look something like
>>>>>> this:
>>>>>>
>>>>>> const uint32_t *
>>>>>> PPCRegisterInfo::getCallPreservedMask(const
MachineFunction &MF,
>>>>>>                                       CallingConv::ID
CC) const {
>>>>>>   const PPCSubtarget &Subtarget =
MF.getSubtarget<PPCSubtarget>();
>>>>>>   ...
>>>>>>   return TM.isPPC64() ? (Subtarget.hasAltivec() ?
>>>>>> CSR_SVR464_Altivec_RegMask
>>>>>>                                                 :
CSR_SVR464_RegMask)
>>>>>>                       : (Subtarget.hasAltivec() ?
>>>>>> CSR_SVR432_Altivec_RegMask
>>>>>>                                                 :
>>>>>> CSR_SVR432_RegMask);
>>>>>> }
>>>>>>
>>>>>> In any case, the fundamental idea here is that, when
someone calls
>>>>>> getCallPreservedMask in order to set the regmask on a
call, we might not
>>>>>> have to use the CC at all. Instead, if we've
already codegened the
>>>>>> function, we might use a cache of 'exact'
register masks computed during
>>>>>> codegen of the potential callees instead.
>>>>>>
>>>>> I am thinking to add a simple Immutable pass
MachineRegisterUsageInfo
>>>>> similar to MachineBranchProbabilityInfo that can maintain
>>>>> RegisterUsageInformation per function. Can it be simply
done by using
>>>>> UsedPhysRegMask from MachineRegisterInfo ??
>>>>>
>>>>>
>>>>> No, like the comment said, UsedPhysRegMask gives only the
registers
>>>>> clobbered by calls:
>>>>> // This bit vector represents all the registers clobbered
by function
>>>>> calls.
>>>>>
>>>>> You want to build this information yourself on top of
>>>>> MachineRegisterInfo::isPhysRegModified
>>>>>
>>>> Ok but then the time complexity will be O(n) n = number of
physical
>>>> register on the target. Am I going correct?
>>>>
>>>>
>>>> Yes, this is correct.
>>>>
>>>>
>>>>> Here getCallPreservedMask will call API provided by
>>>>> MachineRegisterUsageInfo to avail the exact register mask
but how it can
>>>>> know that the function is already codegen or it will query
each time when
>>>>> getCallPreservedMask is called and of available
MachineRegisterUsageInfo
>>>>> will return the details otherwise simply return NULL.
>>>>> So changes will be now in TargetRegisterInfo implementation
for each
>>>>> target right ??
>>>>>
>>>>>
>>>>>> In order to do this, I think we'll need to provide
a function
>>>>>> callable from the target's getCallPreservedMask
implementation, which can
>>>>>> return such an 'exact' regmask when available.
I think we need to do it
>>>>>> this way for two reasons:
>>>>>>
>>>>>>  1. Not all of the target code calls
getCallPreservedMask, but
>>>>>> sometimes calls other similar target-specific functions
(e.g.
>>>>>> getTLSCallPreservedMask).
>>>>>>  2. The targets need to opt-in to this behavior because
only the
>>>>>> target can know that all register uses are really
tagged correctly post
>>>>>> "pre-emit".
>>>>>>
>>>>>> Because the target is free to introduce uses of
registers at
>>>>>> essentially any time, we need to do the scanning for
used registers after
>>>>>> the "pre-emit" passes run. This can be done
by scheduling some simple
>>>>>> register-use scanning pass after the call to
addPreEmitPass in
>>>>>> lib/CodeGen/Passes.cpp.
>>>>>>
>>>>>>
>>>>>> I think this also applies in someway to Mehdi
Amini's idea to keep a
>>>>>> ModulePass for bookkeeping but then existing register
allocators will be
>>>>>> required to change so that the code can query the
ModulePass for
>>>>>> RegMaskBits for particular function.
>>>>>>
>>>>>> I think that the simplest way to do this is to create
an immutable
>>>>>> analysis pass (e.g. BasicAA) that keeps the cache of
the computed register
>>>>>> masks. This is somewhat similar in spirit to how the
'AssumptionCache'
>>>>>> analysis works at the IR level. This analysis can then
be created by
>>>>>> lib/CodeGen/Passes.cpp early, and then queried and
passed around later by
>>>>>> the CodeGen/Target code. Because it is an immutable
analysis, it won't get
>>>>>> destroyed until the very end, which is also important
because, I imagine,
>>>>>> it will need to own the memory associated with the
generated register masks.
>>>>>>
>>>>>>  -Hal
>>>>>>
>>>>>>
>>>>>> Vivek
>>>>>>
>>>>>>
>>>>>>>> Yes for propagating register usage approach we
don't need
>>>>>>> MachineModulePass
>>>>>>>
>>>>>>> Vivek
>>>>>>>
>>>>>>>> --
>>>>>>>> Mehdi
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Hal Finkel
>>>>>> Assistant Computational Scientist
>>>>>> Leadership Computing Facility
>>>>>> Argonne National Laboratory
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>>
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160525/771dc669/attachment.html>

Mehdi Amini via llvm-dev

2016-May-25 04:38 UTC

head link

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

> On May 24, 2016, at 9:17 PM, vivek pandya <vivekvpandya at gmail.com>
wrote:
> 
> Dear Mentors,
> 
> Please help me to understand our plan to implement Interprocedural Register
allocator by propogating register usage info. While writing this mail I am
considering all previous discussion over llvm-dev and IRC.
> 
> 1) A MachineFunction pass to be executed POST-RA to collect the information
about the used Registers.
> 2) An Immutable pass which will store reg usage info collected by previous
pass and return it whenever queried.
> 3) A Target specific MachineFucntion pass that will use the register usage
info for available for call instrction to 	achive IPRA. This pass should run at
PRE-RA.
> 
> Relation among above passes:
> 
> 1) pass will store info to 2) pass as well use info for call instruction
found while processing.
> 
> 3) pass only requires to query information from 2) pass.
> 
> 
> Questions
> ========> 
>  Which pass is responsible for load/store of callee saved register, at the
begining of each function call? And how does it uses RegMask of call instruction
to generate load/store. I think Intra-procedural register allocator is not
responsible to generate load/store around the call site.
> 
> 				 /- - -> (A) - - -> (D)
> 				/
> 	(K)- - ->(T)- - -> (B) - - -> (E)
> 				\
> 				 \- - -> (C) - - -> (F)
> 
> So as per our discussion we would require following passes:
> 
> Suppose in given example call graph , register allocation for D is
completed now we have that information available So 3) pass while processing A ,
it would collect reg usage info for all callees and OR them and then it should
update A's regmask by going to parant procedure that actually calls A ??
No, Pass 3) is only looking for every call MI in A and updating the associated
regmask by replacing it with the information stored in the immutable pass.

> How reg mask details of call D would be used by Register allocator while
allocating register for A and also not generating load/store for register being
used by A in body of D as we have callee saved convention.
I expect all of that to be handled automatically when updating the regmask.

> 
> How the pass responsible for generating load/store will optimize for the
child node of call graph where it does not require to load/store because caller
will not use register used by callee ? I mean how our IPRA will take care of
this?
I don't understand that.

> 
> In short I am not much clear with the method for using information to get
effect of IPRA without modifying Register allocator them self(i.e by updating
regMask of call instructions).
> 
> Also 1) pass and 3) pass are seem to intersecting for their work, for
example consider while scanning register usage info for T function the final
register usage info should be <all regs used by T> OR < reg usage info
A > OR <reg usage info B > OR < reg usage info C > because K
should not use any register which is used by T, A ,D, B, C, E, F with out
load/store the relevant paper also discuss this situation and suggest to fall
back to load/store approach. So as we move to upper region of the call graph it
is very likely that enough regiseters are not there to allocate.
The calling convention (for anything else than internal function) will always
have some callee-saved registers.
If we have a deep call-graph of internal-only function, then we may generate a
lot of spill at the top. Somehow we may have to think about driving some
heuristic with PGO.
I suspect we can think about that a bit later. Let's focus on the simple for
now.

-- 
Mehdi


> 
> Please bear with my silly questions.
> 
> Sincerely,
> Vivek
> 
> 
> On Wed, May 25, 2016 at 8:46 AM, vivek pandya <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>> wrote:
> 
> 
> On Wed, May 25, 2016 at 8:44 AM, Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
> 
> From: "vivek pandya" <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>>
> To: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org <mailto:llvm-dev
at lists.llvm.org>>, "Matthias Braun" <matze at braunis.de
<mailto:matze at braunis.de>>, "Mehdi Amini" <mehdi.amini
at apple.com <mailto:mehdi.amini at apple.com>>, "Quentin
Colombet" <qcolombet at apple.com <mailto:qcolombet at
apple.com>>
> Sent: Tuesday, May 24, 2016 9:34:29 PM
> 
> Subject: Re: [GSoC 2016] Interprocedural Register Allocation - Introduction
and Feedback
> 
> 
> 
> On Wed, May 25, 2016 at 3:53 AM, Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
> 
> From: "vivek pandya" <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>>
> To: "Quentin Colombet" <qcolombet at apple.com
<mailto:qcolombet at apple.com>>
> Cc: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>>, "llvm-dev" <llvm-dev at lists.llvm.org
<mailto:llvm-dev at lists.llvm.org>>, "Matthias Braun"
<matze at braunis.de <mailto:matze at braunis.de>>, "Mehdi
Amini" <mehdi.amini at apple.com <mailto:mehdi.amini at
apple.com>>
> Sent: Tuesday, May 24, 2016 1:00:58 PM
> Subject: Re: [GSoC 2016] Interprocedural Register Allocation - Introduction
and Feedback
> 
> Hello,
> 
> I have written following code to check each register if it is used by
machineFunction or not :
> 
> MachineRegisterInfo *MRI = &MF.getRegInfo();
> 	TargetRegisterInfo *TRI = (TargetRegisterInfo
*)MF.getSubtarget().getRegisterInfo();
> Some reason you can't use a const pointer here?
> MCRegisterInfo is just used to get conventional name of register for given
target like AX, BX on X86.
> 
> 	const TargetMachine &TM = MF.getTarget();
> 	const MCRegisterInfo *MCRI = TM.getMCRegisterInfo();
> 	DEBUG(dbgs() << "Function Name : " << MF.getName()
<< "\n");
> 
> 	for(TargetRegisterInfo::regclass_iterator i = (*TRI).regclass_begin(), e =
(*TRI).regclass_end(); i != e; i++ ) {
> 		for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege =
(*i)->end(); pregi != prege; pregi++ ) {
> 			DEBUG( dbgs() << "Physical Register : " <<
MCRI->getName(*pregi) << " is modified "<<
MRI->isPhysRegModified(*pregi) << " \n");
> Try isPhysRegUsed.
> ok 
> 
> 		}
> 	}
> 	DEBUG(dbgs() << "\n");
> 
> The pass which is executing this code is schedule POST-RA stage but this
gives me true for all registers i.e in each function all registers are being
used except EBP and some other similar, Is this a correct way to get register
usage information ? I think I have made some mistake please help.
> 
> You might look at the implementation of these functions in
lib/CodeGen/MachineRegisterInfo.cpp and figure out if they're returning true
because UsedPhysRegMask.test(PhysReg) is true or because
reg_nodbg_empty(*AliasReg) is true.
> Yes that helped now I am getting actual register which have been used by
given function, but a little problem
> The updated code is as shown below : 
> for(TargetRegisterInfo::regclass_iterator i = (*TRI).regclass_begin(), e =
(*TRI).regclass_end(); i != e; i++ ) {
> 		for(TargetRegisterClass::iterator pregi = (*i)->begin(), prege =
(*i)->end(); pregi != prege; pregi++ ) {
> 			for (MCRegAliasIterator AliasReg(*pregi, TRI, true); AliasReg.isValid();
++AliasReg) {
> 			    if (!MRI->reg_nodbg_empty(*AliasReg)) {
> 			    	DEBUG( dbgs() << "Physical Register : " <<
MCRI->getName(*pregi) << " is used "<<
MRI->isPhysRegUsed(*pregi) << " \n");
> 			    	break; // no need to process more alias
> 			    }
>  			 }	
> 		}
> 	} 
> But here some registers are getting processed with in different classes
(unnecessary processing) Is this only way to iterate through all used register
(using RegClass iterator) ? Is there any way to avoid duplicate regs?
> Of course currently I am just printing but next I am thinking to use a map
to track usage info , in that only distinct register info will be stored but
still due to loop structure I need to iterate through a single register 3 - 4
times making it time consuming.
> Yes, I believe you can just do:
> 
>   for (unsigned Reg = 0; Reg < TRI->getNumRegs(); ++Reg) {
> Oh yes thanks I just forgot that PhyReg starts at 0. 
> 
> 
>  -Hal
> 
> 
> -Vivek
> 
> 
>  -Hal
> 
> 
> 
> Vivek
> 
> On Wed, May 18, 2016 at 11:42 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
> 
> On May 18, 2016, at 11:00 AM, vivek pandya <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>> wrote:
> 
> 
> 
> Vivek Pandya
> 
> 
> On Wed, May 18, 2016 at 11:25 PM, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
> 
> On May 18, 2016, at 10:46 AM, vivek pandya <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>> wrote:
> 
> 
> 
> Vivek Pandya
> 
> 
> On Wed, May 11, 2016 at 4:01 PM, Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
> 
> From: "vivek pandya" <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>>
> To: "Mehdi Amini" <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>>
> Cc: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>>, "Quentin Colombet" <qcolombet at apple.com
<mailto:qcolombet at apple.com>>, "llvm-dev" <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>, "Matthias
Braun" <matze at braunis.de <mailto:matze at braunis.de>>
> Sent: Wednesday, May 11, 2016 3:15:03 AM
> Subject: Re: [GSoC 2016] Interprocedural Register Allocation - Introduction
and Feedback
> 
> 
> 
> 
> Vivek Pandya
> 
> 
> On Wed, May 11, 2016 at 10:02 AM, vivek pandya <vivekvpandya at
gmail.com <mailto:vivekvpandya at gmail.com>> wrote:
> 
> 
> Vivek Pandya
> 
> 
> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
> 
> On May 10, 2016, at 6:06 PM, Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
> 
> 
> 
> From: "vivek pandya" <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>>
> To: "llvm-dev" <llvm-dev at lists.llvm.org <mailto:llvm-dev
at lists.llvm.org>>, "Tim Amini Golling" <mehdi.amini at
apple.com <mailto:mehdi.amini at apple.com>>, "Hal Finkel"
<hfinkel at anl.gov <mailto:hfinkel at anl.gov>>
> Cc: "Quentin Colombet" <qcolombet at apple.com
<mailto:qcolombet at apple.com>>
> Sent: Tuesday, May 10, 2016 2:59:16 PM
> Subject: [GSoC 2016] Interprocedural Register Allocation - Introduction and
Feedback
> 
> Hello LLVM Community,
> 
> Sorry for delay as I was busy in final exams.
> 
> I am Vivek from India. Thanks for choosing my proposal for Interprocedural
Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal Finkel will be mentoring
me for this project.
> 
> IPRA can reduce code size and runtime of programs by allocating register
across the module and procedure boundaries.
> 
> I have identified some old but effective research work on this area.
> I want community's feedback for feasibility of these approach and I am
targeting to implement two of them during this project.
> 
> Here is list of the papers, I have read first two papers and I would like
to discuss those approach first, I will read other two paper then initiate
discussion for them as well. All I want is to find out a concrete implementation
plan before 23 May, 2016 and for that I need community's help.
> 
> 1) Compile time ----- Minimizing register usage penalty at procedure calls
- http://dl.acm.org/citation.cfm?id=53999
<http://dl.acm.org/citation.cfm?id=53999>
> ====================================================================In this
approach intra-procedural register allocation is used as base but machine code
generation order is bottom up traversal of call graph and inter-procedural
effect is achieved by propagating register usage information of callee function
to caller (i.e child to parent in CallGraph) so that caller can use different
registers than callee and can save load store cost at procedure call, this is
not trivial as it seems due to recursive calls, library function usage etc. Also
for upper region of the graph in this technique available number of registers
might become zero in that case it should fall back to normal load store at
procedure call. Apart from these difficulties other difficulties have been
identified please follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
<https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion>
> My mentor has already provided me a patch that alters code generation order
as per bottom up call graph traversal, I am working from that point now. Any
other help/suggestion is always welcomed.
> 
> 2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415
<http://dl.acm.org/citation.cfm?id=989415>
> ====================================================================In this
particular approach (sort of true IPRA) registers will be reallocated (this
optimization will be optional if turned off still code will be compiled as per
intra-procedural allocation) at link time. Here modules are first complied as
per normal compilation but the object code is annotated with details so that
linker can build call graph and also calculate usage information at link time.
Compiler also write hints in object code that if particular variable is
allocated in some other register ( due to new allocation) then how the code
should be changed? Thus linker can use these information to decide which
variables (global) need to be in same register through out the program execution
and also according to register usage information in call graph which procedure
will not be active simultaneously so that locals for that procedures can be in
same registers with out load store at procedure calls.
> For these particular method help me to analyze feasibility: 
> 1) Can llvm collects following information at module level in MachineIR?
list of procedures in module, list of locals in procedures, list of procedures
that a particular procedure can call, and a list of the variables this procedure
references. Each entry in the last two lists includes an estimate of the number
of times the procedure is called or the variable is referenced in each execution
of this procedure
> 2) Can llvm write informative commands to object files?
> 3) Can LTO is capable of leveraging those commands? 
> In terms of scoping the project for the summer, I definitely recommend that
you focus on (1) first. If you finish that, we can certainly move on to other
things.
> 
> I'll add +1 here, but I already wrote the same thing on IRC when
discussing with Vivek. True IPRA without a proper MachineModule infrastructure
won't be doable in my opinion (even with such infrastructure, it may not be
trivial in LLVM in general).
> 
> Regarding link time, note that any such a design would likely look much
different than in David Wall's paper however, because our LTO re-codegens
everything anyway. The paper says, "Finally, it keeps us honest as
designers of the system; once we postpone anything until link time, the
temptation is great to postpone everything, ..." - Well, we've
long-since succumb to that temptation when we LTO. C'est la vie.
> 
> +1 as well, our LTO will benefit naturally from the leaf-to-root
information propagation. ThinLTO will be more challenging/interesting though!
> For the first part a mechanism similar to MachineModulePass would be
desirable but that may not be possible during this project, but if we can make
some sort of smaller version of that to suit our purpose.
> I don't think we need to make any kind of MachineModulePass to make
this work. Once we alter the visitation order based on the CGSCC iteration
scheme, we can keep state in-between functions in the pre-existing hacky way
(using static members of the relevant function passes).
>  Sorry my mistake here by first part I mean 1) requirement in the link time
approach.
>  
> I also don't see where/why we need a MachineModule(Pass) for the CGSCC
scheme, that said I'd rather avoid using a function pass with static
members, if we can have a ModuleAnalysis that is bookkeeping the results for
functions in the module and queries by the register allocator somehow.
> Matthias/Quentin may have other inputs on this aspect.
>  
> @Hal do you mean to add a simple MachineFunction pass that will just
operate on register allocated function and prepare a BitVector to indicate which
register is being used by MachineFunction, and then use this pass as analysis
pass (i.e just simply return static BitVector for clobbered register when
register allocation for next function begins. This part is not much clear to me)
this thing can be done by scheduling a pass post register allocation in
lib/CodeGen/Passes.cpp
> 
> void TargetPassConfig::addMachinePasses() {
> . 
> .
> .
>   // Run pre-ra passes.
>   addPreRegAlloc();
> 
>   // Run register allocation and passes that are tightly coupled with it,
>   // including phi elimination and scheduling.
>   if (getOptimizeRegAlloc())
>     addOptimizedRegAlloc(createRegAllocPass(true));
>   else
>     addFastRegAlloc(createRegAllocPass(false));
> 
>   // Run post-ra passes.
>   addPostRegAlloc();
> // Adding a new pass here which keeps register mask information across
function calls.
> .
> .
> .
> }
> 
> But this also requires current register allocators to use this information
in someway because RegMaskBits in LiveIntervalAnalysis.cpp is not static across
calls. I mean I am not clear for how to propagate static info to
Intra-procedural Register allocators (if possible without disturbing their code
)
> First, my hope is that we won't need to change the register allocators,
as such, in order to make use of this information. Instead, we'll simply be
able to alter the register masks generated for the call instructions. These
masks will indicate fewer clobbers than might otherwise be present based on the
ABI because of information gathered during the codegen of the callee. These
masks are generally constructed by target based on the calling convention. The
PowerPC backend, for example, looks like this:
> 
>   // Add a register mask operand representing the call-preserved registers.
>   const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
>   const uint32_t *Mask >      
TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
>   assert(Mask && "Missing call preserved mask for calling
convention");
>   Ops.push_back(DAG.getRegisterMask(Mask));
> 
> but it can be more complicated. If you look for uses of
'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll see what I
mean. Regardless, the code ends up calling some method is the targets
TargetRegisterInfo subclass. These methods generally look something like this:
> 
> const uint32_t *
> PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
>                                       CallingConv::ID CC) const {
>   const PPCSubtarget &Subtarget =
MF.getSubtarget<PPCSubtarget>();
>   ...
>   return TM.isPPC64() ? (Subtarget.hasAltivec() ?
CSR_SVR464_Altivec_RegMask
>                                                 : CSR_SVR464_RegMask)
>                       : (Subtarget.hasAltivec() ?
CSR_SVR432_Altivec_RegMask
>                                                 : CSR_SVR432_RegMask);
> }
> 
> In any case, the fundamental idea here is that, when someone calls
getCallPreservedMask in order to set the regmask on a call, we might not have to
use the CC at all. Instead, if we've already codegened the function, we
might use a cache of 'exact' register masks computed during codegen of
the potential callees instead.
> I am thinking to add a simple Immutable pass MachineRegisterUsageInfo
similar to MachineBranchProbabilityInfo that can maintain
RegisterUsageInformation per function. Can it be simply done by using
UsedPhysRegMask from MachineRegisterInfo ??
> 
> No, like the comment said, UsedPhysRegMask gives only the registers
clobbered by calls:
> // This bit vector represents all the registers clobbered by function
calls.
> 
> You want to build this information yourself on top of
MachineRegisterInfo::isPhysRegModified
> Ok but then the time complexity will be O(n) n = number of physical
register on the target. Am I going correct?
> 
> Yes, this is correct.
> 
>   
> Here getCallPreservedMask will call API provided by
MachineRegisterUsageInfo to avail the exact register mask but how it can know
that the function is already codegen or it will query each time when
getCallPreservedMask is called and of available MachineRegisterUsageInfo will
return the details otherwise simply return NULL.
> So changes will be now in TargetRegisterInfo implementation for each target
right ??
> 
> 
> In order to do this, I think we'll need to provide a function callable
from the target's getCallPreservedMask implementation, which can return such
an 'exact' regmask when available. I think we need to do it this way for
two reasons:
> 
>  1. Not all of the target code calls getCallPreservedMask, but sometimes
calls other similar target-specific functions (e.g. getTLSCallPreservedMask).
>  2. The targets need to opt-in to this behavior because only the target can
know that all register uses are really tagged correctly post
"pre-emit".
> 
> Because the target is free to introduce uses of registers at essentially
any time, we need to do the scanning for used registers after the
"pre-emit" passes run. This can be done by scheduling some simple
register-use scanning pass after the call to addPreEmitPass in
lib/CodeGen/Passes.cpp.
> 
> 
> I think this also applies in someway to Mehdi Amini's idea to keep a
ModulePass for bookkeeping but then existing register allocators will be
required to change so that the code can query the ModulePass for RegMaskBits for
particular function.
> I think that the simplest way to do this is to create an immutable analysis
pass (e.g. BasicAA) that keeps the cache of the computed register masks. This is
somewhat similar in spirit to how the 'AssumptionCache' analysis works
at the IR level. This analysis can then be created by lib/CodeGen/Passes.cpp
early, and then queried and passed around later by the CodeGen/Target code.
Because it is an immutable analysis, it won't get destroyed until the very
end, which is also important because, I imagine, it will need to own the memory
associated with the generated register masks.
> 
>  -Hal
> 
> Vivek
> 
> 
> Yes for propagating register usage approach we don't need
MachineModulePass
>  
> Vivek
> -- 
> Mehdi
> 
> 
> 
> 
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> 
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160524/01ad9fa2/attachment-0001.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - May 2016 - [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Maybe Matching Threads