thr3ads.net - llvm dev - [llvm-dev] [GSoC 2016] Interprocedural Register Allocation

If this information is useful, please help other people find it:
Share via:

Matthias Braun via llvm-dev

2016-May-11 17:49 UTC

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Yes there is also MachineRegisterInfo::UsedPhysRegMask which should be the union
of all regmasks in the function.
> On May 11, 2016, at 10:47 AM, Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> 
> From: "Matthias Braun" <matze at braunis.de>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "vivek pandya" <vivekvpandya at gmail.com>,
"llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Wednesday, May 11, 2016 12:46:25 PM
> Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register Allocation -
Introduction and Feedback
> 
> 
> On May 11, 2016, at 3:31 AM, Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> 
> 
> From: "vivek pandya" <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>>
> To: "Mehdi Amini" <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>>
> Cc: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>>, "Quentin Colombet" <qcolombet at apple.com
<mailto:qcolombet at apple.com>>, "llvm-dev" <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>, "Matthias
Braun" <matze at braunis.de <mailto:matze at braunis.de>>
> Sent: Wednesday, May 11, 2016 3:15:03 AM
> Subject: Re: [GSoC 2016] Interprocedural Register Allocation - Introduction
and Feedback
> 
> 
> 
> Vivek Pandya
> 
> 
> On Wed, May 11, 2016 at 10:02 AM, vivek pandya <vivekvpandya at
gmail.com <mailto:vivekvpandya at gmail.com>> wrote:
> 
> 
> Vivek Pandya
> 
> 
> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
> 
> On May 10, 2016, at 6:06 PM, Hal Finkel <hfinkel at anl.gov
<mailto:hfinkel at anl.gov>> wrote:
> 
> 
> 
> From: "vivek pandya" <vivekvpandya at gmail.com
<mailto:vivekvpandya at gmail.com>>
> To: "llvm-dev" <llvm-dev at lists.llvm.org <mailto:llvm-dev
at lists.llvm.org>>, "Tim Amini Golling" <mehdi.amini at
apple.com <mailto:mehdi.amini at apple.com>>, "Hal Finkel"
<hfinkel at anl.gov <mailto:hfinkel at anl.gov>>
> Cc: "Quentin Colombet" <qcolombet at apple.com
<mailto:qcolombet at apple.com>>
> Sent: Tuesday, May 10, 2016 2:59:16 PM
> Subject: [GSoC 2016] Interprocedural Register Allocation - Introduction and
Feedback
> 
> Hello LLVM Community,
> 
> Sorry for delay as I was busy in final exams.
> 
> I am Vivek from India. Thanks for choosing my proposal for Interprocedural
Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal Finkel will be mentoring
me for this project.
> 
> IPRA can reduce code size and runtime of programs by allocating register
across the module and procedure boundaries.
> 
> I have identified some old but effective research work on this area.
> I want community's feedback for feasibility of these approach and I am
targeting to implement two of them during this project.
> 
> Here is list of the papers, I have read first two papers and I would like
to discuss those approach first, I will read other two paper then initiate
discussion for them as well. All I want is to find out a concrete implementation
plan before 23 May, 2016 and for that I need community's help.
> 
> 1) Compile time ----- Minimizing register usage penalty at procedure calls
- http://dl.acm.org/citation.cfm?id=53999
<http://dl.acm.org/citation.cfm?id=53999>
> ====================================================================In this
approach intra-procedural register allocation is used as base but machine code
generation order is bottom up traversal of call graph and inter-procedural
effect is achieved by propagating register usage information of callee function
to caller (i.e child to parent in CallGraph) so that caller can use different
registers than callee and can save load store cost at procedure call, this is
not trivial as it seems due to recursive calls, library function usage etc. Also
for upper region of the graph in this technique available number of registers
might become zero in that case it should fall back to normal load store at
procedure call. Apart from these difficulties other difficulties have been
identified please follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
<https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion>
> My mentor has already provided me a patch that alters code generation order
as per bottom up call graph traversal, I am working from that point now. Any
other help/suggestion is always welcomed.
> 
> 2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415
<http://dl.acm.org/citation.cfm?id=989415>
> ====================================================================In this
particular approach (sort of true IPRA) registers will be reallocated (this
optimization will be optional if turned off still code will be compiled as per
intra-procedural allocation) at link time. Here modules are first complied as
per normal compilation but the object code is annotated with details so that
linker can build call graph and also calculate usage information at link time.
Compiler also write hints in object code that if particular variable is
allocated in some other register ( due to new allocation) then how the code
should be changed? Thus linker can use these information to decide which
variables (global) need to be in same register through out the program execution
and also according to register usage information in call graph which procedure
will not be active simultaneously so that locals for that procedures can be in
same registers with out load store at procedure calls.
> For these particular method help me to analyze feasibility: 
> 1) Can llvm collects following information at module level in MachineIR?
list of procedures in module, list of locals in procedures, list of procedures
that a particular procedure can call, and a list of the variables this procedure
references. Each entry in the last two lists includes an estimate of the number
of times the procedure is called or the variable is referenced in each execution
of this procedure
> 2) Can llvm write informative commands to object files?
> 3) Can LTO is capable of leveraging those commands? 
> In terms of scoping the project for the summer, I definitely recommend that
you focus on (1) first. If you finish that, we can certainly move on to other
things.
> 
> I'll add +1 here, but I already wrote the same thing on IRC when
discussing with Vivek. True IPRA without a proper MachineModule infrastructure
won't be doable in my opinion (even with such infrastructure, it may not be
trivial in LLVM in general).
> 
> Regarding link time, note that any such a design would likely look much
different than in David Wall's paper however, because our LTO re-codegens
everything anyway. The paper says, "Finally, it keeps us honest as
designers of the system; once we postpone anything until link time, the
temptation is great to postpone everything, ..." - Well, we've
long-since succumb to that temptation when we LTO. C'est la vie.
> 
> +1 as well, our LTO will benefit naturally from the leaf-to-root
information propagation. ThinLTO will be more challenging/interesting though!
> For the first part a mechanism similar to MachineModulePass would be
desirable but that may not be possible during this project, but if we can make
some sort of smaller version of that to suit our purpose.
> I don't think we need to make any kind of MachineModulePass to make
this work. Once we alter the visitation order based on the CGSCC iteration
scheme, we can keep state in-between functions in the pre-existing hacky way
(using static members of the relevant function passes).
>  Sorry my mistake here by first part I mean 1) requirement in the link time
approach.
>  
> I also don't see where/why we need a MachineModule(Pass) for the CGSCC
scheme, that said I'd rather avoid using a function pass with static
members, if we can have a ModuleAnalysis that is bookkeeping the results for
functions in the module and queries by the register allocator somehow.
> Matthias/Quentin may have other inputs on this aspect.
>  
> @Hal do you mean to add a simple MachineFunction pass that will just
operate on register allocated function and prepare a BitVector to indicate which
register is being used by MachineFunction, and then use this pass as analysis
pass (i.e just simply return static BitVector for clobbered register when
register allocation for next function begins. This part is not much clear to me)
this thing can be done by scheduling a pass post register allocation in
lib/CodeGen/Passes.cpp
> 
> void TargetPassConfig::addMachinePasses() {
> . 
> .
> .
>   // Run pre-ra passes.
>   addPreRegAlloc();
> 
>   // Run register allocation and passes that are tightly coupled with it,
>   // including phi elimination and scheduling.
>   if (getOptimizeRegAlloc())
>     addOptimizedRegAlloc(createRegAllocPass(true));
>   else
>     addFastRegAlloc(createRegAllocPass(false));
> 
>   // Run post-ra passes.
>   addPostRegAlloc();
> // Adding a new pass here which keeps register mask information across
function calls.
> .
> .
> .
> }
> 
> But this also requires current register allocators to use this information
in someway because RegMaskBits in LiveIntervalAnalysis.cpp is not static across
calls. I mean I am not clear for how to propagate static info to
Intra-procedural Register allocators (if possible without disturbing their code
)
> First, my hope is that we won't need to change the register allocators,
as such, in order to make use of this information. Instead, we'll simply be
able to alter the register masks generated for the call instructions. These
masks will indicate fewer clobbers than might otherwise be present based on the
ABI because of information gathered during the codegen of the callee. These
masks are generally constructed by target based on the calling convention. The
PowerPC backend, for example, looks like this:
> 
>   // Add a register mask operand representing the call-preserved registers.
>   const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
>   const uint32_t *Mask >      
TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
>   assert(Mask && "Missing call preserved mask for calling
convention");
>   Ops.push_back(DAG.getRegisterMask(Mask));
> 
> but it can be more complicated. If you look for uses of
'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll see what I
mean. Regardless, the code ends up calling some method is the targets
TargetRegisterInfo subclass. These methods generally look something like this:
> 
> const uint32_t *
> PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
>                                       CallingConv::ID CC) const {
>   const PPCSubtarget &Subtarget =
MF.getSubtarget<PPCSubtarget>();
>   ...
>   return TM.isPPC64() ? (Subtarget.hasAltivec() ?
CSR_SVR464_Altivec_RegMask
>                                                 : CSR_SVR464_RegMask)
>                       : (Subtarget.hasAltivec() ?
CSR_SVR432_Altivec_RegMask
>                                                 : CSR_SVR432_RegMask);
> }
> 
> In any case, the fundamental idea here is that, when someone calls
getCallPreservedMask in order to set the regmask on a call, we might not have to
use the CC at all. Instead, if we've already codegened the function, we
might use a cache of 'exact' register masks computed during codegen of
the potential callees instead.
> 
> In order to do this, I think we'll need to provide a function callable
from the target's getCallPreservedMask implementation, which can return such
an 'exact' regmask when available. I think we need to do it this way for
two reasons:
> 
>  1. Not all of the target code calls getCallPreservedMask, but sometimes
calls other similar target-specific functions (e.g. getTLSCallPreservedMask).
>  2. The targets need to opt-in to this behavior because only the target can
know that all register uses are really tagged correctly post
"pre-emit".
> 
> Because the target is free to introduce uses of registers at essentially
any time, we need to do the scanning for used registers after the
"pre-emit" passes run. This can be done by scheduling some simple
register-use scanning pass after the call to addPreEmitPass in
lib/CodeGen/Passes.cpp.
> MachineRegister maintains linked lists with defs/uses for each register so
you can determine whether a specific register is used or not without scanning.
> Does this include regmask-clobbered registers?
> 
>  -Hal
> 
> - Matthias
> 
> 
> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160511/158bd75d/attachment.html>

vivek pandya via llvm-dev

2016-May-11 18:14 UTC

head link

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

*Vivek Pandya*


On Wed, May 11, 2016 at 11:19 PM, Matthias Braun <matze at braunis.de>
wrote:
> Yes there is also MachineRegisterInfo::UsedPhysRegMask which should be the
> union of all regmasks in the function.
>
> Does this means there is no requirement of a separate pass for propagatingregister usage information ? And also only codegen order needs to be
changed to DFS on call graph. And currently no intra-procedural register is
using UsedPhysRegMask to avoid load/store. Does it mean we need to change
that ? Or I am getting it wrong.

- Vivek
> On May 11, 2016, at 10:47 AM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> ------------------------------
>
> *From: *"Matthias Braun" <matze at braunis.de>
>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"vivek pandya" <vivekvpandya at gmail.com>,
"llvm-dev" <
> llvm-dev at lists.llvm.org>
> *Sent: *Wednesday, May 11, 2016 12:46:25 PM
> *Subject: *Re: [llvm-dev] [GSoC 2016] Interprocedural Register Allocation
> - Introduction and Feedback
>
>
> On May 11, 2016, at 3:31 AM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
> ------------------------------
>
> *From: *"vivek pandya" <vivekvpandya at gmail.com>
> *To: *"Mehdi Amini" <mehdi.amini at apple.com>
> *Cc: *"Hal Finkel" <hfinkel at anl.gov>, "Quentin
Colombet" <
> qcolombet at apple.com>, "llvm-dev" <llvm-dev at
lists.llvm.org>, "Matthias
> Braun" <matze at braunis.de>
> *Sent: *Wednesday, May 11, 2016 3:15:03 AM
> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
> Introduction and Feedback
>
>
>
> *Vivek Pandya*
>
>
> On Wed, May 11, 2016 at 10:02 AM, vivek pandya <vivekvpandya at
gmail.com>
> wrote:
>
>>
>>
>> *Vivek Pandya*
>>
>>
>> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini <mehdi.amini at
apple.com>
>> wrote:
>>
>>>
>>> On May 10, 2016, at 6:06 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>> *To: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Tim Amini Golling" <
>>> mehdi.amini at apple.com>, "Hal Finkel" <hfinkel at
anl.gov>
>>> *Cc: *"Quentin Colombet" <qcolombet at apple.com>
>>> *Sent: *Tuesday, May 10, 2016 2:59:16 PM
>>> *Subject: *[GSoC 2016] Interprocedural Register Allocation -
>>> Introduction and Feedback
>>>
>>> Hello LLVM Community,
>>>
>>> Sorry for delay as I was busy in final exams.
>>>
>>> I am Vivek from India. Thanks for choosing my proposal for
>>> Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and
Hal
>>> Finkel will be mentoring me for this project.
>>>
>>> IPRA can reduce code size and runtime of programs by allocating
register
>>> across the module and procedure boundaries.
>>>
>>> I have identified some old but effective research work on this
area.
>>> I want community's feedback for feasibility of these approach
and I am
>>> targeting to implement two of them during this project.
>>>
>>> Here is list of the papers, I have read first two papers and I
would
>>> like to discuss those approach first, I will read other two paper
then
>>> initiate discussion for them as well. All I want is to find out a
concrete
>>> implementation plan before 23 May, 2016 and for that I need
community's
>>> help.
>>>
>>> 1) Compile time ----- Minimizing register usage penalty at
procedure
>>> calls - http://dl.acm.org/citation.cfm?id=53999
>>>
====================================================================In
>>> this approach intra-procedural register allocation is used as base
but
>>> machine code generation order is bottom up traversal of call graph
and
>>> inter-procedural effect is achieved by propagating register usage
>>> information of callee function to caller (i.e child to parent in
CallGraph)
>>> so that caller can use different registers than callee and can save
load
>>> store cost at procedure call, this is not trivial as it seems due
to
>>> recursive calls, library function usage etc. Also for upper region
of the
>>> graph in this technique available number of registers might become
zero in
>>> that case it should fall back to normal load store at procedure
call. Apart
>>> from these difficulties other difficulties have been identified
please
>>> follow this mail-chain
>>> https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
>>> My mentor has already provided me a patch that alters code
generation
>>> order as per bottom up call graph traversal, I am working from that
point
>>> now. Any other help/suggestion is always welcomed.
>>>
>>> 2) Link time ----- Global register allocation at link time -
>>> http://dl.acm.org/citation.cfm?id=989415
>>>
====================================================================In
>>> this particular approach (sort of true IPRA) registers will be
reallocated
>>> (this optimization will be optional if turned off still code will
be
>>> compiled as per intra-procedural allocation) at link time. Here
modules are
>>> first complied as per normal compilation but the object code is
annotated
>>> with details so that linker can build call graph and also calculate
usage
>>> information at link time. Compiler also write hints in object code
that if
>>> particular variable is allocated in some other register ( due to
new
>>> allocation) then how the code should be changed? Thus linker can
use these
>>> information to decide which variables (global) need to be in same
register
>>> through out the program execution and also according to register
usage
>>> information in call graph which procedure will not be active
simultaneously
>>> so that locals for that procedures can be in same registers with
out load
>>> store at procedure calls.
>>> For these particular method help me to analyze feasibility:
>>> 1) Can llvm collects following information at module level in
MachineIR?
>>> list of procedures in module, list of locals in procedures, list of
>>> procedures that a particular procedure can call, and a list of the
>>> variables this procedure references. Each entry in the last two
lists
>>> includes an estimate of the number of times the procedure is called
or the
>>> variable is referenced in each execution of this procedure
>>> 2) Can llvm write informative commands to object files?
>>> 3) Can LTO is capable of leveraging those commands?
>>>
>>> In terms of scoping the project for the summer, I definitely
recommend
>>> that you focus on (1) first. If you finish that, we can certainly
move on
>>> to other things.
>>>
>>>
>>> I'll add +1 here, but I already wrote the same thing on IRC
when
>>> discussing with Vivek. True IPRA without a proper MachineModule
>>> infrastructure won't be doable in my opinion (even with such
>>> infrastructure, it may not be trivial in LLVM in general).
>>>
>>> Regarding link time, note that any such a design would likely look
much
>>> different than in David Wall's paper however, because our LTO
re-codegens
>>> everything anyway. The paper says, "Finally, it keeps us
honest as
>>> designers of the system; once we postpone anything until link time,
the
>>> temptation is great to postpone everything, ..." - Well,
we've long-since
>>> succumb to that temptation when we LTO. C'est la vie.
>>>
>>>
>>> +1 as well, our LTO will benefit naturally from the leaf-to-root
>>> information propagation. ThinLTO will be more
challenging/interesting
>>> though!
>>>
>>> For the first part a mechanism similar to MachineModulePass would
be
>>> desirable but that may not be possible during this project, but if
we can
>>> make some sort of smaller version of that to suit our purpose.
>>>
>>> I don't think we need to make any kind of MachineModulePass to
make this
>>> work. Once we alter the visitation order based on the CGSCC
iteration
>>> scheme, we can keep state in-between functions in the pre-existing
hacky
>>> way (using static members of the relevant function passes).
>>>
>>>  Sorry my mistake here by first part I mean 1) requirement in the
link
>> time approach.
>>
>>>
>>
>>> I also don't see where/why we need a MachineModule(Pass) for
the CGSCC
>>> scheme, that said I'd rather avoid using a function pass with
static
>>> members, if we can have a ModuleAnalysis that is bookkeeping the
results
>>> for functions in the module and queries by the register allocator
somehow.
>>> Matthias/Quentin may have other inputs on this aspect.
>>>
>>
> @Hal do you mean to add a simple MachineFunction pass that will just
> operate on register allocated function and prepare a BitVector to indicate
> which register is being used by MachineFunction, and then use this pass as
> analysis pass (i.e just simply return static BitVector for clobbered
> register when register allocation for next function begins. This part is
> not much clear to me) this thing can be done by scheduling a pass post
> register allocation in lib/CodeGen/Passes.cpp
>
> void TargetPassConfig::addMachinePasses() {
> .
> .
> .
>   // Run pre-ra passes.
>   addPreRegAlloc();
>
>   // Run register allocation and passes that are tightly coupled with it,
>   // including phi elimination and scheduling.
>   if (getOptimizeRegAlloc())
>     addOptimizedRegAlloc(createRegAllocPass(true));
>   else
>     addFastRegAlloc(createRegAllocPass(false));
>
>   // Run post-ra passes.
>   addPostRegAlloc();
> // Adding a new pass here which keeps register mask information across
> function calls.
> .
> .
> .
> }
>
> But this also requires current register allocators to use this information
> in someway because RegMaskBits in LiveIntervalAnalysis.cpp is not static
> across calls. I mean I am not clear for how to propagate static info to
> Intra-procedural Register allocators (if possible without disturbing their
> code )
>
> First, my hope is that we won't need to change the register allocators,
as
> such, in order to make use of this information. Instead, we'll simply
be
> able to alter the register masks generated for the call instructions. These
> masks will indicate fewer clobbers than might otherwise be present based on
> the ABI because of information gathered during the codegen of the callee.
> These masks are generally constructed by target based on the calling
> convention. The PowerPC backend, for example, looks like this:
>
>   // Add a register mask operand representing the call-preserved registers.
>   const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
>   const uint32_t *Mask >      
TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
>   assert(Mask && "Missing call preserved mask for calling
convention");
>   Ops.push_back(DAG.getRegisterMask(Mask));
>
> but it can be more complicated. If you look for uses of
'getRegisterMask'
> in Target/*/*ISelLowering.cpp, you'll see what I mean. Regardless, the
code
> ends up calling some method is the targets TargetRegisterInfo subclass.
> These methods generally look something like this:
>
> const uint32_t *
> PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
>                                       CallingConv::ID CC) const {
>   const PPCSubtarget &Subtarget =
MF.getSubtarget<PPCSubtarget>();
>   ...
>   return TM.isPPC64() ? (Subtarget.hasAltivec() ?
> CSR_SVR464_Altivec_RegMask
>                                                 : CSR_SVR464_RegMask)
>                       : (Subtarget.hasAltivec() ?
> CSR_SVR432_Altivec_RegMask
>                                                 : CSR_SVR432_RegMask);
> }
>
> In any case, the fundamental idea here is that, when someone calls
> getCallPreservedMask in order to set the regmask on a call, we might not
> have to use the CC at all. Instead, if we've already codegened the
> function, we might use a cache of 'exact' register masks computed
during
> codegen of the potential callees instead.
>
> In order to do this, I think we'll need to provide a function callable
> from the target's getCallPreservedMask implementation, which can return
> such an 'exact' regmask when available. I think we need to do it
this way
> for two reasons:
>
>  1. Not all of the target code calls getCallPreservedMask, but sometimes
> calls other similar target-specific functions (e.g.
> getTLSCallPreservedMask).
>  2. The targets need to opt-in to this behavior because only the target
> can know that all register uses are really tagged correctly post
"pre-emit".
>
> Because the target is free to introduce uses of registers at essentially
> any time, we need to do the scanning for used registers after the
> "pre-emit" passes run. This can be done by scheduling some simple
> register-use scanning pass after the call to addPreEmitPass in
> lib/CodeGen/Passes.cpp.
>
> MachineRegister maintains linked lists with defs/uses for each register so
> you can determine whether a specific register is used or not without
> scanning.
>
> Does this include regmask-clobbered registers?
>
>  -Hal
>
>
> - Matthias
>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160511/5d5b6d78/attachment-0001.html>

Hal Finkel via llvm-dev

2016-May-11 18:16 UTC

head link

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

----- Original Message -----
> From: "vivek pandya" <vivekvpandya at gmail.com>
> To: "Matthias Braun" <matze at braunis.de>
> Cc: "Hal Finkel" <hfinkel at anl.gov>, "llvm-dev"
> <llvm-dev at lists.llvm.org>
> Sent: Wednesday, May 11, 2016 1:14:07 PM
> Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register
> Allocation - Introduction and Feedback
> Vivek Pandya
> On Wed, May 11, 2016 at 11:19 PM, Matthias Braun < matze at braunis.de
>
> wrote:
> > Yes there is also MachineRegisterInfo::UsedPhysRegMask which should
> > be the union of all regmasks in the function.
> 
> Does this means there is no requirement of a separate pass for
> propagating register usage information ? And also only codegen order
> needs to be changed to DFS on call graph. And currently no
> intra-procedural register is using UsedPhysRegMask to avoid
> load/store. Does it mean we need to change that ? Or I am getting it
> wrong.No, it just means that the "scanning" procedure is simple, it does not
actually need to visit each instruction.

-Hal 
> - Vivek
> > > On May 11, 2016, at 10:47 AM, Hal Finkel via llvm-dev <
> > > llvm-dev at lists.llvm.org > wrote:
> > 
> 
> > > > From: "Matthias Braun" < matze at braunis.de
>
> > > 
> > 
> 
> > > > To: "Hal Finkel" < hfinkel at anl.gov >
> > > 
> > 
> 
> > > > Cc: "vivek pandya" < vivekvpandya at gmail.com
>, "llvm-dev" <
> > > > llvm-dev at lists.llvm.org >
> > > 
> > 
> 
> > > > Sent: Wednesday, May 11, 2016 12:46:25 PM
> > > 
> > 
> 
> > > > Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register
> > > > Allocation - Introduction and Feedback
> > > 
> > 
> 
> > > > > On May 11, 2016, at 3:31 AM, Hal Finkel via llvm-dev
<
> > > > > llvm-dev at lists.llvm.org > wrote:
> > > > 
> > > 
> > 
> 
> > > > > > From: "vivek pandya" < vivekvpandya
at gmail.com >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > To: "Mehdi Amini" < mehdi.amini at
apple.com >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Cc: "Hal Finkel" < hfinkel at anl.gov
>, "Quentin Colombet" <
> > > > > > qcolombet at apple.com >, "llvm-dev"
< llvm-dev at lists.llvm.org
> > > > > > >,
> > > > > > "Matthias Braun" < matze at
braunis.de >
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Sent: Wednesday, May 11, 2016 3:15:03 AM
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Subject: Re: [GSoC 2016] Interprocedural Register
> > > > > > Allocation
> > > > > > -
> > > > > > Introduction and Feedback
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > Vivek Pandya
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > On Wed, May 11, 2016 at 10:02 AM, vivek pandya
<
> > > > > > vivekvpandya at gmail.com > wrote:
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Vivek Pandya
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini
<
> > > > > > > mehdi.amini at apple.com
> > > > > > > > wrote:
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > On May 10, 2016, at 6:06 PM, Hal
Finkel <
> > > > > > > > > hfinkel at anl.gov
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > From: "vivek pandya"
< vivekvpandya at gmail.com >
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > To: "llvm-dev" <
llvm-dev at lists.llvm.org >, "Tim
> > > > > > > > > > Amini
> > > > > > > > > > Golling"
> > > > > > > > > > <
> > > > > > > > > > mehdi.amini at apple.com >,
"Hal Finkel" <
> > > > > > > > > > hfinkel at anl.gov
> > > > > > > > > > >
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Cc: "Quentin
Colombet" < qcolombet at apple.com >
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Sent: Tuesday, May 10, 2016
2:59:16 PM
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Subject: [GSoC 2016]
Interprocedural Register
> > > > > > > > > > Allocation
> > > > > > > > > > -
> > > > > > > > > > Introduction and Feedback
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Hello LLVM Community,
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Sorry for delay as I was busy
in final exams.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > I am Vivek from India. Thanks
for choosing my
> > > > > > > > > > proposal
> > > > > > > > > > for
> > > > > > > > > > Interprocedural Register
Allocation (IPRA) in LLVM.
> > > > > > > > > > Mehdi
> > > > > > > > > > Amini
> > > > > > > > > > and
> > > > > > > > > > Hal Finkel will be mentoring
me for this project.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > IPRA can reduce code size and
runtime of programs
> > > > > > > > > > by
> > > > > > > > > > allocating
> > > > > > > > > > register across the module and
procedure
> > > > > > > > > > boundaries.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > I have identified some old but
effective research
> > > > > > > > > > work
> > > > > > > > > > on
> > > > > > > > > > this
> > > > > > > > > > area.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > I want community's
feedback for feasibility of
> > > > > > > > > > these
> > > > > > > > > > approach
> > > > > > > > > > and
> > > > > > > > > > I
> > > > > > > > > > am targeting to implement two
of them during this
> > > > > > > > > > project.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > Here is list of the papers, I
have read first two
> > > > > > > > > > papers
> > > > > > > > > > and
> > > > > > > > > > I
> > > > > > > > > > would
> > > > > > > > > > like to discuss those approach
first, I will read
> > > > > > > > > > other
> > > > > > > > > > two
> > > > > > > > > > paper
> > > > > > > > > > then initiate discussion for
them as well. All I
> > > > > > > > > > want
> > > > > > > > > > is
> > > > > > > > > > to
> > > > > > > > > > find
> > > > > > > > > > out
> > > > > > > > > > a concrete implementation plan
before 23 May, 2016
> > > > > > > > > > and
> > > > > > > > > > for
> > > > > > > > > > that
> > > > > > > > > > I
> > > > > > > > > > need community's help.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > 1) Compile time -----
Minimizing register usage
> > > > > > > > > > penalty
> > > > > > > > > > at
> > > > > > > > > > procedure
> > > > > > > > > > calls -
http://dl.acm.org/citation.cfm?id=53999
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > >
====================================================================In
> > > > > > > > > > this approach intra-procedural
register allocation
> > > > > > > > > > is
> > > > > > > > > > used
> > > > > > > > > > as
> > > > > > > > > > base
> > > > > > > > > > but machine code generation
order is bottom up
> > > > > > > > > > traversal
> > > > > > > > > > of
> > > > > > > > > > call
> > > > > > > > > > graph and inter-procedural
effect is achieved by
> > > > > > > > > > propagating
> > > > > > > > > > register usage information of
callee function to
> > > > > > > > > > caller
> > > > > > > > > > (i.e
> > > > > > > > > > child
> > > > > > > > > > to parent in CallGraph) so
that caller can use
> > > > > > > > > > different
> > > > > > > > > > registers
> > > > > > > > > > than callee and can save load
store cost at
> > > > > > > > > > procedure
> > > > > > > > > > call,
> > > > > > > > > > this
> > > > > > > > > > is
> > > > > > > > > > not trivial as it seems due to
recursive calls,
> > > > > > > > > > library
> > > > > > > > > > function
> > > > > > > > > > usage etc. Also for upper
region of the graph in
> > > > > > > > > > this
> > > > > > > > > > technique
> > > > > > > > > > available number of registers
might become zero in
> > > > > > > > > > that
> > > > > > > > > > case
> > > > > > > > > > it
> > > > > > > > > > should fall back to normal
load store at procedure
> > > > > > > > > > call.
> > > > > > > > > > Apart
> > > > > > > > > > from
> > > > > > > > > > these difficulties other
difficulties have been
> > > > > > > > > > identified
> > > > > > > > > > please
> > > > > > > > > > follow this mail-chain
> > > > > > > > > >
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > My mentor has already provided
me a patch that
> > > > > > > > > > alters
> > > > > > > > > > code
> > > > > > > > > > generation
> > > > > > > > > > order as per bottom up call
graph traversal, I am
> > > > > > > > > > working
> > > > > > > > > > from
> > > > > > > > > > that
> > > > > > > > > > point now. Any other
help/suggestion is always
> > > > > > > > > > welcomed.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > 2) Link time ----- Global
register allocation at
> > > > > > > > > > link
> > > > > > > > > > time
> > > > > > > > > > -
> > > > > > > > > >
http://dl.acm.org/citation.cfm?id=989415
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > >
====================================================================In
> > > > > > > > > > this particular approach (sort
of true IPRA)
> > > > > > > > > > registers
> > > > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > reallocated (this optimization
will be optional if
> > > > > > > > > > turned
> > > > > > > > > > off
> > > > > > > > > > still
> > > > > > > > > > code will be compiled as per
intra-procedural
> > > > > > > > > > allocation)
> > > > > > > > > > at
> > > > > > > > > > link
> > > > > > > > > > time. Here modules are first
complied as per normal
> > > > > > > > > > compilation
> > > > > > > > > > but
> > > > > > > > > > the object code is annotated
with details so that
> > > > > > > > > > linker
> > > > > > > > > > can
> > > > > > > > > > build
> > > > > > > > > > call graph and also calculate
usage information at
> > > > > > > > > > link
> > > > > > > > > > time.
> > > > > > > > > > Compiler also write hints in
object code that if
> > > > > > > > > > particular
> > > > > > > > > > variable
> > > > > > > > > > is allocated in some other
register ( due to new
> > > > > > > > > > allocation)
> > > > > > > > > > then
> > > > > > > > > > how the code should be
changed? Thus linker can use
> > > > > > > > > > these
> > > > > > > > > > information to decide which
variables (global) need
> > > > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > in
> > > > > > > > > > same
> > > > > > > > > > register through out the
program execution and also
> > > > > > > > > > according
> > > > > > > > > > to
> > > > > > > > > > register usage information in
call graph which
> > > > > > > > > > procedure
> > > > > > > > > > will
> > > > > > > > > > not
> > > > > > > > > > be
> > > > > > > > > > active simultaneously so that
locals for that
> > > > > > > > > > procedures
> > > > > > > > > > can
> > > > > > > > > > be
> > > > > > > > > > in
> > > > > > > > > > same registers with out load
store at procedure
> > > > > > > > > > calls.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > For these particular method
help me to analyze
> > > > > > > > > > feasibility:
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > 1) Can llvm collects following
information at
> > > > > > > > > > module
> > > > > > > > > > level
> > > > > > > > > > in
> > > > > > > > > > MachineIR? list of procedures
in module, list of
> > > > > > > > > > locals
> > > > > > > > > > in
> > > > > > > > > > procedures, list of procedures
that a particular
> > > > > > > > > > procedure
> > > > > > > > > > can
> > > > > > > > > > call,
> > > > > > > > > > and a list of the variables
this procedure
> > > > > > > > > > references.
> > > > > > > > > > Each
> > > > > > > > > > entry
> > > > > > > > > > in
> > > > > > > > > > the last two lists includes an
estimate of the
> > > > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > times
> > > > > > > > > > the
> > > > > > > > > > procedure is called or the
variable is referenced
> > > > > > > > > > in
> > > > > > > > > > each
> > > > > > > > > > execution
> > > > > > > > > > of this procedure
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > 2) Can llvm write informative
commands to object
> > > > > > > > > > files?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > 3) Can LTO is capable of
leveraging those commands?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > In terms of scoping the project for
the summer, I
> > > > > > > > > definitely
> > > > > > > > > recommend that you focus on (1)
first. If you finish
> > > > > > > > > that,
> > > > > > > > > we
> > > > > > > > > can
> > > > > > > > > certainly move on to other things.
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > I'll add +1 here, but I already
wrote the same thing on
> > > > > > > > IRC
> > > > > > > > when
> > > > > > > > discussing with Vivek. True IPRA without
a proper
> > > > > > > > MachineModule
> > > > > > > > infrastructure won't be doable in my
opinion (even with
> > > > > > > > such
> > > > > > > > infrastructure, it may not be trivial in
LLVM in
> > > > > > > > general).
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > Regarding link time, note that any
such a design
> > > > > > > > > would
> > > > > > > > > likely
> > > > > > > > > look
> > > > > > > > > much different than in David
Wall's paper however,
> > > > > > > > > because
> > > > > > > > > our
> > > > > > > > > LTO
> > > > > > > > > re-codegens everything anyway. The
paper says,
> > > > > > > > > "Finally,
> > > > > > > > > it
> > > > > > > > > keeps
> > > > > > > > > us
> > > > > > > > > honest as designers of the system;
once we postpone
> > > > > > > > > anything
> > > > > > > > > until
> > > > > > > > > link time, the temptation is great
to postpone
> > > > > > > > > everything,
> > > > > > > > > ..."
> > > > > > > > > -
> > > > > > > > > Well, we've long-since succumb
to that temptation
> > > > > > > > > when
> > > > > > > > > we
> > > > > > > > > LTO.
> > > > > > > > > C'est
> > > > > > > > > la vie.
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > +1 as well, our LTO will benefit
naturally from the
> > > > > > > > leaf-to-root
> > > > > > > > information propagation. ThinLTO will be
more
> > > > > > > > challenging/interesting though!
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > > For the first part a mechanism
similar to
> > > > > > > > > > MachineModulePass
> > > > > > > > > > would
> > > > > > > > > > be
> > > > > > > > > > desirable but that may not be
possible during this
> > > > > > > > > > project,
> > > > > > > > > > but
> > > > > > > > > > if
> > > > > > > > > > we can make some sort of
smaller version of that to
> > > > > > > > > > suit
> > > > > > > > > > our
> > > > > > > > > > purpose.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > > I don't think we need to make
any kind of
> > > > > > > > > MachineModulePass
> > > > > > > > > to
> > > > > > > > > make
> > > > > > > > > this work. Once we alter the
visitation order based
> > > > > > > > > on
> > > > > > > > > the
> > > > > > > > > CGSCC
> > > > > > > > > iteration scheme, we can keep state
in-between
> > > > > > > > > functions
> > > > > > > > > in
> > > > > > > > > the
> > > > > > > > > pre-existing hacky way (using
static members of the
> > > > > > > > > relevant
> > > > > > > > > function passes).
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > Sorry my mistake here by first part I mean 1)
requirement
> > > > > > > in
> > > > > > > the
> > > > > > > link
> > > > > > > time approach.
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > I also don't see where/why we need a
> > > > > > > > MachineModule(Pass)
> > > > > > > > for
> > > > > > > > the
> > > > > > > > CGSCC scheme, that said I'd rather
avoid using a
> > > > > > > > function
> > > > > > > > pass
> > > > > > > > with
> > > > > > > > static members, if we can have a
ModuleAnalysis that is
> > > > > > > > bookkeeping
> > > > > > > > the results for functions in the module
and queries by
> > > > > > > > the
> > > > > > > > register
> > > > > > > > allocator somehow.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > > > Matthias/Quentin may have other inputs
on this aspect.
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > @Hal do you mean to add a simple MachineFunction
pass that
> > > > > > will
> > > > > > just
> > > > > > operate on register allocated function and prepare
a
> > > > > > BitVector
> > > > > > to
> > > > > > indicate which register is being used by
MachineFunction,
> > > > > > and
> > > > > > then
> > > > > > use this pass as analysis pass (i.e just simply
return
> > > > > > static
> > > > > > BitVector for clobbered register when register
allocation
> > > > > > for
> > > > > > next
> > > > > > function begins. This part is not much clear to
me) this
> > > > > > thing
> > > > > > can
> > > > > > be done by scheduling a pass post register
allocation in
> > > > > > lib/CodeGen/Passes.cpp
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > void TargetPassConfig::addMachinePasses() {
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > .
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > .
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > .
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > // Run pre-ra passes.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > addPreRegAlloc();
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > // Run register allocation and passes that are
tightly
> > > > > > coupled
> > > > > > with
> > > > > > it,
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > // including phi elimination and scheduling.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > if (getOptimizeRegAlloc())
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > addOptimizedRegAlloc(createRegAllocPass(true));
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > else
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > addFastRegAlloc(createRegAllocPass(false));
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > // Run post-ra passes.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > addPostRegAlloc();
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > // Adding a new pass here which keeps register
mask
> > > > > > information
> > > > > > across function calls.
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > .
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > .
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > .
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > }
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > > But this also requires current register allocators
to use
> > > > > > this
> > > > > > information in someway because RegMaskBits in
> > > > > > LiveIntervalAnalysis.cpp is not static across
calls. I mean
> > > > > > I
> > > > > > am
> > > > > > not
> > > > > > clear for how to propagate static info to
Intra-procedural
> > > > > > Register
> > > > > > allocators (if possible without disturbing their
code )
> > > > > 
> > > > 
> > > 
> > 
> 
> > > > > First, my hope is that we won't need to change the
register
> > > > > allocators, as such, in order to make use of this
> > > > > information.
> > > > > Instead, we'll simply be able to alter the register
masks
> > > > > generated
> > > > > for the call instructions. These masks will indicate
fewer
> > > > > clobbers
> > > > > than might otherwise be present based on the ABI
because of
> > > > > information gathered during the codegen of the callee.
These
> > > > > masks
> > > > > are generally constructed by target based on the
calling
> > > > > convention.
> > > > > The PowerPC backend, for example, looks like this:
> > > > 
> > > 
> > 
> 
> > > > > // Add a register mask operand representing the
> > > > > call-preserved
> > > > > registers.
> > > > 
> > > 
> > 
> 
> > > > > const TargetRegisterInfo *TRI =
Subtarget.getRegisterInfo();
> > > > 
> > > 
> > 
> 
> > > > > const uint32_t *Mask > > > > 
> > > 
> > 
> 
> > > > > TRI->getCallPreservedMask(DAG.getMachineFunction(),
> > > > > CallConv);
> > > > 
> > > 
> > 
> 
> > > > > assert(Mask && "Missing call preserved
mask for calling
> > > > > convention");
> > > > 
> > > 
> > 
> 
> > > > > Ops.push_back(DAG.getRegisterMask(Mask));
> > > > 
> > > 
> > 
> 
> > > > > but it can be more complicated. If you look for uses of
> > > > > 'getRegisterMask' in
Target/*/*ISelLowering.cpp, you'll see
> > > > > what
> > > > > I
> > > > > mean. Regardless, the code ends up calling some method
is the
> > > > > targets TargetRegisterInfo subclass. These methods
generally
> > > > > look
> > > > > something like this:
> > > > 
> > > 
> > 
> 
> > > > > const uint32_t *
> > > > 
> > > 
> > 
> 
> > > > > PPCRegisterInfo::getCallPreservedMask(const
MachineFunction
> > > > > &MF,
> > > > 
> > > 
> > 
> 
> > > > > CallingConv::ID CC) const {
> > > > 
> > > 
> > 
> 
> > > > > const PPCSubtarget &Subtarget > > > >
> MF.getSubtarget<PPCSubtarget>();
> > > > 
> > > 
> > 
> 
> > > > > ...
> > > > 
> > > 
> > 
> 
> > > > > return TM.isPPC64() ? (Subtarget.hasAltivec() ?
> > > > > CSR_SVR464_Altivec_RegMask
> > > > 
> > > 
> > 
> 
> > > > > : CSR_SVR464_RegMask)
> > > > 
> > > 
> > 
> 
> > > > > : (Subtarget.hasAltivec() ? CSR_SVR432_Altivec_RegMask
> > > > 
> > > 
> > 
> 
> > > > > : CSR_SVR432_RegMask);
> > > > 
> > > 
> > 
> 
> > > > > }
> > > > 
> > > 
> > 
> 
> > > > > In any case, the fundamental idea here is that, when
someone
> > > > > calls
> > > > > getCallPreservedMask in order to set the regmask on a
call,
> > > > > we
> > > > > might
> > > > > not have to use the CC at all. Instead, if we've
already
> > > > > codegened
> > > > > the function, we might use a cache of 'exact'
register masks
> > > > > computed during codegen of the potential callees
instead.
> > > > 
> > > 
> > 
> 
> > > > > In order to do this, I think we'll need to provide
a function
> > > > > callable from the target's getCallPreservedMask
> > > > > implementation,
> > > > > which can return such an 'exact' regmask when
available. I
> > > > > think
> > > > > we
> > > > > need to do it this way for two reasons:
> > > > 
> > > 
> > 
> 
> > > > > 1. Not all of the target code calls
getCallPreservedMask, but
> > > > > sometimes calls other similar target-specific functions
(e.g.
> > > > > getTLSCallPreservedMask).
> > > > 
> > > 
> > 
> 
> > > > > 2. The targets need to opt-in to this behavior because
only
> > > > > the
> > > > > target can know that all register uses are really
tagged
> > > > > correctly
> > > > > post "pre-emit".
> > > > 
> > > 
> > 
> 
> > > > > Because the target is free to introduce uses of
registers at
> > > > > essentially any time, we need to do the scanning for
used
> > > > > registers
> > > > > after the "pre-emit" passes run. This can be
done by
> > > > > scheduling
> > > > > some
> > > > > simple register-use scanning pass after the call to
> > > > > addPreEmitPass
> > > > > in lib/CodeGen/Passes.cpp.
> > > > 
> > > 
> > 
> 
> > > > MachineRegister maintains linked lists with defs/uses for
each
> > > > register so you can determine whether a specific register is
> > > > used
> > > > or
> > > > not without scanning.
> > > 
> > 
> 
> > > Does this include regmask-clobbered registers?
> > 
> 
> > > -Hal
> > 
> 
> > > > - Matthias
> > > 
> > 
> 
> > > --
> > 
> 
> > > Hal Finkel
> > 
> 
> > > Assistant Computational Scientist
> > 
> 
> > > Leadership Computing Facility
> > 
> 
> > > Argonne National Laboratory
> > 
> 
> > > _______________________________________________
> > 
> 
> > > LLVM Developers mailing list
> > 
> 
> > > llvm-dev at lists.llvm.org
> > 
> 
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> -- 

Hal Finkel 
Assistant Computational Scientist 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160511/73037f41/attachment.html>

vivek pandya via llvm-dev

2016-May-18 17:25 UTC

head link

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

*Vivek Pandya*


On Wed, May 11, 2016 at 11:19 PM, Matthias Braun <matze at braunis.de>
wrote:
> Yes there is also MachineRegisterInfo::UsedPhysRegMask which should be the
> union of all regmasks in the function.
>
>
> ./lib/CodeGen/MIRParser/MIRParser.cpp:  RegInfo.*setUsedPhysRegMask*
> (CalleeSavedRegisterMask.flip());
>
> Is this line responsible for setting the UsedPhysMask after codegen for afunction? And This will be changed for each function call right ?
-Vivek
> On May 11, 2016, at 10:47 AM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> ------------------------------
>
> *From: *"Matthias Braun" <matze at braunis.de>
>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *"vivek pandya" <vivekvpandya at gmail.com>,
"llvm-dev" <
> llvm-dev at lists.llvm.org>
> *Sent: *Wednesday, May 11, 2016 12:46:25 PM
> *Subject: *Re: [llvm-dev] [GSoC 2016] Interprocedural Register Allocation
> - Introduction and Feedback
>
>
> On May 11, 2016, at 3:31 AM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
> ------------------------------
>
> *From: *"vivek pandya" <vivekvpandya at gmail.com>
> *To: *"Mehdi Amini" <mehdi.amini at apple.com>
> *Cc: *"Hal Finkel" <hfinkel at anl.gov>, "Quentin
Colombet" <
> qcolombet at apple.com>, "llvm-dev" <llvm-dev at
lists.llvm.org>, "Matthias
> Braun" <matze at braunis.de>
> *Sent: *Wednesday, May 11, 2016 3:15:03 AM
> *Subject: *Re: [GSoC 2016] Interprocedural Register Allocation -
> Introduction and Feedback
>
>
>
> *Vivek Pandya*
>
>
> On Wed, May 11, 2016 at 10:02 AM, vivek pandya <vivekvpandya at
gmail.com>
> wrote:
>
>>
>>
>> *Vivek Pandya*
>>
>>
>> On Wed, May 11, 2016 at 9:43 AM, Mehdi Amini <mehdi.amini at
apple.com>
>> wrote:
>>
>>>
>>> On May 10, 2016, at 6:06 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> *From: *"vivek pandya" <vivekvpandya at gmail.com>
>>> *To: *"llvm-dev" <llvm-dev at lists.llvm.org>,
"Tim Amini Golling" <
>>> mehdi.amini at apple.com>, "Hal Finkel" <hfinkel at
anl.gov>
>>> *Cc: *"Quentin Colombet" <qcolombet at apple.com>
>>> *Sent: *Tuesday, May 10, 2016 2:59:16 PM
>>> *Subject: *[GSoC 2016] Interprocedural Register Allocation -
>>> Introduction and Feedback
>>>
>>> Hello LLVM Community,
>>>
>>> Sorry for delay as I was busy in final exams.
>>>
>>> I am Vivek from India. Thanks for choosing my proposal for
>>> Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and
Hal
>>> Finkel will be mentoring me for this project.
>>>
>>> IPRA can reduce code size and runtime of programs by allocating
register
>>> across the module and procedure boundaries.
>>>
>>> I have identified some old but effective research work on this
area.
>>> I want community's feedback for feasibility of these approach
and I am
>>> targeting to implement two of them during this project.
>>>
>>> Here is list of the papers, I have read first two papers and I
would
>>> like to discuss those approach first, I will read other two paper
then
>>> initiate discussion for them as well. All I want is to find out a
concrete
>>> implementation plan before 23 May, 2016 and for that I need
community's
>>> help.
>>>
>>> 1) Compile time ----- Minimizing register usage penalty at
procedure
>>> calls - http://dl.acm.org/citation.cfm?id=53999
>>>
====================================================================In
>>> this approach intra-procedural register allocation is used as base
but
>>> machine code generation order is bottom up traversal of call graph
and
>>> inter-procedural effect is achieved by propagating register usage
>>> information of callee function to caller (i.e child to parent in
CallGraph)
>>> so that caller can use different registers than callee and can save
load
>>> store cost at procedure call, this is not trivial as it seems due
to
>>> recursive calls, library function usage etc. Also for upper region
of the
>>> graph in this technique available number of registers might become
zero in
>>> that case it should fall back to normal load store at procedure
call. Apart
>>> from these difficulties other difficulties have been identified
please
>>> follow this mail-chain
>>> https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
>>> My mentor has already provided me a patch that alters code
generation
>>> order as per bottom up call graph traversal, I am working from that
point
>>> now. Any other help/suggestion is always welcomed.
>>>
>>> 2) Link time ----- Global register allocation at link time -
>>> http://dl.acm.org/citation.cfm?id=989415
>>>
====================================================================In
>>> this particular approach (sort of true IPRA) registers will be
reallocated
>>> (this optimization will be optional if turned off still code will
be
>>> compiled as per intra-procedural allocation) at link time. Here
modules are
>>> first complied as per normal compilation but the object code is
annotated
>>> with details so that linker can build call graph and also calculate
usage
>>> information at link time. Compiler also write hints in object code
that if
>>> particular variable is allocated in some other register ( due to
new
>>> allocation) then how the code should be changed? Thus linker can
use these
>>> information to decide which variables (global) need to be in same
register
>>> through out the program execution and also according to register
usage
>>> information in call graph which procedure will not be active
simultaneously
>>> so that locals for that procedures can be in same registers with
out load
>>> store at procedure calls.
>>> For these particular method help me to analyze feasibility:
>>> 1) Can llvm collects following information at module level in
MachineIR?
>>> list of procedures in module, list of locals in procedures, list of
>>> procedures that a particular procedure can call, and a list of the
>>> variables this procedure references. Each entry in the last two
lists
>>> includes an estimate of the number of times the procedure is called
or the
>>> variable is referenced in each execution of this procedure
>>> 2) Can llvm write informative commands to object files?
>>> 3) Can LTO is capable of leveraging those commands?
>>>
>>> In terms of scoping the project for the summer, I definitely
recommend
>>> that you focus on (1) first. If you finish that, we can certainly
move on
>>> to other things.
>>>
>>>
>>> I'll add +1 here, but I already wrote the same thing on IRC
when
>>> discussing with Vivek. True IPRA without a proper MachineModule
>>> infrastructure won't be doable in my opinion (even with such
>>> infrastructure, it may not be trivial in LLVM in general).
>>>
>>> Regarding link time, note that any such a design would likely look
much
>>> different than in David Wall's paper however, because our LTO
re-codegens
>>> everything anyway. The paper says, "Finally, it keeps us
honest as
>>> designers of the system; once we postpone anything until link time,
the
>>> temptation is great to postpone everything, ..." - Well,
we've long-since
>>> succumb to that temptation when we LTO. C'est la vie.
>>>
>>>
>>> +1 as well, our LTO will benefit naturally from the leaf-to-root
>>> information propagation. ThinLTO will be more
challenging/interesting
>>> though!
>>>
>>> For the first part a mechanism similar to MachineModulePass would
be
>>> desirable but that may not be possible during this project, but if
we can
>>> make some sort of smaller version of that to suit our purpose.
>>>
>>> I don't think we need to make any kind of MachineModulePass to
make this
>>> work. Once we alter the visitation order based on the CGSCC
iteration
>>> scheme, we can keep state in-between functions in the pre-existing
hacky
>>> way (using static members of the relevant function passes).
>>>
>>>  Sorry my mistake here by first part I mean 1) requirement in the
link
>> time approach.
>>
>>>
>>
>>> I also don't see where/why we need a MachineModule(Pass) for
the CGSCC
>>> scheme, that said I'd rather avoid using a function pass with
static
>>> members, if we can have a ModuleAnalysis that is bookkeeping the
results
>>> for functions in the module and queries by the register allocator
somehow.
>>> Matthias/Quentin may have other inputs on this aspect.
>>>
>>
> @Hal do you mean to add a simple MachineFunction pass that will just
> operate on register allocated function and prepare a BitVector to indicate
> which register is being used by MachineFunction, and then use this pass as
> analysis pass (i.e just simply return static BitVector for clobbered
> register when register allocation for next function begins. This part is
> not much clear to me) this thing can be done by scheduling a pass post
> register allocation in lib/CodeGen/Passes.cpp
>
> void TargetPassConfig::addMachinePasses() {
> .
> .
> .
>   // Run pre-ra passes.
>   addPreRegAlloc();
>
>   // Run register allocation and passes that are tightly coupled with it,
>   // including phi elimination and scheduling.
>   if (getOptimizeRegAlloc())
>     addOptimizedRegAlloc(createRegAllocPass(true));
>   else
>     addFastRegAlloc(createRegAllocPass(false));
>
>   // Run post-ra passes.
>   addPostRegAlloc();
> // Adding a new pass here which keeps register mask information across
> function calls.
> .
> .
> .
> }
>
> But this also requires current register allocators to use this information
> in someway because RegMaskBits in LiveIntervalAnalysis.cpp is not static
> across calls. I mean I am not clear for how to propagate static info to
> Intra-procedural Register allocators (if possible without disturbing their
> code )
>
> First, my hope is that we won't need to change the register allocators,
as
> such, in order to make use of this information. Instead, we'll simply
be
> able to alter the register masks generated for the call instructions. These
> masks will indicate fewer clobbers than might otherwise be present based on
> the ABI because of information gathered during the codegen of the callee.
> These masks are generally constructed by target based on the calling
> convention. The PowerPC backend, for example, looks like this:
>
>   // Add a register mask operand representing the call-preserved registers.
>   const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
>   const uint32_t *Mask >      
TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
>   assert(Mask && "Missing call preserved mask for calling
convention");
>   Ops.push_back(DAG.getRegisterMask(Mask));
>
> but it can be more complicated. If you look for uses of
'getRegisterMask'
> in Target/*/*ISelLowering.cpp, you'll see what I mean. Regardless, the
code
> ends up calling some method is the targets TargetRegisterInfo subclass.
> These methods generally look something like this:
>
> const uint32_t *
> PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
>                                       CallingConv::ID CC) const {
>   const PPCSubtarget &Subtarget =
MF.getSubtarget<PPCSubtarget>();
>   ...
>   return TM.isPPC64() ? (Subtarget.hasAltivec() ?
> CSR_SVR464_Altivec_RegMask
>                                                 : CSR_SVR464_RegMask)
>                       : (Subtarget.hasAltivec() ?
> CSR_SVR432_Altivec_RegMask
>                                                 : CSR_SVR432_RegMask);
> }
>
> In any case, the fundamental idea here is that, when someone calls
> getCallPreservedMask in order to set the regmask on a call, we might not
> have to use the CC at all. Instead, if we've already codegened the
> function, we might use a cache of 'exact' register masks computed
during
> codegen of the potential callees instead.
>
> In order to do this, I think we'll need to provide a function callable
> from the target's getCallPreservedMask implementation, which can return
> such an 'exact' regmask when available. I think we need to do it
this way
> for two reasons:
>
>  1. Not all of the target code calls getCallPreservedMask, but sometimes
> calls other similar target-specific functions (e.g.
> getTLSCallPreservedMask).
>  2. The targets need to opt-in to this behavior because only the target
> can know that all register uses are really tagged correctly post
"pre-emit".
>
> Because the target is free to introduce uses of registers at essentially
> any time, we need to do the scanning for used registers after the
> "pre-emit" passes run. This can be done by scheduling some simple
> register-use scanning pass after the call to addPreEmitPass in
> lib/CodeGen/Passes.cpp.
>
> MachineRegister maintains linked lists with defs/uses for each register so
> you can determine whether a specific register is used or not without
> scanning.
>
> Does this include regmask-clobbered registers?
>
>  -Hal
>
>
> - Matthias
>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160518/50237d0a/attachment-0001.html>

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - May 2016 - [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

[llvm-dev] [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Possibly Parallel Threads