thr3ads.net - llvm dev - [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM [Dec 2013]

If this information is useful, please help other people find it:
Share via:

Radovan Obradovic

2013-Dec-18 02:33 UTC

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

Hal,

Thank you for finding interest in our work!
Please find my answers inlined below.
> 
> ________________________________________
> From: Hal Finkel [hfinkel at anl.gov]
> Sent: Monday, December 16, 2013 4:26 PM
> To: Radovan Obradovic
> Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick
> Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
> 
> Radovan,
> 
> Thanks for posting this! I would really like to have this kind of
> functionality available. There are a lot of things to think through
> here; comments inline...
> 
> ----- Original Message -----
> > From: "Radovan Obradovic" <Radovan.Obradovic at
imgtec.com>
> > To: llvmdev at cs.uiuc.edu
> > Sent: Monday, December 16, 2013 11:31:21 AM
> > Subject: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
> >
> >
> >
> >
> > This is a first step towards to full iterative compilation framework
> > for
> > Clang/LLVM. The attached patch is work in progress and the idea of
> > sending
> > it to the list earlier rather than later is to open up a discussion
> > on
> > iterative compilation in Clang/LLVM. All comments are welcome,
> > especially
> > those related to integration of the framework into Clang/LLVM.
> 
> I'm glad you're doing this, and I hope that you'll be willing
to
> stick though what might be a long design discussion :)
Our belief is that this approach can be beneficial, especially
for non-orthogonal architectures (e.g. microMIPS), thus we plan
to continue our work on this.
> 
> >
> > Current compilers have pipeline structure in which data from one
> > phase is
> > passed to the following one. But these compilation phases can
> > interfere in
> > unpredictable way so compilers use conservative heuristics on which
> > they
> > make decisions during the compilation process. Because of
> > unpredictability
> > of these interferences, compilers sometimes make decisions which can
> > degrade
> > code quality. Iterative compilation could partially resolve these
> > problems
> > by allowing compiler to explore some alternative decisions, generate
> > code
> > for these new decision paths and take the generated code that is the
> > best
> > for a particular criteria eventually.
> >
> > This approach is not new in the literature, we can mention
> > Milepost/CTuning
> > [http://ctuning.org] among other projects. In this approach, in
> > comparison
> > to Milepost/CTuning, the emphasis is on much finer granularity by
> > changing
> > arbitrary compiler decisions, and not only those based on values of
> > command
> > line options, at any point in translation process.
> >
> > This framework implements iterative approach in which in every
> > iteration
> > slightly different code is generated. At the end of compilation
> > process the
> > best emitted code is chosen.
> >
> > All points at which relevant decisions are made should be identified.
> > Access to framework data structure (DecisionTree) should be provided
> > at
> > these points. Based on the data from DecisionTree, obtained by
> > calling
> > getDecision member function, at these points decisions are made
> > whether to
> > execute default actions (as compiler would do without this framework)
> > or to
> > follow alternative paths.
> >
> > Let us consider Inliner example:
> >
> > Original code:
> >
> > if (!IC) {
> > DEBUG(dbgs() << " NOT Inlining: cost=" <<
IC.getCost()
> > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
> > << ", Call: " << *CS.getInstruction() <<
"\n");
> > return false;
> > }
> >
> > Code after instrumentalization:
> >
> > bool doInlinging = (bool)IC;
> > DecisionTreeProxy *decisionTreeProxy;
> > decisionTreeProxy > >
moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
> > int t = MaxDtPriority - abs(IC.getCostDelta());
> > bool decision = false;
> > if (t >= 0)
> > decision = decisionTreeProxy->getDecision(call);
> > if (decision)
> > doInlinging = !doInlinging;
> >
> > if (!doInlinging) {
> > DEBUG(dbgs() << " NOT Inlining: cost=" <<
IC.getCost()
> > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
> > << ", Call: " << *CS.getInstruction() <<
"\n");
> > return false;
> > }
> 
> Talking a quick look over the patch, this seems to introduce a lot
> of boilerplate code. Can you think of a design that simplifies this?
It is necessary to access decision tree at every place where you
want to call getDecision function and priority parameter must be
calculated which is specific to every such place. It is only a
few lines of code but we will consider how to improve this.
All suggestions are welcomed.
> 
> >
> > Function getDecision returns true or false. If false is returned than
> > this
> > means to take default path and true means to take alternative one.
> >
> > Compilation process can be represented by sequence of binary values
> > where
> > every value determines decision made at some point. Value FALSE means
> > to
> > follow normal compilation process as original compiler would do
> > without
> > presence of iterative compilation framework. TRUE value means to take
> > alternative path or to make opposite decision at that point. This
> > naturally
> > forms tree like structure which we named Decision Tree.
> > Traversing the tree from the root to the leaf of the tree provides
> > all the
> > decisions made during one iteration in compilation process.
> >
> > At the end of each iteration, quality of generated code is estimated
> > by
> > executing newly introduced target dependent pass. Based on results
> > path for
> > the following iteration is calculated. At the moment, this has been
> > proved
> > for MIPS only and it is based on code size.
> 
> I think that, in general, we ought to be able to get an overall
> latency estimate from the instruction scheduler.
Criteria to optimize for speed should be added as well. Do you
think that latency info form scheduler can be used to estimate
execution speed of generated code? Can we guess somehow the execution
frequency of basic blocks?
> 
> >
> > Initially iterative compilation framework was implemented in CodeGen
> > phase
> > but large number of code transformations was already performed in
> > earlier
> > phases. This reduces possibilities to improve quality of generated
> > code.
> > Thus iterative compilation framework is applied to Clang driver where
> > it can
> > influence every single compiler phase. This relocation resulted in
> > some
> > implementation issues mostly related to the fact that driver and the
> > compiler itself (-cc1) are located in different processes. This is
> > resolved
> > by introducing DecisionTreeProxy class that implements communication
> > between
> > these two processes through temporary files.
> 
> I think that this can be solved by introducing some API for the
> necessary data flow. There is obviously already a binding here.
This probably needs to be fixed. All suggestions are welcomed.
> 
> >
> > At this point mechanism is applied to two optimizations: Inliner and
> > Loop
> > Invariant Code Motion. Supported platform is MIPS32r2.
> >
> > Despite small number of instrumented points we got improvements on
> > some test
> > files from CSiBE benchmark. Here are the promising results:
> >
> > jpeg-6b/jidctflt.c -10.2%
> > OpenTCP-1.0.4/system.c -6.7%
> > libmspack/test/chmd_md5.c -5.8%
> > flex-2.5.31/regex.c -4.8%
> 
> Very interesting. Out of curiosity, any slowdowns?
jidctflt.c is optimized due to the fact that LICM moves some
constant definitions out of the loop and increases the register
pressure so some spills occur. So optimized version, in this
case, should be even faster. But slowdowns are possible and we
use option -Oz so we are telling the compiler that code size
is the priority.
> 
> >
> > We plan to apply this framework to more compiler phases and to some
> > other
> > target architectures. It is expected to give better results on some
> > non
> > orthogonal architectures like microMIPS. Another direction is to
> > apply
> > machine learning based on some input program features (number of
> > basic
> > blocks, register pressure at some point etc.) and it can be used for
> > selecting compilation path for the next iteration or estimating the
> > best
> > path for compiling program without using iterative approach (with
> > only one
> > iteration).
> >
> >
> > o
> > / \
> > / \
> > o \
> > / \
> > / \
> > o o
> > / \ /
> > / \ /
> > L1 o o
> > / /
> > / /
> > L2 L3
> >
> > Figure 1: Decision Tree Example - Symbol 'o' represent
decision
> > points. If
> > getDecision function returns FALSE the left branch is taken (default
> > path),
> > otherwise the right branch is taken.
> > 'L1' compilation iteration can be represented as a string
> > (FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE) and
'L3'
> > as (TRUE, FALSE, FALSE).
> >
> >
> > Clearly, we see huge potential in this direction. At the same time,
> > we are
> > aware this would cause compiler to be slower which everyone is
> > against, but
> > this penalty can be taken when we are in need for most optimized
> > code.
> 
> What is the overhead of the current implementation? I suspect that,
> in order to control the large number of decision points from large
> regions, we will want some way to limit fine-grained decisions in
> the tree in large regions (and only including coarse ones, like the
> inlining of relatively large functions). In small blocks, on the
> other hand, we might even include instruction scheduling decisions
> (for in-order cores).
In the current implementation one full compilation per
decision is needed. Plan is that, for CodeGen pass, do once all the
passes of CodeGen per decision (to reuse bitcode). Machine learning should
considerably reduce the need for the large number of iterations.
> 
> I'd been thinking of experimenting with some iterative scheduling
> for my in-order PPC cores (for small blocks in loops), and
> integrating it into something like this might be even better.
> 
> >
> > What does everyone think?
> > Could this approach be considered as a valid mode for Clang/LLVM?
> >
> > Please use this post and the attached patch (it can be applied to 3.3
> > release - clang should be inside llvm/tools folder) as a reference
> > and
> > not the patch for submission.
> 
> I see in the patch some (seemingly small) changes to the pass manager
> setup. Can you describe exactly what you've done. Chandler is
> rewriting the pass infrastructure, and I'd like to understand how
> what you've done will or won't fit with the new setup.
Pointer to decision tree proxy has to be passed to every place
where getDecision is called and to the pass that calculates function
fitness. So pointer to it is added in Pass and PassManagerBase classes.
We will provide more detail explanation later.
> 
> Thanks again,
> Hal
> 
> >
> > Example command line:
> > > clang -target mipsel-unknown-linux -Oz -S -fiterative-comp=3
> > > example.c
> >
> > Thank you. Looking forward to hearing your comments on this.
> >
> >
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

Hongbin Zheng

2013-Dec-19 03:45 UTC

head link

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

Hi Radovan,

I am also interested in the iterative compilation in LLVM, and had
implemented a simple iterative compilation driver.

I guess you do not need to embedded the pointer
to ModuleDecisionTreeProxies into the core classes of llvm, i.e. the
PassManager class and the Function class.

Instead, you could:
1. Implement a special pass manager that runs a set of passes iteratively.
Implementing the iterative compilation framework based on pass manage
allows the framework to live outside of clang, and we can use it in opt,
llc, etc.


2. Implement an immutable pass to store the global information (i.e.
ModuleDecisionTreeProxies if I do not misunderstand your patch) across
iterations. In your example, once you implemented ModuleDecisionTreeProxies
as an immutable pass, you can get the ModuleDecisionTreeProxies in every
pass you need it by simply call
"getAnalysis<ModuleDecisionTreeProxies>()"

Thanks
Hongbin

PS: I attached our iterative driver, you could have a look if you are
interested.


On Wed, Dec 18, 2013 at 10:33 AM, Radovan Obradovic <
Radovan.Obradovic at imgtec.com> wrote:
> Hal,
>
> Thank you for finding interest in our work!
> Please find my answers inlined below.
>
> >
> > ________________________________________
> > From: Hal Finkel [hfinkel at anl.gov]
> > Sent: Monday, December 16, 2013 4:26 PM
> > To: Radovan Obradovic
> > Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick
> > Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for
> Clang/LLVM
> >
> > Radovan,
> >
> > Thanks for posting this! I would really like to have this kind of
> > functionality available. There are a lot of things to think through
> > here; comments inline...
> >
> > ----- Original Message -----
> > > From: "Radovan Obradovic" <Radovan.Obradovic at
imgtec.com>
> > > To: llvmdev at cs.uiuc.edu
> > > Sent: Monday, December 16, 2013 11:31:21 AM
> > > Subject: [LLVMdev] [RFC] Iterrative compilation framework for
> Clang/LLVM
> > >
> > >
> > >
> > >
> > > This is a first step towards to full iterative compilation
framework
> > > for
> > > Clang/LLVM. The attached patch is work in progress and the idea
of
> > > sending
> > > it to the list earlier rather than later is to open up a
discussion
> > > on
> > > iterative compilation in Clang/LLVM. All comments are welcome,
> > > especially
> > > those related to integration of the framework into Clang/LLVM.
> >
> > I'm glad you're doing this, and I hope that you'll be
willing to
> > stick though what might be a long design discussion :)
>
> Our belief is that this approach can be beneficial, especially
> for non-orthogonal architectures (e.g. microMIPS), thus we plan
> to continue our work on this.
>
> >
> > >
> > > Current compilers have pipeline structure in which data from one
> > > phase is
> > > passed to the following one. But these compilation phases can
> > > interfere in
> > > unpredictable way so compilers use conservative heuristics on
which
> > > they
> > > make decisions during the compilation process. Because of
> > > unpredictability
> > > of these interferences, compilers sometimes make decisions which
can
> > > degrade
> > > code quality. Iterative compilation could partially resolve these
> > > problems
> > > by allowing compiler to explore some alternative decisions,
generate
> > > code
> > > for these new decision paths and take the generated code that is
the
> > > best
> > > for a particular criteria eventually.
> > >
> > > This approach is not new in the literature, we can mention
> > > Milepost/CTuning
> > > [http://ctuning.org] among other projects. In this approach, in
> > > comparison
> > > to Milepost/CTuning, the emphasis is on much finer granularity by
> > > changing
> > > arbitrary compiler decisions, and not only those based on values
of
> > > command
> > > line options, at any point in translation process.
> > >
> > > This framework implements iterative approach in which in every
> > > iteration
> > > slightly different code is generated. At the end of compilation
> > > process the
> > > best emitted code is chosen.
> > >
> > > All points at which relevant decisions are made should be
identified.
> > > Access to framework data structure (DecisionTree) should be
provided
> > > at
> > > these points. Based on the data from DecisionTree, obtained by
> > > calling
> > > getDecision member function, at these points decisions are made
> > > whether to
> > > execute default actions (as compiler would do without this
framework)
> > > or to
> > > follow alternative paths.
> > >
> > > Let us consider Inliner example:
> > >
> > > Original code:
> > >
> > > if (!IC) {
> > > DEBUG(dbgs() << " NOT Inlining: cost=" <<
IC.getCost()
> > > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
> > > << ", Call: " << *CS.getInstruction()
<< "\n");
> > > return false;
> > > }
> > >
> > > Code after instrumentalization:
> > >
> > > bool doInlinging = (bool)IC;
> > > DecisionTreeProxy *decisionTreeProxy;
> > > decisionTreeProxy > > >
moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
> > > int t = MaxDtPriority - abs(IC.getCostDelta());
> > > bool decision = false;
> > > if (t >= 0)
> > > decision = decisionTreeProxy->getDecision(call);
> > > if (decision)
> > > doInlinging = !doInlinging;
> > >
> > > if (!doInlinging) {
> > > DEBUG(dbgs() << " NOT Inlining: cost=" <<
IC.getCost()
> > > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
> > > << ", Call: " << *CS.getInstruction()
<< "\n");
> > > return false;
> > > }
> >
> > Talking a quick look over the patch, this seems to introduce a lot
> > of boilerplate code. Can you think of a design that simplifies this?
>
> It is necessary to access decision tree at every place where you
> want to call getDecision function and priority parameter must be
> calculated which is specific to every such place. It is only a
> few lines of code but we will consider how to improve this.
> All suggestions are welcomed.
>
> >
> > >
> > > Function getDecision returns true or false. If false is returned
than
> > > this
> > > means to take default path and true means to take alternative
one.
> > >
> > > Compilation process can be represented by sequence of binary
values
> > > where
> > > every value determines decision made at some point. Value FALSE
means
> > > to
> > > follow normal compilation process as original compiler would do
> > > without
> > > presence of iterative compilation framework. TRUE value means to
take
> > > alternative path or to make opposite decision at that point. This
> > > naturally
> > > forms tree like structure which we named Decision Tree.
> > > Traversing the tree from the root to the leaf of the tree
provides
> > > all the
> > > decisions made during one iteration in compilation process.
> > >
> > > At the end of each iteration, quality of generated code is
estimated
> > > by
> > > executing newly introduced target dependent pass. Based on
results
> > > path for
> > > the following iteration is calculated. At the moment, this has
been
> > > proved
> > > for MIPS only and it is based on code size.
> >
> > I think that, in general, we ought to be able to get an overall
> > latency estimate from the instruction scheduler.
>
> Criteria to optimize for speed should be added as well. Do you
> think that latency info form scheduler can be used to estimate
> execution speed of generated code? Can we guess somehow the execution
> frequency of basic blocks?
>
> >
> > >
> > > Initially iterative compilation framework was implemented in
CodeGen
> > > phase
> > > but large number of code transformations was already performed in
> > > earlier
> > > phases. This reduces possibilities to improve quality of
generated
> > > code.
> > > Thus iterative compilation framework is applied to Clang driver
where
> > > it can
> > > influence every single compiler phase. This relocation resulted
in
> > > some
> > > implementation issues mostly related to the fact that driver and
the
> > > compiler itself (-cc1) are located in different processes. This
is
> > > resolved
> > > by introducing DecisionTreeProxy class that implements
communication
> > > between
> > > these two processes through temporary files.
> >
> > I think that this can be solved by introducing some API for the
> > necessary data flow. There is obviously already a binding here.
>
> This probably needs to be fixed. All suggestions are welcomed.
>
> >
> > >
> > > At this point mechanism is applied to two optimizations: Inliner
and
> > > Loop
> > > Invariant Code Motion. Supported platform is MIPS32r2.
> > >
> > > Despite small number of instrumented points we got improvements
on
> > > some test
> > > files from CSiBE benchmark. Here are the promising results:
> > >
> > > jpeg-6b/jidctflt.c -10.2%
> > > OpenTCP-1.0.4/system.c -6.7%
> > > libmspack/test/chmd_md5.c -5.8%
> > > flex-2.5.31/regex.c -4.8%
> >
> > Very interesting. Out of curiosity, any slowdowns?
>
> jidctflt.c is optimized due to the fact that LICM moves some
> constant definitions out of the loop and increases the register
> pressure so some spills occur. So optimized version, in this
> case, should be even faster. But slowdowns are possible and we
> use option -Oz so we are telling the compiler that code size
> is the priority.
>
> >
> > >
> > > We plan to apply this framework to more compiler phases and to
some
> > > other
> > > target architectures. It is expected to give better results on
some
> > > non
> > > orthogonal architectures like microMIPS. Another direction is to
> > > apply
> > > machine learning based on some input program features (number of
> > > basic
> > > blocks, register pressure at some point etc.) and it can be used
for
> > > selecting compilation path for the next iteration or estimating
the
> > > best
> > > path for compiling program without using iterative approach (with
> > > only one
> > > iteration).
> > >
> > >
> > > o
> > > / \
> > > / \
> > > o \
> > > / \
> > > / \
> > > o o
> > > / \ /
> > > / \ /
> > > L1 o o
> > > / /
> > > / /
> > > L2 L3
> > >
> > > Figure 1: Decision Tree Example - Symbol 'o' represent
decision
> > > points. If
> > > getDecision function returns FALSE the left branch is taken
(default
> > > path),
> > > otherwise the right branch is taken.
> > > 'L1' compilation iteration can be represented as a string
> > > (FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE) and
'L3'
> > > as (TRUE, FALSE, FALSE).
> > >
> > >
> > > Clearly, we see huge potential in this direction. At the same
time,
> > > we are
> > > aware this would cause compiler to be slower which everyone is
> > > against, but
> > > this penalty can be taken when we are in need for most optimized
> > > code.
> >
> > What is the overhead of the current implementation? I suspect that,
> > in order to control the large number of decision points from large
> > regions, we will want some way to limit fine-grained decisions in
> > the tree in large regions (and only including coarse ones, like the
> > inlining of relatively large functions). In small blocks, on the
> > other hand, we might even include instruction scheduling decisions
> > (for in-order cores).
>
> In the current implementation one full compilation per
> decision is needed. Plan is that, for CodeGen pass, do once all the
> passes of CodeGen per decision (to reuse bitcode). Machine learning should
> considerably reduce the need for the large number of iterations.
>
> >
> > I'd been thinking of experimenting with some iterative scheduling
> > for my in-order PPC cores (for small blocks in loops), and
> > integrating it into something like this might be even better.
> >
> > >
> > > What does everyone think?
> > > Could this approach be considered as a valid mode for Clang/LLVM?
> > >
> > > Please use this post and the attached patch (it can be applied to
3.3
> > > release - clang should be inside llvm/tools folder) as a
reference
> > > and
> > > not the patch for submission.
> >
> > I see in the patch some (seemingly small) changes to the pass manager
> > setup. Can you describe exactly what you've done. Chandler is
> > rewriting the pass infrastructure, and I'd like to understand how
> > what you've done will or won't fit with the new setup.
>
> Pointer to decision tree proxy has to be passed to every place
> where getDecision is called and to the pass that calculates function
> fitness. So pointer to it is added in Pass and PassManagerBase classes.
> We will provide more detail explanation later.
>
> >
> > Thanks again,
> > Hal
> >
> > >
> > > Example command line:
> > > > clang -target mipsel-unknown-linux -Oz -S -fiterative-comp=3
> > > > example.c
> > >
> > > Thank you. Looking forward to hearing your comments on this.
> > >
> > >
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/cc9db40f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: IterativeSchedulingDriver.cpp
Type: text/x-c++src
Size: 10249 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/cc9db40f/attachment.cpp>

Radovan Obradovic

2013-Dec-19 12:21 UTC

head link

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

Hal,

Here is a slightly modified iterative compilation patch that reduces amount
of boilerplate code at the points where the decisions are made.

Radovan
> ________________________________________
> From: Radovan Obradovic
> Sent: Tuesday, December 17, 2013 6:33 PM
> To: Hal Finkel
> Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick; Zoran Jovanovic; Petar
Jovanovic
> Subject: RE: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
> 
> Hal,
> 
> Thank you for finding interest in our work!
> Please find my answers inlined below.
> 
> >
> > ________________________________________
> > From: Hal Finkel [hfinkel at anl.gov]
> > Sent: Monday, December 16, 2013 4:26 PM
> > To: Radovan Obradovic
> > Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick
> > Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
> >
> > Radovan,
> >
> > Thanks for posting this! I would really like to have this kind of
> > functionality available. There are a lot of things to think through
> > here; comments inline...
> >
> > ----- Original Message -----
> > > From: "Radovan Obradovic" <Radovan.Obradovic at
imgtec.com>
> > > To: llvmdev at cs.uiuc.edu
> > > Sent: Monday, December 16, 2013 11:31:21 AM
> > > Subject: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
> > >
> > >
> > >
> > >
> > > This is a first step towards to full iterative compilation
framework
> > > for
> > > Clang/LLVM. The attached patch is work in progress and the idea
of
> > > sending
> > > it to the list earlier rather than later is to open up a
discussion
> > > on
> > > iterative compilation in Clang/LLVM. All comments are welcome,
> > > especially
> > > those related to integration of the framework into Clang/LLVM.
> >
> > I'm glad you're doing this, and I hope that you'll be
willing to
> > stick though what might be a long design discussion :)
> 
> Our belief is that this approach can be beneficial, especially
> for non-orthogonal architectures (e.g. microMIPS), thus we plan
> to continue our work on this.
> 
> >
> > >
> > > Current compilers have pipeline structure in which data from one
> > > phase is
> > > passed to the following one. But these compilation phases can
> > > interfere in
> > > unpredictable way so compilers use conservative heuristics on
which
> > > they
> > > make decisions during the compilation process. Because of
> > > unpredictability
> > > of these interferences, compilers sometimes make decisions which
can
> > > degrade
> > > code quality. Iterative compilation could partially resolve these
> > > problems
> > > by allowing compiler to explore some alternative decisions,
generate
> > > code
> > > for these new decision paths and take the generated code that is
the
> > > best
> > > for a particular criteria eventually.
> > >
> > > This approach is not new in the literature, we can mention
> > > Milepost/CTuning
> > > [http://ctuning.org] among other projects. In this approach, in
> > > comparison
> > > to Milepost/CTuning, the emphasis is on much finer granularity by
> > > changing
> > > arbitrary compiler decisions, and not only those based on values
of
> > > command
> > > line options, at any point in translation process.
> > >
> > > This framework implements iterative approach in which in every
> > > iteration
> > > slightly different code is generated. At the end of compilation
> > > process the
> > > best emitted code is chosen.
> > >
> > > All points at which relevant decisions are made should be
identified.
> > > Access to framework data structure (DecisionTree) should be
provided
> > > at
> > > these points. Based on the data from DecisionTree, obtained by
> > > calling
> > > getDecision member function, at these points decisions are made
> > > whether to
> > > execute default actions (as compiler would do without this
framework)
> > > or to
> > > follow alternative paths.
> > >
> > > Let us consider Inliner example:
> > >
> > > Original code:
> > >
> > > if (!IC) {
> > > DEBUG(dbgs() << " NOT Inlining: cost=" <<
IC.getCost()
> > > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
> > > << ", Call: " << *CS.getInstruction()
<< "\n");
> > > return false;
> > > }
> > >
> > > Code after instrumentalization:
> > >
> > > bool doInlinging = (bool)IC;
> > > DecisionTreeProxy *decisionTreeProxy;
> > > decisionTreeProxy > > >
moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
> > > int t = MaxDtPriority - abs(IC.getCostDelta());
> > > bool decision = false;
> > > if (t >= 0)
> > > decision = decisionTreeProxy->getDecision(call);
> > > if (decision)
> > > doInlinging = !doInlinging;
> > >
> > > if (!doInlinging) {
> > > DEBUG(dbgs() << " NOT Inlining: cost=" <<
IC.getCost()
> > > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
> > > << ", Call: " << *CS.getInstruction()
<< "\n");
> > > return false;
> > > }
> >
> > Talking a quick look over the patch, this seems to introduce a lot
> > of boilerplate code. Can you think of a design that simplifies this?
> 
> It is necessary to access decision tree at every place where you
> want to call getDecision function and priority parameter must be
> calculated which is specific to every such place. It is only a
> few lines of code but we will consider how to improve this.
> All suggestions are welcomed.
> 
> >
> > >
> > > Function getDecision returns true or false. If false is returned
than
> > > this
> > > means to take default path and true means to take alternative
one.
> > >
> > > Compilation process can be represented by sequence of binary
values
> > > where
> > > every value determines decision made at some point. Value FALSE
means
> > > to
> > > follow normal compilation process as original compiler would do
> > > without
> > > presence of iterative compilation framework. TRUE value means to
take
> > > alternative path or to make opposite decision at that point. This
> > > naturally
> > > forms tree like structure which we named Decision Tree.
> > > Traversing the tree from the root to the leaf of the tree
provides
> > > all the
> > > decisions made during one iteration in compilation process.
> > >
> > > At the end of each iteration, quality of generated code is
estimated
> > > by
> > > executing newly introduced target dependent pass. Based on
results
> > > path for
> > > the following iteration is calculated. At the moment, this has
been
> > > proved
> > > for MIPS only and it is based on code size.
> >
> > I think that, in general, we ought to be able to get an overall
> > latency estimate from the instruction scheduler.
> 
> Criteria to optimize for speed should be added as well. Do you
> think that latency info form scheduler can be used to estimate
> execution speed of generated code? Can we guess somehow the execution
> frequency of basic blocks?
> 
> >
> > >
> > > Initially iterative compilation framework was implemented in
CodeGen
> > > phase
> > > but large number of code transformations was already performed in
> > > earlier
> > > phases. This reduces possibilities to improve quality of
generated
> > > code.
> > > Thus iterative compilation framework is applied to Clang driver
where
> > > it can
> > > influence every single compiler phase. This relocation resulted
in
> > > some
> > > implementation issues mostly related to the fact that driver and
the
> > > compiler itself (-cc1) are located in different processes. This
is
> > > resolved
> > > by introducing DecisionTreeProxy class that implements
communication
> > > between
> > > these two processes through temporary files.
> >
> > I think that this can be solved by introducing some API for the
> > necessary data flow. There is obviously already a binding here.
> 
> This probably needs to be fixed. All suggestions are welcomed.
> 
> >
> > >
> > > At this point mechanism is applied to two optimizations: Inliner
and
> > > Loop
> > > Invariant Code Motion. Supported platform is MIPS32r2.
> > >
> > > Despite small number of instrumented points we got improvements
on
> > > some test
> > > files from CSiBE benchmark. Here are the promising results:
> > >
> > > jpeg-6b/jidctflt.c -10.2%
> > > OpenTCP-1.0.4/system.c -6.7%
> > > libmspack/test/chmd_md5.c -5.8%
> > > flex-2.5.31/regex.c -4.8%
> >
> > Very interesting. Out of curiosity, any slowdowns?
> 
> jidctflt.c is optimized due to the fact that LICM moves some
> constant definitions out of the loop and increases the register
> pressure so some spills occur. So optimized version, in this
> case, should be even faster. But slowdowns are possible and we
> use option -Oz so we are telling the compiler that code size
> is the priority.
> 
> >
> > >
> > > We plan to apply this framework to more compiler phases and to
some
> > > other
> > > target architectures. It is expected to give better results on
some
> > > non
> > > orthogonal architectures like microMIPS. Another direction is to
> > > apply
> > > machine learning based on some input program features (number of
> > > basic
> > > blocks, register pressure at some point etc.) and it can be used
for
> > > selecting compilation path for the next iteration or estimating
the
> > > best
> > > path for compiling program without using iterative approach (with
> > > only one
> > > iteration).
> > >
> > >
> > > o
> > > / \
> > > / \
> > > o \
> > > / \
> > > / \
> > > o o
> > > / \ /
> > > / \ /
> > > L1 o o
> > > / /
> > > / /
> > > L2 L3
> > >
> > > Figure 1: Decision Tree Example - Symbol 'o' represent
decision
> > > points. If
> > > getDecision function returns FALSE the left branch is taken
(default
> > > path),
> > > otherwise the right branch is taken.
> > > 'L1' compilation iteration can be represented as a string
> > > (FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE) and
'L3'
> > > as (TRUE, FALSE, FALSE).
> > >
> > >
> > > Clearly, we see huge potential in this direction. At the same
time,
> > > we are
> > > aware this would cause compiler to be slower which everyone is
> > > against, but
> > > this penalty can be taken when we are in need for most optimized
> > > code.
> >
> > What is the overhead of the current implementation? I suspect that,
> > in order to control the large number of decision points from large
> > regions, we will want some way to limit fine-grained decisions in
> > the tree in large regions (and only including coarse ones, like the
> > inlining of relatively large functions). In small blocks, on the
> > other hand, we might even include instruction scheduling decisions
> > (for in-order cores).
> 
> In the current implementation one full compilation per
> decision is needed. Plan is that, for CodeGen pass, do once all the
> passes of CodeGen per decision (to reuse bitcode). Machine learning should
> considerably reduce the need for the large number of iterations.
> 
> >
> > I'd been thinking of experimenting with some iterative scheduling
> > for my in-order PPC cores (for small blocks in loops), and
> > integrating it into something like this might be even better.
> >
> > >
> > > What does everyone think?
> > > Could this approach be considered as a valid mode for Clang/LLVM?
> > >
> > > Please use this post and the attached patch (it can be applied to
3.3
> > > release - clang should be inside llvm/tools folder) as a
reference
> > > and
> > > not the patch for submission.
> >
> > I see in the patch some (seemingly small) changes to the pass manager
> > setup. Can you describe exactly what you've done. Chandler is
> > rewriting the pass infrastructure, and I'd like to understand how
> > what you've done will or won't fit with the new setup.
> 
> Pointer to decision tree proxy has to be passed to every place
> where getDecision is called and to the pass that calculates function
> fitness. So pointer to it is added in Pass and PassManagerBase classes.
> We will provide more detail explanation later.
> 
> >
> > Thanks again,
> > Hal
> >
> > >
> > > Example command line:
> > > > clang -target mipsel-unknown-linux -Oz -S -fiterative-comp=3
> > > > example.c
> > >
> > > Thank you. Looking forward to hearing your comments on this.
> > >
> > >
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory-------------- next part --------------
A non-text attachment was scrubbed...
Name: ic.patch
Type: text/x-patch
Size: 58658 bytes
Desc: ic.patch
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/0a1278aa/attachment.bin>

Radovan Obradovic

2013-Dec-19 12:32 UTC

head link

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

Hello Hongbin,

Thank you for your suggestions. It sounds like something we were looking for
in order to better integrate iterative compilation framework into llvm.
We plan to port this implementation to llvm 3.4 release and then we will try to
follow your recommendations.

Radovan
> From: Hongbin Zheng [etherzhhb at gmail.com]
> Sent: Wednesday, December 18, 2013 7:45 PM
> To: Radovan Obradovic
> Cc: Hal Finkel; Zoran Jovanovic; Petar Jovanovic; llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
>
> Hi Radovan,
> I am also interested in the iterative compilation in LLVM, and had
implemented
> a simple iterative compilation driver. I guess you do not need to embedded
the
> pointer to ModuleDecisionTreeProxies into the core classes of llvm, i.e.
the
> PassManager class and the Function class. Instead, you could:
> 1. Implement a special pass manager that runs a set of passes iteratively.
>    Implementing the iterative compilation framework based on pass manage
>    allows the framework to live outside of clang, and we can use it in opt,
>    llc, etc.
> 2. Implement an immutable pass to store the global information
>    (i.e. ModuleDecisionTreeProxies if I do not misunderstand your patch)
>    across iterations. In your example, once you implemented
>    ModuleDecisionTreeProxies as an immutable pass, you can get
>    the ModuleDecisionTreeProxies in every pass you need it by simply
>    call "getAnalysis<ModuleDecisionTreeProxies>()"
> Thanks
> Hongbin
> PS: I attached our iterative driver, you could have a look if you are
interested.
>
>
> On Wed, Dec 18, 2013 at 10:33 AM, Radovan Obradovic <Radovan.Obradovic
at imgtec.com> wrote:
>
>     Hal,
>
>     Thank you for finding interest in our work!
>     Please find my answers inlined below.
>
>     >
>     > ________________________________________
>     > From: Hal Finkel [hfinkel at anl.gov]
>     > Sent: Monday, December 16, 2013 4:26 PM
>     > To: Radovan Obradovic
>     > Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick
>     > Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
>     >
>     > Radovan,
>     >
>     > Thanks for posting this! I would really like to have this kind of
>     > functionality available. There are a lot of things to think
through
>     > here; comments inline...
>     >
>     > ----- Original Message -----
>     > > From: "Radovan Obradovic" <Radovan.Obradovic at
imgtec.com>
>     > > To: llvmdev at cs.uiuc.edu
>     > > Sent: Monday, December 16, 2013 11:31:21 AM
>     > > Subject: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
>     > >
>     > >
>     > >
>     > >
>     > > This is a first step towards to full iterative compilation
framework
>     > > for
>     > > Clang/LLVM. The attached patch is work in progress and the
idea of
>     > > sending
>     > > it to the list earlier rather than later is to open up a
discussion
>     > > on
>     > > iterative compilation in Clang/LLVM. All comments are
welcome,
>     > > especially
>     > > those related to integration of the framework into
Clang/LLVM.
>     >
>     > I'm glad you're doing this, and I hope that you'll be
willing to
>     > stick though what might be a long design discussion :)
>
>     Our belief is that this approach can be beneficial, especially
>     for non-orthogonal architectures (e.g. microMIPS), thus we plan
>     to continue our work on this.
>
>     >
>     > >
>     > > Current compilers have pipeline structure in which data from
one
>     > > phase is
>     > > passed to the following one. But these compilation phases can
>     > > interfere in
>     > > unpredictable way so compilers use conservative heuristics on
which
>     > > they
>     > > make decisions during the compilation process. Because of
>     > > unpredictability
>     > > of these interferences, compilers sometimes make decisions
which can
>     > > degrade
>     > > code quality. Iterative compilation could partially resolve
these
>     > > problems
>     > > by allowing compiler to explore some alternative decisions,
generate
>     > > code
>     > > for these new decision paths and take the generated code that
is the
>     > > best
>     > > for a particular criteria eventually.
>     > >
>     > > This approach is not new in the literature, we can mention
>     > > Milepost/CTuning
>     > > [http://ctuning.org] among other projects. In this approach,
in
>     > > comparison
>     > > to Milepost/CTuning, the emphasis is on much finer
granularity by
>     > > changing
>     > > arbitrary compiler decisions, and not only those based on
values of
>     > > command
>     > > line options, at any point in translation process.
>     > >
>     > > This framework implements iterative approach in which in
every
>     > > iteration
>     > > slightly different code is generated. At the end of
compilation
>     > > process the
>     > > best emitted code is chosen.
>     > >
>     > > All points at which relevant decisions are made should be
identified.
>     > > Access to framework data structure (DecisionTree) should be
provided
>     > > at
>     > > these points. Based on the data from DecisionTree, obtained
by
>     > > calling
>     > > getDecision member function, at these points decisions are
made
>     > > whether to
>     > > execute default actions (as compiler would do without this
framework)
>     > > or to
>     > > follow alternative paths.
>     > >
>     > > Let us consider Inliner example:
>     > >
>     > > Original code:
>     > >
>     > > if (!IC) {
>     > > DEBUG(dbgs() << " NOT Inlining: cost="
<< IC.getCost()
>     > > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
>     > > << ", Call: " << *CS.getInstruction()
<< "\n");
>     > > return false;
>     > > }
>     > >
>     > > Code after instrumentalization:
>     > >
>     > > bool doInlinging = (bool)IC;
>     > > DecisionTreeProxy *decisionTreeProxy;
>     > > decisionTreeProxy >     > >
moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
>     > > int t = MaxDtPriority - abs(IC.getCostDelta());
>     > > bool decision = false;
>     > > if (t >= 0)
>     > > decision = decisionTreeProxy->getDecision(call);
>     > > if (decision)
>     > > doInlinging = !doInlinging;
>     > >
>     > > if (!doInlinging) {
>     > > DEBUG(dbgs() << " NOT Inlining: cost="
<< IC.getCost()
>     > > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
>     > > << ", Call: " << *CS.getInstruction()
<< "\n");
>     > > return false;
>     > > }
>     >
>     > Talking a quick look over the patch, this seems to introduce a lot
>     > of boilerplate code. Can you think of a design that simplifies
this?
>
>     It is necessary to access decision tree at every place where you
>     want to call getDecision function and priority parameter must be
>     calculated which is specific to every such place. It is only a
>     few lines of code but we will consider how to improve this.
>     All suggestions are welcomed.
>
>     >
>     > >
>     > > Function getDecision returns true or false. If false is
returned than
>     > > this
>     > > means to take default path and true means to take alternative
one.
>     > >
>     > > Compilation process can be represented by sequence of binary
values
>     > > where
>     > > every value determines decision made at some point. Value
FALSE means
>     > > to
>     > > follow normal compilation process as original compiler would
do
>     > > without
>     > > presence of iterative compilation framework. TRUE value means
to take
>     > > alternative path or to make opposite decision at that point.
This
>     > > naturally
>     > > forms tree like structure which we named Decision Tree.
>     > > Traversing the tree from the root to the leaf of the tree
provides
>     > > all the
>     > > decisions made during one iteration in compilation process.
>     > >
>     > > At the end of each iteration, quality of generated code is
estimated
>     > > by
>     > > executing newly introduced target dependent pass. Based on
results
>     > > path for
>     > > the following iteration is calculated. At the moment, this
has been
>     > > proved
>     > > for MIPS only and it is based on code size.
>     >
>     > I think that, in general, we ought to be able to get an overall
>     > latency estimate from the instruction scheduler.
>
>     Criteria to optimize for speed should be added as well. Do you
>     think that latency info form scheduler can be used to estimate
>     execution speed of generated code? Can we guess somehow the execution
>     frequency of basic blocks?
>
>     >
>     > >
>     > > Initially iterative compilation framework was implemented in
CodeGen
>     > > phase
>     > > but large number of code transformations was already
performed in
>     > > earlier
>     > > phases. This reduces possibilities to improve quality of
generated
>     > > code.
>     > > Thus iterative compilation framework is applied to Clang
driver where
>     > > it can
>     > > influence every single compiler phase. This relocation
resulted in
>     > > some
>     > > implementation issues mostly related to the fact that driver
and the
>     > > compiler itself (-cc1) are located in different processes.
This is
>     > > resolved
>     > > by introducing DecisionTreeProxy class that implements
communication
>     > > between
>     > > these two processes through temporary files.
>     >
>     > I think that this can be solved by introducing some API for the
>     > necessary data flow. There is obviously already a binding here.
>
>     This probably needs to be fixed. All suggestions are welcomed.
>
>     >
>     > >
>     > > At this point mechanism is applied to two optimizations:
Inliner and
>     > > Loop
>     > > Invariant Code Motion. Supported platform is MIPS32r2.
>     > >
>     > > Despite small number of instrumented points we got
improvements on
>     > > some test
>     > > files from CSiBE benchmark. Here are the promising results:
>     > >
>     > > jpeg-6b/jidctflt.c -10.2%
>     > > OpenTCP-1.0.4/system.c -6.7%
>     > > libmspack/test/chmd_md5.c -5.8%
>     > > flex-2.5.31/regex.c -4.8%
>     >
>     > Very interesting. Out of curiosity, any slowdowns?
>
>     jidctflt.c is optimized due to the fact that LICM moves some
>     constant definitions out of the loop and increases the register
>     pressure so some spills occur. So optimized version, in this
>     case, should be even faster. But slowdowns are possible and we
>     use option -Oz so we are telling the compiler that code size
>     is the priority.
>
>     >
>     > >
>     > > We plan to apply this framework to more compiler phases and
to some
>     > > other
>     > > target architectures. It is expected to give better results
on some
>     > > non
>     > > orthogonal architectures like microMIPS. Another direction is
to
>     > > apply
>     > > machine learning based on some input program features (number
of
>     > > basic
>     > > blocks, register pressure at some point etc.) and it can be
used for
>     > > selecting compilation path for the next iteration or
estimating the
>     > > best
>     > > path for compiling program without using iterative approach
(with
>     > > only one
>     > > iteration).
>     > >
>     > >
>     > > o
>     > > / \
>     > > / \
>     > > o \
>     > > / \
>     > > / \
>     > > o o
>     > > / \ /
>     > > / \ /
>     > > L1 o o
>     > > / /
>     > > / /
>     > > L2 L3
>     > >
>     > > Figure 1: Decision Tree Example - Symbol 'o'
represent decision
>     > > points. If
>     > > getDecision function returns FALSE the left branch is taken
(default
>     > > path),
>     > > otherwise the right branch is taken.
>     > > 'L1' compilation iteration can be represented as a
string
>     > > (FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE)
and 'L3'
>     > > as (TRUE, FALSE, FALSE).
>     > >
>     > >
>     > > Clearly, we see huge potential in this direction. At the same
time,
>     > > we are
>     > > aware this would cause compiler to be slower which everyone
is
>     > > against, but
>     > > this penalty can be taken when we are in need for most
optimized
>     > > code.
>     >
>     > What is the overhead of the current implementation? I suspect
that,
>     > in order to control the large number of decision points from large
>     > regions, we will want some way to limit fine-grained decisions in
>     > the tree in large regions (and only including coarse ones, like
the
>     > inlining of relatively large functions). In small blocks, on the
>     > other hand, we might even include instruction scheduling decisions
>     > (for in-order cores).
>
>     In the current implementation one full compilation per
>     decision is needed. Plan is that, for CodeGen pass, do once all the
>     passes of CodeGen per decision (to reuse bitcode). Machine learning
should
>     considerably reduce the need for the large number of iterations.
>
>     >
>     > I'd been thinking of experimenting with some iterative
scheduling
>     > for my in-order PPC cores (for small blocks in loops), and
>     > integrating it into something like this might be even better.
>     >
>     > >
>     > > What does everyone think?
>     > > Could this approach be considered as a valid mode for
Clang/LLVM?
>     > >
>     > > Please use this post and the attached patch (it can be
applied to 3.3
>     > > release - clang should be inside llvm/tools folder) as a
reference
>     > > and
>     > > not the patch for submission.
>     >
>     > I see in the patch some (seemingly small) changes to the pass
manager
>     > setup. Can you describe exactly what you've done. Chandler is
>     > rewriting the pass infrastructure, and I'd like to understand
how
>     > what you've done will or won't fit with the new setup.
>
>     Pointer to decision tree proxy has to be passed to every place
>     where getDecision is called and to the pass that calculates function
>     fitness. So pointer to it is added in Pass and PassManagerBase classes.
>     We will provide more detail explanation later.
>
>     >
>     > Thanks again,
>     > Hal
>     >
>     > >
>     > > Example command line:
>     > > > clang -target mipsel-unknown-linux -Oz -S
-fiterative-comp=3
>     > > > example.c
>     > >
>     > > Thank you. Looking forward to hearing your comments on this.
>     > >
>     > >
>     > >
>     > > _______________________________________________
>     > > LLVM Developers mailing list
>     > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>     > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>     > >
>     >
>     > --
>     > Hal Finkel
>     > Assistant Computational Scientist
>     > Leadership Computing Facility
>     > Argonne National Laboratory
>
>
>     _______________________________________________
>     LLVM Developers mailing list
>     LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/29b689ab/attachment.html>

Hal Finkel

2014-Feb-16 00:24 UTC

head link

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

----- Original Message -----> From: "Radovan Obradovic" <Radovan.Obradovic at imgtec.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvmdev at cs.uiuc.edu, "chandlerc" <chandlerc at
google.com>, "Andrew Trick" <atrick at apple.com>,
"Zoran Jovanovic"
> <Zoran.Jovanovic at imgtec.com>, "Petar Jovanovic"
<Petar.Jovanovic at imgtec.com>, "Rich Fuhler"
> <Rich.Fuhler at imgtec.com>, clattner at apple.com
> Sent: Thursday, December 19, 2013 6:21:16 AM
> Subject: RE: [LLVMdev] [RFC] Iterrative compilation framework for
Clang/LLVM
> 
> Hal,
> 
> Here is a slightly modified iterative compilation patch that reduces
> amount
> of boilerplate code at the points where the decisions are made.
Thanks! That looks more manageable. There is going to be *a lot* of work
involved in getting something like this integrated; but if you're up for
that, as a next step, please rebase the patch and upload it to phabricator (see
http://llvm.org/docs/Phabricator.html). That will make it easier to review in
detail.

 -Hal
> 
> Radovan
> 
> > ________________________________________
> > From: Radovan Obradovic
> > Sent: Tuesday, December 17, 2013 6:33 PM
> > To: Hal Finkel
> > Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick; Zoran Jovanovic;
> > Petar Jovanovic
> > Subject: RE: [LLVMdev] [RFC] Iterrative compilation framework for
> > Clang/LLVM
> > 
> > Hal,
> > 
> > Thank you for finding interest in our work!
> > Please find my answers inlined below.
> > 
> > >
> > > ________________________________________
> > > From: Hal Finkel [hfinkel at anl.gov]
> > > Sent: Monday, December 16, 2013 4:26 PM
> > > To: Radovan Obradovic
> > > Cc: llvmdev at cs.uiuc.edu; chandlerc; Andrew Trick
> > > Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for
> > > Clang/LLVM
> > >
> > > Radovan,
> > >
> > > Thanks for posting this! I would really like to have this kind of
> > > functionality available. There are a lot of things to think
> > > through
> > > here; comments inline...
> > >
> > > ----- Original Message -----
> > > > From: "Radovan Obradovic" <Radovan.Obradovic at
imgtec.com>
> > > > To: llvmdev at cs.uiuc.edu
> > > > Sent: Monday, December 16, 2013 11:31:21 AM
> > > > Subject: [LLVMdev] [RFC] Iterrative compilation framework
for
> > > > Clang/LLVM
> > > >
> > > >
> > > >
> > > >
> > > > This is a first step towards to full iterative compilation
> > > > framework
> > > > for
> > > > Clang/LLVM. The attached patch is work in progress and the
idea
> > > > of
> > > > sending
> > > > it to the list earlier rather than later is to open up a
> > > > discussion
> > > > on
> > > > iterative compilation in Clang/LLVM. All comments are
welcome,
> > > > especially
> > > > those related to integration of the framework into
Clang/LLVM.
> > >
> > > I'm glad you're doing this, and I hope that you'll be
willing to
> > > stick though what might be a long design discussion :)
> > 
> > Our belief is that this approach can be beneficial, especially
> > for non-orthogonal architectures (e.g. microMIPS), thus we plan
> > to continue our work on this.
> > 
> > >
> > > >
> > > > Current compilers have pipeline structure in which data from
> > > > one
> > > > phase is
> > > > passed to the following one. But these compilation phases
can
> > > > interfere in
> > > > unpredictable way so compilers use conservative heuristics
on
> > > > which
> > > > they
> > > > make decisions during the compilation process. Because of
> > > > unpredictability
> > > > of these interferences, compilers sometimes make decisions
> > > > which can
> > > > degrade
> > > > code quality. Iterative compilation could partially resolve
> > > > these
> > > > problems
> > > > by allowing compiler to explore some alternative decisions,
> > > > generate
> > > > code
> > > > for these new decision paths and take the generated code
that
> > > > is the
> > > > best
> > > > for a particular criteria eventually.
> > > >
> > > > This approach is not new in the literature, we can mention
> > > > Milepost/CTuning
> > > > [http://ctuning.org] among other projects. In this approach,
in
> > > > comparison
> > > > to Milepost/CTuning, the emphasis is on much finer
granularity
> > > > by
> > > > changing
> > > > arbitrary compiler decisions, and not only those based on
> > > > values of
> > > > command
> > > > line options, at any point in translation process.
> > > >
> > > > This framework implements iterative approach in which in
every
> > > > iteration
> > > > slightly different code is generated. At the end of
compilation
> > > > process the
> > > > best emitted code is chosen.
> > > >
> > > > All points at which relevant decisions are made should be
> > > > identified.
> > > > Access to framework data structure (DecisionTree) should be
> > > > provided
> > > > at
> > > > these points. Based on the data from DecisionTree, obtained
by
> > > > calling
> > > > getDecision member function, at these points decisions are
made
> > > > whether to
> > > > execute default actions (as compiler would do without this
> > > > framework)
> > > > or to
> > > > follow alternative paths.
> > > >
> > > > Let us consider Inliner example:
> > > >
> > > > Original code:
> > > >
> > > > if (!IC) {
> > > > DEBUG(dbgs() << " NOT Inlining: cost="
<< IC.getCost()
> > > > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
> > > > << ", Call: " << *CS.getInstruction()
<< "\n");
> > > > return false;
> > > > }
> > > >
> > > > Code after instrumentalization:
> > > >
> > > > bool doInlinging = (bool)IC;
> > > > DecisionTreeProxy *decisionTreeProxy;
> > > > decisionTreeProxy > > > >
moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
> > > > int t = MaxDtPriority - abs(IC.getCostDelta());
> > > > bool decision = false;
> > > > if (t >= 0)
> > > > decision = decisionTreeProxy->getDecision(call);
> > > > if (decision)
> > > > doInlinging = !doInlinging;
> > > >
> > > > if (!doInlinging) {
> > > > DEBUG(dbgs() << " NOT Inlining: cost="
<< IC.getCost()
> > > > << ", thres=" << (IC.getCostDelta() +
IC.getCost())
> > > > << ", Call: " << *CS.getInstruction()
<< "\n");
> > > > return false;
> > > > }
> > >
> > > Talking a quick look over the patch, this seems to introduce a
> > > lot
> > > of boilerplate code. Can you think of a design that simplifies
> > > this?
> > 
> > It is necessary to access decision tree at every place where you
> > want to call getDecision function and priority parameter must be
> > calculated which is specific to every such place. It is only a
> > few lines of code but we will consider how to improve this.
> > All suggestions are welcomed.
> > 
> > >
> > > >
> > > > Function getDecision returns true or false. If false is
> > > > returned than
> > > > this
> > > > means to take default path and true means to take
alternative
> > > > one.
> > > >
> > > > Compilation process can be represented by sequence of binary
> > > > values
> > > > where
> > > > every value determines decision made at some point. Value
FALSE
> > > > means
> > > > to
> > > > follow normal compilation process as original compiler would
do
> > > > without
> > > > presence of iterative compilation framework. TRUE value
means
> > > > to take
> > > > alternative path or to make opposite decision at that point.
> > > > This
> > > > naturally
> > > > forms tree like structure which we named Decision Tree.
> > > > Traversing the tree from the root to the leaf of the tree
> > > > provides
> > > > all the
> > > > decisions made during one iteration in compilation process.
> > > >
> > > > At the end of each iteration, quality of generated code is
> > > > estimated
> > > > by
> > > > executing newly introduced target dependent pass. Based on
> > > > results
> > > > path for
> > > > the following iteration is calculated. At the moment, this
has
> > > > been
> > > > proved
> > > > for MIPS only and it is based on code size.
> > >
> > > I think that, in general, we ought to be able to get an overall
> > > latency estimate from the instruction scheduler.
> > 
> > Criteria to optimize for speed should be added as well. Do you
> > think that latency info form scheduler can be used to estimate
> > execution speed of generated code? Can we guess somehow the
> > execution
> > frequency of basic blocks?
> > 
> > >
> > > >
> > > > Initially iterative compilation framework was implemented in
> > > > CodeGen
> > > > phase
> > > > but large number of code transformations was already
performed
> > > > in
> > > > earlier
> > > > phases. This reduces possibilities to improve quality of
> > > > generated
> > > > code.
> > > > Thus iterative compilation framework is applied to Clang
driver
> > > > where
> > > > it can
> > > > influence every single compiler phase. This relocation
resulted
> > > > in
> > > > some
> > > > implementation issues mostly related to the fact that driver
> > > > and the
> > > > compiler itself (-cc1) are located in different processes.
This
> > > > is
> > > > resolved
> > > > by introducing DecisionTreeProxy class that implements
> > > > communication
> > > > between
> > > > these two processes through temporary files.
> > >
> > > I think that this can be solved by introducing some API for the
> > > necessary data flow. There is obviously already a binding here.
> > 
> > This probably needs to be fixed. All suggestions are welcomed.
> > 
> > >
> > > >
> > > > At this point mechanism is applied to two optimizations:
> > > > Inliner and
> > > > Loop
> > > > Invariant Code Motion. Supported platform is MIPS32r2.
> > > >
> > > > Despite small number of instrumented points we got
improvements
> > > > on
> > > > some test
> > > > files from CSiBE benchmark. Here are the promising results:
> > > >
> > > > jpeg-6b/jidctflt.c -10.2%
> > > > OpenTCP-1.0.4/system.c -6.7%
> > > > libmspack/test/chmd_md5.c -5.8%
> > > > flex-2.5.31/regex.c -4.8%
> > >
> > > Very interesting. Out of curiosity, any slowdowns?
> > 
> > jidctflt.c is optimized due to the fact that LICM moves some
> > constant definitions out of the loop and increases the register
> > pressure so some spills occur. So optimized version, in this
> > case, should be even faster. But slowdowns are possible and we
> > use option -Oz so we are telling the compiler that code size
> > is the priority.
> > 
> > >
> > > >
> > > > We plan to apply this framework to more compiler phases and
to
> > > > some
> > > > other
> > > > target architectures. It is expected to give better results
on
> > > > some
> > > > non
> > > > orthogonal architectures like microMIPS. Another direction
is
> > > > to
> > > > apply
> > > > machine learning based on some input program features
(number
> > > > of
> > > > basic
> > > > blocks, register pressure at some point etc.) and it can be
> > > > used for
> > > > selecting compilation path for the next iteration or
estimating
> > > > the
> > > > best
> > > > path for compiling program without using iterative approach
> > > > (with
> > > > only one
> > > > iteration).
> > > >
> > > >
> > > > o
> > > > / \
> > > > / \
> > > > o \
> > > > / \
> > > > / \
> > > > o o
> > > > / \ /
> > > > / \ /
> > > > L1 o o
> > > > / /
> > > > / /
> > > > L2 L3
> > > >
> > > > Figure 1: Decision Tree Example - Symbol 'o'
represent decision
> > > > points. If
> > > > getDecision function returns FALSE the left branch is taken
> > > > (default
> > > > path),
> > > > otherwise the right branch is taken.
> > > > 'L1' compilation iteration can be represented as a
string
> > > > (FALSE,FALSE,FALSE), 'L2' as
(FALSE,FALSE,TRUE,FALSE) and 'L3'
> > > > as (TRUE, FALSE, FALSE).
> > > >
> > > >
> > > > Clearly, we see huge potential in this direction. At the
same
> > > > time,
> > > > we are
> > > > aware this would cause compiler to be slower which everyone
is
> > > > against, but
> > > > this penalty can be taken when we are in need for most
> > > > optimized
> > > > code.
> > >
> > > What is the overhead of the current implementation? I suspect
> > > that,
> > > in order to control the large number of decision points from
> > > large
> > > regions, we will want some way to limit fine-grained decisions in
> > > the tree in large regions (and only including coarse ones, like
> > > the
> > > inlining of relatively large functions). In small blocks, on the
> > > other hand, we might even include instruction scheduling
> > > decisions
> > > (for in-order cores).
> > 
> > In the current implementation one full compilation per
> > decision is needed. Plan is that, for CodeGen pass, do once all the
> > passes of CodeGen per decision (to reuse bitcode). Machine learning
> > should
> > considerably reduce the need for the large number of iterations.
> > 
> > >
> > > I'd been thinking of experimenting with some iterative
scheduling
> > > for my in-order PPC cores (for small blocks in loops), and
> > > integrating it into something like this might be even better.
> > >
> > > >
> > > > What does everyone think?
> > > > Could this approach be considered as a valid mode for
> > > > Clang/LLVM?
> > > >
> > > > Please use this post and the attached patch (it can be
applied
> > > > to 3.3
> > > > release - clang should be inside llvm/tools folder) as a
> > > > reference
> > > > and
> > > > not the patch for submission.
> > >
> > > I see in the patch some (seemingly small) changes to the pass
> > > manager
> > > setup. Can you describe exactly what you've done. Chandler is
> > > rewriting the pass infrastructure, and I'd like to understand
how
> > > what you've done will or won't fit with the new setup.
> > 
> > Pointer to decision tree proxy has to be passed to every place
> > where getDecision is called and to the pass that calculates
> > function
> > fitness. So pointer to it is added in Pass and PassManagerBase
> > classes.
> > We will provide more detail explanation later.
> > 
> > >
> > > Thanks again,
> > > Hal
> > >
> > > >
> > > > Example command line:
> > > > > clang -target mipsel-unknown-linux -Oz -S
-fiterative-comp=3
> > > > > example.c
> > > >
> > > > Thank you. Looking forward to hearing your comments on this.
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

llvm dev - Dec 2013 - [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

[LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM