thr3ads.net - llvm dev - [llvm-dev] [RFC] Polly Status and Integration [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2017-Sep-01 18:47 UTC

[llvm-dev] [RFC] Polly Status and Integration

*Hi everyone,As you may know, stock LLVM does not provide the kind of
advanced loop transformations necessary to provide good performance on
many applications. LLVM's Polly project provides many of the required
capabilities, including loop transformations such as fission, fusion,
skewing, blocking/tiling, and interchange, all powered by
state-of-the-art dependence analysis. Polly also provides automated
parallelization and targeting of GPUs and other**accelerators.*

Over the past year, Polly’s development has focused on robustness,
correctness, and closer integration with LLVM. To highlight a few
accomplishments:

Polly now runs, by default, in the conceptually-proper place in
LLVM’s pass pipeline (just before the loop vectorizer). Importantly,
this means that its loop transformations are performed after
inlining and other canonicalization, greatly increasing its
robustness, and enabling its use on C++ code (where [] is often a
function call before inlining).

Polly’s cost-modeling parameters, such as those describing the
target’s memory hierarchy, are being integrated with
TargetTransformInfo. This allows targets to properly override the
modeling parameters and allows reuse of these parameters by other
clients.

Polly’s method of handling signed division/remainder operations,
which worked around lack of support in ScalarEvolution, is being
replaced thanks to improvements being contributed to ScalarEvolution
itself (see D34598). Polly’s core delinearization routines have long
been a part of LLVM itself.

PolyhedralInfo, which exposes a subset of Polly’s loop analysis for
use by other clients, is now available.

Polly is now part of the LLVM release process and is being included
with LLVM by various packagers (e.g., Debian).

I believe that the LLVM community would benefit from beginning the
process of integrating Polly with LLVM itself and continuing its
development as part of our main code base. This will:

Allow for wider adoption of LLVM within communities relying on
advanced loop transformations.

Provide for better community feedback on, and testing of, the code
developed (although the story in this regard is already fairly solid).

Better motivate targets to provide accurate, comprehensive, modeling
parameters for use by advanced loop transformations.

Perhaps most importantly, this will allow us to develop and tune the
rest of the optimizer assuming that Polly’s capabilities are present
(the underlying analysis, and eventually, the transformations
themselves).

The largest issue on which community consensus is required, in order to
move forward at all, is what to do with isl. isl, the Integer Set
Library, provides core functionality on which Polly depends. It is a C
library, and while some Polly/LLVM developers are also isl developers,
it has a large user community outside of LLVM/Polly. A C++ interface was
recently added, and Polly is transitioning to use the C++ interface.
Nevertheless, options here include rewriting the needed functionality,
forking isl and transitioning our fork toward LLVM coding conventions
(and data structures) over time, and incorporating isl more-or-less
as-is to avoid partitioning its development.

That having been said, isl is internally modular, and regardless of the
overall integration strategy, the Polly developers anticipate
specializing, or even replacing, some of these components with
LLVM-specific solutions. This is especially true for anything that
touches performance-related heuristics and modeling. LLVM-specific, or
even target-specific, loop schedulers may be developed as well.

Even though some developers in the LLVM community already have a
background in polyhedral-modeling techniques, the Polly developers have
developed, and are still developing, extensive tutorials on this topic
http://pollylabs.org/education.htmland especially
http://playground.pollylabs.org.

Finally, let me highlight a few ongoing development efforts in Polly
that are potentially relevant to this discussion. Polly’s loop analysis
is sound and technically superior to what’s in LLVM currently (i.e. in
LoopAccessAnalysis and DependenceAnalysis). There are, however, two
known reasons why Polly’s transformations could not yet be enabled by
default:

A correctness issue: Currently, Polly assumes that 64 bits is large
enough for all new loop-induction variables and index expressions.
In rare cases, transformations could be performed where more bits
are required. Preconditions need to be generated preventing this
(e.g., D35471).

A performance issue: Polly currently models temporal locality (i.e.,
it tries to get better reuse in time), but does not model spatial
locality (i.e., it does not model cache-line reuse). As a result, it
can sometimes introduce performance regressions. Polly Labs is
currently working on integrating spatial locality modeling into the
loop optimization model.

Polly can already split apart basic blocks in order to implement loop
fusion. Heuristics to choose at which granularity are still being
implemented (e.g., PR12402).

I believe that we can now develop a concrete plan for moving
state-of-the-art loop optimizations, based on the technology in the
Polly project, into LLVM. Doing so will enable LLVM to be competitive
with proprietary compilers in high-performance computing, machine
learning, and other important application domains. I'd like community
feedback on what**should be part of that plan.

Sincerely,

Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with
feedback from**several other active Polly developers)

We thank the numerous people who have contributed to the Polly
infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, Annanay
Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin
Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris Jenneisch,
ChristianBielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David
Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman,
EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui
Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil,
LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt Arsenault,
MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse, Matthias
Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini Singhal,
NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip Pfaffe, Philipp
Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola, Raghesh Aloor,
ReidKleckner, Roal Jordans, Richard Membarth, Roman Gareev,
SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer AbuAsal,
SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay
Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Star
Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros Theodoridis,
Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.*

--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170901/b5ef5cdb/attachment-0001.html>

Davide Italiano via llvm-dev

2017-Sep-01 19:18 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

On Fri, Sep 1, 2017 at 11:47 AM, Hal Finkel via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Hi everyone, As you may know, stock LLVM does not provide the kind of
> advanced loop transformations necessary to provide good performance on many
> applications. LLVM's Polly project provides many of the required
> capabilities, including loop transformations such as fission, fusion,
> skewing, blocking/tiling, and interchange, all powered by state-of-the-art
> dependence analysis. Polly also provides automated parallelization and
> targeting of GPUs and other accelerators.
>
>
> Over the past year, Polly’s development has focused on robustness,
> correctness, and closer integration with LLVM. To highlight a few
> accomplishments:
>
>
> Polly now runs, by default, in the conceptually-proper place in LLVM’s pass
> pipeline (just before the loop vectorizer). Importantly, this means that
its
> loop transformations are performed after inlining and other
> canonicalization, greatly increasing its robustness, and enabling its use
on
> C++ code (where [] is often a function call before inlining).
>
> Polly’s cost-modeling parameters, such as those describing the target’s
> memory hierarchy, are being integrated with TargetTransformInfo. This
allows
> targets to properly override the modeling parameters and allows reuse of
> these parameters by other clients.
>
> Polly’s method of handling signed division/remainder operations, which
> worked around lack of support in ScalarEvolution, is being replaced thanks
> to improvements being contributed to ScalarEvolution itself (see D34598).
> Polly’s core delinearization routines have long been a part of LLVM itself.
>
> PolyhedralInfo, which exposes a subset of Polly’s loop analysis for use by
> other clients, is now available.
>
> Polly is now part of the LLVM release process and is being included with
> LLVM by various packagers (e.g., Debian).
>
>
> I believe that the LLVM community would benefit from beginning the process
> of integrating Polly with LLVM itself and continuing its development as
part
> of our main code base. This will:
>
> Allow for wider adoption of LLVM within communities relying on advanced
loop
> transformations.
>
> Provide for better community feedback on, and testing of, the code
developed
> (although the story in this regard is already fairly solid).
>
> Better motivate targets to provide accurate, comprehensive, modeling
> parameters for use by advanced loop transformations.
>
> Perhaps most importantly, this will allow us to develop and tune the rest
of
> the optimizer assuming that Polly’s capabilities are present (the
underlying
> analysis, and eventually, the transformations themselves).
>
>
> The largest issue on which community consensus is required, in order to
move
> forward at all, is what to do with isl. isl, the Integer Set Library,
> provides core functionality on which Polly depends. It is a C library, and
> while some Polly/LLVM developers are also isl developers, it has a large
> user community outside of LLVM/Polly. A C++ interface was recently added,
> and Polly is transitioning to use the C++ interface. Nevertheless, options
> here include rewriting the needed functionality, forking isl and
> transitioning our fork toward LLVM coding conventions (and data structures)
> over time, and incorporating isl more-or-less as-is to avoid partitioning
> its development.
>
>
> That having been said, isl is internally modular, and regardless of the
> overall integration strategy, the Polly developers anticipate specializing,
> or even replacing, some of these components with LLVM-specific solutions.
> This is especially true for anything that touches performance-related
> heuristics and modeling. LLVM-specific, or even target-specific, loop
> schedulers may be developed as well.
>
>
> Even though some developers in the LLVM community already have a background
> in polyhedral-modeling techniques, the Polly developers have developed, and
> are still developing, extensive tutorials on this topic
> http://pollylabs.org/education.html and especially
> http://playground.pollylabs.org.
>
>
> Finally, let me highlight a few ongoing development efforts in Polly that
> are potentially relevant to this discussion. Polly’s loop analysis is sound
> and technically superior to what’s in LLVM currently (i.e. in
> LoopAccessAnalysis and DependenceAnalysis). There are, however, two known
> reasons why Polly’s transformations could not yet be enabled by default:
>
> A correctness issue: Currently, Polly assumes that 64 bits is large enough
> for all new loop-induction variables and index expressions. In rare cases,
> transformations could be performed where more bits are required.
> Preconditions need to be generated preventing this (e.g., D35471).
>
> A performance issue: Polly currently models temporal locality (i.e., it
> tries to get better reuse in time), but does not model spatial locality
> (i.e., it does not model cache-line reuse). As a result, it can sometimes
> introduce performance regressions. Polly Labs is currently working on
> integrating spatial locality modeling into the loop optimization model.
>
> Polly can already split apart basic blocks in order to implement loop
> fusion. Heuristics to choose at which granularity are still being
> implemented (e.g., PR12402).
>
> I believe that we can now develop a concrete plan for moving
> state-of-the-art loop optimizations, based on the technology in the Polly
> project, into LLVM. Doing so will enable LLVM to be competitive with
> proprietary compilers in high-performance computing, machine learning, and
> other important application domains. I'd like community feedback on
what
> should be part of that plan.
>
>
This, at least on paper, sounds great. I think LLVM could greatly
benefit from this informations for some applications.
I have a couple of questions:
1) I'm aware there have been attempts in the past to make polyhedral
value informations available to LLVM, but they've been unsuccessful.
Do you plan to develop new LLVM passes to overcome this issue?
2) As far as I can tell (last I tried, but that was a while ago),
polly had a significant compile time impact. Do you plan to introduce
a new -Opolly pipeline?

On the ISL story. I think it would be better to have polly being
self-contained (with LLVM implementing the needed functionality), but
I understand that's a major undertaken and it comes at a cost (i.e.
LLVM will have to maintain this forked/reimplemented library forever
instead of leveraging upstream work). LLVM tries to minimize the
amount of required dependencies, but it seems it's OK to add new
external optional deps (recently, Z3), so that could be an OK path
forward.
I don't have a strong opinion on what's the best solution here, FWIW.

Thanks,

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Tobias Grosser via llvm-dev

2017-Sep-01 21:49 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

On Fri, Sep 1, 2017, at 21:18, Davide Italiano via llvm-dev
wrote:> On Fri, Sep 1, 2017 at 11:47 AM, Hal Finkel via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > Hi everyone, As you may know, stock LLVM does not provide the kind of
> > advanced loop transformations necessary to provide good performance on
many
> > applications. LLVM's Polly project provides many of the required
> > capabilities, including loop transformations such as fission, fusion,
> > skewing, blocking/tiling, and interchange, all powered by
state-of-the-art
> > dependence analysis. Polly also provides automated parallelization and
> > targeting of GPUs and other accelerators.
> >
> >
> > Over the past year, Polly’s development has focused on robustness,
> > correctness, and closer integration with LLVM. To highlight a few
> > accomplishments:
> >
> >
> > Polly now runs, by default, in the conceptually-proper place in LLVM’s
pass
> > pipeline (just before the loop vectorizer). Importantly, this means
that its
> > loop transformations are performed after inlining and other
> > canonicalization, greatly increasing its robustness, and enabling its
use on
> > C++ code (where [] is often a function call before inlining).
> >
> > Polly’s cost-modeling parameters, such as those describing the
target’s
> > memory hierarchy, are being integrated with TargetTransformInfo. This
allows
> > targets to properly override the modeling parameters and allows reuse
of
> > these parameters by other clients.
> >
> > Polly’s method of handling signed division/remainder operations, which
> > worked around lack of support in ScalarEvolution, is being replaced
thanks
> > to improvements being contributed to ScalarEvolution itself (see
D34598).
> > Polly’s core delinearization routines have long been a part of LLVM
itself.
> >
> > PolyhedralInfo, which exposes a subset of Polly’s loop analysis for
use by
> > other clients, is now available.
> >
> > Polly is now part of the LLVM release process and is being included
with
> > LLVM by various packagers (e.g., Debian).
> >
> >
> > I believe that the LLVM community would benefit from beginning the
process
> > of integrating Polly with LLVM itself and continuing its development
as part
> > of our main code base. This will:
> >
> > Allow for wider adoption of LLVM within communities relying on
advanced loop
> > transformations.
> >
> > Provide for better community feedback on, and testing of, the code
developed
> > (although the story in this regard is already fairly solid).
> >
> > Better motivate targets to provide accurate, comprehensive, modeling
> > parameters for use by advanced loop transformations.
> >
> > Perhaps most importantly, this will allow us to develop and tune the
rest of
> > the optimizer assuming that Polly’s capabilities are present (the
underlying
> > analysis, and eventually, the transformations themselves).
> >
> >
> > The largest issue on which community consensus is required, in order
to move
> > forward at all, is what to do with isl. isl, the Integer Set Library,
> > provides core functionality on which Polly depends. It is a C library,
and
> > while some Polly/LLVM developers are also isl developers, it has a
large
> > user community outside of LLVM/Polly. A C++ interface was recently
added,
> > and Polly is transitioning to use the C++ interface. Nevertheless,
options
> > here include rewriting the needed functionality, forking isl and
> > transitioning our fork toward LLVM coding conventions (and data
structures)
> > over time, and incorporating isl more-or-less as-is to avoid
partitioning
> > its development.
> >
> >
> > That having been said, isl is internally modular, and regardless of
the
> > overall integration strategy, the Polly developers anticipate
specializing,
> > or even replacing, some of these components with LLVM-specific
solutions.
> > This is especially true for anything that touches performance-related
> > heuristics and modeling. LLVM-specific, or even target-specific, loop
> > schedulers may be developed as well.
> >
> >
> > Even though some developers in the LLVM community already have a
background
> > in polyhedral-modeling techniques, the Polly developers have
developed, and
> > are still developing, extensive tutorials on this topic
> > http://pollylabs.org/education.html and especially
> > http://playground.pollylabs.org.
> >
> >
> > Finally, let me highlight a few ongoing development efforts in Polly
that
> > are potentially relevant to this discussion. Polly’s loop analysis is
sound
> > and technically superior to what’s in LLVM currently (i.e. in
> > LoopAccessAnalysis and DependenceAnalysis). There are, however, two
known
> > reasons why Polly’s transformations could not yet be enabled by
default:
> >
> > A correctness issue: Currently, Polly assumes that 64 bits is large
enough
> > for all new loop-induction variables and index expressions. In rare
cases,
> > transformations could be performed where more bits are required.
> > Preconditions need to be generated preventing this (e.g., D35471).
> >
> > A performance issue: Polly currently models temporal locality (i.e.,
it
> > tries to get better reuse in time), but does not model spatial
locality
> > (i.e., it does not model cache-line reuse). As a result, it can
sometimes
> > introduce performance regressions. Polly Labs is currently working on
> > integrating spatial locality modeling into the loop optimization
model.
> >
> > Polly can already split apart basic blocks in order to implement loop
> > fusion. Heuristics to choose at which granularity are still being
> > implemented (e.g., PR12402).
> >
> > I believe that we can now develop a concrete plan for moving
> > state-of-the-art loop optimizations, based on the technology in the
Polly
> > project, into LLVM. Doing so will enable LLVM to be competitive with
> > proprietary compilers in high-performance computing, machine learning,
and
> > other important application domains. I'd like community feedback
on what
> > should be part of that plan.
> >
> >
> 
> This, at least on paper, sounds great. I think LLVM could greatly
> benefit from this informations for some applications.
> I have a couple of questions:
> 1) I'm aware there have been attempts in the past to make polyhedral
> value informations available to LLVM, but they've been unsuccessful.
> Do you plan to develop new LLVM passes to overcome this issue?
It would be great to have a polyhedral range analysis in LLVM. We have
most code for this in Polly. Especially with the new isl C++ bindings,
its code could also look very nice!
> 2) As far as I can tell (last I tried, but that was a while ago),
> polly had a significant compile time impact. Do you plan to introduce
> a new -Opolly pipeline?
We do not intend to enable Polly by default just yet. Hence a new option
-O?? or -f** seems appropriate. I suggest to leave the naming debate for
later.

We also worked on performance over the last 2 years:

- Introduced fast arbitrary integers to backup isl (gives ~7x speedup
vs. imath, 2x vs. gmp, and is only 30% slower than native integers)
- Worked with Eli Friedman on compiling all of AOSP. We do this in
"-polly-process-unprofitable" mode, where we run on anything we can
analyze and make sure timeouts and validity checks are all in place.
- Adjusted our core scheduler's assymptotic complexity to not be in
function of number of statements to schedule, but rather the maximal
number of statements that are fused:
https://lirias.kuleuven.be/bitstream/123456789/586108/1/CW706.pdf
- We worked on bailing out fast (but slightly regressed in recent
months). For code that is not amenable to polyhedral 
   optimizations we I would like to see almost zero compile time
   overhead. We have a pure LLVM-IR analysis before Polly, which 
   just scans the IR and decides if we should look at it more
   intensively.
> On the ISL story. I think it would be better to have polly being
> self-contained (with LLVM implementing the needed functionality), but
> I understand that's a major undertaken and it comes at a cost (i.e.
> LLVM will have to maintain this forked/reimplemented library forever
> instead of leveraging upstream work). LLVM tries to minimize the
> amount of required dependencies, but it seems it's OK to add new
> external optional deps (recently, Z3), so that could be an OK path
> forward. I don't have a strong opinion on what's the best solution
here, FWIW.
We currently ship a snapshot of isl with Polly. This has shown to be
very useful, as we ship isl "trunk" which is updated quickly in case
of
bug reports. Having a single version of isl shipped in sync with polly
is also desirable when receiving bug reports for Polly/isl.

As Hal pointed out, I expect us to complement at least parts of isl with
our own heuristics and analysis. While this is likely the way to go I
believe isl provides solid "defaults" for us to start off with and to
identify where LLVM specific solutions indeed add value.

Best,
Tobias

Adam Nemet via llvm-dev

2017-Sep-11 17:26 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi Hal, Tobias, Michael and others,
> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> Hi everyone,
> 
> As you may know, stock LLVM does not provide the kind of advanced loop
transformations necessary to provide good performance on many applications.
LLVM's Polly project provides many of the required capabilities, including
loop transformations such as fission, fusion, skewing, blocking/tiling, and
interchange, all powered by state-of-the-art dependence analysis. Polly also
provides automated parallelization and targeting of GPUs and other accelerators.
> 
> Over the past year, Polly’s development has focused on robustness,
correctness, and closer integration with LLVM. To highlight a few
accomplishments:
> 
> Polly now runs, by default, in the conceptually-proper place in LLVM’s pass
pipeline (just before the loop vectorizer). Importantly, this means that its
loop transformations are performed after inlining and other canonicalization,
greatly increasing its robustness, and enabling its use on C++ code (where [] is
often a function call before inlining).
> Polly’s cost-modeling parameters, such as those describing the target’s
memory hierarchy, are being integrated with TargetTransformInfo. This allows
targets to properly override the modeling parameters and allows reuse of these
parameters by other clients.
> Polly’s method of handling signed division/remainder operations, which
worked around lack of support in ScalarEvolution, is being replaced thanks to
improvements being contributed to ScalarEvolution itself (see D34598). Polly’s
core delinearization routines have long been a part of LLVM itself.
> PolyhedralInfo, which exposes a subset of Polly’s loop analysis for use by
other clients, is now available.
> Polly is now part of the LLVM release process and is being included with
LLVM by various packagers (e.g., Debian).
> 
> I believe that the LLVM community would benefit from beginning the process
of integrating Polly with LLVM itself and continuing its development as part of
our main code base. This will:
> Allow for wider adoption of LLVM within communities relying on advanced
loop transformations.
> Provide for better community feedback on, and testing of, the code
developed (although the story in this regard is already fairly solid).
> Better motivate targets to provide accurate, comprehensive, modeling
parameters for use by advanced loop transformations.
> Perhaps most importantly, this will allow us to develop and tune the rest
of the optimizer assuming that Polly’s capabilities are present (the underlying
analysis, and eventually, the transformations themselves).
> 
> The largest issue on which community consensus is required, in order to
move forward at all, is what to do with isl. isl, the Integer Set Library,
provides core functionality on which Polly depends. It is a C library, and while
some Polly/LLVM developers are also isl developers, it has a large user
community outside of LLVM/Polly. A C++ interface was recently added, and Polly
is transitioning to use the C++ interface. Nevertheless, options here include
rewriting the needed functionality, forking isl and transitioning our fork
toward LLVM coding conventions (and data structures) over time, and
incorporating isl more-or-less as-is to avoid partitioning its development.
> 
> That having been said, isl is internally modular, and regardless of the
overall integration strategy, the Polly developers anticipate specializing, or
even replacing, some of these components with LLVM-specific solutions. This is
especially true for anything that touches performance-related heuristics and
modeling. LLVM-specific, or even target-specific, loop schedulers may be
developed as well.
> 
> Even though some developers in the LLVM community already have a background
in polyhedral-modeling techniques, the Polly developers have developed, and are
still developing, extensive tutorials on this topic
http://pollylabs.org/education.html <http://pollylabs.org/education.html>
and especially http://playground.pollylabs.org
<http://playground.pollylabs.org/>.
> 
> Finally, let me highlight a few ongoing development efforts in Polly that
are potentially relevant to this discussion. Polly’s loop analysis is sound and
technically superior to what’s in LLVM currently (i.e. in LoopAccessAnalysis and
DependenceAnalysis). There are, however, two known reasons why Polly’s
transformations could not yet be enabled by default:
> A correctness issue: Currently, Polly assumes that 64 bits is large enough
for all new loop-induction variables and index expressions. In rare cases,
transformations could be performed where more bits are required. Preconditions
need to be generated preventing this (e.g., D35471).
> A performance issue: Polly currently models temporal locality (i.e., it
tries to get better reuse in time), but does not model spatial locality (i.e.,
it does not model cache-line reuse). As a result, it can sometimes introduce
performance regressions. Polly Labs is currently working on integrating spatial
locality modeling into the loop optimization model.
> Polly can already split apart basic blocks in order to implement loop
fusion. Heuristics to choose at which granularity are still being implemented
(e.g., PR12402).
> 
> I believe that we can now develop a concrete plan for moving
state-of-the-art loop optimizations, based on the technology in the Polly
project, into LLVM. Doing so will enable LLVM to be competitive with proprietary
compilers in high-performance computing, machine learning, and other important
application domains. I’d like community feedback on what should be part of that
plan.
One thing that I’d like to see more details on is what this means for the
evolution of loop transformations in LLVM.

Our more-or-less established direction was so far to incrementally improve and
generalize the required analyses (e.g. the LoopVectorizer’s dependence analysis
+ loop versioning analysis into a stand-alone analysis pass
(LoopAccessAnalysis)) and then build new transformations (e.g. LoopDistribution,
LoopLoadElimination, LICMLoopVersioning) on top of these.

The idea was that infrastructure would be incrementally improved from two
directions:

- As new transformations are built analyses have to be improved (e.g. past
improvements to LAA to support the LoopVersioning utility, future improvements
for full LoopSROA beyond just store->load forwarding [1] or the improvements
to LAA for the LoopFusion proposal[2])

- As more complex loops would have to be analyzed we either improve LAA or make
DependenceAnalysis a drop-in replacement for the memory analysis part in LAA

While this model may be slow it has all the benefits of the incremental
development model.

Then there is the question of use cases.  It’s fairly obvious that anybody
wanting to optimize a 5-deep highly regular loop-nest operating on arrays should
use Polly.  On the other hand it’s way less clear that we should use it for
singly or doubly nested not-so-regular loops which are the norm in non-HPC
workloads.

And this brings me to the maintenance question.  Is it reasonable to expect
people to fix Polly when they have a seemingly unrelated change that happens to
break a Polly bot.  As far as I know, there were companies in the past that
tried Polly without a whole lot of prior experience.  It would be great to hear
what the experience was before adopting Polly at a much larger scale.

Adam

[1] http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
[2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html

> 
> Sincerely,
> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with feedback
from several other active Polly developers)
> 
> We thank the numerous people who have contributed to the Polly
infrastructure:
> 
> Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, Annanay Agarwal, Armin
> Groesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin Kramer, Bill
> Wendling, Chandler Carruth, Craig Topper, Chris Jenneisch, Christian
> Bielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David Peixotto,
> Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman, Eugene
> Zelenko, George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui Zhang,
> Jakub Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil, Logan
> Chien, Lawrence Hu, Mandeep Singh Grang, Matt Arsenault, Matthew
> Simpson, Mehdi Amini, Micah Villmow, Michael Kruse, Matthias Reisinger,
> Maximilian Falkenstein, Nakamura Takumi, Nandini Singhal, Nicolas
> Bonfante, Patrik Hägglund, Paul Robinson, Philip Pfaffe, Philipp Schaad,
> Peter Conn, Pratik Bhatu, Rafael Espindola, Raghesh Aloor, Reid
> Kleckner, Roal Jordans, Richard Membarth, Roman Gareev, Saleem
> Abdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer AbuAsal, Sam
> Novak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay Srivallabh,
> Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Star Tan, Tanya
> Lattner, Tim Shen, Tarun Ranjendran, Theodoros Theodoridis, Utpal Bora,
> Wei Mi, Weiming Zhao, and Yabin Hu.
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170911/879a0158/attachment-0001.html>

Hal Finkel via llvm-dev

2017-Sep-12 05:47 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

On 09/11/2017 12:26 PM, Adam Nemet wrote:> Hi Hal, Tobias, Michael and others,
>
>> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev 
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>> **
>>
>> *Hi everyone,As you may know, stock LLVM does not provide the kind of 
>> advanced loop transformations necessary to provide good performance 
>> on many applications. LLVM's Polly project provides many of the 
>> required capabilities, including loop transformations such as 
>> fission, fusion, skewing, blocking/tiling, and interchange, all 
>> powered by state-of-the-art dependence analysis. Polly also provides 
>> automated parallelization and targeting of GPUs and
other**accelerators.*
>> *
>> Over the past year, Polly’s development has focused on robustness, 
>> correctness, and closer integration with LLVM. To highlight a few 
>> accomplishments:
>>
>>  *
>>     Polly now runs, by default, in the conceptually-proper place in
>>     LLVM’s pass pipeline (just before the loop vectorizer).
>>     Importantly, this means that its loop transformations are
>>     performed after inlining and other canonicalization, greatly
>>     increasing its robustness, and enabling its use on C++ code
>>     (where [] is often a function call before inlining).
>>  *
>>     Polly’s cost-modeling parameters, such as those describing the
>>     target’s memory hierarchy, are being integrated with
>>     TargetTransformInfo. This allows targets to properly override the
>>     modeling parameters and allows reuse of these parameters by other
>>     clients.
>>  *
>>     Polly’s method of handling signed division/remainder operations,
>>     which worked around lack of support in ScalarEvolution, is being
>>     replaced thanks to improvements being contributed to
>>     ScalarEvolution itself (see D34598). Polly’s core delinearization
>>     routines have long been a part of LLVM itself.
>>  *
>>     PolyhedralInfo, which exposes a subset of Polly’s loop analysis
>>     for use by other clients, is now available.
>>  *
>>     Polly is now part of the LLVM release process and is being
>>     included with LLVM by various packagers (e.g., Debian).
>>
>>
>> I believe that the LLVM community would benefit from beginning the 
>> process of integrating Polly with LLVM itself and continuing its 
>> development as part of our main code base. This will:
>>
>>  *
>>     Allow for wider adoption of LLVM within communities relying on
>>     advanced loop transformations.
>>  *
>>     Provide for better community feedback on, and testing of, the
>>     code developed (although the story in this regard is already
>>     fairly solid).
>>  *
>>     Better motivate targets to provide accurate, comprehensive,
>>     modeling parameters for use by advanced loop transformations.
>>  *
>>     Perhaps most importantly, this will allow us to develop and tune
>>     the rest of the optimizer assuming that Polly’s capabilities are
>>     present (the underlying analysis, and eventually, the
>>     transformations themselves).
>>
>>
>> The largest issue on which community consensus is required, in order 
>> to move forward at all, is what to do with isl. isl, the Integer Set 
>> Library, provides core functionality on which Polly depends. It is a 
>> C library, and while some Polly/LLVM developers are also isl 
>> developers, it has a large user community outside of LLVM/Polly. A 
>> C++ interface was recently added, and Polly is transitioning to use 
>> the C++ interface. Nevertheless, options here include rewriting the 
>> needed functionality, forking isl and transitioning our fork toward 
>> LLVM coding conventions (and data structures) over time, and 
>> incorporating isl more-or-less as-is to avoid partitioning its 
>> development.
>>
>> That having been said, isl is internally modular, and regardless of 
>> the overall integration strategy, the Polly developers anticipate 
>> specializing, or even replacing, some of these components with 
>> LLVM-specific solutions. This is especially true for anything that 
>> touches performance-related heuristics and modeling. LLVM-specific, 
>> or even target-specific, loop schedulers may be developed as well.
>>
>> Even though some developers in the LLVM community already have a 
>> background in polyhedral-modeling techniques, the Polly developers 
>> have developed, and are still developing, extensive tutorials on this 
>> topic http://pollylabs.org/education.htmland especially 
>> http://playground.pollylabs.org
<http://playground.pollylabs.org/>.
>>
>> Finally, let me highlight a few ongoing development efforts in Polly 
>> that are potentially relevant to this discussion. Polly’s loop 
>> analysis is sound and technically superior to what’s in LLVM 
>> currently (i.e. in LoopAccessAnalysis and DependenceAnalysis). There 
>> are, however, two known reasons why Polly’s transformations could not 
>> yet be enabled by default:
>>
>>  *
>>     A correctness issue: Currently, Polly assumes that 64 bits is
>>     large enough for all new loop-induction variables and index
>>     expressions. In rare cases, transformations could be performed
>>     where more bits are required. Preconditions need to be generated
>>     preventing this (e.g., D35471).
>>  *
>>     A performance issue: Polly currently models temporal locality
>>     (i.e., it tries to get better reuse in time), but does not model
>>     spatial locality (i.e., it does not model cache-line reuse). As a
>>     result, it can sometimes introduce performance regressions. Polly
>>     Labs is currently working on integrating spatial locality
>>     modeling into the loop optimization model.
>>
>> Polly can already split apart basic blocks in order to implement loop 
>> fusion. Heuristics to choose at which granularity are still being 
>> implemented (e.g., PR12402).
>> I believe that we can now develop a concrete plan for moving 
>> state-of-the-art loop optimizations, based on the technology in the 
>> Polly project, into LLVM. Doing so will enable LLVM to be competitive 
>> with proprietary compilers in high-performance computing, machine 
>> learning, and other important application domains. I’d like community 
>> feedback on what**should be part of that plan.
>> *
>
> One thing that I’d like to see more details on is what this means for 
> the evolution of loop transformations in LLVM.
>
> Our more-or-less established direction was so far to incrementally 
> improve and generalize the required analyses (e.g. the 
> LoopVectorizer’s dependence analysis + loop versioning analysis into a 
> stand-alone analysis pass (LoopAccessAnalysis)) and then build new 
> transformations (e.g. LoopDistribution, LoopLoadElimination, 
> LICMLoopVersioning) on top of these.
>
> The idea was that infrastructure would be incrementally improved from 
> two directions:
>
> - As new transformations are built analyses have to be improved (e.g. 
> past improvements to LAA to support the LoopVersioning utility, future 
> improvements for full LoopSROA beyond just store->load forwarding [1] 
> or the improvements to LAA for the LoopFusion proposal[2])
>
> - As more complex loops would have to be analyzed we either improve 
> LAA or make DependenceAnalysis a drop-in replacement for the memory 
> analysis part in LAA
Or we could use Polly's dependence analysis, which I believe to be more 
powerful, more robust, and more correct than DependenceAnalysis. I 
believe that the difficult part here is actually the pairing with 
predicated SCEV or whatever mechanism we want to use generate runtime 
predicates (this applies to use of DependenceAnalysis too).
>
> While this model may be slow it has all the benefits of the 
> incremental development model.
The current model may have been slow in many areas, but I think that's 
mostly a question of development effort. My largest concern about the 
current model is that, to the extent that we're implementing classic 
loop transformations (e.g., fusion, distribution, interchange, skewing, 
tiling, and so on), we're repeating a historical design that is known to 
have several suboptimal properties. Chief among them is the lack of 
integration: many of these transformations are interconnected, and 
there's no good pass ordering in which to make independent decisions. 
Many of these transformations can be captured in a single model and we 
can get much better results by integrating them. There's also the matter 
of whether building these transformation on SCEV (or IR directly) is the 
best underlying infrastructure, or whether parts of Polly would be better.

That having been said, I think that integrating this technology into 
LLVM will also mean applying appropriate modularity. I think that we'll 
almost definitely want to make use of the dependence analysis separately 
as an analysis. We'll want to decide which of these transformations will 
be considered canonicalization (and run in the iterative pipeline) and 
which will be lowering (and run near the vectorizer). LoopSROA certainly 
sounds to me like canonicalization, but loop fusion might also fall into 
that category (i.e., we might want to fuse early to enable optimizations 
and then split late).
>
> Then there is the question of use cases.  It’s fairly obvious that 
> anybody wanting to optimize a 5-deep highly regular loop-nest 
> operating on arrays should use Polly.  On the other hand it’s way less 
> clear that we should use it for singly or doubly nested not-so-regular 
> loops which are the norm in non-HPC workloads.
This is clearly a good question, but thinking about Polly as a set of 
components, not as a monolithic transformation component, I think that 
polyhedral analysis and transformations can underlie a lot of the 
transformations we need for non-HPC code (and, which I'll point out, we 
need for modern HPC code too). In practice, the loops that we can 
actually analyze have affine dependencies, and Polly does, or can do, a 
better job at generating runtime predicates and dealing with 
piecewise-linear expressions than our current infrastructure.

In short, I look at Polly as two things: First, an infrastructure for 
dealing with loop analysis and transformation. I view this as being 
broadly applicable. Second, an application of that to apply 
cost-model-driven classic loop transformations. To some extent this is 
going to be more useful for HPC codes, but also applies to machine 
learning, signal processing, graphics, and other areas.
>
> And this brings me to the maintenance question.  Is it reasonable to 
> expect people to fix Polly when they have a seemingly unrelated change 
> that happens to break a Polly bot.
The eventual goal here is to have this technology in appropriate parts 
of the main pipeline, and so the question here is not really about 
breaking a "Polly bot", but just about a "bot" in general.
I've given
this question some thought and I think it sits in a reasonable place in 
the risk-reward space. The answer would be, yes, we'd need to treat this 
like any other part of the pipeline. However, I believe that Polly has 
as many, or more, active contributors than essentially any other 
individual part of the mid-level optimizer or CodeGen. As a result, 
there will be people around in many time zones to help with problems 
with Polly-related code.
>  As far as I know, there were companies in the past that tried Polly 
> without a whole lot of prior experience.  It would be great to hear 
> what the experience was before adopting Polly at a much larger scale.
I'm also interested, although I'll caution against over-interpreting any
evidence here (positive or negative). Before a few weeks ago, Polly 
didn't effectively run in the pipeline after inlining, and so I doubt it 
would have been much use outside of embedded environments (and maybe 
some HPC environments) with straightforwardly-presented C code. It's 
only now that this has been fixed that I find the possibility of 
integrating this in production interesting.

Thanks again,
Hal
>
> Adam
>
> [1] http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
> [2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html
>
>
>> *
>> Sincerely,
>> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with 
>> feedback from**several other active Polly developers)
>>
>> We thank the numerous people who have contributed to the Polly 
>> infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, 
>> Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, 
>> Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris 
>> Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David 
>> Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, 
>> Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, 
>> Hongbin Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin 
>> Bogner, Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh 
>> Grang, Matt Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, 
>> Michael Kruse, Matthias Reisinger,Maximilian Falkenstein, Nakamura 
>> Takumi, Nandini Singhal, NicolasBonfante, Patrik Hägglund, Paul 
>> Robinson, Philip Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, 
>> Rafael Espindola, Raghesh Aloor, ReidKleckner, Roal Jordans, Richard 
>> Membarth, Roman Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, 
>> Sanjoy Das, Sameer AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, 
>> Singapuram Sanjay Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, 
>> Sylvestre Ledru, Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, 
>> Theodoros Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.*
>>
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170912/5f981c31/attachment-0001.html>

Johannes Doerfert via llvm-dev

2017-Sep-20 00:33 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi Hal, Tobias, Michael, and others,

I'd like to add my view (and a proposal) to this discussion and I
apologize directly for doing this so late*. I also want to apologize
because this email is long, contains various technical details and also
argumentations that might need more justification. However, I am happy
to provide further information (and/or examples) to explain my views if
required.

tldr;
We should introduce polyhedral analysis and optimization capabilities
into LLVM step by step. Along the way we should revisit design decisions
made by Polly and adjust them according to the use cases in the rest of
LLVM. Finally, we should keep Polly as a stand-alone tool to allow
research into, and prototyping of, complex loop transformations.

* There was a paper deadline end of last week and I simply had to prioritize.

-------------------------------------------------------------------------------

LLVM performs "simple" loop transformations and targets almost
exclusively inner-most loops. Polly offers a pipeline to perform
"complex" polyhedral-based loop optimizations on a sequence of loop
nests. It seems natural to integrate Polly into the LLVM pipeline as it
apparently complements it well. However, I do not believe that it is
wise to integrate Polly into LLVM for a variety of reasons, some of
which I will explain in more detail now. Afterwards I will propose a
different way to achieve (most of) the things Hal mentioned in his mail
in an incremental way. Nevertheless, I want to stress that I am still an
advocate of Polly and I honestly believe it to be a valuable part of the
LLVM environment. Though I strongly feel it should be continued as a
research tool and test bed to allow research into, and prototyping of,
(polyhedral-model-based) analysis and optimizations.

---

Polly is a deep pipeline of passes that were developed for the sole
purpose of applying scheduling transformations at the very end. This
"main purpose" of Polly makes it hard to effectively reuse
intermediate
parts, e.g. the dependence analysis or the code generation framework, at
least to the degree we might want to. In a different thread [1] Hal
asked for the possibility to reuse Polly for tuning via pragmas e.g.,
unrolling/interleaving/tiling/interchange. While this is interesting and
generally possible it won't take long before the restrictions Polly
places on the input code (to enable scheduling optimizations in the end)
become a serious problem. The point here is that Polly has too many
features and dependences for such an "easy" use case. (Note that I do
not question the legality of a user requested transformation here.) If
you are not looking for "end-to-end polyhedral loop optimizations" you
want a system that answers a very specific question (e.g.,
isVectorizable) or performs a very specific transformation (e.g, tiling)
on a maximal set of programs. In contrast, Polly is (and I strongly
believe it should be) developed as a system that performs complex
optimizations, based on specialized analysis results, on a constraint
type of programs.

One might argue that Polly will become more modular when it is integrated
into LLVM and when the intermediate results are used in more and more
places. However, I'd say this is the "wrong direction" with
regards to:
  1) incremental design and development,
  2) maintainability of individual parts, and
  3) the development of a suitable cost model (which Polly does not ship).

Instead of starting with a full, hard to understand, scheduling pipeline
we should start with the use cases at hand and design specific analysis
(and also transformation) passes "from scratch". The long term goal
should be a full scheduling pipeline but the parts need to be designed
in a modular (LLVM-way) from the very beginning. This allows us to:
  - provide __immediate benefit__ to other developers,
  - allow active participation in (and understanding of) the design of
    each part, and
  - develop the intermediate parts with the requirements of the whole
    LLVM project in mind.

Let me give two examples to make my point:

-- Example 1 --
In addition to the applicability issues there are other problems that
arise from the full pipeline design. Let's assume we want to apply a
pragma driven optimization to a loop that can be represented (and
optimized) by Polly. Depending on the pragma we only need to perform a
very specific task, e.g., adjust the iteration variable (loop inversion)
or introduce new loops and adjust existing iteration variables (tiling).
Polly will however represent (and actually analyze) the whole loop nest
including the exact control flow and all accesses. In addition it
will compute alias checks, constraints under which the representation
is not truthful (e.g., an access function might overflow) and a lot of
other things that are not needed for the task at hand.

-- Example 2 --
We can also imagine we want to determine if a loop can be vectorized,
thus if there are (short) loop carried dependences. No matter which
technique is used, the complexity of a dependence analysis will
certainly increase with the number (and complexity) of the analyzed
accesses. Since Polly is designed with fine-grained scheduling
optimizations in mind, the dependences it will compute are
disproportionate with regards to the question asked. Consequently, it
will either take more time to determine if there are loop carried
dependences or a time-out in the computation will cause a conservatively
negative result.

---

While it is generally possible to disable some "features" of Polly or
to
"work around them", it requires new independent code paths in an
already
highly interleaved project. The explicit (and implicit) interactions
between the different parts of Polly are plentiful and subtle. I don't
think there are 5 people that know about more than half of them. (I also
doubt there is no one that knows them all.)

---

I want to propose an alternative plan for which I would appreciate
feedback. The feedback is especially useful since I will present this,
or more accurately the concrete polyhedral analyses described below, in
the student research competition at LLVM Dev meeting in October.

The idea is simple. We rewrite parts of Polly with two goals in mind:
  1) to provide analysis results and transformation options to LLVM, and
  2) allow polyhedral loop transformation.

A rewrite allows to revisit various design decisions that are buried
deep in the Polly code and that hinder its application (in any context).
The interested reader can find a short list at the end.

I think, we should incrementally introduce polyhedral analysis
capabilities in order to provide immediate benefits and to allow
developers to get involved with the design and usage from an early point
in time. Decoupled analyses also allow easier integration, interfaces
that come "more natural" to non-polyhedral people, and better use of
the
capabilities if polyhedral scheduling is not the goal.

To demonstrate this incremental process I started to write a polyhedral
value analysis**. Think of it as Scalar Evolution but with piece-wise
defined polyhedral values as the abstract domain. This analysis is
demand driven, works on the granularity of a LLVM value, is flow
sensitive, and especially designed to be applicable on partially affine
programs. It already supports most of what the polyhedral model does
allow and includes a lot of things we prototyped in Polly. It shall be
used when Scalar Evolution is not able to provide a sufficient result.
On top I implemented a polyhedral memory analysis*** that is (currently)
capable of summarizing the memory effects of a code region, e.g., a
function. The next step is a demand-driven dependence analysis**** that
utilizes the (often) very structured nature of a CFG. Afterwards, I
would like to see a transformation framework that allows classical but
also scheduling based transformations.

While these analyses still depend on isl as a back-end, they never
directly use it. Instead there is a C++ interface in-between that
encapsulates the features needed. Once our use cases are clear we can
re-implement this interface.

** For comparison, the value analysis covers parts of the ScopDetection,
all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and
ScopBuilder in Polly.
*** The memory analysis is concerned with the construction and
functionality Polly implements in the MemoryAccess class.
**** The dependence analysis is similar to Polly's DependenceInfo class
but is supposed to allow more fine-grained control.

---

These are a few design decisions that I think should be revisited when
we think about polyhedral-based analysis and optimizations in LLVM:

  - We should not rely on Scalar Evolution. Instead we want to build
    polyhedral values straight from the underlying IR. While
    Scalar Evolution did certainly simplify Polly's development it
    induced various problems:
      - The lack of piece-wise affine values (other than min/max)
        limits applicability and precision. (Though conditional SCEVs
        are a small step in this direction.)
      - The flow insensitive value analysis (Scalar Evolution) causes a
        lot of "assumptions", thus versioning, for the otherwise flow
        sensitive Polly.
      - There are caching issues when parts of the program are
        transformed or when Polly is used for inter-procedural
        analysis/optimization.

  - We should not depend on single-entry-single-exit (SESE) regions.
    Instead we analyze the code optimistically to the extend possible
    (or required by the user). Afterwards we can easily specialize the
    results to the code regions we are interested in. While SESE regions
    always caused compile time issues (due to the use of the post
    dominance tree) they also limit the analysis/transformations that
    can be performed. As an example think of accesses that are
    non-affine with regards to the smallest surrounding SESE region but
    not in a smaller code region (e.g., a conditional). If the goal is
    not loop scheduling but, for example, instruction reordering, it
    is still valuable to determine the dependences between these
    accesses in the smaller code region.

  - We should iteratively and selective construct polyhedral
    representations. The same holds for analysis results, e.g.
    dependences. Let's reconsider example 2 from above. We want to visit
    one memory access at a time in order to build the polyhedral
    representation and compute the new dependences. We also want to
    immediately stop if we identify a loop-carried dependence. Finally,
    we do not want to compute dependences in outer loop iterations or
    dependences that are completely contained in one loop iteration.

  - We should encapsulate the code generation part completely from the
    transformation part. Additionally we might want to start with
    classical loop transformations instead of full polyhedral
    scheduling. For a lot of classical loop transformations code
    generation only needs to understand/change loop induction variables
    and exit conditions. Duplication + rewriting of the loop body is
    often not necessary. Pragma driven optimizations could be easily
    done and we could also use heuristics similar to the ones we have
    today (e.g., for loop distribution). The transformations we allow
    should than grow with the scheduling decisions we allow and these
    should be limited by a yet to be determined cost model.

---

I am aware that not all Polly developers will agree with my views and
that some are simply trade-offs that do not offer a "correct way".
Nevertheless, I hope that the simple examples I used to point out
various problems are enough to consider the route I proposed.

Cheers,
  Johannes

[1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ

On 09/01, Hal Finkel via llvm-dev wrote:> **
> 
> *Hi everyone,As you may know, stock LLVM does not provide the kind of
> advanced loop transformations necessary to provide good performance on many
> applications. LLVM's Polly project provides many of the required
> capabilities, including loop transformations such as fission, fusion,
> skewing, blocking/tiling, and interchange, all powered by state-of-the-art
> dependence analysis. Polly also provides automated parallelization and
> targeting of GPUs and other**accelerators.*
> 
> *
> 
> Over the past year, Polly’s development has focused on robustness,
> correctness, and closer integration with LLVM. To highlight a few
> accomplishments:
> 
> 
>  *
> 
>    Polly now runs, by default, in the conceptually-proper place in
>    LLVM’s pass pipeline (just before the loop vectorizer). Importantly,
>    this means that its loop transformations are performed after
>    inlining and other canonicalization, greatly increasing its
>    robustness, and enabling its use on C++ code (where [] is often a
>    function call before inlining).
> 
>  *
> 
>    Polly’s cost-modeling parameters, such as those describing the
>    target’s memory hierarchy, are being integrated with
>    TargetTransformInfo. This allows targets to properly override the
>    modeling parameters and allows reuse of these parameters by other
>    clients.
> 
>  *
> 
>    Polly’s method of handling signed division/remainder operations,
>    which worked around lack of support in ScalarEvolution, is being
>    replaced thanks to improvements being contributed to ScalarEvolution
>    itself (see D34598). Polly’s core delinearization routines have long
>    been a part of LLVM itself.
> 
>  *
> 
>    PolyhedralInfo, which exposes a subset of Polly’s loop analysis for
>    use by other clients, is now available.
> 
>  *
> 
>    Polly is now part of the LLVM release process and is being included
>    with LLVM by various packagers (e.g., Debian).
> 
> 
> I believe that the LLVM community would benefit from beginning the process
> of integrating Polly with LLVM itself and continuing its development as
part
> of our main code base. This will:
> 
>  *
> 
>    Allow for wider adoption of LLVM within communities relying on
>    advanced loop transformations.
> 
>  *
> 
>    Provide for better community feedback on, and testing of, the code
>    developed (although the story in this regard is already fairly solid).
> 
>  *
> 
>    Better motivate targets to provide accurate, comprehensive, modeling
>    parameters for use by advanced loop transformations.
> 
>  *
> 
>    Perhaps most importantly, this will allow us to develop and tune the
>    rest of the optimizer assuming that Polly’s capabilities are present
>    (the underlying analysis, and eventually, the transformations
>    themselves).
> 
> 
> The largest issue on which community consensus is required, in order to
move
> forward at all, is what to do with isl. isl, the Integer Set Library,
> provides core functionality on which Polly depends. It is a C library, and
> while some Polly/LLVM developers are also isl developers, it has a large
> user community outside of LLVM/Polly. A C++ interface was recently added,
> and Polly is transitioning to use the C++ interface. Nevertheless, options
> here include rewriting the needed functionality, forking isl and
> transitioning our fork toward LLVM coding conventions (and data structures)
> over time, and incorporating isl more-or-less as-is to avoid partitioning
> its development.
> 
> 
> That having been said, isl is internally modular, and regardless of the
> overall integration strategy, the Polly developers anticipate specializing,
> or even replacing, some of these components with LLVM-specific solutions.
> This is especially true for anything that touches performance-related
> heuristics and modeling. LLVM-specific, or even target-specific, loop
> schedulers may be developed as well.
> 
> 
> Even though some developers in the LLVM community already have a background
> in polyhedral-modeling techniques, the Polly developers have developed, and
> are still developing, extensive tutorials on this topic
> http://pollylabs.org/education.htmland especially
> http://playground.pollylabs.org.
> 
> 
> Finally, let me highlight a few ongoing development efforts in Polly that
> are potentially relevant to this discussion. Polly’s loop analysis is sound
> and technically superior to what’s in LLVM currently (i.e. in
> LoopAccessAnalysis and DependenceAnalysis). There are, however, two known
> reasons why Polly’s transformations could not yet be enabled by default:
> 
>  *
> 
>    A correctness issue: Currently, Polly assumes that 64 bits is large
>    enough for all new loop-induction variables and index expressions.
>    In rare cases, transformations could be performed where more bits
>    are required. Preconditions need to be generated preventing this
>    (e.g., D35471).
> 
>  *
> 
>    A performance issue: Polly currently models temporal locality (i.e.,
>    it tries to get better reuse in time), but does not model spatial
>    locality (i.e., it does not model cache-line reuse). As a result, it
>    can sometimes introduce performance regressions. Polly Labs is
>    currently working on integrating spatial locality modeling into the
>    loop optimization model.
> 
> Polly can already split apart basic blocks in order to implement loop
> fusion. Heuristics to choose at which granularity are still being
> implemented (e.g., PR12402).
> 
> I believe that we can now develop a concrete plan for moving
> state-of-the-art loop optimizations, based on the technology in the Polly
> project, into LLVM. Doing so will enable LLVM to be competitive with
> proprietary compilers in high-performance computing, machine learning, and
> other important application domains. I'd like community feedback on
> what**should be part of that plan.
> 
> 
> Sincerely,
> 
> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with feedback
> from**several other active Polly developers)
> 
> 
> We thank the numerous people who have contributed to the Polly
> infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, Annanay
> Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin
> Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris Jenneisch,
> ChristianBielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David
> Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman,
> EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui
> Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil,
> LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt Arsenault,
> MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse, Matthias
> Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini Singhal,
> NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip Pfaffe, Philipp
> Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola, Raghesh Aloor,
> ReidKleckner, Roal Jordans, Richard Membarth, Roman Gareev,
> SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer AbuAsal,
> SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay
> Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Star
Tan,
> TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros Theodoridis, Utpal
> Bora,Wei Mi, Weiming Zhao, and Yabin Hu.*
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170920/2d76e352/attachment.sig>

Sjoerd Meijer via llvm-dev

2017-Sep-20 09:33 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi,
I have not been following the polyhedral developments in both GCC and LLVM very
closely the last few years, but I was just wondering if there are any lessons
learned we should look at (or actually already aware of) in GCC's
implementation and integration of Graphite. For example, do we know if it is
(still)  enabled/used/etc.? And in the context of possibly rewritten bits and
pieces of Polly, thus making it modular, and making its exact analysis available
to "traditional" optimisations passes such as e.g. the vectoriser, is
there data that that shows that it really improves performance? Is there e.g.
data available that shows the performance uplift of the vectoriser using the
exact polyhedral data dependence analysis over the traditional data dependence
analysis? I am a big fan of the polyhedral model and these are not leading
questions, but I was just wondering if that could help in deciding to make it
modular and slowly integrating it, or just to have it as a separate pipeline.

Cheers,
Sjoerd.

-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Johannes
Doerfert via llvm-dev
Sent: 20 September 2017 01:33
To: Hal Finkel
Cc: LLVM Dev; Tobias Grosser; michaelkruse at meinersbur.de
Subject: Re: [llvm-dev] [RFC] Polly Status and Integration

Hi Hal, Tobias, Michael, and others,

I'd like to add my view (and a proposal) to this discussion and I apologize
directly for doing this so late*. I also want to apologize because this email is
long, contains various technical details and also argumentations that might need
more justification. However, I am happy to provide further information (and/or
examples) to explain my views if required.

tldr;
We should introduce polyhedral analysis and optimization capabilities into LLVM
step by step. Along the way we should revisit design decisions made by Polly and
adjust them according to the use cases in the rest of LLVM. Finally, we should
keep Polly as a stand-alone tool to allow research into, and prototyping of,
complex loop transformations.

* There was a paper deadline end of last week and I simply had to prioritize.

-------------------------------------------------------------------------------

LLVM performs "simple" loop transformations and targets almost
exclusively inner-most loops. Polly offers a pipeline to perform
"complex" polyhedral-based loop optimizations on a sequence of loop
nests. It seems natural to integrate Polly into the LLVM pipeline as it
apparently complements it well. However, I do not believe that it is wise to
integrate Polly into LLVM for a variety of reasons, some of which I will explain
in more detail now. Afterwards I will propose a different way to achieve (most
of) the things Hal mentioned in his mail in an incremental way. Nevertheless, I
want to stress that I am still an advocate of Polly and I honestly believe it to
be a valuable part of the LLVM environment. Though I strongly feel it should be
continued as a research tool and test bed to allow research into, and
prototyping of,
(polyhedral-model-based) analysis and optimizations.

---

Polly is a deep pipeline of passes that were developed for the sole purpose of
applying scheduling transformations at the very end. This "main
purpose" of Polly makes it hard to effectively reuse intermediate parts,
e.g. the dependence analysis or the code generation framework, at least to the
degree we might want to. In a different thread [1] Hal asked for the possibility
to reuse Polly for tuning via pragmas e.g.,
unrolling/interleaving/tiling/interchange. While this is interesting and
generally possible it won't take long before the restrictions Polly places
on the input code (to enable scheduling optimizations in the end) become a
serious problem. The point here is that Polly has too many features and
dependences for such an "easy" use case. (Note that I do not question
the legality of a user requested transformation here.) If you are not looking
for "end-to-end polyhedral loop optimizations" you want a system that
answers a very specific question (e.g.,
isVectorizable) or performs a very specific transformation (e.g, tiling) on a
maximal set of programs. In contrast, Polly is (and I strongly believe it should
be) developed as a system that performs complex optimizations, based on
specialized analysis results, on a constraint type of programs.

One might argue that Polly will become more modular when it is integrated into
LLVM and when the intermediate results are used in more and more places.
However, I'd say this is the "wrong direction" with regards to:
  1) incremental design and development,
  2) maintainability of individual parts, and
  3) the development of a suitable cost model (which Polly does not ship).

Instead of starting with a full, hard to understand, scheduling pipeline we
should start with the use cases at hand and design specific analysis (and also
transformation) passes "from scratch". The long term goal should be a
full scheduling pipeline but the parts need to be designed in a modular
(LLVM-way) from the very beginning. This allows us to:
  - provide __immediate benefit__ to other developers,
  - allow active participation in (and understanding of) the design of
    each part, and
  - develop the intermediate parts with the requirements of the whole
    LLVM project in mind.

Let me give two examples to make my point:

-- Example 1 --
In addition to the applicability issues there are other problems that arise from
the full pipeline design. Let's assume we want to apply a pragma driven
optimization to a loop that can be represented (and
optimized) by Polly. Depending on the pragma we only need to perform a very
specific task, e.g., adjust the iteration variable (loop inversion) or introduce
new loops and adjust existing iteration variables (tiling).
Polly will however represent (and actually analyze) the whole loop nest
including the exact control flow and all accesses. In addition it will compute
alias checks, constraints under which the representation is not truthful (e.g.,
an access function might overflow) and a lot of other things that are not needed
for the task at hand.

-- Example 2 --
We can also imagine we want to determine if a loop can be vectorized, thus if
there are (short) loop carried dependences. No matter which technique is used,
the complexity of a dependence analysis will certainly increase with the number
(and complexity) of the analyzed accesses. Since Polly is designed with
fine-grained scheduling optimizations in mind, the dependences it will compute
are disproportionate with regards to the question asked. Consequently, it will
either take more time to determine if there are loop carried dependences or a
time-out in the computation will cause a conservatively negative result.

---

While it is generally possible to disable some "features" of Polly or
to "work around them", it requires new independent code paths in an
already highly interleaved project. The explicit (and implicit) interactions
between the different parts of Polly are plentiful and subtle. I don't think
there are 5 people that know about more than half of them. (I also doubt there
is no one that knows them all.)

---

I want to propose an alternative plan for which I would appreciate feedback. The
feedback is especially useful since I will present this, or more accurately the
concrete polyhedral analyses described below, in the student research
competition at LLVM Dev meeting in October.

The idea is simple. We rewrite parts of Polly with two goals in mind:
  1) to provide analysis results and transformation options to LLVM, and
  2) allow polyhedral loop transformation.

A rewrite allows to revisit various design decisions that are buried deep in the
Polly code and that hinder its application (in any context).
The interested reader can find a short list at the end.

I think, we should incrementally introduce polyhedral analysis capabilities in
order to provide immediate benefits and to allow developers to get involved with
the design and usage from an early point in time. Decoupled analyses also allow
easier integration, interfaces that come "more natural" to
non-polyhedral people, and better use of the capabilities if polyhedral
scheduling is not the goal.

To demonstrate this incremental process I started to write a polyhedral value
analysis**. Think of it as Scalar Evolution but with piece-wise defined
polyhedral values as the abstract domain. This analysis is demand driven, works
on the granularity of a LLVM value, is flow sensitive, and especially designed
to be applicable on partially affine programs. It already supports most of what
the polyhedral model does allow and includes a lot of things we prototyped in
Polly. It shall be used when Scalar Evolution is not able to provide a
sufficient result.
On top I implemented a polyhedral memory analysis*** that is (currently) capable
of summarizing the memory effects of a code region, e.g., a function. The next
step is a demand-driven dependence analysis**** that utilizes the (often) very
structured nature of a CFG. Afterwards, I would like to see a transformation
framework that allows classical but also scheduling based transformations.

While these analyses still depend on isl as a back-end, they never directly use
it. Instead there is a C++ interface in-between that encapsulates the features
needed. Once our use cases are clear we can re-implement this interface.

** For comparison, the value analysis covers parts of the ScopDetection, all of
SCEVValidator and SCEVAffinator as well as parts of ScopInfo and ScopBuilder in
Polly.
*** The memory analysis is concerned with the construction and functionality
Polly implements in the MemoryAccess class.
**** The dependence analysis is similar to Polly's DependenceInfo class but
is supposed to allow more fine-grained control.

---

These are a few design decisions that I think should be revisited when we think
about polyhedral-based analysis and optimizations in LLVM:

  - We should not rely on Scalar Evolution. Instead we want to build
    polyhedral values straight from the underlying IR. While
    Scalar Evolution did certainly simplify Polly's development it
    induced various problems:
      - The lack of piece-wise affine values (other than min/max)
        limits applicability and precision. (Though conditional SCEVs
        are a small step in this direction.)
      - The flow insensitive value analysis (Scalar Evolution) causes a
        lot of "assumptions", thus versioning, for the otherwise flow
        sensitive Polly.
      - There are caching issues when parts of the program are
        transformed or when Polly is used for inter-procedural
        analysis/optimization.

  - We should not depend on single-entry-single-exit (SESE) regions.
    Instead we analyze the code optimistically to the extend possible
    (or required by the user). Afterwards we can easily specialize the
    results to the code regions we are interested in. While SESE regions
    always caused compile time issues (due to the use of the post
    dominance tree) they also limit the analysis/transformations that
    can be performed. As an example think of accesses that are
    non-affine with regards to the smallest surrounding SESE region but
    not in a smaller code region (e.g., a conditional). If the goal is
    not loop scheduling but, for example, instruction reordering, it
    is still valuable to determine the dependences between these
    accesses in the smaller code region.

  - We should iteratively and selective construct polyhedral
    representations. The same holds for analysis results, e.g.
    dependences. Let's reconsider example 2 from above. We want to visit
    one memory access at a time in order to build the polyhedral
    representation and compute the new dependences. We also want to
    immediately stop if we identify a loop-carried dependence. Finally,
    we do not want to compute dependences in outer loop iterations or
    dependences that are completely contained in one loop iteration.

  - We should encapsulate the code generation part completely from the
    transformation part. Additionally we might want to start with
    classical loop transformations instead of full polyhedral
    scheduling. For a lot of classical loop transformations code
    generation only needs to understand/change loop induction variables
    and exit conditions. Duplication + rewriting of the loop body is
    often not necessary. Pragma driven optimizations could be easily
    done and we could also use heuristics similar to the ones we have
    today (e.g., for loop distribution). The transformations we allow
    should than grow with the scheduling decisions we allow and these
    should be limited by a yet to be determined cost model.

---

I am aware that not all Polly developers will agree with my views and that some
are simply trade-offs that do not offer a "correct way".
Nevertheless, I hope that the simple examples I used to point out various
problems are enough to consider the route I proposed.

Cheers,
  Johannes

[1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ

On 09/01, Hal Finkel via llvm-dev wrote:> **
>
> *Hi everyone,As you may know, stock LLVM does not provide the kind of
> advanced loop transformations necessary to provide good performance on
> many applications. LLVM's Polly project provides many of the required
> capabilities, including loop transformations such as fission, fusion,
> skewing, blocking/tiling, and interchange, all powered by
> state-of-the-art dependence analysis. Polly also provides automated
> parallelization and targeting of GPUs and other**accelerators.*
>
> *
>
> Over the past year, Polly’s development has focused on robustness,
> correctness, and closer integration with LLVM. To highlight a few
> accomplishments:
>
>
>  *
>
>    Polly now runs, by default, in the conceptually-proper place in
>    LLVM’s pass pipeline (just before the loop vectorizer). Importantly,
>    this means that its loop transformations are performed after
>    inlining and other canonicalization, greatly increasing its
>    robustness, and enabling its use on C++ code (where [] is often a
>    function call before inlining).
>
>  *
>
>    Polly’s cost-modeling parameters, such as those describing the
>    target’s memory hierarchy, are being integrated with
>    TargetTransformInfo. This allows targets to properly override the
>    modeling parameters and allows reuse of these parameters by other
>    clients.
>
>  *
>
>    Polly’s method of handling signed division/remainder operations,
>    which worked around lack of support in ScalarEvolution, is being
>    replaced thanks to improvements being contributed to ScalarEvolution
>    itself (see D34598). Polly’s core delinearization routines have long
>    been a part of LLVM itself.
>
>  *
>
>    PolyhedralInfo, which exposes a subset of Polly’s loop analysis for
>    use by other clients, is now available.
>
>  *
>
>    Polly is now part of the LLVM release process and is being included
>    with LLVM by various packagers (e.g., Debian).
>
>
> I believe that the LLVM community would benefit from beginning the
> process of integrating Polly with LLVM itself and continuing its
> development as part of our main code base. This will:
>
>  *
>
>    Allow for wider adoption of LLVM within communities relying on
>    advanced loop transformations.
>
>  *
>
>    Provide for better community feedback on, and testing of, the code
>    developed (although the story in this regard is already fairly solid).
>
>  *
>
>    Better motivate targets to provide accurate, comprehensive, modeling
>    parameters for use by advanced loop transformations.
>
>  *
>
>    Perhaps most importantly, this will allow us to develop and tune the
>    rest of the optimizer assuming that Polly’s capabilities are present
>    (the underlying analysis, and eventually, the transformations
>    themselves).
>
>
> The largest issue on which community consensus is required, in order
> to move forward at all, is what to do with isl. isl, the Integer Set
> Library, provides core functionality on which Polly depends. It is a C
> library, and while some Polly/LLVM developers are also isl developers,
> it has a large user community outside of LLVM/Polly. A C++ interface
> was recently added, and Polly is transitioning to use the C++
> interface. Nevertheless, options here include rewriting the needed
> functionality, forking isl and transitioning our fork toward LLVM
> coding conventions (and data structures) over time, and incorporating
> isl more-or-less as-is to avoid partitioning its development.
>
>
> That having been said, isl is internally modular, and regardless of
> the overall integration strategy, the Polly developers anticipate
> specializing, or even replacing, some of these components with
LLVM-specific solutions.
> This is especially true for anything that touches performance-related
> heuristics and modeling. LLVM-specific, or even target-specific, loop
> schedulers may be developed as well.
>
>
> Even though some developers in the LLVM community already have a
> background in polyhedral-modeling techniques, the Polly developers
> have developed, and are still developing, extensive tutorials on this
> topic http://pollylabs.org/education.htmland especially
> http://playground.pollylabs.org.
>
>
> Finally, let me highlight a few ongoing development efforts in Polly
> that are potentially relevant to this discussion. Polly’s loop
> analysis is sound and technically superior to what’s in LLVM currently
> (i.e. in LoopAccessAnalysis and DependenceAnalysis). There are,
> however, two known reasons why Polly’s transformations could not yet be
enabled by default:
>
>  *
>
>    A correctness issue: Currently, Polly assumes that 64 bits is large
>    enough for all new loop-induction variables and index expressions.
>    In rare cases, transformations could be performed where more bits
>    are required. Preconditions need to be generated preventing this
>    (e.g., D35471).
>
>  *
>
>    A performance issue: Polly currently models temporal locality (i.e.,
>    it tries to get better reuse in time), but does not model spatial
>    locality (i.e., it does not model cache-line reuse). As a result, it
>    can sometimes introduce performance regressions. Polly Labs is
>    currently working on integrating spatial locality modeling into the
>    loop optimization model.
>
> Polly can already split apart basic blocks in order to implement loop
> fusion. Heuristics to choose at which granularity are still being
> implemented (e.g., PR12402).
>
> I believe that we can now develop a concrete plan for moving
> state-of-the-art loop optimizations, based on the technology in the
> Polly project, into LLVM. Doing so will enable LLVM to be competitive
> with proprietary compilers in high-performance computing, machine
> learning, and other important application domains. I'd like community
> feedback on what**should be part of that plan.
>
>
> Sincerely,
>
> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with
> feedback from**several other active Polly developers)
>
>
> We thank the numerous people who have contributed to the Polly
> infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs,
> Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan,
> Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris
> Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David
> Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith,
> Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin
> Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner,
> Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt
> Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse,
> Matthias Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini
> Singhal, NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip
> Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola,
> Raghesh Aloor, ReidKleckner, Roal Jordans, Richard Membarth, Roman
> Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer
> AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay
> Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru,
> Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros
> Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.*
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages Leadership
> Computing Facility Argonne National Laboratory
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681
302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.

Johannes Doerfert via llvm-dev

2017-Sep-20 16:04 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi Sjoerd,

On 09/20, Sjoerd Meijer wrote:> I have not been following the polyhedral developments in both GCC and
> LLVM very closely the last few years, but I was just wondering if
> there are any lessons learned we should look at (or actually already
> aware of) in GCC's implementation and integration of Graphite. For
> example, do we know if it is (still)  enabled/used/etc.?I tried to look this up online but most documents were rather old
(<=2015). The changes for graphite.{h,c} seem to be scarce nowadays too
(6 in 2016 + 2017). However, I am no Graphite expert and I hope
Sebastian [CC] can help.
> And in the context of possibly rewritten bits and pieces of Polly,
> thus making it modular, and making its exact analysis available to
> "traditional" optimisations passes such as e.g. the vectoriser,
is
> there data that that shows that it really improves performance? Is
> there e.g. data available that shows the performance uplift of the
> vectoriser using the exact polyhedral data dependence analysis over
> the traditional data dependence analysisAfaik, there is only (the rather inconclusive) GSoC from 2016. I'll try
to generate numbers soon but first for the value analysis as the
separated dependence analysis is not ready.

It is also worth to note that the current dependence analysis are
apparently broken and incomplete. Here is part of the exchange between
Gerolf Hoflehner and Sebastian Pop also in this thread:

On 09/13, Sebastian Pop via llvm-dev wrote:> On Tue, Sep 12, 2017 at 10:26 PM, Gerolf Hoflehner via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > Are you saying the LLVM Dependence Analysis is incorrect or do you
> > actually mean less conservative (or "more accurate" or
something
> > like that)?
> 
> Yes, the LLVM dependence analysis is broken from day one, by design,
> due to a misunderstanding of the meaning of GEPs:
>
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179509.html
>
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179529.html
>
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179570.html
> 
> Loop interchange and any other pass that relies on the current llvm
> dependence analysis may generate wrong code.
> See https://reviews.llvm.org/D35430
> 
> Another point, the MIV test in the llvm depednence analysis is not
> implemented, and so the scope of the llvm dependence analysis is
> rather narrow: i.e., it would not be able to solve the loop
> interchange in spec2000/swim.
> I am a big fan of the polyhedral model and these are not leading
> questions, but I was just wondering if that could help in deciding to
> make it modular and slowly integrating it, or just to have it as a
> separate pipeline.I think these questions are perfectly fine and deserve to be answered!
Though, it is not that simple to generate the numbers you are interested
in and it will consequently take a little bit more time.

Cheers,
  Johannes
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Johannes Doerfert via llvm-dev
> Sent: 20 September 2017 01:33
> To: Hal Finkel
> Cc: LLVM Dev; Tobias Grosser; michaelkruse at meinersbur.de
> Subject: Re: [llvm-dev] [RFC] Polly Status and Integration
> 
> Hi Hal, Tobias, Michael, and others,
> 
> I'd like to add my view (and a proposal) to this discussion and I
apologize directly for doing this so late*. I also want to apologize because
this email is long, contains various technical details and also argumentations
that might need more justification. However, I am happy to provide further
information (and/or examples) to explain my views if required.
> 
> tldr;
> We should introduce polyhedral analysis and optimization capabilities into
LLVM step by step. Along the way we should revisit design decisions made by
Polly and adjust them according to the use cases in the rest of LLVM. Finally,
we should keep Polly as a stand-alone tool to allow research into, and
prototyping of, complex loop transformations.
> 
> * There was a paper deadline end of last week and I simply had to
prioritize.
> 
>
-------------------------------------------------------------------------------
> 
> LLVM performs "simple" loop transformations and targets almost
exclusively inner-most loops. Polly offers a pipeline to perform
"complex" polyhedral-based loop optimizations on a sequence of loop
nests. It seems natural to integrate Polly into the LLVM pipeline as it
apparently complements it well. However, I do not believe that it is wise to
integrate Polly into LLVM for a variety of reasons, some of which I will explain
in more detail now. Afterwards I will propose a different way to achieve (most
of) the things Hal mentioned in his mail in an incremental way. Nevertheless, I
want to stress that I am still an advocate of Polly and I honestly believe it to
be a valuable part of the LLVM environment. Though I strongly feel it should be
continued as a research tool and test bed to allow research into, and
prototyping of,
> (polyhedral-model-based) analysis and optimizations.
> 
> ---
> 
> Polly is a deep pipeline of passes that were developed for the sole purpose
of applying scheduling transformations at the very end. This "main
purpose" of Polly makes it hard to effectively reuse intermediate parts,
e.g. the dependence analysis or the code generation framework, at least to the
degree we might want to. In a different thread [1] Hal asked for the possibility
to reuse Polly for tuning via pragmas e.g.,
unrolling/interleaving/tiling/interchange. While this is interesting and
generally possible it won't take long before the restrictions Polly places
on the input code (to enable scheduling optimizations in the end) become a
serious problem. The point here is that Polly has too many features and
dependences for such an "easy" use case. (Note that I do not question
the legality of a user requested transformation here.) If you are not looking
for "end-to-end polyhedral loop optimizations" you want a system that
answers a very specific question (e.g.,
> isVectorizable) or performs a very specific transformation (e.g, tiling) on
a maximal set of programs. In contrast, Polly is (and I strongly believe it
should be) developed as a system that performs complex optimizations, based on
specialized analysis results, on a constraint type of programs.
> 
> One might argue that Polly will become more modular when it is integrated
into LLVM and when the intermediate results are used in more and more places.
However, I'd say this is the "wrong direction" with regards to:
>   1) incremental design and development,
>   2) maintainability of individual parts, and
>   3) the development of a suitable cost model (which Polly does not ship).
> 
> Instead of starting with a full, hard to understand, scheduling pipeline we
should start with the use cases at hand and design specific analysis (and also
transformation) passes "from scratch". The long term goal should be a
full scheduling pipeline but the parts need to be designed in a modular
(LLVM-way) from the very beginning. This allows us to:
>   - provide __immediate benefit__ to other developers,
>   - allow active participation in (and understanding of) the design of
>     each part, and
>   - develop the intermediate parts with the requirements of the whole
>     LLVM project in mind.
> 
> Let me give two examples to make my point:
> 
> -- Example 1 --
> In addition to the applicability issues there are other problems that arise
from the full pipeline design. Let's assume we want to apply a pragma driven
optimization to a loop that can be represented (and
> optimized) by Polly. Depending on the pragma we only need to perform a very
specific task, e.g., adjust the iteration variable (loop inversion) or introduce
new loops and adjust existing iteration variables (tiling).
> Polly will however represent (and actually analyze) the whole loop nest
including the exact control flow and all accesses. In addition it will compute
alias checks, constraints under which the representation is not truthful (e.g.,
an access function might overflow) and a lot of other things that are not needed
for the task at hand.
> 
> -- Example 2 --
> We can also imagine we want to determine if a loop can be vectorized, thus
if there are (short) loop carried dependences. No matter which technique is
used, the complexity of a dependence analysis will certainly increase with the
number (and complexity) of the analyzed accesses. Since Polly is designed with
fine-grained scheduling optimizations in mind, the dependences it will compute
are disproportionate with regards to the question asked. Consequently, it will
either take more time to determine if there are loop carried dependences or a
time-out in the computation will cause a conservatively negative result.
> 
> ---
> 
> While it is generally possible to disable some "features" of
Polly or to "work around them", it requires new independent code paths
in an already highly interleaved project. The explicit (and implicit)
interactions between the different parts of Polly are plentiful and subtle. I
don't think there are 5 people that know about more than half of them. (I
also doubt there is no one that knows them all.)
> 
> ---
> 
> I want to propose an alternative plan for which I would appreciate
feedback. The feedback is especially useful since I will present this, or more
accurately the concrete polyhedral analyses described below, in the student
research competition at LLVM Dev meeting in October.
> 
> The idea is simple. We rewrite parts of Polly with two goals in mind:
>   1) to provide analysis results and transformation options to LLVM, and
>   2) allow polyhedral loop transformation.
> 
> A rewrite allows to revisit various design decisions that are buried deep
in the Polly code and that hinder its application (in any context).
> The interested reader can find a short list at the end.
> 
> I think, we should incrementally introduce polyhedral analysis capabilities
in order to provide immediate benefits and to allow developers to get involved
with the design and usage from an early point in time. Decoupled analyses also
allow easier integration, interfaces that come "more natural" to
non-polyhedral people, and better use of the capabilities if polyhedral
scheduling is not the goal.
> 
> To demonstrate this incremental process I started to write a polyhedral
value analysis**. Think of it as Scalar Evolution but with piece-wise defined
polyhedral values as the abstract domain. This analysis is demand driven, works
on the granularity of a LLVM value, is flow sensitive, and especially designed
to be applicable on partially affine programs. It already supports most of what
the polyhedral model does allow and includes a lot of things we prototyped in
Polly. It shall be used when Scalar Evolution is not able to provide a
sufficient result.
> On top I implemented a polyhedral memory analysis*** that is (currently)
capable of summarizing the memory effects of a code region, e.g., a function.
The next step is a demand-driven dependence analysis**** that utilizes the
(often) very structured nature of a CFG. Afterwards, I would like to see a
transformation framework that allows classical but also scheduling based
transformations.
> 
> While these analyses still depend on isl as a back-end, they never directly
use it. Instead there is a C++ interface in-between that encapsulates the
features needed. Once our use cases are clear we can re-implement this
interface.
> 
> ** For comparison, the value analysis covers parts of the ScopDetection,
all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and
ScopBuilder in Polly.
> *** The memory analysis is concerned with the construction and
functionality Polly implements in the MemoryAccess class.
> **** The dependence analysis is similar to Polly's DependenceInfo class
but is supposed to allow more fine-grained control.
> 
> ---
> 
> These are a few design decisions that I think should be revisited when we
think about polyhedral-based analysis and optimizations in LLVM:
> 
>   - We should not rely on Scalar Evolution. Instead we want to build
>     polyhedral values straight from the underlying IR. While
>     Scalar Evolution did certainly simplify Polly's development it
>     induced various problems:
>       - The lack of piece-wise affine values (other than min/max)
>         limits applicability and precision. (Though conditional SCEVs
>         are a small step in this direction.)
>       - The flow insensitive value analysis (Scalar Evolution) causes a
>         lot of "assumptions", thus versioning, for the otherwise
flow
>         sensitive Polly.
>       - There are caching issues when parts of the program are
>         transformed or when Polly is used for inter-procedural
>         analysis/optimization.
> 
>   - We should not depend on single-entry-single-exit (SESE) regions.
>     Instead we analyze the code optimistically to the extend possible
>     (or required by the user). Afterwards we can easily specialize the
>     results to the code regions we are interested in. While SESE regions
>     always caused compile time issues (due to the use of the post
>     dominance tree) they also limit the analysis/transformations that
>     can be performed. As an example think of accesses that are
>     non-affine with regards to the smallest surrounding SESE region but
>     not in a smaller code region (e.g., a conditional). If the goal is
>     not loop scheduling but, for example, instruction reordering, it
>     is still valuable to determine the dependences between these
>     accesses in the smaller code region.
> 
>   - We should iteratively and selective construct polyhedral
>     representations. The same holds for analysis results, e.g.
>     dependences. Let's reconsider example 2 from above. We want to
visit
>     one memory access at a time in order to build the polyhedral
>     representation and compute the new dependences. We also want to
>     immediately stop if we identify a loop-carried dependence. Finally,
>     we do not want to compute dependences in outer loop iterations or
>     dependences that are completely contained in one loop iteration.
> 
>   - We should encapsulate the code generation part completely from the
>     transformation part. Additionally we might want to start with
>     classical loop transformations instead of full polyhedral
>     scheduling. For a lot of classical loop transformations code
>     generation only needs to understand/change loop induction variables
>     and exit conditions. Duplication + rewriting of the loop body is
>     often not necessary. Pragma driven optimizations could be easily
>     done and we could also use heuristics similar to the ones we have
>     today (e.g., for loop distribution). The transformations we allow
>     should than grow with the scheduling decisions we allow and these
>     should be limited by a yet to be determined cost model.
> 
> ---
> 
> I am aware that not all Polly developers will agree with my views and that
some are simply trade-offs that do not offer a "correct way".
> Nevertheless, I hope that the simple examples I used to point out various
problems are enough to consider the route I proposed.
> 
> Cheers,
>   Johannes
> 
> [1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ
> 
> On 09/01, Hal Finkel via llvm-dev wrote:
> > **
> >
> > *Hi everyone,As you may know, stock LLVM does not provide the kind of
> > advanced loop transformations necessary to provide good performance on
> > many applications. LLVM's Polly project provides many of the
required
> > capabilities, including loop transformations such as fission, fusion,
> > skewing, blocking/tiling, and interchange, all powered by
> > state-of-the-art dependence analysis. Polly also provides automated
> > parallelization and targeting of GPUs and other**accelerators.*
> >
> > *
> >
> > Over the past year, Polly’s development has focused on robustness,
> > correctness, and closer integration with LLVM. To highlight a few
> > accomplishments:
> >
> >
> >  *
> >
> >    Polly now runs, by default, in the conceptually-proper place in
> >    LLVM’s pass pipeline (just before the loop vectorizer).
Importantly,
> >    this means that its loop transformations are performed after
> >    inlining and other canonicalization, greatly increasing its
> >    robustness, and enabling its use on C++ code (where [] is often a
> >    function call before inlining).
> >
> >  *
> >
> >    Polly’s cost-modeling parameters, such as those describing the
> >    target’s memory hierarchy, are being integrated with
> >    TargetTransformInfo. This allows targets to properly override the
> >    modeling parameters and allows reuse of these parameters by other
> >    clients.
> >
> >  *
> >
> >    Polly’s method of handling signed division/remainder operations,
> >    which worked around lack of support in ScalarEvolution, is being
> >    replaced thanks to improvements being contributed to
ScalarEvolution
> >    itself (see D34598). Polly’s core delinearization routines have
long
> >    been a part of LLVM itself.
> >
> >  *
> >
> >    PolyhedralInfo, which exposes a subset of Polly’s loop analysis for
> >    use by other clients, is now available.
> >
> >  *
> >
> >    Polly is now part of the LLVM release process and is being included
> >    with LLVM by various packagers (e.g., Debian).
> >
> >
> > I believe that the LLVM community would benefit from beginning the
> > process of integrating Polly with LLVM itself and continuing its
> > development as part of our main code base. This will:
> >
> >  *
> >
> >    Allow for wider adoption of LLVM within communities relying on
> >    advanced loop transformations.
> >
> >  *
> >
> >    Provide for better community feedback on, and testing of, the code
> >    developed (although the story in this regard is already fairly
solid).
> >
> >  *
> >
> >    Better motivate targets to provide accurate, comprehensive,
modeling
> >    parameters for use by advanced loop transformations.
> >
> >  *
> >
> >    Perhaps most importantly, this will allow us to develop and tune
the
> >    rest of the optimizer assuming that Polly’s capabilities are
present
> >    (the underlying analysis, and eventually, the transformations
> >    themselves).
> >
> >
> > The largest issue on which community consensus is required, in order
> > to move forward at all, is what to do with isl. isl, the Integer Set
> > Library, provides core functionality on which Polly depends. It is a C
> > library, and while some Polly/LLVM developers are also isl developers,
> > it has a large user community outside of LLVM/Polly. A C++ interface
> > was recently added, and Polly is transitioning to use the C++
> > interface. Nevertheless, options here include rewriting the needed
> > functionality, forking isl and transitioning our fork toward LLVM
> > coding conventions (and data structures) over time, and incorporating
> > isl more-or-less as-is to avoid partitioning its development.
> >
> >
> > That having been said, isl is internally modular, and regardless of
> > the overall integration strategy, the Polly developers anticipate
> > specializing, or even replacing, some of these components with
LLVM-specific solutions.
> > This is especially true for anything that touches performance-related
> > heuristics and modeling. LLVM-specific, or even target-specific, loop
> > schedulers may be developed as well.
> >
> >
> > Even though some developers in the LLVM community already have a
> > background in polyhedral-modeling techniques, the Polly developers
> > have developed, and are still developing, extensive tutorials on this
> > topic http://pollylabs.org/education.htmland especially
> > http://playground.pollylabs.org.
> >
> >
> > Finally, let me highlight a few ongoing development efforts in Polly
> > that are potentially relevant to this discussion. Polly’s loop
> > analysis is sound and technically superior to what’s in LLVM currently
> > (i.e. in LoopAccessAnalysis and DependenceAnalysis). There are,
> > however, two known reasons why Polly’s transformations could not yet
be enabled by default:
> >
> >  *
> >
> >    A correctness issue: Currently, Polly assumes that 64 bits is large
> >    enough for all new loop-induction variables and index expressions.
> >    In rare cases, transformations could be performed where more bits
> >    are required. Preconditions need to be generated preventing this
> >    (e.g., D35471).
> >
> >  *
> >
> >    A performance issue: Polly currently models temporal locality
(i.e.,
> >    it tries to get better reuse in time), but does not model spatial
> >    locality (i.e., it does not model cache-line reuse). As a result,
it
> >    can sometimes introduce performance regressions. Polly Labs is
> >    currently working on integrating spatial locality modeling into the
> >    loop optimization model.
> >
> > Polly can already split apart basic blocks in order to implement loop
> > fusion. Heuristics to choose at which granularity are still being
> > implemented (e.g., PR12402).
> >
> > I believe that we can now develop a concrete plan for moving
> > state-of-the-art loop optimizations, based on the technology in the
> > Polly project, into LLVM. Doing so will enable LLVM to be competitive
> > with proprietary compilers in high-performance computing, machine
> > learning, and other important application domains. I'd like
community
> > feedback on what**should be part of that plan.
> >
> >
> > Sincerely,
> >
> > Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with
> > feedback from**several other active Polly developers)
> >
> >
> > We thank the numerous people who have contributed to the Polly
> > infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs,
> > Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan,
> > Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris
> > Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David
> > Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith,
> > Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin
> > Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner,
> > Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt
> > Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse,
> > Matthias Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini
> > Singhal, NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip
> > Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola,
> > Raghesh Aloor, ReidKleckner, Roal Jordans, Richard Membarth, Roman
> > Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer
> > AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay
> > Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru,
> > Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros
> > Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.*
> >
> > --
> > Hal Finkel
> > Lead, Compiler Technology and Programming Languages Leadership
> > Computing Facility Argonne National Laboratory
> >
> 
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> 
> --
> 
> Johannes Doerfert
> Researcher / PhD Student
> 
> Compiler Design Lab (Prof. Hack)
> Saarland Informatics Campus, Germany
> Building E1.3, Room 4.31
> 
> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681
302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
> IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170920/fbe66560/attachment.sig>

Hal Finkel via llvm-dev

2017-Sep-22 05:09 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi, Johannes,

Thanks for writing this. I certainly think you have the right idea in 
terms of the desired end state and modular design.

On 09/19/2017 07:33 PM, Johannes Doerfert wrote:> Hi Hal, Tobias, Michael, and others,
>
> I'd like to add my view (and a proposal) to this discussion and I
> apologize directly for doing this so late*. I also want to apologize
> because this email is long, contains various technical details and also
> argumentations that might need more justification. However, I am happy
> to provide further information (and/or examples) to explain my views if
> required.
>
> tldr;
> We should introduce polyhedral analysis and optimization capabilities
> into LLVM step by step. Along the way we should revisit design decisions
> made by Polly and adjust them according to the use cases in the rest of
> LLVM. Finally, we should keep Polly as a stand-alone tool to allow
> research into, and prototyping of, complex loop transformations.
>
> * There was a paper deadline end of last week and I simply had to
prioritize.
>
>
-------------------------------------------------------------------------------
>
> LLVM performs "simple" loop transformations and targets almost
> exclusively inner-most loops. Polly offers a pipeline to perform
> "complex" polyhedral-based loop optimizations on a sequence of
loop
> nests. It seems natural to integrate Polly into the LLVM pipeline as it
> apparently complements it well. However, I do not believe that it is
> wise to integrate Polly into LLVM for a variety of reasons, some of
> which I will explain in more detail now. Afterwards I will propose a
> different way to achieve (most of) the things Hal mentioned in his mail
> in an incremental way. Nevertheless, I want to stress that I am still an
> advocate of Polly and I honestly believe it to be a valuable part of the
> LLVM environment. Though I strongly feel it should be continued as a
> research tool and test bed to allow research into, and prototyping of,
> (polyhedral-model-based) analysis and optimizations.
While this may end up being the conclusion, I'd find that splitting 
unfortunate. One of the strengths of LLVM is that, within the framework 
of a production-quality compiler, we enable new compiler research. This 
strengthens research not only by allowing new ideas to be surrounded by 
existing ones in a realistic way, allowing the effects of the new ideas 
to be isolated convincingly, but also encourages new ideas to be 
developed and tested in a way that mirrors how they might be deployed in 
practice. This also encourages the research community to be part of the 
LLVM community, and I believe that's a positive for the community as a 
whole. The more that we isolate the "research tool" from the rest of
the
ecosystem, the less value the research tool has, and the less the 
community as a whole benefits from interactions with researchers.
>
> ---
>
> Polly is a deep pipeline of passes that were developed for the sole
> purpose of applying scheduling transformations at the very end. This
> "main purpose" of Polly makes it hard to effectively reuse
intermediate
> parts, e.g. the dependence analysis or the code generation framework, at
> least to the degree we might want to. In a different thread [1] Hal
> asked for the possibility to reuse Polly for tuning via pragmas e.g.,
> unrolling/interleaving/tiling/interchange. While this is interesting and
> generally possible it won't take long before the restrictions Polly
> places on the input code (to enable scheduling optimizations in the end)
> become a serious problem. The point here is that Polly has too many
> features and dependences for such an "easy" use case. (Note that
I do
> not question the legality of a user requested transformation here.) If
> you are not looking for "end-to-end polyhedral loop
optimizations" you
> want a system that answers a very specific question (e.g.,
> isVectorizable) or performs a very specific transformation (e.g, tiling)
> on a maximal set of programs. In contrast, Polly is (and I strongly
> believe it should be) developed as a system that performs complex
> optimizations, based on specialized analysis results, on a constraint
> type of programs.
>
> One might argue that Polly will become more modular when it is integrated
> into LLVM and when the intermediate results are used in more and more
> places. However, I'd say this is the "wrong direction" with
regards to:
>    1) incremental design and development,
>    2) maintainability of individual parts, and
>    3) the development of a suitable cost model (which Polly does not ship).
>
> Instead of starting with a full, hard to understand, scheduling pipeline
> we should start with the use cases at hand and design specific analysis
> (and also transformation) passes "from scratch". The long term
goal
> should be a full scheduling pipeline but the parts need to be designed
> in a modular (LLVM-way) from the very beginning. This allows us to:
>    - provide __immediate benefit__ to other developers,
>    - allow active participation in (and understanding of) the design of
>      each part, and
>    - develop the intermediate parts with the requirements of the whole
>      LLVM project in mind.
>
> Let me give two examples to make my point:
>
> -- Example 1 --
> In addition to the applicability issues there are other problems that
> arise from the full pipeline design. Let's assume we want to apply a
> pragma driven optimization to a loop that can be represented (and
> optimized) by Polly. Depending on the pragma we only need to perform a
> very specific task, e.g., adjust the iteration variable (loop inversion)
> or introduce new loops and adjust existing iteration variables (tiling).
> Polly will however represent (and actually analyze) the whole loop nest
> including the exact control flow and all accesses. In addition it
> will compute alias checks, constraints under which the representation
> is not truthful (e.g., an access function might overflow) and a lot of
> other things that are not needed for the task at hand.
I understand your point, but I think that you're assuming too much about 
the semantics of the pragmas. I believe that our current vectorization 
pragma clauses demonstrate the philosophy we should follow with other 
loop pragmas in this sense: We have vectorize(enable) and 
vectorize(assume_safety). The latter one does indeed just vectorize the 
loop without requiring dependency analysis, but the vectorize(enable) 
clause instructs the vectorizer to vectorize the loop 
(overriding/loosening some of the cost model constraints), but still 
uses the dependency analysis and generates runtime checks. Even if we 
have pragmas to direct the process, by default, we still likely want the 
full dependency analysis and the framework for generating safety predicates.
>
> -- Example 2 --
> We can also imagine we want to determine if a loop can be vectorized,
> thus if there are (short) loop carried dependences. No matter which
> technique is used, the complexity of a dependence analysis will
> certainly increase with the number (and complexity) of the analyzed
> accesses. Since Polly is designed with fine-grained scheduling
> optimizations in mind, the dependences it will compute are
> disproportionate with regards to the question asked.
Could you explain this in more detail?
>   Consequently, it
> will either take more time to determine if there are loop carried
> dependences
I think that this is clearly a question of degree. Almost any non-lazy 
analysis will over-compute for a particular transformation (e.g. 
computing a dominator tree), but there's engineering value in 
consistency. The question is just in the cost vs. the engineering 
complexity in customizing the analysis.
>   or a time-out in the computation will cause a conservatively
> negative result.
>
> ---
>
> While it is generally possible to disable some "features" of
Polly or to
> "work around them", it requires new independent code paths in an
already
> highly interleaved project. The explicit (and implicit) interactions
> between the different parts of Polly are plentiful and subtle. I don't
> think there are 5 people that know about more than half of them. (I also
> doubt there is no one that knows them all.)
This point is well taken. For better or worse, it reminds me of another 
component we've discussed a lot recently: InstCombine. Both (excluding 
isl) are also nearly the same size.

In this context, documenting these interactions is important (especially 
if we hope to rewrite large parts of it).
>
> ---
>
> I want to propose an alternative plan for which I would appreciate
> feedback. The feedback is especially useful since I will present this,
> or more accurately the concrete polyhedral analyses described below, in
> the student research competition at LLVM Dev meeting in October.
>
> The idea is simple. We rewrite parts of Polly with two goals in mind:
>    1) to provide analysis results and transformation options to LLVM, and
>    2) allow polyhedral loop transformation.
>
> A rewrite allows to revisit various design decisions that are buried
> deep in the Polly code and that hinder its application (in any context).
> The interested reader can find a short list at the end.
>
> I think, we should incrementally introduce polyhedral analysis
> capabilities in order to provide immediate benefits and to allow
> developers to get involved with the design and usage from an early point
> in time. Decoupled analyses also allow easier integration, interfaces
> that come "more natural" to non-polyhedral people, and better use
of the
> capabilities if polyhedral scheduling is not the goal.
>
> To demonstrate this incremental process I started to write a polyhedral
> value analysis**. Think of it as Scalar Evolution but with piece-wise
> defined polyhedral values as the abstract domain. This analysis is
> demand driven, works on the granularity of a LLVM value, is flow
> sensitive, and especially designed to be applicable on partially affine
> programs. It already supports most of what the polyhedral model does
> allow and includes a lot of things we prototyped in Polly. It shall be
> used when Scalar Evolution is not able to provide a sufficient result.
> On top I implemented a polyhedral memory analysis*** that is (currently)
> capable of summarizing the memory effects of a code region, e.g., a
> function. The next step is a demand-driven dependence analysis**** that
> utilizes the (often) very structured nature of a CFG. Afterwards, I
> would like to see a transformation framework that allows classical but
> also scheduling based transformations.
>
> While these analyses still depend on isl as a back-end, they never
> directly use it. Instead there is a C++ interface in-between that
> encapsulates the features needed. Once our use cases are clear we can
> re-implement this interface.
>
> ** For comparison, the value analysis covers parts of the ScopDetection,
> all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and
> ScopBuilder in Polly.
> *** The memory analysis is concerned with the construction and
> functionality Polly implements in the MemoryAccess class.
> **** The dependence analysis is similar to Polly's DependenceInfo class
> but is supposed to allow more fine-grained control.
Is the code for these available? Are you planning to contribute them? We 
could certainly consider these part of the plan.
>
> ---
>
> These are a few design decisions that I think should be revisited when
> we think about polyhedral-based analysis and optimizations in LLVM:
>
>    - We should not rely on Scalar Evolution. Instead we want to build
>      polyhedral values straight from the underlying IR. While
>      Scalar Evolution did certainly simplify Polly's development it
>      induced various problems:
>        - The lack of piece-wise affine values (other than min/max)
>          limits applicability and precision. (Though conditional SCEVs
>          are a small step in this direction.)
>        - The flow insensitive value analysis (Scalar Evolution) causes a
>          lot of "assumptions", thus versioning, for the otherwise
flow
>          sensitive Polly.
>        - There are caching issues when parts of the program are
>          transformed or when Polly is used for inter-procedural
>          analysis/optimization.
>
>    - We should not depend on single-entry-single-exit (SESE) regions.
>      Instead we analyze the code optimistically to the extend possible
>      (or required by the user). Afterwards we can easily specialize the
>      results to the code regions we are interested in. While SESE regions
>      always caused compile time issues (due to the use of the post
>      dominance tree) they also limit the analysis/transformations that
>      can be performed. As an example think of accesses that are
>      non-affine with regards to the smallest surrounding SESE region but
>      not in a smaller code region (e.g., a conditional). If the goal is
>      not loop scheduling but, for example, instruction reordering, it
>      is still valuable to determine the dependences between these
>      accesses in the smaller code region.
>
>    - We should iteratively and selective construct polyhedral
>      representations. The same holds for analysis results, e.g.
>      dependences. Let's reconsider example 2 from above. We want to
visit
>      one memory access at a time in order to build the polyhedral
>      representation and compute the new dependences. We also want to
>      immediately stop if we identify a loop-carried dependence. Finally,
>      we do not want to compute dependences in outer loop iterations or
>      dependences that are completely contained in one loop iteration.
If we're only considering inner-loop vectorization, then yes, we don't 
need outer-loop dependencies. If I understand what you mean, we actually 
might want intra-iteration dependencies for cases where reordering might 
allow for vectorization.
>
>    - We should encapsulate the code generation part completely from the
>      transformation part.
This separation sounds useful, but not necessarily to enable us to write 
a bunch of separate loop transformations. We may to split things for 
canonicaliation vs. lowering transformations, however.
>   Additionally we might want to start with
>      classical loop transformations instead of full polyhedral
>      scheduling. For a lot of classical loop transformations code
>      generation only needs to understand/change loop induction variables
>      and exit conditions. Duplication + rewriting of the loop body is
>      often not necessary. Pragma driven optimizations could be easily
>      done and we could also use heuristics similar to the ones we have
>      today (e.g., for loop distribution). The transformations we allow
>      should than grow with the scheduling decisions we allow and these
>      should be limited by a yet to be determined cost model.
Much of this sounds potentially good, but it's not clear to me that any 
of this is really an argument against integrating Polly into LLVM right 
now. The components you propose should be usable for both separate and 
integrated transformations and analysis, and at least in the short term, 
we'll end up needing isl regardless. Is your skepticism mainly about 
whether the current Polly could be transitioned to use the new 
infrastructure (without essentially rewriting all of it)? If so, I'd 
like to understand why.

Thanks again,
Hal
>
> ---
>
> I am aware that not all Polly developers will agree with my views and
> that some are simply trade-offs that do not offer a "correct
way".
> Nevertheless, I hope that the simple examples I used to point out
> various problems are enough to consider the route I proposed.
>
> Cheers,
>    Johannes
>
> [1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ
>
> On 09/01, Hal Finkel via llvm-dev wrote:
>> **
>>
>> *Hi everyone,As you may know, stock LLVM does not provide the kind of
>> ...
>>
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Tobias Grosser via llvm-dev

2017-Sep-26 06:18 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

On Wed, Sep 20, 2017, at 02:33, Johannes Doerfert wrote:> Hi Hal, Tobias, Michael, and others,
Hi Johannes,

thanks for joining the discussion.
> I'd like to add my view (and a proposal) to this discussion and I
> apologize directly for doing this so late*. I also want to apologize
> because this email is long, contains various technical details and also
> argumentations that might need more justification. However, I am happy
> to provide further information (and/or examples) to explain my views if
> required.
> 
> tldr;
> We should introduce polyhedral analysis and optimization capabilities
> into LLVM step by step. Along the way we should revisit design decisions
> made by Polly and adjust them according to the use cases in the rest of
> LLVM. Finally, we should keep Polly as a stand-alone tool to allow
> research into, and prototyping of, complex loop transformations.
I very much agree with you that many design decisions will be changed in
the future. Many of the  targets you mentioned are indeed goals that
have been discussed as part of the Polly development agenda and are
areas I very much would like to push forward. You also make a very valid
point that we should involve the wider LLVM community in these
discussions and at best in the overall development of polyhedral
optimizations as a whole. Hence, it seems clear that we want to keep as
much of the polyhedral development as possible in core-LLVM. I have
always been conservative in pushing prototype stuff into core-LLVM,
which is why we spent multiple years to develop very solid foundations
in both Polly and isl. While I agree that many design decisions need to
be re-evaluated, as we constantly do in LLVM, and especially in the
light of advances we pushed into isl, I think doing this jointly with
the full LLVM community is the way to go.

I also don't see your ideas to conflict with the integration of Polly.
We can always move Polly now and replace parts of it with new and better
stuff as it becomes available. In fact, Polly can serve as test-bed to
provide test coverage for such new code and me and the other PollyLabs
people are certainly happy to help.
 > * There was a paper deadline end of last week and I simply had to
> prioritize.
Good luck!
 > Polly is a deep pipeline of passes that were developed for the sole
> purpose of applying scheduling transformations at the very end. This
> "main purpose" of Polly makes it hard to effectively reuse
intermediate
> parts, e.g. the dependence analysis or the code generation framework, at
> least to the degree we might want to. In a different thread [1] Hal
> asked for the possibility to reuse Polly for tuning via pragmas e.g.,
> unrolling/interleaving/tiling/interchange. While this is interesting and
> generally possible it won't take long before the restrictions Polly
> places on the input code (to enable scheduling optimizations in the end)
> become a serious problem. The point here is that Polly has too many
> features and dependences for such an "easy" use case. (Note that
I do
> not question the legality of a user requested transformation here.) If
> you are not looking for "end-to-end polyhedral loop
optimizations" you
> want a system that answers a very specific question (e.g.,
> isVectorizable) or performs a very specific transformation (e.g, tiling)
> on a maximal set of programs. In contrast, Polly is (and I strongly
> believe it should be) developed as a system that performs complex
> optimizations, based on specialized analysis results, on a constraint
> type of programs.
While it is right that Polly has been developed as end-to-end polyhedral
optimizer, it certainly has not been developed only with end-to-end loop
optimizations in mind. In fact, many of the foundations that are today
useful to build a polyhedral based range analysis on top of isl have
been developed together with Polly and with Polly as use case (e.g., to
build the optimistic assumption model we use) to enable a wide range of
optimizations and analyses. Seing more of them popping up now is indeed
very nice.
> One might argue that Polly will become more modular when it is integrated
> into LLVM and when the intermediate results are used in more and more
> places. However, I'd say this is the "wrong direction" with
regards to:
>   1) incremental design and development,
>   2) maintainability of individual parts, and
>   3) the development of a suitable cost model (which Polly does not
>   ship).
Polly has been developed from the very first patch in the open, was
discussed on the LLVM mailing lists from day one (and started from such
a discussion), and quickly became publicly visible LLVM project. I would
almost claim it is the perfect example of incremental development. When
we started due to license reasons and also due to a lack of mature
foundations, Polly was not part of core LLVM. Over the last years we
replaced and wrote new versions of all essential components for a move
to core LLVM to become possible.

Still, I get your point. It makes certainly sense to e.g. extract out
components of Polly to build a polyhedral range analysis and provide it
to other developers. Thanks for already providing a patch that
illustrates your ideas! Siddharth already commented and I would also
like to have a look. I also feel that Mehdi would be an excellent person
to give you feedback on this, as he has significant expertise with PIPS,
which uses abstract interpretation very similar to how one would use it
for a more general polyhedral range analysis.

However, I would decouple these two goals.  We could move Polly and isl
in place, and then move parts of Polly to the range analysis you propose
-- or start copying / and redeveloping parts of Polly onto the new
foundations you propose.

Later we can replace more parts. In general, there has never been an
issue with temporarily having some redundancy in LLVM and I don't think
as part of this discussion we need to lay out the full future of
polyhedral analysis and transformations in LLVM. From my perspective, we
should answer mainly the question:

1) "Do we want polyhedral analysis as part of core LLVM?"

2) "Do we want to keep significant parts of the polyhedral development
outside of core LLVM?"

My feeling is that there are several groups interested in 1), which
gives me confidence that we will -- eventually -- have polyhedral stuff
in LLVM. Hal gave pretty good arguments for 2) and I also believe
strongly that we should break the boundaries between polyhedral and core
LLVM people.
> Let me give two examples to make my point:
> 
> -- Example 1 --
> In addition to the applicability issues there are other problems that
> arise from the full pipeline design. Let's assume we want to apply a
> pragma driven optimization to a loop that can be represented (and
> optimized) by Polly. Depending on the pragma we only need to perform a
> very specific task, e.g., adjust the iteration variable (loop inversion)
> or introduce new loops and adjust existing iteration variables (tiling).
> Polly will however represent (and actually analyze) the whole loop nest
> including the exact control flow and all accesses. In addition it
> will compute alias checks, constraints under which the representation
> is not truthful (e.g., an access function might overflow) and a lot of
> other things that are not needed for the task at hand.
Sure. Incrementally building dependences makes in certain cases sense!
> -- Example 2 --
> We can also imagine we want to determine if a loop can be vectorized,
> thus if there are (short) loop carried dependences. No matter which
> technique is used, the complexity of a dependence analysis will
> certainly increase with the number (and complexity) of the analyzed
> accesses. Since Polly is designed with fine-grained scheduling
> optimizations in mind, the dependences it will compute are
> disproportionate with regards to the question asked. Consequently, it
> will either take more time to determine if there are loop carried
> dependences or a time-out in the computation will cause a conservatively
> negative result.
Good idea! 
> While it is generally possible to disable some "features" of
Polly or to
> "work around them", it requires new independent code paths in an
already
> highly interleaved project. The explicit (and implicit) interactions
> between the different parts of Polly are plentiful and subtle. I don't
> think there are 5 people that know about more than half of them. (I also
> doubt there is no one that knows them all.)
>
> I want to propose an alternative plan for which I would appreciate
> feedback. The feedback is especially useful since I will present this,
> or more accurately the concrete polyhedral analyses described below, in
> the student research competition at LLVM Dev meeting in October.
Nice. Looking forward to your talk!
 > The idea is simple. We rewrite parts of Polly with two goals in mind:
>   1) to provide analysis results and transformation options to LLVM, and
>   2) allow polyhedral loop transformation.
Sure. Improved granularity is certainly a good thing to have. As stated
in the initial proposal, I am glad to see Polly being ripped apart. You
already started!
 > A rewrite allows to revisit various design decisions that are buried
> deep in the Polly code and that hinder its application (in any context).
> The interested reader can find a short list at the end.
> I think, we should incrementally introduce polyhedral analysis
> capabilities in order to provide immediate benefits and to allow
> developers to get involved with the design and usage from an early point
> in time. Decoupled analyses also allow easier integration, interfaces
> that come "more natural" to non-polyhedral people, and better use
of the
> capabilities if polyhedral scheduling is not the goal.
Many of these ideas sound very nice. I am certain you will provide some
good interface ideas.
 > To demonstrate this incremental process I started to write a polyhedral
> value analysis**. Think of it as Scalar Evolution but with piece-wise
> defined polyhedral values as the abstract domain. This analysis is
> demand driven, works on the granularity of a LLVM value, is flow
> sensitive, and especially designed to be applicable on partially affine
> programs. It already supports most of what the polyhedral model does
> allow and includes a lot of things we prototyped in Polly. It shall be
> used when Scalar Evolution is not able to provide a sufficient result.
Nice! Clearly a way in the right direction.
> On top I implemented a polyhedral memory analysis*** that is (currently)
> capable of summarizing the memory effects of a code region, e.g., a
> function. The next step is a demand-driven dependence analysis**** that
> utilizes the (often) very structured nature of a CFG. Afterwards, I
> would like to see a transformation framework that allows classical but
> also scheduling based transformations.
> While these analyses still depend on isl as a back-end, they never
> directly use it. Instead there is a C++ interface in-between that
> encapsulates the features needed. Once our use cases are clear we can
> re-implement this interface.
Interesting. This seems to complement our efforts in providing a C++ 
interface to isl, but with slightly different goals. I feel as if we
still should
use isl++ instead of plain isl within your interface.
> These are a few design decisions that I think should be revisited when
> we think about polyhedral-based analysis and optimizations in LLVM:
> 
>   - We should not rely on Scalar Evolution. Instead we want to build
>     polyhedral values straight from the underlying IR. While
>     Scalar Evolution did certainly simplify Polly's development it
>     induced various problems:
>       - The lack of piece-wise affine values (other than min/max)
>         limits applicability and precision. (Though conditional SCEVs
>         are a small step in this direction.)
>       - The flow insensitive value analysis (Scalar Evolution) causes a
>         lot of "assumptions", thus versioning, for the otherwise
flow
>         sensitive Polly.
>       - There are caching issues when parts of the program are
>         transformed or when Polly is used for inter-procedural
>         analysis/optimization.
I agree, this is certainly a good direction. 
 >   - We should not depend on single-entry-single-exit (SESE) regions.
>     Instead we analyze the code optimistically to the extend possible
>     (or required by the user). Afterwards we can easily specialize the
>     results to the code regions we are interested in. While SESE regions
>     always caused compile time issues (due to the use of the post
>     dominance tree) they also limit the analysis/transformations that
>     can be performed. As an example think of accesses that are
>     non-affine with regards to the smallest surrounding SESE region but
>     not in a smaller code region (e.g., a conditional). If the goal is
>     not loop scheduling but, for example, instruction reordering, it
>     is still valuable to determine the dependences between these
>     accesses in the smaller code region.
SESE regions have (to my observation) never caused compile time issues.
However, several people including Chandler have raised concerns several
times and they indeed do not fit well when e.g. mixed with the
vectorizer. Hence, I agree that our ultimate polyhedral optimizer in
LLVM should likely not use SESE regions. In practice, Polly works
reasonably pretty well with regions for now.
>   - We should iteratively and selective construct polyhedral
>     representations. The same holds for analysis results, e.g.
>     dependences. Let's reconsider example 2 from above. We want to
visit
>     one memory access at a time in order to build the polyhedral
>     representation and compute the new dependences. We also want to
>     immediately stop if we identify a loop-carried dependence. Finally,
>     we do not want to compute dependences in outer loop iterations or
>     dependences that are completely contained in one loop iteration.
Interesting. For certain use cases this is clearly the right approach.
For full-fledged scheduling this may not work.
>   - We should encapsulate the code generation part completely from the
>     transformation part. Additionally we might want to start with
>     classical loop transformations instead of full polyhedral
>     scheduling. For a lot of classical loop transformations code
>     generation only needs to understand/change loop induction variables
>     and exit conditions. Duplication + rewriting of the loop body is
>     often not necessary. Pragma driven optimizations could be easily
>     done and we could also use heuristics similar to the ones we have
>     today (e.g., for loop distribution). The transformations we allow
>     should than grow with the scheduling decisions we allow and these
>     should be limited by a yet to be determined cost model.
Sure, better encapsulation is always good. How to do this exactly will
be an interesting discussion!
> I am aware that not all Polly developers will agree with my views and
> that some are simply trade-offs that do not offer a "correct
way".
> Nevertheless, I hope that the simple examples I used to point out
> various problems are enough to consider the route I proposed.
I briefly looked at your patch for the polyhedral range analysis. While
I would have preferred an initial design discussion much earlier, and
your patch needs more comments and especially a design description, it
certainly is a move in the right direction! We may also split it in
smaller pieces to make review easier. In fact, for some reason I think
here at the RISC-V upstreaming. Not that we should go at it's speed, but
it has been a perfect example of how to upstream a new backend -- with a
lot of feedback from the community. This seems to be very much aligned
with your goals. Overall, I clearly agree with your proposal of a
polyhedral range analysis and will be glad to provide more review
comments.

Your development plans for the future are overall very much aligned with
what I would like to see for polyhedral transformations in LLVM. We
already discussed some of them over the last years, but you clearly also
bring some nice "revolutionary" ideas into the game. I am really
excited
to see how these will evolve! I am also glad to see you again more
active in the mailing list!

Best,
Tobias

Michael Kruse via llvm-dev

2017-Oct-11 07:07 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Dear LLVM community,

when working on some project larger projects, probably everybody has
come up with ideas on how they would make things different if they
started from scratch. For instance, in LLVM, modeling PHI nodes as in
Swift's intermediate representation might be such an idea, but
requires huge changes in the code base. As contributor to Polly, I am
no exception. I therefore would like to share my thoughts about what
the best possible integration of a polyhedral optimizer/Polly could
look like, to discuss which end result we might strive for. I am a
proponent of polyhedral compilation and would like to see it used on a
daily basis by just choosing an optimization level.

IMHO we should avoid multiple loop optimizations in LLVM. If each of
them does its own loop versioning, we will get multiple redundant
runtime checks and code duplication. This is already the case with
LoopVersioningLICM, LoopDistribution, LoopVectorizer and
LoopLoadElimination. I argue for a more integrated approach for
loop-based optimizations.

To clarify, what this is not:
- It has not been shown to work in practice.
- While I would like to contribute to implement such a loop optimizer,
this is not something I could do alone.
- This is no direct replacement for Polly or VPlan, rather a long-term
vision which also includes VPlan/Polly.

To have a comparison, I first try to sketch the approaches taken by
the LoopVectorizer, VPlan and Polly to then present the "Loop
Optimization Framework". First only the design and afterwards the
rationales for the design decisions.

To be able to show some examples, I'll refer to this snippet we intent
to optimize.

for (x = 0; x < sizes->nx; x += 1)
  for (y = 0; y < sizes->ny; y += 1)
     C[x][y] = 0;
for (x = 0; x < sizes->nx; x += 1) {
  tmp = A[x]; /* hoisted load by LICM or GVN */
  int ny = sizes->ny; /* dynamic load of size parameter */
  for (y = 0; y < sizes->ny; y += 1) {
    if (x != y) /* affine conditional */
      C[x][y] = tmp * B[y];
    if (C[x][y] > 0) /* data-dependent conditional */
      C[x][y] = sqrt(C[x][y]);
    if (verbose) /* non-performance-critical case */
      printf("Some output");
  }
}


LoopVectorizer (+VPlan)
-------------------------

This part I know the least, please correct me if I state something wrong.

Details on VPlan were taken from [1] and its talk, and the code
already committed to LLVM.


A) Preprocessing

LCSSA-form and loop-simplification.


B) Scope of Analysis

Innermost loop, single body BB (or if-convertible to single BB)

VPlan: Acyclic control flow with SESE (single entry/exit block or edge?).

Extensions for outer loop vectorization (cyclic control flow) to be
presented at the dev meeting: [2]


C) Control-Flow Modeling

Not needed, it's always just a single loop.

VPlan: Hierarchical-CFG of VPBasicBlocks and VPRegionBlocks.


D) Access Modeling (LoopAccessInfo/LoopAccessAnalysis/AccessAnalysis)

Use AliasSetTracker and look analyze each set independently.
No PHI nodes (apart from induction variable) allowed (there seems to
be some support for reductions).


E) Dependency Analysis (MemoryDepChecker)

Leaks only DependenceType and max vectorization factor to the outside.


F) Assumption Tracking (PredicatedScalarEvolution/LoopVectorizationRequirements)

Assumptions can be added to PredicatedScalarEvolution to make SCEV
modeling possible/simpler/exact.
Used for:
- Assume exact BackedgeTakenCount
- Assume no overflow

Aliasing tracked by AccessAnalysis


G) Transformation

One explicit transformation. Either vectorization is possible or don't
do anything at all.

VPlan: Modifying VPlans (widening, etc) and pick the best according to
the cost model.


H) Code Generation (InnerLoopVectorizer/LoopVersioning)

Create a new vectorized loop (without prologue to get aligned
accesses) and use the old loop as epilogue and if the assumptions are
not met.

VPlan: Generate instructions from the recipes in the VPlan.


I) Pass Manager Integration

Depends on analyses: LoopInfo and LoopAccessAnalysis

Otherwise LoopVectorize is self-contained and instantiates objects
itself (eg. LoopVersioning)



Polly Architecture
------------------

A) Preprocessing

There is no considerable preprocessing (anymore). The only
modification done is inserting an entry block containing only alloca's
where Polly's IR generator can add its own allocas to. The split-off
BB can then be analyzed as well.


B) Scope of Analysis (ScopDetection/ScopInfo)

Search for the largest compatible llvm::Regions, called SCoPs.
Compatibility requirements are (among others):
- Single Entry Single Exit: SESE
- Base pointers invariant in SCoP
- Affine loop-bounds
- Only multidimensional affine array accesses (by default)
- No function calls (except side-effect free functions)

Regions are expanded in order to include multiple top-level loops into
on region. In the example both outermost loops are included into the
same SCoP region.

LoadInst results that are used for loop bounds and array subscripts
are assumed to be invariant during the execution of the SCoP. This
allows to use `sizes->ny` as a loop bound, even though the LoadInst
could return a different value at every iteration of the outer loop.
(Invariant load hoisting)

Conditions that lead to function calls are assumed to be false. In the
example makes it possible to remove the call to printf by adding the
assumption `!verbose`.



C) Control-Flow Modeling (Schedule Tree)

A schedule tree is basically a tree of bands nodes (collection of
perfectly nested loops with a defined order) and sequence nodes (child
node are execute in sequence order). Affine conditionals are
implemented as filters over loop variable values they are executed on.

Example (simplified):
Root
 |
 +- Band(Loops: x, y)
 |  |
 |  +- C[x][y] = 0;
 |
 +- Band(Loops: x)
    |
    + tmp = A[x]; /* statement */
    |
    +- Band(Loops: y)
       |
       +- Filter(x + -y > 0 || x + - y < 0) /* affine conditional */
       |  |
       |  +- C[x][y] += tmp * B[y]; /* statement */
       |
       +- if(C[x][y] > 0) C[x][y] = sqrt(C[x][y]); /* data-dependent
conditional as macro-statement (llvm::Region) */

A statement execution for fixed x and y is called a 'statement
instance'.

Schedule trees are part of the Integer Set Library (isl) [3], which is
also used to represent polyhedral in general.


D) Access Modeling (polly::MemoryAccess)

Every load/store has a base pointer and multi-dimensional indices.
Indices are derived from GetElementPtr and/or using ScalarEvolution
and then converted to piecewise-affine functions (isl_pw_aff).

Because in C, multi-dimensional accesses decay to flat subscripts
(C[x*sizes->nx + y]), delinearization tries to recover the
multidimensional subscripts. If this does not succeed, the option
-polly-allow-nonaffine interprets them as touching the entire array.
Assumptions are added to ensure that the accesses are within the
derived dimension sized.

llvm::Instructions are modeled as zero-dimensional array (or "scalar")
if there is a use outside of the statement instance they are defined
in.


E) Dependency Analysis (DependencyInfo)

When passed the accesses for each statement, isl computes the
dependencies (flow, anti and output) as a relational polyhedron. That
is, the exact dependencies between all statement instances.

Example:
{ [x,y-1] -> [x,y]   if x != y;
  [x,y-1] -> [x,y+1] if x == y }
describes the dependency of the statement with the affine conditional
to the next instance of itself.


F) Assumption Tracking (ScopInfo)

Remembers the assumptions mentioned before. There are four kinds of assumptions
- Valid value ranges, not requiring a runtime check (eg from __builtin_assume)
- Value ranges that must be checked at runtime (eg. the range of nx,ny
such that all array subscripts are within bounds)
- Value ranges that are forbidden and must be checked at runtime (eg
`verbose` must not be true)
- Non-aliasing assumptions, checked as runtime.

Some violated assumptions may lead to a bail-out, eg. if the pointer
of an assumed invariant load is written to.


G) Transformations (ForwardOpTree/DeLICM/Simplify/ScheduleOptimizer)

To improve schedulability, ForwardOpTree and DeLICM modify the
instruction lists of statements (not the IR itself). For instance, it
propagates the definition of tmp (a LoadInst) to the statements that
use it. In the example this makes the x and y loops perfectly nested
such that it can be tiled, and removes the false dependencies of tmp
to itself (a "scalar dependency").

isl then computes a new schedule tree out of the old schedule tree and
the dependence analysis.  By default, it uses a modified PLuTo [4]
algorithm. Its goals are:
- Separate independent statements into separate bands
- Reduce the distance between dependencies to optimize temporal locality
- Find parallel loops

On the transformed schedule tree, Polly walks the tree to find
additional optimization opportunities:
- Tile bands that are 'permutable' and have at least 2 loops
- Optionally vectorize the first innermost 'coincident' loop
- Detect matrix-multiplication access patterns and apply a special
optimization [5]


H) Code Generation (IslAstInfo/CodeGeneration/PPCGCodeGeneration)

Isl can transform a schedule tree representation into an AST
representation, with loops and if-conditions.

CodeGeneration versions the original SCoP region using the tracked
assumptions and emits the AST representation.  For a statement in the
AST, it copies the instructions from the instruction list. For
parallel loops, it can generate OpenMP. PPCGCodeGeneration generates
code for GPU accelerators.


I) Pass Manager Integration

ScopDetection is a function pass, all following passes are ScopPasses
(RegionPasses in disguise):
- ScopInfo (analysis)
- JScopImporter (SCoP transformation)
- Simplify (SCoP transformation)
- ForwardOpTree (SCoP transformation)
- DeLICM (SCoP transformation)
- DependenceInfo (SCoP analysis)
- ScheduleOptimizer (SCoP transformation)
- IslAstInfo (SCoP analysis)
- CodeGeneration (IR transformation)




Proposed Loop Optimization Framework
------------------------------------

Only the architecture is outlined here. The reasons for them can be
found in the "rationales" section below.


A) Preprocessing

Make a copy of the entire function to allow transformations only for
the sake of analysis. For instance (and not necessarily in the
preprocessing):
- Invariant load hoisting (Move the load of sizes->ny before the loop)
- Move operations closer to the instructions that use them (Move tmp A[x] to the
instructions that use it)
- Split basic blocks to allow loop distribution

The function will be discarded when no loop transformation is applied.

This step can be made optional to be only active at higher
optimization settings.


B) Scope of Analysis

Always entire functions. Some loops/blocks might be marked as non-transformable.


C) Control-Flow Modeling

An AST-like tree based on the LoopInfo tree. Statements are pointers
to Basic Blocks in the original IR.  Any conditionals are modeled as
data-dependency.

Example:

+- Loop 0 <= x < sizes->nx
|  |
|  +- Loop 0 <= y < sizes->ny
|     |
|     +- C[x][y] = 0;             [condition: true]
|
+- Loop 0 <= x < sizes->nx
   |
   +- Loop 0 <= y < sizes->ny
      |
      +- C[x][y] += A[x] * B[y];  [condition: true]
      |
      +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0]
      |
      +- printf("Some output");   [condition: verbose]

The tree is mutable by transformations, IR will be generated from it at the end.

Loops bounds and affine conditions can also optionally be expressed as isl_sets.

Statements have properties:
- Conditions under which they are executed
- Is it if-convertible?
- Can be executed even in the original IR it would not be executed?
- If executed at least once, executing multiple times has no effect?
- Hotness information from PGO
- #pragma annotations from clang
- Multi-thread effects (e.g. atomics/fences)
- ...


D) Access Modeling

Each load/store/... is associated with:
- The base pointer
- For each of the array's dimensions, the subscript
  - Either as SCEV
  - Or as affine function (isl_pw_aff)

Frontends might help recovering the multidimensional structure of an
array by, instead of linearizing array subscripts (eg. row-major),
using a intrinsic/builtin.



E) Dependency Analysis

A collection of dependencies between two statements, say S and T. S
must be executed before T.
- The type of dependency: Register, Control, Flow, Anti, Output
- Which instances depend on each other. This can be:
  - No instance information. All instances of S must be executed
before any instance of T.
  - A vector of dependence distances. Basically MemoryDepChecker for
each surrounding loop.
  - An polyhedral relational map (isl_map) describing which instances
of T depend on which instances of S.

The dependency type "register" means that T used an llvm::Value
defined by S. This can also be modeled using flow and
anti-dependencies but as shown before this can be destructive to
transformation opportunitied.  Transformations need to specially
consider register dependencies and avoid requiring expansion.

The dependency type "control" models order induced by control
conditions. Eg. the data-dependent conditional 'C[x][y] > 0' can only
be evaluated after 'C[x][y]' has been written.



F) Assumption Tracking

A public interface that loop analysis and transformations can use to
register their assumptions to checked at runtime. Assumptions are for
specific loops, so they only need to be checked if the loop is
actually transformed.

Assumptions can also be queried, eg. knowing that a variable always
has a specific value may allow some simplifications. This knowledge
could also be fed from __builtin_assume and profiling data.



G) Transformation

In the style of InstCombine, visit the tree and modify it (and only
the tree, not the IR).  Transformations can include:
- Add assumptions to make parts more transformable (e.g. assume
verbose==false to eliminate the call to printf)
- Combine single loops into bands
- Apply #pragmas on the loop/band it is associated with.
- Vectorize loops (VPlan)
- Tiling
- Unrolling
- loop-unswitching
- Localize working sets of inner tiles. This can be a generalization
of Polly's matrix-multiplication optimization.
- (Partial) unrolling of loops
- Specializing loops for specific inputs (loop-unswitching)
- Mixed-LP scheduling
- OpenMP-parallelize bands
- Extract accelerator kernels
- Detect loop idoms (memset, memcopy, zgemm, etc.) to call library
functions instead
- ...

Transformations are responsible to not violate dependencies.

The current Polly would become one of the transformations. It finds
the largest loop body where all size parameters are invariant and
translate the subtree to an isl schedule tree. Isl would transform it,
generate an isl_ast and Polly translates it to a new loop tree. Much
of the work done by Polly today would be provided by the framework.


Example transformation output:

+- memset(C, '\0', sizes->nx*sizes->ny*sizeof(C[0][0]));
|
+- Loop 0 <= x < sizes->nx [#pragma omp parallel for][condition:
!verbose]
   [assumption: C[0..sizes->nx-1][0..sizes->ny-1], A[0..sizes->nx-1],
B[0..sizes->ny-1] not aliasing]
|  |
|  +- Loop 0 <= yv < sizes->ny step 4
|     |
|     +- Loop yv <= y < yv+4 [vectorize VF=4]
|     |  |
|     |  +- C[x][y] += A[x] * B[y];
|     |
|     +- Loop 4*(sizes->ny/4) <= y < sizes->ny
|     |  |
|     |  +- C[x][y] += A[x] * B[y];
|     |
|     +- Loop 0 <= y < sizes->ny
|        |
|        +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0]
|
+- Loop 0 <= x < sizes->nx [condition: verbose]
   |
   +- Loop 0 <= y < sizes->ny
      |
      +- C[x][y] += A[x] * B[y];  [condition: true]
      |
      +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0]
      |
      +- printf("Some output");   [condition: true]


H) Code Generation

Determine which part of the tree has been modified. Outermost parts
that have not been transformed do not need to be regenerated.  The
transformed part of the tree is generated into a separate function.
Runtime checks for the collected assumptions are generated, and
depending on its result, either the optimized function or the
previously outlined function is called.

The optimized function can be further optimized using LLVM's usual
passes and the inliner decide according to its heuristics whether it
is worth inlining into the caller.


I) Pass Manager Integration

A monolithic loop transformation pass which creates the loop tree and
provides dependency analysis and the assumption track. Then it invokes
the separate transformations like InstCombine. The final tree is
code-generated before control is passed back to (old or new) pass
manager.






Design Decision Rationales
--------------------------

A) Preprocessing

Polly's policy at the moment is not modify the IR until code
generations (to not interfere with the current pipeline). For
invariant load hoisting and avoiding scalar dependencies it only
virtually moves/copies instructions in its own data structures. This
adds complexity as these representations get out-of-sync. For
instance, DominatorTree does not reflect the new locations of
instructions, the CFG is modified by CodeGenerator for versioning, PHI
node incoming lists does not correspond to the predecessor blocks
anymore, instructions cannot be propagated cross loop iterations as we
don't know which iteration it was originally in, etc. If possible, I'd
like to avoid this complexity. It approaches being an alternative
representation for the current IR, so why not using the IR?
VPlan has its own IR-like representation consisting of ingredients and
recipes. For the vectorizer it might be justifiable because it does
not change control flow.

Some transformations cope well with scalar dependencies (eg.
vectorization, there's a copy of each scalar anyway and
thread-parallelization would have them privatized), but in the
polyhedral model, analysis takes place before transformation
decisions. However, today's polyhedral schedulers (That are based on
Feautrier's or PLuTo's) do not have special handling for scalar
dependencies.  I am positive that we can have polyhedral schedulers
that handle scalars without getting worse results (by avoiding putting
value definitions into different loops than its uses; ability to
execute a definition multiple times), but this is research were such a
framework could be very useful. Most of today's polyhedral compilers
used by researchers are source-to-source where the problem is much
less pronounced.

We might alternatively change the meaning of IR canonicalization in
LLVM. Pre-loop canonicalized would try to create self-contained blocks
where post-loop transformations do hoisting and sinking.

To illustrate the issue with scalar dependencies:

for (int i = 0; i < N; i+=1) {
  double s = ...;
  ... = s;
}

Since in Polly s is modeled as a zero-dimensional array, there is just
one memory location for it. Before processing the next iterations, the
current iteration must have completed before s is overwritten. If
virtual registers are not modeled having a memory location, a
polyhedral schedule might be included to do this (because, eg.
splitting those up makes one of the loops vectorizable):

double s[N];
for (int i = 0; i < N; i+=1)
  s[i] = ...;
for (int i = 0; i < N; i+=1)
  .... = s[i];

that is, requiring an unbounded amount of additional memory. That's
probably not something we want.



B) Scope of Analysis

Modeling entire functions has the advantage that analysis does not
have to choose a unit of transformation (Polly: SCoPs, LoopVectorizer:
a loop), but the transformations decide themselves what they are
modifying.

I'd also like to synchronize the unit of modeling. Polly's SCoPs is
based on SESE regions. llvm::Region and llvm::Loop have no hierarchy,
regions and loops may partially overlap. It is possible to represent
multiple exit loops in the polyhedral model.



C) Control-Flow Modeling

LoopInfo, Polly's schedule tree, VPlan's hierarchical CFG all use a
tree-like representation of control flow. I hope to find a common one.

Using execution conditions 'flattens' the loop body. Conditions can be
whether a previous statement has been executed or not (control
dependence) so that the original CFG can be reconstructed.

Affine conditions can be directly represented in the polyhedral model,
for others 'always executing' is an overapproximation and it is
checked at runtime whether it is really executed. This allows more
flexibility than Polly's SESE subregion macro-statement approach.
Whether a condition is affine or not also has no influence on other
transformations.

VPlan also uses SESE in its CFG representation, but I think it can
also make use of this model.



D) Access Modeling

There is a common interface for both representations. Modeling as
affine function is required for the polyhedral model and is more
powerful, but assumes arbitrary precision integers, so additional
assumptions might need to be added.

The SCEV representation can (when linear/affine) be converted to an
isl_pw_aff when needed, or the isl_pw_aff computed directly as by
Johannes' suggestion. However, I did not observe that SCEV was the
limiting part in Polly, and could be extended to support conditional
operators as well.

The builtin might not be useful for C arrays, where access is defined
as row-major and always linearized. But libraries could make use of
the builtin, for instance a Matrix class with overloaded operator:

double &operator()(int x, int y) {
#if __has_builtin(__builtin_subscript)
  return __builtin_subscript(data, x, y);
#else
  return data[x*n+y];
#endif
}

A special case are jagged arrays/pointers to pointers of arrays which
are very common in preexisting code bases. To ensure non-aliasing, we
would need a pairwise check for aliasing of every subarray, which has
polynomial complexity by itself. A builtin/intrinsic that guarantees
the non-aliasing of subarrays would allow us to transform these cases
as well. If there is a common interface for both cases,
transformations work transparently on either representation.



E) Dependency Analysis

Again there is a common interface for all three kinds of dependency
representations, eg. "does statement instance S(x,y) depend on
instance T(x)"?. Instances can, as memory subscripts, described as
SCEV or affine functions over surrounding loop counters.
Representations can also be converted to either type, possibly
overapproximating the dependencies.
The precision can be selected by compiler switches or optimization
level. The goal is to allow any loop transformation (eg. vectorizer)
can use any dependency precision.

For the polyhedral representation, we could introduce an intermediate
layer that allows to plug in other intLP solvers and return
approximations of the true result. Dependencies can be arbitrarily
overapproximated, which inhibits transformations, but stays correct.
Alternative implementations might just allow rectangles or octahedra
[6] instead of polyhedra, which do not have exponential worst-case
behaviours.

Polly at the moment does not know register or control dependencies. I
hope that adding such kind of dependency would allow developing
polyhedral scheduler that keep definition and uses of a virtual
register in the same loop.



F) Assumption Tracking

The purpose of a shared assumption tracker is to only have the need
for a single runtime condition check after all transformations have
been applied.



G) Transformation

Notice that polyhedral scheduling is just one of the possible
transformation passes. Even in Polly, some transformations (tiling,
vectorization, matrix-multiplication) are post-optimizations on the
schedule tree and can be implemented on other AST representations as
well. No knowledge of the polyhedral model is required to implement a
new transformation.

VPlan allows creating multiple plans at the same time and select the
best according a heuristic. However, some transformations might be
easier to implement by modifying the underlying IR in the cloned
function. Eg. removing conditionals that lead to exceptional cases
(the non-performance-relevant printf in the example). Otherwise we
could also select the most profitable transformed AST according to a
cost function.



H) Code Generation

Generating code it into a separate function has some advantages.
First, we control the environment. Input base pointers can be
annotated using the noalias qualifier without metadata.  Code
generation is easier because there are less special cases like
multiple exit edges, llvm.lifetime.begin/end intrinsics, etc.
Separate register pressure for the presumably hot code and the rest of
the function.



I) Pass Manager Integration

The reason to not use LLVM's pass manager for the individual
transformations is because the loop tree carries state that is not
represented in the LLVM IR. The pass manager would be allows to
invalidate the pass holding the tree and expect it to be recomputable
the same way when instantiated again. This is not the case if the tree
has been transformed already. Polly does it this way and needs all its
transformation passes to preserve all analyses to avoid making the
pass manager drop and recompute ScopInfo and ScopDetect.




[1]
http://llvm.org/devmtg/2017-03/assets/slides/introducing_vplan_to_the_loop_vectorizer.pdf
[2] http://llvm.org/devmtg/2017-10/#talk17
[3] http://isl.gforge.inria.fr/
[4] http://pluto-compiler.sourceforge.net/
[5] http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[6] https://tel.archives-ouvertes.fr/tel-00818764

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Sep 2017 - [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

Possibly Parallel Threads