thr3ads.net - llvm dev - [llvm-dev] [RFC] Polly Status and Integration [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2017-Sep-12 05:47 UTC

[llvm-dev] [RFC] Polly Status and Integration

On 09/11/2017 12:26 PM, Adam Nemet wrote:> Hi Hal, Tobias, Michael and others,
>
>> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev 
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>> **
>>
>> *Hi everyone,As you may know, stock LLVM does not provide the kind of 
>> advanced loop transformations necessary to provide good performance 
>> on many applications. LLVM's Polly project provides many of the 
>> required capabilities, including loop transformations such as 
>> fission, fusion, skewing, blocking/tiling, and interchange, all 
>> powered by state-of-the-art dependence analysis. Polly also provides 
>> automated parallelization and targeting of GPUs and
other**accelerators.*
>> *
>> Over the past year, Polly’s development has focused on robustness, 
>> correctness, and closer integration with LLVM. To highlight a few 
>> accomplishments:
>>
>>  *
>>     Polly now runs, by default, in the conceptually-proper place in
>>     LLVM’s pass pipeline (just before the loop vectorizer).
>>     Importantly, this means that its loop transformations are
>>     performed after inlining and other canonicalization, greatly
>>     increasing its robustness, and enabling its use on C++ code
>>     (where [] is often a function call before inlining).
>>  *
>>     Polly’s cost-modeling parameters, such as those describing the
>>     target’s memory hierarchy, are being integrated with
>>     TargetTransformInfo. This allows targets to properly override the
>>     modeling parameters and allows reuse of these parameters by other
>>     clients.
>>  *
>>     Polly’s method of handling signed division/remainder operations,
>>     which worked around lack of support in ScalarEvolution, is being
>>     replaced thanks to improvements being contributed to
>>     ScalarEvolution itself (see D34598). Polly’s core delinearization
>>     routines have long been a part of LLVM itself.
>>  *
>>     PolyhedralInfo, which exposes a subset of Polly’s loop analysis
>>     for use by other clients, is now available.
>>  *
>>     Polly is now part of the LLVM release process and is being
>>     included with LLVM by various packagers (e.g., Debian).
>>
>>
>> I believe that the LLVM community would benefit from beginning the 
>> process of integrating Polly with LLVM itself and continuing its 
>> development as part of our main code base. This will:
>>
>>  *
>>     Allow for wider adoption of LLVM within communities relying on
>>     advanced loop transformations.
>>  *
>>     Provide for better community feedback on, and testing of, the
>>     code developed (although the story in this regard is already
>>     fairly solid).
>>  *
>>     Better motivate targets to provide accurate, comprehensive,
>>     modeling parameters for use by advanced loop transformations.
>>  *
>>     Perhaps most importantly, this will allow us to develop and tune
>>     the rest of the optimizer assuming that Polly’s capabilities are
>>     present (the underlying analysis, and eventually, the
>>     transformations themselves).
>>
>>
>> The largest issue on which community consensus is required, in order 
>> to move forward at all, is what to do with isl. isl, the Integer Set 
>> Library, provides core functionality on which Polly depends. It is a 
>> C library, and while some Polly/LLVM developers are also isl 
>> developers, it has a large user community outside of LLVM/Polly. A 
>> C++ interface was recently added, and Polly is transitioning to use 
>> the C++ interface. Nevertheless, options here include rewriting the 
>> needed functionality, forking isl and transitioning our fork toward 
>> LLVM coding conventions (and data structures) over time, and 
>> incorporating isl more-or-less as-is to avoid partitioning its 
>> development.
>>
>> That having been said, isl is internally modular, and regardless of 
>> the overall integration strategy, the Polly developers anticipate 
>> specializing, or even replacing, some of these components with 
>> LLVM-specific solutions. This is especially true for anything that 
>> touches performance-related heuristics and modeling. LLVM-specific, 
>> or even target-specific, loop schedulers may be developed as well.
>>
>> Even though some developers in the LLVM community already have a 
>> background in polyhedral-modeling techniques, the Polly developers 
>> have developed, and are still developing, extensive tutorials on this 
>> topic http://pollylabs.org/education.htmland especially 
>> http://playground.pollylabs.org
<http://playground.pollylabs.org/>.
>>
>> Finally, let me highlight a few ongoing development efforts in Polly 
>> that are potentially relevant to this discussion. Polly’s loop 
>> analysis is sound and technically superior to what’s in LLVM 
>> currently (i.e. in LoopAccessAnalysis and DependenceAnalysis). There 
>> are, however, two known reasons why Polly’s transformations could not 
>> yet be enabled by default:
>>
>>  *
>>     A correctness issue: Currently, Polly assumes that 64 bits is
>>     large enough for all new loop-induction variables and index
>>     expressions. In rare cases, transformations could be performed
>>     where more bits are required. Preconditions need to be generated
>>     preventing this (e.g., D35471).
>>  *
>>     A performance issue: Polly currently models temporal locality
>>     (i.e., it tries to get better reuse in time), but does not model
>>     spatial locality (i.e., it does not model cache-line reuse). As a
>>     result, it can sometimes introduce performance regressions. Polly
>>     Labs is currently working on integrating spatial locality
>>     modeling into the loop optimization model.
>>
>> Polly can already split apart basic blocks in order to implement loop 
>> fusion. Heuristics to choose at which granularity are still being 
>> implemented (e.g., PR12402).
>> I believe that we can now develop a concrete plan for moving 
>> state-of-the-art loop optimizations, based on the technology in the 
>> Polly project, into LLVM. Doing so will enable LLVM to be competitive 
>> with proprietary compilers in high-performance computing, machine 
>> learning, and other important application domains. I’d like community 
>> feedback on what**should be part of that plan.
>> *
>
> One thing that I’d like to see more details on is what this means for 
> the evolution of loop transformations in LLVM.
>
> Our more-or-less established direction was so far to incrementally 
> improve and generalize the required analyses (e.g. the 
> LoopVectorizer’s dependence analysis + loop versioning analysis into a 
> stand-alone analysis pass (LoopAccessAnalysis)) and then build new 
> transformations (e.g. LoopDistribution, LoopLoadElimination, 
> LICMLoopVersioning) on top of these.
>
> The idea was that infrastructure would be incrementally improved from 
> two directions:
>
> - As new transformations are built analyses have to be improved (e.g. 
> past improvements to LAA to support the LoopVersioning utility, future 
> improvements for full LoopSROA beyond just store->load forwarding [1] 
> or the improvements to LAA for the LoopFusion proposal[2])
>
> - As more complex loops would have to be analyzed we either improve 
> LAA or make DependenceAnalysis a drop-in replacement for the memory 
> analysis part in LAA
Or we could use Polly's dependence analysis, which I believe to be more 
powerful, more robust, and more correct than DependenceAnalysis. I 
believe that the difficult part here is actually the pairing with 
predicated SCEV or whatever mechanism we want to use generate runtime 
predicates (this applies to use of DependenceAnalysis too).
>
> While this model may be slow it has all the benefits of the 
> incremental development model.
The current model may have been slow in many areas, but I think that's 
mostly a question of development effort. My largest concern about the 
current model is that, to the extent that we're implementing classic 
loop transformations (e.g., fusion, distribution, interchange, skewing, 
tiling, and so on), we're repeating a historical design that is known to 
have several suboptimal properties. Chief among them is the lack of 
integration: many of these transformations are interconnected, and 
there's no good pass ordering in which to make independent decisions. 
Many of these transformations can be captured in a single model and we 
can get much better results by integrating them. There's also the matter 
of whether building these transformation on SCEV (or IR directly) is the 
best underlying infrastructure, or whether parts of Polly would be better.

That having been said, I think that integrating this technology into 
LLVM will also mean applying appropriate modularity. I think that we'll 
almost definitely want to make use of the dependence analysis separately 
as an analysis. We'll want to decide which of these transformations will 
be considered canonicalization (and run in the iterative pipeline) and 
which will be lowering (and run near the vectorizer). LoopSROA certainly 
sounds to me like canonicalization, but loop fusion might also fall into 
that category (i.e., we might want to fuse early to enable optimizations 
and then split late).
>
> Then there is the question of use cases.  It’s fairly obvious that 
> anybody wanting to optimize a 5-deep highly regular loop-nest 
> operating on arrays should use Polly.  On the other hand it’s way less 
> clear that we should use it for singly or doubly nested not-so-regular 
> loops which are the norm in non-HPC workloads.
This is clearly a good question, but thinking about Polly as a set of 
components, not as a monolithic transformation component, I think that 
polyhedral analysis and transformations can underlie a lot of the 
transformations we need for non-HPC code (and, which I'll point out, we 
need for modern HPC code too). In practice, the loops that we can 
actually analyze have affine dependencies, and Polly does, or can do, a 
better job at generating runtime predicates and dealing with 
piecewise-linear expressions than our current infrastructure.

In short, I look at Polly as two things: First, an infrastructure for 
dealing with loop analysis and transformation. I view this as being 
broadly applicable. Second, an application of that to apply 
cost-model-driven classic loop transformations. To some extent this is 
going to be more useful for HPC codes, but also applies to machine 
learning, signal processing, graphics, and other areas.
>
> And this brings me to the maintenance question.  Is it reasonable to 
> expect people to fix Polly when they have a seemingly unrelated change 
> that happens to break a Polly bot.
The eventual goal here is to have this technology in appropriate parts 
of the main pipeline, and so the question here is not really about 
breaking a "Polly bot", but just about a "bot" in general.
I've given
this question some thought and I think it sits in a reasonable place in 
the risk-reward space. The answer would be, yes, we'd need to treat this 
like any other part of the pipeline. However, I believe that Polly has 
as many, or more, active contributors than essentially any other 
individual part of the mid-level optimizer or CodeGen. As a result, 
there will be people around in many time zones to help with problems 
with Polly-related code.
>  As far as I know, there were companies in the past that tried Polly 
> without a whole lot of prior experience.  It would be great to hear 
> what the experience was before adopting Polly at a much larger scale.
I'm also interested, although I'll caution against over-interpreting any
evidence here (positive or negative). Before a few weeks ago, Polly 
didn't effectively run in the pipeline after inlining, and so I doubt it 
would have been much use outside of embedded environments (and maybe 
some HPC environments) with straightforwardly-presented C code. It's 
only now that this has been fixed that I find the possibility of 
integrating this in production interesting.

Thanks again,
Hal
>
> Adam
>
> [1] http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
> [2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html
>
>
>> *
>> Sincerely,
>> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with 
>> feedback from**several other active Polly developers)
>>
>> We thank the numerous people who have contributed to the Polly 
>> infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, 
>> Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, 
>> Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris 
>> Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David 
>> Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, 
>> Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, 
>> Hongbin Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin 
>> Bogner, Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh 
>> Grang, Matt Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, 
>> Michael Kruse, Matthias Reisinger,Maximilian Falkenstein, Nakamura 
>> Takumi, Nandini Singhal, NicolasBonfante, Patrik Hägglund, Paul 
>> Robinson, Philip Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, 
>> Rafael Espindola, Raghesh Aloor, ReidKleckner, Roal Jordans, Richard 
>> Membarth, Roman Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, 
>> Sanjoy Das, Sameer AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, 
>> Singapuram Sanjay Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, 
>> Sylvestre Ledru, Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, 
>> Theodoros Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.*
>>
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170912/5f981c31/attachment-0001.html>

Gerolf Hoflehner via llvm-dev

2017-Sep-13 03:26 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> On 09/11/2017 12:26 PM, Adam Nemet wrote:
>> Hi Hal, Tobias, Michael and others,
>> 
>>> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>> 
>>> 
>>> 
>>> Hi everyone,
>>> 
>>> As you may know, stock LLVM does not provide the kind of advanced
loop transformations necessary to provide good performance on many applications.
LLVM's Polly project provides many of the required capabilities, including
loop transformations such as fission, fusion, skewing, blocking/tiling, and
interchange, all powered by state-of-the-art dependence analysis. Polly also
provides automated parallelization and targeting of GPUs and other accelerators.
>>> 
>>> Over the past year, Polly’s development has focused on robustness,
correctness, and closer integration with LLVM. To highlight a few
accomplishments:
>>> 
>>> Polly now runs, by default, in the conceptually-proper place in
LLVM’s pass pipeline (just before the loop vectorizer). Importantly, this means
that its loop transformations are performed after inlining and other
canonicalization, greatly increasing its robustness, and enabling its use on C++
code (where [] is often a function call before inlining).
>>> Polly’s cost-modeling parameters, such as those describing the
target’s memory hierarchy, are being integrated with TargetTransformInfo. This
allows targets to properly override the modeling parameters and allows reuse of
these parameters by other clients.
>>> Polly’s method of handling signed division/remainder operations,
which worked around lack of support in ScalarEvolution, is being replaced thanks
to improvements being contributed to ScalarEvolution itself (see D34598).
Polly’s core delinearization routines have long been a part of LLVM itself.
>>> PolyhedralInfo, which exposes a subset of Polly’s loop analysis for
use by other clients, is now available.
>>> Polly is now part of the LLVM release process and is being included
with LLVM by various packagers (e.g., Debian).
>>> 
>>> I believe that the LLVM community would benefit from beginning the
process of integrating Polly with LLVM itself and continuing its development as
part of our main code base. This will:
>>> Allow for wider adoption of LLVM within communities relying on
advanced loop transformations.
>>> Provide for better community feedback on, and testing of, the code
developed (although the story in this regard is already fairly solid).
>>> Better motivate targets to provide accurate, comprehensive,
modeling parameters for use by advanced loop transformations.
>>> Perhaps most importantly, this will allow us to develop and tune
the rest of the optimizer assuming that Polly’s capabilities are present (the
underlying analysis, and eventually, the transformations themselves).
>>> 
>>> The largest issue on which community consensus is required, in
order to move forward at all, is what to do with isl. isl, the Integer Set
Library, provides core functionality on which Polly depends. It is a C library,
and while some Polly/LLVM developers are also isl developers, it has a large
user community outside of LLVM/Polly. A C++ interface was recently added, and
Polly is transitioning to use the C++ interface. Nevertheless, options here
include rewriting the needed functionality, forking isl and transitioning our
fork toward LLVM coding conventions (and data structures) over time, and
incorporating isl more-or-less as-is to avoid partitioning its development.
>>> 
>>> That having been said, isl is internally modular, and regardless of
the overall integration strategy, the Polly developers anticipate specializing,
or even replacing, some of these components with LLVM-specific solutions. This
is especially true for anything that touches performance-related heuristics and
modeling. LLVM-specific, or even target-specific, loop schedulers may be
developed as well.
>>> 
>>> Even though some developers in the LLVM community already have a
background in polyhedral-modeling techniques, the Polly developers have
developed, and are still developing, extensive tutorials on this topic
http://pollylabs.org/education.html <http://pollylabs.org/education.html>
and especially http://playground.pollylabs.org
<http://playground.pollylabs.org/>.
>>> 
>>> Finally, let me highlight a few ongoing development efforts in
Polly that are potentially relevant to this discussion. Polly’s loop analysis is
sound and technically superior to what’s in LLVM currently (i.e. in
LoopAccessAnalysis and DependenceAnalysis). There are, however, two known
reasons why Polly’s transformations could not yet be enabled by default:
>>> A correctness issue: Currently, Polly assumes that 64 bits is large
enough for all new loop-induction variables and index expressions. In rare
cases, transformations could be performed where more bits are required.
Preconditions need to be generated preventing this (e.g., D35471).
>>> A performance issue: Polly currently models temporal locality
(i.e., it tries to get better reuse in time), but does not model spatial
locality (i.e., it does not model cache-line reuse). As a result, it can
sometimes introduce performance regressions. Polly Labs is currently working on
integrating spatial locality modeling into the loop optimization model.
>>> Polly can already split apart basic blocks in order to implement
loop fusion. Heuristics to choose at which granularity are still being
implemented (e.g., PR12402).
>>> 
>>> I believe that we can now develop a concrete plan for moving
state-of-the-art loop optimizations, based on the technology in the Polly
project, into LLVM. Doing so will enable LLVM to be competitive with proprietary
compilers in high-performance computing, machine learning, and other important
application domains. I’d like community feedback on what should be part of that
plan.
>> 
>> One thing that I’d like to see more details on is what this means for
the evolution of loop transformations in LLVM.
>> 
>> Our more-or-less established direction was so far to incrementally
improve and generalize the required analyses (e.g. the LoopVectorizer’s
dependence analysis + loop versioning analysis into a stand-alone analysis pass
(LoopAccessAnalysis)) and then build new transformations (e.g. LoopDistribution,
LoopLoadElimination, LICMLoopVersioning) on top of these.
>> 
>> The idea was that infrastructure would be incrementally improved from
two directions:
>> 
>> - As new transformations are built analyses have to be improved (e.g.
past improvements to LAA to support the LoopVersioning utility, future
improvements for full LoopSROA beyond just store->load forwarding [1] or the
improvements to LAA for the LoopFusion proposal[2])
>> 
>> - As more complex loops would have to be analyzed we either improve LAA
or make DependenceAnalysis a drop-in replacement for the memory analysis part in
LAA
> 
> Or we could use Polly's dependence analysis, which I believe to be more
powerful, more robust, and more correct than DependenceAnalysis. I believe that
the difficult part here is actually the pairing with predicated SCEV or whatever
mechanism we want to use generate runtime predicates (this applies to use of
DependenceAnalysis too).
What is a good way to measure these assertions (More powerful, more robust)? Are
you saying the LLVM Dependence Analysis is incorrect or do you actually mean
less conservative (or "more accurate" or something like
that)?> 
>> 
>> While this model may be slow it has all the benefits of the incremental
development model.
> 
> The current model may have been slow in many areas, but I think that's
mostly a question of development effort. My largest concern about the current
model is that, to the extent that we're implementing classic loop
transformations (e.g., fusion, distribution, interchange, skewing, tiling, and
so on), we're repeating a historical design that is known to have several
suboptimal properties. Chief among them is the lack of integration: many of
these transformations are interconnected, and there's no good pass ordering
in which to make independent decisions. Many of these transformations can be
captured in a single model and we can get much better results by integrating
them. There's also the matter of whether building these transformation on
SCEV (or IR directly) is the best underlying infrastructure, or whether parts of
Polly would be better.
I believe that is true. What I wonder is is there a good method to reason about
it? Perhaps concrete examples or perhaps opt-viewer based comparisons on large
sets of benchmarks? In the big picture you could make such a modeling argument
for all compiler optimizations.> 
> That having been said, I think that integrating this technology into LLVM
will also mean applying appropriate modularity. I think that we'll almost
definitely want to make use of the dependence analysis separately as an
analysis. We'll want to decide which of these transformations will be
considered canonicalization (and run in the iterative pipeline) and which will
be lowering (and run near the vectorizer). LoopSROA certainly sounds to me like
canonicalization, but loop fusion might also fall into that category (i.e., we
might want to fuse early to enable optimizations and then split late).
> 
>> 
>> Then there is the question of use cases.  It’s fairly obvious that
anybody wanting to optimize a 5-deep highly regular loop-nest operating on
arrays should use Polly.  On the other hand it’s way less clear that we should
use it for singly or doubly nested not-so-regular loops which are the norm in
non-HPC workloads.
> 
> This is clearly a good question, but thinking about Polly as a set of
components, not as a monolithic transformation component, I think that
polyhedral analysis and transformations can underlie a lot of the
transformations we need for non-HPC code (and, which I'll point out, we need
for modern HPC code too). In practice, the loops that we can actually analyze
have affine dependencies, and Polly does, or can do, a better job at generating
runtime predicates and dealing with piecewise-linear expressions than our
current infrastructure.
> 
> In short, I look at Polly as two things: First, an infrastructure for
dealing with loop analysis and transformation. I view this as being broadly
applicable. Second, an application of that to apply cost-model-driven classic
loop transformations. To some extent this is going to be more useful for HPC
codes, but also applies to machine learning, signal processing, graphics, and
other areas.I’m wondering if it could be used for pointing out headroom for the existing
LLVM ecosystem (*)
> 
>> 
>> And this brings me to the maintenance question.  Is it reasonable to
expect people to fix Polly when they have a seemingly unrelated change that
happens to break a Polly bot.
> 
> The eventual goal here is to have this technology in appropriate parts of
the main pipeline, and so the question here is not really about breaking a
"Polly bot", but just about a "bot" in general. I've
given this question some thought and I think it sits in a reasonable place in
the risk-reward space. The answer would be, yes, we'd need to treat this
like any other part of the pipeline. However, I believe that Polly has as many,
or more, active contributors than essentially any other individual part of the
mid-level optimizer or CodeGen. As a result, there will be people around in many
time zones to help with problems with Polly-related code.
> 
>>  As far as I know, there were companies in the past that tried Polly
without a whole lot of prior experience.  It would be great to hear what the
experience was before adopting Polly at a much larger scale.
> 
> I'm also interested, although I'll caution against
over-interpreting any evidence here (positive or negative). Before a few weeks
ago, Polly didn't effectively run in the pipeline after inlining, and so I
doubt it would have been much use outside of embedded environments (and maybe
some HPC environments) with straightforwardly-presented C code. It's only
now that this has been fixed that I find the possibility of integrating this in
production interesting.
That is a good point. There are also biases independent of past experiences (for
disclosure mine is (*) above). But I think it is objective to say a Polly
integration is a big piece to swallow.Your pro-Polly argument lists a number of
categories that I think could be reasoned about individually and partly
evaluated with a data-driven approach:
A) Architecture
- support for autoparallelism
- support for accelerators
- isl- rewrite? etc
...
B) Modelling
- polyhedral model
- temporal locality
- spatial locality 
…
C) Analysis/Optimizations
- Dependence Analysis
- Transformation effective/power (loop nests, quality of transformations,
#vectorizable loops etc)

A) is mostly Polly independent (except for the isl question I guess). For B and
C performance/ compile-time /opt-viewer data on a decent/wide range of
benchmarks possibly at different optimization levels (O2, O3, LTO, PGO etc and
combinations) should provide data-driven insight into costs/benefits.

Cheers
Gerolf



> 
> Thanks again,
> Hal 
> 
>> 
>> Adam
>> 
>> [1] http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
<http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html>
>> [2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html
<http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html>
>> 
>> 
>>> 
>>> Sincerely,
>>> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with
feedback from several other active Polly developers)
>>> 
>>> We thank the numerous people who have contributed to the Polly
infrastructure:
>>> 
>>> Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, Annanay Agarwal,
Armin
>>> Groesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin Kramer,
Bill
>>> Wendling, Chandler Carruth, Craig Topper, Chris Jenneisch,
Christian
>>> Bielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David
Peixotto,
>>> Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman, Eugene
>>> Zelenko, George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui
Zhang,
>>> Jakub Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil,
Logan
>>> Chien, Lawrence Hu, Mandeep Singh Grang, Matt Arsenault, Matthew
>>> Simpson, Mehdi Amini, Micah Villmow, Michael Kruse, Matthias
Reisinger,
>>> Maximilian Falkenstein, Nakamura Takumi, Nandini Singhal, Nicolas
>>> Bonfante, Patrik Hägglund, Paul Robinson, Philip Pfaffe, Philipp
Schaad,
>>> Peter Conn, Pratik Bhatu, Rafael Espindola, Raghesh Aloor, Reid
>>> Kleckner, Roal Jordans, Richard Membarth, Roman Gareev, Saleem
>>> Abdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer AbuAsal, Sam
>>> Novak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay Srivallabh,
>>> Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Star Tan,
Tanya
>>> Lattner, Tim Shen, Tarun Ranjendran, Theodoros Theodoridis, Utpal
Bora,
>>> Wei Mi, Weiming Zhao, and Yabin Hu.
>>> 
>>> -- 
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>> 
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170912/658453a0/attachment-0001.html>

C Bergström via llvm-dev

2017-Sep-13 07:16 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

A completely non-technical point, but what's the current "polly"
license?
Does integrating that code conflict in any way with the work being done to
relicense llvm?

Does adding polly expose any additional legal risks? Some people from
Reservoir labs have explicitly stated to me that some of their patents
target polyhedral optimizations. You should almost certainly review their
portfolio or contact them.

If at some point someone wants to add real loop optimizations - will there
be a conflict?

What's the DWARF look like after poly transformations?

The talk about performance is pretty light - It would be good to get
something besides just a handful of spotlight known codes. Also code size,
compilation speed. etc
------------
flag bikeshed - If it's not ready for -O3 - create specific flags to
specific poly passes. Creating yet another micro flag like -O3poly just
doesn't make sense to me. (keep it simple.) When it's really really
ready,
enable it with the rest of the loop heavy passes.




On Wed, Sep 13, 2017 at 11:26 AM, Gerolf Hoflehner via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
> On 09/11/2017 12:26 PM, Adam Nemet wrote:
>
> Hi Hal, Tobias, Michael and others,
>
> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
> *Hi everyone, As you may know, stock LLVM does not provide the kind of
> advanced loop transformations necessary to provide good performance on many
> applications. LLVM's Polly project provides many of the required
> capabilities, including loop transformations such as fission, fusion,
> skewing, blocking/tiling, and interchange, all powered by state-of-the-art
> dependence analysis. Polly also provides automated parallelization and
> targeting of GPUs and other accelerators.*
>
>
>
>
>
>
>
>
>
>
>
>
>
> * Over the past year, Polly’s development has focused on robustness,
> correctness, and closer integration with LLVM. To highlight a few
> accomplishments: - Polly now runs, by default, in the conceptually-proper
> place in LLVM’s pass pipeline (just before the loop vectorizer).
> Importantly, this means that its loop transformations are performed after
> inlining and other canonicalization, greatly increasing its robustness, and
> enabling its use on C++ code (where [] is often a function call before
> inlining). - Polly’s cost-modeling parameters, such as those describing the
> target’s memory hierarchy, are being integrated with TargetTransformInfo.
> This allows targets to properly override the modeling parameters and allows
> reuse of these parameters by other clients. - Polly’s method of handling
> signed division/remainder operations, which worked around lack of support
> in ScalarEvolution, is being replaced thanks to improvements being
> contributed to ScalarEvolution itself (see D34598). Polly’s core
> delinearization routines have long been a part of LLVM itself. -
> PolyhedralInfo, which exposes a subset of Polly’s loop analysis for use by
> other clients, is now available. - Polly is now part of the LLVM release
> process and is being included with LLVM by various packagers (e.g.,
> Debian). I believe that the LLVM community would benefit from beginning the
> process of integrating Polly with LLVM itself and continuing its
> development as part of our main code base. This will: - Allow for wider
> adoption of LLVM within communities relying on advanced loop
> transformations. - Provide for better community feedback on, and testing
> of, the code developed (although the story in this regard is already fairly
> solid). - Better motivate targets to provide accurate, comprehensive,
> modeling parameters for use by advanced loop transformations. - Perhaps
> most importantly, this will allow us to develop and tune the rest of the
> optimizer assuming that Polly’s capabilities are present (the underlying
> analysis, and eventually, the transformations themselves). The largest
> issue on which community consensus is required, in order to move forward at
> all, is what to do with isl. isl, the Integer Set Library, provides core
> functionality on which Polly depends. It is a C library, and while some
> Polly/LLVM developers are also isl developers, it has a large user
> community outside of LLVM/Polly. A C++ interface was recently added, and
> Polly is transitioning to use the C++ interface. Nevertheless, options here
> include rewriting the needed functionality, forking isl and transitioning
> our fork toward LLVM coding conventions (and data structures) over time,
> and incorporating isl more-or-less as-is to avoid partitioning its
> development. That having been said, isl is internally modular, and
> regardless of the overall integration strategy, the Polly developers
> anticipate specializing, or even replacing, some of these components with
> LLVM-specific solutions. This is especially true for anything that touches
> performance-related heuristics and modeling. LLVM-specific, or even
> target-specific, loop schedulers may be developed as well. Even though some
> developers in the LLVM community already have a background in
> polyhedral-modeling techniques, the Polly developers have developed, and
> are still developing, extensive tutorials on this topic
> http://pollylabs.org/education.html
<http://pollylabs.org/education.html>
> and especially http://playground.pollylabs.org
> <http://playground.pollylabs.org/>. Finally, let me highlight a few
ongoing
> development efforts in Polly that are potentially relevant to this
> discussion. Polly’s loop analysis is sound and technically superior to
> what’s in LLVM currently (i.e. in LoopAccessAnalysis and
> DependenceAnalysis). There are, however, two known reasons why Polly’s
> transformations could not yet be enabled by default: - A correctness issue:
> Currently, Polly assumes that 64 bits is large enough for all new
> loop-induction variables and index expressions. In rare cases,
> transformations could be performed where more bits are required.
> Preconditions need to be generated preventing this (e.g., D35471). - A
> performance issue: Polly currently models temporal locality (i.e., it tries
> to get better reuse in time), but does not model spatial locality (i.e., it
> does not model cache-line reuse). As a result, it can sometimes introduce
> performance regressions. Polly Labs is currently working on integrating
> spatial locality modeling into the loop optimization model. Polly can
> already split apart basic blocks in order to implement loop fusion.
> Heuristics to choose at which granularity are still being implemented
> (e.g., PR12402). I believe that we can now develop a concrete plan for
> moving state-of-the-art loop optimizations, based on the technology in the
> Polly project, into LLVM. Doing so will enable LLVM to be competitive with
> proprietary compilers in high-performance computing, machine learning, and
> other important application domains. I’d like community feedback on what
> should be part of that plan. *
>
>
> One thing that I’d like to see more details on is what this means for the
> evolution of loop transformations in LLVM.
>
> Our more-or-less established direction was so far to incrementally improve
> and generalize the required analyses (e.g. the LoopVectorizer’s dependence
> analysis + loop versioning analysis into a stand-alone analysis pass
> (LoopAccessAnalysis)) and then build new transformations (e.g.
> LoopDistribution, LoopLoadElimination, LICMLoopVersioning) on top of these.
>
> The idea was that infrastructure would be incrementally improved from two
> directions:
>
> - As new transformations are built analyses have to be improved (e.g. past
> improvements to LAA to support the LoopVersioning utility, future
> improvements for full LoopSROA beyond just store->load forwarding [1] or
> the improvements to LAA for the LoopFusion proposal[2])
>
> - As more complex loops would have to be analyzed we either improve LAA or
> make DependenceAnalysis a drop-in replacement for the memory analysis part
> in LAA
>
>
> Or we could use Polly's dependence analysis, which I believe to be more
> powerful, more robust, and more correct than DependenceAnalysis. I believe
> that the difficult part here is actually the pairing with predicated SCEV
> or whatever mechanism we want to use generate runtime predicates (this
> applies to use of DependenceAnalysis too).
>
>
> What is a good way to measure these assertions (More powerful, more
> robust)? Are you saying the LLVM Dependence Analysis is incorrect or do you
> actually mean less conservative (or "more accurate" or something
like that)?
>
>
>
> While this model may be slow it has all the benefits of the incremental
> development model.
>
>
> The current model may have been slow in many areas, but I think that's
> mostly a question of development effort. My largest concern about the
> current model is that, to the extent that we're implementing classic
loop
> transformations (e.g., fusion, distribution, interchange, skewing, tiling,
> and so on), we're repeating a historical design that is known to have
> several suboptimal properties. Chief among them is the lack of integration:
> many of these transformations are interconnected, and there's no good
pass
> ordering in which to make independent decisions. Many of these
> transformations can be captured in a single model and we can get much
> better results by integrating them. There's also the matter of whether
> building these transformation on SCEV (or IR directly) is the best
> underlying infrastructure, or whether parts of Polly would be better.
>
>
> I believe that is true. What I wonder is is there a good method to reason
> about it? Perhaps concrete examples or perhaps opt-viewer based comparisons
> on large sets of benchmarks? In the big picture you could make such a
> modeling argument for all compiler optimizations.
>
>
> That having been said, I think that integrating this technology into LLVM
> will also mean applying appropriate modularity. I think that we'll
almost
> definitely want to make use of the dependence analysis separately as an
> analysis. We'll want to decide which of these transformations will be
> considered canonicalization (and run in the iterative pipeline) and which
> will be lowering (and run near the vectorizer). LoopSROA certainly sounds
> to me like canonicalization, but loop fusion might also fall into that
> category (i.e., we might want to fuse early to enable optimizations and
> then split late).
>
>
> Then there is the question of use cases.  It’s fairly obvious that anybody
> wanting to optimize a 5-deep highly regular loop-nest operating on arrays
> should use Polly.  On the other hand it’s way less clear that we should use
> it for singly or doubly nested not-so-regular loops which are the norm in
> non-HPC workloads.
>
>
> This is clearly a good question, but thinking about Polly as a set of
> components, not as a monolithic transformation component, I think that
> polyhedral analysis and transformations can underlie a lot of the
> transformations we need for non-HPC code (and, which I'll point out, we
> need for modern HPC code too). In practice, the loops that we can actually
> analyze have affine dependencies, and Polly does, or can do, a better job
> at generating runtime predicates and dealing with piecewise-linear
> expressions than our current infrastructure.
>
> In short, I look at Polly as two things: First, an infrastructure for
> dealing with loop analysis and transformation. I view this as being broadly
> applicable. Second, an application of that to apply cost-model-driven
> classic loop transformations. To some extent this is going to be more
> useful for HPC codes, but also applies to machine learning, signal
> processing, graphics, and other areas.
>
> I’m wondering if it could be used for pointing out headroom for the
> existing LLVM ecosystem (*)
>
>
>
> And this brings me to the maintenance question.  Is it reasonable to
> expect people to fix Polly when they have a seemingly unrelated change that
> happens to break a Polly bot.
>
>
> The eventual goal here is to have this technology in appropriate parts of
> the main pipeline, and so the question here is not really about breaking a
> "Polly bot", but just about a "bot" in general.
I've given this question
> some thought and I think it sits in a reasonable place in the risk-reward
> space. The answer would be, yes, we'd need to treat this like any other
> part of the pipeline. However, I believe that Polly has as many, or more,
> active contributors than essentially any other individual part of the
> mid-level optimizer or CodeGen. As a result, there will be people around in
> many time zones to help with problems with Polly-related code.
>
>  As far as I know, there were companies in the past that tried Polly
> without a whole lot of prior experience.  It would be great to hear what
> the experience was before adopting Polly at a much larger scale.
>
>
> I'm also interested, although I'll caution against
over-interpreting any
> evidence here (positive or negative). Before a few weeks ago, Polly
didn't
> effectively run in the pipeline after inlining, and so I doubt it would
> have been much use outside of embedded environments (and maybe some HPC
> environments) with straightforwardly-presented C code. It's only now
that
> this has been fixed that I find the possibility of integrating this in
> production interesting.
>
>
> That is a good point. There are also biases independent of past
> experiences (for disclosure mine is (*) above). But I think it is objective
> to say a Polly integration is a big piece to swallow.Your pro-Polly
> argument lists a number of categories that I think could be reasoned about
> individually and partly evaluated with a data-driven approach:
> A) Architecture
> - support for autoparallelism
> - support for accelerators
> - isl- rewrite? etc
> ...
> B) Modelling
> - polyhedral model
> - temporal locality
> - spatial locality
> …
> C) Analysis/Optimizations
> - Dependence Analysis
> - Transformation effective/power (loop nests, quality of transformations,
> #vectorizable loops etc)
>
> A) is mostly Polly independent (except for the isl question I guess). For
> B and C performance/ compile-time /opt-viewer data on a decent/wide range
> of benchmarks possibly at different optimization levels (O2, O3, LTO, PGO
> etc and combinations) should provide data-driven insight into
> costs/benefits.
>
> Cheers
> Gerolf
>
>
>
>
>
> Thanks again,
> Hal
>
>
> Adam
>
> [1] http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
> [2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html
>
>
>
>
> * Sincerely, Hal (on behalf of myself, Tobias Grosser, and Michael Kruse,
> with feedback from several other active Polly developers) We thank the
> numerous people who have contributed to the Polly infrastructure: Alexandre
> Isoard, Andreas Simbuerger, Andy Gibbs, Annanay Agarwal, Armin
> Groesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin Kramer, Bill
> Wendling, Chandler Carruth, Craig Topper, Chris Jenneisch, Christian
> Bielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David Peixotto,
> Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman, Eugene Zelenko,
> George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui Zhang, Jakub
> Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil, Logan Chien,
> Lawrence Hu, Mandeep Singh Grang, Matt Arsenault, Matthew Simpson, Mehdi
> Amini, Micah Villmow, Michael Kruse, Matthias Reisinger, Maximilian
> Falkenstein, Nakamura Takumi, Nandini Singhal, Nicolas Bonfante, Patrik
> Hägglund, Paul Robinson, Philip Pfaffe, Philipp Schaad, Peter Conn, Pratik
> Bhatu, Rafael Espindola, Raghesh Aloor, Reid Kleckner, Roal Jordans,
> Richard Membarth, Roman Gareev, Saleem Abdulrasool, Sameer Sahasrabuddhe,
> Sanjoy Das, Sameer AbuAsal, Sam Novak, Sebastian Pop, Siddharth Bhat,
> Singapuram Sanjay Srivallabh, Sumanth Gundapaneni, Sunil Srivastava,
> Sylvestre Ledru, Star Tan, Tanya Lattner, Tim Shen, Tarun Ranjendran,
> Theodoros Theodoridis, Utpal Bora, Wei Mi, Weiming Zhao, and Yabin Hu.*
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170913/5aa092a9/attachment.html>

Sebastian Pop via llvm-dev

2017-Sep-13 13:39 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi Gerolf,

On Tue, Sep 12, 2017 at 10:26 PM, Gerolf Hoflehner via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Are you saying the LLVM Dependence Analysis is incorrect or do you actually
mean less conservative (or "more accurate" or something like that)?
>
Yes, the LLVM dependence analysis is broken from day one, by design,
due to a misunderstanding of the meaning of GEPs:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179509.html
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179529.html
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179570.html

Loop interchange and any other pass that relies on the current llvm
dependence analysis may generate wrong code.
See https://reviews.llvm.org/D35430

Another point, the MIV test in the llvm depednence analysis is not
implemented, and so the scope of the llvm dependence analysis is
rather narrow: i.e., it would not be able to solve the loop
interchange in spec2000/swim.

Sebastian

Tobias Grosser via llvm-dev

2017-Sep-19 16:21 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi Adam,

thanks for your input. Hal already answered most questions, but let me
complement his reply with a couple of data points:

On Tue, Sep 12, 2017, at 07:47, Hal Finkel wrote:> 
> On 09/11/2017 12:26 PM, Adam Nemet wrote:
> > Hi Hal, Tobias, Michael and others,
> >
> >> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev 
> >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
> >>
> >> **
> >>
> >> *Hi everyone,As you may know, stock LLVM does not provide the kind
of
> >> advanced loop transformations necessary to provide good
performance
> >> on many applications. LLVM's Polly project provides many of
the
> >> required capabilities, including loop transformations such as 
> >> fission, fusion, skewing, blocking/tiling, and interchange, all 
> >> powered by state-of-the-art dependence analysis. Polly also
provides
> >> automated parallelization and targeting of GPUs and
other**accelerators.*
> >> *
> >> Over the past year, Polly’s development has focused on robustness,
> >> correctness, and closer integration with LLVM. To highlight a few 
> >> accomplishments:
> >>
> >>  *
> >>     Polly now runs, by default, in the conceptually-proper place
in
> >>     LLVM’s pass pipeline (just before the loop vectorizer).
> >>     Importantly, this means that its loop transformations are
> >>     performed after inlining and other canonicalization, greatly
> >>     increasing its robustness, and enabling its use on C++ code
> >>     (where [] is often a function call before inlining).
> >>  *
> >>     Polly’s cost-modeling parameters, such as those describing the
> >>     target’s memory hierarchy, are being integrated with
> >>     TargetTransformInfo. This allows targets to properly override
the
> >>     modeling parameters and allows reuse of these parameters by
other
> >>     clients.
> >>  *
> >>     Polly’s method of handling signed division/remainder
operations,
> >>     which worked around lack of support in ScalarEvolution, is
being
> >>     replaced thanks to improvements being contributed to
> >>     ScalarEvolution itself (see D34598). Polly’s core
delinearization
> >>     routines have long been a part of LLVM itself.
> >>  *
> >>     PolyhedralInfo, which exposes a subset of Polly’s loop
analysis
> >>     for use by other clients, is now available.
> >>  *
> >>     Polly is now part of the LLVM release process and is being
> >>     included with LLVM by various packagers (e.g., Debian).
> >>
> >>
> >> I believe that the LLVM community would benefit from beginning the
> >> process of integrating Polly with LLVM itself and continuing its 
> >> development as part of our main code base. This will:
> >>
> >>  *
> >>     Allow for wider adoption of LLVM within communities relying on
> >>     advanced loop transformations.
> >>  *
> >>     Provide for better community feedback on, and testing of, the
> >>     code developed (although the story in this regard is already
> >>     fairly solid).
> >>  *
> >>     Better motivate targets to provide accurate, comprehensive,
> >>     modeling parameters for use by advanced loop transformations.
> >>  *
> >>     Perhaps most importantly, this will allow us to develop and
tune
> >>     the rest of the optimizer assuming that Polly’s capabilities
are
> >>     present (the underlying analysis, and eventually, the
> >>     transformations themselves).
> >>
> >>
> >> The largest issue on which community consensus is required, in
order
> >> to move forward at all, is what to do with isl. isl, the Integer
Set
> >> Library, provides core functionality on which Polly depends. It is
a
> >> C library, and while some Polly/LLVM developers are also isl 
> >> developers, it has a large user community outside of LLVM/Polly. A
> >> C++ interface was recently added, and Polly is transitioning to
use
> >> the C++ interface. Nevertheless, options here include rewriting
the
> >> needed functionality, forking isl and transitioning our fork
toward
> >> LLVM coding conventions (and data structures) over time, and 
> >> incorporating isl more-or-less as-is to avoid partitioning its 
> >> development.
> >>
> >> That having been said, isl is internally modular, and regardless
of
> >> the overall integration strategy, the Polly developers anticipate 
> >> specializing, or even replacing, some of these components with 
> >> LLVM-specific solutions. This is especially true for anything that
> >> touches performance-related heuristics and modeling.
LLVM-specific,
> >> or even target-specific, loop schedulers may be developed as well.
> >>
> >> Even though some developers in the LLVM community already have a 
> >> background in polyhedral-modeling techniques, the Polly developers
> >> have developed, and are still developing, extensive tutorials on
this
> >> topic http://pollylabs.org/education.htmland especially 
> >> http://playground.pollylabs.org
<http://playground.pollylabs.org/>.
> >>
> >> Finally, let me highlight a few ongoing development efforts in
Polly
> >> that are potentially relevant to this discussion. Polly’s loop 
> >> analysis is sound and technically superior to what’s in LLVM 
> >> currently (i.e. in LoopAccessAnalysis and DependenceAnalysis).
There
> >> are, however, two known reasons why Polly’s transformations could
not
> >> yet be enabled by default:
> >>
> >>  *
> >>     A correctness issue: Currently, Polly assumes that 64 bits is
> >>     large enough for all new loop-induction variables and index
> >>     expressions. In rare cases, transformations could be performed
> >>     where more bits are required. Preconditions need to be
generated
> >>     preventing this (e.g., D35471).
> >>  *
> >>     A performance issue: Polly currently models temporal locality
> >>     (i.e., it tries to get better reuse in time), but does not
model
> >>     spatial locality (i.e., it does not model cache-line reuse).
As a
> >>     result, it can sometimes introduce performance regressions.
Polly
> >>     Labs is currently working on integrating spatial locality
> >>     modeling into the loop optimization model.
> >>
> >> Polly can already split apart basic blocks in order to implement
loop
> >> fusion. Heuristics to choose at which granularity are still being 
> >> implemented (e.g., PR12402).
> >> I believe that we can now develop a concrete plan for moving 
> >> state-of-the-art loop optimizations, based on the technology in
the
> >> Polly project, into LLVM. Doing so will enable LLVM to be
competitive
> >> with proprietary compilers in high-performance computing, machine 
> >> learning, and other important application domains. I’d like
community
> >> feedback on what**should be part of that plan.
> >> *
> >
> > One thing that I’d like to see more details on is what this means for 
> > the evolution of loop transformations in LLVM.
> >
> > Our more-or-less established direction was so far to incrementally 
> > improve and generalize the required analyses (e.g. the 
> > LoopVectorizer’s dependence analysis + loop versioning analysis into a
> > stand-alone analysis pass (LoopAccessAnalysis)) and then build new 
> > transformations (e.g. LoopDistribution, LoopLoadElimination, 
> > LICMLoopVersioning) on top of these.
> >
> > The idea was that infrastructure would be incrementally improved from 
> > two directions:
> >
> > - As new transformations are built analyses have to be improved (e.g. 
> > past improvements to LAA to support the LoopVersioning utility, future
> > improvements for full LoopSROA beyond just store->load forwarding
[1]
> > or the improvements to LAA for the LoopFusion proposal[2])
> >
> > - As more complex loops would have to be analyzed we either improve 
> > LAA or make DependenceAnalysis a drop-in replacement for the memory 
> > analysis part in LAA
> 
> Or we could use Polly's dependence analysis, which I believe to be more
> powerful, more robust, and more correct than DependenceAnalysis. I 
> believe that the difficult part here is actually the pairing with 
> predicated SCEV or whatever mechanism we want to use generate runtime 
> predicates (this applies to use of DependenceAnalysis too).
>
> > While this model may be slow it has all the benefits of the 
> > incremental development model.
> 
> The current model may have been slow in many areas, but I think that's 
> mostly a question of development effort. My largest concern about the 
> current model is that, to the extent that we're implementing classic 
> loop transformations (e.g., fusion, distribution, interchange, skewing, 
> tiling, and so on), we're repeating a historical design that is known
to
> have several suboptimal properties. Chief among them is the lack of 
> integration: many of these transformations are interconnected, and 
> there's no good pass ordering in which to make independent decisions. 
> Many of these transformations can be captured in a single model and we 
> can get much better results by integrating them. There's also the
matter
> of whether building these transformation on SCEV (or IR directly) is the 
> best underlying infrastructure, or whether parts of Polly would be
> better.
> 
> That having been said, I think that integrating this technology into 
> LLVM will also mean applying appropriate modularity. I think that we'll
> almost definitely want to make use of the dependence analysis separately 
> as an analysis. We'll want to decide which of these transformations
will
> be considered canonicalization (and run in the iterative pipeline) and 
> which will be lowering (and run near the vectorizer). LoopSROA certainly 
> sounds to me like canonicalization, but loop fusion might also fall into 
> that category (i.e., we might want to fuse early to enable optimizations 
> and then split late).
>
> > Then there is the question of use cases.  It’s fairly obvious that 
> > anybody wanting to optimize a 5-deep highly regular loop-nest 
> > operating on arrays should use Polly.  On the other hand it’s way less
> > clear that we should use it for singly or doubly nested not-so-regular
> > loops which are the norm in non-HPC workloads.
> 
> This is clearly a good question, but thinking about Polly as a set of 
> components, not as a monolithic transformation component, I think that 
> polyhedral analysis and transformations can underlie a lot of the 
> transformations we need for non-HPC code (and, which I'll point out, we
> need for modern HPC code too). In practice, the loops that we can 
> actually analyze have affine dependencies, and Polly does, or can do, a 
> better job at generating runtime predicates and dealing with 
> piecewise-linear expressions than our current infrastructure.
> 
> In short, I look at Polly as two things: First, an infrastructure for 
> dealing with loop analysis and transformation. I view this as being 
> broadly applicable. Second, an application of that to apply 
> cost-model-driven classic loop transformations. To some extent this is 
> going to be more useful for HPC codes, but also applies to machine 
> learning, signal processing, graphics, and other areas.
Hal is very right here. Polly -- by itself -- is an end-to-end
implementation of a polyhedral loop optimization framework, but it
consists of several components -- which I expect will become
increasingly modular and also will be integrated at more and more places
into LLVM. I expect this integration to be gradually and driven by
community discussions and consensus.  For me, the primary point of this
proposal is to enable such an integration and especially to make sure we
have maximal community input at each step -- carefully weighting the
different goals we have in the community.
 > > And this brings me to the maintenance question.  Is it reasonable to 
> > expect people to fix Polly when they have a seemingly unrelated change
> > that happens to break a Polly bot.
> 
> The eventual goal here is to have this technology in appropriate parts 
> of the main pipeline, and so the question here is not really about 
> breaking a "Polly bot", but just about a "bot" in
general. I've given
> this question some thought and I think it sits in a reasonable place in 
> the risk-reward space. The answer would be, yes, we'd need to treat
this
> like any other part of the pipeline. However, I believe that Polly has 
> as many, or more, active contributors than essentially any other 
> individual part of the mid-level optimizer or CodeGen. As a result, 
> there will be people around in many time zones to help with problems 
> with Polly-related code.
I personally would be even more conservative than Hal. While Polly is
meanwhile very robust and has been shipped in production compilers, I
see this proposal more in line with an "experimental backend". Hence,
the maintenance burden will lie on Polly itself. As we did over the last
years, the Polly team will address bugs and regressions in buildbots
without posing any load on regular development. In fact this promise is
easy to make. We have at most 1-2 bugs a months that are caused by LLVM
changes and some of these bugs are mis-optimizations in LLVM which we
have a history of reporting and addressing. 
 > >  As far as I know, there were companies in the past that tried Polly 
> > without a whole lot of prior experience.  It would be great to hear 
> > what the experience was before adopting Polly at a much larger scale.
> 
> I'm also interested, although I'll caution against
over-interpreting any
> evidence here (positive or negative). Before a few weeks ago, Polly 
> didn't effectively run in the pipeline after inlining, and so I doubt
it
> would have been much use outside of embedded environments (and maybe 
> some HPC environments) with straightforwardly-presented C code. It's 
> only now that this has been fixed that I find the possibility of 
> integrating this in production interesting.
Eli Friedman from Qualcomm has been working with us on the robustness of
Polly by regularly testing it on AOSP, Roman Gareev worked with ARM on
linear algebra kernels (Polly now get's within 10-20% of vendor
libraries) (should work out-of-the-box), we are currently working with
MeteoSwiss/CSCS on the COSMO weather and climate mode, and together with
Johannes we have been working on some SPEC benchmarks (results still to
be published). We also see non-trivial speedups (2-4x) on SPEC for
automatic GPGPU code generation:
http://grosser.es/publications/grosser-2016-polly-acc-transparent-compilation-to-heterogeneous-hardware.pdf

However, more important for an initial implementation is compile time
cost. Polyhedral scheduling was the one piece of Polly that had an
inherently exponential cost as code-sizes grew. Since about half a year
we now have a new incremental scheduling approach which (assuming
limited fusion) limits the scheduling cost to the maximal amount of loop
fusion we choose (see section 7.3):
https://www.researchgate.net/publication/317826152_Scheduling_for_PPCG

I feel that at this point broader community input to our development is
very critical for the next steps -- especially as more people (e.g.,
Hal's group) want to apply polyhedral based optimizations in LLVM. To
make sure Polly evolves in a direction  well aligned with LLVM goals,
the input of LLVM loop optimization experts like you (Adam), Gerolf and
the larger LLVM community
will be very helpful. Moving our day-to-day discussions and development
under the full LLVM umbrella will make this possible.

I do not see this proposal to immediately replace all classical loop
transformations, but rather expect Polly to complement existing
optimizations. To which degree will depend on our experiences over the
next year.

Best,
Tobias

Hal Finkel via llvm-dev

2017-Sep-22 03:59 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:>
>
>> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev 
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>
>> On 09/11/2017 12:26 PM, Adam Nemet wrote:
>>> Hi Hal, Tobias, Michael and others,
>>> *...*
>>>
>>> One thing that I’d like to see more details on is what this means 
>>> for the evolution of loop transformations in LLVM.
>>>
>>> Our more-or-less established direction was so far to incrementally 
>>> improve and generalize the required analyses (e.g. the 
>>> LoopVectorizer’s dependence analysis + loop versioning analysis
into
>>> a stand-alone analysis pass (LoopAccessAnalysis)) and then build
new
>>> transformations (e.g. LoopDistribution, LoopLoadElimination, 
>>> LICMLoopVersioning) on top of these.
>>>
>>> The idea was that infrastructure would be incrementally improved 
>>> from two directions:
>>>
>>> - As new transformations are built analyses have to be improved 
>>> (e.g. past improvements to LAA to support the LoopVersioning 
>>> utility, future improvements for full LoopSROA beyond just 
>>> store->load forwarding [1] or the improvements to LAA for the 
>>> LoopFusion proposal[2])
>>>
>>> - As more complex loops would have to be analyzed we either improve
>>> LAA or make DependenceAnalysis a drop-in replacement for the memory
>>> analysis part in LAA
>>
>> Or we could use Polly's dependence analysis, which I believe to be 
>> more powerful, more robust, and more correct than DependenceAnalysis. 
>> I believe that the difficult part here is actually the pairing with 
>> predicated SCEV or whatever mechanism we want to use generate runtime 
>> predicates (this applies to use of DependenceAnalysis too).
>
> What is a good way to measure these assertions (More powerful, more 
> robust)? Are you saying the LLVM Dependence Analysis is incorrect or 
> do you actually mean less conservative (or "more accurate" or 
> something like that)?
Sebastian's email covers the issues with the DependenceAnalysis pass 
pretty well.

Regarding what's in LoopAccessAnalysis, I believe it to be correct, but 
more limited. It is not clear to me that LAA is bad at what it does 
based on what the vectorizer can handle. LAA could do better in some 
cases with non-unit-stride loops. Polly also handles piecewise-affine 
functions, which allows the modeling of loops with conditionals. 
Extending LAA to handle loop nests, moreover, seems likely to be 
non-trivial.

Regardless, measuring these differences certainly seems like a good 
idea. I think that we can do this using optimization remarks. LAA 
already emits optimization remarks for loops in which it finds unsafe 
memory dependencies. Polly also emits optimization remarks. We may need 
to iterate some in order to setup a good comparison, but we should be 
able to collect statistics (and other information) by compiling code 
using -fsave-optimization-record (in combination with some other flags), 
and then analyzing the resulting YAML files.
>>
>>>
>>> While this model may be slow it has all the benefits of the 
>>> incremental development model.
>>
>> The current model may have been slow in many areas, but I think 
>> that's mostly a question of development effort. My largest concern 
>> about the current model is that, to the extent that we're 
>> implementing classic loop transformations (e.g., fusion, 
>> distribution, interchange, skewing, tiling, and so on), we're 
>> repeating a historical design that is known to have several 
>> suboptimal properties. Chief among them is the lack of integration: 
>> many of these transformations are interconnected, and there's no
good
>> pass ordering in which to make independent decisions. Many of these 
>> transformations can be captured in a single model and we can get much 
>> better results by integrating them. There's also the matter of 
>> whether building these transformation on SCEV (or IR directly) is the 
>> best underlying infrastructure, or whether parts of Polly would be 
>> better.
>
> I believe that is true. What I wonder is is there a good method to 
> reason about it?
If I understand what you mean, one way to look at it is this: This is 
not a canonicalization problem. Picking an optimal way to interchange 
loops may depend on how the result can be skewed and/or tiled, picking 
an optimal way to distribute loops often depends on what can be done 
afterward in each piece. Optimal here generally involves reasoning about 
the memory hierarchy (e.g., cache properties), available prefetching 
streams, register-file size, and so on.

I know that I've seen some good examples in papers over the years that 
illustrate the phase-ordering challenges. Hopefully, someone will jump 
in here with some good references. One classic one is: William Pugh. 
Uniform Techniques for Loop Optimization. 1991.
> Perhaps concrete examples or perhaps opt-viewer based comparisons on 
> large sets of benchmarks? In the big picture you could make such a 
> modeling argument for all compiler optimizations.
Certainly. However, in this case there's a well-studied unified model 
for this set of optimizations known to reduce phase-ordering effects. 
That's not true in general.
>>
>> That having been said, I think that integrating this technology into 
>> LLVM will also mean applying appropriate modularity. I think that 
>> we'll almost definitely want to make use of the dependence analysis
>> separately as an analysis. We'll want to decide which of these 
>> transformations will be considered canonicalization (and run in the 
>> iterative pipeline) and which will be lowering (and run near the 
>> vectorizer). LoopSROA certainly sounds to me like canonicalization, 
>> but loop fusion might also fall into that category (i.e., we might 
>> want to fuse early to enable optimizations and then split late).
>>
>>>
>>> Then there is the question of use cases.  It’s fairly obvious that 
>>> anybody wanting to optimize a 5-deep highly regular loop-nest 
>>> operating on arrays should use Polly.  On the other hand it’s way 
>>> less clear that we should use it for singly or doubly nested 
>>> not-so-regular loops which are the norm in non-HPC workloads.
>>
>> This is clearly a good question, but thinking about Polly as a set of 
>> components, not as a monolithic transformation component, I think 
>> that polyhedral analysis and transformations can underlie a lot of 
>> the transformations we need for non-HPC code (and, which I'll point
>> out, we need for modern HPC code too). In practice, the loops that we 
>> can actually analyze have affine dependencies, and Polly does, or can 
>> do, a better job at generating runtime predicates and dealing with 
>> piecewise-linear expressions than our current infrastructure.
>>
>> In short, I look at Polly as two things: First, an infrastructure for 
>> dealing with loop analysis and transformation. I view this as being 
>> broadly applicable. Second, an application of that to apply 
>> cost-model-driven classic loop transformations. To some extent this 
>> is going to be more useful for HPC codes, but also applies to machine 
>> learning, signal processing, graphics, and other areas.
> I’m wondering if it could be used for pointing out headroom for the 
> existing LLVM ecosystem (*)
Sure.
>
>>
>>>
>>> And this brings me to the maintenance question.  Is it reasonable
to
>>> expect people to fix Polly when they have a seemingly unrelated 
>>> change that happens to break a Polly bot.
>>
>> The eventual goal here is to have this technology in appropriate 
>> parts of the main pipeline, and so the question here is not really 
>> about breaking a "Polly bot", but just about a
"bot" in general. I've
>> given this question some thought and I think it sits in a reasonable 
>> place in the risk-reward space. The answer would be, yes, we'd need
>> to treat this like any other part of the pipeline. However, I believe 
>> that Polly has as many, or more, active contributors than essentially 
>> any other individual part of the mid-level optimizer or CodeGen. As a 
>> result, there will be people around in many time zones to help with 
>> problems with Polly-related code.
>>
>>>  As far as I know, there were companies in the past that tried
Polly
>>> without a whole lot of prior experience.  It would be great to hear
>>> what the experience was before adopting Polly at a much larger
scale.
>>
>> I'm also interested, although I'll caution against
over-interpreting
>> any evidence here (positive or negative). Before a few weeks ago, 
>> Polly didn't effectively run in the pipeline after inlining, and so
I
>> doubt it would have been much use outside of embedded environments 
>> (and maybe some HPC environments) with straightforwardly-presented C 
>> code. It's only now that this has been fixed that I find the 
>> possibility of integrating this in production interesting.
>
> That is a good point. There are also biases independent of past 
> experiences (for disclosure mine is (*) above). But I think it is 
> objective to say a Polly integration is a big piece to swallow.Your 
> pro-Polly argument lists a number of categories that I think could be 
> reasoned about individually and partly evaluated with a data-driven 
> approach:
> A) Architecture
> - support for autoparallelism
> - support for accelerators
> - isl- rewrite? etc
> ...
> B) Modelling
> - polyhedral model
> - temporal locality
> - spatial locality
> …
> C) Analysis/Optimizations
> - Dependence Analysis
> - Transformation effective/power (loop nests, quality of 
> transformations, #vectorizable loops etc)
>
> A) is mostly Polly independent (except for the isl question I guess). 
> For B and C performance/ compile-time /opt-viewer data on a 
> decent/wide range of benchmarks possibly at different optimization 
> levels (O2, O3, LTO, PGO etc and combinations) should provide 
> data-driven insight into costs/benefits.
I agree. In practice, the first question is: Are will willing to take on 
Polly (+isl), in whole or in part, as a build dependency? If the answer 
is yes, the next question is: what parts should be reused or refactored 
for use in other parts of the pipeline? My argument is that we should 
take on Polly, or most of it, as a build dependency. Work on better 
unifying the developer communities as we start experimenting with other 
kinds of integration. This will, however, allow us to provide to all of 
our users these transformations through pragmas (and other kinds of 
optional enablement). This is an important first step.

I'm not sure exactly how good this is, but polly has LNT-submitting 
bots, so the website can generate a comparison (e.g., 
http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182). Looking 
at this comparison shows a number of potential problems but also cases 
where Polly really helps (and, FWIW, the largest two compile-time 
regressions are also associated with very large execution performance 
improvements). My first focus would certainly be on pragma-driven 
enablement.

Thanks again,
Hal
>
> Cheers
> Gerolf
>
>
>
>
>>
>> Thanks again,
>> Hal
>>
>>>
>>> Adam
>>>
>>> [1]
http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
>>> [2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html
>>>
>>>
>>>> *
>>>> Sincerely,
>>>> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse,
with
>>>> feedback from**several other active Polly developers)
>>>> *...**
>>>>
>>>> -- 
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170921/983d6dc5/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Sep 2017 - [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

Possibly Parallel Threads