thr3ads.net - llvm dev - [llvm-dev] [RFC] Polly Status and Integration [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2017-Sep-22 03:59 UTC

[llvm-dev] [RFC] Polly Status and Integration

On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:>
>
>> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev 
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>
>> On 09/11/2017 12:26 PM, Adam Nemet wrote:
>>> Hi Hal, Tobias, Michael and others,
>>> *...*
>>>
>>> One thing that I’d like to see more details on is what this means 
>>> for the evolution of loop transformations in LLVM.
>>>
>>> Our more-or-less established direction was so far to incrementally 
>>> improve and generalize the required analyses (e.g. the 
>>> LoopVectorizer’s dependence analysis + loop versioning analysis
into
>>> a stand-alone analysis pass (LoopAccessAnalysis)) and then build
new
>>> transformations (e.g. LoopDistribution, LoopLoadElimination, 
>>> LICMLoopVersioning) on top of these.
>>>
>>> The idea was that infrastructure would be incrementally improved 
>>> from two directions:
>>>
>>> - As new transformations are built analyses have to be improved 
>>> (e.g. past improvements to LAA to support the LoopVersioning 
>>> utility, future improvements for full LoopSROA beyond just 
>>> store->load forwarding [1] or the improvements to LAA for the 
>>> LoopFusion proposal[2])
>>>
>>> - As more complex loops would have to be analyzed we either improve
>>> LAA or make DependenceAnalysis a drop-in replacement for the memory
>>> analysis part in LAA
>>
>> Or we could use Polly's dependence analysis, which I believe to be 
>> more powerful, more robust, and more correct than DependenceAnalysis. 
>> I believe that the difficult part here is actually the pairing with 
>> predicated SCEV or whatever mechanism we want to use generate runtime 
>> predicates (this applies to use of DependenceAnalysis too).
>
> What is a good way to measure these assertions (More powerful, more 
> robust)? Are you saying the LLVM Dependence Analysis is incorrect or 
> do you actually mean less conservative (or "more accurate" or 
> something like that)?
Sebastian's email covers the issues with the DependenceAnalysis pass 
pretty well.

Regarding what's in LoopAccessAnalysis, I believe it to be correct, but 
more limited. It is not clear to me that LAA is bad at what it does 
based on what the vectorizer can handle. LAA could do better in some 
cases with non-unit-stride loops. Polly also handles piecewise-affine 
functions, which allows the modeling of loops with conditionals. 
Extending LAA to handle loop nests, moreover, seems likely to be 
non-trivial.

Regardless, measuring these differences certainly seems like a good 
idea. I think that we can do this using optimization remarks. LAA 
already emits optimization remarks for loops in which it finds unsafe 
memory dependencies. Polly also emits optimization remarks. We may need 
to iterate some in order to setup a good comparison, but we should be 
able to collect statistics (and other information) by compiling code 
using -fsave-optimization-record (in combination with some other flags), 
and then analyzing the resulting YAML files.
>>
>>>
>>> While this model may be slow it has all the benefits of the 
>>> incremental development model.
>>
>> The current model may have been slow in many areas, but I think 
>> that's mostly a question of development effort. My largest concern 
>> about the current model is that, to the extent that we're 
>> implementing classic loop transformations (e.g., fusion, 
>> distribution, interchange, skewing, tiling, and so on), we're 
>> repeating a historical design that is known to have several 
>> suboptimal properties. Chief among them is the lack of integration: 
>> many of these transformations are interconnected, and there's no
good
>> pass ordering in which to make independent decisions. Many of these 
>> transformations can be captured in a single model and we can get much 
>> better results by integrating them. There's also the matter of 
>> whether building these transformation on SCEV (or IR directly) is the 
>> best underlying infrastructure, or whether parts of Polly would be 
>> better.
>
> I believe that is true. What I wonder is is there a good method to 
> reason about it?
If I understand what you mean, one way to look at it is this: This is 
not a canonicalization problem. Picking an optimal way to interchange 
loops may depend on how the result can be skewed and/or tiled, picking 
an optimal way to distribute loops often depends on what can be done 
afterward in each piece. Optimal here generally involves reasoning about 
the memory hierarchy (e.g., cache properties), available prefetching 
streams, register-file size, and so on.

I know that I've seen some good examples in papers over the years that 
illustrate the phase-ordering challenges. Hopefully, someone will jump 
in here with some good references. One classic one is: William Pugh. 
Uniform Techniques for Loop Optimization. 1991.
> Perhaps concrete examples or perhaps opt-viewer based comparisons on 
> large sets of benchmarks? In the big picture you could make such a 
> modeling argument for all compiler optimizations.
Certainly. However, in this case there's a well-studied unified model 
for this set of optimizations known to reduce phase-ordering effects. 
That's not true in general.
>>
>> That having been said, I think that integrating this technology into 
>> LLVM will also mean applying appropriate modularity. I think that 
>> we'll almost definitely want to make use of the dependence analysis
>> separately as an analysis. We'll want to decide which of these 
>> transformations will be considered canonicalization (and run in the 
>> iterative pipeline) and which will be lowering (and run near the 
>> vectorizer). LoopSROA certainly sounds to me like canonicalization, 
>> but loop fusion might also fall into that category (i.e., we might 
>> want to fuse early to enable optimizations and then split late).
>>
>>>
>>> Then there is the question of use cases.  It’s fairly obvious that 
>>> anybody wanting to optimize a 5-deep highly regular loop-nest 
>>> operating on arrays should use Polly.  On the other hand it’s way 
>>> less clear that we should use it for singly or doubly nested 
>>> not-so-regular loops which are the norm in non-HPC workloads.
>>
>> This is clearly a good question, but thinking about Polly as a set of 
>> components, not as a monolithic transformation component, I think 
>> that polyhedral analysis and transformations can underlie a lot of 
>> the transformations we need for non-HPC code (and, which I'll point
>> out, we need for modern HPC code too). In practice, the loops that we 
>> can actually analyze have affine dependencies, and Polly does, or can 
>> do, a better job at generating runtime predicates and dealing with 
>> piecewise-linear expressions than our current infrastructure.
>>
>> In short, I look at Polly as two things: First, an infrastructure for 
>> dealing with loop analysis and transformation. I view this as being 
>> broadly applicable. Second, an application of that to apply 
>> cost-model-driven classic loop transformations. To some extent this 
>> is going to be more useful for HPC codes, but also applies to machine 
>> learning, signal processing, graphics, and other areas.
> I’m wondering if it could be used for pointing out headroom for the 
> existing LLVM ecosystem (*)
Sure.
>
>>
>>>
>>> And this brings me to the maintenance question.  Is it reasonable
to
>>> expect people to fix Polly when they have a seemingly unrelated 
>>> change that happens to break a Polly bot.
>>
>> The eventual goal here is to have this technology in appropriate 
>> parts of the main pipeline, and so the question here is not really 
>> about breaking a "Polly bot", but just about a
"bot" in general. I've
>> given this question some thought and I think it sits in a reasonable 
>> place in the risk-reward space. The answer would be, yes, we'd need
>> to treat this like any other part of the pipeline. However, I believe 
>> that Polly has as many, or more, active contributors than essentially 
>> any other individual part of the mid-level optimizer or CodeGen. As a 
>> result, there will be people around in many time zones to help with 
>> problems with Polly-related code.
>>
>>>  As far as I know, there were companies in the past that tried
Polly
>>> without a whole lot of prior experience.  It would be great to hear
>>> what the experience was before adopting Polly at a much larger
scale.
>>
>> I'm also interested, although I'll caution against
over-interpreting
>> any evidence here (positive or negative). Before a few weeks ago, 
>> Polly didn't effectively run in the pipeline after inlining, and so
I
>> doubt it would have been much use outside of embedded environments 
>> (and maybe some HPC environments) with straightforwardly-presented C 
>> code. It's only now that this has been fixed that I find the 
>> possibility of integrating this in production interesting.
>
> That is a good point. There are also biases independent of past 
> experiences (for disclosure mine is (*) above). But I think it is 
> objective to say a Polly integration is a big piece to swallow.Your 
> pro-Polly argument lists a number of categories that I think could be 
> reasoned about individually and partly evaluated with a data-driven 
> approach:
> A) Architecture
> - support for autoparallelism
> - support for accelerators
> - isl- rewrite? etc
> ...
> B) Modelling
> - polyhedral model
> - temporal locality
> - spatial locality
> …
> C) Analysis/Optimizations
> - Dependence Analysis
> - Transformation effective/power (loop nests, quality of 
> transformations, #vectorizable loops etc)
>
> A) is mostly Polly independent (except for the isl question I guess). 
> For B and C performance/ compile-time /opt-viewer data on a 
> decent/wide range of benchmarks possibly at different optimization 
> levels (O2, O3, LTO, PGO etc and combinations) should provide 
> data-driven insight into costs/benefits.
I agree. In practice, the first question is: Are will willing to take on 
Polly (+isl), in whole or in part, as a build dependency? If the answer 
is yes, the next question is: what parts should be reused or refactored 
for use in other parts of the pipeline? My argument is that we should 
take on Polly, or most of it, as a build dependency. Work on better 
unifying the developer communities as we start experimenting with other 
kinds of integration. This will, however, allow us to provide to all of 
our users these transformations through pragmas (and other kinds of 
optional enablement). This is an important first step.

I'm not sure exactly how good this is, but polly has LNT-submitting 
bots, so the website can generate a comparison (e.g., 
http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182). Looking 
at this comparison shows a number of potential problems but also cases 
where Polly really helps (and, FWIW, the largest two compile-time 
regressions are also associated with very large execution performance 
improvements). My first focus would certainly be on pragma-driven 
enablement.

Thanks again,
Hal
>
> Cheers
> Gerolf
>
>
>
>
>>
>> Thanks again,
>> Hal
>>
>>>
>>> Adam
>>>
>>> [1]
http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
>>> [2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html
>>>
>>>
>>>> *
>>>> Sincerely,
>>>> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse,
with
>>>> feedback from**several other active Polly developers)
>>>> *...**
>>>>
>>>> -- 
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>> -- 
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170921/983d6dc5/attachment.html>

Mehdi AMINI via llvm-dev

2017-Sep-22 05:03 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi Hal,


2017-09-21 20:59 GMT-07:00 Hal Finkel via llvm-dev <llvm-dev at
lists.llvm.org>
:
>
> On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
>
>
>
> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
> On 09/11/2017 12:26 PM, Adam Nemet wrote:
>
> Hi Hal, Tobias, Michael and others,
> *...*
>
> One thing that I’d like to see more details on is what this means for the
> evolution of loop transformations in LLVM.
>
> Our more-or-less established direction was so far to incrementally improve
> and generalize the required analyses (e.g. the LoopVectorizer’s dependence
> analysis + loop versioning analysis into a stand-alone analysis pass
> (LoopAccessAnalysis)) and then build new transformations (e.g.
> LoopDistribution, LoopLoadElimination, LICMLoopVersioning) on top of these.
>
> The idea was that infrastructure would be incrementally improved from two
> directions:
>
> - As new transformations are built analyses have to be improved (e.g. past
> improvements to LAA to support the LoopVersioning utility, future
> improvements for full LoopSROA beyond just store->load forwarding [1] or
> the improvements to LAA for the LoopFusion proposal[2])
>
> - As more complex loops would have to be analyzed we either improve LAA or
> make DependenceAnalysis a drop-in replacement for the memory analysis part
> in LAA
>
>
> Or we could use Polly's dependence analysis, which I believe to be more
> powerful, more robust, and more correct than DependenceAnalysis. I believe
> that the difficult part here is actually the pairing with predicated SCEV
> or whatever mechanism we want to use generate runtime predicates (this
> applies to use of DependenceAnalysis too).
>
>
> What is a good way to measure these assertions (More powerful, more
> robust)? Are you saying the LLVM Dependence Analysis is incorrect or do you
> actually mean less conservative (or "more accurate" or something
like that)?
>
>
> Sebastian's email covers the issues with the DependenceAnalysis pass
> pretty well.
>
> Regarding what's in LoopAccessAnalysis, I believe it to be correct, but
> more limited. It is not clear to me that LAA is bad at what it does based
> on what the vectorizer can handle. LAA could do better in some cases with
> non-unit-stride loops. Polly also handles piecewise-affine functions, which
> allows the modeling of loops with conditionals. Extending LAA to handle
> loop nests, moreover, seems likely to be non-trivial.
>
> Regardless, measuring these differences certainly seems like a good idea.
> I think that we can do this using optimization remarks. LAA already emits
> optimization remarks for loops in which it finds unsafe memory
> dependencies. Polly also emits optimization remarks. We may need to iterate
> some in order to setup a good comparison, but we should be able to collect
> statistics (and other information) by compiling code using
> -fsave-optimization-record (in combination with some other flags), and then
> analyzing the resulting YAML files.
>
>
>
> While this model may be slow it has all the benefits of the incremental
> development model.
>
>
> The current model may have been slow in many areas, but I think that's
> mostly a question of development effort. My largest concern about the
> current model is that, to the extent that we're implementing classic
loop
> transformations (e.g., fusion, distribution, interchange, skewing, tiling,
> and so on), we're repeating a historical design that is known to have
> several suboptimal properties. Chief among them is the lack of integration:
> many of these transformations are interconnected, and there's no good
pass
> ordering in which to make independent decisions. Many of these
> transformations can be captured in a single model and we can get much
> better results by integrating them. There's also the matter of whether
> building these transformation on SCEV (or IR directly) is the best
> underlying infrastructure, or whether parts of Polly would be better.
>
>
> I believe that is true. What I wonder is is there a good method to reason
> about it?
>
>
> If I understand what you mean, one way to look at it is this: This is not
> a canonicalization problem. Picking an optimal way to interchange loops may
> depend on how the result can be skewed and/or tiled, picking an optimal way
> to distribute loops often depends on what can be done afterward in each
> piece. Optimal here generally involves reasoning about the memory hierarchy
> (e.g., cache properties), available prefetching streams, register-file
> size, and so on.
>
> I know that I've seen some good examples in papers over the years that
> illustrate the phase-ordering challenges. Hopefully, someone will jump in
> here with some good references. One classic one is: William Pugh. Uniform
> Techniques for Loop Optimization. 1991.
>
> Perhaps concrete examples or perhaps opt-viewer based comparisons on large
> sets of benchmarks? In the big picture you could make such a modeling
> argument for all compiler optimizations.
>
>
> Certainly. However, in this case there's a well-studied unified model
for
> this set of optimizations known to reduce phase-ordering effects.
That's
> not true in general.
>
>
> That having been said, I think that integrating this technology into LLVM
> will also mean applying appropriate modularity. I think that we'll
almost
> definitely want to make use of the dependence analysis separately as an
> analysis. We'll want to decide which of these transformations will be
> considered canonicalization (and run in the iterative pipeline) and which
> will be lowering (and run near the vectorizer). LoopSROA certainly sounds
> to me like canonicalization, but loop fusion might also fall into that
> category (i.e., we might want to fuse early to enable optimizations and
> then split late).
>
>
> Then there is the question of use cases.  It’s fairly obvious that anybody
> wanting to optimize a 5-deep highly regular loop-nest operating on arrays
> should use Polly.  On the other hand it’s way less clear that we should use
> it for singly or doubly nested not-so-regular loops which are the norm in
> non-HPC workloads.
>
>
> This is clearly a good question, but thinking about Polly as a set of
> components, not as a monolithic transformation component, I think that
> polyhedral analysis and transformations can underlie a lot of the
> transformations we need for non-HPC code (and, which I'll point out, we
> need for modern HPC code too). In practice, the loops that we can actually
> analyze have affine dependencies, and Polly does, or can do, a better job
> at generating runtime predicates and dealing with piecewise-linear
> expressions than our current infrastructure.
>
> In short, I look at Polly as two things: First, an infrastructure for
> dealing with loop analysis and transformation. I view this as being broadly
> applicable. Second, an application of that to apply cost-model-driven
> classic loop transformations. To some extent this is going to be more
> useful for HPC codes, but also applies to machine learning, signal
> processing, graphics, and other areas.
>
> I’m wondering if it could be used for pointing out headroom for the
> existing LLVM ecosystem (*)
>
>
> Sure.
>
>
>
>
> And this brings me to the maintenance question.  Is it reasonable to
> expect people to fix Polly when they have a seemingly unrelated change that
> happens to break a Polly bot.
>
>
> The eventual goal here is to have this technology in appropriate parts of
> the main pipeline, and so the question here is not really about breaking a
> "Polly bot", but just about a "bot" in general.
I've given this question
> some thought and I think it sits in a reasonable place in the risk-reward
> space. The answer would be, yes, we'd need to treat this like any other
> part of the pipeline. However, I believe that Polly has as many, or more,
> active contributors than essentially any other individual part of the
> mid-level optimizer or CodeGen. As a result, there will be people around in
> many time zones to help with problems with Polly-related code.
>
>  As far as I know, there were companies in the past that tried Polly
> without a whole lot of prior experience.  It would be great to hear what
> the experience was before adopting Polly at a much larger scale.
>
>
> I'm also interested, although I'll caution against
over-interpreting any
> evidence here (positive or negative). Before a few weeks ago, Polly
didn't
> effectively run in the pipeline after inlining, and so I doubt it would
> have been much use outside of embedded environments (and maybe some HPC
> environments) with straightforwardly-presented C code. It's only now
that
> this has been fixed that I find the possibility of integrating this in
> production interesting.
>
>
> That is a good point. There are also biases independent of past
> experiences (for disclosure mine is (*) above). But I think it is objective
> to say a Polly integration is a big piece to swallow.Your pro-Polly
> argument lists a number of categories that I think could be reasoned about
> individually and partly evaluated with a data-driven approach:
> A) Architecture
> - support for autoparallelism
> - support for accelerators
> - isl- rewrite? etc
> ...
> B) Modelling
> - polyhedral model
> - temporal locality
> - spatial locality
> …
> C) Analysis/Optimizations
> - Dependence Analysis
> - Transformation effective/power (loop nests, quality of transformations,
> #vectorizable loops etc)
>
> A) is mostly Polly independent (except for the isl question I guess). For
> B and C performance/ compile-time /opt-viewer data on a decent/wide range
> of benchmarks possibly at different optimization levels (O2, O3, LTO, PGO
> etc and combinations) should provide data-driven insight into
> costs/benefits.
>
>
> I agree. In practice, the first question is: Are will willing to take on
> Polly (+isl), in whole or in part, as a build dependency? If the answer is
> yes, the next question is: what parts should be reused or refactored for
> use in other parts of the pipeline? My argument is that we should take on
> Polly, or most of it, as a build dependency. Work on better unifying the
> developer communities as we start experimenting with other kinds of
> integration. This will, however, allow us to provide to all of our users
> these transformations through pragmas (and other kinds of optional
> enablement). This is an important first step.
>
When you write that we should take Polly as a build dependency: are you
envisioning this to be tied into LLVM to the point where we can't build
LLVM without Polly?
I thought the approach would rather be that a default build of LLVM would
have Polly available but that there will "forever" be a
`-DENABLE_POLLY=OFF` option?
This seems like an important distinction to me, as making it more
integrated but still optional for the correct behavior of LLVM means that
people can continue to work and maintain "forever" an optimization
pipeline
which operates without Polly, while starting to get more integration in a
pluggable form. I believe this is the direction that was taken with the
GSoC last year, trying to get Polly as a "pluggable" dependency
analysis.

>
> I'm not sure exactly how good this is, but polly has LNT-submitting
bots,
> so the website can generate a comparison (e.g., http://lnt.llvm.org/db_
> default/v4/nts/71208?compare_to=71182). Looking at this comparison shows
> a number of potential problems but also cases where Polly really helps
> (and, FWIW, the largest two compile-time regressions are also associated
> with very large execution performance improvements). My first focus would
> certainly be on pragma-driven enablement.
>
I'll add: PGO-driven enablement: spending compile-time where we know it can
have an impact. What do you think?

Best,

-- 
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170921/61298c6f/attachment.html>

Hal Finkel via llvm-dev

2017-Sep-22 05:22 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

On 09/22/2017 12:03 AM, Mehdi AMINI wrote:> Hi Hal,
>
>
> 2017-09-21 20:59 GMT-07:00 Hal Finkel via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>>:
>
>
>     On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
>>
>>
>>>     On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev
>>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>>
>>>
>>>     On 09/11/2017 12:26 PM, Adam Nemet wrote:
>>>>     Hi Hal, Tobias, Michael and others,
>>>>     *...*
>>>>
>>>>     One thing that I’d like to see more details on is what this
>>>>     means for the evolution of loop transformations in LLVM.
>>>>
>>>>     Our more-or-less established direction was so far to
>>>>     incrementally improve and generalize the required analyses
>>>>     (e.g. the LoopVectorizer’s dependence analysis + loop
>>>>     versioning analysis into a stand-alone analysis pass
>>>>     (LoopAccessAnalysis)) and then build new transformations
(e.g.
>>>>     LoopDistribution, LoopLoadElimination, LICMLoopVersioning)
on
>>>>     top of these.
>>>>
>>>>     The idea was that infrastructure would be incrementally
>>>>     improved from two directions:
>>>>
>>>>     - As new transformations are built analyses have to be
improved
>>>>     (e.g. past improvements to LAA to support the
LoopVersioning
>>>>     utility, future improvements for full LoopSROA beyond just
>>>>     store->load forwarding [1] or the improvements to LAA
for the
>>>>     LoopFusion proposal[2])
>>>>
>>>>     - As more complex loops would have to be analyzed we either
>>>>     improve LAA or make DependenceAnalysis a drop-in
replacement
>>>>     for the memory analysis part in LAA
>>>
>>>     Or we could use Polly's dependence analysis, which I
believe to
>>>     be more powerful, more robust, and more correct than
>>>     DependenceAnalysis. I believe that the difficult part here is
>>>     actually the pairing with predicated SCEV or whatever mechanism
>>>     we want to use generate runtime predicates (this applies to use
>>>     of DependenceAnalysis too).
>>
>>     What is a good way to measure these assertions (More powerful,
>>     more robust)? Are you saying the LLVM Dependence Analysis is
>>     incorrect or do you actually mean less conservative (or "more
>>     accurate" or something like that)?
>
>     Sebastian's email covers the issues with the DependenceAnalysis
>     pass pretty well.
>
>     Regarding what's in LoopAccessAnalysis, I believe it to be
>     correct, but more limited. It is not clear to me that LAA is bad
>     at what it does based on what the vectorizer can handle. LAA could
>     do better in some cases with non-unit-stride loops. Polly also
>     handles piecewise-affine functions, which allows the modeling of
>     loops with conditionals. Extending LAA to handle loop nests,
>     moreover, seems likely to be non-trivial.
>
>     Regardless, measuring these differences certainly seems like a
>     good idea. I think that we can do this using optimization remarks.
>     LAA already emits optimization remarks for loops in which it finds
>     unsafe memory dependencies. Polly also emits optimization remarks.
>     We may need to iterate some in order to setup a good comparison,
>     but we should be able to collect statistics (and other
>     information) by compiling code using -fsave-optimization-record
>     (in combination with some other flags), and then analyzing the
>     resulting YAML files.
>
>>>
>>>>
>>>>     While this model may be slow it has all the benefits of the
>>>>     incremental development model.
>>>
>>>     The current model may have been slow in many areas, but I think
>>>     that's mostly a question of development effort. My largest
>>>     concern about the current model is that, to the extent that
>>>     we're implementing classic loop transformations (e.g.,
fusion,
>>>     distribution, interchange, skewing, tiling, and so on),
we're
>>>     repeating a historical design that is known to have several
>>>     suboptimal properties. Chief among them is the lack of
>>>     integration: many of these transformations are interconnected,
>>>     and there's no good pass ordering in which to make
independent
>>>     decisions. Many of these transformations can be captured in a
>>>     single model and we can get much better results by integrating
>>>     them. There's also the matter of whether building these
>>>     transformation on SCEV (or IR directly) is the best underlying
>>>     infrastructure, or whether parts of Polly would be better.
>>
>>     I believe that is true. What I wonder is is there a good method
>>     to reason about it?
>
>     If I understand what you mean, one way to look at it is this: This
>     is not a canonicalization problem. Picking an optimal way to
>     interchange loops may depend on how the result can be skewed
>     and/or tiled, picking an optimal way to distribute loops often
>     depends on what can be done afterward in each piece. Optimal here
>     generally involves reasoning about the memory hierarchy (e.g.,
>     cache properties), available prefetching streams, register-file
>     size, and so on.
>
>     I know that I've seen some good examples in papers over the years
>     that illustrate the phase-ordering challenges. Hopefully, someone
>     will jump in here with some good references. One classic one is:
>     William Pugh. Uniform Techniques for Loop Optimization. 1991.
>
>>     Perhaps concrete examples or perhaps opt-viewer based comparisons
>>     on large sets of benchmarks? In the big picture you could make
>>     such a modeling argument for all compiler optimizations.
>
>     Certainly. However, in this case there's a well-studied unified
>     model for this set of optimizations known to reduce phase-ordering
>     effects. That's not true in general.
>
>>>
>>>     That having been said, I think that integrating this technology
>>>     into LLVM will also mean applying appropriate modularity. I
>>>     think that we'll almost definitely want to make use of the
>>>     dependence analysis separately as an analysis. We'll want
to
>>>     decide which of these transformations will be considered
>>>     canonicalization (and run in the iterative pipeline) and which
>>>     will be lowering (and run near the vectorizer). LoopSROA
>>>     certainly sounds to me like canonicalization, but loop fusion
>>>     might also fall into that category (i.e., we might want to fuse
>>>     early to enable optimizations and then split late).
>>>
>>>>
>>>>     Then there is the question of use cases.  It’s fairly
obvious
>>>>     that anybody wanting to optimize a 5-deep highly regular
>>>>     loop-nest operating on arrays should use Polly.  On the
other
>>>>     hand it’s way less clear that we should use it for singly
or
>>>>     doubly nested not-so-regular loops which are the norm in
>>>>     non-HPC workloads.
>>>
>>>     This is clearly a good question, but thinking about Polly as a
>>>     set of components, not as a monolithic transformation
component,
>>>     I think that polyhedral analysis and transformations can
>>>     underlie a lot of the transformations we need for non-HPC code
>>>     (and, which I'll point out, we need for modern HPC code
too). In
>>>     practice, the loops that we can actually analyze have affine
>>>     dependencies, and Polly does, or can do, a better job at
>>>     generating runtime predicates and dealing with piecewise-linear
>>>     expressions than our current infrastructure.
>>>
>>>     In short, I look at Polly as two things: First, an
>>>     infrastructure for dealing with loop analysis and
>>>     transformation. I view this as being broadly applicable.
Second,
>>>     an application of that to apply cost-model-driven classic loop
>>>     transformations. To some extent this is going to be more useful
>>>     for HPC codes, but also applies to machine learning, signal
>>>     processing, graphics, and other areas.
>>     I’m wondering if it could be used for pointing out headroom for
>>     the existing LLVM ecosystem (*)
>
>     Sure.
>
>>
>>>
>>>>
>>>>     And this brings me to the maintenance question.  Is it
>>>>     reasonable to expect people to fix Polly when they have a
>>>>     seemingly unrelated change that happens to break a Polly
bot.
>>>
>>>     The eventual goal here is to have this technology in
appropriate
>>>     parts of the main pipeline, and so the question here is not
>>>     really about breaking a "Polly bot", but just about a
"bot" in
>>>     general. I've given this question some thought and I think
it
>>>     sits in a reasonable place in the risk-reward space. The answer
>>>     would be, yes, we'd need to treat this like any other part
of
>>>     the pipeline. However, I believe that Polly has as many, or
>>>     more, active contributors than essentially any other individual
>>>     part of the mid-level optimizer or CodeGen. As a result, there
>>>     will be people around in many time zones to help with problems
>>>     with Polly-related code.
>>>
>>>>      As far as I know, there were companies in the past that
tried
>>>>     Polly without a whole lot of prior experience.  It would be
>>>>     great to hear what the experience was before adopting Polly
at
>>>>     a much larger scale.
>>>
>>>     I'm also interested, although I'll caution against
>>>     over-interpreting any evidence here (positive or negative).
>>>     Before a few weeks ago, Polly didn't effectively run in the
>>>     pipeline after inlining, and so I doubt it would have been much
>>>     use outside of embedded environments (and maybe some HPC
>>>     environments) with straightforwardly-presented C code. It's
only
>>>     now that this has been fixed that I find the possibility of
>>>     integrating this in production interesting.
>>
>>     That is a good point. There are also biases independent of past
>>     experiences (for disclosure mine is (*) above). But I think it is
>>     objective to say a Polly integration is a big piece to
>>     swallow.Your pro-Polly argument lists a number of categories that
>>     I think could be reasoned about individually and partly evaluated
>>     with a data-driven approach:
>>     A) Architecture
>>     - support for autoparallelism
>>     - support for accelerators
>>     - isl- rewrite? etc
>>     ...
>>     B) Modelling
>>     - polyhedral model
>>     - temporal locality
>>     - spatial locality
>>     …
>>     C) Analysis/Optimizations
>>     - Dependence Analysis
>>     - Transformation effective/power (loop nests, quality of
>>     transformations, #vectorizable loops etc)
>>
>>     A) is mostly Polly independent (except for the isl question I
>>     guess). For B and C performance/ compile-time /opt-viewer data on
>>     a decent/wide range of benchmarks possibly at different
>>     optimization levels (O2, O3, LTO, PGO etc and combinations)
>>     should provide data-driven insight into costs/benefits.
>
>     I agree. In practice, the first question is: Are will willing to
>     take on Polly (+isl), in whole or in part, as a build dependency?
>     If the answer is yes, the next question is: what parts should be
>     reused or refactored for use in other parts of the pipeline? My
>     argument is that we should take on Polly, or most of it, as a
>     build dependency. Work on better unifying the developer
>     communities as we start experimenting with other kinds of
>     integration. This will, however, allow us to provide to all of our
>     users these transformations through pragmas (and other kinds of
>     optional enablement). This is an important first step.
>
>
> When you write that we should take Polly as a build dependency: are 
> you envisioning this to be tied into LLVM to the point where we can't 
> build LLVM without Polly?
> I thought the approach would rather be that a default build of LLVM 
> would have Polly available but that there will "forever" be a 
> `-DENABLE_POLLY=OFF` option?
> This seems like an important distinction to me, as making it more 
> integrated but still optional for the correct behavior of LLVM means 
> that people can continue to work and maintain "forever" an 
> optimization pipeline which operates without Polly, while starting to 
> get more integration in a pluggable form. I believe this is the 
> direction that was taken with the GSoC last year, trying to get Polly 
> as a "pluggable" dependency analysis.
I don't want to make statements about "forever": eventually,
I'd like to
have strong loop optimizations as an integrated part of LLVM. At that 
point, "Polly" might be little more than a shorthand for some parts of
LLVM. Now LLVM currently has an issue whereby it's difficult to build 
customized versions with only a subset of needed passes, but that's an 
orthogonal issue. In the short term, it would certainly be separable. I 
believe that we'd certainly keep the `-DENABLE_POLLY=OFF` option as long 
as we felt it made sense.

By build dependency, I mean: The sources are part of the main source 
tree, they're built by default, thus essentially all of the builders 
build them, and they'll end up in many distributions and derived 
products. This will allow us to build on what's there in whatever ways 
make sense.
>
>     I'm not sure exactly how good this is, but polly has
>     LNT-submitting bots, so the website can generate a comparison
>     (e.g.,
>     http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182
>     <http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182>).
>     Looking at this comparison shows a number of potential problems
>     but also cases where Polly really helps (and, FWIW, the largest
>     two compile-time regressions are also associated with very large
>     execution performance improvements). My first focus would
>     certainly be on pragma-driven enablement.
>
>
> I'll add: PGO-driven enablement: spending compile-time where we know 
> it can have an impact. What do you think?
Certainly (although, obviously, we still need reasonable confidence that 
it won't cause performance regressions to do that).

  -Hal
>
> Best,
>
> -- 
> Mehdi
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170922/7a5c62ac/attachment-0001.html>

Johannes Doerfert via llvm-dev

2017-Sep-22 13:47 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

Hi Hal,

On 09/21, Hal Finkel via llvm-dev wrote:> 
> On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
> >
> >
> >>On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev
> >><llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
> >>
> >>
> >>On 09/11/2017 12:26 PM, Adam Nemet wrote:
> >>>Hi Hal, Tobias, Michael and others,
> >>>*...*
> >>>
> >>>One thing that I’d like to see more details on is what this
means for
> >>>the evolution of loop transformations in LLVM.
> >>>
> >>>Our more-or-less established direction was so far to
incrementally
> >>>improve and generalize the required analyses (e.g. the
> >>>LoopVectorizer’s dependence analysis + loop versioning analysis
into a
> >>>stand-alone analysis pass (LoopAccessAnalysis)) and then build
new
> >>>transformations (e.g. LoopDistribution, LoopLoadElimination,
> >>>LICMLoopVersioning) on top of these.
> >>>
> >>>The idea was that infrastructure would be incrementally
improved from
> >>>two directions:
> >>>
> >>>- As new transformations are built analyses have to be improved
(e.g.
> >>>past improvements to LAA to support the LoopVersioning utility,
future
> >>>improvements for full LoopSROA beyond just store->load
forwarding [1]
> >>>or the improvements to LAA for the LoopFusion proposal[2])
> >>>
> >>>- As more complex loops would have to be analyzed we either
improve
> >>>LAA or make DependenceAnalysis a drop-in replacement for the
memory
> >>>analysis part in LAA
> >>
> >>Or we could use Polly's dependence analysis, which I believe to
be more
> >>powerful, more robust, and more correct than DependenceAnalysis. I
> >>believe that the difficult part here is actually the pairing with
> >>predicated SCEV or whatever mechanism we want to use generate
runtime
> >>predicates (this applies to use of DependenceAnalysis too).
> >
> >What is a good way to measure these assertions (More powerful, more
> >robust)? Are you saying the LLVM Dependence Analysis is incorrect or do
> >you actually mean less conservative (or "more accurate" or
something like
> >that)?
> 
> Sebastian's email covers the issues with the DependenceAnalysis pass
pretty
> well.
> 
> Regarding what's in LoopAccessAnalysis, I believe it to be correct, but
more
> limited. It is not clear to me that LAA is bad at what it does based on
what
> the vectorizer can handle. LAA could do better in some cases with
> non-unit-stride loops. 
> Polly also handles piecewise-affine functions, which allows the
> modeling of loops with conditionals. Extending LAA to handle loop
> nests, moreover, seems likely to be non-trivial.What exactly do you mean here? A loop with a conditional does not
necessarily cause a piecewise-affine (access) function. The ones that
actually do cause piecewise-affine (acces) functions are not handled by
Polly (except min/max). This is not because Polly could not handle them
but because it relies on Scalar Evolution.
> Regardless, measuring these differences certainly seems like a good idea.It does, that's why I try to get some actual performance results soon.
If somebody does the same for Polly, e.g., by extending the GSoC'16
interface that exists, we will have actual data to talk about.
> I think that we can do this using optimization remarks. LAA already
> emits optimization remarks for loops in which it finds unsafe memory
> dependencies.  Polly also emits optimization remarks. We may need to
> iterate some in order to setup a good comparison, but we should be
> able to collect statistics (and other information) by compiling code
> using -fsave-optimization-record (in combination with some other
> flags), and then analyzing the resulting YAML files.
> 
> >>
> >>>
> >>>While this model may be slow it has all the benefits of the
> >>>incremental development model.
> >>
> >>The current model may have been slow in many areas, but I think
that's
> >>mostly a question of development effort. My largest concern about
the
> >>current model is that, to the extent that we're implementing
classic
> >>loop transformations (e.g., fusion, distribution, interchange,
skewing,
> >>tiling, and so on), we're repeating a historical design that is
known to
> >>have several suboptimal properties. Chief among them is the lack of
> >>integration: many of these transformations are interconnected, and
> >>there's no good pass ordering in which to make independent
decisions.
> >>Many of these transformations can be captured in a single model and
we
> >>can get much better results by integrating them. There's also
the matter
> >>of whether building these transformation on SCEV (or IR directly)
is the
> >>best underlying infrastructure, or whether parts of Polly would be
> >>better.
> >
> >I believe that is true. What I wonder is is there a good method to
reason
> >about it?
> 
> If I understand what you mean, one way to look at it is this: This is not a
> canonicalization problem. Picking an optimal way to interchange loops may
> depend on how the result can be skewed and/or tiled, picking an optimal way
> to distribute loops often depends on what can be done afterward in each
> piece. Optimal here generally involves reasoning about the memory hierarchy
> (e.g., cache properties), available prefetching streams, register-file
size,
> and so on.While I totally agree with the idea here I am not so convinced anymore.
The polyhedral scheduling technique that I have seen do not take more
than two of the above properties into account. One problem is simply the
way the scheduler works (per statement, dimension after dimension,
starting with the outermost one) and what that means for the quality of
performance prediction during the scheduling. It is also problematic
that properties might be "non-affine" which doesn't allow us to
encode
them directly. At the moment the only "heuristic" you mentioned that
is
actually used in Polly are cache properties for the hand-tuned
optimization (see below).
> I know that I've seen some good examples in papers over the years that
> illustrate the phase-ordering challenges. Hopefully, someone will jump in
> here with some good references. One classic one is: William Pugh. Uniform
> Techniques for Loop Optimization. 1991.
> 
> >Perhaps concrete examples or perhaps opt-viewer based comparisons on
large
> >sets of benchmarks? In the big picture you could make such a modeling
> >argument for all compiler optimizations.
> 
> Certainly. However, in this case there's a well-studied unified model
for
> this set of optimizations known to reduce phase-ordering effects.
That's not
> true in general.Polyhedral scheduling is not restricted by phase-ordering effects.
Though, it still has to be guided by an objective function that
realistically models "sufficient performance indicators" to actually
improve performance. I think we should talk more about how we could
get such an objective function and what might be intermediate steps as
it is (at least to me) absolutely not clear.
> >>That having been said, I think that integrating this technology
into
> >>LLVM will also mean applying appropriate modularity. I think that
we'll
> >>almost definitely want to make use of the dependence analysis
separately
> >>as an analysis. We'll want to decide which of these
transformations will
> >>be considered canonicalization (and run in the iterative pipeline)
and
> >>which will be lowering (and run near the vectorizer). LoopSROA
certainly
> >>sounds to me like canonicalization, but loop fusion might also fall
into
> >>that category (i.e., we might want to fuse early to enable
optimizations
> >>and then split late).
> >>
> >>>
> >>>Then there is the question of use cases.  It’s fairly obvious
that
> >>>anybody wanting to optimize a 5-deep highly regular loop-nest
> >>>operating on arrays should use Polly.  On the other hand it’s
way less
> >>>clear that we should use it for singly or doubly nested
not-so-regular
> >>>loops which are the norm in non-HPC workloads.
> >>
> >>This is clearly a good question, but thinking about Polly as a set
of
> >>components, not as a monolithic transformation component, I think
that
> >>polyhedral analysis and transformations can underlie a lot of the
> >>transformations we need for non-HPC code (and, which I'll point
out, we
> >>need for modern HPC code too). In practice, the loops that we can
> >>actually analyze have affine dependencies, and Polly does, or can
do, a
> >>better job at generating runtime predicates and dealing with
> >>piecewise-linear expressions than our current infrastructure.
> >>
> >>In short, I look at Polly as two things: First, an infrastructure
for
> >>dealing with loop analysis and transformation. I view this as being
> >>broadly applicable. Second, an application of that to apply
> >>cost-model-driven classic loop transformations. To some extent this
is
> >>going to be more useful for HPC codes, but also applies to machine
> >>learning, signal processing, graphics, and other areas.
> >I’m wondering if it could be used for pointing out headroom for the
> >existing LLVM ecosystem (*)
> 
> Sure.
> 
> >
> >>
> >>>
> >>>And this brings me to the maintenance question.  Is it
reasonable to
> >>>expect people to fix Polly when they have a seemingly unrelated
change
> >>>that happens to break a Polly bot.
> >>
> >>The eventual goal here is to have this technology in appropriate
parts
> >>of the main pipeline, and so the question here is not really about
> >>breaking a "Polly bot", but just about a "bot"
in general. I've given
> >>this question some thought and I think it sits in a reasonable
place in
> >>the risk-reward space. The answer would be, yes, we'd need to
treat this
> >>like any other part of the pipeline. However, I believe that Polly
has
> >>as many, or more, active contributors than essentially any other
> >>individual part of the mid-level optimizer or CodeGen. As a result,
> >>there will be people around in many time zones to help with
problems
> >>with Polly-related code.
> >>
> >>> As far as I know, there were companies in the past that tried
Polly
> >>>without a whole lot of prior experience.  It would be great to
hear
> >>>what the experience was before adopting Polly at a much larger
scale.
> >>
> >>I'm also interested, although I'll caution against
over-interpreting any
> >>evidence here (positive or negative). Before a few weeks ago, Polly
> >>didn't effectively run in the pipeline after inlining, and so I
doubt it
> >>would have been much use outside of embedded environments (and
maybe
> >>some HPC environments) with straightforwardly-presented C code.
It's
> >>only now that this has been fixed that I find the possibility of
> >>integrating this in production interesting.
> >
> >That is a good point. There are also biases independent of past
> >experiences (for disclosure mine is (*) above). But I think it is
> >objective to say a Polly integration is a big piece to swallow.Your
> >pro-Polly argument lists a number of categories that I think could be
> >reasoned about individually and partly evaluated with a data-driven
> >approach:
> >A) Architecture
> >- support for autoparallelism
> >- support for accelerators
> >- isl- rewrite? etc
> >...
> >B) Modelling
> >- polyhedral model
> >- temporal locality
> >- spatial locality
> >…
> >C) Analysis/Optimizations
> >- Dependence Analysis
> >- Transformation effective/power (loop nests, quality of
transformations,
> >#vectorizable loops etc)
> >
> >A) is mostly Polly independent (except for the isl question I guess).
For
> >B and C performance/ compile-time /opt-viewer data on a decent/wide
range
> >of benchmarks possibly at different optimization levels (O2, O3, LTO,
PGO
> >etc and combinations) should provide data-driven insight into
> >costs/benefits.
> 
> I agree. In practice, the first question is: Are will willing to take on
> Polly (+isl), in whole or in part, as a build dependency? If the answer is
> yes, the next question is: what parts should be reused or refactored for
use
> in other parts of the pipeline? My argument is that we should take on
Polly,
> or most of it, as a build dependency. Work on better unifying the developer
> communities as we start experimenting with other kinds of integration. This
> will, however, allow us to provide to all of our users these
transformations
> through pragmas (and other kinds of optional enablement). This is an
> important first step.
> 
> I'm not sure exactly how good this is, but polly has LNT-submitting
bots, so
> the website can generate a comparison (e.g.,
> http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182). Looking at
> this comparison shows a number of potential problems but also cases where
> Polly really helps (and, FWIW, the largest two compile-time regressions are
> also associated with very large execution performance improvements). My
> first focus would certainly be on pragma-driven enablement.To be completely honest, I think this comparisons oversell the positive
runtime effects of Polly quite a bit. (Though I do not think you try to
do that on purpose). Let me explain:

At least the first 5 of the 19 improved benchmarks are matrix-matrix
multiplications. These are not automatically scheduled by Polly but
instead pattern matched and replaced by hand-tuned, target specific
versions (ref. cache properties). Even if you argue this is a valid
"automatic optimization" it still is only one improvement shown five
times. Note also how matrix-matrix multiplications perform when there
are not matched:
  MultiSource/Benchmarks/VersaBench/bmm/bmm
  SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm

Last time I checked the impact of Polly on the TSVC benchmarks I
realized that the effect was due to unrelated passes scheduled by Polly.
This might have changed by now but I doubt it. This leaves ten improved
Polybench benchmarks and whetstone. At least my local version of Polly
does not optimize the latter, thus the effect might be the same as for
the three TSVC benchmarks. For the Polybench benchmarks it would be
interesting to know what kind of transformation improved ten of them and
what regressed five others. I am especially interested if there is a
benefit from polyhedral scheduling or if it is only due to the tiling
that is applied afterwards.

--

I don't want to play devil's advocate here all the time but I think we
should be realistic on what we have and can expect. If not, we will
only cause unfulfilled expectations that won't help anybody.

Thanks,
  Johannes
> Thanks again,
> Hal
> 
> >
> >Cheers
> >Gerolf
> >
> >
> >
> >
> >>
> >>Thanks again,
> >>Hal
> >>
> >>>
> >>>Adam
> >>>
> >>>[1]
http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
> >>>[2]
http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html
> >>>
> >>>
> >>>>*
> >>>>Sincerely,
> >>>>Hal (on behalf of myself, Tobias Grosser, and Michael
Kruse, with
> >>>>feedback from**several other active Polly developers)
> >>>>*...**
> >>>>
> >>>>-- 
> >>>>Hal Finkel
> >>>>Lead, Compiler Technology and Programming Languages
> >>>>Leadership Computing Facility
> >>>>Argonne National Laboratory
> >>>>_______________________________________________
> >>>>LLVM Developers mailing list
> >>>>llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
> >>>>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>>
> >>
> >>-- 
> >>Hal Finkel
> >>Lead, Compiler Technology and Programming Languages
> >>Leadership Computing Facility
> >>Argonne National Laboratory
> >>_______________________________________________
> >>LLVM Developers mailing list
> >>llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
> >>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170922/cbcfc55f/attachment.sig>

Hal Finkel via llvm-dev

2017-Sep-22 18:21 UTC

head link

[llvm-dev] [RFC] Polly Status and Integration

On 09/22/2017 08:47 AM, Johannes Doerfert wrote:> Hi Hal,
>
> On 09/21, Hal Finkel via llvm-dev wrote:
>> On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
>>>
>>> ...
>> Polly also handles piecewise-affine functions, which allows the
>> modeling of loops with conditionals. Extending LAA to handle loop
>> nests, moreover, seems likely to be non-trivial.
> What exactly do you mean here? A loop with a conditional does not
> necessarily cause a piecewise-affine (access) function. The ones that
> actually do cause piecewise-affine (acces) functions are not handled by
> Polly (except min/max). This is not because Polly could not handle them
> but because it relies on Scalar Evolution.
Thanks for clarifying this. min/max is certainly an important special 
case, although it would certainly be good to handle more-general cases. 
Not to imply that SCEV is an ideal tool for this purpose (the 
replacement you propose, for example, may well be better), but it would 
seem like matching access functions that look like select(cond, AddRec, 
AddRec), and similar, would be useful and practical even within the 
current framework.

I'll immediately point out that, OTOH, while piecewise access functions 
are likely to be converted into selects early in our current 
infrastructure, there's also an ongoing discussion about whether that's 
actually optimal (i.e., forming selects early is likely suboptimal for 
GVN and we might want to delay this conversion until later in the 
pipeline). See PR34603. And so, if we want to think about using this 
kind of analysis during canonicalization as well as during lowering, 
more may be required to handle this consistently.
>
>> Regardless, measuring these differences certainly seems like a good
idea.
> It does, that's why I try to get some actual performance results soon.
> If somebody does the same for Polly, e.g., by extending the GSoC'16
> interface that exists, we will have actual data to talk about.
Given that Polly now runs right before the vectorizer, can we do some 
comparisons immediately just by looking at loops which LAA can't analyze 
but Polly can analyze (or vice versa)?
>
>> I think that we can do this using optimization remarks. LAA already
>> emits optimization remarks for loops in which it finds unsafe memory
>> dependencies.  Polly also emits optimization remarks. We may need to
>> iterate some in order to setup a good comparison, but we should be
>> able to collect statistics (and other information) by compiling code
>> using -fsave-optimization-record (in combination with some other
>> flags), and then analyzing the resulting YAML files.
>>
>>>>> While this model may be slow it has all the benefits of the
>>>>> incremental development model.
>>>> The current model may have been slow in many areas, but I think
that's
>>>> mostly a question of development effort. My largest concern
about the
>>>> current model is that, to the extent that we're
implementing classic
>>>> loop transformations (e.g., fusion, distribution, interchange,
skewing,
>>>> tiling, and so on), we're repeating a historical design
that is known to
>>>> have several suboptimal properties. Chief among them is the
lack of
>>>> integration: many of these transformations are interconnected,
and
>>>> there's no good pass ordering in which to make independent
decisions.
>>>> Many of these transformations can be captured in a single model
and we
>>>> can get much better results by integrating them. There's
also the matter
>>>> of whether building these transformation on SCEV (or IR
directly) is the
>>>> best underlying infrastructure, or whether parts of Polly would
be
>>>> better.
>>> I believe that is true. What I wonder is is there a good method to
reason
>>> about it?
>> If I understand what you mean, one way to look at it is this: This is
not a
>> canonicalization problem. Picking an optimal way to interchange loops
may
>> depend on how the result can be skewed and/or tiled, picking an optimal
way
>> to distribute loops often depends on what can be done afterward in each
>> piece. Optimal here generally involves reasoning about the memory
hierarchy
>> (e.g., cache properties), available prefetching streams, register-file
size,
>> and so on.
> While I totally agree with the idea here I am not so convinced anymore.
> The polyhedral scheduling technique that I have seen do not take more
> than two of the above properties into account. One problem is simply the
> way the scheduler works (per statement, dimension after dimension,
> starting with the outermost one) and what that means for the quality of
> performance prediction during the scheduling.
I agree that we may need to change the way that the scheduler works. 
However, there's value in having a unified transformation framework 
capable of performing the selected schedule (regardless of how we arrive 
at that schedule). Moreover, I suspect that we'll need to have 
conversation about the scheduler design in the context of the cost 
modeling parameters we need to consider, and how those models are 
expressed. As I understand it, the current incremental technique is 
largely motivated by efficiency concerns and/or to keep the problems 
linear? FWIW, while I'm not an expert in this area, a few of my 
colleagues at the lab are experts in mixed-integer optimization, and at 
some point soon I'll chat with them about the approach here.
>   It is also problematic
> that properties might be "non-affine" which doesn't allow us
to encode
> them directly.
But we'll solve this either by over approximating or restricting the 
problem space (e.g., to an inner loop where we can perform the 
analysis). Either way, this seems like an orthogonal concern.
>   At the moment the only "heuristic" you mentioned that is
> actually used in Polly are cache properties for the hand-tuned
> optimization (see below).
Certainly true. I think this will need to change if we want to do 
purely-cost-model-driven loop optimizations.
>
>> I know that I've seen some good examples in papers over the years
that
>> illustrate the phase-ordering challenges. Hopefully, someone will jump
in
>> here with some good references. One classic one is: William Pugh.
Uniform
>> Techniques for Loop Optimization. 1991.
>>
>>> Perhaps concrete examples or perhaps opt-viewer based comparisons
on large
>>> sets of benchmarks? In the big picture you could make such a
modeling
>>> argument for all compiler optimizations.
>> Certainly. However, in this case there's a well-studied unified
model for
>> this set of optimizations known to reduce phase-ordering effects.
That's not
>> true in general.
> Polyhedral scheduling is not restricted by phase-ordering effects.
> Though, it still has to be guided by an objective function that
> realistically models "sufficient performance indicators" to
actually
> improve performance. I think we should talk more about how we could
> get such an objective function and what might be intermediate steps as
> it is (at least to me) absolutely not clear.
The reason that I've made such a point of talking about the cost 
modeling is that:

  1. I think this is a large current weakness of Polly
  2. I think this is a large current deficiency in the literature (i.e., 
I don't know of anyone else who (publicly) discusses how to do this well)
  3. I think that improving this is going to be a lot of work

It's pretty clear that we need to handle a bunch of different effects 
here (depending on what we're doing), not just cache effects, but also 
prefetching streams, ILP, register pressure, and more. An exception to 
this will be if we do some transformations early (i.e., during 
canonicalization), in which case, we might not require as much from the 
cost model. That having been said, some potential canonicalization 
choices (e.g., loop fusion) may be impractical until we have a good 
cost-model-based way to undo them (e.g., loop fission).

Moreover, I think that we'll want to think seriously about how to we 
loop multiversioning in this context. A loop structure that is optimized 
for large working sets might not be optimal for working sets that fit in 
L1, or even L2, cache.
>
>> ...
>> I'm not sure exactly how good this is, but polly has LNT-submitting
bots, so
>> the website can generate a comparison (e.g.,
>> http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182). Looking
at
>> this comparison shows a number of potential problems but also cases
where
>> Polly really helps (and, FWIW, the largest two compile-time regressions
are
>> also associated with very large execution performance improvements). My
>> first focus would certainly be on pragma-driven enablement.
> To be completely honest, I think this comparisons oversell the positive
> runtime effects of Polly quite a bit.
Funny. I wasn't overly impressed by the comparison as it is. Mixed 
results on pollybench and positive results mostly on simple benchmark 
loops. One bright spot here is actually 
SingleSource/Benchmarks/Shootout-C++/ary3, because it shows that Polly 
can successfully transform a loop that uses std::vector (instead of just 
plain C-style arrays). The only DOE workload proxy, out of our current 
ten, that shows up here is 
MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC, and that's a 30% 
performance regression. Clearly I'm not advocating we turn this on by 
default under these conditions.
>   (Though I do not think you try to
> do that on purpose). Let me explain:
>
> At least the first 5 of the 19 improved benchmarks are matrix-matrix
> multiplications. These are not automatically scheduled by Polly but
> instead pattern matched and replaced by hand-tuned, target specific
> versions (ref. cache properties). Even if you argue this is a valid
> "automatic optimization" it still is only one improvement shown
five
> times.
The first three are certainly very similar in this regard. The others 
have outer loops, or other loops, that could be relevant (although I 
imagine that the matrix-multiplication part is very relevant). 
Regardless, this point is well taken.
>   Note also how matrix-matrix multiplications perform when there
> are not matched:
>    MultiSource/Benchmarks/VersaBench/bmm/bmm
>    SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm
For everyone's convenience, in this comparison, both of these show 
significant compile-time and runtime regressions.
>
> Last time I checked the impact of Polly on the TSVC benchmarks I
> realized that the effect was due to unrelated passes scheduled by Polly.
> This might have changed by now but I doubt it.
I was under the impression that this did change when Polly was moved 
later in the pipeline (and it looks like this is true from the code in 
Support/RegisterPasses.cpp, but I certainly could be missing something). 
OTOH, if you're right, and this is due to a late run of some 
canonicalization pass (the things registered by 
polly::registerCanonicalicationPasses), we should definitely understand 
that: some of those performance improvements are very large (~25% for 
/LoopRestructuring-flt).
> This leaves ten improved
> Polybench benchmarks and whetstone. At least my local version of Polly
> does not optimize the latter, thus the effect might be the same as for
> the three TSVC benchmarks. For the Polybench benchmarks it would be
> interesting to know what kind of transformation improved ten of them and
> what regressed five others. I am especially interested if there is a
> benefit from polyhedral scheduling or if it is only due to the tiling
> that is applied afterwards.
Definitely good questions.

Regardless, it's clear that the cost modeling needs work (and there are 
other potential significant improvements). In the short term, I find 
most appealing the framework for enabling complex transformations, which 
we can enable by pragmas, and trying to better integrate the community. 
The point I take is that Polly can do good things, whether by general 
scheduling or by pattern matching (and it's good that it can do both in 
the same infrastructure).
>
> --
>
> I don't want to play devil's advocate here all the time but I think
we
> should be realistic on what we have and can expect. If not, we will
> only cause unfulfilled expectations that won't help anybody.
I find this feedback very valuable (and I'm sure that others do as 
well). Besides, you've made concrete alternative suggestions, so you're 
not just playing devil's advocate.

Thanks again,
Hal
>
> Thanks,
>    Johannes
>
> ...
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

Seemingly Similar Threads

Search for more maybe matching threads

llvm dev - Sep 2017 - [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

[llvm-dev] [RFC] Polly Status and Integration

Seemingly Similar Threads