Hal Finkel via llvm-dev
2017-Sep-01 18:47 UTC
[llvm-dev] [RFC] Polly Status and Integration
** *Hi everyone,As you may know, stock LLVM does not provide the kind of advanced loop transformations necessary to provide good performance on many applications. LLVM's Polly project provides many of the required capabilities, including loop transformations such as fission, fusion, skewing, blocking/tiling, and interchange, all powered by state-of-the-art dependence analysis. Polly also provides automated parallelization and targeting of GPUs and other**accelerators.* * Over the past year, Polly’s development has focused on robustness, correctness, and closer integration with LLVM. To highlight a few accomplishments: * Polly now runs, by default, in the conceptually-proper place in LLVM’s pass pipeline (just before the loop vectorizer). Importantly, this means that its loop transformations are performed after inlining and other canonicalization, greatly increasing its robustness, and enabling its use on C++ code (where [] is often a function call before inlining). * Polly’s cost-modeling parameters, such as those describing the target’s memory hierarchy, are being integrated with TargetTransformInfo. This allows targets to properly override the modeling parameters and allows reuse of these parameters by other clients. * Polly’s method of handling signed division/remainder operations, which worked around lack of support in ScalarEvolution, is being replaced thanks to improvements being contributed to ScalarEvolution itself (see D34598). Polly’s core delinearization routines have long been a part of LLVM itself. * PolyhedralInfo, which exposes a subset of Polly’s loop analysis for use by other clients, is now available. * Polly is now part of the LLVM release process and is being included with LLVM by various packagers (e.g., Debian). I believe that the LLVM community would benefit from beginning the process of integrating Polly with LLVM itself and continuing its development as part of our main code base. This will: * Allow for wider adoption of LLVM within communities relying on advanced loop transformations. * Provide for better community feedback on, and testing of, the code developed (although the story in this regard is already fairly solid). * Better motivate targets to provide accurate, comprehensive, modeling parameters for use by advanced loop transformations. * Perhaps most importantly, this will allow us to develop and tune the rest of the optimizer assuming that Polly’s capabilities are present (the underlying analysis, and eventually, the transformations themselves). The largest issue on which community consensus is required, in order to move forward at all, is what to do with isl. isl, the Integer Set Library, provides core functionality on which Polly depends. It is a C library, and while some Polly/LLVM developers are also isl developers, it has a large user community outside of LLVM/Polly. A C++ interface was recently added, and Polly is transitioning to use the C++ interface. Nevertheless, options here include rewriting the needed functionality, forking isl and transitioning our fork toward LLVM coding conventions (and data structures) over time, and incorporating isl more-or-less as-is to avoid partitioning its development. That having been said, isl is internally modular, and regardless of the overall integration strategy, the Polly developers anticipate specializing, or even replacing, some of these components with LLVM-specific solutions. This is especially true for anything that touches performance-related heuristics and modeling. LLVM-specific, or even target-specific, loop schedulers may be developed as well. Even though some developers in the LLVM community already have a background in polyhedral-modeling techniques, the Polly developers have developed, and are still developing, extensive tutorials on this topic http://pollylabs.org/education.htmland especially http://playground.pollylabs.org. Finally, let me highlight a few ongoing development efforts in Polly that are potentially relevant to this discussion. Polly’s loop analysis is sound and technically superior to what’s in LLVM currently (i.e. in LoopAccessAnalysis and DependenceAnalysis). There are, however, two known reasons why Polly’s transformations could not yet be enabled by default: * A correctness issue: Currently, Polly assumes that 64 bits is large enough for all new loop-induction variables and index expressions. In rare cases, transformations could be performed where more bits are required. Preconditions need to be generated preventing this (e.g., D35471). * A performance issue: Polly currently models temporal locality (i.e., it tries to get better reuse in time), but does not model spatial locality (i.e., it does not model cache-line reuse). As a result, it can sometimes introduce performance regressions. Polly Labs is currently working on integrating spatial locality modeling into the loop optimization model. Polly can already split apart basic blocks in order to implement loop fusion. Heuristics to choose at which granularity are still being implemented (e.g., PR12402). I believe that we can now develop a concrete plan for moving state-of-the-art loop optimizations, based on the technology in the Polly project, into LLVM. Doing so will enable LLVM to be competitive with proprietary compilers in high-performance computing, machine learning, and other important application domains. I'd like community feedback on what**should be part of that plan. Sincerely, Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with feedback from**several other active Polly developers) We thank the numerous people who have contributed to the Polly infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse, Matthias Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini Singhal, NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola, Raghesh Aloor, ReidKleckner, Roal Jordans, Richard Membarth, Roman Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.* -- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170901/b5ef5cdb/attachment-0001.html>
Davide Italiano via llvm-dev
2017-Sep-01 19:18 UTC
[llvm-dev] [RFC] Polly Status and Integration
On Fri, Sep 1, 2017 at 11:47 AM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote:> Hi everyone, As you may know, stock LLVM does not provide the kind of > advanced loop transformations necessary to provide good performance on many > applications. LLVM's Polly project provides many of the required > capabilities, including loop transformations such as fission, fusion, > skewing, blocking/tiling, and interchange, all powered by state-of-the-art > dependence analysis. Polly also provides automated parallelization and > targeting of GPUs and other accelerators. > > > Over the past year, Polly’s development has focused on robustness, > correctness, and closer integration with LLVM. To highlight a few > accomplishments: > > > Polly now runs, by default, in the conceptually-proper place in LLVM’s pass > pipeline (just before the loop vectorizer). Importantly, this means that its > loop transformations are performed after inlining and other > canonicalization, greatly increasing its robustness, and enabling its use on > C++ code (where [] is often a function call before inlining). > > Polly’s cost-modeling parameters, such as those describing the target’s > memory hierarchy, are being integrated with TargetTransformInfo. This allows > targets to properly override the modeling parameters and allows reuse of > these parameters by other clients. > > Polly’s method of handling signed division/remainder operations, which > worked around lack of support in ScalarEvolution, is being replaced thanks > to improvements being contributed to ScalarEvolution itself (see D34598). > Polly’s core delinearization routines have long been a part of LLVM itself. > > PolyhedralInfo, which exposes a subset of Polly’s loop analysis for use by > other clients, is now available. > > Polly is now part of the LLVM release process and is being included with > LLVM by various packagers (e.g., Debian). > > > I believe that the LLVM community would benefit from beginning the process > of integrating Polly with LLVM itself and continuing its development as part > of our main code base. This will: > > Allow for wider adoption of LLVM within communities relying on advanced loop > transformations. > > Provide for better community feedback on, and testing of, the code developed > (although the story in this regard is already fairly solid). > > Better motivate targets to provide accurate, comprehensive, modeling > parameters for use by advanced loop transformations. > > Perhaps most importantly, this will allow us to develop and tune the rest of > the optimizer assuming that Polly’s capabilities are present (the underlying > analysis, and eventually, the transformations themselves). > > > The largest issue on which community consensus is required, in order to move > forward at all, is what to do with isl. isl, the Integer Set Library, > provides core functionality on which Polly depends. It is a C library, and > while some Polly/LLVM developers are also isl developers, it has a large > user community outside of LLVM/Polly. A C++ interface was recently added, > and Polly is transitioning to use the C++ interface. Nevertheless, options > here include rewriting the needed functionality, forking isl and > transitioning our fork toward LLVM coding conventions (and data structures) > over time, and incorporating isl more-or-less as-is to avoid partitioning > its development. > > > That having been said, isl is internally modular, and regardless of the > overall integration strategy, the Polly developers anticipate specializing, > or even replacing, some of these components with LLVM-specific solutions. > This is especially true for anything that touches performance-related > heuristics and modeling. LLVM-specific, or even target-specific, loop > schedulers may be developed as well. > > > Even though some developers in the LLVM community already have a background > in polyhedral-modeling techniques, the Polly developers have developed, and > are still developing, extensive tutorials on this topic > http://pollylabs.org/education.html and especially > http://playground.pollylabs.org. > > > Finally, let me highlight a few ongoing development efforts in Polly that > are potentially relevant to this discussion. Polly’s loop analysis is sound > and technically superior to what’s in LLVM currently (i.e. in > LoopAccessAnalysis and DependenceAnalysis). There are, however, two known > reasons why Polly’s transformations could not yet be enabled by default: > > A correctness issue: Currently, Polly assumes that 64 bits is large enough > for all new loop-induction variables and index expressions. In rare cases, > transformations could be performed where more bits are required. > Preconditions need to be generated preventing this (e.g., D35471). > > A performance issue: Polly currently models temporal locality (i.e., it > tries to get better reuse in time), but does not model spatial locality > (i.e., it does not model cache-line reuse). As a result, it can sometimes > introduce performance regressions. Polly Labs is currently working on > integrating spatial locality modeling into the loop optimization model. > > Polly can already split apart basic blocks in order to implement loop > fusion. Heuristics to choose at which granularity are still being > implemented (e.g., PR12402). > > I believe that we can now develop a concrete plan for moving > state-of-the-art loop optimizations, based on the technology in the Polly > project, into LLVM. Doing so will enable LLVM to be competitive with > proprietary compilers in high-performance computing, machine learning, and > other important application domains. I'd like community feedback on what > should be part of that plan. > >This, at least on paper, sounds great. I think LLVM could greatly benefit from this informations for some applications. I have a couple of questions: 1) I'm aware there have been attempts in the past to make polyhedral value informations available to LLVM, but they've been unsuccessful. Do you plan to develop new LLVM passes to overcome this issue? 2) As far as I can tell (last I tried, but that was a while ago), polly had a significant compile time impact. Do you plan to introduce a new -Opolly pipeline? On the ISL story. I think it would be better to have polly being self-contained (with LLVM implementing the needed functionality), but I understand that's a major undertaken and it comes at a cost (i.e. LLVM will have to maintain this forked/reimplemented library forever instead of leveraging upstream work). LLVM tries to minimize the amount of required dependencies, but it seems it's OK to add new external optional deps (recently, Z3), so that could be an OK path forward. I don't have a strong opinion on what's the best solution here, FWIW. Thanks, -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare
Tobias Grosser via llvm-dev
2017-Sep-01 21:49 UTC
[llvm-dev] [RFC] Polly Status and Integration
On Fri, Sep 1, 2017, at 21:18, Davide Italiano via llvm-dev wrote:> On Fri, Sep 1, 2017 at 11:47 AM, Hal Finkel via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > Hi everyone, As you may know, stock LLVM does not provide the kind of > > advanced loop transformations necessary to provide good performance on many > > applications. LLVM's Polly project provides many of the required > > capabilities, including loop transformations such as fission, fusion, > > skewing, blocking/tiling, and interchange, all powered by state-of-the-art > > dependence analysis. Polly also provides automated parallelization and > > targeting of GPUs and other accelerators. > > > > > > Over the past year, Polly’s development has focused on robustness, > > correctness, and closer integration with LLVM. To highlight a few > > accomplishments: > > > > > > Polly now runs, by default, in the conceptually-proper place in LLVM’s pass > > pipeline (just before the loop vectorizer). Importantly, this means that its > > loop transformations are performed after inlining and other > > canonicalization, greatly increasing its robustness, and enabling its use on > > C++ code (where [] is often a function call before inlining). > > > > Polly’s cost-modeling parameters, such as those describing the target’s > > memory hierarchy, are being integrated with TargetTransformInfo. This allows > > targets to properly override the modeling parameters and allows reuse of > > these parameters by other clients. > > > > Polly’s method of handling signed division/remainder operations, which > > worked around lack of support in ScalarEvolution, is being replaced thanks > > to improvements being contributed to ScalarEvolution itself (see D34598). > > Polly’s core delinearization routines have long been a part of LLVM itself. > > > > PolyhedralInfo, which exposes a subset of Polly’s loop analysis for use by > > other clients, is now available. > > > > Polly is now part of the LLVM release process and is being included with > > LLVM by various packagers (e.g., Debian). > > > > > > I believe that the LLVM community would benefit from beginning the process > > of integrating Polly with LLVM itself and continuing its development as part > > of our main code base. This will: > > > > Allow for wider adoption of LLVM within communities relying on advanced loop > > transformations. > > > > Provide for better community feedback on, and testing of, the code developed > > (although the story in this regard is already fairly solid). > > > > Better motivate targets to provide accurate, comprehensive, modeling > > parameters for use by advanced loop transformations. > > > > Perhaps most importantly, this will allow us to develop and tune the rest of > > the optimizer assuming that Polly’s capabilities are present (the underlying > > analysis, and eventually, the transformations themselves). > > > > > > The largest issue on which community consensus is required, in order to move > > forward at all, is what to do with isl. isl, the Integer Set Library, > > provides core functionality on which Polly depends. It is a C library, and > > while some Polly/LLVM developers are also isl developers, it has a large > > user community outside of LLVM/Polly. A C++ interface was recently added, > > and Polly is transitioning to use the C++ interface. Nevertheless, options > > here include rewriting the needed functionality, forking isl and > > transitioning our fork toward LLVM coding conventions (and data structures) > > over time, and incorporating isl more-or-less as-is to avoid partitioning > > its development. > > > > > > That having been said, isl is internally modular, and regardless of the > > overall integration strategy, the Polly developers anticipate specializing, > > or even replacing, some of these components with LLVM-specific solutions. > > This is especially true for anything that touches performance-related > > heuristics and modeling. LLVM-specific, or even target-specific, loop > > schedulers may be developed as well. > > > > > > Even though some developers in the LLVM community already have a background > > in polyhedral-modeling techniques, the Polly developers have developed, and > > are still developing, extensive tutorials on this topic > > http://pollylabs.org/education.html and especially > > http://playground.pollylabs.org. > > > > > > Finally, let me highlight a few ongoing development efforts in Polly that > > are potentially relevant to this discussion. Polly’s loop analysis is sound > > and technically superior to what’s in LLVM currently (i.e. in > > LoopAccessAnalysis and DependenceAnalysis). There are, however, two known > > reasons why Polly’s transformations could not yet be enabled by default: > > > > A correctness issue: Currently, Polly assumes that 64 bits is large enough > > for all new loop-induction variables and index expressions. In rare cases, > > transformations could be performed where more bits are required. > > Preconditions need to be generated preventing this (e.g., D35471). > > > > A performance issue: Polly currently models temporal locality (i.e., it > > tries to get better reuse in time), but does not model spatial locality > > (i.e., it does not model cache-line reuse). As a result, it can sometimes > > introduce performance regressions. Polly Labs is currently working on > > integrating spatial locality modeling into the loop optimization model. > > > > Polly can already split apart basic blocks in order to implement loop > > fusion. Heuristics to choose at which granularity are still being > > implemented (e.g., PR12402). > > > > I believe that we can now develop a concrete plan for moving > > state-of-the-art loop optimizations, based on the technology in the Polly > > project, into LLVM. Doing so will enable LLVM to be competitive with > > proprietary compilers in high-performance computing, machine learning, and > > other important application domains. I'd like community feedback on what > > should be part of that plan. > > > > > > This, at least on paper, sounds great. I think LLVM could greatly > benefit from this informations for some applications. > I have a couple of questions: > 1) I'm aware there have been attempts in the past to make polyhedral > value informations available to LLVM, but they've been unsuccessful. > Do you plan to develop new LLVM passes to overcome this issue?It would be great to have a polyhedral range analysis in LLVM. We have most code for this in Polly. Especially with the new isl C++ bindings, its code could also look very nice!> 2) As far as I can tell (last I tried, but that was a while ago), > polly had a significant compile time impact. Do you plan to introduce > a new -Opolly pipeline?We do not intend to enable Polly by default just yet. Hence a new option -O?? or -f** seems appropriate. I suggest to leave the naming debate for later. We also worked on performance over the last 2 years: - Introduced fast arbitrary integers to backup isl (gives ~7x speedup vs. imath, 2x vs. gmp, and is only 30% slower than native integers) - Worked with Eli Friedman on compiling all of AOSP. We do this in "-polly-process-unprofitable" mode, where we run on anything we can analyze and make sure timeouts and validity checks are all in place. - Adjusted our core scheduler's assymptotic complexity to not be in function of number of statements to schedule, but rather the maximal number of statements that are fused: https://lirias.kuleuven.be/bitstream/123456789/586108/1/CW706.pdf - We worked on bailing out fast (but slightly regressed in recent months). For code that is not amenable to polyhedral optimizations we I would like to see almost zero compile time overhead. We have a pure LLVM-IR analysis before Polly, which just scans the IR and decides if we should look at it more intensively.> On the ISL story. I think it would be better to have polly being > self-contained (with LLVM implementing the needed functionality), but > I understand that's a major undertaken and it comes at a cost (i.e. > LLVM will have to maintain this forked/reimplemented library forever > instead of leveraging upstream work). LLVM tries to minimize the > amount of required dependencies, but it seems it's OK to add new > external optional deps (recently, Z3), so that could be an OK path > forward. I don't have a strong opinion on what's the best solution here, FWIW.We currently ship a snapshot of isl with Polly. This has shown to be very useful, as we ship isl "trunk" which is updated quickly in case of bug reports. Having a single version of isl shipped in sync with polly is also desirable when receiving bug reports for Polly/isl. As Hal pointed out, I expect us to complement at least parts of isl with our own heuristics and analysis. While this is likely the way to go I believe isl provides solid "defaults" for us to start off with and to identify where LLVM specific solutions indeed add value. Best, Tobias
Adam Nemet via llvm-dev
2017-Sep-11 17:26 UTC
[llvm-dev] [RFC] Polly Status and Integration
Hi Hal, Tobias, Michael and others,> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > Hi everyone, > > As you may know, stock LLVM does not provide the kind of advanced loop transformations necessary to provide good performance on many applications. LLVM's Polly project provides many of the required capabilities, including loop transformations such as fission, fusion, skewing, blocking/tiling, and interchange, all powered by state-of-the-art dependence analysis. Polly also provides automated parallelization and targeting of GPUs and other accelerators. > > Over the past year, Polly’s development has focused on robustness, correctness, and closer integration with LLVM. To highlight a few accomplishments: > > Polly now runs, by default, in the conceptually-proper place in LLVM’s pass pipeline (just before the loop vectorizer). Importantly, this means that its loop transformations are performed after inlining and other canonicalization, greatly increasing its robustness, and enabling its use on C++ code (where [] is often a function call before inlining). > Polly’s cost-modeling parameters, such as those describing the target’s memory hierarchy, are being integrated with TargetTransformInfo. This allows targets to properly override the modeling parameters and allows reuse of these parameters by other clients. > Polly’s method of handling signed division/remainder operations, which worked around lack of support in ScalarEvolution, is being replaced thanks to improvements being contributed to ScalarEvolution itself (see D34598). Polly’s core delinearization routines have long been a part of LLVM itself. > PolyhedralInfo, which exposes a subset of Polly’s loop analysis for use by other clients, is now available. > Polly is now part of the LLVM release process and is being included with LLVM by various packagers (e.g., Debian). > > I believe that the LLVM community would benefit from beginning the process of integrating Polly with LLVM itself and continuing its development as part of our main code base. This will: > Allow for wider adoption of LLVM within communities relying on advanced loop transformations. > Provide for better community feedback on, and testing of, the code developed (although the story in this regard is already fairly solid). > Better motivate targets to provide accurate, comprehensive, modeling parameters for use by advanced loop transformations. > Perhaps most importantly, this will allow us to develop and tune the rest of the optimizer assuming that Polly’s capabilities are present (the underlying analysis, and eventually, the transformations themselves). > > The largest issue on which community consensus is required, in order to move forward at all, is what to do with isl. isl, the Integer Set Library, provides core functionality on which Polly depends. It is a C library, and while some Polly/LLVM developers are also isl developers, it has a large user community outside of LLVM/Polly. A C++ interface was recently added, and Polly is transitioning to use the C++ interface. Nevertheless, options here include rewriting the needed functionality, forking isl and transitioning our fork toward LLVM coding conventions (and data structures) over time, and incorporating isl more-or-less as-is to avoid partitioning its development. > > That having been said, isl is internally modular, and regardless of the overall integration strategy, the Polly developers anticipate specializing, or even replacing, some of these components with LLVM-specific solutions. This is especially true for anything that touches performance-related heuristics and modeling. LLVM-specific, or even target-specific, loop schedulers may be developed as well. > > Even though some developers in the LLVM community already have a background in polyhedral-modeling techniques, the Polly developers have developed, and are still developing, extensive tutorials on this topic http://pollylabs.org/education.html <http://pollylabs.org/education.html> and especially http://playground.pollylabs.org <http://playground.pollylabs.org/>. > > Finally, let me highlight a few ongoing development efforts in Polly that are potentially relevant to this discussion. Polly’s loop analysis is sound and technically superior to what’s in LLVM currently (i.e. in LoopAccessAnalysis and DependenceAnalysis). There are, however, two known reasons why Polly’s transformations could not yet be enabled by default: > A correctness issue: Currently, Polly assumes that 64 bits is large enough for all new loop-induction variables and index expressions. In rare cases, transformations could be performed where more bits are required. Preconditions need to be generated preventing this (e.g., D35471). > A performance issue: Polly currently models temporal locality (i.e., it tries to get better reuse in time), but does not model spatial locality (i.e., it does not model cache-line reuse). As a result, it can sometimes introduce performance regressions. Polly Labs is currently working on integrating spatial locality modeling into the loop optimization model. > Polly can already split apart basic blocks in order to implement loop fusion. Heuristics to choose at which granularity are still being implemented (e.g., PR12402). > > I believe that we can now develop a concrete plan for moving state-of-the-art loop optimizations, based on the technology in the Polly project, into LLVM. Doing so will enable LLVM to be competitive with proprietary compilers in high-performance computing, machine learning, and other important application domains. I’d like community feedback on what should be part of that plan.One thing that I’d like to see more details on is what this means for the evolution of loop transformations in LLVM. Our more-or-less established direction was so far to incrementally improve and generalize the required analyses (e.g. the LoopVectorizer’s dependence analysis + loop versioning analysis into a stand-alone analysis pass (LoopAccessAnalysis)) and then build new transformations (e.g. LoopDistribution, LoopLoadElimination, LICMLoopVersioning) on top of these. The idea was that infrastructure would be incrementally improved from two directions: - As new transformations are built analyses have to be improved (e.g. past improvements to LAA to support the LoopVersioning utility, future improvements for full LoopSROA beyond just store->load forwarding [1] or the improvements to LAA for the LoopFusion proposal[2]) - As more complex loops would have to be analyzed we either improve LAA or make DependenceAnalysis a drop-in replacement for the memory analysis part in LAA While this model may be slow it has all the benefits of the incremental development model. Then there is the question of use cases. It’s fairly obvious that anybody wanting to optimize a 5-deep highly regular loop-nest operating on arrays should use Polly. On the other hand it’s way less clear that we should use it for singly or doubly nested not-so-regular loops which are the norm in non-HPC workloads. And this brings me to the maintenance question. Is it reasonable to expect people to fix Polly when they have a seemingly unrelated change that happens to break a Polly bot. As far as I know, there were companies in the past that tried Polly without a whole lot of prior experience. It would be great to hear what the experience was before adopting Polly at a much larger scale. Adam [1] http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html [2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html> > Sincerely, > Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with feedback from several other active Polly developers) > > We thank the numerous people who have contributed to the Polly infrastructure: > > Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, Annanay Agarwal, Armin > Groesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin Kramer, Bill > Wendling, Chandler Carruth, Craig Topper, Chris Jenneisch, Christian > Bielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David Peixotto, > Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman, Eugene > Zelenko, George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui Zhang, > Jakub Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil, Logan > Chien, Lawrence Hu, Mandeep Singh Grang, Matt Arsenault, Matthew > Simpson, Mehdi Amini, Micah Villmow, Michael Kruse, Matthias Reisinger, > Maximilian Falkenstein, Nakamura Takumi, Nandini Singhal, Nicolas > Bonfante, Patrik Hägglund, Paul Robinson, Philip Pfaffe, Philipp Schaad, > Peter Conn, Pratik Bhatu, Rafael Espindola, Raghesh Aloor, Reid > Kleckner, Roal Jordans, Richard Membarth, Roman Gareev, Saleem > Abdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer AbuAsal, Sam > Novak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay Srivallabh, > Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Star Tan, Tanya > Lattner, Tim Shen, Tarun Ranjendran, Theodoros Theodoridis, Utpal Bora, > Wei Mi, Weiming Zhao, and Yabin Hu. > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170911/879a0158/attachment-0001.html>
Hal Finkel via llvm-dev
2017-Sep-12 05:47 UTC
[llvm-dev] [RFC] Polly Status and Integration
On 09/11/2017 12:26 PM, Adam Nemet wrote:> Hi Hal, Tobias, Michael and others, > >> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: >> >> ** >> >> *Hi everyone,As you may know, stock LLVM does not provide the kind of >> advanced loop transformations necessary to provide good performance >> on many applications. LLVM's Polly project provides many of the >> required capabilities, including loop transformations such as >> fission, fusion, skewing, blocking/tiling, and interchange, all >> powered by state-of-the-art dependence analysis. Polly also provides >> automated parallelization and targeting of GPUs and other**accelerators.* >> * >> Over the past year, Polly’s development has focused on robustness, >> correctness, and closer integration with LLVM. To highlight a few >> accomplishments: >> >> * >> Polly now runs, by default, in the conceptually-proper place in >> LLVM’s pass pipeline (just before the loop vectorizer). >> Importantly, this means that its loop transformations are >> performed after inlining and other canonicalization, greatly >> increasing its robustness, and enabling its use on C++ code >> (where [] is often a function call before inlining). >> * >> Polly’s cost-modeling parameters, such as those describing the >> target’s memory hierarchy, are being integrated with >> TargetTransformInfo. This allows targets to properly override the >> modeling parameters and allows reuse of these parameters by other >> clients. >> * >> Polly’s method of handling signed division/remainder operations, >> which worked around lack of support in ScalarEvolution, is being >> replaced thanks to improvements being contributed to >> ScalarEvolution itself (see D34598). Polly’s core delinearization >> routines have long been a part of LLVM itself. >> * >> PolyhedralInfo, which exposes a subset of Polly’s loop analysis >> for use by other clients, is now available. >> * >> Polly is now part of the LLVM release process and is being >> included with LLVM by various packagers (e.g., Debian). >> >> >> I believe that the LLVM community would benefit from beginning the >> process of integrating Polly with LLVM itself and continuing its >> development as part of our main code base. This will: >> >> * >> Allow for wider adoption of LLVM within communities relying on >> advanced loop transformations. >> * >> Provide for better community feedback on, and testing of, the >> code developed (although the story in this regard is already >> fairly solid). >> * >> Better motivate targets to provide accurate, comprehensive, >> modeling parameters for use by advanced loop transformations. >> * >> Perhaps most importantly, this will allow us to develop and tune >> the rest of the optimizer assuming that Polly’s capabilities are >> present (the underlying analysis, and eventually, the >> transformations themselves). >> >> >> The largest issue on which community consensus is required, in order >> to move forward at all, is what to do with isl. isl, the Integer Set >> Library, provides core functionality on which Polly depends. It is a >> C library, and while some Polly/LLVM developers are also isl >> developers, it has a large user community outside of LLVM/Polly. A >> C++ interface was recently added, and Polly is transitioning to use >> the C++ interface. Nevertheless, options here include rewriting the >> needed functionality, forking isl and transitioning our fork toward >> LLVM coding conventions (and data structures) over time, and >> incorporating isl more-or-less as-is to avoid partitioning its >> development. >> >> That having been said, isl is internally modular, and regardless of >> the overall integration strategy, the Polly developers anticipate >> specializing, or even replacing, some of these components with >> LLVM-specific solutions. This is especially true for anything that >> touches performance-related heuristics and modeling. LLVM-specific, >> or even target-specific, loop schedulers may be developed as well. >> >> Even though some developers in the LLVM community already have a >> background in polyhedral-modeling techniques, the Polly developers >> have developed, and are still developing, extensive tutorials on this >> topic http://pollylabs.org/education.htmland especially >> http://playground.pollylabs.org <http://playground.pollylabs.org/>. >> >> Finally, let me highlight a few ongoing development efforts in Polly >> that are potentially relevant to this discussion. Polly’s loop >> analysis is sound and technically superior to what’s in LLVM >> currently (i.e. in LoopAccessAnalysis and DependenceAnalysis). There >> are, however, two known reasons why Polly’s transformations could not >> yet be enabled by default: >> >> * >> A correctness issue: Currently, Polly assumes that 64 bits is >> large enough for all new loop-induction variables and index >> expressions. In rare cases, transformations could be performed >> where more bits are required. Preconditions need to be generated >> preventing this (e.g., D35471). >> * >> A performance issue: Polly currently models temporal locality >> (i.e., it tries to get better reuse in time), but does not model >> spatial locality (i.e., it does not model cache-line reuse). As a >> result, it can sometimes introduce performance regressions. Polly >> Labs is currently working on integrating spatial locality >> modeling into the loop optimization model. >> >> Polly can already split apart basic blocks in order to implement loop >> fusion. Heuristics to choose at which granularity are still being >> implemented (e.g., PR12402). >> I believe that we can now develop a concrete plan for moving >> state-of-the-art loop optimizations, based on the technology in the >> Polly project, into LLVM. Doing so will enable LLVM to be competitive >> with proprietary compilers in high-performance computing, machine >> learning, and other important application domains. I’d like community >> feedback on what**should be part of that plan. >> * > > One thing that I’d like to see more details on is what this means for > the evolution of loop transformations in LLVM. > > Our more-or-less established direction was so far to incrementally > improve and generalize the required analyses (e.g. the > LoopVectorizer’s dependence analysis + loop versioning analysis into a > stand-alone analysis pass (LoopAccessAnalysis)) and then build new > transformations (e.g. LoopDistribution, LoopLoadElimination, > LICMLoopVersioning) on top of these. > > The idea was that infrastructure would be incrementally improved from > two directions: > > - As new transformations are built analyses have to be improved (e.g. > past improvements to LAA to support the LoopVersioning utility, future > improvements for full LoopSROA beyond just store->load forwarding [1] > or the improvements to LAA for the LoopFusion proposal[2]) > > - As more complex loops would have to be analyzed we either improve > LAA or make DependenceAnalysis a drop-in replacement for the memory > analysis part in LAAOr we could use Polly's dependence analysis, which I believe to be more powerful, more robust, and more correct than DependenceAnalysis. I believe that the difficult part here is actually the pairing with predicated SCEV or whatever mechanism we want to use generate runtime predicates (this applies to use of DependenceAnalysis too).> > While this model may be slow it has all the benefits of the > incremental development model.The current model may have been slow in many areas, but I think that's mostly a question of development effort. My largest concern about the current model is that, to the extent that we're implementing classic loop transformations (e.g., fusion, distribution, interchange, skewing, tiling, and so on), we're repeating a historical design that is known to have several suboptimal properties. Chief among them is the lack of integration: many of these transformations are interconnected, and there's no good pass ordering in which to make independent decisions. Many of these transformations can be captured in a single model and we can get much better results by integrating them. There's also the matter of whether building these transformation on SCEV (or IR directly) is the best underlying infrastructure, or whether parts of Polly would be better. That having been said, I think that integrating this technology into LLVM will also mean applying appropriate modularity. I think that we'll almost definitely want to make use of the dependence analysis separately as an analysis. We'll want to decide which of these transformations will be considered canonicalization (and run in the iterative pipeline) and which will be lowering (and run near the vectorizer). LoopSROA certainly sounds to me like canonicalization, but loop fusion might also fall into that category (i.e., we might want to fuse early to enable optimizations and then split late).> > Then there is the question of use cases. It’s fairly obvious that > anybody wanting to optimize a 5-deep highly regular loop-nest > operating on arrays should use Polly. On the other hand it’s way less > clear that we should use it for singly or doubly nested not-so-regular > loops which are the norm in non-HPC workloads.This is clearly a good question, but thinking about Polly as a set of components, not as a monolithic transformation component, I think that polyhedral analysis and transformations can underlie a lot of the transformations we need for non-HPC code (and, which I'll point out, we need for modern HPC code too). In practice, the loops that we can actually analyze have affine dependencies, and Polly does, or can do, a better job at generating runtime predicates and dealing with piecewise-linear expressions than our current infrastructure. In short, I look at Polly as two things: First, an infrastructure for dealing with loop analysis and transformation. I view this as being broadly applicable. Second, an application of that to apply cost-model-driven classic loop transformations. To some extent this is going to be more useful for HPC codes, but also applies to machine learning, signal processing, graphics, and other areas.> > And this brings me to the maintenance question. Is it reasonable to > expect people to fix Polly when they have a seemingly unrelated change > that happens to break a Polly bot.The eventual goal here is to have this technology in appropriate parts of the main pipeline, and so the question here is not really about breaking a "Polly bot", but just about a "bot" in general. I've given this question some thought and I think it sits in a reasonable place in the risk-reward space. The answer would be, yes, we'd need to treat this like any other part of the pipeline. However, I believe that Polly has as many, or more, active contributors than essentially any other individual part of the mid-level optimizer or CodeGen. As a result, there will be people around in many time zones to help with problems with Polly-related code.> As far as I know, there were companies in the past that tried Polly > without a whole lot of prior experience. It would be great to hear > what the experience was before adopting Polly at a much larger scale.I'm also interested, although I'll caution against over-interpreting any evidence here (positive or negative). Before a few weeks ago, Polly didn't effectively run in the pipeline after inlining, and so I doubt it would have been much use outside of embedded environments (and maybe some HPC environments) with straightforwardly-presented C code. It's only now that this has been fixed that I find the possibility of integrating this in production interesting. Thanks again, Hal> > Adam > > [1] http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html > [2] http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html > > >> * >> Sincerely, >> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with >> feedback from**several other active Polly developers) >> >> We thank the numerous people who have contributed to the Polly >> infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, >> Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, >> Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris >> Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David >> Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, >> Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, >> Hongbin Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin >> Bogner, Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh >> Grang, Matt Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, >> Michael Kruse, Matthias Reisinger,Maximilian Falkenstein, Nakamura >> Takumi, Nandini Singhal, NicolasBonfante, Patrik Hägglund, Paul >> Robinson, Philip Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, >> Rafael Espindola, Raghesh Aloor, ReidKleckner, Roal Jordans, Richard >> Membarth, Roman Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, >> Sanjoy Das, Sameer AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, >> Singapuram Sanjay Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, >> Sylvestre Ledru, Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, >> Theodoros Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.* >> >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170912/5f981c31/attachment-0001.html>
Johannes Doerfert via llvm-dev
2017-Sep-20 00:33 UTC
[llvm-dev] [RFC] Polly Status and Integration
Hi Hal, Tobias, Michael, and others, I'd like to add my view (and a proposal) to this discussion and I apologize directly for doing this so late*. I also want to apologize because this email is long, contains various technical details and also argumentations that might need more justification. However, I am happy to provide further information (and/or examples) to explain my views if required. tldr; We should introduce polyhedral analysis and optimization capabilities into LLVM step by step. Along the way we should revisit design decisions made by Polly and adjust them according to the use cases in the rest of LLVM. Finally, we should keep Polly as a stand-alone tool to allow research into, and prototyping of, complex loop transformations. * There was a paper deadline end of last week and I simply had to prioritize. ------------------------------------------------------------------------------- LLVM performs "simple" loop transformations and targets almost exclusively inner-most loops. Polly offers a pipeline to perform "complex" polyhedral-based loop optimizations on a sequence of loop nests. It seems natural to integrate Polly into the LLVM pipeline as it apparently complements it well. However, I do not believe that it is wise to integrate Polly into LLVM for a variety of reasons, some of which I will explain in more detail now. Afterwards I will propose a different way to achieve (most of) the things Hal mentioned in his mail in an incremental way. Nevertheless, I want to stress that I am still an advocate of Polly and I honestly believe it to be a valuable part of the LLVM environment. Though I strongly feel it should be continued as a research tool and test bed to allow research into, and prototyping of, (polyhedral-model-based) analysis and optimizations. --- Polly is a deep pipeline of passes that were developed for the sole purpose of applying scheduling transformations at the very end. This "main purpose" of Polly makes it hard to effectively reuse intermediate parts, e.g. the dependence analysis or the code generation framework, at least to the degree we might want to. In a different thread [1] Hal asked for the possibility to reuse Polly for tuning via pragmas e.g., unrolling/interleaving/tiling/interchange. While this is interesting and generally possible it won't take long before the restrictions Polly places on the input code (to enable scheduling optimizations in the end) become a serious problem. The point here is that Polly has too many features and dependences for such an "easy" use case. (Note that I do not question the legality of a user requested transformation here.) If you are not looking for "end-to-end polyhedral loop optimizations" you want a system that answers a very specific question (e.g., isVectorizable) or performs a very specific transformation (e.g, tiling) on a maximal set of programs. In contrast, Polly is (and I strongly believe it should be) developed as a system that performs complex optimizations, based on specialized analysis results, on a constraint type of programs. One might argue that Polly will become more modular when it is integrated into LLVM and when the intermediate results are used in more and more places. However, I'd say this is the "wrong direction" with regards to: 1) incremental design and development, 2) maintainability of individual parts, and 3) the development of a suitable cost model (which Polly does not ship). Instead of starting with a full, hard to understand, scheduling pipeline we should start with the use cases at hand and design specific analysis (and also transformation) passes "from scratch". The long term goal should be a full scheduling pipeline but the parts need to be designed in a modular (LLVM-way) from the very beginning. This allows us to: - provide __immediate benefit__ to other developers, - allow active participation in (and understanding of) the design of each part, and - develop the intermediate parts with the requirements of the whole LLVM project in mind. Let me give two examples to make my point: -- Example 1 -- In addition to the applicability issues there are other problems that arise from the full pipeline design. Let's assume we want to apply a pragma driven optimization to a loop that can be represented (and optimized) by Polly. Depending on the pragma we only need to perform a very specific task, e.g., adjust the iteration variable (loop inversion) or introduce new loops and adjust existing iteration variables (tiling). Polly will however represent (and actually analyze) the whole loop nest including the exact control flow and all accesses. In addition it will compute alias checks, constraints under which the representation is not truthful (e.g., an access function might overflow) and a lot of other things that are not needed for the task at hand. -- Example 2 -- We can also imagine we want to determine if a loop can be vectorized, thus if there are (short) loop carried dependences. No matter which technique is used, the complexity of a dependence analysis will certainly increase with the number (and complexity) of the analyzed accesses. Since Polly is designed with fine-grained scheduling optimizations in mind, the dependences it will compute are disproportionate with regards to the question asked. Consequently, it will either take more time to determine if there are loop carried dependences or a time-out in the computation will cause a conservatively negative result. --- While it is generally possible to disable some "features" of Polly or to "work around them", it requires new independent code paths in an already highly interleaved project. The explicit (and implicit) interactions between the different parts of Polly are plentiful and subtle. I don't think there are 5 people that know about more than half of them. (I also doubt there is no one that knows them all.) --- I want to propose an alternative plan for which I would appreciate feedback. The feedback is especially useful since I will present this, or more accurately the concrete polyhedral analyses described below, in the student research competition at LLVM Dev meeting in October. The idea is simple. We rewrite parts of Polly with two goals in mind: 1) to provide analysis results and transformation options to LLVM, and 2) allow polyhedral loop transformation. A rewrite allows to revisit various design decisions that are buried deep in the Polly code and that hinder its application (in any context). The interested reader can find a short list at the end. I think, we should incrementally introduce polyhedral analysis capabilities in order to provide immediate benefits and to allow developers to get involved with the design and usage from an early point in time. Decoupled analyses also allow easier integration, interfaces that come "more natural" to non-polyhedral people, and better use of the capabilities if polyhedral scheduling is not the goal. To demonstrate this incremental process I started to write a polyhedral value analysis**. Think of it as Scalar Evolution but with piece-wise defined polyhedral values as the abstract domain. This analysis is demand driven, works on the granularity of a LLVM value, is flow sensitive, and especially designed to be applicable on partially affine programs. It already supports most of what the polyhedral model does allow and includes a lot of things we prototyped in Polly. It shall be used when Scalar Evolution is not able to provide a sufficient result. On top I implemented a polyhedral memory analysis*** that is (currently) capable of summarizing the memory effects of a code region, e.g., a function. The next step is a demand-driven dependence analysis**** that utilizes the (often) very structured nature of a CFG. Afterwards, I would like to see a transformation framework that allows classical but also scheduling based transformations. While these analyses still depend on isl as a back-end, they never directly use it. Instead there is a C++ interface in-between that encapsulates the features needed. Once our use cases are clear we can re-implement this interface. ** For comparison, the value analysis covers parts of the ScopDetection, all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and ScopBuilder in Polly. *** The memory analysis is concerned with the construction and functionality Polly implements in the MemoryAccess class. **** The dependence analysis is similar to Polly's DependenceInfo class but is supposed to allow more fine-grained control. --- These are a few design decisions that I think should be revisited when we think about polyhedral-based analysis and optimizations in LLVM: - We should not rely on Scalar Evolution. Instead we want to build polyhedral values straight from the underlying IR. While Scalar Evolution did certainly simplify Polly's development it induced various problems: - The lack of piece-wise affine values (other than min/max) limits applicability and precision. (Though conditional SCEVs are a small step in this direction.) - The flow insensitive value analysis (Scalar Evolution) causes a lot of "assumptions", thus versioning, for the otherwise flow sensitive Polly. - There are caching issues when parts of the program are transformed or when Polly is used for inter-procedural analysis/optimization. - We should not depend on single-entry-single-exit (SESE) regions. Instead we analyze the code optimistically to the extend possible (or required by the user). Afterwards we can easily specialize the results to the code regions we are interested in. While SESE regions always caused compile time issues (due to the use of the post dominance tree) they also limit the analysis/transformations that can be performed. As an example think of accesses that are non-affine with regards to the smallest surrounding SESE region but not in a smaller code region (e.g., a conditional). If the goal is not loop scheduling but, for example, instruction reordering, it is still valuable to determine the dependences between these accesses in the smaller code region. - We should iteratively and selective construct polyhedral representations. The same holds for analysis results, e.g. dependences. Let's reconsider example 2 from above. We want to visit one memory access at a time in order to build the polyhedral representation and compute the new dependences. We also want to immediately stop if we identify a loop-carried dependence. Finally, we do not want to compute dependences in outer loop iterations or dependences that are completely contained in one loop iteration. - We should encapsulate the code generation part completely from the transformation part. Additionally we might want to start with classical loop transformations instead of full polyhedral scheduling. For a lot of classical loop transformations code generation only needs to understand/change loop induction variables and exit conditions. Duplication + rewriting of the loop body is often not necessary. Pragma driven optimizations could be easily done and we could also use heuristics similar to the ones we have today (e.g., for loop distribution). The transformations we allow should than grow with the scheduling decisions we allow and these should be limited by a yet to be determined cost model. --- I am aware that not all Polly developers will agree with my views and that some are simply trade-offs that do not offer a "correct way". Nevertheless, I hope that the simple examples I used to point out various problems are enough to consider the route I proposed. Cheers, Johannes [1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ On 09/01, Hal Finkel via llvm-dev wrote:> ** > > *Hi everyone,As you may know, stock LLVM does not provide the kind of > advanced loop transformations necessary to provide good performance on many > applications. LLVM's Polly project provides many of the required > capabilities, including loop transformations such as fission, fusion, > skewing, blocking/tiling, and interchange, all powered by state-of-the-art > dependence analysis. Polly also provides automated parallelization and > targeting of GPUs and other**accelerators.* > > * > > Over the past year, Polly’s development has focused on robustness, > correctness, and closer integration with LLVM. To highlight a few > accomplishments: > > > * > > Polly now runs, by default, in the conceptually-proper place in > LLVM’s pass pipeline (just before the loop vectorizer). Importantly, > this means that its loop transformations are performed after > inlining and other canonicalization, greatly increasing its > robustness, and enabling its use on C++ code (where [] is often a > function call before inlining). > > * > > Polly’s cost-modeling parameters, such as those describing the > target’s memory hierarchy, are being integrated with > TargetTransformInfo. This allows targets to properly override the > modeling parameters and allows reuse of these parameters by other > clients. > > * > > Polly’s method of handling signed division/remainder operations, > which worked around lack of support in ScalarEvolution, is being > replaced thanks to improvements being contributed to ScalarEvolution > itself (see D34598). Polly’s core delinearization routines have long > been a part of LLVM itself. > > * > > PolyhedralInfo, which exposes a subset of Polly’s loop analysis for > use by other clients, is now available. > > * > > Polly is now part of the LLVM release process and is being included > with LLVM by various packagers (e.g., Debian). > > > I believe that the LLVM community would benefit from beginning the process > of integrating Polly with LLVM itself and continuing its development as part > of our main code base. This will: > > * > > Allow for wider adoption of LLVM within communities relying on > advanced loop transformations. > > * > > Provide for better community feedback on, and testing of, the code > developed (although the story in this regard is already fairly solid). > > * > > Better motivate targets to provide accurate, comprehensive, modeling > parameters for use by advanced loop transformations. > > * > > Perhaps most importantly, this will allow us to develop and tune the > rest of the optimizer assuming that Polly’s capabilities are present > (the underlying analysis, and eventually, the transformations > themselves). > > > The largest issue on which community consensus is required, in order to move > forward at all, is what to do with isl. isl, the Integer Set Library, > provides core functionality on which Polly depends. It is a C library, and > while some Polly/LLVM developers are also isl developers, it has a large > user community outside of LLVM/Polly. A C++ interface was recently added, > and Polly is transitioning to use the C++ interface. Nevertheless, options > here include rewriting the needed functionality, forking isl and > transitioning our fork toward LLVM coding conventions (and data structures) > over time, and incorporating isl more-or-less as-is to avoid partitioning > its development. > > > That having been said, isl is internally modular, and regardless of the > overall integration strategy, the Polly developers anticipate specializing, > or even replacing, some of these components with LLVM-specific solutions. > This is especially true for anything that touches performance-related > heuristics and modeling. LLVM-specific, or even target-specific, loop > schedulers may be developed as well. > > > Even though some developers in the LLVM community already have a background > in polyhedral-modeling techniques, the Polly developers have developed, and > are still developing, extensive tutorials on this topic > http://pollylabs.org/education.htmland especially > http://playground.pollylabs.org. > > > Finally, let me highlight a few ongoing development efforts in Polly that > are potentially relevant to this discussion. Polly’s loop analysis is sound > and technically superior to what’s in LLVM currently (i.e. in > LoopAccessAnalysis and DependenceAnalysis). There are, however, two known > reasons why Polly’s transformations could not yet be enabled by default: > > * > > A correctness issue: Currently, Polly assumes that 64 bits is large > enough for all new loop-induction variables and index expressions. > In rare cases, transformations could be performed where more bits > are required. Preconditions need to be generated preventing this > (e.g., D35471). > > * > > A performance issue: Polly currently models temporal locality (i.e., > it tries to get better reuse in time), but does not model spatial > locality (i.e., it does not model cache-line reuse). As a result, it > can sometimes introduce performance regressions. Polly Labs is > currently working on integrating spatial locality modeling into the > loop optimization model. > > Polly can already split apart basic blocks in order to implement loop > fusion. Heuristics to choose at which granularity are still being > implemented (e.g., PR12402). > > I believe that we can now develop a concrete plan for moving > state-of-the-art loop optimizations, based on the technology in the Polly > project, into LLVM. Doing so will enable LLVM to be competitive with > proprietary compilers in high-performance computing, machine learning, and > other important application domains. I'd like community feedback on > what**should be part of that plan. > > > Sincerely, > > Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with feedback > from**several other active Polly developers) > > > We thank the numerous people who have contributed to the Polly > infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, Annanay > Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin > Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris Jenneisch, > ChristianBielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David > Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman, > EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui > Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil, > LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt Arsenault, > MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse, Matthias > Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini Singhal, > NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip Pfaffe, Philipp > Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola, Raghesh Aloor, > ReidKleckner, Roal Jordans, Richard Membarth, Roman Gareev, > SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer AbuAsal, > SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay > Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Star Tan, > TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros Theodoridis, Utpal > Bora,Wei Mi, Weiming Zhao, and Yabin Hu.* > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages > Leadership Computing Facility > Argonne National Laboratory >> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Johannes Doerfert Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland Informatics Campus, Germany Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: Digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170920/2d76e352/attachment.sig>
Sjoerd Meijer via llvm-dev
2017-Sep-20 09:33 UTC
[llvm-dev] [RFC] Polly Status and Integration
Hi, I have not been following the polyhedral developments in both GCC and LLVM very closely the last few years, but I was just wondering if there are any lessons learned we should look at (or actually already aware of) in GCC's implementation and integration of Graphite. For example, do we know if it is (still) enabled/used/etc.? And in the context of possibly rewritten bits and pieces of Polly, thus making it modular, and making its exact analysis available to "traditional" optimisations passes such as e.g. the vectoriser, is there data that that shows that it really improves performance? Is there e.g. data available that shows the performance uplift of the vectoriser using the exact polyhedral data dependence analysis over the traditional data dependence analysis? I am a big fan of the polyhedral model and these are not leading questions, but I was just wondering if that could help in deciding to make it modular and slowly integrating it, or just to have it as a separate pipeline. Cheers, Sjoerd. -----Original Message----- From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Johannes Doerfert via llvm-dev Sent: 20 September 2017 01:33 To: Hal Finkel Cc: LLVM Dev; Tobias Grosser; michaelkruse at meinersbur.de Subject: Re: [llvm-dev] [RFC] Polly Status and Integration Hi Hal, Tobias, Michael, and others, I'd like to add my view (and a proposal) to this discussion and I apologize directly for doing this so late*. I also want to apologize because this email is long, contains various technical details and also argumentations that might need more justification. However, I am happy to provide further information (and/or examples) to explain my views if required. tldr; We should introduce polyhedral analysis and optimization capabilities into LLVM step by step. Along the way we should revisit design decisions made by Polly and adjust them according to the use cases in the rest of LLVM. Finally, we should keep Polly as a stand-alone tool to allow research into, and prototyping of, complex loop transformations. * There was a paper deadline end of last week and I simply had to prioritize. ------------------------------------------------------------------------------- LLVM performs "simple" loop transformations and targets almost exclusively inner-most loops. Polly offers a pipeline to perform "complex" polyhedral-based loop optimizations on a sequence of loop nests. It seems natural to integrate Polly into the LLVM pipeline as it apparently complements it well. However, I do not believe that it is wise to integrate Polly into LLVM for a variety of reasons, some of which I will explain in more detail now. Afterwards I will propose a different way to achieve (most of) the things Hal mentioned in his mail in an incremental way. Nevertheless, I want to stress that I am still an advocate of Polly and I honestly believe it to be a valuable part of the LLVM environment. Though I strongly feel it should be continued as a research tool and test bed to allow research into, and prototyping of, (polyhedral-model-based) analysis and optimizations. --- Polly is a deep pipeline of passes that were developed for the sole purpose of applying scheduling transformations at the very end. This "main purpose" of Polly makes it hard to effectively reuse intermediate parts, e.g. the dependence analysis or the code generation framework, at least to the degree we might want to. In a different thread [1] Hal asked for the possibility to reuse Polly for tuning via pragmas e.g., unrolling/interleaving/tiling/interchange. While this is interesting and generally possible it won't take long before the restrictions Polly places on the input code (to enable scheduling optimizations in the end) become a serious problem. The point here is that Polly has too many features and dependences for such an "easy" use case. (Note that I do not question the legality of a user requested transformation here.) If you are not looking for "end-to-end polyhedral loop optimizations" you want a system that answers a very specific question (e.g., isVectorizable) or performs a very specific transformation (e.g, tiling) on a maximal set of programs. In contrast, Polly is (and I strongly believe it should be) developed as a system that performs complex optimizations, based on specialized analysis results, on a constraint type of programs. One might argue that Polly will become more modular when it is integrated into LLVM and when the intermediate results are used in more and more places. However, I'd say this is the "wrong direction" with regards to: 1) incremental design and development, 2) maintainability of individual parts, and 3) the development of a suitable cost model (which Polly does not ship). Instead of starting with a full, hard to understand, scheduling pipeline we should start with the use cases at hand and design specific analysis (and also transformation) passes "from scratch". The long term goal should be a full scheduling pipeline but the parts need to be designed in a modular (LLVM-way) from the very beginning. This allows us to: - provide __immediate benefit__ to other developers, - allow active participation in (and understanding of) the design of each part, and - develop the intermediate parts with the requirements of the whole LLVM project in mind. Let me give two examples to make my point: -- Example 1 -- In addition to the applicability issues there are other problems that arise from the full pipeline design. Let's assume we want to apply a pragma driven optimization to a loop that can be represented (and optimized) by Polly. Depending on the pragma we only need to perform a very specific task, e.g., adjust the iteration variable (loop inversion) or introduce new loops and adjust existing iteration variables (tiling). Polly will however represent (and actually analyze) the whole loop nest including the exact control flow and all accesses. In addition it will compute alias checks, constraints under which the representation is not truthful (e.g., an access function might overflow) and a lot of other things that are not needed for the task at hand. -- Example 2 -- We can also imagine we want to determine if a loop can be vectorized, thus if there are (short) loop carried dependences. No matter which technique is used, the complexity of a dependence analysis will certainly increase with the number (and complexity) of the analyzed accesses. Since Polly is designed with fine-grained scheduling optimizations in mind, the dependences it will compute are disproportionate with regards to the question asked. Consequently, it will either take more time to determine if there are loop carried dependences or a time-out in the computation will cause a conservatively negative result. --- While it is generally possible to disable some "features" of Polly or to "work around them", it requires new independent code paths in an already highly interleaved project. The explicit (and implicit) interactions between the different parts of Polly are plentiful and subtle. I don't think there are 5 people that know about more than half of them. (I also doubt there is no one that knows them all.) --- I want to propose an alternative plan for which I would appreciate feedback. The feedback is especially useful since I will present this, or more accurately the concrete polyhedral analyses described below, in the student research competition at LLVM Dev meeting in October. The idea is simple. We rewrite parts of Polly with two goals in mind: 1) to provide analysis results and transformation options to LLVM, and 2) allow polyhedral loop transformation. A rewrite allows to revisit various design decisions that are buried deep in the Polly code and that hinder its application (in any context). The interested reader can find a short list at the end. I think, we should incrementally introduce polyhedral analysis capabilities in order to provide immediate benefits and to allow developers to get involved with the design and usage from an early point in time. Decoupled analyses also allow easier integration, interfaces that come "more natural" to non-polyhedral people, and better use of the capabilities if polyhedral scheduling is not the goal. To demonstrate this incremental process I started to write a polyhedral value analysis**. Think of it as Scalar Evolution but with piece-wise defined polyhedral values as the abstract domain. This analysis is demand driven, works on the granularity of a LLVM value, is flow sensitive, and especially designed to be applicable on partially affine programs. It already supports most of what the polyhedral model does allow and includes a lot of things we prototyped in Polly. It shall be used when Scalar Evolution is not able to provide a sufficient result. On top I implemented a polyhedral memory analysis*** that is (currently) capable of summarizing the memory effects of a code region, e.g., a function. The next step is a demand-driven dependence analysis**** that utilizes the (often) very structured nature of a CFG. Afterwards, I would like to see a transformation framework that allows classical but also scheduling based transformations. While these analyses still depend on isl as a back-end, they never directly use it. Instead there is a C++ interface in-between that encapsulates the features needed. Once our use cases are clear we can re-implement this interface. ** For comparison, the value analysis covers parts of the ScopDetection, all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and ScopBuilder in Polly. *** The memory analysis is concerned with the construction and functionality Polly implements in the MemoryAccess class. **** The dependence analysis is similar to Polly's DependenceInfo class but is supposed to allow more fine-grained control. --- These are a few design decisions that I think should be revisited when we think about polyhedral-based analysis and optimizations in LLVM: - We should not rely on Scalar Evolution. Instead we want to build polyhedral values straight from the underlying IR. While Scalar Evolution did certainly simplify Polly's development it induced various problems: - The lack of piece-wise affine values (other than min/max) limits applicability and precision. (Though conditional SCEVs are a small step in this direction.) - The flow insensitive value analysis (Scalar Evolution) causes a lot of "assumptions", thus versioning, for the otherwise flow sensitive Polly. - There are caching issues when parts of the program are transformed or when Polly is used for inter-procedural analysis/optimization. - We should not depend on single-entry-single-exit (SESE) regions. Instead we analyze the code optimistically to the extend possible (or required by the user). Afterwards we can easily specialize the results to the code regions we are interested in. While SESE regions always caused compile time issues (due to the use of the post dominance tree) they also limit the analysis/transformations that can be performed. As an example think of accesses that are non-affine with regards to the smallest surrounding SESE region but not in a smaller code region (e.g., a conditional). If the goal is not loop scheduling but, for example, instruction reordering, it is still valuable to determine the dependences between these accesses in the smaller code region. - We should iteratively and selective construct polyhedral representations. The same holds for analysis results, e.g. dependences. Let's reconsider example 2 from above. We want to visit one memory access at a time in order to build the polyhedral representation and compute the new dependences. We also want to immediately stop if we identify a loop-carried dependence. Finally, we do not want to compute dependences in outer loop iterations or dependences that are completely contained in one loop iteration. - We should encapsulate the code generation part completely from the transformation part. Additionally we might want to start with classical loop transformations instead of full polyhedral scheduling. For a lot of classical loop transformations code generation only needs to understand/change loop induction variables and exit conditions. Duplication + rewriting of the loop body is often not necessary. Pragma driven optimizations could be easily done and we could also use heuristics similar to the ones we have today (e.g., for loop distribution). The transformations we allow should than grow with the scheduling decisions we allow and these should be limited by a yet to be determined cost model. --- I am aware that not all Polly developers will agree with my views and that some are simply trade-offs that do not offer a "correct way". Nevertheless, I hope that the simple examples I used to point out various problems are enough to consider the route I proposed. Cheers, Johannes [1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ On 09/01, Hal Finkel via llvm-dev wrote:> ** > > *Hi everyone,As you may know, stock LLVM does not provide the kind of > advanced loop transformations necessary to provide good performance on > many applications. LLVM's Polly project provides many of the required > capabilities, including loop transformations such as fission, fusion, > skewing, blocking/tiling, and interchange, all powered by > state-of-the-art dependence analysis. Polly also provides automated > parallelization and targeting of GPUs and other**accelerators.* > > * > > Over the past year, Polly’s development has focused on robustness, > correctness, and closer integration with LLVM. To highlight a few > accomplishments: > > > * > > Polly now runs, by default, in the conceptually-proper place in > LLVM’s pass pipeline (just before the loop vectorizer). Importantly, > this means that its loop transformations are performed after > inlining and other canonicalization, greatly increasing its > robustness, and enabling its use on C++ code (where [] is often a > function call before inlining). > > * > > Polly’s cost-modeling parameters, such as those describing the > target’s memory hierarchy, are being integrated with > TargetTransformInfo. This allows targets to properly override the > modeling parameters and allows reuse of these parameters by other > clients. > > * > > Polly’s method of handling signed division/remainder operations, > which worked around lack of support in ScalarEvolution, is being > replaced thanks to improvements being contributed to ScalarEvolution > itself (see D34598). Polly’s core delinearization routines have long > been a part of LLVM itself. > > * > > PolyhedralInfo, which exposes a subset of Polly’s loop analysis for > use by other clients, is now available. > > * > > Polly is now part of the LLVM release process and is being included > with LLVM by various packagers (e.g., Debian). > > > I believe that the LLVM community would benefit from beginning the > process of integrating Polly with LLVM itself and continuing its > development as part of our main code base. This will: > > * > > Allow for wider adoption of LLVM within communities relying on > advanced loop transformations. > > * > > Provide for better community feedback on, and testing of, the code > developed (although the story in this regard is already fairly solid). > > * > > Better motivate targets to provide accurate, comprehensive, modeling > parameters for use by advanced loop transformations. > > * > > Perhaps most importantly, this will allow us to develop and tune the > rest of the optimizer assuming that Polly’s capabilities are present > (the underlying analysis, and eventually, the transformations > themselves). > > > The largest issue on which community consensus is required, in order > to move forward at all, is what to do with isl. isl, the Integer Set > Library, provides core functionality on which Polly depends. It is a C > library, and while some Polly/LLVM developers are also isl developers, > it has a large user community outside of LLVM/Polly. A C++ interface > was recently added, and Polly is transitioning to use the C++ > interface. Nevertheless, options here include rewriting the needed > functionality, forking isl and transitioning our fork toward LLVM > coding conventions (and data structures) over time, and incorporating > isl more-or-less as-is to avoid partitioning its development. > > > That having been said, isl is internally modular, and regardless of > the overall integration strategy, the Polly developers anticipate > specializing, or even replacing, some of these components with LLVM-specific solutions. > This is especially true for anything that touches performance-related > heuristics and modeling. LLVM-specific, or even target-specific, loop > schedulers may be developed as well. > > > Even though some developers in the LLVM community already have a > background in polyhedral-modeling techniques, the Polly developers > have developed, and are still developing, extensive tutorials on this > topic http://pollylabs.org/education.htmland especially > http://playground.pollylabs.org. > > > Finally, let me highlight a few ongoing development efforts in Polly > that are potentially relevant to this discussion. Polly’s loop > analysis is sound and technically superior to what’s in LLVM currently > (i.e. in LoopAccessAnalysis and DependenceAnalysis). There are, > however, two known reasons why Polly’s transformations could not yet be enabled by default: > > * > > A correctness issue: Currently, Polly assumes that 64 bits is large > enough for all new loop-induction variables and index expressions. > In rare cases, transformations could be performed where more bits > are required. Preconditions need to be generated preventing this > (e.g., D35471). > > * > > A performance issue: Polly currently models temporal locality (i.e., > it tries to get better reuse in time), but does not model spatial > locality (i.e., it does not model cache-line reuse). As a result, it > can sometimes introduce performance regressions. Polly Labs is > currently working on integrating spatial locality modeling into the > loop optimization model. > > Polly can already split apart basic blocks in order to implement loop > fusion. Heuristics to choose at which granularity are still being > implemented (e.g., PR12402). > > I believe that we can now develop a concrete plan for moving > state-of-the-art loop optimizations, based on the technology in the > Polly project, into LLVM. Doing so will enable LLVM to be competitive > with proprietary compilers in high-performance computing, machine > learning, and other important application domains. I'd like community > feedback on what**should be part of that plan. > > > Sincerely, > > Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with > feedback from**several other active Polly developers) > > > We thank the numerous people who have contributed to the Polly > infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, > Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, > Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris > Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David > Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, > Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin > Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner, > Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt > Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse, > Matthias Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini > Singhal, NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip > Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola, > Raghesh Aloor, ReidKleckner, Roal Jordans, Richard Membarth, Roman > Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer > AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay > Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, > Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros > Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.* > > -- > Hal Finkel > Lead, Compiler Technology and Programming Languages Leadership > Computing Facility Argonne National Laboratory >> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Johannes Doerfert Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland Informatics Campus, Germany Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Johannes Doerfert via llvm-dev
2017-Sep-20 16:04 UTC
[llvm-dev] [RFC] Polly Status and Integration
Hi Sjoerd, On 09/20, Sjoerd Meijer wrote:> I have not been following the polyhedral developments in both GCC and > LLVM very closely the last few years, but I was just wondering if > there are any lessons learned we should look at (or actually already > aware of) in GCC's implementation and integration of Graphite. For > example, do we know if it is (still) enabled/used/etc.?I tried to look this up online but most documents were rather old (<=2015). The changes for graphite.{h,c} seem to be scarce nowadays too (6 in 2016 + 2017). However, I am no Graphite expert and I hope Sebastian [CC] can help.> And in the context of possibly rewritten bits and pieces of Polly, > thus making it modular, and making its exact analysis available to > "traditional" optimisations passes such as e.g. the vectoriser, is > there data that that shows that it really improves performance? Is > there e.g. data available that shows the performance uplift of the > vectoriser using the exact polyhedral data dependence analysis over > the traditional data dependence analysisAfaik, there is only (the rather inconclusive) GSoC from 2016. I'll try to generate numbers soon but first for the value analysis as the separated dependence analysis is not ready. It is also worth to note that the current dependence analysis are apparently broken and incomplete. Here is part of the exchange between Gerolf Hoflehner and Sebastian Pop also in this thread: On 09/13, Sebastian Pop via llvm-dev wrote:> On Tue, Sep 12, 2017 at 10:26 PM, Gerolf Hoflehner via llvm-dev > <llvm-dev at lists.llvm.org> wrote: > > Are you saying the LLVM Dependence Analysis is incorrect or do you > > actually mean less conservative (or "more accurate" or something > > like that)? > > Yes, the LLVM dependence analysis is broken from day one, by design, > due to a misunderstanding of the meaning of GEPs: > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179509.html > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179529.html > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20130701/179570.html > > Loop interchange and any other pass that relies on the current llvm > dependence analysis may generate wrong code. > See https://reviews.llvm.org/D35430 > > Another point, the MIV test in the llvm depednence analysis is not > implemented, and so the scope of the llvm dependence analysis is > rather narrow: i.e., it would not be able to solve the loop > interchange in spec2000/swim.> I am a big fan of the polyhedral model and these are not leading > questions, but I was just wondering if that could help in deciding to > make it modular and slowly integrating it, or just to have it as a > separate pipeline.I think these questions are perfectly fine and deserve to be answered! Though, it is not that simple to generate the numbers you are interested in and it will consequently take a little bit more time. Cheers, Johannes> -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Johannes Doerfert via llvm-dev > Sent: 20 September 2017 01:33 > To: Hal Finkel > Cc: LLVM Dev; Tobias Grosser; michaelkruse at meinersbur.de > Subject: Re: [llvm-dev] [RFC] Polly Status and Integration > > Hi Hal, Tobias, Michael, and others, > > I'd like to add my view (and a proposal) to this discussion and I apologize directly for doing this so late*. I also want to apologize because this email is long, contains various technical details and also argumentations that might need more justification. However, I am happy to provide further information (and/or examples) to explain my views if required. > > tldr; > We should introduce polyhedral analysis and optimization capabilities into LLVM step by step. Along the way we should revisit design decisions made by Polly and adjust them according to the use cases in the rest of LLVM. Finally, we should keep Polly as a stand-alone tool to allow research into, and prototyping of, complex loop transformations. > > * There was a paper deadline end of last week and I simply had to prioritize. > > ------------------------------------------------------------------------------- > > LLVM performs "simple" loop transformations and targets almost exclusively inner-most loops. Polly offers a pipeline to perform "complex" polyhedral-based loop optimizations on a sequence of loop nests. It seems natural to integrate Polly into the LLVM pipeline as it apparently complements it well. However, I do not believe that it is wise to integrate Polly into LLVM for a variety of reasons, some of which I will explain in more detail now. Afterwards I will propose a different way to achieve (most of) the things Hal mentioned in his mail in an incremental way. Nevertheless, I want to stress that I am still an advocate of Polly and I honestly believe it to be a valuable part of the LLVM environment. Though I strongly feel it should be continued as a research tool and test bed to allow research into, and prototyping of, > (polyhedral-model-based) analysis and optimizations. > > --- > > Polly is a deep pipeline of passes that were developed for the sole purpose of applying scheduling transformations at the very end. This "main purpose" of Polly makes it hard to effectively reuse intermediate parts, e.g. the dependence analysis or the code generation framework, at least to the degree we might want to. In a different thread [1] Hal asked for the possibility to reuse Polly for tuning via pragmas e.g., unrolling/interleaving/tiling/interchange. While this is interesting and generally possible it won't take long before the restrictions Polly places on the input code (to enable scheduling optimizations in the end) become a serious problem. The point here is that Polly has too many features and dependences for such an "easy" use case. (Note that I do not question the legality of a user requested transformation here.) If you are not looking for "end-to-end polyhedral loop optimizations" you want a system that answers a very specific question (e.g., > isVectorizable) or performs a very specific transformation (e.g, tiling) on a maximal set of programs. In contrast, Polly is (and I strongly believe it should be) developed as a system that performs complex optimizations, based on specialized analysis results, on a constraint type of programs. > > One might argue that Polly will become more modular when it is integrated into LLVM and when the intermediate results are used in more and more places. However, I'd say this is the "wrong direction" with regards to: > 1) incremental design and development, > 2) maintainability of individual parts, and > 3) the development of a suitable cost model (which Polly does not ship). > > Instead of starting with a full, hard to understand, scheduling pipeline we should start with the use cases at hand and design specific analysis (and also transformation) passes "from scratch". The long term goal should be a full scheduling pipeline but the parts need to be designed in a modular (LLVM-way) from the very beginning. This allows us to: > - provide __immediate benefit__ to other developers, > - allow active participation in (and understanding of) the design of > each part, and > - develop the intermediate parts with the requirements of the whole > LLVM project in mind. > > Let me give two examples to make my point: > > -- Example 1 -- > In addition to the applicability issues there are other problems that arise from the full pipeline design. Let's assume we want to apply a pragma driven optimization to a loop that can be represented (and > optimized) by Polly. Depending on the pragma we only need to perform a very specific task, e.g., adjust the iteration variable (loop inversion) or introduce new loops and adjust existing iteration variables (tiling). > Polly will however represent (and actually analyze) the whole loop nest including the exact control flow and all accesses. In addition it will compute alias checks, constraints under which the representation is not truthful (e.g., an access function might overflow) and a lot of other things that are not needed for the task at hand. > > -- Example 2 -- > We can also imagine we want to determine if a loop can be vectorized, thus if there are (short) loop carried dependences. No matter which technique is used, the complexity of a dependence analysis will certainly increase with the number (and complexity) of the analyzed accesses. Since Polly is designed with fine-grained scheduling optimizations in mind, the dependences it will compute are disproportionate with regards to the question asked. Consequently, it will either take more time to determine if there are loop carried dependences or a time-out in the computation will cause a conservatively negative result. > > --- > > While it is generally possible to disable some "features" of Polly or to "work around them", it requires new independent code paths in an already highly interleaved project. The explicit (and implicit) interactions between the different parts of Polly are plentiful and subtle. I don't think there are 5 people that know about more than half of them. (I also doubt there is no one that knows them all.) > > --- > > I want to propose an alternative plan for which I would appreciate feedback. The feedback is especially useful since I will present this, or more accurately the concrete polyhedral analyses described below, in the student research competition at LLVM Dev meeting in October. > > The idea is simple. We rewrite parts of Polly with two goals in mind: > 1) to provide analysis results and transformation options to LLVM, and > 2) allow polyhedral loop transformation. > > A rewrite allows to revisit various design decisions that are buried deep in the Polly code and that hinder its application (in any context). > The interested reader can find a short list at the end. > > I think, we should incrementally introduce polyhedral analysis capabilities in order to provide immediate benefits and to allow developers to get involved with the design and usage from an early point in time. Decoupled analyses also allow easier integration, interfaces that come "more natural" to non-polyhedral people, and better use of the capabilities if polyhedral scheduling is not the goal. > > To demonstrate this incremental process I started to write a polyhedral value analysis**. Think of it as Scalar Evolution but with piece-wise defined polyhedral values as the abstract domain. This analysis is demand driven, works on the granularity of a LLVM value, is flow sensitive, and especially designed to be applicable on partially affine programs. It already supports most of what the polyhedral model does allow and includes a lot of things we prototyped in Polly. It shall be used when Scalar Evolution is not able to provide a sufficient result. > On top I implemented a polyhedral memory analysis*** that is (currently) capable of summarizing the memory effects of a code region, e.g., a function. The next step is a demand-driven dependence analysis**** that utilizes the (often) very structured nature of a CFG. Afterwards, I would like to see a transformation framework that allows classical but also scheduling based transformations. > > While these analyses still depend on isl as a back-end, they never directly use it. Instead there is a C++ interface in-between that encapsulates the features needed. Once our use cases are clear we can re-implement this interface. > > ** For comparison, the value analysis covers parts of the ScopDetection, all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and ScopBuilder in Polly. > *** The memory analysis is concerned with the construction and functionality Polly implements in the MemoryAccess class. > **** The dependence analysis is similar to Polly's DependenceInfo class but is supposed to allow more fine-grained control. > > --- > > These are a few design decisions that I think should be revisited when we think about polyhedral-based analysis and optimizations in LLVM: > > - We should not rely on Scalar Evolution. Instead we want to build > polyhedral values straight from the underlying IR. While > Scalar Evolution did certainly simplify Polly's development it > induced various problems: > - The lack of piece-wise affine values (other than min/max) > limits applicability and precision. (Though conditional SCEVs > are a small step in this direction.) > - The flow insensitive value analysis (Scalar Evolution) causes a > lot of "assumptions", thus versioning, for the otherwise flow > sensitive Polly. > - There are caching issues when parts of the program are > transformed or when Polly is used for inter-procedural > analysis/optimization. > > - We should not depend on single-entry-single-exit (SESE) regions. > Instead we analyze the code optimistically to the extend possible > (or required by the user). Afterwards we can easily specialize the > results to the code regions we are interested in. While SESE regions > always caused compile time issues (due to the use of the post > dominance tree) they also limit the analysis/transformations that > can be performed. As an example think of accesses that are > non-affine with regards to the smallest surrounding SESE region but > not in a smaller code region (e.g., a conditional). If the goal is > not loop scheduling but, for example, instruction reordering, it > is still valuable to determine the dependences between these > accesses in the smaller code region. > > - We should iteratively and selective construct polyhedral > representations. The same holds for analysis results, e.g. > dependences. Let's reconsider example 2 from above. We want to visit > one memory access at a time in order to build the polyhedral > representation and compute the new dependences. We also want to > immediately stop if we identify a loop-carried dependence. Finally, > we do not want to compute dependences in outer loop iterations or > dependences that are completely contained in one loop iteration. > > - We should encapsulate the code generation part completely from the > transformation part. Additionally we might want to start with > classical loop transformations instead of full polyhedral > scheduling. For a lot of classical loop transformations code > generation only needs to understand/change loop induction variables > and exit conditions. Duplication + rewriting of the loop body is > often not necessary. Pragma driven optimizations could be easily > done and we could also use heuristics similar to the ones we have > today (e.g., for loop distribution). The transformations we allow > should than grow with the scheduling decisions we allow and these > should be limited by a yet to be determined cost model. > > --- > > I am aware that not all Polly developers will agree with my views and that some are simply trade-offs that do not offer a "correct way". > Nevertheless, I hope that the simple examples I used to point out various problems are enough to consider the route I proposed. > > Cheers, > Johannes > > [1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ > > On 09/01, Hal Finkel via llvm-dev wrote: > > ** > > > > *Hi everyone,As you may know, stock LLVM does not provide the kind of > > advanced loop transformations necessary to provide good performance on > > many applications. LLVM's Polly project provides many of the required > > capabilities, including loop transformations such as fission, fusion, > > skewing, blocking/tiling, and interchange, all powered by > > state-of-the-art dependence analysis. Polly also provides automated > > parallelization and targeting of GPUs and other**accelerators.* > > > > * > > > > Over the past year, Polly’s development has focused on robustness, > > correctness, and closer integration with LLVM. To highlight a few > > accomplishments: > > > > > > * > > > > Polly now runs, by default, in the conceptually-proper place in > > LLVM’s pass pipeline (just before the loop vectorizer). Importantly, > > this means that its loop transformations are performed after > > inlining and other canonicalization, greatly increasing its > > robustness, and enabling its use on C++ code (where [] is often a > > function call before inlining). > > > > * > > > > Polly’s cost-modeling parameters, such as those describing the > > target’s memory hierarchy, are being integrated with > > TargetTransformInfo. This allows targets to properly override the > > modeling parameters and allows reuse of these parameters by other > > clients. > > > > * > > > > Polly’s method of handling signed division/remainder operations, > > which worked around lack of support in ScalarEvolution, is being > > replaced thanks to improvements being contributed to ScalarEvolution > > itself (see D34598). Polly’s core delinearization routines have long > > been a part of LLVM itself. > > > > * > > > > PolyhedralInfo, which exposes a subset of Polly’s loop analysis for > > use by other clients, is now available. > > > > * > > > > Polly is now part of the LLVM release process and is being included > > with LLVM by various packagers (e.g., Debian). > > > > > > I believe that the LLVM community would benefit from beginning the > > process of integrating Polly with LLVM itself and continuing its > > development as part of our main code base. This will: > > > > * > > > > Allow for wider adoption of LLVM within communities relying on > > advanced loop transformations. > > > > * > > > > Provide for better community feedback on, and testing of, the code > > developed (although the story in this regard is already fairly solid). > > > > * > > > > Better motivate targets to provide accurate, comprehensive, modeling > > parameters for use by advanced loop transformations. > > > > * > > > > Perhaps most importantly, this will allow us to develop and tune the > > rest of the optimizer assuming that Polly’s capabilities are present > > (the underlying analysis, and eventually, the transformations > > themselves). > > > > > > The largest issue on which community consensus is required, in order > > to move forward at all, is what to do with isl. isl, the Integer Set > > Library, provides core functionality on which Polly depends. It is a C > > library, and while some Polly/LLVM developers are also isl developers, > > it has a large user community outside of LLVM/Polly. A C++ interface > > was recently added, and Polly is transitioning to use the C++ > > interface. Nevertheless, options here include rewriting the needed > > functionality, forking isl and transitioning our fork toward LLVM > > coding conventions (and data structures) over time, and incorporating > > isl more-or-less as-is to avoid partitioning its development. > > > > > > That having been said, isl is internally modular, and regardless of > > the overall integration strategy, the Polly developers anticipate > > specializing, or even replacing, some of these components with LLVM-specific solutions. > > This is especially true for anything that touches performance-related > > heuristics and modeling. LLVM-specific, or even target-specific, loop > > schedulers may be developed as well. > > > > > > Even though some developers in the LLVM community already have a > > background in polyhedral-modeling techniques, the Polly developers > > have developed, and are still developing, extensive tutorials on this > > topic http://pollylabs.org/education.htmland especially > > http://playground.pollylabs.org. > > > > > > Finally, let me highlight a few ongoing development efforts in Polly > > that are potentially relevant to this discussion. Polly’s loop > > analysis is sound and technically superior to what’s in LLVM currently > > (i.e. in LoopAccessAnalysis and DependenceAnalysis). There are, > > however, two known reasons why Polly’s transformations could not yet be enabled by default: > > > > * > > > > A correctness issue: Currently, Polly assumes that 64 bits is large > > enough for all new loop-induction variables and index expressions. > > In rare cases, transformations could be performed where more bits > > are required. Preconditions need to be generated preventing this > > (e.g., D35471). > > > > * > > > > A performance issue: Polly currently models temporal locality (i.e., > > it tries to get better reuse in time), but does not model spatial > > locality (i.e., it does not model cache-line reuse). As a result, it > > can sometimes introduce performance regressions. Polly Labs is > > currently working on integrating spatial locality modeling into the > > loop optimization model. > > > > Polly can already split apart basic blocks in order to implement loop > > fusion. Heuristics to choose at which granularity are still being > > implemented (e.g., PR12402). > > > > I believe that we can now develop a concrete plan for moving > > state-of-the-art loop optimizations, based on the technology in the > > Polly project, into LLVM. Doing so will enable LLVM to be competitive > > with proprietary compilers in high-performance computing, machine > > learning, and other important application domains. I'd like community > > feedback on what**should be part of that plan. > > > > > > Sincerely, > > > > Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with > > feedback from**several other active Polly developers) > > > > > > We thank the numerous people who have contributed to the Polly > > infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, > > Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, > > Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris > > Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David > > Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, > > Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin > > Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner, > > Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt > > Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse, > > Matthias Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini > > Singhal, NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip > > Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola, > > Raghesh Aloor, ReidKleckner, Roal Jordans, Richard Membarth, Roman > > Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer > > AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay > > Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, > > Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros > > Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.* > > > > -- > > Hal Finkel > > Lead, Compiler Technology and Programming Languages Leadership > > Computing Facility Argonne National Laboratory > > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > -- > > Johannes Doerfert > Researcher / PhD Student > > Compiler Design Lab (Prof. Hack) > Saarland Informatics Campus, Germany > Building E1.3, Room 4.31 > > Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert > IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.-- Johannes Doerfert Researcher / PhD Student Compiler Design Lab (Prof. Hack) Saarland Informatics Campus, Germany Building E1.3, Room 4.31 Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065 : http://www.cdl.uni-saarland.de/people/doerfert -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: Digital signature URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170920/fbe66560/attachment.sig>
Hal Finkel via llvm-dev
2017-Sep-22 05:09 UTC
[llvm-dev] [RFC] Polly Status and Integration
Hi, Johannes, Thanks for writing this. I certainly think you have the right idea in terms of the desired end state and modular design. On 09/19/2017 07:33 PM, Johannes Doerfert wrote:> Hi Hal, Tobias, Michael, and others, > > I'd like to add my view (and a proposal) to this discussion and I > apologize directly for doing this so late*. I also want to apologize > because this email is long, contains various technical details and also > argumentations that might need more justification. However, I am happy > to provide further information (and/or examples) to explain my views if > required. > > tldr; > We should introduce polyhedral analysis and optimization capabilities > into LLVM step by step. Along the way we should revisit design decisions > made by Polly and adjust them according to the use cases in the rest of > LLVM. Finally, we should keep Polly as a stand-alone tool to allow > research into, and prototyping of, complex loop transformations. > > * There was a paper deadline end of last week and I simply had to prioritize. > > ------------------------------------------------------------------------------- > > LLVM performs "simple" loop transformations and targets almost > exclusively inner-most loops. Polly offers a pipeline to perform > "complex" polyhedral-based loop optimizations on a sequence of loop > nests. It seems natural to integrate Polly into the LLVM pipeline as it > apparently complements it well. However, I do not believe that it is > wise to integrate Polly into LLVM for a variety of reasons, some of > which I will explain in more detail now. Afterwards I will propose a > different way to achieve (most of) the things Hal mentioned in his mail > in an incremental way. Nevertheless, I want to stress that I am still an > advocate of Polly and I honestly believe it to be a valuable part of the > LLVM environment. Though I strongly feel it should be continued as a > research tool and test bed to allow research into, and prototyping of, > (polyhedral-model-based) analysis and optimizations.While this may end up being the conclusion, I'd find that splitting unfortunate. One of the strengths of LLVM is that, within the framework of a production-quality compiler, we enable new compiler research. This strengthens research not only by allowing new ideas to be surrounded by existing ones in a realistic way, allowing the effects of the new ideas to be isolated convincingly, but also encourages new ideas to be developed and tested in a way that mirrors how they might be deployed in practice. This also encourages the research community to be part of the LLVM community, and I believe that's a positive for the community as a whole. The more that we isolate the "research tool" from the rest of the ecosystem, the less value the research tool has, and the less the community as a whole benefits from interactions with researchers.> > --- > > Polly is a deep pipeline of passes that were developed for the sole > purpose of applying scheduling transformations at the very end. This > "main purpose" of Polly makes it hard to effectively reuse intermediate > parts, e.g. the dependence analysis or the code generation framework, at > least to the degree we might want to. In a different thread [1] Hal > asked for the possibility to reuse Polly for tuning via pragmas e.g., > unrolling/interleaving/tiling/interchange. While this is interesting and > generally possible it won't take long before the restrictions Polly > places on the input code (to enable scheduling optimizations in the end) > become a serious problem. The point here is that Polly has too many > features and dependences for such an "easy" use case. (Note that I do > not question the legality of a user requested transformation here.) If > you are not looking for "end-to-end polyhedral loop optimizations" you > want a system that answers a very specific question (e.g., > isVectorizable) or performs a very specific transformation (e.g, tiling) > on a maximal set of programs. In contrast, Polly is (and I strongly > believe it should be) developed as a system that performs complex > optimizations, based on specialized analysis results, on a constraint > type of programs. > > One might argue that Polly will become more modular when it is integrated > into LLVM and when the intermediate results are used in more and more > places. However, I'd say this is the "wrong direction" with regards to: > 1) incremental design and development, > 2) maintainability of individual parts, and > 3) the development of a suitable cost model (which Polly does not ship). > > Instead of starting with a full, hard to understand, scheduling pipeline > we should start with the use cases at hand and design specific analysis > (and also transformation) passes "from scratch". The long term goal > should be a full scheduling pipeline but the parts need to be designed > in a modular (LLVM-way) from the very beginning. This allows us to: > - provide __immediate benefit__ to other developers, > - allow active participation in (and understanding of) the design of > each part, and > - develop the intermediate parts with the requirements of the whole > LLVM project in mind. > > Let me give two examples to make my point: > > -- Example 1 -- > In addition to the applicability issues there are other problems that > arise from the full pipeline design. Let's assume we want to apply a > pragma driven optimization to a loop that can be represented (and > optimized) by Polly. Depending on the pragma we only need to perform a > very specific task, e.g., adjust the iteration variable (loop inversion) > or introduce new loops and adjust existing iteration variables (tiling). > Polly will however represent (and actually analyze) the whole loop nest > including the exact control flow and all accesses. In addition it > will compute alias checks, constraints under which the representation > is not truthful (e.g., an access function might overflow) and a lot of > other things that are not needed for the task at hand.I understand your point, but I think that you're assuming too much about the semantics of the pragmas. I believe that our current vectorization pragma clauses demonstrate the philosophy we should follow with other loop pragmas in this sense: We have vectorize(enable) and vectorize(assume_safety). The latter one does indeed just vectorize the loop without requiring dependency analysis, but the vectorize(enable) clause instructs the vectorizer to vectorize the loop (overriding/loosening some of the cost model constraints), but still uses the dependency analysis and generates runtime checks. Even if we have pragmas to direct the process, by default, we still likely want the full dependency analysis and the framework for generating safety predicates.> > -- Example 2 -- > We can also imagine we want to determine if a loop can be vectorized, > thus if there are (short) loop carried dependences. No matter which > technique is used, the complexity of a dependence analysis will > certainly increase with the number (and complexity) of the analyzed > accesses. Since Polly is designed with fine-grained scheduling > optimizations in mind, the dependences it will compute are > disproportionate with regards to the question asked.Could you explain this in more detail?> Consequently, it > will either take more time to determine if there are loop carried > dependencesI think that this is clearly a question of degree. Almost any non-lazy analysis will over-compute for a particular transformation (e.g. computing a dominator tree), but there's engineering value in consistency. The question is just in the cost vs. the engineering complexity in customizing the analysis.> or a time-out in the computation will cause a conservatively > negative result. > > --- > > While it is generally possible to disable some "features" of Polly or to > "work around them", it requires new independent code paths in an already > highly interleaved project. The explicit (and implicit) interactions > between the different parts of Polly are plentiful and subtle. I don't > think there are 5 people that know about more than half of them. (I also > doubt there is no one that knows them all.)This point is well taken. For better or worse, it reminds me of another component we've discussed a lot recently: InstCombine. Both (excluding isl) are also nearly the same size. In this context, documenting these interactions is important (especially if we hope to rewrite large parts of it).> > --- > > I want to propose an alternative plan for which I would appreciate > feedback. The feedback is especially useful since I will present this, > or more accurately the concrete polyhedral analyses described below, in > the student research competition at LLVM Dev meeting in October. > > The idea is simple. We rewrite parts of Polly with two goals in mind: > 1) to provide analysis results and transformation options to LLVM, and > 2) allow polyhedral loop transformation. > > A rewrite allows to revisit various design decisions that are buried > deep in the Polly code and that hinder its application (in any context). > The interested reader can find a short list at the end. > > I think, we should incrementally introduce polyhedral analysis > capabilities in order to provide immediate benefits and to allow > developers to get involved with the design and usage from an early point > in time. Decoupled analyses also allow easier integration, interfaces > that come "more natural" to non-polyhedral people, and better use of the > capabilities if polyhedral scheduling is not the goal. > > To demonstrate this incremental process I started to write a polyhedral > value analysis**. Think of it as Scalar Evolution but with piece-wise > defined polyhedral values as the abstract domain. This analysis is > demand driven, works on the granularity of a LLVM value, is flow > sensitive, and especially designed to be applicable on partially affine > programs. It already supports most of what the polyhedral model does > allow and includes a lot of things we prototyped in Polly. It shall be > used when Scalar Evolution is not able to provide a sufficient result. > On top I implemented a polyhedral memory analysis*** that is (currently) > capable of summarizing the memory effects of a code region, e.g., a > function. The next step is a demand-driven dependence analysis**** that > utilizes the (often) very structured nature of a CFG. Afterwards, I > would like to see a transformation framework that allows classical but > also scheduling based transformations. > > While these analyses still depend on isl as a back-end, they never > directly use it. Instead there is a C++ interface in-between that > encapsulates the features needed. Once our use cases are clear we can > re-implement this interface. > > ** For comparison, the value analysis covers parts of the ScopDetection, > all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and > ScopBuilder in Polly. > *** The memory analysis is concerned with the construction and > functionality Polly implements in the MemoryAccess class. > **** The dependence analysis is similar to Polly's DependenceInfo class > but is supposed to allow more fine-grained control.Is the code for these available? Are you planning to contribute them? We could certainly consider these part of the plan.> > --- > > These are a few design decisions that I think should be revisited when > we think about polyhedral-based analysis and optimizations in LLVM: > > - We should not rely on Scalar Evolution. Instead we want to build > polyhedral values straight from the underlying IR. While > Scalar Evolution did certainly simplify Polly's development it > induced various problems: > - The lack of piece-wise affine values (other than min/max) > limits applicability and precision. (Though conditional SCEVs > are a small step in this direction.) > - The flow insensitive value analysis (Scalar Evolution) causes a > lot of "assumptions", thus versioning, for the otherwise flow > sensitive Polly. > - There are caching issues when parts of the program are > transformed or when Polly is used for inter-procedural > analysis/optimization. > > - We should not depend on single-entry-single-exit (SESE) regions. > Instead we analyze the code optimistically to the extend possible > (or required by the user). Afterwards we can easily specialize the > results to the code regions we are interested in. While SESE regions > always caused compile time issues (due to the use of the post > dominance tree) they also limit the analysis/transformations that > can be performed. As an example think of accesses that are > non-affine with regards to the smallest surrounding SESE region but > not in a smaller code region (e.g., a conditional). If the goal is > not loop scheduling but, for example, instruction reordering, it > is still valuable to determine the dependences between these > accesses in the smaller code region. > > - We should iteratively and selective construct polyhedral > representations. The same holds for analysis results, e.g. > dependences. Let's reconsider example 2 from above. We want to visit > one memory access at a time in order to build the polyhedral > representation and compute the new dependences. We also want to > immediately stop if we identify a loop-carried dependence. Finally, > we do not want to compute dependences in outer loop iterations or > dependences that are completely contained in one loop iteration.If we're only considering inner-loop vectorization, then yes, we don't need outer-loop dependencies. If I understand what you mean, we actually might want intra-iteration dependencies for cases where reordering might allow for vectorization.> > - We should encapsulate the code generation part completely from the > transformation part.This separation sounds useful, but not necessarily to enable us to write a bunch of separate loop transformations. We may to split things for canonicaliation vs. lowering transformations, however.> Additionally we might want to start with > classical loop transformations instead of full polyhedral > scheduling. For a lot of classical loop transformations code > generation only needs to understand/change loop induction variables > and exit conditions. Duplication + rewriting of the loop body is > often not necessary. Pragma driven optimizations could be easily > done and we could also use heuristics similar to the ones we have > today (e.g., for loop distribution). The transformations we allow > should than grow with the scheduling decisions we allow and these > should be limited by a yet to be determined cost model.Much of this sounds potentially good, but it's not clear to me that any of this is really an argument against integrating Polly into LLVM right now. The components you propose should be usable for both separate and integrated transformations and analysis, and at least in the short term, we'll end up needing isl regardless. Is your skepticism mainly about whether the current Polly could be transitioned to use the new infrastructure (without essentially rewriting all of it)? If so, I'd like to understand why. Thanks again, Hal> > --- > > I am aware that not all Polly developers will agree with my views and > that some are simply trade-offs that do not offer a "correct way". > Nevertheless, I hope that the simple examples I used to point out > various problems are enough to consider the route I proposed. > > Cheers, > Johannes > > [1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ > > On 09/01, Hal Finkel via llvm-dev wrote: >> ** >> >> *Hi everyone,As you may know, stock LLVM does not provide the kind of >> ... >> >> -- >> Hal Finkel >> Lead, Compiler Technology and Programming Languages >> Leadership Computing Facility >> Argonne National Laboratory >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Hal Finkel Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
Tobias Grosser via llvm-dev
2017-Sep-26 06:18 UTC
[llvm-dev] [RFC] Polly Status and Integration
On Wed, Sep 20, 2017, at 02:33, Johannes Doerfert wrote:> Hi Hal, Tobias, Michael, and others,Hi Johannes, thanks for joining the discussion.> I'd like to add my view (and a proposal) to this discussion and I > apologize directly for doing this so late*. I also want to apologize > because this email is long, contains various technical details and also > argumentations that might need more justification. However, I am happy > to provide further information (and/or examples) to explain my views if > required. > > tldr; > We should introduce polyhedral analysis and optimization capabilities > into LLVM step by step. Along the way we should revisit design decisions > made by Polly and adjust them according to the use cases in the rest of > LLVM. Finally, we should keep Polly as a stand-alone tool to allow > research into, and prototyping of, complex loop transformations.I very much agree with you that many design decisions will be changed in the future. Many of the targets you mentioned are indeed goals that have been discussed as part of the Polly development agenda and are areas I very much would like to push forward. You also make a very valid point that we should involve the wider LLVM community in these discussions and at best in the overall development of polyhedral optimizations as a whole. Hence, it seems clear that we want to keep as much of the polyhedral development as possible in core-LLVM. I have always been conservative in pushing prototype stuff into core-LLVM, which is why we spent multiple years to develop very solid foundations in both Polly and isl. While I agree that many design decisions need to be re-evaluated, as we constantly do in LLVM, and especially in the light of advances we pushed into isl, I think doing this jointly with the full LLVM community is the way to go. I also don't see your ideas to conflict with the integration of Polly. We can always move Polly now and replace parts of it with new and better stuff as it becomes available. In fact, Polly can serve as test-bed to provide test coverage for such new code and me and the other PollyLabs people are certainly happy to help.> * There was a paper deadline end of last week and I simply had to > prioritize.Good luck!> Polly is a deep pipeline of passes that were developed for the sole > purpose of applying scheduling transformations at the very end. This > "main purpose" of Polly makes it hard to effectively reuse intermediate > parts, e.g. the dependence analysis or the code generation framework, at > least to the degree we might want to. In a different thread [1] Hal > asked for the possibility to reuse Polly for tuning via pragmas e.g., > unrolling/interleaving/tiling/interchange. While this is interesting and > generally possible it won't take long before the restrictions Polly > places on the input code (to enable scheduling optimizations in the end) > become a serious problem. The point here is that Polly has too many > features and dependences for such an "easy" use case. (Note that I do > not question the legality of a user requested transformation here.) If > you are not looking for "end-to-end polyhedral loop optimizations" you > want a system that answers a very specific question (e.g., > isVectorizable) or performs a very specific transformation (e.g, tiling) > on a maximal set of programs. In contrast, Polly is (and I strongly > believe it should be) developed as a system that performs complex > optimizations, based on specialized analysis results, on a constraint > type of programs.While it is right that Polly has been developed as end-to-end polyhedral optimizer, it certainly has not been developed only with end-to-end loop optimizations in mind. In fact, many of the foundations that are today useful to build a polyhedral based range analysis on top of isl have been developed together with Polly and with Polly as use case (e.g., to build the optimistic assumption model we use) to enable a wide range of optimizations and analyses. Seing more of them popping up now is indeed very nice.> One might argue that Polly will become more modular when it is integrated > into LLVM and when the intermediate results are used in more and more > places. However, I'd say this is the "wrong direction" with regards to: > 1) incremental design and development, > 2) maintainability of individual parts, and > 3) the development of a suitable cost model (which Polly does not > ship).Polly has been developed from the very first patch in the open, was discussed on the LLVM mailing lists from day one (and started from such a discussion), and quickly became publicly visible LLVM project. I would almost claim it is the perfect example of incremental development. When we started due to license reasons and also due to a lack of mature foundations, Polly was not part of core LLVM. Over the last years we replaced and wrote new versions of all essential components for a move to core LLVM to become possible. Still, I get your point. It makes certainly sense to e.g. extract out components of Polly to build a polyhedral range analysis and provide it to other developers. Thanks for already providing a patch that illustrates your ideas! Siddharth already commented and I would also like to have a look. I also feel that Mehdi would be an excellent person to give you feedback on this, as he has significant expertise with PIPS, which uses abstract interpretation very similar to how one would use it for a more general polyhedral range analysis. However, I would decouple these two goals. We could move Polly and isl in place, and then move parts of Polly to the range analysis you propose -- or start copying / and redeveloping parts of Polly onto the new foundations you propose. Later we can replace more parts. In general, there has never been an issue with temporarily having some redundancy in LLVM and I don't think as part of this discussion we need to lay out the full future of polyhedral analysis and transformations in LLVM. From my perspective, we should answer mainly the question: 1) "Do we want polyhedral analysis as part of core LLVM?" 2) "Do we want to keep significant parts of the polyhedral development outside of core LLVM?" My feeling is that there are several groups interested in 1), which gives me confidence that we will -- eventually -- have polyhedral stuff in LLVM. Hal gave pretty good arguments for 2) and I also believe strongly that we should break the boundaries between polyhedral and core LLVM people.> Let me give two examples to make my point: > > -- Example 1 -- > In addition to the applicability issues there are other problems that > arise from the full pipeline design. Let's assume we want to apply a > pragma driven optimization to a loop that can be represented (and > optimized) by Polly. Depending on the pragma we only need to perform a > very specific task, e.g., adjust the iteration variable (loop inversion) > or introduce new loops and adjust existing iteration variables (tiling). > Polly will however represent (and actually analyze) the whole loop nest > including the exact control flow and all accesses. In addition it > will compute alias checks, constraints under which the representation > is not truthful (e.g., an access function might overflow) and a lot of > other things that are not needed for the task at hand.Sure. Incrementally building dependences makes in certain cases sense!> -- Example 2 -- > We can also imagine we want to determine if a loop can be vectorized, > thus if there are (short) loop carried dependences. No matter which > technique is used, the complexity of a dependence analysis will > certainly increase with the number (and complexity) of the analyzed > accesses. Since Polly is designed with fine-grained scheduling > optimizations in mind, the dependences it will compute are > disproportionate with regards to the question asked. Consequently, it > will either take more time to determine if there are loop carried > dependences or a time-out in the computation will cause a conservatively > negative result.Good idea!> While it is generally possible to disable some "features" of Polly or to > "work around them", it requires new independent code paths in an already > highly interleaved project. The explicit (and implicit) interactions > between the different parts of Polly are plentiful and subtle. I don't > think there are 5 people that know about more than half of them. (I also > doubt there is no one that knows them all.) > > I want to propose an alternative plan for which I would appreciate > feedback. The feedback is especially useful since I will present this, > or more accurately the concrete polyhedral analyses described below, in > the student research competition at LLVM Dev meeting in October.Nice. Looking forward to your talk!> The idea is simple. We rewrite parts of Polly with two goals in mind: > 1) to provide analysis results and transformation options to LLVM, and > 2) allow polyhedral loop transformation.Sure. Improved granularity is certainly a good thing to have. As stated in the initial proposal, I am glad to see Polly being ripped apart. You already started!> A rewrite allows to revisit various design decisions that are buried > deep in the Polly code and that hinder its application (in any context). > The interested reader can find a short list at the end. > I think, we should incrementally introduce polyhedral analysis > capabilities in order to provide immediate benefits and to allow > developers to get involved with the design and usage from an early point > in time. Decoupled analyses also allow easier integration, interfaces > that come "more natural" to non-polyhedral people, and better use of the > capabilities if polyhedral scheduling is not the goal.Many of these ideas sound very nice. I am certain you will provide some good interface ideas.> To demonstrate this incremental process I started to write a polyhedral > value analysis**. Think of it as Scalar Evolution but with piece-wise > defined polyhedral values as the abstract domain. This analysis is > demand driven, works on the granularity of a LLVM value, is flow > sensitive, and especially designed to be applicable on partially affine > programs. It already supports most of what the polyhedral model does > allow and includes a lot of things we prototyped in Polly. It shall be > used when Scalar Evolution is not able to provide a sufficient result.Nice! Clearly a way in the right direction.> On top I implemented a polyhedral memory analysis*** that is (currently) > capable of summarizing the memory effects of a code region, e.g., a > function. The next step is a demand-driven dependence analysis**** that > utilizes the (often) very structured nature of a CFG. Afterwards, I > would like to see a transformation framework that allows classical but > also scheduling based transformations.> While these analyses still depend on isl as a back-end, they never > directly use it. Instead there is a C++ interface in-between that > encapsulates the features needed. Once our use cases are clear we can > re-implement this interface.Interesting. This seems to complement our efforts in providing a C++ interface to isl, but with slightly different goals. I feel as if we still should use isl++ instead of plain isl within your interface.> These are a few design decisions that I think should be revisited when > we think about polyhedral-based analysis and optimizations in LLVM: > > - We should not rely on Scalar Evolution. Instead we want to build > polyhedral values straight from the underlying IR. While > Scalar Evolution did certainly simplify Polly's development it > induced various problems: > - The lack of piece-wise affine values (other than min/max) > limits applicability and precision. (Though conditional SCEVs > are a small step in this direction.) > - The flow insensitive value analysis (Scalar Evolution) causes a > lot of "assumptions", thus versioning, for the otherwise flow > sensitive Polly. > - There are caching issues when parts of the program are > transformed or when Polly is used for inter-procedural > analysis/optimization.I agree, this is certainly a good direction.> - We should not depend on single-entry-single-exit (SESE) regions. > Instead we analyze the code optimistically to the extend possible > (or required by the user). Afterwards we can easily specialize the > results to the code regions we are interested in. While SESE regions > always caused compile time issues (due to the use of the post > dominance tree) they also limit the analysis/transformations that > can be performed. As an example think of accesses that are > non-affine with regards to the smallest surrounding SESE region but > not in a smaller code region (e.g., a conditional). If the goal is > not loop scheduling but, for example, instruction reordering, it > is still valuable to determine the dependences between these > accesses in the smaller code region.SESE regions have (to my observation) never caused compile time issues. However, several people including Chandler have raised concerns several times and they indeed do not fit well when e.g. mixed with the vectorizer. Hence, I agree that our ultimate polyhedral optimizer in LLVM should likely not use SESE regions. In practice, Polly works reasonably pretty well with regions for now.> - We should iteratively and selective construct polyhedral > representations. The same holds for analysis results, e.g. > dependences. Let's reconsider example 2 from above. We want to visit > one memory access at a time in order to build the polyhedral > representation and compute the new dependences. We also want to > immediately stop if we identify a loop-carried dependence. Finally, > we do not want to compute dependences in outer loop iterations or > dependences that are completely contained in one loop iteration.Interesting. For certain use cases this is clearly the right approach. For full-fledged scheduling this may not work.> - We should encapsulate the code generation part completely from the > transformation part. Additionally we might want to start with > classical loop transformations instead of full polyhedral > scheduling. For a lot of classical loop transformations code > generation only needs to understand/change loop induction variables > and exit conditions. Duplication + rewriting of the loop body is > often not necessary. Pragma driven optimizations could be easily > done and we could also use heuristics similar to the ones we have > today (e.g., for loop distribution). The transformations we allow > should than grow with the scheduling decisions we allow and these > should be limited by a yet to be determined cost model.Sure, better encapsulation is always good. How to do this exactly will be an interesting discussion!> I am aware that not all Polly developers will agree with my views and > that some are simply trade-offs that do not offer a "correct way". > Nevertheless, I hope that the simple examples I used to point out > various problems are enough to consider the route I proposed.I briefly looked at your patch for the polyhedral range analysis. While I would have preferred an initial design discussion much earlier, and your patch needs more comments and especially a design description, it certainly is a move in the right direction! We may also split it in smaller pieces to make review easier. In fact, for some reason I think here at the RISC-V upstreaming. Not that we should go at it's speed, but it has been a perfect example of how to upstream a new backend -- with a lot of feedback from the community. This seems to be very much aligned with your goals. Overall, I clearly agree with your proposal of a polyhedral range analysis and will be glad to provide more review comments. Your development plans for the future are overall very much aligned with what I would like to see for polyhedral transformations in LLVM. We already discussed some of them over the last years, but you clearly also bring some nice "revolutionary" ideas into the game. I am really excited to see how these will evolve! I am also glad to see you again more active in the mailing list! Best, Tobias
Michael Kruse via llvm-dev
2017-Oct-11 07:07 UTC
[llvm-dev] [RFC] Polly Status and Integration
Dear LLVM community, when working on some project larger projects, probably everybody has come up with ideas on how they would make things different if they started from scratch. For instance, in LLVM, modeling PHI nodes as in Swift's intermediate representation might be such an idea, but requires huge changes in the code base. As contributor to Polly, I am no exception. I therefore would like to share my thoughts about what the best possible integration of a polyhedral optimizer/Polly could look like, to discuss which end result we might strive for. I am a proponent of polyhedral compilation and would like to see it used on a daily basis by just choosing an optimization level. IMHO we should avoid multiple loop optimizations in LLVM. If each of them does its own loop versioning, we will get multiple redundant runtime checks and code duplication. This is already the case with LoopVersioningLICM, LoopDistribution, LoopVectorizer and LoopLoadElimination. I argue for a more integrated approach for loop-based optimizations. To clarify, what this is not: - It has not been shown to work in practice. - While I would like to contribute to implement such a loop optimizer, this is not something I could do alone. - This is no direct replacement for Polly or VPlan, rather a long-term vision which also includes VPlan/Polly. To have a comparison, I first try to sketch the approaches taken by the LoopVectorizer, VPlan and Polly to then present the "Loop Optimization Framework". First only the design and afterwards the rationales for the design decisions. To be able to show some examples, I'll refer to this snippet we intent to optimize. for (x = 0; x < sizes->nx; x += 1) for (y = 0; y < sizes->ny; y += 1) C[x][y] = 0; for (x = 0; x < sizes->nx; x += 1) { tmp = A[x]; /* hoisted load by LICM or GVN */ int ny = sizes->ny; /* dynamic load of size parameter */ for (y = 0; y < sizes->ny; y += 1) { if (x != y) /* affine conditional */ C[x][y] = tmp * B[y]; if (C[x][y] > 0) /* data-dependent conditional */ C[x][y] = sqrt(C[x][y]); if (verbose) /* non-performance-critical case */ printf("Some output"); } } LoopVectorizer (+VPlan) ------------------------- This part I know the least, please correct me if I state something wrong. Details on VPlan were taken from [1] and its talk, and the code already committed to LLVM. A) Preprocessing LCSSA-form and loop-simplification. B) Scope of Analysis Innermost loop, single body BB (or if-convertible to single BB) VPlan: Acyclic control flow with SESE (single entry/exit block or edge?). Extensions for outer loop vectorization (cyclic control flow) to be presented at the dev meeting: [2] C) Control-Flow Modeling Not needed, it's always just a single loop. VPlan: Hierarchical-CFG of VPBasicBlocks and VPRegionBlocks. D) Access Modeling (LoopAccessInfo/LoopAccessAnalysis/AccessAnalysis) Use AliasSetTracker and look analyze each set independently. No PHI nodes (apart from induction variable) allowed (there seems to be some support for reductions). E) Dependency Analysis (MemoryDepChecker) Leaks only DependenceType and max vectorization factor to the outside. F) Assumption Tracking (PredicatedScalarEvolution/LoopVectorizationRequirements) Assumptions can be added to PredicatedScalarEvolution to make SCEV modeling possible/simpler/exact. Used for: - Assume exact BackedgeTakenCount - Assume no overflow Aliasing tracked by AccessAnalysis G) Transformation One explicit transformation. Either vectorization is possible or don't do anything at all. VPlan: Modifying VPlans (widening, etc) and pick the best according to the cost model. H) Code Generation (InnerLoopVectorizer/LoopVersioning) Create a new vectorized loop (without prologue to get aligned accesses) and use the old loop as epilogue and if the assumptions are not met. VPlan: Generate instructions from the recipes in the VPlan. I) Pass Manager Integration Depends on analyses: LoopInfo and LoopAccessAnalysis Otherwise LoopVectorize is self-contained and instantiates objects itself (eg. LoopVersioning) Polly Architecture ------------------ A) Preprocessing There is no considerable preprocessing (anymore). The only modification done is inserting an entry block containing only alloca's where Polly's IR generator can add its own allocas to. The split-off BB can then be analyzed as well. B) Scope of Analysis (ScopDetection/ScopInfo) Search for the largest compatible llvm::Regions, called SCoPs. Compatibility requirements are (among others): - Single Entry Single Exit: SESE - Base pointers invariant in SCoP - Affine loop-bounds - Only multidimensional affine array accesses (by default) - No function calls (except side-effect free functions) Regions are expanded in order to include multiple top-level loops into on region. In the example both outermost loops are included into the same SCoP region. LoadInst results that are used for loop bounds and array subscripts are assumed to be invariant during the execution of the SCoP. This allows to use `sizes->ny` as a loop bound, even though the LoadInst could return a different value at every iteration of the outer loop. (Invariant load hoisting) Conditions that lead to function calls are assumed to be false. In the example makes it possible to remove the call to printf by adding the assumption `!verbose`. C) Control-Flow Modeling (Schedule Tree) A schedule tree is basically a tree of bands nodes (collection of perfectly nested loops with a defined order) and sequence nodes (child node are execute in sequence order). Affine conditionals are implemented as filters over loop variable values they are executed on. Example (simplified): Root | +- Band(Loops: x, y) | | | +- C[x][y] = 0; | +- Band(Loops: x) | + tmp = A[x]; /* statement */ | +- Band(Loops: y) | +- Filter(x + -y > 0 || x + - y < 0) /* affine conditional */ | | | +- C[x][y] += tmp * B[y]; /* statement */ | +- if(C[x][y] > 0) C[x][y] = sqrt(C[x][y]); /* data-dependent conditional as macro-statement (llvm::Region) */ A statement execution for fixed x and y is called a 'statement instance'. Schedule trees are part of the Integer Set Library (isl) [3], which is also used to represent polyhedral in general. D) Access Modeling (polly::MemoryAccess) Every load/store has a base pointer and multi-dimensional indices. Indices are derived from GetElementPtr and/or using ScalarEvolution and then converted to piecewise-affine functions (isl_pw_aff). Because in C, multi-dimensional accesses decay to flat subscripts (C[x*sizes->nx + y]), delinearization tries to recover the multidimensional subscripts. If this does not succeed, the option -polly-allow-nonaffine interprets them as touching the entire array. Assumptions are added to ensure that the accesses are within the derived dimension sized. llvm::Instructions are modeled as zero-dimensional array (or "scalar") if there is a use outside of the statement instance they are defined in. E) Dependency Analysis (DependencyInfo) When passed the accesses for each statement, isl computes the dependencies (flow, anti and output) as a relational polyhedron. That is, the exact dependencies between all statement instances. Example: { [x,y-1] -> [x,y] if x != y; [x,y-1] -> [x,y+1] if x == y } describes the dependency of the statement with the affine conditional to the next instance of itself. F) Assumption Tracking (ScopInfo) Remembers the assumptions mentioned before. There are four kinds of assumptions - Valid value ranges, not requiring a runtime check (eg from __builtin_assume) - Value ranges that must be checked at runtime (eg. the range of nx,ny such that all array subscripts are within bounds) - Value ranges that are forbidden and must be checked at runtime (eg `verbose` must not be true) - Non-aliasing assumptions, checked as runtime. Some violated assumptions may lead to a bail-out, eg. if the pointer of an assumed invariant load is written to. G) Transformations (ForwardOpTree/DeLICM/Simplify/ScheduleOptimizer) To improve schedulability, ForwardOpTree and DeLICM modify the instruction lists of statements (not the IR itself). For instance, it propagates the definition of tmp (a LoadInst) to the statements that use it. In the example this makes the x and y loops perfectly nested such that it can be tiled, and removes the false dependencies of tmp to itself (a "scalar dependency"). isl then computes a new schedule tree out of the old schedule tree and the dependence analysis. By default, it uses a modified PLuTo [4] algorithm. Its goals are: - Separate independent statements into separate bands - Reduce the distance between dependencies to optimize temporal locality - Find parallel loops On the transformed schedule tree, Polly walks the tree to find additional optimization opportunities: - Tile bands that are 'permutable' and have at least 2 loops - Optionally vectorize the first innermost 'coincident' loop - Detect matrix-multiplication access patterns and apply a special optimization [5] H) Code Generation (IslAstInfo/CodeGeneration/PPCGCodeGeneration) Isl can transform a schedule tree representation into an AST representation, with loops and if-conditions. CodeGeneration versions the original SCoP region using the tracked assumptions and emits the AST representation. For a statement in the AST, it copies the instructions from the instruction list. For parallel loops, it can generate OpenMP. PPCGCodeGeneration generates code for GPU accelerators. I) Pass Manager Integration ScopDetection is a function pass, all following passes are ScopPasses (RegionPasses in disguise): - ScopInfo (analysis) - JScopImporter (SCoP transformation) - Simplify (SCoP transformation) - ForwardOpTree (SCoP transformation) - DeLICM (SCoP transformation) - DependenceInfo (SCoP analysis) - ScheduleOptimizer (SCoP transformation) - IslAstInfo (SCoP analysis) - CodeGeneration (IR transformation) Proposed Loop Optimization Framework ------------------------------------ Only the architecture is outlined here. The reasons for them can be found in the "rationales" section below. A) Preprocessing Make a copy of the entire function to allow transformations only for the sake of analysis. For instance (and not necessarily in the preprocessing): - Invariant load hoisting (Move the load of sizes->ny before the loop) - Move operations closer to the instructions that use them (Move tmp A[x] to the instructions that use it) - Split basic blocks to allow loop distribution The function will be discarded when no loop transformation is applied. This step can be made optional to be only active at higher optimization settings. B) Scope of Analysis Always entire functions. Some loops/blocks might be marked as non-transformable. C) Control-Flow Modeling An AST-like tree based on the LoopInfo tree. Statements are pointers to Basic Blocks in the original IR. Any conditionals are modeled as data-dependency. Example: +- Loop 0 <= x < sizes->nx | | | +- Loop 0 <= y < sizes->ny | | | +- C[x][y] = 0; [condition: true] | +- Loop 0 <= x < sizes->nx | +- Loop 0 <= y < sizes->ny | +- C[x][y] += A[x] * B[y]; [condition: true] | +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0] | +- printf("Some output"); [condition: verbose] The tree is mutable by transformations, IR will be generated from it at the end. Loops bounds and affine conditions can also optionally be expressed as isl_sets. Statements have properties: - Conditions under which they are executed - Is it if-convertible? - Can be executed even in the original IR it would not be executed? - If executed at least once, executing multiple times has no effect? - Hotness information from PGO - #pragma annotations from clang - Multi-thread effects (e.g. atomics/fences) - ... D) Access Modeling Each load/store/... is associated with: - The base pointer - For each of the array's dimensions, the subscript - Either as SCEV - Or as affine function (isl_pw_aff) Frontends might help recovering the multidimensional structure of an array by, instead of linearizing array subscripts (eg. row-major), using a intrinsic/builtin. E) Dependency Analysis A collection of dependencies between two statements, say S and T. S must be executed before T. - The type of dependency: Register, Control, Flow, Anti, Output - Which instances depend on each other. This can be: - No instance information. All instances of S must be executed before any instance of T. - A vector of dependence distances. Basically MemoryDepChecker for each surrounding loop. - An polyhedral relational map (isl_map) describing which instances of T depend on which instances of S. The dependency type "register" means that T used an llvm::Value defined by S. This can also be modeled using flow and anti-dependencies but as shown before this can be destructive to transformation opportunitied. Transformations need to specially consider register dependencies and avoid requiring expansion. The dependency type "control" models order induced by control conditions. Eg. the data-dependent conditional 'C[x][y] > 0' can only be evaluated after 'C[x][y]' has been written. F) Assumption Tracking A public interface that loop analysis and transformations can use to register their assumptions to checked at runtime. Assumptions are for specific loops, so they only need to be checked if the loop is actually transformed. Assumptions can also be queried, eg. knowing that a variable always has a specific value may allow some simplifications. This knowledge could also be fed from __builtin_assume and profiling data. G) Transformation In the style of InstCombine, visit the tree and modify it (and only the tree, not the IR). Transformations can include: - Add assumptions to make parts more transformable (e.g. assume verbose==false to eliminate the call to printf) - Combine single loops into bands - Apply #pragmas on the loop/band it is associated with. - Vectorize loops (VPlan) - Tiling - Unrolling - loop-unswitching - Localize working sets of inner tiles. This can be a generalization of Polly's matrix-multiplication optimization. - (Partial) unrolling of loops - Specializing loops for specific inputs (loop-unswitching) - Mixed-LP scheduling - OpenMP-parallelize bands - Extract accelerator kernels - Detect loop idoms (memset, memcopy, zgemm, etc.) to call library functions instead - ... Transformations are responsible to not violate dependencies. The current Polly would become one of the transformations. It finds the largest loop body where all size parameters are invariant and translate the subtree to an isl schedule tree. Isl would transform it, generate an isl_ast and Polly translates it to a new loop tree. Much of the work done by Polly today would be provided by the framework. Example transformation output: +- memset(C, '\0', sizes->nx*sizes->ny*sizeof(C[0][0])); | +- Loop 0 <= x < sizes->nx [#pragma omp parallel for][condition: !verbose] [assumption: C[0..sizes->nx-1][0..sizes->ny-1], A[0..sizes->nx-1], B[0..sizes->ny-1] not aliasing] | | | +- Loop 0 <= yv < sizes->ny step 4 | | | +- Loop yv <= y < yv+4 [vectorize VF=4] | | | | | +- C[x][y] += A[x] * B[y]; | | | +- Loop 4*(sizes->ny/4) <= y < sizes->ny | | | | | +- C[x][y] += A[x] * B[y]; | | | +- Loop 0 <= y < sizes->ny | | | +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0] | +- Loop 0 <= x < sizes->nx [condition: verbose] | +- Loop 0 <= y < sizes->ny | +- C[x][y] += A[x] * B[y]; [condition: true] | +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0] | +- printf("Some output"); [condition: true] H) Code Generation Determine which part of the tree has been modified. Outermost parts that have not been transformed do not need to be regenerated. The transformed part of the tree is generated into a separate function. Runtime checks for the collected assumptions are generated, and depending on its result, either the optimized function or the previously outlined function is called. The optimized function can be further optimized using LLVM's usual passes and the inliner decide according to its heuristics whether it is worth inlining into the caller. I) Pass Manager Integration A monolithic loop transformation pass which creates the loop tree and provides dependency analysis and the assumption track. Then it invokes the separate transformations like InstCombine. The final tree is code-generated before control is passed back to (old or new) pass manager. Design Decision Rationales -------------------------- A) Preprocessing Polly's policy at the moment is not modify the IR until code generations (to not interfere with the current pipeline). For invariant load hoisting and avoiding scalar dependencies it only virtually moves/copies instructions in its own data structures. This adds complexity as these representations get out-of-sync. For instance, DominatorTree does not reflect the new locations of instructions, the CFG is modified by CodeGenerator for versioning, PHI node incoming lists does not correspond to the predecessor blocks anymore, instructions cannot be propagated cross loop iterations as we don't know which iteration it was originally in, etc. If possible, I'd like to avoid this complexity. It approaches being an alternative representation for the current IR, so why not using the IR? VPlan has its own IR-like representation consisting of ingredients and recipes. For the vectorizer it might be justifiable because it does not change control flow. Some transformations cope well with scalar dependencies (eg. vectorization, there's a copy of each scalar anyway and thread-parallelization would have them privatized), but in the polyhedral model, analysis takes place before transformation decisions. However, today's polyhedral schedulers (That are based on Feautrier's or PLuTo's) do not have special handling for scalar dependencies. I am positive that we can have polyhedral schedulers that handle scalars without getting worse results (by avoiding putting value definitions into different loops than its uses; ability to execute a definition multiple times), but this is research were such a framework could be very useful. Most of today's polyhedral compilers used by researchers are source-to-source where the problem is much less pronounced. We might alternatively change the meaning of IR canonicalization in LLVM. Pre-loop canonicalized would try to create self-contained blocks where post-loop transformations do hoisting and sinking. To illustrate the issue with scalar dependencies: for (int i = 0; i < N; i+=1) { double s = ...; ... = s; } Since in Polly s is modeled as a zero-dimensional array, there is just one memory location for it. Before processing the next iterations, the current iteration must have completed before s is overwritten. If virtual registers are not modeled having a memory location, a polyhedral schedule might be included to do this (because, eg. splitting those up makes one of the loops vectorizable): double s[N]; for (int i = 0; i < N; i+=1) s[i] = ...; for (int i = 0; i < N; i+=1) .... = s[i]; that is, requiring an unbounded amount of additional memory. That's probably not something we want. B) Scope of Analysis Modeling entire functions has the advantage that analysis does not have to choose a unit of transformation (Polly: SCoPs, LoopVectorizer: a loop), but the transformations decide themselves what they are modifying. I'd also like to synchronize the unit of modeling. Polly's SCoPs is based on SESE regions. llvm::Region and llvm::Loop have no hierarchy, regions and loops may partially overlap. It is possible to represent multiple exit loops in the polyhedral model. C) Control-Flow Modeling LoopInfo, Polly's schedule tree, VPlan's hierarchical CFG all use a tree-like representation of control flow. I hope to find a common one. Using execution conditions 'flattens' the loop body. Conditions can be whether a previous statement has been executed or not (control dependence) so that the original CFG can be reconstructed. Affine conditions can be directly represented in the polyhedral model, for others 'always executing' is an overapproximation and it is checked at runtime whether it is really executed. This allows more flexibility than Polly's SESE subregion macro-statement approach. Whether a condition is affine or not also has no influence on other transformations. VPlan also uses SESE in its CFG representation, but I think it can also make use of this model. D) Access Modeling There is a common interface for both representations. Modeling as affine function is required for the polyhedral model and is more powerful, but assumes arbitrary precision integers, so additional assumptions might need to be added. The SCEV representation can (when linear/affine) be converted to an isl_pw_aff when needed, or the isl_pw_aff computed directly as by Johannes' suggestion. However, I did not observe that SCEV was the limiting part in Polly, and could be extended to support conditional operators as well. The builtin might not be useful for C arrays, where access is defined as row-major and always linearized. But libraries could make use of the builtin, for instance a Matrix class with overloaded operator: double &operator()(int x, int y) { #if __has_builtin(__builtin_subscript) return __builtin_subscript(data, x, y); #else return data[x*n+y]; #endif } A special case are jagged arrays/pointers to pointers of arrays which are very common in preexisting code bases. To ensure non-aliasing, we would need a pairwise check for aliasing of every subarray, which has polynomial complexity by itself. A builtin/intrinsic that guarantees the non-aliasing of subarrays would allow us to transform these cases as well. If there is a common interface for both cases, transformations work transparently on either representation. E) Dependency Analysis Again there is a common interface for all three kinds of dependency representations, eg. "does statement instance S(x,y) depend on instance T(x)"?. Instances can, as memory subscripts, described as SCEV or affine functions over surrounding loop counters. Representations can also be converted to either type, possibly overapproximating the dependencies. The precision can be selected by compiler switches or optimization level. The goal is to allow any loop transformation (eg. vectorizer) can use any dependency precision. For the polyhedral representation, we could introduce an intermediate layer that allows to plug in other intLP solvers and return approximations of the true result. Dependencies can be arbitrarily overapproximated, which inhibits transformations, but stays correct. Alternative implementations might just allow rectangles or octahedra [6] instead of polyhedra, which do not have exponential worst-case behaviours. Polly at the moment does not know register or control dependencies. I hope that adding such kind of dependency would allow developing polyhedral scheduler that keep definition and uses of a virtual register in the same loop. F) Assumption Tracking The purpose of a shared assumption tracker is to only have the need for a single runtime condition check after all transformations have been applied. G) Transformation Notice that polyhedral scheduling is just one of the possible transformation passes. Even in Polly, some transformations (tiling, vectorization, matrix-multiplication) are post-optimizations on the schedule tree and can be implemented on other AST representations as well. No knowledge of the polyhedral model is required to implement a new transformation. VPlan allows creating multiple plans at the same time and select the best according a heuristic. However, some transformations might be easier to implement by modifying the underlying IR in the cloned function. Eg. removing conditionals that lead to exceptional cases (the non-performance-relevant printf in the example). Otherwise we could also select the most profitable transformed AST according to a cost function. H) Code Generation Generating code it into a separate function has some advantages. First, we control the environment. Input base pointers can be annotated using the noalias qualifier without metadata. Code generation is easier because there are less special cases like multiple exit edges, llvm.lifetime.begin/end intrinsics, etc. Separate register pressure for the presumably hot code and the rest of the function. I) Pass Manager Integration The reason to not use LLVM's pass manager for the individual transformations is because the loop tree carries state that is not represented in the LLVM IR. The pass manager would be allows to invalidate the pass holding the tree and expect it to be recomputable the same way when instantiated again. This is not the case if the tree has been transformed already. Polly does it this way and needs all its transformation passes to preserve all analyses to avoid making the pass manager drop and recompute ScopInfo and ScopDetect. [1] http://llvm.org/devmtg/2017-03/assets/slides/introducing_vplan_to_the_loop_vectorizer.pdf [2] http://llvm.org/devmtg/2017-10/#talk17 [3] http://isl.gforge.inria.fr/ [4] http://pluto-compiler.sourceforge.net/ [5] http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf [6] https://tel.archives-ouvertes.fr/tel-00818764