Saito, Hideki via llvm-dev
2017-Oct-14 00:04 UTC
[llvm-dev] [RFC] Polly Status and Integration
Very sorry for the spam.>So, if this part is optional, you may want to state thatThought I need to clarify since Michael actually wrote "optional". The way Michael wrote is "skip these transformations at lower opt level". What I meant was like apply these preprocess transformations by some other means (e.g., part of step-G). -----Original Message----- From: Saito, Hideki Sent: Friday, October 13, 2017 4:30 PM To: 'llvm-dev at lists.llvm.org' <llvm-dev at lists.llvm.org>; 'Hal Finkel' <hfinkel at anl.gov> Subject: Re: [RFC] Polly Status and Integration Michael, [Sorry that I don't have you in To:. I don't have your addr that I can use since I'm replying through digest.] I'm also sorry that I'm not commenting on the main part of your RFC in this reply. I just want to focus on one thing here. Proposed Loop Optimization Framework ------------------------------------ Only the architecture is outlined here. The reasons for them can be found in the "rationales" section below. A) Preprocessing Make a copy of the entire function to allow transformations only for the sake of analysis. Before we started our journey into VPlan-based vectorization approach, we explicitly asked about modifying the IR for the sake of vectorization analysis ----- general consensus in LLVM-DEV was "that's a bad idea". We thought so, too. That's the reason VPlan has its own local data structure that it can play with. Making a copy of the entire function is far worse than copying a loop (nest). Unless the community mindset has dramatically changed since ~2yrs ago, it's unlikely that you'll get overwhelming support on "copy the entire function" before the decision to transform is taken. So, if this part is optional, you may want to state that. Having said that, in https://reviews.llvm.org/D38676, we are introducing concepts like VPValue, VPUser, and VPInstruction, in order to manipulate and interact with things that didn't come from the IR of the original loop (nest). As such, I can see why you think "a playground copy of IR" is beneficial. Even before reading your RFC, I was actually thinking that someone else may also benefit from VPValue/VPUser/VPInstruction-like concept for the analysis part of some heavy-weight transformations. Ideally, what we wanted to do was to use LLVM's Value class hierarchy if we could generate/manipulate "abstract Instructions" w/o making them parts of the IR state, instead of creating our own light-weight VPValue/VPUser/VPInstruction classes. If anyone has better ideas in this area, we'd like to listen to. Please reply either to this thread or through the code review mentioned above. Thanks, Hideki Saito Vectorizer Team, Intel Compilers and Languages ------------------------------ Message: 6 Date: Wed, 11 Oct 2017 09:07:11 +0200 From: Michael Kruse via llvm-dev <llvm-dev at lists.llvm.org> To: Hal Finkel <hfinkel at anl.gov>, LLVM Dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] [RFC] Polly Status and Integration Message-ID: <CADPO-WCGVPjU8znGBuA7J_ZLqxFOJQ9GqOP5XMUyUCsYmdSsyg at mail.gmail.com> Content-Type: text/plain; charset="UTF-8" Dear LLVM community, when working on some project larger projects, probably everybody has come up with ideas on how they would make things different if they started from scratch. For instance, in LLVM, modeling PHI nodes as in Swift's intermediate representation might be such an idea, but requires huge changes in the code base. As contributor to Polly, I am no exception. I therefore would like to share my thoughts about what the best possible integration of a polyhedral optimizer/Polly could look like, to discuss which end result we might strive for. I am a proponent of polyhedral compilation and would like to see it used on a daily basis by just choosing an optimization level. IMHO we should avoid multiple loop optimizations in LLVM. If each of them does its own loop versioning, we will get multiple redundant runtime checks and code duplication. This is already the case with LoopVersioningLICM, LoopDistribution, LoopVectorizer and LoopLoadElimination. I argue for a more integrated approach for loop-based optimizations. To clarify, what this is not: - It has not been shown to work in practice. - While I would like to contribute to implement such a loop optimizer, this is not something I could do alone. - This is no direct replacement for Polly or VPlan, rather a long-term vision which also includes VPlan/Polly. To have a comparison, I first try to sketch the approaches taken by the LoopVectorizer, VPlan and Polly to then present the "Loop Optimization Framework". First only the design and afterwards the rationales for the design decisions. To be able to show some examples, I'll refer to this snippet we intent to optimize. for (x = 0; x < sizes->nx; x += 1) for (y = 0; y < sizes->ny; y += 1) C[x][y] = 0; for (x = 0; x < sizes->nx; x += 1) { tmp = A[x]; /* hoisted load by LICM or GVN */ int ny = sizes->ny; /* dynamic load of size parameter */ for (y = 0; y < sizes->ny; y += 1) { if (x != y) /* affine conditional */ C[x][y] = tmp * B[y]; if (C[x][y] > 0) /* data-dependent conditional */ C[x][y] = sqrt(C[x][y]); if (verbose) /* non-performance-critical case */ printf("Some output"); } } LoopVectorizer (+VPlan) ------------------------- This part I know the least, please correct me if I state something wrong. Details on VPlan were taken from [1] and its talk, and the code already committed to LLVM. A) Preprocessing LCSSA-form and loop-simplification. B) Scope of Analysis Innermost loop, single body BB (or if-convertible to single BB) VPlan: Acyclic control flow with SESE (single entry/exit block or edge?). Extensions for outer loop vectorization (cyclic control flow) to be presented at the dev meeting: [2] C) Control-Flow Modeling Not needed, it's always just a single loop. VPlan: Hierarchical-CFG of VPBasicBlocks and VPRegionBlocks. D) Access Modeling (LoopAccessInfo/LoopAccessAnalysis/AccessAnalysis) Use AliasSetTracker and look analyze each set independently. No PHI nodes (apart from induction variable) allowed (there seems to be some support for reductions). E) Dependency Analysis (MemoryDepChecker) Leaks only DependenceType and max vectorization factor to the outside. F) Assumption Tracking (PredicatedScalarEvolution/LoopVectorizationRequirements) Assumptions can be added to PredicatedScalarEvolution to make SCEV modeling possible/simpler/exact. Used for: - Assume exact BackedgeTakenCount - Assume no overflow Aliasing tracked by AccessAnalysis G) Transformation One explicit transformation. Either vectorization is possible or don't do anything at all. VPlan: Modifying VPlans (widening, etc) and pick the best according to the cost model. H) Code Generation (InnerLoopVectorizer/LoopVersioning) Create a new vectorized loop (without prologue to get aligned accesses) and use the old loop as epilogue and if the assumptions are not met. VPlan: Generate instructions from the recipes in the VPlan. I) Pass Manager Integration Depends on analyses: LoopInfo and LoopAccessAnalysis Otherwise LoopVectorize is self-contained and instantiates objects itself (eg. LoopVersioning) Polly Architecture ------------------ A) Preprocessing There is no considerable preprocessing (anymore). The only modification done is inserting an entry block containing only alloca's where Polly's IR generator can add its own allocas to. The split-off BB can then be analyzed as well. B) Scope of Analysis (ScopDetection/ScopInfo) Search for the largest compatible llvm::Regions, called SCoPs. Compatibility requirements are (among others): - Single Entry Single Exit: SESE - Base pointers invariant in SCoP - Affine loop-bounds - Only multidimensional affine array accesses (by default) - No function calls (except side-effect free functions) Regions are expanded in order to include multiple top-level loops into on region. In the example both outermost loops are included into the same SCoP region. LoadInst results that are used for loop bounds and array subscripts are assumed to be invariant during the execution of the SCoP. This allows to use `sizes->ny` as a loop bound, even though the LoadInst could return a different value at every iteration of the outer loop. (Invariant load hoisting) Conditions that lead to function calls are assumed to be false. In the example makes it possible to remove the call to printf by adding the assumption `!verbose`. C) Control-Flow Modeling (Schedule Tree) A schedule tree is basically a tree of bands nodes (collection of perfectly nested loops with a defined order) and sequence nodes (child node are execute in sequence order). Affine conditionals are implemented as filters over loop variable values they are executed on. Example (simplified): Root | +- Band(Loops: x, y) | | | +- C[x][y] = 0; | +- Band(Loops: x) | + tmp = A[x]; /* statement */ | +- Band(Loops: y) | +- Filter(x + -y > 0 || x + - y < 0) /* affine conditional */ | | | +- C[x][y] += tmp * B[y]; /* statement */ | +- if(C[x][y] > 0) C[x][y] = sqrt(C[x][y]); /* data-dependent conditional as macro-statement (llvm::Region) */ A statement execution for fixed x and y is called a 'statement instance'. Schedule trees are part of the Integer Set Library (isl) [3], which is also used to represent polyhedral in general. D) Access Modeling (polly::MemoryAccess) Every load/store has a base pointer and multi-dimensional indices. Indices are derived from GetElementPtr and/or using ScalarEvolution and then converted to piecewise-affine functions (isl_pw_aff). Because in C, multi-dimensional accesses decay to flat subscripts (C[x*sizes->nx + y]), delinearization tries to recover the multidimensional subscripts. If this does not succeed, the option -polly-allow-nonaffine interprets them as touching the entire array. Assumptions are added to ensure that the accesses are within the derived dimension sized. llvm::Instructions are modeled as zero-dimensional array (or "scalar") if there is a use outside of the statement instance they are defined in. E) Dependency Analysis (DependencyInfo) When passed the accesses for each statement, isl computes the dependencies (flow, anti and output) as a relational polyhedron. That is, the exact dependencies between all statement instances. Example: { [x,y-1] -> [x,y] if x != y; [x,y-1] -> [x,y+1] if x == y } describes the dependency of the statement with the affine conditional to the next instance of itself. F) Assumption Tracking (ScopInfo) Remembers the assumptions mentioned before. There are four kinds of assumptions - Valid value ranges, not requiring a runtime check (eg from __builtin_assume) - Value ranges that must be checked at runtime (eg. the range of nx,ny such that all array subscripts are within bounds) - Value ranges that are forbidden and must be checked at runtime (eg `verbose` must not be true) - Non-aliasing assumptions, checked as runtime. Some violated assumptions may lead to a bail-out, eg. if the pointer of an assumed invariant load is written to. G) Transformations (ForwardOpTree/DeLICM/Simplify/ScheduleOptimizer) To improve schedulability, ForwardOpTree and DeLICM modify the instruction lists of statements (not the IR itself). For instance, it propagates the definition of tmp (a LoadInst) to the statements that use it. In the example this makes the x and y loops perfectly nested such that it can be tiled, and removes the false dependencies of tmp to itself (a "scalar dependency"). isl then computes a new schedule tree out of the old schedule tree and the dependence analysis. By default, it uses a modified PLuTo [4] algorithm. Its goals are: - Separate independent statements into separate bands - Reduce the distance between dependencies to optimize temporal locality - Find parallel loops On the transformed schedule tree, Polly walks the tree to find additional optimization opportunities: - Tile bands that are 'permutable' and have at least 2 loops - Optionally vectorize the first innermost 'coincident' loop - Detect matrix-multiplication access patterns and apply a special optimization [5] H) Code Generation (IslAstInfo/CodeGeneration/PPCGCodeGeneration) Isl can transform a schedule tree representation into an AST representation, with loops and if-conditions. CodeGeneration versions the original SCoP region using the tracked assumptions and emits the AST representation. For a statement in the AST, it copies the instructions from the instruction list. For parallel loops, it can generate OpenMP. PPCGCodeGeneration generates code for GPU accelerators. I) Pass Manager Integration ScopDetection is a function pass, all following passes are ScopPasses (RegionPasses in disguise): - ScopInfo (analysis) - JScopImporter (SCoP transformation) - Simplify (SCoP transformation) - ForwardOpTree (SCoP transformation) - DeLICM (SCoP transformation) - DependenceInfo (SCoP analysis) - ScheduleOptimizer (SCoP transformation) - IslAstInfo (SCoP analysis) - CodeGeneration (IR transformation) Proposed Loop Optimization Framework ------------------------------------ Only the architecture is outlined here. The reasons for them can be found in the "rationales" section below. A) Preprocessing Make a copy of the entire function to allow transformations only for the sake of analysis. For instance (and not necessarily in the preprocessing): - Invariant load hoisting (Move the load of sizes->ny before the loop) - Move operations closer to the instructions that use them (Move tmp = A[x] to the instructions that use it) - Split basic blocks to allow loop distribution The function will be discarded when no loop transformation is applied. This step can be made optional to be only active at higher optimization settings. B) Scope of Analysis Always entire functions. Some loops/blocks might be marked as non-transformable. C) Control-Flow Modeling An AST-like tree based on the LoopInfo tree. Statements are pointers to Basic Blocks in the original IR. Any conditionals are modeled as data-dependency. Example: +- Loop 0 <= x < sizes->nx | | | +- Loop 0 <= y < sizes->ny | | | +- C[x][y] = 0; [condition: true] | +- Loop 0 <= x < sizes->nx | +- Loop 0 <= y < sizes->ny | +- C[x][y] += A[x] * B[y]; [condition: true] | +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0] | +- printf("Some output"); [condition: verbose] The tree is mutable by transformations, IR will be generated from it at the end. Loops bounds and affine conditions can also optionally be expressed as isl_sets. Statements have properties: - Conditions under which they are executed - Is it if-convertible? - Can be executed even in the original IR it would not be executed? - If executed at least once, executing multiple times has no effect? - Hotness information from PGO - #pragma annotations from clang - Multi-thread effects (e.g. atomics/fences) - ... D) Access Modeling Each load/store/... is associated with: - The base pointer - For each of the array's dimensions, the subscript - Either as SCEV - Or as affine function (isl_pw_aff) Frontends might help recovering the multidimensional structure of an array by, instead of linearizing array subscripts (eg. row-major), using a intrinsic/builtin. E) Dependency Analysis A collection of dependencies between two statements, say S and T. S must be executed before T. - The type of dependency: Register, Control, Flow, Anti, Output - Which instances depend on each other. This can be: - No instance information. All instances of S must be executed before any instance of T. - A vector of dependence distances. Basically MemoryDepChecker for each surrounding loop. - An polyhedral relational map (isl_map) describing which instances of T depend on which instances of S. The dependency type "register" means that T used an llvm::Value defined by S. This can also be modeled using flow and anti-dependencies but as shown before this can be destructive to transformation opportunitied. Transformations need to specially consider register dependencies and avoid requiring expansion. The dependency type "control" models order induced by control conditions. Eg. the data-dependent conditional 'C[x][y] > 0' can only be evaluated after 'C[x][y]' has been written. F) Assumption Tracking A public interface that loop analysis and transformations can use to register their assumptions to checked at runtime. Assumptions are for specific loops, so they only need to be checked if the loop is actually transformed. Assumptions can also be queried, eg. knowing that a variable always has a specific value may allow some simplifications. This knowledge could also be fed from __builtin_assume and profiling data. G) Transformation In the style of InstCombine, visit the tree and modify it (and only the tree, not the IR). Transformations can include: - Add assumptions to make parts more transformable (e.g. assume verbose==false to eliminate the call to printf) - Combine single loops into bands - Apply #pragmas on the loop/band it is associated with. - Vectorize loops (VPlan) - Tiling - Unrolling - loop-unswitching - Localize working sets of inner tiles. This can be a generalization of Polly's matrix-multiplication optimization. - (Partial) unrolling of loops - Specializing loops for specific inputs (loop-unswitching) - Mixed-LP scheduling - OpenMP-parallelize bands - Extract accelerator kernels - Detect loop idoms (memset, memcopy, zgemm, etc.) to call library functions instead - ... Transformations are responsible to not violate dependencies. The current Polly would become one of the transformations. It finds the largest loop body where all size parameters are invariant and translate the subtree to an isl schedule tree. Isl would transform it, generate an isl_ast and Polly translates it to a new loop tree. Much of the work done by Polly today would be provided by the framework. Example transformation output: +- memset(C, '\0', sizes->nx*sizes->ny*sizeof(C[0][0])); | +- Loop 0 <= x < sizes->nx [#pragma omp parallel for][condition: +!verbose] [assumption: C[0..sizes->nx-1][0..sizes->ny-1], A[0..sizes->nx-1], B[0..sizes->ny-1] not aliasing] | | | +- Loop 0 <= yv < sizes->ny step 4 | | | +- Loop yv <= y < yv+4 [vectorize VF=4] | | | | | +- C[x][y] += A[x] * B[y]; | | | +- Loop 4*(sizes->ny/4) <= y < sizes->ny | | | | | +- C[x][y] += A[x] * B[y]; | | | +- Loop 0 <= y < sizes->ny | | | +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0] | +- Loop 0 <= x < sizes->nx [condition: verbose] | +- Loop 0 <= y < sizes->ny | +- C[x][y] += A[x] * B[y]; [condition: true] | +- C[x][y] = sqrt(C[x][y]); [condition: C[x][y] > 0] | +- printf("Some output"); [condition: true] H) Code Generation Determine which part of the tree has been modified. Outermost parts that have not been transformed do not need to be regenerated. The transformed part of the tree is generated into a separate function. Runtime checks for the collected assumptions are generated, and depending on its result, either the optimized function or the previously outlined function is called. The optimized function can be further optimized using LLVM's usual passes and the inliner decide according to its heuristics whether it is worth inlining into the caller. I) Pass Manager Integration A monolithic loop transformation pass which creates the loop tree and provides dependency analysis and the assumption track. Then it invokes the separate transformations like InstCombine. The final tree is code-generated before control is passed back to (old or new) pass manager. Design Decision Rationales -------------------------- A) Preprocessing Polly's policy at the moment is not modify the IR until code generations (to not interfere with the current pipeline). For invariant load hoisting and avoiding scalar dependencies it only virtually moves/copies instructions in its own data structures. This adds complexity as these representations get out-of-sync. For instance, DominatorTree does not reflect the new locations of instructions, the CFG is modified by CodeGenerator for versioning, PHI node incoming lists does not correspond to the predecessor blocks anymore, instructions cannot be propagated cross loop iterations as we don't know which iteration it was originally in, etc. If possible, I'd like to avoid this complexity. It approaches being an alternative representation for the current IR, so why not using the IR? VPlan has its own IR-like representation consisting of ingredients and recipes. For the vectorizer it might be justifiable because it does not change control flow. Some transformations cope well with scalar dependencies (eg. vectorization, there's a copy of each scalar anyway and thread-parallelization would have them privatized), but in the polyhedral model, analysis takes place before transformation decisions. However, today's polyhedral schedulers (That are based on Feautrier's or PLuTo's) do not have special handling for scalar dependencies. I am positive that we can have polyhedral schedulers that handle scalars without getting worse results (by avoiding putting value definitions into different loops than its uses; ability to execute a definition multiple times), but this is research were such a framework could be very useful. Most of today's polyhedral compilers used by researchers are source-to-source where the problem is much less pronounced. We might alternatively change the meaning of IR canonicalization in LLVM. Pre-loop canonicalized would try to create self-contained blocks where post-loop transformations do hoisting and sinking. To illustrate the issue with scalar dependencies: for (int i = 0; i < N; i+=1) { double s = ...; ... = s; } Since in Polly s is modeled as a zero-dimensional array, there is just one memory location for it. Before processing the next iterations, the current iteration must have completed before s is overwritten. If virtual registers are not modeled having a memory location, a polyhedral schedule might be included to do this (because, eg. splitting those up makes one of the loops vectorizable): double s[N]; for (int i = 0; i < N; i+=1) s[i] = ...; for (int i = 0; i < N; i+=1) .... = s[i]; that is, requiring an unbounded amount of additional memory. That's probably not something we want. B) Scope of Analysis Modeling entire functions has the advantage that analysis does not have to choose a unit of transformation (Polly: SCoPs, LoopVectorizer: a loop), but the transformations decide themselves what they are modifying. I'd also like to synchronize the unit of modeling. Polly's SCoPs is based on SESE regions. llvm::Region and llvm::Loop have no hierarchy, regions and loops may partially overlap. It is possible to represent multiple exit loops in the polyhedral model. C) Control-Flow Modeling LoopInfo, Polly's schedule tree, VPlan's hierarchical CFG all use a tree-like representation of control flow. I hope to find a common one. Using execution conditions 'flattens' the loop body. Conditions can be whether a previous statement has been executed or not (control dependence) so that the original CFG can be reconstructed. Affine conditions can be directly represented in the polyhedral model, for others 'always executing' is an overapproximation and it is checked at runtime whether it is really executed. This allows more flexibility than Polly's SESE subregion macro-statement approach. Whether a condition is affine or not also has no influence on other transformations. VPlan also uses SESE in its CFG representation, but I think it can also make use of this model. D) Access Modeling There is a common interface for both representations. Modeling as affine function is required for the polyhedral model and is more powerful, but assumes arbitrary precision integers, so additional assumptions might need to be added. The SCEV representation can (when linear/affine) be converted to an isl_pw_aff when needed, or the isl_pw_aff computed directly as by Johannes' suggestion. However, I did not observe that SCEV was the limiting part in Polly, and could be extended to support conditional operators as well. The builtin might not be useful for C arrays, where access is defined as row-major and always linearized. But libraries could make use of the builtin, for instance a Matrix class with overloaded operator: double &operator()(int x, int y) { #if __has_builtin(__builtin_subscript) return __builtin_subscript(data, x, y); #else return data[x*n+y]; #endif } A special case are jagged arrays/pointers to pointers of arrays which are very common in preexisting code bases. To ensure non-aliasing, we would need a pairwise check for aliasing of every subarray, which has polynomial complexity by itself. A builtin/intrinsic that guarantees the non-aliasing of subarrays would allow us to transform these cases as well. If there is a common interface for both cases, transformations work transparently on either representation. E) Dependency Analysis Again there is a common interface for all three kinds of dependency representations, eg. "does statement instance S(x,y) depend on instance T(x)"?. Instances can, as memory subscripts, described as SCEV or affine functions over surrounding loop counters. Representations can also be converted to either type, possibly overapproximating the dependencies. The precision can be selected by compiler switches or optimization level. The goal is to allow any loop transformation (eg. vectorizer) can use any dependency precision. For the polyhedral representation, we could introduce an intermediate layer that allows to plug in other intLP solvers and return approximations of the true result. Dependencies can be arbitrarily overapproximated, which inhibits transformations, but stays correct. Alternative implementations might just allow rectangles or octahedra [6] instead of polyhedra, which do not have exponential worst-case behaviours. Polly at the moment does not know register or control dependencies. I hope that adding such kind of dependency would allow developing polyhedral scheduler that keep definition and uses of a virtual register in the same loop. F) Assumption Tracking The purpose of a shared assumption tracker is to only have the need for a single runtime condition check after all transformations have been applied. G) Transformation Notice that polyhedral scheduling is just one of the possible transformation passes. Even in Polly, some transformations (tiling, vectorization, matrix-multiplication) are post-optimizations on the schedule tree and can be implemented on other AST representations as well. No knowledge of the polyhedral model is required to implement a new transformation. VPlan allows creating multiple plans at the same time and select the best according a heuristic. However, some transformations might be easier to implement by modifying the underlying IR in the cloned function. Eg. removing conditionals that lead to exceptional cases (the non-performance-relevant printf in the example). Otherwise we could also select the most profitable transformed AST according to a cost function. H) Code Generation Generating code it into a separate function has some advantages. First, we control the environment. Input base pointers can be annotated using the noalias qualifier without metadata. Code generation is easier because there are less special cases like multiple exit edges, llvm.lifetime.begin/end intrinsics, etc. Separate register pressure for the presumably hot code and the rest of the function. I) Pass Manager Integration The reason to not use LLVM's pass manager for the individual transformations is because the loop tree carries state that is not represented in the LLVM IR. The pass manager would be allows to invalidate the pass holding the tree and expect it to be recomputable the same way when instantiated again. This is not the case if the tree has been transformed already. Polly does it this way and needs all its transformation passes to preserve all analyses to avoid making the pass manager drop and recompute ScopInfo and ScopDetect. [1] http://llvm.org/devmtg/2017-03/assets/slides/introducing_vplan_to_the_loop_vectorizer.pdf [2] http://llvm.org/devmtg/2017-10/#talk17 [3] http://isl.gforge.inria.fr/ [4] http://pluto-compiler.sourceforge.net/ [5] http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf [6] https://tel.archives-ouvertes.fr/tel-00818764