Hi all, I'd like to have a first stab at a loop-based auto-vectorization pass as part of 2009's Google Summer of Code program. As far as I can tell from searching the mailing list archives, no work on such an auto-vectorizer seems to be currently in progress. Whereas auto-vectorization is a well-researched topic, complexity seems to quickly explode once more advanced loop constructs are to be supported. Therefore my aim is to start with the most minimal implementation possible, to explore the difficulties encountered in the specific context of LLVM and to build a foundation from which future work can progress. So, initially, I aim at supporting only the simplest loops such as: int a[256], b[256], c[256]; for (int i = 0; i < 256; ++i) c[i] = a[i] + b[i]; My goal is to implement the necessary analyses and transformations to turn IR corresponding to such code into IR utilizing vector instructions; i.e. the core of the desired result would look like: %va = load <256 x i32>* %a %vb = load <256 x i32>* %b %vc = add <256 x i32> %a, %b store <256 x i32> %vc, <256 x i32>* %c After this transformation, the code could be left as is (which would generate a fully unrolled vector-add, as long as the vector length doesn't exceed the lowering capabilities of the code generator), or strip-mined into a loop using fixed-size vectors (preferably corresponding to the SIMD register width of the target architecture). Auto-vectorization has numerous dependencies on other analysis/transformation passes (loop normalization, constant/scalar expansion, dependence analysis, etc). A significant part of the proposed project is to evaluate already-available LLVM passes regarding their applicability to support auto-vectorization and to extend/implement what's missing. As for my personal background, I'm a graduate student at Vienna University of Technology (TU Wien) with a strong background in compiler theory, acquired in a wide variety of undergraduate- and graduate-level courses. I recently implemented two simple compilers LLVM and conducted various experiments using LLVM's vector support. My general interest in data-parallel computation stems from a great fondness of APL (especially of its modern offspring K). While I'm familiar with the concepts behind auto-vectorization, I haven't implemented an auto-vectorizer before. I'm author of and contributor to several smaller open source projects. I appreciate any suggestions and would be very excited if someone is interested in mentoring this. -- Best regards, Andreas Bolka
Andreas Bolka wrote:> Hi all, > > I'd like to have a first stab at a loop-based auto-vectorization pass as > part of 2009's Google Summer of Code program. As far as I can tell from > searching the mailing list archives, no work on such an auto-vectorizer > seems to be currently in progress.Hi Andreas, Actually, you'd be the third person to try writing one, with myself being the second. :) Don't let that discourage you, I don't think anyone's actively working on it now.> Whereas auto-vectorization is a well-researched topic, complexity seems > to quickly explode once more advanced loop constructs are to be > supported. Therefore my aim is to start with the most minimal > implementation possible, to explore the difficulties encountered in > the specific context of LLVM and to build a foundation from which future > work can progress.There's two types of autovectorization, SLP (superword level parallelism) and ILP (instruction level parallelism). You can do ILP which is loop-ignorant autovectorization of straight-line code which turns out to be the code that runs inside the loop.> So, initially, I aim at supporting only the simplest loops such as: > > int a[256], b[256], c[256]; > for (int i = 0; i < 256; ++i) > c[i] = a[i] + b[i]; > > My goal is to implement the necessary analyses and transformations to > turn IR corresponding to such code into IR utilizing vector > instructions; i.e. the core of the desired result would look like: > > %va = load <256 x i32>* %a > %vb = load <256 x i32>* %b > %vc = add <256 x i32> %a, %b > store <256 x i32> %vc, <256 x i32>* %c > > After this transformation, the code could be left as is (which would > generate a fully unrolled vector-add, as long as the vector length > doesn't exceed the lowering capabilities of the code generator), or > strip-mined into a loop using fixed-size vectors (preferably > corresponding to the SIMD register width of the target architecture). > > Auto-vectorization has numerous dependencies on other > analysis/transformation passes (loop normalization, constant/scalar > expansion, dependence analysis, etc). A significant part of the proposed > project is to evaluate already-available LLVM passes regarding their > applicability to support auto-vectorization and to extend/implement > what's missing.There's no loop dependence analysis in LLVM yet. How are you planning to implement it? Can you point to a paper, or write up a plain English description?> As for my personal background, I'm a graduate student at Vienna > University of Technology (TU Wien) with a strong background in compiler > theory, acquired in a wide variety of undergraduate- and graduate-level > courses. I recently implemented two simple compilers LLVM and conducted > various experiments using LLVM's vector support. My general interest in > data-parallel computation stems from a great fondness of APL (especially > of its modern offspring K). While I'm familiar with the concepts behind > auto-vectorization, I haven't implemented an auto-vectorizer before. I'm > author of and contributor to several smaller open source projects. > > I appreciate any suggestions and would be very excited if someone is > interested in mentoring this.I'm interested, but I'd like to see more details about exactly how (ie., with what algorithm) you intend to use to find and transform code first. Nick
Nick Lewycky wrote:> Andreas Bolka wrote: >> Hi all, >> >> I'd like to have a first stab at a loop-based auto-vectorization pass as >> part of 2009's Google Summer of Code program. As far as I can tell from >> searching the mailing list archives, no work on such an auto-vectorizer >> seems to be currently in progress. > > Hi Andreas, > > Actually, you'd be the third person to try writing one, with myself > being the second. :) Don't let that discourage you, I don't think > anyone's actively working on it now. > >> Whereas auto-vectorization is a well-researched topic, complexity seems >> to quickly explode once more advanced loop constructs are to be >> supported. Therefore my aim is to start with the most minimal >> implementation possible, to explore the difficulties encountered in >> the specific context of LLVM and to build a foundation from which future >> work can progress. > > There's two types of autovectorization, SLP (superword level > parallelism) and ILP (instruction level parallelism). You can do ILP > which is loop-ignorant autovectorization of straight-line code which > turns out to be the code that runs inside the loop.Pardon me. I've been disabused by Owen Anderson for writing the above. SLP is loop-ignorant. ILP is a rather generic term and probably shouldn't be described as a type of autovectorization. Cross-iteration and intra-iteration better describes the distinction. In any event, the paper I was thinking of when I wrote the above is: http://www.cag.lcs.mit.edu/slp/SLP-PLDI-2000.pdf My concern being that writing both loop dependence analysis and loop autovectorization would be difficult in the time span alloted for GSoC, and you may want to consider just doing SLP. Nick> >> So, initially, I aim at supporting only the simplest loops such as: >> >> int a[256], b[256], c[256]; >> for (int i = 0; i < 256; ++i) >> c[i] = a[i] + b[i]; >> >> My goal is to implement the necessary analyses and transformations to >> turn IR corresponding to such code into IR utilizing vector >> instructions; i.e. the core of the desired result would look like: >> >> %va = load <256 x i32>* %a >> %vb = load <256 x i32>* %b >> %vc = add <256 x i32> %a, %b >> store <256 x i32> %vc, <256 x i32>* %c >> >> After this transformation, the code could be left as is (which would >> generate a fully unrolled vector-add, as long as the vector length >> doesn't exceed the lowering capabilities of the code generator), or >> strip-mined into a loop using fixed-size vectors (preferably >> corresponding to the SIMD register width of the target architecture). >> >> Auto-vectorization has numerous dependencies on other >> analysis/transformation passes (loop normalization, constant/scalar >> expansion, dependence analysis, etc). A significant part of the proposed >> project is to evaluate already-available LLVM passes regarding their >> applicability to support auto-vectorization and to extend/implement >> what's missing. > > There's no loop dependence analysis in LLVM yet. How are you planning to > implement it? Can you point to a paper, or write up a plain English > description? > >> As for my personal background, I'm a graduate student at Vienna >> University of Technology (TU Wien) with a strong background in compiler >> theory, acquired in a wide variety of undergraduate- and graduate-level >> courses. I recently implemented two simple compilers LLVM and conducted >> various experiments using LLVM's vector support. My general interest in >> data-parallel computation stems from a great fondness of APL (especially >> of its modern offspring K). While I'm familiar with the concepts behind >> auto-vectorization, I haven't implemented an auto-vectorizer before. I'm >> author of and contributor to several smaller open source projects. >> >> I appreciate any suggestions and would be very excited if someone is >> interested in mentoring this. > > I'm interested, but I'd like to see more details about exactly how (ie., > with what algorithm) you intend to use to find and transform code first. > > Nick > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
On Mar 31, 2009, at 5:27 PM, Andreas Bolka wrote:> Hi all, > I'd like to have a first stab at a loop-based auto-vectorization > pass as > part of 2009's Google Summer of Code program. As far as I can tell > from > searching the mailing list archives, no work on such an auto- > vectorizer > seems to be currently in progress.Hi Andreas, This would be a very interesting project! The most important part of the proposal is to break it down into small pieces.> Whereas auto-vectorization is a well-researched topic, complexity > seems > to quickly explode once more advanced loop constructs are to be > supported. Therefore my aim is to start with the most minimal > implementation possible, to explore the difficulties encountered in > the specific context of LLVM and to build a foundation from which > future > work can progress.This is a great approach. I'm very much in favor of solving small problems first, instead of trying to solve all the worlds issues at once. :)> So, initially, I aim at supporting only the simplest loops such as: > > int a[256], b[256], c[256]; > for (int i = 0; i < 256; ++i) > c[i] = a[i] + b[i]; > > My goal is to implement the necessary analyses and transformations to > turn IR corresponding to such code into IR utilizing vector > instructions;Sounds great. Some important steps: 1. We need an abstract dependence analysis interface (like we have for AA) and one implementation (probably handling trivial loops like the one above to start with). Many people have talked about this, but no code has gone in yet. 2. We need some interface that can be implemented by a target machine to describe what the vector capabilities of the machine are. 3. We need the transformation code. Starting with a simple loop vectorizor (as opposed to an SLP system) would make sense to me. Once the basics are in place, it can be feature-crept to support new idioms (reductions, alignment analysis, etc). The first cut doesn't need to provide an immediate speedup, but should have many testcases that demonstrates the loop forms it can handle.> i.e. the core of the desired result would look like: > > %va = load <256 x i32>* %a > %vb = load <256 x i32>* %b > %vc = add <256 x i32> %a, %b > store <256 x i32> %vc, <256 x i32>* %cActually, IMO this is probably not the directly we want to go. Vectorizing to extremely wide vectors like this isn't along the lines of what we want: we should get something like the original loop unrolled 4 times to do operations on <4 x i32>'s (for x86/ppc).> Auto-vectorization has numerous dependencies on other > analysis/transformation passes (loop normalization, constant/scalar > expansion, dependence analysis, etc). A significant part of the > proposed > project is to evaluate already-available LLVM passes regarding their > applicability to support auto-vectorization and to extend/implement > what's missing.Sounds good. I'm also fine with it only working in carefully controlled circumstances. Depending on how ambitious you are, you could also do a GSoC project just around a) building dependence analysis interface + some *non*-trivial implementations, and b) build some dep analysis clients around it (e.g. loop interchange, etc). This may be a less risky and higher impact project in the short term, if you are interested in it.> As for my personal background, I'm a graduate student at Vienna > University of Technology (TU Wien) with a strong background in > compiler > theory, acquired in a wide variety of undergraduate- and graduate- > level > courses. I recently implemented two simple compilers LLVM and > conducted > various experiments using LLVM's vector support. My general interest > in > data-parallel computation stems from a great fondness of APL > (especially > of its modern offspring K). While I'm familiar with the concepts > behind > auto-vectorization, I haven't implemented an auto-vectorizer before. > I'm > author of and contributor to several smaller open source projects.Great! Either approach (building a very simple auto-vec system from end-to-end, or building some non-trivial dep analysis implementations + a client or two) sound like very promising GSoC projects! -Chris
Chris Lattner wrote:> Depending on how ambitious you are, you > could also do a GSoC project just around a) building dependence > analysis interface + some *non*-trivial implementations, and b) build > some dep analysis clients around it (e.g. loop interchange, etc). > This may be a less risky and higher impact project in the short term, > if you are interested in it.This would be in great demand independent of the auto-vectorization. Given that the amount of work is fixed, I would rather see all the work go into a brilliant dependence analysis rather than an ok dependence analysis and an ok vectorization. I encourage you to consider it, Andreas. -- Pertti
Hi Andreas, On 31-Mar-09, at 8:27 PM, Andreas Bolka wrote:> So, initially, I aim at supporting only the simplest loops such as: > > int a[256], b[256], c[256]; > for (int i = 0; i < 256; ++i) > c[i] = a[i] + b[i]; > > My goal is to implement the necessary analyses and transformations to > turn IR corresponding to such code into IR utilizing vector > instructions; i.e. the core of the desired result would look like: > > %va = load <256 x i32>* %a > %vb = load <256 x i32>* %b > %vc = add <256 x i32> %a, %b > store <256 x i32> %vc, <256 x i32>* %c > > After this transformation, the code could be left as is (which would > generate a fully unrolled vector-add, as long as the vector length > doesn't exceed the lowering capabilities of the code generator), or > strip-mined into a loop using fixed-size vectors (preferably > corresponding to the SIMD register width of the target architecture).I think the biggest problem with this approach, apart from the fact that it doesn't mirror how vectors are typically used today in LLVM, is that vectors in LLVM are of fixed size. This is going to severely limit the usefulness of this transformation. I think you may be better off getting information about vector widths for the architecture (e.g. from TargetData) and vectorizing directly with a particular width in mind. Overall, this would be a great GSOC project, and I guarantee you would get a lot of interest in this :). Good luck! Stefanus -- Stefanus Du Toit <stefanus.dutoit at rapidmind.com> RapidMind Inc. phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463
Hi Andreas, On Mar 31, 2009, at 5:27 PM, Andreas Bolka wrote:> Hi all, > > I'd like to have a first stab at a loop-based auto-vectorization > pass as > part of 2009's Google Summer of Code program.Good idea!> As far as I can tell from > searching the mailing list archives, no work on such an auto- > vectorizer > seems to be currently in progress.Yes.> > Whereas auto-vectorization is a well-researched topic, complexity > seems > to quickly explode once more advanced loop constructs are to be > supported. Therefore my aim is to start with the most minimal > implementation possible, to explore the difficulties encountered in > the specific context of LLVM and to build a foundation from which > future > work can progress.> > So, initially, I aim at supporting only the simplest loops such as: > > int a[256], b[256], c[256]; > for (int i = 0; i < 256; ++i) > c[i] = a[i] + b[i]; > > My goal is to implement the necessary analyses and transformations to > turn IR corresponding to such code into IR utilizing vector > instructions; i.e. the core of the desired result would look like: > > %va = load <256 x i32>* %a > %vb = load <256 x i32>* %b > %vc = add <256 x i32> %a, %b > store <256 x i32> %vc, <256 x i32>* %c > > After this transformation, the code could be left as is (which would > generate a fully unrolled vector-add, as long as the vector length > doesn't exceed the lowering capabilities of the code generator), or > strip-mined into a loop using fixed-size vectors (preferably > corresponding to the SIMD register width of the target architecture). > > Auto-vectorization has numerous dependencies on other > analysis/transformation passes (loop normalization, constant/scalar > expansion, dependence analysis, etc). A significant part of the > proposed > project is to evaluate already-available LLVM passes regarding their > applicability to support auto-vectorization and to extend/implement > what's missing. > > As for my personal background, I'm a graduate student at Vienna > University of Technology (TU Wien) with a strong background in > compiler > theory, acquired in a wide variety of undergraduate- and graduate- > level > courses. I recently implemented two simple compilers LLVM and > conducted > various experiments using LLVM's vector support. My general interest > in > data-parallel computation stems from a great fondness of APL > (especially > of its modern offspring K). While I'm familiar with the concepts > behind > auto-vectorization, I haven't implemented an auto-vectorizer before. > I'm > author of and contributor to several smaller open source projects. > > I appreciate any suggestions and would be very excited if someone is > interested in mentoring this.As Chris said, there are many approaches you can take for a GSoC project whose title includes "auto vectorization". IMO, building a initial foundation for data dependence analysis is a good start. I encourage you to prepare a proposal that includes list of measurable and achievable milestones with realistic goals, where each milestone is a small step in the right direction. I am ready to mentor this project. - Devang
Hi Stefanus, On Wed Apr 01 16:08:45 +0200 2009, Stefanus Du Toit wrote:> On 31-Mar-09, at 8:27 PM, Andreas Bolka wrote: > > i.e. the core of the desired result would look like: > > > > %va = load <256 x i32>* %a > > %vb = load <256 x i32>* %b > > %vc = add <256 x i32> %a, %b > > store <256 x i32> %vc, <256 x i32>* %c > > I think the biggest problem with this approach, apart from the fact > that it doesn't mirror how vectors are typically used today in LLVM, > is that vectors in LLVM are of fixed size. This is going to severely > limit the usefulness of this transformation. I think you may be better > off getting information about vector widths for the architecture (e.g. > from TargetData) and vectorizing directly with a particular width in > mind.Thanks for the remark. My initial thinking was that, independent of auto-vectorization, a general "wide vector strip-mining" transformation would be worthwhile to have anyway. I.e. a transformation which would convert such wide vector operations as above to a loop over (target-dependent) smaller width vectors. And as the auto-vectorizer I am aiming for would initially only be able to vectorize loops with statically known constant trip counts, I could squash some complexity by leaving this target-dependent aspect to such a hypothetical strip-mining pass. But I understand the limits of this, and based on the feedback I got so far, I'll rather generate a loop with fixed-size vectors instead. A command-line option to determine the vectorization factor should do fine to enable quick bootstrapping, if querying the necessary register width from TargetData turns out to be problematic.> Overall, this would be a great GSOC project, and I guarantee you would > get a lot of interest in this :). Good luck!Thanks! -- Andreas
Hi Chris, On Wed Apr 01 08:18:28 +0200 2009, Chris Lattner wrote:> On Mar 31, 2009, at 5:27 PM, Andreas Bolka wrote: > > My goal is to implement the necessary analyses and transformations to > > turn IR corresponding to such code into IR utilizing vector > > instructions; > > Sounds great. Some important steps:> 1. We need an abstract dependence analysis interface (like we have for > AA) and one implementation (probably handling trivial loops like the > one above to start with). Many people have talked about this, but no > code has gone in yet.My original plan was to implement a loop dependence analysis only for non-nested loops using the "GCD" method of determining interdependent statements. This limits the array subscript expressions to simple linear equations of the form x1*i + x0 where i is the induction variable of the loop. I haven't thought yet about generalising this to ease future implementation of more complex loop dependency analyses.> 2. We need some interface that can be implemented by a target machine > to describe what the vector capabilities of the machine are.Initially the vectorization pass would basically only need the SIMD register width; later, the supported "packing formats" may come in handy (i.e. the information that 128-bit registers that be used as e.g. 16 x i8, 8 x i16, and 4 x i32).> 3. We need the transformation code. Starting with a simple loop > vectorizor (as opposed to an SLP system) would make sense to me.I'm also inclined to prefer "classical" Allen & Kennedy-style auto-vectorization over "Vectorization by unrolling"/SLP approaches. Based on the responses on this list, it seems that this sentiment is shared by the LLVM community. An SLP pass would certainly be a nice complement to loop-based auto-vectorization, though> Once the basics are in place, it can be feature-crept to support new > idioms (reductions, alignment analysis, etc). The first cut doesn't > need to provide an immediate speedup, but should have many testcases > that demonstrates the loop forms it can handle.That's exactly my understanding of it. I want to limit this to the most basic transformation possible; i.e.: - non-nested loops, having - a statically known constant trip count, which is - divisible by the vectorization factor; - a single basic block as loop body, with - a single arithmetic instruction, using - only same-sized vector operands, and - ignoring alignment issues.> Vectorizing to extremely wide vectors like this isn't along the lines > of what we want: we should get something like the original loop > unrolled 4 times to do operations on <4 x i32>'s (for x86/ppc).Yes, I agree. See also my reply to Stefanus.> > Auto-vectorization has numerous dependencies on other > > analysis/transformation passes (loop normalization, constant/scalar > > expansion, dependence analysis, etc). A significant part of the > > proposed project is to evaluate already-available LLVM passes > > regarding their applicability to support auto-vectorization and to > > extend/implement what's missing. > > Depending on how ambitious you are, you could also do a GSoC project > just around a) building dependence analysis interface + some > *non*-trivial implementations, and b) build some dep analysis clients > around it (e.g. loop interchange, etc). This may be a less risky and > higher impact project in the short term, if you are interested in it.I understand the importance and necessity of a general loop dependence interface, but I'm not too familiar with the more complex approaches (which generally require integer linear programming, if I'm not mistaken). So I can't really asses if there's a realistic chance to build a generally usable non-trivial implementation in the time available. But as a simple loop dependence analysis pass will be one of the first steps in this project anyway, I'll be in a much better position to re-evaluate that question once this first pass is done. I'd certainly be willing to modify the project, if it turns out that this direction would be more valuable and/or more realistic to pursue.> Great! Either approach (building a very simple auto-vec system from > end-to-end, or building some non-trivial dep analysis implementations > + a client or two) sound like very promising GSoC projects!Thanks for the extensive feedback! -- Andreas