ether zhhb
2011-Jan-06 08:38 UTC
[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
Hi, I just have a detail look at the code of Polly[1], it seems that Polly start to support some basic auto-parallelization stuffs. I have some idea to improve the current auto-vectorization and parallelization approach in Polly. The main idea is, we separate the transform passes and codegen passes for auto-parallelization and vectorization (Graphite[2] for gcc seems to taking similar approach for auto-vectorization). That means Polly (Or other similar framework) should perform necessary code transform, then just generates the sequential code, and provides necessary parallelism information (These information could be encoded as metadata just like TBAA), then the later passes can generate the parallel code with those parallelism information. The main benefit of separating transform passes and codegen passes are: 1. Decouple the the autopar framework from various parallel runtime environment, so we can keep both autopar framework and code generation pass for specific parallel runtime environment simple. And we can support more parallel runtime environments easily. 2. Allow the some generic parallelism information live out specific autopar framework, so these information can benefit more passes in llvm. For example, the X86 and PTX backend could use these information to perform target specific auto-vectorization. Implementation consideration: We may define some kind of generic "parallelism analysis" or the generic version of "LoopDependenceAnalysis" interface just like AliasAnalysis, or we can also encode those parallelism information as metadata. And combining the both should be OK, too. Any comments are appreciated. best regards ether [1] http://wiki.llvm.org/Polly [2] http://gcc.gnu.org/wiki/Graphite/Parallelization
Tobias Grosser
2011-Jan-06 15:16 UTC
[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
On 01/06/2011 03:38 AM, ether zhhb wrote:> Hi, > > I just have a detail look at the code of Polly[1], it seems that Polly > start to support some basic auto-parallelization stuffs.This is true. However still work in progress. I hope we can soon show some interesting results. > I have some idea to improve the current auto-vectorization > and parallelization approach in Polly.> The main idea is, we separate the transform passes and codegen passes > for auto-parallelization and vectorization (Graphite[2] for gcc seems > to taking similar approach for auto-vectorization).This is just historical. Sebastian is planning to couple them a lot closer. Graphite and auto parallelization/vectorization are not yet playing well together, even if Sebastian made some huge progress with Graphite<->Autovec interaction.> That means Polly (Or other similar framework) should perform necessary > code transform, then just generates the sequential code, and provides > necessary parallelism information (These information could be encoded > as metadata just like TBAA), then the later passes can generate the > parallel code with those parallelism information.I also thought about this approach.> The main benefit of separating transform passes and codegen passes are: > 1. Decouple the the autopar framework from various parallel runtime > environment, so we can keep both autopar framework and code generation > pass for specific parallel runtime environment simple. And we can > support more parallel runtime environments easily.Yes, as soon as we start to support more targets we should move out some of the logic into separate functions/file/passes. This could be either done inside polly or as a generic LLVM pass, depending on which is more suitable.> 2. Allow the some generic parallelism information live out specific > autopar framework, so these information can benefit more passes in > llvm. For example, the X86 and PTX backend could use these information > to perform target specific auto-vectorization.What other types of parallelism are you expecting? We currently support thread level parallelism (as in OpenMP) and vector level parallelism (as in LLVM-IR vectors). At least for X86 I do not see any reason for target specific auto-vectorization as LLVM-IR vectors are lowered extremely well to x86 SIMD instructions. I suppose this is the same for all CPU targets. I still need to look into GPU targets.> Implementation consideration: > > We may define some kind of generic "parallelism analysis" or the > generic version of "LoopDependenceAnalysis" interface just like > AliasAnalysis, or we can also encode those parallelism information as > metadata. And combining the both should be OK, too.Thanks ether for starting to reason about this. My current plan is as follows: Create stable support for OpenMP and LLVM-IR vector generation inside polly. This is not very difficult, as you the needed analysis can easily be performed in the polyhedral framework. Code generation is not difficult either. SIMD Vector support is almost complete and OpenMP code generation is also half way through. Based on that infrastructure we can take advantage of both thread level as well as SIMD level parallelism and which covers the two most common targets. The next step will be to evaluate loop transformations that enable thread and vector level parallelism as it is done e.g. in pluto[1]. I am pretty confident we can show good improvements. Stay tuned. ;-) Regarding your proposal I hope to move some of the code into more generic frameworks. I am thinking e.g. about introducing OpenMP intrinsics to support different OpenMP libraries like libgomp and mpc[2]. LLVM-IR vector instructions however are generic SIMD instructions so I do not see any reason to create target specific auto vectorizer passes. The remaining question is how/where to generate parallel/vector code. The current approach is to do this directly at Polly code generation time, the solution you proposed would be to generate sequential code inside polly, annotate it with meta data, reparse it and finally create openmp/vector code. The second approach would have the advantage that multiple LLVM passes can use the information and generate code from it as well as there could exist several analysis that create the needed metadata. It has however the drawback that instead of just doing code generation once after polly, we do sequential code generation -> reparsing/analysis -> parallel code generation. Furthermore, the infrastructure needs to pass all the information needed for efficient parallelisation which are at least the access strides, the alignment and privatized variables. Recomputing this information using scalar evolution might be difficult as Polly may introduce loop ivs using e.g. ceil/floor divisions. As Polly - based on the current approach - will soon support target agnostic autovectorization and OpenMP code generation, I plan to now focus on polyhedral loop nest optimizations that enable efficient vectorization and thread level parallelism. As this transformations are performed on the abstract polyhedral description, no further changes to the code generation are needed. In case a larger framework is shown to be useful, I will definitely support this. For the moment however I am very fine with the existing lightweight approach. Cheers Tobi [1] http://pluto-compiler.sourceforge.net/ [2] http://mpc.sf.net
Renato Golin
2011-Jan-06 15:59 UTC
[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
On 6 January 2011 15:16, Tobias Grosser <grosser at fim.uni-passau.de> wrote:>> The main idea is, we separate the transform passes and codegen passes >> for auto-parallelization and vectorization (Graphite[2] for gcc seems >> to taking similar approach for auto-vectorization).I agree with Ether. A two-stage vectorization would allow you to use the simple loop-unroller already in place to generate vector/mp intrinsics from them, and if more parallelism is required, use the expensive Poly framework to skew loops and remove dependencies, so the loop-unroller and other cheap bits can do their job where then couldn't before. So, in essence, this is a three-stage job. The optional heavy-duty Poly analysis, the cheap loop-optimizer and the mp/vector transformation pass. The best features of having them three is to be able to choose the level of vectorization you want and to re-use the current loop analysis into the scheme.> What other types of parallelism are you expecting? We currently support > thread level parallelism (as in OpenMP) and vector level parallelism (as > in LLVM-IR vectors). At least for X86 I do not see any reason for > target specific auto-vectorization as LLVM-IR vectors are lowered > extremely well to x86 SIMD instructions. I suppose this is the same for > all CPU targets. I still need to look into GPU targets.I'd suggest to try and transform sequential instructions into vector instructions (in the third stage) if proven to be correct. So, when Poly skews a loop, and the loop analysis unrolls it to, say, 4 calls to the same instruction, a metadata binding them together can hint the third stage to make that a vector operation with the same semantics.> LLVM-IR vector instructions however are generic SIMD > instructions so I do not see any reason to create target specific > auto vectorizer passes.If you're assuming the original code is using intrinsics, that is correct. But if you want to generate the vector code from Poly, than you need to add that support, too. ARM also has good vector instruction selection (on Cortex-A* with NEON), so you also get that for free. ;) cheers, --renato
ether zhhb
2011-Jan-07 05:36 UTC
[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
Hi tobi,>> 2. Allow the some generic parallelism information live out specific >> autopar framework, so these information can benefit more passes in >> llvm. For example, the X86 and PTX backend could use these information >> to perform target specific auto-vectorization. > > What other types of parallelism are you expecting? We currently support > thread level parallelism (as in OpenMP) and vector level parallelism (as in > LLVM-IR vectors). At least for X86 I do not see any reason for > target specific auto-vectorization as LLVM-IR vectors are lowered extremely > well to x86 SIMD instructions. I suppose this is the same for all CPU > targets. I still need to look into GPU targets. >I just think the vector units in different target may have a difference width, so the best unroll count of a loop for vectorization in not know in high level optimization passes.> It has however the drawback that instead of just doing code generation once > after polly, we do sequential code generation -> reparsing/analysis -> > parallel code generation. Furthermore, the infrastructure needs to pass all > the information needed > for efficient parallelisation which are at least the access strides, the > alignment and privatized variables. Recomputing this information using > scalar evolution might be difficult as Polly may introduce > loop ivs using e.g. ceil/floor divisions.To overcame this, We can encode these kind of "hard to recover" information as metadata while generating sequential code, and what the later "Polyhedral Parallelism Analysis" pass need to do is just read these information form metadata, and reparsing/analysis other information which is easy to recover. so the process become: sequential code generation and metadata annotation -> read metadata (and perform some cheap reparsing/analysis)->parallel code generation The bigger picture is: 1. Define the common interface for "Parallelism Analysis" or "LoopDependenceAnalysis", just like AliasAnalysis. 2. Then we can have different implementations of Parallelism Analysis. For example, we may have the "SCEVParallelsimAnalysis", which compute the parallelism information base on SCEV. and we can also have the "PolyhedralParallelismAnalysis", which read "hard to recover" information from metadata and recompute the cheap information, then provides these information via the common "Parallelism Analysis" interface. 3. The auto-vectorization and parallelization codegen passes can just ask the common interface of "Parallelism Analysis" to get necessary information. The new approach may also make current work for OpenMP support esaier, Instead of generate the subfunction directly from clast and insert new function in a region pass(it seems that we can only insert new function in a modulepass or callgraphSCC pass), we can extract the body of the parallel for to a new function with existing CodeExtractor in LLVM. best regards ether
Possibly Parallel Threads
- [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
- [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
- [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
- [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly
- [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly