thr3ads.net - llvm dev - [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly [Jan 2011]

If this information is useful, please help other people find it:
Share via:

ether zhhb

2011-Jan-06 08:38 UTC

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Hi,

I just have a detail look at the code of Polly[1], it seems that Polly
start to support some basic auto-parallelization stuffs. I have some
idea to improve the current auto-vectorization and parallelization
approach in Polly.

The main idea is, we separate the transform passes and codegen passes
for auto-parallelization and vectorization (Graphite[2] for gcc seems
to taking similar approach for auto-vectorization).

That means Polly (Or other similar framework) should perform necessary
code transform, then just generates the sequential code, and provides
necessary parallelism information (These information could be encoded
as metadata just like TBAA), then the later passes can generate the
parallel code with those parallelism information.

The main benefit of separating transform passes and codegen passes are:
1. Decouple the the autopar framework from various parallel runtime
environment, so we can keep both autopar framework and code generation
pass for specific parallel runtime environment simple. And we can
support more parallel runtime environments easily.

2. Allow the some generic parallelism information live out specific
autopar framework, so these information can benefit more passes in
llvm. For example, the X86 and PTX backend could use these information
to perform target specific auto-vectorization.

Implementation consideration:

We may define some kind of generic "parallelism analysis" or the
generic version of "LoopDependenceAnalysis" interface just like
AliasAnalysis, or we can also encode those parallelism information as
metadata. And combining the both should be OK, too.

Any comments are appreciated.

best regards
ether


[1] http://wiki.llvm.org/Polly
[2] http://gcc.gnu.org/wiki/Graphite/Parallelization

Tobias Grosser

2011-Jan-06 15:16 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 01/06/2011 03:38 AM, ether zhhb wrote:> Hi,
>
> I just have a detail look at the code of Polly[1], it seems that Polly
> start to support some basic auto-parallelization stuffs.
This is true. However still work in progress. I hope we can soon show 
some interesting results.

 > I have some idea to improve the current auto-vectorization
 > and parallelization approach in Polly.
> The main idea is, we separate the transform passes and codegen passes
> for auto-parallelization and vectorization (Graphite[2] for gcc seems
> to taking similar approach for auto-vectorization).
This is just historical. Sebastian is planning to couple them a lot 
closer. Graphite and auto parallelization/vectorization are not yet 
playing well together, even if Sebastian made some huge progress with 
Graphite<->Autovec interaction.
> That means Polly (Or other similar framework) should perform necessary
> code transform, then just generates the sequential code, and provides
> necessary parallelism information (These information could be encoded
> as metadata just like TBAA), then the later passes can generate the
> parallel code with those parallelism information.
I also thought about this approach.
> The main benefit of separating transform passes and codegen passes are:
> 1. Decouple the the autopar framework from various parallel runtime
> environment, so we can keep both autopar framework and code generation
> pass for specific parallel runtime environment simple. And we can
> support more parallel runtime environments easily.
Yes, as soon as we start to support more targets we should move out some 
of the logic into separate functions/file/passes. This could be either 
done inside polly or as a generic LLVM pass, depending
on which is more suitable.
> 2. Allow the some generic parallelism information live out specific
> autopar framework, so these information can benefit more passes in
> llvm. For example, the X86 and PTX backend could use these information
> to perform target specific auto-vectorization.
What other types of parallelism are you expecting? We currently support 
thread level parallelism (as in OpenMP) and vector level parallelism (as 
in LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered 
extremely well to x86 SIMD instructions. I suppose this is the same for 
all CPU targets. I still need to look into GPU targets.
> Implementation consideration:
>
> We may define some kind of generic "parallelism analysis" or the
> generic version of "LoopDependenceAnalysis" interface just like
> AliasAnalysis, or we can also encode those parallelism information as
> metadata. And combining the both should be OK, too.
Thanks ether for starting to reason about this.

My current plan is as follows:

Create stable support for OpenMP and LLVM-IR vector generation inside 
polly. This is not very difficult, as you the needed analysis can easily 
be performed in the polyhedral framework. Code generation is not 
difficult either. SIMD Vector support is almost complete and OpenMP code 
generation is also half way through. Based on that infrastructure we can 
take advantage of both thread level as well as SIMD level parallelism 
and which covers the two most common targets.
The next step will be to evaluate loop transformations that enable 
thread and vector level parallelism as it is done e.g. in pluto[1].
I am pretty confident we can show good improvements. Stay tuned. ;-)

Regarding your proposal I hope to move some of the code into more 
generic frameworks. I am thinking e.g. about introducing OpenMP 
intrinsics to support different OpenMP libraries like libgomp and 
mpc[2]. LLVM-IR vector instructions however are generic SIMD 
instructions so I do not see any reason to create target specific
auto vectorizer passes.

The remaining question is how/where to generate parallel/vector
code.

The current approach is to do this directly at Polly code generation 
time, the solution you proposed would be to generate sequential
code inside polly, annotate it with meta data, reparse it and finally 
create openmp/vector code.

The second approach would have the advantage that multiple LLVM passes
can use the information and generate code from it as well as there could 
exist several analysis that create the needed metadata.

It has however the drawback that instead of just doing code generation 
once after polly, we do sequential code generation -> reparsing/analysis 
-> parallel code generation. Furthermore, the infrastructure needs to 
pass all the information needed
for efficient parallelisation which are at least the access strides, the 
alignment and privatized variables. Recomputing this information using 
scalar evolution might be difficult as Polly may introduce
loop ivs using e.g. ceil/floor divisions.

As Polly - based on the current approach - will soon support target 
agnostic autovectorization and OpenMP code generation, I plan to now 
focus on polyhedral loop nest optimizations that enable efficient 
vectorization and thread level parallelism. As this transformations are
performed on the abstract polyhedral description, no further changes to 
the code generation are needed. In case a larger framework is shown to 
be useful, I will definitely support this. For the moment however I am 
very fine with the existing lightweight approach.

Cheers
Tobi

[1] http://pluto-compiler.sourceforge.net/
[2] http://mpc.sf.net

Renato Golin

2011-Jan-06 15:59 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 6 January 2011 15:16, Tobias Grosser <grosser at fim.uni-passau.de>
wrote:>> The main idea is, we separate the transform passes and codegen passes
>> for auto-parallelization and vectorization (Graphite[2] for gcc seems
>> to taking similar approach for auto-vectorization).
I agree with Ether.

A two-stage vectorization would allow you to use the simple
loop-unroller already in place to generate vector/mp intrinsics from
them, and if more parallelism is required, use the expensive Poly
framework to skew loops and remove dependencies, so the loop-unroller
and other cheap bits can do their job where then couldn't before.

So, in essence, this is a three-stage job. The optional heavy-duty
Poly analysis, the cheap loop-optimizer and the mp/vector
transformation pass. The best features of having them three is to be
able to choose the level of vectorization you want and to re-use the
current loop analysis into the scheme.

> What other types of parallelism are you expecting? We currently support
> thread level parallelism (as in OpenMP) and vector level parallelism (as
> in LLVM-IR vectors). At least for X86 I do not see any reason for
> target specific auto-vectorization as LLVM-IR vectors are lowered
> extremely well to x86 SIMD instructions. I suppose this is the same for
> all CPU targets. I still need to look into GPU targets.
I'd suggest to try and transform sequential instructions into vector
instructions (in the third stage) if proven to be correct.

So, when Poly skews a loop, and the loop analysis unrolls it to, say,
4 calls to the same instruction, a metadata binding them together can
hint the third stage to make that a vector operation with the same
semantics.

> LLVM-IR vector instructions however are generic SIMD
> instructions so I do not see any reason to create target specific
> auto vectorizer passes.
If you're assuming the original code is using intrinsics, that is
correct. But if you want to generate the vector code from Poly, than
you need to add that support, too.

ARM also has good vector instruction selection (on Cortex-A* with
NEON), so you also get that for free. ;)

cheers,
--renato

ether zhhb

2011-Jan-07 05:36 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Hi tobi,

>> 2. Allow the some generic parallelism information live out specific
>> autopar framework, so these information can benefit more passes in
>> llvm. For example, the X86 and PTX backend could use these information
>> to perform target specific auto-vectorization.
>
> What other types of parallelism are you expecting? We currently support
> thread level parallelism (as in OpenMP) and vector level parallelism (as in
> LLVM-IR vectors). At least for X86 I do not see any reason for
> target specific auto-vectorization as LLVM-IR vectors are lowered extremely
> well to x86 SIMD instructions. I suppose this is the same for all CPU
> targets. I still need to look into GPU targets.
>I just think the vector units in different target may have a
difference width, so the best unroll count of a loop for vectorization
in not know in high level optimization passes.

> It has however the drawback that instead of just doing code generation once
> after polly, we do sequential code generation -> reparsing/analysis
->
> parallel code generation. Furthermore, the infrastructure needs to pass all
> the information needed
> for efficient parallelisation which are at least the access strides, the
> alignment and privatized variables. Recomputing this information using
> scalar evolution might be difficult as Polly may introduce
> loop ivs using e.g. ceil/floor divisions.
To overcame this, We can encode these kind of "hard to recover"
information as metadata while generating sequential code, and what the
later "Polyhedral Parallelism Analysis" pass need to do is just read
these information form metadata, and reparsing/analysis other
information which is easy to recover. so the process become:
sequential code generation and metadata annotation -> read metadata
(and perform some cheap reparsing/analysis)->parallel code generation

The bigger picture is:
1. Define the common interface for "Parallelism Analysis" or
"LoopDependenceAnalysis", just like AliasAnalysis.
2. Then we can have different implementations of Parallelism Analysis.
    For example, we may have the "SCEVParallelsimAnalysis", which
compute the parallelism information base on SCEV.
    and we can also have the "PolyhedralParallelismAnalysis", which
read "hard to recover" information from metadata and recompute the
cheap information, then provides these information via the common
"Parallelism Analysis" interface.
3. The auto-vectorization and parallelization codegen passes can just
ask the common interface of "Parallelism Analysis" to get necessary
information.

The new approach may also make current work for OpenMP support esaier,
Instead of generate the subfunction directly from clast and insert new
function in a region pass(it seems that we can only insert new
function in a modulepass or callgraphSCC pass), we can extract the
body of the parallel for to a new function with existing CodeExtractor
in LLVM.

best regards
ether

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Jan 2011 - [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Apparently Analagous Threads