thr3ads.net - llvm dev - [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2011-Jan-06 15:59 UTC

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 6 January 2011 15:16, Tobias Grosser <grosser at fim.uni-passau.de>
wrote:>> The main idea is, we separate the transform passes and codegen passes
>> for auto-parallelization and vectorization (Graphite[2] for gcc seems
>> to taking similar approach for auto-vectorization).
I agree with Ether.

A two-stage vectorization would allow you to use the simple
loop-unroller already in place to generate vector/mp intrinsics from
them, and if more parallelism is required, use the expensive Poly
framework to skew loops and remove dependencies, so the loop-unroller
and other cheap bits can do their job where then couldn't before.

So, in essence, this is a three-stage job. The optional heavy-duty
Poly analysis, the cheap loop-optimizer and the mp/vector
transformation pass. The best features of having them three is to be
able to choose the level of vectorization you want and to re-use the
current loop analysis into the scheme.

> What other types of parallelism are you expecting? We currently support
> thread level parallelism (as in OpenMP) and vector level parallelism (as
> in LLVM-IR vectors). At least for X86 I do not see any reason for
> target specific auto-vectorization as LLVM-IR vectors are lowered
> extremely well to x86 SIMD instructions. I suppose this is the same for
> all CPU targets. I still need to look into GPU targets.
I'd suggest to try and transform sequential instructions into vector
instructions (in the third stage) if proven to be correct.

So, when Poly skews a loop, and the loop analysis unrolls it to, say,
4 calls to the same instruction, a metadata binding them together can
hint the third stage to make that a vector operation with the same
semantics.

> LLVM-IR vector instructions however are generic SIMD
> instructions so I do not see any reason to create target specific
> auto vectorizer passes.
If you're assuming the original code is using intrinsics, that is
correct. But if you want to generate the vector code from Poly, than
you need to add that support, too.

ARM also has good vector instruction selection (on Cortex-A* with
NEON), so you also get that for free. ;)

cheers,
--renato

Tobias Grosser

2011-Jan-08 18:27 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 01/06/2011 10:59 AM, Renato Golin wrote:> On 6 January 2011 15:16, Tobias Grosser<grosser at fim.uni-passau.de>
wrote:
>>> The main idea is, we separate the transform passes and codegen
passes
>>> for auto-parallelization and vectorization (Graphite[2] for gcc
seems
>>> to taking similar approach for auto-vectorization).
>
> I agree with Ether.
>
> A two-stage vectorization would allow you to use the simple
> loop-unroller already in place to generate vector/mp intrinsics from
> them, and if more parallelism is required, use the expensive Poly
> framework to skew loops and remove dependencies, so the loop-unroller
> and other cheap bits can do their job where then couldn't before.
>
> So, in essence, this is a three-stage job. The optional heavy-duty
> Poly analysis, the cheap loop-optimizer and the mp/vector
> transformation pass. The best features of having them three is to be
> able to choose the level of vectorization you want and to re-use the
> current loop analysis into the scheme.
OK. First of all to agree on a name, we decided to call the Polyhedral 
analysis we develop PoLLy, as in Polly the parrot. ;-) Maybe it was a 
misleading choice?

In general as I explained I agree that a three stage approach is useful,
for the reasons you explained, however it is more overhead (and just 
implementation work) than the one we use now. I currently do not have 
the time to implement the proposed approach. In case anybody is 
interested to work on patches, I am happy to support this.
>> What other types of parallelism are you expecting? We currently support
>> thread level parallelism (as in OpenMP) and vector level parallelism
(as
>> in LLVM-IR vectors). At least for X86 I do not see any reason for
>> target specific auto-vectorization as LLVM-IR vectors are lowered
>> extremely well to x86 SIMD instructions. I suppose this is the same for
>> all CPU targets. I still need to look into GPU targets.
>
> I'd suggest to try and transform sequential instructions into vector
> instructions (in the third stage) if proven to be correct.
>
> So, when Poly skews a loop, and the loop analysis unrolls it to, say,
> 4 calls to the same instruction, a metadata binding them together can
> hint the third stage to make that a vector operation with the same
> semantics.
I know, this is the classical approach for vector code generation. The 
difference in Polly is that we do not have a loop represented in 
LLVM-IR, which we would like to vectorize, but we have a loop body
and its content which we want to create as vector code. So instead of
creating the LLVM-IR loop structure, write meta data, unroll the loop
and than create merge instructions to vector instructions, the only 
change in polly is, that it either generates N scalar instructions per 
original instruction or one vector instruction (if N is the number of 
loop iterations which is equivalent to the vector width). So 
vectorization in Polly was very easy to implement and already works 
reasonable well.
>> LLVM-IR vector instructions however are generic SIMD
>> instructions so I do not see any reason to create target specific
>> auto vectorizer passes.
>
> If you're assuming the original code is using intrinsics, that is
> correct. But if you want to generate the vector code from Poly, than
> you need to add that support, too.
Why are target specific vectorization passes needed to generate vector 
instructions from Polly? The only target specific information I 
currently see is the vector width, which a generic vectorization pass 
can obtain from the target data information. Could you explain for which 
features target specific vectorization would be needed?
> ARM also has good vector instruction selection (on Cortex-A* with
> NEON), so you also get that for free. ;)I have read this and the look interesting. I suppose they are created 
out of the box, if a pass generates LLVM-IR vector instructions?
> cheers,
> --renato
Thanks for your comments

Tobi

Renato Golin

2011-Jan-08 23:52 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 8 January 2011 18:27, Tobias Grosser <grosser at fim.uni-passau.de>
wrote:> OK. First of all to agree on a name, we decided to call the Polyhedral
> analysis we develop PoLLy, as in Polly the parrot. ;-) Maybe it was a
> misleading choice?
I never realised... ;) Polly it is!

> In general as I explained I agree that a three stage approach is useful,
> for the reasons you explained, however it is more overhead (and just
> implementation work) than the one we use now. I currently do not have the
> time to implement the proposed approach. In case anybody is interested to
> work on patches, I am happy to support this.
Good. If it's just a matter of time (not design), things can be left
ready for future implementation without breaking the current model. I
thought there was a fundamental flaw with the three-stage design (and
was eager to learn it).

> the only change
> in polly is, that it either generates N scalar instructions per original
> instruction or one vector instruction (if N is the number of loop
iterations
> which is equivalent to the vector width). So vectorization in Polly was
very
> easy to implement and already works reasonable well.
Ok, this comes with another current change in LLVM: OpenCL. I explain.

OpenCL has very large (and odd) vector sizes, that if implemented to
vectorized units (like SSE or NEON), need to be legalised.

Such a pass should be target specific and polly could make use of
that. If polly always generate vector code (instead of reason if the
number of unrolled operations are the same as the current target being
compiled into), the later legalisation pass can deal with the odd
sized vectors and transform into multiples of legal vector + some
surplus of the module as normal instructions.

Also, if the target doesn't have vector units, there could be a
generic (or not) transformation to cpu instructions (if there isn't
one already), so that makes your polly pass completely target
agnostic.

> Why are target specific vectorization passes needed to generate vector
> instructions from Polly? The only target specific information I currently
> see is the vector width, which a generic vectorization pass can obtain from
> the target data information. Could you explain for which features target
> specific vectorization would be needed?
Not target specific, generic vectors. See above.

> I have read this and the look interesting. I suppose they are created out
of
> the box, if a pass generates LLVM-IR vector instructions?
Yup. It's pretty neat. SSE is probably similar, but with NEON, a
pattern-match is done when the variable type is a vector.

So, a multiplication followed by an addition in the right way is
transformed into a multiply-and-add NEON instruction.

An example (in a completely wrong IR, just to make a point):

%a = <4 x i32>
%b = <4 x i32>
%c = <4 x i32>
%mul = mul %b, %c
%acc = add %mul, %a

gets transformed into:

VMLA.I32 q0, q1, q2

Multiplying vectors (of the correct size) gets into VMUL, adding gets
to VADD and so on...

cheers,
--renato

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jan 2011 - [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Reasonably Related Threads