thr3ads.net - llvm dev - [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2011-Jan-08 23:52 UTC

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 8 January 2011 18:27, Tobias Grosser <grosser at fim.uni-passau.de>
wrote:> OK. First of all to agree on a name, we decided to call the Polyhedral
> analysis we develop PoLLy, as in Polly the parrot. ;-) Maybe it was a
> misleading choice?
I never realised... ;) Polly it is!

> In general as I explained I agree that a three stage approach is useful,
> for the reasons you explained, however it is more overhead (and just
> implementation work) than the one we use now. I currently do not have the
> time to implement the proposed approach. In case anybody is interested to
> work on patches, I am happy to support this.
Good. If it's just a matter of time (not design), things can be left
ready for future implementation without breaking the current model. I
thought there was a fundamental flaw with the three-stage design (and
was eager to learn it).

> the only change
> in polly is, that it either generates N scalar instructions per original
> instruction or one vector instruction (if N is the number of loop
iterations
> which is equivalent to the vector width). So vectorization in Polly was
very
> easy to implement and already works reasonable well.
Ok, this comes with another current change in LLVM: OpenCL. I explain.

OpenCL has very large (and odd) vector sizes, that if implemented to
vectorized units (like SSE or NEON), need to be legalised.

Such a pass should be target specific and polly could make use of
that. If polly always generate vector code (instead of reason if the
number of unrolled operations are the same as the current target being
compiled into), the later legalisation pass can deal with the odd
sized vectors and transform into multiples of legal vector + some
surplus of the module as normal instructions.

Also, if the target doesn't have vector units, there could be a
generic (or not) transformation to cpu instructions (if there isn't
one already), so that makes your polly pass completely target
agnostic.

> Why are target specific vectorization passes needed to generate vector
> instructions from Polly? The only target specific information I currently
> see is the vector width, which a generic vectorization pass can obtain from
> the target data information. Could you explain for which features target
> specific vectorization would be needed?
Not target specific, generic vectors. See above.

> I have read this and the look interesting. I suppose they are created out
of
> the box, if a pass generates LLVM-IR vector instructions?
Yup. It's pretty neat. SSE is probably similar, but with NEON, a
pattern-match is done when the variable type is a vector.

So, a multiplication followed by an addition in the right way is
transformed into a multiply-and-add NEON instruction.

An example (in a completely wrong IR, just to make a point):

%a = <4 x i32>
%b = <4 x i32>
%c = <4 x i32>
%mul = mul %b, %c
%acc = add %mul, %a

gets transformed into:

VMLA.I32 q0, q1, q2

Multiplying vectors (of the correct size) gets into VMUL, adding gets
to VADD and so on...

cheers,
--renato

Tobias Grosser

2011-Jan-09 00:07 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 01/08/2011 06:52 PM, Renato Golin wrote:> On 8 January 2011 18:27, Tobias Grosser<grosser at fim.uni-passau.de>
wrote:
>> OK. First of all to agree on a name, we decided to call the Polyhedral
>> analysis we develop PoLLy, as in Polly the parrot. ;-) Maybe it was a
>> misleading choice?
>
> I never realised... ;) Polly it is!
>
>
>> In general as I explained I agree that a three stage approach is
useful,
>> for the reasons you explained, however it is more overhead (and just
>> implementation work) than the one we use now. I currently do not have
the
>> time to implement the proposed approach. In case anybody is interested
to
>> work on patches, I am happy to support this.
>
> Good. If it's just a matter of time (not design), things can be left
> ready for future implementation without breaking the current model. I
> thought there was a fundamental flaw with the three-stage design (and
> was eager to learn it).Not at all. It was just soo much easier to create vector instructions 
right inside of Polly that I did not start another large project.
>> the only change
>> in polly is, that it either generates N scalar instructions per
original
>> instruction or one vector instruction (if N is the number of loop
iterations
>> which is equivalent to the vector width). So vectorization in Polly was
very
>> easy to implement and already works reasonable well.
>
> Ok, this comes with another current change in LLVM: OpenCL. I explain.
>
> OpenCL has very large (and odd) vector sizes, that if implemented to
> vectorized units (like SSE or NEON), need to be legalised.
>
> Such a pass should be target specific and polly could make use of
> that. If polly always generate vector code (instead of reason if the
> number of unrolled operations are the same as the current target being
> compiled into), the later legalisation pass can deal with the odd
> sized vectors and transform into multiples of legal vector + some
> surplus of the module as normal instructions.
>
> Also, if the target doesn't have vector units, there could be a
> generic (or not) transformation to cpu instructions (if there isn't
> one already), so that makes your polly pass completely target
> agnostic.
I believe LLVM already has lowering of wide LLVM-IR vectors to a set of 
operations of vectors of target specific wide. This is done in the back 
end and works correct for all cases I tested. The problem with this is 
just performance. They fastest loop nest is different on architectures 
with different vector width and size of vector registers. A generic 
polyhedral optimizer will therefore not generate incorrect code, however 
the proposed transformations are wrong. If Polly e.g. always generates 
vector operations to be 32 floats width, this would result in not very 
pleasant code, but might still be lowered to something well performing 
on cpus supporting AVX. Matching the target vector width in our 
heuristics will obviously give the best performance. So to get optimal 
performance Polly needs to take target data into account.

Talking about OpenCL. The lowering you described for the large vector 
instructions sounds reasonable. Optimal code would however probably 
produced by revisiting the whole loop structure and generating one that 
is performance wise optimal for the target architecture.
>> Why are target specific vectorization passes needed to generate vector
>> instructions from Polly? The only target specific information I
currently
>> see is the vector width, which a generic vectorization pass can obtain
from
>> the target data information. Could you explain for which features
target
>> specific vectorization would be needed?
>
> Not target specific, generic vectors. See above.
>
>
>> I have read this and the look interesting. I suppose they are created
out of
>> the box, if a pass generates LLVM-IR vector instructions?
>
> Yup. It's pretty neat. SSE is probably similar, but with NEON, a
> pattern-match is done when the variable type is a vector.
>
> So, a multiplication followed by an addition in the right way is
> transformed into a multiply-and-add NEON instruction.
>
> An example (in a completely wrong IR, just to make a point):
>
> %a =<4 x i32>
> %b =<4 x i32>
> %c =<4 x i32>
> %mul = mul %b, %c
> %acc = add %mul, %a
>
> gets transformed into:
>
> VMLA.I32 q0, q1, q2
>
> Multiplying vectors (of the correct size) gets into VMUL, adding gets
> to VADD and so on...
Yes. I have seen this also for powerpc and hope we can soon use the AVX 
multiply and add instruction too.

Cheers
Tobi

Renato Golin

2011-Jan-09 00:34 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 9 January 2011 00:07, Tobias Grosser <grosser at fim.uni-passau.de>
wrote:> Matching the target vector width in our heuristics will obviously give the
> best performance. So to get optimal performance Polly needs to take target
> data into account.
Indeed! And even if you lack target information, you won't generate
wrong code. ;)

> Talking about OpenCL. The lowering you described for the large vector
> instructions sounds reasonable. Optimal code would however probably
produced
> by revisiting the whole loop structure and generating one that is
> performance wise optimal for the target architecture.
Yes, and this is an important point in OpenCL for CPUs. If we could
run a sub-pass of Polly (just the vector fiddling) after the
legalization, that would make it much easier for OpenCL
implementations.

However, none of these apply to GPUs, and any pass you run could
completely destroy the semantics for a GPU back-end. The AMD
presentation on the meetings last year expose some of that.

cheers,
--renato

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Jan 2011 - [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Maybe Matching Threads