thr3ads.net - llvm dev - [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly [Jan 2011]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2011-Jan-09 00:34 UTC

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 9 January 2011 00:07, Tobias Grosser <grosser at fim.uni-passau.de>
wrote:> Matching the target vector width in our heuristics will obviously give the
> best performance. So to get optimal performance Polly needs to take target
> data into account.
Indeed! And even if you lack target information, you won't generate
wrong code. ;)

> Talking about OpenCL. The lowering you described for the large vector
> instructions sounds reasonable. Optimal code would however probably
produced
> by revisiting the whole loop structure and generating one that is
> performance wise optimal for the target architecture.
Yes, and this is an important point in OpenCL for CPUs. If we could
run a sub-pass of Polly (just the vector fiddling) after the
legalization, that would make it much easier for OpenCL
implementations.

However, none of these apply to GPUs, and any pass you run could
completely destroy the semantics for a GPU back-end. The AMD
presentation on the meetings last year expose some of that.

cheers,
--renato

Tobias Grosser

2011-Jan-09 00:41 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 01/08/2011 07:34 PM, Renato Golin wrote:> On 9 January 2011 00:07, Tobias Grosser<grosser at fim.uni-passau.de>
wrote:
>> Matching the target vector width in our heuristics will obviously give
the
>> best performance. So to get optimal performance Polly needs to take
target
>> data into account.
>
> Indeed! And even if you lack target information, you won't generate
> wrong code. ;)
>
>
>> Talking about OpenCL. The lowering you described for the large vector
>> instructions sounds reasonable. Optimal code would however probably
produced
>> by revisiting the whole loop structure and generating one that is
>> performance wise optimal for the target architecture.
>
> Yes, and this is an important point in OpenCL for CPUs. If we could
> run a sub-pass of Polly (just the vector fiddling) after the
> legalization, that would make it much easier for OpenCL
> implementations.I do not get this one? Why would you just use a part of Polly? Was I 
wrong by
assuming LLVM will even today without any special pass generate correct 
code for the width OpenCL vectors. For me Polly just is an optimization,
that could revisit the whole vectorization decision by looking at the 
big picture of the whole loop nest and generating a target specific loop 
nest with target specific vectorization (and openmp parallelisation).
> However, none of these apply to GPUs, and any pass you run could
> completely destroy the semantics for a GPU back-end. The AMD
> presentation on the meetings last year expose some of that.
I have seen the AMD presentation and believe we can generate efficient 
vector code for GPUs. Obviously with some adaptions, however I am 
convinced this is doable.

Cheers
Tobi

Renato Golin

2011-Jan-09 14:23 UTC

head link

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

On 9 January 2011 00:41, Tobias Grosser <grosser at fim.uni-passau.de>
wrote:> I do not get this one? Why would you just use a part of Polly?
Oh, you can. Just that maybe you don't need to go over all Polly if
openCL already has the vector semantics done in the front-end.

> Was I wrong by assuming LLVM will even today without any special pass
generate correct code
> for the width OpenCL vectors. For me Polly just is an optimization,
> that could revisit the whole vectorization decision by looking at the big
> picture of the whole loop nest and generating a target specific loop nest
> with target specific vectorization (and openmp parallelisation).
I'm really not the OpenCL expert, but I hear that it's not as trivial
as one would think.

I know from generating NEON code in the front-end that any fiddling in
the semantics of the instructions could make the pattern-matching
algorithm to fail and you fall back to normal instructions.

I'm just trying to be cautions here not to fall into false hopes, but
someone with more knowledge in OpenCL would know better.

> I have seen the AMD presentation and believe we can generate efficient
> vector code for GPUs. Obviously with some adaptions, however I am convinced
> this is doable.
Great! Even better than I thought! ;)

cheers,
--renatoorder

Seemingly Similar Threads

Search for more apparently analagous threads

llvm dev - Jan 2011 - [LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

[LLVMdev] Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Seemingly Similar Threads