thr3ads.net - llvm dev - [LLVMdev] Is there pass to break down <4 x float> to scalars [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Pekka Jääskeläinen

2013-Oct-25 14:14 UTC

[LLVMdev] Is there pass to break down <4 x float> to scalars

Hi,

Great to see someone working on this. This will benefit the performance
portability goal of the pocl's OpenCL kernel compiler. It has been one of
the low hanging fruits in improving its implicit WG vectorization
applicability.

The use case there is that sometimes it makes sense to devectorize
the explicitly used vector datatype code of OpenCL kernels in order to make
better opportunities for the "horizontal" vectorization across
work-items
inside the work-group.

E.g., the last time I checked, the inner loop vectorizer (which pocl exploits)
just refused to vectorize loops with vector instructions. It might not
be so drastic with the SLP or the BB vectorizer, but in general, it might
make sense to let the vectorizer to do the decisions on how to map the
parallel (scalar) operations best to the vector hardware, and just help it
with the parallelism knowledge propagated from the parallel program.
One can then fall back to the original (hand vectorized) code in case
the autovectorization failed, to get some vector hardware utilization
still.

On 10/25/2013 04:15 PM, Richard Sandiford wrote:> To be honest I hadn't really thought about targets with vector units
> at all.:-)   I was just assuming that we'd want to keep vector
operations
> together if there's native support.  E.g. ISTR comments about not
wanting
> to rewrite vec_selects because it can be hard to synthesise optimal
> sequences from a single canonical form.  But I might have got that wrong.
> Also, llvmpipe uses intrinsics for some things, so it might be strange
> if we decompose IR operations but leave the intriniscs alone.
The issue of intrinsics and vectorization was discussed some time ago.
There it might be better to devectorize to a scalar version of the
instrinsics (if available) as at least the loopvectorizer can vectorize
also a set of selected intrinsics, and the target might have direct
machine instructions for those (which could not be exploited easily from
"inlined" versions).

-- 
Pekka

Richard Sandiford

2013-Oct-25 14:38 UTC

head link

[LLVMdev] Is there pass to break down <4 x float> to scalars

Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi>
writes:> E.g., the last time I checked, the inner loop vectorizer (which pocl
exploits)
> just refused to vectorize loops with vector instructions. It might not
> be so drastic with the SLP or the BB vectorizer, but in general, it might
> make sense to let the vectorizer to do the decisions on how to map the
> parallel (scalar) operations best to the vector hardware, and just help it
> with the parallelism knowledge propagated from the parallel program.
> One can then fall back to the original (hand vectorized) code in case
> the autovectorization failed, to get some vector hardware utilization
> still.
Sounds like a nice compromise if it could be made to work.  Would it be
LLVM that reverts to the autovectorised version, or pocl?
> On 10/25/2013 04:15 PM, Richard Sandiford wrote:
>> To be honest I hadn't really thought about targets with vector
units
>> at all.:-)   I was just assuming that we'd want to keep vector
operations
>> together if there's native support.  E.g. ISTR comments about not
wanting
>> to rewrite vec_selects because it can be hard to synthesise optimal
>> sequences from a single canonical form.  But I might have got that
wrong.
>> Also, llvmpipe uses intrinsics for some things, so it might be strange
>> if we decompose IR operations but leave the intriniscs alone.
>
> The issue of intrinsics and vectorization was discussed some time ago.
> There it might be better to devectorize to a scalar version of the
> instrinsics (if available) as at least the loopvectorizer can vectorize
> also a set of selected intrinsics, and the target might have direct
> machine instructions for those (which could not be exploited easily from
> "inlined" versions).
Yeah, I vaguely remember some objections to handling target-specific
intrinsics at the IR level, which I heard is what put others off doing
the pass.  In my case life is much simpler: there are no intrinsics
and there's no native vector support.  So in some ways I've only done
the easy bit.  I'm just hoping it's also the less controversial bit.

Do the OpenCL loops that don't get vectorised (because they already
have some vector ops) also contain vector intrinsics, or is it usually
generic vector IR?  Would a pass that just scalarises the generic
operations but keeps intrinsics as-is be any use to you, or would the
intrinsics really need to be handled too?

Thanks for the feedback.

Richard

Pekka Jääskeläinen

2013-Oct-25 15:10 UTC

head link

[LLVMdev] Is there pass to break down <4 x float> to scalars

On 10/25/2013 05:38 PM, Richard Sandiford wrote:> Sounds like a nice compromise if it could be made to work.  Would it be
> LLVM that reverts to the autovectorised version, or pocl?
In my opinion LLVM, because this benefits not only the OpenCL WG
autovectorization of pocl, but any code that uses explicit vector instructions 
and might be more efficiently autovectorized if those were devectorized first. 
E.g. C code that uses the vector datatypes using
the Clang's attributes.
> Yeah, I vaguely remember some objections to handling target-specific
> intrinsics at the IR level, which I heard is what put others off doing
> the pass.  In my case life is much simpler: there are no intrinsics
> and there's no native vector support.  So in some ways I've only
done
> the easy bit.  I'm just hoping it's also the less controversial
bit.
One solution is to try to scalarize the intrinsic calls
too (if one knows of the matching ones), and if it fails, keep them
intact (potentially leads to additional unpack/pack etc. overheads if
the autovectorization of them fails).
> Do the OpenCL loops that don't get vectorised (because they already
> have some vector ops) also contain vector intrinsics, or is it usually
> generic vector IR?  Would a pass that just scalarises the generic
> operations but keeps intrinsics as-is be any use to you, or would the
> intrinsics really need to be handled too?
Yes to both. It is useful without the intrinsics support, but the
above handling might improve the results for some kernels.

OpenCL builtins (math functions) have vector versions so they are
called if one uses the vector data types. Then sometimes one ends up
having vector instrinsics in the bitcode.

In case I wasn't clear, there are two dimensions on how one can
autovectorize the OpenCL C kernels: inside the SPMD kernel
descriptions (single work-item) itself, or the "implicit parallel
loops"
across all work-items in the work-group. I was referring to the latter
as that is where the massive data parallelism and, thus more scalable
vectorization opportunities, usually are.

-- 
Pekka

Reasonably Related Threads

Search for more reasonably related threads

llvm dev - Oct 2013 - [LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

Reasonably Related Threads