thr3ads.net - llvm dev - [LLVMdev] Is there pass to break down <4 x float> to scalars [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2013-Oct-25 12:53 UTC

[LLVMdev] Is there pass to break down <4 x float> to scalars

On 25 October 2013 11:06, Richard Sandiford <rsandifo at
linux.vnet.ibm.com>wrote:
> I wanted the same thing for SystemZ, which doesn't have vectors,
> in order to improve the llvmpipe code.
>
Hi Richard,

This is a nice patch. I was wondering how hard it'd be to do that, and it
seems that you're catching lots of corner cases.

My interest is also due to converting odd vectors into scalars, but to
convert them again to CPU vectors, say from OpenCL to NEON code.

It would also need some TargetTransformInfo hooks to decide
which> vectors should be decomposed.
>
If I got it right, this may not be necessary, or it may even be harmful.

Say you decide that <4 x i32> vectors should be left alone, so that your
pass only scalarise the others. But when the vectorizer passes again (to
try and use CPU vector instructions), it might not match the scalarised
version with the vector, and you end up with data movement between scalar
and vector pipelines, which normally slows down CPUs (at least in ARM's
case). Also, problematic cases like <5 x i32> could be better split into
3+2 pairs, rather than 4+1.

If you scalarise everything, than the vectorizers will have a better chance
of spotting patterns and vectorising the whole lot, then based on target
transform info.

Is that what you had in mind?

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131025/cf325b87/attachment.html>

Richard Sandiford

2013-Oct-25 13:15 UTC

head link

[LLVMdev] Is there pass to break down <4 x float> to scalars

Renato Golin <renato.golin at linaro.org> writes:> On 25 October 2013 11:06, Richard Sandiford <rsandifo at
linux.vnet.ibm.com>wrote>> It would also need some TargetTransformInfo
hooks to decide which
>> vectors should be decomposed.
>
> If I got it right, this may not be necessary, or it may even be harmful.
>
> Say you decide that <4 x i32> vectors should be left alone, so that
your
> pass only scalarise the others. But when the vectorizer passes again (to
> try and use CPU vector instructions), it might not match the scalarised
> version with the vector, and you end up with data movement between scalar
> and vector pipelines, which normally slows down CPUs (at least in ARM's
> case). Also, problematic cases like <5 x i32> could be better split
into
> 3+2 pairs, rather than 4+1.
>
> If you scalarise everything, than the vectorizers will have a better chance
> of spotting patterns and vectorising the whole lot, then based on target
> transform info.
>
> Is that what you had in mind?
To be honest I hadn't really thought about targets with vector units
at all. :-)  I was just assuming that we'd want to keep vector operations
together if there's native support.  E.g. ISTR comments about not wanting
to rewrite vec_selects because it can be hard to synthesise optimal
sequences from a single canonical form.  But I might have got that wrong.
Also, llvmpipe uses intrinsics for some things, so it might be strange
if we decompose IR operations but leave the intriniscs alone.

I'd half wondered whether, as an extension, the pass should split wide
vectors into supported widths.  I hadn't thought about the possiblity of
decomposing everything and them reassembling it though.  I can see how
that would cope with more cases, like you say.

Thanks,
Richard

Renato Golin

2013-Oct-25 13:28 UTC

head link

[LLVMdev] Is there pass to break down <4 x float> to scalars

On 25 October 2013 14:15, Richard Sandiford <rsandifo at
linux.vnet.ibm.com>wrote:
> Also, llvmpipe uses intrinsics for some things, so it might be strange
> if we decompose IR operations but leave the intriniscs alone.
>
Yes, this is a problem with OpenCL as well. I wonder if it'd be useful to
have a compiler-implemented inlined library code for each type of supported
intrinsic, so that you can also lower them to code.

For targets that don't support the intrinsics, well, the IR wouldn't
compile, so being slow is better than not working at all, but I wonder how
much worse it would be to do that.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131025/c2c985cb/attachment.html>

Pekka Jääskeläinen

2013-Oct-25 14:14 UTC

head link

[LLVMdev] Is there pass to break down <4 x float> to scalars

Hi,

Great to see someone working on this. This will benefit the performance
portability goal of the pocl's OpenCL kernel compiler. It has been one of
the low hanging fruits in improving its implicit WG vectorization
applicability.

The use case there is that sometimes it makes sense to devectorize
the explicitly used vector datatype code of OpenCL kernels in order to make
better opportunities for the "horizontal" vectorization across
work-items
inside the work-group.

E.g., the last time I checked, the inner loop vectorizer (which pocl exploits)
just refused to vectorize loops with vector instructions. It might not
be so drastic with the SLP or the BB vectorizer, but in general, it might
make sense to let the vectorizer to do the decisions on how to map the
parallel (scalar) operations best to the vector hardware, and just help it
with the parallelism knowledge propagated from the parallel program.
One can then fall back to the original (hand vectorized) code in case
the autovectorization failed, to get some vector hardware utilization
still.

On 10/25/2013 04:15 PM, Richard Sandiford wrote:> To be honest I hadn't really thought about targets with vector units
> at all.:-)   I was just assuming that we'd want to keep vector
operations
> together if there's native support.  E.g. ISTR comments about not
wanting
> to rewrite vec_selects because it can be hard to synthesise optimal
> sequences from a single canonical form.  But I might have got that wrong.
> Also, llvmpipe uses intrinsics for some things, so it might be strange
> if we decompose IR operations but leave the intriniscs alone.
The issue of intrinsics and vectorization was discussed some time ago.
There it might be better to devectorize to a scalar version of the
instrinsics (if available) as at least the loopvectorizer can vectorize
also a set of selected intrinsics, and the target might have direct
machine instructions for those (which could not be exploited easily from
"inlined" versions).

-- 
Pekka

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Oct 2013 - [LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

[LLVMdev] Is there pass to break down <4 x float> to scalars

Seemingly Similar Threads