Renato Golin
2013-Oct-25 12:53 UTC
[LLVMdev] Is there pass to break down <4 x float> to scalars
On 25 October 2013 11:06, Richard Sandiford <rsandifo at linux.vnet.ibm.com>wrote:> I wanted the same thing for SystemZ, which doesn't have vectors, > in order to improve the llvmpipe code. >Hi Richard, This is a nice patch. I was wondering how hard it'd be to do that, and it seems that you're catching lots of corner cases. My interest is also due to converting odd vectors into scalars, but to convert them again to CPU vectors, say from OpenCL to NEON code. It would also need some TargetTransformInfo hooks to decide which> vectors should be decomposed. >If I got it right, this may not be necessary, or it may even be harmful. Say you decide that <4 x i32> vectors should be left alone, so that your pass only scalarise the others. But when the vectorizer passes again (to try and use CPU vector instructions), it might not match the scalarised version with the vector, and you end up with data movement between scalar and vector pipelines, which normally slows down CPUs (at least in ARM's case). Also, problematic cases like <5 x i32> could be better split into 3+2 pairs, rather than 4+1. If you scalarise everything, than the vectorizers will have a better chance of spotting patterns and vectorising the whole lot, then based on target transform info. Is that what you had in mind? cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131025/cf325b87/attachment.html>
Richard Sandiford
2013-Oct-25 13:15 UTC
[LLVMdev] Is there pass to break down <4 x float> to scalars
Renato Golin <renato.golin at linaro.org> writes:> On 25 October 2013 11:06, Richard Sandiford <rsandifo at linux.vnet.ibm.com>wrote>> It would also need some TargetTransformInfo hooks to decide which >> vectors should be decomposed. > > If I got it right, this may not be necessary, or it may even be harmful. > > Say you decide that <4 x i32> vectors should be left alone, so that your > pass only scalarise the others. But when the vectorizer passes again (to > try and use CPU vector instructions), it might not match the scalarised > version with the vector, and you end up with data movement between scalar > and vector pipelines, which normally slows down CPUs (at least in ARM's > case). Also, problematic cases like <5 x i32> could be better split into > 3+2 pairs, rather than 4+1. > > If you scalarise everything, than the vectorizers will have a better chance > of spotting patterns and vectorising the whole lot, then based on target > transform info. > > Is that what you had in mind?To be honest I hadn't really thought about targets with vector units at all. :-) I was just assuming that we'd want to keep vector operations together if there's native support. E.g. ISTR comments about not wanting to rewrite vec_selects because it can be hard to synthesise optimal sequences from a single canonical form. But I might have got that wrong. Also, llvmpipe uses intrinsics for some things, so it might be strange if we decompose IR operations but leave the intriniscs alone. I'd half wondered whether, as an extension, the pass should split wide vectors into supported widths. I hadn't thought about the possiblity of decomposing everything and them reassembling it though. I can see how that would cope with more cases, like you say. Thanks, Richard
Renato Golin
2013-Oct-25 13:28 UTC
[LLVMdev] Is there pass to break down <4 x float> to scalars
On 25 October 2013 14:15, Richard Sandiford <rsandifo at linux.vnet.ibm.com>wrote:> Also, llvmpipe uses intrinsics for some things, so it might be strange > if we decompose IR operations but leave the intriniscs alone. >Yes, this is a problem with OpenCL as well. I wonder if it'd be useful to have a compiler-implemented inlined library code for each type of supported intrinsic, so that you can also lower them to code. For targets that don't support the intrinsics, well, the IR wouldn't compile, so being slow is better than not working at all, but I wonder how much worse it would be to do that. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131025/c2c985cb/attachment.html>
Pekka Jääskeläinen
2013-Oct-25 14:14 UTC
[LLVMdev] Is there pass to break down <4 x float> to scalars
Hi, Great to see someone working on this. This will benefit the performance portability goal of the pocl's OpenCL kernel compiler. It has been one of the low hanging fruits in improving its implicit WG vectorization applicability. The use case there is that sometimes it makes sense to devectorize the explicitly used vector datatype code of OpenCL kernels in order to make better opportunities for the "horizontal" vectorization across work-items inside the work-group. E.g., the last time I checked, the inner loop vectorizer (which pocl exploits) just refused to vectorize loops with vector instructions. It might not be so drastic with the SLP or the BB vectorizer, but in general, it might make sense to let the vectorizer to do the decisions on how to map the parallel (scalar) operations best to the vector hardware, and just help it with the parallelism knowledge propagated from the parallel program. One can then fall back to the original (hand vectorized) code in case the autovectorization failed, to get some vector hardware utilization still. On 10/25/2013 04:15 PM, Richard Sandiford wrote:> To be honest I hadn't really thought about targets with vector units > at all.:-) I was just assuming that we'd want to keep vector operations > together if there's native support. E.g. ISTR comments about not wanting > to rewrite vec_selects because it can be hard to synthesise optimal > sequences from a single canonical form. But I might have got that wrong. > Also, llvmpipe uses intrinsics for some things, so it might be strange > if we decompose IR operations but leave the intriniscs alone.The issue of intrinsics and vectorization was discussed some time ago. There it might be better to devectorize to a scalar version of the instrinsics (if available) as at least the loopvectorizer can vectorize also a set of selected intrinsics, and the target might have direct machine instructions for those (which could not be exploited easily from "inlined" versions). -- Pekka
Apparently Analagous Threads
- [LLVMdev] Is there pass to break down <4 x float> to scalars
- [LLVMdev] Is there pass to break down <4 x float> to scalars
- [LLVMdev] Is there pass to break down <4 x float> to scalars
- [LLVMdev] Is there pass to break down <4 x float> to scalars
- [LLVMdev] Is there pass to break down <4 x float> to scalars