Pekka Jääskeläinen
2013-Oct-25  14:14 UTC
[LLVMdev] Is there pass to break down <4 x float> to scalars
Hi, Great to see someone working on this. This will benefit the performance portability goal of the pocl's OpenCL kernel compiler. It has been one of the low hanging fruits in improving its implicit WG vectorization applicability. The use case there is that sometimes it makes sense to devectorize the explicitly used vector datatype code of OpenCL kernels in order to make better opportunities for the "horizontal" vectorization across work-items inside the work-group. E.g., the last time I checked, the inner loop vectorizer (which pocl exploits) just refused to vectorize loops with vector instructions. It might not be so drastic with the SLP or the BB vectorizer, but in general, it might make sense to let the vectorizer to do the decisions on how to map the parallel (scalar) operations best to the vector hardware, and just help it with the parallelism knowledge propagated from the parallel program. One can then fall back to the original (hand vectorized) code in case the autovectorization failed, to get some vector hardware utilization still. On 10/25/2013 04:15 PM, Richard Sandiford wrote:> To be honest I hadn't really thought about targets with vector units > at all.:-) I was just assuming that we'd want to keep vector operations > together if there's native support. E.g. ISTR comments about not wanting > to rewrite vec_selects because it can be hard to synthesise optimal > sequences from a single canonical form. But I might have got that wrong. > Also, llvmpipe uses intrinsics for some things, so it might be strange > if we decompose IR operations but leave the intriniscs alone.The issue of intrinsics and vectorization was discussed some time ago. There it might be better to devectorize to a scalar version of the instrinsics (if available) as at least the loopvectorizer can vectorize also a set of selected intrinsics, and the target might have direct machine instructions for those (which could not be exploited easily from "inlined" versions). -- Pekka
Richard Sandiford
2013-Oct-25  14:38 UTC
[LLVMdev] Is there pass to break down <4 x float> to scalars
Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> writes:> E.g., the last time I checked, the inner loop vectorizer (which pocl exploits) > just refused to vectorize loops with vector instructions. It might not > be so drastic with the SLP or the BB vectorizer, but in general, it might > make sense to let the vectorizer to do the decisions on how to map the > parallel (scalar) operations best to the vector hardware, and just help it > with the parallelism knowledge propagated from the parallel program. > One can then fall back to the original (hand vectorized) code in case > the autovectorization failed, to get some vector hardware utilization > still.Sounds like a nice compromise if it could be made to work. Would it be LLVM that reverts to the autovectorised version, or pocl?> On 10/25/2013 04:15 PM, Richard Sandiford wrote: >> To be honest I hadn't really thought about targets with vector units >> at all.:-) I was just assuming that we'd want to keep vector operations >> together if there's native support. E.g. ISTR comments about not wanting >> to rewrite vec_selects because it can be hard to synthesise optimal >> sequences from a single canonical form. But I might have got that wrong. >> Also, llvmpipe uses intrinsics for some things, so it might be strange >> if we decompose IR operations but leave the intriniscs alone. > > The issue of intrinsics and vectorization was discussed some time ago. > There it might be better to devectorize to a scalar version of the > instrinsics (if available) as at least the loopvectorizer can vectorize > also a set of selected intrinsics, and the target might have direct > machine instructions for those (which could not be exploited easily from > "inlined" versions).Yeah, I vaguely remember some objections to handling target-specific intrinsics at the IR level, which I heard is what put others off doing the pass. In my case life is much simpler: there are no intrinsics and there's no native vector support. So in some ways I've only done the easy bit. I'm just hoping it's also the less controversial bit. Do the OpenCL loops that don't get vectorised (because they already have some vector ops) also contain vector intrinsics, or is it usually generic vector IR? Would a pass that just scalarises the generic operations but keeps intrinsics as-is be any use to you, or would the intrinsics really need to be handled too? Thanks for the feedback. Richard
Pekka Jääskeläinen
2013-Oct-25  15:10 UTC
[LLVMdev] Is there pass to break down <4 x float> to scalars
On 10/25/2013 05:38 PM, Richard Sandiford wrote:> Sounds like a nice compromise if it could be made to work. Would it be > LLVM that reverts to the autovectorised version, or pocl?In my opinion LLVM, because this benefits not only the OpenCL WG autovectorization of pocl, but any code that uses explicit vector instructions and might be more efficiently autovectorized if those were devectorized first. E.g. C code that uses the vector datatypes using the Clang's attributes.> Yeah, I vaguely remember some objections to handling target-specific > intrinsics at the IR level, which I heard is what put others off doing > the pass. In my case life is much simpler: there are no intrinsics > and there's no native vector support. So in some ways I've only done > the easy bit. I'm just hoping it's also the less controversial bit.One solution is to try to scalarize the intrinsic calls too (if one knows of the matching ones), and if it fails, keep them intact (potentially leads to additional unpack/pack etc. overheads if the autovectorization of them fails).> Do the OpenCL loops that don't get vectorised (because they already > have some vector ops) also contain vector intrinsics, or is it usually > generic vector IR? Would a pass that just scalarises the generic > operations but keeps intrinsics as-is be any use to you, or would the > intrinsics really need to be handled too?Yes to both. It is useful without the intrinsics support, but the above handling might improve the results for some kernels. OpenCL builtins (math functions) have vector versions so they are called if one uses the vector data types. Then sometimes one ends up having vector instrinsics in the bitcode. In case I wasn't clear, there are two dimensions on how one can autovectorize the OpenCL C kernels: inside the SPMD kernel descriptions (single work-item) itself, or the "implicit parallel loops" across all work-items in the work-group. I was referring to the latter as that is where the massive data parallelism and, thus more scalable vectorization opportunities, usually are. -- Pekka
Possibly Parallel Threads
- [LLVMdev] Is there pass to break down <4 x float> to scalars
- [LLVMdev] Is there pass to break down <4 x float> to scalars
- [LLVMdev] Is there pass to break down <4 x float> to scalars
- [LLVMdev] Is there pass to break down <4 x float> to scalars
- [LLVMdev] Is there pass to break down <4 x float> to scalars