Hi Renato,
On 07/07/2015 10:57 PM, Renato Golin wrote:> Now, IIRC, OpenCL had a lot of trouble from getting odd-sized vector
> types in IR that the middle end would not understand, especially the
> vectorizers. The solution, at least as of 2 years ago, was to
> serialise everything and let the CL back-end to vectorize it.
Perhaps you are referring to the problem of autovectorizing
work-groups with kernels that use implicit vector datatypes
internally?
Yes, this can be done with (selective) scalarization or
with a vector-variable aware vectorizer. AFAIK,
there's already a Scalarizer pass in upstream LLVM for this.
> Since CL back-ends are normally very different from each others, with
> very different cost models, and some secretive implementation details,
> it's very hard to do that generically in the LLVM middle-end.
Of course it's impossible to cover everything always. What pocl
tries to do is the very minimum to make it easier for later passes
to do their tasks while reusing the standard cost models etc. (such
as that in the LLVM vectorizers already) whenever possible.
> Also, if you have different domains (like many SIMD cores), sharing
> the operations across cores and across lanes may be completely
> different than, say, pthreads vs. AVX, so the model may not even apply
> here. If you need write back loops, non-trivial synchronization
> barriers between cores and other crazy stuff, adding all that to the
> vectorizer would be bloating code beyond usability. On the other hand,
> maybe not.
Instead of implementing a monolithic SPMD-specific kernel vectorizer
with lots of code duplication to simple loop vectorizers, what pocl
does is quite the opposite. All it does is identify the
parallel regions between barriers, marks them as parallel loops and
let the other passes do what they like with the loops.
Currently we apply inner loop vectorizer (hopefully e.g. the loop
interchange and other work will soon improve it) for CPU+SIMD
targets, VLIW-style schedule the inner loops for static multi-issue
using custom backends, and just leave the original SPMD representation
be for GPU-like "SPMD targets" (briefly tested in an ongoing
experimental HSA support work).
Adding a mode where some of the parallel loop iterations are
executed in SIMD lanes and some in multiple cores with the target's
supported threading mechanism is something to consider, but not yet
done (in pocl). The original question was only about autovectorization
so I'd not go there yet. OpenMP was just a side note from me, sorry
for the possible confusion.
> I'd be interested in knowing what kind of changes we'd need to get
the
> OMP+SIMD model into CL-type code, if that's what you're
proposing...
I'm not sure what you mean by "OMP+SIMD" model. I was simply
proposing using the existing parallel loop MD like pocl does
to keep the pass responsibilities organized.
What I suggested was to consider upstreaming a part of the pocl compiler
(or preferably an improved implementation of it) that
statically identifies the parallel regions and generates a new
function that wraps the parallel regions in parallel loops
(which are then vectorized or whatever is best for the target
at hand by other passes, to keep the chain modular).
From the IR, I think it by minimum needs a notion of a "barrier
instruction" using which it can do its CFG analysis to identify the
regions. We simply use a dummy function declaration for this now.
BR,
--
--Pekka