thr3ads.net - llvm dev - [LLVMdev] SPMD Autovectorizer [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Pekka Jääskeläinen

2015-Jul-07 21:23 UTC

[LLVMdev] SPMD Autovectorizer

Hi Renato,

On 07/07/2015 10:57 PM, Renato Golin wrote:> Now, IIRC, OpenCL had a lot of trouble from getting odd-sized vector
> types in IR that the middle end would not understand, especially the
> vectorizers. The solution, at least as of 2 years ago, was to
> serialise everything and let the CL back-end to vectorize it.
Perhaps you are referring to the problem of autovectorizing
work-groups with kernels that use implicit vector datatypes
internally?

Yes, this can be done with (selective) scalarization or
with a vector-variable aware vectorizer. AFAIK,
there's already a Scalarizer pass in upstream LLVM for this.
> Since CL back-ends are normally very different from each others, with
> very different cost models, and some secretive implementation details,
> it's very hard to do that generically in the LLVM middle-end.
Of course it's impossible to cover everything always. What pocl
tries to do is the very minimum to make it easier for later passes
to do their tasks while reusing the standard cost models etc. (such
as that in the LLVM vectorizers already) whenever possible.
> Also, if you have different domains (like many SIMD cores), sharing
> the operations across cores and across lanes may be completely
> different than, say, pthreads vs. AVX, so the model may not even apply
> here. If you need write back loops, non-trivial synchronization
> barriers between cores and other crazy stuff, adding all that to the
> vectorizer would be bloating code beyond usability. On the other hand,
> maybe not.
Instead of implementing a monolithic SPMD-specific kernel vectorizer
with lots of code duplication to simple loop vectorizers, what pocl
does is quite the opposite. All it does is identify the
parallel regions between barriers, marks them as parallel loops and
let the other passes do what they like with the loops.

Currently we apply inner loop vectorizer (hopefully e.g. the loop
interchange and other work will soon improve it) for CPU+SIMD
targets, VLIW-style schedule the inner loops for static multi-issue
using custom backends, and just leave the original SPMD representation
be for GPU-like "SPMD targets" (briefly tested in an ongoing
experimental HSA support work).

Adding a mode where some of the parallel loop iterations are
executed in SIMD lanes and some in multiple cores with the target's 
supported threading mechanism is something to consider, but not yet
done (in pocl).  The original question was only about autovectorization
so I'd not go there yet. OpenMP was just a side note from me, sorry
for the possible confusion.
> I'd be interested in knowing what kind of changes we'd need to get
the
> OMP+SIMD model into CL-type code, if that's what you're
proposing...
I'm not sure what you mean by "OMP+SIMD" model. I was simply
proposing using the existing parallel loop MD like pocl does
to keep the pass responsibilities organized.

What I suggested was to consider upstreaming a part of the pocl compiler 
(or preferably an improved implementation of it) that
statically identifies the parallel regions and generates a new
function that wraps the parallel regions in parallel loops
(which are then vectorized or whatever is best for the target
at hand by other passes, to keep the chain modular).

 From the IR, I think it by minimum needs a notion of a "barrier 
instruction" using which it can do its CFG analysis to identify the 
regions.  We simply use a dummy function declaration for this now.

BR,
-- 
--Pekka

Renato Golin

2015-Jul-09 09:47 UTC

head link

[LLVMdev] SPMD Autovectorizer

On 7 July 2015 at 22:23, Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi>
wrote:> Yes, this can be done with (selective) scalarization or
> with a vector-variable aware vectorizer. AFAIK,
> there's already a Scalarizer pass in upstream LLVM for this.
There is, I was just wondering how much work would it be to get our
main vectorizers to be vector-variable aware, or if it is really worth
pursuing that.

> Instead of implementing a monolithic SPMD-specific kernel vectorizer
> with lots of code duplication to simple loop vectorizers, what pocl
> does is quite the opposite. All it does is identify the
> parallel regions between barriers, marks them as parallel loops and
> let the other passes do what they like with the loops.
Ah! Excellent!

> I'm not sure what you mean by "OMP+SIMD" model. I was simply
> proposing using the existing parallel loop MD like pocl does
> to keep the pass responsibilities organized.
My question is if OMP+SIMD MD would help pocl to identify kernels and
vectorisable areas, but it seems you already have that in place.


> What I suggested was to consider upstreaming a part of the pocl compiler
(or
> preferably an improved implementation of it) that
> statically identifies the parallel regions and generates a new
> function that wraps the parallel regions in parallel loops
> (which are then vectorized or whatever is best for the target
> at hand by other passes, to keep the chain modular).
Is this an IR pass? I think that'd be interesting...

> From the IR, I think it by minimum needs a notion of a "barrier
instruction"
> using which it can do its CFG analysis to identify the regions.  We simply
> use a dummy function declaration for this now.
As usual, ok.

Maybe we should discuss a builtin with the larger community. Probably
best if in a different thread, as this one's stale. :)

cheers,
--renato

Pekka Jääskeläinen

2015-Jul-09 12:28 UTC

head link

[LLVMdev] SPMD Autovectorizer

On 07/09/2015 12:47 PM, Renato Golin wrote:> Is this an IR pass? I think that'd be interesting...
Right, an IR pass mostly working at the CFG level. In addition to
creating the parallel loops, it modifies the IR mainly to inject 
"context arrays" for temporarily storing variables
of which live range crosses parallel regions (loops).
> Maybe we should discuss a builtin with the larger community. Probably
> best if in a different thread, as this one's stale. :)
Yes. Simply an intrisics "llvm.spmd_barrier" or similar with
"all threads or none reach me" semantics might do for starters.

-- 
--Pekka

llvm dev - Jul 2015 - [LLVMdev] SPMD Autovectorizer

[LLVMdev] SPMD Autovectorizer

[LLVMdev] SPMD Autovectorizer

[LLVMdev] SPMD Autovectorizer