thr3ads.net - llvm dev - [llvm-dev] Data structure improvement for the SLP vectorizer [Mar 2017]

If this information is useful, please help other people find it:
Share via:

Keno Fischer via llvm-dev

2017-Mar-15 01:00 UTC

[llvm-dev] Data structure improvement for the SLP vectorizer

There was some discussion of this on the llvm-commits list, but I
wanted to raise the topic for discussion here. The background of the
-commits discussion was that r296863 added the ability to sort memory
access when the SLP vectorizer reached a load (the SLP vectorizer
starts at a store or some other sink, and tries to go up the tree
vectorizing as it goes along - if the input is in a different order
than the output, you need a shufflevector somewhere in the middle).
However, that commit had to be reverted because the SLP tree is not
actually a tree and the same load bundle could be used by several
different users, causing problems with representing the required
shuffle on the load bundle node (Please add any history I'm missing
here).

Now, I'm jumping into this because I want to teach the vectorizer
about a similar operation, namely a 2->4 (or 4->8 for AVX512)
broadcast. This operation also requires a load followed by a
shufflevector, so it would be nice to use the same representation for
both of these.

Now, the primary problem is that we don't really have anywhere to put
this permutation information. The two solutions seem to be duplicating
the nodes (and maybe deduplicating them later again when materializing
the tree), or taking the hit and representing the graph more
explicitly, such that we can annotate information on the actual edges
of the graph.

The latter seems nicer to me, we'd essentially have a DAG of scalar
bundles with vector shuffles in between them. Right now the vectorizer
can't really handle intermediate vector shuffles at all as far as I
can tell (with a few hand coded exceptions for common patterns). From
my perspective that seems a bit odd. AVX512 for example can represent
most (all?) shufflevector operations in one instruction, so failing to
vectorize a large tree because it would require one of them seems very
problematic. This is also related to my desire for vector->larger
vector broadcasting. Especially when targeting AVX512, there can be
whole 4-wide subtrees that the SLP vectorizer isn't currently looking
at, because it can't handle the 4->8 expansion. Whatever the new
representation, I'd like the SLP vectorizer to be able to handle that.

I hope that summarizes the problem. My understanding of the situation
is admittedly incomplete - I came to this because I need the
vectorizer to correctly handle my narrow-width subtrees, but I'm
hoping we can collectively come up with a design that would cover all
the relevant use cases.

Shahid, Asghar-ahmad via llvm-dev

2017-Mar-15 04:34 UTC

head link

[llvm-dev] Data structure improvement for the SLP vectorizer

Thanks Keno for your interest. My response inlined below.
> -----Original Message-----
> From: Keno Fischer [mailto:keno at juliacomputing.com]
> Sent: Wednesday, March 15, 2017 6:31 AM
> To: llvm-dev at lists.llvm.org
> Cc: Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>; Michael
> Kuperstein <mkuper at google.com>; Matthias Braun <matze at
braunis.de>
> Subject: Data structure improvement for the SLP vectorizer
> 
> There was some discussion of this on the llvm-commits list, but I wanted to
> raise the topic for discussion here. The background of the -commits
> discussion was that r296863 added the ability to sort memory access when
> the SLP vectorizer reached a load (the SLP vectorizer starts at a store or
some
> other sink, and tries to go up the tree vectorizing as it goes along - if
the input
> is in a different order than the output, you need a shufflevector somewhere
> in the middle).
> However, that commit had to be reverted because the SLP tree is not
actually
> a tree and the same load bundle could be used by several different users,
> causing problems with representing the required shuffle on the load bundle
> node (Please add any history I'm missing here).
> 
> Now, I'm jumping into this because I want to teach the vectorizer about
a
> similar operation, namely a 2->4 (or 4->8 for AVX512) broadcast. This
> operation also requires a load followed by a shufflevector, so it would be
> nice to use the same representation for both of these.How do you plan to identify or mark a certain "load" bundle for
broadcast,
As these "loads" presumably will not be unordered?
> 
> Now, the primary problem is that we don't really have anywhere to put
this
> permutation information. The two solutions seem to be duplicating the
> nodes (and maybe deduplicating them later again when materializing the
> tree), or taking the hit and representing the graph more explicitly, such
that
> we can annotate information on the actual edges of the graph.
> 
> The latter seems nicer to me, we'd essentially have a DAG of scalar
bundles
> with vector shuffles in between them.
That’s right.

Right now the vectorizer can't really> handle intermediate vector shuffles at all as far as I can tell (with a few
hand
> coded exceptions for common patterns). From my perspective that seems a
> bit odd. AVX512 for example can represent most (all?) shufflevector
> operations in one instruction, so failing to vectorize a large tree because
it
> would require one of them seems very problematic.
Do you mean widening the vector shuffles itself?

 This is also related to my> desire for vector->larger vector broadcasting. Especially when targeting
> AVX512, there can be whole 4-wide subtrees that the SLP vectorizer
isn't
> currently looking at, because it can't handle the 4->8 expansion.
Whatever
> the new representation, I'd like the SLP vectorizer to be able to
handle that.
> 
> I hope that summarizes the problem. My understanding of the situation is
> admittedly incomplete - I came to this because I need the vectorizer to
> correctly handle my narrow-width subtrees, but I'm hoping we can
> collectively come up with a design that would cover all the relevant use
cases.
I did not get this part can you please elaborate?

Regards,
Shahid

Keno Fischer via llvm-dev

2017-Mar-15 13:57 UTC

head link

[llvm-dev] Data structure improvement for the SLP vectorizer

Maybe it would illustrative to give an IR example of the case I'm
interested in. Consider

define void @"julia_transform_bvn_derivs_hessian!"(double* %data,
double* %data2, double *%data3, double *%out) {
    %element11 = getelementptr inbounds double, double* %data, i32 1
    %load10 = load double, double* %data
    %load11 = load double, double* %element11

    %element21 = getelementptr inbounds double, double* %data2, i32 1
    %element22 = getelementptr inbounds double, double* %data2, i32 2
    %element23 = getelementptr inbounds double, double* %data2, i32 3
    %load20 = load double, double* %data2
    %load21 = load double, double* %element21
    %load22 = load double, double* %element22
    %load23 = load double, double* %element23

    %element31 = getelementptr inbounds double, double* %data3, i32 1
    %element32 = getelementptr inbounds double, double* %data3, i32 2
    %element33 = getelementptr inbounds double, double* %data3, i32 3
    %load30 = load double, double* %data3
    %load31 = load double, double* %element31
    %load32 = load double, double* %element32
    %load33 = load double, double* %element33

    %mul1 = fmul fast double %load20, %load10
    %mul2 = fmul fast double %load21, %load11
    %mul3 = fmul fast double %load22, %load10
    %mul4 = fmul fast double %load23, %load11

    %add1 = fadd fast double %load30, %mul1
    %add2 = fadd fast double %load31, %mul2
    %add3 = fadd fast double %load32, %mul3
    %add4 = fadd fast double %load33, %mul4

    %out0 = getelementptr inbounds double, double* %out, i32 0
    %out1 = getelementptr inbounds double, double* %out, i32 1
    %out2 = getelementptr inbounds double, double* %out, i32 2
    %out3 = getelementptr inbounds double, double* %out, i32 3

    store double %add1, double *%out0
    store double %add2, double *%out1
    store double %add3, double *%out2
    store double %add4, double *%out3

    ret void
}

Currently the SLP vectorizer generates (for avx2 target):

{
  %element11 = getelementptr inbounds double, double* %data, i64 1
  %load10 = load double, double* %data, align 8
  %load11 = load double, double* %element11, align 8
  %1 = bitcast double* %data2 to <4 x double>*
  %2 = load <4 x double>, <4 x double>* %1, align 8
  %3 = bitcast double* %data3 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = insertelement <4 x double> undef, double %load10, i32 0
  %6 = insertelement <4 x double> %5, double %load11, i32 1
  %7 = insertelement <4 x double> %6, double %load10, i32 2
  %8 = insertelement <4 x double> %7, double %load11, i32 3
  %9 = fmul fast <4 x double> %2, %8
  %10 = fadd fast <4 x double> %4, %9
  %11 = bitcast double* %out to <4 x double>*
  store <4 x double> %10, <4 x double>* %11, align 8
  ret void
}

I want it to generate

{
  %cast = bitcast double* %data to <2 x double>*
  %load = load <2 x double>, <2 x double>* %1, align 8
  %shuffle = shufflevector <2 x double> %load, <2 x double> undef,
i32
<i32 0, i32 1, i32 0, i32 1>
  %1 = bitcast double* %data2 to <4 x double>*
  %2 = load <4 x double>, <4 x double>* %1, align 8
  %3 = bitcast double* %data3 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %9 = fmul fast <4 x double> %2, %shuffle
  %10 = fadd fast <4 x double> %4, %9
  %11 = bitcast double* %out to <4 x double>*
  store <4 x double> %10, <4 x double>* %11, align 8
  ret void
}

More generally though, rather than the loads, there could be a whole
subtree of vectorizable operations.
This becomes an especially large problem for larger vector widths, as
the penalty there for missing the
vectorization opportunity becomes larger.

What I was suggesting that to handle the case, the load could be a
two-element bundle with a
(0, 1, 0, 1) shuffle mask on the edge into the fmul. Hope that gives
an illustrative example of what I care
about and why I think it can be accomplished with the same solution.
Please let me know if that didn't answer your questions.

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Mar 2017 - Data structure improvement for the SLP vectorizer

[llvm-dev] Data structure improvement for the SLP vectorizer

[llvm-dev] Data structure improvement for the SLP vectorizer

[llvm-dev] Data structure improvement for the SLP vectorizer

Apparently Analagous Threads