Keno Fischer via llvm-dev
2017-Mar-15 01:00 UTC
[llvm-dev] Data structure improvement for the SLP vectorizer
There was some discussion of this on the llvm-commits list, but I wanted to raise the topic for discussion here. The background of the -commits discussion was that r296863 added the ability to sort memory access when the SLP vectorizer reached a load (the SLP vectorizer starts at a store or some other sink, and tries to go up the tree vectorizing as it goes along - if the input is in a different order than the output, you need a shufflevector somewhere in the middle). However, that commit had to be reverted because the SLP tree is not actually a tree and the same load bundle could be used by several different users, causing problems with representing the required shuffle on the load bundle node (Please add any history I'm missing here). Now, I'm jumping into this because I want to teach the vectorizer about a similar operation, namely a 2->4 (or 4->8 for AVX512) broadcast. This operation also requires a load followed by a shufflevector, so it would be nice to use the same representation for both of these. Now, the primary problem is that we don't really have anywhere to put this permutation information. The two solutions seem to be duplicating the nodes (and maybe deduplicating them later again when materializing the tree), or taking the hit and representing the graph more explicitly, such that we can annotate information on the actual edges of the graph. The latter seems nicer to me, we'd essentially have a DAG of scalar bundles with vector shuffles in between them. Right now the vectorizer can't really handle intermediate vector shuffles at all as far as I can tell (with a few hand coded exceptions for common patterns). From my perspective that seems a bit odd. AVX512 for example can represent most (all?) shufflevector operations in one instruction, so failing to vectorize a large tree because it would require one of them seems very problematic. This is also related to my desire for vector->larger vector broadcasting. Especially when targeting AVX512, there can be whole 4-wide subtrees that the SLP vectorizer isn't currently looking at, because it can't handle the 4->8 expansion. Whatever the new representation, I'd like the SLP vectorizer to be able to handle that. I hope that summarizes the problem. My understanding of the situation is admittedly incomplete - I came to this because I need the vectorizer to correctly handle my narrow-width subtrees, but I'm hoping we can collectively come up with a design that would cover all the relevant use cases.
Shahid, Asghar-ahmad via llvm-dev
2017-Mar-15 04:34 UTC
[llvm-dev] Data structure improvement for the SLP vectorizer
Thanks Keno for your interest. My response inlined below.> -----Original Message----- > From: Keno Fischer [mailto:keno at juliacomputing.com] > Sent: Wednesday, March 15, 2017 6:31 AM > To: llvm-dev at lists.llvm.org > Cc: Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com>; Michael > Kuperstein <mkuper at google.com>; Matthias Braun <matze at braunis.de> > Subject: Data structure improvement for the SLP vectorizer > > There was some discussion of this on the llvm-commits list, but I wanted to > raise the topic for discussion here. The background of the -commits > discussion was that r296863 added the ability to sort memory access when > the SLP vectorizer reached a load (the SLP vectorizer starts at a store or some > other sink, and tries to go up the tree vectorizing as it goes along - if the input > is in a different order than the output, you need a shufflevector somewhere > in the middle). > However, that commit had to be reverted because the SLP tree is not actually > a tree and the same load bundle could be used by several different users, > causing problems with representing the required shuffle on the load bundle > node (Please add any history I'm missing here). > > Now, I'm jumping into this because I want to teach the vectorizer about a > similar operation, namely a 2->4 (or 4->8 for AVX512) broadcast. This > operation also requires a load followed by a shufflevector, so it would be > nice to use the same representation for both of these.How do you plan to identify or mark a certain "load" bundle for broadcast, As these "loads" presumably will not be unordered?> > Now, the primary problem is that we don't really have anywhere to put this > permutation information. The two solutions seem to be duplicating the > nodes (and maybe deduplicating them later again when materializing the > tree), or taking the hit and representing the graph more explicitly, such that > we can annotate information on the actual edges of the graph. > > The latter seems nicer to me, we'd essentially have a DAG of scalar bundles > with vector shuffles in between them.That’s right. Right now the vectorizer can't really> handle intermediate vector shuffles at all as far as I can tell (with a few hand > coded exceptions for common patterns). From my perspective that seems a > bit odd. AVX512 for example can represent most (all?) shufflevector > operations in one instruction, so failing to vectorize a large tree because it > would require one of them seems very problematic.Do you mean widening the vector shuffles itself? This is also related to my> desire for vector->larger vector broadcasting. Especially when targeting > AVX512, there can be whole 4-wide subtrees that the SLP vectorizer isn't > currently looking at, because it can't handle the 4->8 expansion. Whatever > the new representation, I'd like the SLP vectorizer to be able to handle that. > > I hope that summarizes the problem. My understanding of the situation is > admittedly incomplete - I came to this because I need the vectorizer to > correctly handle my narrow-width subtrees, but I'm hoping we can > collectively come up with a design that would cover all the relevant use cases.I did not get this part can you please elaborate? Regards, Shahid
Keno Fischer via llvm-dev
2017-Mar-15 13:57 UTC
[llvm-dev] Data structure improvement for the SLP vectorizer
Maybe it would illustrative to give an IR example of the case I'm interested in. Consider define void @"julia_transform_bvn_derivs_hessian!"(double* %data, double* %data2, double *%data3, double *%out) { %element11 = getelementptr inbounds double, double* %data, i32 1 %load10 = load double, double* %data %load11 = load double, double* %element11 %element21 = getelementptr inbounds double, double* %data2, i32 1 %element22 = getelementptr inbounds double, double* %data2, i32 2 %element23 = getelementptr inbounds double, double* %data2, i32 3 %load20 = load double, double* %data2 %load21 = load double, double* %element21 %load22 = load double, double* %element22 %load23 = load double, double* %element23 %element31 = getelementptr inbounds double, double* %data3, i32 1 %element32 = getelementptr inbounds double, double* %data3, i32 2 %element33 = getelementptr inbounds double, double* %data3, i32 3 %load30 = load double, double* %data3 %load31 = load double, double* %element31 %load32 = load double, double* %element32 %load33 = load double, double* %element33 %mul1 = fmul fast double %load20, %load10 %mul2 = fmul fast double %load21, %load11 %mul3 = fmul fast double %load22, %load10 %mul4 = fmul fast double %load23, %load11 %add1 = fadd fast double %load30, %mul1 %add2 = fadd fast double %load31, %mul2 %add3 = fadd fast double %load32, %mul3 %add4 = fadd fast double %load33, %mul4 %out0 = getelementptr inbounds double, double* %out, i32 0 %out1 = getelementptr inbounds double, double* %out, i32 1 %out2 = getelementptr inbounds double, double* %out, i32 2 %out3 = getelementptr inbounds double, double* %out, i32 3 store double %add1, double *%out0 store double %add2, double *%out1 store double %add3, double *%out2 store double %add4, double *%out3 ret void } Currently the SLP vectorizer generates (for avx2 target): { %element11 = getelementptr inbounds double, double* %data, i64 1 %load10 = load double, double* %data, align 8 %load11 = load double, double* %element11, align 8 %1 = bitcast double* %data2 to <4 x double>* %2 = load <4 x double>, <4 x double>* %1, align 8 %3 = bitcast double* %data3 to <4 x double>* %4 = load <4 x double>, <4 x double>* %3, align 8 %5 = insertelement <4 x double> undef, double %load10, i32 0 %6 = insertelement <4 x double> %5, double %load11, i32 1 %7 = insertelement <4 x double> %6, double %load10, i32 2 %8 = insertelement <4 x double> %7, double %load11, i32 3 %9 = fmul fast <4 x double> %2, %8 %10 = fadd fast <4 x double> %4, %9 %11 = bitcast double* %out to <4 x double>* store <4 x double> %10, <4 x double>* %11, align 8 ret void } I want it to generate { %cast = bitcast double* %data to <2 x double>* %load = load <2 x double>, <2 x double>* %1, align 8 %shuffle = shufflevector <2 x double> %load, <2 x double> undef, i32 <i32 0, i32 1, i32 0, i32 1> %1 = bitcast double* %data2 to <4 x double>* %2 = load <4 x double>, <4 x double>* %1, align 8 %3 = bitcast double* %data3 to <4 x double>* %4 = load <4 x double>, <4 x double>* %3, align 8 %9 = fmul fast <4 x double> %2, %shuffle %10 = fadd fast <4 x double> %4, %9 %11 = bitcast double* %out to <4 x double>* store <4 x double> %10, <4 x double>* %11, align 8 ret void } More generally though, rather than the loads, there could be a whole subtree of vectorizable operations. This becomes an especially large problem for larger vector widths, as the penalty there for missing the vectorization opportunity becomes larger. What I was suggesting that to handle the case, the load could be a two-element bundle with a (0, 1, 0, 1) shuffle mask on the edge into the fmul. Hope that gives an illustrative example of what I care about and why I think it can be accomplished with the same solution. Please let me know if that didn't answer your questions.