Keno Fischer via llvm-dev
2017-Mar-15 13:57 UTC
[llvm-dev] Data structure improvement for the SLP vectorizer
Maybe it would illustrative to give an IR example of the case I'm interested in. Consider define void @"julia_transform_bvn_derivs_hessian!"(double* %data, double* %data2, double *%data3, double *%out) { %element11 = getelementptr inbounds double, double* %data, i32 1 %load10 = load double, double* %data %load11 = load double, double* %element11 %element21 = getelementptr inbounds double, double* %data2, i32 1 %element22 = getelementptr inbounds double, double* %data2, i32 2 %element23 = getelementptr inbounds double, double* %data2, i32 3 %load20 = load double, double* %data2 %load21 = load double, double* %element21 %load22 = load double, double* %element22 %load23 = load double, double* %element23 %element31 = getelementptr inbounds double, double* %data3, i32 1 %element32 = getelementptr inbounds double, double* %data3, i32 2 %element33 = getelementptr inbounds double, double* %data3, i32 3 %load30 = load double, double* %data3 %load31 = load double, double* %element31 %load32 = load double, double* %element32 %load33 = load double, double* %element33 %mul1 = fmul fast double %load20, %load10 %mul2 = fmul fast double %load21, %load11 %mul3 = fmul fast double %load22, %load10 %mul4 = fmul fast double %load23, %load11 %add1 = fadd fast double %load30, %mul1 %add2 = fadd fast double %load31, %mul2 %add3 = fadd fast double %load32, %mul3 %add4 = fadd fast double %load33, %mul4 %out0 = getelementptr inbounds double, double* %out, i32 0 %out1 = getelementptr inbounds double, double* %out, i32 1 %out2 = getelementptr inbounds double, double* %out, i32 2 %out3 = getelementptr inbounds double, double* %out, i32 3 store double %add1, double *%out0 store double %add2, double *%out1 store double %add3, double *%out2 store double %add4, double *%out3 ret void } Currently the SLP vectorizer generates (for avx2 target): { %element11 = getelementptr inbounds double, double* %data, i64 1 %load10 = load double, double* %data, align 8 %load11 = load double, double* %element11, align 8 %1 = bitcast double* %data2 to <4 x double>* %2 = load <4 x double>, <4 x double>* %1, align 8 %3 = bitcast double* %data3 to <4 x double>* %4 = load <4 x double>, <4 x double>* %3, align 8 %5 = insertelement <4 x double> undef, double %load10, i32 0 %6 = insertelement <4 x double> %5, double %load11, i32 1 %7 = insertelement <4 x double> %6, double %load10, i32 2 %8 = insertelement <4 x double> %7, double %load11, i32 3 %9 = fmul fast <4 x double> %2, %8 %10 = fadd fast <4 x double> %4, %9 %11 = bitcast double* %out to <4 x double>* store <4 x double> %10, <4 x double>* %11, align 8 ret void } I want it to generate { %cast = bitcast double* %data to <2 x double>* %load = load <2 x double>, <2 x double>* %1, align 8 %shuffle = shufflevector <2 x double> %load, <2 x double> undef, i32 <i32 0, i32 1, i32 0, i32 1> %1 = bitcast double* %data2 to <4 x double>* %2 = load <4 x double>, <4 x double>* %1, align 8 %3 = bitcast double* %data3 to <4 x double>* %4 = load <4 x double>, <4 x double>* %3, align 8 %9 = fmul fast <4 x double> %2, %shuffle %10 = fadd fast <4 x double> %4, %9 %11 = bitcast double* %out to <4 x double>* store <4 x double> %10, <4 x double>* %11, align 8 ret void } More generally though, rather than the loads, there could be a whole subtree of vectorizable operations. This becomes an especially large problem for larger vector widths, as the penalty there for missing the vectorization opportunity becomes larger. What I was suggesting that to handle the case, the load could be a two-element bundle with a (0, 1, 0, 1) shuffle mask on the edge into the fmul. Hope that gives an illustrative example of what I care about and why I think it can be accomplished with the same solution. Please let me know if that didn't answer your questions.
Shahid, Asghar-ahmad via llvm-dev
2017-Mar-16 05:11 UTC
[llvm-dev] Data structure improvement for the SLP vectorizer
Comments inlined.> -----Original Message----- > From: Keno Fischer [mailto:keno at juliacomputing.com] > Sent: Wednesday, March 15, 2017 7:27 PM > To: Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com> > Cc: llvm-dev at lists.llvm.org; Michael Kuperstein <mkuper at google.com>; > Matthias Braun <matze at braunis.de> > Subject: Re: Data structure improvement for the SLP vectorizer > > Maybe it would illustrative to give an IR example of the case I'm interested in. > Consider > > define void @"julia_transform_bvn_derivs_hessian!"(double* %data, > double* %data2, double *%data3, double *%out) { > %element11 = getelementptr inbounds double, double* %data, i32 1 > %load10 = load double, double* %data > %load11 = load double, double* %element11 > > %element21 = getelementptr inbounds double, double* %data2, i32 1 > %element22 = getelementptr inbounds double, double* %data2, i32 2 > %element23 = getelementptr inbounds double, double* %data2, i32 3 > %load20 = load double, double* %data2 > %load21 = load double, double* %element21 > %load22 = load double, double* %element22 > %load23 = load double, double* %element23 > > %element31 = getelementptr inbounds double, double* %data3, i32 1 > %element32 = getelementptr inbounds double, double* %data3, i32 2 > %element33 = getelementptr inbounds double, double* %data3, i32 3 > %load30 = load double, double* %data3 > %load31 = load double, double* %element31 > %load32 = load double, double* %element32 > %load33 = load double, double* %element33 > > %mul1 = fmul fast double %load20, %load10 > %mul2 = fmul fast double %load21, %load11 > %mul3 = fmul fast double %load22, %load10 > %mul4 = fmul fast double %load23, %load11 > > %add1 = fadd fast double %load30, %mul1 > %add2 = fadd fast double %load31, %mul2 > %add3 = fadd fast double %load32, %mul3 > %add4 = fadd fast double %load33, %mul4 > > %out0 = getelementptr inbounds double, double* %out, i32 0 > %out1 = getelementptr inbounds double, double* %out, i32 1 > %out2 = getelementptr inbounds double, double* %out, i32 2 > %out3 = getelementptr inbounds double, double* %out, i32 3 > > store double %add1, double *%out0 > store double %add2, double *%out1 > store double %add3, double *%out2 > store double %add4, double *%out3 > > ret void > } > > Currently the SLP vectorizer generates (for avx2 target): > > { > %element11 = getelementptr inbounds double, double* %data, i64 1 > %load10 = load double, double* %data, align 8 > %load11 = load double, double* %element11, align 8 > %1 = bitcast double* %data2 to <4 x double>* > %2 = load <4 x double>, <4 x double>* %1, align 8 > %3 = bitcast double* %data3 to <4 x double>* > %4 = load <4 x double>, <4 x double>* %3, align 8 > %5 = insertelement <4 x double> undef, double %load10, i32 0 > %6 = insertelement <4 x double> %5, double %load11, i32 1 > %7 = insertelement <4 x double> %6, double %load10, i32 2 > %8 = insertelement <4 x double> %7, double %load11, i32 3 > %9 = fmul fast <4 x double> %2, %8 > %10 = fadd fast <4 x double> %4, %9 > %11 = bitcast double* %out to <4 x double>* > store <4 x double> %10, <4 x double>* %11, align 8 > ret void > }This is because currently the broadcast elements are being handled as scalars to be "gathered".> > I want it to generate > > { > %cast = bitcast double* %data to <2 x double>* > %load = load <2 x double>, <2 x double>* %1, align 8 > %shuffle = shufflevector <2 x double> %load, <2 x double> undef, i32 > <i32 0, i32 1, i32 0, i32 1> > %1 = bitcast double* %data2 to <4 x double>* > %2 = load <4 x double>, <4 x double>* %1, align 8 > %3 = bitcast double* %data3 to <4 x double>* > %4 = load <4 x double>, <4 x double>* %3, align 8 > %9 = fmul fast <4 x double> %2, %shuffle > %10 = fadd fast <4 x double> %4, %9 > %11 = bitcast double* %out to <4 x double>* > store <4 x double> %10, <4 x double>* %11, align 8 > ret void > }Here, %load should be 4 element load instead of 2 and you can still do the required broadcast With above shuffle. Why this is important is that with our scheme of having a DAG with masks on Edges to a single tree node, generating a vector load of lesser length than the chosen vector factor Will be problematic. If we want to do so we would require another tree entry with lesser number of scalar loads.> > More generally though, rather than the loads, there could be a whole > subtree of vectorizable operations. > This becomes an especially large problem for larger vector widths, as the > penalty there for missing the vectorization opportunity becomes larger. > > What I was suggesting that to handle the case, the load could be a two- > element bundle with a (0, 1, 0, 1) shuffle mask on the edge into the fmul. > Hope that gives an illustrative example of what I care about and why I think it > can be accomplished with the same solution. > Please let me know if that didn't answer your questions.Yes, that sums up the problem. IMO, this can be achieved by checking for the broadcast case: This is essentially looking for the partial consecutiveness for memory accesses and having a proper mask for that on the edge from this USE node to DEF node. Regard, Shahid
Keno Fischer via llvm-dev
2017-Mar-16 12:41 UTC
[llvm-dev] Data structure improvement for the SLP vectorizer
On Thu, Mar 16, 2017 at 1:11 AM, Shahid, Asghar-ahmad <Asghar-ahmad.Shahid at amd.com> wrote:> > Here, %load should be 4 element load instead of 2 and you can still do the required broadcast > With above shuffle. Why this is important is that with our scheme of having a DAG with masks on > Edges to a single tree node, generating a vector load of lesser length than the chosen vector factor > Will be problematic.Could you elaborate why you think this is? There doesn't seem a problem to me of having on 2-element bundle and then putting (0,1,0,1) on the edge to a 4-element, bundle, but I may be missing something. Thanks, Keno