I was looking at the code generated from the following c code and noticed extra loads in the inner-loop of these nested for-loops: #define DIM 8 #define UNROLL_DIM DIM typedef double InArray[DIM][DIM]; void f1( InArray c, InArray a, InArray b ) { #pragma clang loop unroll_count(UNROLL_DIM) for( int i=0;i<DIM;i++) #pragma clang loop unroll_count(UNROLL_DIM) for( int j=0;j<DIM;j++) #pragma clang loop unroll_count(UNROLL_DIM) for( int k=0;k<DIM;k++) { c[i][k] = c[i][k] + a[i][j]*b[j][k]; } } In the inner-most loop there, the generated code (and in the .ll as well) loads a[i][j] every time. In this case I've unrolled the loops, but it's the same situation if they're not unrolled. Using -O3 to compile, this is the .ll that results (just showing 2 iterations of the unrolled inner loop): define void @f1([8 x double]* nocapture %c, [8 x double]* nocapture readonly %a, [8 x double]* nocapture readonly %b) #0 { entry: %arrayidx8 = getelementptr inbounds [8 x double]* %c, i64 0, i64 0 %arrayidx12 = getelementptr inbounds [8 x double]* %a, i64 0, i64 0 %0 = load double* %arrayidx8, align 8, !tbaa !1 %1 = load double* %arrayidx12, align 8, !tbaa !1 %arrayidx16 = getelementptr inbounds [8 x double]* %b, i64 0, i64 0 %2 = load double* %arrayidx16, align 8, !tbaa !1 %mul = fmul double %1, %2 %add = fadd double %0, %mul store double %add, double* %arrayidx8, align 8, !tbaa !1 %arrayidx8.1 = getelementptr inbounds [8 x double]* %c, i64 0, i64 1 %3 = load double* %arrayidx8.1, align 8, !tbaa !1 %4 = load double* %arrayidx12, align 8, !tbaa !1 #EXTRA LOAD, could reuse %1! %arrayidx16.1 = getelementptr inbounds [8 x double]* %b, i64 0, i64 1 %5 = load double* %arrayidx16.1, align 8, !tbaa !1 %mul.1 = fmul double %4, %5 %add.1 = fadd double %3, %mul.1 store double %add.1, double* %arrayidx8.1, align 8, !tbaa !1 ... Note the line with the comment at the end: #EXTRA LOAD, could reuse %1 This loading from a[i][j] happens again for each iteration and seems quite inefficient. I changed the C code to explicitly do the load of a[i][j] outside of the innermost loop and that (as would be expected) eliminates the extra load: void f1( InArray c, InArray a, InArray b ) { int a_i_j; #pragma clang loop unroll_count(UNROLL_DIM) for(int i=0;i<DIM;i++){ #pragma clang loop unroll_count(UNROLL_DIM) for(int j=0;j<DIM;j++) { a_i_j = a[i][j]; #pragma clang loop unroll_count(UNROLL_DIM) for(int k=0;k<DIM;k++) { c[i][k] = c[i][k] + a_i_j*b[j][k]; } } } } I guess I'm a bit surprised that -O3 wouldn't automatically do what I've done in the second version of the C code when generating code from the first version? Is there a specific optimization that can be called to do this? (we're using LLVM 3.6 - maybe this is something that's done in later versions?) Phil -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160711/dccc984f/attachment.html>
Krzysztof Parzyszek via llvm-dev
2016-Jul-11 21:20 UTC
[llvm-dev] extra loads in nested for-loop
On 7/11/2016 3:27 PM, Phil Tomson via llvm-dev wrote:> > I guess I'm a bit surprised that -O3 wouldn't automatically do what I've > done in the second version of the C code when generating code from the > first version?This is most likely because a, b, and c are assumed to be aliased. -Krzysztof -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On Mon, Jul 11, 2016 at 2:20 PM, Krzysztof Parzyszek via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On 7/11/2016 3:27 PM, Phil Tomson via llvm-dev wrote: > >> >> I guess I'm a bit surprised that -O3 wouldn't automatically do what I've >> done in the second version of the C code when generating code from the >> first version? >> > > This is most likely because a, b, and c are assumed to be aliased. >In the context of this particular function wouldn't alias analysis be able to figure out that there is no alias to a? Although, I guess considering the possibility of multi-threading it might be difficult to make that determination in all other parts of the program. Phil> > -Krzysztof > > -- > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted > by The Linux Foundation > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160711/d4328410/attachment.html>