thr3ads.net - llvm dev - [llvm-dev] extra loads in nested for-loop [Jul 2016]

If this information is useful, please help other people find it:
Share via:

Phil Tomson via llvm-dev

2016-Jul-11 20:27 UTC

[llvm-dev] extra loads in nested for-loop

I was looking at the code generated from the following c code and noticed
extra loads in the inner-loop of these nested for-loops:

#define DIM 8
#define UNROLL_DIM DIM
typedef double InArray[DIM][DIM];

void f1( InArray c, InArray a, InArray b ) {
#pragma clang loop unroll_count(UNROLL_DIM)
    for( int i=0;i<DIM;i++)
#pragma clang loop unroll_count(UNROLL_DIM)
        for( int j=0;j<DIM;j++)
#pragma clang loop  unroll_count(UNROLL_DIM)
            for( int k=0;k<DIM;k++) {
                c[i][k] = c[i][k] + a[i][j]*b[j][k];
            }
}

In the inner-most loop there, the generated code (and in the .ll as well)
loads a[i][j] every time. In this case I've unrolled the loops, but it's
the same situation if they're not unrolled.

Using -O3 to compile,  this is the .ll that results (just showing 2
iterations of the unrolled inner loop):

define void @f1([8 x double]* nocapture %c, [8 x double]* nocapture
readonly %a, [8 x double]* nocapture readonly %b) #0 {
entry:
  %arrayidx8 = getelementptr inbounds [8 x double]* %c, i64 0, i64 0
  %arrayidx12 = getelementptr inbounds [8 x double]* %a, i64 0, i64 0
  %0 = load double* %arrayidx8, align 8, !tbaa !1
  %1 = load double* %arrayidx12, align 8, !tbaa !1
  %arrayidx16 = getelementptr inbounds [8 x double]* %b, i64 0, i64 0
  %2 = load double* %arrayidx16, align 8, !tbaa !1
  %mul = fmul double %1, %2
  %add = fadd double %0, %mul
  store double %add, double* %arrayidx8, align 8, !tbaa !1
  %arrayidx8.1 = getelementptr inbounds [8 x double]* %c, i64 0, i64 1
  %3 = load double* %arrayidx8.1, align 8, !tbaa !1
  %4 = load double* %arrayidx12, align 8, !tbaa !1        #EXTRA LOAD,
could reuse %1!
  %arrayidx16.1 = getelementptr inbounds [8 x double]* %b, i64 0, i64 1
  %5 = load double* %arrayidx16.1, align 8, !tbaa !1
  %mul.1 = fmul double %4, %5
  %add.1 = fadd double %3, %mul.1
  store double %add.1, double* %arrayidx8.1, align 8, !tbaa !1
...

Note the line with the comment at the end: #EXTRA LOAD, could reuse %1
This loading from a[i][j] happens again for each iteration and seems quite
inefficient.

I changed the C code to explicitly do the load of a[i][j] outside of the
innermost loop and that (as would be expected) eliminates the extra load:

void f1( InArray c, InArray a, InArray b ) {
int a_i_j;
#pragma clang loop unroll_count(UNROLL_DIM)
    for(int i=0;i<DIM;i++){
#pragma clang loop unroll_count(UNROLL_DIM)
        for(int j=0;j<DIM;j++) {
           a_i_j = a[i][j];
#pragma clang loop  unroll_count(UNROLL_DIM)
            for(int k=0;k<DIM;k++) {
                c[i][k] = c[i][k] + a_i_j*b[j][k];
            }
        }
    }
}


I guess I'm a bit surprised that -O3 wouldn't automatically do what
I've
done in the second version of the C code when generating code from the
first version? Is there a specific optimization that can be called to do
this?

(we're using LLVM 3.6 - maybe this is something that's done in later
versions?)

Phil
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160711/dccc984f/attachment.html>

Krzysztof Parzyszek via llvm-dev

2016-Jul-11 21:20 UTC

head link

[llvm-dev] extra loads in nested for-loop

On 7/11/2016 3:27 PM, Phil Tomson via llvm-dev wrote:>
> I guess I'm a bit surprised that -O3 wouldn't automatically do what
I've
> done in the second version of the C code when generating code from the
> first version?
This is most likely because a, b, and c are assumed to be aliased.

-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

Phil Tomson via llvm-dev

2016-Jul-11 23:21 UTC

head link

[llvm-dev] extra loads in nested for-loop

On Mon, Jul 11, 2016 at 2:20 PM, Krzysztof Parzyszek via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 7/11/2016 3:27 PM, Phil Tomson via llvm-dev wrote:
>
>>
>> I guess I'm a bit surprised that -O3 wouldn't automatically do
what I've
>> done in the second version of the C code when generating code from the
>> first version?
>>
>
> This is most likely because a, b, and c are assumed to be aliased.
>
In the context of this particular function wouldn't alias analysis be able
to figure out that there is no alias to a? Although, I guess considering
the possibility of multi-threading it might be difficult to make that
determination in all other parts of the program.

Phil
>
> -Krzysztof
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160711/d4328410/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Jul 2016 - extra loads in nested for-loop

[llvm-dev] extra loads in nested for-loop

[llvm-dev] extra loads in nested for-loop

[llvm-dev] extra loads in nested for-loop

Apparently Analagous Threads