thr3ads.net - llvm dev - [llvm-dev] Bitwise AND blocks re-association [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Timofeev, Alexander via llvm-dev

2016-Nov-16 15:28 UTC

[llvm-dev] Bitwise AND blocks re-association

Hi All,

I'm writing to ask for the suggestion.

I'm working on the AMDGPU project. I was recently analyzing our backend
performance looking for optimization opportunities and found that LoadStore
vectorizer is unable to handle simple case.
The deeper look revealed that the reason was in a i64 to i32 truncation.

The OpenCL code looks like this:

__kernel void read_linear(__global float *input,__global float *output)
{
        float val = 0.0f;
        uint gid = get_global_id(0);

        val = val + input[gid + 0];
        val = val + input[gid + 1];
        val = val + input[gid + 2];
        val = val + input[gid + 3];
        val = val + input[gid + 4];

*** and so on... ***

        output[gid] = val;
}

I'd expect loads to be combined   since they all have same base and offsets
are constants and consecutively increasing.
That is exactly what happens if I change 'uint' to 'ulong' in
the assignment: uint gid = get_global_id(0)  => ulong gid = get_global_id(0)
- loads are combined as expected.

The reason is simple:
OpenCL get_global_id(uint) returns i64
The result is assigned to i32
Clang generates trunc i64 to i32 as AND with FFFFFFFF:

%call = tail call i64 @_Z13get_global_idj(i32 0) #2
  %idxprom = and i64 %call, 4294967295
  %arrayidx = getelementptr inbounds float, float addrspace(1)* %input, i64
%idxprom
  %0 = load float, float addrspace(1)* %arrayidx, align 4, !tbaa !7


So, LoadStoreVectorizer cannot combine with the next load:

  %add2 = add i64 %call, 1
  %idxprom3 = and i64 %add2, 4294967295
  %arrayidx4 = getelementptr inbounds float, float addrspace(1)* %input, i64
%idxprom3
  %1 = load float, float addrspace(1)* %arrayidx4, align 4, !tbaa !7

Basically we have a case: C1 & ( A + C2 ) => C1 & A + C1 & C2
that is not always legal.
In our case that really is trunc(A+C) => trunc(A) + trunc(C) that is legal.

NaryReassociate cannot handle this because it only considers ADD, MUL, GEP
Reassociate cannot handle this either just because it explicitly looks for
"X&~X == 0, X|~X == -1." or "Y ^ X^X -> Y"

We definitely should be able to optimize  0xFFFFFFFF & ( A + 1 ) to 
0xFFFFFFFF & A + 1 given that  bitwise AND with FFFFFFFF  is a special case.
We maybe want be able to enhance bitwise operation processing to handle more
common cases.

I'd like to know your opinion on this topic.

Thanks in advance.

Alexander.


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/c214204c/attachment-0001.html>

Friedman, Eli via llvm-dev

2016-Nov-16 19:07 UTC

head link

[llvm-dev] Bitwise AND blocks re-association

On 11/16/2016 7:28 AM, Timofeev, Alexander via llvm-dev
wrote:>
> Hi All,
>
> I’m writing to ask for the suggestion.
>
> I’m working on the AMDGPU project. I was recently analyzing our 
> backend performance looking for optimization opportunities and found 
> that LoadStore vectorizer is unable to handle simple case.
>
> The deeper look revealed that the reason was in a i64 to i32 truncation.
>
> The OpenCL code looks like this:
>
> __kernel void read_linear(__global float *input,__global float *output)
>
> {
>
>         float val = 0.0f;
>
>         uint gid = get_global_id(0);
>
>         val = val + input[gid + 0];
>
>         val = val + input[gid + 1];
>
>         val = val + input[gid + 2];
>
>         val = val + input[gid + 3];
>
>         val = val + input[gid + 4];
>
> *** and so on… ***
>
>         output[gid] = val;
>
> }
>
> I’d expect loads to be combined   since they all have same base and 
> offsets are constants and consecutively increasing.
>
> That is exactly what happens if I change ‘uint’ to ‘ulong’ in the 
> assignment: uint gid = get_global_id(0)  => ulong gid = 
> get_global_id(0) – loads are combined as expected.
>
> The reason is simple:
>
> OpenCL get_global_id(uint) returns i64
>
> The result is assigned to i32
>
> Clang generates trunc i64 to i32 as AND with FFFFFFFF:
>
> %call = tail call i64 @_Z13get_global_idj(i32 0) #2
>
>   %idxprom = and i64 %call, 4294967295
>
>   %arrayidx = getelementptr inbounds float, float addrspace(1)* 
> %input, i64 %idxprom
>
>   %0 = load float, float addrspace(1)* %arrayidx, align 4, !tbaa !7
>
> So, LoadStoreVectorizer cannot combine with the next load:
>
>   %add2 = add i64 %call, 1
>
>   %idxprom3 = and i64 %add2, 4294967295
>
>   %arrayidx4 = getelementptr inbounds float, float addrspace(1)* 
> %input, i64 %idxprom3
>
>   %1 = load float, float addrspace(1)* %arrayidx4, align 4, !tbaa !7
>
> Basically we have a case: C1 & ( A + C2 ) => C1 & A + C1 &
C2 that is
> not always legal.
>
> In our case that really is trunc(A+C) => trunc(A) + trunc(C) that is 
> legal.
>
> NaryReassociate cannot handle this because it only considers ADD, MUL, GEP
>
> Reassociate cannot handle this either just because it explicitly looks 
> for “X&~X == 0, X|~X == -1.” or “Y ^ X^X -> Y”
>
> We definitely should be able to optimize 0xFFFFFFFF & ( A + 1 ) to  
> 0xFFFFFFFF & A + 1 given that  bitwise AND with FFFFFFFF  is a special 
> case.
>
> We maybe want be able to enhance bitwise operation processing to 
> handle more common cases.
>
I'm not sure what you're expecting reassociation to do here.  If you can
prove "gid + 1" doesn't wrap in your example (because of some
property
of the get_global_id() function), the AND is a no-op, so you can just 
kill it.  And if you can't prove "gid + 1" doesn't wrap, your
proposed
transformation is illegal.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/6e357aa4/attachment-0001.html>

llvm dev - Nov 2016 - Bitwise AND blocks re-association

[llvm-dev] Bitwise AND blocks re-association

[llvm-dev] Bitwise AND blocks re-association