Timofeev, Alexander via llvm-dev
2016-Nov-16 15:28 UTC
[llvm-dev] Bitwise AND blocks re-association
Hi All, I'm writing to ask for the suggestion. I'm working on the AMDGPU project. I was recently analyzing our backend performance looking for optimization opportunities and found that LoadStore vectorizer is unable to handle simple case. The deeper look revealed that the reason was in a i64 to i32 truncation. The OpenCL code looks like this: __kernel void read_linear(__global float *input,__global float *output) { float val = 0.0f; uint gid = get_global_id(0); val = val + input[gid + 0]; val = val + input[gid + 1]; val = val + input[gid + 2]; val = val + input[gid + 3]; val = val + input[gid + 4]; *** and so on... *** output[gid] = val; } I'd expect loads to be combined since they all have same base and offsets are constants and consecutively increasing. That is exactly what happens if I change 'uint' to 'ulong' in the assignment: uint gid = get_global_id(0) => ulong gid = get_global_id(0) - loads are combined as expected. The reason is simple: OpenCL get_global_id(uint) returns i64 The result is assigned to i32 Clang generates trunc i64 to i32 as AND with FFFFFFFF: %call = tail call i64 @_Z13get_global_idj(i32 0) #2 %idxprom = and i64 %call, 4294967295 %arrayidx = getelementptr inbounds float, float addrspace(1)* %input, i64 %idxprom %0 = load float, float addrspace(1)* %arrayidx, align 4, !tbaa !7 So, LoadStoreVectorizer cannot combine with the next load: %add2 = add i64 %call, 1 %idxprom3 = and i64 %add2, 4294967295 %arrayidx4 = getelementptr inbounds float, float addrspace(1)* %input, i64 %idxprom3 %1 = load float, float addrspace(1)* %arrayidx4, align 4, !tbaa !7 Basically we have a case: C1 & ( A + C2 ) => C1 & A + C1 & C2 that is not always legal. In our case that really is trunc(A+C) => trunc(A) + trunc(C) that is legal. NaryReassociate cannot handle this because it only considers ADD, MUL, GEP Reassociate cannot handle this either just because it explicitly looks for "X&~X == 0, X|~X == -1." or "Y ^ X^X -> Y" We definitely should be able to optimize 0xFFFFFFFF & ( A + 1 ) to 0xFFFFFFFF & A + 1 given that bitwise AND with FFFFFFFF is a special case. We maybe want be able to enhance bitwise operation processing to handle more common cases. I'd like to know your opinion on this topic. Thanks in advance. Alexander. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/c214204c/attachment-0001.html>
Friedman, Eli via llvm-dev
2016-Nov-16 19:07 UTC
[llvm-dev] Bitwise AND blocks re-association
On 11/16/2016 7:28 AM, Timofeev, Alexander via llvm-dev wrote:> > Hi All, > > I’m writing to ask for the suggestion. > > I’m working on the AMDGPU project. I was recently analyzing our > backend performance looking for optimization opportunities and found > that LoadStore vectorizer is unable to handle simple case. > > The deeper look revealed that the reason was in a i64 to i32 truncation. > > The OpenCL code looks like this: > > __kernel void read_linear(__global float *input,__global float *output) > > { > > float val = 0.0f; > > uint gid = get_global_id(0); > > val = val + input[gid + 0]; > > val = val + input[gid + 1]; > > val = val + input[gid + 2]; > > val = val + input[gid + 3]; > > val = val + input[gid + 4]; > > *** and so on… *** > > output[gid] = val; > > } > > I’d expect loads to be combined since they all have same base and > offsets are constants and consecutively increasing. > > That is exactly what happens if I change ‘uint’ to ‘ulong’ in the > assignment: uint gid = get_global_id(0) => ulong gid = > get_global_id(0) – loads are combined as expected. > > The reason is simple: > > OpenCL get_global_id(uint) returns i64 > > The result is assigned to i32 > > Clang generates trunc i64 to i32 as AND with FFFFFFFF: > > %call = tail call i64 @_Z13get_global_idj(i32 0) #2 > > %idxprom = and i64 %call, 4294967295 > > %arrayidx = getelementptr inbounds float, float addrspace(1)* > %input, i64 %idxprom > > %0 = load float, float addrspace(1)* %arrayidx, align 4, !tbaa !7 > > So, LoadStoreVectorizer cannot combine with the next load: > > %add2 = add i64 %call, 1 > > %idxprom3 = and i64 %add2, 4294967295 > > %arrayidx4 = getelementptr inbounds float, float addrspace(1)* > %input, i64 %idxprom3 > > %1 = load float, float addrspace(1)* %arrayidx4, align 4, !tbaa !7 > > Basically we have a case: C1 & ( A + C2 ) => C1 & A + C1 & C2 that is > not always legal. > > In our case that really is trunc(A+C) => trunc(A) + trunc(C) that is > legal. > > NaryReassociate cannot handle this because it only considers ADD, MUL, GEP > > Reassociate cannot handle this either just because it explicitly looks > for “X&~X == 0, X|~X == -1.” or “Y ^ X^X -> Y” > > We definitely should be able to optimize 0xFFFFFFFF & ( A + 1 ) to > 0xFFFFFFFF & A + 1 given that bitwise AND with FFFFFFFF is a special > case. > > We maybe want be able to enhance bitwise operation processing to > handle more common cases. >I'm not sure what you're expecting reassociation to do here. If you can prove "gid + 1" doesn't wrap in your example (because of some property of the get_global_id() function), the AND is a no-op, so you can just kill it. And if you can't prove "gid + 1" doesn't wrap, your proposed transformation is illegal. -Eli -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/6e357aa4/attachment-0001.html>