thr3ads.net - llvm dev - [llvm-dev] [AMDGPU] Strange results with different address spaces [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Matt Arsenault via llvm-dev

2017-Dec-05 19:01 UTC

[llvm-dev] [AMDGPU] Strange results with different address spaces

> On Dec 5, 2017, at 13:53, Matt Arsenault <arsenm2 at gmail.com>
wrote:
> 
> 
> 
>> On Dec 5, 2017, at 02:51, Haidl, Michael via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> Hi dev list, 
>> 
>> I am currently exploring the integration of AMDGPU/ROCm into the PACXX
project and observing some strange behavior of the AMDGPU backend. The following
IR is generated for a simple address space test that copies from global to
shared memory and back to global after a barrier synchronization.
>> 
>> Here is the IR is attached as as1.ll
>> 
>> The output is as follows: 
>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 16 16 16 16 16 16 16 16 16 16 16 16
16 16 16 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 48 48 48 48 48 48 48 48
48 48 48 48 48 48 48 48 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 80 80 80
80 80 80 80 80 80 80 80 80 80 80 80 80 96 96 96 96 96 96 96 96 96 96 96 96 96 96
96 96 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 128 128
128 128 128 128 128 128 128 128 128 128 128 128 128 128 144 144 144 144 144 144
144 144 144 144 144 144 144 144 144 144 160 160 160 160 160 160 160 160 160 160
160 160 160 160 160 160 176 176 176 176 176 176 176 176 176 176 176 176 176 176
176 176 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 208 208
208 208 208 208 208 208 208 208 208 208 208 208 208 208 224 224 224 224 224 224
224 224 224 224 224 224 224 224 224 224 240 240 240 240 240 240 240 240 240 240
240 240 240 240 240 240
> 
> It looks like the addressing in as1.ll is incorrectly concluded to be
uniform:
> 
>   %6 = tail call i32 @llvm.amdgcn.workitem.id.x() #0, !range !11
>   %7 = tail call i32 @llvm.amdgcn.workgroup.id.x() #0
>   %mul.i.i.i.i.i = mul nsw i32 %7, %3
>   %add.i.i.i.i.i = add nsw i32 %mul.i.i.i.i.i, %6
>   %idxprom.i.i.i = sext i32 %add.i.i.i.i.i to i64
>   %8 = getelementptr i32, i32 addrspace(1)* %callable.coerce0, i64
%idxprom.i.i.i, !amdgpu.uniform !12, !amdgpu.noclobber !12
> 
> However since this depends on workitem.id <http://workitem.id/>.x, it
certainly is not
> 
> -Matt
Actually you have the amdgpu.uniform annotation already here, and it isn’t added
by the backend optimization pass, so there’s a bug in however you produced this.
It just happens the uniform load optimization doesn’t trigger on flat loads.

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171205/8a7e7c93/attachment.html>

Haidl, Michael via llvm-dev

2017-Dec-06 07:28 UTC

head link

[llvm-dev] [AMDGPU] Strange results with different address spaces

Hi,

the IR comes from clang in -O0 form. So no optimizations are performed by the
front-end. The IR goes through a backend agnostic preparation phase that brings
it into SSA from and changes the AS from 0 to 1. After this phase the IR goes
through another pass manager that performs O3 passes and the AMDGPU target
passes for object file generation. I looked into the AMDGPU backend and the only
place where this metadata is added is in AMDGPUAnnotateUniformValues.cpp. The
pass queries dependency analysis for the load and checks if it is reported as
uniform. Afterwards the metadata is added to the GEP.

Removing the O3 passes before code generation solves the problem so does
separating the O3 passes and the backend passes into separate pass managers. I
assume dependency analysis does not run in the second pass manager because no
metadata is generated at all.

Could this be a bug in DA reporting the load falsely as uniform by not taking
the intrinsics into account?

Cheers,
Michael


Von: Matt Arsenault [mailto:whatmannerofburgeristhis at gmail.com] Im Auftrag
von Matt Arsenault
Gesendet: Dienstag, 5. Dezember 2017 20:01
An: Haidl, Michael <michael.haidl at uni-muenster.de>
Cc: llvm-dev at lists.llvm.org
Betreff: Re: [llvm-dev] [AMDGPU] Strange results with different address spaces




On Dec 5, 2017, at 13:53, Matt Arsenault <arsenm2 at
gmail.com<mailto:arsenm2 at gmail.com>> wrote:




On Dec 5, 2017, at 02:51, Haidl, Michael via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Hi dev list,

I am currently exploring the integration of AMDGPU/ROCm into the PACXX project
and observing some strange behavior of the AMDGPU backend. The following IR is
generated for a simple address space test that copies from global to shared
memory and back to global after a barrier synchronization.

Here is the IR is attached as as1.ll

The output is as follows:
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 48 48 48 48 48 48 48 48 48 48 48
48 48 48 48 48 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 80 80 80 80 80 80
80 80 80 80 80 80 80 80 80 80 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96 96
112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 112 128 128 128 128
128 128 128 128 128 128 128 128 128 128 128 128 144 144 144 144 144 144 144 144
144 144 144 144 144 144 144 144 160 160 160 160 160 160 160 160 160 160 160 160
160 160 160 160 176 176 176 176 176 176 176 176 176 176 176 176 176 176 176 176
192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 208 208 208 208
208 208 208 208 208 208 208 208 208 208 208 208 224 224 224 224 224 224 224 224
224 224 224 224 224 224 224 224 240 240 240 240 240 240 240 240 240 240 240 240
240 240 240 240

It looks like the addressing in as1.ll is incorrectly concluded to be uniform:

  %6 = tail call i32 @llvm.amdgcn.workitem.id.x() #0, !range !11
  %7 = tail call i32 @llvm.amdgcn.workgroup.id.x() #0
  %mul.i.i.i.i.i = mul nsw i32 %7, %3
  %add.i.i.i.i.i = add nsw i32 %mul.i.i.i.i.i, %6
  %idxprom.i.i.i = sext i32 %add.i.i.i.i.i to i64
  %8 = getelementptr i32, i32 addrspace(1)* %callable.coerce0, i64
%idxprom.i.i.i, !amdgpu.uniform !12, !amdgpu.noclobber !12

However since this depends on workitem.id<http://workitem.id/>.x, it
certainly is not

-Matt

Actually you have the amdgpu.uniform annotation already here, and it isn’t added
by the backend optimization pass, so there’s a bug in however you produced this.
It just happens the uniform load optimization doesn’t trigger on flat loads.

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171206/51388107/attachment.html>

Matt Arsenault via llvm-dev

2017-Dec-06 18:45 UTC

head link

[llvm-dev] [AMDGPU] Strange results with different address spaces

> On Dec 6, 2017, at 02:28, Haidl, Michael <michael.haidl at
uni-muenster.de> wrote:
> 
>  The IR goes through a backend agnostic preparation phase that brings it
into SSA from and changes the AS from 0 to 1.
This sounds possibly problematic to me. The IR should be created with the
correct address space to begin with. Changing this in the middle sounds suspect.
> After this phase the IR goes through another pass manager that performs O3
passes and the AMDGPU target passes for object file generation. I looked into
the AMDGPU backend and the only place where this metadata is added is in
AMDGPUAnnotateUniformValues.cpp. The pass queries dependency analysis for the
load and checks if it is reported as uniform. Afterwards the metadata is added
to the GEP.
>  
> Removing the O3 passes before code generation solves the problem so does
separating the O3 passes and the backend passes into separate pass managers. I
assume dependency analysis does not run in the second pass manager because no
metadata is generated at all.
>  
> Could this be a bug in DA reporting the load falsely as uniform by not
taking the intrinsics into account?
>  
> Cheers,
> Michael
>  
The intrinsics certainly are correctly treated as divergent. Nothing would work
otherwise. If I run the annotate pass or analysis on the examples it does the
right thing and sees the load as divergent.

$ opt -S -analyze -divergence -o - as1.ll
Printing analysis 'Divergence Analysis' for function
'_ZN5pacxx2v213genericKernelIZL12test_barrieriPPcE3$_0EEvT_':
DIVERGENT:  %6 = tail call i32 @llvm.amdgcn.workitem.id.x() #0, !range !11
DIVERGENT:  %add.i.i.i.i.i = add nsw i32 %mul.i.i.i.i.i, %6
DIVERGENT:  %idxprom.i.i.i = sext i32 %add.i.i.i.i.i to i64
DIVERGENT:  %8 = getelementptr i32, i32 addrspace(1)* %callable.coerce0, i64
%idxprom.i.i.i
DIVERGENT:  %9 = load i32, i32 addrspace(1)* %8, align 4
DIVERGENT:  %10 = getelementptr [16 x i32], [16 x i32] addrspace(3)*
@"_ZN5pacxx2v213genericKernelIZL12test_barrieriPPcE3$_0EEvT__sm0", i32
0, i32 %6
DIVERGENT:  store i32 %9, i32 addrspace(3)* %10, align 4
DIVERGENT:  %11 = load i32, i32 addrspace(3)* %10, align 4
DIVERGENT:  %12 = getelementptr i32, i32 addrspace(1)* %callable.coerce1, i64
%idxprom.i.i.i
DIVERGENT:  store i32 %11, i32 addrspace(1)* %12, align 4

I’m also questioning how/where you obtained this dump. You have the declarations
for the control flow intrinsics in there, which should only ever appear when the
backend inserts them as part of codegen. There’s something suspicious about your
pass setup. What does the IR look like immediately before
AMDGPUAnnotateUniformValues, and immediately out of the frontend?

-Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171206/513a32de/attachment-0001.html>

llvm dev - Dec 2017 - [AMDGPU] Strange results with different address spaces

[llvm-dev] [AMDGPU] Strange results with different address spaces

[llvm-dev] [AMDGPU] Strange results with different address spaces

[llvm-dev] [AMDGPU] Strange results with different address spaces