Tom Stellard via llvm-dev
2017-Jun-13 00:23 UTC
[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
On 06/12/2017 08:03 PM, Connor Abbott wrote:> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com> wrote: >> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote: >>> cc some people who have worked on this. >>> >>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote: >>>> Hi all, >>>> >>>> I've been looking into how to implement the more advanced Shader Model >>>> 6 reduction operations in radv (and obviously most of the work would >>>> be useful for radeonsi too). They're explained in the spec for >>>> GL_AMD_shader_ballot at >>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt, >>>> but I'll summarize them here. There are two types of operations: >>>> reductions that always return a uniform value, and prefix scan >>>> operations. The reductions can be implemented in terms of the prefix >>>> scan (although in practice I don't think we want to implement them in >>>> exactly the same way), and the concerns are mostly the same, so I'll >>>> focus on the prefix scan operations for now. Given an operation `op' >>>> and an input value `a' (that's really a SIMD array with one value per >>>> invocation, even though it's a scalar value in LLVM), the prefix scan >>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0] >>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also >>>> work for non-uniform control flow: it simply skips inactive >>>> invocations. >>>> >>>> On the LLVM side, I think that we have most of the AMD-specific >>>> low-level shuffle intrinsics implemented that you need to do this, but >>>> I can think of a few concerns/questions. First of all, to implement >>>> the prefix scan, we'll need to do a code sequence that looks like >>>> this, modified from >>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace >>>> v_foo_f32 with the appropriate operation): >>>> >>>> ; v0 is the input register >>>> v_mov_b32 v1, v0 >>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1 >>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2 >>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 >>>> v_nop // Add two independent instructions to avoid a data hazard >>>> v_nop >>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 >>>> v_nop // Add two independent instructions to avoid a data hazard >>>> v_nop >>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 >>>> v_nop // Add two independent instructions to avoid a data hazard >>>> v_nop >>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6 >>>> v_nop // Add two independent instructions to avoid a data hazard >>>> v_nop >>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7 >>>> >>>> The problem is that the way these instructions use the DPP word isn't >>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp >>>> intrinsic, but it isn't enough. For example, take the first >>>> instruction: >>>> >>>> v_foo_f32 v1, v0, v1 row_shr:1 >>>> >>>> What it's doing is shifting v0 right by one within each row and adding >>>> it to v1. v1 stays the same in the first lane of each row, however. >>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like >>>> this, in LLVM-like pseduocode: >>>> >>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 >>>> %result = foo %tmp, %input >>>> >>>> but this is incorrect. If I'm reading the source correctly, this will >>>> make %tmp garbage in lane 0 (since it just turns into a normal move >>>> with the dpp modifier, and no restrictions on the destination). We >>>> could set bound_ctrl to 0 to work around this, since it will make %tmp >>>> 0 in lane 0, but that won't work with operations whose identity is >>>> non-0 like min and max. What we need is something like: >>>> >> >> Why is %tmp garbage? I thought the two options were 0 (bound_ctrl =0) >> or %input (bound_ctrl = 1)? > > Oh, maybe it is... for that to happen the underlying move would need > to have the source and destination constrained to be the same. I > couldn't see that constraint anywhere I looked, but I'm not an expert, > so I may have overlooked it. In any case, that behavior still isn't > what we want if we want to implement the prefix scan operations > efficiently. >Ok, I see what you are saying now. I think the best option here is to document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version of V_MOV_B32_dpp where the src and dst are the same register. -Tom> Connor > >> >> -Tom >> >> >>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1 >>>> >>>> where llvm.amdgcn.foo_dpp copies the first argument to the result, >>>> then applies the DPP swizzling to the second argument and does 'foo' >>>> to the second and third arguments. It would mean that we'd have a >>>> separate intrinsic for every operation we care about, but I can't >>>> think of a better way to express it. Is there a better way that >>>> doesn't involve creating an intrinsic for each operation? >>>> >>>> Next, there's the fact that this code sequence only works when the >>>> active lanes are densely-packed, but we have to make this work even >>>> when control flow is non-uniform. Essentially, we need to "skip over" >>>> the inactive lanes by setting them to the identity, and then we need >>>> to enable them in the exec mask when doing the reduction to make sure >>>> they pass along the correct result. That is, to handle non-uniform >>>> control flow, we need something like: >>>> >>>> invert EXEC >>>> result = identity >>>> set EXEC to ~0 >>>> <original code sequence> >>>> restore original EXEC >>>> >>>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes >>>> intrinsic that returns the first argument with inactive lanes set to >>>> the second argument. We'd also need something like WQM to make all the >>>> lanes active during the sequence. But that raises some hairy >>>> requirements for register allocation. For example, in something like: >>>> >>>> foo = ... >>>> if (...) { >>>> bar = minInvocationsInclusiveScanAMD(...) >>>> } else { >>>> ... = foo; >>>> } >>>> >>>> we have to make sure that foo isn't allocated to the same register as >>>> one of the temporaries used inside minInvocationsInclusiveScanAMD(), >>>> though they don't interfere. That's because the implementation of >>>> minInvocationsInclusiveScanAMD() will do funny things with the exec >>>> mask, possibly overwriting foo, if the condition is non-uniform. Or >>>> consider the following: >>>> >>>> do { >>>> bar = minInvocationsInclusiveScanAMD(...); >>>> // ... >>>> ... = bar; // last use of bar >>>> foo = ...; >>>> } while (...); >>>> >>>> ... = foo; >>>> >>>> again, foo and the temporaries used to compute bar can't be assigned >>>> to the same register, even though their live ranges don't intersect, >>>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo >>>> in a previous iteration if the loop exit condition isn't uniform. How >>>> can we express this in the backend? I don't know much about the LLVM >>>> infrastucture, so I'm not sure if it's relatively easy or really hard. >>>> >>>> Thanks, >>>> >>>> Connor >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>
Matt Arsenault via llvm-dev
2017-Jun-13 23:33 UTC
[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com> wrote: > > On 06/12/2017 08:03 PM, Connor Abbott wrote: >> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com> wrote: >>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote: >>>> cc some people who have worked on this. >>>> >>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote: >>>>> Hi all, >>>>> >>>>> I've been looking into how to implement the more advanced Shader Model >>>>> 6 reduction operations in radv (and obviously most of the work would >>>>> be useful for radeonsi too). They're explained in the spec for >>>>> GL_AMD_shader_ballot at >>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt, >>>>> but I'll summarize them here. There are two types of operations: >>>>> reductions that always return a uniform value, and prefix scan >>>>> operations. The reductions can be implemented in terms of the prefix >>>>> scan (although in practice I don't think we want to implement them in >>>>> exactly the same way), and the concerns are mostly the same, so I'll >>>>> focus on the prefix scan operations for now. Given an operation `op' >>>>> and an input value `a' (that's really a SIMD array with one value per >>>>> invocation, even though it's a scalar value in LLVM), the prefix scan >>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0] >>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also >>>>> work for non-uniform control flow: it simply skips inactive >>>>> invocations. >>>>> >>>>> On the LLVM side, I think that we have most of the AMD-specific >>>>> low-level shuffle intrinsics implemented that you need to do this, but >>>>> I can think of a few concerns/questions. First of all, to implement >>>>> the prefix scan, we'll need to do a code sequence that looks like >>>>> this, modified from >>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace >>>>> v_foo_f32 with the appropriate operation): >>>>> >>>>> ; v0 is the input register >>>>> v_mov_b32 v1, v0 >>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1 >>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2 >>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 >>>>> v_nop // Add two independent instructions to avoid a data hazard >>>>> v_nop >>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 >>>>> v_nop // Add two independent instructions to avoid a data hazard >>>>> v_nop >>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 >>>>> v_nop // Add two independent instructions to avoid a data hazard >>>>> v_nop >>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6 >>>>> v_nop // Add two independent instructions to avoid a data hazard >>>>> v_nop >>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7 >>>>> >>>>> The problem is that the way these instructions use the DPP word isn't >>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp >>>>> intrinsic, but it isn't enough. For example, take the first >>>>> instruction: >>>>> >>>>> v_foo_f32 v1, v0, v1 row_shr:1 >>>>> >>>>> What it's doing is shifting v0 right by one within each row and adding >>>>> it to v1. v1 stays the same in the first lane of each row, however. >>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like >>>>> this, in LLVM-like pseduocode: >>>>> >>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 >>>>> %result = foo %tmp, %input >>>>> >>>>> but this is incorrect. If I'm reading the source correctly, this will >>>>> make %tmp garbage in lane 0 (since it just turns into a normal move >>>>> with the dpp modifier, and no restrictions on the destination). We >>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp >>>>> 0 in lane 0, but that won't work with operations whose identity is >>>>> non-0 like min and max. What we need is something like: >>>>> >>> >>> Why is %tmp garbage? I thought the two options were 0 (bound_ctrl =0) >>> or %input (bound_ctrl = 1)? >> >> Oh, maybe it is... for that to happen the underlying move would need >> to have the source and destination constrained to be the same. I >> couldn't see that constraint anywhere I looked, but I'm not an expert, >> so I may have overlooked it. In any case, that behavior still isn't >> what we want if we want to implement the prefix scan operations >> efficiently. >> > > Ok, I see what you are saying now. I think the best option here is to > document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to > copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid > thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version > of V_MOV_B32_dpp where the src and dst are the same register. > > -TomThis came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like. -Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170613/caeac9d4/attachment.html>
Tom Stellard via llvm-dev
2017-Jun-14 01:13 UTC
[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
On 06/13/2017 07:33 PM, Matt Arsenault wrote:> >> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote: >> >> On 06/12/2017 08:03 PM, Connor Abbott wrote: >>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote: >>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote: >>>>> cc some people who have worked on this. >>>>> >>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote: >>>>>> Hi all, >>>>>> >>>>>> I've been looking into how to implement the more advanced Shader Model >>>>>> 6 reduction operations in radv (and obviously most of the work would >>>>>> be useful for radeonsi too). They're explained in the spec for >>>>>> GL_AMD_shader_ballot at >>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt, >>>>>> but I'll summarize them here. There are two types of operations: >>>>>> reductions that always return a uniform value, and prefix scan >>>>>> operations. The reductions can be implemented in terms of the prefix >>>>>> scan (although in practice I don't think we want to implement them in >>>>>> exactly the same way), and the concerns are mostly the same, so I'll >>>>>> focus on the prefix scan operations for now. Given an operation `op' >>>>>> and an input value `a' (that's really a SIMD array with one value per >>>>>> invocation, even though it's a scalar value in LLVM), the prefix scan >>>>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0] >>>>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also >>>>>> work for non-uniform control flow: it simply skips inactive >>>>>> invocations. >>>>>> >>>>>> On the LLVM side, I think that we have most of the AMD-specific >>>>>> low-level shuffle intrinsics implemented that you need to do this, but >>>>>> I can think of a few concerns/questions. First of all, to implement >>>>>> the prefix scan, we'll need to do a code sequence that looks like >>>>>> this, modified from >>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace >>>>>> v_foo_f32 with the appropriate operation): >>>>>> >>>>>> ; v0 is the input register >>>>>> v_mov_b32 v1, v0 >>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1 >>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2 >>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 >>>>>> v_nop // Add two independent instructions to avoid a data hazard >>>>>> v_nop >>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 >>>>>> v_nop // Add two independent instructions to avoid a data hazard >>>>>> v_nop >>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 >>>>>> v_nop // Add two independent instructions to avoid a data hazard >>>>>> v_nop >>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6 >>>>>> v_nop // Add two independent instructions to avoid a data hazard >>>>>> v_nop >>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7 >>>>>> >>>>>> The problem is that the way these instructions use the DPP word isn't >>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp >>>>>> intrinsic, but it isn't enough. For example, take the first >>>>>> instruction: >>>>>> >>>>>> v_foo_f32 v1, v0, v1 row_shr:1 >>>>>> >>>>>> What it's doing is shifting v0 right by one within each row and adding >>>>>> it to v1. v1 stays the same in the first lane of each row, however. >>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like >>>>>> this, in LLVM-like pseduocode: >>>>>> >>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 >>>>>> %result = foo %tmp, %input >>>>>> >>>>>> but this is incorrect. If I'm reading the source correctly, this will >>>>>> make %tmp garbage in lane 0 (since it just turns into a normal move >>>>>> with the dpp modifier, and no restrictions on the destination). We >>>>>> could set bound_ctrl to 0 to work around this, since it will make %tmp >>>>>> 0 in lane 0, but that won't work with operations whose identity is >>>>>> non-0 like min and max. What we need is something like: >>>>>> >>>> >>>> Why is %tmp garbage? I thought the two options were 0 (bound_ctrl =0) >>>> or %input (bound_ctrl = 1)? >>> >>> Oh, maybe it is... for that to happen the underlying move would need >>> to have the source and destination constrained to be the same. I >>> couldn't see that constraint anywhere I looked, but I'm not an expert, >>> so I may have overlooked it. In any case, that behavior still isn't >>> what we want if we want to implement the prefix scan operations >>> efficiently. >>> >> >> Ok, I see what you are saying now. I think the best option here is to >> document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to >> copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid >> thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version >> of V_MOV_B32_dpp where the src and dst are the same register. >> >> -Tom > > This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like. >I think re-defining the existing intrinsic will be easier than adding a new one. Also, if we add a new one, llvm.amdgcn.mov.dpp will still be broken for the bound_ctrl=1 case, which isn't really ideal. -Tom> -Matt >