Tom Stellard via llvm-dev
2017-Jun-12 23:56 UTC
[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote:> cc some people who have worked on this. > > On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote: >> Hi all, >> >> I've been looking into how to implement the more advanced Shader Model >> 6 reduction operations in radv (and obviously most of the work would >> be useful for radeonsi too). They're explained in the spec for >> GL_AMD_shader_ballot at >> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt, >> but I'll summarize them here. There are two types of operations: >> reductions that always return a uniform value, and prefix scan >> operations. The reductions can be implemented in terms of the prefix >> scan (although in practice I don't think we want to implement them in >> exactly the same way), and the concerns are mostly the same, so I'll >> focus on the prefix scan operations for now. Given an operation `op' >> and an input value `a' (that's really a SIMD array with one value per >> invocation, even though it's a scalar value in LLVM), the prefix scan >> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0] >> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also >> work for non-uniform control flow: it simply skips inactive >> invocations. >> >> On the LLVM side, I think that we have most of the AMD-specific >> low-level shuffle intrinsics implemented that you need to do this, but >> I can think of a few concerns/questions. First of all, to implement >> the prefix scan, we'll need to do a code sequence that looks like >> this, modified from >> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace >> v_foo_f32 with the appropriate operation): >> >> ; v0 is the input register >> v_mov_b32 v1, v0 >> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1 >> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2 >> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 >> v_nop // Add two independent instructions to avoid a data hazard >> v_nop >> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 >> v_nop // Add two independent instructions to avoid a data hazard >> v_nop >> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 >> v_nop // Add two independent instructions to avoid a data hazard >> v_nop >> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6 >> v_nop // Add two independent instructions to avoid a data hazard >> v_nop >> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7 >> >> The problem is that the way these instructions use the DPP word isn't >> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp >> intrinsic, but it isn't enough. For example, take the first >> instruction: >> >> v_foo_f32 v1, v0, v1 row_shr:1 >> >> What it's doing is shifting v0 right by one within each row and adding >> it to v1. v1 stays the same in the first lane of each row, however. >> With llvm.amdgcn.mov_dpp, we could try to express it as something like >> this, in LLVM-like pseduocode: >> >> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 >> %result = foo %tmp, %input >> >> but this is incorrect. If I'm reading the source correctly, this will >> make %tmp garbage in lane 0 (since it just turns into a normal move >> with the dpp modifier, and no restrictions on the destination). We >> could set bound_ctrl to 0 to work around this, since it will make %tmp >> 0 in lane 0, but that won't work with operations whose identity is >> non-0 like min and max. What we need is something like: >>Why is %tmp garbage? I thought the two options were 0 (bound_ctrl =0) or %input (bound_ctrl = 1)? -Tom>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1 >> >> where llvm.amdgcn.foo_dpp copies the first argument to the result, >> then applies the DPP swizzling to the second argument and does 'foo' >> to the second and third arguments. It would mean that we'd have a >> separate intrinsic for every operation we care about, but I can't >> think of a better way to express it. Is there a better way that >> doesn't involve creating an intrinsic for each operation? >> >> Next, there's the fact that this code sequence only works when the >> active lanes are densely-packed, but we have to make this work even >> when control flow is non-uniform. Essentially, we need to "skip over" >> the inactive lanes by setting them to the identity, and then we need >> to enable them in the exec mask when doing the reduction to make sure >> they pass along the correct result. That is, to handle non-uniform >> control flow, we need something like: >> >> invert EXEC >> result = identity >> set EXEC to ~0 >> <original code sequence> >> restore original EXEC >> >> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes >> intrinsic that returns the first argument with inactive lanes set to >> the second argument. We'd also need something like WQM to make all the >> lanes active during the sequence. But that raises some hairy >> requirements for register allocation. For example, in something like: >> >> foo = ... >> if (...) { >> bar = minInvocationsInclusiveScanAMD(...) >> } else { >> ... = foo; >> } >> >> we have to make sure that foo isn't allocated to the same register as >> one of the temporaries used inside minInvocationsInclusiveScanAMD(), >> though they don't interfere. That's because the implementation of >> minInvocationsInclusiveScanAMD() will do funny things with the exec >> mask, possibly overwriting foo, if the condition is non-uniform. Or >> consider the following: >> >> do { >> bar = minInvocationsInclusiveScanAMD(...); >> // ... >> ... = bar; // last use of bar >> foo = ...; >> } while (...); >> >> ... = foo; >> >> again, foo and the temporaries used to compute bar can't be assigned >> to the same register, even though their live ranges don't intersect, >> since minInvocationsInclusiveScanAMD() may overwrite the value of foo >> in a previous iteration if the loop exit condition isn't uniform. How >> can we express this in the backend? I don't know much about the LLVM >> infrastucture, so I'm not sure if it's relatively easy or really hard. >> >> Thanks, >> >> Connor >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
Connor Abbott via llvm-dev
2017-Jun-13 00:03 UTC
[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com> wrote:> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote: >> cc some people who have worked on this. >> >> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote: >>> Hi all, >>> >>> I've been looking into how to implement the more advanced Shader Model >>> 6 reduction operations in radv (and obviously most of the work would >>> be useful for radeonsi too). They're explained in the spec for >>> GL_AMD_shader_ballot at >>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt, >>> but I'll summarize them here. There are two types of operations: >>> reductions that always return a uniform value, and prefix scan >>> operations. The reductions can be implemented in terms of the prefix >>> scan (although in practice I don't think we want to implement them in >>> exactly the same way), and the concerns are mostly the same, so I'll >>> focus on the prefix scan operations for now. Given an operation `op' >>> and an input value `a' (that's really a SIMD array with one value per >>> invocation, even though it's a scalar value in LLVM), the prefix scan >>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0] >>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also >>> work for non-uniform control flow: it simply skips inactive >>> invocations. >>> >>> On the LLVM side, I think that we have most of the AMD-specific >>> low-level shuffle intrinsics implemented that you need to do this, but >>> I can think of a few concerns/questions. First of all, to implement >>> the prefix scan, we'll need to do a code sequence that looks like >>> this, modified from >>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace >>> v_foo_f32 with the appropriate operation): >>> >>> ; v0 is the input register >>> v_mov_b32 v1, v0 >>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1 >>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2 >>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 >>> v_nop // Add two independent instructions to avoid a data hazard >>> v_nop >>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 >>> v_nop // Add two independent instructions to avoid a data hazard >>> v_nop >>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 >>> v_nop // Add two independent instructions to avoid a data hazard >>> v_nop >>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6 >>> v_nop // Add two independent instructions to avoid a data hazard >>> v_nop >>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7 >>> >>> The problem is that the way these instructions use the DPP word isn't >>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp >>> intrinsic, but it isn't enough. For example, take the first >>> instruction: >>> >>> v_foo_f32 v1, v0, v1 row_shr:1 >>> >>> What it's doing is shifting v0 right by one within each row and adding >>> it to v1. v1 stays the same in the first lane of each row, however. >>> With llvm.amdgcn.mov_dpp, we could try to express it as something like >>> this, in LLVM-like pseduocode: >>> >>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 >>> %result = foo %tmp, %input >>> >>> but this is incorrect. If I'm reading the source correctly, this will >>> make %tmp garbage in lane 0 (since it just turns into a normal move >>> with the dpp modifier, and no restrictions on the destination). We >>> could set bound_ctrl to 0 to work around this, since it will make %tmp >>> 0 in lane 0, but that won't work with operations whose identity is >>> non-0 like min and max. What we need is something like: >>> > > Why is %tmp garbage? I thought the two options were 0 (bound_ctrl =0) > or %input (bound_ctrl = 1)?Oh, maybe it is... for that to happen the underlying move would need to have the source and destination constrained to be the same. I couldn't see that constraint anywhere I looked, but I'm not an expert, so I may have overlooked it. In any case, that behavior still isn't what we want if we want to implement the prefix scan operations efficiently. Connor> > -Tom > > >>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1 >>> >>> where llvm.amdgcn.foo_dpp copies the first argument to the result, >>> then applies the DPP swizzling to the second argument and does 'foo' >>> to the second and third arguments. It would mean that we'd have a >>> separate intrinsic for every operation we care about, but I can't >>> think of a better way to express it. Is there a better way that >>> doesn't involve creating an intrinsic for each operation? >>> >>> Next, there's the fact that this code sequence only works when the >>> active lanes are densely-packed, but we have to make this work even >>> when control flow is non-uniform. Essentially, we need to "skip over" >>> the inactive lanes by setting them to the identity, and then we need >>> to enable them in the exec mask when doing the reduction to make sure >>> they pass along the correct result. That is, to handle non-uniform >>> control flow, we need something like: >>> >>> invert EXEC >>> result = identity >>> set EXEC to ~0 >>> <original code sequence> >>> restore original EXEC >>> >>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes >>> intrinsic that returns the first argument with inactive lanes set to >>> the second argument. We'd also need something like WQM to make all the >>> lanes active during the sequence. But that raises some hairy >>> requirements for register allocation. For example, in something like: >>> >>> foo = ... >>> if (...) { >>> bar = minInvocationsInclusiveScanAMD(...) >>> } else { >>> ... = foo; >>> } >>> >>> we have to make sure that foo isn't allocated to the same register as >>> one of the temporaries used inside minInvocationsInclusiveScanAMD(), >>> though they don't interfere. That's because the implementation of >>> minInvocationsInclusiveScanAMD() will do funny things with the exec >>> mask, possibly overwriting foo, if the condition is non-uniform. Or >>> consider the following: >>> >>> do { >>> bar = minInvocationsInclusiveScanAMD(...); >>> // ... >>> ... = bar; // last use of bar >>> foo = ...; >>> } while (...); >>> >>> ... = foo; >>> >>> again, foo and the temporaries used to compute bar can't be assigned >>> to the same register, even though their live ranges don't intersect, >>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo >>> in a previous iteration if the loop exit condition isn't uniform. How >>> can we express this in the backend? I don't know much about the LLVM >>> infrastucture, so I'm not sure if it's relatively easy or really hard. >>> >>> Thanks, >>> >>> Connor >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >
Tom Stellard via llvm-dev
2017-Jun-13 00:23 UTC
[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
On 06/12/2017 08:03 PM, Connor Abbott wrote:> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com> wrote: >> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote: >>> cc some people who have worked on this. >>> >>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote: >>>> Hi all, >>>> >>>> I've been looking into how to implement the more advanced Shader Model >>>> 6 reduction operations in radv (and obviously most of the work would >>>> be useful for radeonsi too). They're explained in the spec for >>>> GL_AMD_shader_ballot at >>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_shader_ballot.txt, >>>> but I'll summarize them here. There are two types of operations: >>>> reductions that always return a uniform value, and prefix scan >>>> operations. The reductions can be implemented in terms of the prefix >>>> scan (although in practice I don't think we want to implement them in >>>> exactly the same way), and the concerns are mostly the same, so I'll >>>> focus on the prefix scan operations for now. Given an operation `op' >>>> and an input value `a' (that's really a SIMD array with one value per >>>> invocation, even though it's a scalar value in LLVM), the prefix scan >>>> returns a[0] in invocation 0, a[0] `op' a[1] in invocation 1, a[0] >>>> `op' a[1] `op' a[2] in invocation 2, etc. The prefix scan will also >>>> work for non-uniform control flow: it simply skips inactive >>>> invocations. >>>> >>>> On the LLVM side, I think that we have most of the AMD-specific >>>> low-level shuffle intrinsics implemented that you need to do this, but >>>> I can think of a few concerns/questions. First of all, to implement >>>> the prefix scan, we'll need to do a code sequence that looks like >>>> this, modified from >>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ (replace >>>> v_foo_f32 with the appropriate operation): >>>> >>>> ; v0 is the input register >>>> v_mov_b32 v1, v0 >>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1 >>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2 >>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 >>>> v_nop // Add two independent instructions to avoid a data hazard >>>> v_nop >>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 >>>> v_nop // Add two independent instructions to avoid a data hazard >>>> v_nop >>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 >>>> v_nop // Add two independent instructions to avoid a data hazard >>>> v_nop >>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction 6 >>>> v_nop // Add two independent instructions to avoid a data hazard >>>> v_nop >>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7 >>>> >>>> The problem is that the way these instructions use the DPP word isn't >>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp >>>> intrinsic, but it isn't enough. For example, take the first >>>> instruction: >>>> >>>> v_foo_f32 v1, v0, v1 row_shr:1 >>>> >>>> What it's doing is shifting v0 right by one within each row and adding >>>> it to v1. v1 stays the same in the first lane of each row, however. >>>> With llvm.amdgcn.mov_dpp, we could try to express it as something like >>>> this, in LLVM-like pseduocode: >>>> >>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 >>>> %result = foo %tmp, %input >>>> >>>> but this is incorrect. If I'm reading the source correctly, this will >>>> make %tmp garbage in lane 0 (since it just turns into a normal move >>>> with the dpp modifier, and no restrictions on the destination). We >>>> could set bound_ctrl to 0 to work around this, since it will make %tmp >>>> 0 in lane 0, but that won't work with operations whose identity is >>>> non-0 like min and max. What we need is something like: >>>> >> >> Why is %tmp garbage? I thought the two options were 0 (bound_ctrl =0) >> or %input (bound_ctrl = 1)? > > Oh, maybe it is... for that to happen the underlying move would need > to have the source and destination constrained to be the same. I > couldn't see that constraint anywhere I looked, but I'm not an expert, > so I may have overlooked it. In any case, that behavior still isn't > what we want if we want to implement the prefix scan operations > efficiently. >Ok, I see what you are saying now. I think the best option here is to document that the behavior of the llvm.amdgcn.mov.dpp intrinsic is to copy its src operand to dst when bound_ctrl = 1 and it reads from an invalid thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version of V_MOV_B32_dpp where the src and dst are the same register. -Tom> Connor > >> >> -Tom >> >> >>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1 >>>> >>>> where llvm.amdgcn.foo_dpp copies the first argument to the result, >>>> then applies the DPP swizzling to the second argument and does 'foo' >>>> to the second and third arguments. It would mean that we'd have a >>>> separate intrinsic for every operation we care about, but I can't >>>> think of a better way to express it. Is there a better way that >>>> doesn't involve creating an intrinsic for each operation? >>>> >>>> Next, there's the fact that this code sequence only works when the >>>> active lanes are densely-packed, but we have to make this work even >>>> when control flow is non-uniform. Essentially, we need to "skip over" >>>> the inactive lanes by setting them to the identity, and then we need >>>> to enable them in the exec mask when doing the reduction to make sure >>>> they pass along the correct result. That is, to handle non-uniform >>>> control flow, we need something like: >>>> >>>> invert EXEC >>>> result = identity >>>> set EXEC to ~0 >>>> <original code sequence> >>>> restore original EXEC >>>> >>>> I imagine we'd need to add some special llvm.amdcgn.set_inactive_lanes >>>> intrinsic that returns the first argument with inactive lanes set to >>>> the second argument. We'd also need something like WQM to make all the >>>> lanes active during the sequence. But that raises some hairy >>>> requirements for register allocation. For example, in something like: >>>> >>>> foo = ... >>>> if (...) { >>>> bar = minInvocationsInclusiveScanAMD(...) >>>> } else { >>>> ... = foo; >>>> } >>>> >>>> we have to make sure that foo isn't allocated to the same register as >>>> one of the temporaries used inside minInvocationsInclusiveScanAMD(), >>>> though they don't interfere. That's because the implementation of >>>> minInvocationsInclusiveScanAMD() will do funny things with the exec >>>> mask, possibly overwriting foo, if the condition is non-uniform. Or >>>> consider the following: >>>> >>>> do { >>>> bar = minInvocationsInclusiveScanAMD(...); >>>> // ... >>>> ... = bar; // last use of bar >>>> foo = ...; >>>> } while (...); >>>> >>>> ... = foo; >>>> >>>> again, foo and the temporaries used to compute bar can't be assigned >>>> to the same register, even though their live ranges don't intersect, >>>> since minInvocationsInclusiveScanAMD() may overwrite the value of foo >>>> in a previous iteration if the loop exit condition isn't uniform. How >>>> can we express this in the backend? I don't know much about the LLVM >>>> infrastucture, so I'm not sure if it's relatively easy or really hard. >>>> >>>> Thanks, >>>> >>>> Connor >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> llvm-dev at lists.llvm.org >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>