Sumner, Brian via llvm-dev
2017-Jun-15 13:56 UTC
[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
I'm wondering about the focus on bound_cntl. Any cleared bit in the row_mask or bank_mask will also disable updating the result. Brian -----Original Message----- From: Connor Abbott [mailto:cwabbott0 at gmail.com] Sent: Wednesday, June 14, 2017 6:13 PM To: tstellar at redhat.com Cc: Matt Arsenault; llvm-dev at lists.llvm.org; Kolton, Sam; Sumner, Brian; Pykhtin, Valery Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <tstellar at redhat.com> wrote:> On 06/14/2017 05:05 PM, Connor Abbott wrote: >> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at redhat.com> wrote: >>> On 06/13/2017 07:33 PM, Matt Arsenault wrote: >>>> >>>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote: >>>>> >>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote: >>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote: >>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote: >>>>>>>> cc some people who have worked on this. >>>>>>>> >>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote: >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I've been looking into how to implement the more advanced >>>>>>>>> Shader Model >>>>>>>>> 6 reduction operations in radv (and obviously most of the work >>>>>>>>> would be useful for radeonsi too). They're explained in the >>>>>>>>> spec for GL_AMD_shader_ballot at >>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_sha >>>>>>>>> der_ballot.txt, but I'll summarize them here. There are two >>>>>>>>> types of operations: >>>>>>>>> reductions that always return a uniform value, and prefix scan >>>>>>>>> operations. The reductions can be implemented in terms of the >>>>>>>>> prefix scan (although in practice I don't think we want to >>>>>>>>> implement them in exactly the same way), and the concerns are >>>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op' >>>>>>>>> and an input value `a' (that's really a SIMD array with one >>>>>>>>> value per invocation, even though it's a scalar value in >>>>>>>>> LLVM), the prefix scan returns a[0] in invocation 0, a[0] `op' >>>>>>>>> a[1] in invocation 1, a[0] `op' a[1] `op' a[2] in invocation >>>>>>>>> 2, etc. The prefix scan will also work for non-uniform control >>>>>>>>> flow: it simply skips inactive invocations. >>>>>>>>> >>>>>>>>> On the LLVM side, I think that we have most of the >>>>>>>>> AMD-specific low-level shuffle intrinsics implemented that you >>>>>>>>> need to do this, but I can think of a few concerns/questions. >>>>>>>>> First of all, to implement the prefix scan, we'll need to do a >>>>>>>>> code sequence that looks like this, modified from >>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ >>>>>>>>> (replace >>>>>>>>> v_foo_f32 with the appropriate operation): >>>>>>>>> >>>>>>>>> ; v0 is the input register >>>>>>>>> v_mov_b32 v1, v0 >>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1 >>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2 >>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add >>>>>>>>> two independent instructions to avoid a data hazard v_nop >>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 >>>>>>>>> v_nop // Add two independent instructions to avoid a data >>>>>>>>> hazard v_nop >>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 >>>>>>>>> v_nop // Add two independent instructions to avoid a data >>>>>>>>> hazard v_nop >>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction >>>>>>>>> 6 v_nop // Add two independent instructions to avoid a data >>>>>>>>> hazard v_nop >>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction >>>>>>>>> 7 >>>>>>>>> >>>>>>>>> The problem is that the way these instructions use the DPP >>>>>>>>> word isn't currently expressible in LLVM. We have the >>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For >>>>>>>>> example, take the first >>>>>>>>> instruction: >>>>>>>>> >>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 >>>>>>>>> >>>>>>>>> What it's doing is shifting v0 right by one within each row >>>>>>>>> and adding it to v1. v1 stays the same in the first lane of each row, however. >>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as >>>>>>>>> something like this, in LLVM-like pseduocode: >>>>>>>>> >>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo >>>>>>>>> %tmp, %input >>>>>>>>> >>>>>>>>> but this is incorrect. If I'm reading the source correctly, >>>>>>>>> this will make %tmp garbage in lane 0 (since it just turns >>>>>>>>> into a normal move with the dpp modifier, and no restrictions >>>>>>>>> on the destination). We could set bound_ctrl to 0 to work >>>>>>>>> around this, since it will make %tmp >>>>>>>>> 0 in lane 0, but that won't work with operations whose >>>>>>>>> identity is >>>>>>>>> non-0 like min and max. What we need is something like: >>>>>>>>> >>>>>>> >>>>>>> Why is %tmp garbage? I thought the two options were 0 >>>>>>> (bound_ctrl =0) or %input (bound_ctrl = 1)? >>>>>> >>>>>> Oh, maybe it is... for that to happen the underlying move would >>>>>> need to have the source and destination constrained to be the >>>>>> same. I couldn't see that constraint anywhere I looked, but I'm >>>>>> not an expert, so I may have overlooked it. In any case, that >>>>>> behavior still isn't what we want if we want to implement the >>>>>> prefix scan operations efficiently. >>>>>> >>>>> >>>>> Ok, I see what you are saying now. I think the best option here >>>>> is to document that the behavior of the llvm.amdgcn.mov.dpp >>>>> intrinsic is to copy its src operand to dst when bound_ctrl = 1 >>>>> and it reads from an invalid thread, and then when bound_ctrl=1, >>>>> lower the intrinsic to a special tied version of V_MOV_B32_dpp where the src and dst are the same register. >>>>> >>>>> -Tom >>>> >>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like. >>>> >>> >>> I think re-defining the existing intrinsic will be easier than >>> adding a new one. Also, if we add a new one, llvm.amdgcn.mov.dpp >>> will still be broken for the bound_ctrl=1 case, which isn't really ideal. >> >> You can't re-define the existing intrinsic. The problem is that a >> move with DPP doesn't just write to its destination register - it >> modifies it, i.e. it reads and then writes to it. You can't express >> that with SSA, since you can only ever write to an SSA value once. I >> think what you want is an intrinsic, say llvm.amdgcn.update.dpp >> (dunno what you'd call it?) that takes the value to use if the read >> is invalid. That is, something like: >> >> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control) >> > > This is a more general form of what I'm suggesting. If you re-define > (when I say re-define, I just mean document this behavior) > llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value of %src > will be written to %dst for invalid lanes, then this will have the > same effect as llvm.amdgcn.update.dpp when %old = %update. > > I can see the usefulness now of having this extra intrinsic, but I > still think it would be good to fix llvm.amdgcn.mov.dpp. > At the MachineIR level, I think you could handle the new intrinsic and > llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same pseudo > instructions, so fixing llvm.amdgcn.mov.dpp may not be so hard if you > are adding the new intrinsic.Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to generate the best possible code sequence, and it's a strict superset of llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm inclined to not bother fixing it and just deprecate it instead.> > >> would get turned into: >> >> v_mov_b32 %new, %old >> v_mov_b32 %new, %update (dpp control) >> >> And hopefully coalescing will remove the first copy. Think of it as >> like lowering LLVM's three-address form for e.g. ADD to the x86 >> two-address form. If you wanted to get fancy, you could also >> recognize the case where the old value is zero and turn that into a >> DPP move with bound_ctrl = 1. >> >> Also, remember that the whole point of this was to be able to express >> stuff like: >> >> v_min_f32 v1, v0, v1 (dpp control) >> >> where you take the minimum of v1 and the swizzled v0, except where >> you would've read an invalid lane for v0, you read the old value for >> v1 instead. For operations like add and iadd where the identity is 0, >> you can set bound_ctrl = 1, and then the optimizer can safely fold >> the >> v_mov_b32 into the operation itself. That is, you'd do: >> >> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control) >> %new = i32 add %swizzled, %old >> >> and after coalescing, register allocation, etc. the backend would >> turn that into: >> >> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1 >> >> which is functionally equivalent to the version without the bound_ctrl. >> >> Otherwise, for operations `op' where a `op' a == a, you can do something like: >> >> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control) >> %new = f32 fmin %swizzled, %old >> >> and then if the optimizer is clever enough, it can also fold it into >> one instruction since the invalid lanes will get their old values >> back. I think that covers all the cases we care about. The question >> is whether it's better to take that route, or whether it's better to >> just add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, >> etc. so >> that: >> > > I would only go the route of adding a dpp intrinsic for every > operation if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp. > The main reason for this is that you will lose the generic combines > that LLVM has for add, min, etc.Ok, good point. I think this is only going to be used in a few specific scenarios, but I can see stuff like fusing add + into mad being useful.> > > -Tom >> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control) >> >> turns into: >> >> v_mov_b32 %new, %old >> v_min_f32 %new, %src0, %src1 (dpp_control) >> >> and then we can get what we want directly, without have to do much >> optimization except for coalescing that already exists. The downside >> is that we'd have to add a lot more intrinsics (one for min, max, >> fmin, fmax, add, and iadd). >> >>> >>> -Tom >>> >>>> -Matt >>>> >>> >
Connor Abbott via llvm-dev
2017-Jun-15 21:56 UTC
[llvm-dev] Implementing cross-thread reduction in the AMDGPU backend
On Thu, Jun 15, 2017 at 6:56 AM, Sumner, Brian <Brian.Sumner at amd.com> wrote:> I'm wondering about the focus on bound_cntl. Any cleared bit in the row_mask or bank_mask will also disable updating the result. > BrianTrue. Although, the same issue happens if you clear a bit in row_mask or bank_mask as well.> > -----Original Message----- > From: Connor Abbott [mailto:cwabbott0 at gmail.com] > Sent: Wednesday, June 14, 2017 6:13 PM > To: tstellar at redhat.com > Cc: Matt Arsenault; llvm-dev at lists.llvm.org; Kolton, Sam; Sumner, Brian; Pykhtin, Valery > Subject: Re: [llvm-dev] Implementing cross-thread reduction in the AMDGPU backend > > On Wed, Jun 14, 2017 at 5:23 PM, Tom Stellard <tstellar at redhat.com> wrote: >> On 06/14/2017 05:05 PM, Connor Abbott wrote: >>> On Tue, Jun 13, 2017 at 6:13 PM, Tom Stellard <tstellar at redhat.com> wrote: >>>> On 06/13/2017 07:33 PM, Matt Arsenault wrote: >>>>> >>>>>> On Jun 12, 2017, at 17:23, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote: >>>>>> >>>>>> On 06/12/2017 08:03 PM, Connor Abbott wrote: >>>>>>> On Mon, Jun 12, 2017 at 4:56 PM, Tom Stellard <tstellar at redhat.com <mailto:tstellar at redhat.com>> wrote: >>>>>>>> On 06/12/2017 07:15 PM, Tom Stellard via llvm-dev wrote: >>>>>>>>> cc some people who have worked on this. >>>>>>>>> >>>>>>>>> On 06/12/2017 05:58 PM, Connor Abbott via llvm-dev wrote: >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I've been looking into how to implement the more advanced >>>>>>>>>> Shader Model >>>>>>>>>> 6 reduction operations in radv (and obviously most of the work >>>>>>>>>> would be useful for radeonsi too). They're explained in the >>>>>>>>>> spec for GL_AMD_shader_ballot at >>>>>>>>>> https://www.khronos.org/registry/OpenGL/extensions/AMD/AMD_sha >>>>>>>>>> der_ballot.txt, but I'll summarize them here. There are two >>>>>>>>>> types of operations: >>>>>>>>>> reductions that always return a uniform value, and prefix scan >>>>>>>>>> operations. The reductions can be implemented in terms of the >>>>>>>>>> prefix scan (although in practice I don't think we want to >>>>>>>>>> implement them in exactly the same way), and the concerns are >>>>>>>>>> mostly the same, so I'll focus on the prefix scan operations for now. Given an operation `op' >>>>>>>>>> and an input value `a' (that's really a SIMD array with one >>>>>>>>>> value per invocation, even though it's a scalar value in >>>>>>>>>> LLVM), the prefix scan returns a[0] in invocation 0, a[0] `op' >>>>>>>>>> a[1] in invocation 1, a[0] `op' a[1] `op' a[2] in invocation >>>>>>>>>> 2, etc. The prefix scan will also work for non-uniform control >>>>>>>>>> flow: it simply skips inactive invocations. >>>>>>>>>> >>>>>>>>>> On the LLVM side, I think that we have most of the >>>>>>>>>> AMD-specific low-level shuffle intrinsics implemented that you >>>>>>>>>> need to do this, but I can think of a few concerns/questions. >>>>>>>>>> First of all, to implement the prefix scan, we'll need to do a >>>>>>>>>> code sequence that looks like this, modified from >>>>>>>>>> http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/ >>>>>>>>>> (replace >>>>>>>>>> v_foo_f32 with the appropriate operation): >>>>>>>>>> >>>>>>>>>> ; v0 is the input register >>>>>>>>>> v_mov_b32 v1, v0 >>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 // Instruction 1 >>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:2 // Instruction 2 >>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:3/ / Instruction 3 v_nop // Add >>>>>>>>>> two independent instructions to avoid a data hazard v_nop >>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:4 bank_mask:0xe // Instruction 4 >>>>>>>>>> v_nop // Add two independent instructions to avoid a data >>>>>>>>>> hazard v_nop >>>>>>>>>> v_foo_f32 v1, v1, v1 row_shr:8 bank_mask:0xc // Instruction 5 >>>>>>>>>> v_nop // Add two independent instructions to avoid a data >>>>>>>>>> hazard v_nop >>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:15 row_mask:0xa // Instruction >>>>>>>>>> 6 v_nop // Add two independent instructions to avoid a data >>>>>>>>>> hazard v_nop >>>>>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction >>>>>>>>>> 7 >>>>>>>>>> >>>>>>>>>> The problem is that the way these instructions use the DPP >>>>>>>>>> word isn't currently expressible in LLVM. We have the >>>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For >>>>>>>>>> example, take the first >>>>>>>>>> instruction: >>>>>>>>>> >>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1 >>>>>>>>>> >>>>>>>>>> What it's doing is shifting v0 right by one within each row >>>>>>>>>> and adding it to v1. v1 stays the same in the first lane of each row, however. >>>>>>>>>> With llvm.amdgcn.mov_dpp, we could try to express it as >>>>>>>>>> something like this, in LLVM-like pseduocode: >>>>>>>>>> >>>>>>>>>> %tmp = call llvm.amdgcn.mov_dpp %input row_shr:1 %result = foo >>>>>>>>>> %tmp, %input >>>>>>>>>> >>>>>>>>>> but this is incorrect. If I'm reading the source correctly, >>>>>>>>>> this will make %tmp garbage in lane 0 (since it just turns >>>>>>>>>> into a normal move with the dpp modifier, and no restrictions >>>>>>>>>> on the destination). We could set bound_ctrl to 0 to work >>>>>>>>>> around this, since it will make %tmp >>>>>>>>>> 0 in lane 0, but that won't work with operations whose >>>>>>>>>> identity is >>>>>>>>>> non-0 like min and max. What we need is something like: >>>>>>>>>> >>>>>>>> >>>>>>>> Why is %tmp garbage? I thought the two options were 0 >>>>>>>> (bound_ctrl =0) or %input (bound_ctrl = 1)? >>>>>>> >>>>>>> Oh, maybe it is... for that to happen the underlying move would >>>>>>> need to have the source and destination constrained to be the >>>>>>> same. I couldn't see that constraint anywhere I looked, but I'm >>>>>>> not an expert, so I may have overlooked it. In any case, that >>>>>>> behavior still isn't what we want if we want to implement the >>>>>>> prefix scan operations efficiently. >>>>>>> >>>>>> >>>>>> Ok, I see what you are saying now. I think the best option here >>>>>> is to document that the behavior of the llvm.amdgcn.mov.dpp >>>>>> intrinsic is to copy its src operand to dst when bound_ctrl = 1 >>>>>> and it reads from an invalid thread, and then when bound_ctrl=1, >>>>>> lower the intrinsic to a special tied version of V_MOV_B32_dpp where the src and dst are the same register. >>>>>> >>>>>> -Tom >>>>> >>>>> This came up before that there’s no way to represent the unmodified input register for the inactive lanes. I think the conclusion was that a new intrinsic is needed to represent this case but I don’t think there was a consensus on what it should look like. >>>>> >>>> >>>> I think re-defining the existing intrinsic will be easier than >>>> adding a new one. Also, if we add a new one, llvm.amdgcn.mov.dpp >>>> will still be broken for the bound_ctrl=1 case, which isn't really ideal. >>> >>> You can't re-define the existing intrinsic. The problem is that a >>> move with DPP doesn't just write to its destination register - it >>> modifies it, i.e. it reads and then writes to it. You can't express >>> that with SSA, since you can only ever write to an SSA value once. I >>> think what you want is an intrinsic, say llvm.amdgcn.update.dpp >>> (dunno what you'd call it?) that takes the value to use if the read >>> is invalid. That is, something like: >>> >>> %new = i32 llvm.amdgcn.update.dpp %old, %update, (dpp control) >>> >> >> This is a more general form of what I'm suggesting. If you re-define >> (when I say re-define, I just mean document this behavior) >> llvm.amdgcn.mov.dpp, such that when bound_ctrl = 0 the value of %src >> will be written to %dst for invalid lanes, then this will have the >> same effect as llvm.amdgcn.update.dpp when %old = %update. >> >> I can see the usefulness now of having this extra intrinsic, but I >> still think it would be good to fix llvm.amdgcn.mov.dpp. >> At the MachineIR level, I think you could handle the new intrinsic and >> llvm.amdgcn.mov.dpp with bound_ctrl = 0 with the same pseudo >> instructions, so fixing llvm.amdgcn.mov.dpp may not be so hard if you >> are adding the new intrinsic. > > Ah, ok. Still, we're going to need the full llvm.amdgcn.update.dpp to generate the best possible code sequence, and it's a strict superset of llvm.amdgcn.mov.dpp (both the old and fixed versions), so I'm inclined to not bother fixing it and just deprecate it instead. > >> >> >>> would get turned into: >>> >>> v_mov_b32 %new, %old >>> v_mov_b32 %new, %update (dpp control) >>> >>> And hopefully coalescing will remove the first copy. Think of it as >>> like lowering LLVM's three-address form for e.g. ADD to the x86 >>> two-address form. If you wanted to get fancy, you could also >>> recognize the case where the old value is zero and turn that into a >>> DPP move with bound_ctrl = 1. >>> >>> Also, remember that the whole point of this was to be able to express >>> stuff like: >>> >>> v_min_f32 v1, v0, v1 (dpp control) >>> >>> where you take the minimum of v1 and the swizzled v0, except where >>> you would've read an invalid lane for v0, you read the old value for >>> v1 instead. For operations like add and iadd where the identity is 0, >>> you can set bound_ctrl = 1, and then the optimizer can safely fold >>> the >>> v_mov_b32 into the operation itself. That is, you'd do: >>> >>> %swizzled = i32 llvm.amdgcn.update.dpp i32 0, %update, (dpp control) >>> %new = i32 add %swizzled, %old >>> >>> and after coalescing, register allocation, etc. the backend would >>> turn that into: >>> >>> v_add_i32 v1, v0, v1 (dpp control) bound_ctrl:1 >>> >>> which is functionally equivalent to the version without the bound_ctrl. >>> >>> Otherwise, for operations `op' where a `op' a == a, you can do something like: >>> >>> %swizzled = f32 llvm.amdgcn.update.dpp %old, %update, (dpp control) >>> %new = f32 fmin %swizzled, %old >>> >>> and then if the optimizer is clever enough, it can also fold it into >>> one instruction since the invalid lanes will get their old values >>> back. I think that covers all the cases we care about. The question >>> is whether it's better to take that route, or whether it's better to >>> just add intrinsics like llvm.amdgcn.add_dpp, llvm.amdgcn.fmin_dpp, >>> etc. so >>> that: >>> >> >> I would only go the route of adding a dpp intrinsic for every >> operation if it gives you functionality that you can't get with the llvm.amdgcn.update.dpp. >> The main reason for this is that you will lose the generic combines >> that LLVM has for add, min, etc. > > Ok, good point. I think this is only going to be used in a few specific scenarios, but I can see stuff like fusing add + into mad being useful. > >> >> >> -Tom >>> %new = f32 llvm.amdgcn.fmin_dpp %old, %src0, %src1 (dpp_control) >>> >>> turns into: >>> >>> v_mov_b32 %new, %old >>> v_min_f32 %new, %src0, %src1 (dpp_control) >>> >>> and then we can get what we want directly, without have to do much >>> optimization except for coalescing that already exists. The downside >>> is that we'd have to add a lot more intrinsics (one for min, max, >>> fmin, fmax, add, and iadd). >>> >>>> >>>> -Tom >>>> >>>>> -Matt >>>>> >>>> >>