Displaying 3 results from an estimated 3 matches for "foo_dpp".
2017 Jun 12
4
Implementing cross-thread reduction in the AMDGPU backend
...l move
with the dpp modifier, and no restrictions on the destination). We
could set bound_ctrl to 0 to work around this, since it will make %tmp
0 in lane 0, but that won't work with operations whose identity is
non-0 like min and max. What we need is something like:
%result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
where llvm.amdgcn.foo_dpp copies the first argument to the result,
then applies the DPP swizzling to the second argument and does 'foo'
to the second and third arguments. It would mean that we'd have a
separate intrinsic for every operation we care ab...
2017 Jun 12
2
Implementing cross-thread reduction in the AMDGPU backend
...p
>> 0 in lane 0, but that won't work with operations whose identity is
>> non-0 like min and max. What we need is something like:
>>
Why is %tmp garbage? I thought the two options were 0 (bound_ctrl =0)
or %input (bound_ctrl = 1)?
-Tom
>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>
>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>> then applies the DPP swizzling to the second argument and does 'foo'
>> to the second and third arguments. It would mean that we'd have a
>> separ...
2017 Jun 13
2
Implementing cross-thread reduction in the AMDGPU backend
...n bound_ctrl = 1 and it reads from an invalid
thread, and then when bound_ctrl=1, lower the intrinsic to a special tied version
of V_MOV_B32_dpp where the src and dst are the same register.
-Tom
> Connor
>
>>
>> -Tom
>>
>>
>>>> %result = call llvm.amdgcn.foo_dpp %result, %input, %result row_shr:1
>>>>
>>>> where llvm.amdgcn.foo_dpp copies the first argument to the result,
>>>> then applies the DPP swizzling to the second argument and does 'foo'
>>>> to the second and third arguments. It would mean tha...