Displaying 7 results from an estimated 7 matches for "mov_dpp".
2017 Jun 14
5
Implementing cross-thread reduction in the AMDGPU backend
...v_nop
>>>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>
>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>> instruction:
>>>>>>
>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>
>>>>>> What it's doing is shifting v0 right...
2017 Jun 13
2
Implementing cross-thread reduction in the AMDGPU backend
...d a data hazard
>>>> v_nop
>>>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>
>>>> The problem is that the way these instructions use the DPP word isn't
>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>> intrinsic, but it isn't enough. For example, take the first
>>>> instruction:
>>>>
>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>
>>>> What it's doing is shifting v0 right by one within each row and adding
>>>&g...
2017 Jun 14
0
Implementing cross-thread reduction in the AMDGPU backend
...> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>
>>>>>> The problem is that the way these instructions use the DPP word
>>>>>> isn't currently expressible in LLVM. We have the
>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For example,
>>>>>> take the first
>>>>>> instruction:
>>>>>>
>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>
>>>>>> What it's doing is shifting v0 right...
2017 Jun 12
4
Implementing cross-thread reduction in the AMDGPU backend
...t:15 row_mask:0xa // Instruction 6
v_nop // Add two independent instructions to avoid a data hazard
v_nop
v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
The problem is that the way these instructions use the DPP word isn't
currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
intrinsic, but it isn't enough. For example, take the first
instruction:
v_foo_f32 v1, v0, v1 row_shr:1
What it's doing is shifting v0 right by one within each row and adding
it to v1. v1 stays the same in the first lane of each row, however.
With llvm.amdgcn.mov_dpp, we could try to expr...
2017 Jun 12
2
Implementing cross-thread reduction in the AMDGPU backend
...Add two independent instructions to avoid a data hazard
>> v_nop
>> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>
>> The problem is that the way these instructions use the DPP word isn't
>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>> intrinsic, but it isn't enough. For example, take the first
>> instruction:
>>
>> v_foo_f32 v1, v0, v1 row_shr:1
>>
>> What it's doing is shifting v0 right by one within each row and adding
>> it to v1. v1 stays the same in the first lane of each...
2017 Jun 15
2
Implementing cross-thread reduction in the AMDGPU backend
...gt;> v_foo_f32 v1, v1, v1 row_bcast:31 row_mask:0xc // Instruction 7
>>>>>>>>
>>>>>>>> The problem is that the way these instructions use the DPP word isn't
>>>>>>>> currently expressible in LLVM. We have the llvm.amdgcn.mov_dpp
>>>>>>>> intrinsic, but it isn't enough. For example, take the first
>>>>>>>> instruction:
>>>>>>>>
>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>
>>>>>>&g...
2017 Jun 15
1
Implementing cross-thread reduction in the AMDGPU backend
...t;>>>> 7
>>>>>>>>>
>>>>>>>>> The problem is that the way these instructions use the DPP
>>>>>>>>> word isn't currently expressible in LLVM. We have the
>>>>>>>>> llvm.amdgcn.mov_dpp intrinsic, but it isn't enough. For
>>>>>>>>> example, take the first
>>>>>>>>> instruction:
>>>>>>>>>
>>>>>>>>> v_foo_f32 v1, v0, v1 row_shr:1
>>>>>>>>>
>&...