thr3ads.net - llvm dev - [llvm-dev] Rotates, once again [Jul 2018]

If this information is useful, please help other people find it:
Share via:

Fabian Giesen via llvm-dev

2018-Jul-02 22:36 UTC

[llvm-dev] Rotates, once again

On 7/2/2018 3:16 PM, Sanjay Patel wrote:> I also agree that the per-element rotate for vectors is what we want for 
> this intrinsic.
> 
> So I have this so far:
> 
> declare  i32  @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
> declare  <2  x  i32>  @llvm.catshift.v2i32(<2  x  i32>  %a,
<2 x i32> %b, <2 x i32> %shift_amount)
> 
> For scalars, @llvm.catshift concatenates %a and %b, shifts the 
> concatenated value right by the number of bits specified by 
> %shift_amount modulo the bit-width, and truncates to the original 
> bit-width.
> For vectors, that operation occurs for each element of the vector:
>     result[i] = trunc(concat(a[i], b[i]) >> c[i])
> If %a == %b, this is equivalent to a bitwise rotate right. Rotate left 
> may be implemented by subtracting the shift amount from the bit-width of 
> the scalar type or vector element type.
Or just negating, iff the shift amount is defined to be modulo and the 
machine is two's complement.

I'm a bit worried that while modulo is the Obviously Right Thing for 
rotates, the situation is less clear for general funnel shifts.

I looked over some of the ISAs I have docs at hand for:

- x86 (32b/64b variants) has SHRD/SHLD, so both right and left variants. 
Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b 
instruction variants). As of BMI2, we also get RORX (non-flag-setting 
ROR) but no ROLX.

- ARM AArch64 has EXTR, which is a right funnel shift, but shift 
distances must be literal constants. EXTR with both source registers 
equal disassembles as ROR and is often special-cased in implementations. 
(EXTR with source 1 != source 2 often has an extra cycle of latency). 
There is RORV which is right rotate by a variable (register) amount; 
there is no EXTRV.

- NVPTX has SHF 
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf)
with both left/right shift variants and with both "clamp" (clamps
shift
count at 32) and "wrap" (shift count taken mod 32) modes.

- GCN has v_alignbit_b32 which is a right funnel shift, and it seems to 
be defined to take shift distances mod 32.

based on that sampling, modulo behavior seems like a good choice for a 
generic IR instruction, and if you're going to pick one direction, right 
shifts are the one to use. Not sure about other ISAs.

-Fabian

Simon Pilgrim via llvm-dev

2018-Jul-08 13:23 UTC

head link

[llvm-dev] Rotates, once again

On 02/07/2018 23:36, Fabian Giesen via llvm-dev wrote:> On 7/2/2018 3:16 PM, Sanjay Patel wrote:
>> I also agree that the per-element rotate for vectors is what we want 
>> for this intrinsic.
>>
>> So I have this so far:
>>
>> declare  i32  @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
>> declare  <2  x  i32>  @llvm.catshift.v2i32(<2  x i32>  %a,
<2 x i32>
>> %b, <2 x i32> %shift_amount)
>>
>> For scalars, @llvm.catshift concatenates %a and %b, shifts the 
>> concatenated value right by the number of bits specified by 
>> %shift_amount modulo the bit-width, and truncates to the original 
>> bit-width.
>> For vectors, that operation occurs for each element of the vector:
>>     result[i] = trunc(concat(a[i], b[i]) >> c[i])
>> If %a == %b, this is equivalent to a bitwise rotate right. Rotate 
>> left may be implemented by subtracting the shift amount from the 
>> bit-width of the scalar type or vector element type.
>
> Or just negating, iff the shift amount is defined to be modulo and the 
> machine is two's complement.
>
> I'm a bit worried that while modulo is the Obviously Right Thing for 
> rotates, the situation is less clear for general funnel shifts.
>
> I looked over some of the ISAs I have docs at hand for:
>
> - x86 (32b/64b variants) has SHRD/SHLD, so both right and left 
> variants. Count is modulo (mod 32 for 32b instruction variants, mod 64 
> for 64b instruction variants). As of BMI2, we also get RORX 
> (non-flag-setting ROR) but no ROLX.
>
> - ARM AArch64 has EXTR, which is a right funnel shift, but shift 
> distances must be literal constants. EXTR with both source registers 
> equal disassembles as ROR and is often special-cased in 
> implementations. (EXTR with source 1 != source 2 often has an extra 
> cycle of latency). There is RORV which is right rotate by a variable 
> (register) amount; there is no EXTRV.
>
> - NVPTX has SHF 
>
(https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf)
> with both left/right shift variants and with both "clamp" (clamps
> shift count at 32) and "wrap" (shift count taken mod 32) modes.
>
> - GCN has v_alignbit_b32 which is a right funnel shift, and it seems 
> to be defined to take shift distances mod 32.
>
> based on that sampling, modulo behavior seems like a good choice for a 
> generic IR instruction, and if you're going to pick one direction, 
> right shifts are the one to use. Not sure about other ISAs.
>
> -FabianSorry for the late reply to this thread, I'd like to mention that the 
existing ISD ROTL/ROTR opcodes currently do not properly assume modulo 
behaviour so that definition would need to be tidied up and made 
explicit; the recent legalization code might need fixing as well. Are 
you intending to add CONCATSHL/CONCATSRL ISD opcodes as well?

Additionally the custom SSE lowering that I added doesn't assume modulo 
(although I think the vXi8 lowering might work already), and it only 
lowers for ROTL at the moment (mainly due to a legacy of how the XOP 
instructions work), but adding ROTR support shouldn't be difficult.

Sanjay Patel via llvm-dev

2018-Jul-08 15:23 UTC

head link

[llvm-dev] Rotates, once again

Yes, if we're going to define this as the more general 2-operand funnel
shift, then we might as well add the matching DAG defs and adjust the
existing ROTL/ROTR defs to match the modulo.

I hadn't heard the "funnel shift" terminology before, but
let's go with
that because it's more descriptive/accurate than concat+shift.

We will need to define both left and right variants. Otherwise, we risk
losing the negation/subtraction of the shift amount via other transforms
and defeat the point of defining the full operation.

A few more examples to add to Fabian's:
 - x86 AVX512 added vprol* / vpror* instructions for 32/64-bit element
vector types with constant and variable rotate amounts. The "count operand
modulo the data size (32 or 64) is used".

- PowerPC defined scalar rotates with 'rl*' (everything is based on
rotating left). Similarly, Altivec only has 'vrl*' instructions for
vectors
and all ops rotate modulo the element size. The funnel op is called
"vsldoi". So again, it goes left.



On Sun, Jul 8, 2018 at 7:23 AM, Simon Pilgrim <llvm-dev at redking.me.uk>
wrote:
>
>
> On 02/07/2018 23:36, Fabian Giesen via llvm-dev wrote:
>
>> On 7/2/2018 3:16 PM, Sanjay Patel wrote:
>>
>>> I also agree that the per-element rotate for vectors is what we
want for
>>> this intrinsic.
>>>
>>> So I have this so far:
>>>
>>> declare  i32  @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)
>>> declare  <2  x  i32>  @llvm.catshift.v2i32(<2  x i32> 
%a, <2 x i32> %b,
>>> <2 x i32> %shift_amount)
>>>
>>> For scalars, @llvm.catshift concatenates %a and %b, shifts the
>>> concatenated value right by the number of bits specified by
%shift_amount
>>> modulo the bit-width, and truncates to the original bit-width.
>>> For vectors, that operation occurs for each element of the vector:
>>>     result[i] = trunc(concat(a[i], b[i]) >> c[i])
>>> If %a == %b, this is equivalent to a bitwise rotate right. Rotate
left
>>> may be implemented by subtracting the shift amount from the
bit-width of
>>> the scalar type or vector element type.
>>>
>>
>> Or just negating, iff the shift amount is defined to be modulo and the
>> machine is two's complement.
>>
>> I'm a bit worried that while modulo is the Obviously Right Thing
for
>> rotates, the situation is less clear for general funnel shifts.
>>
>> I looked over some of the ISAs I have docs at hand for:
>>
>> - x86 (32b/64b variants) has SHRD/SHLD, so both right and left
variants.
>> Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b
>> instruction variants). As of BMI2, we also get RORX (non-flag-setting
ROR)
>> but no ROLX.
>>
>> - ARM AArch64 has EXTR, which is a right funnel shift, but shift
>> distances must be literal constants. EXTR with both source registers
equal
>> disassembles as ROR and is often special-cased in implementations.
(EXTR
>> with source 1 != source 2 often has an extra cycle of latency). There
is
>> RORV which is right rotate by a variable (register) amount; there is no
>> EXTRV.
>>
>> - NVPTX has SHF (https://docs.nvidia.com/cuda/
>> parallel-thread-execution/index.html#logic-and-shift-instructions-shf)
>> with both left/right shift variants and with both "clamp"
(clamps shift
>> count at 32) and "wrap" (shift count taken mod 32) modes.
>>
>> - GCN has v_alignbit_b32 which is a right funnel shift, and it seems to
>> be defined to take shift distances mod 32.
>>
>> based on that sampling, modulo behavior seems like a good choice for a
>> generic IR instruction, and if you're going to pick one direction,
right
>> shifts are the one to use. Not sure about other ISAs.
>>
>> -Fabian
>>
> Sorry for the late reply to this thread, I'd like to mention that the
> existing ISD ROTL/ROTR opcodes currently do not properly assume modulo
> behaviour so that definition would need to be tidied up and made explicit;
> the recent legalization code might need fixing as well. Are you intending
> to add CONCATSHL/CONCATSRL ISD opcodes as well?
>
> Additionally the custom SSE lowering that I added doesn't assume modulo
> (although I think the vXi8 lowering might work already), and it only lowers
> for ROTL at the moment (mainly due to a legacy of how the XOP instructions
> work), but adding ROTR support shouldn't be difficult.
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180708/41937602/attachment.html>

llvm dev - Jul 2018 - Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again