On 7/2/2018 3:16 PM, Sanjay Patel wrote:> I also agree that the per-element rotate for vectors is what we want for > this intrinsic. > > So I have this so far: > > declare i32 @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount) > declare <2 x i32> @llvm.catshift.v2i32(<2 x i32> %a, <2 x i32> %b, <2 x i32> %shift_amount) > > For scalars, @llvm.catshift concatenates %a and %b, shifts the > concatenated value right by the number of bits specified by > %shift_amount modulo the bit-width, and truncates to the original > bit-width. > For vectors, that operation occurs for each element of the vector: > result[i] = trunc(concat(a[i], b[i]) >> c[i]) > If %a == %b, this is equivalent to a bitwise rotate right. Rotate left > may be implemented by subtracting the shift amount from the bit-width of > the scalar type or vector element type.Or just negating, iff the shift amount is defined to be modulo and the machine is two's complement. I'm a bit worried that while modulo is the Obviously Right Thing for rotates, the situation is less clear for general funnel shifts. I looked over some of the ISAs I have docs at hand for: - x86 (32b/64b variants) has SHRD/SHLD, so both right and left variants. Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b instruction variants). As of BMI2, we also get RORX (non-flag-setting ROR) but no ROLX. - ARM AArch64 has EXTR, which is a right funnel shift, but shift distances must be literal constants. EXTR with both source registers equal disassembles as ROR and is often special-cased in implementations. (EXTR with source 1 != source 2 often has an extra cycle of latency). There is RORV which is right rotate by a variable (register) amount; there is no EXTRV. - NVPTX has SHF (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf) with both left/right shift variants and with both "clamp" (clamps shift count at 32) and "wrap" (shift count taken mod 32) modes. - GCN has v_alignbit_b32 which is a right funnel shift, and it seems to be defined to take shift distances mod 32. based on that sampling, modulo behavior seems like a good choice for a generic IR instruction, and if you're going to pick one direction, right shifts are the one to use. Not sure about other ISAs. -Fabian
On 02/07/2018 23:36, Fabian Giesen via llvm-dev wrote:> On 7/2/2018 3:16 PM, Sanjay Patel wrote: >> I also agree that the per-element rotate for vectors is what we want >> for this intrinsic. >> >> So I have this so far: >> >> declare i32 @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount) >> declare <2 x i32> @llvm.catshift.v2i32(<2 x i32> %a, <2 x i32> >> %b, <2 x i32> %shift_amount) >> >> For scalars, @llvm.catshift concatenates %a and %b, shifts the >> concatenated value right by the number of bits specified by >> %shift_amount modulo the bit-width, and truncates to the original >> bit-width. >> For vectors, that operation occurs for each element of the vector: >> result[i] = trunc(concat(a[i], b[i]) >> c[i]) >> If %a == %b, this is equivalent to a bitwise rotate right. Rotate >> left may be implemented by subtracting the shift amount from the >> bit-width of the scalar type or vector element type. > > Or just negating, iff the shift amount is defined to be modulo and the > machine is two's complement. > > I'm a bit worried that while modulo is the Obviously Right Thing for > rotates, the situation is less clear for general funnel shifts. > > I looked over some of the ISAs I have docs at hand for: > > - x86 (32b/64b variants) has SHRD/SHLD, so both right and left > variants. Count is modulo (mod 32 for 32b instruction variants, mod 64 > for 64b instruction variants). As of BMI2, we also get RORX > (non-flag-setting ROR) but no ROLX. > > - ARM AArch64 has EXTR, which is a right funnel shift, but shift > distances must be literal constants. EXTR with both source registers > equal disassembles as ROR and is often special-cased in > implementations. (EXTR with source 1 != source 2 often has an extra > cycle of latency). There is RORV which is right rotate by a variable > (register) amount; there is no EXTRV. > > - NVPTX has SHF > (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf) > with both left/right shift variants and with both "clamp" (clamps > shift count at 32) and "wrap" (shift count taken mod 32) modes. > > - GCN has v_alignbit_b32 which is a right funnel shift, and it seems > to be defined to take shift distances mod 32. > > based on that sampling, modulo behavior seems like a good choice for a > generic IR instruction, and if you're going to pick one direction, > right shifts are the one to use. Not sure about other ISAs. > > -FabianSorry for the late reply to this thread, I'd like to mention that the existing ISD ROTL/ROTR opcodes currently do not properly assume modulo behaviour so that definition would need to be tidied up and made explicit; the recent legalization code might need fixing as well. Are you intending to add CONCATSHL/CONCATSRL ISD opcodes as well? Additionally the custom SSE lowering that I added doesn't assume modulo (although I think the vXi8 lowering might work already), and it only lowers for ROTL at the moment (mainly due to a legacy of how the XOP instructions work), but adding ROTR support shouldn't be difficult.
Yes, if we're going to define this as the more general 2-operand funnel shift, then we might as well add the matching DAG defs and adjust the existing ROTL/ROTR defs to match the modulo. I hadn't heard the "funnel shift" terminology before, but let's go with that because it's more descriptive/accurate than concat+shift. We will need to define both left and right variants. Otherwise, we risk losing the negation/subtraction of the shift amount via other transforms and defeat the point of defining the full operation. A few more examples to add to Fabian's: - x86 AVX512 added vprol* / vpror* instructions for 32/64-bit element vector types with constant and variable rotate amounts. The "count operand modulo the data size (32 or 64) is used". - PowerPC defined scalar rotates with 'rl*' (everything is based on rotating left). Similarly, Altivec only has 'vrl*' instructions for vectors and all ops rotate modulo the element size. The funnel op is called "vsldoi". So again, it goes left. On Sun, Jul 8, 2018 at 7:23 AM, Simon Pilgrim <llvm-dev at redking.me.uk> wrote:> > > On 02/07/2018 23:36, Fabian Giesen via llvm-dev wrote: > >> On 7/2/2018 3:16 PM, Sanjay Patel wrote: >> >>> I also agree that the per-element rotate for vectors is what we want for >>> this intrinsic. >>> >>> So I have this so far: >>> >>> declare i32 @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount) >>> declare <2 x i32> @llvm.catshift.v2i32(<2 x i32> %a, <2 x i32> %b, >>> <2 x i32> %shift_amount) >>> >>> For scalars, @llvm.catshift concatenates %a and %b, shifts the >>> concatenated value right by the number of bits specified by %shift_amount >>> modulo the bit-width, and truncates to the original bit-width. >>> For vectors, that operation occurs for each element of the vector: >>> result[i] = trunc(concat(a[i], b[i]) >> c[i]) >>> If %a == %b, this is equivalent to a bitwise rotate right. Rotate left >>> may be implemented by subtracting the shift amount from the bit-width of >>> the scalar type or vector element type. >>> >> >> Or just negating, iff the shift amount is defined to be modulo and the >> machine is two's complement. >> >> I'm a bit worried that while modulo is the Obviously Right Thing for >> rotates, the situation is less clear for general funnel shifts. >> >> I looked over some of the ISAs I have docs at hand for: >> >> - x86 (32b/64b variants) has SHRD/SHLD, so both right and left variants. >> Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b >> instruction variants). As of BMI2, we also get RORX (non-flag-setting ROR) >> but no ROLX. >> >> - ARM AArch64 has EXTR, which is a right funnel shift, but shift >> distances must be literal constants. EXTR with both source registers equal >> disassembles as ROR and is often special-cased in implementations. (EXTR >> with source 1 != source 2 often has an extra cycle of latency). There is >> RORV which is right rotate by a variable (register) amount; there is no >> EXTRV. >> >> - NVPTX has SHF (https://docs.nvidia.com/cuda/ >> parallel-thread-execution/index.html#logic-and-shift-instructions-shf) >> with both left/right shift variants and with both "clamp" (clamps shift >> count at 32) and "wrap" (shift count taken mod 32) modes. >> >> - GCN has v_alignbit_b32 which is a right funnel shift, and it seems to >> be defined to take shift distances mod 32. >> >> based on that sampling, modulo behavior seems like a good choice for a >> generic IR instruction, and if you're going to pick one direction, right >> shifts are the one to use. Not sure about other ISAs. >> >> -Fabian >> > Sorry for the late reply to this thread, I'd like to mention that the > existing ISD ROTL/ROTR opcodes currently do not properly assume modulo > behaviour so that definition would need to be tidied up and made explicit; > the recent legalization code might need fixing as well. Are you intending > to add CONCATSHL/CONCATSRL ISD opcodes as well? > > Additionally the custom SSE lowering that I added doesn't assume modulo > (although I think the vXi8 lowering might work already), and it only lowers > for ROTL at the moment (mainly due to a legacy of how the XOP instructions > work), but adding ROTR support shouldn't be difficult. > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180708/41937602/attachment.html>