1. I'm not sure what you mean by "full vector" here - using the same shift distance for all lanes (as opposed to per-lane distances), or doing a treat-the-vector-as-bag-of-bits shift that doesn't have any internal lane boundaries? If the latter, that doesn't really help you much with implementing a per-lane rotate. I think the most useful generalization of a vector funnel shift in this context is lane-wise result[i] = trunc(concat(a[i], b[i]) >> c[i]) (or the equivalent for a left shift); the special case a==b is a rotate. 2. For operand sizes that have native rotate instructions, at least x86, x86-64, ARM A32/T32 and AArch64 A64 agree that rotate distances are modulo the operand width. I believe PPC and MIPS do the same but am not sure (it's been a while), no clue about other architectures. It certainly seems the most natural way to define it, since rotates are cyclic to begin with. 8- and 16-bit rotates will need to be lowered into multi-instruction sequences on most targets (x86 and x86-64 can do them directly, but RISC-lineage archs usually don't have rotates at smaller than word size). Having explicit modulo semantics might end up forcing an explicit extra AND there, so that's an extra cost there, but it would certainly be nice to have the rotate definition be total. -Fabian On 07/02/2018 09:27 AM, Sanjay Patel wrote:> I'm guessing nobody has started implementing any of the suggested > rotate functionality since there are still open questions, but let me > know if I'm wrong. > > We're still getting patches that try to work around the current > limitations (https://reviews.llvm.org/D48705 > <https://reviews.llvm.org/D48705> ), so we should move forward since > we've approximated/justified the cost and benefits. > > Let's settle on the intrinsic definition(s). > > 1. There was a suggestion to generalize rotate to a "valign" or > "double shift" (that's what x86 means with its poorly worded "double > precision shift"). How does that work with vector types? The options > are a full vector-size shift or a per-element shift. If it's the full > vector, do we still want/need a specialized rotate intrinsic for > per-element? If it's per-element, do we still want/need the other form > for a full vector? > > 2. What is the behavior for a shift/rotate amount that is equal or > greater than the bit-width of the operand (or the bit width of a > vector element type?)? Can we modulo that operand by the bit width, or > does that not map well to the hardware semantics? > > On Thu, May 17, 2018 at 5:23 PM, John Regehr <regehr at cs.utah.edu > <mailto:regehr at cs.utah.edu>> wrote: > > Thanks Sanjay! > > At this point the cost/benefit tradeoff for rotate intrinsics > seems pretty good. > > John > > > On 05/17/2018 11:14 AM, Sanjay Patel wrote: > > A rotate intrinsic should be relatively close in > cost/complexity to the existing bswap. > > A grep of intrinsic::bswap says we'd probably add code in: > InstCombine > InstructionSimplify > ConstantFolding > DemandedBits > ValueTracking > VectorUtils > SelectionDAGBuilder > > But I don't think it's fair to view those additions as pure > added cost. As an example, consider that we have to add hacks > to EarlyCSE to recognize multi-IR-instruction min/max/abs > patterns. Intrinsics just work as-is there. So if you search > for 'matchSelectPattern', you get an idea (I see 32 hits in 10 > files) of the cost of *not* having intrinsics for those > operations that we've decided are not worthy of intrinsics. > > > On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > <mailto:llvm-dev at lists.llvm.org > <mailto:llvm-dev at lists.llvm.org>>> wrote: > > On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote: > > An informal metric might be: if the operation is > supported as a > primitive op or built-in in source languages and it is > supported > as a single target instruction, can we guarantee that > 1-to-1 > translation through optimization? > > > It seems perfectly reasonable for LLVM users to expect this to > happen reliably. > > I'd like to take a look at the other side of the equation: > the cost > of adding a new intrinsic in terms of teaching passes to > see through > it, so we don't lose optimizations that worked before the > intrinsic > was added. > > For example, clearly ValueTracking needs a few lines added > so that > computeKnownBits and friends don't get stopped by a > rotate. Anyone > have a reasonably complete list of files that need similar > changes? > > John > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > <mailto:llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>> > > >
I also agree that the per-element rotate for vectors is what we want for this intrinsic. So I have this so far: declare i32 @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount)declare <2 x i32> @llvm.catshift.v2i32(<2 x i32> %a, <2 x i32> %b, <2 x i32> %shift_amount) For scalars, @llvm.catshift concatenates %a and %b, shifts the concatenated value right by the number of bits specified by %shift_amount modulo the bit-width, and truncates to the original bit-width. For vectors, that operation occurs for each element of the vector: result[i] = trunc(concat(a[i], b[i]) >> c[i]) If %a == %b, this is equivalent to a bitwise rotate right. Rotate left may be implemented by subtracting the shift amount from the bit-width of the scalar type or vector element type. On Mon, Jul 2, 2018 at 2:37 PM, Fabian Giesen <fabiang at radgametools.com> wrote:> 1. I'm not sure what you mean by "full vector" here - using the same shift > distance for all lanes (as opposed to per-lane distances), or doing a > treat-the-vector-as-bag-of-bits shift that doesn't have any internal lane > boundaries? If the latter, that doesn't really help you much with > implementing a per-lane rotate. > > I think the most useful generalization of a vector funnel shift in this > context is lane-wise > > result[i] = trunc(concat(a[i], b[i]) >> c[i]) > > (or the equivalent for a left shift); the special case a==b is a rotate. > > 2. For operand sizes that have native rotate instructions, at least x86, > x86-64, ARM A32/T32 and AArch64 A64 agree that rotate distances are modulo > the operand width. I believe PPC and MIPS do the same but am not sure (it's > been a while), no clue about other architectures. > > It certainly seems the most natural way to define it, since rotates are > cyclic to begin with. > > 8- and 16-bit rotates will need to be lowered into multi-instruction > sequences on most targets (x86 and x86-64 can do them directly, but > RISC-lineage archs usually don't have rotates at smaller than word size). > Having explicit modulo semantics might end up forcing an explicit extra AND > there, so that's an extra cost there, but it would certainly be nice to > have the rotate definition be total. > > -Fabian > > On 07/02/2018 09:27 AM, Sanjay Patel wrote: > >> I'm guessing nobody has started implementing any of the suggested rotate >> functionality since there are still open questions, but let me know if I'm >> wrong. >> >> We're still getting patches that try to work around the current >> limitations (https://reviews.llvm.org/D48705 < >> https://reviews.llvm.org/D48705> ), so we should move forward since >> we've approximated/justified the cost and benefits. >> >> Let's settle on the intrinsic definition(s). >> >> 1. There was a suggestion to generalize rotate to a "valign" or "double >> shift" (that's what x86 means with its poorly worded "double precision >> shift"). How does that work with vector types? The options are a full >> vector-size shift or a per-element shift. If it's the full vector, do we >> still want/need a specialized rotate intrinsic for per-element? If it's >> per-element, do we still want/need the other form for a full vector? >> >> 2. What is the behavior for a shift/rotate amount that is equal or >> greater than the bit-width of the operand (or the bit width of a vector >> element type?)? Can we modulo that operand by the bit width, or does that >> not map well to the hardware semantics? >> >> On Thu, May 17, 2018 at 5:23 PM, John Regehr <regehr at cs.utah.edu <mailto: >> regehr at cs.utah.edu>> wrote: >> >> Thanks Sanjay! >> >> At this point the cost/benefit tradeoff for rotate intrinsics >> seems pretty good. >> >> John >> >> >> On 05/17/2018 11:14 AM, Sanjay Patel wrote: >> >> A rotate intrinsic should be relatively close in >> cost/complexity to the existing bswap. >> >> A grep of intrinsic::bswap says we'd probably add code in: >> InstCombine >> InstructionSimplify >> ConstantFolding >> DemandedBits >> ValueTracking >> VectorUtils >> SelectionDAGBuilder >> >> But I don't think it's fair to view those additions as pure >> added cost. As an example, consider that we have to add hacks >> to EarlyCSE to recognize multi-IR-instruction min/max/abs >> patterns. Intrinsics just work as-is there. So if you search >> for 'matchSelectPattern', you get an idea (I see 32 hits in 10 >> files) of the cost of *not* having intrinsics for those >> operations that we've decided are not worthy of intrinsics. >> >> >> On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> <mailto:llvm-dev at lists.llvm.org >> >> <mailto:llvm-dev at lists.llvm.org>>> wrote: >> >> On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote: >> >> An informal metric might be: if the operation is >> supported as a >> primitive op or built-in in source languages and it is >> supported >> as a single target instruction, can we guarantee that >> 1-to-1 >> translation through optimization? >> >> >> It seems perfectly reasonable for LLVM users to expect this to >> happen reliably. >> >> I'd like to take a look at the other side of the equation: >> the cost >> of adding a new intrinsic in terms of teaching passes to >> see through >> it, so we don't lose optimizations that worked before the >> intrinsic >> was added. >> >> For example, clearly ValueTracking needs a few lines added >> so that >> computeKnownBits and friends don't get stopped by a >> rotate. Anyone >> have a reasonably complete list of files that need similar >> changes? >> >> John >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> >> <mailto:llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> >> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>> >> >> >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180702/fbf41a24/attachment.html>
On 7/2/2018 3:16 PM, Sanjay Patel wrote:> I also agree that the per-element rotate for vectors is what we want for > this intrinsic. > > So I have this so far: > > declare i32 @llvm.catshift.i32(i32 %a, i32 %b, i32 %shift_amount) > declare <2 x i32> @llvm.catshift.v2i32(<2 x i32> %a, <2 x i32> %b, <2 x i32> %shift_amount) > > For scalars, @llvm.catshift concatenates %a and %b, shifts the > concatenated value right by the number of bits specified by > %shift_amount modulo the bit-width, and truncates to the original > bit-width. > For vectors, that operation occurs for each element of the vector: > result[i] = trunc(concat(a[i], b[i]) >> c[i]) > If %a == %b, this is equivalent to a bitwise rotate right. Rotate left > may be implemented by subtracting the shift amount from the bit-width of > the scalar type or vector element type.Or just negating, iff the shift amount is defined to be modulo and the machine is two's complement. I'm a bit worried that while modulo is the Obviously Right Thing for rotates, the situation is less clear for general funnel shifts. I looked over some of the ISAs I have docs at hand for: - x86 (32b/64b variants) has SHRD/SHLD, so both right and left variants. Count is modulo (mod 32 for 32b instruction variants, mod 64 for 64b instruction variants). As of BMI2, we also get RORX (non-flag-setting ROR) but no ROLX. - ARM AArch64 has EXTR, which is a right funnel shift, but shift distances must be literal constants. EXTR with both source registers equal disassembles as ROR and is often special-cased in implementations. (EXTR with source 1 != source 2 often has an extra cycle of latency). There is RORV which is right rotate by a variable (register) amount; there is no EXTRV. - NVPTX has SHF (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf) with both left/right shift variants and with both "clamp" (clamps shift count at 32) and "wrap" (shift count taken mod 32) modes. - GCN has v_alignbit_b32 which is a right funnel shift, and it seems to be defined to take shift distances mod 32. based on that sampling, modulo behavior seems like a good choice for a generic IR instruction, and if you're going to pick one direction, right shifts are the one to use. Not sure about other ISAs. -Fabian