thr3ads.net - llvm dev - [llvm-dev] Rotates, once again [Jul 2018]

If this information is useful, please help other people find it:
Share via:

John Regehr via llvm-dev

2018-May-17 23:23 UTC

[llvm-dev] Rotates, once again

Thanks Sanjay!

At this point the cost/benefit tradeoff for rotate intrinsics seems 
pretty good.

John


On 05/17/2018 11:14 AM, Sanjay Patel wrote:> A rotate intrinsic should be relatively close in cost/complexity to the 
> existing bswap.
> 
> A grep of intrinsic::bswap says we'd probably add code in:
> InstCombine
> InstructionSimplify
> ConstantFolding
> DemandedBits
> ValueTracking
> VectorUtils
> SelectionDAGBuilder
> 
> But I don't think it's fair to view those additions as pure added
cost.
> As an example, consider that we have to add hacks to EarlyCSE to 
> recognize multi-IR-instruction min/max/abs patterns. Intrinsics just 
> work as-is there. So if you search for 'matchSelectPattern', you
get an
> idea (I see 32 hits in 10 files) of the cost of *not* having intrinsics 
> for those operations that we've decided are not worthy of intrinsics.
> 
> 
> On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
> 
>     On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:
> 
>         An informal metric might be: if the operation is supported as a
>         primitive op or built-in in source languages and it is supported
>         as a single target instruction, can we guarantee that 1-to-1
>         translation through optimization?
> 
> 
>     It seems perfectly reasonable for LLVM users to expect this to
>     happen reliably.
> 
>     I'd like to take a look at the other side of the equation: the cost
>     of adding a new intrinsic in terms of teaching passes to see through
>     it, so we don't lose optimizations that worked before the intrinsic
>     was added.
> 
>     For example, clearly ValueTracking needs a few lines added so that
>     computeKnownBits and friends don't get stopped by a rotate. Anyone
>     have a reasonably complete list of files that need similar changes?
> 
>     John
> 
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
>

Sanjay Patel via llvm-dev

2018-Jul-02 16:27 UTC

head link

[llvm-dev] Rotates, once again

I'm guessing nobody has started implementing any of the suggested rotate
functionality since there are still open questions, but let me know if I'm
wrong.

We're still getting patches that try to work around the current limitations
( https://reviews.llvm.org/D48705 ), so we should move forward since we've
approximated/justified the cost and benefits.

Let's settle on the intrinsic definition(s).

1. There was a suggestion to generalize rotate to a "valign" or
"double
shift" (that's what x86 means with its poorly worded "double
precision
shift"). How does that work with vector types? The options are a full
vector-size shift or a per-element shift. If it's the full vector, do we
still want/need a specialized rotate intrinsic for per-element? If it's
per-element, do we still want/need the other form for a full vector?

2. What is the behavior for a shift/rotate amount that is equal or greater
than the bit-width of the operand (or the bit width of a vector element
type?)? Can we modulo that operand by the bit width, or does that not map
well to the hardware semantics?

On Thu, May 17, 2018 at 5:23 PM, John Regehr <regehr at cs.utah.edu>
wrote:
> Thanks Sanjay!
>
> At this point the cost/benefit tradeoff for rotate intrinsics seems pretty
> good.
>
> John
>
>
> On 05/17/2018 11:14 AM, Sanjay Patel wrote:
>
>> A rotate intrinsic should be relatively close in cost/complexity to the
>> existing bswap.
>>
>> A grep of intrinsic::bswap says we'd probably add code in:
>> InstCombine
>> InstructionSimplify
>> ConstantFolding
>> DemandedBits
>> ValueTracking
>> VectorUtils
>> SelectionDAGBuilder
>>
>> But I don't think it's fair to view those additions as pure
added cost.
>> As an example, consider that we have to add hacks to EarlyCSE to
recognize
>> multi-IR-instruction min/max/abs patterns. Intrinsics just work as-is
>> there. So if you search for 'matchSelectPattern', you get an
idea (I see 32
>> hits in 10 files) of the cost of *not* having intrinsics for those
>> operations that we've decided are not worthy of intrinsics.
>>
>>
>> On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev <
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>>
>>     On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:
>>
>>         An informal metric might be: if the operation is supported as a
>>         primitive op or built-in in source languages and it is
supported
>>         as a single target instruction, can we guarantee that 1-to-1
>>         translation through optimization?
>>
>>
>>     It seems perfectly reasonable for LLVM users to expect this to
>>     happen reliably.
>>
>>     I'd like to take a look at the other side of the equation: the
cost
>>     of adding a new intrinsic in terms of teaching passes to see
through
>>     it, so we don't lose optimizations that worked before the
intrinsic
>>     was added.
>>
>>     For example, clearly ValueTracking needs a few lines added so that
>>     computeKnownBits and friends don't get stopped by a rotate.
Anyone
>>     have a reasonably complete list of files that need similar changes?
>>
>>     John
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180702/f18fe963/attachment-0001.html>

Krzysztof Parzyszek via llvm-dev

2018-Jul-02 19:41 UTC

head link

[llvm-dev] Rotates, once again

On 7/2/2018 11:27 AM, Sanjay Patel via llvm-dev wrote:> 
> Let's settle on the intrinsic definition(s).
> 
> 1. There was a suggestion to generalize rotate to a "valign" or
"double
> shift" (that's what x86 means with its poorly worded "double
precision
> shift"). How does that work with vector types? The options are a full 
> vector-size shift or a per-element shift. If it's the full vector, do
we
> still want/need a specialized rotate intrinsic for per-element? If it's
> per-element, do we still want/need the other form for a full vector?
The scalar rotate moves bits and such an operation doesn't make much 
sense for moving data across lanes in vectors. I voted for the valign 
variant originally, but I think that a per-element rotate would be the 
natural vector version of the scalar operation.

It could still be doing the "double shift", since it's more
general, it
just shouldn't be called valign. A per-byte cross-lane vector rotate (an 
actual vector valign) can be implemented using shuffle, so I don't think 
that an intrinsic for that is necessary.


-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

Fabian Giesen via llvm-dev

2018-Jul-02 20:37 UTC

head link

[llvm-dev] Rotates, once again

1. I'm not sure what you mean by "full vector" here - using the
same
shift distance for all lanes (as opposed to per-lane distances), or 
doing a treat-the-vector-as-bag-of-bits shift that doesn't have any 
internal lane boundaries? If the latter, that doesn't really help you 
much with implementing a per-lane rotate.

I think the most useful generalization of a vector funnel shift in this 
context is lane-wise

    result[i] = trunc(concat(a[i], b[i]) >> c[i])

(or the equivalent for a left shift); the special case a==b is a rotate.

2. For operand sizes that have native rotate instructions, at least x86, 
x86-64, ARM A32/T32 and AArch64 A64 agree that rotate distances are 
modulo the operand width. I believe PPC and MIPS do the same but am not 
sure (it's been a while), no clue about other architectures.

It certainly seems the most natural way to define it, since rotates are 
cyclic to begin with.

8- and 16-bit rotates will need to be lowered into multi-instruction 
sequences on most targets (x86 and x86-64 can do them directly, but 
RISC-lineage archs usually don't have rotates at smaller than word 
size). Having explicit modulo semantics might end up forcing an explicit 
extra AND there, so that's an extra cost there, but it would certainly 
be nice to have the rotate definition be total.

-Fabian

On 07/02/2018 09:27 AM, Sanjay Patel wrote:> I'm guessing nobody has started implementing any of the suggested 
> rotate functionality since there are still open questions, but let me 
> know if I'm wrong.
>
> We're still getting patches that try to work around the current 
> limitations (https://reviews.llvm.org/D48705 
> <https://reviews.llvm.org/D48705> ), so we should move forward since 
> we've approximated/justified the cost and benefits.
>
> Let's settle on the intrinsic definition(s).
>
> 1. There was a suggestion to generalize rotate to a "valign" or 
> "double shift" (that's what x86 means with its poorly worded
"double
> precision shift"). How does that work with vector types? The options 
> are a full vector-size shift or a per-element shift. If it's the full 
> vector, do we still want/need a specialized rotate intrinsic for 
> per-element? If it's per-element, do we still want/need the other form 
> for a full vector?
>
> 2. What is the behavior for a shift/rotate amount that is equal or 
> greater than the bit-width of the operand (or the bit width of a 
> vector element type?)? Can we modulo that operand by the bit width, or 
> does that not map well to the hardware semantics?
>
> On Thu, May 17, 2018 at 5:23 PM, John Regehr <regehr at cs.utah.edu 
> <mailto:regehr at cs.utah.edu>> wrote:
>
>     Thanks Sanjay!
>
>     At this point the cost/benefit tradeoff for rotate intrinsics
>     seems pretty good.
>
>     John
>
>
>     On 05/17/2018 11:14 AM, Sanjay Patel wrote:
>
>         A rotate intrinsic should be relatively close in
>         cost/complexity to the existing bswap.
>
>         A grep of intrinsic::bswap says we'd probably add code in:
>         InstCombine
>         InstructionSimplify
>         ConstantFolding
>         DemandedBits
>         ValueTracking
>         VectorUtils
>         SelectionDAGBuilder
>
>         But I don't think it's fair to view those additions as pure
>         added cost. As an example, consider that we have to add hacks
>         to EarlyCSE to recognize multi-IR-instruction min/max/abs
>         patterns. Intrinsics just work as-is there. So if you search
>         for 'matchSelectPattern', you get an idea (I see 32 hits in
10
>         files) of the cost of *not* having intrinsics for those
>         operations that we've decided are not worthy of intrinsics.
>
>
>         On Wed, May 16, 2018 at 2:20 PM, John Regehr via llvm-dev
>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>         <mailto:llvm-dev at lists.llvm.org
>         <mailto:llvm-dev at lists.llvm.org>>> wrote:
>
>             On 5/16/18 1:58 PM, Sanjay Patel via llvm-dev wrote:
>
>                 An informal metric might be: if the operation is
>         supported as a
>                 primitive op or built-in in source languages and it is
>         supported
>                 as a single target instruction, can we guarantee that
>         1-to-1
>                 translation through optimization?
>
>
>             It seems perfectly reasonable for LLVM users to expect this to
>             happen reliably.
>
>             I'd like to take a look at the other side of the equation:
>         the cost
>             of adding a new intrinsic in terms of teaching passes to
>         see through
>             it, so we don't lose optimizations that worked before the
>         intrinsic
>             was added.
>
>             For example, clearly ValueTracking needs a few lines added
>         so that
>             computeKnownBits and friends don't get stopped by a
>         rotate. Anyone
>             have a reasonably complete list of files that need similar
>         changes?
>
>             John
>
>             _______________________________________________
>             LLVM Developers mailing list
>         llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>         <mailto:llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>>
>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>             <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>>
>
>
>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Jul 2018 - Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

Apparently Analagous Threads