thr3ads.net - llvm dev - [llvm-dev] Rotates, once again [May 2018]

If this information is useful, please help other people find it:
Share via:

Fabian Giesen via llvm-dev

2018-May-14 08:53 UTC

[llvm-dev] Rotates, once again

Hi everyone!

I recently ran into some interesting issues with generation of rotate
instructions - the details are in the bug tracker
(https://bugs.llvm.org/show_bug.cgi?id=37387 and related bugs) for those
interested - and it brought up the issue of rotates in the IR again. Now this is
a proposal that has been made (and been rejected) several times, but I've
been told that this time round we ran into issues not previously considered, and
that I should bring the topic up on llvm-dev.

I'll try to summarize the status quo first and then bring up the tricky
cases that might tip the balance in favor of rotate intrinsics.

Rotates
======
Rotates, or circular shifts, are bit shifts where bits "dropped out"
on one side of the operand are shifted back in on the other.

They occur less frequently in real-world code than regular (non-circular) bit
shifts, but are nevertheless directly supported in most popular ISAs. They are
useful primitives for cryptography, non-cryptographic hashing, pseudo-random
number generation, bit I/O, and probably other cases as well.

They do not have a direct representation in C/C++, but several compilers support
them via intrinsics (including MSVC, which Clang-CL tries to be compatible to),
and some languages such as Rust include them explicitly.

The status quo
=============
Rotates by compile-time constant amounts are fairly easy to express in C-like
languages by ORing together two opposite-direction shifts. For example, rotating
a uint32_t  x left by 3 positions in C boils down to

  (x << 3) | (x >> 29)

which is simple enough. The backends with rotate support detect this type of
pattern at the DAG stage and map it to a rotate instruction.

This, in a nutshell, is the argument against a direct rotate in the IR: the
existing primitives are sufficient to model it, and doing so allows LLVM to
leverage transforms and analyses such as SimplifyDemandedBits that know how to
deal with the more common regular shifts to work on bit rotate expressions as
well. Adding a new primitive would mean that a whole bunch of optimization
passes would need to know how to deal with it (or else they might lose on
optimization opportunities).

Variable-distance rotates
========================
Things become murkier when we consider not just rotates by a compile-time
constant, but also rotates where the amount is determined at run time.

For one thing, it's not immediately obvious how to write a well-defined
corresponding C expression: the most "natural" version (for a 32-bit
left rotate) is

  (x << k) | (x >> (32 - k))

but this is undefined for k==0 due to the right-shift of a 32-bit value by 32.
That said, both GCC and Clang will generally compile this particular expression
into a rotate-left as of this writing, but having UB is unfortunate. A trickier
variant that is always well-defined is

  (x << (k & 31)) | (x >> (-k & 31))

but this is a substantially more complicated sub-dag  than the OR of two shifts
you would see for a compile-time constant rotate distance, and there are more
opportunities for things to go wrong along the way. Bug 37387 contains a few
examples, the simplest of which was given by Sanjay:

void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
  for (unsigned i = 0; i < N; ++i)
    x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is loop
invariant
}

"32 - b" is loop-invariant and gets hoisted, and what's left of
the expression doesn't match a known "rotate" pattern (since the
DAG can't see across BB boundaries at the moment), so instead we get two
shifts and an OR inside the loop.

The counterargument here was that GlobalISel should eventually fix this
particular problem, but nevertheless, as of today, LICM breaks recognition of
the rotate pattern in this (very simple) loop, and there is reason to believe
that other transforms (existing or forthcoming) might well break the pattern as
well; the issues referenced in the linked bug illustrate some of the existing
failure modes.

This situation is frustrating for users of variable-distance rotates; right now,
there simply is no way to write code in a source language that will reliably
turn into rotates; you either have to watch the generated code closely, or just
throw up your hands and use platform-dependent inline assembly. (It is assumed
that the code in question is in a hot spot.)

The proposal
===========
The idea would be to add rotate intrinsics, primarily intended to be used for
the variable-distance case, which is otherwise fragile.

Frontends can emit them directly (for source language-level intrinsics, or when
matched from "known good" rotate patterns) and the backends can match
them and select appropriate instructions.

For the constant-distance case, the preferred form would still be the
OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is known to
existing optimization passes.

SDB does not currently do anything with shifts by non-constant amounts anyway;
and, for the use cases in question, a rotate that is not touched by other
optimization passes is still preferable to a more complex expression that can be
partially optimized, but then ends up in a form that the backend doesn't
know how to turn back into a rotate, which is a net loss.

Thoughts?

-Fabian

David Jones via llvm-dev

2018-May-14 13:31 UTC

head link

[llvm-dev] Rotates, once again

It might be worth generalizing this to an arbitrary bit-select.

My application needs to extract contiguous bits of multi-precision numbers.
e.g. given a 128-bit number stored in an array a[2] and we want to select
64 bits starting from position N:

    result = (a[1] << (64-N)) | (a[0] >> N);

which is basically the same as your rotate if a[1] == a[0].  Some machines
(e.g. x86 and PA-RISC) have this instruction in hardware. On x86 this is a
double-edged sword, as SHLD/SHRD may be microcoded and it may be faster
just to expand to 2 shifts and an OR. Even so, it might help if I could
encode this more directly.


On Mon, May 14, 2018 at 4:53 AM, Fabian Giesen via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi everyone!
>
> I recently ran into some interesting issues with generation of rotate
> instructions - the details are in the bug tracker (
> https://bugs.llvm.org/show_bug.cgi?id=37387 and related bugs) for those
> interested - and it brought up the issue of rotates in the IR again. Now
> this is a proposal that has been made (and been rejected) several times,
> but I've been told that this time round we ran into issues not
previously
> considered, and that I should bring the topic up on llvm-dev.
>
> I'll try to summarize the status quo first and then bring up the tricky
> cases that might tip the balance in favor of rotate intrinsics.
>
> Rotates
> ======>
> Rotates, or circular shifts, are bit shifts where bits "dropped
out" on
> one side of the operand are shifted back in on the other.
>
> They occur less frequently in real-world code than regular (non-circular)
> bit shifts, but are nevertheless directly supported in most popular ISAs.
> They are useful primitives for cryptography, non-cryptographic hashing,
> pseudo-random number generation, bit I/O, and probably other cases as well.
>
> They do not have a direct representation in C/C++, but several compilers
> support them via intrinsics (including MSVC, which Clang-CL tries to be
> compatible to), and some languages such as Rust include them explicitly.
>
> The status quo
> =============>
> Rotates by compile-time constant amounts are fairly easy to express in
> C-like languages by ORing together two opposite-direction shifts. For
> example, rotating a uint32_t  x left by 3 positions in C boils down to
>
>   (x << 3) | (x >> 29)
>
> which is simple enough. The backends with rotate support detect this type
> of pattern at the DAG stage and map it to a rotate instruction.
>
> This, in a nutshell, is the argument against a direct rotate in the IR:
> the existing primitives are sufficient to model it, and doing so allows
> LLVM to leverage transforms and analyses such as SimplifyDemandedBits that
> know how to deal with the more common regular shifts to work on bit rotate
> expressions as well. Adding a new primitive would mean that a whole bunch
> of optimization passes would need to know how to deal with it (or else they
> might lose on optimization opportunities).
>
> Variable-distance rotates
> ========================>
> Things become murkier when we consider not just rotates by a compile-time
> constant, but also rotates where the amount is determined at run time.
>
> For one thing, it's not immediately obvious how to write a well-defined
> corresponding C expression: the most "natural" version (for a
32-bit left
> rotate) is
>
>   (x << k) | (x >> (32 - k))
>
> but this is undefined for k==0 due to the right-shift of a 32-bit value by
> 32. That said, both GCC and Clang will generally compile this particular
> expression into a rotate-left as of this writing, but having UB is
> unfortunate. A trickier variant that is always well-defined is
>
>   (x << (k & 31)) | (x >> (-k & 31))
>
> but this is a substantially more complicated sub-dag  than the OR of two
> shifts you would see for a compile-time constant rotate distance, and there
> are more opportunities for things to go wrong along the way. Bug 37387
> contains a few examples, the simplest of which was given by Sanjay:
>
> void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
>   for (unsigned i = 0; i < N; ++i)
>     x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is
loop
> invariant
> }
>
> "32 - b" is loop-invariant and gets hoisted, and what's left
of the
> expression doesn't match a known "rotate" pattern (since the
DAG can't see
> across BB boundaries at the moment), so instead we get two shifts and an OR
> inside the loop.
>
> The counterargument here was that GlobalISel should eventually fix this
> particular problem, but nevertheless, as of today, LICM breaks recognition
> of the rotate pattern in this (very simple) loop, and there is reason to
> believe that other transforms (existing or forthcoming) might well break
> the pattern as well; the issues referenced in the linked bug illustrate
> some of the existing failure modes.
>
> This situation is frustrating for users of variable-distance rotates;
> right now, there simply is no way to write code in a source language that
> will reliably turn into rotates; you either have to watch the generated
> code closely, or just throw up your hands and use platform-dependent inline
> assembly. (It is assumed that the code in question is in a hot spot.)
>
> The proposal
> ===========>
> The idea would be to add rotate intrinsics, primarily intended to be used
> for the variable-distance case, which is otherwise fragile.
>
> Frontends can emit them directly (for source language-level intrinsics, or
> when matched from "known good" rotate patterns) and the backends
can match
> them and select appropriate instructions.
>
> For the constant-distance case, the preferred form would still be the
> OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is
> known to existing optimization passes.
>
> SDB does not currently do anything with shifts by non-constant amounts
> anyway; and, for the use cases in question, a rotate that is not touched by
> other optimization passes is still preferable to a more complex expression
> that can be partially optimized, but then ends up in a form that the
> backend doesn't know how to turn back into a rotate, which is a net
loss.
>
> Thoughts?
>
> -Fabian
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180514/98e876e3/attachment.html>

Sanjay Patel via llvm-dev

2018-May-15 22:34 UTC

head link

[llvm-dev] Rotates, once again

Thanks for writing this up. I'd like to have this intrinsic too.

Another argument for having the intrinsic is shown in PR37426:
https://bugs.llvm.org/show_bug.cgi?id=37426

Vectorization goes overboard because the throughput cost model used by the
vectorizers doesn't match the 6 IR instructions that correspond to 1 x86
rotate instruction. Instead, we have:

$ opt 37426prevectorize.ll -S -cost-model -analyze
...
Cost Model: Found an estimated cost of 1 for instruction:   %and = and i32
%cond, 31
Cost Model: Found an estimated cost of 1 for instruction:   %shl = shl i32
%1, %and
Cost Model: Found an estimated cost of 1 for instruction:   %sub = sub nsw
i32 0, %cond
Cost Model: Found an estimated cost of 1 for instruction:   %and5 = and i32
%sub, 31
Cost Model: Found an estimated cost of 1 for instruction:   %shr = lshr i32
%1, %and5
Cost Model: Found an estimated cost of 1 for instruction:   %or = or i32
%shl, %shr

The broken cost model also affects unrolling and inlining. Size costs are
overestimated for a target that has a rotate instruction.
This cost problem isn't limited to rotate patterns (it's come up in the
context of min/max/abs/fma too). But it would be simpler if we had a rotate
intrinsic, and the 6-to-1 margin is the biggest I've seen.

Another potential benefit of a generic intrinsic is that it can replace
target-specific intrinsics. PowerPC and x86 have those. ARM translates
source-level target-specific intrinsics into regular IR, so that could use
the intrinsic too.

On Mon, May 14, 2018 at 2:53 AM, Fabian Giesen via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi everyone!
>
> I recently ran into some interesting issues with generation of rotate
> instructions - the details are in the bug tracker (
> https://bugs.llvm.org/show_bug.cgi?id=37387 and related bugs) for those
> interested - and it brought up the issue of rotates in the IR again. Now
> this is a proposal that has been made (and been rejected) several times,
> but I've been told that this time round we ran into issues not
previously
> considered, and that I should bring the topic up on llvm-dev.
>
> I'll try to summarize the status quo first and then bring up the tricky
> cases that might tip the balance in favor of rotate intrinsics.
>
> Rotates
> ======>
> Rotates, or circular shifts, are bit shifts where bits "dropped
out" on
> one side of the operand are shifted back in on the other.
>
> They occur less frequently in real-world code than regular (non-circular)
> bit shifts, but are nevertheless directly supported in most popular ISAs.
> They are useful primitives for cryptography, non-cryptographic hashing,
> pseudo-random number generation, bit I/O, and probably other cases as well.
>
> They do not have a direct representation in C/C++, but several compilers
> support them via intrinsics (including MSVC, which Clang-CL tries to be
> compatible to), and some languages such as Rust include them explicitly.
>
> The status quo
> =============>
> Rotates by compile-time constant amounts are fairly easy to express in
> C-like languages by ORing together two opposite-direction shifts. For
> example, rotating a uint32_t  x left by 3 positions in C boils down to
>
>   (x << 3) | (x >> 29)
>
> which is simple enough. The backends with rotate support detect this type
> of pattern at the DAG stage and map it to a rotate instruction.
>
> This, in a nutshell, is the argument against a direct rotate in the IR:
> the existing primitives are sufficient to model it, and doing so allows
> LLVM to leverage transforms and analyses such as SimplifyDemandedBits that
> know how to deal with the more common regular shifts to work on bit rotate
> expressions as well. Adding a new primitive would mean that a whole bunch
> of optimization passes would need to know how to deal with it (or else they
> might lose on optimization opportunities).
>
> Variable-distance rotates
> ========================>
> Things become murkier when we consider not just rotates by a compile-time
> constant, but also rotates where the amount is determined at run time.
>
> For one thing, it's not immediately obvious how to write a well-defined
> corresponding C expression: the most "natural" version (for a
32-bit left
> rotate) is
>
>   (x << k) | (x >> (32 - k))
>
> but this is undefined for k==0 due to the right-shift of a 32-bit value by
> 32. That said, both GCC and Clang will generally compile this particular
> expression into a rotate-left as of this writing, but having UB is
> unfortunate. A trickier variant that is always well-defined is
>
>   (x << (k & 31)) | (x >> (-k & 31))
>
> but this is a substantially more complicated sub-dag  than the OR of two
> shifts you would see for a compile-time constant rotate distance, and there
> are more opportunities for things to go wrong along the way. Bug 37387
> contains a few examples, the simplest of which was given by Sanjay:
>
> void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
>   for (unsigned i = 0; i < N; ++i)
>     x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is
loop
> invariant
> }
>
> "32 - b" is loop-invariant and gets hoisted, and what's left
of the
> expression doesn't match a known "rotate" pattern (since the
DAG can't see
> across BB boundaries at the moment), so instead we get two shifts and an OR
> inside the loop.
>
> The counterargument here was that GlobalISel should eventually fix this
> particular problem, but nevertheless, as of today, LICM breaks recognition
> of the rotate pattern in this (very simple) loop, and there is reason to
> believe that other transforms (existing or forthcoming) might well break
> the pattern as well; the issues referenced in the linked bug illustrate
> some of the existing failure modes.
>
> This situation is frustrating for users of variable-distance rotates;
> right now, there simply is no way to write code in a source language that
> will reliably turn into rotates; you either have to watch the generated
> code closely, or just throw up your hands and use platform-dependent inline
> assembly. (It is assumed that the code in question is in a hot spot.)
>
> The proposal
> ===========>
> The idea would be to add rotate intrinsics, primarily intended to be used
> for the variable-distance case, which is otherwise fragile.
>
> Frontends can emit them directly (for source language-level intrinsics, or
> when matched from "known good" rotate patterns) and the backends
can match
> them and select appropriate instructions.
>
> For the constant-distance case, the preferred form would still be the
> OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is
> known to existing optimization passes.
>
> SDB does not currently do anything with shifts by non-constant amounts
> anyway; and, for the use cases in question, a rotate that is not touched by
> other optimization passes is still preferable to a more complex expression
> that can be partially optimized, but then ends up in a form that the
> backend doesn't know how to turn back into a rotate, which is a net
loss.
>
> Thoughts?
>
> -Fabian
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180515/3dc6ca1b/attachment.html>

David Zarzycki via llvm-dev

2018-May-16 14:50 UTC

head link

[llvm-dev] Rotates, once again

More food for thought:

1) Like bitwise rotate, Intel’s bitwise PDEP and PEXT instructions are
expressible via mutliple pure C bitwise operations. Nevertheless, clang/LLVM has
intrinsics for these bitwise instructions.
2) If we imagine an alternate universe where C had rotate operators from the
beginning, then LLVM would probably have rotate intrinsics too, and we’d be
discussing whether the LLVM rotate intrinsics were worth keeping or not given
that clang could just emit a series of simpler bitwise operations. I’d imagine
the examples below would be why LLVM would keep the rotate intrinsics (in this
alternate universe).

Dave

> On May 15, 2018, at 6:34 PM, Sanjay Patel via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Thanks for writing this up. I'd like to have this intrinsic too.
> 
> Another argument for having the intrinsic is shown in PR37426:
> https://bugs.llvm.org/show_bug.cgi?id=37426
<https://bugs.llvm.org/show_bug.cgi?id=37426>
> 
> Vectorization goes overboard because the throughput cost model used by the
vectorizers doesn't match the 6 IR instructions that correspond to 1 x86
rotate instruction. Instead, we have:
> 
> $ opt 37426prevectorize.ll -S -cost-model -analyze
> ...
> Cost Model: Found an estimated cost of 1 for instruction:   %and = and i32
%cond, 31
> Cost Model: Found an estimated cost of 1 for instruction:   %shl = shl i32
%1, %and
> Cost Model: Found an estimated cost of 1 for instruction:   %sub = sub nsw
i32 0, %cond
> Cost Model: Found an estimated cost of 1 for instruction:   %and5 = and i32
%sub, 31
> Cost Model: Found an estimated cost of 1 for instruction:   %shr = lshr i32
%1, %and5
> Cost Model: Found an estimated cost of 1 for instruction:   %or = or i32
%shl, %shr
> 
> The broken cost model also affects unrolling and inlining. Size costs are
overestimated for a target that has a rotate instruction.
> This cost problem isn't limited to rotate patterns (it's come up in
the context of min/max/abs/fma too). But it would be simpler if we had a rotate
intrinsic, and the 6-to-1 margin is the biggest I've seen.
> 
> Another potential benefit of a generic intrinsic is that it can replace
target-specific intrinsics. PowerPC and x86 have those. ARM translates
source-level target-specific intrinsics into regular IR, so that could use the
intrinsic too.
> 
> On Mon, May 14, 2018 at 2:53 AM, Fabian Giesen via llvm-dev <llvm-dev at
lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> Hi everyone!
> 
> I recently ran into some interesting issues with generation of rotate
instructions - the details are in the bug tracker
(https://bugs.llvm.org/show_bug.cgi?id=37387
<https://bugs.llvm.org/show_bug.cgi?id=37387> and related bugs) for those
interested - and it brought up the issue of rotates in the IR again. Now this is
a proposal that has been made (and been rejected) several times, but I've
been told that this time round we ran into issues not previously considered, and
that I should bring the topic up on llvm-dev.
> 
> I'll try to summarize the status quo first and then bring up the tricky
cases that might tip the balance in favor of rotate intrinsics.
> 
> Rotates
> ======> 
> Rotates, or circular shifts, are bit shifts where bits "dropped
out" on one side of the operand are shifted back in on the other.
> 
> They occur less frequently in real-world code than regular (non-circular)
bit shifts, but are nevertheless directly supported in most popular ISAs. They
are useful primitives for cryptography, non-cryptographic hashing, pseudo-random
number generation, bit I/O, and probably other cases as well.
> 
> They do not have a direct representation in C/C++, but several compilers
support them via intrinsics (including MSVC, which Clang-CL tries to be
compatible to), and some languages such as Rust include them explicitly.
> 
> The status quo
> =============> 
> Rotates by compile-time constant amounts are fairly easy to express in
C-like languages by ORing together two opposite-direction shifts. For example,
rotating a uint32_t  x left by 3 positions in C boils down to
> 
>   (x << 3) | (x >> 29)
> 
> which is simple enough. The backends with rotate support detect this type
of pattern at the DAG stage and map it to a rotate instruction.
> 
> This, in a nutshell, is the argument against a direct rotate in the IR: the
existing primitives are sufficient to model it, and doing so allows LLVM to
leverage transforms and analyses such as SimplifyDemandedBits that know how to
deal with the more common regular shifts to work on bit rotate expressions as
well. Adding a new primitive would mean that a whole bunch of optimization
passes would need to know how to deal with it (or else they might lose on
optimization opportunities).
> 
> Variable-distance rotates
> ========================> 
> Things become murkier when we consider not just rotates by a compile-time
constant, but also rotates where the amount is determined at run time.
> 
> For one thing, it's not immediately obvious how to write a well-defined
corresponding C expression: the most "natural" version (for a 32-bit
left rotate) is
> 
>   (x << k) | (x >> (32 - k))
> 
> but this is undefined for k==0 due to the right-shift of a 32-bit value by
32. That said, both GCC and Clang will generally compile this particular
expression into a rotate-left as of this writing, but having UB is unfortunate.
A trickier variant that is always well-defined is
> 
>   (x << (k & 31)) | (x >> (-k & 31))
> 
> but this is a substantially more complicated sub-dag  than the OR of two
shifts you would see for a compile-time constant rotate distance, and there are
more opportunities for things to go wrong along the way. Bug 37387 contains a
few examples, the simplest of which was given by Sanjay:
> 
> void rotateInLoop(unsigned *x, unsigned N, unsigned *a, int b) {
>   for (unsigned i = 0; i < N; ++i)
>     x[ (a[i] >> b) | (a[i] << (32 - b)) ] = i; // shift amt is
loop invariant
> }
> 
> "32 - b" is loop-invariant and gets hoisted, and what's left
of the expression doesn't match a known "rotate" pattern (since
the DAG can't see across BB boundaries at the moment), so instead we get two
shifts and an OR inside the loop.
> 
> The counterargument here was that GlobalISel should eventually fix this
particular problem, but nevertheless, as of today, LICM breaks recognition of
the rotate pattern in this (very simple) loop, and there is reason to believe
that other transforms (existing or forthcoming) might well break the pattern as
well; the issues referenced in the linked bug illustrate some of the existing
failure modes.
> 
> This situation is frustrating for users of variable-distance rotates; right
now, there simply is no way to write code in a source language that will
reliably turn into rotates; you either have to watch the generated code closely,
or just throw up your hands and use platform-dependent inline assembly. (It is
assumed that the code in question is in a hot spot.)
> 
> The proposal
> ===========> 
> The idea would be to add rotate intrinsics, primarily intended to be used
for the variable-distance case, which is otherwise fragile.
> 
> Frontends can emit them directly (for source language-level intrinsics, or
when matched from "known good" rotate patterns) and the backends can
match them and select appropriate instructions.
> 
> For the constant-distance case, the preferred form would still be the
OR-of-shifts variant, which plays nice with SimplifyDemandedBits and is known to
existing optimization passes.
> 
> SDB does not currently do anything with shifts by non-constant amounts
anyway; and, for the use cases in question, a rotate that is not touched by
other optimization passes is still preferable to a more complex expression that
can be partially optimized, but then ends up in a form that the backend
doesn't know how to turn back into a rotate, which is a net loss.
> 
> Thoughts?
> 
> -Fabian
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180516/aba1f586/attachment.html>

Krzysztof Parzyszek via llvm-dev

2018-May-16 15:58 UTC

head link

[llvm-dev] Rotates, once again

On 5/15/2018 5:34 PM, Sanjay Patel via llvm-dev wrote:> Thanks for writing this up. I'd like to have this intrinsic too.
I'd vote for the "valign" variant that David proposed. It becomes
a
rotate when both inputs are the same.

<ty> %result = @llvm.valign.<ty>(<ty> %a0, <ty> %a1, i32
s)
result = truncate (concat(a1,a0) >> s)

Where "concat" concatenates the bits of both inputs to form a value of
a
twice as long type, and "truncate" takes the lower half of the bits of
its input.

-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

Manuel Jacob via llvm-dev

2018-May-16 17:27 UTC

head link

[llvm-dev] Rotates, once again

On 2018-05-16 00:34, Sanjay Patel via llvm-dev wrote:> Vectorization goes overboard because the throughput cost model used by 
> the
> vectorizers doesn't match the 6 IR instructions that correspond to 1 
> x86
> rotate instruction. Instead, we have:
> 
> [...]
> 
> The broken cost model also affects unrolling and inlining. Size costs 
> are
> overestimated for a target that has a rotate instruction.
> This cost problem isn't limited to rotate patterns (it's come up in
the
> context of min/max/abs/fma too). But it would be simpler if we had a 
> rotate
> intrinsic, and the 6-to-1 margin is the biggest I've seen.
Given that this is a general problem that occurs with other instruction 
sequences as well, wouldn't it make more sense to make the cost model 
handle more than one instruction, as suggested in PR31274 [1]?

[1] https://bugs.llvm.org/show_bug.cgi?id=31274

In all these cases (rotate, min, max, abs, fma, add-with-overflow, and 
probably many more) there's a tradeoff between elaborating them as 
simpler IR instructions and modelling them as its own instruction / 
intrinsic.  A big disadvantage of introducing new instructions / 
intrinsics is that all optimizations have to be told about this, 
increasing the compiler code base and maintainability burden.  On the 
other hand, too few instructions can make optimization difficult as well 
(in theory, one instruction like "subtract and branch if not equal to 
zero" could emulate all the others, but this wouldn't be very helpful 
for optimization).  Since you put a lot of thought into how to 
canonicalize IR, can you eleborate more on this tradeoff?  Can we find a 
set of criteria to determine whether an instruction pattern should get 
an intrinsic or not?

-Manuel

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - May 2018 - Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

[llvm-dev] Rotates, once again

Maybe Matching Threads