On Wed, May 16, 2018 at 11:27 AM, Manuel Jacob <me at manueljacob.de>
wrote:
> On 2018-05-16 00:34, Sanjay Patel via llvm-dev wrote:
>
>> Vectorization goes overboard because the throughput cost model used by
the
>> vectorizers doesn't match the 6 IR instructions that correspond to
1 x86
>> rotate instruction. Instead, we have:
>>
>> [...]
>>
>> The broken cost model also affects unrolling and inlining. Size costs
are
>> overestimated for a target that has a rotate instruction.
>> This cost problem isn't limited to rotate patterns (it's come
up in the
>> context of min/max/abs/fma too). But it would be simpler if we had a
>> rotate
>> intrinsic, and the 6-to-1 margin is the biggest I've seen.
>>
>
> Given that this is a general problem that occurs with other instruction
> sequences as well, wouldn't it make more sense to make the cost model
> handle more than one instruction, as suggested in PR31274 [1]?
>
> [1] https://bugs.llvm.org/show_bug.cgi?id=31274
>
>
Yes, I think everyone agrees that we need to improve the cost model
interface. Note that there's a proposal for review for the 1 IR -> many
machine ops variant of this question:
https://reviews.llvm.org/D46276
> In all these cases (rotate, min, max, abs, fma, add-with-overflow, and
> probably many more) there's a tradeoff between elaborating them as
simpler
> IR instructions and modelling them as its own instruction / intrinsic. A
> big disadvantage of introducing new instructions / intrinsics is that all
> optimizations have to be told about this, increasing the compiler code base
> and maintainability burden. On the other hand, too few instructions can
> make optimization difficult as well (in theory, one instruction like
> "subtract and branch if not equal to zero" could emulate all the
others,
> but this wouldn't be very helpful for optimization). Since you put a
lot
> of thought into how to canonicalize IR, can you eleborate more on this
> tradeoff? Can we find a set of criteria to determine whether an
> instruction pattern should get an intrinsic or not?
I don't have formal criteria for this. Someone more knowledgeable may give
us an answer.
An informal metric might be: if the operation is supported as a primitive
op or built-in in source languages and it is supported as a single target
instruction, can we guarantee that 1-to-1 translation through optimization?
We are failing that test in the loop-invariant C example presented here.
We're also failing that test when operands have different sizes. Rotate
goes in, logic and shifts come out.
We even failed that test for an even more basic case before this fix:
https://reviews.llvm.org/D46656
We may still be failing the basic case for other source languages/flavors
because encoding this operation in existing IR ops isn't obvious?
Another informal measure is: how do programmers express this operation
without a builtin? If they're resorting to inline asm, that's a strong
sign
that the compiler needs an intrinsic.
Another metric could be: what is the probability that the optimizer can
improve the codegen if the operation is broken down into multiple IR
instructions?
As noted here, it's unlikely for rotate. If it is possible, then adding
folds to instcombine for this intrinsic isn't hard. Are any other passes
affected?
For reference, these are the current target-independent bit-manipulation
intrinsics - bswap, bitreverse, ctpop, ctlz, cttz:
http://llvm.org/docs/LangRef.html#bit-manipulation-intrinsics
The LLVM cost for the proposed rotate intrinsic should be in the same range
as those? Note that we would not just be adding code to support an
intrinsic. There are already ~200 lines of DAG matching code for rotate, so
we already have a cost for trying to get this optimized. We can remove that
after adding canonicalization to the intrinsic in IR.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180516/8ea6c22a/attachment.html>