thr3ads.net - llvm dev - [llvm-dev] RFC: Generic IR reductions [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2017-Feb-01 21:22 UTC

[llvm-dev] RFC: Generic IR reductions

Hi, Renato.
>So I vote "it depends". :)
My preference is to let vectorizer emit one kind of "reduce vector into
scalar"
instead of letting vectorizer choose one of many different ways.

I'm perfectly fine with
	@llvm.reduce.op.typein.typeout.ordered?(%vector)
being that "one kind of reduce vector into scalar".

I think we are converging enough at the detail level, but having a big
difference in the opinions at the "vision" level.  :)

Thanks,
Hideki

-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: Wednesday, February 01, 2017 12:52 PM
To: Saito, Hideki <hideki.saito at intel.com>
Cc: llvm-dev at lists.llvm.org; Amara Emerson <amara.emerson at
gmail.com>; Chandler Carruth <chandlerc at gmail.com>; Philip Reames
<listmail at philipreames.com>; Demikhovsky, Elena <elena.demikhovsky
at intel.com>; Simon Pilgrim <llvm-dev at redking.me.uk>; Mehdi Amini
<mehdi.amini at apple.com>; Michael Kuperstein <mkuper at
google.com>; nd <nd at arm.com>
Subject: Re: RFC: Generic IR reductions

On 1 February 2017 at 20:27, Saito, Hideki <hideki.saito at intel.com>
wrote:> Main problem for SVE: We can't write straight-line IR instruction
sequence for reduction last value compute, without
> knowing #elements in vector to start from.
Hi Saito,

Indeed, I think there's at least consensus that, for scalable vectors,
there's no other way around.

> For non-scalable vector, series of "reduce to half" (or any other
straight-line IR instruction sequence) is functionally
> working, but certainly not ideal (ugly, optimal sequence is target
dependent, artificially inflating the outgoing IL
> for any optimizers not interested in optimizing reduction last value
compute, that translates to longer compile time,
> etc.)
I agree it's not pretty, but IR doesn't have to be pretty. IR is a
compiler intermediate language, not a user-facing language. So, if the
representation is accurate and the compile time costs are within
noise, there are no reasons to change.

Now, you touched one subject that is worth considering: binary
reduction is target specific.

I don't know AVX, AltiVec and others too well, but I assumed there are
only two types of reduction strategies: ordered and binary reduction.
The first because of precision and the second because it's the most
logical generic step: you only need one instruction with multiple
register sizes to do any kind of reduction, and that reduces the
complexity of the ISA.

So if there are non-ordered / non-binary reductions, the
representation is, indeed, target specific, and that's a very negative
point for it. But, this only raises the case for non-ordered /
non-binary reductions to gain a builtin, not to move everything into a
builtin.

> So, it's not like we don't have reasons for asking to change the
status quo.
I apologise if that's what got across. I'm not trying to shut anything
down, I'm just reiterating the reasons that I have received (and
given) in the past with regards to IR changes. We have done this for
NEON, stride access, etc. and it has worked well in general, with
exposed optimisations that we wouldn't have gotten with a builtin.

> Here's what I'd like to see at the end of this discussion.
>      Nice and concise representation of reducing a vector into a scalar
value.
>      An IR instruction to do so is ideal, but I understand that the hurdle
for that is very high.
>      I'm okay with an intrinsic function call, and I heard that's a
reasonable step to get to instruction.
Yes. If we have an intrinsic that can be built in a generic way,
that's half-way towards an instruction. That's what I was trying to do
on my first reply, when I suggested the @llvm.reduce() builtin.

But it seems impractical to have a single one. It'll most likely be a
family: @llvm.reduce.op.typein.typeout.ordered?(%vector), with op
being add/sub/max/min/etc, type in the vector type, type out if
different for widen/shorten and ordered if FP precision is important.
It'll be harder to transform this into an instruction (mainly because
of the op), but it can be done, just need time and community support.

>      Let's say someone comes up with 1024bit vector working on char
data. Nobody is really happy to see
>      a sequence of "reduce to half" for 128 elements. Today, with
AVX512BW, we already have the problem
>      of half that size (only a few instructions less). I don't think
anything that is proportional to "LOG(#elems)"
>      is "nice and concise".
I agree it would be ugly, but I want to make sure we're clear that
this is mostly irrelevant, unless it impacts compile time or the
complexity of pattern matching beyond reasonable.

I would expect larger vectors to have longer patterns, but I wouldn't
oppose to an intrinsic for 1024 bit vector reduction. :)


> Such a representation is also useful inside of vectorized loop if the
programmer wants bitwise identical
> FP reduction value (obviously at the much slower speed), without losing the
vectorization of rest of
> the loop body. So, this isn't just "outside the vectorized
loop" stuff.
I'm not sure what you mean by this. Do you mean that
@llvm.reduce.add.ordered(...) would still be useful scalarised?

We can get the same ordered behaviour by introducing instruction
dependencies (%sum = %a + %b -> %sum = %um + %c, etc).

Also, we'd have to teach the middle end or all back-ends to lower that
intrinsic, even if the target has no concept of reduction in its ISA,
in the case it does appear from some other pass.

Wouldn't that be risky?

> Now, can we focus on the value of "nice and concise
representation"? If the community consensus
> is "Yes", we can then talk about how to deliver that vision ---
one IR instruction, one intrinsic call, a small
> number of IR instructions (I don't know how, but someone might have
brilliant ideas), etc., and further dig
> deeper (e.g., IR instruction for each reduction operator?, whether the
operator specification is part of
> operand or part of intrinsic name?, etc). If the community consensus is
"No", we can stop talking about
> details right away.
As I said, I don't see why we need to have a concise representation at
all, unless it impacts compile time or pattern match complexity
(maintainability), or it's the only way (ex. SVE).

So I vote "it depends". :)

cheers,
--renato

Renato Golin via llvm-dev

2017-Feb-01 22:02 UTC

head link

[llvm-dev] RFC: Generic IR reductions

On 1 February 2017 at 21:22, Saito, Hideki <hideki.saito at intel.com>
wrote:> I think we are converging enough at the detail level, but having a big
> difference in the opinions at the "vision" level.  :)
Vision is always harder than engineering. :)

So, I have copied a lot more people now, to make sure we get a wider
audience. I want to give them a chance to chime in early enough.
Folks, feel free to copy more people if you think they need to have an
opinion.

Here's a summary:

Problem #1: Reductions are common, but:
 * the code generated is large (log(lanes) operations + extracts) and
will get "too large" with 1024/2048 bit vectors
 * this may cause compile time slow downs (we're not sure, need data),
and increased complexity in pattern-matching description for the
back-end
 * the semantics can be target-specific (ordered / non-ordered /
non-binary reduction), but I don't know of any target that is not
ordered/binary.
 * constant propagation can hide the semantics half-way through, and
the back-end loses the ability to match

Problem #2: As we move to scalable vectors (of unknown length, like SVE), we:
 * can't generate binary extracts + ops (to mean unordered / aka
fast), as we don't know the number of lanes
 * we'll have to represent ordered reduction as a loop based on the
(unknown at compile time) "compile time constant" that represents the
vector length

So, Amara's proposal converged to solve all those problems with a new
intrinsic similar to:

  %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty>
%vector)

With:
 * "op" one of { add, sub, mul, max, min, etc } x { '',
'f' } for
integer / float
 * vec-ty the input type, which can be a scalable type in the future
(<n x N x type>).
 * type-out not necessarily equal to the lane type in vec-ty, to allow
for widening/shotening or FP to int conversions, it supported.
 * "ordered" some flag meaning FP precision is a concern (could be
"fast" meaning the opposite, doesn't matter)

The solution simplifies a lot the loop vectoriser, as well as the SLP
and, as Amara said, could even be used by front-ends or other passes
when the semantics is guaranteed (Haskell?).

However, it also incur in the same issue we had with the other vector
intrinsics: they stiff the IR, making other optimisations discard
information, avoid valid transformations or even plain bad code-gen.

So far, on the specific topic for reductions, the only case I could
think of that we'd miss optimisations is when we have two compatible
reductions and we want to merge them. For example, NEON can
horizontally reduce one or two vectors at the same time, or we could
have three strided reductions and then combine them all into one after
inlining.

What stops us from doing so with intrinsics is just the knowledge, so
we trade complexity in the back-end to match long patterns for
complexity in optimisation passes to know about the new intrinsics.

The arguments were:

Pro intrinsic:
 * Simpler and shorter IR
 * Easier to read, easier for the vectorisers to generate
 * Easier for the back-end to match
 * Easier transition towards scalable vectors in the future

Cons:
 * Already existing patterns work and can expose further opportunities
 * We may have to change more middle-end passes / back-ends than would
be affected

I really have no strong opinion either way. Both are challenging in
their own rights, and this is worth doing anyway.

Did I miss something?

cheers,
--renato

Amara Emerson via llvm-dev

2017-Feb-02 01:06 UTC

head link

[llvm-dev] RFC: Generic IR reductions

Thanks for the summary, some more comments inline.

On 1 February 2017 at 22:02, Renato Golin <renato.golin at linaro.org>
wrote:> On 1 February 2017 at 21:22, Saito, Hideki <hideki.saito at
intel.com> wrote:
>> I think we are converging enough at the detail level, but having a big
>> difference in the opinions at the "vision" level.  :)
>
> Vision is always harder than engineering. :)
>
> So, I have copied a lot more people now, to make sure we get a wider
> audience. I want to give them a chance to chime in early enough.
> Folks, feel free to copy more people if you think they need to have an
> opinion.
>
> Here's a summary:
>
> Problem #1: Reductions are common, but:
>  * the code generated is large (log(lanes) operations + extracts) and
> will get "too large" with 1024/2048 bit vectors
>  * this may cause compile time slow downs (we're not sure, need data),
> and increased complexity in pattern-matching description for the
> back-end
>  * the semantics can be target-specific (ordered / non-ordered /
> non-binary reduction), but I don't know of any target that is not
> ordered/binary.
>  * constant propagation can hide the semantics half-way through, and
> the back-end loses the ability to matchUnless I'm mistaken I think the constant propagation issue was with
the proposal to split reduction into two stages, where the reduce step
depends on the instruction used to produce the input value. This is
unrelated to the current shuffle representation.>
> Problem #2: As we move to scalable vectors (of unknown length, like SVE),
we:
>  * can't generate binary extracts + ops (to mean unordered / aka
> fast), as we don't know the number of lanes
>  * we'll have to represent ordered reduction as a loop based on the
> (unknown at compile time) "compile time constant" that represents
the
> vector length
>
> So, Amara's proposal converged to solve all those problems with a new
> intrinsic similar to:
>
>   %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty>
%vector)
>
> With:
>  * "op" one of { add, sub, mul, max, min, etc } x { '',
'f' } for
> integer / float
>  * vec-ty the input type, which can be a scalable type in the future
> (<n x N x type>).
>  * type-out not necessarily equal to the lane type in vec-ty, to allow
> for widening/shotening or FP to int conversions, it supported.I didn't include this in my original proposal, although I'm not
particularly opposed to it. In the spirit of keeping the IR as simple
as possible, the widening/narrowing behaviour doesn't need to be
bundled into the reduction. A promotion can be a separate ext of the
input and I think it should be easily pattern matched.>  * "ordered" some flag meaning FP precision is a concern (could
be
> "fast" meaning the opposite, doesn't matter)
>
> The solution simplifies a lot the loop vectoriser, as well as the SLP
> and, as Amara said, could even be used by front-ends or other passes
> when the semantics is guaranteed (Haskell?).
>
> However, it also incur in the same issue we had with the other vector
> intrinsics: they stiff the IR, making other optimisations discard
> information, avoid valid transformations or even plain bad code-gen.
>
> So far, on the specific topic for reductions, the only case I could
> think of that we'd miss optimisations is when we have two compatible
> reductions and we want to merge them. For example, NEON can
> horizontally reduce one or two vectors at the same time, or we could
> have three strided reductions and then combine them all into one after
> inlining.My implementation of this for AArch64 NEON (I've not looked at AArch32
yet) converts the generic reduce ISD nodes to AArch64 specific ones
with an extract. It doesn't try to do anything else, so once it goes
past the original tree-shuffle matching code, the DAG should be
identical.>
> What stops us from doing so with intrinsics is just the knowledge, so
> we trade complexity in the back-end to match long patterns for
> complexity in optimisation passes to know about the new intrinsics.
>
> The arguments were:
>
> Pro intrinsic:
>  * Simpler and shorter IR
>  * Easier to read, easier for the vectorisers to generate
>  * Easier for the back-end to match
>  * Easier transition towards scalable vectors in the futureAlso want to re-iterate that reductions are useful for testing bits of
predicate vectors, which can be applicable to other targets than SVE
for things like early-exit vectorization. Simon mentioned this to me
off-list. Simon, could you comment here if this proposal would work
for your needs?> Cons:
>  * Already existing patterns work and can expose further opportunities
>  * We may have to change more middle-end passes / back-ends than would
> be affected
>
> I really have no strong opinion either way. Both are challenging in
> their own rights, and this is worth doing anyway.
>
> Did I miss something?
>
> cheers,
> --renato

llvm dev - Feb 2017 - RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions