thr3ads.net - llvm dev - [llvm-dev] RFC: Generic IR reductions [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Saito, Hideki via llvm-dev

2017-Feb-01 20:27 UTC

[llvm-dev] RFC: Generic IR reductions

Hi All,

Renato wrote:>As they say: if it's not broken, don't fix it.
>Let's talk about the reductions that AVX512 and SVE can't handle
with IR semantics, but let's not change the current IR semantics for no
reason.
Main problem for SVE: We can't write straight-line IR instruction sequence
for reduction last value compute, without
knowing #elements in vector to start from.

For non-scalable vector, series of "reduce to half" (or any other
straight-line IR instruction sequence) is functionally
working, but certainly not ideal (ugly, optimal sequence is target dependent,
artificially inflating the outgoing IL
for any optimizers not interested in optimizing reduction last value compute,
that translates to longer compile time,
etc.)

So, it's not like we don't have reasons for asking to change the status
quo.

Here's what I'd like to see at the end of this discussion.
     Nice and concise representation of reducing a vector into a scalar value.
     An IR instruction to do so is ideal, but I understand that the hurdle for
that is very high.
     I'm okay with an intrinsic function call, and I heard that's a
reasonable step to get to instruction.
     Let's say someone comes up with 1024bit vector working on char data.
Nobody is really happy to see
     a sequence of "reduce to half" for 128 elements. Today, with
AVX512BW, we already have the problem
     of half that size (only a few instructions less). I don't think
anything that is proportional to "LOG(#elems)"
     is "nice and concise".  
Such a representation is also useful inside of vectorized loop if the programmer
wants bitwise identical
FP reduction value (obviously at the much slower speed), without losing the
vectorization of rest of
the loop body. So, this isn't just "outside the vectorized loop"
stuff.

Now, can we focus on the value of "nice and concise representation"?
If the community consensus
is "Yes", we can then talk about how to deliver that vision --- one IR
instruction, one intrinsic call, a small
number of IR instructions (I don't know how, but someone might have
brilliant ideas), etc., and further dig
deeper (e.g., IR instruction for each reduction operator?, whether the operator
specification is part of
operand or part of intrinsic name?, etc). If the community consensus is
"No", we can stop talking about
details right away.

Obviously, I'm voting for YES on "nice and concise
representation".

Thanks,
Hideki Saito
Intel Compilers and Languages

------------------------------

Date: Wed, 1 Feb 2017 15:10:36 +0000
From: Renato Golin via llvm-dev <llvm-dev at lists.llvm.org>
To: Amara Emerson <amara.emerson at gmail.com>
Cc: "llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>,
nd
	<nd at arm.com>
Subject: Re: [llvm-dev] RFC: Generic IR reductions
Message-ID:
	<CAMSE1keZFe+EGTZ6zu69J+953Y87jnMKKZuHbR-P63sEjE0Ymw at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

On 1 February 2017 at 14:44, Amara Emerson <amara.emerson at gmail.com>
wrote:> Her point is that the %sum value will no longer be an add node, but 
> simply a constant vector. There is no way to resolve the semantics and 
> have meaningful IR after a simple constant prop. This means that other 
> simple transformations will all have to have special case logic just 
> to handle reductions, for example LCSSA.
Right.

> Can you give a specific example? Reductions are actually very robust 
> to optimizations breaking the idiom, which is why I was able to 
> replace the reductions with intrinsics in my patches and with simple 
> matching generate identical code to before. No other changes were 
> required in the mid-end.
First, this is a demonstration that keeping them as IR is a good thing, not a
bad one. We keep the semantics as well as allow for further introspection in the
code block.

Examples of introspection are vector widening or instruction fusion after
inlining. Say you have a loop with a function call and a reduction, but that
call has a reduction on its own. If the function gets inlined after you reduced
your loop into a builtin, the optimiser will have no visibility if the
function's reduction pattern can be merged, either widening the vectors
(saving on loads/stores) or fusing (ex. MLA).

We have seen both cases with NEON after using IR instructions for everything but
the impossible.

In 2010, I've gone through a similar discussion with Bob Wilson, who
defended the position I'm defending now. And I defend this position today
because I have been categorically proven wrong by the results I describe above.

Chandler's arguments are perfectly to the point. Intrinsics are not only
necessary when we can't represent things in IR, they're a *good* ways of
representing odd things.

But if things are not odd (ie. many targets have them) or if we can already
represent them in IR, then it stands to reason that adding duplicated stuff only
adds complexity. It increases the maintenance cost (more node types to
consider), in increases the chance for missing some of them (and either not
optimising or generating bad code), and it stops the optimisers that know
nothing about it (because it's too new) to do any inference, and can
actually generate worse code than before (shuffle explosion).

As they say: if it's not broken, don't fix it.

Let's talk about the reductions that AVX512 and SVE can't handle with IR
semantics, but let's not change the current IR semantics for no reason.

cheers,
--renato

Renato Golin via llvm-dev

2017-Feb-01 20:52 UTC

head link

[llvm-dev] RFC: Generic IR reductions

On 1 February 2017 at 20:27, Saito, Hideki <hideki.saito at intel.com>
wrote:> Main problem for SVE: We can't write straight-line IR instruction
sequence for reduction last value compute, without
> knowing #elements in vector to start from.
Hi Saito,

Indeed, I think there's at least consensus that, for scalable vectors,
there's no other way around.

> For non-scalable vector, series of "reduce to half" (or any other
straight-line IR instruction sequence) is functionally
> working, but certainly not ideal (ugly, optimal sequence is target
dependent, artificially inflating the outgoing IL
> for any optimizers not interested in optimizing reduction last value
compute, that translates to longer compile time,
> etc.)
I agree it's not pretty, but IR doesn't have to be pretty. IR is a
compiler intermediate language, not a user-facing language. So, if the
representation is accurate and the compile time costs are within
noise, there are no reasons to change.

Now, you touched one subject that is worth considering: binary
reduction is target specific.

I don't know AVX, AltiVec and others too well, but I assumed there are
only two types of reduction strategies: ordered and binary reduction.
The first because of precision and the second because it's the most
logical generic step: you only need one instruction with multiple
register sizes to do any kind of reduction, and that reduces the
complexity of the ISA.

So if there are non-ordered / non-binary reductions, the
representation is, indeed, target specific, and that's a very negative
point for it. But, this only raises the case for non-ordered /
non-binary reductions to gain a builtin, not to move everything into a
builtin.

> So, it's not like we don't have reasons for asking to change the
status quo.
I apologise if that's what got across. I'm not trying to shut anything
down, I'm just reiterating the reasons that I have received (and
given) in the past with regards to IR changes. We have done this for
NEON, stride access, etc. and it has worked well in general, with
exposed optimisations that we wouldn't have gotten with a builtin.

> Here's what I'd like to see at the end of this discussion.
>      Nice and concise representation of reducing a vector into a scalar
value.
>      An IR instruction to do so is ideal, but I understand that the hurdle
for that is very high.
>      I'm okay with an intrinsic function call, and I heard that's a
reasonable step to get to instruction.
Yes. If we have an intrinsic that can be built in a generic way,
that's half-way towards an instruction. That's what I was trying to do
on my first reply, when I suggested the @llvm.reduce() builtin.

But it seems impractical to have a single one. It'll most likely be a
family: @llvm.reduce.op.typein.typeout.ordered?(%vector), with op
being add/sub/max/min/etc, type in the vector type, type out if
different for widen/shorten and ordered if FP precision is important.
It'll be harder to transform this into an instruction (mainly because
of the op), but it can be done, just need time and community support.

>      Let's say someone comes up with 1024bit vector working on char
data. Nobody is really happy to see
>      a sequence of "reduce to half" for 128 elements. Today, with
AVX512BW, we already have the problem
>      of half that size (only a few instructions less). I don't think
anything that is proportional to "LOG(#elems)"
>      is "nice and concise".
I agree it would be ugly, but I want to make sure we're clear that
this is mostly irrelevant, unless it impacts compile time or the
complexity of pattern matching beyond reasonable.

I would expect larger vectors to have longer patterns, but I wouldn't
oppose to an intrinsic for 1024 bit vector reduction. :)

> Such a representation is also useful inside of vectorized loop if the
programmer wants bitwise identical
> FP reduction value (obviously at the much slower speed), without losing the
vectorization of rest of
> the loop body. So, this isn't just "outside the vectorized
loop" stuff.
I'm not sure what you mean by this. Do you mean that
@llvm.reduce.add.ordered(...) would still be useful scalarised?

We can get the same ordered behaviour by introducing instruction
dependencies (%sum = %a + %b -> %sum = %um + %c, etc).

Also, we'd have to teach the middle end or all back-ends to lower that
intrinsic, even if the target has no concept of reduction in its ISA,
in the case it does appear from some other pass.

Wouldn't that be risky?

> Now, can we focus on the value of "nice and concise
representation"? If the community consensus
> is "Yes", we can then talk about how to deliver that vision ---
one IR instruction, one intrinsic call, a small
> number of IR instructions (I don't know how, but someone might have
brilliant ideas), etc., and further dig
> deeper (e.g., IR instruction for each reduction operator?, whether the
operator specification is part of
> operand or part of intrinsic name?, etc). If the community consensus is
"No", we can stop talking about
> details right away.
As I said, I don't see why we need to have a concise representation at
all, unless it impacts compile time or pattern match complexity
(maintainability), or it's the only way (ex. SVE).

So I vote "it depends". :)

cheers,
--renato

Saito, Hideki via llvm-dev

2017-Feb-01 21:22 UTC

head link

[llvm-dev] RFC: Generic IR reductions

Hi, Renato.
>So I vote "it depends". :)
My preference is to let vectorizer emit one kind of "reduce vector into
scalar"
instead of letting vectorizer choose one of many different ways.

I'm perfectly fine with
	@llvm.reduce.op.typein.typeout.ordered?(%vector)
being that "one kind of reduce vector into scalar".

I think we are converging enough at the detail level, but having a big
difference in the opinions at the "vision" level.  :)

Thanks,
Hideki

-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: Wednesday, February 01, 2017 12:52 PM
To: Saito, Hideki <hideki.saito at intel.com>
Cc: llvm-dev at lists.llvm.org; Amara Emerson <amara.emerson at
gmail.com>; Chandler Carruth <chandlerc at gmail.com>; Philip Reames
<listmail at philipreames.com>; Demikhovsky, Elena <elena.demikhovsky
at intel.com>; Simon Pilgrim <llvm-dev at redking.me.uk>; Mehdi Amini
<mehdi.amini at apple.com>; Michael Kuperstein <mkuper at
google.com>; nd <nd at arm.com>
Subject: Re: RFC: Generic IR reductions

On 1 February 2017 at 20:27, Saito, Hideki <hideki.saito at intel.com>
wrote:> Main problem for SVE: We can't write straight-line IR instruction
sequence for reduction last value compute, without
> knowing #elements in vector to start from.
Hi Saito,

Indeed, I think there's at least consensus that, for scalable vectors,
there's no other way around.

> For non-scalable vector, series of "reduce to half" (or any other
straight-line IR instruction sequence) is functionally
> working, but certainly not ideal (ugly, optimal sequence is target
dependent, artificially inflating the outgoing IL
> for any optimizers not interested in optimizing reduction last value
compute, that translates to longer compile time,
> etc.)
I agree it's not pretty, but IR doesn't have to be pretty. IR is a
compiler intermediate language, not a user-facing language. So, if the
representation is accurate and the compile time costs are within
noise, there are no reasons to change.

Now, you touched one subject that is worth considering: binary
reduction is target specific.

I don't know AVX, AltiVec and others too well, but I assumed there are
only two types of reduction strategies: ordered and binary reduction.
The first because of precision and the second because it's the most
logical generic step: you only need one instruction with multiple
register sizes to do any kind of reduction, and that reduces the
complexity of the ISA.

So if there are non-ordered / non-binary reductions, the
representation is, indeed, target specific, and that's a very negative
point for it. But, this only raises the case for non-ordered /
non-binary reductions to gain a builtin, not to move everything into a
builtin.

> So, it's not like we don't have reasons for asking to change the
status quo.
I apologise if that's what got across. I'm not trying to shut anything
down, I'm just reiterating the reasons that I have received (and
given) in the past with regards to IR changes. We have done this for
NEON, stride access, etc. and it has worked well in general, with
exposed optimisations that we wouldn't have gotten with a builtin.

> Here's what I'd like to see at the end of this discussion.
>      Nice and concise representation of reducing a vector into a scalar
value.
>      An IR instruction to do so is ideal, but I understand that the hurdle
for that is very high.
>      I'm okay with an intrinsic function call, and I heard that's a
reasonable step to get to instruction.
Yes. If we have an intrinsic that can be built in a generic way,
that's half-way towards an instruction. That's what I was trying to do
on my first reply, when I suggested the @llvm.reduce() builtin.

But it seems impractical to have a single one. It'll most likely be a
family: @llvm.reduce.op.typein.typeout.ordered?(%vector), with op
being add/sub/max/min/etc, type in the vector type, type out if
different for widen/shorten and ordered if FP precision is important.
It'll be harder to transform this into an instruction (mainly because
of the op), but it can be done, just need time and community support.

>      Let's say someone comes up with 1024bit vector working on char
data. Nobody is really happy to see
>      a sequence of "reduce to half" for 128 elements. Today, with
AVX512BW, we already have the problem
>      of half that size (only a few instructions less). I don't think
anything that is proportional to "LOG(#elems)"
>      is "nice and concise".
I agree it would be ugly, but I want to make sure we're clear that
this is mostly irrelevant, unless it impacts compile time or the
complexity of pattern matching beyond reasonable.

I would expect larger vectors to have longer patterns, but I wouldn't
oppose to an intrinsic for 1024 bit vector reduction. :)


> Such a representation is also useful inside of vectorized loop if the
programmer wants bitwise identical
> FP reduction value (obviously at the much slower speed), without losing the
vectorization of rest of
> the loop body. So, this isn't just "outside the vectorized
loop" stuff.
I'm not sure what you mean by this. Do you mean that
@llvm.reduce.add.ordered(...) would still be useful scalarised?

We can get the same ordered behaviour by introducing instruction
dependencies (%sum = %a + %b -> %sum = %um + %c, etc).

Also, we'd have to teach the middle end or all back-ends to lower that
intrinsic, even if the target has no concept of reduction in its ISA,
in the case it does appear from some other pass.

Wouldn't that be risky?

> Now, can we focus on the value of "nice and concise
representation"? If the community consensus
> is "Yes", we can then talk about how to deliver that vision ---
one IR instruction, one intrinsic call, a small
> number of IR instructions (I don't know how, but someone might have
brilliant ideas), etc., and further dig
> deeper (e.g., IR instruction for each reduction operator?, whether the
operator specification is part of
> operand or part of intrinsic name?, etc). If the community consensus is
"No", we can stop talking about
> details right away.
As I said, I don't see why we need to have a concise representation at
all, unless it impacts compile time or pattern match complexity
(maintainability), or it's the only way (ex. SVE).

So I vote "it depends". :)

cheers,
--renato

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Feb 2017 - RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

[llvm-dev] RFC: Generic IR reductions

Maybe Matching Threads