thr3ads.net - llvm dev - [llvm-dev] [RFC] Simple GVN hoist [Sep 2021]

If this information is useful, please help other people find it:
Share via:

Philip Reames via llvm-dev

2021-Sep-14 18:16 UTC

[llvm-dev] [RFC] Simple GVN hoist

Initial response focused on different potential ways of solving this 
problem.  I'll reply separately on your proposed GVN patch.

I see a number of ways to tackle this:

  * The vectorize already has to do a forward walk through the function
    to compute predicate masks.  As we do that walk, we could keep track
    for each address of the union of predicate masks for each access. 
    In this case, the union would be all ones, meaning we could use a
    unconditional access.  In general, we can do a single predicated
    access using the union of the predicate masks.  This is a
    non-trivial costing question of whether explicitly forming the union
    is worthwhile, but I suspect we can handle obvious cases cheaply.
  * Still in the vectorizer, we can do a forward walk and track accesses
    encountered along each path.  Any address which is accessed along
    each path to the latch can be done unconditionally.  (This is
    essentially a restricted subset of the former without the
    generalization to predicate masks.)
  * We could extend SimplifyCFG.  You mention this, but don't get into
    *why* we don't handle the last load.  We really should in this
    case.  (Though, after CSE, there should only be two conditional
    loads in your inner loop?  Maybe I'm missing something?)
  * You don't mention your target processor, but one question to ask is
    the cost model for a predicated load reasonable?  If not, would
    reducing it to match the actual target cost fix your problem?  In
    particular, we have *multiple* memory accesses with the *same* mask
    here.  Does accounting for that difference in lowering cost side
    step the issue?

Your in-passing mention that -O3 and unrolling breaks vectorization also 
concerns me.  It really shouldn't.  That sounds like a probably issue in 
the SLP vectorizer, and maybe a pass order issue.

All of the above is simply me brainstorming.  :)

Philip

On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:> Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
> vectorise.
>
> typedef short int16_t;
> typedef unsigned short uint16_t;
>
> int q(int16_t d[], uint16_t m[], uint16_t b[]) {
>    int n = 0;
>    for (int i = 0; i < 16; i++) {
>      if (d[i] > 0)
>        d[i] = (b[i] + d[i]) * m[i] >> 16;
>      else
>        d[i] = -((b[i] - d[i]) * m[i] >> 16);
>      n |= d[i];
>    }
>    return n;
> }
>
> As it turns out, LLVM adds runtime alias checks for this function and then
it
> considers it legal to vectorise with if-conversion.  However, the
vectorisation
> cost model effectively bans that particular if-conversion by assigning a
> ridiculous cost to emulated masked loads and stores.
>
> Originally, each branch of the if statement in the loop body contains three
> identical loads.  Manually hoisting these allows the loop to be vectorised
at
> `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
> There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting
of
> instructions from the two successors of a block, which subroutine does
indeed
> hoist two of the three loads, but gives up before hoisting the third one.
>
> We'd need a way to make LLVM hoist all three of the loads by itself.
`GVNHoist`
> can do that, but that pass is disabled by default and has been disabled for
a
> long, long time.
>
> As an alternative, I was thinking of a simpler hoisting transformation,
that
> just handles moving instructions from two single-predecessor blocks to
their
> common predecessor. That could be made reasonably fast, by pairing
instructions
> by their GVN value number. Limiting hoisting to a predecessor block (for
the
> most part) would also avoid excessive increase of lifetimes (for the
majority of
> the case) and would also simplify correctness checks.
>
> I've written such a transformation as a subroutine to `GVN`, it seemed
like a
> good place for it and is an a similar spirit as various PREs the GVN does.
The
> Phabricator review is at https://reviews.llvm.org/D109760.
>
> Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):
>
> 500.perlbench_r    1.13%
> 502.gcc_r          0.00%
> 505.mcf_r         -1.89%
> 520.omnetpp_r      0.00%
> 523.xalancbmk_r    0.00%
> 525.x264_r         7.67%
> 531.deepsjeng_r    0.60%
> 541.leela_r        0.24%
> 548.exchange2_r    0.00%
> 557.xz_r           0.75%
>
> Comments?
>
> ~chill
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210914/8178c072/attachment.html>

Sjoerd Meijer via llvm-dev

2021-Sep-14 18:56 UTC

head link

[llvm-dev] [RFC] Simple GVN hoist

Adding to Philip's list and ideas:

  *   Fix GVNHoist. I like your patch because it's small and concise, but
missing in the RFC is a discussion why we don't try to re-enable GVNHoist. I
know it was briefly enabled by default, which was reverted due to correctness
(or was it regressions?) problems. But if this belongs in GVNHoist, could this
for example be added to GVNHoist, and only this part enabled? Not sure if
that's possible as I haven't looked at GVNHoist.

Removing from Philip's list:

  *   SimplifyCFG, because it's already a monster. 😉 More serious, not sure
it's the right place, GVNHoist looks better.

________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Philip
Reames via llvm-dev <llvm-dev at lists.llvm.org>
Sent: 14 September 2021 19:16
To: Momchil Velikov <Momchil.Velikov at arm.com>; llvm-dev at
lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] Simple GVN hoist


Initial response focused on different potential ways of solving this problem. 
I'll reply separately on your proposed GVN patch.

I see a number of ways to tackle this:

  *   The vectorize already has to do a forward walk through the function to
compute predicate masks.  As we do that walk, we could keep track for each
address of the union of predicate masks for each access.  In this case, the
union would be all ones, meaning we could use a unconditional access.  In
general, we can do a single predicated access using the union of the predicate
masks.  This is a non-trivial costing question of whether explicitly forming the
union is worthwhile, but I suspect we can handle obvious cases cheaply.
  *   Still in the vectorizer, we can do a forward walk and track accesses
encountered along each path.  Any address which is accessed along each path to
the latch can be done unconditionally.  (This is essentially a restricted subset
of the former without the generalization to predicate masks.)
  *   We could extend SimplifyCFG.  You mention this, but don't get into
*why* we don't handle the last load.  We really should in this case. 
(Though, after CSE, there should only be two conditional loads in your inner
loop?  Maybe I'm missing something?)
  *   You don't mention your target processor, but one question to ask is
the cost model for a predicated load reasonable?  If not, would reducing it to
match the actual target cost fix your problem?  In particular, we have
*multiple* memory accesses with the *same* mask here.  Does accounting for that
difference in lowering cost side step the issue?

Your in-passing mention that -O3 and unrolling breaks vectorization also
concerns me.  It really shouldn't.  That sounds like a probably issue in the
SLP vectorizer, and maybe a pass order issue.

All of the above is simply me brainstorming.  :)

Philip

On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:

Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
vectorise.

typedef short int16_t;
typedef unsigned short uint16_t;

int q(int16_t d[], uint16_t m[], uint16_t b[]) {
  int n = 0;
  for (int i = 0; i < 16; i++) {
    if (d[i] > 0)
      d[i] = (b[i] + d[i]) * m[i] >> 16;
    else
      d[i] = -((b[i] - d[i]) * m[i] >> 16);
    n |= d[i];
  }
  return n;
}

As it turns out, LLVM adds runtime alias checks for this function and then it
considers it legal to vectorise with if-conversion.  However, the vectorisation
cost model effectively bans that particular if-conversion by assigning a
ridiculous cost to emulated masked loads and stores.

Originally, each branch of the if statement in the loop body contains three
identical loads.  Manually hoisting these allows the loop to be vectorised at
`-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of
instructions from the two successors of a block, which subroutine does indeed
hoist two of the three loads, but gives up before hoisting the third one.

We'd need a way to make LLVM hoist all three of the loads by itself.
`GVNHoist`
can do that, but that pass is disabled by default and has been disabled for a
long, long time.

As an alternative, I was thinking of a simpler hoisting transformation, that
just handles moving instructions from two single-predecessor blocks to their
common predecessor. That could be made reasonably fast, by pairing instructions
by their GVN value number. Limiting hoisting to a predecessor block (for the
most part) would also avoid excessive increase of lifetimes (for the majority of
the case) and would also simplify correctness checks.

I've written such a transformation as a subroutine to `GVN`, it seemed like
a
good place for it and is an a similar spirit as various PREs the GVN does. The
Phabricator review is at https://reviews.llvm.org/D109760.

Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):

500.perlbench_r    1.13%
502.gcc_r          0.00%
505.mcf_r         -1.89%
520.omnetpp_r      0.00%
523.xalancbmk_r    0.00%
525.x264_r         7.67%
531.deepsjeng_r    0.60%
541.leela_r        0.24%
548.exchange2_r    0.00%
557.xz_r           0.75%

Comments?

~chill
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210914/d6117646/attachment.html>

Momchil Velikov via llvm-dev

2021-Sep-15 11:16 UTC

head link

[llvm-dev] [RFC] Simple GVN hoist

On Tue, Sep 14, 2021 at 7:17 PM Philip Reames via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> The vectorize already has to do a forward walk through the function to
compute predicate masks.  As we do that walk, we could keep track for each
address of the union of predicate masks for each access.  In this case, the
union would be all ones, meaning we could use a unconditional access.  In
general, we can do a single predicated access using the union of the predicate
masks.  This is a non-trivial costing question of whether explicitly forming the
union is worthwhile, but I suspect we can handle obvious cases cheaply.
> Still in the vectorizer, we can do a forward walk and track accesses
encountered along each path.  Any address which is accessed along each path to
the latch can be done unconditionally.  (This is essentially a restricted subset
of the former without the generalization to predicate masks.)
> You don't mention your target processor, but one question to ask is the
cost model for a predicated load reasonable?  If not, would reducing it to match
the actual target cost fix your problem?  In particular, we have *multiple*
memory accesses with the *same* mask here.  Does accounting for that difference
in lowering cost side step the issue?
The target architecture is AArch64 and the cost model indeed is
entirely unreasonable, in fact, it on purpose sets a high value for
emulated masked loads/stores so as to disable vectorisation.
For the case I have in hand, though, I set on finding a way to not
have to deal with loads/stores in the first place, instead of handling
them better (by fixing the cost model or otherwise).
> We could extend SimplifyCFG.  You mention this, but don't get into
*why* we don't handle the last load.  We really should in this case. 
(Though, after CSE, there should only be two conditional loads in your inner
loop?  Maybe I'm missing something?)Two of the loads are hoisted by
`SimplifyCFGOpt::HoistThenElseCodeToIf`, which is intentionally
limited to hoist identical instructions in identical order. In this
example, the scan of the two blocks
stops at the first pair of different instructions, which happens to be
before the third load.
> Your in-passing mention that -O3 and unrolling breaks vectorization also
concerns me.  It really shouldn't.  That sounds like a probably issue in the
SLP vectorizer, and maybe a pass order issue.I would (maybe naively)  think that loop vectoriser should be given a
chance before loop unrolling and the SLP vectoriser.

llvm dev - Sep 2021 - [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist