thr3ads.net - llvm dev - [llvm-dev] [RFC] Simple GVN hoist [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Alina Sbirlea via llvm-dev

2021-Nov-08 22:37 UTC

[llvm-dev] [RFC] Simple GVN hoist

Apologies for the delayed reply on this issue.

The status of NewGVN is uncertain at this point due to correctness
issues with the pass.
The idea of adding the GVN Hoist functionality makes sense, but I'd like a
discussion on how to plan this longer term. I added a note about this in
https://reviews.llvm.org/D110822.

Alina

On Tue, Sep 14, 2021 at 12:12 PM Philip Reames <listmail at
philipreames.com>
wrote:
> Ok, this response is focused on the proposed transform.  See inline.
> On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:
>
> Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
> vectorise.
>
> typedef short int16_t;
> typedef unsigned short uint16_t;
>
> int q(int16_t d[], uint16_t m[], uint16_t b[]) {
>   int n = 0;
>   for (int i = 0; i < 16; i++) {
>     if (d[i] > 0)
>       d[i] = (b[i] + d[i]) * m[i] >> 16;
>     else
>       d[i] = -((b[i] - d[i]) * m[i] >> 16);
>     n |= d[i];
>   }
>   return n;
> }
>
> As it turns out, LLVM adds runtime alias checks for this function and then
it
> considers it legal to vectorise with if-conversion.  However, the
vectorisation
> cost model effectively bans that particular if-conversion by assigning a
> ridiculous cost to emulated masked loads and stores.
>
> Originally, each branch of the if statement in the loop body contains three
> identical loads.  Manually hoisting these allows the loop to be vectorised
at
> `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
> There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting
of
> instructions from the two successors of a block, which subroutine does
indeed
> hoist two of the three loads, but gives up before hoisting the third one.
>
> We'd need a way to make LLVM hoist all three of the loads by itself.
`GVNHoist`
> can do that, but that pass is disabled by default and has been disabled for
a
> long, long time.
>
> I haven't looked at this pass recently, but I would strongly recommend
> distrusting the current implementation.  The software engineering process
> that went into writing it doesn't give me any confidence.  I remember a
> long string of enable-revert-bugfix without obvious justification as to why
> the bugs being found had to be found with such an expensive approach.  If
> you do decide to pursue this, expect the need for a) targeted review and b)
> a bunch of dedicated fuzzing.
>
> Having said that, *conceptually* this pass takes the proper approach by
> using MemorySSA.  Your current implementation has a bunch of complexity to
> basically do a local renumbering of memory instructions.  Some of that can
> be improved (see below), but if you were doing this with MemorySSA, you
> wouldn't need to do anything fancy as GVN would have numbered the loads
> correctly.  NewGVN is a MemorySSA version of GVN which I'd nominally
say is
> the right place for this, but it has never been productionized.  A huge
> blocker was MemorySSA itself which was recently productionized, but I
don't
> know what's left for NewGVN itself.  (Alina might be able to add
context
> here?)
>
> As an alternative, I was thinking of a simpler hoisting transformation,
that
> just handles moving instructions from two single-predecessor blocks to
their
> common predecessor. That could be made reasonably fast, by pairing
instructions
> by their GVN value number. Limiting hoisting to a predecessor block (for
the
> most part) would also avoid excessive increase of lifetimes (for the
majority of
> the case) and would also simplify correctness checks.
>
> I've written such a transformation as a subroutine to `GVN`, it seemed
like a
> good place for it and is an a similar spirit as various PREs the GVN does.
The
> Phabricator review is at https://reviews.llvm.org/D109760.
>
> The current structure of algorithm you proposed can probably be
> simplified.  Here's what I'd suggest:
>
> Start by handling all of the non-memory instructions.  This should be as
> simple as constructing a hash map from value number to instruction for one
> block, and then scanning the other block.  If you find the same value
> number in both blocks, the (by assumption scalar) instruction must exist in
> both blocks and can be hoisted.  This can be reviewed, tested, submitted,
> etc..
>
> Then add a single pre-pass which "renumbers" all of the memory
> instructions.  You can either use a trivial "no clobbers seen"
like
> EarlyCSE does, or only handle loads where MD->getDependency() is
nonlocal.
> (We already do this query in processLoad, so it should be cached.)  The
> same algorithm can be used as above, all that changes is where the
> valuenumber comes from.  (The "pre-pass" can probably be
implemented as a
> helper function which renumbers memory instructions on demand.)
>
> I would specifically *not* add an extra layer of iteration here.  If you
> find you need iteration to handle the cases you care about, I suspect we
> should actually be adjusting the value re-numbering scheme for loads
> described just above.
>
> A couple of assorted notes:
>
>    - MemDep should be responsible for handling all aliasing barriers.  If
>    it returns a non-local dependence, reordering that instruction to the
>    beginning of the block should be legal w.r.t. memory.  Your use of
>    isHoistBarrier is confusing to me.  You may be trying to handle
speculation
>    safety (i.e. avoid introducing faults), but if so, the special casing
>    around volatile and atomics seems odd?  (Also, why allow hosting of the
>    barrier instruction?)  I suspect there's a particular case
you're trying to
>    handle.  I would strongly advice dropping that from the initial patch,
then
>    add back the complexity in a separate change which is well motivated
with
>    tests, etc..
>    - In terms of the actual merge logic, I'd suggest implementing it as
>    hoist-one, then CSE.  That would allow you to reuse the existing
patching
>    logic and avoid reimplementing a bunch of complexity.
>
>
> Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):
>
> 500.perlbench_r    1.13%
> 502.gcc_r          0.00%
> 505.mcf_r         -1.89%
> 520.omnetpp_r      0.00%
> 523.xalancbmk_r    0.00%
> 525.x264_r         7.67%
> 531.deepsjeng_r    0.60%
> 541.leela_r        0.24%
> 548.exchange2_r    0.00%
> 557.xz_r           0.75%
>
> This looks generally positive.  Do you know why MCF regresses?  If not,
> taking a glance to make sure you understand the interaction is probably
> warranted.
>
> Comments?
>
> ~chill
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211108/ecb6dd67/attachment.html>

Momchil Velikov via llvm-dev

2021-Nov-11 17:02 UTC

head link

[llvm-dev] [RFC] Simple GVN hoist

[back to the mailing list, for more(?) exposure]

From: Alina Sbirlea <asbirlea at google.com>> I added a note about this in https://reviews.llvm.org/D110822.
> So, stepping back, I'd like to raise the question of how this change is
going to be used if implemented.
> In the current default pipeline MemorySSA is not available where GVN is
added. This is forcing the
> computation of MemorySSA before GVN.
> Having MemorySSA computed before GVN may be good in the LTO pipeline (GVN
already preserves the
> analysis), as the following passes require it (MemCpyOpt and DSE), but in
the function simplification pipeline
> we'd end up with GVN computing, using and preserving two analysis
(MemorySSA and MemDepAnalysis) with
> overlapping goals.
> The status for switching to NewGVN is uncertain at this point, but porting
GVN to switch over from
> MemDepAnalysis to MemorySSA may be an option, at least in the interim. This
will lift the issue of having two
> analysis and also provide the availability of MemorySSA here.
> Is this something you'd be interested in exploring?
It is possible. We (the Arm team working on these things) would like to have a
reasonable idea[1]
of a couple of things.

First, will this get us anywhere? Does anyone can think of any other objections
(provided, of course, the patch is of sufficient quality) against merging this
and enabling it by default, including
computing MemorySSA?
AFAIK, GVNHoist was disabled due to some benchmark regressions, unfortunately I
wasn't able to find a trace about
this decision, does anyone know anything more about it? Would anyone interested
be able to apply
https://reviews.llvm.org/D111418 (5 patches) and check and see if there are
unacceptable regressions,
that we can try to resolve?

Second, what are we getting into, if we decide to migrate GVN.cpp to MemorySSA?
Would that be
a couple of weeks or a multi-year project? Has anyone already thought about
approaches to doing that?

I spent maybe a total of three days looking at
GVN/MemorySSA/MemoryDependenceAnalysis
and the transition does not look straightforward or obvious, more like
re-implementing (parts of)
MemoryDependenceAnalysis. I mean, it looks doable and not *that* much work, but
more likely than not
I'm overlooking something.

For example, looking at the dependencies of loads (if I'm not mistaken
that's the only kind relevant to GVN)
the MemoryDependencyAnalysis would return other loads as
"dependencies", depending on aliasing,
and also `alloca`s. This does not have a direct equivalent in MemorySSA. I'm
thinking of something along
the lines of: get the clobbering def for a load, get all its uses, from this
list, remove uses, post-dominated
by other uses (as long as the post-dominating one is not a MayAlias), and
somehow do this
in an efficient manner - that'll hopefully get the same set of dependencies
(local or not) as the ones obtained from
the MemoryDependencyAnalysis, continue from there as before.

Any other ideas?

~chill

[1] by no means a guarantee, we're just looking to asses the risks going one
or another way

llvm dev - Nov 2021 - [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist