thr3ads.net - llvm dev - [llvm-dev] [RFC] Simple GVN hoist [Sep 2021]

If this information is useful, please help other people find it:
Share via:

Philip Reames via llvm-dev

2021-Sep-14 19:12 UTC

[llvm-dev] [RFC] Simple GVN hoist

Ok, this response is focused on the proposed transform.  See inline.

On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:> Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
> vectorise.
>
> typedef short int16_t;
> typedef unsigned short uint16_t;
>
> int q(int16_t d[], uint16_t m[], uint16_t b[]) {
>    int n = 0;
>    for (int i = 0; i < 16; i++) {
>      if (d[i] > 0)
>        d[i] = (b[i] + d[i]) * m[i] >> 16;
>      else
>        d[i] = -((b[i] - d[i]) * m[i] >> 16);
>      n |= d[i];
>    }
>    return n;
> }
>
> As it turns out, LLVM adds runtime alias checks for this function and then
it
> considers it legal to vectorise with if-conversion.  However, the
vectorisation
> cost model effectively bans that particular if-conversion by assigning a
> ridiculous cost to emulated masked loads and stores.
>
> Originally, each branch of the if statement in the loop body contains three
> identical loads.  Manually hoisting these allows the loop to be vectorised
at
> `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
> There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting
of
> instructions from the two successors of a block, which subroutine does
indeed
> hoist two of the three loads, but gives up before hoisting the third one.
>
> We'd need a way to make LLVM hoist all three of the loads by itself.
`GVNHoist`
> can do that, but that pass is disabled by default and has been disabled for
a
> long, long time.
I haven't looked at this pass recently, but I would strongly recommend 
distrusting the current implementation.  The software engineering 
process that went into writing it doesn't give me any confidence.  I 
remember a long string of enable-revert-bugfix without obvious 
justification as to why the bugs being found had to be found with such 
an expensive approach.  If you do decide to pursue this, expect the need 
for a) targeted review and b) a bunch of dedicated fuzzing.

Having said that, *conceptually* this pass takes the proper approach by 
using MemorySSA.  Your current implementation has a bunch of complexity 
to basically do a local renumbering of memory instructions.  Some of 
that can be improved (see below), but if you were doing this with 
MemorySSA, you wouldn't need to do anything fancy as GVN would have 
numbered the loads correctly. NewGVN is a MemorySSA version of GVN which 
I'd nominally say is the right place for this, but it has never been 
productionized.  A huge blocker was MemorySSA itself which was recently 
productionized, but I don't know what's left for NewGVN itself. (Alina 
might be able to add context here?)
>
> As an alternative, I was thinking of a simpler hoisting transformation,
that
> just handles moving instructions from two single-predecessor blocks to
their
> common predecessor. That could be made reasonably fast, by pairing
instructions
> by their GVN value number. Limiting hoisting to a predecessor block (for
the
> most part) would also avoid excessive increase of lifetimes (for the
majority of
> the case) and would also simplify correctness checks.
>
> I've written such a transformation as a subroutine to `GVN`, it seemed
like a
> good place for it and is an a similar spirit as various PREs the GVN does.
The
> Phabricator review is at https://reviews.llvm.org/D109760.
The current structure of algorithm you proposed can probably be 
simplified.  Here's what I'd suggest:

Start by handling all of the non-memory instructions.  This should be as 
simple as constructing a hash map from value number to instruction for 
one block, and then scanning the other block. If you find the same value 
number in both blocks, the (by assumption scalar) instruction must exist 
in both blocks and can be hoisted.  This can be reviewed, tested, 
submitted, etc..

Then add a single pre-pass which "renumbers" all of the memory 
instructions.  You can either use a trivial "no clobbers seen" like 
EarlyCSE does, or only handle loads where MD->getDependency() is 
nonlocal.  (We already do this query in processLoad, so it should be 
cached.)  The same algorithm can be used as above, all that changes is 
where the valuenumber comes from.  (The "pre-pass" can probably be 
implemented as a helper function which renumbers memory instructions on 
demand.)

I would specifically *not* add an extra layer of iteration here. If you 
find you need iteration to handle the cases you care about, I suspect we 
should actually be adjusting the value re-numbering scheme for loads 
described just above.

A couple of assorted notes:

  * MemDep should be responsible for handling all aliasing barriers.  If
    it returns a non-local dependence, reordering that instruction to
    the beginning of the block should be legal w.r.t. memory.  Your use
    of isHoistBarrier is confusing to me.  You may be trying to handle
    speculation safety (i.e. avoid introducing faults), but if so, the
    special casing around volatile and atomics seems odd?  (Also, why
    allow hosting of the barrier instruction?)  I suspect there's a
    particular case you're trying to handle.  I would strongly advice
    dropping that from the initial patch, then add back the complexity
    in a separate change which is well motivated with tests, etc..
  * In terms of the actual merge logic, I'd suggest implementing it as
    hoist-one, then CSE.  That would allow you to reuse the existing
    patching logic and avoid reimplementing a bunch of complexity.

>
> Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):
>
> 500.perlbench_r    1.13%
> 502.gcc_r          0.00%
> 505.mcf_r         -1.89%
> 520.omnetpp_r      0.00%
> 523.xalancbmk_r    0.00%
> 525.x264_r         7.67%
> 531.deepsjeng_r    0.60%
> 541.leela_r        0.24%
> 548.exchange2_r    0.00%
> 557.xz_r           0.75%This looks generally positive.  Do you know why MCF regresses?  If not, 
taking a glance to make sure you understand the interaction is probably 
warranted.>
> Comments?
>
> ~chill
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210914/420407f7/attachment.html>

Philip Reames via llvm-dev

2021-Sep-14 20:09 UTC

head link

[llvm-dev] [RFC] Simple GVN hoist

Realized that the detailed nature of my response may have missed the 
high order bit.  I generally think that doing this transformation in GVN 
is entirely reasonable, and would be a useful enhancement.  That's why I 
spent the time to give the detailed feedback on implementation.

Philip

p.s. It's worth noting that what we're implementing here is a restricted
form of "very busy expressions" which basically the dual of PRE.  
There's a bunch of existing literature on the general problem.

p.p.s. Once we're through with the hoisting variant, there's a 
restricted sinking variant too.   If we structure the hoisting code 
right, the sinking should be straight forward.

On 9/14/21 12:12 PM, Philip Reames wrote:>
> Ok, this response is focused on the proposed transform.  See inline.
>
> On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:
>> Looking at a particularly hot function in the SPEC/x264, that LLVM
fails to
>> vectorise.
>>
>> typedef short int16_t;
>> typedef unsigned short uint16_t;
>>
>> int q(int16_t d[], uint16_t m[], uint16_t b[]) {
>>    int n = 0;
>>    for (int i = 0; i < 16; i++) {
>>      if (d[i] > 0)
>>        d[i] = (b[i] + d[i]) * m[i] >> 16;
>>      else
>>        d[i] = -((b[i] - d[i]) * m[i] >> 16);
>>      n |= d[i];
>>    }
>>    return n;
>> }
>>
>> As it turns out, LLVM adds runtime alias checks for this function and
then it
>> considers it legal to vectorise with if-conversion.  However, the
vectorisation
>> cost model effectively bans that particular if-conversion by assigning
a
>> ridiculous cost to emulated masked loads and stores.
>>
>> Originally, each branch of the if statement in the loop body contains
three
>> identical loads.  Manually hoisting these allows the loop to be
vectorised at
>> `-O2` (at `-O3` the loop is fully unrolled and that breaks
vectorisation).
>> There's a subroutine in `SimplifyCFG` that does a rudimentary
hoisting of
>> instructions from the two successors of a block, which subroutine does
indeed
>> hoist two of the three loads, but gives up before hoisting the third
one.
>>
>> We'd need a way to make LLVM hoist all three of the loads by
itself. `GVNHoist`
>> can do that, but that pass is disabled by default and has been disabled
for a
>> long, long time.
>
> I haven't looked at this pass recently, but I would strongly recommend 
> distrusting the current implementation.  The software engineering 
> process that went into writing it doesn't give me any confidence.  I 
> remember a long string of enable-revert-bugfix without obvious 
> justification as to why the bugs being found had to be found with such 
> an expensive approach.  If you do decide to pursue this, expect the 
> need for a) targeted review and b) a bunch of dedicated fuzzing.
>
> Having said that, *conceptually* this pass takes the proper approach 
> by using MemorySSA.  Your current implementation has a bunch of 
> complexity to basically do a local renumbering of memory 
> instructions.  Some of that can be improved (see below), but if you 
> were doing this with MemorySSA, you wouldn't need to do anything fancy 
> as GVN would have numbered the loads correctly.  NewGVN is a MemorySSA 
> version of GVN which I'd nominally say is the right place for this, 
> but it has never been productionized.  A huge blocker was MemorySSA 
> itself which was recently productionized, but I don't know what's
left
> for NewGVN itself.  (Alina might be able to add context here?)
>
>> As an alternative, I was thinking of a simpler hoisting transformation,
that
>> just handles moving instructions from two single-predecessor blocks to
their
>> common predecessor. That could be made reasonably fast, by pairing
instructions
>> by their GVN value number. Limiting hoisting to a predecessor block
(for the
>> most part) would also avoid excessive increase of lifetimes (for the
majority of
>> the case) and would also simplify correctness checks.
>>
>> I've written such a transformation as a subroutine to `GVN`, it
seemed like a
>> good place for it and is an a similar spirit as various PREs the GVN
does. The
>> Phabricator review is athttps://reviews.llvm.org/D109760.
>
> The current structure of algorithm you proposed can probably be 
> simplified.  Here's what I'd suggest:
>
> Start by handling all of the non-memory instructions.  This should be 
> as simple as constructing a hash map from value number to instruction 
> for one block, and then scanning the other block.  If you find the 
> same value number in both blocks, the (by assumption scalar) 
> instruction must exist in both blocks and can be hoisted.  This can be 
> reviewed, tested, submitted, etc..
>
> Then add a single pre-pass which "renumbers" all of the memory 
> instructions.  You can either use a trivial "no clobbers seen"
like
> EarlyCSE does, or only handle loads where MD->getDependency() is 
> nonlocal.  (We already do this query in processLoad, so it should be 
> cached.)  The same algorithm can be used as above, all that changes is 
> where the valuenumber comes from.  (The "pre-pass" can probably
be
> implemented as a helper function which renumbers memory instructions 
> on demand.)
>
> I would specifically *not* add an extra layer of iteration here.  If 
> you find you need iteration to handle the cases you care about, I 
> suspect we should actually be adjusting the value re-numbering scheme 
> for loads described just above.
>
> A couple of assorted notes:
>
>   * MemDep should be responsible for handling all aliasing barriers. 
>     If it returns a non-local dependence, reordering that instruction
>     to the beginning of the block should be legal w.r.t. memory.  Your
>     use of isHoistBarrier is confusing to me.  You may be trying to
>     handle speculation safety (i.e. avoid introducing faults), but if
>     so, the special casing around volatile and atomics seems odd? 
>     (Also, why allow hosting of the barrier instruction?)  I suspect
>     there's a particular case you're trying to handle.  I would
>     strongly advice dropping that from the initial patch, then add
>     back the complexity in a separate change which is well motivated
>     with tests, etc..
>   * In terms of the actual merge logic, I'd suggest implementing it as
>     hoist-one, then CSE.  That would allow you to reuse the existing
>     patching logic and avoid reimplementing a bunch of complexity.
>
>
>> Initial benchmarking on Neoverse N1 looks good (speedup, higher is
better):
>>
>> 500.perlbench_r    1.13%
>> 502.gcc_r          0.00%
>> 505.mcf_r         -1.89%
>> 520.omnetpp_r      0.00%
>> 523.xalancbmk_r    0.00%
>> 525.x264_r         7.67%
>> 531.deepsjeng_r    0.60%
>> 541.leela_r        0.24%
>> 548.exchange2_r    0.00%
>> 557.xz_r           0.75%
> This looks generally positive.  Do you know why MCF regresses?  If 
> not, taking a glance to make sure you understand the interaction is 
> probably warranted.
>> Comments?
>>
>> ~chill
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210914/481cd554/attachment.html>

Momchil Velikov via llvm-dev

2021-Sep-15 19:29 UTC

head link

[llvm-dev] [RFC] Simple GVN hoist

On Tue, Sep 14, 2021 at 8:12 PM Philip Reames via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> Having said that, *conceptually* this pass takes the proper approach by
using MemorySSA.  Your current implementation has a bunch of complexity to
basically do a local renumbering of memory instructions.  Some of that can be
improved (see below), but if you were doing this with MemorySSA, you
wouldn't need to do anything fancy as GVN would have numbered the loads
correctly.
Do you mean the NewGVN will number loads and store correctly? The old
GVN will have unique numbers for loads and stores.
> Phabricator review is at https://reviews.llvm.org/D109760.
>
> The current structure of algorithm you proposed can probably be simplified.
Here's what I'd suggest:
>
> Start by handling all of the non-memory instructions.  This should be as
simple as constructing a hash map from value number to instruction for one
block, and then scanning the other block.  If you find the same value number in
both blocks, the (by assumption scalar) instruction must exist in both blocks
and can be hoisted.  This can be reviewed, tested, submitted, etc..
>
> Then add a single pre-pass which "renumbers" all of the memory
instructions.  You can either use a trivial "no clobbers seen" like
EarlyCSE does, or only handle loads where MD->getDependency() is nonlocal. 
(We already do this query in processLoad, so it should be cached.)  The same
algorithm can be used as above, all that changes is where the valuenumber comes
from.  (The "pre-pass" can probably be implemented as a helper
function which renumbers memory instructions on demand.)
The issue here is that uniqueness of load numbers propagates to the
users of those loads, so just after we've hoisted and merged a pair of
loads into one instruction we can potentially get the same VN in that
load's users.
I was handling that with a followup iteration, but indeed we could
recursively traverse the users, starting from the load, and recompute
VNs.
> I would specifically *not* add an extra layer of iteration here.  If you
find you need iteration to handle the cases you care about, I suspect we should
actually be adjusting the value re-numbering scheme for loads described just
above.
>
> A couple of assorted notes:
>
> MemDep should be responsible for handling all aliasing barriers.  If it
returns a non-local dependence, reordering that instruction to the beginning of
the block should be legal w.r.t. memory.  Your use of isHoistBarrier is
confusing to me.  You may be trying to handle speculation safety (i.e. avoid
introducing faults), but if so, the special casing around volatile and atomics
seems odd?  (Also, why allow hosting of the barrier instruction?)  I suspect
there's a particular case you're trying to handle.  I would strongly
advice dropping that from the initial patch, then add back the complexity in a
separate change which is well motivated with tests, etc..
With regard to volatile and the atomics, I guess I was being too
conservartive trying to not reorder *any* instructions above them. I
can remove those conditions. The other condition is to not
speculatively execute instructions by moving them across an
instruction that may throw, e.g. with

    VN 9:   br i1 %tobool, label %if.then, label %if.else
    if.then:
    VN 11:   call void @h()
    VN 7:   %0 = sdiv i32 %a, %b
     ...
    if.else:
    VN 7:   %1 = sdiv i32 %a, %b
    ...

we don't want to end up with:

    VN 7:   %0 = sdiv i32 %a, %b
    br i1 %tobool, label %if.then, label %if.else
    if.then:
    VN 11:   call void @h()
     ...
    if.else:
    VN 7:   %1 = sdiv i32 %a, %b
    ...


but with (`h` is `readnone`):

    VN 9:   br i1 %tobool, label %if.then, label %if.else
    if.then:
    VN 11:   call void @h()
    VN 7:   %0 = sdiv i32 %a, %b
    ...
    if.else:
    VN 11:   call void @h()
    VN 7:   %1 = sdiv i32 %a, %b
    ...

we could move both the call  and the division, while preserving their
order, which also answers "Also, why allow hosting of the barrier
instruction?")

So, the subsequent iterations of the algorithm were designed to:
a) handle renumbering of loads and their users (which could possibly
be dealt with as described above)
b) to prevent all reordering across volatiles and atomics (I'll drop
that and prevent reordering only when MemDep says there is a
dependency).
c) detect when hoisting an instruction turns it from a local
dependency into a non-local one, thus potentially allowing more
hoisting; I'll see about explicitly checking if two matching
instructions have a local dependency on
a couple if instructions that are themselves paired for hoisting.
d) prevent speculation across throwing instructions; I don't have a
nice non-iterative solution for that, e.g. in the example above there
is no dependency between the call and the division.
> In terms of the actual merge logic, I'd suggest implementing it as
hoist-one, then CSE.  That would allow you to reuse the existing patching logic
and avoid reimplementing a bunch of complexity.
I'm not sure I understand this. Currently I move one instruction, then
replace all uses of the other with the first one. Do you mean that or
something else?

Thanks for the comments and the suggestions, I now have a bunch of new
ideas to try!

Thanks and best regards,
~chill

Alina Sbirlea via llvm-dev

2021-Nov-08 22:37 UTC

head link

[llvm-dev] [RFC] Simple GVN hoist

Apologies for the delayed reply on this issue.

The status of NewGVN is uncertain at this point due to correctness
issues with the pass.
The idea of adding the GVN Hoist functionality makes sense, but I'd like a
discussion on how to plan this longer term. I added a note about this in
https://reviews.llvm.org/D110822.

Alina

On Tue, Sep 14, 2021 at 12:12 PM Philip Reames <listmail at
philipreames.com>
wrote:
> Ok, this response is focused on the proposed transform.  See inline.
> On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:
>
> Looking at a particularly hot function in the SPEC/x264, that LLVM fails to
> vectorise.
>
> typedef short int16_t;
> typedef unsigned short uint16_t;
>
> int q(int16_t d[], uint16_t m[], uint16_t b[]) {
>   int n = 0;
>   for (int i = 0; i < 16; i++) {
>     if (d[i] > 0)
>       d[i] = (b[i] + d[i]) * m[i] >> 16;
>     else
>       d[i] = -((b[i] - d[i]) * m[i] >> 16);
>     n |= d[i];
>   }
>   return n;
> }
>
> As it turns out, LLVM adds runtime alias checks for this function and then
it
> considers it legal to vectorise with if-conversion.  However, the
vectorisation
> cost model effectively bans that particular if-conversion by assigning a
> ridiculous cost to emulated masked loads and stores.
>
> Originally, each branch of the if statement in the loop body contains three
> identical loads.  Manually hoisting these allows the loop to be vectorised
at
> `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation).
> There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting
of
> instructions from the two successors of a block, which subroutine does
indeed
> hoist two of the three loads, but gives up before hoisting the third one.
>
> We'd need a way to make LLVM hoist all three of the loads by itself.
`GVNHoist`
> can do that, but that pass is disabled by default and has been disabled for
a
> long, long time.
>
> I haven't looked at this pass recently, but I would strongly recommend
> distrusting the current implementation.  The software engineering process
> that went into writing it doesn't give me any confidence.  I remember a
> long string of enable-revert-bugfix without obvious justification as to why
> the bugs being found had to be found with such an expensive approach.  If
> you do decide to pursue this, expect the need for a) targeted review and b)
> a bunch of dedicated fuzzing.
>
> Having said that, *conceptually* this pass takes the proper approach by
> using MemorySSA.  Your current implementation has a bunch of complexity to
> basically do a local renumbering of memory instructions.  Some of that can
> be improved (see below), but if you were doing this with MemorySSA, you
> wouldn't need to do anything fancy as GVN would have numbered the loads
> correctly.  NewGVN is a MemorySSA version of GVN which I'd nominally
say is
> the right place for this, but it has never been productionized.  A huge
> blocker was MemorySSA itself which was recently productionized, but I
don't
> know what's left for NewGVN itself.  (Alina might be able to add
context
> here?)
>
> As an alternative, I was thinking of a simpler hoisting transformation,
that
> just handles moving instructions from two single-predecessor blocks to
their
> common predecessor. That could be made reasonably fast, by pairing
instructions
> by their GVN value number. Limiting hoisting to a predecessor block (for
the
> most part) would also avoid excessive increase of lifetimes (for the
majority of
> the case) and would also simplify correctness checks.
>
> I've written such a transformation as a subroutine to `GVN`, it seemed
like a
> good place for it and is an a similar spirit as various PREs the GVN does.
The
> Phabricator review is at https://reviews.llvm.org/D109760.
>
> The current structure of algorithm you proposed can probably be
> simplified.  Here's what I'd suggest:
>
> Start by handling all of the non-memory instructions.  This should be as
> simple as constructing a hash map from value number to instruction for one
> block, and then scanning the other block.  If you find the same value
> number in both blocks, the (by assumption scalar) instruction must exist in
> both blocks and can be hoisted.  This can be reviewed, tested, submitted,
> etc..
>
> Then add a single pre-pass which "renumbers" all of the memory
> instructions.  You can either use a trivial "no clobbers seen"
like
> EarlyCSE does, or only handle loads where MD->getDependency() is
nonlocal.
> (We already do this query in processLoad, so it should be cached.)  The
> same algorithm can be used as above, all that changes is where the
> valuenumber comes from.  (The "pre-pass" can probably be
implemented as a
> helper function which renumbers memory instructions on demand.)
>
> I would specifically *not* add an extra layer of iteration here.  If you
> find you need iteration to handle the cases you care about, I suspect we
> should actually be adjusting the value re-numbering scheme for loads
> described just above.
>
> A couple of assorted notes:
>
>    - MemDep should be responsible for handling all aliasing barriers.  If
>    it returns a non-local dependence, reordering that instruction to the
>    beginning of the block should be legal w.r.t. memory.  Your use of
>    isHoistBarrier is confusing to me.  You may be trying to handle
speculation
>    safety (i.e. avoid introducing faults), but if so, the special casing
>    around volatile and atomics seems odd?  (Also, why allow hosting of the
>    barrier instruction?)  I suspect there's a particular case
you're trying to
>    handle.  I would strongly advice dropping that from the initial patch,
then
>    add back the complexity in a separate change which is well motivated
with
>    tests, etc..
>    - In terms of the actual merge logic, I'd suggest implementing it as
>    hoist-one, then CSE.  That would allow you to reuse the existing
patching
>    logic and avoid reimplementing a bunch of complexity.
>
>
> Initial benchmarking on Neoverse N1 looks good (speedup, higher is better):
>
> 500.perlbench_r    1.13%
> 502.gcc_r          0.00%
> 505.mcf_r         -1.89%
> 520.omnetpp_r      0.00%
> 523.xalancbmk_r    0.00%
> 525.x264_r         7.67%
> 531.deepsjeng_r    0.60%
> 541.leela_r        0.24%
> 548.exchange2_r    0.00%
> 557.xz_r           0.75%
>
> This looks generally positive.  Do you know why MCF regresses?  If not,
> taking a glance to make sure you understand the interaction is probably
> warranted.
>
> Comments?
>
> ~chill
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211108/ecb6dd67/attachment.html>

llvm dev - Sep 2021 - [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist

[llvm-dev] [RFC] Simple GVN hoist