Looking at a particularly hot function in the SPEC/x264, that LLVM fails to vectorise. typedef short int16_t; typedef unsigned short uint16_t; int q(int16_t d[], uint16_t m[], uint16_t b[]) { int n = 0; for (int i = 0; i < 16; i++) { if (d[i] > 0) d[i] = (b[i] + d[i]) * m[i] >> 16; else d[i] = -((b[i] - d[i]) * m[i] >> 16); n |= d[i]; } return n; } As it turns out, LLVM adds runtime alias checks for this function and then it considers it legal to vectorise with if-conversion. However, the vectorisation cost model effectively bans that particular if-conversion by assigning a ridiculous cost to emulated masked loads and stores. Originally, each branch of the if statement in the loop body contains three identical loads. Manually hoisting these allows the loop to be vectorised at `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation). There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of instructions from the two successors of a block, which subroutine does indeed hoist two of the three loads, but gives up before hoisting the third one. We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist` can do that, but that pass is disabled by default and has been disabled for a long, long time. As an alternative, I was thinking of a simpler hoisting transformation, that just handles moving instructions from two single-predecessor blocks to their common predecessor. That could be made reasonably fast, by pairing instructions by their GVN value number. Limiting hoisting to a predecessor block (for the most part) would also avoid excessive increase of lifetimes (for the majority of the case) and would also simplify correctness checks. I've written such a transformation as a subroutine to `GVN`, it seemed like a good place for it and is an a similar spirit as various PREs the GVN does. The Phabricator review is at https://reviews.llvm.org/D109760. Initial benchmarking on Neoverse N1 looks good (speedup, higher is better): 500.perlbench_r 1.13% 502.gcc_r 0.00% 505.mcf_r -1.89% 520.omnetpp_r 0.00% 523.xalancbmk_r 0.00% 525.x264_r 7.67% 531.deepsjeng_r 0.60% 541.leela_r 0.24% 548.exchange2_r 0.00% 557.xz_r 0.75% Comments? ~chill
Initial response focused on different potential ways of solving this problem. I'll reply separately on your proposed GVN patch. I see a number of ways to tackle this: * The vectorize already has to do a forward walk through the function to compute predicate masks. As we do that walk, we could keep track for each address of the union of predicate masks for each access. In this case, the union would be all ones, meaning we could use a unconditional access. In general, we can do a single predicated access using the union of the predicate masks. This is a non-trivial costing question of whether explicitly forming the union is worthwhile, but I suspect we can handle obvious cases cheaply. * Still in the vectorizer, we can do a forward walk and track accesses encountered along each path. Any address which is accessed along each path to the latch can be done unconditionally. (This is essentially a restricted subset of the former without the generalization to predicate masks.) * We could extend SimplifyCFG. You mention this, but don't get into *why* we don't handle the last load. We really should in this case. (Though, after CSE, there should only be two conditional loads in your inner loop? Maybe I'm missing something?) * You don't mention your target processor, but one question to ask is the cost model for a predicated load reasonable? If not, would reducing it to match the actual target cost fix your problem? In particular, we have *multiple* memory accesses with the *same* mask here. Does accounting for that difference in lowering cost side step the issue? Your in-passing mention that -O3 and unrolling breaks vectorization also concerns me. It really shouldn't. That sounds like a probably issue in the SLP vectorizer, and maybe a pass order issue. All of the above is simply me brainstorming. :) Philip On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:> Looking at a particularly hot function in the SPEC/x264, that LLVM fails to > vectorise. > > typedef short int16_t; > typedef unsigned short uint16_t; > > int q(int16_t d[], uint16_t m[], uint16_t b[]) { > int n = 0; > for (int i = 0; i < 16; i++) { > if (d[i] > 0) > d[i] = (b[i] + d[i]) * m[i] >> 16; > else > d[i] = -((b[i] - d[i]) * m[i] >> 16); > n |= d[i]; > } > return n; > } > > As it turns out, LLVM adds runtime alias checks for this function and then it > considers it legal to vectorise with if-conversion. However, the vectorisation > cost model effectively bans that particular if-conversion by assigning a > ridiculous cost to emulated masked loads and stores. > > Originally, each branch of the if statement in the loop body contains three > identical loads. Manually hoisting these allows the loop to be vectorised at > `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation). > There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of > instructions from the two successors of a block, which subroutine does indeed > hoist two of the three loads, but gives up before hoisting the third one. > > We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist` > can do that, but that pass is disabled by default and has been disabled for a > long, long time. > > As an alternative, I was thinking of a simpler hoisting transformation, that > just handles moving instructions from two single-predecessor blocks to their > common predecessor. That could be made reasonably fast, by pairing instructions > by their GVN value number. Limiting hoisting to a predecessor block (for the > most part) would also avoid excessive increase of lifetimes (for the majority of > the case) and would also simplify correctness checks. > > I've written such a transformation as a subroutine to `GVN`, it seemed like a > good place for it and is an a similar spirit as various PREs the GVN does. The > Phabricator review is at https://reviews.llvm.org/D109760. > > Initial benchmarking on Neoverse N1 looks good (speedup, higher is better): > > 500.perlbench_r 1.13% > 502.gcc_r 0.00% > 505.mcf_r -1.89% > 520.omnetpp_r 0.00% > 523.xalancbmk_r 0.00% > 525.x264_r 7.67% > 531.deepsjeng_r 0.60% > 541.leela_r 0.24% > 548.exchange2_r 0.00% > 557.xz_r 0.75% > > Comments? > > ~chill > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210914/8178c072/attachment.html>
Ok, this response is focused on the proposed transform. See inline. On 9/14/21 7:19 AM, Momchil Velikov via llvm-dev wrote:> Looking at a particularly hot function in the SPEC/x264, that LLVM fails to > vectorise. > > typedef short int16_t; > typedef unsigned short uint16_t; > > int q(int16_t d[], uint16_t m[], uint16_t b[]) { > int n = 0; > for (int i = 0; i < 16; i++) { > if (d[i] > 0) > d[i] = (b[i] + d[i]) * m[i] >> 16; > else > d[i] = -((b[i] - d[i]) * m[i] >> 16); > n |= d[i]; > } > return n; > } > > As it turns out, LLVM adds runtime alias checks for this function and then it > considers it legal to vectorise with if-conversion. However, the vectorisation > cost model effectively bans that particular if-conversion by assigning a > ridiculous cost to emulated masked loads and stores. > > Originally, each branch of the if statement in the loop body contains three > identical loads. Manually hoisting these allows the loop to be vectorised at > `-O2` (at `-O3` the loop is fully unrolled and that breaks vectorisation). > There's a subroutine in `SimplifyCFG` that does a rudimentary hoisting of > instructions from the two successors of a block, which subroutine does indeed > hoist two of the three loads, but gives up before hoisting the third one. > > We'd need a way to make LLVM hoist all three of the loads by itself. `GVNHoist` > can do that, but that pass is disabled by default and has been disabled for a > long, long time.I haven't looked at this pass recently, but I would strongly recommend distrusting the current implementation. The software engineering process that went into writing it doesn't give me any confidence. I remember a long string of enable-revert-bugfix without obvious justification as to why the bugs being found had to be found with such an expensive approach. If you do decide to pursue this, expect the need for a) targeted review and b) a bunch of dedicated fuzzing. Having said that, *conceptually* this pass takes the proper approach by using MemorySSA. Your current implementation has a bunch of complexity to basically do a local renumbering of memory instructions. Some of that can be improved (see below), but if you were doing this with MemorySSA, you wouldn't need to do anything fancy as GVN would have numbered the loads correctly. NewGVN is a MemorySSA version of GVN which I'd nominally say is the right place for this, but it has never been productionized. A huge blocker was MemorySSA itself which was recently productionized, but I don't know what's left for NewGVN itself. (Alina might be able to add context here?)> > As an alternative, I was thinking of a simpler hoisting transformation, that > just handles moving instructions from two single-predecessor blocks to their > common predecessor. That could be made reasonably fast, by pairing instructions > by their GVN value number. Limiting hoisting to a predecessor block (for the > most part) would also avoid excessive increase of lifetimes (for the majority of > the case) and would also simplify correctness checks. > > I've written such a transformation as a subroutine to `GVN`, it seemed like a > good place for it and is an a similar spirit as various PREs the GVN does. The > Phabricator review is at https://reviews.llvm.org/D109760.The current structure of algorithm you proposed can probably be simplified. Here's what I'd suggest: Start by handling all of the non-memory instructions. This should be as simple as constructing a hash map from value number to instruction for one block, and then scanning the other block. If you find the same value number in both blocks, the (by assumption scalar) instruction must exist in both blocks and can be hoisted. This can be reviewed, tested, submitted, etc.. Then add a single pre-pass which "renumbers" all of the memory instructions. You can either use a trivial "no clobbers seen" like EarlyCSE does, or only handle loads where MD->getDependency() is nonlocal. (We already do this query in processLoad, so it should be cached.) The same algorithm can be used as above, all that changes is where the valuenumber comes from. (The "pre-pass" can probably be implemented as a helper function which renumbers memory instructions on demand.) I would specifically *not* add an extra layer of iteration here. If you find you need iteration to handle the cases you care about, I suspect we should actually be adjusting the value re-numbering scheme for loads described just above. A couple of assorted notes: * MemDep should be responsible for handling all aliasing barriers. If it returns a non-local dependence, reordering that instruction to the beginning of the block should be legal w.r.t. memory. Your use of isHoistBarrier is confusing to me. You may be trying to handle speculation safety (i.e. avoid introducing faults), but if so, the special casing around volatile and atomics seems odd? (Also, why allow hosting of the barrier instruction?) I suspect there's a particular case you're trying to handle. I would strongly advice dropping that from the initial patch, then add back the complexity in a separate change which is well motivated with tests, etc.. * In terms of the actual merge logic, I'd suggest implementing it as hoist-one, then CSE. That would allow you to reuse the existing patching logic and avoid reimplementing a bunch of complexity.> > Initial benchmarking on Neoverse N1 looks good (speedup, higher is better): > > 500.perlbench_r 1.13% > 502.gcc_r 0.00% > 505.mcf_r -1.89% > 520.omnetpp_r 0.00% > 523.xalancbmk_r 0.00% > 525.x264_r 7.67% > 531.deepsjeng_r 0.60% > 541.leela_r 0.24% > 548.exchange2_r 0.00% > 557.xz_r 0.75%This looks generally positive. Do you know why MCF regresses? If not, taking a glance to make sure you understand the interaction is probably warranted.> > Comments? > > ~chill > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210914/420407f7/attachment.html>