Jonas Paulsson via llvm-dev
2018-Sep-04 12:14 UTC
[llvm-dev] LoopVectorizer: shufflevectors
Hi, I have been discussing a bit with Sanjay on how to handle the poor sequences of shufflevector instructions produced by the loop vectorizer and he suggested we bring this up on llvm-dev. I have run into this in the past also and it surprised me to again see (on SystemZ) that the vectorized loop did many seemingly unnecessary shuffles. In this case (see https://bugs.llvm.org/show_bug.cgi?id=38792), one group of interleaved loads got shuffled into vector operands, to then be used by an interleaved group of stores, which in turn did its shuffling. The loop vectorizer did not attempt to combine these shuffles, and unfortunately no later pass did so either. This seems to be an issue which is due to keeping instcombine simple and fast, as well as a conservativeness to not produce any new shuffles not already in the input program (see comment in InstCombiner::visitShuffleVectorInst). For some reason a bit unclear to me the backend will get into trouble then. Should improved optimization of shufflevector instructions handle all of them globally, or just the new ones produced by the vectorizers? At least in the code produced by the vectorizers, it seems fair to combine the shuffles to minimize them. If we want to limit this to just auto-vectorized code, then maybe this could be done with some common utility called by a vectorizer on its result? If we on the other hand want to optimize everything, a new separate IR pass for vector ops could be made to run after the vectorizers just once. Such a new pass could handle vector instructions in general more extensively than instcombine. Would it then be possible to avoid the current problems the backend is having? Or does this really have to be done on the DAG by each backend? Or perhaps this is really just a local issue with the loop vectorizer? Please fill in on how to best proceed with improving the loop vectorizers code. Jonas
On Tue, 4 Sep 2018 at 13:14, Jonas Paulsson via llvm-dev <llvm-dev at lists.llvm.org> wrote:> I have run into this in the past also and it surprised me to again see > (on SystemZ) that the vectorized loop did many seemingly unnecessary > shuffles.Hi Jonas, This is a known side-effect of vectorisers (not just LLVM) that have to handle multiple hardware architectures. GCC has it's own set of bad patterns, too.> This seems to be an issue which is due to keeping instcombine simple and > fast, as well as a conservativeness to not produce any new shuffles not > already in the input program (see comment in > InstCombiner::visitShuffleVectorInst). For some reason a bit unclear to > me the backend will get into trouble then.Specifically interleaved generation will invariably lead to *additional* shuffles, because it's trying to create the pattern that will, later, be selected into one or few instructions. The middle-end relies on the back-end knowing how to select the large patters, as well as other middle-end passes not destroying it. Cleaning that up may very well lead to poorer code.> Should improved optimization of shufflevector instructions handle all of > them globally, or just the new ones produced by the vectorizers?It's probably a lot simpler to improve the SystemZ model to "know" have the same arch flags / cost model completeness as the other targets. The transformations done by the vectoriser are target-agnostic, but they still ask the targets if certain patterns are possible and profitable before doing so.> Or does this really have to be done on the DAG by each backend? Or > perhaps this is really just a local issue with the loop vectorizer?It's a dance between what the middle-end enquires of the target info and what the back-end can actually generate. You may have to expose some flags (to turn certain behaviour on), then to tune the cost model (to make them profitable on most cases), then implement the pattern recognition in the DAGISel (also GlobalISel), so that the generated code can be optimally selected. If LLVM was compiling to a single target, emitting IR that conforms to one specific pattern would be the most logical choice (don't pollute, simplify further passes, reduce pattern complexity), so it may sound a lot simpler on arch-specific compilers. But in a target-agnostic compiler you need to "emulate" that using the three-step above: target info, cost model, ISel patterns. Hope that helps. -- cheers, --renato
Jonas Paulsson via llvm-dev
2018-Sep-04 16:34 UTC
[llvm-dev] LoopVectorizer: shufflevectors
Hi Renato,> It's probably a lot simpler to improve the SystemZ model to "know" > have the same arch flags / cost model completeness as the other > targets.I thought they were - anything particular in mind?> > The transformations done by the vectoriser are target-agnostic, but > they still ask the targets if certain patterns are possible and > profitable before doing so.I have a patch for the LoopVectorizer that makes it recognize this particular case of a load interleave group being only used by a store interleave group. I then pass this flag to TTI.getInterleavedMemoryOpCost(), so that the target can return an accurate cost. During my experiments on SystemZ I added the cost of shuffling the vector(s) only on the load, while then the store group did not get that cost at all. This then made many more cases of interleaving happen (~450 cases on spec IIRC). Only problem was... the SystemZ backend could not handle those shuffles as well in all the cases. To me that looked like something to be fixed on the I/R level, and after discussions with Sanjay I got the impression that this was the case... To me, this looks like something the LoopVectorizer is neglecting and should be combining. I suppose with my patch for the Load -> Store groups, I could add also the handling of recomputed indices so that the load group produces a vector that fits the store group directly. But if I understand you correctly, even this is not so wise? And if so, then indeed improving the SystemZ DAGCombiner is the only alternative left, I guess...> But in a target-agnostic compiler you need to "emulate" that using the three-step above: target info, cost model, ISel patterns.But having the cost functions available is not enough to drive a later I/R pass to optimize the generated vector code? I mean if the target indicated which shuffles were expensive, that could then easily be avoided. Thanks, Jonas
Possibly Parallel Threads
- [RFC] Extending shufflevector for vscale vectors (SVE etc.)
- [RFC] Extending shufflevector for vscale vectors (SVE etc.)
- [RFC] Extending shufflevector for vscale vectors (SVE etc.)
- LoopVectorizer -- generating bad and unhandled shufflevector sequence
- [LoopVectorizer] Improving the performance of dot product reduction loop