Danila Malyutin via llvm-dev
2019-Aug-08 17:36 UTC
[llvm-dev] How to best deal with undesirable Induction Variable Simplification?
Hello, Recently I've come across two instances where Induction Variable Simplification lead to noticable performance regressions. In one case, the removal of extra IV lead to the inability to reschedule instructions in a tight loop to reduce stalls. In that case, there were enough registers to spare, so using extra register for extra induction variable was preferable since it reduced dependencies in the loop. In the second case, there was a big nested loop made even bigger after unswitching. However, the inner loop body was rather simple, of the form: loop { p+=n; ... p+=n; ... } use p. Due to unswitching there were several such loops each with the different number of p+=n ops, so when the IndVars pass rewrote all exit values, it added a lot of slightly different offsets to the main loop header that couldn't fit in the available registers which lead to unnecessary spills/reloads. I am wondering what is the usual strategy for dealing with such "pessimizations"? Is it possible to somehow modify the IndVarSimplify pass to take those issues into account (for example, tell it that adding offset computation + gep is potentially more expensive than simply reusing last var from the loop) or should it be recovered in some later pass? If so, is there an easy way to revert IV elimination? Have anyone dealt with similar issues before? -- Danila -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190808/375f3f8b/attachment-0001.html>
Michael Kruse via llvm-dev
2019-Aug-08 23:21 UTC
[llvm-dev] How to best deal with undesirable Induction Variable Simplification?
Am Do., 8. Aug. 2019 um 12:37 Uhr schrieb Danila Malyutin via llvm-dev <llvm-dev at lists.llvm.org>:> > Hello, > Recently I’ve come across two instances where Induction Variable Simplification lead to noticable performance regressions. > > In one case, the removal of extra IV lead to the inability to reschedule instructions in a tight loop to reduce stalls. In that case, there were enough registers to spare, so using extra register for extra induction variable was preferable since it reduced dependencies in the loop.Since r139579, IndVarSimplify (the pass) should not normalize induction variables without a reason anymore (a reason would be that the loop can be deleted). Could you file a bug report, attach a minimal .ll file and mention what output you would expect?> Due to unswitching there were several such loops each with the different number of p+=n ops, so when the IndVars pass rewrote all exit values, it added a lot of slightly different offsets to the main loop header that couldn’t fit in the available registers which lead to unnecessary spills/reloads.Since after unswitching only one of the resulting loops is executed, the register usage should be the maximum of those loops, which ideally is at most the register usage of the pre-unswitched loop. In your case, p could be in the same register in all unswitched loops. However, other optimizations might increase register pressure again and the register allocation is not optimal in all cases. Again, could you file a bug report, include a minimal reproducer and what output you expect?> I am wondering what is the usual strategy for dealing with such “pessimizations”? Is it possible to somehow modify the IndVarSimplify pass to take those issues into account (for example, tell it that adding offset computation + gep is potentially more expensive than simply reusing last var from the loop) or should it be recovered in some later pass? If so, is there an easy way to revert IV elimination? Have anyone dealt with similar issues before?Ideally, we prefer to such pessimizations to not occur, as r139579 did. However, the transformation might also be a IR normalization that enables other transformations. In that case, another pass down the pipeline would transform the normalized form to an optimized one. For instance, LoopSimplify inserts a loop preheader the CFGSimplify would remove again. What is considered normalization depends on the case. If you can show that a change generally improves performance (not just for your code) and has at most minor regressions, then any approach is worth considering. Michael
Danila Malyutin via llvm-dev
2019-Aug-09 12:32 UTC
[llvm-dev] How to best deal with undesirable Induction Variable Simplification?
> Since r139579, IndVarSimplify (the pass) should not normalize induction variables without a reason anymore (a reason would be that the loop can be deleted). Could you file a bug report, attach a minimal .ll file and mention what output you would expect?The IV is removed there by the replaceCongruentIVs. It is what I'd probably expect when looking at the IR alone, but, as I've mentioned, this prevents latency masking later down the line since now certain ops use single common register.> Since after unswitching only one of the resulting loops is executed, the register usage should be the maximum of those loops, which ideally is at most the register usage of the pre-unswitched loop. In your case, p could be in the same register in all unswitched loops.However, other optimizations might increase register pressure again and the register allocation is not optimal in all cases. It looks like for some reason, when IndVars rewrote all loop exit values (which were just pointers incremented in the loop body) from simple single-value phis to GEP with recomputed offset (back edge count * increment inside the loop), it expanded this offset computation in the main outermost loop (pre?)header even when the value was used only inside one of the unswitched loops exits. Later passes failed to sink them either for whatever reason so in the end instead of max(unswitched loop regs) it became max(unswitched loop regs) + Const * number of loops (for offsets, even though many were shared). I'll see if I can come up with a minimal reproducer for some in-tree target. -- Danila -----Original Message----- From: Michael Kruse [mailto:llvmdev at meinersbur.de] Sent: Friday, August 9, 2019 02:22 To: Danila Malyutin <Danila.Malyutin at synopsys.com> Cc: llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] How to best deal with undesirable Induction Variable Simplification? Am Do., 8. Aug. 2019 um 12:37 Uhr schrieb Danila Malyutin via llvm-dev <llvm-dev at lists.llvm.org>:> > Hello, > Recently I’ve come across two instances where Induction Variable Simplification lead to noticable performance regressions. > > In one case, the removal of extra IV lead to the inability to reschedule instructions in a tight loop to reduce stalls. In that case, there were enough registers to spare, so using extra register for extra induction variable was preferable since it reduced dependencies in the loop.Since r139579, IndVarSimplify (the pass) should not normalize induction variables without a reason anymore (a reason would be that the loop can be deleted). Could you file a bug report, attach a minimal .ll file and mention what output you would expect?> Due to unswitching there were several such loops each with the different number of p+=n ops, so when the IndVars pass rewrote all exit values, it added a lot of slightly different offsets to the main loop header that couldn’t fit in the available registers which lead to unnecessary spills/reloads.Since after unswitching only one of the resulting loops is executed, the register usage should be the maximum of those loops, which ideally is at most the register usage of the pre-unswitched loop. In your case, p could be in the same register in all unswitched loops. However, other optimizations might increase register pressure again and the register allocation is not optimal in all cases. Again, could you file a bug report, include a minimal reproducer and what output you expect?> I am wondering what is the usual strategy for dealing with such “pessimizations”? Is it possible to somehow modify the IndVarSimplify pass to take those issues into account (for example, tell it that adding offset computation + gep is potentially more expensive than simply reusing last var from the loop) or should it be recovered in some later pass? If so, is there an easy way to revert IV elimination? Have anyone dealt with similar issues before?Ideally, we prefer to such pessimizations to not occur, as r139579 did. However, the transformation might also be a IR normalization that enables other transformations. In that case, another pass down the pipeline would transform the normalized form to an optimized one. For instance, LoopSimplify inserts a loop preheader the CFGSimplify would remove again. What is considered normalization depends on the case. If you can show that a change generally improves performance (not just for your code) and has at most minor regressions, then any approach is worth considering. Michael
Philip Reames via llvm-dev
2019-Aug-09 23:00 UTC
[llvm-dev] How to best deal with undesirable Induction Variable Simplification?
On 8/8/19 10:36 AM, Danila Malyutin via llvm-dev wrote:> > Hello, > Recently I’ve come across two instances where Induction Variable > Simplification lead to noticable performance regressions. > > In one case, the removal of extra IV lead to the inability to > reschedule instructions in a tight loop to reduce stalls. In that > case, there were enough registers to spare, so using extra register > for extra induction variable was preferable since it reduced > dependencies in the loop. >This one I'd phrase as a deficiency in the backend. Arguably LSR, but in general our rewrite to reduce schedule pressure transforms have room for improvement. I ran across a case of this with an add reduction recently as well. Removing a redundant IV is clearly the "right answer" in terms of producing simpler, easier to optimize IR.> In the second case, there was a big nested loop made even bigger after > unswitching. However, the inner loop body was rather simple, of the form: > > loop { > > p+=n; > > … > > p+=n; > > … > > } > use p. > > > > Due to unswitching there were several such loops each with the > different number of p+=n ops, so when the IndVars pass rewrote all > exit values, it added a lot of slightly different offsets to the main > loop header that couldn’t fit in the available registers which lead to > unnecessary spills/reloads. >I have to ask a further question here. Why are the spill/fills problematic? If they happened *outside* said loops - as you'd expect from the example - at worst there is a code size impact. Is there something more going on? (i.e. are the loops super short running or something?)> > > I am wondering what is the usual strategy for dealing with such > “pessimizations”? Is it possible to somehow modify the IndVarSimplify > pass to take those issues into account (for example, tell it that > adding offset computation + gep is potentially more expensive than > simply reusing last var from the loop) or should it be recovered in > some later pass? If so, is there an easy way to revert IV elimination? > Have anyone dealt with similar issues before? >My answer: IndVars did the right thing in both of these cases. The IR is definitely much cleaner, easier to optimize by other transforms, etc.. Unfortunately, it's not uncommon for a good transform to produce output which reveals other deficiencies in the optimizer/backend. We can and should fix those where we find them. (There's honest disagreement about the philosophy here JFYI.)> > > -- > > Danila > > > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190809/56c5c89a/attachment.html>