Florian Hahn via llvm-dev
2021-Mar-08 16:50 UTC
[llvm-dev] Spurious peeling in simple loop unrolling
> On Mar 4, 2021, at 15:53, Thomas Preud'homme <thomasp at graphcore.ai> wrote: > > Hi, > > On 04/03/2021 13:32, Florian Hahn wrote: >> Hi, >> >>> On Mar 2, 2021, at 11:57, Thomas Preud'homme <thomasp at graphcore.ai <mailto:thomasp at graphcore.ai>> wrote: >>> >>> Hi, >>> >>> We've noticed some codegen regressions due to spurious peeling in the simple loop unrolling pass. This happens when the incoming values of the phi nodes causing the peeling are equivalent, since GVN only runs after the loop peeling. Ironically, this prevents vectorization on our target due to the loss of alignment in the loop body caused by the peeling. The spurious peeling seems to affect AArch64 as well, so it's possible other targets may be affected. >>> >> >> Thanks for sharing the example. IIUC the problem here is that there is phi node which becomes invariant after the first iteration and which causes the loop to be peeled (small example here https://godbolt.org/z/Ej4Y19 <https://godbolt.org/z/Ej4Y19>). In your case, the PHI node is actually redundant because the incoming values are actually the same, just different bit casts of the same value, so peeling does not really add any benefits. > > Not different bitcast of the same value. %8 and %13 in the postrotate log are doing the same load. %9 and %14 are same bitcast of those respective loads. They are however indeed redundant and peeling does not add value. > > > >> >>> Attached is an example test case and the corresponding codegen as of 2632ba6a358a62c5cbaddc141de81b756b68698f (with and without 35b3989a30eefa66cd6edca4c6e1ec061c05ad96 reverted). Compiling with -O3 -S -target aarch64-linux-gnu gives the attached 3 logs. Prerotate is just before loop rotation. The post IIUrotate logfile shows the IR after the rotate loop pass and a few more passes have run (the logs after the rotate loop pass only show the loop body). Loop rotate introduces two elidable phis: >>> >>> %33 = phi i8* [ %9, %22 ], [ %14, %32 ] >>> %34 = phi %class.HomemadeVector.0* [ %8, %22 ], [ %13, %32 ] >>> >>> This leads to loop peeling inside simple loop unrolling to peel the first iteration of the loop. Prior to 35b3989a30eefa66cd6edca4c6e1ec061c05ad96 the phi would remain until GVN runs and would remove them since %8 and %13 are the same values. The same would happen to %9 and %14. >>> >>> Running GVN earlier would fix this issue, but I suspect it would have other regressions. Does anyone know how to address this? >> >> Adjusting the position of GVN to address the issue at hand is probably not a good idea. I think there are at least a few small improvements we could make to improve the current situation: >> >> 1. Only peel if the PHI that becomes invariant has any concrete users; if the only users of a phi that becomes invariant are things like llvm.assume, peeling should be very unlikely to be beneficial (note that currently peeling also seems to happily peel for completely unused phis) > > In my example the %34 phi is used in a getelementptr that's used for a load. No assume involved. That phi alone would still cause the issue. > > >Oh right, there are multiple redundant phis! So that does indeed not apply in this case. It might still be beneficial in general.> >> 2. Instcombine before peeling already simplifies the IR so that both incoming values are bit casts of the same value. It probably would be trivial to also have instcombine simplify pointer phis if the incoming values striped by pointer casts are equal. (There might be some other reason why we are not doing this at the moment though). > > As mentioned above, it's not as simple as bitcast of the same pointer so this would not work here. One would have to go look at whether the loads are equivalent which is a more involved check. > > >That’s true, but I think instcombine already CSE’d the loads. So if we simplify such phis, the unnecessary peeling should not happen https://reviews.llvm.org/D98058 <https://reviews.llvm.org/D98058>>> 3. For targets very sensitive to the number of iterations, perhaps it would be worth adding a TTI hook to express that. > > In this case it's not so much the number of iterations (it gets lower which is not a problem) but rather the code bloat and resulting loss of alignment from peeling. > > >Oh right.>> 4. Perhaps peeling should also be a bit more careful when a known trip count isn’t a power-of-2 (or some similar constraint) any more, but was before peeling. > > This would be a good idea in general indeed, but that would not solve the problem of pointless extra code bloat (code bloat itself would not be a problem at O2 if it resulted in better performance of course but here the peeling should not happen). > > >Sure it should not happen, but on most platforms the unnecessary peeling in this case should have a negligible impact.> Naive question: can GVN work in an incremental way (i.e. only process what changed)? If yes it could maybe run twice, where the second time would be more lightweight. > >I don’t think so. Cheers, Florian -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210308/3a7ff925/attachment.html>
Thomas Preud'homme via llvm-dev
2021-Mar-09 13:13 UTC
[llvm-dev] Spurious peeling in simple loop unrolling
On 08/03/2021 16:50, Florian Hahn wrote: On Mar 4, 2021, at 15:53, Thomas Preud'homme <thomasp at graphcore.ai<mailto:thomasp at graphcore.ai>> wrote: Hi, On 04/03/2021 13:32, Florian Hahn wrote: 1. Only peel if the PHI that becomes invariant has any concrete users; if the only users of a phi that becomes invariant are things like llvm.assume, peeling should be very unlikely to be beneficial (note that currently peeling also seems to happily peel for completely unused phis) In my example the %34 phi is used in a getelementptr that's used for a load. No assume involved. That phi alone would still cause the issue. Oh right, there are multiple redundant phis! So that does indeed not apply in this case. It might still be beneficial in general. Certainly 2. Instcombine before peeling already simplifies the IR so that both incoming values are bit casts of the same value. It probably would be trivial to also have instcombine simplify pointer phis if the incoming values striped by pointer casts are equal. (There might be some other reason why we are not doing this at the moment though). As mentioned above, it's not as simple as bitcast of the same pointer so this would not work here. One would have to go look at whether the loads are equivalent which is a more involved check. That’s true, but I think instcombine already CSE’d the loads. So if we simplify such phis, the unnecessary peeling should not happen https://reviews.llvm.org/D98058 I tried the patch (thanks) but that did not remove any of the PHI (the 2 loads are still there and thus the bitcast don't appear to have the same source). I'll try tolook at InstCombine to see why loads are not CSE'd. Thanks for the help. Best regards, Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210309/34717495/attachment.html>