Hi all, we have encountered a case of redundant copies still left in the final code and we would like to, at least, mitigate it. The original motivating case comes from a context where we have large vector registers. In that context, copies are expensive and we would like to avoid them as much as possible. This small testcase in C, similar to the original vector case, exposes the issue but using scalars. long a, b; long fn1(); long fn2() { long c = a, d = c; for (; b;) { long e = fn1(); d = d + e; } long f = d - c; return f; } For instance in RISC-V we emit something like this but other backends like ARM or X86 show the same behaviour. add s0, zero, s2 # ← copy beqz a0, .LBB0_3 # %bb.1: # %for.body.preheader add s0, zero, s2 # ← not needed .LBB0_2: # %for.body Has anyone encountered a similar issue like this in the past? We are looking into removing these copies with a post RA pass to address the most obvious case: if we see a copy with the same physregs in dest and source to an earlier one and the reaching definition of the dest and source registers is one and the same, then that copy should be redundant. This might be too specific though, so perhaps there are better approaches? Thanks! -- Roger Ferrer Ibáñez -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200312/ecc0e034/attachment.html>
+ Sam Hi Roger, FWIW: we have observed redundant copies/movies, they are annoying us for some time now but we haven't got round to looking at it. Not sure we if we are looking at exactly the same problem, but I guess so. Treating symptoms with post RA dead code elimination might be very effective, but it might also be worth to just have a look at the source of the problem (regalloc?) to see if we are not missing something obvious. Regarding a post RA pass: you may want to have a look at the ARM hardware-loop pass. In order to make that beneficial, we have to do quite some dead code elimination post RA, both in inside loops and in preheaders, see e.g. ARMLowOverheadLoops::IterationCountDCE. This is using ReachingDefAnalysis (RDA), which has been extended by Sam and made more generic to support this, which was also going to be his eurollvm talk: http://llvm.org/devmtg/2020-04/talks.html#LightningTalk_26. End of advertisement. ;-) Basically what I want to say is that this should provide most of the things you'll need. Cheers, Sjoerd. ________________________________ From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Roger Ferrer Ibáñez via llvm-dev <llvm-dev at lists.llvm.org> Sent: 12 March 2020 18:06 To: LLVM-Dev <llvm-dev at lists.llvm.org> Subject: [llvm-dev] Redundant copies Hi all, we have encountered a case of redundant copies still left in the final code and we would like to, at least, mitigate it. The original motivating case comes from a context where we have large vector registers. In that context, copies are expensive and we would like to avoid them as much as possible. This small testcase in C, similar to the original vector case, exposes the issue but using scalars. long a, b; long fn1(); long fn2() { long c = a, d = c; for (; b;) { long e = fn1(); d = d + e; } long f = d - c; return f; } For instance in RISC-V we emit something like this but other backends like ARM or X86 show the same behaviour. add s0, zero, s2 # ← copy beqz a0, .LBB0_3 # %bb.1: # %for.body.preheader add s0, zero, s2 # ← not needed .LBB0_2: # %for.body Has anyone encountered a similar issue like this in the past? We are looking into removing these copies with a post RA pass to address the most obvious case: if we see a copy with the same physregs in dest and source to an earlier one and the reaching definition of the dest and source registers is one and the same, then that copy should be redundant. This might be too specific though, so perhaps there are better approaches? Thanks! -- Roger Ferrer Ibáñez -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200312/0a708cae/attachment.html>
Hi Sjoerd, I'm already using RDA in the pass I mentioned and it works great. Thanks Sam! Regarding the root cause, I didn't see anything obviously suboptimal not in the copy coalescing or the register allocation, at least in my previous example. Alternatively we might want to improve what we pass onto RA: i.e. remove the redundant copy earlier. At this point however it doesn't (obviously) look like one (it still using different vregs) which suggests it might require a bit more of work to discover something that will ultimately lead to a redundant copy. I will investigate this option as well. I'll take a look at the hardware-loop pass DCE code. Thanks for the pointer! Kind regards, Missatge de Sjoerd Meijer <Sjoerd.Meijer at arm.com> del dia dj., 12 de març 2020 a les 20:50:> + Sam > > Hi Roger, > > FWIW: we have observed redundant copies/movies, they are annoying us for > some time now but we haven't got round to looking at it. Not sure we if we > are looking at exactly the same problem, but I guess so. > > Treating symptoms with post RA dead code elimination might be very > effective, but it might also be worth to just have a look at the source of > the problem (regalloc?) to see if we are not missing something obvious. > > Regarding a post RA pass: you may want to have a look at the ARM > hardware-loop pass. In order to make that beneficial, we have to do quite > some dead code elimination post RA, both in inside loops and in preheaders, > see e.g. ARMLowOverheadLoops::IterationCountDCE. This is using > ReachingDefAnalysis (RDA), which has been extended by Sam and made more > generic to support this, which was also going to be his eurollvm talk: > http://llvm.org/devmtg/2020-04/talks.html#LightningTalk_26. End of > advertisement. ;-) Basically what I want to say is that this should provide > most of the things you'll need. > > Cheers, > Sjoerd. > > > > ------------------------------ > *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Roger > Ferrer Ibáñez via llvm-dev <llvm-dev at lists.llvm.org> > *Sent:* 12 March 2020 18:06 > *To:* LLVM-Dev <llvm-dev at lists.llvm.org> > *Subject:* [llvm-dev] Redundant copies > > Hi all, > > we have encountered a case of redundant copies still left in the final > code and we would like to, at least, mitigate it. The original motivating > case comes from a context where we have large vector registers. In that > context, copies are expensive and we would like to avoid them as much as > possible. > > This small testcase in C, similar to the original vector case, exposes the > issue but using scalars. > > long a, b; > long fn1(); > long fn2() { > long c = a, d = c; > for (; b;) { > long e = fn1(); > d = d + e; > } > long f = d - c; > return f; > } > > For instance in RISC-V we emit something like this but other backends like > ARM or X86 show the same behaviour. > > add s0, zero, s2 # ← copy > beqz a0, .LBB0_3 > # %bb.1: # %for.body.preheader > add s0, zero, s2 # ← not needed > .LBB0_2: # %for.body > > Has anyone encountered a similar issue like this in the past? > > We are looking into removing these copies with a post RA pass to address > the most obvious case: if we see a copy with the same physregs in dest and > source to an earlier one and the reaching definition of the dest and source > registers is one and the same, then that copy should be redundant. > > This might be too specific though, so perhaps there are better approaches? > > Thanks! > > -- > Roger Ferrer Ibáñez >-- Roger Ferrer Ibáñez -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200316/4c2456cc/attachment-0001.html>