Nemanja Ivanovic via llvm-dev
2021-Jul-06 23:58 UTC
[llvm-dev] Problems with subreg-liveness and Greedy RA
Hi Quentin, I am really sorry for the delay. I am working on getting a reproducer for this and once I have something manageable, I'll open a bug. Thanks for all your help, Nemanja On Thu, Jul 1, 2021 at 1:55 PM Quentin Colombet <qcolombet at apple.com> wrote:> Hi Nemanja, > > Do you have something I could run on my side? > > I’d like to see how/when we create %296 and the dump, doesn’t have that > information: > ``` > selectOrSplit VSRpRC:%39 [536r,6168B:0) 0 at 536r L0000000000000002 > [536r,6168B:0) 0 at 536r L0000000000000040 [536r,536d:0) 0 at 536r > weight:7.122666e+05 w=7.122666e+05 > RS_Split Cascade 11 > Analyze counted 3 instrs in 2 blocks, through 0 blocks. > Cost of isolating all blocks = 2147483648.8 > $vsrp0 no positive bundles > $vsrp1 no positive bundles > $vsrp2 no positive bundles > $vsrp3 no positive bundles > $vsrp4 no positive bundles > $vsrp5 no positive bundles > $vsrp6 no positive bundles > $vsrp17 no positive bundles > $vsrp18 no positive bundles > $vsrp16 no positive bundles > $vsrp19 no positive bundles > $vsrp20 no positive bundles > $vsrp21 no positive bundles > $vsrp22 no positive bundles > $vsrp23 no positive bundles > $vsrp24 no positive bundles > $vsrp25 no positive bundles > $vsrp15 no positive bundles > $vsrp14 no positive bundles > $vsrp13 no positive bundles > $vsrp12 no positive bundles > $vsrp11 no positive bundles > $vsrp10 no positive bundles > $vsrp9 no positive bundles > $vsrp8 no positive bundles > $vsrp7 no positive bundles > $vsrp31 no positive bundles > $vsrp30 no positive bundles > $vsrp29 no positive bundles > $vsrp28 no positive bundles > $vsrp27 no positive bundles > $vsrp26 no positive bundles > enterIntvBefore 5296r: valno 0 > leaveIntvAfter 5320r: valno 0 > useIntv [5292r;5320r): [5292r;5320r):1 > Multi-mapped complement 0 at 5316r for parent 0 at 536r hoist to %bb.2 5316r > Direct complement def at 536r > Removing 1 back-copies. > Removing 5316r undef %296.sub_64:vsrprc = COPY %39.sub_64:vsrprc. <———— > THIS COPY > blit [536r,6168B:0): [536r;5292r)=0(%296)(recalc) > [5292r;5320r)=1(%297)(recalc) [5320r;6168B)=0(%296)(recalc) > rewr %bb.1 536r:0 %296:vsrprc = LXVP 0, %33:g8rc_and_g8rc_nox0 :: > (load 32 from constant-pool) > rewr %bb.2 5320B:1 %290:vsrc = contract nofpexcept XVMADDADP > %290:vsrc(tied-def 0), %297.sub_vsx0:vsrprc, %31:vsrc, implicit $rm > rewr %bb.2 5296B:1 %289:vsrc = contract nofpexcept XVMADDADP > %289:vsrc(tied-def 0), %297.sub_vsx0:vsrprc, %62:vsrc, implicit $rm > rewr %bb.2 5292B:0 undef %297.sub_64:vsrprc = COPY %296.sub_64:vsrprc > queuing new interval: %296 [536r,6168B:0) 0 at 536r L0000000000000040 > [536r,536d:0) 0 at 536r L0000000000000002 [536r,6168B:1) 0 at x 1 at 536r > weight:3.596946e+05 > queuing new interval: %297 [5292r,5320r:0) 0 at 5292r L0000000000000002 > [5292r,5320r:0) 0 at 5292r weight:1.520298e+07 > ``` > > Ideally, if you can file a bug, that would make the tracking simpler. > > Cheers, > -Quentin > > On Jun 23, 2021, at 12:18 PM, Nemanja Ivanovic <nemanja.i.ibm at gmail.com> > wrote: > > Sorry, it would appear that the dev list stripped my attachment. I have > reduced the file using bugpoint and produced the output from that. > Attaching it here. > > Nemanja > > On Wed, Jun 23, 2021 at 2:23 PM Nemanja Ivanovic <nemanja.i.ibm at gmail.com> > wrote: > >> Thank you so much for taking the time to answer Quentin. >> >> The bad copies are definitely added by live range splitting. The issue >> seems to be the LaneBitmasks for the various subregisters. Honestly, I >> don't really know what the bits of LaneBitmask produced by TblGen are meant >> to mean, but I can't make any sense of them. And those seem to lead the >> register allocator astray. >> Here are the LaneBitmasks from the register include file: >> static const LaneBitmask SubRegIndexLaneMaskTable[] = { >> LaneBitmask::getAll(), >> LaneBitmask(0x0000000000000001), // sub_32 >> LaneBitmask(0x0000000000000002), // sub_64 >> LaneBitmask(0x0000000000000004), // sub_eq >> LaneBitmask(0x0000000000000001), // sub_gp8_x0 >> LaneBitmask(0x0000000000000200), // sub_gp8_x1 >> LaneBitmask(0x0000000000000008), // sub_gt >> LaneBitmask(0x0000000000000010), // sub_lt >> LaneBitmask(0x0000000000000042), // sub_pair0 >> LaneBitmask(0x0000000000000180), // sub_pair1 >> LaneBitmask(0x0000000000000020), // sub_un >> LaneBitmask(0x0000000000000002), // sub_vsx0 >> LaneBitmask(0x0000000000000040), // sub_vsx1 >> LaneBitmask(0x0000000000000040), // sub_vsx1_then_sub_64 >> LaneBitmask(0x0000000000000080), // sub_pair1_then_sub_64 >> LaneBitmask(0x0000000000000080), // sub_pair1_then_sub_vsx0 >> LaneBitmask(0x0000000000000100), // sub_pair1_then_sub_vsx1 >> LaneBitmask(0x0000000000000100), // sub_pair1_then_sub_vsx1_then_sub_64 >> LaneBitmask(0x0000000000000200), // sub_gp8_x1_then_sub_32 >> }; >> >> For example, what does it mean that the mask for sub_64 and sub_vsx0 are >> the same? The two subregisters certainly do not represent the same lanes in >> their respective registers. The sub_vsx0 subregister is the first VSX >> register in a VSX register pair. And each of the two subregisters of a VSX >> register pair (sub_vsx0, sub_vsx1) have their own scalar subregister ( >> sub_64). >> >> I have also attached the output of RA, but it is huge :( >> It is the result of specifying options -debug-only=regalloc >> -print-before=greedy -print-after=greedy on the command line. >> >> On Tue, Jun 22, 2021 at 3:21 PM Quentin Colombet <qcolombet at apple.com> >> wrote: >> >>> >>> >>> On Jun 21, 2021, at 10:05 AM, Nemanja Ivanovic via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>> I am having a really difficult time with subregister related issues when >>> I turn >>> on subregister liveness tracking. >>> >>> Before RA: >>> 79760B %2216:vsrc = LXVDSX %5551:g8rc_and_g8rc_nox0, %2215:g8rc :: >>> (load 8 from %ir.scevgep1857.cast, !alias.scope !92, !noalias !93) >>> 79872B %2225:vsrprc = LXVP 352, %661:g8rc_and_g8rc_nox0 >>> 84328B %5540:vsrc = contract nofpexcept XVMADDADP %5540:vsrc(tied-def >>> 0), %2225.sub_vsx0:vsrprc, %2216:vsrc, implicit $rm >>> >>> After RA (greedy): >>> 79744B %2214:vsrc = LXVDSX %5551:g8rc_and_g8rc_nox0, %6477:g8rc :: >>> (load 8 from %ir.scevgep1860.cast, !alias.scope !92, !noalias !93) >>> 79872B %7503:vsrprc = LXVP 352, %661:g8rc_and_g8rc_nox0 >>> 80248B %7527:vsrprc = COPY %7503:vsrprc >>> 80988B undef %7526.sub_64:vsrprc = COPY %7527.sub_64:vsrprc >>> 84324B undef %7501.sub_64:vsrprc = COPY %7526.sub_64:vsrprc >>> 84328B %5546:vsrc = contract nofpexcept XVMADDADP %5546:vsrc(tied-def >>> 0), %7501.sub_vsx0:vsrprc, %2214:vsrc, implicit $rm >>> >>> Subregister definitions for PPC: >>> def sub_64 : SubRegIndex<64>; >>> def sub_vsx0 : SubRegIndex<128>; >>> def sub_vsx1 : SubRegIndex<128, 128>; >>> def sub_pair0 : SubRegIndex<256>; >>> def sub_pair1 : SubRegIndex<256, 256>; >>> >>> So the instruction at 84328B uses the full register %2216 and the high >>> order >>> 128 bits of (256-bit) register %2225. However, the register allocator >>> splits >>> the live range and introduces a copy of the high order 64 bits of that >>> 256-bit >>> register, then another copy of that copy and rewrites the use in >>> instruction >>> 84328B to that copy. The copy is marked undef so the register allocator >>> assigns just some random register to the use of that copy in 84328B. >>> >>> Or maybe I am completely misinterpreting the meaning of the debug dumps >>> from the register allocator. >>> >>> This appears to be related to lane masks and dead lane detection although >>> I don't see dead lane detection marking anything unexpected as undef >>> (seems >>> to just be INSERT_SUBREG and PHI). >>> >>> >>> Are the copies added by dead lane detection or by live-range splitting? >>> >>> The undef flag on the definition of %7501 is suspicious and depending on >>> how you look at it, so is the one on %7526. Essentially, we are losing the >>> full copy in this chain of copies and I wonder what is at fault here. >>> >>> Could you share the debug output of regalloc? >>> >>> >>> If anyone has suggestions on what might be the issue and/or how to go >>> about figuring this out and fixing it, I would really appreciate it. >>> >>> Nemanja >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> <ra-before-after-debug.txt> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210706/8720aada/attachment.html>
Diogo Sampaio via llvm-dev
2021-Jul-07 11:43 UTC
[llvm-dev] Problems with subreg-liveness and Greedy RA
Hi Nemanja, I've seen alike issue with our downstream arch as well. I've fixed with this patch, see if solves your problem: diff --git a/llvm/lib/CodeGen/LiveInterval.cpp b/llvm/lib/CodeGen/LiveInterval.cpp --- a/llvm/lib/CodeGen/LiveInterval.cpp +++ b/llvm/lib/CodeGen/LiveInterval.cpp @@ -984,16 +984,18 @@ for (const MachineOperand &MO : MRI.def_operands(reg)) { if (!MO.isUndef()) continue; - unsigned SubReg = MO.getSubReg(); - assert(SubReg != 0 && "Undef should only be set on subreg defs"); - LaneBitmask DefMask = TRI.getSubRegIndexLaneMask(SubReg); - LaneBitmask UndefMask = VRegMask & ~DefMask; - if ((UndefMask & LaneMask).any()) { - const MachineInstr &MI = *MO.getParent(); - bool EarlyClobber = MO.isEarlyClobber(); - SlotIndex Pos = Indexes.getInstructionIndex(MI).getRegSlot(EarlyClobber); - Undefs.push_back(Pos); + // If it is a SubReg, see if they overlap, + // or if it is the entire reg, just add it + if (unsigned SubReg = MO.getSubReg()) { + LaneBitmask DefMask = TRI.getSubRegIndexLaneMask(SubReg); + LaneBitmask UndefMask = VRegMask & ~DefMask; + if (!(UndefMask & LaneMask).any()) + return; } + const MachineInstr &MI = *MO.getParent(); + bool EarlyClobber = MO.isEarlyClobber(); + SlotIndex Pos = Indexes.getInstructionIndex(MI).getRegSlot(EarlyClobber); + Undefs.push_back(Pos); } } Diogo Sampaio Senior Compiler Engineer • Kalray Phone: dsampaio at kalrayinc.com • www.kalrayinc.com Intelligent Data Processing>From Cloud to EdgePlease consider the environment before printing this e-mail. This message contains information that may be privileged or confidential and is the property of Kalray S.A. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message. -----Original Message----- From: Nemanja <llvm-dev at lists.llvm.org> To: Quentin <qcolombet at apple.com> Cc: llvm-dev <llvm-dev at lists.llvm.org> Date: Wednesday, 7 July 2021 1:58 AM CEST Subject: Re: [llvm-dev] Problems with subreg-liveness and Greedy RA Hi Quentin, I am really sorry for the delay. I am working on getting a reproducer for this and once I have something manageable, I'll open a bug. Thanks for all your help, Nemanja On Thu, Jul 1, 2021 at 1:55 PM Quentin Colombet <qcolombet at apple.com> wrote:> Hi Nemanja, > > Do you have something I could run on my side? > > I’d like to see how/when we create %296 and the dump, doesn’t have that > information: > ``` > selectOrSplit VSRpRC:%39 [536r,6168B:0) 0 at 536r L0000000000000002 > [536r,6168B:0) 0 at 536r L0000000000000040 [536r,536d:0) 0 at 536r > weight:7.122666e+05 w=7.122666e+05 > RS_Split Cascade 11 > Analyze counted 3 instrs in 2 blocks, through 0 blocks. > Cost of isolating all blocks = 2147483648.8 > $vsrp0 no positive bundles > $vsrp1 no positive bundles > $vsrp2 no positive bundles > $vsrp3 no positive bundles > $vsrp4 no positive bundles > $vsrp5 no positive bundles > $vsrp6 no positive bundles > $vsrp17 no positive bundles > $vsrp18 no positive bundles > $vsrp16 no positive bundles > $vsrp19 no positive bundles > $vsrp20 no positive bundles > $vsrp21 no positive bundles > $vsrp22 no positive bundles > $vsrp23 no positive bundles > $vsrp24 no positive bundles > $vsrp25 no positive bundles > $vsrp15 no positive bundles > $vsrp14 no positive bundles > $vsrp13 no positive bundles > $vsrp12 no positive bundles > $vsrp11 no positive bundles > $vsrp10 no positive bundles > $vsrp9 no positive bundles > $vsrp8 no positive bundles > $vsrp7 no positive bundles > $vsrp31 no positive bundles > $vsrp30 no positive bundles > $vsrp29 no positive bundles > $vsrp28 no positive bundles > $vsrp27 no positive bundles > $vsrp26 no positive bundles > enterIntvBefore 5296r: valno 0 > leaveIntvAfter 5320r: valno 0 > useIntv [5292r;5320r): [5292r;5320r):1 > Multi-mapped complement 0 at 5316r for parent 0 at 536r hoist to %bb.2 5316r > Direct complement def at 536r > Removing 1 back-copies. > Removing 5316r undef %296.sub_64:vsrprc = COPY %39.sub_64:vsrprc. <———— > THIS COPY > blit [536r,6168B:0): [536r;5292r)=0(%296)(recalc) > [5292r;5320r)=1(%297)(recalc) [5320r;6168B)=0(%296)(recalc) > rewr %bb.1 536r:0 %296:vsrprc = LXVP 0, %33:g8rc_and_g8rc_nox0 :: > (load 32 from constant-pool) > rewr %bb.2 5320B:1 %290:vsrc = contract nofpexcept XVMADDADP > %290:vsrc(tied-def 0), %297.sub_vsx0:vsrprc, %31:vsrc, implicit $rm > rewr %bb.2 5296B:1 %289:vsrc = contract nofpexcept XVMADDADP > %289:vsrc(tied-def 0), %297.sub_vsx0:vsrprc, %62:vsrc, implicit $rm > rewr %bb.2 5292B:0 undef %297.sub_64:vsrprc = COPY %296.sub_64:vsrprc > queuing new interval: %296 [536r,6168B:0) 0 at 536r L0000000000000040 > [536r,536d:0) 0 at 536r L0000000000000002 [536r,6168B:1) 0 at x 1 at 536r > weight:3.596946e+05 > queuing new interval: %297 [5292r,5320r:0) 0 at 5292r L0000000000000002 > [5292r,5320r:0) 0 at 5292r weight:1.520298e+07 > ``` > > Ideally, if you can file a bug, that would make the tracking simpler. > > Cheers, > -Quentin > > On Jun 23, 2021, at 12:18 PM, Nemanja Ivanovic <nemanja.i.ibm at gmail.com> > wrote: > > Sorry, it would appear that the dev list stripped my attachment. I have > reduced the file using bugpoint and produced the output from that. > Attaching it here. > > Nemanja > > On Wed, Jun 23, 2021 at 2:23 PM Nemanja Ivanovic <nemanja.i.ibm at gmail.com> > wrote: > >> Thank you so much for taking the time to answer Quentin. >> >> The bad copies are definitely added by live range splitting. The issue >> seems to be the LaneBitmasks for the various subregisters. Honestly, I >> don't really know what the bits of LaneBitmask produced by TblGen are meant >> to mean, but I can't make any sense of them. And those seem to lead the >> register allocator astray. >> Here are the LaneBitmasks from the register include file: >> static const LaneBitmask SubRegIndexLaneMaskTable[] = { >> LaneBitmask::getAll(), >> LaneBitmask(0x0000000000000001), // sub_32 >> LaneBitmask(0x0000000000000002), // sub_64 >> LaneBitmask(0x0000000000000004), // sub_eq >> LaneBitmask(0x0000000000000001), // sub_gp8_x0 >> LaneBitmask(0x0000000000000200), // sub_gp8_x1 >> LaneBitmask(0x0000000000000008), // sub_gt >> LaneBitmask(0x0000000000000010), // sub_lt >> LaneBitmask(0x0000000000000042), // sub_pair0 >> LaneBitmask(0x0000000000000180), // sub_pair1 >> LaneBitmask(0x0000000000000020), // sub_un >> LaneBitmask(0x0000000000000002), // sub_vsx0 >> LaneBitmask(0x0000000000000040), // sub_vsx1 >> LaneBitmask(0x0000000000000040), // sub_vsx1_then_sub_64 >> LaneBitmask(0x0000000000000080), // sub_pair1_then_sub_64 >> LaneBitmask(0x0000000000000080), // sub_pair1_then_sub_vsx0 >> LaneBitmask(0x0000000000000100), // sub_pair1_then_sub_vsx1 >> LaneBitmask(0x0000000000000100), // sub_pair1_then_sub_vsx1_then_sub_64 >> LaneBitmask(0x0000000000000200), // sub_gp8_x1_then_sub_32 >> }; >> >> For example, what does it mean that the mask for sub_64 and sub_vsx0 are >> the same? The two subregisters certainly do not represent the same lanes in >> their respective registers. The sub_vsx0 subregister is the first VSX >> register in a VSX register pair. And each of the two subregisters of a VSX >> register pair (sub_vsx0, sub_vsx1) have their own scalar subregister ( >> sub_64). >> >> I have also attached the output of RA, but it is huge :( >> It is the result of specifying options -debug-only=regalloc >> -print-before=greedy -print-after=greedy on the command line. >> >> On Tue, Jun 22, 2021 at 3:21 PM Quentin Colombet <qcolombet at apple.com> >> wrote: >> >>> >>> >>> On Jun 21, 2021, at 10:05 AM, Nemanja Ivanovic via llvm-dev < >>> llvm-dev at lists.llvm.org> wrote: >>> >>> I am having a really difficult time with subregister related issues when >>> I turn >>> on subregister liveness tracking. >>> >>> Before RA: >>> 79760B %2216:vsrc = LXVDSX %5551:g8rc_and_g8rc_nox0, %2215:g8rc :: >>> (load 8 from %ir.scevgep1857.cast, !alias.scope !92, !noalias !93) >>> 79872B %2225:vsrprc = LXVP 352, %661:g8rc_and_g8rc_nox0 >>> 84328B %5540:vsrc = contract nofpexcept XVMADDADP %5540:vsrc(tied-def >>> 0), %2225.sub_vsx0:vsrprc, %2216:vsrc, implicit $rm >>> >>> After RA (greedy): >>> 79744B %2214:vsrc = LXVDSX %5551:g8rc_and_g8rc_nox0, %6477:g8rc :: >>> (load 8 from %ir.scevgep1860.cast, !alias.scope !92, !noalias !93) >>> 79872B %7503:vsrprc = LXVP 352, %661:g8rc_and_g8rc_nox0 >>> 80248B %7527:vsrprc = COPY %7503:vsrprc >>> 80988B undef %7526.sub_64:vsrprc = COPY %7527.sub_64:vsrprc >>> 84324B undef %7501.sub_64:vsrprc = COPY %7526.sub_64:vsrprc >>> 84328B %5546:vsrc = contract nofpexcept XVMADDADP %5546:vsrc(tied-def >>> 0), %7501.sub_vsx0:vsrprc, %2214:vsrc, implicit $rm >>> >>> Subregister definitions for PPC: >>> def sub_64 : SubRegIndex<64>; >>> def sub_vsx0 : SubRegIndex<128>; >>> def sub_vsx1 : SubRegIndex<128, 128>; >>> def sub_pair0 : SubRegIndex<256>; >>> def sub_pair1 : SubRegIndex<256, 256>; >>> >>> So the instruction at 84328B uses the full register %2216 and the high >>> order >>> 128 bits of (256-bit) register %2225. However, the register allocator >>> splits >>> the live range and introduces a copy of the high order 64 bits of that >>> 256-bit >>> register, then another copy of that copy and rewrites the use in >>> instruction >>> 84328B to that copy. The copy is marked undef so the register allocator >>> assigns just some random register to the use of that copy in 84328B. >>> >>> Or maybe I am completely misinterpreting the meaning of the debug dumps >>> from the register allocator. >>> >>> This appears to be related to lane masks and dead lane detection although >>> I don't see dead lane detection marking anything unexpected as undef >>> (seems >>> to just be INSERT_SUBREG and PHI). >>> >>> >>> Are the copies added by dead lane detection or by live-range splitting? >>> >>> The undef flag on the definition of %7501 is suspicious and depending on >>> how you look at it, so is the one on %7526. Essentially, we are losing the >>> full copy in this chain of copies and I wonder what is at fault here. >>> >>> Could you share the debug output of regalloc? >>> >>> >>> If anyone has suggestions on what might be the issue and/or how to go >>> about figuring this out and fixing it, I would really appreciate it. >>> >>> Nemanja >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >>> >>> <ra-before-after-debug.txt> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210707/e5ec5ab6/attachment.html> -------------- next part -------------- _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev