John Brawn via llvm-dev
2018-Dec-05 17:50 UTC
[llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation
enableAdvancedRASplitCost() does the same thing as ConsiderLocalIntervalCost, but as a subtarget option instead of a command-line option, and as I’ve said it doesn’t help because it’s a non-local interval causing the eviction chain (RAGreedy::splitCanCauseEvictionChain only considers the local interval for a single block, and it’s unclear to me how to make it handle a non-local interval). John From: Nirav Davé [mailto:niravd at google.com] Sent: 05 December 2018 17:14 To: John Brawn Cc: llvm-dev; nd Subject: Re: [llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation This has cropped up before in X86 (https://bugs.llvm.org/show_bug.cgi?id=26810 / https://reviews.llvm.org/rL316295), and there's at least a partial mitigation (I recently ran into an eviction change on X86 when trying variants of a MachineScheduler change, but couldn't find a reproduction post the landed patch). I suggest you try enabling enableAdvancedRASplitCost() for ARM and seeing if that helps. -Nirav On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Preamble -------- While working on an IR-level optimisation completely unrelated to register allocation I happened to trigger some really strange register allocator behaviour causing a large regression in bzip2 in spec2006. I've been trying to fix that regression before getting the optimisation patch committed, because I don't want to regress spec2006, but I'm basically fumbling in the dark because I don't yet know how or why the register allocator is making the decisions it does and I thought I'd send an email to see if anyone has any advice. The problem ----------- Attached are (zipped, as llvm-dev has a 100kb message limit): * bzip2_regression.ll (reduced from bzip2 in spec2006 after being compiled with some patches that I'm working on) which demonstrates the problem. * 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch which causes the problem. * without_patch_regalloc.txt, the regalloc debug log for llc -mcpu=cortex-a57 bzip2_regression.ll without the patch applied. * with_patch_regalloc.txt, the same log but with the patch applied. Note that the patch is not correct, but it happens to be a useful way of provoking the problem. Without the patch generating assembly with llc -mcpu=cortex-a57 everything looks fine, but with the patch we get this (which comes from the block bb.17.switchdest13): .LBB0_16: mov x29, x24 mov w24, w20 mov w20, w19 mov w19, w7 mov w7, w6 mov w6, w5 mov w5, w2 mov x2, x18 mov w18, w15 orr w15, wzr, #0x1c str w15, [x8, #8] mov w0, wzr mov w15, w18 mov x18, x2 mov w2, w5 mov w5, w6 mov w6, w7 mov w7, w19 mov w19, w20 mov w20, w24 mov x24, x29 b .LBB0_3 It looks like the orr and str have barged in and said "we're using w15!" and all the rest of the registers have meekly moved out of the way and then moved back again at the and. If the orr and str had used w29 instead then none of this would have happened. What the patch does is make one of the input operands to the JumpTableDest32 pseudo-instruction be not marked as earlyclobber, or in other words it means we have one extra register free compared to without the patch. And you would expect that more free registers = better register allocation, but in this case it appears we don't. Note: this problem can happen without the patch, but the test case is much much larger and manifested itself as -fno-omit-frame-pointer giving a better allocation than -fomit-frame-pointer. This patch was actually my first attempt at fixing this (as I'd noticed that we were unnecessarily keeping an extra register live across the JumpTableDest8). What's going on --------------- What this block looks like after live range splitting has happened is: 7352B bb.17.switchdest13: ; predecessors: %bb.3 successors: %bb.30(0x80000000); %bb.30(100.00%) 7360B %390:gpr32 = COPY $wzr 7364B %434:gpr64 = COPY %432:gpr64 7368B %429:gpr32 = COPY %427:gpr32 7376B %424:gpr32 = COPY %422:gpr32 7384B %419:gpr32 = COPY %417:gpr32 7392B %414:gpr32 = COPY %412:gpr32 7400B %409:gpr32 = COPY %407:gpr32 7408B %404:gpr32 = COPY %402:gpr32 7416B %399:gpr64 = COPY %397:gpr64 7424B %394:gpr32 = COPY %392:gpr32 7528B %253:gpr32 = MOVi32imm 28 7536B STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into %ir.106, align 8) 7752B %392:gpr32 = COPY %394:gpr32 7756B %397:gpr64 = COPY %399:gpr64 7764B %402:gpr32 = COPY %404:gpr32 7768B %407:gpr32 = COPY %409:gpr32 7776B %412:gpr32 = COPY %414:gpr32 7780B %417:gpr32 = COPY %419:gpr32 7788B %422:gpr32 = COPY %424:gpr32 7792B %427:gpr32 = COPY %429:gpr32 7800B %432:gpr64 = COPY %434:gpr64 7808B %373:gpr64sp = IMPLICIT_DEF 7816B %374:gpr64sp = IMPLICIT_DEF 8048B B %bb.30 Looking at the debug output of the register allocator, the sequence of events which kicks things off is %223 assigned to w0 %283 evicts %381 from w15 %381 requeued for second round %253 assigned to w15 %381 split for w15 in 4 bundles into %391-%395 %391, %392, %395 are not local intervals %393 is the local interval for bb.11.switchdest09 %394 is the local interval for bb.17.switchdest13 %392 assigned to w15 %391 evicts %376 from w18 %394 assigned to w18 %376 split into %396-%400 and then %396 evicts something which is split into something which evicts something etc. until we're done. Looking at what happens when this patch isn't applied the difference is: %223 cannot be assigned to w0, evicts %381 from w15 %381 requeued for second round %283 assigned to w15 %253 assigned to w15 %381 split for w15 in 1 bundle into %391 and %392 Neither is a local interval %391 evicts %380 from w2 %392 assigned to w2 So it looks like the difference is that with the patch we happen to split %381 in a way that causes the split intervals to be allocated such that we get a pair of copies in bb.17.switchdest13, and this causes a cascade effect where we repeatedly do the same thing with a whole load of other registers. Possible Solutions ------------------ So there's two ways I can think of to fix this: * Make %381 be split in the same way that it is without the patch, which I think means deciding that there's only 1 bundle for w15. Does anyone know where and how exactly these bundles are decided? * Try and change how evicted / split registers are allocated in some way. Things I've tried: * In RAGreedy::enqueue reduce the score of unspillable local intervals, and in RAGreedy::evictInterference put evicted registers into stage RS_Split immediately. This causes %381 to be split immediately instead of being requeued, and then makes %391 have a higher score than %253 causing it to be allocated before it. This works, but ends up causing an extra spill. * In RAGreedy::splitAroundRegion put global intervals into stage RS_Split immediately. This makes the chain of evictions after %396 not happen, but that gives us one extra spill and we still get one pair of copies in bb.17.switchdest13. * In RAGreedy::evictInterference put evicted registers into a new RS_Evicted stage, which is like RS_Assign but can't evict anything. This seemed to give OK results but was a mess and I didn't understand what I was doing, so I threw it away. * Turn on the ConsiderLocalIntervalCost option, as it's supposed to help with eviction chains like this. Unfortunately it doesn't work as it's a non-local interval that's causing the eviction chain. I tried making it also handle non-local intervals, but couldn't figure out how to. * Turn on TRI->reverseLocalAssignment(). This seemed to work, but I'm not sure why and reading the description of that it may not be the correct solution (it's described as being an option to reduce the time the register allocator takes, not to give better allocation). The benchmark results are also overall slightly worse. Any ideas on what the right approach to fixing this is? John _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181205/d7d88e46/attachment.html>
Quentin Colombet via llvm-dev
2018-Dec-05 18:21 UTC
[llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation
+ Marina, maybe she would have ideas on how to improve the eviction chain detection. Hi John, What you're seeing is indeed another instance of eviction chain madness that Marina improved with the splitCanCauseEvictionChain but is still a problem. The problem with what you're suggesting is that it will improve this instance, but I am pretty sure it will regress some others. Unfortunately, this is the case we whatever change we do in that space though.> Does anyone know where and how exactly these bundles are decided?I have to relearn this every time there is something wrong happening there... Needless to say this is complicated! IIRC the bundles are chosen using an Hopfield network, which in short is a kind of neural network that is guarantee to converge to a local minimum. As you could imagine, there is little tweak you can do here and maybe its time to rethink that approach. Cheers, -Quentin Le mer. 5 déc. 2018 à 09:50, John Brawn via llvm-dev <llvm-dev at lists.llvm.org> a écrit :> > enableAdvancedRASplitCost() does the same thing as ConsiderLocalIntervalCost, but as a > > subtarget option instead of a command-line option, and as I’ve said it doesn’t help because > > it’s a non-local interval causing the eviction chain (RAGreedy::splitCanCauseEvictionChain > > only considers the local interval for a single block, and it’s unclear to me how to make it > > handle a non-local interval). > > > > John > > > > From: Nirav Davé [mailto:niravd at google.com] > Sent: 05 December 2018 17:14 > To: John Brawn > Cc: llvm-dev; nd > Subject: Re: [llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation > > > > This has cropped up before in X86 (https://bugs.llvm.org/show_bug.cgi?id=26810 / https://reviews.llvm.org/rL316295), and there's at least a partial mitigation > > (I recently ran into an eviction change on X86 when trying variants of a MachineScheduler change, but couldn't find a reproduction post the landed patch). > > > > I suggest you try enabling enableAdvancedRASplitCost() for ARM and seeing if that helps. > > > > -Nirav > > > > > > > > On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Preamble > -------- > > While working on an IR-level optimisation completely unrelated to register > allocation I happened to trigger some really strange register allocator > behaviour causing a large regression in bzip2 in spec2006. I've been trying > to fix that regression before getting the optimisation patch committed, because > I don't want to regress spec2006, but I'm basically fumbling in the dark because > I don't yet know how or why the register allocator is making the decisions it > does and I thought I'd send an email to see if anyone has any advice. > > > The problem > ----------- > > Attached are (zipped, as llvm-dev has a 100kb message limit): > * bzip2_regression.ll (reduced from bzip2 in spec2006 after being compiled with > some patches that I'm working on) which demonstrates the problem. > * 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch which causes > the problem. > * without_patch_regalloc.txt, the regalloc debug log for llc -mcpu=cortex-a57 > bzip2_regression.ll without the patch applied. > * with_patch_regalloc.txt, the same log but with the patch applied. > Note that the patch is not correct, but it happens to be a useful way of > provoking the problem. > > Without the patch generating assembly with llc -mcpu=cortex-a57 everything looks > fine, but with the patch we get this (which comes from the block > bb.17.switchdest13): > > .LBB0_16: > mov x29, x24 > mov w24, w20 > mov w20, w19 > mov w19, w7 > mov w7, w6 > mov w6, w5 > mov w5, w2 > mov x2, x18 > mov w18, w15 > orr w15, wzr, #0x1c > str w15, [x8, #8] > mov w0, wzr > mov w15, w18 > mov x18, x2 > mov w2, w5 > mov w5, w6 > mov w6, w7 > mov w7, w19 > mov w19, w20 > mov w20, w24 > mov x24, x29 > b .LBB0_3 > > It looks like the orr and str have barged in and said "we're using w15!" and all > the rest of the registers have meekly moved out of the way and then moved back > again at the and. If the orr and str had used w29 instead then none of this > would have happened. > > What the patch does is make one of the input operands to the JumpTableDest32 > pseudo-instruction be not marked as earlyclobber, or in other words it means we > have one extra register free compared to without the patch. And you would > expect that more free registers = better register allocation, but in this case > it appears we don't. > > Note: this problem can happen without the patch, but the test case is much much > larger and manifested itself as -fno-omit-frame-pointer giving a better > allocation than -fomit-frame-pointer. This patch was actually my first attempt > at fixing this (as I'd noticed that we were unnecessarily keeping an extra > register live across the JumpTableDest8). > > > What's going on > --------------- > > What this block looks like after live range splitting has happened is: > > 7352B bb.17.switchdest13: > ; predecessors: %bb.3 > successors: %bb.30(0x80000000); %bb.30(100.00%) > > 7360B %390:gpr32 = COPY $wzr > 7364B %434:gpr64 = COPY %432:gpr64 > 7368B %429:gpr32 = COPY %427:gpr32 > 7376B %424:gpr32 = COPY %422:gpr32 > 7384B %419:gpr32 = COPY %417:gpr32 > 7392B %414:gpr32 = COPY %412:gpr32 > 7400B %409:gpr32 = COPY %407:gpr32 > 7408B %404:gpr32 = COPY %402:gpr32 > 7416B %399:gpr64 = COPY %397:gpr64 > 7424B %394:gpr32 = COPY %392:gpr32 > 7528B %253:gpr32 = MOVi32imm 28 > 7536B STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into %ir.106, align 8) > 7752B %392:gpr32 = COPY %394:gpr32 > 7756B %397:gpr64 = COPY %399:gpr64 > 7764B %402:gpr32 = COPY %404:gpr32 > 7768B %407:gpr32 = COPY %409:gpr32 > 7776B %412:gpr32 = COPY %414:gpr32 > 7780B %417:gpr32 = COPY %419:gpr32 > 7788B %422:gpr32 = COPY %424:gpr32 > 7792B %427:gpr32 = COPY %429:gpr32 > 7800B %432:gpr64 = COPY %434:gpr64 > 7808B %373:gpr64sp = IMPLICIT_DEF > 7816B %374:gpr64sp = IMPLICIT_DEF > 8048B B %bb.30 > > Looking at the debug output of the register allocator, the sequence of events > which kicks things off is > %223 assigned to w0 > %283 evicts %381 from w15 > %381 requeued for second round > %253 assigned to w15 > %381 split for w15 in 4 bundles into %391-%395 > %391, %392, %395 are not local intervals > %393 is the local interval for bb.11.switchdest09 > %394 is the local interval for bb.17.switchdest13 > %392 assigned to w15 > %391 evicts %376 from w18 > %394 assigned to w18 > %376 split into %396-%400 > and then %396 evicts something which is split into something which evicts > something etc. until we're done. > > Looking at what happens when this patch isn't applied the difference is: > %223 cannot be assigned to w0, evicts %381 from w15 > %381 requeued for second round > %283 assigned to w15 > %253 assigned to w15 > %381 split for w15 in 1 bundle into %391 and %392 > Neither is a local interval > %391 evicts %380 from w2 > %392 assigned to w2 > > So it looks like the difference is that with the patch we happen to split %381 > in a way that causes the split intervals to be allocated such that we get a pair > of copies in bb.17.switchdest13, and this causes a cascade effect where we > repeatedly do the same thing with a whole load of other registers. > > > Possible Solutions > ------------------ > > So there's two ways I can think of to fix this: > * Make %381 be split in the same way that it is without the patch, which I > think means deciding that there's only 1 bundle for w15. Does anyone know > where and how exactly these bundles are decided? > * Try and change how evicted / split registers are allocated in some way. > Things I've tried: > * In RAGreedy::enqueue reduce the score of unspillable local intervals, and in > RAGreedy::evictInterference put evicted registers into stage RS_Split > immediately. This causes %381 to be split immediately instead of being > requeued, and then makes %391 have a higher score than %253 causing it to > be allocated before it. This works, but ends up causing an extra spill. > * In RAGreedy::splitAroundRegion put global intervals into stage RS_Split > immediately. This makes the chain of evictions after %396 not happen, but > that gives us one extra spill and we still get one pair of copies in > bb.17.switchdest13. > * In RAGreedy::evictInterference put evicted registers into a new RS_Evicted > stage, which is like RS_Assign but can't evict anything. This seemed to give > OK results but was a mess and I didn't understand what I was doing, so I > threw it away. > * Turn on the ConsiderLocalIntervalCost option, as it's supposed to help with > eviction chains like this. Unfortunately it doesn't work as it's a non-local > interval that's causing the eviction chain. I tried making it also handle > non-local intervals, but couldn't figure out how to. > * Turn on TRI->reverseLocalAssignment(). This seemed to work, but I'm not sure > why and reading the description of that it may not be the correct solution > (it's described as being an option to reduce the time the register allocator > takes, not to give better allocation). The benchmark results are also > overall slightly worse. > > Any ideas on what the right approach to fixing this is? > > John > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
John Brawn via llvm-dev
2018-Dec-14 18:12 UTC
[llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation
A bit of an update on this: I looked into SpillPlacement.cpp, which is where the choice of the bundles to split the live range at is done, and I think that the choice it's making is correct given the information it has. It's recognising that it's a bad idea to split around bb.17, but it's an even worse idea to not split before bb.30 as it's a very hot block and inserting a spill there would be much worse. Or in other words, if we didn't have an eviction chain then the choice would have been correct, so fixing this looks like it will require some kind of eviction chain detection. I had a go at making RAGreedy::calcGlobalSplitCost (and the functions it calls) consider all intervals, not just local intervals, when looking for eviction chains and it looks like that does work. I'll look at it some more in the new year, as I'm now on holiday, to see if it's generally a good idea. John> -----Original Message----- > From: Quentin Colombet [mailto:quentin.colombet at gmail.com] > Sent: 05 December 2018 18:22 > To: John Brawn; marina.yatsina at intel.com > Cc: niravd at google.com; LLVM Development List; nd > Subject: Re: [llvm-dev] Strange regalloc behaviour: one more available > register causes much worse allocation > > + Marina, maybe she would have ideas on how to improve the eviction > chain detection. > > Hi John, > > What you're seeing is indeed another instance of eviction chain > madness that Marina improved with the splitCanCauseEvictionChain but > is still a problem. > The problem with what you're suggesting is that it will improve this > instance, but I am pretty sure it will regress some others. > Unfortunately, this is the case we whatever change we do in that space > though. > > > Does anyone know where and how exactly these bundles are decided? > > I have to relearn this every time there is something wrong happening > there... Needless to say this is complicated! IIRC the bundles are > chosen using an Hopfield network, which in short is a kind of neural > network that is guarantee to converge to a local minimum. As you could > imagine, there is little tweak you can do here and maybe its time to > rethink that approach. > > Cheers, > -Quentin > Le mer. 5 déc. 2018 à 09:50, John Brawn via llvm-dev > <llvm-dev at lists.llvm.org> a écrit : > > > > enableAdvancedRASplitCost() does the same thing as > ConsiderLocalIntervalCost, but as a > > > > subtarget option instead of a command-line option, and as I’ve said it > doesn’t help because > > > > it’s a non-local interval causing the eviction chain > (RAGreedy::splitCanCauseEvictionChain > > > > only considers the local interval for a single block, and it’s unclear > to me how to make it > > > > handle a non-local interval). > > > > > > > > John > > > > > > > > From: Nirav Davé [mailto:niravd at google.com] > > Sent: 05 December 2018 17:14 > > To: John Brawn > > Cc: llvm-dev; nd > > Subject: Re: [llvm-dev] Strange regalloc behaviour: one more available > register causes much worse allocation > > > > > > > > This has cropped up before in X86 > (https://bugs.llvm.org/show_bug.cgi?id=26810 / > https://reviews.llvm.org/rL316295), and there's at least a partial > mitigation > > > > (I recently ran into an eviction change on X86 when trying variants of > a MachineScheduler change, but couldn't find a reproduction post the > landed patch). > > > > > > > > I suggest you try enabling enableAdvancedRASplitCost() for ARM and > seeing if that helps. > > > > > > > > -Nirav > > > > > > > > > > > > > > > > On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <llvm- > dev at lists.llvm.org> wrote: > > > > Preamble > > -------- > > > > While working on an IR-level optimisation completely unrelated to > register > > allocation I happened to trigger some really strange register > allocator > > behaviour causing a large regression in bzip2 in spec2006. I've been > trying > > to fix that regression before getting the optimisation patch > committed, because > > I don't want to regress spec2006, but I'm basically fumbling in the > dark because > > I don't yet know how or why the register allocator is making the > decisions it > > does and I thought I'd send an email to see if anyone has any advice. > > > > > > The problem > > ----------- > > > > Attached are (zipped, as llvm-dev has a 100kb message limit): > > * bzip2_regression.ll (reduced from bzip2 in spec2006 after being > compiled with > > some patches that I'm working on) which demonstrates the problem. > > * 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch > which causes > > the problem. > > * without_patch_regalloc.txt, the regalloc debug log for llc - > mcpu=cortex-a57 > > bzip2_regression.ll without the patch applied. > > * with_patch_regalloc.txt, the same log but with the patch applied. > > Note that the patch is not correct, but it happens to be a useful way > of > > provoking the problem. > > > > Without the patch generating assembly with llc -mcpu=cortex-a57 > everything looks > > fine, but with the patch we get this (which comes from the block > > bb.17.switchdest13): > > > > .LBB0_16: > > mov x29, x24 > > mov w24, w20 > > mov w20, w19 > > mov w19, w7 > > mov w7, w6 > > mov w6, w5 > > mov w5, w2 > > mov x2, x18 > > mov w18, w15 > > orr w15, wzr, #0x1c > > str w15, [x8, #8] > > mov w0, wzr > > mov w15, w18 > > mov x18, x2 > > mov w2, w5 > > mov w5, w6 > > mov w6, w7 > > mov w7, w19 > > mov w19, w20 > > mov w20, w24 > > mov x24, x29 > > b .LBB0_3 > > > > It looks like the orr and str have barged in and said "we're using > w15!" and all > > the rest of the registers have meekly moved out of the way and then > moved back > > again at the and. If the orr and str had used w29 instead then none of > this > > would have happened. > > > > What the patch does is make one of the input operands to the > JumpTableDest32 > > pseudo-instruction be not marked as earlyclobber, or in other words it > means we > > have one extra register free compared to without the patch. And you > would > > expect that more free registers = better register allocation, but in > this case > > it appears we don't. > > > > Note: this problem can happen without the patch, but the test case is > much much > > larger and manifested itself as -fno-omit-frame-pointer giving a > better > > allocation than -fomit-frame-pointer. This patch was actually my first > attempt > > at fixing this (as I'd noticed that we were unnecessarily keeping an > extra > > register live across the JumpTableDest8). > > > > > > What's going on > > --------------- > > > > What this block looks like after live range splitting has happened is: > > > > 7352B bb.17.switchdest13: > > ; predecessors: %bb.3 > > successors: %bb.30(0x80000000); %bb.30(100.00%) > > > > 7360B %390:gpr32 = COPY $wzr > > 7364B %434:gpr64 = COPY %432:gpr64 > > 7368B %429:gpr32 = COPY %427:gpr32 > > 7376B %424:gpr32 = COPY %422:gpr32 > > 7384B %419:gpr32 = COPY %417:gpr32 > > 7392B %414:gpr32 = COPY %412:gpr32 > > 7400B %409:gpr32 = COPY %407:gpr32 > > 7408B %404:gpr32 = COPY %402:gpr32 > > 7416B %399:gpr64 = COPY %397:gpr64 > > 7424B %394:gpr32 = COPY %392:gpr32 > > 7528B %253:gpr32 = MOVi32imm 28 > > 7536B STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into > %ir.106, align 8) > > 7752B %392:gpr32 = COPY %394:gpr32 > > 7756B %397:gpr64 = COPY %399:gpr64 > > 7764B %402:gpr32 = COPY %404:gpr32 > > 7768B %407:gpr32 = COPY %409:gpr32 > > 7776B %412:gpr32 = COPY %414:gpr32 > > 7780B %417:gpr32 = COPY %419:gpr32 > > 7788B %422:gpr32 = COPY %424:gpr32 > > 7792B %427:gpr32 = COPY %429:gpr32 > > 7800B %432:gpr64 = COPY %434:gpr64 > > 7808B %373:gpr64sp = IMPLICIT_DEF > > 7816B %374:gpr64sp = IMPLICIT_DEF > > 8048B B %bb.30 > > > > Looking at the debug output of the register allocator, the sequence of > events > > which kicks things off is > > %223 assigned to w0 > > %283 evicts %381 from w15 > > %381 requeued for second round > > %253 assigned to w15 > > %381 split for w15 in 4 bundles into %391-%395 > > %391, %392, %395 are not local intervals > > %393 is the local interval for bb.11.switchdest09 > > %394 is the local interval for bb.17.switchdest13 > > %392 assigned to w15 > > %391 evicts %376 from w18 > > %394 assigned to w18 > > %376 split into %396-%400 > > and then %396 evicts something which is split into something which > evicts > > something etc. until we're done. > > > > Looking at what happens when this patch isn't applied the difference > is: > > %223 cannot be assigned to w0, evicts %381 from w15 > > %381 requeued for second round > > %283 assigned to w15 > > %253 assigned to w15 > > %381 split for w15 in 1 bundle into %391 and %392 > > Neither is a local interval > > %391 evicts %380 from w2 > > %392 assigned to w2 > > > > So it looks like the difference is that with the patch we happen to > split %381 > > in a way that causes the split intervals to be allocated such that we > get a pair > > of copies in bb.17.switchdest13, and this causes a cascade effect > where we > > repeatedly do the same thing with a whole load of other registers. > > > > > > Possible Solutions > > ------------------ > > > > So there's two ways I can think of to fix this: > > * Make %381 be split in the same way that it is without the patch, > which I > > think means deciding that there's only 1 bundle for w15. Does > anyone know > > where and how exactly these bundles are decided? > > * Try and change how evicted / split registers are allocated in some > way. > > Things I've tried: > > * In RAGreedy::enqueue reduce the score of unspillable local > intervals, and in > > RAGreedy::evictInterference put evicted registers into stage > RS_Split > > immediately. This causes %381 to be split immediately instead of > being > > requeued, and then makes %391 have a higher score than %253 > causing it to > > be allocated before it. This works, but ends up causing an extra > spill. > > * In RAGreedy::splitAroundRegion put global intervals into stage > RS_Split > > immediately. This makes the chain of evictions after %396 not > happen, but > > that gives us one extra spill and we still get one pair of copies > in > > bb.17.switchdest13. > > * In RAGreedy::evictInterference put evicted registers into a new > RS_Evicted > > stage, which is like RS_Assign but can't evict anything. This > seemed to give > > OK results but was a mess and I didn't understand what I was > doing, so I > > threw it away. > > * Turn on the ConsiderLocalIntervalCost option, as it's supposed to > help with > > eviction chains like this. Unfortunately it doesn't work as it's a > non-local > > interval that's causing the eviction chain. I tried making it also > handle > > non-local intervals, but couldn't figure out how to. > > * Turn on TRI->reverseLocalAssignment(). This seemed to work, but > I'm not sure > > why and reading the description of that it may not be the correct > solution > > (it's described as being an option to reduce the time the register > allocator > > takes, not to give better allocation). The benchmark results are > also > > overall slightly worse. > > > > Any ideas on what the right approach to fixing this is? > > > > John > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Yatsina, Marina via llvm-dev
2018-Dec-18 06:44 UTC
[llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation
Hi John, It indeed looks like the eviction chains I encountered. In a later message you mentioned you've experimented in extending calcGlobalSplitCost to consider non local intervals too, which seems like a good idea. I wonder if there are any cases where the following scenario doesn't cause an eviction chain of several movs and is considered as a good split decision: vreg0 evicts vreg1 from physreg0, vreg1 is split into vreg2 and vreg3, one of the vreg2/vreg3 evicts vreg0 from physreg0 again. If so, maybe there is something else we are missing when making the cost decisions. I think the best way to check this is to detect these scenarios having the ability to allow/prevent them (this seems to be what you did) and then check the global spill cost of several workloads, see if there are any workloads for which allowing these scenarios is better than preventing them. Last I checked LLVM had a statistic that counted the number of spills in loops (reportNumberOfSplillsReloads()), but this statistic didn't count the cost of those spills (considering the BB frequency), so this needs to be fixed to get a more accurate evaluation of the global spill cost. Thanks, Marina -----Original Message----- From: Quentin Colombet [mailto:quentin.colombet at gmail.com] Sent: Wednesday, December 05, 2018 20:22 To: John.Brawn at arm.com; Yatsina, Marina <marina.yatsina at intel.com> Cc: niravd at google.com; LLVM Development List <llvm-dev at lists.llvm.org>; nd at arm.com Subject: Re: [llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation + Marina, maybe she would have ideas on how to improve the eviction chain detection. Hi John, What you're seeing is indeed another instance of eviction chain madness that Marina improved with the splitCanCauseEvictionChain but is still a problem. The problem with what you're suggesting is that it will improve this instance, but I am pretty sure it will regress some others. Unfortunately, this is the case we whatever change we do in that space though.> Does anyone know where and how exactly these bundles are decided?I have to relearn this every time there is something wrong happening there... Needless to say this is complicated! IIRC the bundles are chosen using an Hopfield network, which in short is a kind of neural network that is guarantee to converge to a local minimum. As you could imagine, there is little tweak you can do here and maybe its time to rethink that approach. Cheers, -Quentin Le mer. 5 déc. 2018 à 09:50, John Brawn via llvm-dev <llvm-dev at lists.llvm.org> a écrit :> > enableAdvancedRASplitCost() does the same thing as > ConsiderLocalIntervalCost, but as a > > subtarget option instead of a command-line option, and as I’ve said it > doesn’t help because > > it’s a non-local interval causing the eviction chain > (RAGreedy::splitCanCauseEvictionChain > > only considers the local interval for a single block, and it’s unclear > to me how to make it > > handle a non-local interval). > > > > John > > > > From: Nirav Davé [mailto:niravd at google.com] > Sent: 05 December 2018 17:14 > To: John Brawn > Cc: llvm-dev; nd > Subject: Re: [llvm-dev] Strange regalloc behaviour: one more available > register causes much worse allocation > > > > This has cropped up before in X86 > (https://bugs.llvm.org/show_bug.cgi?id=26810 / > https://reviews.llvm.org/rL316295), and there's at least a partial > mitigation > > (I recently ran into an eviction change on X86 when trying variants of a MachineScheduler change, but couldn't find a reproduction post the landed patch). > > > > I suggest you try enabling enableAdvancedRASplitCost() for ARM and seeing if that helps. > > > > -Nirav > > > > > > > > On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Preamble > -------- > > While working on an IR-level optimisation completely unrelated to > register allocation I happened to trigger some really strange register > allocator behaviour causing a large regression in bzip2 in spec2006. > I've been trying to fix that regression before getting the > optimisation patch committed, because I don't want to regress > spec2006, but I'm basically fumbling in the dark because I don't yet > know how or why the register allocator is making the decisions it does and I thought I'd send an email to see if anyone has any advice. > > > The problem > ----------- > > Attached are (zipped, as llvm-dev has a 100kb message limit): > * bzip2_regression.ll (reduced from bzip2 in spec2006 after being compiled with > some patches that I'm working on) which demonstrates the problem. > * 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch which causes > the problem. > * without_patch_regalloc.txt, the regalloc debug log for llc -mcpu=cortex-a57 > bzip2_regression.ll without the patch applied. > * with_patch_regalloc.txt, the same log but with the patch applied. > Note that the patch is not correct, but it happens to be a useful way > of provoking the problem. > > Without the patch generating assembly with llc -mcpu=cortex-a57 > everything looks fine, but with the patch we get this (which comes > from the block > bb.17.switchdest13): > > .LBB0_16: > mov x29, x24 > mov w24, w20 > mov w20, w19 > mov w19, w7 > mov w7, w6 > mov w6, w5 > mov w5, w2 > mov x2, x18 > mov w18, w15 > orr w15, wzr, #0x1c > str w15, [x8, #8] > mov w0, wzr > mov w15, w18 > mov x18, x2 > mov w2, w5 > mov w5, w6 > mov w6, w7 > mov w7, w19 > mov w19, w20 > mov w20, w24 > mov x24, x29 > b .LBB0_3 > > It looks like the orr and str have barged in and said "we're using > w15!" and all the rest of the registers have meekly moved out of the > way and then moved back again at the and. If the orr and str had used > w29 instead then none of this would have happened. > > What the patch does is make one of the input operands to the > JumpTableDest32 pseudo-instruction be not marked as earlyclobber, or > in other words it means we have one extra register free compared to > without the patch. And you would expect that more free registers = > better register allocation, but in this case it appears we don't. > > Note: this problem can happen without the patch, but the test case is > much much larger and manifested itself as -fno-omit-frame-pointer > giving a better allocation than -fomit-frame-pointer. This patch was > actually my first attempt at fixing this (as I'd noticed that we were > unnecessarily keeping an extra register live across the JumpTableDest8). > > > What's going on > --------------- > > What this block looks like after live range splitting has happened is: > > 7352B bb.17.switchdest13: > ; predecessors: %bb.3 > successors: %bb.30(0x80000000); %bb.30(100.00%) > > 7360B %390:gpr32 = COPY $wzr > 7364B %434:gpr64 = COPY %432:gpr64 > 7368B %429:gpr32 = COPY %427:gpr32 > 7376B %424:gpr32 = COPY %422:gpr32 > 7384B %419:gpr32 = COPY %417:gpr32 > 7392B %414:gpr32 = COPY %412:gpr32 > 7400B %409:gpr32 = COPY %407:gpr32 > 7408B %404:gpr32 = COPY %402:gpr32 > 7416B %399:gpr64 = COPY %397:gpr64 > 7424B %394:gpr32 = COPY %392:gpr32 > 7528B %253:gpr32 = MOVi32imm 28 > 7536B STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into %ir.106, align 8) > 7752B %392:gpr32 = COPY %394:gpr32 > 7756B %397:gpr64 = COPY %399:gpr64 > 7764B %402:gpr32 = COPY %404:gpr32 > 7768B %407:gpr32 = COPY %409:gpr32 > 7776B %412:gpr32 = COPY %414:gpr32 > 7780B %417:gpr32 = COPY %419:gpr32 > 7788B %422:gpr32 = COPY %424:gpr32 > 7792B %427:gpr32 = COPY %429:gpr32 > 7800B %432:gpr64 = COPY %434:gpr64 > 7808B %373:gpr64sp = IMPLICIT_DEF > 7816B %374:gpr64sp = IMPLICIT_DEF > 8048B B %bb.30 > > Looking at the debug output of the register allocator, the sequence of > events which kicks things off is > %223 assigned to w0 > %283 evicts %381 from w15 > %381 requeued for second round > %253 assigned to w15 > %381 split for w15 in 4 bundles into %391-%395 > %391, %392, %395 are not local intervals > %393 is the local interval for bb.11.switchdest09 > %394 is the local interval for bb.17.switchdest13 > %392 assigned to w15 > %391 evicts %376 from w18 > %394 assigned to w18 > %376 split into %396-%400 > and then %396 evicts something which is split into something which > evicts something etc. until we're done. > > Looking at what happens when this patch isn't applied the difference is: > %223 cannot be assigned to w0, evicts %381 from w15 > %381 requeued for second round > %283 assigned to w15 > %253 assigned to w15 > %381 split for w15 in 1 bundle into %391 and %392 > Neither is a local interval > %391 evicts %380 from w2 > %392 assigned to w2 > > So it looks like the difference is that with the patch we happen to > split %381 in a way that causes the split intervals to be allocated > such that we get a pair of copies in bb.17.switchdest13, and this > causes a cascade effect where we repeatedly do the same thing with a whole load of other registers. > > > Possible Solutions > ------------------ > > So there's two ways I can think of to fix this: > * Make %381 be split in the same way that it is without the patch, which I > think means deciding that there's only 1 bundle for w15. Does anyone know > where and how exactly these bundles are decided? > * Try and change how evicted / split registers are allocated in some way. > Things I've tried: > * In RAGreedy::enqueue reduce the score of unspillable local intervals, and in > RAGreedy::evictInterference put evicted registers into stage RS_Split > immediately. This causes %381 to be split immediately instead of being > requeued, and then makes %391 have a higher score than %253 causing it to > be allocated before it. This works, but ends up causing an extra spill. > * In RAGreedy::splitAroundRegion put global intervals into stage RS_Split > immediately. This makes the chain of evictions after %396 not happen, but > that gives us one extra spill and we still get one pair of copies in > bb.17.switchdest13. > * In RAGreedy::evictInterference put evicted registers into a new RS_Evicted > stage, which is like RS_Assign but can't evict anything. This seemed to give > OK results but was a mess and I didn't understand what I was doing, so I > threw it away. > * Turn on the ConsiderLocalIntervalCost option, as it's supposed to help with > eviction chains like this. Unfortunately it doesn't work as it's a non-local > interval that's causing the eviction chain. I tried making it also handle > non-local intervals, but couldn't figure out how to. > * Turn on TRI->reverseLocalAssignment(). This seemed to work, but I'm not sure > why and reading the description of that it may not be the correct solution > (it's described as being an option to reduce the time the register allocator > takes, not to give better allocation). The benchmark results are also > overall slightly worse. > > Any ideas on what the right approach to fixing this is? > > John > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev--------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.