Yes, I quite liked the things I've read about the PBQP allocator. Given what the hardware folks have to go through to get 1% improvements in scalar code, spending 20% (or whatever) compile time (under control of a flag) seems like nothing. And falling back on "average code" is a little disingenuous. People looking for performance don't care about average code; they care about their own personal code, their important loop, and wonder why there are unnecessary copies right *there* and wasn't this problem solved ages ago? Preston On Tue, Sep 11, 2018 at 11:00 AM, Quentin Colombet < quentin.colombet at gmail.com> wrote:> Le mar. 11 sept. 2018 à 10:23, Preston Briggs > <preston.briggs at gmail.com> a écrit : > > > > Hi, > > > > Using Chaitin's approach, removing a copy via coalescing could expose > more opportunities for coalescing. > > So he would iteratively rebuild the interference graph and check for > more opportunities. > > I am not sure Chaitin's approach would lead to a better coalescing > than what LLVM does. > I believe most of the copies that we see in the final code comes from > live-range splitting (i.e., to avoid spill), not coalescing, but I am > nitpicking :). > > > > > Chaitin was also careful to make sure that the source and destination of > a copy didn't interfere unnecessarily (because of the copy alone); > > that is, his approach to interference was very precise. The idea of > computing interference from intersecting ranges seems less precise, > > though I owe Matthias examples demonstrating these supposed weaknesses. > > > > Finally, when spills are introduced, more opportunities for coalescing > can arise. > > By separating coalescing and allocation, it seems like you miss an > opportunity. > > > > LLVM still eliminates copies at coloring time. The available registers > are biases toward the colors used by whatever copies connected to > current live-range. > > Also, to be fair, LLVM provides a graph coloring regalloc with the > PBQP allocator. > > That said, I agree that we trade compile time for code quality, but > pragmatically, it is hard to justify spending say 20% time in compile > time to achieve maybe 1% speed up on average (numbers completely mad > up :)). If you're willing to pay the price, the PBQP may be an option. > ARM made a talk showing how it performed > (https://llvm.org/devmtg/2014-10/Slides/PBQP-update-and-in-the-wild.pdf). > > What I am saying is, I believe you're right but there isn't a big > enough incentive for a company to invest in such project. > > > I admit that the extra time spent rebuilding the interference graph > > and attempting to coalesce can add up, but I hate that we (apparently) > give up on quality > > results to maybe save some compile time. Our machines are really fast > these days; > > lets spend their speed on something useful! > > > > Thanks for the pointer and explanation, > > Preston > > > > > > On Tue, Sep 11, 2018 at 9:40 AM, Quentin Colombet < > quentin.colombet at gmail.com> wrote: > >> > >> Hi Preston, > >> > >> To summarize what Matthias said, LLVM approach to regalloc is split in > >> two main phases: > >> 1. Aggressive coalescing based on the "interference graph" of the > >> variable in SSA (we don't actually build the interference graph). At > >> that stage, we don't care about the whether or not the resulting > >> representation is colorable with K register. > >> 2. Priority based register allocation with live-range splitting. > >> Basically when we are about to spill, instead of doing that right > >> away, we have heuristics to split the live-range and re-enqueue the > >> results. > >> > >> The most high-level description of LLVM's regalloc that I can think of > >> is available here: > >> https://llvm.org/devmtg/2011-11/Olesen_RegisterAllocation.pdf > >> > >> Cheers, > >> -Quentin > >> Le lun. 10 sept. 2018 à 17:28, Matthias Braun via llvm-dev > >> <llvm-dev at lists.llvm.org> a écrit : > >> > > >> > > >> > > >> > On Sep 10, 2018, at 5:25 PM, Matthias Braun <mbraun at apple.com> wrote: > >> > > >> > > >> > > >> > On Sep 10, 2018, at 5:11 PM, Preston Briggs <preston.briggs at gmail.com> > wrote: > >> > > >> > The phi instruction is irrelevant; just the way I think about things. > >> > The question is if the allocator believes that t0 and t2 interfere. > >> > > >> > Perhaps the coalescing example was too simple. > >> > In the general case, we can't coalesce without a notion of > interference. > >> > > >> > There is also logic in `LiveRange::overlaps()` that considers live > ranges as not overlapping in some cases where they are related via copy > instructions. This will also effect the interference during the main > allocation algorithm. > >> > > >> > My worry is that looking at interference by ranges of instruction > numbers > >> > leads to inaccuracies when a range is introduced by a copy. > >> > > >> > > >> > I can't think of a case where we wouldn't have the same accuracy as a > graph coloring allocator. > >> > If you can construct an example where we don't I'd love to hear about > it. > >> > > >> > If you wonder about the liveness information, you can perform > experiments like this (I'm using a NOOP instruction with an implicit use > operand to produce some artificial uses). > >> > > >> > $ cat test.mir > >> > name: somefunc > >> > body: | > >> > bb.0: > >> > %0:gr32 = MOV32ri 42 > >> > JB_1 %bb.2, undef implicit $eflags > >> > JMP_1 %bb.2 > >> > > >> > bb.1: > >> > %1:gr32 = MOV32ri 17 > >> > JMP_1 %bb.3 > >> > > >> > bb.2: > >> > NOOP implicit %0 > >> > %1 = COPY %0 > >> > JMP_1 %bb.3 > >> > > >> > bb.3: > >> > NOOP implicit %1 > >> > > >> > > >> > > >> > $ llc -run-pass=liveintervals -debug-only=regalloc test.mir > >> > ********** INTERVALS ********** > >> > %0 [16r,64B:0)[112B,144r:0) 0 at 16r weight:0.000000e+00 > >> > %1 [80r,112B:1)[144r,176B:0)[176B,192r:2) 0 at 144r 1 at 80r 2 at 176B-phi > weight:0.000000e+00 > >> > RegMasks: > >> > ********** MACHINEINSTRS ********** > >> > # Machine code for function somefunc: NoPHIs > >> > > >> > 0B bb.0: > >> > successors: %bb.2(0x80000000); %bb.2(100.00%) > >> > > >> > 16B %0:gr32 = MOV32ri 42 > >> > 32B JB_1 %bb.2, implicit undef $eflags > >> > 48B JMP_1 %bb.2 > >> > > >> > 64B bb.1: > >> > successors: %bb.3(0x80000000); %bb.3(100.00%) > >> > > >> > 80B %1:gr32 = MOV32ri 17 > >> > 96B JMP_1 %bb.3 > >> > > >> > 112B bb.2: > >> > ; predecessors: %bb.0 > >> > successors: %bb.3(0x80000000); %bb.3(100.00%) > >> > > >> > 128B NOOP implicit %0:gr32 > >> > 144B %1:gr32 = COPY %0:gr32 > >> > 160B JMP_1 %bb.3 > >> > > >> > 176B bb.3: > >> > ; predecessors: %bb.1, %bb.2 > >> > > >> > 192B NOOP implicit %1:gr32 > >> > > >> > # End machine code for function somefunc. > >> > > >> > > >> > If you look at the "intervals" (the class is a misnomer since > nowadays it contains a list of ranges...) in the beginning you see that %0 > and %1 do not overlap anywhere. > >> > > >> > - Matthias > >> > > >> > > >> > But perhaps I should focus on the links and, as you suggested, > >> > the debugging info. > >> > > >> > Thanks, > >> > Preston > >> > > >> > > >> > > >> > > >> > On Mon, Sep 10, 2018 at 5:02 PM, Matthias Braun <mbraun at apple.com> > wrote: > >> >> > >> >> > >> >> > >> >> On Sep 10, 2018, at 4:53 PM, Preston Briggs < > preston.briggs at gmail.com> wrote: > >> >> > >> >> > >> >> > The underlying liveness datastructure is a list of ranges where > each vreg is alive > >> >> > (ranges in terms of instructions numbered). I remember a couple of > later linear scan > >> >> > papers describing the same thing (Traub et.al. being the first if > I remember correctly). > >> >> > That should be as accurate as you can get in terms of liveness > information. > >> >> > >> >> It depends on the details. > >> >> For example, given > >> >> > >> >> t0 = mumble > >> >> > >> >> if (something) { > >> >> t2 = 3 > >> >> } > >> >> else { > >> >> t3 = t0 + 3 > >> >> print t0 > >> >> } > >> >> t4 = phi(t2, t3) > >> >> > >> >> > >> >> it's clear that t2 and t0 shouldn't interfere, > >> >> but some folks might say the ranges overlap. > >> >> > >> >> > >> >> Similarly, > >> >> > >> >> t6 = mumble > >> >> t7 = t6 > >> >> t8 = t6 + 5 > >> >> t9 = t7 + 10 > >> >> print t8, t9 > >> >> > >> >> > >> >> Chaitin points out that t6 and t7 shouldn't interfere, > >> >> even though the live ranges overlap. > >> >> > >> >> - We go out of SSA form before allocating registers so you won't see > phi instruction. > >> >> - In the second case the copy coalesceing pass should have coalesced > t6 and t7. > >> >> > >> >> I would expect both cases to work as you expect. > >> >> > >> >> - Matthias > >> >> > >> >> > >> >> > >> >> Anyway, I'll look at the links. > >> >> > >> >> Thanks, > >> >> Preston > >> >> > >> >> > >> >> > >> >> > >> >>> > >> >>> > >> >>> We have separate aggressive coalescing pass before allocation. The > register allocator will later perform live range splitting inside the main > allocation loop as it seems fit. > >> >>> > >> >>> > >> >>> I ask these questions because we (guys I work with) see loops > >> >>> where there's a little register juggling that seems unnecessary. > >> >>> > >> >>> Well your best bet is to learn reading the output of `-mllvm > -print-machineinstrs -mllvm -debug-only=regalloc` (which can take a > while)... > >> >>> > >> >>> > >> >>> Is there a paper that describes what y'all do? > >> >>> > >> >>> > >> >>> I'm only aware of a blog post: > >> >>> http://blog.llvm.org/2011/09/greedy-register-allocation-in- > llvm-30.html > >> >>> and a dev conference talk in 2011: > >> >>> https://llvm.org/devmtg/2011-11/ > >> >>> > >> >>> - Matthias > >> >>> > >> >>> > >> >>> Thanks, > >> >>> Preston > >> >>> > >> >>> > >> >>> On Mon, Sep 10, 2018 at 9:57 AM, Matthias Braun <mbraun at apple.com> > wrote: > >> >>>> > >> >>>> I would not describe LLVMs register allocator as linear scan, it's > closer to graph coloring than linear scan IMO (though doesn't really > matcher either approach). > >> >>>> > >> >>>> RegAllocGreedy assigns the registers in an order based on the > priority value computed in enqueu() we are not just scanning from top to > bottom of the program. We also perform actual interference checks we just > don't happen to build up an explicit interference graph up front. > >> >>>> > >> >>>> - Matthias > >> >>>> > >> >>>> On Sep 10, 2018, at 9:49 AM, Preston Briggs via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >>>> > >> >>>> Why have we ended up using linear-scan register allocation > >> >>>> by default (instead of, e.g., coloring)? > >> >>>> > >> >>>> Thanks, > >> >>>> Preston > >> >>>> > >> >>>> > >> >>>> _______________________________________________ > >> >>>> LLVM Developers mailing list > >> >>>> llvm-dev at lists.llvm.org > >> >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >> >>>> > >> >>>> > >> >>> > >> >>> > >> >> > >> >> > >> > > >> > > >> > > >> > _______________________________________________ > >> > LLVM Developers mailing list > >> > llvm-dev at lists.llvm.org > >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/9ce5ce66/attachment.html>
Le mar. 11 sept. 2018 à 11:23, Preston Briggs <preston.briggs at gmail.com> a écrit :> > Yes, I quite liked the things I've read about the PBQP allocator. > > Given what the hardware folks have to go through to get 1% improvements in scalar code, > spending 20% (or whatever) compile time (under control of a flag) seems like nothing. > And falling back on "average code" is a little disingenuous.There are also instances where PBQP actually does worse than Greedy (see Arnaud's presentation from my previous reply), so I don't think it is fair to say that graph coloring should be the default. In my opinion, there isn't a clear winner in terms of the performance of the generated code between both the PBQP and greedy, thus, LLVM default is to take the one with the fastest compile time. If you believe that should be changed, feel free to start an RFC/post a patch on changing the default allocator.> People looking for performance don't care about average code; > they care about their own personal code, their important loop, and wonder > why there are unnecessary copies right there and wasn't this problem > solved ages ago?If it was, I don't think we would have this conversation :). Graph coloring is just another representation that has its own advantages and limitations (like ignoring the structure of the program, spilling is an after thought, to name a few). At the end of the day, we are comparing heuristics so of course we can find examples where one performs better than the other. At this point, the developers can try both and pick the best. The current default seems fine though. But again that's just my opinion.> > Preston > > > > On Tue, Sep 11, 2018 at 11:00 AM, Quentin Colombet <quentin.colombet at gmail.com> wrote: >> >> Le mar. 11 sept. 2018 à 10:23, Preston Briggs >> <preston.briggs at gmail.com> a écrit : >> > >> > Hi, >> > >> > Using Chaitin's approach, removing a copy via coalescing could expose more opportunities for coalescing. >> > So he would iteratively rebuild the interference graph and check for more opportunities. >> >> I am not sure Chaitin's approach would lead to a better coalescing >> than what LLVM does. >> I believe most of the copies that we see in the final code comes from >> live-range splitting (i.e., to avoid spill), not coalescing, but I am >> nitpicking :). >> >> > >> > Chaitin was also careful to make sure that the source and destination of a copy didn't interfere unnecessarily (because of the copy alone); >> > that is, his approach to interference was very precise. The idea of computing interference from intersecting ranges seems less precise, >> > though I owe Matthias examples demonstrating these supposed weaknesses. >> > >> > Finally, when spills are introduced, more opportunities for coalescing can arise. >> > By separating coalescing and allocation, it seems like you miss an opportunity. >> > >> >> LLVM still eliminates copies at coloring time. The available registers >> are biases toward the colors used by whatever copies connected to >> current live-range. >> >> Also, to be fair, LLVM provides a graph coloring regalloc with the >> PBQP allocator. >> >> That said, I agree that we trade compile time for code quality, but >> pragmatically, it is hard to justify spending say 20% time in compile >> time to achieve maybe 1% speed up on average (numbers completely mad >> up :)). If you're willing to pay the price, the PBQP may be an option. >> ARM made a talk showing how it performed >> (https://llvm.org/devmtg/2014-10/Slides/PBQP-update-and-in-the-wild.pdf). >> >> What I am saying is, I believe you're right but there isn't a big >> enough incentive for a company to invest in such project. >> >> > I admit that the extra time spent rebuilding the interference graph >> > and attempting to coalesce can add up, but I hate that we (apparently) give up on quality >> > results to maybe save some compile time. Our machines are really fast these days; >> > lets spend their speed on something useful! >> > >> > Thanks for the pointer and explanation, >> > Preston >> > >> > >> > On Tue, Sep 11, 2018 at 9:40 AM, Quentin Colombet <quentin.colombet at gmail.com> wrote: >> >> >> >> Hi Preston, >> >> >> >> To summarize what Matthias said, LLVM approach to regalloc is split in >> >> two main phases: >> >> 1. Aggressive coalescing based on the "interference graph" of the >> >> variable in SSA (we don't actually build the interference graph). At >> >> that stage, we don't care about the whether or not the resulting >> >> representation is colorable with K register. >> >> 2. Priority based register allocation with live-range splitting. >> >> Basically when we are about to spill, instead of doing that right >> >> away, we have heuristics to split the live-range and re-enqueue the >> >> results. >> >> >> >> The most high-level description of LLVM's regalloc that I can think of >> >> is available here: >> >> https://llvm.org/devmtg/2011-11/Olesen_RegisterAllocation.pdf >> >> >> >> Cheers, >> >> -Quentin >> >> Le lun. 10 sept. 2018 à 17:28, Matthias Braun via llvm-dev >> >> <llvm-dev at lists.llvm.org> a écrit : >> >> > >> >> > >> >> > >> >> > On Sep 10, 2018, at 5:25 PM, Matthias Braun <mbraun at apple.com> wrote: >> >> > >> >> > >> >> > >> >> > On Sep 10, 2018, at 5:11 PM, Preston Briggs <preston.briggs at gmail.com> wrote: >> >> > >> >> > The phi instruction is irrelevant; just the way I think about things. >> >> > The question is if the allocator believes that t0 and t2 interfere. >> >> > >> >> > Perhaps the coalescing example was too simple. >> >> > In the general case, we can't coalesce without a notion of interference. >> >> > >> >> > There is also logic in `LiveRange::overlaps()` that considers live ranges as not overlapping in some cases where they are related via copy instructions. This will also effect the interference during the main allocation algorithm. >> >> > >> >> > My worry is that looking at interference by ranges of instruction numbers >> >> > leads to inaccuracies when a range is introduced by a copy. >> >> > >> >> > >> >> > I can't think of a case where we wouldn't have the same accuracy as a graph coloring allocator. >> >> > If you can construct an example where we don't I'd love to hear about it. >> >> > >> >> > If you wonder about the liveness information, you can perform experiments like this (I'm using a NOOP instruction with an implicit use operand to produce some artificial uses). >> >> > >> >> > $ cat test.mir >> >> > name: somefunc >> >> > body: | >> >> > bb.0: >> >> > %0:gr32 = MOV32ri 42 >> >> > JB_1 %bb.2, undef implicit $eflags >> >> > JMP_1 %bb.2 >> >> > >> >> > bb.1: >> >> > %1:gr32 = MOV32ri 17 >> >> > JMP_1 %bb.3 >> >> > >> >> > bb.2: >> >> > NOOP implicit %0 >> >> > %1 = COPY %0 >> >> > JMP_1 %bb.3 >> >> > >> >> > bb.3: >> >> > NOOP implicit %1 >> >> > >> >> > >> >> > >> >> > $ llc -run-pass=liveintervals -debug-only=regalloc test.mir >> >> > ********** INTERVALS ********** >> >> > %0 [16r,64B:0)[112B,144r:0) 0 at 16r weight:0.000000e+00 >> >> > %1 [80r,112B:1)[144r,176B:0)[176B,192r:2) 0 at 144r 1 at 80r 2 at 176B-phi weight:0.000000e+00 >> >> > RegMasks: >> >> > ********** MACHINEINSTRS ********** >> >> > # Machine code for function somefunc: NoPHIs >> >> > >> >> > 0B bb.0: >> >> > successors: %bb.2(0x80000000); %bb.2(100.00%) >> >> > >> >> > 16B %0:gr32 = MOV32ri 42 >> >> > 32B JB_1 %bb.2, implicit undef $eflags >> >> > 48B JMP_1 %bb.2 >> >> > >> >> > 64B bb.1: >> >> > successors: %bb.3(0x80000000); %bb.3(100.00%) >> >> > >> >> > 80B %1:gr32 = MOV32ri 17 >> >> > 96B JMP_1 %bb.3 >> >> > >> >> > 112B bb.2: >> >> > ; predecessors: %bb.0 >> >> > successors: %bb.3(0x80000000); %bb.3(100.00%) >> >> > >> >> > 128B NOOP implicit %0:gr32 >> >> > 144B %1:gr32 = COPY %0:gr32 >> >> > 160B JMP_1 %bb.3 >> >> > >> >> > 176B bb.3: >> >> > ; predecessors: %bb.1, %bb.2 >> >> > >> >> > 192B NOOP implicit %1:gr32 >> >> > >> >> > # End machine code for function somefunc. >> >> > >> >> > >> >> > If you look at the "intervals" (the class is a misnomer since nowadays it contains a list of ranges...) in the beginning you see that %0 and %1 do not overlap anywhere. >> >> > >> >> > - Matthias >> >> > >> >> > >> >> > But perhaps I should focus on the links and, as you suggested, >> >> > the debugging info. >> >> > >> >> > Thanks, >> >> > Preston >> >> > >> >> > >> >> > >> >> > >> >> > On Mon, Sep 10, 2018 at 5:02 PM, Matthias Braun <mbraun at apple.com> wrote: >> >> >> >> >> >> >> >> >> >> >> >> On Sep 10, 2018, at 4:53 PM, Preston Briggs <preston.briggs at gmail.com> wrote: >> >> >> >> >> >> >> >> >> > The underlying liveness datastructure is a list of ranges where each vreg is alive >> >> >> > (ranges in terms of instructions numbered). I remember a couple of later linear scan >> >> >> > papers describing the same thing (Traub et.al. being the first if I remember correctly). >> >> >> > That should be as accurate as you can get in terms of liveness information. >> >> >> >> >> >> It depends on the details. >> >> >> For example, given >> >> >> >> >> >> t0 = mumble >> >> >> >> >> >> if (something) { >> >> >> t2 = 3 >> >> >> } >> >> >> else { >> >> >> t3 = t0 + 3 >> >> >> print t0 >> >> >> } >> >> >> t4 = phi(t2, t3) >> >> >> >> >> >> >> >> >> it's clear that t2 and t0 shouldn't interfere, >> >> >> but some folks might say the ranges overlap. >> >> >> >> >> >> >> >> >> Similarly, >> >> >> >> >> >> t6 = mumble >> >> >> t7 = t6 >> >> >> t8 = t6 + 5 >> >> >> t9 = t7 + 10 >> >> >> print t8, t9 >> >> >> >> >> >> >> >> >> Chaitin points out that t6 and t7 shouldn't interfere, >> >> >> even though the live ranges overlap. >> >> >> >> >> >> - We go out of SSA form before allocating registers so you won't see phi instruction. >> >> >> - In the second case the copy coalesceing pass should have coalesced t6 and t7. >> >> >> >> >> >> I would expect both cases to work as you expect. >> >> >> >> >> >> - Matthias >> >> >> >> >> >> >> >> >> >> >> >> Anyway, I'll look at the links. >> >> >> >> >> >> Thanks, >> >> >> Preston >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>> >> >> >>> >> >> >>> We have separate aggressive coalescing pass before allocation. The register allocator will later perform live range splitting inside the main allocation loop as it seems fit. >> >> >>> >> >> >>> >> >> >>> I ask these questions because we (guys I work with) see loops >> >> >>> where there's a little register juggling that seems unnecessary. >> >> >>> >> >> >>> Well your best bet is to learn reading the output of `-mllvm -print-machineinstrs -mllvm -debug-only=regalloc` (which can take a while)... >> >> >>> >> >> >>> >> >> >>> Is there a paper that describes what y'all do? >> >> >>> >> >> >>> >> >> >>> I'm only aware of a blog post: >> >> >>> http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html >> >> >>> and a dev conference talk in 2011: >> >> >>> https://llvm.org/devmtg/2011-11/ >> >> >>> >> >> >>> - Matthias >> >> >>> >> >> >>> >> >> >>> Thanks, >> >> >>> Preston >> >> >>> >> >> >>> >> >> >>> On Mon, Sep 10, 2018 at 9:57 AM, Matthias Braun <mbraun at apple.com> wrote: >> >> >>>> >> >> >>>> I would not describe LLVMs register allocator as linear scan, it's closer to graph coloring than linear scan IMO (though doesn't really matcher either approach). >> >> >>>> >> >> >>>> RegAllocGreedy assigns the registers in an order based on the priority value computed in enqueu() we are not just scanning from top to bottom of the program. We also perform actual interference checks we just don't happen to build up an explicit interference graph up front. >> >> >>>> >> >> >>>> - Matthias >> >> >>>> >> >> >>>> On Sep 10, 2018, at 9:49 AM, Preston Briggs via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> >>>> >> >> >>>> Why have we ended up using linear-scan register allocation >> >> >>>> by default (instead of, e.g., coloring)? >> >> >>>> >> >> >>>> Thanks, >> >> >>>> Preston >> >> >>>> >> >> >>>> >> >> >>>> _______________________________________________ >> >> >>>> LLVM Developers mailing list >> >> >>>> llvm-dev at lists.llvm.org >> >> >>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >>>> >> >> >>>> >> >> >>> >> >> >>> >> >> >> >> >> >> >> >> > >> >> > >> >> > >> >> > _______________________________________________ >> >> > LLVM Developers mailing list >> >> > llvm-dev at lists.llvm.org >> >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > >> > > >
> On Sep 11, 2018, at 11:42 AM, Quentin Colombet <quentin.colombet at gmail.com> wrote: > > Le mar. 11 sept. 2018 à 11:23, Preston Briggs > <preston.briggs at gmail.com> a écrit : >> >> Yes, I quite liked the things I've read about the PBQP allocator. >> >> Given what the hardware folks have to go through to get 1% improvements in scalar code, >> spending 20% (or whatever) compile time (under control of a flag) seems like nothing. >> And falling back on "average code" is a little disingenuous. > > There are also instances where PBQP actually does worse than Greedy > (see Arnaud's presentation from my previous reply), so I don't think > it is fair to say that graph coloring should be the default. > In my opinion, there isn't a clear winner in terms of the performance > of the generated code between both the PBQP and greedy, thus, LLVM > default is to take the one with the fastest compile time. > > If you believe that should be changed, feel free to start an RFC/post > a patch on changing the default allocator.I don't want to go too deep into the discussion here; but I'd like to point out that in my experience the assignment of registers is less interesting/important than the problem of how to place your spill/reload code, where to perform live range splitting, how to accomodate your odd machine performance characteristics, ... The greedy allocator does a good job in giving you the flexibility to do what you need to.> >> People looking for performance don't care about average code; >> they care about their own personal code, their important loop, and wonder >> why there are unnecessary copies right there and wasn't this problem >> solved ages ago? > > If it was, I don't think we would have this conversation :). Graph > coloring is just another representation that has its own advantages > and limitations (like ignoring the structure of the program, spilling > is an after thought, to name a few). > At the end of the day, we are comparing heuristics so of course we can > find examples where one performs better than the other. At this point, > the developers can try both and pick the best. The current default > seems fine though. But again that's just my opinion.I have the suspicion the discussion here is triggered by a mere bug or mistuned case... But we'd need to see details to make that judgement :) - Matthias> >> >> Preston >> >> >> >> On Tue, Sep 11, 2018 at 11:00 AM, Quentin Colombet <quentin.colombet at gmail.com> wrote: >>> >>> Le mar. 11 sept. 2018 à 10:23, Preston Briggs >>> <preston.briggs at gmail.com> a écrit : >>>> >>>> Hi, >>>> >>>> Using Chaitin's approach, removing a copy via coalescing could expose more opportunities for coalescing. >>>> So he would iteratively rebuild the interference graph and check for more opportunities. >>> >>> I am not sure Chaitin's approach would lead to a better coalescing >>> than what LLVM does. >>> I believe most of the copies that we see in the final code comes from >>> live-range splitting (i.e., to avoid spill), not coalescing, but I am >>> nitpicking :). >>> >>>> >>>> Chaitin was also careful to make sure that the source and destination of a copy didn't interfere unnecessarily (because of the copy alone); >>>> that is, his approach to interference was very precise. The idea of computing interference from intersecting ranges seems less precise, >>>> though I owe Matthias examples demonstrating these supposed weaknesses. >>>> >>>> Finally, when spills are introduced, more opportunities for coalescing can arise. >>>> By separating coalescing and allocation, it seems like you miss an opportunity. >>>> >>> >>> LLVM still eliminates copies at coloring time. The available registers >>> are biases toward the colors used by whatever copies connected to >>> current live-range. >>> >>> Also, to be fair, LLVM provides a graph coloring regalloc with the >>> PBQP allocator. >>> >>> That said, I agree that we trade compile time for code quality, but >>> pragmatically, it is hard to justify spending say 20% time in compile >>> time to achieve maybe 1% speed up on average (numbers completely mad >>> up :)). If you're willing to pay the price, the PBQP may be an option. >>> ARM made a talk showing how it performed >>> (https://llvm.org/devmtg/2014-10/Slides/PBQP-update-and-in-the-wild.pdf). >>> >>> What I am saying is, I believe you're right but there isn't a big >>> enough incentive for a company to invest in such project. >>> >>>> I admit that the extra time spent rebuilding the interference graph >>>> and attempting to coalesce can add up, but I hate that we (apparently) give up on quality >>>> results to maybe save some compile time. Our machines are really fast these days; >>>> lets spend their speed on something useful! >>>> >>>> Thanks for the pointer and explanation, >>>> Preston >>>> >>>> >>>> On Tue, Sep 11, 2018 at 9:40 AM, Quentin Colombet <quentin.colombet at gmail.com> wrote: >>>>> >>>>> Hi Preston, >>>>> >>>>> To summarize what Matthias said, LLVM approach to regalloc is split in >>>>> two main phases: >>>>> 1. Aggressive coalescing based on the "interference graph" of the >>>>> variable in SSA (we don't actually build the interference graph). At >>>>> that stage, we don't care about the whether or not the resulting >>>>> representation is colorable with K register. >>>>> 2. Priority based register allocation with live-range splitting. >>>>> Basically when we are about to spill, instead of doing that right >>>>> away, we have heuristics to split the live-range and re-enqueue the >>>>> results. >>>>> >>>>> The most high-level description of LLVM's regalloc that I can think of >>>>> is available here: >>>>> https://llvm.org/devmtg/2011-11/Olesen_RegisterAllocation.pdf >>>>> >>>>> Cheers, >>>>> -Quentin >>>>> Le lun. 10 sept. 2018 à 17:28, Matthias Braun via llvm-dev >>>>> <llvm-dev at lists.llvm.org> a écrit : >>>>>> >>>>>> >>>>>> >>>>>> On Sep 10, 2018, at 5:25 PM, Matthias Braun <mbraun at apple.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Sep 10, 2018, at 5:11 PM, Preston Briggs <preston.briggs at gmail.com> wrote: >>>>>> >>>>>> The phi instruction is irrelevant; just the way I think about things. >>>>>> The question is if the allocator believes that t0 and t2 interfere. >>>>>> >>>>>> Perhaps the coalescing example was too simple. >>>>>> In the general case, we can't coalesce without a notion of interference. >>>>>> >>>>>> There is also logic in `LiveRange::overlaps()` that considers live ranges as not overlapping in some cases where they are related via copy instructions. This will also effect the interference during the main allocation algorithm. >>>>>> >>>>>> My worry is that looking at interference by ranges of instruction numbers >>>>>> leads to inaccuracies when a range is introduced by a copy. >>>>>> >>>>>> >>>>>> I can't think of a case where we wouldn't have the same accuracy as a graph coloring allocator. >>>>>> If you can construct an example where we don't I'd love to hear about it. >>>>>> >>>>>> If you wonder about the liveness information, you can perform experiments like this (I'm using a NOOP instruction with an implicit use operand to produce some artificial uses). >>>>>> >>>>>> $ cat test.mir >>>>>> name: somefunc >>>>>> body: | >>>>>> bb.0: >>>>>> %0:gr32 = MOV32ri 42 >>>>>> JB_1 %bb.2, undef implicit $eflags >>>>>> JMP_1 %bb.2 >>>>>> >>>>>> bb.1: >>>>>> %1:gr32 = MOV32ri 17 >>>>>> JMP_1 %bb.3 >>>>>> >>>>>> bb.2: >>>>>> NOOP implicit %0 >>>>>> %1 = COPY %0 >>>>>> JMP_1 %bb.3 >>>>>> >>>>>> bb.3: >>>>>> NOOP implicit %1 >>>>>> >>>>>> >>>>>> >>>>>> $ llc -run-pass=liveintervals -debug-only=regalloc test.mir >>>>>> ********** INTERVALS ********** >>>>>> %0 [16r,64B:0)[112B,144r:0) 0 at 16r weight:0.000000e+00 >>>>>> %1 [80r,112B:1)[144r,176B:0)[176B,192r:2) 0 at 144r 1 at 80r 2 at 176B-phi weight:0.000000e+00 >>>>>> RegMasks: >>>>>> ********** MACHINEINSTRS ********** >>>>>> # Machine code for function somefunc: NoPHIs >>>>>> >>>>>> 0B bb.0: >>>>>> successors: %bb.2(0x80000000); %bb.2(100.00%) >>>>>> >>>>>> 16B %0:gr32 = MOV32ri 42 >>>>>> 32B JB_1 %bb.2, implicit undef $eflags >>>>>> 48B JMP_1 %bb.2 >>>>>> >>>>>> 64B bb.1: >>>>>> successors: %bb.3(0x80000000); %bb.3(100.00%) >>>>>> >>>>>> 80B %1:gr32 = MOV32ri 17 >>>>>> 96B JMP_1 %bb.3 >>>>>> >>>>>> 112B bb.2: >>>>>> ; predecessors: %bb.0 >>>>>> successors: %bb.3(0x80000000); %bb.3(100.00%) >>>>>> >>>>>> 128B NOOP implicit %0:gr32 >>>>>> 144B %1:gr32 = COPY %0:gr32 >>>>>> 160B JMP_1 %bb.3 >>>>>> >>>>>> 176B bb.3: >>>>>> ; predecessors: %bb.1, %bb.2 >>>>>> >>>>>> 192B NOOP implicit %1:gr32 >>>>>> >>>>>> # End machine code for function somefunc. >>>>>> >>>>>> >>>>>> If you look at the "intervals" (the class is a misnomer since nowadays it contains a list of ranges...) in the beginning you see that %0 and %1 do not overlap anywhere. >>>>>> >>>>>> - Matthias >>>>>> >>>>>> >>>>>> But perhaps I should focus on the links and, as you suggested, >>>>>> the debugging info. >>>>>> >>>>>> Thanks, >>>>>> Preston >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Sep 10, 2018 at 5:02 PM, Matthias Braun <mbraun at apple.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sep 10, 2018, at 4:53 PM, Preston Briggs <preston.briggs at gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>>> The underlying liveness datastructure is a list of ranges where each vreg is alive >>>>>>>> (ranges in terms of instructions numbered). I remember a couple of later linear scan >>>>>>>> papers describing the same thing (Traub et.al. being the first if I remember correctly). >>>>>>>> That should be as accurate as you can get in terms of liveness information. >>>>>>> >>>>>>> It depends on the details. >>>>>>> For example, given >>>>>>> >>>>>>> t0 = mumble >>>>>>> >>>>>>> if (something) { >>>>>>> t2 = 3 >>>>>>> } >>>>>>> else { >>>>>>> t3 = t0 + 3 >>>>>>> print t0 >>>>>>> } >>>>>>> t4 = phi(t2, t3) >>>>>>> >>>>>>> >>>>>>> it's clear that t2 and t0 shouldn't interfere, >>>>>>> but some folks might say the ranges overlap. >>>>>>> >>>>>>> >>>>>>> Similarly, >>>>>>> >>>>>>> t6 = mumble >>>>>>> t7 = t6 >>>>>>> t8 = t6 + 5 >>>>>>> t9 = t7 + 10 >>>>>>> print t8, t9 >>>>>>> >>>>>>> >>>>>>> Chaitin points out that t6 and t7 shouldn't interfere, >>>>>>> even though the live ranges overlap. >>>>>>> >>>>>>> - We go out of SSA form before allocating registers so you won't see phi instruction. >>>>>>> - In the second case the copy coalesceing pass should have coalesced t6 and t7. >>>>>>> >>>>>>> I would expect both cases to work as you expect. >>>>>>> >>>>>>> - Matthias >>>>>>> >>>>>>> >>>>>>> >>>>>>> Anyway, I'll look at the links. >>>>>>> >>>>>>> Thanks, >>>>>>> Preston >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> We have separate aggressive coalescing pass before allocation. The register allocator will later perform live range splitting inside the main allocation loop as it seems fit. >>>>>>>> >>>>>>>> >>>>>>>> I ask these questions because we (guys I work with) see loops >>>>>>>> where there's a little register juggling that seems unnecessary. >>>>>>>> >>>>>>>> Well your best bet is to learn reading the output of `-mllvm -print-machineinstrs -mllvm -debug-only=regalloc` (which can take a while)... >>>>>>>> >>>>>>>> >>>>>>>> Is there a paper that describes what y'all do? >>>>>>>> >>>>>>>> >>>>>>>> I'm only aware of a blog post: >>>>>>>> http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html >>>>>>>> and a dev conference talk in 2011: >>>>>>>> https://llvm.org/devmtg/2011-11/ >>>>>>>> >>>>>>>> - Matthias >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Preston >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Sep 10, 2018 at 9:57 AM, Matthias Braun <mbraun at apple.com> wrote: >>>>>>>>> >>>>>>>>> I would not describe LLVMs register allocator as linear scan, it's closer to graph coloring than linear scan IMO (though doesn't really matcher either approach). >>>>>>>>> >>>>>>>>> RegAllocGreedy assigns the registers in an order based on the priority value computed in enqueu() we are not just scanning from top to bottom of the program. We also perform actual interference checks we just don't happen to build up an explicit interference graph up front. >>>>>>>>> >>>>>>>>> - Matthias >>>>>>>>> >>>>>>>>> On Sep 10, 2018, at 9:49 AM, Preston Briggs via llvm-dev <llvm-dev at lists.llvm.org> wrote: >>>>>>>>> >>>>>>>>> Why have we ended up using linear-scan register allocation >>>>>>>>> by default (instead of, e.g., coloring)? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Preston >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> LLVM Developers mailing list >>>>>>>>> llvm-dev at lists.llvm.org >>>>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> LLVM Developers mailing list >>>>>> llvm-dev at lists.llvm.org >>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>> >>>> >> >>