Jakob Stoklund Olesen
2011-Sep-19 15:44 UTC
[LLVMdev] Greedy Register Allocation in LLVM 3.0
I just uploaded a blog post outlining the new register allocation algorithm in LLVM 3.0. http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html Please direct comments here. /jakob
Carlo Alberto Ferraris
2011-Sep-26 09:12 UTC
[LLVMdev] Greedy Register Allocation in LLVM 3.0
Just a quick question: is greedy still a local allocator (i.e. only takes into consideration the current bb) or a global one (takes into consideration the whole function)? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110926/d87aa2bd/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: cafxx.vcf Type: text/x-vcard Size: 230 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110926/d87aa2bd/attachment.vcf>
The greedy allocator is global, but so was the old linear scan allocator. Cameron On Sep 26, 2011, at 2:12 AM, Carlo Alberto Ferraris <cafxx at strayorange.com> wrote:> Just a quick question: is greedy still a local allocator (i.e. only takes into consideration the current bb) or a global one (takes into consideration the whole function)? > <cafxx.vcf> > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Hi Jakob, Thanks for a very interesting description of the new register allocation algorithm in LLVM!!! Could you elaborate a bit on the following topics: 1) Do you have any plans to publish something more formal and detailed about this algorithm, e.g. a paper or something similar? It would be nice to better understand how this algorithm relates to well-known algorithms described in the literature, e.g. linear scan, graph coloring, iterated register coalescing by Chaitin-Briggs, register allocation by priority-based coloring by Chow and Hennesey and if it represents an extension of one of those algorithms or if it is a totally new algorithm. And of course it would be nice if the terminology used by this description would follow the one published in the literature, as sometimes use of a different terminology can be misleading. For example, it is rather well known that the Linear Scan register allocation of LLVM was not a real linear scan register allocator, as it allowed for backtracking and re-coloring. It was de-facto more of a special case of graph-coloring register allocator. But due to its name, many published research papers assumed that it is a real linear scan implementation and took it as a representative example for such allocators. 2) A comparison of the quality of generated allocations and allocator performance with well-known register allocation algorithms would be very interesting to better understand pros and cons of this new algorithm. Do you have any numbers at hand? If not, do you plan to produce a more thorough analysis? 3) It is stated in the blog that this allocator is much simpler than the old linear scan implementation of LLVM. Can you provide a bit more information? For example: - How many lines of code it is and how it compares in this regard to the old linear scan? - Are there any wired places like the former SimpleRegisterCoalescer, which was not that simple at all and had an extremely wired logic ;-) - You explained that it produces in most situations a better code than linear scan, which is cool! But it would be also interesting to know in which pathological cases it still looses against linear scan and why - Can you provide a rough estimate of the algorithmic complexity of the new algorithm? Is it O(n), O(n^2) or something else? Thanks, Roman ----- Ursprüngliche Message -----> Von: Jakob Stoklund Olesen <jolesen at apple.com> > An: "llvmdev at cs.uiuc.edu List" <llvmdev at cs.uiuc.edu> > Cc: > Gesendet: 17:44 Montag, 19.September 2011 > Betreff: [LLVMdev] Greedy Register Allocation in LLVM 3.0 > > I just uploaded a blog post outlining the new register allocation algorithm in > LLVM 3.0. > > http://blog.llvm.org/2011/09/greedy-register-allocation-in-llvm-30.html > > Please direct comments here. > > /jakob > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Jakob Stoklund Olesen
2011-Sep-26 16:49 UTC
[LLVMdev] Greedy Register Allocation in LLVM 3.0
On Sep 26, 2011, at 4:22 AM, Leo Romanoff wrote:> 1) Do you have any plans to publish something more formal and detailed about this algorithm, e.g. a paper or something similar?I wasn't planning on that, but I might put something together in the long dark California winter nights.> It would be nice to better understand how this algorithm relates to well-known algorithms described in the literature, e.g. linear scan, graph coloring, iterated register coalescing by Chaitin-Briggs, register allocation by priority-based coloring by Chow and Hennesey and if it represents an extension of one of those algorithms or if it is a totally new algorithm. And of course it would be nice if the terminology used by this description would follow the one published in the literature, as sometimes use of a different terminology can be misleading. For example, it is rather well known that the Linear Scan register allocation of LLVM was not a real linear scan register allocator, as it allowed for backtracking and re-coloring. It was de-facto more of a special case of graph-coloring register allocator. But due to its name, many > published research papers assumed that it is a real linear scan implementation and took it as a representative example for such allocators.So much for peer review ;-) Much of the terminology in LLVM's register allocation framework comes from the linear scan literature, as far as I know. Sometimes it doesn't make any sense. For example, our LiveInterval class represents a sorted list of intervals. Each interval is called a LiveRange. I am hoping to clean this stuff up gradually. Most articles on register allocation start with the premise that register allocation is isomorphic to graph coloring. This is quite wrong in the real world where we must deal with overlapping register classes and aliasing registers. The terminology is often based on this false premise, and that makes me reluctant to adopt it. Ideally, I would like to settle on a language that is as close as possible to the established conventions without being misleading. I don't think you can find any production quality compiler using a 'pure' published register allocator. The devil is in the details, and the published algorithms are too simple. Anyway, the 'basic' register allocator is quite similar to Chow and Hennessy's priority-based coloring. They separate the unconstrained live ranges first while basic just drops everything in the priority queue. Unconstrained live ranges only make sense if you assume graph coloring isomorphism. You could compare the greedy allocator to George and Appel's iterated register coalescing running in reverse. Greedy coalesces live ranges aggressively, and splits them on demand.> 2) A comparison of the quality of generated allocations and allocator performance with well-known register allocation algorithms would be very interesting to better understand pros and cons of this new algorithm. Do you have any numbers at hand? If not, do you plan to produce a more thorough analysis?Unfortunately, it would be too much work to make a fair comparison. As I said, the published algorithms are not good enough for a real compiler. Each would require a lot of work before producing decent results. You would need to work around the missing support for overlapping register classes and register aliases, and you would need to handle pre-colored registers and call-clobbered registers somehow. The greedy allocator was designed to handle these real-world problems as part of the core algorithm. It was my judgment that adapting something like George and Appel's iterated register coalescing to the real world would be just as much work as starting from scratch. It was my hope that handling the real-world problems from the start would lead to a cleaner design. GCC's new integrated register allocator was designed by adapting Chaitin-Briggs to the real world, as far as I know. That is the most fair comparison I can think of.> 3) It is stated in the blog that this allocator is much simpler than the old linear scan implementation of LLVM. Can you provide a bit more information? > For example: > - How many lines of code it is and how it compares in this regard to the old linear scan?A lot of source files are involved, and some are shared, so I don't have these numbers. The worst linear scan code is the rewriter which is 2600 lines in VirtRegRewriter.cpp, and parts of LiveIntervalAnalysis.cpp. The hard part of greedy is the live range splitting code in SplitKit.cpp at 1400 lines. The rest is of comparable size, but greedy gets you global live range splitting for the same price.> - Are there any wired places like the former SimpleRegisterCoalescer, which was not that simple at all and had an extremely wired logic ;-)Greedy is still using the same coalescer. It has been renamed RegisterCoalescer for honesty. ;-) Since greedy handles pre-colored registers as part if its algorithm, the coalescer no longer needs to handle physical register coalescing. This code can be removed once linear scan is gone. I suspect that parts of SplitKit.cpp might make people squint. As the author, I find it perfectly clear, of course.> - You explained that it produces in most situations a better code than linear scan, which is cool! But it would be also interesting to know in which pathological cases it still looses against linear scan and whyThe regressions I have seen have mostly been luck. That is, neither heuristic knows what it is doing, and one gets it just right by accident. I have a vague sense that linear scan is sometimes doing better with some high register pressure coloring problems. It is hard to quantify, though.> - Can you provide a rough estimate of the algorithmic complexity of the new algorithm? Is it O(n), O(n^2) or something else?Something like N log (M), where N is the number of live ranges, and M is the number of continuous live range segments. I wouldn't be surprised if the worst-case behavior is much worse. The complexity of global live range splitting depends on the topology of the CFG spanned by the live range, but each split is sub-linear to linear in the number of blocks spanned. Local live range splitting has a known quadratic behavior in large basic blocks that we still need to eliminate. It rarely shows up in practice. /jakob
Jakob Stoklund Olesen
2011-Sep-26 18:27 UTC
[LLVMdev] Greedy Register Allocation in LLVM 3.0
On Sep 26, 2011, at 2:12 AM, Carlo Alberto Ferraris wrote:> Just a quick question: is greedy still a local allocator (i.e. only takes into consideration the current bb) or a global one (takes into consideration the whole function)?RegAllocGreedy, RegAllocBasic, RegAllocLinearScan and RegAllocPBQP are global. RegAllocFast is local. It is only used for -O0. In the future, it may get primitive support for the kind of global live ranges that clang -O0 can produce. /jakob
On Sep 26, 2011, at 4:22 AM, Leo Romanoff wrote:> Hi Jakob, > > Thanks for a very interesting description of the new register allocation algorithm in LLVM!!! > > Could you elaborate a bit on the following topics: > > 1) Do you have any plans to publish something more formal and detailed about this algorithm, e.g. a paper or something similar? It would be nice to better understand how this algorithm relates to well-known algorithms described in the literature, e.g. linear scan, graph coloring, iterated register coalescing by Chaitin-Briggs, register allocation by priority-based coloring by Chow and Hennesey and if it represents an extension of one of those algorithms or if it is a totally new algorithm. And of course it would be nice if the terminology used by this description would follow the one published in the literature, as sometimes use of a different terminology can be misleading. For example, it is rather well known that the Linear Scan register allocation of LLVM was not a real linear scan register allocator, as it allowed for backtracking and re-coloring. It was de-facto more of a special case of graph-coloring register allocator. But due to its name, many > published research papers assumed that it is a real linear scan implementation and took it as a representative example for such allocators. > > 2) A comparison of the quality of generated allocations and allocator performance with well-known register allocation algorithms would be very interesting to better understand pros and cons of this new algorithm. Do you have any numbers at hand? If not, do you plan to produce a more thorough analysis? > > 3) It is stated in the blog that this allocator is much simpler than the old linear scan implementation of LLVM. Can you provide a bit more information? > For example: > - How many lines of code it is and how it compares in this regard to the old linear scan? > - Are there any wired places like the former SimpleRegisterCoalescer, which was not that simple at all and had an extremely wired logic ;-) > - You explained that it produces in most situations a better code than linear scan, which is cool! But it would be also interesting to know in which pathological cases it still looses against linear scan and why > - Can you provide a rough estimate of the algorithmic complexity of the new algorithm? Is it O(n), O(n^2) or something else? > > Thanks, > RomanIt may be more helpful to explain how LLVM's register allocator came into existence before debating the high level algorithm. When I began working on LLVM last October, Jakob was developing an infrastructure for global live range splitting. It was just becoming clear that LLVM's Linear Scan was inadequate for this work, and no one I talked to knew how to fix it. Chris and Evan were very receptive to replacing the register allocator, but the design space was wide open. I knew from experience that we did not want to build a traditional interference graph. Chris, Evan, and Jakob all agreed on this point. I reasoned that an efficient implementation of LiveInterval along with a new "LiveIntervalUnion" data structure and mechanism for caching interference tests would be sufficient, and offered to build a prototype to prove it. This was an important step. My goal was never to develop a theoretically superior register allocator algorithm in its own right. On the contrary, Jakob and I wanted the global splitter to drive allocation, rather than an allocation algorithm that fell back to splitting when it failed. The important stake holders at the time strongly agreed that the problem Jakob was solving, global live range splitting, was the difficult problem that deserved attention, and that the register allocation algorithm was only important in that it did not inhibit splitting or spilling or unduly impact compile time. I believe this goal led to a register allocator that is fundamentally different from anything I'm aware of at the level of software design. To summarize, Jakob and I both strongly agreed on the following design goals, primarily for engineering reasons, not theoretical reasons: - early aggressive coalescing - no predetermined allocation order i.e. no fixed order CFG traversal, splitting changes allocation order - no explicit interference graph - on-the-fly live range splitting at the finest possible granularity When I say this the new allocator is fundamentally different from the other allocators I've seen, this is mainly what I'm referring to. At each decision point made by the allocator, the splitter may rewrite the code, modify live ranges, and incrementally update every bit of state in the allocator. There is no abandoning the current solution and no iterative application of an algorithm. When I implemented the Basic/Greedy prototypes, I was able to replace the code in RegAllocLinearScan.cpp with the code you can still see in RegAllocBasic.cpp. You can see for yourself what Jakob means by the new allocator being much simpler. And as Jakob explained, the real benefit will be in removing VirtRegWriter.cpp which is made possible by his spill code optimizations. Those optimizations are in turn made practical by the new framework for incrementally updating the register allocator's state. The naming decision was literally a five minute brainstorm driven by the need for unique filenames that are easy to type. "Greedy" is no better description of our design than "Linear Scan" is of the LLVM implementation by that name or "Graph Coloring" is of most Chaitin-Briggs implementations. I believe the best name for the new design would be one that conveys that it is actually a live range splitter that happens to perform allocation. But I am not attached to the names, and always figured the final algorithm, if successful, would simply be known as the "LLVM Register Allocator". It's worth mentioning that we considered calling it a "Priority" allocator, but I objected to this because I felt it to be misleading. Register reassignment and eviction were part of my original prototype and I felt those features were more important than allocation order. I used a priority queue for convenience, but believed the choice was somewhat insignificant. In fact, the design was intended to work reasonably well even if priorities were inverted. I initially used spill weight as a place holder for whatever dynamic heuristic Jakob would find works well with the splitter. One thing Jakob discovered is that an "inverted" heuristic actually cooperates better with his splitter. The Greedy allocator now allocates registers to the largest live ranges first. To be fair to Linear Scan, it is also worth noting that the framework for live intervals was directly inspired by the LLVM implementation of the linear scan algorithm. In the end I believe the effectiveness of any allocator reflects the work that goes into modeling the details of the target architecture. This is not easy to achieve in a retargetable code generator. Consequently, that is where much of the effort continues to be directed in LLVM, rather than in attempting to rank algorithms in their most abstract interpretation. I hope this brief chronicle of the algorithm's inception dispels some of its mystery. -Andy