Hi everyone, Quick summary: LLVM now has a new register allocator particularly suitable for compiling (very) large, machine-generated functions. Longer story: I've recently been using LLVM in an application that involves JITing fairly large functions that have no control flow - they're just flat sequences of instructions, anywhere from 100 to 10000+ in size. (The control flow is all in the host program, which works out which monster function to call and when.) The default (linearscan) register allocator wasn't doing a good job, as it doesn't (yet) have live range splitting. It would quickly use all available registers and then get stuck, using only a single register (and the stack, a lot!) for hundreds or thousands of instructions at a time, greatly slowing (+bloating) the code. Luckily, I didn't have the presence of mind to actually try using the local register allocator before I ripped its guts out, turning it into RegAllocBigBlock.cpp that's now in CVS. (Had I done that, I would have been quite happy to find that the local allocator produced much better code, much faster than linearscan.) The good news is the new "BigBlock" allocator turns out to produce even better code than the local allocator when blocks are very large. We're talking a +10~20% speed boost on average. (If your basic blocks are small, or there's not much register pressure, you'll actually get the same code out of both local and BigBlock.) While BigBlock isn't (and never will be) as fast as the local allocator, it's not much slower, doesn't use much memory, and is certainly faster than linearscan. So if you're compiling very large, (probably) machine-generated blocks of straight-line code, give the Local and BigBlock allocators a try, especially if you're JITing things and compile time is important. If anyone has any problems with or comments on the BigBlock allocator, please drop me an email! Enjoy, Duraid
Hi Duraid,> Hi everyone, > > Quick summary: > > LLVM now has a new register allocator particularly suitable for > compiling (very) large, machine-generated functions.Congrats! Very good job!> Longer story: > > I've recently been using LLVM in an application that involves JITing > > fairly large functions that have no control flow - they're just flat > sequences of instructions, anywhere from 100 to 10000+ in size. (The > control flow is all in the host program, which works out which > monster function to call and when.)> The default (linearscan) register allocator wasn't doing a good job,as> it doesn't (yet) have live range splitting. It would quickly use all > available registers and then get stuck, using only a single register > (and the stack, a lot!) for hundreds or thousands of instructions at > a time, greatly slowing (+bloating) the code.True. I'm working on the version of the linear scan based on Wimmer's thesis. It supports live range splitting. I'd like to compare it with yours. Do you have any good examples of those fairly large functions that are just flat sequences of instructions, anywhere from 100 to 10000+ in size??? It would be nice, if you could send me those test cases (as C or ll files). I could use it then as a basis for a comparision and report about results.> The good news is the new "BigBlock" allocator turns out to > produce even better code than the local allocator when blocks arevery> large. We're talking a +10~20% speed boost on average. (If your basic> blocks are small, or there's not much register pressure, you'll > actually get the same code out of both local and BigBlock.)Do you have numbers comparing it to the current version of the LLVM's linear scan? The win of your allocator over the linear scan should be even better, I guess.> While BigBlock isn't (and never will be) as fast as the local > allocator, it's not much slower, doesn't use much memory, and is > certainly faster than linearscan. So if you're compiling very large, > (probably) machine-generated blocks of straight-line code, give the > Local and BigBlock allocators a try, especially if you're JITing > things and compile time is important.I looked at your code. And I see some things that could be significantlty sped up, e.g. - InsnTimes handling. I have the feeling, this map can be eliminated completely. - use of the VRegReadTable. The vector of read occurences can be shortened every time, you processed the corresponding intruction. This makes it shorter and makes searches inside this vector faster, thus making chooseReg much faster. Probably also some other optimizations can be applied to the chooseReg function. - PhysRegsUseOrder - you remove some elements from the middle of this vector in removePhysReg. This is not a very efficient operation on the vectors, since it need to copy the tail of the vector. I think using a list data-structure could be much more efficient for this purpose I think these changes may significantely improve the performance of your BigBlock register allocator. I'll try to come up with some more concrete proposals or even patches over the week-end or next week. -Roman __________________________________________________ Do You Yahoo!? Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. http://mail.yahoo.com
Hi Roman,> True. I'm working on the version of the linear scan based on Wimmer's > thesis. It supports live range splitting. I'd like to compare it with > yours. Do you have any good examples of those fairly large functions > that are just flat sequences of instructions, anywhere from 100 to > 10000+ in size???You can find a couple attached to bug #1512, but I'm happy to email you more. Just in case people out there think JITing 10000-instruction functions is crazy (I did), you should know that the Core2 processor appears to have no real problem running *very* large functions out of L2$.> I could use it then as a basis for a > comparision and report about results.No problem - I'd be curious to see if your allocator can improve the code enough to justify the extra time spent.> Do you have numbers comparing it to the current version of the LLVM's > linear scan? The win of your allocator over the linear scan should be > even better, I guess.Both local and bigblock are way, way ahead of the linearscan allocator for my application (and I imagine others, such as large unrolled linear algebra kernels, FFTs/codecs etc). On the code I have, performance improves by around 50~100% using the local allocator. Using bigblock gives another 10~20% on top of that, with the gap more or less proportional to the size of the block. For blocks with <200 instructions, local and bigblock tend to produce more or less the same code.> I looked at your code. And I see some things that could be > significantlty sped up, e.g. > - InsnTimes handling. I have the feeling, this map can be eliminated > completely.Absolutely, I agree it should go. I was in a hurry, and my first attempt at eliminating it didn't work, so I just put it back. I'll get around to killing it again at some point.> - use of the VRegReadTable. The vector of read occurences can be > shortened every time, you processed the corresponding intruction. This > makes it shorter and makes searches inside this vector faster, thus > making chooseReg much faster. Probably also some other optimizations > can be applied to the chooseReg function.chooseReg is of course where most of the time is spent, but truth be told, 98% of the time my application spends in LLVM is *not* in the register allocator. Or I should say *was* not - a quick patch by Evan Cheng earlier today to some scheduler code more than doubled LLVM's speed on large functions. However, there are still some very serious LLVM performance issues when codegenning large basic blocks. A couple of years ago, I clocked the LLVM JIT at roughly 10,000 cycles per byte of code generated. Currently, it's *well* over 200,000. (Again, this is on large functions. On small functions, the performance is presumably much better.) Of course, back then, LLVM didn't *have* a scheduler, and even things like instruction selection were simpler. But somewhere, some nasty speed issues have crept in when compiling huge functions. I'll do my best to try and sift through these over the next few weeks. (Or at least provide Evan with plenty of test cases and beg him to fix things. ;)> - PhysRegsUseOrder - you remove some elements from the middle of this > vector in removePhysReg. This is not a very efficient operation on the > vectors, since it need to copy the tail of the vector. I think using a > list data-structure could be much more efficient for this purposeActually, PhysRegsUseOrder isn't really needed for BigBlock - it's a leftover from the local allocator. So there are even more brutal performance fixes to be made to BigBlock, however your suggestion applies to the Local allocator for sure!> I think these changes may significantely improve the performance of > your BigBlock register allocator. I'll try to come up with some more > concrete proposals or even patches over the week-end or next week.Patches would be a dream - while I don't expect BigBlock to ever be more than ~25% of total codegen time, every little bit helps. :) Thanks for your comments, Duraid
Thanks Duraid. Nice job. I'd like to point out a couple of things. 1. The local register allocator is currently broken. :-( Actually it has been broken for a while (due to physical sub-regs changes in earlier passes. I am currently working on it. You probably will need to merge the fixes back once I am done. 2. Please benchmark against some more "normally sized" programs. Perhaps SPEC? I don't expect a local register allocator to win against linearscan. Thanks, Evan On Jun 22, 2007, at 2:46 AM, Duraid Madina wrote:> Hi everyone, > > Quick summary: > > LLVM now has a new register allocator particularly suitable for > compiling (very) large, machine-generated functions. > > Longer story: > > I've recently been using LLVM in an application that involves JITing > fairly large functions that have no control flow - they're just flat > sequences of instructions, anywhere from 100 to 10000+ in size. (The > control flow is all in the host program, which works out which monster > function to call and when.) > > The default (linearscan) register allocator wasn't doing a good > job, as > it doesn't (yet) have live range splitting. It would quickly use all > available registers and then get stuck, using only a single register > (and the stack, a lot!) for hundreds or thousands of instructions at a > time, greatly slowing (+bloating) the code. Luckily, I didn't have the > presence of mind to actually try using the local register allocator > before I ripped its guts out, turning it into RegAllocBigBlock.cpp > that's now in CVS. (Had I done that, I would have been quite happy to > find that the local allocator produced much better code, much faster > than linearscan.) > > The good news is the new "BigBlock" allocator turns out to produce > even > better code than the local allocator when blocks are very large. We're > talking a +10~20% speed boost on average. (If your basic blocks are > small, or there's not much register pressure, you'll actually get the > same code out of both local and BigBlock.) > > While BigBlock isn't (and never will be) as fast as the local > allocator, it's not much slower, doesn't use much memory, and is > certainly faster than linearscan. So if you're compiling very large, > (probably) machine-generated blocks of straight-line code, give the > Local and BigBlock allocators a try, especially if you're JITing > things > and compile time is important. > > If anyone has any problems with or comments on the BigBlock > allocator, > please drop me an email! > > Enjoy, > > Duraid > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev