On May 3, 2011, at 9:19 AM, David A. Greene wrote:> Jakob Stoklund Olesen <stoklund at 2pi.dk> writes: > >> +10.0% SingleSource/Benchmarks/CoyoteBench/huffbench >> +12.0% SingleSource/Benchmarks/McGill/chomp >> +18.0% SingleSource/Benchmarks/BenchmarkGame/n-body >> +45.5% SingleSource/Benchmarks/BenchmarkGame/puzzle >> +10.0% SingleSource/Benchmarks/Shootout/heapsort >> +10.5% MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des >> +10.9% SingleSource/Benchmarks/Shootout-C++/heapsort >> +11.7% MultiSource/Benchmarks/Ptrdist/bc/bc >> +12.0% MultiSource/Benchmarks/McCat/17-bintr/bintr >> +55.2% SingleSource/Benchmarks/Shootout/methcall > > Yikes! Do we know why these codes got so much worse? Even 5% is a big > deal on x86.On x86-64, n-body and puzzle have the exact same instructions as with linear scan. The only difference is the choice of registers. This causes some loops to be a few bytes longer or shorter which can easily change performance by that much if that small loop is all the benchmark does. The greedy allocator is trying to pick registers so inner loops are as small as possible, but that is not always the right thing to do. Unfortunately, we don't model the effects of code alignment, so there is a lot of luck involved. I am working my way through the regressions, looking for things the allocator did wrong. Any help is appreciated, please file bugs if you find examples of stupid register allocation. /jakob
Jakob Stoklund Olesen <stoklund at 2pi.dk> writes:>> Yikes! Do we know why these codes got so much worse? Even 5% is a big >> deal on x86. > > On x86-64, n-body and puzzle have the exact same instructions as with > linear scan. The only difference is the choice of registers. This > causes some loops to be a few bytes longer or shorter which can easily > change performance by that much if that small loop is all the > benchmark does.Ok, I can believe that.> The greedy allocator is trying to pick registers so inner loops are as > small as possible, but that is not always the right thing to do.How does it balance that against spill cost?> Unfortunately, we don't model the effects of code alignment, so there > is a lot of luck involved.As with any allocator. :)> I am working my way through the regressions, looking for things the > allocator did wrong. Any help is appreciated, please file bugs if you > find examples of stupid register allocation.Certainly. I would ask that we keep linearscan around, if possible, as long as there are significant regressions like this. Our customers tend to really, really care about performance. -Dave
On May 3, 2011, at 12:03 PM, David A. Greene wrote:>> >> The greedy allocator is trying to pick registers so inner loops are as >> small as possible, but that is not always the right thing to do. > > How does it balance that against spill cost?I added the CostPerUse field to the register descriptions. The allocator will try to minimize the spill weight assigned to registers with a CostPerUse. It does it by swapping physical register assignments, it won't do it if it requires extra spilling. This is actually the cause of the n-body regression. The benchmark has nested loops: %vreg1 = const pool load header1: ; large blocks with lots of floating point ops header2: ; small loop using %vreg1 jnz header2 ... jnz header1 The def of %vreg1 has been hoisted by LICM so it is live across a block with lots of floating point code. The allocator uses the low xmm registers for the large block, and %xmm8 is left for %vreg1 which has a low spill weight. This significantly improves code size, but the small loop suffers. A low xmm register could be used for %vreg1, but would need to be rematerialized. The allocator won't go that far just to use cheaper registers. In this case it might have helped to split the live range and rematerialize, but usually that won't be the case. /jakob
On May 3, 2011, at 12:03 PM, David A. Greene wrote:>> >> I am working my way through the regressions, looking for things the >> allocator did wrong. Any help is appreciated, please file bugs if you >> find examples of stupid register allocation. > > Certainly. I would ask that we keep linearscan around, if possible, as > long as there are significant regressions like this. Our customers tend > to really, really care about performance.That's reasonable, and it is also useful to keep it around as a reference when greedy breaks. On the other hand, I really want to clean up the code surrounding register allocation, and that is much easier to do after linear scan is gone. There is a good chance it won't make it to the 3.0 release. /jakob