Dwight Guth via llvm-dev
2021-Mar-01 22:00 UTC
[llvm-dev] Is the fast register allocator O(n^2) in the size of the function?
Apologies if this email is a bit weirdly formatted; I had actually already unsubscribed from the mailing list when I noticed that I had received a response. Here is the stack trace I found for LLVM 11: #0 hasOneNonDBGUse () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/MachineRegisterInfo.cpp:420 #1 0x00007ffff3e8a478 in defineVirtReg () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:787 #2 0x00007ffff3e883d1 in allocateInstruction () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1188 #3 allocateBasicBlock () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1277 #4 runOnMachineFunction () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1316 #5 0x00007ffff3d8a39e in runOnFunction () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/MachineFunctionPass.cpp:73 #6 0x00007ffff3bc7579 in runOnFunction () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1516 #7 0x00007ffff3bccb23 in runOnModule () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1552 #8 0x00007ffff3bc7b90 in runOnModule () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1617 #9 run () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:614 #10 0x000000000040d792 in main () It looks like at least as of LLVM 11, RegAllocFast.cpp was indeed calling hasOneNonDBGUse, which was accounting for some 55% of the self time reported in my profile on the bitcode file I was testing with. I did some digging into the more recent history of LLVM and it seems that this function was refactored sometime between LLVM 11 and LLVM 12 RC1, so I went ahead and built the most recent release candidate from source to test if the problem still occurred or not. I don't remember seeing a release candidate when I originally composed the email or I would have tried it then, although it seems RC1 must have been out by then. I must have overlooked it by accident. With that being said, when I run the same bitcode file with LLVM 12 RC2, I see that register allocation only takes 4% of the CPU time in total. I also see that the time spent in register allocation has reduced to only 5 seconds from 5 minutes, a 60-fold improvement! I do notice that 30% of self time is spent in each of llvm::FastISel::handlePHINodesInSuccessorBlocks and llvm::SelectionDAGBuilder::HandlePHINodesInSuccessorBlocks, for a whopping 60% of self time total, but that is a completely separate issue, and much more acceptable given that wall time has decreased from 10 minutes for the entire process to 2.5. In short, it looks like someone else already beat me to the punch. Sorry for wasting your time. It's good to know that this issue will be addressed for us simply by waiting for LLVM 12 to be released, though! Thanks, Dwight> Hi Dwight, > > Can your share the profile (stack trace for instance) you’ve observed? > > As far as I remember the fast regalloc (the one that runs at O0 by default) shouldn’t call hasOneNonDBGUse and the only potential quadratic behavior in that allocator is when we scan for the live out (RegAllocFast::mayLiveOut) and even that part shouldn’t cause significant problems because we limit the number of checks by an upper bound of 8 per variable (hard coded limit). > > If RegAllocFast::mayLiveOut turns out to be problematic, we could start to cache its results. > > Anyhow, I’ll need more information to be able to help you. > > Cheers, > -Quentin > > > On Feb 22, 2021, at 12:17 PM, David Blaikie via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > > > +Lang who might still have some register allocation knowledge kicking around (but his responses may be a bit delayed) > > > > Generally there's a bunch of algorithms that don't scale really well on especially large function - and I'd encourage users to modify any code generators to chunk up functions, or they'll be chasing down these sort of issues indefinitely, really. > > > > But improvements are certainly appreciated - though I don't have any particular knowledge or pointers regarding how to improve the register allocator. (my gut reaction would be that probably a lot of people have looked and haven't found better tradeoffs - but I'm a bit of a pessimist, fresh eyes can often help :) ) > > > > On Wed, Feb 10, 2021 at 10:48 AM Dwight Guth via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > Hi, > > > > I am the maintainer of an open source compiler that makes use of the > > LLVM toolchain to perform code generation. A user of our compiler, who > > is working on a proprietary project, recently shared with me an > > example where our toolchain was invoking `llc -O0` on a file with a > > rather large function in it, and the register allocation pass was > > taking upwards of 5 minutes. This seemed unusual to me, so I > > downloaded the debug symbols for LLVM (this was with LLVM 11.0.0), and > > I found that slightly over 50% of the runtime of LLC was spent within > > a very small loop within llvm::MachineRegisterInfo::hasOneNonDBGUse. I > > believe this loop to correspond to the while loop in > > llvm::MachineRegisterInfo::defusechain_iterator::advance. > > > > It looks like this function is just scanning through all the defs and > > uses of a register in the function until it finds one that it > > considers satisfactory? That seems like it would be introducing > > behavior that is quadratic in the size of the function to me. Am I > > missing something? Is there some other, more sensible reason why the > > profile seems so dependent on this one loop of code? Did I > > misunderstand something? > > > > My end goal here would be to submit a patch that might optimize this > > case, since it seems to me something that might be able to be computed > > more efficiently. But I don't understand the code or the algorithm > > hugely well, so I was hoping someone could give me some pointers on > > where to get started with something like this. > > > > Does anyone have any suggestions on what I could do to make register > > allocation run faster on quite large functions? > > > > Thanks, > > > > -- > > Dwight Guth > > > > Chief Information Officer > > Runtime Verification, Inc. > > Email: dwight.guth at runtimeverification.com <mailto:dwight.guth at runtimeverification.com> > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-- Dwight Guth Chief Information Officer Runtime Verification, Inc. Email: dwight.guth at runtimeverification.com
Quentin Colombet via llvm-dev
2021-Mar-01 23:27 UTC
[llvm-dev] Is the fast register allocator O(n^2) in the size of the function?
Glad to hear that your problem will be fixed by LLVM 12!> On Mar 1, 2021, at 2:00 PM, Dwight Guth via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > Apologies if this email is a bit weirdly formatted; I had actually > already unsubscribed from the mailing list when I noticed that I had > received a response. > > Here is the stack trace I found for LLVM 11: > > #0 hasOneNonDBGUse () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/MachineRegisterInfo.cpp:420 > #1 0x00007ffff3e8a478 in defineVirtReg () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:787 > #2 0x00007ffff3e883d1 in allocateInstruction () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1188 > #3 allocateBasicBlock () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1277 > #4 runOnMachineFunction () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1316 > #5 0x00007ffff3d8a39e in runOnFunction () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/MachineFunctionPass.cpp:73 > #6 0x00007ffff3bc7579 in runOnFunction () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1516 > #7 0x00007ffff3bccb23 in runOnModule () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1552 > #8 0x00007ffff3bc7b90 in runOnModule () at > /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1617 > #9 run () at /build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:614 > #10 0x000000000040d792 in main () > > It looks like at least as of LLVM 11, RegAllocFast.cpp was indeed > calling hasOneNonDBGUse, which was accounting for some 55% of the self > time reported in my profile on the bitcode file I was testing with. > > I did some digging into the more recent history of LLVM and it seems > that this function was refactored sometime between LLVM 11 and LLVM 12 > RC1, so I went ahead and built the most recent release candidate from > source to test if the problem still occurred or not. I don't remember > seeing a release candidate when I originally composed the email or I > would have tried it then, although it seems RC1 must have been out by > then. I must have overlooked it by accident. > > With that being said, when I run the same bitcode file with LLVM 12 > RC2, I see that register allocation only takes 4% of the CPU time in > total. I also see that the time spent in register allocation has > reduced to only 5 seconds from 5 minutes, a 60-fold improvement! > > I do notice that 30% of self time is spent in each of > llvm::FastISel::handlePHINodesInSuccessorBlocks and > llvm::SelectionDAGBuilder::HandlePHINodesInSuccessorBlocks, for a > whopping 60% of self time total, but that is a completely separate > issue, and much more acceptable given that wall time has decreased > from 10 minutes for the entire process to 2.5. > > In short, it looks like someone else already beat me to the punch. > Sorry for wasting your time. It's good to know that this issue will be > addressed for us simply by waiting for LLVM 12 to be released, though! > > Thanks, > Dwight > >> Hi Dwight, >> >> Can your share the profile (stack trace for instance) you’ve observed? >> >> As far as I remember the fast regalloc (the one that runs at O0 by default) shouldn’t call hasOneNonDBGUse and the only potential quadratic behavior in that allocator is when we scan for the live out (RegAllocFast::mayLiveOut) and even that part shouldn’t cause significant problems because we limit the number of checks by an upper bound of 8 per variable (hard coded limit). >> >> If RegAllocFast::mayLiveOut turns out to be problematic, we could start to cache its results. >> >> Anyhow, I’ll need more information to be able to help you. >> >> Cheers, >> -Quentin >> >>> On Feb 22, 2021, at 12:17 PM, David Blaikie via llvm-dev <llvm-dev at lists.llvm.org <http://lists.llvm.org/>> wrote: >>> >>> +Lang who might still have some register allocation knowledge kicking around (but his responses may be a bit delayed) >>> >>> Generally there's a bunch of algorithms that don't scale really well on especially large function - and I'd encourage users to modify any code generators to chunk up functions, or they'll be chasing down these sort of issues indefinitely, really. >>> >>> But improvements are certainly appreciated - though I don't have any particular knowledge or pointers regarding how to improve the register allocator. (my gut reaction would be that probably a lot of people have looked and haven't found better tradeoffs - but I'm a bit of a pessimist, fresh eyes can often help :) ) >>> >>> On Wed, Feb 10, 2021 at 10:48 AM Dwight Guth via llvm-dev <llvm-dev at lists.llvm.org <http://lists.llvm.org/> <mailto:llvm-dev at lists.llvm.org <http://lists.llvm.org/>>> wrote: >>> Hi, >>> >>> I am the maintainer of an open source compiler that makes use of the >>> LLVM toolchain to perform code generation. A user of our compiler, who >>> is working on a proprietary project, recently shared with me an >>> example where our toolchain was invoking `llc -O0` on a file with a >>> rather large function in it, and the register allocation pass was >>> taking upwards of 5 minutes. This seemed unusual to me, so I >>> downloaded the debug symbols for LLVM (this was with LLVM 11.0.0), and >>> I found that slightly over 50% of the runtime of LLC was spent within >>> a very small loop within llvm::MachineRegisterInfo::hasOneNonDBGUse. I >>> believe this loop to correspond to the while loop in >>> llvm::MachineRegisterInfo::defusechain_iterator::advance. >>> >>> It looks like this function is just scanning through all the defs and >>> uses of a register in the function until it finds one that it >>> considers satisfactory? That seems like it would be introducing >>> behavior that is quadratic in the size of the function to me. Am I >>> missing something? Is there some other, more sensible reason why the >>> profile seems so dependent on this one loop of code? Did I >>> misunderstand something? >>> >>> My end goal here would be to submit a patch that might optimize this >>> case, since it seems to me something that might be able to be computed >>> more efficiently. But I don't understand the code or the algorithm >>> hugely well, so I was hoping someone could give me some pointers on >>> where to get started with something like this. >>> >>> Does anyone have any suggestions on what I could do to make register >>> allocation run faster on quite large functions? >>> >>> Thanks, >>> >>> -- >>> Dwight Guth >>> >>> Chief Information Officer >>> Runtime Verification, Inc. >>> Email: dwight.guth at runtimeverification.com <http://runtimeverification.com/> <mailto:dwight.guth at runtimeverification.com <http://runtimeverification.com/>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org <http://lists.llvm.org/> <mailto:llvm-dev at lists.llvm.org <http://lists.llvm.org/>> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org <http://lists.llvm.org/> >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > > > -- > Dwight Guth > Chief Information Officer > Runtime Verification, Inc. > > Email: dwight.guth at runtimeverification.com <mailto:dwight.guth at runtimeverification.com> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210301/8f51f2c2/attachment.html>