thr3ads.net - llvm dev - [llvm-dev] Is the fast register allocator O(n^2) in the size of the function? [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Dwight Guth via llvm-dev

2021-Mar-01 22:00 UTC

[llvm-dev] Is the fast register allocator O(n^2) in the size of the function?

Apologies if this email is a bit weirdly formatted; I had actually
already unsubscribed from the mailing list when I noticed that I had
received a response.

Here is the stack trace I found for LLVM 11:

#0  hasOneNonDBGUse () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/MachineRegisterInfo.cpp:420
#1  0x00007ffff3e8a478 in defineVirtReg () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:787
#2  0x00007ffff3e883d1 in allocateInstruction () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1188
#3  allocateBasicBlock () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1277
#4  runOnMachineFunction () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1316
#5  0x00007ffff3d8a39e in runOnFunction () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/MachineFunctionPass.cpp:73
#6  0x00007ffff3bc7579 in runOnFunction () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1516
#7  0x00007ffff3bccb23 in runOnModule () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1552
#8  0x00007ffff3bc7b90 in runOnModule () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1617
#9  run () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:614
#10 0x000000000040d792 in main ()

It looks like at least as of LLVM 11, RegAllocFast.cpp was indeed
calling hasOneNonDBGUse, which was accounting for some 55% of the self
time reported in my profile on the bitcode file I was testing with.

I did some digging into the more recent history of LLVM and it seems
that this function was refactored sometime between LLVM 11 and LLVM 12
RC1, so I went ahead and built the most recent release candidate from
source to test if the problem still occurred or not. I don't remember
seeing a release candidate when I originally composed the email or I
would have tried it then, although it seems RC1 must have been out by
then. I must have overlooked it by accident.

With that being said, when I run the same bitcode file with LLVM 12
RC2, I see that register allocation only takes 4% of the CPU time in
total. I also see that the time spent in register allocation has
reduced to only 5 seconds from 5 minutes, a 60-fold improvement!

I do notice that 30% of self time is spent in each of
llvm::FastISel::handlePHINodesInSuccessorBlocks and
llvm::SelectionDAGBuilder::HandlePHINodesInSuccessorBlocks, for a
whopping 60% of self time total, but that is a completely separate
issue, and much more acceptable given that wall time has decreased
from 10 minutes for the entire process to 2.5.

In short, it looks like someone else already beat me to the punch.
Sorry for wasting your time. It's good to know that this issue will be
addressed for us simply by waiting for LLVM 12 to be released, though!

Thanks,
Dwight
> Hi Dwight,
>
> Can your share the profile (stack trace for instance) you’ve observed?
>
> As far as I remember the fast regalloc (the one that runs at O0 by default)
shouldn’t call hasOneNonDBGUse and the only potential quadratic behavior in that
allocator is when we scan for the live out (RegAllocFast::mayLiveOut) and even
that part shouldn’t cause significant problems because we limit the number of
checks by an upper bound of 8 per variable (hard coded limit).
>
> If RegAllocFast::mayLiveOut turns out to be problematic, we could start to
cache its results.
>
> Anyhow, I’ll need more information to be able to help you.
>
> Cheers,
> -Quentin
>
> > On Feb 22, 2021, at 12:17 PM, David Blaikie via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> >
> > +Lang who might still have some register allocation knowledge kicking
around (but his responses may be a bit delayed)
> >
> > Generally there's a bunch of algorithms that don't scale
really well on especially large function - and I'd encourage users to modify
any code generators to chunk up functions, or they'll be chasing down these
sort of issues indefinitely, really.
> >
> > But improvements are certainly appreciated - though I don't have
any particular knowledge or pointers regarding how to improve the register
allocator. (my gut reaction would be that probably a lot of people have looked
and haven't found better tradeoffs - but I'm a bit of a pessimist, fresh
eyes can often help :) )
> >
> > On Wed, Feb 10, 2021 at 10:48 AM Dwight Guth via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> > Hi,
> >
> > I am the maintainer of an open source compiler that makes use of the
> > LLVM toolchain to perform code generation. A user of our compiler, who
> > is working on a proprietary project, recently shared with me an
> > example where our toolchain was invoking `llc -O0` on a file with a
> > rather large function in it, and the register allocation pass was
> > taking upwards of 5 minutes. This seemed unusual to me, so I
> > downloaded the debug symbols for LLVM (this was with LLVM 11.0.0), and
> > I found that slightly over 50% of the runtime of LLC was spent within
> > a very small loop within llvm::MachineRegisterInfo::hasOneNonDBGUse. I
> > believe this loop to correspond to the while loop in
> > llvm::MachineRegisterInfo::defusechain_iterator::advance.
> >
> > It looks like this function is just scanning through all the defs and
> > uses of a register in the function until it finds one that it
> > considers satisfactory? That seems like it would be introducing
> > behavior that is quadratic in the size of the function to me. Am I
> > missing something? Is there some other, more sensible reason why the
> > profile seems so dependent on this one loop of code? Did I
> > misunderstand something?
> >
> > My end goal here would be to submit a patch that might optimize this
> > case, since it seems to me something that might be able to be computed
> > more efficiently. But I don't understand the code or the algorithm
> > hugely well, so I was hoping someone could give me some pointers on
> > where to get started with something like this.
> >
> > Does anyone have any suggestions on what I could do to make register
> > allocation run faster on quite large functions?
> >
> > Thanks,
> >
> > --
> > Dwight Guth
> >
> > Chief Information Officer
> > Runtime Verification, Inc.
> > Email: dwight.guth at runtimeverification.com <mailto:dwight.guth
at runtimeverification.com>
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 
Dwight Guth
Chief Information Officer
Runtime Verification, Inc.

Email: dwight.guth at runtimeverification.com

Quentin Colombet via llvm-dev

2021-Mar-01 23:27 UTC

head link

[llvm-dev] Is the fast register allocator O(n^2) in the size of the function?

Glad to hear that your problem will be fixed by LLVM 12!
> On Mar 1, 2021, at 2:00 PM, Dwight Guth via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Apologies if this email is a bit weirdly formatted; I had actually
> already unsubscribed from the mailing list when I noticed that I had
> received a response.
> 
> Here is the stack trace I found for LLVM 11:
> 
> #0  hasOneNonDBGUse () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/MachineRegisterInfo.cpp:420
> #1  0x00007ffff3e8a478 in defineVirtReg () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:787
> #2  0x00007ffff3e883d1 in allocateInstruction () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1188
> #3  allocateBasicBlock () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1277
> #4  runOnMachineFunction () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/RegAllocFast.cpp:1316
> #5  0x00007ffff3d8a39e in runOnFunction () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/CodeGen/MachineFunctionPass.cpp:73
> #6  0x00007ffff3bc7579 in runOnFunction () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1516
> #7  0x00007ffff3bccb23 in runOnModule () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1552
> #8  0x00007ffff3bc7b90 in runOnModule () at
>
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:1617
> #9  run () at
/build/llvm-toolchain-11-9xfOLw/llvm-toolchain-11-11.0.0/llvm/lib/IR/LegacyPassManager.cpp:614
> #10 0x000000000040d792 in main ()
> 
> It looks like at least as of LLVM 11, RegAllocFast.cpp was indeed
> calling hasOneNonDBGUse, which was accounting for some 55% of the self
> time reported in my profile on the bitcode file I was testing with.
> 
> I did some digging into the more recent history of LLVM and it seems
> that this function was refactored sometime between LLVM 11 and LLVM 12
> RC1, so I went ahead and built the most recent release candidate from
> source to test if the problem still occurred or not. I don't remember
> seeing a release candidate when I originally composed the email or I
> would have tried it then, although it seems RC1 must have been out by
> then. I must have overlooked it by accident.
> 
> With that being said, when I run the same bitcode file with LLVM 12
> RC2, I see that register allocation only takes 4% of the CPU time in
> total. I also see that the time spent in register allocation has
> reduced to only 5 seconds from 5 minutes, a 60-fold improvement!
> 
> I do notice that 30% of self time is spent in each of
> llvm::FastISel::handlePHINodesInSuccessorBlocks and
> llvm::SelectionDAGBuilder::HandlePHINodesInSuccessorBlocks, for a
> whopping 60% of self time total, but that is a completely separate
> issue, and much more acceptable given that wall time has decreased
> from 10 minutes for the entire process to 2.5.
> 
> In short, it looks like someone else already beat me to the punch.
> Sorry for wasting your time. It's good to know that this issue will be
> addressed for us simply by waiting for LLVM 12 to be released, though!
> 
> Thanks,
> Dwight
> 
>> Hi Dwight,
>> 
>> Can your share the profile (stack trace for instance) you’ve observed?
>> 
>> As far as I remember the fast regalloc (the one that runs at O0 by
default) shouldn’t call hasOneNonDBGUse and the only potential quadratic
behavior in that allocator is when we scan for the live out
(RegAllocFast::mayLiveOut) and even that part shouldn’t cause significant
problems because we limit the number of checks by an upper bound of 8 per
variable (hard coded limit).
>> 
>> If RegAllocFast::mayLiveOut turns out to be problematic, we could start
to cache its results.
>> 
>> Anyhow, I’ll need more information to be able to help you.
>> 
>> Cheers,
>> -Quentin
>> 
>>> On Feb 22, 2021, at 12:17 PM, David Blaikie via llvm-dev
<llvm-dev at lists.llvm.org <http://lists.llvm.org/>> wrote:
>>> 
>>> +Lang who might still have some register allocation knowledge
kicking around (but his responses may be a bit delayed)
>>> 
>>> Generally there's a bunch of algorithms that don't scale
really well on especially large function - and I'd encourage users to modify
any code generators to chunk up functions, or they'll be chasing down these
sort of issues indefinitely, really.
>>> 
>>> But improvements are certainly appreciated - though I don't
have any particular knowledge or pointers regarding how to improve the register
allocator. (my gut reaction would be that probably a lot of people have looked
and haven't found better tradeoffs - but I'm a bit of a pessimist, fresh
eyes can often help :) )
>>> 
>>> On Wed, Feb 10, 2021 at 10:48 AM Dwight Guth via llvm-dev
<llvm-dev at lists.llvm.org <http://lists.llvm.org/>
<mailto:llvm-dev at lists.llvm.org <http://lists.llvm.org/>>>
wrote:
>>> Hi,
>>> 
>>> I am the maintainer of an open source compiler that makes use of
the
>>> LLVM toolchain to perform code generation. A user of our compiler,
who
>>> is working on a proprietary project, recently shared with me an
>>> example where our toolchain was invoking `llc -O0` on a file with a
>>> rather large function in it, and the register allocation pass was
>>> taking upwards of 5 minutes. This seemed unusual to me, so I
>>> downloaded the debug symbols for LLVM (this was with LLVM 11.0.0),
and
>>> I found that slightly over 50% of the runtime of LLC was spent
within
>>> a very small loop within
llvm::MachineRegisterInfo::hasOneNonDBGUse. I
>>> believe this loop to correspond to the while loop in
>>> llvm::MachineRegisterInfo::defusechain_iterator::advance.
>>> 
>>> It looks like this function is just scanning through all the defs
and
>>> uses of a register in the function until it finds one that it
>>> considers satisfactory? That seems like it would be introducing
>>> behavior that is quadratic in the size of the function to me. Am I
>>> missing something? Is there some other, more sensible reason why
the
>>> profile seems so dependent on this one loop of code? Did I
>>> misunderstand something?
>>> 
>>> My end goal here would be to submit a patch that might optimize
this
>>> case, since it seems to me something that might be able to be
computed
>>> more efficiently. But I don't understand the code or the
algorithm
>>> hugely well, so I was hoping someone could give me some pointers on
>>> where to get started with something like this.
>>> 
>>> Does anyone have any suggestions on what I could do to make
register
>>> allocation run faster on quite large functions?
>>> 
>>> Thanks,
>>> 
>>> --
>>> Dwight Guth
>>> 
>>> Chief Information Officer
>>> Runtime Verification, Inc.
>>> Email: dwight.guth at runtimeverification.com
<http://runtimeverification.com/> <mailto:dwight.guth at
runtimeverification.com <http://runtimeverification.com/>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <http://lists.llvm.org/>
<mailto:llvm-dev at lists.llvm.org <http://lists.llvm.org/>>
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <http://lists.llvm.org/>
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> 
> -- 
> Dwight Guth
> Chief Information Officer
> Runtime Verification, Inc.
> 
> Email: dwight.guth at runtimeverification.com <mailto:dwight.guth at
runtimeverification.com>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210301/8f51f2c2/attachment.html>

llvm dev - Mar 2021 - Is the fast register allocator O(n^2) in the size of the function?

[llvm-dev] Is the fast register allocator O(n^2) in the size of the function?

[llvm-dev] Is the fast register allocator O(n^2) in the size of the function?