thr3ads.net - llvm dev - [LLVMdev] SROA is slow when compiling a large basic block [May 2014]

If this information is useful, please help other people find it:
Share via:

Akira Hatanaka

2014-May-15 01:02 UTC

[LLVMdev] SROA is slow when compiling a large basic block

I would like to get feedback from the community on how I can speed up the
compilation of a function that has one huge basic block consisting of over
150K instructions. It takes about 10 minutes for clang to produce the
object file and the function that is taking up the majority of the time is
LoadAndStorePromoter::run, which is called when SROA is run:

    // Otherwise, we have mixed loads and stores (or just a bunch of
stores).

    // Since SSAUpdater is purely for cross-block values, we need to
determine

    // the order of these instructions in the block.  If the first use in
the

    // block is a load, then it uses the live in value.  The last store
defines

    // the live out value.  We handle this by doing a linear scan of the
block.

    Value *StoredValue = nullptr;

    for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II != E;
++II) {


If I understand this code correctly. LoadAndStorePromoter::run is called
once per every promotable alloca and iterates over the whole list to
determine the order of loads and stores in the basic block that access the
alloca.


This is the list of ideas I have considered or implemented that can
possibly solve my problem:


1. In SROA::getAnalysisUsage, always require DominatorTreeWrapperPass. This
will enable SROA::promoteAllocas to use mem2reg, which is fast because it
caches the per basic-block ordering of the relevant loads and stores. If
it's important to avoid always computing the dominator tree, computing it
conditionally based on whether there is a huge basic block in the function
is another idea, but I am not sure if that is possible (I don't think this
is currently supported).


This brings down the compilation time (using clang -emit-llvm) from 350s to
30s (it still takes about 23s to do GVN). It also might fix PR17855 (the
program that used to take 65s to compile now takes just 11s):


http://llvm.org/bugs/show_bug.cgi?id=17855


2. Cache the ordering of loads and stores in LoadAndStorePromoter::run. I
don't know how hard it would be to implement this, but I think it would be
as fast as 1 (using mem2reg).


3. Insert a pass that splits huge basic blocks into smaller blocks prior to
SROA and recombine them after the pass is run.


4. Change the definition of class Instruction in IR/Instruction.h so that
determining whether one instruction precedes another becomes more
efficient. My current implementation brings down the compilation time to
just 6s.


I won't go into much detail about the change I made, but it basically
replaces Instruction's pointer to its parent BasicBlock with another data
structure. I haven't done much investigation into how this will affect the
common case.


This will also help speed up the execution of code in other places. For
example, these two functions are called when GVN::processLoad is called and
iterate over a basic block's instruction list to determine the order of two
(load and stores) instructions.:


llvm::isPotentiallyReachable


    // Linear scan, start at 'A', see whether we hit 'B' or the
end first.

    for (BasicBlock::const_iterator I = A, E = BB->end(); I != E; ++I) {

      if (&*I == B)

        return true;

    }

DominatorTree::dominates

  // Loop through the basic block until we find Def or User.

  BasicBlock::const_iterator I = DefBB->begin();

  for (; &*I != Def && &*I != User; ++I)

    /*empty*/;

5. Simply give up doing SROA (and GVN::processLoad) if there is a huge
basic block that can potentially take a long time to compile.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140514/24fe9f79/attachment.html>

Chandler Carruth

2014-May-15 03:15 UTC

head link

[LLVMdev] SROA is slow when compiling a large basic block

On Wed, May 14, 2014 at 7:02 PM, Akira Hatanaka <ahatanak at gmail.com>
wrote:
> If I understand this code correctly. LoadAndStorePromoter::run is called
> once per every promotable alloca and iterates over the whole list to
> determine the order of loads and stores in the basic block that access the
> alloca.
>
Yes, this is a long standing problem of SROA.

>
> This is the list of ideas I have considered or implemented that can
> possibly solve my problem:
>
>
> 1. In SROA::getAnalysisUsage, always require DominatorTreeWrapperPass.
> This will enable SROA::promoteAllocas to use mem2reg, which is fast because
> it caches the per basic-block ordering of the relevant loads and stores. If
> it's important to avoid always computing the dominator tree, computing
it
> conditionally based on whether there is a huge basic block in the function
> is another idea, but I am not sure if that is possible (I don't think
this
> is currently supported).
>
>
> This brings down the compilation time (using clang -emit-llvm) from 350s
> to 30s (it still takes about 23s to do GVN). It also might fix PR17855 (the
> program that used to take 65s to compile now takes just 11s):
>
>
> http://llvm.org/bugs/show_bug.cgi?id=17855
>
This is my plan, but before doing it there are a bunch of *huge*
performance improvements we can make in the more common case so that
mem2reg isn't actually slower. Also, we need to be able to preserve
analyses further which the new pass manager will allow.

Is this a pressing matter for you?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140514/b2bb1072/attachment.html>

Philip Reames

2014-May-15 16:31 UTC

head link

[LLVMdev] SROA is slow when compiling a large basic block

On 05/14/2014 06:02 PM, Akira Hatanaka wrote:> I would like to get feedback from the community on how I can speed up 
> the compilation of a function that has one huge basic block consisting 
> of over 150K instructions. It takes about 10 minutes for clang to 
> produce the object file and the function that is taking up the 
> majority of the time is LoadAndStorePromoter::run, which is called 
> when SROA is run:
>
>     // Otherwise, we have mixed loads and stores (or just a bunch of 
> stores).
>
>     // Since SSAUpdater is purely for cross-block values, we need to 
> determine
>
>     // the order of these instructions in the block.  If the first use 
> in the
>
>     // block is a load, then it uses the live in value.  The last 
> store defines
>
>     // the live out value.  We handle this by doing a linear scan of 
> the block.
>
> Value *StoredValue = nullptr;
>
>     for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II != 
> E; ++II) {
>
>
> If I understand this code correctly. LoadAndStorePromoter::run is 
> called once per every promotable alloca and iterates over the whole 
> list to determine the order of loads and stores in the basic block 
> that access the alloca.
>I'm a bit confused about the cost here.  Based on the code, I see us 
walking each basic block at most once per alloca.  Even at 150K 
instructions, I wouldn't expect the loop to take *that* long.  How many 
allocas are you dealing with?

I do see an outer loop in SROA::performPromotion.  Are you possibly 
triggering a large number of iterations there?  That would make the 
inner loop appear much more expensive.
>
> This is the list of ideas I have considered or implemented that can 
> possibly solve my problem:
>
>
> 1. In SROA::getAnalysisUsage, always require DominatorTreeWrapperPass. 
> This will enable SROA::promoteAllocas to use mem2reg, which is fast 
> because it caches the per basic-block ordering of the relevant loads 
> and stores. If it's important to avoid always computing the dominator 
> tree, computing it conditionally based on whether there is a huge 
> basic block in the function is another idea, but I am not sure if that 
> is possible (I don't think this is currently supported).
>
>
> This brings down the compilation time (using clang -emit-llvm) from 
> 350s to 30s (it still takes about 23s to do GVN). It also might fix 
> PR17855 (the program that used to take 65s to compile now takes just 11s):
>
>
> http://llvm.org/bugs/show_bug.cgi?id=17855
>
Chandler commented here, I'll defer to him.>
>
> 2. Cache the ordering of loads and stores in 
> LoadAndStorePromoter::run. I don't know how hard it would be to 
> implement this, but I think it would be as fast as 1 (using mem2reg).
>Caching the first load and first store for each alloca in a single pass 
doesn't seem too bad.  (Assuming that that doesn't change across 
iterations at least - I'm not particularly familiar with this code.)  
This seems like a fairly isolated improvement.  Assuming Chandler 
doesn't disagree, I'd say this seems like a reasonable first
step.>
>
> 3. Insert a pass that splits huge basic blocks into smaller blocks 
> prior to SROA and recombine them after the pass is run.
>How would this actually help?  If you have uses in each of the split 
blocks, wouldn't you spend the exact same time?>
>
> 4. Change the definition of class Instruction in IR/Instruction.h so 
> that determining whether one instruction precedes another becomes more 
> efficient. My current implementation brings down the compilation time 
> to just 6s.
>
>
> I won't go into much detail about the change I made, but it basically 
> replaces Instruction's pointer to its parent BasicBlock with another 
> data structure. I haven't done much investigation into how this will 
> affect the common case.
>
>
> This will also help speed up the execution of code in other places. 
> For example, these two functions are called when GVN::processLoad is 
> called and iterate over a basic block's instruction list to determine 
> the order of two (load and stores) instructions.:
>
>
> llvm::isPotentiallyReachable
>
>
>     // Linear scan, start at 'A', see whether we hit 'B' or
the end first.
>
>     for (BasicBlock::const_iterator I = A, E = BB->end(); I != E; ++I) {
>
>   if (&*I == B)
>
>   return true;
>
>     }
>
>
> DominatorTree::dominates
>
>   // Loop through the basic block until we find Def or User.
>
>   BasicBlock::const_iterator I = DefBB->begin();
>
>   for (; &*I != Def && &*I != User; ++I)
>
>     /*empty*/;
>I'd be curious to see a proposal here.  I've been bitten by the 
DT::dominates issue myself.>
> 5. Simply give up doing SROA (and GVN::processLoad) if there is a huge 
> basic block that can potentially take a long time to compile.
>Until we have something better, this might be a necessary hack. Whether 
it's actually worth introducing depends on how long Chandler thinks the 
mem2reg version will take to land.

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140515/99b3fcb4/attachment.html>

Akira Hatanaka

2014-May-15 17:30 UTC

head link

[LLVMdev] SROA is slow when compiling a large basic block

On Wed, May 14, 2014 at 8:15 PM, Chandler Carruth <chandlerc at
google.com>wrote:
>
> On Wed, May 14, 2014 at 7:02 PM, Akira Hatanaka <ahatanak at
gmail.com>wrote:
>
>> If I understand this code correctly. LoadAndStorePromoter::run is
called
>> once per every promotable alloca and iterates over the whole list to
>> determine the order of loads and stores in the basic block that access
the
>> alloca.
>>
>
> Yes, this is a long standing problem of SROA.
>
>
>>
>> This is the list of ideas I have considered or implemented that can
>> possibly solve my problem:
>>
>>
>> 1. In SROA::getAnalysisUsage, always require DominatorTreeWrapperPass.
>> This will enable SROA::promoteAllocas to use mem2reg, which is fast
because
>> it caches the per basic-block ordering of the relevant loads and
stores. If
>> it's important to avoid always computing the dominator tree,
computing it
>> conditionally based on whether there is a huge basic block in the
function
>> is another idea, but I am not sure if that is possible (I don't
think this
>> is currently supported).
>>
>>
>> This brings down the compilation time (using clang -emit-llvm) from
350s
>> to 30s (it still takes about 23s to do GVN). It also might fix PR17855
(the
>> program that used to take 65s to compile now takes just 11s):
>>
>>
>> http://llvm.org/bugs/show_bug.cgi?id=17855
>>
>
> This is my plan, but before doing it there are a bunch of *huge*
> performance improvements we can make in the more common case so that
> mem2reg isn't actually slower. Also, we need to be able to preserve
> analyses further which the new pass manager will allow.
>
> Is this a pressing matter for you?
>
It's not really pressing, but there are users complaining about the long
compilation time, so I would like to see this fixed soon, if that is
possible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140515/5a3e4ca9/attachment.html>

Akira Hatanaka

2014-May-15 18:12 UTC

head link

[LLVMdev] SROA is slow when compiling a large basic block

On Thu, May 15, 2014 at 9:31 AM, Philip Reames <listmail at
philipreames.com>wrote:
>  On 05/14/2014 06:02 PM, Akira Hatanaka wrote:
>
>  I would like to get feedback from the community on how I can speed up
> the compilation of a function that has one huge basic block consisting of
> over 150K instructions. It takes about 10 minutes for clang to produce the
> object file and the function that is taking up the majority of the time is
> LoadAndStorePromoter::run, which is called when SROA is run:
>
>      // Otherwise, we have mixed loads and stores (or just a bunch of
> stores).
>
>     // Since SSAUpdater is purely for cross-block values, we need to
> determine
>
>     // the order of these instructions in the block.  If the first use in
> the
>
>     // block is a load, then it uses the live in value.  The last store
> defines
>
>     // the live out value.  We handle this by doing a linear scan of the
> block.
>
>     Value *StoredValue = nullptr;
>
>     for (BasicBlock::iterator II = BB->begin(), E = BB->end(); II !=
E;
> ++II) {
>
>
>  If I understand this code correctly. LoadAndStorePromoter::run is called
> once per every promotable alloca and iterates over the whole list to
> determine the order of loads and stores in the basic block that access the
> alloca.
>
> I'm a bit confused about the cost here.  Based on the code, I see us
> walking each basic block at most once per alloca.  Even at 150K
> instructions, I wouldn't expect the loop to take *that* long.  How many
> allocas are you dealing with?
>
> I do see an outer loop in SROA::performPromotion.  Are you possibly
> triggering a large number of iterations there?  That would make the inner
> loop appear much more expensive.
>
>The unoptimized bitcode (compiled with clang -O0) has a large number of
allocas (over 7000), which I think is also contributing to the long
compilation time.

But making LoadAndStorePromoter::run faster is sufficient to reduce the
compilation time (at least in my case).

>
>  This is the list of ideas I have considered or implemented that can
> possibly solve my problem:
>
>
>  1. In SROA::getAnalysisUsage, always require DominatorTreeWrapperPass.
> This will enable SROA::promoteAllocas to use mem2reg, which is fast because
> it caches the per basic-block ordering of the relevant loads and stores. If
> it's important to avoid always computing the dominator tree, computing
it
> conditionally based on whether there is a huge basic block in the function
> is another idea, but I am not sure if that is possible (I don't think
this
> is currently supported).
>
>
>  This brings down the compilation time (using clang -emit-llvm) from 350s
> to 30s (it still takes about 23s to do GVN). It also might fix PR17855 (the
> program that used to take 65s to compile now takes just 11s):
>
>
>  http://llvm.org/bugs/show_bug.cgi?id=17855
>
> Chandler commented here, I'll defer to him.
>
>
>  2. Cache the ordering of loads and stores in LoadAndStorePromoter::run.
> I don't know how hard it would be to implement this, but I think it
would
> be as fast as 1 (using mem2reg).
>
> Caching the first load and first store for each alloca in a single pass
> doesn't seem too bad.  (Assuming that that doesn't change across
iterations
> at least - I'm not particularly familiar with this code.)  This seems
like
> a fairly isolated improvement.  Assuming Chandler doesn't disagree,
I'd say
> this seems like a reasonable first step.
>
>
>  3. Insert a pass that splits huge basic blocks into smaller blocks prior
> to SROA and recombine them after the pass is run.
>
> How would this actually help?  If you have uses in each of the split
> blocks, wouldn't you spend the exact same time?
>
>
> It might not help much. The code in LoadAndStorePromoter::run avoidsscanning all instructions in a basic block if either there is only one user
in a basic block or there are no stores in a basic block, and splitting
blocks might increase the number of such blocks.

4. Change the definition of class Instruction in IR/Instruction.h so
that> determining whether one instruction precedes another becomes more
> efficient. My current implementation brings down the compilation time to
> just 6s.
>
>
>  I won't go into much detail about the change I made, but it basically
> replaces Instruction's pointer to its parent BasicBlock with another
data
> structure. I haven't done much investigation into how this will affect
the
> common case.
>
>
>  This will also help speed up the execution of code in other places. For
> example, these two functions are called when GVN::processLoad is called and
> iterate over a basic block's instruction list to determine the order of
two
> (load and stores) instructions.:
>
>
>  llvm::isPotentiallyReachable
>
>
>      // Linear scan, start at 'A', see whether we hit 'B'
or the end
> first.
>
>     for (BasicBlock::const_iterator I = A, E = BB->end(); I != E; ++I) {
>
>       if (&*I == B)
>
>         return true;
>
>     }
>
>  DominatorTree::dominates
>
>    // Loop through the basic block until we find Def or User.
>
>   BasicBlock::const_iterator I = DefBB->begin();
>
>   for (; &*I != Def && &*I != User; ++I)
>
>     /*empty*/;
>
> I'd be curious to see a proposal here.  I've been bitten by the
> DT::dominates issue myself.
>
>
> I was hesitant to make changes in classes such as BasicBlock andInstruction just to make it faster to compile a basic block with 150K+
instructions (which I think is a pathological case). If there are other use
cases that need discovering orders of instructions in a basic block, it
might be worth considering fixing those classes.

The attached patch is the current implementation I have. Basically I am
trying to make function Instruction::precede fast and use it to speed up
functions that are currently iterating over the whole instruction list to
determine the order of instructions in a basic block.

 5. Simply give up doing SROA (and GVN::processLoad) if there is a
huge> basic block that can potentially take a long time to compile.
>
> Until we have something better, this might be a necessary hack.  Whether
> it's actually worth introducing depends on how long Chandler thinks the
> mem2reg version will take to land.
>
>I tried this idea today, and unfortunately it didn't reduce the compilation
time. Skipping SROA leaves a lot of loads/stores in the code that could
have been promoted, and that places the burden on other passes that are run
later, such as GVN and DSE.


Philip>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140515/fc74f553/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sublist1.patch
Type: application/octet-stream
Size: 24745 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140515/fc74f553/attachment.obj>

Reasonably Related Threads

Search for more reasonably related threads

llvm dev - May 2014 - [LLVMdev] SROA is slow when compiling a large basic block

[LLVMdev] SROA is slow when compiling a large basic block

[LLVMdev] SROA is slow when compiling a large basic block

[LLVMdev] SROA is slow when compiling a large basic block

[LLVMdev] SROA is slow when compiling a large basic block

[LLVMdev] SROA is slow when compiling a large basic block

Reasonably Related Threads