On Oct 4, 2008, at 2:51 PM, Chris Lattner wrote:>>> I like your approach of using the use lists but I'm not sure the >>> ordering >>> is guaranteed. If it is, your approach is superior. >> >> I got my patch updated to work with TOT. Here it is. Comments >> welcome. > > Hi Dave, > > Great. I'd like to get this in, but would really like to do a profile > first. Could you please include a testcase? Even if it is something > contrived with llvm-gcc, I would really like to understand what is > going on.Looking more closely at the approach, I infer that the (logical in hindsight) issue are cases where mem2reg scans a BB to see the relative ordering of two instructions. For example, the single store case. I could imagine that if you saw lots of single store allocas in the middle of a large BB that mem2reg would be very very slow. As far as the approach goes, here's a counterproposal that is a bit less complex. I suggest adding a single DenseMap<Instruction*, unsigned> to Mem2reg. If there is an entry in this map for a load/ store instruction, it would store the instruction's index in its basic block. This means that to check dominance of two instrs in a block, you just need to compare their two indices (which is fast). Really small blocks (< 10 instructions?) would never be entered into map, to keep the size of the map low. The first time a query is made within a large block, it would compute the indices of all loads and stores within the block. The nice thing about using a single map is that it makes updating the map really trivial. Just use TheMap.erase(I) when deleting load and store instructions. The numbering can be purely local to a block and the relative ordering of memops don't get invalidated when a load or store is nuked. Does this seem like a reasonable approach? If so, please split the various parts of mem2reg out into helper methods as a prepatch. This will make review of the subsequent changes much easier. -Chris
On Saturday 04 October 2008 17:05, Chris Lattner wrote:> Looking more closely at the approach, I infer that the (logical in > hindsight) issue are cases where mem2reg scans a BB to see the > relative ordering of two instructions. For example, the single storeCorrect.> case. I could imagine that if you saw lots of single store allocas in > the middle of a large BB that mem2reg would be very very slow.Right.> As far as the approach goes, here's a counterproposal that is a bit > less complex. I suggest adding a single DenseMap<Instruction*, > unsigned> to Mem2reg. If there is an entry in this map for a load/ > store instruction, it would store the instruction's index in its basic > block. This means that to check dominance of two instrs in a block, > you just need to compare their two indices (which is fast). Really > small blocks (< 10 instructions?) would never be entered into map, to > keep the size of the map low. The first time a query is made within a > large block, it would compute the indices of all loads and stores > within the block.This is a simpler approach and should have the same effect. I'll code it up.> The nice thing about using a single map is that it makes updating the > map really trivial. Just use TheMap.erase(I) when deleting load and > store instructions. The numbering can be purely local to a block and > the relative ordering of memops don't get invalidated when a load or > store is nuked.Good points.> Does this seem like a reasonable approach? If so, please split the > various parts of mem2reg out into helper methods as a prepatch. This > will make review of the subsequent changes much easier.Which various parts are you referring to? You mean the stuff that populates and updates the map? -Dave
On Oct 6, 2008, at 9:47 AM, David Greene wrote:>> As far as the approach goes, here's a counterproposal that is a bit >> less complex. I suggest adding a single DenseMap<Instruction*, >> unsigned> to Mem2reg. If there is an entry in this map for a load/ >> store instruction, it would store the instruction's index in its >> basic >> block. This means that to check dominance of two instrs in a block, >> you just need to compare their two indices (which is fast). Really >> small blocks (< 10 instructions?) would never be entered into map, to >> keep the size of the map low. The first time a query is made >> within a >> large block, it would compute the indices of all loads and stores >> within the block. > > This is a simpler approach and should have the same effect. I'll > code it up.Great! Thanks,>> Does this seem like a reasonable approach? If so, please split the >> various parts of mem2reg out into helper methods as a prepatch. This >> will make review of the subsequent changes much easier. > > Which various parts are you referring to? You mean the stuff that > populates > and updates the map?I'm criticizing the existing code :), which is just very poorly structured. It would be nice to do a refactoring patch first which just splits the various "scan a BB" places out into helper methods, instead of them being inlined into larger pieces of code. This will make the mem2reg algo easier to read and will make the algorithmic change easier to review. If you want, I can change one of the places that does a scan to show what I mean, -Chris
Hi Chris, Sounds like an excellent generic approach. I'd love to lend a hand to implement it but I'm afraid it involves a little more in-depth LLVM knowledge than what I currently master, so I'll leave it to Dave and you for now... Talking about performance, for me this optimization was the difference between mem2reg taking 63% of execution time, and having it totally 'vanish'. It also marked the transition between rather using a custom JIT compiler and getting really excited to use LLVM instead. :) Previously I just assumed LLVM to be inherently slow but now I realize there's still potential for optimization. Do you consider it useful if I go ahead and further identify bottlenecks and report them or would you rather concentrate on correctness bugs for now? Is there any list of known optimization opportunities? Thanks, Nicolas -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Chris Lattner Sent: Sunday, 05 October, 2008 00:05 To: LLVM Developers Mailing List Subject: Re: [LLVMdev] mem2reg optimization On Oct 4, 2008, at 2:51 PM, Chris Lattner wrote:>>> I like your approach of using the use lists but I'm not sure the >>> ordering >>> is guaranteed. If it is, your approach is superior. >> >> I got my patch updated to work with TOT. Here it is. Comments >> welcome. > > Hi Dave, > > Great. I'd like to get this in, but would really like to do a profile > first. Could you please include a testcase? Even if it is something > contrived with llvm-gcc, I would really like to understand what is > going on.Looking more closely at the approach, I infer that the (logical in hindsight) issue are cases where mem2reg scans a BB to see the relative ordering of two instructions. For example, the single store case. I could imagine that if you saw lots of single store allocas in the middle of a large BB that mem2reg would be very very slow. As far as the approach goes, here's a counterproposal that is a bit less complex. I suggest adding a single DenseMap<Instruction*, unsigned> to Mem2reg. If there is an entry in this map for a load/ store instruction, it would store the instruction's index in its basic block. This means that to check dominance of two instrs in a block, you just need to compare their two indices (which is fast). Really small blocks (< 10 instructions?) would never be entered into map, to keep the size of the map low. The first time a query is made within a large block, it would compute the indices of all loads and stores within the block. The nice thing about using a single map is that it makes updating the map really trivial. Just use TheMap.erase(I) when deleting load and store instructions. The numbering can be purely local to a block and the relative ordering of memops don't get invalidated when a load or store is nuked. Does this seem like a reasonable approach? If so, please split the various parts of mem2reg out into helper methods as a prepatch. This will make review of the subsequent changes much easier. -Chris _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
On Oct 6, 2008, at 10:58 PM, Nicolas Capens wrote:> Sounds like an excellent generic approach. I'd love to lend a hand to > implement it but I'm afraid it involves a little more in-depth LLVM > knowledge than what I currently master, so I'll leave it to Dave and > you for > now...Ok.> Talking about performance, for me this optimization was the difference > between mem2reg taking 63% of execution time, and having it totally > 'vanish'. It also marked the transition between rather using a > custom JIT > compiler and getting really excited to use LLVM instead. :) > Previously I > just assumed LLVM to be inherently slow but now I realize there's > still > potential for optimization.Hehe, we can't fix problems we don't know about :)> Do you consider it useful if I go ahead and further identify > bottlenecks and > report them or would you rather concentrate on correctness bugs for > now? Is > there any list of known optimization opportunities?Absolutely. Different people have different focuses, and I tend to consider correctness/miscompilation bugs a higher priority. However, compile time speed is a very high second priority to me. Bug reports with testcases are extremely useful, -Chris
On Monday 06 October 2008 11:47, David Greene wrote:> > As far as the approach goes, here's a counterproposal that is a bit > > less complex. I suggest adding a single DenseMap<Instruction*, > > unsigned> to Mem2reg. If there is an entry in this map for a load/ > > store instruction, it would store the instruction's index in its basic > > block. This means that to check dominance of two instrs in a block, > > you just need to compare their two indices (which is fast). Really > > small blocks (< 10 instructions?) would never be entered into map, to > > keep the size of the map low. The first time a query is made within a > > large block, it would compute the indices of all loads and stores > > within the block. > > This is a simpler approach and should have the same effect. I'll code it > up.Urk. After looking at doing this today, I realized it won't work. The problem is that RewriteSingleStoreAlloca has this loop: for (; &*I != OnlyStore; ++I) { // scan block for store. if (isa<LoadInst>(I) && I->getOperand(0) == AI) break; It's looking for a load that matches the address of the single store. This is the loop that's taking all the time. What I did with the original patch was to make lookups of load and store instructions fast. That won't be possible with a map indexing all instructions. This loop is also a problem: // Otherwise, if this is a different block or if all uses happen // after the store, do a simple linear scan to replace loads with // the stored value. for (BasicBlock::iterator I = UseBlock->begin(), E = UseBlock->end(); I != E; ) { if (LoadInst *LI = dyn_cast<LoadInst>(I++)) { if (LI->getOperand(0) == AI) { LI->replaceAllUsesWith(OnlyStore->getOperand(0)); if (AST && isa<PointerType>(LI->getType())) AST->deleteValue(LI); LI->eraseFromParent(); } } } I suppose we could resctrict the map to containing only load and store instructions but if you've got a block with a lot of memory instructions I'm not sure it will be much faster, especially with the second loop. The artificial testcase I've come up with does exactly this. It basically does load-load-add-store-call over and over again. I've got the original patch ready to go. Shoudl I commit that? I'm thinking if we want to make things simpler later on we can do that and we'll have the testcase in the repository to catch whether we've simplified too much and lost the optimization. -Dave