John Brawn via llvm-dev
2018-Nov-15 16:23 UTC
[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess
Background ========= This is part of my long-running quest to better optimise float min_element_example(float *data, int size) { return *std::min_element(data, data+size); } which requires improvements to GVN, which first requires a side-quest to delay phi-to-select generation until after it (see https://reviews.llvm.org/D50723 and my earlier RFC "Delaying phi-to-select transformation until later in the pass pipeline") which requires side-side-quests to fix everything that gets worse when we do that. There's some bad interaction between jump threading and simplifycfg phi-to-select generation, where jump threading then phi-to-select gives worse results than phi-to-select then jump threading, specifically in the function: define double @perlin_example(i32 %h, double %x, double %y, double %z) { entry: %cmp = icmp slt i32 %h, 8 br i1 %cmp, label %cond.end, label %cond.false cond.false: br label %cond.end cond.end: %cond = phi double [ %y, %cond.false ], [ %x, %entry ] %cmp1 = icmp slt i32 %h, 4 br i1 %cmp1, label %cond.end9, label %cond.false3 cond.false3: %cmp4 = icmp eq i32 %h, 12 br i1 %cmp4, label %cond.end9, label %cond.false6 cond.false6: br label %cond.end9 cond.end9: %cond10 = phi double [ %z, %cond.false6 ], [ %y, %cond.end ], [ %x, %cond.false3 ] %add = fadd double %cond, %cond10 ret double %add } (heavily reduced from SingleSource/Benchmarks/Misc/perlin.c) where opt -jump-threading -simplifycfg gives worse results than opt -simplifycfg -jump-threading and we also get these worse results for opt -O3 when D50723 is applied. I think the correct way to fix this is to improve how SimplifyCFG does block speculation, which is currently a horrible inconsistent mess and I think I need to fix the mess first before I make any changes, and it's that side-side-side-quest that I'm talking about here. Overview ======= Some of what SimplifyCFG does is speculate blocks, i.e. move instructions from a block into one of its predecessors. There are five functions (that I've found) that do this, with a lot of overlap in what they handle and also some restricted in ways that others aren't without much consistency: FoldBranchToCommonDest: * BB can have any number of predecessors, the first valid predecessor is chosen. * BB can have any number of successors, but it has to share a successor with one of its predecessors. * The condition must be an instruction in BB. * BB can contain at most 1+BonusInstThreshold (non-branch) instructions. * Phis are updated rather than turned into selects, so this function cannot happen if the phi cannot be updated in this way. * Branch condition in predecessor merged with BB's using logical operations. HoistThenElseCodeToIf: * BB1 and BB2 must be the have the same predecessor, and be the only successors of that predecessor. * The successors of BB1 and BB2 do not matter (except see phi handling below). * No restriction on condition. * Only instructions that are common to BB1 and BB2 are speculated. * If the terminators of BB1 and BB2 is speculated (i.e. BB1 and BB2 have the same successors), then phis in BB1 and BB2's successors are turned into selects in their predecessor. * Branch in predecessor turned into that of BB1/BB2 if it was hoisted. SpeculativelyExecuteBB: * BB must have one predecessor. * BB must have one successor, which is the other successor of its predecessor. * The condition being speculated cannot come from FPCmpInst. * BB can contain at most 1 (non-branch) instruction. * Phis in BB's successor are turned into selects in BB's predecessor. * BB's predecessor changed to branch unconditionally to its other successor. SimplifyCondBranchToCondBranch: * BB can have any number of predecessors, the first valid predecessor is chosen. * BB can have any number of successors, but it has to share a successor with the predecessor. * No restriction on condition. * BB must be empty (except for the branch). * Phis in the shared successor are turned into selects in the predecessor. * Branch condition in predecessor merged with BB's using logical operations. FoldTwoEntryPHINode: * BB1 and/or BB2 must have the same predecessor, and must form a simple if or if/else. * BB1 and BB2 must have the same successor, or we only have BB1 which has one successor * No restriction on condition. * Instructions with total cost PHINodeFoldingThreshold * TCC_Basic can be speculated. * Phis in the successor are turned into selects in the common predecessor. * Branch condition in predecessor changed to branch unconditionally. Examples ======= What does all this means? Let's look at some examples: Inconsistent fcmp / empty block handling ---------------------------------------- define float @fn1(float %a, float %b, float %c) { entry: %cmp1 = fcmp une float %a, %b br i1 %cmp1, label %bb1, label %end bb1: %cmp2 = fcmp une float %a, %c br i1 %cmp2, label %bb2, label %end bb2: br label %end end: %phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ] ret float %phi } define float @fn2(float %a, float %b, float %c) { entry: %cmp1 = fcmp une float %a, %b %cmp2 = fcmp une float %a, %c br i1 %cmp1, label %bb1, label %end bb1: br i1 %cmp2, label %bb2, label %end bb2: br label %end end: %phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ] ret float %phi } In @fn1 no speculation happens because the two function which could possibly do it are SpeculativelyExecuteBB, which can't do it because %cmp2 is an fcmp, and SimplifyCondBranchToCondBranch, which can't do it because bb1 isn't empty However in @fn2 we can now speculate bb1 because %cmp2 has been moved so bb1 is now empty and SimplifyCondBranchToCondBranch, which doesn't care about fcmp, can speculate it. Large impact due to which function speculating what --------------------------------------------------- define i32 @fn3(i1 %c, i32 %arg) { entry: br i1 %c, label %bb1, label %end bb1: %cmp = icmp eq i32 %arg, 0 br i1 %cmp, label %bb2, label %end bb2: br label %end end: %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ] ret i32 %phi } define i32 @fn4(i1 %c, i32 %arg) { entry: %cmp = icmp eq i32 %arg, 0 br i1 %c, label %bb1, label %end bb1: br i1 %cmp, label %bb2, label %end bb2: br label %end end: %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ] ret i32 %phi } define i32 @fn5(i1 %c, i32 %arg) { entry: %cmp = icmp eq i32 %arg, 0 br i1 %c, label %bb1, label %end bb1: br i1 %cmp, label %bb2, label %end end: %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ] ret i32 %phi bb2: br label %end } In @fn3: * FoldBranchToCommonDest speculates %bb1 in %entry * FoldTwoEntryPHINode turn phi into a select in %entry * %bb2 now an empty block which serves no purpose, and is removed In @fn4: * SpeculativelyExecuteBB speculates %bb2 in %bb1 * SimplifyCondBranchToTwoReturns makes %bb1 select then return instead of conditionally branching to two returns In @fn5: * SpeculativelyExecuteBB speculates %bb2 in %bb1, inserting a select * FoldTwoEntryPHINode turns the phi in %end into a select in %entry What all this means is what we end up with is: * @fn3: cmp; and; select * @fn4: cmp; conditional branch to select * @fn5: cmp; select; select of which @fn3 is the best. Inconsistent thresholds ----------------------- define float @fn6(i32 %a, i32 %b, float %c) { entry: %cmp1 = icmp ne i32 %a, %b br i1 %cmp1, label %bb1, label %end bb1: %mul = fmul float %c, 3.0 %add = fadd float %mul, 3.0 br label %end end: %phi = phi float [ %add, %bb1 ], [ 0.0, %entry ] ret float %phi } define float @fn7(i32 %a, i32 %b, float %c) { entry: %cmp1 = icmp ne i32 %a, %b br i1 %cmp1, label %bb1, label %end bb1: %div = fdiv float %c, 3.0 %add = fadd float %div, 3.0 br label %end end: %phi = phi float [ %add, %bb1 ], [ 0.0, %entry ] ret float %phi } define void @fn8(i32 %a, i32 %b, float %c, i32* %p1, i32* %p2) { entry: %cmp1 = icmp ne i32 %a, %b br i1 %cmp1, label %bb1, label %bb2 bb1: %div = fdiv float %c, 3.0 %cmp2 = fcmp une float %div, 3.0 br i1 %cmp2, label %bb2, label %end bb2: store i32 0, i32* %p1 br label %end end: store i32 0, i32* %p2 ret void } FoldTwoEntryPHINode will fold the phi (and so speculated bb1) in @fn6 but not @fn7 as fmul has cost TCC_Basic, so fmul + fadd is below the threshold, but fdiv has cost TCC_Expensive, so fdiv + fadd is above the threshold. In @fn8 however FoldBranchToCommonDest just checks the number of instructions so has no problem at all speculating bb1. Questions ======== The questions arising from all of this: * Firstly, is there some alternate way of dealing with @perlin_example above (e.g doing something different in jump threading) that will magically solve my problems without having to deal with any of this? * The inconsistent FPCmpInst handling is really odd. Would anyone object if I removed that check from SpeculativelyExecuteBB? * What exactly is the right threshold for doing all of this stuff? FoldTwoEntryPHINode takes into accout the actual instruction cost, so that would seem like the most accurate. * It'd be really nice to cut down the number of functions doing this. I've got an attempt at making FoldBranchToCommonDest do everything that SimplifyCondBranchToCondBranch does so we can get rid of SimplifyCondBranchToCondBranch, as those two functions are the ones with the most overlap, but there's some major problems: * Doing this changes the order that blocks are speculated, giving worse results in some cases (but better in others). * SimplifyCondBranchToCondBranc must turn phi into select, FoldBranchToCommonDest cannot, and FoldBranchToCommonDest followed by the rest of the functions that can turn phi to select sometimes can't either. So: any ideas on how cutting down the number of functions could be done? * Alternatively instead of cutting down on the number of functions we could try to common out things into shared functions. * Any other ideas for how to make this less of a horrible mess? * Would people mind if I just ignore all of this and just make this into even more of a mess while trying to fix how we handle @perlin_example? (The correct answer is "No, don't do that" but it would certainly make my job easier.) Or alternatively does anyone else want to try sorting out this mess? John
Krzysztof Parzyszek via llvm-dev
2018-Nov-15 19:07 UTC
[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess
On 11/15/2018 10:23 AM, John Brawn via llvm-dev wrote:> > * Any other ideas for how to make this less of a horrible mess?My opinion is that we should rewrite it. And not the speculation part, but the whole thing. The first problem is that there is really no clear description of that this pass does. "Simplify CFG" is a catch-all term that includes every little change that could be argued to simplify things in certain circumstances, and this has lead to what we have now. We should specify what the callers can expect to happen. "Expect the CFG to be simplified" is not an adequate specification. This is not a pass where we should be making frequent changes, since anything that happens here can have far reaching consequences. There are certain types of CFG transformations that are needed more frequently, like folding a branch to a branch, or eliminating empty blocks (as long as it doesn't damage loop structure). At the same time, things like generating switch statements is not something that needs to be attempted every time. Subtypes of CFG optimizations should be identified (e.g. cleanups, aggressive branch elimination, etc.) and there should be a parameter by which the caller would indicate which of these optimization groups should be run. The recursive algorithm is not the best choice here. For a graph like a CFG, the entire structure of it is important. With the recursive nature of the code it can impossible to predict what it's going to do on any given graph, or determine why something predicted didn't actually happen. The algorithm should have a clear structure: first X happens, then Y, etc. We should motivate the CFG simplification by what makes subsequent optimizations easier to implement and what exposes more opportunities to them, instead of "what helps my benchmark". Maybe targets should have hooks to perform specific modifications of the graph that may or may not be good for everyone. -Krzysztof -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation