John Brawn via llvm-dev
2018-Nov-15 16:23 UTC
[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess
Background
=========
This is part of my long-running quest to better optimise
float min_element_example(float *data, int size) {
return *std::min_element(data, data+size);
}
which requires improvements to GVN, which first requires a side-quest to delay
phi-to-select generation until after it (see https://reviews.llvm.org/D50723 and
my earlier RFC "Delaying phi-to-select transformation until later in the
pass
pipeline") which requires side-side-quests to fix everything that gets
worse when
we do that.
There's some bad interaction between jump threading and simplifycfg
phi-to-select
generation, where jump threading then phi-to-select gives worse results than
phi-to-select then jump threading, specifically in the function:
define double @perlin_example(i32 %h, double %x, double %y, double %z) {
entry:
%cmp = icmp slt i32 %h, 8
br i1 %cmp, label %cond.end, label %cond.false
cond.false:
br label %cond.end
cond.end:
%cond = phi double [ %y, %cond.false ], [ %x, %entry ]
%cmp1 = icmp slt i32 %h, 4
br i1 %cmp1, label %cond.end9, label %cond.false3
cond.false3:
%cmp4 = icmp eq i32 %h, 12
br i1 %cmp4, label %cond.end9, label %cond.false6
cond.false6:
br label %cond.end9
cond.end9:
%cond10 = phi double [ %z, %cond.false6 ], [ %y, %cond.end ], [ %x,
%cond.false3 ]
%add = fadd double %cond, %cond10
ret double %add
}
(heavily reduced from SingleSource/Benchmarks/Misc/perlin.c) where
opt -jump-threading -simplifycfg
gives worse results than
opt -simplifycfg -jump-threading
and we also get these worse results for opt -O3 when D50723 is applied.
I think the correct way to fix this is to improve how SimplifyCFG does block
speculation,
which is currently a horrible inconsistent mess and I think I need to fix the
mess first
before I make any changes, and it's that side-side-side-quest that I'm
talking about here.
Overview
=======
Some of what SimplifyCFG does is speculate blocks, i.e. move instructions from
a block into one of its predecessors. There are five functions (that I've
found)
that do this, with a lot of overlap in what they handle and also some restricted
in ways that others aren't without much consistency:
FoldBranchToCommonDest:
* BB can have any number of predecessors, the first valid predecessor is
chosen.
* BB can have any number of successors, but it has to share a successor with
one of its predecessors.
* The condition must be an instruction in BB.
* BB can contain at most 1+BonusInstThreshold (non-branch) instructions.
* Phis are updated rather than turned into selects, so this function cannot
happen if the phi cannot be updated in this way.
* Branch condition in predecessor merged with BB's using logical
operations.
HoistThenElseCodeToIf:
* BB1 and BB2 must be the have the same predecessor, and be the only successors
of that predecessor.
* The successors of BB1 and BB2 do not matter (except see phi handling below).
* No restriction on condition.
* Only instructions that are common to BB1 and BB2 are speculated.
* If the terminators of BB1 and BB2 is speculated (i.e. BB1 and BB2 have the
same successors), then phis in BB1 and BB2's successors are turned into
selects in their predecessor.
* Branch in predecessor turned into that of BB1/BB2 if it was hoisted.
SpeculativelyExecuteBB:
* BB must have one predecessor.
* BB must have one successor, which is the other successor of its predecessor.
* The condition being speculated cannot come from FPCmpInst.
* BB can contain at most 1 (non-branch) instruction.
* Phis in BB's successor are turned into selects in BB's predecessor.
* BB's predecessor changed to branch unconditionally to its other
successor.
SimplifyCondBranchToCondBranch:
* BB can have any number of predecessors, the first valid predecessor is
chosen.
* BB can have any number of successors, but it has to share a successor with
the predecessor.
* No restriction on condition.
* BB must be empty (except for the branch).
* Phis in the shared successor are turned into selects in the predecessor.
* Branch condition in predecessor merged with BB's using logical
operations.
FoldTwoEntryPHINode:
* BB1 and/or BB2 must have the same predecessor, and must form a simple if or
if/else.
* BB1 and BB2 must have the same successor, or we only have BB1 which has one
successor
* No restriction on condition.
* Instructions with total cost PHINodeFoldingThreshold * TCC_Basic can be
speculated.
* Phis in the successor are turned into selects in the common predecessor.
* Branch condition in predecessor changed to branch unconditionally.
Examples
=======
What does all this means? Let's look at some examples:
Inconsistent fcmp / empty block handling
----------------------------------------
define float @fn1(float %a, float %b, float %c) {
entry:
%cmp1 = fcmp une float %a, %b
br i1 %cmp1, label %bb1, label %end
bb1:
%cmp2 = fcmp une float %a, %c
br i1 %cmp2, label %bb2, label %end
bb2:
br label %end
end:
%phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ]
ret float %phi
}
define float @fn2(float %a, float %b, float %c) {
entry:
%cmp1 = fcmp une float %a, %b
%cmp2 = fcmp une float %a, %c
br i1 %cmp1, label %bb1, label %end
bb1:
br i1 %cmp2, label %bb2, label %end
bb2:
br label %end
end:
%phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ]
ret float %phi
}
In @fn1 no speculation happens because the two function which could possibly do
it are SpeculativelyExecuteBB, which can't do it because %cmp2 is an fcmp,
and
SimplifyCondBranchToCondBranch, which can't do it because bb1 isn't
empty
However in @fn2 we can now speculate bb1 because %cmp2 has been moved so bb1 is
now empty and SimplifyCondBranchToCondBranch, which doesn't care about fcmp,
can
speculate it.
Large impact due to which function speculating what
---------------------------------------------------
define i32 @fn3(i1 %c, i32 %arg) {
entry:
br i1 %c, label %bb1, label %end
bb1:
%cmp = icmp eq i32 %arg, 0
br i1 %cmp, label %bb2, label %end
bb2:
br label %end
end:
%phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
ret i32 %phi
}
define i32 @fn4(i1 %c, i32 %arg) {
entry:
%cmp = icmp eq i32 %arg, 0
br i1 %c, label %bb1, label %end
bb1:
br i1 %cmp, label %bb2, label %end
bb2:
br label %end
end:
%phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
ret i32 %phi
}
define i32 @fn5(i1 %c, i32 %arg) {
entry:
%cmp = icmp eq i32 %arg, 0
br i1 %c, label %bb1, label %end
bb1:
br i1 %cmp, label %bb2, label %end
end:
%phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
ret i32 %phi
bb2:
br label %end
}
In @fn3:
* FoldBranchToCommonDest speculates %bb1 in %entry
* FoldTwoEntryPHINode turn phi into a select in %entry
* %bb2 now an empty block which serves no purpose, and is removed
In @fn4:
* SpeculativelyExecuteBB speculates %bb2 in %bb1
* SimplifyCondBranchToTwoReturns makes %bb1 select then return instead of
conditionally
branching to two returns
In @fn5:
* SpeculativelyExecuteBB speculates %bb2 in %bb1, inserting a select
* FoldTwoEntryPHINode turns the phi in %end into a select in %entry
What all this means is what we end up with is:
* @fn3: cmp; and; select
* @fn4: cmp; conditional branch to select
* @fn5: cmp; select; select
of which @fn3 is the best.
Inconsistent thresholds
-----------------------
define float @fn6(i32 %a, i32 %b, float %c) {
entry:
%cmp1 = icmp ne i32 %a, %b
br i1 %cmp1, label %bb1, label %end
bb1:
%mul = fmul float %c, 3.0
%add = fadd float %mul, 3.0
br label %end
end:
%phi = phi float [ %add, %bb1 ], [ 0.0, %entry ]
ret float %phi
}
define float @fn7(i32 %a, i32 %b, float %c) {
entry:
%cmp1 = icmp ne i32 %a, %b
br i1 %cmp1, label %bb1, label %end
bb1:
%div = fdiv float %c, 3.0
%add = fadd float %div, 3.0
br label %end
end:
%phi = phi float [ %add, %bb1 ], [ 0.0, %entry ]
ret float %phi
}
define void @fn8(i32 %a, i32 %b, float %c, i32* %p1, i32* %p2) {
entry:
%cmp1 = icmp ne i32 %a, %b
br i1 %cmp1, label %bb1, label %bb2
bb1:
%div = fdiv float %c, 3.0
%cmp2 = fcmp une float %div, 3.0
br i1 %cmp2, label %bb2, label %end
bb2:
store i32 0, i32* %p1
br label %end
end:
store i32 0, i32* %p2
ret void
}
FoldTwoEntryPHINode will fold the phi (and so speculated bb1) in @fn6 but not
@fn7 as
fmul has cost TCC_Basic, so fmul + fadd is below the threshold, but fdiv has
cost
TCC_Expensive, so fdiv + fadd is above the threshold.
In @fn8 however FoldBranchToCommonDest just checks the number of instructions so
has no
problem at all speculating bb1.
Questions
========
The questions arising from all of this:
* Firstly, is there some alternate way of dealing with @perlin_example above
(e.g doing
something different in jump threading) that will magically solve my problems
without
having to deal with any of this?
* The inconsistent FPCmpInst handling is really odd. Would anyone object if I
removed
that check from SpeculativelyExecuteBB?
* What exactly is the right threshold for doing all of this stuff?
FoldTwoEntryPHINode
takes into accout the actual instruction cost, so that would seem like the
most
accurate.
* It'd be really nice to cut down the number of functions doing this.
I've got an attempt
at making FoldBranchToCommonDest do everything that
SimplifyCondBranchToCondBranch
does so we can get rid of SimplifyCondBranchToCondBranch, as those two
functions are
the ones with the most overlap, but there's some major problems:
* Doing this changes the order that blocks are speculated, giving worse
results in
some cases (but better in others).
* SimplifyCondBranchToCondBranc must turn phi into select,
FoldBranchToCommonDest cannot,
and FoldBranchToCommonDest followed by the rest of the functions that can
turn phi to
select sometimes can't either.
So: any ideas on how cutting down the number of functions could be done?
* Alternatively instead of cutting down on the number of functions we could try
to common
out things into shared functions.
* Any other ideas for how to make this less of a horrible mess?
* Would people mind if I just ignore all of this and just make this into even
more of a
mess while trying to fix how we handle @perlin_example? (The correct answer
is "No, don't
do that" but it would certainly make my job easier.) Or alternatively
does anyone else
want to try sorting out this mess?
John
Krzysztof Parzyszek via llvm-dev
2018-Nov-15 19:07 UTC
[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess
On 11/15/2018 10:23 AM, John Brawn via llvm-dev wrote:> > * Any other ideas for how to make this less of a horrible mess?My opinion is that we should rewrite it. And not the speculation part, but the whole thing. The first problem is that there is really no clear description of that this pass does. "Simplify CFG" is a catch-all term that includes every little change that could be argued to simplify things in certain circumstances, and this has lead to what we have now. We should specify what the callers can expect to happen. "Expect the CFG to be simplified" is not an adequate specification. This is not a pass where we should be making frequent changes, since anything that happens here can have far reaching consequences. There are certain types of CFG transformations that are needed more frequently, like folding a branch to a branch, or eliminating empty blocks (as long as it doesn't damage loop structure). At the same time, things like generating switch statements is not something that needs to be attempted every time. Subtypes of CFG optimizations should be identified (e.g. cleanups, aggressive branch elimination, etc.) and there should be a parameter by which the caller would indicate which of these optimization groups should be run. The recursive algorithm is not the best choice here. For a graph like a CFG, the entire structure of it is important. With the recursive nature of the code it can impossible to predict what it's going to do on any given graph, or determine why something predicted didn't actually happen. The algorithm should have a clear structure: first X happens, then Y, etc. We should motivate the CFG simplification by what makes subsequent optimizations easier to implement and what exposes more opportunities to them, instead of "what helps my benchmark". Maybe targets should have hooks to perform specific modifications of the graph that may or may not be good for everyone. -Krzysztof -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation