thr3ads.net - llvm dev - [llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess [Nov 2018]

If this information is useful, please help other people find it:
Share via:

John Brawn via llvm-dev

2018-Nov-15 16:23 UTC

[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess

Background
=========
This is part of my long-running quest to better optimise

  float min_element_example(float *data, int size) {
    return *std::min_element(data, data+size);
  }

which requires improvements to GVN, which first requires a side-quest to delay
phi-to-select generation until after it (see https://reviews.llvm.org/D50723 and
my earlier RFC "Delaying phi-to-select transformation until later in the
pass
pipeline") which requires side-side-quests to fix everything that gets
worse when
we do that.

There's some bad interaction between jump threading and simplifycfg
phi-to-select
generation, where jump threading then phi-to-select gives worse results than
phi-to-select then jump threading, specifically in the function:

  define double @perlin_example(i32 %h, double %x, double %y, double %z) {
  entry:
    %cmp = icmp slt i32 %h, 8
    br i1 %cmp, label %cond.end, label %cond.false

  cond.false:
    br label %cond.end

  cond.end:
    %cond = phi double [ %y, %cond.false ], [ %x, %entry ]
    %cmp1 = icmp slt i32 %h, 4
    br i1 %cmp1, label %cond.end9, label %cond.false3

  cond.false3:
    %cmp4 = icmp eq i32 %h, 12
    br i1 %cmp4, label %cond.end9, label %cond.false6

  cond.false6:
    br label %cond.end9

  cond.end9:
    %cond10 = phi double [ %z, %cond.false6 ], [ %y, %cond.end ], [ %x,
%cond.false3 ]
    %add = fadd double %cond, %cond10
    ret double %add
  }

(heavily reduced from SingleSource/Benchmarks/Misc/perlin.c) where
 opt -jump-threading -simplifycfg
gives worse results than
 opt -simplifycfg -jump-threading
and we also get these worse results for opt -O3 when D50723 is applied.

I think the correct way to fix this is to improve how SimplifyCFG does block
speculation,
which is currently a horrible inconsistent mess and I think I need to fix the
mess first
before I make any changes, and it's that side-side-side-quest that I'm
talking about here.

Overview
=======
Some of what SimplifyCFG does is speculate blocks, i.e. move instructions from
a block into one of its predecessors. There are five functions (that I've
found)
that do this, with a lot of overlap in what they handle and also some restricted
in ways that others aren't without much consistency:

FoldBranchToCommonDest:
 * BB can have any number of predecessors, the first valid predecessor is
   chosen.
 * BB can have any number of successors, but it has to share a successor with
   one of its predecessors.
 * The condition must be an instruction in BB.
 * BB can contain at most 1+BonusInstThreshold (non-branch) instructions.
 * Phis are updated rather than turned into selects, so this function cannot
   happen if the phi cannot be updated in this way.
 * Branch condition in predecessor merged with BB's using logical
operations.

HoistThenElseCodeToIf:
 * BB1 and BB2 must be the have the same predecessor, and be the only successors
   of that predecessor.
 * The successors of BB1 and BB2 do not matter (except see phi handling below).
 * No restriction on condition.
 * Only instructions that are common to BB1 and BB2 are speculated.
 * If the terminators of BB1 and BB2 is speculated (i.e. BB1 and BB2 have the
   same successors), then phis in BB1 and BB2's successors are turned into
   selects in their predecessor.
 * Branch in predecessor turned into that of BB1/BB2 if it was hoisted.

SpeculativelyExecuteBB:
 * BB must have one predecessor.
 * BB must have one successor, which is the other successor of its predecessor.
 * The condition being speculated cannot come from FPCmpInst.
 * BB can contain at most 1 (non-branch) instruction.
 * Phis in BB's successor are turned into selects in BB's predecessor.
 * BB's predecessor changed to branch unconditionally to its other
successor.

SimplifyCondBranchToCondBranch:
 * BB can have any number of predecessors, the first valid predecessor is
   chosen.
 * BB can have any number of successors, but it has to share a successor with
   the predecessor.
 * No restriction on condition.
 * BB must be empty (except for the branch).
 * Phis in the shared successor are turned into selects in the predecessor.
 * Branch condition in predecessor merged with BB's using logical
operations.

FoldTwoEntryPHINode:
 * BB1 and/or BB2 must have the same predecessor, and must form a simple if or
if/else.
 * BB1 and BB2 must have the same successor, or we only have BB1 which has one
successor
 * No restriction on condition.
 * Instructions with total cost PHINodeFoldingThreshold * TCC_Basic can be
speculated.
 * Phis in the successor are turned into selects in the common predecessor.
 * Branch condition in predecessor changed to branch unconditionally.

Examples
=======
What does all this means? Let's look at some examples:

Inconsistent fcmp / empty block handling
----------------------------------------

  define float @fn1(float %a, float %b, float %c) {
  entry:
    %cmp1 = fcmp une float %a, %b
    br i1 %cmp1, label %bb1, label %end
  bb1:
    %cmp2 = fcmp une float %a, %c
    br i1 %cmp2, label %bb2, label %end
  bb2:
    br label %end
  end:
    %phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ]
    ret float %phi
  }

  define float @fn2(float %a, float %b, float %c) {
  entry:
    %cmp1 = fcmp une float %a, %b
    %cmp2 = fcmp une float %a, %c
    br i1 %cmp1, label %bb1, label %end
  bb1:
    br i1 %cmp2, label %bb2, label %end
  bb2:
    br label %end
  end:
    %phi = phi float [ 0.0, %entry ], [ 1.0, %bb1 ], [ 2.0, %bb2 ]
    ret float %phi
  }

In @fn1 no speculation happens because the two function which could possibly do
it are SpeculativelyExecuteBB, which can't do it because %cmp2 is an fcmp,
and
SimplifyCondBranchToCondBranch, which can't do it because bb1 isn't
empty

However in @fn2 we can now speculate bb1 because %cmp2 has been moved so bb1 is
now empty and SimplifyCondBranchToCondBranch, which doesn't care about fcmp,
can
speculate it.

Large impact due to which function speculating what
---------------------------------------------------

  define i32 @fn3(i1 %c, i32 %arg) {
  entry:
    br i1 %c, label %bb1, label %end
  bb1:
    %cmp = icmp eq i32 %arg, 0
    br i1 %cmp, label %bb2, label %end
  bb2:
    br label %end
  end:
    %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
    ret i32 %phi
  }

  define i32 @fn4(i1 %c, i32 %arg) {
  entry:
    %cmp = icmp eq i32 %arg, 0
    br i1 %c, label %bb1, label %end
  bb1:
    br i1 %cmp, label %bb2, label %end
  bb2:
    br label %end
  end:
    %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
    ret i32 %phi
  }

  define i32 @fn5(i1 %c, i32 %arg) {
  entry:
    %cmp = icmp eq i32 %arg, 0
    br i1 %c, label %bb1, label %end
  bb1:
    br i1 %cmp, label %bb2, label %end
  end:
    %phi = phi i32 [ 927, %bb2 ], [ 42, %bb1 ], [ 42, %entry ]
    ret i32 %phi
  bb2:
    br label %end
  }

In @fn3:
 * FoldBranchToCommonDest speculates %bb1 in %entry
 * FoldTwoEntryPHINode turn phi into a select in %entry
 * %bb2 now an empty block which serves no purpose, and is removed

In @fn4:
 * SpeculativelyExecuteBB speculates %bb2 in %bb1
 * SimplifyCondBranchToTwoReturns makes %bb1 select then return instead of
conditionally
   branching to two returns

In @fn5:
 * SpeculativelyExecuteBB speculates %bb2 in %bb1, inserting a select
 * FoldTwoEntryPHINode turns the phi in %end into a select in %entry

What all this means is what we end up with is:
 * @fn3: cmp; and; select
 * @fn4: cmp; conditional branch to select
 * @fn5: cmp; select; select
of which @fn3 is the best.

Inconsistent thresholds
-----------------------

  define float @fn6(i32 %a, i32 %b, float %c) {
  entry:
    %cmp1 = icmp ne i32 %a, %b
    br i1 %cmp1, label %bb1, label %end
  bb1:
    %mul = fmul float %c, 3.0
    %add = fadd float %mul, 3.0
    br label %end
  end:
    %phi = phi float [ %add, %bb1 ], [ 0.0, %entry ]
    ret float %phi
  }

  define float @fn7(i32 %a, i32 %b, float %c) {
  entry:
    %cmp1 = icmp ne i32 %a, %b
    br i1 %cmp1, label %bb1, label %end
  bb1:
    %div = fdiv float %c, 3.0
    %add = fadd float %div, 3.0
    br label %end
  end:
    %phi = phi float [ %add, %bb1 ], [ 0.0, %entry ]
    ret float %phi
  }

  define void @fn8(i32 %a, i32 %b, float %c, i32* %p1, i32* %p2) {
  entry:
    %cmp1 = icmp ne i32 %a, %b
    br i1 %cmp1, label %bb1, label %bb2
  bb1:
    %div = fdiv float %c, 3.0
    %cmp2 = fcmp une float %div, 3.0
    br i1 %cmp2, label %bb2, label %end
  bb2:
    store i32 0, i32* %p1
    br label %end
  end:
    store i32 0, i32* %p2
    ret void
  }

FoldTwoEntryPHINode will fold the phi (and so speculated bb1) in @fn6 but not
@fn7 as
fmul has cost TCC_Basic, so fmul + fadd is below the threshold, but fdiv has
cost
TCC_Expensive, so fdiv + fadd is above the threshold.

In @fn8 however FoldBranchToCommonDest just checks the number of instructions so
has no
problem at all speculating bb1.

Questions
========
The questions arising from all of this:
 * Firstly, is there some alternate way of dealing with @perlin_example above
(e.g doing
   something different in jump threading) that will magically solve my problems
without
   having to deal with any of this?
 * The inconsistent FPCmpInst handling is really odd. Would anyone object if I
removed
   that check from SpeculativelyExecuteBB?
 * What exactly is the right threshold for doing all of this stuff?
FoldTwoEntryPHINode
   takes into accout the actual instruction cost, so that would seem like the
most
   accurate.
 * It'd be really nice to cut down the number of functions doing this.
I've got an attempt
   at making FoldBranchToCommonDest do everything that
SimplifyCondBranchToCondBranch
   does so we can get rid of SimplifyCondBranchToCondBranch, as those two
functions are
   the ones with the most overlap, but there's some major problems:
   * Doing this changes the order that blocks are speculated, giving worse
results in
     some cases (but better in others).
   * SimplifyCondBranchToCondBranc must turn phi into select,
FoldBranchToCommonDest cannot,
     and FoldBranchToCommonDest followed by the rest of the functions that can
turn phi to
     select sometimes can't either.
   So: any ideas on how cutting down the number of functions could be done?
 * Alternatively instead of cutting down on the number of functions we could try
to common
   out things into shared functions.
 * Any other ideas for how to make this less of a horrible mess?
 * Would people mind if I just ignore all of this and just make this into even
more of a
   mess while trying to fix how we handle @perlin_example? (The correct answer
is "No, don't
   do that" but it would certainly make my job easier.) Or alternatively
does anyone else
   want to try sorting out this mess?

John

Krzysztof Parzyszek via llvm-dev

2018-Nov-15 19:07 UTC

head link

[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess

On 11/15/2018 10:23 AM, John Brawn via llvm-dev wrote:> 
>   * Any other ideas for how to make this less of a horrible mess?
My opinion is that we should rewrite it. And not the speculation part, 
but the whole thing.

The first problem is that there is really no clear description of that 
this pass does. "Simplify CFG" is a catch-all term that includes every
little change that could be argued to simplify things in certain 
circumstances, and this has lead to what we have now. We should specify 
what the callers can expect to happen. "Expect the CFG to be
simplified"
is not an adequate specification. This is not a pass where we should be 
making frequent changes, since anything that happens here can have far 
reaching consequences.

There are certain types of CFG transformations that are needed more 
frequently, like folding a branch to a branch, or eliminating empty 
blocks (as long as it doesn't damage loop structure). At the same time, 
things like generating switch statements is not something that needs to 
be attempted every time. Subtypes of CFG optimizations should be 
identified (e.g. cleanups, aggressive branch elimination, etc.) and 
there should be a parameter by which the caller would indicate which of 
these optimization groups should be run.

The recursive algorithm is not the best choice here. For a graph like a 
CFG, the entire structure of it is important.  With the recursive nature 
of the code it can impossible to predict what it's going to do on any 
given graph, or determine why something predicted didn't actually 
happen. The algorithm should have a clear structure: first X happens, 
then Y, etc.

We should motivate the CFG simplification by what makes subsequent 
optimizations easier to implement and what exposes more opportunities to 
them, instead of "what helps my benchmark".  Maybe targets should have
hooks to perform specific modifications of the graph that may or may not 
be good for everyone.

-Krzysztof

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

llvm dev - Nov 2018 - [RFC] SimplifyCFG block speculation is a horribly inconsistent mess

[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess

[llvm-dev] [RFC] SimplifyCFG block speculation is a horribly inconsistent mess