thr3ads.net - llvm dev - [LLVMdev] Question regarding basic-block placement optimization [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2011-Oct-19 10:24 UTC

[LLVMdev] Question regarding basic-block placement optimization

On Tue, Oct 18, 2011 at 6:58 PM, Jakob Stoklund Olesen <stoklund at
2pi.dk>wrote:
>
> On Oct 18, 2011, at 5:22 PM, Chandler Carruth wrote:
>
> As for why it should be an IR pass, mostly because once the selection dag
>> runs through the code, we can never recover all of the freedom we have
at
>> the IR level. To start with, splicing MBBs around requires known about
the
>> terminators (which we only some of the time do), and it requires
re-writing
>> them a touch to account for the different fall-through pattern. To make
>> matters worse, at this point we don't have the nicely analyzable
'switch'
>> terminator (I think), and so the existing MBB placement code just bails
on
>> non-branch-exit blocks.
>>
>>
>> Those are all the wrong reasons for not doing the right thing.
>>
>
> Sorry, I'm not trying to do the wrong thing because of this...
Currently,
> it feels like a trade-off in terms of cost/benefit. It's not yet clear
to me
> that the benefit of doing this analysis in the CodeGen layer outweighs the
> cost and I was trying to clarify what the costs I perceive are.
>
>
> I think it's mostly about understanding how MBBs work.
>
Indeed, that seems to be the case. =D Thanks for explaining things below, it
helped me a lot.

> Ignoring calls and returns, most machines have three kinds of branches:
>
> 1. Unconditional
> 2. Conditional
> 3. Indirect.
>
> The AnalyzeBranch() function understands the first two kinds, so if that
> function returns false (as in it's false that it didn't succeed)
you can
> move the successors around, and you know that placing a successor
> immediately after the block and calling updateTerminator() will give you a
> fall-through.
>
> If AnalyzeBranch() fails, you can still check if the last instruction in
> the block is an unpredicated barrier. If so, it is still safe to move the
> successors around, but that block will never be a fall-through. The
> canFallThrough() function implements this check.
>
> If the last instruction in the block is predicated or not a barrier, you
> must keep it together with its layout successor. This should only happen in
> rare cases where it is necessary. For example, I am planning to lower
invoke
> instructions into call instructions that are terminators. This is necessary
> to accurately model control flow to landing pads. Such a call instruction
> must fall through to its layout successor.
>
> Some experimental targets don't implement AnalyzeBranch, so everything
> looks like an indirect branch. Those targets get the code placement they
> deserve.
>
> I am not claiming the API is awesome, but the information you need is
> there, and you have the same freedom as for IR.
>
> We explicitly designed the branch weights so switch lowering could annotate
> all the new branches with exact weights. It would be a shame to ignore that
> information.
>
> So the benefits are:
>
> - Profile-driven fall-through layout of lowered switches. That should be a
> pretty big deal.
> - Proper placement of split critical edges.
> - The ability to implement stuff like: "Don't put too many
branches in a
> fetch group, or you'll freak out the branch predictor".
>
These all seem like really good reasons, and your explanation helps me a
lot. I'll take a stab at re-implementing this on MBBs. In the mean time,
I've attached my patch with the IR-level patch as most of the interesting
logic will remain the same. Please be gentle, it's my first proper LLVM
pass, and my knowledge of optimization pass research & papers is limited.
The part I'm least pleased with is the computation of weight for each chain
in an SCC of chains, but so far I've not come up with anything better. =]
I've tested this on several ad-hoc test cases, but nothing really thorough.
Going to try moving it to the codegen layer first.


One question that remains, mostly to ensure I've understood you correctly:
For switches, it is represented as an N-ary conditional terminator, and the
targets of the switch can be freely intermingled with other MBBs?

I'll attach an updated patch to work on MBBs when I have one...
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111019/bbb9144a/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: block_placement_ir.patch
Type: text/x-patch
Size: 24170 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111019/bbb9144a/attachment.bin>

Chandler Carruth

2011-Oct-19 12:50 UTC

head link

[LLVMdev] Question regarding basic-block placement optimization

On Wed, Oct 19, 2011 at 3:24 AM, Chandler Carruth <chandlerc at
google.com>wrote:
> On Tue, Oct 18, 2011 at 6:58 PM, Jakob Stoklund Olesen <stoklund at
2pi.dk>wrote:
>
>>
>> On Oct 18, 2011, at 5:22 PM, Chandler Carruth wrote:
>>
>>  As for why it should be an IR pass, mostly because once the selection
>>> dag runs through the code, we can never recover all of the freedom
we have
>>> at the IR level. To start with, splicing MBBs around requires known
about
>>> the terminators (which we only some of the time do), and it
requires
>>> re-writing them a touch to account for the different fall-through
pattern.
>>> To make matters worse, at this point we don't have the nicely
analyzable
>>> 'switch' terminator (I think), and so the existing MBB
placement code just
>>> bails on non-branch-exit blocks.
>>>
>>>
>>> Those are all the wrong reasons for not doing the right thing.
>>>
>>
>> Sorry, I'm not trying to do the wrong thing because of this...
Currently,
>> it feels like a trade-off in terms of cost/benefit. It's not yet
clear to me
>> that the benefit of doing this analysis in the CodeGen layer outweighs
the
>> cost and I was trying to clarify what the costs I perceive are.
>>
>>
>> I think it's mostly about understanding how MBBs work.
>>
>
> Indeed, that seems to be the case. =D Thanks for explaining things below,
> it helped me a lot.
>
>
>> Ignoring calls and returns, most machines have three kinds of branches:
>>
>> 1. Unconditional
>> 2. Conditional
>> 3. Indirect.
>>
>> The AnalyzeBranch() function understands the first two kinds, so if
that
>> function returns false (as in it's false that it didn't
succeed) you can
>> move the successors around, and you know that placing a successor
>> immediately after the block and calling updateTerminator() will give
you a
>> fall-through.
>>
>> If AnalyzeBranch() fails, you can still check if the last instruction
in
>> the block is an unpredicated barrier. If so, it is still safe to move
the
>> successors around, but that block will never be a fall-through. The
>> canFallThrough() function implements this check.
>>
>> If the last instruction in the block is predicated or not a barrier,
you
>> must keep it together with its layout successor. This should only
happen in
>> rare cases where it is necessary. For example, I am planning to lower
invoke
>> instructions into call instructions that are terminators. This is
necessary
>> to accurately model control flow to landing pads. Such a call
instruction
>> must fall through to its layout successor.
>>
>> Some experimental targets don't implement AnalyzeBranch, so
everything
>> looks like an indirect branch. Those targets get the code placement
they
>> deserve.
>>
>> I am not claiming the API is awesome, but the information you need is
>> there, and you have the same freedom as for IR.
>>
>> We explicitly designed the branch weights so switch lowering could
>> annotate all the new branches with exact weights. It would be a shame
to
>> ignore that information.
>>
>> So the benefits are:
>>
>> - Profile-driven fall-through layout of lowered switches. That should
be a
>> pretty big deal.
>> - Proper placement of split critical edges.
>> - The ability to implement stuff like: "Don't put too many
branches in a
>> fetch group, or you'll freak out the branch predictor".
>>
>
> These all seem like really good reasons, and your explanation helps me a
> lot. I'll take a stab at re-implementing this on MBBs. In the mean
time,
> I've attached my patch with the IR-level patch as most of the
interesting
> logic will remain the same. Please be gentle, it's my first proper LLVM
> pass, and my knowledge of optimization pass research & papers is
limited.
> The part I'm least pleased with is the computation of weight for each
chain
> in an SCC of chains, but so far I've not come up with anything better.
=]
> I've tested this on several ad-hoc test cases, but nothing really
thorough.
> Going to try moving it to the codegen layer first.
>
>
> One question that remains, mostly to ensure I've understood you
correctly:
> For switches, it is represented as an N-ary conditional terminator, and the
> targets of the switch can be freely intermingled with other MBBs?
>
> I'll attach an updated patch to work on MBBs when I have one...
>
Ok, wow that wasn't hard at all. This is still *very* much a rough draft,
but it's probably better to review that the previous patch. One big caveat,
I know I have an iteration bug in here somewhere that is inf-looping. Just
ran out of steam debugging it, will pick it back up again later today to
shake it out.

Still, for the test cases that don't tickle the iteration bug, it generates
beautiful, well laid out code. =]

% cat ifchain.ll
; RUN: opt < %s -analyze -block-freq | FileCheck %s

declare void @error(i32 %i, i32 %a, i32 %b)

define i32 @test(i32 %i, i32* %a, i32 %b) {
entry:
  %gep1 = getelementptr i32* %a, i32 1
  %val1 = load i32* %gep1
  %cond1 = icmp ugt i32 %val1, 1
  br i1 %cond1, label %then1, label %else1, !prof !0

then1:
  call void @error(i32 %i, i32 1, i32 %b)
  br label %else1

else1:
  %gep2 = getelementptr i32* %a, i32 2
  %val2 = load i32* %gep2
  %cond2 = icmp ugt i32 %val2, 2
  br i1 %cond2, label %then2, label %else2, !prof !0

then2:
  call void @error(i32 %i, i32 1, i32 %b)
  br label %else2

else2:
  %gep3 = getelementptr i32* %a, i32 3
  %val3 = load i32* %gep3
  %cond3 = icmp ugt i32 %val3, 3
  br i1 %cond3, label %then3, label %else3, !prof !0

then3:
  call void @error(i32 %i, i32 1, i32 %b)
  br label %else3

else3:
  %gep4 = getelementptr i32* %a, i32 4
  %val4 = load i32* %gep4
  %cond4 = icmp ugt i32 %val4, 4
  br i1 %cond4, label %then4, label %else4, !prof !0

then4:
  call void @error(i32 %i, i32 1, i32 %b)
  br label %else4

else4:
  %gep5 = getelementptr i32* %a, i32 3
  %val5 = load i32* %gep5
  %cond5 = icmp ugt i32 %val5, 3
  br i1 %cond5, label %then5, label %exit, !prof !0

then5:
  call void @error(i32 %i, i32 1, i32 %b)
  br label %exit

exit:
  ret i32 %b
}

!0 = metadata !{metadata !"branch_weights", i32 4, i32 64}

% ./bin/llc -O2 -o - ifchain.ll
        .file   "ifchain.ll"
        .text
        .globl  test
        .align  16, 0x90
        .type   test, at function
test:                                   # @test
.Ltmp4:
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rbp
.Ltmp5:
        .cfi_def_cfa_offset 16
        pushq   %r14
.Ltmp6:
        .cfi_def_cfa_offset 24
        pushq   %rbx
.Ltmp7:
        .cfi_def_cfa_offset 32
.Ltmp8:
        .cfi_offset %rbx, -32
.Ltmp9:
        .cfi_offset %r14, -24
.Ltmp10:
        .cfi_offset %rbp, -16
        movl    %edx, %ebx
        movq    %rsi, %r14
        movl    %edi, %ebp
        cmpl    $2, 4(%r14)
        jb      .LBB0_2
.LBB0_2:                                # %else1
        cmpl    $3, 8(%r14)
        jb      .LBB0_4
.LBB0_4:                                # %else2
        cmpl    $4, 12(%r14)
        jb      .LBB0_6
.LBB0_6:                                # %else3
        cmpl    $5, 16(%r14)
        jb      .LBB0_8
.LBB0_8:                                # %else4
        cmpl    $4, 12(%r14)
        jb      .LBB0_10
.LBB0_10:                               # %exit
        movl    %ebx, %eax
        popq    %rbx
        popq    %r14
        popq    %rbp
        ret
.LBB0_1:                                # %then1
        movl    %ebp, %edi
        movl    $1, %esi
        movl    %ebx, %edx
        callq   error
.LBB0_3:                                # %then2
        movl    %ebp, %edi
        movl    $1, %esi
        movl    %ebx, %edx
        callq   error
.LBB0_5:                                # %then3
        movl    %ebp, %edi
        movl    $1, %esi
        movl    %ebx, %edx
        callq   error
.LBB0_7:                                # %then4
        movl    %ebp, %edi
        movl    $1, %esi
        movl    %ebx, %edx
        callq   error
.LBB0_9:                                # %then5
        movl    %ebp, %edi
        movl    $1, %esi
        movl    %ebx, %edx
        callq   error
.Ltmp11:
        .size   test, .Ltmp11-test
.Ltmp12:
        .cfi_endproc
.Leh_func_end0:


        .section        ".note.GNU-stack","", at progbits
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111019/ca556d5c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: block_placement_codegen.patch
Type: text/x-patch
Size: 25889 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111019/ca556d5c/attachment.bin>

Chandler Carruth

2011-Oct-19 12:52 UTC

head link

[LLVMdev] Question regarding basic-block placement optimization

On Wed, Oct 19, 2011 at 5:50 AM, Chandler Carruth <chandlerc at
google.com>wrote:
> Ok, wow that wasn't hard at all. This is still *very* much a rough
draft,
> but it's probably better to review that the previous patch. One big
caveat,
> I know I have an iteration bug in here somewhere that is inf-looping. Just
> ran out of steam debugging it, will pick it back up again later today to
> shake it out.
>
Oh, of course, i'm not updating the terminators yet... I'll add that...

Could that cause an infloop when iterating over MBBs in a function?
Specifically what I'm seeing is a single MBB iterator which when incremented
returns itself. Every time. even calling 'dump' on the MachineFunction
inf-loops, so clearly I've corrupted the state of the function somehow.

>
> Still, for the test cases that don't tickle the iteration bug, it
generates
> beautiful, well laid out code. =]
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111019/c7aff050/attachment.html>

Jakob Stoklund Olesen

2011-Oct-19 14:56 UTC

head link

[LLVMdev] Question regarding basic-block placement optimization

On Oct 19, 2011, at 5:50 AM, Chandler Carruth wrote:
> Ok, wow that wasn't hard at all.
Awesome ;-)
> This is still *very* much a rough draft, but it's probably better to
review that the previous patch. One big caveat, I know I have an iteration bug
in here somewhere that is inf-looping. Just ran out of steam debugging it, will
pick it back up again later today to shake it out.
Maybe splicing a block into it current position will create a loop?

Some random notes:

- Please add a description of the algorithm.
- Please add a comment to the BlockChain class.
- Use a separate anonymous namespace per class, and don't indent for the
namespace.

+BlockChain *BlockPlacement2::CreateChain(MachineBasicBlock *BB) {
+  Chains.push_back(BlockChain(BlockToChain, BB));
+  BlockToChain[BB] = &Chains.back();
+  assert(ActiveChains.insert(&Chains.back()));
+  return &Chains.back();
+}

Whoa, you are storing pointers into a growing vector. You should at least
assert(Chains.size() != Chains.capacity()) before pushing.

+      RequiredChain->BBEnd = ++MachineFunction::iterator(From);

Does C++ allow this these days? I would prefer
llvm::next(MachineFunction::iterator(From))

Note that MachineFunction::iterator decays into an MBB pointer, so you can say
FI->canFallThrough() and AnalyzeBranch(*FI...)

+      WeightedEdge WE = { BaseFrequency * MBPI->getEdgeProbability(From,
To),

Did we add API for this computation? We intended to add a function that computes
this and saturates on overflow etc.
> Still, for the test cases that don't tickle the iteration bug, it
generates beautiful, well laid out code. =]
I am sure others have strong opinions about the algorithm ;-)

I would like to see a more explicit handling of loop layout. There is a tradeoff
between entering the loop with a branch, and having the exit block last. It is
not clear to me that this algorithm handles that properly.

Be careful about placing too much trust in the behavior of branch probabilities.
They go out of whack when they saturate.

Second, you should handle blocks that can never fall through (indirect branches
from jump tables). There is no need to try to place their successors.

/jakob

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Oct 2011 - [LLVMdev] Question regarding basic-block placement optimization

[LLVMdev] Question regarding basic-block placement optimization

[LLVMdev] Question regarding basic-block placement optimization

[LLVMdev] Question regarding basic-block placement optimization

[LLVMdev] Question regarding basic-block placement optimization

Possibly Parallel Threads