thr3ads.net - llvm dev - [llvm-dev] RFC: Trace-based layout. [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Kyle Butt via llvm-dev

2017-Sep-15 01:53 UTC

[llvm-dev] RFC: Trace-based layout.

I plan on rewriting the block placement algorithm to proceed by traces.

A trace is a chain of blocks where each block in the chain may fall through
to
the successor in the chain.

The overall algorithm would be to first produce traces for a function, and
then
order those traces to try and get cache locality.

Currently block placement uses a greedy single step approach to layout. It
produces chains working from inner to outer loops. Unlike a trace, a chain
may
contain non-fallthrough edges. This causes problems with loop layout. The
main
problems with loop layout are: loop rotation and cold blocks in a loop.

Overview of proposed solution:

Phase 1:
Greedily produce a set of traces through the function. A trace is a list of
blocks with each block in the list falling through (possibly conditionally)
to
the next block in the list. Loop rotation will occur naturally in this
phase via
the triangle replacement algorithm below. Handling single trace loops
requires a
tweak, see the detailed design.

Phase 2:
After producing what we believe are the best traces, they need to be
ordered.
They will be ordered topologically, except that traces that are cold enough
(As
measured by their warmest block) will be floated later, This may push them
out
of a loop or to the end of the function.

Detailed Design

Note whenever an edge is used as a number, I am referring to the edge
frequency.

Phase 1: Producing traces
Traces are produced according to the following algorithm:
 * Sort the edges according to weight, stable-sorting them according the
incoming
block and edge ordering.
 * Place each block in a trace of length 1.
 * For each edge in order:
    * If the source is at the end of a trace, and the target is at the
beginning
      of a trace, glue those 2 traces into 1 longer trace.
    * If an edge has a target or source in the middle of another trace,
consider
      tail duplication. The benefit calculation is the same as the existing
      code.
    * If an edge has a source or target in the middle, check them to see if
they
      can be replaced as a triangle. (Triangle replacement described below)
      * Compare the benefit of choosing the edge, along with any triangles
        found, with the cost of breaking the existing edges.
        * If it is a net benefit, perform the switch.
 * Triangle checking:
    Consider a trace in 2 parts: A1->A2, and the current edge under
consideration
    is A1->B (the case for C->A2 is mirror, and both may need to be done)
    * First find the best alternative C->B
    * Check for an alternative for A2: D->A2
    * Find D's best Alternative: D->E
    * Compare the frequencies: A1->A2 + C->B + D->E vs A1->B +
D->A2
    * If the 2nd sum is bigger, do the switch.
  * Loop Rotation Tweak:
    If A contains a backedge A2->A1, then when considering A1->B or
C->A2,
we
    can include that backedge in the gain:
    A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A

Phase 2: Order traces.
First we compute the frequency of a trace by finding the max frequency of
any of
its blocks.
Then we attempt to place the traces topologically. When a trace cannot be
placed
topologically, we prefer warmer traces first.

Questions and comments welcome.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170914/5f74c31e/attachment.html>

Sean Silva via llvm-dev

2017-Sep-15 02:24 UTC

head link

[llvm-dev] RFC: Trace-based layout.

Is this an existing published algorithm? Do you have a link to a paper?

-- Sean Silva

On Thu, Sep 14, 2017 at 6:53 PM, Kyle Butt via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I plan on rewriting the block placement algorithm to proceed by traces.
>
> A trace is a chain of blocks where each block in the chain may fall
> through to
> the successor in the chain.
>
> The overall algorithm would be to first produce traces for a function, and
> then
> order those traces to try and get cache locality.
>
> Currently block placement uses a greedy single step approach to layout. It
> produces chains working from inner to outer loops. Unlike a trace, a chain
> may
> contain non-fallthrough edges. This causes problems with loop layout. The
> main
> problems with loop layout are: loop rotation and cold blocks in a loop.
>
> Overview of proposed solution:
>
> Phase 1:
> Greedily produce a set of traces through the function. A trace is a list of
> blocks with each block in the list falling through (possibly
> conditionally) to
> the next block in the list. Loop rotation will occur naturally in this
> phase via
> the triangle replacement algorithm below. Handling single trace loops
> requires a
> tweak, see the detailed design.
>
> Phase 2:
> After producing what we believe are the best traces, they need to be
> ordered.
> They will be ordered topologically, except that traces that are cold
> enough (As
> measured by their warmest block) will be floated later, This may push them
> out
> of a loop or to the end of the function.
>
> Detailed Design
>
> Note whenever an edge is used as a number, I am referring to the edge
> frequency.
>
> Phase 1: Producing traces
> Traces are produced according to the following algorithm:
>  * Sort the edges according to weight, stable-sorting them according the
> incoming
> block and edge ordering.
>  * Place each block in a trace of length 1.
>  * For each edge in order:
>     * If the source is at the end of a trace, and the target is at the
> beginning
>       of a trace, glue those 2 traces into 1 longer trace.
>     * If an edge has a target or source in the middle of another trace,
> consider
>       tail duplication. The benefit calculation is the same as the existing
>       code.
>     * If an edge has a source or target in the middle, check them to see
> if they
>       can be replaced as a triangle. (Triangle replacement described below)
>       * Compare the benefit of choosing the edge, along with any triangles
>         found, with the cost of breaking the existing edges.
>         * If it is a net benefit, perform the switch.
>  * Triangle checking:
>     Consider a trace in 2 parts: A1->A2, and the current edge under
> consideration
>     is A1->B (the case for C->A2 is mirror, and both may need to be
done)
>     * First find the best alternative C->B
>     * Check for an alternative for A2: D->A2
>     * Find D's best Alternative: D->E
>     * Compare the frequencies: A1->A2 + C->B + D->E vs A1->B +
D->A2
>     * If the 2nd sum is bigger, do the switch.
>   * Loop Rotation Tweak:
>     If A contains a backedge A2->A1, then when considering A1->B or
C->A2,
> we
>     can include that backedge in the gain:
>     A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A
>
> Phase 2: Order traces.
> First we compute the frequency of a trace by finding the max frequency of
> any of
> its blocks.
> Then we attempt to place the traces topologically. When a trace cannot be
> placed
> topologically, we prefer warmer traces first.
>
> Questions and comments welcome.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170914/2d88c21a/attachment.html>

Kyle Butt via llvm-dev

2017-Sep-15 17:00 UTC

head link

[llvm-dev] RFC: Trace-based layout.

It is essentially block layout algorithm 2 here, with limited non-greedy
lookahead. (The triangle detection)
https://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=p16-pettis.pdf

On Thu, Sep 14, 2017 at 7:24 PM, Sean Silva <chisophugis at gmail.com>
wrote:
> Is this an existing published algorithm? Do you have a link to a paper?
>
> -- Sean Silva
>
> On Thu, Sep 14, 2017 at 6:53 PM, Kyle Butt via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I plan on rewriting the block placement algorithm to proceed by traces.
>>
>> A trace is a chain of blocks where each block in the chain may fall
>> through to
>> the successor in the chain.
>>
>> The overall algorithm would be to first produce traces for a function,
>> and then
>> order those traces to try and get cache locality.
>>
>> Currently block placement uses a greedy single step approach to layout.
It
>> produces chains working from inner to outer loops. Unlike a trace, a
>> chain may
>> contain non-fallthrough edges. This causes problems with loop layout.
The
>> main
>> problems with loop layout are: loop rotation and cold blocks in a loop.
>>
>> Overview of proposed solution:
>>
>> Phase 1:
>> Greedily produce a set of traces through the function. A trace is a
list
>> of
>> blocks with each block in the list falling through (possibly
>> conditionally) to
>> the next block in the list. Loop rotation will occur naturally in this
>> phase via
>> the triangle replacement algorithm below. Handling single trace loops
>> requires a
>> tweak, see the detailed design.
>>
>> Phase 2:
>> After producing what we believe are the best traces, they need to be
>> ordered.
>> They will be ordered topologically, except that traces that are cold
>> enough (As
>> measured by their warmest block) will be floated later, This may push
>> them out
>> of a loop or to the end of the function.
>>
>> Detailed Design
>>
>> Note whenever an edge is used as a number, I am referring to the edge
>> frequency.
>>
>> Phase 1: Producing traces
>> Traces are produced according to the following algorithm:
>>  * Sort the edges according to weight, stable-sorting them according
the
>> incoming
>> block and edge ordering.
>>  * Place each block in a trace of length 1.
>>  * For each edge in order:
>>     * If the source is at the end of a trace, and the target is at the
>> beginning
>>       of a trace, glue those 2 traces into 1 longer trace.
>>     * If an edge has a target or source in the middle of another trace,
>> consider
>>       tail duplication. The benefit calculation is the same as the
>> existing
>>       code.
>>     * If an edge has a source or target in the middle, check them to
see
>> if they
>>       can be replaced as a triangle. (Triangle replacement described
>> below)
>>       * Compare the benefit of choosing the edge, along with any
triangles
>>         found, with the cost of breaking the existing edges.
>>         * If it is a net benefit, perform the switch.
>>  * Triangle checking:
>>     Consider a trace in 2 parts: A1->A2, and the current edge under
>> consideration
>>     is A1->B (the case for C->A2 is mirror, and both may need to
be done)
>>     * First find the best alternative C->B
>>     * Check for an alternative for A2: D->A2
>>     * Find D's best Alternative: D->E
>>     * Compare the frequencies: A1->A2 + C->B + D->E vs
A1->B + D->A2
>>     * If the 2nd sum is bigger, do the switch.
>>   * Loop Rotation Tweak:
>>     If A contains a backedge A2->A1, then when considering A1->B
or
>> C->A2, we
>>     can include that backedge in the gain:
>>     A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A
>>
>> Phase 2: Order traces.
>> First we compute the frequency of a trace by finding the max frequency
of
>> any of
>> its blocks.
>> Then we attempt to place the traces topologically. When a trace cannot
be
>> placed
>> topologically, we prefer warmer traces first.
>>
>> Questions and comments welcome.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170915/82dfc991/attachment.html>

Andrew Trick via llvm-dev

2017-Sep-18 20:16 UTC

head link

[llvm-dev] RFC: Trace-based layout.

> On Sep 14, 2017, at 6:53 PM, Kyle Butt via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> I plan on rewriting the block placement algorithm to proceed by traces.
> 
> A trace is a chain of blocks where each block in the chain may fall through
to
> the successor in the chain.
> 
> The overall algorithm would be to first produce traces for a function, and
then
> order those traces to try and get cache locality.
> 
> Currently block placement uses a greedy single step approach to layout. It
> produces chains working from inner to outer loops. Unlike a trace, a chain
may
> contain non-fallthrough edges. This causes problems with loop layout. The
main
> problems with loop layout are: loop rotation and cold blocks in a loop.
> 
> Overview of proposed solution:
> 
> Phase 1:
> Greedily produce a set of traces through the function. A trace is a list of
> blocks with each block in the list falling through (possibly conditionally)
to
> the next block in the list. Loop rotation will occur naturally in this
phase via
> the triangle replacement algorithm below. Handling single trace loops
requires a
> tweak, see the detailed design.
> 
> Phase 2:
> After producing what we believe are the best traces, they need to be
ordered.
> They will be ordered topologically, except that traces that are cold enough
(As
> measured by their warmest block) will be floated later, This may push them
out
> of a loop or to the end of the function.
> 
> Detailed Design
> 
> Note whenever an edge is used as a number, I am referring to the edge
frequency.
> 
> Phase 1: Producing traces
> Traces are produced according to the following algorithm:
>  * Sort the edges according to weight, stable-sorting them according the
incoming
> block and edge ordering.
>  * Place each block in a trace of length 1.
>  * For each edge in order:
>     * If the source is at the end of a trace, and the target is at the
beginning
>       of a trace, glue those 2 traces into 1 longer trace.
>     * If an edge has a target or source in the middle of another trace,
consider
>       tail duplication. The benefit calculation is the same as the existing
>       code.
>     * If an edge has a source or target in the middle, check them to see if
they
>       can be replaced as a triangle. (Triangle replacement described below)
>       * Compare the benefit of choosing the edge, along with any triangles
>         found, with the cost of breaking the existing edges.
>         * If it is a net benefit, perform the switch.
>  * Triangle checking:
>     Consider a trace in 2 parts: A1->A2, and the current edge under
consideration
>     is A1->B (the case for C->A2 is mirror, and both may need to be
done)
>     * First find the best alternative C->B
>     * Check for an alternative for A2: D->A2
>     * Find D's best Alternative: D->E
>     * Compare the frequencies: A1->A2 + C->B + D->E vs A1->B +
D->A2
>     * If the 2nd sum is bigger, do the switch.
>   * Loop Rotation Tweak:
>     If A contains a backedge A2->A1, then when considering A1->B or
C->A2, we
>     can include that backedge in the gain:
>     A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A
> 
> Phase 2: Order traces.
> First we compute the frequency of a trace by finding the max frequency of
any of
> its blocks.
> Then we attempt to place the traces topologically. When a trace cannot be
placed
> topologically, we prefer warmer traces first.
> 
> Questions and comments welcome.
> _______________________________________________
This algorithm should be easy enough to implement and experiment with out of
tree. A reasonable goal would be to identify cases that the current, more
sophisticated algorithm are handling poorly. At that point, it would be much
more useful to fix the current algorithm’s heuristics rather than introducing
another less sophisticated alternative.

The current algorithm deals well with partial profile information and maintains
a number of layout properties relative to the CFG structure. I’m not the best
person to explain all of this, but you will certainly need to understand the
original goals, show concrete examples of how it “fails” and can’t be easily
fixed, before you propose replacing it.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170918/bdc373f4/attachment.html>

Kyle Butt via llvm-dev

2017-Sep-19 00:17 UTC

head link

[llvm-dev] RFC: Trace-based layout.

On Mon, Sep 18, 2017 at 1:16 PM, Andrew Trick <atrick at apple.com> wrote:
>
> On Sep 14, 2017, at 6:53 PM, Kyle Butt via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I plan on rewriting the block placement algorithm to proceed by traces.
>
> A trace is a chain of blocks where each block in the chain may fall
> through to
> the successor in the chain.
>
> The overall algorithm would be to first produce traces for a function, and
> then
> order those traces to try and get cache locality.
>
> Currently block placement uses a greedy single step approach to layout. It
> produces chains working from inner to outer loops. Unlike a trace, a chain
> may
> contain non-fallthrough edges. This causes problems with loop layout. The
> main
> problems with loop layout are: loop rotation and cold blocks in a loop.
>
> Overview of proposed solution:
>
> Phase 1:
> Greedily produce a set of traces through the function. A trace is a list of
> blocks with each block in the list falling through (possibly
> conditionally) to
> the next block in the list. Loop rotation will occur naturally in this
> phase via
> the triangle replacement algorithm below. Handling single trace loops
> requires a
> tweak, see the detailed design.
>
> Phase 2:
> After producing what we believe are the best traces, they need to be
> ordered.
> They will be ordered topologically, except that traces that are cold
> enough (As
> measured by their warmest block) will be floated later, This may push them
> out
> of a loop or to the end of the function.
>
> Detailed Design
>
> Note whenever an edge is used as a number, I am referring to the edge
> frequency.
>
> Phase 1: Producing traces
> Traces are produced according to the following algorithm:
>  * Sort the edges according to weight, stable-sorting them according the
> incoming
> block and edge ordering.
>  * Place each block in a trace of length 1.
>  * For each edge in order:
>     * If the source is at the end of a trace, and the target is at the
> beginning
>       of a trace, glue those 2 traces into 1 longer trace.
>     * If an edge has a target or source in the middle of another trace,
> consider
>       tail duplication. The benefit calculation is the same as the existing
>       code.
>     * If an edge has a source or target in the middle, check them to see
> if they
>       can be replaced as a triangle. (Triangle replacement described below)
>       * Compare the benefit of choosing the edge, along with any triangles
>         found, with the cost of breaking the existing edges.
>         * If it is a net benefit, perform the switch.
>  * Triangle checking:
>     Consider a trace in 2 parts: A1->A2, and the current edge under
> consideration
>     is A1->B (the case for C->A2 is mirror, and both may need to be
done)
>     * First find the best alternative C->B
>     * Check for an alternative for A2: D->A2
>     * Find D's best Alternative: D->E
>     * Compare the frequencies: A1->A2 + C->B + D->E vs A1->B +
D->A2
>     * If the 2nd sum is bigger, do the switch.
>   * Loop Rotation Tweak:
>     If A contains a backedge A2->A1, then when considering A1->B or
C->A2,
> we
>     can include that backedge in the gain:
>     A1->A2 + C->D + E->B vs A1->B + C->A2 + A2->A
>
> Phase 2: Order traces.
> First we compute the frequency of a trace by finding the max frequency of
> any of
> its blocks.
> Then we attempt to place the traces topologically. When a trace cannot be
> placed
> topologically, we prefer warmer traces first.
>
> Questions and comments welcome.
> _______________________________________________
>
>
> This algorithm should be easy enough to implement and experiment with out
> of tree. A reasonable goal would be to identify cases that the current,
> more sophisticated algorithm are handling poorly. At that point, it would
> be much more useful to fix the current algorithm’s heuristics rather than
> introducing another less sophisticated alternative.
>
I wrote many of the current algorithms heuristics, and they feel like hacks
to work around specific problems.

What I'm proposing is more, not less sophisticated. Handling loop rotation
as part of a general lookahead problem is more sophisticated than the
existing method of laying out the loop and praying that one of the n
rotations will be a good one.
Even the tail-duplication support I added is kind of a hack. It would be
better to do block cloning to handle cases where tail-duplication currently
fails

>
> The current algorithm deals well with partial profile information and
> maintains a number of layout properties relative to the CFG structure. I’m
> not the best person to explain all of this, but you will certainly need to
> understand the original goals, show concrete examples of how it “fails” and
> can’t be easily fixed, before you propose replacing it.
>
I'll make sure to include test cases with any diffs I mail out.

>
> -Andy
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170918/86525a71/attachment-0001.html>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Sep 2017 - RFC: Trace-based layout.

[llvm-dev] RFC: Trace-based layout.

[llvm-dev] RFC: Trace-based layout.

[llvm-dev] RFC: Trace-based layout.

[llvm-dev] RFC: Trace-based layout.

[llvm-dev] RFC: Trace-based layout.

Maybe Matching Threads