On 01/20/10 18:05, Jan Sjodin wrote:>>>> bbs, to generate these edges. As we do not have to modify the
CFG other
>>>> passes like dominance information are still valid, and we do
not have to
>>>> create a lot of auxiliary bbs, to be able to detect all
regions. This
>>>> saves memory and runtime. In general it is probably not too
easy to
>>>> decide where to insert these bbs either:
>>>
>>> The general rule is to split all blocks with multiple in-edges and
multiple out-edges
>>> into blocks with either multiple in-edges or multiple out-edges,
but not both.
>> This is not sufficient, as shown in the example below. It would allow
>> only one region.
>
> Yes, but with the insertion of merge-blocks it will allow more (see below).
Sure. If you insert the right merge blocks you could be optimal. In
terms of finding all regions that could be created by inserting merge
blocks.
>>> One option is to keep this as an invariant throughout the compiler
and make use
>>> of the merge blocks (multiple in-edges) to contain only PHI-nodes,
and all other code
>>> in regular basic blocks. There are different variations on this
that may or may not be
>>> useful.
>> This might be possible, however probably doubling the number of bbs.
>
> I don't know if it will be double, but there will be more basic blocks
for sure.
Sure.
>>>>> CFG:
>>>> 0
>>>> |
>>>> 1
>>>> / |
>>>> 2 |
>>>> / \ 3
>>>> 4 5 |
>>>> | | |
>>>> 6 7 8
>>>> \ | /
>>>> \ |/ region A: 1 -> 9 {1,2,3,4,5,6,7,8}
>>>> 9 region B: 2 -> 9 {2,4,5,6,7}
>>>>
>>>> So we need one bb that joins 6 and 7 and one that joins the two
regions
>>>>
>>>> CFG: 0
>>>> |
>>>> 1
>>>> / |
>>>> 2 |
>>>> / \ 3
>>>> 4 5 |
>>>> | | |
>>>> 6 7 8
>>>> \ | |
>>>> \ | | region A: (0,1) -> (9b,9)
{1,2,3,4,5,6,7,8,9a,9b}
>>>> 9a | region B: (1,2) -> (9a,9b) {2,4,5,6,7,9a}
>>>> \ /
>>>> 9b
>>>> |
>>>> 9
>>>
>>> It is fairly simple to use the information from the algorithm to
decide
>>> where those merges should be inserted to get the expected regions.
>>> This may be needed in the cases where a sub-region is too
complicated
>>> to be represented and must be abstracted into a "black
box".
>> From which algorithm? The program structure tree does not give this
>> information, does it?
>
> The algorithm that computes the SESE-regions can be used to determine
> where the merge-nodes should be inserted. There are a couple of ways
> of doing it, but if the bracket sets on two eges can be intersected to
> match a third edge (which dominates the first two), you can insert a
> merge block for the two edges. You don't have to compute the
> dominators, but it helps to explain the problem that way.
Might be possible. I did not reason about this too much, when I found a
reasonable algorithm that did not requiere these blocks.
>>>>> My approach is comparable to this paper:
>>>>> The Refined Process Structure Tree by JussiVanhatalo, Hagen
Völzer,
>>>>> Jana Koehler
>>>>
>>>> I was looking through some slides that described their
algorithm. One case that
>>>> seems to be legal is this:
>>>>
>>>> |
>>>> Entry
>>>> / \
>>>> R0 R1
>>>> \ /
>>>> Exit
>>>> |
>>>>
>>>> With two fragments: Entry->R0->Exit and
Entry->R1->Exit, which means
>>>> that a fragment cannot be identified using only the entry and
exit blocks, but
>>>> the internal blocks or edges will also need to be listed. I
don't know if this is
>>>> relevant to your implementation.
>>
>> No. The ideas are comparable, however I believe their implementation is
>> a little complicated. ;-)
>
> Do you have the same definition of a region and entry/exit blocks as they
do?
No. They define a region based on all the edges in the region. This is a
very verbose definition, as all edges have to be saved. Also it is quite
expensive to compare regions, ...
However their description allows to talk about all possible regions.
>> I would mark the regions as:
>>
>> Region A: R0 -> Exit, containing {R0}
>> Region B: R1 -> Exit, containing {R1}
>
> Is the entry always contained and is the exit never contained, or is that
specified
> per region? Depending on the restrictions of entry and exit blocks a loop
with a single
> basic block cannot be an entry or exit by itself. Example:
The entry is always contained and dominates all bbs in the region. The
exit is never contained, but it postdominates all bbs in the region.
>
> |
> A
> /| _
> / |/ \
> B R |
> \ |\_/
> \|
> C
> |
> If you only care about R in this case how is the region formed?
This is: R -> C, containing {R}
So R is the entry BB with the entry edge A->R and C is the exit BB with
the exit edge R->C. R is in the SCoP, C is not.
>
>
>>>> The implementation however takes advantage of the existence of
>>>> Dominance/PostDominance information. Therefore it is simpler
and
>>>> hopefully faster. At the moment run time is comparable to
dominance tree
>>>> calculation.
>>>
>>> Both algorithms are linear so there is really no big difference in
time imo.
>> Sure. However in terms of maintainability it is nice to be able to
reuse
>> existing analysis instead of write another triconnected component
>> analysis upfront.
>>
>>> I believe the biggest difference that you mention is that you can
capture more
>>> complicated regions without having to modify the CFG with the
current
>>> algorithm.
>> Yes.
>>
>>>> If you want, have a look into some results I got with a pass
extracting
>>>> maximal non trivial regions that do not contain loops from
libbz2:
>>>>
>>>> http://tobias.osaft.eu/llvm/region/bz2NoLoops/
>>>>
>>>> Interesting ones:
>>>>
>>>> regions_without_loops.BZ2_bzBuffToBuffCompress.dot.png
>>>> has a lot of exit edges.
>>>
>>> I think this example proves the strengths and weaknesses of both
>>> approaches. Making that region into structured control flow would
add a lot
>>> of additional blocks. This will also happen after generating code
>>> from the polyhedral model, so either way the cost is there if the
optimization
>>> is successful.
>> Yes, but just in this case and not for regions where the model cannot
be
>> applied. If the regions pass is used for analysis purposes, nothing has
>> to be touched.
>>
>>> The second case is where the optimization fails (no profitable
>>> transformation found) and the CFG can remain untouched.
>>>
>>> The third case is if one of those blocks contains something
complicated.
>>> I believe the current algorithm simply fails and cannot detect the
region.
>> Which algorithm? The one in Graphite? Or the region detection I wrote
>> here? This is just plain region detection, that does not even look at
>> the content but builds a region tree (program structure tree). It just
>> detects every possible region.
>> The selection would be a later pass.
>
> My assumption was that there is a selection in there somewhere.
> Do you plan to refine the regions in the selection phase in any way?
Sure. The region tree can be processed and several sets of regions can
be filtered out. E.g. the maximal regions that fulfill a certain
restriction. The minimal regions that fullfill a certain restriction.
Or some more complicated ones where some problems (like irregular
control flow) can be hidden in SESE subcomponents.
>>> If the
>>> CFG is modified this would allow an internal SESE-region to become
a black box, and the
>>> the outer regions could be optimized.
>> This is an optimization, however I think it is orthogonal to the region
>> detection problem. Say it works with any algorithm.
>
> I believe that creating a black-box will map a lot more cleanly to an
edge-based
> region definition, since block-based may include multiple entry/exit
sub-regions
> that will not encapsulate control flow in a reasonable way.
The block based definition works as well as any edge based definition to
define a sese region. And it should be able to hide stuff in black boxes.
>
>>>> regions_without_loops.bzopen_or_bzdopen.dot.png
>>>> the first region has two entry edges. One is the loop latch.
>>>> (Keep in mind all regions have the same color, so if it seems
there is
>>>> an edge into a region, there are just two regions close by)
>>>>
>>>> Without a prepass that exposes the edges almost no region could
be
>>>> detected with the "standard" approach.
>>>
>>> Indeed the CFG will have to be modified for these cases. I it seems
to me that the trade-off
>>> between the two approaches is that the algorithm that you currently
have is a cheaper up
>>> front, but may be less capable in some cases, while the
"standard" algorithm will be more
>>> expensive, but can handle problematic regions better. Would you
agree?
>>
>> I agree that the algorithm I have is cheaper upfront, but I do not yet
>> see a case where the algorithm is less capable. Would you mind to give
>> an example or to highlight the relevant part of the discussion?
>
> With the insertion of extra merge-blocks the code becomes more structured
and the PST
> can be refined further. A more fine-grained PST may allow more cases to be
handled.
That is actually the difference between my algorithm and the PST bracket
algorithm. Mine should get the same regions, that the PST one gets after
inserting the best merge blocks possible. Just without requiring CFG
changes.
I am in writing some documentation about this that we could discuss
later on.
If you are interested I run my analysis on aermod from polyhedron.
The results can be found here:
http://tobias.osaft.eu/llvm/region/aermod/
There is a .regtree.txt file containing the region tree for every
function of the binary.
Furthermore there are the results of an example analysis, that finds the
biggest regions without loops. (The .dot and a .svg files)
If you have the impression some regions are not SESE keep in mind that I
took the same color for two regions, therefore they may just be next too
each other.
Thanks for discussing this with me. It helped me to get a feeling where
the differences are. Hoping we can have these discussions more often.
Tobi