thr3ads.net - llvm dev - [llvm-dev] RFC: Representing unions in TBAA [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Daniel Berlin via llvm-dev

2017-May-14 21:39 UTC

[llvm-dev] RFC: Representing unions in TBAA

On Sun, May 14, 2017 at 11:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 05/14/2017 12:49 PM, Daniel Berlin wrote:
>
>
>
> On Sun, May 14, 2017 at 10:20 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
>
>>
>> On 05/14/2017 11:06 AM, Daniel Berlin wrote:
>>
>>
>>
>> On Sun, May 14, 2017 at 8:37 AM, Hal Finkel <hfinkel at anl.gov>
wrote:
>>
>>>
>>> On 03/01/2017 05:30 PM, Daniel Berlin via llvm-dev wrote:
>>>
>>> So, https://bugs.llvm.org/show_bug.cgi?id=32056 is an example
showing
>>> our current TBAA tree for union generation is definitely
irretrievably
>>> broken.
>>> I'll be honest here. I'm pretty sure your proposal
doesn't go far enough.
>>> But truthfully,  I would rather see us come closer to a
representation
>>> we know works, which is GCC's.
>>> Let me try to simplify what you are suggesting, and what we have.
>>> Our current representation is basically inverted from GCC, but we
don't
>>> allow things that would enable it to work.
>>>
>>> Given
>>> union {int a, short b};
>>>
>>> GCC's will be:
>>>
>>>  union
>>>   /   \
>>> short int
>>>
>>>
>>> Everything is implicitly a subset of alias set 0 (which for C/C++
is
>>> char) just to avoid representing it.
>>>
>>> Our metadata has no child links, and only a single parent link.
>>>
>>> You can't represent the above form because you can't make a
single short
>>> node a child  of every union/struct it needs to be (lack of
multiple
>>> parents means you can't just frob them all to offset zero).
>>>
>>> Your suggestion is to invert this as a struct
>>>
>>> short  int
>>>    |    /
>>> union
>>>
>>> We don't allow multiple parents, however.
>>> Because of that, you suggest we follow all nodes for unions,
special
>>> casing union-type nodes somehow
>>>
>>>
>>> Now that I've spent a bunch of time looking at this, I'd
like to voice
>>> support for Steven's original proposal. In the context of what
we have, it
>>> makes sense, seems sound, and should fix the representational
mapping
>>> problem we currently have.
>>>
>>
>> Except you can't easily differentiate it from the current one, and
if we
>> are going to have to upgrade/break compatibility, why not just do it
once
>> right, a way we know works, instead of risk screwing it up again, and
>> playing with a representation we aren't actually sure we can make
efficient
>> for this case?
>>
>>
>> I don't see why need to break backward compatibility. Do you have
>> something specific in mind? Once we extend the TBAA implementation to
treat
>> repeated offsets as disjunction, then we'll extend Clang to emit
metadata
>> for unions in this form. Old IR will work exactly as it does now.
>>
>
> Except the Old IR is irretrievably broken. and in this method, you
can't
> tell whether you are dealing with correct TBAA or not.
>
> Earlier, the discussion was basically "detect old IR, disable TBAA
since
> it's broken, make new IR work".
>
> If we do that, we need a way to distinguish new vs old IR.
>
>
>
> Ah, okay. I don't think that's desirable in general. There are
frontends
> emitting perfectly self-consistent TBAA metadata, and there's no reason
> they should change. Clang's metadata is broken because the mapping
between
> language rules and TBAA is broken. If we'd like, we can blacklist TBAA
> metadata coming from root nodes named "Simple C++ TBAA" and
"Simple C/C++
> TBAA" (or provide an option to do that because I'm not convinced
it is
> worth trying to retroactively "fix" old IR by default given the
associated
> performance penalties). After the fix, Clang can emit root nodes with
> different names (they're arbitrary). Changing the root-node names will
also
> give the conservatively-correct behavior when linking old IR with new IR.
>
>I still continue to believe trying to patch the existing mechanism this way
is not a great plan, and likely to end us in a place where we still have
either correctness or performance issues.
But i'm not doing the work, so if that's what you guys want to do, go
for
it.

--Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170514/20219e57/attachment.html>

Hal Finkel via llvm-dev

2017-May-15 01:20 UTC

head link

[llvm-dev] RFC: Representing unions in TBAA

On 05/14/2017 04:39 PM, Daniel Berlin wrote:>
>
> On Sun, May 14, 2017 at 11:01 AM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 05/14/2017 12:49 PM, Daniel Berlin wrote:
>>
>>
>>     On Sun, May 14, 2017 at 10:20 AM, Hal Finkel <hfinkel at anl.gov
>>     <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>>         On 05/14/2017 11:06 AM, Daniel Berlin wrote:
>>>
>>>
>>>         On Sun, May 14, 2017 at 8:37 AM, Hal Finkel <hfinkel at
anl.gov
>>>         <mailto:hfinkel at anl.gov>> wrote:
>>>
>>>
>>>             On 03/01/2017 05:30 PM, Daniel Berlin via llvm-dev
wrote:
>>>>             So, https://bugs.llvm.org/show_bug.cgi?id=32056
>>>>             <https://bugs.llvm.org/show_bug.cgi?id=32056>
is an
>>>>             example showing our current TBAA tree for union
>>>>             generation is definitely irretrievably broken.
>>>>             I'll be honest here. I'm pretty sure your
proposal
>>>>             doesn't go far enough.
>>>>             But truthfully,  I would rather see us come closer
to a
>>>>             representation we know works, which is GCC's.
>>>>             Let me try to simplify what you are suggesting, and
>>>>             what we have.
>>>>             Our current representation is basically inverted
from
>>>>             GCC, but we don't allow things that would
enable it to
>>>>             work.
>>>>
>>>>             Given
>>>>             union {int a, short b};
>>>>
>>>>             GCC's will be:
>>>>
>>>>              union
>>>>               /   \
>>>>             short int
>>>>
>>>>
>>>>             Everything is implicitly a subset of alias set 0
(which
>>>>             for C/C++ is char) just to avoid representing it.
>>>>
>>>>             Our metadata has no child links, and only a single
>>>>             parent link.
>>>>
>>>>             You can't represent the above form because you
can't
>>>>             make a single short node a child  of every
union/struct
>>>>             it needs to be (lack of multiple parents means you
>>>>             can't just frob them all to offset zero).
>>>>
>>>>             Your suggestion is to invert this as a struct
>>>>
>>>>             short  int
>>>>                |    /
>>>>             union
>>>>
>>>>             We don't allow multiple parents, however.
>>>>             Because of that, you suggest we follow all nodes
for
>>>>             unions, special casing union-type nodes somehow
>>>
>>>             Now that I've spent a bunch of time looking at
this, I'd
>>>             like to voice support for Steven's original
proposal. In
>>>             the context of what we have, it makes sense, seems
>>>             sound, and should fix the representational mapping
>>>             problem we currently have.
>>>
>>>
>>>         Except you can't easily differentiate it from the
current
>>>         one, and if we are going to have to upgrade/break
>>>         compatibility, why not just do it once right, a way we know
>>>         works, instead of risk screwing it up again, and playing
>>>         with a representation we aren't actually sure we can
make
>>>         efficient for this case?
>>
>>         I don't see why need to break backward compatibility. Do
you
>>         have something specific in mind? Once we extend the TBAA
>>         implementation to treat repeated offsets as disjunction, then
>>         we'll extend Clang to emit metadata for unions in this
form.
>>         Old IR will work exactly as it does now.
>>
>>     Except the Old IR is irretrievably broken. and in this method,
>>     you can't tell whether you are dealing with correct TBAA or
not.
>>
>>     Earlier, the discussion was basically "detect old IR, disable
>>     TBAA since it's broken, make new IR work".
>>
>>     If we do that, we need a way to distinguish new vs old IR.
>
>
>     Ah, okay. I don't think that's desirable in general. There are
>     frontends emitting perfectly self-consistent TBAA metadata, and
>     there's no reason they should change. Clang's metadata is
broken
>     because the mapping between language rules and TBAA is broken. If
>     we'd like, we can blacklist TBAA metadata coming from root nodes
>     named "Simple C++ TBAA" and "Simple C/C++ TBAA" (or
provide an
>     option to do that because I'm not convinced it is worth trying to
>     retroactively "fix" old IR by default given the associated
>     performance penalties). After the fix, Clang can emit root nodes
>     with different names (they're arbitrary). Changing the root-node
>     names will also give the conservatively-correct behavior when
>     linking old IR with new IR.
>
>
> I still continue to believe trying to patch the existing mechanism 
> this way is not a great plan, and likely to end us in a place where we 
> still have either correctness or performance issues.
I don't agree, but this is because I fail to see how the two 
representations (the GCC-like one you've outlined and the current one 
with the proposed extension) aren't completely isomorphic. Your proposal is:
> scalar type node -> {name, children nodes}
> struct node -> {name, array of {offset, child node} }
>
> Paths are separate from the tbaa tree itself, and are just:
> path node -> {struct node, offset} or whatever.
>
> unions are just scalar type nodes with multiple children, not struct 
> nodes with special-cased offset zero.
the evolutionary scheme can be described in similar terms:

aggregate-type node -> { name, array of { offset, member-type } }

For a root node, the array is empty (i.e. it is only a name).

For a scalar type, the array only has one entry: the base/parent type.

For struct types, the array has all of the members.

For union types, the array has all of the members with duplicate offsets 
(likely zero, although there may be cases, such as anonymous unions in 
structs, where the duplicate offsets are not zero).

Paths in the current scheme, as in your proposal, are separate from the 
TBAA DAG itself, and are just:
path node -> {access-type node, aggregate-type node, offset}

For scalar accesses, access type == aggregate type and offset == 0. This 
could be more representationally efficient, but it is done this way to 
provide a uniform way to handle both scalars and aggregates and provide 
a uniform way to differentiate these node from the old-style scalar-only 
TBAA (to be fair, however, likely more the latter than the former). In 
any case, aside from the fact that we explicitly include the access 
type, this is the same.

In both cases the edges of the hierarchy run between structs and their 
members, so regardless of whether you draw the graph going up or down 
(i.e. call the edges parents or children) that aspect of the 
representation seems equivalent. The differences seem to be:

  1. Our current scheme represents scalar types like it represents 
structs with a single member (i.e. a single base type at offset 0). In 
your proposed scheme we have separate scalar nodes.

  2. The proposed evolved scheme we represent unions as aggregates with 
duplicated offsets (or, more generally, duplicate offsets within some 
aggregate). In your proposed scheme we represent unions as multiple 
children of scalar nodes. Transforming between the schemes seems only to 
require taking all consecutive members with duplicate offsets in the 
proposed evolved scheme and placing them into a union (scalar) node in 
your scheme.

Given that the union representation seems equivalent by rewriting, the 
question is whether the representation of scalars is equivalent - that 
is, is representing a scalar as a single-member struct (or union, as the 
case may be) equivalent to the representation you've proposed? I don't 
see why this wouldn't be equivalent as well (i.e. when traversing the 
tree in your representation we'd not adjust the offset when visiting 
scalar nodes, and then we'd visit the child, which is what we currently 
do as well).

Thanks again,
Hal
> But i'm not doing the work, so if that's what you guys want to do,
go
> for it.
>
> --Dan
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170514/2e858b67/attachment.html>

Daniel Berlin via llvm-dev

2017-May-15 02:00 UTC

head link

[llvm-dev] RFC: Representing unions in TBAA

>
>
> I don't agree, but this is because I fail to see how the two
> representations (the GCC-like one you've outlined and the current one
with
> the proposed extension) aren't completely isomorphic. Your proposal is:
>
>Lots of data structures are completely isomorphic in the same way,  and in
plenty of those cases, one is completely unusable, and the other functions
quite well.

Your below is basically "one is drawn in one direction, and the other is
drawn the other".
This is entirely true.

You want to argue that this means they will be exactly as efficient or easy
to understand as each other.
This is where you and I disagree.
I disagree because i rarely see parent-linked graph structures (outside of
union-find ) that are implemented or used well.
People often screw up the algorithms, they tend to be harder to reason
about, etc.
This is why, for example, despite having both parent and child links,
roughly all of our dominator tree graph algorithms walk children instead of
parents, even though you could equivalently do them using only the parent
links (even the depth first algorithms).

Past that I'm not sure what you want me to say here.
If y'all come up with an implementation that works as well as the ones
i'm
familiar with, i'll be as happy as anyone else?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170514/4083c368/attachment.html>

Ivan A. Kosarev via llvm-dev

2017-Aug-14 16:58 UTC

head link

[llvm-dev] RFC: Representing unions in TBAA

Hello Steven, Hal and Daniel,

Thanks a lot for your discussion; it really helps with summarizing 
current TBAA issues and ways to resolve them.

Do you guys know anything of the current status of the proposed change? 
Steven, will you please let us know if the work is in progress and if 
there is any ETA you can share?

I'm asking because we are working on an alternative approach that not 
only supports accesses to union members, bit fields, fields of aggregate 
and union types, but also allows to represent accesses to aggregates and 
unions the same way we do it for scalars so that !tbaa.struct is 
replaced with plain !tbaa, meaning TBAA information can be propagated 
uniformly regardless of types of accessed objects. As a consequence, it 
supports identification of user types defined in different translation 
units, even if some of them are written in C and others are in C++. It 
also defines a set of language-neutral formal rules that LLVM codegen 
follows to determine whether a given pair of accesses are allowed to 
overlap by rules of the input language. As of today, we know this 
implementation covers all currently supported TBAA functionality 
reflected in the test suites and to test the new functionality we have 
SROA improved to preserve TBAA information.

The point is, our approach does not try to describe accesses as (type, 
offset) pairs and instead represents access sequences explicitly 
beginning from the base type followed by field descriptors, which is 
what makes the approach so flexible. TypeBasedAAResult::Aliases() and 
MDNode::getMostGenericTBAA() are a bit more complex than they used to be 
(they actually use the same internal function), but rely exclusively on 
linear scans of access sequences unless we have a situation when have to 
check if one of the accessed types is the type of a member of the other 
one, in which case it seems we just have to traverse through fields 
recursively no matter what.

So, I wonder if this or similar approaches have ever been considered 
before and what are the cons, if there are any sounded. Do you think it 
is worth to consider it now?

Thanks again,

--

llvm dev - Aug 2017 - RFC: Representing unions in TBAA

[llvm-dev] RFC: Representing unions in TBAA

[llvm-dev] RFC: Representing unions in TBAA

[llvm-dev] RFC: Representing unions in TBAA

[llvm-dev] RFC: Representing unions in TBAA