thr3ads.net - llvm dev - [llvm-dev] RFC: Resolving TBAA issues [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Daniel Berlin via llvm-dev

2017-Aug-19 19:05 UTC

[llvm-dev] RFC: Resolving TBAA issues

FWIW: to be completely concrete; from my perspective, the main thing we'd
get out of switching is something that is more complete already.

past that
"I don't want language-specific kinds of nodes in the grammar, and I
believe that it's not necessary under some reasonable variant of this
scheme. Maybe I'm wrong."

If your position is that you'd be fine with a variant of this scheme, than
i don't really think we disagree at all :)


On Sat, Aug 19, 2017 at 12:00 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 08/17/2017 11:25 PM, Daniel Berlin wrote:
>
>
> <just want to focus on these parts for a second. *All* of these
> representations are really access path representations, just encoded
> slightly different ways, and, as a result, with various parts of the rules
> in slightly different places>
>
> Imagine that we took the enhancement we previously discussed, but instead
>> of implementing it directly, we just directly encoded for every access
the
>> path from the access type to the root. I think it would look very much
like
>> this proposal.
>
>
> Something like it, yes.
> Note that this representation also has special vtable and union groups,
> for example, and assumes very c-like rules in several places around the
> sequencing of access types.
> Also note that  in the language of access paths, C allows overlapping that
> does not exist in, for example, Java
>
>
> Is this a statement generally favoring or disfavoring having the frontend
> encode the path explicitly?
>
> (This is true in both the points-to and type-based domains).
> Ada has discriminated unions (and you could not use the "union"
in this
> proposal to represent them)
>
>
> Yes, but discriminated unions also have a dynamic element to them. As a
> result, I suspect we'd need a scheme that captures that (or else
we'd need
> to model them like a struct with a field and something like a C union).
>
> etc.
>
>
>
>>
>> We generally explain our current TBAA rules by saying that they're
>>> generic but motivated by C/C++ rules.
>>>
>>
>> We do say that but that's not really what our implementation does
in any
>> way.
>>
>>
>> Really? I thought it was motivated by C/C++ rules. When you say that
it's
>> "not really what our implementation does", is this because it
drops a lot
>> of potential information in the translation to our generic form?
>>
>
> What we do is completely and totally unrelated to types as they exist in
> the original language.  It is a completely and totally generic
> implementation with no rules related to the original language.  The
> original language, for example, has dynamic typing rules, object lifetime
> rules, etc. We have none of these.
>
> The types do not, in fact,even always have the same relationship we give
> them.
>
> We have a tree and some nodes, and we happen to name them after types.
> Like this proposal, they also represent computed access paths, in a much
> simpler language.
>
> The nodes represent alias set members (either one or many)
> The edges represent either "access at offset" (where the default
is 0 if
> unweighted).
> We have a simple rule that that processes the access paths related to
> ancestry.
>
>
> I certainly understand all of this ;)
>
>
> You can view it as either a grammar parsing or a graph reachability
> problem depending on what works for you (in fact, it's really a
Dyck-CFL
> reachability problem on bidirected trees)
>
> The reachability rule is  usually given as: If node Target is reachable
> from node Source, offset another through either it's children or the
> upwards edges (or was it downwards, i always screw up which direction we go
> in), they may-alias
>
> You could also implement it as a real grammar parsing problem (IE you
> quite literally could generate strings from tag + offset, and parse them
> against the grammar, and see if the terminal nodes contain your target).
> They are equivalent.
>
> This representation would be the same in that regard.
>
> Our current lowering for clang happens to kind of look like c/c++
> structures converted to a tree.
>
> However, this is actually just inefficient, space wise, and done because
> it's simple to lower it this way.
>
>
> I agree (although this wasn't happenstance; the representation was
> designed with an expected lowering scheme in mind, at least for C/C++).
>
>
> Because the accesses are completely unrelated to the original types, and
> require *no* language rules to interpret, you could also just partition the
> things that alias by the language rules in the frontend, and then output a
> tree that represents the possible unique paths.
>
> IE figure out all the answers, then re-encode it as precisely as possible.
>
> This trades time for space.
>
> This representation is *super* generic.  That is, the language being used
> here is super simple, and the reachability rule is super simple.
>
> I could have the frontend generate these trees based on anything i like.
> I could, in fact, encode steensgaard points-to results as this tree without
> any trouble.
>
> The access path language described in this proposal is more complex and
> complete, and directly closer to access paths you find in C/C++.  It has
> bitfields, vtables, and unions.
>
>
> It is more complex, but I think this is partly in the description. AFAIKT,
> only "union" has special properties under the scheme. Bit fields,
vtables,
> etc. are all just particular types with no particular special rules. I
> don't like special rules at all for any named entities, but as
there's only
> one such entity right now ("union"), this is an encoding choice
we could
> bikeshed.
>
> The reachability rules are more complex.
> Is it possible to express other languages in that set of access path rules?
> Sure.  For example, you can, as above, generate the set of answers, and
> then re-encode it into access paths.
>
> Right now, the work it takes *in the front end* is minimal, and has a
> fairly efficient space encoding.
> if i want to say two things alias, i just gotta be able to reach one from
> the other.
> If i want to say two things do not alias, i just gotta be able to not
> reach one from the other only using certain types of edges
> In our current language, all that takes in a frontend is "find longest
> no-aliasing part of tree. Go to parent, add new child".
>
> In the proposed language, the lowering is more complex.  Is it doable?
> Sure, of course, not gonna claim otherwise.  But the more
"features" of
> the access path you add and expect the middle end to handle, instead of the
> front-end expanding them, and the more those feature's reachability
rules
> are related to a specific language the more language-specific it gets.
>
> That's the *only* tradeoff we are making here, representationally.  
How
> much does the frontend have to understand about how the middle end walks
> these things, in order to generate whatever it wants as precisely as
> possible
>
> We can make whichever we want express everything we want in N^3 time :)
> The real question is "do we try to add on to what we have in ways that
> work for multiple languages, and are expressed neutrally in a simple
> reachability language"
> or "do we add language-specific kinds of nodes to the grammar, and
have
> reachability rules that are fairly language specific".
>
>
> I don't want language-specific kinds of nodes in the grammar, and I
> believe that it's not necessary under some reasonable variant of this
> scheme. Maybe I'm wrong.
>
>
> IE do you add, say, discriminated_union nodes to our current
> representation for ada, or "vtable_access" nodes to our current
> representation for C++ vtable accesses
> Or do you instead generate a metadata that has a unidirectional edge
> reachability (IE up only), or whatever it takes to do vtables generically.
>
> Both are completely and totally viable paths, and it's all about which
way
> you want to go.
> But they *definitely* have a difference in terms of language-specificness.
>
>
> To be clear, if the system is not generic, I'm far less interested.
>
> Thanks again,
> Hal
>
>
>
>
>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170819/c58896e8/attachment.html>

Hal Finkel via llvm-dev

2017-Aug-19 19:33 UTC

head link

[llvm-dev] RFC: Resolving TBAA issues

On 08/19/2017 02:05 PM, Daniel Berlin wrote:> FWIW: to be completely concrete; from my perspective, the main thing 
> we'd get out of switching is something that is more complete already.
>
> past that
> "I don't want language-specific kinds of nodes in the grammar, and
I
> believe that it's not necessary under some reasonable variant of this 
> scheme. Maybe I'm wrong."
>
> If your position is that you'd be fine with a variant of this scheme, 
> than i don't really think we disagree at all :)
That's exactly my position. Good :-)

Here's what I find most intriguing about this: In the way we'd discussed
extending the current scheme to handle unions, we extend the current way 
that we traverse the graph such that, if there are multiple fields in 
one of the types with the same offset (i.e. a union), when we need to 
walk up the graph through all fields. While I still believe this is 
unlikely to be problematic in practice, we're now exploring many paths, 
and the asymptotic complexity doesn't thrill me.

If I'm thinking about this correctly, this type of encoding makes 
dealing with unions much more efficient. The frontend knows the access 
path, and can encode that directly. We don't need to explore many paths 
through a graph to find it (i.e. find a potential set of paths through 
the graph that indicate potential aliasing). Instead we can just examine 
the two paths directly encoded. If they both have the same union type at 
the same offset, then they can alias. If they have the same union type 
but incompatible offsets, they can't. No path-finding required. Also, if 
I'm thinking about this correctly, doing it this way is also more 
accurate (because it can distinguish between structurally-identical 
union members, and the proposed extension we'd discussed previously 
cannot (i.e., it would give a conservative answer)).

Thoughts?

Thanks again,
Hal
>
>
> On Sat, Aug 19, 2017 at 12:00 PM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 08/17/2017 11:25 PM, Daniel Berlin wrote:
>>
>>     <just want to focus on these parts for a second. *All* of these
>>     representations are really access path representations, just
>>     encoded slightly different ways, and, as a result, with various
>>     parts of the rules in slightly different places>
>>
>>         Imagine that we took the enhancement we previously discussed,
>>         but instead of implementing it directly, we just directly
>>         encoded for every access the path from the access type to the
>>         root. I think it would look very much like this proposal.
>>
>>
>>     Something like it, yes.
>>     Note that this representation also has special vtable and union
>>     groups, for example, and assumes very c-like rules in several
>>     places around the sequencing of access types.
>>     Also note that  in the language of access paths, C allows
>>     overlapping that does not exist in, for example, Java
>
>     Is this a statement generally favoring or disfavoring having the
>     frontend encode the path explicitly?
>
>>     (This is true in both the points-to and type-based domains).
>>     Ada has discriminated unions (and you could not use the
"union"
>>     in this proposal to represent them)
>
>     Yes, but discriminated unions also have a dynamic element to them.
>     As a result, I suspect we'd need a scheme that captures that (or
>     else we'd need to model them like a struct with a field and
>     something like a C union).
>
>>     etc.
>>
>>
>>>
>>>
>>>             We generally explain our current TBAA rules by saying
>>>             that they're generic but motivated by C/C++ rules.
>>>
>>>
>>>         We do say that but that's not really what our
implementation
>>>         does in any way.
>>
>>         Really? I thought it was motivated by C/C++ rules. When you
>>         say that it's "not really what our implementation
does", is
>>         this because it drops a lot of potential information in the
>>         translation to our generic form?
>>
>>
>>     What we do is completely and totally unrelated to types as they
>>     exist in the original language.  It is a completely and totally
>>     generic implementation with no rules related to the original
>>     language.  The original language, for example, has dynamic typing
>>     rules, object lifetime rules, etc. We have none of these.
>>
>>     The types do not, in fact,even always have the same relationship
>>     we give them.
>>
>>     We have a tree and some nodes, and we happen to name them after
>>     types.  Like this proposal, they also represent computed access
>>     paths, in a much simpler language.
>>
>>     The nodes represent alias set members (either one or many)
>>     The edges represent either "access at offset" (where the
default
>>     is 0 if unweighted).
>>     We have a simple rule that that processes the access paths
>>     related to ancestry.
>
>     I certainly understand all of this ;)
>
>>
>>     You can view it as either a grammar parsing or a graph
>>     reachability problem depending on what works for you (in fact,
>>     it's really a Dyck-CFL reachability problem on bidirected
trees)
>>
>>     The reachability rule is  usually given as: If node Target is
>>     reachable from node Source, offset another through either it's
>>     children or the upwards edges (or was it downwards, i always
>>     screw up which direction we go in), they may-alias
>>
>>     You could also implement it as a real grammar parsing problem (IE
>>     you quite literally could generate strings from tag + offset, and
>>     parse them against the grammar, and see if the terminal nodes
>>     contain your target).  They are equivalent.
>>
>>     This representation would be the same in that regard.
>>
>>     Our current lowering for clang happens to kind of look like c/c++
>>     structures converted to a tree.
>>
>>     However, this is actually just inefficient, space wise, and done
>>     because it's simple to lower it this way.
>
>     I agree (although this wasn't happenstance; the representation was
>     designed with an expected lowering scheme in mind, at least for
>     C/C++).
>
>>     Because the accesses are completely unrelated to the original
>>     types, and require *no* language rules to interpret, you could
>>     also just partition the things that alias by the language rules
>>     in the frontend, and then output a tree that represents the
>>     possible unique paths.
>>
>>     IE figure out all the answers, then re-encode it as precisely as
>>     possible.
>>
>>     This trades time for space.
>>
>>     This representation is *super* generic. That is, the language
>>     being used here is super simple, and the reachability rule is
>>     super simple.
>>
>>     I could have the frontend generate these trees based on anything
>>     i like.  I could, in fact, encode steensgaard points-to results
>>     as this tree without any trouble.
>>
>>     The access path language described in this proposal is more
>>     complex and complete, and directly closer to access paths you
>>     find in C/C++.  It has bitfields, vtables, and unions.
>
>     It is more complex, but I think this is partly in the description.
>     AFAIKT, only "union" has special properties under the scheme.
Bit
>     fields, vtables, etc. are all just particular types with no
>     particular special rules. I don't like special rules at all for
>     any named entities, but as there's only one such entity right now
>     ("union"), this is an encoding choice we could bikeshed.
>
>>     The reachability rules are more complex.
>>     Is it possible to express other languages in that set of access
>>     path rules?
>>     Sure.  For example, you can, as above, generate the set of
>>     answers, and then re-encode it into access paths.
>>
>>     Right now, the work it takes *in the front end* is minimal, and
>>     has a fairly efficient space encoding.
>>     if i want to say two things alias, i just gotta be able to reach
>>     one from the other.
>>     If i want to say two things do not alias, i just gotta be able to
>>     not reach one from the other only using certain types of edges
>>     In our current language, all that takes in a frontend is "find
>>     longest no-aliasing part of tree. Go to parent, add new
child".
>>
>>     In the proposed language, the lowering is more complex.  Is it
>>     doable?
>>     Sure, of course, not gonna claim otherwise.  But the more
>>     "features" of the access path you add and expect the
middle end
>>     to handle, instead of the front-end expanding them, and the more
>>     those feature's reachability rules are related to a specific
>>     language the more language-specific it gets.
>>
>>     That's the *only* tradeoff we are making here,
>>     representationally.   How much does the frontend have to
>>     understand about how the middle end walks these things, in order
>>     to generate whatever it wants as precisely as possible
>>
>>     We can make whichever we want express everything we want in N^3
>>     time :)
>>     The real question is "do we try to add on to what we have in
ways
>>     that work for multiple languages, and are expressed neutrally in
>>     a simple reachability language"
>>     or "do we add language-specific kinds of nodes to the grammar,
>>     and have reachability rules that are fairly language
specific".
>
>     I don't want language-specific kinds of nodes in the grammar, and
>     I believe that it's not necessary under some reasonable variant of
>     this scheme. Maybe I'm wrong.
>
>>
>>     IE do you add, say, discriminated_union nodes to our current
>>     representation for ada, or "vtable_access" nodes to our
current
>>     representation for C++ vtable accesses
>>     Or do you instead generate a metadata that has a unidirectional
>>     edge reachability (IE up only), or whatever it takes to do
>>     vtables generically.
>>
>>     Both are completely and totally viable paths, and it's all
about
>>     which way you want to go.
>>     But they *definitely* have a difference in terms of
>>     language-specificness.
>>
>
>     To be clear, if the system is not generic, I'm far less interested.
>
>     Thanks again,
>     Hal
>
>>
>>
>>
>>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170819/4ce20b3b/attachment.html>

Daniel Berlin via llvm-dev

2017-Aug-19 19:45 UTC

head link

[llvm-dev] RFC: Resolving TBAA issues

On Sat, Aug 19, 2017 at 12:33 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> On 08/19/2017 02:05 PM, Daniel Berlin wrote:
>
> FWIW: to be completely concrete; from my perspective, the main thing
we'd
> get out of switching is something that is more complete already.
>
> past that
> "I don't want language-specific kinds of nodes in the grammar, and
I
> believe that it's not necessary under some reasonable variant of this
> scheme. Maybe I'm wrong."
>
> If your position is that you'd be fine with a variant of this scheme,
than
> i don't really think we disagree at all :)
>
>
> That's exactly my position. Good :-)
>
> Here's what I find most intriguing about this: In the way we'd
discussed
> extending the current scheme to handle unions, we extend the current way
> that we traverse the graph such that, if there are multiple fields in one
> of the types with the same offset (i.e. a union), when we need to walk up
> the graph through all fields. While I still believe this is unlikely to be
> problematic in practice, we're now exploring many paths, and the
asymptotic
> complexity doesn't thrill me.
>
>
> If I'm thinking about this correctly, this type of encoding makes
dealing
> with unions much more efficient. The frontend knows the access path, and
> can encode that directly.
>

The usual problem with this, historically, is, staring at a given statement
(or even function), that the frontend doesn't always know the entire access
path. It can be trivially split over function calls, etc.  This is why the
usual solution was to walk the graph in both directions, with one of those
directions giving you "possible suffixes of this access path", and the
other giving you "possible prefixes of this access path", so you could
discover the alignment of one access path within another.

For unions,  since we already decided we are going to say, in those cases,
that the user needs to be more explicit, it shouldn't be an issue.

For non-unions, whether this is okay or not depends on your interpretation
and execution of the rules.

If we go by the interpretation laid out by Ivan so far, it should be
possible to make this work.

I would suggest one way to make sure we get what we want, if it doesn't
exist, is a fairly comprehensive set of tbaa test cases :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170819/fefd3b76/attachment.html>

llvm dev - Aug 2017 - RFC: Resolving TBAA issues

[llvm-dev] RFC: Resolving TBAA issues

[llvm-dev] RFC: Resolving TBAA issues

[llvm-dev] RFC: Resolving TBAA issues