thr3ads.net - llvm dev - [llvm-dev] RFC: Representing unions in TBAA [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Daniel Berlin via llvm-dev

2017-Feb-14 15:05 UTC

[llvm-dev] RFC: Representing unions in TBAA

On Tue, Feb 14, 2017 at 5:51 AM, Hubert Tong <
hubert.reinterpretcast at gmail.com> wrote:
> On Mon, Feb 13, 2017 at 10:39 PM, Daniel Berlin <dberlin at
dberlin.org>
> wrote:
>
>>
>>
>> On Mon, Feb 13, 2017 at 10:07 AM, Hubert Tong <
>> hubert.reinterpretcast at gmail.com> wrote:
>>
>>> On Mon, Feb 13, 2017 at 2:23 AM, Daniel Berlin via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>>
>>>>> I don't think this fully solves the problem --
you'll also need to fix
>>>>> getMostGenericTBAA.  That is, even if you implement the
above scheme,
>>>>> say you started out with:
>>>>>
>>>>> union U {
>>>>>   int i;
>>>>>   float f;
>>>>> };
>>>>>
>>>>> float f(union U *u, int *ii, float *ff, bool c) {
>>>>>   if (c) {
>>>>>     *ii = 10;
>>>>>     *ff = 10.0;
>>>>>   } else {
>>>>>     u->i = 10;    // S0
>>>>>     u->f = 10.0;  // S1
>>>>>   }
>>>>>   return u->f;
>>>>> }
>>>>>
>>>>> (I presume you're trying to avoid reordering S0 and
S1?)
>>>>>
>>>>> SimplifyCFG or some other such pass may transform f to:
>>>>>
>>>>> float f(union U *u, int *ii, float *ff, bool c) {
>>>>>   int *iptr = c ? ii : &(u->i);
>>>>>   int *fptr = c ? ff : &(u->f);
>>>>>   *iptr = 10;     // S2
>>>>>   *fptr = 10.0;   // S3
>>>>>   return u->f;
>>>>> }
>>>>>
>>>>> then getMostGenericTBAA will infer scalar "int"
TBAA for S2 and scalar
>>>>> "float" TBAA for S3, which will be NoAlias and
allow the reordering
>>>>> you were trying to avoid.
>>>>>
>>>>
>>>> FWIW, i have to read this in detail, but a few things pop out
at me.
>>>>
>>>> 1. We would like to live in a world where we don't depend
on TBAA
>>>> overriding BasicAA to get correct answers.  We do now, but
don't want to.
>>>> Hopefully this proposal does not make that impossible.
>>>>
>>>> 2.  Literally the only way that GCC ends up getting this right
is two
>>>> fold:
>>>> It only guarantees things about direct access through union.
>>>> If you take the address of the union member (like the transform
above),
>>>> it knows it will get a wrong answer.
>>>> So what it does is it finds the type it has to stop at (here,
the
>>>> union) to keep the TBAA set the same, and makes the transform
end there.
>>>> So the above would not occur.
>>>>
>>>>
>>>> 3. A suggestion that TBAA follow all possible paths seems ..
very slow.
>>>>
>>>> 4. "The main motivation for this is functional correctness
of code
>>>> using unions".  I believe you mean "with tbaa and
strict-aliasing on".
>>>> If not,functional correctness for unions should not be in any
way
>>>> related to requiring TBAA.
>>>>
>>>> 5. Unions are among the worst area of the standard in terms of
"nobody
>>>> has really thought super-hard about the interaction of aliasing
and unions
>>>> in a way that is coherent".
>>>> So when you say things like 'necessary for functional
correctness of
>>>> unions', just note that this is pretty much wrong.  You
probably mean
>>>> "necessary for a reasonable interpretation" or
something.
>>>>
>>>> Because we would be *functionally correct* by the standard by
>>>> destroying the program  if you ever read the member you
didn't set :)
>>>>
>>> C11 subclause 6.5.2.3 paragraph 3, has in footnote 95:
>>> If the member used to read the contents of a union object is not
the
>>> same as the member last used to store a value in the object, the
>>> appropriate part of the object representation of the value is
reinterpreted
>>> as an object representation in the new type as described in 6.2.6
(a
>>> process sometimes called "type punning"). This might be a
trap
>>> representation.
>>>
>>> So, the intent is at least that the use of the . operator or the
->
>>> operator to access a member of a union would "safely"
perform type punning.
>>>
>>>
>> Certainly, if you can quote this, you know this is new to C11 (and
newer
>> versions of C++).
>> :)
>>
> Yes, this is new to C11; however, the question for us here is whether, and
> under what options, we would support such intended use of the language.
>
I was on board with it :)
 I remember pointing out most of the cases we have bugs for when
struct-path TBAA was designed (on a whiteboard, in fact) and the consensus
answer i got back was basically "we'll just punt on unions for
now".  I did
say i suspected this would not work out well, soundness wise, but the clear
consensus was against me, so here we are :)
In any case, reading this in detail:
1. "In the struct path TBAA, it is assumed that the length of a member
extends
to the start of the next.  This will not be true for unions."

Staring at the langref, i don't see where it assumes this.
it is just offset, size, tbaa type, which seems correct to me.

If you mean "the implementation assumes", then yeah, the
implementation
should just be fixed.

"In TypeBasedAAResult::alias, we will also have to deal with the fact that
unions
have multiple members all at offset 0.  This means that, unlike structs, the
compiler will not be able to tell which member of the union the memory
reference
actually used.  To deal with this, TypeBasedAAResult::alias (and the
functions
it calls) will have to acts as if all of the unions members could have been
accessed. "

Actually, for correctness, it just has to give a conservative answer of
"may-alias"

This is papering over the fact that if it does that, basicaa overrides it.
That we should just fix that, not rely on making TBAA work harder to be
correct (working harder to optimize is fine).

That is, say "these are may-alias, this is a final answer" instead of
"these are may-alias, maybe you can do better?"

But yes, if you are trying to optimize unions harder (and i don't know why
you would :P), yes, you'd have to follow all intersecting members.
In gcc we just sort the list, and use offset, size info to see what  fields
it overlaps with.

But note: Your correctness still depends on being able to tell these are
union accesses.
That's bad, because as sanjoy says, we do transforms that may cause this to
be non-recoverable.

If you can't have TBAA say "may-alias, final answer", you lose.

(Note, while you are improving this, i'll note the info should also be used
to differentiate structure fields and solve the issue discussed here:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160523/date.html

The offset, size info is useful and should be used to differentiate struct
accesses *regardless of type* in cases where we can.
It doesn't actually matter if one field is an int and the other a float. If
they are different fields, and we can prove it, they can't alias anyway)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170214/0849d896/attachment-0001.html>

Steven Perron via llvm-dev

2017-Feb-15 04:22 UTC

head link

[llvm-dev] RFC: Representing unions in TBAA

Thanks Daniel and Sanjoy for you comments.  I'll try to list the issues I 
need to respond to, and double check my understand.

1) Sometimes I think of things a certain way, and I assume other people 
think the same way.  When I say that I do not want to rely on basic alias 
analysis to override the TBAA, I do not consider "may-alias" an
answer.  I
want to avoid the situation where TBAA says "no-alias, but that might be 
wrong", at least for legal code.  In the same spirit, I primarily work on 
optimizing code, so, when I say the main point is functional correctness, 
I really mean to make things functionally correct while still being able 
to optimize as much as possible.

2) Any function that deals with the TBAA nodes directly must be updated to 
handle the new form, a struct-path with multiple member at offset 0.  One 
example is getMostGenericTBAA.  I'll look into that function and see what 
we can do about it.  I'll need to look for other.  However, I do not 
consider this something that should stop us from implementing something 
like this.  I believe they can all be handled in a reasonable way. 
"getMostGenericTBAA" could be handled by Making recursive calls with
each
member of the union, or something equivalent. 

3) How should we handle a reference directly through a union, and a 
reference that is not through the union? 

My solution was to look for each member of the union overlaps the given 
offset, and see if any of those members aliased the other reference.  If 
no member aliases the other reference, then the answer is no alias. 
Otherwise the answer is may alias.  The run time for this would be 
proportional to  "distance to the root" * "number of overlapping
members".
 This could be slow if there are unions with many members or many unions 
of unions.

Another option is to say that they do not alias.  This would mean that all 
references to unions must be explicitly through the union.  This would be 
the least restrictive aliasing allowing the most optimization.  The 
implementation would be simple.  I believe we make the parent of the TBAA 
node for the union to be "omnipotent char".  This might be similar to 
treating the union TBAA node more like a scalar node instead of a 
struct-path.  Then the traversal of the TBAA nodes will be quick.  I'll 
have to work this out a bit more, but, if this is good enough to meet the 
requirements of the standard, I can try to think this through a little 
more.  I'll need Hubert and Daniel to comment on that since I am no expert 
on the C and C++ standards.

The third option is to be pessimistic and say "may alias" all of the
time
(conservatively correct), and rely on other alias analysis to improve it. 
This will have good compile time, but could hinder optimization. 
Personally I do not like this option.  Most of the time it will not have a 
negative effect, but there will be a reasonable number of programs where 
this will hurt optimization more that it needs to.

4) Sonjoy asked a question about how I would be using length information:

It was not clear how you'll take the length into consideration.  Are
you talking about inferences like "I know the access was 4 bytes long,
so it can't be accessing the `short` field"?  If so, we'd need to
verify (and maintain) that no passes ask AA about a subsegment of a
memory access (i.e. "does that store alias the first two bytes of this
i32 load"); or at least prevent them from using TBAA when they refine
their queries this way.

The answer is no, I will not be trying to infer length information from 
the access type.  The idea is if I have two references to the same union. 
One if offset o1 and size s1, the other with offset o2 and size s2. 
Suppose without lose of generality that o1 <= o2.  Then if o1+s1 <= o2, 
then the references do not alias.  Otherwise they may alias.  There may be 
a few case where a size is not available.  When dealing with call site for 
example.  In this case, we have to treat the size as "infinite" 
pessimizing the aliasing.

Again thanks for your comments.

Later,
Steven Perron

Daniel Berlin <dberlin at dberlin.org> wrote on 2017/02/14 10:05:29 AM:
> From: Daniel Berlin <dberlin at dberlin.org>
> To: Hubert Tong <hubert.reinterpretcast at gmail.com>
> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>, Steven Perron/
> Toronto/IBM at IBMCA, llvm-dev <llvm-dev at lists.llvm.org>
> Date: 2017/02/14 10:05 AM
> Subject: Re: [llvm-dev] RFC: Representing unions in TBAA
> 
> On Tue, Feb 14, 2017 at 5:51 AM, Hubert Tong 
<hubert.reinterpretcast at gmail.com> > wrote:
> On Mon, Feb 13, 2017 at 10:39 PM, Daniel Berlin <dberlin at
dberlin.org>
wrote:> 
> On Mon, Feb 13, 2017 at 10:07 AM, Hubert Tong <
> hubert.reinterpretcast at gmail.com> wrote:
> On Mon, Feb 13, 2017 at 2:23 AM, Daniel Berlin via llvm-dev <llvm-
> dev at lists.llvm.org> wrote:
> 
> I don't think this fully solves the problem -- you'll also need to
fix
> getMostGenericTBAA.  That is, even if you implement the above scheme,
> say you started out with:
> 
> union U {
>   int i;
>   float f;
> };
> 
> float f(union U *u, int *ii, float *ff, bool c) {
>   if (c) {
>     *ii = 10;
>     *ff = 10.0;
>   } else {
>     u->i = 10;    // S0
>     u->f = 10.0;  // S1
>   }
>   return u->f;
> }
> 
> (I presume you're trying to avoid reordering S0 and S1?)
> 
> SimplifyCFG or some other such pass may transform f to:
> 
> float f(union U *u, int *ii, float *ff, bool c) {
>   int *iptr = c ? ii : &(u->i);
>   int *fptr = c ? ff : &(u->f);
>   *iptr = 10;     // S2
>   *fptr = 10.0;   // S3
>   return u->f;
> }
> 
> then getMostGenericTBAA will infer scalar "int" TBAA for S2 and
scalar
> "float" TBAA for S3, which will be NoAlias and allow the
reordering
> you were trying to avoid.
> 
> FWIW, i have to read this in detail, but a few things pop out at me.
> 
> 1. We would like to live in a world where we don't depend on TBAA 
> overriding BasicAA to get correct answers.  We do now, but don't want 
to.> Hopefully this proposal does not make that impossible.
> 
> 2.  Literally the only way that GCC ends up getting this right is two 
fold:> It only guarantees things about direct access through union.
> If you take the address of the union member (like the transform 
> above), it knows it will get a wrong answer.
> So what it does is it finds the type it has to stop at (here, the 
> union) to keep the TBAA set the same, and makes the transform end there.
> So the above would not occur.
> 
> 3. A suggestion that TBAA follow all possible paths seems .. very slow.
> 
> 4. "The main motivation for this is functional correctness of code 
> using unions".  I believe you mean "with tbaa and strict-aliasing
on".
> If not,functional correctness for unions should not be in any way 
> related to requiring TBAA.
> 
> 5. Unions are among the worst area of the standard in terms of 
> "nobody has really thought super-hard about the interaction of 
> aliasing and unions in a way that is coherent".
> So when you say things like 'necessary for functional correctness of
> unions', just note that this is pretty much wrong.  You probably 
> mean "necessary for a reasonable interpretation" or something.
> 
> Because we would be *functionally correct* by the standard by 
> destroying the program  if you ever read the member you didn't set :)
> C11 subclause 6.5.2.3 paragraph 3, has in footnote 95:
> If the member used to read the contents of a union object is not the
> same as the member last used to store a value in the object, the 
> appropriate part of the object representation of the value is 
> reinterpreted as an object representation in the new type as 
> described in 6.2.6 (a process sometimes called "type punning").
This
> might be a trap representation.
> So, the intent is at least that the use of the . operator or the -> 
> operator to access a member of a union would "safely" perform
type
punning.>  
> Certainly, if you can quote this, you know this is new to C11 (and 
> newer versions of C++).
> :)
> Yes, this is new to C11; however, the question for us here is 
> whether, and under what options, we would support such intended use 
> of the language.
> 
> I was on board with it :)
>  I remember pointing out most of the cases we have bugs for when 
> struct-path TBAA was designed (on a whiteboard, in fact) and the 
> consensus answer i got back was basically "we'll just punt on
unions
> for now".  I did say i suspected this would not work out well, 
> soundness wise, but the clear consensus was against me, so here we are 
:)> In any case, reading this in detail:
> 1. "In the struct path TBAA, it is assumed that the length of a member
extends> to the start of the next.  This will not be true for unions."
> 
> Staring at the langref, i don't see where it assumes this.
> it is just offset, size, tbaa type, which seems correct to me.
> 
> If you mean "the implementation assumes", then yeah, the 
> implementation should just be fixed.
> 
> "In TypeBasedAAResult::alias, we will also have to deal with the 
> fact that unions
> have multiple members all at offset 0.  This means that, unlike structs, 
the> compiler will not be able to tell which member of the union the 
> memory reference
> actually used.  To deal with this, TypeBasedAAResult::alias (and 
thefunctions> it calls) will have to acts as if all of the unions members could have 
been> accessed. "
> 
> Actually, for correctness, it just has to give a conservative answer
> of "may-alias"
> 
> This is papering over the fact that if it does that, basicaa overrides 
it.> That we should just fix that, not rely on making TBAA work harder to
> be correct (working harder to optimize is fine).
> 
> That is, say "these are may-alias, this is a final answer"
instead
> of "these are may-alias, maybe you can do better?"
> 
> But yes, if you are trying to optimize unions harder (and i don't 
> know why you would :P), yes, you'd have to follow all intersecting 
members.> In gcc we just sort the list, and use offset, size info to see what 
>  fields it overlaps with.
> 
> But note: Your correctness still depends on being able to tell these
> are union accesses.
> That's bad, because as sanjoy says, we do transforms that may cause 
> this to be non-recoverable.
> 
> If you can't have TBAA say "may-alias, final answer", you
lose.
> 
> (Note, while you are improving this, i'll note the info should also 
> be used to differentiate structure fields and solve the issue discussed 
here:> 
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160523/date.html> 
> The offset, size info is useful and should be used to differentiate 
> struct accesses *regardless of type* in cases where we can.
> It doesn't actually matter if one field is an int and the other a 
> float. If they are different fields, and we can prove it, they can't
> alias anyway)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170214/693e4785/attachment.html>

Hubert Tong via llvm-dev

2017-Feb-15 12:44 UTC

head link

[llvm-dev] RFC: Representing unions in TBAA

On Tue, Feb 14, 2017 at 11:22 PM, Steven Perron <perrons at ca.ibm.com>
wrote:
> 3) How should we handle a reference directly through a union, and a
> reference that is not through the union?
>
> My solution was to look for each member of the union overlaps the given
> offset, and see if any of those members aliased the other reference.  If no
> member aliases the other reference, then the answer is no alias.  Otherwise
> the answer is may alias.  The run time for this would be proportional to
>  "distance to the root" * "number of overlapping
members".  This could be
> slow if there are unions with many members or many unions of unions.
>
> Another option is to say that they do not alias.  This would mean that all
> references to unions must be explicitly through the union.
>From what I gather from the thread so far, the access through the union maybe lost because of LLVM transformations. I am not sure how, in the face of
that, TBAA could indicate NoAlias safely (without the risk of
functional-correctness issues in correct programs) between types which
overlap within any union (within some portion of the program).

As for the standards, it is definitely not true that all references to
unions must be explicitly through the union. However, if you are trying to
perform union-based type punning (under C11), then it appears that it is
intended that the read must be through the union.

> This would be the least restrictive aliasing allowing the most
> optimization.  The implementation would be simple.  I believe we make the
> parent of the TBAA node for the union to be "omnipotent char". 
This might
> be similar to treating the union TBAA node more like a scalar node instead
> of a struct-path.  Then the traversal of the TBAA nodes will be quick.
> I'll have to work this out a bit more, but, if this is good enough to
meet
> the requirements of the standard, I can try to think this through a little
> more.  I'll need Hubert and Daniel to comment on that since I am no
expert
> on the C and C++ standards.
>
> The third option is to be pessimistic and say "may alias" all of
the time
> (conservatively correct), and rely on other alias analysis to improve it.
> This will have good compile time, but could hinder optimization.
> Personally I do not like this option.  Most of the time it will not have a
> negative effect, but there will be a reasonable number of programs where
> this will hurt optimization more that it needs to.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170215/217c416e/attachment.html>

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Feb 2017 - RFC: Representing unions in TBAA

[llvm-dev] RFC: Representing unions in TBAA

[llvm-dev] RFC: Representing unions in TBAA

[llvm-dev] RFC: Representing unions in TBAA

Possibly Parallel Threads