thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

John McCall via llvm-dev

2021-Jun-22 17:10 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 22 Jun 2021, at 12:02, Juneyoung Lee wrote:> On Tue, Jun 22, 2021 at 4:08 AM John McCall <rjmccall at apple.com> 
> wrote:
>> On 21 Jun 2021, at 2:15, Juneyoung Lee wrote:
>>
>> Hi,
>> Sorry for my late reply, and thank you for sharing great summaries
&
>> ideas.
>> I'll leave my thoughts below.
>>
>> On Wed, Jun 16, 2021 at 8:56 AM John McCall <rjmccall at
apple.com>
>> wrote:
>>
>> Now, that rule as I’ve stated it would be really bad. Allowing a
>> lucky guess to resolve to absolutely anything would almost
>> completely block the optimizer from optimizing memory. For example,
>> if a local variable came into scope, and then we called a function
>> that returned a different pointer, we’d have to conservatively
>> assume that that pointer might alias the local, even if the address
>> of the local was never even taken, much less escaped:
>>
>> int x = 0;
>> int *p = guess_address_of_x();
>> *p = 15;
>> printf(“%d\n”, x); // provably 0?
>>
>> So the currently favored proposal adds a really important caveat:
>> this blessing of provenance only works when a pointer with the
>> correct provenance has been “exposed”. There are several ways to
>> expose a pointer, including I/O, but the most important is casting
>> it to an integer.
>>
>> This is a valid point. If one wants to formally show the correctness 
>> of
>> this kind of memory optimization this problem should be tackled.
>> I think n2676's 'Allocation-address nondeterminism' (p. 27)
paragraph
>> addresses this issue.
>> The underlying idea is that the address of an allocated object is 
>> assumed
>> to be non-deterministically chosen, causing any guessed accesses to 
>> raise
>> undefined behavior in at least one execution.
>>
>> Ah, that’s an interesting idea. I must have missed that section; 
>> I’m
>> afraid I only skimmed N2676 looking for the key points, because 
>> it’s
>> quite a long document.
>>
>> To be clear, under the PVNI-ae family of semantics, that rule would
>> not be necessary to optimize x in my example because int-to-pointer
>> casts are not allowed to recreate provenance for x because it has
>> not been exposed. That rule does theoretically allow optimization of
>> some related examples where the address of x *has* been exposed,
>> because it lets the compiler try to reason about what happens after
>> exposure; it’s no longer true that exposure implies guessing are 
>> okay.
>>
>> Unfortunately, though, I this non-determinism still doesn’t allow 
>> LLVM
>> to be anywhere near as naive about pointer-to-int casts as it is 
>> today.
>> The rule is intended to allow the compiler to start doing 
>> use-analysis
>> of exposures; let’s assume that this analysis doesn’t see any
>> un-analyzable uses, since of course it would need to conservatively
>> treat them as escapes. But if we can optimize uses of integers as if
>> they didn’t carry pointer data — say, in a function that takes 
>> integer
>> parameters — and then we can apply those optimized uses to integers
>> that concretely result from pointer-to-int casts — say, by inlining
>> that function into one of its callers — can’t we end up with a 
>> use
>> pattern for one or more of those pointer-to-int casts that no longer
>> reflects the fact that it’s been exposed? It seems to me that 
>> either
>> (1) we cannot do those optimizations on opaque integers or (2) we
>> need to record that we did them in a way that, if it turns out that
>> they were created by a pointer-to-int casts, forces other code to
>> treat that pointer as opaquely exposed.
>>
> IIUC the corresponding example would be:
> int p;
> intptr_t i = (intptr_t)&p;
> f(i);
> // Initially f(i) exposes p by storing i into somewhere, but 
> optimization
> changed f to not store i anymore (e.g. i was replaced with a constant)
>
> In this case, p was already exposed by '(intptr_t)p' I guess; the
body
> of
> f(i) won't affect the exposedness of p.
Right, formally I think that’s true.
>> All of these concerns are valid.
>>
>> (I'm not sure whether this is a good place to introduce this, but)
we
>> actually have semantics for pointer castings tailored to LLVM (link
>> <https://sf.snu.ac.kr/publications/llvmtwin.pdf>).
>> In this proposal, ptrtoint does not have an escaping side effect; 
>> ptrtoint
>> and inttoptr are scalar operations.
>> inttoptr simply returns a pointer which can access any object.
>>
>> Skimming your paper, I can see how this works *except* that I don’t
>> see any way not to treat ptrtoint as an escape. And really I think
>> you’re already partially acknowledging that, because that’s the 
>> only
>> real sense of saying that inttoptr(ptrtoint p) can’t be reduced to
>> p. If those are really just scalar operations that don’t expose
>> p in ways that might be disconnected from the uses of the inttoptr
>> then that reduction ought to be safe.
>>
>> John.
>>
> It again relies on nondeterminism of the address of an object.
>
> For example:
> p = call malloc(4) ; p's address is assumed to be nondeterministically
> chosen
> i = ptrtoint p
> ; i is never used
> ret void
>
> Since i is never used, this program cannot safely access p via address
> guessing + inttoptr.
> This allows (more precise) escape analysis to conclude that malloc 
> with one
> unused ptrtoint is never escaped.
I think this is too local of an analysis.

Your proposal generally relies on certain optimizations not applying
to pointers because they mess up provenance as represented in
transitive use-dependencies.  If those optimizations can be applied
to integers, you can lose use-dependencies in exactly the same way as
you can with pointers.  Not doing the `inttoptr(ptrtoint p)) -> p`
reduction doesn’t matter at all, because in either case that value
has a use-dependency on `p`, whereas the actual problem is that there
is no longer a use-dependency on some other value.

For example, you have compellingly argued that it’s problematic to
do the reduction `a == b ? a : b` to `b` for pointer types.  Suppose
I instead do this optimization on integers where `a = ptrtoint A`.
The result is now simply `b`.  If I `inttoptr` that result and access
the memory, there will be no record that that access may validly
be to `A`.  It does not help that the access may be represented
as `inttoptr (ptrtoint B)` for some `B` rather than just directly
to `B`, because there is no use-dependence on `A`.  All there is
is an apparently unrelated and unused `ptrtoint A`.

Obviously we can avoid doing this optimization locally when we
see that the inputs result from `ptrtoint`, but that’s no more
than best-effort: we can do this optimization in a function which
we later inline in a caller that performs all the `ptrtoint` and
`inttoptr` casts.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210622/38a7ce16/attachment.html>

Ralf Jung via llvm-dev

2021-Jun-22 17:21 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi,
> Your proposal generally relies on certain optimizations not applying
> to pointers because they mess up provenance as represented in
> transitive use-dependencies. If those optimizations can be applied
> to integers, you can lose use-dependencies in exactly the same way as
> you can with pointers. Not doing the inttoptr(ptrtoint p)) -> p
> reduction doesn’t matter at all, because in either case that value
> has a use-dependency on p, whereas the actual problem is that there
> is no longer a use-dependency on some other value.
Note that "provenance" as we use it in this discussion is an *explicit
operational artifact* -- it exists as a concrete piece of state in the Abstract 
Machine. That is very different from something that might just be used 
internally in some kind of analysis.

There is no problem with "resetting" that provenance on a
"inttoptr", and
basically forgetting about where the int comes from. Note that this is a 
statement about an operation in the Abstract Machine, not merely a statement 
about some analysis: this is not "forgetting" as in "safely
overapproximating
the real set of possible behaviors", it is "forgetting" by
*forcing* the
provenance to be some kind of wildcard/default provenance. All analyses then 
have to correctly account for that.
> For example, you have compellingly argued that it’s problematic to
> do the reduction |a == b ? a : b| to |b| for pointer types. Suppose
> I instead do this optimization on integers where |a = ptrtoint A|.
> The result is now simply |b|. If I |inttoptr| that result and access
> the memory, there will be no record that that access may validly
> be to |A|. It does not help that the access may be represented
> as |inttoptr (ptrtoint B)| for some |B| rather than just directly
> to |B|, because there is no use-dependence on |A|. All there is
> is an apparently unrelated and unused |ptrtoint A|.
So that would be "ptrtoint A == ptrtoint B ? ptrtoint A : ptrtoint B"
being
replaced by "ptrtoint B"? I don't see any problem with that. Do
you have a
concrete example?
> Obviously we can avoid doing this optimization locally when we
> see that the inputs result from |ptrtoint|, but that’s no more
> than best-effort: we can do this optimization in a function which
> we later inline in a caller that performs all the |ptrtoint| and
> |inttoptr| casts.
I agree "avoiding something locally" is way too fragile to be
considered a
solution. I do not think it is needed, though.

Kind regards,
Ralf

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM