thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Juneyoung Lee via llvm-dev

2021-Jun-22 16:02 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

On Tue, Jun 22, 2021 at 4:08 AM John McCall <rjmccall at apple.com> wrote:
> On 21 Jun 2021, at 2:15, Juneyoung Lee wrote:
>
> Hi,
> Sorry for my late reply, and thank you for sharing great summaries &
ideas.
> I'll leave my thoughts below.
>
> On Wed, Jun 16, 2021 at 8:56 AM John McCall <rjmccall at apple.com>
wrote:
>
> Now, that rule as I’ve stated it would be really bad. Allowing a
> lucky guess to resolve to absolutely anything would almost
> completely block the optimizer from optimizing memory. For example,
> if a local variable came into scope, and then we called a function
> that returned a different pointer, we’d have to conservatively
> assume that that pointer might alias the local, even if the address
> of the local was never even taken, much less escaped:
>
> int x = 0;
> int *p = guess_address_of_x();
> *p = 15;
> printf(“%d\n”, x); // provably 0?
>
> So the currently favored proposal adds a really important caveat:
> this blessing of provenance only works when a pointer with the
> correct provenance has been “exposed”. There are several ways to
> expose a pointer, including I/O, but the most important is casting
> it to an integer.
>
> This is a valid point. If one wants to formally show the correctness of
> this kind of memory optimization this problem should be tackled.
> I think n2676's 'Allocation-address nondeterminism' (p. 27)
paragraph
> addresses this issue.
> The underlying idea is that the address of an allocated object is assumed
> to be non-deterministically chosen, causing any guessed accesses to raise
> undefined behavior in at least one execution.
>
> Ah, that’s an interesting idea. I must have missed that section; I’m
> afraid I only skimmed N2676 looking for the key points, because it’s
> quite a long document.
>
> To be clear, under the PVNI-ae family of semantics, that rule would
> not be necessary to optimize x in my example because int-to-pointer
> casts are not allowed to recreate provenance for x because it has
> not been exposed. That rule does theoretically allow optimization of
> some related examples where the address of x *has* been exposed,
> because it lets the compiler try to reason about what happens after
> exposure; it’s no longer true that exposure implies guessing are okay.
>
> Unfortunately, though, I this non-determinism still doesn’t allow LLVM
> to be anywhere near as naive about pointer-to-int casts as it is today.
> The rule is intended to allow the compiler to start doing use-analysis
> of exposures; let’s assume that this analysis doesn’t see any
> un-analyzable uses, since of course it would need to conservatively
> treat them as escapes. But if we can optimize uses of integers as if
> they didn’t carry pointer data — say, in a function that takes integer
> parameters — and then we can apply those optimized uses to integers
> that concretely result from pointer-to-int casts — say, by inlining
> that function into one of its callers — can’t we end up with a use
> pattern for one or more of those pointer-to-int casts that no longer
> reflects the fact that it’s been exposed? It seems to me that either
> (1) we cannot do those optimizations on opaque integers or (2) we
> need to record that we did them in a way that, if it turns out that
> they were created by a pointer-to-int casts, forces other code to
> treat that pointer as opaquely exposed.
>IIUC the corresponding example would be:
int p;
intptr_t i = (intptr_t)&p;
f(i);
// Initially f(i) exposes p by storing i into somewhere, but optimization
changed f to not store i anymore (e.g. i was replaced with a constant)

In this case, p was already exposed by '(intptr_t)p' I guess; the body
of
f(i) won't affect the exposedness of p.
> <snip>
>
> Everything I’ve been talking about so far is a C-level concept:
> an int-to-pointer cast is e.g. (float*) myInt, not inttoptr
> in LLVM IR. But I think people have an expectation of what these
> things mean in LLVM IR, and I haven’t seen it written out explicitly,
> so let’s do that now.
>
> The first assumption here is that int-to-pointer and pointer-to-int
> casts in C will translate to inttoptr and ptrtoint operations
> in IR. Now, this is already problematic, because those operations
> do not currently have the semantics they need to have to make the
> proposed optimization model sound. In particular:
>
> -
>
> ptrtoint does not have side-effects and can be dead-stripped
> when unused, which as discussed above is not okay.
> -
>
> ptrtoint on a constant is folded to a constant expression,
> not an instruction, which is therefore no longer anchored in the
> code and does not reliably record that the global may have escaped.
> (Unused constant expressions do not really exist, and we absolutely
> cannot allow them to affect the semantics of the IR.)
>
> Of course, this is only significant for globals that don’t already
> have to be treated conservatively because they might have other
> uses. That is, it only really matters for globals with, say,
> internal or private linkage.
> -
>
> inttoptr can be reordered with other instructions, which is
> not allowed because different points in a function may have
> different sets of storage with escaped provenance.
> -
>
> inttoptr(ptrtoint) can be peepholed; ignoring the dead-stripping
> aspects of removing the inttoptr, this also potentially
> introduces UB because the original inttoptr “launders” the
> provenance of the pointer to the current provenance of the
> storage, whereas the original pointer may have stale provenance.
>
> All of these concerns are valid.
>
> (I'm not sure whether this is a good place to introduce this, but) we
> actually have semantics for pointer castings tailored to LLVM (link
> <https://sf.snu.ac.kr/publications/llvmtwin.pdf>).
> In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
> and inttoptr are scalar operations.
> inttoptr simply returns a pointer which can access any object.
>
> Skimming your paper, I can see how this works *except* that I don’t
> see any way not to treat ptrtoint as an escape. And really I think
> you’re already partially acknowledging that, because that’s the only
> real sense of saying that inttoptr(ptrtoint p) can’t be reduced to
> p. If those are really just scalar operations that don’t expose
> p in ways that might be disconnected from the uses of the inttoptr
> then that reduction ought to be safe.
>
> John.
>It again relies on nondeterminism of the address of an object.

For example:
p = call malloc(4) ; p's address is assumed to be nondeterministically
chosen
i = ptrtoint p
; i is never used
ret void

Since i is never used, this program cannot safely access p via address
guessing + inttoptr.
This allows (more precise) escape analysis to conclude that malloc with one
unused ptrtoint is never escaped.

Juneyoung
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210623/b19a6051/attachment.html>

John McCall via llvm-dev

2021-Jun-22 17:10 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 22 Jun 2021, at 12:02, Juneyoung Lee wrote:> On Tue, Jun 22, 2021 at 4:08 AM John McCall <rjmccall at apple.com> 
> wrote:
>> On 21 Jun 2021, at 2:15, Juneyoung Lee wrote:
>>
>> Hi,
>> Sorry for my late reply, and thank you for sharing great summaries
&
>> ideas.
>> I'll leave my thoughts below.
>>
>> On Wed, Jun 16, 2021 at 8:56 AM John McCall <rjmccall at
apple.com>
>> wrote:
>>
>> Now, that rule as I’ve stated it would be really bad. Allowing a
>> lucky guess to resolve to absolutely anything would almost
>> completely block the optimizer from optimizing memory. For example,
>> if a local variable came into scope, and then we called a function
>> that returned a different pointer, we’d have to conservatively
>> assume that that pointer might alias the local, even if the address
>> of the local was never even taken, much less escaped:
>>
>> int x = 0;
>> int *p = guess_address_of_x();
>> *p = 15;
>> printf(“%d\n”, x); // provably 0?
>>
>> So the currently favored proposal adds a really important caveat:
>> this blessing of provenance only works when a pointer with the
>> correct provenance has been “exposed”. There are several ways to
>> expose a pointer, including I/O, but the most important is casting
>> it to an integer.
>>
>> This is a valid point. If one wants to formally show the correctness 
>> of
>> this kind of memory optimization this problem should be tackled.
>> I think n2676's 'Allocation-address nondeterminism' (p. 27)
paragraph
>> addresses this issue.
>> The underlying idea is that the address of an allocated object is 
>> assumed
>> to be non-deterministically chosen, causing any guessed accesses to 
>> raise
>> undefined behavior in at least one execution.
>>
>> Ah, that’s an interesting idea. I must have missed that section; 
>> I’m
>> afraid I only skimmed N2676 looking for the key points, because 
>> it’s
>> quite a long document.
>>
>> To be clear, under the PVNI-ae family of semantics, that rule would
>> not be necessary to optimize x in my example because int-to-pointer
>> casts are not allowed to recreate provenance for x because it has
>> not been exposed. That rule does theoretically allow optimization of
>> some related examples where the address of x *has* been exposed,
>> because it lets the compiler try to reason about what happens after
>> exposure; it’s no longer true that exposure implies guessing are 
>> okay.
>>
>> Unfortunately, though, I this non-determinism still doesn’t allow 
>> LLVM
>> to be anywhere near as naive about pointer-to-int casts as it is 
>> today.
>> The rule is intended to allow the compiler to start doing 
>> use-analysis
>> of exposures; let’s assume that this analysis doesn’t see any
>> un-analyzable uses, since of course it would need to conservatively
>> treat them as escapes. But if we can optimize uses of integers as if
>> they didn’t carry pointer data — say, in a function that takes 
>> integer
>> parameters — and then we can apply those optimized uses to integers
>> that concretely result from pointer-to-int casts — say, by inlining
>> that function into one of its callers — can’t we end up with a 
>> use
>> pattern for one or more of those pointer-to-int casts that no longer
>> reflects the fact that it’s been exposed? It seems to me that 
>> either
>> (1) we cannot do those optimizations on opaque integers or (2) we
>> need to record that we did them in a way that, if it turns out that
>> they were created by a pointer-to-int casts, forces other code to
>> treat that pointer as opaquely exposed.
>>
> IIUC the corresponding example would be:
> int p;
> intptr_t i = (intptr_t)&p;
> f(i);
> // Initially f(i) exposes p by storing i into somewhere, but 
> optimization
> changed f to not store i anymore (e.g. i was replaced with a constant)
>
> In this case, p was already exposed by '(intptr_t)p' I guess; the
body
> of
> f(i) won't affect the exposedness of p.
Right, formally I think that’s true.
>> All of these concerns are valid.
>>
>> (I'm not sure whether this is a good place to introduce this, but)
we
>> actually have semantics for pointer castings tailored to LLVM (link
>> <https://sf.snu.ac.kr/publications/llvmtwin.pdf>).
>> In this proposal, ptrtoint does not have an escaping side effect; 
>> ptrtoint
>> and inttoptr are scalar operations.
>> inttoptr simply returns a pointer which can access any object.
>>
>> Skimming your paper, I can see how this works *except* that I don’t
>> see any way not to treat ptrtoint as an escape. And really I think
>> you’re already partially acknowledging that, because that’s the 
>> only
>> real sense of saying that inttoptr(ptrtoint p) can’t be reduced to
>> p. If those are really just scalar operations that don’t expose
>> p in ways that might be disconnected from the uses of the inttoptr
>> then that reduction ought to be safe.
>>
>> John.
>>
> It again relies on nondeterminism of the address of an object.
>
> For example:
> p = call malloc(4) ; p's address is assumed to be nondeterministically
> chosen
> i = ptrtoint p
> ; i is never used
> ret void
>
> Since i is never used, this program cannot safely access p via address
> guessing + inttoptr.
> This allows (more precise) escape analysis to conclude that malloc 
> with one
> unused ptrtoint is never escaped.
I think this is too local of an analysis.

Your proposal generally relies on certain optimizations not applying
to pointers because they mess up provenance as represented in
transitive use-dependencies.  If those optimizations can be applied
to integers, you can lose use-dependencies in exactly the same way as
you can with pointers.  Not doing the `inttoptr(ptrtoint p)) -> p`
reduction doesn’t matter at all, because in either case that value
has a use-dependency on `p`, whereas the actual problem is that there
is no longer a use-dependency on some other value.

For example, you have compellingly argued that it’s problematic to
do the reduction `a == b ? a : b` to `b` for pointer types.  Suppose
I instead do this optimization on integers where `a = ptrtoint A`.
The result is now simply `b`.  If I `inttoptr` that result and access
the memory, there will be no record that that access may validly
be to `A`.  It does not help that the access may be represented
as `inttoptr (ptrtoint B)` for some `B` rather than just directly
to `B`, because there is no use-dependence on `A`.  All there is
is an apparently unrelated and unused `ptrtoint A`.

Obviously we can avoid doing this optimization locally when we
see that the inputs result from `ptrtoint`, but that’s no more
than best-effort: we can do this optimization in a function which
we later inline in a caller that performs all the `ptrtoint` and
`inttoptr` casts.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210622/38a7ce16/attachment.html>

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM