thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

John McCall via llvm-dev

2021-Jun-21 19:07 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 21 Jun 2021, at 2:15, Juneyoung Lee wrote:> Hi,
> Sorry for my late reply, and thank you for sharing great summaries &
ideas.
> I'll leave my thoughts below.
>
> On Wed, Jun 16, 2021 at 8:56 AM John McCall <rjmccall at apple.com>
wrote:
>> Now, that rule as I’ve stated it would be really bad. Allowing a
>> lucky guess to resolve to absolutely anything would almost
>> completely block the optimizer from optimizing memory. For example,
>> if a local variable came into scope, and then we called a function
>> that returned a different pointer, we’d have to conservatively
>> assume that that pointer might alias the local, even if the address
>> of the local was never even taken, much less escaped:
>>
>>   int x = 0;
>>   int *p = guess_address_of_x();
>>   *p = 15;
>>   printf(“%d\n”, x); // provably 0?
>>
>> So the currently favored proposal adds a really important caveat:
>> this blessing of provenance only works when a pointer with the
>> correct provenance has been “exposed”. There are several ways to
>> expose a pointer, including I/O, but the most important is casting
>> it to an integer.
>>
> This is a valid point. If one wants to formally show the correctness of
> this kind of memory optimization this problem should be tackled.
> I think n2676's 'Allocation-address nondeterminism' (p. 27)
paragraph
> addresses this issue.
> The underlying idea is that the address of an allocated object is assumed
> to be non-deterministically chosen, causing any guessed accesses to raise
> undefined behavior in at least one execution.
Ah, that’s an interesting idea.  I must have missed that section; I’m
afraid I only skimmed N2676 looking for the key points, because it’s
quite a long document.

To be clear, under the `PVNI-ae` family of semantics, that rule would
not be necessary to optimize `x` in my example because int-to-pointer
casts are not allowed to recreate provenance for `x` because it has
not been exposed.  That rule does theoretically allow optimization of
some related examples where the address of `x` *has* been exposed,
because it lets the compiler try to reason about what happens after
exposure; it’s no longer true that exposure implies guessing are okay.

Unfortunately, though, I this non-determinism still doesn’t allow LLVM
to be anywhere near as naive about pointer-to-int casts as it is today.
The rule is intended to allow the compiler to start doing use-analysis
of exposures; let’s assume that this analysis doesn’t see any
un-analyzable uses, since of course it would need to conservatively
treat them as escapes.  But if we can optimize uses of integers as if
they didn’t carry pointer data — say, in a function that takes integer
parameters — and then we can apply those optimized uses to integers
that concretely result from pointer-to-int casts — say, by inlining
that function into one of its callers — can’t we end up with a use
pattern for one or more of those pointer-to-int casts that no longer
reflects the fact that it’s been exposed?  It seems to me that either
(1) we cannot do those optimizations on opaque integers or (2) we
need to record that we did them in a way that, if it turns out that
they were created by a pointer-to-int casts, forces other code to
treat that pointer as opaquely exposed.
>> <snip>
>>
>> Everything I’ve been talking about so far is a C-level concept:
>> an int-to-pointer cast is e.g. (float*) myInt, not inttoptr
>> in LLVM IR. But I think people have an expectation of what these
>> things mean in LLVM IR, and I haven’t seen it written out explicitly,
>> so let’s do that now.
>>
>> The first assumption here is that int-to-pointer and pointer-to-int
>> casts in C will translate to inttoptr and ptrtoint operations
>> in IR. Now, this is already problematic, because those operations
>> do not currently have the semantics they need to have to make the
>> proposed optimization model sound. In particular:
>>
>>    -
>>
>>    ptrtoint does not have side-effects and can be dead-stripped
>>    when unused, which as discussed above is not okay.
>>    -
>>
>>    ptrtoint on a constant is folded to a constant expression,
>>    not an instruction, which is therefore no longer anchored in the
>>    code and does not reliably record that the global may have escaped.
>>    (Unused constant expressions do not really exist, and we absolutely
>>    cannot allow them to affect the semantics of the IR.)
>>
>>    Of course, this is only significant for globals that don’t already
>>    have to be treated conservatively because they might have other
>>    uses. That is, it only really matters for globals with, say,
>>    internal or private linkage.
>>    -
>>
>>    inttoptr can be reordered with other instructions, which is
>>    not allowed because different points in a function may have
>>    different sets of storage with escaped provenance.
>>    -
>>
>>    inttoptr(ptrtoint) can be peepholed; ignoring the dead-stripping
>>    aspects of removing the inttoptr, this also potentially
>>    introduces UB because the original inttoptr “launders” the
>>    provenance of the pointer to the current provenance of the
>>    storage, whereas the original pointer may have stale provenance.
>>
>> All of these concerns are valid.
>
> (I'm not sure whether this is a good place to introduce this, but) we
> actually have semantics for pointer castings tailored to LLVM (link
> <https://sf.snu.ac.kr/publications/llvmtwin.pdf>).
> In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
> and inttoptr are scalar operations.
> inttoptr simply returns a pointer which can access any object.
Skimming your paper, I can see how this works *except* that I don’t
see any way not to treat `ptrtoint` as an escape.  And really I think
you’re already partially acknowledging that, because that’s the only
real sense of saying that `inttoptr(ptrtoint p)` can’t be reduced to
`p`.  If those are really just scalar operations that don’t expose
`p` in ways that might be disconnected from the uses of the `inttoptr`
then that reduction ought to be safe.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210621/5ce12b62/attachment.html>

Nicolai Hähnle via llvm-dev

2021-Jun-22 07:14 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

On Mon, Jun 21, 2021 at 9:08 PM John McCall via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 21 Jun 2021, at 2:15, Juneyoung Lee wrote:
>
> actually have semantics for pointer castings tailored to LLVM (link
> <https://sf.snu.ac.kr/publications/llvmtwin.pdf>).
> In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
> and inttoptr are scalar operations.
> inttoptr simply returns a pointer which can access any object.
>
> Skimming your paper, I can see how this works *except* that I don’t
> see any way not to treat ptrtoint as an escape. And really I think
> you’re already partially acknowledging that, because that’s the only
> real sense of saying that inttoptr(ptrtoint p) can’t be reduced to
> p. If those are really just scalar operations that don’t expose
> p in ways that might be disconnected from the uses of the inttoptr
> then that reduction ought to be safe.
>
I haven't read the paper yet, but I think the argument is: Assume p was
obtained via an out-of-bounds GEP that happens to point at valid memory
that belongs to a different object than p's provenance. In that case,
dereferencing inttoptr(ptrtoint(p)) is defined behavior ("inttoptr simply
returns a pointer which can access any object") while dereferencing p is
UB.

Cheers,
Nicolai


> John.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210622/e05188d8/attachment.html>

Ralf Jung via llvm-dev

2021-Jun-22 09:58 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi John,
> Unfortunately, though, I this non-determinism still doesn’t allow LLVM
> to be anywhere near as naive about pointer-to-int casts as it is today.
Definitely. There are limits to how naive one can be; beyond those limits, 
miscompilations lurk.
<https://www.ralfj.de/blog/2020/12/14/provenance.html>
explains this by showing such a miscompilation arising from three naive 
optimizations being chained together.
> The rule is intended to allow the compiler to start doing use-analysis
> of exposures; let’s assume that this analysis doesn’t see any
> un-analyzable uses, since of course it would need to conservatively
> treat them as escapes. But if we can optimize uses of integers as if
> they didn’t carry pointer data — say, in a function that takes integer
> parameters — and then we can apply those optimized uses to integers
> that concretely result from pointer-to-int casts — say, by inlining
> that function into one of its callers — can’t we end up with a use
> pattern for one or more of those pointer-to-int casts that no longer
> reflects the fact that it’s been exposed? It seems to me that either
> (1) we cannot do those optimizations on opaque integers or (2) we
> need to record that we did them in a way that, if it turns out that
> they were created by a pointer-to-int casts, forces other code to
> treat that pointer as opaquely exposed.
There is a third option: don't optimize away ptr-int-ptr roundtrips. Then
you
can still do all the same optimizations on integers that LLVM does today, 
completely naively -- the integer world remains "sane". Only the
pointer world
has to be "strange".
(You can also not do things like GVN replacement of *pointer-typed* values, but 
for values of integer types this remains unproblematic.)

I don't think it makes sense for LLVM to adopt an explicit
"exposed" flag in its
semantics. Reasoning based on non-determinism works fine, and has the advantage 
of keeping ptr-to-int casts a pure, side-effect-free operation. This is the 
model we explored in
<https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf>, and
we were able to show quite a few of LLVM's standard optimizations correct 
formally. Some changes are still needed as you noted, but those changes will be 
required anyway even if LLVM were to adopt PNVI-ae:
- No removal of ptr-int-ptr roundtrips. 
(https://bugs.llvm.org/show_bug.cgi?id=34548)
- No GVN replacement of pointer-typed values. 
(https://bugs.llvm.org/show_bug.cgi?id=35229)
>     (I'm not sure whether this is a good place to introduce this, but)
we
>     actually have semantics for pointer castings tailored to LLVM (link
>     <https://sf.snu.ac.kr/publications/llvmtwin.pdf
>     <https://sf.snu.ac.kr/publications/llvmtwin.pdf>>).
>     In this proposal, ptrtoint does not have an escaping side effect;
ptrtoint
>     and inttoptr are scalar operations.
>     inttoptr simply returns a pointer which can access any object.
> 
> Skimming your paper, I can see how this works /except/ that I don’t
> see any way not to treat |ptrtoint| as an escape. And really I think
> you’re already partially acknowledging that, because that’s the only
> real sense of saying that |inttoptr(ptrtoint p)| can’t be reduced to
> |p|. If those are really just scalar operations that don’t expose
> |p| in ways that might be disconnected from the uses of the |inttoptr|
> then that reduction ought to be safe.
They are indeed just scalar operations, but the reduction is not safe.
The reason is that pointer-typed variables have values of the form "(addr, 
provenance)". There is essentially an 'invisible' component in each
pointer
value that tracks some additional information -- the "provenance" of
the
pointer. Casting a ptr to an int removes that provenance. Casting an int to a 
ptr picks a "default" provenance. So the overall effect of
inttoptr(ptrtoint p)
is to turn "(addr, provenance)" into "(addr,
DEFAULT_PROVENANCE)".
Clearly that is *not* a NOP, and hence performing the reduction actually changes
the result of this operation. Before the reduction, the resulting pointer had 
DEFAULT_PROVENANCE; after the reduction, it maintains the original provenance of
"p". This can introduce UB into previously UB-free programs.

Kind regards,
Ralf

Juneyoung Lee via llvm-dev

2021-Jun-22 16:02 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

On Tue, Jun 22, 2021 at 4:08 AM John McCall <rjmccall at apple.com> wrote:
> On 21 Jun 2021, at 2:15, Juneyoung Lee wrote:
>
> Hi,
> Sorry for my late reply, and thank you for sharing great summaries &
ideas.
> I'll leave my thoughts below.
>
> On Wed, Jun 16, 2021 at 8:56 AM John McCall <rjmccall at apple.com>
wrote:
>
> Now, that rule as I’ve stated it would be really bad. Allowing a
> lucky guess to resolve to absolutely anything would almost
> completely block the optimizer from optimizing memory. For example,
> if a local variable came into scope, and then we called a function
> that returned a different pointer, we’d have to conservatively
> assume that that pointer might alias the local, even if the address
> of the local was never even taken, much less escaped:
>
> int x = 0;
> int *p = guess_address_of_x();
> *p = 15;
> printf(“%d\n”, x); // provably 0?
>
> So the currently favored proposal adds a really important caveat:
> this blessing of provenance only works when a pointer with the
> correct provenance has been “exposed”. There are several ways to
> expose a pointer, including I/O, but the most important is casting
> it to an integer.
>
> This is a valid point. If one wants to formally show the correctness of
> this kind of memory optimization this problem should be tackled.
> I think n2676's 'Allocation-address nondeterminism' (p. 27)
paragraph
> addresses this issue.
> The underlying idea is that the address of an allocated object is assumed
> to be non-deterministically chosen, causing any guessed accesses to raise
> undefined behavior in at least one execution.
>
> Ah, that’s an interesting idea. I must have missed that section; I’m
> afraid I only skimmed N2676 looking for the key points, because it’s
> quite a long document.
>
> To be clear, under the PVNI-ae family of semantics, that rule would
> not be necessary to optimize x in my example because int-to-pointer
> casts are not allowed to recreate provenance for x because it has
> not been exposed. That rule does theoretically allow optimization of
> some related examples where the address of x *has* been exposed,
> because it lets the compiler try to reason about what happens after
> exposure; it’s no longer true that exposure implies guessing are okay.
>
> Unfortunately, though, I this non-determinism still doesn’t allow LLVM
> to be anywhere near as naive about pointer-to-int casts as it is today.
> The rule is intended to allow the compiler to start doing use-analysis
> of exposures; let’s assume that this analysis doesn’t see any
> un-analyzable uses, since of course it would need to conservatively
> treat them as escapes. But if we can optimize uses of integers as if
> they didn’t carry pointer data — say, in a function that takes integer
> parameters — and then we can apply those optimized uses to integers
> that concretely result from pointer-to-int casts — say, by inlining
> that function into one of its callers — can’t we end up with a use
> pattern for one or more of those pointer-to-int casts that no longer
> reflects the fact that it’s been exposed? It seems to me that either
> (1) we cannot do those optimizations on opaque integers or (2) we
> need to record that we did them in a way that, if it turns out that
> they were created by a pointer-to-int casts, forces other code to
> treat that pointer as opaquely exposed.
>IIUC the corresponding example would be:
int p;
intptr_t i = (intptr_t)&p;
f(i);
// Initially f(i) exposes p by storing i into somewhere, but optimization
changed f to not store i anymore (e.g. i was replaced with a constant)

In this case, p was already exposed by '(intptr_t)p' I guess; the body
of
f(i) won't affect the exposedness of p.
> <snip>
>
> Everything I’ve been talking about so far is a C-level concept:
> an int-to-pointer cast is e.g. (float*) myInt, not inttoptr
> in LLVM IR. But I think people have an expectation of what these
> things mean in LLVM IR, and I haven’t seen it written out explicitly,
> so let’s do that now.
>
> The first assumption here is that int-to-pointer and pointer-to-int
> casts in C will translate to inttoptr and ptrtoint operations
> in IR. Now, this is already problematic, because those operations
> do not currently have the semantics they need to have to make the
> proposed optimization model sound. In particular:
>
> -
>
> ptrtoint does not have side-effects and can be dead-stripped
> when unused, which as discussed above is not okay.
> -
>
> ptrtoint on a constant is folded to a constant expression,
> not an instruction, which is therefore no longer anchored in the
> code and does not reliably record that the global may have escaped.
> (Unused constant expressions do not really exist, and we absolutely
> cannot allow them to affect the semantics of the IR.)
>
> Of course, this is only significant for globals that don’t already
> have to be treated conservatively because they might have other
> uses. That is, it only really matters for globals with, say,
> internal or private linkage.
> -
>
> inttoptr can be reordered with other instructions, which is
> not allowed because different points in a function may have
> different sets of storage with escaped provenance.
> -
>
> inttoptr(ptrtoint) can be peepholed; ignoring the dead-stripping
> aspects of removing the inttoptr, this also potentially
> introduces UB because the original inttoptr “launders” the
> provenance of the pointer to the current provenance of the
> storage, whereas the original pointer may have stale provenance.
>
> All of these concerns are valid.
>
> (I'm not sure whether this is a good place to introduce this, but) we
> actually have semantics for pointer castings tailored to LLVM (link
> <https://sf.snu.ac.kr/publications/llvmtwin.pdf>).
> In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
> and inttoptr are scalar operations.
> inttoptr simply returns a pointer which can access any object.
>
> Skimming your paper, I can see how this works *except* that I don’t
> see any way not to treat ptrtoint as an escape. And really I think
> you’re already partially acknowledging that, because that’s the only
> real sense of saying that inttoptr(ptrtoint p) can’t be reduced to
> p. If those are really just scalar operations that don’t expose
> p in ways that might be disconnected from the uses of the inttoptr
> then that reduction ought to be safe.
>
> John.
>It again relies on nondeterminism of the address of an object.

For example:
p = call malloc(4) ; p's address is assumed to be nondeterministically
chosen
i = ptrtoint p
; i is never used
ret void

Since i is never used, this program cannot safely access p via address
guessing + inttoptr.
This allows (more precise) escape analysis to conclude that malloc with one
unused ptrtoint is never escaped.

Juneyoung
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210623/b19a6051/attachment.html>

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM