thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

John McCall via llvm-dev

2021-Jun-15 23:56 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 15 Jun 2021, at 13:29, Juneyoung Lee wrote:> On Tue, Jun 15, 2021 at 4:07 PM John McCall <rjmccall at apple.com> 
> wrote:
>> On 15 Jun 2021, at 1:49, Juneyoung Lee wrote:
>>
>> On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> The semantics you seem to want are that LLVM’s integer types cannot 
>> carry
>> information from pointers. But I can cast a pointer to an integer in 
>> C and
>> vice-versa, and compilers have de facto defined the behavior of 
>> subsequent
>> operations like breaking the integer up (and then putting it back
>> together), adding numbers to it, and so on. So no, as a C compiler 
>> writer,
>> I do not have a choice; I will have to use a type that can validly 
>> carry
>> pointer information for integers in C.
>>
>> int->ptr cast can reconstruct the pointer information, so making 
>> integer
>> types not carry pointer information does not necessarily mean that
>> dereferencing a pointer casted from integer is UB.
>>
>> What exactly is the claimed formal property of byte types, then,
>> that integer types will lack? Because it seems to me that converting
>> from an integer gives us valid provenance in strictly more situations
>> than converting from bytes, since it reconstructs provenance if 
>> there’s
>> any object at that address (under still-debated restrictions),
>> while converting from bytes always preserves the original provenance
>> (if any). I don’t understand how that can possibly give us *more*
>> flexibility to optimize integers.
>>
> When two objects are adjacent, and an integer is exactly pointing to 
> the
> location between them, its provenance cannot be properly recovered.
>
> int x[1], y[1];
> llvm.assume((intptr_t)&x[0] == 0x100 && (intptr_t)&y[0] ==
0x104);
> int *p = (int*)(intptr_t)&x[1];
> // Q: Is p's provenance x or y?
>
> If it is expected that '*(p-1)' is equivalent to *x, p's
provenance
> should
> be x.
> However, based on llvm.assume, optimizations on integers can
> replace (intptr_t)&x[1] with (intptr_t)&y[0] (which is what
happened
> in the
> bug report).
> Then, '*(p-1)' suddenly becomes out-of-bounds access, which is UB.
> So, p's provenance isn't simply x or y; it should be something that
> can
> access both x and y.
This is something that the TR does not really have a firm grasp on yet.
> This implies that, unless there is a guarantee that all allocated 
> objects
> are one or more bytes apart, there is no type that can perfectly store 
> a
> pointer byte.
> memcpy(x, y, 8) isn't equivalent to 'v=load i64 y;store i64 v,
x'
> because v
> already lost the pointer information .
Okay, so let me try to restate and summarize all this.  I’ve CC’ed
a bunch of people back into this part of the thread.

C is moving towards a provenance model; you can find the details in
this committee TR that Joshua Cranmer linked:
   http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf

This TR is very clearly a work in progress and contains many
digressions and several possible sets of rules with different
implications.  I will try to briefly summarize.

Formally, all storage has a unique provenance: a notional unique
identifier for the specific allocation event of the storage, like a
local variable coming into scope, or a specific call to `malloc`.
Pointers formed to the storage formally carry that provenance (or
invalid provenance, or in some corner cases ambiguous provenance).
It is undefined behavior to do certain things with mismatched
provenances.  For example, it is undefined behavior to access
an object through a pointer whose provenance doesn’t match the
current provenance of the storage containing that object.  This is
a new way of formalizing the well-known rule that you can’t access
a local variable after it’s gone out of scope.

In the provenance model that looks most likely to be accepted, an
int-to-pointer cast blesses the resulting pointer with the current
provenance of the storage containing that address.  There does *not*
have to be any sort of data dependence between taking the address
of the storage and the integer that’s used in the cast; it can
simply be a lucky guess.  An integer could reasonably hold the
representation of any address in the program, and so an
int-to-pointer cast could bless its result with valid provenance
for any storage in memory.  But if the pointed-to memory is
subsequently deallocated, the provenance of the pointer will
become out of date.

Now, that rule as I’ve stated it would be really bad.  Allowing a
lucky guess to resolve to absolutely anything would almost
completely block the optimizer from optimizing memory.  For example,
if a local variable came into scope, and then we called a function
that returned a different pointer, we’d have to conservatively
assume that that pointer might alias the local, even if the address
of the local was never even taken, much less escaped:

```
   int x = 0;
   int *p = guess_address_of_x();
   *p = 15;
   printf(“%d\n”, x); // provably 0?
```

So the currently favored proposal adds a really important caveat:
this blessing of provenance only works when a pointer with the
correct provenance has been “exposed”.  There are several ways to
expose a pointer, including I/O, but the most important is casting
it to an integer.

Again, there’s no requirement of a data dependence between the
exposure and the int-to-pointer cast.  If the program casts a
pointer to an integer, and an independently-produced integer
that happens to be the same value is later cast to a pointer,
and the storage hasn’t been reallocated in the meantime, the
resulting pointer will have the right provenance for the memory
and will be valid to use.  This implies that pointer-to-int casts
(and other exposures) are semantically significant events in the
program.  They don’t have side effects in the normal sense, but
they must be treated by the compiler just like things that do have
side effects: e.g. unless I’m missing something in the TR,
eliminating a completely unused pointer-to-int cast may make
later code UB.

And in fact, it turns out that this is crucially important for
optimization.  If the optimizer wants to allow arbitrary
replacement of integers based on conditional equality, like
in GVN, then replacement totally breaks direct data dependence,
and you can easily be left with no remaining uses of a pointer-to-int
cast when the original code would have had a data dependence.  So
you cannot reason about provenance through int-to-pointer casts:
the pointer can alias any storage whose provenance has potentially
been exposed, and the optimizer must be conservative about optimizing
memory that has potentially been exposed.

Of course, a lot of this isn’t new.  If code passes a pointer to an
opaque function, then yeah, the optimizer has to assume that the
function might expose it by performing a pointer-to-int cast on it.
But that’s okay, because the optimizer already has to assume that
the function can escape the pointer in a more standard way.  Exposure
is just another form of escape.

Everything I’ve been talking about so far is a C-level concept:
an int-to-pointer cast is e.g. `(float*) myInt`, not `inttoptr`
in LLVM IR.  But I think people have an expectation of what these
things mean in LLVM IR, and I haven’t seen it written out explicitly,
so let’s do that now.

The first assumption here is that int-to-pointer and pointer-to-int
casts in C will translate to `inttoptr` and `ptrtoint` operations
in IR.  Now, this is already problematic, because those operations
do not currently have the semantics they need to have to make the
proposed optimization model sound.  In particular:

- `ptrtoint` does not have side-effects and can be dead-stripped
   when unused, which as discussed above is not okay.

- `ptrtoint` on a constant is folded to a constant expression,
   not an instruction, which is therefore no longer anchored in the
   code and does not reliably record that the global may have escaped.
   (Unused constant expressions do not really exist, and we absolutely
   cannot allow them to affect the semantics of the IR.)

   Of course, this is only significant for globals that don’t already
   have to be treated conservatively because they might have other
   uses.  That is, it only really matters for globals with, say,
   internal or private linkage.

- `inttoptr` can be reordered with other instructions, which is
   not allowed because different points in a function may have
   different sets of storage with escaped provenance.

- `inttoptr(ptrtoint)` can be peepholed; ignoring the dead-stripping
   aspects of removing the `inttoptr`, this also potentially
   introduces UB because the original `inttoptr` “launders” the
   provenance of the pointer to the current provenance of the
   storage, whereas the original pointer may have stale provenance.

There are probably other problems.  There are two ways that I see
of addressing all of this.

The first way is to take them on directly and change the core
semantic rules of LLVM IR.  I think this is a huge change, but
maybe it’s necessary in order to get the semantics we want.  We
will need to introduce some new IR features to make this possible,
both for performance and for expressivity.  For example, the way
we currently express relative pointers in constant initializers
involves `ptrtoint` and `sub` constant expressions, so if we
remove `ptrtoint` constants, we’ll need something else.  And
frontends will need to be much more cautious about not introducing
unnecessary int<->pointer casts because they’ll be heavily
pessimizing.

The second way is to say that `inttoptr` and `ptrtoint` are just
superficial type conversions and don’t affect provenance, then
add builtins that do affect provenance.  But this leaves us still
tracking provenance through integers, so any optimization that’s
not valid on pointers will also not be valid on integers.

All of this is not even considering the need for byte types yet.
Here’s the thinking behind byte types as I understand it.

The claim is that, to make all of the above work, we need
int<->pointer conversions to be explicit in IR.  If there’s some
other way of doing an int<->pointer conversion, we’re in trouble,
because maybe we won’t realize that the conversion has happened,
and so we won’t be appropriately conservative.  And unfortunately
C provides some other ways to do these conversions, which happen
to all go through memory: copying representations around, doing
aliased loads and stores, whatever.  So we need to have some way
to represent conversions that happen through these mechanisms.
And C says that these mechanisms don’t generally affect provenance,
so inserting `inttoptr`/`ptrtoint` casts is at the very least
imprecise because it launders provenance.  So what we want is a
type that maintains the original provenance, and therefore is
subject to the same optimization restrictions as pointers.

I don’t find either side of this argument very convincing.

First, the compiler already has to be very conservative about
memory.  If a pointer is stored into memory, we generally have
to treat it as having fully escaped: unless we can prove that the
memory isn’t loaded by other code, we have to assume that it is,
and that the loading code will do arbitrary other stuff.  That
could include doing a pointer-to-int cast and sharing the pointer
back to us as an integer.  Therefore, in the general case, our
ability to optimize when a pointer has escaped into memory is at
least as bad as if it had escaped via an int-to-pointer cast.  If
we *can* nonetheless optimize, because we can reason about some of
the uses together or prove that there aren’t any other uses,
then okay, maybe we see that there’s an int<->pointer conversion.
But translating this to `ptrtoint`/`inttoptr` should be, at
worst, overly conservative; it’s not unsound, for reasons I’m
about to get into.

Second, adding casts through an integer type is always valid.
Doing so might block the optimizer, but it doesn’t break semantics.
If I have a program that does e.g `*px = 15`, and I change it to
do `*(int*)(intptr_t)px = 15`, my program has become well-defined
in strictly more situations: in any case, there must be valid
storage at `px` for this not to be UB, but previously `px` might
have had the wrong provenance, and now it never does as long as
the provenance for that storage has previously escaped.

If we find that we’re generating too many unnecessary casts
through integer types and it’s really blocking the optimizer too
much, then we should consider the best solutions to those problems
as they arise.  It may be the case that we need a better solution
for type conversions introduced through manual memcpy-like code
so that we maintain the original provenance instead of adding
explicit int<->pointer conversions that launder provenance.
I don’t know that byte types are the best solution to that, but
we can consider them.

But this whole conversation about byte types seems to be coming
at it the wrong way around.  We need to be centering the first
set of problems around int<->pointer casts.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210615/7f2b1a0b/attachment-0001.html>

Andrew Trick via llvm-dev

2021-Jun-19 05:21 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

> On Jun 15, 2021, at 4:56 PM, John McCall via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> If we find that we’re generating too many unnecessary casts
> through integer types and it’s really blocking the optimizer too
> much, then we should consider the best solutions to those problems
> as they arise. It may be the case that we need a better solution
> for type conversions introduced through manual memcpy-like code
> so that we maintain the original provenance instead of adding
> explicit int<->pointer conversions that launder provenance.
> I don’t know that byte types are the best solution to that, but
> we can consider them.
We can't retroactively weaken inttoptr semantics relative to what frontends
rely on--it should continue to have the semantics of a pointer cast in C, which
need to be handled more conservatively in LLVM.

We can easily express the weaker memcpy semantics as "!raw" metadata
on the load and inttoptr instructions. (I apologize if that was already
mentioned in this thread.)

Introducing a new type to communicate pointer semantics seems inconsistent. LLVM
types primarily communicate ABI requirements from the frontend to the target.
The semantics of pointer loads, and integer conversions can be specified on
those operations and generated when the compiler emits byte-wise copies as
opposed to semantic casts. We can expose a clang attribute or builtin for manual
pointer copying.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210618/da30636d/attachment.html>

Juneyoung Lee via llvm-dev

2021-Jun-21 06:15 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi,
Sorry for my late reply, and thank you for sharing great summaries & ideas.
I'll leave my thoughts below.

On Wed, Jun 16, 2021 at 8:56 AM John McCall <rjmccall at apple.com> wrote:
> Okay, so let me try to restate and summarize all this. I’ve CC’ed
>
> a bunch of people back into this part of the thread.
>
> C is moving towards a provenance model; you can find the details in
> this committee TR that Joshua Cranmer linked:
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf
>
> This TR is very clearly a work in progress and contains many
> digressions and several possible sets of rules with different
> implications. I will try to briefly summarize.
>
> <snip>
>
> Now, that rule as I’ve stated it would be really bad. Allowing a
> lucky guess to resolve to absolutely anything would almost
> completely block the optimizer from optimizing memory. For example,
> if a local variable came into scope, and then we called a function
> that returned a different pointer, we’d have to conservatively
> assume that that pointer might alias the local, even if the address
> of the local was never even taken, much less escaped:
>
>   int x = 0;
>   int *p = guess_address_of_x();
>   *p = 15;
>   printf(“%d\n”, x); // provably 0?
>
> So the currently favored proposal adds a really important caveat:
> this blessing of provenance only works when a pointer with the
> correct provenance has been “exposed”. There are several ways to
> expose a pointer, including I/O, but the most important is casting
> it to an integer.
>This is a valid point. If one wants to formally show the correctness of
this kind of memory optimization this problem should be tackled.
I think n2676's 'Allocation-address nondeterminism' (p. 27)
paragraph
addresses this issue.
The underlying idea is that the address of an allocated object is assumed
to be non-deterministically chosen, causing any guessed accesses to raise
undefined behavior in at least one execution.
> Again, there’s no requirement of a data dependence between the
> exposure and the int-to-pointer cast. If the program casts a
> pointer to an integer, and an independently-produced integer
> that happens to be the same value is later cast to a pointer,
> and the storage hasn’t been reallocated in the meantime, the
> resulting pointer will have the right provenance for the memory
> and will be valid to use. This implies that pointer-to-int casts
> (and other exposures) are semantically significant events in the
> program. They don’t have side effects in the normal sense, but
> they must be treated by the compiler just like things that do have
> side effects: e.g. unless I’m missing something in the TR,
> eliminating a completely unused pointer-to-int cast may make
> later code UB.
>
> And in fact, it turns out that this is crucially important for
> optimization. If the optimizer wants to allow arbitrary
> replacement of integers based on conditional equality, like
> in GVN, then replacement totally breaks direct data dependence,
> and you can easily be left with no remaining uses of a pointer-to-int
> cast when the original code would have had a data dependence. So
> you cannot reason about provenance through int-to-pointer casts:
> the pointer can alias any storage whose provenance has potentially
> been exposed, and the optimizer must be conservative about optimizing
> memory that has potentially been exposed.
>+1, due to this reason the casting semantics cannot be directly used for
LLVM's ptrtoint/inttoptr.
> <snip>
>
> Everything I’ve been talking about so far is a C-level concept:
> an int-to-pointer cast is e.g. (float*) myInt, not inttoptr
> in LLVM IR. But I think people have an expectation of what these
> things mean in LLVM IR, and I haven’t seen it written out explicitly,
> so let’s do that now.
>
> The first assumption here is that int-to-pointer and pointer-to-int
> casts in C will translate to inttoptr and ptrtoint operations
> in IR. Now, this is already problematic, because those operations
> do not currently have the semantics they need to have to make the
> proposed optimization model sound. In particular:
>
>    -
>
>    ptrtoint does not have side-effects and can be dead-stripped
>    when unused, which as discussed above is not okay.
>    -
>
>    ptrtoint on a constant is folded to a constant expression,
>    not an instruction, which is therefore no longer anchored in the
>    code and does not reliably record that the global may have escaped.
>    (Unused constant expressions do not really exist, and we absolutely
>    cannot allow them to affect the semantics of the IR.)
>
>    Of course, this is only significant for globals that don’t already
>    have to be treated conservatively because they might have other
>    uses. That is, it only really matters for globals with, say,
>    internal or private linkage.
>    -
>
>    inttoptr can be reordered with other instructions, which is
>    not allowed because different points in a function may have
>    different sets of storage with escaped provenance.
>    -
>
>    inttoptr(ptrtoint) can be peepholed; ignoring the dead-stripping
>    aspects of removing the inttoptr, this also potentially
>    introduces UB because the original inttoptr “launders” the
>    provenance of the pointer to the current provenance of the
>    storage, whereas the original pointer may have stale provenance.
>
> All of these concerns are valid.
(I'm not sure whether this is a good place to introduce this, but) we
actually have semantics for pointer castings tailored to LLVM (link
<https://sf.snu.ac.kr/publications/llvmtwin.pdf>).
In this proposal, ptrtoint does not have an escaping side effect; ptrtoint
and inttoptr are scalar operations.
inttoptr simply returns a pointer which can access any object.
Combined with the address nondeterminism that is described above, unescaped
objects can be effectively left untouched from other memory accesses.

Also, the following optimizations can be explained:
- The aliasing property of 'gep inbounds p' is supported: dereferencing
'gep inbounds p, 1' must raise UB if either p or p+1 isn't in bounds
of p's
object (provenance)
- 'gep (inttoptr i), idx' is equivalent to 'i + idx' (same value
and same
level of undefinedness)
- gep and gep inbounds is a scalar operation (can be freely reordered w.r.t
ptrtoint/inttoptr/lifetime/free/...)
- gep's natural properties are supported: stripping off inbounds tag,
'gep
(gep (inttoptr i), i1), i2' -> 'gep (inttoptr i), i1+i2'

There are a few transformations that become hard to explain, but perhaps
the biggest one is 'inttoptr(ptrtoint p)' -> p.
> <snip>
>
> I don’t find either side of this argument very convincing.
>
> First, the compiler already has to be very conservative about
> memory. If a pointer is stored into memory, we generally have
> to treat it as having fully escaped: unless we can prove that the
> memory isn’t loaded by other code, we have to assume that it is,
> and that the loading code will do arbitrary other stuff. That
> could include doing a pointer-to-int cast and sharing the pointer
> back to us as an integer. Therefore, in the general case, our
> ability to optimize when a pointer has escaped into memory is at
> least as bad as if it had escaped via an int-to-pointer cast. If
> we *can* nonetheless optimize, because we can reason about some of
> the uses together or prove that there aren’t any other uses,
> then okay, maybe we see that there’s an int<->pointer conversion.
> But translating this to ptrtoint/inttoptr should be, at
> worst, overly conservative; it’s not unsound, for reasons I’m
> about to get into.
>
> Second, adding casts through an integer type is always valid.
> Doing so might block the optimizer, but it doesn’t break semantics.
> If I have a program that does e.g *px = 15, and I change it to
> do *(int*)(intptr_t)px = 15, my program has become well-defined
> in strictly more situations: in any case, there must be valid
> storage at px for this not to be UB, but previously px might
> have had the wrong provenance, and now it never does as long as
> the provenance for that storage has previously escaped.
>I agree. Transforming 'p' into 'inttoptr(ptrtoint p)' should not
make the
program undefined, and it can be used to address the correctness issue of
these kinds of problems.
> If we find that we’re generating too many unnecessary casts
> through integer types and it’s really blocking the optimizer too
> much, then we should consider the best solutions to those problems
> as they arise. It may be the case that we need a better solution
> for type conversions introduced through manual memcpy-like code
> so that we maintain the original provenance instead of adding
> explicit int<->pointer conversions that launder provenance.
> I don’t know that byte types are the best solution to that, but
> we can consider them.
>
> But this whole conversation about byte types seems to be coming
> at it the wrong way around. We need to be centering the first
> set of problems around int<->pointer casts.
>
John.>As the first step, I made a patch to LangRef for differentiation of int and
ptr: https://reviews.llvm.org/D104013 . It is currently under review.
Maybe we can pursue two-track:
(1) gradually disabling the 'inttoptr(ptrtoint p) -> p' folding.
- For example, to fix https://bugs.llvm.org/show_bug.cgi?id=34548,
disabling it when p's underlying object is alloca would be enough (I guess).
(2) inspecting how byte types can help revive optimizations.
- For example, loop idiom recognition on memcpy-like loops should be
removed otherwise. Its performance impact should be checked.
>
>
>Thanks,
Juneyoung
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210621/91a8e7be/attachment.html>

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM