thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] Demystifying the byte type [Oct 2021]

If this information is useful, please help other people find it:
Share via:

George Mitenkov via llvm-dev

2021-Oct-15 18:41 UTC

[llvm-dev] Demystifying the byte type

Hi all,

In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add
a byte type in LLVM to fix load type punning issues. Initial RFC touched
some subtle aspects of LLVM IR and its semantics, and sparked a lot of
questions, concerns, and discussions.

We decided to write a post that would summarise the thread and the
complicated topic:

https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f

We hope that our post clarifies initial concerns raised on the mailing
list. As always, any questions, suggestions and advice are welcome!

Thanks,
George
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211015/aeb3143a/attachment.html>

James Dutton via llvm-dev

2021-Oct-16 10:33 UTC

head link

[llvm-dev] [cfe-dev] Demystifying the byte type

Hi,

Thank you for the description of the problem.
I think it would be helpful to also document which CPUs you were
considering in relation to the behaviour.
I can see that the description would hold for x86(32bit) and amd64.
But, there are CPUs out there that have special instructions for doing
pointer manipulation.
You might know that the CPU type has no bearing on the discussion, in
which case it would be helpful to add a paragraph or two explaining
that.

Kind Regards

James

On Fri, 15 Oct 2021 at 19:41, George Mitenkov via cfe-dev
<cfe-dev at lists.llvm.org> wrote:>
>
> Hi all,
>
> In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add
a byte type in LLVM to fix load type punning issues. Initial RFC touched some
subtle aspects of LLVM IR and its semantics, and sparked a lot of questions,
concerns, and discussions.
>
> We decided to write a post that would summarise the thread and the
complicated topic:
>
> https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f
>
> We hope that our post clarifies initial concerns raised on the mailing
list. As always, any questions, suggestions and advice are welcome!
>
> Thanks,
> George
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

James Dutton via llvm-dev

2021-Oct-16 11:24 UTC

head link

[llvm-dev] [cfe-dev] Demystifying the byte type

Hi,

The gist post, seems to imply that one needs memory to be typed.
If what you describe is correct, doesn't that imply that the opaque
pointer work is a fools errand ?
I.e. If memory needs to be typed, surely pointers to that memory also
need to be typed?

Kind Regards

James


On Fri, 15 Oct 2021 at 19:41, George Mitenkov via cfe-dev
<cfe-dev at lists.llvm.org> wrote:>
>
> Hi all,
>
> In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add
a byte type in LLVM to fix load type punning issues. Initial RFC touched some
subtle aspects of LLVM IR and its semantics, and sparked a lot of questions,
concerns, and discussions.
>
> We decided to write a post that would summarise the thread and the
complicated topic:
>
> https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f
>
> We hope that our post clarifies initial concerns raised on the mailing
list. As always, any questions, suggestions and advice are welcome!
>
> Thanks,
> George
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

Johannes Doerfert via llvm-dev

2021-Oct-18 05:31 UTC

head link

[llvm-dev] [cfe-dev] Demystifying the byte type

Hi George,

I only made it through part 1 for now but I figured I might forget if I 
don't reply directly:

 > Under the untyped memory model, we need to accept that every 
load/store has an implicit |ptrtoint|/|inttoptr| attached to it.

This is stated but I don't see it. Rather, a store of a pointer makes 
the pointer potentially escape (also see below).
Escaped pointers could show up as integers (among other things). So 
escaping a pointer (by any means) does an implicit
ptrtoint/inttoptr but not necessarily the store or the load. The 
transformation shown below that statement doesn't
contradict this view and SROA is still legal. All that happened is that 
SROA has first determined and then made
it explicit that the pointer (here %in) did not escape during the round 
trip through %mem. If, for example, %mem would
have been passed to an unknown function SROA would not have done this 
transformation because %in could now have
escaped through %mem. If %mem was casted to int and then loaded SROA 
would have made the escaping use explicit through
an ptrtoint: https://godbolt.org/z/PG3fj7qe4
Long story short, if you store a pointer (or cast it to an integer, or 
compare it other than some special ways) it might
escape and as anything could happen to it it looses its provenance. If 
you can show it doesn't escape, no provenance is
lost.

 > An alternative is to say that all pointer stores escape, which again 
has severe performance consequences and again do not align with all LLVM 
optimizations.

What optimizations do not treat a pointer stored away as an escaping 
use? That sounds like a problem.
[FWIW, I'm only aware of the Attributor but it ensures that all uses of 
the store are instead visited which makes this sound again (no escape 
can happen through the store).]

~ Johannes

On 10/15/21 13:41, George Mitenkov via cfe-dev wrote:> Hi all,
>
> In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add
> a byte type in LLVM to fix load type punning issues. Initial RFC touched
> some subtle aspects of LLVM IR and its semantics, and sparked a lot of
> questions, concerns, and discussions.
>
> We decided to write a post that would summarise the thread and the
> complicated topic:
>
> https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f
>
> We hope that our post clarifies initial concerns raised on the mailing
> list. As always, any questions, suggestions and advice are welcome!
>
> Thanks,
> George
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

Michael Kruse via llvm-dev

2021-Oct-19 04:59 UTC

head link

[llvm-dev] [cfe-dev] Demystifying the byte type

>From the linked document:
> Solution 3: Annotations and tags
> LLVM optimizers work with the assumption that attributes can be discarded
if the optimizer does not know how to handle them.
I don't think this is necessarily the case. Such attributes can be
designed such that a missing attribute represents the most
conservative, like the `mustprogress` attribute/metadata. That is, a
missing annotation has an implicit provenance of {all}. GVN can fold q
and p after `if (q == p)` with a new provenance being the union of q
and p's provenance, like a PHINode. In other models, p and q cannot be
folded or in the case of the proposed byte type, cannot carry
provenance information.

> High engineering effort to enforce that attributes are preserved in every
transformation and used by analyses.
IMHO, it is still lower than introducing a new first-class type.


Michael

David Chisnall via llvm-dev

2021-Oct-19 09:56 UTC

head link

[llvm-dev] Demystifying the byte type

Hi George,

On 15/10/2021 19:41, George Mitenkov via llvm-dev wrote:
 >
 > Hi all,
 >
 > In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed 
to add a byte type in LLVM to fix load type punning issues. Initial RFC 
touched some subtle aspects of LLVM IR and its semantics, and sparked a 
lot of questions, concerns, and discussions.
 >
 > We decided to write a post that would summarise the thread and the 
complicated topic:
 >
 > 
https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f 
<https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f>
 >
 > We hope that our post clarifies initial concerns raised on the 
mailing list. As always, any questions, suggestions and advice are welcome!

Thank you for the writeup.  I think a big part of the problem in 
understanding this comes from the name of the type.  On 
provenance-carrying architectures (such as CHERI systems, including 
Arm's Morello[1]), it is unsound to copy a pointer as bytes.  Pointers 
must be copied by provenance-carrying operations.  The hardware splits 
registers into ones that don't carry provenance (integer, 
floating-point, vector) and ones that do but which can *also* be used to 
copy non-pointer data (capabilities).

On a CHERI system, ptrtoint does not confer provenance and inttoptr on 
the result may yield either an invalid pointer or a pointer with larger 
  bounds, depending on the environment.  This reflects the machine 
semantics: converting a pointer to an integer is an operation that 
simply extracts the address (on Morello, the address is exposed as a 
subregister of the capability register).  Converting in the opposite 
direction inserts the address into the capability held in the default 
data capability register (which, in the pure-capability ABI is typically 
not a valid capabilitiy and so yields an invalid pointer, in the hybrid 
ABI refers to the part of the address space used for legacy code).

I think that all of this is fairly aligned with your byte type.

David

[1] 
https://developer.arm.com/architectures/cpu-architecture/a-profile/morello

Cranmer, Joshua via llvm-dev

2021-Oct-19 20:43 UTC

head link

[llvm-dev] Demystifying the byte type

The way I understand it, the problem that the byte type is meant to solve is
part of a broader-scoped problem, which is the inconsistency of pointer
semantics in LLVM (and other compilers, for that matter). Subtle
misunderstandings in how pointer semantics works between different optimization
passes causes misoptimizations to happen, and identifying which pass is the
culprit is challenging. This is not helped by the LLVM language reference being
outright incorrect here: it describes provenance in terms of data dependence,
even through integers, which is not how any of our analyses actually work,
generally preferring to reason on a more escape-based analysis approach.

However, the byte type proposal feels to me like it is motivated on a minor
portion of the problem, so narrow that it feels like it only really solves “how
to write memcpy in standard C” aspect of this problem. It doesn’t really address
how the addition of byte types would fix miscompilations, especially anything
beyond memcpy (for example, C code compiled with -fno-strict-aliasing). It
doesn’t suggest any fixes to the current known inconsistencies in the language
specification. And as a result, it’s kind of dismissive as to why isolated fixes
to various optimization passes are insufficient to achieve coherent semantics.

Stepping back a bit, it’s helpful to understand that, for the purposes of
building an operational semantics, a pointer is not an i64 but a { i64, BOOM
(Bag Of Other Metadata) }, where the BOOM contains sufficient information to
explain when a load or store of a pointer is undefined behavior—including
liveness information, provenance, and noalias rules [1]. Described like this,
three things should be clear. First, the inttoptr instruction has to recreate
the BOOM given no information, which is necessarily a pessimistic assumption (it
may be useful to have intrinsics that provide less pessimistic recreation of the
BOOM). Second, loads and stores of pointers in memory needs to preserve the
BOOM, presumably through a generally inaccessible shadow memory feature.
Finally, the interaction of non-pointer types with the representation of the
BOOM in memory needs to be given a definition.

Fundamentally, then, the problem is inttoptr (and to a lesser degree, ptrtoint,
as it constitutes a vehicle for escaping pointers), and memory is involved only
insofar as it constitutes a ‘hidden’ inttoptr (and ptrtoint). But byte doesn’t
really expose the ‘hidden’ inttoptr, it just hides it in a different place.
Indeed, it still retains the existing ones if you should load a pointer with an
i64. To me, it appears only to be useful in giving a way to canonicalize
@llvm.memcpy into a regular load type, but an entirely new type doesn’t seem
necessary for that—intrinsics that give access to reading and writing shadow
BOOM seem like they would be sufficient. You might argue that such intrinsics
would eliminate the ability of users to write their own copies of memcpy, but
even here, byte is an insufficient proposal—there’s no way to write a word-based
memcpy in C with this proposal (assuming -fno-strict-aliasing, of course).

With that in mind, I’d like to ask a few questions:

Have you been tracking the WG14 study group on provenance?

Have you attempted to put together some form of provenance semantics in a tool
like Alive2 to more comprehensively catalogue miscompilations in existing
optimizations?

[1] My first instinct is to say that the BOOM is the set of allocations the
pointer may point to, but there may be edge cases that I’m not immediately
thinking of. Formal semantics is not my forte, after all.

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of George
Mitenkov via llvm-dev
Sent: Friday, October 15, 2021 14:41
To: llvm-dev <llvm-dev at lists.llvm.org>; cfe-dev at lists.llvm.org
Developers <cfe-dev at lists.llvm.org>
Subject: [llvm-dev] Demystifying the byte type


Hi all,

In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a
byte type in LLVM to fix load type punning issues. Initial RFC touched some
subtle aspects of LLVM IR and its semantics, and sparked a lot of questions,
concerns, and discussions.

We decided to write a post that would summarise the thread and the complicated
topic:

https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f

We hope that our post clarifies initial concerns raised on the mailing list. As
always, any questions, suggestions and advice are welcome!

Thanks,
George
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211019/ff739ed7/attachment.html>

llvm dev - Oct 2021 - [cfe-dev] Demystifying the byte type

[llvm-dev] Demystifying the byte type

[llvm-dev] [cfe-dev] Demystifying the byte type

[llvm-dev] [cfe-dev] Demystifying the byte type

[llvm-dev] [cfe-dev] Demystifying the byte type

[llvm-dev] [cfe-dev] Demystifying the byte type

[llvm-dev] Demystifying the byte type

[llvm-dev] Demystifying the byte type