Hi all, In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a byte type in LLVM to fix load type punning issues. Initial RFC touched some subtle aspects of LLVM IR and its semantics, and sparked a lot of questions, concerns, and discussions. We decided to write a post that would summarise the thread and the complicated topic: https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f We hope that our post clarifies initial concerns raised on the mailing list. As always, any questions, suggestions and advice are welcome! Thanks, George -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211015/aeb3143a/attachment.html>
James Dutton via llvm-dev
2021-Oct-16 10:33 UTC
[llvm-dev] [cfe-dev] Demystifying the byte type
Hi, Thank you for the description of the problem. I think it would be helpful to also document which CPUs you were considering in relation to the behaviour. I can see that the description would hold for x86(32bit) and amd64. But, there are CPUs out there that have special instructions for doing pointer manipulation. You might know that the CPU type has no bearing on the discussion, in which case it would be helpful to add a paragraph or two explaining that. Kind Regards James On Fri, 15 Oct 2021 at 19:41, George Mitenkov via cfe-dev <cfe-dev at lists.llvm.org> wrote:> > > Hi all, > > In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a byte type in LLVM to fix load type punning issues. Initial RFC touched some subtle aspects of LLVM IR and its semantics, and sparked a lot of questions, concerns, and discussions. > > We decided to write a post that would summarise the thread and the complicated topic: > > https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f > > We hope that our post clarifies initial concerns raised on the mailing list. As always, any questions, suggestions and advice are welcome! > > Thanks, > George > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
James Dutton via llvm-dev
2021-Oct-16 11:24 UTC
[llvm-dev] [cfe-dev] Demystifying the byte type
Hi, The gist post, seems to imply that one needs memory to be typed. If what you describe is correct, doesn't that imply that the opaque pointer work is a fools errand ? I.e. If memory needs to be typed, surely pointers to that memory also need to be typed? Kind Regards James On Fri, 15 Oct 2021 at 19:41, George Mitenkov via cfe-dev <cfe-dev at lists.llvm.org> wrote:> > > Hi all, > > In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a byte type in LLVM to fix load type punning issues. Initial RFC touched some subtle aspects of LLVM IR and its semantics, and sparked a lot of questions, concerns, and discussions. > > We decided to write a post that would summarise the thread and the complicated topic: > > https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f > > We hope that our post clarifies initial concerns raised on the mailing list. As always, any questions, suggestions and advice are welcome! > > Thanks, > George > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Johannes Doerfert via llvm-dev
2021-Oct-18 05:31 UTC
[llvm-dev] [cfe-dev] Demystifying the byte type
Hi George, I only made it through part 1 for now but I figured I might forget if I don't reply directly: > Under the untyped memory model, we need to accept that every load/store has an implicit |ptrtoint|/|inttoptr| attached to it. This is stated but I don't see it. Rather, a store of a pointer makes the pointer potentially escape (also see below). Escaped pointers could show up as integers (among other things). So escaping a pointer (by any means) does an implicit ptrtoint/inttoptr but not necessarily the store or the load. The transformation shown below that statement doesn't contradict this view and SROA is still legal. All that happened is that SROA has first determined and then made it explicit that the pointer (here %in) did not escape during the round trip through %mem. If, for example, %mem would have been passed to an unknown function SROA would not have done this transformation because %in could now have escaped through %mem. If %mem was casted to int and then loaded SROA would have made the escaping use explicit through an ptrtoint: https://godbolt.org/z/PG3fj7qe4 Long story short, if you store a pointer (or cast it to an integer, or compare it other than some special ways) it might escape and as anything could happen to it it looses its provenance. If you can show it doesn't escape, no provenance is lost. > An alternative is to say that all pointer stores escape, which again has severe performance consequences and again do not align with all LLVM optimizations. What optimizations do not treat a pointer stored away as an escaping use? That sounds like a problem. [FWIW, I'm only aware of the Attributor but it ensures that all uses of the store are instead visited which makes this sound again (no escape can happen through the store).] ~ Johannes On 10/15/21 13:41, George Mitenkov via cfe-dev wrote:> Hi all, > > In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add > a byte type in LLVM to fix load type punning issues. Initial RFC touched > some subtle aspects of LLVM IR and its semantics, and sparked a lot of > questions, concerns, and discussions. > > We decided to write a post that would summarise the thread and the > complicated topic: > > https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f > > We hope that our post clarifies initial concerns raised on the mailing > list. As always, any questions, suggestions and advice are welcome! > > Thanks, > George > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Michael Kruse via llvm-dev
2021-Oct-19 04:59 UTC
[llvm-dev] [cfe-dev] Demystifying the byte type
>From the linked document:> Solution 3: Annotations and tags > LLVM optimizers work with the assumption that attributes can be discarded if the optimizer does not know how to handle them.I don't think this is necessarily the case. Such attributes can be designed such that a missing attribute represents the most conservative, like the `mustprogress` attribute/metadata. That is, a missing annotation has an implicit provenance of {all}. GVN can fold q and p after `if (q == p)` with a new provenance being the union of q and p's provenance, like a PHINode. In other models, p and q cannot be folded or in the case of the proposed byte type, cannot carry provenance information.> High engineering effort to enforce that attributes are preserved in every transformation and used by analyses.IMHO, it is still lower than introducing a new first-class type. Michael
Hi George, On 15/10/2021 19:41, George Mitenkov via llvm-dev wrote: > > Hi all, > > In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a byte type in LLVM to fix load type punning issues. Initial RFC touched some subtle aspects of LLVM IR and its semantics, and sparked a lot of questions, concerns, and discussions. > > We decided to write a post that would summarise the thread and the complicated topic: > > https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f <https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f> > > We hope that our post clarifies initial concerns raised on the mailing list. As always, any questions, suggestions and advice are welcome! Thank you for the writeup. I think a big part of the problem in understanding this comes from the name of the type. On provenance-carrying architectures (such as CHERI systems, including Arm's Morello[1]), it is unsound to copy a pointer as bytes. Pointers must be copied by provenance-carrying operations. The hardware splits registers into ones that don't carry provenance (integer, floating-point, vector) and ones that do but which can *also* be used to copy non-pointer data (capabilities). On a CHERI system, ptrtoint does not confer provenance and inttoptr on the result may yield either an invalid pointer or a pointer with larger bounds, depending on the environment. This reflects the machine semantics: converting a pointer to an integer is an operation that simply extracts the address (on Morello, the address is exposed as a subregister of the capability register). Converting in the opposite direction inserts the address into the capability held in the default data capability register (which, in the pure-capability ABI is typically not a valid capabilitiy and so yields an invalid pointer, in the hybrid ABI refers to the part of the address space used for legacy code). I think that all of this is fairly aligned with your byte type. David [1] https://developer.arm.com/architectures/cpu-architecture/a-profile/morello
The way I understand it, the problem that the byte type is meant to solve is part of a broader-scoped problem, which is the inconsistency of pointer semantics in LLVM (and other compilers, for that matter). Subtle misunderstandings in how pointer semantics works between different optimization passes causes misoptimizations to happen, and identifying which pass is the culprit is challenging. This is not helped by the LLVM language reference being outright incorrect here: it describes provenance in terms of data dependence, even through integers, which is not how any of our analyses actually work, generally preferring to reason on a more escape-based analysis approach. However, the byte type proposal feels to me like it is motivated on a minor portion of the problem, so narrow that it feels like it only really solves “how to write memcpy in standard C” aspect of this problem. It doesn’t really address how the addition of byte types would fix miscompilations, especially anything beyond memcpy (for example, C code compiled with -fno-strict-aliasing). It doesn’t suggest any fixes to the current known inconsistencies in the language specification. And as a result, it’s kind of dismissive as to why isolated fixes to various optimization passes are insufficient to achieve coherent semantics. Stepping back a bit, it’s helpful to understand that, for the purposes of building an operational semantics, a pointer is not an i64 but a { i64, BOOM (Bag Of Other Metadata) }, where the BOOM contains sufficient information to explain when a load or store of a pointer is undefined behavior—including liveness information, provenance, and noalias rules [1]. Described like this, three things should be clear. First, the inttoptr instruction has to recreate the BOOM given no information, which is necessarily a pessimistic assumption (it may be useful to have intrinsics that provide less pessimistic recreation of the BOOM). Second, loads and stores of pointers in memory needs to preserve the BOOM, presumably through a generally inaccessible shadow memory feature. Finally, the interaction of non-pointer types with the representation of the BOOM in memory needs to be given a definition. Fundamentally, then, the problem is inttoptr (and to a lesser degree, ptrtoint, as it constitutes a vehicle for escaping pointers), and memory is involved only insofar as it constitutes a ‘hidden’ inttoptr (and ptrtoint). But byte doesn’t really expose the ‘hidden’ inttoptr, it just hides it in a different place. Indeed, it still retains the existing ones if you should load a pointer with an i64. To me, it appears only to be useful in giving a way to canonicalize @llvm.memcpy into a regular load type, but an entirely new type doesn’t seem necessary for that—intrinsics that give access to reading and writing shadow BOOM seem like they would be sufficient. You might argue that such intrinsics would eliminate the ability of users to write their own copies of memcpy, but even here, byte is an insufficient proposal—there’s no way to write a word-based memcpy in C with this proposal (assuming -fno-strict-aliasing, of course). With that in mind, I’d like to ask a few questions: Have you been tracking the WG14 study group on provenance? Have you attempted to put together some form of provenance semantics in a tool like Alive2 to more comprehensively catalogue miscompilations in existing optimizations? [1] My first instinct is to say that the BOOM is the set of allocations the pointer may point to, but there may be edge cases that I’m not immediately thinking of. Formal semantics is not my forte, after all. From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of George Mitenkov via llvm-dev Sent: Friday, October 15, 2021 14:41 To: llvm-dev <llvm-dev at lists.llvm.org>; cfe-dev at lists.llvm.org Developers <cfe-dev at lists.llvm.org> Subject: [llvm-dev] Demystifying the byte type Hi all, In May 2021, together with Nuno Lopes and Juneyoung Lee, we proposed to add a byte type in LLVM to fix load type punning issues. Initial RFC touched some subtle aspects of LLVM IR and its semantics, and sparked a lot of questions, concerns, and discussions. We decided to write a post that would summarise the thread and the complicated topic: https://gist.github.com/georgemitenkov/3def898b8845c2cc161bd216cbbdb81f We hope that our post clarifies initial concerns raised on the mailing list. As always, any questions, suggestions and advice are welcome! Thanks, George -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211019/ff739ed7/attachment.html>