John McCall via llvm-dev
2021-Jun-04 18:25 UTC
[llvm-dev] [RFC] Introducing a byte type to LLVM
On 4 Jun 2021, at 11:24, George Mitenkov wrote:> Hi all, > > Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte > type to LLVM to fix miscompilations due to load type punning. Please see > the proposal below. It would be great to hear the > feedback/comments/suggestions! > > > Motivation > =========> > char and unsigned char are considered to be universal holders in C. They > can access raw memory and are used to implement memcpy. i8 is the LLVM’s > counterpart but it does not have such semantics, which is also not > desirable as it would disable many optimizations.I don’t believe this is correct. LLVM does not have an innate concept of typed memory. The type of a global or local allocation is just a roundabout way of giving it a size and default alignment, and similarly the type of a load or store just determines the width and default alignment of the access. There are no restrictions on what types can be used to load or store from certain objects. C-style type aliasing restrictions are imposed using `tbaa` metadata, which are unrelated to the IR type of the access. John. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210604/79126e9f/attachment.html>
John McCall via llvm-dev
2021-Jun-04 18:49 UTC
[llvm-dev] [RFC] Introducing a byte type to LLVM
On 4 Jun 2021, at 14:25, John McCall via llvm-dev wrote:> On 4 Jun 2021, at 11:24, George Mitenkov wrote: >> Hi all, >> >> Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte >> type to LLVM to fix miscompilations due to load type punning. Please see >> the proposal below. It would be great to hear the >> feedback/comments/suggestions! >> >> >> Motivation >> =========>> >> char and unsigned char are considered to be universal holders in C. They >> can access raw memory and are used to implement memcpy. i8 is the LLVM’s >> counterpart but it does not have such semantics, which is also not >> desirable as it would disable many optimizations. > > I don’t believe this is correct. LLVM does not have an innate > concept of typed memory. The type of a global or local allocation > is just a roundabout way of giving it a size and default alignment, > and similarly the type of a load or store just determines the width > and default alignment of the access. There are no restrictions on > what types can be used to load or store from certain objects. > > C-style type aliasing restrictions are imposed using `tbaa` > metadata, which are unrelated to the IR type of the access.If this is all related to https://bugs.llvm.org/show_bug.cgi?id=37469, I don’t think anything about i8 is the ultimate problem there. John.
Nuno Lopes via llvm-dev
2021-Jun-04 19:06 UTC
[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
On 4 Jun 2021, at 11:24, George Mitenkov wrote: Hi all, Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte type to LLVM to fix miscompilations due to load type punning. Please see the proposal below. It would be great to hear the feedback/comments/suggestions! Motivation ========= char and unsigned char are considered to be universal holders in C. They can access raw memory and are used to implement memcpy. i8 is the LLVM’s counterpart but it does not have such semantics, which is also not desirable as it would disable many optimizations. I don’t believe this is correct. LLVM does not have an innate concept of typed memory. The type of a global or local allocation is just a roundabout way of giving it a size and default alignment, and similarly the type of a load or store just determines the width and default alignment of the access. There are no restrictions on what types can be used to load or store from certain objects. C-style type aliasing restrictions are imposed using tbaa metadata, which are unrelated to the IR type of the access. It’s debatable whether LLVM considers memory to be typed or not. If we don’t consider memory to be typed, then *all* integer load operations have to be considered as potentially escaping pointers. Example: store i32* %p, i32** %q %q2 = bitcast i32** %q to i64* %v = load i64* %q2 This program stores a pointer and then loads it back as an integer. So there’s an implicit pointer-to-integer cast, which escapes the pointer. If we allow this situation to happen, then the alias analysis code is broken, as well as several optimizations. LLVM doesn’t consider loads as potential pointer escape sites. It would probably be a disaster (performance wise) if it did! The introduction of the byte type allow us to make all pointer <-> integer casts explicit, so that we don’t have to make all integer loads as escaping. It also allow us to say that LLVM optimizations are correct, and we “just” need to create a few new optimization to get rid of the extra bytecast instructions when they are provably not needed. TBAA is unrelated to the problem we are trying to solve here. Nuno -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210604/a01249c7/attachment.html>
Joshua Cranmer via llvm-dev
2021-Jun-04 22:27 UTC
[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
On 6/4/2021 2:25 PM, John McCall via cfe-dev wrote:> I don’t believe this is correct. LLVM does not have an innate > > concept of typed memory. The type of a global or local allocation > is just a roundabout way of giving it a size and default alignment, > and similarly the type of a load or store just determines the width > and default alignment of the access. There are no restrictions on > what types can be used to load or store from certain objects. > > C-style type aliasing restrictions are imposed using |tbaa| > metadata, which are unrelated to the IR type of the access. > > John. >I've never been thoroughly involved in any of the actual optimizations here, but it seems to me that there is a soundness hole in the LLVM semantics that we gloss over when we say that LLVM doesn't have typed memory. Working backwards from what a putative operational semantics of LLVM might look like (and I'm going to ignore poison/undef because it's not relevant), I think there is agreement that integer types in LLVM are purely bitvectors. Any value of i64 5 can be replaced with any other value of i64 5 no matter where it came from. At the same time, when we have pointers involved, this is not true. Two pointers may have the same numerical value (e.g., when cast to integers), but one might not be replaceable with the other because there's other data that might not be the same. So in operational terms, pointers have both a numerical value and a bag of provenance data (probably other stuff, but let's be simple and call it provenance). Now we have to ask what the semantics of converting between integers and pointers are. Integers, as we've defined, don't have provenance data. So an inttoptr instruction has to synthesize that provenance somehow. Ideally, we'd want to grab that data from the ptrtoint instruction that generated the integer, but the semantics of integers means we can only launder that data globally, so that an inttoptr has the union of all of the provenance data that was ever fed into an inttoptr (I suspect the actual semantics we use is somewhat more precise than this in that it only considers those pointers that point to still-live data, which doesn't invalidate anything I'm about to talk about). Okay, what about memory? I believe what most people intend to mean when they say that LLVM's memory is untyped is that a load or store of any type is equivalent to first converting it to an integer and then storing the integer into memory. E.g. these two functions are semantically equivalent: define void @foo(ptr %mem, i8* %foo) { store i8* %foo, ptr %mem } define void @bar(ptr %mem, i8* %foo) { %asint = ptrtoint i8* %foo to i64 ; Or whatever pointer size you have store i64 %asint, ptr %mem } In other words, we are to accept that every load and store instruction of a pointer has an implicit inttoptr or ptrtoint attached to it. But as I mentioned earlier, pointers have this extra metadata attached to it that is lost when converting to an integer. Under this strict interpretation of memory, we *lose* that metadata every time a pointer is stored in memory, as if we did an inttoptr(ptrtoint x). Thus, the following two functions are *not* semantically equivalent in that model: define i8* @basic(i8* %in) { ret i8* %in } define i8* @via_alloc(i8* %in) { %mem = alloca i8* store i8* %in, i8** %mem %out = load i8*, i8** %mem ret i8* %out } In order to allow these two functions to be equivalent, we have to let the load of a pointer recover the provenance data stored by the store of the pointer, and nothing more general. If either one of those were instead an integer load or store, then no provenance data can be communicated, so the integer and the pointer loads *must* be nonequivalent (although loading an integer instead of a pointer would presumably be a pessimistic transformation). In short, pointers have pointery bits that aren't reflected in a bitvector representation an integer has. LLVM has some optimizations that assume that loads and stores only have bitvector manipulation semantics, while other optimizations (and most of the frontends) expect that loads and stores will preserve the pointery bits. And when these interact with each other, it's undoubtedly possible that the pointery bits get lost along the way. -- Joshua Cranmer -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210604/828f62d1/attachment.html>
Chris Lattner via llvm-dev
2021-Jun-06 04:26 UTC
[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe-dev at lists.llvm.org> wrote:On 4 Jun 2021, at 11:24, George Mitenkov wrote:> Hi all, > > Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte > type to LLVM to fix miscompilations due to load type punning. Please see > the proposal below. It would be great to hear the > feedback/comments/suggestions! > > > Motivation > =========> > char and unsigned char are considered to be universal holders in C. They > can access raw memory and are used to implement memcpy. i8 is the LLVM’s > counterpart but it does not have such semantics, which is also not > desirable as it would disable many optimizations. > > I don’t believe this is correct. LLVM does not have an innate > concept of typed memory. The type of a global or local allocation > is just a roundabout way of giving it a size and default alignment, > and similarly the type of a load or store just determines the width > and default alignment of the access. There are no restrictions on > what types can be used to load or store from certain objects. > > C-style type aliasing restrictions are imposed using tbaa > metadata, which are unrelated to the IR type of the access. >I completely agree with John. “i8” in LLVM doesn’t carry any implications about aliasing (in fact, LLVM pointers are going towards being typeless). Any such thing occurs at the accesses, and are part of TBAA. I’m opposed to adding a byte type to LLVM, as such semantic carrying types are entirely unprecedented, and would add tremendous complexity to the entire system. -Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210605/f1d542ff/attachment.html>