Ralf Jung via llvm-dev
2021-Jun-13 15:22 UTC
[llvm-dev] [RFC] Introducing a byte type to LLVM
Hi,> 1. Forbidding arithmetic and bitwise operations in b<N> seems pointless. Just > define them as the corresponding i<N> op plus the union of provenance of the > operands. This allows consistent implementation of char/unsigned char as b8, > without having to jump back and forth between b8 and i8 all the time.FWIW, "char" addition happens at "int" type due to integer promotion. So there is no problem with back and forth here. "Union" of provenance is currently not an operation that is required to model LLVM IR, so your proposal would necessitate adding such a concept. It'll be interesting to figure out how "getelementptr inbounds" behaves on multi-provenance pointers...> 6. (How) are pointer types fundamentally different from b<N> types of the > correct size? (By this I mean: is there any interesting difference in the values > that these types can carry? Ignore surface differences like the fact that GEP > traditionally goes with pointers while `add` goes with integer types -- we could > have a GEP instruction on a correctly sized b<N>)I'm not saying I have the answer here, but one possible difference might arise with "mixing bytes from different pointers". Say we are storing pointer "ptr1" directly followed by "ptr2" on a 64bit machine, and now we are doing an (unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4 bytes of ptr2. This is certainly a valid value for b64. Is it also a valid value at pointer type, and if yes, which provenance does it have? Kind regards, Ralf
Nicolai Hähnle via llvm-dev
2021-Jun-14 06:29 UTC
[llvm-dev] [RFC] Introducing a byte type to LLVM
Hi Ralf, On Sun, Jun 13, 2021 at 5:22 PM Ralf Jung <jung at mpi-sws.org> wrote:> > 1. Forbidding arithmetic and bitwise operations in b<N> seems pointless. > Just > > define them as the corresponding i<N> op plus the union of provenance of > the > > operands. This allows consistent implementation of char/unsigned char as > b8, > > without having to jump back and forth between b8 and i8 all the time. > > FWIW, "char" addition happens at "int" type due to integer promotion. So > there > is no problem with back and forth here. > > "Union" of provenance is currently not an operation that is required to > model > LLVM IR, so your proposal would necessitate adding such a concept. It'll > be > interesting to figure out how "getelementptr inbounds" behaves on > multi-provenance pointers... >True, something needs to be said about that. The main question is whether "jumping" between different objects that are both in the provenance set is poison or not. Ultimately, the goal of provenance is to help alias analysis, so that's what should be driving that choice.> 6. (How) are pointer types fundamentally different from b<N> types of the > > correct size? (By this I mean: is there any interesting difference in > the values > > that these types can carry? Ignore surface differences like the fact > that GEP > > traditionally goes with pointers while `add` goes with integer types -- > we could > > have a GEP instruction on a correctly sized b<N>) > > I'm not saying I have the answer here, but one possible difference might > arise > with "mixing bytes from different pointers". Say we are storing pointer > "ptr1" > directly followed by "ptr2" on a 64bit machine, and now we are doing an > (unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4 > bytes > of ptr2. This is certainly a valid value for b64. Is it also a valid value > at > pointer type, and if yes, which provenance does it have? >This kind of example is why I was implicitly assuming that we must have a "provenance union" operation anyway, whether we like it or not. I suppose the alternative is to say that pointers formed in this way, whether directly or indirectly, are poison, but I have my doubts whether this is feasible. What happens with pointer arithmetic where you start out with two pointers of different provenance, convert to integer in the source language, subtract them, use the result further in some way, and for some reason all steps are performed with "byte" types in LLVM IR? Cheers, Nicolai> Kind regards, > Ralf >-- Lerne, wie die Welt wirklich ist, aber vergiss niemals, wie sie sein sollte. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210614/c1202ada/attachment.html>
David Chisnall via llvm-dev
2021-Jun-20 15:55 UTC
[llvm-dev] [RFC] Introducing a byte type to LLVM
On 13 Jun 2021, at 16:22, Ralf Jung via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > "Union" of provenance is currently not an operation that is required to model LLVM IR, so your proposal would necessitate adding such a concept. It'll be interesting to figure out how "getelementptr inbounds" behaves on multi-provenance pointers...Union provenance is required if you want an XOR linked list to be valid. These are pretty rare, but there are some idioms (including the calculation of per-CPU storage in the FreeBSD kernel) that depend on multi-provenance semantics. CHERI systems, such as the Morello boards that Arm is shipping early next year, provide a hardware-enforced single-provenance semantics, which might provide some inspiration for this discussion: In 64-bit CHERI implementations, memory capabilities are a 128-bit type protected by a tag bit (in memory and registers) that signifies that it has been derived from one of the capabilities provided in a register on hardware reset. Any operation that would violate the montonicity of rights (e.g. overwriting a single byte in a valid capability in memory) clears the tag bit, destroying its provenance and causing a trap if you try to use it as the base for a load or store instruction. When compiling from C-family languages, the memory capability is the hardware type to which all pointer types are lowered. Our clang port to target these architectures defines a new built-in type, `__intcap_t`, which is used to represent `intptr_t`. When we emit LLVM IR, we lower this to an LLVM IR pointer type, not an integer type. All C operations on `__intcap_t` are defined to take provenance from the left operand, with a warning if we can statically show that this is probably wrong. In our model, at the IR level, `ptrtoint` is fine, but the integer does not carry provenance. `inttoptr` is not permitted[1]. If code wants to extract an address from a pointer for comparison or for hashing (for example), that’s fine, but it can’t turn the integer back into a pointer directly. If pointers flow around as integers in the C sources, the may-be-a-pointer-in-C types are lowered to pointer types in the IR. This would be easier if C arithmetic operations were defined on pointer types. we currently have to use an IR intrinsic to get the address, then do the arithmetic, then reapply the result to the pointer type, and try to fold that again in the back end. We have found that large codebases require a very small amount of porting to support this mode, but they *do* require some. This is not a 100% compatible mode with existing codebases and so a single-provenance model for LLVM IR (at least, as the only option) would not be acceptable. David [1] Well, kind of. It gives you a capability relative to the default data capability, but in the ABI where all pointers are capabilities then the default data capability is likely to be invalid. Optimisations may not introduce `inttoptr`.