thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Ralf Jung via llvm-dev

2021-Jun-13 15:22 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi,
> 1. Forbidding arithmetic and bitwise operations in b<N> seems
pointless. Just
> define them as the corresponding i<N> op plus the union of provenance
of the
> operands. This allows consistent implementation of char/unsigned char as
b8,
> without having to jump back and forth between b8 and i8 all the time.
FWIW, "char" addition happens at "int" type due to integer
promotion. So there
is no problem with back and forth here.

"Union" of provenance is currently not an operation that is required
to model
LLVM IR, so your proposal would necessitate adding such a concept. It'll be 
interesting to figure out how "getelementptr inbounds" behaves on 
multi-provenance pointers...
> 6. (How) are pointer types fundamentally different from b<N> types of
the
> correct size? (By this I mean: is there any interesting difference in the
values
> that these types can carry? Ignore surface differences like the fact that
GEP
> traditionally goes with pointers while `add` goes with integer types -- we
could
> have a GEP instruction on a correctly sized b<N>)
I'm not saying I have the answer here, but one possible difference might
arise
with "mixing bytes from different pointers". Say we are storing
pointer "ptr1"
directly followed by "ptr2" on a 64bit machine, and now we are doing
an
(unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4 bytes 
of ptr2. This is certainly a valid value for b64. Is it also a valid value at 
pointer type, and if yes, which provenance does it have?

Kind regards,
Ralf

Nicolai Hähnle via llvm-dev

2021-Jun-14 06:29 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi Ralf,

On Sun, Jun 13, 2021 at 5:22 PM Ralf Jung <jung at mpi-sws.org> wrote:
> > 1. Forbidding arithmetic and bitwise operations in b<N> seems
pointless.
> Just
> > define them as the corresponding i<N> op plus the union of
provenance of
> the
> > operands. This allows consistent implementation of char/unsigned char
as
> b8,
> > without having to jump back and forth between b8 and i8 all the time.
>
> FWIW, "char" addition happens at "int" type due to
integer promotion. So
> there
> is no problem with back and forth here.
>
> "Union" of provenance is currently not an operation that is
required to
> model
> LLVM IR, so your proposal would necessitate adding such a concept.
It'll
> be
> interesting to figure out how "getelementptr inbounds" behaves on
> multi-provenance pointers...
>
True, something needs to be said about that. The main question is whether
"jumping" between different objects that are both in the provenance
set is
poison or not. Ultimately, the goal of provenance is to help alias
analysis, so that's what should be driving that choice.

> 6. (How) are pointer types fundamentally different from b<N> types of
the
> > correct size? (By this I mean: is there any interesting difference in
> the values
> > that these types can carry? Ignore surface differences like the fact
> that GEP
> > traditionally goes with pointers while `add` goes with integer types
--
> we could
> > have a GEP instruction on a correctly sized b<N>)
>
> I'm not saying I have the answer here, but one possible difference
might
> arise
> with "mixing bytes from different pointers". Say we are storing
pointer
> "ptr1"
> directly followed by "ptr2" on a 64bit machine, and now we are
doing an
> (unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4
> bytes
> of ptr2. This is certainly a valid value for b64. Is it also a valid value
> at
> pointer type, and if yes, which provenance does it have?
>
This kind of example is why I was implicitly assuming that we must have a
"provenance union" operation anyway, whether we like it or not. I
suppose
the alternative is to say that pointers formed in this way, whether
directly or indirectly, are poison, but I have my doubts whether this is
feasible. What happens with pointer arithmetic where you start out with two
pointers of different provenance, convert to integer in the source
language, subtract them, use the result further in some way, and for some
reason all steps are performed with "byte" types in LLVM IR?

Cheers,
Nicolai


> Kind regards,
> Ralf
>

-- 
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210614/c1202ada/attachment.html>

David Chisnall via llvm-dev

2021-Jun-20 15:55 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 13 Jun 2021, at 16:22, Ralf Jung via llvm-dev <llvm-dev at
lists.llvm.org> wrote:> 
> "Union" of provenance is currently not an operation that is
required to model LLVM IR, so your proposal would necessitate adding such a
concept. It'll be interesting to figure out how "getelementptr
inbounds" behaves on multi-provenance pointers...
Union provenance is required if you want an XOR linked list to be valid.  These
are pretty rare, but there are some idioms (including the calculation of per-CPU
storage in the FreeBSD kernel) that depend on multi-provenance semantics.

CHERI systems, such as the Morello boards that Arm is shipping early next year,
provide a hardware-enforced single-provenance semantics, which might provide
some inspiration for this discussion:

In 64-bit CHERI implementations, memory capabilities are a 128-bit type
protected by a tag bit (in memory and registers) that signifies that it has been
derived from one of the capabilities provided in a register on hardware reset. 
Any operation that would violate the montonicity of rights (e.g. overwriting a
single byte in a valid capability in memory) clears the tag bit, destroying its
provenance and causing a trap if you try to use it as the base for a load or
store instruction.  When compiling from C-family languages, the memory
capability is the hardware type to which all pointer types are lowered.

Our clang port to target these architectures defines a new built-in type,
`__intcap_t`, which is used to represent `intptr_t`.  When we emit LLVM IR, we
lower this to an LLVM IR pointer type, not an integer type.  All C operations on
`__intcap_t` are defined to take provenance from the left operand, with a
warning if we can statically show that this is probably wrong.

In our model, at the IR level, `ptrtoint` is fine, but the integer does not
carry provenance.  `inttoptr` is not permitted[1].  If code wants to extract an
address from a pointer for comparison or for hashing (for example), that’s fine,
but it can’t turn the integer back into a pointer directly.  If pointers flow
around as integers in the C sources, the may-be-a-pointer-in-C types are lowered
to pointer types in the IR.  This would be easier if C arithmetic operations
were defined on pointer types.  we currently have to use an IR intrinsic to get
the address, then do the arithmetic, then reapply the result to the pointer
type, and try to fold that again in the back end.

We have found that large codebases require a very small amount of porting to
support this mode, but they *do* require some.  This is not a 100% compatible
mode with existing codebases and so a single-provenance model for LLVM IR (at
least, as the only option) would not be acceptable.

David

[1] Well, kind of.  It gives you a capability relative to the default data
capability, but in the ABI where all pointers are capabilities then the default
data capability is likely to be invalid.  Optimisations may not introduce
`inttoptr`.

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM