thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

John McCall via llvm-dev

2021-Jun-04 18:25 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 4 Jun 2021, at 11:24, George Mitenkov wrote:> Hi all,
>
> Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
> type to LLVM to fix miscompilations due to load type punning. Please see
> the proposal below. It would be great to hear the
> feedback/comments/suggestions!
>
>
> Motivation
> =========>
> char and unsigned char are considered to be universal holders in C. They
> can access raw memory and are used to implement memcpy. i8 is the LLVM’s
> counterpart but it does not have such semantics, which is also not
> desirable as it would disable many optimizations.
I don’t believe this is correct.  LLVM does not have an innate
concept of typed memory.  The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access.  There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using `tbaa`
metadata, which are unrelated to the IR type of the access.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210604/79126e9f/attachment.html>

John McCall via llvm-dev

2021-Jun-04 18:49 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 4 Jun 2021, at 14:25, John McCall via llvm-dev wrote:> On 4 Jun 2021, at 11:24, George Mitenkov wrote:
>> Hi all,
>>
>> Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
>> type to LLVM to fix miscompilations due to load type punning. Please
see
>> the proposal below. It would be great to hear the
>> feedback/comments/suggestions!
>>
>>
>> Motivation
>> =========>>
>> char and unsigned char are considered to be universal holders in C.
They
>> can access raw memory and are used to implement memcpy. i8 is the
LLVM’s
>> counterpart but it does not have such semantics, which is also not
>> desirable as it would disable many optimizations.
>
> I don’t believe this is correct.  LLVM does not have an innate
> concept of typed memory.  The type of a global or local allocation
> is just a roundabout way of giving it a size and default alignment,
> and similarly the type of a load or store just determines the width
> and default alignment of the access.  There are no restrictions on
> what types can be used to load or store from certain objects.
>
> C-style type aliasing restrictions are imposed using `tbaa`
> metadata, which are unrelated to the IR type of the access.
If this is all related to https://bugs.llvm.org/show_bug.cgi?id=37469,
I don’t think anything about i8 is the ultimate problem there.

John.

Nuno Lopes via llvm-dev

2021-Jun-04 19:06 UTC

head link

[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

On 4 Jun 2021, at 11:24, George Mitenkov wrote:

Hi all,

Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
type to LLVM to fix miscompilations due to load type punning. Please see
the proposal below. It would be great to hear the
feedback/comments/suggestions!


Motivation
=========
char and unsigned char are considered to be universal holders in C. They
can access raw memory and are used to implement memcpy. i8 is the LLVM’s
counterpart but it does not have such semantics, which is also not
desirable as it would disable many optimizations.

I don’t believe this is correct. LLVM does not have an innate
concept of typed memory. The type of a global or local allocation
is just a roundabout way of giving it a size and default alignment,
and similarly the type of a load or store just determines the width
and default alignment of the access. There are no restrictions on
what types can be used to load or store from certain objects.

C-style type aliasing restrictions are imposed using tbaa
metadata, which are unrelated to the IR type of the access.

 

It’s debatable whether LLVM considers memory to be typed or not. If we don’t
consider memory to be typed, then *all* integer load operations have to be
considered as potentially escaping pointers. Example:
store i32* %p, i32** %q
%q2 = bitcast i32** %q to i64*
%v = load i64* %q2

This program stores a pointer and then loads it back as an integer. So there’s
an implicit pointer-to-integer cast, which escapes the pointer. If we allow this
situation to happen, then the alias analysis code is broken, as well as several
optimizations. LLVM doesn’t consider loads as potential pointer escape sites. It
would probably be a disaster (performance wise) if it did!

 

The introduction of the byte type allow us to make all pointer <-> integer
casts explicit, so that we don’t have to make all integer loads as escaping. It
also allow us to say that LLVM optimizations are correct, and we “just” need to
create a few new optimization to get rid of the extra bytecast instructions when
they are provably not needed.

TBAA is unrelated to the problem we are trying to solve here.

 

Nuno

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210604/a01249c7/attachment.html>

Joshua Cranmer via llvm-dev

2021-Jun-04 22:27 UTC

head link

[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

On 6/4/2021 2:25 PM, John McCall via cfe-dev wrote:> I don’t believe this is correct. LLVM does not have an innate
>
> concept of typed memory. The type of a global or local allocation
> is just a roundabout way of giving it a size and default alignment,
> and similarly the type of a load or store just determines the width
> and default alignment of the access. There are no restrictions on
> what types can be used to load or store from certain objects.
>
> C-style type aliasing restrictions are imposed using |tbaa|
> metadata, which are unrelated to the IR type of the access.
>
> John.
>I've never been thoroughly involved in any of the actual optimizations 
here, but it seems to me that there is a soundness hole in the LLVM 
semantics that we gloss over when we say that LLVM doesn't have typed 
memory.

Working backwards from what a putative operational semantics of LLVM 
might look like (and I'm going to ignore poison/undef because it's not 
relevant), I think there is agreement that integer types in LLVM are 
purely bitvectors. Any value of i64 5 can be replaced with any other 
value of i64 5 no matter where it came from. At the same time, when we 
have pointers involved, this is not true. Two pointers may have the same 
numerical value (e.g., when cast to integers), but one might not be 
replaceable with the other because there's other data that might not be 
the same. So in operational terms, pointers have both a numerical value 
and a bag of provenance data (probably other stuff, but let's be simple 
and call it provenance).

Now we have to ask what the semantics of converting between integers and 
pointers are. Integers, as we've defined, don't have provenance data. So
an inttoptr instruction has to synthesize that provenance somehow. 
Ideally, we'd want to grab that data from the ptrtoint instruction that 
generated the integer, but the semantics of integers means we can only 
launder that data globally, so that an inttoptr has the union of all of 
the provenance data that was ever fed into an inttoptr (I suspect the 
actual semantics we use is somewhat more precise than this in that it 
only considers those pointers that point to still-live data, which 
doesn't invalidate anything I'm about to talk about).

Okay, what about memory? I believe what most people intend to mean when 
they say that LLVM's memory is untyped is that a load or store of any 
type is equivalent to first converting it to an integer and then storing 
the integer into memory. E.g. these two functions are semantically 
equivalent:

define void @foo(ptr %mem, i8* %foo) {
   store i8* %foo, ptr %mem
}
define void @bar(ptr %mem, i8* %foo) {
   %asint = ptrtoint i8* %foo to i64 ; Or whatever pointer size you have
   store i64 %asint, ptr %mem
}

In other words, we are to accept that every load and store instruction 
of a pointer has an implicit inttoptr or ptrtoint attached to it. But as 
I mentioned earlier, pointers have this extra metadata attached to it 
that is lost when converting to an integer. Under this strict 
interpretation of memory, we *lose* that metadata every time a pointer 
is stored in memory, as if we did an inttoptr(ptrtoint x). Thus, the 
following two functions are *not* semantically equivalent in that model:

define i8* @basic(i8* %in) {
   ret i8* %in
}
define i8* @via_alloc(i8* %in) {
   %mem = alloca i8*
   store i8* %in, i8** %mem
   %out = load i8*, i8** %mem
   ret i8* %out
}

In order to allow these two functions to be equivalent, we have to let 
the load of a pointer recover the provenance data stored by the store of 
the pointer, and nothing more general. If either one of those were 
instead an integer load or store, then no provenance data can be 
communicated, so the integer and the pointer loads *must* be 
nonequivalent (although loading an integer instead of a pointer would 
presumably be a pessimistic transformation).

In short, pointers have pointery bits that aren't reflected in a 
bitvector representation an integer has. LLVM has some optimizations 
that assume that loads and stores only have bitvector manipulation 
semantics, while other optimizations (and most of the frontends) expect 
that loads and stores will preserve the pointery bits. And when these 
interact with each other, it's undoubtedly possible that the pointery 
bits get lost along the way.

-- 
Joshua Cranmer
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210604/828f62d1/attachment.html>

Chris Lattner via llvm-dev

2021-Jun-06 04:26 UTC

head link

[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe-dev at
lists.llvm.org> wrote:On 4 Jun 2021, at 11:24, George Mitenkov
wrote:> Hi all,
> 
> Together with Nuno Lopes and Juneyoung Lee we propose to add a new byte
> type to LLVM to fix miscompilations due to load type punning. Please see
> the proposal below. It would be great to hear the
> feedback/comments/suggestions!
> 
> 
> Motivation
> =========> 
> char and unsigned char are considered to be universal holders in C. They
> can access raw memory and are used to implement memcpy. i8 is the LLVM’s
> counterpart but it does not have such semantics, which is also not
> desirable as it would disable many optimizations.
> 
> I don’t believe this is correct. LLVM does not have an innate
> concept of typed memory. The type of a global or local allocation
> is just a roundabout way of giving it a size and default alignment,
> and similarly the type of a load or store just determines the width
> and default alignment of the access. There are no restrictions on
> what types can be used to load or store from certain objects.
> 
> C-style type aliasing restrictions are imposed using tbaa
> metadata, which are unrelated to the IR type of the access.
> 
I completely agree with John.  “i8” in LLVM doesn’t carry any implications about
aliasing (in fact, LLVM pointers are going towards being typeless).  Any such
thing occurs at the accesses, and are part of TBAA.

I’m opposed to adding a byte type to LLVM, as such semantic carrying types are
entirely unprecedented, and would add tremendous complexity to the entire
system.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210605/f1d542ff/attachment.html>

llvm dev - Jun 2021 - [cfe-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM