thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Ralf Jung via llvm-dev

2021-Jun-13 15:09 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi all,

to add to what Nuno said, I came up with a series of optimizations which 
demonstrate that dead store elimination, integer "substitution" (GVN
replacing
'a' by 'b' after an 'a == b'), and provenance-based
alias analysis, are in
conflict with each other: the current LLVM semantics are inconsistent and can 
lead to miscompilations.

I posted them here as Rust code: 
<https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806>.
A direct translation to C wouldn't be correct due to strict aliasing, but if
you
imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing), then 
I hope this example demonstrates that *something* needs to give. This is not 
just a small bug somewhere, it is a case of the implicit assumptions made by 
different optimization passes / instructions being mutually incompatible.

Kind regards,
Ralf

On 06.06.21 20:26, Nuno Lopes wrote:> Let me try to explain the issue around typed memory. It’s orthogonal to
TBAA.
> Plus I agree the allocation type is irrelevant; it’s the type of the
load/store
> operations that matters.
> 
> Fact 1: LLVM needs an “universal holder” type in its IR.
> 
> Why?
> 
>   * Frontends load data from memory and don’t necessarily know the type
they are
>     loading. When you load a char in C, it can be part of a pointer or an
>     integer, for example. Clang has no way to know.
>   * That’s necessary to express implementations of memcpy in IR, as it
copies
>     raw data without knowing what the type is.
>   * LLVM lowers small memcpys into load/store.
> 
> I hope you agree LLVM needs a type that can contain any other type.
> 
> Fact 2: optimizations for this “universal holder” type need to be correct
for
> any LLVM IR type.
> 
> Since this type can hold anything, optimizations that apply to this type
must be
> correct for integers, pointers, etc. Otherwise, the results would be
incorrect
> if the data was cast to the “wrong” type.
> 
> The corollary of this fact is that GVN is not correct for the “universal
holder”
> type in general. That’s because GVN is not correct for pointers in general 
> (though correct in some cases). GVN doesn’t work for pointers since just
because
> 2 pointers compare equal, it doesn’t mean they have the same provenance. As
> LLVM’s AA uses provenance information, this information must be preserved.
> 
> Hopefully you agree with everything I wrote so far. If not please stop me
and ask.
> 
> Ok, so what’s the solution?
> 
> We have two candidates for this “universal holder” type:
> 
>   * integer (i/<x>/)
>   * byte (b/<x>/), a new dedicated type
> 
> The advantage of going with integers is that they already exist. The 
> disadvantage? To make LLVM correct, we would need to disable GVN in most
cases
> or stop using provenance information in AA (make pointers == integers).
> 
> This sounds pretty bad to me due to the expect performance hit.
> 
> Introducing a new type requires work, yes. But it allows us to keep all
integer
> optimizations as aggressive as we want, and only throttle down when
encountering
> the byte type. Another benefit is that loading a byte from memory doesn’t
escape
> any pointer, while with the first solution you need to consider all stored 
> pointers as escaped when loading an integer.
> 
> Introducing the byte type allow us to fix all the integer/pointer casts
bugs in
> LLVM IR. And it enables more optimizations as we won’t need to be careful 
> because of the LLVM IR bug.
> 
> A 3^rd option is to keep everything as-is. That is, keep wrong
optimizations in
> LLVM and keep adding hacks to limit the miscompilations in real-world
programs.
> 
> That has bitten us multiple times. Last time someone told me my
miscompilation
> example was academic, LLVM ended up miscompiling itself, and then
miscompiled
> Google’s code. Only this year we are planning to fully fix that bug (we
have a
> hack right now).
> 
> Last week Alive2 caught a miscompilation in the Linux kernel, in the
network
> stack. The optimization got the pointer arithmetic wrong. Pretty scary, and
the
> bug may have security implications.
> 
> Anyway, just to argue that leaving things as-is is not an option. We have 
> momentum and 3 GSoC students ready to help to fix the last few design bugs
in
> LLVM IR.
> 
> I know this stuff is complex; please ask questions if something is not
clear or
> if you disagree with anything I wrote above.
> 
> Thanks,
> 
> Nuno
> 
> *From:* Chris Lattner via llvm-dev
> *Sent:* 06 June 2021 05:27
> *To:* John McCall <rjmccall at apple.com>
> *Cc:* llvm-dev at lists.llvm.org; Ralf Jung <jung at mpi-sws.org>;
cfe-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
> 
> On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <cfe-dev at
lists.llvm.org
> <mailto:cfe-dev at lists.llvm.org>> wrote:On 4 Jun 2021, at 11:24,
George Mitenkov
> wrote:
> 
>         Hi all,
> 
>         Together with Nuno Lopes and Juneyoung Lee we propose to add a new
byte
>         type to LLVM to fix miscompilations due to load type punning.
Please see
>         the proposal below. It would be great to hear the
>         feedback/comments/suggestions!
> 
> 
>         Motivation
>         =========> 
>         char and unsigned char are considered to be universal holders in C.
They
>         can access raw memory and are used to implement memcpy. i8 is the
LLVM’s
>         counterpart but it does not have such semantics, which is also not
>         desirable as it would disable many optimizations.
> 
>     I don’t believe this is correct. LLVM does not have an innate
>     concept of typed memory. The type of a global or local allocation
>     is just a roundabout way of giving it a size and default alignment,
>     and similarly the type of a load or store just determines the width
>     and default alignment of the access. There are no restrictions on
>     what types can be used to load or store from certain objects.
> 
>     C-style type aliasing restrictions are imposed using |tbaa|
>     metadata, which are unrelated to the IR type of the access.
> 
> I completely agree with John.  “i8” in LLVM doesn’t carry any implications
about
> aliasing (in fact, LLVM pointers are going towards being typeless).  Any
such
> thing occurs at the accesses, and are part of TBAA.
> 
> I’m opposed to adding a byte type to LLVM, as such semantic carrying types
are
> entirely unprecedented, and would add tremendous complexity to the entire
system.
> 
> -Chris
> 
-- 
Website: https://people.mpi-sws.org/~jung/

Nicolai Hähnle via llvm-dev

2021-Jun-14 06:17 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi Ralf,

On Sun, Jun 13, 2021 at 5:09 PM Ralf Jung via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> to add to what Nuno said, I came up with a series of optimizations which
> demonstrate that dead store elimination, integer "substitution"
(GVN
> replacing
> 'a' by 'b' after an 'a == b'), and provenance-based
alias analysis, are in
> conflict with each other: the current LLVM semantics are inconsistent and
> can
> lead to miscompilations.
>
> I posted them here as Rust code:
> <
>
https://github.com/rust-lang/unsafe-code-guidelines/issues/286#issuecomment-860189806>.
>
> A direct translation to C wouldn't be correct due to strict aliasing,
but
> if you
> imagine this being LLVM IR (or C code compiled with -fno-strict-aliasing),
> then
> I hope this example demonstrates that *something* needs to give. This is
> not
> just a small bug somewhere, it is a case of the implicit assumptions made
> by
> different optimization passes / instructions being mutually incompatible.
>
Agreed. Reading your example as LLVM IR, I think we have a choice to make:
either memory is typed or it isn't. By "typed" I mean that an
integer load
following a pointer store or vice versa returns poison or is immediate UB.

If memory is typed, then your original example program has UB in at least
two places:

1. A pointer stored via storage_ptr is loaded via storage_usize, used in a
comparison, and the comparison result is branched on. This would be UB.
2. An integer is stored via storage_usize and then loaded via storage_ptr
(pointer load) and then dereferenced. This would also be UB.

If memory is untyped and you want the original program to have defined
behavior, my first thought is that the resolution that makes the most sense
is to say that removing the store to *storage_usize is incorrect because
that store is not truly redundant: it changes the pointer provenance that
is associated with the underlying memory. The other optimizations in that
chain of optimizations seem of higher value (keeping in mind that *most*
dead store elimination is still correct, it's just this particular narrow
case which isn't).

Cheers,
Nicolai


> Kind regards,
> Ralf
>
> On 06.06.21 20:26, Nuno Lopes wrote:
> > Let me try to explain the issue around typed memory. It’s orthogonal
to
> TBAA.
> > Plus I agree the allocation type is irrelevant; it’s the type of the
> load/store
> > operations that matters.
> >
> > Fact 1: LLVM needs an “universal holder” type in its IR.
> >
> > Why?
> >
> >   * Frontends load data from memory and don’t necessarily know the
type
> they are
> >     loading. When you load a char in C, it can be part of a pointer or
an
> >     integer, for example. Clang has no way to know.
> >   * That’s necessary to express implementations of memcpy in IR, as it
> copies
> >     raw data without knowing what the type is.
> >   * LLVM lowers small memcpys into load/store.
> >
> > I hope you agree LLVM needs a type that can contain any other type.
> >
> > Fact 2: optimizations for this “universal holder” type need to be
> correct for
> > any LLVM IR type.
> >
> > Since this type can hold anything, optimizations that apply to this
type
> must be
> > correct for integers, pointers, etc. Otherwise, the results would be
> incorrect
> > if the data was cast to the “wrong” type.
> >
> > The corollary of this fact is that GVN is not correct for the
“universal
> holder”
> > type in general. That’s because GVN is not correct for pointers in
> general
> > (though correct in some cases). GVN doesn’t work for pointers since
just
> because
> > 2 pointers compare equal, it doesn’t mean they have the same
provenance.
> As
> > LLVM’s AA uses provenance information, this information must be
> preserved.
> >
> > Hopefully you agree with everything I wrote so far. If not please stop
> me and ask.
> >
> > Ok, so what’s the solution?
> >
> > We have two candidates for this “universal holder” type:
> >
> >   * integer (i/<x>/)
> >   * byte (b/<x>/), a new dedicated type
> >
> > The advantage of going with integers is that they already exist. The
> > disadvantage? To make LLVM correct, we would need to disable GVN in
most
> cases
> > or stop using provenance information in AA (make pointers ==
integers).
> >
> > This sounds pretty bad to me due to the expect performance hit.
> >
> > Introducing a new type requires work, yes. But it allows us to keep
all
> integer
> > optimizations as aggressive as we want, and only throttle down when
> encountering
> > the byte type. Another benefit is that loading a byte from memory
> doesn’t escape
> > any pointer, while with the first solution you need to consider all
> stored
> > pointers as escaped when loading an integer.
> >
> > Introducing the byte type allow us to fix all the integer/pointer
casts
> bugs in
> > LLVM IR. And it enables more optimizations as we won’t need to be
> careful
> > because of the LLVM IR bug.
> >
> > A 3^rd option is to keep everything as-is. That is, keep wrong
> optimizations in
> > LLVM and keep adding hacks to limit the miscompilations in real-world
> programs.
> >
> > That has bitten us multiple times. Last time someone told me my
> miscompilation
> > example was academic, LLVM ended up miscompiling itself, and then
> miscompiled
> > Google’s code. Only this year we are planning to fully fix that bug
(we
> have a
> > hack right now).
> >
> > Last week Alive2 caught a miscompilation in the Linux kernel, in the
> network
> > stack. The optimization got the pointer arithmetic wrong. Pretty
scary,
> and the
> > bug may have security implications.
> >
> > Anyway, just to argue that leaving things as-is is not an option. We
> have
> > momentum and 3 GSoC students ready to help to fix the last few design
> bugs in
> > LLVM IR.
> >
> > I know this stuff is complex; please ask questions if something is not
> clear or
> > if you disagree with anything I wrote above.
> >
> > Thanks,
> >
> > Nuno
> >
> > *From:* Chris Lattner via llvm-dev
> > *Sent:* 06 June 2021 05:27
> > *To:* John McCall <rjmccall at apple.com>
> > *Cc:* llvm-dev at lists.llvm.org; Ralf Jung <jung at
mpi-sws.org>;
> cfe-dev at lists.llvm.org
> > *Subject:* Re: [llvm-dev] [cfe-dev] [RFC] Introducing a byte type to
LLVM
> >
> > On Jun 4, 2021, at 11:25 AM, John McCall via cfe-dev <
> cfe-dev at lists.llvm.org
> > <mailto:cfe-dev at lists.llvm.org>> wrote:On 4 Jun 2021, at
11:24, George
> Mitenkov
> > wrote:
> >
> >         Hi all,
> >
> >         Together with Nuno Lopes and Juneyoung Lee we propose to add a
> new byte
> >         type to LLVM to fix miscompilations due to load type punning.
> Please see
> >         the proposal below. It would be great to hear the
> >         feedback/comments/suggestions!
> >
> >
> >         Motivation
> >         =========> >
> >         char and unsigned char are considered to be universal holders
in
> C. They
> >         can access raw memory and are used to implement memcpy. i8 is
> the LLVM’s
> >         counterpart but it does not have such semantics, which is also
> not
> >         desirable as it would disable many optimizations.
> >
> >     I don’t believe this is correct. LLVM does not have an innate
> >     concept of typed memory. The type of a global or local allocation
> >     is just a roundabout way of giving it a size and default
alignment,
> >     and similarly the type of a load or store just determines the
width
> >     and default alignment of the access. There are no restrictions on
> >     what types can be used to load or store from certain objects.
> >
> >     C-style type aliasing restrictions are imposed using |tbaa|
> >     metadata, which are unrelated to the IR type of the access.
> >
> > I completely agree with John.  “i8” in LLVM doesn’t carry any
> implications about
> > aliasing (in fact, LLVM pointers are going towards being typeless). 
Any
> such
> > thing occurs at the accesses, and are part of TBAA.
> >
> > I’m opposed to adding a byte type to LLVM, as such semantic carrying
> types are
> > entirely unprecedented, and would add tremendous complexity to the
> entire system.
> >
> > -Chris
> >
>
> --
> Website: https://people.mpi-sws.org/~jung/
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210614/c318703c/attachment.html>

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM