thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

John McCall via llvm-dev

2021-Jun-14 16:08 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

On 14 Jun 2021, at 7:04, Ralf Jung wrote:
> Hi,
>
>>>> I don't dispute that but I am still not understanding the
need for
>>>> bytes. None of the examples I have seen so far
>>>> clearly made the point that it is the byte types that provide a
>>>> substantial benefit. The AA example below does neither.
>>>
>>> I hope 
>>>
<https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html>
>>> makes a convincing case that under the current semantics, when one 
>>> does an "i64" load of a value that was stored at pointer
type, we
>>> have to say that this load returns poison. In particular, saying 
>>> that this implicitly performs a "ptrtoint" is
inconsistent with
>>> optimizations that are probably too important to be changed to 
>>> accommodate this implicit "ptrtoint".
>>
>> I think it is fact rather obvious that, if this optimization as 
>> currently written is indeed in conflict with the current semantics, 
>> it is the optimization that will have to give.  If the optimization 
>> is too important for performance to give up entirely, we will simply 
>> have to find some more restricted pattern that wee can still soundly 
>> optimize.
>
> That is certainly a reasonable approach.
> However, judging from how reluctant LLVM is to remove optimizations 
> that are much more convincingly wrong [1], my impression was that it 
> is easier to complicate the semantics than to remove an optimization 
> that LLVM already performs.
>
> [1]: https://bugs.llvm.org/show_bug.cgi?id=34548,
>      https://bugs.llvm.org/show_bug.cgi?id=35229;
>      see https://www.ralfj.de/blog/2020/12/14/provenance.html for a
>      more detailed explanation
>
>> Perhaps the clearest reason is that, if we did declare that integer 
>> types cannot carry pointers and so introduced byte types that could, 
>> C frontends would have to switch to byte types for their integer 
>> types, and so we would immediately lose this supposedly important 
>> optimization for C-like languages, and so, since optimizing C is very 
>> important, we would immediately need to find some restricted pattern 
>> under which we could soundly apply this optimization to byte types.  
>> That’s assuming that this optimization is actually significant, of 
>> course.
>
> At least C with strict aliasing enabled (i.e., standard C) only needs 
> to use the byte type for "(un)signed char". The other integer
types
> remain unaffected. There is no arithmetic on these types ("char + 
> char" is subject to integer promotion), so the IR overhead would 
> consist in a few "bytecast" instructions next to / replacing the 
> existing sign extensions that convert "char" to "int"
before
> performing the arithmetic.
The semantics you seem to want are that LLVM’s integer types cannot 
carry information from pointers.  But I can cast a pointer to an integer 
in C and vice-versa, and compilers have de facto defined the behavior of 
subsequent operations like breaking the integer up (and then putting it 
back together), adding numbers to it, and so on.  So no, as a C compiler 
writer, I do not have a choice; I will have to use a type that can 
validly carry pointer information for integers in C.

Since you seem to find this sort of thing compelling, please note that 
even a simple assignment like `char c2 = c1` technically promotes 
through `int` in C, and so `int` must be able to carry pointer 
information if `char` can.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210614/2fab5432/attachment.html>

Cranmer, Joshua via llvm-dev

2021-Jun-14 20:29 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

In case anyone reading this thread is unaware, there is currently a proposal
before the C standards committee that goes into more details on a future
provenance model for C:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2676.pdf. It actually details a
few different provenance models, especially several variations on
integers-do-not-have-provenance, mainly around what provenance is reconstructed
on an inttoptr cast. While I’m not on the standards committee, I believe that
something along the lines of this proposal will become the official C provenance
rules.

While LLVM isn’t required to adopt C rules wholesale, the basic form of the
proposal already matches applied LLVM semantics better than our own language
reference (we do not consider integers to carry provenance). There are several
points in the proposal where extra annotations are suggested to effect various
provenance properties (e.g., the ability to write word-level memcpy), and
offhand, it looks like the byte proposal would be a viable vehicle for those
extra annotations, although I’m not entirely persuaded that it’s the best
vehicle.

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of John
McCall via llvm-dev
Sent: Monday, June 14, 2021 12:08
To: Ralf Jung <jung at mpi-sws.org>
Cc: llvm-dev at lists.llvm.org; cfe-dev at lists.llvm.org
Subject: Re: [llvm-dev] [RFC] Introducing a byte type to LLVM


On 14 Jun 2021, at 7:04, Ralf Jung wrote:

Hi,

I don't dispute that but I am still not understanding the need for bytes.
None of the examples I have seen so far
clearly made the point that it is the byte types that provide a substantial
benefit. The AA example below does neither.

I hope <https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html>
makes a convincing case that under the current semantics, when one does an
"i64" load of a value that was stored at pointer type, we have to say
that this load returns poison. In particular, saying that this implicitly
performs a "ptrtoint" is inconsistent with optimizations that are
probably too important to be changed to accommodate this implicit
"ptrtoint".

I think it is fact rather obvious that, if this optimization as currently
written is indeed in conflict with the current semantics, it is the optimization
that will have to give.  If the optimization is too important for performance to
give up entirely, we will simply have to find some more restricted pattern that
wee can still soundly optimize.

That is certainly a reasonable approach.
However, judging from how reluctant LLVM is to remove optimizations that are
much more convincingly wrong [1], my impression was that it is easier to
complicate the semantics than to remove an optimization that LLVM already
performs.

[1]: https://bugs.llvm.org/show_bug.cgi?id=34548,
https://bugs.llvm.org/show_bug.cgi?id=35229;
see https://www.ralfj.de/blog/2020/12/14/provenance.html for a
more detailed explanation

Perhaps the clearest reason is that, if we did declare that integer types cannot
carry pointers and so introduced byte types that could, C frontends would have
to switch to byte types for their integer types, and so we would immediately
lose this supposedly important optimization for C-like languages, and so, since
optimizing C is very important, we would immediately need to find some
restricted pattern under which we could soundly apply this optimization to byte
types.  That’s assuming that this optimization is actually significant, of
course.

At least C with strict aliasing enabled (i.e., standard C) only needs to use the
byte type for "(un)signed char". The other integer types remain
unaffected. There is no arithmetic on these types ("char + char" is
subject to integer promotion), so the IR overhead would consist in a few
"bytecast" instructions next to / replacing the existing sign
extensions that convert "char" to "int" before performing
the arithmetic.

The semantics you seem to want are that LLVM’s integer types cannot carry
information from pointers. But I can cast a pointer to an integer in C and
vice-versa, and compilers have de facto defined the behavior of subsequent
operations like breaking the integer up (and then putting it back together),
adding numbers to it, and so on. So no, as a C compiler writer, I do not have a
choice; I will have to use a type that can validly carry pointer information for
integers in C.

Since you seem to find this sort of thing compelling, please note that even a
simple assignment like char c2 = c1 technically promotes through int in C, and
so int must be able to carry pointer information if char can.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210614/39b88446/attachment.html>

Juneyoung Lee via llvm-dev

2021-Jun-15 05:49 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> The semantics you seem to want are that LLVM’s integer types cannot carry
> information from pointers. But I can cast a pointer to an integer in C and
> vice-versa, and compilers have de facto defined the behavior of subsequent
> operations like breaking the integer up (and then putting it back
> together), adding numbers to it, and so on. So no, as a C compiler writer,
> I do not have a choice; I will have to use a type that can validly carry
> pointer information for integers in C.
>int->ptr cast can reconstruct the pointer information, so making integer
types not carry pointer information does not necessarily mean that
dereferencing a pointer casted from integer is UB.

For example, the definition of cast_ival_to_ptrval at the n2676 proposal
shows that a pointer's provenance is reconstructed from an integer.
(Whether n2676's cast_ival_to_ptrval can be also used for LLVM's
inttoptr
semantics is a different question, though)
> Since you seem to find this sort of thing compelling, please note that
> even a simple assignment like char c2 = c1 technically promotes through
> int in C, and so int must be able to carry pointer information if char
> can.
>IIUC integer promotion is done when it is used as an operand of arithmetic
ops or switch's condition, so I think assignment operation is okay.

Juneyoung


On Tue, Jun 15, 2021 at 1:08 AM John McCall via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 14 Jun 2021, at 7:04, Ralf Jung wrote:
>
> Hi,
>
> I don't dispute that but I am still not understanding the need for
bytes.
> None of the examples I have seen so far
> clearly made the point that it is the byte types that provide a
> substantial benefit. The AA example below does neither.
>
> I hope
<https://lists.llvm.org/pipermail/llvm-dev/2021-June/151110.html>
> makes a convincing case that under the current semantics, when one does an
> "i64" load of a value that was stored at pointer type, we have to
say that
> this load returns poison. In particular, saying that this implicitly
> performs a "ptrtoint" is inconsistent with optimizations that are
probably
> too important to be changed to accommodate this implicit
"ptrtoint".
>
> I think it is fact rather obvious that, if this optimization as currently
> written is indeed in conflict with the current semantics, it is the
> optimization that will have to give.  If the optimization is too important
> for performance to give up entirely, we will simply have to find some more
> restricted pattern that wee can still soundly optimize.
>
> That is certainly a reasonable approach.
> However, judging from how reluctant LLVM is to remove optimizations that
> are much more convincingly wrong [1], my impression was that it is easier
> to complicate the semantics than to remove an optimization that LLVM
> already performs.
>
> [1]: https://bugs.llvm.org/show_bug.cgi?id=34548,
> https://bugs.llvm.org/show_bug.cgi?id=35229;
> see https://www.ralfj.de/blog/2020/12/14/provenance.html for a
> more detailed explanation
>
> Perhaps the clearest reason is that, if we did declare that integer types
> cannot carry pointers and so introduced byte types that could, C frontends
> would have to switch to byte types for their integer types, and so we would
> immediately lose this supposedly important optimization for C-like
> languages, and so, since optimizing C is very important, we would
> immediately need to find some restricted pattern under which we could
> soundly apply this optimization to byte types.  That’s assuming that this
> optimization is actually significant, of course.
>
> At least C with strict aliasing enabled (i.e., standard C) only needs to
> use the byte type for "(un)signed char". The other integer types
remain
> unaffected. There is no arithmetic on these types ("char + char"
is subject
> to integer promotion), so the IR overhead would consist in a few
"bytecast"
> instructions next to / replacing the existing sign extensions that convert
> "char" to "int" before performing the arithmetic.
>
> The semantics you seem to want are that LLVM’s integer types cannot carry
> information from pointers. But I can cast a pointer to an integer in C and
> vice-versa, and compilers have de facto defined the behavior of subsequent
> operations like breaking the integer up (and then putting it back
> together), adding numbers to it, and so on. So no, as a C compiler writer,
> I do not have a choice; I will have to use a type that can validly carry
> pointer information for integers in C.
>
> Since you seem to find this sort of thing compelling, please note that
> even a simple assignment like char c2 = c1 technically promotes through
> int in C, and so int must be able to carry pointer information if char
> can.
>
> John.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 

Juneyoung Lee
Software Foundation Lab, Seoul National University
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210615/4f348f6c/attachment-0001.html>

Ralf Jung via llvm-dev

2021-Jun-15 19:15 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi,
> The semantics you seem to want are that LLVM’s integer types cannot carry 
> information from pointers. But I can cast a pointer to an integer in C and 
> vice-versa, and compilers have de facto defined the behavior of subsequent 
> operations like breaking the integer up (and then putting it back
together),
> adding numbers to it, and so on. So no, as a C compiler writer, I do not
have a
> choice; I will have to use a type that can validly carry pointer
information for
> integers in C.
Integers demonstrably do not carry provenance; see 
<https://www.ralfj.de/blog/2020/12/14/provenance.html> for a detailed 
explanation of why.
As a consequence of this, ptr-int-ptr roundtrips are lossy: some of the original
provenance information is lost. This means that optimizing away such roundtrips 
is incorrect, and indeed doing so leads to miscompilations 
(https://bugs.llvm.org/show_bug.cgi?id=34548).

The key difference between int and byte is that ptr-byte-ptr roundtrips are 
*lossless*, all the provenance is preserved. This means some extra optimizations
(such as removing these roundtrips -- which implicitly happens when a 
redundant-store-after-load is removed), but also some lost optimizations (most 
notably, "x == y" does not mean x and y are equal in all respects;
their
provenance might still differ, so it is incorrect for GVN to replace one my the 
other).

It's a classic tradeoff: we can *either* have lossless roundtrips *or*
"x == y"
implies full equality of the abstract values. Having both together leads to 
contradictions, which manifest as miscompilations. "byte" and
"int" represent
the two possible choices here; therefore, by adding "byte", LLVM would
close a
gap in the expressive power of its IR.

Kind regards,
Ralf

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM