thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Nicolai Hähnle via llvm-dev

2021-Jun-08 23:01 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi,

On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 04/06/2021 16:24, George Mitenkov via llvm-dev wrote:
>
>
>    1.
>
>    When copying pointers as bytes implicit pointer-to-integer casts are
>    avoided.The current setting of performing a memcpy using i8s leads to
>    miscompilations (example: bug report 37469
>    <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for
alias
>    analysis.
>
> While trying to understand what exactly is going wrong in this bug I think
> I stumbled upon incomplete documentation. At
> https://llvm.org/docs/LangRef.html#pointeraliasing it is documented that
>
>    - A memory access must be done through a pointer associated with the
>    address range.
>    - A pointer value is associated with the addresses associated with any
>    value it is based on.
>    - A pointer value formed by an instruction is based on another pointer
>    value the instruction is getelementptr, bitcast, or inttoptr.
>
> What is not mentioned is what a pointer value formed by a load instruction
> is based on. If a pointer value formed by a load instruction is not
> associated with any address range, then it cannot be used to load or store
> *any* value. Clearly that is not wanted.
>
> If memory is completely untyped, then a load of a pointer value needs to
> be equivalent to a load of an integer value followed by inttoptr. In that
> model, it is not possible to replace a load of a pointer value by a
> previously stored value, it is only possible to replace it by
> inttoptr(ptrtoint(previously stored value)). In that model, the problem is
> that LLVM does replace a load of a pointer value by a previously stored
> value, and the bug can be fixed (at a cost of reduced optimisations of
> other code) by making sure to insert ptrtoint and inttoptr as needed,
> including for any pointer members of structs etc.
>
> Your proposal is based on an interpretation of the problem using a memory
> model where memory is not completely untyped. In your memory model, what
> address range is a pointer value that came from a load instruction
> associated with?
>
I second this.

I feel like the entire discussion is extremely confused and confusing
because there is lack of clarity about what values are in fact carried by
the types i<N>, ptr (assuming untyped pointers already), and the proposed
b<N>. There are statements about "whether a b<N> contains a
pointer or an
integer" without a clear definition of what any of that really means.

By the way: the discussion of GVN states that i<N> do not carry provenance
information, but LangRef appears to contradict this, because it states (in
the Pointer Aliasing section) that "A pointer value formed by an inttoptr
is *based* on all pointer values that contribute (directly or indirectly)
to the computation of the pointer’s value." The only way to make this
statement operational is for the operand of the inttoptr to carry
provenance information; this definitely needs to be addressed by this
proposal.

My educated guess is that what the proposal really wants to say is that an
i<N> value is poison or an integer without pointer provenance (i.e., an
i<N> value is never "based on" a pointer) while a b<N>
value is poison or
an integer that may have provenance (i.e., it may be "based on" one or
more
pointers). Everything else should follow from this core definition. With
this phrasing, I can see a coherent picture emerging that can be discussed
without a lot of the confusion in this thread (though still _some_
confusion, as the topic is inherently subtle).

@George, Nuno, and Juneyoung: is that a good lens through which to see the
proposal? If no, why not? If yes, I believe there are some immediate
implications and questions, such as:

1. Forbidding arithmetic and bitwise operations in b<N> seems pointless.
Just define them as the corresponding i<N> op plus the union of provenance
of the operands. This allows consistent implementation of char/unsigned
char as b8, without having to jump back and forth between b8 and i8 all the
time.

2. What's the provenance of a b<N> `select`?

3. A b<N>-to-i<N> conversion necessarily loses all provenance
information.

4. What's the provenance of the result of an i<N>-to-b<N>
conversion? The
only possible choices are "empty" or "full" provenance,
where "full"
provenance means that the value is "based on" _all_ pointers.
Depending on
the details of allowed instruction signatures, we may want to allow both
choices. "Full" provenance gives us something that's useful as a
step
towards inttoptr. "Empty" provenance allows us to lift i<N> to
b<N> for
arithmetic operations without being overly conservative (e.g., consider the
somewhat hypothetical problem of lowering a GEP to a sequence of arithmetic
on b<N> types in IR -- we want the result to have the provenance of the
base pointer, which enters the arithmetic as a b<N> with provenance;
adding
the various i<N> offset operands should have no effect on provenance).

5. Memory is untyped but can carry provenance; i.e., it consists of b8.
Load/store of i<N> performs implicit i-to-b/b-to-i conversions,
respectively. Which kind of i-to-b conversion is used during a store? The
"empty" one or the "full" one? There are several reasonable
options for how
to address this, and a choice must be made.

6. (How) are pointer types fundamentally different from b<N> types of the
correct size? (By this I mean: is there any interesting difference in the
values that these types can carry? Ignore surface differences like the fact
that GEP traditionally goes with pointers while `add` goes with integer
types -- we could have a GEP instruction on a correctly sized b<N>)

It seems pretty clear that there is still a lot of work to be done on this
proposal before we can be sufficiently confident that it won't just add new
correctness problems elsewhere. Besides, the cost of adding a new family of
types is very high, and I am not (yet?) convinced that it pays for itself.
But with the above I can at least see better where you're coming from.

Cheers,
Nicolai

Cheers,> Harald van Dijk
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210609/27f378b4/attachment.html>

James Courtier-Dutton via llvm-dev

2021-Jun-08 23:36 UTC

head link

[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

On Wed, 9 Jun 2021 at 00:02, Nicolai Hähnle via cfe-dev
<cfe-dev at lists.llvm.org> wrote:> Hi,
> ...
> On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>
> I second this.
>
> I feel like the entire discussion is extremely confused and confusing
because there is lack of clarity about what values are in fact carried by the
types i<N>, ptr (assuming untyped pointers already), and the proposed
b<N>. There are statements about "whether a b<N> contains a
pointer or an integer" without a clear definition of what any of that
really means.
>
+1 on the confused and confusing point. :-)

Juneyoung Lee via llvm-dev

2021-Jun-09 03:28 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi Nicolai,

> I feel like the entire discussion is extremely confused and confusing
> because there is lack of clarity about what values are in fact carried by
> the types i<N>, ptr (assuming untyped pointers already), and the
proposed
> b<N>. There are statements about "whether a b<N> contains
a pointer or an
> integer" without a clear definition of what any of that really means.
> By the way: the discussion of GVN states that i<N> do not carry
provenance
> information, but LangRef appears to contradict this, because it states (in
> the Pointer Aliasing section) that "A pointer value formed by an
inttoptr
> is *based* on all pointer values that contribute (directly or indirectly)
> to the computation of the pointer’s value." The only way to make this
> statement operational is for the operand of the inttoptr to carry
> provenance information; this definitely needs to be addressed by this
> proposal.
> My educated guess is that what the proposal really wants to say is that an
> i<N> value is poison or an integer without pointer provenance (i.e.,
an
> i<N> value is never "based on" a pointer) while a
b<N> value is poison or
> an integer that may have provenance (i.e., it may be "based on"
one or more
> pointers). Everything else should follow from this core definition. With
> this phrasing, I can see a coherent picture emerging that can be discussed
> without a lot of the confusion in this thread (though still _some_
> confusion, as the topic is inherently subtle).
> @George, Nuno, and Juneyoung: is that a good lens through which to see the
> proposal?

Yes, our understanding is that only values of pointer types carry
provenance whereas non-pointer types (i<N>, float, ...) don't carry
provenance.
We believe that when people were writing optimizations on non-pointer types
(instcombine, reassociate, gvn, etc) they did not assume the existence of
provenance in such types.
As mentioned, it contradicts with LangRef's wording about inttoptr though.

I agree that clarifying these is a good first step to avoid confusion.
Specifically, via pinning down what people agree on in LangRef.
I think it is good to start with removing the inttoptr case from LangRef's
based-on part and adding relevant texts.
IMO, existing alias analysis and optimizations were already avoiding uses
of inttoptr's based-on information (experts' comments needed to double
check).
Also, to provide alternatives for ptrtoint/inttoptr round trip with better
analysis, llvm.ptrmask and clang-tidy's performance-no-int-to-ptr were
added as well.

If people agree, I'll write a draft for this and share it via phabricator.

Thanks,
Juneyoung

On Wed, Jun 9, 2021 at 8:02 AM Nicolai Hähnle via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On 04/06/2021 16:24, George Mitenkov via llvm-dev wrote:
>>
>>
>>    1.
>>
>>    When copying pointers as bytes implicit pointer-to-integer casts are
>>    avoided.The current setting of performing a memcpy using i8s leads
to
>>    miscompilations (example: bug report 37469
>>    <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for
alias
>>    analysis.
>>
>> While trying to understand what exactly is going wrong in this bug I
>> think I stumbled upon incomplete documentation. At
>> https://llvm.org/docs/LangRef.html#pointeraliasing it is documented
that
>>
>>    - A memory access must be done through a pointer associated with the
>>    address range.
>>    - A pointer value is associated with the addresses associated with
>>    any value it is based on.
>>    - A pointer value formed by an instruction is based on another
>>    pointer value the instruction is getelementptr, bitcast, or
inttoptr.
>>
>> What is not mentioned is what a pointer value formed by a load
>> instruction is based on. If a pointer value formed by a load
instruction is
>> not associated with any address range, then it cannot be used to load
or
>> store *any* value. Clearly that is not wanted.
>>
>> If memory is completely untyped, then a load of a pointer value needs
to
>> be equivalent to a load of an integer value followed by inttoptr. In
that
>> model, it is not possible to replace a load of a pointer value by a
>> previously stored value, it is only possible to replace it by
>> inttoptr(ptrtoint(previously stored value)). In that model, the problem
is
>> that LLVM does replace a load of a pointer value by a previously stored
>> value, and the bug can be fixed (at a cost of reduced optimisations of
>> other code) by making sure to insert ptrtoint and inttoptr as needed,
>> including for any pointer members of structs etc.
>>
>> Your proposal is based on an interpretation of the problem using a
memory
>> model where memory is not completely untyped. In your memory model,
what
>> address range is a pointer value that came from a load instruction
>> associated with?
>>
>
> I second this.
>
> I feel like the entire discussion is extremely confused and confusing
> because there is lack of clarity about what values are in fact carried by
> the types i<N>, ptr (assuming untyped pointers already), and the
proposed
> b<N>. There are statements about "whether a b<N> contains
a pointer or an
> integer" without a clear definition of what any of that really means.
>
> By the way: the discussion of GVN states that i<N> do not carry
provenance
> information, but LangRef appears to contradict this, because it states (in
> the Pointer Aliasing section) that "A pointer value formed by an
inttoptr
> is *based* on all pointer values that contribute (directly or indirectly)
> to the computation of the pointer’s value." The only way to make this
> statement operational is for the operand of the inttoptr to carry
> provenance information; this definitely needs to be addressed by this
> proposal.
>
> My educated guess is that what the proposal really wants to say is that an
> i<N> value is poison or an integer without pointer provenance (i.e.,
an
> i<N> value is never "based on" a pointer) while a
b<N> value is poison or
> an integer that may have provenance (i.e., it may be "based on"
one or more
> pointers). Everything else should follow from this core definition. With
> this phrasing, I can see a coherent picture emerging that can be discussed
> without a lot of the confusion in this thread (though still _some_
> confusion, as the topic is inherently subtle).
>
> @George, Nuno, and Juneyoung: is that a good lens through which to see the
> proposal? If no, why not? If yes, I believe there are some immediate
> implications and questions, such as:
>
> 1. Forbidding arithmetic and bitwise operations in b<N> seems
pointless.
> Just define them as the corresponding i<N> op plus the union of
provenance
> of the operands. This allows consistent implementation of char/unsigned
> char as b8, without having to jump back and forth between b8 and i8 all the
> time.
>
> 2. What's the provenance of a b<N> `select`?
>
> 3. A b<N>-to-i<N> conversion necessarily loses all provenance
information.
>
> 4. What's the provenance of the result of an i<N>-to-b<N>
conversion? The
> only possible choices are "empty" or "full" provenance,
where "full"
> provenance means that the value is "based on" _all_ pointers.
Depending on
> the details of allowed instruction signatures, we may want to allow both
> choices. "Full" provenance gives us something that's useful
as a step
> towards inttoptr. "Empty" provenance allows us to lift i<N>
to b<N> for
> arithmetic operations without being overly conservative (e.g., consider the
> somewhat hypothetical problem of lowering a GEP to a sequence of arithmetic
> on b<N> types in IR -- we want the result to have the provenance of
the
> base pointer, which enters the arithmetic as a b<N> with provenance;
adding
> the various i<N> offset operands should have no effect on
provenance).
>
> 5. Memory is untyped but can carry provenance; i.e., it consists of b8.
> Load/store of i<N> performs implicit i-to-b/b-to-i conversions,
> respectively. Which kind of i-to-b conversion is used during a store? The
> "empty" one or the "full" one? There are several
reasonable options for how
> to address this, and a choice must be made.
>
> 6. (How) are pointer types fundamentally different from b<N> types of
the
> correct size? (By this I mean: is there any interesting difference in the
> values that these types can carry? Ignore surface differences like the fact
> that GEP traditionally goes with pointers while `add` goes with integer
> types -- we could have a GEP instruction on a correctly sized b<N>)
>
> It seems pretty clear that there is still a lot of work to be done on this
> proposal before we can be sufficiently confident that it won't just add
new
> correctness problems elsewhere. Besides, the cost of adding a new family of
> types is very high, and I am not (yet?) convinced that it pays for itself.
> But with the above I can at least see better where you're coming from.
>
> Cheers,
> Nicolai
>
>
> Cheers,
>> Harald van Dijk
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> --
> Lerne, wie die Welt wirklich ist,
> aber vergiss niemals, wie sie sein sollte.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 

Juneyoung Lee
Software Foundation Lab, Seoul National University
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210609/c5382219/attachment.html>

Ralf Jung via llvm-dev

2021-Jun-13 15:22 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi,
> 1. Forbidding arithmetic and bitwise operations in b<N> seems
pointless. Just
> define them as the corresponding i<N> op plus the union of provenance
of the
> operands. This allows consistent implementation of char/unsigned char as
b8,
> without having to jump back and forth between b8 and i8 all the time.
FWIW, "char" addition happens at "int" type due to integer
promotion. So there
is no problem with back and forth here.

"Union" of provenance is currently not an operation that is required
to model
LLVM IR, so your proposal would necessitate adding such a concept. It'll be 
interesting to figure out how "getelementptr inbounds" behaves on 
multi-provenance pointers...
> 6. (How) are pointer types fundamentally different from b<N> types of
the
> correct size? (By this I mean: is there any interesting difference in the
values
> that these types can carry? Ignore surface differences like the fact that
GEP
> traditionally goes with pointers while `add` goes with integer types -- we
could
> have a GEP instruction on a correctly sized b<N>)
I'm not saying I have the answer here, but one possible difference might
arise
with "mixing bytes from different pointers". Say we are storing
pointer "ptr1"
directly followed by "ptr2" on a 64bit machine, and now we are doing
an
(unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4 bytes 
of ptr2. This is certainly a valid value for b64. Is it also a valid value at 
pointer type, and if yes, which provenance does it have?

Kind regards,
Ralf

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM