thr3ads.net - llvm dev - [llvm-dev] [RFC] Introducing a byte type to LLVM [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Harald van Dijk via llvm-dev

2021-Jun-07 17:39 UTC

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi,

On 04/06/2021 16:24, George Mitenkov via llvm-dev wrote:>
> 2.
>
>     When copying pointers as bytes implicit pointer-to-integer casts
>     are avoided.The current setting of performing a memcpy using i8s
>     leads to miscompilations (example: bug report 37469
>     <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for
>     alias analysis.
>While trying to understand what exactly is going wrong in this bug I 
think I stumbled upon incomplete documentation. At 
https://llvm.org/docs/LangRef.html#pointeraliasing it is documented that

  * A memory access must be done through a pointer associated with the
    address range.
  * A pointer value is associated with the addresses associated with any
    value it is based on.
  * A pointer value formed by an instruction is based on another pointer
    value the instruction is getelementptr, bitcast, or inttoptr.

What is not mentioned is what a pointer value formed by a load 
instruction is based on. If a pointer value formed by a load instruction 
is not associated with any address range, then it cannot be used to load 
or store /any/ value. Clearly that is not wanted.

If memory is completely untyped, then a load of a pointer value needs to 
be equivalent to a load of an integer value followed by inttoptr. In 
that model, it is not possible to replace a load of a pointer value by a 
previously stored value, it is only possible to replace it by 
inttoptr(ptrtoint(previously stored value)). In that model, the problem 
is that LLVM does replace a load of a pointer value by a previously 
stored value, and the bug can be fixed (at a cost of reduced 
optimisations of other code) by making sure to insert ptrtoint and 
inttoptr as needed, including for any pointer members of structs etc.

Your proposal is based on an interpretation of the problem using a 
memory model where memory is not completely untyped. In your memory 
model, what address range is a pointer value that came from a load 
instruction associated with?

Cheers,
Harald van Dijk
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210607/997e13fa/attachment.html>

Nicolai Hähnle via llvm-dev

2021-Jun-08 23:01 UTC

head link

[llvm-dev] [RFC] Introducing a byte type to LLVM

Hi,

On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 04/06/2021 16:24, George Mitenkov via llvm-dev wrote:
>
>
>    1.
>
>    When copying pointers as bytes implicit pointer-to-integer casts are
>    avoided.The current setting of performing a memcpy using i8s leads to
>    miscompilations (example: bug report 37469
>    <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for
alias
>    analysis.
>
> While trying to understand what exactly is going wrong in this bug I think
> I stumbled upon incomplete documentation. At
> https://llvm.org/docs/LangRef.html#pointeraliasing it is documented that
>
>    - A memory access must be done through a pointer associated with the
>    address range.
>    - A pointer value is associated with the addresses associated with any
>    value it is based on.
>    - A pointer value formed by an instruction is based on another pointer
>    value the instruction is getelementptr, bitcast, or inttoptr.
>
> What is not mentioned is what a pointer value formed by a load instruction
> is based on. If a pointer value formed by a load instruction is not
> associated with any address range, then it cannot be used to load or store
> *any* value. Clearly that is not wanted.
>
> If memory is completely untyped, then a load of a pointer value needs to
> be equivalent to a load of an integer value followed by inttoptr. In that
> model, it is not possible to replace a load of a pointer value by a
> previously stored value, it is only possible to replace it by
> inttoptr(ptrtoint(previously stored value)). In that model, the problem is
> that LLVM does replace a load of a pointer value by a previously stored
> value, and the bug can be fixed (at a cost of reduced optimisations of
> other code) by making sure to insert ptrtoint and inttoptr as needed,
> including for any pointer members of structs etc.
>
> Your proposal is based on an interpretation of the problem using a memory
> model where memory is not completely untyped. In your memory model, what
> address range is a pointer value that came from a load instruction
> associated with?
>
I second this.

I feel like the entire discussion is extremely confused and confusing
because there is lack of clarity about what values are in fact carried by
the types i<N>, ptr (assuming untyped pointers already), and the proposed
b<N>. There are statements about "whether a b<N> contains a
pointer or an
integer" without a clear definition of what any of that really means.

By the way: the discussion of GVN states that i<N> do not carry provenance
information, but LangRef appears to contradict this, because it states (in
the Pointer Aliasing section) that "A pointer value formed by an inttoptr
is *based* on all pointer values that contribute (directly or indirectly)
to the computation of the pointer’s value." The only way to make this
statement operational is for the operand of the inttoptr to carry
provenance information; this definitely needs to be addressed by this
proposal.

My educated guess is that what the proposal really wants to say is that an
i<N> value is poison or an integer without pointer provenance (i.e., an
i<N> value is never "based on" a pointer) while a b<N>
value is poison or
an integer that may have provenance (i.e., it may be "based on" one or
more
pointers). Everything else should follow from this core definition. With
this phrasing, I can see a coherent picture emerging that can be discussed
without a lot of the confusion in this thread (though still _some_
confusion, as the topic is inherently subtle).

@George, Nuno, and Juneyoung: is that a good lens through which to see the
proposal? If no, why not? If yes, I believe there are some immediate
implications and questions, such as:

1. Forbidding arithmetic and bitwise operations in b<N> seems pointless.
Just define them as the corresponding i<N> op plus the union of provenance
of the operands. This allows consistent implementation of char/unsigned
char as b8, without having to jump back and forth between b8 and i8 all the
time.

2. What's the provenance of a b<N> `select`?

3. A b<N>-to-i<N> conversion necessarily loses all provenance
information.

4. What's the provenance of the result of an i<N>-to-b<N>
conversion? The
only possible choices are "empty" or "full" provenance,
where "full"
provenance means that the value is "based on" _all_ pointers.
Depending on
the details of allowed instruction signatures, we may want to allow both
choices. "Full" provenance gives us something that's useful as a
step
towards inttoptr. "Empty" provenance allows us to lift i<N> to
b<N> for
arithmetic operations without being overly conservative (e.g., consider the
somewhat hypothetical problem of lowering a GEP to a sequence of arithmetic
on b<N> types in IR -- we want the result to have the provenance of the
base pointer, which enters the arithmetic as a b<N> with provenance;
adding
the various i<N> offset operands should have no effect on provenance).

5. Memory is untyped but can carry provenance; i.e., it consists of b8.
Load/store of i<N> performs implicit i-to-b/b-to-i conversions,
respectively. Which kind of i-to-b conversion is used during a store? The
"empty" one or the "full" one? There are several reasonable
options for how
to address this, and a choice must be made.

6. (How) are pointer types fundamentally different from b<N> types of the
correct size? (By this I mean: is there any interesting difference in the
values that these types can carry? Ignore surface differences like the fact
that GEP traditionally goes with pointers while `add` goes with integer
types -- we could have a GEP instruction on a correctly sized b<N>)

It seems pretty clear that there is still a lot of work to be done on this
proposal before we can be sufficiently confident that it won't just add new
correctness problems elsewhere. Besides, the cost of adding a new family of
types is very high, and I am not (yet?) convinced that it pays for itself.
But with the above I can at least see better where you're coming from.

Cheers,
Nicolai

Cheers,> Harald van Dijk
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Lerne, wie die Welt wirklich ist,
aber vergiss niemals, wie sie sein sollte.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210609/27f378b4/attachment.html>

llvm dev - Jun 2021 - [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM

[llvm-dev] [RFC] Introducing a byte type to LLVM