Nicolai Hähnle via llvm-dev
2021-Jun-08 23:01 UTC
[llvm-dev] [RFC] Introducing a byte type to LLVM
Hi, On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On 04/06/2021 16:24, George Mitenkov via llvm-dev wrote: > > > 1. > > When copying pointers as bytes implicit pointer-to-integer casts are > avoided.The current setting of performing a memcpy using i8s leads to > miscompilations (example: bug report 37469 > <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for alias > analysis. > > While trying to understand what exactly is going wrong in this bug I think > I stumbled upon incomplete documentation. At > https://llvm.org/docs/LangRef.html#pointeraliasing it is documented that > > - A memory access must be done through a pointer associated with the > address range. > - A pointer value is associated with the addresses associated with any > value it is based on. > - A pointer value formed by an instruction is based on another pointer > value the instruction is getelementptr, bitcast, or inttoptr. > > What is not mentioned is what a pointer value formed by a load instruction > is based on. If a pointer value formed by a load instruction is not > associated with any address range, then it cannot be used to load or store > *any* value. Clearly that is not wanted. > > If memory is completely untyped, then a load of a pointer value needs to > be equivalent to a load of an integer value followed by inttoptr. In that > model, it is not possible to replace a load of a pointer value by a > previously stored value, it is only possible to replace it by > inttoptr(ptrtoint(previously stored value)). In that model, the problem is > that LLVM does replace a load of a pointer value by a previously stored > value, and the bug can be fixed (at a cost of reduced optimisations of > other code) by making sure to insert ptrtoint and inttoptr as needed, > including for any pointer members of structs etc. > > Your proposal is based on an interpretation of the problem using a memory > model where memory is not completely untyped. In your memory model, what > address range is a pointer value that came from a load instruction > associated with? >I second this. I feel like the entire discussion is extremely confused and confusing because there is lack of clarity about what values are in fact carried by the types i<N>, ptr (assuming untyped pointers already), and the proposed b<N>. There are statements about "whether a b<N> contains a pointer or an integer" without a clear definition of what any of that really means. By the way: the discussion of GVN states that i<N> do not carry provenance information, but LangRef appears to contradict this, because it states (in the Pointer Aliasing section) that "A pointer value formed by an inttoptr is *based* on all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value." The only way to make this statement operational is for the operand of the inttoptr to carry provenance information; this definitely needs to be addressed by this proposal. My educated guess is that what the proposal really wants to say is that an i<N> value is poison or an integer without pointer provenance (i.e., an i<N> value is never "based on" a pointer) while a b<N> value is poison or an integer that may have provenance (i.e., it may be "based on" one or more pointers). Everything else should follow from this core definition. With this phrasing, I can see a coherent picture emerging that can be discussed without a lot of the confusion in this thread (though still _some_ confusion, as the topic is inherently subtle). @George, Nuno, and Juneyoung: is that a good lens through which to see the proposal? If no, why not? If yes, I believe there are some immediate implications and questions, such as: 1. Forbidding arithmetic and bitwise operations in b<N> seems pointless. Just define them as the corresponding i<N> op plus the union of provenance of the operands. This allows consistent implementation of char/unsigned char as b8, without having to jump back and forth between b8 and i8 all the time. 2. What's the provenance of a b<N> `select`? 3. A b<N>-to-i<N> conversion necessarily loses all provenance information. 4. What's the provenance of the result of an i<N>-to-b<N> conversion? The only possible choices are "empty" or "full" provenance, where "full" provenance means that the value is "based on" _all_ pointers. Depending on the details of allowed instruction signatures, we may want to allow both choices. "Full" provenance gives us something that's useful as a step towards inttoptr. "Empty" provenance allows us to lift i<N> to b<N> for arithmetic operations without being overly conservative (e.g., consider the somewhat hypothetical problem of lowering a GEP to a sequence of arithmetic on b<N> types in IR -- we want the result to have the provenance of the base pointer, which enters the arithmetic as a b<N> with provenance; adding the various i<N> offset operands should have no effect on provenance). 5. Memory is untyped but can carry provenance; i.e., it consists of b8. Load/store of i<N> performs implicit i-to-b/b-to-i conversions, respectively. Which kind of i-to-b conversion is used during a store? The "empty" one or the "full" one? There are several reasonable options for how to address this, and a choice must be made. 6. (How) are pointer types fundamentally different from b<N> types of the correct size? (By this I mean: is there any interesting difference in the values that these types can carry? Ignore surface differences like the fact that GEP traditionally goes with pointers while `add` goes with integer types -- we could have a GEP instruction on a correctly sized b<N>) It seems pretty clear that there is still a lot of work to be done on this proposal before we can be sufficiently confident that it won't just add new correctness problems elsewhere. Besides, the cost of adding a new family of types is very high, and I am not (yet?) convinced that it pays for itself. But with the above I can at least see better where you're coming from. Cheers, Nicolai Cheers,> Harald van Dijk > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Lerne, wie die Welt wirklich ist, aber vergiss niemals, wie sie sein sollte. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210609/27f378b4/attachment.html>
James Courtier-Dutton via llvm-dev
2021-Jun-08 23:36 UTC
[llvm-dev] [cfe-dev] [RFC] Introducing a byte type to LLVM
On Wed, 9 Jun 2021 at 00:02, Nicolai Hähnle via cfe-dev <cfe-dev at lists.llvm.org> wrote:> Hi, > ... > On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev <llvm-dev at lists.llvm.org> wrote: > > I second this. > > I feel like the entire discussion is extremely confused and confusing because there is lack of clarity about what values are in fact carried by the types i<N>, ptr (assuming untyped pointers already), and the proposed b<N>. There are statements about "whether a b<N> contains a pointer or an integer" without a clear definition of what any of that really means. >+1 on the confused and confusing point. :-)
Juneyoung Lee via llvm-dev
2021-Jun-09 03:28 UTC
[llvm-dev] [RFC] Introducing a byte type to LLVM
Hi Nicolai,> I feel like the entire discussion is extremely confused and confusing > because there is lack of clarity about what values are in fact carried by > the types i<N>, ptr (assuming untyped pointers already), and the proposed > b<N>. There are statements about "whether a b<N> contains a pointer or an > integer" without a clear definition of what any of that really means.> By the way: the discussion of GVN states that i<N> do not carry provenance > information, but LangRef appears to contradict this, because it states (in > the Pointer Aliasing section) that "A pointer value formed by an inttoptr > is *based* on all pointer values that contribute (directly or indirectly) > to the computation of the pointer’s value." The only way to make this > statement operational is for the operand of the inttoptr to carry > provenance information; this definitely needs to be addressed by this > proposal.> My educated guess is that what the proposal really wants to say is that an > i<N> value is poison or an integer without pointer provenance (i.e., an > i<N> value is never "based on" a pointer) while a b<N> value is poison or > an integer that may have provenance (i.e., it may be "based on" one or more > pointers). Everything else should follow from this core definition. With > this phrasing, I can see a coherent picture emerging that can be discussed > without a lot of the confusion in this thread (though still _some_ > confusion, as the topic is inherently subtle).> @George, Nuno, and Juneyoung: is that a good lens through which to see the > proposal?Yes, our understanding is that only values of pointer types carry provenance whereas non-pointer types (i<N>, float, ...) don't carry provenance. We believe that when people were writing optimizations on non-pointer types (instcombine, reassociate, gvn, etc) they did not assume the existence of provenance in such types. As mentioned, it contradicts with LangRef's wording about inttoptr though. I agree that clarifying these is a good first step to avoid confusion. Specifically, via pinning down what people agree on in LangRef. I think it is good to start with removing the inttoptr case from LangRef's based-on part and adding relevant texts. IMO, existing alias analysis and optimizations were already avoiding uses of inttoptr's based-on information (experts' comments needed to double check). Also, to provide alternatives for ptrtoint/inttoptr round trip with better analysis, llvm.ptrmask and clang-tidy's performance-no-int-to-ptr were added as well. If people agree, I'll write a draft for this and share it via phabricator. Thanks, Juneyoung On Wed, Jun 9, 2021 at 8:02 AM Nicolai Hähnle via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > > On Mon, Jun 7, 2021 at 7:40 PM Harald van Dijk via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> On 04/06/2021 16:24, George Mitenkov via llvm-dev wrote: >> >> >> 1. >> >> When copying pointers as bytes implicit pointer-to-integer casts are >> avoided.The current setting of performing a memcpy using i8s leads to >> miscompilations (example: bug report 37469 >> <https://bugs.llvm.org/show_bug.cgi?id=37469>) and is bad for alias >> analysis. >> >> While trying to understand what exactly is going wrong in this bug I >> think I stumbled upon incomplete documentation. At >> https://llvm.org/docs/LangRef.html#pointeraliasing it is documented that >> >> - A memory access must be done through a pointer associated with the >> address range. >> - A pointer value is associated with the addresses associated with >> any value it is based on. >> - A pointer value formed by an instruction is based on another >> pointer value the instruction is getelementptr, bitcast, or inttoptr. >> >> What is not mentioned is what a pointer value formed by a load >> instruction is based on. If a pointer value formed by a load instruction is >> not associated with any address range, then it cannot be used to load or >> store *any* value. Clearly that is not wanted. >> >> If memory is completely untyped, then a load of a pointer value needs to >> be equivalent to a load of an integer value followed by inttoptr. In that >> model, it is not possible to replace a load of a pointer value by a >> previously stored value, it is only possible to replace it by >> inttoptr(ptrtoint(previously stored value)). In that model, the problem is >> that LLVM does replace a load of a pointer value by a previously stored >> value, and the bug can be fixed (at a cost of reduced optimisations of >> other code) by making sure to insert ptrtoint and inttoptr as needed, >> including for any pointer members of structs etc. >> >> Your proposal is based on an interpretation of the problem using a memory >> model where memory is not completely untyped. In your memory model, what >> address range is a pointer value that came from a load instruction >> associated with? >> > > I second this. > > I feel like the entire discussion is extremely confused and confusing > because there is lack of clarity about what values are in fact carried by > the types i<N>, ptr (assuming untyped pointers already), and the proposed > b<N>. There are statements about "whether a b<N> contains a pointer or an > integer" without a clear definition of what any of that really means. > > By the way: the discussion of GVN states that i<N> do not carry provenance > information, but LangRef appears to contradict this, because it states (in > the Pointer Aliasing section) that "A pointer value formed by an inttoptr > is *based* on all pointer values that contribute (directly or indirectly) > to the computation of the pointer’s value." The only way to make this > statement operational is for the operand of the inttoptr to carry > provenance information; this definitely needs to be addressed by this > proposal. > > My educated guess is that what the proposal really wants to say is that an > i<N> value is poison or an integer without pointer provenance (i.e., an > i<N> value is never "based on" a pointer) while a b<N> value is poison or > an integer that may have provenance (i.e., it may be "based on" one or more > pointers). Everything else should follow from this core definition. With > this phrasing, I can see a coherent picture emerging that can be discussed > without a lot of the confusion in this thread (though still _some_ > confusion, as the topic is inherently subtle). > > @George, Nuno, and Juneyoung: is that a good lens through which to see the > proposal? If no, why not? If yes, I believe there are some immediate > implications and questions, such as: > > 1. Forbidding arithmetic and bitwise operations in b<N> seems pointless. > Just define them as the corresponding i<N> op plus the union of provenance > of the operands. This allows consistent implementation of char/unsigned > char as b8, without having to jump back and forth between b8 and i8 all the > time. > > 2. What's the provenance of a b<N> `select`? > > 3. A b<N>-to-i<N> conversion necessarily loses all provenance information. > > 4. What's the provenance of the result of an i<N>-to-b<N> conversion? The > only possible choices are "empty" or "full" provenance, where "full" > provenance means that the value is "based on" _all_ pointers. Depending on > the details of allowed instruction signatures, we may want to allow both > choices. "Full" provenance gives us something that's useful as a step > towards inttoptr. "Empty" provenance allows us to lift i<N> to b<N> for > arithmetic operations without being overly conservative (e.g., consider the > somewhat hypothetical problem of lowering a GEP to a sequence of arithmetic > on b<N> types in IR -- we want the result to have the provenance of the > base pointer, which enters the arithmetic as a b<N> with provenance; adding > the various i<N> offset operands should have no effect on provenance). > > 5. Memory is untyped but can carry provenance; i.e., it consists of b8. > Load/store of i<N> performs implicit i-to-b/b-to-i conversions, > respectively. Which kind of i-to-b conversion is used during a store? The > "empty" one or the "full" one? There are several reasonable options for how > to address this, and a choice must be made. > > 6. (How) are pointer types fundamentally different from b<N> types of the > correct size? (By this I mean: is there any interesting difference in the > values that these types can carry? Ignore surface differences like the fact > that GEP traditionally goes with pointers while `add` goes with integer > types -- we could have a GEP instruction on a correctly sized b<N>) > > It seems pretty clear that there is still a lot of work to be done on this > proposal before we can be sufficiently confident that it won't just add new > correctness problems elsewhere. Besides, the cost of adding a new family of > types is very high, and I am not (yet?) convinced that it pays for itself. > But with the above I can at least see better where you're coming from. > > Cheers, > Nicolai > > > Cheers, >> Harald van Dijk >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > > > -- > Lerne, wie die Welt wirklich ist, > aber vergiss niemals, wie sie sein sollte. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- Juneyoung Lee Software Foundation Lab, Seoul National University -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210609/c5382219/attachment.html>
Ralf Jung via llvm-dev
2021-Jun-13 15:22 UTC
[llvm-dev] [RFC] Introducing a byte type to LLVM
Hi,> 1. Forbidding arithmetic and bitwise operations in b<N> seems pointless. Just > define them as the corresponding i<N> op plus the union of provenance of the > operands. This allows consistent implementation of char/unsigned char as b8, > without having to jump back and forth between b8 and i8 all the time.FWIW, "char" addition happens at "int" type due to integer promotion. So there is no problem with back and forth here. "Union" of provenance is currently not an operation that is required to model LLVM IR, so your proposal would necessitate adding such a concept. It'll be interesting to figure out how "getelementptr inbounds" behaves on multi-provenance pointers...> 6. (How) are pointer types fundamentally different from b<N> types of the > correct size? (By this I mean: is there any interesting difference in the values > that these types can carry? Ignore surface differences like the fact that GEP > traditionally goes with pointers while `add` goes with integer types -- we could > have a GEP instruction on a correctly sized b<N>)I'm not saying I have the answer here, but one possible difference might arise with "mixing bytes from different pointers". Say we are storing pointer "ptr1" directly followed by "ptr2" on a 64bit machine, and now we are doing an (unalinged) 8-byte load covering the last 4 bytes of ptr1 and the first 4 bytes of ptr2. This is certainly a valid value for b64. Is it also a valid value at pointer type, and if yes, which provenance does it have? Kind regards, Ralf