Ralf Jung via llvm-dev
2019-Mar-27 08:09 UTC
[llvm-dev] getelementptr inbounds with offset 0
Hi Johannes,> Now that reasoning works from a conceptual standpoint only for > non-inbounds GEPs, I think. From a practical standpoint my above > description will probably make sure everything works out just fine (see > also my rephrased answer down below!). I say this because I think the > following lang-ref passage makes sure everything, not only memory > accesses, involving a non-pointer-to-object* GEP is poison: > "If the inbounds keyword is present, the result value of the > getelementptr is a poison value if the base pointer is not an in > bounds address of an allocated object" > > * I would argue every object needs to have an extend, hence cannot be > zero-sized.I would find that a rather surprising exception / special case. There's nothing wrong with objects of size 0.>> FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway had to >> make "getelementptr inbounds" on integer pointers (pointers obtained by casting >> an integer to a pointer) never yield poison directly and instead defer the >> in-bound check to the time when the actual access happens. That nicely >> accommodates all uses of getelementptr that just compute addresses without ever >> using them for a memory access (using them only, e.g. to compute offsets or >> compare pointers). But this is not how the LLVM LangRef is written, unfortunately. > > I see. Is there a quick answer to the questions why you need inbounds > GEPs in that case? Can't you just use non-inbounds GEPs if you know you > might not have a valid base ptr and "optimize" it to inbounds once that > is proven?You mean on the Rust side? We emit GEPi for field accesses and array indexing. We cannot always statically determine if this is happening for a ZST or not. At the same time, given that no memory access ever happens for a ZST, allocating a ZST (Box::new in Rust, think of it like new in C++) does not actually allocate any memory, it just returns an integer (sufficiently aligned) cast to a pointer.>>> Let's start with example2, note that I renamed the values above. >>> >>> %P2 is dangling (and we know it) after the free. %P2 is therefore >>> poison* and so is %G2. >>> >>> * or undef I'm always confused which might be bad in this conversation. >> >> Wait, I know that C has a rule that dangling pointers are "indeterminate" but >> this is the first time I hear that LLVM has it as well. Is that written down >> anywhere? Rust relies heavily in dangling pointers being well-behaved when used >> only on comparisons and casts (no accesses), so this would be a big deal. >> (Also, this rule in C is pretty much impossible to formalize and serves no >> purpose that I know of, but that is a separate discussion.) > > I am not very formal in this thread and I realize that this might be a > problem, sorry. The above quote from the lang-ref [0] is why I think > "dangling" inbounds GEPs are poison, do you concur? > > [0] https://llvm.org/docs/LangRef.html#getelementptr-instructionYou said above that even the *input* (%P2) would be poison. That is the part I am doubting. If the input is not poison (just dangling), then we come back to the original question behind example2 -- and yes, I can see a reading of the GEPi spec that makes this poison. On the other hand, this would make a dangling pointer (formerly pointing to an object) behave different than an integer pointer that never pointed to any object, which seems odd.> I agree with your intent, but: My argument here was not to say we cannot > figure X out today so all is good. What I wanted to say/should have said > is something more along the line of: > Undefined behavior in C/LLVM-IR is often (runtime) value dependent and > therefore statically not decidable. If it is not, the code must be > assumed to have defined (="the normal") behavior statically. This > should be preserved by current and future LLVM passes. Your particular > example (example1) seems to me like such a case in which the semantics > is statically not decidable and therefore I do not see any problem. > > Again, I might just be wrong about. Please don't pin it on me at the end > of the day.Sure, UB is definitely *defined* in a runtime-value dependent way. The problem here is that it is not defined in a precise way -- something where one could write an interpreter that tracks all the extra state that is needed (like poison/undef and where allocations lie) and then says precisely under which conditions we have UB and under which we do not. What I am asking here for is the exact definition of GEPi if, *at run-time*, the offset is 0, and the base pointer is (a) an integer, or (b) dangling.>> But also, your response assumes "dangling pointers are undef/posion", which is >> new to me. I'd be rather shocked if this is something LLVM actually relies on >> anywhere. > > Again, that is how I read the quoted lang-ref wording above for > inbounds GEPs. I agree with you that non-inbounds GEPs have a "normal" > value that can be used for all non-access instructions in the usual way > without producing undef/poison.I must be missing something here. You said above "%P2 is dangling (and we know it) after the free. %P2 is therefore poison" -- at this point, GEPi has not even happened yet! If GEPi does something, it will make the *output* poison (%G2), but you are saying the *input* becomes poison (%P1), and that cannot be a consequence of GEPi at all. Kind regards, Ralf> > Cheers, > Johannes > > >>>>> A side-effect based on the GEP will however __locally__ introduce an >>>>> dereferencability assumption (in my opinion at least). Let's say the >>>>> code looks like this: >>>>> >>>>> >>>>> %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the >>>>> dereferencability of ; the memory at address 4 here. br %cnd, >>>>> %BB0, %BB1 >>>>> >>>>> BB0: ; We don't know anything about the dereferencability of ; the >>>>> memory at address 4 here. load %G ; We know the memory at address 4 >>>>> is dereferenceable here. ; Though, that is due to the load and not >>>>> the inbounds. ... br %BB1 >>>>> >>>>> BB1: ; We don't know anything about the dereferencability of ; the >>>>> memory at address 4 here. >>>>> >>>>> >>>>> It is a different story if you start to use the GEP in other >>>>> operations, e.g., to alter control flow. Then the (potential) >>>>> undefined value can propagate. >>>>> >>>>> >>>>> Any thought on this? Did I at least get your problem description >>>>> right? >>>>> >>>>> Cheers, Johannes >>>>> >>>>> >>>>> >>>>> P.S. Sorry if this breaks the thread and apologies that I had to >>>>> remove Bruce from the CC. It turns out replying to an email you did >>>>> not receive is complicated and getting on the LLVM-Dev list is >>>>> nowadays as well... >>>>> >>>>> >>>>> On 02/25, Ralf Jung via llvm-dev wrote: >>>>>> Hi Bruce, >>>>>> >>>>>> On 25.02.19 13:10, Bruce Hoult wrote: >>>>>>> LLVM has no idea whether the address computed by GEP is actually >>>>>>> within a legal object. The "inbounds" keyword is just you, the >>>>>>> programmer, promising LLVM that you know it's ok and that you >>>>>>> don't care what happens if it is actually out of bounds. >>>>>>> >>>>>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds >>>>>> >>>>>> The LangRef says I get a poison value when I am violating the >>>>>> bounds. What I am asking is what exactly this means when the offset >>>>>> is 0 -- what *are* the conditions under which an offset-by-0 is >>>>>> "out of bounds" and hence yields poison? Of course LLVM cannot >>>>>> always statically determine this, but it relies on (dynamically, on >>>>>> the "LLVM abstract machine") such things not happening, and I am >>>>>> asking what exactly these dynamic conditions are. >>>>>> >>>>>> Kind regards, Ralf >>>>>> >>>>>>> >>>>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev >>>>>>> <llvm... at lists.llvm.org> wrote: >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> What exactly are the rules for `getelementptr inbounds` with >>>>>>>> offset 0? >>>>>>>> >>>>>>>> In Rust, we are relying on the fact that if we use, for example, >>>>>>>> `inttoptr` to turn `4` into a pointer, we can then do >>>>>>>> `getelementptr inbounds` with offset 0 on that without LLVM >>>>>>>> deducing that there actually is any dereferencable memory at >>>>>>>> location 4. The argument is that we can think of there being a >>>>>>>> zero-sized allocation. Is that a reasonable assumption? Can >>>>>>>> something like this be documented in the LangRef? >>>>>>>> >>>>>>>> Relatedly, how does the situation change if the pointer is not >>>>>>>> created "out of thin air" from a fixed integer, but is actually a >>>>>>>> dangling pointer obtained previously from `malloc` (or `alloca` >>>>>>>> or whatever)? Is getelementptr inbounds` with offset 0 on such a >>>>>>>> pointer a NOP, or does it result in `poison`? And if that makes >>>>>>>> a difference, how does that square with the fact that, e.g., the >>>>>>>> integer `0x4000` could well be inside such an allocation, but >>>>>>>> doing `getelementptr inbounds` with offset 0 on that would fall >>>>>>>> under the first question above? >>>>>>>> >>>>>>>> Kind regards, Ralf >>>>>>>> _______________________________________________ LLVM Developers >>>>>>>> mailing list llvm... at lists.llvm.org >>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>>> _______________________________________________ LLVM Developers >>>>>> mailing list llvm... at lists.llvm.org >>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>>>> >>> >
Doerfert, Johannes via llvm-dev
2019-Mar-27 16:06 UTC
[llvm-dev] getelementptr inbounds with offset 0
On 03/27, Ralf Jung wrote:> > Now that reasoning works from a conceptual standpoint only for > > non-inbounds GEPs, I think. From a practical standpoint my above > > description will probably make sure everything works out just fine (see > > also my rephrased answer down below!). I say this because I think the > > following lang-ref passage makes sure everything, not only memory > > accesses, involving a non-pointer-to-object* GEP is poison: > > "If the inbounds keyword is present, the result value of the > > getelementptr is a poison value if the base pointer is not an in > > bounds address of an allocated object" > > > > * I would argue every object needs to have an extend, hence cannot be > > zero-sized. > > I would find that a rather surprising exception / special case. There's nothing > wrong with objects of size 0.I guess you're right. It will not change my argumentation above though, if it is known that is has extend 0 it falls under the known extend special case.> >> FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway had to > >> make "getelementptr inbounds" on integer pointers (pointers obtained by casting > >> an integer to a pointer) never yield poison directly and instead defer the > >> in-bound check to the time when the actual access happens. That nicely > >> accommodates all uses of getelementptr that just compute addresses without ever > >> using them for a memory access (using them only, e.g. to compute offsets or > >> compare pointers). But this is not how the LLVM LangRef is written, unfortunately. > > > > I see. Is there a quick answer to the questions why you need inbounds > > GEPs in that case? Can't you just use non-inbounds GEPs if you know you > > might not have a valid base ptr and "optimize" it to inbounds once that > > is proven? > > You mean on the Rust side? We emit GEPi for field accesses and array indexing. > We cannot always statically determine if this is happening for a ZST or not. > At the same time, given that no memory access ever happens for a ZST, allocating > a ZST (Box::new in Rust, think of it like new in C++) does not actually allocate > any memory, it just returns an integer (sufficiently aligned) cast to a pointer.OK, but why not emit non-inbonuds GEPs instead? They do not come with the problems you have now, or maybe I misunderstand.> >>> Let's start with example2, note that I renamed the values above. > >>> > >>> %P2 is dangling (and we know it) after the free. %P2 is therefore > >>> poison* and so is %G2. > >>> > >>> * or undef I'm always confused which might be bad in this conversation. > >> > >> Wait, I know that C has a rule that dangling pointers are "indeterminate" but > >> this is the first time I hear that LLVM has it as well. Is that written down > >> anywhere? Rust relies heavily in dangling pointers being well-behaved when used > >> only on comparisons and casts (no accesses), so this would be a big deal. > >> (Also, this rule in C is pretty much impossible to formalize and serves no > >> purpose that I know of, but that is a separate discussion.) > > > > I am not very formal in this thread and I realize that this might be a > > problem, sorry. The above quote from the lang-ref [0] is why I think > > "dangling" inbounds GEPs are poison, do you concur? > > > > [0] https://llvm.org/docs/LangRef.html#getelementptr-instruction > > You said above that even the *input* (%P2) would be poison. That is the part I > am doubting.You are right again I guess. %P2, the input, is probably not poison after the free but just a dangling pointer.> If the input is not poison (just dangling), then we come back to the original > question behind example2 -- and yes, I can see a reading of the GEPi spec that > makes this poison. On the other hand, this would make a dangling pointer > (formerly pointing to an object) behave different than an integer pointer that > never pointed to any object, which seems odd.I do not know how to read this but one way to make sense of it is to assume the GEPi on an object is fine but its value becomes poison in the moment you deallocate the object. That means, %G2 in the example is non-poison until the free, then it is. That would at least be what I assume to be the semantic here.> > I agree with your intent, but: My argument here was not to say we cannot > > figure X out today so all is good. What I wanted to say/should have said > > is something more along the line of: > > Undefined behavior in C/LLVM-IR is often (runtime) value dependent and > > therefore statically not decidable. If it is not, the code must be > > assumed to have defined (="the normal") behavior statically. This > > should be preserved by current and future LLVM passes. Your particular > > example (example1) seems to me like such a case in which the semantics > > is statically not decidable and therefore I do not see any problem. > > > > Again, I might just be wrong about. Please don't pin it on me at the end > > of the day. > > Sure, UB is definitely *defined* in a runtime-value dependent way. The problem > here is that it is not defined in a precise way -- something where one could > write an interpreter that tracks all the extra state that is needed (like > poison/undef and where allocations lie) and then says precisely under which > conditions we have UB and under which we do not. > What I am asking here for is the exact definition of GEPi if, *at run-time*, the > offset is 0, and the base pointer is (a) an integer, or (b) dangling.That last part is given by the lang-ref (imo): "If the inbounds keyword is present, the result value of the getelementptr is a poison value if the base pointer is not an in bounds address of an allocated object" I read this as: If you have a GEPi, you get poison if the base pointer is not an allocated object. That is a dangling pointer (b) causes the GEPi to be poison and a pointer from integer (a) may, if the address denoted by the integer is not inside, or one past, an allocated object. Now any offset except 0 will add more possible ways to generate a poison value.> >> But also, your response assumes "dangling pointers are undef/posion", which is > >> new to me. I'd be rather shocked if this is something LLVM actually relies on > >> anywhere. > > > > Again, that is how I read the quoted lang-ref wording above for > > inbounds GEPs. I agree with you that non-inbounds GEPs have a "normal" > > value that can be used for all non-access instructions in the usual way > > without producing undef/poison. > > I must be missing something here. You said above "%P2 is dangling (and we know > it) after the free. %P2 is therefore poison" -- at this point, GEPi has not even > happened yet! If GEPi does something, it will make the *output* poison (%G2), > but you are saying the *input* becomes poison (%P1), and that cannot be a > consequence of GEPi at all.True and corrected above. Sorry for the confusion.> > Cheers, > > Johannes > > > > > >>>>> A side-effect based on the GEP will however __locally__ introduce an > >>>>> dereferencability assumption (in my opinion at least). Let's say the > >>>>> code looks like this: > >>>>> > >>>>> > >>>>> %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about the > >>>>> dereferencability of ; the memory at address 4 here. br %cnd, > >>>>> %BB0, %BB1 > >>>>> > >>>>> BB0: ; We don't know anything about the dereferencability of ; the > >>>>> memory at address 4 here. load %G ; We know the memory at address 4 > >>>>> is dereferenceable here. ; Though, that is due to the load and not > >>>>> the inbounds. ... br %BB1 > >>>>> > >>>>> BB1: ; We don't know anything about the dereferencability of ; the > >>>>> memory at address 4 here. > >>>>> > >>>>> > >>>>> It is a different story if you start to use the GEP in other > >>>>> operations, e.g., to alter control flow. Then the (potential) > >>>>> undefined value can propagate. > >>>>> > >>>>> > >>>>> Any thought on this? Did I at least get your problem description > >>>>> right? > >>>>> > >>>>> Cheers, Johannes > >>>>> > >>>>> > >>>>> > >>>>> P.S. Sorry if this breaks the thread and apologies that I had to > >>>>> remove Bruce from the CC. It turns out replying to an email you did > >>>>> not receive is complicated and getting on the LLVM-Dev list is > >>>>> nowadays as well... > >>>>> > >>>>> > >>>>> On 02/25, Ralf Jung via llvm-dev wrote: > >>>>>> Hi Bruce, > >>>>>> > >>>>>> On 25.02.19 13:10, Bruce Hoult wrote: > >>>>>>> LLVM has no idea whether the address computed by GEP is actually > >>>>>>> within a legal object. The "inbounds" keyword is just you, the > >>>>>>> programmer, promising LLVM that you know it's ok and that you > >>>>>>> don't care what happens if it is actually out of bounds. > >>>>>>> > >>>>>>> https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds > >>>>>> > >>>>>> The LangRef says I get a poison value when I am violating the > >>>>>> bounds. What I am asking is what exactly this means when the offset > >>>>>> is 0 -- what *are* the conditions under which an offset-by-0 is > >>>>>> "out of bounds" and hence yields poison? Of course LLVM cannot > >>>>>> always statically determine this, but it relies on (dynamically, on > >>>>>> the "LLVM abstract machine") such things not happening, and I am > >>>>>> asking what exactly these dynamic conditions are. > >>>>>> > >>>>>> Kind regards, Ralf > >>>>>> > >>>>>>> > >>>>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev > >>>>>>> <llvm... at lists.llvm.org> wrote: > >>>>>>>> > >>>>>>>> Hi all, > >>>>>>>> > >>>>>>>> What exactly are the rules for `getelementptr inbounds` with > >>>>>>>> offset 0? > >>>>>>>> > >>>>>>>> In Rust, we are relying on the fact that if we use, for example, > >>>>>>>> `inttoptr` to turn `4` into a pointer, we can then do > >>>>>>>> `getelementptr inbounds` with offset 0 on that without LLVM > >>>>>>>> deducing that there actually is any dereferencable memory at > >>>>>>>> location 4. The argument is that we can think of there being a > >>>>>>>> zero-sized allocation. Is that a reasonable assumption? Can > >>>>>>>> something like this be documented in the LangRef? > >>>>>>>> > >>>>>>>> Relatedly, how does the situation change if the pointer is not > >>>>>>>> created "out of thin air" from a fixed integer, but is actually a > >>>>>>>> dangling pointer obtained previously from `malloc` (or `alloca` > >>>>>>>> or whatever)? Is getelementptr inbounds` with offset 0 on such a > >>>>>>>> pointer a NOP, or does it result in `poison`? And if that makes > >>>>>>>> a difference, how does that square with the fact that, e.g., the > >>>>>>>> integer `0x4000` could well be inside such an allocation, but > >>>>>>>> doing `getelementptr inbounds` with offset 0 on that would fall > >>>>>>>> under the first question above? > >>>>>>>> > >>>>>>>> Kind regards, Ralf > >>>>>>>> _______________________________________________ LLVM Developers > >>>>>>>> mailing list llvm... at lists.llvm.org > >>>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >>>>>> _______________________________________________ LLVM Developers > >>>>>> mailing list llvm... at lists.llvm.org > >>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >>>>> > >>> > >-- Johannes Doerfert Researcher Argonne National Laboratory Lemont, IL 60439, USA jdoerfert at anl.gov -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190327/67e5ccba/attachment.sig>
Ralf Jung via llvm-dev
2019-Apr-10 15:14 UTC
[llvm-dev] getelementptr inbounds with offset 0
Hi,>>> I see. Is there a quick answer to the questions why you need inbounds >>> GEPs in that case? Can't you just use non-inbounds GEPs if you know you >>> might not have a valid base ptr and "optimize" it to inbounds once that >>> is proven? >> >> You mean on the Rust side? We emit GEPi for field accesses and array indexing. >> We cannot always statically determine if this is happening for a ZST or not. >> At the same time, given that no memory access ever happens for a ZST, allocating >> a ZST (Box::new in Rust, think of it like new in C++) does not actually allocate >> any memory, it just returns an integer (sufficiently aligned) cast to a pointer. > > OK, but why not emit non-inbonuds GEPs instead? They do not come with > the problems you have now, or maybe I misunderstand.The problem is statically figuring out whether it should be inbounds or non-inbounds. When we have code like `&x[n]`, this might be an offset-by-0 in an empty slice and hence fall into the scope of my question, or it might be a "normal" array access where we definitely want inbounds.>> Sure, UB is definitely *defined* in a runtime-value dependent way. The problem >> here is that it is not defined in a precise way -- something where one could >> write an interpreter that tracks all the extra state that is needed (like >> poison/undef and where allocations lie) and then says precisely under which >> conditions we have UB and under which we do not. >> What I am asking here for is the exact definition of GEPi if, *at run-time*, the >> offset is 0, and the base pointer is (a) an integer, or (b) dangling. > > That last part is given by the lang-ref (imo): > "If the inbounds keyword is present, the result value of the > getelementptr is a poison value if the base pointer is not an in > bounds address of an allocated object" > > I read this as: If you have a GEPi, you get poison if the base pointer > is not an allocated object. That is a dangling pointer (b) causes the > GEPi to be poison and a pointer from integer (a) may, if the address > denoted by the integer is not inside, or one past, an allocated object. > Now any offset except 0 will add more possible ways to generate a poison > value.Thanks. That makes sense from reading the docs (though I am not convinced that it actually helps with optimizations to be this strict here). For the (a) case, the question about "0-sized objects" remains, but it doesn't seem like the answer could affect what LLVM does. It would be really nice to have a reference interpreter for LLVM IR that can explicitly check for all the UB. Maybe, one day... ;) Kind regards, Ralf