thr3ads.net - llvm dev - [llvm-dev] getelementptr inbounds with offset 0 [Mar 2019]

If this information is useful, please help other people find it:
Share via:

Ralf Jung via llvm-dev

2019-Mar-15 16:09 UTC

[llvm-dev] getelementptr inbounds with offset 0

Hi Johannes,
> From the Lang-Ref statement
> 
>   "With the inbounds keyword, the result value of the GEP is undefined
>   if the address is outside the actual underlying allocated object and
>   not the address one-past-the-end."
> 
> I'd argue that the actual offset value (here 0) is irrelevant. The GEP
> value is undefined if inbounds is present and the resulting pointer does
> not point into, or one-past-the-end, of an allocated object. This
> object, in my understanding, has to be the same one the base pointer of
> the GEP points into, or one-past-the-end, or you get again an undefined
> result.
Yes, I agree with that reading.

However, the notion of "allocated object" here is not entirely clear. 
LLVM has
to operate under the assumption that there are allocations and allocators it doe
snot know anything about.  Just imagine some embedded project writing to
well-known address 0xDeadCafe because there is a hardware register there.

So, the thinking here is: LLVM cannot exclude the possibility of an object of
size 0 existing at any given address.  The pointer returned by "GEPi p
0" then
would be one-past-the-end of such a 0-sized object.  Thus, "GEPi p 0"
is the
identitiy function for any p, it will not return poison.
> Now if that might cause any problems, e.g., if LLVM is able to act on
> this fact, depends on various factors including what you do with the
> GEP. Your initial problem seemed to be that LLVM "might be able to
> deduce dereferencable memory at location 4" but that should never be
the
> case if you only form the aforementioned GEP, with or without the
> inbounds actually. Forming a pointer that has a undefined value is just
> that, a pointer with an undefined value.
Ah, good point.  First of all I was indeed unclear; the case I am worried about
here is GEPi returning poison.  (These values might be used in further
computations and eventually surface as UB.)
But also, clearly a "GEPi 0" alone cannot introduce any
dereferencability
assumption because of the "one-past-the-end" case. That point is
inbounds but
cannot be dereferenced.

So, for the sake of a more concrete example (and please excuse me butchering
LLVM syntax, I usually deal with this in terms of C or Rust syntax): Can %G in
the following programs be poison?  If yes, what is the analysis that would be
weakened or the optimization that could no longer happen if "GEPi %P
0" was
instead defined to always return %P?

# example1

%P = int2ptr 4
%G = gep inbounds %P 0

# example2

%P = call noalias i8* @malloc(i64 12)
call void @free(i8* %P)
%G = gep inbounds %P 0

The first happens in Rust all the time, and we rely on not getting poison.  The
second doesn't occur in Rust (to my knowledge), but it seems somewhat
inconsistent to return poison in one case and not the other.

Kind regards,
Ralf
> A side-effect based on the GEP
> will however __locally__ introduce an dereferencability assumption (in
> my opinion at least). Let's say the code looks like this:
> 
> 
>   %G = gep inbounds (int2ptr 4) 0
>   ; We don't know anything about the dereferencability of
>   ; the memory at address 4 here.
>   br %cnd, %BB0, %BB1
> 
> BB0:
>   ; We don't know anything about the dereferencability of
>   ; the memory at address 4 here.
>   load %G
>   ; We know the memory at address 4 is dereferenceable here.
>   ; Though, that is due to the load and not the inbounds.
>   ...
>   br %BB1
> 
> BB1:
>   ; We don't know anything about the dereferencability of
>   ; the memory at address 4 here.
> 
> 
> It is a different story if you start to use the GEP in other operations,
> e.g., to alter control flow. Then the (potential) undefined value can
> propagate.
> 
> 
> Any thought on this? Did I at least get your problem description right?
> 
> Cheers,
>   Johannes
> 
> 
> 
> P.S. Sorry if this breaks the thread and apologies that I had to remove
>      Bruce from the CC. It turns out replying to an email you did not
>      receive is complicated and getting on the LLVM-Dev list is nowadays
>      as well...
> 
> 
> On 02/25, Ralf Jung via llvm-dev wrote:
>> Hi Bruce,
>>
>> On 25.02.19 13:10, Bruce Hoult wrote:
>>> LLVM has no idea whether the address computed by GEP is actually
>>> within a legal object. The "inbounds" keyword is just
you, the
>>> programmer, promising LLVM that you know it's ok and that you
don't
>>> care what happens if it is actually out of bounds.
>>>
>>>
https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
>>
>> The LangRef says I get a poison value when I am violating the bounds.
What I am
>> asking is what exactly this means when the offset is 0 -- what *are*
the
>> conditions under which an offset-by-0 is "out of bounds" and
hence yields poison?
>> Of course LLVM cannot always statically determine this, but it relies
on
>> (dynamically, on the "LLVM abstract machine") such things not
happening, and I
>> am asking what exactly these dynamic conditions are.
>>
>> Kind regards,
>> Ralf
>>
>>>
>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
>>> <llvm... at lists.llvm.org> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> What exactly are the rules for `getelementptr inbounds` with
offset 0?
>>>>
>>>> In Rust, we are relying on the fact that if we use, for
example, `inttoptr` to
>>>> turn `4` into a pointer, we can then do `getelementptr
inbounds` with offset 0
>>>> on that without LLVM deducing that there actually is any
dereferencable memory
>>>> at location 4.  The argument is that we can think of there
being a zero-sized
>>>> allocation. Is that a reasonable assumption?  Can something
like this be
>>>> documented in the LangRef?
>>>>
>>>> Relatedly, how does the situation change if the pointer is not
created "out of
>>>> thin air" from a fixed integer, but is actually a dangling
pointer obtained
>>>> previously from `malloc` (or `alloca` or whatever)?  Is
getelementptr inbounds`
>>>> with offset 0 on such a pointer a NOP, or does it result in
`poison`?  And if
>>>> that makes a difference, how does that square with the fact
that, e.g., the
>>>> integer `0x4000` could well be inside such an allocation, but
doing
>>>> `getelementptr inbounds` with offset 0 on that would fall under
the first
>>>> question above?
>>>>
>>>> Kind regards,
>>>> Ralf
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm... at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm... at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Doerfert, Johannes via llvm-dev

2019-Mar-15 16:57 UTC

head link

[llvm-dev] getelementptr inbounds with offset 0

Hi Ralf,

On 03/15, Ralf Jung wrote:> > From the Lang-Ref statement
> > 
> >   "With the inbounds keyword, the result value of the GEP is
undefined
> >   if the address is outside the actual underlying allocated object and
> >   not the address one-past-the-end."
> > 
> > I'd argue that the actual offset value (here 0) is irrelevant. The
GEP
> > value is undefined if inbounds is present and the resulting pointer
does
> > not point into, or one-past-the-end, of an allocated object. This
> > object, in my understanding, has to be the same one the base pointer
of
> > the GEP points into, or one-past-the-end, or you get again an
undefined
> > result.
> 
> Yes, I agree with that reading.
That's reassuring for me ;)

> However, the notion of "allocated object" here is not entirely
clear.
True.

> LLVM has to operate under the assumption that there are allocations
> and allocators it doe snot know anything about.  Just imagine some
> embedded project writing to well-known address 0xDeadCafe because
> there is a hardware register there.
True.

> So, the thinking here is: LLVM cannot exclude the possibility of an
> object of size 0 existing at any given address.  The pointer returned
> by "GEPi p 0" then would be one-past-the-end of such a 0-sized
object.
> Thus, "GEPi p 0" is the identitiy function for any p, it will not
> return poison.
I don't see the problem. The behavior I hope we want and implement is:

Either LLVM knows that %p points to an invalid address (=non-object) or
it doesn't. If it does, %p and all GEPs on it yield poison. If it
doesn't, it has to assume %p points to a valid address and offset 0, 1,
2, ... might all yield valid pointers. The special case is when we know
%p is valid and has extend of (at most) S, then all offsets <= S,
including 0, are potentially valid (negative extends are similar).

> > Now if that might cause any problems, e.g., if LLVM is able to act
> > on this fact, depends on various factors including what you do with
> > the GEP. Your initial problem seemed to be that LLVM "might be
able
> > to deduce dereferencable memory at location 4" but that should
never
> > be the case if you only form the aforementioned GEP, with or without
> > the inbounds actually. Forming a pointer that has a undefined value
> > is just that, a pointer with an undefined value.
> 
> Ah, good point.  First of all I was indeed unclear; the case I am
> worried about here is GEPi returning poison.  (These values might be
> used in further computations and eventually surface as UB.) But also,
> clearly a "GEPi 0" alone cannot introduce any dereferencability
> assumption because of the "one-past-the-end" case. That point is
> inbounds but cannot be dereferenced.
> 
> So, for the sake of a more concrete example (and please excuse me
> butchering LLVM syntax, I usually deal with this in terms of C or Rust
> syntax): Can %G in the following programs be poison?  If yes, what is
> the analysis that would be weakened or the optimization that could no
> longer happen if "GEPi %P 0" was instead defined to always return
%P?
> 
> # example1
> 
> %P1 = int2ptr 4
> %G1 = gep inbounds %P1 0
> 
> # example2
> 
> %P2 = call noalias i8* @malloc(i64 12)
> call void @free(i8* %P2)
> %G2 = gep inbounds %P2 0
> 
> The first happens in Rust all the time, and we rely on not getting
> poison.  The second doesn't occur in Rust (to my knowledge), but it
> seems somewhat inconsistent to return poison in one case and not the
> other.
Let's start with example2, note that I renamed the values above.

%P2 is dangling (and we know it) after the free. %P2 is therefore
poison* and so is %G2.

* or undef I'm always confused which might be bad in this conversation.



In example1, without further information, I'd say that there is no
poison (statically). Address 4 could be an allocated object until proven
otherwise.


I am still a little confused about the problem you see. If what I wrote
about the implemented behavior holds true (which I am not totally sure
of), you should not have a problem with poison even if you would
sprinkle GEP (inbounds) %p 0 all over the place. Either %p was known to
be invalid and so is the GEP, or %p was not known to be invalid and
neither is the GEP. Am I missing something here?

Cheers,
  Johannes
> > A side-effect based on the GEP will however __locally__ introduce an
> > dereferencability assumption (in my opinion at least). Let's say
the
> > code looks like this:
> > 
> > 
> >   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything about
the
> >   dereferencability of ; the memory at address 4 here.  br %cnd,
> >   %BB0, %BB1
> > 
> > BB0: ; We don't know anything about the dereferencability of ; the
> > memory at address 4 here.  load %G ; We know the memory at address 4
> > is dereferenceable here.  ; Though, that is due to the load and not
> > the inbounds.  ...  br %BB1
> > 
> > BB1: ; We don't know anything about the dereferencability of ; the
> > memory at address 4 here.
> > 
> > 
> > It is a different story if you start to use the GEP in other
> > operations, e.g., to alter control flow. Then the (potential)
> > undefined value can propagate.
> > 
> > 
> > Any thought on this? Did I at least get your problem description
> > right?
> > 
> > Cheers, Johannes
> > 
> > 
> > 
> > P.S. Sorry if this breaks the thread and apologies that I had to
> > remove Bruce from the CC. It turns out replying to an email you did
> > not receive is complicated and getting on the LLVM-Dev list is
> > nowadays as well...
> > 
> > 
> > On 02/25, Ralf Jung via llvm-dev wrote:
> >> Hi Bruce,
> >>
> >> On 25.02.19 13:10, Bruce Hoult wrote:
> >>> LLVM has no idea whether the address computed by GEP is
actually
> >>> within a legal object. The "inbounds" keyword is
just you, the
> >>> programmer, promising LLVM that you know it's ok and that
you
> >>> don't care what happens if it is actually out of bounds.
> >>>
> >>>
https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
> >>
> >> The LangRef says I get a poison value when I am violating the
> >> bounds. What I am asking is what exactly this means when the
offset
> >> is 0 -- what *are* the conditions under which an offset-by-0 is
> >> "out of bounds" and hence yields poison?  Of course LLVM
cannot
> >> always statically determine this, but it relies on (dynamically,
on
> >> the "LLVM abstract machine") such things not happening,
and I am
> >> asking what exactly these dynamic conditions are.
> >>
> >> Kind regards, Ralf
> >>
> >>>
> >>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
> >>> <llvm... at lists.llvm.org> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> What exactly are the rules for `getelementptr inbounds`
with
> >>>> offset 0?
> >>>>
> >>>> In Rust, we are relying on the fact that if we use, for
example,
> >>>> `inttoptr` to turn `4` into a pointer, we can then do
> >>>> `getelementptr inbounds` with offset 0 on that without
LLVM
> >>>> deducing that there actually is any dereferencable memory
at
> >>>> location 4.  The argument is that we can think of there
being a
> >>>> zero-sized allocation. Is that a reasonable assumption? 
Can
> >>>> something like this be documented in the LangRef?
> >>>>
> >>>> Relatedly, how does the situation change if the pointer is
not
> >>>> created "out of thin air" from a fixed integer,
but is actually a
> >>>> dangling pointer obtained previously from `malloc` (or
`alloca`
> >>>> or whatever)?  Is getelementptr inbounds` with offset 0 on
such a
> >>>> pointer a NOP, or does it result in `poison`?  And if that
makes
> >>>> a difference, how does that square with the fact that,
e.g., the
> >>>> integer `0x4000` could well be inside such an allocation,
but
> >>>> doing `getelementptr inbounds` with offset 0 on that would
fall
> >>>> under the first question above?
> >>>>
> >>>> Kind regards, Ralf
> >>>> _______________________________________________ LLVM
Developers
> >>>> mailing list llvm... at lists.llvm.org
> >>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> _______________________________________________ LLVM Developers
> >> mailing list llvm... at lists.llvm.org
> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
-- 

Johannes Doerfert
Researcher

Argonne National Laboratory
Lemont, IL 60439, USA

jdoerfert at anl.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190315/3c567e7a/attachment.sig>

Ralf Jung via llvm-dev

2019-Mar-26 17:20 UTC

head link

[llvm-dev] getelementptr inbounds with offset 0

Hi Johannes,
>> So, the thinking here is: LLVM cannot exclude the possibility of an
>> object of size 0 existing at any given address.  The pointer returned
>> by "GEPi p 0" then would be one-past-the-end of such a
0-sized object.
>> Thus, "GEPi p 0" is the identitiy function for any p, it will
not
>> return poison.
> 
> I don't see the problem. The behavior I hope we want and implement is:
> 
> Either LLVM knows that %p points to an invalid address (=non-object) or
> it doesn't. If it does, %p and all GEPs on it yield poison. If it
> doesn't, it has to assume %p points to a valid address and offset 0, 1,
> 2, ... might all yield valid pointers. The special case is when we know
> %p is valid and has extend of (at most) S, then all offsets <= S,
> including 0, are potentially valid (negative extends are similar).
So you are basically saying whether the offset is 0 or not does not matter, but
whether the base is an object LLVM can now about or not does?  I see.  That
makes sense.

The reason I restricted myself to offset 0 is that we'd like to do this
without
actually having any accessible objects anywhere, which works out if the objects
have size 0.

FWIW, in <https://people.mpi-sws.org/~jung/twinsem/twinsem.pdf> we anyway
had to
make "getelementptr inbounds" on integer pointers (pointers obtained
by casting
an integer to a pointer) never yield poison directly and instead defer the
in-bound check to the time when the actual access happens.  That nicely
accommodates all uses of getelementptr that just compute addresses without ever
using them for a memory access (using them only, e.g. to compute offsets or
compare pointers).  But this is not how the LLVM LangRef is written,
unfortunately.
>> # example1
>>
>> %P1 = int2ptr 4
>> %G1 = gep inbounds %P1 0
>>
>> # example2
>>
>> %P2 = call noalias i8* @malloc(i64 12)
>> call void @free(i8* %P2)
>> %G2 = gep inbounds %P2 0
>>
>> The first happens in Rust all the time, and we rely on not getting
>> poison.  The second doesn't occur in Rust (to my knowledge), but it
>> seems somewhat inconsistent to return poison in one case and not the
>> other.
> 
> Let's start with example2, note that I renamed the values above.
> 
> %P2 is dangling (and we know it) after the free. %P2 is therefore
> poison* and so is %G2.
> 
> * or undef I'm always confused which might be bad in this conversation.
Wait, I know that C has a rule that dangling pointers are
"indeterminate" but
this is the first time I hear that LLVM has it as well.  Is that written down
anywhere?  Rust relies heavily in dangling pointers being well-behaved when used
only on comparisons and casts (no accesses), so this would be a big deal.
(Also, this rule in C is pretty much impossible to formalize and serves no
purpose that I know of, but that is a separate discussion.)
> In example1, without further information, I'd say that there is no
> poison (statically). Address 4 could be an allocated object until proven
> otherwise.
> 
> 
> I am still a little confused about the problem you see. If what I wrote
> about the implemented behavior holds true (which I am not totally sure
> of), you should not have a problem with poison even if you would
> sprinkle GEP (inbounds) %p 0 all over the place. Either %p was known to
> be invalid and so is the GEP, or %p was not known to be invalid and
> neither is the GEP. Am I missing something here?
The thing is, I am not asking about the behavior implemented today but about the
behavior of the "abstract LLVM machine" that is described by the
LangRef and
that the optimizer has to justify its transformations against.  Analyses become
smarter every day, so looking at what LLVM deduces from certain instructions is
but a snapshot.

But also, your response assumes "dangling pointers are undef/posion",
which is
new to me.  I'd be rather shocked if this is something LLVM actually relies
on
anywhere.

Kind regards,
Ralf
> 
> Cheers,
>   Johannes
> 
>>> A side-effect based on the GEP will however __locally__ introduce
an
>>> dereferencability assumption (in my opinion at least). Let's
say the
>>> code looks like this:
>>>
>>>
>>>   %G = gep inbounds (int2ptr 4) 0 ; We don't know anything
about the
>>>   dereferencability of ; the memory at address 4 here.  br %cnd,
>>>   %BB0, %BB1
>>>
>>> BB0: ; We don't know anything about the dereferencability of ;
the
>>> memory at address 4 here.  load %G ; We know the memory at address
4
>>> is dereferenceable here.  ; Though, that is due to the load and not
>>> the inbounds.  ...  br %BB1
>>>
>>> BB1: ; We don't know anything about the dereferencability of ;
the
>>> memory at address 4 here.
>>>
>>>
>>> It is a different story if you start to use the GEP in other
>>> operations, e.g., to alter control flow. Then the (potential)
>>> undefined value can propagate.
>>>
>>>
>>> Any thought on this? Did I at least get your problem description
>>> right?
>>>
>>> Cheers, Johannes
>>>
>>>
>>>
>>> P.S. Sorry if this breaks the thread and apologies that I had to
>>> remove Bruce from the CC. It turns out replying to an email you did
>>> not receive is complicated and getting on the LLVM-Dev list is
>>> nowadays as well...
>>>
>>>
>>> On 02/25, Ralf Jung via llvm-dev wrote:
>>>> Hi Bruce,
>>>>
>>>> On 25.02.19 13:10, Bruce Hoult wrote:
>>>>> LLVM has no idea whether the address computed by GEP is
actually
>>>>> within a legal object. The "inbounds" keyword is
just you, the
>>>>> programmer, promising LLVM that you know it's ok and
that you
>>>>> don't care what happens if it is actually out of
bounds.
>>>>>
>>>>>
https://llvm.org/docs/GetElementPtr.html#what-happens-if-an-array-index-is-out-of-bounds
>>>>
>>>> The LangRef says I get a poison value when I am violating the
>>>> bounds. What I am asking is what exactly this means when the
offset
>>>> is 0 -- what *are* the conditions under which an offset-by-0 is
>>>> "out of bounds" and hence yields poison?  Of course
LLVM cannot
>>>> always statically determine this, but it relies on
(dynamically, on
>>>> the "LLVM abstract machine") such things not
happening, and I am
>>>> asking what exactly these dynamic conditions are.
>>>>
>>>> Kind regards, Ralf
>>>>
>>>>>
>>>>> On Sun, Feb 24, 2019 at 9:05 AM Ralf Jung via llvm-dev
>>>>> <llvm... at lists.llvm.org> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> What exactly are the rules for `getelementptr inbounds`
with
>>>>>> offset 0?
>>>>>>
>>>>>> In Rust, we are relying on the fact that if we use, for
example,
>>>>>> `inttoptr` to turn `4` into a pointer, we can then do
>>>>>> `getelementptr inbounds` with offset 0 on that without
LLVM
>>>>>> deducing that there actually is any dereferencable
memory at
>>>>>> location 4.  The argument is that we can think of there
being a
>>>>>> zero-sized allocation. Is that a reasonable assumption?
Can
>>>>>> something like this be documented in the LangRef?
>>>>>>
>>>>>> Relatedly, how does the situation change if the pointer
is not
>>>>>> created "out of thin air" from a fixed
integer, but is actually a
>>>>>> dangling pointer obtained previously from `malloc` (or
`alloca`
>>>>>> or whatever)?  Is getelementptr inbounds` with offset 0
on such a
>>>>>> pointer a NOP, or does it result in `poison`?  And if
that makes
>>>>>> a difference, how does that square with the fact that,
e.g., the
>>>>>> integer `0x4000` could well be inside such an
allocation, but
>>>>>> doing `getelementptr inbounds` with offset 0 on that
would fall
>>>>>> under the first question above?
>>>>>>
>>>>>> Kind regards, Ralf
>>>>>> _______________________________________________ LLVM
Developers
>>>>>> mailing list llvm... at lists.llvm.org
>>>>>>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> _______________________________________________ LLVM Developers
>>>> mailing list llvm... at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Mar 2019 - getelementptr inbounds with offset 0

[llvm-dev] getelementptr inbounds with offset 0

[llvm-dev] getelementptr inbounds with offset 0

[llvm-dev] getelementptr inbounds with offset 0

Seemingly Similar Threads