thr3ads.net - llvm dev - [LLVMdev] RFC: GEP as canonical form for pointer addressing [Feb 2014]

If this information is useful, please help other people find it:
Share via:

David Chisnall

2014-Feb-17 10:31 UTC

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com> wrote:
> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at
philipreames.com> wrote:
> 
>> RFC: GEP as canonical form for pointer addressing
>> 
>> I would like to propose that we designate GEPs as the canonical form
for pointer addressing in LLVM IR before CodeGenPrepare.
>> 
>> Corollaries
>> 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr
sequences to GEPs, but not vice versa.
>> 2) Input IR which does not contain inttoptr instructions will never
contain inttoptr instructions (before CodeGenPrepare.)
>> 
>> I've spoken with Nick Lewycky & Owen Anderson offline at the
last social.  On first reflection, both were okay with the proposal, but I'd
like broader buy-in and discussion.  Nick & Owen, if I've accidentally
misrepresented our discussion or you've had second thoughts since, please
speak up.
> 
> FWIW, I think it would be nice if standard optimization passes have this
property of being well behaved with respect to pointer types, and I don’t see a
good reason for canonical IR passes to lose pointer types. I also think it’s the
only way to mix the optimization of pointer values with precise GC. It seems
that you just want LLVM developers to generally agree that certain passes will
be well behaved (you can disable any others). It may just be a matter of
documenting those passes. Ideally we could formalize this by declaring a pass as
pointer-safe and verifying. Can we easily verify that no memory access is based
on inttoptr?
Not directly related, but our canonical form for loops involving pointers[1]
turns a loop that contains a GEP with the loop induction variable into a GEP
with the increment inside the loop.  This has two annoying properties for code
generation:

- The GEP with the induction variable as the offset maps cleanly to CPU
addressing modes and so we generate better code if we don't do this
canonicalisation, and therefore end up trying to undo it in the back end (yuck).

- If the source is the start of an object, then this behaviour is GC-hostile
because it means that IR that contains a pointer to an object start now only
contains a pointer to the middle, requiring the GC to deal with inner pointers.

It would be nice if we could have canonical forms such that if the front end
ensures that there are no inner pointers without pointers to the object's
start in the IR, the optimisers don't break this.

David

[1] Are canonical forms actually documented anywhere, or are they simply
undocumented implicit contracts?

Andrew Trick

2014-Feb-17 22:53 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On Feb 17, 2014, at 2:31 AM, David Chisnall <David.Chisnall at
cl.cam.ac.uk> wrote:
> On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com> wrote:
> 
>> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at
philipreames.com> wrote:
>> 
>>> RFC: GEP as canonical form for pointer addressing
>>> 
>>> I would like to propose that we designate GEPs as the canonical
form for pointer addressing in LLVM IR before CodeGenPrepare.
>>> 
>>> Corollaries
>>> 1) It is legal for an optimizer to convert
inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
>>> 2) Input IR which does not contain inttoptr instructions will never
contain inttoptr instructions (before CodeGenPrepare.)
>>> 
>>> I've spoken with Nick Lewycky & Owen Anderson offline at
the last social.  On first reflection, both were okay with the proposal, but
I'd like broader buy-in and discussion.  Nick & Owen, if I've
accidentally misrepresented our discussion or you've had second thoughts
since, please speak up.
>> 
>> FWIW, I think it would be nice if standard optimization passes have
this property of being well behaved with respect to pointer types, and I don’t
see a good reason for canonical IR passes to lose pointer types. I also think
it’s the only way to mix the optimization of pointer values with precise GC. It
seems that you just want LLVM developers to generally agree that certain passes
will be well behaved (you can disable any others). It may just be a matter of
documenting those passes. Ideally we could formalize this by declaring a pass as
pointer-safe and verifying. Can we easily verify that no memory access is based
on inttoptr?
> 
> Not directly related, but our canonical form for loops involving
pointers[1] turns a loop that contains a GEP with the loop induction variable
into a GEP with the increment inside the loop.  This has two annoying properties
for code generation:
> 
> - The GEP with the induction variable as the offset maps cleanly to CPU
addressing modes and so we generate better code if we don't do this
canonicalisation, and therefore end up trying to undo it in the back end (yuck).
> 
> - If the source is the start of an object, then this behaviour is
GC-hostile because it means that IR that contains a pointer to an object start
now only contains a pointer to the middle, requiring the GC to deal with inner
pointers.
> 
> It would be nice if we could have canonical forms such that if the front
end ensures that there are no inner pointers without pointers to the
object's start in the IR, the optimisers don't break this.
> 
> David
> 
> [1] Are canonical forms actually documented anywhere, or are they simply
undocumented implicit contracts?
I would say whatever form is currently generated by IR passes is defined as
canonical. It’s not easy to specify. At some points in the pipeline (early and
late) it’s fine to permit multiple forms of the same expression as long as it’s
canonical-enough for the downstream analysis.

If some pass is generating a suboptimal form, it’s good to question whether it’s
really necessary for any analysis. If not, we should change it. Without a test
case, I can’t say what issue you’re running into above.

-Andy

Philip Reames

2014-Feb-18 19:29 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On 02/17/2014 02:31 AM, David Chisnall wrote:> On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com> wrote:
>
>> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at
philipreames.com> wrote:
>>
>>> RFC: GEP as canonical form for pointer addressing
>>>
>>> I would like to propose that we designate GEPs as the canonical
form for pointer addressing in LLVM IR before CodeGenPrepare.
>>>
>>> Corollaries
>>> 1) It is legal for an optimizer to convert
inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
>>> 2) Input IR which does not contain inttoptr instructions will never
contain inttoptr instructions (before CodeGenPrepare.)
>>>
>>> I've spoken with Nick Lewycky & Owen Anderson offline at
the last social.  On first reflection, both were okay with the proposal, but
I'd like broader buy-in and discussion.  Nick & Owen, if I've
accidentally misrepresented our discussion or you've had second thoughts
since, please speak up.
>> FWIW, I think it would be nice if standard optimization passes have
this property of being well behaved with respect to pointer types, and I don’t
see a good reason for canonical IR passes to lose pointer types. I also think
it’s the only way to mix the optimization of pointer values with precise GC. It
seems that you just want LLVM developers to generally agree that certain passes
will be well behaved (you can disable any others). It may just be a matter of
documenting those passes. Ideally we could formalize this by declaring a pass as
pointer-safe and verifying. Can we easily verify that no memory access is based
on inttoptr?
> Not directly related, but our canonical form for loops involving
pointers[1] turns a loop that contains a GEP with the loop induction variable
into a GEP with the increment inside the loop.  This has two annoying properties
for code generation:
>
> - The GEP with the induction variable as the offset maps cleanly to CPU
addressing modes and so we generate better code if we don't do this
canonicalisation, and therefore end up trying to undo it in the back end (yuck).
>
> - If the source is the start of an object, then this behaviour is
GC-hostile because it means that IR that contains a pointer to an object start
now only contains a pointer to the middle, requiring the GC to deal with inner
pointers.
>
> It would be nice if we could have canonical forms such that if the front
end ensures that there are no inner pointers without pointers to the
object's start in the IR, the optimisers don't break this.While I agree that from a stylistic point of view this would be an 
improvement, we don't actual *need* this to support precise GC.  It 
would definitely result in cleaner code generation than our current 
scheme though.

Philip

Philip Reames

2014-Feb-18 19:51 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On 02/17/2014 02:53 PM, Andrew Trick wrote:> On Feb 17, 2014, at 2:31 AM, David Chisnall <David.Chisnall at
cl.cam.ac.uk> wrote:
>
>> On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com>
wrote:
>>
>>> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at
philipreames.com> wrote:
>>>
>>>> RFC: GEP as canonical form for pointer addressing
>>>>
>>>> I would like to propose that we designate GEPs as the canonical
form for pointer addressing in LLVM IR before CodeGenPrepare.
>>>>
>>>> Corollaries
>>>> 1) It is legal for an optimizer to convert
inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
>>>> 2) Input IR which does not contain inttoptr instructions will
never contain inttoptr instructions (before CodeGenPrepare.)
>>>>
>>>> I've spoken with Nick Lewycky & Owen Anderson offline
at the last social.  On first reflection, both were okay with the proposal, but
I'd like broader buy-in and discussion.  Nick & Owen, if I've
accidentally misrepresented our discussion or you've had second thoughts
since, please speak up.
>>> FWIW, I think it would be nice if standard optimization passes have
this property of being well behaved with respect to pointer types, and I don’t
see a good reason for canonical IR passes to lose pointer types. I also think
it’s the only way to mix the optimization of pointer values with precise GC. It
seems that you just want LLVM developers to generally agree that certain passes
will be well behaved (you can disable any others). It may just be a matter of
documenting those passes. Ideally we could formalize this by declaring a pass as
pointer-safe and verifying. Can we easily verify that no memory access is based
on inttoptr?
>> Not directly related, but our canonical form for loops involving
pointers[1] turns a loop that contains a GEP with the loop induction variable
into a GEP with the increment inside the loop.  This has two annoying properties
for code generation:
>>
>> - The GEP with the induction variable as the offset maps cleanly to CPU
addressing modes and so we generate better code if we don't do this
canonicalisation, and therefore end up trying to undo it in the back end (yuck).
>>
>> - If the source is the start of an object, then this behaviour is
GC-hostile because it means that IR that contains a pointer to an object start
now only contains a pointer to the middle, requiring the GC to deal with inner
pointers.
>>
>> It would be nice if we could have canonical forms such that if the
front end ensures that there are no inner pointers without pointers to the
object's start in the IR, the optimisers don't break this.
>>
>> David
>>
>> [1] Are canonical forms actually documented anywhere, or are they
simply undocumented implicit contracts?
> I would say whatever form is currently generated by IR passes is defined as
canonical. It’s not easy to specify. At some points in the pipeline (early and
late) it’s fine to permit multiple forms of the same expression as long as it’s
canonical-enough for the downstream analysis.
>
> If some pass is generating a suboptimal form, it’s good to question whether
it’s really necessary for any analysis. If not, we should change it. Without a
test case, I can’t say what issue you’re running into above.David, do you happen to have a test case on hand?  I know I've seen this 
before, but my attempt to write out a quick example from memory failed.

Philip

Andrew Trick

2014-Feb-18 21:13 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On Feb 18, 2014, at 11:29 AM, Philip Reames <listmail at philipreames.com>
wrote:
> 
> On 02/17/2014 02:31 AM, David Chisnall wrote:
>> On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com>
wrote:
>> 
>>> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at
philipreames.com> wrote:
>>> 
>>>> RFC: GEP as canonical form for pointer addressing
>>>> 
>>>> I would like to propose that we designate GEPs as the canonical
form for pointer addressing in LLVM IR before CodeGenPrepare.
>>>> 
>>>> Corollaries
>>>> 1) It is legal for an optimizer to convert
inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa.
>>>> 2) Input IR which does not contain inttoptr instructions will
never contain inttoptr instructions (before CodeGenPrepare.)
>>>> 
>>>> I've spoken with Nick Lewycky & Owen Anderson offline
at the last social.  On first reflection, both were okay with the proposal, but
I'd like broader buy-in and discussion.  Nick & Owen, if I've
accidentally misrepresented our discussion or you've had second thoughts
since, please speak up.
>>> FWIW, I think it would be nice if standard optimization passes have
this property of being well behaved with respect to pointer types, and I don’t
see a good reason for canonical IR passes to lose pointer types. I also think
it’s the only way to mix the optimization of pointer values with precise GC. It
seems that you just want LLVM developers to generally agree that certain passes
will be well behaved (you can disable any others). It may just be a matter of
documenting those passes. Ideally we could formalize this by declaring a pass as
pointer-safe and verifying. Can we easily verify that no memory access is based
on inttoptr?
>> Not directly related, but our canonical form for loops involving
pointers[1] turns a loop that contains a GEP with the loop induction variable
into a GEP with the increment inside the loop.  This has two annoying properties
for code generation:
>> 
>> - The GEP with the induction variable as the offset maps cleanly to CPU
addressing modes and so we generate better code if we don't do this
canonicalisation, and therefore end up trying to undo it in the back end (yuck).
>> 
>> - If the source is the start of an object, then this behaviour is
GC-hostile because it means that IR that contains a pointer to an object start
now only contains a pointer to the middle, requiring the GC to deal with inner
pointers.
>> 
>> It would be nice if we could have canonical forms such that if the
front end ensures that there are no inner pointers without pointers to the
object's start in the IR, the optimisers don't break this.
> While I agree that from a stylistic point of view this would be an
improvement, we don't actual *need* this to support precise GC.  It would
definitely result in cleaner code generation than our current scheme though.
I’m not opposed to preserving the GEP’s original base as a matter of convention
when there’s no good reason not to. But, in general passes expect to be able to
break-up GEPs into smaller steps. We can’t guarantee that the original base will
be directly referenced at every point of use.

We should certainly avoid generating out-of-bounds GEPs without retaining some
in-bounds pointer, because that would break everyone’s conservative GC as well.

-Andy

llvm dev - Feb 2014 - [LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing