thr3ads.net - llvm dev - [LLVMdev] RFC: GEP as canonical form for pointer addressing [Feb 2014]

If this information is useful, please help other people find it:
Share via:

Philip Reames

2014-Feb-15 01:18 UTC

[LLVMdev] RFC: GEP as canonical form for pointer addressing

RFC: GEP as canonical form for pointer addressing

I would like to propose that we designate GEPs as the canonical form for 
pointer addressing in LLVM IR before CodeGenPrepare.

Corollaries
1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr 
sequences to GEPs, but not vice versa.
2) Input IR which does not contain inttoptr instructions will never 
contain inttoptr instructions (before CodeGenPrepare.)

I've spoken with Nick Lewycky & Owen Anderson offline at the last 
social.  On first reflection, both were okay with the proposal, but I'd 
like broader buy-in and discussion.  Nick & Owen, if I've accidentally 
misrepresented our discussion or you've had second thoughts since, 
please speak up.


Background & Motivation

We want to support precise garbage collection(1) in LLVM.  To do so, we 
have written a pass which inserts safepoints, read, and write barriers 
as appropriate.  This pass needs to be able to reliably(2) identify 
pointer vs non-pointer values.  Its advantageous to run this pass as 
late as practical in the optimization pipeline, but we can schedule it 
before lowering begins (i.e. before CodeGenPrepare).

We control the initial IR which is generated and can ensure that it does 
not contain any inttoptr instructions.  We're looking to have a 
guarantee(*) that a random LLVM optimization pass will not decide to 
replace GEPs with a sequence of ptrtoint, int arithmetic, and inttoptr 
which are hard for us to reason about.

* "guarantee" isn't really the right word here.  I'm really
just looking
to make sure that the community is comfortable with GEPs as canonical 
form.  If some pass decides to insert inttoptr instructions into 
otherwise clean IR, I want some assurance a patch fixing that would 
stand a good chance of being accepted.  I'm happy to do any cleanup 
required.

In addition to my own use case, here's a few others which might come up:
- Backends for targets which support different operations on pointers vs 
integers.  Examples would be some of the older mainframe architectures.  
(There'd be a lot more work needed to support this.)
- Various security related applications (e.g. CFI w.r.t. function pointers)

I don't really want to get into these applications in detail, mostly 
because I'm not particularly knowledgeable on those topics.  I'd 
appreciate any other applications anyone wants to throw out, but lets 
try to keep from derailing the discussion.  (As I did to Nick's original 
thread on DataLayout. :))

Notes:
1) We're not using the existing gc.root implementation strategy.  I plan 
on explaining why in a lot more detail once we're closer to having a 
complete implementation that we can upstream.  That should be coming 
relatively shortly.  (i.e. months, not weeks, not years)

2) As Nick pointed out in a separate thread, other types of typecasts 
can obscure pointer vs integer classifications.  (i.e. casting the base 
type of a pointer we then load through could load a field of the
"wrong"
type")  I plan on responding to his point separately, but let's leave 
that out of this discussion for the moment.  Having GEPs as canonical 
form is a step forward by itself, even if I decide to propose something 
further down the road.

Philip

Hal Finkel

2014-Feb-15 15:22 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

----- Original Message -----> From: "Philip Reames" <listmail at philipreames.com>
> To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Friday, February 14, 2014 7:18:21 PM
> Subject: [LLVMdev] RFC: GEP as canonical form for pointer addressing
> 
> RFC: GEP as canonical form for pointer addressing
> 
> I would like to propose that we designate GEPs as the canonical form
> for
> pointer addressing in LLVM IR before CodeGenPrepare.
Is this not already the case? I did not think that any passes introduce
inttoptr+arithmetic+inttoptr prior to CGP. On the other hand, we don't
convert inttoptr+arithmetic+inttoptr into GEP when we can (which is PR14226 --
where Eli (cc'd) said this is unsafe in general).

 -Hal
> 
> Corollaries
> 1) It is legal for an optimizer to convert
> inttoptr+arithmetic+inttoptr
> sequences to GEPs, but not vice versa.
> 2) Input IR which does not contain inttoptr instructions will never
> contain inttoptr instructions (before CodeGenPrepare.)
> 
> I've spoken with Nick Lewycky & Owen Anderson offline at the last
> social.  On first reflection, both were okay with the proposal, but
> I'd
> like broader buy-in and discussion.  Nick & Owen, if I've
> accidentally
> misrepresented our discussion or you've had second thoughts since,
> please speak up.
> 
> 
> Background & Motivation
> 
> We want to support precise garbage collection(1) in LLVM.  To do so,
> we
> have written a pass which inserts safepoints, read, and write
> barriers
> as appropriate.  This pass needs to be able to reliably(2) identify
> pointer vs non-pointer values.  Its advantageous to run this pass as
> late as practical in the optimization pipeline, but we can schedule
> it
> before lowering begins (i.e. before CodeGenPrepare).
> 
> We control the initial IR which is generated and can ensure that it
> does
> not contain any inttoptr instructions.  We're looking to have a
> guarantee(*) that a random LLVM optimization pass will not decide to
> replace GEPs with a sequence of ptrtoint, int arithmetic, and
> inttoptr
> which are hard for us to reason about.
> 
> * "guarantee" isn't really the right word here.  I'm
really just
> looking
> to make sure that the community is comfortable with GEPs as canonical
> form.  If some pass decides to insert inttoptr instructions into
> otherwise clean IR, I want some assurance a patch fixing that would
> stand a good chance of being accepted.  I'm happy to do any cleanup
> required.
> 
> In addition to my own use case, here's a few others which might come
> up:
> - Backends for targets which support different operations on pointers
> vs
> integers.  Examples would be some of the older mainframe
> architectures.
> (There'd be a lot more work needed to support this.)
> - Various security related applications (e.g. CFI w.r.t. function
> pointers)
> 
> I don't really want to get into these applications in detail, mostly
> because I'm not particularly knowledgeable on those topics.  I'd
> appreciate any other applications anyone wants to throw out, but lets
> try to keep from derailing the discussion.  (As I did to Nick's
> original
> thread on DataLayout. :))
> 
> Notes:
> 1) We're not using the existing gc.root implementation strategy.  I
> plan
> on explaining why in a lot more detail once we're closer to having a
> complete implementation that we can upstream.  That should be coming
> relatively shortly.  (i.e. months, not weeks, not years)
> 
> 2) As Nick pointed out in a separate thread, other types of typecasts
> can obscure pointer vs integer classifications.  (i.e. casting the
> base
> type of a pointer we then load through could load a field of the
> "wrong"
> type")  I plan on responding to his point separately, but let's
leave
> that out of this discussion for the moment.  Having GEPs as canonical
> form is a step forward by itself, even if I decide to propose
> something
> further down the road.
> 
> Philip
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Philip Reames

2014-Feb-15 20:27 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On 02/15/2014 07:22 AM, Hal Finkel wrote:> ----- Original Message -----
>> From: "Philip Reames" <listmail at philipreames.com>
>> To: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>
>> Sent: Friday, February 14, 2014 7:18:21 PM
>> Subject: [LLVMdev] RFC: GEP as canonical form for pointer addressing
>>
>> RFC: GEP as canonical form for pointer addressing
>>
>> I would like to propose that we designate GEPs as the canonical form
>> for
>> pointer addressing in LLVM IR before CodeGenPrepare.
> Is this not already the case?If it is, my proposal is even less controversial than I'd hoped it would 
be.  :)  However, given the number of folks who I've talked to about 
this who haven't said so, I expecting the answer is
no.> I did not think that any passes introduce inttoptr+arithmetic+inttoptr
prior to CGP.To my knowledge, none currently do.  I want to keep it that
way.> On the other hand, we don't convert inttoptr+arithmetic+inttoptr into
GEP when we can (which is PR14226 -- where Eli (cc'd) said this is unsafe in
general).To be clear, this is somewhat separate from my proposal.  I only care 
that inttoptr+arithmetic+inttoptr sequences aren't inserted in the place 
of GEPs.

Having said that..., it's still an interesting point to discuss.

I read through the PR and I have to admit I don't understand why the 
pointer aliasing rules would prevent such a transform.  The final GEP is 
already "based on"* the base of the original GEP.  The transformation 
doesn't effect that at all.  Can you (or Eli) spell this out a bit for 
me?  I'm missing something.

* from "A pointer value formed by an inttoptr is /based/ on all pointer 
values that contribute (directly or indirectly) to the computation of 
the pointer’s value." in the LangRef

Philip



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140215/5b27a0ca/attachment.html>

Andrew Trick

2014-Feb-15 23:55 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at philipreames.com>
wrote:
> RFC: GEP as canonical form for pointer addressing
> 
> I would like to propose that we designate GEPs as the canonical form for
pointer addressing in LLVM IR before CodeGenPrepare.
> 
> Corollaries
> 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr
sequences to GEPs, but not vice versa.
> 2) Input IR which does not contain inttoptr instructions will never contain
inttoptr instructions (before CodeGenPrepare.)
> 
> I've spoken with Nick Lewycky & Owen Anderson offline at the last
social.  On first reflection, both were okay with the proposal, but I'd like
broader buy-in and discussion.  Nick & Owen, if I've accidentally
misrepresented our discussion or you've had second thoughts since, please
speak up.
FWIW, I think it would be nice if standard optimization passes have this
property of being well behaved with respect to pointer types, and I don’t see a
good reason for canonical IR passes to lose pointer types. I also think it’s the
only way to mix the optimization of pointer values with precise GC. It seems
that you just want LLVM developers to generally agree that certain passes will
be well behaved (you can disable any others). It may just be a matter of
documenting those passes. Ideally we could formalize this by declaring a pass as
pointer-safe and verifying. Can we easily verify that no memory access is based
on inttoptr?

-Andy
> Background & Motivation
> 
> We want to support precise garbage collection(1) in LLVM.  To do so, we
have written a pass which inserts safepoints, read, and write barriers as
appropriate.  This pass needs to be able to reliably(2) identify pointer vs
non-pointer values.  Its advantageous to run this pass as late as practical in
the optimization pipeline, but we can schedule it before lowering begins (i.e.
before CodeGenPrepare).
> 
> We control the initial IR which is generated and can ensure that it does
not contain any inttoptr instructions.  We're looking to have a guarantee(*)
that a random LLVM optimization pass will not decide to replace GEPs with a
sequence of ptrtoint, int arithmetic, and inttoptr which are hard for us to
reason about.
> 
> * "guarantee" isn't really the right word here.  I'm
really just looking to make sure that the community is comfortable with GEPs as
canonical form.  If some pass decides to insert inttoptr instructions into
otherwise clean IR, I want some assurance a patch fixing that would stand a good
chance of being accepted.  I'm happy to do any cleanup required.
> 
> In addition to my own use case, here's a few others which might come
up:
> - Backends for targets which support different operations on pointers vs
integers.  Examples would be some of the older mainframe architectures. 
(There'd be a lot more work needed to support this.)
> - Various security related applications (e.g. CFI w.r.t. function pointers)
> 
> I don't really want to get into these applications in detail, mostly
because I'm not particularly knowledgeable on those topics.  I'd
appreciate any other applications anyone wants to throw out, but lets try to
keep from derailing the discussion.  (As I did to Nick's original thread on
DataLayout. :))
> 
> Notes:
> 1) We're not using the existing gc.root implementation strategy.  I
plan on explaining why in a lot more detail once we're closer to having a
complete implementation that we can upstream.  That should be coming relatively
shortly.  (i.e. months, not weeks, not years)
> 
> 2) As Nick pointed out in a separate thread, other types of typecasts can
obscure pointer vs integer classifications.  (i.e. casting the base type of a
pointer we then load through could load a field of the "wrong"
type")  I plan on responding to his point separately, but let's leave
that out of this discussion for the moment.  Having GEPs as canonical form is a
step forward by itself, even if I decide to propose something further down the
road.
> 
> Philip
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

David Chisnall

2014-Feb-17 10:31 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com> wrote:
> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at
philipreames.com> wrote:
> 
>> RFC: GEP as canonical form for pointer addressing
>> 
>> I would like to propose that we designate GEPs as the canonical form
for pointer addressing in LLVM IR before CodeGenPrepare.
>> 
>> Corollaries
>> 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr
sequences to GEPs, but not vice versa.
>> 2) Input IR which does not contain inttoptr instructions will never
contain inttoptr instructions (before CodeGenPrepare.)
>> 
>> I've spoken with Nick Lewycky & Owen Anderson offline at the
last social.  On first reflection, both were okay with the proposal, but I'd
like broader buy-in and discussion.  Nick & Owen, if I've accidentally
misrepresented our discussion or you've had second thoughts since, please
speak up.
> 
> FWIW, I think it would be nice if standard optimization passes have this
property of being well behaved with respect to pointer types, and I don’t see a
good reason for canonical IR passes to lose pointer types. I also think it’s the
only way to mix the optimization of pointer values with precise GC. It seems
that you just want LLVM developers to generally agree that certain passes will
be well behaved (you can disable any others). It may just be a matter of
documenting those passes. Ideally we could formalize this by declaring a pass as
pointer-safe and verifying. Can we easily verify that no memory access is based
on inttoptr?
Not directly related, but our canonical form for loops involving pointers[1]
turns a loop that contains a GEP with the loop induction variable into a GEP
with the increment inside the loop.  This has two annoying properties for code
generation:

- The GEP with the induction variable as the offset maps cleanly to CPU
addressing modes and so we generate better code if we don't do this
canonicalisation, and therefore end up trying to undo it in the back end (yuck).

- If the source is the start of an object, then this behaviour is GC-hostile
because it means that IR that contains a pointer to an object start now only
contains a pointer to the middle, requiring the GC to deal with inner pointers.

It would be nice if we could have canonical forms such that if the front end
ensures that there are no inner pointers without pointers to the object's
start in the IR, the optimisers don't break this.

David

[1] Are canonical forms actually documented anywhere, or are they simply
undocumented implicit contracts?

Philip Reames

2014-Feb-18 19:21 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On 02/15/2014 03:55 PM, Andrew Trick wrote:> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at
philipreames.com> wrote:
>
>> RFC: GEP as canonical form for pointer addressing
>>
>> I would like to propose that we designate GEPs as the canonical form
for pointer addressing in LLVM IR before CodeGenPrepare.
>>
>> Corollaries
>> 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr
sequences to GEPs, but not vice versa.
>> 2) Input IR which does not contain inttoptr instructions will never
contain inttoptr instructions (before CodeGenPrepare.)
>>
>> I've spoken with Nick Lewycky & Owen Anderson offline at the
last social.  On first reflection, both were okay with the proposal, but I'd
like broader buy-in and discussion.  Nick & Owen, if I've accidentally
misrepresented our discussion or you've had second thoughts since, please
speak up.
> FWIW, I think it would be nice if standard optimization passes have this
property of being well behaved with respect to pointer types, and I don’t see a
good reason for canonical IR passes to lose pointer types. I also think it’s the
only way to mix the optimization of pointer values with precise GC. It seems
that you just want LLVM developers to generally agree that certain passes will
be well behaved (you can disable any others). It may just be a matter of
documenting those passes.You could phrase it this way.  I would push for "everything before 
CodeGenPrepare", but am open to counter argument on why a smaller set 
should be selected.  :)> Ideally we could formalize this by declaring a pass as pointer-safe and
verifying. Can we easily verify that no memory access is based on inttoptr?Yes.  The only slightly complication is dealing with phi nodes and 
selects (which feed into GEPs), but assuming you're willing to accept a 
slightly conservative answer, it's definitely doable.

I have a pass locally which effectively does this.  It's a side effect 
of it's primary purpose, but getting that extracted as a distinct pass 
(or part of the verifier) shouldn't be difficult.>
> -Andy
>
>> Background & Motivation
>>
>> We want to support precise garbage collection(1) in LLVM.  To do so, we
have written a pass which inserts safepoints, read, and write barriers as
appropriate.  This pass needs to be able to reliably(2) identify pointer vs
non-pointer values.  Its advantageous to run this pass as late as practical in
the optimization pipeline, but we can schedule it before lowering begins (i.e.
before CodeGenPrepare).
>>
>> We control the initial IR which is generated and can ensure that it
does not contain any inttoptr instructions.  We're looking to have a
guarantee(*) that a random LLVM optimization pass will not decide to replace
GEPs with a sequence of ptrtoint, int arithmetic, and inttoptr which are hard
for us to reason about.
>>
>> * "guarantee" isn't really the right word here.  I'm
really just looking to make sure that the community is comfortable with GEPs as
canonical form.  If some pass decides to insert inttoptr instructions into
otherwise clean IR, I want some assurance a patch fixing that would stand a good
chance of being accepted.  I'm happy to do any cleanup required.
>>
>> In addition to my own use case, here's a few others which might
come up:
>> - Backends for targets which support different operations on pointers
vs integers.  Examples would be some of the older mainframe architectures. 
(There'd be a lot more work needed to support this.)
>> - Various security related applications (e.g. CFI w.r.t. function
pointers)
>>
>> I don't really want to get into these applications in detail,
mostly because I'm not particularly knowledgeable on those topics.  I'd
appreciate any other applications anyone wants to throw out, but lets try to
keep from derailing the discussion.  (As I did to Nick's original thread on
DataLayout. :))
>>
>> Notes:
>> 1) We're not using the existing gc.root implementation strategy.  I
plan on explaining why in a lot more detail once we're closer to having a
complete implementation that we can upstream.  That should be coming relatively
shortly.  (i.e. months, not weeks, not years)
>>
>> 2) As Nick pointed out in a separate thread, other types of typecasts
can obscure pointer vs integer classifications.  (i.e. casting the base type of
a pointer we then load through could load a field of the "wrong"
type")  I plan on responding to his point separately, but let's leave
that out of this discussion for the moment.  Having GEPs as canonical form is a
step forward by itself, even if I decide to propose something further down the
road.
>>
>> Philip
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Ivan Godard

2014-Feb-20 06:11 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

Philip Reames <listmail <at> philipreames.com> writes:
> 
> RFC: GEP as canonical form for pointer addressing 

<snip>
 > In addition to my own use case, here's a few others which might come
up:
> - Backends for targets which support different operations on pointers vs 
> integers.  Examples would be some of the older mainframe architectures.  
> (There'd be a lot more work needed to support this.)
> 
> Philip
> 

It's not just old mainframes, it's some of the newest architecture as
well.
The Mill general-purpose architecture (http://ootbcomp.com) has non-integer
pointers and distinct pointer operations too. That LLVM loses pointerhood is
the biggest problem that we have identified while looking into using LLVM as
our supported compiler. It may be a killer, and we may have to fall back to
gcc. That would be a shame, but it does appear that the ir makes rash
assumptions about machine architecture.

"There'd be a lot more work needed to support this" is not
encouraging to
see, especially for a startup company with limited resources and little
prior exposure to LLVM internals.

Ivan

David Chisnall

2014-Feb-20 09:02 UTC

head link

[LLVMdev] RFC: GEP as canonical form for pointer addressing

On 20 Feb 2014, at 06:11, Ivan Godard <ivan at ootbcomp.com> wrote:
> It's not just old mainframes, it's some of the newest architecture
as well.
> The Mill general-purpose architecture (http://ootbcomp.com) has non-integer
> pointers and distinct pointer operations too. That LLVM loses pointerhood
is
> the biggest problem that we have identified while looking into using LLVM
as
> our supported compiler. It may be a killer, and we may have to fall back to
> gcc. That would be a shame, but it does appear that the ir makes rash
> assumptions about machine architecture.
> 
> "There'd be a lot more work needed to support this" is not
encouraging to
> see, especially for a startup company with limited resources and little
> prior exposure to LLVM internals.
Just to add, I spend a fair bit of my time in the computer architecture research
community, and pointers that are not integers are an increasingly common model. 
They simplify various dependency analysis paths in the pipeline (giving fewer
pipeline flushes) and make certain kinds of security features significantly
easier to implement.

Architectures that separate pointers from integers are note becoming rarer. 
They are increasingly common in application-specific processors and likely to
reappear in mainstream processors over the next 5-10 years.

We have managed to get LLVM working (and building nontrivial amounts of code) on
a MIPS-derived architecture that has non-integer pointers, and the
representation in the IR itself is fine.  We have a few hacks in optimisations
that are far too coarse grained (i.e. don't do this optimisation if
you're dealing with this kind of pointer, even though many of them [SCEV in
particular] should work but the code makes invalid assumptions).  We do end up
having to add more after every merge.

We start to hit problems when we get to SelectionDAG, which makes a lot of
assumptions about the underlying architecture and has an annoying habit of
thinking it knows better than the back end and undoing transformations that the
back end has done.

David

P.S. The Mill is a very interesting architecture, but I'm very glad I'm
not the one responsible for instruction scheduling on it...

llvm dev - Feb 2014 - [LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing

[LLVMdev] RFC: GEP as canonical form for pointer addressing