Philip Reames
2014-Feb-15 01:18 UTC
[LLVMdev] RFC: GEP as canonical form for pointer addressing
RFC: GEP as canonical form for pointer addressing I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare. Corollaries 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa. 2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.) I've spoken with Nick Lewycky & Owen Anderson offline at the last social. On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion. Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up. Background & Motivation We want to support precise garbage collection(1) in LLVM. To do so, we have written a pass which inserts safepoints, read, and write barriers as appropriate. This pass needs to be able to reliably(2) identify pointer vs non-pointer values. Its advantageous to run this pass as late as practical in the optimization pipeline, but we can schedule it before lowering begins (i.e. before CodeGenPrepare). We control the initial IR which is generated and can ensure that it does not contain any inttoptr instructions. We're looking to have a guarantee(*) that a random LLVM optimization pass will not decide to replace GEPs with a sequence of ptrtoint, int arithmetic, and inttoptr which are hard for us to reason about. * "guarantee" isn't really the right word here. I'm really just looking to make sure that the community is comfortable with GEPs as canonical form. If some pass decides to insert inttoptr instructions into otherwise clean IR, I want some assurance a patch fixing that would stand a good chance of being accepted. I'm happy to do any cleanup required. In addition to my own use case, here's a few others which might come up: - Backends for targets which support different operations on pointers vs integers. Examples would be some of the older mainframe architectures. (There'd be a lot more work needed to support this.) - Various security related applications (e.g. CFI w.r.t. function pointers) I don't really want to get into these applications in detail, mostly because I'm not particularly knowledgeable on those topics. I'd appreciate any other applications anyone wants to throw out, but lets try to keep from derailing the discussion. (As I did to Nick's original thread on DataLayout. :)) Notes: 1) We're not using the existing gc.root implementation strategy. I plan on explaining why in a lot more detail once we're closer to having a complete implementation that we can upstream. That should be coming relatively shortly. (i.e. months, not weeks, not years) 2) As Nick pointed out in a separate thread, other types of typecasts can obscure pointer vs integer classifications. (i.e. casting the base type of a pointer we then load through could load a field of the "wrong" type") I plan on responding to his point separately, but let's leave that out of this discussion for the moment. Having GEPs as canonical form is a step forward by itself, even if I decide to propose something further down the road. Philip
Hal Finkel
2014-Feb-15 15:22 UTC
[LLVMdev] RFC: GEP as canonical form for pointer addressing
----- Original Message -----> From: "Philip Reames" <listmail at philipreames.com> > To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > Sent: Friday, February 14, 2014 7:18:21 PM > Subject: [LLVMdev] RFC: GEP as canonical form for pointer addressing > > RFC: GEP as canonical form for pointer addressing > > I would like to propose that we designate GEPs as the canonical form > for > pointer addressing in LLVM IR before CodeGenPrepare.Is this not already the case? I did not think that any passes introduce inttoptr+arithmetic+inttoptr prior to CGP. On the other hand, we don't convert inttoptr+arithmetic+inttoptr into GEP when we can (which is PR14226 -- where Eli (cc'd) said this is unsafe in general). -Hal> > Corollaries > 1) It is legal for an optimizer to convert > inttoptr+arithmetic+inttoptr > sequences to GEPs, but not vice versa. > 2) Input IR which does not contain inttoptr instructions will never > contain inttoptr instructions (before CodeGenPrepare.) > > I've spoken with Nick Lewycky & Owen Anderson offline at the last > social. On first reflection, both were okay with the proposal, but > I'd > like broader buy-in and discussion. Nick & Owen, if I've > accidentally > misrepresented our discussion or you've had second thoughts since, > please speak up. > > > Background & Motivation > > We want to support precise garbage collection(1) in LLVM. To do so, > we > have written a pass which inserts safepoints, read, and write > barriers > as appropriate. This pass needs to be able to reliably(2) identify > pointer vs non-pointer values. Its advantageous to run this pass as > late as practical in the optimization pipeline, but we can schedule > it > before lowering begins (i.e. before CodeGenPrepare). > > We control the initial IR which is generated and can ensure that it > does > not contain any inttoptr instructions. We're looking to have a > guarantee(*) that a random LLVM optimization pass will not decide to > replace GEPs with a sequence of ptrtoint, int arithmetic, and > inttoptr > which are hard for us to reason about. > > * "guarantee" isn't really the right word here. I'm really just > looking > to make sure that the community is comfortable with GEPs as canonical > form. If some pass decides to insert inttoptr instructions into > otherwise clean IR, I want some assurance a patch fixing that would > stand a good chance of being accepted. I'm happy to do any cleanup > required. > > In addition to my own use case, here's a few others which might come > up: > - Backends for targets which support different operations on pointers > vs > integers. Examples would be some of the older mainframe > architectures. > (There'd be a lot more work needed to support this.) > - Various security related applications (e.g. CFI w.r.t. function > pointers) > > I don't really want to get into these applications in detail, mostly > because I'm not particularly knowledgeable on those topics. I'd > appreciate any other applications anyone wants to throw out, but lets > try to keep from derailing the discussion. (As I did to Nick's > original > thread on DataLayout. :)) > > Notes: > 1) We're not using the existing gc.root implementation strategy. I > plan > on explaining why in a lot more detail once we're closer to having a > complete implementation that we can upstream. That should be coming > relatively shortly. (i.e. months, not weeks, not years) > > 2) As Nick pointed out in a separate thread, other types of typecasts > can obscure pointer vs integer classifications. (i.e. casting the > base > type of a pointer we then load through could load a field of the > "wrong" > type") I plan on responding to his point separately, but let's leave > that out of this discussion for the moment. Having GEPs as canonical > form is a step forward by itself, even if I decide to propose > something > further down the road. > > Philip > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Philip Reames
2014-Feb-15 20:27 UTC
[LLVMdev] RFC: GEP as canonical form for pointer addressing
On 02/15/2014 07:22 AM, Hal Finkel wrote:> ----- Original Message ----- >> From: "Philip Reames" <listmail at philipreames.com> >> To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> >> Sent: Friday, February 14, 2014 7:18:21 PM >> Subject: [LLVMdev] RFC: GEP as canonical form for pointer addressing >> >> RFC: GEP as canonical form for pointer addressing >> >> I would like to propose that we designate GEPs as the canonical form >> for >> pointer addressing in LLVM IR before CodeGenPrepare. > Is this not already the case?If it is, my proposal is even less controversial than I'd hoped it would be. :) However, given the number of folks who I've talked to about this who haven't said so, I expecting the answer is no.> I did not think that any passes introduce inttoptr+arithmetic+inttoptr prior to CGP.To my knowledge, none currently do. I want to keep it that way.> On the other hand, we don't convert inttoptr+arithmetic+inttoptr into GEP when we can (which is PR14226 -- where Eli (cc'd) said this is unsafe in general).To be clear, this is somewhat separate from my proposal. I only care that inttoptr+arithmetic+inttoptr sequences aren't inserted in the place of GEPs. Having said that..., it's still an interesting point to discuss. I read through the PR and I have to admit I don't understand why the pointer aliasing rules would prevent such a transform. The final GEP is already "based on"* the base of the original GEP. The transformation doesn't effect that at all. Can you (or Eli) spell this out a bit for me? I'm missing something. * from "A pointer value formed by an inttoptr is /based/ on all pointer values that contribute (directly or indirectly) to the computation of the pointer’s value." in the LangRef Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140215/5b27a0ca/attachment.html>
Andrew Trick
2014-Feb-15 23:55 UTC
[LLVMdev] RFC: GEP as canonical form for pointer addressing
On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at philipreames.com> wrote:> RFC: GEP as canonical form for pointer addressing > > I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare. > > Corollaries > 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa. > 2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.) > > I've spoken with Nick Lewycky & Owen Anderson offline at the last social. On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion. Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up.FWIW, I think it would be nice if standard optimization passes have this property of being well behaved with respect to pointer types, and I don’t see a good reason for canonical IR passes to lose pointer types. I also think it’s the only way to mix the optimization of pointer values with precise GC. It seems that you just want LLVM developers to generally agree that certain passes will be well behaved (you can disable any others). It may just be a matter of documenting those passes. Ideally we could formalize this by declaring a pass as pointer-safe and verifying. Can we easily verify that no memory access is based on inttoptr? -Andy> Background & Motivation > > We want to support precise garbage collection(1) in LLVM. To do so, we have written a pass which inserts safepoints, read, and write barriers as appropriate. This pass needs to be able to reliably(2) identify pointer vs non-pointer values. Its advantageous to run this pass as late as practical in the optimization pipeline, but we can schedule it before lowering begins (i.e. before CodeGenPrepare). > > We control the initial IR which is generated and can ensure that it does not contain any inttoptr instructions. We're looking to have a guarantee(*) that a random LLVM optimization pass will not decide to replace GEPs with a sequence of ptrtoint, int arithmetic, and inttoptr which are hard for us to reason about. > > * "guarantee" isn't really the right word here. I'm really just looking to make sure that the community is comfortable with GEPs as canonical form. If some pass decides to insert inttoptr instructions into otherwise clean IR, I want some assurance a patch fixing that would stand a good chance of being accepted. I'm happy to do any cleanup required. > > In addition to my own use case, here's a few others which might come up: > - Backends for targets which support different operations on pointers vs integers. Examples would be some of the older mainframe architectures. (There'd be a lot more work needed to support this.) > - Various security related applications (e.g. CFI w.r.t. function pointers) > > I don't really want to get into these applications in detail, mostly because I'm not particularly knowledgeable on those topics. I'd appreciate any other applications anyone wants to throw out, but lets try to keep from derailing the discussion. (As I did to Nick's original thread on DataLayout. :)) > > Notes: > 1) We're not using the existing gc.root implementation strategy. I plan on explaining why in a lot more detail once we're closer to having a complete implementation that we can upstream. That should be coming relatively shortly. (i.e. months, not weeks, not years) > > 2) As Nick pointed out in a separate thread, other types of typecasts can obscure pointer vs integer classifications. (i.e. casting the base type of a pointer we then load through could load a field of the "wrong" type") I plan on responding to his point separately, but let's leave that out of this discussion for the moment. Having GEPs as canonical form is a step forward by itself, even if I decide to propose something further down the road. > > Philip > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
David Chisnall
2014-Feb-17 10:31 UTC
[LLVMdev] RFC: GEP as canonical form for pointer addressing
On 15 Feb 2014, at 23:55, Andrew Trick <atrick at apple.com> wrote:> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at philipreames.com> wrote: > >> RFC: GEP as canonical form for pointer addressing >> >> I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare. >> >> Corollaries >> 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa. >> 2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.) >> >> I've spoken with Nick Lewycky & Owen Anderson offline at the last social. On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion. Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up. > > FWIW, I think it would be nice if standard optimization passes have this property of being well behaved with respect to pointer types, and I don’t see a good reason for canonical IR passes to lose pointer types. I also think it’s the only way to mix the optimization of pointer values with precise GC. It seems that you just want LLVM developers to generally agree that certain passes will be well behaved (you can disable any others). It may just be a matter of documenting those passes. Ideally we could formalize this by declaring a pass as pointer-safe and verifying. Can we easily verify that no memory access is based on inttoptr?Not directly related, but our canonical form for loops involving pointers[1] turns a loop that contains a GEP with the loop induction variable into a GEP with the increment inside the loop. This has two annoying properties for code generation: - The GEP with the induction variable as the offset maps cleanly to CPU addressing modes and so we generate better code if we don't do this canonicalisation, and therefore end up trying to undo it in the back end (yuck). - If the source is the start of an object, then this behaviour is GC-hostile because it means that IR that contains a pointer to an object start now only contains a pointer to the middle, requiring the GC to deal with inner pointers. It would be nice if we could have canonical forms such that if the front end ensures that there are no inner pointers without pointers to the object's start in the IR, the optimisers don't break this. David [1] Are canonical forms actually documented anywhere, or are they simply undocumented implicit contracts?
Philip Reames
2014-Feb-18 19:21 UTC
[LLVMdev] RFC: GEP as canonical form for pointer addressing
On 02/15/2014 03:55 PM, Andrew Trick wrote:> On Feb 14, 2014, at 5:18 PM, Philip Reames <listmail at philipreames.com> wrote: > >> RFC: GEP as canonical form for pointer addressing >> >> I would like to propose that we designate GEPs as the canonical form for pointer addressing in LLVM IR before CodeGenPrepare. >> >> Corollaries >> 1) It is legal for an optimizer to convert inttoptr+arithmetic+inttoptr sequences to GEPs, but not vice versa. >> 2) Input IR which does not contain inttoptr instructions will never contain inttoptr instructions (before CodeGenPrepare.) >> >> I've spoken with Nick Lewycky & Owen Anderson offline at the last social. On first reflection, both were okay with the proposal, but I'd like broader buy-in and discussion. Nick & Owen, if I've accidentally misrepresented our discussion or you've had second thoughts since, please speak up. > FWIW, I think it would be nice if standard optimization passes have this property of being well behaved with respect to pointer types, and I don’t see a good reason for canonical IR passes to lose pointer types. I also think it’s the only way to mix the optimization of pointer values with precise GC. It seems that you just want LLVM developers to generally agree that certain passes will be well behaved (you can disable any others). It may just be a matter of documenting those passes.You could phrase it this way. I would push for "everything before CodeGenPrepare", but am open to counter argument on why a smaller set should be selected. :)> Ideally we could formalize this by declaring a pass as pointer-safe and verifying. Can we easily verify that no memory access is based on inttoptr?Yes. The only slightly complication is dealing with phi nodes and selects (which feed into GEPs), but assuming you're willing to accept a slightly conservative answer, it's definitely doable. I have a pass locally which effectively does this. It's a side effect of it's primary purpose, but getting that extracted as a distinct pass (or part of the verifier) shouldn't be difficult.> > -Andy > >> Background & Motivation >> >> We want to support precise garbage collection(1) in LLVM. To do so, we have written a pass which inserts safepoints, read, and write barriers as appropriate. This pass needs to be able to reliably(2) identify pointer vs non-pointer values. Its advantageous to run this pass as late as practical in the optimization pipeline, but we can schedule it before lowering begins (i.e. before CodeGenPrepare). >> >> We control the initial IR which is generated and can ensure that it does not contain any inttoptr instructions. We're looking to have a guarantee(*) that a random LLVM optimization pass will not decide to replace GEPs with a sequence of ptrtoint, int arithmetic, and inttoptr which are hard for us to reason about. >> >> * "guarantee" isn't really the right word here. I'm really just looking to make sure that the community is comfortable with GEPs as canonical form. If some pass decides to insert inttoptr instructions into otherwise clean IR, I want some assurance a patch fixing that would stand a good chance of being accepted. I'm happy to do any cleanup required. >> >> In addition to my own use case, here's a few others which might come up: >> - Backends for targets which support different operations on pointers vs integers. Examples would be some of the older mainframe architectures. (There'd be a lot more work needed to support this.) >> - Various security related applications (e.g. CFI w.r.t. function pointers) >> >> I don't really want to get into these applications in detail, mostly because I'm not particularly knowledgeable on those topics. I'd appreciate any other applications anyone wants to throw out, but lets try to keep from derailing the discussion. (As I did to Nick's original thread on DataLayout. :)) >> >> Notes: >> 1) We're not using the existing gc.root implementation strategy. I plan on explaining why in a lot more detail once we're closer to having a complete implementation that we can upstream. That should be coming relatively shortly. (i.e. months, not weeks, not years) >> >> 2) As Nick pointed out in a separate thread, other types of typecasts can obscure pointer vs integer classifications. (i.e. casting the base type of a pointer we then load through could load a field of the "wrong" type") I plan on responding to his point separately, but let's leave that out of this discussion for the moment. Having GEPs as canonical form is a step forward by itself, even if I decide to propose something further down the road. >> >> Philip >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Ivan Godard
2014-Feb-20 06:11 UTC
[LLVMdev] RFC: GEP as canonical form for pointer addressing
Philip Reames <listmail <at> philipreames.com> writes:> > RFC: GEP as canonical form for pointer addressing<snip>> In addition to my own use case, here's a few others which might come up: > - Backends for targets which support different operations on pointers vs > integers. Examples would be some of the older mainframe architectures. > (There'd be a lot more work needed to support this.)> > Philip >It's not just old mainframes, it's some of the newest architecture as well. The Mill general-purpose architecture (http://ootbcomp.com) has non-integer pointers and distinct pointer operations too. That LLVM loses pointerhood is the biggest problem that we have identified while looking into using LLVM as our supported compiler. It may be a killer, and we may have to fall back to gcc. That would be a shame, but it does appear that the ir makes rash assumptions about machine architecture. "There'd be a lot more work needed to support this" is not encouraging to see, especially for a startup company with limited resources and little prior exposure to LLVM internals. Ivan
David Chisnall
2014-Feb-20 09:02 UTC
[LLVMdev] RFC: GEP as canonical form for pointer addressing
On 20 Feb 2014, at 06:11, Ivan Godard <ivan at ootbcomp.com> wrote:> It's not just old mainframes, it's some of the newest architecture as well. > The Mill general-purpose architecture (http://ootbcomp.com) has non-integer > pointers and distinct pointer operations too. That LLVM loses pointerhood is > the biggest problem that we have identified while looking into using LLVM as > our supported compiler. It may be a killer, and we may have to fall back to > gcc. That would be a shame, but it does appear that the ir makes rash > assumptions about machine architecture. > > "There'd be a lot more work needed to support this" is not encouraging to > see, especially for a startup company with limited resources and little > prior exposure to LLVM internals.Just to add, I spend a fair bit of my time in the computer architecture research community, and pointers that are not integers are an increasingly common model. They simplify various dependency analysis paths in the pipeline (giving fewer pipeline flushes) and make certain kinds of security features significantly easier to implement. Architectures that separate pointers from integers are note becoming rarer. They are increasingly common in application-specific processors and likely to reappear in mainstream processors over the next 5-10 years. We have managed to get LLVM working (and building nontrivial amounts of code) on a MIPS-derived architecture that has non-integer pointers, and the representation in the IR itself is fine. We have a few hacks in optimisations that are far too coarse grained (i.e. don't do this optimisation if you're dealing with this kind of pointer, even though many of them [SCEV in particular] should work but the code makes invalid assumptions). We do end up having to add more after every merge. We start to hit problems when we get to SelectionDAG, which makes a lot of assumptions about the underlying architecture and has an annoying habit of thinking it knows better than the back end and undoing transformations that the back end has done. David P.S. The Mill is a very interesting architecture, but I'm very glad I'm not the one responsible for instruction scheduling on it...