Joseph Tremoulet via llvm-dev
2016-Feb-05 20:22 UTC
[llvm-dev] gc relocations on exception path w/RS4GC currently broken
Sorry to reply to myself here, but I had an idea regarding "issue #2" -- possibly what makes the most sense for those clients/targets is to pull the pointer difference computation/reapplication into RS4GC itself -- it could have a pass just before or after rematerialization, which runs based on a configuration flag (eventually to be driven by GCStrategy), which performs rewrites like below to ensure that only base pointers are live across statepoints when it's done (plus a bit of bookkeeping w.r.t. recomputeliveness and/or the ssa update at the end to make sure the rewritten pointers don't get reported and do get uses of the original value replaced with them) Thanks -Joseph . -----Original Message----- Subject: Re: [llvm-dev] gc relocations on exception path w/RS4GC currently broken Working on this, I've run into a couple potential issues regarding which I'd like to solicit feedback. To give a concrete example, we're talking about having RS4GC see a GC-safepoint call like so: %a = _ ; gc pointer %b = _ ; gc pointer ... invoke void @callee() to label %cont unwind label %pad cont: _ = %a ... pad: landingpad _ _ = %b ... and transform it into: %b.gc_spill = alloca <ty> ... %a = _ %b = _ ... store <ty> %b, <ty>* %b.gc_spill %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a is a gc pointer and %b.gc_spill holds a gc pointer>) to label %cont unwind label %pad cont: %a.reloc = call <ty> @llvm.experimental.gc.relocate(token %sp, <index of %a>, <index of %a>) _ = %a.reloc ... pad: landingpad _ %b.gc_reload = load <ty>, <ty>* %b.gc_spill _ = %b.gc_reload ... which would then get lowered to a call with a stack map reporting %a (or the slot that lowering spills %a to) and %b.gc_spill as holding live gc pointers. Issue #1: obscurability of the %b.gc_spill use on the gc.statepoint invoke Some target runtimes/GCs (CoreCLR include) need to have stack slots reported directly by offset. If code runs between RS4GC and lowering that somehow rewrites the argument on the statepoint corresponding to b's spill to be anything other than a direct use of the static alloca that RS4GC allocated to hold the spill, the best we could do is have the lowering introduce another layer of indirection. E.g., continuing the above example, if something after RS4GC obscures %b.gc_spill on the statepoint: %b.gc_spill = alloca <ty> ... %a = _ %b = _ ... %p = _ ; like maybe a PHI that has %b.gc_spill as an incoming value ... store <ty> %b, <ty>* %b.gc_spill ; may or may not have been rewritten as store into %p %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a is a gc pointer and %p holds a gc pointer>) to label %cont unwind label %pad cont: %a.reloc = call <ty> @llvm.experimental.gc.relocate(token %sp, <index of %a>, <index of %a>) _ = %a.reloc ... pad: landingpad _ %b.gc_reload = load <ty>, <ty>* %b.gc_spill _ = %b.gc_reload ... then lowering would effectively have to insert another indirection: %b.gc_spill = alloca <ty> ... %a = _ %b = _ ... %p = _ ; like maybe a PHI that has %b.gc_spill as an incoming value ... store <ty> %b, <ty>* %b.gc_spill ; may or may not have been rewritten as store into %p %p.deref = load <ty>, <ty>* %p %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a and %p.deref are gc pointers>) to label %cont unwind label %pad cont: store <ty> %p.deref, <ty>* %p %a.reloc = call <ty> @llvm.experimental.gc.relocate(token %sp, <index of %a>, <index of %a>) _ = %a.reloc ... pad: landingpad _ store <ty> %p.deref, <ty>* %p %b.gc_reload = load <ty>, <ty>* %b.gc_spill _ = %b.gc_reload ... (and the code to insert the %p <-> %p.deref loads/stores would have to be something in CodeGenPrep or [ugh] direclty in StatepointLowering). I'm curious to know if others think this is problematic or not. I know that for LLILC we intend to run RS4GC late in the pass list and could probably just discount the possibility of the spill slot allocas on the gc.statepoint invoke getting obscured (or at least could be ok with having the bail-out lowering/CGP code that patches things up with extra stores/loads, on the assumption that it's rare in practice to hit these cases), but I'm not sure how representative LLILC is of the community in that regard. Similarly, I have the impression that we're moving generally toward wedding RS4GC more with CGP, but I would be interested to know if I'm off the mark there. Issue #2: Relocating derived pointers by IR injection I know there's been some discussion about runtimes which require the pointers reported directly to them to be base object pointers. So e.g. with code like this: %p = _ ; <some base object pointer> %q = <getelementptr getting a pointer at some offset from %p> ... call @callee() _ = %q then RS4GC will generate %p = _ ; <some base object pointer> %q = <getelementptr getting a pointer at some offset from %p> ... %sp = call token @llvm.experimental.gc.statepoint(<args indicating %p and %q are gc pointers, with %p as %q's base>) %p.reloc = call @llvm.experimental.gc.relocate(token %sp, <index of p>, <index of p>) %q.reloc = call @llvm.experimental.gc.relocate(token %sp, <index of p>, <index of q>) _ = %q.reloc The default lowering of that is to spill %p and %q to the stack just before the call, and lower the gc.relocate calls as loads from those slots after the calls, with the understanding that the stack map will communicate to the GC that q's slot is derived from p's slot and the GC will update both pointers appropriately. However, for targets where the interface with the runtime only allows reporting base pointers, lowering would have to effect something like the following transformation: %p = _ ; <some base object pointer> %q = <getelementptr getting a pointer at some offset from %p> ... %d = <compute %q - %p> %sp = call token @llvm.experimental.gc.statepoint(<args indicating %p is a gc pointer>) %p.reloc = call @llvm.experimental.gc.relocate(token %sp, <index of p>, <index of p>) ; lower to load %q.reloc = call @llvm.experimental.gc.relocate(token %sp, <index of p>, <index of q>) ; lower to %p.reloc + %d _ = %q.reloc Now, if we consider the same situation but on an exception path where there are no explicit gc.relocate calls because RS4GC spilled along the exception path: %p.gc_spill = alloca _ %q.gc_spill = alloca _ %p = _ ; <some base object pointer> %q = <getelementptr getting a pointer at some offset from %p> ... store <ty> %p, <ty>* %p.gc_spill store <ty> %q, <ty>* %q.gc_spill %sp = invoke token @llvm.experimental.gc.statepoint(<args indicating %p.gc_spill and %q.gc_spill hold gc pointers, with %p.gc_spill as %q.gc_spill's base>) to label _, unwind label %pad pad: landingpad _ %p.reload = load <ty>, <ty>* %p.gc_spill %q.reload = load <ty>, <ty>* %q.gc_spill _ = %q.reload then the best you could do is something like this: %p.gc_spill = alloca _ %q.gc_spill = alloca _ %p = _ ; <some base object pointer> %q = <getelementptr getting a pointer at some offset from %p> ... store <ty> %p, <ty>* %p.gc_spill store <ty> %q, <ty>* %q.gc_spill ; compute difference %p.to_compute_d = load <ty>, <ty*> %p.gc_spill %q.to_compute_d = load <ty>, <ty*> %q.gc_spill %d = <compute %q - %p> %sp = invoke token @llvm.experimental.gc.statepoint(<args indicating %p.gc_spill and %q.gc_spill hold gc pointers, with %p.gc_spill as %q.gc_spill's base>) to label _, unwind label %pad pad: landingpad _ %p.to_compute_q = load <ty>, <ty>* %p.gc_spill %q.recomputed = <compute %p.to_compute_q + %d> store <ty> %q.recomputed, <ty>* %q.gc_spill ; continue as normal %p.reload = load <ty>, <ty>* %p.gc_spill %q.reload = load <ty>, <ty>* %q.gc_spill _ = %q.reload There are a number of redundant loads/stores in that sequence, and I'm not sure it's reasonable to generate them and expect them to get cleaned up later (especially since "later" is post-optimizer). So it seems like, for that use case, explicit spilling in RS4GC gets in the way more than it helps. But I'm not sure how important that use case is to anybody (Manuel, is this the approach PyPy is taking?), or what would be preferable to people for whom it is important: simply to continue using gc.relocates on the exceptional path and live with non-token linkage between your landingpads and gc.relocates? A different scheme where RS4GC doesn't generate `alloca`s and `store`s and `load`s directly, but rather some family of intrinsics like `gc.spill.alloca`/`gc.spill.store`/`gc.spill.load` where the conceptual slot's type would be token rather than pointer, so as to be onobscurable until lowering? Something else entirely? I'm interested to hear what others think, about both issues above. Thanks -Joseph -----Original Message----- From: Philip Reames [mailto:listmail at philipreames.com] Sent: Friday, January 22, 2016 3:36 PM To: llvm-dev <llvm-dev at lists.llvm.org>; Joseph Tremoulet <jotrem at microsoft.com>; Manuel Jacob <me at manueljacob.de>; chenli@ <"azulsystems chenli"@https://na01.safelinks.protection.outlook.com/?url=azulsystems.com&data=01%7c01%7cjotrem%40microsoft.com%7c81b669bbdc6a4dec072208d32e607440%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Ru7P2lfCvpxYDglofpoxAOW%2bUVTwwEc7UQLXQ%2bj2pLs%3d>; Sanjoy Das <sanjoy at playingwithpointers.com> Subject: FYI: gc relocations on exception path w/RS4GC currently broken For anyone following along on ToT using the gc.statepoint mechanism, you should know that ToT is currently not able to express arbitrary exceptional control flow and relocations along exceptional edges. This is a direct result of moving the gc.statepoint representation to using a token type landingpad. Essentially, we have a design inconsistency where we expect to be able to "resume" a phi of arbitrary landing pads, but we expect relocations to be tied specifically to a particular invoke. Chen, Joseph, and I have spent some time talking about how to resolve this. All of the schemes we've come up with representing relocations using gc.relocates on the exceptional path require either a change to how we define an invoke instruction (something we'd really like to avoid) or a new intrinsic with special treatment in the optimizer so that it basically "becomes part of" the landing pad without actually being the landing pad. None of us were particular thrilled by the changes involved. Given exceptional paths are nearly by definition cold, we're currently exploring another option. We're considering having RS4GC insert explicit spill slots at the IR level (via allocas) for values live along exceptional paths, and leaving all of the normal path values represented as gc.relocates. This avoids the need for another IR extension, makes it slightly easier to meet an ABI requirement Joseph has, and provides a better platform for lowering experimentation. Joseph is working on implementing this and will probably have something up for review next week or the week after. Once that's in, we're going to run some performance experiments to see if it's a viable lowering strategy even without Joseph's particular ABI requirement, and if so, make that the standard way of representing relocations on exceptional edges. Assuming this approach works, we're going to defer solving the problem of how to cleanly represent explicit relocations along the exceptional path until a later point in time. In particular, the value of the explicit relocations comes mainly from being able to lower them efficiently to register uses. Since the work to integrate relocations with the register allocator hasn't happened and doesn't look like it's going to happen in the near term (*), this seems like a reasonable compromise. Philip (*) To give some context on this, it turns out one of our initial starting assumptions was wrong in practice. We expected the quality of lowering for the gc arguments at statepoint/safepoint to be very important for overall code quality. While this may some day become true, we've found that whenever we encounter a hot safepoint, the problem is usually that we didn't inline appropriately. As a result, we've ended up fixing (out of tree) inlining or devirtualization bugs rather than working on the lowering itself. For us, a truly hot megamorphic call site has turned out to be a very rare beast. Worth noting is that this is only true because we're a high tier JIT with good profiling information. It's likely that other users who don't have the same design point may find the lowering far more problematic; in fact, we have some evidence this may already be true. _______________________________________________ LLVM Developers mailing list
Sanjoy Das via llvm-dev
2016-Feb-06 00:04 UTC
[llvm-dev] gc relocations on exception path w/RS4GC currently broken
For #1, perhaps we need a third kind of encoding, which we could call (for the lack of a better name), "VeryIndirect". A VeryIndirect location implies that the heap reference is stored in the location **(Reg + Offset). With that in place, we'll have three different forms of locations: Direct == the reference *is* Reg+Offset Indirect == the reference is *(Reg+Offset) VeryIndirect == the reference is **(Reg+Offset) (This following bit is re-iterating what Joseph and I talked about on Skype, so that everyone is up to speed) gc.statepoint would then have two different "argument regions" for reporting heap references, "unspilled" and "spilled". Lowering for the "unspilled" region would what we currently have for GC references. Lowering for the "spilled" region would be: emit code normally (i.e. what we do today), but if you were going to report the location as Direct, then report it as Indirect (since the spill is already present in the IR), and if you were going to report it as Indirect, then report it as VeryIndirect (since we'll have two spills now). RS4GC would construct statepoint with normal SSA references in the "unspilled" section, and allocas in the "spilled" section. For #2, I like your idea of teaching RS4GC to not emit "live derived pointers" at all. It is conceptually the same transform as our "rematerialize simple GEPs" optimization, except that we now need to be able to do this for correctness. -- Sanjoy Joseph Tremoulet wrote:> Sorry to reply to myself here, but I had an idea regarding "issue #2" -- possibly what makes the most sense for those clients/targets is to pull the pointer difference computation/reapplication into RS4GC itself -- it could have a pass just before or after rematerialization, which runs based on a configuration flag (eventually to be driven by GCStrategy), which performs rewrites like below to ensure that only base pointers are live across statepoints when it's done (plus a bit of bookkeeping w.r.t. recomputeliveness and/or the ssa update at the end to make sure the rewritten pointers don't get reported and do get uses of the original value replaced with them) > > Thanks > -Joseph > . > > -----Original Message----- > Subject: Re: [llvm-dev] gc relocations on exception path w/RS4GC currently broken > > Working on this, I've run into a couple potential issues regarding which I'd like to solicit feedback. > > To give a concrete example, we're talking about having RS4GC see a GC-safepoint call like so: > > %a = _ ; gc pointer > %b = _ ; gc pointer > ... > invoke void @callee() > to label %cont unwind label %pad > cont: > _ = %a > ... > pad: > landingpad _ > _ = %b > ... > > and transform it into: > > %b.gc_spill = alloca<ty> > ... > %a = _ > %b = _ > ... > store<ty> %b,<ty>* %b.gc_spill > %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a is a gc pointer and %b.gc_spill holds a gc pointer>) > to label %cont unwind label %pad > cont: > %a.reloc = call<ty> @llvm.experimental.gc.relocate(token %sp,<index of %a>,<index of %a>) > _ = %a.reloc > ... > pad: > landingpad _ > %b.gc_reload = load<ty>,<ty>* %b.gc_spill > _ = %b.gc_reload > ... > > which would then get lowered to a call with a stack map reporting %a (or the slot that lowering spills %a to) and %b.gc_spill as holding live gc pointers. > > > Issue #1: obscurability of the %b.gc_spill use on the gc.statepoint invoke > > Some target runtimes/GCs (CoreCLR include) need to have stack slots reported directly by offset. If code runs between RS4GC and lowering that somehow rewrites the argument on the statepoint corresponding to b's spill to be anything other than a direct use of the static alloca that RS4GC allocated to hold the spill, the best we could do is have the lowering introduce another layer of indirection. > E.g., continuing the above example, if something after RS4GC obscures %b.gc_spill on the statepoint: > > %b.gc_spill = alloca<ty> > ... > %a = _ > %b = _ > ... > %p = _ ; like maybe a PHI that has %b.gc_spill as an incoming value > ... > store<ty> %b,<ty>* %b.gc_spill ; may or may not have been rewritten as store into %p > %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a is a gc pointer and %p holds a gc pointer>) > to label %cont unwind label %pad > cont: > %a.reloc = call<ty> @llvm.experimental.gc.relocate(token %sp,<index of %a>,<index of %a>) > _ = %a.reloc > ... > pad: > landingpad _ > %b.gc_reload = load<ty>,<ty>* %b.gc_spill > _ = %b.gc_reload > ... > > then lowering would effectively have to insert another indirection: > > %b.gc_spill = alloca<ty> > ... > %a = _ > %b = _ > ... > %p = _ ; like maybe a PHI that has %b.gc_spill as an incoming value > ... > store<ty> %b,<ty>* %b.gc_spill ; may or may not have been rewritten as store into %p > %p.deref = load<ty>,<ty>* %p > %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a and %p.deref are gc pointers>) > to label %cont unwind label %pad > cont: > store<ty> %p.deref,<ty>* %p > %a.reloc = call<ty> @llvm.experimental.gc.relocate(token %sp,<index of %a>,<index of %a>) > _ = %a.reloc > ... > pad: > landingpad _ > store<ty> %p.deref,<ty>* %p > %b.gc_reload = load<ty>,<ty>* %b.gc_spill > _ = %b.gc_reload > ... > > (and the code to insert the %p<-> %p.deref loads/stores would have to be something in CodeGenPrep or [ugh] direclty in StatepointLowering). > I'm curious to know if others think this is problematic or not. I know that for LLILC we intend to run RS4GC late in the pass list and could probably just discount the possibility of the spill slot allocas on the gc.statepoint invoke getting obscured (or at least could be ok with having the bail-out lowering/CGP code that patches things up with extra stores/loads, on the assumption that it's rare in practice to hit these cases), but I'm not sure how representative LLILC is of the community in that regard. Similarly, I have the impression that we're moving generally toward wedding RS4GC more with CGP, but I would be interested to know if I'm off the mark there. > > > Issue #2: Relocating derived pointers by IR injection > > I know there's been some discussion about runtimes which require the pointers reported directly to them to be base object pointers. So e.g. with code like this: > > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > call @callee() > _ = %q > > then RS4GC will generate > > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > %sp = call token @llvm.experimental.gc.statepoint(<args indicating %p and %q are gc pointers, with %p as %q's base>) > %p.reloc = call @llvm.experimental.gc.relocate(token %sp,<index of p>,<index of p>) > %q.reloc = call @llvm.experimental.gc.relocate(token %sp,<index of p>,<index of q>) > _ = %q.reloc > > The default lowering of that is to spill %p and %q to the stack just before the call, and lower the gc.relocate calls as loads from those slots after the calls, with the understanding that the stack map will communicate to the GC that q's slot is derived from p's slot and the GC will update both pointers appropriately. However, for targets where the interface with the runtime only allows reporting base pointers, lowering would have to effect something like the following transformation: > > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > %d =<compute %q - %p> > %sp = call token @llvm.experimental.gc.statepoint(<args indicating %p is a gc pointer>) > %p.reloc = call @llvm.experimental.gc.relocate(token %sp,<index of p>,<index of p>) ; lower to load > %q.reloc = call @llvm.experimental.gc.relocate(token %sp,<index of p>,<index of q>) ; lower to %p.reloc + %d > _ = %q.reloc > > Now, if we consider the same situation but on an exception path where there are no explicit gc.relocate calls because RS4GC spilled along the exception path: > > %p.gc_spill = alloca _ > %q.gc_spill = alloca _ > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > store<ty> %p,<ty>* %p.gc_spill > store<ty> %q,<ty>* %q.gc_spill > %sp = invoke token @llvm.experimental.gc.statepoint(<args indicating %p.gc_spill and %q.gc_spill hold gc pointers, with %p.gc_spill as %q.gc_spill's base>) > to label _, unwind label %pad > pad: > landingpad _ > %p.reload = load<ty>,<ty>* %p.gc_spill > %q.reload = load<ty>,<ty>* %q.gc_spill > _ = %q.reload > > then the best you could do is something like this: > > %p.gc_spill = alloca _ > %q.gc_spill = alloca _ > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > store<ty> %p,<ty>* %p.gc_spill > store<ty> %q,<ty>* %q.gc_spill > ; compute difference > %p.to_compute_d = load<ty>,<ty*> %p.gc_spill > %q.to_compute_d = load<ty>,<ty*> %q.gc_spill > %d =<compute %q - %p> > %sp = invoke token @llvm.experimental.gc.statepoint(<args indicating %p.gc_spill and %q.gc_spill hold gc pointers, with %p.gc_spill as %q.gc_spill's base>) > to label _, unwind label %pad > pad: > landingpad _ > %p.to_compute_q = load<ty>,<ty>* %p.gc_spill > %q.recomputed =<compute %p.to_compute_q + %d> > store<ty> %q.recomputed,<ty>* %q.gc_spill > ; continue as normal > %p.reload = load<ty>,<ty>* %p.gc_spill > %q.reload = load<ty>,<ty>* %q.gc_spill > _ = %q.reload > > There are a number of redundant loads/stores in that sequence, and I'm not sure it's reasonable to generate them and expect them to get cleaned up later (especially since "later" is post-optimizer). > > So it seems like, for that use case, explicit spilling in RS4GC gets in the way more than it helps. But I'm not sure how important that use case is to anybody (Manuel, is this the approach PyPy is taking?), or what would be preferable to people for whom it is important: simply to continue using gc.relocates on the exceptional path and live with non-token linkage between your landingpads and gc.relocates? A different scheme where RS4GC doesn't generate `alloca`s and `store`s and `load`s directly, but rather some family of intrinsics like `gc.spill.alloca`/`gc.spill.store`/`gc.spill.load` where the conceptual slot's type would be token rather than pointer, so as to be onobscurable until lowering? Something else entirely? > > > I'm interested to hear what others think, about both issues above. > > Thanks > -Joseph > > > -----Original Message----- > From: Philip Reames [mailto:listmail at philipreames.com] > Sent: Friday, January 22, 2016 3:36 PM > To: llvm-dev<llvm-dev at lists.llvm.org>; Joseph Tremoulet<jotrem at microsoft.com>; Manuel Jacob<me at manueljacob.de>; chenli@<"azulsystems chenli"@https://na01.safelinks.protection.outlook.com/?url=azulsystems.com&data=01%7c01%7cjotrem%40microsoft.com%7c81b669bbdc6a4dec072208d32e607440%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Ru7P2lfCvpxYDglofpoxAOW%2bUVTwwEc7UQLXQ%2bj2pLs%3d>; Sanjoy Das<sanjoy at playingwithpointers.com> > Subject: FYI: gc relocations on exception path w/RS4GC currently broken > > For anyone following along on ToT using the gc.statepoint mechanism, you should know that ToT is currently not able to express arbitrary exceptional control flow and relocations along exceptional edges. This is a direct result of moving the gc.statepoint representation to using a token type landingpad. Essentially, we have a design inconsistency where we expect to be able to "resume" a phi of arbitrary landing pads, but we expect relocations to be tied specifically to a particular invoke. > > Chen, Joseph, and I have spent some time talking about how to resolve this. All of the schemes we've come up with representing relocations using gc.relocates on the exceptional path require either a change to how we define an invoke instruction (something we'd really like to > avoid) or a new intrinsic with special treatment in the optimizer so that it basically "becomes part of" the landing pad without actually being the landing pad. None of us were particular thrilled by the changes involved. > > Given exceptional paths are nearly by definition cold, we're currently exploring another option. We're considering having RS4GC insert explicit spill slots at the IR level (via allocas) for values live along exceptional paths, and leaving all of the normal path values represented as gc.relocates. This avoids the need for another IR extension, makes it slightly easier to meet an ABI requirement Joseph has, and provides a better platform for lowering experimentation. Joseph is working on implementing this and will probably have something up for review next week or the week after. Once that's in, we're going to run some performance experiments to see if it's a viable lowering strategy even without Joseph's particular ABI requirement, and if so, make that the standard way of representing relocations on exceptional edges. > > Assuming this approach works, we're going to defer solving the problem of how to cleanly represent explicit relocations along the exceptional path until a later point in time. In particular, the value of the explicit relocations comes mainly from being able to lower them efficiently to register uses. Since the work to integrate relocations with the register allocator hasn't happened and doesn't look like it's going to happen in the near term (*), this seems like a reasonable compromise. > > Philip > > (*) To give some context on this, it turns out one of our initial starting assumptions was wrong in practice. We expected the quality of lowering for the gc arguments at statepoint/safepoint to be very important for overall code quality. While this may some day become true, we've found that whenever we encounter a hot safepoint, the problem is usually that we didn't inline appropriately. As a result, we've ended up fixing (out of tree) inlining or devirtualization bugs rather than working on the lowering itself. For us, a truly hot megamorphic call site has turned out to be a very rare beast. Worth noting is that this is only true because we're a high tier JIT with good profiling information. It's likely that other users who don't have the same design point may find the lowering far more problematic; in fact, we have some evidence this may already be true. > > _______________________________________________ > LLVM Developers mailing list
Joseph Tremoulet via llvm-dev
2016-Feb-06 01:46 UTC
[llvm-dev] gc relocations on exception path w/RS4GC currently broken
Thanks, I think that's a useful way to look at it (though if I wanted to bikeshed I'd suggest the name "DoubleIndirect" as a bit more precise than "VeryIndirect"). An aspect of it that I'm still puzzling over is that my target runtime (at least in its current form) doesn't have a way to represent/process a "VeryIndirect" pointer. So I'd like to be able to guarantee that only "Direct" and (single)"Indirect" slots get reported. And then it's not clear to me what bit of code should be responsible for ensuring that there are no "VeryIndirect" slots at the end of the day. Does statepoint lowering on the DAG need to be able to inject loads/stores to convert a "VeryIndirect" to a (single)"Indirect"? Should CodeGenPrepare be responsible for doing that rewrite at the IR level (and is it reasonable to assume that nothing after CGP would do the inverse)? Is it "good enough" to just know that RS4GC won't directly emit the pattern that lowers to "VeryIndirect", and have clients that care like LLILC run RS4GC "right before" CGP? Or were you suggesting something different, like somehow on the machine code we should insert loads and stores if needed when we see a "VeryIndirect" in the stack map? It occurs to me that the expansion needed is very similar to the expansion that the lowering currently does to spill gc-pointer SSA values and produce "Indirect" slots, just with a load prepended before the spill and stores appended after the fills at the ends; but the current mechanism for that lowering keys off the gc.relocate calls for generating the fills, and the gc.relocate calls wouldn't be present for the cases that need to be "VeryIndirect"... Thanks -Joseph -----Original Message----- From: Sanjoy Das [mailto:sanjoy at playingwithpointers.com] Sent: Friday, February 5, 2016 7:05 PM To: Joseph Tremoulet <jotrem at microsoft.com> Cc: Philip Reames <listmail at philipreames.com>; Manuel Jacob <me at manueljacob.de>; llvm-dev <llvm-dev at lists.llvm.org> Subject: Re: [llvm-dev] gc relocations on exception path w/RS4GC currently broken For #1, perhaps we need a third kind of encoding, which we could call (for the lack of a better name), "VeryIndirect". A VeryIndirect location implies that the heap reference is stored in the location **(Reg + Offset). With that in place, we'll have three different forms of locations: Direct == the reference *is* Reg+Offset Indirect == the reference is *(Reg+Offset) VeryIndirect == the reference is **(Reg+Offset) (This following bit is re-iterating what Joseph and I talked about on Skype, so that everyone is up to speed) gc.statepoint would then have two different "argument regions" for reporting heap references, "unspilled" and "spilled". Lowering for the "unspilled" region would what we currently have for GC references. Lowering for the "spilled" region would be: emit code normally (i.e. what we do today), but if you were going to report the location as Direct, then report it as Indirect (since the spill is already present in the IR), and if you were going to report it as Indirect, then report it as VeryIndirect (since we'll have two spills now). RS4GC would construct statepoint with normal SSA references in the "unspilled" section, and allocas in the "spilled" section. For #2, I like your idea of teaching RS4GC to not emit "live derived pointers" at all. It is conceptually the same transform as our "rematerialize simple GEPs" optimization, except that we now need to be able to do this for correctness. -- Sanjoy Joseph Tremoulet wrote:> Sorry to reply to myself here, but I had an idea regarding "issue #2" > -- possibly what makes the most sense for those clients/targets is to > pull the pointer difference computation/reapplication into RS4GC > itself -- it could have a pass just before or after rematerialization, > which runs based on a configuration flag (eventually to be driven by > GCStrategy), which performs rewrites like below to ensure that only > base pointers are live across statepoints when it's done (plus a bit > of bookkeeping w.r.t. recomputeliveness and/or the ssa update at the > end to make sure the rewritten pointers don't get reported and do get > uses of the original value replaced with them) > > Thanks > -Joseph > . > > -----Original Message----- > Subject: Re: [llvm-dev] gc relocations on exception path w/RS4GC > currently broken > > Working on this, I've run into a couple potential issues regarding which I'd like to solicit feedback. > > To give a concrete example, we're talking about having RS4GC see a GC-safepoint call like so: > > %a = _ ; gc pointer > %b = _ ; gc pointer > ... > invoke void @callee() > to label %cont unwind label %pad > cont: > _ = %a > ... > pad: > landingpad _ > _ = %b > ... > > and transform it into: > > %b.gc_spill = alloca<ty> > ... > %a = _ > %b = _ > ... > store<ty> %b,<ty>* %b.gc_spill > %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a is a gc pointer and %b.gc_spill holds a gc pointer>) > to label %cont unwind label %pad > cont: > %a.reloc = call<ty> @llvm.experimental.gc.relocate(token %sp,<index of %a>,<index of %a>) > _ = %a.reloc > ... > pad: > landingpad _ > %b.gc_reload = load<ty>,<ty>* %b.gc_spill > _ = %b.gc_reload > ... > > which would then get lowered to a call with a stack map reporting %a (or the slot that lowering spills %a to) and %b.gc_spill as holding live gc pointers. > > > Issue #1: obscurability of the %b.gc_spill use on the gc.statepoint > invoke > > Some target runtimes/GCs (CoreCLR include) need to have stack slots reported directly by offset. If code runs between RS4GC and lowering that somehow rewrites the argument on the statepoint corresponding to b's spill to be anything other than a direct use of the static alloca that RS4GC allocated to hold the spill, the best we could do is have the lowering introduce another layer of indirection. > E.g., continuing the above example, if something after RS4GC obscures %b.gc_spill on the statepoint: > > %b.gc_spill = alloca<ty> > ... > %a = _ > %b = _ > ... > %p = _ ; like maybe a PHI that has %b.gc_spill as an incoming value > ... > store<ty> %b,<ty>* %b.gc_spill ; may or may not have been rewritten as store into %p > %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a is a gc pointer and %p holds a gc pointer>) > to label %cont unwind label %pad > cont: > %a.reloc = call<ty> @llvm.experimental.gc.relocate(token %sp,<index of %a>,<index of %a>) > _ = %a.reloc > ... > pad: > landingpad _ > %b.gc_reload = load<ty>,<ty>* %b.gc_spill > _ = %b.gc_reload > ... > > then lowering would effectively have to insert another indirection: > > %b.gc_spill = alloca<ty> > ... > %a = _ > %b = _ > ... > %p = _ ; like maybe a PHI that has %b.gc_spill as an incoming value > ... > store<ty> %b,<ty>* %b.gc_spill ; may or may not have been rewritten as store into %p > %p.deref = load<ty>,<ty>* %p > %sp = invoke token @llvm.experimental.gc.statepoint(<arg list that indicates %a and %p.deref are gc pointers>) > to label %cont unwind label %pad > cont: > store<ty> %p.deref,<ty>* %p > %a.reloc = call<ty> @llvm.experimental.gc.relocate(token %sp,<index of %a>,<index of %a>) > _ = %a.reloc > ... > pad: > landingpad _ > store<ty> %p.deref,<ty>* %p > %b.gc_reload = load<ty>,<ty>* %b.gc_spill > _ = %b.gc_reload > ... > > (and the code to insert the %p<-> %p.deref loads/stores would have to be something in CodeGenPrep or [ugh] direclty in StatepointLowering). > I'm curious to know if others think this is problematic or not. I know that for LLILC we intend to run RS4GC late in the pass list and could probably just discount the possibility of the spill slot allocas on the gc.statepoint invoke getting obscured (or at least could be ok with having the bail-out lowering/CGP code that patches things up with extra stores/loads, on the assumption that it's rare in practice to hit these cases), but I'm not sure how representative LLILC is of the community in that regard. Similarly, I have the impression that we're moving generally toward wedding RS4GC more with CGP, but I would be interested to know if I'm off the mark there. > > > Issue #2: Relocating derived pointers by IR injection > > I know there's been some discussion about runtimes which require the pointers reported directly to them to be base object pointers. So e.g. with code like this: > > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > call @callee() > _ = %q > > then RS4GC will generate > > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > %sp = call token @llvm.experimental.gc.statepoint(<args indicating %p and %q are gc pointers, with %p as %q's base>) > %p.reloc = call @llvm.experimental.gc.relocate(token %sp,<index of p>,<index of p>) > %q.reloc = call @llvm.experimental.gc.relocate(token %sp,<index of p>,<index of q>) > _ = %q.reloc > > The default lowering of that is to spill %p and %q to the stack just before the call, and lower the gc.relocate calls as loads from those slots after the calls, with the understanding that the stack map will communicate to the GC that q's slot is derived from p's slot and the GC will update both pointers appropriately. However, for targets where the interface with the runtime only allows reporting base pointers, lowering would have to effect something like the following transformation: > > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > %d =<compute %q - %p> > %sp = call token @llvm.experimental.gc.statepoint(<args indicating %p is a gc pointer>) > %p.reloc = call @llvm.experimental.gc.relocate(token %sp,<index of p>,<index of p>) ; lower to load > %q.reloc = call @llvm.experimental.gc.relocate(token %sp,<index of p>,<index of q>) ; lower to %p.reloc + %d > _ = %q.reloc > > Now, if we consider the same situation but on an exception path where there are no explicit gc.relocate calls because RS4GC spilled along the exception path: > > %p.gc_spill = alloca _ > %q.gc_spill = alloca _ > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > store<ty> %p,<ty>* %p.gc_spill > store<ty> %q,<ty>* %q.gc_spill > %sp = invoke token @llvm.experimental.gc.statepoint(<args indicating %p.gc_spill and %q.gc_spill hold gc pointers, with %p.gc_spill as %q.gc_spill's base>) > to label _, unwind label %pad > pad: > landingpad _ > %p.reload = load<ty>,<ty>* %p.gc_spill > %q.reload = load<ty>,<ty>* %q.gc_spill > _ = %q.reload > > then the best you could do is something like this: > > %p.gc_spill = alloca _ > %q.gc_spill = alloca _ > %p = _ ;<some base object pointer> > %q =<getelementptr getting a pointer at some offset from %p> > ... > store<ty> %p,<ty>* %p.gc_spill > store<ty> %q,<ty>* %q.gc_spill > ; compute difference > %p.to_compute_d = load<ty>,<ty*> %p.gc_spill > %q.to_compute_d = load<ty>,<ty*> %q.gc_spill > %d =<compute %q - %p> > %sp = invoke token @llvm.experimental.gc.statepoint(<args indicating %p.gc_spill and %q.gc_spill hold gc pointers, with %p.gc_spill as %q.gc_spill's base>) > to label _, unwind label %pad > pad: > landingpad _ > %p.to_compute_q = load<ty>,<ty>* %p.gc_spill > %q.recomputed =<compute %p.to_compute_q + %d> > store<ty> %q.recomputed,<ty>* %q.gc_spill > ; continue as normal > %p.reload = load<ty>,<ty>* %p.gc_spill > %q.reload = load<ty>,<ty>* %q.gc_spill > _ = %q.reload > > There are a number of redundant loads/stores in that sequence, and I'm not sure it's reasonable to generate them and expect them to get cleaned up later (especially since "later" is post-optimizer). > > So it seems like, for that use case, explicit spilling in RS4GC gets in the way more than it helps. But I'm not sure how important that use case is to anybody (Manuel, is this the approach PyPy is taking?), or what would be preferable to people for whom it is important: simply to continue using gc.relocates on the exceptional path and live with non-token linkage between your landingpads and gc.relocates? A different scheme where RS4GC doesn't generate `alloca`s and `store`s and `load`s directly, but rather some family of intrinsics like `gc.spill.alloca`/`gc.spill.store`/`gc.spill.load` where the conceptual slot's type would be token rather than pointer, so as to be onobscurable until lowering? Something else entirely? > > > I'm interested to hear what others think, about both issues above. > > Thanks > -Joseph > > > -----Original Message----- > From: Philip Reames [mailto:listmail at philipreames.com] > Sent: Friday, January 22, 2016 3:36 PM > To: llvm-dev<llvm-dev at lists.llvm.org>; Joseph > Tremoulet<jotrem at microsoft.com>; Manuel Jacob<me at manueljacob.de>; > chenli@<"azulsystems > chenli"@https://na01.safelinks.protection.outlook.com/?url=azulsystems > .com&data=01%7c01%7cjotrem%40microsoft.com%7c81b669bbdc6a4dec072208d32 > e607440%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=Ru7P2lfCvpxYDglof > poxAOW%2bUVTwwEc7UQLXQ%2bj2pLs%3d>; Sanjoy > Das<sanjoy at playingwithpointers.com> > Subject: FYI: gc relocations on exception path w/RS4GC currently > broken > > For anyone following along on ToT using the gc.statepoint mechanism, you should know that ToT is currently not able to express arbitrary exceptional control flow and relocations along exceptional edges. This is a direct result of moving the gc.statepoint representation to using a token type landingpad. Essentially, we have a design inconsistency where we expect to be able to "resume" a phi of arbitrary landing pads, but we expect relocations to be tied specifically to a particular invoke. > > Chen, Joseph, and I have spent some time talking about how to resolve > this. All of the schemes we've come up with representing relocations > using gc.relocates on the exceptional path require either a change to > how we define an invoke instruction (something we'd really like to > avoid) or a new intrinsic with special treatment in the optimizer so that it basically "becomes part of" the landing pad without actually being the landing pad. None of us were particular thrilled by the changes involved. > > Given exceptional paths are nearly by definition cold, we're currently exploring another option. We're considering having RS4GC insert explicit spill slots at the IR level (via allocas) for values live along exceptional paths, and leaving all of the normal path values represented as gc.relocates. This avoids the need for another IR extension, makes it slightly easier to meet an ABI requirement Joseph has, and provides a better platform for lowering experimentation. Joseph is working on implementing this and will probably have something up for review next week or the week after. Once that's in, we're going to run some performance experiments to see if it's a viable lowering strategy even without Joseph's particular ABI requirement, and if so, make that the standard way of representing relocations on exceptional edges. > > Assuming this approach works, we're going to defer solving the problem of how to cleanly represent explicit relocations along the exceptional path until a later point in time. In particular, the value of the explicit relocations comes mainly from being able to lower them efficiently to register uses. Since the work to integrate relocations with the register allocator hasn't happened and doesn't look like it's going to happen in the near term (*), this seems like a reasonable compromise. > > Philip > > (*) To give some context on this, it turns out one of our initial starting assumptions was wrong in practice. We expected the quality of lowering for the gc arguments at statepoint/safepoint to be very important for overall code quality. While this may some day become true, we've found that whenever we encounter a hot safepoint, the problem is usually that we didn't inline appropriately. As a result, we've ended up fixing (out of tree) inlining or devirtualization bugs rather than working on the lowering itself. For us, a truly hot megamorphic call site has turned out to be a very rare beast. Worth noting is that this is only true because we're a high tier JIT with good profiling information. It's likely that other users who don't have the same design point may find the lowering far more problematic; in fact, we have some evidence this may already be true. > > _______________________________________________ > LLVM Developers mailing list