On Tue, Sep 21, 2010 at 8:20 PM, Talin <viridia at gmail.com> wrote:> On Mon, Sep 20, 2010 at 3:16 PM, Talin <viridia at gmail.com> wrote: >> >> So I've managed to get my stack crawler working and passing its unit tests >> - this is the one I've been working on as an alternative to shadow-stack: it >> uses only static constant data structures (no global variables or >> thread-local data), which means that it's fully compatible with a >> multi-threaded environment. >> One question that has arisen, however, is what to do about function >> parameters that may be stack roots. You can't call llvm.gcroot() on a >> function parameter. The only thing I can think of is to copy any function >> parameter that is a pointer, or which contains a pointer, into a local >> variable. This seems complicated and inefficient to me (for one thing, it >> means that every stack frame just got 8 bytes bigger due to the need to >> store the 'this' pointer locally) - is there a better way?If a function parameter is a stack root, then so is the value passed through that parameter. Since they both point to the same thing, only one stack root is needed to keep it reachable. If you've got a root for it in the caller, you don't need one in the callee for the argument.> > Anyone? > Here's some examples of what I am talking about (I'll use Java for the > examples): > Example 1: > class Foo { > static String st = "Hello"; > void f1() { > f2(st); > } > void f2(String s) { > st = null; // Destroy static root > // Now allocate something, which might trigger > // a garbage collection. > StringBuffer sb = new StringBuffer(); > // Is 's' still live? It's not stored in any > // alloca, or in any global variable. It only > // exists as a function parameter, which cannot > // be declared as a root via llvm.gcroot.s is definitely still live assuming f1 copied st into a root before using it, which it generally should.> sb.append(s); > } > } > Example 2: > class Foo { > static void f1() { > // Create a new 'Foo' but don't store it > // anywhere. > new Foo().f2(); > } > void f2() { > // Possibly trigger a collection > StringBuffer sb = new StringBuffer(); > // Is 'this' still live at this point?Again, yes, if f1 copied the result of "new Foo()" into a root.> sb.append(toString()); > } > } > Now, I know that the LLVM "Accurate GC" document says that any "intermediate > values" must be declared as a GC root. My question is, does merely _loading_ > a global variable - or any mutable, non-strictly-local variable for that > matter - count as an "intermediate" value in this case?Yes, I believe it would. In a multithreaded system, if you load a pointer from a global or the heap, and some other thread changes the pointer you loaded, you'd better have a stack root for the original pointer value you loaded while you're using it. Now if the global is thread-local, you don't need another root for it since no other threads can see it. And if you can prove that a global isn't going to change, you also don't need another root for it.> My problem is that I can't see how to address this without generating > horribly bloated and inefficient code. Pretty much every function argument > that isn't a simple integer or float will have to be copied to a hidden > local variable before every function call - in addition to copying it to the > stack to create the call frame. Even local variables are not immune if you > allow closures in your language. Under this scenario, calling conventions > that pass arguments in registers are utterly futile and save nothing, > because everything has to get copied to the stack anyway.One way or another, stack roots *must* get copied to the stack, so that the GC can find them. Remember the GC is usually running in some other thread and can *only* find pointers that live somewhere in memory (either the stack or the heap).> I really wish I could simply declare function arguments as llvm.gcroots. For > arguments that are not in registers, it would treat it just like an alloca, > since they both represent stack slots. For arguments that are passed in > registers, LLVM should automatically lower it to a non-register argument, > making a copy if needed. (I can't do this myself, since I'm not supposed to > *know* which arguments are passed in registers or not - that depends on the > target and the code generators and such).I *believe* the code generator has some optimizations that eliminate redundant copies within a block. At any rate, if you keep your stack roots in the caller rather than the callee, then there'll be a number of cases where the caller needs to keep that stack root for purposes other than that one call. In those cases, having the stack root doesn't impose a cost you aren't already paying just to allow the GC to work.> That still leaves me with the problem of declaring as roots function > arguments that are the result of complex calculations, but those are far > fewer - declaring temporary vars for those won't bloat the code so badly. > But having to copy every single load of a non-local variable into a > temporary is a nightmare.You only have to copy pointers. If you've got a struct, you only need to keep a root for the pointer member(s) of that struct.
On Wed, Sep 22, 2010 at 5:58 AM, Kenneth Uildriks <kennethuil at gmail.com>wrote:> On Tue, Sep 21, 2010 at 8:20 PM, Talin <viridia at gmail.com> wrote: > > On Mon, Sep 20, 2010 at 3:16 PM, Talin <viridia at gmail.com> wrote: > >> > >> So I've managed to get my stack crawler working and passing its unit > tests > >> - this is the one I've been working on as an alternative to > shadow-stack: it > >> uses only static constant data structures (no global variables or > >> thread-local data), which means that it's fully compatible with a > >> multi-threaded environment. > >> One question that has arisen, however, is what to do about function > >> parameters that may be stack roots. You can't call llvm.gcroot() on a > >> function parameter. The only thing I can think of is to copy any > function > >> parameter that is a pointer, or which contains a pointer, into a local > >> variable. This seems complicated and inefficient to me (for one thing, > it > >> means that every stack frame just got 8 bytes bigger due to the need to > >> store the 'this' pointer locally) - is there a better way? > > If a function parameter is a stack root, then so is the value passed > through that parameter. Since they both point to the same thing, only > one stack root is needed to keep it reachable. If you've got a root > for it in the caller, you don't need one in the callee for the > argument. > > > > > Anyone? > > Here's some examples of what I am talking about (I'll use Java for the > > examples): > > Example 1: > > class Foo { > > static String st = "Hello"; > > void f1() { > > f2(st); > > } > > void f2(String s) { > > st = null; // Destroy static root > > // Now allocate something, which might trigger > > // a garbage collection. > > StringBuffer sb = new StringBuffer(); > > // Is 's' still live? It's not stored in any > > // alloca, or in any global variable. It only > > // exists as a function parameter, which cannot > > // be declared as a root via llvm.gcroot. > > s is definitely still live assuming f1 copied st into a root before > using it, which it generally should. > > > sb.append(s); > > } > > } > > Example 2: > > class Foo { > > static void f1() { > > // Create a new 'Foo' but don't store it > > // anywhere. > > new Foo().f2(); > > } > > void f2() { > > // Possibly trigger a collection > > StringBuffer sb = new StringBuffer(); > > // Is 'this' still live at this point? > > Again, yes, if f1 copied the result of "new Foo()" into a root. > > > sb.append(toString()); > > } > > } > > Now, I know that the LLVM "Accurate GC" document says that any > "intermediate > > values" must be declared as a GC root. My question is, does merely > _loading_ > > a global variable - or any mutable, non-strictly-local variable for that > > matter - count as an "intermediate" value in this case? > > Yes, I believe it would. In a multithreaded system, if you load a > pointer from a global or the heap, and some other thread changes the > pointer you loaded, you'd better have a stack root for the original > pointer value you loaded while you're using it. > > Now if the global is thread-local, you don't need another root for it > since no other threads can see it. And if you can prove that a global > isn't going to change, you also don't need another root for it. > > > > My problem is that I can't see how to address this without generating > > horribly bloated and inefficient code. Pretty much every function > argument > > that isn't a simple integer or float will have to be copied to a hidden > > local variable before every function call - in addition to copying it to > the > > stack to create the call frame. Even local variables are not immune if > you > > allow closures in your language. Under this scenario, calling conventions > > that pass arguments in registers are utterly futile and save nothing, > > because everything has to get copied to the stack anyway. > > One way or another, stack roots *must* get copied to the stack, so > that the GC can find them. Remember the GC is usually running in some > other thread and can *only* find pointers that live somewhere in > memory (either the stack or the heap). > > > I really wish I could simply declare function arguments as llvm.gcroots. > For > > arguments that are not in registers, it would treat it just like an > alloca, > > since they both represent stack slots. For arguments that are passed in > > registers, LLVM should automatically lower it to a non-register argument, > > making a copy if needed. (I can't do this myself, since I'm not supposed > to > > *know* which arguments are passed in registers or not - that depends on > the > > target and the code generators and such). > > I *believe* the code generator has some optimizations that eliminate > redundant copies within a block. At any rate, if you keep your stack > roots in the caller rather than the callee, then there'll be a number > of cases where the caller needs to keep that stack root for purposes > other than that one call. In those cases, having the stack root > doesn't impose a cost you aren't already paying just to allow the GC > to work. > > > That still leaves me with the problem of declaring as roots function > > arguments that are the result of complex calculations, but those are far > > fewer - declaring temporary vars for those won't bloat the code so badly. > > But having to copy every single load of a non-local variable into a > > temporary is a nightmare. > > You only have to copy pointers. If you've got a struct, you only need > to keep a root for the pointer member(s) of that struct. >Here's another way to think about the issue: Compare this whole approach to stack roots to that of a conservative collector. Since we know conservative collectors work, the whole reason for making an accurate collector in the first place is efficiency, right? If we couldn't make an accurate collector that was more efficient than a conservative collector, then no one would bother because it's such a pain. However, looking strictly at stack frames, a conservative collector has a much more compact stack layout. There's no need to copy anything to local variables, because by the time you call into the collector proper, everything - every register variable, every function parameter, and so on - is already on the stack *somewhere*. The stack frame layout dictated by llvm.gcroot, by contrast, is much, much fatter. In a typical deeply-nested call stack, there will be many copies of the same value on the stack. Let's take the 'this' pointer as an example. Say we have some method that calls itself recursively, and we are currently 10 levels deep. That means that we have 20 copies of the 'this' pointer on the stack: Because each invocation of the method has to push the 'this' pointer on the stack as part of the calling protocol (even if it's passed in registers, it still has to save the prior value of the register somewhere), and in addition we also have to save the 'this' pointer as a stack root. A conservative collector, on the other hand, would be able to get by with only 10 copies of the 'this' pointer. As far as the issue of eliminating redundant stores, I'm pretty sure that the optimizers cannot eliminate redundant *roots*. Figuring out the minimal set of temporary roots needed for a function call is a non-trivial problem (you don't want to just blindly make every pointer argument to a function a temporary root, since that just bloats the size of the call frame needlessly.) My entire goal here is to get a GC that has high performance and efficiency. I want my GC to be able to run on everything from a data center to a beagle board, and use the minimum resources possible. -- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100922/b5103a7f/attachment.html>
snip> >> > My problem is that I can't see how to address this without generating >> > horribly bloated and inefficient code. Pretty much every function >> argument >> > that isn't a simple integer or float will have to be copied to a hidden >> > local variable before every function call - in addition to copying it to >> the >> > stack to create the call frame. Even local variables are not immune if >> you >> > allow closures in your language. Under this scenario, calling >> conventions >> > that pass arguments in registers are utterly futile and save nothing, >> > because everything has to get copied to the stack anyway. >> >> One way or another, stack roots *must* get copied to the stack, so >> that the GC can find them. Remember the GC is usually running in some >> other thread and can *only* find pointers that live somewhere in >> memory (either the stack or the heap). >> >I think the logic is that by putting the responsibility for having stack roots on the caller, the callee's do not need to concern themselves with their arguments. A function using a gc'ed pointer assumes that a caller somewhere up the call-chain has created a gcroot for that pointer that will outlive its invocation therefore it doesn't need to do anything with the garbage collector. In the case of the 'this' pointer, only the function that got the 'this' pointer from somewhere else (global, heap) would be responsible for creating a gcroot, once this is done the 'this' pointer can be passed through any number of calls and no more roots are required as long as the pointer is not stored somewhere that will outlive the invocation. snip -- Talin> > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >- Nathan -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100922/fbd326fd/attachment.html>
Mark Shannon
2010-Sep-22 18:08 UTC
[LLVMdev] Stack roots and function parameters[MESSAGE NOT SCANNED]
Talin wrote: [snip]> > Here's another way to think about the issue: Compare this whole approach > to stack roots to that of a conservative collector. Since we know > conservative collectors work, the whole reason for making an accurate > collector in the first place is efficiency, right? If we couldn't make > an accurate collector that was more efficient than a conservative > collector, then no one would bother because it's such a pain. >One very important to use precise GC is to have a copying collector. For generational GC, a copying collector is faster and makes the allocator much simpler and faster.> However, looking strictly at stack frames, a conservative collector has > a much more compact stack layout. There's no need to copy anything to > local variables, because by the time you call into the collector proper, > everything - every register variable, every function parameter, and so > on - is already on the stack *somewhere*.The typical stack is *tiny* compared with the typical heap. Its pretty small compared with the nursery, so even for minor collections it won't matter.> > The stack frame layout dictated by llvm.gcroot, by contrast, is much, > much fatter. In a typical deeply-nested call stack, there will be many > copies of the same value on the stack. Let's take the 'this' pointer as > an example. Say we have some method that calls itself recursively, and > we are currently 10 levels deep. That means that we have 20 copies of > the 'this' pointer on the stack: Because each invocation of the method > has to push the 'this' pointer on the stack as part of the calling > protocol (even if it's passed in registers, it still has to save the > prior value of the register somewhere), and in addition we also have to > save the 'this' pointer as a stack root. A conservative collector, on > the other hand, would be able to get by with only 10 copies of the > 'this' pointer.If you're really worried about the extra roots, you could do liveness analysis. If its not live across a GC-safe point, then its not a root. In fact, liveness analysis is important anyway. Retaining a dead object will hurt your performance much more than tracing a live one twice.> > As far as the issue of eliminating redundant stores, I'm pretty sure > that the optimizers cannot eliminate redundant /roots/. Figuring out the > minimal set of temporary roots needed for a function call is a > non-trivial problem (you don't want to just blindly make every pointer > argument to a function a temporary root, since that just bloats the size > of the call frame needlessly.) > > My entire goal here is to get a GC that has high performance and > efficiency. I want my GC to be able to run on everything from a data > center to a beagle board, and use the minimum resources possible. >You might want to have a look at my GC(s), its probably too entangled with the rest of the toolkit to be useful directly, but it might be of interest. http://code.google.com/p/gvmt/> -- > -- Talin >Cheers, Mark
On Wed, Sep 22, 2010 at 11:12 AM, Talin <viridia at gmail.com> wrote:> My entire goal here is to get a GC that has high performance and efficiency. > I want my GC to be able to run on everything from a data center to a beagle > board, and use the minimum resources possible. > > -- > -- Talin >The best way I can think of to streamline garbage collection is to cut down on the amount of garbage it has to collect. I'm working on an optimization pass that transforms GC allocations into region allocations whenever it can... I'll share it when it's done. One of my goals is to handle enough cases that I can put off implementing GC for a while :)
Foo Here's another way to think about the issue: Compare this whole approach to stack roots to that of a conservative collector. Since we know conservative collectors work, the whole reason for making an accurate collector in the first place is efficiency, right? If we couldn't make an accurate collector that was more efficient than a conservative collector, then no one would bother because it's such a pain. However, looking strictly at stack frames, a conservative collector has a much more compact stack layout. There's no need to copy anything to local variables, because by the time you call into the collector proper, everything - every register variable, every function parameter, and so on - is already on the stack *somewhere*. The stack frame layout dictated by llvm.gcroot, by contrast, is much, much fatter. In a typical deeply-nested call stack, there will be many copies of the same value on the stack. Let's take the 'this' pointer as an example. Say we have some method that calls itself recursively, and we are currently 10 levels deep. That means that we have 20 copies of the 'this' pointer on the stack: Because each invocation of the method has to push the 'this' pointer on the stack as part of the calling protocol (even if it's passed in registers, it still has to save the prior value of the register somewhere), and in addition we also have to save the 'this' pointer as a stack root. A conservative collector, on the other hand, would be able to get by with only 10 copies of the 'this' pointer. As far as the issue of eliminating redundant stores, I'm pretty sure that the optimizers cannot eliminate redundant roots. Figuring out the minimal set of temporary roots needed for a function call is a non-trivial problem (you don't want to just blindly make every pointer argument to a function a temporary root, since that just bloats the size of the call frame needlessly.) My entire goal here is to get a GC that has high performance and efficiency. I want my GC to be able to run on everything from a data center to a beagle board, and use the minimum resources possible. -- -- Talin Foo Foo -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100924/5a398760/attachment.html>
Forgive my top post but I hate Windows. J I am surprised you (Talin) say that "we know conservative collectors work" because my experience has very much been of them not working. Indeed, if you have 400Mb of allocated heap blocks on a 32-bit machine is there not a 10% chance of *each* random 32-bit int "pointing" into your heap, i.e. a false positive? I just did a simple test here (create an array with 1M random 32-bit ints and then allocate blocks n=1..2^16 of size n bytes) and Boehm's conservative GC sometimes takes 13s and other times runs out of memory after several minutes. It even complains about the "Repeated allocation of very large block (appr. size 65536)". Is 64kb really "very large" today? I think not. Performance is certainly an important reason to implement an accurate collector though. However, the performance of a conservative collector is surely dominated by the cost of searching to see if each int "points" to within any allocated block, a completely unnecessary operation that accurate collectors simply do not perform. That overhead will dwarf any doubling up of roots on the stack but the main problem with conservative collectors, as I say, is not that they are slow but that they leak. I agree that eliminating redundant roots is a desirable goal but I did only the most basic things for HLVM and was very happy with the results. So I think it is premature to value such optimizations. I personally would much rather see a mature VM built on top of LLVM. Cheers, Jon. Talin wrote: Here's another way to think about the issue: Compare this whole approach to stack roots to that of a conservative collector. Since we know conservative collectors work, the whole reason for making an accurate collector in the first place is efficiency, right? If we couldn't make an accurate collector that was more efficient than a conservative collector, then no one would bother because it's such a pain. However, looking strictly at stack frames, a conservative collector has a much more compact stack layout. There's no need to copy anything to local variables, because by the time you call into the collector proper, everything - every register variable, every function parameter, and so on - is already on the stack *somewhere*. The stack frame layout dictated by llvm.gcroot, by contrast, is much, much fatter. In a typical deeply-nested call stack, there will be many copies of the same value on the stack. Let's take the 'this' pointer as an example. Say we have some method that calls itself recursively, and we are currently 10 levels deep. That means that we have 20 copies of the 'this' pointer on the stack: Because each invocation of the method has to push the 'this' pointer on the stack as part of the calling protocol (even if it's passed in registers, it still has to save the prior value of the register somewhere), and in addition we also have to save the 'this' pointer as a stack root. A conservative collector, on the other hand, would be able to get by with only 10 copies of the 'this' pointer. As far as the issue of eliminating redundant stores, I'm pretty sure that the optimizers cannot eliminate redundant roots. Figuring out the minimal set of temporary roots needed for a function call is a non-trivial problem (you don't want to just blindly make every pointer argument to a function a temporary root, since that just bloats the size of the call frame needlessly.) My entire goal here is to get a GC that has high performance and efficiency. I want my GC to be able to run on everything from a data center to a beagle board, and use the minimum resources possible. -- -- Talin -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20100925/5c90ae48/attachment.html>
Reasonably Related Threads
- [LLVMdev] Stack roots and function parameters
- [LLVMdev] Stack roots and function parameters
- [LLVMdev] Interfacing llvm with a precise, relocating GC
- [LLVMdev] open source multithreaded garbage collector suitable for LLVM applications?
- [LLVMdev] Garbage Collection Roots