thr3ads.net - llvm dev - [LLVMdev] getting started with IR needing GC [Apr 2008]

If this information is useful, please help other people find it:
Share via:

Terence Parr

2008-Apr-21 23:12 UTC

[LLVMdev] getting started with IR needing GC

On Apr 20, 2008, at 6:52 PM, Gordon Henriksen wrote:
> On 2008-04-20, at 21:05, Terence Parr wrote:
>
>> On Apr 20, 2008, at 5:36 PM, Gordon Henriksen wrote:
>>
>>> Since the semispace heap doesn't actually work (it's an
example,
>>> at best), I suggest you simply copy the stack visitor into your  
>>> project; it's only a dozen lines of code or so.
>>
>>
>> Ok, copying; can't find ShadowStackEntry though. Even make in that
>> dir doesn't work:
>
> Please use the version from subversion; this is broken in 2.2  
> release, unfortunately.
ah! ok, looks better now. :)
>> how does the gc "shadow-stack" gcroot intrinsic work exactly?
I
>> couldn't read the assembly very well.  Seems my example above  
>> wouldn't work would it unless i create/fill in a shadow stack
record?
>>
>
> 'gc "shadow-stack"' in the LLVM IR instructs the code
generator to
> automatically maintain the linked list of stack frames. You don't  
> have to do anything to maintain these shadow stack frames except to  
> keep your variables in the llvm.gcroot'd allocas. Essentially, it  
> does this:
>
>     struct ShadowStackEntry {
>         ShadowStackLink *next;
>         const ShadowStackMetadata *metadata;
>         void *roots[0];
>     };
Ok, bear with me here...

What's the difference between ShadowStackLink and ShadowStackEntry?
>       template <size_t count>
>     struct Roots {
>         ShadowStackLink *next;
>         const ShadowStackMetadata *metadata;
>         void *roots[0];
>     };
>
>     ShadowStackEntry *shadowStackHead;
>
>     // Defined by the code generator.
>     const ShadowStackMetadata f_metadata = ...;
Do you mean generated by my front end that emits IR or do you mean the  
backend? It seems that, since I read the source code and build the  
symbol table, I would need to build this stack frame type information  
for LLVM.
>         void f() {
>         Roots<3> roots;
>         roots.next = shadowStackHead;
>         roots.metadata = f_metadata;
>         roots.roots[0] = NULL;
>         roots.roots[1] = NULL;
>         roots.roots[2] = NULL;
What are the three roots here? Not sure where anything but the next,  
metadata are coming from.  So the gc "shadow-stack" generates that  
preamble code? That would make sense
>         shadowStackHead = (ShadowStackEntry *) &roots;
>
>         ... user code ...
here is where my gcroots go then I guess.
>         shadowStackHead = entry.next; // before any exit
>         return;
>     }
Can you  tell me where to find ShadowStackMetadata?  A search does not  
reveal it:

/usr/local/llvm-2.2 $ find . -name 'ShadowStackMetadata*'
>> Taking a giant step back, I can build something similar to  
>> semispace.c myself so I'm in control of my world, right?  i would  
>> set up the shadow stack using IR instructions and could avoid  
>> gcroot by notifying my collector as I see fit...
>
> That's true; the shadow stack design is explicitly for uncooperative  
> environments, after all.
The compiler plug-in for a GC is like a sophisticated macro that knows  
how to emit preambles and post ambles for each function that says it  
uses that particular GC, right?  Does it do more than an include such  
as figuring out which alloca's I have that are pointers? If so, then  
why do I need to use gcroot instructions to identify roots? Seems like  
it would be much easier to understand to just have my output templates  
emit the preamble and so on.  Oh, maybe the optimizer remove some  
stuff in there for what I think is a root is actually not around  
anymore.
> When you want to eliminate the shadow stack overhead, you will need  
> to (a.) use a conservative GC or (b.) emit stack frame metadata  
> using the LLVM GC support.
Unfortunately, I'm thoroughly confused about who generates what.  Who  
is supposed to generate the meta data types?  If I am, that is fine,  
but I really can't find anything in the documentation that is a simple  
end to end C code -> IR example. Once I get one together, I'll put it  
in the book I'm writing. I've spent many hours reading and playing as  
much as I can, but it is still not clear; 'course I ain't always that  
bright. ;)  Note that the paper by Henderson was extremely clear to  
me, so it's not the contents, it is the details of using LLVM to do GC.
>> Sorry I'm so lost...just trying to figure out what llvm does for me
>> and what I have to do.
>
> No problem!
>
> Generally speaking, LLVM is going to help you find roots on the  
> stack, which is the part that the compiler backend must help with;  
> the rest is your playground.
Is that because only code generation knows what roots exist after  
processing the IR?
> The infrastructure is more suited toward interfacing with an  
> existing GC rather than necessarily making writing a new runtime  
> trivial. (See exception handling for precedent…)
Well, writing a new garbage collector seems really straightforward  
(like to mark and sweep).  LLVM will give me the roots and I am free  
to walk them. The part that I don't understand is who defines what  
metadata types and how exactly I make use of gcroot and LLVM's  
support. The concepts are clear, the details seem miles away ;)

Thanks for all the help...

Has anybody else on the list gotten a trivial GC'd language working I  
could look at? All go back to the scheme translator again to see what  
I can learn.

Thanks,
Ter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20080421/b4f56100/attachment.html>

Gordon Henriksen

2008-Apr-22 01:13 UTC

head link

[LLVMdev] getting started with IR needing GC

Hi Terence,

I think you're getting hung up on the details of the shadow stack  
collector. The shadow stack is a GC that is possible within this  
framework, but of course could be implemented without any special  
support. Its presence is more misleading than anything else. Taking a  
step back, the concepts are:

llvm.gcroot instructs the code generator --> My GC needs to be able to  
find this variable on the stack at runtime.
gc "x" instructs the code generator --> My GC will use the method
"x"
to find variables on the stack at runtime.

Typically, a metadata table (similar to exception handling tables)  
will be emitted. Its structure is fundamentally
map<void*,list<int>>,
where the key is the address of a safe point in code (for instance,  
the address following a 'call' instruction), and the ints indicate the  
offsets of the live roots within the stack frame. LLVM does not define  
such a data structure, but the runtime-specific Collector subclass and  
the GC runtime itself must agree to a common format in order to  
interoperate.

ShadowStackCollector uses a completely alternative methodology. A  
shadow stack could be implemented in user code without any special  
support. (Indeed, it is implemented entirely as an LLVM IR -> LLVM IR  
transformation!)


On Apr 21, 2008, at 19:12, Terence Parr wrote:
> On Apr 20, 2008, at 6:52 PM, Gordon Henriksen wrote:
>
>> On 2008-04-20, at 21:05, Terence Parr wrote:
>>
>>> how does the gc "shadow-stack" gcroot intrinsic work
exactly?  I
>>> couldn't read the assembly very well.  Seems my example above  
>>> wouldn't work would it unless i create/fill in a shadow stack  
>>> record?
>
>>>
>>
>> 'gc "shadow-stack"' in the LLVM IR instructs the code
generator to
>> automatically maintain the linked list of stack frames. You don't  
>> have to do anything to maintain these shadow stack frames except to  
>> keep your variables in the llvm.gcroot'd allocas. Essentially, it  
>> does this:
>>
>>     struct ShadowStackEntry {
>>         ShadowStackLink *next;
>>         const ShadowStackMetadata *metadata;
>>         void *roots[0];
>>     };
>
> Ok, bear with me here...
>
> What's the difference between ShadowStackLink and ShadowStackEntry?
This is an abstract type with a flexible array member.
>>       template <size_t count>
>>     struct Roots {
>>         ShadowStackLink *next;
>>         const ShadowStackMetadata *metadata;
>>         void *roots[0];
>>     };
This should be roots[n], which makes the difference.
>>     ShadowStackEntry *shadowStackHead;
>>
>>     // Defined by the code generator.
>>     const ShadowStackMetadata f_metadata = ...;
>
> Do you mean generated by my front end that emits IR or do you mean  
> the backend? It seems that, since I read the source code and build  
> the symbol table, I would need to build this stack frame type  
> information for LLVM.
No, the code generator injects this constant. The metadata records how  
many roots are in the entry, and also stores the 'metadata' parameter  
to llvm.gcroot if you provide one.
>>         void f() {
>>         Roots<3> roots;
>>         roots.next = shadowStackHead;
>>         roots.metadata = f_metadata;
>>         roots.roots[0] = NULL;
>>         roots.roots[1] = NULL;
>>         roots.roots[2] = NULL;
>
> What are the three roots here? Not sure where anything but the next,  
> metadata are coming from.  So the gc "shadow-stack" generates
that
> preamble code? That would make sense
These would correspond to three gcroot allocas.
>>         shadowStackHead = (ShadowStackEntry *) &roots;
>>
>>         ... user code ...
>
> here is where my gcroots go then I guess.
This is where your uses of them would occur. llvm.gcroot does not emit  
any code except the preamble.
>>         shadowStackHead = entry.next; // before any exit
>>         return;
>>     }
>
> Can you  tell me where to find ShadowStackMetadata?  A search does  
> not reveal it:
>
> /usr/local/llvm-2.2 $ find . -name 'ShadowStackMetadata*'
This was pseudocode. The ShadowStackCollector actually does  
instantiate such llvm StructTypes, however; you can refer to its  
implementation.
>>> Taking a giant step back, I can build something similar to  
>>> semispace.c myself so I'm in control of my world, right?  i
would
>>> set up the shadow stack using IR instructions and could avoid  
>>> gcroot by notifying my collector as I see fit...
>>
>> That's true; the shadow stack design is explicitly for  
>> uncooperative environments, after all.
>
> The compiler plug-in for a GC is like a sophisticated macro that  
> knows how to emit preambles and post ambles for each function that  
> says it uses that particular GC, right?  Does it do more than an  
> include such as figuring out which alloca's I have that are  
> pointers? If so, then why do I need to use gcroot instructions to  
> identify roots? Seems like it would be much easier to understand to  
> just have my output templates emit the preamble and so on.  Oh,  
> maybe the optimizer remove some stuff in there for what I think is a  
> root is actually not around anymore.
The shadow stack is a primitive form of GC with high runtime overhead.  
It's an easy way to bring up a new collector, and it is highly  
portable, but it's slow and not multithread-capable. This table is  
probably the best summary of the benefits of the GC infrastructure:

http://llvm.org/docs/GarbageCollection.html#collector-algos
>> When you want to eliminate the shadow stack overhead, you will need  
>> to (a.) use a conservative GC or (b.) emit stack frame metadata  
>> using the LLVM GC support.
>
> Unfortunately, I'm thoroughly confused about who generates what.   
> Who is supposed to generate the meta data types?
Your code is responsible for defining the object model metadata  
(vtable pointers, class layouts, etc.).

For enumerating stack roots, the Collector plugin and the runtime must  
agree to a common format.
> If I am, that is fine, but I really can't find anything in the  
> documentation that is a simple end to end C code -> IR example. Once  
> I get one together, I'll put it in the book I'm writing. I've
spent
> many hours reading and playing as much as I can, but it is still not  
> clear; 'course I ain't always that bright. ;)  Note that the paper
> by Henderson was extremely clear to me, so it's not the contents, it  
> is the details of using LLVM to do GC.
>
>>> Sorry I'm so lost...just trying to figure out what llvm does
for
>>> me and what I have to do.
>>
>> No problem!
>>
>> Generally speaking, LLVM is going to help you find roots on the  
>> stack, which is the part that the compiler backend must help with;  
>> the rest is your playground.
>
> Is that because only code generation knows what roots exist after  
> processing the IR?
Only the code generator knows where the roots are after code  
generation (stack offsets, registers), or at what points the roots  
must be findable (at safe points).

It is of course possible to define user data structures to do this  
(this is what the shadow stack does), but that incurs significant  
overhead (4 + n stores + 2 loads per function call).
>> The infrastructure is more suited toward interfacing with an  
>> existing GC rather than necessarily making writing a new runtime  
>> trivial. (See exception handling for precedent…)
>
> Well, writing a new garbage collector seems really straightforward  
> (like to mark and sweep).  LLVM will give me the roots and I am free  
> to walk them. The part that I don't understand is who defines what  
> metadata types and how exactly I make use of gcroot and LLVM's  
> support. The concepts are clear, the details seem miles away ;)
>
> Thanks for all the help...
>
> Has anybody else on the list gotten a trivial GC'd language working  
> I could look at? All go back to the scheme translator again to see  
> what I can learn.
PyPy's used this, but LLVM isn't their best code generator. There are  
private projects as well.

— Gordon

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20080421/aa70f7e3/attachment.html>

Lane Schwartz

2008-Apr-28 01:29 UTC

head link

[LLVMdev] getting started with IR needing GC

On Mon, Apr 21, 2008 at 8:13 PM, Gordon Henriksen
<gordonhenriksen at mac.com> wrote:>
> Hi Terence,
>
>
> I think you're getting hung up on the details of the shadow stack
collector.
> The shadow stack is a GC that is possible within this framework, but of
> course could be implemented without any special support. Its presence is
> more misleading than anything else. Taking a step back, the concepts are:
>
>
> llvm.gcroot instructs the code generator --> My GC needs to be able to
find
> this variable on the stack at runtime.
> gc "x" instructs the code generator --> My GC will use the
method "x" to
> find variables on the stack at runtime.
>
> Typically, a metadata table (similar to exception handling tables) will be
> emitted. Its structure is fundamentally map<void*,list<int>>,
where the key
> is the address of a safe point in code (for instance, the address following
> a 'call' instruction), and the ints indicate the offsets of the
live roots
> within the stack frame. LLVM does not define such a data structure, but the
> runtime-specific Collector subclass and the GC runtime itself must agree to
> a common format in order to interoperate.
Hi guys,

I am also trying to understand the garbage collection framework. I've
built up a tiny frontend for a very simple language, and I'd like to
try adding a simple garbage collector. I guess I'll walk through my
understanding of the required steps - I'd appreciate it if you guys
could correct where I misunderstand and help fill in the gaps.

Currently, my frontend creates a normal LLVM IR of a program. Before
adding GC, all variables are allocated by doing an alloca or malloc,
then by doing a store of the alloca to the variable.

Now let's assume I want to add a simple copying garbage collector.

I write an implementation called copycollector.c (based on
semispace.c) that implements the functions declared in GCInterface.h.

When my frontend generates LLVM IR code for a program, it should also
generate a call (early in the IR code) to llvm_gc_initialize. This
function uses the system calloc to allocate two big blocks of memory,
then stores pointers to that memory in static variables.

Since this is a simple copying collector, the functions llvm_gc_read
and llvm_gc_write won't really do much:
void *llvm_gc_read(void *ObjPtr, void **FieldPtr) { return *FieldPtr; }
void llvm_gc_write(void *V, void *ObjPtr, void **FieldPtr) { *FieldPtr = V; }

There is also a function called llvm_gc_allocate. Now, instead of
using alloca or malloc, my frontend generates a call to
llvm_gc_allocate.

Up to this point I feel pretty good about what's going on. But now
things get a bit fuzzy.

I somehow need to inform the garbage collection runtime (my
copycollector.c) about my variables - specifically about gc roots. So,
after I get new memory using llvm_gc_initialize, I think I should
generate an @llvm.gcroot intrinsic. But what do I pass it? The
@llvm.gcroot intrinsic takes an i8**, but llvm_gc_initialize returns a
void *. I don't really understand what happens when I generate an
@llvm.gcroot intrinsic. Conceptually, I'm announcing that I have a
variable that could serve as a GC root. But what exactly is happening?
And do I need to care what's happening?

Now, back to the collector runtime... When llvm_gc_collect is called,
I need to walk through the GC space, starting with the GC roots. How
do I get access to the GC roots?

Then there is the collector plugin. I guess there will be a
MyCustomCollector which inherits from Collector . I really don't
understand what this class does and why I need it. Let's say I have a
simple copying GC, and let's say I have a very simple set of types in
my language. When I store a variable in GC space, I'll store a char at
the beginning of the memory for that variable that simply indicates
the type (and I'll be able to infer the size from the type). Given
that, I assume I won't ever need use the collector metadata (in
@llvm.gcroot, llvm_cg_walk_gcroots, or via CollectorMetadata). Is that
right?

What does it mean for the plugin to compute a stack map?

When are the methods of the plugin called? Who is responsible for calling them?

When is the constructor of the plugin called?

What is the relationship between the plugin and the GC runtime?

In the GC docs, section "Tracing GC pointers from heap objects", there
is the following statement: "In addition, LLVM allows the front-end to
extract GC information in any form from a specific object pointer
(this supports situations #1 and #3)." Is there an example where this
is done? What API can I look at to see how to do this?

Thanks for all the help. I know this is a lot of questions. I'd really
like to get a basic GC up and running.

Thanks,
Lane

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Apr 2008 - [LLVMdev] getting started with IR needing GC

[LLVMdev] getting started with IR needing GC

[LLVMdev] getting started with IR needing GC

[LLVMdev] getting started with IR needing GC

Apparently Analagous Threads