Reid Spencer
2004-Jan-06 05:05 UTC
[LLVMdev] 9 Ideas To Better Support Source Language Developers
A while back I promised to provide some feedback on useful extensions to LLVM to better support source language writers (i.e. those _using_ LLVM, not developing it). Below is a list of the ideas I've come up with so far. As I get more of XPL's compiler done, I'll start diving into each of the these areas. I'm posting early in the hopes that discussion will bear some fruit. In discussing these things, I'm mostly interested in learning whether any of the following ideas should or should not be part of LLVM as opposed to part of XPS. DISCLAIMER: If any of the following items are already implemented, I missed it! So, please enlighten me! NOTE: If you respond to this, please respond to each item in a separate message to the list. That way we can keep track of different topics on different discussion threads. I doubt you'll want to do that, however: these are all great ideas and should just be adopted without further discussion! :)))) <kidding!> The following items are ranked roughly in order of importance to _me_. Feel free to rank them for your needs -- it would be interesting to see what's important to others. ------------------------------------------------------------------ 1. Definition Import Source languages are likely to create lots of named type and value definitions for the memory objects the language manipulates. Redefining these in every module produces byte code bloat. It would be very useful for LLVM to natively support some kind of import capability that would properly declare global types and global values into the module being considered. Even better would be a way to have this capability supported as a first-class citizen with some kind of "Import" class and/or instruction: simply point an Import class to an existing bytecode file and it causes the global declarations from that bytecode file to be imported into the current Module. ------------------------------------------------------------------ 2. Memory Management My programming system (XPS) has some very powerful and efficient memory allocation mechanisms built into it. It would be useful to allow users of LLVM to control how (and more importantly where) memory is allocated by LLVM. This is another pretty large, sweeping change that will affect every "new" in LLVM. The various Class::get() methods would need to be altered as well as the various createXXX functions. To minimize the impact, we would subclass every root class in the LLVM inheritance hierarchy from some "Allocatee" class that implements operators new and delete. This base class would handle dispatching operator new to a user-installed version, if provided. Otherwise it just defaults to ::new. This trick means we don't have to change _every_ "new" call, just the ones that allocate things outside of the llvm namespace. ------------------------------------------------------------------ 3. Code Signing Support One of the requirements for XPL is that the author and/or distributor of a piece of software be known before execution and that there is a way to validate the integrity of the bytecodes. To that end, I'm planning on providing message digesting and signing on LLVM bytecode files. This is pretty straight forward to implement. The only question is whether it really belongs in LLVM or not. Note that code signing is pretty much a standard part of Java these days. There's one issue with code signing: it thwart's global optimization because changing the byte code means changing the signature. While the software's author can always do this, a signed bytecode file could not be globally optimized into another program without breaking the signature. It would probably be acceptable to allow LLVM to modify the bytecode in memory at runtime after de-encryption and verification of the signature. ------------------------------------------------------------------ 4. Threading Support Some low level support for threading is needed. I think there are really just a very few primitives we need from which higher order things can be constructed. One is a memory barrier to ensure cache is flushed, etc. so we can be certain a write to memory has "taken". This goes beyond the current volatile support and will need to access specific machine instructions if a native barrier is supported. Another is a thread forking instruction. I'd like to see TLS supported but that can probably be constructed from lower level primitives. A nice-to-have would be critical section support. This could be done similar to java's monitorenter and monitorexit instructions. If I recall correctly, I believe this capability is being worked on currently. ------------------------------------------------------------------ 5. Fully Developed ByteCode Archives XPL programs are developed into packages. Packages are the unit of deployment and as such I need a way to (a) archive several bytecode files together, (b) index the globals in them, and (c) compress the whole thing with bzip2. Although LLVM has some support for this today with the llvm-ar program, I don't believe it supports (b) and (c). Note that bytecode files compress to about 50% with bzip2 which means faster transmission times to their destinations (oh, did I mention that XPL supports distributed programming? :) The resulting archive program would be more similar to jar/tar than to ar. ------------------------------------------------------------------ 6. Incremental Code Generation The conventional wisdom for compilation is to emit object code (or in our case the byte code) from a compiler incrementally on a per-function basis. This is necessary so that one doesn't have to keep the memory for every function around for the entire compilation. This allows much larger programs to be compiled since the memory limit is relative to the size of a single function rather than the size of the whole program. My language, XPL, will result in the compilation of huge programs because it is essentially a language to support program generation. I'm not sure if LLVM supports this now, but I'd like LLVM to be able to write byte code for an llvm::Function object and then "drop" the function's body and carry on. It isn't obvious from llvm::Function's interface if this is supported or not. The only drawback to this is the effect on optimization. I would suggest that after bytecode generation, a function's "body" be replaced with some kind of summary (annotation?) of interest to optimization passes. The summary would contain indications of whether the function calls anything else, modifies global memory, etc. That way the relevant information for optimization passes can be retained while all the gory details aren't. Taking the above suggestion to its logical conclusion, it might be useful to create a general mechanism for passes to leave "tidbits" of information around for other passes. The Annotation mechanism probably could be used for this purpose but something a little more formal would probably be better. It's highly likely there's something like this in place already that I'm not aware of. ------------------------------------------------------------------ 7. Idioms Package As I learned from Stacker (the hard way), there are certain idioms that occur in using LLVM over and over again. These idioms need to be either (a) documented or (b) implemented in a library. I prefer (b) because it implies (a) ;> Such idioms as if-then-else, for (pre; cond; post), while(cond), etc. should be just coded into a framework so that compiler writers have a slightly higher level interface to work with. Although I like this idea, its low on my list because I regard LLVM _already_ incredibly easy to use as a compiler writer's tool. But, hey, why stop at "incredibly easy" when there's "amazingly trivial" waiting in the wings? ------------------------------------------------------------------ 8. Create a ConstantString class Constant strings are very common occurrences in XPL and probably are in other source languages as well. The current implementation of ConstantArray::get(std::string&) is a bit weak. It creates a ConstantSInt for every character. What if the strings are long and the program creates many of them? It seems a little heavy weight to me. I can't think of a good reason not to support a ConstantString class that retains the string as a std::string and DTRT with it for code generation. I know that every character in the string must be addressable and that coalescing them into a single object thwarts the use-def chains, etc. But, couldn't ConstantString just "fake it" some how so we don't have so many objects created for a string? One idea is to just punt. If you use ConstantString then its treated like a single atomic memory object. Only the address of the first location can be taken. If that doesn't fit the bill, you can always go back to a ConstantArray of ConstantSInt. ------------------------------------------------------------------ 9. More Native Platforms Supported To get the platform coverage that I need, I'm making the XPL compiler use the C back end. Its slower to compile that way but I'll only need it for those programs that want to go fully native. The back end support in LLVM is a bit weak right now in terms of both optimizations available and platforms supported. This isn't a big priority for me as there is a viable alternative to native platform support. ------------------------------------------------------------------ I'll do another one of these postings as I get nearer to the end of the XPL Compiler implementation. There should be lots more ideas by then. Don't hold your breath :) Best Regards, Reid. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20040106/86ab1148/attachment.sig>
Valery A.Khamenya
2004-Jan-07 12:50 UTC
[LLVMdev] 9 Ideas To Better Support Source Language Developers
Hello Reid and LLVMers, ... 10. Basic support for distributed computations. -- Best regards, Valery A.Khamenya mailto:khamenya at mail.ru Local Time: 19:48
Chris Lattner
2004-Jan-07 12:55 UTC
[LLVMdev] 9 Ideas To Better Support Source Language Developers
On Tue, 6 Jan 2004, Reid Spencer wrote:> A while back I promised to provide some feedback on useful extensions to > LLVM to better support source language writers (i.e. those _using_ LLVM, > not developing it). Below is a list of the ideas I've come up with so > far.Cool! Ideas are alway welcome!> If you respond to this, please respond to each item in a separate > message to the list. That way we can keep track of different topics on > different discussion threads.I'll let you split it up as you see fit. :)> ------------------------------------------------------------------ > 1. Definition Import > Source languages are likely to create lots of named type and value > definitions for the memory objects the language manipulates. Redefining > these in every module produces byte code bloat. It would be very useful > for LLVM to natively support some kind of import capability that would > properly declare global types and global values into the module being > considered.Unfortunately, this would break the ability to take a random LLVM bytecode file and use it in a self-contained way. In general, the type names and external declarations are actually stored very compactly, and the optimizers remove unused ones. Is this really a problem for you in practice?> Even better would be a way to have this capability supported > as a first-class citizen with some kind of "Import" class and/or > instruction: simply point an Import class to an existing bytecode file > and it causes the global declarations from that bytecode file to be > imported into the current Module.We already have this: the linker. :) Just put whatever you want into an LLVM bytecode file, then use the LinkModules method (from llvm/Transforms/Utils/Linker.h) to "import" it. Alternatively, in your front-end, you could just start with this module instead of an empty one when you compile a file...> ------------------------------------------------------------------ > 2. Memory Management > > My programming system (XPS) has some very powerful and efficient memory > allocation mechanisms built into it. It would be useful to allow users > of LLVM to control how (and more importantly where) memory is allocated > by LLVM.What exactly would this be used for? Custom allocators for performance? Or something more important? In general, custom allocators for performance are actually a bad idea...> ------------------------------------------------------------------ > 3. Code Signing Support > > One of the requirements for XPL is that the author and/or distributor of > a piece of software be known before execution and that there is a way to > validate the integrity of the bytecodes. To that end, I'm planning on > providing message digesting and signing on LLVM bytecode files. This is > pretty straight forward to implement. The only question is whether it > really belongs in LLVM or not.I don't think that this really belongs in LLVM itself: Better would be to wrap LLVM bytecode files in an application (ie, XPL) specific file format that includes the digest of the module, the bytecode itself, and whatever else you wanted to keep with it. That way your tool, when commanded to load a file, would check the digest, and only if it matches call the LLVM bytecode loader.> There's one issue with code signing: it thwart's global optimization > because changing the byte code means changing the signature. While the > software's author can always do this, a signed bytecode file could not > be globally optimized into another program without breaking the > signature. It would probably be acceptable to allow LLVM to modify the > bytecode in memory at runtime after de-encryption and verification of > the signature.I'm not sure that there is a wonderful solution to this. You could go the route of having a "trusted" compiler, which has the necessary keys built into it or something, but I don't know very much about this area.> ------------------------------------------------------------------ > 4. Threading Support > > Some low level support for threading is needed. I think there are really > just a very few primitives we need from which higher order things can be > constructed. One is a memory barrier to ensure cache is flushed, etc. so > we can be certain a write to memory has "taken".Just out of curiousity, what do you need a membar for? The only thing that I'm aware of it being useful for (besides implementing threading packages) are Read-Copy-Update algorithms.> This goes beyond the current volatile support and will need to access > specific machine instructions if a native barrier is supported. Another > is a thread forking instruction. I'd like to see TLS supported but that > can probably be constructed from lower level primitives. A nice-to-have > would be critical section support. This could be done similar to java's > monitorenter and monitorexit instructions. If I recall correctly, I > believe this capability is being worked on currently.Yup, Misha is currently working on adding these capabilities to LLVM. In the meantime, calling into a pthreads library directly is the preferred solution. I agree that TLS would be very handy to have.> ------------------------------------------------------------------ > 5. Fully Developed ByteCode Archives > > XPL programs are developed into packages. Packages are the unit of > deployment and as such I need a way to (a) archive several bytecode > files together, (b) index the globals in them, and (c) compress the > whole thing with bzip2. Although LLVM has some support for this today > with the llvm-ar program, I don't believe it supports (b) and (c).This makes a lot of sense. The LLVM bytecode reader supports loading a bytecode file from a memory buffer, so I think it would be pretty easy to implement this. Note that llvm-ar is currently a work-in-progress, but it might make sense to implement support for this directly in it. Afterall, we aren't constrained by what the format of the ".o" files in the .a file look like (as long as gccld and llvm-nm support the format).> Note that bytecode files compress to about 50% with bzip2 which means > faster transmission times to their destinations (oh, did I mention that > XPL supports distributed programming? :) The resulting archive program > would be more similar to jar/tar than to ar.Also note that we are always interested in finding ways to shink the bytecode files. Right now they are basically comperable to native executable sizes, but smaller is always better!> ------------------------------------------------------------------ > 6. Incremental Code Generation > > The conventional wisdom for compilation is to emit object code (or in > our case the byte code) from a compiler incrementally on a per-function > basis. This is necessary so that one doesn't have to keep the memory for > every function around for the entire compilation. This allows muchThat makes sense.> I'm not sure if LLVM supports this now, but I'd like LLVM to be able to > write byte code for an llvm::Function object and then "drop" the > function's body and carry on. It isn't obvious from llvm::Function's > interface if this is supported or not.This has not yet been implemented, but a very similar thing has: incremental bytecode loading. The basic idea is that you can load a bytecode file without all of the function bodies. As you need the contents of a function body, it is streamed in from the bytecode file. Misha added this for the JIT. Doing the reverse seems very doable, but noone has tried it. If you're interested, take a look at the llvm::ModuleProvider interface and the implementations of it to get a feeling for how the incremental loader works.> The only drawback to this is the effect on optimization. I would suggest > that after bytecode generation, a function's "body" be replaced with > some kind of summary (annotation?) of interest to optimization passes.This is very similar to the functionality required for the incremental loader, so when it gets developed for the loader, the writer could use similar kinds of interfaces.> Taking the above suggestion to its logical conclusion, it might be > useful to create a general mechanism for passes to leave "tidbits" of > information around for other passes. The Annotation mechanism probably > could be used for this purpose but something a little more formal would > probably be better. It's highly likely there's something like this in > place already that I'm not aware of.LLVM already has an llvm::Annotation class that does exactly this :)> ------------------------------------------------------------------ > 7. Idioms Package > > As I learned from Stacker (the hard way), there are certain idioms that > occur in using LLVM over and over again. These idioms need to be either > (a) documented or (b) implemented in a library. I prefer (b) because it > implies (a) ;> Such idioms as if-then-else, for (pre; cond; post), > while(cond), etc. should be just coded into a framework so that compiler > writers have a slightly higher level interface to work with. > > Although I like this idea, its low on my list because I regard LLVM > _already_ incredibly easy to use as a compiler writer's tool. But, hey, > why stop at "incredibly easy" when there's "amazingly trivial" waiting > in the wings?Developing a new "front-end helper" library could be interesting! The only challange would be to make it general purpose enough that it would actually be useful for multiple languages.> ------------------------------------------------------------------ > 8. Create a ConstantString class > > Constant strings are very common occurrences in XPL and probably are in > other source languages as well. The current implementation of > ConstantArray::get(std::string&) is a bit weak. It creates a > ConstantSInt for every character. What if the strings are long and the > program creates many of them? It seems a little heavy weight to me.This is something that might make sense to deal with in the future, but it has a lot of implications in the compiler and optimizer. Look at GCC for example, there are many optimizations that works on constant strings but not on arrays of characters or any other type. At this stage in the game, effort is probably best spent elsewhere. :) On the other hand, adding a hack to the bytecode format to efficiently encode strings is something that I have been considering: there the effect of the change is more contained.> ------------------------------------------------------------------ > 9. More Native Platforms Supported > > To get the platform coverage that I need, I'm making the XPL compiler > use the C back end. Its slower to compile that way but I'll only need it > for those programs that want to go fully native. The back end support in > LLVM is a bit weak right now in terms of both optimizations available > and platforms supported. This isn't a big priority for me as there is a > viable alternative to native platform support.Yup, that makes sense. Supporting the CBE will always be a good idea, but adding new native platforms and improving the ones we do will be increasingly important over time. :)> ------------------------------------------------------------------ > > I'll do another one of these postings as I get nearer to the end of the > XPL Compiler implementation. There should be lots more ideas by then. > Don't hold your breath :)Cool, keep us informed! :) -Chris -- http://llvm.cs.uiuc.edu/ http://www.nondot.org/~sabre/Projects/
Chris Lattner
2004-Jan-07 12:56 UTC
[LLVMdev] 9 Ideas To Better Support Source Language Developers
On Wed, 7 Jan 2004, Valery A.Khamenya wrote:> Hello Reid and LLVMers, > 10. Basic support for distributed computations.What kind of support? What do you think should be included in LLVM directly, as opposed to being built on top of it? -Chris -- http://llvm.cs.uiuc.edu/ http://www.nondot.org/~sabre/Projects/
Reid Spencer
2004-Jan-07 13:35 UTC
[LLVMdev] 9 Ideas To Better Support Source Language Developers
On Wed, 2004-01-07 at 11:12, Chris Lattner wrote:> > ------------------------------------------------------------------ > > 1. Definition Import > > Source languages are likely to create lots of named type and value > > definitions for the memory objects the language manipulates. Redefining > > these in every module produces byte code bloat. It would be very useful > > for LLVM to natively support some kind of import capability that would > > properly declare global types and global values into the module being > > considered. > > Unfortunately, this would break the ability to take a random LLVM bytecode > file and use it in a self-contained way. In general, the type names and > external declarations are actually stored very compactly, and the > optimizers remove unused ones. Is this really a problem for you in > practice?I'm trying to get to a "once-and-done" solution on compilation. That is, a given module is compiled exactly once (per version). There's no such thing as "include" in XPL, only "import". The difference is that "import" loads the results of previous compilations (i.e. a bytcode file). I included it in my list because I thought it would be something quite handy for other source languages (Java would need it, for example). The functionality is something like Java's class loader except its a module loader for LLVM and it doesn't load the function bodies.> > > Even better would be a way to have this capability supported > > as a first-class citizen with some kind of "Import" class and/or > > instruction: simply point an Import class to an existing bytecode file > > and it causes the global declarations from that bytecode file to be > > imported into the current Module. > > We already have this: the linker. :) Just put whatever you want into an > LLVM bytecode file, then use the LinkModules method (from > llvm/Transforms/Utils/Linker.h) to "import" it. Alternatively, in your > front-end, you could just start with this module instead of an empty one > when you compile a file...Okay, I'll take a look at this and see if it fits the bill.> > > ------------------------------------------------------------------ > > 2. Memory Management > > > > My programming system (XPS) has some very powerful and efficient memory > > allocation mechanisms built into it. It would be useful to allow users > > of LLVM to control how (and more importantly where) memory is allocated > > by LLVM. > > What exactly would this be used for? Custom allocators for performance? > Or something more important? In general, custom allocators for > performance are actually a bad idea...My memory system can do seamless persistent memory as well (i.e. its almost a full scale OO Database). One of my ideas for the "import" functionality was to simply save the LLVM objects for each module persistently. Import then takes no longer than an mmap(2) call to load the LLVM data structures associated with the module into memory. I can't think of a faster way to do it. The reason this is so important to me is that I expect to be doing lots of on the fly compilation. XPL is highly dynamic. What I'm trying to avoid is the constant recompilation of included things as with C/C++. The time taken to recompile headers is, in my opinion, just wasted time. That's why pre-compiled header support exists in so many compilers. I have also tuned my allocators so that they can do multiple millions of allocations per second on modest hardware. There's a range of allocators available each using different algorithms. Each has space/time tradeoffs. The performance of "malloc(2)" sucks on most platforms and sucks on all platforms when there's a lot of memory thrash. None of my allocators suffer these problems. Curious: why do you think custom allocators for performance are a bad idea?> > > ------------------------------------------------------------------ > > 3. Code Signing Support > > > I don't think that this really belongs in LLVM itself: Better would be to > wrap LLVM bytecode files in an application (ie, XPL) specific file format > that includes the digest of the module, the bytecode itself, and whatever > else you wanted to keep with it. That way your tool, when commanded to > load a file, would check the digest, and only if it matches call the LLVM > bytecode loader.I'd probably be more inclined to just add an internal global array of bytes to the LLVM bytecode format. Supporting a new file format means that I'd have to re-write all the LLVM tools -- not worth the time. So, I'll implement this myself and not extend LLVM with it.> > ------------------------------------------------------------------ > > 4. Threading Support > > > > Some low level support for threading is needed. I think there are really > > just a very few primitives we need from which higher order things can be > > constructed. One is a memory barrier to ensure cache is flushed, etc. so > > we can be certain a write to memory has "taken". > > Just out of curiousity, what do you need a membar for? The only thing > that I'm aware of it being useful for (besides implementing threading > packages) are Read-Copy-Update algorithms.Um, to implement a threading package :) I have assumed that, true to its name, LLVM will only provide the lowest level primitives needed to implement a threading package, not actually provide a threading package. I'm sure you don't want to put all the different kinds of synchronization concepts (mutex, semaphore, barrier, futex, etc.) into LLVM? All of them need the membar. For that matter, you'll probably need an efficient thread barrier as well.> > ------------------------------------------------------------------ > > 5. Fully Developed ByteCode Archives > > > This makes a lot of sense. The LLVM bytecode reader supports loading a > bytecode file from a memory buffer, so I think it would be pretty easy to > implement this. Note that llvm-ar is currently a work-in-progress, but it > might make sense to implement support for this directly in it. Afterall, > we aren't constrained by what the format of the ".o" files in the .a file > look like (as long as gccld and llvm-nm support the format).But if the file gets compressed, it isn't a .a file any more, right? Or, were you suggesting that only the archive members get compressed and the file is otherwise an archive? The problem with that approach is that it limits the compression somewhat. Think about an archive with 1000 bytecode files each using common declarations. Compressed individually those common declarations are repeated in each file. Compressed en masse, only one copy of the common declarations is stored achieving close to 1000:1 compression for those declarations.> Also note that we are always interested in finding ways to shrink the > bytecode files. Right now they are basically comparable to native > executable sizes, but smaller is always better!Unfortunately, the answer to that is to utilize higher level instructions. LLVM is comparable to native because it isn't a whole lot higher level. Compared with Java, whose byte code knows about things like classes, LLVM will always be larger because expression of the higher level concepts in LLVM's relatively low level takes more bytes. That said, we _should_ strive to minimize I haven't really looked into the bytecode format in much detail. Are we doing things like constant string folding? Could the bytecode format be natively compressed (i.e. not with bz2 or zip but simply be not duplicating anything in the output)?> > > ------------------------------------------------------------------ > > 6. Incremental Code Generation > > > > The conventional wisdom for compilation is to emit object code (or in > > our case the byte code) from a compiler incrementally on a per-function > > basis. This is necessary so that one doesn't have to keep the memory for > > every function around for the entire compilation. This allows much > > That makes sense. > > > I'm not sure if LLVM supports this now, but I'd like LLVM to be able to > > write byte code for an llvm::Function object and then "drop" the > > function's body and carry on. It isn't obvious from llvm::Function's > > interface if this is supported or not. > > This has not yet been implemented, but a very similar thing has: > incremental bytecode loading. The basic idea is that you can load a > bytecode file without all of the function bodies.That's what I want for importing! .. item (1) above!> As you need the > contents of a function body, it is streamed in from the bytecode file. > Misha added this for the JIT.Cool.> > Doing the reverse seems very doable, but noone has tried it. If you're > interested, take a look at the llvm::ModuleProvider interface and the > implementations of it to get a feeling for how the incremental loader > works.Okay, I'll see what I can come up with.> Developing a new "front-end helper" library could be interesting! The > only challenge would be to make it general purpose enough that it would > actually be useful for multiple languages.You're right, it would need to be useful for multiple languages. Here's what I do. I'll revisit this when I get closer to done on the XPL compiler. I'm building things now that are somewhat framework oriented. If there are specific patterns that arise and could be useful, I'll submit them back to the list at that time for review.> > > ------------------------------------------------------------------ > > 8. Create a ConstantString class > > This is something that might make sense to deal with in the future, but it > has a lot of implications in the compiler and optimizer.Consider it postponed.Reid. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20040107/ec1f863e/attachment.sig>