Greetings, I am working on a project (unrelated to LLVM) that needed a bytecode-like format. I found Bitcode and it seems to fit the bill really nicely. I am writing an independent implementation of Bitcode in C (I really want to keep my runtime pure C). In the process of doing this, I've discovered a few things I want to ask about. I notice the Bitcode format documentation [0] is somewhat incomplete -- there have been a few questions I had to resolve by looking at LLVM source code. This is totally understandable (I'm actually impressed that it's documented so well for something so new), but I wonder if you would accept a patch that clarified some things. For example, the document doesn't mention endianness at all, but from the source code I discover that bytes are read in order and bits are read least-significant first. I also have a few questions about the format: - it appears that the only magic number in the file is application-specific. This seems unfortunate, because it means that application-neutral tools cannot be built that process bitcode files, since they could not reliably detect that the file is a bitcode file. It might seem like there is little room for application- neutral tools since almost all the data in the file is application-specific, but off the top of my head I can think of a few, like a tool to suggest abbreviations that would give a file better compression. - the LLVM code assumes that several VBR fields can be at most 32 bits (block ids, number of elements in an array, etc). These assumptions seem quite reasonable: can they be considered part of the format and added to the document? Cheers, Josh [0] http://llvm.org/releases/2.0/docs/BitCodeFormat.html
Hi Joshua, On Mon, 2007-09-03 at 21:34 +0000, Joshua Haberman wrote:> Greetings, > > I am working on a project (unrelated to LLVM) that needed a > bytecode-like format. I found Bitcode and it seems to fit the bill > really nicely. > > I am writing an independent implementation of Bitcode in C (I really > want to keep my runtime pure C). In the process of doing this, I've > discovered a few things I want to ask about.Okay, cool :)> > I notice the Bitcode format documentation [0] is somewhat incomplete -- > there have been a few questions I had to resolve by looking at LLVM > source code. This is totally understandable (I'm actually impressed > that it's documented so well for something so new), but I wonder if you > would accept a patch that clarified some things. For example, the > document doesn't mention endianness at all, but from the source code I > discover that bytes are read in order and bits are read > least-significant first.Absolutely, patches are always welcome.> > I also have a few questions about the format: > > - it appears that the only magic number in the file is > application-specific. This seems unfortunate, because it means that > application-neutral tools cannot be built that process bitcode files, > since they could not reliably detect that the file is a bitcode file. > It might seem like there is little room for application- neutral tools > since almost all the data in the file is application-specific, but off > the top of my head I can think of a few, like a tool to suggest > abbreviations that would give a file better compression.Wouldn't those abbreviations imply some application-specific knowledge of the format? I can't think of any situation where a generic tool could do anything other than outline the contents of the file by looking at the blocks, e.g. llvm-bcanalyzer.> > - the LLVM code assumes that several VBR fields can be at most 32 bits > (block ids, number of elements in an array, etc). These assumptions > seem quite reasonable: can they be considered part of the format and > added to the document?I don't see why not. The llvm bitcode documentation is specific to llvm anyway so the limits should be defined. Thanks for your interest, Joshua. Reid.> > Cheers, > Josh > > [0] http://llvm.org/releases/2.0/docs/BitCodeFormat.html > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reid Spencer <rspencer <at> reidspencer.com> writes:> Hi Joshua,Hi, thanks for the reply.> On Mon, 2007-09-03 at 21:34 +0000, Joshua Haberman wrote: > > I also have a few questions about the format: > > > > - it appears that the only magic number in the file is > > application-specific. This seems unfortunate, because it means that > > application-neutral tools cannot be built that process bitcode files, > > since they could not reliably detect that the file is a bitcode file. > > It might seem like there is little room for application- neutral tools > > since almost all the data in the file is application-specific, but off > > the top of my head I can think of a few, like a tool to suggest > > abbreviations that would give a file better compression. > > Wouldn't those abbreviations imply some application-specific knowledge > of the format?No -- unless I am mistaken about bitcode, abbreviations have no semantic meaning. They are nothing but a space optimization, and needn't even be exposed to the application (at least for reading).> I can't think of any situation where a generic tool could > do anything other than outline the contents of the file by looking at > the blocks, e.g. llvm-bcanalyzer.I can think of several tools that could be application-neutral and very useful: - a tool that prints the hierarchy of blocks and their sizes in the file - a tool that does more sophisticated space usage analysis. For example, how much space are the abbreviations saving? Could the file be more efficiently packed with different VBR choices? Or with different abbreviations? It could even spit out a file that is semantically identical but smaller. - a tool that dumps selected blocks from the file (possibly giving you fine-grained control over what blocks, how many records from each block, etc). - a tool for creating arbitrary bitcode files by reading textual input on stdin written in some language that is isomorphic to bitcode. This could be useful for things like constructing bitcode files that don't follow the expectations of your application, so you can test that your application handles such corruption gracefully. I know there must already be bitcode files flying around, and so I can see the difficulty of making any changes to the format. What if we just said that BC (2 bytes) is the magic number for the Bitcode format, and that 0xC0DE (2 bytes) is the application-specific magic number for LLVM?> > - the LLVM code assumes that several VBR fields can be at most 32 bits > > (block ids, number of elements in an array, etc). These assumptions > > seem quite reasonable: can they be considered part of the format and > > added to the document? > > I don't see why not. The llvm bitcode documentation is specific to llvm > anyway so the limits should be defined.Hmm, the Bitcode documentation [0] seems to contain both application-neutral and LLVM-specific parts. Specifically, only section 4 of that document seems specific to LLVM. In any case, when I submit my patch to that document, I'll put any of these assumptions that seem supremely reasonable in there and see what the feedback is. I'm especially interested to hear from Chris, since he wrote that document (and I assume designed the format).> Thanks for your interest, Joshua.Thanks for encouraging my participation. And you can call me Josh. :) Josh [0] http://llvm.org/releases/2.0/docs/BitCodeFormat.html