Dear Chris and Reid: Some other random ideas I've had as I've been sifting through the new bytecode format. Please let me know what you think. 1) ANSI C allows for char to default to unsigned char. This is I guess not how it normally is in GCC. If char defaulted to unsigned char several things would be possible. Single char constants that are defined would be almost always stored in one byte instead of the present usual two. Also this would allow string constants to be stored in the constant table in the regular fashion without wasting bytes. This would prevent the need to do those expensive linear searches through the type slot list in order to match a character constant with its type. 1a) If it's not feasible to make char default to unsigned, perhaps it would be possible to put all string constants at type slot zero to eliminate the linear searching. I realize this would somewhat violate LLVM's strict typing rules, but since these are all constants their strict type could always be derived if needed. Also, it would save space by eliminating the need to create a proliferating number of char array types of various lengths. 1b) Failing this, you should at least store the type with each constant string to avoid the linear searches. This solution would add space, but save processing time, especially with large files with extensive type lists. 2) I think it would be a big file size and processing speed win to have implied pointer types for every literal type. This would save a tremendous amount of space in the global type table and other places where pointer types are constantly being defined. So the primitive types list would change to: 0 void 1 void* (implied) 2 bool 3 bool* (implied) 4 ubyte 5 ubyte* (implied) 6 sbyte 7 sbyte* (implied) 8 ushort 9 ushort* (implied) etc. This approach would have the added advantage of being able to check to see whether anything is a pointer type by checking bit 0 (1 = yes) and deriving its dereferenced type (just subtract 1). 3) Have the value index for labels start at 1, just like nonzero values of everything else does. This just makes the encode/decode algorithm simpler and I doubt it would cost anything in file size. I made this suggestion a few emails back, hopefully in a clearer form here. 4) Can files have multiple 0x01 headers? I've never seen more than one. If not, ditch this four bytes of unnecessary space per file. 5) Don't write the compaction table for a function if there are no entries. All my simple examples have empty compaction tables that use up 8 bytes per function. This would save space. I hope you find these suggestions helpful. I'm committed to making LLVM bytecode as compact and as quick to encode/decode as possible. Regards, -- Robert. Robert Mykland Voice: (831) 462-6725 Founder/CTO Ascenium Corporation
Robert Mykland wrote:> Dear Chris and Reid:Hi Robert.> > Some other random ideas I've had as I've been sifting through the new > bytecode format. Please let me know what you think. > > 1) ANSI C allows for char to default to unsigned char. This is I guess > not how it normally is in GCC. If char defaulted to unsigned char > several things would be possible. Single char constants that are > defined would be almost always stored in one byte instead of the present > usual two.So, if I get you correctly, you're advocating the creation of a Type::CharTyID in the TypeID enumeration that is always written as a single byte? Note that right now all ASCII values ( <128 ) will be written as a single byte for UByteTyID but for SByteTyID (often the default from FE compilers like GCC), you're right, they'll take two bytes if the value > 63. Or are you saying that we should always write UByteTyID and SByteTyID as a single byte? Long term, LLVM's distinction between signed and unsigned will go away. Talk to Chris about that. :) Also this would allow string constants to be stored in the> constant table in the regular fashion without wasting bytes.I don't think that happens today. We handle string constants (as opposed to character constants) separately and they are always written as a length followed by the characters in the string. This would> prevent the need to do those expensive linear searches through the type > slot list in order to match a character constant with its type.hmm .. not clear on what you mean here. What "linear searches" ? THe slot number indexes into the vector of types directly. I believe its O(1), not O(n). Am I missing something here?> > 1a) If it's not feasible to make char default to unsigned, perhaps it > would be possible to put all string constants at type slot zero to > eliminate the linear searching. I realize this would somewhat violate > LLVM's strict typing rules, but since these are all constants their > strict type could always be derived if needed. Also, it would save > space by eliminating the need to create a proliferating number of char > array types of various lengths.Again, I'm not sure what you're getting at here. Have a look at the function outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. Strings currently occupy the "void" type plane, plane 0 (which is a bit of a hack), but we write out the type slot number (index), then the character data, then a terminating null (0) byte. When the reader reads the constant strings, it reads the slot number and then calls getType(slot) which just indexes into the appropriate vector of strings. If you look at getType() in lib/Bytecode/Reader/Reader.cpp I think you'll find there's no O(n) searching there, just O(1) indexing.> > 1b) Failing this, you should at least store the type with each constant > string to avoid the linear searches. This solution would add space, but > save processing time, especially with large files with extensive type > lists.Type type slot number is stored with each constant string, as described above.> > 2) I think it would be a big file size and processing speed win to have > implied pointer types for every literal type. This would save a > tremendous amount of space in the global type table and other places > where pointer types are constantly being defined. So the primitive > types list would change to: > > 0 void > 1 void* (implied) > 2 bool > 3 bool* (implied) > 4 ubyte > 5 ubyte* (implied) > 6 sbyte > 7 sbyte* (implied) > 8 ushort > 9 ushort* (implied) > etc. > > This approach would have the added advantage of being able to check to > see whether anything is a pointer type by checking bit 0 (1 = yes) and > deriving its dereferenced type (just subtract 1).Interesting idea. However, I'm not convinced it saves us much. Most high level languages will have lots of pointers to non-primitive types for which this method won't work. Can you provide me with a bytecode file where you can show that the bytecode is bloated by lots of pointers to primitive types?> 3) Have the value index for labels start at 1, just like nonzero values > of everything else does. This just makes the encode/decode algorithm > simpler and I doubt it would cost anything in file size. I made this > suggestion a few emails back, hopefully in a clearer form here.Like I replied, we don't store labels as values in LLVM. Labels are just the names of basic blocks. Those names are stored in the function level symbol table but there's no value slot indices for labels. Perhaps I'm missing something here? I'm still not exactly sure what you mean by "the value index for labels". Do you mean its slot number? If so, I think you're mistaken on the file format or the doc is misleading you. Please let me know which so I can fix the doc. :)> 4) Can files have multiple 0x01 headers? I've never seen more than > one. If not, ditch this four bytes of unnecessary space per file.I think the original plan was to have multiple modules in them but this seems to have gone by the wayside. The result of linking two (or more) modules is a single module so except in some really bizare corner cases the need for multiple modules would go away. I suppose we could get rid of the block id field for the file. I'll give this some thought and see if Chris has any objections. Long term, I intend to write some kind of bytecode archive utility similar to JAR files that contains multiple bytecode files, an index, and the whole thing is compressed via bzip2. This would be the normal way that libraries compiled with LLVM would be distributed. Note that LLVM bytecode compreses with bzip2 around 50%. We might even add a global symbol index (optionally) so the things can be linked quickly without examining each bytecode file in the archive. This is on my "TODO" list but at a lower priority (for now).> > 5) Don't write the compaction table for a function if there are no > entries. All my simple examples have empty compaction tables that use > up 8 bytes per function. This would save space.Hmm. That's not supposed to happen. Have you got a bytecode and source file that produces this? The algorithm in bcwriter for deciding whether to emit the compaction table is pretty complicated. It should avoid the compaction table any time there's no advantage to it space wise. The compaction table's *sole* purpose is to save bytecode space and its BAD if it gets used incorrectly.> > I hope you find these suggestions helpful.Absolutely! You have some interesting ideas here.> I'm committed to making LLVM > bytecode as compact and as quick to encode/decode as possible.Thanks, we appreciate that a lot. Its high on our agenda too. Everything we can do to make the bytecode faster/smaller has enormous impacts downstream (considering linking 1000s of bc files). Reid.
I mis-spoke when I said ..> I don't think that happens today. We handle string constants (as opposed > to character constants) separately and they are always written as a > length followed by the characters in the string.Actually, they're written as a type slot number (from which the array length is derived), followed by the character data in the string. and another goof ..> Again, I'm not sure what you're getting at here. Have a look at the > function outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. > Strings currently occupy the "void" type plane, plane 0 (which is a bit > of a hack), but we write out the type slot number (index), then the > character data, then a terminating null (0) byte.There's no terminating null byte. The length is inferred from the array type. Sorry for the inaccuracies .. I shouldn't rely on memory so much :) Reid.
On Fri, 20 Aug 2004, Reid Spencer wrote:> > defined would be almost always stored in one byte instead of the present > > usual two. > > So, if I get you correctly, you're advocating the creation of a Type::CharTyID > in the TypeID enumeration that is always written as a single byte? Note that > right now all ASCII values ( <128 ) will be written as a single byte for > UByteTyID but for SByteTyID (often the default from FE compilers like GCC), > you're right, they'll take two bytes if the value > 63. Or are you saying that > we should always write UByteTyID and SByteTyID as a single byte? > > Long term, LLVM's distinction between signed and unsigned will go away. Talk to > Chris about that. :)If you're interested in the plans, they are described in some detail here: http://nondot.org/sabre/LLVMNotes/TypeSystemChanges.txt Note that there is no concrete timeline for this to happen, it basically depends on when someone is ambitious enough to start working on it. In any case, both signed and unsigned 8-bit constants can be written out in a single byte. Again, do you think it's worth special casing this though? Considering that we handle 8-bit strings specially already, there are not a ton of 8-bit constants with value >= 128.> > 2) I think it would be a big file size and processing speed win to have > > implied pointer types for every literal type. This would save a > > tremendous amount of space in the global type table and other places > > where pointer types are constantly being defined. So the primitive > > types list would change to: > > > > 0 void > > 1 void* (implied)This is a very interesting idea, particularly for languages like C++ that have a ton of types. Before making this change, I would want to see some numbers though. In particular, I don't think that types typically take up a large amount of the .bc file size: most of it are instructions. Are you seeing other cases?> > This approach would have the added advantage of being able to check to > > see whether anything is a pointer type by checking bit 0 (1 = yes) and > > deriving its dereferenced type (just subtract 1).I don't think this is a big win, the .bc reader doesn't have to do much of this.> > 3) Have the value index for labels start at 1, just like nonzero values > > of everything else does. This just makes the encode/decode algorithm > > simpler and I doubt it would cost anything in file size. I made this > > suggestion a few emails back, hopefully in a clearer form here. > > Like I replied, we don't store labels as values in LLVM. Labels are just the > names of basic blocks. Those names are stored in the function level symbolI think that Robert's point is that this would remove a special case from the code (which is good). I'm indifferent about the change: if some other changes are made to the .bc file format, this could go in as well.> > 4) Can files have multiple 0x01 headers? I've never seen more than > > one. If not, ditch this four bytes of unnecessary space per file. > > I think the original plan was to have multiple modules in them but this seems > to have gone by the wayside. The result of linking two (or more) modules is a > single module so except in some really bizare corner cases the need for > multiple modules would go away. I suppose we could get rid of the block id > field for the file. I'll give this some thought and see if Chris has any > objections.I don't have any problem with removing it.> Long term, I intend to write some kind of bytecode archive utility similar to > JAR files that contains multiple bytecode files, an index, and the whole thingSounds like a cool thing. If you did this, make sure that llvm-nm could read the files (of course), and, if/when you do this, you could make the interface be llvm-ar (which was never finished).> > I'm committed to making LLVM > > bytecode as compact and as quick to encode/decode as possible. > > Thanks, we appreciate that a lot. Its high on our agenda too.I totally agree as well. :) -Chris -- http://llvm.org/ http://nondot.org/sabre/
At 02:05 PM 8/20/2004, you wrote:>Robert Mykland wrote: >>Dear Chris and Reid: > >Hi Robert. > >>Some other random ideas I've had as I've been sifting through the new >>bytecode format. Please let me know what you think. >>1) ANSI C allows for char to default to unsigned char. This is I guess >>not how it normally is in GCC. If char defaulted to unsigned char >>several things would be possible. Single char constants that are defined >>would be almost always stored in one byte instead of the present usual two. > >So, if I get you correctly, you're advocating the creation of a >Type::CharTyID in the TypeID enumeration that is always written as a >single byte? Note that right now all ASCII values ( <128 ) will be written >as a single byte for UByteTyID but for SByteTyID (often the default from >FE compilers like GCC), you're right, they'll take two bytes if the >value > 63. Or are you saying that we should always write UByteTyID and >SByteTyID as a single byte? > >Long term, LLVM's distinction between signed and unsigned will go away. >Talk to Chris about that. :)No, I didn't make myself clear. When you do a string constant like: printf( "hello, world!" ); ...its is defined as type char*. The ANSI standard for the C programming language allows for compilers to either default char to be the same as unsigned char OR char to be the same as signed char, as they appear to be currently in llvm-gcc. Now this may be very difficult to change in gcc, I don't know, but if it could change, I'm noticing that it would make the bytecode files cleaner. Even though I now understand how to decode these without searching the type table, it would be great if these constants could be treated like any other constants without a size penalty. There would be little size penalty if char defaulted to unsigned char.> Also this would allow string constants to be stored in the >>constant table in the regular fashion without wasting bytes. > >I don't think that happens today. We handle string constants (as opposed >to character constants) separately and they are always written as a length >followed by the characters in the string. > > >This would >>prevent the need to do those expensive linear searches through the type >>slot list in order to match a character constant with its type. > >hmm .. not clear on what you mean here. What "linear searches" ? THe slot >number indexes into the vector of types directly. I believe its O(1), not >O(n). Am I missing something here?D'OH! Never mind. On the example I was looking at, the size of the string happened to match the number of the type slot and I got the two confused.>>1a) If it's not feasible to make char default to unsigned, perhaps it >>would be possible to put all string constants at type slot zero to >>eliminate the linear searching. I realize this would somewhat violate >>LLVM's strict typing rules, but since these are all constants their >>strict type could always be derived if needed. Also, it would save space >>by eliminating the need to create a proliferating number of char array >>types of various lengths. > >Again, I'm not sure what you're getting at here. Have a look at the >function outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. Strings >currently occupy the "void" type plane, plane 0 (which is a bit of a >hack), but we write out the type slot number (index), then the character >data, then a terminating null (0) byte. When the reader reads the constant >strings, it reads the slot number and then calls getType(slot) which just >indexes into the appropriate vector of strings. If you look at getType() >in lib/Bytecode/Reader/Reader.cpp I think you'll find there's no O(n) >searching there, just O(1) indexing.Yep, I understand this now.>>1b) Failing this, you should at least store the type with each constant >>string to avoid the linear searches. This solution would add space, but >>save processing time, especially with large files with extensive type lists. > >Type type slot number is stored with each constant string, as described above.Yep.>>2) I think it would be a big file size and processing speed win to have >>implied pointer types for every literal type. This would save a >>tremendous amount of space in the global type table and other places >>where pointer types are constantly being defined. So the primitive types >>list would change to: >>0 void >>1 void* (implied) >>2 bool >>3 bool* (implied) >>4 ubyte >>5 ubyte* (implied) >>6 sbyte >>7 sbyte* (implied) >>8 ushort >>9 ushort* (implied) >>etc. >>This approach would have the added advantage of being able to check to >>see whether anything is a pointer type by checking bit 0 (1 = yes) and >>deriving its dereferenced type (just subtract 1). > >Interesting idea. However, I'm not convinced it saves us much. Most high >level languages will have lots of pointers to non-primitive types for >which this method won't work. Can you provide me with a bytecode file >where you can show that the bytecode is bloated by lots of pointers to >primitive types?No, you're not getting the point. The bytecode is bloated by pointers to all types. Every time a type is defined pretty much a pointer is defined with it already, and since the pointer type is not implied by the definition of the literal, we waste two or more bytes whenever we define a type. I'm advocating that every time we define a type an implied pointer type to it is also created. We'll save two or more bytes for every type defined, including all function types. That's hundreds of bytes for a big LLVM file.>>3) Have the value index for labels start at 1, just like nonzero values >>of everything else does. This just makes the encode/decode algorithm >>simpler and I doubt it would cost anything in file size. I made this >>suggestion a few emails back, hopefully in a clearer form here. > >Like I replied, we don't store labels as values in LLVM. Labels are just >the names of basic blocks. Those names are stored in the function level >symbol table but there's no value slot indices for labels. Perhaps I'm >missing something here? I'm still not exactly sure what you mean by "the >value index for labels". Do you mean its slot number? If so, I think >you're mistaken on the file format or the doc is misleading you. Please >let me know which so I can fix the doc. :)The symbol table refers to a value index for each label. This value index starts counting at zero, instead of one like for every other data type. So my function that retrieves a thing based on its value index has to have a special case in it for labels. That's what I'm referring to.>>4) Can files have multiple 0x01 headers? I've never seen more than >>one. If not, ditch this four bytes of unnecessary space per file. > >I think the original plan was to have multiple modules in them but this >seems to have gone by the wayside. The result of linking two (or more) >modules is a single module so except in some really bizare corner cases >the need for multiple modules would go away. I suppose we could get rid of >the block id field for the file. I'll give this some thought and see if >Chris has any objections. > >Long term, I intend to write some kind of bytecode archive utility similar >to JAR files that contains multiple bytecode files, an index, and the >whole thing is compressed via bzip2. This would be the normal way that >libraries compiled with LLVM would be distributed. Note that LLVM bytecode >compreses with bzip2 around 50%. We might even add a global symbol index >(optionally) so the things can be linked quickly without examining each >bytecode file in the archive. This is on my "TODO" list but at a lower >priority (for now).I can see how multiple 0x01 modules in a single LLVM file would be handy if you wanted to easily maintain seperate module name spaces in debug versions of libraries. Gosh, but at the very least, since it's a special case anyway, we could shrink this field down to a byte. Come to think of it, you could still have multiple seperate LLVM modules in a file without this marker. If there were still bytes in the file after the last byte specified by the module size, you'd be in another module. Debug libraries is such a rare case that this wouldn't be too big a deal.>>5) Don't write the compaction table for a function if there are no >>entries. All my simple examples have empty compaction tables that use up >>8 bytes per function. This would save space. > >Hmm. That's not supposed to happen. Have you got a bytecode and source >file that produces this? The algorithm in bcwriter for deciding whether to >emit the compaction table is pretty complicated. It should avoid the >compaction table any time there's no advantage to it space wise. The >compaction table's *sole* purpose is to save bytecode space and its BAD if >it gets used incorrectly.It happens all the time! I guess this is a bug. Thanks for everything! -- Robert. Robert Mykland Voice: (831) 462-6725 Founder/CTO Ascenium Corporation