thr3ads.net - llvm dev - [LLVMdev] More Encoding Ideas [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Robert Mykland

2004-Aug-20 19:52 UTC

[LLVMdev] More Encoding Ideas

Dear Chris and Reid:

Some other random ideas I've had as I've been sifting through the new 
bytecode format.  Please let me know what you think.

1) ANSI C allows for char to default to unsigned char.  This is I guess not 
how it normally is in GCC.  If char defaulted to unsigned char several 
things would be possible.  Single char constants that are defined would be 
almost always stored in one byte instead of the present usual two.  Also 
this would allow string constants to be stored in the constant table in the 
regular fashion without wasting bytes.  This would prevent the need to do 
those expensive linear searches through the type slot list in order to 
match a character constant with its type.

1a) If it's not feasible to make char default to unsigned, perhaps it would 
be possible to put all string constants at type slot zero to eliminate the 
linear searching.  I realize this would somewhat violate LLVM's strict 
typing rules, but since these are all constants their strict type could 
always be derived if needed.  Also, it would save space by eliminating the 
need to create a proliferating number of char array types of various lengths.

1b) Failing this, you should at least store the type with each constant 
string to avoid the linear searches.  This solution would add space, but 
save processing time, especially with large files with extensive type lists.

2) I think it would be a big file size and processing speed win to have 
implied pointer types for every literal type.  This would save a tremendous 
amount of space in the global type table and other places where pointer 
types are constantly being defined.  So the primitive types list would 
change to:

0       void
1       void* (implied)
2       bool
3       bool* (implied)
4       ubyte
5       ubyte* (implied)
6       sbyte
7       sbyte* (implied)
8       ushort
9       ushort* (implied)
etc.

This approach would have the added advantage of being able to check to see 
whether anything is a pointer type by checking bit 0 (1 = yes) and deriving 
its dereferenced type (just subtract 1).

3) Have the value index for labels start at 1, just like nonzero values of 
everything else does.  This just makes the encode/decode algorithm simpler 
and I doubt it would cost anything in file size.  I made this suggestion a 
few emails back, hopefully in a clearer form here.

4) Can files have multiple 0x01 headers?  I've never seen more than 
one.  If not, ditch this four bytes of unnecessary space per file.

5) Don't write the compaction table for a function if there are no 
entries.  All my simple examples have empty compaction tables that use up 8 
bytes per function.  This would save space.

I hope you find these suggestions helpful.  I'm committed to making LLVM 
bytecode as compact and as quick to encode/decode as possible.

Regards,

-- Robert.


Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation

Reid Spencer

2004-Aug-20 21:05 UTC

head link

[LLVMdev] More Encoding Ideas

Robert Mykland wrote:> Dear Chris and Reid:
Hi Robert.
> 
> Some other random ideas I've had as I've been sifting through the
new
> bytecode format.  Please let me know what you think.
> 
> 1) ANSI C allows for char to default to unsigned char.  This is I guess 
> not how it normally is in GCC.  If char defaulted to unsigned char 
> several things would be possible.  Single char constants that are 
> defined would be almost always stored in one byte instead of the present 
> usual two.
So, if I get you correctly, you're advocating the creation of a
Type::CharTyID
in the TypeID enumeration that is always written as a single byte? Note that 
right now all ASCII values ( <128 ) will be written as a single byte for 
UByteTyID but for SByteTyID (often the default from FE compilers like GCC), 
you're right, they'll take two bytes if the value > 63.  Or are you
saying that
we should always write UByteTyID and SByteTyID as a single byte?

Long term, LLVM's distinction between signed and unsigned will go away. Talk
to
Chris about that. :)

   Also this would allow string constants to be stored in
the> constant table in the regular fashion without wasting bytes.  
I don't think that happens today. We handle string constants (as opposed to 
character constants) separately and they are always written as a length 
followed by the characters in the string.


This would> prevent the need to do those expensive linear searches through the type 
> slot list in order to match a character constant with its type.
hmm .. not clear on what you mean here. What "linear searches" ? THe
slot
number indexes into the vector of types directly. I believe its O(1), not O(n). 
Am I missing something here?
> 
> 1a) If it's not feasible to make char default to unsigned, perhaps it 
> would be possible to put all string constants at type slot zero to 
> eliminate the linear searching.  I realize this would somewhat violate 
> LLVM's strict typing rules, but since these are all constants their 
> strict type could always be derived if needed.  Also, it would save 
> space by eliminating the need to create a proliferating number of char 
> array types of various lengths.
Again, I'm not sure what you're getting at here. Have a look at the
function
outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. Strings currently 
occupy the "void" type plane, plane 0 (which is a bit of a hack), but
we write
out the type slot number (index), then the character data, then a terminating 
null (0) byte. When the reader reads the constant strings, it reads the slot 
number and then calls getType(slot) which just indexes into the appropriate 
vector of strings. If you look at getType() in lib/Bytecode/Reader/Reader.cpp I 
think you'll find there's no O(n) searching there, just O(1) indexing.
> 
> 1b) Failing this, you should at least store the type with each constant 
> string to avoid the linear searches.  This solution would add space, but 
> save processing time, especially with large files with extensive type 
> lists.
Type type slot number is stored with each constant string, as described above.
> 
> 2) I think it would be a big file size and processing speed win to have 
> implied pointer types for every literal type.  This would save a 
> tremendous amount of space in the global type table and other places 
> where pointer types are constantly being defined.  So the primitive 
> types list would change to:
> 
> 0       void
> 1       void* (implied)
> 2       bool
> 3       bool* (implied)
> 4       ubyte
> 5       ubyte* (implied)
> 6       sbyte
> 7       sbyte* (implied)
> 8       ushort
> 9       ushort* (implied)
> etc.
> 
> This approach would have the added advantage of being able to check to 
> see whether anything is a pointer type by checking bit 0 (1 = yes) and 
> deriving its dereferenced type (just subtract 1).
Interesting idea. However, I'm not convinced it saves us much. Most high
level
languages will have lots of pointers to non-primitive types for which this 
method won't work. Can you provide me with a bytecode file where you can
show
that the bytecode is bloated by lots of pointers to primitive types?
> 3) Have the value index for labels start at 1, just like nonzero values 
> of everything else does.  This just makes the encode/decode algorithm 
> simpler and I doubt it would cost anything in file size.  I made this 
> suggestion a few emails back, hopefully in a clearer form here.
Like I replied, we don't store labels as values in LLVM. Labels are just the
names of basic blocks. Those names are stored in the function level symbol 
table but there's no value slot indices for labels.  Perhaps I'm missing
something here? I'm still not exactly sure what you mean by "the value
index
for labels". Do you mean its slot number? If so, I think you're
mistaken on the
file format or the doc is misleading you. Please let me know which so I can fix 
the doc. :)
> 4) Can files have multiple 0x01 headers?  I've never seen more than 
> one.  If not, ditch this four bytes of unnecessary space per file.
I think the original plan was to have multiple modules in them but this seems 
to have gone by the wayside. The result of linking two (or more) modules is a 
single module so except in some really bizare corner cases the need for 
multiple modules would go away. I suppose we could get rid of the block id 
field for the file. I'll give this some thought and see if Chris has any 
objections.

Long term, I intend to write some kind of bytecode archive utility similar to 
JAR files that contains multiple bytecode files, an index, and the whole thing 
is compressed via bzip2. This would be the normal way that libraries compiled 
with LLVM would be distributed. Note that LLVM bytecode compreses with bzip2 
around 50%.  We might even add a global symbol index (optionally) so the things 
can be linked quickly without examining each bytecode file in the archive. This 
is on my "TODO" list but at a lower priority (for now).
> 
> 5) Don't write the compaction table for a function if there are no 
> entries.  All my simple examples have empty compaction tables that use 
> up 8 bytes per function.  This would save space.
Hmm. That's not supposed to happen. Have you got a bytecode and source file 
that produces this? The algorithm in bcwriter for deciding whether to emit the 
compaction table is pretty complicated. It should avoid the compaction table 
any time there's no advantage to it space wise. The compaction table's
*sole*
purpose is to save bytecode space and its BAD if it gets used incorrectly.
> 
> I hope you find these suggestions helpful.  
Absolutely! You have some interesting ideas here.
> I'm committed to making LLVM 
> bytecode as compact and as quick to encode/decode as possible.
Thanks, we appreciate that a lot. Its high on our agenda too. Everything we can 
do to make the bytecode faster/smaller has enormous impacts downstream 
(considering linking 1000s of bc files).

Reid.

Reid Spencer

2004-Aug-20 21:14 UTC

head link

[LLVMdev] More Encoding Ideas

I mis-spoke when I said ..
> I don't think that happens today. We handle string constants (as
opposed
> to character constants) separately and they are always written as a 
> length followed by the characters in the string.
Actually, they're written as a type slot number (from which the array length
is
derived), followed by the character data in the string.

and another goof ..
> Again, I'm not sure what you're getting at here. Have a look at the
> function outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. 
> Strings currently occupy the "void" type plane, plane 0 (which is
a bit
> of a hack), but we write out the type slot number (index), then the 
> character data, then a terminating null (0) byte. 
There's no terminating null byte. The length is inferred from the array
type.

Sorry for the inaccuracies .. I shouldn't rely on memory so much :)

Reid.

Chris Lattner

2004-Aug-21 00:09 UTC

head link

[LLVMdev] More Encoding Ideas

On Fri, 20 Aug 2004, Reid Spencer wrote:> > defined would be almost always stored in one byte instead of the
present
> > usual two.
>
> So, if I get you correctly, you're advocating the creation of a
Type::CharTyID
> in the TypeID enumeration that is always written as a single byte? Note
that
> right now all ASCII values ( <128 ) will be written as a single byte for
> UByteTyID but for SByteTyID (often the default from FE compilers like GCC),
> you're right, they'll take two bytes if the value > 63.  Or are
you saying that
> we should always write UByteTyID and SByteTyID as a single byte?
>
> Long term, LLVM's distinction between signed and unsigned will go away.
Talk to
> Chris about that. :)
If you're interested in the plans, they are described in some detail here:
http://nondot.org/sabre/LLVMNotes/TypeSystemChanges.txt

Note that there is no concrete timeline for this to happen, it basically
depends on when someone is ambitious enough to start working on it.

In any case, both signed and unsigned 8-bit constants can be written out
in a single byte.  Again, do you think it's worth special casing this
though?  Considering that we handle 8-bit strings specially already, there
are not a ton of 8-bit constants with value >= 128.
> > 2) I think it would be a big file size and processing speed win to
have
> > implied pointer types for every literal type.  This would save a
> > tremendous amount of space in the global type table and other places
> > where pointer types are constantly being defined.  So the primitive
> > types list would change to:
> >
> > 0       void
> > 1       void* (implied)
This is a very interesting idea, particularly for languages like C++ that
have a ton of types.  Before making this change, I would want to see some
numbers though.  In particular, I don't think that types typically take up
a large amount of the .bc file size: most of it are instructions.

Are you seeing other cases?
> > This approach would have the added advantage of being able to check to
> > see whether anything is a pointer type by checking bit 0 (1 = yes) and
> > deriving its dereferenced type (just subtract 1).
I don't think this is a big win, the .bc reader doesn't have to do much
of
this.
> > 3) Have the value index for labels start at 1, just like nonzero
values
> > of everything else does.  This just makes the encode/decode algorithm
> > simpler and I doubt it would cost anything in file size.  I made this
> > suggestion a few emails back, hopefully in a clearer form here.
>
> Like I replied, we don't store labels as values in LLVM. Labels are
just the
> names of basic blocks. Those names are stored in the function level symbol
I think that Robert's point is that this would remove a special case from
the code (which is good).  I'm indifferent about the change: if some other
changes are made to the .bc file format, this could go in as well.
> > 4) Can files have multiple 0x01 headers?  I've never seen more
than
> > one.  If not, ditch this four bytes of unnecessary space per file.
>
> I think the original plan was to have multiple modules in them but this
seems
> to have gone by the wayside. The result of linking two (or more) modules is
a
> single module so except in some really bizare corner cases the need for
> multiple modules would go away. I suppose we could get rid of the block id
> field for the file. I'll give this some thought and see if Chris has
any
> objections.
I don't have any problem with removing it.
> Long term, I intend to write some kind of bytecode archive utility similar
to
> JAR files that contains multiple bytecode files, an index, and the whole
thing
Sounds like a cool thing.  If you did this, make sure that llvm-nm could
read the files (of course), and, if/when you do this, you could make the
interface be llvm-ar (which was never finished).
> > I'm committed to making LLVM
> > bytecode as compact and as quick to encode/decode as possible.
>
> Thanks, we appreciate that a lot. Its high on our agenda too.
I totally agree as well.  :)

-Chris

-- 
http://llvm.org/
http://nondot.org/sabre/

Robert Mykland

2004-Aug-21 00:39 UTC

head link

[LLVMdev] More Encoding Ideas

At 02:05 PM 8/20/2004, you wrote:>Robert Mykland wrote:
>>Dear Chris and Reid:
>
>Hi Robert.
>
>>Some other random ideas I've had as I've been sifting through
the new
>>bytecode format.  Please let me know what you think.
>>1) ANSI C allows for char to default to unsigned char.  This is I guess 
>>not how it normally is in GCC.  If char defaulted to unsigned char 
>>several things would be possible.  Single char constants that are
defined
>>would be almost always stored in one byte instead of the present usual
two.
>
>So, if I get you correctly, you're advocating the creation of a 
>Type::CharTyID in the TypeID enumeration that is always written as a 
>single byte? Note that right now all ASCII values ( <128 ) will be
written
>as a single byte for UByteTyID but for SByteTyID (often the default from 
>FE compilers like GCC), you're right, they'll take two bytes if the 
>value > 63.  Or are you saying that we should always write UByteTyID and 
>SByteTyID as a single byte?
>
>Long term, LLVM's distinction between signed and unsigned will go away. 
>Talk to Chris about that. :)
No, I didn't make myself clear.  When you do a string constant like:

printf( "hello, world!" );

...its is defined as type char*.  The ANSI standard for the C programming 
language allows for compilers to either default char to be the same as 
unsigned char OR char to be the same as signed char, as they appear to be 
currently in llvm-gcc.  Now this may be very difficult to change in gcc, I 
don't know, but if it could change, I'm noticing that it would make the 
bytecode files cleaner.

Even though I now understand how to decode these without searching the type 
table, it would be great if these constants could be treated like any other 
constants without a size penalty.  There would be little size penalty if 
char defaulted to unsigned char.

>   Also this would allow string constants to be stored in the
>>constant table in the regular fashion without wasting bytes.
>
>I don't think that happens today. We handle string constants (as opposed
>to character constants) separately and they are always written as a length 
>followed by the characters in the string.
>
>
>This would
>>prevent the need to do those expensive linear searches through the type 
>>slot list in order to match a character constant with its type.
>
>hmm .. not clear on what you mean here. What "linear searches" ?
THe slot
>number indexes into the vector of types directly. I believe its O(1), not 
>O(n). Am I missing something here?
D'OH!  Never mind.  On the example I was looking at, the size of the string 
happened to match the number of the type slot and I got the two confused.

>>1a) If it's not feasible to make char default to unsigned, perhaps
it
>>would be possible to put all string constants at type slot zero to 
>>eliminate the linear searching.  I realize this would somewhat violate 
>>LLVM's strict typing rules, but since these are all constants their 
>>strict type could always be derived if needed.  Also, it would save
space
>>by eliminating the need to create a proliferating number of char array 
>>types of various lengths.
>
>Again, I'm not sure what you're getting at here. Have a look at the 
>function outputConstantStrings in lib/Bytecode/Writer/Writer.cpp. Strings 
>currently occupy the "void" type plane, plane 0 (which is a bit of
a
>hack), but we write out the type slot number (index), then the character 
>data, then a terminating null (0) byte. When the reader reads the constant 
>strings, it reads the slot number and then calls getType(slot) which just 
>indexes into the appropriate vector of strings. If you look at getType() 
>in lib/Bytecode/Reader/Reader.cpp I think you'll find there's no
O(n)
>searching there, just O(1) indexing.
Yep, I understand this now.

>>1b) Failing this, you should at least store the type with each constant 
>>string to avoid the linear searches.  This solution would add space, but
>>save processing time, especially with large files with extensive type
lists.
>
>Type type slot number is stored with each constant string, as described
above.
Yep.

>>2) I think it would be a big file size and processing speed win to have 
>>implied pointer types for every literal type.  This would save a 
>>tremendous amount of space in the global type table and other places 
>>where pointer types are constantly being defined.  So the primitive
types
>>list would change to:
>>0       void
>>1       void* (implied)
>>2       bool
>>3       bool* (implied)
>>4       ubyte
>>5       ubyte* (implied)
>>6       sbyte
>>7       sbyte* (implied)
>>8       ushort
>>9       ushort* (implied)
>>etc.
>>This approach would have the added advantage of being able to check to 
>>see whether anything is a pointer type by checking bit 0 (1 = yes) and 
>>deriving its dereferenced type (just subtract 1).
>
>Interesting idea. However, I'm not convinced it saves us much. Most high
>level languages will have lots of pointers to non-primitive types for 
>which this method won't work. Can you provide me with a bytecode file 
>where you can show that the bytecode is bloated by lots of pointers to 
>primitive types?
No, you're not getting the point.  The bytecode is bloated by pointers to 
all types.  Every time a type is defined pretty much a pointer is defined 
with it already, and since the pointer type is not implied by the 
definition of the literal, we waste two or more bytes whenever we define a 
type.  I'm advocating that every time we define a type an implied pointer 
type to it is also created.  We'll save two or more bytes for every type 
defined, including all function types.  That's hundreds of bytes for a big 
LLVM file.
>>3) Have the value index for labels start at 1, just like nonzero values 
>>of everything else does.  This just makes the encode/decode algorithm 
>>simpler and I doubt it would cost anything in file size.  I made this 
>>suggestion a few emails back, hopefully in a clearer form here.
>
>Like I replied, we don't store labels as values in LLVM. Labels are just
>the names of basic blocks. Those names are stored in the function level 
>symbol table but there's no value slot indices for labels.  Perhaps
I'm
>missing something here? I'm still not exactly sure what you mean by
"the
>value index for labels". Do you mean its slot number? If so, I think 
>you're mistaken on the file format or the doc is misleading you. Please 
>let me know which so I can fix the doc. :)
The symbol table refers to a value index for each label.  This value index 
starts counting at zero, instead of one like for every other data type.  So 
my function that retrieves a thing based on its value index has to have a 
special case in it for labels.  That's what I'm referring to.

>>4) Can files have multiple 0x01 headers?  I've never seen more than 
>>one.  If not, ditch this four bytes of unnecessary space per file.
>
>I think the original plan was to have multiple modules in them but this 
>seems to have gone by the wayside. The result of linking two (or more) 
>modules is a single module so except in some really bizare corner cases 
>the need for multiple modules would go away. I suppose we could get rid of 
>the block id field for the file. I'll give this some thought and see if 
>Chris has any objections.
>
>Long term, I intend to write some kind of bytecode archive utility similar 
>to JAR files that contains multiple bytecode files, an index, and the 
>whole thing is compressed via bzip2. This would be the normal way that 
>libraries compiled with LLVM would be distributed. Note that LLVM bytecode 
>compreses with bzip2 around 50%.  We might even add a global symbol index 
>(optionally) so the things can be linked quickly without examining each 
>bytecode file in the archive. This is on my "TODO" list but at a
lower
>priority (for now).
I can see how multiple 0x01 modules in a single LLVM file would be handy if 
you wanted to easily maintain seperate module name spaces in debug versions 
of libraries.

Gosh, but at the very least, since it's a special case anyway, we could 
shrink this field down to a byte.

Come to think of it, you could still have multiple seperate LLVM modules in 
a file without this marker.  If there were still bytes in the file after 
the last byte specified by the module size, you'd be in another 
module.  Debug libraries is such a rare case that this wouldn't be too big 
a deal.
>>5) Don't write the compaction table for a function if there are no 
>>entries.  All my simple examples have empty compaction tables that use
up
>>8 bytes per function.  This would save space.
>
>Hmm. That's not supposed to happen. Have you got a bytecode and source 
>file that produces this? The algorithm in bcwriter for deciding whether to 
>emit the compaction table is pretty complicated. It should avoid the 
>compaction table any time there's no advantage to it space wise. The 
>compaction table's *sole* purpose is to save bytecode space and its BAD
if
>it gets used incorrectly.
It happens all the time!  I guess this is a bug.

Thanks for everything!

-- Robert.


Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Aug 2004 - [LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

Maybe Matching Threads