thr3ads.net - llvm dev - [LLVMdev] More Encoding Ideas [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Chris Lattner

2004-Aug-21 00:09 UTC

[LLVMdev] More Encoding Ideas

On Fri, 20 Aug 2004, Reid Spencer wrote:> > defined would be almost always stored in one byte instead of the
present
> > usual two.
>
> So, if I get you correctly, you're advocating the creation of a
Type::CharTyID
> in the TypeID enumeration that is always written as a single byte? Note
that
> right now all ASCII values ( <128 ) will be written as a single byte for
> UByteTyID but for SByteTyID (often the default from FE compilers like GCC),
> you're right, they'll take two bytes if the value > 63.  Or are
you saying that
> we should always write UByteTyID and SByteTyID as a single byte?
>
> Long term, LLVM's distinction between signed and unsigned will go away.
Talk to
> Chris about that. :)
If you're interested in the plans, they are described in some detail here:
http://nondot.org/sabre/LLVMNotes/TypeSystemChanges.txt

Note that there is no concrete timeline for this to happen, it basically
depends on when someone is ambitious enough to start working on it.

In any case, both signed and unsigned 8-bit constants can be written out
in a single byte.  Again, do you think it's worth special casing this
though?  Considering that we handle 8-bit strings specially already, there
are not a ton of 8-bit constants with value >= 128.
> > 2) I think it would be a big file size and processing speed win to
have
> > implied pointer types for every literal type.  This would save a
> > tremendous amount of space in the global type table and other places
> > where pointer types are constantly being defined.  So the primitive
> > types list would change to:
> >
> > 0       void
> > 1       void* (implied)
This is a very interesting idea, particularly for languages like C++ that
have a ton of types.  Before making this change, I would want to see some
numbers though.  In particular, I don't think that types typically take up
a large amount of the .bc file size: most of it are instructions.

Are you seeing other cases?
> > This approach would have the added advantage of being able to check to
> > see whether anything is a pointer type by checking bit 0 (1 = yes) and
> > deriving its dereferenced type (just subtract 1).
I don't think this is a big win, the .bc reader doesn't have to do much
of
this.
> > 3) Have the value index for labels start at 1, just like nonzero
values
> > of everything else does.  This just makes the encode/decode algorithm
> > simpler and I doubt it would cost anything in file size.  I made this
> > suggestion a few emails back, hopefully in a clearer form here.
>
> Like I replied, we don't store labels as values in LLVM. Labels are
just the
> names of basic blocks. Those names are stored in the function level symbol
I think that Robert's point is that this would remove a special case from
the code (which is good).  I'm indifferent about the change: if some other
changes are made to the .bc file format, this could go in as well.
> > 4) Can files have multiple 0x01 headers?  I've never seen more
than
> > one.  If not, ditch this four bytes of unnecessary space per file.
>
> I think the original plan was to have multiple modules in them but this
seems
> to have gone by the wayside. The result of linking two (or more) modules is
a
> single module so except in some really bizare corner cases the need for
> multiple modules would go away. I suppose we could get rid of the block id
> field for the file. I'll give this some thought and see if Chris has
any
> objections.
I don't have any problem with removing it.
> Long term, I intend to write some kind of bytecode archive utility similar
to
> JAR files that contains multiple bytecode files, an index, and the whole
thing
Sounds like a cool thing.  If you did this, make sure that llvm-nm could
read the files (of course), and, if/when you do this, you could make the
interface be llvm-ar (which was never finished).
> > I'm committed to making LLVM
> > bytecode as compact and as quick to encode/decode as possible.
>
> Thanks, we appreciate that a lot. Its high on our agenda too.
I totally agree as well.  :)

-Chris

-- 
http://llvm.org/
http://nondot.org/sabre/

Robert Mykland

2004-Aug-21 00:55 UTC

head link

[LLVMdev] More Encoding Ideas

At 05:09 PM 8/20/2004, you wrote:>On Fri, 20 Aug 2004, Reid Spencer wrote:
> > > defined would be almost always stored in one byte instead of the
present
> > > usual two.
> >
> > So, if I get you correctly, you're advocating the creation of a 
> Type::CharTyID
> > in the TypeID enumeration that is always written as a single byte?
Note
> that
> > right now all ASCII values ( <128 ) will be written as a single
byte for
> > UByteTyID but for SByteTyID (often the default from FE compilers like
GCC),
> > you're right, they'll take two bytes if the value > 63.  Or
are you
> saying that
> > we should always write UByteTyID and SByteTyID as a single byte?
> >
> > Long term, LLVM's distinction between signed and unsigned will go
away.
> Talk to
> > Chris about that. :)
>
>If you're interested in the plans, they are described in some detail
here:
>http://nondot.org/sabre/LLVMNotes/TypeSystemChanges.txt
>
>Note that there is no concrete timeline for this to happen, it basically
>depends on when someone is ambitious enough to start working on it.
>
>In any case, both signed and unsigned 8-bit constants can be written out
>in a single byte.  Again, do you think it's worth special casing this
>though?  Considering that we handle 8-bit strings specially already, there
>are not a ton of 8-bit constants with value >= 128.
I'd rather that they not be treated specially.  If char defaulted to 
unsigned char, there would be little reason to create this special case.
> > > 2) I think it would be a big file size and processing speed win
to have
> > > implied pointer types for every literal type.  This would save a
> > > tremendous amount of space in the global type table and other
places
> > > where pointer types are constantly being defined.  So the
primitive
> > > types list would change to:
> > >
> > > 0       void
> > > 1       void* (implied)
>
>This is a very interesting idea, particularly for languages like C++ that
>have a ton of types.  Before making this change, I would want to see some
>numbers though.  In particular, I don't think that types typically take
up
>a large amount of the .bc file size: most of it are instructions.
>
>Are you seeing other cases?
No.  This would only save a bit less than two bytes per primitive and 
defined type.  Maybe a few hundred bytes in a large LLVM file.  Not a big 
savings, but a savings.  The thing I like is that along with the size 
savings it appears to make the encode/decode simpler and quicker if 
anything.  So good news all around.
> > > This approach would have the added advantage of being able to
check to
> > > see whether anything is a pointer type by checking bit 0 (1 =
yes) and
> > > deriving its dereferenced type (just subtract 1).
>
>I don't think this is a big win, the .bc reader doesn't have to do
much of
>this.
I know my reader does this.  I'm not really sure how much time it spends 
doing it.  My little code generator spends a lot of time going back and 
forth between pointers and literal values when turning certain kinds of 
memory operations into data movement in the Ascenium array.
> > > 3) Have the value index for labels start at 1, just like nonzero
values
> > > of everything else does.  This just makes the encode/decode
algorithm
> > > simpler and I doubt it would cost anything in file size.  I made
this
> > > suggestion a few emails back, hopefully in a clearer form here.
> >
> > Like I replied, we don't store labels as values in LLVM. Labels
are
> just the
> > names of basic blocks. Those names are stored in the function level
symbol
>
>I think that Robert's point is that this would remove a special case
from
>the code (which is good).  I'm indifferent about the change: if some
other
>changes are made to the .bc file format, this could go in as well.
Cool.
> > > 4) Can files have multiple 0x01 headers?  I've never seen
more than
> > > one.  If not, ditch this four bytes of unnecessary space per
file.
> >
> > I think the original plan was to have multiple modules in them but
this
> seems
> > to have gone by the wayside. The result of linking two (or more) 
> modules is a
> > single module so except in some really bizare corner cases the need
for
> > multiple modules would go away. I suppose we could get rid of the
block id
> > field for the file. I'll give this some thought and see if Chris
has any
> > objections.
>
>I don't have any problem with removing it.
Cool. Before you chop remember debug libraries.
> > Long term, I intend to write some kind of bytecode archive utility 
> similar to
> > JAR files that contains multiple bytecode files, an index, and the 
> whole thing
>
>Sounds like a cool thing.  If you did this, make sure that llvm-nm could
>read the files (of course), and, if/when you do this, you could make the
>interface be llvm-ar (which was never finished).
Seconded!

Regards,

-- Robert.


Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation

Reid Spencer

2004-Aug-21 01:14 UTC

head link

[LLVMdev] More Encoding Ideas

On Fri, 2004-08-20 at 17:55, Robert Mykland wrote:> At 05:09 PM 8/20/2004, Chris Lattner wrote:
> >
> >If you're interested in the plans, they are described in some
detail here:
> >http://nondot.org/sabre/LLVMNotes/TypeSystemChanges.txt
> >
> >Note that there is no concrete timeline for this to happen, it
basically
> >depends on when someone is ambitious enough to start working on it.
> >
> >In any case, both signed and unsigned 8-bit constants can be written
out
> >in a single byte.  Again, do you think it's worth special casing
this
> >though?  Considering that we handle 8-bit strings specially already,
there
> >are not a ton of 8-bit constants with value >= 128.
> 
> I'd rather that they not be treated specially.  If char defaulted to 
> unsigned char, there would be little reason to create this special case.
Actually, this isn't a very big deal. Its just handled in a switch()
statement now so I just make a couple more cases that handle the
UByteTyID and SByteTyID separately. 

I'll probably include this in 1.4
> > > > This approach would have the added advantage of being able
to check to
> > > > see whether anything is a pointer type by checking bit 0 (1
= yes) and
> > > > deriving its dereferenced type (just subtract 1).
> >
> >I don't think this is a big win, the .bc reader doesn't have to
do much of
> >this.
> 
> I know my reader does this.  I'm not really sure how much time it
spends
> doing it.  My little code generator spends a lot of time going back and 
> forth between pointers and literal values when turning certain kinds of 
> memory operations into data movement in the Ascenium array.
I will probably make this change in 1.4 to eek out a few more bytes of
savings from the file and since it will help Robert.
> > > > 4) Can files have multiple 0x01 headers?  I've never
seen more than
> > > > one.  If not, ditch this four bytes of unnecessary space per
file.
> > >
> > > I think the original plan was to have multiple modules in them
but this
> > seems
> > > to have gone by the wayside. The result of linking two (or more) 
> > modules is a
> > > single module so except in some really bizare corner cases the
need for
> > > multiple modules would go away. I suppose we could get rid of the
block id
> > > field for the file. I'll give this some thought and see if
Chris has any
> > > objections.
> >
> >I don't have any problem with removing it.
> 
> Cool. Before you chop remember debug libraries.
Sorry, I'm missing the context here. Why would this affect debug
libraries?

Reid
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20040820/2fae433a/attachment.sig>

Chris Lattner

2004-Aug-21 01:43 UTC

head link

[LLVMdev] More Encoding Ideas

On Fri, 20 Aug 2004, Robert Mykland wrote:> >In any case, both signed and unsigned 8-bit constants can be written
out
> >in a single byte.  Again, do you think it's worth special casing
this
> >though?  Considering that we handle 8-bit strings specially already,
there
> >are not a ton of 8-bit constants with value >= 128.
>
> I'd rather that they not be treated specially.  If char defaulted to
> unsigned char, there would be little reason to create this special case.
I don't understand what you're getting at here.  You can change char to
default to unsigned right now with llvm-gcc -funsigned-char.  I don't
understand how that would change anything to be more useful though.
> >This is a very interesting idea, particularly for languages like C++
that
> >have a ton of types.  Before making this change, I would want to see
some
> >numbers though.  In particular, I don't think that types typically
take up
> >a large amount of the .bc file size: most of it are instructions.
> >
> >Are you seeing other cases?
>
> No.  This would only save a bit less than two bytes per primitive and
> defined type.  Maybe a few hundred bytes in a large LLVM file.  Not a
> big savings, but a savings.  The thing I like is that along with the
> size savings it appears to make the encode/decode simpler and quicker if
> anything.  So good news all around.
Okay, that's fine.  When implementing that, we should take care to create
the pointer types lazily instead of eagerly to avoid creating pointer
types that are not used.
> > > I think the original plan was to have multiple modules in them
but this
> > seems
> > > to have gone by the wayside. The result of linking two (or more)
> > modules is a
> > > single module so except in some really bizare corner cases the
need for
> > > multiple modules would go away. I suppose we could get rid of the
block id
> > > field for the file. I'll give this some thought and see if
Chris has any
> > > objections.
> >
> >I don't have any problem with removing it.
>
> Cool. Before you chop remember debug libraries.
I think that debug libraries should be handled in other ways.  The
original idea was to have .bc files hold lots of other random cruft with
them.  With more experience, this seems like a bad idea.

-Chris

-- 
http://llvm.org/
http://nondot.org/sabre/

Seemingly Similar Threads

Search for more seemingly similar threads

llvm dev - Aug 2004 - [LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

Seemingly Similar Threads