thr3ads.net - llvm dev - [LLVMdev] More Encoding Ideas [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Reid Spencer

2004-Aug-24 04:37 UTC

[LLVMdev] More Encoding Ideas

On Mon, 2004-08-23 at 19:46, Robert Mykland wrote:> At 06:43 PM 8/20/2004, Chris Lattner wrote:
> >I don't understand what you're getting at here.  You can change
char to
> >default to unsigned right now with llvm-gcc -funsigned-char.  I
don't
> >understand how that would change anything to be more useful though.
> 
> Well, in the old days, char strings were handled just like any other kind 
> of array of primitive types.  
And, they still are :)
> In that world, when char defaulted to signed 
> char, most of the heavily used ASCII symbols took two bytes to 
> encode.  
Um. What? ASCII is a 7-bit encoding. It defines values 0-127 which, even
with a sign bit is encoded into one byte. Recall that in the "old
days"
computers had a parity bit as the 8th-bit because the memory failure
rates were so high (think vacuum tubes). 
> Thus, (and I'm guessing here), you guys decided to treat char 
> strings as a special case to save space in the bytecode file.
Actually, LLVM doesn't really treat character strings specially EXCEPT
in the bcwriter and bcreader. There is no notion in LLVM of a
"string",
just primitive types and arrays of them. It is up to the front end
compiler to define what it means by a "string". In the bytecode
libraries of LLVM, we chose to interpret "[n x ubyte]" and "[n x
sbyte]"
as "strings" for reading and writing efficiency. They are, however,
still just arrays of one of the two primitive single-byte types.
> If all pointer types are implied, not a problem to create them.  However, 
> in larger files it may cost a little due to slightly larger type 
> numbers.  I'm not sure about the tradeoff here, but I expect that
implied
> pointers would still save more just because of pointers to function types.
Pointers are used heavily in almost all languages. I can almost
guarantee that the "tradeoff" would be larger bytecode files. The use
of
pointers to function types is not all that frequent so I wouldn't expect
it to save much.  In any event, we're not going to do anything with this
until there are solid numbers. I'm working on improving llvm-bcanalyzer
to provide them.

Reid
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20040823/f72a1033/attachment.sig>

Chris Lattner

2004-Aug-24 06:36 UTC

head link

[LLVMdev] More Encoding Ideas

On Mon, 23 Aug 2004, Reid Spencer wrote:> > If all pointer types are implied, not a problem to create them. 
However,
> > in larger files it may cost a little due to slightly larger type
> > numbers.  I'm not sure about the tradeoff here, but I expect that
implied
> > pointers would still save more just because of pointers to function
types.
>
> Pointers are used heavily in almost all languages. I can almost
> guarantee that the "tradeoff" would be larger bytecode files. The
use of
> pointers to function types is not all that frequent so I wouldn't
expect
Note that every LLVM function involves creating a pointer to function, so
it might be a good idea to implicitly encode pointers for every function
type (there is no other way to use a function type in any case).  If you
guys are microoptimizing the bc format, this is an idea.  *shrug*

-Chris

-- 
http://llvm.org/
http://nondot.org/sabre/

Robert Mykland

2004-Aug-26 19:32 UTC

head link

[LLVMdev] More Encoding Ideas

Chris & Reid,

In fact, the primitive that means "function" could stand for
"function
pointer" and new number could be added to the end of the list, if needed, 
that just means "function".  As far as I can see, the
"function" type slots
are never used.

-- Robert.

At 11:36 PM 8/23/2004, you wrote:>On Mon, 23 Aug 2004, Reid Spencer wrote:
> > > If all pointer types are implied, not a problem to create them. 
However,
> > > in larger files it may cost a little due to slightly larger type
> > > numbers.  I'm not sure about the tradeoff here, but I expect
that implied
> > > pointers would still save more just because of pointers to
function
> types.
> >
> > Pointers are used heavily in almost all languages. I can almost
> > guarantee that the "tradeoff" would be larger bytecode
files. The use of
> > pointers to function types is not all that frequent so I wouldn't
expect
>
>Note that every LLVM function involves creating a pointer to function, so
>it might be a good idea to implicitly encode pointers for every function
>type (there is no other way to use a function type in any case).  If you
>guys are microoptimizing the bc format, this is an idea.  *shrug*
>
>-Chris
>
>--
>http://llvm.org/
>http://nondot.org/sabre/
>
>_______________________________________________
>LLVM Developers mailing list
>LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev
Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation

Robert Mykland

2004-Aug-26 19:45 UTC

head link

[LLVMdev] More Encoding Ideas

At 09:37 PM 8/23/2004, you wrote:>On Mon, 2004-08-23 at 19:46, Robert Mykland wrote:
> > At 06:43 PM 8/20/2004, Chris Lattner wrote:
> > >I don't understand what you're getting at here.  You can
change char to
> > >default to unsigned right now with llvm-gcc -funsigned-char.  I
don't
> > >understand how that would change anything to be more useful
though.
> >
> > Well, in the old days, char strings were handled just like any other
kind
> > of array of primitive types.
>
>And, they still are :)
No.  If you define an array of int, the various int values that initialize 
the table are defined seperately and then their value indexes are used in 
the definition of the int array, which appears under its created type in 
the constants list.  Character strings are handled differently from this.

> > In that world, when char defaulted to signed
> > char, most of the heavily used ASCII symbols took two bytes to
> > encode.
>
>Um. What? ASCII is a 7-bit encoding. It defines values 0-127 which, even
>with a sign bit is encoded into one byte. Recall that in the "old
days"
>computers had a parity bit as the 8th-bit because the memory failure
>rates were so high (think vacuum tubes).
Actually, by "old days" I meant LLVM version 0.9.  In LLVM 0.9 they
were
most often encoded as two bytes because most of the most commonly used 
ASCII symbols are above 0x30.
> > Thus, (and I'm guessing here), you guys decided to treat char
> > strings as a special case to save space in the bytecode file.
>
>Actually, LLVM doesn't really treat character strings specially EXCEPT
>in the bcwriter and bcreader. There is no notion in LLVM of a
"string",
>just primitive types and arrays of them. It is up to the front end
>compiler to define what it means by a "string". In the bytecode
>libraries of LLVM, we chose to interpret "[n x ubyte]" and
"[n x sbyte]"
>as "strings" for reading and writing efficiency. They are,
however,
>still just arrays of one of the two primitive single-byte types.
Okay, but this discussion is about the physical protocol of the 
bytecode.  That's what I'm referring to.
> > If all pointer types are implied, not a problem to create them. 
However,
> > in larger files it may cost a little due to slightly larger type
> > numbers.  I'm not sure about the tradeoff here, but I expect that
implied
> > pointers would still save more just because of pointers to function
types.
>
>Pointers are used heavily in almost all languages. I can almost
>guarantee that the "tradeoff" would be larger bytecode files. The
use of
>pointers to function types is not all that frequent so I wouldn't expect
>it to save much.  In any event, we're not going to do anything with this
>until there are solid numbers. I'm working on improving llvm-bcanalyzer
>to provide them.
Right now I see pointer types being created for practically every literal 
type defined anyway.  I doubt you'd see much file bloat due to pointer 
types being implied for everything.  These pointer types are already being 
defined.

However, I could see how it could conceivably save more file space to only 
define pointer types where absolutely necessary, thus keeping the overall 
number of types to an absolute minimum.  Chris mentioned this philosophy 
and I think it's a good one.  Perhaps we could also find a way to declare 
pointer types without having to declare the literal type if the literal 
type is never used.  Functions are but one example of this.

Regards,

-- Robert.


Robert Mykland               Voice: (831) 462-6725
Founder/CTO                   Ascenium Corporation

Reid Spencer

2004-Aug-26 20:11 UTC

head link

[LLVMdev] More Encoding Ideas

On Thu, 2004-08-26 at 12:45, Robert Mykland wrote:> At 09:37 PM 8/23/2004, you wrote:
> >On Mon, 2004-08-23 at 19:46, Robert Mykland wrote:
> > > At 06:43 PM 8/20/2004, Chris Lattner wrote:
> > > >I don't understand what you're getting at here.  You
can change char to
> > > >default to unsigned right now with llvm-gcc -funsigned-char. 
I don't
> > > >understand how that would change anything to be more useful
though.
> > >
> > > Well, in the old days, char strings were handled just like any
other kind
> > > of array of primitive types.
> >
> >And, they still are :)
> 
> No.  If you define an array of int, the various int values that initialize 
> the table are defined seperately and then their value indexes are used in 
> the definition of the int array, which appears under its created type in 
> the constants list.  Character strings are handled differently from this.
Okay, I see what you mean now. Yes, bytecode treats character arrays
(well, to be precise arrays of UByte or SByte) separately from arrays of
other primitive types.
> 
> 
> > > In that world, when char defaulted to signed
> > > char, most of the heavily used ASCII symbols took two bytes to
> > > encode.
> >
> >Um. What? ASCII is a 7-bit encoding. It defines values 0-127 which,
even
> >with a sign bit is encoded into one byte. Recall that in the "old
days"
> >computers had a parity bit as the 8th-bit because the memory failure
> >rates were so high (think vacuum tubes).
> 
> Actually, by "old days" I meant LLVM version 0.9.  In LLVM 0.9
they were
> most often encoded as two bytes because most of the most commonly used 
> ASCII symbols are above 0x30.
Okay. LLVM 0.9 predates my involvement so I wouldn't know :)
> 
> > > Thus, (and I'm guessing here), you guys decided to treat char
> > > strings as a special case to save space in the bytecode file.
> >
> >Actually, LLVM doesn't really treat character strings specially
EXCEPT
> >in the bcwriter and bcreader. There is no notion in LLVM of a
"string",
> >just primitive types and arrays of them. It is up to the front end
> >compiler to define what it means by a "string". In the
bytecode
> >libraries of LLVM, we chose to interpret "[n x ubyte]" and
"[n x sbyte]"
> >as "strings" for reading and writing efficiency. They are,
however,
> >still just arrays of one of the two primitive single-byte types.
> 
> Okay, but this discussion is about the physical protocol of the 
> bytecode.  That's what I'm referring to.
Okay.
> 
> > > If all pointer types are implied, not a problem to create them. 
However,
> > > in larger files it may cost a little due to slightly larger type
> > > numbers.  I'm not sure about the tradeoff here, but I expect
that implied
> > > pointers would still save more just because of pointers to
function types.
> >
> >Pointers are used heavily in almost all languages. I can almost
> >guarantee that the "tradeoff" would be larger bytecode files.
The use of
> >pointers to function types is not all that frequent so I wouldn't
expect
> >it to save much.  In any event, we're not going to do anything with
this
> >until there are solid numbers. I'm working on improving
llvm-bcanalyzer
> >to provide them.
> 
> Right now I see pointer types being created for practically every literal 
> type defined anyway.  I doubt you'd see much file bloat due to pointer 
> types being implied for everything.  These pointer types are already being 
> defined.
The bloat doesn't occur because of the extra type definitions. A pointer
type definition is incredibly small (basically a VBR number or 1-2 bytes
to reference the element type). The bloat occurs because of the
automatic doubling of the slot numbers. Say you had 100 non-pointer
types of which 15 needed pointers for a total of 115. If we
automatically made even indices in the type table be non-pointer types
and odd indices be pointer-types, we'd have 200 entries in the table.
Now the entries above 127 all require 2 VBR bytes to reference instead
of just 1 as per the current mechanism. If those higher slot number
types are used frequently, this could add serious bloat to the file.
What we *really* want to do here is (a) only define types that are used
and (b) sort the types by frequency of use so that the most frequently
used types have the smallest slot numbers. This will ensure the most
compact encoding. 

You should try out llvm-bcanalyzer sometime. It will show you that on
average the size of the type table is about 1% of the file size wheras
the size of the instructions is around 90% (unless there's a lot of
symbols in which case the symbol table rivals the instruction lists).
 > 
> However, I could see how it could conceivably save more file space to only 
> define pointer types where absolutely necessary, thus keeping the overall 
> number of types to an absolute minimum.  Chris mentioned this philosophy 
> and I think it's a good one.  Perhaps we could also find a way to
declare
> pointer types without having to declare the literal type if the literal 
> type is never used.  Functions are but one example of this.
True 'nuff.

Reid
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20040826/6b11dfaf/attachment.sig>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Aug 2004 - [LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

[LLVMdev] More Encoding Ideas

Possibly Parallel Threads