thr3ads.net - llvm dev - [llvm-dev] RFC: Adding a string table to the bitcode format [Apr 2017]

If this information is useful, please help other people find it:
Share via:

Peter Collingbourne via llvm-dev

2017-Apr-04 19:12 UTC

[llvm-dev] RFC: Adding a string table to the bitcode format

On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
>
> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:
>
> Hi,
>
> As part of PR27551 I want to add a string table to the bitcode format to
> allow global value and comdat names to be shared with the proposed symbol
> table (and, as side effects, allow comdat names to be shared with value
> names, make bitcode files more compressible and make bitcode easier to
> parse). The format of the string table would be a top-level block
> containing a blob containing null-terminated strings [0] similar to the
> string table format used in most object files.
>
>
>
> I’m in favor of this, but note that currently string can be encoded with
> less than 8 bits / char in some cases (there might some size increase
> because of this).
>
Sure, but I think we need to make the right tradeoff between making data
more efficient to read and using fewer bits. In this case I think the right
tradeoff is clearly in favour of being efficient to read, because accessing
it is in the critical path of a consumer (i.e. a linker), and the part that
needs to be efficient to read is a relatively small part of the data in the
bitcode file. The same logic applies to the symbol table (note that we use
support::ulittle32_t instead of a bit encoding).

That said we already paid this with the metadata table in the recent
past> for example.
>
> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
> records would change so that their first operand would specify their names
> with a byte offset into the string table. (To allow for backwards
> compatibility, I would increment the bitcode version.)
>
>
> I assume you mean the EPOCH?
>
No, the MODULE_CODE_VERSION.
http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822
It isn't clear to me why we have both.

> Here is what it would look like as bcanalyzer output:
>
> <MODULE_BLOCK>
>   <VERSION op0=2>
>   <COMDAT op0=0 ...> ; name = foo
>   <FUNCTION op0=0 ...> ; name = foo
>   <GLOBALVAR op0=4 ...> ; name = bar
>   <ALIAS op0=8 ...> ; name = baz
>  ; function bodies, etc.
> </MODULE_BLOCK>
> <STRTAB_BLOCK>
>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
> </STRTAB_BLOCK>
>
>
> Why is the string table after the module instead of before?
>
For implementation simplicity. The idea is that the BitcodeWriter would
have a member of type StringTableBuilder which would accumulate strings
while writing the bitcode module(s) (and symtab in the future). At the end,
the client would call something like BitcodeWriter::writeStrtab() which
would write out the string table.

>
> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means
> that bitcode files can continue to be concatenated with "llvm-cat
-b".
> (Normally bitcode files would contain a single string table, which in
> multi-module bitcode files would be shared between modules.)
>
> This *almost* allows us to remove the global (top-level) VST entirely, if
> not for the function offset in the FNENTRY record. However, this offset is
> not actually required because we can scan the module's
FUNCTION_BLOCK_IDs
> as we were doing before http://reviews.llvm.org/D12536 (this may have a
> performance impact, so I'll measure it first).
>
> Assuming that performance looks good, does this seem reasonable to folks?
>
>
>
> I rather seek to have a symbol table that entirely replace the VST, kee.
> If there is a perf impact with the FNENTRY offset, why can’t it be
> replicated in the symbol table?
>
Sure, we could in principle store function offsets in the symbol table as
well, if that helps with performance. But I want to measure the impact and
find out whether that is actually the case first.

Thanks,
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/7810ab51/attachment.html>

Duncan P. N. Exon Smith via llvm-dev

2017-Apr-04 19:36 UTC

head link

[llvm-dev] RFC: Adding a string table to the bitcode format

> On 2017-Apr-04, at 12:12, Peter Collingbourne <peter at pcc.me.uk>
wrote:
> 
> On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
> 
>> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk
<mailto:peter at pcc.me.uk>> wrote:
>> 
>> Hi,
>> 
>> As part of PR27551 I want to add a string table to the bitcode format
to allow global value and comdat names to be shared with the proposed symbol
table (and, as side effects, allow comdat names to be shared with value names,
make bitcode files more compressible and make bitcode easier to parse). The
format of the string table would be a top-level block containing a blob
containing null-terminated strings [0] similar to the string table format used
in most object files.
> 
> 
> I’m in favor of this, but note that currently string can be encoded with
less than 8 bits / char in some cases (there might some size increase because of
this).
> 
> Sure, but I think we need to make the right tradeoff between making data
more efficient to read and using fewer bits. In this case I think the right
tradeoff is clearly in favour of being efficient to read, because accessing it
is in the critical path of a consumer (i.e. a linker), and the part that needs
to be efficient to read is a relatively small part of the data in the bitcode
file. The same logic applies to the symbol table (note that we use
support::ulittle32_t instead of a bit encoding).
The same logic might suggest storing string sizes (as a prefix) to avoid
scanning for string lengths.
> 
> That said we already paid this with the metadata table in the recent past
for example.
> 
>> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
records would change so that their first operand would specify their names with
a byte offset into the string table. (To allow for backwards compatibility, I
would increment the bitcode version.)
> 
> I assume you mean the EPOCH?
> 
> No, the MODULE_CODE_VERSION.
> http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822
<http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822>
> It isn't clear to me why we have both.
> 
> 
>> Here is what it would look like as bcanalyzer output:
>> 
>> <MODULE_BLOCK>
>>   <VERSION op0=2>
>>   <COMDAT op0=0 ...> ; name = foo
>>   <FUNCTION op0=0 ...> ; name = foo
>>   <GLOBALVAR op0=4 ...> ; name = bar
>>   <ALIAS op0=8 ...> ; name = baz
>>  ; function bodies, etc.
>> </MODULE_BLOCK>
>> <STRTAB_BLOCK>
>>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
>> </STRTAB_BLOCK>
> 
> Why is the string table after the module instead of before?
> 
> For implementation simplicity. The idea is that the BitcodeWriter would
have a member of type StringTableBuilder which would accumulate strings while
writing the bitcode module(s) (and symtab in the future). At the end, the client
would call something like BitcodeWriter::writeStrtab() which would write out the
string table.
> 
> 
> 
>> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This
means that bitcode files can continue to be concatenated with "llvm-cat
-b". (Normally bitcode files would contain a single string table, which in
multi-module bitcode files would be shared between modules.)
>> 
>> This *almost* allows us to remove the global (top-level) VST entirely,
if not for the function offset in the FNENTRY record. However, this offset is
not actually required because we can scan the module's FUNCTION_BLOCK_IDs as
we were doing before http://reviews.llvm.org/D12536
<http://reviews.llvm.org/D12536> (this may have a performance impact, so
I'll measure it first).
>> 
>> Assuming that performance looks good, does this seem reasonable to
folks?
> 
> 
> I rather seek to have a symbol table that entirely replace the VST, kee. If
there is a perf impact with the FNENTRY offset, why can’t it be replicated in
the symbol table?
> 
> Sure, we could in principle store function offsets in the symbol table as
well, if that helps with performance. But I want to measure the impact and find
out whether that is actually the case first.
> 
> Thanks,
> -- 
> -- 
> Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/641eb488/attachment.html>

Peter Collingbourne via llvm-dev

2017-Apr-04 20:11 UTC

head link

[llvm-dev] RFC: Adding a string table to the bitcode format

On Tue, Apr 4, 2017 at 12:36 PM, Duncan P. N. Exon Smith <
dexonsmith at apple.com> wrote:
>
> On 2017-Apr-04, at 12:12, Peter Collingbourne <peter at pcc.me.uk>
wrote:
>
> On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>>
>> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at
pcc.me.uk> wrote:
>>
>> Hi,
>>
>> As part of PR27551 I want to add a string table to the bitcode format
to
>> allow global value and comdat names to be shared with the proposed
symbol
>> table (and, as side effects, allow comdat names to be shared with value
>> names, make bitcode files more compressible and make bitcode easier to
>> parse). The format of the string table would be a top-level block
>> containing a blob containing null-terminated strings [0] similar to the
>> string table format used in most object files.
>>
>>
>>
>> I’m in favor of this, but note that currently string can be encoded
with
>> less than 8 bits / char in some cases (there might some size increase
>> because of this).
>>
>
> Sure, but I think we need to make the right tradeoff between making data
> more efficient to read and using fewer bits. In this case I think the right
> tradeoff is clearly in favour of being efficient to read, because accessing
> it is in the critical path of a consumer (i.e. a linker), and the part that
> needs to be efficient to read is a relatively small part of the data in the
> bitcode file. The same logic applies to the symbol table (note that we use
> support::ulittle32_t instead of a bit encoding).
>
>
> The same logic might suggest storing string sizes (as a prefix) to avoid
> scanning for string lengths.
>
It might do, keeping in mind that reading pretty much every existing object
file format already requires scanning for string lengths. Certainly
something to try and evaluate, at least.

Peter

>
> That said we already paid this with the metadata table in the recent past
>> for example.
>>
>
>> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
>> records would change so that their first operand would specify their
names
>> with a byte offset into the string table. (To allow for backwards
>> compatibility, I would increment the bitcode version.)
>>
>>
>> I assume you mean the EPOCH?
>>
>
> No, the MODULE_CODE_VERSION.
> http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822
> It isn't clear to me why we have both.
>
>
>> Here is what it would look like as bcanalyzer output:
>>
>> <MODULE_BLOCK>
>>   <VERSION op0=2>
>>   <COMDAT op0=0 ...> ; name = foo
>>   <FUNCTION op0=0 ...> ; name = foo
>>   <GLOBALVAR op0=4 ...> ; name = bar
>>   <ALIAS op0=8 ...> ; name = baz
>>  ; function bodies, etc.
>> </MODULE_BLOCK>
>> <STRTAB_BLOCK>
>>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
>> </STRTAB_BLOCK>
>>
>>
>> Why is the string table after the module instead of before?
>>
>
> For implementation simplicity. The idea is that the BitcodeWriter would
> have a member of type StringTableBuilder which would accumulate strings
> while writing the bitcode module(s) (and symtab in the future). At the end,
> the client would call something like BitcodeWriter::writeStrtab() which
> would write out the string table.
>
>
>>
>> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This
means
>> that bitcode files can continue to be concatenated with "llvm-cat
-b".
>> (Normally bitcode files would contain a single string table, which in
>> multi-module bitcode files would be shared between modules.)
>>
>> This *almost* allows us to remove the global (top-level) VST entirely,
if
>> not for the function offset in the FNENTRY record. However, this offset
is
>> not actually required because we can scan the module's
FUNCTION_BLOCK_IDs
>> as we were doing before http://reviews.llvm.org/D12536 (this may have a
>> performance impact, so I'll measure it first).
>>
>> Assuming that performance looks good, does this seem reasonable to
folks?
>>
>>
>>
>> I rather seek to have a symbol table that entirely replace the VST,
kee.
>> If there is a perf impact with the FNENTRY offset, why can’t it be
>> replicated in the symbol table?
>>
>
> Sure, we could in principle store function offsets in the symbol table as
> well, if that helps with performance. But I want to measure the impact and
> find out whether that is actually the case first.
>
> Thanks,
> --
> --
> Peter
>
>
>

-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/1a566d64/attachment-0001.html>

Mehdi Amini via llvm-dev

2017-Apr-04 20:25 UTC

head link

[llvm-dev] RFC: Adding a string table to the bitcode format

> On Apr 4, 2017, at 12:12 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:
> 
> On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
> 
>> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk
<mailto:peter at pcc.me.uk>> wrote:
>> 
>> Hi,
>> 
>> As part of PR27551 I want to add a string table to the bitcode format
to allow global value and comdat names to be shared with the proposed symbol
table (and, as side effects, allow comdat names to be shared with value names,
make bitcode files more compressible and make bitcode easier to parse). The
format of the string table would be a top-level block containing a blob
containing null-terminated strings [0] similar to the string table format used
in most object files.
> 
> 
> I’m in favor of this, but note that currently string can be encoded with
less than 8 bits / char in some cases (there might some size increase because of
this).
> 
> Sure, but I think we need to make the right tradeoff between making data
more efficient to read and using fewer bits. In this case I think the right
tradeoff is clearly in favour of being efficient to read, because accessing it
is in the critical path of a consumer (i.e. a linker), and the part that needs
to be efficient to read is a relatively small part of the data in the bitcode
file. The same logic applies to the symbol table (note that we use
support::ulittle32_t instead of a bit encoding).
> 
> That said we already paid this with the metadata table in the recent past
for example.
> 
>> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
records would change so that their first operand would specify their names with
a byte offset into the string table. (To allow for backwards compatibility, I
would increment the bitcode version.)
> 
> I assume you mean the EPOCH?
> 
> No, the MODULE_CODE_VERSION.
> http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822
<http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822>
> It isn't clear to me why we have both.
> 
> 
>> Here is what it would look like as bcanalyzer output:
>> 
>> <MODULE_BLOCK>
>>   <VERSION op0=2>
>>   <COMDAT op0=0 ...> ; name = foo
>>   <FUNCTION op0=0 ...> ; name = foo
>>   <GLOBALVAR op0=4 ...> ; name = bar
>>   <ALIAS op0=8 ...> ; name = baz
>>  ; function bodies, etc.
>> </MODULE_BLOCK>
>> <STRTAB_BLOCK>
>>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
>> </STRTAB_BLOCK>
> 
> Why is the string table after the module instead of before?
> 
> For implementation simplicity. The idea is that the BitcodeWriter would
have a member of type StringTableBuilder which would accumulate strings while
writing the bitcode module(s) (and symtab in the future). At the end, the client
would call something like BitcodeWriter::writeStrtab() which would write out the
string table.
There is already a traversal of the module for value numbering, building the
StringTable at the same time seems quite natural to me.

— 
Mehdi

> 
> 
> 
>> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This
means that bitcode files can continue to be concatenated with "llvm-cat
-b". (Normally bitcode files would contain a single string table, which in
multi-module bitcode files would be shared between modules.)
>> 
>> This *almost* allows us to remove the global (top-level) VST entirely,
if not for the function offset in the FNENTRY record. However, this offset is
not actually required because we can scan the module's FUNCTION_BLOCK_IDs as
we were doing before http://reviews.llvm.org/D12536
<http://reviews.llvm.org/D12536> (this may have a performance impact, so
I'll measure it first).
>> 
>> Assuming that performance looks good, does this seem reasonable to
folks?
> 
> 
> I rather seek to have a symbol table that entirely replace the VST, kee. If
there is a perf impact with the FNENTRY offset, why can’t it be replicated in
the symbol table?
> 
> Sure, we could in principle store function offsets in the symbol table as
well, if that helps with performance. But I want to measure the impact and find
out whether that is actually the case first.
> 
> Thanks,
> -- 
> -- 
> Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/de057aa0/attachment.html>

Peter Collingbourne via llvm-dev

2017-Apr-04 20:40 UTC

head link

[llvm-dev] RFC: Adding a string table to the bitcode format

On Tue, Apr 4, 2017 at 1:25 PM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
>
> On Apr 4, 2017, at 12:12 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:
>
> On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at
apple.com> wrote:
>
>>
>> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at
pcc.me.uk> wrote:
>>
>> Hi,
>>
>> As part of PR27551 I want to add a string table to the bitcode format
to
>> allow global value and comdat names to be shared with the proposed
symbol
>> table (and, as side effects, allow comdat names to be shared with value
>> names, make bitcode files more compressible and make bitcode easier to
>> parse). The format of the string table would be a top-level block
>> containing a blob containing null-terminated strings [0] similar to the
>> string table format used in most object files.
>>
>>
>>
>> I’m in favor of this, but note that currently string can be encoded
with
>> less than 8 bits / char in some cases (there might some size increase
>> because of this).
>>
>
> Sure, but I think we need to make the right tradeoff between making data
> more efficient to read and using fewer bits. In this case I think the right
> tradeoff is clearly in favour of being efficient to read, because accessing
> it is in the critical path of a consumer (i.e. a linker), and the part that
> needs to be efficient to read is a relatively small part of the data in the
> bitcode file. The same logic applies to the symbol table (note that we use
> support::ulittle32_t instead of a bit encoding).
>
> That said we already paid this with the metadata table in the recent past
>> for example.
>>
>
>> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT}
>> records would change so that their first operand would specify their
names
>> with a byte offset into the string table. (To allow for backwards
>> compatibility, I would increment the bitcode version.)
>>
>>
>> I assume you mean the EPOCH?
>>
>
> No, the MODULE_CODE_VERSION.
> http://llvm-cs.pcc.me.uk/lib/Bitcode/Writer/BitcodeWriter.cpp#3822
> It isn't clear to me why we have both.
>
>
>> Here is what it would look like as bcanalyzer output:
>>
>> <MODULE_BLOCK>
>>   <VERSION op0=2>
>>   <COMDAT op0=0 ...> ; name = foo
>>   <FUNCTION op0=0 ...> ; name = foo
>>   <GLOBALVAR op0=4 ...> ; name = bar
>>   <ALIAS op0=8 ...> ; name = baz
>>  ; function bodies, etc.
>> </MODULE_BLOCK>
>> <STRTAB_BLOCK>
>>   <STRTAB_BLOB blob="foo\0bar\0baz\0">
>> </STRTAB_BLOCK>
>>
>>
>> Why is the string table after the module instead of before?
>>
>
> For implementation simplicity. The idea is that the BitcodeWriter would
> have a member of type StringTableBuilder which would accumulate strings
> while writing the bitcode module(s) (and symtab in the future). At the end,
> the client would call something like BitcodeWriter::writeStrtab() which
> would write out the string table.
>
>
> There is already a traversal of the module for value numbering, building
> the StringTable at the same time seems quite natural to me.
>
Other modules in the same bitcode file may need to add names to the string
table, and the symbol table builder may also need to add mangled names.
Trying to impose an ordering on all of those components doesn't seem worth
it in my opinion.

Peter
>
> —
> Mehdi
>
>
>
>
>>
>> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This
means
>> that bitcode files can continue to be concatenated with "llvm-cat
-b".
>> (Normally bitcode files would contain a single string table, which in
>> multi-module bitcode files would be shared between modules.)
>>
>> This *almost* allows us to remove the global (top-level) VST entirely,
if
>> not for the function offset in the FNENTRY record. However, this offset
is
>> not actually required because we can scan the module's
FUNCTION_BLOCK_IDs
>> as we were doing before http://reviews.llvm.org/D12536 (this may have a
>> performance impact, so I'll measure it first).
>>
>> Assuming that performance looks good, does this seem reasonable to
folks?
>>
>>
>>
>> I rather seek to have a symbol table that entirely replace the VST,
kee.
>> If there is a perf impact with the FNENTRY offset, why can’t it be
>> replicated in the symbol table?
>>
>
> Sure, we could in principle store function offsets in the symbol table as
> well, if that helps with performance. But I want to measure the impact and
> find out whether that is actually the case first.
>
> Thanks,
> --
> --
> Peter
>
>
>

-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/84770ca5/attachment-0001.html>

llvm dev - Apr 2017 - RFC: Adding a string table to the bitcode format

[llvm-dev] RFC: Adding a string table to the bitcode format

[llvm-dev] RFC: Adding a string table to the bitcode format

[llvm-dev] RFC: Adding a string table to the bitcode format

[llvm-dev] RFC: Adding a string table to the bitcode format

[llvm-dev] RFC: Adding a string table to the bitcode format