Teresa Johnson via llvm-dev
2017-Apr-04 14:37 UTC
[llvm-dev] RFC: Adding a string table to the bitcode format
On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:> > On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote: > > Hi, > > As part of PR27551 I want to add a string table to the bitcode format to > allow global value and comdat names to be shared with the proposed symbol > table (and, as side effects, allow comdat names to be shared with value > names, make bitcode files more compressible and make bitcode easier to > parse). The format of the string table would be a top-level block > containing a blob containing null-terminated strings [0] similar to the > string table format used in most object files. > > > > I’m in favor of this, but note that currently string can be encoded with > less than 8 bits / char in some cases (there might some size increase > because of this). > That said we already paid this with the metadata table in the recent past > for example. > > The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT} > records would change so that their first operand would specify their names > with a byte offset into the string table. (To allow for backwards > compatibility, I would increment the bitcode version.) > > > I assume you mean the EPOCH? > > Here is what it would look like as bcanalyzer output: > > <MODULE_BLOCK> > <VERSION op0=2> > <COMDAT op0=0 ...> ; name = foo > <FUNCTION op0=0 ...> ; name = foo > <GLOBALVAR op0=4 ...> ; name = bar > <ALIAS op0=8 ...> ; name = baz > ; function bodies, etc. > </MODULE_BLOCK> > <STRTAB_BLOCK> > <STRTAB_BLOB blob="foo\0bar\0baz\0"> > </STRTAB_BLOCK> > > > Why is the string table after the module instead of before? > > > Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means > that bitcode files can continue to be concatenated with "llvm-cat -b". > > Do you mean "apply to all preceding MODULE_BLOCKs that aren't followed byan intervening STRTAB_BLOCK"? I.e. when bitcode files are concatenated you presumably don't want to apply a STRTAB_BLOCK to a MODULE_BLOCK from a different input bitcode file that has its own STRTAB_BLOCK.> (Normally bitcode files would contain a single string table, which in > multi-module bitcode files would be shared between modules.) > > This *almost* allows us to remove the global (top-level) VST entirely, if > not for the function offset in the FNENTRY record. However, this offset is > not actually required because we can scan the module's FUNCTION_BLOCK_IDs > as we were doing before http://reviews.llvm.org/D12536 (this may have a > performance impact, so I'll measure it first). > > Assuming that performance looks good, does this seem reasonable to folks? > > > > I rather seek to have a symbol table that entirely replace the VST, kee. > If there is a perf impact with the FNENTRY offset, why can’t it be > replicated in the symbol table? >Won't the new symbol table be added before the top-level VST can be removed, i.e. you need the linkage types etc right? In that case, can the offset just be added to the new symbol table? That would be more analogous to object file symbol tables which also have an offset anyway. Thanks, Teresa> Thanks for driving this, > > — > Mehdi > >-- Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/126e0b75/attachment-0001.html>
Mehdi Amini via llvm-dev
2017-Apr-04 14:41 UTC
[llvm-dev] RFC: Adding a string table to the bitcode format
> On Apr 4, 2017, at 7:37 AM, Teresa Johnson <tejohnson at google.com> wrote: > > > > On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote: > >> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk <mailto:peter at pcc.me.uk>> wrote: >> >> Hi, >> >> As part of PR27551 I want to add a string table to the bitcode format to allow global value and comdat names to be shared with the proposed symbol table (and, as side effects, allow comdat names to be shared with value names, make bitcode files more compressible and make bitcode easier to parse). The format of the string table would be a top-level block containing a blob containing null-terminated strings [0] similar to the string table format used in most object files. > > > I’m in favor of this, but note that currently string can be encoded with less than 8 bits / char in some cases (there might some size increase because of this). > That said we already paid this with the metadata table in the recent past for example. > >> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT} records would change so that their first operand would specify their names with a byte offset into the string table. (To allow for backwards compatibility, I would increment the bitcode version.) > > I assume you mean the EPOCH? > >> Here is what it would look like as bcanalyzer output: >> >> <MODULE_BLOCK> >> <VERSION op0=2> >> <COMDAT op0=0 ...> ; name = foo >> <FUNCTION op0=0 ...> ; name = foo >> <GLOBALVAR op0=4 ...> ; name = bar >> <ALIAS op0=8 ...> ; name = baz >> ; function bodies, etc. >> </MODULE_BLOCK> >> <STRTAB_BLOCK> >> <STRTAB_BLOB blob="foo\0bar\0baz\0"> >> </STRTAB_BLOCK> > > Why is the string table after the module instead of before? > > >> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means that bitcode files can continue to be concatenated with "llvm-cat -b". > > Do you mean "apply to all preceding MODULE_BLOCKs that aren't followed by an intervening STRTAB_BLOCK"? I.e. when bitcode files are concatenated you presumably don't want to apply a STRTAB_BLOCK to a MODULE_BLOCK from a different input bitcode file that has its own STRTAB_BLOCK. >> (Normally bitcode files would contain a single string table, which in multi-module bitcode files would be shared between modules.) >> >> This *almost* allows us to remove the global (top-level) VST entirely, if not for the function offset in the FNENTRY record. However, this offset is not actually required because we can scan the module's FUNCTION_BLOCK_IDs as we were doing before http://reviews.llvm.org/D12536 <http://reviews.llvm.org/D12536> (this may have a performance impact, so I'll measure it first). >> >> Assuming that performance looks good, does this seem reasonable to folks? > > > I rather seek to have a symbol table that entirely replace the VST, kee. If there is a perf impact with the FNENTRY offset, why can’t it be replicated in the symbol table? > > Won't the new symbol table be added before the top-level VST can be removed, i.e. you need the linkage types etc right? In that case, can the offset just be added to the new symbol table? That would be more analogous to object file symbol tables which also have an offset anyway.I’m not sure I read you correctly, isn’t it what I suggested? — Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/9e7a1eef/attachment.html>
Teresa Johnson via llvm-dev
2017-Apr-04 14:46 UTC
[llvm-dev] RFC: Adding a string table to the bitcode format
On Tue, Apr 4, 2017 at 7:41 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:> > On Apr 4, 2017, at 7:37 AM, Teresa Johnson <tejohnson at google.com> wrote: > > > > On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com> wrote: > >> >> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote: >> >> Hi, >> >> As part of PR27551 I want to add a string table to the bitcode format to >> allow global value and comdat names to be shared with the proposed symbol >> table (and, as side effects, allow comdat names to be shared with value >> names, make bitcode files more compressible and make bitcode easier to >> parse). The format of the string table would be a top-level block >> containing a blob containing null-terminated strings [0] similar to the >> string table format used in most object files. >> >> >> >> I’m in favor of this, but note that currently string can be encoded with >> less than 8 bits / char in some cases (there might some size increase >> because of this). >> That said we already paid this with the metadata table in the recent past >> for example. >> >> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT} >> records would change so that their first operand would specify their names >> with a byte offset into the string table. (To allow for backwards >> compatibility, I would increment the bitcode version.) >> >> >> I assume you mean the EPOCH? >> >> Here is what it would look like as bcanalyzer output: >> >> <MODULE_BLOCK> >> <VERSION op0=2> >> <COMDAT op0=0 ...> ; name = foo >> <FUNCTION op0=0 ...> ; name = foo >> <GLOBALVAR op0=4 ...> ; name = bar >> <ALIAS op0=8 ...> ; name = baz >> ; function bodies, etc. >> </MODULE_BLOCK> >> <STRTAB_BLOCK> >> <STRTAB_BLOB blob="foo\0bar\0baz\0"> >> </STRTAB_BLOCK> >> >> >> Why is the string table after the module instead of before? >> >> >> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means >> that bitcode files can continue to be concatenated with "llvm-cat -b". >> >> Do you mean "apply to all preceding MODULE_BLOCKs that aren't followed by > an intervening STRTAB_BLOCK"? I.e. when bitcode files are concatenated you > presumably don't want to apply a STRTAB_BLOCK to a MODULE_BLOCK from a > different input bitcode file that has its own STRTAB_BLOCK. > >> (Normally bitcode files would contain a single string table, which in >> multi-module bitcode files would be shared between modules.) >> >> This *almost* allows us to remove the global (top-level) VST entirely, if >> not for the function offset in the FNENTRY record. However, this offset is >> not actually required because we can scan the module's FUNCTION_BLOCK_IDs >> as we were doing before http://reviews.llvm.org/D12536 (this may have a >> performance impact, so I'll measure it first). >> >> Assuming that performance looks good, does this seem reasonable to folks? >> >> >> >> I rather seek to have a symbol table that entirely replace the VST, kee. >> If there is a perf impact with the FNENTRY offset, why can’t it be >> replicated in the symbol table? >> > > Won't the new symbol table be added before the top-level VST can be > removed, i.e. you need the linkage types etc right? In that case, can the > offset just be added to the new symbol table? That would be more analogous > to object file symbol tables which also have an offset anyway. > > > I’m not sure I read you correctly, isn’t it what I suggested? >It is - I'm just wondering why we would consider removing the offset since other things have to be moved from the VST to a new symbol table anyway. I.e., confused by pcc's comment that this " *almost* allows us to remove the global (top-level) VST entirely, if not for the function offset in the FNENTRY record" - there are currently other things in the VST that we need to a new symbol before it can be removed, and I'm not sure why this is any different. Teresa> — > Mehdi > >-- Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/7132f5eb/attachment.html>
Peter Collingbourne via llvm-dev
2017-Apr-04 19:21 UTC
[llvm-dev] RFC: Adding a string table to the bitcode format
On Tue, Apr 4, 2017 at 7:37 AM, Teresa Johnson <tejohnson at google.com> wrote:> > > On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com> wrote: > >> >> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote: >> >> Hi, >> >> As part of PR27551 I want to add a string table to the bitcode format to >> allow global value and comdat names to be shared with the proposed symbol >> table (and, as side effects, allow comdat names to be shared with value >> names, make bitcode files more compressible and make bitcode easier to >> parse). The format of the string table would be a top-level block >> containing a blob containing null-terminated strings [0] similar to the >> string table format used in most object files. >> >> >> >> I’m in favor of this, but note that currently string can be encoded with >> less than 8 bits / char in some cases (there might some size increase >> because of this). >> That said we already paid this with the metadata table in the recent past >> for example. >> >> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT} >> records would change so that their first operand would specify their names >> with a byte offset into the string table. (To allow for backwards >> compatibility, I would increment the bitcode version.) >> >> >> I assume you mean the EPOCH? >> >> Here is what it would look like as bcanalyzer output: >> >> <MODULE_BLOCK> >> <VERSION op0=2> >> <COMDAT op0=0 ...> ; name = foo >> <FUNCTION op0=0 ...> ; name = foo >> <GLOBALVAR op0=4 ...> ; name = bar >> <ALIAS op0=8 ...> ; name = baz >> ; function bodies, etc. >> </MODULE_BLOCK> >> <STRTAB_BLOCK> >> <STRTAB_BLOB blob="foo\0bar\0baz\0"> >> </STRTAB_BLOCK> >> >> >> Why is the string table after the module instead of before? >> >> >> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means >> that bitcode files can continue to be concatenated with "llvm-cat -b". >> >> Do you mean "apply to all preceding MODULE_BLOCKs that aren't followed by > an intervening STRTAB_BLOCK"? I.e. when bitcode files are concatenated you > presumably don't want to apply a STRTAB_BLOCK to a MODULE_BLOCK from a > different input bitcode file that has its own STRTAB_BLOCK. >Yes, sorry, that is exactly what I meant.> (Normally bitcode files would contain a single string table, which in >> multi-module bitcode files would be shared between modules.) >> >> This *almost* allows us to remove the global (top-level) VST entirely, if >> not for the function offset in the FNENTRY record. However, this offset is >> not actually required because we can scan the module's FUNCTION_BLOCK_IDs >> as we were doing before http://reviews.llvm.org/D12536 (this may have a >> performance impact, so I'll measure it first). >> >> Assuming that performance looks good, does this seem reasonable to folks? >> >> >> >> I rather seek to have a symbol table that entirely replace the VST, kee. >> If there is a perf impact with the FNENTRY offset, why can’t it be >> replicated in the symbol table? >> > > Won't the new symbol table be added before the top-level VST can be > removed, i.e. you need the linkage types etc right? In that case, can the > offset just be added to the new symbol table? That would be more analogous > to object file symbol tables which also have an offset anyway. >The VST only stores names (and function offsets). The other attributes are stored on the MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC} records. So once we move the names elsewhere, the VST isn't really storing much data at all. As I mentioned to Mehdi, we could indeed store the function offset in the symbol table. That would be done in a separate step to this change, which is just about string tables. Thanks, -- -- Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/8fd34f4d/attachment.html>
Teresa Johnson via llvm-dev
2017-Apr-04 20:23 UTC
[llvm-dev] RFC: Adding a string table to the bitcode format
On Tue, Apr 4, 2017 at 12:21 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:> > > On Tue, Apr 4, 2017 at 7:37 AM, Teresa Johnson <tejohnson at google.com> > wrote: > >> >> >> On Mon, Apr 3, 2017 at 8:13 PM, Mehdi Amini <mehdi.amini at apple.com> >> wrote: >> >>> >>> On Apr 3, 2017, at 7:08 PM, Peter Collingbourne <peter at pcc.me.uk> wrote: >>> >>> Hi, >>> >>> As part of PR27551 I want to add a string table to the bitcode format to >>> allow global value and comdat names to be shared with the proposed symbol >>> table (and, as side effects, allow comdat names to be shared with value >>> names, make bitcode files more compressible and make bitcode easier to >>> parse). The format of the string table would be a top-level block >>> containing a blob containing null-terminated strings [0] similar to the >>> string table format used in most object files. >>> >>> >>> >>> I’m in favor of this, but note that currently string can be encoded with >>> less than 8 bits / char in some cases (there might some size increase >>> because of this). >>> That said we already paid this with the metadata table in the recent >>> past for example. >>> >>> The format of MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC,COMDAT} >>> records would change so that their first operand would specify their names >>> with a byte offset into the string table. (To allow for backwards >>> compatibility, I would increment the bitcode version.) >>> >>> >>> I assume you mean the EPOCH? >>> >>> Here is what it would look like as bcanalyzer output: >>> >>> <MODULE_BLOCK> >>> <VERSION op0=2> >>> <COMDAT op0=0 ...> ; name = foo >>> <FUNCTION op0=0 ...> ; name = foo >>> <GLOBALVAR op0=4 ...> ; name = bar >>> <ALIAS op0=8 ...> ; name = baz >>> ; function bodies, etc. >>> </MODULE_BLOCK> >>> <STRTAB_BLOCK> >>> <STRTAB_BLOB blob="foo\0bar\0baz\0"> >>> </STRTAB_BLOCK> >>> >>> >>> Why is the string table after the module instead of before? >>> >>> >>> Each STRTAB_BLOCK would apply to all preceding MODULE_BLOCKs. This means >>> that bitcode files can continue to be concatenated with "llvm-cat -b". >>> >>> Do you mean "apply to all preceding MODULE_BLOCKs that aren't followed >> by an intervening STRTAB_BLOCK"? I.e. when bitcode files are concatenated >> you presumably don't want to apply a STRTAB_BLOCK to a MODULE_BLOCK from a >> different input bitcode file that has its own STRTAB_BLOCK. >> > > Yes, sorry, that is exactly what I meant. > >> (Normally bitcode files would contain a single string table, which in >>> multi-module bitcode files would be shared between modules.) >>> >>> This *almost* allows us to remove the global (top-level) VST entirely, >>> if not for the function offset in the FNENTRY record. However, this offset >>> is not actually required because we can scan the module's >>> FUNCTION_BLOCK_IDs as we were doing before http://reviews.llvm.org >>> /D12536 (this may have a performance impact, so I'll measure it first). >>> >>> Assuming that performance looks good, does this seem reasonable to folks? >>> >>> >>> >>> I rather seek to have a symbol table that entirely replace the VST, kee. >>> If there is a perf impact with the FNENTRY offset, why can’t it be >>> replicated in the symbol table? >>> >> >> Won't the new symbol table be added before the top-level VST can be >> removed, i.e. you need the linkage types etc right? In that case, can the >> offset just be added to the new symbol table? That would be more analogous >> to object file symbol tables which also have an offset anyway. >> > > The VST only stores names (and function offsets). The other attributes are > stored on the MODULE_CODE_{FUNCTION,GLOBALVAR,ALIAS,IFUNC} records. So > once we move the names elsewhere, the VST isn't really storing much data at > all. >Ok right, that's true... We could probably benchmark the removal of the offsets on a clang ThinLTO bootstrap. As mentioned off-list to pcc, the theoretical benefit when I added those offsets was largely because we were planning to do iterative importing in the ThinLTO backends, which of course we don't do anymore. Teresa> As I mentioned to Mehdi, we could indeed store the function offset in the > symbol table. That would be done in a separate step to this change, which > is just about string tables. > > Thanks, > -- > -- > Peter >-- Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170404/0e46afd3/attachment.html>