Sean Bartell via llvm-dev
2020-Aug-28 01:57 UTC
[llvm-dev] End-to-end -fembed-bitcode .llvmbc and .llvmcmd
Hi Mircea, If you use an ordinary linker that concatenates .llvmbc sections, you can use this code to get the size of each bitcode module. As far as I know, there's no clean way to separate the .llvmcmd sections without making assumptions about what options were used. // Given a bitcode file followed by garbage, get the size of the actual // bitcode. This only works correctly with some kinds of garbage (in // particular, it will work if the bitcode file is followed by zeros, or if // it's followed by another bitcode file). size_t GetBitcodeSize(MemoryBufferRef Buffer) { const unsigned char *BufPtr reinterpret_cast<const unsigned char *>(Buffer.getBufferStart()); const unsigned char *EndBufPtr reinterpret_cast<const unsigned char *>(Buffer.getBufferEnd()); if (isBitcodeWrapper(BufPtr, EndBufPtr)) { const unsigned char *FixedBufPtr = BufPtr; if (SkipBitcodeWrapperHeader(FixedBufPtr, EndBufPtr, true)) report_fatal_error("Invalid bitcode wrapper"); return EndBufPtr - BufPtr; } if (!isRawBitcode(BufPtr, EndBufPtr)) report_fatal_error("Invalid magic bytes; not a bitcode file?"); BitstreamCursor Reader(Buffer); Reader.Read(32); // skip signature while (true) { size_t EntryStart = Reader.getCurrentByteNo(); BitstreamEntry Entry Reader.advance(BitstreamCursor::AF_DontAutoprocessAbbrevs); if (Entry.Kind == BitstreamEntry::SubBlock) { if (Reader.SkipBlock()) report_fatal_error("Invalid bitcode file"); } else { // We must have reached the end of the module. return EntryStart; } } } Sean On Thu, Aug 27, 2020, at 13:17, Steven Wu via llvm-dev wrote:> Hi Mircea > > From the RFC you mentioned, that is a Darwin specific implementation, which later got extended to support other targets. The main use case for the embed bitcode option is to allow compiler passing intermediate IR and command flags in the object file it produced for later use. For Darwin, it is used for bitcode recompilation, and some might use it to achieve other goals. > > In order to use this information properly, you needs to have tools that understand the layout and sections for embedded bitcode. You can't just use an ordinary linker, because like you said, an ELF linker will just append the bitcode. Depending on what you are trying to achieve, you need to implement the downstream tools, like linker, binary analysis tools, etc. to understand this concept. > > Steven > >> On Aug 24, 2020, at 7:10 PM, Mircea Trofin via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> Hello, >> >> I'm trying to understand how .llvmbc and .llvmcmd fit into an end-to-end story. From the RFC <http://lists.llvm.org/pipermail/llvm-dev/2016-February/094851.html>, and reading through the implementation, I'm piecing together that the goal was to enable capturing IR right after clang and before passing it to LLVM's optimization passes, as well as the command line options needed for later compiling that IR to the same native object it was compiled to originally (with the same compiler). >> >> Here's what I don't understand: say you have a.o and b.o compiled with -fembed-bitcode=all. They are linked into a binary called my_binary. How do you re-create the corresponding IR for modules a and b (let's call them a.bc and b.bc), and their corresponding command lines? From what I can tell, the linker just concatenates the IR for a and b in my_binary's .llvmbc, and the same for the command line in .llvmcmd. Is there a separator maybe I missed? For .llvmcmd, I could see how *maybe* -cc1 could be that separator, what about the .llvmbc part? The magic number? >> >> Thanks! >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > *Attachments:* > * ATT00001.txt-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200827/0a82a5ff/attachment.html>
Mircea Trofin via llvm-dev
2020-Aug-28 05:25 UTC
[llvm-dev] End-to-end -fembed-bitcode .llvmbc and .llvmcmd
Thanks, Sean, Steven, to explore this a bit further, are there currently users for non-Darwin cases? I wonder if it would it be an issue if we inserted markers in the section (maybe as an opt-in, if there were users), such that, when concatenated, the resulting section would be self-describing, for a specialized reader, of course - basically, achieve what Sean described, but "by design". For instance, each .o file could have a size, followed by the payload (maybe include in the payload the name of the module, too; maybe compress it, too). Same for the .llvmcmd case. On Thu, Aug 27, 2020 at 6:57 PM Sean Bartell <smbarte2 at illinois.edu> wrote:> Hi Mircea, > > If you use an ordinary linker that concatenates .llvmbc sections, you can > use this code to get the size of each bitcode module. As far as I know, > there's no clean way to separate the .llvmcmd sections without making > assumptions about what options were used. > > // Given a bitcode file followed by garbage, get the size of the actual > // bitcode. This only works correctly with some kinds of garbage (in > // particular, it will work if the bitcode file is followed by zeros, or if > // it's followed by another bitcode file). > size_t GetBitcodeSize(MemoryBufferRef Buffer) { > const unsigned char *BufPtr > reinterpret_cast<const unsigned char *>(Buffer.getBufferStart()); > const unsigned char *EndBufPtr > reinterpret_cast<const unsigned char *>(Buffer.getBufferEnd()); > if (isBitcodeWrapper(BufPtr, EndBufPtr)) { > const unsigned char *FixedBufPtr = BufPtr; > if (SkipBitcodeWrapperHeader(FixedBufPtr, EndBufPtr, true)) > report_fatal_error("Invalid bitcode wrapper"); > return EndBufPtr - BufPtr; > } > > if (!isRawBitcode(BufPtr, EndBufPtr)) > report_fatal_error("Invalid magic bytes; not a bitcode file?"); > > BitstreamCursor Reader(Buffer); > Reader.Read(32); // skip signature > while (true) { > size_t EntryStart = Reader.getCurrentByteNo(); > BitstreamEntry Entry > Reader.advance(BitstreamCursor::AF_DontAutoprocessAbbrevs); > if (Entry.Kind == BitstreamEntry::SubBlock) { > if (Reader.SkipBlock()) > report_fatal_error("Invalid bitcode file"); > } else { > // We must have reached the end of the module. > return EntryStart; > } > } > } > > Sean > > On Thu, Aug 27, 2020, at 13:17, Steven Wu via llvm-dev wrote: > > Hi Mircea > > From the RFC you mentioned, that is a Darwin specific implementation, > which later got extended to support other targets. The main use case for > the embed bitcode option is to allow compiler passing intermediate IR and > command flags in the object file it produced for later use. For Darwin, it > is used for bitcode recompilation, and some might use it to achieve other > goals. > > In order to use this information properly, you needs to have tools that > understand the layout and sections for embedded bitcode. You can't just use > an ordinary linker, because like you said, an ELF linker will just append > the bitcode. Depending on what you are trying to achieve, you need to > implement the downstream tools, like linker, binary analysis tools, etc. to > understand this concept. > > Steven > > On Aug 24, 2020, at 7:10 PM, Mircea Trofin via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > Hello, > > I'm trying to understand how .llvmbc and .llvmcmd fit into an end-to-end > story. From the RFC > <http://lists.llvm.org/pipermail/llvm-dev/2016-February/094851.html>, and > reading through the implementation, I'm piecing together that the goal was > to enable capturing IR right after clang and before passing it to > LLVM's optimization passes, as well as the command line options needed for > later compiling that IR to the same native object it was compiled to > originally (with the same compiler). > > Here's what I don't understand: say you have a.o and b.o compiled with > -fembed-bitcode=all. They are linked into a binary called my_binary. How do > you re-create the corresponding IR for modules a and b (let's call them > a.bc and b.bc), and their corresponding command lines? From what I can > tell, the linker just concatenates the IR for a and b in my_binary's > .llvmbc, and the same for the command line in .llvmcmd. Is there a > separator maybe I missed? For .llvmcmd, I could see how *maybe* -cc1 could > be that separator, what about the .llvmbc part? The magic number? > > Thanks! > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > *Attachments:* > > - ATT00001.txt > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200827/4d7756f6/attachment.html>
David Blaikie via llvm-dev
2020-Aug-28 18:22 UTC
[llvm-dev] End-to-end -fembed-bitcode .llvmbc and .llvmcmd
You should probably pull in some folks who implemented/maintain the feature for Darwin. I guess they aren't linking this info, but only communicating in the object file between tools - maybe they flag these sections (either in the object, or by the linker) as ignored/dropped during linking. That semantic could be implemented in ELF too by marking the sections SHF_IGNORED or something (same-file split DWARF uses this technique). So maybe the goal/desire is to have a different semantic, rather than the equivalent semantic being different on ELF compared to MachO. So if it's a different semantic - yeah, I'd guess a flag that prefixes the module metadata with a length would make sense, then it can be linked naturally on any platform. (if the "don't link these sections" support on Darwin is done by the linker hardcoding the section name - then maybe this flag would also put the data in a different section that isn't linker stripped on Darwin, so users interested in getting everything linked together can do so on any platform) But if this data is linked, then it'd be hard to know which command line goes with which module, yes? So maybe it'd make sense then to have the command line as a header before the module, in the same section. So they're kept together. On Thu, Aug 27, 2020 at 10:26 PM Mircea Trofin via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Thanks, Sean, Steven, > > to explore this a bit further, are there currently users for non-Darwin > cases? I wonder if it would it be an issue if we inserted markers in the > section (maybe as an opt-in, if there were users), such that, when > concatenated, the resulting section would be self-describing, for a > specialized reader, of course - basically, achieve what Sean described, but > "by design". > > For instance, each .o file could have a size, followed by the payload > (maybe include in the payload the name of the module, too; maybe compress > it, too). Same for the .llvmcmd case. > > On Thu, Aug 27, 2020 at 6:57 PM Sean Bartell <smbarte2 at illinois.edu> > wrote: > >> Hi Mircea, >> >> If you use an ordinary linker that concatenates .llvmbc sections, you can >> use this code to get the size of each bitcode module. As far as I know, >> there's no clean way to separate the .llvmcmd sections without making >> assumptions about what options were used. >> >> // Given a bitcode file followed by garbage, get the size of the actual >> // bitcode. This only works correctly with some kinds of garbage (in >> // particular, it will work if the bitcode file is followed by zeros, or >> if >> // it's followed by another bitcode file). >> size_t GetBitcodeSize(MemoryBufferRef Buffer) { >> const unsigned char *BufPtr >> reinterpret_cast<const unsigned char *>(Buffer.getBufferStart()); >> const unsigned char *EndBufPtr >> reinterpret_cast<const unsigned char *>(Buffer.getBufferEnd()); >> if (isBitcodeWrapper(BufPtr, EndBufPtr)) { >> const unsigned char *FixedBufPtr = BufPtr; >> if (SkipBitcodeWrapperHeader(FixedBufPtr, EndBufPtr, true)) >> report_fatal_error("Invalid bitcode wrapper"); >> return EndBufPtr - BufPtr; >> } >> >> if (!isRawBitcode(BufPtr, EndBufPtr)) >> report_fatal_error("Invalid magic bytes; not a bitcode file?"); >> >> BitstreamCursor Reader(Buffer); >> Reader.Read(32); // skip signature >> while (true) { >> size_t EntryStart = Reader.getCurrentByteNo(); >> BitstreamEntry Entry >> Reader.advance(BitstreamCursor::AF_DontAutoprocessAbbrevs); >> if (Entry.Kind == BitstreamEntry::SubBlock) { >> if (Reader.SkipBlock()) >> report_fatal_error("Invalid bitcode file"); >> } else { >> // We must have reached the end of the module. >> return EntryStart; >> } >> } >> } >> >> Sean >> >> On Thu, Aug 27, 2020, at 13:17, Steven Wu via llvm-dev wrote: >> >> Hi Mircea >> >> From the RFC you mentioned, that is a Darwin specific implementation, >> which later got extended to support other targets. The main use case for >> the embed bitcode option is to allow compiler passing intermediate IR and >> command flags in the object file it produced for later use. For Darwin, it >> is used for bitcode recompilation, and some might use it to achieve other >> goals. >> >> In order to use this information properly, you needs to have tools that >> understand the layout and sections for embedded bitcode. You can't just use >> an ordinary linker, because like you said, an ELF linker will just append >> the bitcode. Depending on what you are trying to achieve, you need to >> implement the downstream tools, like linker, binary analysis tools, etc. to >> understand this concept. >> >> Steven >> >> On Aug 24, 2020, at 7:10 PM, Mircea Trofin via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> Hello, >> >> I'm trying to understand how .llvmbc and .llvmcmd fit into an end-to-end >> story. From the RFC >> <http://lists.llvm.org/pipermail/llvm-dev/2016-February/094851.html>, >> and reading through the implementation, I'm piecing together that the goal >> was to enable capturing IR right after clang and before passing it to >> LLVM's optimization passes, as well as the command line options needed for >> later compiling that IR to the same native object it was compiled to >> originally (with the same compiler). >> >> Here's what I don't understand: say you have a.o and b.o compiled with >> -fembed-bitcode=all. They are linked into a binary called my_binary. How do >> you re-create the corresponding IR for modules a and b (let's call them >> a.bc and b.bc), and their corresponding command lines? From what I can >> tell, the linker just concatenates the IR for a and b in my_binary's >> .llvmbc, and the same for the command line in .llvmcmd. Is there a >> separator maybe I missed? For .llvmcmd, I could see how *maybe* -cc1 could >> be that separator, what about the .llvmbc part? The magic number? >> >> Thanks! >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> *Attachments:* >> >> - ATT00001.txt >> >> >> _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200828/484f6cb1/attachment.html>