thr3ads.net - llvm dev - [llvm-dev] [RFC] LLVM Busybox Proposal [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Leonard Chan via llvm-dev

2021-Jun-22 23:24 UTC

[llvm-dev] [RFC] LLVM Busybox Proposal

Small update: I have a WIP prototype of the tool at
https://reviews.llvm.org/D104686. The prototype only includes llvm-objcopy
and llvm-objdump packed together, but we're seeing size benefits from
busyboxing those two compared against having two separate tools. (More
details in the prototype's description.) I don't plan on landing this
as-is
anytime soon and there's still some things I'd like to improve/change
and
get feedback on.

To answer some replies:

- Ideally, we could start off with an incremental approach and not package
large tools like clang/lld off the bat. The llvm-* tools seem like a good
place to start since they're generally a bunch of relatively small binaries
that all share a subset of functions in libLLVM, but don't necessarily use
all of libLLVM, so statically linking them together (with --gc-sections)
can help dedup a lot of shared components vs having separate statically
compiled tools. In my measurements, the busybox tool containing
llvm-objcopy+objdump is negligibly larger than llvm-objdump on its own (a
couple KB difference) indicating a lot of shared code between objdump and
objcopy.

- Will Dietz's multiplexing tool looks like a good place to start from. The
only concern I can see though is mostly the amount of work needed to update
it to LLVM 13.

- We don't have plans for windows support now, but it's not off the
table.
(Been mostly focusing on *nix for now). Depending on overall traction for
this idea, we could approach incrementally and add support for different
platforms over time.

- I'm starting to think the `cl::opt` to `OptTable` issue might be
orthogonal to the busybox implementation. The tool essentially dispatches
to different "main" functions in different tools, but as long as we
don't
do anything within busybox after exiting that tool's main, then the global
state issues we weren't sure of with `cl::opt` might not be of any concern
now. It may be an issue down the line if, let's say, the tool flags moved
from being "owned" by the tools themselves to instead being
"owned" by
busybox, and then we'd have to merge similarly-named flags together. In
that case, migrating these tools to use `OptTable` may be necessary since
(I think) `OptTable` should handle this. This may be a tedious task, but
this is just to say that busybox won't need to be immediately blocked on it.

- I haven't seen any issues with colliding symbols when linking (although
I've only merged two tools for now). I suspect that with small-ish llvm-*
tools, the bulk of their code is shared from libLLVM, and they have their
own distinct logic built on top of it, which could mean a low chance of
conflicting internal ABIs.

On Mon, Jun 21, 2021 at 10:54 AM Leonard Chan <leonardchan at google.com>
wrote:
> Hello all,
>
> When building LLVM tools, including Clang and lld, it's currently
possible
> to use either static or shared linking for LLVM libraries. The latter can
> significantly reduce the size of the toolchain since we aren't
duplicating
> the same code in every binary, but the dynamic relocations can affect
> performance. The former doesn't affect performance but significantly
> increases the size of our toolchain.
>
> We would like to implement a support for a third approach which we call,
> for a lack of better term, "busybox" feature, where everything is
compiled
> into a single binary which then dispatches into an appropriate tool
> depending on the first command. This approach can significantly reduce the
> size by deduplicating all of the shared code without affecting the
> performance.
>
> In terms of implementation, the build would produce a single binary called
> `llvm` and the first command would identify the tool. For example, instead
> of invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also
support
> creation of `llvm-nm` symlink which redirects to `llvm` for backwards
> compatibility.
> This functionality would ideally be implemented as an option in the CMake
> build that toolchain vendors can opt into.
>
> The implementation would have to replace `main` function of each tool with
> an entrypoint regular function which is registered into a tool registry.
> This could be wrapped in a macro for convenience. When the
"busybox"
> feature is disabled, the macro would expand to a `main` function as before
> and redirect to the entrypoint function. When the "busybox"
feature is
> enabled, it would register the entrypoint function into the registry, which
> would be responsible for the dispatching based on the tool name. Ideally,
> toolchain maintainers would also be able to control which tools they could
> add to the "busybox" binary via CMake build options, so
toolchains will
> only include the tools they use.
>
> One implementation detail we think will be an issue is merging arguments
> in individual tools that use `cl::opt`. `cl::opt` works by maintaining a
> global state of flags, but we aren’t confident of what the resulting
> behavior will be when merging them together in the dispatching `main`. What
> we would like to avoid is having flags used by one specific tool available
> on other tools. To address this issue, we would like to migrate all tools
> to use `OptTable` which doesn't have this issue and has been the
general
> direction most tools have been already moving into.
>
> A second issue would be resolving symlinks. For example, llvm-objcopy will
> check argv[0] and behave as llvm-strip (ie. use the right flags +
> configuration) if it is called via a symlink that “looks like” a strip
> tool, but for all other cases it will run under the default objcopy mode.
> The “looks like” function is usually an `Is` function copied in multiple
> tools that is essentially a substring check: so symlinks like `llvm-strip`,
> strip.exe, and `gnu-llvm-strip-10` all result in using the strip “mode”
> while all other names use the objcopy mode. To replicate the same behavior,
> we will need to take great care in making sure symlinks to the busybox tool
> dispatch correctly to the appropriate llvm tool, which might mean exposing
> and merging these `Is` functions.
>
> Some open questions:
> - People's initial thoughts/opinions?
> - Are there existing tools in LLVM that already do this?
> - Other implementation details/global states that we would also need to
> account for?
>
> - Leonard
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210622/9c21f1e1/attachment.html>

Fangrui Song via llvm-dev

2021-Jun-23 00:00 UTC

head link

[llvm-dev] [RFC] LLVM Busybox Proposal

On 2021-06-22, Leonard Chan via llvm-dev wrote:>Small update: I have a WIP prototype of the tool at
>https://reviews.llvm.org/D104686. The prototype only includes llvm-objcopy
>and llvm-objdump packed together, but we're seeing size benefits from
>busyboxing those two compared against having two separate tools. (More
>details in the prototype's description.) I don't plan on landing
this as-is
>anytime soon and there's still some things I'd like to
improve/change and
>get feedback on.
>
>To answer some replies:
>
>- Ideally, we could start off with an incremental approach and not package
>large tools like clang/lld off the bat. The llvm-* tools seem like a good
>place to start since they're generally a bunch of relatively small
binaries
>that all share a subset of functions in libLLVM, but don't necessarily
use
>all of libLLVM, so statically linking them together (with --gc-sections)
>can help dedup a lot of shared components vs having separate statically
>compiled tools. In my measurements, the busybox tool containing
>llvm-objcopy+objdump is negligibly larger than llvm-objdump on its own (a
>couple KB difference) indicating a lot of shared code between objdump and
>objcopy.
>
>- Will Dietz's multiplexing tool looks like a good place to start from.
The
>only concern I can see though is mostly the amount of work needed to update
>it to LLVM 13.
>
>- We don't have plans for windows support now, but it's not off the
table.
>(Been mostly focusing on *nix for now). Depending on overall traction for
>this idea, we could approach incrementally and add support for different
>platforms over time.
-DLLVM_LINK_LLVM_DYLIB=on -DCLANG_LINK_CLANG_DYLIB=on
-DLLVM_TARGETS_TO_BUILD=X86 (custom1)
vs
-DLLVM_TARGETS_TO_BUILD=X86 (custom2)


# This is the lower bound for any multiplexing approach. clang is the largest
executable.
% stat -c %s /tmp/out/custom2/bin/clang-13
102900408

I have built clang, lld and a bunch of ELF binary utilities.

% stat -c %s /tmp/out/custom1/lib/libLLVM-13git.so
/tmp/out/custom1/lib/libclang-cpp.so.13git
/tmp/out/custom1/bin/{clang-13,lld,llvm-{ar,cov,cxxfilt,nm,objcopy,objdump,readobj,size,strings,symbolizer}}
| awk '{s+=$1}END{print s}'
138896544

% stat -c %s
/tmp/out/custom2/bin/{clang-13,lld,llvm-{ar,cov,cxxfilt,nm,objcopy,objdump,readobj,size,strings,symbolizer}}
| awk '{s+=$1}END{print s}'
209054440


The -DLLVM_LINK_LLVM_DYLIB=on -DCLANG_LINK_CLANG_DYLIB=on build is doing a
really good job.

A multiplexing approach can squeeze some bytes from 138896544 toward 102900408,
but how much can it do?

>- I'm starting to think the `cl::opt` to `OptTable` issue might be
>orthogonal to the busybox implementation. The tool essentially dispatches
>to different "main" functions in different tools, but as long as
we don't
>do anything within busybox after exiting that tool's main, then the
global
>state issues we weren't sure of with `cl::opt` might not be of any
concern
>now. It may be an issue down the line if, let's say, the tool flags
moved
>from being "owned" by the tools themselves to instead being
"owned" by
>busybox, and then we'd have to merge similarly-named flags together. In
>that case, migrating these tools to use `OptTable` may be necessary since
>(I think) `OptTable` should handle this. This may be a tedious task, but
>this is just to say that busybox won't need to be immediately blocked on
it.
Such improvement is useful even if we don't do multiplexing.
I switched llvm-symbolizer. thakis switched llvm-objdump.
I can look at some binary utilities.
>- I haven't seen any issues with colliding symbols when linking
(although
>I've only merged two tools for now). I suspect that with small-ish
llvm-*
>tools, the bulk of their code is shared from libLLVM, and they have their
>own distinct logic built on top of it, which could mean a low chance of
>conflicting internal ABIs.
>
>On Mon, Jun 21, 2021 at 10:54 AM Leonard Chan <leonardchan at
google.com>
>wrote:
>
>> Hello all,
>>
>> When building LLVM tools, including Clang and lld, it's currently
possible
>> to use either static or shared linking for LLVM libraries. The latter
can
>> significantly reduce the size of the toolchain since we aren't
duplicating
>> the same code in every binary, but the dynamic relocations can affect
>> performance. The former doesn't affect performance but
significantly
>> increases the size of our toolchain.
>>
>> We would like to implement a support for a third approach which we
call,
>> for a lack of better term, "busybox" feature, where
everything is compiled
>> into a single binary which then dispatches into an appropriate tool
>> depending on the first command. This approach can significantly reduce
the
>> size by deduplicating all of the shared code without affecting the
>> performance.
>>
>> In terms of implementation, the build would produce a single binary
called
>> `llvm` and the first command would identify the tool. For example,
instead
>> of invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also
support
>> creation of `llvm-nm` symlink which redirects to `llvm` for backwards
>> compatibility.
>> This functionality would ideally be implemented as an option in the
CMake
>> build that toolchain vendors can opt into.
>>
>> The implementation would have to replace `main` function of each tool
with
>> an entrypoint regular function which is registered into a tool
registry.
>> This could be wrapped in a macro for convenience. When the
"busybox"
>> feature is disabled, the macro would expand to a `main` function as
before
>> and redirect to the entrypoint function. When the "busybox"
feature is
>> enabled, it would register the entrypoint function into the registry,
which
>> would be responsible for the dispatching based on the tool name.
Ideally,
>> toolchain maintainers would also be able to control which tools they
could
>> add to the "busybox" binary via CMake build options, so
toolchains will
>> only include the tools they use.
>>
>> One implementation detail we think will be an issue is merging
arguments
>> in individual tools that use `cl::opt`. `cl::opt` works by maintaining
a
>> global state of flags, but we aren’t confident of what the resulting
>> behavior will be when merging them together in the dispatching `main`.
What
>> we would like to avoid is having flags used by one specific tool
available
>> on other tools. To address this issue, we would like to migrate all
tools
>> to use `OptTable` which doesn't have this issue and has been the
general
>> direction most tools have been already moving into.
>>
>> A second issue would be resolving symlinks. For example, llvm-objcopy
will
>> check argv[0] and behave as llvm-strip (ie. use the right flags +
>> configuration) if it is called via a symlink that “looks like” a strip
>> tool, but for all other cases it will run under the default objcopy
mode.
>> The “looks like” function is usually an `Is` function copied in
multiple
>> tools that is essentially a substring check: so symlinks like
`llvm-strip`,
>> strip.exe, and `gnu-llvm-strip-10` all result in using the strip “mode”
>> while all other names use the objcopy mode. To replicate the same
behavior,
>> we will need to take great care in making sure symlinks to the busybox
tool
>> dispatch correctly to the appropriate llvm tool, which might mean
exposing
>> and merging these `Is` functions.
>>
>> Some open questions:
>> - People's initial thoughts/opinions?
>> - Are there existing tools in LLVM that already do this?
>> - Other implementation details/global states that we would also need to
>> account for?
>>
>> - Leonard
>>
>_______________________________________________
>LLVM Developers mailing list
>llvm-dev at lists.llvm.org
>https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

llvm dev - Jun 2021 - [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal