thr3ads.net - llvm dev - [llvm-dev] [RFC] LLVM Busybox Proposal [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Leonard Chan via llvm-dev

2021-Jun-21 17:54 UTC

[llvm-dev] [RFC] LLVM Busybox Proposal

Hello all,

When building LLVM tools, including Clang and lld, it's currently possible
to use either static or shared linking for LLVM libraries. The latter can
significantly reduce the size of the toolchain since we aren't duplicating
the same code in every binary, but the dynamic relocations can affect
performance. The former doesn't affect performance but significantly
increases the size of our toolchain.

We would like to implement a support for a third approach which we call,
for a lack of better term, "busybox" feature, where everything is
compiled
into a single binary which then dispatches into an appropriate tool
depending on the first command. This approach can significantly reduce the
size by deduplicating all of the shared code without affecting the
performance.

In terms of implementation, the build would produce a single binary called
`llvm` and the first command would identify the tool. For example, instead
of invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also support
creation of `llvm-nm` symlink which redirects to `llvm` for backwards
compatibility.
This functionality would ideally be implemented as an option in the CMake
build that toolchain vendors can opt into.

The implementation would have to replace `main` function of each tool with
an entrypoint regular function which is registered into a tool registry.
This could be wrapped in a macro for convenience. When the "busybox"
feature is disabled, the macro would expand to a `main` function as before
and redirect to the entrypoint function. When the "busybox" feature is
enabled, it would register the entrypoint function into the registry, which
would be responsible for the dispatching based on the tool name. Ideally,
toolchain maintainers would also be able to control which tools they could
add to the "busybox" binary via CMake build options, so toolchains
will
only include the tools they use.

One implementation detail we think will be an issue is merging arguments in
individual tools that use `cl::opt`. `cl::opt` works by maintaining a
global state of flags, but we aren’t confident of what the resulting
behavior will be when merging them together in the dispatching `main`. What
we would like to avoid is having flags used by one specific tool available
on other tools. To address this issue, we would like to migrate all tools
to use `OptTable` which doesn't have this issue and has been the general
direction most tools have been already moving into.

A second issue would be resolving symlinks. For example, llvm-objcopy will
check argv[0] and behave as llvm-strip (ie. use the right flags +
configuration) if it is called via a symlink that “looks like” a strip
tool, but for all other cases it will run under the default objcopy mode.
The “looks like” function is usually an `Is` function copied in multiple
tools that is essentially a substring check: so symlinks like `llvm-strip`,
strip.exe, and `gnu-llvm-strip-10` all result in using the strip “mode”
while all other names use the objcopy mode. To replicate the same behavior,
we will need to take great care in making sure symlinks to the busybox tool
dispatch correctly to the appropriate llvm tool, which might mean exposing
and merging these `Is` functions.

Some open questions:
- People's initial thoughts/opinions?
- Are there existing tools in LLVM that already do this?
- Other implementation details/global states that we would also need to
account for?

- Leonard
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210621/e4bac42b/attachment.html>

Tom Stellard via llvm-dev

2021-Jun-21 18:04 UTC

head link

[llvm-dev] [RFC] LLVM Busybox Proposal

On 6/21/21 10:54 AM, Leonard Chan via llvm-dev wrote:> Hello all,
> 
> When building LLVM tools, including Clang and lld, it's currently
possible to use either static or shared linking for LLVM libraries. The latter
can significantly reduce the size of the toolchain since we aren't
duplicating the same code in every binary, but the dynamic relocations can
affect performance. The former doesn't affect performance but significantly
increases the size of our toolchain.
> 
> We would like to implement a support for a third approach which we call,
for a lack of better term, "busybox" feature, where everything is
compiled into a single binary which then dispatches into an appropriate tool
depending on the first command. This approach can significantly reduce the size
by deduplicating all of the shared code without affecting the performance.
> 
> In terms of implementation, the build would produce a single binary called
`llvm` and the first command would identify the tool. For example, instead of
invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also support
creation of `llvm-nm` symlink which redirects to `llvm` for backwards
compatibility.
> This functionality would ideally be implemented as an option in the CMake
build that toolchain vendors can opt into.
> 
> The implementation would have to replace `main` function of each tool with
an entrypoint regular function which is registered into a tool registry. This
could be wrapped in a macro for convenience. When the "busybox"
feature is disabled, the macro would expand to a `main` function as before and
redirect to the entrypoint function. When the "busybox" feature is
enabled, it would register the entrypoint function into the registry, which
would be responsible for the dispatching based on the tool name. Ideally,
toolchain maintainers would also be able to control which tools they could add
to the "busybox" binary via CMake build options, so toolchains will
only include the tools they use.
> 
> One implementation detail we think will be an issue is merging arguments in
individual tools that use `cl::opt`. `cl::opt` works by maintaining a global
state of flags, but we aren’t confident of what the resulting behavior will be
when merging them together in the dispatching `main`. What we would like to
avoid is having flags used by one specific tool available on other tools. To
address this issue, we would like to migrate all tools to use `OptTable` which
doesn't have this issue and has been the general direction most tools have
been already moving into.
> 
> A second issue would be resolving symlinks. For example, llvm-objcopy will
check argv[0] and behave as llvm-strip (ie. use the right flags + configuration)
if it is called via a symlink that “looks like” a strip tool, but for all other
cases it will run under the default objcopy mode. The “looks like” function is
usually an `Is` function copied in multiple tools that is essentially a
substring check: so symlinks like `llvm-strip`, strip.exe, and
`gnu-llvm-strip-10` all result in using the strip “mode” while all other names
use the objcopy mode. To replicate the same behavior, we will need to take great
care in making sure symlinks to the busybox tool dispatch correctly to the
appropriate llvm tool, which might mean exposing and merging these `Is`
functions.
> 
> Some open questions:
> - People's initial thoughts/opinions?
I think it's an interesting idea.  My main concern is that adding a new
CMake
option for this going to complicate the build system and make future CMake
improvements more difficult.

Do you have any idea of how much performance /
toolchain size gains you will get from this approach?

-Tom
> - Are there existing tools in LLVM that already do this?
> - Other implementation details/global states that we would also need to
account for?
> 
> - Leonard
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Fangrui Song via llvm-dev

2021-Jun-21 18:17 UTC

head link

[llvm-dev] [RFC] LLVM Busybox Proposal

On 2021-06-21, Leonard Chan via llvm-dev wrote:>Hello all,
>
>When building LLVM tools, including Clang and lld, it's currently
possible
>to use either static or shared linking for LLVM libraries. The latter can
>significantly reduce the size of the toolchain since we aren't
duplicating
>the same code in every binary, but the dynamic relocations can affect
>performance. The former doesn't affect performance but significantly
>increases the size of our toolchain.
The dynamic relocation claim is not true.

A thin executable using just -Bsymbolic libLLVM-13git.so is almost
identical to a mostly statically linked PIE.

I added -Bsymbolic-functions to libLLVM.so and libclang-cpp.so which
has claimed most of the -Bsymbolic benefits.

The shared object approach *can be* inferior to static linking plus
-Wl,--gc-sections because with libLLVM.so and libclang-cpp.so we are
making many many API dynamic and that inhibits the --gc-sections
benefits. However, if clang and lld are shipped together with
llvm-objdump/llvm-readobj/llvm-objcopy/.... , I expect the non-GCable
code due to shared objects will be significantly smaller.

I am conservative on adding yet another mechanism.
>We would like to implement a support for a third approach which we call,
>for a lack of better term, "busybox" feature, where everything is
compiled
>into a single binary which then dispatches into an appropriate tool
>depending on the first command. This approach can significantly reduce the
>size by deduplicating all of the shared code without affecting the
>performance.
>
>In terms of implementation, the build would produce a single binary called
>`llvm` and the first command would identify the tool. For example, instead
>of invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also
support
>creation of `llvm-nm` symlink which redirects to `llvm` for backwards
>compatibility.
>This functionality would ideally be implemented as an option in the CMake
>build that toolchain vendors can opt into.
>
>The implementation would have to replace `main` function of each tool with
>an entrypoint regular function which is registered into a tool registry.
>This could be wrapped in a macro for convenience. When the
"busybox"
>feature is disabled, the macro would expand to a `main` function as before
>and redirect to the entrypoint function. When the "busybox"
feature is
>enabled, it would register the entrypoint function into the registry, which
>would be responsible for the dispatching based on the tool name. Ideally,
>toolchain maintainers would also be able to control which tools they could
>add to the "busybox" binary via CMake build options, so toolchains
will
>only include the tools they use.
>
>One implementation detail we think will be an issue is merging arguments in
>individual tools that use `cl::opt`. `cl::opt` works by maintaining a
>global state of flags, but we aren’t confident of what the resulting
>behavior will be when merging them together in the dispatching `main`. What
>we would like to avoid is having flags used by one specific tool available
>on other tools. To address this issue, we would like to migrate all tools
>to use `OptTable` which doesn't have this issue and has been the general
>direction most tools have been already moving into.
>
>A second issue would be resolving symlinks. For example, llvm-objcopy will
>check argv[0] and behave as llvm-strip (ie. use the right flags +
>configuration) if it is called via a symlink that “looks like” a strip
>tool, but for all other cases it will run under the default objcopy mode.
>The “looks like” function is usually an `Is` function copied in multiple
>tools that is essentially a substring check: so symlinks like `llvm-strip`,
>strip.exe, and `gnu-llvm-strip-10` all result in using the strip “mode”
>while all other names use the objcopy mode. To replicate the same behavior,
>we will need to take great care in making sure symlinks to the busybox tool
>dispatch correctly to the appropriate llvm tool, which might mean exposing
>and merging these `Is` functions.
>
>Some open questions:
>- People's initial thoughts/opinions?
>- Are there existing tools in LLVM that already do this?
>- Other implementation details/global states that we would also need to
>account for?
crunchgen. As you said, argv[0] checking code needs to be taken care of.
We should make these executables' main file not have colliding symbols.
I have cleaned up a lot of files.
>- Leonard
>_______________________________________________
>LLVM Developers mailing list
>llvm-dev at lists.llvm.org
>https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

John Criswell via llvm-dev

2021-Jun-21 20:00 UTC

head link

[llvm-dev] [RFC] LLVM Busybox Proposal

Dear Leonard et al.,

Will Dietz built a multiplexing tool using LLVM that does just this: it takes
several programs and merges them together into one “busy box-esque” program that
determines which main() function to call based on the argv[0] string.

The relevant paper is here: https://dl.acm.org/doi/abs/10.1145/3276524
<https://dl.acm.org/doi/abs/10.1145/3276524>.

Will included the multiplexer code in the ALLVM code base.  You can look at it
here: https://publish.illinois.edu/allvm-project/software/
<https://publish.illinois.edu/allvm-project/software/>.  I believe the
Github link is https://github.com/allvm/allvm-tools
<https://github.com/allvm/allvm-tools>.  I’ve been told that the code was
built with LLVM 4.0, so it’d need to be updated to mainline.

I haven’t used it myself, but the idea of having LLVM multiplex itself seems
cool, and it might make sense to give LLVM the ability to multiplex programs
instead of expending effort doing it manually for LLVM and only getting the
benefit in LLVM.

Regards,

John Criswell

--
John Criswell
Associate Professor
University of Rochester
jtcriswel at gmail.com




> On Jun 21, 2021, at 12:54 PM, Leonard Chan via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hello all,
> 
> When building LLVM tools, including Clang and lld, it's currently
possible to use either static or shared linking for LLVM libraries. The latter
can significantly reduce the size of the toolchain since we aren't
duplicating the same code in every binary, but the dynamic relocations can
affect performance. The former doesn't affect performance but significantly
increases the size of our toolchain.
> 
> We would like to implement a support for a third approach which we call,
for a lack of better term, "busybox" feature, where everything is
compiled into a single binary which then dispatches into an appropriate tool
depending on the first command. This approach can significantly reduce the size
by deduplicating all of the shared code without affecting the performance.
> 
> In terms of implementation, the build would produce a single binary called
`llvm` and the first command would identify the tool. For example, instead of
invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also support
creation of `llvm-nm` symlink which redirects to `llvm` for backwards
compatibility.
> This functionality would ideally be implemented as an option in the CMake
build that toolchain vendors can opt into.
> 
> The implementation would have to replace `main` function of each tool with
an entrypoint regular function which is registered into a tool registry. This
could be wrapped in a macro for convenience. When the "busybox"
feature is disabled, the macro would expand to a `main` function as before and
redirect to the entrypoint function. When the "busybox" feature is
enabled, it would register the entrypoint function into the registry, which
would be responsible for the dispatching based on the tool name. Ideally,
toolchain maintainers would also be able to control which tools they could add
to the "busybox" binary via CMake build options, so toolchains will
only include the tools they use.
> 
> One implementation detail we think will be an issue is merging arguments in
individual tools that use `cl::opt`. `cl::opt` works by maintaining a global
state of flags, but we aren’t confident of what the resulting behavior will be
when merging them together in the dispatching `main`. What we would like to
avoid is having flags used by one specific tool available on other tools. To
address this issue, we would like to migrate all tools to use `OptTable` which
doesn't have this issue and has been the general direction most tools have
been already moving into.
> 
> A second issue would be resolving symlinks. For example, llvm-objcopy will
check argv[0] and behave as llvm-strip (ie. use the right flags + configuration)
if it is called via a symlink that “looks like” a strip tool, but for all other
cases it will run under the default objcopy mode. The “looks like” function is
usually an `Is` function copied in multiple tools that is essentially a
substring check: so symlinks like `llvm-strip`, strip.exe, and
`gnu-llvm-strip-10` all result in using the strip “mode” while all other names
use the objcopy mode. To replicate the same behavior, we will need to take great
care in making sure symlinks to the busybox tool dispatch correctly to the
appropriate llvm tool, which might mean exposing and merging these `Is`
functions.
> 
> Some open questions:
> - People's initial thoughts/opinions?
> - Are there existing tools in LLVM that already do this?
> - Other implementation details/global states that we would also need to
account for?
> 
> - Leonard
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210621/dd504f43/attachment.html>

Ben Craig via llvm-dev

2021-Jun-21 21:18 UTC

head link

[llvm-dev] [RFC] LLVM Busybox Proposal

Do you have a plan for Windows?  Sym links on Windows are mostly limited to
administrators and developer mode.

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Leonard
Chan via llvm-dev
Sent: Monday, June 21, 2021 12:55 PM
To: llvm-dev <llvm-dev at lists.llvm.org>
Subject: [EXTERNAL] [llvm-dev] [RFC] LLVM Busybox Proposal

Hello all,

When building LLVM tools, including Clang and lld, it's currently possible
to use either static or shared linking for LLVM libraries. The latter can
significantly reduce the size of the toolchain since we aren't duplicating
the same code in every binary, but the dynamic relocations can affect
performance. The former doesn't affect performance but significantly
increases the size of our toolchain.

We would like to implement a support for a third approach which we call, for a
lack of better term, "busybox" feature, where everything is compiled
into a single binary which then dispatches into an appropriate tool depending on
the first command. This approach can significantly reduce the size by
deduplicating all of the shared code without affecting the performance.

In terms of implementation, the build would produce a single binary called
`llvm` and the first command would identify the tool. For example, instead of
invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also support
creation of `llvm-nm` symlink which redirects to `llvm` for backwards
compatibility.
This functionality would ideally be implemented as an option in the CMake build
that toolchain vendors can opt into.

The implementation would have to replace `main` function of each tool with an
entrypoint regular function which is registered into a tool registry. This could
be wrapped in a macro for convenience. When the "busybox" feature is
disabled, the macro would expand to a `main` function as before and redirect to
the entrypoint function. When the "busybox" feature is enabled, it
would register the entrypoint function into the registry, which would be
responsible for the dispatching based on the tool name. Ideally, toolchain
maintainers would also be able to control which tools they could add to the
"busybox" binary via CMake build options, so toolchains will only
include the tools they use.

One implementation detail we think will be an issue is merging arguments in
individual tools that use `cl::opt`. `cl::opt` works by maintaining a global
state of flags, but we aren’t confident of what the resulting behavior will be
when merging them together in the dispatching `main`. What we would like to
avoid is having flags used by one specific tool available on other tools. To
address this issue, we would like to migrate all tools to use `OptTable` which
doesn't have this issue and has been the general direction most tools have
been already moving into.

A second issue would be resolving symlinks. For example, llvm-objcopy will check
argv[0] and behave as llvm-strip (ie. use the right flags + configuration) if it
is called via a symlink that “looks like” a strip tool, but for all other cases
it will run under the default objcopy mode. The “looks like” function is usually
an `Is` function copied in multiple tools that is essentially a substring check:
so symlinks like `llvm-strip`, strip.exe, and `gnu-llvm-strip-10` all result in
using the strip “mode” while all other names use the objcopy mode. To replicate
the same behavior, we will need to take great care in making sure symlinks to
the busybox tool dispatch correctly to the appropriate llvm tool, which might
mean exposing and merging these `Is` functions.

Some open questions:
- People's initial thoughts/opinions?
- Are there existing tools in LLVM that already do this?
- Other implementation details/global states that we would also need to account
for?

- Leonard
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210621/d2258655/attachment.html>

Alexandre Ganea via llvm-dev

2021-Jun-22 15:51 UTC

head link

[llvm-dev] [RFC] LLVM Busybox Proposal

Hello Leonard,

That is a very interesting idea! This will particularly favor Windows where the
LLVM bin/ folder is huge (3.5 GiB) since we don’t have working symlinks
out-of-box. This is also going towards the direction that we are pursuing,
having Clang and LLD together into an embedded application as suggested by
llvm-buildozer [1], however we’re also considering the multi-threading aspect.
We took a different route for now, which is loading the existing executables as
shared libraries inside our application, but our concern was less the binary
size on disk, and more about runtime performance (building time).

Regarding migrating every option to `OptTable`, are you suggesting removing
`cl::opt` and `CommandLineParser` altogether? I can count 3,597 instances of
`cl::opt` in the whole monorepo. This can be a tedious task even with
automation, since it would need some level of classification into the
appropriate .td file. What would be the approach for the migration? To alleviate
the issue of having `cl::opt`s cross the tool domain, we could temporarily
auto-generate a dictionary of `cl::opt`s available for each tool? That could be
a quick intermediary step, while waiting for a complete migration.

Once other issue I can see is symbols clashing at link time. Having everything
in the same executable requires internal ABI compatibly throughout, ie.
compiling with the same #defines and linking with the same (system) libraries.
I’m wondering if there was a analysis done in that regards? But maybe that is
not an issue.

Best,
Alex.

[1] https://reviews.llvm.org/D86351

De : llvm-dev <llvm-dev-bounces at lists.llvm.org> De la part de Leonard
Chan via llvm-dev
Envoyé : June 21, 2021 1:55 PM
À : llvm-dev <llvm-dev at lists.llvm.org>
Objet : [llvm-dev] [RFC] LLVM Busybox Proposal

Hello all,

When building LLVM tools, including Clang and lld, it's currently possible
to use either static or shared linking for LLVM libraries. The latter can
significantly reduce the size of the toolchain since we aren't duplicating
the same code in every binary, but the dynamic relocations can affect
performance. The former doesn't affect performance but significantly
increases the size of our toolchain.

We would like to implement a support for a third approach which we call, for a
lack of better term, "busybox" feature, where everything is compiled
into a single binary which then dispatches into an appropriate tool depending on
the first command. This approach can significantly reduce the size by
deduplicating all of the shared code without affecting the performance.

In terms of implementation, the build would produce a single binary called
`llvm` and the first command would identify the tool. For example, instead of
invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also support
creation of `llvm-nm` symlink which redirects to `llvm` for backwards
compatibility.
This functionality would ideally be implemented as an option in the CMake build
that toolchain vendors can opt into.

The implementation would have to replace `main` function of each tool with an
entrypoint regular function which is registered into a tool registry. This could
be wrapped in a macro for convenience. When the "busybox" feature is
disabled, the macro would expand to a `main` function as before and redirect to
the entrypoint function. When the "busybox" feature is enabled, it
would register the entrypoint function into the registry, which would be
responsible for the dispatching based on the tool name. Ideally, toolchain
maintainers would also be able to control which tools they could add to the
"busybox" binary via CMake build options, so toolchains will only
include the tools they use.

One implementation detail we think will be an issue is merging arguments in
individual tools that use `cl::opt`. `cl::opt` works by maintaining a global
state of flags, but we aren’t confident of what the resulting behavior will be
when merging them together in the dispatching `main`. What we would like to
avoid is having flags used by one specific tool available on other tools. To
address this issue, we would like to migrate all tools to use `OptTable` which
doesn't have this issue and has been the general direction most tools have
been already moving into.

A second issue would be resolving symlinks. For example, llvm-objcopy will check
argv[0] and behave as llvm-strip (ie. use the right flags + configuration) if it
is called via a symlink that “looks like” a strip tool, but for all other cases
it will run under the default objcopy mode. The “looks like” function is usually
an `Is` function copied in multiple tools that is essentially a substring check:
so symlinks like `llvm-strip`, strip.exe, and `gnu-llvm-strip-10` all result in
using the strip “mode” while all other names use the objcopy mode. To replicate
the same behavior, we will need to take great care in making sure symlinks to
the busybox tool dispatch correctly to the appropriate llvm tool, which might
mean exposing and merging these `Is` functions.

Some open questions:
- People's initial thoughts/opinions?
- Are there existing tools in LLVM that already do this?
- Other implementation details/global states that we would also need to account
for?

- Leonard
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210622/9be4eba1/attachment-0001.html>

Leonard Chan via llvm-dev

2021-Jun-22 23:24 UTC

head link

[llvm-dev] [RFC] LLVM Busybox Proposal

Small update: I have a WIP prototype of the tool at
https://reviews.llvm.org/D104686. The prototype only includes llvm-objcopy
and llvm-objdump packed together, but we're seeing size benefits from
busyboxing those two compared against having two separate tools. (More
details in the prototype's description.) I don't plan on landing this
as-is
anytime soon and there's still some things I'd like to improve/change
and
get feedback on.

To answer some replies:

- Ideally, we could start off with an incremental approach and not package
large tools like clang/lld off the bat. The llvm-* tools seem like a good
place to start since they're generally a bunch of relatively small binaries
that all share a subset of functions in libLLVM, but don't necessarily use
all of libLLVM, so statically linking them together (with --gc-sections)
can help dedup a lot of shared components vs having separate statically
compiled tools. In my measurements, the busybox tool containing
llvm-objcopy+objdump is negligibly larger than llvm-objdump on its own (a
couple KB difference) indicating a lot of shared code between objdump and
objcopy.

- Will Dietz's multiplexing tool looks like a good place to start from. The
only concern I can see though is mostly the amount of work needed to update
it to LLVM 13.

- We don't have plans for windows support now, but it's not off the
table.
(Been mostly focusing on *nix for now). Depending on overall traction for
this idea, we could approach incrementally and add support for different
platforms over time.

- I'm starting to think the `cl::opt` to `OptTable` issue might be
orthogonal to the busybox implementation. The tool essentially dispatches
to different "main" functions in different tools, but as long as we
don't
do anything within busybox after exiting that tool's main, then the global
state issues we weren't sure of with `cl::opt` might not be of any concern
now. It may be an issue down the line if, let's say, the tool flags moved
from being "owned" by the tools themselves to instead being
"owned" by
busybox, and then we'd have to merge similarly-named flags together. In
that case, migrating these tools to use `OptTable` may be necessary since
(I think) `OptTable` should handle this. This may be a tedious task, but
this is just to say that busybox won't need to be immediately blocked on it.

- I haven't seen any issues with colliding symbols when linking (although
I've only merged two tools for now). I suspect that with small-ish llvm-*
tools, the bulk of their code is shared from libLLVM, and they have their
own distinct logic built on top of it, which could mean a low chance of
conflicting internal ABIs.

On Mon, Jun 21, 2021 at 10:54 AM Leonard Chan <leonardchan at google.com>
wrote:
> Hello all,
>
> When building LLVM tools, including Clang and lld, it's currently
possible
> to use either static or shared linking for LLVM libraries. The latter can
> significantly reduce the size of the toolchain since we aren't
duplicating
> the same code in every binary, but the dynamic relocations can affect
> performance. The former doesn't affect performance but significantly
> increases the size of our toolchain.
>
> We would like to implement a support for a third approach which we call,
> for a lack of better term, "busybox" feature, where everything is
compiled
> into a single binary which then dispatches into an appropriate tool
> depending on the first command. This approach can significantly reduce the
> size by deduplicating all of the shared code without affecting the
> performance.
>
> In terms of implementation, the build would produce a single binary called
> `llvm` and the first command would identify the tool. For example, instead
> of invoking `llvm-nm` you'd invoke `llvm nm`. Ideally we would also
support
> creation of `llvm-nm` symlink which redirects to `llvm` for backwards
> compatibility.
> This functionality would ideally be implemented as an option in the CMake
> build that toolchain vendors can opt into.
>
> The implementation would have to replace `main` function of each tool with
> an entrypoint regular function which is registered into a tool registry.
> This could be wrapped in a macro for convenience. When the
"busybox"
> feature is disabled, the macro would expand to a `main` function as before
> and redirect to the entrypoint function. When the "busybox"
feature is
> enabled, it would register the entrypoint function into the registry, which
> would be responsible for the dispatching based on the tool name. Ideally,
> toolchain maintainers would also be able to control which tools they could
> add to the "busybox" binary via CMake build options, so
toolchains will
> only include the tools they use.
>
> One implementation detail we think will be an issue is merging arguments
> in individual tools that use `cl::opt`. `cl::opt` works by maintaining a
> global state of flags, but we aren’t confident of what the resulting
> behavior will be when merging them together in the dispatching `main`. What
> we would like to avoid is having flags used by one specific tool available
> on other tools. To address this issue, we would like to migrate all tools
> to use `OptTable` which doesn't have this issue and has been the
general
> direction most tools have been already moving into.
>
> A second issue would be resolving symlinks. For example, llvm-objcopy will
> check argv[0] and behave as llvm-strip (ie. use the right flags +
> configuration) if it is called via a symlink that “looks like” a strip
> tool, but for all other cases it will run under the default objcopy mode.
> The “looks like” function is usually an `Is` function copied in multiple
> tools that is essentially a substring check: so symlinks like `llvm-strip`,
> strip.exe, and `gnu-llvm-strip-10` all result in using the strip “mode”
> while all other names use the objcopy mode. To replicate the same behavior,
> we will need to take great care in making sure symlinks to the busybox tool
> dispatch correctly to the appropriate llvm tool, which might mean exposing
> and merging these `Is` functions.
>
> Some open questions:
> - People's initial thoughts/opinions?
> - Are there existing tools in LLVM that already do this?
> - Other implementation details/global states that we would also need to
> account for?
>
> - Leonard
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210622/9c21f1e1/attachment.html>

Fāng-ruì Sòng via llvm-dev

2021-Jul-02 17:14 UTC

head link

[llvm-dev] Binary utilities: switch command line parsing from llvm::cl to OptTable (byproduct: drop -long-option?)

llvm/tools/ include some binary utilities used as replacement for GNU
binutils, e.g. llvm-objcopy, llvm-symbolizer, llvm-nm.
In some old threads people discussed some drawbacks of using cl::opt for
user-facing utilities (I cannot find them now).
Switching to OptTable is an appealing solution. I have prepared two patches
for two binary utilities: llvm-nm and llvm-strings.

* llvm-strings https://reviews.llvm.org/D104889
* llvm-nm https://reviews.llvm.org/D105330

llvm-symbolizer was switched last year. llvm-objdump was switched by thakis
earlier this year.

The switch can fix some corners with lib/Support/CommandLine.cpp. Here is a
summary:

* -t=d is removed (equal sign after a short option). Use -t d instead.
* --demangle=0 (=0 to disable a boolean option) is removed. Omit the option
or use --no-demangle instead.
* To support boolean options (e.g. --demangle --no-demangle), we don't need
to compare their positions (if (NoDemangle.getPosition() >
Demangle.getPosition()) , see llvm-nm.cpp)
* grouped short options can be specified with one line
`setGroupedShortOptions`, instead of adding cl::Grouping to every short
options.
* We don't need to add cl::cat to every option and call
`HideUnrelatedOptions` to hide unrelated options from --help. The issue
would happen with cl::opt tools if linker garbage collection is disabled or
libLLVM-13git.so is used. (See https://reviews.llvm.org/D104363)
* If we decide to support binary utility multiplexting (
https://reviews.llvm.org/D104686), we will not get conflicting options. An
option may have different meanings in different utilities (especially for
one-letter options).

*I expect that most users will not observe any difference.*

There is a related topic whether we should disallow the single-dash
`-long-option` form.
(Discussed in 2019:
https://lists.llvm.org/pipermail/llvm-dev/2019-April/131786.html Accept
--long-option but not -long-option for llvm binary utilities)
*I'd like to disallow -long-option but may want to do this in a separate
change.*
The main point is that (1) grouped short options have syntax conflict with
one-dash long options. (2) the GNU getopt_long style two-dash long option
is much more popular.

I can think of potential pushback for some Mach-O specific options, e.g. nm
-arch
http://www.manpagez.com/man/1/nm/osx-10.12.6.php says `-arch` has one dash.
If such options may have problems, we can keep supporting one dash forms.
With OptTable, allowing one-dash forms for a specific option is easy.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210702/5f2fe78f/attachment.html>

llvm dev - Jun 2021 - [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal

[llvm-dev] [RFC] LLVM Busybox Proposal

[llvm-dev] Binary utilities: switch command line parsing from llvm::cl to OptTable (byproduct: drop -long-option?)