thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries [Feb 2019]

If this information is useful, please help other people find it:
Share via:

Peter Collingbourne via llvm-dev

2019-Feb-27 01:34 UTC

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

Hi folks,

I'd like to propose adding a feature to ELF lld for automatically
partitioning a program into multiple binaries. (This will also involve
adding a feature to clang, so I've cc'd cfe-dev as well.)

== Problem statement =
Embedded devices such as cell phones, especially lower end devices, are
typically highly resource constrained. Users of cell phone applications
must pay a cost (in terms of download size as well as storage space) for
all features that the application implements, even for features that are
only used by a minority of users. Therefore, there is a desire to split
applications into multiple pieces that can be downloaded independently, so
that the majority of users only pay the cost of the commonly used features.
This can technically be achieved using traditional ELF dynamic linking: the
main part of the program can be compiled as an executable or DSO that
exports symbols that are then imported by a separate DSO containing the
part of the program implementing the optional feature. However, this itself
imposes costs:
- Each exported symbol by itself imposes additional binary size costs, as
it requires the name of the symbol and a dynamic symbol table entry to be
stored in both the exporting and importing DSO, and on the importing side a
dynamic relocation, a GOT entry and likely a PLT entry must be present.
These additional costs go some way towards defeating the purpose of
splitting the program into pieces in the first place, and can also impact
program startup and overall performance because of the additional
indirections.
- It can result in more code needing to appear in the main part of the
program than necessary. For example, imagine that both the feature and the
main program make use of a common (statically linked) library, but they
call different subsets of the functions in that library. With traditional
ELF linking we are forced to either link and export the entire library from
the main program (even the functions unused by either part of the program)
or carefully maintain a list of functions that are used by the other parts
of the program.
- Since the linker does not see the whole program at once and links each
piece independently, a number of link-time optimizations and features stop
working, such as LTO across partition boundaries, whole-program
devirtualization and non-cross-DSO control flow integrity (control flow
integrity has a cross-DSO mode, but that also imposes binary size costs
because a significant amount of metadata needs to appear in each DSO).
There are ways around at least the first point. For example, the program
could arrange to use a custom mechanism for binding references between the
main program and the feature code, such as a table of entry points.
However, this can impose maintenance costs (for example, the binding
mechanism can be intrusive in the source code and typically has to be
maintained manually), and it still does not address the last point.

== Proposed solution =
I propose to extend lld so that it can perform the required partitioning
automatically, given a set of entry points for each part of the program.
The end product of linking will be a main program (which can be either an
executable or a DSO) combined with a set of DSOs that must be loaded at
fixed addresses relative to the base address of the main program. These
binaries will all share a virtual address space so that they can refer to
one another directly using PC-relative references or RELATIVE dynamic
relocations as if they were all statically linked together in the first
place, rather than via the GOT (or custom GOT-equivalent).

The way that it will work is that we can extend the graph reachability
algorithm currently implemented by the linker for --gc-sections. The entry
points for each partition are marked up with a string naming the partition,
either at the source level with an attribute on the function or global
variable, or by passing a flag to the compiler (this string becomes the
partition's soname). These symbols will act as the GC roots for the
partition and will be exported from its dynsym. Assuming that there is a
single partition, let's call this set of symbols S2, while all other GC
roots (e.g. non-marked-up exported symbols, sections in .init_array) we
call S1. Any sections reachable from S1 are allocated to the main
partition, while sections reachable only from S2 but not from S1 are
allocated to S2's partition. We can extend this idea to multiple loadable
partitions by defining S3, S4 and so on, but any sections reachable from
multiple loadable partitions are allocated to the main partition even if
they aren’t reachable from the main partition.

When assigning input sections to output sections, we take into account, in
addition to the name of the input section, the partition that the input
section is assigned to. The SHF_ALLOC output sections are first sorted by
partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
sections appear last and are not sorted by partition. In the end we are
left with a collection of output sections that might look like this:

Main partition:
0x0000 ELF header, phdrs
0x1000 .rodata
0x2000 .dynsym
0x3000 .text

Loadable partition 1:
0x4000 ELF header, phdrs
0x5000 .rodata
0x6000 .dynsym
0x7000 .text

Loadable partition 2:
0x8000 ELF header, phdrs
0x9000 .rodata
0xa000 .dynsym
0xb000 .text

Non-SHF_ALLOC sections from all partitions:
.comment
.debug_info
(etc.)

Now linking proceeds mostly as usual, and we’re effectively left with a
single .so that contains all of the partitions concatenated together. This
isn’t very useful on its own and is likely to confuse tools (e.g. due to
the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
that will extract the individual partitions from the output file
essentially by taking a slice of the combined .so file. These slices can
also be fed to tools such as debuggers provided that the non-SHF_ALLOC
sections are left in place.

The envisaged usage of this feature is as follows:
$ clang -ffunction-sections -fdata-sections -c main.c # compile the main
program
$ clang -ffunction-sections -fdata-sections
-fsymbol-partition=libfeature.so -c feature.c # compile the feature
$ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
-Wl,-soname,libmain.so -Wl,--gc-sections
$ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
$ llvm-objcopy libcombined.so libfeature.so
--extract-partition=libfeature.so

On Android, the loadable partitions can be loaded with the
android_dlopen_ext
<https://developer.android.com/ndk/reference/group/libdl>
function passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at
the correct address relative to the main partition. Other platforms that
wish to support this feature will likely either need to add a similar
feature to their dynamic loader or (in order to support loading the
partitions with a regular dlopen) define a custom dynamic tag that will
cause the dynamic loader to first load the main partition and then the
loadable partition at the correct relative address.

== In more detail =
Each loadable partition will require its own sections to support the
dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
.dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
.interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
with the main partition. This means that all addresses associated with
symbols will continue to be fixed.

In order to cause the dynamic loader to reserve address space for the
loadable partitions so that they can be loaded at the correct address
later, a PT_LOAD segment is added to the main partition that allocates a
page of bss at the address one byte past the end of the last address in the
last partition. In the Android dynamic loader at least, this is enough to
cause the required space to be reserved. Other platforms would need to
ensure that their dynamic loader implements similar behaviour.

I haven't thought about how this feature will interact with linker scripts.
At least to start with we will likely need to forbid using this feature
together with the PHDRS or SECTIONS linker script directives.

Some sections will need to be present in each partition (e.g. .interp and
.note sections). Probably the most straightforward way to do this will be
to cause the linker to create a clone of these sections for each partition.

== Other use cases =
An example of another use case for this feature could be an operating
system API which is exposed across multiple DSOs. Typically these DSOs will
be implemented using private APIs that are not exposed to the application.
This feature would allow you to create a common DSO that contains the
shared code implementing the private APIs (i.e. the main partition),
together with individual DSOs (i.e. the loadable partitions) that use the
private APIs and expose the public ones, but without actually exposing the
private APIs in the dynamic symbol table or paying the binary size cost of
doing so.

== Prototype =
A prototype/proof of concept of this feature has been implemented here:
https://github.com/pcc/llvm-project/tree/lld-module-symbols
There is a test app in the test-progs/app directory that demonstrates the
feature on Android with a simple hello world app (based on
https://www.hanshq.net/command-line-android.html ). I have successfully
tested debugging the loadable partition with gdb (e.g. setting breakpoints
and printing globals), but getting unwinding working will need a bit more
work.

Note that the feature as exposed by the prototype is different from what
I'm proposing here, e.g. it uses a linker flag to specify which symbols go
in which partitions. I think the best place to specify this information is
at either the source level or the compiler flag level, so that is what I
intend to implement.

Thanks,
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/85c880b0/attachment.html>

Jake Ehrlich via llvm-dev

2019-Feb-27 02:22 UTC

head link

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

Neat Idea.
>In order to cause the dynamic loader to reserve address space for theloadable partitions so that they can be loaded at the correct address
later, a PT_LOAD segment is added to the main partition that allocates a
page of bss at the address one byte past the end of the last address in the
last partition. In the Android dynamic loader at least, this is enough to
cause the required space to be reserved. Other platforms would need to
ensure that their dynamic loader implements similar behaviour.

Maybe all partitions can contain the same PT_LOAD segments? or the main
partition contains all PT_LOAD segments for the entire address space?

On Tue, Feb 26, 2019 at 5:35 PM Peter Collingbourne via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi folks,
>
> I'd like to propose adding a feature to ELF lld for automatically
> partitioning a program into multiple binaries. (This will also involve
> adding a feature to clang, so I've cc'd cfe-dev as well.)
>
> == Problem statement =>
> Embedded devices such as cell phones, especially lower end devices, are
> typically highly resource constrained. Users of cell phone applications
> must pay a cost (in terms of download size as well as storage space) for
> all features that the application implements, even for features that are
> only used by a minority of users. Therefore, there is a desire to split
> applications into multiple pieces that can be downloaded independently, so
> that the majority of users only pay the cost of the commonly used features.
> This can technically be achieved using traditional ELF dynamic linking: the
> main part of the program can be compiled as an executable or DSO that
> exports symbols that are then imported by a separate DSO containing the
> part of the program implementing the optional feature. However, this itself
> imposes costs:
> - Each exported symbol by itself imposes additional binary size costs, as
> it requires the name of the symbol and a dynamic symbol table entry to be
> stored in both the exporting and importing DSO, and on the importing side a
> dynamic relocation, a GOT entry and likely a PLT entry must be present.
> These additional costs go some way towards defeating the purpose of
> splitting the program into pieces in the first place, and can also impact
> program startup and overall performance because of the additional
> indirections.
> - It can result in more code needing to appear in the main part of the
> program than necessary. For example, imagine that both the feature and the
> main program make use of a common (statically linked) library, but they
> call different subsets of the functions in that library. With traditional
> ELF linking we are forced to either link and export the entire library from
> the main program (even the functions unused by either part of the program)
> or carefully maintain a list of functions that are used by the other parts
> of the program.
> - Since the linker does not see the whole program at once and links each
> piece independently, a number of link-time optimizations and features stop
> working, such as LTO across partition boundaries, whole-program
> devirtualization and non-cross-DSO control flow integrity (control flow
> integrity has a cross-DSO mode, but that also imposes binary size costs
> because a significant amount of metadata needs to appear in each DSO).
> There are ways around at least the first point. For example, the program
> could arrange to use a custom mechanism for binding references between the
> main program and the feature code, such as a table of entry points.
> However, this can impose maintenance costs (for example, the binding
> mechanism can be intrusive in the source code and typically has to be
> maintained manually), and it still does not address the last point.
>
> == Proposed solution =>
> I propose to extend lld so that it can perform the required partitioning
> automatically, given a set of entry points for each part of the program.
> The end product of linking will be a main program (which can be either an
> executable or a DSO) combined with a set of DSOs that must be loaded at
> fixed addresses relative to the base address of the main program. These
> binaries will all share a virtual address space so that they can refer to
> one another directly using PC-relative references or RELATIVE dynamic
> relocations as if they were all statically linked together in the first
> place, rather than via the GOT (or custom GOT-equivalent).
>
> The way that it will work is that we can extend the graph reachability
> algorithm currently implemented by the linker for --gc-sections. The entry
> points for each partition are marked up with a string naming the partition,
> either at the source level with an attribute on the function or global
> variable, or by passing a flag to the compiler (this string becomes the
> partition's soname). These symbols will act as the GC roots for the
> partition and will be exported from its dynsym. Assuming that there is a
> single partition, let's call this set of symbols S2, while all other GC
> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
> call S1. Any sections reachable from S1 are allocated to the main
> partition, while sections reachable only from S2 but not from S1 are
> allocated to S2's partition. We can extend this idea to multiple
loadable
> partitions by defining S3, S4 and so on, but any sections reachable from
> multiple loadable partitions are allocated to the main partition even if
> they aren’t reachable from the main partition.
>
> When assigning input sections to output sections, we take into account, in
> addition to the name of the input section, the partition that the input
> section is assigned to. The SHF_ALLOC output sections are first sorted by
> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
> sections appear last and are not sorted by partition. In the end we are
> left with a collection of output sections that might look like this:
>
> Main partition:
> 0x0000 ELF header, phdrs
> 0x1000 .rodata
> 0x2000 .dynsym
> 0x3000 .text
>
> Loadable partition 1:
> 0x4000 ELF header, phdrs
> 0x5000 .rodata
> 0x6000 .dynsym
> 0x7000 .text
>
> Loadable partition 2:
> 0x8000 ELF header, phdrs
> 0x9000 .rodata
> 0xa000 .dynsym
> 0xb000 .text
>
> Non-SHF_ALLOC sections from all partitions:
> .comment
> .debug_info
> (etc.)
>
> Now linking proceeds mostly as usual, and we’re effectively left with a
> single .so that contains all of the partitions concatenated together. This
> isn’t very useful on its own and is likely to confuse tools (e.g. due to
> the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
> that will extract the individual partitions from the output file
> essentially by taking a slice of the combined .so file. These slices can
> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
> sections are left in place.
>
> The envisaged usage of this feature is as follows:
> $ clang -ffunction-sections -fdata-sections -c main.c # compile the main
> program
> $ clang -ffunction-sections -fdata-sections
> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
> -Wl,-soname,libmain.so -Wl,--gc-sections
> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
> $ llvm-objcopy libcombined.so libfeature.so
> --extract-partition=libfeature.so
>
> On Android, the loadable partitions can be loaded with the
> android_dlopen_ext
> <https://developer.android.com/ndk/reference/group/libdl> function
> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
> correct address relative to the main partition. Other platforms that wish
> to support this feature will likely either need to add a similar feature to
> their dynamic loader or (in order to support loading the partitions with a
> regular dlopen) define a custom dynamic tag that will cause the dynamic
> loader to first load the main partition and then the loadable partition at
> the correct relative address.
>
> == In more detail =>
> Each loadable partition will require its own sections to support the
> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
> with the main partition. This means that all addresses associated with
> symbols will continue to be fixed.
>
> In order to cause the dynamic loader to reserve address space for the
> loadable partitions so that they can be loaded at the correct address
> later, a PT_LOAD segment is added to the main partition that allocates a
> page of bss at the address one byte past the end of the last address in the
> last partition. In the Android dynamic loader at least, this is enough to
> cause the required space to be reserved. Other platforms would need to
> ensure that their dynamic loader implements similar behaviour.
>
> I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
>
> Some sections will need to be present in each partition (e.g. .interp and
> .note sections). Probably the most straightforward way to do this will be
> to cause the linker to create a clone of these sections for each partition.
>
> == Other use cases =>
> An example of another use case for this feature could be an operating
> system API which is exposed across multiple DSOs. Typically these DSOs will
> be implemented using private APIs that are not exposed to the application.
> This feature would allow you to create a common DSO that contains the
> shared code implementing the private APIs (i.e. the main partition),
> together with individual DSOs (i.e. the loadable partitions) that use the
> private APIs and expose the public ones, but without actually exposing the
> private APIs in the dynamic symbol table or paying the binary size cost of
> doing so.
>
> == Prototype =>
> A prototype/proof of concept of this feature has been implemented here:
> https://github.com/pcc/llvm-project/tree/lld-module-symbols
> There is a test app in the test-progs/app directory that demonstrates the
> feature on Android with a simple hello world app (based on
> https://www.hanshq.net/command-line-android.html ). I have successfully
> tested debugging the loadable partition with gdb (e.g. setting breakpoints
> and printing globals), but getting unwinding working will need a bit more
> work.
>
> Note that the feature as exposed by the prototype is different from what
> I'm proposing here, e.g. it uses a linker flag to specify which symbols
go
> in which partitions. I think the best place to specify this information is
> at either the source level or the compiler flag level, so that is what I
> intend to implement.
>
> Thanks,
> --
> --
> Peter
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/e4179bee/attachment-0001.html>

Michael Spencer via llvm-dev

2019-Feb-27 02:32 UTC

head link

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

On Tue, Feb 26, 2019 at 5:35 PM Peter Collingbourne via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
> Hi folks,
>
> I'd like to propose adding a feature to ELF lld for automatically
> partitioning a program into multiple binaries. (This will also involve
> adding a feature to clang, so I've cc'd cfe-dev as well.)
>
> == Problem statement =>
> Embedded devices such as cell phones, especially lower end devices, are
> typically highly resource constrained. Users of cell phone applications
> must pay a cost (in terms of download size as well as storage space) for
> all features that the application implements, even for features that are
> only used by a minority of users. Therefore, there is a desire to split
> applications into multiple pieces that can be downloaded independently, so
> that the majority of users only pay the cost of the commonly used features.
> This can technically be achieved using traditional ELF dynamic linking: the
> main part of the program can be compiled as an executable or DSO that
> exports symbols that are then imported by a separate DSO containing the
> part of the program implementing the optional feature. However, this itself
> imposes costs:
> - Each exported symbol by itself imposes additional binary size costs, as
> it requires the name of the symbol and a dynamic symbol table entry to be
> stored in both the exporting and importing DSO, and on the importing side a
> dynamic relocation, a GOT entry and likely a PLT entry must be present.
> These additional costs go some way towards defeating the purpose of
> splitting the program into pieces in the first place, and can also impact
> program startup and overall performance because of the additional
> indirections.
> - It can result in more code needing to appear in the main part of the
> program than necessary. For example, imagine that both the feature and the
> main program make use of a common (statically linked) library, but they
> call different subsets of the functions in that library. With traditional
> ELF linking we are forced to either link and export the entire library from
> the main program (even the functions unused by either part of the program)
> or carefully maintain a list of functions that are used by the other parts
> of the program.
> - Since the linker does not see the whole program at once and links each
> piece independently, a number of link-time optimizations and features stop
> working, such as LTO across partition boundaries, whole-program
> devirtualization and non-cross-DSO control flow integrity (control flow
> integrity has a cross-DSO mode, but that also imposes binary size costs
> because a significant amount of metadata needs to appear in each DSO).
> There are ways around at least the first point. For example, the program
> could arrange to use a custom mechanism for binding references between the
> main program and the feature code, such as a table of entry points.
> However, this can impose maintenance costs (for example, the binding
> mechanism can be intrusive in the source code and typically has to be
> maintained manually), and it still does not address the last point.
>
> == Proposed solution =>
> I propose to extend lld so that it can perform the required partitioning
> automatically, given a set of entry points for each part of the program.
> The end product of linking will be a main program (which can be either an
> executable or a DSO) combined with a set of DSOs that must be loaded at
> fixed addresses relative to the base address of the main program. These
> binaries will all share a virtual address space so that they can refer to
> one another directly using PC-relative references or RELATIVE dynamic
> relocations as if they were all statically linked together in the first
> place, rather than via the GOT (or custom GOT-equivalent).
>
> The way that it will work is that we can extend the graph reachability
> algorithm currently implemented by the linker for --gc-sections. The entry
> points for each partition are marked up with a string naming the partition,
> either at the source level with an attribute on the function or global
> variable, or by passing a flag to the compiler (this string becomes the
> partition's soname). These symbols will act as the GC roots for the
> partition and will be exported from its dynsym. Assuming that there is a
> single partition, let's call this set of symbols S2, while all other GC
> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
> call S1. Any sections reachable from S1 are allocated to the main
> partition, while sections reachable only from S2 but not from S1 are
> allocated to S2's partition. We can extend this idea to multiple
loadable
> partitions by defining S3, S4 and so on, but any sections reachable from
> multiple loadable partitions are allocated to the main partition even if
> they aren’t reachable from the main partition.
>
> When assigning input sections to output sections, we take into account, in
> addition to the name of the input section, the partition that the input
> section is assigned to. The SHF_ALLOC output sections are first sorted by
> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
> sections appear last and are not sorted by partition. In the end we are
> left with a collection of output sections that might look like this:
>
> Main partition:
> 0x0000 ELF header, phdrs
> 0x1000 .rodata
> 0x2000 .dynsym
> 0x3000 .text
>
> Loadable partition 1:
> 0x4000 ELF header, phdrs
> 0x5000 .rodata
> 0x6000 .dynsym
> 0x7000 .text
>
> Loadable partition 2:
> 0x8000 ELF header, phdrs
> 0x9000 .rodata
> 0xa000 .dynsym
> 0xb000 .text
>
> Non-SHF_ALLOC sections from all partitions:
> .comment
> .debug_info
> (etc.)
>
> Now linking proceeds mostly as usual, and we’re effectively left with a
> single .so that contains all of the partitions concatenated together. This
> isn’t very useful on its own and is likely to confuse tools (e.g. due to
> the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
> that will extract the individual partitions from the output file
> essentially by taking a slice of the combined .so file. These slices can
> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
> sections are left in place.
>
> The envisaged usage of this feature is as follows:
> $ clang -ffunction-sections -fdata-sections -c main.c # compile the main
> program
> $ clang -ffunction-sections -fdata-sections
> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
> -Wl,-soname,libmain.so -Wl,--gc-sections
> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
> $ llvm-objcopy libcombined.so libfeature.so
> --extract-partition=libfeature.so
>
> On Android, the loadable partitions can be loaded with the
> android_dlopen_ext
> <https://developer.android.com/ndk/reference/group/libdl> function
> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
> correct address relative to the main partition. Other platforms that wish
> to support this feature will likely either need to add a similar feature to
> their dynamic loader or (in order to support loading the partitions with a
> regular dlopen) define a custom dynamic tag that will cause the dynamic
> loader to first load the main partition and then the loadable partition at
> the correct relative address.
>
> == In more detail =>
> Each loadable partition will require its own sections to support the
> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
> with the main partition. This means that all addresses associated with
> symbols will continue to be fixed.
>
> In order to cause the dynamic loader to reserve address space for the
> loadable partitions so that they can be loaded at the correct address
> later, a PT_LOAD segment is added to the main partition that allocates a
> page of bss at the address one byte past the end of the last address in the
> last partition. In the Android dynamic loader at least, this is enough to
> cause the required space to be reserved. Other platforms would need to
> ensure that their dynamic loader implements similar behaviour.
>
> I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
>
> Some sections will need to be present in each partition (e.g. .interp and
> .note sections). Probably the most straightforward way to do this will be
> to cause the linker to create a clone of these sections for each partition.
>
> == Other use cases =>
> An example of another use case for this feature could be an operating
> system API which is exposed across multiple DSOs. Typically these DSOs will
> be implemented using private APIs that are not exposed to the application.
> This feature would allow you to create a common DSO that contains the
> shared code implementing the private APIs (i.e. the main partition),
> together with individual DSOs (i.e. the loadable partitions) that use the
> private APIs and expose the public ones, but without actually exposing the
> private APIs in the dynamic symbol table or paying the binary size cost of
> doing so.
>
> == Prototype =>
> A prototype/proof of concept of this feature has been implemented here:
> https://github.com/pcc/llvm-project/tree/lld-module-symbols
> There is a test app in the test-progs/app directory that demonstrates the
> feature on Android with a simple hello world app (based on
> https://www.hanshq.net/command-line-android.html ). I have successfully
> tested debugging the loadable partition with gdb (e.g. setting breakpoints
> and printing globals), but getting unwinding working will need a bit more
> work.
>
> Note that the feature as exposed by the prototype is different from what
> I'm proposing here, e.g. it uses a linker flag to specify which symbols
go
> in which partitions. I think the best place to specify this information is
> at either the source level or the compiler flag level, so that is what I
> intend to implement.
>
> Thanks,
> --
> --
> Peter
>
>This is an interesting idea.  What's the behaivor for calling an unloaded
function?  Would it be hard to make that force a load?  An interesting way
to do it would be to map the other partitions without execute privs and
load the partition on a fault, although this would require restricting the
alignment of partitions to the page size.

- Michael Spencer
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/690e628d/attachment.html>

Peter Collingbourne via llvm-dev

2019-Feb-27 02:36 UTC

head link

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

On Tue, Feb 26, 2019 at 6:23 PM Jake Ehrlich <jakehehrlich at google.com>
wrote:
> Neat Idea.
>
> >In order to cause the dynamic loader to reserve address space for the
> loadable partitions so that they can be loaded at the correct address
> later, a PT_LOAD segment is added to the main partition that allocates a
> page of bss at the address one byte past the end of the last address in the
> last partition. In the Android dynamic loader at least, this is enough to
> cause the required space to be reserved. Other platforms would need to
> ensure that their dynamic loader implements similar behaviour.
>
> Maybe all partitions can contain the same PT_LOAD segments? or the main
> partition contains all PT_LOAD segments for the entire address space?
>
I don't think this will work. When the main partition is loaded, the
dynamic loader will try to map all of the PT_LOADs, and with this approach
since they won't exist in the main partition (due to the p_offset being out
of bounds), the dynamic loader will likely raise an error or crash. The
per-partition PT_LOADs also give a convenient way for llvm-objcopy to
figure out which section headers need to be copied into each partition.

Another possibility that I considered for reserving the address space was
to extend the p_memsz of the last segment of the main partition to cover
the address space of the other partitions. But this prevents tools such as
unwinders from easily mapping an address onto a DSO by examining PT_LOAD
segments. For example, this is what libunwind does:
https://github.com/llvm/llvm-project/blob/e739ac0e255597d818c907223034ddf3bc18a593/libunwind/src/AddressSpace.hpp#L523

Peter

>
> On Tue, Feb 26, 2019 at 5:35 PM Peter Collingbourne via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi folks,
>>
>> I'd like to propose adding a feature to ELF lld for automatically
>> partitioning a program into multiple binaries. (This will also involve
>> adding a feature to clang, so I've cc'd cfe-dev as well.)
>>
>> == Problem statement =>>
>> Embedded devices such as cell phones, especially lower end devices, are
>> typically highly resource constrained. Users of cell phone applications
>> must pay a cost (in terms of download size as well as storage space)
for
>> all features that the application implements, even for features that
are
>> only used by a minority of users. Therefore, there is a desire to split
>> applications into multiple pieces that can be downloaded independently,
so
>> that the majority of users only pay the cost of the commonly used
features.
>> This can technically be achieved using traditional ELF dynamic linking:
the
>> main part of the program can be compiled as an executable or DSO that
>> exports symbols that are then imported by a separate DSO containing the
>> part of the program implementing the optional feature. However, this
itself
>> imposes costs:
>> - Each exported symbol by itself imposes additional binary size costs,
as
>> it requires the name of the symbol and a dynamic symbol table entry to
be
>> stored in both the exporting and importing DSO, and on the importing
side a
>> dynamic relocation, a GOT entry and likely a PLT entry must be present.
>> These additional costs go some way towards defeating the purpose of
>> splitting the program into pieces in the first place, and can also
impact
>> program startup and overall performance because of the additional
>> indirections.
>> - It can result in more code needing to appear in the main part of the
>> program than necessary. For example, imagine that both the feature and
the
>> main program make use of a common (statically linked) library, but they
>> call different subsets of the functions in that library. With
traditional
>> ELF linking we are forced to either link and export the entire library
from
>> the main program (even the functions unused by either part of the
program)
>> or carefully maintain a list of functions that are used by the other
parts
>> of the program.
>> - Since the linker does not see the whole program at once and links
each
>> piece independently, a number of link-time optimizations and features
stop
>> working, such as LTO across partition boundaries, whole-program
>> devirtualization and non-cross-DSO control flow integrity (control flow
>> integrity has a cross-DSO mode, but that also imposes binary size costs
>> because a significant amount of metadata needs to appear in each DSO).
>> There are ways around at least the first point. For example, the
program
>> could arrange to use a custom mechanism for binding references between
the
>> main program and the feature code, such as a table of entry points.
>> However, this can impose maintenance costs (for example, the binding
>> mechanism can be intrusive in the source code and typically has to be
>> maintained manually), and it still does not address the last point.
>>
>> == Proposed solution =>>
>> I propose to extend lld so that it can perform the required
partitioning
>> automatically, given a set of entry points for each part of the
program.
>> The end product of linking will be a main program (which can be either
an
>> executable or a DSO) combined with a set of DSOs that must be loaded at
>> fixed addresses relative to the base address of the main program. These
>> binaries will all share a virtual address space so that they can refer
to
>> one another directly using PC-relative references or RELATIVE dynamic
>> relocations as if they were all statically linked together in the first
>> place, rather than via the GOT (or custom GOT-equivalent).
>>
>> The way that it will work is that we can extend the graph reachability
>> algorithm currently implemented by the linker for --gc-sections. The
entry
>> points for each partition are marked up with a string naming the
partition,
>> either at the source level with an attribute on the function or global
>> variable, or by passing a flag to the compiler (this string becomes the
>> partition's soname). These symbols will act as the GC roots for the
>> partition and will be exported from its dynsym. Assuming that there is
a
>> single partition, let's call this set of symbols S2, while all
other GC
>> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
>> call S1. Any sections reachable from S1 are allocated to the main
>> partition, while sections reachable only from S2 but not from S1 are
>> allocated to S2's partition. We can extend this idea to multiple
loadable
>> partitions by defining S3, S4 and so on, but any sections reachable
from
>> multiple loadable partitions are allocated to the main partition even
if
>> they aren’t reachable from the main partition.
>>
>> When assigning input sections to output sections, we take into account,
>> in addition to the name of the input section, the partition that the
input
>> section is assigned to. The SHF_ALLOC output sections are first sorted
by
>> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
>> sections appear last and are not sorted by partition. In the end we are
>> left with a collection of output sections that might look like this:
>>
>> Main partition:
>> 0x0000 ELF header, phdrs
>> 0x1000 .rodata
>> 0x2000 .dynsym
>> 0x3000 .text
>>
>> Loadable partition 1:
>> 0x4000 ELF header, phdrs
>> 0x5000 .rodata
>> 0x6000 .dynsym
>> 0x7000 .text
>>
>> Loadable partition 2:
>> 0x8000 ELF header, phdrs
>> 0x9000 .rodata
>> 0xa000 .dynsym
>> 0xb000 .text
>>
>> Non-SHF_ALLOC sections from all partitions:
>> .comment
>> .debug_info
>> (etc.)
>>
>> Now linking proceeds mostly as usual, and we’re effectively left with a
>> single .so that contains all of the partitions concatenated together.
This
>> isn’t very useful on its own and is likely to confuse tools (e.g. due
to
>> the presence of multiple .dynsyms); we can add a feature to
llvm-objcopy
>> that will extract the individual partitions from the output file
>> essentially by taking a slice of the combined .so file. These slices
can
>> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
>> sections are left in place.
>>
>> The envisaged usage of this feature is as follows:
>> $ clang -ffunction-sections -fdata-sections -c main.c # compile the
main
>> program
>> $ clang -ffunction-sections -fdata-sections
>> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
>> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
>> -Wl,-soname,libmain.so -Wl,--gc-sections
>> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
>> $ llvm-objcopy libcombined.so libfeature.so
>> --extract-partition=libfeature.so
>>
>> On Android, the loadable partitions can be loaded with the
>> android_dlopen_ext
>> <https://developer.android.com/ndk/reference/group/libdl>
function
>> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
>> correct address relative to the main partition. Other platforms that
wish
>> to support this feature will likely either need to add a similar
feature to
>> their dynamic loader or (in order to support loading the partitions
with a
>> regular dlopen) define a custom dynamic tag that will cause the dynamic
>> loader to first load the main partition and then the loadable partition
at
>> the correct relative address.
>>
>> == In more detail =>>
>> Each loadable partition will require its own sections to support the
>> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
>> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
>> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
>> with the main partition. This means that all addresses associated with
>> symbols will continue to be fixed.
>>
>> In order to cause the dynamic loader to reserve address space for the
>> loadable partitions so that they can be loaded at the correct address
>> later, a PT_LOAD segment is added to the main partition that allocates
a
>> page of bss at the address one byte past the end of the last address in
the
>> last partition. In the Android dynamic loader at least, this is enough
to
>> cause the required space to be reserved. Other platforms would need to
>> ensure that their dynamic loader implements similar behaviour.
>>
>> I haven't thought about how this feature will interact with linker
>> scripts. At least to start with we will likely need to forbid using
this
>> feature together with the PHDRS or SECTIONS linker script directives.
>>
>> Some sections will need to be present in each partition (e.g. .interp
and
>> .note sections). Probably the most straightforward way to do this will
be
>> to cause the linker to create a clone of these sections for each
partition.
>>
>> == Other use cases =>>
>> An example of another use case for this feature could be an operating
>> system API which is exposed across multiple DSOs. Typically these DSOs
will
>> be implemented using private APIs that are not exposed to the
application.
>> This feature would allow you to create a common DSO that contains the
>> shared code implementing the private APIs (i.e. the main partition),
>> together with individual DSOs (i.e. the loadable partitions) that use
the
>> private APIs and expose the public ones, but without actually exposing
the
>> private APIs in the dynamic symbol table or paying the binary size cost
of
>> doing so.
>>
>> == Prototype =>>
>> A prototype/proof of concept of this feature has been implemented here:
>> https://github.com/pcc/llvm-project/tree/lld-module-symbols
>> There is a test app in the test-progs/app directory that demonstrates
the
>> feature on Android with a simple hello world app (based on
>> https://www.hanshq.net/command-line-android.html ). I have successfully
>> tested debugging the loadable partition with gdb (e.g. setting
breakpoints
>> and printing globals), but getting unwinding working will need a bit
more
>> work.
>>
>> Note that the feature as exposed by the prototype is different from
what
>> I'm proposing here, e.g. it uses a linker flag to specify which
symbols go
>> in which partitions. I think the best place to specify this information
is
>> at either the source level or the compiler flag level, so that is what
I
>> intend to implement.
>>
>> Thanks,
>> --
>> --
>> Peter
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/0b374509/attachment.html>

Eli Friedman via llvm-dev

2019-Feb-27 02:41 UTC

head link

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

This seems like a very complicated approach… do you have some numbers to give
some idea how much of an improvement we’re talking about here over a more
conventional solution involving shared libraries?  Or have you not gotten that
far?

What’s the tradeoff involved in the specific sections you chose to split?  It
seems like it would be possible to, for example, split the GOT, or avoid
splitting the relocation/EH/etc. sections.  Some variation would require
different runtime support, I guess.

It looks like this doesn’t include a proposal for the corresponding LLVM IR
extension?  I think it might be sort of complicated to define correctly…
specifically, in terms of what it means to “use” a function or global from a
different partition (so the program doesn’t try to speculatively access
something which isn’t loaded).  This could come up even without LTO if you have
C++ inline functions, since all functions with weak linkage have to be in the
first partition.  (At least, I think they do, unless you invent a new kind of
“partition” visibility for this.)

-Eli

From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Peter
Collingbourne via cfe-dev
Sent: Tuesday, February 26, 2019 5:35 PM
To: llvm-dev <llvm-dev at lists.llvm.org>; cfe-dev <cfe-dev at
lists.llvm.org>
Cc: George Rimar <grimar at accesssoftek.com>
Subject: [EXT] [cfe-dev] RFC: Linker feature for automatically partitioning a
program into multiple binaries

Hi folks,

I'd like to propose adding a feature to ELF lld for automatically
partitioning a program into multiple binaries. (This will also involve adding a
feature to clang, so I've cc'd cfe-dev as well.)

== Problem statement =
Embedded devices such as cell phones, especially lower end devices, are
typically highly resource constrained. Users of cell phone applications must pay
a cost (in terms of download size as well as storage space) for all features
that the application implements, even for features that are only used by a
minority of users. Therefore, there is a desire to split applications into
multiple pieces that can be downloaded independently, so that the majority of
users only pay the cost of the commonly used features. This can technically be
achieved using traditional ELF dynamic linking: the main part of the program can
be compiled as an executable or DSO that exports symbols that are then imported
by a separate DSO containing the part of the program implementing the optional
feature. However, this itself imposes costs:
- Each exported symbol by itself imposes additional binary size costs, as it
requires the name of the symbol and a dynamic symbol table entry to be stored in
both the exporting and importing DSO, and on the importing side a dynamic
relocation, a GOT entry and likely a PLT entry must be present. These additional
costs go some way towards defeating the purpose of splitting the program into
pieces in the first place, and can also impact program startup and overall
performance because of the additional indirections.
- It can result in more code needing to appear in the main part of the program
than necessary. For example, imagine that both the feature and the main program
make use of a common (statically linked) library, but they call different
subsets of the functions in that library. With traditional ELF linking we are
forced to either link and export the entire library from the main program (even
the functions unused by either part of the program) or carefully maintain a list
of functions that are used by the other parts of the program.
- Since the linker does not see the whole program at once and links each piece
independently, a number of link-time optimizations and features stop working,
such as LTO across partition boundaries, whole-program devirtualization and
non-cross-DSO control flow integrity (control flow integrity has a cross-DSO
mode, but that also imposes binary size costs because a significant amount of
metadata needs to appear in each DSO).

There are ways around at least the first point. For example, the program could
arrange to use a custom mechanism for binding references between the main
program and the feature code, such as a table of entry points. However, this can
impose maintenance costs (for example, the binding mechanism can be intrusive in
the source code and typically has to be maintained manually), and it still does
not address the last point.

== Proposed solution =
I propose to extend lld so that it can perform the required partitioning
automatically, given a set of entry points for each part of the program. The end
product of linking will be a main program (which can be either an executable or
a DSO) combined with a set of DSOs that must be loaded at fixed addresses
relative to the base address of the main program. These binaries will all share
a virtual address space so that they can refer to one another directly using
PC-relative references or RELATIVE dynamic relocations as if they were all
statically linked together in the first place, rather than via the GOT (or
custom GOT-equivalent).

The way that it will work is that we can extend the graph reachability algorithm
currently implemented by the linker for --gc-sections. The entry points for each
partition are marked up with a string naming the partition, either at the source
level with an attribute on the function or global variable, or by passing a flag
to the compiler (this string becomes the partition's soname). These symbols
will act as the GC roots for the partition and will be exported from its dynsym.
Assuming that there is a single partition, let's call this set of symbols
S2, while all other GC roots (e.g. non-marked-up exported symbols, sections in
.init_array) we call S1. Any sections reachable from S1 are allocated to the
main partition, while sections reachable only from S2 but not from S1 are
allocated to S2's partition. We can extend this idea to multiple loadable
partitions by defining S3, S4 and so on, but any sections reachable from
multiple loadable partitions are allocated to the main partition even if they
aren’t reachable from the main partition.

When assigning input sections to output sections, we take into account, in
addition to the name of the input section, the partition that the input section
is assigned to. The SHF_ALLOC output sections are first sorted by partition, and
then by the usual sorting rules. As usual, non-SHF_ALLOC sections appear last
and are not sorted by partition. In the end we are left with a collection of
output sections that might look like this:

Main partition:
0x0000 ELF header, phdrs
0x1000 .rodata
0x2000 .dynsym
0x3000 .text

Loadable partition 1:
0x4000 ELF header, phdrs
0x5000 .rodata
0x6000 .dynsym
0x7000 .text

Loadable partition 2:
0x8000 ELF header, phdrs
0x9000 .rodata
0xa000 .dynsym
0xb000 .text

Non-SHF_ALLOC sections from all partitions:
.comment
.debug_info
(etc.)

Now linking proceeds mostly as usual, and we’re effectively left with a single
.so that contains all of the partitions concatenated together. This isn’t very
useful on its own and is likely to confuse tools (e.g. due to the presence of
multiple .dynsyms); we can add a feature to llvm-objcopy that will extract the
individual partitions from the output file essentially by taking a slice of the
combined .so file. These slices can also be fed to tools such as debuggers
provided that the non-SHF_ALLOC sections are left in place.

The envisaged usage of this feature is as follows:
$ clang -ffunction-sections -fdata-sections -c main.c # compile the main program
$ clang -ffunction-sections -fdata-sections -fsymbol-partition=libfeature.so -c
feature.c # compile the feature
$ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
-Wl,-soname,libmain.so -Wl,--gc-sections
$ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
$ llvm-objcopy libcombined.so libfeature.so --extract-partition=libfeature.so

On Android, the loadable partitions can be loaded with the
android_dlopen_ext<https://developer.android.com/ndk/reference/group/libdl>
function passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
correct address relative to the main partition. Other platforms that wish to
support this feature will likely either need to add a similar feature to their
dynamic loader or (in order to support loading the partitions with a regular
dlopen) define a custom dynamic tag that will cause the dynamic loader to first
load the main partition and then the loadable partition at the correct relative
address.

== In more detail =
Each loadable partition will require its own sections to support the dynamic
loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr, .dynsym,
.eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash, .interp,
.rela.dyn, .relr.dyn), but will be able to share a GOT and PLT with the main
partition. This means that all addresses associated with symbols will continue
to be fixed.

In order to cause the dynamic loader to reserve address space for the loadable
partitions so that they can be loaded at the correct address later, a PT_LOAD
segment is added to the main partition that allocates a page of bss at the
address one byte past the end of the last address in the last partition. In the
Android dynamic loader at least, this is enough to cause the required space to
be reserved. Other platforms would need to ensure that their dynamic loader
implements similar behaviour.

I haven't thought about how this feature will interact with linker scripts.
At least to start with we will likely need to forbid using this feature together
with the PHDRS or SECTIONS linker script directives.

Some sections will need to be present in each partition (e.g. .interp and .note
sections). Probably the most straightforward way to do this will be to cause the
linker to create a clone of these sections for each partition.

== Other use cases =
An example of another use case for this feature could be an operating system API
which is exposed across multiple DSOs. Typically these DSOs will be implemented
using private APIs that are not exposed to the application. This feature would
allow you to create a common DSO that contains the shared code implementing the
private APIs (i.e. the main partition), together with individual DSOs (i.e. the
loadable partitions) that use the private APIs and expose the public ones, but
without actually exposing the private APIs in the dynamic symbol table or paying
the binary size cost of doing so.

== Prototype =
A prototype/proof of concept of this feature has been implemented here:
https://github.com/pcc/llvm-project/tree/lld-module-symbols
There is a test app in the test-progs/app directory that demonstrates the
feature on Android with a simple hello world app (based on
https://www.hanshq.net/command-line-android.html ). I have successfully tested
debugging the loadable partition with gdb (e.g. setting breakpoints and printing
globals), but getting unwinding working will need a bit more work.

Note that the feature as exposed by the prototype is different from what I'm
proposing here, e.g. it uses a linker flag to specify which symbols go in which
partitions. I think the best place to specify this information is at either the
source level or the compiler flag level, so that is what I intend to implement.

Thanks,
--
--
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190227/3b46a392/attachment-0001.html>

Peter Collingbourne via llvm-dev

2019-Feb-27 02:48 UTC

head link

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

On Tue, Feb 26, 2019 at 6:32 PM Michael Spencer <bigcheesegs at gmail.com>
wrote:
> On Tue, Feb 26, 2019 at 5:35 PM Peter Collingbourne via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> Hi folks,
>>
>> I'd like to propose adding a feature to ELF lld for automatically
>> partitioning a program into multiple binaries. (This will also involve
>> adding a feature to clang, so I've cc'd cfe-dev as well.)
>>
>> == Problem statement =>>
>> Embedded devices such as cell phones, especially lower end devices, are
>> typically highly resource constrained. Users of cell phone applications
>> must pay a cost (in terms of download size as well as storage space)
for
>> all features that the application implements, even for features that
are
>> only used by a minority of users. Therefore, there is a desire to split
>> applications into multiple pieces that can be downloaded independently,
so
>> that the majority of users only pay the cost of the commonly used
features.
>> This can technically be achieved using traditional ELF dynamic linking:
the
>> main part of the program can be compiled as an executable or DSO that
>> exports symbols that are then imported by a separate DSO containing the
>> part of the program implementing the optional feature. However, this
itself
>> imposes costs:
>> - Each exported symbol by itself imposes additional binary size costs,
as
>> it requires the name of the symbol and a dynamic symbol table entry to
be
>> stored in both the exporting and importing DSO, and on the importing
side a
>> dynamic relocation, a GOT entry and likely a PLT entry must be present.
>> These additional costs go some way towards defeating the purpose of
>> splitting the program into pieces in the first place, and can also
impact
>> program startup and overall performance because of the additional
>> indirections.
>> - It can result in more code needing to appear in the main part of the
>> program than necessary. For example, imagine that both the feature and
the
>> main program make use of a common (statically linked) library, but they
>> call different subsets of the functions in that library. With
traditional
>> ELF linking we are forced to either link and export the entire library
from
>> the main program (even the functions unused by either part of the
program)
>> or carefully maintain a list of functions that are used by the other
parts
>> of the program.
>> - Since the linker does not see the whole program at once and links
each
>> piece independently, a number of link-time optimizations and features
stop
>> working, such as LTO across partition boundaries, whole-program
>> devirtualization and non-cross-DSO control flow integrity (control flow
>> integrity has a cross-DSO mode, but that also imposes binary size costs
>> because a significant amount of metadata needs to appear in each DSO).
>> There are ways around at least the first point. For example, the
program
>> could arrange to use a custom mechanism for binding references between
the
>> main program and the feature code, such as a table of entry points.
>> However, this can impose maintenance costs (for example, the binding
>> mechanism can be intrusive in the source code and typically has to be
>> maintained manually), and it still does not address the last point.
>>
>> == Proposed solution =>>
>> I propose to extend lld so that it can perform the required
partitioning
>> automatically, given a set of entry points for each part of the
program.
>> The end product of linking will be a main program (which can be either
an
>> executable or a DSO) combined with a set of DSOs that must be loaded at
>> fixed addresses relative to the base address of the main program. These
>> binaries will all share a virtual address space so that they can refer
to
>> one another directly using PC-relative references or RELATIVE dynamic
>> relocations as if they were all statically linked together in the first
>> place, rather than via the GOT (or custom GOT-equivalent).
>>
>> The way that it will work is that we can extend the graph reachability
>> algorithm currently implemented by the linker for --gc-sections. The
entry
>> points for each partition are marked up with a string naming the
partition,
>> either at the source level with an attribute on the function or global
>> variable, or by passing a flag to the compiler (this string becomes the
>> partition's soname). These symbols will act as the GC roots for the
>> partition and will be exported from its dynsym. Assuming that there is
a
>> single partition, let's call this set of symbols S2, while all
other GC
>> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
>> call S1. Any sections reachable from S1 are allocated to the main
>> partition, while sections reachable only from S2 but not from S1 are
>> allocated to S2's partition. We can extend this idea to multiple
loadable
>> partitions by defining S3, S4 and so on, but any sections reachable
from
>> multiple loadable partitions are allocated to the main partition even
if
>> they aren’t reachable from the main partition.
>>
>> When assigning input sections to output sections, we take into account,
>> in addition to the name of the input section, the partition that the
input
>> section is assigned to. The SHF_ALLOC output sections are first sorted
by
>> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
>> sections appear last and are not sorted by partition. In the end we are
>> left with a collection of output sections that might look like this:
>>
>> Main partition:
>> 0x0000 ELF header, phdrs
>> 0x1000 .rodata
>> 0x2000 .dynsym
>> 0x3000 .text
>>
>> Loadable partition 1:
>> 0x4000 ELF header, phdrs
>> 0x5000 .rodata
>> 0x6000 .dynsym
>> 0x7000 .text
>>
>> Loadable partition 2:
>> 0x8000 ELF header, phdrs
>> 0x9000 .rodata
>> 0xa000 .dynsym
>> 0xb000 .text
>>
>> Non-SHF_ALLOC sections from all partitions:
>> .comment
>> .debug_info
>> (etc.)
>>
>> Now linking proceeds mostly as usual, and we’re effectively left with a
>> single .so that contains all of the partitions concatenated together.
This
>> isn’t very useful on its own and is likely to confuse tools (e.g. due
to
>> the presence of multiple .dynsyms); we can add a feature to
llvm-objcopy
>> that will extract the individual partitions from the output file
>> essentially by taking a slice of the combined .so file. These slices
can
>> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
>> sections are left in place.
>>
>> The envisaged usage of this feature is as follows:
>> $ clang -ffunction-sections -fdata-sections -c main.c # compile the
main
>> program
>> $ clang -ffunction-sections -fdata-sections
>> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
>> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
>> -Wl,-soname,libmain.so -Wl,--gc-sections
>> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
>> $ llvm-objcopy libcombined.so libfeature.so
>> --extract-partition=libfeature.so
>>
>> On Android, the loadable partitions can be loaded with the
>> android_dlopen_ext
>> <https://developer.android.com/ndk/reference/group/libdl>
function
>> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
>> correct address relative to the main partition. Other platforms that
wish
>> to support this feature will likely either need to add a similar
feature to
>> their dynamic loader or (in order to support loading the partitions
with a
>> regular dlopen) define a custom dynamic tag that will cause the dynamic
>> loader to first load the main partition and then the loadable partition
at
>> the correct relative address.
>>
>> == In more detail =>>
>> Each loadable partition will require its own sections to support the
>> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
>> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
>> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
>> with the main partition. This means that all addresses associated with
>> symbols will continue to be fixed.
>>
>> In order to cause the dynamic loader to reserve address space for the
>> loadable partitions so that they can be loaded at the correct address
>> later, a PT_LOAD segment is added to the main partition that allocates
a
>> page of bss at the address one byte past the end of the last address in
the
>> last partition. In the Android dynamic loader at least, this is enough
to
>> cause the required space to be reserved. Other platforms would need to
>> ensure that their dynamic loader implements similar behaviour.
>>
>> I haven't thought about how this feature will interact with linker
>> scripts. At least to start with we will likely need to forbid using
this
>> feature together with the PHDRS or SECTIONS linker script directives.
>>
>> Some sections will need to be present in each partition (e.g. .interp
and
>> .note sections). Probably the most straightforward way to do this will
be
>> to cause the linker to create a clone of these sections for each
partition.
>>
>> == Other use cases =>>
>> An example of another use case for this feature could be an operating
>> system API which is exposed across multiple DSOs. Typically these DSOs
will
>> be implemented using private APIs that are not exposed to the
application.
>> This feature would allow you to create a common DSO that contains the
>> shared code implementing the private APIs (i.e. the main partition),
>> together with individual DSOs (i.e. the loadable partitions) that use
the
>> private APIs and expose the public ones, but without actually exposing
the
>> private APIs in the dynamic symbol table or paying the binary size cost
of
>> doing so.
>>
>> == Prototype =>>
>> A prototype/proof of concept of this feature has been implemented here:
>> https://github.com/pcc/llvm-project/tree/lld-module-symbols
>> There is a test app in the test-progs/app directory that demonstrates
the
>> feature on Android with a simple hello world app (based on
>> https://www.hanshq.net/command-line-android.html ). I have successfully
>> tested debugging the loadable partition with gdb (e.g. setting
breakpoints
>> and printing globals), but getting unwinding working will need a bit
more
>> work.
>>
>> Note that the feature as exposed by the prototype is different from
what
>> I'm proposing here, e.g. it uses a linker flag to specify which
symbols go
>> in which partitions. I think the best place to specify this information
is
>> at either the source level or the compiler flag level, so that is what
I
>> intend to implement.
>>
>> Thanks,
>> --
>> --
>> Peter
>>
>>
> This is an interesting idea.  What's the behaivor for calling an
unloaded
> function?  Would it be hard to make that force a load?  An interesting way
> to do it would be to map the other partitions without execute privs and
> load the partition on a fault, although this would require restricting the
> alignment of partitions to the page size.
>
It may be possible with a sophisticated runtime. Right now the reserved
space is mapped without privs so you would SIGSEGV when calling the
function. But you could imagine a signal handler that responds to SIGSEGV
by mapping the other partition. However, if you're in a position to do
that, I don't see a huge advantage to that over just mapping the other
partition and letting the system page fault it in. The main advantage is
where the machine running the program is storage or bandwidth constrained,
requiring you to do something complicated (such as downloading the other
partition from the internet) in order to start using it, and that likely
couldn't be done easily in the context of a signal handler.

But note that in general, with this scheme it would not be possible to call
an unloaded function directly, because calling the function would create a
reference to the function via a relocation, which would cause it to be
moved into either the same partition as the caller or the main partition,
guaranteeing its availability. The intended usage mode is that the main
partition would access the other partitions exclusively via dlopen/dlsym.

Thanks,
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/3bd077b4/attachment.html>

Peter Collingbourne via llvm-dev

2019-Feb-27 03:47 UTC

head link

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

On Tue, Feb 26, 2019 at 6:41 PM Eli Friedman <efriedma at quicinc.com>
wrote:
> This seems like a very complicated approach… do you have some numbers to
> give some idea how much of an improvement we’re talking about here over a
> more conventional solution involving shared libraries?  Or have you not
> gotten that far?
>
I can talk to my internal customer to see what kind of overhead they were
seeing. But I do know that at the start of the project they did evaluate
using regular dynamic linking for the feature partitions, and that was
quickly rejected in favour of other approaches due to the code size and
maintenance overhead. And with control flow integrity the binary size of
the cross-DSO metadata dwarfed the binary size savings that they were
hoping to gain by splitting their program in two.

Furthermore, there are things that you simply cannot do with a more
conventional approach, such as optimizations relying on whole-program
information (like whole-program devirtualization, which helps significantly
in my customer's program).

What’s the tradeoff involved in the specific sections you chose to
split?> It seems like it would be possible to, for example, split the GOT, or avoid
> splitting the relocation/EH/etc. sections.  Some variation would require
> different runtime support, I guess.
>
We could certainly consider having multiple GOTs which are allocated to
partitions in the same way as sections are. This might be useful if for
example one of the partitions references a DSO that is unused by the main
program and we need to avoid having the main program depend on the DSO. But
I consider this an optimization over the proposed approach and not
something that would be strictly required for correctness. I chose to omit
this for now for the sake of simplicity and because my customer does not
require it for now.

I think we need to split the dynamic relocation section because otherwise
the dynamic loader will try to relocate the unreadable memory of the other
partitions and cause a SIGSEGV. Similarly, we need to split the EH sections
because unwinders will generally expect to be able to find the unwind info
for a function by enumerating PT_LOADs to map an address onto a DSO and
then using that DSO's PT_ARM_EXIDX/PT_GNU_EH_FRAME to find the unwind info.
See for example what libunwind does:
https://github.com/llvm/llvm-project/blob/e739ac0e255597d818c907223034ddf3bc18a593/libunwind/src/AddressSpace.hpp#L523

As you point out, the latter part could vary based on the runtime, but I
don't see a strong reason to do it another way.

> It looks like this doesn’t include a proposal for the corresponding LLVM
> IR extension?  I think it might be sort of complicated to define correctly…
> specifically, in terms of what it means to “use” a function or global from
> a different partition (so the program doesn’t try to speculatively access
> something which isn’t loaded).  This could come up even without LTO if you
> have C++ inline functions, since all functions with weak linkage have to be
> in the first partition.  (At least, I think they do, unless you invent a
> new kind of “partition” visibility for this.)
>
The idea here is that for code to "use" a function or global is
exactly the
same thing as having a relocation pointing to it. This is the same
principle that is used to implement --gc-sections.

So for a program to end up accessing a section (speculatively or otherwise)
there needs to be a chain of relocations referring to it from the entry
points. That would force the section into either the main partition or the
same partition as the referent.

Another way to think about it is: when I load the main partition into
memory, I have loaded all code that is reachable from the main partition's
entry points. Now I dynamically load a feature partition. I've now loaded
all code that is reachable from the combination of the main partition and
the feature partition's entry points. That's pretty much the same thing
as
having first loaded a conventional ELF DSO linked with --gc-sections with
just the main partition's entry points, and then replacing it with a second
DSO linked with --gc-sections with the main partition + feature partition's
entry points, except that none of the addresses in the main partition
happen to have changed. So if --gc-sections works, this should work too.

You might be wondering: what happens if I directly reference one of the
feature partition's entry points from the main partition? Well, something
interesting will happen. The feature partition's dynamic symbol table will
contain an entry for the entry point, but the entry's address will point
inside the main partition. This should work out just fine because the main
partition is guaranteed to be loaded if the feature partition is also
loaded. (Which is the same reason why direct pc-relative references from
the feature partition to the main partition will also work.)

I don't think any significant IR extensions are necessary here, except
perhaps for the part involving attaching the -fsymbol-partition names to
globals, but I think that part is mostly trivial and it would probably end
up looking like the custom section name field.

I'm not sure I understand how weak linkage is impacted here. With this
nothing special happens inside the linker until we start handling
--gc-sections, and by that time weak/strong resolution has already
happened. In ELF, dynamic loaders do not care about symbol bindings (except
for weak undefined symbols), so we get the same result whether the symbols
are weak or not.

Peter
>
>
> -Eli
>
>
>
> *From:* cfe-dev <cfe-dev-bounces at lists.llvm.org> *On Behalf Of
*Peter
> Collingbourne via cfe-dev
> *Sent:* Tuesday, February 26, 2019 5:35 PM
> *To:* llvm-dev <llvm-dev at lists.llvm.org>; cfe-dev <cfe-dev at
lists.llvm.org>
> *Cc:* George Rimar <grimar at accesssoftek.com>
> *Subject:* [EXT] [cfe-dev] RFC: Linker feature for automatically
> partitioning a program into multiple binaries
>
>
>
> Hi folks,
>
>
>
> I'd like to propose adding a feature to ELF lld for automatically
> partitioning a program into multiple binaries. (This will also involve
> adding a feature to clang, so I've cc'd cfe-dev as well.)
>
>
>
> == Problem statement =>
>
>
> Embedded devices such as cell phones, especially lower end devices, are
> typically highly resource constrained. Users of cell phone applications
> must pay a cost (in terms of download size as well as storage space) for
> all features that the application implements, even for features that are
> only used by a minority of users. Therefore, there is a desire to split
> applications into multiple pieces that can be downloaded independently, so
> that the majority of users only pay the cost of the commonly used features.
> This can technically be achieved using traditional ELF dynamic linking: the
> main part of the program can be compiled as an executable or DSO that
> exports symbols that are then imported by a separate DSO containing the
> part of the program implementing the optional feature. However, this itself
> imposes costs:
>
> - Each exported symbol by itself imposes additional binary size costs, as
> it requires the name of the symbol and a dynamic symbol table entry to be
> stored in both the exporting and importing DSO, and on the importing side a
> dynamic relocation, a GOT entry and likely a PLT entry must be present.
> These additional costs go some way towards defeating the purpose of
> splitting the program into pieces in the first place, and can also impact
> program startup and overall performance because of the additional
> indirections.
>
> - It can result in more code needing to appear in the main part of the
> program than necessary. For example, imagine that both the feature and the
> main program make use of a common (statically linked) library, but they
> call different subsets of the functions in that library. With traditional
> ELF linking we are forced to either link and export the entire library from
> the main program (even the functions unused by either part of the program)
> or carefully maintain a list of functions that are used by the other parts
> of the program.
>
> - Since the linker does not see the whole program at once and links each
> piece independently, a number of link-time optimizations and features stop
> working, such as LTO across partition boundaries, whole-program
> devirtualization and non-cross-DSO control flow integrity (control flow
> integrity has a cross-DSO mode, but that also imposes binary size costs
> because a significant amount of metadata needs to appear in each DSO).
>
>
>
> There are ways around at least the first point. For example, the program
> could arrange to use a custom mechanism for binding references between the
> main program and the feature code, such as a table of entry points.
> However, this can impose maintenance costs (for example, the binding
> mechanism can be intrusive in the source code and typically has to be
> maintained manually), and it still does not address the last point.
>
>
>
> == Proposed solution =>
>
>
> I propose to extend lld so that it can perform the required partitioning
> automatically, given a set of entry points for each part of the program.
> The end product of linking will be a main program (which can be either an
> executable or a DSO) combined with a set of DSOs that must be loaded at
> fixed addresses relative to the base address of the main program. These
> binaries will all share a virtual address space so that they can refer to
> one another directly using PC-relative references or RELATIVE dynamic
> relocations as if they were all statically linked together in the first
> place, rather than via the GOT (or custom GOT-equivalent).
>
>
>
> The way that it will work is that we can extend the graph reachability
> algorithm currently implemented by the linker for --gc-sections. The entry
> points for each partition are marked up with a string naming the partition,
> either at the source level with an attribute on the function or global
> variable, or by passing a flag to the compiler (this string becomes the
> partition's soname). These symbols will act as the GC roots for the
> partition and will be exported from its dynsym. Assuming that there is a
> single partition, let's call this set of symbols S2, while all other GC
> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
> call S1. Any sections reachable from S1 are allocated to the main
> partition, while sections reachable only from S2 but not from S1 are
> allocated to S2's partition. We can extend this idea to multiple
loadable
> partitions by defining S3, S4 and so on, but any sections reachable from
> multiple loadable partitions are allocated to the main partition even if
> they aren’t reachable from the main partition.
>
>
>
> When assigning input sections to output sections, we take into account, in
> addition to the name of the input section, the partition that the input
> section is assigned to. The SHF_ALLOC output sections are first sorted by
> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
> sections appear last and are not sorted by partition. In the end we are
> left with a collection of output sections that might look like this:
>
>
>
> Main partition:
>
> 0x0000 ELF header, phdrs
>
> 0x1000 .rodata
>
> 0x2000 .dynsym
>
> 0x3000 .text
>
>
>
> Loadable partition 1:
>
> 0x4000 ELF header, phdrs
>
> 0x5000 .rodata
>
> 0x6000 .dynsym
>
> 0x7000 .text
>
>
>
> Loadable partition 2:
>
> 0x8000 ELF header, phdrs
>
> 0x9000 .rodata
>
> 0xa000 .dynsym
>
> 0xb000 .text
>
>
>
> Non-SHF_ALLOC sections from all partitions:
>
> .comment
>
> .debug_info
>
> (etc.)
>
>
>
> Now linking proceeds mostly as usual, and we’re effectively left with a
> single .so that contains all of the partitions concatenated together. This
> isn’t very useful on its own and is likely to confuse tools (e.g. due to
> the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
> that will extract the individual partitions from the output file
> essentially by taking a slice of the combined .so file. These slices can
> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
> sections are left in place.
>
>
>
> The envisaged usage of this feature is as follows:
>
> $ clang -ffunction-sections -fdata-sections -c main.c # compile the main
> program
>
> $ clang -ffunction-sections -fdata-sections
> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
>
> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
> -Wl,-soname,libmain.so -Wl,--gc-sections
>
> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
>
> $ llvm-objcopy libcombined.so libfeature.so
> --extract-partition=libfeature.so
>
>
>
> On Android, the loadable partitions can be loaded with the
> android_dlopen_ext
> <https://developer.android.com/ndk/reference/group/libdl> function
> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
> correct address relative to the main partition. Other platforms that wish
> to support this feature will likely either need to add a similar feature to
> their dynamic loader or (in order to support loading the partitions with a
> regular dlopen) define a custom dynamic tag that will cause the dynamic
> loader to first load the main partition and then the loadable partition at
> the correct relative address.
>
>
>
> == In more detail =>
>
>
> Each loadable partition will require its own sections to support the
> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
> with the main partition. This means that all addresses associated with
> symbols will continue to be fixed.
>
>
>
> In order to cause the dynamic loader to reserve address space for the
> loadable partitions so that they can be loaded at the correct address
> later, a PT_LOAD segment is added to the main partition that allocates a
> page of bss at the address one byte past the end of the last address in the
> last partition. In the Android dynamic loader at least, this is enough to
> cause the required space to be reserved. Other platforms would need to
> ensure that their dynamic loader implements similar behaviour.
>
>
>
> I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
>
>
>
> Some sections will need to be present in each partition (e.g. .interp and
> .note sections). Probably the most straightforward way to do this will be
> to cause the linker to create a clone of these sections for each partition.
>
>
>
> == Other use cases =>
>
>
> An example of another use case for this feature could be an operating
> system API which is exposed across multiple DSOs. Typically these DSOs will
> be implemented using private APIs that are not exposed to the application.
> This feature would allow you to create a common DSO that contains the
> shared code implementing the private APIs (i.e. the main partition),
> together with individual DSOs (i.e. the loadable partitions) that use the
> private APIs and expose the public ones, but without actually exposing the
> private APIs in the dynamic symbol table or paying the binary size cost of
> doing so.
>
>
>
> == Prototype =>
>
>
> A prototype/proof of concept of this feature has been implemented here:
> https://github.com/pcc/llvm-project/tree/lld-module-symbols
>
> There is a test app in the test-progs/app directory that demonstrates the
> feature on Android with a simple hello world app (based on
> https://www.hanshq.net/command-line-android.html ). I have successfully
> tested debugging the loadable partition with gdb (e.g. setting breakpoints
> and printing globals), but getting unwinding working will need a bit more
> work.
>
>
>
> Note that the feature as exposed by the prototype is different from what
> I'm proposing here, e.g. it uses a linker flag to specify which symbols
go
> in which partitions. I think the best place to specify this information is
> at either the source level or the compiler flag level, so that is what I
> intend to implement.
>
>
>
> Thanks,
>
> --
>
> --
>
> Peter
>

-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/1addea2e/attachment.html>

James Y Knight via llvm-dev

2019-Feb-27 04:37 UTC

head link

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

I've long wanted a related-but-different form of this sort of thing.

What I want is to be able to statically link _independent_ binaries (either
executables or shared libraries), which happen to depend on a lot of the
same code. And, then, to save space by reducing the amount of duplicated
code between them.

The way I imagined it is that it if I link two programs, the linker would
emit three resulting files: Program 1, Program 2, and a shared-library
containing the common code.

As input, I'd give the linker, separately, the lists of objects/libraries
to go into Program 1, and those to go into Program 2. It would do its
normal static archive search/object-file-selection process separately for
each list. After collecting the object file lists, any objects that ended
up used both in #1 and #2 would go into the shared-library, while the
others would go into their own separate programs.

(Note that this then causes static initializers to end up being
run/included in the expected programs in the same way as with plain static
linking.)

Finally, gc-sections should do its graph coloring starting only from the
entry-points of the two programs, but also coloring through into the
automatic shared library. In the end, the shared library should only
contain functions that are used by at least one of the programs. (As I
described it, this mechanism would leave functions used only by one program
that came from an object file used by both programs in the shared-lib.
Alternatively, the graph coloring could possibly determine the target
location of the function, too)

Without any new loader hacks, you'd still have some additional overhead,
but since the symbol names in the automatic shared library are irrelevant
(other than needing to be unique), they can be shortened to reduce the
string-table overhead and reduce the risk of external-usage.

In the end, you'd have two completely normal binaries, which use a normal
shared object -- but the shared lib was auto-generated, with meaningless
private symbols, only exporting the minimum set of symbols it needs to in
order to let the dynamic loader load it into the exact programs it was
compiled for.



On Tue, Feb 26, 2019 at 8:35 PM Peter Collingbourne via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
> Hi folks,
>
> I'd like to propose adding a feature to ELF lld for automatically
> partitioning a program into multiple binaries. (This will also involve
> adding a feature to clang, so I've cc'd cfe-dev as well.)
>
> == Problem statement =>
> Embedded devices such as cell phones, especially lower end devices, are
> typically highly resource constrained. Users of cell phone applications
> must pay a cost (in terms of download size as well as storage space) for
> all features that the application implements, even for features that are
> only used by a minority of users. Therefore, there is a desire to split
> applications into multiple pieces that can be downloaded independently, so
> that the majority of users only pay the cost of the commonly used features.
> This can technically be achieved using traditional ELF dynamic linking: the
> main part of the program can be compiled as an executable or DSO that
> exports symbols that are then imported by a separate DSO containing the
> part of the program implementing the optional feature. However, this itself
> imposes costs:
> - Each exported symbol by itself imposes additional binary size costs, as
> it requires the name of the symbol and a dynamic symbol table entry to be
> stored in both the exporting and importing DSO, and on the importing side a
> dynamic relocation, a GOT entry and likely a PLT entry must be present.
> These additional costs go some way towards defeating the purpose of
> splitting the program into pieces in the first place, and can also impact
> program startup and overall performance because of the additional
> indirections.
> - It can result in more code needing to appear in the main part of the
> program than necessary. For example, imagine that both the feature and the
> main program make use of a common (statically linked) library, but they
> call different subsets of the functions in that library. With traditional
> ELF linking we are forced to either link and export the entire library from
> the main program (even the functions unused by either part of the program)
> or carefully maintain a list of functions that are used by the other parts
> of the program.
> - Since the linker does not see the whole program at once and links each
> piece independently, a number of link-time optimizations and features stop
> working, such as LTO across partition boundaries, whole-program
> devirtualization and non-cross-DSO control flow integrity (control flow
> integrity has a cross-DSO mode, but that also imposes binary size costs
> because a significant amount of metadata needs to appear in each DSO).
> There are ways around at least the first point. For example, the program
> could arrange to use a custom mechanism for binding references between the
> main program and the feature code, such as a table of entry points.
> However, this can impose maintenance costs (for example, the binding
> mechanism can be intrusive in the source code and typically has to be
> maintained manually), and it still does not address the last point.
>
> == Proposed solution =>
> I propose to extend lld so that it can perform the required partitioning
> automatically, given a set of entry points for each part of the program.
> The end product of linking will be a main program (which can be either an
> executable or a DSO) combined with a set of DSOs that must be loaded at
> fixed addresses relative to the base address of the main program. These
> binaries will all share a virtual address space so that they can refer to
> one another directly using PC-relative references or RELATIVE dynamic
> relocations as if they were all statically linked together in the first
> place, rather than via the GOT (or custom GOT-equivalent).
>
> The way that it will work is that we can extend the graph reachability
> algorithm currently implemented by the linker for --gc-sections. The entry
> points for each partition are marked up with a string naming the partition,
> either at the source level with an attribute on the function or global
> variable, or by passing a flag to the compiler (this string becomes the
> partition's soname). These symbols will act as the GC roots for the
> partition and will be exported from its dynsym. Assuming that there is a
> single partition, let's call this set of symbols S2, while all other GC
> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
> call S1. Any sections reachable from S1 are allocated to the main
> partition, while sections reachable only from S2 but not from S1 are
> allocated to S2's partition. We can extend this idea to multiple
loadable
> partitions by defining S3, S4 and so on, but any sections reachable from
> multiple loadable partitions are allocated to the main partition even if
> they aren’t reachable from the main partition.
>
> When assigning input sections to output sections, we take into account, in
> addition to the name of the input section, the partition that the input
> section is assigned to. The SHF_ALLOC output sections are first sorted by
> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
> sections appear last and are not sorted by partition. In the end we are
> left with a collection of output sections that might look like this:
>
> Main partition:
> 0x0000 ELF header, phdrs
> 0x1000 .rodata
> 0x2000 .dynsym
> 0x3000 .text
>
> Loadable partition 1:
> 0x4000 ELF header, phdrs
> 0x5000 .rodata
> 0x6000 .dynsym
> 0x7000 .text
>
> Loadable partition 2:
> 0x8000 ELF header, phdrs
> 0x9000 .rodata
> 0xa000 .dynsym
> 0xb000 .text
>
> Non-SHF_ALLOC sections from all partitions:
> .comment
> .debug_info
> (etc.)
>
> Now linking proceeds mostly as usual, and we’re effectively left with a
> single .so that contains all of the partitions concatenated together. This
> isn’t very useful on its own and is likely to confuse tools (e.g. due to
> the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
> that will extract the individual partitions from the output file
> essentially by taking a slice of the combined .so file. These slices can
> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
> sections are left in place.
>
> The envisaged usage of this feature is as follows:
> $ clang -ffunction-sections -fdata-sections -c main.c # compile the main
> program
> $ clang -ffunction-sections -fdata-sections
> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
> -Wl,-soname,libmain.so -Wl,--gc-sections
> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
> $ llvm-objcopy libcombined.so libfeature.so
> --extract-partition=libfeature.so
>
> On Android, the loadable partitions can be loaded with the
> android_dlopen_ext
> <https://developer.android.com/ndk/reference/group/libdl> function
> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
> correct address relative to the main partition. Other platforms that wish
> to support this feature will likely either need to add a similar feature to
> their dynamic loader or (in order to support loading the partitions with a
> regular dlopen) define a custom dynamic tag that will cause the dynamic
> loader to first load the main partition and then the loadable partition at
> the correct relative address.
>
> == In more detail =>
> Each loadable partition will require its own sections to support the
> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
> with the main partition. This means that all addresses associated with
> symbols will continue to be fixed.
>
> In order to cause the dynamic loader to reserve address space for the
> loadable partitions so that they can be loaded at the correct address
> later, a PT_LOAD segment is added to the main partition that allocates a
> page of bss at the address one byte past the end of the last address in the
> last partition. In the Android dynamic loader at least, this is enough to
> cause the required space to be reserved. Other platforms would need to
> ensure that their dynamic loader implements similar behaviour.
>
> I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
>
> Some sections will need to be present in each partition (e.g. .interp and
> .note sections). Probably the most straightforward way to do this will be
> to cause the linker to create a clone of these sections for each partition.
>
> == Other use cases =>
> An example of another use case for this feature could be an operating
> system API which is exposed across multiple DSOs. Typically these DSOs will
> be implemented using private APIs that are not exposed to the application.
> This feature would allow you to create a common DSO that contains the
> shared code implementing the private APIs (i.e. the main partition),
> together with individual DSOs (i.e. the loadable partitions) that use the
> private APIs and expose the public ones, but without actually exposing the
> private APIs in the dynamic symbol table or paying the binary size cost of
> doing so.
>
> == Prototype =>
> A prototype/proof of concept of this feature has been implemented here:
> https://github.com/pcc/llvm-project/tree/lld-module-symbols
> There is a test app in the test-progs/app directory that demonstrates the
> feature on Android with a simple hello world app (based on
> https://www.hanshq.net/command-line-android.html ). I have successfully
> tested debugging the loadable partition with gdb (e.g. setting breakpoints
> and printing globals), but getting unwinding working will need a bit more
> work.
>
> Note that the feature as exposed by the prototype is different from what
> I'm proposing here, e.g. it uses a linker flag to specify which symbols
go
> in which partitions. I think the best place to specify this information is
> at either the source level or the compiler flag level, so that is what I
> intend to implement.
>
> Thanks,
> --
> --
> Peter
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/ad8446f9/attachment-0001.html>

Peter Collingbourne via llvm-dev

2019-Feb-27 05:45 UTC

head link

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

Interesting. It sounds like you basically want to be able to do automatic
busyboxing.

Maybe you can achieve something like this with my approach by reversing the
role of the library and the binary. You could have the main program do
something like:

int main(int argc, char **argv) {
  void *p = dlopen("lib" + argv[0] + ".so"); // not quite
right but you get
the idea
  void (*f)(int, char **) = dlsym(p, argv[0] + "_main");
  return f(argc, argv);
}

And then rename the programs' main functions to foo_main, bar_main etc.,
and have them be exported from partitions named libfoo.so, libbar.so etc.
Now all the single-program code would end up in the partitions, the shared
code would end up in the main binary and could be referred to directly from
the partitions. But this would still rely on loader support of course and
all of the static initialization would be moved to the binary (depending on
your program, maybe this isn't a huge deal).

It might be nice to avoid needing the loader change, but that would mean
that you're now dealing with multiple address spaces (and depending on how
the programs look, multiple symbol tables), which makes things
significantly more complicated in the linker. There's also a new constraint
in the --gc-sections graph coloring because without further changes it's
now possible for --gc-sections to split an object file between address
spaces (i.e. between DSOs), and most of the time an object file will assume
that it can use direct relocations to refer to its own sections, which
isn't possible with a split object file. So --gc-sections would probably
need to work on a per object file basis.

Peter

On Tue, Feb 26, 2019 at 8:37 PM James Y Knight <jyknight at google.com>
wrote:
> I've long wanted a related-but-different form of this sort of thing.
>
> What I want is to be able to statically link _independent_ binaries
> (either executables or shared libraries), which happen to depend on a lot
> of the same code. And, then, to save space by reducing the amount of
> duplicated code between them.
>
> The way I imagined it is that it if I link two programs, the linker would
> emit three resulting files: Program 1, Program 2, and a shared-library
> containing the common code.
>
> As input, I'd give the linker, separately, the lists of
objects/libraries
> to go into Program 1, and those to go into Program 2. It would do its
> normal static archive search/object-file-selection process separately for
> each list. After collecting the object file lists, any objects that ended
> up used both in #1 and #2 would go into the shared-library, while the
> others would go into their own separate programs.
>
> (Note that this then causes static initializers to end up being
> run/included in the expected programs in the same way as with plain static
> linking.)
>
> Finally, gc-sections should do its graph coloring starting only from the
> entry-points of the two programs, but also coloring through into the
> automatic shared library. In the end, the shared library should only
> contain functions that are used by at least one of the programs. (As I
> described it, this mechanism would leave functions used only by one program
> that came from an object file used by both programs in the shared-lib.
> Alternatively, the graph coloring could possibly determine the target
> location of the function, too)
>
> Without any new loader hacks, you'd still have some additional
overhead,
> but since the symbol names in the automatic shared library are irrelevant
> (other than needing to be unique), they can be shortened to reduce the
> string-table overhead and reduce the risk of external-usage.
>
> In the end, you'd have two completely normal binaries, which use a
normal
> shared object -- but the shared lib was auto-generated, with meaningless
> private symbols, only exporting the minimum set of symbols it needs to in
> order to let the dynamic loader load it into the exact programs it was
> compiled for.
>
>
>
> On Tue, Feb 26, 2019 at 8:35 PM Peter Collingbourne via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> Hi folks,
>>
>> I'd like to propose adding a feature to ELF lld for automatically
>> partitioning a program into multiple binaries. (This will also involve
>> adding a feature to clang, so I've cc'd cfe-dev as well.)
>>
>> == Problem statement =>>
>> Embedded devices such as cell phones, especially lower end devices, are
>> typically highly resource constrained. Users of cell phone applications
>> must pay a cost (in terms of download size as well as storage space)
for
>> all features that the application implements, even for features that
are
>> only used by a minority of users. Therefore, there is a desire to split
>> applications into multiple pieces that can be downloaded independently,
so
>> that the majority of users only pay the cost of the commonly used
features.
>> This can technically be achieved using traditional ELF dynamic linking:
the
>> main part of the program can be compiled as an executable or DSO that
>> exports symbols that are then imported by a separate DSO containing the
>> part of the program implementing the optional feature. However, this
itself
>> imposes costs:
>> - Each exported symbol by itself imposes additional binary size costs,
as
>> it requires the name of the symbol and a dynamic symbol table entry to
be
>> stored in both the exporting and importing DSO, and on the importing
side a
>> dynamic relocation, a GOT entry and likely a PLT entry must be present.
>> These additional costs go some way towards defeating the purpose of
>> splitting the program into pieces in the first place, and can also
impact
>> program startup and overall performance because of the additional
>> indirections.
>> - It can result in more code needing to appear in the main part of the
>> program than necessary. For example, imagine that both the feature and
the
>> main program make use of a common (statically linked) library, but they
>> call different subsets of the functions in that library. With
traditional
>> ELF linking we are forced to either link and export the entire library
from
>> the main program (even the functions unused by either part of the
program)
>> or carefully maintain a list of functions that are used by the other
parts
>> of the program.
>> - Since the linker does not see the whole program at once and links
each
>> piece independently, a number of link-time optimizations and features
stop
>> working, such as LTO across partition boundaries, whole-program
>> devirtualization and non-cross-DSO control flow integrity (control flow
>> integrity has a cross-DSO mode, but that also imposes binary size costs
>> because a significant amount of metadata needs to appear in each DSO).
>> There are ways around at least the first point. For example, the
program
>> could arrange to use a custom mechanism for binding references between
the
>> main program and the feature code, such as a table of entry points.
>> However, this can impose maintenance costs (for example, the binding
>> mechanism can be intrusive in the source code and typically has to be
>> maintained manually), and it still does not address the last point.
>>
>> == Proposed solution =>>
>> I propose to extend lld so that it can perform the required
partitioning
>> automatically, given a set of entry points for each part of the
program.
>> The end product of linking will be a main program (which can be either
an
>> executable or a DSO) combined with a set of DSOs that must be loaded at
>> fixed addresses relative to the base address of the main program. These
>> binaries will all share a virtual address space so that they can refer
to
>> one another directly using PC-relative references or RELATIVE dynamic
>> relocations as if they were all statically linked together in the first
>> place, rather than via the GOT (or custom GOT-equivalent).
>>
>> The way that it will work is that we can extend the graph reachability
>> algorithm currently implemented by the linker for --gc-sections. The
entry
>> points for each partition are marked up with a string naming the
partition,
>> either at the source level with an attribute on the function or global
>> variable, or by passing a flag to the compiler (this string becomes the
>> partition's soname). These symbols will act as the GC roots for the
>> partition and will be exported from its dynsym. Assuming that there is
a
>> single partition, let's call this set of symbols S2, while all
other GC
>> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
>> call S1. Any sections reachable from S1 are allocated to the main
>> partition, while sections reachable only from S2 but not from S1 are
>> allocated to S2's partition. We can extend this idea to multiple
loadable
>> partitions by defining S3, S4 and so on, but any sections reachable
from
>> multiple loadable partitions are allocated to the main partition even
if
>> they aren’t reachable from the main partition.
>>
>> When assigning input sections to output sections, we take into account,
>> in addition to the name of the input section, the partition that the
input
>> section is assigned to. The SHF_ALLOC output sections are first sorted
by
>> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
>> sections appear last and are not sorted by partition. In the end we are
>> left with a collection of output sections that might look like this:
>>
>> Main partition:
>> 0x0000 ELF header, phdrs
>> 0x1000 .rodata
>> 0x2000 .dynsym
>> 0x3000 .text
>>
>> Loadable partition 1:
>> 0x4000 ELF header, phdrs
>> 0x5000 .rodata
>> 0x6000 .dynsym
>> 0x7000 .text
>>
>> Loadable partition 2:
>> 0x8000 ELF header, phdrs
>> 0x9000 .rodata
>> 0xa000 .dynsym
>> 0xb000 .text
>>
>> Non-SHF_ALLOC sections from all partitions:
>> .comment
>> .debug_info
>> (etc.)
>>
>> Now linking proceeds mostly as usual, and we’re effectively left with a
>> single .so that contains all of the partitions concatenated together.
This
>> isn’t very useful on its own and is likely to confuse tools (e.g. due
to
>> the presence of multiple .dynsyms); we can add a feature to
llvm-objcopy
>> that will extract the individual partitions from the output file
>> essentially by taking a slice of the combined .so file. These slices
can
>> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
>> sections are left in place.
>>
>> The envisaged usage of this feature is as follows:
>> $ clang -ffunction-sections -fdata-sections -c main.c # compile the
main
>> program
>> $ clang -ffunction-sections -fdata-sections
>> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
>> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
>> -Wl,-soname,libmain.so -Wl,--gc-sections
>> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
>> $ llvm-objcopy libcombined.so libfeature.so
>> --extract-partition=libfeature.so
>>
>> On Android, the loadable partitions can be loaded with the
>> android_dlopen_ext
>> <https://developer.android.com/ndk/reference/group/libdl>
function
>> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
>> correct address relative to the main partition. Other platforms that
wish
>> to support this feature will likely either need to add a similar
feature to
>> their dynamic loader or (in order to support loading the partitions
with a
>> regular dlopen) define a custom dynamic tag that will cause the dynamic
>> loader to first load the main partition and then the loadable partition
at
>> the correct relative address.
>>
>> == In more detail =>>
>> Each loadable partition will require its own sections to support the
>> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
>> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
>> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
>> with the main partition. This means that all addresses associated with
>> symbols will continue to be fixed.
>>
>> In order to cause the dynamic loader to reserve address space for the
>> loadable partitions so that they can be loaded at the correct address
>> later, a PT_LOAD segment is added to the main partition that allocates
a
>> page of bss at the address one byte past the end of the last address in
the
>> last partition. In the Android dynamic loader at least, this is enough
to
>> cause the required space to be reserved. Other platforms would need to
>> ensure that their dynamic loader implements similar behaviour.
>>
>> I haven't thought about how this feature will interact with linker
>> scripts. At least to start with we will likely need to forbid using
this
>> feature together with the PHDRS or SECTIONS linker script directives.
>>
>> Some sections will need to be present in each partition (e.g. .interp
and
>> .note sections). Probably the most straightforward way to do this will
be
>> to cause the linker to create a clone of these sections for each
partition.
>>
>> == Other use cases =>>
>> An example of another use case for this feature could be an operating
>> system API which is exposed across multiple DSOs. Typically these DSOs
will
>> be implemented using private APIs that are not exposed to the
application.
>> This feature would allow you to create a common DSO that contains the
>> shared code implementing the private APIs (i.e. the main partition),
>> together with individual DSOs (i.e. the loadable partitions) that use
the
>> private APIs and expose the public ones, but without actually exposing
the
>> private APIs in the dynamic symbol table or paying the binary size cost
of
>> doing so.
>>
>> == Prototype =>>
>> A prototype/proof of concept of this feature has been implemented here:
>> https://github.com/pcc/llvm-project/tree/lld-module-symbols
>> There is a test app in the test-progs/app directory that demonstrates
the
>> feature on Android with a simple hello world app (based on
>> https://www.hanshq.net/command-line-android.html ). I have successfully
>> tested debugging the loadable partition with gdb (e.g. setting
breakpoints
>> and printing globals), but getting unwinding working will need a bit
more
>> work.
>>
>> Note that the feature as exposed by the prototype is different from
what
>> I'm proposing here, e.g. it uses a linker flag to specify which
symbols go
>> in which partitions. I think the best place to specify this information
is
>> at either the source level or the compiler flag level, so that is what
I
>> intend to implement.
>>
>> Thanks,
>> --
>> --
>> Peter
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190226/848c0240/attachment.html>

Peter Smith via llvm-dev

2019-Feb-27 12:18 UTC

head link

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

Hello,

When I first read this I thought that this is kind of like overlays,
but optimising for flash size rather than memory size and with the
dynamic linker as a simplified non-evicting overlay manager. However I
think it might be more accurate to describe as a way of integrating
features implemented in shared libraries into an application?

Do you have an idea of how the development process would work for such
an application? It sounds to me like this isn't going to work well for
"here is an executable that I've already written wrote go find the
bits can be split off into partitions". I'd expect that this is more
of a, "here is an application that has been implemented with some
independent features implemented in shared libraries (natural but
heavy weight partitions) that I want to be optimised together at the
cost of (realistically) losing the shared part of the libraries.". I'm
thinking that one possible development model for you would be first
implement your independent feature as a shared library, the entry
points for the partition could then be extracted from it, with perhaps
the inputs to the shared library link step being extracted for the
main application (eliminating duplicates between the partitions would
be interesting).

My main concern with this is that this could end being fragile, with
subtle bugs that would be difficult to reason about ahead of time and
without the feature implemented on a desktop OS it could be very
difficult to test that a new feature would work with it, or whether a
change to an existing feature would break it. I think we'd need some
extensive documentation to go with it at the least.

Peter

Some other thoughts below:
> We could certainly consider having multiple GOTs which are allocated to
partitions in the same way as sections are. This might be useful if for example
one of the partitions references a DSO that is unused by the main program and we
need to avoid having the main program depend on the DSO. But I consider >
this an optimization over the proposed approach and not something that would be
strictly required for correctness. I chose to omit this for now for the sake of
simplicity and because my customer does not require it for now.
Introducing multiple GOTs might make it difficult to reason about GOT
relative relocations, is it always unambiguous which GOT a relocation
should refer to? Can we answer that question for all targets?
> I haven't thought about how this feature will interact with linker
scripts. At least to start with we will likely need to forbid using this feature
together with the PHDRS or SECTIONS linker script directives.I'd recommend not going there unless you have a really good model of
how it will work and what type of scripts are compatible with it. It
would be incredibly easy to write something to break the assumptions
that this is relying on.

An ifunc in one of the loadable partitions could be problematic as
there is only one PLT and GOT in the application, with the ifunc
dynamic relocation running eagerly at application launch but no
resolver present in memory at the time. You'll also need to be careful
to prevent any sharing of things like strings between the partitions.

On Wed, 27 Feb 2019 at 01:34, Peter Collingbourne <peter at pcc.me.uk>
wrote:>
> Hi folks,
>
> I'd like to propose adding a feature to ELF lld for automatically
partitioning a program into multiple binaries. (This will also involve adding a
feature to clang, so I've cc'd cfe-dev as well.)
>
> == Problem statement =>
> Embedded devices such as cell phones, especially lower end devices, are
typically highly resource constrained. Users of cell phone applications must pay
a cost (in terms of download size as well as storage space) for all features
that the application implements, even for features that are only used by a
minority of users. Therefore, there is a desire to split applications into
multiple pieces that can be downloaded independently, so that the majority of
users only pay the cost of the commonly used features. This can technically be
achieved using traditional ELF dynamic linking: the main part of the program can
be compiled as an executable or DSO that exports symbols that are then imported
by a separate DSO containing the part of the program implementing the optional
feature. However, this itself imposes costs:
> - Each exported symbol by itself imposes additional binary size costs, as
it requires the name of the symbol and a dynamic symbol table entry to be stored
in both the exporting and importing DSO, and on the importing side a dynamic
relocation, a GOT entry and likely a PLT entry must be present. These additional
costs go some way towards defeating the purpose of splitting the program into
pieces in the first place, and can also impact program startup and overall
performance because of the additional indirections.
> - It can result in more code needing to appear in the main part of the
program than necessary. For example, imagine that both the feature and the main
program make use of a common (statically linked) library, but they call
different subsets of the functions in that library. With traditional ELF linking
we are forced to either link and export the entire library from the main program
(even the functions unused by either part of the program) or carefully maintain
a list of functions that are used by the other parts of the program.
> - Since the linker does not see the whole program at once and links each
piece independently, a number of link-time optimizations and features stop
working, such as LTO across partition boundaries, whole-program devirtualization
and non-cross-DSO control flow integrity (control flow integrity has a cross-DSO
mode, but that also imposes binary size costs because a significant amount of
metadata needs to appear in each DSO).
> There are ways around at least the first point. For example, the program
could arrange to use a custom mechanism for binding references between the main
program and the feature code, such as a table of entry points. However, this can
impose maintenance costs (for example, the binding mechanism can be intrusive in
the source code and typically has to be maintained manually), and it still does
not address the last point.
>
> == Proposed solution =>
> I propose to extend lld so that it can perform the required partitioning
automatically, given a set of entry points for each part of the program. The end
product of linking will be a main program (which can be either an executable or
a DSO) combined with a set of DSOs that must be loaded at fixed addresses
relative to the base address of the main program. These binaries will all share
a virtual address space so that they can refer to one another directly using
PC-relative references or RELATIVE dynamic relocations as if they were all
statically linked together in the first place, rather than via the GOT (or
custom GOT-equivalent).
>
> The way that it will work is that we can extend the graph reachability
algorithm currently implemented by the linker for --gc-sections. The entry
points for each partition are marked up with a string naming the partition,
either at the source level with an attribute on the function or global variable,
or by passing a flag to the compiler (this string becomes the partition's
soname). These symbols will act as the GC roots for the partition and will be
exported from its dynsym. Assuming that there is a single partition, let's
call this set of symbols S2, while all other GC roots (e.g. non-marked-up
exported symbols, sections in .init_array) we call S1. Any sections reachable
from S1 are allocated to the main partition, while sections reachable only from
S2 but not from S1 are allocated to S2's partition. We can extend this idea
to multiple loadable partitions by defining S3, S4 and so on, but any sections
reachable from multiple loadable partitions are allocated to the main partition
even if they aren’t reachable from the main partition.
>
> When assigning input sections to output sections, we take into account, in
addition to the name of the input section, the partition that the input section
is assigned to. The SHF_ALLOC output sections are first sorted by partition, and
then by the usual sorting rules. As usual, non-SHF_ALLOC sections appear last
and are not sorted by partition. In the end we are left with a collection of
output sections that might look like this:
>
> Main partition:
> 0x0000 ELF header, phdrs
> 0x1000 .rodata
> 0x2000 .dynsym
> 0x3000 .text
>
> Loadable partition 1:
> 0x4000 ELF header, phdrs
> 0x5000 .rodata
> 0x6000 .dynsym
> 0x7000 .text
>
> Loadable partition 2:
> 0x8000 ELF header, phdrs
> 0x9000 .rodata
> 0xa000 .dynsym
> 0xb000 .text
>
> Non-SHF_ALLOC sections from all partitions:
> .comment
> .debug_info
> (etc.)
>
> Now linking proceeds mostly as usual, and we’re effectively left with a
single .so that contains all of the partitions concatenated together. This isn’t
very useful on its own and is likely to confuse tools (e.g. due to the presence
of multiple .dynsyms); we can add a feature to llvm-objcopy that will extract
the individual partitions from the output file essentially by taking a slice of
the combined .so file. These slices can also be fed to tools such as debuggers
provided that the non-SHF_ALLOC sections are left in place.
>
> The envisaged usage of this feature is as follows:
> $ clang -ffunction-sections -fdata-sections -c main.c # compile the main
program
> $ clang -ffunction-sections -fdata-sections
-fsymbol-partition=libfeature.so -c feature.c # compile the feature
> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
-Wl,-soname,libmain.so -Wl,--gc-sections
> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
> $ llvm-objcopy libcombined.so libfeature.so
--extract-partition=libfeature.so
>
> On Android, the loadable partitions can be loaded with the
android_dlopen_ext function passing ANDROID_DLEXT_RESERVED_ADDRESS to force it
to be loaded at the correct address relative to the main partition. Other
platforms that wish to support this feature will likely either need to add a
similar feature to their dynamic loader or (in order to support loading the
partitions with a regular dlopen) define a custom dynamic tag that will cause
the dynamic loader to first load the main partition and then the loadable
partition at the correct relative address.
>
> == In more detail =>
> Each loadable partition will require its own sections to support the
dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr, .dynsym,
.eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash, .interp,
.rela.dyn, .relr.dyn), but will be able to share a GOT and PLT with the main
partition. This means that all addresses associated with symbols will continue
to be fixed.
>
> In order to cause the dynamic loader to reserve address space for the
loadable partitions so that they can be loaded at the correct address later, a
PT_LOAD segment is added to the main partition that allocates a page of bss at
the address one byte past the end of the last address in the last partition. In
the Android dynamic loader at least, this is enough to cause the required space
to be reserved. Other platforms would need to ensure that their dynamic loader
implements similar behaviour.
>
> I haven't thought about how this feature will interact with linker
scripts. At least to start with we will likely need to forbid using this feature
together with the PHDRS or SECTIONS linker script directives.
>
> Some sections will need to be present in each partition (e.g. .interp and
.note sections). Probably the most straightforward way to do this will be to
cause the linker to create a clone of these sections for each partition.
>
> == Other use cases =>
> An example of another use case for this feature could be an operating
system API which is exposed across multiple DSOs. Typically these DSOs will be
implemented using private APIs that are not exposed to the application. This
feature would allow you to create a common DSO that contains the shared code
implementing the private APIs (i.e. the main partition), together with
individual DSOs (i.e. the loadable partitions) that use the private APIs and
expose the public ones, but without actually exposing the private APIs in the
dynamic symbol table or paying the binary size cost of doing so.
>
> == Prototype =>
> A prototype/proof of concept of this feature has been implemented here:
https://github.com/pcc/llvm-project/tree/lld-module-symbols
> There is a test app in the test-progs/app directory that demonstrates the
feature on Android with a simple hello world app (based on
https://www.hanshq.net/command-line-android.html ). I have successfully tested
debugging the loadable partition with gdb (e.g. setting breakpoints and printing
globals), but getting unwinding working will need a bit more work.
>
> Note that the feature as exposed by the prototype is different from what
I'm proposing here, e.g. it uses a linker flag to specify which symbols go
in which partitions. I think the best place to specify this information is at
either the source level or the compiler flag level, so that is what I intend to
implement.
>
> Thanks,
> --
> --
> Peter

Peter Collingbourne via llvm-dev

2019-Feb-27 21:07 UTC

head link

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

On Wed, Feb 27, 2019 at 4:19 AM Peter Smith <peter.smith at linaro.org>
wrote:
> Hello,
>
> When I first read this I thought that this is kind of like overlays,
> but optimising for flash size rather than memory size and with the
> dynamic linker as a simplified non-evicting overlay manager. However I
> think it might be more accurate to describe as a way of integrating
> features implemented in shared libraries into an application?
>
Yes, this is sort of like overlays, except that they're automatic (unlike
the linker script OVERLAY feature, which relies on manually partitioning
the program between object files) and non-overlapping (so that you can load
multiple of them). The intent is exactly that features would live in the
shared libraries.

Do you have an idea of how the development process would work for
such> an application? It sounds to me like this isn't going to work well for
> "here is an executable that I've already written wrote go find the
> bits can be split off into partitions". I'd expect that this is
more
> of a, "here is an application that has been implemented with some
> independent features implemented in shared libraries (natural but
> heavy weight partitions) that I want to be optimised together at the
> cost of (realistically) losing the shared part of the libraries.".
I'm
> thinking that one possible development model for you would be first
> implement your independent feature as a shared library, the entry
> points for the partition could then be extracted from it, with perhaps
> the inputs to the shared library link step being extracted for the
> main application (eliminating duplicates between the partitions would
> be interesting).
>
Yes, the intent is that a developer adopting this feature could start with
an application that is already split into multiple conventional DSOs, and
then adjust their cflags and ldflags in order to link both the application
and the features in a single step. That by itself could result in code size
savings from any code that was already being statically linked into
multiple binaries. Then they would be in a position to start switching
symbol visibilities over to hidden in order to remove dynsym entries and
cause more code to be moved into the loadable partitions.

My main concern with this is that this could end being fragile,
with> subtle bugs that would be difficult to reason about ahead of time and
> without the feature implemented on a desktop OS it could be very
> difficult to test that a new feature would work with it, or whether a
> change to an existing feature would break it. I think we'd need some
> extensive documentation to go with it at the least.
>
Yes, that's a concern for me as well. The situation is not too different
from the relocation packing features, which AFAIK are only supported by the
Android and ChromeOS dynamic loaders, but admittedly this proposed feature
will probably end up touching more code. I think it is mitigated by the
fact that one of the first intended users of this feature will be part of
Chromium, and Chromium is continuously tested with top-of-tree LLVM on most
platforms including Android, which means that we should find out about any
breakage quickly. I do intend to contribute user documentation since it
will be needed in order to understand how to use the feature and we can't
simply refer to GNU documentation for this.

>
> Peter
>
> Some other thoughts below:
>
> > We could certainly consider having multiple GOTs which are allocated
to
> partitions in the same way as sections are. This might be useful if for
> example one of the partitions references a DSO that is unused by the main
> program and we need to avoid having the main program depend on the DSO. But
> I consider > this an optimization over the proposed approach and not
> something that would be strictly required for correctness. I chose to omit
> this for now for the sake of simplicity and because my customer does not
> require it for now.
>
> Introducing multiple GOTs might make it difficult to reason about GOT
> relative relocations, is it always unambiguous which GOT a relocation
> should refer to? Can we answer that question for all targets?
>
I think it should be unambiguous: a GOT entry would be allocated to the
partition containing all GOT generating relocations to the symbol if there
is such a partition, or the main partition if not. I'm not sure what should
happen on MIPS which has its own weird multi-GOT thing. I'll probably just
make this feature unsupported on MIPS.
> I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
> I'd recommend not going there unless you have a really good model of
> how it will work and what type of scripts are compatible with it. It
> would be incredibly easy to write something to break the assumptions
> that this is relying on.
>
Agreed.

> An ifunc in one of the loadable partitions could be problematic as
> there is only one PLT and GOT in the application, with the ifunc
> dynamic relocation running eagerly at application launch but no
> resolver present in memory at the time.

Yes, I probably should have realised that ifuncs could be an issue. They'll
probably all need to be placed in the main partition unless we have
multiple GOTs.

You'll also need to be careful> to prevent any sharing of things like strings between the partitions.
>
In my prototype all mergeable sections will end up in the main partition
but I think we'd eventually want them to also participate in the graph
colouring (a good fraction of the loadable partition is expected to be made
up of strings). Strings would be placed similarly to GOT entries: in the
same partition if all refs come from that partition or in the main
partition if not.

Peter
>
> On Wed, 27 Feb 2019 at 01:34, Peter Collingbourne <peter at
pcc.me.uk> wrote:
> >
> > Hi folks,
> >
> > I'd like to propose adding a feature to ELF lld for automatically
> partitioning a program into multiple binaries. (This will also involve
> adding a feature to clang, so I've cc'd cfe-dev as well.)
> >
> > == Problem statement => >
> > Embedded devices such as cell phones, especially lower end devices,
are
> typically highly resource constrained. Users of cell phone applications
> must pay a cost (in terms of download size as well as storage space) for
> all features that the application implements, even for features that are
> only used by a minority of users. Therefore, there is a desire to split
> applications into multiple pieces that can be downloaded independently, so
> that the majority of users only pay the cost of the commonly used features.
> This can technically be achieved using traditional ELF dynamic linking: the
> main part of the program can be compiled as an executable or DSO that
> exports symbols that are then imported by a separate DSO containing the
> part of the program implementing the optional feature. However, this itself
> imposes costs:
> > - Each exported symbol by itself imposes additional binary size costs,
> as it requires the name of the symbol and a dynamic symbol table entry to
> be stored in both the exporting and importing DSO, and on the importing
> side a dynamic relocation, a GOT entry and likely a PLT entry must be
> present. These additional costs go some way towards defeating the purpose
> of splitting the program into pieces in the first place, and can also
> impact program startup and overall performance because of the additional
> indirections.
> > - It can result in more code needing to appear in the main part of the
> program than necessary. For example, imagine that both the feature and the
> main program make use of a common (statically linked) library, but they
> call different subsets of the functions in that library. With traditional
> ELF linking we are forced to either link and export the entire library from
> the main program (even the functions unused by either part of the program)
> or carefully maintain a list of functions that are used by the other parts
> of the program.
> > - Since the linker does not see the whole program at once and links
each
> piece independently, a number of link-time optimizations and features stop
> working, such as LTO across partition boundaries, whole-program
> devirtualization and non-cross-DSO control flow integrity (control flow
> integrity has a cross-DSO mode, but that also imposes binary size costs
> because a significant amount of metadata needs to appear in each DSO).
> > There are ways around at least the first point. For example, the
program
> could arrange to use a custom mechanism for binding references between the
> main program and the feature code, such as a table of entry points.
> However, this can impose maintenance costs (for example, the binding
> mechanism can be intrusive in the source code and typically has to be
> maintained manually), and it still does not address the last point.
> >
> > == Proposed solution => >
> > I propose to extend lld so that it can perform the required
partitioning
> automatically, given a set of entry points for each part of the program.
> The end product of linking will be a main program (which can be either an
> executable or a DSO) combined with a set of DSOs that must be loaded at
> fixed addresses relative to the base address of the main program. These
> binaries will all share a virtual address space so that they can refer to
> one another directly using PC-relative references or RELATIVE dynamic
> relocations as if they were all statically linked together in the first
> place, rather than via the GOT (or custom GOT-equivalent).
> >
> > The way that it will work is that we can extend the graph reachability
> algorithm currently implemented by the linker for --gc-sections. The entry
> points for each partition are marked up with a string naming the partition,
> either at the source level with an attribute on the function or global
> variable, or by passing a flag to the compiler (this string becomes the
> partition's soname). These symbols will act as the GC roots for the
> partition and will be exported from its dynsym. Assuming that there is a
> single partition, let's call this set of symbols S2, while all other GC
> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
> call S1. Any sections reachable from S1 are allocated to the main
> partition, while sections reachable only from S2 but not from S1 are
> allocated to S2's partition. We can extend this idea to multiple
loadable
> partitions by defining S3, S4 and so on, but any sections reachable from
> multiple loadable partitions are allocated to the main partition even if
> they aren’t reachable from the main partition.
> >
> > When assigning input sections to output sections, we take into
account,
> in addition to the name of the input section, the partition that the input
> section is assigned to. The SHF_ALLOC output sections are first sorted by
> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
> sections appear last and are not sorted by partition. In the end we are
> left with a collection of output sections that might look like this:
> >
> > Main partition:
> > 0x0000 ELF header, phdrs
> > 0x1000 .rodata
> > 0x2000 .dynsym
> > 0x3000 .text
> >
> > Loadable partition 1:
> > 0x4000 ELF header, phdrs
> > 0x5000 .rodata
> > 0x6000 .dynsym
> > 0x7000 .text
> >
> > Loadable partition 2:
> > 0x8000 ELF header, phdrs
> > 0x9000 .rodata
> > 0xa000 .dynsym
> > 0xb000 .text
> >
> > Non-SHF_ALLOC sections from all partitions:
> > .comment
> > .debug_info
> > (etc.)
> >
> > Now linking proceeds mostly as usual, and we’re effectively left with
a
> single .so that contains all of the partitions concatenated together. This
> isn’t very useful on its own and is likely to confuse tools (e.g. due to
> the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
> that will extract the individual partitions from the output file
> essentially by taking a slice of the combined .so file. These slices can
> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
> sections are left in place.
> >
> > The envisaged usage of this feature is as follows:
> > $ clang -ffunction-sections -fdata-sections -c main.c # compile the
main
> program
> > $ clang -ffunction-sections -fdata-sections
> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
> > $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
> -Wl,-soname,libmain.so -Wl,--gc-sections
> > $ llvm-objcopy libcombined.so libmain.so
--extract-partition=libmain.so
> > $ llvm-objcopy libcombined.so libfeature.so
> --extract-partition=libfeature.so
> >
> > On Android, the loadable partitions can be loaded with the
> android_dlopen_ext function passing ANDROID_DLEXT_RESERVED_ADDRESS to force
> it to be loaded at the correct address relative to the main partition.
> Other platforms that wish to support this feature will likely either need
> to add a similar feature to their dynamic loader or (in order to support
> loading the partitions with a regular dlopen) define a custom dynamic tag
> that will cause the dynamic loader to first load the main partition and
> then the loadable partition at the correct relative address.
> >
> > == In more detail => >
> > Each loadable partition will require its own sections to support the
> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
> with the main partition. This means that all addresses associated with
> symbols will continue to be fixed.
> >
> > In order to cause the dynamic loader to reserve address space for the
> loadable partitions so that they can be loaded at the correct address
> later, a PT_LOAD segment is added to the main partition that allocates a
> page of bss at the address one byte past the end of the last address in the
> last partition. In the Android dynamic loader at least, this is enough to
> cause the required space to be reserved. Other platforms would need to
> ensure that their dynamic loader implements similar behaviour.
> >
> > I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
> >
> > Some sections will need to be present in each partition (e.g. .interp
> and .note sections). Probably the most straightforward way to do this will
> be to cause the linker to create a clone of these sections for each
> partition.
> >
> > == Other use cases => >
> > An example of another use case for this feature could be an operating
> system API which is exposed across multiple DSOs. Typically these DSOs will
> be implemented using private APIs that are not exposed to the application.
> This feature would allow you to create a common DSO that contains the
> shared code implementing the private APIs (i.e. the main partition),
> together with individual DSOs (i.e. the loadable partitions) that use the
> private APIs and expose the public ones, but without actually exposing the
> private APIs in the dynamic symbol table or paying the binary size cost of
> doing so.
> >
> > == Prototype => >
> > A prototype/proof of concept of this feature has been implemented
here:
> https://github.com/pcc/llvm-project/tree/lld-module-symbols
> > There is a test app in the test-progs/app directory that demonstrates
> the feature on Android with a simple hello world app (based on
> https://www.hanshq.net/command-line-android.html ). I have successfully
> tested debugging the loadable partition with gdb (e.g. setting breakpoints
> and printing globals), but getting unwinding working will need a bit more
> work.
> >
> > Note that the feature as exposed by the prototype is different from
what
> I'm proposing here, e.g. it uses a linker flag to specify which symbols
go
> in which partitions. I think the best place to specify this information is
> at either the source level or the compiler flag level, so that is what I
> intend to implement.
> >
> > Thanks,
> > --
> > --
> > Peter
>

-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190227/b878358e/attachment.html>

Joerg Sonnenberger via llvm-dev

2019-Feb-27 23:52 UTC

head link

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

On Tue, Feb 26, 2019 at 11:37:16PM -0500, James Y Knight via cfe-dev
wrote:> What I want is to be able to statically link _independent_ binaries (either
> executables or shared libraries), which happen to depend on a lot of the
> same code. And, then, to save space by reducing the amount of duplicated
> code between them.
crunchgen in NetBSD does essentially the first half. -ffunction-sections
could do a bit of the second part, but I'm not sure if you actually save
much.

Joerg

Rui Ueyama via llvm-dev

2019-Mar-02 01:38 UTC

head link

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

Somehow I missed this email.

This is an interesting proposal. I guess that Chrome on Android is a
candidate program to apply this. I wonder what are other programs you can
apply this scheme. Looks like as long as you want to split a program into
multiple small pieces sharing the same memory address space, this is
applicable. One thing I can think of as an example of it is the kernel
module. Is there anything other than that?

On Tue, Feb 26, 2019 at 5:34 PM Peter Collingbourne <peter at pcc.me.uk>
wrote:
> Hi folks,
>
> I'd like to propose adding a feature to ELF lld for automatically
> partitioning a program into multiple binaries. (This will also involve
> adding a feature to clang, so I've cc'd cfe-dev as well.)
>
> == Problem statement =>
> Embedded devices such as cell phones, especially lower end devices, are
> typically highly resource constrained. Users of cell phone applications
> must pay a cost (in terms of download size as well as storage space) for
> all features that the application implements, even for features that are
> only used by a minority of users. Therefore, there is a desire to split
> applications into multiple pieces that can be downloaded independently, so
> that the majority of users only pay the cost of the commonly used features.
> This can technically be achieved using traditional ELF dynamic linking: the
> main part of the program can be compiled as an executable or DSO that
> exports symbols that are then imported by a separate DSO containing the
> part of the program implementing the optional feature. However, this itself
> imposes costs:
> - Each exported symbol by itself imposes additional binary size costs, as
> it requires the name of the symbol and a dynamic symbol table entry to be
> stored in both the exporting and importing DSO, and on the importing side a
> dynamic relocation, a GOT entry and likely a PLT entry must be present.
> These additional costs go some way towards defeating the purpose of
> splitting the program into pieces in the first place, and can also impact
> program startup and overall performance because of the additional
> indirections.
> - It can result in more code needing to appear in the main part of the
> program than necessary. For example, imagine that both the feature and the
> main program make use of a common (statically linked) library, but they
> call different subsets of the functions in that library. With traditional
> ELF linking we are forced to either link and export the entire library from
> the main program (even the functions unused by either part of the program)
> or carefully maintain a list of functions that are used by the other parts
> of the program.
> - Since the linker does not see the whole program at once and links each
> piece independently, a number of link-time optimizations and features stop
> working, such as LTO across partition boundaries, whole-program
> devirtualization and non-cross-DSO control flow integrity (control flow
> integrity has a cross-DSO mode, but that also imposes binary size costs
> because a significant amount of metadata needs to appear in each DSO).
> There are ways around at least the first point. For example, the program
> could arrange to use a custom mechanism for binding references between the
> main program and the feature code, such as a table of entry points.
> However, this can impose maintenance costs (for example, the binding
> mechanism can be intrusive in the source code and typically has to be
> maintained manually), and it still does not address the last point.
>
> == Proposed solution =>
> I propose to extend lld so that it can perform the required partitioning
> automatically, given a set of entry points for each part of the program.
> The end product of linking will be a main program (which can be either an
> executable or a DSO) combined with a set of DSOs that must be loaded at
> fixed addresses relative to the base address of the main program. These
> binaries will all share a virtual address space so that they can refer to
> one another directly using PC-relative references or RELATIVE dynamic
> relocations as if they were all statically linked together in the first
> place, rather than via the GOT (or custom GOT-equivalent).
>
> The way that it will work is that we can extend the graph reachability
> algorithm currently implemented by the linker for --gc-sections. The entry
> points for each partition are marked up with a string naming the partition,
> either at the source level with an attribute on the function or global
> variable, or by passing a flag to the compiler (this string becomes the
> partition's soname). These symbols will act as the GC roots for the
> partition and will be exported from its dynsym. Assuming that there is a
> single partition, let's call this set of symbols S2, while all other GC
> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
> call S1. Any sections reachable from S1 are allocated to the main
> partition, while sections reachable only from S2 but not from S1 are
> allocated to S2's partition. We can extend this idea to multiple
loadable
> partitions by defining S3, S4 and so on, but any sections reachable from
> multiple loadable partitions are allocated to the main partition even if
> they aren’t reachable from the main partition.
>
> When assigning input sections to output sections, we take into account, in
> addition to the name of the input section, the partition that the input
> section is assigned to. The SHF_ALLOC output sections are first sorted by
> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
> sections appear last and are not sorted by partition. In the end we are
> left with a collection of output sections that might look like this:
>
> Main partition:
> 0x0000 ELF header, phdrs
> 0x1000 .rodata
> 0x2000 .dynsym
> 0x3000 .text
>
> Loadable partition 1:
> 0x4000 ELF header, phdrs
> 0x5000 .rodata
> 0x6000 .dynsym
> 0x7000 .text
>
> Loadable partition 2:
> 0x8000 ELF header, phdrs
> 0x9000 .rodata
> 0xa000 .dynsym
> 0xb000 .text
>
> Non-SHF_ALLOC sections from all partitions:
> .comment
> .debug_info
> (etc.)
>
> Now linking proceeds mostly as usual, and we’re effectively left with a
> single .so that contains all of the partitions concatenated together. This
> isn’t very useful on its own and is likely to confuse tools (e.g. due to
> the presence of multiple .dynsyms); we can add a feature to llvm-objcopy
> that will extract the individual partitions from the output file
> essentially by taking a slice of the combined .so file. These slices can
> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
> sections are left in place.
>
> The envisaged usage of this feature is as follows:
> $ clang -ffunction-sections -fdata-sections -c main.c # compile the main
> program
> $ clang -ffunction-sections -fdata-sections
> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
> -Wl,-soname,libmain.so -Wl,--gc-sections
> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
> $ llvm-objcopy libcombined.so libfeature.so
> --extract-partition=libfeature.so
>
> On Android, the loadable partitions can be loaded with the
> android_dlopen_ext
> <https://developer.android.com/ndk/reference/group/libdl> function
> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
> correct address relative to the main partition. Other platforms that wish
> to support this feature will likely either need to add a similar feature to
> their dynamic loader or (in order to support loading the partitions with a
> regular dlopen) define a custom dynamic tag that will cause the dynamic
> loader to first load the main partition and then the loadable partition at
> the correct relative address.
>
> == In more detail =>
> Each loadable partition will require its own sections to support the
> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
> with the main partition. This means that all addresses associated with
> symbols will continue to be fixed.
>
> In order to cause the dynamic loader to reserve address space for the
> loadable partitions so that they can be loaded at the correct address
> later, a PT_LOAD segment is added to the main partition that allocates a
> page of bss at the address one byte past the end of the last address in the
> last partition. In the Android dynamic loader at least, this is enough to
> cause the required space to be reserved. Other platforms would need to
> ensure that their dynamic loader implements similar behaviour.
>
> I haven't thought about how this feature will interact with linker
> scripts. At least to start with we will likely need to forbid using this
> feature together with the PHDRS or SECTIONS linker script directives.
>
> Some sections will need to be present in each partition (e.g. .interp and
> .note sections). Probably the most straightforward way to do this will be
> to cause the linker to create a clone of these sections for each partition.
>
> == Other use cases =>
> An example of another use case for this feature could be an operating
> system API which is exposed across multiple DSOs. Typically these DSOs will
> be implemented using private APIs that are not exposed to the application.
> This feature would allow you to create a common DSO that contains the
> shared code implementing the private APIs (i.e. the main partition),
> together with individual DSOs (i.e. the loadable partitions) that use the
> private APIs and expose the public ones, but without actually exposing the
> private APIs in the dynamic symbol table or paying the binary size cost of
> doing so.
>
> == Prototype =>
> A prototype/proof of concept of this feature has been implemented here:
> https://github.com/pcc/llvm-project/tree/lld-module-symbols
> There is a test app in the test-progs/app directory that demonstrates the
> feature on Android with a simple hello world app (based on
> https://www.hanshq.net/command-line-android.html ). I have successfully
> tested debugging the loadable partition with gdb (e.g. setting breakpoints
> and printing globals), but getting unwinding working will need a bit more
> work.
>
> Note that the feature as exposed by the prototype is different from what
> I'm proposing here, e.g. it uses a linker flag to specify which symbols
go
> in which partitions. I think the best place to specify this information is
> at either the source level or the compiler flag level, so that is what I
> intend to implement.
>
> Thanks,
> --
> --
> Peter
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190301/2699935a/attachment-0001.html>

Rui Ueyama via llvm-dev

2019-Mar-02 01:42 UTC

head link

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

On Tue, Feb 26, 2019 at 8:38 PM James Y Knight via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I've long wanted a related-but-different form of this sort of thing.
>
> What I want is to be able to statically link _independent_ binaries
> (either executables or shared libraries), which happen to depend on a lot
> of the same code. And, then, to save space by reducing the amount of
> duplicated code between them.
>
> The way I imagined it is that it if I link two programs, the linker would
> emit three resulting files: Program 1, Program 2, and a shared-library
> containing the common code.
>
> As input, I'd give the linker, separately, the lists of
objects/libraries
> to go into Program 1, and those to go into Program 2. It would do its
> normal static archive search/object-file-selection process separately for
> each list. After collecting the object file lists, any objects that ended
> up used both in #1 and #2 would go into the shared-library, while the
> others would go into their own separate programs.
>
> (Note that this then causes static initializers to end up being
> run/included in the expected programs in the same way as with plain static
> linking.)
>
> Finally, gc-sections should do its graph coloring starting only from the
> entry-points of the two programs, but also coloring through into the
> automatic shared library. In the end, the shared library should only
> contain functions that are used by at least one of the programs. (As I
> described it, this mechanism would leave functions used only by one program
> that came from an object file used by both programs in the shared-lib.
> Alternatively, the graph coloring could possibly determine the target
> location of the function, too)
>
> Without any new loader hacks, you'd still have some additional
overhead,
> but since the symbol names in the automatic shared library are irrelevant
> (other than needing to be unique), they can be shortened to reduce the
> string-table overhead and reduce the risk of external-usage.
>
> In the end, you'd have two completely normal binaries, which use a
normal
> shared object -- but the shared lib was auto-generated, with meaningless
> private symbols, only exporting the minimum set of symbols it needs to in
> order to let the dynamic loader load it into the exact programs it was
> compiled for.
>
I wonder what is the difference of having three files (2 executables and 1
DSO) and having a single binary containing all code and data. It is not too
hard to link two programs together (which is what busybox does, IIUC), and
the resulting executable would be as compact as the sum of the three
different files. If you execute a busybox'ed executable, only the needed
part is mapped to memory, so I guess you wouldn't lose anything at runtime.
What am I missing?

>
> On Tue, Feb 26, 2019 at 8:35 PM Peter Collingbourne via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> Hi folks,
>>
>> I'd like to propose adding a feature to ELF lld for automatically
>> partitioning a program into multiple binaries. (This will also involve
>> adding a feature to clang, so I've cc'd cfe-dev as well.)
>>
>> == Problem statement =>>
>> Embedded devices such as cell phones, especially lower end devices, are
>> typically highly resource constrained. Users of cell phone applications
>> must pay a cost (in terms of download size as well as storage space)
for
>> all features that the application implements, even for features that
are
>> only used by a minority of users. Therefore, there is a desire to split
>> applications into multiple pieces that can be downloaded independently,
so
>> that the majority of users only pay the cost of the commonly used
features.
>> This can technically be achieved using traditional ELF dynamic linking:
the
>> main part of the program can be compiled as an executable or DSO that
>> exports symbols that are then imported by a separate DSO containing the
>> part of the program implementing the optional feature. However, this
itself
>> imposes costs:
>> - Each exported symbol by itself imposes additional binary size costs,
as
>> it requires the name of the symbol and a dynamic symbol table entry to
be
>> stored in both the exporting and importing DSO, and on the importing
side a
>> dynamic relocation, a GOT entry and likely a PLT entry must be present.
>> These additional costs go some way towards defeating the purpose of
>> splitting the program into pieces in the first place, and can also
impact
>> program startup and overall performance because of the additional
>> indirections.
>> - It can result in more code needing to appear in the main part of the
>> program than necessary. For example, imagine that both the feature and
the
>> main program make use of a common (statically linked) library, but they
>> call different subsets of the functions in that library. With
traditional
>> ELF linking we are forced to either link and export the entire library
from
>> the main program (even the functions unused by either part of the
program)
>> or carefully maintain a list of functions that are used by the other
parts
>> of the program.
>> - Since the linker does not see the whole program at once and links
each
>> piece independently, a number of link-time optimizations and features
stop
>> working, such as LTO across partition boundaries, whole-program
>> devirtualization and non-cross-DSO control flow integrity (control flow
>> integrity has a cross-DSO mode, but that also imposes binary size costs
>> because a significant amount of metadata needs to appear in each DSO).
>> There are ways around at least the first point. For example, the
program
>> could arrange to use a custom mechanism for binding references between
the
>> main program and the feature code, such as a table of entry points.
>> However, this can impose maintenance costs (for example, the binding
>> mechanism can be intrusive in the source code and typically has to be
>> maintained manually), and it still does not address the last point.
>>
>> == Proposed solution =>>
>> I propose to extend lld so that it can perform the required
partitioning
>> automatically, given a set of entry points for each part of the
program.
>> The end product of linking will be a main program (which can be either
an
>> executable or a DSO) combined with a set of DSOs that must be loaded at
>> fixed addresses relative to the base address of the main program. These
>> binaries will all share a virtual address space so that they can refer
to
>> one another directly using PC-relative references or RELATIVE dynamic
>> relocations as if they were all statically linked together in the first
>> place, rather than via the GOT (or custom GOT-equivalent).
>>
>> The way that it will work is that we can extend the graph reachability
>> algorithm currently implemented by the linker for --gc-sections. The
entry
>> points for each partition are marked up with a string naming the
partition,
>> either at the source level with an attribute on the function or global
>> variable, or by passing a flag to the compiler (this string becomes the
>> partition's soname). These symbols will act as the GC roots for the
>> partition and will be exported from its dynsym. Assuming that there is
a
>> single partition, let's call this set of symbols S2, while all
other GC
>> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
>> call S1. Any sections reachable from S1 are allocated to the main
>> partition, while sections reachable only from S2 but not from S1 are
>> allocated to S2's partition. We can extend this idea to multiple
loadable
>> partitions by defining S3, S4 and so on, but any sections reachable
from
>> multiple loadable partitions are allocated to the main partition even
if
>> they aren’t reachable from the main partition.
>>
>> When assigning input sections to output sections, we take into account,
>> in addition to the name of the input section, the partition that the
input
>> section is assigned to. The SHF_ALLOC output sections are first sorted
by
>> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
>> sections appear last and are not sorted by partition. In the end we are
>> left with a collection of output sections that might look like this:
>>
>> Main partition:
>> 0x0000 ELF header, phdrs
>> 0x1000 .rodata
>> 0x2000 .dynsym
>> 0x3000 .text
>>
>> Loadable partition 1:
>> 0x4000 ELF header, phdrs
>> 0x5000 .rodata
>> 0x6000 .dynsym
>> 0x7000 .text
>>
>> Loadable partition 2:
>> 0x8000 ELF header, phdrs
>> 0x9000 .rodata
>> 0xa000 .dynsym
>> 0xb000 .text
>>
>> Non-SHF_ALLOC sections from all partitions:
>> .comment
>> .debug_info
>> (etc.)
>>
>> Now linking proceeds mostly as usual, and we’re effectively left with a
>> single .so that contains all of the partitions concatenated together.
This
>> isn’t very useful on its own and is likely to confuse tools (e.g. due
to
>> the presence of multiple .dynsyms); we can add a feature to
llvm-objcopy
>> that will extract the individual partitions from the output file
>> essentially by taking a slice of the combined .so file. These slices
can
>> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
>> sections are left in place.
>>
>> The envisaged usage of this feature is as follows:
>> $ clang -ffunction-sections -fdata-sections -c main.c # compile the
main
>> program
>> $ clang -ffunction-sections -fdata-sections
>> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
>> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
>> -Wl,-soname,libmain.so -Wl,--gc-sections
>> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
>> $ llvm-objcopy libcombined.so libfeature.so
>> --extract-partition=libfeature.so
>>
>> On Android, the loadable partitions can be loaded with the
>> android_dlopen_ext
>> <https://developer.android.com/ndk/reference/group/libdl>
function
>> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
>> correct address relative to the main partition. Other platforms that
wish
>> to support this feature will likely either need to add a similar
feature to
>> their dynamic loader or (in order to support loading the partitions
with a
>> regular dlopen) define a custom dynamic tag that will cause the dynamic
>> loader to first load the main partition and then the loadable partition
at
>> the correct relative address.
>>
>> == In more detail =>>
>> Each loadable partition will require its own sections to support the
>> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
>> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
>> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
>> with the main partition. This means that all addresses associated with
>> symbols will continue to be fixed.
>>
>> In order to cause the dynamic loader to reserve address space for the
>> loadable partitions so that they can be loaded at the correct address
>> later, a PT_LOAD segment is added to the main partition that allocates
a
>> page of bss at the address one byte past the end of the last address in
the
>> last partition. In the Android dynamic loader at least, this is enough
to
>> cause the required space to be reserved. Other platforms would need to
>> ensure that their dynamic loader implements similar behaviour.
>>
>> I haven't thought about how this feature will interact with linker
>> scripts. At least to start with we will likely need to forbid using
this
>> feature together with the PHDRS or SECTIONS linker script directives.
>>
>> Some sections will need to be present in each partition (e.g. .interp
and
>> .note sections). Probably the most straightforward way to do this will
be
>> to cause the linker to create a clone of these sections for each
partition.
>>
>> == Other use cases =>>
>> An example of another use case for this feature could be an operating
>> system API which is exposed across multiple DSOs. Typically these DSOs
will
>> be implemented using private APIs that are not exposed to the
application.
>> This feature would allow you to create a common DSO that contains the
>> shared code implementing the private APIs (i.e. the main partition),
>> together with individual DSOs (i.e. the loadable partitions) that use
the
>> private APIs and expose the public ones, but without actually exposing
the
>> private APIs in the dynamic symbol table or paying the binary size cost
of
>> doing so.
>>
>> == Prototype =>>
>> A prototype/proof of concept of this feature has been implemented here:
>> https://github.com/pcc/llvm-project/tree/lld-module-symbols
>> There is a test app in the test-progs/app directory that demonstrates
the
>> feature on Android with a simple hello world app (based on
>> https://www.hanshq.net/command-line-android.html ). I have successfully
>> tested debugging the loadable partition with gdb (e.g. setting
breakpoints
>> and printing globals), but getting unwinding working will need a bit
more
>> work.
>>
>> Note that the feature as exposed by the prototype is different from
what
>> I'm proposing here, e.g. it uses a linker flag to specify which
symbols go
>> in which partitions. I think the best place to specify this information
is
>> at either the source level or the compiler flag level, so that is what
I
>> intend to implement.
>>
>> Thanks,
>> --
>> --
>> Peter
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190301/638a7543/attachment.html>

Peter Collingbourne via llvm-dev

2019-Mar-02 01:51 UTC

head link

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

I covered the example of the operating system API in the original proposal,
but yes, you might also be able to use this for kernel modules. The
justification isn't as great for them since kernels typically require
linker scripts, and I'm not planning to make this compatible with linker
scripts. But if you did manage to write a kernel that didn't need a linker
script, this seems like it might be useful.

Peter

On Fri, Mar 1, 2019 at 5:39 PM Rui Ueyama <ruiu at google.com> wrote:
> Somehow I missed this email.
>
> This is an interesting proposal. I guess that Chrome on Android is a
> candidate program to apply this. I wonder what are other programs you can
> apply this scheme. Looks like as long as you want to split a program into
> multiple small pieces sharing the same memory address space, this is
> applicable. One thing I can think of as an example of it is the kernel
> module. Is there anything other than that?
>
> On Tue, Feb 26, 2019 at 5:34 PM Peter Collingbourne <peter at
pcc.me.uk>
> wrote:
>
>> Hi folks,
>>
>> I'd like to propose adding a feature to ELF lld for automatically
>> partitioning a program into multiple binaries. (This will also involve
>> adding a feature to clang, so I've cc'd cfe-dev as well.)
>>
>> == Problem statement =>>
>> Embedded devices such as cell phones, especially lower end devices, are
>> typically highly resource constrained. Users of cell phone applications
>> must pay a cost (in terms of download size as well as storage space)
for
>> all features that the application implements, even for features that
are
>> only used by a minority of users. Therefore, there is a desire to split
>> applications into multiple pieces that can be downloaded independently,
so
>> that the majority of users only pay the cost of the commonly used
features.
>> This can technically be achieved using traditional ELF dynamic linking:
the
>> main part of the program can be compiled as an executable or DSO that
>> exports symbols that are then imported by a separate DSO containing the
>> part of the program implementing the optional feature. However, this
itself
>> imposes costs:
>> - Each exported symbol by itself imposes additional binary size costs,
as
>> it requires the name of the symbol and a dynamic symbol table entry to
be
>> stored in both the exporting and importing DSO, and on the importing
side a
>> dynamic relocation, a GOT entry and likely a PLT entry must be present.
>> These additional costs go some way towards defeating the purpose of
>> splitting the program into pieces in the first place, and can also
impact
>> program startup and overall performance because of the additional
>> indirections.
>> - It can result in more code needing to appear in the main part of the
>> program than necessary. For example, imagine that both the feature and
the
>> main program make use of a common (statically linked) library, but they
>> call different subsets of the functions in that library. With
traditional
>> ELF linking we are forced to either link and export the entire library
from
>> the main program (even the functions unused by either part of the
program)
>> or carefully maintain a list of functions that are used by the other
parts
>> of the program.
>> - Since the linker does not see the whole program at once and links
each
>> piece independently, a number of link-time optimizations and features
stop
>> working, such as LTO across partition boundaries, whole-program
>> devirtualization and non-cross-DSO control flow integrity (control flow
>> integrity has a cross-DSO mode, but that also imposes binary size costs
>> because a significant amount of metadata needs to appear in each DSO).
>> There are ways around at least the first point. For example, the
program
>> could arrange to use a custom mechanism for binding references between
the
>> main program and the feature code, such as a table of entry points.
>> However, this can impose maintenance costs (for example, the binding
>> mechanism can be intrusive in the source code and typically has to be
>> maintained manually), and it still does not address the last point.
>>
>> == Proposed solution =>>
>> I propose to extend lld so that it can perform the required
partitioning
>> automatically, given a set of entry points for each part of the
program.
>> The end product of linking will be a main program (which can be either
an
>> executable or a DSO) combined with a set of DSOs that must be loaded at
>> fixed addresses relative to the base address of the main program. These
>> binaries will all share a virtual address space so that they can refer
to
>> one another directly using PC-relative references or RELATIVE dynamic
>> relocations as if they were all statically linked together in the first
>> place, rather than via the GOT (or custom GOT-equivalent).
>>
>> The way that it will work is that we can extend the graph reachability
>> algorithm currently implemented by the linker for --gc-sections. The
entry
>> points for each partition are marked up with a string naming the
partition,
>> either at the source level with an attribute on the function or global
>> variable, or by passing a flag to the compiler (this string becomes the
>> partition's soname). These symbols will act as the GC roots for the
>> partition and will be exported from its dynsym. Assuming that there is
a
>> single partition, let's call this set of symbols S2, while all
other GC
>> roots (e.g. non-marked-up exported symbols, sections in .init_array) we
>> call S1. Any sections reachable from S1 are allocated to the main
>> partition, while sections reachable only from S2 but not from S1 are
>> allocated to S2's partition. We can extend this idea to multiple
loadable
>> partitions by defining S3, S4 and so on, but any sections reachable
from
>> multiple loadable partitions are allocated to the main partition even
if
>> they aren’t reachable from the main partition.
>>
>> When assigning input sections to output sections, we take into account,
>> in addition to the name of the input section, the partition that the
input
>> section is assigned to. The SHF_ALLOC output sections are first sorted
by
>> partition, and then by the usual sorting rules. As usual, non-SHF_ALLOC
>> sections appear last and are not sorted by partition. In the end we are
>> left with a collection of output sections that might look like this:
>>
>> Main partition:
>> 0x0000 ELF header, phdrs
>> 0x1000 .rodata
>> 0x2000 .dynsym
>> 0x3000 .text
>>
>> Loadable partition 1:
>> 0x4000 ELF header, phdrs
>> 0x5000 .rodata
>> 0x6000 .dynsym
>> 0x7000 .text
>>
>> Loadable partition 2:
>> 0x8000 ELF header, phdrs
>> 0x9000 .rodata
>> 0xa000 .dynsym
>> 0xb000 .text
>>
>> Non-SHF_ALLOC sections from all partitions:
>> .comment
>> .debug_info
>> (etc.)
>>
>> Now linking proceeds mostly as usual, and we’re effectively left with a
>> single .so that contains all of the partitions concatenated together.
This
>> isn’t very useful on its own and is likely to confuse tools (e.g. due
to
>> the presence of multiple .dynsyms); we can add a feature to
llvm-objcopy
>> that will extract the individual partitions from the output file
>> essentially by taking a slice of the combined .so file. These slices
can
>> also be fed to tools such as debuggers provided that the non-SHF_ALLOC
>> sections are left in place.
>>
>> The envisaged usage of this feature is as follows:
>> $ clang -ffunction-sections -fdata-sections -c main.c # compile the
main
>> program
>> $ clang -ffunction-sections -fdata-sections
>> -fsymbol-partition=libfeature.so -c feature.c # compile the feature
>> $ clang main.o feature.o -fuse-ld=lld -shared -o libcombined.so
>> -Wl,-soname,libmain.so -Wl,--gc-sections
>> $ llvm-objcopy libcombined.so libmain.so --extract-partition=libmain.so
>> $ llvm-objcopy libcombined.so libfeature.so
>> --extract-partition=libfeature.so
>>
>> On Android, the loadable partitions can be loaded with the
>> android_dlopen_ext
>> <https://developer.android.com/ndk/reference/group/libdl>
function
>> passing ANDROID_DLEXT_RESERVED_ADDRESS to force it to be loaded at the
>> correct address relative to the main partition. Other platforms that
wish
>> to support this feature will likely either need to add a similar
feature to
>> their dynamic loader or (in order to support loading the partitions
with a
>> regular dlopen) define a custom dynamic tag that will cause the dynamic
>> loader to first load the main partition and then the loadable partition
at
>> the correct relative address.
>>
>> == In more detail =>>
>> Each loadable partition will require its own sections to support the
>> dynamic loader and unwinder (namely: .ARM.exidx, .dynamic, .dynstr,
>> .dynsym, .eh_frame_hdr, .gnu.hash, .gnu.version, .gnu.version_r, .hash,
>> .interp, .rela.dyn, .relr.dyn), but will be able to share a GOT and PLT
>> with the main partition. This means that all addresses associated with
>> symbols will continue to be fixed.
>>
>> In order to cause the dynamic loader to reserve address space for the
>> loadable partitions so that they can be loaded at the correct address
>> later, a PT_LOAD segment is added to the main partition that allocates
a
>> page of bss at the address one byte past the end of the last address in
the
>> last partition. In the Android dynamic loader at least, this is enough
to
>> cause the required space to be reserved. Other platforms would need to
>> ensure that their dynamic loader implements similar behaviour.
>>
>> I haven't thought about how this feature will interact with linker
>> scripts. At least to start with we will likely need to forbid using
this
>> feature together with the PHDRS or SECTIONS linker script directives.
>>
>> Some sections will need to be present in each partition (e.g. .interp
and
>> .note sections). Probably the most straightforward way to do this will
be
>> to cause the linker to create a clone of these sections for each
partition.
>>
>> == Other use cases =>>
>> An example of another use case for this feature could be an operating
>> system API which is exposed across multiple DSOs. Typically these DSOs
will
>> be implemented using private APIs that are not exposed to the
application.
>> This feature would allow you to create a common DSO that contains the
>> shared code implementing the private APIs (i.e. the main partition),
>> together with individual DSOs (i.e. the loadable partitions) that use
the
>> private APIs and expose the public ones, but without actually exposing
the
>> private APIs in the dynamic symbol table or paying the binary size cost
of
>> doing so.
>>
>> == Prototype =>>
>> A prototype/proof of concept of this feature has been implemented here:
>> https://github.com/pcc/llvm-project/tree/lld-module-symbols
>> There is a test app in the test-progs/app directory that demonstrates
the
>> feature on Android with a simple hello world app (based on
>> https://www.hanshq.net/command-line-android.html ). I have successfully
>> tested debugging the loadable partition with gdb (e.g. setting
breakpoints
>> and printing globals), but getting unwinding working will need a bit
more
>> work.
>>
>> Note that the feature as exposed by the prototype is different from
what
>> I'm proposing here, e.g. it uses a linker flag to specify which
symbols go
>> in which partitions. I think the best place to specify this information
is
>> at either the source level or the compiler flag level, so that is what
I
>> intend to implement.
>>
>> Thanks,
>> --
>> --
>> Peter
>>
>
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20190301/8d8d5d12/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Feb 2019 - [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] [cfe-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

[llvm-dev] RFC: Linker feature for automatically partitioning a program into multiple binaries

Maybe Matching Threads