thr3ads.net - llvm dev - [llvm-dev] Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Rahul Chaudhry via llvm-dev

2017-Dec-11 23:50 UTC

[llvm-dev] Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

A simple combination of delta-encoding and run_length-encoding is one of the
first schemes we experimented with (32-bit entries with 24-bit 'delta'
and an
8-bit 'count'). This gave really good results, but as Sri mentions, we
observed
several cases where the relative relocations were not on consecutive offsets.
There were common cases where the relocations applied to alternate words, and
that totally wrecked the scheme (a bunch of entries with delta==16 and
count==1).

I dug up some numbers on how that scheme compared with the current proposal on
the three examples I posted before:

delta+run_length encoding is using 32-bit entries (24-bit delta, 8-bit count).
delta+bitmap encoding is using 64-bit entries (8-bit delta, 56-bit bitmap).

1. Chrome browser (x86_64, built as PIE):
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         385420 bytes (0.37MB) using delta+run_length encoding
         109256 bytes  (0.10MB) using delta+bitmap encoding


2. Go net/http test binary (x86_64, 'go test -buildmode=pie -c
net/http')
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
        204476 bytes (0.20MB) using delta+run_length encoding
         43744 bytes (0.04MB) using delta+bitmap encoding


3. Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
        14388 bytes (0.01MB) using delta+run_length encoding
         1992 bytes (0.00MB) using delta+bitmap encoding

Rahul


On Mon, Dec 11, 2017 at 10:41 AM, Sriraman Tallam <tmsriram at google.com>
wrote:> On Sat, Dec 9, 2017 at 3:06 PM, Florian Weimer <fw at deneb.enyo.de>
wrote:
>> * Rahul Chaudhry via gnu-gabi:
>>
>>> The encoding used is a simple combination of delta-encoding and a
>>> bitmap of offsets. The section consists of 64-bit entries: higher
>>> 8-bits contain delta since last offset, and lower 56-bits contain a
>>> bitmap for which words to apply the relocation to. This is best
>>> described by showing the code for decoding the section:
>>>
>>> typedef struct
>>> {
>>>   Elf64_Xword  r_data;  /* jump and bitmap for relative relocations
*/
>>> } Elf64_Relrz;
>>>
>>> #define ELF64_R_JUMP(val)    ((val) >> 56)
>>> #define ELF64_R_BITS(val)    ((val) & 0xffffffffffffff)
>>>
>>> #ifdef DO_RELRZ
>>>   {
>>>     ElfW(Addr) offset = 0;
>>>     for (; relative < end; ++relative)
>>>       {
>>>         ElfW(Addr) jump = ELFW(R_JUMP) (relative->r_data);
>>>         ElfW(Addr) bits = ELFW(R_BITS) (relative->r_data);
>>>         offset += jump * sizeof(ElfW(Addr));
>>>         if (jump == 0)
>>>           {
>>>             ++relative;
>>>             offset = relative->r_data;
>>>           }
>>>         ElfW(Addr) r_offset = offset;
>>>         for (; bits != 0; bits >>= 1)
>>>           {
>>>             if ((bits&1) != 0)
>>>               elf_machine_relrz_relative (l_addr, (void *) (l_addr
+ r_offset));
>>>             r_offset += sizeof(ElfW(Addr));
>>>           }
>>>       }
>>>   }
>>> #endif
>>
>> That data-dependent “if ((bits&1) != 0)” branch looks a bit nasty.
>>
>> Have you investigated whether some sort of RLE-style encoding would be
>> beneficial? If there are blocks of relative relocations, it might even
>> be possible to use vector instructions to process them (although more
>> than four relocations at a time are probably not achievable in a
>> power-efficient manner on current x86-64).
>
> Yes, we originally investigated RLE style encoding but I guess the key
> insight which led us towards the proposed encoding is the following.
> The offset addresses which contain the relocations are close but not
> necessarily contiguous.  We experimented with an encoding strategy
> where we would store the initial offset and the number of words after
> that which contained dynamic relocations.  This gave us good
> compression numbers but the proposed scheme was way better.  I will
> let Rahul say more as he did quite a bit of experiments with different
> strategies.
>
> Thanks
> Sri

Roland McGrath via llvm-dev

2017-Dec-12 02:14 UTC

head link

[llvm-dev] Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

On Mon, Dec 11, 2017 at 3:50 PM Rahul Chaudhry via gnu-gabi <
gnu-gabi at sourceware.org> wrote:
> A simple combination of delta-encoding and run_length-encoding is one of
> the
> first schemes we experimented with (32-bit entries with 24-bit
'delta' and
> an
> 8-bit 'count'). This gave really good results, but as Sri mentions,
we
> observed
> several cases where the relative relocations were not on consecutive
> offsets.
> There were common cases where the relocations applied to alternate words,
> and
> that totally wrecked the scheme (a bunch of entries with delta==16 and
> count==1).
>
For the same issue in a different context, I recently implemented a scheme
using run-length-encoding but using a variable stride.  So for a run of
alternate words, you still get a single entry, but with stride 16 instead
of 8.  In my application, most cases of strides > 8 are a run of only 2 or
3 but there are a few cases of dozens or hundreds with a stride of 16.  My
case is a solution tailored to exactly one application (a kernel), so there
is a closed sample set that's all that matters and the trade-off between
simplicity of the analysis and compactness of the results is different than
the general case you're addressing (my "analysis" consists of a
few lines
of AWK).  But I wonder if it might be worthwhile to study the effect a
variable-stride RLE scheme or adding the variable-stride ability into your
hybrid scheme has on your sample applications.

Since we're talking about specifying a new ABI that will be serving us for
many years to come and will be hard to change once deployed, it seems worth
spending quite a bit of effort up front to come to the most compact scheme
that's feasible.
-- 


Thanks,
Roland
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171212/bb56540a/attachment.html>

Rahul Chaudhry via llvm-dev

2017-Dec-13 00:53 UTC

head link

[llvm-dev] Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

On Mon, Dec 11, 2017 at 6:14 PM, Roland McGrath <roland at hack.frob.com>
wrote:>
> On Mon, Dec 11, 2017 at 3:50 PM Rahul Chaudhry via gnu-gabi <gnu-gabi at
sourceware.org> wrote:
>>
>> A simple combination of delta-encoding and run_length-encoding is one
of the
>> first schemes we experimented with (32-bit entries with 24-bit
'delta' and an
>> 8-bit 'count'). This gave really good results, but as Sri
mentions, we observed
>> several cases where the relative relocations were not on consecutive
offsets.
>> There were common cases where the relocations applied to alternate
words, and
>> that totally wrecked the scheme (a bunch of entries with delta==16 and
>> count==1).
>
>
> For the same issue in a different context, I recently implemented a scheme
using run-length-encoding but using a variable stride.  So for a run of
alternate words, you still get a single entry, but with stride 16 instead of 8. 
In my application, most cases of strides > 8 are a run of only 2 or 3 but
there are a few cases of dozens or hundreds with a stride of 16.  My case is a
solution tailored to exactly one application (a kernel), so there is a closed
sample set that's all that matters and the trade-off between simplicity of
the analysis and compactness of the results is different than the general case
you're addressing (my "analysis" consists of a few lines of AWK). 
But I wonder if it might be worthwhile to study the effect a variable-stride RLE
scheme or adding the variable-stride ability into your hybrid scheme has on your
sample applications.
>
> Since we're talking about specifying a new ABI that will be serving us
for many years to come and will be hard to change once deployed, it seems worth
spending quite a bit of effort up front to come to the most compact scheme
that's feasible.
I agree. Can you share more details of the encoding scheme that you found
useful (size of each entry, number of bits used for stride/count etc.)?

I just ran some experiments with an encoding with 32-bit entries: 16-bits for
delta, 8-bits for stride, and 8-bits for count. Here are the numbers, inlined
with those from the previous schemes for comparison:

1. Chrome browser (x86_64, built as PIE):
   605159 relocation entries (24 bytes each) in '.rela.dyn'
   594542 are R_X86_64_RELATIVE relocations (98.25%)
       14269008 bytes (13.61MB) in use in '.rela.dyn' section
         385420 bytes (0.37MB) using delta+count encoding
         232540 bytes (0.22MB) using delta+stride+count encoding
         109256 bytes  (0.10MB) using jump+bitmap encoding

2. Go net/http test binary (x86_64, 'go test -buildmode=pie -c
net/http')
   83810 relocation entries (24 bytes each) in '.rela.dyn'
   83804 are R_X86_64_RELATIVE relocations (99.99%)
       2011296 bytes (1.92MB) in use in .rela.dyn section
        204476 bytes (0.20MB) using delta+count encoding
        132568 bytes (0.13MB) using delta+stride+count encoding
         43744 bytes (0.04MB) using jump+bitmap encoding

3. Vim binary in /usr/bin on my workstation (Ubuntu, x86_64)
   6680 relocation entries (24 bytes each) in '.rela.dyn'
   6272 are R_X86_64_RELATIVE relocations (93.89%)
       150528 bytes (0.14MB) in use in .rela.dyn section
        14388 bytes (0.01MB) using delta+count encoding
         7000 bytes (0.01MB) using delta+stride+count encoding
         1992 bytes (0.00MB) using jump+bitmap encoding

delta+count encoding is using 32-bit entries:
  24-bit delta: number of bytes since last offset.
   8-bit count: number of relocations to apply (consecutive words).

delta+stride+count encoding is using 32-bit entries:
  16-bit delta: number of bytes since last offset.
   8-bit stride: stride (in bytes) for applying 'count' relocations.
   8-bit count: number of relocations to apply (using 'stride').

jump+bitmap encoding is using 64-bit entries:
   8-bit jump: number of words since last offset.
  56-bit bitmap: bitmap for which words to apply relocations to.

While adding a 'stride' field is definitely an improvement over simple
delta+count encoding, it doesn't compare well against the bitmap based
encoding.

I took a look inside the encoding for the Vim binary. There are some instances
in the bitmap based encoding like
  [0x3855555555555555 0x3855555555555555 0x3855555555555555 ...]
that encode sequences of relocations applying to alternate words. The stride
based encoding works very well on these and turns it into much more compact
  [0x0ff010ff 0x0ff010ff 0x0ff010ff ...]
using stride==0x10 and count==0xff.

However, for the vast majority of cases, the stride based encoding ends up with
count <= 2, and that kills it in the end.

I could try something more complex with 16-bit entries, but that can only give
2x improvement at best, so it still won't be better than the bitmap
approach.

Thanks,
Rahul

> --
>
>
> Thanks,
> Roland

llvm dev - Dec 2017 - Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

[llvm-dev] Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

[llvm-dev] Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section

[llvm-dev] Reducing code size of Position Independent Executables (PIE) by shrinking the size of dynamic relocations section