thr3ads.net - llvm dev - [llvm-dev] [LLD] Linker Relaxation [Jul 2017]

If this information is useful, please help other people find it:
Share via:

Bruce Hoult via llvm-dev

2017-Jul-11 12:14 UTC

[llvm-dev] [LLD] Linker Relaxation

Here's an example using the gcc toolchain for embedded 32 bit RISC-V (my
HiFive1 board):

#include <stdio.h>

int foo(int i){
    if (i < 100){
        printf("%d\n", i);
    }
    return i;
}

int main(){
    foo(10);
    return 0;
}

After compiling to a .o with -O2 -march=RV32IC we get (just looking at foo)

00000000 <foo>:
   0: 1141                 addi sp,sp,-16
   2: c422                 sw s0,8(sp)
   4: c606                 sw ra,12(sp)
   6: 06300793           li a5,99
   a: 842a                 mv s0,a0
   c: 00a7cb63           blt a5,a0,22 <.L2>
  10: 85aa                 mv a1,a0
  12: 00000537           lui a0,0x0
  16: 00050513           mv a0,a0
  1a: 00000317           auipc t1,0x0
  1e: 000300e7           jalr t1

00000022 <.L2>:
  22: 40b2                 lw ra,12(sp)
  24: 8522                 mv a0,s0
  26: 4422                 lw s0,8(sp)
  28: 0141                 addi sp,sp,16
  2a: 8082                 ret

And after linking:

00010164 <foo>:
   10164:       1141                    addi    sp,sp,-16
   10166:       c422                    sw      s0,8(sp)
   10168:       c606                    sw      ra,12(sp)
   1016a:       06300793                li      a5,99
   1016e:       842a                    mv      s0,a0
   10170:       00a7c863                blt     a5,a0,10180 <foo+0x1c>
   10174:       85aa                    mv      a1,a0
   10176:       0001a537                lui     a0,0x1a
   1017a:       6a050513                addi    a0,a0,1696 # 1a6a0
<__clz_tab+0x100>
   1017e:       2a69                    jal     10318 <printf>
   10180:       40b2                    lw      ra,12(sp)
   10182:       8522                    mv      a0,s0
   10184:       4422                    lw      s0,8(sp)
   10186:       0141                    addi    sp,sp,16
   10188:       8082                    ret

The linker has done quite a lot!

1) the format string address generation has had the LUI (Load Upper
Immediate)
changed from 0x0 to 0x1a (the literal is in flash memory). If it had stayed
at
0x0 it would have been removed by the linker. The mv a0,a0 (which is really
addi a0,a0,#0) has had the real immediate filled in.

2) the call of printf had the general call-anywhere-in-the-address-space
auipc
(Add Upper Immediate to PC); jalr (Jump And Link to address in Register
(plus
offset)) sequence replaced by a simple jal (Jump And Link, with PC +/- 1 MB
range)

3) as the jal offset was in fact less than +/- 2 KB, the 32 bit jal was
replaced by a
16 bit jal instruction.

4) the conditional branch has been shortened from 18 bytes to 12 bytes due
to
the other changes.

On Tue, Jul 11, 2017 at 1:59 PM, Peter Smith via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hello,
>
> To the best of my knowledge I think the closest analogue is something
> like the Synthetic EHFrame and MergeInputSections, not strictly code
> relaxation, but these do involve changes in size of sections.
>
> Can I ask you a quick question?  In many architectures not all
> pc-relative offsets are exposed to the linker as relocations so it
> isn't safe to change the sizes of sections in arbitrary places [*];
> does the RiscV ABI have restrictions on code-generation to allow a
> linker to reduce the code-size of a code-sequence within a Section? If
> there are constraints on the relaxations it might help us make a
> suggestion.
>
> I'm assuming that if you are doing some kind of range based relaxation
> you'll need something like range extension thunks (I'm working on
> these right now) this means you'll probably have to do your
> calculations on what you can relax in finalizeSections() at a similar
> point to createThunks(), or perhaps the mechanisms would need to be
> merged as I think they'll need to converge on a point when no more
> relaxations are possible and no more thunks can be added.
>
> Writing out the relaxed sections will be interesting as you won't want
> all of the InputSectionContents. I suggest looking at EHFrame and
> MergeInputSections for ideas.
>
> Hope that is of some use
>
> Peter
>
> [*] For example in pseudo ARM
>
>     ldr r0, [pc, offset] ; where pc + offset == label
>     ...
>     relaxable sequence such as an indirect jump via a register
>     ...
> label: .word foo
>
> If the compiler/assembler has pre-computed the offset to label then
> changing the size of the relaxable sequence without also updating the
> offset will break the program.
>
>
>
> On 11 July 2017 at 11:09, PkmX via llvm-dev <llvm-dev at
lists.llvm.org>
> wrote:
> > Hi,
> >
> > Does lld support linker relaxation that may shrink code size? As far
> > as I see lld seems to assume that the content of input sections to be
> > fixed other than patching up relocations, but I believe some targets
> > may benefit the extra optimization opportunity with relaxation.
> > Specifically, I'm currently working on adding support for RISC-V
in
> > lld, and RISC-V heavily relies on linker relaxation to remove
> > extraneous code and to handle alignment. Since linker relaxation may
> > be of interest to other targets as well, I'm wondering what would
be a
> > good way to modify lld to support that. Thanks.
> >
> > --
> > Chih-Mao Chen (PkmX)
> > Software R&D, Andes Technology
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170711/24a53171/attachment.html>

Peter Smith via llvm-dev

2017-Jul-11 13:40 UTC

head link

[llvm-dev] [LLD] Linker Relaxation

Thanks for the example, it looks like relaxations can occur at any
point within the section.

With this in mind a couple of implementation approaches come to mind:
1. Add a SyntheticSection, something like RelaxableSection that
supports major rewriting of the InputSection prior to being written to
the output file. The RiscV target will either need to replace
InputSections that have been relaxed with RelaxableSection or create
RelaxableSections directly at InputFile read time on the assumption
that the majority will contain relaxations.
2. Extend InputSection to support relaxation. Currently InputSection
will assume that the whole of contents of the section in the input
file is used, then relocated. On the assumption that most Targets
won't support this form of relaxation I think that this will be a
harder approach to sell than 1.)

I think that regardless of the approaches, you'll need a relaxation
function that is in a similar place or is merged with createThunks().
I don't know enough about RiscV to say if you can get away with doing
all you need to do in one pass, or if you will need multiple passes
with the address of each section recalculated after each relaxation
pass.

It will be worth seeing if the maintainers have any suggestions, and
possibly agreeing a design in principle before going off and trying to
implement something large.

Peter


On 11 July 2017 at 13:14, Bruce Hoult <bruce at hoult.org>
wrote:> Here's an example using the gcc toolchain for embedded 32 bit RISC-V
(my
> HiFive1 board):
>
> #include <stdio.h>
>
> int foo(int i){
>     if (i < 100){
>         printf("%d\n", i);
>     }
>     return i;
> }
>
> int main(){
>     foo(10);
>     return 0;
> }
>
> After compiling to a .o with -O2 -march=RV32IC we get (just looking at foo)
>
> 00000000 <foo>:
>    0: 1141                 addi sp,sp,-16
>    2: c422                 sw s0,8(sp)
>    4: c606                 sw ra,12(sp)
>    6: 06300793           li a5,99
>    a: 842a                 mv s0,a0
>    c: 00a7cb63           blt a5,a0,22 <.L2>
>   10: 85aa                 mv a1,a0
>   12: 00000537           lui a0,0x0
>   16: 00050513           mv a0,a0
>   1a: 00000317           auipc t1,0x0
>   1e: 000300e7           jalr t1
>
> 00000022 <.L2>:
>   22: 40b2                 lw ra,12(sp)
>   24: 8522                 mv a0,s0
>   26: 4422                 lw s0,8(sp)
>   28: 0141                 addi sp,sp,16
>   2a: 8082                 ret
>
> And after linking:
>
> 00010164 <foo>:
>    10164:       1141                    addi    sp,sp,-16
>    10166:       c422                    sw      s0,8(sp)
>    10168:       c606                    sw      ra,12(sp)
>    1016a:       06300793                li      a5,99
>    1016e:       842a                    mv      s0,a0
>    10170:       00a7c863                blt     a5,a0,10180
<foo+0x1c>
>    10174:       85aa                    mv      a1,a0
>    10176:       0001a537                lui     a0,0x1a
>    1017a:       6a050513                addi    a0,a0,1696 # 1a6a0
> <__clz_tab+0x100>
>    1017e:       2a69                    jal     10318 <printf>
>    10180:       40b2                    lw      ra,12(sp)
>    10182:       8522                    mv      a0,s0
>    10184:       4422                    lw      s0,8(sp)
>    10186:       0141                    addi    sp,sp,16
>    10188:       8082                    ret
>
> The linker has done quite a lot!
>
> 1) the format string address generation has had the LUI (Load Upper
> Immediate)
> changed from 0x0 to 0x1a (the literal is in flash memory). If it had stayed
> at
> 0x0 it would have been removed by the linker. The mv a0,a0 (which is really
> addi a0,a0,#0) has had the real immediate filled in.
>
> 2) the call of printf had the general call-anywhere-in-the-address-space
> auipc
> (Add Upper Immediate to PC); jalr (Jump And Link to address in Register
> (plus
> offset)) sequence replaced by a simple jal (Jump And Link, with PC +/- 1 MB
> range)
>
> 3) as the jal offset was in fact less than +/- 2 KB, the 32 bit jal was
> replaced by a
> 16 bit jal instruction.
>
> 4) the conditional branch has been shortened from 18 bytes to 12 bytes due
> to
> the other changes.
>
> On Tue, Jul 11, 2017 at 1:59 PM, Peter Smith via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>
>> Hello,
>>
>> To the best of my knowledge I think the closest analogue is something
>> like the Synthetic EHFrame and MergeInputSections, not strictly code
>> relaxation, but these do involve changes in size of sections.
>>
>> Can I ask you a quick question?  In many architectures not all
>> pc-relative offsets are exposed to the linker as relocations so it
>> isn't safe to change the sizes of sections in arbitrary places [*];
>> does the RiscV ABI have restrictions on code-generation to allow a
>> linker to reduce the code-size of a code-sequence within a Section? If
>> there are constraints on the relaxations it might help us make a
>> suggestion.
>>
>> I'm assuming that if you are doing some kind of range based
relaxation
>> you'll need something like range extension thunks (I'm working
on
>> these right now) this means you'll probably have to do your
>> calculations on what you can relax in finalizeSections() at a similar
>> point to createThunks(), or perhaps the mechanisms would need to be
>> merged as I think they'll need to converge on a point when no more
>> relaxations are possible and no more thunks can be added.
>>
>> Writing out the relaxed sections will be interesting as you won't
want
>> all of the InputSectionContents. I suggest looking at EHFrame and
>> MergeInputSections for ideas.
>>
>> Hope that is of some use
>>
>> Peter
>>
>> [*] For example in pseudo ARM
>>
>>     ldr r0, [pc, offset] ; where pc + offset == label
>>     ...
>>     relaxable sequence such as an indirect jump via a register
>>     ...
>> label: .word foo
>>
>> If the compiler/assembler has pre-computed the offset to label then
>> changing the size of the relaxable sequence without also updating the
>> offset will break the program.
>>
>>
>>
>> On 11 July 2017 at 11:09, PkmX via llvm-dev <llvm-dev at
lists.llvm.org>
>> wrote:
>> > Hi,
>> >
>> > Does lld support linker relaxation that may shrink code size? As
far
>> > as I see lld seems to assume that the content of input sections to
be
>> > fixed other than patching up relocations, but I believe some
targets
>> > may benefit the extra optimization opportunity with relaxation.
>> > Specifically, I'm currently working on adding support for
RISC-V in
>> > lld, and RISC-V heavily relies on linker relaxation to remove
>> > extraneous code and to handle alignment. Since linker relaxation
may
>> > be of interest to other targets as well, I'm wondering what
would be a
>> > good way to modify lld to support that. Thanks.
>> >
>> > --
>> > Chih-Mao Chen (PkmX)
>> > Software R&D, Andes Technology
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>

Bruce Hoult via llvm-dev

2017-Jul-11 15:02 UTC

head link

[llvm-dev] [LLD] Linker Relaxation

On Tue, Jul 11, 2017 at 4:40 PM, Peter Smith via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Thanks for the example, it looks like relaxations can occur at any
> point within the section.
>
Welcome.

> I think that regardless of the approaches, you'll need a relaxation
> function that is in a similar place or is merged with createThunks().
> I don't know enough about RiscV to say if you can get away with doing
> all you need to do in one pass, or if you will need multiple passes
> with the address of each section recalculated after each relaxation
> pass.
>
Relocation is of course compulsory. Relaxation is optional. You can do zero
passes of it.

So I suppose it comes down to "how likely are you to get significant gains
from a 2nd (or later) relaxation pass"?

There is only one size of conditional branch: +/- 4 KB. In the rare case of
a function being bigger than that the compiler will invert the condition
and branch around an unconditional branch (range: +/- 1 MB) or luipc;jr
pair (range: +/- 2 GB). I don't even know whether the binutils linker will
relax a conditional branch around an unconditional branch. So that's I
suppose the major source of iterated relaxation not being relevent: x86 and
others with +/- 128 byte branches, which really do frequently have
cascading effects.

Relaxed 32 bit instruction function calls have +/- 1 MB range. The chances
of a 2nd round of relaxation tipping more call offsets from just over 1 MB
to just under 1 MB are fairly slim.

I would say that a 2nd pass of relaxation would seldom be profitable enough
to be worthwhile.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170711/96cc08cf/attachment.html>

Rafael Avila de Espindola via llvm-dev

2017-Jul-11 17:41 UTC

head link

[llvm-dev] [LLD] Linker Relaxation

Bruce Hoult via llvm-dev <llvm-dev at lists.llvm.org> writes:
> Here's an example using the gcc toolchain for embedded 32 bit RISC-V
(my
> HiFive1 board):
>
> #include <stdio.h>
>
> int foo(int i){
>     if (i < 100){
>         printf("%d\n", i);
>     }
>     return i;
> }
>
> int main(){
>     foo(10);
>     return 0;
> }
>
> After compiling to a .o with -O2 -march=RV32IC we get (just looking at foo)
>
> 00000000 <foo>:
>    0: 1141                 addi sp,sp,-16
>    2: c422                 sw s0,8(sp)
>    4: c606                 sw ra,12(sp)
>    6: 06300793           li a5,99
>    a: 842a                 mv s0,a0
>    c: 00a7cb63           blt a5,a0,22 <.L2>
>   10: 85aa                 mv a1,a0
>   12: 00000537           lui a0,0x0
>   16: 00050513           mv a0,a0
>   1a: 00000317           auipc t1,0x0
>   1e: 000300e7           jalr t1
>
> 00000022 <.L2>:
This is fascinating.

The code has a fallthrough after the printf call, so .L2 is not an
independent section. But the relocation is still present in the .o:

00000014  00000810 R_RISCV_BRANCH    0000002c   .L2 + 0

At -O0 lld can just create InputSections and everything should work.

But when optimizing lld should create a RelaxableSection as peter
suggested. It can probably piggy back on the thunk creation, but this
goes the other way (things get smaller).

RelaxableSection will need some way of representing the new content
(just read it and update in place?) and a map that given an offset tells
how much space has been saved before that offset. This is similar to
what we do for mapping offsets in merge sections.

Cheers,
Rafael

Rafael Avila de Espindola via llvm-dev

2017-Jul-11 17:57 UTC

head link

[llvm-dev] [LLD] Linker Relaxation

Rafael Avila de Espindola <rafael.espindola at gmail.com> writes:
> This is fascinating.
>
> The code has a fallthrough after the printf call, so .L2 is not an
> independent section. But the relocation is still present in the .o:
>
> 00000014  00000810 R_RISCV_BRANCH    0000002c   .L2 + 0
>
> At -O0 lld can just create InputSections and everything should work.
>
> But when optimizing lld should create a RelaxableSection as peter
> suggested. It can probably piggy back on the thunk creation, but this
> goes the other way (things get smaller).
Assuming that

* The contents are still read lazily.
* The offset mapping is really fast in the case of a section with no
  relaxations.
* We don't go looking for relaxations in sections that don't need it.

We could probably just add the feature to InputSection, but some
benchmarking is required.

Cheers,
Rafael

Rui Ueyama via llvm-dev

2017-Jul-11 18:21 UTC

head link

[llvm-dev] [LLD] Linker Relaxation

On Tue, Jul 11, 2017 at 9:14 PM, Bruce Hoult via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Here's an example using the gcc toolchain for embedded 32 bit RISC-V
(my
> HiFive1 board):
>
> #include <stdio.h>
>
> int foo(int i){
>     if (i < 100){
>         printf("%d\n", i);
>     }
>     return i;
> }
>
> int main(){
>     foo(10);
>     return 0;
> }
>
> After compiling to a .o with -O2 -march=RV32IC we get (just looking at foo)
>
> 00000000 <foo>:
>    0: 1141                 addi sp,sp,-16
>    2: c422                 sw s0,8(sp)
>    4: c606                 sw ra,12(sp)
>    6: 06300793           li a5,99
>    a: 842a                 mv s0,a0
>    c: 00a7cb63           blt a5,a0,22 <.L2>
>   10: 85aa                 mv a1,a0
>   12: 00000537           lui a0,0x0
>   16: 00050513           mv a0,a0
>   1a: 00000317           auipc t1,0x0
>   1e: 000300e7           jalr t1
>
> 00000022 <.L2>:
>   22: 40b2                 lw ra,12(sp)
>   24: 8522                 mv a0,s0
>   26: 4422                 lw s0,8(sp)
>   28: 0141                 addi sp,sp,16
>   2a: 8082                 ret
>
> And after linking:
>
> 00010164 <foo>:
>    10164:       1141                    addi    sp,sp,-16
>    10166:       c422                    sw      s0,8(sp)
>    10168:       c606                    sw      ra,12(sp)
>    1016a:       06300793                li      a5,99
>    1016e:       842a                    mv      s0,a0
>    10170:       00a7c863                blt     a5,a0,10180
<foo+0x1c>
>    10174:       85aa                    mv      a1,a0
>    10176:       0001a537                lui     a0,0x1a
>    1017a:       6a050513                addi    a0,a0,1696 # 1a6a0
> <__clz_tab+0x100>
>    1017e:       2a69                    jal     10318 <printf>
>    10180:       40b2                    lw      ra,12(sp)
>    10182:       8522                    mv      a0,s0
>    10184:       4422                    lw      s0,8(sp)
>    10186:       0141                    addi    sp,sp,16
>    10188:       8082                    ret
>
> The linker has done quite a lot!
>
> 1) the format string address generation has had the LUI (Load Upper
> Immediate)
> changed from 0x0 to 0x1a (the literal is in flash memory). If it had
> stayed at
> 0x0 it would have been removed by the linker. The mv a0,a0 (which is really
> addi a0,a0,#0) has had the real immediate filled in.
>
> 2) the call of printf had the general call-anywhere-in-the-address-space
> auipc
> (Add Upper Immediate to PC); jalr (Jump And Link to address in Register
> (plus
> offset)) sequence replaced by a simple jal (Jump And Link, with PC +/- 1
> MB range)
>
> 3) as the jal offset was in fact less than +/- 2 KB, the 32 bit jal was
> replaced by a
> 16 bit jal instruction.
>
> 4) the conditional branch has been shortened from 18 bytes to 12 bytes due
> to
> the other changes.
>
Thanks, Bruce. This is a very interesting optimization.

lld doesn't currently have code to support that kind of code shrinking
optimization, but we can definitely add it. It seems that essentially we
need to iterate over all relocations while rewriting instructions until a
convergence is obtained. But the point is how to do do it efficiently --
link speed really matters. I can't come up with an algorithm to parallelize
it. Do you have any idea?

In order to shrink instructions, all address references must be explicitly
represented as relocations even if they are in the same section. I think
that means object files for RISC-V have many more relocations than the
other architectures. Is this correct?

>
> On Tue, Jul 11, 2017 at 1:59 PM, Peter Smith via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hello,
>>
>> To the best of my knowledge I think the closest analogue is something
>> like the Synthetic EHFrame and MergeInputSections, not strictly code
>> relaxation, but these do involve changes in size of sections.
>>
>> Can I ask you a quick question?  In many architectures not all
>> pc-relative offsets are exposed to the linker as relocations so it
>> isn't safe to change the sizes of sections in arbitrary places [*];
>> does the RiscV ABI have restrictions on code-generation to allow a
>> linker to reduce the code-size of a code-sequence within a Section? If
>> there are constraints on the relaxations it might help us make a
>> suggestion.
>>
>> I'm assuming that if you are doing some kind of range based
relaxation
>> you'll need something like range extension thunks (I'm working
on
>> these right now) this means you'll probably have to do your
>> calculations on what you can relax in finalizeSections() at a similar
>> point to createThunks(), or perhaps the mechanisms would need to be
>> merged as I think they'll need to converge on a point when no more
>> relaxations are possible and no more thunks can be added.
>>
>> Writing out the relaxed sections will be interesting as you won't
want
>> all of the InputSectionContents. I suggest looking at EHFrame and
>> MergeInputSections for ideas.
>>
>> Hope that is of some use
>>
>> Peter
>>
>> [*] For example in pseudo ARM
>>
>>     ldr r0, [pc, offset] ; where pc + offset == label
>>     ...
>>     relaxable sequence such as an indirect jump via a register
>>     ...
>> label: .word foo
>>
>> If the compiler/assembler has pre-computed the offset to label then
>> changing the size of the relaxable sequence without also updating the
>> offset will break the program.
>>
>>
>>
>> On 11 July 2017 at 11:09, PkmX via llvm-dev <llvm-dev at
lists.llvm.org>
>> wrote:
>> > Hi,
>> >
>> > Does lld support linker relaxation that may shrink code size? As
far
>> > as I see lld seems to assume that the content of input sections to
be
>> > fixed other than patching up relocations, but I believe some
targets
>> > may benefit the extra optimization opportunity with relaxation.
>> > Specifically, I'm currently working on adding support for
RISC-V in
>> > lld, and RISC-V heavily relies on linker relaxation to remove
>> > extraneous code and to handle alignment. Since linker relaxation
may
>> > be of interest to other targets as well, I'm wondering what
would be a
>> > good way to modify lld to support that. Thanks.
>> >
>> > --
>> > Chih-Mao Chen (PkmX)
>> > Software R&D, Andes Technology
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > llvm-dev at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170712/e84bd6c6/attachment.html>

Rui Ueyama via llvm-dev

2017-Jul-11 18:27 UTC

head link

[llvm-dev] [LLD] Linker Relaxation

By the way, since this is an optional code relaxation, we can think about
it later. The first thing I would do is to add RISC-V support to lld
without code shrinking relaxations, which I believe is doable by at most a
few hundreds lines of code.

On Wed, Jul 12, 2017 at 3:21 AM, Rui Ueyama <ruiu at google.com> wrote:
> On Tue, Jul 11, 2017 at 9:14 PM, Bruce Hoult via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Here's an example using the gcc toolchain for embedded 32 bit
RISC-V (my
>> HiFive1 board):
>>
>> #include <stdio.h>
>>
>> int foo(int i){
>>     if (i < 100){
>>         printf("%d\n", i);
>>     }
>>     return i;
>> }
>>
>> int main(){
>>     foo(10);
>>     return 0;
>> }
>>
>> After compiling to a .o with -O2 -march=RV32IC we get (just looking at
>> foo)
>>
>> 00000000 <foo>:
>>    0: 1141                 addi sp,sp,-16
>>    2: c422                 sw s0,8(sp)
>>    4: c606                 sw ra,12(sp)
>>    6: 06300793           li a5,99
>>    a: 842a                 mv s0,a0
>>    c: 00a7cb63           blt a5,a0,22 <.L2>
>>   10: 85aa                 mv a1,a0
>>   12: 00000537           lui a0,0x0
>>   16: 00050513           mv a0,a0
>>   1a: 00000317           auipc t1,0x0
>>   1e: 000300e7           jalr t1
>>
>> 00000022 <.L2>:
>>   22: 40b2                 lw ra,12(sp)
>>   24: 8522                 mv a0,s0
>>   26: 4422                 lw s0,8(sp)
>>   28: 0141                 addi sp,sp,16
>>   2a: 8082                 ret
>>
>> And after linking:
>>
>> 00010164 <foo>:
>>    10164:       1141                    addi    sp,sp,-16
>>    10166:       c422                    sw      s0,8(sp)
>>    10168:       c606                    sw      ra,12(sp)
>>    1016a:       06300793                li      a5,99
>>    1016e:       842a                    mv      s0,a0
>>    10170:       00a7c863                blt     a5,a0,10180
<foo+0x1c>
>>    10174:       85aa                    mv      a1,a0
>>    10176:       0001a537                lui     a0,0x1a
>>    1017a:       6a050513                addi    a0,a0,1696 # 1a6a0
>> <__clz_tab+0x100>
>>    1017e:       2a69                    jal     10318 <printf>
>>    10180:       40b2                    lw      ra,12(sp)
>>    10182:       8522                    mv      a0,s0
>>    10184:       4422                    lw      s0,8(sp)
>>    10186:       0141                    addi    sp,sp,16
>>    10188:       8082                    ret
>>
>> The linker has done quite a lot!
>>
>> 1) the format string address generation has had the LUI (Load Upper
>> Immediate)
>> changed from 0x0 to 0x1a (the literal is in flash memory). If it had
>> stayed at
>> 0x0 it would have been removed by the linker. The mv a0,a0 (which is
>> really
>> addi a0,a0,#0) has had the real immediate filled in.
>>
>> 2) the call of printf had the general
call-anywhere-in-the-address-space
>> auipc
>> (Add Upper Immediate to PC); jalr (Jump And Link to address in Register
>> (plus
>> offset)) sequence replaced by a simple jal (Jump And Link, with PC +/-
1
>> MB range)
>>
>> 3) as the jal offset was in fact less than +/- 2 KB, the 32 bit jal was
>> replaced by a
>> 16 bit jal instruction.
>>
>> 4) the conditional branch has been shortened from 18 bytes to 12 bytes
>> due to
>> the other changes.
>>
>
> Thanks, Bruce. This is a very interesting optimization.
>
> lld doesn't currently have code to support that kind of code shrinking
> optimization, but we can definitely add it. It seems that essentially we
> need to iterate over all relocations while rewriting instructions until a
> convergence is obtained. But the point is how to do do it efficiently --
> link speed really matters. I can't come up with an algorithm to
parallelize
> it. Do you have any idea?
>
> In order to shrink instructions, all address references must be explicitly
> represented as relocations even if they are in the same section. I think
> that means object files for RISC-V have many more relocations than the
> other architectures. Is this correct?
>
>
>>
>
>> On Tue, Jul 11, 2017 at 1:59 PM, Peter Smith via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hello,
>>>
>>> To the best of my knowledge I think the closest analogue is
something
>>> like the Synthetic EHFrame and MergeInputSections, not strictly
code
>>> relaxation, but these do involve changes in size of sections.
>>>
>>> Can I ask you a quick question?  In many architectures not all
>>> pc-relative offsets are exposed to the linker as relocations so it
>>> isn't safe to change the sizes of sections in arbitrary places
[*];
>>> does the RiscV ABI have restrictions on code-generation to allow a
>>> linker to reduce the code-size of a code-sequence within a Section?
If
>>> there are constraints on the relaxations it might help us make a
>>> suggestion.
>>>
>>> I'm assuming that if you are doing some kind of range based
relaxation
>>> you'll need something like range extension thunks (I'm
working on
>>> these right now) this means you'll probably have to do your
>>> calculations on what you can relax in finalizeSections() at a
similar
>>> point to createThunks(), or perhaps the mechanisms would need to be
>>> merged as I think they'll need to converge on a point when no
more
>>> relaxations are possible and no more thunks can be added.
>>>
>>> Writing out the relaxed sections will be interesting as you
won't want
>>> all of the InputSectionContents. I suggest looking at EHFrame and
>>> MergeInputSections for ideas.
>>>
>>> Hope that is of some use
>>>
>>> Peter
>>>
>>> [*] For example in pseudo ARM
>>>
>>>     ldr r0, [pc, offset] ; where pc + offset == label
>>>     ...
>>>     relaxable sequence such as an indirect jump via a register
>>>     ...
>>> label: .word foo
>>>
>>> If the compiler/assembler has pre-computed the offset to label then
>>> changing the size of the relaxable sequence without also updating
the
>>> offset will break the program.
>>>
>>>
>>>
>>> On 11 July 2017 at 11:09, PkmX via llvm-dev <llvm-dev at
lists.llvm.org>
>>> wrote:
>>> > Hi,
>>> >
>>> > Does lld support linker relaxation that may shrink code size?
As far
>>> > as I see lld seems to assume that the content of input
sections to be
>>> > fixed other than patching up relocations, but I believe some
targets
>>> > may benefit the extra optimization opportunity with
relaxation.
>>> > Specifically, I'm currently working on adding support for
RISC-V in
>>> > lld, and RISC-V heavily relies on linker relaxation to remove
>>> > extraneous code and to handle alignment. Since linker
relaxation may
>>> > be of interest to other targets as well, I'm wondering
what would be a
>>> > good way to modify lld to support that. Thanks.
>>> >
>>> > --
>>> > Chih-Mao Chen (PkmX)
>>> > Software R&D, Andes Technology
>>> > _______________________________________________
>>> > LLVM Developers mailing list
>>> > llvm-dev at lists.llvm.org
>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170712/8b239801/attachment.html>

Sean Silva via llvm-dev

2017-Jul-12 02:48 UTC

head link

[llvm-dev] [LLD] Linker Relaxation

On Tue, Jul 11, 2017 at 11:21 AM, Rui Ueyama via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Tue, Jul 11, 2017 at 9:14 PM, Bruce Hoult via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Here's an example using the gcc toolchain for embedded 32 bit
RISC-V (my
>> HiFive1 board):
>>
>> #include <stdio.h>
>>
>> int foo(int i){
>>     if (i < 100){
>>         printf("%d\n", i);
>>     }
>>     return i;
>> }
>>
>> int main(){
>>     foo(10);
>>     return 0;
>> }
>>
>> After compiling to a .o with -O2 -march=RV32IC we get (just looking at
>> foo)
>>
>> 00000000 <foo>:
>>    0: 1141                 addi sp,sp,-16
>>    2: c422                 sw s0,8(sp)
>>    4: c606                 sw ra,12(sp)
>>    6: 06300793           li a5,99
>>    a: 842a                 mv s0,a0
>>    c: 00a7cb63           blt a5,a0,22 <.L2>
>>   10: 85aa                 mv a1,a0
>>   12: 00000537           lui a0,0x0
>>   16: 00050513           mv a0,a0
>>   1a: 00000317           auipc t1,0x0
>>   1e: 000300e7           jalr t1
>>
>> 00000022 <.L2>:
>>   22: 40b2                 lw ra,12(sp)
>>   24: 8522                 mv a0,s0
>>   26: 4422                 lw s0,8(sp)
>>   28: 0141                 addi sp,sp,16
>>   2a: 8082                 ret
>>
>> And after linking:
>>
>> 00010164 <foo>:
>>    10164:       1141                    addi    sp,sp,-16
>>    10166:       c422                    sw      s0,8(sp)
>>    10168:       c606                    sw      ra,12(sp)
>>    1016a:       06300793                li      a5,99
>>    1016e:       842a                    mv      s0,a0
>>    10170:       00a7c863                blt     a5,a0,10180
<foo+0x1c>
>>    10174:       85aa                    mv      a1,a0
>>    10176:       0001a537                lui     a0,0x1a
>>    1017a:       6a050513                addi    a0,a0,1696 # 1a6a0
>> <__clz_tab+0x100>
>>    1017e:       2a69                    jal     10318 <printf>
>>    10180:       40b2                    lw      ra,12(sp)
>>    10182:       8522                    mv      a0,s0
>>    10184:       4422                    lw      s0,8(sp)
>>    10186:       0141                    addi    sp,sp,16
>>    10188:       8082                    ret
>>
>> The linker has done quite a lot!
>>
>> 1) the format string address generation has had the LUI (Load Upper
>> Immediate)
>> changed from 0x0 to 0x1a (the literal is in flash memory). If it had
>> stayed at
>> 0x0 it would have been removed by the linker. The mv a0,a0 (which is
>> really
>> addi a0,a0,#0) has had the real immediate filled in.
>>
>> 2) the call of printf had the general
call-anywhere-in-the-address-space
>> auipc
>> (Add Upper Immediate to PC); jalr (Jump And Link to address in Register
>> (plus
>> offset)) sequence replaced by a simple jal (Jump And Link, with PC +/-
1
>> MB range)
>>
>> 3) as the jal offset was in fact less than +/- 2 KB, the 32 bit jal was
>> replaced by a
>> 16 bit jal instruction.
>>
>> 4) the conditional branch has been shortened from 18 bytes to 12 bytes
>> due to
>> the other changes.
>>
>
> Thanks, Bruce. This is a very interesting optimization.
>
> lld doesn't currently have code to support that kind of code shrinking
> optimization, but we can definitely add it. It seems that essentially we
> need to iterate over all relocations while rewriting instructions until a
> convergence is obtained. But the point is how to do do it efficiently --
> link speed really matters. I can't come up with an algorithm to
parallelize
> it. Do you have any idea?
>

Essentially, you loop:
1. create a prospective address assignment / layout (basically a cumulative
sum operation on the section sizes)
2. visit all relocations in parallel, modifying the relocation / content
for eligible relocations. Practically, input sections would be visited in
parallel (instead of relocations) so that the changes to the input section
content are done serially with a single linear scan that avoids doing
O(input section size) work per modification of the section content.

To terminate the algorithm, you stop after step 1 and just calculate the
exact offsets for the layout.

Note that since offsets can only be decreased by step 2, modifications to
relocations/content always remain valid.


Parallelizing cumulative sum type calculations are well-studied in the
context of GPU's and hardware reduction trees. See e.g.
https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html


Note that we could in theory visit the relocations in parallel in step 2,
but then the changes to section's content would require a more complicated
algorithm, with each relaxation producing a size delta value that is then
accumulated with a cumulative sum type operation to coalesce the input
section and remove dead space. This type of pattern is common on GPU's;
e.g. AMD GPUs have a special instruction ds_ordered_count which is
optimized for precisely this operation; e.g. slide 42 of
https://frostbite-wp-prd.s3.amazonaws.com/wp-content/uploads/2016/03/29204330/GDC_2016_Compute.pdf

-- Sean Silva

>
> In order to shrink instructions, all address references must be explicitly
> represented as relocations even if they are in the same section. I think
> that means object files for RISC-V have many more relocations than the
> other architectures. Is this correct?
>
>
>>
>
>> On Tue, Jul 11, 2017 at 1:59 PM, Peter Smith via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hello,
>>>
>>> To the best of my knowledge I think the closest analogue is
something
>>> like the Synthetic EHFrame and MergeInputSections, not strictly
code
>>> relaxation, but these do involve changes in size of sections.
>>>
>>> Can I ask you a quick question?  In many architectures not all
>>> pc-relative offsets are exposed to the linker as relocations so it
>>> isn't safe to change the sizes of sections in arbitrary places
[*];
>>> does the RiscV ABI have restrictions on code-generation to allow a
>>> linker to reduce the code-size of a code-sequence within a Section?
If
>>> there are constraints on the relaxations it might help us make a
>>> suggestion.
>>>
>>> I'm assuming that if you are doing some kind of range based
relaxation
>>> you'll need something like range extension thunks (I'm
working on
>>> these right now) this means you'll probably have to do your
>>> calculations on what you can relax in finalizeSections() at a
similar
>>> point to createThunks(), or perhaps the mechanisms would need to be
>>> merged as I think they'll need to converge on a point when no
more
>>> relaxations are possible and no more thunks can be added.
>>>
>>> Writing out the relaxed sections will be interesting as you
won't want
>>> all of the InputSectionContents. I suggest looking at EHFrame and
>>> MergeInputSections for ideas.
>>>
>>> Hope that is of some use
>>>
>>> Peter
>>>
>>> [*] For example in pseudo ARM
>>>
>>>     ldr r0, [pc, offset] ; where pc + offset == label
>>>     ...
>>>     relaxable sequence such as an indirect jump via a register
>>>     ...
>>> label: .word foo
>>>
>>> If the compiler/assembler has pre-computed the offset to label then
>>> changing the size of the relaxable sequence without also updating
the
>>> offset will break the program.
>>>
>>>
>>>
>>> On 11 July 2017 at 11:09, PkmX via llvm-dev <llvm-dev at
lists.llvm.org>
>>> wrote:
>>> > Hi,
>>> >
>>> > Does lld support linker relaxation that may shrink code size?
As far
>>> > as I see lld seems to assume that the content of input
sections to be
>>> > fixed other than patching up relocations, but I believe some
targets
>>> > may benefit the extra optimization opportunity with
relaxation.
>>> > Specifically, I'm currently working on adding support for
RISC-V in
>>> > lld, and RISC-V heavily relies on linker relaxation to remove
>>> > extraneous code and to handle alignment. Since linker
relaxation may
>>> > be of interest to other targets as well, I'm wondering
what would be a
>>> > good way to modify lld to support that. Thanks.
>>> >
>>> > --
>>> > Chih-Mao Chen (PkmX)
>>> > Software R&D, Andes Technology
>>> > _______________________________________________
>>> > LLVM Developers mailing list
>>> > llvm-dev at lists.llvm.org
>>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170711/791d09d2/attachment-0001.html>

PkmX via llvm-dev

2017-Jul-12 07:26 UTC

head link

[llvm-dev] [LLD] Linker Relaxation

Hi,

On Wed, Jul 12, 2017 at 2:21 AM, Rui Ueyama via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
> Thanks, Bruce. This is a very interesting optimization.
>
> lld doesn't currently have code to support that kind of code shrinking
> optimization, but we can definitely add it. It seems that essentially we
> need to iterate over all relocations while rewriting instructions until a
> convergence is obtained. But the point is how to do do it efficiently --
> link speed really matters. I can't come up with an algorithm to
parallelize
> it. Do you have any idea?
>
> In order to shrink instructions, all address references must be explicitly
> represented as relocations even if they are in the same section. I think
> that means object files for RISC-V have many more relocations than the
other
> architectures. Is this correct?
Indeed. RISC-V would need to emit relocations for PC-relative offsets
otherwise those offsets will become incorrect after relaxation.

On Wed, Jul 12, 2017 at 2:27 AM, Rui Ueyama via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> By the way, since this is an optional code relaxation, we can think about
it
> later. The first thing I would do is to add RISC-V support to lld without
> code shrinking relaxations, which I believe is doable by at most a few
> hundreds lines of code.
Yes, we have a working target for RISC-V in lld now (with relaxation)
and is passing our internal tests. Our iterated relaxation currently
makes a copy of each input section, loops over each of them to process
relaxation and then adjust symbol address and relocation entries
accordingly, just before they are written out. This works but isn't
optimal. Since we intend to contribute this target back to upstream
later on, we'd like to discuss how this should be properly handled.

Note that RISC-V also handles alignment as part of relaxation, so it
isn't really optional. For example:

_start:
    mv      a0, a0
    .p2align 2
    li      a0, 0

The assembler inserts a 3-byte padding (note: this behavior isn't
merged yet, see: https://github.com/riscv/riscv-binutils-gdb/pull/88):

00000000 <_start>:
   0:   852a                    mv      a0,a0
   2:   00 01 00                        # R_RISCV_ALIGN
                        2: R_RISCV_ALIGN        *ABS*+0x3
   5:   4501                    li      a0,0

The linker then remove 1 byte from padding to align to the desired width:

00010054 <_start>:
   10054:       852a                    mv      a0,a0
   10056:       0001                    nop
   10058:       4501                    li      a0,0

This essentially shrinks code size and must be performed as RISC-V
instructions must be 2-byte aligned. Therefore lld must be able to
accommodate changes of content in an input section.

Chih-Mao Chen (PkmX)
Software R&D, Andes Technology

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Jul 2017 - [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

[llvm-dev] [LLD] Linker Relaxation

Apparently Analagous Threads