Hello Rui, Thanks for the comments - Synthetic sections and rewriting relocations I think that this would definitely be worth trying. It should remove the need for thunks to be represented in the core data structures, and would allow . It would also mean that we wouldn't have to associate symbols with thunks as the relocations would directly target the thunks. ARM interworking makes reusing thunks more difficult as not every thunk is compatible with every caller. For example: ARM B target and Thumb2 B.W target can't reuse the same thunk even if in range as the branch instruction can't change state. I think it is worth an experiment to make the existing implementation of thunks use synthetic sections and rewriting relocations before trying to implement range extension thunks. - Yes the scan is linear it is essentially: do assign addresses to input sections for each relocation if (thunk needed) create thunk or reuse existing one while (no more thunks added) There's quite a lot of complexity that can be added with respect to the placement of thunks within the output section. For example if there is a caller with a low address and a caller with a high address, both might be able to reuse a thunk placed in the middle. I think it is worth starting simple though. Peter On 5 January 2017 at 09:52, Rui Ueyama <ruiu at google.com> wrote:> Hi Peter, > > Here are my comments: > > - I didn't think hard enough, but I believe creating thunks as synthetic > sections instead of attached data for other input sections is towards a > right direction, because synthetic sections are suitable for adding > linker-generated data to output files. > > - As you wrote, we need to iterate relocations at least twice to create > range extension thunks. Each iteration can be a linear scan, correct? I > mean, we can start from the section at the lowest address towards higher > address examining relocations and create thunks if targets are too far. > > - I do not see a reason that we need to associate range extension thunks to > symbols. It seems to me that while scanning relocations, we need to keep > only the last thunk address for each symbol. If we find that some relocation > against symbol S needs a range extension thunk, we first check if the last > thunk for S is within the range and reuse it if it is. In this way, we need > to keep only one thunk for one symbol at any moment. > > - Have you considered rewriting relocations? I think, if we find that > relocation R pointing to symbol S needs a range extension thunk, we should > (1) create a range extension thunk, (2) create a symbol body object S' for > the thunk, (3) and rewrite R to point to S' instead of S. Then later passes > don't have to deal with thunks. > > > On Thu, Jan 5, 2017 at 3:34 AM, Peter Smith <peter.smith at linaro.org> wrote: >> >> I'm about to start working on range extension thunks in lld. This is >> an attempt to summarize the approach I'd like to take and what the >> impact will be on lld outside of thunks. I'm interested if anyone has >> any constraints the approach will break, alternative suggestions, or >> is working on something I'll need to take account of? >> >> I expect range extension thunks to be important for ARM and useful for >> AArch64. In principle any target with a limited range branch immediate >> instruction could benefit from them. >> >> I've put some detail about range extension thunks at the end of this >> message for those not familiar with them. The name range extension >> thunks is by no means universal, for example in ARM's ABI >> documentation they are called veneers, The GNU linker calls them >> stubs. >> >> Summary of existing thunk implementation (ARM interworking and Mips >> PIC to non-PIC calls): >> - A Regular, Shared or Undefined symbol may have a single thunk >> - For each relocation to a symbol S, if we need a thunk we use the >> thunk for S as the target of the relocation rather than S. The thunk >> will transfer control to S. >> - Thunks are assigned to an InputSection, these are written out when >> the InputSection is written. The InputSection with the Thunk contains >> either the caller (ARM) or callee (Mips). >> - For all existing thunks, the decision of whether a thunk is needed >> is not dependent on address. A Thumb branch to ARM will always need a >> thunk, no matter the distance. Thunks can therefore be generated >> relatively early in Writer::run(). >> >> High level impact of range extension thunks: >> >> There may be more than one than one thunk per callee: >> - A range extension thunk must be placed within range of the caller, >> there may be cases where no single thunk for a callee is in range of >> all callers. >> - An ARM Target may need a different Thunk for ARM and Thumb callers. >> >> Address information is needed to determine if a range extension thunk is >> needed: >> - The more precise the address information available the less thunks >> will be generated. the most precise address information is the final >> address of caller and callee is known at thunk creation time, the >> least precise is neither the address of the caller or callee is known. >> >> Range extension thunks can be combined or replace other thunks >> - Thunks may also be used for instruction set interworking (ARM) or >> for calling between position independent and non-position independent >> code (Mips). Either a chain of thunks or a combined thunk that does >> both operations is needed. For ARM all range extension thunks can >> trivially be interworking thunks as well. >> >> Range extension thunk placement can be important >> - Many callers may need a range extension. Placing a range extension >> thunk so that it is in range of the most callers minimizes number of >> thunks needed. >> - Thunks may be better as synthetic sections rather than as additions >> to input sections. >> >> Adding/removing content must not break the range calculations used in >> range extension thunks. >> - If any caller, callee or thunk address is changed after range >> extension thunks are calculated it could invalidate the range >> calculation. >> - Ideally range extension thunks are the last operation the linker >> does prior to resolving relocations. >> >> I think that there are two separate areas to a range extension thunk >> implementation that can be considered separately. >> 1.) Moving thunk generation to a later stage, at a minimum we need an >> estimation of the address of each caller and callee, in an ideal world >> we know the final address of each caller and callee. This could mean >> assigning section addresses multiple times. >> 2.) The alterations to the core data structures to permit more than >> one Thunk per symbol and the logic to select the "right" Thunk for >> each relocation. >> >> The design I'd like to aim at moves thunk creation into >> finalizeSections() at a point where the sizes and addresses of all the >> SyntheticSections are known. This would mean that the final address of >> each caller and callee could be used, and after thunk creation there >> would be no further content changes. This would mean: >> - All code that runs prior to thunk creation may have the offset in >> the OutputSection altered by the addition of thunks. In particular >> scanRelocs() calculates the offset in the OutputSection of some >> relocations. We would need to find alternative ways of handling these >> cases so that they could either survive thunk creation or be patched >> up afterwards. >> - assignAddresses() will need to run at least twice if thunks are >> created. At least once to give the thunk creation the caller and >> callee addresses, and at least once after all thunks have been >> created. >> >> There is an alternative design that only uses estimates of caller and >> callee address to decide if a thunk is needed. In effect we use a >> heuristic to predict how much extra synthetic content, such as plt and >> got size, will be added after Thunk creation when deciding if a Thunk >> is needed. I'm not in favour of this approach as from bitter >> experience it tends to result in hard to debug problems when the >> heuristics break down. Precise addresses would also allow errata >> patching thunks [*] >> >> I've not thought too hard about how to alter the core data structures >> yet. I think this will mostly be implementation detail though. >> >> Next steps: >> I'd like to proceed with the following plan: >> 1.) Move the existing thunk implementation to where it would need to >> be in finalizeSections(). This should flush out all the non-thunk >> related assumptions about addresses without adding any existing >> complexity to the Thunk implementation. >> 2.) Add support for multiple thunks per symbol >> 3.) If it turns out to be a good idea, implement thunks as >> SyntheticSections >> 4.) Add support for range extensions. >> >> I think the first implementation of range extension thunks should be >> simple and not try too hard to minimize the number of thunks needed. >> If there is a need to optimize it can be done later as the changes >> should be within the thunk creation module. >> >> Thanks for reading >> >> Peter >> >> The remainder of the message is a brief explanation of range extension >> and errata patching thunks. >> >> What are range extension thunks? >> Many architectures have branch instructions that have a finite range >> that is insufficient to reach all possible program locations. For >> example the ARM branch immediate instruction has an immediate that >> encodes an offset of +-32Mb from the branch instruction. A range >> extension thunk is a linker generated code sequence, inserted between >> the caller and the callee, that completes the transfer of control to >> the callee when the distance between the caller and callee exceeds the >> range of the branch instruction. A simple example in psuedo assembly >> for a non-position independent ARM function call. >> >> source: >> BL long_range_thunk_to_target >> ... >> long_range_thunk_to_target >> LDR r12, target_address ; r12 is the corruptible interprocedural >> scratch register (ip) >> BX r12 >> target_address: >> .word target ; >> ... >> target: >> ... >> >> What is an errata patching thunk? >> Some CPU errata (hardware bugs) can be fixed at a link time by >> replacing an instruction with a branch to a sequence of equivalent >> instructions that are guaranteed to to not trigger the erratum. In >> some cases the trigger sequence is dependent on precise addresses such >> as immediates crossing page boundaries, for example >> https://sourceware.org/ml/binutils-cvs/2015-04/msg00012.html . Errata >> patching is out of the scope of implementing range extension thunks >> but can be seen as a generalization of it. > >
On Thu, Jan 5, 2017 at 8:15 PM, Peter Smith <peter.smith at linaro.org> wrote:> Hello Rui, > > Thanks for the comments > > - Synthetic sections and rewriting relocations > I think that this would definitely be worth trying. It should remove > the need for thunks to be represented in the core data structures, and > would allow . >Creating symbols for thunks would have another benefit: it makes disassembled output easier to read because thunks have names.> It would also mean that we wouldn't have to associate symbols with > thunks as the relocations would directly target the thunks. ARM > interworking makes reusing thunks more difficult as not every thunk is > compatible with every caller. For example: > ARM B target and Thumb2 B.W target can't reuse the same thunk even if > in range as the branch instruction can't change state. > > I think it is worth an experiment to make the existing implementation > of thunks use synthetic sections and rewriting relocations before > trying to implement range extension thunks. > > - Yes the scan is linear it is essentially: > do > assign addresses to input sections > for each relocation > if (thunk needed) > create thunk or reuse existing one > while (no more thunks added) > > There's quite a lot of complexity that can be added with respect to > the placement of thunks within the output section. For example if > there is a caller with a low address and a caller with a high address, > both might be able to reuse a thunk placed in the middle. I think it > is worth starting simple though.I agree. I believe that computing the best thunk positions is NP-hard, but the best layout and a layout produced by a naive algorithm wouldn't be that different.> Peter > > > On 5 January 2017 at 09:52, Rui Ueyama <ruiu at google.com> wrote: > > Hi Peter, > > > > Here are my comments: > > > > - I didn't think hard enough, but I believe creating thunks as synthetic > > sections instead of attached data for other input sections is towards a > > right direction, because synthetic sections are suitable for adding > > linker-generated data to output files. > > > > - As you wrote, we need to iterate relocations at least twice to create > > range extension thunks. Each iteration can be a linear scan, correct? I > > mean, we can start from the section at the lowest address towards higher > > address examining relocations and create thunks if targets are too far. > > > > - I do not see a reason that we need to associate range extension thunks > to > > symbols. It seems to me that while scanning relocations, we need to keep > > only the last thunk address for each symbol. If we find that some > relocation > > against symbol S needs a range extension thunk, we first check if the > last > > thunk for S is within the range and reuse it if it is. In this way, we > need > > to keep only one thunk for one symbol at any moment. > > > > - Have you considered rewriting relocations? I think, if we find that > > relocation R pointing to symbol S needs a range extension thunk, we > should > > (1) create a range extension thunk, (2) create a symbol body object S' > for > > the thunk, (3) and rewrite R to point to S' instead of S. Then later > passes > > don't have to deal with thunks. > > > > > > On Thu, Jan 5, 2017 at 3:34 AM, Peter Smith <peter.smith at linaro.org> > wrote: > >> > >> I'm about to start working on range extension thunks in lld. This is > >> an attempt to summarize the approach I'd like to take and what the > >> impact will be on lld outside of thunks. I'm interested if anyone has > >> any constraints the approach will break, alternative suggestions, or > >> is working on something I'll need to take account of? > >> > >> I expect range extension thunks to be important for ARM and useful for > >> AArch64. In principle any target with a limited range branch immediate > >> instruction could benefit from them. > >> > >> I've put some detail about range extension thunks at the end of this > >> message for those not familiar with them. The name range extension > >> thunks is by no means universal, for example in ARM's ABI > >> documentation they are called veneers, The GNU linker calls them > >> stubs. > >> > >> Summary of existing thunk implementation (ARM interworking and Mips > >> PIC to non-PIC calls): > >> - A Regular, Shared or Undefined symbol may have a single thunk > >> - For each relocation to a symbol S, if we need a thunk we use the > >> thunk for S as the target of the relocation rather than S. The thunk > >> will transfer control to S. > >> - Thunks are assigned to an InputSection, these are written out when > >> the InputSection is written. The InputSection with the Thunk contains > >> either the caller (ARM) or callee (Mips). > >> - For all existing thunks, the decision of whether a thunk is needed > >> is not dependent on address. A Thumb branch to ARM will always need a > >> thunk, no matter the distance. Thunks can therefore be generated > >> relatively early in Writer::run(). > >> > >> High level impact of range extension thunks: > >> > >> There may be more than one than one thunk per callee: > >> - A range extension thunk must be placed within range of the caller, > >> there may be cases where no single thunk for a callee is in range of > >> all callers. > >> - An ARM Target may need a different Thunk for ARM and Thumb callers. > >> > >> Address information is needed to determine if a range extension thunk is > >> needed: > >> - The more precise the address information available the less thunks > >> will be generated. the most precise address information is the final > >> address of caller and callee is known at thunk creation time, the > >> least precise is neither the address of the caller or callee is known. > >> > >> Range extension thunks can be combined or replace other thunks > >> - Thunks may also be used for instruction set interworking (ARM) or > >> for calling between position independent and non-position independent > >> code (Mips). Either a chain of thunks or a combined thunk that does > >> both operations is needed. For ARM all range extension thunks can > >> trivially be interworking thunks as well. > >> > >> Range extension thunk placement can be important > >> - Many callers may need a range extension. Placing a range extension > >> thunk so that it is in range of the most callers minimizes number of > >> thunks needed. > >> - Thunks may be better as synthetic sections rather than as additions > >> to input sections. > >> > >> Adding/removing content must not break the range calculations used in > >> range extension thunks. > >> - If any caller, callee or thunk address is changed after range > >> extension thunks are calculated it could invalidate the range > >> calculation. > >> - Ideally range extension thunks are the last operation the linker > >> does prior to resolving relocations. > >> > >> I think that there are two separate areas to a range extension thunk > >> implementation that can be considered separately. > >> 1.) Moving thunk generation to a later stage, at a minimum we need an > >> estimation of the address of each caller and callee, in an ideal world > >> we know the final address of each caller and callee. This could mean > >> assigning section addresses multiple times. > >> 2.) The alterations to the core data structures to permit more than > >> one Thunk per symbol and the logic to select the "right" Thunk for > >> each relocation. > >> > >> The design I'd like to aim at moves thunk creation into > >> finalizeSections() at a point where the sizes and addresses of all the > >> SyntheticSections are known. This would mean that the final address of > >> each caller and callee could be used, and after thunk creation there > >> would be no further content changes. This would mean: > >> - All code that runs prior to thunk creation may have the offset in > >> the OutputSection altered by the addition of thunks. In particular > >> scanRelocs() calculates the offset in the OutputSection of some > >> relocations. We would need to find alternative ways of handling these > >> cases so that they could either survive thunk creation or be patched > >> up afterwards. > >> - assignAddresses() will need to run at least twice if thunks are > >> created. At least once to give the thunk creation the caller and > >> callee addresses, and at least once after all thunks have been > >> created. > >> > >> There is an alternative design that only uses estimates of caller and > >> callee address to decide if a thunk is needed. In effect we use a > >> heuristic to predict how much extra synthetic content, such as plt and > >> got size, will be added after Thunk creation when deciding if a Thunk > >> is needed. I'm not in favour of this approach as from bitter > >> experience it tends to result in hard to debug problems when the > >> heuristics break down. Precise addresses would also allow errata > >> patching thunks [*] > >> > >> I've not thought too hard about how to alter the core data structures > >> yet. I think this will mostly be implementation detail though. > >> > >> Next steps: > >> I'd like to proceed with the following plan: > >> 1.) Move the existing thunk implementation to where it would need to > >> be in finalizeSections(). This should flush out all the non-thunk > >> related assumptions about addresses without adding any existing > >> complexity to the Thunk implementation. > >> 2.) Add support for multiple thunks per symbol > >> 3.) If it turns out to be a good idea, implement thunks as > >> SyntheticSections > >> 4.) Add support for range extensions. > >> > >> I think the first implementation of range extension thunks should be > >> simple and not try too hard to minimize the number of thunks needed. > >> If there is a need to optimize it can be done later as the changes > >> should be within the thunk creation module. > >> > >> Thanks for reading > >> > >> Peter > >> > >> The remainder of the message is a brief explanation of range extension > >> and errata patching thunks. > >> > >> What are range extension thunks? > >> Many architectures have branch instructions that have a finite range > >> that is insufficient to reach all possible program locations. For > >> example the ARM branch immediate instruction has an immediate that > >> encodes an offset of +-32Mb from the branch instruction. A range > >> extension thunk is a linker generated code sequence, inserted between > >> the caller and the callee, that completes the transfer of control to > >> the callee when the distance between the caller and callee exceeds the > >> range of the branch instruction. A simple example in psuedo assembly > >> for a non-position independent ARM function call. > >> > >> source: > >> BL long_range_thunk_to_target > >> ... > >> long_range_thunk_to_target > >> LDR r12, target_address ; r12 is the corruptible interprocedural > >> scratch register (ip) > >> BX r12 > >> target_address: > >> .word target ; > >> ... > >> target: > >> ... > >> > >> What is an errata patching thunk? > >> Some CPU errata (hardware bugs) can be fixed at a link time by > >> replacing an instruction with a branch to a sequence of equivalent > >> instructions that are guaranteed to to not trigger the erratum. In > >> some cases the trigger sequence is dependent on precise addresses such > >> as immediates crossing page boundaries, for example > >> https://sourceware.org/ml/binutils-cvs/2015-04/msg00012.html . Errata > >> patching is out of the scope of implementing range extension thunks > >> but can be seen as a generalization of it. > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170106/d58da420/attachment-0001.html>
On Fri, Jan 6, 2017 at 6:21 AM, Rui Ueyama via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Thu, Jan 5, 2017 at 8:15 PM, Peter Smith <peter.smith at linaro.org> > wrote: > >> Hello Rui, >> >> Thanks for the comments >> >> - Synthetic sections and rewriting relocations >> I think that this would definitely be worth trying. It should remove >> the need for thunks to be represented in the core data structures, and >> would allow . >> > > Creating symbols for thunks would have another benefit: it makes > disassembled output easier to read because thunks have names. > > >> It would also mean that we wouldn't have to associate symbols with >> thunks as the relocations would directly target the thunks. ARM >> interworking makes reusing thunks more difficult as not every thunk is >> compatible with every caller. For example: >> ARM B target and Thumb2 B.W target can't reuse the same thunk even if >> in range as the branch instruction can't change state. >> >> I think it is worth an experiment to make the existing implementation >> of thunks use synthetic sections and rewriting relocations before >> trying to implement range extension thunks. >> >> - Yes the scan is linear it is essentially: >> do >> assign addresses to input sections >> for each relocation >> if (thunk needed) >> create thunk or reuse existing one >> while (no more thunks added) >> >> There's quite a lot of complexity that can be added with respect to >> the placement of thunks within the output section. For example if >> there is a caller with a low address and a caller with a high address, >> both might be able to reuse a thunk placed in the middle. I think it >> is worth starting simple though. > > > I agree. I believe that computing the best thunk positions is NP-hard, but > the best layout and a layout produced by a naive algorithm wouldn't be that > different. >Correct conclusion, but there's no way the problem is NP. Reordering functions (or even individual basic blocks) to minimize the needed thunks is a complex problem. But you're not doing that. Once an ordering is selected a simple greedy algorithm is optimal. There is no cost difference between a thunk that is right next to the short jump and a thunk that is only juuust within range. So you can find the lowest address jump needing a thunk to a particular target and put the thunk the maximum possible distance after it (after the end of a function, or even after any unconditional branch). Find everything else within range of that thunk and fix it up. Repeat. Other algorithms will give smaller average displacements to the thunks, but there is no advantage in that. No other algorithm will generate fewer thunks. That's assuming all short branches have the same code size and displacement distance. If there are multiple branch distances and code sizes (and you have a choice between them at given call sites) then it's still just a simple dynamic programming problem, solvable in linear [1] time by trying each branch size at the first available call site, with a cache of the minimum cost assuming the first 0, 1, 2 .. N call sites have already been covered. [1] or at least nCallSites * nBranchSizes -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170106/e12b0316/attachment.html>