Fangrui Song via llvm-dev
2020-Feb-28 02:34 UTC
[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?
I met with the Propeller team today (we work for the same company but it was my first time meeting two members on the team:) ). One thing I have been reassured: * There is no general disassembly work. General disassembly work would assuredly frighten off developers. (Inherently unreliable, memory usage heavy and difficult to deal with CFI, debug information, etc) Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is acceptable: locating the jump relocation, detecting the jump type, inverting the direction of a jump, and deleting trailing bytes of an input section. The existing linker relaxation schemes already do similar things. Deleting a trailing jump is similar to RISC-V where sections can shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in my mind)) (binutils supports deleting bytes for a few other architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal amount of disassembly work, conceptually the framework should not be too hard to be ported to another target. One thing I was not aware of (perhaps the description did not make it clear) is that Propeller intends to **reorder basic block sections across translation units**. This is something that full LTO can do while ThinLTO cannot. Our internal systems cannot afford doing a full LTO (**Can we fix the bottleneck of full LTO** [1]?) for large executables and I believe some other users are in the same camp. Now, with ThinLTO, the post link optimization scheme will inevitably require help from the linker/compiler. It seems we have two routes: ## Route 1: Current Propeller framework lld does whole-program reordering of basic block sections. We can extend it in the future to overalign some sections and pad gaps with NOPs. What else can we do? Source code/IR/MCInst is lost at this stage. Without general assembly work, it may be difficult to do more optimization. This makes me concerned of another thing: Intel's Jump Condition Code Erratum. https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31 (mod 32) should be avoided. There are assembler level (MC) mitigations (function sections are overaligned to 32), but because we use basic block sections (sh_addralign<32) and need reordering, we have to redo some work at the linking stage. After losing the representation of MCInst, it is not clear to me how we can insert NOPs/segment override prefixes without doing disassembly work in the linker. Route 2 does heavy lifting work in the compiler, which can naturally reuse the assembler level mitigation, CFI and debug information generating, and probably other stuff. (How will debug information be bloated?) ## Route 2: Add another link stage, similar to a Thin Link as used by ThinLTO. Regular ThinLTO with minimized bitcode files: all: compile thin_link thinlto_backend final_link compile a.o b.o a.indexing.o b.indexing.o: a.c b.c $(clang) -O2 -c -flto=thin -fthin-link-bitcode=a.indexing.o a.c $(clang) -O2 -c -flto=thin -fthin-link-bitcode=b.indexing.o b.c thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: a.indexing.o b.indexing.o $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp -Wl,--thinlto-prefix-replace=';lto' -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o b.indexing.o thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o final_link exe: lto/a.o lto/b.o a.rsp # Propeller does basic block section reordering here. $(clang) -fuse-ld=lld @a.rsp -o exe We need to replace the two stages thinlto_backend and final_link with three. Propelled ThinLTO with minimized bitcode files: propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc # Propeller emits something similar to a Machine IR file. # a.o and b.o are all IR files. $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc -fpropeller a.o -o lto/a.mir $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc -fpropeller b.o -o lto/b.mir propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir # Propeller collects input Machine IR files, # spawn threads to generate object files parallelly. $(clang) -fpropeller-backend -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir final_link exe: propeller/a.o propeller/b.o # GNU ld/gold/lld links object files. $(clang) $^ -o exe A .mir may be much large than an object file. So lto/a.mir may be actually an object file annotated with some information, or some lower level representation than a Machine IR (there should be a guarantee that the produced object file will keep the basic block structure unchanged => otherwise basic block profiling information will not be too useful). [1]: **Can we fix the bottleneck of full LTO** [1]? I wonder whether we have reached a "local maximum" of ThinLTO. If full LTO were nearly as fast as ThinLTO, how would we design a post-link optimization framework? Apparently, if full LTO did not have the scalability problem, we would not do so much work in the linker?
Rui Ueyama via llvm-dev
2020-Feb-28 05:00 UTC
[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?
On Fri, Feb 28, 2020 at 11:34 AM Fangrui Song <maskray at google.com> wrote:> I met with the Propeller team today (we work for the same company but it > was my first time meeting two members on the team:) ). > One thing I have been reassured: > > * There is no general disassembly work. General > disassembly work would assuredly frighten off developers. (Inherently > unreliable, memory usage heavy and difficult to deal with CFI, debug > information, etc) > > Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is > acceptable: locating the jump relocation, detecting the jump type, > inverting the direction of a jump, and deleting trailing bytes of an > input section. The existing linker relaxation schemes already do similar > things. Deleting a trailing jump is similar to RISC-V where sections can > shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in > my mind)) (binutils supports deleting bytes for a few other > architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal > amount of disassembly work, conceptually the framework should not be too > hard to be ported to another target. > > One thing I was not aware of (perhaps the description did not make it > clear) is that > Propeller intends to **reorder basic block sections across translation > units**. > This is something that full LTO can do while ThinLTO cannot. > Our internal systems cannot afford doing a full LTO (**Can we fix the > bottleneck of full LTO** [1]?) > for large executables and I believe some other users are in the same camp. > > Now, with ThinLTO, the post link optimization scheme will inevitably > require > help from the linker/compiler. It seems we have two routes: > > ## Route 1: Current Propeller framework > > lld does whole-program reordering of basic block sections. We can extend > it in > the future to overalign some sections and pad gaps with NOPs. What else > can we > do? Source code/IR/MCInst is lost at this stage. Without general assembly > work, it may be difficult to do more optimization. > > This makes me concerned of another thing: Intel's Jump Condition Code > Erratum. > > https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf > > Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31 > (mod 32) should be avoided. There are assembler level (MC) mitigations > (function sections are overaligned to 32), but because we use basic > block sections (sh_addralign<32) and need reordering, we have to redo > some work at the linking stage. > > After losing the representation of MCInst, it is not clear to me how we can > insert NOPs/segment override prefixes without doing disassembly work in > the linker. >I'm not sure how the basic-block sections feature makes it hard to implement a mitigation for that specific JCC erratum. I may be missing something, but doesn't the BB sections actually make it easier to implement, as the JCC occurs only at the ending of each basic block, and with the BB sections we know what the ending instruction is for each block? I mean, when we are reordering sections, and if we found some BB with JCC is not at a desired address, we can add a padding before that BB. Route 2 does heavy lifting work in the compiler, which can naturally reuse> the assembler level mitigation, > CFI and debug information generating, and probably other stuff. > (How will debug information be bloated?) > > ## Route 2: Add another link stage, similar to a Thin Link as used by > ThinLTO. > > Regular ThinLTO with minimized bitcode files: > > all: compile thin_link thinlto_backend final_link > > compile a.o b.o a.indexing.o b.indexing.o: a.c b.c > $(clang) -O2 -c -flto=thin > -fthin-link-bitcode=a.indexing.o a.c > $(clang) -O2 -c -flto=thin > -fthin-link-bitcode=b.indexing.o b.c > > thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: > a.indexing.o b.indexing.o > $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp > -Wl,--thinlto-prefix-replace=';lto' > -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o > b.indexing.o > > thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc > lto/b.o.thinlto.bc > $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o > lto/a.o > $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o > lto/b.o > > final_link exe: lto/a.o lto/b.o a.rsp > # Propeller does basic block section reordering here. > $(clang) -fuse-ld=lld @a.rsp -o exe > > We need to replace the two stages thinlto_backend and final_link with > three. > > Propelled ThinLTO with minimized bitcode files: > > propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o > lto/a.o.thinlto.bc lto/b.o.thinlto.bc > # Propeller emits something similar to a Machine IR file. > # a.o and b.o are all IR files. > $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc > -fpropeller a.o -o lto/a.mir > $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc > -fpropeller b.o -o lto/b.mir > > propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir > # Propeller collects input Machine IR files, > # spawn threads to generate object files parallelly. > $(clang) -fpropeller-backend > -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir > > final_link exe: propeller/a.o propeller/b.o > # GNU ld/gold/lld links object files. > $(clang) $^ -o exe > > A .mir may be much large than an object file. So lto/a.mir may be > actually an object file annotated with some information, or some lower > level representation than a Machine IR (there should be a guarantee that > the produced object file will keep the basic block structure unchanged > => otherwise basic block profiling information will not be too useful). > > > > [1]: **Can we fix the bottleneck of full LTO** [1]? > > I wonder whether we have reached a "local maximum" of ThinLTO. > If full LTO were nearly as fast as ThinLTO, how would we design a > post-link optimization framework? > Apparently, if full LTO did not have the scalability problem, we would > not do so much work in the linker? >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200228/ea70b547/attachment.html>
Sriraman Tallam via llvm-dev
2020-Feb-28 16:44 UTC
[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?
Hi Fangrui, Not sure why you started a new conversation when you could have just replied to the existing thread. On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song <maskray at google.com> wrote:> I met with the Propeller team today (we work for the same company but it > was my first time meeting two members on the team:) ). > One thing I have been reassured: > > * There is no general disassembly work. General > disassembly work would assuredly frighten off developers. (Inherently > unreliable, memory usage heavy and difficult to deal with CFI, debug > information, etc) > > Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is > acceptable: locating the jump relocation, detecting the jump type, > inverting the direction of a jump, and deleting trailing bytes of an > input section. The existing linker relaxation schemes already do similar > things. Deleting a trailing jump is similar to RISC-V where sections can > shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in > my mind)) (binutils supports deleting bytes for a few other > architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal > amount of disassembly work, conceptually the framework should not be too > hard to be ported to another target. > > One thing I was not aware of (perhaps the description did not make it > clear) is that > Propeller intends to **reorder basic block sections across translation > units**. >This was the intention all along with basic block sections from the very beginning.> This is something that full LTO can do while ThinLTO cannot. > Our internal systems cannot afford doing a full LTO (**Can we fix the > bottleneck of full LTO** [1]?) > for large executables and I believe some other users are in the same camp. > > Now, with ThinLTO, the post link optimization scheme will inevitably > require > help from the linker/compiler. It seems we have two routes: > > ## Route 1: Current Propeller framework > > lld does whole-program reordering of basic block sections. We can extend > it in > the future to overalign some sections and pad gaps with NOPs. What else > can we > do? Source code/IR/MCInst is lost at this stage. Without general assembly > work, it may be difficult to do more optimization. > > This makes me concerned of another thing: Intel's Jump Condition Code > Erratum. > > https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf > > Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31 > (mod 32) should be avoided. There are assembler level (MC) mitigations > (function sections are overaligned to 32), but because we use basic > block sections (sh_addralign<32) and need reordering, we have to redo > some work at the linking stage. > > After losing the representation of MCInst, it is not clear to me how we can > insert NOPs/segment override prefixes without doing disassembly work in > the linker. > > Route 2 does heavy lifting work in the compiler, which can naturally reuse > the assembler level mitigation, > CFI and debug information generating, and probably other stuff. > (How will debug information be bloated?) > > ## Route 2: Add another link stage, similar to a Thin Link as used by > ThinLTO. > > Regular ThinLTO with minimized bitcode files: > > all: compile thin_link thinlto_backend final_link > > compile a.o b.o a.indexing.o b.indexing.o: a.c b.c > $(clang) -O2 -c -flto=thin > -fthin-link-bitcode=a.indexing.o a.c > $(clang) -O2 -c -flto=thin > -fthin-link-bitcode=b.indexing.o b.c > > thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: > a.indexing.o b.indexing.o > $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp > -Wl,--thinlto-prefix-replace=';lto' > -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o > b.indexing.o > > thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc > lto/b.o.thinlto.bc > $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o > lto/a.o > $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o > lto/b.o > > final_link exe: lto/a.o lto/b.o a.rsp > # Propeller does basic block section reordering here. > $(clang) -fuse-ld=lld @a.rsp -o exe > > We need to replace the two stages thinlto_backend and final_link with > three. >I am not sure I fully follow what you mean here but it seems to be along the lines of going back to MIR to do the optimizations. We are considering this and we have even discussed this with Eli in the original thread: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135455.html For example, we are looking at inserting prefetch instructions at specific points in the binary. We would not be disassembling native code to do that but would be doing it in MIR.> > Propelled ThinLTO with minimized bitcode files: > > propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o > lto/a.o.thinlto.bc lto/b.o.thinlto.bc > # Propeller emits something similar to a Machine IR file. > # a.o and b.o are all IR files. > $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc > -fpropeller a.o -o lto/a.mir > $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc > -fpropeller b.o -o lto/b.mir > > propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir > # Propeller collects input Machine IR files, > # spawn threads to generate object files parallelly. > $(clang) -fpropeller-backend > -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir > > final_link exe: propeller/a.o propeller/b.o > # GNU ld/gold/lld links object files. > $(clang) $^ -o exe > > A .mir may be much large than an object file. So lto/a.mir may be > actually an object file annotated with some information, or some lower > level representation than a Machine IR (there should be a guarantee that > the produced object file will keep the basic block structure unchanged > => otherwise basic block profiling information will not be too useful). > > > > [1]: **Can we fix the bottleneck of full LTO** [1]? > > I wonder whether we have reached a "local maximum" of ThinLTO. > If full LTO were nearly as fast as ThinLTO, how would we design a > post-link optimization framework? > Apparently, if full LTO did not have the scalability problem, we would > not do so much work in the linker? >Full LTO has very high overheads for medium to large binaries. As a data point, I ran a Full LTO optimization of a binary with 350M of text and I had to kill the process after RSS went to 175G. I couldn't get it to run on my beefy machine with 192G of RAM. Hope this helps address some of your concerns. Thanks Sri -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200228/d3bf9c90/attachment.html>
Fangrui Song via llvm-dev
2020-Mar-02 19:02 UTC
[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?
On 2020-02-28, Rui Ueyama wrote:>On Fri, Feb 28, 2020 at 11:34 AM Fangrui Song <maskray at google.com> wrote: > >> I met with the Propeller team today (we work for the same company but it >> was my first time meeting two members on the team:) ). >> One thing I have been reassured: >> >> * There is no general disassembly work. General >> disassembly work would assuredly frighten off developers. (Inherently >> unreliable, memory usage heavy and difficult to deal with CFI, debug >> information, etc) >> >> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is >> acceptable: locating the jump relocation, detecting the jump type, >> inverting the direction of a jump, and deleting trailing bytes of an >> input section. The existing linker relaxation schemes already do similar >> things. Deleting a trailing jump is similar to RISC-V where sections can >> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in >> my mind)) (binutils supports deleting bytes for a few other >> architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal >> amount of disassembly work, conceptually the framework should not be too >> hard to be ported to another target. >> >> One thing I was not aware of (perhaps the description did not make it >> clear) is that >> Propeller intends to **reorder basic block sections across translation >> units**. >> This is something that full LTO can do while ThinLTO cannot. >> Our internal systems cannot afford doing a full LTO (**Can we fix the >> bottleneck of full LTO** [1]?) >> for large executables and I believe some other users are in the same camp. >> >> Now, with ThinLTO, the post link optimization scheme will inevitably >> require >> help from the linker/compiler. It seems we have two routes: >> >> ## Route 1: Current Propeller framework >> >> lld does whole-program reordering of basic block sections. We can extend >> it in >> the future to overalign some sections and pad gaps with NOPs. What else >> can we >> do? Source code/IR/MCInst is lost at this stage. Without general assembly >> work, it may be difficult to do more optimization. >> >> This makes me concerned of another thing: Intel's Jump Condition Code >> Erratum. >> >> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf >> >> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31 >> (mod 32) should be avoided. There are assembler level (MC) mitigations >> (function sections are overaligned to 32), but because we use basic >> block sections (sh_addralign<32) and need reordering, we have to redo >> some work at the linking stage. >> >> After losing the representation of MCInst, it is not clear to me how we can >> insert NOPs/segment override prefixes without doing disassembly work in >> the linker. >> > >I'm not sure how the basic-block sections feature makes it hard to >implement a mitigation for that specific JCC erratum. I may be missing >something, but doesn't the BB sections actually make it easier to >implement, as the JCC occurs only at the ending of each basic block, and >with the BB sections we know what the ending instruction is for each block? >I mean, when we are reordering sections, and if we found some BB with JCC >is not at a desired address, we can add a padding before that BB.Loss of MachineInstr/MCInst (what we have are ELF object files) and refraining from disassembly makes it hard to implement the Intel JCC Erratum mitigation. Inserting padding can increase the distance of JCC_1/JMP_1 instructions. JCC_1/JMP_1 may need to be relaxed to JCC_4/JMP_4. jb 1f; nop; 1: # 72 01 jb 0x3 jb 2f; .space 0x7f, 0x90; 2: # 72 7f jb 0x84 jb 3f; .space 0x80, 0x90; 3: # 0f 82 80 00 00 00 jb 0x10a Without disassembly, we can only add NOPs, but not the superior segment override prefixes. Note that x86 has several instructions which are documented (Table 24-3. Format of Interruptibility State") as enabling interrupts exactly one instruction after the one which changes the SS segment register. Inserting a nop allows an interrupt to arrive before the execution of the following instruction which changes semantic behaviour. # NOP before jmp can change semantic behavior. sti; jmp baz movl %esi, %ss; jmp baz movw (%rsi), %ss; jmp baz Well, we may go the MSP430 route: always generate the maximum range branch instructions, and rely on the linker to relax instructions. This is also what RISC-V does. (I mentioned both MSP430 and RISC-V in my previous message:) ) There are many challenges here. Coming back to the topic. We really have 2 routes. a) Start from ELF object files, add sufficient metadata that supports post-link optimization b) Start from machine level representations (MIR-like), remove unneeded details to make whole-program post-link optimization affordable. Our current route is a). To avoid creating JCC_1/JMP_1 which are close to the jump distance limit, we may have to implement something similar to TargetInstrInfo::getInstSizeInBytes for x86, add BranchRelaxation to addPreEmitPass(), so that JCC_1/JMP_1 can be avoided in the first place. Unfortunately, getInstSizeInBytes is not great when there is inline assembly, because a pseudo instruction or a macro can expand to multiple real instructions. In this regard, LLVM is better than GCC because GCC just counts the statements (https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm) Debug information is another hassle. We only have relocation records, and we will use conservative forms to make debug information not break after Propeller optimization. If we start from DwarfDebug, we will be more confident that nothing will break. I understand that out focus has always been b), and it is (huge) sunk cost if we change the direction to a). I am just concerned how many new things we will discover in the future which is mandatory to annotate ELF object files. Starting from b) is a challenge, because it will push us to rethink various LLVM infrastructures. The nice thing in return for Propeller is immediate reuse of facility already provided by the compiler. The nice thing in a long term is flexible frameworks that will benefit the overall LLVM project and other future optimizations. Additionaly, Sri told me that we would also do compiler-inserted prefetching (this topic has been thoroughly studied by prior art, so I don't think exposing the information is sensitive at all). I cannot imagine how we would do it without more machine level information.>>Route 2 does heavy lifting work in the compiler, which can naturally reuse >> the assembler level mitigation, >> CFI and debug information generating, and probably other stuff. >> (How will debug information be bloated?) >> >> ## Route 2: Add another link stage, similar to a Thin Link as used by >> ThinLTO. >> >> Regular ThinLTO with minimized bitcode files: >> >> all: compile thin_link thinlto_backend final_link >> >> compile a.o b.o a.indexing.o b.indexing.o: a.c b.c >> $(clang) -O2 -c -flto=thin >> -fthin-link-bitcode=a.indexing.o a.c >> $(clang) -O2 -c -flto=thin >> -fthin-link-bitcode=b.indexing.o b.c >> >> thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: >> a.indexing.o b.indexing.o >> $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp >> -Wl,--thinlto-prefix-replace=';lto' >> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o >> b.indexing.o >> >> thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc >> lto/b.o.thinlto.bc >> $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o >> lto/a.o >> $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o >> lto/b.o >> >> final_link exe: lto/a.o lto/b.o a.rsp >> # Propeller does basic block section reordering here. >> $(clang) -fuse-ld=lld @a.rsp -o exe >> >> We need to replace the two stages thinlto_backend and final_link with >> three. >> >> Propelled ThinLTO with minimized bitcode files: >> >> propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o >> lto/a.o.thinlto.bc lto/b.o.thinlto.bc >> # Propeller emits something similar to a Machine IR file. >> # a.o and b.o are all IR files. >> $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc >> -fpropeller a.o -o lto/a.mir >> $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc >> -fpropeller b.o -o lto/b.mir >> >> propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir >> # Propeller collects input Machine IR files, >> # spawn threads to generate object files parallelly. >> $(clang) -fpropeller-backend >> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir >> >> final_link exe: propeller/a.o propeller/b.o >> # GNU ld/gold/lld links object files. >> $(clang) $^ -o exe >> >> A .mir may be much large than an object file. So lto/a.mir may be >> actually an object file annotated with some information, or some lower >> level representation than a Machine IR (there should be a guarantee that >> the produced object file will keep the basic block structure unchanged >> => otherwise basic block profiling information will not be too useful). >> >> >> >> [1]: **Can we fix the bottleneck of full LTO** [1]? >> >> I wonder whether we have reached a "local maximum" of ThinLTO. >> If full LTO were nearly as fast as ThinLTO, how would we design a >> post-link optimization framework? >> Apparently, if full LTO did not have the scalability problem, we would >> not do so much work in the linker? >>
Mehdi AMINI via llvm-dev
2020-Mar-03 07:55 UTC
[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?
On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I met with the Propeller team today (we work for the same company but it > was my first time meeting two members on the team:) ). > One thing I have been reassured: > > * There is no general disassembly work. General > disassembly work would assuredly frighten off developers. (Inherently > unreliable, memory usage heavy and difficult to deal with CFI, debug > information, etc) > > Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is > acceptable: locating the jump relocation, detecting the jump type, > inverting the direction of a jump, and deleting trailing bytes of an > input section. The existing linker relaxation schemes already do similar> things. Deleting a trailing jump is similar to RISC-V where sections can > shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in > my mind)) (binutils supports deleting bytes for a few other > architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal > amount of disassembly work, conceptually the framework should not be too > hard to be ported to another target. > > One thing I was not aware of (perhaps the description did not make it > clear) is that > Propeller intends to **reorder basic block sections across translation > units**. > This is something that full LTO can do while ThinLTO cannot. > Our internal systems cannot afford doing a full LTO (**Can we fix the > bottleneck of full LTO** [1]?) > for large executables and I believe some other users are in the same camp. >Right, beyond distributed build system, even on a single machine and for "small" projects like clang: building on a laptop with FullLTO can be challenging in terms of memory consumption, and the iterative development is just not practical.> > Now, with ThinLTO, the post link optimization scheme will inevitably > require > help from the linker/compiler. It seems we have two routes: > > ## Route 1: Current Propeller framework > > lld does whole-program reordering of basic block sections. We can extend > it in > the future to overalign some sections and pad gaps with NOPs. What else > can we > do? Source code/IR/MCInst is lost at this stage. Without general assembly > work, it may be difficult to do more optimization. > > This makes me concerned of another thing: Intel's Jump Condition Code > Erratum. > > https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf > > Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31 > (mod 32) should be avoided. There are assembler level (MC) mitigations > (function sections are overaligned to 32), but because we use basic > block sections (sh_addralign<32) and need reordering, we have to redo > some work at the linking stage. > > After losing the representation of MCInst, it is not clear to me how we can > insert NOPs/segment override prefixes without doing disassembly work in > the linker. > > Route 2 does heavy lifting work in the compiler, which can naturally reuse > the assembler level mitigation, > CFI and debug information generating, and probably other stuff. > (How will debug information be bloated?) > > ## Route 2: Add another link stage, similar to a Thin Link as used by > ThinLTO. > > Regular ThinLTO with minimized bitcode files: > > all: compile thin_link thinlto_backend final_link > > compile a.o b.o a.indexing.o b.indexing.o: a.c b.c > $(clang) -O2 -c -flto=thin > -fthin-link-bitcode=a.indexing.o a.c > $(clang) -O2 -c -flto=thin > -fthin-link-bitcode=b.indexing.o b.c > > thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: > a.indexing.o b.indexing.o > $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp > -Wl,--thinlto-prefix-replace=';lto' > -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o > b.indexing.o > > thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc > lto/b.o.thinlto.bc > $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o > lto/a.o > $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o > lto/b.o > > final_link exe: lto/a.o lto/b.o a.rsp > # Propeller does basic block section reordering here. > $(clang) -fuse-ld=lld @a.rsp -o exe > > We need to replace the two stages thinlto_backend and final_link with > three. > > Propelled ThinLTO with minimized bitcode files: > > propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o > lto/a.o.thinlto.bc lto/b.o.thinlto.bc > # Propeller emits something similar to a Machine IR file. > # a.o and b.o are all IR files. > $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc > -fpropeller a.o -o lto/a.mir > $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc > -fpropeller b.o -o lto/b.mir > > propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir > # Propeller collects input Machine IR files, > # spawn threads to generate object files parallelly. > $(clang) -fpropeller-backend > -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir > > final_link exe: propeller/a.o propeller/b.o > # GNU ld/gold/lld links object files. > $(clang) $^ -o exe >There was an interesting talk last week at the LLVM performance workshop: Global Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl> which introduced a similar stage in ThinLTO (for another purpose though). I believe they avoid the serialization of MIR by running the CodeGen twice instead (once to collect the cross-module informations, and the second time using these informations). CC the author in case the slides are already available online.> > A .mir may be much large than an object file. So lto/a.mir may be > actually an object file annotated with some information, or some lower > level representation than a Machine IR (there should be a guarantee that > the produced object file will keep the basic block structure unchanged > => otherwise basic block profiling information will not be too useful). > > > > [1]: **Can we fix the bottleneck of full LTO** [1]? > > I wonder whether we have reached a "local maximum" of ThinLTO. > If full LTO were nearly as fast as ThinLTO, how would we design a > post-link optimization framework? > Apparently, if full LTO did not have the scalability problem, we would > not do so much work in the linker? >At lot of work went into ThinLTO because the scalability issue of LTO was considered inherent to the design. It isn't clear what you're suggesting here though? -- Mehdi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200302/bf0f8773/attachment.html>
Rahman Lavaee via llvm-dev
2020-Mar-21 04:55 UTC
[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?
Hereby, we discuss our plan for handling Intel's JCC mitigation as follows. TLDR; By computing basic block groupings early, the compiler can form larger clusters of basic blocks (each cluster in a section) which will allow Propeller to just reuse the assembler’s mitigation. Our experiments show that when JCC mitigation causes only 0.2% slowdown for Propeller, compared to the 0.6% slowdown incurred for the vanilla configuration. A slightly longer summary: - We evaluated a Propeller prototype to reuse the existing assembler mitigation in llvm, -mbranches-within-32B-boundary, which currently uses only NOPs for mitigation. - With some changes, Propeller is able to reuse the existing assembler mitigation. To do this, we form large basic block clusters (sections containing multiple basic blocks) in the compiler by computing the basic block layout earlier. - Vanilla clang benchmark (no Propeller) regresses by ~0.6% with this flag. - With Propeller, the exact same flag regresses clang only by ~0.2%, reducing the total speedup from 7.8% to 7.6%. - For similar problems, the solution is most optimally implemented in the linker. However, for this particular problem, it appears that the assembler's mitigation is good enough when combined with Propeller. Background The JCC erratum <https://www.intel.com/content/www/us/en/support/articles/000055650/processors.html> is a CPU bug affecting Skylake processors which results in unpredictable behaviour under complex micro-architectural states involving the Decoded I-cache, specifically, when executing branches which cross a cache line. MicroCode Update (MCU) Mitigation The CPU avoids this bug by bypassing the Decoded ICache for branches crossing 32B boundaries. This sacrifices some performance (0-4%) in return for correctness. The compiler can alleviate this effect by aligning the code such that branches do not cross a 32B boundary. There are two ways that the compiler can do this: 1. Inserting NOP instructions 2. Inserting prefixes for instructions The current solution shipped with clang-10 (under -mbranches-within-32B-boundary) aligns every function at 32B and uses NOPs between instructions. Our experiment shows enabling this option results in 0.6% performance degradation for Clang. There have been some efforts to improve this using instruction prefixes (https://reviews.llvm.org/D72225, https://reviews.llvm.org/D75268) even though there has been some uncertainty about the available headroom ( https://reviews.llvm.org/D72225#1818149). JCC Mitigation in Propeller Propeller modifies the code layout by emitting basic blocks into sections and reordering them at link time. This means the assembler’s mitigation could be corrupted by Propeller. There are two ways in which Propeller can solve the problem: 1. Redo the full mitigation in the linker 2. Reuse the mitigation that is being implemented in the assembler Next we discuss each of the two strategies in more detail. Full Mitigation in the Linker The current compiler solution is implemented in the assembler backend and its scope is limited to one function at a time (with -function-sections), which requires excessive alignment of 32B for the function entry. As a post-link optimization infrastructure, Propeller has the global view of all sections in the link time and is at a better position for global optimal JCC mitigation. The challenge for Propeller is finding the location of affected branch instructions, and inserting paddings or prefixes at the right places (some instructions cannot be prepended with prefixes or NOPs). This is easier for the assembler as it has higher-level information about instructions and can use the MC layer structures (such as MCRelaxableFragment) to emit variable-sized paddings or prefixes. As we discuss next, our prototype relying on the assembler's mitigation incurs no significant overhead and therefore we do not plan to address this problem in the linker. Relying on the Assembler’s Mitigation Propeller can use the assembler’s mitigation on every basic block section. However, this means every basic block would be aligned at 32 bytes. The paddings between the basic blocks may be executed nops which will put significant pressure on the CPU's frontend. To reduce the NOP paddings, we would need to emit BB sections at a coarser level of granularity, which would mean emitting multiple basic blocks in the same section. However, currently, Propeller delays the basic block layout computation until link time and hence the actual group of basic blocks (cluster) is only available at link time. To make this work, we implemented a prototype by moving the layout computation before the final round of Propeller compilation. After the layout is computed, basic block partitions of each function are extracted and passed to the compiler. For example, consider the following BB layout for a program consisting of two functions foo (with 5 basic blocks) and bar (with a single basic block). foo foo.BB.1 foo.BB.2 bar foo.BB.3 foo.BB.4 The extracted BB partitions are as follows: foo: { [foo, foo.BB.1, foo.BB.2] , [foo.BB.3, foo.BB.4] } Bar: { [bar] } We instruct the compiler to emit foo’s basic blocks in two sections and bar’s single basic block in one section. The assembler applies JCC mitigation on each of the three sections by aligning them at 32 bytes and inserting minimal paddings between instructions within every section. The only change compared to the baseline mitigation with -function-sections is emitting an excessive 32 bytes alignment for foo.BB.3. However, the introduced padding is non-executed code (may have small pressure on the instruction cache and TLB). We note that the layout algorithm would scatter a function’s basic blocks across multiple partitions judiciously and only if it is advantageous for the performance. For intra-procedural layout, only two clusters are created (hot and cold). Nonetheless, the non-executed paddings for clusters will have minimal impact on performance. On another note, better code layout could reduce the overhead of JCC mitigation because the hot code would be packed together and the paddings for the cold blocks will not affect the hot code. Results We evaluated Clang’s performance under different optimizations with and without JCC mitigation. We used PGO + ThinLTO for all configurations. We tested two propeller code layouts: inter-procedural, and intra-procedural. The intra-procedural results in at most two clusters for every function, while the inter-procedural layout could lead to more. To use JCC mitigation, we use “-Wl,-mllvm,--x86-branches-within-32B-boundaries -mbranches-within-32B-boundaries". We ran the clang bootstrap test 10 times for each configuration and measured the average cpu time (user + sys in seconds). We note that our evaluation is performed on a machine without the microcode update installed. Mitigation Enabled Mitigation Disabled baseline (PGO + ThinLTO) 545.362 542.012 Propeller intra-proedural 506.828 504.861 Propeller inter-procedural 503.23 502.136Clang's cpu time relative to the baseline, for different optimization flavors, with and without JCC mitigation FIrst, JCC Mitigation results in a 0.6% slowdown when applied to the baseline. With Propeller, JCC mitigation incurs 0.4% slowdown for intra-procedural and 0.2% for inter-procedural. The lesser JCC mitigation slowdowns for Propeller configurations shows the impact of better code layout. When hot and cold code are mixed together, the paddings in the cold part could put more pressure on I-Cache and I-TLB. Conclusion Using BB clusters, we can reuse the assembler’s JCC mitigation with no significant impact on performance. In fact the slowdown caused by JCC mitigation is lower for Propeller, because of the better code layout. Finally, we would like to stress once again that Propeller has the potential to do a better job for problems like this JCC mitigation. However, for this particular problem, we have shown that the assembler's mitigation is good enough to be used along with Propeller. On Mon, Mar 2, 2020 at 11:56 PM Mehdi AMINI via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> I met with the Propeller team today (we work for the same company but it >> was my first time meeting two members on the team:) ). >> One thing I have been reassured: >> >> * There is no general disassembly work. General >> disassembly work would assuredly frighten off developers. (Inherently >> unreliable, memory usage heavy and difficult to deal with CFI, debug >> information, etc) >> >> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is >> acceptable: locating the jump relocation, detecting the jump type, >> inverting the direction of a jump, and deleting trailing bytes of an >> input section > > . The existing linker relaxation schemes already do similar >> things. Deleting a trailing jump is similar to RISC-V where sections can >> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in >> my mind)) (binutils supports deleting bytes for a few other >> architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal >> amount of disassembly work, conceptually the framework should not be too >> hard to be ported to another target. >> >> One thing I was not aware of (perhaps the description did not make it >> clear) is that >> Propeller intends to **reorder basic block sections across translation >> units**. >> This is something that full LTO can do while ThinLTO cannot. >> Our internal systems cannot afford doing a full LTO (**Can we fix the >> bottleneck of full LTO** [1]?) >> for large executables and I believe some other users are in the same camp. >> > > Right, beyond distributed build system, even on a single machine and for > "small" projects like clang: building on a laptop with FullLTO can be > challenging in terms of memory consumption, and the iterative development > is just not practical. > > >> >> Now, with ThinLTO, the post link optimization scheme will inevitably >> require >> help from the linker/compiler. It seems we have two routes: >> >> ## Route 1: Current Propeller framework >> >> lld does whole-program reordering of basic block sections. We can extend >> it in >> the future to overalign some sections and pad gaps with NOPs. What else >> can we >> do? Source code/IR/MCInst is lost at this stage. Without general assembly >> work, it may be difficult to do more optimization. >> >> This makes me concerned of another thing: Intel's Jump Condition Code >> Erratum. >> >> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf >> >> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31 >> (mod 32) should be avoided. There are assembler level (MC) mitigations >> (function sections are overaligned to 32), but because we use basic >> block sections (sh_addralign<32) and need reordering, we have to redo >> some work at the linking stage. >> >> After losing the representation of MCInst, it is not clear to me how we >> can >> insert NOPs/segment override prefixes without doing disassembly work in >> the linker. >> >> Route 2 does heavy lifting work in the compiler, which can naturally >> reuse the assembler level mitigation, >> CFI and debug information generating, and probably other stuff. >> (How will debug information be bloated?) >> >> ## Route 2: Add another link stage, similar to a Thin Link as used by >> ThinLTO. >> >> Regular ThinLTO with minimized bitcode files: >> >> all: compile thin_link thinlto_backend final_link >> >> compile a.o b.o a.indexing.o b.indexing.o: a.c b.c >> $(clang) -O2 -c -flto=thin >> -fthin-link-bitcode=a.indexing.o a.c >> $(clang) -O2 -c -flto=thin >> -fthin-link-bitcode=b.indexing.o b.c >> >> thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: >> a.indexing.o b.indexing.o >> $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp >> -Wl,--thinlto-prefix-replace=';lto' >> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o >> b.indexing.o >> >> thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc >> lto/b.o.thinlto.bc >> $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o >> lto/a.o >> $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o >> lto/b.o >> >> final_link exe: lto/a.o lto/b.o a.rsp >> # Propeller does basic block section reordering here. >> $(clang) -fuse-ld=lld @a.rsp -o exe >> >> We need to replace the two stages thinlto_backend and final_link with >> three. >> >> Propelled ThinLTO with minimized bitcode files: >> >> propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o >> lto/a.o.thinlto.bc lto/b.o.thinlto.bc >> # Propeller emits something similar to a Machine IR file. >> # a.o and b.o are all IR files. >> $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc >> -fpropeller a.o -o lto/a.mir >> $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc >> -fpropeller b.o -o lto/b.mir >> >> propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir >> # Propeller collects input Machine IR files, >> # spawn threads to generate object files parallelly. >> $(clang) -fpropeller-backend >> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir >> >> final_link exe: propeller/a.o propeller/b.o >> # GNU ld/gold/lld links object files. >> $(clang) $^ -o exe >> > > There was an interesting talk last week at the LLVM performance workshop: Global > Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl> which > introduced a similar stage in ThinLTO (for another purpose though). I > believe they avoid the serialization of MIR by running the CodeGen twice > instead (once to collect the cross-module informations, and the second time > using these informations). > CC the author in case the slides are already available online. > > > >> >> A .mir may be much large than an object file. So lto/a.mir may be >> actually an object file annotated with some information, or some lower >> level representation than a Machine IR (there should be a guarantee that >> the produced object file will keep the basic block structure unchanged >> => otherwise basic block profiling information will not be too useful). >> >> >> >> [1]: **Can we fix the bottleneck of full LTO** [1]? >> >> I wonder whether we have reached a "local maximum" of ThinLTO. >> If full LTO were nearly as fast as ThinLTO, how would we design a >> post-link optimization framework? >> Apparently, if full LTO did not have the scalability problem, we would >> not do so much work in the linker? >> > > At lot of work went into ThinLTO because the scalability issue of LTO was > considered inherent to the design. It isn't clear what you're suggesting > here though? > > -- > Mehdi > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-devOn Mon, Mar 2, 2020 at 11:56 PM Mehdi AMINI via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> I met with the Propeller team today (we work for the same company but it >> was my first time meeting two members on the team:) ). >> One thing I have been reassured: >> >> * There is no general disassembly work. General >> disassembly work would assuredly frighten off developers. (Inherently >> unreliable, memory usage heavy and difficult to deal with CFI, debug >> information, etc) >> >> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is >> acceptable: locating the jump relocation, detecting the jump type, >> inverting the direction of a jump, and deleting trailing bytes of an >> input section > > . The existing linker relaxation schemes already do similar >> things. Deleting a trailing jump is similar to RISC-V where sections can >> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in >> my mind)) (binutils supports deleting bytes for a few other >> architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal >> amount of disassembly work, conceptually the framework should not be too >> hard to be ported to another target. >> >> One thing I was not aware of (perhaps the description did not make it >> clear) is that >> Propeller intends to **reorder basic block sections across translation >> units**. >> This is something that full LTO can do while ThinLTO cannot. >> Our internal systems cannot afford doing a full LTO (**Can we fix the >> bottleneck of full LTO** [1]?) >> for large executables and I believe some other users are in the same camp. >> > > Right, beyond distributed build system, even on a single machine and for > "small" projects like clang: building on a laptop with FullLTO can be > challenging in terms of memory consumption, and the iterative development > is just not practical. > > >> >> Now, with ThinLTO, the post link optimization scheme will inevitably >> require >> help from the linker/compiler. It seems we have two routes: >> >> ## Route 1: Current Propeller framework >> >> lld does whole-program reordering of basic block sections. We can extend >> it in >> the future to overalign some sections and pad gaps with NOPs. What else >> can we >> do? Source code/IR/MCInst is lost at this stage. Without general assembly >> work, it may be difficult to do more optimization. >> >> This makes me concerned of another thing: Intel's Jump Condition Code >> Erratum. >> >> https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf >> >> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31 >> (mod 32) should be avoided. There are assembler level (MC) mitigations >> (function sections are overaligned to 32), but because we use basic >> block sections (sh_addralign<32) and need reordering, we have to redo >> some work at the linking stage. >> >> After losing the representation of MCInst, it is not clear to me how we >> can >> insert NOPs/segment override prefixes without doing disassembly work in >> the linker. >> >> Route 2 does heavy lifting work in the compiler, which can naturally >> reuse the assembler level mitigation, >> CFI and debug information generating, and probably other stuff. >> (How will debug information be bloated?) >> >> ## Route 2: Add another link stage, similar to a Thin Link as used by >> ThinLTO. >> >> Regular ThinLTO with minimized bitcode files: >> >> all: compile thin_link thinlto_backend final_link >> >> compile a.o b.o a.indexing.o b.indexing.o: a.c b.c >> $(clang) -O2 -c -flto=thin >> -fthin-link-bitcode=a.indexing.o a.c >> $(clang) -O2 -c -flto=thin >> -fthin-link-bitcode=b.indexing.o b.c >> >> thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: >> a.indexing.o b.indexing.o >> $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp >> -Wl,--thinlto-prefix-replace=';lto' >> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o >> b.indexing.o >> >> thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc >> lto/b.o.thinlto.bc >> $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o >> lto/a.o >> $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o >> lto/b.o >> >> final_link exe: lto/a.o lto/b.o a.rsp >> # Propeller does basic block section reordering here. >> $(clang) -fuse-ld=lld @a.rsp -o exe >> >> We need to replace the two stages thinlto_backend and final_link with >> three. >> >> Propelled ThinLTO with minimized bitcode files: >> >> propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o >> lto/a.o.thinlto.bc lto/b.o.thinlto.bc >> # Propeller emits something similar to a Machine IR file. >> # a.o and b.o are all IR files. >> $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc >> -fpropeller a.o -o lto/a.mir >> $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc >> -fpropeller b.o -o lto/b.mir >> >> propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir >> # Propeller collects input Machine IR files, >> # spawn threads to generate object files parallelly. >> $(clang) -fpropeller-backend >> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir >> >> final_link exe: propeller/a.o propeller/b.o >> # GNU ld/gold/lld links object files. >> $(clang) $^ -o exe >> > > There was an interesting talk last week at the LLVM performance workshop: Global > Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl> which > introduced a similar stage in ThinLTO (for another purpose though). I > believe they avoid the serialization of MIR by running the CodeGen twice > instead (once to collect the cross-module informations, and the second time > using these informations). > CC the author in case the slides are already available online. > > > >> >> A .mir may be much large than an object file. So lto/a.mir may be >> actually an object file annotated with some information, or some lower >> level representation than a Machine IR (there should be a guarantee that >> the produced object file will keep the basic block structure unchanged >> => otherwise basic block profiling information will not be too useful). >> >> >> >> [1]: **Can we fix the bottleneck of full LTO** [1]? >> >> I wonder whether we have reached a "local maximum" of ThinLTO. >> If full LTO were nearly as fast as ThinLTO, how would we design a >> post-link optimization framework? >> Apparently, if full LTO did not have the scalability problem, we would >> not do so much work in the linker? >> > > At lot of work went into ThinLTO because the scalability issue of LTO was > considered inherent to the design. It isn't clear what you're suggesting > here though? > > -- > Mehdi > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200320/3d1a680b/attachment-0001.html>