thr3ads.net - llvm dev - [llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)? [Mar 2020]

If this information is useful, please help other people find it:
Share via:

Fangrui Song via llvm-dev

2020-Feb-28 02:34 UTC

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

I met with the Propeller team today (we work for the same company but it
was my first time meeting two members on the team:) ).
One thing I have been reassured:

* There is no general disassembly work. General
disassembly work would assuredly frighten off developers. (Inherently
unreliable, memory usage heavy and difficult to deal with CFI, debug
information, etc)

Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
acceptable: locating the jump relocation, detecting the jump type,
inverting the direction of a jump, and deleting trailing bytes of an
input section. The existing linker relaxation schemes already do similar
things. Deleting a trailing jump is similar to RISC-V where sections can
shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
my mind)) (binutils supports deleting bytes for a few other
architectures, e.g. msp430, sh, mips, ft32, rl78). With just minimal
amount of disassembly work, conceptually the framework should not be too
hard to be ported to another target.

One thing I was not aware of (perhaps the description did not make it clear) is
that
Propeller intends to **reorder basic block sections across translation units**.
This is something that full LTO can do while ThinLTO cannot.
Our internal systems cannot afford doing a full LTO (**Can we fix the bottleneck
of full LTO** [1]?)
for large executables and I believe some other users are in the same camp.

Now, with ThinLTO, the post link optimization scheme will inevitably require
help from the linker/compiler. It seems we have two routes:

## Route 1: Current Propeller framework

lld does whole-program reordering of basic block sections. We can extend it in
the future to overalign some sections and pad gaps with NOPs. What else can we
do? Source code/IR/MCInst is lost at this stage. Without general assembly
work, it may be difficult to do more optimization.

This makes me concerned of another thing: Intel's Jump Condition Code
Erratum.
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf

Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
(mod 32) should be avoided. There are assembler level (MC) mitigations
(function sections are overaligned to 32), but because we use basic
block sections (sh_addralign<32) and need reordering, we have to redo
some work at the linking stage.

After losing the representation of MCInst, it is not clear to me how we can
insert NOPs/segment override prefixes without doing disassembly work in the
linker.

Route 2 does heavy lifting work in the compiler, which can naturally reuse the
assembler level mitigation,
CFI and debug information generating, and probably other stuff.
(How will debug information be bloated?)

## Route 2: Add another link stage, similar to a Thin Link as used by ThinLTO.

Regular ThinLTO with minimized bitcode files:

all: compile thin_link thinlto_backend final_link

compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
$(clang) -O2 -c -flto=thin -fthin-link-bitcode=a.indexing.o a.c
$(clang) -O2 -c -flto=thin -fthin-link-bitcode=b.indexing.o b.c

thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp: a.indexing.o
b.indexing.o
$(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
-Wl,--thinlto-prefix-replace=';lto'
-Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
b.indexing.o

thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc lto/b.o.thinlto.bc
$(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o lto/a.o
$(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o lto/b.o

final_link exe: lto/a.o lto/b.o a.rsp
# Propeller does basic block section reordering here.
$(clang) -fuse-ld=lld @a.rsp -o exe

We need to replace the two stages thinlto_backend and final_link with
three.

Propelled ThinLTO with minimized bitcode files:

propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o lto/a.o.thinlto.bc
lto/b.o.thinlto.bc
# Propeller emits something similar to a Machine IR file.
# a.o and b.o are all IR files.
$(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc -fpropeller a.o -o
lto/a.mir
$(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc -fpropeller b.o -o
lto/b.mir

propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
# Propeller collects input Machine IR files,
# spawn threads to generate object files parallelly.
$(clang) -fpropeller-backend
-fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir

final_link exe: propeller/a.o propeller/b.o
# GNU ld/gold/lld links object files.
$(clang) $^ -o exe

A .mir may be much large than an object file. So lto/a.mir may be
actually an object file annotated with some information, or some lower
level representation than a Machine IR (there should be a guarantee that
the produced object file will keep the basic block structure unchanged
=> otherwise basic block profiling information will not be too useful).

[1]: **Can we fix the bottleneck of full LTO** [1]?

I wonder whether we have reached a "local maximum" of ThinLTO.
If full LTO were nearly as fast as ThinLTO, how would we design a post-link
optimization framework?
Apparently, if full LTO did not have the scalability problem, we would
not do so much work in the linker?

Rui Ueyama via llvm-dev

2020-Feb-28 05:00 UTC

head link

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

On Fri, Feb 28, 2020 at 11:34 AM Fangrui Song <maskray at google.com>
wrote:
> I met with the Propeller team today (we work for the same company but it
> was my first time meeting two members on the team:) ).
> One thing I have been reassured:
>
> * There is no general disassembly work. General
> disassembly work would assuredly frighten off developers.  (Inherently
> unreliable, memory usage heavy and difficult to deal with CFI, debug
> information, etc)
>
> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
> acceptable: locating the jump relocation, detecting the jump type,
> inverting the direction of a jump, and deleting trailing bytes of an
> input section. The existing linker relaxation schemes already do similar
> things. Deleting a trailing jump is similar to RISC-V where sections can
> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
> my mind)) (binutils supports deleting bytes for a few other
> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
> amount of disassembly work, conceptually the framework should not be too
> hard to be ported to another target.
>
> One thing I was not aware of (perhaps the description did not make it
> clear) is that
> Propeller intends to **reorder basic block sections across translation
> units**.
> This is something that full LTO can do while ThinLTO cannot.
> Our internal systems cannot afford doing a full LTO (**Can we fix the
> bottleneck of full LTO** [1]?)
> for large executables and I believe some other users are in the same camp.
>
> Now, with ThinLTO, the post link optimization scheme will inevitably
> require
> help from the linker/compiler. It seems we have two routes:
>
> ## Route 1: Current Propeller framework
>
> lld does whole-program reordering of basic block sections.  We can extend
> it in
> the future to overalign some sections and pad gaps with NOPs.  What else
> can we
> do? Source code/IR/MCInst is lost at this stage. Without general assembly
> work, it may be difficult to do more optimization.
>
> This makes me concerned of another thing: Intel's Jump Condition Code
> Erratum.
>
>
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>
> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
> (mod 32) should be avoided.  There are assembler level (MC) mitigations
> (function sections are overaligned to 32), but because we use basic
> block sections (sh_addralign<32) and need reordering, we have to redo
> some work at the linking stage.
>
> After losing the representation of MCInst, it is not clear to me how we can
> insert NOPs/segment override prefixes without doing disassembly work in
> the linker.
>
I'm not sure how the basic-block sections feature makes it hard to
implement a mitigation for that specific JCC erratum. I may be missing
something, but doesn't the BB sections actually make it easier to
implement, as the JCC occurs only at the ending of each basic block, and
with the BB sections we know what the ending instruction is for each block?
I mean, when we are reordering sections, and if we found some BB with JCC
is not at a desired address, we can add a padding before that BB.

Route 2 does heavy lifting work in the compiler, which can naturally
reuse> the assembler level mitigation,
> CFI and debug information generating, and probably other stuff.
> (How will debug information be bloated?)
>
> ## Route 2: Add another link stage, similar to a Thin Link as used by
> ThinLTO.
>
> Regular ThinLTO with minimized bitcode files:
>
>         all: compile thin_link thinlto_backend final_link
>
>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>                 $(clang) -O2 -c -flto=thin
> -fthin-link-bitcode=a.indexing.o a.c
>                 $(clang) -O2 -c -flto=thin
> -fthin-link-bitcode=b.indexing.o b.c
>
>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
> a.indexing.o b.indexing.o
>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
> -Wl,--thinlto-prefix-replace=';lto'
> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
> b.indexing.o
>
>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
> lto/b.o.thinlto.bc
>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o
> lto/a.o
>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o
> lto/b.o
>
>         final_link exe: lto/a.o lto/b.o a.rsp
>                 # Propeller does basic block section reordering here.
>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>
> We need to replace the two stages thinlto_backend and final_link with
> three.
>
> Propelled ThinLTO with minimized bitcode files:
>
>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>                 # Propeller emits something similar to a Machine IR file.
>                 # a.o and b.o are all IR files.
>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
> -fpropeller a.o -o lto/a.mir
>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
> -fpropeller b.o -o lto/b.mir
>
>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>                 # Propeller collects input Machine IR files,
>                 # spawn threads to generate object files parallelly.
>                 $(clang) -fpropeller-backend
> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>
>         final_link exe: propeller/a.o propeller/b.o
>                 # GNU ld/gold/lld links object files.
>                 $(clang) $^ -o exe
>
> A .mir may be much large than an object file. So lto/a.mir may be
> actually an object file annotated with some information, or some lower
> level representation than a Machine IR (there should be a guarantee that
> the produced object file will keep the basic block structure unchanged
> => otherwise basic block profiling information will not be too useful).
>
>
>
> [1]: **Can we fix the bottleneck of full LTO** [1]?
>
> I wonder whether we have reached a "local maximum" of ThinLTO.
> If full LTO were nearly as fast as ThinLTO, how would we design a
> post-link optimization framework?
> Apparently, if full LTO did not have the scalability problem, we would
> not do so much work in the linker?
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200228/ea70b547/attachment.html>

Sriraman Tallam via llvm-dev

2020-Feb-28 16:44 UTC

head link

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

Hi Fangrui,

  Not sure why you started a new conversation when you could have just
replied to the existing thread.

On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song <maskray at google.com>
wrote:
> I met with the Propeller team today (we work for the same company but it
> was my first time meeting two members on the team:) ).
> One thing I have been reassured:
>
> * There is no general disassembly work. General
> disassembly work would assuredly frighten off developers.  (Inherently
> unreliable, memory usage heavy and difficult to deal with CFI, debug
> information, etc)
>
> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
> acceptable: locating the jump relocation, detecting the jump type,
> inverting the direction of a jump, and deleting trailing bytes of an
> input section. The existing linker relaxation schemes already do similar
> things. Deleting a trailing jump is similar to RISC-V where sections can
> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
> my mind)) (binutils supports deleting bytes for a few other
> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
> amount of disassembly work, conceptually the framework should not be too
> hard to be ported to another target.
>
> One thing I was not aware of (perhaps the description did not make it
> clear) is that
> Propeller intends to **reorder basic block sections across translation
> units**.
>
This was the intention all along with basic block sections from the very
beginning.

> This is something that full LTO can do while ThinLTO cannot.
> Our internal systems cannot afford doing a full LTO (**Can we fix the
> bottleneck of full LTO** [1]?)
> for large executables and I believe some other users are in the same camp.
>
> Now, with ThinLTO, the post link optimization scheme will inevitably
> require
> help from the linker/compiler. It seems we have two routes:
>
> ## Route 1: Current Propeller framework
>
> lld does whole-program reordering of basic block sections.  We can extend
> it in
> the future to overalign some sections and pad gaps with NOPs.  What else
> can we
> do? Source code/IR/MCInst is lost at this stage. Without general assembly
> work, it may be difficult to do more optimization.
>
> This makes me concerned of another thing: Intel's Jump Condition Code
> Erratum.
>
>
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>
> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
> (mod 32) should be avoided.  There are assembler level (MC) mitigations
> (function sections are overaligned to 32), but because we use basic
> block sections (sh_addralign<32) and need reordering, we have to redo
> some work at the linking stage.
>
> After losing the representation of MCInst, it is not clear to me how we can
> insert NOPs/segment override prefixes without doing disassembly work in
> the linker.
>
> Route 2 does heavy lifting work in the compiler, which can naturally reuse
> the assembler level mitigation,
> CFI and debug information generating, and probably other stuff.
> (How will debug information be bloated?)
>
> ## Route 2: Add another link stage, similar to a Thin Link as used by
> ThinLTO.
>
> Regular ThinLTO with minimized bitcode files:
>
>         all: compile thin_link thinlto_backend final_link
>
>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>                 $(clang) -O2 -c -flto=thin
> -fthin-link-bitcode=a.indexing.o a.c
>                 $(clang) -O2 -c -flto=thin
> -fthin-link-bitcode=b.indexing.o b.c
>
>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
> a.indexing.o b.indexing.o
>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
> -Wl,--thinlto-prefix-replace=';lto'
> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
> b.indexing.o
>
>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
> lto/b.o.thinlto.bc
>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o
> lto/a.o
>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o
> lto/b.o
>
>         final_link exe: lto/a.o lto/b.o a.rsp
>                 # Propeller does basic block section reordering here.
>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>
> We need to replace the two stages thinlto_backend and final_link with
> three.
>
I am not sure I fully follow what you mean here but it seems to be along
the lines of going back to MIR to do the optimizations.  We are considering
this and we have even discussed this with Eli in the original thread:

http://lists.llvm.org/pipermail/llvm-dev/2019-September/135455.html

For example, we are looking at inserting prefetch instructions at specific
points in the binary.  We would not be disassembling native code to do that
but would be doing it in MIR.

>
> Propelled ThinLTO with minimized bitcode files:
>
>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>                 # Propeller emits something similar to a Machine IR file.
>                 # a.o and b.o are all IR files.
>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
> -fpropeller a.o -o lto/a.mir
>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
> -fpropeller b.o -o lto/b.mir
>
>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>                 # Propeller collects input Machine IR files,
>                 # spawn threads to generate object files parallelly.
>                 $(clang) -fpropeller-backend
> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>
>         final_link exe: propeller/a.o propeller/b.o
>                 # GNU ld/gold/lld links object files.
>                 $(clang) $^ -o exe
>
> A .mir may be much large than an object file. So lto/a.mir may be
> actually an object file annotated with some information, or some lower
> level representation than a Machine IR (there should be a guarantee that
> the produced object file will keep the basic block structure unchanged
> => otherwise basic block profiling information will not be too useful).
>
>
>
> [1]: **Can we fix the bottleneck of full LTO** [1]?
>
> I wonder whether we have reached a "local maximum" of ThinLTO.
> If full LTO were nearly as fast as ThinLTO, how would we design a
> post-link optimization framework?
> Apparently, if full LTO did not have the scalability problem, we would
> not do so much work in the linker?
>
Full LTO has very high overheads for medium to large binaries.  As a data
point, I ran a Full LTO optimization of a binary with 350M of text and I
had to kill the process after RSS went to 175G.  I couldn't get it to run
on my beefy machine with 192G of RAM.

Hope this helps address some of your concerns.

Thanks
Sri
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200228/d3bf9c90/attachment.html>

Fangrui Song via llvm-dev

2020-Mar-02 19:02 UTC

head link

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

On 2020-02-28, Rui Ueyama wrote:>On Fri, Feb 28, 2020 at 11:34 AM Fangrui Song <maskray at google.com>
wrote:
>
>> I met with the Propeller team today (we work for the same company but
it
>> was my first time meeting two members on the team:) ).
>> One thing I have been reassured:
>>
>> * There is no general disassembly work. General
>> disassembly work would assuredly frighten off developers.  (Inherently
>> unreliable, memory usage heavy and difficult to deal with CFI, debug
>> information, etc)
>>
>> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
>> acceptable: locating the jump relocation, detecting the jump type,
>> inverting the direction of a jump, and deleting trailing bytes of an
>> input section. The existing linker relaxation schemes already do
similar
>> things. Deleting a trailing jump is similar to RISC-V where sections
can
>> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
>> my mind)) (binutils supports deleting bytes for a few other
>> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
>> amount of disassembly work, conceptually the framework should not be
too
>> hard to be ported to another target.
>>
>> One thing I was not aware of (perhaps the description did not make it
>> clear) is that
>> Propeller intends to **reorder basic block sections across translation
>> units**.
>> This is something that full LTO can do while ThinLTO cannot.
>> Our internal systems cannot afford doing a full LTO (**Can we fix the
>> bottleneck of full LTO** [1]?)
>> for large executables and I believe some other users are in the same
camp.
>>
>> Now, with ThinLTO, the post link optimization scheme will inevitably
>> require
>> help from the linker/compiler. It seems we have two routes:
>>
>> ## Route 1: Current Propeller framework
>>
>> lld does whole-program reordering of basic block sections.  We can
extend
>> it in
>> the future to overalign some sections and pad gaps with NOPs.  What
else
>> can we
>> do? Source code/IR/MCInst is lost at this stage. Without general
assembly
>> work, it may be difficult to do more optimization.
>>
>> This makes me concerned of another thing: Intel's Jump Condition
Code
>> Erratum.
>>
>>
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>>
>> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
>> (mod 32) should be avoided.  There are assembler level (MC) mitigations
>> (function sections are overaligned to 32), but because we use basic
>> block sections (sh_addralign<32) and need reordering, we have to
redo
>> some work at the linking stage.
>>
>> After losing the representation of MCInst, it is not clear to me how we
can
>> insert NOPs/segment override prefixes without doing disassembly work in
>> the linker.
>>
>
>I'm not sure how the basic-block sections feature makes it hard to
>implement a mitigation for that specific JCC erratum. I may be missing
>something, but doesn't the BB sections actually make it easier to
>implement, as the JCC occurs only at the ending of each basic block, and
>with the BB sections we know what the ending instruction is for each block?
>I mean, when we are reordering sections, and if we found some BB with JCC
>is not at a desired address, we can add a padding before that BB.
Loss of MachineInstr/MCInst (what we have are ELF object files) and
refraining from disassembly makes it hard to implement the Intel JCC
Erratum mitigation.

Inserting padding can increase the distance of JCC_1/JMP_1 instructions.
JCC_1/JMP_1 may need to be relaxed to JCC_4/JMP_4.

jb 1f; nop; 1:                # 72 01             jb 0x3
jb 2f; .space 0x7f, 0x90; 2:  # 72 7f             jb 0x84
jb 3f; .space 0x80, 0x90; 3:  # 0f 82 80 00 00 00 jb 0x10a

Without disassembly, we can only add NOPs, but not the superior segment
override prefixes. Note that x86 has several instructions which are
documented (Table 24-3. Format of Interruptibility State") as enabling
interrupts exactly one instruction after the one which changes the SS
segment register. Inserting a nop allows an interrupt to arrive before
the execution of the following instruction which changes semantic
behaviour.

   # NOP before jmp can change semantic behavior.
   sti; jmp baz
   movl %esi, %ss; jmp baz
   movw (%rsi), %ss; jmp baz

Well, we may go the MSP430 route: always generate the maximum range branch
instructions, and rely on the linker to relax instructions. This is also
what RISC-V does. (I mentioned both MSP430 and RISC-V in my previous
message:) ) There are many challenges here.


Coming back to the topic. We really have 2 routes.

a) Start from ELF object files, add sufficient metadata that supports post-link
optimization
b) Start from machine level representations (MIR-like), remove unneeded details
to make whole-program post-link optimization affordable.

Our current route is a). To avoid creating JCC_1/JMP_1 which are close
to the jump distance limit, we may have to implement something similar
to TargetInstrInfo::getInstSizeInBytes for x86, add BranchRelaxation
to addPreEmitPass(), so that JCC_1/JMP_1 can be avoided in the first place.
Unfortunately, getInstSizeInBytes is not great when there is inline
assembly, because a pseudo instruction or a macro can expand to multiple
real instructions. In this regard, LLVM is better than GCC because GCC
just counts the statements
(https://gcc.gnu.org/onlinedocs/gcc/Size-of-an-asm.html#Size-of-an-asm)

Debug information is another hassle. We only have relocation records,
and we will use conservative forms to make debug information not break
after Propeller optimization. If we start from DwarfDebug, we will be
more confident that nothing will break.

I understand that out focus has always been b), and it is (huge) sunk
cost if we change the direction to a). I am just concerned how many new
things we will discover in the future which is mandatory to annotate ELF
object files.

Starting from b) is a challenge, because it will push us to rethink
various LLVM infrastructures. The nice thing in return for Propeller is
immediate reuse of facility already provided by the compiler. The nice
thing in a long term is flexible frameworks that will benefit the
overall LLVM project and other future optimizations.

Additionaly, Sri told me that we would also do compiler-inserted
prefetching (this topic has been thoroughly studied by prior art, so I
don't think exposing the information is sensitive at all). I cannot
imagine how we would do it without more machine level information.
>>Route 2 does heavy lifting work in the compiler, which can naturally
reuse
>> the assembler level mitigation,
>> CFI and debug information generating, and probably other stuff.
>> (How will debug information be bloated?)
>>
>> ## Route 2: Add another link stage, similar to a Thin Link as used by
>> ThinLTO.
>>
>> Regular ThinLTO with minimized bitcode files:
>>
>>         all: compile thin_link thinlto_backend final_link
>>
>>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=a.indexing.o a.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=b.indexing.o b.c
>>
>>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
>> a.indexing.o b.indexing.o
>>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
>> -Wl,--thinlto-prefix-replace=';lto'
>> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o'
a.indexing.o
>> b.indexing.o
>>
>>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
>> lto/b.o.thinlto.bc
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o
-o
>> lto/a.o
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o
-o
>> lto/b.o
>>
>>         final_link exe: lto/a.o lto/b.o a.rsp
>>                 # Propeller does basic block section reordering here.
>>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>>
>> We need to replace the two stages thinlto_backend and final_link with
>> three.
>>
>> Propelled ThinLTO with minimized bitcode files:
>>
>>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
>> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>>                 # Propeller emits something similar to a Machine IR
file.
>>                 # a.o and b.o are all IR files.
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
>> -fpropeller a.o -o lto/a.mir
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
>> -fpropeller b.o -o lto/b.mir
>>
>>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>>                 # Propeller collects input Machine IR files,
>>                 # spawn threads to generate object files parallelly.
>>                 $(clang) -fpropeller-backend
>> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>>
>>         final_link exe: propeller/a.o propeller/b.o
>>                 # GNU ld/gold/lld links object files.
>>                 $(clang) $^ -o exe
>>
>> A .mir may be much large than an object file. So lto/a.mir may be
>> actually an object file annotated with some information, or some lower
>> level representation than a Machine IR (there should be a guarantee
that
>> the produced object file will keep the basic block structure unchanged
>> => otherwise basic block profiling information will not be too
useful).
>>
>>
>>
>> [1]: **Can we fix the bottleneck of full LTO** [1]?
>>
>> I wonder whether we have reached a "local maximum" of
ThinLTO.
>> If full LTO were nearly as fast as ThinLTO, how would we design a
>> post-link optimization framework?
>> Apparently, if full LTO did not have the scalability problem, we would
>> not do so much work in the linker?
>>

Mehdi AMINI via llvm-dev

2020-Mar-03 07:55 UTC

head link

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> I met with the Propeller team today (we work for the same company but it
> was my first time meeting two members on the team:) ).
> One thing I have been reassured:
>
> * There is no general disassembly work. General
> disassembly work would assuredly frighten off developers.  (Inherently
> unreliable, memory usage heavy and difficult to deal with CFI, debug
> information, etc)
>
> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
> acceptable: locating the jump relocation, detecting the jump type,
> inverting the direction of a jump, and deleting trailing bytes of an
> input section
. The existing linker relaxation schemes already do
similar> things. Deleting a trailing jump is similar to RISC-V where sections can
> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
> my mind)) (binutils supports deleting bytes for a few other
> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
> amount of disassembly work, conceptually the framework should not be too
> hard to be ported to another target.
>
> One thing I was not aware of (perhaps the description did not make it
> clear) is that
> Propeller intends to **reorder basic block sections across translation
> units**.
> This is something that full LTO can do while ThinLTO cannot.
> Our internal systems cannot afford doing a full LTO (**Can we fix the
> bottleneck of full LTO** [1]?)
> for large executables and I believe some other users are in the same camp.
>
Right, beyond distributed build system, even on a single machine and for
"small" projects like clang: building on a laptop with FullLTO can be
challenging in terms of memory consumption, and the iterative development
is just not practical.

>
> Now, with ThinLTO, the post link optimization scheme will inevitably
> require
> help from the linker/compiler. It seems we have two routes:
>
> ## Route 1: Current Propeller framework
>
> lld does whole-program reordering of basic block sections.  We can extend
> it in
> the future to overalign some sections and pad gaps with NOPs.  What else
> can we
> do? Source code/IR/MCInst is lost at this stage. Without general assembly
> work, it may be difficult to do more optimization.
>
> This makes me concerned of another thing: Intel's Jump Condition Code
> Erratum.
>
>
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>
> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
> (mod 32) should be avoided.  There are assembler level (MC) mitigations
> (function sections are overaligned to 32), but because we use basic
> block sections (sh_addralign<32) and need reordering, we have to redo
> some work at the linking stage.
>
> After losing the representation of MCInst, it is not clear to me how we can
> insert NOPs/segment override prefixes without doing disassembly work in
> the linker.
>
> Route 2 does heavy lifting work in the compiler, which can naturally reuse
> the assembler level mitigation,
> CFI and debug information generating, and probably other stuff.
> (How will debug information be bloated?)
>
> ## Route 2: Add another link stage, similar to a Thin Link as used by
> ThinLTO.
>
> Regular ThinLTO with minimized bitcode files:
>
>         all: compile thin_link thinlto_backend final_link
>
>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>                 $(clang) -O2 -c -flto=thin
> -fthin-link-bitcode=a.indexing.o a.c
>                 $(clang) -O2 -c -flto=thin
> -fthin-link-bitcode=b.indexing.o b.c
>
>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
> a.indexing.o b.indexing.o
>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
> -Wl,--thinlto-prefix-replace=';lto'
> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o' a.indexing.o
> b.indexing.o
>
>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
> lto/b.o.thinlto.bc
>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o -o
> lto/a.o
>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o -o
> lto/b.o
>
>         final_link exe: lto/a.o lto/b.o a.rsp
>                 # Propeller does basic block section reordering here.
>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>
> We need to replace the two stages thinlto_backend and final_link with
> three.
>
> Propelled ThinLTO with minimized bitcode files:
>
>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>                 # Propeller emits something similar to a Machine IR file.
>                 # a.o and b.o are all IR files.
>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
> -fpropeller a.o -o lto/a.mir
>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
> -fpropeller b.o -o lto/b.mir
>
>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>                 # Propeller collects input Machine IR files,
>                 # spawn threads to generate object files parallelly.
>                 $(clang) -fpropeller-backend
> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>
>         final_link exe: propeller/a.o propeller/b.o
>                 # GNU ld/gold/lld links object files.
>                 $(clang) $^ -o exe
>
There was an interesting talk last week at the LLVM performance
workshop: Global
Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl>
which
introduced a similar stage in ThinLTO (for another purpose though). I
believe they avoid the serialization of MIR by running the CodeGen twice
instead (once to collect the cross-module informations, and the second time
using these informations).
CC the author in case the slides are already available online.


>
> A .mir may be much large than an object file. So lto/a.mir may be
> actually an object file annotated with some information, or some lower
> level representation than a Machine IR (there should be a guarantee that
> the produced object file will keep the basic block structure unchanged
> => otherwise basic block profiling information will not be too useful).
>
>
>
> [1]: **Can we fix the bottleneck of full LTO** [1]?
>
> I wonder whether we have reached a "local maximum" of ThinLTO.
> If full LTO were nearly as fast as ThinLTO, how would we design a
> post-link optimization framework?
> Apparently, if full LTO did not have the scalability problem, we would
> not do so much work in the linker?
>
At lot of work went into ThinLTO because the scalability issue of LTO was
considered inherent to the design. It isn't clear what you're suggesting
here though?

-- 
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200302/bf0f8773/attachment.html>

Rahman Lavaee via llvm-dev

2020-Mar-21 04:55 UTC

head link

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

Hereby, we discuss our plan for handling Intel's JCC mitigation as follows.

TLDR;  By computing basic block groupings early, the compiler can form
larger clusters of basic blocks (each cluster in a section) which will
allow Propeller to just reuse the assembler’s mitigation.  Our experiments
show that when JCC mitigation causes only 0.2% slowdown for Propeller,
compared to the 0.6% slowdown incurred for the vanilla configuration.

A slightly longer summary:

   -

   We evaluated a Propeller prototype to reuse the existing assembler
   mitigation in llvm, -mbranches-within-32B-boundary, which currently uses
   only NOPs for mitigation.
   -

   With some changes, Propeller is able to reuse the existing assembler
   mitigation. To do this, we form large basic block clusters (sections
   containing multiple basic blocks) in the compiler by computing the basic
   block layout earlier.
   -

   Vanilla clang benchmark (no Propeller) regresses by ~0.6% with this flag.
   -

   With Propeller, the exact same flag regresses clang only by ~0.2%,
   reducing the total speedup from 7.8% to 7.6%.
   -

   For similar problems, the solution is most optimally implemented in the
   linker. However, for this particular problem, it appears that the
   assembler's mitigation is good enough when combined with Propeller.

Background

The JCC erratum
<https://www.intel.com/content/www/us/en/support/articles/000055650/processors.html>
is a CPU bug affecting Skylake processors which results in unpredictable
behaviour under complex micro-architectural states involving the Decoded
I-cache, specifically, when executing branches which cross a cache line.

MicroCode Update (MCU) Mitigation

The CPU avoids this bug by bypassing the Decoded ICache for branches
crossing 32B boundaries. This sacrifices some performance (0-4%) in return
for correctness. The compiler can alleviate this effect by aligning the
code such that branches do not cross a 32B boundary. There are two ways
that the compiler can do this:

   1.

   Inserting NOP instructions
   2.

   Inserting prefixes for instructions

The current solution shipped with clang-10 (under
-mbranches-within-32B-boundary) aligns every function at 32B and uses NOPs
between instructions. Our experiment shows enabling this option results in
0.6% performance degradation for Clang. There have been some efforts to
improve this using instruction prefixes (https://reviews.llvm.org/D72225,
https://reviews.llvm.org/D75268) even though there has been some
uncertainty about the available headroom (
https://reviews.llvm.org/D72225#1818149).

JCC Mitigation in Propeller

Propeller modifies the code layout by emitting basic blocks into sections
and reordering them at link time. This means the assembler’s mitigation
could be corrupted by Propeller.

There are two ways in which Propeller can solve the problem:

   1.

   Redo the full mitigation in the linker
   2.

   Reuse the mitigation that is being implemented in the assembler

Next we discuss each of the two strategies in more detail.
Full Mitigation in the Linker

The current compiler solution is implemented in the assembler backend and
its scope is limited to one function at a time (with -function-sections),
which requires excessive alignment of 32B for the function entry.

As a post-link optimization infrastructure, Propeller has the global view
of all sections in the link time and is at a better position for global
optimal JCC mitigation. The challenge for Propeller is finding the location
of affected branch instructions, and inserting paddings or prefixes at the
right places (some instructions cannot be prepended with prefixes or NOPs).
This is easier for the assembler as it has higher-level information about
instructions and can use the MC layer structures (such as
MCRelaxableFragment) to emit variable-sized paddings or prefixes.

As we discuss next, our prototype relying on the assembler's mitigation
incurs no significant overhead and therefore we do not plan to address this
problem in the linker.
Relying on the Assembler’s Mitigation

Propeller can use the assembler’s mitigation on every basic block section.
However, this means every basic block would be aligned at 32 bytes. The
paddings between the basic blocks may be executed nops which will put
significant pressure on the CPU's frontend.

To reduce the NOP paddings, we would need to emit BB sections at a coarser
level of granularity, which would mean emitting multiple basic blocks in
the same section. However, currently, Propeller delays the basic block
layout computation until link time and hence the actual group of basic
blocks (cluster) is only available at link time.

To make this work, we implemented a prototype by moving the layout
computation before the final round of Propeller compilation. After the
layout is computed, basic block partitions of each function are extracted
and passed to the compiler.

For example, consider the following BB layout for a program consisting of
two functions foo (with 5 basic blocks) and bar (with a single basic block).

foo

foo.BB.1

foo.BB.2

bar

foo.BB.3

foo.BB.4

The extracted BB partitions are as follows:

foo: {  [foo, foo.BB.1, foo.BB.2] , [foo.BB.3, foo.BB.4] }

Bar: { [bar] }

We instruct the compiler to emit foo’s basic blocks in two sections and bar’s
single basic block in one section. The assembler applies JCC mitigation on
each of the three sections by aligning them at 32 bytes and inserting
minimal paddings between instructions within every section. The only change
compared to the baseline mitigation with -function-sections is emitting an
excessive 32 bytes alignment for foo.BB.3. However, the introduced padding
is non-executed code (may have small pressure on the instruction cache and
TLB).

We note that the layout algorithm would scatter a function’s basic blocks
across multiple partitions judiciously and only if it is advantageous for
the performance. For intra-procedural layout, only two clusters are created
(hot and cold). Nonetheless, the non-executed paddings for clusters will
have minimal impact on performance.

On another note, better code layout could reduce the overhead of JCC
mitigation because the hot code would be packed together and the paddings
for the cold blocks will not affect the hot code.
Results

We evaluated Clang’s performance under different optimizations with and
without JCC mitigation. We used PGO + ThinLTO for all configurations. We
tested two propeller code layouts: inter-procedural, and intra-procedural.
The intra-procedural results in at most two clusters for every function,
while the inter-procedural layout could lead to more.

To use JCC mitigation, we use
“-Wl,-mllvm,--x86-branches-within-32B-boundaries
-mbranches-within-32B-boundaries".

We ran the clang bootstrap test 10 times for each configuration and
measured the average cpu time (user + sys in seconds).

We note that our evaluation is performed on a machine without the microcode
update installed.

Mitigation Enabled Mitigation Disabled
baseline (PGO + ThinLTO) 545.362 542.012
Propeller intra-proedural 506.828 504.861
Propeller inter-procedural 503.23 502.136Clang's cpu time relative to the
baseline, for different optimization flavors, with and without JCC
mitigation

FIrst, JCC Mitigation results in a 0.6% slowdown when applied to the
baseline. With Propeller, JCC mitigation incurs 0.4% slowdown for
intra-procedural and 0.2% for inter-procedural. The lesser JCC mitigation
slowdowns for Propeller configurations shows the impact of better code
layout. When hot and cold code are mixed together, the paddings in the cold
part could put more pressure on I-Cache and I-TLB.

Conclusion

Using BB clusters, we can reuse the assembler’s JCC mitigation with no
significant impact on performance. In fact the slowdown caused by JCC
mitigation is lower for Propeller, because of the better code layout.

Finally, we would like to stress once again that Propeller has the
potential to do a better job for problems like this JCC mitigation.
However, for this particular problem, we have shown that the assembler's
mitigation is good enough to be used along with Propeller.

On Mon, Mar 2, 2020 at 11:56 PM Mehdi AMINI via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I met with the Propeller team today (we work for the same company but
it
>> was my first time meeting two members on the team:) ).
>> One thing I have been reassured:
>>
>> * There is no general disassembly work. General
>> disassembly work would assuredly frighten off developers.  (Inherently
>> unreliable, memory usage heavy and difficult to deal with CFI, debug
>> information, etc)
>>
>> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
>> acceptable: locating the jump relocation, detecting the jump type,
>> inverting the direction of a jump, and deleting trailing bytes of an
>> input section
>
> . The existing linker relaxation schemes already do similar
>> things. Deleting a trailing jump is similar to RISC-V where sections
can
>> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
>> my mind)) (binutils supports deleting bytes for a few other
>> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
>> amount of disassembly work, conceptually the framework should not be
too
>> hard to be ported to another target.
>>
>> One thing I was not aware of (perhaps the description did not make it
>> clear) is that
>> Propeller intends to **reorder basic block sections across translation
>> units**.
>> This is something that full LTO can do while ThinLTO cannot.
>> Our internal systems cannot afford doing a full LTO (**Can we fix the
>> bottleneck of full LTO** [1]?)
>> for large executables and I believe some other users are in the same
camp.
>>
>
> Right, beyond distributed build system, even on a single machine and for
> "small" projects like clang: building on a laptop with FullLTO
can be
> challenging in terms of memory consumption, and the iterative development
> is just not practical.
>
>
>>
>> Now, with ThinLTO, the post link optimization scheme will inevitably
>> require
>> help from the linker/compiler. It seems we have two routes:
>>
>> ## Route 1: Current Propeller framework
>>
>> lld does whole-program reordering of basic block sections.  We can
extend
>> it in
>> the future to overalign some sections and pad gaps with NOPs.  What
else
>> can we
>> do? Source code/IR/MCInst is lost at this stage. Without general
assembly
>> work, it may be difficult to do more optimization.
>>
>> This makes me concerned of another thing: Intel's Jump Condition
Code
>> Erratum.
>>
>>
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>>
>> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
>> (mod 32) should be avoided.  There are assembler level (MC) mitigations
>> (function sections are overaligned to 32), but because we use basic
>> block sections (sh_addralign<32) and need reordering, we have to
redo
>> some work at the linking stage.
>>
>> After losing the representation of MCInst, it is not clear to me how we
>> can
>> insert NOPs/segment override prefixes without doing disassembly work in
>> the linker.
>>
>> Route 2 does heavy lifting work in the compiler, which can naturally
>> reuse the assembler level mitigation,
>> CFI and debug information generating, and probably other stuff.
>> (How will debug information be bloated?)
>>
>> ## Route 2: Add another link stage, similar to a Thin Link as used by
>> ThinLTO.
>>
>> Regular ThinLTO with minimized bitcode files:
>>
>>         all: compile thin_link thinlto_backend final_link
>>
>>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=a.indexing.o a.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=b.indexing.o b.c
>>
>>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
>> a.indexing.o b.indexing.o
>>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
>> -Wl,--thinlto-prefix-replace=';lto'
>> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o'
a.indexing.o
>> b.indexing.o
>>
>>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
>> lto/b.o.thinlto.bc
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o
-o
>> lto/a.o
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o
-o
>> lto/b.o
>>
>>         final_link exe: lto/a.o lto/b.o a.rsp
>>                 # Propeller does basic block section reordering here.
>>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>>
>> We need to replace the two stages thinlto_backend and final_link with
>> three.
>>
>> Propelled ThinLTO with minimized bitcode files:
>>
>>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
>> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>>                 # Propeller emits something similar to a Machine IR
file.
>>                 # a.o and b.o are all IR files.
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
>> -fpropeller a.o -o lto/a.mir
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
>> -fpropeller b.o -o lto/b.mir
>>
>>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>>                 # Propeller collects input Machine IR files,
>>                 # spawn threads to generate object files parallelly.
>>                 $(clang) -fpropeller-backend
>> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>>
>>         final_link exe: propeller/a.o propeller/b.o
>>                 # GNU ld/gold/lld links object files.
>>                 $(clang) $^ -o exe
>>
>
> There was an interesting talk last week at the LLVM performance workshop:
Global
> Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl>
which
> introduced a similar stage in ThinLTO (for another purpose though). I
> believe they avoid the serialization of MIR by running the CodeGen twice
> instead (once to collect the cross-module informations, and the second time
> using these informations).
> CC the author in case the slides are already available online.
>
>
>
>>
>> A .mir may be much large than an object file. So lto/a.mir may be
>> actually an object file annotated with some information, or some lower
>> level representation than a Machine IR (there should be a guarantee
that
>> the produced object file will keep the basic block structure unchanged
>> => otherwise basic block profiling information will not be too
useful).
>>
>>
>>
>> [1]: **Can we fix the bottleneck of full LTO** [1]?
>>
>> I wonder whether we have reached a "local maximum" of
ThinLTO.
>> If full LTO were nearly as fast as ThinLTO, how would we design a
>> post-link optimization framework?
>> Apparently, if full LTO did not have the scalability problem, we would
>> not do so much work in the linker?
>>
>
> At lot of work went into ThinLTO because the scalability issue of LTO was
> considered inherent to the design. It isn't clear what you're
suggesting
> here though?
>
> --
> Mehdi
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

On Mon, Mar 2, 2020 at 11:56 PM Mehdi AMINI via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Thu, Feb 27, 2020 at 6:34 PM Fangrui Song via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I met with the Propeller team today (we work for the same company but
it
>> was my first time meeting two members on the team:) ).
>> One thing I have been reassured:
>>
>> * There is no general disassembly work. General
>> disassembly work would assuredly frighten off developers.  (Inherently
>> unreliable, memory usage heavy and difficult to deal with CFI, debug
>> information, etc)
>>
>> Minimal amount of plumbing work (https://reviews.llvm.org/D68065) is
>> acceptable: locating the jump relocation, detecting the jump type,
>> inverting the direction of a jump, and deleting trailing bytes of an
>> input section
>
> . The existing linker relaxation schemes already do similar
>> things. Deleting a trailing jump is similar to RISC-V where sections
can
>> shrink (not implemented in lld; R_RISCV_ALIGN and R_RISCV_RELAX are in
>> my mind)) (binutils supports deleting bytes for a few other
>> architectures, e.g.  msp430, sh, mips, ft32, rl78).  With just minimal
>> amount of disassembly work, conceptually the framework should not be
too
>> hard to be ported to another target.
>>
>> One thing I was not aware of (perhaps the description did not make it
>> clear) is that
>> Propeller intends to **reorder basic block sections across translation
>> units**.
>> This is something that full LTO can do while ThinLTO cannot.
>> Our internal systems cannot afford doing a full LTO (**Can we fix the
>> bottleneck of full LTO** [1]?)
>> for large executables and I believe some other users are in the same
camp.
>>
>
> Right, beyond distributed build system, even on a single machine and for
> "small" projects like clang: building on a laptop with FullLTO
can be
> challenging in terms of memory consumption, and the iterative development
> is just not practical.
>
>
>>
>> Now, with ThinLTO, the post link optimization scheme will inevitably
>> require
>> help from the linker/compiler. It seems we have two routes:
>>
>> ## Route 1: Current Propeller framework
>>
>> lld does whole-program reordering of basic block sections.  We can
extend
>> it in
>> the future to overalign some sections and pad gaps with NOPs.  What
else
>> can we
>> do? Source code/IR/MCInst is lost at this stage. Without general
assembly
>> work, it may be difficult to do more optimization.
>>
>> This makes me concerned of another thing: Intel's Jump Condition
Code
>> Erratum.
>>
>>
https://www.intel.com/content/dam/support/us/en/documents/processors/mitigations-jump-conditional-code-erratum.pdf
>>
>> Put it in the simplest way, a Jcc instruction whose address ≡ 30 or 31
>> (mod 32) should be avoided.  There are assembler level (MC) mitigations
>> (function sections are overaligned to 32), but because we use basic
>> block sections (sh_addralign<32) and need reordering, we have to
redo
>> some work at the linking stage.
>>
>> After losing the representation of MCInst, it is not clear to me how we
>> can
>> insert NOPs/segment override prefixes without doing disassembly work in
>> the linker.
>>
>> Route 2 does heavy lifting work in the compiler, which can naturally
>> reuse the assembler level mitigation,
>> CFI and debug information generating, and probably other stuff.
>> (How will debug information be bloated?)
>>
>> ## Route 2: Add another link stage, similar to a Thin Link as used by
>> ThinLTO.
>>
>> Regular ThinLTO with minimized bitcode files:
>>
>>         all: compile thin_link thinlto_backend final_link
>>
>>         compile a.o b.o a.indexing.o b.indexing.o: a.c b.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=a.indexing.o a.c
>>                 $(clang) -O2 -c -flto=thin
>> -fthin-link-bitcode=b.indexing.o b.c
>>
>>         thin_link lto/a.o.thinlto.bc lto/b.o.thinlto.bc a.rsp:
>> a.indexing.o b.indexing.o
>>                 $(clang) -fuse-ld=lld -Wl,--thinlto-index-only=a.rsp
>> -Wl,--thinlto-prefix-replace=';lto'
>> -Wl,--thinlto-object-suffix-replace='.indexing.o;.o'
a.indexing.o
>> b.indexing.o
>>
>>         thinlto_backend lto/a.o lto/b.o: a.o b.o lto/a.o.thinlto.bc
>> lto/b.o.thinlto.bc
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc a.o
-o
>> lto/a.o
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc b.o
-o
>> lto/b.o
>>
>>         final_link exe: lto/a.o lto/b.o a.rsp
>>                 # Propeller does basic block section reordering here.
>>                 $(clang) -fuse-ld=lld @a.rsp -o exe
>>
>> We need to replace the two stages thinlto_backend and final_link with
>> three.
>>
>> Propelled ThinLTO with minimized bitcode files:
>>
>>         propelled_thinlto_backend lto/a.mir lto/b.mir: a.o b.o
>> lto/a.o.thinlto.bc lto/b.o.thinlto.bc
>>                 # Propeller emits something similar to a Machine IR
file.
>>                 # a.o and b.o are all IR files.
>>                 $(clang) -O2 -c -fthinlto-index=lto/a.o.thinlto.bc
>> -fpropeller a.o -o lto/a.mir
>>                 $(clang) -O2 -c -fthinlto-index=lto/b.o.thinlto.bc
>> -fpropeller b.o -o lto/b.mir
>>
>>         propeller_link propeller/a.o propeller/b.o: lto/a.mir lto/b.mir
>>                 # Propeller collects input Machine IR files,
>>                 # spawn threads to generate object files parallelly.
>>                 $(clang) -fpropeller-backend
>> -fpropeller-prefix-replace='lto;propeller' lto/a.mir lto/b.mir
>>
>>         final_link exe: propeller/a.o propeller/b.o
>>                 # GNU ld/gold/lld links object files.
>>                 $(clang) $^ -o exe
>>
>
> There was an interesting talk last week at the LLVM performance workshop:
Global
> Machine Outliner for ThinLTO <https://llvm.org/devmtg/2020-02-23/#kl>
which
> introduced a similar stage in ThinLTO (for another purpose though). I
> believe they avoid the serialization of MIR by running the CodeGen twice
> instead (once to collect the cross-module informations, and the second time
> using these informations).
> CC the author in case the slides are already available online.
>
>
>
>>
>> A .mir may be much large than an object file. So lto/a.mir may be
>> actually an object file annotated with some information, or some lower
>> level representation than a Machine IR (there should be a guarantee
that
>> the produced object file will keep the basic block structure unchanged
>> => otherwise basic block profiling information will not be too
useful).
>>
>>
>>
>> [1]: **Can we fix the bottleneck of full LTO** [1]?
>>
>> I wonder whether we have reached a "local maximum" of
ThinLTO.
>> If full LTO were nearly as fast as ThinLTO, how would we design a
>> post-link optimization framework?
>> Apparently, if full LTO did not have the scalability problem, we would
>> not do so much work in the linker?
>>
>
> At lot of work went into ThinLTO because the scalability issue of LTO was
> considered inherent to the design. It isn't clear what you're
suggesting
> here though?
>
> --
> Mehdi
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200320/3d1a680b/attachment-0001.html>

llvm dev - Mar 2020 - A Propeller link (similar to a Thin Link as used by ThinLTO)?

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?

[llvm-dev] A Propeller link (similar to a Thin Link as used by ThinLTO)?