thr3ads.net - llvm dev - [llvm-dev] Proposal/patch: simple parallel LTO code generation [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Peter Collingbourne via llvm-dev

2015-Aug-12 08:52 UTC

[llvm-dev] Proposal/patch: simple parallel LTO code generation

Hi all,

The most time consuming part of LTO at opt level 1 is by far the backend code
generator. (As a reminder, LTO opt level 1 runs a minimal set of passes;
it is most useful where the motivation behind the use of LTO is to deploy
a transformation that requires whole program visibility such as control
flow integrity [1], rather than to optimise the program using whole program
visibility). Code generation is in principle embarrassingly parallel, as it
can in principle be partitioned at the function granularity level, however
there are practical issues that need to be solved before we can parallelise
code generation for LTO.

The main issue is that the backend currently makes no effort to be thread safe.
This can be overcome by observing that it is unnecessary for the backend to
be thread safe if we arrange for each instance of the backend to operate in
a different LLVMContext. This is the approach that this patch proposes. The
LTO code generator partitions the combined LTO module into sub-modules, each
with its own LLVMContext, and runs the code generator on the sub-modules
in parallel. (Entities in the combined module are partitioned by taking
the modulus of the hash of the name of the entity, or its comdat if it has
one.) The resulting native object files can be combined by the linker in
the usual way.

This approach is reasonably effective. In one experiment, an LTO link of
Chromium at LTO opt level 1 on an HP Z620 machine took 15m20s without
parallelism, 8m06s with 4 partitions and 7m27s with 8 partitions.

I've attached a patch with an initial implementation of this idea for the
gold plugin. If this idea seems reasonable, I'll proceed to clean up the
patch and send it for review on llvm-commits.

Thanks,
-- 
Peter

[1] http://clang.llvm.org/docs/ControlFlowIntegrity.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: module-splitter.diff
Type: text/x-diff
Size: 14039 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150812/cba67d21/attachment.diff>

Mehdi Amini via llvm-dev

2015-Aug-12 15:31 UTC

head link

[llvm-dev] Proposal/patch: simple parallel LTO code generation

Hi Peter,

> On Aug 12, 2015, at 1:52 AM, Peter Collingbourne via llvm-dev <llvm-dev
at lists.llvm.org> wrote:
> 
> Hi all,
> 
> The most time consuming part of LTO at opt level 1 is by far the backend
code
> generator. (As a reminder, LTO opt level 1 runs a minimal set of passes;
> it is most useful where the motivation behind the use of LTO is to deploy
> a transformation that requires whole program visibility such as control
> flow integrity [1], rather than to optimise the program using whole program
> visibility). Code generation is in principle embarrassingly parallel, as it
> can in principle be partitioned at the function granularity level, however
> there are practical issues that need to be solved before we can parallelise
> code generation for LTO.
That seems definitively something I wanted to explore as I’m sure there are low
hanging fruits to get in this area, I’m glad you gave a try :)
> The main issue is that the backend currently makes no effort to be thread
safe.
> This can be overcome by observing that it is unnecessary for the backend to
> be thread safe if we arrange for each instance of the backend to operate in
> a different LLVMContext.
These two sentences don’t go well together to me: I believe an LLVMContext per
thread is always something that is needed conceptually.
But it won’t “overcome” any part of LLVM that is not thread-safe: the backend
still need to make the effort of not modifying any global state.
I have the same use case of parallel CodeGen internally and I had to fix some
cases of global mutable state here and there recently, I think I still have a
patch on Phabricator about the nulls() stream.
Something that is not completely clear to me either is if the TargetMachine and
cie intended to be used in different threads. I ended up having one
TargetMachine instance per thread (like you do in the gold-plugin).
> This is the approach that this patch proposes. The
> LTO code generator partitions the combined LTO module into sub-modules,
each
> with its own LLVMContext, and runs the code generator on the sub-modules
> in parallel. (Entities in the combined module are partitioned by taking
> the modulus of the hash of the name of the entity, or its comdat if it has
> one.) The resulting native object files can be combined by the linker in
> the usual way.
You ended up with quite a small patch for what it achieves!
It is still a shame we have to duplicate all the module when we partition it. I
wonder what if the memory overhead of your approach?

I have no idea if the way you manipulate the linkage would work in all cases,
I’m eager to hear what other will have to say about it.

For instance I’m not sure why you’re doing this:

   for (Module::const_iterator I = M->begin(), E = M->end(); I != E; ++I)
{
     Function *F = cast<Function>(VMap[I]);
+    if (!CloneDefinition(I)) {
+      F->setLinkage(GlobalValue::ExternalLinkage);


> This approach is reasonably effective. In one experiment, an LTO link of
> Chromium at LTO opt level 1 on an HP Z620 machine took 15m20s without
> parallelism, 8m06s with 4 partitions and 7m27s with 8 partitions.
Is it a machine with 24 cores? The result is already nice, I wonder if is does
not scale better because of Amdahl?

Thanks,

— 
Mehdi

> 
> I've attached a patch with an initial implementation of this idea for
the
> gold plugin. If this idea seems reasonable, I'll proceed to clean up
the
> patch and send it for review on llvm-commits.
> 
> Thanks,
> -- 
> Peter
> 
> [1] http://clang.llvm.org/docs/ControlFlowIntegrity.html
> <module-splitter.diff>_______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org         http://llvm.cs.uiuc.edu
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Peter Collingbourne via llvm-dev

2015-Aug-12 22:07 UTC

head link

[llvm-dev] Proposal/patch: simple parallel LTO code generation

On Wed, Aug 12, 2015 at 08:31:46AM -0700, Mehdi Amini
wrote:> Hi Peter,
> 
> 
> > On Aug 12, 2015, at 1:52 AM, Peter Collingbourne via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> > 
> > Hi all,
> > 
> > The most time consuming part of LTO at opt level 1 is by far the
backend code
> > generator. (As a reminder, LTO opt level 1 runs a minimal set of
passes;
> > it is most useful where the motivation behind the use of LTO is to
deploy
> > a transformation that requires whole program visibility such as
control
> > flow integrity [1], rather than to optimise the program using whole
program
> > visibility). Code generation is in principle embarrassingly parallel,
as it
> > can in principle be partitioned at the function granularity level,
however
> > there are practical issues that need to be solved before we can
parallelise
> > code generation for LTO.
> 
> That seems definitively something I wanted to explore as I’m sure there are
low hanging fruits to get in this area, I’m glad you gave a try :)
> 
> > The main issue is that the backend currently makes no effort to be
thread safe.
> > This can be overcome by observing that it is unnecessary for the
backend to
> > be thread safe if we arrange for each instance of the backend to
operate in
> > a different LLVMContext.
> 
> These two sentences don’t go well together to me: I believe an LLVMContext
per thread is always something that is needed conceptually.
This isn't necessarily true if all threads only read from the LLVMContext,
but this is much easier said than done for the backend.
> But it won’t “overcome” any part of LLVM that is not thread-safe: the
backend still need to make the effort of not modifying any global state.
> I have the same use case of parallel CodeGen internally and I had to fix
some cases of global mutable state here and there recently, I think I still have
a patch on Phabricator about the nulls() stream.
You are quite right, I hadn't realised that we may have global mutable state
in places such as I/O streams. The problem becomes much more tractable than
multiple-backends-per-LLVMContext though, and we can use tools like TSan to
help flush out any issues there.
> Something that is not completely clear to me either is if the TargetMachine
and cie intended to be used in different threads. I ended up having one
TargetMachine instance per thread (like you do in the gold-plugin).
I believe that the current TargetMachine design requires one
TargetMachine per thread (see e.g. the CodeGenInfo field:
http://llvm.org/docs/doxygen/html/Target_2TargetMachine_8h_source.html#l00090).
> 
> > This is the approach that this patch proposes. The
> > LTO code generator partitions the combined LTO module into
sub-modules, each
> > with its own LLVMContext, and runs the code generator on the
sub-modules
> > in parallel. (Entities in the combined module are partitioned by
taking
> > the modulus of the hash of the name of the entity, or its comdat if it
has
> > one.) The resulting native object files can be combined by the linker
in
> > the usual way.
> 
> You ended up with quite a small patch for what it achieves!
> It is still a shame we have to duplicate all the module when we partition
it. I wonder what if the memory overhead of your approach?
When linking Chromium the maximum RSS increases from 9.6GB at j=1 to 15.5GB
at j=4 (and, a little surprisingly, decreases to 13.4GB at j=8; I haven't
investigated why).
> I have no idea if the way you manipulate the linkage would work in all
cases, I’m eager to hear what other will have to say about it.
One issue I am aware of is that if a symbol with internal linkage shares
a name with some other symbol outside of the combined LTO module, those names
will conflict if we externalize the symbol. A probably good-enough workaround
for this would be to give each such symbol a prefix that moves it into the
reserved identifier namespace (e.g. "__llvmlto_").

A related issue that came up in Chromium was that internal symbols defined by
module inline asm are not exported to the other partitions, but this should
be fixable somehow (e.g. we control the assembler, so we can probably teach
it to export any internal symbols with our prefix).
> For instance I’m not sure why you’re doing this:
> 
>    for (Module::const_iterator I = M->begin(), E = M->end(); I != E;
++I) {
>      Function *F = cast<Function>(VMap[I]);
> +    if (!CloneDefinition(I)) {
> +      F->setLinkage(GlobalValue::ExternalLinkage);
It is invalid for an external reference to have a non-external linkage,
so we need to reset the linkage here.
> > This approach is reasonably effective. In one experiment, an LTO link
of
> > Chromium at LTO opt level 1 on an HP Z620 machine took 15m20s without
> > parallelism, 8m06s with 4 partitions and 7m27s with 8 partitions.
> 
> Is it a machine with 24 cores?
40 cores.
> The result is already nice, I wonder if is does not scale better because of
Amdahl?
Yes, there is a significant amount of single-threaded work that needs to be done
up
front by the bitcode reader, the IR linker and by gold itself. For Chromium
I measured about 4-5 minutes of single-threaded work before we spawn the
first backend thread.

Thanks,
-- 
Peter

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Aug 2015 - Proposal/patch: simple parallel LTO code generation

[llvm-dev] Proposal/patch: simple parallel LTO code generation

[llvm-dev] Proposal/patch: simple parallel LTO code generation

[llvm-dev] Proposal/patch: simple parallel LTO code generation

Apparently Analagous Threads