thr3ads.net - llvm dev - [llvm-dev] LLD: time to enable --threads by default [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Rui Ueyama via llvm-dev

2016-Nov-16 20:44 UTC

[llvm-dev] LLD: time to enable --threads by default

LLD supports multi-threading, and it seems to be working well as you can
see in a recent result
<http://llvm.org/viewvc/llvm-project?view=revision&revision=287140>.
In
short, LLD runs 30% faster with --threads option and more than 50% faster
if you are using --build-id (your mileage may vary depending on your
computer). However, I don't think most users even don't know about that
because --threads is not a default option.

I'm thinking to enable --threads by default. We now have real users, and
they'll be happy about the performance boost.

Any concerns?

I can't think of problems with that, but I want to write a few notes about
that:

 - We still need to focus on single-thread performance rather than
multi-threaded one because it is hard to make a slow program faster just by
using more threads.

 - We shouldn't do "too clever" things with threads. Currently, we
are
using multi-threads only at two places where they are highly parallelizable
by nature (namely, copying and applying relocations for each input section,
and computing build-id hash). We are using parallel_for_each, and that is
very simple and easy to understand. I believe this was a right design
choice, and I don't think we want to have something like workqueues/tasks
in GNU gold, for example.

 - Run benchmarks with --no-threads if you are not focusing on multi-thread
performance.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/3e9593e1/attachment.html>

Rafael Espíndola via llvm-dev

2016-Nov-16 20:52 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

I will do a quick benchmark run.

Other than the observations you have my only concern is the situation
where many lld invocations run in parallel, like in a llvm build where
there many outputs in bin/. Our task system doesn't know about load,
so I worry that it might degrade performance in that case.

Cheers,
Rafael


On 16 November 2016 at 15:44, Rui Ueyama <ruiu at google.com>
wrote:> LLD supports multi-threading, and it seems to be working well as you can
see
> in a recent result. In short, LLD runs 30% faster with --threads option and
> more than 50% faster if you are using --build-id (your mileage may vary
> depending on your computer). However, I don't think most users even
don't
> know about that because --threads is not a default option.
>
> I'm thinking to enable --threads by default. We now have real users,
and
> they'll be happy about the performance boost.
>
> Any concerns?
>
> I can't think of problems with that, but I want to write a few notes
about
> that:
>
>  - We still need to focus on single-thread performance rather than
> multi-threaded one because it is hard to make a slow program faster just by
> using more threads.
>
>  - We shouldn't do "too clever" things with threads.
Currently, we are using
> multi-threads only at two places where they are highly parallelizable by
> nature (namely, copying and applying relocations for each input section,
and
> computing build-id hash). We are using parallel_for_each, and that is very
> simple and easy to understand. I believe this was a right design choice,
and
> I don't think we want to have something like workqueues/tasks in GNU
gold,
> for example.
>
>  - Run benchmarks with --no-threads if you are not focusing on multi-thread
> performance.
>

Renato Golin via llvm-dev

2016-Nov-16 20:55 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

On 16 November 2016 at 20:44, Rui Ueyama via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> I'm thinking to enable --threads by default. We now have real users,
and
> they'll be happy about the performance boost.
Will it detect single-core computers and disable it? What is the
minimum number of threads that can run in that mode?

Is the penalty on dual core computers less than the gains? If you
could have a VM with only two cores, where the OS is running on one
and LLD threads are running on both, it'd be good to measure the
downgrade.

Rafael's concern is also very real. I/O and memory consumption are
important factors on small footprint systems, though I'd be happy to
have a different default per architecture or even carry the burden of
forcing a --no-threads option every run if the benefits are
substantial.

If those issues are not a concern, than I'm in favour!

>  - We still need to focus on single-thread performance rather than
> multi-threaded one because it is hard to make a slow program faster just by
> using more threads.
Agreed.

>  - We shouldn't do "too clever" things with threads.
Currently, we are using
> multi-threads only at two places where they are highly parallelizable by
> nature (namely, copying and applying relocations for each input section,
and
> computing build-id hash). We are using parallel_for_each, and that is very
> simple and easy to understand. I believe this was a right design choice,
and
> I don't think we want to have something like workqueues/tasks in GNU
gold,
> for example.
Strongly agreed.

cheers,
--renato

Rui Ueyama via llvm-dev

2016-Nov-16 21:27 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

On Wed, Nov 16, 2016 at 12:55 PM, Renato Golin <renato.golin at
linaro.org>
wrote:
> On 16 November 2016 at 20:44, Rui Ueyama via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > I'm thinking to enable --threads by default. We now have real
users, and
> > they'll be happy about the performance boost.
>
> Will it detect single-core computers and disable it? What is the
> minimum number of threads that can run in that mode?
>
> Is the penalty on dual core computers less than the gains? If you
> could have a VM with only two cores, where the OS is running on one
> and LLD threads are running on both, it'd be good to measure the
> downgrade.
>
As a quick test, I ran the benchmark again with "taskset -c 0" to use
only
one core. LLD still spawns 40 threads because my machine has 40 cores (20
physical cores), so 40 threads ran on one core.

With --no-threads (one thread on a single core), it took 6.66 seconds to
self-link. With -thread (40 threads on a single core), it took 6.70
seconds. I guess they are mostly in error margin. So I think it wouldn't
hurt single core machine.

Rafael may be running his benchmarks and will bring his results.

Rafael's concern is also very real. I/O and memory consumption
are> important factors on small footprint systems, though I'd be happy to
> have a different default per architecture or even carry the burden of
> forcing a --no-threads option every run if the benefits are
> substantial.
>
> If those issues are not a concern, than I'm in favour!
>
>
> >  - We still need to focus on single-thread performance rather than
> > multi-threaded one because it is hard to make a slow program faster
just
> by
> > using more threads.
>
> Agreed.
>
>
> >  - We shouldn't do "too clever" things with threads.
Currently, we are
> using
> > multi-threads only at two places where they are highly parallelizable
by
> > nature (namely, copying and applying relocations for each input
section,
> and
> > computing build-id hash). We are using parallel_for_each, and that is
> very
> > simple and easy to understand. I believe this was a right design
choice,
> and
> > I don't think we want to have something like workqueues/tasks in
GNU
> gold,
> > for example.
>
> Strongly agreed.
>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/d9173f75/attachment.html>

Rui Ueyama via llvm-dev

2016-Nov-16 21:29 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

On Wed, Nov 16, 2016 at 12:55 PM, Renato Golin <renato.golin at
linaro.org>
wrote:
> On 16 November 2016 at 20:44, Rui Ueyama via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > I'm thinking to enable --threads by default. We now have real
users, and
> > they'll be happy about the performance boost.
>
> Will it detect single-core computers and disable it? What is the
> minimum number of threads that can run in that mode?
>
> Is the penalty on dual core computers less than the gains? If you
> could have a VM with only two cores, where the OS is running on one
> and LLD threads are running on both, it'd be good to measure the
> downgrade.
>
> Rafael's concern is also very real. I/O and memory consumption are
> important factors on small footprint systems, though I'd be happy to
> have a different default per architecture or even carry the burden of
> forcing a --no-threads option every run if the benefits are
> substantial.
>
On such a computer, you don't want to enable threads at all, no? If so, you
can build LLVM without LLVM_ENABLE_THREADS.

> If those issues are not a concern, than I'm in favour!
>
>
> >  - We still need to focus on single-thread performance rather than
> > multi-threaded one because it is hard to make a slow program faster
just
> by
> > using more threads.
>
> Agreed.
>
>
> >  - We shouldn't do "too clever" things with threads.
Currently, we are
> using
> > multi-threads only at two places where they are highly parallelizable
by
> > nature (namely, copying and applying relocations for each input
section,
> and
> > computing build-id hash). We are using parallel_for_each, and that is
> very
> > simple and easy to understand. I believe this was a right design
choice,
> and
> > I don't think we want to have something like workqueues/tasks in
GNU
> gold,
> > for example.
>
> Strongly agreed.
>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/1d2ad80c/attachment.html>

Rafael Espíndola via llvm-dev

2016-Nov-16 21:46 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

On 16 November 2016 at 15:52, Rafael Espíndola
<rafael.espindola at gmail.com> wrote:> I will do a quick benchmark run.

On a mac pro (running linux) the results I got with all cores available:

firefox
  master 7.146418217
  patch  5.304271767 1.34729488437x faster
firefox-gc
  master 7.316743822
  patch  5.46436812 1.33899174824x faster
chromium
  master 4.265597914
  patch  3.972218527 1.07385781648x faster
chromium fast
  master 1.823614026
  patch  1.686059427 1.08158348205x faster
the gold plugin
  master 0.340167513
  patch  0.318601465 1.06768973269x faster
clang
  master 0.579914119
  patch  0.520784947 1.11353855817x faster
llvm-as
  master 0.03323043
  patch  0.041571719 1.251013574x slower
the gold plugin fsds
  master 0.36675887
  patch  0.350970944 1.04498356992x faster
clang fsds
  master 0.656180056
  patch  0.591607603 1.10914743602x faster
llvm-as fsds
  master 0.030324313
  patch  0.040045353 1.32056917497x slower
scylla
  master 3.23378908
  patch  2.019191831 1.60152642773x faster

With only 2 cores:

firefox
  master 7.174839911
  patch  6.319808477 1.13529388384x faster
firefox-gc
  master 7.345525844
  patch  6.493005841 1.13129820362x faster
chromium
  master 4.180752414
  patch  4.129515199 1.01240756179x faster
chromium fast
  master 1.847296843
  patch  1.78837299 1.0329483018x faster
the gold plugin
  master 0.341725451
  patch  0.339943222 1.0052427255x faster
clang
  master 0.581901114
  patch  0.566932481 1.02640284955x faster
llvm-as
  master 0.03381059
  patch  0.036671392 1.08461260215x slower
the gold plugin fsds
  master 0.369184003
  patch  0.368774353 1.00111084189x faster
clang fsds
  master 0.660120583
  patch  0.641040511 1.02976422187x faster
llvm-as fsds
  master 0.031074029
  patch  0.035421531 1.13990789543x slower
scylla
  master 3.243011681
  patch  2.630991522 1.23261958615x faster


With only 1 core:

firefox
  master 7.174323116
  patch  7.301968002 1.01779190649x slower
firefox-gc
  master 7.339104117
  patch  7.466171668 1.01731376868x slower
chromium
  master 4.176958448
  patch  4.188387233 1.00273615003x slower
chromium fast
  master 1.848922713
  patch  1.858714219 1.00529578978x slower
the gold plugin
  master 0.342383846
  patch  0.347106743 1.01379415838x slower
clang
  master 0.582476955
  patch  0.600524655 1.03098440178x slower
llvm-as
  master 0.033248459
  patch  0.035622988 1.07141771593x slower
the gold plugin fsds
  master 0.369510236
  patch  0.376390506 1.01861997133x slower
clang fsds
  master 0.661267753
  patch  0.683417482 1.03349585535x slower
llvm-as fsds
  master 0.030574688
  patch  0.033052779 1.08105041006x slower
scylla
  master 3.236604638
  patch  3.325831407 1.02756801617x slower

Given that we have an improvement even with just two cores available, LGTM.

Cheers,
Rafael

Joerg Sonnenberger via llvm-dev

2016-Nov-17 01:15 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

On Wed, Nov 16, 2016 at 12:44:46PM -0800, Rui Ueyama via llvm-dev
wrote:> I'm thinking to enable --threads by default. We now have real users,
and
> they'll be happy about the performance boost.
> 
> Any concerns?
What is the total time consumped, not just the real time? When building
a large project, linking is often done in parallel with other tasks, so
wasting a lot of CPU to save a bit of real time is not necessarily a net
win.

Joerg

Rui Ueyama via llvm-dev

2016-Nov-17 01:26 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

Did you see this
http://llvm.org/viewvc/llvm-project?view=revision&revision=287140 ?
Interpreting these numbers may be tricky because of hyper threading, though.

On Wed, Nov 16, 2016 at 5:15 PM, Joerg Sonnenberger via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On Wed, Nov 16, 2016 at 12:44:46PM -0800, Rui Ueyama via llvm-dev wrote:
> > I'm thinking to enable --threads by default. We now have real
users, and
> > they'll be happy about the performance boost.
> >
> > Any concerns?
>
> What is the total time consumped, not just the real time? When building
> a large project, linking is often done in parallel with other tasks, so
> wasting a lot of CPU to save a bit of real time is not necessarily a net
> win.
>
> Joerg
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/451488e1/attachment.html>

Sean Silva via llvm-dev

2016-Nov-23 07:41 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

On Wed, Nov 16, 2016 at 12:44 PM, Rui Ueyama via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> LLD supports multi-threading, and it seems to be working well as you can
> see in a recent result
>
<http://llvm.org/viewvc/llvm-project?view=revision&revision=287140>.
In
> short, LLD runs 30% faster with --threads option and more than 50% faster
> if you are using --build-id (your mileage may vary depending on your
> computer). However, I don't think most users even don't know about
that
> because --threads is not a default option.
>
> I'm thinking to enable --threads by default. We now have real users,
and
> they'll be happy about the performance boost.
>
> Any concerns?
>
> I can't think of problems with that, but I want to write a few notes
about
> that:
>
>  - We still need to focus on single-thread performance rather than
> multi-threaded one because it is hard to make a slow program faster just by
> using more threads.
>
>  - We shouldn't do "too clever" things with threads.
Currently, we are
> using multi-threads only at two places where they are highly parallelizable
> by nature (namely, copying and applying relocations for each input section,
> and computing build-id hash). We are using parallel_for_each, and that is
> very simple and easy to understand. I believe this was a right design
> choice, and I don't think we want to have something like
workqueues/tasks
> in GNU gold, for example.
>
Sorry for the late response.

Copying and applying relocations is actually are not as parallelizable as
you would imagine in current LLD. The reason is that there is an implicit
serialization when mutating the kernel's VA map (which happens any time
there is a minor page fault, i.e. the first time you touch a page of an
mmap'd input). Since threads share the same VA, there is an implicit
serialization across them. Separate processes are needed to avoid this
overhead (note that the separate processes would still have the same output
file mapped; so (at least with fixed partitioning) there is no need for
complex IPC).

For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even though
the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA
overhead is on the order of a 50x slowdown for this copying operation in
this extreme case, so Amdahl's law indicates that there will be practically
no speedup for this copy operation by adding multiple threads. I've also
DTrace'd this to see massive contention on the VA lock. LInux will be
better but no matter how good, it is still a serialization point and
Amdahl's law will limit your speedup significantly.

-- Sean Silva

>
>  - Run benchmarks with --no-threads if you are not focusing on
> multi-thread performance.
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161122/f5fe3715/attachment.html>

Rafael Espíndola via llvm-dev

2016-Nov-23 14:31 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

Interesting. Might be worth giving a try again to the idea of creating
the file in anonymous memory and using a write to output it.

Cheers,
Rafael

On 23 November 2016 at 02:41, Sean Silva via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>
>
> On Wed, Nov 16, 2016 at 12:44 PM, Rui Ueyama via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>
>> LLD supports multi-threading, and it seems to be working well as you
can
>> see in a recent result. In short, LLD runs 30% faster with --threads
option
>> and more than 50% faster if you are using --build-id (your mileage may
vary
>> depending on your computer). However, I don't think most users even
don't
>> know about that because --threads is not a default option.
>>
>> I'm thinking to enable --threads by default. We now have real
users, and
>> they'll be happy about the performance boost.
>>
>> Any concerns?
>>
>> I can't think of problems with that, but I want to write a few
notes about
>> that:
>>
>>  - We still need to focus on single-thread performance rather than
>> multi-threaded one because it is hard to make a slow program faster
just by
>> using more threads.
>>
>>  - We shouldn't do "too clever" things with threads.
Currently, we are
>> using multi-threads only at two places where they are highly
parallelizable
>> by nature (namely, copying and applying relocations for each input
section,
>> and computing build-id hash). We are using parallel_for_each, and that
is
>> very simple and easy to understand. I believe this was a right design
>> choice, and I don't think we want to have something like
workqueues/tasks in
>> GNU gold, for example.
>
>
> Sorry for the late response.
>
> Copying and applying relocations is actually are not as parallelizable as
> you would imagine in current LLD. The reason is that there is an implicit
> serialization when mutating the kernel's VA map (which happens any time
> there is a minor page fault, i.e. the first time you touch a page of an
> mmap'd input). Since threads share the same VA, there is an implicit
> serialization across them. Separate processes are needed to avoid this
> overhead (note that the separate processes would still have the same output
> file mapped; so (at least with fixed partitioning) there is no need for
> complex IPC).
>
> For `ld.lld -O0` on Mac host, I measured <1GB/s copying speed, even
though
> the machine I was running on had like 50 GB/s DRAM bandwidth; so the VA
> overhead is on the order of a 50x slowdown for this copying operation in
> this extreme case, so Amdahl's law indicates that there will be
practically
> no speedup for this copy operation by adding multiple threads. I've
also
> DTrace'd this to see massive contention on the VA lock. LInux will be
better
> but no matter how good, it is still a serialization point and Amdahl's
law
> will limit your speedup significantly.
>
> -- Sean Silva
>
>>
>>
>>  - Run benchmarks with --no-threads if you are not focusing on
>> multi-thread performance.
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Nov 2016 - LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

Apparently Analagous Threads