thr3ads.net - llvm dev - [llvm-dev] LLD: time to enable --threads by default [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Rui Ueyama via llvm-dev

2016-Nov-17 18:00 UTC

[llvm-dev] LLD: time to enable --threads by default

On Thu, Nov 17, 2016 at 9:50 AM, Mehdi Amini <mehdi.amini at apple.com>
wrote:
>
> On Nov 17, 2016, at 9:41 AM, Rui Ueyama via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Thu, Nov 17, 2016 at 6:12 AM, Teresa Johnson via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>>
>> On Thu, Nov 17, 2016 at 4:11 AM, Rafael Espíndola via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> > Sounds like threading isn't beneficial much beyond the
second CPU...
>>> > Maybe blindly creating one thread per core isn't the best
plan...
>>>
>>> parallel.h is pretty simplistic at the moment. Currently it creates
>>> one per SMT. One per core and being lazy about it would probably be
a
>>> good thing, but threading is already beneficial and improving
>>> parallel.h an welcome improvement.
>>>
>>
>> Instead of using std::thread::hardware_concurrency (which is one per
>> SMT), you may be interested in using the facility I added for setting
>> default ThinLTO backend parallelism so that one per physical core is
>> created, llvm::heavyweight_hardware_concurrency() (see D25585  and
>> r284390). The name is meant to indicate that this is the concurrency
that
>> should be used for heavier weight tasks (that may use a lot of memory
e.g.).
>>
>
> Sorry for my ignorance, but what's the point of running the same number
of
> threads as the number of physical cores instead of HT virtual cores? If we
> can get better throughput by not running more than one thread per a
> physical core, it feels like HT is a useless technology.
>
>
> It depends on the use-case: with ThinLTO we scale linearly with the number
> of physical cores. When you get over the number of physical cores you still
> get some improvements, but that’s no longer linear.
> The profitability question is a tradeoff one: for example if each of your
> task is very memory intensive, you may not want to overcommit the cores or
> increase the ratio of available mem per physical core.
>
> To take some number as an example: if your average user has a 8GB machine
> with 4 cores (8 virtual cores with HT), and you know that each of your
> parallel tasks is consuming 1.5GB of memory on average, then having 4
> parallel workers threads to process your tasks will lead to a peak memory
> of 6GB, having 8 parallel threads will lead to a peak mem of 12GB and the
> machine will start to swap.
>
> Another consideration is that having the linker issuing threads behind the
> back of the build system isn’t great: the build system is supposed to
> exploit the parallelism. Now if it spawn 10 linker jobs in parallel, how
> many threads are competing for the hardware?
>
> So, HT is not useless, but it is not universally applicable or universally
> efficient in the same way.
>
> Hope it makes sense!
>
Thank you for the explanation! That makes sense.

Unlike ThinLTO, each thread in LLD consumes very small amount of memory
(probably just a few megabytes), so that's not a problem for me. At the
final stage of linking, we spawn threads to copy section contents and apply
relocations, and I guess that causes a lot of memory traffic because that's
basically memcpy'ing input files to an output file, so the memory bandwidth
could be a limiting factor there. But I do not see a reason to limit the
number of threads to the number of physical core. For LLD, it seems like we
can just spawn as many threads as HT provides.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161117/ac5781db/attachment.html>

Teresa Johnson via llvm-dev

2016-Nov-17 18:08 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

On Thu, Nov 17, 2016 at 10:00 AM, Rui Ueyama <ruiu at google.com> wrote:
> On Thu, Nov 17, 2016 at 9:50 AM, Mehdi Amini <mehdi.amini at
apple.com>
> wrote:
>
>>
>> On Nov 17, 2016, at 9:41 AM, Rui Ueyama via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> On Thu, Nov 17, 2016 at 6:12 AM, Teresa Johnson via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>
>>>
>>> On Thu, Nov 17, 2016 at 4:11 AM, Rafael Espíndola via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> > Sounds like threading isn't beneficial much beyond the
second CPU...
>>>> > Maybe blindly creating one thread per core isn't the
best plan...
>>>>
>>>> parallel.h is pretty simplistic at the moment. Currently it
creates
>>>> one per SMT. One per core and being lazy about it would
probably be a
>>>> good thing, but threading is already beneficial and improving
>>>> parallel.h an welcome improvement.
>>>>
>>>
>>> Instead of using std::thread::hardware_concurrency (which is one
per
>>> SMT), you may be interested in using the facility I added for
setting
>>> default ThinLTO backend parallelism so that one per physical core
is
>>> created, llvm::heavyweight_hardware_concurrency() (see D25585  and
>>> r284390). The name is meant to indicate that this is the
concurrency that
>>> should be used for heavier weight tasks (that may use a lot of
memory e.g.).
>>>
>>
>> Sorry for my ignorance, but what's the point of running the same
number
>> of threads as the number of physical cores instead of HT virtual cores?
If
>> we can get better throughput by not running more than one thread per a
>> physical core, it feels like HT is a useless technology.
>>
>>
>> It depends on the use-case: with ThinLTO we scale linearly with the
>> number of physical cores. When you get over the number of physical
cores
>> you still get some improvements, but that’s no longer linear.
>> The profitability question is a tradeoff one: for example if each of
your
>> task is very memory intensive, you may not want to overcommit the cores
or
>> increase the ratio of available mem per physical core.
>>
>> To take some number as an example: if your average user has a 8GB
machine
>> with 4 cores (8 virtual cores with HT), and you know that each of your
>> parallel tasks is consuming 1.5GB of memory on average, then having 4
>> parallel workers threads to process your tasks will lead to a peak
memory
>> of 6GB, having 8 parallel threads will lead to a peak mem of 12GB and
the
>> machine will start to swap.
>>
>> Another consideration is that having the linker issuing threads behind
>> the back of the build system isn’t great: the build system is supposed
to
>> exploit the parallelism. Now if it spawn 10 linker jobs in parallel,
how
>> many threads are competing for the hardware?
>>
>> So, HT is not useless, but it is not universally applicable or
>> universally efficient in the same way.
>>
>> Hope it makes sense!
>>
>
> Thank you for the explanation! That makes sense.
>
> Unlike ThinLTO, each thread in LLD consumes very small amount of memory
> (probably just a few megabytes), so that's not a problem for me. At the
> final stage of linking, we spawn threads to copy section contents and apply
> relocations, and I guess that causes a lot of memory traffic because
that's
> basically memcpy'ing input files to an output file, so the memory
bandwidth
> could be a limiting factor there. But I do not see a reason to limit the
> number of threads to the number of physical core. For LLD, it seems like we
> can just spawn as many threads as HT provides.
>
Ok, sure - I was just suggesting based on Rafael's comment above about lld
currently creating one thread per SMT and possibly wanting one per core
instead. It will definitely depend on the characteristics of your parallel
tasks (which is why the name of the interface was changed to include
"heavyweight" aka large memory intensive, since the implementation may
return something other than # physical cores for other architectures -
right now it is just implemented for x86 otherwise returns
thread::hardware_concurrency()).

Teresa


-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |  408-460-2413
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161117/b9eee654/attachment-0001.html>

Rafael Espíndola via llvm-dev

2016-Nov-17 21:20 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

>
> Thank you for the explanation! That makes sense.
>
> Unlike ThinLTO, each thread in LLD consumes very small amount of memory
> (probably just a few megabytes), so that's not a problem for me. At the
> final stage of linking, we spawn threads to copy section contents and apply
> relocations, and I guess that causes a lot of memory traffic because
that's
> basically memcpy'ing input files to an output file, so the memory
bandwidth
> could be a limiting factor there. But I do not see a reason to limit the
> number of threads to the number of physical core. For LLD, it seems like we
> can just spawn as many threads as HT provides.

It is quite common for SMT to *not* be profitable. I did notice some
small wins by not using it. On an intel machine you can do a quick
check by running with half the threads since they always have 2x SMT.


Cheers,
Rafael

Davide Italiano via llvm-dev

2016-Nov-18 02:30 UTC

head link

[llvm-dev] LLD: time to enable --threads by default

On Thu, Nov 17, 2016 at 1:20 PM, Rafael Espíndola via llvm-dev
<llvm-dev at lists.llvm.org> wrote:>>
>> Thank you for the explanation! That makes sense.
>>
>> Unlike ThinLTO, each thread in LLD consumes very small amount of memory
>> (probably just a few megabytes), so that's not a problem for me. At
the
>> final stage of linking, we spawn threads to copy section contents and
apply
>> relocations, and I guess that causes a lot of memory traffic because
that's
>> basically memcpy'ing input files to an output file, so the memory
bandwidth
>> could be a limiting factor there. But I do not see a reason to limit
the
>> number of threads to the number of physical core. For LLD, it seems
like we
>> can just spawn as many threads as HT provides.
>
>
> It is quite common for SMT to *not* be profitable. I did notice some
> small wins by not using it. On an intel machine you can do a quick
> check by running with half the threads since they always have 2x SMT.
>
I had the same experience. Ideally I would like to have a way to
override the number of threads used by the linker.
gold has a plethora of options for doing that, i.e.

  --thread-count COUNT        Number of threads to use
  --thread-count-initial COUNT
                              Number of threads to use in initial pass
  --thread-count-middle COUNT Number of threads to use in middle pass
  --thread-count-final COUNT  Number of threads to use in final pass

I don't think we need the full generality/flexibility of
initial/middle/final, but --thread-count could be useful (at least for
experimenting). The current interface of `parallel_for_each` doesn't
allow to specify the number of threads to be run, so, assuming lld
goes that route (it may not), that should be extended accordingly.

-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

llvm dev - Nov 2016 - LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default

[llvm-dev] LLD: time to enable --threads by default