thr3ads.net - llvm dev - [LLVMdev] On LLD performance [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Shankar Easwaran

2015-Mar-13 16:38 UTC

[LLVMdev] On LLD performance

Rafael,

This is very good information and extremely useful.

On 3/12/2015 11:49 AM, Rafael Espíndola wrote:> I tried benchmarking it on linux by linking clang Release+asserts (but
> lld itself with no asserts). The first things I noticed were:
>
> missing options:
>
> warning: ignoring unknown argument: --no-add-needed
> warning: ignoring unknown argument: -O3
> warning: ignoring unknown argument: --gc-sections
>
> I just removed them from the command line.
>
> Looks like --hash-style=gnu and --build-id are just ignored, so I
> removed them too.
>
> Looks like --strip-all is ignored, so I removed and ran strip manually.
>
> Looks like .note.GNU-stack is incorrectly added, neither gold nor
> bfd.ld adds it for clang.
>
> Looks like .gnu.version and .gnu.version_r are not implemented.
>
> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
> it is not included in .got.
I have a fix for this. Will merge it.>
> Gold produces a .data.rel.ro.local. lld produces a .data.rel.local.
> bfd puts everything in .data.rel. I have to research a bit to find out
> what this is. For now I just added the sizes into a single entry.
>
> .eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr
> from the command line.
>
> With all that, the sections that increased in size the most when using lld
were:
>
> .rodata: 9 449 278 bytes bigger
> .eh_frame: 438 376 bytes bigger
> .comment: 77 797 bytes bigger
> .data.rel.ro: 48 056 bytes bigger
Did you try --merge-strings with lld ? --gc-sections>
> The comment section is bigger because it has multiple copies of
>
> clang version 3.7.0 (trunk 232021) (llvm/trunk 232027)
>
> The lack of duplicate entry merging would also explain the size
> difference of .rodata and .eh_frame. No idea why .data.rel.ro is
> bigger.
>
> So, with the big warning that both linkers are not doing exactly the
> same thing, the performance numbers I got were:
>
> lld:
>
>
>         1961.842991      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.04% )
>               1,152      context-switches          #    0.587 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>                 ( +-100.00% )
>             199,310      page-faults               #    0.102 M/sec
>                 ( +-  0.00% )
>       5,893,291,145      cycles                    #    3.004 GHz
>                 ( +-  0.03% )
>       3,329,741,079      stalled-cycles-frontend   #   56.50% frontend
> cycles idle     ( +-  0.05% )
>     <not supported>      stalled-cycles-backend
>       6,255,727,902      instructions              #    1.06  insns per
> cycle
>                                                    #    0.53  stalled
> cycles per insn  ( +-  0.01% )
>       1,295,893,191      branches                  #  660.549 M/sec
>                 ( +-  0.01% )
>          26,760,734      branch-misses             #    2.07% of all
> branches          ( +-  0.01% )
>
>         1.963705923 seconds time elapsed
>            ( +-  0.04% )
>
> gold:
>
>          990.708786      task-clock (msec)         #    0.999 CPUs
> utilized            ( +-  0.06% )
>                   0      context-switches          #    0.000 K/sec
>                   0      cpu-migrations            #    0.000 K/sec
>                 ( +-100.00% )
>              77,840      page-faults               #    0.079 M/sec
>       2,976,552,629      cycles                    #    3.004 GHz
>                 ( +-  0.02% )
>       1,384,720,988      stalled-cycles-frontend   #   46.52% frontend
> cycles idle     ( +-  0.04% )
>     <not supported>      stalled-cycles-backend
>       4,105,948,264      instructions              #    1.38  insns per
> cycle
>                                                    #    0.34  stalled
> cycles per insn  ( +-  0.00% )
>         868,894,366      branches                  #  877.043 M/sec
>                 ( +-  0.00% )
>          15,426,051      branch-misses             #    1.78% of all
> branches          ( +-  0.01% )
>
>         0.991619294 seconds time elapsed
>            ( +-  0.06% )
>
>
> The biggest difference that shows up is that lld has 1,152 context
> switches, but the cpu utilization is still < 1. Maybe there is just a
> threading bug somewhere?lld apparently is highly multithreaded, but I see your point.  May be 
trying to do this exercise on /dev/shm can show more cpu utilization ?

Shankar Easwaran

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the
Linux Foundation

Rafael Espíndola

2015-Mar-13 17:15 UTC

head link

[LLVMdev] On LLD performance

>> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why
>> it is not included in .got.
>
> I have a fix for this. Will merge it.
Thanks.
>> .rodata: 9 449 278 bytes bigger
>> .eh_frame: 438 376 bytes bigger
>> .comment: 77 797 bytes bigger
>> .data.rel.ro: 48 056 bytes bigger
>
> Did you try --merge-strings with lld ? --gc-sections

I got

warning: ignoring unknown argument: --gc-sections

I will do a run with --merge-strings. This should probably the the
default to match other ELF linkers.
>> The biggest difference that shows up is that lld has 1,152 context
>> switches, but the cpu utilization is still < 1. Maybe there is just
a
>> threading bug somewhere?
>
> lld apparently is highly multithreaded, but I see your point.  May be
trying
> to do this exercise on /dev/shm can show more cpu utilization ?
Yes, the number just under 1 cpu utilized is very suspicious. As Rui
points out, there is probably some issue in the threading
implementation on linux. One interesting experiment would be timing
gold and lld linking ELF on windows (but I have only a windows VM and
no idea what the "perf" equivalent is on windows.

I forgot to mention, the tests were run on tmpfs already.

Cheers,
Rafael

Davide Italiano

2015-Mar-13 17:53 UTC

head link

[LLVMdev] On LLD performance

On Fri, Mar 13, 2015 at 10:15 AM, Rafael Espíndola
<rafael.espindola at gmail.com> wrote:>>> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure
why
>>> it is not included in .got.
>>
>> I have a fix for this. Will merge it.
>
> Thanks.
>
>>> .rodata: 9 449 278 bytes bigger
>>> .eh_frame: 438 376 bytes bigger
>>> .comment: 77 797 bytes bigger
>>> .data.rel.ro: 48 056 bytes bigger
>>
>> Did you try --merge-strings with lld ? --gc-sections
>
>
> I got
>
> warning: ignoring unknown argument: --gc-sections
>
> I will do a run with --merge-strings. This should probably the the
> default to match other ELF linkers.
>
Unfortunately, --gc-sections isn't implemented on the GNU driver. I
tried to enable it but I hit quite a few issues I'm slowly fixing. At
the time of writing the Resolver reclaims live atoms.

>>> The biggest difference that shows up is that lld has 1,152 context
>>> switches, but the cpu utilization is still < 1. Maybe there is
just a
>>> threading bug somewhere?
>>
>> lld apparently is highly multithreaded, but I see your point.  May be
trying
>> to do this exercise on /dev/shm can show more cpu utilization ?
>
> Yes, the number just under 1 cpu utilized is very suspicious. As Rui
> points out, there is probably some issue in the threading
> implementation on linux. One interesting experiment would be timing
> gold and lld linking ELF on windows (but I have only a windows VM and
> no idea what the "perf" equivalent is on windows.
>
> I forgot to mention, the tests were run on tmpfs already.
>
I think we can make an effort to reduce the number of context
switches. In particular, we might try to switch to a model where task
is the basic unit of computation and a thread pool of worker(s)
responsible for executing these tasks.
This way we can tune the number of threads fighting at the same time
for the CPU, maybe with a reasonable default, that can be overriden by
the user using cmdline options.
That said, as long as this would require some substantial changes I
wouldn't go for that path until we have some strong evidence that the
change is gonna improve the performances significantly. I feel like
that while context switches may have some impact on the final numbers,
they hardly will account for large part of the performance loss.

Another thing that come to my mind is that the number of context
switches being relatively high might be the effect of lock contention.
If somebody has access to a VTune license and can run 'lock analysis'
on it that would be greatly appreciated. I don't have a Linux
laptop/setup but I'll try to collect some numbers on FreeBSD and
investigate further over the weekend.

Thanks,


-- 
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare

Rafael Espíndola

2015-Mar-13 18:59 UTC

head link

[LLVMdev] On LLD performance

> I will do a run with --merge-strings. This should probably the the
> default to match other ELF linkers.
Trying --merge-strings with today's trunk I got

* comment got 77 797 bytes smaller.
* rodata got 9 394 257 bytes smaller.

Comparing with gold, comment now has the same size and rodata is 55
021 bytes bigger.

Amusingly, merging strings seems to make lld a bit faster. With
today's files I got:

lld:
---------------------------------------------------------------------------

       1985.256427      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.07% )
             1,152      context-switches          #    0.580 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               ( +-100.00% )
           199,309      page-faults               #    0.100 M/sec
     5,970,383,833      cycles                    #    3.007 GHz
               ( +-  0.07% )
     3,413,740,580      stalled-cycles-frontend   #   57.18% frontend
cycles idle     ( +-  0.12% )
   <not supported>      stalled-cycles-backend
     6,240,156,987      instructions              #    1.05  insns per
cycle
                                                  #    0.55  stalled
cycles per insn  ( +-  0.01% )
     1,293,186,347      branches                  #  651.395 M/sec
               ( +-  0.01% )
        26,687,288      branch-misses             #    2.06% of all
branches          ( +-  0.00% )

       1.987125976 seconds time elapsed
          ( +-  0.07% )
-----------------------------------------------------------------------------------
ldd --merge-strings:

------------------------------------------------------------------------------
       1912.735291      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.10% )
             1,152      context-switches          #    0.602 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               ( +-100.00% )
           187,916      page-faults               #    0.098 M/sec
               ( +-  0.00% )
     5,749,920,058      cycles                    #    3.006 GHz
               ( +-  0.04% )
     3,250,485,516      stalled-cycles-frontend   #   56.53% frontend
cycles idle     ( +-  0.07% )
   <not supported>      stalled-cycles-backend
     5,987,870,976      instructions              #    1.04  insns per
cycle
                                                  #    0.54  stalled
cycles per insn  ( +-  0.00% )
     1,250,773,036      branches                  #  653.919 M/sec
               ( +-  0.00% )
        27,922,489      branch-misses             #    2.23% of all
branches          ( +-  0.00% )

       1.914565005 seconds time elapsed
          ( +-  0.10% )
----------------------------------------------------------------------------


gold

-------------------------------------------------------------------------------
       1000.132594      task-clock (msec)         #    0.999 CPUs
utilized            ( +-  0.01% )
                 0      context-switches          #    0.000 K/sec
                 0      cpu-migrations            #    0.000 K/sec
            77,836      page-faults               #    0.078 M/sec
     3,002,431,314      cycles                    #    3.002 GHz
               ( +-  0.01% )
     1,404,393,569      stalled-cycles-frontend   #   46.78% frontend
cycles idle     ( +-  0.02% )
   <not supported>      stalled-cycles-backend
     4,110,576,101      instructions              #    1.37  insns per
cycle
                                                  #    0.34  stalled
cycles per insn  ( +-  0.00% )
       869,160,761      branches                  #  869.046 M/sec
               ( +-  0.00% )
        15,691,670      branch-misses             #    1.81% of all
branches          ( +-  0.00% )

       1.001044905 seconds time elapsed
          ( +-  0.01% )
-------------------------------------------------------------------------------

I have attached the run.sh script I used to collect the numbers.

Cheers,
Rafael
-------------- next part --------------
A non-text attachment was scrubbed...
Name: run.sh
Type: application/x-sh
Size: 5653 bytes
Desc: not available
URL:
<lists.llvm.org/pipermail/llvm-dev/attachments/20150313/db3d7301/attachment.sh>

llvm dev - Mar 2015 - [LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

[LLVMdev] On LLD performance

Reasonably Related Threads