Rafael, This is very good information and extremely useful. On 3/12/2015 11:49 AM, Rafael Espíndola wrote:> I tried benchmarking it on linux by linking clang Release+asserts (but > lld itself with no asserts). The first things I noticed were: > > missing options: > > warning: ignoring unknown argument: --no-add-needed > warning: ignoring unknown argument: -O3 > warning: ignoring unknown argument: --gc-sections > > I just removed them from the command line. > > Looks like --hash-style=gnu and --build-id are just ignored, so I > removed them too. > > Looks like --strip-all is ignored, so I removed and ran strip manually. > > Looks like .note.GNU-stack is incorrectly added, neither gold nor > bfd.ld adds it for clang. > > Looks like .gnu.version and .gnu.version_r are not implemented. > > Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why > it is not included in .got.I have a fix for this. Will merge it.> > Gold produces a .data.rel.ro.local. lld produces a .data.rel.local. > bfd puts everything in .data.rel. I have to research a bit to find out > what this is. For now I just added the sizes into a single entry. > > .eh_frame_hdr is effectively empty on lld. I removed --eh-frame-hdr > from the command line. > > With all that, the sections that increased in size the most when using lld were: > > .rodata: 9 449 278 bytes bigger > .eh_frame: 438 376 bytes bigger > .comment: 77 797 bytes bigger > .data.rel.ro: 48 056 bytes biggerDid you try --merge-strings with lld ? --gc-sections> > The comment section is bigger because it has multiple copies of > > clang version 3.7.0 (trunk 232021) (llvm/trunk 232027) > > The lack of duplicate entry merging would also explain the size > difference of .rodata and .eh_frame. No idea why .data.rel.ro is > bigger. > > So, with the big warning that both linkers are not doing exactly the > same thing, the performance numbers I got were: > > lld: > > > 1961.842991 task-clock (msec) # 0.999 CPUs > utilized ( +- 0.04% ) > 1,152 context-switches # 0.587 K/sec > 0 cpu-migrations # 0.000 K/sec > ( +-100.00% ) > 199,310 page-faults # 0.102 M/sec > ( +- 0.00% ) > 5,893,291,145 cycles # 3.004 GHz > ( +- 0.03% ) > 3,329,741,079 stalled-cycles-frontend # 56.50% frontend > cycles idle ( +- 0.05% ) > <not supported> stalled-cycles-backend > 6,255,727,902 instructions # 1.06 insns per > cycle > # 0.53 stalled > cycles per insn ( +- 0.01% ) > 1,295,893,191 branches # 660.549 M/sec > ( +- 0.01% ) > 26,760,734 branch-misses # 2.07% of all > branches ( +- 0.01% ) > > 1.963705923 seconds time elapsed > ( +- 0.04% ) > > gold: > > 990.708786 task-clock (msec) # 0.999 CPUs > utilized ( +- 0.06% ) > 0 context-switches # 0.000 K/sec > 0 cpu-migrations # 0.000 K/sec > ( +-100.00% ) > 77,840 page-faults # 0.079 M/sec > 2,976,552,629 cycles # 3.004 GHz > ( +- 0.02% ) > 1,384,720,988 stalled-cycles-frontend # 46.52% frontend > cycles idle ( +- 0.04% ) > <not supported> stalled-cycles-backend > 4,105,948,264 instructions # 1.38 insns per > cycle > # 0.34 stalled > cycles per insn ( +- 0.00% ) > 868,894,366 branches # 877.043 M/sec > ( +- 0.00% ) > 15,426,051 branch-misses # 1.78% of all > branches ( +- 0.01% ) > > 0.991619294 seconds time elapsed > ( +- 0.06% ) > > > The biggest difference that shows up is that lld has 1,152 context > switches, but the cpu utilization is still < 1. Maybe there is just a > threading bug somewhere?lld apparently is highly multithreaded, but I see your point. May be trying to do this exercise on /dev/shm can show more cpu utilization ? Shankar Easwaran -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by the Linux Foundation
>> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why >> it is not included in .got. > > I have a fix for this. Will merge it.Thanks.>> .rodata: 9 449 278 bytes bigger >> .eh_frame: 438 376 bytes bigger >> .comment: 77 797 bytes bigger >> .data.rel.ro: 48 056 bytes bigger > > Did you try --merge-strings with lld ? --gc-sectionsI got warning: ignoring unknown argument: --gc-sections I will do a run with --merge-strings. This should probably the the default to match other ELF linkers.>> The biggest difference that shows up is that lld has 1,152 context >> switches, but the cpu utilization is still < 1. Maybe there is just a >> threading bug somewhere? > > lld apparently is highly multithreaded, but I see your point. May be trying > to do this exercise on /dev/shm can show more cpu utilization ?Yes, the number just under 1 cpu utilized is very suspicious. As Rui points out, there is probably some issue in the threading implementation on linux. One interesting experiment would be timing gold and lld linking ELF on windows (but I have only a windows VM and no idea what the "perf" equivalent is on windows. I forgot to mention, the tests were run on tmpfs already. Cheers, Rafael
On Fri, Mar 13, 2015 at 10:15 AM, Rafael Espíndola <rafael.espindola at gmail.com> wrote:>>> Curiously lld produces a tiny got.dyn (0x0000a0 bytes), not sure why >>> it is not included in .got. >> >> I have a fix for this. Will merge it. > > Thanks. > >>> .rodata: 9 449 278 bytes bigger >>> .eh_frame: 438 376 bytes bigger >>> .comment: 77 797 bytes bigger >>> .data.rel.ro: 48 056 bytes bigger >> >> Did you try --merge-strings with lld ? --gc-sections > > > I got > > warning: ignoring unknown argument: --gc-sections > > I will do a run with --merge-strings. This should probably the the > default to match other ELF linkers. >Unfortunately, --gc-sections isn't implemented on the GNU driver. I tried to enable it but I hit quite a few issues I'm slowly fixing. At the time of writing the Resolver reclaims live atoms.>>> The biggest difference that shows up is that lld has 1,152 context >>> switches, but the cpu utilization is still < 1. Maybe there is just a >>> threading bug somewhere? >> >> lld apparently is highly multithreaded, but I see your point. May be trying >> to do this exercise on /dev/shm can show more cpu utilization ? > > Yes, the number just under 1 cpu utilized is very suspicious. As Rui > points out, there is probably some issue in the threading > implementation on linux. One interesting experiment would be timing > gold and lld linking ELF on windows (but I have only a windows VM and > no idea what the "perf" equivalent is on windows. > > I forgot to mention, the tests were run on tmpfs already. >I think we can make an effort to reduce the number of context switches. In particular, we might try to switch to a model where task is the basic unit of computation and a thread pool of worker(s) responsible for executing these tasks. This way we can tune the number of threads fighting at the same time for the CPU, maybe with a reasonable default, that can be overriden by the user using cmdline options. That said, as long as this would require some substantial changes I wouldn't go for that path until we have some strong evidence that the change is gonna improve the performances significantly. I feel like that while context switches may have some impact on the final numbers, they hardly will account for large part of the performance loss. Another thing that come to my mind is that the number of context switches being relatively high might be the effect of lock contention. If somebody has access to a VTune license and can run 'lock analysis' on it that would be greatly appreciated. I don't have a Linux laptop/setup but I'll try to collect some numbers on FreeBSD and investigate further over the weekend. Thanks, -- Davide "There are no solved problems; there are only problems that are more or less solved" -- Henri Poincare
> I will do a run with --merge-strings. This should probably the the > default to match other ELF linkers.Trying --merge-strings with today's trunk I got * comment got 77 797 bytes smaller. * rodata got 9 394 257 bytes smaller. Comparing with gold, comment now has the same size and rodata is 55 021 bytes bigger. Amusingly, merging strings seems to make lld a bit faster. With today's files I got: lld: --------------------------------------------------------------------------- 1985.256427 task-clock (msec) # 0.999 CPUs utilized ( +- 0.07% ) 1,152 context-switches # 0.580 K/sec 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 199,309 page-faults # 0.100 M/sec 5,970,383,833 cycles # 3.007 GHz ( +- 0.07% ) 3,413,740,580 stalled-cycles-frontend # 57.18% frontend cycles idle ( +- 0.12% ) <not supported> stalled-cycles-backend 6,240,156,987 instructions # 1.05 insns per cycle # 0.55 stalled cycles per insn ( +- 0.01% ) 1,293,186,347 branches # 651.395 M/sec ( +- 0.01% ) 26,687,288 branch-misses # 2.06% of all branches ( +- 0.00% ) 1.987125976 seconds time elapsed ( +- 0.07% ) ----------------------------------------------------------------------------------- ldd --merge-strings: ------------------------------------------------------------------------------ 1912.735291 task-clock (msec) # 0.999 CPUs utilized ( +- 0.10% ) 1,152 context-switches # 0.602 K/sec 0 cpu-migrations # 0.000 K/sec ( +-100.00% ) 187,916 page-faults # 0.098 M/sec ( +- 0.00% ) 5,749,920,058 cycles # 3.006 GHz ( +- 0.04% ) 3,250,485,516 stalled-cycles-frontend # 56.53% frontend cycles idle ( +- 0.07% ) <not supported> stalled-cycles-backend 5,987,870,976 instructions # 1.04 insns per cycle # 0.54 stalled cycles per insn ( +- 0.00% ) 1,250,773,036 branches # 653.919 M/sec ( +- 0.00% ) 27,922,489 branch-misses # 2.23% of all branches ( +- 0.00% ) 1.914565005 seconds time elapsed ( +- 0.10% ) ---------------------------------------------------------------------------- gold ------------------------------------------------------------------------------- 1000.132594 task-clock (msec) # 0.999 CPUs utilized ( +- 0.01% ) 0 context-switches # 0.000 K/sec 0 cpu-migrations # 0.000 K/sec 77,836 page-faults # 0.078 M/sec 3,002,431,314 cycles # 3.002 GHz ( +- 0.01% ) 1,404,393,569 stalled-cycles-frontend # 46.78% frontend cycles idle ( +- 0.02% ) <not supported> stalled-cycles-backend 4,110,576,101 instructions # 1.37 insns per cycle # 0.34 stalled cycles per insn ( +- 0.00% ) 869,160,761 branches # 869.046 M/sec ( +- 0.00% ) 15,691,670 branch-misses # 1.81% of all branches ( +- 0.00% ) 1.001044905 seconds time elapsed ( +- 0.01% ) ------------------------------------------------------------------------------- I have attached the run.sh script I used to collect the numbers. Cheers, Rafael -------------- next part -------------- A non-text attachment was scrubbed... Name: run.sh Type: application/x-sh Size: 5653 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150313/db3d7301/attachment.sh>