thr3ads.net - llvm dev - [llvm-dev] lld and thread over-subscription [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Hal Finkel via llvm-dev

2015-Oct-01 16:35 UTC

[llvm-dev] lld and thread over-subscription

Hi Rui, et al.,

I was experimenting yesterday with building lld on my POWER7 PPC64/Linux
machine, and ran into an unfortunate problem. When running the regressions tests
under lit, almost all of the tests fail like this:

terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
...
5  libc.so.6       0x00000080b7847238 abort + 4293480680
6  libstdc++.so.6  0x00000fff94f0f004 __gnu_cxx::__verbose_terminate_handler() +
4294099316
7  libstdc++.so.6  0x00000fff94f0bc84
8  libstdc++.so.6  0x00000fff94f0bccc std::terminate() + 4294087956
9  libstdc++.so.6  0x00000fff94f0c0c4 __cxa_throw + 4294088780
10 libstdc++.so.6  0x00000fff94f816e0 std::__throw_system_error(int) +
4294526808
11 libstdc++.so.6  0x00000fff94f83d30
std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) +
4294534936
12 lld             0x000000001002a278
...

which seems to indicate a core problem here with dealing with thread-resource
exhaustion. For almost all tests, running them individually (or using lit -j 1)
works without a problem. We could deal with this by limiting the number of
threads lld uses when running regression tests, or limit the number of threads
that lit uses when running lld tests (as we currently do with the OpenMP runtime
tests), but I'm somewhat concerned that users will run into this program
regardless with heavily-parallelized builds.

We could try to catch exceptions that otherwise come from
ThreadPoolExecutor's constructor, but do we compile with exceptions enabled?

Thanks again,
Hal

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Rui Ueyama via llvm-dev

2015-Oct-01 16:46 UTC

head link

[llvm-dev] lld and thread over-subscription

On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> Hi Rui, et al.,
>
> I was experimenting yesterday with building lld on my POWER7 PPC64/Linux
> machine, and ran into an unfortunate problem. When running the regressions
> tests under lit, almost all of the tests fail like this:
>
> terminate called after throwing an instance of 'std::system_error'
>   what():  Resource temporarily unavailable
> ...
> 5  libc.so.6       0x00000080b7847238 abort + 4293480680
> 6  libstdc++.so.6  0x00000fff94f0f004
> __gnu_cxx::__verbose_terminate_handler() + 4294099316
> 7  libstdc++.so.6  0x00000fff94f0bc84
> 8  libstdc++.so.6  0x00000fff94f0bccc std::terminate() + 4294087956
> 9  libstdc++.so.6  0x00000fff94f0c0c4 __cxa_throw + 4294088780
> 10 libstdc++.so.6  0x00000fff94f816e0 std::__throw_system_error(int) +
> 4294526808
> 11 libstdc++.so.6  0x00000fff94f83d30
>
std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) +
> 4294534936
> 12 lld             0x000000001002a278
> ...
>
> which seems to indicate a core problem here with dealing with
> thread-resource exhaustion. For almost all tests, running them individually
> (or using lit -j 1) works without a problem. We could deal with this by
> limiting the number of threads lld uses when running regression tests, or
> limit the number of threads that lit uses when running lld tests (as we
> currently do with the OpenMP runtime tests), but I'm somewhat concerned
> that users will run into this program regardless with heavily-parallelized
> builds.
>
> We could try to catch exceptions that otherwise come from
> ThreadPoolExecutor's constructor, but do we compile with exceptions
enabled?
>
I guess we do not want to enable exceptions to deal with the issue. Are
COFF tests failing, or just ELF tests? If ELF tests for the old LLD are
failing, the best way would be to not use threads in the old LLD. It has
lingering threading issues.

> Thanks again,
> Hal
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151001/82d0b486/attachment.html>

Hal Finkel via llvm-dev

2015-Oct-01 17:26 UTC

head link

[llvm-dev] lld and thread over-subscription

----- Original Message -----> From: "Rui Ueyama" <ruiu at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Developers" <llvm-dev at lists.llvm.org>,
"Rafael Espindola" <rafael.espindola at gmail.com>
> Sent: Thursday, October 1, 2015 11:46:05 AM
> Subject: Re: lld and thread over-subscription
> 
> On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel < hfinkel at anl.gov >
wrote:
> 
> Hi Rui, et al.,
> 
> I was experimenting yesterday with building lld on my POWER7
> PPC64/Linux machine, and ran into an unfortunate problem. When
> running the regressions tests under lit, almost all of the tests
> fail like this:
> 
> terminate called after throwing an instance of 'std::system_error'
> what(): Resource temporarily unavailable
> ...
> 5 libc.so.6 0x00000080b7847238 abort + 4293480680
> 6 libstdc++.so.6 0x00000fff94f0f004
> __gnu_cxx::__verbose_terminate_handler() + 4294099316
> 7 libstdc++.so.6 0x00000fff94f0bc84
> 8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956
> 9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780
> 10 libstdc++.so.6 0x00000fff94f816e0 std::__throw_system_error(int) +
> 4294526808
> 11 libstdc++.so.6 0x00000fff94f83d30
>
std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>)
> + 4294534936
> 12 lld 0x000000001002a278
> ...
> 
> which seems to indicate a core problem here with dealing with
> thread-resource exhaustion. For almost all tests, running them
> individually (or using lit -j 1) works without a problem. We could
> deal with this by limiting the number of threads lld uses when
> running regression tests, or limit the number of threads that lit
> uses when running lld tests (as we currently do with the OpenMP
> runtime tests), but I'm somewhat concerned that users will run into
> this program regardless with heavily-parallelized builds.
> 
> We could try to catch exceptions that otherwise come from
> ThreadPoolExecutor's constructor, but do we compile with exceptions
> enabled?
>  
> I guess we do not want to enable exceptions to deal with the issue.
> Are COFF tests failing, or just ELF tests? If ELF tests for the old
> LLD are failing, the best way would be to not use threads in the old
> LLD. It has lingering threading issues.
> 
To provide a data point; my default environment has this:

$ ulimit -a | grep proc
max user processes              (-u) 1024

This machine has 48 cores, so with lit running 48 tests leaves each test with
only about 20 available threads, much less than the 48 each test believes it can
use.

This is somewhat non-deterministic, but I just reran things both ways, and
here's what I see:

During my last run, these tests fail when running under lit with many parallel
tests, but do not fail when run otherwise:

    lld :: elf2/basic.s
    lld :: elf/AArch64/general-dyn-tls-0.test
    lld :: elf/AArch64/initial-exec-tls-0.test
    lld :: elf/AArch64/rel-prel32-overflow.test
    lld :: elf/AArch64/rel-prel64.test
    lld :: elf/AMDGPU/hsa.test
    lld :: elf/ARM/arm-symbols.test
    lld :: elf/ARM/dynamic-symbols.test
    lld :: elf/ARM/entry-point.test
    lld :: elf/ARM/exidx.test
    lld :: elf/ARM/header-flags.test
    lld :: elf/ARM/mapping-code-model.test
    lld :: elf/ARM/mapping-symbols.test
    lld :: elf/ARM/missing-symbol.test
    lld :: elf/ARM/plt-dynamic.test
    lld :: elf/ARM/plt-ifunc-interwork.test
    lld :: elf/ARM/plt-ifunc-mapping.test
    lld :: elf/ARM/rel-arm-call.test
    lld :: elf/ARM/rel-arm-jump24-veneer-b.test
    lld :: elf/ARM/rel-arm-mov.test
    lld :: elf/ARM/rel-arm-prel31.test
    lld :: elf/ARM/rel-arm-target1.test
    lld :: elf/ARM/rel-arm-thm-interwork.test
    lld :: elf/ARM/undef-lazy-symbol.test
    lld :: elf/Hexagon/dynlib-data.test
    lld :: elf/Mips/exe-dynamic.test
    lld :: elf/Mips/exe-dynsym.test
    lld :: elf/Mips/exe-fileheader-64.test
    lld :: elf/Mips/exe-fileheader-micro-64.test
    lld :: elf/Mips/exe-fileheader-n32.test
    lld :: elf/Mips/exe-got-micro.test
    lld :: elf/Mips/exe-got.test
    lld :: elf/Mips/got16-2.test
    lld :: elf/Mips/got16-micro.test
    lld :: elf/Mips/got-page-32-micro.test
    lld :: elf/Mips/got-page-64-micro.test
    lld :: elf/Mips/got-page-64.test
    lld :: elf/X86_64/sectionchoice.test
    lld :: elf/X86_64/sectionmap.test
    lld :: mach-o/arm-interworking.yaml
    lld :: mach-o/arm-shims.yaml
    lld :: mach-o/data-only-dylib.yaml
    lld :: mach-o/executable-exports.yaml
    lld :: mach-o/exe-offsets.yaml
    lld :: mach-o/exported_symbols_list-undef.yaml
    lld :: mach-o/fat-archive.yaml
    lld :: mach-o/flat_namespace_undef_error.yaml
    lld :: mach-o/flat_namespace_undef_suppress.yaml
    lld :: mach-o/force_load-x86_64.yaml
    lld :: mach-o/got-order.yaml
    lld :: mach-o/hello-world-arm64.yaml
    lld :: mach-o/hello-world-armv6.yaml
    lld :: mach-o/hello-world-x86_64.yaml
    lld :: mach-o/hello-world-x86.yaml
    lld :: mach-o/keep_private_externs.yaml
    lld :: mach-o/lazy-bind-x86_64.yaml
    lld :: mach-o/library-rescan.yaml
    lld :: mach-o/mh_bundle_header.yaml
    lld :: mach-o/mh_dylib_header.yaml
    lld :: mach-o/objc_export_list.yaml
    lld :: mach-o/order_file-basic.yaml
    lld :: mach-o/parse-aliases.yaml
    lld :: mach-o/parse-cfstring32.yaml
    lld :: mach-o/parse-cfstring64.yaml
    lld :: mach-o/parse-compact-unwind32.yaml
    lld :: mach-o/parse-compact-unwind64.yaml
    lld :: mach-o/parse-data-in-code-armv7.yaml
    lld :: mach-o/parse-data-in-code-x86.yaml
    lld :: mach-o/parse-data-relocs-arm64.yaml
    lld :: mach-o/parse-data-relocs-x86_64.yaml
    lld :: mach-o/parse-data.yaml
    lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml
    lld :: mach-o/parse-eh-frame-x86-anon.yaml
    lld :: mach-o/parse-eh-frame-x86-labeled.yaml
    lld :: mach-o/parse-eh-frame.yaml
    lld :: mach-o/parse-function.yaml
    lld :: mach-o/parse-initializers32.yaml
    lld :: mach-o/parse-initializers64.yaml
    lld :: mach-o/parse-literals-error.yaml
    lld :: mach-o/parse-literals.yaml
    lld :: mach-o/parse-non-lazy-pointers.yaml
    lld :: mach-o/parse-relocs-x86.yaml
    lld :: mach-o/parse-section-no-symbol.yaml
    lld :: mach-o/parse-tentative-defs.yaml
    lld :: mach-o/parse-text-relocs-x86_64.yaml
    lld :: mach-o/parse-tlv-relocs-x86-64.yaml
    lld :: mach-o/re-exported-dylib-ordinal.yaml
    lld :: mach-o/rpath.yaml
    lld :: mach-o/run-tlv-pass-x86-64.yaml
    lld :: mach-o/sectalign.yaml
    lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml
    lld :: mach-o/usage.yaml
    lld :: mach-o/use-simple-dylib.yaml
    lld :: mach-o/write-final-sections.yaml
    lld :: mach-o/wrong-arch-error.yaml
    lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range
    lld-Unit :: CoreTests/CoreTests/Range.slice
    lld-Unit :: CoreTests/CoreTests/Range.user1
    lld-Unit :: CoreTests/CoreTests/Range.user2

Of these, the following tests don't fail, but are reported as
'Unresolved' (which does not happen if I run lit -j 1):

    lld :: elf/ARM/mapping-code-model.test
    lld :: elf/ARM/mapping-symbols.test
    lld :: elf/ARM/missing-symbol.test
    lld :: elf/ARM/plt-ifunc-interwork.test
    lld :: elf/ARM/rel-arm-jump24-veneer-b.test
    lld :: elf/Mips/exe-got-micro.test
    lld :: elf/Mips/exe-got.test
    lld :: elf/Mips/got16-micro.test
    lld :: mach-o/parse-cfstring64.yaml
    lld :: mach-o/parse-compact-unwind32.yaml
    lld :: mach-o/parse-compact-unwind64.yaml
    lld :: mach-o/parse-data-in-code-armv7.yaml
    lld :: mach-o/parse-data-in-code-x86.yaml
    lld :: mach-o/parse-data-relocs-arm64.yaml
    lld :: mach-o/parse-data-relocs-x86_64.yaml
    lld :: mach-o/parse-data.yaml
    lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml
    lld :: mach-o/parse-eh-frame-x86-anon.yaml
    lld :: mach-o/parse-eh-frame-x86-labeled.yaml
    lld :: mach-o/parse-eh-frame.yaml
    lld :: mach-o/parse-function.yaml
    lld :: mach-o/parse-initializers32.yaml
    lld :: mach-o/parse-initializers64.yaml
    lld :: mach-o/parse-literals-error.yaml
    lld :: mach-o/parse-literals.yaml
    lld :: mach-o/parse-non-lazy-pointers.yaml
    lld :: mach-o/parse-relocs-x86.yaml
    lld :: mach-o/parse-section-no-symbol.yaml
    lld :: mach-o/parse-tentative-defs.yaml
    lld :: mach-o/parse-text-relocs-arm64.yaml
    lld :: mach-o/parse-text-relocs-x86_64.yaml
    lld :: mach-o/parse-tlv-relocs-x86-64.yaml
    lld :: mach-o/rpath.yaml
    lld :: mach-o/run-tlv-pass-x86-64.yaml
    lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml
    lld :: mach-o/usage.yaml
    lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range
    lld-Unit :: CoreTests/CoreTests/Range.slice
    lld-Unit :: CoreTests/CoreTests/Range.user1
    lld-Unit :: CoreTests/CoreTests/Range.user2

these are listed as unresolved for the same underlying reason, for example:

********************
UNRESOLVED: lld-Unit :: CoreTests/CoreTests/Range.user1 (25040 of 25181)
******************** TEST 'lld-Unit :: CoreTests/CoreTests/Range.user1'
FAILED ********************
Exception during script execution:
Traceback (most recent call last):
  File "/src/llvm/utils/lit/lit/run.py", line 166, in execute_test
    result = test.config.test_format.execute(test, self.lit_config)
  File "/src/llvm/utils/lit/lit/formats/googletest.py", line 113, in
execute
    cmd, env=test.config.environment)
  File "/src/llvm/utils/lit/lit/util.py", line 166, in executeCommand
    env=env, close_fds=kUseCloseFDs)
  File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line
710, in __init__
    errread, errwrite)
  File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line
1231, in _execute_child
    self.pid = os.fork()
OSError: [Errno 11] Resource temporarily unavailable

Being naturally nondeterministic, running again with the default number of
parallel lit tests changes which tests fail (for example, running a second time
adds tests under COFF).

And, FWIW, these tests generally fail on my system (for reasons seemingly
unrelated to the thread/process resource issue):

    lld :: Driver/lib-search.test
    lld :: Driver/undef-basic.objtxt
    lld :: elf2/dynamic-reloc.s
    lld :: elf2/shared.s
    lld :: elf2/soname.s
    lld :: elf/librarynotfound.test
    lld :: elf/responsefile.test
    lld :: mach-o/dylib-install-names.yaml
    lld :: mach-o/force_load-dylib.yaml
    lld :: mach-o/lib-search-paths.yaml
    lld :: mach-o/parse-text-relocs-arm64.yaml
    lld :: mach-o/upward-dylib-load-command.yaml
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.AsNeeded
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymAlias
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymDecimal
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymHexadecimal
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymOctal
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Empty
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Entry
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryJoined
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryShort
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.ExportDynamic
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Init
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.InitJoined
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoExportDynamic
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoinhibitExec
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Output
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.OutputDefault
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Rpath
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.RpathEq
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SOName
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameH
    lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameSingleDash
    lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Entry
    lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.ExprEval
    lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Group
    lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.IgnoreSearchDirNoStdLib
    lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Input
    lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Output
    lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.SearchDir
    lld-Unit :: DriverTests/DriverTests/UniversalDriver.flavor

(it could be big-Endian issues, LLVM bugs, etc. -- I've yet to investigate).

The easiest thing to do is to make lld tests run using lit -j 1, but we may also
want to think about how to more-gracefully handle this situation in general,
because it seems like something a user is not unlikely to hit.

Thanks again,
Hal
> 
> Thanks again,
> Hal
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Oct 2015 - lld and thread over-subscription

[llvm-dev] lld and thread over-subscription

[llvm-dev] lld and thread over-subscription

[llvm-dev] lld and thread over-subscription

Maybe Matching Threads