----- Original Message -----> From: "Rui Ueyama" <ruiu at google.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "LLVM Developers" <llvm-dev at lists.llvm.org>, "Rafael Espindola" <rafael.espindola at gmail.com> > Sent: Thursday, October 1, 2015 11:46:05 AM > Subject: Re: lld and thread over-subscription > > On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel < hfinkel at anl.gov > wrote: > > Hi Rui, et al., > > I was experimenting yesterday with building lld on my POWER7 > PPC64/Linux machine, and ran into an unfortunate problem. When > running the regressions tests under lit, almost all of the tests > fail like this: > > terminate called after throwing an instance of 'std::system_error' > what(): Resource temporarily unavailable > ... > 5 libc.so.6 0x00000080b7847238 abort + 4293480680 > 6 libstdc++.so.6 0x00000fff94f0f004 > __gnu_cxx::__verbose_terminate_handler() + 4294099316 > 7 libstdc++.so.6 0x00000fff94f0bc84 > 8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956 > 9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780 > 10 libstdc++.so.6 0x00000fff94f816e0 std::__throw_system_error(int) + > 4294526808 > 11 libstdc++.so.6 0x00000fff94f83d30 > std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) > + 4294534936 > 12 lld 0x000000001002a278 > ... > > which seems to indicate a core problem here with dealing with > thread-resource exhaustion. For almost all tests, running them > individually (or using lit -j 1) works without a problem. We could > deal with this by limiting the number of threads lld uses when > running regression tests, or limit the number of threads that lit > uses when running lld tests (as we currently do with the OpenMP > runtime tests), but I'm somewhat concerned that users will run into > this program regardless with heavily-parallelized builds. > > We could try to catch exceptions that otherwise come from > ThreadPoolExecutor's constructor, but do we compile with exceptions > enabled? > > I guess we do not want to enable exceptions to deal with the issue. > Are COFF tests failing, or just ELF tests? If ELF tests for the old > LLD are failing, the best way would be to not use threads in the old > LLD. It has lingering threading issues. >To provide a data point; my default environment has this: $ ulimit -a | grep proc max user processes (-u) 1024 This machine has 48 cores, so with lit running 48 tests leaves each test with only about 20 available threads, much less than the 48 each test believes it can use. This is somewhat non-deterministic, but I just reran things both ways, and here's what I see: During my last run, these tests fail when running under lit with many parallel tests, but do not fail when run otherwise: lld :: elf2/basic.s lld :: elf/AArch64/general-dyn-tls-0.test lld :: elf/AArch64/initial-exec-tls-0.test lld :: elf/AArch64/rel-prel32-overflow.test lld :: elf/AArch64/rel-prel64.test lld :: elf/AMDGPU/hsa.test lld :: elf/ARM/arm-symbols.test lld :: elf/ARM/dynamic-symbols.test lld :: elf/ARM/entry-point.test lld :: elf/ARM/exidx.test lld :: elf/ARM/header-flags.test lld :: elf/ARM/mapping-code-model.test lld :: elf/ARM/mapping-symbols.test lld :: elf/ARM/missing-symbol.test lld :: elf/ARM/plt-dynamic.test lld :: elf/ARM/plt-ifunc-interwork.test lld :: elf/ARM/plt-ifunc-mapping.test lld :: elf/ARM/rel-arm-call.test lld :: elf/ARM/rel-arm-jump24-veneer-b.test lld :: elf/ARM/rel-arm-mov.test lld :: elf/ARM/rel-arm-prel31.test lld :: elf/ARM/rel-arm-target1.test lld :: elf/ARM/rel-arm-thm-interwork.test lld :: elf/ARM/undef-lazy-symbol.test lld :: elf/Hexagon/dynlib-data.test lld :: elf/Mips/exe-dynamic.test lld :: elf/Mips/exe-dynsym.test lld :: elf/Mips/exe-fileheader-64.test lld :: elf/Mips/exe-fileheader-micro-64.test lld :: elf/Mips/exe-fileheader-n32.test lld :: elf/Mips/exe-got-micro.test lld :: elf/Mips/exe-got.test lld :: elf/Mips/got16-2.test lld :: elf/Mips/got16-micro.test lld :: elf/Mips/got-page-32-micro.test lld :: elf/Mips/got-page-64-micro.test lld :: elf/Mips/got-page-64.test lld :: elf/X86_64/sectionchoice.test lld :: elf/X86_64/sectionmap.test lld :: mach-o/arm-interworking.yaml lld :: mach-o/arm-shims.yaml lld :: mach-o/data-only-dylib.yaml lld :: mach-o/executable-exports.yaml lld :: mach-o/exe-offsets.yaml lld :: mach-o/exported_symbols_list-undef.yaml lld :: mach-o/fat-archive.yaml lld :: mach-o/flat_namespace_undef_error.yaml lld :: mach-o/flat_namespace_undef_suppress.yaml lld :: mach-o/force_load-x86_64.yaml lld :: mach-o/got-order.yaml lld :: mach-o/hello-world-arm64.yaml lld :: mach-o/hello-world-armv6.yaml lld :: mach-o/hello-world-x86_64.yaml lld :: mach-o/hello-world-x86.yaml lld :: mach-o/keep_private_externs.yaml lld :: mach-o/lazy-bind-x86_64.yaml lld :: mach-o/library-rescan.yaml lld :: mach-o/mh_bundle_header.yaml lld :: mach-o/mh_dylib_header.yaml lld :: mach-o/objc_export_list.yaml lld :: mach-o/order_file-basic.yaml lld :: mach-o/parse-aliases.yaml lld :: mach-o/parse-cfstring32.yaml lld :: mach-o/parse-cfstring64.yaml lld :: mach-o/parse-compact-unwind32.yaml lld :: mach-o/parse-compact-unwind64.yaml lld :: mach-o/parse-data-in-code-armv7.yaml lld :: mach-o/parse-data-in-code-x86.yaml lld :: mach-o/parse-data-relocs-arm64.yaml lld :: mach-o/parse-data-relocs-x86_64.yaml lld :: mach-o/parse-data.yaml lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml lld :: mach-o/parse-eh-frame-x86-anon.yaml lld :: mach-o/parse-eh-frame-x86-labeled.yaml lld :: mach-o/parse-eh-frame.yaml lld :: mach-o/parse-function.yaml lld :: mach-o/parse-initializers32.yaml lld :: mach-o/parse-initializers64.yaml lld :: mach-o/parse-literals-error.yaml lld :: mach-o/parse-literals.yaml lld :: mach-o/parse-non-lazy-pointers.yaml lld :: mach-o/parse-relocs-x86.yaml lld :: mach-o/parse-section-no-symbol.yaml lld :: mach-o/parse-tentative-defs.yaml lld :: mach-o/parse-text-relocs-x86_64.yaml lld :: mach-o/parse-tlv-relocs-x86-64.yaml lld :: mach-o/re-exported-dylib-ordinal.yaml lld :: mach-o/rpath.yaml lld :: mach-o/run-tlv-pass-x86-64.yaml lld :: mach-o/sectalign.yaml lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml lld :: mach-o/usage.yaml lld :: mach-o/use-simple-dylib.yaml lld :: mach-o/write-final-sections.yaml lld :: mach-o/wrong-arch-error.yaml lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range lld-Unit :: CoreTests/CoreTests/Range.slice lld-Unit :: CoreTests/CoreTests/Range.user1 lld-Unit :: CoreTests/CoreTests/Range.user2 Of these, the following tests don't fail, but are reported as 'Unresolved' (which does not happen if I run lit -j 1): lld :: elf/ARM/mapping-code-model.test lld :: elf/ARM/mapping-symbols.test lld :: elf/ARM/missing-symbol.test lld :: elf/ARM/plt-ifunc-interwork.test lld :: elf/ARM/rel-arm-jump24-veneer-b.test lld :: elf/Mips/exe-got-micro.test lld :: elf/Mips/exe-got.test lld :: elf/Mips/got16-micro.test lld :: mach-o/parse-cfstring64.yaml lld :: mach-o/parse-compact-unwind32.yaml lld :: mach-o/parse-compact-unwind64.yaml lld :: mach-o/parse-data-in-code-armv7.yaml lld :: mach-o/parse-data-in-code-x86.yaml lld :: mach-o/parse-data-relocs-arm64.yaml lld :: mach-o/parse-data-relocs-x86_64.yaml lld :: mach-o/parse-data.yaml lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml lld :: mach-o/parse-eh-frame-x86-anon.yaml lld :: mach-o/parse-eh-frame-x86-labeled.yaml lld :: mach-o/parse-eh-frame.yaml lld :: mach-o/parse-function.yaml lld :: mach-o/parse-initializers32.yaml lld :: mach-o/parse-initializers64.yaml lld :: mach-o/parse-literals-error.yaml lld :: mach-o/parse-literals.yaml lld :: mach-o/parse-non-lazy-pointers.yaml lld :: mach-o/parse-relocs-x86.yaml lld :: mach-o/parse-section-no-symbol.yaml lld :: mach-o/parse-tentative-defs.yaml lld :: mach-o/parse-text-relocs-arm64.yaml lld :: mach-o/parse-text-relocs-x86_64.yaml lld :: mach-o/parse-tlv-relocs-x86-64.yaml lld :: mach-o/rpath.yaml lld :: mach-o/run-tlv-pass-x86-64.yaml lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml lld :: mach-o/usage.yaml lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range lld-Unit :: CoreTests/CoreTests/Range.slice lld-Unit :: CoreTests/CoreTests/Range.user1 lld-Unit :: CoreTests/CoreTests/Range.user2 these are listed as unresolved for the same underlying reason, for example: ******************** UNRESOLVED: lld-Unit :: CoreTests/CoreTests/Range.user1 (25040 of 25181) ******************** TEST 'lld-Unit :: CoreTests/CoreTests/Range.user1' FAILED ******************** Exception during script execution: Traceback (most recent call last): File "/src/llvm/utils/lit/lit/run.py", line 166, in execute_test result = test.config.test_format.execute(test, self.lit_config) File "/src/llvm/utils/lit/lit/formats/googletest.py", line 113, in execute cmd, env=test.config.environment) File "/src/llvm/utils/lit/lit/util.py", line 166, in executeCommand env=env, close_fds=kUseCloseFDs) File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line 710, in __init__ errread, errwrite) File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line 1231, in _execute_child self.pid = os.fork() OSError: [Errno 11] Resource temporarily unavailable Being naturally nondeterministic, running again with the default number of parallel lit tests changes which tests fail (for example, running a second time adds tests under COFF). And, FWIW, these tests generally fail on my system (for reasons seemingly unrelated to the thread/process resource issue): lld :: Driver/lib-search.test lld :: Driver/undef-basic.objtxt lld :: elf2/dynamic-reloc.s lld :: elf2/shared.s lld :: elf2/soname.s lld :: elf/librarynotfound.test lld :: elf/responsefile.test lld :: mach-o/dylib-install-names.yaml lld :: mach-o/force_load-dylib.yaml lld :: mach-o/lib-search-paths.yaml lld :: mach-o/parse-text-relocs-arm64.yaml lld :: mach-o/upward-dylib-load-command.yaml lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.AsNeeded lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymAlias lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymDecimal lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymHexadecimal lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymOctal lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Empty lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Entry lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryJoined lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryShort lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.ExportDynamic lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Init lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.InitJoined lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoExportDynamic lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoinhibitExec lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Output lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.OutputDefault lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Rpath lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.RpathEq lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SOName lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameH lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameSingleDash lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Entry lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.ExprEval lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Group lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.IgnoreSearchDirNoStdLib lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Input lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Output lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.SearchDir lld-Unit :: DriverTests/DriverTests/UniversalDriver.flavor (it could be big-Endian issues, LLVM bugs, etc. -- I've yet to investigate). The easiest thing to do is to make lld tests run using lit -j 1, but we may also want to think about how to more-gracefully handle this situation in general, because it seems like something a user is not unlikely to hit. Thanks again, Hal> > Thanks again, > Hal > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
I honestly think that the ulimit of 1024 max threads is too strict for 48 core machine. Processes are independent each other, so it is not strange for them to spawn as many threads as the number of cores. What's the reason you cannot increase the limit? On Thu, Oct 1, 2015 at 10:26 AM, Hal Finkel <hfinkel at anl.gov> wrote:> ----- Original Message ----- > > From: "Rui Ueyama" <ruiu at google.com> > > To: "Hal Finkel" <hfinkel at anl.gov> > > Cc: "LLVM Developers" <llvm-dev at lists.llvm.org>, "Rafael Espindola" < > rafael.espindola at gmail.com> > > Sent: Thursday, October 1, 2015 11:46:05 AM > > Subject: Re: lld and thread over-subscription > > > > On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel < hfinkel at anl.gov > wrote: > > > > Hi Rui, et al., > > > > I was experimenting yesterday with building lld on my POWER7 > > PPC64/Linux machine, and ran into an unfortunate problem. When > > running the regressions tests under lit, almost all of the tests > > fail like this: > > > > terminate called after throwing an instance of 'std::system_error' > > what(): Resource temporarily unavailable > > ... > > 5 libc.so.6 0x00000080b7847238 abort + 4293480680 > > 6 libstdc++.so.6 0x00000fff94f0f004 > > __gnu_cxx::__verbose_terminate_handler() + 4294099316 > > 7 libstdc++.so.6 0x00000fff94f0bc84 > > 8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956 > > 9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780 > > 10 libstdc++.so.6 0x00000fff94f816e0 std::__throw_system_error(int) + > > 4294526808 > > 11 libstdc++.so.6 0x00000fff94f83d30 > > std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) > > + 4294534936 > > 12 lld 0x000000001002a278 > > ... > > > > which seems to indicate a core problem here with dealing with > > thread-resource exhaustion. For almost all tests, running them > > individually (or using lit -j 1) works without a problem. We could > > deal with this by limiting the number of threads lld uses when > > running regression tests, or limit the number of threads that lit > > uses when running lld tests (as we currently do with the OpenMP > > runtime tests), but I'm somewhat concerned that users will run into > > this program regardless with heavily-parallelized builds. > > > > We could try to catch exceptions that otherwise come from > > ThreadPoolExecutor's constructor, but do we compile with exceptions > > enabled? > > > > I guess we do not want to enable exceptions to deal with the issue. > > Are COFF tests failing, or just ELF tests? If ELF tests for the old > > LLD are failing, the best way would be to not use threads in the old > > LLD. It has lingering threading issues. > > > > To provide a data point; my default environment has this: > > $ ulimit -a | grep proc > max user processes (-u) 1024 > > This machine has 48 cores, so with lit running 48 tests leaves each test > with only about 20 available threads, much less than the 48 each test > believes it can use. > > This is somewhat non-deterministic, but I just reran things both ways, and > here's what I see: > > During my last run, these tests fail when running under lit with many > parallel tests, but do not fail when run otherwise: > > lld :: elf2/basic.s > lld :: elf/AArch64/general-dyn-tls-0.test > lld :: elf/AArch64/initial-exec-tls-0.test > lld :: elf/AArch64/rel-prel32-overflow.test > lld :: elf/AArch64/rel-prel64.test > lld :: elf/AMDGPU/hsa.test > lld :: elf/ARM/arm-symbols.test > lld :: elf/ARM/dynamic-symbols.test > lld :: elf/ARM/entry-point.test > lld :: elf/ARM/exidx.test > lld :: elf/ARM/header-flags.test > lld :: elf/ARM/mapping-code-model.test > lld :: elf/ARM/mapping-symbols.test > lld :: elf/ARM/missing-symbol.test > lld :: elf/ARM/plt-dynamic.test > lld :: elf/ARM/plt-ifunc-interwork.test > lld :: elf/ARM/plt-ifunc-mapping.test > lld :: elf/ARM/rel-arm-call.test > lld :: elf/ARM/rel-arm-jump24-veneer-b.test > lld :: elf/ARM/rel-arm-mov.test > lld :: elf/ARM/rel-arm-prel31.test > lld :: elf/ARM/rel-arm-target1.test > lld :: elf/ARM/rel-arm-thm-interwork.test > lld :: elf/ARM/undef-lazy-symbol.test > lld :: elf/Hexagon/dynlib-data.test > lld :: elf/Mips/exe-dynamic.test > lld :: elf/Mips/exe-dynsym.test > lld :: elf/Mips/exe-fileheader-64.test > lld :: elf/Mips/exe-fileheader-micro-64.test > lld :: elf/Mips/exe-fileheader-n32.test > lld :: elf/Mips/exe-got-micro.test > lld :: elf/Mips/exe-got.test > lld :: elf/Mips/got16-2.test > lld :: elf/Mips/got16-micro.test > lld :: elf/Mips/got-page-32-micro.test > lld :: elf/Mips/got-page-64-micro.test > lld :: elf/Mips/got-page-64.test > lld :: elf/X86_64/sectionchoice.test > lld :: elf/X86_64/sectionmap.test > lld :: mach-o/arm-interworking.yaml > lld :: mach-o/arm-shims.yaml > lld :: mach-o/data-only-dylib.yaml > lld :: mach-o/executable-exports.yaml > lld :: mach-o/exe-offsets.yaml > lld :: mach-o/exported_symbols_list-undef.yaml > lld :: mach-o/fat-archive.yaml > lld :: mach-o/flat_namespace_undef_error.yaml > lld :: mach-o/flat_namespace_undef_suppress.yaml > lld :: mach-o/force_load-x86_64.yaml > lld :: mach-o/got-order.yaml > lld :: mach-o/hello-world-arm64.yaml > lld :: mach-o/hello-world-armv6.yaml > lld :: mach-o/hello-world-x86_64.yaml > lld :: mach-o/hello-world-x86.yaml > lld :: mach-o/keep_private_externs.yaml > lld :: mach-o/lazy-bind-x86_64.yaml > lld :: mach-o/library-rescan.yaml > lld :: mach-o/mh_bundle_header.yaml > lld :: mach-o/mh_dylib_header.yaml > lld :: mach-o/objc_export_list.yaml > lld :: mach-o/order_file-basic.yaml > lld :: mach-o/parse-aliases.yaml > lld :: mach-o/parse-cfstring32.yaml > lld :: mach-o/parse-cfstring64.yaml > lld :: mach-o/parse-compact-unwind32.yaml > lld :: mach-o/parse-compact-unwind64.yaml > lld :: mach-o/parse-data-in-code-armv7.yaml > lld :: mach-o/parse-data-in-code-x86.yaml > lld :: mach-o/parse-data-relocs-arm64.yaml > lld :: mach-o/parse-data-relocs-x86_64.yaml > lld :: mach-o/parse-data.yaml > lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml > lld :: mach-o/parse-eh-frame-x86-anon.yaml > lld :: mach-o/parse-eh-frame-x86-labeled.yaml > lld :: mach-o/parse-eh-frame.yaml > lld :: mach-o/parse-function.yaml > lld :: mach-o/parse-initializers32.yaml > lld :: mach-o/parse-initializers64.yaml > lld :: mach-o/parse-literals-error.yaml > lld :: mach-o/parse-literals.yaml > lld :: mach-o/parse-non-lazy-pointers.yaml > lld :: mach-o/parse-relocs-x86.yaml > lld :: mach-o/parse-section-no-symbol.yaml > lld :: mach-o/parse-tentative-defs.yaml > lld :: mach-o/parse-text-relocs-x86_64.yaml > lld :: mach-o/parse-tlv-relocs-x86-64.yaml > lld :: mach-o/re-exported-dylib-ordinal.yaml > lld :: mach-o/rpath.yaml > lld :: mach-o/run-tlv-pass-x86-64.yaml > lld :: mach-o/sectalign.yaml > lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml > lld :: mach-o/usage.yaml > lld :: mach-o/use-simple-dylib.yaml > lld :: mach-o/write-final-sections.yaml > lld :: mach-o/wrong-arch-error.yaml > lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range > lld-Unit :: CoreTests/CoreTests/Range.slice > lld-Unit :: CoreTests/CoreTests/Range.user1 > lld-Unit :: CoreTests/CoreTests/Range.user2 > > Of these, the following tests don't fail, but are reported as 'Unresolved' > (which does not happen if I run lit -j 1): > > lld :: elf/ARM/mapping-code-model.test > lld :: elf/ARM/mapping-symbols.test > lld :: elf/ARM/missing-symbol.test > lld :: elf/ARM/plt-ifunc-interwork.test > lld :: elf/ARM/rel-arm-jump24-veneer-b.test > lld :: elf/Mips/exe-got-micro.test > lld :: elf/Mips/exe-got.test > lld :: elf/Mips/got16-micro.test > lld :: mach-o/parse-cfstring64.yaml > lld :: mach-o/parse-compact-unwind32.yaml > lld :: mach-o/parse-compact-unwind64.yaml > lld :: mach-o/parse-data-in-code-armv7.yaml > lld :: mach-o/parse-data-in-code-x86.yaml > lld :: mach-o/parse-data-relocs-arm64.yaml > lld :: mach-o/parse-data-relocs-x86_64.yaml > lld :: mach-o/parse-data.yaml > lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml > lld :: mach-o/parse-eh-frame-x86-anon.yaml > lld :: mach-o/parse-eh-frame-x86-labeled.yaml > lld :: mach-o/parse-eh-frame.yaml > lld :: mach-o/parse-function.yaml > lld :: mach-o/parse-initializers32.yaml > lld :: mach-o/parse-initializers64.yaml > lld :: mach-o/parse-literals-error.yaml > lld :: mach-o/parse-literals.yaml > lld :: mach-o/parse-non-lazy-pointers.yaml > lld :: mach-o/parse-relocs-x86.yaml > lld :: mach-o/parse-section-no-symbol.yaml > lld :: mach-o/parse-tentative-defs.yaml > lld :: mach-o/parse-text-relocs-arm64.yaml > lld :: mach-o/parse-text-relocs-x86_64.yaml > lld :: mach-o/parse-tlv-relocs-x86-64.yaml > lld :: mach-o/rpath.yaml > lld :: mach-o/run-tlv-pass-x86-64.yaml > lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml > lld :: mach-o/usage.yaml > lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range > lld-Unit :: CoreTests/CoreTests/Range.slice > lld-Unit :: CoreTests/CoreTests/Range.user1 > lld-Unit :: CoreTests/CoreTests/Range.user2 > > these are listed as unresolved for the same underlying reason, for example: > > ******************** > UNRESOLVED: lld-Unit :: CoreTests/CoreTests/Range.user1 (25040 of 25181) > ******************** TEST 'lld-Unit :: CoreTests/CoreTests/Range.user1' > FAILED ******************** > Exception during script execution: > Traceback (most recent call last): > File "/src/llvm/utils/lit/lit/run.py", line 166, in execute_test > result = test.config.test_format.execute(test, self.lit_config) > File "/src/llvm/utils/lit/lit/formats/googletest.py", line 113, in > execute > cmd, env=test.config.environment) > File "/src/llvm/utils/lit/lit/util.py", line 166, in executeCommand > env=env, close_fds=kUseCloseFDs) > File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line 710, > in __init__ > errread, errwrite) > File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line 1231, > in _execute_child > self.pid = os.fork() > OSError: [Errno 11] Resource temporarily unavailable > > Being naturally nondeterministic, running again with the default number of > parallel lit tests changes which tests fail (for example, running a second > time adds tests under COFF). > > And, FWIW, these tests generally fail on my system (for reasons seemingly > unrelated to the thread/process resource issue): > > lld :: Driver/lib-search.test > lld :: Driver/undef-basic.objtxt > lld :: elf2/dynamic-reloc.s > lld :: elf2/shared.s > lld :: elf2/soname.s > lld :: elf/librarynotfound.test > lld :: elf/responsefile.test > lld :: mach-o/dylib-install-names.yaml > lld :: mach-o/force_load-dylib.yaml > lld :: mach-o/lib-search-paths.yaml > lld :: mach-o/parse-text-relocs-arm64.yaml > lld :: mach-o/upward-dylib-load-command.yaml > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.AsNeeded > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymAlias > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymDecimal > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymHexadecimal > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymOctal > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Empty > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Entry > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryJoined > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryShort > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.ExportDynamic > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Init > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.InitJoined > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoExportDynamic > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoinhibitExec > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Output > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.OutputDefault > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Rpath > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.RpathEq > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SOName > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameH > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameSingleDash > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Entry > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.ExprEval > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Group > lld-Unit :: > DriverTests/DriverTests/LinkerScriptTest.IgnoreSearchDirNoStdLib > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Input > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Output > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.SearchDir > lld-Unit :: DriverTests/DriverTests/UniversalDriver.flavor > > (it could be big-Endian issues, LLVM bugs, etc. -- I've yet to > investigate). > > The easiest thing to do is to make lld tests run using lit -j 1, but we > may also want to think about how to more-gracefully handle this situation > in general, because it seems like something a user is not unlikely to hit. > > Thanks again, > Hal > > > > > Thanks again, > > Hal > > > > -- > > Hal Finkel > > Assistant Computational Scientist > > Leadership Computing Facility > > Argonne National Laboratory > > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151001/89ae518b/attachment.html>
----- Original Message -----> From: "Rui Ueyama" <ruiu at google.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "LLVM Developers" <llvm-dev at lists.llvm.org>, "Rafael Espindola" <rafael.espindola at gmail.com> > Sent: Thursday, October 1, 2015 12:55:20 PM > Subject: Re: lld and thread over-subscription > > > I honestly think that the ulimit of 1024 max threads is too strict > for 48 core machine. Processes are independent each other, so it is > not strange for them to spawn as many threads as the number of > cores.It is an understandable misconfiguration, but not something desirable in production.> What's the reason you cannot increase the limit? >It is a soft limit, and I can. Running 'ulimit -u 3072' and then re-running lit causes these failures to go away. My concern is that a soft process limit of 1024 is a common default (at least on any RedHat-derived Linux distribution) regardless of the number of cores on the machine. And, obviously, parallel makes are still very common. Regardless, do you think it would be reasonable for lit to adjust the soft process limit by default to avoid these kinds of issues, at least when running our regression tests? Thanks again, Hal> > On Thu, Oct 1, 2015 at 10:26 AM, Hal Finkel < hfinkel at anl.gov > > wrote: > > > > > ----- Original Message ----- > > From: "Rui Ueyama" < ruiu at google.com > > > To: "Hal Finkel" < hfinkel at anl.gov > > > Cc: "LLVM Developers" < llvm-dev at lists.llvm.org >, "Rafael > > Espindola" < rafael.espindola at gmail.com > > > Sent: Thursday, October 1, 2015 11:46:05 AM > > Subject: Re: lld and thread over-subscription > > > > On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel < hfinkel at anl.gov > > > wrote: > > > > Hi Rui, et al., > > > > I was experimenting yesterday with building lld on my POWER7 > > PPC64/Linux machine, and ran into an unfortunate problem. When > > running the regressions tests under lit, almost all of the tests > > fail like this: > > > > terminate called after throwing an instance of 'std::system_error' > > what(): Resource temporarily unavailable > > ... > > 5 libc.so.6 0x00000080b7847238 abort + 4293480680 > > 6 libstdc++.so.6 0x00000fff94f0f004 > > __gnu_cxx::__verbose_terminate_handler() + 4294099316 > > 7 libstdc++.so.6 0x00000fff94f0bc84 > > 8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956 > > 9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780 > > 10 libstdc++.so.6 0x00000fff94f816e0 std::__throw_system_error(int) > > + > > 4294526808 > > 11 libstdc++.so.6 0x00000fff94f83d30 > > std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) > > + 4294534936 > > 12 lld 0x000000001002a278 > > ... > > > > which seems to indicate a core problem here with dealing with > > thread-resource exhaustion. For almost all tests, running them > > individually (or using lit -j 1) works without a problem. We could > > deal with this by limiting the number of threads lld uses when > > running regression tests, or limit the number of threads that lit > > uses when running lld tests (as we currently do with the OpenMP > > runtime tests), but I'm somewhat concerned that users will run into > > this program regardless with heavily-parallelized builds. > > > > We could try to catch exceptions that otherwise come from > > ThreadPoolExecutor's constructor, but do we compile with exceptions > > enabled? > > > > I guess we do not want to enable exceptions to deal with the issue. > > Are COFF tests failing, or just ELF tests? If ELF tests for the old > > LLD are failing, the best way would be to not use threads in the > > old > > LLD. It has lingering threading issues. > > > > To provide a data point; my default environment has this: > > $ ulimit -a | grep proc > max user processes (-u) 1024 > > This machine has 48 cores, so with lit running 48 tests leaves each > test with only about 20 available threads, much less than the 48 > each test believes it can use. > > This is somewhat non-deterministic, but I just reran things both > ways, and here's what I see: > > During my last run, these tests fail when running under lit with many > parallel tests, but do not fail when run otherwise: > > lld :: elf2/basic.s > lld :: elf/AArch64/general-dyn-tls-0.test > lld :: elf/AArch64/initial-exec-tls-0.test > lld :: elf/AArch64/rel-prel32-overflow.test > lld :: elf/AArch64/rel-prel64.test > lld :: elf/AMDGPU/hsa.test > lld :: elf/ARM/arm-symbols.test > lld :: elf/ARM/dynamic-symbols.test > lld :: elf/ARM/entry-point.test > lld :: elf/ARM/exidx.test > lld :: elf/ARM/header-flags.test > lld :: elf/ARM/mapping-code-model.test > lld :: elf/ARM/mapping-symbols.test > lld :: elf/ARM/missing-symbol.test > lld :: elf/ARM/plt-dynamic.test > lld :: elf/ARM/plt-ifunc-interwork.test > lld :: elf/ARM/plt-ifunc-mapping.test > lld :: elf/ARM/rel-arm-call.test > lld :: elf/ARM/rel-arm-jump24-veneer-b.test > lld :: elf/ARM/rel-arm-mov.test > lld :: elf/ARM/rel-arm-prel31.test > lld :: elf/ARM/rel-arm-target1.test > lld :: elf/ARM/rel-arm-thm-interwork.test > lld :: elf/ARM/undef-lazy-symbol.test > lld :: elf/Hexagon/dynlib-data.test > lld :: elf/Mips/exe-dynamic.test > lld :: elf/Mips/exe-dynsym.test > lld :: elf/Mips/exe-fileheader-64.test > lld :: elf/Mips/exe-fileheader-micro-64.test > lld :: elf/Mips/exe-fileheader-n32.test > lld :: elf/Mips/exe-got-micro.test > lld :: elf/Mips/exe-got.test > lld :: elf/Mips/got16-2.test > lld :: elf/Mips/got16-micro.test > lld :: elf/Mips/got-page-32-micro.test > lld :: elf/Mips/got-page-64-micro.test > lld :: elf/Mips/got-page-64.test > lld :: elf/X86_64/sectionchoice.test > lld :: elf/X86_64/sectionmap.test > lld :: mach-o/arm-interworking.yaml > lld :: mach-o/arm-shims.yaml > lld :: mach-o/data-only-dylib.yaml > lld :: mach-o/executable-exports.yaml > lld :: mach-o/exe-offsets.yaml > lld :: mach-o/exported_symbols_list-undef.yaml > lld :: mach-o/fat-archive.yaml > lld :: mach-o/flat_namespace_undef_error.yaml > lld :: mach-o/flat_namespace_undef_suppress.yaml > lld :: mach-o/force_load-x86_64.yaml > lld :: mach-o/got-order.yaml > lld :: mach-o/hello-world-arm64.yaml > lld :: mach-o/hello-world-armv6.yaml > lld :: mach-o/hello-world-x86_64.yaml > lld :: mach-o/hello-world-x86.yaml > lld :: mach-o/keep_private_externs.yaml > lld :: mach-o/lazy-bind-x86_64.yaml > lld :: mach-o/library-rescan.yaml > lld :: mach-o/mh_bundle_header.yaml > lld :: mach-o/mh_dylib_header.yaml > lld :: mach-o/objc_export_list.yaml > lld :: mach-o/order_file-basic.yaml > lld :: mach-o/parse-aliases.yaml > lld :: mach-o/parse-cfstring32.yaml > lld :: mach-o/parse-cfstring64.yaml > lld :: mach-o/parse-compact-unwind32.yaml > lld :: mach-o/parse-compact-unwind64.yaml > lld :: mach-o/parse-data-in-code-armv7.yaml > lld :: mach-o/parse-data-in-code-x86.yaml > lld :: mach-o/parse-data-relocs-arm64.yaml > lld :: mach-o/parse-data-relocs-x86_64.yaml > lld :: mach-o/parse-data.yaml > lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml > lld :: mach-o/parse-eh-frame-x86-anon.yaml > lld :: mach-o/parse-eh-frame-x86-labeled.yaml > lld :: mach-o/parse-eh-frame.yaml > lld :: mach-o/parse-function.yaml > lld :: mach-o/parse-initializers32.yaml > lld :: mach-o/parse-initializers64.yaml > lld :: mach-o/parse-literals-error.yaml > lld :: mach-o/parse-literals.yaml > lld :: mach-o/parse-non-lazy-pointers.yaml > lld :: mach-o/parse-relocs-x86.yaml > lld :: mach-o/parse-section-no-symbol.yaml > lld :: mach-o/parse-tentative-defs.yaml > lld :: mach-o/parse-text-relocs-x86_64.yaml > lld :: mach-o/parse-tlv-relocs-x86-64.yaml > lld :: mach-o/re-exported-dylib-ordinal.yaml > lld :: mach-o/rpath.yaml > lld :: mach-o/run-tlv-pass-x86-64.yaml > lld :: mach-o/sectalign.yaml > lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml > lld :: mach-o/usage.yaml > lld :: mach-o/use-simple-dylib.yaml > lld :: mach-o/write-final-sections.yaml > lld :: mach-o/wrong-arch-error.yaml > lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range > lld-Unit :: CoreTests/CoreTests/Range.slice > lld-Unit :: CoreTests/CoreTests/Range.user1 > lld-Unit :: CoreTests/CoreTests/Range.user2 > > Of these, the following tests don't fail, but are reported as > 'Unresolved' (which does not happen if I run lit -j 1): > > lld :: elf/ARM/mapping-code-model.test > lld :: elf/ARM/mapping-symbols.test > lld :: elf/ARM/missing-symbol.test > lld :: elf/ARM/plt-ifunc-interwork.test > lld :: elf/ARM/rel-arm-jump24-veneer-b.test > lld :: elf/Mips/exe-got-micro.test > lld :: elf/Mips/exe-got.test > lld :: elf/Mips/got16-micro.test > lld :: mach-o/parse-cfstring64.yaml > lld :: mach-o/parse-compact-unwind32.yaml > lld :: mach-o/parse-compact-unwind64.yaml > lld :: mach-o/parse-data-in-code-armv7.yaml > lld :: mach-o/parse-data-in-code-x86.yaml > lld :: mach-o/parse-data-relocs-arm64.yaml > lld :: mach-o/parse-data-relocs-x86_64.yaml > lld :: mach-o/parse-data.yaml > lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml > lld :: mach-o/parse-eh-frame-x86-anon.yaml > lld :: mach-o/parse-eh-frame-x86-labeled.yaml > lld :: mach-o/parse-eh-frame.yaml > lld :: mach-o/parse-function.yaml > lld :: mach-o/parse-initializers32.yaml > lld :: mach-o/parse-initializers64.yaml > lld :: mach-o/parse-literals-error.yaml > lld :: mach-o/parse-literals.yaml > lld :: mach-o/parse-non-lazy-pointers.yaml > lld :: mach-o/parse-relocs-x86.yaml > lld :: mach-o/parse-section-no-symbol.yaml > lld :: mach-o/parse-tentative-defs.yaml > lld :: mach-o/parse-text-relocs-arm64.yaml > lld :: mach-o/parse-text-relocs-x86_64.yaml > lld :: mach-o/parse-tlv-relocs-x86-64.yaml > lld :: mach-o/rpath.yaml > lld :: mach-o/run-tlv-pass-x86-64.yaml > lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml > lld :: mach-o/usage.yaml > lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range > lld-Unit :: CoreTests/CoreTests/Range.slice > lld-Unit :: CoreTests/CoreTests/Range.user1 > lld-Unit :: CoreTests/CoreTests/Range.user2 > > these are listed as unresolved for the same underlying reason, for > example: > > ******************** > UNRESOLVED: lld-Unit :: CoreTests/CoreTests/Range.user1 (25040 of > 25181) > ******************** TEST 'lld-Unit :: > CoreTests/CoreTests/Range.user1' FAILED ******************** > Exception during script execution: > Traceback (most recent call last): > File "/src/llvm/utils/lit/lit/run.py", line 166, in execute_test > result = test.config.test_format.execute(test, self.lit_config) > File "/src/llvm/utils/lit/lit/formats/googletest.py", line 113, in > execute > cmd, env=test.config.environment) > File "/src/llvm/utils/lit/lit/util.py", line 166, in executeCommand > env=env, close_fds=kUseCloseFDs) > File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line > 710, in __init__ > errread, errwrite) > File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line > 1231, in _execute_child > self.pid = os.fork() > OSError: [Errno 11] Resource temporarily unavailable > > Being naturally nondeterministic, running again with the default > number of parallel lit tests changes which tests fail (for example, > running a second time adds tests under COFF). > > And, FWIW, these tests generally fail on my system (for reasons > seemingly unrelated to the thread/process resource issue): > > lld :: Driver/lib-search.test > lld :: Driver/undef-basic.objtxt > lld :: elf2/dynamic-reloc.s > lld :: elf2/shared.s > lld :: elf2/soname.s > lld :: elf/librarynotfound.test > lld :: elf/responsefile.test > lld :: mach-o/dylib-install-names.yaml > lld :: mach-o/force_load-dylib.yaml > lld :: mach-o/lib-search-paths.yaml > lld :: mach-o/parse-text-relocs-arm64.yaml > lld :: mach-o/upward-dylib-load-command.yaml > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.AsNeeded > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymAlias > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymDecimal > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymHexadecimal > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymOctal > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Empty > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Entry > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryJoined > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryShort > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.ExportDynamic > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Init > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.InitJoined > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoExportDynamic > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoinhibitExec > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Output > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.OutputDefault > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Rpath > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.RpathEq > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SOName > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameH > lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameSingleDash > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Entry > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.ExprEval > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Group > lld-Unit :: > DriverTests/DriverTests/LinkerScriptTest.IgnoreSearchDirNoStdLib > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Input > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Output > lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.SearchDir > lld-Unit :: DriverTests/DriverTests/UniversalDriver.flavor > > (it could be big-Endian issues, LLVM bugs, etc. -- I've yet to > investigate). > > The easiest thing to do is to make lld tests run using lit -j 1, but > we may also want to think about how to more-gracefully handle this > situation in general, because it seems like something a user is not > unlikely to hit. > > Thanks again, > Hal > > > > > > > Thanks again, > > Hal > > > > -- > > Hal Finkel > > Assistant Computational Scientist > > Leadership Computing Facility > > Argonne National Laboratory > > > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Alex Rosenberg via llvm-dev
2015-Oct-02 08:39 UTC
[llvm-dev] lld and thread over-subscription
> On Oct 1, 2015, at 10:26 AM, Hal Finkel via llvm-dev <llvm-dev at lists.llvm.org> wrote: > >> On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel < hfinkel at anl.gov > wrote: >> >> Hi Rui, et al., >> >> I was experimenting yesterday with building lld on my POWER7 >> PPC64/Linux machine, and ran into an unfortunate problem. When >> running the regressions tests under lit, almost all of the tests >> fail like this: >> >> terminate called after throwing an instance of 'std::system_error' >> what(): Resource temporarily unavailable > > To provide a data point; my default environment has this: > > $ ulimit -a | grep proc > max user processes (-u) 1024 > > This machine has 48 cores, so with lit running 48 tests leaves each test with only about 20 available threads, much less than the 48 each test believes it can use.We've seen a similar failure on OS X running tests in another LLVM project besides lld. Filipe may remember what it was. I think the potential improvement here should be to lit since it can know the limit and schedule work accordingly. Alex
On Thu, Oct 1, 2015 at 10:55 AM, Rui Ueyama via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I honestly think that the ulimit of 1024 max threads is too strict for 48 > core machine. Processes are independent each other, so it is not strange > for them to spawn as many threads as the number of cores. What's the reason > you cannot increase the limit? >Yeah, this is it. We've run into this internally on our linux bots. Basically, the threading abstractions inside LLD spawn #cores threads for their thread pool as one of the very first things. So if your build is #cores wide, you end up with #cores ^ 2 threads total. The simplest solutions is just upping the ulimit. This may be something we can even do inside lit so users automatically see it. Beyond that, changes to LLD could ameliorate this; fundamentally though it has to do with thread pools knowing how many threads they need to spin up. A nasty solution could be an environment variable like LLD_NUM_THREADS. We could also have a command line flag, and do something like `%lld` in the tests like we do for clang like `%clang_cc1`, where some extra stuff is inserted in the expansion telling lld to use a smaller thread count (for the tests, --num-threads=1 would be fine I think). -- Sean Silva> > On Thu, Oct 1, 2015 at 10:26 AM, Hal Finkel <hfinkel at anl.gov> wrote: > >> ----- Original Message ----- >> > From: "Rui Ueyama" <ruiu at google.com> >> > To: "Hal Finkel" <hfinkel at anl.gov> >> > Cc: "LLVM Developers" <llvm-dev at lists.llvm.org>, "Rafael Espindola" < >> rafael.espindola at gmail.com> >> > Sent: Thursday, October 1, 2015 11:46:05 AM >> > Subject: Re: lld and thread over-subscription >> > >> > On Thu, Oct 1, 2015 at 9:35 AM, Hal Finkel < hfinkel at anl.gov > wrote: >> > >> > Hi Rui, et al., >> > >> > I was experimenting yesterday with building lld on my POWER7 >> > PPC64/Linux machine, and ran into an unfortunate problem. When >> > running the regressions tests under lit, almost all of the tests >> > fail like this: >> > >> > terminate called after throwing an instance of 'std::system_error' >> > what(): Resource temporarily unavailable >> > ... >> > 5 libc.so.6 0x00000080b7847238 abort + 4293480680 >> > 6 libstdc++.so.6 0x00000fff94f0f004 >> > __gnu_cxx::__verbose_terminate_handler() + 4294099316 >> > 7 libstdc++.so.6 0x00000fff94f0bc84 >> > 8 libstdc++.so.6 0x00000fff94f0bccc std::terminate() + 4294087956 >> > 9 libstdc++.so.6 0x00000fff94f0c0c4 __cxa_throw + 4294088780 >> > 10 libstdc++.so.6 0x00000fff94f816e0 std::__throw_system_error(int) + >> > 4294526808 >> > 11 libstdc++.so.6 0x00000fff94f83d30 >> > std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) >> > + 4294534936 >> > 12 lld 0x000000001002a278 >> > ... >> > >> > which seems to indicate a core problem here with dealing with >> > thread-resource exhaustion. For almost all tests, running them >> > individually (or using lit -j 1) works without a problem. We could >> > deal with this by limiting the number of threads lld uses when >> > running regression tests, or limit the number of threads that lit >> > uses when running lld tests (as we currently do with the OpenMP >> > runtime tests), but I'm somewhat concerned that users will run into >> > this program regardless with heavily-parallelized builds. >> > >> > We could try to catch exceptions that otherwise come from >> > ThreadPoolExecutor's constructor, but do we compile with exceptions >> > enabled? >> > >> > I guess we do not want to enable exceptions to deal with the issue. >> > Are COFF tests failing, or just ELF tests? If ELF tests for the old >> > LLD are failing, the best way would be to not use threads in the old >> > LLD. It has lingering threading issues. >> > >> >> To provide a data point; my default environment has this: >> >> $ ulimit -a | grep proc >> max user processes (-u) 1024 >> >> This machine has 48 cores, so with lit running 48 tests leaves each test >> with only about 20 available threads, much less than the 48 each test >> believes it can use. >> >> This is somewhat non-deterministic, but I just reran things both ways, >> and here's what I see: >> >> During my last run, these tests fail when running under lit with many >> parallel tests, but do not fail when run otherwise: >> >> lld :: elf2/basic.s >> lld :: elf/AArch64/general-dyn-tls-0.test >> lld :: elf/AArch64/initial-exec-tls-0.test >> lld :: elf/AArch64/rel-prel32-overflow.test >> lld :: elf/AArch64/rel-prel64.test >> lld :: elf/AMDGPU/hsa.test >> lld :: elf/ARM/arm-symbols.test >> lld :: elf/ARM/dynamic-symbols.test >> lld :: elf/ARM/entry-point.test >> lld :: elf/ARM/exidx.test >> lld :: elf/ARM/header-flags.test >> lld :: elf/ARM/mapping-code-model.test >> lld :: elf/ARM/mapping-symbols.test >> lld :: elf/ARM/missing-symbol.test >> lld :: elf/ARM/plt-dynamic.test >> lld :: elf/ARM/plt-ifunc-interwork.test >> lld :: elf/ARM/plt-ifunc-mapping.test >> lld :: elf/ARM/rel-arm-call.test >> lld :: elf/ARM/rel-arm-jump24-veneer-b.test >> lld :: elf/ARM/rel-arm-mov.test >> lld :: elf/ARM/rel-arm-prel31.test >> lld :: elf/ARM/rel-arm-target1.test >> lld :: elf/ARM/rel-arm-thm-interwork.test >> lld :: elf/ARM/undef-lazy-symbol.test >> lld :: elf/Hexagon/dynlib-data.test >> lld :: elf/Mips/exe-dynamic.test >> lld :: elf/Mips/exe-dynsym.test >> lld :: elf/Mips/exe-fileheader-64.test >> lld :: elf/Mips/exe-fileheader-micro-64.test >> lld :: elf/Mips/exe-fileheader-n32.test >> lld :: elf/Mips/exe-got-micro.test >> lld :: elf/Mips/exe-got.test >> lld :: elf/Mips/got16-2.test >> lld :: elf/Mips/got16-micro.test >> lld :: elf/Mips/got-page-32-micro.test >> lld :: elf/Mips/got-page-64-micro.test >> lld :: elf/Mips/got-page-64.test >> lld :: elf/X86_64/sectionchoice.test >> lld :: elf/X86_64/sectionmap.test >> lld :: mach-o/arm-interworking.yaml >> lld :: mach-o/arm-shims.yaml >> lld :: mach-o/data-only-dylib.yaml >> lld :: mach-o/executable-exports.yaml >> lld :: mach-o/exe-offsets.yaml >> lld :: mach-o/exported_symbols_list-undef.yaml >> lld :: mach-o/fat-archive.yaml >> lld :: mach-o/flat_namespace_undef_error.yaml >> lld :: mach-o/flat_namespace_undef_suppress.yaml >> lld :: mach-o/force_load-x86_64.yaml >> lld :: mach-o/got-order.yaml >> lld :: mach-o/hello-world-arm64.yaml >> lld :: mach-o/hello-world-armv6.yaml >> lld :: mach-o/hello-world-x86_64.yaml >> lld :: mach-o/hello-world-x86.yaml >> lld :: mach-o/keep_private_externs.yaml >> lld :: mach-o/lazy-bind-x86_64.yaml >> lld :: mach-o/library-rescan.yaml >> lld :: mach-o/mh_bundle_header.yaml >> lld :: mach-o/mh_dylib_header.yaml >> lld :: mach-o/objc_export_list.yaml >> lld :: mach-o/order_file-basic.yaml >> lld :: mach-o/parse-aliases.yaml >> lld :: mach-o/parse-cfstring32.yaml >> lld :: mach-o/parse-cfstring64.yaml >> lld :: mach-o/parse-compact-unwind32.yaml >> lld :: mach-o/parse-compact-unwind64.yaml >> lld :: mach-o/parse-data-in-code-armv7.yaml >> lld :: mach-o/parse-data-in-code-x86.yaml >> lld :: mach-o/parse-data-relocs-arm64.yaml >> lld :: mach-o/parse-data-relocs-x86_64.yaml >> lld :: mach-o/parse-data.yaml >> lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml >> lld :: mach-o/parse-eh-frame-x86-anon.yaml >> lld :: mach-o/parse-eh-frame-x86-labeled.yaml >> lld :: mach-o/parse-eh-frame.yaml >> lld :: mach-o/parse-function.yaml >> lld :: mach-o/parse-initializers32.yaml >> lld :: mach-o/parse-initializers64.yaml >> lld :: mach-o/parse-literals-error.yaml >> lld :: mach-o/parse-literals.yaml >> lld :: mach-o/parse-non-lazy-pointers.yaml >> lld :: mach-o/parse-relocs-x86.yaml >> lld :: mach-o/parse-section-no-symbol.yaml >> lld :: mach-o/parse-tentative-defs.yaml >> lld :: mach-o/parse-text-relocs-x86_64.yaml >> lld :: mach-o/parse-tlv-relocs-x86-64.yaml >> lld :: mach-o/re-exported-dylib-ordinal.yaml >> lld :: mach-o/rpath.yaml >> lld :: mach-o/run-tlv-pass-x86-64.yaml >> lld :: mach-o/sectalign.yaml >> lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml >> lld :: mach-o/usage.yaml >> lld :: mach-o/use-simple-dylib.yaml >> lld :: mach-o/write-final-sections.yaml >> lld :: mach-o/wrong-arch-error.yaml >> lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range >> lld-Unit :: CoreTests/CoreTests/Range.slice >> lld-Unit :: CoreTests/CoreTests/Range.user1 >> lld-Unit :: CoreTests/CoreTests/Range.user2 >> >> Of these, the following tests don't fail, but are reported as >> 'Unresolved' (which does not happen if I run lit -j 1): >> >> lld :: elf/ARM/mapping-code-model.test >> lld :: elf/ARM/mapping-symbols.test >> lld :: elf/ARM/missing-symbol.test >> lld :: elf/ARM/plt-ifunc-interwork.test >> lld :: elf/ARM/rel-arm-jump24-veneer-b.test >> lld :: elf/Mips/exe-got-micro.test >> lld :: elf/Mips/exe-got.test >> lld :: elf/Mips/got16-micro.test >> lld :: mach-o/parse-cfstring64.yaml >> lld :: mach-o/parse-compact-unwind32.yaml >> lld :: mach-o/parse-compact-unwind64.yaml >> lld :: mach-o/parse-data-in-code-armv7.yaml >> lld :: mach-o/parse-data-in-code-x86.yaml >> lld :: mach-o/parse-data-relocs-arm64.yaml >> lld :: mach-o/parse-data-relocs-x86_64.yaml >> lld :: mach-o/parse-data.yaml >> lld :: mach-o/parse-eh-frame-relocs-x86_64.yaml >> lld :: mach-o/parse-eh-frame-x86-anon.yaml >> lld :: mach-o/parse-eh-frame-x86-labeled.yaml >> lld :: mach-o/parse-eh-frame.yaml >> lld :: mach-o/parse-function.yaml >> lld :: mach-o/parse-initializers32.yaml >> lld :: mach-o/parse-initializers64.yaml >> lld :: mach-o/parse-literals-error.yaml >> lld :: mach-o/parse-literals.yaml >> lld :: mach-o/parse-non-lazy-pointers.yaml >> lld :: mach-o/parse-relocs-x86.yaml >> lld :: mach-o/parse-section-no-symbol.yaml >> lld :: mach-o/parse-tentative-defs.yaml >> lld :: mach-o/parse-text-relocs-arm64.yaml >> lld :: mach-o/parse-text-relocs-x86_64.yaml >> lld :: mach-o/parse-tlv-relocs-x86-64.yaml >> lld :: mach-o/rpath.yaml >> lld :: mach-o/run-tlv-pass-x86-64.yaml >> lld :: mach-o/twolevel_namespace_undef_dynamic_lookup.yaml >> lld :: mach-o/usage.yaml >> lld-Unit :: CoreTests/CoreTests/Range.conversion_to_pointer_range >> lld-Unit :: CoreTests/CoreTests/Range.slice >> lld-Unit :: CoreTests/CoreTests/Range.user1 >> lld-Unit :: CoreTests/CoreTests/Range.user2 >> >> these are listed as unresolved for the same underlying reason, for >> example: >> >> ******************** >> UNRESOLVED: lld-Unit :: CoreTests/CoreTests/Range.user1 (25040 of 25181) >> ******************** TEST 'lld-Unit :: CoreTests/CoreTests/Range.user1' >> FAILED ******************** >> Exception during script execution: >> Traceback (most recent call last): >> File "/src/llvm/utils/lit/lit/run.py", line 166, in execute_test >> result = test.config.test_format.execute(test, self.lit_config) >> File "/src/llvm/utils/lit/lit/formats/googletest.py", line 113, in >> execute >> cmd, env=test.config.environment) >> File "/src/llvm/utils/lit/lit/util.py", line 166, in executeCommand >> env=env, close_fds=kUseCloseFDs) >> File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line 710, >> in __init__ >> errread, errwrite) >> File "/install/ppc64/Python-2.7/lib/python2.7/subprocess.py", line >> 1231, in _execute_child >> self.pid = os.fork() >> OSError: [Errno 11] Resource temporarily unavailable >> >> Being naturally nondeterministic, running again with the default number >> of parallel lit tests changes which tests fail (for example, running a >> second time adds tests under COFF). >> >> And, FWIW, these tests generally fail on my system (for reasons seemingly >> unrelated to the thread/process resource issue): >> >> lld :: Driver/lib-search.test >> lld :: Driver/undef-basic.objtxt >> lld :: elf2/dynamic-reloc.s >> lld :: elf2/shared.s >> lld :: elf2/soname.s >> lld :: elf/librarynotfound.test >> lld :: elf/responsefile.test >> lld :: mach-o/dylib-install-names.yaml >> lld :: mach-o/force_load-dylib.yaml >> lld :: mach-o/lib-search-paths.yaml >> lld :: mach-o/parse-text-relocs-arm64.yaml >> lld :: mach-o/upward-dylib-load-command.yaml >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.AsNeeded >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymAlias >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymDecimal >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymHexadecimal >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.DefsymOctal >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Empty >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Entry >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryJoined >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.EntryShort >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.ExportDynamic >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Init >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.InitJoined >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoExportDynamic >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.NoinhibitExec >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Output >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.OutputDefault >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.Rpath >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.RpathEq >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SOName >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameH >> lld-Unit :: DriverTests/DriverTests/GnuLdParserTest.SONameSingleDash >> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Entry >> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.ExprEval >> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Group >> lld-Unit :: >> DriverTests/DriverTests/LinkerScriptTest.IgnoreSearchDirNoStdLib >> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Input >> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.Output >> lld-Unit :: DriverTests/DriverTests/LinkerScriptTest.SearchDir >> lld-Unit :: DriverTests/DriverTests/UniversalDriver.flavor >> >> (it could be big-Endian issues, LLVM bugs, etc. -- I've yet to >> investigate). >> >> The easiest thing to do is to make lld tests run using lit -j 1, but we >> may also want to think about how to more-gracefully handle this situation >> in general, because it seems like something a user is not unlikely to hit. >> >> Thanks again, >> Hal >> >> > >> > Thanks again, >> > Hal >> > >> > -- >> > Hal Finkel >> > Assistant Computational Scientist >> > Leadership Computing Facility >> > Argonne National Laboratory >> > >> > >> >> -- >> Hal Finkel >> Assistant Computational Scientist >> Leadership Computing Facility >> Argonne National Laboratory >> > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151002/18f5ec75/attachment-0001.html>