Hi,>From time to time, I see check-all hang during running of lit tests.The hang always happens at the > 90% completion stage and I'm pretty sure all tests have been run and check-all is just waiting for lit/python to exit. I see a single python processing running, taking very little CPU time. An strace of that process shows this: select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 32168}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = 0 futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = -1 EAGAIN (Resourc e temporarily unavailable) futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff) = -1 EAGAIN (Resourc e temporarily unavailable) futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) It appears that python is waiting for some I/O or something which never appears. Has anyone else seen this before? Any ideas of what is going on or how to fix it? -David
What you're seeing is just the fact that lit is waiting on subprocesses (select is waiting on the pipes i suspect). Anyways, you'll need to dig into what it is waiting on, and what *that* process is doing that is stuck to make progress. I've not seen anything like this, but I basically never run `check-all` these days because LLDB and sanitizer tests are too flaky. =[ I've not been able to interest anyone in fixing this either sadly. On Wed, Jan 2, 2019 at 10:09 AM David Greene via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > > From time to time, I see check-all hang during running of lit tests. > The hang always happens at the > 90% completion stage and I'm pretty > sure all tests have been run and check-all is just waiting for > lit/python to exit. I see a single python processing running, taking > very little CPU time. An strace of that process shows this: > > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 32168}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) > futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, > ffffffff) = 0 > futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, > ffffffff) = -1 EAGAIN (Resourc > e temporarily unavailable) > > futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, > ffffffff) = -1 EAGAIN (Resourc > e temporarily unavailable) > > futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 > futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) > > It appears that python is waiting for some I/O or something which never > appears. > > Has anyone else seen this before? Any ideas of what is going on or how > to fix it? > > -David > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190102/65c47986/attachment.html>
Hi David, Chandler, I see lldb tests hang often, and then I kill the dotest process. I'd like to stop running check-all too, but I feel it's important when I modify FileCheck. The flakiness that Chandler mentioned makes it time-consuming to verify test results. Joel On Wed, Jan 2, 2019 at 4:41 PM Chandler Carruth via llvm-dev < llvm-dev at lists.llvm.org> wrote:> What you're seeing is just the fact that lit is waiting on subprocesses > (select is waiting on the pipes i suspect). > > Anyways, you'll need to dig into what it is waiting on, and what *that* > process is doing that is stuck to make progress. > > I've not seen anything like this, but I basically never run `check-all` > these days because LLDB and sanitizer tests are too flaky. =[ I've not been > able to interest anyone in fixing this either sadly. > > On Wed, Jan 2, 2019 at 10:09 AM David Greene via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> Hi, >> >> From time to time, I see check-all hang during running of lit tests. >> The hang always happens at the > 90% completion stage and I'm pretty >> sure all tests have been run and check-all is just waiting for >> lit/python to exit. I see a single python processing running, taking >> very little CPU time. An strace of that process shows this: >> >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 32168}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) >> futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 >> futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, >> ffffffff) = 0 >> futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, >> ffffffff) = -1 EAGAIN (Resourc >> e temporarily unavailable) >> >> futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> futex(0x3bcc8c0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, >> ffffffff) = -1 EAGAIN (Resourc >> e temporarily unavailable) >> >> futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 >> futex(0x3bcc8c0, FUTEX_WAKE_PRIVATE, 1) = 1 >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) >> >> It appears that python is waiting for some I/O or something which never >> appears. >> >> Has anyone else seen this before? Any ideas of what is going on or how >> to fix it? >> >> -David >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190102/90aaa7af/attachment.html>
Chandler Carruth via llvm-dev <llvm-dev at lists.llvm.org> writes:> What you're seeing is just the fact that lit is waiting on > subprocesses (select is waiting on the pipes i suspect).Right. Some digging revealed that it is waiting on getline_nohang.cc.tmp, a tsan test. I see that this test has been disabled for NetBSD, due to it sometimes failing. I'm seeing the same on Linux. How can we stabilize the sanitizer tests so that check-all can work reliably? If some sanitizer tests are so flaky, I should think they should be marked UNSUPPORTED. Who has the authority to make those determinations? -David