Jim Klimov <jimklimov+nut at gmail.com> writes:
> you're likely on to something!
>
> While NUT CI farm runs, give or take, 10^3 builds and tests across the
> matrix of platforms, toolkits and dependencies fir each iteration, and most
> of these pass green or catch true coding errors, I did occasionally see
> failed C++ tests (and also NIT where it waits for both OL and OB to be seen
> on a dummy-ups over certain time).
>
> Mostly this correlated with slow-down of build agents (esp. VMs on
> congested hosts), and maybe kernel or its context-switching-under-stress
> tuning (openbsd and macos seen more often than others), but I did not
> succeed pinpointing the problem for the C++ case.
This is running on a real computer, not a VM, and it's a 9th gen i7 with
32GB of RAM which does things speedily in general.
> In that OL-OB test of NIT, had to sort of write it off - if the VM is too
> busy that a 1-second timer flip is not happening/detected over 10 seconds,
> it is a SUT problem more than NUT problem. A real system on battery and
> frantically shutting down (causing stress/slowness) might have power lost
> during that time though.
Yes, if a signal doesn't show up in 10s, that's an issue. But I
don't
think that's what we are seeing here.
> IPC tests are similarly flawed by nature, communicating two processes
> that both have to get a slice of CPU in a given time frame for the test (or
> real-life reaction), but if you can get something to fail reliably in
> reasonable conditions (relevant under normal load) - that's really
> encouraging for the prospect of fixing it.
I would say that if a test fails, we need to be able to say that the SUT
is broken. But I don't see that here.
So:
Is it expected that a failed test will dump core? This is surprising
to me. I'd expect that failures would just be counted and printed
out.
Are you sure the tests reliably use the as-built libs and do not
reach into any previous nut installation on the system?
(I just de-installed 2.8.2 and get the same issue.)
running the unit test by hand (now that the other libs are gone, it
feels safe)
$ ./cppunittest
D: Getting test suite...
D: Preparing test runner...
D: Setting test runner outputter...
D: Launching the test run...
.................................F.terminate called after throwing an instance
of 'std::runtime_error'
what(): Poll on communication pipe read end 5 failed: 4
Abort trap (core dumped)
4 is EINTR, and that means select was interrupted, perhaps a signal.
Which does not seem necessarily buggy. There is a comment in nutipc.hpp
that indicates that recovery probably should be happen but it's not
implemented.
ktrace shows
3507 3507 cppunittest 1733147496.035727910 CALL
_lwp_create(0x7f7fffeaf860,0,0x7c38b65a80a0)
3507 3507 cppunittest 1733147496.035739244 RET _lwp_create 0
3507 3507 cppunittest 1733147496.035761910 CALL
__sigaction_sigtramp(SIGUSR1,0x7f7fffeafc10,0,0x7c38b497f620,2)
3507 3507 cppunittest 1733147496.035773243 RET __sigaction_sigtramp 0
3507 3507 cppunittest 1733147496.035784577 CALL
__sigaction_sigtramp(SIGUSR2,0x7f7fffeafc10,0,0x7c38b497f620,2)
3507 3507 cppunittest 1733147496.035795910 RET __sigaction_sigtramp 0
3507 24482 cppunittest 1733147496.035800910 CALL _lwp_ctl(1,0x7c38b65a8148)
3507 3507 cppunittest 1733147496.035812243 CALL getpid
3507 24482 cppunittest 1733147496.035823577 RET _lwp_ctl 0
3507 3507 cppunittest 1733147496.035834910 RET getpid 3507/0xdb3,
4602/0x11fa
3507 24482 cppunittest 1733147496.035846243 CALL
__select50(0x100,0x7c38b41ffde0,0,0,0)
3507 3507 cppunittest 1733147496.035857535 CALL kill(0xdb3, SIGUSR1)
3507 3507 cppunittest 1733147496.035891535 RET kill 0
3507 3507 cppunittest 1733147496.035914201 CALL kill(0xdb3, SIGUSR2)
3507 3507 cppunittest 1733147496.035948202 RET kill 0
3507 3507 cppunittest 1733147496.035959535 CALL kill(0xdb3, SIGUSR2)
3507 3507 cppunittest 1733147496.035993493 RET kill 0
3507 3507 cppunittest 1733147496.036016201 CALL kill(0xdb3, SIGUSR1)
3507 3507 cppunittest 1733147496.036027493 RET kill 0
3507 24482 cppunittest 1733147496.036004868 RET __select50 -1 errno 4
Interrupted system call
3507 3507 cppunittest 1733147496.036038826 CALL kill(0xdb3, SIGUSR1)
3507 3507 cppunittest 1733147496.036061493 RET kill 0
3507 3507 cppunittest 1733147496.036086159 CALL
__nanosleep50(0x7f7fffeafdf0,0x7f7fffeafe00)
3507 24482 cppunittest 1733147496.036179118 PSIG SIGUSR1 caught
handler=0x48555c mask=(): code=SI_USER sent by pid=3507, uid=10853)
3507 24482 cppunittest 1733147496.036190451 PSIG SIGUSR2 caught
handler=0x48555c mask=(30): code=SI_USER sent by pid=3507, uid=10853)
so not really sure what's going on, but looks like select got
interrupted so the handler could run, and that seems ok. Maybe other
systems auto restart it?