Jim Klimov <jimklimov+nut at gmail.com> writes:
>   you're likely on to something!
>
>   While NUT CI farm runs, give or take, 10^3 builds and tests across the
> matrix of platforms, toolkits and dependencies fir each iteration, and most
> of these pass green or catch true coding errors, I did occasionally see
> failed C++ tests (and also NIT where it waits for both OL and OB to be seen
> on a dummy-ups over certain time).
>
>   Mostly this correlated with slow-down of build agents (esp. VMs on
> congested hosts), and maybe kernel or its context-switching-under-stress
> tuning (openbsd and macos seen more often than others), but I did not
> succeed pinpointing the problem for the C++ case.
This is running on a real computer, not a VM, and it's a 9th gen i7 with
32GB of RAM which does things speedily in general.
>   In that OL-OB test of NIT, had to sort of write it off - if the VM is too
> busy that a 1-second timer flip is not happening/detected over 10 seconds,
> it is a SUT problem more than NUT problem. A real system on battery and
> frantically shutting down (causing stress/slowness) might have power lost
> during that time though.
Yes, if a signal doesn't show up in 10s, that's an issue.  But I
don't
think that's what we are seeing here.
>   IPC tests are similarly flawed by nature, communicating two processes
> that both have to get a slice of CPU in a given time frame for the test (or
> real-life reaction), but if you can get something to fail reliably in
> reasonable conditions (relevant under normal load) - that's really
> encouraging for the prospect of fixing it.
I would say that if a test fails, we need to be able to say that the SUT
is broken.   But I don't see that here.
So:
  Is it expected that a failed test will dump core?  This is surprising
  to me.  I'd expect that failures would just be counted and printed
  out.
  Are you sure the tests reliably use the as-built libs  and do not
  reach into any previous nut installation on the system?
  (I just de-installed 2.8.2 and get the same issue.)
running the unit test by hand (now that the other libs are gone, it
feels safe)
  $ ./cppunittest 
  D: Getting test suite...
  D: Preparing test runner...
  D: Setting test runner outputter...
  D: Launching the test run...
  .................................F.terminate called after throwing an instance
of 'std::runtime_error'
    what():  Poll on communication pipe read end 5 failed: 4
  Abort trap (core dumped)
4 is EINTR, and that means select was interrupted, perhaps a signal.
Which does not seem necessarily buggy.  There is a comment in nutipc.hpp
that indicates that recovery probably should be happen but it's not
implemented.
ktrace shows
  3507   3507 cppunittest 1733147496.035727910 CALL 
_lwp_create(0x7f7fffeaf860,0,0x7c38b65a80a0)
  3507   3507 cppunittest 1733147496.035739244 RET   _lwp_create 0
  3507   3507 cppunittest 1733147496.035761910 CALL 
__sigaction_sigtramp(SIGUSR1,0x7f7fffeafc10,0,0x7c38b497f620,2)
  3507   3507 cppunittest 1733147496.035773243 RET   __sigaction_sigtramp 0
  3507   3507 cppunittest 1733147496.035784577 CALL 
__sigaction_sigtramp(SIGUSR2,0x7f7fffeafc10,0,0x7c38b497f620,2)
  3507   3507 cppunittest 1733147496.035795910 RET   __sigaction_sigtramp 0
  3507  24482 cppunittest 1733147496.035800910 CALL  _lwp_ctl(1,0x7c38b65a8148)
  3507   3507 cppunittest 1733147496.035812243 CALL  getpid
  3507  24482 cppunittest 1733147496.035823577 RET   _lwp_ctl 0
  3507   3507 cppunittest 1733147496.035834910 RET   getpid 3507/0xdb3,
4602/0x11fa
  3507  24482 cppunittest 1733147496.035846243 CALL 
__select50(0x100,0x7c38b41ffde0,0,0,0)
  3507   3507 cppunittest 1733147496.035857535 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.035891535 RET   kill 0
  3507   3507 cppunittest 1733147496.035914201 CALL  kill(0xdb3, SIGUSR2)
  3507   3507 cppunittest 1733147496.035948202 RET   kill 0
  3507   3507 cppunittest 1733147496.035959535 CALL  kill(0xdb3, SIGUSR2)
  3507   3507 cppunittest 1733147496.035993493 RET   kill 0
  3507   3507 cppunittest 1733147496.036016201 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.036027493 RET   kill 0
  3507  24482 cppunittest 1733147496.036004868 RET   __select50 -1 errno 4
Interrupted system call
  3507   3507 cppunittest 1733147496.036038826 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.036061493 RET   kill 0
  3507   3507 cppunittest 1733147496.036086159 CALL 
__nanosleep50(0x7f7fffeafdf0,0x7f7fffeafe00)
  3507  24482 cppunittest 1733147496.036179118 PSIG  SIGUSR1 caught
handler=0x48555c mask=(): code=SI_USER sent by pid=3507, uid=10853)
  3507  24482 cppunittest 1733147496.036190451 PSIG  SIGUSR2 caught
handler=0x48555c mask=(30): code=SI_USER sent by pid=3507, uid=10853)
so not really sure what's going on, but looks like select got
interrupted so the handler could run, and that seems ok.  Maybe other
systems auto restart it?