thr3ads.net - Nut upsdev - [Nut-upsdev] git stable, cppunit? [Dec 2024]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2024-Dec-02 08:58 UTC

[Nut-upsdev] git stable, cppunit?

Thanks,

  you're likely on to something!

  While NUT CI farm runs, give or take, 10^3 builds and tests across the
matrix of platforms, toolkits and dependencies fir each iteration, and most
of these pass green or catch true coding errors, I did occasionally see
failed C++ tests (and also NIT where it waits for both OL and OB to be seen
on a dummy-ups over certain time).

  Mostly this correlated with slow-down of build agents (esp. VMs on
congested hosts), and maybe kernel or its context-switching-under-stress
tuning (openbsd and macos seen more often than others), but I did not
succeed pinpointing the problem for the C++ case.

  In that OL-OB test of NIT, had to sort of write it off - if the VM is too
busy that a 1-second timer flip is not happening/detected over 10 seconds,
it is a SUT problem more than NUT problem. A real system on battery and
frantically shutting down (causing stress/slowness) might have power lost
during that time though.

  IPC tests are similarly flawed by nature, communicating two processes
that both have to get a slice of CPU in a given time frame for the test (or
real-life reaction), but if you can get something to fail reliably in
reasonable conditions (relevant under normal load) - that's really
encouraging for the prospect of fixing it.

Jim

On Mon, Dec 2, 2024, 01:36 Greg Troxel via Nut-upsdev <
nut-upsdev at alioth-lists.debian.net> wrote:
> After not paying such close attention for a while, for no particular
> reason, I'm bringing the pkgsrc-wip package (which tracks master) up to
> date.
>
> I am doing a full build of .1412, 3e004e9.
>
> Things are mostly ok, but in tests, cppunittest seems to crash.   I
> wonder: what was expected?
>
> Reading nutipc_ut.cpp, I don't understand the test, and in particular I
> don't understand the assumption that the signal handler will run
> promptly.
>
> Do tests pass for everyone else with git master?
>
> I am using gcc 10.
>
> I got a core file and here's the backtrace.  Seems to be gnu extension
> new_allocator?
>
> (gdb) bt
> #0  0x00007bba8957eeea in _lwp_kill () from /usr/lib/libc.so.12
> #1  0x00007bba895846e0 in abort () from /usr/lib/libc.so.12
> #2  0x00007bba8a2fd165 in ?? () from /usr/lib/libstdc++.so.9
> #3  0x00007bba8a2f2e2d in __cxxabiv1::__terminate(void (*)()) () from
> /usr/lib/libstdc++.so.9
> #4  0x00007bba8a2f2e6f in std::terminate() () from /usr/lib/libstdc++.so.9
> #5  0x00007bba8a2f2dd0 in __cxa_throw () from /usr/lib/libstdc++.so.9
> #6  0x0000000000479485 in
> nut::Signal::HandlerThread<TestSignalHandler>::main
> (comm_pipe_read_end=<optimized out>) at
> /usr/include/g++/ext/new_allocator.h:89
> #7  0x00007bba8a60c89f in ?? () from /usr/lib/libpthread.so.1
> #8  0x00007bba894930e0 in ?? () from /usr/lib/libc.so.12
> #9  0x0000000000400000 in ?? ()
> #10 0x00007bba89200000 in ?? ()
> #11 0x0000001003a0efff in ?? ()
> #12 0x00007bba890000c0 in ?? ()
> #13 0x00000000001fff40 in ?? ()
> #14 0x0000000000000000 in ?? ()
>
> _______________________________________________
> Nut-upsdev mailing list
> Nut-upsdev at alioth-lists.debian.net
> https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/nut-upsdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20241202/c76a3f67/attachment-0001.htm>

Greg Troxel

2024-Dec-02 14:06 UTC

head link

[Nut-upsdev] git stable, cppunit?

Jim Klimov <jimklimov+nut at gmail.com> writes:
>   you're likely on to something!
>
>   While NUT CI farm runs, give or take, 10^3 builds and tests across the
> matrix of platforms, toolkits and dependencies fir each iteration, and most
> of these pass green or catch true coding errors, I did occasionally see
> failed C++ tests (and also NIT where it waits for both OL and OB to be seen
> on a dummy-ups over certain time).
>
>   Mostly this correlated with slow-down of build agents (esp. VMs on
> congested hosts), and maybe kernel or its context-switching-under-stress
> tuning (openbsd and macos seen more often than others), but I did not
> succeed pinpointing the problem for the C++ case.
This is running on a real computer, not a VM, and it's a 9th gen i7 with
32GB of RAM which does things speedily in general.
>   In that OL-OB test of NIT, had to sort of write it off - if the VM is too
> busy that a 1-second timer flip is not happening/detected over 10 seconds,
> it is a SUT problem more than NUT problem. A real system on battery and
> frantically shutting down (causing stress/slowness) might have power lost
> during that time though.
Yes, if a signal doesn't show up in 10s, that's an issue.  But I
don't
think that's what we are seeing here.
>   IPC tests are similarly flawed by nature, communicating two processes
> that both have to get a slice of CPU in a given time frame for the test (or
> real-life reaction), but if you can get something to fail reliably in
> reasonable conditions (relevant under normal load) - that's really
> encouraging for the prospect of fixing it.
I would say that if a test fails, we need to be able to say that the SUT
is broken.   But I don't see that here.



So:

  Is it expected that a failed test will dump core?  This is surprising
  to me.  I'd expect that failures would just be counted and printed
  out.

  Are you sure the tests reliably use the as-built libs  and do not
  reach into any previous nut installation on the system?
  (I just de-installed 2.8.2 and get the same issue.)


running the unit test by hand (now that the other libs are gone, it
feels safe)

  $ ./cppunittest 
  D: Getting test suite...
  D: Preparing test runner...
  D: Setting test runner outputter...
  D: Launching the test run...
  .................................F.terminate called after throwing an instance
of 'std::runtime_error'
    what():  Poll on communication pipe read end 5 failed: 4
  Abort trap (core dumped)


4 is EINTR, and that means select was interrupted, perhaps a signal.
Which does not seem necessarily buggy.  There is a comment in nutipc.hpp
that indicates that recovery probably should be happen but it's not
implemented.

ktrace shows

  3507   3507 cppunittest 1733147496.035727910 CALL 
_lwp_create(0x7f7fffeaf860,0,0x7c38b65a80a0)
  3507   3507 cppunittest 1733147496.035739244 RET   _lwp_create 0
  3507   3507 cppunittest 1733147496.035761910 CALL 
__sigaction_sigtramp(SIGUSR1,0x7f7fffeafc10,0,0x7c38b497f620,2)
  3507   3507 cppunittest 1733147496.035773243 RET   __sigaction_sigtramp 0
  3507   3507 cppunittest 1733147496.035784577 CALL 
__sigaction_sigtramp(SIGUSR2,0x7f7fffeafc10,0,0x7c38b497f620,2)
  3507   3507 cppunittest 1733147496.035795910 RET   __sigaction_sigtramp 0
  3507  24482 cppunittest 1733147496.035800910 CALL  _lwp_ctl(1,0x7c38b65a8148)
  3507   3507 cppunittest 1733147496.035812243 CALL  getpid
  3507  24482 cppunittest 1733147496.035823577 RET   _lwp_ctl 0
  3507   3507 cppunittest 1733147496.035834910 RET   getpid 3507/0xdb3,
4602/0x11fa
  3507  24482 cppunittest 1733147496.035846243 CALL 
__select50(0x100,0x7c38b41ffde0,0,0,0)
  3507   3507 cppunittest 1733147496.035857535 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.035891535 RET   kill 0
  3507   3507 cppunittest 1733147496.035914201 CALL  kill(0xdb3, SIGUSR2)
  3507   3507 cppunittest 1733147496.035948202 RET   kill 0
  3507   3507 cppunittest 1733147496.035959535 CALL  kill(0xdb3, SIGUSR2)
  3507   3507 cppunittest 1733147496.035993493 RET   kill 0
  3507   3507 cppunittest 1733147496.036016201 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.036027493 RET   kill 0
  3507  24482 cppunittest 1733147496.036004868 RET   __select50 -1 errno 4
Interrupted system call
  3507   3507 cppunittest 1733147496.036038826 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.036061493 RET   kill 0
  3507   3507 cppunittest 1733147496.036086159 CALL 
__nanosleep50(0x7f7fffeafdf0,0x7f7fffeafe00)
  3507  24482 cppunittest 1733147496.036179118 PSIG  SIGUSR1 caught
handler=0x48555c mask=(): code=SI_USER sent by pid=3507, uid=10853)
  3507  24482 cppunittest 1733147496.036190451 PSIG  SIGUSR2 caught
handler=0x48555c mask=(30): code=SI_USER sent by pid=3507, uid=10853)


so not really sure what's going on, but looks like select got
interrupted so the handler could run, and that seems ok.  Maybe other
systems auto restart it?

Nut upsdev - Dec 2024 - git stable, cppunit?

[Nut-upsdev] git stable, cppunit?

[Nut-upsdev] git stable, cppunit?