Jim Klimov
2023-Jun-13 14:06 UTC
[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12
After launching the command several times, with debug (posted by new code in a new branch for the investigation) confirming that the same daemon handles operations from the new client instances, its strace now has numerous FDs to report after select() - so I guess it is a problem of detecting an exit of the counterpart. 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000034 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000041 [D1] parse_at: processing CANCEL-TIMER 0.000054 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000057 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000060 [D1] parse_at: processing START-TIMER 0.000151 [D1] Keeping stderr open due to debug verbosity 8 0.000195 Timer daemon started 0.000204 [D2] Timer daemon waiting for connections on pipefd 10 0.250325 [D3] new connection on fd 7 0.250377 New timer: heartbeat-failure-timer (660 seconds) 0.250423 [D1] Exiting upssched (CLI process) 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000039 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000047 [D1] parse_at: processing CANCEL-TIMER 14.745773 [D3] new connection on fd 8 14.745829 Cancelling timer: heartbeat-failure-timer 0.000153 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000159 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000163 [D1] parse_at: processing START-TIMER 14.745890 [D3] new connection on fd 9 14.745969 New timer: heartbeat-failure-timer (660 seconds) 0.000293 [D1] Exiting upssched (CLI process) 0.000001 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000042 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000050 [D1] parse_at: processing CANCEL-TIMER 47.674032 [D3] new connection on fd 11 47.674160 Cancelling timer: heartbeat-failure-timer 0.000266 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000277 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000284 [D1] parse_at: processing START-TIMER 47.674235 [D3] new connection on fd 12 47.674388 New timer: heartbeat-failure-timer (660 seconds) 0.000487 [D1] Exiting upssched (CLI process) Note that `select` only lists remnants of "new connection" but not the FD#10 (pipefd): select(13, [7 8 9 10 11 12], NULL, NULL, {tv_sec=1, tv_usec=0}) = 5 (in [7 8 9 11 12], left {tv_sec=0, tv_usec=999997}) Jim On Tue, Jun 13, 2023 at 3:41?PM Jim Klimov <jimklimov+nut at gmail.com> wrote:> So, got some good news: I hear(*) I managed to reproduce the problem with > current NUT master and an adapted copy of your posted configs and script :D > Experimental debugging now sounds possible. > > (*) PC under the desk wails with all its cooling fans as soon as I started > the client which spawned a daemon and itself had exited: > > $ UPSNAME=heartbeat at localhost NOTIFYTYPE=ONBATT NUT_STATEPATH=/dev/shm > NUT_CONFPATH=`pwd`/tests/NIT/tmp/etc ./clients/upssched -DDDDDDDD > 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command the > 'heartbeat at localhost' we were launched to process? > 0.000053 [D1] parse_at: 'heartbeat at localhost' in AT command > matched the 'heartbeat at localhost' UPSNAME we were launched to process > 0.000066 [D1] parse_at: processing CANCEL-TIMER > 0.000089 [D2] parse_at: is 'heartbeat at localhost' in AT command the > 'heartbeat at localhost' we were launched to process? > 0.000096 [D1] parse_at: 'heartbeat at localhost' in AT command > matched the 'heartbeat at localhost' UPSNAME we were launched to process > 0.000104 [D1] parse_at: processing START-TIMER > > $ > > Jim > > > On Tue, Jun 13, 2023 at 1:58?PM Kari Lempi?inen < > kari.lempiainen at summerday.net> wrote: > >> Hi, >> >> Thanks Jim! Here is more system information from the commands you >> mentioned. >> >> Kari >> >> root at fricka:~# lsof -p 1716171 >> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >> /run/user/1002/gvfs >> Output information may be incomplete. >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE >> NAME >> upssched 1716171 root cwd DIR 8,2 4096 2 / >> upssched 1716171 root rtd DIR 8,2 4096 2 / >> upssched 1716171 root txt REG 8,2 39240 21762148 >> /usr/sbin/upssched >> upssched 1716171 root mem REG 8,2 1922136 21764230 >> /usr/lib/x86_64-linux-gnu/libc.so.6 >> upssched 1716171 root mem REG 8,2 210968 21764476 >> /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 >> upssched 1716171 root 0u CHR 1,3 0t0 4 >> /dev/null >> upssched 1716171 root 1u CHR 1,3 0t0 4 >> /dev/null >> upssched 1716171 root 2u CHR 1,3 0t0 4 >> /dev/null >> upssched 1716171 root 3u unix 0x00000000eeb52825 0t0 16427010 >> type=DGRAM (CONNECTED) >> upssched 1716171 root 4r REG 8,2 545 39847858 >> /etc/nut/upssched.conf >> upssched 1716171 root 5u unix 0x00000000a035b466 0t0 16426351 >> type=STREAM (UNCONNECTED) >> upssched 1716171 root 6u unix 0x0000000050820b2b 0t0 16426352 >> type=STREAM (UNCONNECTED) >> upssched 1716171 root 7u unix 0x00000000447aac0e 0t0 16424724 >> /var/run/nut/upssched.pipe type=STREAM (CONNECTED) >> upssched 1716171 root 10u unix 0x00000000d7bd8088 0t0 16424722 >> /var/run/nut/upssched.pipe type=STREAM (LISTEN) >> >> >> root at fricka:~# ps -ef|grep ups >> root 1425 1 0 Jun11 ? 00:00:00 sshd: /usr/sbin/sshd >> -D [listener] 0 of 10-100 startups >> root 1156722 1 0 00:00 ? 00:00:00 /usr/sbin/cupsd -l >> root 1156723 1 0 00:00 ? 00:00:00 /usr/sbin/cups-browsed >> lp 1156748 1156722 0 00:00 ? 00:00:00 >> /usr/lib/cups/notifier/dbus dbus:// >> nut 1713117 1 0 14:30 ? 00:00:00 /lib/nut/dummy-ups -a >> heartbeat >> nut 1713154 1 0 14:30 ? 00:00:00 /lib/nut/usbhid-ups >> -a eaton5s >> nut 1713155 1 0 14:30 ? 00:00:00 /lib/nut/upsd -F >> root 1713156 1 0 14:30 ? 00:00:00 /lib/nut/upsmon -F >> root 1713157 1713156 0 14:30 ? 00:00:00 /lib/nut/upsmon -F >> root 1716171 1 99 14:35 ? 00:05:10 /sbin/upssched UPS >> heartbeat at localhost: On battery. >> root 1719510 1550505 0 14:40 pts/2 00:00:00 grep ups >> >> >> >> >> root at fricka:~# fuser /var/run/nut/upssched.pipe >> /run/nut/upssched.pipe: 1716171 >> >> >> root at fricka:/proc/1716171/fd# ls -la >> total 0 >> dr-x------ 2 root root 0 Jun 13 14:36 . >> dr-xr-xr-x 9 root root 0 Jun 13 14:35 .. >> lrwx------ 1 root root 64 Jun 13 14:36 0 -> /dev/null >> lrwx------ 1 root root 64 Jun 13 14:36 1 -> /dev/null >> lrwx------ 1 root root 64 Jun 13 14:36 10 -> 'socket:[16424722]' >> lrwx------ 1 root root 64 Jun 13 14:36 2 -> /dev/null >> lrwx------ 1 root root 64 Jun 13 14:36 3 -> 'socket:[16427010]' >> lr-x------ 1 root root 64 Jun 13 14:36 4 -> /etc/nut/upssched.conf >> lrwx------ 1 root root 64 Jun 13 14:36 5 -> 'socket:[16426351]' >> lrwx------ 1 root root 64 Jun 13 14:36 6 -> 'socket:[16426352]' >> lrwx------ 1 root root 64 Jun 13 14:36 7 -> 'socket:[16424724]' >> >> >> >> ------------------------------ >> *From:* Jim Klimov <jimklimov+nut at gmail.com> >> *Sent:* Tuesday, June 13, 2023 14:17 >> *To:* Kari Lempi?inen <kari.lempiainen at summerday.net> >> *Cc:* Arnaud Quette via Nut-upsuser <nut-upsuser at alioth-lists.debian.net>; >> nut-upsdev <nut-upsdev at alioth-lists.debian.net>; Dimitris Economou < >> dimitris.s.economou at gmail.com> >> *Subject:* Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12 >> >> FWIW, cross-posted the issue to NUT GitHub tracker: >> https://github.com/networkupstools/nut/issues/1964 >> >> Jim >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230613/1ed3a16b/attachment-0001.htm>
Jim Klimov
2023-Jun-13 15:35 UTC
[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12
So... determining that FD is to be reaped proved hard. Internet lore suggests fcntl() and poll() on the FD, but it just seems valid to them. The errno is also usually not raised (once I saw a "111: Connection refused" though). So the best dumb idea so far is to bail out if we spent the whole loop (128 attempts) and only got zero-sized read replies and no errors. https://github.com/networkupstools/nut/pull/1965 Jim On Tue, Jun 13, 2023 at 4:06?PM Jim Klimov <jimklimov+nut at gmail.com> wrote:> After launching the command several times, with debug (posted by new code > in a new branch for the investigation) confirming that the same daemon > handles operations from the new client instances, its strace now has > numerous FDs to report after select() - so I guess it is a problem of > detecting an exit of the counterpart. > > > > 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command the > 'heartbeat at localhost' we were launched to process? > 0.000034 [D1] parse_at: 'heartbeat at localhost' in AT command > matched the 'heartbeat at localhost' UPSNAME we were launched to process > 0.000041 [D1] parse_at: processing CANCEL-TIMER > 0.000054 [D2] parse_at: is 'heartbeat at localhost' in AT command the > 'heartbeat at localhost' we were launched to process? > 0.000057 [D1] parse_at: 'heartbeat at localhost' in AT command > matched the 'heartbeat at localhost' UPSNAME we were launched to process > 0.000060 [D1] parse_at: processing START-TIMER > 0.000151 [D1] Keeping stderr open due to debug verbosity 8 > 0.000195 Timer daemon started > 0.000204 [D2] Timer daemon waiting for connections on pipefd 10 > 0.250325 [D3] new connection on fd 7 > 0.250377 New timer: heartbeat-failure-timer (660 seconds) > 0.250423 [D1] Exiting upssched (CLI process) > > 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command the > 'heartbeat at localhost' we were launched to process? > 0.000039 [D1] parse_at: 'heartbeat at localhost' in AT command > matched the 'heartbeat at localhost' UPSNAME we were launched to process > 0.000047 [D1] parse_at: processing CANCEL-TIMER > 14.745773 [D3] new connection on fd 8 > 14.745829 Cancelling timer: heartbeat-failure-timer > 0.000153 [D2] parse_at: is 'heartbeat at localhost' in AT command the > 'heartbeat at localhost' we were launched to process? > 0.000159 [D1] parse_at: 'heartbeat at localhost' in AT command > matched the 'heartbeat at localhost' UPSNAME we were launched to process > 0.000163 [D1] parse_at: processing START-TIMER > 14.745890 [D3] new connection on fd 9 > 14.745969 New timer: heartbeat-failure-timer (660 seconds) > 0.000293 [D1] Exiting upssched (CLI process) > > 0.000001 [D2] parse_at: is 'heartbeat at localhost' in AT command the > 'heartbeat at localhost' we were launched to process? > 0.000042 [D1] parse_at: 'heartbeat at localhost' in AT command > matched the 'heartbeat at localhost' UPSNAME we were launched to process > 0.000050 [D1] parse_at: processing CANCEL-TIMER > 47.674032 [D3] new connection on fd 11 > 47.674160 Cancelling timer: heartbeat-failure-timer > 0.000266 [D2] parse_at: is 'heartbeat at localhost' in AT command the > 'heartbeat at localhost' we were launched to process? > 0.000277 [D1] parse_at: 'heartbeat at localhost' in AT command > matched the 'heartbeat at localhost' UPSNAME we were launched to process > 0.000284 [D1] parse_at: processing START-TIMER > 47.674235 [D3] new connection on fd 12 > 47.674388 New timer: heartbeat-failure-timer (660 seconds) > 0.000487 [D1] Exiting upssched (CLI process) > > Note that `select` only lists remnants of "new connection" but not the > FD#10 (pipefd): > > select(13, [7 8 9 10 11 12], NULL, NULL, {tv_sec=1, tv_usec=0}) = 5 > (in [7 8 9 11 12], left {tv_sec=0, tv_usec=999997}) > > Jim > > > > On Tue, Jun 13, 2023 at 3:41?PM Jim Klimov <jimklimov+nut at gmail.com> > wrote: > >> So, got some good news: I hear(*) I managed to reproduce the problem with >> current NUT master and an adapted copy of your posted configs and script :D >> Experimental debugging now sounds possible. >> >> (*) PC under the desk wails with all its cooling fans as soon as I >> started the client which spawned a daemon and itself had exited: >> >> $ UPSNAME=heartbeat at localhost NOTIFYTYPE=ONBATT NUT_STATEPATH=/dev/shm >> NUT_CONFPATH=`pwd`/tests/NIT/tmp/etc ./clients/upssched -DDDDDDDD >> 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command >> the 'heartbeat at localhost' we were launched to process? >> 0.000053 [D1] parse_at: 'heartbeat at localhost' in AT command >> matched the 'heartbeat at localhost' UPSNAME we were launched to process >> 0.000066 [D1] parse_at: processing CANCEL-TIMER >> 0.000089 [D2] parse_at: is 'heartbeat at localhost' in AT command >> the 'heartbeat at localhost' we were launched to process? >> 0.000096 [D1] parse_at: 'heartbeat at localhost' in AT command >> matched the 'heartbeat at localhost' UPSNAME we were launched to process >> 0.000104 [D1] parse_at: processing START-TIMER >> >> $ >> >> Jim >> >> >> On Tue, Jun 13, 2023 at 1:58?PM Kari Lempi?inen < >> kari.lempiainen at summerday.net> wrote: >> >>> Hi, >>> >>> Thanks Jim! Here is more system information from the commands you >>> mentioned. >>> >>> Kari >>> >>> root at fricka:~# lsof -p 1716171 >>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system >>> /run/user/1002/gvfs >>> Output information may be incomplete. >>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE >>> NAME >>> upssched 1716171 root cwd DIR 8,2 4096 2 / >>> upssched 1716171 root rtd DIR 8,2 4096 2 / >>> upssched 1716171 root txt REG 8,2 39240 21762148 >>> /usr/sbin/upssched >>> upssched 1716171 root mem REG 8,2 1922136 21764230 >>> /usr/lib/x86_64-linux-gnu/libc.so.6 >>> upssched 1716171 root mem REG 8,2 210968 21764476 >>> /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 >>> upssched 1716171 root 0u CHR 1,3 0t0 4 >>> /dev/null >>> upssched 1716171 root 1u CHR 1,3 0t0 4 >>> /dev/null >>> upssched 1716171 root 2u CHR 1,3 0t0 4 >>> /dev/null >>> upssched 1716171 root 3u unix 0x00000000eeb52825 0t0 16427010 >>> type=DGRAM (CONNECTED) >>> upssched 1716171 root 4r REG 8,2 545 39847858 >>> /etc/nut/upssched.conf >>> upssched 1716171 root 5u unix 0x00000000a035b466 0t0 16426351 >>> type=STREAM (UNCONNECTED) >>> upssched 1716171 root 6u unix 0x0000000050820b2b 0t0 16426352 >>> type=STREAM (UNCONNECTED) >>> upssched 1716171 root 7u unix 0x00000000447aac0e 0t0 16424724 >>> /var/run/nut/upssched.pipe type=STREAM (CONNECTED) >>> upssched 1716171 root 10u unix 0x00000000d7bd8088 0t0 16424722 >>> /var/run/nut/upssched.pipe type=STREAM (LISTEN) >>> >>> >>> root at fricka:~# ps -ef|grep ups >>> root 1425 1 0 Jun11 ? 00:00:00 sshd: /usr/sbin/sshd >>> -D [listener] 0 of 10-100 startups >>> root 1156722 1 0 00:00 ? 00:00:00 /usr/sbin/cupsd -l >>> root 1156723 1 0 00:00 ? 00:00:00 >>> /usr/sbin/cups-browsed >>> lp 1156748 1156722 0 00:00 ? 00:00:00 >>> /usr/lib/cups/notifier/dbus dbus:// >>> nut 1713117 1 0 14:30 ? 00:00:00 /lib/nut/dummy-ups >>> -a heartbeat >>> nut 1713154 1 0 14:30 ? 00:00:00 /lib/nut/usbhid-ups >>> -a eaton5s >>> nut 1713155 1 0 14:30 ? 00:00:00 /lib/nut/upsd -F >>> root 1713156 1 0 14:30 ? 00:00:00 /lib/nut/upsmon -F >>> root 1713157 1713156 0 14:30 ? 00:00:00 /lib/nut/upsmon -F >>> root 1716171 1 99 14:35 ? 00:05:10 /sbin/upssched UPS >>> heartbeat at localhost: On battery. >>> root 1719510 1550505 0 14:40 pts/2 00:00:00 grep ups >>> >>> >>> >>> >>> root at fricka:~# fuser /var/run/nut/upssched.pipe >>> /run/nut/upssched.pipe: 1716171 >>> >>> >>> root at fricka:/proc/1716171/fd# ls -la >>> total 0 >>> dr-x------ 2 root root 0 Jun 13 14:36 . >>> dr-xr-xr-x 9 root root 0 Jun 13 14:35 .. >>> lrwx------ 1 root root 64 Jun 13 14:36 0 -> /dev/null >>> lrwx------ 1 root root 64 Jun 13 14:36 1 -> /dev/null >>> lrwx------ 1 root root 64 Jun 13 14:36 10 -> 'socket:[16424722]' >>> lrwx------ 1 root root 64 Jun 13 14:36 2 -> /dev/null >>> lrwx------ 1 root root 64 Jun 13 14:36 3 -> 'socket:[16427010]' >>> lr-x------ 1 root root 64 Jun 13 14:36 4 -> /etc/nut/upssched.conf >>> lrwx------ 1 root root 64 Jun 13 14:36 5 -> 'socket:[16426351]' >>> lrwx------ 1 root root 64 Jun 13 14:36 6 -> 'socket:[16426352]' >>> lrwx------ 1 root root 64 Jun 13 14:36 7 -> 'socket:[16424724]' >>> >>> >>> >>> ------------------------------ >>> *From:* Jim Klimov <jimklimov+nut at gmail.com> >>> *Sent:* Tuesday, June 13, 2023 14:17 >>> *To:* Kari Lempi?inen <kari.lempiainen at summerday.net> >>> *Cc:* Arnaud Quette via Nut-upsuser <nut-upsuser at alioth-lists.debian.net>; >>> nut-upsdev <nut-upsdev at alioth-lists.debian.net>; Dimitris Economou < >>> dimitris.s.economou at gmail.com> >>> *Subject:* Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12 >>> >>> FWIW, cross-posted the issue to NUT GitHub tracker: >>> https://github.com/networkupstools/nut/issues/1964 >>> >>> Jim >>> >>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230613/f741f2ab/attachment.htm>
Kari LempiƤinen
2023-Jun-13 16:38 UTC
[Nut-upsuser] Upssched 100% CPU after updating Debian 12
Hi, Great work Jim! I?m glad you could reproduce the problem and found a potential culprit. Just for my own interest I restored upsshed from my backups (version 2.7.4-13) and it seems to running ok, so no big runtime changes regarding that with Debian 12. It is not hogging CPU. From the daemon log the heartbeat seems to be working ok. Only difference between the old logs (pre Debian 12 update) is the there is this line (Network UPS Tools upsmon 2.8.0) every five minutes. Jun 13 19:17:08 fricka upssched[1896873]: Timer daemon started Jun 13 19:17:08 fricka upssched[1896873]: New timer: heartbeat-failure-timer (660 seconds) Jun 13 19:17:08 fricka nut-monitor[1896870]: Network UPS Tools upsmon 2.8.0 Jun 13 19:22:08 fricka nut-monitor[1899911]: Network UPS Tools upsmon 2.8.0 Jun 13 19:27:08 fricka upssched[1896873]: Cancelling timer: heartbeat-failure-timer Jun 13 19:27:08 fricka upssched[1896873]: New timer: heartbeat-failure-timer (660 seconds) Jun 13 19:27:08 fricka nut-monitor[1903379]: Network UPS Tools upsmon 2.8.0 Jun 13 19:32:13 fricka nut-monitor[1906677]: Network UPS Tools upsmon 2.8.0 That line wasn?t there previously. I will leave the old version in place until nut gets fixed and it?s been provided by Debian. This actually reminded me of a similar problem I had myself a loooong time ago. I was writing a large software distribution program for a Windows platform. The connection between server and client were done by TCP sockets. In my server code there was place where I read reply from the client and the read (from socket) function didn?t return error code, but the data length was 0. In the documentation, at least at the time, it wasn?t specified directly that it was an error situation. My server code looped in a thread hogging CPU. When I modified the code to treat 0 bytes read as an error, everything worked fine. Best regards, Kari From: Jim Klimov <jimklimov+nut at gmail.com> Date: Tuesday, 13. June 2023 at 18.36 To: Kari Lempi?inen <kari.lempiainen at summerday.net> Cc: Arnaud Quette via Nut-upsuser <nut-upsuser at alioth-lists.debian.net>, nut-upsdev <nut-upsdev at alioth-lists.debian.net>, Dimitris Economou <dimitris.s.economou at gmail.com> Subject: Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12 So... determining that FD is to be reaped proved hard. Internet lore suggests fcntl() and poll() on the FD, but it just seems valid to them. The errno is also usually not raised (once I saw a "111: Connection refused" though). So the best dumb idea so far is to bail out if we spent the whole loop (128 attempts) and only got zero-sized read replies and no errors. https://github.com/networkupstools/nut/pull/1965 Jim On Tue, Jun 13, 2023 at 4:06?PM Jim Klimov <jimklimov+nut at gmail.com<mailto:jimklimov%2Bnut at gmail.com>> wrote: After launching the command several times, with debug (posted by new code in a new branch for the investigation) confirming that the same daemon handles operations from the new client instances, its strace now has numerous FDs to report after select() - so I guess it is a problem of detecting an exit of the counterpart. 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000034 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000041 [D1] parse_at: processing CANCEL-TIMER 0.000054 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000057 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000060 [D1] parse_at: processing START-TIMER 0.000151 [D1] Keeping stderr open due to debug verbosity 8 0.000195 Timer daemon started 0.000204 [D2] Timer daemon waiting for connections on pipefd 10 0.250325 [D3] new connection on fd 7 0.250377 New timer: heartbeat-failure-timer (660 seconds) 0.250423 [D1] Exiting upssched (CLI process) 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000039 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000047 [D1] parse_at: processing CANCEL-TIMER 14.745773 [D3] new connection on fd 8 14.745829 Cancelling timer: heartbeat-failure-timer 0.000153 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000159 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000163 [D1] parse_at: processing START-TIMER 14.745890 [D3] new connection on fd 9 14.745969 New timer: heartbeat-failure-timer (660 seconds) 0.000293 [D1] Exiting upssched (CLI process) 0.000001 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000042 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000050 [D1] parse_at: processing CANCEL-TIMER 47.674032 [D3] new connection on fd 11 47.674160 Cancelling timer: heartbeat-failure-timer 0.000266 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000277 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000284 [D1] parse_at: processing START-TIMER 47.674235 [D3] new connection on fd 12 47.674388 New timer: heartbeat-failure-timer (660 seconds) 0.000487 [D1] Exiting upssched (CLI process) Note that `select` only lists remnants of "new connection" but not the FD#10 (pipefd): select(13, [7 8 9 10 11 12], NULL, NULL, {tv_sec=1, tv_usec=0}) = 5 (in [7 8 9 11 12], left {tv_sec=0, tv_usec=999997}) Jim On Tue, Jun 13, 2023 at 3:41?PM Jim Klimov <jimklimov+nut at gmail.com<mailto:jimklimov%2Bnut at gmail.com>> wrote: So, got some good news: I hear(*) I managed to reproduce the problem with current NUT master and an adapted copy of your posted configs and script :D Experimental debugging now sounds possible. (*) PC under the desk wails with all its cooling fans as soon as I started the client which spawned a daemon and itself had exited: $ UPSNAME=heartbeat at localhost NOTIFYTYPE=ONBATT NUT_STATEPATH=/dev/shm NUT_CONFPATH=`pwd`/tests/NIT/tmp/etc ./clients/upssched -DDDDDDDD 0.000000 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000053 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000066 [D1] parse_at: processing CANCEL-TIMER 0.000089 [D2] parse_at: is 'heartbeat at localhost' in AT command the 'heartbeat at localhost' we were launched to process? 0.000096 [D1] parse_at: 'heartbeat at localhost' in AT command matched the 'heartbeat at localhost' UPSNAME we were launched to process 0.000104 [D1] parse_at: processing START-TIMER $ Jim On Tue, Jun 13, 2023 at 1:58?PM Kari Lempi?inen <kari.lempiainen at summerday.net<mailto:kari.lempiainen at summerday.net>> wrote: Hi, Thanks Jim! Here is more system information from the commands you mentioned. Kari root at fricka:~# lsof -p 1716171 lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1002/gvfs Output information may be incomplete. COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME upssched 1716171 root cwd DIR 8,2 4096 2 / upssched 1716171 root rtd DIR 8,2 4096 2 / upssched 1716171 root txt REG 8,2 39240 21762148 /usr/sbin/upssched upssched 1716171 root mem REG 8,2 1922136 21764230 /usr/lib/x86_64-linux-gnu/libc.so.6 upssched 1716171 root mem REG 8,2 210968 21764476 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 upssched 1716171 root 0u CHR 1,3 0t0 4 /dev/null upssched 1716171 root 1u CHR 1,3 0t0 4 /dev/null upssched 1716171 root 2u CHR 1,3 0t0 4 /dev/null upssched 1716171 root 3u unix 0x00000000eeb52825 0t0 16427010 type=DGRAM (CONNECTED) upssched 1716171 root 4r REG 8,2 545 39847858 /etc/nut/upssched.conf upssched 1716171 root 5u unix 0x00000000a035b466 0t0 16426351 type=STREAM (UNCONNECTED) upssched 1716171 root 6u unix 0x0000000050820b2b 0t0 16426352 type=STREAM (UNCONNECTED) upssched 1716171 root 7u unix 0x00000000447aac0e 0t0 16424724 /var/run/nut/upssched.pipe type=STREAM (CONNECTED) upssched 1716171 root 10u unix 0x00000000d7bd8088 0t0 16424722 /var/run/nut/upssched.pipe type=STREAM (LISTEN) root at fricka:~# ps -ef|grep ups root 1425 1 0 Jun11 ? 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups root 1156722 1 0 00:00 ? 00:00:00 /usr/sbin/cupsd -l root 1156723 1 0 00:00 ? 00:00:00 /usr/sbin/cups-browsed lp 1156748 1156722 0 00:00 ? 00:00:00 /usr/lib/cups/notifier/dbus dbus:// nut 1713117 1 0 14:30 ? 00:00:00 /lib/nut/dummy-ups -a heartbeat nut 1713154 1 0 14:30 ? 00:00:00 /lib/nut/usbhid-ups -a eaton5s nut 1713155 1 0 14:30 ? 00:00:00 /lib/nut/upsd -F root 1713156 1 0 14:30 ? 00:00:00 /lib/nut/upsmon -F root 1713157 1713156 0 14:30 ? 00:00:00 /lib/nut/upsmon -F root 1716171 1 99 14:35 ? 00:05:10 /sbin/upssched UPS heartbeat at localhost: On battery. root 1719510 1550505 0 14:40 pts/2 00:00:00 grep ups root at fricka:~# fuser /var/run/nut/upssched.pipe /run/nut/upssched.pipe: 1716171 root at fricka:/proc/1716171/fd# ls -la total 0 dr-x------ 2 root root 0 Jun 13 14:36 . dr-xr-xr-x 9 root root 0 Jun 13 14:35 .. lrwx------ 1 root root 64 Jun 13 14:36 0 -> /dev/null lrwx------ 1 root root 64 Jun 13 14:36 1 -> /dev/null lrwx------ 1 root root 64 Jun 13 14:36 10 -> 'socket:[16424722]' lrwx------ 1 root root 64 Jun 13 14:36 2 -> /dev/null lrwx------ 1 root root 64 Jun 13 14:36 3 -> 'socket:[16427010]' lr-x------ 1 root root 64 Jun 13 14:36 4 -> /etc/nut/upssched.conf lrwx------ 1 root root 64 Jun 13 14:36 5 -> 'socket:[16426351]' lrwx------ 1 root root 64 Jun 13 14:36 6 -> 'socket:[16426352]' lrwx------ 1 root root 64 Jun 13 14:36 7 -> 'socket:[16424724]' ________________________________ From: Jim Klimov <jimklimov+nut at gmail.com<mailto:jimklimov%2Bnut at gmail.com>> Sent: Tuesday, June 13, 2023 14:17 To: Kari Lempi?inen <kari.lempiainen at summerday.net<mailto:kari.lempiainen at summerday.net>> Cc: Arnaud Quette via Nut-upsuser <nut-upsuser at alioth-lists.debian.net<mailto:nut-upsuser at alioth-lists.debian.net>>; nut-upsdev <nut-upsdev at alioth-lists.debian.net<mailto:nut-upsdev at alioth-lists.debian.net>>; Dimitris Economou <dimitris.s.economou at gmail.com<mailto:dimitris.s.economou at gmail.com>> Subject: Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12 FWIW, cross-posted the issue to NUT GitHub tracker: https://github.com/networkupstools/nut/issues/1964 Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://alioth-lists.debian.net/pipermail/nut-upsuser/attachments/20230613/11fc0b57/attachment-0001.htm>
Sam Varshavchik
2023-Jun-14 10:34 UTC
[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12
Jim Klimov via Nut-upsuser writes:> So... determining that FD is to be reaped proved hard. Internet lore suggests > fcntl() and poll() on the FD, but it just seems valid to them. The errno is > also usually not raised (once I saw a "111: Connection refused" though). > So the best dumb idea so far is to bail out if we spent the whole loop (128 > attempts) and only got zero-sized read replies and no errors.It's been my experience, that the path of least resistance is: 1) The sockets are sets to be non-blocking 2) poll() 3) If poll() says the socket is readable, and read() returns <= 0, then the socket is dead. You get no error, and read() returns 0, if the socket had an orderly shutdown. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 833 bytes Desc: not available URL: <http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230614/7eb96efb/attachment.sig>