thr3ads.net - Nut upsdev - [Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12 [Jun 2023]

If this information is useful, please help other people find it:
Share via:

Kari Lempiäinen

2023-Jun-13 06:33 UTC

[Nut-upsuser] Upssched 100% CPU after updating Debian 12

Hi,

I ran the strace command while upssched was 100% CPU hungry. This is what I got:

1686633611.702798 read(7, "", 1)        = 0 <0.000004>
1686633611.702816 read(7, "", 1)        = 0 <0.000004>
1686633611.702834 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL)
= 1 (in [7], left {tv_sec=0, tv_nsec=999998800}) <0.000006>
1686633611.702862 read(7, "", 1)        = 0 <0.000004>
1686633611.702880 read(7, "", 1)        = 0 <0.000004>
.
.
.
1686633611.705206 read(7, "", 1)        = 0 <0.000004>
1686633611.705224 read(7, "", 1)        = 0 <0.000004>
1686633611.705242 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0}, NULL)
= 1 (in [7], left {tv_sec=0, tv_nsec=999999204}) <0.000005>
1686633611.705268 read(7, "", 1)        = 0 <0.000004>
1686633611.705288 read(7, "", 1)        = 0 <0.000302>

there are 128 "read(7" lines between every pselect6 lines (I did count
the read lines for couple of blocks, 128 in each of them). Nothing else in the
strace log, just read and pselect lines for the short time I let strace to run.

Any help from this?

Best regards,
Kari
________________________________
From: Jim Klimov <jimklimov+nut at gmail.com>
Sent: Tuesday, June 13, 2023 00:35
To: Kari Lempi?inen <kari.lempiainen at summerday.net>
Cc: Arnaud Quette via Nut-upsuser <nut-upsuser at alioth-lists.debian.net>
Subject: Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12

Hello,

  Thank you for the report and I don't recall any similar issues.

  Unfortunately, upssched did not have any "native" enablement of
debug verbosity in 2.8.0 and older releases (some added with
https://github.com/networkupstools/nut/pull/1864 recently), so maybe the best
way to get more info at the moment would be to watch syscalls made when it
begins to hog CPU and make educated guesses from there, e.g.:

    # strace -f -ttt -T -p "`ps -ef | grep -w upssched | grep -v grep | awk
'{print $2}'`"

  I suppose when hogging the CPU it has some reason to quickly interrupt its
processing loop - perhaps a lack of some file (socket?) it waits on, etc.

Hope this helps,
Jim Klimov


On Mon, Jun 12, 2023, 14:59 Kari Lempi?inen <kari.lempiainen at
summerday.net<mailto:kari.lempiainen at summerday.net>> wrote:

Hi,



I have been running nut successfully for a long time with my Debian 11 server. I
upgraded my server to Debian 12 today, which upgraded nut also from 2.7.4-13 to
2.8.0-7. I noticed that after upgrade there was a upssched process running and
taking 100% cpu time.



I checked if there were any changes to configuration file formats with nut
upgrade and only differences I noticed were a terminology change from
master/slave to primary/secondary in upsd.users and upsmon.conf, and dummy mode
change for the dummy-ups. I didn?t expect these being the culprit, but I made
the changes to my configuration. No change, after five minutes of starting nut,
upssched will be running with 100% cpu usage.



Any ideas what could be causing this? Here is my configuration and in the end
are some log snippets:

---------------------------------------------------

heartbeat.conf



# heartbeat.conf -- 10 minute heartbeat

ups.status: OL

TIMER 300

ups.status: OB

TIMER 300

---------------------------------------------------

hosts.conf



MONITOR eaton5s at localhost "Eaton 5S 700"

---------------------------------------------------

nut.conf



MODE=netserver

---------------------------------------------------

ups.conf



maxretry = 3

[eaton5s]

        driver = usbhid-ups

        port = auto

        desc = "Eaton 5S 700i UPS"

        vendorid = "0463"

        productid = "FFFF"

        pollinterval = 15

[heartbeat]

        driver = dummy-ups

        mode=dummy-loop

        port = heartbeat.conf

        desc = "Watch over NUT"

---------------------------------------------------

upsd.conf



LISTEN 0.0.0.0 3493

---------------------------------------------------

upsmon.conf



RUN_AS_USER root

MONITOR eaton5s at localhost 1 upsmaster xxx primary

MONITOR heartbeat at localhost 0 upsmaster xxx primary

MINSUPPLIES 1

NOTIFYCMD /sbin/upssched

SHUTDOWNCMD "/sbin/shutdown -h +0"

POLLFREQ 5

POLLFREQALERT 5

HOSTSYNC 15

DEADTIME 15

POWERDOWNFLAG /etc/killpower

RBWARNTIME 43200

NOCOMMWARNTIME 300

FINALDELAY 5

NOTIFYMSG ONLINE "UPS %s: On line power."

NOTIFYMSG ONBATT "UPS %s: On battery."

NOTIFYMSG LOWBATT "UPS %s: Battery is low."

NOTIFYMSG REPLBATT "UPS %s: Battery needs to be replaced."

NOTIFYMSG FSD "UPS %s: Forced shutdown in progress."

NOTIFYMSG SHUTDOWN "Auto logout and shutdown proceeding."

NOTIFYMSG COMMOK "UPS %s: Communications (re-)established."

NOTIFYMSG COMMBAD "UPS %s: Communications lost."

NOTIFYMSG NOCOMM "UPS %s: Not available."

NOTIFYMSG NOPARENT "upsmon parent dead, shutdown impossible."

NOTIFYFLAG ONLINE EXEC

NOTIFYFLAG ONBATT EXEC

NOTIFYFLAG LOWBATT SYSLOG+WALL

NOTIFYFLAG REPLBATT SYSLOG+WALL

NOTIFYFLAG FSD SYSLOG+WALL

NOTIFYFLAG SHUTDOWN SYSLOG+WALL

NOTIFYFLAG COMMOK SYSLOG+WALL

NOTIFYFLAG COMMBAD SYSLOG+WALL

NOTIFYFLAG NOCOMM SYSLOG+WALL

NOTIFYFLAG NOPARENT SYSLOG+WALL

---------------------------------------------------

upssched.conf



CMDSCRIPT /etc/nut/upssched-cmd

PIPEFN /var/run/nut/upssched.pipe

LOCKFN /var/run/nut/upssched.lock



AT ONBATT eaton5s at localhost EXECUTE onbatt

AT ONBATT eaton5s at localhost START-TIMER beeperoff 15

AT ONLINE eaton5s at localhost EXECUTE online

AT ONLINE eaton5s at localhost CANCEL-TIMER beeperoff

AT ONBATT eaton5s at localhost START-TIMER shutdown-ws2019 300

AT ONLINE eaton5s at localhost CANCEL-TIMER shutdown-ws2019

AT ONBATT heartbeat at localhost CANCEL-TIMER heartbeat-failure-timer

AT ONBATT heartbeat at localhost START-TIMER heartbeat-failure-timer 660

---------------------------------------------------

upsd.users



[upsmaster]

        password = xxx

        upsmon primary

        actions = set

        actions = fsd

        instcmds = all



[upsslave]

        password = xxx

        upsmon secondary

---------------------------------------------------

upssched-cmd



#! /bin/bash



UPS="eaton5s"

UPS_USERNAME="upsmaster"

UPS_PASSWORD="xxx"

EMAIL_TO=xxx at xxx.xxx<mailto:xxx at xxx.xxx>

EMAIL_FROM=xxx at xxx.xxx<mailto:xxx at xxx.xxx>

STATUS=`upsc $UPS ups.status 2>&1 | grep -v '^Init SSL'`

CHARGE=`upsc $UPS battery.charge 2>&1 | grep -v '^Init SSL'`

CHMSG="[$STATUS]:$CHARGE%"

MSG="dummy message - $1"



case $1 in

        online) MSG="$UPS, $CHMSG - On line power - power supply had been
restored." ;;

        onbatt) MSG="$UPS, $CHMSG - On battery - power failure - save your
work!" ;;

        lowbatt) MSG="$UPS, $CHMSG - Battery is low - shutdown now!"
;;

        beeperoff)

                logger -i -t upsshed-cmd "Turning off Beeper"

                STATUS=`upsc ${UPS} ups.beeper.status 2>&1 | grep -v
'^Init SSL'`

                if [[ "${STATUS}" == "enabled" ]] ; then

                        upscmd -u ${UPS_USERNAME} -p ${UPS_PASSWORD} ${UPS}
beeper.toggle

                fi

                ;;

        shutdown-ws2019)

                MSG="$UPS, $CHMSG - Stopping WS2019"

                logger -i -t upsshed-cmd "Stopping WS2019"

                setsid /usr/local/bin/shutdown-ws2019 &

                ;;

        upsgone)

                logger -i -t upssched-cmd "The UPS has been gone for
awhile"

                ;;

        heartbeat-failure-timer)

                MSG="NUT heart beat fails. $CHMSG"

                # Email

                MSG1="Hello, upssched-cmd reports NUT heartbeat has
failed."

                MSG2="Current status: $CHMSG \n\n$0 $1"

                MSG3="\n\n$( ps -elf | grep -E 'ups[dms]|nut'
)"

                echo -e "$MSG1 $MSG2 $MSG3" | /usr/bin/mail -r
"$EMAIL_FROM" \

                -s "NUT heart beat fails. Currently $CHMSG"
"$EMAIL_TO" ;;



        *)

                logger -i -t upssched-cmd "Unrecognized command: $1"

                ;;

esac



logger -i -t upssched-cmd $MSG

wall "$MSG"

---------------------------------------------------





---------------------------------------------------

This is normal ups activity in daemon log from the previous version:



Jun 10 23:10:50 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer

Jun 10 23:10:50 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)

Jun 10 23:20:50 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer

Jun 10 23:20:50 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)

Jun 10 23:30:55 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer

Jun 10 23:30:55 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)

Jun 10 23:40:55 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer

Jun 10 23:40:55 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)

Jun 10 23:50:55 fricka upssched[16007]: Cancelling timer:
heartbeat-failure-timer

Jun 10 23:50:55 fricka upssched[16007]: New timer: heartbeat-failure-timer (660
seconds)

---------------------------------------------------

This is daemon log from the new version, first two lines are the last lines of
the startup messages. One same upssched process (124835) running with 100% cpu
from 21:07:



Jun 11 21:02:40 fricka dummy-ups[121803]: sock_connect: enabling asynchronous
mode (auto)

Jun 11 21:02:41 fricka usbhid-ups[121820]: sock_connect: enabling asynchronous
mode (auto)

Jun 11 21:07:40 fricka upssched[124835]: Timer daemon started

Jun 11 21:07:40 fricka upssched[124835]: New timer: heartbeat-failure-timer (660
seconds)

Jun 11 21:07:40 fricka nut-monitor[124832]: Network UPS Tools upsmon 2.8.0

Jun 11 21:12:40 fricka nut-monitor[128005]: Network UPS Tools upsmon 2.8.0

Jun 11 21:17:40 fricka upssched[124835]: Cancelling timer:
heartbeat-failure-timer

Jun 11 21:17:40 fricka upssched[124835]: New timer: heartbeat-failure-timer (660
seconds)

Jun 11 21:17:40 fricka nut-monitor[131233]: Network UPS Tools upsmon 2.8.0

Jun 11 21:22:45 fricka nut-monitor[134330]: Network UPS Tools upsmon 2.8.0

Jun 11 21:27:45 fricka upssched[124835]: Cancelling timer:
heartbeat-failure-timer

Jun 11 21:27:45 fricka upssched[124835]: New timer: heartbeat-failure-timer (660
seconds)

Jun 11 21:27:45 fricka nut-monitor[137838]: Network UPS Tools upsmon 2.8.0





---------------------------------------------------



_______________________________________________
Nut-upsuser mailing list
Nut-upsuser at alioth-lists.debian.net<mailto:Nut-upsuser at
alioth-lists.debian.net>
https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/nut-upsuser
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://alioth-lists.debian.net/pipermail/nut-upsuser/attachments/20230613/f8bc3f9f/attachment-0001.htm>

Jim Klimov

2023-Jun-13 08:41 UTC

head link

[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12

Thanks for the interesting puzzle to crunch!

Looking at these few bread-crumbs, I wager an educated guess that this
loops in `sendcmd()` (where CLI child processes talk to a daemonized copy
which tracks the timers for events), around here:
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L583-L626
for the daemon and here for the child:
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L762-L787

In both cases we run an infinite loop with `select()` to wait up to 1
second and `read()` from the pipe. The common expectation is that if there
is nothing to read, the program blocks in `select()` waiting for a reply
from the kernel, and only returns after the specified timeout - to the
point that some `sleep()` with sub-second precision implementations do just
that on systems that lack `usleep()`.

With `select()` in your capture posting an array of FDs `[7 10]` I'd say it
is the parent loop, and one of those is the `pipefd` and another is the FD
for socket connection to talk to a particular child on the other side of
the pipe.

Here I must guess that either the `pselect6()` implementation behaves
differently which would be surprising after decades worth of expected
behavior, or that it for some reason decides that the FD #7 (whoever that
is) raised the hand but was then silent (0 bytes read).

The next clue is 128 reads, which leads to
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L485
`sock_read()` method which indeed does up to `US_MAX_READ==128` attempts to
read one byte from a connection (so from the CLI/child counterpart) at a
time - so that is who FD#7 gotta be.

So it is just at this point that I get stuck explaining what happens and
why :) Possibly there was a child or CLI counterpart process which died and
somehow the connection did not get closed (which would return a `read()`
error and evict it from the conn list), e.g. frozen/zombified, or something
else connected to the pipe (antivirus heh?) and loiters there, or something
else I could not think of quickly...

The next clue could be to look at who is connected to the pipe file, maybe
with `fuser`, or look in `ps -ef` if there are other `upssched` processes
present, or look in `/proc/PIDNUM/fd/` of the daemonized and looping
`upssched`...

If anyone decides to track through the code, the potential processing loop
goes
    checkconf -> conf_arg -> parse_at -> sendcmd -> daemon if needed
->
start_daemon -> conn_add(pipefd) or sock_read(conn)
with short-lived child processes (command-line launched from `upsmon`)
parsing the configuration to pick the part relevant to them per envvars,
and talk to the daemon (fork one if needed) to start/cancel the timers.

Jim

On Tue, Jun 13, 2023 at 8:33?AM Kari Lempi?inen <
kari.lempiainen at summerday.net> wrote:
> Hi,
>
> I ran the strace command while upssched was 100% CPU hungry. This is what
> I got:
>
> 1686633611.702798 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702816 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702834 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0},
> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999998800}) <0.000006>
> 1686633611.702862 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702880 read(7, "", 1)        = 0 <0.000004>
> .
> .
> .
> 1686633611.705206 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705224 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705242 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0},
> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999999204}) <0.000005>
> 1686633611.705268 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705288 read(7, "", 1)        = 0 <0.000302>
>
> there are 128 "read(7" lines between every pselect6 lines (I did
count the
> read lines for couple of blocks, 128 in each of them). Nothing else in the
> strace log, just read and pselect lines for the short time I let strace to
> run.
>
> Any help from this?
>
> Best regards,
> Kari
> ------------------------------
>
> *<more history snapped>*
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230613/1c198f3c/attachment.htm>

Apparently Analagous Threads

Search for more possibly parallel threads

Nut upsdev - Jun 2023 - Upssched 100% CPU after updating Debian 12

[Nut-upsuser] Upssched 100% CPU after updating Debian 12

[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12

Apparently Analagous Threads