thr3ads.net - Nut upsuser - [Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12 [Jun 2023]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2023-Jun-13 08:41 UTC

[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12

Thanks for the interesting puzzle to crunch!

Looking at these few bread-crumbs, I wager an educated guess that this
loops in `sendcmd()` (where CLI child processes talk to a daemonized copy
which tracks the timers for events), around here:
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L583-L626
for the daemon and here for the child:
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L762-L787

In both cases we run an infinite loop with `select()` to wait up to 1
second and `read()` from the pipe. The common expectation is that if there
is nothing to read, the program blocks in `select()` waiting for a reply
from the kernel, and only returns after the specified timeout - to the
point that some `sleep()` with sub-second precision implementations do just
that on systems that lack `usleep()`.

With `select()` in your capture posting an array of FDs `[7 10]` I'd say it
is the parent loop, and one of those is the `pipefd` and another is the FD
for socket connection to talk to a particular child on the other side of
the pipe.

Here I must guess that either the `pselect6()` implementation behaves
differently which would be surprising after decades worth of expected
behavior, or that it for some reason decides that the FD #7 (whoever that
is) raised the hand but was then silent (0 bytes read).

The next clue is 128 reads, which leads to
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L485
`sock_read()` method which indeed does up to `US_MAX_READ==128` attempts to
read one byte from a connection (so from the CLI/child counterpart) at a
time - so that is who FD#7 gotta be.

So it is just at this point that I get stuck explaining what happens and
why :) Possibly there was a child or CLI counterpart process which died and
somehow the connection did not get closed (which would return a `read()`
error and evict it from the conn list), e.g. frozen/zombified, or something
else connected to the pipe (antivirus heh?) and loiters there, or something
else I could not think of quickly...

The next clue could be to look at who is connected to the pipe file, maybe
with `fuser`, or look in `ps -ef` if there are other `upssched` processes
present, or look in `/proc/PIDNUM/fd/` of the daemonized and looping
`upssched`...

If anyone decides to track through the code, the potential processing loop
goes
    checkconf -> conf_arg -> parse_at -> sendcmd -> daemon if needed
->
start_daemon -> conn_add(pipefd) or sock_read(conn)
with short-lived child processes (command-line launched from `upsmon`)
parsing the configuration to pick the part relevant to them per envvars,
and talk to the daemon (fork one if needed) to start/cancel the timers.

Jim

On Tue, Jun 13, 2023 at 8:33?AM Kari Lempi?inen <
kari.lempiainen at summerday.net> wrote:
> Hi,
>
> I ran the strace command while upssched was 100% CPU hungry. This is what
> I got:
>
> 1686633611.702798 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702816 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702834 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0},
> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999998800}) <0.000006>
> 1686633611.702862 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702880 read(7, "", 1)        = 0 <0.000004>
> .
> .
> .
> 1686633611.705206 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705224 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705242 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0},
> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999999204}) <0.000005>
> 1686633611.705268 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705288 read(7, "", 1)        = 0 <0.000302>
>
> there are 128 "read(7" lines between every pselect6 lines (I did
count the
> read lines for couple of blocks, 128 in each of them). Nothing else in the
> strace log, just read and pselect lines for the short time I let strace to
> run.
>
> Any help from this?
>
> Best regards,
> Kari
> ------------------------------
>
> *<more history snapped>*
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230613/1c198f3c/attachment.htm>

Jim Klimov

2023-Jun-13 09:53 UTC

head link

[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12

FWIW, I wonder if this is a fallout of PR #1274 :
https://github.com/networkupstools/nut/pull/1274/files
and specifically
https://github.com/networkupstools/nut/commit/550064930e369fb3a322c3b28ffff8b4acf53532
which added a `sock_read()` loop continuation effectively if `pconf_char()`
returned 0.

Just grasping at straws here (since it is a change a couple of months
before the release was cut), because if ALL of those 128 reads (skipped
above) did return 0, we should not have got to this line.

The comment added here "*O_NDELAY with zero bytes means nothing to read but
since read() follows a successful select() with ready file descriptor, ret
shouldn't be 0*" - that might be misleading too, since per internet:
> The result of O_nonblock and O_ndelay is to make I/O into non-blockingmode (non-blocking), return immediately without reading the data or the
write buffer full, without blocking the wait.>
> The difference is that, in a read operation, if the data is not read,O_ndelay causes the I/O function to return 0 immediately, but this gives
rise to a problem, because the read to the end of the file (EOF) is
returned 0, so it is impossible to distinguish between the cases.
Therefore, O_nonblock is generated, it will return 1 when the data is not
read, and set errno to Eagain.>
> O_ndelay was introduced in earlier versions of System V, and when coding,it is recommended that POSIX-defined o_nonblock,o_nonblock be set at open
and Fcntl.

Jim




On Tue, Jun 13, 2023 at 10:41?AM Jim Klimov <jimklimov+nut at gmail.com>
wrote:
> Thanks for the interesting puzzle to crunch!
>
> Looking at these few bread-crumbs, I wager an educated guess that this
> loops in `sendcmd()` (where CLI child processes talk to a daemonized copy
> which tracks the timers for events), around here:
>
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L583-L626
> for the daemon and here for the child:
>
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L762-L787
>
> In both cases we run an infinite loop with `select()` to wait up to 1
> second and `read()` from the pipe. The common expectation is that if there
> is nothing to read, the program blocks in `select()` waiting for a reply
> from the kernel, and only returns after the specified timeout - to the
> point that some `sleep()` with sub-second precision implementations do just
> that on systems that lack `usleep()`.
>
> With `select()` in your capture posting an array of FDs `[7 10]` I'd
say
> it is the parent loop, and one of those is the `pipefd` and another is the
> FD for socket connection to talk to a particular child on the other side of
> the pipe.
>
> Here I must guess that either the `pselect6()` implementation behaves
> differently which would be surprising after decades worth of expected
> behavior, or that it for some reason decides that the FD #7 (whoever that
> is) raised the hand but was then silent (0 bytes read).
>
> The next clue is 128 reads, which leads to
> https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L485
> `sock_read()` method which indeed does up to `US_MAX_READ==128` attempts to
> read one byte from a connection (so from the CLI/child counterpart) at a
> time - so that is who FD#7 gotta be.
>
> So it is just at this point that I get stuck explaining what happens and
> why :) Possibly there was a child or CLI counterpart process which died and
> somehow the connection did not get closed (which would return a `read()`
> error and evict it from the conn list), e.g. frozen/zombified, or something
> else connected to the pipe (antivirus heh?) and loiters there, or something
> else I could not think of quickly...
>
> The next clue could be to look at who is connected to the pipe file, maybe
> with `fuser`, or look in `ps -ef` if there are other `upssched` processes
> present, or look in `/proc/PIDNUM/fd/` of the daemonized and looping
> `upssched`...
>
> If anyone decides to track through the code, the potential processing loop
> goes
>     checkconf -> conf_arg -> parse_at -> sendcmd -> daemon if
needed ->
> start_daemon -> conn_add(pipefd) or sock_read(conn)
> with short-lived child processes (command-line launched from `upsmon`)
> parsing the configuration to pick the part relevant to them per envvars,
> and talk to the daemon (fork one if needed) to start/cancel the timers.
>
> Jim
>
> On Tue, Jun 13, 2023 at 8:33?AM Kari Lempi?inen <
> kari.lempiainen at summerday.net> wrote:
>
>> Hi,
>>
>> I ran the strace command while upssched was 100% CPU hungry. This is
what
>> I got:
>>
>> 1686633611.702798 read(7, "", 1)        = 0 <0.000004>
>> 1686633611.702816 read(7, "", 1)        = 0 <0.000004>
>> 1686633611.702834 pselect6(11, [7 10], NULL, NULL, {tv_sec=1,
tv_nsec=0},
>> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999998800}) <0.000006>
>> 1686633611.702862 read(7, "", 1)        = 0 <0.000004>
>> 1686633611.702880 read(7, "", 1)        = 0 <0.000004>
>> .
>> .
>> .
>> 1686633611.705206 read(7, "", 1)        = 0 <0.000004>
>> 1686633611.705224 read(7, "", 1)        = 0 <0.000004>
>> 1686633611.705242 pselect6(11, [7 10], NULL, NULL, {tv_sec=1,
tv_nsec=0},
>> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999999204}) <0.000005>
>> 1686633611.705268 read(7, "", 1)        = 0 <0.000004>
>> 1686633611.705288 read(7, "", 1)        = 0 <0.000302>
>>
>> there are 128 "read(7" lines between every pselect6 lines (I
did count
>> the read lines for couple of blocks, 128 in each of them). Nothing else
in
>> the strace log, just read and pselect lines for the short time I let
strace
>> to run.
>>
>> Any help from this?
>>
>> Best regards,
>> Kari
>> ------------------------------
>>
>> *<more history snapped>*
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230613/77dc9849/attachment-0001.htm>

Possibly Parallel Threads

Search for more apparently analagous threads

Nut upsuser - Jun 2023 - Upssched 100% CPU after updating Debian 12

[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12

[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12

Possibly Parallel Threads