Jim Klimov
2023-Jun-13 08:41 UTC
[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12
Thanks for the interesting puzzle to crunch! Looking at these few bread-crumbs, I wager an educated guess that this loops in `sendcmd()` (where CLI child processes talk to a daemonized copy which tracks the timers for events), around here: https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L583-L626 for the daemon and here for the child: https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L762-L787 In both cases we run an infinite loop with `select()` to wait up to 1 second and `read()` from the pipe. The common expectation is that if there is nothing to read, the program blocks in `select()` waiting for a reply from the kernel, and only returns after the specified timeout - to the point that some `sleep()` with sub-second precision implementations do just that on systems that lack `usleep()`. With `select()` in your capture posting an array of FDs `[7 10]` I'd say it is the parent loop, and one of those is the `pipefd` and another is the FD for socket connection to talk to a particular child on the other side of the pipe. Here I must guess that either the `pselect6()` implementation behaves differently which would be surprising after decades worth of expected behavior, or that it for some reason decides that the FD #7 (whoever that is) raised the hand but was then silent (0 bytes read). The next clue is 128 reads, which leads to https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L485 `sock_read()` method which indeed does up to `US_MAX_READ==128` attempts to read one byte from a connection (so from the CLI/child counterpart) at a time - so that is who FD#7 gotta be. So it is just at this point that I get stuck explaining what happens and why :) Possibly there was a child or CLI counterpart process which died and somehow the connection did not get closed (which would return a `read()` error and evict it from the conn list), e.g. frozen/zombified, or something else connected to the pipe (antivirus heh?) and loiters there, or something else I could not think of quickly... The next clue could be to look at who is connected to the pipe file, maybe with `fuser`, or look in `ps -ef` if there are other `upssched` processes present, or look in `/proc/PIDNUM/fd/` of the daemonized and looping `upssched`... If anyone decides to track through the code, the potential processing loop goes checkconf -> conf_arg -> parse_at -> sendcmd -> daemon if needed -> start_daemon -> conn_add(pipefd) or sock_read(conn) with short-lived child processes (command-line launched from `upsmon`) parsing the configuration to pick the part relevant to them per envvars, and talk to the daemon (fork one if needed) to start/cancel the timers. Jim On Tue, Jun 13, 2023 at 8:33?AM Kari Lempi?inen < kari.lempiainen at summerday.net> wrote:> Hi, > > I ran the strace command while upssched was 100% CPU hungry. This is what > I got: > > 1686633611.702798 read(7, "", 1) = 0 <0.000004> > 1686633611.702816 read(7, "", 1) = 0 <0.000004> > 1686633611.702834 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0}, > NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999998800}) <0.000006> > 1686633611.702862 read(7, "", 1) = 0 <0.000004> > 1686633611.702880 read(7, "", 1) = 0 <0.000004> > . > . > . > 1686633611.705206 read(7, "", 1) = 0 <0.000004> > 1686633611.705224 read(7, "", 1) = 0 <0.000004> > 1686633611.705242 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0}, > NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999999204}) <0.000005> > 1686633611.705268 read(7, "", 1) = 0 <0.000004> > 1686633611.705288 read(7, "", 1) = 0 <0.000302> > > there are 128 "read(7" lines between every pselect6 lines (I did count the > read lines for couple of blocks, 128 in each of them). Nothing else in the > strace log, just read and pselect lines for the short time I let strace to > run. > > Any help from this? > > Best regards, > Kari > ------------------------------ > > *<more history snapped>* >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230613/1c198f3c/attachment.htm>
Jim Klimov
2023-Jun-13 09:53 UTC
[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12
FWIW, I wonder if this is a fallout of PR #1274 : https://github.com/networkupstools/nut/pull/1274/files and specifically https://github.com/networkupstools/nut/commit/550064930e369fb3a322c3b28ffff8b4acf53532 which added a `sock_read()` loop continuation effectively if `pconf_char()` returned 0. Just grasping at straws here (since it is a change a couple of months before the release was cut), because if ALL of those 128 reads (skipped above) did return 0, we should not have got to this line. The comment added here "*O_NDELAY with zero bytes means nothing to read but since read() follows a successful select() with ready file descriptor, ret shouldn't be 0*" - that might be misleading too, since per internet:> The result of O_nonblock and O_ndelay is to make I/O into non-blockingmode (non-blocking), return immediately without reading the data or the write buffer full, without blocking the wait.> > The difference is that, in a read operation, if the data is not read,O_ndelay causes the I/O function to return 0 immediately, but this gives rise to a problem, because the read to the end of the file (EOF) is returned 0, so it is impossible to distinguish between the cases. Therefore, O_nonblock is generated, it will return 1 when the data is not read, and set errno to Eagain.> > O_ndelay was introduced in earlier versions of System V, and when coding,it is recommended that POSIX-defined o_nonblock,o_nonblock be set at open and Fcntl. Jim On Tue, Jun 13, 2023 at 10:41?AM Jim Klimov <jimklimov+nut at gmail.com> wrote:> Thanks for the interesting puzzle to crunch! > > Looking at these few bread-crumbs, I wager an educated guess that this > loops in `sendcmd()` (where CLI child processes talk to a daemonized copy > which tracks the timers for events), around here: > https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L583-L626 > for the daemon and here for the child: > https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L762-L787 > > In both cases we run an infinite loop with `select()` to wait up to 1 > second and `read()` from the pipe. The common expectation is that if there > is nothing to read, the program blocks in `select()` waiting for a reply > from the kernel, and only returns after the specified timeout - to the > point that some `sleep()` with sub-second precision implementations do just > that on systems that lack `usleep()`. > > With `select()` in your capture posting an array of FDs `[7 10]` I'd say > it is the parent loop, and one of those is the `pipefd` and another is the > FD for socket connection to talk to a particular child on the other side of > the pipe. > > Here I must guess that either the `pselect6()` implementation behaves > differently which would be surprising after decades worth of expected > behavior, or that it for some reason decides that the FD #7 (whoever that > is) raised the hand but was then silent (0 bytes read). > > The next clue is 128 reads, which leads to > https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L485 > `sock_read()` method which indeed does up to `US_MAX_READ==128` attempts to > read one byte from a connection (so from the CLI/child counterpart) at a > time - so that is who FD#7 gotta be. > > So it is just at this point that I get stuck explaining what happens and > why :) Possibly there was a child or CLI counterpart process which died and > somehow the connection did not get closed (which would return a `read()` > error and evict it from the conn list), e.g. frozen/zombified, or something > else connected to the pipe (antivirus heh?) and loiters there, or something > else I could not think of quickly... > > The next clue could be to look at who is connected to the pipe file, maybe > with `fuser`, or look in `ps -ef` if there are other `upssched` processes > present, or look in `/proc/PIDNUM/fd/` of the daemonized and looping > `upssched`... > > If anyone decides to track through the code, the potential processing loop > goes > checkconf -> conf_arg -> parse_at -> sendcmd -> daemon if needed -> > start_daemon -> conn_add(pipefd) or sock_read(conn) > with short-lived child processes (command-line launched from `upsmon`) > parsing the configuration to pick the part relevant to them per envvars, > and talk to the daemon (fork one if needed) to start/cancel the timers. > > Jim > > On Tue, Jun 13, 2023 at 8:33?AM Kari Lempi?inen < > kari.lempiainen at summerday.net> wrote: > >> Hi, >> >> I ran the strace command while upssched was 100% CPU hungry. This is what >> I got: >> >> 1686633611.702798 read(7, "", 1) = 0 <0.000004> >> 1686633611.702816 read(7, "", 1) = 0 <0.000004> >> 1686633611.702834 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0}, >> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999998800}) <0.000006> >> 1686633611.702862 read(7, "", 1) = 0 <0.000004> >> 1686633611.702880 read(7, "", 1) = 0 <0.000004> >> . >> . >> . >> 1686633611.705206 read(7, "", 1) = 0 <0.000004> >> 1686633611.705224 read(7, "", 1) = 0 <0.000004> >> 1686633611.705242 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0}, >> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999999204}) <0.000005> >> 1686633611.705268 read(7, "", 1) = 0 <0.000004> >> 1686633611.705288 read(7, "", 1) = 0 <0.000302> >> >> there are 128 "read(7" lines between every pselect6 lines (I did count >> the read lines for couple of blocks, 128 in each of them). Nothing else in >> the strace log, just read and pselect lines for the short time I let strace >> to run. >> >> Any help from this? >> >> Best regards, >> Kari >> ------------------------------ >> >> *<more history snapped>* >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230613/77dc9849/attachment-0001.htm>