Henning Fehrmann
2016-Feb-08 13:46 UTC
[Nut-upsuser] UPS Socomec is unavailable appears in the logs
Hello, we have two UPSes - one Galaxy and one from Socomec, both monitored by the NUT server. The NUT server runs the version 2.7.2. We have roughly 3500 clients running upsmon, reading the status of both UPSes. They run mainly the version 2.6.4 and some 2.7.2. We observed the following on both client versions, regardless whether we run upsmon in the daemon or debug mode: Running upsc socomec at ip_socomec returns a list of all required values. OTOH, we find on all clients the folloging logs: upsmon[1608]: UPS socomec at ip_socomec is unavailable This appears regularly each little bit more than 300s and corresponds to the NOCOMMWARNTIME in the upsmon.conf. The log entry appears only for the Socomec UPS, not for the Galaxy. The process list shows that /sbin/upsmon is forking occasionally, generating a child in the Z state root 3294 0.0 0.0 10448 704 ? Ss 14:02 0:00 /sbin/upsmon nut 3295 0.0 0.0 10448 580 ? S 14:02 0:00 /sbin/upsmon nut 3497 0.0 0.0 0 0 ? Z 14:02 0:00 [upsmon] <defunct> As far as I can see, strace provides not much. Here is a snipped: 4275 14:07:50.688200 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdc214c79d0) = 4580 4275 14:07:50.688650 select(6, [5], NULL, NULL, {0, 0}) = 0 (Timeout) 4275 14:07:50.688873 wait4(-1, NULL, WNOHANG, NULL) = 4521 4275 14:07:50.688968 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 4275 14:07:50.689007 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 4275 14:07:50.689046 rt_sigprocmask(SIG_SETMASK, [], <unfinished ...> 4580 14:07:50.689065 exit_group(0) = ? 4275 14:07:50.689089 <... rt_sigprocmask resumed> NULL, 8) = 0 4275 14:07:50.689109 nanosleep({15, 0}, {14, 999864784}) = ? ERESTART_RESTARTBLOCK (To be restarted) 4275 14:07:50.689334 --- SIGCHLD (Child exited) @ 0 (0) --- 4275 14:07:50.689355 restart_syscall(<... resuming interrupted call ...>) = 0 4275 14:08:05.689412 alarm(10) = 0 4275 14:08:05.689530 select(5, NULL, [4], NULL, {0, 0}) = 1 (out [4], left {0, 0}) 4275 14:08:05.689595 write(4, "GET VAR galaxy ups.status\n", 26) = 26 4275 14:08:05.689721 select(5, [4], NULL, NULL, {5, 0}) = 1 (in [4], left {4, 996929}) 4275 14:08:05.692927 read(4, "VAR galaxy ups.status \"OL\"\n", 64) = 27 4275 14:08:05.693050 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, 8) = 0 4275 14:08:05.693108 alarm(0) = 10 4275 14:08:05.693139 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, 8) = 0 4275 14:08:05.693187 alarm(0) = 0 4275 14:08:05.693228 alarm(10) = 0 4275 14:08:05.693266 select(7, NULL, [6], NULL, {0, 0}) = 1 (out [6], left {0, 0}) 4275 14:08:05.693306 write(6, "GET VAR socomec ups.status\n", 27) = 27 4275 14:08:05.693373 select(7, [6], NULL, NULL, {5, 0}) = 1 (in [6], left {4, 997653}) 4275 14:08:05.695899 read(6, "VAR socomec ups.status \"\"\n", 64) = 26 4275 14:08:05.696006 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, 8) = 0 4275 14:08:05.696062 alarm(0) = 10 4275 14:08:05.696140 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, 8) = 0 4275 14:08:05.696187 alarm(0) = 0 4275 14:08:05.696251 write(2, " 50.070507\t", 12) = 12 4275 14:08:05.696305 write(2, "UPS socomec at ip_socomec is unav"..., 40) = 40 4275 14:08:05.696346 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdc214c79d0) = 4631 4275 14:08:05.696826 select(6, [5], NULL, NULL, {0, 0}) = 0 (Timeout) 4275 14:08:05.697101 wait4(-1, NULL, WNOHANG, NULL) = 4580 where process 4580 is the zombi process. Here is the ups.conf on the NUT server: [galaxy] driver = netxml-ups port = http://ip_galaxy pollinterval = 10 desc = "GALAXY 6000" [socomec] driver = snmp-ups port = ip_socomec community = public snmp_version = v1 mibs = netvision pollinterval = 10 desc = "Socomec" The upsd.users is [monuser] password = XXX On the client side the ups.conf is [galaxy] driver = galaxy port = ip_NUTserver desc = "GALAXY 6000" [socomec] driver = socomec port = ip_NUTserver desc = "SOCOMEC" The upsmon.conf reads: MONITOR galaxy at 172.28.22.40 1 monuser XXX master MONITOR socomec at 172.28.22.40 1 monuser XXX master MINSUPPLIES 1 SHUTDOWNCMD "/sbin/shutdown -h now" NOTIFYCMD "/sbin/upssched" POLLFREQ 15 POLLFREQALERT 15 HOSTSYNC 15 DEADTIME 45 POWERDOWNFLAG /etc/killpower NOTIFYMSG ONBATT "%s runs on battery" NOTIFYMSG FSD "%s forced shutdown" NOTIFYFLAG COMMOK SYSLOG NOTIFYFLAG COMMBAD SYSLOG NOTIFYFLAG NOCOMM SYSLOG NOTIFYFLAG ONLINE SYSLOG+EXEC+WALL NOTIFYFLAG ONBATT SYSLOG+EXEC+WALL NOTIFYFLAG FSD SYSLOG+EXEC+WALL RBWARNTIME 43200 NOCOMMWARNTIME 300 If some other logs or configs are required, please let us know. Is there something, we've missed in the config, which might have caused this problem? Thank you and cheers, Henning
Charles Lepple
2016-Feb-12 13:53 UTC
[Nut-upsuser] UPS Socomec is unavailable appears in the logs
On Feb 8, 2016, at 8:46 AM, Henning Fehrmann <henning.fehrmann at aei.mpg.de> wrote:> > Hello, > > we have two UPSes - one Galaxy and one from Socomec, both monitored by the NUT server. > > The NUT server runs the version 2.7.2. > > We have roughly 3500 clients running upsmon, reading the status of both > UPSes. They run mainly the version 2.6.4 and some 2.7.2. > > We observed the following on both client versions, regardless whether > we run upsmon in the daemon or debug mode: > > Running > upsc socomec at ip_socomec > returns a list of all required values. > > OTOH, we find on all clients the folloging logs: > upsmon[1608]: UPS socomec at ip_socomec is unavailableI think this error message comes from this check: https://github.com/networkupstools/nut/blob/1993c39866b4625c0218fda1dfa9b037934c4c47/clients/upsmon.c#L1458 "empty response is the same as a dead ups" Corresponds to these lines from strace: 4275 14:08:05.693306 write(6, "GET VAR socomec ups.status\n", 27) = 27 4275 14:08:05.693373 select(7, [6], NULL, NULL, {5, 0}) = 1 (in [6], left {4, 997653}) 4275 14:08:05.695899 read(6, "VAR socomec ups.status \"\"\n", 64) = 26 If you run "upsc socomec at ip_socomec ups.status", it should return (at a minimum) "ups.status: OL", "ups.status: OB" or "ups.status: OB LB". The Galaxy seems to be working (from a few lines earlier): 4275 14:08:05.689595 write(4, "GET VAR galaxy ups.status\n", 26) = 26 4275 14:08:05.689721 select(5, [4], NULL, NULL, {5, 0}) = 1 (in [4], left {4, 996929}) 4275 14:08:05.692927 read(4, "VAR galaxy ups.status \"OL\"\n", 64) = 27 There was a change to the Netvision MIB file (included in NUT 2.7.3) that might fix the empty ups.status: https://github.com/networkupstools/nut/commit/4552d14162086a280445efc5331a8e6ad7b1c3a4 It should not be necessary to update the clients to NUT 2.7.3, only the Socomec NUT server.> > This appears regularly each little bit more than 300s and corresponds to > the NOCOMMWARNTIME in the upsmon.conf. > > The log entry appears only for the Socomec UPS, not for the Galaxy. > > The process list shows that /sbin/upsmon is forking occasionally, generating a child in the Z state > > root 3294 0.0 0.0 10448 704 ? Ss 14:02 0:00 /sbin/upsmon > nut 3295 0.0 0.0 10448 580 ? S 14:02 0:00 /sbin/upsmon > nut 3497 0.0 0.0 0 0 ? Z 14:02 0:00 [upsmon] <defunct>It is unclear to me why this process is a zombie (defunct). I would expect upsmon to fork every 300s to run the NOTIFYCMD (upssched, in your case) due to the empty ups.status, but since the wait4 syscall (the waitpid() library function in the code) is being called, it should reap the zombie process. Maybe you are just seeing the window of time between the fork() and the waitpid()?> On the client side the ups.conf is > [galaxy] > driver = galaxy > port = ip_NUTserver > desc = "GALAXY 6000" > > [socomec] > driver = socomec > port = ip_NUTserver > desc = "SOCOMEC"Actually, the clients don't need ups.conf - they get the description from the server's ups.conf over the NUT network protocol.> The upsmon.conf reads: > > MONITOR galaxy at 172.28.22.40 1 monuser XXX master > MONITOR socomec at 172.28.22.40 1 monuser XXX master > > > MINSUPPLIES 1 > SHUTDOWNCMD "/sbin/shutdown -h now" > NOTIFYCMD "/sbin/upssched" >I don't use upssched - maybe someone else can shed some light on this?> _______________________________________________ > Nut-upsuser mailing list > Nut-upsuser at lists.alioth.debian.org > http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/nut-upsuser-- Charles Lepple clepple at gmail