Bob Brown
2015-Aug-14 12:53 UTC
[Nut-upsuser] occassional problem wtih upslog and apc ups units via snmp ... [NA] status
I see this in syslog: Aug 12 03:57:22 orion upsd[18274]: Data for UPS [a102-apc-10] is stale - check driver (one for each UPS). Then I see a bunch of messages like this: Aug 12 03:58:02 orion upsd[18274]: write() failed for ::1: Broken pipe Aug 12 03:58:02 orion snmpd[2872]: Connection from UDP: [127.0.0.1]:50940 And then it returns to normal like this: Aug 12 03:58:04 orion upsd[18274]: UPS [a102-apc-10] data is no longer stale Where else can I look to find out what really happened? -Bob -----Original Message----- From: Charles Lepple [mailto:clepple at gmail.com] Sent: Thursday, August 13, 2015 9:52 PM To: Bob Brown Cc: nut-upsuser at lists.alioth.debian.org Subject: Re: [Nut-upsuser] occassional problem wtih upslog and apc ups units via snmp ... [NA] status On Aug 13, 2015, at 12:45 PM, Bob Brown <bbrown at harpercollege.edu> wrote:> > Os is redhat enterprise, kernel 2.6.18-164.2.1.e15 NUT is apparently > 2.4.1 It was installed by tar. > > Monitoring a lot of apc UPS units?mostly Smart-UPS of various flavors, also a Silcon DP380E. > > Normally works just fine? ?upslog? writes status log every 30 seconds with various parameters.. > > Every so often, one of the UPS units has a log entry of: > > 20150812+035542+NA+NA+NA+[NA]+NA+NA+NA > > When it normally looks like: > > 20150813+114309+100.00+115.00+39.00+[OL]+25.00+60.00+1020.00 > > What does it mean when it puts NA in the log file?Every log interval, each variable is fetched from the server, and if there is an error retrieving that value at that time, "NA" is logged: github.com/networkupstools/nut/blob/master/clients/upslog.c#L203 Since they are all "NA" at the same time, it is likely that there is a communication problem, either between upslog and upsd, or between upsd and the driver. In either case, there is probably a syslog message describing the problem in more detail. -- Charles Lepple clepple at gmail
Charles Lepple
2015-Aug-14 13:27 UTC
[Nut-upsuser] occassional problem wtih upslog and apc ups units via snmp ... [NA] status
On Aug 14, 2015, at 8:53 AM, Bob Brown <bbrown at harpercollege.edu> wrote:> > I see this in syslog: > Aug 12 03:57:22 orion upsd[18274]: Data for UPS [a102-apc-10] is stale - check driverI don't have a SNMP UPS to test with, and this might be different for NUT 2.4.1, but the current defaults seem to make stale data more likely in your case. In networkupstools.org/docs/man/snmp-ups.html#_extra_arguments the default for "pollfreq" is 30 seconds, and in networkupstools.org/docs/man/upsd.conf.html the "MAXAGE" default is 15 seconds. So you might want to either increase MAXAGE, or decrease pollfreq. I would recommend making MAXAGE at least pollfreq (possibly 1.5x larger), because decreasing pollfreq might run up against the amount of time that it takes to poll all of the SNMP variables from the UPS. There are some retries and timeouts in snmp-ups which make this difficult to guess without seeing the debug logs. You might check the list archives to see if anyone has suggestions on what pollfreq and MAXAGE values worked for them.> (one for each UPS). > > Then I see a bunch of messages like this: > > Aug 12 03:58:02 orion upsd[18274]: write() failed for ::1: Broken pipeThis just means that the client (upslog, probably) disconnected before upsd expected the session to end. The upsd-to-driver connection is a Unix-domain socket (and the driver-to-UPS connection is UDP).> Aug 12 03:58:02 orion snmpd[2872]: Connection from UDP: [127.0.0.1]:50940 >Might be unrelated, unless I misunderstood your NUT setup (upslog -> upsd -> snmp-ups -> UPS). The snmpd daemon and the UPS comms card are SNMP servers ("agents" in SNMP parlance, IIRC), and snmp-ups is a SNMP client. -- Charles Lepple clepple at gmail
Charles Lepple
2015-Aug-14 22:50 UTC
[Nut-upsuser] occassional problem wtih upslog and apc ups units via snmp ... [NA] status
[please keep the list CC'd, thanks]> On Aug 14, 2015, at 10:27 AM, Bob Brown <bbrown at harpercollege.edu> wrote: > > Would it put NA in the log files if maxage was too short?I think so. I haven't tried it, but if snmp-ups polls every 30 seconds, and MAXAGE is 15, then with a random polling interval, 50% of the time the data would be more than 15 seconds old. In practice, the driver and upsd were probably started around the same time, and due to differences in timing between upsd and the driver, they occasionally drift out of sync. (Increasing MAXAGE is probably the simplest way around that.) The usbhid-ups USB driver (which tends to get a lot more testing these days) splits polling into high and low priority updates, so that important variables get update frequently (many times in one MAXAGE period), and other variables get updated later. But successful high-priority updates clear out the "data stale" flag. There are comments in the snmp-ups driver that recommend implementing that there as well, but nobody has written that code yet.> Does it sound like a network delay and that's why it put NA and marked the data as stale?Packet loss between the driver and the SNMP card in the UPS is more likely than delay (the MAXAGE and pollfreq arguments are in seconds), but as I mentioned, the current version of NUT retries a few times, so ultimately, packet loss shouldn't be a problem.> -Bob > > -----Original Message----- > From: Charles Lepple [mailto:clepple at gmail.com] > Sent: Friday, August 14, 2015 8:28 AM > To: Bob Brown > Cc: nut-upsuser at lists.alioth.debian.org > Subject: Re: [Nut-upsuser] occassional problem wtih upslog and apc ups units via snmp ... [NA] status > > On Aug 14, 2015, at 8:53 AM, Bob Brown <bbrown at harpercollege.edu> wrote: >> >> I see this in syslog: >> Aug 12 03:57:22 orion upsd[18274]: Data for UPS [a102-apc-10] is stale - check driver > > I don't have a SNMP UPS to test with, and this might be different for NUT 2.4.1, but the current defaults seem to make stale data more likely in your case. > > In networkupstools.org/docs/man/snmp-ups.html#_extra_arguments the default for "pollfreq" is 30 seconds, and in networkupstools.org/docs/man/upsd.conf.html the "MAXAGE" default is 15 seconds. So you might want to either increase MAXAGE, or decrease pollfreq. > > I would recommend making MAXAGE at least pollfreq (possibly 1.5x larger), because decreasing pollfreq might run up against the amount of time that it takes to poll all of the SNMP variables from the UPS. There are some retries and timeouts in snmp-ups which make this difficult to guess without seeing the debug logs. You might check the list archives to see if anyone has suggestions on what pollfreq and MAXAGE values worked for them. > >> (one for each UPS). >> >> Then I see a bunch of messages like this: >> >> Aug 12 03:58:02 orion upsd[18274]: write() failed for ::1: Broken pipe > > This just means that the client (upslog, probably) disconnected before upsd expected the session to end. The upsd-to-driver connection is a Unix-domain socket (and the driver-to-UPS connection is UDP). > >> Aug 12 03:58:02 orion snmpd[2872]: Connection from UDP: [127.0.0.1]:50940 >> > > Might be unrelated, unless I misunderstood your NUT setup (upslog -> upsd -> snmp-ups -> UPS). The snmpd daemon and the UPS comms card are SNMP servers ("agents" in SNMP parlance, IIRC), and snmp-ups is a SNMP client. > > -- > Charles Lepple > clepple at gmail > > >-- Charles Lepple clepple at gmail