George Pontis
2006-Jan-04  19:14 UTC
[Nut-upsuser] Nut 2.0.1, net-snmp-5.1.2p2, APC SUA1000 with AP9606: repeated comm lost/established messages
Hello list members,
I am running NUT 2.0.1 on an OpenBSD 3.8 system, using the OpenBSD port for
net-snmp-5.1.2p2. The UPS is connected via an AP9606 Web/SNMP card with
current firmware, and the LAN. All goes well for a long time, usually more
than one month. Then the system starts spitting messages about losing
communications with the UPS. A second message about reconnecting follows
immediately. The messages are sometimes coming at a fast rate, faster than
the 300 seconds specified for NOCOMMWARNTIME in upsmon.conf. A reboot of the
computer running NUT (the only computer running NUT) does not fix it, so I
assume that there is something going on with the UPS or the UPS network
card.
If this is not a known problem, I would be happy to do some guided sleuthing
to track it down. However, I have no clue where to look.
Any pointers would be much appreciated.
George
The system log messages look like this:
Jan  4 10:57:00 z9 upsd[12613]: Data for UPS [ap9606-srv] is stale - check
driver
Jan  4 10:57:02 z9 upsd[12613]: UPS [ap9606-srv] data is no longer stale
Jan  4 10:59:02 z9 upsd[12613]: Data for UPS [ap9606-srv] is stale - check
driver
Jan  4 10:59:02 z9 upsd[12613]: UPS [ap9606-srv] data is no longer stale
The output from upsmon -D (with console messages interspersed) follows
/usr/local/ups/sbin >> ./upsmon -D 
Network UPS Tools upsmon 2.0.1
UPS: ap9606-srv@localhost (slave) (power value 1)
Using power down flag file /etc/killpower
Trying to connect to UPS [ap9606-srv@localhost]
Logged into UPS ap9606-srv@localhost
polling ups: ap9606-srv@localhost
get_var: ap9606-srv@localhost / status
     status: [OL]
    parsing: [OL]: ups_on_line(ap9606-srv@localhost) (first time)
polling ups: ap9606-srv@localhost
get_var: ap9606-srv@localhost / status
     status: [OL]
    parsing: [OL]: ups_on_line(ap9606-srv@localhost) (no change)
polling ups: ap9606-srv@localhost
get_var: ap9606-srv@localhost / status
     status: [OL]
    parsing: [OL]: ups_on_line(ap9606-srv@localhost) (no change)
polling ups: ap9606-srv@localhost
get_var: ap9606-srv@localhost / status
     status: [OL]
    parsing: [OL]: ups_on_line(ap9606-srv@localhost) (no change)
polling ups: ap9606-srv@localhost
get_var: ap9606-srv@localhost / status
     status: [OL]
    parsing: [OL]: ups_on_line(ap9606-srv@localhost) (no change)
polling ups: ap9606-srv@localhost
get_var: ap9606-srv@localhost / status
Poll UPS [ap9606-srv@localhost] failed - Data stale
do_notify: ntype 0x0005 (COMMBAD)
Communications with UPS ap9606-srv@localhost lost
 
Broadcast Message from root@z9.z9.com
        (/dev/ttyp0) at 10:39 ...
 
Communications with UPS ap9606-srv@localhost lost
 
polling ups: ap9606-srv@localhost
get_var: ap9606-srv@localhost / status
     status: [OL]
do_notify: ntype 0x0004 (COMMOK)
Communications with UPS ap9606-srv@localhost established
    parsing: [OL]: ups_on_line(ap9606-srv@localhost) (no change)
 
Broadcast Message from root@z9.z9.com
        (/dev/ttyp0) at 10:40 ...
 
Communications with UPS ap9606-srv@localhost established
 
polling ups: ap9606-srv@localhost
get_var: ap9606-srv@localhost / status
     status: [OL]
    parsing: [OL]: ups_on_line(ap9606-srv@localhost) (no change)
^CSignal 2: exiting
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.alioth.debian.org/pipermail/nut-upsuser/attachments/20060104/c3148e01/attachment.htm
Arnaud Quette
2006-Jan-05  11:59 UTC
[Nut-upsuser] Nut 2.0.1, net-snmp-5.1.2p2, APC SUA1000 with AP9606: repeated comm lost/established messages
Hi George 2006/1/4, George Pontis <gpontis@spamcop.net>:> > Hello list members, > > I am running NUT 2.0.1 on an OpenBSD 3.8 system, using the OpenBSD port > for net-snmp-5.1.2p2. The UPS is connected via an AP9606 Web/SNMP card > with current firmware, and the LAN. All goes well for a long time, usually > more than one month. Then the system starts spitting messages about losing > communications with the UPS. A second message about reconnecting follows > immediately. The messages are sometimes coming at a fast rate, faster than > the 300 seconds specified for NOCOMMWARNTIME in upsmon.conf. A reboot of > the computer running NUT (the only computer running NUT) does not fix it, so > I assume that there is something going on with the UPS or the UPS network > card. > > If this is not a known problem, I would be happy to do some guided > sleuthing to track it down. However, I have no clue where to look. > > Any pointers would be much appreciated. >does resetting the card itself (there should be something on the web admin interface, if there is one) resolve this issue. If so, then you've validated that the problem is from the comm card itself. Otherwise, it should be a problem in the ups core... Another test, on the nut side, would be to stop snmp-ups, then restart it in debug mode (using -DDDDD) when the comm is staling, and look at what happen. A simpler one would be, in the same context, to try a ping and an snmp-walk to see if the comm card is denying answer. Arnaud -- Linux / Unix Expert - MGE UPS SYSTEMS - R&D Dpt Network UPS Tools (NUT) Project Leader - http://www.networkupstools.org/ Debian Developer - http://people.debian.org/~aquette/ OpenSource Developer - http://arnaud.quette.free.fr/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.alioth.debian.org/pipermail/nut-upsuser/attachments/20060105/bf1c7991/attachment.html