Hi all,
We have recently bought an APC UPS and are in the process of setting up
the NUT software to make use of it. We are experiencing a problem with
the behaviour of the slave systems when the master system goes off line.
Although the failure of our master system will (hopefully) be a rare
event, and we hope not to experience too many power outages, it is
possible (if unlikely) that both circumstances will occur at the same
time. I have searched the list, but not found anyone else with this
problem. We would appreciate some help and advice if possible.
I will first give a very brief overview of our set up, then detail the
problem, and finally provide detailed information on our set up and its
configuration.
++ Brief overview of set up.
Our APC UPS is attached to a PC by a serial cable. This PC acts as the
NUT master system (with NUT server and client software installed) and is
connected to the network. Two other systems act as NUT slave systems
(have NUT client software installed), these are also attached to the
network and monitor the master system using this network connection.
This is a test rig. It has shown the NUT software and UPS to operate
very successfully in many different circumstances. As stated above, the
circumstances that lead to our problem should be rare.
++ Details of the problem.
Problem
_______
We have conducted some tests in which the master PC is unexpectedly shut
down when the UPS is On Line (OL) and On Battery (OB). Both tests showed
that the slave systems did not register the loss of the master system
for 15 minutes. This period of time is too great because the fully
charged battery of the UPS will probably not last for 15 minutes, and
there is no guarantee that such a failure will occur with a fully
charged battery.
Our Understanding of the Expected NUT Behaviour
_______________________________________________
It is our understanding that the NUT software process "upsmon" is
responsible for monitoring the "upsd" process on the master system
that
provides information about the state of the UPS. Each slave system can
set parameters for the upsmon process (using the NUT configuration file
"upsmon.conf"). One of these parameters is called
"DEADTIME".
The man page for upsmon (upsmon.8) states:
DEAD UPSES
In the event that upsmon can?t reach upsd(8), it declares that UPS dead
after some interval controlled by DEADTIME in the upsmon.conf(5). If
this happens while that UPS was last known to be on battery, it is
assumed to have gone critical and no longer contributes to the overall
power value.
The parameter DEADTIME has units of seconds. This parameter is set to
"15" by default, indicating that after 15 seconds of being unable to
contact the master's upsd process, the slave upsmon process should make
a decision on whether to shut the system down. (The decision is based on
the last know state of the UPS [OL or OB] and whether the system has an
alternative power source.) Modifications have been made to this
parameter on the slave systems; these changes have not affected the 15
minute delay between the shut down of the master and the registering of
the absence of the master upsd process by the slaves.
We expect that if the UPS is OB and the master system is shut down, the
slaves will begin to shut down after a DEADTIME second delay. It is
clear that something other than the upsmon DEADTIME parameter is
affecting the behaviour of the slaves, but we don't know how to alter this.
A Guess at the Root of this Problem
___________________________________
We have done a little bit of further investigation to try to understand
what is going on and what we are doing wrong.
By running a slave upsmon process with a debugging flag set it can be
seen that the 15 minute delay occurs as a result of the upsmon's poll of
the master's upsd process. Once the master has gone off line, the slave
upsmon reports:
polling ups: apcups at nutMaster.domain.uk
get_var: apcups at nutMaster.domain.uk / status
and then 'hangs'. A 15 minute delay follows before the polling process
returns that the master's upsd process is not reachable.
A brief examination of the NUT source code indicates that a system
"write" statement is being used to communicate across the network with
the upsd process of the master. We think that this system function
blocks by default. Maybe the default blocking settings are in use. We
don't know, this is probably very wide of the mark, but it is the best
we have come up with!
We are expecting this problem to be caused by our set up and
configuration of the NUT software. Has anyone seen similar behaviour?
Does anyone have any suggestions on how to fix this problem?
Any sharing of knowledge or suggestions will be appreciated.
Best wishes,
Jon Clark
++ Details about the set up
In almost all cases, the default configuration settings are in use where
possible.
Master Configuration Files
__________________________
ups.conf
--------
$ grep -v "#" ups.conf
[apcups]
driver = apcsmart
port = /dev/ttyS0
upsd.conf
---------
$ grep -v "#" upsd.conf
ACL all 0.0.0.0/0
ACL localhost 127.0.0.1/32
ACL nutMaster xx.xx.xx.xx1/32
ACL nutSlave1 xx.xx.xx.xx7/32
ACL nutSlave2 xx.xx.xx.xx3/32
ACCEPT localhost nutMaster nutSlave1 nutSlave2
REJECT all
upsd.users
----------
$ grep -v "#" upsd.users
[upsadmin]
password = ****
allowfrom = nutMaster
actions = SET
instcmds = ALL
[monmaster]
password = ****
allowfrom = nutMaster
upsmon master
[monslave-nutSlave1]
password = ****
allowfrom = nutSlave1
upsmon slave
[monslave-nutSlave2]
password = ****
allowfrom = nutSlave2
upsmon slave
upsmon.conf
-----------
$ grep -v "#" upsmon.conf
MONITOR apcups at nutMaster.domain.uk 1 monmaster **** master
MINSUPPLIES 1
SHUTDOWNCMD "/sbin/shutdown -h +0"
POLLFREQ 5
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
RBWARNTIME 43200
NOCOMMWARNTIME 300
FINALDELAY 5
Slave Configuration Files
_________________________
(Both slaves have similar settings and exhibit similar behaviour.)
upsmon.conf
-----------
$ grep -v "#" upsmon.conf
MONITOR apcups at nutMaster.domain.uk 1 monslave-nutSlave1 **** slave
MINSUPPLIES 1
SHUTDOWNCMD "/sbin/shutdown -h +0"
POLLFREQALERT 5
HOSTSYNC 15
DEADTIME 15
POWERDOWNFLAG /etc/killpower
NOCOMMWARNTIME 300
FINALDELAY 0
Computer Operating Systems
__________________________
nutMaster: Scientific Linux 4.4
nutSlave1: Scientific Linux 4.1
(Scientific Linux is a Redhat Enterprise recompile.)
NUT Software Versions
_____________________
nutMaster:
- nut-2.2.0-3.3.el4.i386.rpm
- nut-client-2.2.0-3.3.el4.i386.rpm
nutSlave1:
- nut-client-2.2.0-3.3.el4.i386.rpm
UPS Details
___________
Brand: APC
Model: Smart-UPS RT 8000VA RM 230V (XLI)
--
----------------------------
Jon Clark
Scientific Officer
Dept. of Applied Mathematics
University of Sheffield
Sheffield, S3 7RH, UK
----------------------------