This note describes a heartbeat technique for validating the integrity of
a NUT installation.
Introduction
------------
A NUT configuration may run for months with little or no output to a
system administrator to assure that the combined processes are running
correctly. The technique described in this note verifies that the ups
driver, upsd, upsmon, upssched and upssched-cmd components are operational
and that the flow of data between them is effective. The system
administrator is warned if the overall combined process breaks.
Overview of the technique
-------------------------
An 11 minute upssched timer runs permanently, and when it completes,
upssched-cmd sends a warning message to the sysadmin. During normal
operation the timer is prevented from completing by a timed process with a
shorter 10 minute period running in a dummy UPS known as "heartbeat".
The
dummy UPS "heartbeat" cycles through an OL and an OB every 10 minutes,
and
the status changes are communicated to upsd and then to upsmon and
upssched. Thus every 10 minutes upssched stops and restarts the 11 minute
timer. During normal operation the 11 minute timer will never complete,
but if the driver -> upsd -> upsmon -> upssched chain is broken, it
will
complete and the sysadmin advised.
The technique requires a working NUT installation and an understanding of
upssched timers and the upssched-cmd script.
Changes to configuration files
------------------------------
1. In ups.conf, add
[heartbeat]
driver = dummy-ups
port = heartbeat.dev
desc = "Heart beat validation of NUT"
2. Create heartbeat.dev in the same directory as ups.conf with the
contents
ups.status: OL
TIMER 300
ups.status: OB
TIMER 300
Remember that the are no comments in NUT .dev files.
3. In upsmon.conf, add
MONITOR heartbeat at localhost 1 upsmaster s3cr3t master
and make sure that you have specified
NOTIFYCMD /usr/sbin/upssched
NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC
NOTIFYFLAG ONLINE SYSLOG+WALL+EXEC
Your upssched executable may be elsewhere. You may want to remove the
WALL.
4. In upssched.conf, add
# Heart beat validation that NUT is operational.
# Restart timer which completes only if the dummy-ups heart beat has stopped.
# See timer values in heartbeat.dev
AT ONBATT heartbeat at localhost CANCEL-TIMER heartbeat-failure-timer
AT ONBATT heartbeat at localhost START-TIMER heartbeat-failure-timer 660
and make sure that there are no entries such as
AT ONLINE * ...
AT ONBATT * ...
Replace the "*" with the full address of the ups unit, e.g.
myups at localhost
Make sure that you have specified
CMDSCRIPT /usr/sbin/upssched-cmd
Your upssched-cmd may be elsewhere.
5. In upssched-cmd, test for completion of the heartbeat-failure-timer and
when it completes send a warning to the sysadmin, e-mail, SMS, pigeon, ...
Testing the heartbeat setup
---------------------------
1. Test that you can send a warning to the sysadmin with the command
upssched-cmd heartbeat-failure-timer
2. When you start NUT, check that "heartbeat" is running. Command ps
aux |
grep ups should show something like
upsd 14785 0.0 0.0 13228 652 ? Ss 22:48 0:00
/usr/lib/ups/driver/usbhid-ups -a myups
upsd 14787 0.0 0.0 19624 704 ? Ss 22:48 0:00
/usr/lib/ups/driver/dummy-ups -a heartbeat
upsd 14791 0.0 0.0 17560 744 ? Ss 22:48 0:00 /usr/sbin/upsd
-u upsd
root 14794 0.0 0.0 19432 664 ? Ss 22:48 0:00
/usr/sbin/upsmon
upsd 14795 0.0 0.0 19856 1616 ? S 22:48 0:00
/usr/sbin/upsmon
upsd 14845 0.0 0.0 6408 448 ? S 22:53 0:00
/usr/sbin/upssched UPS heartbeat at localhost: On battery
3. Shorten the heartbeat-failure-timer in upssched.conf to 540 seconds,
and you should receive a warning every 10 minutes.
4. If you leave the WALL in the NOTIFYFLAG ONBATT and NOTIFYFLAG ONLINE
declarations in upsmon.conf you will see the action of the dummy-ups
displayed in an xterm or equivalent console.
I have tested this setup with NUT 2.7.4 on openSUSE 13.2 and 42.2.
Comments and suggestions welcome.
Roger