Roger Price
2016-Jul-11 09:07 UTC
[Nut-upsdev] Proposal for technique to stop a timer at any moment
Here is patch 2 of 2. Roger diff -rup -x '*.html' -x '*.8' -x '*.5' nut-2.7.4.orig/docs/man/upsd.txt nut-2.7.4.dev/docs/man/upsd.txt --- nut-2.7.4.orig/docs/man/upsd.txt 2015-12-29 09:42:34.000000000 +0100 +++ nut-2.7.4.dev/docs/man/upsd.txt 2016-07-07 10:08:51.939354892 +0200 @@ -119,6 +119,16 @@ administrative functions like SET and IN controlled in linkman:upsd.users[5]. UPS definitions are found in linkman:ups.conf[5]. +The file <configpath>/NUT_DEBUG_LEVEL may be used to manage the debugging +level. The file content is a single integer which represents the number of +"-D" options which may be specified on the command line. If the file is +created, modified or removed while the upsd process is running, the effect is +immediate. It is not necesssary to restart upsd. If both a "-D" option and +the NUT_DEBUG_LEVEL file are present, the higher of the two values applies. +The principal use of this file is in very specific tracing during a test, +possibly under the control of a script which creates and removes the +NUT_DEBUG_LEVEL file. + ENVIRONMENT VARIABLES --------------------- @@ -131,6 +141,35 @@ ENVIRONMENT VARIABLES *upsd* uses a built-in default, which is often `/var/state/ups`. The *STATEPATH* directive in linkman:upsd.conf[5] overrides this variable. +SIGNALS +------- + +upsd accepts the following signals: + +*SIGINT* *SIGQUIT* *SIGTERM*:: +The upsd process shuts down. + +*SIGHUP*:: +The upsd process reloads it's configuration files. + +*SIGUSR1*:: +upsd remembers the SIGUSR1 signal on behalf of each UPS. When a +client such as linkman:upsmon[8] polls upsd for UPS status chnges, the SIGUSR1 +is reported as a status change. Once reported, upsd discards it's record of +the signal. If a further SIGUSR1 is received while a previous SIGUSR1 is +waiting to be polled, it is discarded. Any semantics attached to the signal +are defined by the user, perhaps in a user script called by +linkman:upssched[8]. + +*SIGUSR2*:: +upsd remembers the SIGUSR2 signal on behalf of each UPS. When a +client such as linkman:upsmon[8] polls upsd for UPS status chnges, the SIGUSR2 +is reported as a status change. Once reported, upsd discards it's record of +that signal. If a further SIGUSR2 is received while a previous SIGUSR2 is +waiting to be polled, it is queued. Up to eight SIGUSR2 signals may be +queued; further SIGUSR2 are discarded. Any semantics attached to the signal are +defined by the user, perhaps in a user script called by linkman:upssched[8]. + SEE ALSO -------- diff -rup -x '*.html' -x '*.8' -x '*.5' nut-2.7.4.orig/docs/man/upsmon.conf.txt nut-2.7.4.dev/docs/man/upsmon.conf.txt --- nut-2.7.4.orig/docs/man/upsmon.conf.txt 2015-12-29 13:08:34.000000000 +0100 +++ nut-2.7.4.dev/docs/man/upsmon.conf.txt 2016-07-06 22:27:57.514216079 +0200 @@ -213,6 +213,10 @@ REPLBATT;; The UPS battery is bad and ne NOCOMM;; A UPS is unavailable (can't be contacted for monitoring) +SIGUSR1;; A UPS has received signal SIGUSR1. What this means is defined locally. + +SIGUSR2;; A UPS has received signal SIGUSR2. What this means is defined locally. + *NOTIFYFLAG* 'type' 'flag'[\+'flag'][+'flag']...:: By default, upsmon sends walls global messages to all logged in users) diff -rup -x '*.html' -x '*.8' -x '*.5' nut-2.7.4.orig/docs/man/upsmon.txt nut-2.7.4.dev/docs/man/upsmon.txt --- nut-2.7.4.orig/docs/man/upsmon.txt 2015-12-29 13:08:34.000000000 +0100 +++ nut-2.7.4.dev/docs/man/upsmon.txt 2016-07-06 22:28:45.882438409 +0200 @@ -145,6 +145,12 @@ The UPS needs to have its battery replac *NOCOMM*:: The UPS can't be contacted for monitoring. +*SIGUSR1*:: +The UPS has received a SIGUSR1 signal. What this means is defined locally. + +*SIGUSR2*:: +The UPS has received a SIGUSR2 signal. What this means is defined locally. + NOTIFY COMMAND -------------- diff -rup -x '*.html' -x '*.8' -x '*.5' nut-2.7.4.orig/docs/man/upssched.conf.txt nut-2.7.4.dev/docs/man/upssched.conf.txt --- nut-2.7.4.orig/docs/man/upssched.conf.txt 2015-12-29 09:42:34.000000000 +0100 +++ nut-2.7.4.dev/docs/man/upssched.conf.txt 2016-07-07 10:28:08.290375907 +0200 @@ -83,6 +83,15 @@ If a specific UPS (+myups at localhost+) co stop the timer before it triggers AT COMMOK myups at localhost CANCEL-TIMER upsgone ++ +If any UPS received a SIGUSR1 signal, remove the current timer(s) +before they trigger + + AT SIGUSR1 * CANCEL-TIMER first-warning-timer + AT SIGUSR1 * CANCEL-TIMER last-warning-timer + AT SIGUSR1 * CANCEL-TIMER shutdown-timer ++ +It is not an error to cancel a timer which is not running. *EXECUTE* 'command';; Immediately pass 'command' as an argument to CMDSCRIPT. diff -rup -x '*.html' -x '*.8' -x '*.5' nut-2.7.4.orig/docs/net-protocol.txt nut-2.7.4.dev/docs/net-protocol.txt --- nut-2.7.4.orig/docs/net-protocol.txt 2016-03-08 16:48:26.000000000 +0100 +++ nut-2.7.4.dev/docs/net-protocol.txt 2016-06-19 16:32:53.000000000 +0200 @@ -44,6 +44,7 @@ NUT network protocol, over the time: |1.1 |>= 1.5.0 |Original protocol (without old commands) .2+|1.2 .2+|>= 2.6.4 |Add "LIST CLIENTS" and "NETVER" commands |Add ranges of values for writable variables +|1.3 |>= 2.7.4 |Add "GET SIGUSR1" and "GET SIGUSR2" |============================================================================== NOTE: any new version of the protocol implies an update of NUT_NETVERSION @@ -78,6 +79,29 @@ still connected when starting the shutdo This replaces the old "REQ NUMLOGINS" command. +SIGUSR1 SIGUSR2 +~~~~~~~~~~~~~~~ + +Form: + + GET SIGUSR1 <upsname> + GET SIGUSR2 <upsname> + GET SIGUSR1 cheapo + +Response: + + SIGUSR1 <upsname> <value> + SIGUSR2 <upsname> <value> + SIGUSR2 cheapo 1 + +'<value>' is 0 or 1 and says whether the server has received a SIGUSR1 +or SIGUSR2 signal for this UPS. Further signals may await in the server. +The count of signals waiting in the server is reduced by one each time +the server receives this GET request. + +Any semantics attached to the signal are defined by the user. + + UPSDESC ~~~~~~~ diff -rup -x '*.html' -x '*.8' -x '*.5' nut-2.7.4.orig/docs/scheduling.txt nut-2.7.4.dev/docs/scheduling.txt --- nut-2.7.4.orig/docs/scheduling.txt 2016-03-08 13:01:11.000000000 +0100 +++ nut-2.7.4.dev/docs/scheduling.txt 2016-07-07 16:39:28.578212591 +0200 @@ -112,12 +112,13 @@ be doing all the work for us. So, in th Then we want upsmon to actually _use_ it for the notify events, so again in the upsmon.conf we set the flags: - NOTIFYFLAG ONLINE SYSLOG+EXEC - NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC + NOTIFYFLAG ONLINE SYSLOG+EXEC + NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC NOTIFYFLAG LOWBATT SYSLOG+WALL+EXEC + NOTIFYFLAG SIGUSR1 SYSLOG+EXEC ... and so on. -For the purposes of this document I will only use those three, but you can set +For the purposes of this document I will only use those four, but you can set the flags for any of the valid notify types. Setting up your upssched.conf @@ -141,17 +142,18 @@ for a temporary file created by upssched under some circumstances. Please see the relevant comments in upssched.conf for additional information and advice about these variables. -Now you can tell your CMDSCRIPT what to do when it is called by upsmon. +Now you can tell your CMDSCRIPT (e.g. upssched-cmd) what to do when it is +called by upsmon. The big picture ^^^^^^^^^^^^^^^ The design in a nutshell is: - upsmon ---> calls upssched ---> calls your CMDSCRIPT + upsmon ---> calls upssched ---> calls your CMDSCRIPT (e.g. upssched-cmd) -Ultimately, the CMDSCRIPT does the actual useful work, whether that's -initiating an early shutdown with 'upsmon -c fsd', sending a page by +Ultimately, the CMDSCRIPT upssched-cmd does the actual useful work, whether +that's initiating an early shutdown with 'upsmon -c fsd', sending a page by calling sendmail, or opening a subspace channel to V'ger. Establishing timers @@ -176,12 +178,12 @@ Executing commands immediately ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ As an example, consider the scenario where a UPS goes onto battery power. -However, the users are not informed until 60 seconds later - using a timer as +However, the users are not informed until 30 seconds later - using a timer as described above. Whilst this may let the *logged in* users know that the UPS is on battery power, it does not inform any users subsequently logging in. To enable this we could, at the same time, create a file which is read and displayed to any user trying to login whilst the UPS is on battery power. If -the UPS comes back onto utility power within 60 seconds, then we can cancel +the UPS comes back onto utility power within 30 seconds, then we can cancel the timer and remove the file, as described above. However, if the UPS comes back onto utility power say 5 minutes later then we do not want to use any timers but we still want to remove the file. To do this we could use: @@ -190,32 +192,37 @@ timers but we still want to remove the f This means that when upsmon detects that the UPS is back on utility power it will signal upssched. Upssched will see the above command and simply pass -'ups-back-on-power' as an argument directly to CMDSCRIPT. This occurs -immediately, there are no timers involved. +'ups-back-on-power' as an argument directly to the CMDSCRIPT +upssched-cmd. This occurs immediately, there are no timers involved. Writing the command script handler ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -OK, now that upssched knows how the timers are supposed to work, let's -give it something to do when one actually triggers. The name of the -example timer is onbattwarn, so that's the argument that will be passed -into your CMDSCRIPT when it triggers. This means we need to do some -shell script writing to deal with that input. +OK, now that upssched knows how the timers are supposed to work, let's give it +something to do when one actually triggers. The name of the example timer is +onbattwarn, so that's the argument that will be passed into your CMDSCRIPT +upssched-cmd when it triggers. This means we need to do some shell script +writing to deal with that input. + +Our script is written in Bash since this is widely available and well known to +system administrators, but it doesn't have to be Bash. Debian for example +uses Dash by default, and nothing stops you from writing the script in C. -------------------------------------------------------------------------------- - #! /bin/sh + #! /bin/bash -u case $1 in onbattwarn) echo "The UPS has been on battery for awhile" \ - | mail -s"UPS monitor" bofh at pager.example.com + | mail -s "UPS monitor" bofh at pager.example.com ;; ups-back-on-power) + # Remove the warning to new logins that system is on battery. /bin/rm -f /some/path/ups-on-battery ;; *) - logger -t upssched-cmd "Unrecognized command: $1" + logger -i -t upssched-cmd "Unrecognized command: $1" ;; esac @@ -226,11 +233,44 @@ the presence of a given trigger. With m names, you will need to test for each possibility and handle it according to your desires. +In the preceeding example, we sent a message to the sysadmin when the timeout +occured. However the situation in the UPS may have changed since the timer +started, and the sysadmin needs to know about it. It is possible to +interrogate the UPS from within the script with commands such as + +-------------------------------------------------------------------------------- + + UPS="dodgy-old-UPS at example.com" + STATUS=$( upsc $UPS ups.status ) + CHARGE=$( upsc $UPS battery.charge ) + STATE="$STATUS charge=$CHARGE%" + +-------------------------------------------------------------------------------- + +and include that up-to-date state in the message to the sysadmin + +-------------------------------------------------------------------------------- + + onbattwarn) + echo "UPS $UPS has been on battery for awhile, state now $STATE" \ + | mail -r "$UPS" -s "UPS $UPS: $STATE" bofh at pager.example.com + ;; + +-------------------------------------------------------------------------------- + NOTE: You can invoke just about anything from inside the CMDSCRIPT. It doesn't need to be a shell script, either - that's just an example. If you want to write a program that will parse argv[1] and deal with the possibilities, that will work too. +TIP: You can debug your upssched-cmd script by calling it directly from the +command line with any argument, for example + +-------------------------------------------------------------------------------- + +bofh at bigbox:~> upssched-cmd onbattwarn + +-------------------------------------------------------------------------------- Early Shutdowns ~~~~~~~~~~~~~~~ @@ -243,7 +283,7 @@ Just be sure to cancel this timer if you The best way to do this is to use the upsmon callback feature. You can make upsmon set the "forced shutdown" (FSD) flag on the upsd so your slave systems shut down early too. Just do something like this in your -CMDSCRIPT: +CMDSCRIPT upssched-cmd: /usr/local/ups/sbin/upsmon -c fsd @@ -252,6 +292,70 @@ from the CMDSCRIPT, since there's no syn systems hooked to the same UPS. FSD is the master's way of saying "we're shutting down *now* like it or not, so you'd better get ready". +Irregular Hardware +~~~~~~~~~~~~~~~~~~ + +The process control industry is full of examples of hardware which works +perfectly when hardware field engineering come round to test it, but which +suffers transient errors when left to itself over a holiday weekend. UPS's +are no different and the careful sysadmin wants a confirmation that the +reported OB is real and not just a transient. + +A CMDSCRIPT such as upssched-cmd offers the possibility of doing this. In +this example, we check that a reported LB event is indeed due to a low +battery. First, we need to add two timers to upssched.conf + +-------------------------------------------------------------------------------- + +# Defective UPS - UPS reports LB for no reason! +# Give warning and turn off box rapidly +AT LOWBATT * START-TIMER check-low-battery-timer 5 +AT LOWBATT * START-TIMER low-battery-shutdown-timer 65 + +-------------------------------------------------------------------------------- + +Is is also possible to cancel these timers with declarations such as + +-------------------------------------------------------------------------------- + +# Defective UPS - clean up timers +AT ONLINE * CANCEL-TIMER check-low-battery-timer +AT ONLINE * CANCEL-TIMER low-battery-shutdown-timer + +-------------------------------------------------------------------------------- + +but we cannot rely on receiving the status change ONLINE. To get around the +difficulty we add further declarations + +-------------------------------------------------------------------------------- + +# Defective UPS - clean up timers +AT SIGUSR1 * CANCEL-TIMER check-low-battery-timer +AT SIGUSR2 * CANCEL-TIMER low-battery-shutdown-timer + +-------------------------------------------------------------------------------- + +and drive the timer cleanup directly from the CMDSCRIPT upssched-cmd with +commands such as + +-------------------------------------------------------------------------------- + + check-low-battery-timer) + if [[ $CHARGE -gt 70 && ("$STATUS" = "OL CHRG" || "$STATUS" = "OL CHRG LB") ]] + then # False alarm. Cancel all low battery shutdown timers. + killall -SIGUSR1 upsd + logger -i -t upssched-cmd "LB warning, but state=$STATE. Stopping shutdown-timer." + else # OK we believe it, let the low battery shutdown timer run + logger -i -t upssched-cmd "LB warning, state=$STATE, allowing shutdown." + fi + ;; + +-------------------------------------------------------------------------------- + +In general the user signals SIGUSR1 and SIGUSR2 provide a mechanism for +injecting status change events into NUT from outside. These events may be +used for notifications and timer management, and may come from the CMDSCRIPT +upssched-cmd or elsewhere. Background ~~~~~~~~~~