Peter Selinger
2007-Jan-04 18:52 UTC
[Nut-upsdev] Re: [nut-Patches][303751] Checking UPS Temperature
One disadvantage of handling it through a script is that is will not be done by default. Most users probably don't know about the problem of burning batteries, as it is not very common. A potential problem with Eric Wilde's patch is that it is not general enough; some UPS models have an boolean OVERHEAT flag although they don't report the actual temperature. So the UPSOVERTEMP mechanism will not work for such models. The decision which temperature is "too high" should be made in the driver, not in upsmon, as the normal operating temperature could differ for different devices. For devices that support temperature readings, this could be based on a threshold (which can be made user-settable via a driver configuration variable if necessary). For devices that only have a boolean OVERHEAT flag, they should report this flag directly. A few drivers already support the OVERHEAT flag in ups.status. (None seem to support ups.alarm, although this was perhaps originally intended for this purpose). I wonder if it would make sense to allow upsmon to react to the OVERHEAT flag. -- Peter Comment By: Arjen de Korte (adkorte-guest)> Date: 2007-01-04 16:35 > > Message: > You can handle this through a script that monitors the UPS > temperature through upsc without any changes to upsmon by > running > > upsc myups@somewhere ups.temperature > > and parse the results. If it determines the temperature is > too high, it could send off a message to the operator or > switch it off through sending an instcmd to the UPS to > shutdown the UPS and keep it off. > > I'm not in favor of doing this in upsmon, since only the > 'ups.status' is guaranteed to be available for each driver. > If we start adding variables that *might* be supported, > there is no end to the number of possible variables. Where > would we stop? > > Furthermore, polling for the temperature doesn't need to be > done as frequently as the line voltage, since it won't > change that quickly (unless the UPS *is* on fire already). > You don't need the near instantaneous reaction like we have > for input/battery state changes. > > Adding 'TEMP' to the 'ups.status' might be a good idea, but > requires changes to the driver. It would be a much better > option than changing upsmon in the way proposed here though. > It should be the driver to decide something is not right and > upsmon then acts upon that notice. I'm against reversing the > order of events, since if upsmon is somehow not able to talk > to the driver, nothing is done to resolve the situation. >nut-patches@alioth.debian.org wrote:> > Patches item #303751, was opened at 2006-08-12 00:04 > >Status: Closed > Priority: 3 > Submitted By: Eric Wilde (ewilde-guest) > Assigned to: Nobody (None) > Summary: Checking UPS Temperature > >Resolution: Rejected > Group: None > Category: None > > > Initial Comment: > Last week, one of my UPS burned the batteries up (plates buckled, cases bulging, several of the sealed vent caps opened, plastic welded together). The batteries eventually appear to have shorted and the UPS shut down, without warning, despite being on line power (lucky the equipment it was powering had a sense of humor). From reading the log file posthumously, I see that the internal temps in the UPS reached 81 degrees Celsius, which is pretty hot. > > Normal operating temperatures for this UPS are in the 40-50 degree range. It went up into the 75-80 degree range 36 hours before the batteries shorted out so it appears that increased temperature is an excellent predictor of battery failure. > > This being the case, I added the following code to upsmon to monitor temperature (changes based on nut-2.0.0). > > Eric Wilde > > > --- upsmon.h.orig 2004-03-08 07:09:28.000000000 -0500 > +++ upsmon.h 2006-08-11 13:38:03.000000000 -0400 > @@ -29,4 +29,5 @@ > /* was ST_FIRST 0x080 */ > #define ST_CONNECTED 0x100 /* upscli_connect returned OK */ > +#define ST_OVERTEMP 0x200 /* UPS is running overtemp */ //EW > > /* required contents of flag file */ > @@ -72,4 +73,5 @@ > #define NOTIFY_NOCOMM 8 /* UPS hasn't been contacted in awhile */ > #define NOTIFY_NOPARENT 9 /* privileged parent process died */ > +#define NOTIFY_OVERTEMP 10 /* UPS went to overtemp */ //EW > > /* notify flag values */ > @@ -101,4 +103,5 @@ > { NOTIFY_NOCOMM, "NOCOMM", NULL, "UPS %s is unavailable", 0 }, > { NOTIFY_NOPARENT, "NOPARENT", NULL, "upsmon parent process died - shutdown impossible", 0 }, > + { NOTIFY_OVERTEMP, "OVERTEMP", NULL, "UPS %s is running at an excessive temperature", 0 }, //EW > { 0, NULL, NULL, NULL, 0 } > }; > > > --- upsmon.c.orig 2004-01-31 16:00:02.000000000 -0500 > +++ upsmon.c 2006-08-11 16:11:15.000000000 -0400 > @@ -50,4 +50,7 @@ > static int rbwarntime = 43200; > > + /* default UPS overtemp value (degrees Celcius - 0.0 means ignore) */ //EW > +static double upsovertemp = 0.0; //EW > + > /* default "all communications down" warning interval (seconds) */ > static int nocommwarntime = 300; > @@ -546,4 +549,13 @@ > } > > +//EW >>>>>> > + if (!strcmp(var, "temp")) { > + query[0] = "VAR"; > + query[1] = ups->upsname; > + query[2] = "ups.temperature"; > + numq = 3; > + } > +//EW <<<<<< > + > if (numq == 0) { > upslogx(LOG_ERR, "get_var: programming error: var=%s", var); > @@ -770,4 +782,21 @@ > } > > +//EW >>>>>> > +static void ups_overtemp(utype *ups) > +{ > + if (flag_isset(ups->status, ST_OVERTEMP)) { /* no change */ > + debug("ups_overtemp(%s) (no change)\n", ups->sys); > + return; > + } > + > + debug("ups_overtemp(%s) (first time)\n", ups->sys); > + > + /* must have changed from !OVERTEMP to OVERTEMP, so notify */ > + > + do_notify(ups, NOTIFY_OVERTEMP); > + setflag(&ups->status, ST_OVERTEMP); > +} > +//EW <<<<<< > + > /* cleanly close the connection to a given UPS */ > static void drop_connection(utype *ups) > @@ -1163,4 +1192,12 @@ > } > > +//EW >>>>>> > + /* UPSOVERTEMP <num> */ > + if (!strcmp(arg[0], "UPSOVERTEMP")) { > + upsovertemp = atof(arg[1]); > + return 1; > + } > +//EW <<<<<< > + > /* NOCOMMWARNTIME <num> */ > if (!strcmp(arg[0], "NOCOMMWARNTIME")) { > @@ -1563,4 +1600,31 @@ > } > > +//EW >>>>>> > +/* deal with the ups.temperature for this ups */ > +static void parse_temperature(utype *ups, char *temperature) > +{ > + double temp; > + > + debug(" temperature: [%s]\n", temperature); > + > + /* empty response is ignored -- not all ups return temperatures */ > + if (!strcmp(temperature, "")) { > + clearflag(&ups->status, ST_OVERTEMP); > + return; > + } > + > + /* get the temperature as a double */ > + temp = atof(temperature); > + > + /* check the temperature against the overtemp value */ > + if (temp > upsovertemp) > + ups_overtemp(ups); > + else > + clearflag(&ups->status, ST_OVERTEMP); > + > + debug("\n"); > +} > +//EW <<<<<< > + > /* see what the status of the UPS is and handle any changes */ > static void pollups(utype *ups) > @@ -1578,4 +1642,19 @@ > debug("polling ups: %s\n", ups->sys); > > +//EW >>>>>> > + /* if the user wants us to check for overtemp */ > + if (upsovertemp > 0.0) { > + set_alarm(); > + > + if (get_var(ups, "temp", status, sizeof(status)) == 0) { > + clear_alarm(); > + parse_temperature(ups, status); > + } > + > + /* fallthrough: no communications */ > + clear_alarm(); > + } > +//EW <<<<<< > + > set_alarm(); > > > > +++ upsmon.config (changes somewhere in the config file) > > # -------------------------------------------------------------------------- > # UPSOVERTEMP - Temperature (in Celcius) which is too high for operation > # > # upsmon will check all UPS that return temperature information against this > # value. If the UPS temperature exceeds this value, an OVERTEMP notification > # will be generated. > # > # Note that certain UPS are renown for cooking and even burning up batteries > # (some reports of spectacular battery fires have been received). From actual > # observed log data, it appears that prior to burning up the batteries, the > # UPS internal temperature rises significantly. Hence, monitoring the UPS > # temperature can be a valuable tool towards detecting battery cooking, before > # the UPS burns the place down (the UPS is supposed to solve problems, not > # cause them, isn't it). > # > # Once again, typical observed internal temperatures are in the 40 to 50 degree > # Celcius range. Observed temperatures of 80 degrees Celcius prior to an > # actual battery failure are indicative of pending failure. Thus, to be safe, > # the the UPSOVERTEMP value should be set in the 60-70 degree range. > > UPSOVERTEMP 60.0 > > # OVERTEMP : The UPSOVERTEMP value has been exceeded (for UPS that return temp) > > NOTIFYFLAG OVERTEMP SYSLOG+EXEC > >
Arjen de Korte
2007-Jan-04 20:23 UTC
[Nut-upsdev] Re: [nut-Patches][303751] Checking UPS Temperature
Peter Selinger wrote:> One disadvantage of handling it through a script is that is will not > be done by default. Most users probably don't know about the problem > of burning batteries, as it is not very common.Whatever we do, that isn't going to change (sadly). Since we're adding a new function (with possibly bad side effects), the default should be 'off'. Selecting the maximum temperature should be done in 'ups.conf', since it very much depends on the environment in which the UPS is used, there is no default. Setting it too low and you risk nuisance tripping it (and shutting down a system, which isn't going to boot up automatically). On the other hand, too high and you risk burning your UPS without notification (while expecting to be warned for that). Either way, we (as the developers) should *not* take the responsibility. It's probably OK to nag the system administrator if the value is not set, but that's all we can/should do.> A potential problem with Eric Wilde's patch is that it is not general > enough; some UPS models have an boolean OVERHEAT flag although they > don't report the actual temperature.Where is this flag defined? While looking for it, I found quite a number of other flags that are not documented in 'docs/new-drivers.txt': ACFAIL AWAITINGPOWER BY CHRG (mentioned in 'docs/acpi.txt') COMMFAULT DEPLETED DISCHRG (mentioned in 'docs/acpi.txt') FAILED HB NOT_APPLICABLE OVERHEAT (mentioned in 'docs/new-names.txt') SD TEST TIP UNKNOWN... VRANGE At the very least, flags should be documented in 'docs/new-drivers.txt', in order to maintain consistency throughout the drivers. [...]> A few drivers already support the OVERHEAT flag in ups.status. (None > seem to support ups.alarm, although this was perhaps originally > intended for this purpose). I wonder if it would make sense to allow > upsmon to react to the OVERHEAT flag.Not at the moment, since that would also require a different flag to indicate that the UPS should 'shutdown.stayoff', rather than 'shutdown.return' or 'shutdown.reboot'. This is not what we want, because the error condition is not going to improve without manual intervention (taking the UPS offline, replacing the batteries). Best regards, Arjen
Arjen de Korte
2007-Jan-05 09:37 UTC
[Nut-upsdev] Re: [nut-Patches][303751] Checking UPS Temperature
> One disadvantage of handling it through a script is that is will not > be done by default. Most users probably don't know about the problem > of burning batteries, as it is not very common.I forgot to add one thing to this. The fact that this not very common, is that a UPS should be protected for exactly this kind of situation. There are two ways to fry a (sealed) lead acid battery. Either by passing a very high current through it (short circuiting for instance) or by overcharging it with a modest current for a long time. The first is obvious and doesn't happen unless the case is opened and the battery terminals are shorted. The second may happen when the battery is charged to a voltage above 2.47 V/cell (13.65 - 13.80 V for a 6 cell battery). Until that voltage is reached, charging current should be limited to about 1/10th of the rated capacity. If you do that, the cells will prevent overcharging themselves, since the current will drop to essentially zero after a couple of hours. However, when one cell fails (which is guaranteed to happen for every battery pack, unless it is replaced before that happens), the remaining cells can be overcharged if the float voltage is not reduced. There is a simple way to detect this. Within 8 - 12 hours charging with 1/10th of the capacity of the battery, the float voltage of 13.65 - 13.80 V must be reached. When that voltage is reached, charging current should go down to essentially zero within another 8 - 12 hours. If either of these doesn't happen, the battery is dead and charging must be stopped. It surprises me that the UPS in question got a safety approval (but it might not have it). Even if the charge circuit failed, an independent secondary circuit should have prevented the 'meltdown' of the battery pack. Best regards, Arjen -- Eindhoven - The Netherlands Key fingerprint - 66 4E 03 2C 9D B5 CB 9B 7A FE 7E C1 EE 88 BC 57