João Luis Meloni Assirati
2020-Sep-29 20:13 UTC
[Nut-upsdev] Question about hardware failures / FSD
Hello, The UPS I am developing a driver to is able to report several flags for critical hardware conditions, like overheat, overload, inverter failure, output short etc. What should be the correct policy of operation when such a condition occurs? I think that the an UPS in such a condition is not reliable and therefore a system shutdown should be called. However, the developer's manual and all other drivers I inspected seem to call the FSD flag only when there is a shutdown already in progress (say, a countdown register is active). So far none of my questions were answered, but I hope eventually someone will have some time to spend helping me. Thanks, joão Luis. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20200929/24672c56/attachment.html>
Daniel F. Dickinson
2020-Sep-29 21:23 UTC
[Nut-upsdev] Question about hardware failures / FSD
Hello joão, To get more responses you should plaintext email (standard for open source development mailing lists) instead of HTML mail.>The UPS I am developing a driver to is able to report several flags for critical hardware conditions, like overheat, overload, inverter failure, output short etc. What should be the correct policy of operation >when such a condition occurs? I think that the an UPS in such a condition is not reliable and therefore a system shutdown should be called. However, the developer's manual and all other drivers I inspected >seem to call the FSD flag only when there is a shutdown already in progress (say, a countdown register is active).The decision about whether to do a forced shutdown is, I believe, considered a local administrator decision. I would suggest that you should make these 'events' like low batter or communications failure, and let the admin choose what to do about the event.> >So far none of my questions were answered, but I hope eventually someone will have some time to spend helping me.Regards, Daniel
João Luis Meloni Assirati
2020-Sep-29 21:52 UTC
[Nut-upsdev] Question about hardware failures / FSD
Hello Daniel! On Tue, Sep 29, 2020 at 6:23 PM Daniel F. Dickinson <cshore at thecshore.com> wrote:> > Hello joão, > > To get more responses you should plaintext email (standard for open source development mailing lists) instead of HTML mail.I just can't believe that I was sending HTML e-mail. Thank you very very much for bringing this to my attention. I hope it is ok now.> >The UPS I am developing a driver to is able to report several flags for critical hardware conditions, like overheat, overload, inverter failure, output short etc. What should be the correct policy of operation >when such a condition occurs? I think that the an UPS in such a condition is not reliable and therefore a system shutdown should be called. However, the developer's manual and all other drivers I inspected >seem to call the FSD flag only when there is a shutdown already in progress (say, a countdown register is active). > > The decision about whether to do a forced shutdown is, I believe, considered a local administrator decision. I would suggest that you should make these 'events' like low batter or communications failure, and let the admin choose what to do about the event.I see. I think I understand it better now. The driver's job is to inform hardware status. The decision of what to do with this status concerns more to the clients (upsmon or other client talking to upsd). This will simplify my code. However, I think there should be some universal flag telling upsmon that something really bad is going on. Maybe raising alarms, but as far as I understand there are no standard alarm messages. Some drivers use things like OVERHEAT, other use textual messages like "The UPS is too hot!". Can you please send me some advise about this matter?> > > > >So far none of my questions were answered, but I hope eventually someone will have some time to spend helping me. > > Regards, > > DanielThank you, João Luis.
On Tue, 29 Sep 2020, João Luis Meloni Assirati wrote:> The UPS I am developing a driver to is able to report several flags for > critical hardware conditions, like overheat, overload, inverter failure, > output short etc. What should be the correct policy of operation when such a > condition occurs? I think that the an UPS in such a condition is not reliable > and therefore a system shutdown should be called. However, the developer's > manual and all other drivers I inspected seem to call the FSD flag only when > there is a shutdown already in progress (say, a countdown register is active).I attach a list of the current NUT status changes. Of interest to you will be 16 and 17 since they introduce new status names. They are used by a heartbeat mechanism based on the dummy-ups driver. I suggest you introduce new status names and new status changes. These will be monitored by a upsmon+upsched+upssched-cmd setup extended to cover the new statuses. I suggest you choose generic names which can be used by others in the future. For example HW1, HW2, ... for the hardware error conditions with a correspondance table for your UPS in the documentation. Other drivers might have other correspondance tables. Although maybe OVERHEAT and OVERLOAD are already sufficiently generic. It's up to you. Roger EVENTS based on upsd status changes 1. None->ALARM ALARM->None The UPS has raised/dropped the ALARM signal. 2. None->BOOST BOOST->None The UPS is now boosting/not boosting the output voltage. 3. None->BYPASS BYPASS->None The UPS is/is not now bypassing its own batteries. 4. None->CAL CAL->None The UPS is/is not now in calibration mode. 5. None->CHRG CHRG->None The UPS is/is not now recharging its batteries. 6. None->DISCHRG DISCHRG->None The UPS is/is not now discharging its batteries. 7. None->LB LB->None The driver says the UPS battery charge is now low/no longer low. 8. None->OFF OFF->None The driver says the UPS is/is not now OFF. 9. OL->OB OB->OL The UPS is now on battery/no longer on battery. 10. None->OVER OVER->None The UPS is/is not now in status [OVER]. 11. None->RB RB->None The UPS needs/no longer needs to have its battery replaced. 12. None->TEST TEST->None The UPS is/is not now performing a test. 13. None->TRIM TRIM->None The UPS is now trimming/not trimming the output voltage. Other EVENTS monitored by upsmon, upssched, upssched-cmd 14. LIVE->DEAD DEAD->LIVE Communication with the UPS in now lost/restored. 15. None->FSD FSD->None The UPS is/is not now in Forced ShutDown mode. 16. None->TICK TICK->None A heartbeat UPS has/has not generated a [TICK]. 17. None->TOCK TOCK->None A heartbeat UPS has/has not generated a [TOCK]. 18. TIMEOUT(my-timer) Timer “my-timer” has completed.
On Wed, 30 Sep 2020, Roger Price wrote:> 10. None->OVER OVER->None > The UPS is/is not now in status [OVER].This is sufficiently ambiguous to be of little use. Grepping for OVER in the code source I see UPS units reporting OVERHEAT OVERLOAD and OVERVOLTAGE. Maybe there are other cases of "OVER...". It seems to me that introducing OVERHEAT and OVERLOAD are definitely worthwhile. Roger
João Luis Meloni Assirati
2020-Oct-01 20:35 UTC
[Nut-upsdev] Question about hardware failures / FSD
On Wed, Sep 30, 2020 at 3:37 AM Roger Price <roger at rogerprice.org> wrote:> > On Tue, 29 Sep 2020, João Luis Meloni Assirati wrote: > > > The UPS I am developing a driver to is able to report several flags for > > critical hardware conditions, like overheat, overload, inverter failure, > > output short etc. What should be the correct policy of operation when such a > > condition occurs? I think that the an UPS in such a condition is not reliable > > and therefore a system shutdown should be called. However, the developer's > > manual and all other drivers I inspected seem to call the FSD flag only when > > there is a shutdown already in progress (say, a countdown register is active). > > I attach a list of the current NUT status changes. Of interest to you will be > 16 and 17 since they introduce new status names. They are used by a heartbeat > mechanism based on the dummy-ups driver. I suggest you introduce new status > names and new status changes. These will be monitored by a > upsmon+upsched+upssched-cmd setup extended to cover the new statuses. > > I suggest you choose generic names which can be used by others in the future. > For example HW1, HW2, ... for the hardware error conditions with a > correspondance table for your UPS in the documentation. Other drivers might > have other correspondance tables. > > Although maybe OVERHEAT and OVERLOAD are already sufficiently generic. It's up > to you. > > Roger > > EVENTS based on upsd status changes > > 1. None->ALARM ALARM->None > The UPS has raised/dropped the ALARM signal. > 2. None->BOOST BOOST->None > The UPS is now boosting/not boosting the output voltage. > 3. None->BYPASS BYPASS->None > The UPS is/is not now bypassing its own batteries. > 4. None->CAL CAL->None > The UPS is/is not now in calibration mode. > 5. None->CHRG CHRG->None > The UPS is/is not now recharging its batteries. > 6. None->DISCHRG DISCHRG->None > The UPS is/is not now discharging its batteries. > 7. None->LB LB->None > The driver says the UPS battery charge is now low/no longer low. > 8. None->OFF OFF->None > The driver says the UPS is/is not now OFF. > 9. OL->OB OB->OL > The UPS is now on battery/no longer on battery. > 10. None->OVER OVER->None > The UPS is/is not now in status [OVER]. > 11. None->RB RB->None > The UPS needs/no longer needs to have its battery replaced. > 12. None->TEST TEST->None > The UPS is/is not now performing a test. > 13. None->TRIM TRIM->None > The UPS is now trimming/not trimming the output voltage. > > Other EVENTS monitored by upsmon, upssched, upssched-cmd > > 14. LIVE->DEAD DEAD->LIVE > Communication with the UPS in now lost/restored. > 15. None->FSD FSD->None > The UPS is/is not now in Forced ShutDown mode. > 16. None->TICK TICK->None > A heartbeat UPS has/has not generated a [TICK]. > 17. None->TOCK TOCK->None > A heartbeat UPS has/has not generated a [TOCK]. > 18. TIMEOUT(my-timer) > Timer “my-timer” has completed.Thank you, that was very helpful. Are there also guidelines for alarms? The Developer Guide says there are no official alarms yet and I should ask here. I think that alarms should cover not only serious hardware problems, but also important messages like "you should perform a runtime calibration". Thank you, João Luis.