>From: Keith Roberts <keith at karsites.net>
>On Wed, 3 Nov 2010, Lamar Owen wrote:
>> Might want to check the power supply as well. Bad/flakey
>> power can indeed case damage to the drive surface; been
>> there, done that, have two Maxtor 250GB drives with
>> scribbled servo data to prove it.
>OK.
> I'm running the server from an APC UPS Back-UPS 650, so
> there should not be any glitches in the power supply, should
> there?
Probably not on the AC side, although the Back-UPS 650 isn't a full online
UPS but a switching standby UPS (full online, like the APC Symmetra 16KVA units
I have here) rectify to DC, float the batteries at all times, and run the output
from inverter all of the time (unless they're switched to bypass). The
SmartUPS 1400RM I had in front of the PC that suffered the glitchy power is,
unless I'm mistaken, also a full online pure sinewave UPS like the Symmetra,
and is still in service (I checked its output on my oscilloscope first, though).
No, I was referring to the output DC voltages (+12V, +5V, +3.3V,-5V, and -12V)
from the power supply inside the system.
In addition to my own personal RAID1 of 250GB drives, I also, a different time,
lost a RAID5 array of 15K 36GB SCSI drives in a Dell 1600SC server; testing the
power supply showed lots of noise and complete dropouts of a few milliseconds
duration on the drive connectors' 5V supply pins. Completely and thoroughly
scrambled the servo data on the Hitachi drives. Meaning they didn't just
start showing bad sectors; they started getting seek errors. The 5V line on the
drive connectors was reading an AC RMS of 4V superimposed on the +5V, yielding
an effective DC voltage of 4V. Happened over a period of three weeks, during
which time I had a number of mysterious failures (the Hitachi drives were
error-correcting so well that by the time they started reporting errors, it was
way past too late, and it became impossible for the Hitachi drives to even power
up). I found that the power supply in question, upon investigation, provided
the motherboard (where the DC power sensors on that box are) with clean 5V, and
the drives were powered from a separate 5V rail, meaning the Dell management
system wasn't seeing the power problems.
A simple power supply tester with a built-in meter can be bought for less than
$20; a more thorough power analyzer will run more than that. But even the
simple one caught the failing Dell 1600SC supply. It took an oscilloscope to
test the Antec in my personal box; turned out it was a cold solder joint in the
Antec. A new power supply is less expensive than the equivalent labor it took
to fix the Antec. I keep a known good 500W ATX 12V server-grade (8 pin 12V plug
with adapters, and 24-pin ATX plug with 20-pin adapter) around for testing;
that's one of the very first things I check when a PC is brought in that is
flaky. (The very first thing is the dust accumulation, and the second thing is
the heatsink compound).
One of the first things I do on any CentOS system I put together is install
lm_sensors and gkrellm (gkrellm from a third-party repo). I then enable all the
motherboard sensors that are available in the gkrellm plugins, and run it
(either local GUI or through ssh X forwarding to my central monitoring PC). On
supermicro boards I install SuperODoctor for Linux, available on the supermicro
site. The GUI runs well (there are some odd dependencies, however) and will
e-mail you on alarm conditions that you can set. These include fan RPM,
temperatures, and voltages. The CLI program isn't quite so sophisticated,
but it can be run periodically and the result sent by e-mail for health checks.
Drives that are having trouble will show up with high iowaits; run iostat (from
the sysstat package) and look at the await result. Long awaits mean the drive
is having trouble (or it has firmware issues like WD's EARS and EADS drives
have in RAID configurations).