Don Bowman wrote:> From: Uwe Doering [mailto:gemini@geminix.org]
> ...
>
>>As far as I understand this family of controllers the OS
>>drivers aren't involved at all in case of a disk drive
>>failure. It's strictly the controller's business to deal
>>with it internally. The OS just sits there and waits until
>>the controller is done with the retries and either drops into
>>degraded mode or recovers from the disk error.
>>
>>That's why I initially speculated that there might be a
>>timeout somewhere in PostgreSQL or FreeBSD that leads to data
>>loss if the controller is busy for too long.
>>
>>A somewhat radical way to at least make these failures as
>>rare an event as possible would be to deliberately fail all
>>remaining old disk drives, one after the other of course, in
>>order to get rid of them. And if you are lucky the problem
>>won't happen with newer drives anyway, in case the root cause
>>is an incompatibility between the controller and the old drives.
>
> Started that yesterday. I've got one 'old' one left.
> Sadly, the one that failed night before last was not one of the
> 'old' ones, so this is no guarantee :)
>
> From the raidutil -e log, I see this type of info. I'm not sure
> what the 'unknown' events are. The 'CRC Failure' is
probably the
> problem? There's also Bad SCSI Status, unit attention, etc.
> Perhaps the driver doesn't deal with these properly?
In my opinion what the log shows in this case is internal communication
between the controller and the disk drives. The OS driver is not
involved. In the past I've seen CRC errors like these as a result of
bad cabling or contact problems. You may want to check the SCSI cables.
They have to be properly terminated and there must not be any sharp
kinks given the signal frequencies involved these days. Also, pluggable
drive bays can cause this. Every electrical contact is a potential
source of trouble. Finally, faulty or overloaded power supplies can
cause glitches like these. This can be especially hard to debug.
When these hardware issues have been taken care of you may want to start
a RAID verification/correction run. If it shows any inconsistencies
this may be an indication of former hardware glitches. I'm not sure
whether you can trigger that process through 'raidutil'. I've
always
used the X11 'dptmgr' program. You can terminate it after having
started the verification. It continues to run in the background (inside
the controller).
Uwe
--
Uwe Doering | EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org | http://www.escapebox.net