thr3ads.net - freebsd stable - Adaptec 3210S, 4.9-STABLE, corruption when disk fails [Apr 2005]

If this information is useful, please help other people find it:
Share via:

Don Bowman

2005-Apr-01 08:27 UTC

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

From: Uwe Doering [mailto:gemini@geminix.org] 
 ...> As far as I understand this family of controllers the OS 
> drivers aren't involved at all in case of a disk drive 
> failure.  It's strictly the controller's business to deal 
> with it internally.  The OS just sits there and waits until 
> the controller is done with the retries and either drops into 
> degraded mode or recovers from the disk error.
> 
> That's why I initially speculated that there might be a 
> timeout somewhere in PostgreSQL or FreeBSD that leads to data 
> loss if the controller is busy for too long.
> 
> A somewhat radical way to at least make these failures as 
> rare an event as possible would be to deliberately fail all 
> remaining old disk drives, one after the other of course, in 
> order to get rid of them.  And if you are lucky the problem 
> won't happen with newer drives anyway, in case the root cause 
> is an incompatibility between the controller and the old drives.
Started that yesterday. I've got one 'old' one left.
Sadly, the one that failed night before last was not one of the
'old' ones, so this is no guarantee :)
>From the raidutil -e log, I see this type of info. I'm not sure what the 'unknown' events are. The 'CRC Failure' is probably the
problem? There's also Bad SCSI Status, unit attention, etc.
Perhaps the driver doesn't deal with these properly?

$ raidutil -e d0
03/31/2005  23:37:59   Level 1
Lock for Channel 0 : Started


03/31/2005  23:37:59   Level 1
Lock for Channel 1 : Started


03/31/2005  23:38:09   Level 1
Lock for Channel 0 : Stopped


03/31/2005  23:38:22   Level 1
Lock for Channel 1 : Stopped


03/31/2005  23:38:22   Level 4
HBA=0 BUS=0 ID=0 LUN=0
Status Change
Optimal   => Degraded - Drive Failed


03/31/2005  23:38:22   Level 1
Unknown Event : 56 10 00 08 EE 89 4C 42 00 00 00 00 


03/31/2005  23:38:22   Level 1
CRC Failure
Number of dirty blocks = -1
FFFFFFFF D30A1F2A 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 


03/31/2005  23:38:24   Level 3
HBA=0 BUS=0 ID=0 LUN=0
Bad SCSI Status - Check Condition
28 00 00 00 00 00 00 00 01 00 00 00 


03/31/2005  23:38:24   Level 3
HBA=0 BUS=0 ID=0 LUN=0
Request Sense
70 00 06 00 00 00 00 0A 00 00 00 00 29 02 02 00 00 00 
Unit Attention

Uwe Doering

2005-Apr-01 12:12 UTC

head link

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Don Bowman wrote:> From: Uwe Doering [mailto:gemini@geminix.org] 
>  ...
> 
>>As far as I understand this family of controllers the OS 
>>drivers aren't involved at all in case of a disk drive 
>>failure.  It's strictly the controller's business to deal 
>>with it internally.  The OS just sits there and waits until 
>>the controller is done with the retries and either drops into 
>>degraded mode or recovers from the disk error.
>>
>>That's why I initially speculated that there might be a 
>>timeout somewhere in PostgreSQL or FreeBSD that leads to data 
>>loss if the controller is busy for too long.
>>
>>A somewhat radical way to at least make these failures as 
>>rare an event as possible would be to deliberately fail all 
>>remaining old disk drives, one after the other of course, in 
>>order to get rid of them.  And if you are lucky the problem 
>>won't happen with newer drives anyway, in case the root cause 
>>is an incompatibility between the controller and the old drives.
> 
> Started that yesterday. I've got one 'old' one left.
> Sadly, the one that failed night before last was not one of the
> 'old' ones, so this is no guarantee :)
> 
> From the raidutil -e log, I see this type of info. I'm not sure 
> what the 'unknown' events are. The 'CRC Failure' is
probably the
> problem? There's also Bad SCSI Status, unit attention, etc.
> Perhaps the driver doesn't deal with these properly?
In my opinion what the log shows in this case is internal communication 
between the controller and the disk drives.  The OS driver is not 
involved.  In the past I've seen CRC errors like these as a result of 
bad cabling or contact problems.  You may want to check the SCSI cables. 
  They have to be properly terminated and there must not be any sharp 
kinks given the signal frequencies involved these days.  Also, pluggable 
drive bays can cause this.  Every electrical contact is a potential 
source of trouble.  Finally, faulty or overloaded power supplies can 
cause glitches like these.  This can be especially hard to debug.

When these hardware issues have been taken care of you may want to start 
a RAID verification/correction run.  If it shows any inconsistencies 
this may be an indication of former hardware glitches.  I'm not sure 
whether you can trigger that process through 'raidutil'.  I've
always
used the X11 'dptmgr' program.  You can terminate it after having 
started the verification.  It continues to run in the background (inside 
the controller).

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net

freebsd stable - Apr 2005 - Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Adaptec 3210S, 4.9-STABLE, corruption when disk fails