thr3ads.net - freebsd stable - Adaptec 3210S, 4.9-STABLE, corruption when disk fails [Feb 2005]

If this information is useful, please help other people find it:
Share via:

Don Bowman

2005-Feb-28 22:09 UTC

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

I have a machine running:

$ uname -a
FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0:
Fri Mar 19 10:39:07 EST 2004
user@machine.phaedrus.sandvine.com:/usr/src/sys/compile/LABDB  i386

It has an adaptec 3210S raid controller running a single raid-5, and
runs postgresql 7.4.6 as its primary application.

3 times now I have had a drive fail, and have had corrupted files in the
postgresql cluster @ the same time.

The time is too closely correlated to be a coincidence. It passes fsck @
the time that I got to it a couple of hours later, and the filesystem
seems to be ok (with a failed drive, the raid in 'degrade' mode).

It appears that the drive failure and the postgresql failure occur @
exactly the same time (monitoring with nagios, within 1hr accuracy). It
would appear that for some file(s) bad data was returned.

Does anyone have any suggestions?

$ raidutil -L all
RAIDUTIL  Version: 3.04  Date: 9/27/2000  FreeBSD CLI Configuration
Utility
Adaptec ENGINE  Version: 3.04  Date: 9/27/2000  Adaptec FreeBSD SCSI
Engine

#  b0 b1 b2  Controller     Cache  FW    NVRAM     Serial     Status
------------------------------------------------------------------------
---
d0 -- --     ADAP3210S      16MB   370F  ADPT 1.0  BF0A21700J7Optimal

Physical View
Address    Type              Manufacturer/Model         Capacity  Status
------------------------------------------------------------------------
---
d0b0t0d0   Disk Drive (DASD) SEAGATE  ST318453LW        17501MB
Optimal
d0b0t1d0   Disk Drive (DASD) SEAGATE  ST318453LW        17501MB
Optimal
d0b0t2d0   Disk Drive (DASD) IBM      DNES-318350W      17501MB
Optimal
d0b1t3d0   Disk Drive (DASD) IBM      DNES-318350W      17501MB
Optimal
d0b1t4d0   Disk Drive (DASD) SEAGATE  ST318452LW        17501MB
Optimal
d0b1t5d0   Disk Drive (DASD) IBM      DNES-318350W      17501MB
Optimal

Uwe Doering

2005-Feb-28 23:28 UTC

head link

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Don Bowman wrote:> I have a machine running:
> 
> $ uname -a
> FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 4.9-STABLE #0:
> Fri Mar 19 10:39:07 EST 2004
> user@machine.phaedrus.sandvine.com:/usr/src/sys/compile/LABDB  i386
> 
> It has an adaptec 3210S raid controller running a single raid-5, and
> runs postgresql 7.4.6 as its primary application.
> 
> 3 times now I have had a drive fail, and have had corrupted files in the
> postgresql cluster @ the same time.
> 
> The time is too closely correlated to be a coincidence. It passes fsck @
> the time that I got to it a couple of hours later, and the filesystem
> seems to be ok (with a failed drive, the raid in 'degrade' mode).
> 
> It appears that the drive failure and the postgresql failure occur @
> exactly the same time (monitoring with nagios, within 1hr accuracy). It
> would appear that for some file(s) bad data was returned.
> 
> Does anyone have any suggestions?
In my experience, in a situation like this RAID controllers can block 
the system for up to a couple of minutes, trying to revive a failed disk 
drive by sending it bus reset commands and the like, until they 
eventually give up and drop into degraded mode.  With sufficiently 
patient applications this is no problem, but if a program runs into 
internal timeouts during this period of time bad things can happen.

My point is that while the disk controller may trigger the problem the 
instance that actually corrupts the database might be PostgreSQL itself. 
  Of course, I'm aware that it's going to be quite hard to tell for sure
who the culprit is.

    Uwe
-- 
Uwe Doering         |  EscapeBox - Managed On-Demand UNIX Servers
gemini@geminix.org  |  http://www.escapebox.net

Don Bowman

2005-Mar-01 02:03 UTC

head link

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

From: Uwe Doering [mailto:gemini@geminix.org] > Don Bowman wrote:
> > I have a machine running:
> > 
> > $ uname -a
> > FreeBSD machine.phaedrus.sandvine.com 4.9-STABLE FreeBSD 
> 4.9-STABLE #0:
> > Fri Mar 19 10:39:07 EST 2004
> > user@machine.phaedrus.sandvine.com:/usr/src/sys/compile/LABDB  i386
> >  
...

I have merged asr.c from RELENG_4 to get this fix:

"Fix a mis-merge in the MFC of rev. 1.64 in rev. 1.3.2.3; the following
change wasn't included:
- Set the CAM status to CAM_SCSI_STATUS_ERROR rather than CAM_REQ_CMP
  in case of a CHECK CONDITION."

since I guess its conceivable this could cause my problem.

--don

Don Bowman

2005-Mar-01 15:20 UTC

head link

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

From: Uwe Doering [mailto:gemini@geminix.org] 
 ...> > 
> > Did you merge 1.3.2.3 as well?  This actually should have 
> been one MFC 
Yes, merged from RELENG_4.

I will post later if this happens again, but it will be quite
a long time. The machine has 7 drives in it, there are only
3 ones left old enough they might fail before I take it out
of service (it originally had 7 1999-era IBM drives, now
it has 4 2004-era seagate drives and 3 of the old IBM's.
The drives have been in continuous service, so they've lead
a pretty good life!)

Thanks for the suggestion on the cam timeout, I've set that
value.

--don

Don Bowman

2005-Mar-31 11:12 UTC

head link

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

From: owner-freebsd-stable@freebsd.org > From: Uwe Doering [mailto:gemini@geminix.org]  ...
> > > 
> > > Did you merge 1.3.2.3 as well?  This actually should have
> > been one MFC
> 
> Yes, merged from RELENG_4.
> 
> I will post later if this happens again, but it will be quite
> a long time. The machine has 7 drives in it, there are only
> 3 ones left old enough they might fail before I take it out
> of service (it originally had 7 1999-era IBM drives, now
> it has 4 2004-era seagate drives and 3 of the old IBM's.
> The drives have been in continuous service, so they've lead
> a pretty good life!)
> 
> Thanks for the suggestion on the cam timeout, I've set that
> value.
Another drive failed and the same thing happened.
After the failure, the raid worked in degrade mode just
fine, but many files had been corrupted during the failure.

So I would suggest that this merge did not help, and the
cam timeout did not help either.

This is very frustrating, again I rebuild my postgresql install
from backup :(

--don

Don Bowman

2005-Mar-31 13:00 UTC

head link

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

From: Uwe Doering [mailto:gemini@geminix.org] > Don Bowman wrote:
> > From: owner-freebsd-stable@freebsd.org
> > 
> >>From: Uwe Doering [mailto:gemini@geminix.org]  ...
> >>
> >>>>Did you merge 1.3.2.3 as well?  This actually should have
> >>>
> >>>been one MFC
> >>
> >>Yes, merged from RELENG_4.
> >>
> >>I will post later if this happens again, but it will be 
> quite a long 
> >>time. The machine has 7 drives in it, there are only
> >>3 ones left old enough they might fail before I take it out 
> of service 
> >>(it originally had 7 1999-era IBM drives, now it has 4 2004-era 
> >>seagate drives and 3 of the old IBM's.
> >>The drives have been in continuous service, so they've lead 
> a pretty 
> >>good life!)
> >>
> >>Thanks for the suggestion on the cam timeout, I've set that
value.
> > 
> > Another drive failed and the same thing happened.
> > After the failure, the raid worked in degrade mode just 
> fine, but many 
> > files had been corrupted during the failure.
> > 
> > So I would suggest that this merge did not help, and the 
> cam timeout 
> > did not help either.
> > 
> > This is very frustrating, again I rebuild my postgresql 
> install from 
> > backup :(
> 
> This is indeed unfortunate.  Maybe the problem is in fact 
> located neither in PostgreSQL nor in FreeBSD but in the 
> controller itself.  Does it have the latest firmware?  The 
> necessary files should be available on Adaptec's website, and 
> you can use the 'raidutil' program under FreeBSD to upload 
> the firmware to the controller.  I have to concede, however, 
> that I never did this under FreeBSD myself.  If I recall 
> correctly I did the upload via a DOS diskette the last time.
> 
> If this doesn't help either you could ask Adaptec's support for
help.
> You need to register the controller first, if memory serves.
The latest firmware & bios is in the controller (upgraded the
last time I had problems).

Tried adaptec support, controller is registered.

The problem is definitely not in postgresql. Files go missing
in directories that are having new entries added (e.g. I lost
a 'PG_VERSION' file). Data within the postgresql files becomes
corrupt. Since the only application running is postgresql,
and it reads/writes/fsyncs the data, its not unexpected that
it's the one that reaps the 'rewards' of the failure.

I have to believe this is either a bug in the controller,
or a problem in cam or asr.

--don

freebsd stable - Feb 2005 - Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Adaptec 3210S, 4.9-STABLE, corruption when disk fails

Adaptec 3210S, 4.9-STABLE, corruption when disk fails