thr3ads.net - freebsd stable - [long] ATA timeout problems on -STABLE [Sep 2004]

If this information is useful, please help other people find it:
Share via:

Stijn Hoop

2004-Sep-20 06:03 UTC

[long] ATA timeout problems on -STABLE

Hi,

Short explanation:

I have a box running 4.10-RELEASE-p2 suffering from severe ATA timeout
problems once every few days, and I cannot determine for the life of me
what's
causing it. I'd really like some hints on how to determine the cause of
this,
as I think I've ruled out hardware related stuff.

Long story:

This box consists of

- Gigabyte GA-7N400 Pro F8 mobo
- Athlon XP 2500+ CPU
- 1 Gig PC2700 DDR DRAM in 2 modules
- Adaptec 2940 Ultra SCSI adapter
- 1 x COMPAQ WDE4550W SCSI-2 4G drive for the OS
- 2 x Promise Ultra100 IDE controllers (not the TX2)
- 4 x Maxtor 200G 7200 RPM drives, connected to the master/slave positions
  of the onboard IDE controllers (yes I know better setups exist but these
  drives wouldn't play nice with the Promises)
- 4 x Maxtor 120G 7200 RPM drives, all connected to the 4 master positions
  on the two promise controllers

Now this setup, as you can image, draws a lot of power. The problems actually
began a few months back, when I replaced 2 x 60 and 2 x 80 gig drives with the
4 x 200 above. The machine just wouldn't boot or randomly 'lost' IDE
drives.

A basic working setup I arrived on was to add a second power supply; I was not
overjoyed at this but at the time I thought it was more power that was needed.
If I determined that this helped enough my plan was to go out and buy an
expensive 550W or 600W model.

Unfortunately, while things appeared to work at first, once in a while, one of
the ATA drives would mysteriously 'fallback to PIO mode' or even
indicate that
a block could not be read or written. The first few times I took out the
indicated drive, ran it through the Maxtor test program, and every time
the drive would come back as OK, so it's definitely not the drives.

On the ATA drives are 3 vinum RAID-5 setups, and everytime vinum would of
course correctly indicate that the affected volume was running in degraded
mode. For my experiences with hot rebuilding, see my post from a few weeks
back (basically: don't try to do that).

In any case, there was no pattern to the failures -- I have seen the exact
same error messages on both the onboard IDE controllers and the promises, and
with both the 120G and the 200G drives. Here's an example:

Sep 17 12:17:20 sandcat /kernel: ad10: DMA problem fallback to PIO mode
Sep 17 12:17:20 sandcat last message repeated 4 times
Sep 17 12:21:41 sandcat /kernel: ad10s1e: hard error reading fsbn 13008689 of
6504313-6504392 (ad10s1 bn 13008689; cn 12905 tn 7 sn 8) status=59 error=40
Sep 17 12:21:41 sandcat /kernel: vinum: local.p0.s3 is crashed by force
Sep 17 12:21:41 sandcat /kernel: vinum: local.p0 is degraded

Still suspecting power, I have in the meantime replaced one of the PSU's
with
another one, and even added a third. All the +12V and +5V amp totals that the
PSUs could deliver were triplechecked with the specs of the drives, mobo and
CPU, and should have been more than enough.

I tried to monitor the voltages with sysutils/xmbmon, and got lines like this:

Temp.= 36.0, 49.0, 43.0; Rot.= 3183,    0, 2710
Vcore = 1.65, 2.62; Volt. = 3.34, 4.27, 11.37,  -5.34, -1.95

which initially confirmed my suspicions. However the box kept crashing.  So,
urged by some friends today I took up a multimeter and measured the voltages
on the connectors; and this is were I got away totally clueless, because the
multimeter measured 5.07V on the +5V line and 12.01V on the +12V.

Other than greatly decreasing my confidence in sysutils/xmbmon, this also
shattered my PSU theory.

Other causes that I can think of are of course heat and memory, but there is
no other instability in this box whatsoever. Even when loading all disks at
the same time (dd if=/dev/ad[0-10] of=/dev/null bs=1m) and loading the
processor with a CPU intensive task, nothing crashes.  I would have expected
lots of other symptoms (sig11 etc) in case of overheating or bad memory. I'm
still planning to do a memtest when I can take the box offline, but I'm
skeptical as to the outcome.

Besides that, the temperature readings of xmbmon are within the expected
ranges. Although of course the question remains whether xmbmon spits out the
right values.

Basically my question is open-ended: what would you check when confronted with
such a situation? I'm really baffled by now, and would *greatly* like to
keep
this box up for > 1 week...

As posted above, this is on 4.10-RELEASE-p2, dmesg & pciconf -lv (along
with a copy of this email) available at

http://sandcat.nl/~stijn/freebsd/ataproblem/

Thanks for _any_ hints on this...

--Stijn

-- 
"Computer games don't affect kids; I mean if Pac-Man affected us as
kids,
we'd all be running around in darkened rooms, munching magic pills and
listening to repetitive electronic music."
		-- Kristian Wilson, Nintendo, Inc., 1989

Stijn Hoop

2004-Sep-21 08:14 UTC

head link

[long] ATA timeout problems on -STABLE

Hi,

thanks for your response.

On Mon, Sep 20, 2004 at 11:35:52AM -0400, Paul Mather
wrote:> FWIW, when I would get those errors on my 4-STABLE system (fallback to
> PIO mode; hard error reading fbsn) it did turn out to be a drive problem
> (and with a Maxtor drive, too).  I was none the wiser until I happened
> to reboot the machine after a security advisory upgrade and was
> surprised to see the boot halted because the S.M.A.R.T. status of the
> drive indicated it was failing!  (Prior to that I'd just been assuming
> it was some kind of OS/peak load problem and had been using atacontrol
> to change the mode back to UDMA100 when it fell back to PIO.)
Interesting. I have also done the same in the few rare cases where the drive
would indeed read/write the block in PIO mode. Most of the time the ata
subsystem would just give up on the drive.
> So, I would suggest running smartctl from the sysutils/smartmontools
> port to see what the SMART status of the drives looks like; in
> particular, whether any of the "worst" values have dropped
anywhere
> close to the failure threshold value.  (I have noticed with smartctl
> that some attributes go down and then back up.  I have a system, in
> particular, where the Raw_Read_Error_Rate attribute sometimes drops down
> a few points under heavy disk load [e.g., during the nightly backup or
> cvsup], but increases again after the load has lifted.)
> 
> Unfortunately, you're running 4.x, so you might have to make a 5.x
> FreeSBIE CD with the smartmontools port included because it requires
> ATAng from 5.x to run.
That's a great suggestion that hadn't crossed my mind.

As the box had another error just this morning I took some time when I had to
take it offline to rebuild the RAID array, and put the 4 120G disks (which
definitely generate the most errors) in a 5.x system with the smartmontools
port installed.

Logs of smartctl -a are up at

http://sandcat.nl/~stijn/freebsd/ataproblem/

I don't have a clue how to interpret all these numbers though. A little
googling turns up posts that UNC errors are Bad(TM), however that would
indicate that I have indeed 3(!) failing drives on my hands... Although
certainly possible (they are about 1-2 years old in continuous use), it does
sound improbable.
> You can also use smartctl to run online and offline self-tests.
I didn't have time to run the long tests, but all 4 drives indicated a
'passed' status for the online 'smartctl -t short' test. I take
it the
long tests give better results? If so I'll take the time to run them
on the next rebuild downtime.

But anyway if the drives are dying, I'll accept that. I just don't know
for
sure how to determine that. Do you have pointers for me to read more about
SMART statistics?

--Stijn

-- 
An Orb is for life, not just for Christmas.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20040921/691d44f5/attachment.bin

freebsd stable - Sep 2004 - [long] ATA timeout problems on -STABLE

[long] ATA timeout problems on -STABLE

[long] ATA timeout problems on -STABLE