Hi,
Short explanation:
I have a box running 4.10-RELEASE-p2 suffering from severe ATA timeout
problems once every few days, and I cannot determine for the life of me
what's
causing it. I'd really like some hints on how to determine the cause of
this,
as I think I've ruled out hardware related stuff.
Long story:
This box consists of
- Gigabyte GA-7N400 Pro F8 mobo
- Athlon XP 2500+ CPU
- 1 Gig PC2700 DDR DRAM in 2 modules
- Adaptec 2940 Ultra SCSI adapter
- 1 x COMPAQ WDE4550W SCSI-2 4G drive for the OS
- 2 x Promise Ultra100 IDE controllers (not the TX2)
- 4 x Maxtor 200G 7200 RPM drives, connected to the master/slave positions
of the onboard IDE controllers (yes I know better setups exist but these
drives wouldn't play nice with the Promises)
- 4 x Maxtor 120G 7200 RPM drives, all connected to the 4 master positions
on the two promise controllers
Now this setup, as you can image, draws a lot of power. The problems actually
began a few months back, when I replaced 2 x 60 and 2 x 80 gig drives with the
4 x 200 above. The machine just wouldn't boot or randomly 'lost' IDE
drives.
A basic working setup I arrived on was to add a second power supply; I was not
overjoyed at this but at the time I thought it was more power that was needed.
If I determined that this helped enough my plan was to go out and buy an
expensive 550W or 600W model.
Unfortunately, while things appeared to work at first, once in a while, one of
the ATA drives would mysteriously 'fallback to PIO mode' or even
indicate that
a block could not be read or written. The first few times I took out the
indicated drive, ran it through the Maxtor test program, and every time
the drive would come back as OK, so it's definitely not the drives.
On the ATA drives are 3 vinum RAID-5 setups, and everytime vinum would of
course correctly indicate that the affected volume was running in degraded
mode. For my experiences with hot rebuilding, see my post from a few weeks
back (basically: don't try to do that).
In any case, there was no pattern to the failures -- I have seen the exact
same error messages on both the onboard IDE controllers and the promises, and
with both the 120G and the 200G drives. Here's an example:
Sep 17 12:17:20 sandcat /kernel: ad10: DMA problem fallback to PIO mode
Sep 17 12:17:20 sandcat last message repeated 4 times
Sep 17 12:21:41 sandcat /kernel: ad10s1e: hard error reading fsbn 13008689 of
6504313-6504392 (ad10s1 bn 13008689; cn 12905 tn 7 sn 8) status=59 error=40
Sep 17 12:21:41 sandcat /kernel: vinum: local.p0.s3 is crashed by force
Sep 17 12:21:41 sandcat /kernel: vinum: local.p0 is degraded
Still suspecting power, I have in the meantime replaced one of the PSU's
with
another one, and even added a third. All the +12V and +5V amp totals that the
PSUs could deliver were triplechecked with the specs of the drives, mobo and
CPU, and should have been more than enough.
I tried to monitor the voltages with sysutils/xmbmon, and got lines like this:
Temp.= 36.0, 49.0, 43.0; Rot.= 3183, 0, 2710
Vcore = 1.65, 2.62; Volt. = 3.34, 4.27, 11.37, -5.34, -1.95
which initially confirmed my suspicions. However the box kept crashing. So,
urged by some friends today I took up a multimeter and measured the voltages
on the connectors; and this is were I got away totally clueless, because the
multimeter measured 5.07V on the +5V line and 12.01V on the +12V.
Other than greatly decreasing my confidence in sysutils/xmbmon, this also
shattered my PSU theory.
Other causes that I can think of are of course heat and memory, but there is
no other instability in this box whatsoever. Even when loading all disks at
the same time (dd if=/dev/ad[0-10] of=/dev/null bs=1m) and loading the
processor with a CPU intensive task, nothing crashes. I would have expected
lots of other symptoms (sig11 etc) in case of overheating or bad memory. I'm
still planning to do a memtest when I can take the box offline, but I'm
skeptical as to the outcome.
Besides that, the temperature readings of xmbmon are within the expected
ranges. Although of course the question remains whether xmbmon spits out the
right values.
Basically my question is open-ended: what would you check when confronted with
such a situation? I'm really baffled by now, and would *greatly* like to
keep
this box up for > 1 week...
As posted above, this is on 4.10-RELEASE-p2, dmesg & pciconf -lv (along
with a copy of this email) available at
http://sandcat.nl/~stijn/freebsd/ataproblem/
Thanks for _any_ hints on this...
--Stijn
--
"Computer games don't affect kids; I mean if Pac-Man affected us as
kids,
we'd all be running around in darkened rooms, munching magic pills and
listening to repetitive electronic music."
-- Kristian Wilson, Nintendo, Inc., 1989