Wondering if anyone here might better be able to diagnose an issue we're
seeing, or point me to some guidelines for this sort of thing. There are
two machines with identical hardware, both running Red Hat 7.3's stock
SMP kernel (required due to third party software). Both have come down
with the same symptoms after having run fine for a number of months.
The initial errors were these:
kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error
}
kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=54666719,
high=3, low=4335071, sector=61720
kernel: end_request: I/O error, dev 03:09 (hda), sector 61720
kernel: journal_bmap_Rsmp_41b34d97: journal block not found at offset 7180 on
ide0(3,9)
kernel: Aborting journal on device ide0(3,9).
kernel: ext3_reserve_inode_write: aborting transaction: Journal has aborted in
__ext3_journal_get_write_access<2>EXT3-fs error (device ide0(3,9)) in
ext3_reserve_inode_write: Journal has aborted
kernel: EXT3-fs error (device ide0(3,9)) in ext3_dirty_inode: Journal has
aborted
kernel: ext3_abort called.
kernel: EXT3-fs abort (device ide0(3,9)): ext3_journal_start: Detected aborted
journal
kernel: Remounting filesystem read-only
kernel: EXT3-fs error (device ide0(3,9)) in start_transaction: Journal has
aborted
kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error
}
kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=35484063,
high=2, low=1929631, sector=1843880
kernel: end_request: I/O error, dev 03:06 (hda), sector 1843880
And later, similar errors on different partitions:
kernel: EXT3-fs error (device ide0(3,6)): ext3_readdir: directory #444380
contains a hole at offset 0
kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error
}
kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=40778463,
high=2, low=7224031, sector=7138288
kernel: end_request: I/O error, dev 03:06 (hda), sector 7138288
kernel: EXT3-fs error (device ide0(3,6)): ext3_readdir: directory #444380
contains a hole at offset 0
kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error
}
kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=126701370,
high=7, low=9260858, sector=72096368
kernel: end_request: I/O error, dev 03:09 (hda), sector 72096368
kernel: hda: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error
}
kernel: hda: read_intr: error=0x40 { UncorrectableError }, LBAsect=126701366,
high=7, low=9260854, sector=72096368
kernel: end_request: I/O error, dev 03:09 (hda), sector 72096368
kernel: EXT3-fs error (device ide0(3,9)): ext3_readdir: directory #1280449
contains a hole at offset 0
kernel: EXT3-fs error (device ide0(3,7)): ext3_readdir: directory #32771
contains a hole at offset 0
kernel: EXT3-fs error (device ide0(3,6)): ext3_readdir: directory #81961
contains a hole at offset 4096
kernel: EXT3-fs error (device ide0(3,3)) in start_transaction: Journal has
aborted
last message repeated 1535 times
last message repeated 1134 times
[etc]
The sole drive in both of these machines is a rather run-of-the-mill model:
# grep hda: dmesg
hda: WDC WD1200JB-00EVA0, ATA DISK drive
hda: 234441648 sectors (120034 MB) w/8192KiB Cache, CHS=232581/255/63
hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 hda9 >
However, I notice some peculiarities with the on-board IDE interface:
[from dmesg]
Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
PCI_IDE: unknown IDE controller on PCI bus 00 device f9, VID=8086, DID=24db
PCI: Device 00:1f.1 not available because of resource collisions
PCI_IDE: (ide_setup_pci_device:) Could not enable device.
[from /proc/pci]
Bus 0, device 31, function 1:
IDE interface: PCI device 8086:24db (Intel Corp.) (rev 2).
IRQ 18.
I/O at 0x0 [0x7].
I/O at 0x0 [0x3].
I/O at 0x0 [0x7].
I/O at 0x0 [0x3].
I/O at 0xffa0 [0xffaf].
Non-prefetchable 32 bit memory at 0x3f000000 [0x3f0003ff].
Unfortunately I haven't had the ability myself to actually work with
either of these machines yet, mainly due to 2000 miles of reality in
between us. But I'm writing here as a favor to it's local admin.
Does all of this ring any bells amongst the storage hackers here, maybe
a well-known bug that was fixed long ago? Is there any reasonably
up-to-date documentation out there on a standard diagnosis procedure for
this sort of thing?
Thanks,
Jeremy