Yarema
2007-Sep-24 17:38 UTC
FreeBSD PseudoRAID RAID0 array broken on atapci1: <Intel ICH5 SATA150 controller>
Hi, I need some help recovering from this. First some back story. Running 6.2-STABLE i386 from Sep 17, 2007. My /home slice is mounted from /dev/ar0s1e where the relevant kernel messages look like so when all is good: atapci1: <Intel ICH5 SATA150 controller> ata2: <ATA channel 0> on atapci1 ata3: <ATA channel 1> on atapci1 ad4: 381554MB <WDC WD4000YR-01PLB0 01.06A01> at ata2-master SATA150 ad6: 381554MB <WDC WD4000YR-01PLB0 01.06A01> at ata3-master SATA150 ar0: 763108MB <FreeBSD PseudoRAID RAID0 (stripe 256 KB)> status: READY ar0: disk0 READY using ad4 at ata2-master ar0: disk1 READY using ad6 at ata3-master Today this server crashed with the following loggeed: ad4: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=144888320 ad4: TIMEOUT - READ_DMA retrying (1 retry left) LBA=143390319 ad4: FAILURE - device detached ar0: FAILURE - RAID0 array broken subdisk4: detached ad4: detached g_vfs_done():ar0s1e[WRITE(offset=146002964480, length=2048)]error = 5 initiate_write_filepage: already started g_vfs_done():ar0s1e[WRITE(offset=146002964480, length=2048)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6144000, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6160384, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6176768, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6193152, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6209536, length=2048)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=65536, length=2048)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=147801325568, length=12288)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=147142686720, length=2048)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=65536, length=2048)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6144000, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6160384, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6176768, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6193152, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=6209536, length=2048)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=146831867904, length=16384)]error = 5 g_vfs_done():ar0s1e[WRITE(offset=147024330752, length=16384)]error = 5 initiate_write_filepage: already started g_vfs_done():ar0s1e[WRITE(offset=146002964480, length=2048)]error = 5 initiate_write_filepage: already started g_vfs_done():ar0s1e[WRITE(offset=146002964480, length=2048)]error = 5 initiate_write_filepage: already started g_vfs_done():ar0s1e[WRITE(offset=147801325568, length=12288)]error = 5 initiate_write_filepage: already started g_vfs_done():ar0s1e[WRITE(offset=147142686720, length=2048)]error = 5 Now the kernel messages read: ar0: FAILURE - RAID0 array broken ar0: 763108MB <FreeBSD PseudoRAID RAID0 (stripe 256 KB)> status: BROKEN ar0: disk0 READY using ad4 at ata2-master ar0: disk1 DOWN no device found for this subdisk ar1: 763108MB <FreeBSD PseudoRAID RAID0 (stripe 256 KB)> status: BROKEN ar1: disk0 DOWN no device found for this subdisk ar1: disk1 READY using ad6 at ata3-master For some reason the second disk in the array shows up as ar1 instead of being part of ar0. I suspect there's gotta be some way to force the two drives to show up as part of the same array by perhaps editing the PseudoRAID metadata on disk without putting any of the UFS2 data in "jeopardy". Any pointers on where to start poking around for the relevant metadata structures on disk or what to search for? I figure if I can dd the metadata off the disks, tweak a field or two and then dd the whole mess back I stand a chance of either hosing the array irrevocably or getting it all back. ;) Or maybe atacontrol could be used to re-create the metadata without destroying the UFS2 on the array? I have a coredump of the kernel from this crash if that helps analyze things any. -- Yarema
Yarema
2007-Sep-25 15:16 UTC
FreeBSD PseudoRAID RAID0 array broken on atapci1: <Intel ICH5 SATA150 controller>
--On Tuesday, September 25, 2007 8:49 AM +0200 S?ren Schmidt <sos@deepcore.dk> wrote:> Yarema wrote: >> Hi, I need some help recovering from this. First some back story. >> Running 6.2-STABLE i386 from Sep 17, 2007. My /home slice is mounted >> from /dev/ar0s1e where the relevant kernel messages look like so when >> all is good: >> >> atapci1: <Intel ICH5 SATA150 controller> >> ata2: <ATA channel 0> on atapci1 >> ata3: <ATA channel 1> on atapci1 >> ad4: 381554MB <WDC WD4000YR-01PLB0 01.06A01> at ata2-master SATA150 >> ad6: 381554MB <WDC WD4000YR-01PLB0 01.06A01> at ata3-master SATA150 >> ar0: 763108MB <FreeBSD PseudoRAID RAID0 (stripe 256 KB)> status: READY >> ar0: disk0 READY using ad4 at ata2-master >> ar0: disk1 READY using ad6 at ata3-master >> >> Today this server crashed with the following loggeed: >> >> ad4: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=144888320 >> ad4: TIMEOUT - READ_DMA retrying (1 retry left) LBA=143390319 >> ad4: FAILURE - device detached >> ar0: FAILURE - RAID0 array broken >> subdisk4: detached >> ad4: detached >> g_vfs_done():ar0s1e[WRITE(offset=146002964480, length=2048)]error = 5 >> initiate_write_filepage: already started >> g_vfs_done():ar0s1e[WRITE(offset=146002964480, length=2048)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6144000, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6160384, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6176768, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6193152, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6209536, length=2048)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=65536, length=2048)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=147801325568, length=12288)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=147142686720, length=2048)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=65536, length=2048)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6144000, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6160384, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6176768, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6193152, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=6209536, length=2048)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=146831867904, length=16384)]error = 5 >> g_vfs_done():ar0s1e[WRITE(offset=147024330752, length=16384)]error = 5 >> initiate_write_filepage: already started >> g_vfs_done():ar0s1e[WRITE(offset=146002964480, length=2048)]error = 5 >> initiate_write_filepage: already started >> g_vfs_done():ar0s1e[WRITE(offset=146002964480, length=2048)]error = 5 >> initiate_write_filepage: already started >> g_vfs_done():ar0s1e[WRITE(offset=147801325568, length=12288)]error = 5 >> initiate_write_filepage: already started >> g_vfs_done():ar0s1e[WRITE(offset=147142686720, length=2048)]error = 5 >> >> Now the kernel messages read: >> >> ar0: FAILURE - RAID0 array broken >> ar0: 763108MB <FreeBSD PseudoRAID RAID0 (stripe 256 KB)> status: BROKEN >> ar0: disk0 READY using ad4 at ata2-master >> ar0: disk1 DOWN no device found for this subdisk >> ar1: 763108MB <FreeBSD PseudoRAID RAID0 (stripe 256 KB)> status: BROKEN >> ar1: disk0 DOWN no device found for this subdisk >> ar1: disk1 READY using ad6 at ata3-master >> >> For some reason the second disk in the array shows up as ar1 instead >> of being part of ar0. I suspect there's gotta be some way to force >> the two drives to show up as part of the same array by perhaps editing >> the PseudoRAID metadata on disk without putting any of the UFS2 data >> in "jeopardy". Any pointers on where to start poking around for the >> relevant metadata structures on disk or what to search for? I figure >> if I can dd the metadata off the disks, tweak a field or two and then >> dd the whole mess back I stand a chance of either hosing the array >> irrevocably or getting it all back. ;) Or maybe atacontrol could be >> used to re-create the metadata without destroying the UFS2 on the >> array? I have a coredump of the kernel from this crash if that helps >> analyze things any. >> > > The solution to getting the array back is to "atacontrol delete ar0" > "atacontrol delete ar1" "atacontrol create stripe 512 ad4 ad6" and > the array is reborn. > However your filesystems might be just a bunch of bits depending > on how much of the failed write that made it in there, you get the > (missing) protection you asked for using RAID0....S?ren, Thank you for your prompt and helpful reply. I'm running into an new situation with atacontrol: % atacontrol create RAID0 512 ad4 ad6 ar0: 763108MB <Intel MatrixRAID RAID0 (stripe 128 KB)> status: READY ar0: disk0 READY using ad4 at ata2-master ar0: disk1 READY using ad6 at ata3-master Note that the original RAID0 which broke was ar0: 763108MB <FreeBSD PseudoRAID RAID0 (stripe 256 KB)> status: READY Now atacontrol will not create FreeBSD PseudoRAID metadata with a 256KB stripe, but insists on creating Intel MatrixRAID metadata with a 128KB stripe. This is on a non-R version of the ICH5 southbridge. So there's no way to enable/disable the Intel MatrixRAID from the BIOS. Nor is there any way to change the stripe size in the BIOS since there is no Intel MatrixRAID BIOS on this motherboard. The computer in question is a Dell SC400 with an Intel OEM motherboard which has a very limited BIOS Setup interface typical of Intel/Dell. Is there any way to force atacontrol to create FreeBSD PseudoRAID metadata? Perhaps using an older FreeSBIE release based on FreeBSD 6.0 since IIRC I created this RAID0 back when 6.0 was CURRENT. -- Yarema