We recently installed a 24 disk SATA array with an LSI controller attached to a box running Solaris X86 10 Release 4. The drives were set up in one big pool with raidz, and it worked great for about a month. On the 4th, we had the system kernel panic and crash, and it''s now behaving very badly. Here''s what diagnostic data I''ve been able to collect so far: In the messages file: Nov 4 13:24:11 mondo4 savecore: [ID 570001 auth.error] reboot after panic: ZFS: I/O failure (write on <unknown> off 0: zio ffffffff97c86a00 [L0 DMU dnode] 4000L/1000P DVA[0]=<0 :d08cf11b800:1800> DVA[1]=<0:1020a711c800:1800> fletcher4 lzjb LE contiguous birth=731555 fill=32 Nov 4 13:24:11 mondo4 savecore: [ID 570001 auth.error] reboot after panic: ZFS: I/O failure (write on <unknown> off 0: zio ffffffff97c86a00 [L0 DMU dnode] 4000L/1000P DVA[0]=<0 :d08cf11b800:1800> DVA[1]=<0:1020a711c800:1800> fletcher4 lzjb LE contiguous birth=731555 fill=32 Nov 4 13:24:06 mondo4 savecore: [ID 748169 auth.error] saving system crash dump in /var/crash/mondo4/*.0 Nov 4 13:24:06 mondo4 savecore: [ID 748169 auth.error] saving system crash dump in /var/crash/mondo4/*.0 And yes, we''ve got the core files. The box came back up and seemed to run okay for a couple days, but we noticed today that things were very very odd. We noticed that doing a df on the filesystem hung, and that ls would hang on the local box as well. Looking at the output of dmesg, we see a lot of messages that look like: Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Requested Block: 1450319385 Error Block: 1450319385 Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Requested Block: 1450319385 Error Block: 1450319385 Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Nov 8 03:58:22 mondo4 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Requested Block: 1450487074 Error Block: 1450487074 Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Requested Block: 1450487074 Error Block: 1450487074 Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] Sense Key: Unit Attention Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Nov 8 04:13:59 mondo4 scsi: [ID 107833 kern.notice] ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0 Finally trying to do a zpool status yields: root at mondo4:/# zpool status -v pool: LogData state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested At which point the shell hangs, and cannot be control-c''d. Any thoughts on how to proceed? I''m guessing we have a bad disk, but I''m not sure. Anything you can recommend to diagnose this would be welcome. --Mike
Michael Stalnaker wrote:> > Finally trying to do a zpool status yields: > > root at mondo4:/# zpool status -v > pool: LogData > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: none requested > > At which point the shell hangs, and cannot be control-c''d. > > > Any thoughts on how to proceed? I''m guessing we have a bad disk, but I''m not > sure. Anything you can recommend to diagnose this would be welcome. > >Are you able to run a zpool scrub? Ian
We weren''t able to do anything at all, and finally rebooted the system. When we did, everything came back normally, even with the target that was reporting errors before. We''re using an LSI PCI-E controller that''s on the supported device list, and LSI 3801-E. Right now, I''m trying to figure out if there''s a different controller we should be using with Solaris 10 Release 4 (X86) that will handle a drive issue more gracefully. I know folks are working on this part of the code, but I need to get as far along as I can right now. :) On 11/8/07 8:43 PM, "Ian Collins" <ian at ianshome.com> wrote:> Michael Stalnaker wrote: >> >> Finally trying to do a zpool status yields: >> >> root at mondo4:/# zpool status -v >> pool: LogData >> state: ONLINE >> status: One or more devices has experienced an unrecoverable error. An >> attempt was made to correct the error. Applications are unaffected. >> action: Determine if the device needs to be replaced, and clear the errors >> using ''zpool clear'' or replace the device with ''zpool replace''. >> see: http://www.sun.com/msg/ZFS-8000-9P >> scrub: none requested >> >> At which point the shell hangs, and cannot be control-c''d. >> >> >> Any thoughts on how to proceed? I''m guessing we have a bad disk, but I''m not >> sure. Anything you can recommend to diagnose this would be welcome. >> >> > Are you able to run a zpool scrub? > > Ian
Are all 24 disks in one big raidz raid set with no spares assigned to the pool? If so then maybe the host is having trouble operating on parity over that many drives when the "experienced an unrecoverable error" errors occur. From what I''ve read it might be better to create the pool with 3 raidz sets of 7 drives each and use the remaining 3 drives as spares though I imagine that probably isn''t an option at this point. On Nov 9, 2007 1:02 AM, Michael Stalnaker <Michael.Stalnaker at exponential.com> wrote:> We weren''t able to do anything at all, and finally rebooted the system. When > we did, everything came back normally, even with the target that was > reporting errors before. We''re using an LSI PCI-E controller that''s on the > supported device list, and LSI 3801-E. Right now, I''m trying to figure out > if there''s a different controller we should be using with Solaris 10 Release > 4 (X86) that will handle a drive issue more gracefully. I know folks are > working on this part of the code, but I need to get as far along as I can > right now. :) > > > > > On 11/8/07 8:43 PM, "Ian Collins" <ian at ianshome.com> wrote: > > > Michael Stalnaker wrote: > >> > >> Finally trying to do a zpool status yields: > >> > >> root at mondo4:/# zpool status -v > >> pool: LogData > >> state: ONLINE > >> status: One or more devices has experienced an unrecoverable error. An > >> attempt was made to correct the error. Applications are unaffected. > >> action: Determine if the device needs to be replaced, and clear the errors > >> using ''zpool clear'' or replace the device with ''zpool replace''. > >> see: http://www.sun.com/msg/ZFS-8000-9P > >> scrub: none requested > >> > >> At which point the shell hangs, and cannot be control-c''d. > >> > >> > >> Any thoughts on how to proceed? I''m guessing we have a bad disk, but I''m not > >> sure. Anything you can recommend to diagnose this would be welcome. > >> > >> > > Are you able to run a zpool scrub? > > > > Ian > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >