thr3ads.net - zfs discuss - [zfs-discuss] Major problem with a new ZFS setup [Nov 2007]

If this information is useful, please help other people find it:
Share via:

Michael Stalnaker

2007-Nov-09 00:21 UTC

[zfs-discuss] Major problem with a new ZFS setup

We recently installed a 24 disk SATA array with an LSI controller attached
to a box running Solaris X86  10 Release 4. The drives were set up in one
big pool with raidz, and it worked great for about a month. On the 4th, we
had the system kernel panic and crash, and it''s now behaving very
badly.
Here''s what diagnostic data I''ve been able to collect so far:

In the messages file:

Nov  4 13:24:11 mondo4 savecore: [ID 570001 auth.error] reboot after panic:
ZFS: I/O failure (write on <unknown> off 0: zio ffffffff97c86a00 [L0 DMU
dnode] 4000L/1000P DVA[0]=<0
:d08cf11b800:1800> DVA[1]=<0:1020a711c800:1800> fletcher4 lzjb LE
contiguous
birth=731555 fill=32
Nov  4 13:24:11 mondo4 savecore: [ID 570001 auth.error] reboot after panic:
ZFS: I/O failure (write on <unknown> off 0: zio ffffffff97c86a00 [L0 DMU
dnode] 4000L/1000P DVA[0]=<0
:d08cf11b800:1800> DVA[1]=<0:1020a711c800:1800> fletcher4 lzjb LE
contiguous
birth=731555 fill=32
Nov  4 13:24:06 mondo4 savecore: [ID 748169 auth.error] saving system crash
dump in /var/crash/mondo4/*.0
Nov  4 13:24:06 mondo4 savecore: [ID 748169 auth.error] saving system crash
dump in /var/crash/mondo4/*.0


And yes, we''ve got the core files.

The box came back up and seemed to run okay for a couple days, but we
noticed today that things were very very odd.

We noticed that doing a df on the filesystem hung, and that ls would hang on
the local box as well.

Looking at the output of dmesg, we see a lot of messages that look like:

Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Requested Block:
1450319385                Error Block: 1450319385
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Requested Block:
1450319385                Error Block: 1450319385
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Vendor: ATA
Serial Number:     
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Vendor: ATA
Serial Number:     
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Nov  8 03:58:22 mondo4 scsi: [ID 107833 kern.notice]    ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Requested Block:
1450487074                Error Block: 1450487074
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Requested Block:
1450487074                Error Block: 1450487074
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Vendor: ATA
Serial Number:     
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Vendor: ATA
Serial Number:     
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    Sense Key: Unit
Attention
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Nov  8 04:13:59 mondo4 scsi: [ID 107833 kern.notice]    ASC: 0x29 (power on,
reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0


Finally trying to do a zpool status yields:

root at mondo4:/# zpool status -v
  pool: LogData
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested

At which point the shell hangs, and cannot be control-c''d.


Any thoughts on how to proceed? I''m guessing we have a bad disk, but
I''m not
sure. Anything you can recommend to diagnose this would be welcome.

--Mike

Ian Collins

2007-Nov-09 04:43 UTC

head link

[zfs-discuss] Major problem with a new ZFS setup

Michael Stalnaker wrote:>
> Finally trying to do a zpool status yields:
>
> root at mondo4:/# zpool status -v
>   pool: LogData
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
>
> At which point the shell hangs, and cannot be control-c''d.
>
>
> Any thoughts on how to proceed? I''m guessing we have a bad disk,
but I''m not
> sure. Anything you can recommend to diagnose this would be welcome.
>
>   Are you able to run a zpool scrub?

Ian

Michael Stalnaker

2007-Nov-09 06:02 UTC

head link

[zfs-discuss] Major problem with a new ZFS setup

We weren''t able to do anything at all, and finally rebooted the system.
When
we did, everything came back normally, even with the target that was
reporting errors before. We''re using an LSI PCI-E controller
that''s on the
supported device list, and LSI 3801-E. Right now, I''m trying to figure
out
if there''s a different controller we should be using with Solaris 10
Release
4 (X86) that will handle a drive issue more gracefully. I know folks are
working on this part of the code, but I need to get as far along as I can
right now. :)

On 11/8/07 8:43 PM, "Ian Collins" <ian at ianshome.com> wrote:
> Michael Stalnaker wrote:
>> 
>> Finally trying to do a zpool status yields:
>> 
>> root at mondo4:/# zpool status -v
>>   pool: LogData
>>  state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>>         attempt was made to correct the error.  Applications are
unaffected.
>> action: Determine if the device needs to be replaced, and clear the
errors
>>         using ''zpool clear'' or replace the device
with ''zpool replace''.
>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>  scrub: none requested
>> 
>> At which point the shell hangs, and cannot be control-c''d.
>> 
>> 
>> Any thoughts on how to proceed? I''m guessing we have a bad
disk, but I''m not
>> sure. Anything you can recommend to diagnose this would be welcome.
>> 
>>   
> Are you able to run a zpool scrub?
> 
> Ian

Victor Engle

2007-Nov-09 13:20 UTC

head link

[zfs-discuss] Major problem with a new ZFS setup

Are all 24 disks in one big raidz raid set with no spares assigned to
the pool? If so then maybe the host is having trouble operating on
parity over that many drives when the "experienced an unrecoverable
error" errors occur. From what I''ve read it might be better to
create
the pool with 3 raidz sets of 7 drives each and use the remaining 3
drives as spares though I imagine that probably isn''t an option at
this point.

On Nov 9, 2007 1:02 AM, Michael Stalnaker
<Michael.Stalnaker at exponential.com> wrote:> We weren''t able to do anything at all, and finally rebooted the
system. When
> we did, everything came back normally, even with the target that was
> reporting errors before. We''re using an LSI PCI-E controller
that''s on the
> supported device list, and LSI 3801-E. Right now, I''m trying to
figure out
> if there''s a different controller we should be using with Solaris
10 Release
> 4 (X86) that will handle a drive issue more gracefully. I know folks are
> working on this part of the code, but I need to get as far along as I can
> right now. :)
>
>
>
>
> On 11/8/07 8:43 PM, "Ian Collins" <ian at ianshome.com>
wrote:
>
> > Michael Stalnaker wrote:
> >>
> >> Finally trying to do a zpool status yields:
> >>
> >> root at mondo4:/# zpool status -v
> >>   pool: LogData
> >>  state: ONLINE
> >> status: One or more devices has experienced an unrecoverable
error.  An
> >>         attempt was made to correct the error.  Applications are
unaffected.
> >> action: Determine if the device needs to be replaced, and clear
the errors
> >>         using ''zpool clear'' or replace the
device with ''zpool replace''.
> >>    see: http://www.sun.com/msg/ZFS-8000-9P
> >>  scrub: none requested
> >>
> >> At which point the shell hangs, and cannot be
control-c''d.
> >>
> >>
> >> Any thoughts on how to proceed? I''m guessing we have a
bad disk, but I''m not
> >> sure. Anything you can recommend to diagnose this would be
welcome.
> >>
> >>
> > Are you able to run a zpool scrub?
> >
> > Ian
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Apparently Analagous Threads

Search for more reasonably related threads

zfs discuss - Nov 2007 - Major problem with a new ZFS setup

[zfs-discuss] Major problem with a new ZFS setup

[zfs-discuss] Major problem with a new ZFS setup

[zfs-discuss] Major problem with a new ZFS setup

[zfs-discuss] Major problem with a new ZFS setup

Apparently Analagous Threads