thr3ads.net - zfs discuss - [zfs-discuss] Resliver making the system unresponsive [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Scott Meilicke

2010-Sep-29 22:16 UTC

[zfs-discuss] Resliver making the system unresponsive

This must be resliver day :)

I just had a drive failure. The hot spare kicked in, and access to the pool over
NFS was effectively zero for about 45 minutes. Currently the pool is still
reslivering, but for some reason I can access the file system now.

Resliver speed has been beaten to death I know, but is there a way to avoid
this? For example, is more enterprisy hardware less susceptible to reslivers?
This box is used for development VMs, but there is no way I would consider this
for production with this kind of performance hit during a resliver.

My hardware:
Dell 2950
16G ram
16 disk SAS chassis
LSI 3801 (I think) SAS card (1068e chip)
Intel x25-e SLOG off of the internal PERC 5/i RAID controller
Seagate 750G disks (7200.11)

I am running Nexenta CE 3.0.3 (SunOS rawhide 5.11 NexentaOS_134f i86pc i386
i86pc Solaris)

  pool: data01
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Wed Sep 29 14:03:52 2010
    1.12T scanned out of 5.00T at 311M/s, 3h37m to go
    82.0G resilvered, 22.42% done
config:

	NAME           STATE     READ WRITE CKSUM
	data01         DEGRADED     0     0     0
	  raidz2-0     ONLINE       0     0     0
	    c1t8d0     ONLINE       0     0     0
	    c1t9d0     ONLINE       0     0     0
	    c1t10d0    ONLINE       0     0     0
	    c1t11d0    ONLINE       0     0     0
	    c1t12d0    ONLINE       0     0     0
	    c1t13d0    ONLINE       0     0     0
	    c1t14d0    ONLINE       0     0     0
	  raidz2-1     DEGRADED     0     0     0
	    c1t22d0    ONLINE       0     0     0
	    c1t15d0    ONLINE       0     0     0
	    c1t16d0    ONLINE       0     0     0
	    c1t17d0    ONLINE       0     0     0
	    c1t23d0    ONLINE       0     0     0
	    spare-5    REMOVED      0     0     0
	      c1t20d0  REMOVED      0     0     0
	      c8t18d0  ONLINE       0     0     0  (resilvering)
	    c1t21d0    ONLINE       0     0     0
	logs
	  c0t1d0       ONLINE       0     0     0
	spares
	  c8t18d0      INUSE     currently in use

errors: No known data errors

Thanks for any insights.

-Scott
-- 
This message posted from opensolaris.org

Scott Meilicke

2010-Sep-29 22:37 UTC

head link

[zfs-discuss] Resliver making the system unresponsive

I should add I have 477 snapshots across all files systems. Most of them are
hourly snaps (225 of them anyway).

On Sep 29, 2010, at 3:16 PM, Scott Meilicke wrote:
> This must be resliver day :)
> 
> I just had a drive failure. The hot spare kicked in, and access to the pool
over NFS was effectively zero for about 45 minutes. Currently the pool is still
reslivering, but for some reason I can access the file system now.
> 
> Resliver speed has been beaten to death I know, but is there a way to avoid
this? For example, is more enterprisy hardware less susceptible to reslivers?
This box is used for development VMs, but there is no way I would consider this
for production with this kind of performance hit during a resliver.
> 
> My hardware:
> Dell 2950
> 16G ram
> 16 disk SAS chassis
> LSI 3801 (I think) SAS card (1068e chip)
> Intel x25-e SLOG off of the internal PERC 5/i RAID controller
> Seagate 750G disks (7200.11)
> 
> I am running Nexenta CE 3.0.3 (SunOS rawhide 5.11 NexentaOS_134f i86pc i386
i86pc Solaris)
> 
>  pool: data01
> state: DEGRADED
> status: One or more devices is currently being resilvered.  The pool will
> 	continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
> scan: resilver in progress since Wed Sep 29 14:03:52 2010
>    1.12T scanned out of 5.00T at 311M/s, 3h37m to go
>    82.0G resilvered, 22.42% done
> config:
> 
> 	NAME           STATE     READ WRITE CKSUM
> 	data01         DEGRADED     0     0     0
> 	  raidz2-0     ONLINE       0     0     0
> 	    c1t8d0     ONLINE       0     0     0
> 	    c1t9d0     ONLINE       0     0     0
> 	    c1t10d0    ONLINE       0     0     0
> 	    c1t11d0    ONLINE       0     0     0
> 	    c1t12d0    ONLINE       0     0     0
> 	    c1t13d0    ONLINE       0     0     0
> 	    c1t14d0    ONLINE       0     0     0
> 	  raidz2-1     DEGRADED     0     0     0
> 	    c1t22d0    ONLINE       0     0     0
> 	    c1t15d0    ONLINE       0     0     0
> 	    c1t16d0    ONLINE       0     0     0
> 	    c1t17d0    ONLINE       0     0     0
> 	    c1t23d0    ONLINE       0     0     0
> 	    spare-5    REMOVED      0     0     0
> 	      c1t20d0  REMOVED      0     0     0
> 	      c8t18d0  ONLINE       0     0     0  (resilvering)
> 	    c1t21d0    ONLINE       0     0     0
> 	logs
> 	  c0t1d0       ONLINE       0     0     0
> 	spares
> 	  c8t18d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Thanks for any insights.
> 
> -Scott
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Scott Meilicke

LIC mesh

2010-Sep-29 23:54 UTC

head link

[zfs-discuss] Resliver making the system unresponsive

Yeah, I''m having a combination of this and the "resilver
constantly
restarting" issue.

And nothing to free up space.

It was recommended to me to replace any expanders I had between the HBA and
the drives with extra HBAs, but my array doesn''t have expanders.

If your''s does, you may want to try that.

Otherwise, wait it out :(




On Wed, Sep 29, 2010 at 6:37 PM, Scott Meilicke <scott at kmclan.net>
wrote:
> I should add I have 477 snapshots across all files systems. Most of them
> are hourly snaps (225 of them anyway).
>
> On Sep 29, 2010, at 3:16 PM, Scott Meilicke wrote:
>
> > This must be resliver day :)
> >
> > I just had a drive failure. The hot spare kicked in, and access to the
> pool over NFS was effectively zero for about 45 minutes. Currently the pool
> is still reslivering, but for some reason I can access the file system now.
> >
> > Resliver speed has been beaten to death I know, but is there a way to
> avoid this? For example, is more enterprisy hardware less susceptible to
> reslivers? This box is used for development VMs, but there is no way I
would
> consider this for production with this kind of performance hit during a
> resliver.
> >
> > My hardware:
> > Dell 2950
> > 16G ram
> > 16 disk SAS chassis
> > LSI 3801 (I think) SAS card (1068e chip)
> > Intel x25-e SLOG off of the internal PERC 5/i RAID controller
> > Seagate 750G disks (7200.11)
> >
> > I am running Nexenta CE 3.0.3 (SunOS rawhide 5.11 NexentaOS_134f i86pc
> i386 i86pc Solaris)
> >
> >  pool: data01
> > state: DEGRADED
> > status: One or more devices is currently being resilvered.  The pool
will
> >       continue to function, possibly in a degraded state.
> > action: Wait for the resilver to complete.
> > scan: resilver in progress since Wed Sep 29 14:03:52 2010
> >    1.12T scanned out of 5.00T at 311M/s, 3h37m to go
> >    82.0G resilvered, 22.42% done
> > config:
> >
> >       NAME           STATE     READ WRITE CKSUM
> >       data01         DEGRADED     0     0     0
> >         raidz2-0     ONLINE       0     0     0
> >           c1t8d0     ONLINE       0     0     0
> >           c1t9d0     ONLINE       0     0     0
> >           c1t10d0    ONLINE       0     0     0
> >           c1t11d0    ONLINE       0     0     0
> >           c1t12d0    ONLINE       0     0     0
> >           c1t13d0    ONLINE       0     0     0
> >           c1t14d0    ONLINE       0     0     0
> >         raidz2-1     DEGRADED     0     0     0
> >           c1t22d0    ONLINE       0     0     0
> >           c1t15d0    ONLINE       0     0     0
> >           c1t16d0    ONLINE       0     0     0
> >           c1t17d0    ONLINE       0     0     0
> >           c1t23d0    ONLINE       0     0     0
> >           spare-5    REMOVED      0     0     0
> >             c1t20d0  REMOVED      0     0     0
> >             c8t18d0  ONLINE       0     0     0  (resilvering)
> >           c1t21d0    ONLINE       0     0     0
> >       logs
> >         c0t1d0       ONLINE       0     0     0
> >       spares
> >         c8t18d0      INUSE     currently in use
> >
> > errors: No known data errors
> >
> > Thanks for any insights.
> >
> > -Scott
> > --
> > This message posted from opensolaris.org
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> Scott Meilicke
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100929/9a3edc65/attachment-0001.html>

Tuomas Leikola

2010-Sep-30 09:32 UTC

head link

[zfs-discuss] Resliver making the system unresponsive

On Thu, Sep 30, 2010 at 1:16 AM, Scott Meilicke <
scott.meilicke at craneaerospace.com> wrote:
> Resliver speed has been beaten to death I know, but is there a way to avoid
> this? For example, is more enterprisy hardware less susceptible to
> reslivers? This box is used for development VMs, but there is no way I
would
> consider this for production with this kind of performance hit during a
> resliver.
>
>According to

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473

<http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473>resilver
should in later builds have some option to limit rebuild speed in order to
allow for more IO during reconstruction, but I havent''t found any
guides on
how to actually make use of this feature. Maybe someone can shed some light
on this?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100930/ac9eef9b/attachment.html>

Richard Elling

2010-Sep-30 14:12 UTC

head link

[zfs-discuss] Resliver making the system unresponsive

On Sep 30, 2010, at 2:32 AM, Tuomas Leikola wrote:
> On Thu, Sep 30, 2010 at 1:16 AM, Scott Meilicke <scott.meilicke at
craneaerospace.com> wrote:
> Resliver speed has been beaten to death I know, but is there a way to avoid
this? For example, is more enterprisy hardware less susceptible to reslivers?
This box is used for development VMs, but there is no way I would consider this
for production with this kind of performance hit during a resliver.
> 
> 
> According to
> 
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473
> 
> resilver should in later builds have some option to limit rebuild speed in
order to allow for more IO during reconstruction, but I havent''t found
any guides on how to actually make use of this feature. Maybe someone can shed
some light on this?
Simple.  Resilver activity is throttled using a delay method.  Nothing to tune
here.

In general, if resilver or scrub make a system seem unresponsive, there is a 
root cause that is related to the I/O activity. To diagnose, I usually use
"iostat -zxCn 10"
(or similar) and look for unusual asvc_t from a busy disk.  One bad disk can
ruin
performance for the whole pool.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS and performance consulting
http://www.RichardElling.com

LIC mesh

2010-Sep-30 14:35 UTC

head link

[zfs-discuss] Resliver making the system unresponsive

If we''ve found one bad disk, what are our options?





On Thu, Sep 30, 2010 at 10:12 AM, Richard Elling
<richard.elling at gmail.com>wrote:
> On Sep 30, 2010, at 2:32 AM, Tuomas Leikola wrote:
>
> > On Thu, Sep 30, 2010 at 1:16 AM, Scott Meilicke <
> scott.meilicke at craneaerospace.com> wrote:
> > Resliver speed has been beaten to death I know, but is there a way to
> avoid this? For example, is more enterprisy hardware less susceptible to
> reslivers? This box is used for development VMs, but there is no way I
would
> consider this for production with this kind of performance hit during a
> resliver.
> >
> >
> > According to
> >
> > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473
> >
> > resilver should in later builds have some option to limit rebuild
speed
> in order to allow for more IO during reconstruction, but I
havent''t found
> any guides on how to actually make use of this feature. Maybe someone can
> shed some light on this?
>
> Simple.  Resilver activity is throttled using a delay method.  Nothing to
> tune here.
>
> In general, if resilver or scrub make a system seem unresponsive, there is
> a
> root cause that is related to the I/O activity. To diagnose, I usually use
> "iostat -zxCn 10"
> (or similar) and look for unusual asvc_t from a busy disk.  One bad disk
> can ruin
> performance for the whole pool.
>  -- richard
>
> --
> OpenStorage Summit, October 25-27, Palo Alto, CA
> http://nexenta-summit2010.eventbrite.com
> ZFS and performance consulting
> http://www.RichardElling.com
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100930/74f5f6f1/attachment.html>

jason matthews

2010-Oct-01 05:30 UTC

head link

[zfs-discuss] Resliver making the system unresponsive

Replace it. Reslivering should not as painful if all your disks are functioning
normally.
-- 
This message posted from opensolaris.org

Richard Elling

2011-Oct-19 12:01 UTC

head link

[zfs-discuss] about btrfs and zfs

On Oct 18, 2011, at 6:35 PM, David Magda wrote:
> If we''ve found one bad disk, what are our options?
Live with it or replace it :-)
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
VMworld Copenhagen, October 17-20
OpenStorage Summit, San Jose, CA, October 24-27
LISA ''11, Boston, MA, December 4-9

Jim Klimov

2011-Oct-19 12:31 UTC

head link

[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors

2011-10-19 16:01, Richard Elling ?????:> On Oct 18, 2011, at 6:35 PM, David Magda wrote:
>
>> If we''ve found one bad disk, what are our options?
> Live with it or replace it :-)
>   -- richard
Similar question: a HDD went awry last week in an snv_117 box
(the controller no longer sees the drive - so I guess there is either
a dead drive, or dead power/data ports on the backplane), and
a hot-spare replaced it okay. However, there are a number of
CKSUM errors on the replacement disk, growing by about 100
daily (according to "zpool status"). I tried scrubbing the pool and
zeroing the counter with "zpool clear", but new CKSUM errors
are being found. There are zero READ or WRITE error counts,
though.

Should we be worried about replacing the ex-hotspare drive
ASAP as well?

There are no errors in dmesg regarding the ex-hotspare drive,
only those regarding the dead one, occasionally:

=== dmesg:
Oct 19 16:28:23 thumper scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1/disk at 6,0 (sd40):
Oct 19 16:28:23 thumper Command failed to complete...Device is gone
Oct 19 16:28:23 thumper scsi: [ID 107833 kern.warning] WARNING: 
/pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1/disk at 6,0 (sd40):
Oct 19 16:28:23 thumper SYNCHRONIZE CACHE command failed (5)

=== format:

30. c5t6d0 <drive type unknown>
/pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1/disk at 6,0

-- 


+============================================================+
|                                                            |
| ?????? ???????,                                 Jim Klimov |
| ??????????? ????????                                   CTO |
| ??? "??? ? ??"                                  JSC COS&HT |
|                                                            |
| +7-903-7705859 (cellular)          mailto:jimklimov at cos.ru |
|                          CC:admin at cos.ru,jimklimov at mail.ru |
+============================================================+
| ()  ascii ribbon campaign - against html mail              |
| /\                        - against microsoft attachments  |
+============================================================+

Edward Ned Harvey

2011-Oct-20 11:55 UTC

head link

[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
>
> new CKSUM errors
> are being found. There are zero READ or WRITE error counts,
> though.
> 
> Should we be worried about replacing the ex-hotspare drive
> ASAP as well?
You should not be increasing CKSUM errors.  There is something wrong.  I cannot
say it''s necessarily the fault of the drive, but probably it is.  When
some threshold is reached, ZFS should mark the drive as faulted due to too many
cksum errors.  I don''t recommend waiting for it.

Eric Sproul

2011-Oct-21 16:27 UTC

head link

[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors

On Thu, Oct 20, 2011 at 7:55 AM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>>
>> new CKSUM errors
>> are being found. There are zero READ or WRITE error counts,
>> though.
>>
>> Should we be worried about replacing the ex-hotspare drive
>> ASAP as well?
>
> You should not be increasing CKSUM errors. ?There is something wrong. ?I
cannot say it''s necessarily the fault of the drive, but probably it is.
?When some threshold is reached, ZFS should mark the drive as faulted due to too
many cksum errors. ?I don''t recommend waiting for it.
It probably indicates something else faulty in the I/O path, which
could include RAM, HBA or integrated controller chip, loose or
defective cabling, etc.  If RAM is ECC-capable, it seems unlikely to
be the issue.  I''d make sure all cables are fully seated and not
kinked or otherwise damaged.

Eric

Apparently Analagous Threads

Search for more reasonably related threads

zfs discuss - Sep 2010 - Resliver making the system unresponsive

[zfs-discuss] Resliver making the system unresponsive

[zfs-discuss] Resliver making the system unresponsive

[zfs-discuss] Resliver making the system unresponsive

[zfs-discuss] Resliver making the system unresponsive

[zfs-discuss] Resliver making the system unresponsive

[zfs-discuss] Resliver making the system unresponsive

[zfs-discuss] Resliver making the system unresponsive

[zfs-discuss] about btrfs and zfs

[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors

[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors

[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors

Apparently Analagous Threads