Scott Meilicke
2010-Sep-29 22:16 UTC
[zfs-discuss] Resliver making the system unresponsive
This must be resliver day :) I just had a drive failure. The hot spare kicked in, and access to the pool over NFS was effectively zero for about 45 minutes. Currently the pool is still reslivering, but for some reason I can access the file system now. Resliver speed has been beaten to death I know, but is there a way to avoid this? For example, is more enterprisy hardware less susceptible to reslivers? This box is used for development VMs, but there is no way I would consider this for production with this kind of performance hit during a resliver. My hardware: Dell 2950 16G ram 16 disk SAS chassis LSI 3801 (I think) SAS card (1068e chip) Intel x25-e SLOG off of the internal PERC 5/i RAID controller Seagate 750G disks (7200.11) I am running Nexenta CE 3.0.3 (SunOS rawhide 5.11 NexentaOS_134f i86pc i386 i86pc Solaris) pool: data01 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Sep 29 14:03:52 2010 1.12T scanned out of 5.00T at 311M/s, 3h37m to go 82.0G resilvered, 22.42% done config: NAME STATE READ WRITE CKSUM data01 DEGRADED 0 0 0 raidz2-0 ONLINE 0 0 0 c1t8d0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 raidz2-1 DEGRADED 0 0 0 c1t22d0 ONLINE 0 0 0 c1t15d0 ONLINE 0 0 0 c1t16d0 ONLINE 0 0 0 c1t17d0 ONLINE 0 0 0 c1t23d0 ONLINE 0 0 0 spare-5 REMOVED 0 0 0 c1t20d0 REMOVED 0 0 0 c8t18d0 ONLINE 0 0 0 (resilvering) c1t21d0 ONLINE 0 0 0 logs c0t1d0 ONLINE 0 0 0 spares c8t18d0 INUSE currently in use errors: No known data errors Thanks for any insights. -Scott -- This message posted from opensolaris.org
Scott Meilicke
2010-Sep-29 22:37 UTC
[zfs-discuss] Resliver making the system unresponsive
I should add I have 477 snapshots across all files systems. Most of them are hourly snaps (225 of them anyway). On Sep 29, 2010, at 3:16 PM, Scott Meilicke wrote:> This must be resliver day :) > > I just had a drive failure. The hot spare kicked in, and access to the pool over NFS was effectively zero for about 45 minutes. Currently the pool is still reslivering, but for some reason I can access the file system now. > > Resliver speed has been beaten to death I know, but is there a way to avoid this? For example, is more enterprisy hardware less susceptible to reslivers? This box is used for development VMs, but there is no way I would consider this for production with this kind of performance hit during a resliver. > > My hardware: > Dell 2950 > 16G ram > 16 disk SAS chassis > LSI 3801 (I think) SAS card (1068e chip) > Intel x25-e SLOG off of the internal PERC 5/i RAID controller > Seagate 750G disks (7200.11) > > I am running Nexenta CE 3.0.3 (SunOS rawhide 5.11 NexentaOS_134f i86pc i386 i86pc Solaris) > > pool: data01 > state: DEGRADED > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scan: resilver in progress since Wed Sep 29 14:03:52 2010 > 1.12T scanned out of 5.00T at 311M/s, 3h37m to go > 82.0G resilvered, 22.42% done > config: > > NAME STATE READ WRITE CKSUM > data01 DEGRADED 0 0 0 > raidz2-0 ONLINE 0 0 0 > c1t8d0 ONLINE 0 0 0 > c1t9d0 ONLINE 0 0 0 > c1t10d0 ONLINE 0 0 0 > c1t11d0 ONLINE 0 0 0 > c1t12d0 ONLINE 0 0 0 > c1t13d0 ONLINE 0 0 0 > c1t14d0 ONLINE 0 0 0 > raidz2-1 DEGRADED 0 0 0 > c1t22d0 ONLINE 0 0 0 > c1t15d0 ONLINE 0 0 0 > c1t16d0 ONLINE 0 0 0 > c1t17d0 ONLINE 0 0 0 > c1t23d0 ONLINE 0 0 0 > spare-5 REMOVED 0 0 0 > c1t20d0 REMOVED 0 0 0 > c8t18d0 ONLINE 0 0 0 (resilvering) > c1t21d0 ONLINE 0 0 0 > logs > c0t1d0 ONLINE 0 0 0 > spares > c8t18d0 INUSE currently in use > > errors: No known data errors > > Thanks for any insights. > > -Scott > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussScott Meilicke
Yeah, I''m having a combination of this and the "resilver constantly restarting" issue. And nothing to free up space. It was recommended to me to replace any expanders I had between the HBA and the drives with extra HBAs, but my array doesn''t have expanders. If your''s does, you may want to try that. Otherwise, wait it out :( On Wed, Sep 29, 2010 at 6:37 PM, Scott Meilicke <scott at kmclan.net> wrote:> I should add I have 477 snapshots across all files systems. Most of them > are hourly snaps (225 of them anyway). > > On Sep 29, 2010, at 3:16 PM, Scott Meilicke wrote: > > > This must be resliver day :) > > > > I just had a drive failure. The hot spare kicked in, and access to the > pool over NFS was effectively zero for about 45 minutes. Currently the pool > is still reslivering, but for some reason I can access the file system now. > > > > Resliver speed has been beaten to death I know, but is there a way to > avoid this? For example, is more enterprisy hardware less susceptible to > reslivers? This box is used for development VMs, but there is no way I would > consider this for production with this kind of performance hit during a > resliver. > > > > My hardware: > > Dell 2950 > > 16G ram > > 16 disk SAS chassis > > LSI 3801 (I think) SAS card (1068e chip) > > Intel x25-e SLOG off of the internal PERC 5/i RAID controller > > Seagate 750G disks (7200.11) > > > > I am running Nexenta CE 3.0.3 (SunOS rawhide 5.11 NexentaOS_134f i86pc > i386 i86pc Solaris) > > > > pool: data01 > > state: DEGRADED > > status: One or more devices is currently being resilvered. The pool will > > continue to function, possibly in a degraded state. > > action: Wait for the resilver to complete. > > scan: resilver in progress since Wed Sep 29 14:03:52 2010 > > 1.12T scanned out of 5.00T at 311M/s, 3h37m to go > > 82.0G resilvered, 22.42% done > > config: > > > > NAME STATE READ WRITE CKSUM > > data01 DEGRADED 0 0 0 > > raidz2-0 ONLINE 0 0 0 > > c1t8d0 ONLINE 0 0 0 > > c1t9d0 ONLINE 0 0 0 > > c1t10d0 ONLINE 0 0 0 > > c1t11d0 ONLINE 0 0 0 > > c1t12d0 ONLINE 0 0 0 > > c1t13d0 ONLINE 0 0 0 > > c1t14d0 ONLINE 0 0 0 > > raidz2-1 DEGRADED 0 0 0 > > c1t22d0 ONLINE 0 0 0 > > c1t15d0 ONLINE 0 0 0 > > c1t16d0 ONLINE 0 0 0 > > c1t17d0 ONLINE 0 0 0 > > c1t23d0 ONLINE 0 0 0 > > spare-5 REMOVED 0 0 0 > > c1t20d0 REMOVED 0 0 0 > > c8t18d0 ONLINE 0 0 0 (resilvering) > > c1t21d0 ONLINE 0 0 0 > > logs > > c0t1d0 ONLINE 0 0 0 > > spares > > c8t18d0 INUSE currently in use > > > > errors: No known data errors > > > > Thanks for any insights. > > > > -Scott > > -- > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > Scott Meilicke > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100929/9a3edc65/attachment-0001.html>
Tuomas Leikola
2010-Sep-30 09:32 UTC
[zfs-discuss] Resliver making the system unresponsive
On Thu, Sep 30, 2010 at 1:16 AM, Scott Meilicke < scott.meilicke at craneaerospace.com> wrote:> Resliver speed has been beaten to death I know, but is there a way to avoid > this? For example, is more enterprisy hardware less susceptible to > reslivers? This box is used for development VMs, but there is no way I would > consider this for production with this kind of performance hit during a > resliver. > >According to http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 <http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473>resilver should in later builds have some option to limit rebuild speed in order to allow for more IO during reconstruction, but I havent''t found any guides on how to actually make use of this feature. Maybe someone can shed some light on this? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100930/ac9eef9b/attachment.html>
Richard Elling
2010-Sep-30 14:12 UTC
[zfs-discuss] Resliver making the system unresponsive
On Sep 30, 2010, at 2:32 AM, Tuomas Leikola wrote:> On Thu, Sep 30, 2010 at 1:16 AM, Scott Meilicke <scott.meilicke at craneaerospace.com> wrote: > Resliver speed has been beaten to death I know, but is there a way to avoid this? For example, is more enterprisy hardware less susceptible to reslivers? This box is used for development VMs, but there is no way I would consider this for production with this kind of performance hit during a resliver. > > > According to > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 > > resilver should in later builds have some option to limit rebuild speed in order to allow for more IO during reconstruction, but I havent''t found any guides on how to actually make use of this feature. Maybe someone can shed some light on this?Simple. Resilver activity is throttled using a delay method. Nothing to tune here. In general, if resilver or scrub make a system seem unresponsive, there is a root cause that is related to the I/O activity. To diagnose, I usually use "iostat -zxCn 10" (or similar) and look for unusual asvc_t from a busy disk. One bad disk can ruin performance for the whole pool. -- richard -- OpenStorage Summit, October 25-27, Palo Alto, CA http://nexenta-summit2010.eventbrite.com ZFS and performance consulting http://www.RichardElling.com
If we''ve found one bad disk, what are our options? On Thu, Sep 30, 2010 at 10:12 AM, Richard Elling <richard.elling at gmail.com>wrote:> On Sep 30, 2010, at 2:32 AM, Tuomas Leikola wrote: > > > On Thu, Sep 30, 2010 at 1:16 AM, Scott Meilicke < > scott.meilicke at craneaerospace.com> wrote: > > Resliver speed has been beaten to death I know, but is there a way to > avoid this? For example, is more enterprisy hardware less susceptible to > reslivers? This box is used for development VMs, but there is no way I would > consider this for production with this kind of performance hit during a > resliver. > > > > > > According to > > > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6494473 > > > > resilver should in later builds have some option to limit rebuild speed > in order to allow for more IO during reconstruction, but I havent''t found > any guides on how to actually make use of this feature. Maybe someone can > shed some light on this? > > Simple. Resilver activity is throttled using a delay method. Nothing to > tune here. > > In general, if resilver or scrub make a system seem unresponsive, there is > a > root cause that is related to the I/O activity. To diagnose, I usually use > "iostat -zxCn 10" > (or similar) and look for unusual asvc_t from a busy disk. One bad disk > can ruin > performance for the whole pool. > -- richard > > -- > OpenStorage Summit, October 25-27, Palo Alto, CA > http://nexenta-summit2010.eventbrite.com > ZFS and performance consulting > http://www.RichardElling.com > > > > > > > > > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100930/74f5f6f1/attachment.html>
jason matthews
2010-Oct-01 05:30 UTC
[zfs-discuss] Resliver making the system unresponsive
Replace it. Reslivering should not as painful if all your disks are functioning normally. -- This message posted from opensolaris.org
On Oct 18, 2011, at 6:35 PM, David Magda wrote:> If we''ve found one bad disk, what are our options?Live with it or replace it :-) -- richard -- ZFS and performance consulting http://www.RichardElling.com VMworld Copenhagen, October 17-20 OpenStorage Summit, San Jose, CA, October 24-27 LISA ''11, Boston, MA, December 4-9
Jim Klimov
2011-Oct-19 12:31 UTC
[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors
2011-10-19 16:01, Richard Elling ?????:> On Oct 18, 2011, at 6:35 PM, David Magda wrote: > >> If we''ve found one bad disk, what are our options? > Live with it or replace it :-) > -- richardSimilar question: a HDD went awry last week in an snv_117 box (the controller no longer sees the drive - so I guess there is either a dead drive, or dead power/data ports on the backplane), and a hot-spare replaced it okay. However, there are a number of CKSUM errors on the replacement disk, growing by about 100 daily (according to "zpool status"). I tried scrubbing the pool and zeroing the counter with "zpool clear", but new CKSUM errors are being found. There are zero READ or WRITE error counts, though. Should we be worried about replacing the ex-hotspare drive ASAP as well? There are no errors in dmesg regarding the ex-hotspare drive, only those regarding the dead one, occasionally: === dmesg: Oct 19 16:28:23 thumper scsi: [ID 107833 kern.warning] WARNING: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1/disk at 6,0 (sd40): Oct 19 16:28:23 thumper Command failed to complete...Device is gone Oct 19 16:28:23 thumper scsi: [ID 107833 kern.warning] WARNING: /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1/disk at 6,0 (sd40): Oct 19 16:28:23 thumper SYNCHRONIZE CACHE command failed (5) === format: 30. c5t6d0 <drive type unknown> /pci at 1,0/pci1022,7458 at 4/pci11ab,11ab at 1/disk at 6,0 -- +============================================================+ | | | ?????? ???????, Jim Klimov | | ??????????? ???????? CTO | | ??? "??? ? ??" JSC COS&HT | | | | +7-903-7705859 (cellular) mailto:jimklimov at cos.ru | | CC:admin at cos.ru,jimklimov at mail.ru | +============================================================+ | () ascii ribbon campaign - against html mail | | /\ - against microsoft attachments | +============================================================+
Edward Ned Harvey
2011-Oct-20 11:55 UTC
[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > new CKSUM errors > are being found. There are zero READ or WRITE error counts, > though. > > Should we be worried about replacing the ex-hotspare drive > ASAP as well?You should not be increasing CKSUM errors. There is something wrong. I cannot say it''s necessarily the fault of the drive, but probably it is. When some threshold is reached, ZFS should mark the drive as faulted due to too many cksum errors. I don''t recommend waiting for it.
Eric Sproul
2011-Oct-21 16:27 UTC
[zfs-discuss] Growing CKSUM errors with no READ/WRITE errors
On Thu, Oct 20, 2011 at 7:55 AM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> new CKSUM errors >> are being found. There are zero READ or WRITE error counts, >> though. >> >> Should we be worried about replacing the ex-hotspare drive >> ASAP as well? > > You should not be increasing CKSUM errors. ?There is something wrong. ?I cannot say it''s necessarily the fault of the drive, but probably it is. ?When some threshold is reached, ZFS should mark the drive as faulted due to too many cksum errors. ?I don''t recommend waiting for it.It probably indicates something else faulty in the I/O path, which could include RAM, HBA or integrated controller chip, loose or defective cabling, etc. If RAM is ECC-capable, it seems unlikely to be the issue. I''d make sure all cables are fully seated and not kinked or otherwise damaged. Eric