thr3ads.net - zfs discuss - [zfs-discuss] Resilver restarting several times [May 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-May-11 10:22 UTC

[zfs-discuss] Resilver restarting several times

Hello all,

SHORT VERSION:

   What conditions can cause the reset of the resilvering
process? My lost-and-found disk can''t get back into the
pool because of resilvers restarting...

LONG VERSION:

   I maintain a number of older systems, and among them
a Sun Thumper (X4500) running OpenSolaris SXCE with
nowadays-smallish 250Gb disks and a ZPOOL v14 (with the
OS supporting ZPOOLv16). This year it has "lost contact"
with two different drives, however a poweroff followed
by poweron restored the contact.

   On the first time a couple of months ago the system
successfully used the hotspare disk with no issues
worth mentioning or remembering.

   This week however the hotspare took days to get used
and ultimately didn''t, with pool CKSUM errors growing
into millions on the lost disk, and dmesg clobbered with
messages about the lost drive. There was even no message
about resilvering in "zpool status", other than that the
hotspare is "IN USE".

   After a reboot the disk was re-found and the resilver
went humming along, with the hotspare still marked as
used. The error count is non-zero but stable overnight,
and no data loss was reported:

   pool: pond
  state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
         continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scrub: resilver in progress for 0h49m, 9.52% done, 7h46m to go
config:

         NAME          STATE     READ WRITE CKSUM
         pond          ONLINE       0     0     0
...
           raidz1      ONLINE       0     0     0
             c0t1d0    ONLINE       0     0     0
             spare     ONLINE       0     0    46
               c1t2d0  ONLINE       0     0     0  17.3G resilvered
               c5t6d0  ONLINE       0     0     0
             c4t3d0    ONLINE       0     0     0
             c6t5d0    ONLINE       0     0     0
             c7t6d0    ONLINE       0     0     0
...
         spares
           c5t6d0      INUSE     currently in use

errors: No known data errors


   The problem I come here with is that the resilver does not
complete. It got reset after about 40, 84, 80Gb of resilvered
progress and restarts from scratch. I hoped that it would at
least progress further each time, but that seems to be not
true as of the last two restarts.

   The pool is indeed active, used by zfs-auto-snapshots in
particular (I disabled them for now while asking - they do
also perform very slowly, with some "zfs destroy" invokations
overlaying each other and failing due to missing targets).
There are no correlating errors in dmesg that would point
me at a problem, and the CKSUM count stays the same.
So far I guess that the stale disk may be found referencing
some transactions that are gone from the live pool (i.e. via
deletion of automatic snapshots), but I didn''t think this
should cause a restart of the whole resilvering process?..

   The system must be quite fragmented I guess, because the
resilver of a 250Gb disk is promised to complete in 10 hours
(after each report - even if 80Gb are behind).

   "zpool iostat 60" shows low figures, about 1Mb/s writes and
about 5Mb/s reads; "iostat -xnz 1" also shows low utilizations
on drives in general (about 5-10%busy, 500-900Kb/s reads), but
with a bigger peak on the recovering raidz''s drives and the
hotspare disk. But even these numbers usually seem within the
bounds of IOPS or bandwidth bottlenecks:

# iostat -xzn c0t1d0 c1t2d0 c5t6d0 c4t3d0 c6t5d0 c7t6d0 1
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   107.5   21.1 5200.2   93.1  1.3  0.9   10.0    6.9  22  37 c0t1d0
   107.7   21.2 5200.7   93.2  1.2  0.9    9.6    7.0  22  37 c4t3d0
   107.2   21.1 5199.6   93.2  1.3  0.9   10.0    7.1  22  38 c6t5d0
   107.6   21.1 5199.2   93.2  1.2  0.9    9.5    7.3  21  37 c7t6d0
     1.5  200.0   71.3 4574.1  0.5  0.4    2.7    1.9  10  16 c1t2d0
   106.9   21.2 5124.4   93.1  1.1  0.9    8.8    7.1  20  35 c5t6d0
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   177.9    0.0 4251.1    0.0  1.2  0.4    6.9    2.0  28  35 c0t1d0
   156.8    0.0 4074.2    0.0  1.1  0.4    6.8    2.3  25  35 c4t3d0
   156.8    0.0 4048.1    0.0  0.4  1.1    2.6    6.8  12  33 c6t5d0
   156.8    0.0 4188.3    0.0  1.2  0.4    7.6    2.6  30  41 c7t6d0
     0.0  343.8    0.0 2927.3  0.1  0.1    0.4    0.2   5   6 c1t2d0
   167.9    0.0 4017.4    0.0  1.3  0.4    7.7    2.3  29  37 c5t6d0
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   113.0  114.0 2588.9  134.5  1.7  0.3    7.4    1.5  29  33 c0t1d0
    79.0  122.0 2777.0  144.0  1.6  0.4    7.9    1.8  33  37 c4t3d0
    98.0  122.0 2775.0  144.5  0.6  1.1    2.8    4.9  17  35 c6t5d0
   125.0  125.0 2946.0  144.0  1.5  0.3    6.0    1.3  27  33 c7t6d0
     0.0  395.1    0.0 2363.4  0.3  0.1    0.9    0.2   7   9 c1t2d0
   131.0  114.0 3091.5  138.5  1.5  0.4    6.0    1.5  30  36 c5t6d0
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    82.0   80.0 1715.7   62.0  0.5  0.2    3.4    1.3   9  21 c0t1d0
    75.0   69.0 1673.8   57.5  0.7  0.2    4.8    1.4  11  20 c4t3d0
    68.0   69.0 1607.8   58.5  0.3  0.5    1.9    3.9   8  22 c6t5d0
    71.0   74.0 1539.8   57.0  0.5  0.2    3.1    1.2   8  17 c7t6d0
     0.0  205.0    0.0 1079.3  0.2  0.0    0.9    0.2   4   4 c1t2d0
    72.0   77.0 1679.8   60.0  0.6  0.2    3.9    1.3  12  19 c5t6d0


   What conditions can cause the reset of the resilvering
process? And how can it be sped up to complete other than
disabling zfs auto-snapshots? There are no other heavy users
of the system, though several VMs and webapp servers run off
the pool''s data, and can not be put down...

   The knobs I know from OI did not work here:

# echo zfs_resilver_delay/W0t0 | mdb -kw
mdb: failed to dereference symbol: unknown symbol name

# echo zfs_resilver_min_time_ms/W0t20000 | mdb -kw
mdb: failed to dereference symbol: unknown symbol name

Thanks for any ideas,
//Jim Klimov

Bob Friesenhahn

2012-May-11 13:18 UTC

head link

[zfs-discuss] Resilver restarting several times

On Fri, 11 May 2012, Jim Klimov wrote:
> Hello all,
>
> SHORT VERSION:
>
>  What conditions can cause the reset of the resilvering
> process? My lost-and-found disk can''t get back into the
> pool because of resilvers restarting...
I recall that with sufficiently old vintage zfs, resilver would 
restart if a snapshot was taken.  What sort of zfs is being used here?

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jim Klimov

2012-May-11 13:54 UTC

head link

[zfs-discuss] Resilver restarting several times

2012-05-11 17:18, Bob Friesenhahn wrote:> On Fri, 11 May 2012, Jim Klimov wrote:
>
>> Hello all,
>>
>> SHORT VERSION:
>>
>> What conditions can cause the reset of the resilvering
>> process? My lost-and-found disk can''t get back into the
>> pool because of resilvers restarting...
>
> I recall that with sufficiently old vintage zfs, resilver would restart
> if a snapshot was taken. What sort of zfs is being used here?
Maybe sufficiently old, SXCE snv_117 with "zpool upgrade"
reporting the version as "14" for this pool (as detailed
in the long version of my message).

It is also possible that auto-snaps are taken and deleted
thus interfering with the resilver, however, the recovery
is restarted after a few hours while auto-snapshots have
a 5-minute schedule over about a thousand datasets. With
the lags in place, though, I am not sure they all succeed
at that frequency ;)

Thanks,
//Jim

Jim Klimov

2012-May-12 00:26 UTC

head link

[zfs-discuss] Resilver restarting several times

2012-05-11 17:18, Bob Friesenhahn ???????:> On Fri, 11 May 2012, Jim Klimov wrote:
>
>> Hello all,
>>
>> SHORT VERSION:
>>
>> What conditions can cause the reset of the resilvering
>> process? My lost-and-found disk can''t get back into the
>> pool because of resilvers restarting...
>
> I recall that with sufficiently old vintage zfs, resilver would restart
> if a snapshot was taken. What sort of zfs is being used here?
>
> Bob
Well, for the night I rebooted the machine into single-user mode,
to rule out zones, crontabs and networked abusers, but I still
get resilvering resets every now and then, about once an hour.

I''m now trying a run with all zfs datasets unmounted, hope that
helps somewhat... I''m growing puzzled now.

Wonder if things would get better or worse if I kick one of the
drives (i.e. hotspare c5t6d0) out of the equation:

           raidz1      ONLINE       0     0     0
             c0t1d0    ONLINE       0     0     0
             spare     ONLINE       0     0     0
               c1t2d0  ONLINE       0     0     0  6.72G resilvered
               c5t6d0  ONLINE       0     0     0
             c4t3d0    ONLINE       0     0     0
             c6t5d0    ONLINE       0     0     0
             c7t6d0    ONLINE       0     0     0

Open to ideas and explanations, thanks,
//Jim

Robert Milkowski

2012-May-12 01:50 UTC

head link

[zfs-discuss] Resilver restarting several times

> -----Original Message-----
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
> Sent: 12 May 2012 01:27
> Cc: zfs-discuss at opensolaris.org
> Subject: Re: [zfs-discuss] Resilver restarting several times
> 
> 2012-05-11 17:18, Bob Friesenhahn ???????:
> > On Fri, 11 May 2012, Jim Klimov wrote:
> >
> >> Hello all,
> >>
> >> SHORT VERSION:
> >>
> >> What conditions can cause the reset of the resilvering process? My
> >> lost-and-found disk can''t get back into the pool because
of resilvers
> >> restarting...
> >
> > I recall that with sufficiently old vintage zfs, resilver would
> > restart if a snapshot was taken. What sort of zfs is being used here?
> >
> > Bob
> 
> Well, for the night I rebooted the machine into single-user mode, to rule
out
> zones, crontabs and networked abusers, but I still get resilvering resets
every
> now and then, about once an hour.
> 
> I''m now trying a run with all zfs datasets unmounted, hope that
helps
> somewhat... I''m growing puzzled now.
To double check that no snapshots, etc. are being created run: zpool history -il
pond


-- 
Best regards,
 Robert Milkowski
 http://milek.blogspot.com

Jim Klimov

2012-May-12 03:01 UTC

head link

[zfs-discuss] Resilver restarting several times

2012-05-12 5:50, Robert Milkowski wrote:>>>> What conditions can cause the reset of the resilvering process?
My
>>>> lost-and-found disk can''t get back into the pool
because of resilvers
>>>> restarting...
>> Well, for the night I rebooted the machine into single-user mode, to
rule out
>> zones, crontabs and networked abusers, but I still get resilvering
resets every
>> now and then, about once an hour.
>>
>> I''m now trying a run with all zfs datasets unmounted, hope
that helps
>> somewhat... I''m growing puzzled now.
>
> To double check that no snapshots, etc. are being created run: zpool
history -il pond
Hmmm, I should have thought of that - but didn''t.
Anyhow, the pool history reports many "internal scrub"
starting and ending, with times quite matching the
resilver restarts I am seeing. Since the reboot into
single-user mode there were no other snapshots, but
while the system was fully up there were occasional
"internal snapshots" as well.

I''ll post the last several lines, in hopes that someone
would make more sense of them.

Overall the applied question is whether the disk will
make it back into the live pool (ultimately with no
continuous resilvering), and how fast that can be done -
I don''t want to risk the big pool with nonredundant
arrays for too long.

It has already taken 2 days to try and resilver a 250Gb
disk into the pool, but never made it past 100Gb progress. :(
Reports no errors that I''d see either... :)


2012-05-12.00:50:36 [internal pool scrub done txg:91070404] complete=0 
[user root on thumper]
2012-05-12.00:50:38 [internal pool scrub txg:91070404] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]
2012-05-12.01:19:00 [internal pool scrub done txg:91070477] complete=0 
[user root on thumper]
2012-05-12.01:19:02 [internal pool scrub txg:91070477] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]
2012-05-12.01:46:28 [internal pool scrub done txg:91070571] complete=0 
[user root on thumper]
2012-05-12.01:46:30 [internal pool scrub txg:91070571] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]
2012-05-12.02:45:52 [internal snapshot txg:91071279] dataset = 58650 
[user daemon on thumper]
2012-05-12.02:45:56 [internal snapshot txg:91071280] dataset = 58652 
[user jim on thumper]
2012-05-12.02:46:15 [internal snapshot txg:91071283] dataset = 58654 
[user daemon on thumper]
2012-05-12.02:53:01 [internal pool scrub done txg:91071298] complete=0 
[user root on thumper]
2012-05-12.02:53:03 [internal pool scrub txg:91071298] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]

System rebooted into single-user around 3:03am; subsequent entries
seem to be only about resilvers of different lengths - if that is
what "internal scrubs" are:

2012-05-12.03:06:09 [internal pool scrub done txg:91071322] complete=0 
[user root on thumper]
2012-05-12.03:06:11 [internal pool scrub txg:91071322] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]
2012-05-12.04:09:35 [internal pool scrub done txg:91071448] complete=0 
[user root on thumper]
2012-05-12.04:09:37 [internal pool scrub txg:91071448] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]
2012-05-12.05:42:43 [internal pool scrub done txg:91071631] complete=0 
[user root on thumper]
2012-05-12.05:42:45 [internal pool scrub txg:91071631] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]

Jim Klimov

2012-May-12 03:54 UTC

head link

[zfs-discuss] Resilver restarting several times

2012-05-12 4:26, Jim Klimov wrote:> Wonder if things would get better or worse if I kick one of the
> drives (i.e. hotspare c5t6d0) out of the equation:
>
> raidz1 ONLINE 0 0 0
> c0t1d0 ONLINE 0 0 0
> spare ONLINE 0 0 0
> c1t2d0 ONLINE 0 0 0 6.72G resilvered
> c5t6d0 ONLINE 0 0 0
> c4t3d0 ONLINE 0 0 0
> c6t5d0 ONLINE 0 0 0
> c7t6d0 ONLINE 0 0 0
While googling around for similar reports, I found the rumour
that in a situation like mine the original and (partial) hot
spare disk can conflict causing the restarts I am seeing, and
to a poster on some FreeBSD list kicking the spare from the
pool had helped reportedly.

Can some real zfs gurus please confirm or deny the rumor,
before I do a possibly stupid thing by kicking the hotspare
drive out of the equation (and the pool)?

Also, would booting from OI LiveCD help resilver the pool in
one go, as it is (without upgrading the pool on-disk), and
would the snv_117 server be able to work with it afterwards? ;)

Thanks,
//Jim

Jim Klimov

2012-May-12 11:52 UTC

head link

[zfs-discuss] Resilver restarting several times

2012-05-11 14:22, Jim Klimov wrote:> What conditions can cause the reset of the resilvering
> process? My lost-and-found disk can''t get back into the
> pool because of resilvers restarting...
FOLLOW-UP AND NEW QUESTIONS

Here is a new piece of evidence - I''ve finally got something
out of fmdump - series of several (5) retries ending with a
fail, dated 75 seconds before resilvers restart (more below).
Not a squeak in zpool status nor dmesg nor /dev/console.

Guess I must assume that the disk is dying indeed, losing
connection or something like that after a random time (my
resilvers restart after 15min-5hrs), and at least a run of
SMART long diags is in order, while the pool would try to
rebuild onto another disk (the hotspare) instead of trying
to update this one which was in the pool.

Anyhow, information on the ex-pool disk is likely unused
anyway - from iostat I see that it is only written to,
with few to zero reads for minutes - so I''d lose nothing
by replacing it with a blank drive (that''s strange though)...

I also guess that the disk gets found after something like
an unlogged bus reset or whatever, and this event causes
the resilvering to restart from scratch.
Q: Would this be the same in OI_151a, or would it continue
resilvering from where it left off? I think I had the pool
exported once during a resilver, and it restarted from the
same percentage counter, so it is possible ;)

Best course of action would be to get those people to fully
replace the untrustworthy disk... Or at least pull and push
it a bit - maybe it''s contacts just got plain dirty/oxidized
and the disk should be re-seated in the enclosure...

I''d like someone to please confirm or deny my hypotheses
and guesses :)

DETAILS

According to format, the disk in tailed fmdump reports below
is indeed the one I''m trying to resilver into:

# format | gegrep -B1 ''/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 2,0''
       10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB>
           /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0

 From the pool history we see the resilvering restart
timestamps... they closely match the retry-fail cycles,
following them by some 75 seconds:

# zpool history -il pond | tail; date
....
2012-05-12.10:43:35 [internal pool scrub done txg:91072311] complete=0 
[user root on thumper]
2012-05-12.10:43:36 [internal pool scrub txg:91072311] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]
2012-05-12.14:12:44 [internal pool scrub done txg:91072723] complete=0 
[user root on thumper]
2012-05-12.14:12:45 [internal pool scrub txg:91072723] func=1 mintxg=41 
maxtxg=91051854 [user root on thumper]
Sat May 12 15:45:50 MSK 2012

And last but not least - the FMDUMP messages...

# fmdump -eV | tail -150

...
May 12 2012 10:42:19.559305872 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
         class = ereport.io.scsi.cmd.disk.tran
         ena = 0x928e32c9d1700401
         detector = (embedded nvlist)
         nvlist version: 0
                 version = 0x0
                 scheme = dev
                 device-path = 
/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0
         (end detector)

         driver-assessment = fail
         op-code = 0x28
         cdb = 0x28 0x0 0x14 0xd5 0x53 0x0 0x0 0x0 0x80 0x0
         pkt-reason = 0x1
         pkt-state = 0x37
         pkt-stats = 0x0
         __ttl = 0x1
         __tod = 0x4fae064b 0x21565490

May 12 2012 14:11:27.754940954 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
         class = ereport.io.scsi.cmd.disk.tran
         ena = 0x492896f3a8500401
         detector = (embedded nvlist)
         nvlist version: 0
                 version = 0x0
                 scheme = dev
                 device-path = 
/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0
         (end detector)

         driver-assessment = retry
         op-code = 0x28
         cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
         pkt-reason = 0x1
         pkt-state = 0x37
         pkt-stats = 0x0
         __ttl = 0x1
         __tod = 0x4fae374f 0x2cff7c1a


May 12 2012 14:11:27.754905021 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
         class = ereport.io.scsi.cmd.disk.tran
         ena = 0x492896f3a8500401
         detector = (embedded nvlist)
         nvlist version: 0
                 version = 0x0
                 scheme = dev
                 device-path = 
/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0
         (end detector)

         driver-assessment = retry
         op-code = 0x28
         cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
         pkt-reason = 0x1
         pkt-state = 0x37
         pkt-stats = 0x0
         __ttl = 0x1
         __tod = 0x4fae374f 0x2cfeefbd

May 12 2012 14:11:27.754866050 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
         class = ereport.io.scsi.cmd.disk.tran
         ena = 0x492896f3a8500401
         detector = (embedded nvlist)
         nvlist version: 0
                 version = 0x0
                 scheme = dev
                 device-path = 
/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0
         (end detector)

         driver-assessment = retry
         op-code = 0x28
         cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
         pkt-reason = 0x1
         pkt-state = 0x37
         pkt-stats = 0x0
         __ttl = 0x1
         __tod = 0x4fae374f 0x2cfe5782

May 12 2012 14:11:27.754793613 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
         class = ereport.io.scsi.cmd.disk.tran
         ena = 0x492896f3a8500401
         detector = (embedded nvlist)
         nvlist version: 0
                 version = 0x0
                 scheme = dev
                 device-path = 
/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0
         (end detector)

         driver-assessment = retry
         op-code = 0x28
         cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
         pkt-reason = 0x1
         pkt-state = 0x37
         pkt-stats = 0x0
         __ttl = 0x1
         __tod = 0x4fae374f 0x2cfd3c8d



May 12 2012 14:11:27.754757103 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
         class = ereport.io.scsi.cmd.disk.tran
         ena = 0x492896f3a8500401
         detector = (embedded nvlist)
         nvlist version: 0
                 version = 0x0
                 scheme = dev
                 device-path = 
/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0
         (end detector)

         driver-assessment = retry
         op-code = 0x28
         cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
         pkt-reason = 0x1
         pkt-state = 0x37
         pkt-stats = 0x0
         __ttl = 0x1
         __tod = 0x4fae374f 0x2cfcadef


May 12 2012 14:11:27.754721778 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
         class = ereport.io.scsi.cmd.disk.tran
         ena = 0x492896f3a8500401
         detector = (embedded nvlist)
         nvlist version: 0
                 version = 0x0
                 scheme = dev
                 device-path = 
/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0
         (end detector)

         driver-assessment = fail
         op-code = 0x28
         cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
         pkt-reason = 0x1
         pkt-state = 0x37
         pkt-stats = 0x0
         __ttl = 0x1
         __tod = 0x4fae374f 0x2cfc23f2


//Jim

Jim Klimov

2012-May-12 12:18 UTC

head link

[zfs-discuss] Resilver restarting several times

2012-05-12 15:52, Jim Klimov wrote:> 2012-05-11 14:22, Jim Klimov wrote:
>> What conditions can cause the reset of the resilvering
>> process? My lost-and-found disk can''t get back into the
>> pool because of resilvers restarting...
> Guess I must assume that the disk is dying indeed, losing
> connection or something like that after a random time (my
> resilvers restart after 15min-5hrs), and at least a run of
> SMART long diags is in order, while the pool would try to
> rebuild onto another disk (the hotspare) instead of trying
> to update this one which was in the pool.
Well, I decided to bite the bullet and kick the original
disk from the pool after replacing it with the spare, and
to say the least, speed prognosis looks a lot better now,
about two times better in zpool''s resilver progress/estimate
and by up to an order of magnitude better at iostat speeds:

Sat May 12 16:05:44 MSK 2012
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   571.0    5.4 52488.4    5.1  0.5  3.8    0.9    6.6  11  77 c0t1d0
   589.4    5.6 52630.5    5.3  1.6  2.2    2.6    3.7  32  75 c4t3d0
   562.0    6.2 52552.9    5.8  0.4  4.1    0.8    7.1   9  80 c6t5d0
   582.4    6.2 52554.5    5.8  1.5  2.4    2.5    4.0  29  76 c7t6d0
     0.0  591.0    0.0 52081.8  1.9  6.1    3.3   10.3  23  91 c5t6d0

Sat May 12 16:10:09 MSK 2012
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   248.4    0.1 14190.5    0.0  0.0  6.9    0.0   27.9   0  87 c0t1d0
   247.9    0.1 14188.3    0.0  0.0  6.9    0.0   27.7   0  87 c4t3d0
   248.3    0.1 14193.2    0.0  0.0  6.9    0.0   27.7   0  87 c6t5d0
   245.1    0.1 14195.4    0.0  0.0  7.1    0.0   28.9   0  88 c7t6d0
     0.1  214.4    0.6 13698.7  0.0  9.6    0.0   44.6   0  97 c5t6d0

Disk replacement was as follows: I kicked the hotspare from
the resilvering pool and reattached it, then I kicked the
original disk from the pair and attached it as a new hotspare.
Excerpt from the zpool history follows:


2012-05-12.10:13:03 zpool detach pond c5t6d0 [user root on thumper:global]

2012-05-12.15:58:31 [internal vdev attach txg:91072933] spare in 
vdev=/dev/dsk/c5t6d0s0 for vdev=/dev/dsk/c1t2d0s0 [user root on thumper]
2012-05-12.15:59:08 zpool replace pond c1t2d0 c5t6d0 [user root on 
thumper:global]

2012-05-12.16:02:06 zpool detach pond c1t2d0 [user root on thumper:global]

2012-05-12.16:04:35 zpool add pond spare c1t2d0 [user root on 
thumper:global]


Hope this helps us, thanks for listening,
//Jim Klimov

Richard Elling

2012-May-12 14:34 UTC

head link

[zfs-discuss] Resilver restarting several times

On May 12, 2012, at 4:52 AM, Jim Klimov wrote:
> 2012-05-11 14:22, Jim Klimov wrote:
>> What conditions can cause the reset of the resilvering
>> process? My lost-and-found disk can''t get back into the
>> pool because of resilvers restarting...
> 
> FOLLOW-UP AND NEW QUESTIONS
> 
> Here is a new piece of evidence - I''ve finally got something
> out of fmdump - series of several (5) retries ending with a
> fail, dated 75 seconds before resilvers restart (more below).
> Not a squeak in zpool status nor dmesg nor /dev/console.
> 
> Guess I must assume that the disk is dying indeed, losing
> connection or something like that after a random time (my
> resilvers restart after 15min-5hrs), and at least a run of
> SMART long diags is in order, while the pool would try to
> rebuild onto another disk (the hotspare) instead of trying
> to update this one which was in the pool.
Please share if SMART offers anything useful.
> Anyhow, information on the ex-pool disk is likely unused
> anyway - from iostat I see that it is only written to,
> with few to zero reads for minutes - so I''d lose nothing
> by replacing it with a blank drive (that''s strange though)...
> 
> I also guess that the disk gets found after something like
> an unlogged bus reset or whatever, and this event causes
> the resilvering to restart from scratch.
This makes sense. 
> Q: Would this be the same in OI_151a, or would it continue
> resilvering from where it left off? I think I had the pool
> exported once during a resilver, and it restarted from the
> same percentage counter, so it is possible ;)
> 
> Best course of action would be to get those people to fully
> replace the untrustworthy disk... Or at least pull and push
> it a bit - maybe it''s contacts just got plain dirty/oxidized
> and the disk should be re-seated in the enclosure...
Not likely to help. SATA?
 -- richard
> 
> I''d like someone to please confirm or deny my hypotheses
> and guesses :)
> 
> DETAILS
> 
> According to format, the disk in tailed fmdump reports below
> is indeed the one I''m trying to resilver into:
> 
> # format | gegrep -B1 ''/pci at 0,0/pci1022,7458 at 2/pci11ab,11ab
at 1/disk at 2,0''
>      10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB>
>          /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0
> 
> From the pool history we see the resilvering restart
> timestamps... they closely match the retry-fail cycles,
> following them by some 75 seconds:
> 
> # zpool history -il pond | tail; date
> ....
> 2012-05-12.10:43:35 [internal pool scrub done txg:91072311] complete=0
[user root on thumper]
> 2012-05-12.10:43:36 [internal pool scrub txg:91072311] func=1 mintxg=41
maxtxg=91051854 [user root on thumper]
> 2012-05-12.14:12:44 [internal pool scrub done txg:91072723] complete=0
[user root on thumper]
> 2012-05-12.14:12:45 [internal pool scrub txg:91072723] func=1 mintxg=41
maxtxg=91051854 [user root on thumper]
> Sat May 12 15:45:50 MSK 2012
> 
> And last but not least - the FMDUMP messages...
> 
> # fmdump -eV | tail -150
> 
> ...
> May 12 2012 10:42:19.559305872 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x928e32c9d1700401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 2,0
>        (end detector)
> 
>        driver-assessment = fail
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xd5 0x53 0x0 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae064b 0x21565490
> 
> May 12 2012 14:11:27.754940954 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cff7c1a
> 
> 
> May 12 2012 14:11:27.754905021 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfeefbd
> 
> May 12 2012 14:11:27.754866050 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfe5782
> 
> May 12 2012 14:11:27.754793613 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfd3c8d
> 
> 
> 
> May 12 2012 14:11:27.754757103 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 2,0
>        (end detector)
> 
>        driver-assessment = retry
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfcadef
> 
> 
> May 12 2012 14:11:27.754721778 ereport.io.scsi.cmd.disk.tran
> nvlist version: 0
>        class = ereport.io.scsi.cmd.disk.tran
>        ena = 0x492896f3a8500401
>        detector = (embedded nvlist)
>        nvlist version: 0
>                version = 0x0
>                scheme = dev
>                device-path = /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at
1/disk at 2,0
>        (end detector)
> 
>        driver-assessment = fail
>        op-code = 0x28
>        cdb = 0x28 0x0 0x14 0xc8 0x77 0x80 0x0 0x0 0x80 0x0
>        pkt-reason = 0x1
>        pkt-state = 0x37
>        pkt-stats = 0x0
>        __ttl = 0x1
>        __tod = 0x4fae374f 0x2cfc23f2
> 
> 
> //Jim
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120512/794918d8/attachment-0001.html>

Jim Klimov

2012-May-12 14:55 UTC

head link

[zfs-discuss] Resilver restarting several times

Thanks for staying tuned! ;)

2012-05-12 18:34, Richard Elling wrote:> On May 12, 2012, at 4:52 AM, Jim Klimov wrote:
>
>> 2012-05-11 14:22, Jim Klimov wrote:
>>> What conditions can cause the reset of the resilvering
>>> process? My lost-and-found disk can''t get back into the
>>> pool because of resilvers restarting...
>>
>> FOLLOW-UP AND NEW QUESTIONS
>>
>> Here is a new piece of evidence - I''ve finally got something
>> out of fmdump - series of several (5) retries ending with a
>> fail, dated 75 seconds before resilvers restart (more below).
>> Not a squeak in zpool status nor dmesg nor /dev/console.
>>
>> Guess I must assume that the disk is dying indeed, losing
>> connection or something like that after a random time (my
>> resilvers restart after 15min-5hrs), and at least a run of
>> SMART long diags is in order, while the pool would try to
>> rebuild onto another disk (the hotspare) instead of trying
>> to update this one which was in the pool.
>
> Please share if SMART offers anything useful.
I plan to run the tests after (if, when) the resilver completes
as to not disturb the system. So far I got smartmontools-5.42
compiled and it sees the disks :)

I am not sure if it would indeed run the self-tests and
report back - this seems to be not supported by disk(?):


# /usr/local/sbin/smartctl -x /dev/rdsk/c1t2d0 -d scsi
smartctl 5.42 2011-10-20 r3458 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

User Capacity:        250,056,000,000 bytes [250 GB]
Logical block size:   512 bytes
Serial number:                    5QE5ADXW
Device type:          disk
Local Time is:        Sat May 12 18:40:53 2012 MSK
Device supports SMART and is Enabled
Temperature Warning Disabled or Not Supported
SMART Health Status: OK

Current Drive Temperature:     26 C

Error Counter logging not supported
No self-tests have been logged
Device does not support Background scan results logging
scsiPrintSasPhy Log Sense Failed [unsupported field in scsi command]

>> I also guess that the disk gets found after something like
>> an unlogged bus reset or whatever, and this event causes
>> the resilvering to restart from scratch.
>
> This makes sense.
Good, thanks for the sanity-check ;)
>> Best course of action would be to get those people to fully
>> replace the untrustworthy disk... Or at least pull and push
>> it a bit - maybe it''s contacts just got plain dirty/oxidized
>> and the disk should be re-seated in the enclosure...
>
> Not likely to help. SATA?
Yes, 250Gb Seagate SATA on a Thumper X4500, used since about
May 2008 (last rpool installation date) or maybe before.

       10. c1t2d0 <ATA-SEAGATE ST32500N-3AZQ-232.88GB>
           /pci at 0,0/pci1022,7458 at 2/pci11ab,11ab at 1/disk at 2,0

Thanks,
//Jim

Jim Klimov

2012-May-12 15:10 UTC

head link

[zfs-discuss] Resilver restarting several times

2012-05-12 7:01, Jim Klimov wrote:> Overall the applied question is whether the disk will
> make it back into the live pool (ultimately with no
> continuous resilvering), and how fast that can be done -
> I don''t want to risk the big pool with nonredundant
> arrays for too long.
Here lies another "grumpy gripe", although maybe pertaining
to the oldish snv_117 on that box: the system is not making
its best possible effort to complete the resilver ASAP :)

According to "iostat 60", disk utilizations of this raidz
set vary 15-50%busy, queue lengths vary within 5 outstanding
tasks, the CPU kernel time is 2-7% with over 90% idling,
over 2GB RAM remains free... Why won''t it go to complete
the quest faster? Can some tire be kicked? ;)

Sat May 12 19:06:09 MSK 2012
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   309.6    3.8 14863.0    5.0  0.0  4.7    0.0   15.0   0  65 c0t1d0
   312.5    3.9 14879.7    5.1  0.0  4.6    0.0   14.7   0  64 c4t3d0
   308.5    4.0 14855.0    5.2  0.0  4.7    0.0   15.1   0  66 c6t5d0
   310.7    3.9 14855.7    5.1  0.0  4.6    0.0   14.8   0  65 c7t6d0
     0.0  225.3    0.0 14484.2  0.0  8.1    0.0   36.0   0  83 c5t6d0
Sat May 12 19:07:09 MSK 2012
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   228.0    3.0 6859.7    4.0  0.0  6.9    0.0   29.9   0  81 c0t1d0
   227.7    3.3 6850.0    4.3  0.0  6.9    0.0   30.0   0  81 c4t3d0
   228.1    3.4 6857.9    4.4  0.0  7.0    0.0   30.0   0  81 c6t5d0
   227.6    3.1 6860.4    4.1  0.0  7.1    0.0   30.7   0  82 c7t6d0
     0.0  225.8    0.0 6379.1  0.0  8.1    0.0   35.8   0  85 c5t6d0
...

On some minutes the disks sit there doing almost nothing at all:

Sat May 12 19:01:09 MSK 2012
                     extended device statistics
     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    10.7    0.8  665.4    0.7  0.0  0.1    0.0   11.4   0  13 c0t1d0
    10.7    0.9  667.5    0.7  0.0  0.1    0.0   11.6   0  13 c4t3d0
    10.7    0.8  666.4    0.7  0.0  0.1    0.0   11.9   0  13 c6t5d0
    10.7    0.9  668.5    0.7  0.0  0.1    0.0   11.6   0  13 c7t6d0
     0.1   15.5    0.6   20.3  0.0  0.0    0.0    0.2   0   0 c5t6d0


last pid: 18121;  load avg:  0.16,  0.15,  0.12; up 0+16:03:44 19:06:51
96 processes: 95 sleeping, 1 on cpu
CPU states: 96.6% idle,  0.2% user, 3.2% kernel, 0.0% iowait, 0.0% swap
Memory: 16G phys mem, 2476M free mem, 16G total swap, 16G free swap
...
> It has already taken 2 days to try and resilver a 250Gb
> disk into the pool, but never made it past 100Gb progress. :(
> Reports no errors that I''d see either... :)
Well, that part seems to have been explained in my other
mails, and hopefully worked-around by the hotspare.

//Jim

Steve Gonczi

2012-May-12 16:30 UTC

head link

[zfs-discuss] Resilver restarting several times

Jim, 

This makes sense. 
fmdump -eV reported that your original drive was experiencing repeated 
fread failures ( scsi command code 0x28) 


Steve 

----- Original Message -----
.... 
Well, I decided to bite the bullet and kick the original 
disk from the pool after replacing it with the spare, and 
to say the least, speed prognosis looks a lot better now, 
about two times better in zpool''s resilver progress/estimate 
and by up to an order of magnitude better at iostat speeds: 
.... 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120512/f38d1d79/attachment.html>

Richard Elling

2012-May-13 14:34 UTC

head link

[zfs-discuss] Resilver restarting several times

comments below...

On May 12, 2012, at 8:10 AM, Jim Klimov wrote:
> 2012-05-12 7:01, Jim Klimov wrote:
>> Overall the applied question is whether the disk will
>> make it back into the live pool (ultimately with no
>> continuous resilvering), and how fast that can be done -
>> I don''t want to risk the big pool with nonredundant
>> arrays for too long.
> 
> Here lies another "grumpy gripe", although maybe pertaining
> to the oldish snv_117 on that box: the system is not making
> its best possible effort to complete the resilver ASAP :)
> 
> According to "iostat 60", disk utilizations of this raidz
> set vary 15-50%busy, queue lengths vary within 5 outstanding
> tasks, the CPU kernel time is 2-7% with over 90% idling,
> over 2GB RAM remains free... Why won''t it go to complete
> the quest faster? Can some tire be kicked? ;)
> 
> Sat May 12 19:06:09 MSK 2012
>                    extended device statistics
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  309.6    3.8 14863.0    5.0  0.0  4.7    0.0   15.0   0  65 c0t1d0
>  312.5    3.9 14879.7    5.1  0.0  4.6    0.0   14.7   0  64 c4t3d0
>  308.5    4.0 14855.0    5.2  0.0  4.7    0.0   15.1   0  66 c6t5d0
>  310.7    3.9 14855.7    5.1  0.0  4.6    0.0   14.8   0  65 c7t6d0
>    0.0  225.3    0.0 14484.2  0.0  8.1    0.0   36.0   0  83 c5t6d0
> Sat May 12 19:07:09 MSK 2012
>                    extended device statistics
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>  228.0    3.0 6859.7    4.0  0.0  6.9    0.0   29.9   0  81 c0t1d0
>  227.7    3.3 6850.0    4.3  0.0  6.9    0.0   30.0   0  81 c4t3d0
>  228.1    3.4 6857.9    4.4  0.0  7.0    0.0   30.0   0  81 c6t5d0
>  227.6    3.1 6860.4    4.1  0.0  7.1    0.0   30.7   0  82 c7t6d0
>    0.0  225.8    0.0 6379.1  0.0  8.1    0.0   35.8   0  85 c5t6d0
In general asvc_t of this magnitude along with actv of this size 
means you might be better off lowering zfs_vdev_max_pending.
> ...
> 
> On some minutes the disks sit there doing almost nothing at all:
> 
> Sat May 12 19:01:09 MSK 2012
>                    extended device statistics
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   10.7    0.8  665.4    0.7  0.0  0.1    0.0   11.4   0  13 c0t1d0
>   10.7    0.9  667.5    0.7  0.0  0.1    0.0   11.6   0  13 c4t3d0
>   10.7    0.8  666.4    0.7  0.0  0.1    0.0   11.9   0  13 c6t5d0
>   10.7    0.9  668.5    0.7  0.0  0.1    0.0   11.6   0  13 c7t6d0
>    0.1   15.5    0.6   20.3  0.0  0.0    0.0    0.2   0   0 c5t6d0
This behaviour cannot be debugged with iostat or any of the various
CPU-monitoring stat utilities. There is blocking somewhere, and it is
likely to be in the data path. You might try iosnoop and look for I/O
completion times that are large (> 1 second).
 -- richard
> last pid: 18121;  load avg:  0.16,  0.15,  0.12; up 0+16:03:44 19:06:51
> 96 processes: 95 sleeping, 1 on cpu
> CPU states: 96.6% idle,  0.2% user, 3.2% kernel, 0.0% iowait, 0.0% swap
> Memory: 16G phys mem, 2476M free mem, 16G total swap, 16G free swap
> ...
> 
>> It has already taken 2 days to try and resilver a 250Gb
>> disk into the pool, but never made it past 100Gb progress. :(
>> Reports no errors that I''d see either... :)
> 
> Well, that part seems to have been explained in my other
> mails, and hopefully worked-around by the hotspare.
> 
> //Jim
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120513/294cce6d/attachment.html>

zfs discuss - May 2012 - Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times

[zfs-discuss] Resilver restarting several times