thr3ads.net - zfs discuss - [zfs-discuss] Resilver endlessly restarting at completion [Sep 2010]

If this information is useful, please help other people find it:
Share via:

Tuomas Leikola

2010-Sep-27 10:13 UTC

[zfs-discuss] Resilver endlessly restarting at completion

Hi!

My home server had some disk outages due to flaky cabling and whatnot, and
started resilvering to a spare disk. During this another disk or two
dropped, and were reinserted into the array. So no devices were actually
lost, they just were intermittently away for a while each.

The situation is currently as follows:
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go
config:

        NAME                       STATE     READ WRITE CKSUM
        tank                       ONLINE       0     0     0
          raidz1-0                 ONLINE       0     0     0
            c11t1d0p0              ONLINE       0     0     0
            c11t2d0                ONLINE       0     0     5
            c11t6d0p0              ONLINE       0     0     0
            spare-3                ONLINE       0     0     0
              c11t3d0p0            ONLINE       0     0     0  106M
resilvered
              c9d1                 ONLINE       0     0     0  104G
resilvered
            c11t4d0p0              ONLINE       0     0     0
            c11t0d0p0              ONLINE       0     0     0
            c11t5d0p0              ONLINE       0     0     0
            c11t7d0p0              ONLINE       0     0     0  93.6G
resilvered
          raidz1-2                 ONLINE       0     0     0
            c6t2d0                 ONLINE       0     0     0
            c6t3d0                 ONLINE       0     0     0
            c6t4d0                 ONLINE       0     0     0  2.50K
resilvered
            c6t5d0                 ONLINE       0     0     0
            c6t6d0                 ONLINE       0     0     0
            c6t7d0                 ONLINE       0     0     0
            c6t1d0                 ONLINE       0     0     1
        logs
          /dev/zvol/dsk/rpool/log  ONLINE       0     0     0
        cache
          c6t0d0p0                 ONLINE       0     0     0
        spares
          c9d1                     INUSE     currently in use

errors: No known data errors

And this has been going on for a week now, always restarting when it should
complete.

The questions in my mind atm:

1. How can i determine the cause for each resilver? Is there a log?

2. Why does it resilver the same data over and over, and not just the
changed bits?

3. Can i force remove c9d1 as it is no longer needed but c11t3 can be
resilvered instead?

I''m running opensolaris 134, but the event originally happened on 111b.
I
upgraded and tried quiescing snapshots and IO, none of which helped.

I''ve already ordered some new hardware to recreate this entire array as
raidz2 among other things, but there''s about a week of time when I can
run
debuggers and traces if instructed to.

- Tuomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100927/d9d34dc6/attachment-0001.html>

Tuomas Leikola

2010-Sep-29 17:13 UTC

head link

[zfs-discuss] Resilver endlessly restarting at completion

The endless resilver problem still persists on OI b147. Restarts when it
should complete.

I see no other solution than to copy the data to safety and recreate the
array. Any hints would be appreciated as that takes days unless i can stop
or pause the resilvering.

On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola <tuomas.leikola at
gmail.com>wrote:
> Hi!
>
> My home server had some disk outages due to flaky cabling and whatnot, and
> started resilvering to a spare disk. During this another disk or two
> dropped, and were reinserted into the array. So no devices were actually
> lost, they just were intermittently away for a while each.
>
> The situation is currently as follows:
>   pool: tank
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
> unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using ''zpool clear'' or replace the device with
''zpool replace''.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go
> config:
>
>         NAME                       STATE     READ WRITE CKSUM
>         tank                       ONLINE       0     0     0
>           raidz1-0                 ONLINE       0     0     0
>             c11t1d0p0              ONLINE       0     0     0
>             c11t2d0                ONLINE       0     0     5
>             c11t6d0p0              ONLINE       0     0     0
>             spare-3                ONLINE       0     0     0
>               c11t3d0p0            ONLINE       0     0     0  106M
> resilvered
>               c9d1                 ONLINE       0     0     0  104G
> resilvered
>             c11t4d0p0              ONLINE       0     0     0
>             c11t0d0p0              ONLINE       0     0     0
>             c11t5d0p0              ONLINE       0     0     0
>             c11t7d0p0              ONLINE       0     0     0  93.6G
> resilvered
>           raidz1-2                 ONLINE       0     0     0
>             c6t2d0                 ONLINE       0     0     0
>             c6t3d0                 ONLINE       0     0     0
>             c6t4d0                 ONLINE       0     0     0  2.50K
> resilvered
>             c6t5d0                 ONLINE       0     0     0
>             c6t6d0                 ONLINE       0     0     0
>             c6t7d0                 ONLINE       0     0     0
>             c6t1d0                 ONLINE       0     0     1
>         logs
>           /dev/zvol/dsk/rpool/log  ONLINE       0     0     0
>         cache
>           c6t0d0p0                 ONLINE       0     0     0
>         spares
>           c9d1                     INUSE     currently in use
>
> errors: No known data errors
>
> And this has been going on for a week now, always restarting when it should
> complete.
>
> The questions in my mind atm:
>
> 1. How can i determine the cause for each resilver? Is there a log?
>
> 2. Why does it resilver the same data over and over, and not just the
> changed bits?
>
> 3. Can i force remove c9d1 as it is no longer needed but c11t3 can be
> resilvered instead?
>
> I''m running opensolaris 134, but the event originally happened on
111b. I
> upgraded and tried quiescing snapshots and IO, none of which helped.
>
> I''ve already ordered some new hardware to recreate this entire
array as
> raidz2 among other things, but there''s about a week of time when I
can run
> debuggers and traces if instructed to.
>
> - Tuomas
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100929/6f16d48e/attachment-0001.html>

George Wilson

2010-Sep-29 18:01 UTC

head link

[zfs-discuss] Resilver endlessly restarting at completion

Answers below...

Tuomas Leikola wrote:> The endless resilver problem still persists on OI b147. Restarts when it 
> should complete.
> 
> I see no other solution than to copy the data to safety and recreate the 
> array. Any hints would be appreciated as that takes days unless i can 
> stop or pause the resilvering.
> 
> On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola 
> <tuomas.leikola at gmail.com <mailto:tuomas.leikola at
gmail.com>> wrote:
> 
>     Hi!
> 
>     My home server had some disk outages due to flaky cabling and
>     whatnot, and started resilvering to a spare disk. During this
>     another disk or two dropped, and were reinserted into the array. So
>     no devices were actually lost, they just were intermittently away
>     for a while each.
> 
>     The situation is currently as follows:
>       pool: tank
>      state: ONLINE
>     status: One or more devices has experienced an unrecoverable error.  An
>             attempt was made to correct the error.  Applications are
>     unaffected.
>     action: Determine if the device needs to be replaced, and clear the
>     errors
>             using ''zpool clear'' or replace the device
with ''zpool replace''.
>        see: http://www.sun.com/msg/ZFS-8000-9P
>      scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go
>     config:
> 
>             NAME                       STATE     READ WRITE CKSUM
>             tank                       ONLINE       0     0     0
>               raidz1-0                 ONLINE       0     0     0
>                 c11t1d0p0              ONLINE       0     0     0
>                 c11t2d0                ONLINE       0     0     5
>                 c11t6d0p0              ONLINE       0     0     0
>                 spare-3                ONLINE       0     0     0
>                   c11t3d0p0            ONLINE       0     0     0  106M
>     resilvered
>                   c9d1                 ONLINE       0     0     0  104G
>     resilvered
>                 c11t4d0p0              ONLINE       0     0     0
>                 c11t0d0p0              ONLINE       0     0     0
>                 c11t5d0p0              ONLINE       0     0     0
>                 c11t7d0p0              ONLINE       0     0     0  93.6G
>     resilvered
>               raidz1-2                 ONLINE       0     0     0
>                 c6t2d0                 ONLINE       0     0     0
>                 c6t3d0                 ONLINE       0     0     0
>                 c6t4d0                 ONLINE       0     0     0  2.50K
>     resilvered
>                 c6t5d0                 ONLINE       0     0     0
>                 c6t6d0                 ONLINE       0     0     0
>                 c6t7d0                 ONLINE       0     0     0
>                 c6t1d0                 ONLINE       0     0     1
>             logs
>               /dev/zvol/dsk/rpool/log  ONLINE       0     0     0
>             cache
>               c6t0d0p0                 ONLINE       0     0     0
>             spares
>               c9d1                     INUSE     currently in use
> 
>     errors: No known data errors
> 
>     And this has been going on for a week now, always restarting when it
>     should complete.
> 
>     The questions in my mind atm: 
> 
>     1. How can i determine the cause for each resilver? Is there a log?
If you''re running OI b147 then you should be able to do the following:

# echo "::zfs_dbgmsg" | mdb -k > /var/tmp/dbg.out

Send me the output.
> 
>     2. Why does it resilver the same data over and over, and not just
>     the changed bits?
If you''re having drives fail prior to the initial resilver finishing 
then it will restart and do all the work over again. Are drives still 
failing randomly for you?
> 
>     3. Can i force remove c9d1 as it is no longer needed but c11t3 can
>     be resilvered instead?
You can detach the spare and let the resilver work on only c11t3. Can 
you send me the output of ''zdb -dddd tank 0''?

Thanks,
George

Tuomas Leikola

2010-Sep-29 19:06 UTC

head link

[zfs-discuss] Resilver endlessly restarting at completion

Thanks for taking an interest. Answers below.

On Wed, Sep 29, 2010 at 9:01 PM, George Wilson
<george.r.wilson at oracle.com>wrote:
> On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola <tuomas.leikola at
gmail.com<mailto:
>> tuomas.leikola at gmail.com>> wrote:
>>
>
>>      (continuous resilver loop) has been going on for a week now,
always
>> restarting when it
>>    should complete.
>>
>>    The questions in my mind atm:
>>    1. How can i determine the cause for each resilver? Is there a log?
>>
>
> If you''re running OI b147 then you should be able to do the
following:
>
> # echo "::zfs_dbgmsg" | mdb -k > /var/tmp/dbg.out
>
> Send me the output.

Sending verbose output in a separate email. I''m not very familiar with
this
but it does show some "restarting" lines.

>    2. Why does it resilver the same data over and over, and not just
>>    the changed bits?
>>
>
> If you''re having drives fail prior to the initial resilver
finishing then
> it will restart and do all the work over again. Are drives still failing
> randomly for you?
>
>
>Drives haven''t been dropping since the initial incidents. It''s
run to
completion a few times now without (visible) issues with the drives.

Then again I think there is some magic to reinsert a device back into the
array if there is some intermittent SATA disconnection.

>
>>    3. Can i force remove c9d1 as it is no longer needed but c11t3 can
>>    be resilvered instead?
>>
>
> You can detach the spare and let the resilver work on only c11t3. Can you
> send me the output of ''zdb -dddd tank 0''?

Detach commands complain there''s not enough replicas. Of course I can
physically remove the device, at which point a scrub would suffice (the
disks must be relatively well up-to-date by now..)

Sending zdb output in a separate mail as soon as it completes..
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100929/7bf15a7e/attachment.html>

Tuomas Leikola

2010-Oct-05 09:10 UTC

head link

[zfs-discuss] Resilver endlessly restarting at completion

This seems to have been a false alarm, sorry for that. As soon as I started
paying attention (logging zpool status, peeking around with zdb & mdb) the
resilver didn''t restart unless provoked. A cleartext log would have
been
nice ("restarted due to c11t7 becoming online").

A slight problem i can see is that resilver restarts always if a device is
added to the array. In my case devices were absent for a short period (some
SATA failure that corrected itself by running cfgadm -c disconnect &
connect) and it would have been beneficial to let resilver run to completion
and restart only after that to resilver missing data on the added device.
ZFS does have some intelligence in those cases that all data is not
resilvered, but only blocks that have been born after the outage.

Also, as i had a spare in the array, that kicked in, which probably was not
what I would have wanted, as that triggered a full resilver, and not a
partial one. After the fact I could not kick the spare out, and could not
make the resilvering process forget about doing a full resilver. Plus now I
have to replace it back out and make it a cold spare.

But end is well, all is well.. mostly. Devices seem to be still dropping
from the SATA bus randomly. Maybe I''ll cough together a report and post
to
storage-discuss.

On Wed, Sep 29, 2010 at 8:13 PM, Tuomas Leikola <tuomas.leikola at
gmail.com>wrote:
> The endless resilver problem still persists on OI b147. Restarts when it
> should complete.
>
> I see no other solution than to copy the data to safety and recreate the
> array. Any hints would be appreciated as that takes days unless i can stop
> or pause the resilvering.
>
>
> On Mon, Sep 27, 2010 at 1:13 PM, Tuomas Leikola <tuomas.leikola at
gmail.com>wrote:
>
>> Hi!
>>
>> My home server had some disk outages due to flaky cabling and whatnot,
and
>> started resilvering to a spare disk. During this another disk or two
>> dropped, and were reinserted into the array. So no devices were
actually
>> lost, they just were intermittently away for a while each.
>>
>> The situation is currently as follows:
>>   pool: tank
>>  state: ONLINE
>> status: One or more devices has experienced an unrecoverable error.  An
>>         attempt was made to correct the error.  Applications are
>> unaffected.
>> action: Determine if the device needs to be replaced, and clear the
errors
>>         using ''zpool clear'' or replace the device
with ''zpool replace''.
>>    see: http://www.sun.com/msg/ZFS-8000-9P
>>  scrub: resilver in progress for 5h33m, 22.47% done, 19h10m to go
>> config:
>>
>>         NAME                       STATE     READ WRITE CKSUM
>>         tank                       ONLINE       0     0     0
>>           raidz1-0                 ONLINE       0     0     0
>>             c11t1d0p0              ONLINE       0     0     0
>>             c11t2d0                ONLINE       0     0     5
>>             c11t6d0p0              ONLINE       0     0     0
>>             spare-3                ONLINE       0     0     0
>>               c11t3d0p0            ONLINE       0     0     0  106M
>> resilvered
>>               c9d1                 ONLINE       0     0     0  104G
>> resilvered
>>             c11t4d0p0              ONLINE       0     0     0
>>             c11t0d0p0              ONLINE       0     0     0
>>             c11t5d0p0              ONLINE       0     0     0
>>             c11t7d0p0              ONLINE       0     0     0  93.6G
>> resilvered
>>           raidz1-2                 ONLINE       0     0     0
>>             c6t2d0                 ONLINE       0     0     0
>>             c6t3d0                 ONLINE       0     0     0
>>             c6t4d0                 ONLINE       0     0     0  2.50K
>> resilvered
>>             c6t5d0                 ONLINE       0     0     0
>>             c6t6d0                 ONLINE       0     0     0
>>             c6t7d0                 ONLINE       0     0     0
>>             c6t1d0                 ONLINE       0     0     1
>>         logs
>>           /dev/zvol/dsk/rpool/log  ONLINE       0     0     0
>>         cache
>>           c6t0d0p0                 ONLINE       0     0     0
>>         spares
>>           c9d1                     INUSE     currently in use
>>
>> errors: No known data errors
>>
>> And this has been going on for a week now, always restarting when it
>> should complete.
>>
>> The questions in my mind atm:
>>
>> 1. How can i determine the cause for each resilver? Is there a log?
>>
>> 2. Why does it resilver the same data over and over, and not just the
>> changed bits?
>>
>> 3. Can i force remove c9d1 as it is no longer needed but c11t3 can be
>> resilvered instead?
>>
>> I''m running opensolaris 134, but the event originally happened
on 111b. I
>> upgraded and tried quiescing snapshots and IO, none of which helped.
>>
>> I''ve already ordered some new hardware to recreate this entire
array as
>> raidz2 among other things, but there''s about a week of time
when I can run
>> debuggers and traces if instructed to.
>>
>> - Tuomas
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101005/76bb1354/attachment.html>

zfs discuss - Sep 2010 - Resilver endlessly restarting at completion

[zfs-discuss] Resilver endlessly restarting at completion

[zfs-discuss] Resilver endlessly restarting at completion

[zfs-discuss] Resilver endlessly restarting at completion

[zfs-discuss] Resilver endlessly restarting at completion

[zfs-discuss] Resilver endlessly restarting at completion