thr3ads.net - zfs discuss - [zfs-discuss] Hung Hot Spare [Mar 2011]

If this information is useful, please help other people find it:
Share via:

Paul Kraus

2011-Mar-03 14:45 UTC

[zfs-discuss] Hung Hot Spare

Apologies in advance as this is a Solaris 10 question and not an
OpenSolaris issue (well, OK, it *may* also be an OpenSolaris issue).
System is a T2000 running Solaris 10U9 with latest ZFS patches (zpool
version 22). Storage is a pile of J4400 (5 of them).

I have run into what appears to be (Sun) Bug ID 6995143, and I opened
a case with Oracle requesting to be added to that bug. I am being told
that bug had been abandoned and that ZFS is behaving "correctly". Here
is what I am seeing:

1) zpool with multiple vdevs and hot spares
2) multiple drive failures at once
3) multiple hot spares in use (so far, only one in each vdev, but they
are raidz2 so I suppose it could be up to 2 in each vdev)
4) after repair of the failed drives and resilver completes, the hot
spares stay in use

I have NOT seen the issue with only a single drive failure.
I have NOT seen the problem if the failed drive(s) is(are) replaced
BEFORE the resilver of the hot spares completes

In other words, I have only seen the issue if there are more than one
failed drive at once and if the hot spares complete resilvering before
the bad drives are repaired.

This has all been seen in our test environment, and we simulate a
drive failure by either removing a drive or disabling it (via CAM,
these are J4400 drives). This came to light due to testing to
determine resolution of another bug (SATA over SAS multipathing driver
issues).

We do have a work around. Once the resilver of the repaired drives
completes we can ''zpool detach'' the hot spare device from the
zpool
(vdev) and it goes back into the AVAIL state.

Is this EXPECTED behavior for multiple drive failures ?

Here is some detailed information from my latest test.

Here is the pool in failed state (2 drives failed)...

  pool: nyc-test-01
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using ''zpool online'' or replace the
device with
        ''zpool replace''.
 scrub: scrub completed after 0h0m with 0 errors on Thu Mar  3 08:41:04 2011
config:

        NAME                         STATE     READ WRITE CKSUM
        nyc-test-01                  DEGRADED     0     0     0
          raidz2-0                   DEGRADED     0     0     0
            c5t5000CCA215C8A649d0    ONLINE       0     0     0
            c5t5000CCA215C84A65d0    ONLINE       0     0     0
            c5t5000CCA215C34786d0    ONLINE       0     0     0
            spare-3                  DEGRADED     0     0     0
              c5t5000CCA215C28142d0  REMOVED      0     0     0
              c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
          raidz2-1                   DEGRADED     0     0     0
            c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
            spare-1                  DEGRADED     0     0     0
              c5t5000CCA215C280F8d0  REMOVED      0     0     0
              c5t5000CCA215C83160d0  ONLINE       0     0     0
            c5t5000CCA215C34753d0    ONLINE       0     0     0
            c5t5000CCA215C34823d0    ONLINE       0     0     0
        spares
          c5t5000CCA215C7FD6Ed0      INUSE     currently in use
          c5t5000CCA215C83160d0      INUSE     currently in use

errors: No known data errors

Here is the attempt to bring one of the failed drives back online
using ''zpool replace'' (after the drive was enabled), which
tosses a
warning (as expected)...
> sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0Password:
invalid vdev specification
use ''-f'' to override the following errors:
/dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool
nyc-test-01. Please see zpool(1M).>
Instead of forcing the replace, I did a ''zpool online''

Here is the pool resilvering after one of the two failed drives is
brought back online via the ''zpool online'' command (while the
resilver
is still running)...

  pool: nyc-test-01
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using ''zpool online'' or replace the
device with
        ''zpool replace''.
 scrub: resilver in progress for 0h0m, 6.94% done, 0h4m to go
config:

        NAME                         STATE     READ WRITE CKSUM
        nyc-test-01                  DEGRADED     0     0     0
          raidz2-0                   ONLINE       0     0     0
            c5t5000CCA215C8A649d0    ONLINE       0     0     0
            c5t5000CCA215C84A65d0    ONLINE       0     0     0
            c5t5000CCA215C34786d0    ONLINE       0     0     0
            spare-3                  ONLINE       0     0     0
              c5t5000CCA215C28142d0  ONLINE       0     0     0  104M resilvered
              c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
          raidz2-1                   DEGRADED     0     0     0
            c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
            spare-1                  DEGRADED     0     0     0
              c5t5000CCA215C280F8d0  REMOVED      0     0     0
              c5t5000CCA215C83160d0  ONLINE       0     0     0
            c5t5000CCA215C34753d0    ONLINE       0     0     0
            c5t5000CCA215C34823d0    ONLINE       0     0     0
        spares
          c5t5000CCA215C7FD6Ed0      INUSE     currently in use
          c5t5000CCA215C83160d0      INUSE     currently in use

errors: No known data errors

Here is the pool after the resilver has completed, but the hot spare
is still in use...

  pool: nyc-test-01
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using ''zpool online'' or replace the
device with
        ''zpool replace''.
 scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 08:49:03 2011
config:

        NAME                         STATE     READ WRITE CKSUM
        nyc-test-01                  DEGRADED     0     0     0
          raidz2-0                   ONLINE       0     0     0
            c5t5000CCA215C8A649d0    ONLINE       0     0     0
            c5t5000CCA215C84A65d0    ONLINE       0     0     0
            c5t5000CCA215C34786d0    ONLINE       0     0     0
            spare-3                  ONLINE       0     0     0
              c5t5000CCA215C28142d0  ONLINE       0     0     0  302M resilvered
              c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
          raidz2-1                   DEGRADED     0     0     0
            c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
            spare-1                  DEGRADED     0     0     0
              c5t5000CCA215C280F8d0  REMOVED      0     0     0
              c5t5000CCA215C83160d0  ONLINE       0     0     0
            c5t5000CCA215C34753d0    ONLINE       0     0     0
            c5t5000CCA215C34823d0    ONLINE       0     0     0
        spares
          c5t5000CCA215C7FD6Ed0      INUSE     currently in use
          c5t5000CCA215C83160d0      INUSE     currently in use

errors: No known data errors

Here is the zpool after the second failed drive has been brought back
online via ''zpool online''...

  pool: nyc-test-01
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46 2011
config:

        NAME                         STATE     READ WRITE CKSUM
        nyc-test-01                  ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            c5t5000CCA215C8A649d0    ONLINE       0     0     0
            c5t5000CCA215C84A65d0    ONLINE       0     0     0
            c5t5000CCA215C34786d0    ONLINE       0     0     0
            spare-3                  ONLINE       0     0     0
              c5t5000CCA215C28142d0  ONLINE       0     0     0
              c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
          raidz2-1                   ONLINE       0     0     0
            c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
            spare-1                  ONLINE       0     0     0
              c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K resilvered
              c5t5000CCA215C83160d0  ONLINE       0     0     0
            c5t5000CCA215C34753d0    ONLINE       0     0     0
            c5t5000CCA215C34823d0    ONLINE       0     0     0
        spares
          c5t5000CCA215C7FD6Ed0      INUSE     currently in use
          c5t5000CCA215C83160d0      INUSE     currently in use

errors: No known data errors

Note that BOTH hot spares are still in use even though the faults have
been cleared.

Now I detach one of the hot spares...
> sudo zpool detach nyc-test-01 c5t5000CCA215C7FD6Ed0
  pool: nyc-test-01
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46 2011
config:

        NAME                         STATE     READ WRITE CKSUM
        nyc-test-01                  ONLINE       0     0     0
          raidz2-0                   ONLINE       0     0     0
            c5t5000CCA215C8A649d0    ONLINE       0     0     0
            c5t5000CCA215C84A65d0    ONLINE       0     0     0
            c5t5000CCA215C34786d0    ONLINE       0     0     0
            c5t5000CCA215C28142d0    ONLINE       0     0     0
          raidz2-1                   ONLINE       0     0     0
            c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
            spare-1                  ONLINE       0     0     0
              c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K resilvered
              c5t5000CCA215C83160d0  ONLINE       0     0     0
            c5t5000CCA215C34753d0    ONLINE       0     0     0
            c5t5000CCA215C34823d0    ONLINE       0     0     0
        spares
          c5t5000CCA215C7FD6Ed0      AVAIL
          c5t5000CCA215C83160d0      INUSE     currently in use

errors: No known data errors

and now the other hot spare ...
> sudo zpool detach nyc-test-01 c5t5000CCA215C83160d0
  pool: nyc-test-01
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46 2011
config:

        NAME                       STATE     READ WRITE CKSUM
        nyc-test-01                ONLINE       0     0     0
          raidz2-0                 ONLINE       0     0     0
            c5t5000CCA215C8A649d0  ONLINE       0     0     0
            c5t5000CCA215C84A65d0  ONLINE       0     0     0
            c5t5000CCA215C34786d0  ONLINE       0     0     0
            c5t5000CCA215C28142d0  ONLINE       0     0     0
          raidz2-1                 ONLINE       0     0     0
            c5t5000CCA215C8A5B5d0  ONLINE       0     0     0
            c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K resilvered
            c5t5000CCA215C34753d0  ONLINE       0     0     0
            c5t5000CCA215C34823d0  ONLINE       0     0     0
        spares
          c5t5000CCA215C7FD6Ed0    AVAIL
          c5t5000CCA215C83160d0    AVAIL

errors: No known data errors


-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Richard Elling

2011-Mar-03 16:48 UTC

head link

[zfs-discuss] Hung Hot Spare

On Mar 3, 2011, at 6:45 AM, Paul Kraus wrote:
> Apologies in advance as this is a Solaris 10 question and not an
> OpenSolaris issue (well, OK, it *may* also be an OpenSolaris issue).
> System is a T2000 running Solaris 10U9 with latest ZFS patches (zpool
> version 22). Storage is a pile of J4400 (5 of them).
No problem, this is the "zfs-discuss" forum. AFAIK, there is no
Solaris-10
specific alias.
> 
> I have run into what appears to be (Sun) Bug ID 6995143, and I opened
> a case with Oracle requesting to be added to that bug. I am being told
> that bug had been abandoned and that ZFS is behaving "correctly".
Here
> is what I am seeing:
> 
> 1) zpool with multiple vdevs and hot spares
> 2) multiple drive failures at once
In my experience, hot spares do not help with the case where the failures
are not explicitly drive failures. In those cases where I see multiple failures
at once, the root cause has never been that all of the implicated drives are
bad.
> 3) multiple hot spares in use (so far, only one in each vdev, but they
> are raidz2 so I suppose it could be up to 2 in each vdev)
> 4) after repair of the failed drives and resilver completes, the hot
> spares stay in use
> 
> I have NOT seen the issue with only a single drive failure.
> I have NOT seen the problem if the failed drive(s) is(are) replaced
> BEFORE the resilver of the hot spares completes
> 
> In other words, I have only seen the issue if there are more than one
> failed drive at once and if the hot spares complete resilvering before
> the bad drives are repaired.
> 
> This has all been seen in our test environment, and we simulate a
> drive failure by either removing a drive or disabling it (via CAM,
> these are J4400 drives). This came to light due to testing to
> determine resolution of another bug (SATA over SAS multipathing driver
> issues).
> 
> We do have a work around. Once the resilver of the repaired drives
> completes we can ''zpool detach'' the hot spare device from
the zpool
> (vdev) and it goes back into the AVAIL state.
> 
> Is this EXPECTED behavior for multiple drive failures ?
I believe it is the right thing to do.
> 
> Here is some detailed information from my latest test.
> 
> Here is the pool in failed state (2 drives failed)...
> 
> pool: nyc-test-01
> state: DEGRADED
> status: One or more devices has been removed by the administrator.
>       Sufficient replicas exist for the pool to continue functioning in a
>       degraded state.
> action: Online the device using ''zpool online'' or replace
the device with
>       ''zpool replace''.
> scrub: scrub completed after 0h0m with 0 errors on Thu Mar  3 08:41:04 2011
> config:
> 
>       NAME                         STATE     READ WRITE CKSUM
>       nyc-test-01                  DEGRADED     0     0     0
>         raidz2-0                   DEGRADED     0     0     0
>           c5t5000CCA215C8A649d0    ONLINE       0     0     0
>           c5t5000CCA215C84A65d0    ONLINE       0     0     0
>           c5t5000CCA215C34786d0    ONLINE       0     0     0
>           spare-3                  DEGRADED     0     0     0
>             c5t5000CCA215C28142d0  REMOVED      0     0     0
>             c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
>         raidz2-1                   DEGRADED     0     0     0
>           c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>           spare-1                  DEGRADED     0     0     0
>             c5t5000CCA215C280F8d0  REMOVED      0     0     0
>             c5t5000CCA215C83160d0  ONLINE       0     0     0
>           c5t5000CCA215C34753d0    ONLINE       0     0     0
>           c5t5000CCA215C34823d0    ONLINE       0     0     0
>       spares
>         c5t5000CCA215C7FD6Ed0      INUSE     currently in use
>         c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Here is the attempt to bring one of the failed drives back online
> using ''zpool replace'' (after the drive was enabled),
which tosses a
> warning (as expected)...
> 
>> sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0
> Password:
> invalid vdev specification
> use ''-f'' to override the following errors:
> /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool
> nyc-test-01. Please see zpool(1M).
>> 
> 
> Instead of forcing the replace, I did a ''zpool online''
In the case of SAS drives, it is rare that replacing a disk with itself 
can work -- a replacement disk will have a different WWN.   In this 
case, your test plan is incorrect and the zpool online is correct.
> 
> Here is the pool resilvering after one of the two failed drives is
> brought back online via the ''zpool online'' command (while
the resilver
> is still running)...
> 
> pool: nyc-test-01
> state: DEGRADED
> status: One or more devices has been removed by the administrator.
>       Sufficient replicas exist for the pool to continue functioning in a
>       degraded state.
> action: Online the device using ''zpool online'' or replace
the device with
>       ''zpool replace''.
> scrub: resilver in progress for 0h0m, 6.94% done, 0h4m to go
> config:
> 
>       NAME                         STATE     READ WRITE CKSUM
>       nyc-test-01                  DEGRADED     0     0     0
>         raidz2-0                   ONLINE       0     0     0
>           c5t5000CCA215C8A649d0    ONLINE       0     0     0
>           c5t5000CCA215C84A65d0    ONLINE       0     0     0
>           c5t5000CCA215C34786d0    ONLINE       0     0     0
>           spare-3                  ONLINE       0     0     0
>             c5t5000CCA215C28142d0  ONLINE       0     0     0  104M
resilvered
>             c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
>         raidz2-1                   DEGRADED     0     0     0
>           c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>           spare-1                  DEGRADED     0     0     0
>             c5t5000CCA215C280F8d0  REMOVED      0     0     0
>             c5t5000CCA215C83160d0  ONLINE       0     0     0
>           c5t5000CCA215C34753d0    ONLINE       0     0     0
>           c5t5000CCA215C34823d0    ONLINE       0     0     0
>       spares
>         c5t5000CCA215C7FD6Ed0      INUSE     currently in use
>         c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Here is the pool after the resilver has completed, but the hot spare
> is still in use...
correct
> pool: nyc-test-01
> state: DEGRADED
> status: One or more devices has been removed by the administrator.
>       Sufficient replicas exist for the pool to continue functioning in a
>       degraded state.
> action: Online the device using ''zpool online'' or replace
the device with
>       ''zpool replace''.
> scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 08:49:03
2011
> config:
> 
>       NAME                         STATE     READ WRITE CKSUM
>       nyc-test-01                  DEGRADED     0     0     0
>         raidz2-0                   ONLINE       0     0     0
>           c5t5000CCA215C8A649d0    ONLINE       0     0     0
>           c5t5000CCA215C84A65d0    ONLINE       0     0     0
>           c5t5000CCA215C34786d0    ONLINE       0     0     0
>           spare-3                  ONLINE       0     0     0
>             c5t5000CCA215C28142d0  ONLINE       0     0     0  302M
resilvered
>             c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
>         raidz2-1                   DEGRADED     0     0     0
>           c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>           spare-1                  DEGRADED     0     0     0
>             c5t5000CCA215C280F8d0  REMOVED      0     0     0
>             c5t5000CCA215C83160d0  ONLINE       0     0     0
>           c5t5000CCA215C34753d0    ONLINE       0     0     0
>           c5t5000CCA215C34823d0    ONLINE       0     0     0
>       spares
>         c5t5000CCA215C7FD6Ed0      INUSE     currently in use
>         c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Here is the zpool after the second failed drive has been brought back
> online via ''zpool online''...
> 
> pool: nyc-test-01
> state: ONLINE
> scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46
2011
> config:
> 
>       NAME                         STATE     READ WRITE CKSUM
>       nyc-test-01                  ONLINE       0     0     0
>         raidz2-0                   ONLINE       0     0     0
>           c5t5000CCA215C8A649d0    ONLINE       0     0     0
>           c5t5000CCA215C84A65d0    ONLINE       0     0     0
>           c5t5000CCA215C34786d0    ONLINE       0     0     0
>           spare-3                  ONLINE       0     0     0
>             c5t5000CCA215C28142d0  ONLINE       0     0     0
>             c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
>         raidz2-1                   ONLINE       0     0     0
>           c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>           spare-1                  ONLINE       0     0     0
>             c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K
resilvered
>             c5t5000CCA215C83160d0  ONLINE       0     0     0
>           c5t5000CCA215C34753d0    ONLINE       0     0     0
>           c5t5000CCA215C34823d0    ONLINE       0     0     0
>       spares
>         c5t5000CCA215C7FD6Ed0      INUSE     currently in use
>         c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Note that BOTH hot spares are still in use even though the faults have
> been cleared.
> 
> Now I detach one of the hot spares...
> 
>> sudo zpool detach nyc-test-01 c5t5000CCA215C7FD6Ed0
> 
> pool: nyc-test-01
> state: ONLINE
> scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46
2011
> config:
> 
>       NAME                         STATE     READ WRITE CKSUM
>       nyc-test-01                  ONLINE       0     0     0
>         raidz2-0                   ONLINE       0     0     0
>           c5t5000CCA215C8A649d0    ONLINE       0     0     0
>           c5t5000CCA215C84A65d0    ONLINE       0     0     0
>           c5t5000CCA215C34786d0    ONLINE       0     0     0
>           c5t5000CCA215C28142d0    ONLINE       0     0     0
>         raidz2-1                   ONLINE       0     0     0
>           c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>           spare-1                  ONLINE       0     0     0
>             c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K
resilvered
>             c5t5000CCA215C83160d0  ONLINE       0     0     0
>           c5t5000CCA215C34753d0    ONLINE       0     0     0
>           c5t5000CCA215C34823d0    ONLINE       0     0     0
>       spares
>         c5t5000CCA215C7FD6Ed0      AVAIL
>         c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> and now the other hot spare ...
> 
>> sudo zpool detach nyc-test-01 c5t5000CCA215C83160d0
sudo is so old skewl :-)
> 
> pool: nyc-test-01
> state: ONLINE
> scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46
2011
> config:
> 
>       NAME                       STATE     READ WRITE CKSUM
>       nyc-test-01                ONLINE       0     0     0
>         raidz2-0                 ONLINE       0     0     0
>           c5t5000CCA215C8A649d0  ONLINE       0     0     0
>           c5t5000CCA215C84A65d0  ONLINE       0     0     0
>           c5t5000CCA215C34786d0  ONLINE       0     0     0
>           c5t5000CCA215C28142d0  ONLINE       0     0     0
>         raidz2-1                 ONLINE       0     0     0
>           c5t5000CCA215C8A5B5d0  ONLINE       0     0     0
>           c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K resilvered
>           c5t5000CCA215C34753d0  ONLINE       0     0     0
>           c5t5000CCA215C34823d0  ONLINE       0     0     0
>       spares
>         c5t5000CCA215C7FD6Ed0    AVAIL
>         c5t5000CCA215C83160d0    AVAIL
> 
> errors: No known data errors
This is the expected behaviour, and IMNSO, the best solution to the thorny
problem
of sparing. This procedure allows you to manage the intent of the replacement.
Note
that you can run perfectly fine in the ONLINE state indefinitely, so the only
issue is
to ensure that the administrator''s intent is implemented.
-- richard

Cindy Swearingen

2011-Mar-03 19:08 UTC

head link

[zfs-discuss] Hung Hot Spare

Hi Paul,

I''ve seen some spare stickiness too and its generally when I''m
trying to
simulate a drive failure (like you are below) without actually
physically replacing the device.

If I actually physically replace the failed drive, the spare is
detached automatically after the new device is resilvered. I haven''t
seen a multiple drive failure in our systems here so I can''t comment on
that part, but I don''t think this part is related to spare behavior.

The issue here is that the drive was only "disabled," you
didn''t
actually "physically" replace it so ZFS throws an error:

sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0
 > Password:
 > invalid vdev specification
 > use ''-f'' to override the following errors:
 > /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool
 > nyc-test-01. Please see zpool(1M).

If you want to rerun your test, tries these steps:

1. Removing a device from the pool
2. Watch for the spare to kick-in
3. Replace the "failed" device with a new physical device
4. Run the zpool replace command
5. Observe spare behavior

I don''t see the spare as hung, it just needs to be detached as
described
here:

http://download.oracle.com/docs/cd/E19253-01/819-5461/6n7ht6qvv/index.html#gjfbs

Thanks,

Cindy



On 03/03/11 07:45, Paul Kraus wrote:> Apologies in advance as this is a Solaris 10 question and not an
> OpenSolaris issue (well, OK, it *may* also be an OpenSolaris issue).
> System is a T2000 running Solaris 10U9 with latest ZFS patches (zpool
> version 22). Storage is a pile of J4400 (5 of them).
> 
> I have run into what appears to be (Sun) Bug ID 6995143, and I opened
> a case with Oracle requesting to be added to that bug. I am being told
> that bug had been abandoned and that ZFS is behaving "correctly".
Here
> is what I am seeing:
> 
> 1) zpool with multiple vdevs and hot spares
> 2) multiple drive failures at once
> 3) multiple hot spares in use (so far, only one in each vdev, but they
> are raidz2 so I suppose it could be up to 2 in each vdev)
> 4) after repair of the failed drives and resilver completes, the hot
> spares stay in use
> 
> I have NOT seen the issue with only a single drive failure.
> I have NOT seen the problem if the failed drive(s) is(are) replaced
> BEFORE the resilver of the hot spares completes
> 
> In other words, I have only seen the issue if there are more than one
> failed drive at once and if the hot spares complete resilvering before
> the bad drives are repaired.
> 
> This has all been seen in our test environment, and we simulate a
> drive failure by either removing a drive or disabling it (via CAM,
> these are J4400 drives). This came to light due to testing to
> determine resolution of another bug (SATA over SAS multipathing driver
> issues).
> 
> We do have a work around. Once the resilver of the repaired drives
> completes we can ''zpool detach'' the hot spare device from
the zpool
> (vdev) and it goes back into the AVAIL state.
> 
> Is this EXPECTED behavior for multiple drive failures ?
> 
> Here is some detailed information from my latest test.
> 
> Here is the pool in failed state (2 drives failed)...
> 
>   pool: nyc-test-01
>  state: DEGRADED
> status: One or more devices has been removed by the administrator.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Online the device using ''zpool online'' or replace
the device with
>         ''zpool replace''.
>  scrub: scrub completed after 0h0m with 0 errors on Thu Mar  3 08:41:04
2011
> config:
> 
>         NAME                         STATE     READ WRITE CKSUM
>         nyc-test-01                  DEGRADED     0     0     0
>           raidz2-0                   DEGRADED     0     0     0
>             c5t5000CCA215C8A649d0    ONLINE       0     0     0
>             c5t5000CCA215C84A65d0    ONLINE       0     0     0
>             c5t5000CCA215C34786d0    ONLINE       0     0     0
>             spare-3                  DEGRADED     0     0     0
>               c5t5000CCA215C28142d0  REMOVED      0     0     0
>               c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
>           raidz2-1                   DEGRADED     0     0     0
>             c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>             spare-1                  DEGRADED     0     0     0
>               c5t5000CCA215C280F8d0  REMOVED      0     0     0
>               c5t5000CCA215C83160d0  ONLINE       0     0     0
>             c5t5000CCA215C34753d0    ONLINE       0     0     0
>             c5t5000CCA215C34823d0    ONLINE       0     0     0
>         spares
>           c5t5000CCA215C7FD6Ed0      INUSE     currently in use
>           c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Here is the attempt to bring one of the failed drives back online
> using ''zpool replace'' (after the drive was enabled),
which tosses a
> warning (as expected)...
> 
>> sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0
> Password:
> invalid vdev specification
> use ''-f'' to override the following errors:
> /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool
> nyc-test-01. Please see zpool(1M).
> 
> Instead of forcing the replace, I did a ''zpool online''
> 
> Here is the pool resilvering after one of the two failed drives is
> brought back online via the ''zpool online'' command (while
the resilver
> is still running)...
> 
>   pool: nyc-test-01
>  state: DEGRADED
> status: One or more devices has been removed by the administrator.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Online the device using ''zpool online'' or replace
the device with
>         ''zpool replace''.
>  scrub: resilver in progress for 0h0m, 6.94% done, 0h4m to go
> config:
> 
>         NAME                         STATE     READ WRITE CKSUM
>         nyc-test-01                  DEGRADED     0     0     0
>           raidz2-0                   ONLINE       0     0     0
>             c5t5000CCA215C8A649d0    ONLINE       0     0     0
>             c5t5000CCA215C84A65d0    ONLINE       0     0     0
>             c5t5000CCA215C34786d0    ONLINE       0     0     0
>             spare-3                  ONLINE       0     0     0
>               c5t5000CCA215C28142d0  ONLINE       0     0     0  104M
resilvered
>               c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
>           raidz2-1                   DEGRADED     0     0     0
>             c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>             spare-1                  DEGRADED     0     0     0
>               c5t5000CCA215C280F8d0  REMOVED      0     0     0
>               c5t5000CCA215C83160d0  ONLINE       0     0     0
>             c5t5000CCA215C34753d0    ONLINE       0     0     0
>             c5t5000CCA215C34823d0    ONLINE       0     0     0
>         spares
>           c5t5000CCA215C7FD6Ed0      INUSE     currently in use
>           c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Here is the pool after the resilver has completed, but the hot spare
> is still in use...
> 
>   pool: nyc-test-01
>  state: DEGRADED
> status: One or more devices has been removed by the administrator.
>         Sufficient replicas exist for the pool to continue functioning in a
>         degraded state.
> action: Online the device using ''zpool online'' or replace
the device with
>         ''zpool replace''.
>  scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 08:49:03
2011
> config:
> 
>         NAME                         STATE     READ WRITE CKSUM
>         nyc-test-01                  DEGRADED     0     0     0
>           raidz2-0                   ONLINE       0     0     0
>             c5t5000CCA215C8A649d0    ONLINE       0     0     0
>             c5t5000CCA215C84A65d0    ONLINE       0     0     0
>             c5t5000CCA215C34786d0    ONLINE       0     0     0
>             spare-3                  ONLINE       0     0     0
>               c5t5000CCA215C28142d0  ONLINE       0     0     0  302M
resilvered
>               c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
>           raidz2-1                   DEGRADED     0     0     0
>             c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>             spare-1                  DEGRADED     0     0     0
>               c5t5000CCA215C280F8d0  REMOVED      0     0     0
>               c5t5000CCA215C83160d0  ONLINE       0     0     0
>             c5t5000CCA215C34753d0    ONLINE       0     0     0
>             c5t5000CCA215C34823d0    ONLINE       0     0     0
>         spares
>           c5t5000CCA215C7FD6Ed0      INUSE     currently in use
>           c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Here is the zpool after the second failed drive has been brought back
> online via ''zpool online''...
> 
>   pool: nyc-test-01
>  state: ONLINE
>  scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46
2011
> config:
> 
>         NAME                         STATE     READ WRITE CKSUM
>         nyc-test-01                  ONLINE       0     0     0
>           raidz2-0                   ONLINE       0     0     0
>             c5t5000CCA215C8A649d0    ONLINE       0     0     0
>             c5t5000CCA215C84A65d0    ONLINE       0     0     0
>             c5t5000CCA215C34786d0    ONLINE       0     0     0
>             spare-3                  ONLINE       0     0     0
>               c5t5000CCA215C28142d0  ONLINE       0     0     0
>               c5t5000CCA215C7FD6Ed0  ONLINE       0     0     0
>           raidz2-1                   ONLINE       0     0     0
>             c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>             spare-1                  ONLINE       0     0     0
>               c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K
resilvered
>               c5t5000CCA215C83160d0  ONLINE       0     0     0
>             c5t5000CCA215C34753d0    ONLINE       0     0     0
>             c5t5000CCA215C34823d0    ONLINE       0     0     0
>         spares
>           c5t5000CCA215C7FD6Ed0      INUSE     currently in use
>           c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> Note that BOTH hot spares are still in use even though the faults have
> been cleared.
> 
> Now I detach one of the hot spares...
> 
>> sudo zpool detach nyc-test-01 c5t5000CCA215C7FD6Ed0
> 
>   pool: nyc-test-01
>  state: ONLINE
>  scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46
2011
> config:
> 
>         NAME                         STATE     READ WRITE CKSUM
>         nyc-test-01                  ONLINE       0     0     0
>           raidz2-0                   ONLINE       0     0     0
>             c5t5000CCA215C8A649d0    ONLINE       0     0     0
>             c5t5000CCA215C84A65d0    ONLINE       0     0     0
>             c5t5000CCA215C34786d0    ONLINE       0     0     0
>             c5t5000CCA215C28142d0    ONLINE       0     0     0
>           raidz2-1                   ONLINE       0     0     0
>             c5t5000CCA215C8A5B5d0    ONLINE       0     0     0
>             spare-1                  ONLINE       0     0     0
>               c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K
resilvered
>               c5t5000CCA215C83160d0  ONLINE       0     0     0
>             c5t5000CCA215C34753d0    ONLINE       0     0     0
>             c5t5000CCA215C34823d0    ONLINE       0     0     0
>         spares
>           c5t5000CCA215C7FD6Ed0      AVAIL
>           c5t5000CCA215C83160d0      INUSE     currently in use
> 
> errors: No known data errors
> 
> and now the other hot spare ...
> 
>> sudo zpool detach nyc-test-01 c5t5000CCA215C83160d0
> 
>   pool: nyc-test-01
>  state: ONLINE
>  scrub: resilver completed after 0h0m with 0 errors on Thu Mar  3 09:04:46
2011
> config:
> 
>         NAME                       STATE     READ WRITE CKSUM
>         nyc-test-01                ONLINE       0     0     0
>           raidz2-0                 ONLINE       0     0     0
>             c5t5000CCA215C8A649d0  ONLINE       0     0     0
>             c5t5000CCA215C84A65d0  ONLINE       0     0     0
>             c5t5000CCA215C34786d0  ONLINE       0     0     0
>             c5t5000CCA215C28142d0  ONLINE       0     0     0
>           raidz2-1                 ONLINE       0     0     0
>             c5t5000CCA215C8A5B5d0  ONLINE       0     0     0
>             c5t5000CCA215C280F8d0  ONLINE       0     0     0  46K
resilvered
>             c5t5000CCA215C34753d0  ONLINE       0     0     0
>             c5t5000CCA215C34823d0  ONLINE       0     0     0
>         spares
>           c5t5000CCA215C7FD6Ed0    AVAIL
>           c5t5000CCA215C83160d0    AVAIL
> 
> errors: No known data errors
> 
>

Paul Kraus

2011-Mar-03 20:57 UTC

head link

[zfs-discuss] Hung Hot Spare

On Thu, Mar 3, 2011 at 11:48 AM, Richard Elling
<richard.elling at gmail.com> wrote:
>> 1) zpool with multiple vdevs and hot spares
>> 2) multiple drive failures at once
>
> In my experience, hot spares do not help with the case where the failures
> are not explicitly drive failures. In those cases where I see multiple
failures
> at once, the root cause has never been that all of the implicated drives
are bad.
    With one exception, my experience agrees with yours. In general,
multiple simultaneous drive failures are almost never real drive
failures, but failures of something else that makes the drives
unavailable (such as a bad cable or a flaky SIM).

    The one exception involved two 750 GB SATA drives failing within
hours of each other (out of the 120 drives in five J4400 all purchased
at the same time). This one exception, which happened in
pre-production testing led us to test the multiple failure case.

<snip>
>> Is this EXPECTED behavior for multiple drive failures ?
>
> I believe it is the right thing to do.
OK, so it *may* be that ZFS reacts differently to single and multiple
drive failures on purpose. My concern is that this behavior is an
unintended consequence of something.

<snip>
>> Instead of forcing the replace, I did a ''zpool
online''
>
> In the case of SAS drives, it is rare that replacing a disk with itself
> can work -- a replacement disk will have a different WWN. ? In this
> case, your test plan is incorrect and the zpool online is correct.
    I have been told by Oracle Support that in the case of SATA drives
in J4400, the WWN is associated with the slot and NOT the drive, so a
replacement drive will have the same WWN. Empirically, I have seen
both behaviors (and I don''t really like that). In some cases replacing
a drive did not cause a WWN change and in others it did. I think it
depended on the manufacturer of the drive. In other words, if a
Hitachi was replaced with a Hitachi the WWN did not change, but when a
Hitachi was replaced with a Seagate then the WWN did change.

<snip>
> This is the expected behaviour, and IMNSO, the best solution to the thorny
problem
> of sparing. This procedure allows you to manage the intent of the
replacement. Note
> that you can run perfectly fine in the ONLINE state indefinitely, so the
only issue is
> to ensure that the administrator''s intent is implemented.
    The only issue I see is that it keeps a hot spare busy and
unavailable to cover for a different (real ?) failure. But, one would
hope that the administrator would notice and take corrective action as
necessary ...

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Paul Kraus

2011-Mar-03 21:00 UTC

head link

[zfs-discuss] Hung Hot Spare

On Thu, Mar 3, 2011 at 2:08 PM, Cindy Swearingen
<cindy.swearingen at oracle.com> wrote:
> I''ve seen some spare stickiness too and its generally when
I''m trying to
> simulate a drive failure (like you are below) without actually
> physically replacing the device.
>
> If I actually physically replace the failed drive, the spare is
> detached automatically after the new device is resilvered. I
haven''t
> seen a multiple drive failure in our systems here so I can''t
comment on
> that part, but I don''t think this part is related to spare
behavior.
    I will have to see if I can scare up enough loose drives to do
this (really replace the ''failed'' drive). Unfortunately, I am
200
miles aware from the drives in question ...

<snip>
> If you want to rerun your test, tries these steps:
>
> 1. Removing a device from the pool
> 2. Watch for the spare to kick-in
> 3. Replace the "failed" device with a new physical device
> 4. Run the zpool replace command
> 5. Observe spare behavior
I *may* have enough spare drives to try this.
> I don''t see the spare as hung, it just needs to be detached as
described
> here:
>
>
http://download.oracle.com/docs/cd/E19253-01/819-5461/6n7ht6qvv/index.html#gjfbs
-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Roy Sigurd Karlsbakk

2011-Mar-03 21:28 UTC

head link

[zfs-discuss] Hung Hot Spare

Just a shot in the dark, but could this possibly be related to my issue as
posted with the subject "Nasty zfs issue"?

roy

----- Original Message -----> Apologies in advance as this is a Solaris 10 question and not an
> OpenSolaris issue (well, OK, it *may* also be an OpenSolaris issue).
> System is a T2000 running Solaris 10U9 with latest ZFS patches (zpool
> version 22). Storage is a pile of J4400 (5 of them).
> 
> I have run into what appears to be (Sun) Bug ID 6995143, and I opened
> a case with Oracle requesting to be added to that bug. I am being told
> that bug had been abandoned and that ZFS is behaving "correctly".
Here
> is what I am seeing:
> 
> 1) zpool with multiple vdevs and hot spares
> 2) multiple drive failures at once
> 3) multiple hot spares in use (so far, only one in each vdev, but they
> are raidz2 so I suppose it could be up to 2 in each vdev)
> 4) after repair of the failed drives and resilver completes, the hot
> spares stay in use
> 
> I have NOT seen the issue with only a single drive failure.
> I have NOT seen the problem if the failed drive(s) is(are) replaced
> BEFORE the resilver of the hot spares completes
> 
> In other words, I have only seen the issue if there are more than one
> failed drive at once and if the hot spares complete resilvering before
> the bad drives are repaired.
> 
> This has all been seen in our test environment, and we simulate a
> drive failure by either removing a drive or disabling it (via CAM,
> these are J4400 drives). This came to light due to testing to
> determine resolution of another bug (SATA over SAS multipathing driver
> issues).
> 
> We do have a work around. Once the resilver of the repaired drives
> completes we can ''zpool detach'' the hot spare device from
the zpool
> (vdev) and it goes back into the AVAIL state.
> 
> Is this EXPECTED behavior for multiple drive failures ?
> 
> Here is some detailed information from my latest test.
> 
> Here is the pool in failed state (2 drives failed)...
> 
> pool: nyc-test-01
> state: DEGRADED
> status: One or more devices has been removed by the administrator.
> Sufficient replicas exist for the pool to continue functioning in a
> degraded state.
> action: Online the device using ''zpool online'' or replace
the device
> with
> ''zpool replace''.
> scrub: scrub completed after 0h0m with 0 errors on Thu Mar 3 08:41:04
> 2011
> config:
> 
> NAME STATE READ WRITE CKSUM
> nyc-test-01 DEGRADED 0 0 0
> raidz2-0 DEGRADED 0 0 0
> c5t5000CCA215C8A649d0 ONLINE 0 0 0
> c5t5000CCA215C84A65d0 ONLINE 0 0 0
> c5t5000CCA215C34786d0 ONLINE 0 0 0
> spare-3 DEGRADED 0 0 0
> c5t5000CCA215C28142d0 REMOVED 0 0 0
> c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0
> raidz2-1 DEGRADED 0 0 0
> c5t5000CCA215C8A5B5d0 ONLINE 0 0 0
> spare-1 DEGRADED 0 0 0
> c5t5000CCA215C280F8d0 REMOVED 0 0 0
> c5t5000CCA215C83160d0 ONLINE 0 0 0
> c5t5000CCA215C34753d0 ONLINE 0 0 0
> c5t5000CCA215C34823d0 ONLINE 0 0 0
> spares
> c5t5000CCA215C7FD6Ed0 INUSE currently in use
> c5t5000CCA215C83160d0 INUSE currently in use
> 
> errors: No known data errors
> 
> Here is the attempt to bring one of the failed drives back online
> using ''zpool replace'' (after the drive was enabled),
which tosses a
> warning (as expected)...
> 
> > sudo zpool replace nyc-test-01 c5t5000CCA215C28142d0
> Password:
> invalid vdev specification
> use ''-f'' to override the following errors:
> /dev/dsk/c5t5000CCA215C28142d0s0 is part of active ZFS pool
> nyc-test-01. Please see zpool(1M).
> >
> 
> Instead of forcing the replace, I did a ''zpool online''
> 
> Here is the pool resilvering after one of the two failed drives is
> brought back online via the ''zpool online'' command (while
the resilver
> is still running)...
> 
> pool: nyc-test-01
> state: DEGRADED
> status: One or more devices has been removed by the administrator.
> Sufficient replicas exist for the pool to continue functioning in a
> degraded state.
> action: Online the device using ''zpool online'' or replace
the device
> with
> ''zpool replace''.
> scrub: resilver in progress for 0h0m, 6.94% done, 0h4m to go
> config:
> 
> NAME STATE READ WRITE CKSUM
> nyc-test-01 DEGRADED 0 0 0
> raidz2-0 ONLINE 0 0 0
> c5t5000CCA215C8A649d0 ONLINE 0 0 0
> c5t5000CCA215C84A65d0 ONLINE 0 0 0
> c5t5000CCA215C34786d0 ONLINE 0 0 0
> spare-3 ONLINE 0 0 0
> c5t5000CCA215C28142d0 ONLINE 0 0 0 104M resilvered
> c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0
> raidz2-1 DEGRADED 0 0 0
> c5t5000CCA215C8A5B5d0 ONLINE 0 0 0
> spare-1 DEGRADED 0 0 0
> c5t5000CCA215C280F8d0 REMOVED 0 0 0
> c5t5000CCA215C83160d0 ONLINE 0 0 0
> c5t5000CCA215C34753d0 ONLINE 0 0 0
> c5t5000CCA215C34823d0 ONLINE 0 0 0
> spares
> c5t5000CCA215C7FD6Ed0 INUSE currently in use
> c5t5000CCA215C83160d0 INUSE currently in use
> 
> errors: No known data errors
> 
> Here is the pool after the resilver has completed, but the hot spare
> is still in use...
> 
> pool: nyc-test-01
> state: DEGRADED
> status: One or more devices has been removed by the administrator.
> Sufficient replicas exist for the pool to continue functioning in a
> degraded state.
> action: Online the device using ''zpool online'' or replace
the device
> with
> ''zpool replace''.
> scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3
> 08:49:03 2011
> config:
> 
> NAME STATE READ WRITE CKSUM
> nyc-test-01 DEGRADED 0 0 0
> raidz2-0 ONLINE 0 0 0
> c5t5000CCA215C8A649d0 ONLINE 0 0 0
> c5t5000CCA215C84A65d0 ONLINE 0 0 0
> c5t5000CCA215C34786d0 ONLINE 0 0 0
> spare-3 ONLINE 0 0 0
> c5t5000CCA215C28142d0 ONLINE 0 0 0 302M resilvered
> c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0
> raidz2-1 DEGRADED 0 0 0
> c5t5000CCA215C8A5B5d0 ONLINE 0 0 0
> spare-1 DEGRADED 0 0 0
> c5t5000CCA215C280F8d0 REMOVED 0 0 0
> c5t5000CCA215C83160d0 ONLINE 0 0 0
> c5t5000CCA215C34753d0 ONLINE 0 0 0
> c5t5000CCA215C34823d0 ONLINE 0 0 0
> spares
> c5t5000CCA215C7FD6Ed0 INUSE currently in use
> c5t5000CCA215C83160d0 INUSE currently in use
> 
> errors: No known data errors
> 
> Here is the zpool after the second failed drive has been brought back
> online via ''zpool online''...
> 
> pool: nyc-test-01
> state: ONLINE
> scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3
> 09:04:46 2011
> config:
> 
> NAME STATE READ WRITE CKSUM
> nyc-test-01 ONLINE 0 0 0
> raidz2-0 ONLINE 0 0 0
> c5t5000CCA215C8A649d0 ONLINE 0 0 0
> c5t5000CCA215C84A65d0 ONLINE 0 0 0
> c5t5000CCA215C34786d0 ONLINE 0 0 0
> spare-3 ONLINE 0 0 0
> c5t5000CCA215C28142d0 ONLINE 0 0 0
> c5t5000CCA215C7FD6Ed0 ONLINE 0 0 0
> raidz2-1 ONLINE 0 0 0
> c5t5000CCA215C8A5B5d0 ONLINE 0 0 0
> spare-1 ONLINE 0 0 0
> c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered
> c5t5000CCA215C83160d0 ONLINE 0 0 0
> c5t5000CCA215C34753d0 ONLINE 0 0 0
> c5t5000CCA215C34823d0 ONLINE 0 0 0
> spares
> c5t5000CCA215C7FD6Ed0 INUSE currently in use
> c5t5000CCA215C83160d0 INUSE currently in use
> 
> errors: No known data errors
> 
> Note that BOTH hot spares are still in use even though the faults have
> been cleared.
> 
> Now I detach one of the hot spares...
> 
> > sudo zpool detach nyc-test-01 c5t5000CCA215C7FD6Ed0
> 
> pool: nyc-test-01
> state: ONLINE
> scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3
> 09:04:46 2011
> config:
> 
> NAME STATE READ WRITE CKSUM
> nyc-test-01 ONLINE 0 0 0
> raidz2-0 ONLINE 0 0 0
> c5t5000CCA215C8A649d0 ONLINE 0 0 0
> c5t5000CCA215C84A65d0 ONLINE 0 0 0
> c5t5000CCA215C34786d0 ONLINE 0 0 0
> c5t5000CCA215C28142d0 ONLINE 0 0 0
> raidz2-1 ONLINE 0 0 0
> c5t5000CCA215C8A5B5d0 ONLINE 0 0 0
> spare-1 ONLINE 0 0 0
> c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered
> c5t5000CCA215C83160d0 ONLINE 0 0 0
> c5t5000CCA215C34753d0 ONLINE 0 0 0
> c5t5000CCA215C34823d0 ONLINE 0 0 0
> spares
> c5t5000CCA215C7FD6Ed0 AVAIL
> c5t5000CCA215C83160d0 INUSE currently in use
> 
> errors: No known data errors
> 
> and now the other hot spare ...
> 
> > sudo zpool detach nyc-test-01 c5t5000CCA215C83160d0
> 
> pool: nyc-test-01
> state: ONLINE
> scrub: resilver completed after 0h0m with 0 errors on Thu Mar 3
> 09:04:46 2011
> config:
> 
> NAME STATE READ WRITE CKSUM
> nyc-test-01 ONLINE 0 0 0
> raidz2-0 ONLINE 0 0 0
> c5t5000CCA215C8A649d0 ONLINE 0 0 0
> c5t5000CCA215C84A65d0 ONLINE 0 0 0
> c5t5000CCA215C34786d0 ONLINE 0 0 0
> c5t5000CCA215C28142d0 ONLINE 0 0 0
> raidz2-1 ONLINE 0 0 0
> c5t5000CCA215C8A5B5d0 ONLINE 0 0 0
> c5t5000CCA215C280F8d0 ONLINE 0 0 0 46K resilvered
> c5t5000CCA215C34753d0 ONLINE 0 0 0
> c5t5000CCA215C34823d0 ONLINE 0 0 0
> spares
> c5t5000CCA215C7FD6Ed0 AVAIL
> c5t5000CCA215C83160d0 AVAIL
> 
> errors: No known data errors
> 
> 
> --
>
{--------1---------2---------3---------4---------5---------6---------7---------}
> Paul Kraus
> -> Senior Systems Architect, Garnet River (
> http://www.garnetriver.com/ )
> -> Sound Coordinator, Schenectady Light Opera Company (
> http://www.sloctheater.org/ )
> -> Technical Advisor, RPI Players
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Paul Kraus

2011-Mar-03 21:35 UTC

head link

[zfs-discuss] Hung Hot Spare

On Thu, Mar 3, 2011 at 4:28 PM, Roy Sigurd Karlsbakk <roy at
karlsbakk.net> wrote:
> Just a shot in the dark, but could this possibly be related to my issue as
posted with the subject "Nasty zfs issue"?
    I do not think they are directly related. I have seen some odd
behavior when I replace a failed drive before the resilver completes,
but nothing as dramatic as what you saw. I have also seen at least as
many cases of completely normal behavior when replacing a failed drive
before the hot spare finishes resilvering (the scrub restarts and
resilvers both the hot spare and the replacement drive). My standard
practice is to wait for the hot spare to finish resilvering before
replacing the failed drive, just to be safe.

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players

Roy Sigurd Karlsbakk

2011-Mar-03 22:11 UTC

head link

[zfs-discuss] Hung Hot Spare

> > Just a shot in the dark, but could this possibly be related to my
> > issue as posted with the subject "Nasty zfs issue"?
> 
> I do not think they are directly related. I have seen some odd
> behavior when I replace a failed drive before the resilver completes,
> but nothing as dramatic as what you saw. I have also seen at least as
> many cases of completely normal behavior when replacing a failed drive
> before the hot spare finishes resilvering (the scrub restarts and
> resilvers both the hot spare and the replacement drive). My standard
> practice is to wait for the hot spare to finish resilvering before
> replacing the failed drive, just to be safe.
That''s my plan for tomorrow :?

I''ll be doing some more tests on a test system to see if I can
reproduce this error of mine. If I can, I''ll file a bug (for Illumos,
primarily)

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Roy Sigurd Karlsbakk

2011-Mar-03 22:14 UTC

head link

[zfs-discuss] Use of spares? (Was: Hung Hot Spare)

> > > Just a shot in the dark, but could this possibly be related to my
> > > issue as posted with the subject "Nasty zfs issue"?
> >
> > I do not think they are directly related. I have seen some odd
> > behavior when I replace a failed drive before the resilver
> > completes,
> > but nothing as dramatic as what you saw. I have also seen at least
> > as
> > many cases of completely normal behavior when replacing a failed
> > drive
> > before the hot spare finishes resilvering (the scrub restarts and
> > resilvers both the hot spare and the replacement drive). My standard
> > practice is to wait for the hot spare to finish resilvering before
> > replacing the failed drive, just to be safe.
> 
> That''s my plan for tomorrow :?
> 
> I''ll be doing some more tests on a test system to see if I can
> reproduce this error of mine. If I can, I''ll file a bug (for
Illumos,
> primarily)
Another thing I''ve seen in the lab, is that if I have a RAIDz2 VDEV,
and two drives fail, then getting replaced by hot spares, the VDEV fails if a
third drive fails, even if the spares are functional. After the failed drive is
resilvered, the pool operates normally.

Wouldn''t it be better for a spare to work as a full member of the pool
instead of just being a legimite spare?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Fred Liu

2011-Mar-04 00:34 UTC

head link

[zfs-discuss] performance of whole pool suddenly degrade awhile and restore when one file system trys to exceed the quota

Hi,

Has anyone met this?
I meet this every time just like somebody steps on brake paddle suddenly and
release it in the car.

Thanks.

Fred

zfs discuss - Mar 2011 - Hung Hot Spare

[zfs-discuss] Hung Hot Spare

[zfs-discuss] Hung Hot Spare

[zfs-discuss] Hung Hot Spare

[zfs-discuss] Hung Hot Spare

[zfs-discuss] Hung Hot Spare

[zfs-discuss] Hung Hot Spare

[zfs-discuss] Hung Hot Spare

[zfs-discuss] Hung Hot Spare

[zfs-discuss] Use of spares? (Was: Hung Hot Spare)

[zfs-discuss] performance of whole pool suddenly degrade awhile and restore when one file system trys to exceed the quota