Liam McBrien
2006-Sep-21 11:25 UTC
[zfs-discuss] zfs gets confused with multiple faults involving hot spares
Hi there,
Not sure if this is a known bug (or even if it''s a bug at all), but zfs
seems to get confused when several consecutive temporary disk faults occur
involving a hot spare. I couldn''t find anything related to this on this
forum, so here goes:
I''m testing this on a SunBlade 2000 hooked up to a T3 via STMS. The OS
version is snv48.
This is a bit confusing, so bear with me. Basically, the problem occurs when the
following happens:
- a pool is created with a hot spare
- a data disk is faulted (so that the spare steps in)
- the data disk is brought back online
- the hot spare is faulted
- the hot spare is brought back online and detached from the pool (to stop it
from acting as a spare for the data disc that faulted)
- the original data disc is faulted again
When the above takes place, the spare ends up replacing the data disc completely
in the pool but it still shows up as a spare. This occurs with mirror, raidz1
and raidz2 volumes.
On another note, when a disk is faulted the console output says
"AUTO-RESPONSE: No automated response will occur." -
shouldn''t this mention that a hot spare action will happen?
Here''s a walkthrough with a 2-way mirror (I''m
''faulting'' the discs by making them invisible to the host
using the T3''s LUN masking, then bringing them back by making them
visible again):
*****************
***create pool***
*****************
root at v4u-2000a-gmp03$ zpool create tank mirror
c5t60020F2000000A78450A91BE00088501d0 c5t60020F2000000A78450A918D0003BA4Ad0
spare c5t60020F2000000A7845098A27000B9ED2d0
root at v4u-2000a-gmp03$
root at v4u-2000a-gmp03$ zpool status
pool: tank
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
c5t60020F2000000A78450A91BE00088501d0 ONLINE 0 0 0
c5t60020F2000000A78450A918D0003BA4Ad0 ONLINE 0 0 0
spares
c5t60020F2000000A7845098A27000B9ED2d0 AVAIL
errors: No known data errors
****************************************
***fault a data disc (bring spare in)***
****************************************
t3f1:/:<161>lun perm lun 4 none grp v4u2000a
<console output>
SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Thu Sep 21 11:45:13 BST 2006
PLATFORM: SUNW,Sun-Blade-1000, CSN: -, HOSTNAME: v4u-2000a-gmp03
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 3eef63b6-061e-6039-e273-e06c9feb8475
DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more
information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run ''zpool status -x'' and replace the bad device.
root at v4u-2000a-gmp03$ zpool status
pool: tank
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: resilver completed with 0 errors on Thu Sep 21 11:45:14 2006
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror DEGRADED 0 0 0
c5t60020F2000000A78450A91BE00088501d0 ONLINE 0 0 0
spare DEGRADED 0 0 0
c5t60020F2000000A78450A918D0003BA4Ad0 UNAVAIL 0 62 0
cannot open
c5t60020F2000000A7845098A27000B9ED2d0 ONLINE 0 0 0
spares
c5t60020F2000000A7845098A27000B9ED2d0 INUSE currently in use
errors: No known data errors
************************************
*** Bring data disc back online ***
************************************
t3f1:/:<162>lun perm lun 4 rw grp v4u2000a
root at v4u-2000a-gmp03$ zpool status
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ''zpool clear'' or replace the device with
''zpool replace''.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver completed with 0 errors on Thu Sep 21 11:48:26 2006
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
c5t60020F2000000A78450A91BE00088501d0 ONLINE 0 0 0
spare ONLINE 0 0 0
c5t60020F2000000A78450A918D0003BA4Ad0 ONLINE 0 62 0
c5t60020F2000000A7845098A27000B9ED2d0 ONLINE 0 0 0
spares
c5t60020F2000000A7845098A27000B9ED2d0 INUSE currently in use
errors: No known data errors
****************************
*** Fault the spare disc ***
****************************
t3f1:/:<163>lun perm lun 1 none grp v4u2000a
<console output>
SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Thu Sep 21 11:51:24 BST 2006
PLATFORM: SUNW,Sun-Blade-1000, CSN: -, HOSTNAME: v4u-2000a-gmp03
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 9a80c89d-6633-e9ae-8315-d632cdb12406
DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more
information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run ''zpool status -x'' and replace the bad device.
root at v4u-2000a-gmp03$ zpool status
pool: tank
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using ''zpool
online''.
see: http://www.sun.com/msg/ZFS-8000-D3
scrub: resilver completed with 0 errors on Thu Sep 21 11:48:26 2006
config:
NAME STATE READ WRITE CKSUM
tank DEGRADED 0 0 0
mirror DEGRADED 0 0 0
c5t60020F2000000A78450A91BE00088501d0 ONLINE 0 0 0
spare DEGRADED 0 0 0
c5t60020F2000000A78450A918D0003BA4Ad0 ONLINE 0 62 0
c5t60020F2000000A7845098A27000B9ED2d0 UNAVAIL 0 62 0
cannot open
spares
c5t60020F2000000A7845098A27000B9ED2d0 INUSE currently in use
errors: No known data errors
*******************************************
*** Reconnect and detach the spare disc ***
*******************************************
t3f1:/:<164>lun perm lun 1 rw grp v4u2000a
root at v4u-2000a-gmp03$ zpool detach tank c5t60020F2000000A7845098A27000B9ED2d0
root at v4u-2000a-gmp03$
root at v4u-2000a-gmp03$ zpool status
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using ''zpool clear'' or replace the device with
''zpool replace''.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver completed with 0 errors on Thu Sep 21 11:48:26 2006
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
c5t60020F2000000A78450A91BE00088501d0 ONLINE 0 0 0
c5t60020F2000000A78450A918D0003BA4Ad0 ONLINE 0 62 0
spares
c5t60020F2000000A7845098A27000B9ED2d0 UNAVAIL cannot open
errors: No known data errors
*************************************
*** Fault the original disc again ***
*************************************
t3f1:/:<165>lun perm lun 4 none grp v4u2000a
<console output>
SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Thu Sep 21 11:59:31 BST 2006
PLATFORM: SUNW,Sun-Blade-1000, CSN: -, HOSTNAME: v4u-2000a-gmp03
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: d7c4ffa3-e7d3-41a8-cfbe-eecccb4bbe72
DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more
information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run ''zpool status -x'' and replace the bad device.
root at v4u-2000a-gmp03$ zpool status
pool: tank
state: ONLINE
scrub: resilver completed with 0 errors on Thu Sep 21 11:59:32 2006
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror ONLINE 0 0 0
c5t60020F2000000A78450A91BE00088501d0 ONLINE 0 0 0
c5t60020F2000000A7845098A27000B9ED2d0 ONLINE 0 0 0
spares
c5t60020F2000000A7845098A27000B9ED2d0 UNAVAIL cannot open
errors: No known data errors
The faulted data disc disappears completely and the spare takes its place, but
the spare still shows up as a spare!
Am I just misunderstanding what is intended behaviour, or is something amiss
here?
Cheers,
Liam
This message posted from opensolaris.org
Eric Schrock
2006-Sep-21 19:26 UTC
[zfs-discuss] zfs gets confused with multiple faults involving hot spares
On Thu, Sep 21, 2006 at 04:25:44AM -0700, Liam McBrien wrote:> Hi there, > > Not sure if this is a known bug (or even if it''s a bug at all), but > zfs seems to get confused when several consecutive temporary disk > faults occur involving a hot spare. I couldn''t find anything related > to this on this forum, so here goes: > > I''m testing this on a SunBlade 2000 hooked up to a T3 via STMS. The OS version is snv48. > > This is a bit confusing, so bear with me. Basically, the problem occurs when the following happens: > > - a pool is created with a hot spare > - a data disk is faulted (so that the spare steps in) > - the data disk is brought back online > - the hot spare is faulted > - the hot spare is brought back online and detached from the pool (to > stop it from acting as a spare for the data disc that faulted) - the > original data disc is faulted again > > When the above takes place, the spare ends up replacing the data disc > completely in the pool but it still shows up as a spare. This occurs > with mirror, raidz1 and raidz2 volumes.Yes, this sounds like a variation of a known bug that''s on my queue to look at. Basically, the way we determine if something is a spare or not is rather broken, and you can confuse ZFS to the point of doing the wrong thing. I''ll take a specific look at this case and see if it''s the same underlying root cause.> On another note, when a disk is faulted the console output says > "AUTO-RESPONSE: No automated response will occur." - shouldn''t this > mention that a hot spare action will happen?Yep. I''ll take care of this when I do the next phase of ZFS/FMA integration. - Eri -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock