Chris Siebenmann
2009-Sep-04 17:10 UTC
[zfs-discuss] Understanding when (and how) ZFS will use spare disks
We have a number of shared spares configured in our ZFS pools, and we''re seeing weird issues where spares don''t get used under some circumstances. We''re running Solaris 10 U6 using pools made up of mirrored vdevs, and what I''ve seen is: * if ZFS detects enough checksum errors on an active disk, it will automatically pull in a spare. * if the system reboots without some of the disks available (so that half of the mirrored pairs drop out, for example), spares will *not* get used. ZFS recognizes that the disks are not there; they are marked as UNAVAIL and the vdevs (and pools) as DEGRADED, but it doesn''t try to use spares. (This is in a SAN environment where half of all of the mirrors come from one controller and half come from another one.) All of this makes me think that I don''t understand how ZFS spares really work, and under what circumstances they''ll get used. Does anyone know if there''s a writeup of this somewhere? (What I''ve gathered so far from reading zfs-discuss archives is that ZFS spares are not handled automatically in the kernel code but are instead deployed to pools by a fmd ZFS management module[*], doing more or less ''zpool repace <pool> <failing-dev> <spare>'' (presumably through an internal code path, since ''zpool history'' doesn''t seem to show spare deployment). Is this correct?) Also, searching turns up some old zfs-discuss messages suggesting that not bringing in spares in response to UNAVAIL disks was a bug that''s now fixed in at least OpenSolaris. If so, does anyone know if the fix has made it into S10 U7 (or is planned or available as a patch)? Thanks in advance. - cks [*: http://blogs.sun.com/eschrock/entry/zfs_hot_spares suggests that it is ''zfs-retire'', which is separate from ''zfs-diagnosis''.]
Scott Meilicke
2009-Sep-04 18:01 UTC
[zfs-discuss] Understanding when (and how) ZFS will use spare disks
This sounds like the same behavior as opensolaris 2009.06. I had several disks recently go UNAVAIL, and the spares did not take over. But as soon as I physically removed a disk, the spare started replacing the removed disk. It seems UNAVAIL is not the same as the disk not being there. I wish the spare *would* take over in these cases, since the pool is degraded. -Scott -- This message posted from opensolaris.org