As mentioned last night, we''ve been reviewing a proposal for hot spare support in ZFS. Below you can find a current draft of the proposed interfaces. This has not yet been submitted for ARC review, but comments are welcome. Note that this does not include any enhanced FMA diagnosis to determine when a device is "faulted". This will come in a follow-on project, of which some preliminary designs have been sketched out but not enough to draft any coherent proposal. - Eric A. DESCRIPTION ZFS, as an integrated volume manager and filesystem, has the ability to replace disks within an active pool. This allows administrators to replace failing or faulted drives to keep the system functioning with the required level of replication. Most other volume managers also support the ability to perform this replacement automatically through the use of "hot spares". This case will add this functionality to ZFS. This case will increment the on-disk version number in accordance to PSARC 2006/206, as the resulting labels introduce a new pool state that older pools will not understand, and exported pools containing hot spares will not be importable on earlier versions. B. POOL MANAGEMENT Hot spares are stored with each pool, although they can be overlapped between different pools. This allows administrators to reserve system-wide hot spares, as well as per-pool hot spares according to their policies. 1. Creating a pool with hot spares A pool can be created with hot spares by using the new ''spare'' vdev: # zpool create test mirror c0d0 c1d0 spare c2d0 c3d0 This will create a pool with a single mirror and two spares. Only a single ''spare'' vdev can be specified, though it can appear anywhere within the command line. The resulting pool looks like the following: # zpool status pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c0d0 ONLINE 0 0 0 c1d0 ONLINE 0 0 0 spares c2d0 ONLINE c3d0 ONLINE errors: No known data errors 2. Adding hot spares to a pool Hot spares can be added to a pool in the same manner by using ''zpool add'': # zpool add test spare c4d0 c5d0 This will add two disks to the set of available spares in the pool. 3. Removing hot spares from a pool Hot spares can be removed from a pool with the new ''zpool remove'' subcommand. This subcommand suggests the ability to remove arbitrary devices, and certainly is a feature that will be supported in a future release, but currently this will only allow removing hot spares. For example: # zpool remove test c2d0 If the hot spare is currently spared in, then the command will print an error and exit. 4. Activating a hot spare Hot spares can be used for replacement just like any other device using ''zpool replace''. If ZFS detects that the device is a hot spare within the same pool, then it will create a ''spare'' vdev instead of a ''replacing'' vdev: # zpool replace test c0d0 c2d0 # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 c0d0 ONLINE 0 0 0 35.5K resilvered c2d0 ONLINE 0 0 0 35.5K resilvered c1d0 ONLINE 0 0 0 spares c2d0 SPARED currently in use c3d0 ONLINE The difference between a ''replacing'' and ''spare'' vdev is that the former automatically removes the original drive once the replace completes. With spares, the vdev remains until the original device is removed from the system, at which point the hot spare is returned to the pool of available spares. Note that in this example we have replaced an online device. Under normal circumstances, the device in question would be faulted or the administrator would have proactively offlined the device. 5. Deactivating a hot spare There are 3 ways in which a hot spare can be deactivated: cancelling the hot spare, replacing the original drive, or permanently swapping in the hot spare. To cancel a hot spare attempt, the user can simply ''zpool detach'' the hot spare in question, at which point it will be returned to the set of available spares, the the original drive will remain in its current position (faulted or not): # zpool detach test c2d0 # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c0d0 ONLINE 0 0 0 35.5K resilvered c1d0 ONLINE 0 0 0 spares c2d0 ONLINE c3d0 ONLINE If the original device is replaced, then the spare is automatically removed once the replace completes: # zpool replace test c0d0 c4d0 # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 replacing ONLINE 0 0 0 c0d0 ONLINE 0 0 0 38K resilvered c4d0 ONLINE 0 0 0 38K resilvered c2d0 ONLINE 0 0 0 38K resilvered c1d0 ONLINE 0 0 0 spares c2d0 SPARED currently in use c3d0 ONLINE <wait for replace to complete> # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c4d0 ONLINE 0 0 0 35.5K resilvered c1d0 ONLINE 0 0 0 spares c2d0 ONLINE c3d0 ONLINE If the user instead wants the hot spare to permanently assume the place of the original device, the original device can be removed with ''zpool detach''. At this point the hot spare will become a functioning device, and automatically be removed from the list of available hot spares (for all pools if it is shared): # zpool detach test c0d0 # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c2d0 ONLINE 0 0 0 35K resilvered c1d0 ONLINE 0 0 0 spares c3d0 ONLINE 6. Determining device usage A hot spare is considered ''in use'' for the purpose of libdiskmgt and zpool(1M) if it is labelled as a spare and is currently in one or more pool''s list of active spares. If a spare is part of an exported pool, it is not considered in use, due largely to the fact that distinguishing this case from a recently destroyed pool is difficult and not solvable in the general case. C. AUTOMATED REPLACEMENT In order to perform automated replacement, a ZFS FMA agent will be added that subscribes to ''fault.zfs.vdev.*'' faults. When a fault is received, the agent will examine the pool to see if it has any available hot spares. If so, it will perform a ''zpool replace'' with an available spare. The initial algorithm for this will be ''first come, first serve'', which may not be ideal for all circumstances (such as when not all spares are the same size). It is anticipated that these circumstances will be rare, and that the algorithm can be improved in the future. This is currently limited by the fact that the ZFS diagnosis engine only emits faults when a device has disappeared from the system. When the DE is enhanced to proactively fault drives based on error rates, then the agent will automaticaly leverage this feature. In addition, note that there is no automated response capable of bringing the original drive back online. The user must explicitly take one of the actions described above. A future enhancement will allow ZFS to subscribe to hotplug events and automatically replace the affected drive when it is replaced on the system. D. MANPAGE DIFFS XXX
I didn''t catch a mention of RaidZ in your note. How would hot-spares play in a RaidZ type configuration? (Especially with the "auto-return-home" (or predictive replacement) feature your mention.[ In traditional Raid-Arrays hot-spare rebuilds and "go-home-transitions" are handled differently to cut down on exposure windows, and resource utilization, not sure if/how that applies here... ] If I read/interpreted the last part of your note, I think its OK to use a max-size LUN to hot-spare any (LUN <= HOT_SPARE), returning "home" once its job is done, ready to spare for any other pool/lun. (obviously not the entire hot-spare will be used if its "sparing" for a smaller failed LUN). Maybe there is no difference between Mirrored/RaidZ-configurations (ZFS masking all this?), but even in this case some note regarding this working for both mirrored and RaidZ configurations might make sense? Thanks, -- MikeE -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Eric Schrock Sent: Thursday, March 30, 2006 12:04 PM To: zfs-discuss at opensolaris.org Subject: [zfs-discuss] Proposal: ZFS Hot Spare support As mentioned last night, we''ve been reviewing a proposal for hot spare support in ZFS. Below you can find a current draft of the proposed interfaces. This has not yet been submitted for ARC review, but comments are welcome. Note that this does not include any enhanced FMA diagnosis to determine when a device is "faulted". This will come in a follow-on project, of which some preliminary designs have been sketched out but not enough to draft any coherent proposal. - Eric A. DESCRIPTION ZFS, as an integrated volume manager and filesystem, has the ability to replace disks within an active pool. This allows administrators to replace failing or faulted drives to keep the system functioning with the required level of replication. Most other volume managers also support the ability to perform this replacement automatically through the use of "hot spares". This case will add this functionality to ZFS. This case will increment the on-disk version number in accordance to PSARC 2006/206, as the resulting labels introduce a new pool state that older pools will not understand, and exported pools containing hot spares will not be importable on earlier versions. B. POOL MANAGEMENT Hot spares are stored with each pool, although they can be overlapped between different pools. This allows administrators to reserve system-wide hot spares, as well as per-pool hot spares according to their policies. 1. Creating a pool with hot spares A pool can be created with hot spares by using the new ''spare'' vdev: # zpool create test mirror c0d0 c1d0 spare c2d0 c3d0 This will create a pool with a single mirror and two spares. Only a single ''spare'' vdev can be specified, though it can appear anywhere within the command line. The resulting pool looks like the following: # zpool status pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c0d0 ONLINE 0 0 0 c1d0 ONLINE 0 0 0 spares c2d0 ONLINE c3d0 ONLINE errors: No known data errors 2. Adding hot spares to a pool Hot spares can be added to a pool in the same manner by using ''zpool add'': # zpool add test spare c4d0 c5d0 This will add two disks to the set of available spares in the pool. 3. Removing hot spares from a pool Hot spares can be removed from a pool with the new ''zpool remove'' subcommand. This subcommand suggests the ability to remove arbitrary devices, and certainly is a feature that will be supported in a future release, but currently this will only allow removing hot spares. For example: # zpool remove test c2d0 If the hot spare is currently spared in, then the command will print an error and exit. 4. Activating a hot spare Hot spares can be used for replacement just like any other device using ''zpool replace''. If ZFS detects that the device is a hot spare within the same pool, then it will create a ''spare'' vdev instead of a ''replacing'' vdev: # zpool replace test c0d0 c2d0 # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 c0d0 ONLINE 0 0 0 35.5K resilvered c2d0 ONLINE 0 0 0 35.5K resilvered c1d0 ONLINE 0 0 0 spares c2d0 SPARED currently in use c3d0 ONLINE The difference between a ''replacing'' and ''spare'' vdev is that the former automatically removes the original drive once the replace completes. With spares, the vdev remains until the original device is removed from the system, at which point the hot spare is returned to the pool of available spares. Note that in this example we have replaced an online device. Under normal circumstances, the device in question would be faulted or the administrator would have proactively offlined the device. 5. Deactivating a hot spare There are 3 ways in which a hot spare can be deactivated: cancelling the hot spare, replacing the original drive, or permanently swapping in the hot spare. To cancel a hot spare attempt, the user can simply ''zpool detach'' the hot spare in question, at which point it will be returned to the set of available spares, the the original drive will remain in its current position (faulted or not): # zpool detach test c2d0 # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c0d0 ONLINE 0 0 0 35.5K resilvered c1d0 ONLINE 0 0 0 spares c2d0 ONLINE c3d0 ONLINE If the original device is replaced, then the spare is automatically removed once the replace completes: # zpool replace test c0d0 c4d0 # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 replacing ONLINE 0 0 0 c0d0 ONLINE 0 0 0 38K resilvered c4d0 ONLINE 0 0 0 38K resilvered c2d0 ONLINE 0 0 0 38K resilvered c1d0 ONLINE 0 0 0 spares c2d0 SPARED currently in use c3d0 ONLINE <wait for replace to complete> # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c4d0 ONLINE 0 0 0 35.5K resilvered c1d0 ONLINE 0 0 0 spares c2d0 ONLINE c3d0 ONLINE If the user instead wants the hot spare to permanently assume the place of the original device, the original device can be removed with ''zpool detach''. At this point the hot spare will become a functioning device, and automatically be removed from the list of available hot spares (for all pools if it is shared): # zpool detach test c0d0 # zpool status ... config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c2d0 ONLINE 0 0 0 35K resilvered c1d0 ONLINE 0 0 0 spares c3d0 ONLINE 6. Determining device usage A hot spare is considered ''in use'' for the purpose of libdiskmgt and zpool(1M) if it is labelled as a spare and is currently in one or more pool''s list of active spares. If a spare is part of an exported pool, it is not considered in use, due largely to the fact that distinguishing this case from a recently destroyed pool is difficult and not solvable in the general case. C. AUTOMATED REPLACEMENT In order to perform automated replacement, a ZFS FMA agent will be added that subscribes to ''fault.zfs.vdev.*'' faults. When a fault is received, the agent will examine the pool to see if it has any available hot spares. If so, it will perform a ''zpool replace'' with an available spare. The initial algorithm for this will be ''first come, first serve'', which may not be ideal for all circumstances (such as when not all spares are the same size). It is anticipated that these circumstances will be rare, and that the algorithm can be improved in the future. This is currently limited by the fact that the ZFS diagnosis engine only emits faults when a device has disappeared from the system. When the DE is enhanced to proactively fault drives based on error rates, then the agent will automaticaly leverage this feature. In addition, note that there is no automated response capable of bringing the original drive back online. The user must explicitly take one of the actions described above. A future enhancement will allow ZFS to subscribe to hotplug events and automatically replace the affected drive when it is replaced on the system. D. MANPAGE DIFFS XXX _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Thu, Mar 30, 2006 at 01:20:20PM -0500, Ellis, Mike wrote:> I didn''t catch a mention of RaidZ in your note. > > How would hot-spares play in a RaidZ type configuration? (Especially > with the "auto-return-home" (or predictive replacement) feature your > mention.[ In traditional Raid-Arrays hot-spare rebuilds and > "go-home-transitions" are handled differently to cut down on exposure > windows, and resource utilization, not sure if/how that applies here... > ] > > If I read/interpreted the last part of your note, I think its OK to use > a max-size LUN to hot-spare any (LUN <= HOT_SPARE), returning "home" > once its job is done, ready to spare for any other pool/lun. (obviously > not the entire hot-spare will be used if its "sparing" for a smaller > failed LUN).Yep. The initial concern raised was "what if I have a pool with half 36G disks and half 72G disks?" If you then have both 36G and 72G spares, then using a 72G spare for a 36G disk could potentially deprive you of a needed hot spare should a 72G disk fail. In general, this is a misconfigured system, since it gives you a false sense of security when examining your available hot spares. Hence not worrying about it in the initial version.> Maybe there is no difference between Mirrored/RaidZ-configurations (ZFS > masking all this?), but even in this case some note regarding this > working for both mirrored and RaidZ configurations might make sense?Yes, Mirror and RAID-Z replacements are handled identically, and use the same resilvering code. There is no need to do any special-casing or worry about "exposure windows" or anything like that. I can add statements to that effect. Note that it may also be possible to hot-spare unreplicated pools with the arrival of predictive analysis and pro-active replacement. The usefulness of this feature is rather questionable, however. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Mar 30, 2006, at 12:03, Eric Schrock wrote:> A hot spare is considered ''in use'' for the purpose of libdiskmgt and > zpool(1M) if it is labelled as a spare and is currently in one or more > pool''s list of active spares. If a spare is part of an exported pool, > it is not considered in use, due largely to the fact that > distinguishing > this case from a recently destroyed pool is difficult and not solvable > in the general case.Would it be possible (or useful) to have a ''pool'' of spares available to a couple of ZFS pools? Instead of associating the disks with a particular pool, you would be able to say "if a disk fails in ZFS pool X, Y, or Z, grab a disk 1, 2, or 3; if a disk fails in ZFS pool A, B, or C, grab disk 4 or 5; all other ZFS pools should grab disk 6".
David Magda wrote:> > Would it be possible (or useful) to have a ''pool'' of spares available > to a couple of ZFS pools? > > Instead of associating the disks with a particular pool, you would be > able to say "if a disk fails in ZFS pool X, Y, or Z, grab a disk 1, > 2, or 3; if a disk fails in ZFS pool A, B, or C, grab disk 4 or 5; > all other ZFS pools should grab disk 6".B. POOL MANAGEMENT Hot spares are stored with each pool, although they can be overlapped between different pools. This allows administrators to reserve system-wide hot spares, as well as per-pool hot spares according to their policies. So spares can belong to multiple pools, I take it.
On Thu, Mar 30, 2006 at 08:33:32PM -0500, David Magda wrote:> > Would it be possible (or useful) to have a ''pool'' of spares available > to a couple of ZFS pools? > > Instead of associating the disks with a particular pool, you would be > able to say "if a disk fails in ZFS pool X, Y, or Z, grab a disk 1, > 2, or 3; if a disk fails in ZFS pool A, B, or C, grab disk 4 or 5; > all other ZFS pools should grab disk 6".We kicked this idea around for a while, but there are two main reasons for not doing it: 1. You need to invent a new grammar for describing arbitrary relations between spares and pools. We can''t leverage any existing ZFS CLI to do this for us. 2. The information about which spares are allocated to your pool is no longer associated with your disks. With ZFS, we''ve tried very to keep all information about your data, including how to mount it, share it, and manage redundancy, with the data itself. Having a separate pool means that ''zpool export'' no longer takes information about my hot spares anymore, which is not too appealing. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Fri, Mar 31, 2006 at 01:31:49PM +1100, Barry Robison wrote:> > So spares can belong to multiple pools, I take it. >Yep. Here''s an example: # zpool create test mirror c0t0d0 c0t1d0 spare c1t0d0 c1t1d0 # zpool create test2 mirror c4t0d0 c4t1d0 spare c1t0d0 c1t1d0 # zpool status pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 spares c1t0d0 ONLINE c1t1d0 ONLINE errors: No known data errors pool: test2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test2 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 spares c1t0d0 ONLINE c1t1d0 ONLINE errors: No known data errors # zpool replace test c0t0d0 c1t0d0 # zpool status pool: test state: ONLINE scrub: resilver completed with 0 errors on Thu Mar 30 21:42:37 2006 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 35.5K resilvered c1t0d0 ONLINE 0 0 0 35.5K resilvered c0t1d0 ONLINE 0 0 0 spares c1t0d0 SPARED currently in use c1t1d0 ONLINE errors: No known data errors pool: test2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test2 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t0d0 ONLINE 0 0 0 c4t1d0 ONLINE 0 0 0 spares c1t0d0 SPARED in use by pool ''test'' c1t1d0 ONLINE errors: No known data errors It''s probably a bug that the ''test'' pool is reported as ONLINE. By definition, a ''spare'' vdev should probably be treated as DEGRADED. I can fix that... - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Eric Schrock wrote:> ... > ># zpool replace test c0t0d0 c1t0d0 ># zpool status > pool: test > state: ONLINE > scrub: resilver completed with 0 errors on Thu Mar 30 21:42:37 2006 >config: > > NAME STATE READ WRITE CKSUM > test ONLINE 0 0 0 > mirror ONLINE 0 0 0 > spare ONLINE 0 0 0 > c0t0d0 ONLINE 0 0 0 35.5K resilvered > c1t0d0 ONLINE 0 0 0 35.5K resilvered > c0t1d0 ONLINE 0 0 0 > spares > c1t0d0 SPARED currently in use > c1t1d0 ONLINE >... >It''s probably a bug that the ''test'' pool is reported as ONLINE. By >definition, a ''spare'' vdev should probably be treated as DEGRADED. I >can fix that... >To me the output here is a little confusing. Shouldn''t the status of c0t0d0 in mirror''s spare output say something other than "ONLINE"? Perhaps also that for c1t0d0? I''d expect c1t0d0 to be ONLINE (in the mirror/spare output) after the replacement is complete and at some other state in the meantime. Darren
> > spares > > c1t0d0 SPARED currently in use > > c1t1d0 ONLINE> To me the output here is a little confusing. Shouldn''t the status > of c0t0d0 in mirror''s spare output say something other than "ONLINE"? > Perhaps also that for c1t0d0?I agree. I''d expect ONLINE to mean in use, and OFFLINE to mean not in use (and thus available). But that''s still somewhat indirect. How about TAKEN and AVAILABLE? Jeff
przemolicc at poczta.fm
2006-Mar-31 09:02 UTC
[zfs-discuss] Proposal: ZFS Hot Spare support
On Thu, Mar 30, 2006 at 09:03:30AM -0800, Eric Schrock wrote:> 3. Removing hot spares from a pool > > Hot spares can be removed from a pool with the new ''zpool remove'' > subcommand. This subcommand suggests the ability to remove arbitrary > devices, and certainly is a feature that will be supported in a future > release, but currently this will only allow removing hot spares. For > example: > > # zpool remove test c2d0 > > If the hot spare is currently spared in, then the command will print an > error and exit.I am not sure whether shrinking of pool is considered in the future but if it is wouldn''t be better to use another syntax: [SPARES] # zpool remove test spare c2d0 [SHRINKING] # zpool remove test c2d0 This way I distinguish betweent removing spare and _shrinking_ pool. Without that I could easily make a mistake. przemol
Hello Eric, This is great! However it would be really usefull if you could specify that some of spares are global - so if I create new pool this spares will assigned automatically. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Mar 31, 2006, at 00:45, Eric Schrock wrote:> # zpool create test mirror c0t0d0 c0t1d0 spare c1t0d0 c1t1d0 > # zpool create test2 mirror c4t0d0 c4t1d0 spare c1t0d0 c1t1d0Yes, I must have read over section B too quickly, since this is more or less what I meant in my question. Thanks for clearing things up. Regards, David
On Thu, Mar 30, 2006 at 11:57:35PM -0800, Jeff Bonwick wrote:> > > spares > > > c1t0d0 SPARED currently in use > > > c1t1d0 ONLINE > > > To me the output here is a little confusing. Shouldn''t the status > > of c0t0d0 in mirror''s spare output say something other than "ONLINE"? > > Perhaps also that for c1t0d0? > > I agree. I''d expect ONLINE to mean in use, and OFFLINE to mean > not in use (and thus available). But that''s still somewhat indirect. > > How about TAKEN and AVAILABLE?I''m all for AVAILABLE. It''s still possible to have UNAVAIL spares as well, as the kernel verifies that they can be opened and correspond to a known device. Of course, this makes me wonder about replacing hot spares. If we validate the GUID is known, how does one replace a hot spare? If I swap in a different drive, it''ll complain that the disk doesn''t match the known spare. Perhaps ''zpool replace'' needs to support hot spares, and the future hotplug work can replace them automatically. I''ll need to think about that for a bit... - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Fri, Mar 31, 2006 at 12:59:47PM +0200, Robert Milkowski wrote:> Hello Eric, > > This is great! > > However it would be really usefull if you could specify that some of > spares are global - so if I create new pool this spares will > assigned automatically.I''m hesitant to do this for two reasons: 1. We''re creating auxilliary ZFS state that is independent of the pool data. This means that we need to invent a new syntax for managing system-wide global spares, as well as how to assign them to pools. 2. Creating pools is not a common operation. Most systems will have only one or two pools on it. It''s easily enough to simply add the same spares to both pools, and more configurable. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Fri, Mar 31, 2006 at 11:02:59AM +0200, przemolicc at poczta.fm wrote:> > I am not sure whether shrinking of pool is considered in the future but if it is > wouldn''t be better to use another syntax: > > [SPARES] > # zpool remove test spare c2d0 > > [SHRINKING] > # zpool remove test c2d0 > > This way I distinguish betweent removing spare and _shrinking_ pool. Without > that I could easily make a mistake.For future pool removal, I anticipate having labelled mirrors and RAID-Z vdevs, so that you can identify them by name, such as: mirror-1 c0d0 c1d0 mirror-2 c2d0 c3d0 Then, you can remove a toplevel vdev by saying ''zpool remove mirror-1''. The only way that this could become confusing is if you have an unreplicated pool with hot spares, but I don''t see this being a useful configuration. Note that another possibility would be: # zpool remove mirror c0d0 Which means "remove the mirror containing disk c0d0", but that has other issues (especially if support mirrors of RAID-Z and more complicated configurations). This is definitely a reason not to have ''zpool remove'' behave like ''zpool detach'' for a single drive case. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Jeff Bonwick wrote:>>> spares >>> c1t0d0 SPARED currently in use >>> c1t1d0 ONLINE >>> >>> > > > >>To me the output here is a little confusing. Shouldn''t the status >>of c0t0d0 in mirror''s spare output say something other than "ONLINE"? >>Perhaps also that for c1t0d0? >> >> > >I agree. I''d expect ONLINE to mean in use, and OFFLINE to mean >not in use (and thus available). But that''s still somewhat indirect. > >How about TAKEN and AVAILABLE? > >I agree with those suggestions. Darren
Hello Eric, Friday, March 31, 2006, 7:41:57 PM, you wrote: ES> On Fri, Mar 31, 2006 at 12:59:47PM +0200, Robert Milkowski wrote:>> Hello Eric, >> >> This is great! >> >> However it would be really usefull if you could specify that some of >> spares are global - so if I create new pool this spares will >> assigned automatically.ES> I''m hesitant to do this for two reasons: ES> 1. We''re creating auxilliary ZFS state that is independent of the pool ES> data. This means that we need to invent a new syntax for managing ES> system-wide global spares, as well as how to assign them to pools. ES> 2. Creating pools is not a common operation. Most systems will have ES> only one or two pools on it. It''s easily enough to simply add the ES> same spares to both pools, and more configurable. I don''t know - ZFS was mainly targeted for large systems (I mean in those systems you will see big difference with ZFS) and for example here we add quite a lot of storage on regular basis (I won''t make just one large pool, rather many small pools) and creating globa host spares at the beginning would be welcomed improvements - the same way we have it on HW arrays. btw: I guess hot spares in ZFS won''t make it into U2...? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Sat, Apr 01, 2006 at 01:59:38AM +0200, Robert Milkowski wrote:> > I don''t know - ZFS was mainly targeted for large systems (I mean in > those systems you will see big difference with ZFS) and for example > here we add quite a lot of storage on regular basis (I won''t make just > one large pool, rather many small pools) and creating globa host > spares at the beginning would be welcomed improvements - the same way > we have it on HW arrays.Why won''t you make just one large pool, rather than many small pools? The only reason not to do so is: a. Different performance characteristics or b. Different fault tolerance characteristics I can see a server with just two or three pools (one for the root disk, one for customer data, etc), but I don''t see why you would create lots of new pools on a regular basis. Can you explain your use case and reasons in a little more detail? "Because we can do it on product X" doesn''t really help, especially when a HW array is so fundamentally different from a ZFS storage pool. Supposing we were to adopt the idea of "global spares", where would this information be stored? What would the zpool(1M) interface look like? Could I still do per-pool spares? What would happen when I exported and imported a pool? If a spare is swapped in permanently (an asynchronous event in the kernel), does it then remove it from the global list of spares for subsequent pools? I''m still having trouble envisioning the details of how this would actually work...> btw: I guess hot spares in ZFS won''t make it into U2...?Yes, that is correct. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Hello Eric, Saturday, April 1, 2006, 2:11:09 AM, you wrote: ES> On Sat, Apr 01, 2006 at 01:59:38AM +0200, Robert Milkowski wrote:>> >> I don''t know - ZFS was mainly targeted for large systems (I mean in >> those systems you will see big difference with ZFS) and for example >> here we add quite a lot of storage on regular basis (I won''t make just >> one large pool, rather many small pools) and creating globa host >> spares at the beginning would be welcomed improvements - the same way >> we have it on HW arrays.ES> Why won''t you make just one large pool, rather than many small pools? ES> The only reason not to do so is: ES> a. Different performance characteristics ES> or ES> b. Different fault tolerance characteristics ES> I can see a server with just two or three pools (one for the root disk, ES> one for customer data, etc), but I don''t see why you would create lots ES> of new pools on a regular basis. Can you explain your use case and ES> reasons in a little more detail? "Because we can do it on product X" ES> doesn''t really help, especially when a HW array is so fundamentally ES> different from a ZFS storage pool. Answers to a) and b) are no and no. In our case in one soultion we''re thnking to put zfs on we''ve got let''s say 8x 3511 JBODs connected to two hosts in a cluster. Right now we have additional head unit (with HW controllers) and we''re doing raid-5 group on every enclosure using 11 disks and leaving last disk as a global hot spare. With zfs I was thinking of doing something similar - raidz for every JBOD (so in this case I will endup with 8 pools and 8 hot spares). Now I can make just one large raidz pool (+ some hot spares) but it could be risky. So I can make one large pool which is actually a "concatenation/stripe" of many raidz groups - in an essence it could be a stripe/concatenation of raidz groups where each raidz group is build from 11 disks from one enclosure. That way availibility is better then having one large raidz pool and probably performance is better as Bill pointed out (however I don''t understand why). In that configuration I would endup with ~40TB logical data pool. Now what happens if two disks in one raidz group fail? I will loose whole 40TB of data. What happens if there''s a problem with one disk (very long IOs but it''s still working - it happens) with entire pool? Instead of heaving problem with one smaller pool now I''ve got a performance problem with entire 40TB pool. Now if I want to serve some data from the other cluster node I can just switch some pools to the other node - something I can''t do with one pool. ES> Supposing we were to adopt the idea of "global spares", where would this ES> information be stored? What would the zpool(1M) interface look like? ES> Could I still do per-pool spares? What would happen when I exported and ES> imported a pool? If a spare is swapped in permanently (an asynchronous ES> event in the kernel), does it then remove it from the global list of ES> spares for subsequent pools? I''m still having trouble envisioning the ES> details of how this would actually work... Maybe just another pool with hot spares? Then be default all new pools would have an variable use_global_hotspares set to on? Something like: zpool create global_hotspares hotspare c1t0d0 c2t0d0 c3t0d0 if you don''t want to use global_hotspares in a given pool you could do zfs set use_global_hs=off pool Now if a pool (normal pool) is exported and then imported it just looks for a pool with either a specific ID, name or any other tag which would mean it''s a pool with global hotspares (only if use_global_hs is set to on for the pool being imported). If no such pool is available it can only use local hotspares directly attached to it (if there are any). Now if you import a pool with global hotspares all actually active pools (or later imported) which have use_global_hs set to on will automatically use it. ??>> btw: I guess hot spares in ZFS won''t make it into U2...?ES> Yes, that is correct. Thanks for info. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Sat, Apr 01, 2006 at 02:41:39AM +0200, Robert Milkowski wrote:> > Now I can make just one large raidz pool (+ some hot spares) but it > could be risky. So I can make one large pool which is actually a > "concatenation/stripe" of many raidz groups - in an essence it could > be a stripe/concatenation of raidz groups where each raidz group is > build from 11 disks from one enclosure. That way availibility is > better then having one large raidz pool and probably performance is > better as Bill pointed out (however I don''t understand why). In that > configuration I would endup with ~40TB logical data pool. > > Now what happens if two disks in one raidz group fail? I will loose > whole 40TB of data.This won''t be the case with metadata replication, which should be coming soon. You will only lose the plain file contents of the objects contained within that toplevel vdev. Of course, if you''re measuring "time to restore from backup", then it doesn''t matter if we survive the failure, since you''ll still have to restore all your data from backup. Although I could imagine some creative ways of using zfs send/receive to make this faster.> What happens if there''s a problem with one disk (very long IOs but > it''s still working - it happens) with entire pool? Instead of heaving > problem with one smaller pool now I''ve got a performance problem with > entire 40TB pool.This should be handled by the ZFS I/O scheduler automatically. We have some work to do in this area, but I wouldn''t design a feature around lack of current performance.> Now if I want to serve some data from the other cluster node I can > just switch some pools to the other node - something I can''t do with > one pool.Yes, this is definitely true.> Maybe just another pool with hot spares? Then be default all new pools > would have an variable use_global_hotspares set to on? > > Something like: > > zpool create global_hotspares hotspare c1t0d0 c2t0d0 c3t0d0 > > if you don''t want to use global_hotspares in a given pool you could do > > zfs set use_global_hs=off pool > > Now if a pool (normal pool) is exported and then imported it just > looks for a pool with either a specific ID, name or any other tag > which would mean it''s a pool with global hotspares (only if > use_global_hs is set to on for the pool being imported). If no such > pool is available it can only use local hotspares directly attached to > it (if there are any). Now if you import a pool with global hotspares > all actually active pools (or later imported) which have use_global_hs > set to on will automatically use it.OK, so this is just a "magic pool" that behaves differently? This starts to get very nasty very quickly. The name "global_hostspares" is reserved, and all of a sudden all the operations I can do it are different. You can only add individual disks, you can''t remove certain disks, the output of "zpool status" has to be different, importing a hot spare pool has to be handled specially, renames (when supported) will have to be handled carfeully, I can''t create ZFS filesystems in it, and the edge conditions continue... Based on my observations, it seems to me that: 1. This introduces an order of magnitude more edge conditions that alter normal interaction with the system. 2. It requires work (particularly "zpool set") that we haven''t yet done. 3. It does not replace the need for per-pool spares. 4. It is not the common use case. 5. The behavior can be replicated with a small amount of manual work given the current proposal. We can implement this as a future RFE, but right now we should implement the straightforward solution, and deal with the complexities of this proposal at a later date. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
The main reason to have different ZFS pools is to implement redundancy ACROSS JBOD enclosures. I''m assuming that you can''t add new disks to a udev unit afterwards - you can only add new udevs to a pool. Or is this incorrect? In Robert''s case, the best thing to do is this (assuming he wants maximum disk space usage, while still retaining some redundancy): (for simplicity''s sake, I''m showing a 3-array (3 drives/array) config) zpool create tank raidz c0t0d0s2 c1t0d0s2 c2t0d0s2 raidz c0t1d0s2 c1t1d0s2 c2t2d0s2 raidz c0t2t0s2 c1t2d0s2 c2t2d0s2 That is, create a stripe of RAID-Z undevs. This insulates you against the loss of any one JBOD. You can then add the remaining disks as HotSpare to the pool. (of course, using the 3511s, you probably would be best off creating each RAID-5 subarray using the HW controller, then simply striping them using ZFS). -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Jeff Bonwick wrote:>>> spares >>> c1t0d0 SPARED currently in use >>> c1t1d0 ONLINE >>> >>> > > > >>To me the output here is a little confusing. Shouldn''t the status >>of c0t0d0 in mirror''s spare output say something other than "ONLINE"? >>Perhaps also that for c1t0d0? >> >> > >I agree. I''d expect ONLINE to mean in use, and OFFLINE to mean >not in use (and thus available). But that''s still somewhat indirect. > >How about TAKEN and AVAILABLE? > >I forgot to mention, I think that the "ONLINE" status of the disk being spared-out should be something different. I think this is what is meant by the "35.5k resilvering"? To me this is the only obscure part of the output. I''d rather see something like: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 c0t0d0 RESYNC 0 0 0 35.5K c1t0d0 RESYNC 0 0 0 35.5K c0t1d0 ONLINE 0 0 0 spares c1t0d0 TAKEN currently in use c1t1d0 AVAILABLE I''m tempted to suggest that "RESYNC" should be different for the incoming disk and the outgoing disk, maybe: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 spare ONLINE 0 0 0 c0t0d0 OUTSYNC 0 0 0 35.5K c1t0d0 INSYNC 0 0 0 35.5K c0t1d0 ONLINE 0 0 0 spares c1t0d0 TAKEN currently in use c1t1d0 AVAILABLE The idea is that the "spares" section under "mirror" is now self explanatory. I''m not too enamoured by "OUTSYNC" or "INSYNC" as useful words here but hopefully they should convey the idea. "SYNCUP" and "SYNCDOWN" are some other altnatives I can think of right now. Darren
On Fri, Mar 31, 2006 at 05:17:25PM -0800, Erik Trimble wrote:> The main reason to have different ZFS pools is to implement redundancy > ACROSS JBOD enclosures.I''m a little confused. To implement redundancy across anything, doesn''t that mean they have to be in the same pool? How to I get redundancy across multiple pools?> zpool create tank raidz c0t0d0s2 c1t0d0s2 c2t0d0s2 raidz c0t1d0s2 > c1t1d0s2 c2t2d0s2 raidz c0t2t0s2 c1t2d0s2 c2t2d0s2But isn''t this just one pool?> (of course, using the 3511s, you probably would be best off creating > each RAID-5 subarray using the HW controller, then simply striping them > using ZFS).It depends. If you want better performance, this might be true (though benchmarks would be in order). If you want better fault tolerance, then its better to expose them as JBODs and have ZFS deal with them. Then you get the self-healing capabilities of ZFS that you simply cannot get from a hardware RAID solution. For sure, you would want to RAID the subarrays, or else you''re putting all your reliability entirely within the hands of the hardware... - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Fri, Mar 31, 2006 at 05:23:28PM -0800, Darren Reed wrote:> > I forgot to mention, I think that the "ONLINE" status of the disk being > spared-out should be something different. >Well, the example I gave is pretyt contrived. Under normal circumstances, the device you''re sparing out is faulted. It''s really important that we show the actual state of that device, not just some faked-up value. For example, the following all imply very different capabilities of the pool: spare DEGRADED diskA ONLINE diskB ONLINE spare DEGRADED diskA FAULTED diskB ONLINE spare DEGRADED diskA DEGRADED diskB ONLINE Note that this is the same as with replacing. If you go to replace a online device, we don''t go change its state. We kicked around the idea of trying to fake up something to visually represent which was being replaced, but changing the ''state'' definitely didn''t work for the above reasons. Event though a device is being replaced and/or spared, it has a state that is distinct from its current role in the vdev tree. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
In our case, we are predominantly using iscsi with multiple raw LUNs being exposed for RAIDZ. Each backend unit has mostly uniform disk sizes, but disks sizes differ between units as disks are purchased over time and generally target to maximize storage space. Thus, we will likely be seeing a large heterogeneous disk farm that according to what ZFS best practices, state, should be in separate, uniform raidz zpools. So, a spare pool that may fit multiple zpools can come in handy there. On 3/31/06, Eric Schrock <eric.schrock at sun.com> wrote:> On Sat, Apr 01, 2006 at 01:59:38AM +0200, Robert Milkowski wrote: > > > > I don''t know - ZFS was mainly targeted for large systems (I mean in > > those systems you will see big difference with ZFS) and for example > > here we add quite a lot of storage on regular basis (I won''t make just > > one large pool, rather many small pools) and creating globa host > > spares at the beginning would be welcomed improvements - the same way > > we have it on HW arrays. > > Why won''t you make just one large pool, rather than many small pools? > The only reason not to do so is: > > a. Different performance characteristics > or > b. Different fault tolerance characteristics > > I can see a server with just two or three pools (one for the root disk, > one for customer data, etc), but I don''t see why you would create lots > of new pools on a regular basis. Can you explain your use case and > reasons in a little more detail? "Because we can do it on product X" > doesn''t really help, especially when a HW array is so fundamentally > different from a ZFS storage pool. > > Supposing we were to adopt the idea of "global spares", where would this > information be stored? What would the zpool(1M) interface look like? > Could I still do per-pool spares? What would happen when I exported and > imported a pool? If a spare is swapped in permanently (an asynchronous > event in the kernel), does it then remove it from the global list of > spares for subsequent pools? I''m still having trouble envisioning the > details of how this would actually work... > > > btw: I guess hot spares in ZFS won''t make it into U2...? > > Yes, that is correct. > > - Eric > > -- > Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> It depends. If you want better performance, this might be true (though > benchmarks would be in order). If you want better fault tolerance, then > its better to expose them as JBODs and have ZFS deal with them. Then > you get the self-healing capabilities of ZFS that you simply cannot get > from a hardware RAID solution.Another option is to get the best of both worlds by letting the arrays do RAID-5, and then mirroring or RAID-Z-ing the arrays. A RAID-Z group of RAID-5 arrays can tolerate at least three whole-disk failures before losing data. It can also tolerate the failure of an entire array *plus* one whole-disk failure on each of the remaining arrays). Using RAID-Z (or mirroring) means that you get self-healing data: if an array returns bad data, ZFS will detect it and reconstruct good data from the other arrays. Jeff
> I don''t know - ZFS was mainly targeted for large systemsActually, our goal is to run the gamut. I want ZFS not just on large disk farms, but also on my laptop. Eventually I''d also like to get ZFS onto iPods and Compact Flash cards, so that a power outage doesn''t mean losing your music or your pictures. Jeff
On Thu, 2006-03-30 at 12:03, Eric Schrock wrote:> C. AUTOMATED REPLACEMENT > > In order to perform automated replacement, a ZFS FMA agent will be added > that subscribes to ''fault.zfs.vdev.*'' faults. When a fault is received, > the agent will examine the pool to see if it has any available hot > spares. If so, it will perform a ''zpool replace'' with an available > spare.I''ve seen automated replacement go bad... For a while we had an E420R and its connected A5100 JBOD on a UPS. The UPS battery went bad. We discovered this the hard way when a series of brownouts caused the UPS to reach into the battery and find nothing there.. The E420R sailed right through as if nothing had happened (who knows -- maybe proportionally bigger capacitors in the power supply?), but the A5100 really didn''t like this. I believe all the drives took a little while to reset and spin back up. In the mean time, SVM concluded that a bunch of drives in the array had gone bad, and decided to replace as many as it had hot spares. Once the array came all the way back on line, mirroring to the replacements started.. In reality, all the drives were fine; it just took the better part of a day to unwind all the premature replacements. Not quite sure what heuristics you''d use to avoid this sort of thing, though.... - Bill
On Sat, Apr 01, 2006 at 09:25:55PM -0500, Bill Sommerfeld wrote:> > I''ve seen automated replacement go bad... >Well, this is certainly what would happen with the current bits. The good news is that this is all done through FMA by subscribing to the fault.fs.zfs.vdev.* fault. In the future, as we make the diagnosis engine smarter, this hotplug support will be able to automatically leverage whatever we come up with. I don''t know what the "right answer" is in the case you described, but we''ll certainly be gathering lots of data (via FMA error/fault logs) as well as hooking into SMART and the rest of the I/O subsystem to make more intelligent diagnosis in the future. I''ve got some stuff scoped out for the next phase (SERD on I/O and checksum errors) as well as the next advancements beyond that (processing SMART data and subscribing to hotplug events). Expect to see more info soon. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Veritas VM has a flag for this. If you set this on the disk volumes, it won''t try to use them as reallocation targets. I found this the hard way. We were mirroring on 9176 (early precursor to the D178) between datacenters and between two arrays. When one array went away in a power failure, it ''mirrored'' everything to the same array. Performance on the box went away in a hurry. When mirroring, we''ll need to have a similar flag, as it gets even more interesting when you''ve got high-performance disk (database storage) and low-performance disk (used for DB exports). Mix up the mirroring there, and things will get ugly. I would assume that you''d put different tiers of storage into different pools to reduce the chance of this happening, but it''s still a possibility. On Sat, 2006-04-01 at 21:25 -0500, Bill Sommerfeld wrote:> On Thu, 2006-03-30 at 12:03, Eric Schrock wrote: > > C. AUTOMATED REPLACEMENT > > > > In order to perform automated replacement, a ZFS FMA agent will be added > > that subscribes to ''fault.zfs.vdev.*'' faults. When a fault is received, > > the agent will examine the pool to see if it has any available hot > > spares. If so, it will perform a ''zpool replace'' with an available > > spare. > > I''ve seen automated replacement go bad... > > For a while we had an E420R and its connected A5100 JBOD on a UPS. > The UPS battery went bad. We discovered this the hard way when a series > of brownouts caused the UPS to reach into the battery and find nothing > there.. > > The E420R sailed right through as if nothing had happened (who knows -- > maybe proportionally bigger capacitors in the power supply?), but > the A5100 really didn''t like this. I believe all the drives took a > little while to reset and spin back up. > > In the mean time, SVM concluded that a bunch of drives in the array had > gone bad, and decided to replace as many as it had hot spares. Once the > array came all the way back on line, mirroring to the replacements > started.. > > In reality, all the drives were fine; it just took the better part of a > day to unwind all the premature replacements. > > Not quite sure what heuristics you''d use to avoid this sort of thing, > though.... > > - Bill > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jeff Bonwick wrote:>>I don''t know - ZFS was mainly targeted for large systems >> >> > >Actually, our goal is to run the gamut. I want ZFS not just on >large disk farms, but also on my laptop. Eventually I''d also >like to get ZFS onto iPods and Compact Flash cards, so that a >power outage doesn''t mean losing your music or your pictures. > >And where power outage includes spontaneous popping out of said devices from their "holder" too :) I can''t remember how many Amiga floppies I burnt because the weren''t always consistent on disk. Darren
Eric Schrock wrote:>On Fri, Mar 31, 2006 at 05:23:28PM -0800, Darren Reed wrote: > > >>I forgot to mention, I think that the "ONLINE" status of the disk being >>spared-out should be something different. >> >> >> > >Well, the example I gave is pretyt contrived. Under normal >circumstances, the device you''re sparing out is faulted. It''s really >important that we show the actual state of that device, not just some >faked-up value. For example, the following all imply very different >capabilities of the pool: > > spare DEGRADED > diskA ONLINE > diskB ONLINE > > spare DEGRADED > diskA FAULTED > diskB ONLINE > > spare DEGRADED > diskA DEGRADED > diskB ONLINE > >Note that this is the same as with replacing. >Looking at those three, the "DEGRADED" for the first spare set seems like a bug to me. My assumption is that: ONLINE(spare) = ONLINE(diskA) + ONLINE(diskB) and I think this is the intuitive way to read the above output. If that isn''t the story then something needs to not say "ONLINE".>If you go to replace a >online device, we don''t go change its state. We kicked around the idea >of trying to fake up something to visually represent which was being >replaced, but changing the ''state'' definitely didn''t work for the above >reasons. Event though a device is being replaced and/or spared, it has >a state that is distinct from its current role in the vdev tree. > >It would be very worthwhile if something could be faked, visually, to represent what is going on inside, if only to avoid the first case of output (above) which seems non-sensical. Darren
Hello Eric, Saturday, April 1, 2006, 3:15:10 AM, you wrote: ES> On Sat, Apr 01, 2006 at 02:41:39AM +0200, Robert Milkowski wrote:>> >> Now I can make just one large raidz pool (+ some hot spares) but it >> could be risky. So I can make one large pool which is actually a >> "concatenation/stripe" of many raidz groups - in an essence it could >> be a stripe/concatenation of raidz groups where each raidz group is >> build from 11 disks from one enclosure. That way availibility is >> better then having one large raidz pool and probably performance is >> better as Bill pointed out (however I don''t understand why). In that >> configuration I would endup with ~40TB logical data pool. >> >> Now what happens if two disks in one raidz group fail? I will loose >> whole 40TB of data.ES> This won''t be the case with metadata replication, which should be coming ES> soon. You will only lose the plain file contents of the objects ES> contained within that toplevel vdev. Yeah, that would be better. But still there''s a problem how to correct that situation - as you mentioned you will be probably forced to restore whole 40TB of datam instead of 5TB.>> What happens if there''s a problem with one disk (very long IOs but >> it''s still working - it happens) with entire pool? Instead of heaving >> problem with one smaller pool now I''ve got a performance problem with >> entire 40TB pool.ES> This should be handled by the ZFS I/O scheduler automatically. We have ES> some work to do in this area, but I wouldn''t design a feature around ES> lack of current performance. That''s good to hear. ES> Based on my observations, it seems to me that: ES> 1. This introduces an order of magnitude more edge conditions that alter ES> normal interaction with the system. ES> 2. It requires work (particularly "zpool set") that we haven''t yet done. ES> 3. It does not replace the need for per-pool spares. ES> 4. It is not the common use case. I can''t agree with #4. IMHO in most raid enviroments, especially with a lot of disks, you just create some global hot spares and don''t think about it later when adding new disks, etc. ES> We can implement this as a future RFE, but right now we should implement ES> the straightforward solution, and deal with the complexities of this ES> proposal at a later date. That''s reasonable. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Jeff, Saturday, April 1, 2006, 10:05:29 AM, you wrote:>> It depends. If you want better performance, this might be true (though >> benchmarks would be in order). If you want better fault tolerance, then >> its better to expose them as JBODs and have ZFS deal with them. Then >> you get the self-healing capabilities of ZFS that you simply cannot get >> from a hardware RAID solution.JB> Another option is to get the best of both worlds by letting the JB> arrays do RAID-5, and then mirroring or RAID-Z-ing the arrays. JB> A RAID-Z group of RAID-5 arrays can tolerate at least three JB> whole-disk failures before losing data. It can also tolerate JB> the failure of an entire array *plus* one whole-disk failure JB> on each of the remaining arrays). Using RAID-Z (or mirroring) JB> means that you get self-healing data: if an array returns bad data, JB> ZFS will detect it and reconstruct good data from the other arrays. I haven''t considered this one - sounds interesting. However less storage will ba available but still this could be interesting. Thanks. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Eric, Saturday, April 1, 2006, 3:15:10 AM, you wrote: ES> This won''t be the case with metadata replication, which should be coming ES> soon. You will only lose the plain file contents of the objects ES> contained within that toplevel vdev. It just occured to me - if there would be a zfs command to get a list of "broken" (data missing) files due to failure of some disks then with such a list one could restore only bad files and not a whole pool (assuming that you can overwrite these files). Most backup software lets you restore only files listed in a file. What do you think? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Mon, Apr 03, 2006 at 08:24:45PM +0200, Robert Milkowski wrote:> > It just occured to me - if there would be a zfs command to get a list > of "broken" (data missing) files due to failure of some disks then > with such a list one could restore only bad files and not a whole pool > (assuming that you can overwrite these files). > > Most backup software lets you restore only files listed in a file. > > What do you think?Starting in build 36, we get 50% of the way there. If you do a scrub of a pool, and then run ''zpool status -v'', you''ll get a detailed list of all the unrecoverable (logical) blocks in the pool found during the scrub. The problem is that they are currently only identified by dataset name and object number - not exactly conducive to repair procedures. There is a future RFE to translate the object number to a filename (when available), but it''s non-trivial when the filesystem is currently mounted. We can''t grok around the internal DMU state without going through the "front door" of the ZPL. Matt or Mark may be able to shed some light on how much investigation they''ve done in this area, if any. The result, of course, would be _very_ cool. With background scrubbing (also coming in the future), you will always have an up-to-date list of damaged data in your pool, or hopefully lack thereof :-) - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Sat, 2006-04-01 at 01:11, Eric Schrock wrote:> Why won''t you make just one large pool, rather than many small pools? > The only reason not to do so is: > > a. Different performance characteristics > or > b. Different fault tolerance characteristicsOr: c. Different administrative boundaries. By which I mean that pools are the unit that is imported and exported. If different projects (groups - possibly with separate funding) buy storage, I would expect to align the pools with what they purchased. That way I can split the storage up later without breaking up the data. Or I allocate storage off a SAN. In that case I would want to import and export pools to move data around on the SAN - ie. between machines. Say a machine becomes busy, I would want to be able to export a pool and import it on another machine attached to the SAN and run the service there. I''m not sure what the model for global spares is here. I can see that for a spare local to a pool then when I export the pool I lose the spare (the spare is physically associated with the pool and should remain so). -- -Peter Tribble L.I.S., University of Hertfordshire - http://www.herts.ac.uk/ http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
On Mon, Apr 03, 2006 at 08:10:01PM +0100, Peter Tribble wrote:> > Or: > > c. Different administrative boundaries. > > By which I mean that pools are the unit that is imported and exported.Yep. This is the use case that Robert pointed that I had failed to consider.> If different projects (groups - possibly with separate funding) buy > storage, I would expect to align the pools with what they purchased. > That way I can split the storage up later without breaking up the > data. > > Or I allocate storage off a SAN. In that case I would want to import > and export pools to move data around on the SAN - ie. between > machines. Say a machine becomes busy, I would want to be able to > export a pool and import it on another > machine attached to the SAN and run the service there. > > I''m not sure what the model for global spares is here. I can see that > for a spare local to a pool then when I export the pool I lose the > spare (the spare is physically associated with the pool and should > remain so).Yep. Global spares are likely per-system, rather than per-pool. For example, exporting a pool will not touch any globally configured hot spares. As a result of Robert''s suggestion, we''ll be examining how to expose this in an adminsitratively meaningful way in the future. As usual, the difficultly is all about the admnistrative interface. The actual FMA agent that goes off and does the replacement is trivial, and can get the suggested replacement from anywhere. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Sanjay G. Nadkarni
2006-Apr-10 22:13 UTC
[zfs-discuss] Re: Re[2]: Proposal: ZFS Hot Spare support
At one point there was talk of implementing "hot space" rather than hotspares. Is this a precursor to that step ? Or is hot space a different notion ? -Sanjay This message posted from opensolaris.org
On Mon, Apr 10, 2006 at 03:13:17PM -0700, Sanjay G. Nadkarni wrote:> At one point there was talk of implementing "hot space" rather than > hotspares. Is this a precursor to that step ? Or is hot space a > different notion ?They serve similar purposes, but are not 100% replacements for each other. We will still be working on hot space, but it will not be a short-term project. --Bill