Eric Schrock
2007-Mar-21 18:16 UTC
[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration
Folks - I''m preparing to submit the attached PSARC case to provide better support for device removal and insertion within ZFS. Since this is a rather complex issue, with a fair share of corner issues, I thought I''d send the proposal out to the ZFS community at large for further comment before submitting it. The prototype is functional except for the offline device insertion and hot spares functionality. I hope to have this integrated within the next month, along with the next phase of FMA integration. Please respond with any comments, concerns, or suggestions. Thanks, - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock -------------- next part -------------- 1. INTRODUCTION Currently, ZFS supports what is affectionately known as "poor man''s hotplug". If a device is removed from the system, then it is assumed that upon I/O failure, an attempt to reopen the same device will fail. This will trigger a FMA fault, substituting a hot spare if available. This is undesirable for two reasons: - There is no distinction between device removal and arbitrary failure. If a device is removed from the system, it should be treated as a deliberate action different from normal failure. - There is no support for automatic response to device insertion. For a server configured with a ZFS pool, the administrator should be able to walk up, remove any drive (preferably a faulted one), insert a new drive, and not have to issue any ZFS commands to reconfigure the pool. This is particularly true for the appliance space, where hardware reconfiguration should "just work". This case enhances ZFS to respond to device removal and provides a mechanism to automatically deal with device insertion. While the framework is generic, the primary target is devices supported by the SATA framework. The only device-specific portion of this proposal concerns determining if a device is in the same "physical location" as a previously known device, involve correlating a transport''s enumeration of the device with the device''s physical location within the chassis. 2. DEVICE REMOVAL There are two types of device removal within Solaris. Coordinated device removal involves stopping all consumers of the device, using the appropriate cfgadm(1M) command (PSARC 1996/285), and then physically removing the device. Uncoordinated removal (also known as "surprise removal") is when a device is physically removed while still in active use by the system. The latter increasingly common as more I/O protocols support hotplug and higher level software (ZFS) becomes more capable. There are several ways to detect device removal within Solaris. Fibre channel drivers generate the NDI events FCAL_INSERT_EVENT and FCAL_REMOVE_EVENT. USB and 1394 drivers generate the NDI events DDI_DEVI_INSERT_EVENT and DDI_DEVI_REMOVE_EVENT. In addition to these event channels, there is also the DKIOCSTATE ioctl() which returns (on capable drivers) DKIO_DEV_GONE if the device has been removed. Of these, the ioctl() is the most widely supported, and is the mechanism used as part of this case. Since this is an implementation detail of the current architecture, it does not preclude using alternate mechanisms in the future. When an I/O to a disk fails, ZFS will query the media state by the DKIOCSTATE ioctl. If the device is any state other than DKIO_INSERTED, ZFS will transition the device to a new REMOVED state. No FMA fault will be triggered, and a hot spare (if any) will be substituted if available. Note that the DKIO_DEV_GONE can be returned for a variety of reasons (pulling cables, external chassis being powered off, etc). In the absence of additional FMA information, it is assumed that this is intentional administrative action. As part of this work, lofiadm(1M) will be expanded to include a new force (-f) flag when removing devices. Combined with the upcoming lofi devfs events (PSARC 2006/709), this will provide a much simpler testing framework without the need for physical hardware interaction. When this flag is used, the underlying file will be closed, any further I/O or attempts to open the device will fail, and DKIOCSTATE will return DKIO_DEV_GONE. This flag will remain private for testing only, and will not be documented. An example of this in action: # lofiadm -a /disk/a /dev/lofi/1 # lofiadm -a /disk/b /dev/lofi/2 # lofiadm -a /disk/c /dev/lofi/3 # zpool create -f test mirror /dev/lofi/1 /dev/lofi/2 spare /dev/lofi/3 # while :; do touch /test/foo; sync; sleep 1; done & [1] 100662 # zpool status pool: test state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 AVAIL errors: No known data errors # lofiadm -d /disk/a -f # zpool status pool: test state: DEGRADED scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 /dev/lofi/1 REMOVED 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 INUSE currently in use errors: No known data errors This behavior is universal for all pools, and cannot be disabled. If a device doesn''t support DKIOCSTATE, then it will be diagnosed as faulty through the standard FMA mechanisms. The ''REMOVED'' state is not persistent, so if a machine is rebooted with a device in the REMOVED state, it will appear as FAULTED when the machine comes up. 3. DEVICE INSERTION When a device is inserted, there are two possible outcomes of interest to ZFS: - If a previously known device is inserted, then we want to online the device. - If a new device is inserted into a physical location that previously contained a ZFS device, then we want to format the device and replace the original device. The former is applicable to any pool, and is always enabled. The latter is potentially damaging, as it will automatically overwrite any data present on newly inserted devices. To protect against this, a new pool property (PSARC 2006/577), ''autoreplace'', will be defined. This boolean property will be off by default to minimize the impact on existing systems or unknown hardware. If unset, the current behavior remains the same, and any replacement operation must be initiated by the administrator via zpool(1M). When set, it indicates that any new device found in the same physical location as a device previously belonging to the pool will be automatically formatted and replaced. To ensure consistent behavior, ZFS must behave in the same manner when the device is replaced (via hotplug) while the system is running, as well as when the device is replaced while the system is powered off. 4. ONLINE DEVICE INSERTION A new syseventd module will be introduced that listens for EC_DEV_ADD events of subclass ESC_DISK or ESC_LOFI. This event is triggered when the device node for the disk or lofi device is created, not necessarily when a disk is inserted. Currently, the USB framework auto-configures drives on insertion, while the SATA framework does not. Modifying the SATA framework behavior will be pursued under a separate case and is outside the scope of this case. In the meantime, these SATA events will be triggered only by an explicit ''cfgadm -c configure'' by the user. When one of these events is received, the corresponding device path is derived from the sysevent payload. For disks, this will be the device node, while for lofi it will be a particular minor node. If the device has a devid, then we first search all pools for a vdev with a matching devid. If none is found, or the device does not have a devid, then we search all pools for vdevs with the specified device path. As part of this work, the ZFS configuration will be expanded to store the physical device path as part of the vdev label. This will also have the benefit of allowing ZFS to boot from devices which don''t support devids. Currently, ZFS only identifies by devid or /dev path, neither of which may be available when mounting the root filesystem. This simplistic mechanism will only work for devices which have the behavior that the device path identifies a physical location, which may not be true for FC, or iSCSI devices, or for devices plumbed under MPxIO. This logic can be expanded in the future if there are protocols or drivers which do not adhere to this behavior. If no matching vdevs are found, then the event is ignored and nothing is done. Otherwise, the device is onlined to determine if it is a known ZFS device. This online operation will automatically remove any attached spare when the resilver is complete. To continue the above example: # zpool status pool: test state: DEGRADED scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 /dev/lofi/1 REMOVED 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 INUSE currently in use errors: No known data errors # lofiadm -a /disk/a /dev/lofi/1 # zpool status pool: test state: ONLINE scrub: resilver completed with 0 errors on Mon Mar 12 10:58:22 2007 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 AVAIL errors: No known data errors If the online attempt failed, then we are dealing with a new device inserted into the same physical slot. If the ''autoreplace'' property is unset, then the event is ignored. If the original event was ESC_DISK and the vdev is not a whole disk, then the event is also ignored. Otherwise, the disk is labeled with an EFI label in the same manner as when the pool is initially created. If that succeeds, then the corresponding ''zpool replace'' command is automatically invoked. To continue the above example: # lofiadm -d /disk/a -f # zpool status pool: test state: DEGRADED scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 /dev/lofi/1 REMOVED 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 INUSE currently in use errors: No known data errors # lofiadm -a /disk/d /dev/lofi/1 # zpool status pool: test state: DEGRADED scrub: resilver completed with 0 errors on Mon Mar 12 17:31:06 2007 config: NAME STATE READ WRITE CKSUM test DEGRADED 0 0 0 mirror DEGRADED 0 0 0 spare DEGRADED 0 0 0 replacing DEGRADED 0 0 0 /dev/lofi/1/old FAULTED 0 0 0 corrupted data /dev/lofi/1 ONLINE 0 0 0 /dev/lofi/3 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 INUSE currently in use errors: No known data errors # zpool status pool: test state: ONLINE scrub: resilver completed with 0 errors on Mon Mar 12 17:31:06 2007 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 mirror ONLINE 0 0 0 /dev/lofi/1 ONLINE 0 0 0 /dev/lofi/2 ONLINE 0 0 0 spares /dev/lofi/3 AVAIL errors: No known data errors In this case, the device was automatically replaced with the next contents. 5. OFFLINE DEVICE INSERTION If a device is replaced while the system is powered off, then ZFS should behave in a similar manner. If devices change attachment points (i.e. swapped) while the system is powered off, ZFS already handles this case for devices which support devids. If a device can be opened but the devid doesn''t match, then ZFS will treat this as a disk insertion event. If the ''autoreplace'' property is set, then ZFS will label the disk and perform the appropriate ''zpool replace'' operation to resilver the device. 6. HOT SPARES Currently, ZFS does not do any I/O to inactive hot spares, so it is incapable of detecting when a hot spare is removed from the system. This case will modify ZFS to periodically attempt to read from all hot spares and make sure they are online and available. If a hot spare is removed, then when this I/O fails it will trigger the normal remove path. This case will also allow offline hot spares to be replaced. With these changes, hot spares will be treated as normal devices with respect to hotplug. If an active hot spare is removed, then the hot spare will be detached and marked removed. If another hot spare is available, then it will be substituted in its place. If a hot spare is inserted, and there is a faulted device with no current hot spare, then inserting the device will automatically trigger a hot spare. 7. MANPAGE DIFFS XXX 8. REFERENCES PSARC 1996/285 Dynamic Attach/Detach of CPU/Memory Boards PSARC 2002/240 ZFS PSARC 2006/223 ZFS Hot Spares PSARC 2006/577 zpool property to disable delegation PSARC 2006/709 lofi devfs events
Bill Sommerfeld
2007-Mar-21 18:37 UTC
[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration
This is tangential, but then arc review is all about feature interaction. 1) What happens if the hotplugged replacement device is too small? 2) What''s the interaction between autoreplace and automatic vdev growth (when the underlying device gets bigger)? Since we can''t yet shrink a pool, i''m wondering if there aren''t some gotchas if someone plugs in a disk that''s too big; they might not be able to go back and replace it with one of the intended size. Consider the following comedy of operational errors: we start with a pool which looks like: mirror small1 small2 mirror large1 large2 spare large3 "small1" fails, and hot sparing kicks in to replace it with "large3". resilvering is complete. case #1: operator mistakenly pulls small2 instead of small1 does mirror #1 (reduced to a single functioning replica) autogrow to size "large"? case #2: operator pulls small1, and by mistake plugs in "large4" instead of "small3". before noticing this error, "small2" fails. .. or other variants of the above. - Bill
Eric Schrock
2007-Mar-21 18:44 UTC
[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration
On Wed, Mar 21, 2007 at 02:37:16PM -0400, Bill Sommerfeld wrote:> > 1) What happens if the hotplugged replacement device is too small? >The replace will fail, just as if the administrator tried to issue a ''zpool replace'' with a smaller drive. In the auto-replace case, the result will be a faulted vdev and an associated FMA alert. I''ll add some text to this effect, and make sure that this is actually how it behaves.> 2) What''s the interaction between autoreplace and automatic vdev > growth (when the underlying device gets bigger)? > > Since we can''t yet shrink a pool, i''m wondering if there aren''t some > gotchas if someone plugs in a disk that''s too big; they might not be > able to go back and replace it with one of the intended size. > > Consider the following comedy of operational errors: > > we start with a pool which looks like: > > mirror small1 small2 mirror large1 large2 spare large3 > > "small1" fails, and hot sparing kicks in to replace it with "large3". > resilvering is complete. > > case #1: operator mistakenly pulls small2 instead of small1 does > mirror #1 (reduced to a single functioning replica) autogrow to size > "large"?No, it will not autogrow because a spare doesn''t completely replace the device. So you will have something like: mirror spare 100G small1 100G large3 200G (100G used) small2 100G mirror 200G large1 200G large2 200G If you then pull ''small2'', it will become faulted, but the vdev will not grow in size.> case #2: operator pulls small1, and by mistake plugs in "large4" > instead of "small3". before noticing this error, "small2" fails.Again, the ''spare'' vdev will still reflect the size of ''small2'', so the vdev will not grow. It''s only when the spare is complete, either through explicit zpool(1M) actions or replacing the underlying drive, that the ''spare'' vdev disappears and the new vdev reflects the larger size. Thanks for the input, - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Robert Milkowski
2007-Mar-22 00:03 UTC
[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration
Hello Eric, What if I have a failing drive (still works but I want it to be replaced) and I have a replacement drive on a shelf. All I want is to remove failing drive, insert new one and resilver. I do not want a hot spare to automatically kick in. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Eric Schrock
2007-Mar-22 00:13 UTC
[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration
On Thu, Mar 22, 2007 at 01:03:48AM +0100, Robert Milkowski wrote:> > What if I have a failing drive (still works but I want it to be > replaced) and I have a replacement drive on a shelf. All I want is > to remove failing drive, insert new one and resilver. I do not want > a hot spare to automatically kick in. >Kicking in a hot spare is a harmless activity (the end result will be the same), why would you want to avoid this? Do you have an idea of how you would want to control this behavior? - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Robert Milkowski
2007-Mar-22 01:44 UTC
[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration
Hello Eric, Thursday, March 22, 2007, 1:13:19 AM, you wrote: ES> On Thu, Mar 22, 2007 at 01:03:48AM +0100, Robert Milkowski wrote:>> >> What if I have a failing drive (still works but I want it to be >> replaced) and I have a replacement drive on a shelf. All I want is >> to remove failing drive, insert new one and resilver. I do not want >> a hot spare to automatically kick in. >>ES> Kicking in a hot spare is a harmless activity (the end result will be ES> the same), why would you want to avoid this? With lot of storage I like to keep as much the same config as I can between the same boxes. So if I have a replacement drive I would rather use it instead of hot spare so I do not have to resilver again. I know this is mostly esthetics but helps in managing storage. ES> Do you have an idea of how ES> you would want to control this behavior? Maybe simple method of "freezing" hot spares (not by removing them) or maybe the automated method should have some reasonable delay - when it sees disk is gone it will wait N seconds before hot spare kicks in or if new drive is present at the same physical location then it would use it rather then a hot spare (or perhaps admin can issue zpool replace manually before hot spare kicks in). I''m not sure if it won''t complicate things too much but still I like to keep similar configs. Or maybe an ability to stop resilvering of hot spare and start resilver of a new drive would be sufficient or it should even be automatic (stop resilvering of hot spare, but still keep all data already resilvered, start resilver new disk with data which has not yet been resilvered on hot spare, then resilver data which was resilverd as hot spare, then release a hot spare). All would be working only if some kind of hotspare-back property would be set. It''s a matter of what people prefer - a moving hot spare or rather if disk is replaced the hot spare goes back to a hot spare list and is released (after disk reilvered). It probably doesn''t matter that much on x4500 but it can matter more on other arrays. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Anton B. Rang
2007-Mar-22 04:26 UTC
[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
A couple of questions/comments -- Why is the REMOVED state not persistent? It seems that, if ZFS knows that an administrator pulled a disk deliberately, that''s still useful information after a reboot. Changing the state to FAULTED is non-intuitive, at least to me. What happens with autoreplace after a system reconfiguration? If controller numbers change, is it possible that autoreplace would grab a drive which was not part of a ZFS pool and try to use it? I recall some posts here where the "old" path was preserved in the pool and it was fairly difficult to get ZFS to recognize the "new" path. In general, I don''t like the idea of autoreplace being tied to the device path. It would be both safer and more general if the underlying frameworks exported a physical location identifier to the node. I suspect that this isn''t currently done by Solaris, and I''m sure it''s not done for devices in enclosures which support (say) SES; but it seems like the right long-term direction. For Sun-supplied hardware, it would even be possible to use readable device names (e.g. "Slot A connector 2"). Autoreplace probably needs a lot of warnings except in the particular case of appliances and other highly-controlled environments. Consider a server three drives, A, B, and C, in which A and B are mirrored and C is not. Pull out A, B, and C, and re-insert them as A, C, and B. If B is slow to come up for some reason, ZFS will see "C" in place of "B", and happily reformat it into a mirror of "A". (Or am I reading this incorrectly?) I hope that there''s a way to disable the periodic probing of hot spares. Spinning these drives up often might be highly annoying in some environments (though useful in others, since it could also verify that the disk is responding normally). This message posted from opensolaris.org
Matt B
2007-Mar-22 06:57 UTC
[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
Autoreplace is currently the biggest advantage that H/W raid controllers have over ZFS and other less advanced forms of S/W raid. I would even go so far as to promote this issue to the forefront as a leading deficiency that is hindering ZFS adoption. Regarding H/W raid controllers things are kept pretty simply. It does not even "try" to discriminate between an admin pulling a good drive (intentional removal) and a drive that has failed. While some may see this as less sophisticated in reality it makes things quite simple. There is a certain degree of decadence and freedom that comes with being able to pull a drive out of server A, slap it into server B, and turn on server B with a perfect duplicate. Slap a fresh drive from the factory into the now vacant slot on server A and rebuild takes place immediatley. This has to be the standard. Anything beyond this level of effort simply fails against H/W raid. There is no standard S/W raid like this and with ZFS''s already beautiful simplicity, it is the champion of the hour to deliver on this holy grail. I think this proposal is very close to what we need to take ZFS to the next level in the enterprise Here are some rough thoughts regarding percievable drive failure notification (beyond email, pager, console etc) Literally, someone should be able to make $7/hr with a stack of drives and the ability to just look or listen to a server to determin which drive needs to be replaced. This means ZFS will need to be able to control the HDD Status lights on the chassis for "look", but for "listen" ZFS could cause the server to beep using one beep for the first slot, two beeps in rapid successions, for the second slot. A sort of lame Morse code...no device integration on ZFS''s part required This message posted from opensolaris.org
Eric Schrock
2007-Mar-22 15:39 UTC
[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Wed, Mar 21, 2007 at 09:26:35PM -0700, Anton B. Rang wrote:> A couple of questions/comments -- > > Why is the REMOVED state not persistent? It seems that, if ZFS knows > that an administrator pulled a disk deliberately, that''s still useful > information after a reboot. Changing the state to FAULTED is > non-intuitive, at least to me.The reason is that we can imply from watching a device go from the connected -> unconnected state that a remove event occurred. While the system is powered off, we don''t necessarily know what reconfiguration has taken place, so implying that a drive is still removed may not be appropriate. It''s pretty easy to store this persistently, so this can change if people feel otherwise.> What happens with autoreplace after a system reconfiguration? If > controller numbers change, is it possible that autoreplace would grab > a drive which was not part of a ZFS pool and try to use it? I recall > some posts here where the "old" path was preserved in the pool and it > was fairly difficult to get ZFS to recognize the "new" path.No, because devids trump physical location. If you reconfigured the system, absolutely nothing would happen except the device paths would get updated.> In general, I don''t like the idea of autoreplace being tied to the > device path. It would be both safer and more general if the underlying > frameworks exported a physical location identifier to the node. I > suspect that this isn''t currently done by Solaris, and I''m sure it''s > not done for devices in enclosures which support (say) SES; but it > seems like the right long-term direction. For Sun-supplied hardware, > it would even be possible to use readable device names (e.g. "Slot A > connector 2").This would be nice, but such a thing doesn''t exist. We''re gradually moving to a libtopo world where we might be able to do this, but it won''t exist in the near term, hence having it disabled by default.> Autoreplace probably needs a lot of warnings except in the particular > case of appliances and other highly-controlled environments. Consider > a server three drives, A, B, and C, in which A and B are mirrored and > C is not. Pull out A, B, and C, and re-insert them as A, C, and B. If > B is slow to come up for some reason, ZFS will see "C" in place of > "B", and happily reformat it into a mirror of "A". (Or am I reading > this incorrectly?)Again, thanks to devids, the autoreplace code would not kick in here at all. You would end up with an identical pool.> I hope that there''s a way to disable the periodic probing of hot > spares. Spinning these drives up often might be highly annoying in > some environments (though useful in others, since it could also verify > that the disk is responding normally).Why is this "highly annoying"? The frequency would be rather low, would have no effect on performance, and you''re gaining the ability to know whether your hot spares aare actually working. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Eric Schrock
2007-Mar-22 15:42 UTC
[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Wed, Mar 21, 2007 at 11:57:42PM -0700, Matt B wrote:> > Literally, someone should be able to make $7/hr with a stack of drives > and the ability to just look or listen to a server to determin which > drive needs to be replaced. > > This means ZFS will need to be able to control the HDD Status lights > on the chassis for "look", but for "listen" ZFS could cause the server > to beep using one beep for the first slot, two beeps in rapid > successions, for the second slot. A sort of lame Morse code...no > device integration on ZFS''s part required >This is part of ongoing work with Solaris platform integration (see my last blog post) and future ZFS/FMA work. We will eventually be leveraging IPMI and SES to manage physical indicators (i.e. LEDs) in response to Solaris events. It will take some time to reach this point, however. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Anton B. Rang
2007-Mar-23 05:37 UTC
[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration
> > Consider a server [with] three drives, A, B, and C, in which A and B are mirrored and > > C is not. Pull out A, B, and C, and re-insert them as A, C, and B. If > > B is slow to come up for some reason, ZFS will see "C" in place of > > "B", and happily reformat it into a mirror of "A". (Or am I reading this incorrectly?) > Again, thanks to devids, the autoreplace code would not kick in here at > all. You would end up with an identical pool.Is this because C would already have a devid? If I insert an unlabeled disk, what happens? What if B takes five minutes to spin up? If it never does?> > I hope that there''s a way to disable the periodic probing of hot > > spares. Spinning these drives up often might be highly annoying in > > some environments (though useful in others, since it could also verify > > that the disk is responding normally). > > Why is this "highly annoying"? The frequency would be rather low, would > have no effect on performance, and you''re gaining the ability to know > whether your hot spares aare actually working.Well, in my home office it would be highly annoying if I got to hear spin-up/spin-down sounds every half hour. The ability to tune the time interval would probably make this OK, though. I could live with once a day or once a week. Anton This message posted from opensolaris.org
> > > Consider a server [with] three drives, A, B, and C, in which A and B are mirrored and > > > C is not. Pull out A, B, and C, and re-insert them as A, C, and B. If > > > B is slow to come up for some reason, ZFS will see "C" in place of > > > "B", and happily reformat it into a mirror of "A". (Or am I reading this incorrectly?) > > Again, thanks to devids, the autoreplace code would not kick in here at > > all. You would end up with an identical pool. > > Is this because C would already have a devid?Well, it''s because all the members of the ZFS pool have information about the pool and their place in it. The path of a member isn''t important.> If I insert an unlabeled disk, what happens?Nothing. If ZFS can''t find a signature on it, it knows it''s not part of a ZFS pool.> What if B takes five minutes to spin up?That sounds like something for FMA to deal better with. It might hang for a period of time if the driver doesn''t respond quickly.> If it never does?At some point the device driver needs to respond. If the device doesn''t become ready, it''ll have to time out and be noted as a failure. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Pawel Jakub Dawidek
2007-Mar-23 10:31 UTC
[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote:> Again, thanks to devids, the autoreplace code would not kick in here at > all. You would end up with an identical pool.Eric, maybe I''m missing something, but why ZFS depend on devids at all? As I understand it, devid is something that never change for a block device, eg. disk serial number, but on the other hand it is optional, so we can rely on the fact it''s always there (I mean for all block devices we use). Why we simply not forget about devids and just focus on on-disk metadata to detect pool components? The only reason I see is performance. This is probably why /etc/zfs/zpool.cache is used as well. In FreeBSD we have the GEOM infrastructure for storage. Each storage device (disk, partition, mirror, etc.) is simply a GEOM provider. If GEOM provider appears (eg. disk is inserted, partition is configured) all interested parties are informed about this I can ''taste'' the provider by reading metadata specific for them. The same when provider goes away - all interested parties are informed and can react accordingly. We don''t see any performance problems related to the fact that each disk that appears is read by many "GEOM classes". -- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070323/65692240/attachment.bin>
Pawel Jakub Dawidek
2007-Mar-23 10:42 UTC
[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Fri, Mar 23, 2007 at 11:31:03AM +0100, Pawel Jakub Dawidek wrote:> On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote: > > Again, thanks to devids, the autoreplace code would not kick in here at > > all. You would end up with an identical pool. > > Eric, maybe I''m missing something, but why ZFS depend on devids at all? > As I understand it, devid is something that never change for a block > device, eg. disk serial number, but on the other hand it is optional, so > we can rely on the fact it''s always there (I mean for all block devicess/can/can''t/> we use). > > Why we simply not forget about devids and just focus on on-disk metadata > to detect pool components? > > The only reason I see is performance. This is probably why > /etc/zfs/zpool.cache is used as well. > > In FreeBSD we have the GEOM infrastructure for storage. Each storage > device (disk, partition, mirror, etc.) is simply a GEOM provider. If > GEOM provider appears (eg. disk is inserted, partition is configured) > all interested parties are informed about this I can ''taste'' the > provider by reading metadata specific for them. The same when provider > goes away - all interested parties are informed and can react > accordingly. > > We don''t see any performance problems related to the fact that each disk > that appears is read by many "GEOM classes".-- Pawel Jakub Dawidek http://www.wheel.pl pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070323/19b190ef/attachment.bin>
Eric Schrock
2007-Mar-23 16:51 UTC
[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration
On Fri, Mar 23, 2007 at 11:31:03AM +0100, Pawel Jakub Dawidek wrote:> > Eric, maybe I''m missing something, but why ZFS depend on devids at all? > As I understand it, devid is something that never change for a block > device, eg. disk serial number, but on the other hand it is optional, so > we can rely on the fact it''s always there (I mean for all block devices > we use). > > Why we simply not forget about devids and just focus on on-disk metadata > to detect pool components? > > The only reason I see is performance. This is probably why > /etc/zfs/zpool.cache is used as well. > > In FreeBSD we have the GEOM infrastructure for storage. Each storage > device (disk, partition, mirror, etc.) is simply a GEOM provider. If > GEOM provider appears (eg. disk is inserted, partition is configured) > all interested parties are informed about this I can ''taste'' the > provider by reading metadata specific for them. The same when provider > goes away - all interested parties are informed and can react > accordingly. > > We don''t see any performance problems related to the fact that each disk > that appears is read by many "GEOM classes".We do use the on-disk metatdata for verification purposes, but we can''t open the device based on the metadata. We don''t have a corresponding interface in Solaris, so there is no way to say "open the device with this particular on-disk data". The devid is also unique to the device (it''s based on manufacturer/model/serialnumber), so that we can uniquely identify devices for fault management purposes. The world of hotplug and device configuration in Solaris is quite complicated. Part of my time spent on this work has been just writing down the existing semantics. A scheme like that in FreeBSD would be nice, but unlikely to appear given the existing complexity. As part of the I/O retire work we will likely be introducing device contracts, which is a step in the right direction but it''s a very long road. Thanks for sharing the details on FreeBSD, it''s quite interesting. Since the majority of this work is Solaris-specific, I''ll be interested to see how other platforms deal with this type of reconfiguration. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Richard Elling
2007-Mar-23 19:52 UTC
[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration
Anton B. Rang wrote:> Is this because C would already have a devid? If I insert an unlabeled disk, > what happens? What if B takes five minutes to spin up? If it never does?N.B. You get different error messages from the disk. If a disk is not ready then it will return a not ready code and the sd driver will record this and patiently retry. The reason I know this in some detail is scar #523, which was inflicted when we realized that some/many/most RAID arrays don''t do this. The difference is that the JBOD disk electronics start very quickly, perhaps a few seconds after power-on. A RAID array can take several minutes (or more) to get to a state where it will reply to any request. So, if you do not perform a full, simultaneous power-on test for your entire (cluster) system, then you may not hit the problem that the slow storage start makes Solaris think that the device doesn''t exist -- which can be a bad thing for highly available services. Yes, this is yet another systems engineering problem. -- richard
Richard Elling
2007-Mar-23 20:06 UTC
[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration
workaround below... Richard Elling wrote:> Anton B. Rang wrote: >> Is this because C would already have a devid? If I insert an unlabeled >> disk, what happens? What if B takes five minutes to spin up? If it >> never does? > > N.B. You get different error messages from the disk. If a disk is not > ready > then it will return a not ready code and the sd driver will record this and > patiently retry. The reason I know this in some detail is scar #523, which > was inflicted when we realized that some/many/most RAID arrays don''t do this. > The difference is that the JBOD disk electronics start very quickly, perhaps > a few seconds after power-on. A RAID array can take several minutes (or more) > to get to a state where it will reply to any request. So, if you do not > perform a full, simultaneous power-on test for your entire (cluster) system, > then you may not hit the problem that the slow storage start makes Solaris > think that the device doesn''t exist -- which can be a bad thing for highly > available services. Yes, this is yet another systems engineering problem.Sorry, it was rude of me not to include the workaround. We put a delay in the SPARC OBP to slow down the power-on boot time of the servers to match the attached storage. While this worked, it is butugly. You can do this with GRUB, too. -- richard