thr3ads.net - zfs discuss - [zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Eric Schrock

2007-Mar-21 18:16 UTC

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

Folks -

I''m preparing to submit the attached PSARC case to provide better
support for device removal and insertion within ZFS.  Since this is a
rather complex issue, with a fair share of corner issues, I thought I''d
send the proposal out to the ZFS community at large for further comment
before submitting it.

The prototype is functional except for the offline device insertion and
hot spares functionality.  I hope to have this integrated within the
next month, along with the next phase of FMA integration.  Please
respond with any comments, concerns, or suggestions.

Thanks,

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock
-------------- next part --------------
1. INTRODUCTION

Currently, ZFS supports what is affectionately known as "poor
man''s
hotplug".  If a device is removed from the system, then it is assumed
that upon I/O failure, an attempt to reopen the same device will fail.
This will trigger a FMA fault, substituting a hot spare if available.
This is undesirable for two reasons:

- There is no distinction between device removal and arbitrary failure.
  If a device is removed from the system, it should be treated as a
  deliberate action different from normal failure.

- There is no support for automatic response to device insertion.  For a
  server configured with a ZFS pool, the administrator should be able to
  walk up, remove any drive (preferably a faulted one), insert a new
  drive, and not have to issue any ZFS commands to reconfigure the pool.
  This is particularly true for the appliance space, where hardware
  reconfiguration should "just work".

This case enhances ZFS to respond to device removal and provides a
mechanism to automatically deal with device insertion.  While the
framework is generic, the primary target is devices supported by
the SATA framework.  The only device-specific portion of this proposal
concerns determining if a device is in the same "physical location" as
a
previously known device, involve correlating a transport''s enumeration
of the device with the device''s physical location within the chassis.


2. DEVICE REMOVAL

There are two types of device removal within Solaris.  Coordinated
device removal involves stopping all consumers of the device, using the
appropriate cfgadm(1M) command (PSARC 1996/285), and then physically
removing the device.  Uncoordinated removal (also known as "surprise
removal") is when a device is physically removed while still in active
use by the system.  The latter increasingly common as more I/O protocols
support hotplug and higher level software (ZFS) becomes more capable.

There are several ways to detect device removal within Solaris.  Fibre
channel drivers generate the NDI events FCAL_INSERT_EVENT and
FCAL_REMOVE_EVENT.  USB and 1394 drivers generate the NDI events
DDI_DEVI_INSERT_EVENT and DDI_DEVI_REMOVE_EVENT.  In addition to these
event channels, there is also the DKIOCSTATE ioctl() which returns (on
capable drivers) DKIO_DEV_GONE if the device has been removed.

Of these, the ioctl() is the most widely supported, and is the mechanism
used as part of this case.  Since this is an implementation detail of
the current architecture, it does not preclude using alternate
mechanisms in the future.  When an I/O to a disk fails, ZFS will query
the media state by the DKIOCSTATE ioctl.  If the device is any state
other than DKIO_INSERTED, ZFS will transition the device to a new
REMOVED state.  No FMA fault will be triggered, and a hot spare (if any)
will be substituted if available.  Note that the DKIO_DEV_GONE can be
returned for a variety of reasons (pulling cables, external chassis
being powered off, etc).  In the absence of additional FMA information,
it is assumed that this is intentional administrative action.

As part of this work, lofiadm(1M) will be expanded to include a new
force (-f) flag when removing devices.  Combined with the upcoming lofi
devfs events (PSARC 2006/709), this will provide a much simpler testing
framework without the need for physical hardware interaction.  When this
flag is used, the underlying file will be closed, any further I/O or
attempts to open the device will fail, and DKIOCSTATE will return
DKIO_DEV_GONE.  This flag will remain private for testing only, and will
not be documented.

An example of this in action:

# lofiadm -a /disk/a
/dev/lofi/1
# lofiadm -a /disk/b
/dev/lofi/2
# lofiadm -a /disk/c
/dev/lofi/3
# zpool create -f test mirror /dev/lofi/1 /dev/lofi/2 spare /dev/lofi/3
# while :; do touch /test/foo; sync; sleep 1; done &
[1] 100662
# zpool status
  pool: test
 state: ONLINE
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            /dev/lofi/1  ONLINE       0     0     0
            /dev/lofi/2  ONLINE       0     0     0
        spares
          /dev/lofi/3    AVAIL

errors: No known data errors
# lofiadm -d /disk/a -f
# zpool status
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007
config:

        NAME               STATE     READ WRITE CKSUM
        test               DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            spare          DEGRADED     0     0     0
              /dev/lofi/1  REMOVED      0     0     0
              /dev/lofi/3  ONLINE       0     0     0
            /dev/lofi/2    ONLINE       0     0     0
        spares
          /dev/lofi/3      INUSE     currently in use

errors: No known data errors

This behavior is universal for all pools, and cannot be disabled. If a
device doesn''t support DKIOCSTATE, then it will be diagnosed as faulty
through the standard FMA mechanisms.

The ''REMOVED'' state is not persistent, so if a machine is
rebooted with
a device in the REMOVED state, it will appear as FAULTED when the
machine comes up.


3. DEVICE INSERTION

When a device is inserted, there are two possible outcomes of interest
to ZFS:

- If a previously known device is inserted, then we want to online the
  device.

- If a new device is inserted into a physical location that previously
  contained a ZFS device, then we want to format the device and replace
  the original device.

The former is applicable to any pool, and is always enabled.  The latter
is potentially damaging, as it will automatically overwrite any data
present on newly inserted devices.  To protect against this, a new pool
property (PSARC 2006/577), ''autoreplace'', will be defined. 
This boolean
property will be off by default to minimize the impact on existing
systems or unknown hardware.  If unset, the current behavior remains the
same, and any replacement operation must be initiated by the
administrator via zpool(1M).  When set, it indicates that any new device
found in the same physical location as a device previously belonging to
the pool will be automatically formatted and replaced.

To ensure consistent behavior, ZFS must behave in the same manner when
the device is replaced (via hotplug) while the system is running, as
well as when the device is replaced while the system is powered off.


4. ONLINE DEVICE INSERTION

A new syseventd module will be introduced that listens for EC_DEV_ADD
events of subclass ESC_DISK or ESC_LOFI.  This event is triggered when
the device node for the disk or lofi device is created, not necessarily
when a disk is inserted.  Currently, the USB framework auto-configures
drives on insertion, while the SATA framework does not.  Modifying the
SATA framework behavior will be pursued under a separate case and is
outside the scope of this case.  In the meantime, these SATA events will
be triggered only by an explicit ''cfgadm -c configure'' by the
user.

When one of these events is received, the corresponding device path is
derived from the sysevent payload.  For disks, this will be the device
node, while for lofi it will be a particular minor node.  If the device
has a devid, then we first search all pools for a vdev with a matching
devid.  If none is found, or the device does not have a devid, then we
search all pools for vdevs with the specified device path.  As part of
this work, the ZFS configuration will be expanded to store the physical
device path as part of the vdev label.  This will also have the benefit
of allowing ZFS to boot from devices which don''t support devids.
Currently, ZFS only identifies by devid or /dev path, neither of which
may be available when mounting the root filesystem.  This simplistic
mechanism will only work for devices which have the behavior that the
device path identifies a physical location, which may not be true for
FC, or iSCSI devices, or for devices plumbed under MPxIO.  This logic
can be expanded in the future if there are protocols or drivers which do
not adhere to this behavior.

If no matching vdevs are found, then the event is ignored and nothing is
done.  Otherwise, the device is onlined to determine if it is a known
ZFS device.  This online operation will automatically remove any
attached spare when the resilver is complete.  To continue the above
example:

# zpool status
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007
config:

        NAME               STATE     READ WRITE CKSUM
        test               DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            spare          DEGRADED     0     0     0
              /dev/lofi/1  REMOVED      0     0     0
              /dev/lofi/3  ONLINE       0     0     0
            /dev/lofi/2    ONLINE       0     0     0
        spares
          /dev/lofi/3      INUSE     currently in use

errors: No known data errors
# lofiadm -a /disk/a
/dev/lofi/1
# zpool status
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Mar 12 10:58:22 2007
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            /dev/lofi/1  ONLINE       0     0     0
            /dev/lofi/2  ONLINE       0     0     0
        spares
          /dev/lofi/3    AVAIL

errors: No known data errors

If the online attempt failed, then we are dealing with a new device
inserted into the same physical slot.  If the ''autoreplace''
property is
unset, then the event is ignored.  If the original event was ESC_DISK
and the vdev is not a whole disk, then the event is also ignored.
Otherwise, the disk is labeled with an EFI label in the same manner as
when the pool is initially created.  If that succeeds, then the
corresponding ''zpool replace'' command is automatically
invoked.  To
continue the above example:

# lofiadm -d /disk/a -f
# zpool status
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Mon Mar 12 10:57:43 2007
config:

        NAME               STATE     READ WRITE CKSUM
        test               DEGRADED     0     0     0
          mirror           DEGRADED     0     0     0
            spare          DEGRADED     0     0     0
              /dev/lofi/1  REMOVED      0     0     0
              /dev/lofi/3  ONLINE       0     0     0
            /dev/lofi/2    ONLINE       0     0     0
        spares
          /dev/lofi/3      INUSE     currently in use

errors: No known data errors
# lofiadm -a /disk/d
/dev/lofi/1
# zpool status
  pool: test
 state: DEGRADED
 scrub: resilver completed with 0 errors on Mon Mar 12 17:31:06 2007
config:

        NAME                     STATE     READ WRITE CKSUM
        test                     DEGRADED     0     0     0
          mirror                 DEGRADED     0     0     0
            spare                DEGRADED     0     0     0
              replacing          DEGRADED     0     0     0
                /dev/lofi/1/old  FAULTED      0     0     0  corrupted data
                /dev/lofi/1      ONLINE       0     0     0
              /dev/lofi/3        ONLINE       0     0     0
            /dev/lofi/2          ONLINE       0     0     0
        spares
          /dev/lofi/3            INUSE     currently in use

errors: No known data errors
# zpool status
  pool: test
 state: ONLINE
 scrub: resilver completed with 0 errors on Mon Mar 12 17:31:06 2007
config:

        NAME             STATE     READ WRITE CKSUM
        test             ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            /dev/lofi/1  ONLINE       0     0     0
            /dev/lofi/2  ONLINE       0     0     0
        spares
          /dev/lofi/3    AVAIL

errors: No known data errors

In this case, the device was automatically replaced with the next
contents.


5. OFFLINE DEVICE INSERTION

If a device is replaced while the system is powered off, then ZFS should
behave in a similar manner.  If devices change attachment points (i.e.
swapped) while the system is powered off, ZFS already handles this case
for devices which support devids.  If a device can be opened but the
devid doesn''t match, then ZFS will treat this as a disk insertion
event.
If the ''autoreplace'' property is set, then ZFS will label the
disk and
perform the appropriate ''zpool replace'' operation to resilver
the
device.


6. HOT SPARES

Currently, ZFS does not do any I/O to inactive hot spares, so it is
incapable of detecting when a hot spare is removed from the system.
This case will modify ZFS to periodically attempt to read from all hot
spares and make sure they are online and available.  If a hot spare is
removed, then when this I/O fails it will trigger the normal remove
path.  This case will also allow offline hot spares to be replaced.
With these changes, hot spares will be treated as normal devices with
respect to hotplug.

If an active hot spare is removed, then the hot spare will be detached
and marked removed.  If another hot spare is available, then it will be
substituted in its place.  If a hot spare is inserted, and there is a
faulted device with no current hot spare, then inserting the device will
automatically trigger a hot spare.


7. MANPAGE DIFFS

XXX


8. REFERENCES

PSARC 1996/285 Dynamic Attach/Detach of CPU/Memory Boards
PSARC 2002/240 ZFS
PSARC 2006/223 ZFS Hot Spares
PSARC 2006/577 zpool property to disable delegation
PSARC 2006/709 lofi devfs events

Bill Sommerfeld

2007-Mar-21 18:37 UTC

head link

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

This is tangential, but then arc review is all about feature
interaction.

1) What happens if the hotplugged replacement device is too small?

2) What''s the interaction between autoreplace and automatic vdev growth
(when the underlying device gets bigger)?

Since we can''t yet shrink a pool, i''m wondering if there
aren''t some
gotchas if someone plugs in a disk that''s too big; they might not be
able to go back and replace it with one of the intended size.

Consider the following comedy of operational errors:

we start with a pool which looks like:

	mirror
		small1
		small2
	mirror
		large1
		large2
	spare
		large3

"small1" fails, and hot sparing kicks in to replace it with
"large3".
resilvering is complete.

case #1: operator mistakenly pulls small2 instead of small1
	does mirror #1 (reduced to a single functioning replica) autogrow to
size "large"?

case #2: operator pulls small1, and by mistake plugs in "large4"
instead
of "small3".  before noticing this error, "small2" fails.

.. or other variants of the above.

					- Bill

Eric Schrock

2007-Mar-21 18:44 UTC

head link

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

On Wed, Mar 21, 2007 at 02:37:16PM -0400, Bill Sommerfeld
wrote:> 
> 1) What happens if the hotplugged replacement device is too small?
>
The replace will fail, just as if the administrator tried to issue a
''zpool replace'' with a smaller drive.  In the auto-replace
case, the
result will be a faulted vdev and an associated FMA alert.  I''ll add
some text to this effect, and make sure that this is actually how it
behaves.
> 2) What''s the interaction between autoreplace and automatic vdev
> growth (when the underlying device gets bigger)?
> 
> Since we can''t yet shrink a pool, i''m wondering if there
aren''t some
> gotchas if someone plugs in a disk that''s too big; they might not
be
> able to go back and replace it with one of the intended size.
> 
> Consider the following comedy of operational errors:
> 
> we start with a pool which looks like:
> 
> 	mirror small1 small2 mirror large1 large2 spare large3
> 
> "small1" fails, and hot sparing kicks in to replace it with
"large3".
> resilvering is complete.
> 
> case #1: operator mistakenly pulls small2 instead of small1 does
> mirror #1 (reduced to a single functioning replica) autogrow to size
> "large"?
No, it will not autogrow because a spare doesn''t completely replace the
device.  So you will have something like:

	mirror
	  spare			100G
	    small1		100G
	    large3		200G (100G used)
	  small2		100G
	mirror			200G
	  large1		200G
	  large2		200G

If you then pull ''small2'', it will become faulted, but the
vdev will not
grow in size.
> case #2: operator pulls small1, and by mistake plugs in "large4"
> instead of "small3".  before noticing this error,
"small2" fails.
Again, the ''spare'' vdev will still reflect the size of
''small2'', so the
vdev will not grow.  It''s only when the spare is complete, either
through explicit zpool(1M) actions or replacing the underlying drive,
that the ''spare'' vdev disappears and the new vdev reflects the
larger
size.

Thanks for the input,

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Robert Milkowski

2007-Mar-22 00:03 UTC

head link

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

Hello Eric,

  What if I have a failing drive (still works but I want it to be
  replaced) and I have a replacement drive on a shelf. All I want is
  to remove failing drive, insert new one and resilver. I do not want
  a hot spare to automatically kick in.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Eric Schrock

2007-Mar-22 00:13 UTC

head link

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

On Thu, Mar 22, 2007 at 01:03:48AM +0100, Robert Milkowski
wrote:> 
>   What if I have a failing drive (still works but I want it to be
>   replaced) and I have a replacement drive on a shelf. All I want is
>   to remove failing drive, insert new one and resilver. I do not want
>   a hot spare to automatically kick in.
> 
Kicking in a hot spare is a harmless activity (the end result will be
the same), why would you want to avoid this?  Do you have an idea of how
you would want to control this behavior?

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Robert Milkowski

2007-Mar-22 01:44 UTC

head link

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

Hello Eric,

Thursday, March 22, 2007, 1:13:19 AM, you wrote:

ES> On Thu, Mar 22, 2007 at 01:03:48AM +0100, Robert Milkowski
wrote:>> 
>>   What if I have a failing drive (still works but I want it to be
>>   replaced) and I have a replacement drive on a shelf. All I want is
>>   to remove failing drive, insert new one and resilver. I do not want
>>   a hot spare to automatically kick in.
>> 
ES> Kicking in a hot spare is a harmless activity (the end result will be
ES> the same), why would you want to avoid this?

With lot of storage I like to keep as much the same config as I can
between the same boxes. So if I have a replacement drive I would
rather use it instead of hot spare so I do not have to resilver again.
I know this is mostly esthetics but helps in managing storage.

ES> Do you have an idea of how
ES> you would want to control this behavior?

Maybe simple method of "freezing" hot spares (not by removing them)
or maybe the automated method should have some reasonable delay -
when it sees disk is gone it will wait N seconds before hot spare
kicks in or if new drive is present at the same physical location then
it would use it rather then a hot spare (or perhaps admin can issue
zpool replace manually before hot spare kicks in).

I''m not sure if it won''t complicate things too much but still
I like
to keep similar configs.

Or maybe an ability to stop resilvering of hot spare and start
resilver of a new drive would be sufficient or it should even be
automatic (stop resilvering of hot spare, but still keep all data
already resilvered, start resilver new disk with data which has not
yet been resilvered on hot spare, then resilver data which was
resilverd as hot spare, then release a hot spare). All would be
working only if some kind of hotspare-back property would be set.

It''s a matter of what people prefer - a moving hot spare or rather if
disk is replaced the hot spare goes back to a hot spare list and is
released (after disk reilvered). It probably doesn''t matter that much
on x4500 but it can matter more on other arrays.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Anton B. Rang

2007-Mar-22 04:26 UTC

head link

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

A couple of questions/comments --

Why is the REMOVED state not persistent? It seems that, if ZFS knows that an
administrator pulled a disk deliberately, that''s still useful
information after a reboot. Changing the state to FAULTED is non-intuitive, at
least to me.

What happens with autoreplace after a system reconfiguration? If controller
numbers change, is it possible that autoreplace would grab a drive which was not
part of a ZFS pool and try to use it? I recall some posts here where the
"old" path was preserved in the pool and it was fairly difficult to
get ZFS to recognize the "new" path.

In general, I don''t like the idea of autoreplace being tied to the
device path. It would be both safer and more general if the underlying
frameworks exported a physical location identifier to the node. I suspect that
this isn''t currently done by Solaris, and I''m sure
it''s not done for devices in enclosures which support (say) SES; but it
seems like the right long-term direction. For Sun-supplied hardware, it would
even be possible to use readable device names (e.g. "Slot A connector
2").

Autoreplace probably needs a lot of warnings except in the particular case of
appliances and other highly-controlled environments. Consider a server three
drives, A, B, and C, in which A and B are mirrored and C is not. Pull out A, B,
and C, and re-insert them as A, C, and B. If B is slow to come up for some
reason, ZFS will see "C" in place of "B", and happily
reformat it into a mirror of "A".  (Or am I reading this incorrectly?)

I hope that there''s a way to disable the periodic probing of hot
spares.  Spinning these drives up often might be highly annoying in some
environments (though useful in others, since it could also verify that the disk
is responding normally).
 
 
This message posted from opensolaris.org

Matt B

2007-Mar-22 06:57 UTC

head link

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

Autoreplace is currently the biggest advantage that H/W raid controllers have
over ZFS and other less advanced forms of S/W raid.

I would even go so far as to promote this issue to the forefront as a leading
deficiency  that is hindering ZFS adoption.

Regarding H/W raid controllers things are kept pretty simply. It does not even
"try" to discriminate between an admin pulling a good drive
(intentional removal) and a drive that has failed. While some may see this as
less sophisticated in reality it makes things quite simple.

There is a certain degree of decadence and freedom that comes with being able to
pull a drive out of server A, slap it into server B, and turn on server B with a
perfect duplicate. Slap a fresh drive from the factory into the now vacant slot
on server A and rebuild takes place immediatley.

This has to be the standard. Anything beyond this level of effort simply fails
against H/W raid.

There is no standard S/W raid like this and with ZFS''s already
beautiful simplicity, it is the champion of the hour to deliver on this holy
grail.

I think this proposal is very close to what we need to take ZFS to the next
level in the enterprise


Here are some rough thoughts regarding percievable drive failure notification
(beyond email, pager, console etc)


Literally, someone should be able to make $7/hr with a stack of drives and the
ability to just look or listen to a server to determin which drive needs to be
replaced.

This means ZFS will need to be able to control the HDD Status lights on the
chassis for "look", but for "listen" ZFS could cause the
server to beep using one beep for the first slot, two beeps in rapid
successions, for the second slot. A sort of lame Morse code...no device
integration on ZFS''s part required
 
 
This message posted from opensolaris.org

Eric Schrock

2007-Mar-22 15:39 UTC

head link

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

On Wed, Mar 21, 2007 at 09:26:35PM -0700, Anton B. Rang
wrote:> A couple of questions/comments --
> 
> Why is the REMOVED state not persistent? It seems that, if ZFS knows
> that an administrator pulled a disk deliberately, that''s still
useful
> information after a reboot. Changing the state to FAULTED is
> non-intuitive, at least to me.
The reason is that we can imply from watching a device go from the
connected -> unconnected state that a remove event occurred.  While the
system is powered off, we don''t necessarily know what reconfiguration
has taken place, so implying that a drive is still removed may not be
appropriate.  It''s pretty easy to store this persistently, so this can
change if people feel otherwise.
> What happens with autoreplace after a system reconfiguration? If
> controller numbers change, is it possible that autoreplace would grab
> a drive which was not part of a ZFS pool and try to use it? I recall
> some posts here where the "old" path was preserved in the pool
and it
> was fairly difficult to get ZFS to recognize the "new" path.
No, because devids trump physical location.  If you reconfigured the
system, absolutely nothing would happen except the device paths would
get updated.
> In general, I don''t like the idea of autoreplace being tied to the
> device path. It would be both safer and more general if the underlying
> frameworks exported a physical location identifier to the node. I
> suspect that this isn''t currently done by Solaris, and
I''m sure it''s
> not done for devices in enclosures which support (say) SES; but it
> seems like the right long-term direction. For Sun-supplied hardware,
> it would even be possible to use readable device names (e.g. "Slot A
> connector 2").
This would be nice, but such a thing doesn''t exist.  We''re
gradually
moving to a libtopo world where we might be able to do this, but it
won''t exist in the near term, hence having it disabled by default.
> Autoreplace probably needs a lot of warnings except in the particular
> case of appliances and other highly-controlled environments. Consider
> a server three drives, A, B, and C, in which A and B are mirrored and
> C is not. Pull out A, B, and C, and re-insert them as A, C, and B. If
> B is slow to come up for some reason, ZFS will see "C" in place
of
> "B", and happily reformat it into a mirror of "A".  (Or
am I reading
> this incorrectly?)
Again, thanks to devids, the autoreplace code would not kick in here at
all.  You would end up with an identical pool.
> I hope that there''s a way to disable the periodic probing of hot
> spares.  Spinning these drives up often might be highly annoying in
> some environments (though useful in others, since it could also verify
> that the disk is responding normally).
Why is this "highly annoying"?  The frequency would be rather low,
would
have no effect on performance, and you''re gaining the ability to know
whether your hot spares aare actually working.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Eric Schrock

2007-Mar-22 15:42 UTC

head link

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

On Wed, Mar 21, 2007 at 11:57:42PM -0700, Matt B wrote:> 
> Literally, someone should be able to make $7/hr with a stack of drives
> and the ability to just look or listen to a server to determin which
> drive needs to be replaced.
> 
> This means ZFS will need to be able to control the HDD Status lights
> on the chassis for "look", but for "listen" ZFS could
cause the server
> to beep using one beep for the first slot, two beeps in rapid
> successions, for the second slot. A sort of lame Morse code...no
> device integration on ZFS''s part required
>  
This is part of ongoing work with Solaris platform integration (see my
last blog post) and future ZFS/FMA work.  We will eventually be
leveraging IPMI and SES to manage physical indicators (i.e. LEDs) in
response to Solaris events.  It will take some time to reach this point,
however.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Anton B. Rang

2007-Mar-23 05:37 UTC

head link

[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration

> > Consider a server [with] three drives, A, B, and C, in which A and B
are mirrored and
> > C is not. Pull out A, B, and C, and re-insert them as A, C, and B. If
> > B is slow to come up for some reason, ZFS will see "C" in
place of
> > "B", and happily reformat it into a mirror of "A".
(Or am I reading this incorrectly?)
> Again, thanks to devids, the autoreplace code would not kick in here at
> all.  You would end up with an identical pool.
Is this because C would already have a devid? If I insert an unlabeled disk,
what happens? What if B takes five minutes to spin up? If it never does?
> > I hope that there''s a way to disable the periodic probing of
hot
> > spares.  Spinning these drives up often might be highly annoying in
> > some environments (though useful in others, since it could also verify
> > that the disk is responding normally).
> 
> Why is this "highly annoying"?  The frequency would be rather
low, would
> have no effect on performance, and you''re gaining the ability to
know
> whether your hot spares aare actually working.
Well, in my home office it would be highly annoying if I got to hear
spin-up/spin-down sounds every half hour. The ability to tune the time interval
would probably make this OK, though. I could live with once a day or once a
week.

Anton
 
 
This message posted from opensolaris.org

Darren Dunham

2007-Mar-23 05:55 UTC

head link

[zfs-discuss] Re: Re: Proposal: ZFS hotplug

> > > Consider a server [with] three drives, A, B, and C, in which A
and B are mirrored and
> > > C is not. Pull out A, B, and C, and re-insert them as A, C, and
B. If
> > > B is slow to come up for some reason, ZFS will see "C"
in place of
> > > "B", and happily reformat it into a mirror of
"A".  (Or am I reading this incorrectly?)
> > Again, thanks to devids, the autoreplace code would not kick in here
at
> > all.  You would end up with an identical pool.
> 
> Is this because C would already have a devid?
Well, it''s because all the members of the ZFS pool have information
about the pool and their place in it.  The path of a member isn''t
important.  
> If I insert an unlabeled disk, what happens?
Nothing.  If ZFS can''t find a signature on it, it knows it''s
not part of
a ZFS pool.
> What if B takes five minutes to spin up?
That sounds like something for FMA to deal better with.  It might hang
for a period of time if the driver doesn''t respond quickly.
> If it never does?
At some point the device driver needs to respond.  If the device
doesn''t
become ready, it''ll have to time out and be noted as a failure.


-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Pawel Jakub Dawidek

2007-Mar-23 10:31 UTC

head link

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock
wrote:> Again, thanks to devids, the autoreplace code would not kick in here at
> all. You would end up with an identical pool.
Eric, maybe I''m missing something, but why ZFS depend on devids at all?
As I understand it, devid is something that never change for a block
device, eg. disk serial number, but on the other hand it is optional, so
we can rely on the fact it''s always there (I mean for all block devices
we use).

Why we simply not forget about devids and just focus on on-disk metadata
to detect pool components?

The only reason I see is performance. This is probably why
/etc/zfs/zpool.cache is used as well.

In FreeBSD we have the GEOM infrastructure for storage. Each storage
device (disk, partition, mirror, etc.) is simply a GEOM provider. If
GEOM provider appears (eg. disk is inserted, partition is configured)
all interested parties are informed about this I can ''taste''
the
provider by reading metadata specific for them. The same when provider
goes away - all interested parties are informed and can react
accordingly.

We don''t see any performance problems related to the fact that each
disk
that appears is read by many "GEOM classes".

--
Pawel Jakub Dawidek http://www.wheel.pl
pjd at FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070323/65692240/attachment.bin>

Pawel Jakub Dawidek

2007-Mar-23 10:42 UTC

head link

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

On Fri, Mar 23, 2007 at 11:31:03AM +0100, Pawel Jakub Dawidek
wrote:> On Thu, Mar 22, 2007 at 08:39:55AM -0700, Eric Schrock wrote:
> > Again, thanks to devids, the autoreplace code would not kick in here
at
> > all.  You would end up with an identical pool.
> 
> Eric, maybe I''m missing something, but why ZFS depend on devids at
all?
> As I understand it, devid is something that never change for a block
> device, eg. disk serial number, but on the other hand it is optional, so
> we can rely on the fact it''s always there (I mean for all block
devices
s/can/can''t/
> we use).
> 
> Why we simply not forget about devids and just focus on on-disk metadata
> to detect pool components?
> 
> The only reason I see is performance. This is probably why
> /etc/zfs/zpool.cache is used as well.
> 
> In FreeBSD we have the GEOM infrastructure for storage. Each storage
> device (disk, partition, mirror, etc.) is simply a GEOM provider. If
> GEOM provider appears (eg. disk is inserted, partition is configured)
> all interested parties are informed about this I can
''taste'' the
> provider by reading metadata specific for them. The same when provider
> goes away - all interested parties are informed and can react
> accordingly.
> 
> We don''t see any performance problems related to the fact that
each disk
> that appears is read by many "GEOM classes".
-- 
Pawel Jakub Dawidek                       http://www.wheel.pl
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070323/19b190ef/attachment.bin>

Eric Schrock

2007-Mar-23 16:51 UTC

head link

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

On Fri, Mar 23, 2007 at 11:31:03AM +0100, Pawel Jakub Dawidek
wrote:> 
> Eric, maybe I''m missing something, but why ZFS depend on devids at
all?
> As I understand it, devid is something that never change for a block
> device, eg. disk serial number, but on the other hand it is optional, so
> we can rely on the fact it''s always there (I mean for all block
devices
> we use).
> 
> Why we simply not forget about devids and just focus on on-disk metadata
> to detect pool components?
> 
> The only reason I see is performance. This is probably why
> /etc/zfs/zpool.cache is used as well.
> 
> In FreeBSD we have the GEOM infrastructure for storage. Each storage
> device (disk, partition, mirror, etc.) is simply a GEOM provider. If
> GEOM provider appears (eg. disk is inserted, partition is configured)
> all interested parties are informed about this I can
''taste'' the
> provider by reading metadata specific for them. The same when provider
> goes away - all interested parties are informed and can react
> accordingly.
> 
> We don''t see any performance problems related to the fact that
each disk
> that appears is read by many "GEOM classes".
We do use the on-disk metatdata for verification purposes, but we can''t
open the device based on the metadata.  We don''t have a corresponding
interface in Solaris, so there is no way to say "open the device with
this particular on-disk data".  The devid is also unique to the device
(it''s based on manufacturer/model/serialnumber), so that we can
uniquely
identify devices for fault management purposes.

The world of hotplug and device configuration in Solaris is quite
complicated.  Part of my time spent on this work has been just writing
down the existing semantics.  A scheme like that in FreeBSD would be
nice, but unlikely to appear given the existing complexity.  As part of
the I/O retire work we will likely be introducing device contracts,
which is a step in the right direction but it''s a very long road.

Thanks for sharing the details on FreeBSD, it''s quite interesting.
Since the majority of this work is Solaris-specific, I''ll be interested
to see how other platforms deal with this type of reconfiguration.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Richard Elling

2007-Mar-23 19:52 UTC

head link

[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration

Anton B. Rang wrote:> Is this because C would already have a devid? If I insert an unlabeled
disk,
> what happens? What if B takes five minutes to spin up? If it never does?
N.B.  You get different error messages from the disk.  If a disk is not ready
then it will return a not ready code and the sd driver will record this and
patiently retry.  The reason I know this in some detail is scar #523, which
was inflicted when we realized that some/many/most RAID arrays don''t do
this.
The difference is that the JBOD disk electronics start very quickly, perhaps
a few seconds after power-on.  A RAID array can take several minutes (or more)
to get to a state where it will reply to any request.  So, if you do not
perform a full, simultaneous power-on test for your entire (cluster) system,
then you may not hit the problem that the slow storage start makes Solaris
think that the device doesn''t exist -- which can be a bad thing for
highly
available services.  Yes, this is yet another systems engineering problem.
  -- richard

Richard Elling

2007-Mar-23 20:06 UTC

head link

[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration

workaround below...

Richard Elling wrote:> Anton B. Rang wrote:
>> Is this because C would already have a devid? If I insert an unlabeled 
>> disk, what happens? What if B takes five minutes to spin up? If it 
>> never does?
> 
> N.B.  You get different error messages from the disk.  If a disk is not 
> ready
> then it will return a not ready code and the sd driver will record this and
> patiently retry.  The reason I know this in some detail is scar #523, which
> was inflicted when we realized that some/many/most RAID arrays
don''t do this.
> The difference is that the JBOD disk electronics start very quickly,
perhaps
> a few seconds after power-on.  A RAID array can take several minutes (or
more)
> to get to a state where it will reply to any request.  So, if you do not
> perform a full, simultaneous power-on test for your entire (cluster)
system,
> then you may not hit the problem that the slow storage start makes Solaris
> think that the device doesn''t exist -- which can be a bad thing
for highly
> available services.  Yes, this is yet another systems engineering problem.
Sorry, it was rude of me not to include the workaround.  We put a delay in
the SPARC OBP to slow down the power-on boot time of the servers to match
the attached storage.  While this worked, it is butugly.  You can do this with
GRUB, too.
  -- richard

zfs discuss - Mar 2007 - Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration

[zfs-discuss] Re: Re: Proposal: ZFS hotplug

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Re: Proposal: ZFS hotplug support and autoconfiguration

[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration

[zfs-discuss] Re: Re: Proposal: ZFS hotplug supportandautoconfiguration