thr3ads.net - zfs discuss - [zfs-discuss] hot detach of disks, ZFS and FMA integration [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Garrett D''Amore

2010-Jun-17 19:52 UTC

[zfs-discuss] hot detach of disks, ZFS and FMA integration

So I''ve been working on solving a problem we noticed that when using
certain hot pluggable busses (think SAS/SATA hotplug here), that
removing a drive did not trigger any resulting response from either FMA
or ZFS *until* something tried to use that device.  (This removal of a
drive can be thought of as simulating either disk,bus, hba, or cable
failure.)

This means that if you have an idle pool, you''ll not find out about
these failures until you do some I/O to the drive.  For a hot spare,
this may not occur until you actually do a scrub.  That''s really
unfortunate.

Note that I think the "disk-monitor.so" FMA module may solve this
problem, but it seems to be tied to specific Oracle hardware containing
certain IPMI features that are not necessarily general.

So, I''ve come up with a solution, which involves the creation of a new
FMA module, and a fix for a bug in the ZFS FMRI scheme module.  I''d
like
thoughts here.  (I''m happy to post the code as well; there is no reason
this can''t be pushed upstream as far as I or my employer are
concerned.)

zfs-monitor.so is the module, it runs at a configurable interval
(currently 10 seconds for my debugging.)  What it does is parse the ZFS
configuration to identify all physical disks that are associated with
ZFS vdevs.  For each such device, if ZFS believes that the vdev is
healthy (ONLINE, AVAIL, or even DEGRADED in the zpool status output,
although it uses libzfs directly to get this), it opens the underlying
raw device, and attempts to read the first 512 bytes (block) from the
unit.  If this works, then the disk is presumed to be working, and
we''re
done.

For units that fail either the open() or read(), we use libzfs to mark
the vdev FAULTED (which will impact higher level vdevs appropriately),
and we post an FMA ereport (so that the ZFS diagnosis and retire modules
can do their thing.)

Of course, one side effect of this change is that potentially disks are
spun up too frequently, even if they need not be, so it can have a
negative impact on power savings.  However, in theory, since we''re only
exchanging a single block, and always the same block, that data *ought*
to be in cache.  (This has a drawback as well though -- it means we
might not find errors on the spinning platters themselves.  But still
its far better since it catches the more common problem of a drive that
has either gone completely off the bus or has been removed or
accidentally disconnected.)

The one bug in the ZFS FMRI module that we had to fix was that it was
not failing to identify hot spare devices associated with a zpool, so
nothing was happening for those spares, because of certain logic in the
ZFS diagnosis module.

Anyway, I''m happy to share the code, and even go through the
request-sponsor process to push this upstream.  I would like the
opinions of the ZFS and FMA teams though... is the approach I''m using
sane, or have I missed some important design principle?  Certainly it
*seems* to work well on the systems I''ve tested, and we (Nexenta) think
that it fixes what appears to us to be a critical deficiency in the ZFS
error detection and handling.  But I''d like to hear other thoughts.

	- Garrett

Eric Schrock

2010-Jun-17 20:16 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On Jun 17, 2010, at 3:52 PM, Garrett D''Amore
wrote:> 
> Anyway, I''m happy to share the code, and even go through the
> request-sponsor process to push this upstream.  I would like the
> opinions of the ZFS and FMA teams though... is the approach I''m
using
> sane, or have I missed some important design principle?  Certainly it
> *seems* to work well on the systems I''ve tested, and we (Nexenta)
think
> that it fixes what appears to us to be a critical deficiency in the ZFS
> error detection and handling.  But I''d like to hear other
thoughts.
I don''t think this is the right approach.  You''ll end up
faulting drives that should be marked removed, among other things.  The correct
answer is for drivers to use the new LDI events (LDI_EV_DEVICE_REMOVE) to
indicate device removal.  This is already in ON but not hooked up to ZFS. 
It''s easy to do, but just hasn''t been pushed yet.

Note that for legacy drivers, the DKIOCGETSTATE ioctl() is supposed to handle
this for you.  However, there is a race condition where if the vdev probe
happens before the driver has transitioned to a state where it returns DEV_GONE,
then we can miss this event (because it is only probed in reaction to I/O
failure and we won''t try again).  We spent some time looking at ways to
eliminate this window, but it ultimately got quite ugly and doesn''t
support hot spares, so the better answer was to just properly support the LDI
events.

If you wanted to expand support for legacy drivers, you should expand use of the
DKIOCGETSTATE ioctl(), perhaps with an async task that probes spares, as well as
a delayed timer (within the bounds of the zfs-diagsnosis resource.removed
horizon) to close the associated window for normal vdevs.  However, a better
solution would be to update the drivers that matter to use LDI_EV_DEVICE_REMOVE,
which provides much crisper semantics and will be used in the future to hook
into other subsystems.

In order for anything to be accepted upstream, it''s key that it be able
to distinguish between REMOVED and FAULTED devices.  Mis-diagnosing a removed
drive as faulted is very bad (fault = broken hardware = service call = $$$).

- Eric

P.S. the bug in the ZFS scheme module is legit, we just haven''t fixed
it yet

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Garrett D''Amore

2010-Jun-17 20:35 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On Thu, 2010-06-17 at 16:16 -0400, Eric Schrock wrote:> On Jun 17, 2010, at 3:52 PM, Garrett D''Amore wrote:
> > 
> > Anyway, I''m happy to share the code, and even go through the
> > request-sponsor process to push this upstream.  I would like the
> > opinions of the ZFS and FMA teams though... is the approach
I''m using
> > sane, or have I missed some important design principle?  Certainly it
> > *seems* to work well on the systems I''ve tested, and we
(Nexenta) think
> > that it fixes what appears to us to be a critical deficiency in the
ZFS
> > error detection and handling.  But I''d like to hear other
thoughts.
> 
> I don''t think this is the right approach.  You''ll end up
faulting drives that should be marked removed, among other things.  The correct
answer is for drivers to use the new LDI events (LDI_EV_DEVICE_REMOVE) to
indicate device removal.  This is already in ON but not hooked up to ZFS. 
It''s easy to do, but just hasn''t been pushed yet.
> 
> Note that for legacy drivers, the DKIOCGETSTATE ioctl() is supposed to
handle this for you.  However, there is a race condition where if the vdev probe
happens before the driver has transitioned to a state where it returns DEV_GONE,
then we can miss this event (because it is only probed in reaction to I/O
failure and we won''t try again).  We spent some time looking at ways to
eliminate this window, but it ultimately got quite ugly and doesn''t
support hot spares, so the better answer was to just properly support the LDI
events.
> 
I actually started with DKIOCGSTATE as my first approach, modifying
sd.c.  But I had problems because what I found is that nothing was
issuing this ioctl properly except for removable/hotpluggable media (and
the SAS/SATA controllers/frameworks are not indicating this.  I tried
overriding that in sd.c but I still found that there was another bug
where the HAL module that does the monitoring does not monitor devices
that are present and in use (mounted filesystems) during boot.  I think
HAL was designed for removable media that would not be automatically
mounted by zfs during boot.  I didn''t analyze this further.

Furthermore, for the devices that did work, the report was "device
administratively removed".. which is *far* different from a device that
has unexpectedly offlined.  There was no way to distinguish a device
removed via cfgadm from a device that went away unexpectedly.
> If you wanted to expand support for legacy drivers, you should expand use
of the DKIOCGETSTATE ioctl(), perhaps with an async task that probes spares, as
well as a delayed timer (within the bounds of the zfs-diagsnosis
resource.removed horizon) to close the associated window for normal vdevs. 
However, a better solution would be to update the drivers that matter to use
LDI_EV_DEVICE_REMOVE, which provides much crisper semantics and will be used in
the future to hook into other subsystems.
Is "sd.c" considered a legacy driver?  Its what is responsible for the
vast majority of disks.  That said, perhaps the problem is the HBA
drivers?
> 
> In order for anything to be accepted upstream, it''s key that it be
able to distinguish between REMOVED and FAULTED devices.  Mis-diagnosing a
removed drive as faulted is very bad (fault = broken hardware = service call =
$$$).
So how do we distinguish "removed on purpose" as opposed to
"removed by
accident, faulted cable, or other non administrative issue?"  I presume
that a removal initiated via cfgadm or some other tool could put the ZFS
vdev into an offline state, and this would prevent the logic from
accidentally marking the device FAULTED.  (Ideally it would also mark
the device "REMOVED".)

Put another way, if a libzfs command is used to offline or remove the
device from the pool, then none of the code I''ve written is engaged.
What I''ve done is supply code to handle "surprise" removal or
disconnect.

> 
> - Eric
> 
> P.S. the bug in the ZFS scheme module is legit, we just haven''t
fixed it yet
I can send the diffs for that fix... they''re small and obvious enough.

	- Garrett> 
> --
> Eric Schrock, Fishworks                       
http://blogs.sun.com/eschrock
> 
>

Eric Schrock

2010-Jun-17 21:53 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On Jun 17, 2010, at 4:35 PM, Garrett D''Amore
wrote:> 
> I actually started with DKIOCGSTATE as my first approach, modifying
> sd.c.  But I had problems because what I found is that nothing was
> issuing this ioctl properly except for removable/hotpluggable media (and
> the SAS/SATA controllers/frameworks are not indicating this.  I tried
> overriding that in sd.c but I still found that there was another bug
> where the HAL module that does the monitoring does not monitor devices
> that are present and in use (mounted filesystems) during boot.  I think
> HAL was designed for removable media that would not be automatically
> mounted by zfs during boot.  I didn''t analyze this further.
ZFS issues the ioctl() from vdev_disk.c.  It is up to the HBA drivers to
correctly represent the DEV_GONE state (and is known to work with a variety of
SATA drivers).
> Is "sd.c" considered a legacy driver?  Its what is responsible
for the
> vast majority of disks.  That said, perhaps the problem is the HBA
> drivers?
It''s the HBA drivers.
> So how do we distinguish "removed on purpose" as opposed to
"removed by
> accident, faulted cable, or other non administrative issue?"  I
presume
> that a removal initiated via cfgadm or some other tool could put the ZFS
> vdev into an offline state, and this would prevent the logic from
> accidentally marking the device FAULTED.  (Ideally it would also mark
> the device "REMOVED".)
If there is no physical connection (detected to the best of the
driver''s ability), then it is removed (REMOVED is different from
OFFLINE).  Surprise device removal is not a fault - Solaris is designed to
support removal of disks at any time without administrative intervention.  A
fault is defined as broken hardware, which is not the case for a removed device.

There are projects underway to a) represent devices that are physically present
but unable to attach to generate faults and b) topology-based diagnosis to
detect bad cables, expanders, etc.  This is a complicated problem and not always
tractable, but can be solved reasonably well for modern systems and transports.

A completely orthogonal feature is the ability to represent extended periods of
device removal as a defect.  While removing a disk is not itself a defect,
leaving your pool running minus one disk for hours/days/weeks is clearly broken.

If you have a solution that correctly detects devices as REMOVED for a new class
of HBAs/drivers, that''d be more than welcome.  If you choose to
represent missing devices as faulted in your own third party system,
that''s your own prerogative, but it''s not the current Solaris
FMA model.

Hope that helps,

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Garrett D''Amore

2010-Jun-17 22:13 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On Thu, 2010-06-17 at 17:53 -0400, Eric Schrock wrote:> On Jun 17, 2010, at 4:35 PM, Garrett D''Amore wrote:
> > 
> > I actually started with DKIOCGSTATE as my first approach, modifying
> > sd.c.  But I had problems because what I found is that nothing was
> > issuing this ioctl properly except for removable/hotpluggable media
(and
> > the SAS/SATA controllers/frameworks are not indicating this.  I tried
> > overriding that in sd.c but I still found that there was another bug
> > where the HAL module that does the monitoring does not monitor devices
> > that are present and in use (mounted filesystems) during boot.  I
think
> > HAL was designed for removable media that would not be automatically
> > mounted by zfs during boot.  I didn''t analyze this further.
> 
> ZFS issues the ioctl() from vdev_disk.c.  It is up to the HBA drivers to
correctly represent the DEV_GONE state (and is known to work with a variety of
SATA drivers).
So maybe the problem is the SAS adapters I''m dealing with (LSI).  This
is not an imaginary problem -- if there is a better (more correct)
solution, then I''d like to use it.  Right now it probably is not
reasonable for me to fix every HBA driver (I can''t, as I don''t
have
source code to a number of them.)

Actually, the problem *might* be the MPXIO vHCI layer....

> 
> > Is "sd.c" considered a legacy driver?  Its what is
responsible for the
> > vast majority of disks.  That said, perhaps the problem is the HBA
> > drivers?
> 
> It''s the HBA drivers.
Ah, so you need to see CMD_DEV_GONE on the transport layer.
Interesting.  I don''t think the SAS drivers are doing this.
> 
> > So how do we distinguish "removed on purpose" as opposed to
"removed by
> > accident, faulted cable, or other non administrative issue?"  I
presume
> > that a removal initiated via cfgadm or some other tool could put the
ZFS
> > vdev into an offline state, and this would prevent the logic from
> > accidentally marking the device FAULTED.  (Ideally it would also mark
> > the device "REMOVED".)
> 
> If there is no physical connection (detected to the best of the
driver''s ability), then it is removed (REMOVED is different from
OFFLINE).  Surprise device removal is not a fault - Solaris is designed to
support removal of disks at any time without administrative intervention.  A
fault is defined as broken hardware, which is not the case for a removed device.
So how do you diagnose the situation where someone trips over a cable,
or where the drive was bumped and detached from the cable?  I guess I''m
OK with the idea that these are in a REMOVED state, but I''d like the
messaging to say something besides "the administrator has removed the
device" or somesuch (which is what it says now).  Clearly that''s
not
what happened.

It gets more interesting with other kinds of transports.  For example,
iSCSI or some other transport (I worked with ATAoverEthernet at one
point) -- in that case if the remote node goes belly up, or the network
is lost, its clearly not the case that this was "removed".  The
situation here is a device that you can''t talk to.  I''d argue
its a
FAULT.

For busses like 1394 or USB, where the typical use is at a desktop where
folks just plug in/out all the time, I don''t see this as such a
problem.
But for enterprise grade storage, I have higher expectations.
> 
> There are projects underway to a) represent devices that are physically
present but unable to attach to generate faults and b) topology-based diagnosis
to detect bad cables, expanders, etc.  This is a complicated problem and not
always tractable, but can be solved reasonably well for modern systems and
transports.
I think there are a significant number of cases where you can''t tell
the
difference between a unit dying and a bad cable and a disconnected
cable.  With some special magnetics, you might be able to use
time-domain-reflectometry to diagnose things, but this requires unusual
hardware and is clearly outside of the normal scope of things we''re
dealing with.  (Interestingly, some ethernet PHYs have this capability.)
> 
> A completely orthogonal feature is the ability to represent extended
periods of device removal as a defect.  While removing a disk is not itself a
defect, leaving your pool running minus one disk for hours/days/weeks is clearly
broken.
Agreed that this is orthogonal.  I''d say that this is best handled via
more strong handling of the DEGRADED state.
> 
> If you have a solution that correctly detects devices as REMOVED for a new
class of HBAs/drivers, that''d be more than welcome.  If you choose to
represent missing devices as faulted in your own third party system,
that''s your own prerogative, but it''s not the current Solaris
FMA model.
> 
I can certainly flag the device as REMOVED rather than FAULTED, although
that will require some extra changes to libzfs I think.  (A new
zpool_vdev_removed or vdev_unreachable function or somesuch.)

My point here is that I''m willing to refine the work so that it helps
folks.

What''s important to my mind is two things:

a) when a unit is removed, a spare is recruited to replace it if one is
available.  (I.e. zfs-retire needs to work.)

b) ideally, this should be logged/handled in some manner asynchronously,
so that if such an event has occurred, it does not come as a surprise to
the administrator 2 weeks after the fact when the *2nd* unit dies or is
removed.

Its that last point "b" that makes me feel less good about
"REMOVED".
The current code seems to assume that removal is always intentional, and
therefore no further notification is needed.  But when a disk stops
answering SCSI commands, it may indicate an unplanned device failure.

One other thought -- I think ZFS should handle this in a manner such
that the behavior appears to the administrator to be the same,
regardless of whether I/O was occurring on the unit or not.

An interesting question is what happens if I yank a drive while there
are outstanding commands pending?  Those commands should time out at the
HBA, but will it report them as CMD_DEV_GONE, or will it report an error
causing a fault to be flagged?

	- Garrett

> 
> - Eric
> 
> --
> Eric Schrock, Fishworks                       
http://blogs.sun.com/eschrock
> 
>

Eric Schrock

2010-Jun-17 22:38 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On Jun 17, 2010, at 6:13 PM, Garrett D''Amore
wrote:> 
> So how do you diagnose the situation where someone trips over a cable,
> or where the drive was bumped and detached from the cable?  I guess
I''m
> OK with the idea that these are in a REMOVED state, but I''d like
the
> messaging to say something besides "the administrator has removed the
> device" or somesuch (which is what it says now).  Clearly
that''s not
> what happened.
Are you requesting that we diagnose the difference between tripping over a cable
and intentionally unplugging it?  That''s clearly beyond any
software''s ability to diagnose.

On the SS7000 series, you get an alert that the enclosure has been detached from
the system.  The fru-monitor code (generalization of the disk-monitor) that
generates this sysevent has not yet been pushed to ON.
> a) when a unit is removed, a spare is recruited to replace it if one is
> available.  (I.e. zfs-retire needs to work.)
This is handled by the REMOVED state, as zfs-retire subscribes to
resource.removed.
> b) ideally, this should be logged/handled in some manner asynchronously,
> so that if such an event has occurred, it does not come as a surprise to
> the administrator 2 weeks after the fact when the *2nd* unit dies or is
> removed.
These are logged as alerts in the SS7000.  The first-class notion of a Solaris
alert is not new, and has been proposed in the past as part of FMA work.  The
FMA team is currently working on a project that will introduce some of the
underlying infrastructure to formalized alerts in Solaris.  These events (the
primitives are not called alerts) represent formalized things of interest that
are not directly related to a fault or defect.  That, along with the ability to
diagnose a defect over extended periods of removal, is the correct way to
represent this situation.
> Its that last point "b" that makes me feel less good about
"REMOVED".
> The current code seems to assume that removal is always intentional, and
> therefore no further notification is needed.  But when a disk stops
> answering SCSI commands, it may indicate an unplanned device failure.
There are many, many, failure modes that can be distinguished just fine from
physical device removal.  For example, you can have a PHY up but the attached
device completely unresponsive, but you know there is a device there.  Or you
can look at the SES data to determine physical presence.  Converting all hotplug
events into faults is too broad a brush here.
> One other thought -- I think ZFS should handle this in a manner such
> that the behavior appears to the administrator to be the same,
> regardless of whether I/O was occurring on the unit or not.
> 
> An interesting question is what happens if I yank a drive while there
> are outstanding commands pending?  Those commands should time out at the
> HBA, but will it report them as CMD_DEV_GONE, or will it report an error
> causing a fault to be flagged?
This is detected as device removal.  There is a timeout associated with I/O
errors in zfs-diagnosis that gives some grace period to detect removal before
declaring a disk faulted.

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Garrett D''Amore

2010-Jun-17 23:18 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote:> On Jun 17, 2010, at 6:13 PM, Garrett D''Amore wrote:
> > 
> > So how do you diagnose the situation where someone trips over a cable,
> > or where the drive was bumped and detached from the cable?  I guess
I''m
> > OK with the idea that these are in a REMOVED state, but I''d
like the
> > messaging to say something besides "the administrator has removed
the
> > device" or somesuch (which is what it says now).  Clearly
that''s not
> > what happened.
> 
> Are you requesting that we diagnose the difference between tripping over a
cable and intentionally unplugging it?  That''s clearly beyond any
software''s ability to diagnose.
I guess it depends.  If you have a way to indicate intent -- such as by
issuing a command first, then you can diagnose it, can''t you?  (I
thought this was what cfgadm was all about.)

Perhaps the model here is that nobody ever needs to issue such commands
-- that its reasonable to go around yanking drives from systems in the
datacenter willy nilly.  I hope not.
> 
> On the SS7000 series, you get an alert that the enclosure has been detached
from the system.  The fru-monitor code (generalization of the disk-monitor) that
generates this sysevent has not yet been pushed to ON.
> 
> > a) when a unit is removed, a spare is recruited to replace it if one
is
> > available.  (I.e. zfs-retire needs to work.)
> 
> This is handled by the REMOVED state, as zfs-retire subscribes to
resource.removed.
Yes, I saw that.
> 
> > b) ideally, this should be logged/handled in some manner
asynchronously,
> > so that if such an event has occurred, it does not come as a surprise
to
> > the administrator 2 weeks after the fact when the *2nd* unit dies or
is
> > removed.
> 
> These are logged as alerts in the SS7000.  The first-class notion of a
Solaris alert is not new, and has been proposed in the past as part of FMA work.
The FMA team is currently working on a project that will introduce some of the
underlying infrastructure to formalized alerts in Solaris.  These events (the
primitives are not called alerts) represent formalized things of interest that
are not directly related to a fault or defect.  That, along with the ability to
diagnose a defect over extended periods of removal, is the correct way to
represent this situation.
> 
> > Its that last point "b" that makes me feel less good about
"REMOVED".
> > The current code seems to assume that removal is always intentional,
and
> > therefore no further notification is needed.  But when a disk stops
> > answering SCSI commands, it may indicate an unplanned device failure.
> 
> There are many, many, failure modes that can be distinguished just fine
from physical device removal.  For example, you can have a PHY up but the
attached device completely unresponsive, but you know there is a device there. 
Or you can look at the SES data to determine physical presence.  Converting all
hotplug events into faults is too broad a brush here.
Many of these failure modes depend on having a suitable enclosure.
While this may be fine for the SS7000, there are other users of ZFS that
don''t have that ability.

I guess the fact that the SS7000 code isn''t kept up to date in ON means
that we may wind up having to do our own thing here... its a bit
unfortunate, but ok.

The point is that, for now, we have a real problem, and that is that
devices that fail in any of a number of various ways, don''t have *any*
indication reported about the failure.  So *that* is what we need to
fix.
> 
> > One other thought -- I think ZFS should handle this in a manner such
> > that the behavior appears to the administrator to be the same,
> > regardless of whether I/O was occurring on the unit or not.
> > 
> > An interesting question is what happens if I yank a drive while there
> > are outstanding commands pending?  Those commands should time out at
the
> > HBA, but will it report them as CMD_DEV_GONE, or will it report an
error
> > causing a fault to be flagged?
> 
> This is detected as device removal.  There is a timeout associated with I/O
errors in zfs-diagnosis that gives some grace period to detect removal before
declaring a disk faulted.
> 
Ok.

	- Garrett
> - Eric
> 
> --
> Eric Schrock, Fishworks                       
http://blogs.sun.com/eschrock
> 
>

Robert Milkowski

2010-Jun-18 08:56 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On 18/06/2010 00:18, Garrett D''Amore wrote:> On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote:
>    
>
>> On the SS7000 series, you get an alert that the enclosure has been
detached from the system.  The fru-monitor code (generalization of the
disk-monitor) that generates this sysevent has not yet been pushed to ON.
>>
>>
>>      
[...]> I guess the fact that the SS7000 code isn''t kept up to date in ON
means
> that we may wind up having to do our own thing here... its a bit
> unfortunate, but ok.
Eric - is it a business decision that the discussed code is not in the 
ON or do you actually intent to get it integrated into ON? Because if 
you do then I think that getting Nexenta guys expanding on it would be 
better for everyone instead of having them reinventing the wheel...



-- 
Robert Milkowski
http://milek.blogspot.com

Eric Schrock

2010-Jun-18 13:07 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On Jun 18, 2010, at 4:56 AM, Robert Milkowski wrote:
> On 18/06/2010 00:18, Garrett D''Amore wrote:
>> On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote:
>>   
>>> On the SS7000 series, you get an alert that the enclosure has been
detached from the system.  The fru-monitor code (generalization of the
disk-monitor) that generates this sysevent has not yet been pushed to ON.
>>> 
>>> 
>>>     
> [...]
>> I guess the fact that the SS7000 code isn''t kept up to date in
ON means
>> that we may wind up having to do our own thing here... its a bit
>> unfortunate, but ok.
> 
> Eric - is it a business decision that the discussed code is not in the ON
or do you actually intent to get it integrated into ON? Because if you do then I
think that getting Nexenta guys expanding on it would be better for everyone
instead of having them reinventing the wheel...
Limited bandwidth.

- Eric

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Garrett D''Amore

2010-Jun-18 16:26 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

On Fri, 2010-06-18 at 09:07 -0400, Eric Schrock wrote:> On Jun 18, 2010, at 4:56 AM, Robert Milkowski wrote:
> 
> > On 18/06/2010 00:18, Garrett D''Amore wrote:
> >> On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote:
> >>   
> >>> On the SS7000 series, you get an alert that the enclosure has
been detached from the system.  The fru-monitor code (generalization of the
disk-monitor) that generates this sysevent has not yet been pushed to ON.
> >>> 
> >>> 
> >>>     
> > [...]
> >> I guess the fact that the SS7000 code isn''t kept up to
date in ON means
> >> that we may wind up having to do our own thing here... its a bit
> >> unfortunate, but ok.
> > 
> > Eric - is it a business decision that the discussed code is not in the
ON or do you actually intent to get it integrated into ON? Because if you do
then I think that getting Nexenta guys expanding on it would be better for
everyone instead of having them reinventing the wheel...
> 
> Limited bandwidth.
Is there anything I can do to help?  In my opinion, its better if we can
use solutions in the underlying ON code that everyone agrees with and
that are available to everyone.

At the end of the day though, we''ll do whatever is required to make
sure
that the problems that our customers face are solved -- at least in our
distro.  We''d rather have shared common code for this, but if we have
to
implement our own bits, we will do so.

	-- Garrett
> 
> - Eric
> 
> --
> Eric Schrock, Fishworks                       
http://blogs.sun.com/eschrock
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Garrett D''Amore

2010-Jun-19 05:59 UTC

head link

[zfs-discuss] hot detach of disks, ZFS and FMA integration

Btw, I filed a bug (bugs.os.o) for the ZFS FMRI scheme, and included a
suggested fix in the description.  I don''t have a CR number for it yet.

Its possible that this should go through the request-sponsor process.
Once I have the CR number I''ll be happy to follow up with it.  (It
would
be nice if Nexenta were to get credit for the fix.)

	- Garrett

On Fri, 2010-06-18 at 09:26 -0700, Garrett D''Amore
wrote:> On Fri, 2010-06-18 at 09:07 -0400, Eric Schrock wrote:
> > On Jun 18, 2010, at 4:56 AM, Robert Milkowski wrote:
> > 
> > > On 18/06/2010 00:18, Garrett D''Amore wrote:
> > >> On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote:
> > >>   
> > >>> On the SS7000 series, you get an alert that the enclosure
has been detached from the system.  The fru-monitor code (generalization of the
disk-monitor) that generates this sysevent has not yet been pushed to ON.
> > >>> 
> > >>> 
> > >>>     
> > > [...]
> > >> I guess the fact that the SS7000 code isn''t kept up
to date in ON means
> > >> that we may wind up having to do our own thing here... its a
bit
> > >> unfortunate, but ok.
> > > 
> > > Eric - is it a business decision that the discussed code is not
in the ON or do you actually intent to get it integrated into ON? Because if you
do then I think that getting Nexenta guys expanding on it would be better for
everyone instead of having them reinventing the wheel...
> > 
> > Limited bandwidth.
> 
> Is there anything I can do to help?  In my opinion, its better if we can
> use solutions in the underlying ON code that everyone agrees with and
> that are available to everyone.
> 
> At the end of the day though, we''ll do whatever is required to
make sure
> that the problems that our customers face are solved -- at least in our
> distro.  We''d rather have shared common code for this, but if we
have to
> implement our own bits, we will do so.
> 
> 	-- Garrett
> 
> > 
> > - Eric
> > 
> > --
> > Eric Schrock, Fishworks                       
http://blogs.sun.com/eschrock
> > 
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

zfs discuss - Jun 2010 - hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration

[zfs-discuss] hot detach of disks, ZFS and FMA integration