Garrett D''Amore
2010-Jun-17  19:52 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
So I''ve been working on solving a problem we noticed that when using certain hot pluggable busses (think SAS/SATA hotplug here), that removing a drive did not trigger any resulting response from either FMA or ZFS *until* something tried to use that device. (This removal of a drive can be thought of as simulating either disk,bus, hba, or cable failure.) This means that if you have an idle pool, you''ll not find out about these failures until you do some I/O to the drive. For a hot spare, this may not occur until you actually do a scrub. That''s really unfortunate. Note that I think the "disk-monitor.so" FMA module may solve this problem, but it seems to be tied to specific Oracle hardware containing certain IPMI features that are not necessarily general. So, I''ve come up with a solution, which involves the creation of a new FMA module, and a fix for a bug in the ZFS FMRI scheme module. I''d like thoughts here. (I''m happy to post the code as well; there is no reason this can''t be pushed upstream as far as I or my employer are concerned.) zfs-monitor.so is the module, it runs at a configurable interval (currently 10 seconds for my debugging.) What it does is parse the ZFS configuration to identify all physical disks that are associated with ZFS vdevs. For each such device, if ZFS believes that the vdev is healthy (ONLINE, AVAIL, or even DEGRADED in the zpool status output, although it uses libzfs directly to get this), it opens the underlying raw device, and attempts to read the first 512 bytes (block) from the unit. If this works, then the disk is presumed to be working, and we''re done. For units that fail either the open() or read(), we use libzfs to mark the vdev FAULTED (which will impact higher level vdevs appropriately), and we post an FMA ereport (so that the ZFS diagnosis and retire modules can do their thing.) Of course, one side effect of this change is that potentially disks are spun up too frequently, even if they need not be, so it can have a negative impact on power savings. However, in theory, since we''re only exchanging a single block, and always the same block, that data *ought* to be in cache. (This has a drawback as well though -- it means we might not find errors on the spinning platters themselves. But still its far better since it catches the more common problem of a drive that has either gone completely off the bus or has been removed or accidentally disconnected.) The one bug in the ZFS FMRI module that we had to fix was that it was not failing to identify hot spare devices associated with a zpool, so nothing was happening for those spares, because of certain logic in the ZFS diagnosis module. Anyway, I''m happy to share the code, and even go through the request-sponsor process to push this upstream. I would like the opinions of the ZFS and FMA teams though... is the approach I''m using sane, or have I missed some important design principle? Certainly it *seems* to work well on the systems I''ve tested, and we (Nexenta) think that it fixes what appears to us to be a critical deficiency in the ZFS error detection and handling. But I''d like to hear other thoughts. - Garrett
Eric Schrock
2010-Jun-17  20:16 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On Jun 17, 2010, at 3:52 PM, Garrett D''Amore wrote:> > Anyway, I''m happy to share the code, and even go through the > request-sponsor process to push this upstream. I would like the > opinions of the ZFS and FMA teams though... is the approach I''m using > sane, or have I missed some important design principle? Certainly it > *seems* to work well on the systems I''ve tested, and we (Nexenta) think > that it fixes what appears to us to be a critical deficiency in the ZFS > error detection and handling. But I''d like to hear other thoughts.I don''t think this is the right approach. You''ll end up faulting drives that should be marked removed, among other things. The correct answer is for drivers to use the new LDI events (LDI_EV_DEVICE_REMOVE) to indicate device removal. This is already in ON but not hooked up to ZFS. It''s easy to do, but just hasn''t been pushed yet. Note that for legacy drivers, the DKIOCGETSTATE ioctl() is supposed to handle this for you. However, there is a race condition where if the vdev probe happens before the driver has transitioned to a state where it returns DEV_GONE, then we can miss this event (because it is only probed in reaction to I/O failure and we won''t try again). We spent some time looking at ways to eliminate this window, but it ultimately got quite ugly and doesn''t support hot spares, so the better answer was to just properly support the LDI events. If you wanted to expand support for legacy drivers, you should expand use of the DKIOCGETSTATE ioctl(), perhaps with an async task that probes spares, as well as a delayed timer (within the bounds of the zfs-diagsnosis resource.removed horizon) to close the associated window for normal vdevs. However, a better solution would be to update the drivers that matter to use LDI_EV_DEVICE_REMOVE, which provides much crisper semantics and will be used in the future to hook into other subsystems. In order for anything to be accepted upstream, it''s key that it be able to distinguish between REMOVED and FAULTED devices. Mis-diagnosing a removed drive as faulted is very bad (fault = broken hardware = service call = $$$). - Eric P.S. the bug in the ZFS scheme module is legit, we just haven''t fixed it yet -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Garrett D''Amore
2010-Jun-17  20:35 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On Thu, 2010-06-17 at 16:16 -0400, Eric Schrock wrote:> On Jun 17, 2010, at 3:52 PM, Garrett D''Amore wrote: > > > > Anyway, I''m happy to share the code, and even go through the > > request-sponsor process to push this upstream. I would like the > > opinions of the ZFS and FMA teams though... is the approach I''m using > > sane, or have I missed some important design principle? Certainly it > > *seems* to work well on the systems I''ve tested, and we (Nexenta) think > > that it fixes what appears to us to be a critical deficiency in the ZFS > > error detection and handling. But I''d like to hear other thoughts. > > I don''t think this is the right approach. You''ll end up faulting drives that should be marked removed, among other things. The correct answer is for drivers to use the new LDI events (LDI_EV_DEVICE_REMOVE) to indicate device removal. This is already in ON but not hooked up to ZFS. It''s easy to do, but just hasn''t been pushed yet. > > Note that for legacy drivers, the DKIOCGETSTATE ioctl() is supposed to handle this for you. However, there is a race condition where if the vdev probe happens before the driver has transitioned to a state where it returns DEV_GONE, then we can miss this event (because it is only probed in reaction to I/O failure and we won''t try again). We spent some time looking at ways to eliminate this window, but it ultimately got quite ugly and doesn''t support hot spares, so the better answer was to just properly support the LDI events. >I actually started with DKIOCGSTATE as my first approach, modifying sd.c. But I had problems because what I found is that nothing was issuing this ioctl properly except for removable/hotpluggable media (and the SAS/SATA controllers/frameworks are not indicating this. I tried overriding that in sd.c but I still found that there was another bug where the HAL module that does the monitoring does not monitor devices that are present and in use (mounted filesystems) during boot. I think HAL was designed for removable media that would not be automatically mounted by zfs during boot. I didn''t analyze this further. Furthermore, for the devices that did work, the report was "device administratively removed".. which is *far* different from a device that has unexpectedly offlined. There was no way to distinguish a device removed via cfgadm from a device that went away unexpectedly.> If you wanted to expand support for legacy drivers, you should expand use of the DKIOCGETSTATE ioctl(), perhaps with an async task that probes spares, as well as a delayed timer (within the bounds of the zfs-diagsnosis resource.removed horizon) to close the associated window for normal vdevs. However, a better solution would be to update the drivers that matter to use LDI_EV_DEVICE_REMOVE, which provides much crisper semantics and will be used in the future to hook into other subsystems.Is "sd.c" considered a legacy driver? Its what is responsible for the vast majority of disks. That said, perhaps the problem is the HBA drivers?> > In order for anything to be accepted upstream, it''s key that it be able to distinguish between REMOVED and FAULTED devices. Mis-diagnosing a removed drive as faulted is very bad (fault = broken hardware = service call = $$$).So how do we distinguish "removed on purpose" as opposed to "removed by accident, faulted cable, or other non administrative issue?" I presume that a removal initiated via cfgadm or some other tool could put the ZFS vdev into an offline state, and this would prevent the logic from accidentally marking the device FAULTED. (Ideally it would also mark the device "REMOVED".) Put another way, if a libzfs command is used to offline or remove the device from the pool, then none of the code I''ve written is engaged. What I''ve done is supply code to handle "surprise" removal or disconnect.> > - Eric > > P.S. the bug in the ZFS scheme module is legit, we just haven''t fixed it yetI can send the diffs for that fix... they''re small and obvious enough. - Garrett> > -- > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > >
Eric Schrock
2010-Jun-17  21:53 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On Jun 17, 2010, at 4:35 PM, Garrett D''Amore wrote:> > I actually started with DKIOCGSTATE as my first approach, modifying > sd.c. But I had problems because what I found is that nothing was > issuing this ioctl properly except for removable/hotpluggable media (and > the SAS/SATA controllers/frameworks are not indicating this. I tried > overriding that in sd.c but I still found that there was another bug > where the HAL module that does the monitoring does not monitor devices > that are present and in use (mounted filesystems) during boot. I think > HAL was designed for removable media that would not be automatically > mounted by zfs during boot. I didn''t analyze this further.ZFS issues the ioctl() from vdev_disk.c. It is up to the HBA drivers to correctly represent the DEV_GONE state (and is known to work with a variety of SATA drivers).> Is "sd.c" considered a legacy driver? Its what is responsible for the > vast majority of disks. That said, perhaps the problem is the HBA > drivers?It''s the HBA drivers.> So how do we distinguish "removed on purpose" as opposed to "removed by > accident, faulted cable, or other non administrative issue?" I presume > that a removal initiated via cfgadm or some other tool could put the ZFS > vdev into an offline state, and this would prevent the logic from > accidentally marking the device FAULTED. (Ideally it would also mark > the device "REMOVED".)If there is no physical connection (detected to the best of the driver''s ability), then it is removed (REMOVED is different from OFFLINE). Surprise device removal is not a fault - Solaris is designed to support removal of disks at any time without administrative intervention. A fault is defined as broken hardware, which is not the case for a removed device. There are projects underway to a) represent devices that are physically present but unable to attach to generate faults and b) topology-based diagnosis to detect bad cables, expanders, etc. This is a complicated problem and not always tractable, but can be solved reasonably well for modern systems and transports. A completely orthogonal feature is the ability to represent extended periods of device removal as a defect. While removing a disk is not itself a defect, leaving your pool running minus one disk for hours/days/weeks is clearly broken. If you have a solution that correctly detects devices as REMOVED for a new class of HBAs/drivers, that''d be more than welcome. If you choose to represent missing devices as faulted in your own third party system, that''s your own prerogative, but it''s not the current Solaris FMA model. Hope that helps, - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Garrett D''Amore
2010-Jun-17  22:13 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On Thu, 2010-06-17 at 17:53 -0400, Eric Schrock wrote:> On Jun 17, 2010, at 4:35 PM, Garrett D''Amore wrote: > > > > I actually started with DKIOCGSTATE as my first approach, modifying > > sd.c. But I had problems because what I found is that nothing was > > issuing this ioctl properly except for removable/hotpluggable media (and > > the SAS/SATA controllers/frameworks are not indicating this. I tried > > overriding that in sd.c but I still found that there was another bug > > where the HAL module that does the monitoring does not monitor devices > > that are present and in use (mounted filesystems) during boot. I think > > HAL was designed for removable media that would not be automatically > > mounted by zfs during boot. I didn''t analyze this further. > > ZFS issues the ioctl() from vdev_disk.c. It is up to the HBA drivers to correctly represent the DEV_GONE state (and is known to work with a variety of SATA drivers).So maybe the problem is the SAS adapters I''m dealing with (LSI). This is not an imaginary problem -- if there is a better (more correct) solution, then I''d like to use it. Right now it probably is not reasonable for me to fix every HBA driver (I can''t, as I don''t have source code to a number of them.) Actually, the problem *might* be the MPXIO vHCI layer....> > > Is "sd.c" considered a legacy driver? Its what is responsible for the > > vast majority of disks. That said, perhaps the problem is the HBA > > drivers? > > It''s the HBA drivers.Ah, so you need to see CMD_DEV_GONE on the transport layer. Interesting. I don''t think the SAS drivers are doing this.> > > So how do we distinguish "removed on purpose" as opposed to "removed by > > accident, faulted cable, or other non administrative issue?" I presume > > that a removal initiated via cfgadm or some other tool could put the ZFS > > vdev into an offline state, and this would prevent the logic from > > accidentally marking the device FAULTED. (Ideally it would also mark > > the device "REMOVED".) > > If there is no physical connection (detected to the best of the driver''s ability), then it is removed (REMOVED is different from OFFLINE). Surprise device removal is not a fault - Solaris is designed to support removal of disks at any time without administrative intervention. A fault is defined as broken hardware, which is not the case for a removed device.So how do you diagnose the situation where someone trips over a cable, or where the drive was bumped and detached from the cable? I guess I''m OK with the idea that these are in a REMOVED state, but I''d like the messaging to say something besides "the administrator has removed the device" or somesuch (which is what it says now). Clearly that''s not what happened. It gets more interesting with other kinds of transports. For example, iSCSI or some other transport (I worked with ATAoverEthernet at one point) -- in that case if the remote node goes belly up, or the network is lost, its clearly not the case that this was "removed". The situation here is a device that you can''t talk to. I''d argue its a FAULT. For busses like 1394 or USB, where the typical use is at a desktop where folks just plug in/out all the time, I don''t see this as such a problem. But for enterprise grade storage, I have higher expectations.> > There are projects underway to a) represent devices that are physically present but unable to attach to generate faults and b) topology-based diagnosis to detect bad cables, expanders, etc. This is a complicated problem and not always tractable, but can be solved reasonably well for modern systems and transports.I think there are a significant number of cases where you can''t tell the difference between a unit dying and a bad cable and a disconnected cable. With some special magnetics, you might be able to use time-domain-reflectometry to diagnose things, but this requires unusual hardware and is clearly outside of the normal scope of things we''re dealing with. (Interestingly, some ethernet PHYs have this capability.)> > A completely orthogonal feature is the ability to represent extended periods of device removal as a defect. While removing a disk is not itself a defect, leaving your pool running minus one disk for hours/days/weeks is clearly broken.Agreed that this is orthogonal. I''d say that this is best handled via more strong handling of the DEGRADED state.> > If you have a solution that correctly detects devices as REMOVED for a new class of HBAs/drivers, that''d be more than welcome. If you choose to represent missing devices as faulted in your own third party system, that''s your own prerogative, but it''s not the current Solaris FMA model. >I can certainly flag the device as REMOVED rather than FAULTED, although that will require some extra changes to libzfs I think. (A new zpool_vdev_removed or vdev_unreachable function or somesuch.) My point here is that I''m willing to refine the work so that it helps folks. What''s important to my mind is two things: a) when a unit is removed, a spare is recruited to replace it if one is available. (I.e. zfs-retire needs to work.) b) ideally, this should be logged/handled in some manner asynchronously, so that if such an event has occurred, it does not come as a surprise to the administrator 2 weeks after the fact when the *2nd* unit dies or is removed. Its that last point "b" that makes me feel less good about "REMOVED". The current code seems to assume that removal is always intentional, and therefore no further notification is needed. But when a disk stops answering SCSI commands, it may indicate an unplanned device failure. One other thought -- I think ZFS should handle this in a manner such that the behavior appears to the administrator to be the same, regardless of whether I/O was occurring on the unit or not. An interesting question is what happens if I yank a drive while there are outstanding commands pending? Those commands should time out at the HBA, but will it report them as CMD_DEV_GONE, or will it report an error causing a fault to be flagged? - Garrett> > - Eric > > -- > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > >
Eric Schrock
2010-Jun-17  22:38 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On Jun 17, 2010, at 6:13 PM, Garrett D''Amore wrote:> > So how do you diagnose the situation where someone trips over a cable, > or where the drive was bumped and detached from the cable? I guess I''m > OK with the idea that these are in a REMOVED state, but I''d like the > messaging to say something besides "the administrator has removed the > device" or somesuch (which is what it says now). Clearly that''s not > what happened.Are you requesting that we diagnose the difference between tripping over a cable and intentionally unplugging it? That''s clearly beyond any software''s ability to diagnose. On the SS7000 series, you get an alert that the enclosure has been detached from the system. The fru-monitor code (generalization of the disk-monitor) that generates this sysevent has not yet been pushed to ON.> a) when a unit is removed, a spare is recruited to replace it if one is > available. (I.e. zfs-retire needs to work.)This is handled by the REMOVED state, as zfs-retire subscribes to resource.removed.> b) ideally, this should be logged/handled in some manner asynchronously, > so that if such an event has occurred, it does not come as a surprise to > the administrator 2 weeks after the fact when the *2nd* unit dies or is > removed.These are logged as alerts in the SS7000. The first-class notion of a Solaris alert is not new, and has been proposed in the past as part of FMA work. The FMA team is currently working on a project that will introduce some of the underlying infrastructure to formalized alerts in Solaris. These events (the primitives are not called alerts) represent formalized things of interest that are not directly related to a fault or defect. That, along with the ability to diagnose a defect over extended periods of removal, is the correct way to represent this situation.> Its that last point "b" that makes me feel less good about "REMOVED". > The current code seems to assume that removal is always intentional, and > therefore no further notification is needed. But when a disk stops > answering SCSI commands, it may indicate an unplanned device failure.There are many, many, failure modes that can be distinguished just fine from physical device removal. For example, you can have a PHY up but the attached device completely unresponsive, but you know there is a device there. Or you can look at the SES data to determine physical presence. Converting all hotplug events into faults is too broad a brush here.> One other thought -- I think ZFS should handle this in a manner such > that the behavior appears to the administrator to be the same, > regardless of whether I/O was occurring on the unit or not. > > An interesting question is what happens if I yank a drive while there > are outstanding commands pending? Those commands should time out at the > HBA, but will it report them as CMD_DEV_GONE, or will it report an error > causing a fault to be flagged?This is detected as device removal. There is a timeout associated with I/O errors in zfs-diagnosis that gives some grace period to detect removal before declaring a disk faulted. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Garrett D''Amore
2010-Jun-17  23:18 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote:> On Jun 17, 2010, at 6:13 PM, Garrett D''Amore wrote: > > > > So how do you diagnose the situation where someone trips over a cable, > > or where the drive was bumped and detached from the cable? I guess I''m > > OK with the idea that these are in a REMOVED state, but I''d like the > > messaging to say something besides "the administrator has removed the > > device" or somesuch (which is what it says now). Clearly that''s not > > what happened. > > Are you requesting that we diagnose the difference between tripping over a cable and intentionally unplugging it? That''s clearly beyond any software''s ability to diagnose.I guess it depends. If you have a way to indicate intent -- such as by issuing a command first, then you can diagnose it, can''t you? (I thought this was what cfgadm was all about.) Perhaps the model here is that nobody ever needs to issue such commands -- that its reasonable to go around yanking drives from systems in the datacenter willy nilly. I hope not.> > On the SS7000 series, you get an alert that the enclosure has been detached from the system. The fru-monitor code (generalization of the disk-monitor) that generates this sysevent has not yet been pushed to ON. > > > a) when a unit is removed, a spare is recruited to replace it if one is > > available. (I.e. zfs-retire needs to work.) > > This is handled by the REMOVED state, as zfs-retire subscribes to resource.removed.Yes, I saw that.> > > b) ideally, this should be logged/handled in some manner asynchronously, > > so that if such an event has occurred, it does not come as a surprise to > > the administrator 2 weeks after the fact when the *2nd* unit dies or is > > removed. > > These are logged as alerts in the SS7000. The first-class notion of a Solaris alert is not new, and has been proposed in the past as part of FMA work. The FMA team is currently working on a project that will introduce some of the underlying infrastructure to formalized alerts in Solaris. These events (the primitives are not called alerts) represent formalized things of interest that are not directly related to a fault or defect. That, along with the ability to diagnose a defect over extended periods of removal, is the correct way to represent this situation. > > > Its that last point "b" that makes me feel less good about "REMOVED". > > The current code seems to assume that removal is always intentional, and > > therefore no further notification is needed. But when a disk stops > > answering SCSI commands, it may indicate an unplanned device failure. > > There are many, many, failure modes that can be distinguished just fine from physical device removal. For example, you can have a PHY up but the attached device completely unresponsive, but you know there is a device there. Or you can look at the SES data to determine physical presence. Converting all hotplug events into faults is too broad a brush here.Many of these failure modes depend on having a suitable enclosure. While this may be fine for the SS7000, there are other users of ZFS that don''t have that ability. I guess the fact that the SS7000 code isn''t kept up to date in ON means that we may wind up having to do our own thing here... its a bit unfortunate, but ok. The point is that, for now, we have a real problem, and that is that devices that fail in any of a number of various ways, don''t have *any* indication reported about the failure. So *that* is what we need to fix.> > > One other thought -- I think ZFS should handle this in a manner such > > that the behavior appears to the administrator to be the same, > > regardless of whether I/O was occurring on the unit or not. > > > > An interesting question is what happens if I yank a drive while there > > are outstanding commands pending? Those commands should time out at the > > HBA, but will it report them as CMD_DEV_GONE, or will it report an error > > causing a fault to be flagged? > > This is detected as device removal. There is a timeout associated with I/O errors in zfs-diagnosis that gives some grace period to detect removal before declaring a disk faulted. >Ok. - Garrett> - Eric > > -- > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > >
Robert Milkowski
2010-Jun-18  08:56 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On 18/06/2010 00:18, Garrett D''Amore wrote:> On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote: > > >> On the SS7000 series, you get an alert that the enclosure has been detached from the system. The fru-monitor code (generalization of the disk-monitor) that generates this sysevent has not yet been pushed to ON. >> >> >>[...]> I guess the fact that the SS7000 code isn''t kept up to date in ON means > that we may wind up having to do our own thing here... its a bit > unfortunate, but ok.Eric - is it a business decision that the discussed code is not in the ON or do you actually intent to get it integrated into ON? Because if you do then I think that getting Nexenta guys expanding on it would be better for everyone instead of having them reinventing the wheel... -- Robert Milkowski http://milek.blogspot.com
Eric Schrock
2010-Jun-18  13:07 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On Jun 18, 2010, at 4:56 AM, Robert Milkowski wrote:> On 18/06/2010 00:18, Garrett D''Amore wrote: >> On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote: >> >>> On the SS7000 series, you get an alert that the enclosure has been detached from the system. The fru-monitor code (generalization of the disk-monitor) that generates this sysevent has not yet been pushed to ON. >>> >>> >>> > [...] >> I guess the fact that the SS7000 code isn''t kept up to date in ON means >> that we may wind up having to do our own thing here... its a bit >> unfortunate, but ok. > > Eric - is it a business decision that the discussed code is not in the ON or do you actually intent to get it integrated into ON? Because if you do then I think that getting Nexenta guys expanding on it would be better for everyone instead of having them reinventing the wheel...Limited bandwidth. - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Garrett D''Amore
2010-Jun-18  16:26 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
On Fri, 2010-06-18 at 09:07 -0400, Eric Schrock wrote:> On Jun 18, 2010, at 4:56 AM, Robert Milkowski wrote: > > > On 18/06/2010 00:18, Garrett D''Amore wrote: > >> On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote: > >> > >>> On the SS7000 series, you get an alert that the enclosure has been detached from the system. The fru-monitor code (generalization of the disk-monitor) that generates this sysevent has not yet been pushed to ON. > >>> > >>> > >>> > > [...] > >> I guess the fact that the SS7000 code isn''t kept up to date in ON means > >> that we may wind up having to do our own thing here... its a bit > >> unfortunate, but ok. > > > > Eric - is it a business decision that the discussed code is not in the ON or do you actually intent to get it integrated into ON? Because if you do then I think that getting Nexenta guys expanding on it would be better for everyone instead of having them reinventing the wheel... > > Limited bandwidth.Is there anything I can do to help? In my opinion, its better if we can use solutions in the underlying ON code that everyone agrees with and that are available to everyone. At the end of the day though, we''ll do whatever is required to make sure that the problems that our customers face are solved -- at least in our distro. We''d rather have shared common code for this, but if we have to implement our own bits, we will do so. -- Garrett> > - Eric > > -- > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Garrett D''Amore
2010-Jun-19  05:59 UTC
[zfs-discuss] hot detach of disks, ZFS and FMA integration
Btw, I filed a bug (bugs.os.o) for the ZFS FMRI scheme, and included a suggested fix in the description. I don''t have a CR number for it yet. Its possible that this should go through the request-sponsor process. Once I have the CR number I''ll be happy to follow up with it. (It would be nice if Nexenta were to get credit for the fix.) - Garrett On Fri, 2010-06-18 at 09:26 -0700, Garrett D''Amore wrote:> On Fri, 2010-06-18 at 09:07 -0400, Eric Schrock wrote: > > On Jun 18, 2010, at 4:56 AM, Robert Milkowski wrote: > > > > > On 18/06/2010 00:18, Garrett D''Amore wrote: > > >> On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote: > > >> > > >>> On the SS7000 series, you get an alert that the enclosure has been detached from the system. The fru-monitor code (generalization of the disk-monitor) that generates this sysevent has not yet been pushed to ON. > > >>> > > >>> > > >>> > > > [...] > > >> I guess the fact that the SS7000 code isn''t kept up to date in ON means > > >> that we may wind up having to do our own thing here... its a bit > > >> unfortunate, but ok. > > > > > > Eric - is it a business decision that the discussed code is not in the ON or do you actually intent to get it integrated into ON? Because if you do then I think that getting Nexenta guys expanding on it would be better for everyone instead of having them reinventing the wheel... > > > > Limited bandwidth. > > Is there anything I can do to help? In my opinion, its better if we can > use solutions in the underlying ON code that everyone agrees with and > that are available to everyone. > > At the end of the day though, we''ll do whatever is required to make sure > that the problems that our customers face are solved -- at least in our > distro. We''d rather have shared common code for this, but if we have to > implement our own bits, we will do so. > > -- Garrett > > > > > - Eric > > > > -- > > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > > > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >