running b37 on amd64. after removing power from a disk configured as a mirror, 10 minutes has passed and ZFS has still not offlined it. # zpool status tank pool: tank state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using ''zpool clear'' or replace the device with ''zpool replace''. see: http://www.sun.com/msg/ZFS-8000-9P scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t0d0 ONLINE 14 6.05K 0 c4t1d0 ONLINE 0 0 0 errors: No known data errors # grep ''Hardware_Error'' /var/adm/messages | wc -l 7632 only after I manually ran "zfs detach tank c4t0d0" did the SCSI errors stop. I would have expected it to be offlined automatically, which is exactly what happened when I did this same test with an SVM mirror. is this a bug? grant.
On Tue, May 16, 2006 at 07:02:37PM +1000, grant beattie wrote:> running b37 on amd64. after removing power from a disk configured as > a mirror, 10 minutes has passed and ZFS has still not offlined it.I should have mentioned, the disks are connected to an Adaptec 2120S card (aac). not that I think it should make any difference. grant.
What has happened is that your device has started reporting errors, but is still available on the system. i.e. ZFS is still able to ldi_open() the underlying device. This seems like a strange failure mode for the device (you may want to investigate how that''s possible), but ZFS is functioning as designed. You can verify this by doing ''dtrace -n vdev_reopen:entry'', which should show ZFS attempting to reopen the device once a minute or so. We currently only detect device failure when the device "goes away". A future enhancement is to do predictive analysis based on error rates. This will leverage the full power of FMA diagnosis, allowing us to perform SERD analysis and incorporate past history as a mechanism for predicting future failure. This will also incoporate the SMART predictive failure bit when available. We haven''t started work on this yet, but we have a plan for doing so. - Eric On Tue, May 16, 2006 at 07:02:37PM +1000, grant beattie wrote:> running b37 on amd64. after removing power from a disk configured as > a mirror, 10 minutes has passed and ZFS has still not offlined it. > > # zpool status tank > pool: tank > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using ''zpool clear'' or replace the device with ''zpool replace''. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c4t0d0 ONLINE 14 6.05K 0 > c4t1d0 ONLINE 0 0 0 > > errors: No known data errors > > # grep ''Hardware_Error'' /var/adm/messages | wc -l > 7632 > > only after I manually ran "zfs detach tank c4t0d0" did the SCSI errors > stop. I would have expected it to be offlined automatically, which is > exactly what happened when I did this same test with an SVM mirror. > > is this a bug? > > grant. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Tue, May 16, 2006 at 10:13:46AM -0700, Eric Schrock wrote:> What has happened is that your device has started reporting errors, but > is still available on the system. i.e. ZFS is still able to ldi_open() > the underlying device. This seems like a strange failure mode for the > device (you may want to investigate how that''s possible), but ZFS is > functioning as designed. You can verify this by doing ''dtrace -n > vdev_reopen:entry'', which should show ZFS attempting to reopen the > device once a minute or so. We currently only detect device failure > when the device "goes away".hi Eric, you''re right, the aac card appears to offline the disk but the LUN is still available (though its an empty device). I''ll capture some more info when I try this again tomorrow. what I find interesting is that the SCSI errors were continuous for 10 minutes before I detached it, ZFS wasn''t backing off at all. it was flooding the VGA console quicker than the console could print it all :) from what you said above, once per minute would have been more desirable. I wonder why, given that ZFS knew there was a problem with this disk, that it wasn''t marked FAULTED and the pool DEGRADED? I don''t know enough about the internals to know why SVM happily offlined the device after a short burst of errors - that''s certainly more friendly and expected. is there any way I can get the same failure mode with ZFS?> A future enhancement is to do predictive analysis based on error rates. > This will leverage the full power of FMA diagnosis, allowing us to > perform SERD analysis and incorporate past history as a mechanism for > predicting future failure. This will also incoporate the SMART > predictive failure bit when available. We haven''t started work on this > yet, but we have a plan for doing so.that would be cool, too :) grant.
On Wed, May 17, 2006 at 03:22:34AM +1000, grant beattie wrote:> > what I find interesting is that the SCSI errors were continuous for 10 > minutes before I detached it, ZFS wasn''t backing off at all. it was > flooding the VGA console quicker than the console could print it all > :) from what you said above, once per minute would have been more > desirable.The "once per minute" is related to the frequency at which ZFS tries to reopen the device. Regardless, ZFS will try to issue I/O to the device whenever asked. If you believe the device is completely broken, the correct procedure (as documented in the ZFS Administration Guide), is to ''zpool offline'' the device until you are able to repair it.> I wonder why, given that ZFS knew there was a problem with this disk, > that it wasn''t marked FAULTED and the pool DEGRADED?This is the future enhancement that I described below. We need more sophisticated analysis than simply ''N errors = FAULTED'', and that''s what FMA provides. It will allow us to interact with larger fault management (such as correlating SCSI errors, identifying controller failure, and more). ZFS is a intentionally dumb. Each subsystem is responsible for reporting errors, but coordinated fault diagnosis has to happen at a higher level.> I don''t know enough about the internals to know why SVM happily > offlined the device after a short burst of errors - that''s certainly > more friendly and expected. is there any way I can get the same > failure mode with ZFS?Not currently. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Tue, 2006-05-16 at 10:32 -0700, Eric Schrock wrote:> On Wed, May 17, 2006 at 03:22:34AM +1000, grant beattie wrote: > > > > what I find interesting is that the SCSI errors were continuous for 10 > > minutes before I detached it, ZFS wasn''t backing off at all. it was > > flooding the VGA console quicker than the console could print it all > > :) from what you said above, once per minute would have been more > > desirable. > > The "once per minute" is related to the frequency at which ZFS tries to > reopen the device. Regardless, ZFS will try to issue I/O to the device > whenever asked. If you believe the device is completely broken, the > correct procedure (as documented in the ZFS Administration Guide), is to > ''zpool offline'' the device until you are able to repair it. > > > I wonder why, given that ZFS knew there was a problem with this disk, > > that it wasn''t marked FAULTED and the pool DEGRADED? > > This is the future enhancement that I described below. We need more > sophisticated analysis than simply ''N errors = FAULTED'', and that''s what > FMA provides. It will allow us to interact with larger fault management > (such as correlating SCSI errors, identifying controller failure, and > more). ZFS is a intentionally dumb. Each subsystem is responsible for > reporting errors, but coordinated fault diagnosis has to happen at a > higher level.[reason #8752, why pulling disk drives doesn''t simulate real failures] There are also a number of cases where a successful or unsuccessful+retryable error codes carry the recommendation to replace the drive. There really isn''t a clean way to write such diagnosis engines into the various file systems, LVMs, or databases which might use disk drives. Putting that intelligence into an FMA DE and tying that into file systems or LVMs is the best way to do this. -- richard
Since it''s not exactly clear what you did with SVM I am assuming the following: You had a file system on top of the mirror and there was some I/O occurring to the mirror. The *only* time, SVM puts a device into maintenance is when we receive an EIO from the underlying device. So, in case a write occurred to the mirror, then the write to the powered off side failed (returned an EIO) and SVM kept going. Since all buffers sent to sd/ssd are marked with B_FAILFAST, the driver timeouts are low and the device is put into maintenance. If I understand Eric correctly, ZFS attempts to see if the device is really gone. However I am not quite sure what Eric means by:>We currently only detect device failure when the device "goes away".Perhaps the issue here that ldi_open is successful when it should n''t and therefore confusing ZFS. Another way to check is perform the same test, without any I/O occurring to the file system. Then run metastat -i (as root). This is similar to scrub for the volumes. -Sanjay Richard Elling wrote:>On Tue, 2006-05-16 at 10:32 -0700, Eric Schrock wrote: > > >>On Wed, May 17, 2006 at 03:22:34AM +1000, grant beattie wrote: >> >> >>>what I find interesting is that the SCSI errors were continuous for 10 >>>minutes before I detached it, ZFS wasn''t backing off at all. it was >>>flooding the VGA console quicker than the console could print it all >>>:) from what you said above, once per minute would have been more >>>desirable. >>> >>> >>The "once per minute" is related to the frequency at which ZFS tries to >>reopen the device. Regardless, ZFS will try to issue I/O to the device >>whenever asked. If you believe the device is completely broken, the >>correct procedure (as documented in the ZFS Administration Guide), is to >>''zpool offline'' the device until you are able to repair it. >> >> >> >>>I wonder why, given that ZFS knew there was a problem with this disk, >>>that it wasn''t marked FAULTED and the pool DEGRADED? >>> >>> >>This is the future enhancement that I described below. We need more >>sophisticated analysis than simply ''N errors = FAULTED'', and that''s what >>FMA provides. It will allow us to interact with larger fault management >>(such as correlating SCSI errors, identifying controller failure, and >>more). ZFS is a intentionally dumb. Each subsystem is responsible for >>reporting errors, but coordinated fault diagnosis has to happen at a >>higher level. >> >> > >[reason #8752, why pulling disk drives doesn''t simulate real failures] >There are also a number of cases where a successful or >unsuccessful+retryable error codes carry the recommendation to replace >the drive. There really isn''t a clean way to write such diagnosis >engines into the various file systems, LVMs, or databases which might >use disk drives. Putting that intelligence into an FMA DE and tying >that into file systems or LVMs is the best way to do this. > -- richard > > >_______________________________________________ >zfs-discuss mailing list >zfs-discuss at opensolaris.org >http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
On Thu, May 18, 2006 at 11:40:53PM -0600, Sanjay Nadkarni wrote:> Since it''s not exactly clear what you did with SVM I am assuming the > following: > > You had a file system on top of the mirror and there was some I/O > occurring to the mirror. The *only* time, SVM puts a device into > maintenance is when we receive an EIO from the underlying device. So, > in case a write occurred to the mirror, then the write to the powered > off side failed (returned an EIO) and SVM kept going. Since all buffers > sent to sd/ssd are marked with B_FAILFAST, the driver timeouts are low > and the device is put into maintenance.the test was the same in both the SVM and the ZFS case. constant reads from the mirror device, and unplugging the power. the read throughput during this test with ZFS drops to around 20% until the device is manually removed from the pool, after which point it returns to normal.> If I understand Eric correctly, ZFS attempts to see if the device is > really gone. However I am not quite sure what Eric means by: > > >We currently only detect device failure when the device "goes away". > > Perhaps the issue here that ldi_open is successful when it should n''t > and therefore confusing ZFS.yes, that seems to be the case. it appears to be caused by the way the aac card deals with the disk going away - it offlines the disk, and the LUN is still presented, but it now has zero length. also, after a disk is offlined by the card, there does not seem to be a way to tell the card to rescan the bus, so it requires a reboot (though there is nothing that ZFS can do which would fix that). I believe it can be done with the "aaccli" program provided by Adaptec, but that doesn''t work with the Solaris-provided aac driver.> Another way to check is perform the same test, without any I/O > occurring to the file system. Then run metastat -i (as root). This is > similar to scrub for the volumes.with no IO activity on the mirror, metastat -i does not detect that anything is wrong. with IO activity, SVM offlines the metadevice when it gets a fatal error from the device. grant.
On Thu, 2006-05-18 at 23:40 -0600, Sanjay Nadkarni wrote:> You had a file system on top of the mirror and there was some I/O > occurring to the mirror. The *only* time, SVM puts a device into > maintenance is when we receive an EIO from the underlying device. So, > in case a write occurred to the mirror, then the write to the powered > off side failed (returned an EIO) and SVM kept going. Since all buffers > sent to sd/ssd are marked with B_FAILFAST, the driver timeouts are low > and the device is put into maintenance.Sanjay, #1 on the Pareto chart of disk error messages is the nonrecoverable read. Does SVM put the mirror in maintenance mode due to an EIO caused by a nonrecoverable read? -- richard