Hi All I''d like to expose two points about ZFS that I think are a must before even trying to use it in production: 1) ZFS must stop to force kernel panics! As you know ZFS takes to a kernel panic when a corrupted zpool is found or if it''s unable to reach a device and so on... We need to have it just fail with an error message but please stop crashing the kernel. 2) We need a way to recover a corrupted ZFS, trashing the last incompleted transactions. Please give us "zfsck" :) Waiting for comments, gino This message posted from opensolaris.org
There was some discussion on the "always panic for fatal pool failures" issue in April 2006, but I haven''t seen if an actual RFE was generated. http://mail.opensolaris.org/pipermail/zfs-discuss/2006-April/017276.html This message posted from opensolaris.org
On Tue, Apr 10, 2007 at 12:48:49AM -0700, Gino wrote:> Hi All > > I''d like to expose two points about ZFS that I think are a must before even trying to use it in production: > > > 1) ZFS must stop to force kernel panics! > As you know ZFS takes to a kernel panic when a corrupted zpool is found or if it''s unable to reach > a device and so on... > We need to have it just fail with an error message but please stop crashing the kernel.This is: 6322646 ZFS should gracefully handle all devices failing (when writing) Which is being worked on. Using a redundant configuration prevents this from happening.> 2) We need a way to recover a corrupted ZFS, trashing the last incompleted transactions. > Please give us "zfsck" :)Please the ZFS FAQ at: http://www.opensolaris.org/os/community/zfs/faq/#whynofsck Writing such a tool is effectively impossible. For the one known corruption bug we''ve encountered (and since fixed, we provided the ''zfs_recover'' /etc/system switch, but it only works for this particular bug. Without understanding the underlying pathology it''s impossible to "fix" a ZFS pool. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
>> please stop crashing the kernel. > > This is: > > 6322646 ZFS should gracefully handle all devices failing (when writing)That''s only one cause of panics. At least two of gino''s panics appear due to corrupted space maps, for instance. I think there may also still be a case where a failure to read metadata during a transaction commit leads to a panic, too. Maybe that one''s been fixed, or maybe it will be handled by the above bug. Maybe someone needs to file a bug/RFE to remove all panics from ZFS, at least in non-debug builds? The QFS approach is to panic when inconsistency is found on debug builds, but return an appropriate error code on release builds, which seems reasonable. I/O errors, of course, should never lead to a panic. I think we [you] fixed all of those cases in UFS, and QFS, long ago. Anton This message posted from opensolaris.org
> Without understanding the underlying pathology it''s impossible to "fix" a ZFS pool.Sorry, but I have to disagree with this. The goal of fsck is not to bring a file system into the state it "should" be in had no errors occurred. The goal, rather, is to bring a file system to a self-consistent state. Ideally, data should be recoverable when it''s believed to be good (ZFS has a big advantage here, since the checksums can be used to validate block pointers). The ZFS on-disk data structure is basically a tree. zfsck could fairly easily walk the tree and ensure that, for instance, pools are at the top level; space maps match allocated blocks; block pointers from multiple files don''t overlap; file lengths match their allocation; ACLs are not corrupted; compressed data is not damaged; directories are in the proper format; etc. This might be impractical for a large file system, of course. It might be easier to have a ''zscavenge'' that would recover data, where possible, from a corrupted file system. But there should be at least one of these. Losing a whole pool due to the corruption of a couple of blocks of metadata is a Bad Thing. This message posted from opensolaris.org
On Tue, Apr 10, 2007 at 09:43:39PM -0700, Anton B. Rang wrote:> > That''s only one cause of panics. > > At least two of gino''s panics appear due to corrupted space maps, for > instance. I think there may also still be a case where a failure to > read metadata during a transaction commit leads to a panic, too. Maybe > that one''s been fixed, or maybe it will be handled by the above bug.The space map bugs should have been fixed as part of: 6458218 assertion failed: ss == NULL Which went into Nevada build 60. There are several different pathologies that can result from this bug, and I don''t know if the panics are from before or after this fix. I hope folks from the ZFS team are investigating, but I can''t speak for everyone.> Maybe someone needs to file a bug/RFE to remove all panics from ZFS, > at least in non-debug builds? The QFS approach is to panic when > inconsistency is found on debug builds, but return an appropriate > error code on release builds, which seems reasonable.In order to do this we need to fix 6322646 first, which addresses the issue of ''backing out'' of a transaction once we''re down in the ZIO layer discovering these problems. It doesn''t matter if it''s due to an I/O error or space map inconsistency or I/O error if we can''t propagate the error. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
> 6322646 ZFS should gracefully handle all devices > failing (when writing) > > Which is being worked on. Using a redundant > configuration prevents this > from happening.What do you mean with "redundant"? All our servers has 2 or 4 HBAs, 2 or 4 fc switches and storage arrays with redundant controllers. We used only RAID10 zpools but we still had them corrupted.> http://www.opensolaris.org/os/community/zfs/faq/#whyno > fsck > > Writing such a tool is effectively impossible. For > the one known > corruption bug we''ve encountered (and since fixed, we > provided the > ''zfs_recover'' /etc/system switch, but it only works > for this particular > bug. Without understanding the underlying pathology > it''s impossible to > "fix" a ZFS pool.I think this is a very importand drawback of ZFS. This message posted from opensolaris.org
> > 1) ZFS must stop to force kernel panics! > > As you know ZFS takes to a kernel panic when a > corrupted zpool is found or if it''s unable to reach > > a device and so on... > > We need to have it just fail with an error message > but please stop crashing the kernel. > > This is: > > 6322646 ZFS should gracefully handle all devices > failing (when writing)With S10U3 we are still getting kernel panics when trying to import a corrupted zpool (RAID10) gino This message posted from opensolaris.org
> On Tue, Apr 10, 2007 at 09:43:39PM -0700, Anton B. > Rang wrote: > > > > That''s only one cause of panics. > > > > At least two of gino''s panics appear due to > corrupted space maps, for > > instance. I think there may also still be a case > where a failure to > > read metadata during a transaction commit leads to > a panic, too. Maybe > > that one''s been fixed, or maybe it will be handled > by the above bug. > > The space map bugs should have been fixed as part of: > > 6458218 assertion failed: ss == NULL > > Which went into Nevada build 60. There are several > different > pathologies that can result from this bug, and I > don''t know if the > panics are from before or after this fix.If that can help you, we are able to corrupt a zpool on snv_60 doing the following a few times: -create a raid10 zpool (dual path luns) -making a high writing load on that zpool -disabling fc ports on both the fc switches Each time we get a kernel panic, probably because of 6322646, and sometimes we get a corrupted zpool. gino This message posted from opensolaris.org
Gino writes: > > 6322646 ZFS should gracefully handle all devices > > failing (when writing) > > > > Which is being worked on. Using a redundant > > configuration prevents this > > from happening. > > What do you mean with "redundant"? All our servers has 2 or 4 HBAs, 2 or 4 fc switches and storage arrays with redundant controllers. > We used only RAID10 zpools but we still had them corrupted. > "Redundant" from the viewpoint of ZFS. So either zfs mirror of raid-z. The point of the bug is to better handle failures on devices in non-redundant pools. For redundant pools, ZFS is able to self-heal problems as they arise. If you maintain redundancy at the storage level, then it''s harder for ZFS to deal with problems. We should still behave better than we do now thus 6322646. Can you post your zpool status output ? -r > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Anton B. Rang wrote:> This might be impractical for a large file system, of course. It might be easier to have a ''zscavenge'' that would recover data, where possible, from a corrupted file system. But there should be at least one of these. Losing a whole pool due to the corruption of a couple of blocks of metadata is a Bad Thing. >That could be handy for any number of data-transport, borked-system-recovery and even some forensic-like tasks: zscavenge badpool | zfs recv So, suppose a user has a few hundred gb of data that they''d like copied directly onto our fileserver. They bring me a ZFS-formatted external USB drive. Instead of mounting it and messing with their data, I zscavange /dev/usbdevice, and then write that into a pool that I''m comfortable messing with. It works just as well for someone trying to recover a truly borked system. One could recover the data without making any changes whatsoever to the drive, so that I can put everything back the way I found it -- in case I can''t fix it, the next person who tries to fix it has a clean slate. -Luke -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 3271 bytes Desc: S/MIME Cryptographic Signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070411/28db2138/attachment.bin>
> > At least two of gino''s panics appear due to > corrupted space maps, for > > instance. I think there may also still be a case > where a failure to > > read metadata during a transaction commit leads to > a panic, too. Maybe > > that one''s been fixed, or maybe it will be handled > by the above bug. > > The space map bugs should have been fixed as part of: > > 6458218 assertion failed: ss == NULL > > Which went into Nevada build 60. There are several > different > pathologies that can result from this bug, and I > don''t know if the > panics are from before or after this fix. I hope > folks from the ZFS > team are investigating, but I can''t speak for > everyone.This week we''ll start one of our "burn test" on snv60. I''ll let you know. For the moment we are able to panic zfs very easily. We''ll see if we are able to corrupt again a zpool.> > Maybe someone needs to file a bug/RFE to remove all > panics from ZFS, > > at least in non-debug builds? The QFS approach is > to panic when > > inconsistency is found on debug builds, but return > an appropriate > > error code on release builds, which seems > reasonable. > > In order to do this we need to fix 6322646 first, > which addresses the > issue of ''backing out'' of a transaction once we''re > down in the ZIO layer > discovering these problems. It doesn''t matter if > it''s due to an I/O > error or space map inconsistency or I/O error if we > can''t propagate the > error.Any EDT for 6322646 and 6417779 ? 6417779 ZFS: I/O failure (write on ...) 6322646 ZFS should gracefully handle all devices failing (when writing) Isn''t there a way to increase the "timeout" to have Solaris just hang when a lun is not available and have it retry the i/o more times? gino This message posted from opensolaris.org
> On Tue, Apr 10, 2007 at 09:43:39PM -0700, Anton B. > Rang wrote: > > > > That''s only one cause of panics. > > > > At least two of gino''s panics appear due to > corrupted space maps, for > > instance. I think there may also still be a case > where a failure to > > read metadata during a transaction commit leads to > a panic, too. Maybe > > that one''s been fixed, or maybe it will be handled > by the above bug. > > The space map bugs should have been fixed as part of: > > 6458218 assertion failed: ss == NULL > > Which went into Nevada build 60. There are several > different > pathologies that can result from this bug, and I > don''t know if the > panics are from before or after this fix. I hope > folks from the ZFS > team are investigating, but I can''t speak for > everyone. > > > Maybe someone needs to file a bug/RFE to remove all > panics from ZFS, > > at least in non-debug builds? The QFS approach is > to panic when > > inconsistency is found on debug builds, but return > an appropriate > > error code on release builds, which seems > reasonable. > > In order to do this we need to fix 6322646 first, > which addresses the > issue of ''backing out'' of a transaction once we''re > down in the ZIO layer > discovering these problems. It doesn''t matter if > it''s due to an I/O > error or space map inconsistency or I/O error if we > can''t propagate the > error.Now this is scary, looking from the descriptions, it is possible that we might lose data in zfs, and/or resulted in a corrupted zpool that panics the kernel, if during the write operation, zfs loses connection to underlying hardwares? (for example a power failure?) But I''ve rarely seen that since we got UFS w/ logging in Solaris 7 or something. Even with UFS, there is always fsck allowing us to bring system back to a consistent state for recovering from previous backup. Is ZFS really supposed to be more reliable than UFS w/ logging, for example, in single disk, root file system scenario?> > - Eric > > -- > Eric Schrock, Solaris Kernel Development > http://blogs.sun.com/eschrock > _________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss >This message posted from opensolaris.org
On Mon, Apr 23, 2007 at 08:49:35AM -0700, Ivan Wang wrote:> > Now this is scary, looking from the descriptions, it is possible that > we might lose data in zfs, and/or resulted in a corrupted zpool that > panics the kernel, if during the write operation, zfs loses connection > to underlying hardwares? (for example a power failure?)No, you will not lose data. There is a chance we will panic if you fail a write in an unreplicated pool, but you will not lose data as a result.> But I''ve rarely seen that since we got UFS w/ logging in Solaris 7 or > something. Even with UFS, there is always fsck allowing us to bring > system back to a consistent state for recovering from previous backup. > > Is ZFS really supposed to be more reliable than UFS w/ logging, for > example, in single disk, root file system scenario?Yes. The failure to cope with a failed write in an unreplicated pool affects the availability of the system (because we panic), but not the underlying reliability of the filesystem. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
> > Is ZFS really supposed to be more reliable than UFS > w/ logging, for > > example, in single disk, root file system scenario? > > Yes. The failure to cope with a failed write in an > unreplicated pool > affects the availability of the system (because we > panic), but not the > underlying reliability of the filesystem.Eric, we had 5 corrupted zpool (on different servers and different SANs) ! With Solaris up to S10U3 and Nevada up to snv59 we are able to corrupt easily a zpool only disconnecting a few times one or more luns of a zpool under high i/o load. We are testing now snv60. gino This message posted from opensolaris.org
On Mon, Apr 23, 2007 at 09:38:47AM -0700, Gino wrote:> > we had 5 corrupted zpool (on different servers and different SANs) ! > With Solaris up to S10U3 and Nevada up to snv59 we are able to corrupt > easily a zpool only disconnecting a few times one or more luns of a > zpool under high i/o load. > > We are testing now snv60.As I''ve mentioned before, I believe you were tripping over the space map bug (6458218) which was fixed in build 60 and will appear in S10u4. Let us know if you are able to reproduce the problem on build 60 or later. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
> On Mon, Apr 23, 2007 at 09:38:47AM -0700, Gino wrote: > > > > we had 5 corrupted zpool (on different servers and > different SANs) ! > > With Solaris up to S10U3 and Nevada up to snv59 we > are able to corrupt > > easily a zpool only disconnecting a few times one > or more luns of a > > zpool under high i/o load. > > > > We are testing now snv60. > > As I''ve mentioned before, I believe you were tripping > over the space map > bug (6458218) which was fixed in build 60 and will > appear in S10u4. Let > us know if you are able to reproduce the problem on > build 60 or later.Eric, we done our first test with snv60. we moved over 40TB of data between 4 zpools and in the mean time we''ve done about 100000 snapshots and forced 50 panic disabling ports on the fc switches. None of the pools have been corrupted! Also we found snv60 MUCH more stable than S10U3. gino This message posted from opensolaris.org
Hello Gino, Wednesday, April 11, 2007, 10:43:17 AM, you wrote:>> On Tue, Apr 10, 2007 at 09:43:39PM -0700, Anton B. >> Rang wrote: >> > >> > That''s only one cause of panics. >> > >> > At least two of gino''s panics appear due to >> corrupted space maps, for >> > instance. I think there may also still be a case >> where a failure to >> > read metadata during a transaction commit leads to >> a panic, too. Maybe >> > that one''s been fixed, or maybe it will be handled >> by the above bug. >> >> The space map bugs should have been fixed as part of: >> >> 6458218 assertion failed: ss == NULL >> >> Which went into Nevada build 60. There are several >> different >> pathologies that can result from this bug, and I >> don''t know if the >> panics are from before or after this fix.G> If that can help you, we are able to corrupt a zpool on snv_60 doing the following a few times: G> -create a raid10 zpool (dual path luns) G> -making a high writing load on that zpool G> -disabling fc ports on both the fc switches G> Each time we get a kernel panic, probably because of 6322646, and G> sometimes we get a corrupted zpool. Is it still the case? Was the problem of corruption the pool addressed and hopefully solved? -- Best regards, Robert Milkowski mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Robert, now we are using snv60 and snv67 and moving many TB of data every day and no corruption problem any more. Unfortunately the following problems force us to stay with UFS for our production servers: 6417779 ZFS: I/O failure (write on ...) 6322646 ZFS should gracefully handle all devices failing (when writing) Also we found that our backup servers, using ZFS, after 40-60gg days of uptime starts to show system cpu time > 50%, often using one or two cpus at 100%. After a reboot, cpu system time go back to 7-11%. gino This message posted from opensolaris.org
Hello Gino, Monday, August 13, 2007, 8:51:18 AM, you wrote: G> Hello Robert, G> now we are using snv60 and snv67 and moving many TB of data every G> day and no corruption problem any more. Good, thanks for info. G> Unfortunately the following problems force us to stay with UFS for our production servers: G> 6417779 ZFS: I/O failure (write on ...) G> 6322646 ZFS should gracefully handle all devices failing (when writing) So are you mounting UFS with onerr=lock? -- Best regards, Robert Milkowski mailto:rmilkowski at task.gda.pl http://milek.blogspot.com