In the old days of UFS, on occasion one might create multiple file systems (using multiple partitions) of a large LUN if filesystem corruption was a concern. It didn?t happen often but filesystem corruption has happened. So, if filesystem X was corrupt filesystem Y would be just fine. With ZFS, does the same logic hold true for two filesystems coming from the same pool? Said slightly differently, I?m assuming that if the pool becomes mangled some how then all filesystems will be toast ? but is it possible to have one filesystem be corrupted while the other filesystems are fine? Hmmm, does the answer depend on if the filesystems are nested ex: 1 /my_fs_1 /my_fs_2 ex: 2 /home_dirs /home_dirs/chris TIA! This message posted from opensolaris.org
Richard L. Hamilton
2007-Aug-11  12:23 UTC
[zfs-discuss] do zfs filesystems isolate corruption?
> In the old days of UFS, on occasion one might create > multiple file systems (using multiple partitions) of > a large LUN if filesystem corruption was a concern. > It didn?t happen often but filesystem corruption > has happened. So, if filesystem X was corrupt > filesystem Y would be just fine. > > With ZFS, does the same logic hold true for two > filesystems coming from the same pool? > > Said slightly differently, I?m assuming that if the > pool becomes mangled some how then all filesystems > will be toast ? but is it possible to have one > filesystem be corrupted while the other filesystems > are fine? > > Hmmm, does the answer depend on if the filesystems > are nested > ex: 1 /my_fs_1 /my_fs_2 > ex: 2 /home_dirs /home_dirs/chris > > TIA!If they''re always consistent on-disk, and the checksumming catches storage subsystem errors out to almost 100% certainty, then the only corruption can come from bugs in the code, or uncaught non-storage (i.e. CPU, memory) bugs perhaps. So I suppose the answer would depend on where in the code things went astray; but that you probably could not expect any sort of isolation or even sanity at that point; if privileged code is running amok, anything could happen, and that would be true with two distinct ufs filesystems too, I would think. Perhaps one might guess that it might be more likely for corruption not to be isolated to a single zfs filesystem (given how lightweight a zfs filesystem is). OTOH, since zfs catches errors other filesystems don''t, think of how many ufs filesystems may well be corrupt for a very long time before causing a panic and having that get discovered by fsck. Ideally, if zfs code passes its test suites, you''re safer with it than with most anything else, even if it isn''t perfect. But I''m way out on a limb here; no doubt the experts will correct and amend what I''ve said... This message posted from opensolaris.org
Is it possible that a faulty disk controller could cause corruption to a zpool? I think I had this experience recently when doing a ''zpool replace'' with both the old/new device attached to a controller that I discovered was faulty (because I got data checksum errors, and had to dig for backups). Blake On 8/11/07, Richard L. Hamilton <rlhamil at smart.net> wrote:> > > In the old days of UFS, on occasion one might create > > multiple file systems (using multiple partitions) of > > a large LUN if filesystem corruption was a concern. > > It didn''t happen often but filesystem corruption > > has happened. So, if filesystem X was corrupt > > filesystem Y would be just fine. > > > > With ZFS, does the same logic hold true for two > > filesystems coming from the same pool? > > > > Said slightly differently, I''m assuming that if the > > pool becomes mangled some how then all filesystems > > will be toast ? but is it possible to have one > > filesystem be corrupted while the other filesystems > > are fine? > > > > Hmmm, does the answer depend on if the filesystems > > are nested > > ex: 1 /my_fs_1 /my_fs_2 > > ex: 2 /home_dirs /home_dirs/chris > > > > TIA! > > > If they''re always consistent on-disk, and the checksumming catches storage > subsystem errors out to almost 100% certainty, then the only corruption > can > come from bugs in the code, or uncaught non-storage (i.e. CPU, memory) > bugs perhaps. > > So I suppose the answer would depend on where in the code things > went astray; but that you probably could not expect any sort of isolation > or even sanity at that point; if privileged code is running amok, anything > could happen, and that would be true with two distinct ufs filesystems > too, > I would think. Perhaps one might guess that it might be more likely > for corruption not to be isolated to a single zfs filesystem (given how > lightweight a zfs filesystem is). OTOH, since zfs catches errors other > filesystems don''t, think of how many ufs filesystems may well be corrupt > for a very long time before causing a panic and having that get discovered > by fsck. Ideally, if zfs code passes its test suites, you''re safer with > it than > with most anything else, even if it isn''t perfect. > > But I''m way out on a limb here; no doubt the experts will correct and > amend what I''ve said... > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070811/d6c76764/attachment.html>
I did some tests with zfs-fuse where I created a pool with two vdevs (no mirror or raid-z) and filled it up with files. Then I deliberately corrupted bytes on the vdev and scrubbed the pool to see what happened. ZFS was able to pinpoint exactly which files were corrupted and reported their full path, so you could in principle go recover them from a backup. Part of this robustness is due to the use of ditto blocks for storing filesystem metadata, making it is very hard to destroy directory information. I''m not sure if that answers the question you were asking, but generally I found that damage to a zpool was very well confined. This message posted from opensolaris.org
Chris,> In the old days of UFS, on occasion one might create multiple file > systems (using multiple partitions) of a large LUN if filesystem > corruption was a concern. It didn?t happen often but filesystem > corruption has happened. So, if filesystem X was corrupt > filesystem Y would be just fine. > > With ZFS, does the same logic hold true for two filesystems coming > from the same pool?For the purposes of isolating corruption, the separation of two or more filesystems coming from the same ZFS storage pool does not help. An entire ZFS storage pool is the unit of I/O consistency, as all ZFS filesystems created within this single storage pool share the same physical storage. When configuring a ZFS storage pool the [poor] decision of choosing a non-redundant (single or concatenation of disks) verses redundant (mirror, raidz, raidz2) storage pool, offers no means for ZFS to automatically recover for some forms of corruption. Even when using a redundant storage pool, there are scenarios in which this is not good enough. This is when filesystem needs transitions into availability, such as when the loss or accessibility of two or more disks, causes mirroring or raidz to be ineffective. As of Solaris Express build 68, Availability Suite [http:// www.opensolaris.org/os/project/avs/] is part of base Solaris, offering both local snapshots and remote mirrors, both of which work with ZFS. Locally on a single Solaris host, snapshots of the entire ZFS storage pool can be taken at intervals of ones choosing, and with multiple snapshots of a single master, collections of snapshots, say at intervals of one hour, can be retained. Options allow for 100% independent snapshots (much like your UFS analogy above), dependent where only the Copy-On-Write data is retained, or compact dependent where the snapshots physical storage is some percentage of the master. Remotely between to or more Solaris hosts, remote mirrors of the entire ZFS storage pool can be configured, where synchronous replication can offer zero data loss, or asynchronous replication can offer near zero data loss, but both offering write-order, on disk consistency. A key aspect of remote replication with Availability Suite, is that the replicated ZFS storage pool can be quiesced on the remote node and accessed, or in a disaster recover scenario, take over instantly where the primary left off. When the primary site is restored, the MTTR (Mean Time To Recovery) is essentially zero, since Availability Suite supports on-demand pull, so yet to be replicated blocks are retrieved synchronously, allowing the ZFS filesystem and applications to be resumed without waiting for a potentially length resynchronization.> > Said slightly differently, I?m assuming that if the pool becomes > mangled some how then all filesystems will be toast ? but is it > possible to have one filesystem be corrupted while the other > filesystems are fine? > > Hmmm, does the answer depend on if the filesystems are nested > ex: 1 /my_fs_1 /my_fs_2 > ex: 2 /home_dirs /home_dirs/chris > > TIA! > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussJim Dunham Solaris, Storage Software Group Sun Microsystems, Inc. 1617 Southwood Drive Nashua, NH 03063 Email: James.Dunham at Sun.COM http://blogs.sun.com/avs
Thanks for the info folks. In addition to the 2 replies shown above I got the following very knowledgeable reply from Jim Dunham (for some reason it has not shown up here yet so I''m going to paste it in). ---- Chris, For the purposes of isolating corruption, the separation of two or more filesystems coming from the same ZFS storage pool does not help. An entire ZFS storage pool is the unit of I/O consistency, as all ZFS filesystems created within this single storage pool share the same physical storage. When configuring a ZFS storage pool the [poor] decision of choosing a non-redundant (single or concatenation of disks) verses redundant (mirror, raidz, raidz2) storage pool, offers no means for ZFS to automatically recover for some forms of corruption. Even when using a redundant storage pool, there are scenarios in which this is not good enough. This is when filesystem needs transitions into availability, such as when the loss or accessibility of two or more disks, causes mirroring or raidz to be ineffective. As of Solaris Express build 68, Availability Suite [http://www.opensolaris.org/os/project/avs/] is part of base Solaris, offering both local snapshots and remote mirrors, both of which work with ZFS. Locally on a single Solaris host, snapshots of the entire ZFS storage pool can be taken at intervals of ones choosing, and with multiple snapshots of a single master, collections of snapshots, say at intervals of one hour, can be retained. Options allow for 100% independent snapshots (much like your UFS analogy above), dependent where only the Copy-On-Write data is retained, or compact dependent where the snapshots physical storage is some percentage of the master. Remotely between to or more Solaris hosts, remote mirrors of the entire ZFS storage pool can be configured, where synchronous replication can offer zero data loss, or asynchronous replication can offer near zero data loss, but both offering write-order, on disk consistency. A key aspect of remote replication with Availability Suite, is that the replicated ZFS storage pool can be quiesced on the remote node and accessed, or in a disaster recover scenario, take over instantly where the primary left off. When the primary site is restored, the MTTR (Mean Time To Recovery) is essentially zero, since Availability Suite supports on-demand pull, so yet to be replicated blocks are retrieved synchronously, allowing the ZFS filesystem and applications to be resumed without waiting for a potentially length resynchronization. ---- Thanks Jim! This message posted from opensolaris.org
On 8/11/07, Stan Seibert <volsung at mailsnare.net> wrote:> I''m not sure if that answers the question you were asking, but generally I found that damage to a zpool was very well confined.But you can''t count on it. I currently have an open case where a zpool became corrupt and put the system into a panic loop. As this case has progressed, I found that the panic loop part of it is not present in any released version of S10 tested (S10U3 + 118833-36, 125100-07, 125100-10) but does exist in snv69. The test mechanism is whether "zpool import" (no pool name) causes the system to panic or not. If that happens, I''m going on the assumption that if this causes panic, having the appropriate zpool.cache in place will cause it to panic during every boot. Oddly enough, I know I can''t blame the storage subsystem on this - it is ZFS as well. :) It goes like this: HDS 99xx T2000 primary ldom S10u3 with a file on zfs presented as a block device for an ldom T2000 guest ldom zpool on slice 3 of block device mentioned above Depending on the OS running on the guest LDOM "zpool import" gives different results: S10U3 118833-36 - 125100-10: "zpool is corrupt" "restore from backups" S10u4 Beta, snv69 and I think snv59: panic - S10u4 backtrace is very different from snv* -- Mike Gerdts http://mgerdts.blogspot.com/