Jim Klimov
2012-Jan-13 01:00 UTC
[zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones
While reading about zfs on-disk formats, I wondered once again why is it not possible to create a snapshot on existing data, not of the current TXG but of some older point-in-time? From what I gathered, definition of a snapshot requires the cut-off TXG number existence of some blocks in this dataset with smaller-or-equal TXG numbers. It seems like just a coincidence that current TXG is used and older TXGs aren''t. Is it deemed inconvenient/unpractical/useless/didn''t think of, or are there some fundamental or technological drawbacks to the idea? Note: this idea is related to my proposal in October thread "[zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level" and could aid "restartable zfs-send" by creation of smaller snapshots for incremental sending of existing large datasets. Today I had a new twist on the idea, though: as I wrote in other posts, my raidz2 did not help protecting some of my data. One of the damaged files belongs to a stack of snapshots that are continually replicated from another box, and the inconsistent on-disk block is referenced in an old snapshot (almost at the root of stack). Resending and re-receiving the whole stack of snapshots is possible, but inconvenient and slow. RSyncing just the difference (good data instead of IO-Erroring byte range) to repair the file would forfeit further increnmental snapshot syncs. So I thought: it would be nice if it were possible (perhaps not now, but in the future as an RFE) to resend and replace just that snapshot in the middle or even root of the stack. Perhaps even better, with ZDB or some other tools I might determine which blocks have rottened and which TXG they belonged to, and I''d "fence" that TXG on the source and destination systems with proposed "injected snapshots". Older and newer snapshots around this TXG range would provide incremental changes to data, as they normally do, and I''d only quickly replace a small intermittent snapshot. All this needs is a couple of not-yet-existing features... PS: I think that this idea might even have some "business case" foundation for active-passive clusters with zfs send updating a passive cluster node. Whenever scrub on one of the systems finds an unrecoverable block in older data, the node might request "just it" from the other head. Likewise for backups to removable media, etc. If we already have a ZFS-based storage similar to an out-of-sync mirror, why not use the available knowledge of known-good blocks to repair detected {small} errors in large volumes of "same" data? What do you think?.. //Jim Klimov
Jim Klimov
2012-Jan-13 05:23 UTC
[zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones
2012-01-13 7:26, Steve Gonczi wrote:> JIm, > > > Any modified block (in absence of a snaphot) gets re-written > to a new location and the original block is freed. > > So the earlier state you want to go back and snapshot is no longer there, > > The essence of taking a snapshot is keeping the original blocks > instead of freeing them.Perhaps I need to specify some usecases more clearly: 1) Snapshot added in-between existing snapshots, or even before the first one currently existing, i.e. just to facilitate incremental snapshot sends in small chunks over lousy media (where zfs send is likely to never succeed for huge datasets sent as one initial stream). 2) Cloning and/or rollback of a dataset at some point in time (TXG number) of which I forgot to add a timely snapshot of. Apparently, this would only work to ignore added data, since overwritten blocks would be lost. Exception: there is a "last" chance to reference last 32-128 TXGs, uberblocks for which still exist in the ring. Say, 128*5sec = 640 sec > 10.5 min of rollback info guaranteed to be not overwritten by ZFS COW. This would compensate most of those "Oh sh*t what have I done!?" moments of operator/admin errors, typos, etc. Injecting a snapshot into "3 minutes ago" would help retain that data not-actually-deleted from disk while you go about repairing damage ;) Perhaps this would even allow for undeletion of datasets which you never intended to destroy (notably, I had LU BE deletion trying to kill off my zone datasets some time around snv_101 or so; they were only saved by being mounted and running at the time). 3) Use along with that proposed replacement of existing snapshots (with degraded unreadable blocks) while maintaining the rest of snapshot/clone tree. If this "technology" were to be implemented, injected snaps could naturally be used to "fence off" the corrupted area (TXG number range) and replace the resulting smaller corrupt snapshot with good data from another storage. I hope it is not theoretically impossible to write this replacement snapshot in such a manner that the resulting sequence of block histories would still make sense as valid files. This block reallocation is not much different from autorepairs on resilver or scrub... I think :) Thanks, //Jim
Edward Ned Harvey
2012-Jan-13 13:24 UTC
[zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > Perhaps I need to specify some usecases more clearly:Actually, I''m not sure you do need to specify usecases more clearly - Because the idea is obviously awesome. The main problem, if you''re interested, is getting attention. Maybe it''s more work than I know, but I agree with you, at first blush it doesn''t sound like much work. I think the most compelling use case you mentioned was ability to resume interrupted zfs send. It''s one of those things where it''s not super-super useful (most people are content with whatever snapshot and zfs send scheme they already have today) but if it''s not much work, then maybe it''s worth while anyway. But there''s a finite amount of development resource. And other features that are in higher demand (such as BP rewrite, etc). Why would oracle or nexenta care about devoting the effort? Maybe it''s possible, maybe there just isn''t enough motivation...
Matthew Ahrens
2012-Jan-16 19:14 UTC
[zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones
On Thu, Jan 12, 2012 at 5:00 PM, Jim Klimov <jimklimov at cos.ru> wrote:> While reading about zfs on-disk formats, I wondered once again > why is it not possible to create a snapshot on existing data, > not of the current TXG but of some older point-in-time? >It is not possible because the older data may no longer exist on-disk. For example, you want to take a snapshot from 10 txg''s ago. But since then we have created a new file, which modified the containing directory. So we freed the directory block from 10 txg''s ago. That freed block is then a candidate for reallocation. Existence of old uberblocks in the ring buffer does not indicate that the data they reference is still valid. This is the reason that "zpool import -F" does not always work. --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120116/da71e61c/attachment.html>
Jim Klimov
2012-Jan-16 19:34 UTC
[zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones
2012-01-16 23:14, Matthew Ahrens ?????:> On Thu, Jan 12, 2012 at 5:00 PM, Jim Klimov <jimklimov at cos.ru > <mailto:jimklimov at cos.ru>> wrote: > > While reading about zfs on-disk formats, I wondered once again > why is it not possible to create a snapshot on existing data, > not of the current TXG but of some older point-in-time? > > > It is not possible because the older data may no longer exist on-disk. > For example, you want to take a snapshot from 10 txg''s ago. But since > then we have created a new file, which modified the containing > directory. So we freed the directory block from 10 txg''s ago. That > freed block is then a candidate for reallocation. > > Existence of old uberblocks in the ring buffer does not indicate that > the data they reference is still valid. This is the reason that "zpool > import -F" does not always work.Hmmm... the way I got it (but again have no prooflinks handy) was that ZFS "recently" got a deferred-reuse feature to just guarantee those rollbacks, basically. I am not sure which builds or distros that might be included in. If you authoritatively say it''s not there (or not in illumos), I''m going to trust you ;) What about injecting snapshots into static data - before at least one existing snapshot? Is that possible? I do get your point about missing older directory data and possible invalidity of the snapshot as a ZPL dataset (and probably a bad basis for a writeable clone)... but let''s call them checkpoints then, and limit use for zfs send and fencing of erred ranges ;) Is that technically possible or logically reasonable? Thanks, //Jim
Matthew Ahrens
2012-Jan-16 20:39 UTC
[zfs-discuss] Injection of ZFS snapshots into existing data, and replacement of older snapshots with zfs recv without truncating newer ones
On Mon, Jan 16, 2012 at 11:34 AM, Jim Klimov <jimklimov at cos.ru> wrote:> 2012-01-16 23:14, Matthew Ahrens ?????: > >> On Thu, Jan 12, 2012 at 5:00 PM, Jim Klimov <jimklimov at cos.ru >> <mailto:jimklimov at cos.ru>> wrote: >> >> While reading about zfs on-disk formats, I wondered once again >> why is it not possible to create a snapshot on existing data, >> not of the current TXG but of some older point-in-time? >> >> >> It is not possible because the older data may no longer exist on-disk. >> For example, you want to take a snapshot from 10 txg''s ago. But since >> then we have created a new file, which modified the containing >> directory. So we freed the directory block from 10 txg''s ago. That >> freed block is then a candidate for reallocation. >> >> Existence of old uberblocks in the ring buffer does not indicate that >> the data they reference is still valid. This is the reason that "zpool >> import -F" does not always work. >> > > Hmmm... the way I got it (but again have no prooflinks handy) > was that ZFS "recently" got a deferred-reuse feature to just > guarantee those rollbacks, basically. I am not sure which > builds or distros that might be included in. > > If you authoritatively say it''s not there (or not in illumos), > I''m going to trust you ;) >It''s definitely not there in Illumos. See TXG_DEFER_SIZE. There was talk of changing it at Oracle, don''t know if that ever happened. If you have a S11 system you could probably use mdb to look at the size of the ms_defermap. --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120116/1bd7544f/attachment.html>