Jim Klimov
2011-Oct-29 17:57 UTC
[zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level
Hello all, I am catching up with some 500 posts that I skipped this summer, and came up with a new question. In short, is it possible to add "restartability" to ZFS SEND, for example by adding artificial snapshots (of configurable increment size) into already existing datasets [too large to be zfs-sent successfully as one chunk of stream data]? I''ll start by pre-history of this question, and continue with the detailed idea below: On one hand, there was a post about a T2000 system kernel panicking while trying to import a pool. It was probable that the pool was receiving a large (3Tb) zfs send stream, and this receiving was aborted due to some external issues. Afterwards the pool apparently got into a cycle of trying to destroy the received part of the stream during a pool import attempt, exhausted all RAM and hanged the server. From my experience reported this spring to the forums (alas, which are now gone - and the forums-to-mail replication did not work at that time) and to the Illumos bugtracker, I hope that the OP''s pool did get imported after a few weeks of power cycles. I had different conditions (destroying some snapshots and datasets on a deduped pool) with similar effect. On another hand, there was a discussion (actually, lots of them) about "rsync vs. zfs send". My new question couples these threads. I know that it has been discussed for a number of times that ZFS SEND is more efficient at finding differences and sending updates that a filesystem crawl and calculating checksums all over again. However, RSYNC has an important benefit of being restartable. As shown by the first post I mentioned, broken ZFS SEND operation can lead to long downtimes. With sufficiently large increments (i.e. initial stream of a large dataset), low bandwidth''es and high probability of network errors or power glitches, it may be even guaranteed to never transfer that much data as to complete a single ZFS SEND operation; for example, when replicating 3Tb over a few-Kbps subscriber-level internet link which is reset every 24 hours for ISP''s traffic accounting reasons. On the opposite, it is easy to construct an rsync loop which would transfer all files after several weeks of hard work. But that would not be a ZFS-snapshot replica, so further updates can not be made via ZFS SEND either - locking the user into rsync loops forever. Now, I wondered if it is possible to embed snapshots (or some similar construct) into existing data, for the purpose of tab-keeping during zfs send and zfs recv? For example, the same existing 3Tb dataset could be artificially pre-represented as an horde of snapshots each utilizing 1Gb of disk space, with valid ZFS incremental sends over whatever network link we have. However unlike zfs-auto-snap, these snapshots would not really appear on-disk while the dataset was being written (historically). Instead, they would be patched-on by the admins after the factual data appeared on disk, before the ZFS SEND. Alternatively, if the ZFS SEND is detected to have been broken, the sending side might set a "tab" on the offset where it was last reading the sent data. The receiver (upon pool import or whatever other recovery) also would set such a tab, instead of destroying the broken snapshot (which may take weeks and lots of downtime as proved by several reports on the list, including mine) and restarting from scratch - likely doomed to be broken as well. In terms of code this would probably be like the normal "zfs snapshot" mixed with the reverse of "zfs destroy @snapshot", meaning that some existing blocks would be reassigned as "owned" by a newly embedded snapshot instead of being "owned" by the live dataset or some more recent snapshot... //Jim
Edward Ned Harvey
2011-Oct-29 22:14 UTC
[zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > summer, and came up with a new question. In short, is it > possible to add "restartability" to ZFS SEND, for exampleRather than building something new and special into the filesystem, would something like a restartable/continuable mbuffer command do the trick? It seems to be a general issue, not filesystem specific - that you want to tunnel some command or some data stream through a buffering (perhaps even checksumming/error detecting/correcting) buffering system, to make it more resilient crossing a WAN or whatever. There is probably already a utility like that. I quickly checked mbuffer to see if it did, but it didn''t seem to do that. I didn''t look very deeply, I could be wrong.
Jim Klimov
2011-Oct-30 19:11 UTC
[zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level
2011-10-30 2:14, Edward Ned Harvey ?????:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov >> >> summer, and came up with a new question. In short, is it >> possible to add "restartability" to ZFS SEND, for example > > Rather than building something new and special into the filesystem, would > something like a restartable/continuable mbuffer command do the trick?Well, it is true that for the purposes of sending a replication stream over a flaky network, some sort of restartable buffer program might suffice. If one or both machines were rebooted in the process, however, this would get us into the situation that all incomplete-snapshot data was sent in vain, and the receiver has to destroy that data, which may even get it to crash during pool import. Afterwards the send attempt has to be done again, and if the conditions were such that any attempt is likely to fail - it likely will. Not all of our machines live in ivory-tower datacenters ;) Per Paul Kraus (who recently wrote about similar problems): > Uhhh, not being able to destroy snapshots that are "too big" > is a pretty big one for us Inserting artificial snapshots into existing datasets (perhaps including the inheritance tree of "huge incomplete snapshots" such as we can see now) might also allow to destroy an unneeded dataset with less strain on the system, piece by piece. Perhaps even without causing a loop of kernel panics, wow! ;) The way I see it, this feature would help solve at least two problems (or work-around them). To me these problems are substantial. Perhaps to others, like Paul, too. Because of highly-probable failures during a single unit of ZFS-SEND replication, I am bound to not use it at all. I also have to plan destruction of datasets at my home rig (which was tainted with dedup) and expect weeks of downtime while the system is being reset to crawl through the blocks being released after a large delete... //Jim
Jim Klimov
2011-Oct-30 19:14 UTC
[zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level
2011-10-29 21:57, Jim Klimov ?????:> ... In short, is it > possible to add "restartability" to ZFS SEND, for example > by adding artificial snapshots (of configurable increment > size) into already existing datasets [too large to be > zfs-sent successfully as one chunk of stream data]?On a side note: would this feature, like any other nice-to-have feature in ZFS, require The Mythical Block Pointer Rewrite (TM)? For no apparent reason yet, I''m already afraid so ;) If this is the Holy Grail which everybody craves and nobody saw, what is really the problem of making it happen? Some time ago I skimmed through an overview of "what would have to be done for it". Not being a hardcore ZFS programmer I did not grasp what is so fundamentally difficult about the quest. So I still wonder if it is impossible, or if anyone is already working on it quietly? ;) //Jim
Paul Kraus
2011-Oct-31 14:38 UTC
[zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level
On Sat, Oct 29, 2011 at 1:57 PM, Jim Klimov <jimklimov at cos.ru> wrote:> ?I am catching up with some 500 posts that I skipped this > summer, and came up with a new question. In short, is it > possible to add "restartability" to ZFS SEND, for example > by adding artificial snapshots (of configurable increment > size) into already existing datasets [too large to be > zfs-sent successfully as one chunk of stream data]?We addressed this by decreasing our snapshot interval from 1 day to 1 hour. We rarely have a snapshot bigger than a few GB now. I keep meaning to put together a snapshot script that takes a new snapshot when the amount of changed data increases to a certain point (for example, take a snapshot whenever the snapshot would contain 250 MB of data). Not enough round toits with all the other broken stuff to fix :-( -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
Matthew Ahrens
2011-Nov-05 01:22 UTC
[zfs-discuss] (Incremental) ZFS SEND at sub-snapshot level
On Sat, Oct 29, 2011 at 10:57 AM, Jim Klimov <jimklimov at cos.ru> wrote:> In short, is it > possible to add "restartability" to ZFS SENDIn short, yes. We are working on it here at Delphix, and plan to contribute our changes upstream to Illumos. You can read more about it in the slides I link to in this blog post: http://blog.delphix.com/matt/2011/11/01/zfs-10-year-anniversary/ --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111104/8fdaabd9/attachment.html>