Kai Krakow
2013-May-05 10:07 UTC
Possible to dedpulicate read-only snapshots for space-efficient backups
Hey list, I wonder if it is possible to deduplicate read-only snapshots. Background: I''m using an bash/rsync script[1] to backup my whole system on a nightly basis to an attached USB3 drive into a scratch area, then take a snapshot of this area. I''d like to have these snapshots immutable, so they should be read-only. Since rsync won''t discover moved files but instead place a new copy of that in the backup, I''m running the wonderful bedup application[2] to deduplicate my backup drive from time to time and it almost always gains back a good pile of gigabytes. The rest of storage space issues is taken care of by using rsync''s inplace option (although this won''t cover the case of files moved and changed between backup runs) and using compress-force=gzip. Since bedup sets the immutable attribute during touching the files, I suspect the process will no longer work when I make the snapshots read-only. I''ve read about ongoing work to integrate offline (and even online) deduplication into the kernel so that this process can be made atomic (and even block-based instead of file-based). This would - to my understandings - result in the immutable attribute no longer needed. So, given the fact above and for the case read-only snapshots cannot be used for this application currently, will these patches address the problem and read-only snapshots could be deduplicated? Or are read-only snapshots meant to be what the name suggests: Immutable, even for deduplication? Regards, Kai [1]: https://gist.github.com/kakra/5520370 [2]: https://github.com/g2p/bedup -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Gabriel de Perthuis
2013-May-05 12:55 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
On Sun, 05 May 2013 12:07:17 +0200, Kai Krakow wrote:> Hey list, > > I wonder if it is possible to deduplicate read-only snapshots. > > Background: > > I''m using an bash/rsync script[1] to backup my whole system on a nightly > basis to an attached USB3 drive into a scratch area, then take a snapshot of > this area. I''d like to have these snapshots immutable, so they should be > read-only. > > Since rsync won''t discover moved files but instead place a new copy of that > in the backup, I''m running the wonderful bedup application[2] to deduplicate > my backup drive from time to time and it almost always gains back a good > pile of gigabytes. The rest of storage space issues is taken care of by > using rsync''s inplace option (although this won''t cover the case of files > moved and changed between backup runs) and using compress-force=gzip.> I''ve read about ongoing work to integrate offline (and even online) > deduplication into the kernel so that this process can be made atomic (and > even block-based instead of file-based). This would - to my understandings - > result in the immutable attribute no longer needed. So, given the fact above > and for the case read-only snapshots cannot be used for this application > currently, will these patches address the problem and read-only snapshots > could be deduplicated? Or are read-only snapshots meant to be what the name > suggests: Immutable, even for deduplication?There''s no deep reason read-only snapshots should keep their storage immutable, they can be affected by raid rebalancing for example. The current bedup restriction comes from the clone call; Mark Fasheh''s dedup ioctl[3] appears to be fine with snapshots. The bedup integration (in a branch) is a work in progress at the moment. I need to fix a scan bug, tweak parameters for the latest kernel dedup patch, remove a lot of logic that is now unnecessary, and figure out the compatibility story.> Regards, > Kai > > [1]: https://gist.github.com/kakra/5520370 > [2]: https://github.com/g2p/bedup[3]: http://comments.gmane.org/gmane.comp.file-systems.btrfs/25062 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2013-May-05 17:22 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
Gabriel de Perthuis <g2p.code@gmail.com> schrieb:> There''s no deep reason read-only snapshots should keep their storage > immutable, they can be affected by raid rebalancing for example.Sounds logical, and good...> The current bedup restriction comes from the clone call; Mark Fasheh''s > dedup ioctl[3] appears to be fine with snapshots. The bedup integration > (in a branch) is a work in progress at the moment. I need to fix a scan > bug, tweak parameters for the latest kernel dedup patch, remove a lot of > logic that is now unnecessary, and figure out the compatibility story.I''d be eager to test as soon as the patches arrived in the official kernel distribution. Do you plan to support deduplication on a finer grained basis than file level? As an example, in the end it could be interesting to deduplicate 1M blocks of huge files. Backups of VM images come to my mind as a good candidate. While my current backup script[1] takes care of this by using "rsync --inplace" it won''t consider files moved between two backup cycles. This is the main purpose I''m using bedup for on my backup drive. Maybe you could define another cutoff value to consider huge files for block-level deduplication? Regards, Kai [1]: https://gist.github.com/kakra/5520370 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jan Schmidt
2013-May-06 06:15 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
On Sun, May 05, 2013 at 12:07 (+0200), Kai Krakow wrote:> I''m using an bash/rsync script[1] to backup my whole system on a nightly > basis to an attached USB3 drive into a scratch area, then take a snapshot of > this area. I''d like to have these snapshots immutable, so they should be > read-only.Have you considered using btrfs send / receive for that purpose? You would just save the dedup step. -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2013-May-06 07:44 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
Jan Schmidt <list.btrfs@jan-o-sch.net> schrieb:>> I''m using an bash/rsync script[1] to backup my whole system on a nightly >> basis to an attached USB3 drive into a scratch area, then take a snapshot >> of this area. I''d like to have these snapshots immutable, so they should >> be read-only. > > Have you considered using btrfs send / receive for that purpose? You would > just save the dedup step.This is planned for later. In the first step I want to stay as file system agnostic for the source as possible. But I''ve put it on my todo list in the gist. Regards, Kai -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
james northrup
2013-May-06 14:35 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
tried a git based backup? sounds spot-on as a compromise prior to applying btrfs tweaks. snapshotting the git binaries would have the dedupe characteristics. On Mon, May 6, 2013 at 12:44 AM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:> Jan Schmidt <list.btrfs@jan-o-sch.net> schrieb: > >>> I''m using an bash/rsync script[1] to backup my whole system on a nightly >>> basis to an attached USB3 drive into a scratch area, then take a snapshot >>> of this area. I''d like to have these snapshots immutable, so they should >>> be read-only. >> >> Have you considered using btrfs send / receive for that purpose? You would >> just save the dedup step. > > This is planned for later. In the first step I want to stay as file system > agnostic for the source as possible. But I''ve put it on my todo list in the > gist. > > Regards, > Kai > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2013-May-06 20:48 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
james northrup <northrup.james@gmail.com> schrieb:> tried a git based backup? sounds spot-on as a compromise prior to > applying btrfs tweaks. snapshotting the git binaries would have the > dedupe characteristics.Git is efficient with space, yes. But if you have a lot of binary files, and a lot of them are big, git becomes really slow really fast. Checking out and in can be very slow and resource intensive then. And I don''t think it would track ownership and permissions correctly. Git is great, it''s an everyday tool for me, but it is just not made for binary files. Regards, Kai -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Gabriel de Perthuis
2013-May-07 22:07 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
> Do you plan to support deduplication on a finer grained basis than file > level? As an example, in the end it could be interesting to deduplicate 1M > blocks of huge files. Backups of VM images come to my mind as a good > candidate. While my current backup script[1] takes care of this by using > "rsync --inplace" it won''t consider files moved between two backup cycles. > This is the main purpose I''m using bedup for on my backup drive. > > Maybe you could define another cutoff value to consider huge files for > block-level deduplication?I''m considering deduplicating aligned blocks of large files sharing the same size (VMs with the same baseline. Those would ideally come pre-cowed, but rsync or scp could have broken that). It sounds simple, and was sort-of prompted by the new syscall taking short ranges, but it is tricky figuring out a sane heuristic (when to hash, when to bail, when to submit without comparing, what should be the source in the last case), and it''s not something I have an immediate need for. It is also possible to use 9p (with standard cow and/or small-file dedup) and trade a bit of configuration for much more space-efficient VMs. Finer-grained tracking of which ranges have changed, and maybe some caching of range hashes, would be a good first step before doing any crazy large-file heuristics. The hash caching would actually benefit all use cases.> Regards, > Kai > > [1]: https://gist.github.com/kakra/5520370-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2013-May-07 23:04 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
Gabriel de Perthuis <g2p.code@gmail.com> schrieb:> It sounds simple, and was sort-of prompted by the new syscall taking > short ranges, but it is tricky figuring out a sane heuristic (when to > hash, when to bail, when to submit without comparing, what should be the > source in the last case), and it''s not something I have an immediate > need for. It is also possible to use 9p (with standard cow and/or > small-file dedup) and trade a bit of configuration for much more > space-efficient VMs. > > Finer-grained tracking of which ranges have changed, and maybe some > caching of range hashes, would be a good first step before doing any > crazy large-file heuristics. The hash caching would actually benefit > all use cases.Looking back to good old peer-2-peer days (I think we all got in touch with that the one or the other way), one title pops back into my mind: tiger- tree-hash... I''m not really into it, but would it be possible to use tiger-tree-hashes to find identical blocks? Even accross different sized files... Regards, Kai -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kai Krakow
2013-May-07 23:22 UTC
Re: Possible to dedpulicate read-only snapshots for space-efficient backups
Kai Krakow <hurikhan77+btrfs@gmail.com> schrieb:> Gabriel de Perthuis <g2p.code@gmail.com> schrieb: > >> It sounds simple, and was sort-of prompted by the new syscall taking >> short ranges, but it is tricky figuring out a sane heuristic (when to >> hash, when to bail, when to submit without comparing, what should be the >> source in the last case), and it''s not something I have an immediate >> need for. It is also possible to use 9p (with standard cow and/or >> small-file dedup) and trade a bit of configuration for much more >> space-efficient VMs. >> >> Finer-grained tracking of which ranges have changed, and maybe some >> caching of range hashes, would be a good first step before doing any >> crazy large-file heuristics. The hash caching would actually benefit >> all use cases. > > Looking back to good old peer-2-peer days (I think we all got in touch > with that the one or the other way), one title pops back into my mind: > tiger- tree-hash... > > I''m not really into it, but would it be possible to use tiger-tree-hashes > to find identical blocks? Even accross different sized files...While thinking about it: That hash was probably invented for the purpose of distributing the same content to multiple peers in as small deltas as possible. Well, deduplication is somehow the other way around: Coalescing all those wild distribution back into a single source of content. So some "inverse" of tiger-tree would probably work better / more efficient. Regards, Kai -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Gabriel de Perthuis
2013-May-07 23:35 UTC
Re: Possible to deduplicate read-only snapshots for space-efficient backups
On Wed, 08 May 2013 01:04:38 +0200, Kai Krakow wrote:> Gabriel de Perthuis <g2p.code@gmail.com> schrieb: >> It sounds simple, and was sort-of prompted by the new syscall taking >> short ranges, but it is tricky figuring out a sane heuristic (when to >> hash, when to bail, when to submit without comparing, what should be the >> source in the last case), and it''s not something I have an immediate >> need for. It is also possible to use 9p (with standard cow and/or >> small-file dedup) and trade a bit of configuration for much more >> space-efficient VMs. >> >> Finer-grained tracking of which ranges have changed, and maybe some >> caching of range hashes, would be a good first step before doing any >> crazy large-file heuristics. The hash caching would actually benefit >> all use cases. > > Looking back to good old peer-2-peer days (I think we all got in touch with > that the one or the other way), one title pops back into my mind: tiger- > tree-hash... > > I''m not really into it, but would it be possible to use tiger-tree-hashes to > find identical blocks? Even accross different sized files...Possible, but bedup is all about doing as little io as it can get away with, doing streaming reads only when it has sampled that the files are likely duplicates and not spending a ton of disk space for indexing. Hashing everything in the hope that there are identical blocks at unrelated places on the disk is a much more resource-intensive approach; Liu Bo is working on that, following ZFS''s design choices. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html