I realise that the hard link limit is in the queue to fix, and I read the recent thread as well as the older (october I think) thread. I just wanted to note that BackupPC *does* in fact run into the hard link limit, and its due to the dpkg configuration scripts. BackupPC hard links files with the same content together by scanning new files and linking them together, whether or not they started as a hard link in the backed up source PCs. It also builds a directory structure precisely matching the source machine (basically it rsyncs across, then hardlinks aggressively). If you back up a Debian host, /var/lib/dpkg/info contains many identical files because debhelper generates the same script in the common case: ls /var/lib/dpkg/info/*.postinst | xargs -n1 sha1sum | awk ''{ print $1 }'' | sort -u | wc -l 862 ls /var/lib/dpkg/info/*.postinst | wc -l 1533 As I say, I realise this is queued to get addressed anyway, but it seems like a realistic thing for people to do (use BackupPC on btrfs) - even if something better still can be written to replace the BackupPC store in the future. I will note though, that simple snapshots won''t achieve the deduplication level that BackupPC does, because the fils don''t start out as the same: they are identified as being identical post-backup. Cheers, Rob
Hubert Kario
2010-Mar-02 13:09 UTC
Re: BackupPC, per-dir hard link limit, Debian packaging
On Tuesday 02 March 2010 03:29:05 Robert Collins wrote:> As I say, I realise this is queued to get addressed anyway, but it seems > like a realistic thing for people to do (use BackupPC on btrfs) - even > if something better still can be written to replace the BackupPC store > in the future. I will note though, that simple snapshots won''t achieve > the deduplication level that BackupPC does, because the fils don''t start > out as the same: they are identified as being identical post-backup.Isn''t the main idea behind deduplication to merge identical parts of files together using cow? This way you could have many very similar images of virtual machines, run the deduplication process and reduce massively the space used while maintaining the differences between images. If memory serves me right, the plan is to do it in userland on a post-fact filesystem, not when the data is being saved. If such a daemon or program was available you would run it on the system after rsyncing the workstations. Though the question remains which system would reduce space usage more in your use case. From my experience, hardlinks take less space on disk, I don''t know whatever it could be possible to optimise btrfs cow system for files that are exactly the same.> > Cheers, > Rob >-- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl System Zarządzania Jakością zgodny z normą ISO 9001:2000 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hubert Kario wrote:> On Tuesday 02 March 2010 03:29:05 Robert Collins wrote: >> As I say, I realise this is queued to get addressed anyway, but it seems >> like a realistic thing for people to do (use BackupPC on btrfs) - even >> if something better still can be written to replace the BackupPC store >> in the future. I will note though, that simple snapshots won''t achieve >> the deduplication level that BackupPC does, because the fils don''t start >> out as the same: they are identified as being identical post-backup. > > Isn''t the main idea behind deduplication to merge identical parts of files > together using cow? This way you could have many very similar images of > virtual machines, run the deduplication process and reduce massively the space > used while maintaining the differences between images. > > If memory serves me right, the plan is to do it in userland on a post-fact > filesystem, not when the data is being saved. If such a daemon or program was > available you would run it on the system after rsyncing the workstations. > > Though the question remains which system would reduce space usage more in your > use case. From my experience, hardlinks take less space on disk, I don''t know > whatever it could be possible to optimise btrfs cow system for files that are > exactly the same.Space use is not the key difference between these methods. The btrfs COW makes data sharing safe. The hard link method means changing a file invalidates the content of all linked files. So a BackupPC output should be read-only. jim -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hubert Kario
2010-Mar-03 00:05 UTC
Re: BackupPC, per-dir hard link limit, Debian packaging
On Wednesday 03 March 2010 00:22:31 jim owens wrote:> Hubert Kario wrote: > > On Tuesday 02 March 2010 03:29:05 Robert Collins wrote: > >> As I say, I realise this is queued to get addressed anyway, but it seems > >> like a realistic thing for people to do (use BackupPC on btrfs) - even > >> if something better still can be written to replace the BackupPC store > >> in the future. I will note though, that simple snapshots won''t achieve > >> the deduplication level that BackupPC does, because the fils don''t start > >> out as the same: they are identified as being identical post-backup. > > > > Isn''t the main idea behind deduplication to merge identical parts of > > files together using cow? This way you could have many very similar > > images of virtual machines, run the deduplication process and reduce > > massively the space used while maintaining the differences between > > images. > > > > If memory serves me right, the plan is to do it in userland on a > > post-fact filesystem, not when the data is being saved. If such a daemon > > or program was available you would run it on the system after rsyncing > > the workstations. > > > > Though the question remains which system would reduce space usage more in > > your use case. From my experience, hardlinks take less space on disk, I > > don''t know whatever it could be possible to optimise btrfs cow system > > for files that are exactly the same. > > Space use is not the key difference between these methods. > The btrfs COW makes data sharing safe. The hard link method > means changing a file invalidates the content of all linked files. > > So a BackupPC output should be read-only.I know that, but if you''re using "dumb" tools to replicate systems (say rsync), you don''t want them to overwrite different versions of files and you still want to reclaim disk space used by essentially the same data. My idea behind btrfs as backup storage and using cow not hardlinks for duplicated files comes from need to keep archival copies (something not really possible with hardlinks) in a way similar to rdiff-backup. As first backup I just rsync to backup server from all workstations. But on subsequent backups I copy the last version to a .snapshot/todays-date directory using cow, rsync from workstations and then run deduplication daemon. This way I get both reduced storage and old copies (handy for user home directories...). With such use-case, the ability to use cow while needing similar amounts of space as hardlinks would be at least useful if not very desired. That''s why I asked if it''s possible to optimise btrfs cow mechanism for identical files. From my testing (directory 584MiB in size, 17395 files, Arch kernel 2.6.32.9, coreutils 8.4, btrfs-progs 0.19, 10GiB partition, default mkfs and mount options): cp -al free space decrease: 6176KiB cp -a --reflink=always free space decrease: 23296KiB and in the second run: cp -al free space decrease: 6064KiB cp -a --reflink=always free space decrease: 23324KiB that''s nearly 4 times more! -- Hubert Kario QBS - Quality Business Software ul. Ksawerów 30/85 02-656 Warszawa POLAND tel. +48 (22) 646-61-51, 646-74-24 fax +48 (22) 646-61-50 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html