Robert Bell
2014-Sep-15 06:24 UTC
Backup scripts - recycling old backup directories (Kevin Korb)
Kevin, Thanks for the reply and interest in this topic. Comments below. Regards Rob.> I did consider that but rejected it for 2 reasons... > > 1. Backup run time. We have a 4 hour window to run backups at night. > Using recycled directories significantly extended the backup run > time. The deletion time is eliminated but frankly, we have the other > 20 hours of the day to do deletions. We had to give up using > - --link-dest when the deletions started to actually take that long even > though the backups still ran in under 4 hours.For us, the recycling of old directories significantly shortened the time to do backups, since the recycled backups have typically 95% of the files/directories correct (with daily backups and Tower of Hanoi, half of our recycled backups are only 5 to 6 days old). I've just done some tests with a fairly pathological case, all on one host. I set up a source tree 's' with 11111 sub-directories and 10000 files, and then two destinations: cp -a s d1 cp -afl d1 d2 I then did the first test: # rsync to a new directory, followed by a remove of an old directory." time rsync -a --link-dest=../d2 s/ d3 time /bin/rm -rf d1 I then scrubbed the lot, set it up again, and did the second test: mv d1 d3 # rsync to a recycled directory" time rsync -a --link-dest=../d2 --delete s/ d3 I hope I got this right! I've made no effort to circumvent caching. Anyway, here is a table of the average times (seconds) over 5 runs of each test. Real User Sys (User+Sys) test 1 2.454s 0.150s 2.196s 2.346s test 2 0.392s 0.100s 0.572s 0.672s ratio 6.3 1.5 3.8 3.5 (The User+Sys time is pretty much invariant, even though in earlier tests the real time suffered major blowouts owing to contention.) So, the big difference is that in test 1, the 11111 sub-directories and 10000 files were created in the destination d3, and then the same numbers were deleted from the old directory d1. In test 2, rsync does none of that, but only has to check for differences. ~40,000 metadata operations avoided on the filesystem in this case.> > 2. Metadata history. If there is an existing file in the target dir > that differs only by metadata (permissions, ownership, timestamp) then > rsync will simply change that metadata. That change affects all > instances of that file. Of course this is better for storage space as > the alternative is storing another copy of the file with the different > metadata but we decided it was better to have that information saved.Yes. I would love to see someone make a patched version of rsync to allow callers to select a different behaviour in this case! So, if a file has identical content on source and destination but different metadata, then if --link-dest is in use and the link count on the destination is > 1, then take a new copy from source rather than just updating the metadata (the file could be copied on the destination and then the copy updated with the new metadata and the old version removed, but this would not be essential - just perhaps an efficiency gain.) Thanks in anticipation! Dr Robert C. Bell HPC National Partnerships | Scientific Computing Information Management and Technology CSIRO T +61 3 9669 8102 Alt +61 3 8601 3810 Mob +61 428 108 333 Robert.Bell at csiro.au<mailto:Robert.Bell at csiro.au> | www.csiro.au | wiki.csiro.au/display/ASC/ Street: CSIRO ASC Level 11, 700 Collins Street, Docklands Vic 3008, Australia Postal: CSIRO ASC Level 11, GPO Box 1289, Melbourne Vic 3001, Australia PLEASE NOTE The information contained in this email may be confidential or privileged. Any unauthorised use or disclosure is prohibited. If you have received this email in error, please delete it immediately and notify the sender by return email. Thank you. To the extent permitted by law, CSIRO does not represent, warrant and/or guarantee that the integrity of this communication has been maintained or that the communication is free of errors, virus, interception or interference. Please consider the environment before printing this email.
Kevin Korb
2014-Sep-15 15:03 UTC
Backup scripts - recycling old backup directories (Kevin Korb)
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I would never operate in a manner that only has 5-6 days of old backups. The backups that I am deleting are more than a year old. On 09/15/2014 02:24 AM, Robert Bell wrote:> Kevin, > > Thanks for the reply and interest in this topic. > > Comments below. > > Regards > > Rob. > >> I did consider that but rejected it for 2 reasons... >> >> 1. Backup run time. We have a 4 hour window to run backups at >> night. Using recycled directories significantly extended the >> backup run time. The deletion time is eliminated but frankly, we >> have the other 20 hours of the day to do deletions. We had to >> give up using - --link-dest when the deletions started to >> actually take that long even though the backups still ran in >> under 4 hours. > > For us, the recycling of old directories significantly shortened > the time to do backups, since the recycled backups have typically > 95% of the files/directories correct (with daily backups and Tower > of Hanoi, half of our recycled backups are only 5 to 6 days old). > > I've just done some tests with a fairly pathological case, all on > one host. > > I set up a source tree 's' with 11111 sub-directories and 10000 > files, and then two destinations: cp -a s d1 cp -afl d1 d2 > > I then did the first test: # rsync to a new directory, followed by > a remove of an old directory." time rsync -a --link-dest=../d2 s/ > d3 time /bin/rm -rf d1 > > > I then scrubbed the lot, set it up again, and did the second test: > mv d1 d3 # rsync to a recycled directory" time rsync -a > --link-dest=../d2 --delete s/ d3 > > I hope I got this right! I've made no effort to circumvent > caching. > > Anyway, here is a table of the average times (seconds) over 5 runs > of each test. > > Real User Sys (User+Sys) test 1 2.454s 0.150s > 2.196s 2.346s test 2 0.392s 0.100s 0.572s 0.672s > ratio 6.3 1.5 3.8 3.5 > > (The User+Sys time is pretty much invariant, even though in earlier > tests the real time suffered major blowouts owing to contention.) > > So, the big difference is that in test 1, the 11111 sub-directories > and 10000 files were created in the destination d3, and then the > same numbers were deleted from the old directory d1. In test 2, > rsync does none of that, but only has to check for differences. > ~40,000 metadata operations avoided on the filesystem in this > case. > > >> >> 2. Metadata history. If there is an existing file in the target >> dir that differs only by metadata (permissions, ownership, >> timestamp) then rsync will simply change that metadata. That >> change affects all instances of that file. Of course this is >> better for storage space as the alternative is storing another >> copy of the file with the different metadata but we decided it >> was better to have that information saved. > Yes. > > > I would love to see someone make a patched version of rsync to > allow callers to select a different behaviour in this case! > > So, if a file has identical content on source and destination but > different metadata, then if --link-dest is in use and the link > count on the destination is > 1, then take a new copy from source > rather than just updating the metadata (the file could be copied on > the destination and then the copy updated with the new metadata and > the old version removed, but this would not be essential - just > perhaps an efficiency gain.) > > Thanks in anticipation! > > > Dr Robert C. Bell HPC National Partnerships | Scientific Computing > Information Management and Technology CSIRO T +61 3 9669 8102 Alt > +61 3 8601 3810 Mob +61 428 108 333 > Robert.Bell at csiro.au<mailto:Robert.Bell at csiro.au> | www.csiro.au | > wiki.csiro.au/display/ASC/ Street: CSIRO ASC Level 11, 700 Collins > Street, Docklands Vic 3008, Australia Postal: CSIRO ASC Level 11, > GPO Box 1289, Melbourne Vic 3001, Australia > > PLEASE NOTE The information contained in this email may be > confidential or privileged. Any unauthorised use or disclosure is > prohibited. If you have received this email in error, please > delete it immediately and notify the sender by return email. Thank > you. To the extent permitted by law, CSIRO does not represent, > warrant and/or guarantee that the integrity of this communication > has been maintained or that the communication is free of errors, > virus, interception or interference. > > Please consider the environment before printing this email.- -- ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~ Kevin Korb Phone: (407) 252-6853 Systems Administrator Internet: FutureQuest, Inc. Kevin at FutureQuest.net (work) Orlando, Florida kmk at sanitarium.net (personal) Web page: http://www.sanitarium.net/ PGP public key available on web site. ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlQW/8YACgkQVKC1jlbQAQfCIwCdGKm9z00G0Xu4tItwuUlUaLum 8dwAn0sY8qriEJeUsReRlU67GkbA5BRZ =2b6r -----END PGP SIGNATURE-----
Perry Hutchison
2014-Sep-16 07:07 UTC
Backup scripts - recycling old backup directories (Kevin Korb)
Robert Bell <Robert.Bell at csiro.au> wrote:> > 2. Metadata history. If there is an existing file in the target dir > > that differs only by metadata (permissions, ownership, timestamp) then > > rsync will simply change that metadata. That change affects all > > instances of that file. Of course this is better for storage space as > > the alternative is storing another copy of the file with the different > > metadata but we decided it was better to have that information saved. > Yes. > > I would love to see someone make a patched version of rsync to allow > callers to select a different behaviour in this case! > > So, if a file has identical content on source and destination but > different metadata, then if --link-dest is in use and the link count on > the destination is > 1, then take a new copy from source rather than > just updating the metadata ...Or you could arrange to store the content and the metadata separately, so that all files having common content -- or similar large content with small deltas -- can share a single instance of that content. There is a rather well-known subsystem (toolkit, really) which already provides this capability: git. Figuring out the details of using it as a back-end for rsync backups is left as an exercise :)
Possibly Parallel Threads
- Backup scripts - recycling old backup directories
- Recycling directories and backup performance. Was: Re: rsync --link-dest won't link even if existing file is out of date (fwd)
- rsync - using a --files-from list to cut out scanning. How to handle deletions?
- rsync Digest, Vol 162, Issue 18
- Recycling and keeping backups - Tower of Hanoi management of backups using rsync