samba-bugs at samba.org
2014-May-01 13:57 UTC
[Bug 10581] New: --fuzzy-delay and --fuzzy-limit for fuzzy match tuning
https://bugzilla.samba.org/show_bug.cgi?id=10581 Summary: --fuzzy-delay and --fuzzy-limit for fuzzy match tuning Product: rsync Version: 3.1.0 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P5 Component: core AssignedTo: wayned at samba.org ReportedBy: samba at haravikk.com QAContact: rsync-qa at samba.org It seems that when backing up folders with a very large number of files, --fuzzy behaves in a sub-optimal fashion, forcing rsync to build a file list for the entire folder if even a single new (sender only) file is encountered, which can completely halt a transfer until all of the folder's contents are known. To give you a better idea; I have a backup command that I run, but one of the items included in the backup is a huge OS X sparse bundle disk image comprised of some 32,000+ bands all stored within a single folder inside the image bundle. With --fuzzy disabled, rsync very quickly identifies files that are new or changed and starts sending them in a reasonable amount of time (given how many there are to check). However, with --fuzzy enabled, there is a huge (hours long) delay before a single file gets transferred. Now, I assume this is because rsync is waiting for the destination file-list to be completed so it can perform fuzzy matching for similar files, however with such large folders this can result in an incredible delay for little gain. Such large folders aren't uncommon for modern disk image formats and also for well-used mail folders, as just two examples. While currently I just run with --fuzzy disabled, I would rather keep it enabled for other folders where the match can help to improve matching against relocated files. So I'm proposing two new --fuzzy related options as follows: --fuzzy-limit sets a limit on the size of a folder where fuzzy matching is performed. By setting this to say 500, fuzzy matching can be temporarily disabled for any particularly large folders where the benefits will be far outweighed by the delays. This is the simpler of the two to implement I think. Giving the value as normal will set a limit on folder size at both ends, while setting a value with a plus (e.g - --fuzzy-limit +500) will only test the sender, and a minus will test only the destination. --fuzzy-delay changes the behaviour of --fuzzy such that any fuzzy matching will be deferred until the file-list for the folder is complete. Instead, updates and deletion checks* will continue normally until the file list for the folder is complete, at which point any pending fuzzy matches are performed, and the updates/deletions continue. *in the case of --delete-during this may result in even more missing potential matches than normal, which is why --fuzzy-delay may not be suitable as default behaviour. Either of these features should help to greatly optimise the performance of --fuzzy, so that it particularly large directories don't result in a significant drop in performance with fuzzy matching enabled, particularly when there is a difference in speed between devices (e.g - faster sender, slower destination such as a NAS or shared remote host). -- Configure bugmail: https://bugzilla.samba.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the QA contact for the bug.
Apparently Analagous Threads
- [Bug 10575] New: Long Delay for Large Folders Even with Incremental File-List
- [Bug 10263] New: Extend Behaviour of the --fuzzy Parameter
- [Bug 14109] New: Support Custom Fuzzy Basis Selection Algorithm
- --fuzzy enhancements: size match in all directories
- I don't find "fuzzy matching"