samba-bugs at samba.org
2019-Sep-01 22:55 UTC
[Bug 14109] New: Support Custom Fuzzy Basis Selection Algorithm
https://bugzilla.samba.org/show_bug.cgi?id=14109 Bug ID: 14109 Summary: Support Custom Fuzzy Basis Selection Algorithm Product: rsync Version: 3.1.3 Hardware: All OS: All Status: NEW Severity: normal Priority: P5 Component: core Assignee: wayne at opencoder.net Reporter: lonniebiz at yahoo.com QA Contact: rsync-qa at samba.org The --fuzzy argument does an incredible job at syncing large files when it chooses the correct fuzzy basis. However, the default "fuzzy-basis-destination-file-selection algorithm" is not correct for every situation, so I propose the ability to pass an argument to the fuzzy parameter that specifies which "fuzzy-basis-destination-file-selection algorithm" to use. I've posted a question detailing my needs here: https://unix.stackexchange.com/questions/538548/ In short, some of the files in my source-folder are 200GB in size. When rsync chooses the correct existing-destination-file for its "fuzzy basis", my synchronization (of these files) seems magical in term of the data that gets transferred over the wire. However, when it chooses the wrong existing-destination-file as the source file's fuzzy basis, the data transfer can take days. Look at the filenames in both my source-folder an destination-folder (below): # Source Folder's new files (from today's on-site backup): file100-2019_09-01_12am.log file100-2019_09-01_12am.lzo file101-2019_09-01_12am.log file101-2019_09-01_12am.lzo file102-2019_09-01_12am.log file102-2019_09-01_12am.lzo # Destination-Folder's old files (from yesterday's off-site backup): file100-2019_08-31_12am.log file100-2019_08-31_12am.lzo file101-2019_08-31_12am.log file101-2019_08-31_12am.lzo file102-2019_08-31_12am.log file102-2019_08-31_12am.lzo In my case, the fuzzy-basis-selection-algorithm needs to select the existing destination-file that: 1) Has the same file extension as the source file 2) Begins with the most consecutively identical characters as the source file The default algorithm does not meet these requirements. Therefore, I propose the ability to pass an argument that allows the user to specify non-default fuzzy basis selection algorithms. There should probably be a few common, baked-in ones (as time goes on) that you can choose from by name and it would be even more flexible if rsync also permitted the user the ability pass a file into the command that specifies a custom "fuzzy-basis-destination-file-selection algorithm". Naturally, if these features are granted, the documentation would also need to be update to give guidance on specifying these things. If these things are already implemented, and I have somehow overlooked them, would you kindly post an answer to my question here?: https://unix.stackexchange.com/questions/538548/ -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2019-Sep-01 23:15 UTC
[Bug 14109] Support Custom Fuzzy Basis Selection Algorithm
https://bugzilla.samba.org/show_bug.cgi?id=14109 --- Comment #1 from Kevin Korb <rsync at sanitarium.net> --- Just a quick thought on a workaround... It would be trivial to figure out the new name and best old file in a script. So, you could hard link the best old file to the new file name. Then rsync wouldn't even need --fuzzy to find it. -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2019-Sep-01 23:51 UTC
[Bug 14109] Support Custom Fuzzy Basis Selection Algorithm
https://bugzilla.samba.org/show_bug.cgi?id=14109 --- Comment #2 from Lonnie Best <lonniebiz at yahoo.com> --- Thanks. Yeah, that's probably what I'll do. I may even write the script where it does some tasks parallel (running multiple rsync commands at the same time). The current default "fuzzy-basis-destination-file-selection algorithm" selects the correct file most of the time. Maybe the reason it didn't today is because it is the first day of a new month and that made the file names be too different. I'm not sure. The --fuzzy argument is really awesome and it is just a hair away from being exactly what I need for handling things with one command at the folder-level. If I could only modify the file-selection algorithm, it would be perfect. Until then, I just have to write a script instead of being able to handle this within the command. -- You are receiving this mail because: You are the QA Contact for the bug.
Apparently Analagous Threads
- DO NOT REPLY [Bug 4056] New: Option to look for fuzzy basis files in --*-dest directories
- [Bug 12527] New: Sender waits for timeout when fuzzy basis file found
- [Bug 10581] New: --fuzzy-delay and --fuzzy-limit for fuzzy match tuning
- [Bug 12489] New: --fuzzy --fuzzy does not work with daemon
- [Bug 12498] New: --fuzzy --fuzzy hugely impacts performance even if its' not needed