Back in the spring, we started using rsync for a disk to disk backup system maintaining close to 10PB of data. I am not here to debate the issue of what is the right tool but only to discuss what we found to be a problem with rsync when doing so. We traced the various processes hoping to find what the culprit was slowing things down so much and determined pretty easily that it was the checksum components in rsync. Once we found that and tested against the --checksum option, it was glaring that this was slowing us down. We next tested the MD5 vs MD4 checksums and found little difference in speed. So we went out in search of a better checksum algorithm and found xxhash, using the one from the Centos release. Thanks to the way the source is written, it was a fairly easy patch to get this into the src RPM. We have been using this in production now for awhile and see about a 3x speedup over the MD5/4 checksum algorithm which brings it pretty close to the --checksum speed. Attached is the patch we applied. Since xxhash is in the distro, a dependency would be required for this RPM. If nothing else, perhaps the developers should just take a look as this could benefit many. Thanks, Bill -------------- next part -------------- A non-text attachment was scrubbed... Name: xxhash.patch Type: text/x-patch Size: 4030 bytes Desc: not available URL: <http://lists.samba.org/pipermail/rsync/attachments/20191001/9edcb332/xxhash.bin>
On Tue 01 Oct 2019, Bill Wichser via rsync wrote:> > Attached is the patch we applied. Since xxhash is in the distro, a > dependency would be required for this RPM. If nothing else, perhaps the > developers should just take a look as this could benefit many."The distro" is a bit vague for a tool like rsync that runs on many versions of Unix and linux, and even windows. The problem is (AFAIK) that this would need a protocol version bump so that the checksum algorithm to be used can be decided upon by both ends of the transfer, it's not as simple as simply replacing the current algorithm: that would make it impossible to rsync to / from an older version of rsync. It's an interesting idea, although I wonder how many users would actually profit from this. CPU is generally fast enough to handle what the IO subsystem can read for most people, I imagine. Paul
Paul, Thanks. I can see your point for sure. I wasn't suggesting an all out switch but just an option to use with a flag. Since we're using a GPFS to GPFS transfer over a high speed link, doing billions of files at the moment, even a marginal increase in speed helps and is why we were using MD4 instead of MD5. We can easily maintain the patch with the way that the code is well structured. The hope is that we can take a smarter approach once all the data is mirrored by using the built-in inotify structure IBM has provided. But this will require a better understanding on our part in order to use this effectively. For now we will continue to just use the naive rsync approach with our modifications. Thanks, Bill On 10/3/19 9:43 AM, Paul Slootman via rsync wrote:> On Tue 01 Oct 2019, Bill Wichser via rsync wrote: >> >> Attached is the patch we applied. Since xxhash is in the distro, a >> dependency would be required for this RPM. If nothing else, perhaps the >> developers should just take a look as this could benefit many. > > "The distro" is a bit vague for a tool like rsync that runs on many > versions of Unix and linux, and even windows. > > The problem is (AFAIK) that this would need a protocol version bump so > that the checksum algorithm to be used can be decided upon by both ends > of the transfer, it's not as simple as simply replacing the current > algorithm: that would make it impossible to rsync to / from an older > version of rsync. > > It's an interesting idea, although I wonder how many users would > actually profit from this. CPU is generally fast enough to handle what > the IO subsystem can read for most people, I imagine. > > Paul >
On Tue, Oct 1, 2019 at 8:02 AM Bill Wichser via rsync <rsync at lists.samba.org> wrote:> Attached is the patch we applied [to add xxhash checksums]Thanks, Bill! I finally got around to finishing up some checksum improvements and have added support for xxhash in the master branch. The latest version in git now picks the best checksum algorithm in common between the client & server version, and will support future checksum algorithms being added without the need for a protocol bump. It also supports a new RSYNC_CHECKSUM_LIST environment variable that allows the user to limit what checksum algorithms they want rsync to use (in addition to the already existing --checksum-choice=FOO option that forces the checksum choice). ..wayne.. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20200522/465b18d7/attachment.htm>
That's excellent news! On Sat, 23 May 2020 at 08:11, Wayne Davison via rsync <rsync at lists.samba.org> wrote:> On Tue, Oct 1, 2019 at 8:02 AM Bill Wichser via rsync < > rsync at lists.samba.org> wrote: > >> Attached is the patch we applied [to add xxhash checksums] > > > Thanks, Bill! I finally got around to finishing up some checksum > improvements and have added support for xxhash in the master branch. The > latest version in git now picks the best checksum algorithm in common > between the client & server version, and will support future checksum > algorithms being added without the need for a protocol bump. It also > supports a new RSYNC_CHECKSUM_LIST environment variable that allows the > user to limit what checksum algorithms they want rsync to use (in addition > to the already existing --checksum-choice=FOO option that forces the > checksum choice). > > ..wayne.. > -- > Please use reply-all for most replies to avoid omitting the mailing list. > To unsubscribe or change options: > https://lists.samba.org/mailman/listinfo/rsync > Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20200523/d23cfdf2/attachment.htm>
This is great! However, do you have access to a big-endian CPU? I'm not sure how relevant this still is but I've read at some point that xxhash might have produced different (reverse?) hashes on different endian CPUs. It may be prudent to acutally test if that is the case with this implementation or how rsync uses it. On Sat, May 23, 2020 at 8:11 AM Wayne Davison via rsync <rsync at lists.samba.org> wrote:> > On Tue, Oct 1, 2019 at 8:02 AM Bill Wichser via rsync <rsync at lists.samba.org> wrote: >> >> Attached is the patch we applied [to add xxhash checksums] > > > Thanks, Bill! I finally got around to finishing up some checksum improvements and have added support for xxhash in the master branch. The latest version in git now picks the best checksum algorithm in common between the client & server version, and will support future checksum algorithms being added without the need for a protocol bump. It also supports a new RSYNC_CHECKSUM_LIST environment variable that allows the user to limit what checksum algorithms they want rsync to use (in addition to the already existing --checksum-choice=FOO option that forces the checksum choice). > > ..wayne.. > -- > Please use reply-all for most replies to avoid omitting the mailing list. > To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync > Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html