Jorrit Jongma
2020-May-18 15:58 UTC
[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
Well, don't get too excited, get_checksum1() (the function optimized here) is not the great performance limiter in this case, it's get_checksum2() and sum_update(), which will be using MD5. You can force using MD4, but on the slower CPU's I've tested in practice that is slower rather than faster, contrary to what would be expected. While this patch will improve things a little, to improve things a lot we need to tackle or replace MD5. Unfortunately, single stream MD5 cannot be effectively optimized with SSE, at least I've not seen an SSE version faster than pure C, and I've looked into it. What we _can_ do is parallelize multiple streams using SSE, which may double to triple the throughput at the same CPU load under ideal circumstances. However, this cannot be applied to rsync as-is as it doesn't process multiple files simultaneously (and it is questionable if that is something we should even want). The single-file stream could still be parallelized this way but it would require a slight change in checksum generation that would in turn require a protocol change - both ends need to support it. At that point we might as well swap MD5 out completely, though I will still be digging deeper into this case. The good news is that this parallelization _is_ possible in a drop-in fashion for the case where rsync is comparing the chunks on both ends, the same case where the get_checksum1() patch shows its benefits. I estimate performance improvements could reach about 30% for that specific case (re-transferring large yet slightly modified files), but that does nothing for the performance of whole file checksumming or the transfer of new files. Depending on your use-case you may never or rarely even see that performance improvement in action. It applies for my use-case though, so I am looking into this. On Mon, May 18, 2020 at 5:18 PM Ben RUBSON via rsync <rsync at lists.samba.org> wrote:> > On 18 May 2020, at 17:06, Jorrit Jongma via rsync <rsync at lists.samba.org> wrote: > > This drop-in patch increases the performance of the get_checksum1() > function on x86-64. > > > As ref, rather related to this : https://bugzilla.samba.org/show_bug.cgi?id=13082 > > Thank you Jorrit ! > -- > Please use reply-all for most replies to avoid omitting the mailing list. > To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync > Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Ben RUBSON
2020-May-18 16:20 UTC
[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
Thank you Jorrit for your detailed answer.> On 18 May 2020, at 17:58, Jorrit Jongma via rsync <rsync at lists.samba.org> wrote: > > Well, don't get too excited, get_checksum1() (the function optimized > here) is not the great performance limiter in this case, it's > get_checksum2() and sum_update(), which will be using MD5.Certainly that all other functions using MD5 could be updated to use your SSE-optimized function. So that we have a full SSE MD5 support, wherever rsync is using it (basis file checksum, rolling checksum etc...). I think one nice performance improvement could be when the receiver checksums the (big/huge) basis file, because here the sender is then simply waiting...> Unfortunately, single stream MD5 cannot be effectively optimized with > SSE, at least I've not seen an SSE version faster than pure CI was about to tell you that we successfully implemented it into FreeBSD a few years ago, but it's CRC32, not MD5... https://github.com/freebsd/freebsd/commit/c4b27423f57c30068aff3f234c912ae8d9ff1b6a https://github.com/freebsd/freebsd/commit/5a798b035b4858923878c014a5faa48b2f9aa6e7 At least sounds like the algorithm author / inspiration, Mark Adler, is the same :) Anyway, this is a first interesting SSE MD5 support.
Jorrit Jongma
2020-May-18 17:02 UTC
[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
I think you're missing a point here. Two different checksum algorithms are used in concert, the Adler-based one and the MD5 one. I SSE-optimized the Adler-based one. The Adler-based hash is used to _find_ blocks that might have shifted, while the MD5 hash is a strong cryptographic hash used to _verify_ blocks and files. You wouldn't want to replace the MD5 hash with the Adler-based hash, they are of a different class. If you'd replace the MD5 hash with a different one, you'd replace it with one of the SHA's or even xxHash. On Mon, May 18, 2020 at 6:21 PM Ben RUBSON via rsync <rsync at lists.samba.org> wrote:> > Thank you Jorrit for your detailed answer. > > > On 18 May 2020, at 17:58, Jorrit Jongma via rsync <rsync at lists.samba.org> wrote: > > > > Well, don't get too excited, get_checksum1() (the function optimized > > here) is not the great performance limiter in this case, it's > > get_checksum2() and sum_update(), which will be using MD5. > > Certainly that all other functions using MD5 could be updated to use your SSE-optimized function. > So that we have a full SSE MD5 support, wherever rsync is using it (basis file checksum, rolling checksum etc...). > > I think one nice performance improvement could be when the receiver checksums the (big/huge) basis file, because here the sender is then simply waiting... > > > Unfortunately, single stream MD5 cannot be effectively optimized with > > SSE, at least I've not seen an SSE version faster than pure C > > I was about to tell you that we successfully implemented it into FreeBSD a few years ago, but it's CRC32, not MD5... > https://github.com/freebsd/freebsd/commit/c4b27423f57c30068aff3f234c912ae8d9ff1b6a > https://github.com/freebsd/freebsd/commit/5a798b035b4858923878c014a5faa48b2f9aa6e7 > At least sounds like the algorithm author / inspiration, Mark Adler, is the same :) > > Anyway, this is a first interesting SSE MD5 support. > -- > Please use reply-all for most replies to avoid omitting the mailing list. > To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync > Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Possibly Parallel Threads
- [PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
- [PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
- [PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
- [PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
- [PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64