thr3ads.net - rsync - [PATCH] SSE2/SSSE3 optimized version of get

If this information is useful, please help other people find it:
Share via:

Jorrit Jongma

2020-May-18 15:06 UTC

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

This drop-in patch increases the performance of the get_checksum1()
function on x86-64.

On the target slow CPU performance of the function increased by nearly
50% in the x86-64 default SSE2 mode, and by nearly 100% if the
compiler was told to enable SSSE3 support. The increase was over 200%
on the fastest CPU tested in SSSE3 mode.

Transfer time improvement with large files existing on both ends but
with some bits flipped was measured as 5-10%, with the target machine
being CPU limited (still so due to MD5).

This same patch on (my) GitHub for easier reading:
https://github.com/Chainfire/rsync/commit/f5d0b32df869a23a74b8b8295e4983b0943866df

>From f5d0b32df869a23a74b8b8295e4983b0943866df Mon Sep 17 00:00:00 2001From: Jorrit Jongma <git at jongma.org>
Date: Mon, 18 May 2020 00:21:39 +0200
Subject: [PATCH 1/1] SSE2/SSSE3 optimized version of get_checksum1() for
 x86-64

---
 Makefile.in     |   2 +-
 checksum.c      |   2 +
 checksum_sse2.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 246 insertions(+), 1 deletion(-)
 create mode 100644 checksum_sse2.c

diff --git a/Makefile.in b/Makefile.in
index 59649562..e4202336 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -40,7 +40,7 @@ OBJS1=flist.o rsync.o generator.o receiver.o
cleanup.o sender.o exclude.o \
  util.o util2.o main.o checksum.o match.o syscall.o log.o backup.o delete.o
 OBJS2=options.o io.o compat.o hlink.o token.o uidlist.o socket.o hashtable.o \
  fileio.o batch.o clientname.o chmod.o acls.o xattrs.o
-OBJS3=progress.o pipe.o
+OBJS3=progress.o pipe.o checksum_sse2.o
 DAEMON_OBJ = params.o loadparm.o clientserver.o access.o connection.o
authenticate.o
 popt_OBJS=popt/findme.o  popt/popt.o  popt/poptconfig.o \
  popt/popthelp.o popt/poptparse.o
diff --git a/checksum.c b/checksum.c
index cd234038..4e696f3d 100644
--- a/checksum.c
+++ b/checksum.c
@@ -99,6 +99,7 @@ int canonical_checksum(int csum_type)
  return csum_type >= CSUM_MD4 ? 1 : 0;
 }

+#ifndef __SSE2__  // see checksum_sse2.c for SSE2/SSSE3 version
 /*
   a simple 32 bit checksum that can be updated from either end
   (inspired by Mark Adler's Adler-32 checksum)
@@ -119,6 +120,7 @@ uint32 get_checksum1(char *buf1, int32 len)
  }
  return (s1 & 0xffff) + (s2 << 16);
 }
+#endif

 void get_checksum2(char *buf, int32 len, char *sum)
 {
diff --git a/checksum_sse2.c b/checksum_sse2.c
new file mode 100644
index 00000000..51662833
--- /dev/null
+++ b/checksum_sse2.c
@@ -0,0 +1,243 @@
+/*
+ * SSE2/SSSE3-optimized routines to support checksumming of bytes.
+ *
+ * Copyright (C) 1996 Andrew Tridgell
+ * Copyright (C) 1996 Paul Mackerras
+ * Copyright (C) 2004-2020 Wayne Davison
+ * Copyright (C) 2020 Jorrit Jongma
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 3 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, visit the http://fsf.org website.
+ */
+/*
+ * Optimization target for get_checksum1 was the Intel Atom D2700, the
+ * slowest CPU in the test set and the most likely to be CPU limited during
+ * transfers. The combination of intrinsics was chosen specifically for the
+ * most gain on that CPU, other combinations were occasionally slightly
+ * faster on the others.
+ *
+ * While on more modern CPUs transfers are less likely to be CPU limited,
+ * lower CPU usage is always better. Improvements may still be seen when
+ * matching chunks from NVMe storage even on newer CPUs.
+ *
+ * Benchmarks                   C           SSE2        SSSE3
+ * - Intel Atom D2700           550 MB/s    750 MB/s    1000 MB/s
+ * - Intel i7-7700hq            1850 MB/s   2550 MB/s   4050 MB/s
+ * - AMD ThreadRipper 2950x     2900 MB/s   5600 MB/s   8950 MB/s
+ *
+ * This optimization for get_checksum1 is intentionally limited to x86-64 as
+ * no 32-bit CPU was available for testing. As 32-bit CPUs only have half the
+ * available xmm registers, this optimized version may not be faster than the
+ * pure C version anyway.
+ *
+ * GCC automatically enables SSE2 support on x86-64 builds. The SSSE3 code
+ * path must be enabled manually: ./configure CFLAGS="-mssse3 -O2"
+ */
+
+#ifdef __x86_64__
+#ifdef __SSE2__
+
+#include "rsync.h"
+
+#ifdef __SSSE3__
+#include <immintrin.h>
+#else
+#include <tmmintrin.h>
+#endif
+
+/* Compatibility functions to let our SSSE3 algorithm run on SSE2 */
+
+static inline __m128i sse_load_si128(void const* buf) {
+#ifdef __SSSE3__
+    return _mm_lddqu_si128(buf);  // same as loadu on all but the
oldest SSSE3 CPUs
+#else
+    return _mm_loadu_si128(buf);
+#endif
+}
+
+#ifndef __SSSE3__
+static inline __m128i sse_interleave_odd_epi16(__m128i a, __m128i b) {
+    return _mm_packs_epi32(
+        _mm_srai_epi32(a, 16),
+        _mm_srai_epi32(b, 16)
+    );
+}
+
+static inline __m128i sse_interleave_even_epi16(__m128i a, __m128i b) {
+    return sse_interleave_odd_epi16(
+        _mm_slli_si128(a, 2),
+        _mm_slli_si128(b, 2)
+    );
+}
+
+static inline __m128i sse_mulu_odd_epi8(__m128i a, __m128i b) {
+    return _mm_mullo_epi16(
+        _mm_srli_epi16(a, 8),
+        _mm_srai_epi16(b, 8)
+    );
+}
+
+static inline __m128i sse_mulu_even_epi8(__m128i a, __m128i b) {
+    return _mm_mullo_epi16(
+        _mm_and_si128(a, _mm_set1_epi16(0xFF)),
+        _mm_srai_epi16(_mm_slli_si128(b, 1), 8)
+    );
+}
+#endif
+
+static inline __m128i sse_hadds_epi16(__m128i a, __m128i b) {
+#ifdef __SSSE3__
+    return _mm_hadds_epi16(a, b);
+#else
+    return _mm_adds_epi16(
+        sse_interleave_even_epi16(a, b),
+        sse_interleave_odd_epi16(a, b)
+    );
+#endif
+}
+
+static inline __m128i sse_maddubs_epi16(__m128i a, __m128i b) {
+#ifdef __SSSE3__
+    return _mm_maddubs_epi16(a, b);
+#else
+    return _mm_adds_epi16(
+        sse_mulu_even_epi8(a, b),
+        sse_mulu_odd_epi8(a, b)
+    );
+#endif
+}
+
+/*
+  a simple 32 bit checksum that can be updated from either end
+  (inspired by Mark Adler's Adler-32 checksum)
+  */
+/*
+  Original loop per 4 bytes:
+    s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] +
10*CHAR_OFFSET;
+    s1 += buf[i] + buf[i+1] + buf[i+2] + buf[i+3] + 4*CHAR_OFFSET;
+
+  SSE2/SSSE3 loop per 32 bytes:
+    int16 t1[8];
+    int16 t2[8];
+    for (int j = 0; j < 8; j++) {
+      t1[j] = buf[j*4 + i] + buf[j*4 + i+1] + buf[j*4 + i+2] + buf[j*4 + i+3];
+      t2[j] = 4*buf[j*4 + i] + 3*buf[j*4 + i+1] + 2*buf[j*4 + i+2] +
buf[j*4 + i+3];
+    }
+    s2 += 32*s1 +
+          28*t1[0] + 24*t1[1] + 20*t1[2] + 16*t1[3] + 12*t1[4] +
8*t1[5] + 4*t1[6] +
+          t2[0] + t2[1] + t2[2] + t2[3] + t2[4] + t2[5] + t2[6] + t2[7] +
+          ((16+32+48+64+80+96) + 8)*CHAR_OFFSET;
+    s1 += t1[0] + t1[1] + t1[2] + t1[3] + t1[4] + t1[5] + t1[6] + t1[7] +
+          32*CHAR_OFFSET;
+ */
+uint32 get_checksum1(char *buf1, int32 len)
+{
+    int32 i;
+    uint32 s1, s2;
+    schar *buf = (schar *)buf1;
+
+    i = s1 = s2 = 0;
+    if (len > 32) {
+        const char mul_t1_buf[16] = {28, 0, 24, 0, 20, 0, 16, 0, 12,
0, 8, 0, 4, 0, 0, 0};
+        __m128i mul_t1 = sse_load_si128((void const*)mul_t1_buf);
+        __m128i ss1 = _mm_setzero_si128();
+        __m128i ss2 = _mm_setzero_si128();
+
+        for (i = 0; i < (len-32); i+=32) {
+            // Load ... 2*[int8*16]
+            __m128i in8_1 = sse_load_si128((void const*)&buf[i]);
+            __m128i in8_2 = sse_load_si128((void const*)&buf[i + 16]);
+
+            // (1*buf[i] + 1*buf[i+1]), (1*buf[i+2], 1*buf[i+3]), ...
2*[int16*8]
+            // Fastest, even though multiply by 1
+            __m128i mul_one = _mm_set1_epi8(1);
+            __m128i add16_1 = sse_maddubs_epi16(mul_one, in8_1);
+            __m128i add16_2 = sse_maddubs_epi16(mul_one, in8_2);
+
+            // (4*buf[i] + 3*buf[i+1]), (2*buf[i+2], buf[i+3]), ... 2*[int16*8]
+            __m128i mul_const = _mm_set1_epi32(4 + (3 << 8) + (2 <<
16) + (1 << 24));
+            __m128i mul_add16_1 = sse_maddubs_epi16(mul_const, in8_1);
+            __m128i mul_add16_2 = sse_maddubs_epi16(mul_const, in8_2);
+
+            // s2 += 32*s1
+            ss2 = _mm_add_epi32(ss2, _mm_slli_epi32(ss1, 5));
+
+            // [sum(t1[0]..t1[6]), X, X, X] [int32*4]; faster than
multiple _mm_hadds_epi16
+            // Shifting left, then shifting right again and shuffling
(rather than just
+            // shifting right as with mul32 below) to cheaply end up
with the correct sign
+            // extension as we go from int16 to int32.
+            __m128i sum_add32 = _mm_add_epi16(add16_1, add16_2);
+            sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 2));
+            sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 4));
+            sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 8));
+            sum_add32 = _mm_srai_epi32(sum_add32, 16);
+            sum_add32 = _mm_shuffle_epi32(sum_add32, 3);
+
+            // [sum(t2[0]..t2[6]), X, X, X] [int32*4]; faster than
multiple _mm_hadds_epi16
+            __m128i sum_mul_add32 = _mm_add_epi16(mul_add16_1, mul_add16_2);
+            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
_mm_slli_si128(sum_mul_add32, 2));
+            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
_mm_slli_si128(sum_mul_add32, 4));
+            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
_mm_slli_si128(sum_mul_add32, 8));
+            sum_mul_add32 = _mm_srai_epi32(sum_mul_add32, 16);
+            sum_mul_add32 = _mm_shuffle_epi32(sum_mul_add32, 3);
+
+            // s1 += t1[0] + t1[1] + t1[2] + t1[3] + t1[4] + t1[5] +
t1[6] + t1[7]
+            ss1 = _mm_add_epi32(ss1, sum_add32);
+
+            // s2 += t2[0] + t2[1] + t2[2] + t2[3] + t2[4] + t2[5] +
t2[6] + t2[7]
+            ss2 = _mm_add_epi32(ss2, sum_mul_add32);
+
+            // [t1[0], t1[1], ...] [int16*8]
+            // We could've combined this with generating sum_add32
above and save one _mm_add_epi16,
+            // but benchmarking shows that as being slower
+            __m128i add16 = sse_hadds_epi16(add16_1, add16_2);
+
+            // [t1[0], t1[1], ...] -> [t1[0]*28 + t1[1]*24, ...] [int32*4]
+            __m128i mul32 = _mm_madd_epi16(add16, mul_t1);
+
+            // [sum(mul32), X, X, X] [int32*4]; faster than multiple
_mm_hadd_epi32
+            mul32 = _mm_add_epi32(mul32, _mm_srli_si128(mul32, 4));
+            mul32 = _mm_add_epi32(mul32, _mm_srli_si128(mul32, 8));
+
+            // s2 += 28*t1[0] + 24*t1[1] + 20*t1[2] + 16*t1[3] +
12*t1[4] + 8*t1[5] + 4*t1[6]
+            ss2 = _mm_add_epi32(ss2, mul32);
+
+#if CHAR_OFFSET != 0
+            // s1 += 32*CHAR_OFFSET
+            __m128i char_offset_multiplier = _mm_set1_epi32(32 * CHAR_OFFSET);
+            ss1 = _mm_add_epi32(ss1, char_offset_multiplier);
+
+            // s2 += 528*CHAR_OFFSET
+            char_offset_multiplier = _mm_set1_epi32(528 * CHAR_OFFSET);
+            ss2 = _mm_add_epi32(ss2, char_offset_multiplier);
+#endif
+        }
+
+        int32 x[4] = {0};
+        _mm_store_si128((void*)x, ss1);
+        s1 = x[0];
+        _mm_store_si128((void*)x, ss2);
+        s2 = x[0];
+    }
+    for (; i < (len-4); i+=4) {
+        s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] +
10*CHAR_OFFSET;
+        s1 += (buf[i] + buf[i+1] + buf[i+2] + buf[i+3] + 4*CHAR_OFFSET);
+    }
+    for (; i < len; i++) {
+        s1 += (buf[i]+CHAR_OFFSET); s2 += s1;
+    }
+    return (s1 & 0xffff) + (s2 << 16);
+}
+
+#endif
+#endif
-- 
2.25.2

Ben RUBSON

2020-May-18 15:17 UTC

head link

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

> On 18 May 2020, at 17:06, Jorrit Jongma via rsync <rsync at
lists.samba.org> wrote:
> 
> This drop-in patch increases the performance of the get_checksum1()
> function on x86-64.
As ref, rather related to this :
https://bugzilla.samba.org/show_bug.cgi?id=13082
<https://bugzilla.samba.org/show_bug.cgi?id=13082>

Thank you Jorrit !
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20200518/878ac93b/attachment.htm>

Filipe Maia

2020-May-18 15:42 UTC

head link

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

I think this is a great patch but, in my view, an even better way to tackle
the fundamental problem (the performance limitations) is to use a much
faster checksum like xxhash, as has been suggested before:
https://lists.samba.org/archive/rsync/2019-October/031975.html

Cheers,
Filipe

On Mon, 18 May 2020 at 17:08, Jorrit Jongma via rsync <rsync at
lists.samba.org>
wrote:
> This drop-in patch increases the performance of the get_checksum1()
> function on x86-64.
>
> On the target slow CPU performance of the function increased by nearly
> 50% in the x86-64 default SSE2 mode, and by nearly 100% if the
> compiler was told to enable SSSE3 support. The increase was over 200%
> on the fastest CPU tested in SSSE3 mode.
>
> Transfer time improvement with large files existing on both ends but
> with some bits flipped was measured as 5-10%, with the target machine
> being CPU limited (still so due to MD5).
>
> This same patch on (my) GitHub for easier reading:
>
>
https://github.com/Chainfire/rsync/commit/f5d0b32df869a23a74b8b8295e4983b0943866df
>
>
> From f5d0b32df869a23a74b8b8295e4983b0943866df Mon Sep 17 00:00:00 2001
> From: Jorrit Jongma <git at jongma.org>
> Date: Mon, 18 May 2020 00:21:39 +0200
> Subject: [PATCH 1/1] SSE2/SSSE3 optimized version of get_checksum1() for
>  x86-64
>
> ---
>  Makefile.in     |   2 +-
>  checksum.c      |   2 +
>  checksum_sse2.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 246 insertions(+), 1 deletion(-)
>  create mode 100644 checksum_sse2.c
>
> diff --git a/Makefile.in b/Makefile.in
> index 59649562..e4202336 100644
> --- a/Makefile.in
> +++ b/Makefile.in
> @@ -40,7 +40,7 @@ OBJS1=flist.o rsync.o generator.o receiver.o
> cleanup.o sender.o exclude.o \
>   util.o util2.o main.o checksum.o match.o syscall.o log.o backup.o
> delete.o
>  OBJS2=options.o io.o compat.o hlink.o token.o uidlist.o socket.o
> hashtable.o \
>   fileio.o batch.o clientname.o chmod.o acls.o xattrs.o
> -OBJS3=progress.o pipe.o
> +OBJS3=progress.o pipe.o checksum_sse2.o
>  DAEMON_OBJ = params.o loadparm.o clientserver.o access.o connection.o
> authenticate.o
>  popt_OBJS=popt/findme.o  popt/popt.o  popt/poptconfig.o \
>   popt/popthelp.o popt/poptparse.o
> diff --git a/checksum.c b/checksum.c
> index cd234038..4e696f3d 100644
> --- a/checksum.c
> +++ b/checksum.c
> @@ -99,6 +99,7 @@ int canonical_checksum(int csum_type)
>   return csum_type >= CSUM_MD4 ? 1 : 0;
>  }
>
> +#ifndef __SSE2__  // see checksum_sse2.c for SSE2/SSSE3 version
>  /*
>    a simple 32 bit checksum that can be updated from either end
>    (inspired by Mark Adler's Adler-32 checksum)
> @@ -119,6 +120,7 @@ uint32 get_checksum1(char *buf1, int32 len)
>   }
>   return (s1 & 0xffff) + (s2 << 16);
>  }
> +#endif
>
>  void get_checksum2(char *buf, int32 len, char *sum)
>  {
> diff --git a/checksum_sse2.c b/checksum_sse2.c
> new file mode 100644
> index 00000000..51662833
> --- /dev/null
> +++ b/checksum_sse2.c
> @@ -0,0 +1,243 @@
> +/*
> + * SSE2/SSSE3-optimized routines to support checksumming of bytes.
> + *
> + * Copyright (C) 1996 Andrew Tridgell
> + * Copyright (C) 1996 Paul Mackerras
> + * Copyright (C) 2004-2020 Wayne Davison
> + * Copyright (C) 2020 Jorrit Jongma
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 3 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, visit the http://fsf.org website.
> + */
> +/*
> + * Optimization target for get_checksum1 was the Intel Atom D2700, the
> + * slowest CPU in the test set and the most likely to be CPU limited
> during
> + * transfers. The combination of intrinsics was chosen specifically for
> the
> + * most gain on that CPU, other combinations were occasionally slightly
> + * faster on the others.
> + *
> + * While on more modern CPUs transfers are less likely to be CPU limited,
> + * lower CPU usage is always better. Improvements may still be seen when
> + * matching chunks from NVMe storage even on newer CPUs.
> + *
> + * Benchmarks                   C           SSE2        SSSE3
> + * - Intel Atom D2700           550 MB/s    750 MB/s    1000 MB/s
> + * - Intel i7-7700hq            1850 MB/s   2550 MB/s   4050 MB/s
> + * - AMD ThreadRipper 2950x     2900 MB/s   5600 MB/s   8950 MB/s
> + *
> + * This optimization for get_checksum1 is intentionally limited to x86-64
> as
> + * no 32-bit CPU was available for testing. As 32-bit CPUs only have half
> the
> + * available xmm registers, this optimized version may not be faster than
> the
> + * pure C version anyway.
> + *
> + * GCC automatically enables SSE2 support on x86-64 builds. The SSSE3 code
> + * path must be enabled manually: ./configure CFLAGS="-mssse3
-O2"
> + */
> +
> +#ifdef __x86_64__
> +#ifdef __SSE2__
> +
> +#include "rsync.h"
> +
> +#ifdef __SSSE3__
> +#include <immintrin.h>
> +#else
> +#include <tmmintrin.h>
> +#endif
> +
> +/* Compatibility functions to let our SSSE3 algorithm run on SSE2 */
> +
> +static inline __m128i sse_load_si128(void const* buf) {
> +#ifdef __SSSE3__
> +    return _mm_lddqu_si128(buf);  // same as loadu on all but the
> oldest SSSE3 CPUs
> +#else
> +    return _mm_loadu_si128(buf);
> +#endif
> +}
> +
> +#ifndef __SSSE3__
> +static inline __m128i sse_interleave_odd_epi16(__m128i a, __m128i b) {
> +    return _mm_packs_epi32(
> +        _mm_srai_epi32(a, 16),
> +        _mm_srai_epi32(b, 16)
> +    );
> +}
> +
> +static inline __m128i sse_interleave_even_epi16(__m128i a, __m128i b) {
> +    return sse_interleave_odd_epi16(
> +        _mm_slli_si128(a, 2),
> +        _mm_slli_si128(b, 2)
> +    );
> +}
> +
> +static inline __m128i sse_mulu_odd_epi8(__m128i a, __m128i b) {
> +    return _mm_mullo_epi16(
> +        _mm_srli_epi16(a, 8),
> +        _mm_srai_epi16(b, 8)
> +    );
> +}
> +
> +static inline __m128i sse_mulu_even_epi8(__m128i a, __m128i b) {
> +    return _mm_mullo_epi16(
> +        _mm_and_si128(a, _mm_set1_epi16(0xFF)),
> +        _mm_srai_epi16(_mm_slli_si128(b, 1), 8)
> +    );
> +}
> +#endif
> +
> +static inline __m128i sse_hadds_epi16(__m128i a, __m128i b) {
> +#ifdef __SSSE3__
> +    return _mm_hadds_epi16(a, b);
> +#else
> +    return _mm_adds_epi16(
> +        sse_interleave_even_epi16(a, b),
> +        sse_interleave_odd_epi16(a, b)
> +    );
> +#endif
> +}
> +
> +static inline __m128i sse_maddubs_epi16(__m128i a, __m128i b) {
> +#ifdef __SSSE3__
> +    return _mm_maddubs_epi16(a, b);
> +#else
> +    return _mm_adds_epi16(
> +        sse_mulu_even_epi8(a, b),
> +        sse_mulu_odd_epi8(a, b)
> +    );
> +#endif
> +}
> +
> +/*
> +  a simple 32 bit checksum that can be updated from either end
> +  (inspired by Mark Adler's Adler-32 checksum)
> +  */
> +/*
> +  Original loop per 4 bytes:
> +    s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] +
> 10*CHAR_OFFSET;
> +    s1 += buf[i] + buf[i+1] + buf[i+2] + buf[i+3] + 4*CHAR_OFFSET;
> +
> +  SSE2/SSSE3 loop per 32 bytes:
> +    int16 t1[8];
> +    int16 t2[8];
> +    for (int j = 0; j < 8; j++) {
> +      t1[j] = buf[j*4 + i] + buf[j*4 + i+1] + buf[j*4 + i+2] + buf[j*4 +
> i+3];
> +      t2[j] = 4*buf[j*4 + i] + 3*buf[j*4 + i+1] + 2*buf[j*4 + i+2] +
> buf[j*4 + i+3];
> +    }
> +    s2 += 32*s1 +
> +          28*t1[0] + 24*t1[1] + 20*t1[2] + 16*t1[3] + 12*t1[4] +
> 8*t1[5] + 4*t1[6] +
> +          t2[0] + t2[1] + t2[2] + t2[3] + t2[4] + t2[5] + t2[6] + t2[7] +
> +          ((16+32+48+64+80+96) + 8)*CHAR_OFFSET;
> +    s1 += t1[0] + t1[1] + t1[2] + t1[3] + t1[4] + t1[5] + t1[6] + t1[7] +
> +          32*CHAR_OFFSET;
> + */
> +uint32 get_checksum1(char *buf1, int32 len)
> +{
> +    int32 i;
> +    uint32 s1, s2;
> +    schar *buf = (schar *)buf1;
> +
> +    i = s1 = s2 = 0;
> +    if (len > 32) {
> +        const char mul_t1_buf[16] = {28, 0, 24, 0, 20, 0, 16, 0, 12,
> 0, 8, 0, 4, 0, 0, 0};
> +        __m128i mul_t1 = sse_load_si128((void const*)mul_t1_buf);
> +        __m128i ss1 = _mm_setzero_si128();
> +        __m128i ss2 = _mm_setzero_si128();
> +
> +        for (i = 0; i < (len-32); i+=32) {
> +            // Load ... 2*[int8*16]
> +            __m128i in8_1 = sse_load_si128((void const*)&buf[i]);
> +            __m128i in8_2 = sse_load_si128((void const*)&buf[i + 16]);
> +
> +            // (1*buf[i] + 1*buf[i+1]), (1*buf[i+2], 1*buf[i+3]), ...
> 2*[int16*8]
> +            // Fastest, even though multiply by 1
> +            __m128i mul_one = _mm_set1_epi8(1);
> +            __m128i add16_1 = sse_maddubs_epi16(mul_one, in8_1);
> +            __m128i add16_2 = sse_maddubs_epi16(mul_one, in8_2);
> +
> +            // (4*buf[i] + 3*buf[i+1]), (2*buf[i+2], buf[i+3]), ...
> 2*[int16*8]
> +            __m128i mul_const = _mm_set1_epi32(4 + (3 << 8) + (2
<<
> 16) + (1 << 24));
> +            __m128i mul_add16_1 = sse_maddubs_epi16(mul_const, in8_1);
> +            __m128i mul_add16_2 = sse_maddubs_epi16(mul_const, in8_2);
> +
> +            // s2 += 32*s1
> +            ss2 = _mm_add_epi32(ss2, _mm_slli_epi32(ss1, 5));
> +
> +            // [sum(t1[0]..t1[6]), X, X, X] [int32*4]; faster than
> multiple _mm_hadds_epi16
> +            // Shifting left, then shifting right again and shuffling
> (rather than just
> +            // shifting right as with mul32 below) to cheaply end up
> with the correct sign
> +            // extension as we go from int16 to int32.
> +            __m128i sum_add32 = _mm_add_epi16(add16_1, add16_2);
> +            sum_add32 = _mm_add_epi16(sum_add32,
> _mm_slli_si128(sum_add32, 2));
> +            sum_add32 = _mm_add_epi16(sum_add32,
> _mm_slli_si128(sum_add32, 4));
> +            sum_add32 = _mm_add_epi16(sum_add32,
> _mm_slli_si128(sum_add32, 8));
> +            sum_add32 = _mm_srai_epi32(sum_add32, 16);
> +            sum_add32 = _mm_shuffle_epi32(sum_add32, 3);
> +
> +            // [sum(t2[0]..t2[6]), X, X, X] [int32*4]; faster than
> multiple _mm_hadds_epi16
> +            __m128i sum_mul_add32 = _mm_add_epi16(mul_add16_1,
> mul_add16_2);
> +            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
> _mm_slli_si128(sum_mul_add32, 2));
> +            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
> _mm_slli_si128(sum_mul_add32, 4));
> +            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
> _mm_slli_si128(sum_mul_add32, 8));
> +            sum_mul_add32 = _mm_srai_epi32(sum_mul_add32, 16);
> +            sum_mul_add32 = _mm_shuffle_epi32(sum_mul_add32, 3);
> +
> +            // s1 += t1[0] + t1[1] + t1[2] + t1[3] + t1[4] + t1[5] +
> t1[6] + t1[7]
> +            ss1 = _mm_add_epi32(ss1, sum_add32);
> +
> +            // s2 += t2[0] + t2[1] + t2[2] + t2[3] + t2[4] + t2[5] +
> t2[6] + t2[7]
> +            ss2 = _mm_add_epi32(ss2, sum_mul_add32);
> +
> +            // [t1[0], t1[1], ...] [int16*8]
> +            // We could've combined this with generating sum_add32
> above and save one _mm_add_epi16,
> +            // but benchmarking shows that as being slower
> +            __m128i add16 = sse_hadds_epi16(add16_1, add16_2);
> +
> +            // [t1[0], t1[1], ...] -> [t1[0]*28 + t1[1]*24, ...]
[int32*4]
> +            __m128i mul32 = _mm_madd_epi16(add16, mul_t1);
> +
> +            // [sum(mul32), X, X, X] [int32*4]; faster than multiple
> _mm_hadd_epi32
> +            mul32 = _mm_add_epi32(mul32, _mm_srli_si128(mul32, 4));
> +            mul32 = _mm_add_epi32(mul32, _mm_srli_si128(mul32, 8));
> +
> +            // s2 += 28*t1[0] + 24*t1[1] + 20*t1[2] + 16*t1[3] +
> 12*t1[4] + 8*t1[5] + 4*t1[6]
> +            ss2 = _mm_add_epi32(ss2, mul32);
> +
> +#if CHAR_OFFSET != 0
> +            // s1 += 32*CHAR_OFFSET
> +            __m128i char_offset_multiplier = _mm_set1_epi32(32 *
> CHAR_OFFSET);
> +            ss1 = _mm_add_epi32(ss1, char_offset_multiplier);
> +
> +            // s2 += 528*CHAR_OFFSET
> +            char_offset_multiplier = _mm_set1_epi32(528 * CHAR_OFFSET);
> +            ss2 = _mm_add_epi32(ss2, char_offset_multiplier);
> +#endif
> +        }
> +
> +        int32 x[4] = {0};
> +        _mm_store_si128((void*)x, ss1);
> +        s1 = x[0];
> +        _mm_store_si128((void*)x, ss2);
> +        s2 = x[0];
> +    }
> +    for (; i < (len-4); i+=4) {
> +        s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] +
> 10*CHAR_OFFSET;
> +        s1 += (buf[i] + buf[i+1] + buf[i+2] + buf[i+3] + 4*CHAR_OFFSET);
> +    }
> +    for (; i < len; i++) {
> +        s1 += (buf[i]+CHAR_OFFSET); s2 += s1;
> +    }
> +    return (s1 & 0xffff) + (s2 << 16);
> +}
> +
> +#endif
> +#endif
> --
> 2.25.2
>
> --
> Please use reply-all for most replies to avoid omitting the mailing list.
> To unsubscribe or change options:
> https://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20200518/26ecd0f0/attachment.htm>

Jorrit Jongma

2020-May-18 15:55 UTC

head link

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

I don't disagree that MD5 could (or even should) be replaced so it is
no longer the bottleneck in several real-world cases (including mine).

However this patch is not for MD5 performance, rather for the rolling
checksum rsync uses to match blocks on existing files on both ends to
reduce transfer size.

On Mon, May 18, 2020 at 5:44 PM Filipe Maia via rsync
<rsync at lists.samba.org> wrote:>
> I think this is a great patch but, in my view, an even better way to tackle
the fundamental problem (the performance limitations) is to use a much faster
checksum like xxhash, as has been suggested before:
> https://lists.samba.org/archive/rsync/2019-October/031975.html
>
> Cheers,
> Filipe
>
> On Mon, 18 May 2020 at 17:08, Jorrit Jongma via rsync <rsync at
lists.samba.org> wrote:
>>
>> This drop-in patch increases the performance of the get_checksum1()
>> function on x86-64.
>>
>> On the target slow CPU performance of the function increased by nearly
>> 50% in the x86-64 default SSE2 mode, and by nearly 100% if the
>> compiler was told to enable SSSE3 support. The increase was over 200%
>> on the fastest CPU tested in SSSE3 mode.
>>
>> Transfer time improvement with large files existing on both ends but
>> with some bits flipped was measured as 5-10%, with the target machine
>> being CPU limited (still so due to MD5).
>>
>> This same patch on (my) GitHub for easier reading:
>>
https://github.com/Chainfire/rsync/commit/f5d0b32df869a23a74b8b8295e4983b0943866df
>>
>>
>> From f5d0b32df869a23a74b8b8295e4983b0943866df Mon Sep 17 00:00:00 2001
>> From: Jorrit Jongma <git at jongma.org>
>> Date: Mon, 18 May 2020 00:21:39 +0200
>> Subject: [PATCH 1/1] SSE2/SSSE3 optimized version of get_checksum1()
for
>>  x86-64
>>
>> ---
>>  Makefile.in     |   2 +-
>>  checksum.c      |   2 +
>>  checksum_sse2.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 246 insertions(+), 1 deletion(-)
>>  create mode 100644 checksum_sse2.c
>>
>> diff --git a/Makefile.in b/Makefile.in
>> index 59649562..e4202336 100644
>> --- a/Makefile.in
>> +++ b/Makefile.in
>> @@ -40,7 +40,7 @@ OBJS1=flist.o rsync.o generator.o receiver.o
>> cleanup.o sender.o exclude.o \
>>   util.o util2.o main.o checksum.o match.o syscall.o log.o backup.o
delete.o
>>  OBJS2=options.o io.o compat.o hlink.o token.o uidlist.o socket.o
hashtable.o \
>>   fileio.o batch.o clientname.o chmod.o acls.o xattrs.o
>> -OBJS3=progress.o pipe.o
>> +OBJS3=progress.o pipe.o checksum_sse2.o
>>  DAEMON_OBJ = params.o loadparm.o clientserver.o access.o connection.o
>> authenticate.o
>>  popt_OBJS=popt/findme.o  popt/popt.o  popt/poptconfig.o \
>>   popt/popthelp.o popt/poptparse.o
>> diff --git a/checksum.c b/checksum.c
>> index cd234038..4e696f3d 100644
>> --- a/checksum.c
>> +++ b/checksum.c
>> @@ -99,6 +99,7 @@ int canonical_checksum(int csum_type)
>>   return csum_type >= CSUM_MD4 ? 1 : 0;
>>  }
>>
>> +#ifndef __SSE2__  // see checksum_sse2.c for SSE2/SSSE3 version
>>  /*
>>    a simple 32 bit checksum that can be updated from either end
>>    (inspired by Mark Adler's Adler-32 checksum)
>> @@ -119,6 +120,7 @@ uint32 get_checksum1(char *buf1, int32 len)
>>   }
>>   return (s1 & 0xffff) + (s2 << 16);
>>  }
>> +#endif
>>
>>  void get_checksum2(char *buf, int32 len, char *sum)
>>  {
>> diff --git a/checksum_sse2.c b/checksum_sse2.c
>> new file mode 100644
>> index 00000000..51662833
>> --- /dev/null
>> +++ b/checksum_sse2.c
>> @@ -0,0 +1,243 @@
>> +/*
>> + * SSE2/SSSE3-optimized routines to support checksumming of bytes.
>> + *
>> + * Copyright (C) 1996 Andrew Tridgell
>> + * Copyright (C) 1996 Paul Mackerras
>> + * Copyright (C) 2004-2020 Wayne Davison
>> + * Copyright (C) 2020 Jorrit Jongma
>> + *
>> + * This program is free software; you can redistribute it and/or
modify
>> + * it under the terms of the GNU General Public License as published
by
>> + * the Free Software Foundation; either version 3 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
along
>> + * with this program; if not, visit the http://fsf.org website.
>> + */
>> +/*
>> + * Optimization target for get_checksum1 was the Intel Atom D2700, the
>> + * slowest CPU in the test set and the most likely to be CPU limited
during
>> + * transfers. The combination of intrinsics was chosen specifically
for the
>> + * most gain on that CPU, other combinations were occasionally
slightly
>> + * faster on the others.
>> + *
>> + * While on more modern CPUs transfers are less likely to be CPU
limited,
>> + * lower CPU usage is always better. Improvements may still be seen
when
>> + * matching chunks from NVMe storage even on newer CPUs.
>> + *
>> + * Benchmarks                   C           SSE2        SSSE3
>> + * - Intel Atom D2700           550 MB/s    750 MB/s    1000 MB/s
>> + * - Intel i7-7700hq            1850 MB/s   2550 MB/s   4050 MB/s
>> + * - AMD ThreadRipper 2950x     2900 MB/s   5600 MB/s   8950 MB/s
>> + *
>> + * This optimization for get_checksum1 is intentionally limited to
x86-64 as
>> + * no 32-bit CPU was available for testing. As 32-bit CPUs only have
half the
>> + * available xmm registers, this optimized version may not be faster
than the
>> + * pure C version anyway.
>> + *
>> + * GCC automatically enables SSE2 support on x86-64 builds. The SSSE3
code
>> + * path must be enabled manually: ./configure CFLAGS="-mssse3
-O2"
>> + */
>> +
>> +#ifdef __x86_64__
>> +#ifdef __SSE2__
>> +
>> +#include "rsync.h"
>> +
>> +#ifdef __SSSE3__
>> +#include <immintrin.h>
>> +#else
>> +#include <tmmintrin.h>
>> +#endif
>> +
>> +/* Compatibility functions to let our SSSE3 algorithm run on SSE2 */
>> +
>> +static inline __m128i sse_load_si128(void const* buf) {
>> +#ifdef __SSSE3__
>> +    return _mm_lddqu_si128(buf);  // same as loadu on all but the
>> oldest SSSE3 CPUs
>> +#else
>> +    return _mm_loadu_si128(buf);
>> +#endif
>> +}
>> +
>> +#ifndef __SSSE3__
>> +static inline __m128i sse_interleave_odd_epi16(__m128i a, __m128i b) {
>> +    return _mm_packs_epi32(
>> +        _mm_srai_epi32(a, 16),
>> +        _mm_srai_epi32(b, 16)
>> +    );
>> +}
>> +
>> +static inline __m128i sse_interleave_even_epi16(__m128i a, __m128i b)
{
>> +    return sse_interleave_odd_epi16(
>> +        _mm_slli_si128(a, 2),
>> +        _mm_slli_si128(b, 2)
>> +    );
>> +}
>> +
>> +static inline __m128i sse_mulu_odd_epi8(__m128i a, __m128i b) {
>> +    return _mm_mullo_epi16(
>> +        _mm_srli_epi16(a, 8),
>> +        _mm_srai_epi16(b, 8)
>> +    );
>> +}
>> +
>> +static inline __m128i sse_mulu_even_epi8(__m128i a, __m128i b) {
>> +    return _mm_mullo_epi16(
>> +        _mm_and_si128(a, _mm_set1_epi16(0xFF)),
>> +        _mm_srai_epi16(_mm_slli_si128(b, 1), 8)
>> +    );
>> +}
>> +#endif
>> +
>> +static inline __m128i sse_hadds_epi16(__m128i a, __m128i b) {
>> +#ifdef __SSSE3__
>> +    return _mm_hadds_epi16(a, b);
>> +#else
>> +    return _mm_adds_epi16(
>> +        sse_interleave_even_epi16(a, b),
>> +        sse_interleave_odd_epi16(a, b)
>> +    );
>> +#endif
>> +}
>> +
>> +static inline __m128i sse_maddubs_epi16(__m128i a, __m128i b) {
>> +#ifdef __SSSE3__
>> +    return _mm_maddubs_epi16(a, b);
>> +#else
>> +    return _mm_adds_epi16(
>> +        sse_mulu_even_epi8(a, b),
>> +        sse_mulu_odd_epi8(a, b)
>> +    );
>> +#endif
>> +}
>> +
>> +/*
>> +  a simple 32 bit checksum that can be updated from either end
>> +  (inspired by Mark Adler's Adler-32 checksum)
>> +  */
>> +/*
>> +  Original loop per 4 bytes:
>> +    s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] +
>> 10*CHAR_OFFSET;
>> +    s1 += buf[i] + buf[i+1] + buf[i+2] + buf[i+3] + 4*CHAR_OFFSET;
>> +
>> +  SSE2/SSSE3 loop per 32 bytes:
>> +    int16 t1[8];
>> +    int16 t2[8];
>> +    for (int j = 0; j < 8; j++) {
>> +      t1[j] = buf[j*4 + i] + buf[j*4 + i+1] + buf[j*4 + i+2] + buf[j*4
+ i+3];
>> +      t2[j] = 4*buf[j*4 + i] + 3*buf[j*4 + i+1] + 2*buf[j*4 + i+2] +
>> buf[j*4 + i+3];
>> +    }
>> +    s2 += 32*s1 +
>> +          28*t1[0] + 24*t1[1] + 20*t1[2] + 16*t1[3] + 12*t1[4] +
>> 8*t1[5] + 4*t1[6] +
>> +          t2[0] + t2[1] + t2[2] + t2[3] + t2[4] + t2[5] + t2[6] +
t2[7] +
>> +          ((16+32+48+64+80+96) + 8)*CHAR_OFFSET;
>> +    s1 += t1[0] + t1[1] + t1[2] + t1[3] + t1[4] + t1[5] + t1[6] +
t1[7] +
>> +          32*CHAR_OFFSET;
>> + */
>> +uint32 get_checksum1(char *buf1, int32 len)
>> +{
>> +    int32 i;
>> +    uint32 s1, s2;
>> +    schar *buf = (schar *)buf1;
>> +
>> +    i = s1 = s2 = 0;
>> +    if (len > 32) {
>> +        const char mul_t1_buf[16] = {28, 0, 24, 0, 20, 0, 16, 0, 12,
>> 0, 8, 0, 4, 0, 0, 0};
>> +        __m128i mul_t1 = sse_load_si128((void const*)mul_t1_buf);
>> +        __m128i ss1 = _mm_setzero_si128();
>> +        __m128i ss2 = _mm_setzero_si128();
>> +
>> +        for (i = 0; i < (len-32); i+=32) {
>> +            // Load ... 2*[int8*16]
>> +            __m128i in8_1 = sse_load_si128((void const*)&buf[i]);
>> +            __m128i in8_2 = sse_load_si128((void const*)&buf[i +
16]);
>> +
>> +            // (1*buf[i] + 1*buf[i+1]), (1*buf[i+2], 1*buf[i+3]), ...
>> 2*[int16*8]
>> +            // Fastest, even though multiply by 1
>> +            __m128i mul_one = _mm_set1_epi8(1);
>> +            __m128i add16_1 = sse_maddubs_epi16(mul_one, in8_1);
>> +            __m128i add16_2 = sse_maddubs_epi16(mul_one, in8_2);
>> +
>> +            // (4*buf[i] + 3*buf[i+1]), (2*buf[i+2], buf[i+3]), ...
2*[int16*8]
>> +            __m128i mul_const = _mm_set1_epi32(4 + (3 << 8) + (2
<<
>> 16) + (1 << 24));
>> +            __m128i mul_add16_1 = sse_maddubs_epi16(mul_const, in8_1);
>> +            __m128i mul_add16_2 = sse_maddubs_epi16(mul_const, in8_2);
>> +
>> +            // s2 += 32*s1
>> +            ss2 = _mm_add_epi32(ss2, _mm_slli_epi32(ss1, 5));
>> +
>> +            // [sum(t1[0]..t1[6]), X, X, X] [int32*4]; faster than
>> multiple _mm_hadds_epi16
>> +            // Shifting left, then shifting right again and shuffling
>> (rather than just
>> +            // shifting right as with mul32 below) to cheaply end up
>> with the correct sign
>> +            // extension as we go from int16 to int32.
>> +            __m128i sum_add32 = _mm_add_epi16(add16_1, add16_2);
>> +            sum_add32 = _mm_add_epi16(sum_add32,
_mm_slli_si128(sum_add32, 2));
>> +            sum_add32 = _mm_add_epi16(sum_add32,
_mm_slli_si128(sum_add32, 4));
>> +            sum_add32 = _mm_add_epi16(sum_add32,
_mm_slli_si128(sum_add32, 8));
>> +            sum_add32 = _mm_srai_epi32(sum_add32, 16);
>> +            sum_add32 = _mm_shuffle_epi32(sum_add32, 3);
>> +
>> +            // [sum(t2[0]..t2[6]), X, X, X] [int32*4]; faster than
>> multiple _mm_hadds_epi16
>> +            __m128i sum_mul_add32 = _mm_add_epi16(mul_add16_1,
mul_add16_2);
>> +            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
>> _mm_slli_si128(sum_mul_add32, 2));
>> +            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
>> _mm_slli_si128(sum_mul_add32, 4));
>> +            sum_mul_add32 = _mm_add_epi16(sum_mul_add32,
>> _mm_slli_si128(sum_mul_add32, 8));
>> +            sum_mul_add32 = _mm_srai_epi32(sum_mul_add32, 16);
>> +            sum_mul_add32 = _mm_shuffle_epi32(sum_mul_add32, 3);
>> +
>> +            // s1 += t1[0] + t1[1] + t1[2] + t1[3] + t1[4] + t1[5] +
>> t1[6] + t1[7]
>> +            ss1 = _mm_add_epi32(ss1, sum_add32);
>> +
>> +            // s2 += t2[0] + t2[1] + t2[2] + t2[3] + t2[4] + t2[5] +
>> t2[6] + t2[7]
>> +            ss2 = _mm_add_epi32(ss2, sum_mul_add32);
>> +
>> +            // [t1[0], t1[1], ...] [int16*8]
>> +            // We could've combined this with generating sum_add32
>> above and save one _mm_add_epi16,
>> +            // but benchmarking shows that as being slower
>> +            __m128i add16 = sse_hadds_epi16(add16_1, add16_2);
>> +
>> +            // [t1[0], t1[1], ...] -> [t1[0]*28 + t1[1]*24, ...]
[int32*4]
>> +            __m128i mul32 = _mm_madd_epi16(add16, mul_t1);
>> +
>> +            // [sum(mul32), X, X, X] [int32*4]; faster than multiple
>> _mm_hadd_epi32
>> +            mul32 = _mm_add_epi32(mul32, _mm_srli_si128(mul32, 4));
>> +            mul32 = _mm_add_epi32(mul32, _mm_srli_si128(mul32, 8));
>> +
>> +            // s2 += 28*t1[0] + 24*t1[1] + 20*t1[2] + 16*t1[3] +
>> 12*t1[4] + 8*t1[5] + 4*t1[6]
>> +            ss2 = _mm_add_epi32(ss2, mul32);
>> +
>> +#if CHAR_OFFSET != 0
>> +            // s1 += 32*CHAR_OFFSET
>> +            __m128i char_offset_multiplier = _mm_set1_epi32(32 *
CHAR_OFFSET);
>> +            ss1 = _mm_add_epi32(ss1, char_offset_multiplier);
>> +
>> +            // s2 += 528*CHAR_OFFSET
>> +            char_offset_multiplier = _mm_set1_epi32(528 *
CHAR_OFFSET);
>> +            ss2 = _mm_add_epi32(ss2, char_offset_multiplier);
>> +#endif
>> +        }
>> +
>> +        int32 x[4] = {0};
>> +        _mm_store_si128((void*)x, ss1);
>> +        s1 = x[0];
>> +        _mm_store_si128((void*)x, ss2);
>> +        s2 = x[0];
>> +    }
>> +    for (; i < (len-4); i+=4) {
>> +        s2 += 4*(s1 + buf[i]) + 3*buf[i+1] + 2*buf[i+2] + buf[i+3] +
>> 10*CHAR_OFFSET;
>> +        s1 += (buf[i] + buf[i+1] + buf[i+2] + buf[i+3] +
4*CHAR_OFFSET);
>> +    }
>> +    for (; i < len; i++) {
>> +        s1 += (buf[i]+CHAR_OFFSET); s2 += s1;
>> +    }
>> +    return (s1 & 0xffff) + (s2 << 16);
>> +}
>> +
>> +#endif
>> +#endif
>> --
>> 2.25.2
>>
>> --
>> Please use reply-all for most replies to avoid omitting the mailing
list.
>> To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/rsync
>> Before posting, read:
http://www.catb.org/~esr/faqs/smart-questions.html
>
> --
> Please use reply-all for most replies to avoid omitting the mailing list.
> To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Jorrit Jongma

2020-May-18 15:58 UTC

head link

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

Well, don't get too excited, get_checksum1() (the function optimized
here) is not the great performance limiter in this case, it's
get_checksum2() and sum_update(), which will be using MD5. You can
force using MD4, but on the slower CPU's I've tested in practice that
is slower rather than faster, contrary to what would be expected.
While this patch will improve things a little, to improve things a lot
we need to tackle or replace MD5.

Unfortunately, single stream MD5 cannot be effectively optimized with
SSE, at least I've not seen an SSE version faster than pure C, and
I've looked into it. What we _can_ do is parallelize multiple streams
using SSE, which may double to triple the throughput at the same CPU
load under ideal circumstances. However, this cannot be applied to
rsync as-is as it doesn't process multiple files simultaneously (and
it is questionable if that is something we should even want). The
single-file stream could still be parallelized this way but it would
require a slight change in checksum generation that would in turn
require a protocol change - both ends need to support it. At that
point we might as well swap MD5 out completely, though I will still be
digging deeper into this case.

The good news is that this parallelization _is_ possible in a drop-in
fashion for the case where rsync is comparing the chunks on both ends,
the same case where the get_checksum1() patch shows its benefits. I
estimate performance improvements could reach about 30% for that
specific case (re-transferring large yet slightly modified files), but
that does nothing for the performance of whole file checksumming or
the transfer of new files. Depending on your use-case you may never or
rarely even see that performance improvement in action. It applies for
my use-case though, so I am looking into this.

On Mon, May 18, 2020 at 5:18 PM Ben RUBSON via rsync
<rsync at lists.samba.org> wrote:>
> On 18 May 2020, at 17:06, Jorrit Jongma via rsync <rsync at
lists.samba.org> wrote:
>
> This drop-in patch increases the performance of the get_checksum1()
> function on x86-64.
>
>
> As ref, rather related to this :
https://bugzilla.samba.org/show_bug.cgi?id=13082
>
> Thank you Jorrit !
> --
> Please use reply-all for most replies to avoid omitting the mailing list.
> To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/rsync
> Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

Sebastian Andrzej Siewior

2020-May-18 19:15 UTC

head link

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

On 2020-05-18 17:06:51 [+0200], Jorrit Jongma via rsync
wrote:> diff --git a/checksum.c b/checksum.c
> index cd234038..4e696f3d 100644
> --- a/checksum.c
> +++ b/checksum.c
> @@ -99,6 +99,7 @@ int canonical_checksum(int csum_type)
>   return csum_type >= CSUM_MD4 ? 1 : 0;
>  }
> 
> +#ifndef __SSE2__  // see checksum_sse2.c for SSE2/SSSE3 version
>  /*
>    a simple 32 bit checksum that can be updated from either end
>    (inspired by Mark Adler's Adler-32 checksum)
> @@ -119,6 +120,7 @@ uint32 get_checksum1(char *buf1, int32 len)
>   }
>   return (s1 & 0xffff) + (s2 << 16);
>  }
> +#endif
You can't replace the code like that with SSE2+. You need runtime
detection for this. Otherwise it can't be enabled by distros becuase it
would fail on CPUs without SSE2+. Only SSE is part of generic x86-64.

Sebastian

Jorrit Jongma

2020-May-18 19:55 UTC

head link

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

What do you base this on?

Per https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html :

"For the x86-32 compiler, you must use -march=cpu-type, -msse or
-msse2 switches to enable SSE extensions and make this option
effective. For the x86-64 compiler, these extensions are enabled by
default."

That reads to me like we're fine for SSE2. As stated in my comments,
SSSE3 support must be manually enabled at build time. Your comment
would imply that SSSE3 is enabled out of the box on builds on machines
that support it, this is not the case (it certainly isn't on my Ubuntu
box). It would be preferred to detect this at runtime but getting that
to work on GCC is (apparently) a mess, and would probably require
modifications to configure/Makefile/etc that I'm not comfortable
doing, as my lack of expertise on those would probably lead me to
break the build for somebody else. If someone knowledgable enough in
that area wants to fix it, though...

The only reason there's an SSE2 backport (you'll find SSSE3 support on
most CPUs up to nearly a decade old) in the first place is because by
my understanding SSE2 is supported on all x86-64 CPUs out of the box.
> You can't replace the code like that with SSE2+. You need runtime
> detection for this. Otherwise it can't be enabled by distros becuase it
> would fail on CPUs without SSE2+. Only SSE is part of generic x86-64.

Reasonably Related Threads

Search for more apparently analagous threads

rsync - May 2020 - [PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

Reasonably Related Threads