Displaying 6 results from an estimated 6 matches for "mul_one".
Did you mean:
malone
2020 May 19
5
[PATCHv2] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
...__m128i in8_1 = sse_load_si128((__m128i_u*)&buf[i]);
+ __m128i in8_2 = sse_load_si128((__m128i_u*)&buf[i + 16]);
+
+ // (1*buf[i] + 1*buf[i+1]), (1*buf[i+2], 1*buf[i+3]), ...
2*[int16*8]
+ // Fastest, even though multiply by 1
+ __m128i mul_one = _mm_set1_epi8(1);
+ __m128i add16_1 = sse_maddubs_epi16(mul_one, in8_1);
+ __m128i add16_2 = sse_maddubs_epi16(mul_one, in8_2);
+
+ // (4*buf[i] + 3*buf[i+1]), (2*buf[i+2], buf[i+3]), ... 2*[int16*8]
+ __m128i mul_const = _mm_set1_epi32(4 + (3 <<...
2020 May 18
6
[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
...__m128i in8_1 = sse_load_si128((void const*)&buf[i]);
+ __m128i in8_2 = sse_load_si128((void const*)&buf[i + 16]);
+
+ // (1*buf[i] + 1*buf[i+1]), (1*buf[i+2], 1*buf[i+3]), ...
2*[int16*8]
+ // Fastest, even though multiply by 1
+ __m128i mul_one = _mm_set1_epi8(1);
+ __m128i add16_1 = sse_maddubs_epi16(mul_one, in8_1);
+ __m128i add16_2 = sse_maddubs_epi16(mul_one, in8_2);
+
+ // (4*buf[i] + 3*buf[i+1]), (2*buf[i+2], buf[i+3]), ... 2*[int16*8]
+ __m128i mul_const = _mm_set1_epi32(4 + (3 <<...
2020 May 18
0
[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
...oad_si128((void const*)&buf[i]);
> + __m128i in8_2 = sse_load_si128((void const*)&buf[i + 16]);
> +
> + // (1*buf[i] + 1*buf[i+1]), (1*buf[i+2], 1*buf[i+3]), ...
> 2*[int16*8]
> + // Fastest, even though multiply by 1
> + __m128i mul_one = _mm_set1_epi8(1);
> + __m128i add16_1 = sse_maddubs_epi16(mul_one, in8_1);
> + __m128i add16_2 = sse_maddubs_epi16(mul_one, in8_2);
> +
> + // (4*buf[i] + 3*buf[i+1]), (2*buf[i+2], buf[i+3]), ...
> 2*[int16*8]
> + __m128i mul_const = _...
2020 May 18
2
[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
...amp;buf[i]);
>> + __m128i in8_2 = sse_load_si128((void const*)&buf[i + 16]);
>> +
>> + // (1*buf[i] + 1*buf[i+1]), (1*buf[i+2], 1*buf[i+3]), ...
>> 2*[int16*8]
>> + // Fastest, even though multiply by 1
>> + __m128i mul_one = _mm_set1_epi8(1);
>> + __m128i add16_1 = sse_maddubs_epi16(mul_one, in8_1);
>> + __m128i add16_2 = sse_maddubs_epi16(mul_one, in8_2);
>> +
>> + // (4*buf[i] + 3*buf[i+1]), (2*buf[i+2], buf[i+3]), ... 2*[int16*8]
>> + __m128...
2020 May 20
0
[PATCHv2] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
..._load_si128((__m128i_u*)&buf[i]);
> + __m128i in8_2 = sse_load_si128((__m128i_u*)&buf[i + 16]);
> +
> + // (1*buf[i] + 1*buf[i+1]), (1*buf[i+2], 1*buf[i+3]), ...
> 2*[int16*8]
> + // Fastest, even though multiply by 1
> + __m128i mul_one = _mm_set1_epi8(1);
> + __m128i add16_1 = sse_maddubs_epi16(mul_one, in8_1);
> + __m128i add16_2 = sse_maddubs_epi16(mul_one, in8_2);
> +
> + // (4*buf[i] + 3*buf[i+1]), (2*buf[i+2], buf[i+3]), ... 2*[int16*8]
> + __m128i mul_const = _mm_se...
2020 May 18
3
[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64
What do you base this on?
Per https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html :
"For the x86-32 compiler, you must use -march=cpu-type, -msse or
-msse2 switches to enable SSE extensions and make this option
effective. For the x86-64 compiler, these extensions are enabled by
default."
That reads to me like we're fine for SSE2. As stated in my comments,
SSSE3 support must be