thr3ads.net - search: "_mm_add

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

2020 May 18

6

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

...than multiple _mm_hadds_epi16 + // Shifting left, then shifting right again and shuffling (rather than just + // shifting right as with mul32 below) to cheaply end up with the correct sign + // extension as we go from int16 to int32. + __m128i sum_add32 = _mm_add_epi16(add16_1, add16_2); + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 2)); + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 4)); + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 8)); + sum_add32 = _mm_srai_epi...

[LLVMdev] thinking about timing-test-driven scheduler

2010 Jun 11

0

[LLVMdev] thinking about timing-test-driven scheduler

On Wed, 2010-06-09 at 17:30 +0200, orthochronous wrote: > Hi, > > I've been thinking about how to implement a framework for attempting > instruction scheduling of small blocks of code by using (GA/simulated > annealing/etc) controlled timing-test-evaluations of various > orderings. This sounds interesting. > (I'm particularly interested small-ish numerical inner

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

2020 May 18

0

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

...> + // Shifting left, then shifting right again and shuffling > (rather than just > + // shifting right as with mul32 below) to cheaply end up > with the correct sign > + // extension as we go from int16 to int32. > + __m128i sum_add32 = _mm_add_epi16(add16_1, add16_2); > + sum_add32 = _mm_add_epi16(sum_add32, > _mm_slli_si128(sum_add32, 2)); > + sum_add32 = _mm_add_epi16(sum_add32, > _mm_slli_si128(sum_add32, 4)); > + sum_add32 = _mm_add_epi16(sum_add32, > _mm_slli_si128(sum_add32, 8)); > +...

[PATCHv2] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

2020 May 19

5

[PATCHv2] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

...than multiple _mm_hadds_epi16 + // Shifting left, then shifting right again and shuffling (rather than just + // shifting right as with mul32 below) to cheaply end up with the correct sign + // extension as we go from int16 to int32. + __m128i sum_add32 = _mm_add_epi16(add16_1, add16_2); + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 2)); + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 4)); + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 8)); + sum_add32 = _mm_srai_epi...

[LLVMdev] thinking about timing-test-driven scheduler

2010 Jun 09

2

[LLVMdev] thinking about timing-test-driven scheduler

Hi, I've been thinking about how to implement a framework for attempting instruction scheduling of small blocks of code by using (GA/simulated annealing/etc) controlled timing-test-evaluations of various orderings. (I'm particularly interested small-ish numerical inner loop code in low-power CPUs like Atom and various ARMs where there CPU doesn't have the ability to

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

2020 May 18

2

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

...Shifting left, then shifting right again and shuffling >> (rather than just >> + // shifting right as with mul32 below) to cheaply end up >> with the correct sign >> + // extension as we go from int16 to int32. >> + __m128i sum_add32 = _mm_add_epi16(add16_1, add16_2); >> + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 2)); >> + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 4)); >> + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 8)); >> +...

[PATCHv2] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

2020 May 20

0

[PATCHv2] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

...> + // Shifting left, then shifting right again and shuffling > (rather than just > + // shifting right as with mul32 below) to cheaply end up > with the correct sign > + // extension as we go from int16 to int32. > + __m128i sum_add32 = _mm_add_epi16(add16_1, add16_2); > + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 2)); > + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 4)); > + sum_add32 = _mm_add_epi16(sum_add32, _mm_slli_si128(sum_add32, 8)); > + sum_...

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

2020 May 18

3

[PATCH] SSE2/SSSE3 optimized version of get_checksum1() for x86-64

What do you base this on? Per https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html : "For the x86-32 compiler, you must use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For the x86-64 compiler, these extensions are enabled by default." That reads to me like we're fine for SSE2. As stated in my comments, SSSE3 support must be

Proposal for replacing asm code with intrinsics

2009 Oct 13

3

Proposal for replacing asm code with intrinsics

...ns and are much easier to maintain. For example: _mm_sad_epu8(__m128, __m128) will be compiled in PSADBW instruction with compiler-allocated registers. And code like: psadbw mm4,mm5 paddw mm0,mm4 Can be re-written into _m64 mm0, mm4, mm5, mm6, mm7; //of course using meaningful names mm0= _mm_add_epi16(mm0, _mm_sad_pu8(mm4, mm5)); Compiler will replace variables with actual registers, ensuring better allocation and scheduling of them. So, benefits are: 1) Easier to read & understand code which can use same variable names as generic version in C 2) Single source code for gcc & msvc &...

search for: _mm_add_epi16