thr3ads.net - search: "_mm_add

2008 Nov 26

1

SSE2 code won't compile in VC

...e versa. While there are intrinsics to do the casts, I thought it would be simpler to just use an intrinsic that accomplishes the same thing without all the casting. Thanks, --John @@ -91,7 +91,7 @@ static inline double inner_product_double(const float *a, const float *b, unsign sum = _mm_add_pd(sum, _mm_cvtps_pd(t)); sum = _mm_add_pd(sum, _mm_cvtps_pd(_mm_movehl_ps(t, t))); } - sum = _mm_add_sd(sum, (__m128d) _mm_movehl_ps((__m128) sum, (__m128) sum)); + sum = _mm_add_sd(sum, _mm_unpackhi_pd(sum, sum)); _mm_store_sd(&ret, sum); return ret; } @@ -120,7 +120,7 @...

Resampler (no api)

2008 May 03

2

Resampler (no api)

...R_PRODUCT_DOUBLE + +static inline double inner_product_double(const float *a, const float *b, unsigned int len) +{ + int i; + double ret; + __m128d sum = _mm_setzero_pd(); + __m128 t; + for (i=0;i<len;i+=8) + { + t = _mm_mul_ps(_mm_loadu_ps(a+i), _mm_loadu_ps(b+i)); + sum = _mm_add_pd(sum, _mm_cvtps_pd(t)); + sum = _mm_add_pd(sum, _mm_cvtps_pd(_mm_movehl_ps(t, t))); + + t = _mm_mul_ps(_mm_loadu_ps(a+i+4), _mm_loadu_ps(b+i+4)); + sum = _mm_add_pd(sum, _mm_cvtps_pd(t)); + sum = _mm_add_pd(sum, _mm_cvtps_pd(_mm_movehl_ps(t, t))); + } + sum = _mm_add_sd(sum,...

Resampler, memory only variant

2008 May 03

0

Resampler, memory only variant

...R_PRODUCT_DOUBLE + +static inline double inner_product_double(const float *a, const float *b, unsigned int len) +{ + int i; + double ret; + __m128d sum = _mm_setzero_pd(); + __m128 t; + for (i=0;i<len;i+=8) + { + t = _mm_mul_ps(_mm_loadu_ps(a+i), _mm_loadu_ps(b+i)); + sum = _mm_add_pd(sum, _mm_cvtps_pd(t)); + sum = _mm_add_pd(sum, _mm_cvtps_pd(_mm_movehl_ps(t, t))); + + t = _mm_mul_ps(_mm_loadu_ps(a+i+4), _mm_loadu_ps(b+i+4)); + sum = _mm_add_pd(sum, _mm_cvtps_pd(t)); + sum = _mm_add_pd(sum, _mm_cvtps_pd(_mm_movehl_ps(t, t))); + } + sum = _mm_add_sd(sum,...

[PATCH] Fix miscompile of SSE resampler

2009 Oct 26

1

[PATCH] Fix miscompile of SSE resampler

...product_double(double *ret, const float *a, const float *b, unsigned int len) { int i; - double ret; __m128d sum = _mm_setzero_pd(); __m128 t; for (i=0;i<len;i+=8) @@ -92,14 +87,12 @@ static inline double inner_product_double(const float *a, const float *b, unsign sum = _mm_add_pd(sum, _mm_cvtps_pd(_mm_movehl_ps(t, t))); } sum = _mm_add_sd(sum, _mm_unpackhi_pd(sum, sum)); - _mm_store_sd(&ret, sum); - return ret; + _mm_store_sd(ret, sum); } #define OVERRIDE_INTERPOLATE_PRODUCT_DOUBLE -static inline double interpolate_product_double(const float *a, const...

[PATCH] Make SSE Run Time option.

2004 Aug 06

2

[PATCH] Make SSE Run Time option.

...// Ci = Ai * Br + Ar * Bi __m128d real = _mm_mul_pd( Ar, Br ); __m128d imag = _mm_mul_pd( Ai, Br ); Ai = _mm_mul_pd( Ai, Bi ); Ar = _mm_mul_pd( Ar, Bi ); real = _mm_sub_pd( real, Ai ); imag = _mm_add_pd( imag, Ar ); *Cr = real; *Ci = imag; } No permute is required. The key thing to note is that I do two/four complex multiplies at a time in proper SIMD fashion, unlike PNI based methods. Thus, throughput is 3 vector ALU instructions per element, even though...

[PATCH] Make SSE Run Time option.

2004 Aug 06

0

[PATCH] Make SSE Run Time option.

...on.html > // Cr = Ar * Br - Ai * Bi > // Ci = Ai * Br + Ar * Bi > > __m128d real = _mm_mul_pd( Ar, Br ); > __m128d imag = _mm_mul_pd( Ai, Br ); > > Ai = _mm_mul_pd( Ai, Bi ); > Ar = _mm_mul_pd( Ar, Bi ); > > real = _mm_sub_pd( real, Ai ); > imag = _mm_add_pd( imag, Ar ); > > *Cr = real; > *Ci = imag; > } > > No permute is required. The key thing to note is that I do two/four > complex multiplies at a time in proper SIMD fashion, unlike PNI based > methods. Thus, throughput is 3 vector ALU instructions per element, even &...

[PATCH] Make SSE Run Time option.

2004 Aug 06

5

[PATCH] Make SSE Run Time option.

> Personally, I don't think much of PNI. The complex arithmetic stuff they > added sets you up for a lot of permute overhead that is inefficient -- > especially on a processor that is already weak on permute. In my opinion, Actually, the new instructions make it possible to do complex multiplies without the need to permute and separate the add and subtract. The really useful

search for: _mm_add_pd