thr3ads.net - Speex dev - [speex-dev] [PATCH] Make SSE Run Time option. [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Jean-Marc Valin

2004-Aug-06 15:01 UTC

[speex-dev] [PATCH] Make SSE Run Time option.

> Personally, I don't think much of PNI. The complex arithmetic stuff
they
> added sets you up for a lot of permute overhead that is inefficient --
> especially on a processor that is already weak on permute. In my opinion,
Actually, the new instructions make it possible to do complex multiplies
without the need to permute and separate the add and subtract. The
really useful instruction here is the "addsubps".
> I find it hard to believe you will never need SSE2.  There are some
> instructions that are legitimately useful to single precision floating
> point work, such as cvtps2dq and cvttps2dq.
There are so few conversions in Speex in the first place that it's not
even bothering with that. You get all the gain from just addps and mulps
(and the "glue instructions" that allows to use them like movaps and
shufps).

        Jean-Marc


-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/dde9b246/signature-0001.pgp

Daniel Vogel

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

Unrelated, but please use SSE/MMX/... intrinsics on Windows instead of using
inline assembly so you also get the speed benefit on Win64.

-- Daniel, Epic Games Inc.

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Ian Ollmann

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

On Thu, 15 Jan 2004, Ian Ollmann wrote:
> On Thu, 15 Jan 2004, Jean-Marc Valin wrote:
>
> > > Personally, I don't think much of PNI. The complex arithmetic
stuff they
> > > added sets you up for a lot of permute overhead that is
inefficient --
> > > especially on a processor that is already weak on permute. In my
opinion,
> >
> > Actually, the new instructions make it possible to do complex
multiplies
> > without the need to permute and separate the add and subtract. The
> > really useful instruction here is the "addsubps".
>
> Would you like to prove it with a code sample?
I suppose if I make such a demand that it would only be sporting if I
provide what I believe to be the more efficient competing method that uses
only SSE/SSE2.  Double precision is shown. For Single precision simply
replace all "pd"  with "ps" and "__m128d" with
"__m128".

        //For C[] = A[] * B[]
        //The real and imaginary parts of A, B and C are stored in
        //different arrays, not interleaved
        inline void ComplexMultiply( 	__m128d *Cr, __m128d *Ci,
                                        __m128d Ar, __m128d Ai,
                                        __m128d Br, __m128d Bi )
        {
                // http://mathworld.wolfram.com/ComplexMultiplication.html
                // Cr = Ar * Br - Ai * Bi
                // Ci = Ai * Br + Ar * Bi

                __m128d real = _mm_mul_pd( Ar, Br );
                __m128d imag = _mm_mul_pd( Ai, Br );

                Ai = _mm_mul_pd( Ai, Bi );
                Ar = _mm_mul_pd( Ar, Bi );

                real = _mm_sub_pd( real, Ai );
                imag = _mm_add_pd( imag, Ar );

                *Cr = real;
                *Ci = imag;
        }

No permute is required. The key thing to note is that I do two/four
complex multiplies at a time in proper SIMD fashion, unlike PNI based
methods.  Thus, throughput is 3 vector ALU instructions per element, even
though I do 6 ALU instructions.  (1.5 insns/element for single precision.)
Stores at the end are merely a formality required by C language
architectures to return more than one result and will be presumably
removed when the function is inlined.

Ian

---------------------------------------------------
   Ian Ollmann, Ph.D.       iano@cco.caltech.edu
---------------------------------------------------

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Jean-Marc Valin

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

Le jeu 15/01/2004 à 15:30, Daniel Vogel a écrit :> Unrelated, but please use SSE/MMX/... intrinsics on Windows instead of
using
> inline assembly so you also get the speed benefit on Win64.
OK, so here's a first start. I've translated to intrinsics the asm I
sent 1-2 days ago. The result is about 5% slower than the pure asm
approach, so it's not too bad (SSE asm is 2x faster than x87). Note that
unlike the previous version which had a kludge to work with order 8
(required for wideband), this version only works with order 10, so it
will only work for narrowband.

<p>void filter_mem2(float *x, float *_num, float *_den, float *y, int N,
int ord, float *_mem)
{
   __m128 num[3], den[3], mem[3];
   int i;

   /* Copy numerator, denominator and memory to aligned xmm */
   for (i=0;i<2;i++)
   {
      mem[i] = _mm_loadu_ps(_mem+4*i);
      num[i] = _mm_loadu_ps(_num+4*i+1);
      den[i] = _mm_loadu_ps(_den+4*i+1);
   }
   mem[2] = _mm_setr_ps(_mem[8], _mem[9], 0, 0);
   num[2] = _mm_setr_ps(_num[9], _num[10], 0, 0);
   den[2] = _mm_setr_ps(_den[9], _den[10], 0, 0);
   
   for (i=0;i<N;i++)
   {
      __m128 xx;
      __m128 yy;
      /* Compute next filter result */
      xx = _mm_load_ps1(x+i);
      yy = _mm_add_ss(xx, mem[0]);
      _mm_store_ss(y+i, yy);
      yy = _mm_shuffle_ps(yy, yy, 0);
      
      /* Update memory */
      mem[0] = _mm_move_ss(mem[0], mem[1]);
      mem[0] = _mm_shuffle_ps(mem[0], mem[0], 0x39);

      mem[0] = _mm_add_ps(mem[0], _mm_mul_ps(xx, num[0]));
      mem[0] = _mm_sub_ps(mem[0], _mm_mul_ps(yy, den[0]));

      mem[1] = _mm_move_ss(mem[1], mem[2]);
      mem[1] = _mm_shuffle_ps(mem[1], mem[1], 0x39);

      mem[1] = _mm_add_ps(mem[1], _mm_mul_ps(xx, num[1]));
      mem[1] = _mm_sub_ps(mem[1], _mm_mul_ps(yy, den[1]));

      mem[2] = _mm_shuffle_ps(mem[2], mem[2], 0xfd);

      mem[2] = _mm_add_ps(mem[2], _mm_mul_ps(xx, num[2]));
      mem[2] = _mm_sub_ps(mem[2], _mm_mul_ps(yy, den[2]));
   }
   /* Put memory back in its place */
   _mm_storeu_ps(_mem, mem[0]);
   _mm_storeu_ps(_mem+4, mem[1]);
   _mm_store_ss(_mem+8, mem[2]);
   mem[2] = _mm_shuffle_ps(mem[2], mem[2], 0x55);
   _mm_store_ss(_mem+9, mem[2]);
}

<p>        Jean-Marc


-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040116/444ce574/signature-0001.pgp

Jean-Marc Valin

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

Le jeu 15/01/2004 à 15:30, Daniel Vogel a écrit :> Unrelated, but please use SSE/MMX/... intrinsics on Windows instead of
using
> inline assembly so you also get the speed benefit on Win64.
Just curious, but why can't Win64 use Win32 inline assembly? Or do you
mean the benefit of having 16 registers?

        Jean-Marc


-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/04c25ca9/signature-0001.pgp

Ian Ollmann

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

On Thu, 15 Jan 2004, Jean-Marc Valin wrote:
> > Personally, I don't think much of PNI. The complex arithmetic
stuff they
> > added sets you up for a lot of permute overhead that is inefficient --
> > especially on a processor that is already weak on permute. In my
opinion,
>
> Actually, the new instructions make it possible to do complex multiplies
> without the need to permute and separate the add and subtract. The
> really useful instruction here is the "addsubps".
Would you like to prove it with a code sample?

Ian

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Maybe Matching Threads

Search for more reasonably related threads

Speex dev - Aug 2004 - [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

Maybe Matching Threads