thr3ads.net - Speex dev - [speex-dev] [PATCH] Make SSE Run Time option. [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Ian Ollmann

2004-Aug-06 15:01 UTC

[speex-dev] [PATCH] Make SSE Run Time option.

On Thu, 15 Jan 2004, Ian Ollmann wrote:
> On Thu, 15 Jan 2004, Jean-Marc Valin wrote:
>
> > > Personally, I don't think much of PNI. The complex arithmetic
stuff they
> > > added sets you up for a lot of permute overhead that is
inefficient --
> > > especially on a processor that is already weak on permute. In my
opinion,
> >
> > Actually, the new instructions make it possible to do complex
multiplies
> > without the need to permute and separate the add and subtract. The
> > really useful instruction here is the "addsubps".
>
> Would you like to prove it with a code sample?
I suppose if I make such a demand that it would only be sporting if I
provide what I believe to be the more efficient competing method that uses
only SSE/SSE2.  Double precision is shown. For Single precision simply
replace all "pd"  with "ps" and "__m128d" with
"__m128".

        //For C[] = A[] * B[]
        //The real and imaginary parts of A, B and C are stored in
        //different arrays, not interleaved
        inline void ComplexMultiply( 	__m128d *Cr, __m128d *Ci,
                                        __m128d Ar, __m128d Ai,
                                        __m128d Br, __m128d Bi )
        {
                // http://mathworld.wolfram.com/ComplexMultiplication.html
                // Cr = Ar * Br - Ai * Bi
                // Ci = Ai * Br + Ar * Bi

                __m128d real = _mm_mul_pd( Ar, Br );
                __m128d imag = _mm_mul_pd( Ai, Br );

                Ai = _mm_mul_pd( Ai, Bi );
                Ar = _mm_mul_pd( Ar, Bi );

                real = _mm_sub_pd( real, Ai );
                imag = _mm_add_pd( imag, Ar );

                *Cr = real;
                *Ci = imag;
        }

No permute is required. The key thing to note is that I do two/four
complex multiplies at a time in proper SIMD fashion, unlike PNI based
methods.  Thus, throughput is 3 vector ALU instructions per element, even
though I do 6 ALU instructions.  (1.5 insns/element for single precision.)
Stores at the end are merely a formality required by C language
architectures to return more than one result and will be presumably
removed when the function is inlined.

Ian

---------------------------------------------------
   Ian Ollmann, Ph.D.       iano@cco.caltech.edu
---------------------------------------------------

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Jean-Marc Valin

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

Actually, I'm not denying you can do pretty fast complex multiplies by
separating real from imaginary. What I'm saying is that with addsubps,
you can do a better job when you have the complex numbers packed, then
you can do with SSE1 only. I still think AMD got it better with its
pfpnacc instruction and Intel should have gone much further. 

<p>Le jeu 15/01/2004 à 19:28, Ian Ollmann a écrit
:> On Thu, 15 Jan 2004, Ian Ollmann wrote:
> 
> > On Thu, 15 Jan 2004, Jean-Marc Valin wrote:
> >
> > > > Personally, I don't think much of PNI. The complex
arithmetic stuff they
> > > > added sets you up for a lot of permute overhead that is
inefficient --
> > > > especially on a processor that is already weak on permute.
In my opinion,
> > >
> > > Actually, the new instructions make it possible to do complex
multiplies
> > > without the need to permute and separate the add and subtract.
The
> > > really useful instruction here is the "addsubps".
> >
> > Would you like to prove it with a code sample?
> 
> I suppose if I make such a demand that it would only be sporting if I
> provide what I believe to be the more efficient competing method that uses
> only SSE/SSE2.  Double precision is shown. For Single precision simply
> replace all "pd"  with "ps" and "__m128d"
with "__m128".
> 
> 	//For C[] = A[] * B[]
> 	//The real and imaginary parts of A, B and C are stored in
> 	//different arrays, not interleaved
> 	inline void ComplexMultiply( 	__m128d *Cr, __m128d *Ci,
> 					__m128d Ar, __m128d Ai,
> 					__m128d Br, __m128d Bi )
> 	{
> 		// http://mathworld.wolfram.com/ComplexMultiplication.html
> 		// Cr = Ar * Br - Ai * Bi
> 		// Ci = Ai * Br + Ar * Bi
> 
> 		__m128d real = _mm_mul_pd( Ar, Br );
> 		__m128d imag = _mm_mul_pd( Ai, Br );
> 
> 		Ai = _mm_mul_pd( Ai, Bi );
> 		Ar = _mm_mul_pd( Ar, Bi );
> 
> 		real = _mm_sub_pd( real, Ai );
> 		imag = _mm_add_pd( imag, Ar );
> 
> 		*Cr = real;
> 		*Ci = imag;
> 	}
> 
> No permute is required. The key thing to note is that I do two/four
> complex multiplies at a time in proper SIMD fashion, unlike PNI based
> methods.  Thus, throughput is 3 vector ALU instructions per element, even
> though I do 6 ALU instructions.  (1.5 insns/element for single precision.)
> Stores at the end are merely a formality required by C language
> architectures to return more than one result and will be presumably
> removed when the function is inlined.
> 
> Ian
> 
> ---------------------------------------------------
>    Ian Ollmann, Ph.D.       iano@cco.caltech.edu
> ---------------------------------------------------
> 
> --- >8 ----
> List archives:  http://www.xiph.org/archives/
> Ogg project homepage: http://www.xiph.org/ogg/
> To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
> containing only the word 'unsubscribe' in the body.  No subject is
needed.
> Unsubscribe messages sent to the list will be ignored/filtered.
-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/cee59e05/signature-0001.pgp

Ian Ollmann

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

On Thu, 15 Jan 2004, Jean-Marc Valin wrote:
> Actually, I'm not denying you can do pretty fast complex multiplies by
> separating real from imaginary. What I'm saying is that with addsubps,
> you can do a better job when you have the complex numbers packed, then
> you can do with SSE1 only. I still think AMD got it better with its
> pfpnacc instruction and Intel should have gone much further.
I find it amazing that they would spend effort introducing new hardware
designed to facilitate programming in inefficient ways. The existence of
the instruction encourages people to use that data layout, thereby
shooting themselves in the foot. Furthermore, if they are going to help,
they could at least do so intelligently. The addsubps instruction would
have saved two permutes if it added across rather than vertically.

The way PNI is right now, given a strategy of hobbling ourselves with an
interleaved data layout, we arrive at the following implementation:

        Cr = Ar * Br - Ai * Bi
        Ci = Ai * Br + Ar * Bi

In vector notation:

        C = A * Br +- swap(A) * Bi

which comes to three permutes and three ALU instructions. Given a dispatch
limitation of one SIMD instruction / cycle (everything goes through port 1
as I understand page 1-17 of the Intel Pentium 4 / Xeon Processor
Optimization manual to say), it appears to me that you could do equally
well doing this without permutes using the scalar SSE2 instructions or
maybe even x87, because all we really need to accomplish here is 6 scalar
ALU ops!

If addsubps did its thing horizontally, then we could write it this way:

        real = A * B
        imag = swap(A) * B
        result = { sub_across( real ), add_across( imag ) }

which is three ALU operations and 1 permute. There is some chance that the
SSE2 implementation would beat double precision scalar code!*

Ian

*provided Intel processed more than one double per cycle for packed SSE2
instructions, which it does not.

---------------------------------------------------
   Ian Ollmann, Ph.D.       iano@cco.caltech.edu
---------------------------------------------------

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Reasonably Related Threads

Search for more apparently analagous threads

Speex dev - Aug 2004 - [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

Reasonably Related Threads