> Personally, I don't think much of PNI. The complex arithmetic stuff they > added sets you up for a lot of permute overhead that is inefficient -- > especially on a processor that is already weak on permute. In my opinion,Actually, the new instructions make it possible to do complex multiplies without the need to permute and separate the add and subtract. The really useful instruction here is the "addsubps".> I find it hard to believe you will never need SSE2. There are some > instructions that are legitimately useful to single precision floating > point work, such as cvtps2dq and cvttps2dq.There are so few conversions in Speex in the first place that it's not even bothering with that. You get all the gain from just addps and mulps (and the "glue instructions" that allows to use them like movaps and shufps). Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/dde9b246/signature-0001.pgp
Unrelated, but please use SSE/MMX/... intrinsics on Windows instead of using inline assembly so you also get the speed benefit on Win64. -- Daniel, Epic Games Inc. --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
On Thu, 15 Jan 2004, Ian Ollmann wrote:> On Thu, 15 Jan 2004, Jean-Marc Valin wrote: > > > > Personally, I don't think much of PNI. The complex arithmetic stuff they > > > added sets you up for a lot of permute overhead that is inefficient -- > > > especially on a processor that is already weak on permute. In my opinion, > > > > Actually, the new instructions make it possible to do complex multiplies > > without the need to permute and separate the add and subtract. The > > really useful instruction here is the "addsubps". > > Would you like to prove it with a code sample?I suppose if I make such a demand that it would only be sporting if I provide what I believe to be the more efficient competing method that uses only SSE/SSE2. Double precision is shown. For Single precision simply replace all "pd" with "ps" and "__m128d" with "__m128". //For C[] = A[] * B[] //The real and imaginary parts of A, B and C are stored in //different arrays, not interleaved inline void ComplexMultiply( __m128d *Cr, __m128d *Ci, __m128d Ar, __m128d Ai, __m128d Br, __m128d Bi ) { // http://mathworld.wolfram.com/ComplexMultiplication.html // Cr = Ar * Br - Ai * Bi // Ci = Ai * Br + Ar * Bi __m128d real = _mm_mul_pd( Ar, Br ); __m128d imag = _mm_mul_pd( Ai, Br ); Ai = _mm_mul_pd( Ai, Bi ); Ar = _mm_mul_pd( Ar, Bi ); real = _mm_sub_pd( real, Ai ); imag = _mm_add_pd( imag, Ar ); *Cr = real; *Ci = imag; } No permute is required. The key thing to note is that I do two/four complex multiplies at a time in proper SIMD fashion, unlike PNI based methods. Thus, throughput is 3 vector ALU instructions per element, even though I do 6 ALU instructions. (1.5 insns/element for single precision.) Stores at the end are merely a formality required by C language architectures to return more than one result and will be presumably removed when the function is inlined. Ian --------------------------------------------------- Ian Ollmann, Ph.D. iano@cco.caltech.edu --------------------------------------------------- --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Le jeu 15/01/2004 à 15:30, Daniel Vogel a écrit :> Unrelated, but please use SSE/MMX/... intrinsics on Windows instead of using > inline assembly so you also get the speed benefit on Win64.OK, so here's a first start. I've translated to intrinsics the asm I sent 1-2 days ago. The result is about 5% slower than the pure asm approach, so it's not too bad (SSE asm is 2x faster than x87). Note that unlike the previous version which had a kludge to work with order 8 (required for wideband), this version only works with order 10, so it will only work for narrowband. <p>void filter_mem2(float *x, float *_num, float *_den, float *y, int N, int ord, float *_mem) { __m128 num[3], den[3], mem[3]; int i; /* Copy numerator, denominator and memory to aligned xmm */ for (i=0;i<2;i++) { mem[i] = _mm_loadu_ps(_mem+4*i); num[i] = _mm_loadu_ps(_num+4*i+1); den[i] = _mm_loadu_ps(_den+4*i+1); } mem[2] = _mm_setr_ps(_mem[8], _mem[9], 0, 0); num[2] = _mm_setr_ps(_num[9], _num[10], 0, 0); den[2] = _mm_setr_ps(_den[9], _den[10], 0, 0); for (i=0;i<N;i++) { __m128 xx; __m128 yy; /* Compute next filter result */ xx = _mm_load_ps1(x+i); yy = _mm_add_ss(xx, mem[0]); _mm_store_ss(y+i, yy); yy = _mm_shuffle_ps(yy, yy, 0); /* Update memory */ mem[0] = _mm_move_ss(mem[0], mem[1]); mem[0] = _mm_shuffle_ps(mem[0], mem[0], 0x39); mem[0] = _mm_add_ps(mem[0], _mm_mul_ps(xx, num[0])); mem[0] = _mm_sub_ps(mem[0], _mm_mul_ps(yy, den[0])); mem[1] = _mm_move_ss(mem[1], mem[2]); mem[1] = _mm_shuffle_ps(mem[1], mem[1], 0x39); mem[1] = _mm_add_ps(mem[1], _mm_mul_ps(xx, num[1])); mem[1] = _mm_sub_ps(mem[1], _mm_mul_ps(yy, den[1])); mem[2] = _mm_shuffle_ps(mem[2], mem[2], 0xfd); mem[2] = _mm_add_ps(mem[2], _mm_mul_ps(xx, num[2])); mem[2] = _mm_sub_ps(mem[2], _mm_mul_ps(yy, den[2])); } /* Put memory back in its place */ _mm_storeu_ps(_mem, mem[0]); _mm_storeu_ps(_mem+4, mem[1]); _mm_store_ss(_mem+8, mem[2]); mem[2] = _mm_shuffle_ps(mem[2], mem[2], 0x55); _mm_store_ss(_mem+9, mem[2]); } <p> Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040116/444ce574/signature-0001.pgp
Le jeu 15/01/2004 à 15:30, Daniel Vogel a écrit :> Unrelated, but please use SSE/MMX/... intrinsics on Windows instead of using > inline assembly so you also get the speed benefit on Win64.Just curious, but why can't Win64 use Win32 inline assembly? Or do you mean the benefit of having 16 registers? Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/04c25ca9/signature-0001.pgp
On Thu, 15 Jan 2004, Jean-Marc Valin wrote:> > Personally, I don't think much of PNI. The complex arithmetic stuff they > > added sets you up for a lot of permute overhead that is inefficient -- > > especially on a processor that is already weak on permute. In my opinion, > > Actually, the new instructions make it possible to do complex multiplies > without the need to permute and separate the add and subtract. The > really useful instruction here is the "addsubps".Would you like to prove it with a code sample? Ian --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Apparently Analagous Threads
- [PATCH] Make SSE Run Time option.
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- Bug fix in celt_lpc.c and some xcorr_kernel optimizations
- [RFC PATCH v3] Intrinsics/RTCD related fixes. Mostly x86.
- [RFC PATCHv2] Intrinsics/RTCD related fixes. Mostly x86.