Le jeu 15/01/2004 à 15:30, Daniel Vogel a écrit :> Unrelated, but please use SSE/MMX/... intrinsics on Windows instead of using > inline assembly so you also get the speed benefit on Win64.OK, so here's a first start. I've translated to intrinsics the asm I sent 1-2 days ago. The result is about 5% slower than the pure asm approach, so it's not too bad (SSE asm is 2x faster than x87). Note that unlike the previous version which had a kludge to work with order 8 (required for wideband), this version only works with order 10, so it will only work for narrowband. <p>void filter_mem2(float *x, float *_num, float *_den, float *y, int N, int ord, float *_mem) { __m128 num[3], den[3], mem[3]; int i; /* Copy numerator, denominator and memory to aligned xmm */ for (i=0;i<2;i++) { mem[i] = _mm_loadu_ps(_mem+4*i); num[i] = _mm_loadu_ps(_num+4*i+1); den[i] = _mm_loadu_ps(_den+4*i+1); } mem[2] = _mm_setr_ps(_mem[8], _mem[9], 0, 0); num[2] = _mm_setr_ps(_num[9], _num[10], 0, 0); den[2] = _mm_setr_ps(_den[9], _den[10], 0, 0); for (i=0;i<N;i++) { __m128 xx; __m128 yy; /* Compute next filter result */ xx = _mm_load_ps1(x+i); yy = _mm_add_ss(xx, mem[0]); _mm_store_ss(y+i, yy); yy = _mm_shuffle_ps(yy, yy, 0); /* Update memory */ mem[0] = _mm_move_ss(mem[0], mem[1]); mem[0] = _mm_shuffle_ps(mem[0], mem[0], 0x39); mem[0] = _mm_add_ps(mem[0], _mm_mul_ps(xx, num[0])); mem[0] = _mm_sub_ps(mem[0], _mm_mul_ps(yy, den[0])); mem[1] = _mm_move_ss(mem[1], mem[2]); mem[1] = _mm_shuffle_ps(mem[1], mem[1], 0x39); mem[1] = _mm_add_ps(mem[1], _mm_mul_ps(xx, num[1])); mem[1] = _mm_sub_ps(mem[1], _mm_mul_ps(yy, den[1])); mem[2] = _mm_shuffle_ps(mem[2], mem[2], 0xfd); mem[2] = _mm_add_ps(mem[2], _mm_mul_ps(xx, num[2])); mem[2] = _mm_sub_ps(mem[2], _mm_mul_ps(yy, den[2])); } /* Put memory back in its place */ _mm_storeu_ps(_mem, mem[0]); _mm_storeu_ps(_mem+4, mem[1]); _mm_store_ss(_mem+8, mem[2]); mem[2] = _mm_shuffle_ps(mem[2], mem[2], 0x55); _mm_store_ss(_mem+9, mem[2]); } <p> Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040116/444ce574/signature-0001.pgp
If anyone's interested in doing some testing, I just checked in an improved SSE implementation for filter_mem2, fir_mem2 and iir_mem2. The implementation should also work for wideband now. Give it a try. I'm attaching this new implementation. It's the first time I try coding with intrinsics, so please point out any error or inefficiency. Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: filters_sse.h__charset_iso-8859-1 Type: text/x-c-header Size: 10000 bytes Desc: filters_sse.h__charset_iso-8859-1 Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040116/1d09d876/filters_sse-0001.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040116/1d09d876/signature-0001.pgp
On Fri, Jan 16, 2004 at 01:35:34AM -0500, Jean-Marc Valin wrote:> Le jeu 15/01/2004 ? 15:30, Daniel Vogel a écrit : > > Unrelated, but please use SSE/MMX/... intrinsics on Windows instead of using > > inline assembly so you also get the speed benefit on Win64. > > OK, so here's a first start. I've translated to intrinsics the asm I > sent 1-2 days ago. The result is about 5% slower than the pure asm > approach, so it's not too bad (SSE asm is 2x faster than x87). Note that > unlike the previous version which had a kludge to work with order 8 > (required for wideband), this version only works with order 10, so it > will only work for narrowband.Will this work on linux as well?> Jean-Marc >-- Petr Tomasek, http://www.etf.cuni.cz/~tomasek/ <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> > OK, so here's a first start. I've translated to intrinsics the asm I > > sent 1-2 days ago. The result is about 5% slower than the pure asm > > approach, so it's not too bad (SSE asm is 2x faster than x87). Note that > > unlike the previous version which had a kludge to work with order 8 > > (required for wideband), this version only works with order 10, so it > > will only work for narrowband. > > Will this work on linux as well?I'm developing this with Linux, so there's no problem for that (you need to compile with -march=pentium3). It should also work work for Windows (now that the inline asm has been removed), but I haven't tested. Of course, I'd like more testing both on Linux and Windows to make sure I haven't broken anything. Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040118/fe2f0c4c/signature-0001.pgp