Aron Rosenberg
2004-Aug-06 15:01 UTC
[speex-dev] Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others
Here are our notes on 1.1.4 testing on Windows 1. Compile Error with regular mode (FIXED_POINT undefined) at lsp.c line 104 static inline spx_word16_t spx_cos(spx_word16_t x) . VS6 does not like the inline keyword here. Removing it allows compiling. same with cb_search_sse.h line 34. 2. Compile Error with quant_lsp.c line 55. M_PI is undefined. Either it needs to be included in that file or placed in a header. 3. denoise.c doesn't seem to be in tar.gz, it is in the visual studio project file though. Now onto the actual SSE tests. We ran the SSE intrinics code through some test on windows over here and all I can say is - it sucks. A room filled with Monkeys could generate better SSE code. Having stated that let me describe why. We use Visual Studio 6, SP5 with the processor pack as the main development platform. For some unknown reason, it decides that it only ever wants to use XMM0 for its SSE operations. If it is dealing with a two paramater SSE call, then it will use XMM1, but thats it. Between succesive calls, it won't keep things in an xmm register, even if the next call is using it. To check this, I converted some of the MMX code in our regular application to intrinics and it does the same thing, only uses mm0 and mm1. It actually runs slower than a c code version of the same function. Now, this could be different on Visual Studio .NET and .NET 2003, but that is what happens with Visual Studio 6. Just so you understand, I am pasting below some of the generated SSE code for the fir_mem2_10 function. I got this by compiling the speexenc and loading it up in the debugger. Skipped a bit of the initial function stuff the block starts inside the for loop. For those who don't know, Win32 asm is backwords from GCC, it is OPERATION DEST, SOURCE 254: for (i=0;i<N;i++) 255: { 256: __m128 xx; 257: __m128 yy; 258: /* Compute next filter result */ 259: xx = _mm_load_ps1(x+i); 00413483 mov eax,dword ptr [ebp-64h] 00413486 mov ecx,dword ptr [ebx+8] 00413489 lea edx,[ecx+eax*4] 0041348C movss xmm0,dword ptr [edx] 00413490 shufps xmm0,xmm0,0 00413494 movaps xmmword ptr [xx],xmm0 260: yy = _mm_add_ss(xx, mem[0]); 00413498 movaps xmm0,xmmword ptr [ebp-60h] 0041349C movaps xmm1,xmmword ptr [xx] 004134A0 addss xmm1,xmm0 004134A4 movaps xmmword ptr [yy],xmm1 261: _mm_store_ss(y+i, yy); 004134AB movaps xmm0,xmmword ptr [yy] 004134B2 mov eax,dword ptr [ebp-64h] 004134B5 mov ecx,dword ptr [ebx+10h] 004134B8 lea edx,[ecx+eax*4] 004134BB movss dword ptr [edx],xmm0 262: yy = _mm_shuffle_ps(yy, yy, 0); 004134BF movaps xmm0,xmmword ptr [yy] 004134C6 movaps xmm1,xmmword ptr [yy] 004134CD shufps xmm1,xmm0,0 004134D1 movaps xmmword ptr [yy],xmm1 263: 264: /* Update memory */ 265: mem[0] = _mm_move_ss(mem[0], mem[1]); 004134D8 movaps xmm0,xmmword ptr [ebp-50h] 004134DC movaps xmm1,xmmword ptr [ebp-60h] 004134E0 movss xmm1,xmm0 004134E4 movaps xmmword ptr [ebp-60h],xmm1 266: mem[0] = _mm_shuffle_ps(mem[0], mem[0], 0x39); 004134E8 movaps xmm0,xmmword ptr [ebp-60h] 004134EC movaps xmm1,xmmword ptr [ebp-60h] 004134F0 shufps xmm1,xmm0,39h 004134F4 movaps xmmword ptr [ebp-60h],xmm1 267: 268: mem[0] = _mm_add_ps(mem[0], _mm_mul_ps(xx, num[0])); 004134F8 movaps xmm0,xmmword ptr [ebp-30h] 004134FC movaps xmm1,xmmword ptr [xx] 00413500 mulps xmm1,xmm0 00413503 movaps xmm0,xmmword ptr [ebp-60h] 00413507 addps xmm0,xmm1 0041350A movaps xmmword ptr [ebp-60h],xmm0 269: 270: mem[1] = _mm_move_ss(mem[1], mem[2]); 0041350E movaps xmm0,xmmword ptr [ebp-40h] 00413512 movaps xmm1,xmmword ptr [ebp-50h] 00413516 movss xmm1,xmm0 0041351A movaps xmmword ptr [ebp-50h],xmm1 271: mem[1] = _mm_shuffle_ps(mem[1], mem[1], 0x39); 0041351E movaps xmm0,xmmword ptr [ebp-50h] 00413522 movaps xmm1,xmmword ptr [ebp-50h] 00413526 shufps xmm1,xmm0,39h 0041352A movaps xmmword ptr [ebp-50h],xmm1 272: 273: mem[1] = _mm_add_ps(mem[1], _mm_mul_ps(xx, num[1])); 0041352E movaps xmm0,xmmword ptr [ebp-20h] 00413532 movaps xmm1,xmmword ptr [xx] 00413536 mulps xmm1,xmm0 00413539 movaps xmm0,xmmword ptr [ebp-50h] 0041353D addps xmm0,xmm1 00413540 movaps xmmword ptr [ebp-50h],xmm0 274: 275: mem[2] = _mm_shuffle_ps(mem[2], mem[2], 0xfd); 00413544 movaps xmm0,xmmword ptr [ebp-40h] 00413548 movaps xmm1,xmmword ptr [ebp-40h] 0041354C shufps xmm1,xmm0,0FDh 00413550 movaps xmmword ptr [ebp-40h],xmm1 276: 277: mem[2] = _mm_add_ps(mem[2], _mm_mul_ps(xx, num[2])); 00413554 movaps xmm0,xmmword ptr [ebp-10h] 00413558 movaps xmm1,xmmword ptr [xx] 0041355C mulps xmm1,xmm0 0041355F movaps xmm0,xmmword ptr [ebp-40h] 00413563 addps xmm0,xmm1 00413566 movaps xmmword ptr [ebp-40h],xmm0 278: } <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Jean-Marc Valin
2004-Aug-06 15:01 UTC
[speex-dev] Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others
> 1. Compile Error with regular mode (FIXED_POINT undefined) at lsp.c line 104 > static inline spx_word16_t spx_cos(spx_word16_t x) . VS6 does not like > the inline keyword here. Removing it allows compiling. > > same with cb_search_sse.h line 34.It seems like your compiler simply doesn't like "inline". I suggest doing a -Dinline= which is what autoconf does when it detects that the compiler doesn't understand the inline keyword.> 2. Compile Error with quant_lsp.c line 55. M_PI is undefined. Either it > needs to be included in that file or placed in a header.I'll fix that.> 3. denoise.c doesn't seem to be in tar.gz, it is in the visual studio > project file though.The project file isn't up-to-date (I've never even compiled Speex in Win32). The file's been renamed to preprocess.h> We ran the SSE intrinics code through some test on windows over here and > all I can say is - it sucks. A room filled with Monkeys could generate > better SSE code. Having stated that let me describe why.You mean a room filled with monkeys could generate a better compiler? :)> We use Visual Studio 6, SP5 with the processor pack as the main development > platform. For some unknown reason, it decides that it only ever wants to > use XMM0 for its SSE operations. If it is dealing with a two paramater SSE > call, then it will use XMM1, but thats it. Between succesive calls, it > won't keep things in an xmm register, even if the next call is using it.I just checked with gcc. gcc uses all of the xmm registers available (should check on an Opteron, which has 16 of them). Overall, enabling SSE can give up to 30% improvement (20% is typical).> To check this, I converted some of the MMX code in our regular application > to intrinics and it does the same thing, only uses mm0 and mm1. It actually > runs slower than a c code version of the same function.Well, there's always the option to use gcc to generate the assembly for the few SSE functions.> Now, this could be different on Visual Studio .NET and .NET 2003, but that > is what happens with Visual Studio 6. Just so you understand, I am pasting > below some of the generated SSE code for the fir_mem2_10 function. I got > this by compiling the speexenc and loading it up in the debugger.Yes, that code sucks. Bad. Actually, I can get the same kind of code by turning the optimizer off in gcc (-O0). Maybe you've got it turned off too (I think VS is unable to optimize in debug mode, is that right?). Oterwise, VS really sucks. Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040121/31cd87f2/signature-0001.pgp
Aron Rosenberg
2004-Aug-06 15:01 UTC
[speex-dev] Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others
Jean-Marc, Good catch on the debug mode. After compiling the same code in release mode it does appear to be using all the registers correctly. Give us a few days to integrate our run-time flags into 1.1.4 and I will let you know how are testing turns out. Aron Rosenberg SightSpeed At 08:54 PM 1/21/2004, you wrote:> > 1. Compile Error with regular mode (FIXED_POINT undefined) at lsp.c > line 104 > > static inline spx_word16_t spx_cos(spx_word16_t x) . VS6 does > not like > > the inline keyword here. Removing it allows compiling. > > > > same with cb_search_sse.h line 34. > >It seems like your compiler simply doesn't like "inline". I suggest >doing a -Dinline= which is what autoconf does when it detects that the >compiler doesn't understand the inline keyword. > > > 2. Compile Error with quant_lsp.c line 55. M_PI is undefined. Either it > > needs to be included in that file or placed in a header. > >I'll fix that. > > > 3. denoise.c doesn't seem to be in tar.gz, it is in the visual studio > > project file though. > >The project file isn't up-to-date (I've never even compiled Speex in >Win32). The file's been renamed to preprocess.h > > > We ran the SSE intrinics code through some test on windows over here and > > all I can say is - it sucks. A room filled with Monkeys could generate > > better SSE code. Having stated that let me describe why. > >You mean a room filled with monkeys could generate a better compiler? :) > > > We use Visual Studio 6, SP5 with the processor pack as the main > development > > platform. For some unknown reason, it decides that it only ever wants to > > use XMM0 for its SSE operations. If it is dealing with a two paramater SSE > > call, then it will use XMM1, but thats it. Between succesive calls, it > > won't keep things in an xmm register, even if the next call is using it. > >I just checked with gcc. gcc uses all of the xmm registers available >(should check on an Opteron, which has 16 of them). Overall, enabling >SSE can give up to 30% improvement (20% is typical). > > > To check this, I converted some of the MMX code in our regular application > > to intrinics and it does the same thing, only uses mm0 and mm1. It > actually > > runs slower than a c code version of the same function. > >Well, there's always the option to use gcc to generate the assembly for >the few SSE functions. > > > Now, this could be different on Visual Studio .NET and .NET 2003, but that > > is what happens with Visual Studio 6. Just so you understand, I am pasting > > below some of the generated SSE code for the fir_mem2_10 function. I got > > this by compiling the speexenc and loading it up in the debugger. > >Yes, that code sucks. Bad. Actually, I can get the same kind of code by >turning the optimizer off in gcc (-O0). Maybe you've got it turned off >too (I think VS is unable to optimize in debug mode, is that right?). >Oterwise, VS really sucks. > > Jean-Marc > >-- >Jean-Marc Valin, M.Sc.A., ing. jr. >LABORIUS (http://www.gel.usherb.ca/laborius) >Université de Sherbrooke, Québec, Canada<p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Possibly Parallel Threads
- Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others
- Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others
- Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others
- Notes on 1.1.4 Windows. Testing of SSE Intrinics Code and others
- Compile issue with gcc3.2