So we ran the code on a Windows XP based Atholon XP system and the xmm registers work just fine so it appears that Windows 2000 and below does not support them. We agree on not supporting the non-FP version, however the run time flags need to be settable with a non FP SSE mode so that exceptions are avoided. I thus propose a set of defines like this instead of the ones in our initial patch: #define CPU_MODE_NONE 0 #define CPU_MODE_MMX 1 // Base Intel MMX x86 #define CPU_MODE_3DNOW 2 // Base AMD 3Dnow extensions #define CPU_MODE_SSE 4 // Intel Integer SSE instructions #define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions #define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers #define CPU_MODE_SSE2 32 // Intel SSE2 instructions #define CPU_MODE_ALTIVEC 64 // PowerPC Altivec support. Potential Additions include some of the ASM modes. With the results that we found there is a relationship that looks like this: 3DNOW implies MMX. 3DNOWEXT implies SSE. SSE2 implies SSEFP. SSEFP implies SSE. Either way, all the current Speex SSE should be flag checked against SSEFP.>Do you already have that implemented? I know it's possible, but the code >will likely be really ugly.We already have it implemented for the inner_prod function. After it is stable and fully tested, we will send you a patch. If you have never done Altivec coding it is quite simple since it is all C Macro's / functions. Not nearly as nasty as inline asm code, although the 16 byte alignment issues can be quite a pain. Our current working code is below: Aron Rosenberg SightSpeed Inc. <p>static float inner_prod(float *a, float *b, int len) { if (!(global_use_mmx_sse & CPU_MODE_ALTIVEC )) { #ifdef _USE_ALTIVEC int i; float sum; int a_aligned = (((unsigned long)a) & 15) ? 0 : 1; int b_aligned = (((unsigned long)b) & 15) ? 0 : 1; __vector float MSQa, LSQa, MSQb, LSQb; __vector unsigned char maska, maskb; __vector float vec_a, vec_b; __vector float vec_result; vec_result = (__vector float)vec_splat_u8(0); if ((!a_aligned) && (!b_aligned)) { // This (unfortunately) is the common case. maska = vec_lvsl(0, a); maskb = vec_lvsl(0, b); MSQa = vec_ld(0, a); MSQb = vec_ld(0, b); for (i = 0; i < len; i+=8) { a += 4; LSQa = vec_ld(0, a); vec_a = vec_perm(MSQa, LSQa, maska); b += 4; LSQb = vec_ld(0, b); vec_b = vec_perm(MSQb, LSQb, maskb); vec_result = vec_madd(vec_a, vec_b, vec_result); a += 4; MSQa = vec_ld(0, a); vec_a = vec_perm(LSQa, MSQa, maska); b += 4; MSQb = vec_ld(0, b); vec_b = vec_perm(LSQb, MSQb, maskb); vec_result = vec_madd(vec_a, vec_b, vec_result); } } else if (a_aligned && b_aligned) { for (i = 0; i < len; i+=8) { vec_a = vec_ld(0, a); vec_b = vec_ld(0, b); vec_result = vec_madd(vec_a, vec_b, vec_result); a += 4; b += 4; vec_a = vec_ld(0, a); vec_b = vec_ld(0, b); vec_result = vec_madd(vec_a, vec_b, vec_result); a += 4; b += 4; } } else if (a_aligned) { maskb = vec_lvsl(0, b); MSQb = vec_ld(0, b); for (i = 0; i < len; i+=8) { vec_a = vec_ld(0, a); a += 4; b += 4; LSQb = vec_ld(0, b); vec_b = vec_perm(MSQb, LSQb, maskb); vec_result = vec_madd(vec_a, vec_b, vec_result); vec_a = vec_ld(0, a); a += 4; b += 4; MSQb = vec_ld(0, b); vec_b = vec_perm(LSQb, MSQb, maskb); vec_result = vec_madd(vec_a, vec_b, vec_result); } } else if (b_aligned) { maska = vec_lvsl(0, a); MSQa = vec_ld(0, a); for (i = 0; i < len; i+=8) { a += 4; LSQa = vec_ld(0, a); vec_a = vec_perm(MSQa, LSQa, maska); vec_b = vec_ld(0, b); b += 4; vec_result = vec_madd(vec_a, vec_b, vec_result); a += 4; MSQa = vec_ld(0, a); vec_a = vec_perm(LSQa, MSQa, maska); vec_b = vec_ld(0, b); b += 4; vec_result = vec_madd(vec_a, vec_b, vec_result); } } vec_result = vec_add(vec_result, vec_sld(vec_result, vec_result, 8)); vec_result = vec_add(vec_result, vec_sld(vec_result, vec_result, 4)); vec_ste(vec_result, 0, &sum); return sum; <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> You may wish to save space for PNI. > > http://cedar.intel.com/media/pdf/PNI_LEGAL3.pdfSeems to be interesting instructions for complex arithmetic there (thus helping FFTs). I'm not sure there's anything useful for Speex, though. We'll see. What I think is much more promising is the x86-64 version of SSE with 16 registers. That could speed up the filters a lot.> Please note that dot products of simple vector floats are usually faster > in the scalar units. The add across and transfer to scalar is just too > expensive. Its generally only worthwhile if the data starts and ends in > the vector units, and it is inlined so that latencies can be covered with > other work. e.g:Actually, even with a scalar unit, the best code is implicitly vectorized. If you look at the original code I had, there are 4 partial sums that prevents some stalling due to dependencies. From there, it's easy to vectorize by 4 and add at the end. Note that for Speex the vectors are either 40 or 160 samples long. The whole process is also repeated 128 times in a row, so I think a vector unit will do much better. Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/31dba8ec/signature-0001.pgp
On Thu, 15 Jan 2004, Aron Rosenberg wrote:> So we ran the code on a Windows XP based Atholon XP system and the xmm > registers work just fine so it appears that Windows 2000 and below does not > support them. > > We agree on not supporting the non-FP version, however the run time flags > need to be settable with a non FP SSE mode so that exceptions are avoided. > > I thus propose a set of defines like this instead of the ones in our > initial patch: > > #define CPU_MODE_NONE 0 > #define CPU_MODE_MMX 1 // Base Intel MMX x86 > #define CPU_MODE_3DNOW 2 // Base AMD 3Dnow extensions > #define CPU_MODE_SSE 4 // Intel Integer SSE instructions > #define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions > #define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers > #define CPU_MODE_SSE2 32 // Intel SSE2 instructions > #define CPU_MODE_ALTIVEC 64 // PowerPC Altivec support.You may wish to save space for PNI. http://cedar.intel.com/media/pdf/PNI_LEGAL3.pdf Likewise, all that branching is probably going to cause more trouble than it saves. Try this: vector float a0 = vec_ld( 0, a ); vector float a1 = vec_ld( 15, a ); vector float b0 = vec_ld( 0, b ); vector float b1 = vec_ld( 15, b ); a0 = vec_perm( a0, a1, vec_lvsl( 0, a ) ); b0 = vec_perm( b0, b1, vec_lvsl( 0, b ) ); a0 = vec_madd( a0, b0, (vector float) vec_splat_u32(0) ) ; a0 = vec_add( a0, vec_sld( a0, a0, 8 ) ); a0 = vec_add( a0, vec_sld( a0, a0, 4 ) ); vec_ste( a0, 0, &sum ); return sum; Please note that dot products of simple vector floats are usually faster in the scalar units. The add across and transfer to scalar is just too expensive. Its generally only worthwhile if the data starts and ends in the vector units, and it is inlined so that latencies can be covered with other work. e.g: inline vector float DotProduct( vector float a, vector float b ) { a = vec_madd( a, b, (vector float) vec_splat_u32(0) ) ; a = vec_add( a, vec_sld( a, a, 8 ) ); a = vec_add( a, vec_sld( a, a, 4 ) ); return a; } Ian --------------------------------------------------- Ian Ollmann, Ph.D. iano@cco.caltech.edu --------------------------------------------------- --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> We agree on not supporting the non-FP version, however the run time flags > need to be settable with a non FP SSE mode so that exceptions are avoided.I think we should keep the more "official" naming and not AMD's, which is more confusing. SSE means SSE1: all the SSE instructions (including the ones using xmm registers). What AMD calls SSE is not SSE at all. Basically, it's a bunch of "extra instructions" borrowed from SSE and that are part of the extended 3DNow!.> I thus propose a set of defines like this instead of the ones in our > initial patch: > > #define CPU_MODE_NONE 0 > #define CPU_MODE_MMX 1 // Base Intel MMX x86 > #define CPU_MODE_3DNOW 2 // Base AMD 3Dnow extensions > #define CPU_MODE_SSE 4 // Intel Integer SSE instructions > #define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions > #define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers > #define CPU_MODE_SSE2 32 // Intel SSE2 instructions > #define CPU_MODE_ALTIVEC 64 // PowerPC Altivec support.If you reall want to define stuff like that, you could have simply NONE MMX 3DNOW 3DNOWEXT SSE1 SSE2 ALTIVEC Even then, MMX is completely useless for Speex IMO and I doubt it's worth writing 3DNow non-ext code (or even 3DNow! at all). Same for SSE2: Speex simply doesn't use doubles at all. That's why i think only defining NONE, SSE and ALTIVEC (maybe 3DNow?) would be enough.> We already have it implemented for the inner_prod function. After it is > stable and fully tested, we will send you a patch. If you have never done > Altivec coding it is quite simple since it is all C Macro's / functions. > Not nearly as nasty as inline asm code, although the 16 byte alignment > issues can be quite a pain. Our current working code is below:You can do the same with SSE intrinsics. I just got used to writing assembly before they were available for gcc. I had a quick look at your inner_prod implementation. I think that if you really want to make that fast (there's a big possible gain there), you need to consider the optimization at a higher level: from open_loop_nbest_pitch. The function calls inner_prod for a continuous range of offsets. With that in mind, it would probably be simpler to just take 4 copies (with different offsets) of one of the vectors and then compute everything with simple, aligned loads. Jean-Marc -- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/5c402302/signature-0001.pgp
Hi Jean Marc, I think there is just a confusion over terminology going on here- I agree that support for 3dnow base version may not necessarily be relevant; However, even though 3dNow extended is a bastardized version of SSE, it still supports the same instructions, and that is what is important- I don't think we intend to add any AMD specfic code. The real issue is cross CPU SSE support, and whether in addition there is access to XMM registers or not- whether the OS actually supports XMM as well. We have a fair amount of other stuff we do in assembler, much of which requires SSE instruction sets but *not* XMM registers, and some of which is just MMX only. In speex, I can see how you would always want to use the widest register possible for all of the fp ops in longish vectors. However, the more integer stuff you do, and just as time goes on, the more likely it is someone will want to do some type of optimization to accommodate lowest common denominator type machines. So I guess what I am saying is that it would be good to have at minimum: NONE MMX (lcd type machines) SSE (i.e. 3dnow ext with no OS XMM support) SSE_XMM (i.e. 3dnow ext, sse, with OS XMM support) ALTIVEC my 2p, Tom At 01:09 AM 1/15/2004, Jean-Marc Valin wrote:> > We agree on not supporting the non-FP version, however the run time flags > > need to be settable with a non FP SSE mode so that exceptions are avoided. > >I think we should keep the more "official" naming and not AMD's, which >is more confusing. SSE means SSE1: all the SSE instructions (including >the ones using xmm registers). What AMD calls SSE is not SSE at all. >Basically, it's a bunch of "extra instructions" borrowed from SSE and >that are part of the extended 3DNow!. > > > I thus propose a set of defines like this instead of the ones in our > > initial patch: > > > > #define CPU_MODE_NONE 0 > > #define CPU_MODE_MMX 1 // Base Intel MMX x86 > > #define CPU_MODE_3DNOW 2 // Base AMD 3Dnow extensions > > #define CPU_MODE_SSE 4 // Intel Integer SSE instructions > > #define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions > > #define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers > > #define CPU_MODE_SSE2 32 // Intel SSE2 instructions > > #define CPU_MODE_ALTIVEC 64 // PowerPC Altivec support. > >If you reall want to define stuff like that, you could have simply >NONE >MMX >3DNOW >3DNOWEXT >SSE1 >SSE2 >ALTIVEC > >Even then, MMX is completely useless for Speex IMO and I doubt it's >worth writing 3DNow non-ext code (or even 3DNow! at all). Same for SSE2: >Speex simply doesn't use doubles at all. That's why i think only >defining NONE, SSE and ALTIVEC (maybe 3DNow?) would be enough. > > > We already have it implemented for the inner_prod function. After it is > > stable and fully tested, we will send you a patch. If you have never done > > Altivec coding it is quite simple since it is all C Macro's / functions. > > Not nearly as nasty as inline asm code, although the 16 byte alignment > > issues can be quite a pain. Our current working code is below: > >You can do the same with SSE intrinsics. I just got used to writing >assembly before they were available for gcc. I had a quick look at your >inner_prod implementation. I think that if you really want to make that >fast (there's a big possible gain there), you need to consider the >optimization at a higher level: from open_loop_nbest_pitch. The function >calls inner_prod for a continuous range of offsets. With that in mind, >it would probably be simpler to just take 4 copies (with different >offsets) of one of the vectors and then compute everything with simple, >aligned loads. > > Jean-Marc > >-- >Jean-Marc Valin, M.Sc.A., ing. jr. >LABORIUS (http://www.gel.usherb.ca/laborius) >Université de Sherbrooke, Québec, Canada-- Tom Harper - tharper@sightspeed.com Lead Software Engineer SightSpeed - A Roda Group Affiliated Company 918 Parker St, Suite A14 Berkeley, CA 94710 Phone: 510.665.2920 Cell: 415.378.3779 http://www.sightspeed.com <p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
> Please note that dot products of simple vector floats are usually > faster > in the scalar units. The add across and transfer to scalar is just too > expensive.Or do four at once, with some shuffling (which is basically free); almost the same code as a 4x4 matrix/vector multiply. <p>Segher --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Greetings, <p>my apologies for putting this trash in the mailing list but the topic about SSE run-time option interested me pretty much. Looks like some people is really experienced on the topic. I would really appreciate if somebody could point me to good resources about SSE and Altivec (not necessarly on the net, I'm ready to invest some money if necessary). I already have intel manuals about SSE and I took a look at Intel's document proposed by Ian. I was somewhat shocked by reading a mail posted by Jan-Marc (Date: 14 Jan 2004 02:44:59 -0500). I guess I badly misunderstood it by jumping to wrong conclusions but does this means the only way to recognize if (say) SSE is supported by OS is to check its version? This looks something problematic to me especially for open-source OSs where one can recompile the kernel as he wants - kernel version number may not really be representative of the functionalities supported. <p>Thank you, Massimo <p><p>--- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.