On Thu, 15 Jan 2004, Ian Ollmann wrote:> On Thu, 15 Jan 2004, Jean-Marc Valin wrote: > > > > Personally, I don't think much of PNI. The complex arithmetic stuff they > > > added sets you up for a lot of permute overhead that is inefficient -- > > > especially on a processor that is already weak on permute. In my opinion, > > > > Actually, the new instructions make it possible to do complex multiplies > > without the need to permute and separate the add and subtract. The > > really useful instruction here is the "addsubps". > > Would you like to prove it with a code sample?I suppose if I make such a demand that it would only be sporting if I provide what I believe to be the more efficient competing method that uses only SSE/SSE2. Double precision is shown. For Single precision simply replace all "pd" with "ps" and "__m128d" with "__m128". //For C[] = A[] * B[] //The real and imaginary parts of A, B and C are stored in //different arrays, not interleaved inline void ComplexMultiply( __m128d *Cr, __m128d *Ci, __m128d Ar, __m128d Ai, __m128d Br, __m128d Bi ) { // http://mathworld.wolfram.com/ComplexMultiplication.html // Cr = Ar * Br - Ai * Bi // Ci = Ai * Br + Ar * Bi __m128d real = _mm_mul_pd( Ar, Br ); __m128d imag = _mm_mul_pd( Ai, Br ); Ai = _mm_mul_pd( Ai, Bi ); Ar = _mm_mul_pd( Ar, Bi ); real = _mm_sub_pd( real, Ai ); imag = _mm_add_pd( imag, Ar ); *Cr = real; *Ci = imag; } No permute is required. The key thing to note is that I do two/four complex multiplies at a time in proper SIMD fashion, unlike PNI based methods. Thus, throughput is 3 vector ALU instructions per element, even though I do 6 ALU instructions. (1.5 insns/element for single precision.) Stores at the end are merely a formality required by C language architectures to return more than one result and will be presumably removed when the function is inlined. Ian --------------------------------------------------- Ian Ollmann, Ph.D. iano@cco.caltech.edu --------------------------------------------------- --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.
Actually, I'm not denying you can do pretty fast complex multiplies by separating real from imaginary. What I'm saying is that with addsubps, you can do a better job when you have the complex numbers packed, then you can do with SSE1 only. I still think AMD got it better with its pfpnacc instruction and Intel should have gone much further. <p>Le jeu 15/01/2004 à 19:28, Ian Ollmann a écrit :> On Thu, 15 Jan 2004, Ian Ollmann wrote: > > > On Thu, 15 Jan 2004, Jean-Marc Valin wrote: > > > > > > Personally, I don't think much of PNI. The complex arithmetic stuff they > > > > added sets you up for a lot of permute overhead that is inefficient -- > > > > especially on a processor that is already weak on permute. In my opinion, > > > > > > Actually, the new instructions make it possible to do complex multiplies > > > without the need to permute and separate the add and subtract. The > > > really useful instruction here is the "addsubps". > > > > Would you like to prove it with a code sample? > > I suppose if I make such a demand that it would only be sporting if I > provide what I believe to be the more efficient competing method that uses > only SSE/SSE2. Double precision is shown. For Single precision simply > replace all "pd" with "ps" and "__m128d" with "__m128". > > //For C[] = A[] * B[] > //The real and imaginary parts of A, B and C are stored in > //different arrays, not interleaved > inline void ComplexMultiply( __m128d *Cr, __m128d *Ci, > __m128d Ar, __m128d Ai, > __m128d Br, __m128d Bi ) > { > // http://mathworld.wolfram.com/ComplexMultiplication.html > // Cr = Ar * Br - Ai * Bi > // Ci = Ai * Br + Ar * Bi > > __m128d real = _mm_mul_pd( Ar, Br ); > __m128d imag = _mm_mul_pd( Ai, Br ); > > Ai = _mm_mul_pd( Ai, Bi ); > Ar = _mm_mul_pd( Ar, Bi ); > > real = _mm_sub_pd( real, Ai ); > imag = _mm_add_pd( imag, Ar ); > > *Cr = real; > *Ci = imag; > } > > No permute is required. The key thing to note is that I do two/four > complex multiplies at a time in proper SIMD fashion, unlike PNI based > methods. Thus, throughput is 3 vector ALU instructions per element, even > though I do 6 ALU instructions. (1.5 insns/element for single precision.) > Stores at the end are merely a formality required by C language > architectures to return more than one result and will be presumably > removed when the function is inlined. > > Ian > > --------------------------------------------------- > Ian Ollmann, Ph.D. iano@cco.caltech.edu > --------------------------------------------------- > > --- >8 ---- > List archives: http://www.xiph.org/archives/ > Ogg project homepage: http://www.xiph.org/ogg/ > To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' > containing only the word 'unsubscribe' in the body. No subject is needed. > Unsubscribe messages sent to the list will be ignored/filtered.-- Jean-Marc Valin, M.Sc.A., ing. jr. LABORIUS (http://www.gel.usherb.ca/laborius) Université de Sherbrooke, Québec, Canada -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 190 bytes Desc: Ceci est une partie de message numériquement signée. Url : http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/cee59e05/signature-0001.pgp
On Thu, 15 Jan 2004, Jean-Marc Valin wrote:> Actually, I'm not denying you can do pretty fast complex multiplies by > separating real from imaginary. What I'm saying is that with addsubps, > you can do a better job when you have the complex numbers packed, then > you can do with SSE1 only. I still think AMD got it better with its > pfpnacc instruction and Intel should have gone much further.I find it amazing that they would spend effort introducing new hardware designed to facilitate programming in inefficient ways. The existence of the instruction encourages people to use that data layout, thereby shooting themselves in the foot. Furthermore, if they are going to help, they could at least do so intelligently. The addsubps instruction would have saved two permutes if it added across rather than vertically. The way PNI is right now, given a strategy of hobbling ourselves with an interleaved data layout, we arrive at the following implementation: Cr = Ar * Br - Ai * Bi Ci = Ai * Br + Ar * Bi In vector notation: C = A * Br +- swap(A) * Bi which comes to three permutes and three ALU instructions. Given a dispatch limitation of one SIMD instruction / cycle (everything goes through port 1 as I understand page 1-17 of the Intel Pentium 4 / Xeon Processor Optimization manual to say), it appears to me that you could do equally well doing this without permutes using the scalar SSE2 instructions or maybe even x87, because all we really need to accomplish here is 6 scalar ALU ops! If addsubps did its thing horizontally, then we could write it this way: real = A * B imag = swap(A) * B result = { sub_across( real ), add_across( imag ) } which is three ALU operations and 1 permute. There is some chance that the SSE2 implementation would beat double precision scalar code!* Ian *provided Intel processed more than one double per cycle for packed SSE2 instructions, which it does not. --------------------------------------------------- Ian Ollmann, Ph.D. iano@cco.caltech.edu --------------------------------------------------- --- >8 ---- List archives: http://www.xiph.org/archives/ Ogg project homepage: http://www.xiph.org/ogg/ To unsubscribe from this list, send a message to 'speex-dev-request@xiph.org' containing only the word 'unsubscribe' in the body. No subject is needed. Unsubscribe messages sent to the list will be ignored/filtered.