thr3ads.net - Speex dev - [speex-dev] [PATCH] Make SSE Run Time option. [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Jean-Marc Valin

2004-Aug-06 15:01 UTC

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

> In the Atholon XP 2400+ that we have in our QA lab (Win2000 ) if you run 
> that code it generates an Illegal Instruction Error. In addition, an AMD 
> Duron (Windows ME) does the same thing. There are two possible reasons - 
> One is that those processors do not support xmm registers or the Operating 
> System does not support XMM registers. In the morning we will check the 
> code on Windows XP. This may be a Windows specific thing, either way you 
> still need to support non FP versions of the SSE set.
Most likely, you have on OS problem. I have yet to find code that runs
on a Pentium III and doesn't run on an Athlon XP.
> If you read through AMD's processor detection guide
>          (PDF) 
>
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/20734.pdf
> 
> and go to section that shows the sample code for dealing with CPUID 
> support. (Starts about Page 37) It talks about the FEATURE_SSEFP support 
> which you have to query for. On the Atholon XP 2400+ that we have here, 
> that code does not detect the presence of that when run under Windows. The 
> same code on a Pentium 4 detects it just fine.
OK, I have gone though the doc and I think I understand. What they call
plain "SSE" (no FP) is actually the (very) incomplete SSE
implementation
they had in the Classic and T-Bird Athlon. What they call SSEFP is
actually what Intel calls "SSE" or "SSE1". Only the Athlon
XP (and
newer) CPU implements all of SSE1.

Now about supporting what you call the "non FP version" (AMD's
incomplete implementation), I say it's not worth it. There's no gain
because all this provides is prefetch functions which are going to be
useless for Speex because everything fits in the L1 anyway. Now if you
really want to do something about AMD processors (mainly pre-XP
Athlons), a 3DNow! implementation would give you a great speedup
(probably even better than SSE).
> Here is an article which describes the K8 (Opteron and Atholon64) as 
> including the XMM registers: 
> http://sysopt.earthweb.com/articles/k8/index2.html . All the stuff I could 
> google seems to indicate that XMM register support is not included in the 
> current Atholon XP series or below.
Believe me, it is. I can even tell you that the floating point SSE
implementation in the Athlon XP is faster than that of the Pentium III. 
> With any machine you are not guaranteed to get support for the XMM 
> registers (the 128 bit wide ones), since the OS has to support it as well.
True. With Linux, you need at least 2.4. With NT you need a service
pack, don't know about Win2k and XP.
> Have you or anybody else successfully run the current SSE code on a Atholon
> XP system?
I have, many times.
> Agreed, although the inner_prod isn't that big a deal since you can do 
> clever vector swaps in Altivec to reduce the amount of shuffling needed. In
> our current Altivec version we have four blocks, dealing with when certain 
> things are aligned and certain things aren't. Its ugly to read, but
works
> quite nicely.
Do you already have that implemented? I know it's possible, but the code
will likely be really ugly.
> For the alignment part, my feeling is that the compiler generated way is 
> better than a run-time cast. The compiler native code will not cross 
> platform should generate much faster code since you don't have to
perform
> the cast at run-time, which is what your ALIGN macros appear to be doing in
> stack-alloc.h.
It's not really a "run-time cast" (at least not like C++ casts).
The
compiler will just generate an "add" and an "and" and
that's all.
> One other thing we noticed is that you tend to do a lot of  for loop based 
> copies:
...> Do you not like to use memcpy or memset? Or am I missing something like 
> overlapping memory spaces?
I just felt it wasn't worth it. I've been trying to minimize all
dependencies, including on libc. You'll see that the only file that uses
libc is misc.c so it's pretty easy for someone to even remove that
dependency. Using memcpy/memset in this context would create more
trouble than it would solve. Believe me, the copies change nothing in
terms of CPU time anyway.

        Jean-Marc


-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040114/bc6dfdc4/signature-0001.pgp

Aron Rosenberg

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

So we ran the code on a Windows XP based Atholon XP system and the xmm 
registers work just fine so it appears that Windows 2000 and below does not 
support them.

We agree on not supporting the non-FP version, however the run time flags 
need to be settable with a non FP SSE mode so that exceptions are avoided.

I thus propose a set of defines like this instead of the ones in our 
initial patch:

#define CPU_MODE_NONE     0
#define CPU_MODE_MMX      1   // Base Intel MMX x86
#define CPU_MODE_3DNOW    2 // Base AMD 3Dnow extensions
#define CPU_MODE_SSE      4 // Intel Integer SSE instructions
#define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions
#define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers
#define CPU_MODE_SSE2     32 // Intel SSE2 instructions
#define CPU_MODE_ALTIVEC  64 // PowerPC Altivec support.

Potential Additions include some of the ASM modes.

With the results that we found there is a relationship that looks like this:

3DNOW implies MMX. 3DNOWEXT implies SSE. SSE2 implies SSEFP. SSEFP implies 
SSE. Either way, all the current Speex SSE should be flag checked against 
SSEFP.
>Do you already have that implemented? I know it's possible, but the code
>will likely be really ugly.
We already have it implemented for the inner_prod function. After it is 
stable and fully tested, we will send you a patch. If you have never done 
Altivec coding it is quite simple since it is all C Macro's / functions. 
Not nearly as nasty as inline asm code, although the 16 byte alignment 
issues can be quite a pain. Our current working code is below:

Aron Rosenberg
SightSpeed Inc.

<p>static float inner_prod(float *a, float *b, int len)
{
         if (!(global_use_mmx_sse & CPU_MODE_ALTIVEC ))
         {
#ifdef _USE_ALTIVEC
         int i;
         float sum;

         int a_aligned = (((unsigned long)a) & 15) ? 0 : 1;
         int b_aligned = (((unsigned long)b) & 15) ? 0 : 1;

         __vector float MSQa, LSQa, MSQb, LSQb;
         __vector unsigned char maska, maskb;
         __vector float vec_a, vec_b;
         __vector float vec_result;

         vec_result = (__vector float)vec_splat_u8(0);

         if ((!a_aligned) && (!b_aligned)) {
             // This (unfortunately) is the common case.
             maska = vec_lvsl(0, a);
             maskb = vec_lvsl(0, b);

             MSQa = vec_ld(0, a);
             MSQb = vec_ld(0, b);

             for (i = 0; i < len; i+=8) {

                 a += 4;
                 LSQa = vec_ld(0, a);
                 vec_a = vec_perm(MSQa, LSQa, maska);

                 b += 4;
                 LSQb = vec_ld(0, b);
                 vec_b = vec_perm(MSQb, LSQb, maskb);

                 vec_result = vec_madd(vec_a, vec_b, vec_result);

                 a += 4;
                 MSQa = vec_ld(0, a);
                 vec_a = vec_perm(LSQa, MSQa, maska);

                 b += 4;
                 MSQb = vec_ld(0, b);
                 vec_b = vec_perm(LSQb, MSQb, maskb);

                 vec_result = vec_madd(vec_a, vec_b, vec_result);

             }
         } else if (a_aligned && b_aligned) {

             for (i = 0; i < len; i+=8) {
                 vec_a = vec_ld(0, a);
                 vec_b = vec_ld(0, b);
                 vec_result = vec_madd(vec_a, vec_b, vec_result);
                 a += 4; b += 4;
                 vec_a = vec_ld(0, a);
                 vec_b = vec_ld(0, b);
                 vec_result = vec_madd(vec_a, vec_b, 
vec_result);
                 a += 4; b += 4;
             }

         } else if (a_aligned) {
             maskb = vec_lvsl(0, b);
             MSQb = vec_ld(0, b);

             for (i = 0; i < len; i+=8) {

                 vec_a = vec_ld(0, a);
                 a += 4;

                 b += 4;
                 LSQb = vec_ld(0, b);
                 vec_b = vec_perm(MSQb, LSQb, maskb);

                 vec_result = vec_madd(vec_a, vec_b, vec_result);

                 vec_a = vec_ld(0, a);
                 a += 4;

                 b += 4;
                 MSQb = vec_ld(0, b);
                 vec_b = vec_perm(LSQb, MSQb, maskb);

                 vec_result = vec_madd(vec_a, vec_b, vec_result);
             }
         } else if (b_aligned) {
             maska = vec_lvsl(0, a);
             MSQa = vec_ld(0, a);

             for (i = 0; i < len; i+=8) {

                 a += 4;
                 LSQa = vec_ld(0, a);
                 vec_a = vec_perm(MSQa, LSQa, maska);

                 vec_b = vec_ld(0, b);
                 b += 4;

                 vec_result = vec_madd(vec_a, vec_b, vec_result);

                 a += 4;
                 MSQa = vec_ld(0, a);
                 vec_a = vec_perm(LSQa, MSQa, maska);

                 vec_b = vec_ld(0, b);
                 b += 4;

                 vec_result = vec_madd(vec_a, vec_b, vec_result);
             }
         }

         vec_result = vec_add(vec_result, vec_sld(vec_result, vec_result, 8));
         vec_result = vec_add(vec_result, vec_sld(vec_result, vec_result, 4));
         vec_ste(vec_result, 0, &sum);

         return sum; 

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Aron Rosenberg

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

Jean-Marc,

 >I'm still not sure I get it. On an Athlon XP, I can do something like
 >"mulps xmm0, xmm1", which means that the xmm registers are indeed
 >supported. Besides, without the xmm registers, you can't use much of
 >SSE.

In the Atholon XP 2400+ that we have in our QA lab (Win2000 ) if you run 
that code it generates an Illegal Instruction Error. In addition, an AMD 
Duron (Windows ME) does the same thing. There are two possible reasons - 
One is that those processors do not support xmm registers or the Operating 
System does not support XMM registers. In the morning we will check the 
code on Windows XP. This may be a Windows specific thing, either way you 
still need to support non FP versions of the SSE set.

If you read through AMD's processor detection guide
         (PDF) 
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/20734.pdf

and go to section that shows the sample code for dealing with CPUID 
support. (Starts about Page 37) It talks about the FEATURE_SSEFP support 
which you have to query for. On the Atholon XP 2400+ that we have here, 
that code does not detect the presence of that when run under Windows. The 
same code on a Pentium 4 detects it just fine.

Here is an article which describes the K8 (Opteron and Atholon64) as 
including the XMM registers: 
http://sysopt.earthweb.com/articles/k8/index2.html . All the stuff I could 
google seems to indicate that XMM register support is not included in the 
current Atholon XP series or below.

With any machine you are not guaranteed to get support for the XMM 
registers (the 128 bit wide ones), since the OS has to support it as well.

Have you or anybody else successfully run the current SSE code on a Atholon 
XP system?

<p>>Actually, SSE also requires 16-byte alignment for most
instructions>(except movups, which is slow anyway). That's why I have those kludges
>with the pointer masks in the current code. I think we should find a
>general solution for the problem. Also, there's one place (inner_prod,
>called by the open-loop pirch estimator) where non 16-byte-aligned loads
>are really required. It's probably possible to work around that, but it
>might require 4 copies of the data (with 4-byte offsets).Agreed, although the inner_prod isn't that big a deal since you can do 
clever vector swaps in Altivec to reduce the amount of shuffling needed. In 
our current Altivec version we have four blocks, dealing with when certain 
things are aligned and certain things aren't. Its ugly to read, but works 
quite nicely.
>I think the ALIGN macros I currently have should do the job. If it's
>possible to use them, the advantage is that they are
>platform-independent.For the alignment part, my feeling is that the compiler generated way is 
better than a run-time cast. The compiler native code will not cross 
platform should generate much faster code since you don't have to perform 
the cast at run-time, which is what your ALIGN macros appear to be doing in 
stack-alloc.h.

One other thing we noticed is that you tend to do a lot of  for loop based 
copies:

from your new filters_sse.h around the asm code

   for (i=0;i<12;i++)
       num[i]=den[i]=0;
    for (i=0;i<12;i++)
       mem[i]=0;

    for (i=0;i<ord;i++)
    {
       num[i]=_num[i+1];
       den[i]=_den[i+1];
    }
    for (i=0;i<ord;i++)
       mem[i]=_mem[i];

<<< asm code>>>

    for (i=0;i<ord;i++)
       _mem[i]=mem[i];

<p>could easily be reduced to

memset(num,0,12);
memset(den,0,12);
memset(mem,0,12);
memcpy(num,_num+1,ord);
memcpy(den,_den+1,ord);
memcpy(mem,_mem+1,ord);

<<<asm code>>>

memcpy(_mem,mem,ord);

<p>Do you not like to use memcpy or memset? Or am I missing something like
overlapping memory spaces?

Aron Rosenberg
SightSpeed 

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Reasonably Related Threads

Search for more seemingly similar threads

Speex dev - Aug 2004 - [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

Reasonably Related Threads