thr3ads.net - Speex dev - [speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Aron Rosenberg

2004-Aug-06 15:01 UTC

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

Jean-Marc,

         There is a big difference between SSE and SSEFP. The SSEFP means 
that the CPU supports the xmm registers. All Intel chips with SSE support 
do, however no current 32 bit AMD chips support the XMM registers. They 
will support the SSE instructions but not those registers. You are right 
about the SSE2 not being used.

The AMD Opterons are the first AMD CPU's which support xmm registers. They 
will have 16 of them while the current Pentium 3's and above have only 8.

Sorry about the patch having those push pops commented out, they should be 
in there.

If you check your new code into CVS we can do all the converting needed. We 
are working on an Altivec version right now based on the current code, but 
if you have new code that makes it easier for us since we won't have to 
port it twice.

One major thing to note - In Altivec everything needs to be 16 byte aligned 
for it to work efficiently. A number of the starting points right now are 
only 4 byte aligned. If you can add the following macro to the variables 
that get passed in, it will make everything easier. Use it as such:

         ALIGN(16) unsigned int myVar;
or
         static ALIGN(16) float myArray[16];

#ifdef GCC_COMPILER
#define ALIGN(n) __attribute__ ((__aligned__ (n)))
#endif

#ifdef WIN32
#define ALIGN(n) __declspec(align(n))
#endif

<p>Aron Rosenberg
SightSpeed Software

<p>At 11:23 PM 1/8/2004, you wrote:>Hi,
>
>Thanks for the patch. I think it's a good idea, although I can't
apply
>it as is. The reason is that in its current form, the SSE version is not
>tested enough and isn't very clean in some aspects. For example, the
>order 10 filter is hard-coded and patched to work also for order 8 (less
>efficiently). Also, I think this should really go into 1.1.x (to become
>1.2). I have already found a faster implementation, which is not yet in
>CVS BTW.
>
>About your SPEEX_ASM flags, I'm not sure I see the difference between
>SPEEX_ASM_MMX_SSE and SPEEX_ASM_MMX_SSE_FP. Also, you're saying that the
>current code makes use of SSE2, which I don't think is the case, since I
>developed it on a Pentium III, which only supports SSE1. I don't think
>SSE2 is important at all, since most of the SSE2 instructions are for
>double precision (which Speex doesn't use at all).
>
>Last thing, I see in your windows version a bunch of commented pops and
>pushes. Those are definitely needed. You compiler may happen to produce
>code without that, but there's no guarantee you won't run into
problems
>later because suddenly, the compiler assumes that whatever was there
>before is still there.
>
>         Jean-Marc
>
>Le ven 09/01/2004 à 00:18, Aron Rosenberg a écrit :
> > All,
> >
> >          Attached is a patch that does two things. First it makes the
use
> > of the current SSE code a run time option through the use
> > of  speex_decoder_ctl() and speex_encoder_ctl
> > It does this twofold. First there is a modification to the
configure.in
> > script which introduces a check based upon platform. It will compile
in
> the
> > sse assembly if you are on an i?86 based platform by making a special
> > define. Second, it adds a new ctl value called SPEEX_SET_ASM_FLAG 
which
> > takes in an integer. The values are defined as:
> >
> > #define SPEEX_SET_ASM_FLAG              200
> > #define SPEEX_ASM_MMX_NONE              0
> > #define SPEEX_ASM_MMX_BASIC             1
> > #define SPEEX_ASM_MMX_SSE               2
> > #define SPEEX_ASM_MMX_SSE_FP    4
> >
> > The current Speex SSE code requires full SSE2 support which
corresponds to
> > SPEEX_ASM_MMX_SSE_FP. None of the other defines are actively used, but
> they
> > are included since they represent different Intel/AMD processors. For
> > example, an AMD Duran only supports SPEEX_ASM_MMX_BASIC while Pentium
3's
> > and above support full SPEEX_ASM_MMX_SSE_FP
> >
> >
> > The second part of the patch adds the equivalent MS Windows assembler
for
> > the same sections that currently have GCC x86 assembler code.
> >
> > Notes about implementation: We took the easiest route when hacking in
the
> > flag support which was to add a global flag for the entire library at
> > runtime and extern it in all the various files.
> > Jean-Marc: We looked at adding the flag into the state structures,
however
> > they were not passed all the way down into the filters.c files and it 
> would
> > have been a massive change to make it pass all the needed data. The
> > approach we took should be ok since on a given machine you would have
the
> > same settings. The decoder_ctl and encoder_ctl set the same global
flag
> > variable.
> >
> > The way we setup the asm flags var should allow you to add the ARM 
> assembly
> > in the exact same manor. You would add a check in the configure.in for
the
> > platform and define a _USE_ARM and place the code in the same
functions as
> > we did. You would then add a SPEEX_ASM_ARM 8  or something and let the
> > application decide to turn it on.
> >
> >
> > Other Notes: This patch obsoletes ltp_sse.h and filters_sse.h  .
However
> > the patch does not remove them. This is thge updated version of the
patch
> > we sent in November.
> >
> >
> > Comments are welcome.  BTW, we have been shipping our Video
Conferencing
> > product which only uses the Speex codec for 6 months now and have
gotten
> > rave reviews (PC Magazine Editors choice) for the audio and video
quality.
> > We use Speex in Windows, Mac OS-X, and Linux as we have clients for
each
> > platform. Keep up the great work! Check us out at
> > http://www.sightspeed.com and please try our beta version (Mac and
Windows
> > Clients available now) at http://www.sightspeed.com/page.php?page=beta
>
>--
>Jean-Marc Valin, M.Sc.A., ing. jr.
>LABORIUS (http://www.gel.usherb.ca/laborius)
>Université de Sherbrooke, Québec, Canada
<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Jean-Marc Valin

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

>          There is a big difference between SSE and SSEFP. The SSEFP means 
> that the CPU supports the xmm registers. All Intel chips with SSE support 
> do, however no current 32 bit AMD chips support the XMM registers. They 
> will support the SSE instructions but not those registers. You are right 
> about the SSE2 not being used.
I'm still not sure I get it. On an Athlon XP, I can do something like
"mulps xmm0, xmm1", which means that the xmm registers are indeed
supported. Besides, without the xmm registers, you can't use much of
SSE. You can use the prefetch instructions that were in the Athlon
T-Bird, but that's about it (and I don't think that makes it SSE1).
> The AMD Opterons are the first AMD CPU's which support xmm registers.
They
> will have 16 of them while the current Pentium 3's and above have only
8.
Athlon XP's do. ...unless we have a different idea of what an xmm
register is.
> Sorry about the patch having those push pops commented out, they should be 
> in there.
> 
> If you check your new code into CVS we can do all the converting needed. We
> are working on an Altivec version right now based on the current code, but 
> if you have new code that makes it easier for us since we won't have to
> port it twice.
Fine, I'll put it in the 1.1.x branch though because it's still
experimental and very unclean in some parts (alignment, forced order 10
even when we need 8). In the mean time, I'm attaching my modified
version of filter_mem2. The modifs I made removed the need for unaligned
moves and could also be applied to fir_mem2 and iir_mem2.
> One major thing to note - In Altivec everything needs to be 16 byte aligned
> for it to work efficiently. A number of the starting points right now are 
> only 4 byte aligned. If you can add the following macro to the variables 
> that get passed in, it will make everything easier. Use it as such:
Actually, SSE also requires 16-byte alignment for most instructions
(except movups, which is slow anyway). That's why I have those kludges
with the pointer masks in the current code. I think we should find a
general solution for the problem. Also, there's one place (inner_prod,
called by the open-loop pirch estimator) where non 16-byte-aligned loads
are really required. It's probably possible to work around that, but it
might require 4 copies of the data (with 4-byte offsets).
>          ALIGN(16) unsigned int myVar;
> or
>          static ALIGN(16) float myArray[16];
I think the ALIGN macros I currently have should do the job. If it's
possible to use them, the advantage is that they are
platform-independent.

        Jean-Marc


-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: filters_sse.h__charset_ISO-8859-1
Type: text/x-c-header
Size: 3628 bytes
Desc: filters_sse.h__charset_ISO-8859-1
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040114/7a9bfc3e/filters_sse-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040114/7a9bfc3e/signature-0001.pgp

Aron Rosenberg

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

Jean-Marc,

 >I'm still not sure I get it. On an Athlon XP, I can do something like
 >"mulps xmm0, xmm1", which means that the xmm registers are indeed
 >supported. Besides, without the xmm registers, you can't use much of
 >SSE.

In the Atholon XP 2400+ that we have in our QA lab (Win2000 ) if you run 
that code it generates an Illegal Instruction Error. In addition, an AMD 
Duron (Windows ME) does the same thing. There are two possible reasons - 
One is that those processors do not support xmm registers or the Operating 
System does not support XMM registers. In the morning we will check the 
code on Windows XP. This may be a Windows specific thing, either way you 
still need to support non FP versions of the SSE set.

If you read through AMD's processor detection guide
         (PDF) 
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/20734.pdf

and go to section that shows the sample code for dealing with CPUID 
support. (Starts about Page 37) It talks about the FEATURE_SSEFP support 
which you have to query for. On the Atholon XP 2400+ that we have here, 
that code does not detect the presence of that when run under Windows. The 
same code on a Pentium 4 detects it just fine.

Here is an article which describes the K8 (Opteron and Atholon64) as 
including the XMM registers: 
http://sysopt.earthweb.com/articles/k8/index2.html . All the stuff I could 
google seems to indicate that XMM register support is not included in the 
current Atholon XP series or below.

With any machine you are not guaranteed to get support for the XMM 
registers (the 128 bit wide ones), since the OS has to support it as well.

Have you or anybody else successfully run the current SSE code on a Atholon 
XP system?

<p>>Actually, SSE also requires 16-byte alignment for most
instructions>(except movups, which is slow anyway). That's why I have those kludges
>with the pointer masks in the current code. I think we should find a
>general solution for the problem. Also, there's one place (inner_prod,
>called by the open-loop pirch estimator) where non 16-byte-aligned loads
>are really required. It's probably possible to work around that, but it
>might require 4 copies of the data (with 4-byte offsets).Agreed, although the inner_prod isn't that big a deal since you can do 
clever vector swaps in Altivec to reduce the amount of shuffling needed. In 
our current Altivec version we have four blocks, dealing with when certain 
things are aligned and certain things aren't. Its ugly to read, but works 
quite nicely.
>I think the ALIGN macros I currently have should do the job. If it's
>possible to use them, the advantage is that they are
>platform-independent.For the alignment part, my feeling is that the compiler generated way is 
better than a run-time cast. The compiler native code will not cross 
platform should generate much faster code since you don't have to perform 
the cast at run-time, which is what your ALIGN macros appear to be doing in 
stack-alloc.h.

One other thing we noticed is that you tend to do a lot of  for loop based 
copies:

from your new filters_sse.h around the asm code

   for (i=0;i<12;i++)
       num[i]=den[i]=0;
    for (i=0;i<12;i++)
       mem[i]=0;

    for (i=0;i<ord;i++)
    {
       num[i]=_num[i+1];
       den[i]=_den[i+1];
    }
    for (i=0;i<ord;i++)
       mem[i]=_mem[i];

<<< asm code>>>

    for (i=0;i<ord;i++)
       _mem[i]=mem[i];

<p>could easily be reduced to

memset(num,0,12);
memset(den,0,12);
memset(mem,0,12);
memcpy(num,_num+1,ord);
memcpy(den,_den+1,ord);
memcpy(mem,_mem+1,ord);

<<<asm code>>>

memcpy(_mem,mem,ord);

<p>Do you not like to use memcpy or memset? Or am I missing something like
overlapping memory spaces?

Aron Rosenberg
SightSpeed 

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Possibly Parallel Threads

Search for more possibly parallel threads

Speex dev - Aug 2004 - [PATCH] Make SSE Run Time option. Add Win32 SSE code

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

[speex-dev] [PATCH] Make SSE Run Time option. Add Win32 SSE code

Possibly Parallel Threads