thr3ads.net - Speex dev - [speex-dev] [PATCH] Make SSE Run Time option. [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Aron Rosenberg

2004-Aug-06 15:01 UTC

[speex-dev] [PATCH] Make SSE Run Time option.

So we ran the code on a Windows XP based Atholon XP system and the xmm 
registers work just fine so it appears that Windows 2000 and below does not 
support them.

We agree on not supporting the non-FP version, however the run time flags 
need to be settable with a non FP SSE mode so that exceptions are avoided.

I thus propose a set of defines like this instead of the ones in our 
initial patch:

#define CPU_MODE_NONE     0
#define CPU_MODE_MMX      1   // Base Intel MMX x86
#define CPU_MODE_3DNOW    2 // Base AMD 3Dnow extensions
#define CPU_MODE_SSE      4 // Intel Integer SSE instructions
#define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions
#define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers
#define CPU_MODE_SSE2     32 // Intel SSE2 instructions
#define CPU_MODE_ALTIVEC  64 // PowerPC Altivec support.

Potential Additions include some of the ASM modes.

With the results that we found there is a relationship that looks like this:

3DNOW implies MMX. 3DNOWEXT implies SSE. SSE2 implies SSEFP. SSEFP implies 
SSE. Either way, all the current Speex SSE should be flag checked against 
SSEFP.
>Do you already have that implemented? I know it's possible, but the code
>will likely be really ugly.
We already have it implemented for the inner_prod function. After it is 
stable and fully tested, we will send you a patch. If you have never done 
Altivec coding it is quite simple since it is all C Macro's / functions. 
Not nearly as nasty as inline asm code, although the 16 byte alignment 
issues can be quite a pain. Our current working code is below:

Aron Rosenberg
SightSpeed Inc.

<p>static float inner_prod(float *a, float *b, int len)
{
         if (!(global_use_mmx_sse & CPU_MODE_ALTIVEC ))
         {
#ifdef _USE_ALTIVEC
         int i;
         float sum;

         int a_aligned = (((unsigned long)a) & 15) ? 0 : 1;
         int b_aligned = (((unsigned long)b) & 15) ? 0 : 1;

         __vector float MSQa, LSQa, MSQb, LSQb;
         __vector unsigned char maska, maskb;
         __vector float vec_a, vec_b;
         __vector float vec_result;

         vec_result = (__vector float)vec_splat_u8(0);

         if ((!a_aligned) && (!b_aligned)) {
             // This (unfortunately) is the common case.
             maska = vec_lvsl(0, a);
             maskb = vec_lvsl(0, b);

             MSQa = vec_ld(0, a);
             MSQb = vec_ld(0, b);

             for (i = 0; i < len; i+=8) {

                 a += 4;
                 LSQa = vec_ld(0, a);
                 vec_a = vec_perm(MSQa, LSQa, maska);

                 b += 4;
                 LSQb = vec_ld(0, b);
                 vec_b = vec_perm(MSQb, LSQb, maskb);

                 vec_result = vec_madd(vec_a, vec_b, vec_result);

                 a += 4;
                 MSQa = vec_ld(0, a);
                 vec_a = vec_perm(LSQa, MSQa, maska);

                 b += 4;
                 MSQb = vec_ld(0, b);
                 vec_b = vec_perm(LSQb, MSQb, maskb);

                 vec_result = vec_madd(vec_a, vec_b, vec_result);

             }
         } else if (a_aligned && b_aligned) {

             for (i = 0; i < len; i+=8) {
                 vec_a = vec_ld(0, a);
                 vec_b = vec_ld(0, b);
                 vec_result = vec_madd(vec_a, vec_b, vec_result);
                 a += 4; b += 4;
                 vec_a = vec_ld(0, a);
                 vec_b = vec_ld(0, b);
                 vec_result = vec_madd(vec_a, vec_b, 
vec_result);
                 a += 4; b += 4;
             }

         } else if (a_aligned) {
             maskb = vec_lvsl(0, b);
             MSQb = vec_ld(0, b);

             for (i = 0; i < len; i+=8) {

                 vec_a = vec_ld(0, a);
                 a += 4;

                 b += 4;
                 LSQb = vec_ld(0, b);
                 vec_b = vec_perm(MSQb, LSQb, maskb);

                 vec_result = vec_madd(vec_a, vec_b, vec_result);

                 vec_a = vec_ld(0, a);
                 a += 4;

                 b += 4;
                 MSQb = vec_ld(0, b);
                 vec_b = vec_perm(LSQb, MSQb, maskb);

                 vec_result = vec_madd(vec_a, vec_b, vec_result);
             }
         } else if (b_aligned) {
             maska = vec_lvsl(0, a);
             MSQa = vec_ld(0, a);

             for (i = 0; i < len; i+=8) {

                 a += 4;
                 LSQa = vec_ld(0, a);
                 vec_a = vec_perm(MSQa, LSQa, maska);

                 vec_b = vec_ld(0, b);
                 b += 4;

                 vec_result = vec_madd(vec_a, vec_b, vec_result);

                 a += 4;
                 MSQa = vec_ld(0, a);
                 vec_a = vec_perm(LSQa, MSQa, maska);

                 vec_b = vec_ld(0, b);
                 b += 4;

                 vec_result = vec_madd(vec_a, vec_b, vec_result);
             }
         }

         vec_result = vec_add(vec_result, vec_sld(vec_result, vec_result, 8));
         vec_result = vec_add(vec_result, vec_sld(vec_result, vec_result, 4));
         vec_ste(vec_result, 0, &sum);

         return sum; 

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Jean-Marc Valin

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

> You may wish to save space for PNI.
> 
> 	http://cedar.intel.com/media/pdf/PNI_LEGAL3.pdf
Seems to be interesting instructions for complex arithmetic there (thus
helping FFTs). I'm not sure there's anything useful for Speex, though.
We'll see. What I think is much more promising is the x86-64 version of
SSE with 16 registers. That could speed up the filters a lot.
> Please note that dot products of simple vector floats are usually faster
> in the scalar units. The add across and transfer to scalar is just too
> expensive. Its generally only worthwhile if the data starts and ends in
> the vector units, and it is inlined so that latencies can be covered with
> other work. e.g:
Actually, even with a scalar unit, the best code is implicitly
vectorized. If you look at the original code I had, there are 4 partial
sums that prevents some stalling due to dependencies. From there, it's
easy to vectorize by 4 and add at the end. Note that for Speex the
vectors are either 40 or 160 samples long. The whole process is also
repeated 128 times in a row, so I think a vector unit will do much
better. 

        Jean-Marc


-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/31dba8ec/signature-0001.pgp

Ian Ollmann

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

On Thu, 15 Jan 2004, Aron Rosenberg wrote:
> So we ran the code on a Windows XP based Atholon XP system and the xmm
> registers work just fine so it appears that Windows 2000 and below does not
> support them.
>
> We agree on not supporting the non-FP version, however the run time flags
> need to be settable with a non FP SSE mode so that exceptions are avoided.
>
> I thus propose a set of defines like this instead of the ones in our
> initial patch:
>
> #define CPU_MODE_NONE     0
> #define CPU_MODE_MMX      1   // Base Intel MMX x86
> #define CPU_MODE_3DNOW    2 // Base AMD 3Dnow extensions
> #define CPU_MODE_SSE      4 // Intel Integer SSE instructions
> #define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions
> #define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers
> #define CPU_MODE_SSE2     32 // Intel SSE2 instructions
> #define CPU_MODE_ALTIVEC  64 // PowerPC Altivec support.
You may wish to save space for PNI.

        http://cedar.intel.com/media/pdf/PNI_LEGAL3.pdf

Likewise, all that branching is probably going to cause more trouble than
it saves.  Try this:

        vector float a0 = vec_ld( 0, a );
        vector float a1 = vec_ld( 15, a );
        vector float b0 = vec_ld( 0, b );
        vector float b1 = vec_ld( 15, b );

        a0 = vec_perm( a0, a1, vec_lvsl( 0, a ) );
        b0 = vec_perm( b0, b1, vec_lvsl( 0, b ) );

        a0 = vec_madd( a0, b0, (vector float) vec_splat_u32(0) ) ;
        a0 = vec_add( a0, vec_sld( a0, a0, 8 ) );
        a0 = vec_add( a0, vec_sld( a0, a0, 4 ) );

        vec_ste( a0, 0, &sum );
        return sum;

Please note that dot products of simple vector floats are usually faster
in the scalar units. The add across and transfer to scalar is just too
expensive. Its generally only worthwhile if the data starts and ends in
the vector units, and it is inlined so that latencies can be covered with
other work. e.g:

        inline vector float DotProduct( vector float a, vector float b )
        {
                a = vec_madd( a, b, (vector float) vec_splat_u32(0) ) ;
                a = vec_add( a, vec_sld( a, a, 8 ) );
                a = vec_add( a, vec_sld( a, a, 4 ) );
                return a;
        }

Ian
---------------------------------------------------
   Ian Ollmann, Ph.D.       iano@cco.caltech.edu
---------------------------------------------------

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Jean-Marc Valin

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

> We agree on not supporting the non-FP version, however the run time flags 
> need to be settable with a non FP SSE mode so that exceptions are avoided.
I think we should keep the more "official" naming and not AMD's,
which
is more confusing. SSE means SSE1: all the SSE instructions (including
the ones using xmm registers). What AMD calls SSE is not SSE at all.
Basically, it's a bunch of "extra instructions" borrowed from SSE
and
that are part of the extended 3DNow!.
> I thus propose a set of defines like this instead of the ones in our 
> initial patch:
> 
> #define CPU_MODE_NONE     0
> #define CPU_MODE_MMX      1   // Base Intel MMX x86
> #define CPU_MODE_3DNOW    2 // Base AMD 3Dnow extensions
> #define CPU_MODE_SSE      4 // Intel Integer SSE instructions
> #define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions
> #define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm registers
> #define CPU_MODE_SSE2     32 // Intel SSE2 instructions
> #define CPU_MODE_ALTIVEC  64 // PowerPC Altivec support.
If you reall want to define stuff like that, you could have simply
NONE
MMX
3DNOW
3DNOWEXT
SSE1
SSE2
ALTIVEC

Even then, MMX is completely useless for Speex IMO and I doubt it's
worth writing 3DNow non-ext code (or even 3DNow! at all). Same for SSE2:
Speex simply doesn't use doubles at all. That's why i think only
defining NONE, SSE and ALTIVEC (maybe 3DNow?) would be enough.
> We already have it implemented for the inner_prod function. After it is 
> stable and fully tested, we will send you a patch. If you have never done 
> Altivec coding it is quite simple since it is all C Macro's /
functions.
> Not nearly as nasty as inline asm code, although the 16 byte alignment 
> issues can be quite a pain. Our current working code is below:
You can do the same with SSE intrinsics. I just got used to writing
assembly before they were available for gcc. I had a quick look at your
inner_prod implementation. I think that if you really want to make that
fast (there's a big possible gain there), you need to consider the
optimization at a higher level: from open_loop_nbest_pitch. The function
calls inner_prod for a continuous range of offsets. With that in mind,
it would probably be simpler to just take 4 copies (with different
offsets) of one of the vectors and then compute everything with simple,
aligned loads.

        Jean-Marc


-- 
Jean-Marc Valin, M.Sc.A., ing. jr.
LABORIUS (http://www.gel.usherb.ca/laborius)
Université de Sherbrooke, Québec, Canada


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 190 bytes
Desc: Ceci est une partie de message numériquement signée.
Url :
http://lists.xiph.org/pipermail/speex-dev/attachments/20040115/5c402302/signature-0001.pgp

Tom Harper

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

Hi Jean Marc,

I think there is just a confusion over terminology going on here- I agree that
support for 3dnow base version may not necessarily be relevant; However,
even though 3dNow extended is a bastardized version of SSE, it still supports
the same instructions, and that is what is important- I don't think we 
intend to
add any AMD specfic code.

The real issue is cross CPU SSE support, and whether in addition there is 
access
to XMM registers or not- whether the OS actually supports XMM as well.  We 
have
a fair amount of other stuff we do in assembler, much of which requires SSE 
instruction
sets but *not* XMM registers, and some of which is just MMX only.   In 
speex, I can see
how you would always want to use the widest register possible for all of 
the fp ops in
longish vectors.  However, the more integer stuff  you do, and just as time 
goes on, the
more likely it is someone will want to do some type of optimization to 
accommodate
lowest common denominator type machines.

So I guess what I am saying is that it would be good to have at minimum:

NONE
MMX (lcd type machines)
SSE (i.e. 3dnow ext with no OS XMM support)
SSE_XMM (i.e. 3dnow ext, sse, with OS XMM support)
ALTIVEC

my 2p,
Tom

At 01:09 AM 1/15/2004, Jean-Marc Valin wrote:> > We agree on not supporting the non-FP version, however the run time
flags
> > need to be settable with a non FP SSE mode so that exceptions are
avoided.
>
>I think we should keep the more "official" naming and not
AMD's, which
>is more confusing. SSE means SSE1: all the SSE instructions (including
>the ones using xmm registers). What AMD calls SSE is not SSE at all.
>Basically, it's a bunch of "extra instructions" borrowed from
SSE and
>that are part of the extended 3DNow!.
>
> > I thus propose a set of defines like this instead of the ones in our
> > initial patch:
> >
> > #define CPU_MODE_NONE     0
> > #define CPU_MODE_MMX      1   // Base Intel MMX x86
> > #define CPU_MODE_3DNOW    2 // Base AMD 3Dnow extensions
> > #define CPU_MODE_SSE      4 // Intel Integer SSE instructions
> > #define CPU_MODE_3DNOWEXT 8 // AMD 3Dnow extended instructions
> > #define CPU_MODE_SSEFP 16 // SSE FP modes, mainly support for xmm
registers
> > #define CPU_MODE_SSE2     32 // Intel SSE2 instructions
> > #define CPU_MODE_ALTIVEC  64 // PowerPC Altivec support.
>
>If you reall want to define stuff like that, you could have simply
>NONE
>MMX
>3DNOW
>3DNOWEXT
>SSE1
>SSE2
>ALTIVEC
>
>Even then, MMX is completely useless for Speex IMO and I doubt it's
>worth writing 3DNow non-ext code (or even 3DNow! at all). Same for SSE2:
>Speex simply doesn't use doubles at all. That's why i think only
>defining NONE, SSE and ALTIVEC (maybe 3DNow?) would be enough.
>
> > We already have it implemented for the inner_prod function. After it
is
> > stable and fully tested, we will send you a patch. If you have never
done
> > Altivec coding it is quite simple since it is all C Macro's /
functions.
> > Not nearly as nasty as inline asm code, although the 16 byte alignment
> > issues can be quite a pain. Our current working code is below:
>
>You can do the same with SSE intrinsics. I just got used to writing
>assembly before they were available for gcc. I had a quick look at your
>inner_prod implementation. I think that if you really want to make that
>fast (there's a big possible gain there), you need to consider the
>optimization at a higher level: from open_loop_nbest_pitch. The function
>calls inner_prod for a continuous range of offsets. With that in mind,
>it would probably be simpler to just take 4 copies (with different
>offsets) of one of the vectors and then compute everything with simple,
>aligned loads.
>
>         Jean-Marc
>
>--
>Jean-Marc Valin, M.Sc.A., ing. jr.
>LABORIUS (http://www.gel.usherb.ca/laborius)
>Université de Sherbrooke, Québec, Canada

-- 

Tom Harper - tharper@sightspeed.com
Lead Software Engineer
SightSpeed - A Roda Group Affiliated Company

918 Parker St, Suite A14
Berkeley, CA 94710

Phone: 510.665.2920
Cell: 415.378.3779

http://www.sightspeed.com

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Segher Boessenkool

2004-Aug-06 15:01 UTC

head link

[speex-dev] [PATCH] Make SSE Run Time option.

> Please note that dot products of simple vector floats are usually 
> faster
> in the scalar units. The add across and transfer to scalar is just too
> expensive.
Or do four at once, with some shuffling (which is basically free);
almost the same code as a 4x4 matrix/vector multiply.

<p>Segher

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Massimo

2004-Aug-06 15:01 UTC

head link

[speex-dev] SIMD interest

Greetings,

<p>my apologies for putting this trash in the mailing list but the topic
about SSE run-time option interested me pretty much.
Looks like some people is really experienced on the topic. I would
really appreciate if somebody could point me to good resources about SSE
and Altivec (not necessarly on the net, I'm ready to invest some money
if necessary). I already have intel manuals about SSE and I took a look
at Intel's document proposed by Ian.

I was somewhat shocked by reading a mail posted by Jan-Marc (Date: 	14
Jan 2004 02:44:59 -0500). I guess I badly misunderstood it by jumping to
wrong conclusions but does this means the only way to recognize if (say)
SSE is supported by OS is to check its version?
This looks something problematic to me especially for open-source OSs
where one can recompile the kernel as he wants - kernel version number
may not really be representative of the functionalities supported.

<p>Thank you,
Massimo

<p><p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'speex-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Reasonably Related Threads

Search for more apparently analagous threads

Speex dev - Aug 2004 - [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] [PATCH] Make SSE Run Time option.

[speex-dev] SIMD interest

Reasonably Related Threads