thr3ads.net - Vorbis dev - [vorbis-dev] Optimisations [Nov 2000]

If this information is useful, please help other people find it:
Share via:

Jason Hecker

2000-Nov-15 16:33 UTC

[vorbis-dev] Optimisations

Looking through the archives I have seen talk of making CPU specific 
optimisations for Vorbis, a la MMX/3DNow!/SSE.  The feeling I gather is to 
wait until something is working well in C before committing to any kind of 
specific optimisation.  What if oft used and needed DSP functions were 
identified and standardised DSP functionality be written for Vorbis?  This 
would seperate the basically non-changing core signal processing functions 
(IIR, FIR and DCT/FFT) and allow them to be optimised with MMX and so on 
without fear of upsetting the other code that's in a state of flux at the 
moment.  Wouldn't this be nice?  ;)

If anyone is interested in heading down this path with me please let me know!

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Segher Boessenkool

2000-Nov-15 16:43 UTC

head link

[vorbis-dev] Optimisations

Jason Hecker wrote:> 
> Looking through the archives I have seen talk of making CPU specific
> optimisations for Vorbis, a la MMX/3DNow!/SSE.  The feeling I gather is to
> wait until something is working well in C before committing to any kind of
> specific optimisation.  What if oft used and needed DSP functions were
> identified and standardised DSP functionality be written for Vorbis?  This
> would seperate the basically non-changing core signal processing functions
> (IIR, FIR and DCT/FFT) and allow them to be optimised with MMX and so on
> without fear of upsetting the other code that's in a state of flux at
the
> moment.  Wouldn't this be nice?  ;)
> 
> If anyone is interested in heading down this path with me please let me
know!
Sure I'm interested :-) For optimizing the current all-C version, I
hand-unrolled
a lot of the critical loops; this will come in handy for doing a vector-op
version. Note, that the filters and fast transforms are not the most
time-critical,
however.

Ciao,

Segher

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Timothy Wayper

2000-Nov-15 18:30 UTC

head link

[vorbis-dev] Optimisations

> (but damned fast!)  Does OggVorbis go off and do everything using single  
> precision floating point?  How does this affect the truncation of viable  
> bits after various multiplications (ie accuracy after a very big number is
> multiplied by a very small number)?  I attended a DSP workshop last year by
Actually, *adding* many small numbers to a large number is what endangers you
most in floating point, while multiplication, in general, preserves your
significant digits very well.

In fact, floating point multiplication accuracy is independent of the relative
order of magnitude of the operands (except in limit cases). This is one of the
advantages of floating point over fixed point (not to say fixed point is never
appropriate).

Cheers,
Tim W.

____________________________________________________________
Timothy Wayper                        <timmy@wunderbear.com>
Wunderbear Software              <http://www.wunderbear.com>

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Timothy J. Wood

2000-Nov-16 15:30 UTC

head link

[vorbis-dev] Optimisations

> > Why not in assembly?  The GCC extensions won't necessarily work
across
> > platforms (i.e. with the Metrowerks compiler) while it's already 
> > accepted that assembly doesn't... And (to my mind) it's easier
to
> > separate two similar assembly files than C files.  Besides, most PPC 
>  
> Just use some #ifdef's, no big deal. Or two separate src files,
you'll need
> them for asm as well. 
  #ifdefs would work if there are bugs in the compilers.  I say 'bugs'
since the C extensions for Altivec are defined by Motorola and should be the
same across all compilers.  I have used MrC (a bit) and gcc and both are the
same.  I haven't used the MW compiler for Altivec, though.
>MPW (MrC) does a great job (yeah, I did only one test, sorry). Btw, is there
>a fused multiply-add in AltiVec? That would make it an absolute ROCKER!
  Altivec has lots of cool instructions:

   vmaddfp   --    result = a*b + c
   vnmsubfp --    result = - (a*b - c) = c - a*b
   vrepe        --    result =~  1/a
   vrsqrte      --    result =~  1/sqrt(a)
   vperm       --    result =    a|b permuted by c
   vexpte      --    result =~ 2^a
   vloge        --    result =~ log2(a)
   vctf            --    result = 2^n * (float)i    (although, sadly, n >= 0)

  I have used all of these to great effect in other apps.  Some of the estimate
instructions are _very_ useful when you don't need IEEE exact results.  Even
when you do need really accurate results you can often find a refinement
algorithm that will produce better results given a good starting estimate and
still be way faster than a libm call (like Newton-Rhapson refinement for 1/sqrt
as show on page 4-18 of the Altivec PEM).
>If I understand correctly, the gcc extensions consist mainly of new
datatypes
>(like, floats4 or whatever they call it), such that
   'vector float', 'vector unsigned long', 'vector
bool', 'vector unsigned char', etc
>floats4 a, b, c;
>c = a + b;
  vector float a, b, c;

  // vec_add is a polymorphic function that will select the right instruction
based on the arguments and result type
  c = vec_add(a, b);
> will do a vector addition. This is a quite natural thing to do, and 
> doesn't take 
> much effort to program, while the compiler will probably outsmart about 
> every asm 
> programmer (if enough work is put into the compiler). 
  Probably, but it will probably be hard for the compiler to do some
optimizations.  For example, if your C code has needless conversions back and
forth between ints and floats, the compiler really doesn't know whether you
meant to loose precision or whether you are just being silly.  If you take it
down to the C Altivec bindings then you get some of the best of both worlds. 
You know the general instruction flow and can see where you have a lot of
instructions.  But, you don't have to worry about exact instruction
selection, register assignment (which can be a real bear when you have 32
floats, 32 ints and 32 vectors to worry about), or instruction ordering for
pipelining.

-tim

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Timothy J. Wood

2000-Nov-16 15:58 UTC

head link

[vorbis-dev] Optimisations

>This sounds great! Are these extensions weel-thought out? Where can I
>get-em? I'll look at mot.com, of course...
DSP Kernels (Complex FIR, Real FIR, Real Delayed LMS FIR)
http://motorola.com/SPS/PowerPC/AltiVec/CodeMain.html

Altivec PIM (Programming Interface Manual -- the C bindings)
http://a1008.g.akamai.net/7/1008/787/66cefa0933a341/www.motorola.com/SPS/PowerPC/teksupport/teklibrary/manuals/altivecpim.pdf

Altivec PEM (Programming Environtment Manual -- the assembly level docs)
http://a2016.g.akamai.net/7/2016/787/5087c1b5def3b1/www.motorola.com/SPS/PowerPC/teksupport/teklibrary/manuals/altivec_pem.pdf
>So these will presumably still work when there will fit more then 4
>floats in a reg?
>How do they do this? Or is it fixed at 4? In that case, vecor is a
>mis-nomer, should be vector4 OSLT.
  That's a good point.  I don't know if they ever plan on extending the
number of elements per vector.  They might just decide to extend the number of
vector registers and number of pipelines :)
>According to ANSI C, you want to loose precision.
  Well, yes.  That is what the compiler has to assume.  But the author may have
just been silly when writing the code.  I've seen lots of cases of this when
optimizing for PPC (since it is really bad at doing int/float conversions). 
Needless casts between int/float can cost a lot.

-tim

  
--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Timothy J. Wood

2000-Nov-16 16:44 UTC

head link

[vorbis-dev] Optimisations

>They say they will in the near future support 16 operations at once. I
don't think
>they will be able to do four separate operations at once, so they most
likely will
>widen the registers 
  That's sure be cool.
>Is just int->float bad, or float->int as well? i ask this, because I
was pleasantly
>surprised today, because my G3 was like 10 times faster than my Athlon
>(and that one was _way_ faster than the P-III) in converting an array of
float to an
>array of int (in plain stupid C code).
// cc -O3 -S -static float.c

int floatToInt(float f);
float intToFloat(int i);

int main()
{
    floatToInt(1.0);
    intToFloat(1);
}

int floatToInt(float f)
{
    return (int)f;
}

float intToFloat(int i)
{
    return (float)i;
}

  This produces the following assembly for the two functions:

_floatToInt:
        fctiwz f0,f1
        stfd f0,-8(r1)
        lwz r3,-4(r1)
        blr

        .double 0r4.50360177485414400000e15
.text
        .align 2
.globl _intToFloat
_intToFloat:
        lis r0,0x4330
        lis r9,ha16(LC0)
        la r9,lo16(LC0)(r9)
        lfd f0,0(r9)
        xoris r11,r3,0x8000
        stw r11,-4(r1)
        stw r0,-8(r1)
        lfd f1,-8(r1)
        fsub f1,f1,f0
        frsp f1,f1
        blr

  As you can see float->int isn't too bad.  If you need the results in a
register, you are wasting two memory operations due the fact that RISC machines
don't move data between functional units usually.  On the other hand,
int->float is abominable.  The case shown above makes it look a bit worse
than it has to be since a bunch of the operations can be hoisted outside any
potential loop (loading the address of the contant and initializing the first
word of the double temporary on the stack).  Sadly, even in a loop, gcc
doesn't hoist the first store outside the loop so you get three memory
operations plus two float operations per loop instead of two memory ops and two
float ops.

  This is one nice thing about Altivec -- it has a very fast path for both
int->float and float->int.
>More generally, maybe all of the audience can help: what are the weakest
points of all the
>various processors Vorbis will be deployed on?
  Speaking from my experience trying to optimize Quake3 for Mac OS X, I find:

- Memory bandwidth
- Int->Float conversion

  to be the two worst problems on the PPC.  Memory bandwidth probably isn't
as big of an issue for Vorbis as for Quake, but it might still have some effect
for lookup tables that don't fit in cache.  This effect can be made less bad
by using the data cache touch instructions when possible.

  The int->float conversion problems go away if you can use Altivec to do it
(i.e., you have an array of ints and you need an array of floats and they are
all in the right positions, etc.)

-tim

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Timothy J. Wood

2000-Nov-16 17:02 UTC

head link

[vorbis-dev] Optimisations

> Plan is, to rewrite critical parts of the C code, so to be more natural 
> to rewrite using a (possible assembler) vector implementation. 
> Nothing more; it's just simple unrolling && re-rolling. And the
usual
> putting-more-subroutines-into-one-and-refactoring-it-completely-different, 
> of course. make the code more natural to the machine, i.s.o. to the
programmer.
> I think Monty will hate me ;-)  (If not now, soon he will). 
  It seems like we should have a general framework for this.  That is, say that
we have a routine foo() that can be optimized various ways.  It would be good to
have a runtime switch to enable different optimizations for testing.  For
example, you might have foo_ppc() and foo_ppc_altivec() where the 'ppc'
only takes advantage of instructions on all ppc machines while the ppc_altivec
version uses PPC7440 specific instructions.

  Likewise you might have x86, x86_sse, x86_mmx, x86_3dnow, etc.

  So, it would be nice to compile in anything that is compilable on the target
and have a runtime switch to select a particular optimization path (and possibly
one to select the 'best' one for the current platform automatically).

  One benefit of this approach would be to make it easier to compare the results
of the C version with a particular optimization.  Another would be that Monty,
or whoever else, can modify the C version and not worry about the optimized
version so much (leaving that to the maintainer of that function), and people
can still compile stuff and just select the 'use only C' optimization to
always (well, usually :) get valid results.

-tim

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Timothy J. Wood

2000-Nov-16 18:28 UTC

head link

[vorbis-dev] Optimisations

> I don't think this will be too hard at all to set up.  Leave the
function
> names the same, no need for blah_mmx() or blah_3dnow(), rather have  
> directories for each CPU with the equivalent functions in them and have the
> linker link the right *.o files. 
>  
> I think this might be easier than farting about with macros, or tables that
> register the functions with pointers at runtime and so on. 
  This would work, obviously, and might be a very tiny bit faster, but it would
make it harder to compare results between different versions.  It would also
mean you would have to do a lot more binary releases.  If all the functions are
present at link time and selectable manually or automatically, then you can ship
ONE x86 binary instead of 5 or whatever.

-tim

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Timothy J. Wood

2000-Nov-16 18:34 UTC

head link

[vorbis-dev] Optimisations

> >   One benefit of this approach would be to make it easier to compare
the results of the C
> version with a particular 
> > optimization. 
>  
> What would be the difference? 
>  
  Well, ideally the output would be the same.  This would allow developers to
have a much easier time testing since they wouldn't have to rebuild to
switch to a different optimization path.
> > (leaving that to the maintainer of that function), and people can
still compile stuff and
> just select the 'use only C'  
>  
> Leaving stuff to the maintainer of a particular function, will make
progress of
> Vorbis as slow as the slowest of the developers (at least for some people).
>  
  No, actually, it would make development faster since Monty or other codec
developers wouldn't have to worry so much (or at all) about their changes
impacting optimized work -- they could just commit them and send out mail
letting the optimized path people know something has changed.  When release time
comes around, any optimized paths that don't work would simply be disabled
(since obviously they aren't getting enough support) rather than begin
shipped broken.

-tim

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to
'vorbis-dev-request@xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is
needed.
Unsubscribe messages sent to the list will be ignored/filtered.

Apparently Analagous Threads

Search for more maybe matching threads

Vorbis dev - Nov 2000 - Optimisations

[vorbis-dev] Optimisations

[vorbis-dev] Optimisations

[vorbis-dev] Optimisations

[vorbis-dev] Optimisations

[vorbis-dev] Optimisations

[vorbis-dev] Optimisations

[vorbis-dev] Optimisations

[vorbis-dev] Optimisations

[vorbis-dev] Optimisations

Apparently Analagous Threads