thr3ads.net - Vorbis dev - [Vorbis-dev] mdct.c optimization [Feb 2005]

If this information is useful, please help other people find it:
Share via:

petshome@atlas.cz

2005-Feb-01 08:47 UTC

[Vorbis-dev] mdct.c optimization

I took function mdct_butterfly_8 and write out transformation matrix. Then I
rewrote this matrix into sequence of additions and substractions (see
attachement). As I suspected I got the same as in the original code but I swaped
some rows to get little higher speed. I hope I'll do the same with 16 point
butterfly function combined with 8 point butterflies in a month. Who still
believe that this approach will be slower?
-------------- next part --------------
Itemize transcription of linear algebra transformation into optimized CPU
operations because we know the transformation matrix.

This is the data vector with auxiliary variables A and B which should be
registers.

[x[0],x[1],x[2],x[3],x[4],x[5],x[6],x[7],A,B]

Here are the transorm matrixes with substituted operations and outputs.

 0  1  0 -1 -1  0  1  0
-1  0  1  0  0 -1  0  1
-1  0 -1  0  1  0  1  0
 0 -1  0 -1  0  1  0  1
 0 -1  0  1 -1  0  1  0
 1  0 -1  0  0 -1  0  1
 1  0  1  0  1  0  1  0
 0  1  0  1  0  1  0  1

A=x[1]-x[5]

 0  1  0 -1 -1  0  1  0
 0  0  0  0  0 -1  0  1
-1  0 -1  0  1  0  1  0
 0 -1  0 -1  0  1  0  1
 0 -1  0  1 -1  0  1  0
 0  0  0  0  0 -1  0  1
 1  0  1  0  1  0  1  0
 0  1  0  1  0  1  0  1
-1     1

B=x[6]-x[2]

 0  1  0 -1 -1  0  1  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  1  0  1  0
 0 -1  0 -1  0  1  0  1
 0 -1  0  1 -1  0  1  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  1  0  1  0
 0  1  0  1  0  1  0  1
-1     1
 1     1

y[0]=B-A
y[2]=A+B
A=x[0]-x[4]

 0  0  0  0 -1  0  1  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  1  0  1  0
 0 -1  0 -1  0  1  0  1
 0  0  0  0 -1  0  1  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  1  0  1  0
 0  1  0  1  0  1  0  1
    1    -1

B=x[7]-x[3]

 0  0  0  0 -1  0  1  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  1  0  1  0
 0  0  0  0  0  1  0  1
 0  0  0  0 -1  0  1  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  1  0  1  0
 0  0  0  0  0  1  0  1
    1    -1
    1     1

y[1]=A+B
y[3]=B-A
A=x[0]+x[4]

 0  0  0  0  0  0  0  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  1  0  1  0
 0  0  0  0  0  1  0  1
 0  0  0  0  0  0  0  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  1  0  1  0
 0  0  0  0  0  1  0  1
            -1     1

B=x[2]+x[6]

 0  0  0  0  0  0  0  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  0  0  0  0
 0  0  0  0  0  1  0  1
 0  0  0  0  0  0  0  0
 0  0  0  0  0 -1  0  1
 0  0  0  0  0  0  0  0
 0  0  0  0  0  1  0  1
            -1     1
             1     1

y[4]=B-A
y[6]=A+B
A=x[1]+x[5]

 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  1  0  1
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  1  0  1
               -1     1

B=x[3]+x[7]

 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0
               -1     1
                1     1

y[5]=B-A
y[7]=A+B


I hope I didn't make a mistake. First I wrote a code for minimum number of
add/sub operations. Below you can see second rewrote which minimize memory space
including coupling data inputs. I hope this is little improvement on some CPU.
But this is just with 8x8 matrix. When I'll have more spare time I'll
itemize 16x16 transform matrix including two 8x8 trans. mat. which will lead to
significant increase of speed. I welcome all volunteers which will write
transform matrix, do the multiplication and itemize it into CPU intructions with
memory optimization. I more welcome those which will (write C code and)
benchmark this little improvement and let all know (better wait for 16x16
transform). The most welcome are programmers which will write C code for
automatic itemization and optimization of this transform of arbitrary size. This
includes matrix multiplication, subtitution of the most occured operations,
memory optimization and arbitrary input coupling.

G=x[0]-x[4]
A=x[0]+x[4]
F=x[6]-x[2]
B=x[6]+x[2]
x[6]=B+A
x[4]=B-A
A=x[1]-x[5]
B=x[1]+x[5]
x[2]=F+A
x[0]=F-A
A=x[7]+x[3]
F=x[7]-x[3]
x[7]=A+B
x[5]=A-B
x[3]=F-G
x[1]=F+G

Monty

2005-Feb-01 12:28 UTC

head link

[Vorbis-dev] mdct.c optimization

On Tue, Feb 01, 2005 at 05:46:55PM +0100, petshome@atlas.cz wrote:
> I took function mdct_butterfly_8 and write out transformation
> matrix. Then I rewrote this matrix into sequence of additions and
> substractions (see attachement). As I suspected I got the same as in
> the original code but I swaped some rows to get little higher speed. I
> hope I'll do the same with 16 point butterfly function combined with 8
> point butterflies in a month. Who still believe that this approach
> will be slower?
Writing the original approach in specialized processor code will
produce a similar speedup; this is what libraries like FFTW do.  I've
done so in the past; it's not in the reference encoder because it's a
bitch to maintain.

For small-point transforms, a generalized approach (like a
processor-specific matrix multiply) may end up being faster; n^2 is
faster than nlgn if n^2 has a very small n and a small coefficient and
nlgn has a large coefficient.  I'm not sure this will generalize to
the other butterflies... but it might.

My skepticism mainly stems from the fact that people base entire
careers on making Fourier-domain transformations faster, and to my
knowledge, no one is using this approach.  If you do prove them all
wrong, hey, we all gain.  OTOH, inline MMX/SSE/Altivec vectorization
of the original approach may be faster-yet.  I'd be interested in
seeing a comparison.

(BTW, I should alert folks right now: The specific MDCT transform that
Vorbis uses is not considered especially stellar.  It was documented
and used primarily becuase it was easy to understand and easy to
implement.  I'd hate for someone to spend six months writing an
assembly version that's 50% faster, when simply switching to another
algorithm would be 100% faster).

Regardless the outcome, it looks like you're having a good time so I
don't mean to discourage you. :-)

Monty

Possibly Parallel Threads

Search for more seemingly similar threads

Vorbis dev - Feb 2005 - mdct.c optimization

[Vorbis-dev] mdct.c optimization

[Vorbis-dev] mdct.c optimization

Possibly Parallel Threads