I took function mdct_butterfly_8 and write out transformation matrix. Then I
rewrote this matrix into sequence of additions and substractions (see
attachement). As I suspected I got the same as in the original code but I swaped
some rows to get little higher speed. I hope I'll do the same with 16 point
butterfly function combined with 8 point butterflies in a month. Who still
believe that this approach will be slower?
-------------- next part --------------
Itemize transcription of linear algebra transformation into optimized CPU
operations because we know the transformation matrix.
This is the data vector with auxiliary variables A and B which should be
registers.
[x[0],x[1],x[2],x[3],x[4],x[5],x[6],x[7],A,B]
Here are the transorm matrixes with substituted operations and outputs.
0 1 0 -1 -1 0 1 0
-1 0 1 0 0 -1 0 1
-1 0 -1 0 1 0 1 0
0 -1 0 -1 0 1 0 1
0 -1 0 1 -1 0 1 0
1 0 -1 0 0 -1 0 1
1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1
A=x[1]-x[5]
0 1 0 -1 -1 0 1 0
0 0 0 0 0 -1 0 1
-1 0 -1 0 1 0 1 0
0 -1 0 -1 0 1 0 1
0 -1 0 1 -1 0 1 0
0 0 0 0 0 -1 0 1
1 0 1 0 1 0 1 0
0 1 0 1 0 1 0 1
-1 1
B=x[6]-x[2]
0 1 0 -1 -1 0 1 0
0 0 0 0 0 -1 0 1
0 0 0 0 1 0 1 0
0 -1 0 -1 0 1 0 1
0 -1 0 1 -1 0 1 0
0 0 0 0 0 -1 0 1
0 0 0 0 1 0 1 0
0 1 0 1 0 1 0 1
-1 1
1 1
y[0]=B-A
y[2]=A+B
A=x[0]-x[4]
0 0 0 0 -1 0 1 0
0 0 0 0 0 -1 0 1
0 0 0 0 1 0 1 0
0 -1 0 -1 0 1 0 1
0 0 0 0 -1 0 1 0
0 0 0 0 0 -1 0 1
0 0 0 0 1 0 1 0
0 1 0 1 0 1 0 1
1 -1
B=x[7]-x[3]
0 0 0 0 -1 0 1 0
0 0 0 0 0 -1 0 1
0 0 0 0 1 0 1 0
0 0 0 0 0 1 0 1
0 0 0 0 -1 0 1 0
0 0 0 0 0 -1 0 1
0 0 0 0 1 0 1 0
0 0 0 0 0 1 0 1
1 -1
1 1
y[1]=A+B
y[3]=B-A
A=x[0]+x[4]
0 0 0 0 0 0 0 0
0 0 0 0 0 -1 0 1
0 0 0 0 1 0 1 0
0 0 0 0 0 1 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 -1 0 1
0 0 0 0 1 0 1 0
0 0 0 0 0 1 0 1
-1 1
B=x[2]+x[6]
0 0 0 0 0 0 0 0
0 0 0 0 0 -1 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 -1 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1
-1 1
1 1
y[4]=B-A
y[6]=A+B
A=x[1]+x[5]
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1
-1 1
B=x[3]+x[7]
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
-1 1
1 1
y[5]=B-A
y[7]=A+B
I hope I didn't make a mistake. First I wrote a code for minimum number of
add/sub operations. Below you can see second rewrote which minimize memory space
including coupling data inputs. I hope this is little improvement on some CPU.
But this is just with 8x8 matrix. When I'll have more spare time I'll
itemize 16x16 transform matrix including two 8x8 trans. mat. which will lead to
significant increase of speed. I welcome all volunteers which will write
transform matrix, do the multiplication and itemize it into CPU intructions with
memory optimization. I more welcome those which will (write C code and)
benchmark this little improvement and let all know (better wait for 16x16
transform). The most welcome are programmers which will write C code for
automatic itemization and optimization of this transform of arbitrary size. This
includes matrix multiplication, subtitution of the most occured operations,
memory optimization and arbitrary input coupling.
G=x[0]-x[4]
A=x[0]+x[4]
F=x[6]-x[2]
B=x[6]+x[2]
x[6]=B+A
x[4]=B-A
A=x[1]-x[5]
B=x[1]+x[5]
x[2]=F+A
x[0]=F-A
A=x[7]+x[3]
F=x[7]-x[3]
x[7]=A+B
x[5]=A-B
x[3]=F-G
x[1]=F+G