Hi, I have been playing with Altivec, and I rewrote a couple of the routines in assembly. Looking at the archives, I noticed that there may already be some effort on this. Anyways... Right now, I have two routines working. They need to be cleaned up, made relocatable, and documented; otherwise, they seem to work fairly well. I see an overall ~27% speed improvement when encoding with the default settings, and greater at -8. The ones I have done are: FLAC__lpc_compute_residual_from_qlp_coefficients_16_bit() FLAC__lpc_compute_autocorrelation() I did make a change in stream_encoder.c to better align the data passed to FLAC__lpc_compute_residual_from_qlp_coefficients(), I hope this is ok. Most occurrences of residual are replaced with residual+order as in: FLAC__fixed_compute_residual(signal+order, residual_samples, order, residual+order); ... subframe->data.fixed.residual = residual+order; The vectors in Altivec must be 16 byte aligned, and it complicates things if signal[] and residual[] are not aligned. Chris
I was looking at that. I wrote a version of FLAC__lpc_restore_signal(), but all of the stores from the vector unit were 0 (even when I explicitly set a register to non-zero), and gdb always printed the VR contents as 0. The latter is a known gdb bug; I'm having to upgrade to OSX 10.2 to get the new gdb though. I'm still hoping to get this working. Any tips would be appreciated. -Brady
Chris- I upgraded to 10.2 and updated the developer tools, but am seeing the same behavior. Can you help me with this? gdb still always prints my VR contents as all 0, and the store vector instruction still stores a suspicious-looking garbage value. I wrote a simple program to demonstrate: .text .align 2 .globl _vec_test _vec_test: ; r3: buf vspltish v1,0 stvewx v1,0,r3 lwz r4,0(r3) vspltish v2,1 stvewx v2,0,r3 lwz r5,0(r3) blr I'm assembling this using: % as -static -g -force_cpusubtype_ALL -o test.o test.s main() just verifies altivec existence using sysctl as described at http://developer.apple.com/hardware/ve/g3_compatibility.html, then calls vec_test() with a stack buffer. Here is the relevant stuff from the debugger: 10 lwz r4,0(r3) (gdb) p /x $r4 $9 = 0x7fffdead (gdb) n 11 vspltish v2,1 (gdb) p $v2 $10 = { uint128 = 0x00000000000000000000000000000000, v4_float = {0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0}, v16_int8 = '\0' <repeats 15 times> } (gdb) n 12 stvewx v2,0,r3 (gdb) p /x $r5 $11 = 0xbffff948 (gdb) n 13 lwz r5,0(r3) (gdb) p /x $r5 $12 = 0x7fffdead To summarize, even after I splat 1 to v2, it still contains all 0s, and any vector store writes 0x7fffdead, which apparently indicates the corresponding VR is unused. I've got OS X 10.2.5, gcc 3.1 20020420, and gdb 5.3-20021014 (Apple version gdb-250) on a Powerbook G4. Any idea what's going wrong? Thanks. -Brady -- Brady Patterson (brady@spaceship.com) Do you know Old Kentucky Shark?
On Sun, 27 Apr 2003, Josh Coalson wrote:> An asm version of FLAC__lpc_restore_signal_asm should also give > a pretty good bang for the buck on the decoding side.I've got a working one. It's giving me about a 12% speedup. Considering that it only works on <=16 bps blocks (meaning, from CD audio, blocks that aren't mid-side encoded), and FLAC__lpc_restore_signal() only takes about 25% of the process time, that may be about as good as it gets, but I'm not completely done analyzing it (nor have I done any configuration work for it). I may try to write a version which works for 17-32 bps, but I don't think it can be nearly as efficient as the 1-16 version. Also, FLAC__MD5Accumulate() and maybe MD5Transform() look highly parallelizable (more so than FLAC__lpc_restore_signal() in fact), and they combine for about 20% of the process time, so I may work on Altivec versions for them. The other time-consuming function is FLAC__bitbuffer_read_rice_signed_block(), which I will not be attempting. :) -- Brady Patterson (brady@spaceship.com) Do you know Old Kentucky Shark?
--- Chris Csanady <cc@137.org> wrote:> Hi, > > I have been playing with Altivec, and I rewrote a couple of the > routines > in assembly. Looking at the archives, I noticed that there may > already > be some effort on this. Anyways... > > Right now, I have two routines working. They need to be cleaned up, > made > relocatable, and documented; otherwise, they seem to work fairly > well. > I > see an overall ~27% speed improvement when encoding with the default > settings, and greater at -8. > > The ones I have done are: > > FLAC__lpc_compute_residual_from_qlp_coefficients_16_bit() > FLAC__lpc_compute_autocorrelation() > > I did make a change in stream_encoder.c to better align the data > passed > to > FLAC__lpc_compute_residual_from_qlp_coefficients(), I hope this is > ok. > Most > occurrences of residual are replaced with residual+order as in: > > FLAC__fixed_compute_residual(signal+order, residual_samples, order, > residual+order); > ... > subframe->data.fixed.residual = residual+order; > > The vectors in Altivec must be 16 byte aligned, and it complicates > things > if signal[] and residual[] are not aligned.Cool, I would appreciate any contributions you and/or Brady come up with. As for alignment, there are routines in memory.c to allocate aligned memory at 32-byte boundaries. It is turned on with a #define FLAC__ALIGN_MALLOC_DATA. Currently this is only turned on in configure.in for x86 cpu's but you can easily do it for powerpc. An asm version of FLAC__lpc_restore_signal_asm should also give a pretty good bang for the buck on the decoding side. Josh __________________________________ Do you Yahoo!? The New Yahoo! Search - Faster. Easier. Bingo. http://search.yahoo.com