thr3ads.net - flac dev - [Flac-dev] Altivec Optimizations [Sep 2004]

If this information is useful, please help other people find it:
Share via:

Chris Csanady

2004-Sep-10 16:45 UTC

[Flac-dev] Altivec Optimizations

Hi,

I have been playing with Altivec, and I rewrote a couple of the routines
in assembly.  Looking at the archives, I noticed that there may already
be some effort on this.  Anyways...

Right now, I have two routines working.  They need to be cleaned up, 
made
relocatable, and documented; otherwise, they seem to work fairly well.  
I
see an overall ~27% speed improvement when encoding with the default
settings, and greater at -8.

The ones I have done are:

	FLAC__lpc_compute_residual_from_qlp_coefficients_16_bit()
	FLAC__lpc_compute_autocorrelation()

I did make a change in stream_encoder.c to better align the data passed 
to
FLAC__lpc_compute_residual_from_qlp_coefficients(), I hope this is ok.  
Most
occurrences of residual are replaced with residual+order as in:

	FLAC__fixed_compute_residual(signal+order, residual_samples, order, 
							 residual+order);
	...
	subframe->data.fixed.residual = residual+order;

The vectors in Altivec must be 16 byte aligned, and it complicates 
things
if signal[] and residual[] are not aligned.

Chris

Brady Patterson

2004-Sep-10 16:45 UTC

head link

[Flac-dev] Altivec Optimizations

I was looking at that.

I wrote a version of FLAC__lpc_restore_signal(), but all of the stores from
the vector unit were 0 (even when I explicitly set a register to non-zero), and
gdb always printed the VR contents as 0. The latter is a known gdb bug; I'm
having to upgrade to OSX 10.2 to get the new gdb though.

I'm still hoping to get this working. Any tips would be appreciated.

-Brady

Brady Patterson

2004-Sep-10 16:45 UTC

head link

[Flac-dev] Altivec Optimizations

Chris-

I upgraded to 10.2 and updated the developer tools, but am seeing the same
behavior. Can you help me with this?

gdb still always prints my VR contents as all 0, and the store vector
instruction still stores a suspicious-looking garbage value. I wrote a simple
program to demonstrate:

.text
	.align 2
.globl _vec_test
_vec_test:

; r3: buf

	vspltish v1,0
	stvewx v1,0,r3
	lwz r4,0(r3)
	vspltish v2,1
	stvewx v2,0,r3
	lwz r5,0(r3)

	blr

I'm assembling this using:

% as -static -g -force_cpusubtype_ALL -o test.o test.s

main() just verifies altivec existence using sysctl as described at
http://developer.apple.com/hardware/ve/g3_compatibility.html, then calls
vec_test() with a stack buffer.

Here is the relevant stuff from the debugger:

10	lwz r4,0(r3)
(gdb) p /x $r4
$9 = 0x7fffdead
(gdb) n
11	vspltish v2,1
(gdb) p $v2
$10 = {
  uint128 = 0x00000000000000000000000000000000,
  v4_float = {0, 0, 0, 0},
  v4_int32 = {0, 0, 0, 0},
  v8_int16 = {0, 0, 0, 0, 0, 0, 0, 0},
  v16_int8 = '\0' <repeats 15 times>
}
(gdb) n
12	stvewx v2,0,r3
(gdb) p /x $r5
$11 = 0xbffff948
(gdb) n
13	lwz r5,0(r3)
(gdb) p /x $r5
$12 = 0x7fffdead

To summarize, even after I splat 1 to v2, it still contains all 0s, and any
vector store writes 0x7fffdead, which apparently indicates the corresponding
VR is unused.

I've got OS X 10.2.5, gcc 3.1 20020420, and gdb 5.3-20021014 (Apple version
gdb-250) on a Powerbook G4.

Any idea what's going wrong?

Thanks.

-Brady

--
Brady Patterson (brady@spaceship.com)
Do you know Old Kentucky Shark?

Brady Patterson

2004-Sep-10 16:45 UTC

head link

[Flac-dev] Altivec Optimizations

On Sun, 27 Apr 2003, Josh Coalson wrote:> An asm version of FLAC__lpc_restore_signal_asm should also give
> a pretty good bang for the buck on the decoding side.
I've got a working one. It's giving me about a 12% speedup. Considering
that
it only works on <=16 bps blocks (meaning, from CD audio, blocks that
aren't
mid-side encoded), and FLAC__lpc_restore_signal() only takes about 25% of the
process time, that may be about as good as it gets, but I'm not completely
done analyzing it (nor have I done any configuration work for it).

I may try to write a version which works for 17-32 bps, but I don't think it
can be nearly as efficient as the 1-16 version.

Also, FLAC__MD5Accumulate() and maybe MD5Transform() look highly parallelizable
(more so than FLAC__lpc_restore_signal() in fact), and they combine for about
20% of the process time, so I may work on Altivec versions for them.

The other time-consuming function is FLAC__bitbuffer_read_rice_signed_block(),
which I will not be attempting. :)

--
Brady Patterson (brady@spaceship.com)
Do you know Old Kentucky Shark?

Josh Coalson

2004-Sep-10 16:45 UTC

head link

[Flac-dev] Altivec Optimizations

--- Chris Csanady <cc@137.org> wrote:> Hi,
> 
> I have been playing with Altivec, and I rewrote a couple of the
> routines
> in assembly.  Looking at the archives, I noticed that there may
> already
> be some effort on this.  Anyways...
> 
> Right now, I have two routines working.  They need to be cleaned up, 
> made
> relocatable, and documented; otherwise, they seem to work fairly
> well.  
> I
> see an overall ~27% speed improvement when encoding with the default
> settings, and greater at -8.
> 
> The ones I have done are:
> 
>  FLAC__lpc_compute_residual_from_qlp_coefficients_16_bit()
>  FLAC__lpc_compute_autocorrelation()
> 
> I did make a change in stream_encoder.c to better align the data
> passed 
> to
> FLAC__lpc_compute_residual_from_qlp_coefficients(), I hope this is
> ok.  
> Most
> occurrences of residual are replaced with residual+order as in:
> 
>  FLAC__fixed_compute_residual(signal+order, residual_samples, order, 
>         residual+order);
>  ...
>  subframe->data.fixed.residual = residual+order;
> 
> The vectors in Altivec must be 16 byte aligned, and it complicates 
> things
> if signal[] and residual[] are not aligned.
Cool, I would appreciate any contributions you and/or Brady
come up with.

As for alignment, there are routines in memory.c to allocate
aligned memory at 32-byte boundaries.  It is turned on with a
#define FLAC__ALIGN_MALLOC_DATA.  Currently this is only turned
on in configure.in for x86 cpu's but you can easily do it for
powerpc.

An asm version of FLAC__lpc_restore_signal_asm should also give
a pretty good bang for the buck on the decoding side.

Josh


__________________________________
Do you Yahoo!?
The New Yahoo! Search - Faster. Easier. Bingo.
http://search.yahoo.com

Apparently Analagous Threads

Search for more seemingly similar threads

flac dev - Sep 2004 - Altivec Optimizations

[Flac-dev] Altivec Optimizations

[Flac-dev] Altivec Optimizations

[Flac-dev] Altivec Optimizations

[Flac-dev] Altivec Optimizations

[Flac-dev] Altivec Optimizations

Apparently Analagous Threads