thr3ads.net - Speex dev - [Speex-dev] Blackfin inline assembler and VisualDSP++ toolchain [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Michael Shatz

2007-Jun-19 11:12 UTC

[Speex-dev] Blackfin inline assembler and VisualDSP++ toolchain

-----Original Message-----
From: Jean-Marc Valin [mailto:jean-marc.valin@usherbrooke.ca]
Sent: Tuesday, June 19, 2007 6:38 PM
To: Michael Shatz
Cc: speex-dev@xiph.org
Subject: Re: [Speex-dev] Blackfin inline assembler and VisualDSP++
toolchain

>> Yes, data footprint in the new version is quite manageable. Still I
would
>> wish better documentation for speex_alloc_scratch(). 
>
>I'll be waiting for your patch :-)
I didn't realize that you accept patches for documentation, too.
>
>> It took me time to 
>> figure out that in single-threaded environment I could give the same
scratch
>> area to multiple encoders end decoders. It would be also very useful to
document
>> the size of the scratch area as the function of mode. By the process of
trial and
>> error I found out that in my mode scratch never exceeds 2700 bytes but
finding this
>> data in documentation would be so simpler and more reliable.
>
>Unfortunately not possible. The amount of stack (scratch) space required
>depends on the bit-rate you select, the complexity value, whether you
>compile as float or fixed-point, ... 
Well, I guess, those using floating point are less interested in the space
required for
scratch. And the rest just makes the table in the document a little bigger.
>But if your compiler is sane (read
>C99-compliant), you don't even need that. All you need is to define
>VAR_ARRAYS and all the temp arrays will be allocated as C99
>variable-size arrays (no memory will be allocated for explicit scratch
>space). The configure script actually detects this by default. Even
>without a C99 compiler, you can still use alloca (by defining
>USE_ALLOCA), which is still better than the scratch space.
No, it's not better, it's the same problem in different form. With
variable-size
automatic arrays as well as with alloca the developer of small embedded system 
has to know the size of run-time stack. And please, don't assume that
everybody all
the time have at least 4K.
>> On the code size things are less rosy. 
>> The wideband indeed goes away with DISABLE_WIDEBAND but that's
about all.
>> Due to extensive use of function pointers very little unused stuff
beyond wideband
>> goes away when unused.
>
>Unless you NULL those pointers you don't need. Also, if you only use one
>rate, there are tables you can get rid of as well. All the tables
>represent about 10kB of ROM size, but you can probably reduce that to
>2-3 kB if you only use a single narrowband mode.
Nullifying the pointers means that I don't treat the code as a black box.
Which means
that if I upgrade to the next version of the library I'd have to reapply the
patches.
>> For starter, I would like DISABLE_VBR analogous to DISABLE_WIDEBAND.
>> After that, it's probably possible to put vocoder under conditional
compilation
>> the stuff that is used only in vocoder modes. It seems that modes 3 to
7 are too
>> similar to each other to save significant amount of code by eliminating
some of them,
>> but I have a feeling that generic mechanism for picking only those
modes needed (either
>> through conditional compilation or may be even with configuration perl
script) would be
>> simple than specific DISABLE_VOCODER.
>
>The problem is that there are *lots* of things like that and having an
>option for everything would make the code a bit ugly. But they aren't
>that hard to debug. If you don't know if a function is useful, remove it
>and see what happens. If it succeeds in encoding one file, it will work
>all the time.
VBR is by far the biggest thing after WIDEBAND that the users are likely to
never need or
never want. Ant take it off efficiently requires the widest knowledge of
internal functioning
of the library. I think, DISABLE_VBR is a good candidate for official release.
>> Another potential saving could be achieved by replacing speex_warning,
speex_notification
>> and speex_error with user-modifiable defines. The existing
DISABLE_WARNING/
>> OVERRIDE_SPEEX_WARNING method is not efficient in reducing the code
footprint because the
>> majority of the overhead happens in the points of invocation of the
speex_warning rather than
>> in the function itself.
>
>How about:
>#define OVERRIDE_SPEEX_WARNING
>#define speex_warning(x) {}
>in user_misc.h? That should do the trick.
May be. But once again, why not do it in official release?
>> With all my suggestion applied there is an opportunity that minimized
speex would fit in on-chip
>> code memory of BF532 (48KB). However the original goal of fitting in
BF531 (32KB of on chip code
>> memory) seem impossible even then.
>
>32 kB for Speex appears quite possible to me. Especially considering
>you're only interested in the decoder, right (or was it the encoder)?
No, I need both.
>> Mostly GSM and proprietary codecs. Or G.726. I am starting to feel that
I, too,
>> will end up with G.726.
>
>I heard there are very small and very fast G.711 encoders too :-)
>Seriously, you need to compare apples to apples.
I am not in the business of comparing fruits. I am:
A. Whining
B. Thinking loud.
From functional perspective I don't see how G.726 is not comparable to
narrowband speex mode 7.
>> Many years ago I worked on project in which proprietary codec was
compressing to
>> 4400 bps with decent speech quality all at code footprint of 16K 24-bit
words and
>> about 8-9 ADSP-2111 MIPS. I wasn't involved in speech processing so
by now I don't
>> remember which algorithm they used. IIRC, not CELP.
>
>4.4 kbps is almost certainly some variant of CELP. 
No, not CELP. I googled around a bit and found the site of the company that made
our speech
coder. They are still in the business:
http://www.dvsinc.com
Seems like they call it MultiBand Excitation (MBE).
>Plus 16k 24-bit words is already 48 kB and I'm sure Speex can fit into
smaller than that.
First, I am not sure that board had full 16K words. I said 16K because
that's the maximal size
allowed by ADSP-2111 architecture.
Second, code density of Blackfin family is far superior over ADI 21xx.
Third, I believe you that 48 KB speex on Blackfin is possible, but right now my
code is bigger.

>> <snip>
>> 
>>> IIRC, gcc alone (no asm) was using something in the order of 100
MIPS
>>> (back when it couldn't do hardware loops, MACs, cond. moves,
...), so as
>>> you can see, there's a fair bit of difference. So yes, with
assembly
>>> working, VDSP++ should be able to achieve better than 20 MIPS.
>>>
>>> 	Jean-Marc
>> 
>> Not sure we are talking about the same mode.
>
>>This was with the 15 kbps mode used at complexity 1.
>>
>	Jean-Marc
Yes, that's the mode that I measured, with no VBR. Does 100 MIPS figure
reflect the situation before
or after David Rowe's improvements?

Jim Crichton

2007-Jun-19 13:46 UTC

head link

[Speex-dev] Blackfin inline assembler and VisualDSP++ toolchain

>>> It took me time to
>>> figure out that in single-threaded environment I could give the
same
>>> scratch
>>> area to multiple encoders end decoders. It would be also very
useful to
>>> document
>>> the size of the scratch area as the function of mode. By the
process of
>>> trial and
>>> error I found out that in my mode scratch never exceeds 2700 bytes
but
>>> finding this
>>> data in documentation would be so simpler and more reliable.
>>
>>Unfortunately not possible. The amount of stack (scratch) space required
>>depends on the bit-rate you select, the complexity value, whether you
>>compile as float or fixed-point, ...
>
> Well, I guess, those using floating point are less interested in the space 
> required for
> scratch. And the rest just makes the table in the document a little 
> bigger.
For TI DSPs, I used a private memory array rather than the C stack, and a 
debug patch in stack_alloc.h to measure the scratch usage:

#if 1
extern char *spxGlobalScratchFree;
#define ALLOC(var, size, type) (var = PUSH(stack, size, type), 
(spxGlobalScratchFree)=((stack)>(spxGlobalScratchFree))?(stack):(spxGlobalScratchFree))
#else
#define ALLOC(var, size, type) var = PUSH(stack, size, type)
#endif

I Initialized the global scratch pointer to the beginning of the scratch 
area in the encoder init, and the debug macro keeps track of the max usage. 
It may be too late in your work for this to be of any help.
>>> On the code size things are less rosy.
>>> The wideband indeed goes away with DISABLE_WIDEBAND but that's
about
>>> all.
>>> Due to extensive use of function pointers very little unused stuff 
>>> beyond wideband
>>> goes away when unused.
>>
>>Unless you NULL those pointers you don't need. Also, if you only use
one
>>rate, there are tables you can get rid of as well. All the tables
>>represent about 10kB of ROM size, but you can probably reduce that to
>>2-3 kB if you only use a single narrowband mode.
>
> Nullifying the pointers means that I don't treat the code as a black
box.
> Which means
> that if I upgrade to the next version of the library I'd have to
reapply
> the patches.
For those of us working on very memory constrained platforms, I don't think 
that it will ever be a black box, because that would require having ENBABLE 
defines for every rate and feature, so one could build up just what is 
needed.  That would be really messy.

You did not respond to the point about single data rate.  If you are doing 
this, then you can get rid of most of the tables if you fix up the 
references in modes.c.  It would be nice to have a README.code-reduction 
file that collected some of the advice that hits the list from time to time.
>>> For starter, I would like DISABLE_VBR analogous to
DISABLE_WIDEBAND.
>>> After that, it's probably possible to put vocoder under
conditional
>>> compilation
>>> the stuff that is used only in vocoder modes. It seems that modes 3
to 7
>>> are too
>>> similar to each other to save significant amount of code by
eliminating
>>> some of them,
>>> but I have a feeling that generic mechanism for picking only those
modes
>>> needed (either
>>> through conditional compilation or may be even with configuration
perl
>>> script) would be
>>> simple than specific DISABLE_VOCODER.
>>
>>The problem is that there are *lots* of things like that and having an
>>option for everything would make the code a bit ugly. But they
aren't
>>that hard to debug. If you don't know if a function is useful,
remove it
>>and see what happens. If it succeeds in encoding one file, it will work
>>all the time.
>
> VBR is by far the biggest thing after WIDEBAND that the users are likely 
> to never need or
> never want. Ant take it off efficiently requires the widest knowledge of 
> internal functioning
> of the library. I think, DISABLE_VBR is a good candidate for official 
> release.
I removed vbr.c and ifdefed the references in nb_celp.c (in 8 or so places). 
This is not too messy, and I could send a patch for this if Jean-Marc is 
agreeable.
>>> Another potential saving could be achieved by replacing
speex_warning,
>>> speex_notification
>>> and speex_error with user-modifiable defines. The existing 
>>> DISABLE_WARNING/
>>> OVERRIDE_SPEEX_WARNING method is not efficient in reducing the code
>>> footprint because the
>>> majority of the overhead happens in the points of invocation of the
>>> speex_warning rather than
>>> in the function itself.
>>
>>How about:
>>#define OVERRIDE_SPEEX_WARNING
>>#define speex_warning(x) {}
>>in user_misc.h? That should do the trick.
>
> May be. But once again, why not do it in official release?
The user_misc.h mechanism was added as part of the TI DSP port, to allow 
memory allocation overrides, as well as message output overrides.  It is a 
compromise, but it does the job.
>>> Many years ago I worked on project in which proprietary codec was 
>>> compressing to
>>> 4400 bps with decent speech quality all at code footprint of 16K
24-bit
>>> words and
>>> about 8-9 ADSP-2111 MIPS. I wasn't involved in speech
processing so by
>>> now I don't
>>> remember which algorithm they used. IIRC, not CELP.
>>
>>4.4 kbps is almost certainly some variant of CELP.
>
> No, not CELP. I googled around a bit and found the site of the company 
> that made our speech
> coder. They are still in the business:
> http://www.dvsinc.com
> Seems like they call it MultiBand Excitation (MBE).
That codec family has had a lot of success in mobile satellite applications 
(Inmarsat, Iridium, AMSC/MSV).  It was developed when the 2111 was state of 
the art, and it must have been architected for that kind of footprint.  But 
a single bit rate, mostly assembly, 16-bit DSP codec implementation is bound 
to be much different from Speex.
>>Plus 16k 24-bit words is already 48 kB and I'm sure Speex can fit
into
>>smaller than that.
>
> First, I am not sure that board had full 16K words. I said 16K because 
> that's the maximal size
> allowed by ADSP-2111 architecture.
> Second, code density of Blackfin family is far superior over ADI 21xx.
> Third, I believe you that 48 KB speex on Blackfin is possible, but right 
> now my code is bigger.
With VBR and all modes but one stripped, My text+const size for the TI C55 
is about 48 KB for a standalone build.  It was about 58 KB before.  The 
remaining source files are:

libspeex\bits.c
libspeex\cb_search.c
libspeex\exc_10_32_table.c
libspeex\filters.c
libspeex\gain_table_lbr.c
libspeex\lpc.c
libspeex\lsp.c
libspeex\lsp_tables_nb.c
libspeex\ltp.c
libspeex\math_approx.c
libspeex\misc.c
libspeex\modes.c
libspeex\nb_celp.c
libspeex\quant_lsp.c
libspeex\speex.c
libspeex\speex_callbacks.c
libspeex\vq.c
libspeex\window.c
ti\testenc-TI-C5x.c

My platform has 256KB of internal RAM, so this was fine for me.  It does 
suggest that it might be very hard for you to squeeze this in.  Maybe some 
Blackfin users can chime in with their memory/MIPs results.
>>>> IIRC, gcc alone (no asm) was using something in the order of
100 MIPS
>>>> (back when it couldn't do hardware loops, MACs, cond.
moves, ...), so
>>>> as
>>>> you can see, there's a fair bit of difference. So yes, with
assembly
>>>> working, VDSP++ should be able to achieve better than 20 MIPS.
>>>>
>>>> Jean-Marc
>>>
>>> Not sure we are talking about the same mode.
>>
>>>This was with the 15 kbps mode used at complexity 1.
>>>
>> Jean-Marc
>
> Yes, that's the mode that I measured, with no VBR. Does 100 MIPS figure
> reflect the situation before
> or after David Rowe's improvements?
I see around 26 MIPs for a TI C55x DSP for Quality 3 (8kbps), complexity 1, 
and about 33 MIPs on a TI C64xx, with no assembly optimizations, using TI's 
build tools.  That is consistent with your 15kbps result.

- Jim

Possibly Parallel Threads

Search for more possibly parallel threads

Speex dev - Jun 2007 - Blackfin inline assembler and VisualDSP++ toolchain

[Speex-dev] Blackfin inline assembler and VisualDSP++ toolchain

[Speex-dev] Blackfin inline assembler and VisualDSP++ toolchain

Possibly Parallel Threads