thr3ads.net - opus - [opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Timothy B. Terriberry

2015-Jan-29 17:39 UTC

[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

Viswanath Puttagunta wrote:>   if OPUS_ARM_NEON_INTR
>   CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \
> -			%test_unit_rotation.o %test_unit_mathops.o
> -$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
> +                        $(CELT_SOURCES_ARM_NE10:.c=.lo) \
> +			               %test_unit_rotation.o %test_unit_mathops.o \
> +                        %test_unit_mdct.o %test_unit_dft.o
Crazy indentation.
> +$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
$(NE10_CFLAGS)
>   endif
> diff --git a/celt/arm/arm_celt_ne10_fft_map.c
b/celt/arm/arm_celt_ne10_fft_map.c
> new file mode 100644
> index 0000000..5bb7b5f
Please put these tables (and the ones in arm_celt_ne10_mdct_map) in 
arm_celt_map.c... the goal was to have them all in the same place, so 
that if we changed the architectures supported or something, we did not 
have to hunt all over the codebase to update them all.
> +int (*const OPUS_FFT_ALLOC_ARCH[OPUS_ARCHMASK+1])(kiss_fft_state *st) = {
Please follow the naming convention used in arm_celt_map, e.g., 
<all_caps_function_name>_IMPL instead of _ARCH (on this and all the 
other tables).
> +void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg,
> +                                       const kiss_fft_cpx *fin,
> +                                       kiss_fft_cpx *fout) = {
These are mis-aligned.
> +   st->priv = (void *)ne10_fft_alloc_c2c_float32_neon(st->nfft);
> +   if (st->priv == NULL) {
> +      printf("Unable to ne10 alloc\n");
Absolutely no printfs in library code.
> +      return -1;
> +   }
> +   return 0;
> +}
> +
> +void opus_fft_free_arm_float_neon(kiss_fft_state *st)
> +{
> +   ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
> +
> +   if (cfg)
> +      free((void *)cfg);
This concerns me for several reasons:

1) We never call free() directly in libopus. It is always wrapped in the 
opus_free() macro to allow ports to override it (and as a debugging tool).

2) We didn't call malloc here, NE10 did, which is in a separate module. 
That will work in some environments (Linux), but in others (say, Windows 
Phone), memory allocated in one module MUST be freed by the same module, 
because different modules do not share libc state (indeed, they can even 
be linked against different implementations of libc). If this is how the 
API of NE10 is designed to work, then that API needs to be fixed.
> +}
> +#endif
> +
> +void opus_fft_float_neon(const kiss_fft_state *st,
> +                        const kiss_fft_cpx *fin,
> +                        kiss_fft_cpx *fout)
Again, these are mis-aligned.
> +{
> +   ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
> +   VARDECL(ne10_fft_cpx_float32_t, temp);
> +   VARDECL(ne10_fft_cpx_float32_t, tempin);
Just another note on API design... the _t suffix is reserved by POSIX, 
and should never be used by user code. It's unlikely to cause issues 
with these long type names, but I have certainly seen it cause issues 
elsewhere (and using the style encourages others to do it).
> +   SAVE_STACK;
> +   int N2 = st->nfft >> 1;
> +   float32x4_t inq, outq;
> +   float32x2_t scale;
> +   float *in = (float *)fin;
You're dropping the const qualifier for no reason. Also, vld1q_f32() 
takes a const float32_t *, NOT a float *, and they are not compatible on 
all compiler versions.
> +   float *out;
> +   int i;
> +   ALLOC(temp, st->nfft, ne10_fft_cpx_float32_t);
> +   ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t);
This seems like a fairly large increase in peak stack usage (7.5 kB). Is 
there any chance of reducing this?

For example, presumably the first thing ne10_fft_c2c_1d_float32_neon() 
does is bit-reverse the input... if it could be modified to do the 
scaling at the same time, that would get rid of one buffer. The other 
option is to make opus_fft's callers do it (clt_mdct_forward already 
does this in the C version). The other other option is to modify the API 
to take a mutable input buffer, but combining the scaling with another 
loop will reduce the number of passes (and thus load/stores), which will 
likely be faster. That requires arch-specific code for the call in 
analysis.c (and the tests), though.

clt_mdct_forward() has a buffer f that is the right size and is not used 
during the opus_fft() call. Somehow being able to re-use that buffer 
would get rid of the other temporary buffer here. You'd still need to 
allocate one for the call in analysis.c, but that's at least near the 
top of the stack (of course, better yet would be to eliminate the need 
for this buffer in NE10 entirely).

The exact approach here really depends on our ability to modify the NE10 
API, but I'm getting the impression that clt_mdct_forward_float_neon() 
should probably not call opus_fft() at all (but directly access the NE10 
API), just as the C version directly accesses opus_fft_impl().
> +
> +   out = (float *)tempin;
These are pretty confusing names (if you have to keep this scaling 
here). Ideally they'd be related since they refer to the same memory 
(e.g., scaled and scaledp or something).

Also, float is _not_ compatible with float32_t (which is what vst1q_f32 
takes) in all compiler versions. Please do not mix and match them.
> +   scale = vld1_dup_f32(&st->scale);
Needs a (const float32_t *) cast.
> +   for (i = 0; i < N2; i++) {
> +      inq = vld1q_f32(in);
> +      in += 4;
> +      outq = vmulq_lane_f32(inq, scale, 0);
> +      vst1q_f32(out, outq);
> +      out += 4;
> +   }
> +
> +   cfg->buffer = (ne10_fft_cpx_float32_t *)&temp[0];
If the struct name is "buffer", probably better to have the temporary 
named "buffer" too, since it's not used anywhere else.
> +   ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout,
> +                                 (ne10_fft_cpx_float32_t *)tempin,
> +                                 cfg, 0);
More mis-alignment.
> +   /* N/4 complex FFT, does not downscale anymore */
> +   opus_fft(st, f2, (kiss_fft_cpx *)f, opus_select_arch());
Because you removed the scaling from the above loop, this comment is 
inaccurate.
> diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h
> new file mode 100644
> index 0000000..16f008b
> --- /dev/null
> +++ b/celt/arm/fft_arm.h
> @@ -0,0 +1,65 @@
> +/* Copyright (c) 2015-2016 Xiph.Org Foundation
Are you from the future?
> +ifdef HAVE_ARM_NE10
> +CC = gcc
> +CFLAGS += -mfpu=neon
> +INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR
> +LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so
> +SOURCES += ../arm/celt_neon_intr.c dump_mode_arm_ne10.c
> +endif
It's a bit unfortunate that this depends on having NE10 available, since 
it's used to generate static files which someone may ultimately build on 
a completely different system, but I'm not sure how much effort it's 
worth to try to fix that. Probably not much.
> diff --git a/celt/dump_modes/dump_mode_arm_ne10.c
b/celt/dump_modes/dump_mode_arm_ne10.c
> new file mode 100644
> index 0000000..30c7423
> --- /dev/null
> +++ b/celt/dump_modes/dump_mode_arm_ne10.c
"dump_modes_arm_ne10.c"
> +         fprintf(file, "{%f,%f},%c", cfg->twiddles[j].r,
cfg->twiddles[j].i,(j+4)%3==0?'\n':' ');
Please use the same conversion specification as dump_modes.c for FLOAT, 
e.g., "%#0.8gf".
> +   fprintf(file, "\n#ifdef HAVE_ARM_NE10\n");
> +   fprintf(file, "#define OVERRIDE_FFT 1\n");
> +   fprintf(file, "#include \"%s\"\n",
ARM_NE10_ARCH_FILE_NAME);
> +   fprintf(file, "#endif\n");
At least if you do generate the files on a system without NE10, and then 
build on a system with NE10, this will fail, which is probably the right 
thing.
> @@ -205,7 +220,6 @@ void dump_modes(FILE *file, CELTMode **modes, int
nb_modes)
>         fprintf(file, "#endif\n");
>         fprintf(file, "\n");
>
> -
>         /* Print the actual mode data */
>         fprintf(file, "static const CELTMode mode%d_%d_%d = {\n",
mode->Fs, mdctSize, mode->overlap);
>         fprintf(file, INT32 ",    /* Fs */\n", mode->Fs);
Irrelevant whitespace change.
>      {
> +      opus_fft_free_arch((kiss_fft_state *)cfg, opus_select_arch());
>         opus_free((opus_int16*)cfg->bitrev);
Wrong indentation.
> @@ -59,6 +60,7 @@ extern "C" {
>   #   define kiss_twiddle_scalar float
>   #   define KF_SUFFIX _celt_single
>   # endif
> +
>   #endif
>
>   typedef struct {
Irrelevant whitespace change.
> @@ -87,8 +89,13 @@ typedef struct kiss_fft_state{
>       opus_int16 factors[2*MAXFACTORS];
>       const opus_int16 *bitrev;
>       const kiss_twiddle_cpx *twiddles;
> +    void *priv; /* Used by arch specfic optimizations */
Wrong indentation.
>   } kiss_fft_state;
Can I get a copy of the
/* ARM NE10 library does not support below values */
comment you added below here as well?
> +#ifndef HAVE_ARM_NE10
>           test1d(36,0);
>           test1d(36,1);
>           test1d(50,0);
>           test1d(50,1);
> +#endif
>   celt/static_modes_fixed.h \
> +celt/static_modes_float_arm_ne10.h \
>   celt/arm/armcpu.h \
Wrong indentation.
>   celt/arm/fixed_armv4.h \
>   celt/arm/fixed_armv5e.h \
>   celt/arm/kiss_fft_armv4.h \
>   celt/arm/kiss_fft_armv5e.h \
>   celt/arm/pitch_arm.h \
> +celt/arm/fft_arm.h \
> +celt/arm/mdct_arm.h \
>   celt/x86/pitch_sse.h \
Etc.

Viswanath Puttagunta

2015-Jan-29 23:13 UTC

head link

[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

Hi Timothy,

Appreciate the comprehensive code review.

The biggest issue I see is the peak stack usage.... rest looks like
fairly straight forward cleanup.

Is the peak stack usage a complete blocker in current form?
If it is indeed a blocker, would it be acceptable if we can reduce
additional buffer requirement from 2 buffers (current) to 1, possibly
by moving scaling inside ne10_fft_c2c_1d_float32_neon? (I will let
Phil comment on if there is any reason not to do this.. and suggest
any alternatives if necessary)

I will work on fixing up code from opus side.

Phil is one of developers/maintainers of opus. Please correct me if
any of information below is inaccurate from a libNE10 change request
perspective.

Phil,

Below are 2 main things we need to change from NE10 Library side. Can
you please make these changes on NE10 side?

1. Remove usage of _t in ne10_fft_cfg_float32_t and ne10_fft_cpx_float32_t
-------from comment----
ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
+   VARDECL(ne10_fft_cpx_float32_t, temp);
+   VARDECL(ne10_fft_cpx_float32_t, tempin);

Just another note on API design... the _t suffix is reserved by POSIX,
and should never be used by user code. It's unlikely to cause issues
with these long type names, but I have certainly seen it cause issues
elsewhere (and using the style encourages others to do it).
-------
It should be fairly trivial to fix by adding something like this into
NE10 source with something like
typedef ne10_fft_cpx_float32 ne10_fft_cpx_float32_t
typedef ne10_fft_cpx_float32 ne10_fft_cpx_float32_t

Then from opus side, I can instead use
ne10_fft_cfg_float32 cfg = (ne10_fft_cfg_float32)st->priv;
VARDECL(ne10_fft_cpx_float32, temp);
VARDECL(ne10_fft_cpx_float32, tempin);

2. opus_fft_float_neon: Remove at least one additional buffer that is
currently needed (on stack)
- Currently, we have to use 2 additional buffers on stack.
- One for scaling (ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t);)
- Another as temp buffer (cfg->buffer)
- Is there any reason the scaling cannot be inside
ne10_fft_c2c_1d_float32_neon and just re-use cfg->buffer for scaling
as well?
- This is probably the easiest way to get rid of one additional buffer.
Note: Timothy, I remember Phil telling me that NE10 fft algorithm does
not use bit reversal. (Phil, correct if I'm mistaken here)

Regards,
Vish

On 29 January 2015 at 11:39, Timothy B. Terriberry <tterribe at xiph.org>
wrote:> Viswanath Puttagunta wrote:
>>
>>   if OPUS_ARM_NEON_INTR
>>   CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \
>> -                       %test_unit_rotation.o %test_unit_mathops.o
>> -$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
>> +                        $(CELT_SOURCES_ARM_NE10:.c=.lo) \
>> +                                      %test_unit_rotation.o
>> %test_unit_mathops.o \
>> +                        %test_unit_mdct.o %test_unit_dft.o
>
>
> Crazy indentation.
>
>> +$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
>> $(NE10_CFLAGS)
>>   endif
>> diff --git a/celt/arm/arm_celt_ne10_fft_map.c
>> b/celt/arm/arm_celt_ne10_fft_map.c
>> new file mode 100644
>> index 0000000..5bb7b5f
>
>
> Please put these tables (and the ones in arm_celt_ne10_mdct_map) in
> arm_celt_map.c... the goal was to have them all in the same place, so that
> if we changed the architectures supported or something, we did not have to
> hunt all over the codebase to update them all.
>
>> +int (*const OPUS_FFT_ALLOC_ARCH[OPUS_ARCHMASK+1])(kiss_fft_state *st)
= {
>
>
> Please follow the naming convention used in arm_celt_map, e.g.,
> <all_caps_function_name>_IMPL instead of _ARCH (on this and all the
other
> tables).
>
>> +void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg,
>> +                                       const kiss_fft_cpx *fin,
>> +                                       kiss_fft_cpx *fout) = {
>
>
> These are mis-aligned.
>
>> +   st->priv = (void *)ne10_fft_alloc_c2c_float32_neon(st->nfft);
>> +   if (st->priv == NULL) {
>> +      printf("Unable to ne10 alloc\n");
>
>
> Absolutely no printfs in library code.
>
>> +      return -1;
>> +   }
>> +   return 0;
>> +}
>> +
>> +void opus_fft_free_arm_float_neon(kiss_fft_state *st)
>> +{
>> +   ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
>> +
>> +   if (cfg)
>> +      free((void *)cfg);
>
>
> This concerns me for several reasons:
>
> 1) We never call free() directly in libopus. It is always wrapped in the
> opus_free() macro to allow ports to override it (and as a debugging tool).
>
> 2) We didn't call malloc here, NE10 did, which is in a separate module.
That
> will work in some environments (Linux), but in others (say, Windows Phone),
> memory allocated in one module MUST be freed by the same module, because
> different modules do not share libc state (indeed, they can even be linked
> against different implementations of libc). If this is how the API of NE10
> is designed to work, then that API needs to be fixed.
>
>> +}
>> +#endif
>> +
>> +void opus_fft_float_neon(const kiss_fft_state *st,
>> +                        const kiss_fft_cpx *fin,
>> +                        kiss_fft_cpx *fout)
>
>
> Again, these are mis-aligned.
>
>> +{
>> +   ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv;
>> +   VARDECL(ne10_fft_cpx_float32_t, temp);
>> +   VARDECL(ne10_fft_cpx_float32_t, tempin);
>
>
> Just another note on API design... the _t suffix is reserved by POSIX, and
> should never be used by user code. It's unlikely to cause issues with
these
> long type names, but I have certainly seen it cause issues elsewhere (and
> using the style encourages others to do it).
>
>> +   SAVE_STACK;
>> +   int N2 = st->nfft >> 1;
>> +   float32x4_t inq, outq;
>> +   float32x2_t scale;
>> +   float *in = (float *)fin;
>
>
> You're dropping the const qualifier for no reason. Also, vld1q_f32()
takes a
> const float32_t *, NOT a float *, and they are not compatible on all
> compiler versions.
>
>> +   float *out;
>> +   int i;
>> +   ALLOC(temp, st->nfft, ne10_fft_cpx_float32_t);
>> +   ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t);
>
>
> This seems like a fairly large increase in peak stack usage (7.5 kB). Is
> there any chance of reducing this?
>
> For example, presumably the first thing ne10_fft_c2c_1d_float32_neon() does
> is bit-reverse the input... if it could be modified to do the scaling at
the
> same time, that would get rid of one buffer. The other option is to make
> opus_fft's callers do it (clt_mdct_forward already does this in the C
> version). The other other option is to modify the API to take a mutable
> input buffer, but combining the scaling with another loop will reduce the
> number of passes (and thus load/stores), which will likely be faster. That
> requires arch-specific code for the call in analysis.c (and the tests),
> though.
>
> clt_mdct_forward() has a buffer f that is the right size and is not used
> during the opus_fft() call. Somehow being able to re-use that buffer would
> get rid of the other temporary buffer here. You'd still need to
allocate one
> for the call in analysis.c, but that's at least near the top of the
stack
> (of course, better yet would be to eliminate the need for this buffer in
> NE10 entirely).
>
> The exact approach here really depends on our ability to modify the NE10
> API, but I'm getting the impression that clt_mdct_forward_float_neon()
> should probably not call opus_fft() at all (but directly access the NE10
> API), just as the C version directly accesses opus_fft_impl().
>
>> +
>> +   out = (float *)tempin;
>
>
> These are pretty confusing names (if you have to keep this scaling here).
> Ideally they'd be related since they refer to the same memory (e.g.,
scaled
> and scaledp or something).
>
> Also, float is _not_ compatible with float32_t (which is what vst1q_f32
> takes) in all compiler versions. Please do not mix and match them.
>
>> +   scale = vld1_dup_f32(&st->scale);
>
>
> Needs a (const float32_t *) cast.
>
>> +   for (i = 0; i < N2; i++) {
>> +      inq = vld1q_f32(in);
>> +      in += 4;
>> +      outq = vmulq_lane_f32(inq, scale, 0);
>> +      vst1q_f32(out, outq);
>> +      out += 4;
>> +   }
>> +
>> +   cfg->buffer = (ne10_fft_cpx_float32_t *)&temp[0];
>
>
> If the struct name is "buffer", probably better to have the
temporary named
> "buffer" too, since it's not used anywhere else.
>
>> +   ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout,
>> +                                 (ne10_fft_cpx_float32_t *)tempin,
>> +                                 cfg, 0);
>
>
> More mis-alignment.
>
>> +   /* N/4 complex FFT, does not downscale anymore */
>> +   opus_fft(st, f2, (kiss_fft_cpx *)f, opus_select_arch());
>
>
> Because you removed the scaling from the above loop, this comment is
> inaccurate.
>
>> diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h
>> new file mode 100644
>> index 0000000..16f008b
>> --- /dev/null
>> +++ b/celt/arm/fft_arm.h
>> @@ -0,0 +1,65 @@
>> +/* Copyright (c) 2015-2016 Xiph.Org Foundation
>
>
> Are you from the future?
>
>> +ifdef HAVE_ARM_NE10
>> +CC = gcc
>> +CFLAGS += -mfpu=neon
>> +INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR
>> +LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so
>> +SOURCES += ../arm/celt_neon_intr.c dump_mode_arm_ne10.c
>> +endif
>
>
> It's a bit unfortunate that this depends on having NE10 available,
since
> it's used to generate static files which someone may ultimately build
on a
> completely different system, but I'm not sure how much effort it's
worth to
> try to fix that. Probably not much.
>
>> diff --git a/celt/dump_modes/dump_mode_arm_ne10.c
>> b/celt/dump_modes/dump_mode_arm_ne10.c
>> new file mode 100644
>> index 0000000..30c7423
>> --- /dev/null
>> +++ b/celt/dump_modes/dump_mode_arm_ne10.c
>
>
> "dump_modes_arm_ne10.c"
>
>> +         fprintf(file, "{%f,%f},%c", cfg->twiddles[j].r,
>> cfg->twiddles[j].i,(j+4)%3==0?'\n':' ');
>
>
> Please use the same conversion specification as dump_modes.c for FLOAT,
> e.g., "%#0.8gf".
>
>> +   fprintf(file, "\n#ifdef HAVE_ARM_NE10\n");
>> +   fprintf(file, "#define OVERRIDE_FFT 1\n");
>> +   fprintf(file, "#include \"%s\"\n",
ARM_NE10_ARCH_FILE_NAME);
>> +   fprintf(file, "#endif\n");
>
>
> At least if you do generate the files on a system without NE10, and then
> build on a system with NE10, this will fail, which is probably the right
> thing.
>
>> @@ -205,7 +220,6 @@ void dump_modes(FILE *file, CELTMode **modes, int
>> nb_modes)
>>         fprintf(file, "#endif\n");
>>         fprintf(file, "\n");
>>
>> -
>>         /* Print the actual mode data */
>>         fprintf(file, "static const CELTMode mode%d_%d_%d =
{\n",
>> mode->Fs, mdctSize, mode->overlap);
>>         fprintf(file, INT32 ",    /* Fs */\n", mode->Fs);
>
>
> Irrelevant whitespace change.
>
>>      {
>> +      opus_fft_free_arch((kiss_fft_state *)cfg, opus_select_arch());
>>         opus_free((opus_int16*)cfg->bitrev);
>
>
> Wrong indentation.
>
>> @@ -59,6 +60,7 @@ extern "C" {
>>   #   define kiss_twiddle_scalar float
>>   #   define KF_SUFFIX _celt_single
>>   # endif
>> +
>>   #endif
>>
>>   typedef struct {
>
>
> Irrelevant whitespace change.
>
>> @@ -87,8 +89,13 @@ typedef struct kiss_fft_state{
>>       opus_int16 factors[2*MAXFACTORS];
>>       const opus_int16 *bitrev;
>>       const kiss_twiddle_cpx *twiddles;
>> +    void *priv; /* Used by arch specfic optimizations */
>
>
> Wrong indentation.
>
>>   } kiss_fft_state;
>
>
> Can I get a copy of the
> /* ARM NE10 library does not support below values */
> comment you added below here as well?
>
>> +#ifndef HAVE_ARM_NE10
>>           test1d(36,0);
>>           test1d(36,1);
>>           test1d(50,0);
>>           test1d(50,1);
>> +#endif
>
>
>>   celt/static_modes_fixed.h \
>> +celt/static_modes_float_arm_ne10.h \
>>   celt/arm/armcpu.h \
>
>
> Wrong indentation.
>
>>   celt/arm/fixed_armv4.h \
>>   celt/arm/fixed_armv5e.h \
>>   celt/arm/kiss_fft_armv4.h \
>>   celt/arm/kiss_fft_armv5e.h \
>>   celt/arm/pitch_arm.h \
>> +celt/arm/fft_arm.h \
>> +celt/arm/mdct_arm.h \
>>   celt/x86/pitch_sse.h \
>
>
> Etc.

Timothy B. Terriberry

2015-Jan-30 08:38 UTC

head link

[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

Viswanath Puttagunta wrote:> Is the peak stack usage a complete blocker in current form?
Since this only affects people who enable NE10, I don't think this is a 
blocker.

Apparently Analagous Threads

Search for more possibly parallel threads

opus - Jan 2015 - [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library

Apparently Analagous Threads