Timothy B. Terriberry
2015-Jan-29 17:39 UTC
[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library
Viswanath Puttagunta wrote:> if OPUS_ARM_NEON_INTR > CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \ > - %test_unit_rotation.o %test_unit_mathops.o > -$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS) > + $(CELT_SOURCES_ARM_NE10:.c=.lo) \ > + %test_unit_rotation.o %test_unit_mathops.o \ > + %test_unit_mdct.o %test_unit_dft.oCrazy indentation.> +$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS) $(NE10_CFLAGS) > endif > diff --git a/celt/arm/arm_celt_ne10_fft_map.c b/celt/arm/arm_celt_ne10_fft_map.c > new file mode 100644 > index 0000000..5bb7b5fPlease put these tables (and the ones in arm_celt_ne10_mdct_map) in arm_celt_map.c... the goal was to have them all in the same place, so that if we changed the architectures supported or something, we did not have to hunt all over the codebase to update them all.> +int (*const OPUS_FFT_ALLOC_ARCH[OPUS_ARCHMASK+1])(kiss_fft_state *st) = {Please follow the naming convention used in arm_celt_map, e.g., <all_caps_function_name>_IMPL instead of _ARCH (on this and all the other tables).> +void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, > + const kiss_fft_cpx *fin, > + kiss_fft_cpx *fout) = {These are mis-aligned.> + st->priv = (void *)ne10_fft_alloc_c2c_float32_neon(st->nfft); > + if (st->priv == NULL) { > + printf("Unable to ne10 alloc\n");Absolutely no printfs in library code.> + return -1; > + } > + return 0; > +} > + > +void opus_fft_free_arm_float_neon(kiss_fft_state *st) > +{ > + ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv; > + > + if (cfg) > + free((void *)cfg);This concerns me for several reasons: 1) We never call free() directly in libopus. It is always wrapped in the opus_free() macro to allow ports to override it (and as a debugging tool). 2) We didn't call malloc here, NE10 did, which is in a separate module. That will work in some environments (Linux), but in others (say, Windows Phone), memory allocated in one module MUST be freed by the same module, because different modules do not share libc state (indeed, they can even be linked against different implementations of libc). If this is how the API of NE10 is designed to work, then that API needs to be fixed.> +} > +#endif > + > +void opus_fft_float_neon(const kiss_fft_state *st, > + const kiss_fft_cpx *fin, > + kiss_fft_cpx *fout)Again, these are mis-aligned.> +{ > + ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv; > + VARDECL(ne10_fft_cpx_float32_t, temp); > + VARDECL(ne10_fft_cpx_float32_t, tempin);Just another note on API design... the _t suffix is reserved by POSIX, and should never be used by user code. It's unlikely to cause issues with these long type names, but I have certainly seen it cause issues elsewhere (and using the style encourages others to do it).> + SAVE_STACK; > + int N2 = st->nfft >> 1; > + float32x4_t inq, outq; > + float32x2_t scale; > + float *in = (float *)fin;You're dropping the const qualifier for no reason. Also, vld1q_f32() takes a const float32_t *, NOT a float *, and they are not compatible on all compiler versions.> + float *out; > + int i; > + ALLOC(temp, st->nfft, ne10_fft_cpx_float32_t); > + ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t);This seems like a fairly large increase in peak stack usage (7.5 kB). Is there any chance of reducing this? For example, presumably the first thing ne10_fft_c2c_1d_float32_neon() does is bit-reverse the input... if it could be modified to do the scaling at the same time, that would get rid of one buffer. The other option is to make opus_fft's callers do it (clt_mdct_forward already does this in the C version). The other other option is to modify the API to take a mutable input buffer, but combining the scaling with another loop will reduce the number of passes (and thus load/stores), which will likely be faster. That requires arch-specific code for the call in analysis.c (and the tests), though. clt_mdct_forward() has a buffer f that is the right size and is not used during the opus_fft() call. Somehow being able to re-use that buffer would get rid of the other temporary buffer here. You'd still need to allocate one for the call in analysis.c, but that's at least near the top of the stack (of course, better yet would be to eliminate the need for this buffer in NE10 entirely). The exact approach here really depends on our ability to modify the NE10 API, but I'm getting the impression that clt_mdct_forward_float_neon() should probably not call opus_fft() at all (but directly access the NE10 API), just as the C version directly accesses opus_fft_impl().> + > + out = (float *)tempin;These are pretty confusing names (if you have to keep this scaling here). Ideally they'd be related since they refer to the same memory (e.g., scaled and scaledp or something). Also, float is _not_ compatible with float32_t (which is what vst1q_f32 takes) in all compiler versions. Please do not mix and match them.> + scale = vld1_dup_f32(&st->scale);Needs a (const float32_t *) cast.> + for (i = 0; i < N2; i++) { > + inq = vld1q_f32(in); > + in += 4; > + outq = vmulq_lane_f32(inq, scale, 0); > + vst1q_f32(out, outq); > + out += 4; > + } > + > + cfg->buffer = (ne10_fft_cpx_float32_t *)&temp[0];If the struct name is "buffer", probably better to have the temporary named "buffer" too, since it's not used anywhere else.> + ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout, > + (ne10_fft_cpx_float32_t *)tempin, > + cfg, 0);More mis-alignment.> + /* N/4 complex FFT, does not downscale anymore */ > + opus_fft(st, f2, (kiss_fft_cpx *)f, opus_select_arch());Because you removed the scaling from the above loop, this comment is inaccurate.> diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h > new file mode 100644 > index 0000000..16f008b > --- /dev/null > +++ b/celt/arm/fft_arm.h > @@ -0,0 +1,65 @@ > +/* Copyright (c) 2015-2016 Xiph.Org FoundationAre you from the future?> +ifdef HAVE_ARM_NE10 > +CC = gcc > +CFLAGS += -mfpu=neon > +INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR > +LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so > +SOURCES += ../arm/celt_neon_intr.c dump_mode_arm_ne10.c > +endifIt's a bit unfortunate that this depends on having NE10 available, since it's used to generate static files which someone may ultimately build on a completely different system, but I'm not sure how much effort it's worth to try to fix that. Probably not much.> diff --git a/celt/dump_modes/dump_mode_arm_ne10.c b/celt/dump_modes/dump_mode_arm_ne10.c > new file mode 100644 > index 0000000..30c7423 > --- /dev/null > +++ b/celt/dump_modes/dump_mode_arm_ne10.c"dump_modes_arm_ne10.c"> + fprintf(file, "{%f,%f},%c", cfg->twiddles[j].r, cfg->twiddles[j].i,(j+4)%3==0?'\n':' ');Please use the same conversion specification as dump_modes.c for FLOAT, e.g., "%#0.8gf".> + fprintf(file, "\n#ifdef HAVE_ARM_NE10\n"); > + fprintf(file, "#define OVERRIDE_FFT 1\n"); > + fprintf(file, "#include \"%s\"\n", ARM_NE10_ARCH_FILE_NAME); > + fprintf(file, "#endif\n");At least if you do generate the files on a system without NE10, and then build on a system with NE10, this will fail, which is probably the right thing.> @@ -205,7 +220,6 @@ void dump_modes(FILE *file, CELTMode **modes, int nb_modes) > fprintf(file, "#endif\n"); > fprintf(file, "\n"); > > - > /* Print the actual mode data */ > fprintf(file, "static const CELTMode mode%d_%d_%d = {\n", mode->Fs, mdctSize, mode->overlap); > fprintf(file, INT32 ", /* Fs */\n", mode->Fs);Irrelevant whitespace change.> { > + opus_fft_free_arch((kiss_fft_state *)cfg, opus_select_arch()); > opus_free((opus_int16*)cfg->bitrev);Wrong indentation.> @@ -59,6 +60,7 @@ extern "C" { > # define kiss_twiddle_scalar float > # define KF_SUFFIX _celt_single > # endif > + > #endif > > typedef struct {Irrelevant whitespace change.> @@ -87,8 +89,13 @@ typedef struct kiss_fft_state{ > opus_int16 factors[2*MAXFACTORS]; > const opus_int16 *bitrev; > const kiss_twiddle_cpx *twiddles; > + void *priv; /* Used by arch specfic optimizations */Wrong indentation.> } kiss_fft_state;Can I get a copy of the /* ARM NE10 library does not support below values */ comment you added below here as well?> +#ifndef HAVE_ARM_NE10 > test1d(36,0); > test1d(36,1); > test1d(50,0); > test1d(50,1); > +#endif> celt/static_modes_fixed.h \ > +celt/static_modes_float_arm_ne10.h \ > celt/arm/armcpu.h \Wrong indentation.> celt/arm/fixed_armv4.h \ > celt/arm/fixed_armv5e.h \ > celt/arm/kiss_fft_armv4.h \ > celt/arm/kiss_fft_armv5e.h \ > celt/arm/pitch_arm.h \ > +celt/arm/fft_arm.h \ > +celt/arm/mdct_arm.h \ > celt/x86/pitch_sse.h \Etc.
Viswanath Puttagunta
2015-Jan-29 23:13 UTC
[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library
Hi Timothy, Appreciate the comprehensive code review. The biggest issue I see is the peak stack usage.... rest looks like fairly straight forward cleanup. Is the peak stack usage a complete blocker in current form? If it is indeed a blocker, would it be acceptable if we can reduce additional buffer requirement from 2 buffers (current) to 1, possibly by moving scaling inside ne10_fft_c2c_1d_float32_neon? (I will let Phil comment on if there is any reason not to do this.. and suggest any alternatives if necessary) I will work on fixing up code from opus side. Phil is one of developers/maintainers of opus. Please correct me if any of information below is inaccurate from a libNE10 change request perspective. Phil, Below are 2 main things we need to change from NE10 Library side. Can you please make these changes on NE10 side? 1. Remove usage of _t in ne10_fft_cfg_float32_t and ne10_fft_cpx_float32_t -------from comment---- ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv; + VARDECL(ne10_fft_cpx_float32_t, temp); + VARDECL(ne10_fft_cpx_float32_t, tempin); Just another note on API design... the _t suffix is reserved by POSIX, and should never be used by user code. It's unlikely to cause issues with these long type names, but I have certainly seen it cause issues elsewhere (and using the style encourages others to do it). ------- It should be fairly trivial to fix by adding something like this into NE10 source with something like typedef ne10_fft_cpx_float32 ne10_fft_cpx_float32_t typedef ne10_fft_cpx_float32 ne10_fft_cpx_float32_t Then from opus side, I can instead use ne10_fft_cfg_float32 cfg = (ne10_fft_cfg_float32)st->priv; VARDECL(ne10_fft_cpx_float32, temp); VARDECL(ne10_fft_cpx_float32, tempin); 2. opus_fft_float_neon: Remove at least one additional buffer that is currently needed (on stack) - Currently, we have to use 2 additional buffers on stack. - One for scaling (ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t);) - Another as temp buffer (cfg->buffer) - Is there any reason the scaling cannot be inside ne10_fft_c2c_1d_float32_neon and just re-use cfg->buffer for scaling as well? - This is probably the easiest way to get rid of one additional buffer. Note: Timothy, I remember Phil telling me that NE10 fft algorithm does not use bit reversal. (Phil, correct if I'm mistaken here) Regards, Vish On 29 January 2015 at 11:39, Timothy B. Terriberry <tterribe at xiph.org> wrote:> Viswanath Puttagunta wrote: >> >> if OPUS_ARM_NEON_INTR >> CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \ >> - %test_unit_rotation.o %test_unit_mathops.o >> -$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS) >> + $(CELT_SOURCES_ARM_NE10:.c=.lo) \ >> + %test_unit_rotation.o >> %test_unit_mathops.o \ >> + %test_unit_mdct.o %test_unit_dft.o > > > Crazy indentation. > >> +$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS) >> $(NE10_CFLAGS) >> endif >> diff --git a/celt/arm/arm_celt_ne10_fft_map.c >> b/celt/arm/arm_celt_ne10_fft_map.c >> new file mode 100644 >> index 0000000..5bb7b5f > > > Please put these tables (and the ones in arm_celt_ne10_mdct_map) in > arm_celt_map.c... the goal was to have them all in the same place, so that > if we changed the architectures supported or something, we did not have to > hunt all over the codebase to update them all. > >> +int (*const OPUS_FFT_ALLOC_ARCH[OPUS_ARCHMASK+1])(kiss_fft_state *st) = { > > > Please follow the naming convention used in arm_celt_map, e.g., > <all_caps_function_name>_IMPL instead of _ARCH (on this and all the other > tables). > >> +void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, >> + const kiss_fft_cpx *fin, >> + kiss_fft_cpx *fout) = { > > > These are mis-aligned. > >> + st->priv = (void *)ne10_fft_alloc_c2c_float32_neon(st->nfft); >> + if (st->priv == NULL) { >> + printf("Unable to ne10 alloc\n"); > > > Absolutely no printfs in library code. > >> + return -1; >> + } >> + return 0; >> +} >> + >> +void opus_fft_free_arm_float_neon(kiss_fft_state *st) >> +{ >> + ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv; >> + >> + if (cfg) >> + free((void *)cfg); > > > This concerns me for several reasons: > > 1) We never call free() directly in libopus. It is always wrapped in the > opus_free() macro to allow ports to override it (and as a debugging tool). > > 2) We didn't call malloc here, NE10 did, which is in a separate module. That > will work in some environments (Linux), but in others (say, Windows Phone), > memory allocated in one module MUST be freed by the same module, because > different modules do not share libc state (indeed, they can even be linked > against different implementations of libc). If this is how the API of NE10 > is designed to work, then that API needs to be fixed. > >> +} >> +#endif >> + >> +void opus_fft_float_neon(const kiss_fft_state *st, >> + const kiss_fft_cpx *fin, >> + kiss_fft_cpx *fout) > > > Again, these are mis-aligned. > >> +{ >> + ne10_fft_cfg_float32_t cfg = (ne10_fft_cfg_float32_t)st->priv; >> + VARDECL(ne10_fft_cpx_float32_t, temp); >> + VARDECL(ne10_fft_cpx_float32_t, tempin); > > > Just another note on API design... the _t suffix is reserved by POSIX, and > should never be used by user code. It's unlikely to cause issues with these > long type names, but I have certainly seen it cause issues elsewhere (and > using the style encourages others to do it). > >> + SAVE_STACK; >> + int N2 = st->nfft >> 1; >> + float32x4_t inq, outq; >> + float32x2_t scale; >> + float *in = (float *)fin; > > > You're dropping the const qualifier for no reason. Also, vld1q_f32() takes a > const float32_t *, NOT a float *, and they are not compatible on all > compiler versions. > >> + float *out; >> + int i; >> + ALLOC(temp, st->nfft, ne10_fft_cpx_float32_t); >> + ALLOC(tempin, st->nfft, ne10_fft_cpx_float32_t); > > > This seems like a fairly large increase in peak stack usage (7.5 kB). Is > there any chance of reducing this? > > For example, presumably the first thing ne10_fft_c2c_1d_float32_neon() does > is bit-reverse the input... if it could be modified to do the scaling at the > same time, that would get rid of one buffer. The other option is to make > opus_fft's callers do it (clt_mdct_forward already does this in the C > version). The other other option is to modify the API to take a mutable > input buffer, but combining the scaling with another loop will reduce the > number of passes (and thus load/stores), which will likely be faster. That > requires arch-specific code for the call in analysis.c (and the tests), > though. > > clt_mdct_forward() has a buffer f that is the right size and is not used > during the opus_fft() call. Somehow being able to re-use that buffer would > get rid of the other temporary buffer here. You'd still need to allocate one > for the call in analysis.c, but that's at least near the top of the stack > (of course, better yet would be to eliminate the need for this buffer in > NE10 entirely). > > The exact approach here really depends on our ability to modify the NE10 > API, but I'm getting the impression that clt_mdct_forward_float_neon() > should probably not call opus_fft() at all (but directly access the NE10 > API), just as the C version directly accesses opus_fft_impl(). > >> + >> + out = (float *)tempin; > > > These are pretty confusing names (if you have to keep this scaling here). > Ideally they'd be related since they refer to the same memory (e.g., scaled > and scaledp or something). > > Also, float is _not_ compatible with float32_t (which is what vst1q_f32 > takes) in all compiler versions. Please do not mix and match them. > >> + scale = vld1_dup_f32(&st->scale); > > > Needs a (const float32_t *) cast. > >> + for (i = 0; i < N2; i++) { >> + inq = vld1q_f32(in); >> + in += 4; >> + outq = vmulq_lane_f32(inq, scale, 0); >> + vst1q_f32(out, outq); >> + out += 4; >> + } >> + >> + cfg->buffer = (ne10_fft_cpx_float32_t *)&temp[0]; > > > If the struct name is "buffer", probably better to have the temporary named > "buffer" too, since it's not used anywhere else. > >> + ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout, >> + (ne10_fft_cpx_float32_t *)tempin, >> + cfg, 0); > > > More mis-alignment. > >> + /* N/4 complex FFT, does not downscale anymore */ >> + opus_fft(st, f2, (kiss_fft_cpx *)f, opus_select_arch()); > > > Because you removed the scaling from the above loop, this comment is > inaccurate. > >> diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h >> new file mode 100644 >> index 0000000..16f008b >> --- /dev/null >> +++ b/celt/arm/fft_arm.h >> @@ -0,0 +1,65 @@ >> +/* Copyright (c) 2015-2016 Xiph.Org Foundation > > > Are you from the future? > >> +ifdef HAVE_ARM_NE10 >> +CC = gcc >> +CFLAGS += -mfpu=neon >> +INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR >> +LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so >> +SOURCES += ../arm/celt_neon_intr.c dump_mode_arm_ne10.c >> +endif > > > It's a bit unfortunate that this depends on having NE10 available, since > it's used to generate static files which someone may ultimately build on a > completely different system, but I'm not sure how much effort it's worth to > try to fix that. Probably not much. > >> diff --git a/celt/dump_modes/dump_mode_arm_ne10.c >> b/celt/dump_modes/dump_mode_arm_ne10.c >> new file mode 100644 >> index 0000000..30c7423 >> --- /dev/null >> +++ b/celt/dump_modes/dump_mode_arm_ne10.c > > > "dump_modes_arm_ne10.c" > >> + fprintf(file, "{%f,%f},%c", cfg->twiddles[j].r, >> cfg->twiddles[j].i,(j+4)%3==0?'\n':' '); > > > Please use the same conversion specification as dump_modes.c for FLOAT, > e.g., "%#0.8gf". > >> + fprintf(file, "\n#ifdef HAVE_ARM_NE10\n"); >> + fprintf(file, "#define OVERRIDE_FFT 1\n"); >> + fprintf(file, "#include \"%s\"\n", ARM_NE10_ARCH_FILE_NAME); >> + fprintf(file, "#endif\n"); > > > At least if you do generate the files on a system without NE10, and then > build on a system with NE10, this will fail, which is probably the right > thing. > >> @@ -205,7 +220,6 @@ void dump_modes(FILE *file, CELTMode **modes, int >> nb_modes) >> fprintf(file, "#endif\n"); >> fprintf(file, "\n"); >> >> - >> /* Print the actual mode data */ >> fprintf(file, "static const CELTMode mode%d_%d_%d = {\n", >> mode->Fs, mdctSize, mode->overlap); >> fprintf(file, INT32 ", /* Fs */\n", mode->Fs); > > > Irrelevant whitespace change. > >> { >> + opus_fft_free_arch((kiss_fft_state *)cfg, opus_select_arch()); >> opus_free((opus_int16*)cfg->bitrev); > > > Wrong indentation. > >> @@ -59,6 +60,7 @@ extern "C" { >> # define kiss_twiddle_scalar float >> # define KF_SUFFIX _celt_single >> # endif >> + >> #endif >> >> typedef struct { > > > Irrelevant whitespace change. > >> @@ -87,8 +89,13 @@ typedef struct kiss_fft_state{ >> opus_int16 factors[2*MAXFACTORS]; >> const opus_int16 *bitrev; >> const kiss_twiddle_cpx *twiddles; >> + void *priv; /* Used by arch specfic optimizations */ > > > Wrong indentation. > >> } kiss_fft_state; > > > Can I get a copy of the > /* ARM NE10 library does not support below values */ > comment you added below here as well? > >> +#ifndef HAVE_ARM_NE10 >> test1d(36,0); >> test1d(36,1); >> test1d(50,0); >> test1d(50,1); >> +#endif > > >> celt/static_modes_fixed.h \ >> +celt/static_modes_float_arm_ne10.h \ >> celt/arm/armcpu.h \ > > > Wrong indentation. > >> celt/arm/fixed_armv4.h \ >> celt/arm/fixed_armv5e.h \ >> celt/arm/kiss_fft_armv4.h \ >> celt/arm/kiss_fft_armv5e.h \ >> celt/arm/pitch_arm.h \ >> +celt/arm/fft_arm.h \ >> +celt/arm/mdct_arm.h \ >> celt/x86/pitch_sse.h \ > > > Etc.
Timothy B. Terriberry
2015-Jan-30 08:38 UTC
[opus] [RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library
Viswanath Puttagunta wrote:> Is the peak stack usage a complete blocker in current form?Since this only affects people who enable NE10, I don't think this is a blocker.