Viswanath Puttagunta
2015-Mar-31 22:57 UTC
[opus] [RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series
Hi Timothy, As I mentioned earlier [1], I now fixed compile issues with fixed point and resubmitting the patch. I also have new patch that does intrinsics optimizations for celt_pitch_xcorr targetting aarch64. You can find my latest work-in-progress branch at [2] For reference, you can use the Ne10 pre-built libraries at [3] Note that I am working with Phil at ARM to get my patch at [4] upstreamed to Ne10. [1]: http://lists.xiph.org/pipermail/opus/2015-March/002941.html [2]: https://git.linaro.org/people/viswanath.puttagunta/opus.git Branch: rfcv1_final_xcorr_fixed_armv8 [3]: http://people.linaro.org/~viswanath.puttagunta/opus/NE10_root/ [4]: git://git.linaro.org/people/viswanath.puttagunta/Ne10.git Branch: rfcv1_rc1_armv8 Jonathan Lennox (1): Intrinsics/RTCD related fixes. Mostly x86 Viswanath Puttagunta (4): armv7(float): Optimize encode usecase using NE10 library armv7(float): Optimize decode usecase using NE10 library aarch64: Enable intrinsics for aarch64 aarch64: celt_pitch_xcorr: Fixed point intrinsics Makefile.am | 72 ++++-- celt/arm/arm_celt_map.c | 71 +++++- celt/arm/armcpu.c | 6 +- celt/arm/celt_ne10_fft.c | 148 +++++++++++ celt/arm/celt_ne10_mdct.c | 263 ++++++++++++++++++++ celt/arm/celt_neon_intr.c | 275 +++++++++++++++++++++ celt/arm/fft_arm.h | 74 ++++++ celt/arm/mdct_arm.h | 60 +++++ celt/arm/pitch_arm.h | 14 +- celt/bands.c | 6 +- celt/celt.c | 16 +- celt/celt.h | 12 +- celt/celt_decoder.c | 24 +- celt/celt_encoder.c | 20 +- celt/celt_lpc.h | 2 +- celt/cpu_support.h | 15 +- celt/dump_modes/Makefile | 23 +- celt/dump_modes/dump_modes.c | 21 ++ celt/dump_modes/dump_modes_arch.h | 41 ++++ celt/dump_modes/dump_modes_arm_ne10.c | 125 ++++++++++ celt/kiss_fft.c | 31 ++- celt/kiss_fft.h | 69 +++++- celt/mdct.c | 20 +- celt/mdct.h | 61 ++++- celt/mips/celt_mipsr1.h | 2 +- celt/modes.c | 8 +- celt/pitch.c | 4 +- celt/pitch.h | 22 +- celt/static_modes_float.h | 25 ++ celt/static_modes_float_arm_ne10.h | 404 +++++++++++++++++++++++++++++++ celt/tests/test_unit_dft.c | 56 +++-- celt/tests/test_unit_mathops.c | 22 +- celt/tests/test_unit_mdct.c | 88 ++++--- celt/tests/test_unit_rotation.c | 22 +- celt/x86/celt_lpc_sse.c | 4 + celt/x86/celt_lpc_sse.h | 12 +- celt/x86/pitch_sse.c | 334 ++++++++++--------------- celt/x86/pitch_sse.h | 256 ++++++++------------ celt/x86/pitch_sse2.c | 95 ++++++++ celt/x86/pitch_sse4_1.c | 195 +++++++++++++++ celt/x86/x86_celt_map.c | 76 +++++- celt/x86/x86cpu.c | 47 +++- celt/x86/x86cpu.h | 26 +- celt_headers.mk | 3 + celt_sources.mk | 9 +- configure.ac | 391 +++++++++++++++++++++--------- m4/opus-intrinsics.m4 | 29 +++ silk/x86/SigProc_FIX_sse.h | 17 ++ silk/x86/main_sse.h | 48 ++++ silk/x86/x86_silk_map.c | 25 +- src/analysis.c | 8 +- src/analysis.h | 2 +- src/opus_encoder.c | 2 +- src/opus_multistream_encoder.c | 9 +- win32/VS2010/celt.vcxproj | 17 +- win32/VS2010/celt.vcxproj.filters | 27 +++ win32/VS2010/silk_common.vcxproj | 17 +- win32/VS2010/silk_common.vcxproj.filters | 23 +- win32/VS2010/silk_fixed.vcxproj | 13 +- win32/VS2010/silk_fixed.vcxproj.filters | 17 +- win32/config.h | 25 +- 61 files changed, 3150 insertions(+), 699 deletions(-) create mode 100644 celt/arm/celt_ne10_fft.c create mode 100644 celt/arm/celt_ne10_mdct.c create mode 100644 celt/arm/fft_arm.h create mode 100644 celt/arm/mdct_arm.h create mode 100644 celt/dump_modes/dump_modes_arch.h create mode 100644 celt/dump_modes/dump_modes_arm_ne10.c create mode 100644 celt/static_modes_float_arm_ne10.h create mode 100644 celt/x86/pitch_sse2.c create mode 100644 celt/x86/pitch_sse4_1.c create mode 100644 m4/opus-intrinsics.m4 -- 1.9.1
Viswanath Puttagunta
2015-Mar-31 22:57 UTC
[opus] [RFC PATCH v1 1/5] armv7(float): Optimize encode usecase using NE10 library
Optimize opus encode (float only) usecase using ARM NE10 library. Mainly effects opus_fft and ctl_mdct_forward and related functions. This optimization can be used for ARM CPUs that have NEON VFP unit. This patch only enables optimizations for ARMv7. Official ARM NE10 library page available at http://projectne10.github.io/Ne10/ To enable this optimization, use --enable-intrinsics --with-NE10=<install_prefix> or --enable-intrinsics --with-NE10-libraries=<NE10_lib_dir> --with-NE10-includes=<NE10_includes_dir> Compile time checks made during configure process to make sure optimization option available only when compiler supports NEON instrinsics. Runtime checks made to make sure optimized functions only called on appropriate hardware. --- Makefile.am | 34 +-- celt/arm/arm_celt_map.c | 47 +++- celt/arm/celt_ne10_fft.c | 120 ++++++++++ celt/arm/celt_ne10_mdct.c | 158 +++++++++++++ celt/arm/celt_neon_intr.c | 7 + celt/arm/fft_arm.h | 66 ++++++ celt/arm/mdct_arm.h | 53 +++++ celt/celt_encoder.c | 13 +- celt/dump_modes/Makefile | 23 +- celt/dump_modes/dump_modes.c | 21 ++ celt/dump_modes/dump_modes_arch.h | 41 ++++ celt/dump_modes/dump_modes_arm_ne10.c | 125 +++++++++++ celt/kiss_fft.c | 27 ++- celt/kiss_fft.h | 56 ++++- celt/mdct.c | 15 +- celt/mdct.h | 39 +++- celt/modes.c | 8 +- celt/static_modes_float.h | 25 +++ celt/static_modes_float_arm_ne10.h | 404 ++++++++++++++++++++++++++++++++++ celt/tests/test_unit_dft.c | 53 +++-- celt/tests/test_unit_mathops.c | 6 + celt/tests/test_unit_mdct.c | 84 ++++--- celt/tests/test_unit_rotation.c | 6 + celt_headers.mk | 3 + celt_sources.mk | 4 + configure.ac | 81 +++++++ src/analysis.c | 8 +- src/analysis.h | 2 +- src/opus_encoder.c | 2 +- src/opus_multistream_encoder.c | 9 +- 30 files changed, 1435 insertions(+), 105 deletions(-) create mode 100644 celt/arm/celt_ne10_fft.c create mode 100644 celt/arm/celt_ne10_mdct.c create mode 100644 celt/arm/fft_arm.h create mode 100644 celt/arm/mdct_arm.h create mode 100644 celt/dump_modes/dump_modes_arch.h create mode 100644 celt/dump_modes/dump_modes_arm_ne10.c create mode 100644 celt/static_modes_float_arm_ne10.h diff --git a/Makefile.am b/Makefile.am index 2a1ddc8..c5c1562 100644 --- a/Makefile.am +++ b/Makefile.am @@ -10,7 +10,7 @@ lib_LTLIBRARIES = libopus.la DIST_SUBDIRS = doc AM_CPPFLAGS = -I$(top_srcdir)/include -I$(top_srcdir)/celt -I$(top_srcdir)/silk \ - -I$(top_srcdir)/silk/float -I$(top_srcdir)/silk/fixed + -I$(top_srcdir)/silk/float -I$(top_srcdir)/silk/fixed $(NE10_CFLAGS) include celt_sources.mk include silk_sources.mk @@ -47,6 +47,10 @@ CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon endif +if HAVE_ARM_NE10 +CELT_SOURCES += $(CELT_SOURCES_ARM_NE10) +endif + if OPUS_ARM_EXTERNAL_ASM nodist_libopus_la_SOURCES = $(CELT_SOURCES_ARM_ASM:.s=-gnu.S) BUILT_SOURCES = $(CELT_SOURCES_ARM_ASM:.s=-gnu.S) \ @@ -64,7 +68,7 @@ include opus_headers.mk libopus_la_SOURCES = $(CELT_SOURCES) $(SILK_SOURCES) $(OPUS_SOURCES) libopus_la_LDFLAGS = -no-undefined -version-info @OPUS_LT_CURRENT@:@OPUS_LT_REVISION@:@OPUS_LT_AGE@ -libopus_la_LIBADD = $(LIBM) +libopus_la_LIBADD = $(NE10_LIBS) $(LIBM) pkginclude_HEADERS = include/opus.h include/opus_multistream.h include/opus_types.h include/opus_defines.h @@ -77,32 +81,32 @@ TESTS = celt/tests/test_unit_types celt/tests/test_unit_mathops celt/tests/test_ opus_demo_SOURCES = src/opus_demo.c -opus_demo_LDADD = libopus.la $(LIBM) +opus_demo_LDADD = libopus.la $(NE10_LIBS) $(LIBM) repacketizer_demo_SOURCES = src/repacketizer_demo.c -repacketizer_demo_LDADD = libopus.la $(LIBM) +repacketizer_demo_LDADD = libopus.la $(NE10_LIBS) $(LIBM) opus_compare_SOURCES = src/opus_compare.c opus_compare_LDADD = $(LIBM) tests_test_opus_api_SOURCES = tests/test_opus_api.c tests/test_opus_common.h -tests_test_opus_api_LDADD = libopus.la $(LIBM) +tests_test_opus_api_LDADD = libopus.la $(NE10_LIBS) $(LIBM) tests_test_opus_encode_SOURCES = tests/test_opus_encode.c tests/test_opus_common.h -tests_test_opus_encode_LDADD = libopus.la $(LIBM) +tests_test_opus_encode_LDADD = libopus.la $(NE10_LIBS) $(LIBM) tests_test_opus_decode_SOURCES = tests/test_opus_decode.c tests/test_opus_common.h -tests_test_opus_decode_LDADD = libopus.la $(LIBM) +tests_test_opus_decode_LDADD = libopus.la $(NE10_LIBS) $(LIBM) tests_test_opus_padding_SOURCES = tests/test_opus_padding.c tests/test_opus_common.h -tests_test_opus_padding_LDADD = libopus.la $(LIBM) +tests_test_opus_padding_LDADD = libopus.la $(NE10_LIBS) $(LIBM) celt_tests_test_unit_cwrs32_SOURCES = celt/tests/test_unit_cwrs32.c celt_tests_test_unit_cwrs32_LDADD = $(LIBM) celt_tests_test_unit_dft_SOURCES = celt/tests/test_unit_dft.c -celt_tests_test_unit_dft_LDADD = $(LIBM) +celt_tests_test_unit_dft_LDADD = $(NE10_LIBS) $(LIBM) celt_tests_test_unit_entropy_SOURCES = celt/tests/test_unit_entropy.c celt_tests_test_unit_entropy_LDADD = $(LIBM) @@ -111,7 +115,7 @@ celt_tests_test_unit_laplace_SOURCES = celt/tests/test_unit_laplace.c celt_tests_test_unit_laplace_LDADD = $(LIBM) celt_tests_test_unit_mathops_SOURCES = celt/tests/test_unit_mathops.c -celt_tests_test_unit_mathops_LDADD = $(LIBM) +celt_tests_test_unit_mathops_LDADD = $(NE10_LIBS) $(LIBM) if CPU_ARM if OPUS_ARM_EXTERNAL_ASM celt_tests_test_unit_mathops_LDADD += libopus.la @@ -119,10 +123,10 @@ endif endif celt_tests_test_unit_mdct_SOURCES = celt/tests/test_unit_mdct.c -celt_tests_test_unit_mdct_LDADD = $(LIBM) +celt_tests_test_unit_mdct_LDADD = $(NE10_LIBS) $(LIBM) celt_tests_test_unit_rotation_SOURCES = celt/tests/test_unit_rotation.c -celt_tests_test_unit_rotation_LDADD = $(LIBM) +celt_tests_test_unit_rotation_LDADD = $(NE10_LIBS) $(LIBM) if CPU_ARM if OPUS_ARM_EXTERNAL_ASM celt_tests_test_unit_rotation_LDADD += libopus.la @@ -270,6 +274,8 @@ endif if OPUS_ARM_NEON_INTR CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \ - %test_unit_rotation.o %test_unit_mathops.o -$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS) + $(CELT_SOURCES_ARM_NE10:.c=.lo) \ + %test_unit_rotation.o %test_unit_mathops.o \ + %test_unit_mdct.o %test_unit_dft.o +$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS) $(NE10_CFLAGS) endif diff --git a/celt/arm/arm_celt_map.c b/celt/arm/arm_celt_map.c index 68c224d..3b49f90 100644 --- a/celt/arm/arm_celt_map.c +++ b/celt/arm/arm_celt_map.c @@ -30,6 +30,8 @@ #endif #include "pitch.h" +#include "kiss_fft.h" +#include "mdct.h" #if defined(OPUS_HAVE_RTCD) @@ -50,7 +52,46 @@ void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *, celt_pitch_xcorr_c, /* Media */ celt_pitch_xcorr_float_neon /* Neon */ }; -# endif -# endif -#endif +#if defined(HAVE_ARM_NE10) +#ifdef CUSTOM_MODES +int (*const OPUS_FFT_ALLOC_ARCH_IMPL[OPUS_ARCHMASK+1])(kiss_fft_state *st) = { + opus_fft_alloc_arch_c, /* ARMv4 */ + opus_fft_alloc_arch_c, /* EDSP */ + opus_fft_alloc_arch_c, /* Media */ + opus_fft_alloc_arm_float_neon /* Neon with NE10 library support */ +}; + +void (*const OPUS_FFT_FREE_ARCH_IMPL[OPUS_ARCHMASK+1])(kiss_fft_state *st) = { + opus_fft_free_arch_c, /* ARMv4 */ + opus_fft_free_arch_c, /* EDSP */ + opus_fft_free_arch_c, /* Media */ + opus_fft_free_arm_float_neon /* Neon with NE10 */ +}; +#endif /* CUSTOM_MODES */ + +void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, + const kiss_fft_cpx *fin, + kiss_fft_cpx *fout) = { + opus_fft_c, /* ARMv4 */ + opus_fft_c, /* EDSP */ + opus_fft_c, /* Media */ + opus_fft_float_neon /* Neon with NE10 */ +}; + +void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l, + kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, + int overlap, int shift, + int stride, int arch) = { + clt_mdct_forward_c, /* ARMv4 */ + clt_mdct_forward_c, /* EDSP */ + clt_mdct_forward_c, /* Media */ + clt_mdct_forward_float_neon /* Neon with NE10 */ +}; +#endif /* HAVE_ARM_NE10 */ +# endif /* OPUS_ARM_NEON_INTR */ +# endif /* FIXED_POINT */ + +#endif /* OPUS_HAVE_RTCD */ diff --git a/celt/arm/celt_ne10_fft.c b/celt/arm/celt_ne10_fft.c new file mode 100644 index 0000000..b592f19 --- /dev/null +++ b/celt/arm/celt_ne10_fft.c @@ -0,0 +1,120 @@ +/* Copyright (c) 2015 Xiph.Org Foundation + Written by Viswanath Puttagunta */ +/** + @file celt_ne10_fft.c + @brief ARM Neon optimizations for fft using NE10 library + */ + +/* + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER + OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#ifndef SKIP_CONFIG_H +#ifdef HAVE_CONFIG_H +#include "config.h" +#endif +#endif + +#include <arm_neon.h> +#include <NE10_init.h> +#include <NE10_dsp.h> +#include "../kiss_fft.h" +#include "stack_alloc.h" +#include "os_support.h" +#include "stack_alloc.h" + +#ifdef CUSTOM_MODES + +/* nfft lengths in NE10 that support scaled fft */ +#define NE10_FFTSCALED_SUPPORT_MAX 4 +static const int ne10_fft_scaled_support[NE10_FFTSCALED_SUPPORT_MAX] = { + 480, 240, 120, 60 +}; + +int opus_fft_alloc_arm_float_neon(kiss_fft_state *st) +{ + int i; + size_t memneeded = sizeof(struct arch_fft_state); + + st->arch_fft = (arch_fft_state *)opus_alloc(memneeded); + if (!st->arch_fft) + return -1; + + for (i = 0; i < NE10_FFTSCALED_SUPPORT_MAX; i++) { + if(st->nfft == ne10_fft_scaled_support[i]) + break; + } + if (i == NE10_FFTSCALED_SUPPORT_MAX) { + /* This nfft length (scaled fft) is not supported in NE10 */ + st->arch_fft->is_supported = 0; + st->arch_fft->priv = NULL; + } + else { + st->arch_fft->is_supported = 1; + st->arch_fft->priv = (void *)ne10_fft_alloc_c2c_float32_neon(st->nfft); + if (st->arch_fft->priv == NULL) { + return -1; + } + } + return 0; +} + +void opus_fft_free_arm_float_neon(kiss_fft_state *st) +{ + ne10_fft_cfg_float32_t cfg; + + if (!st->arch_fft) + return; + + cfg = (ne10_fft_cfg_float32_t)st->arch_fft->priv; + if (cfg) + ne10_fft_destroy_c2c_float32(cfg); + opus_free(st->arch_fft); +} +#endif +void opus_fft_float_neon(const kiss_fft_state *st, + const kiss_fft_cpx *fin, + kiss_fft_cpx *fout) +{ + ne10_fft_state_float32_t state; + ne10_fft_cfg_float32_t cfg = &state; + VARDECL(ne10_fft_cpx_float32_t, buffer); + SAVE_STACK; + ALLOC(buffer, st->nfft, ne10_fft_cpx_float32_t); + + if (!st->arch_fft->is_supported) { + /* This nfft length (scaled fft) not supported in NE10 */ + opus_fft_c(st, fin, fout); + } + else { + memcpy((void *)cfg, st->arch_fft->priv, sizeof(ne10_fft_state_float32_t)); + state.buffer = (ne10_fft_cpx_float32_t *)&buffer[0]; + state.is_forward_scaled = 1; + + ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout, + (ne10_fft_cpx_float32_t *)fin, + cfg, 0); + } + RESTORE_STACK; +} diff --git a/celt/arm/celt_ne10_mdct.c b/celt/arm/celt_ne10_mdct.c new file mode 100644 index 0000000..cf175cb --- /dev/null +++ b/celt/arm/celt_ne10_mdct.c @@ -0,0 +1,158 @@ +/* Copyright (c) 2015 Xiph.Org Foundation + Written by Viswanath Puttagunta */ +/** + @file celt_ne10_mdct.c + @brief ARM Neon optimizations for mdct using NE10 library + */ + +/* + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER + OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#ifndef SKIP_CONFIG_H +#ifdef HAVE_CONFIG_H +#include "config.h" +#endif +#endif + +#include "../kiss_fft.h" +#include "_kiss_fft_guts.h" +#include "../mdct.h" +#include "stack_alloc.h" +#include "os_support.h" +#include "stack_alloc.h" + +void clt_mdct_forward_float_neon(const mdct_lookup *l, + kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, + int overlap, int shift, int stride, int arch) +{ + int i; + int N, N2, N4; + VARDECL(kiss_fft_scalar, f); + VARDECL(kiss_fft_cpx, f2); + const kiss_fft_state *st = l->kfft[shift]; + const kiss_twiddle_scalar *trig; + + SAVE_STACK; + + N = l->n; + trig = l->trig; + for (i=0;i<shift;i++) + { + N >>= 1; + trig += N; + } + N2 = N>>1; + N4 = N>>2; + + ALLOC(f, N2, kiss_fft_scalar); + ALLOC(f2, N4, kiss_fft_cpx); + + /* Consider the input to be composed of four blocks: [a, b, c, d] */ + /* Window, shuffle, fold */ + { + /* Temp pointers to make it really clear to the compiler what we're doing */ + const kiss_fft_scalar * OPUS_RESTRICT xp1 = in+(overlap>>1); + const kiss_fft_scalar * OPUS_RESTRICT xp2 = in+N2-1+(overlap>>1); + kiss_fft_scalar * OPUS_RESTRICT yp = f; + const opus_val16 * OPUS_RESTRICT wp1 = window+(overlap>>1); + const opus_val16 * OPUS_RESTRICT wp2 = window+(overlap>>1)-1; + for(i=0;i<((overlap+3)>>2);i++) + { + /* Real part arranged as -d-cR, Imag part arranged as -b+aR*/ + *yp++ = MULT16_32_Q15(*wp2, xp1[N2]) + MULT16_32_Q15(*wp1,*xp2); + *yp++ = MULT16_32_Q15(*wp1, *xp1) - MULT16_32_Q15(*wp2, xp2[-N2]); + xp1+=2; + xp2-=2; + wp1+=2; + wp2-=2; + } + wp1 = window; + wp2 = window+overlap-1; + for(;i<N4-((overlap+3)>>2);i++) + { + /* Real part arranged as a-bR, Imag part arranged as -c-dR */ + *yp++ = *xp2; + *yp++ = *xp1; + xp1+=2; + xp2-=2; + } + for(;i<N4;i++) + { + /* Real part arranged as a-bR, Imag part arranged as -c-dR */ + *yp++ = -MULT16_32_Q15(*wp1, xp1[-N2]) + MULT16_32_Q15(*wp2, *xp2); + *yp++ = MULT16_32_Q15(*wp2, *xp1) + MULT16_32_Q15(*wp1, xp2[N2]); + xp1+=2; + xp2-=2; + wp1+=2; + wp2-=2; + } + } + /* Pre-rotation */ + { + kiss_fft_scalar * OPUS_RESTRICT yp = f; + const kiss_twiddle_scalar *t = &trig[0]; + for(i=0;i<N4;i++) + { + kiss_fft_cpx yc; + kiss_twiddle_scalar t0, t1; + kiss_fft_scalar re, im, yr, yi; + t0 = t[i]; + t1 = t[N4+i]; + re = *yp++; + im = *yp++; + yr = S_MUL(re,t0) - S_MUL(im,t1); + yi = S_MUL(im,t0) + S_MUL(re,t1); + yc.r = yr; + yc.i = yi; + f2[i] = yc; + } + } + + opus_fft(st, f2, (kiss_fft_cpx *)f, arch); + + /* Post-rotate */ + { + /* Temp pointers to make it really clear to the compiler what we're doing */ + const kiss_fft_cpx * OPUS_RESTRICT fp = (kiss_fft_cpx *)f; + kiss_fft_scalar * OPUS_RESTRICT yp1 = out; + kiss_fft_scalar * OPUS_RESTRICT yp2 = out+stride*(N2-1); + const kiss_twiddle_scalar *t = &trig[0]; + /* Temp pointers to make it really clear to the compiler what we're doing */ + for(i=0;i<N4;i++) + { + kiss_fft_scalar yr, yi; + yr = S_MUL(fp->i,t[N4+i]) - S_MUL(fp->r,t[i]); + yi = S_MUL(fp->r,t[N4+i]) + S_MUL(fp->i,t[i]); + *yp1 = yr; + *yp2 = yi; + fp++; + yp1 += 2*stride; + yp2 -= 2*stride; + } + } + RESTORE_STACK; +} diff --git a/celt/arm/celt_neon_intr.c b/celt/arm/celt_neon_intr.c index 4a67413..47dce15 100644 --- a/celt/arm/celt_neon_intr.c +++ b/celt/arm/celt_neon_intr.c @@ -29,9 +29,15 @@ NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ + +#ifdef HAVE_CONFIG_H +#include "config.h" +#endif + #include <arm_neon.h> #include "../pitch.h" +#if !defined(FIXED_POINT) /* * Function: xcorr_kernel_neon_float * --------------------------------- @@ -243,3 +249,4 @@ void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, (const float32_t *)_y+i, (float32_t *)xcorr+i, len); } } +#endif diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h new file mode 100644 index 0000000..e7a30d6 --- /dev/null +++ b/celt/arm/fft_arm.h @@ -0,0 +1,66 @@ +/* Copyright (c) 2015 Xiph.Org Foundation + Written by Viswanath Puttagunta */ +/** + @file fft_arm.h + @brief ARM Neon Intrinsic optimizations for fft using NE10 library + */ + +/* + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER + OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + + +#if !defined(FFT_ARM_H) +#define FFT_ARM_H + +#include "config.h" +#include "kiss_fft.h" + +#if !defined(FIXED_POINT) +#if defined(HAVE_ARM_NE10) + +int opus_fft_alloc_arm_float_neon(kiss_fft_state *st); +void opus_fft_free_arm_float_neon(kiss_fft_state *st); + +void opus_fft_float_neon(const kiss_fft_state *st, + const kiss_fft_cpx *fin, + kiss_fft_cpx *fout); +#if !defined(OPUS_HAVE_RTCD) +#define OVERRIDE_OPUS_FFT (1) + +#define opus_fft_alloc_arch(_st, arch) \ + ((void)(arch), opus_fft_alloc_arm_float_neon(_st)) + +#define opus_fft_free_arch(_st, arch) \ + ((void)(arch), opus_fft_free_arm_float_neon(_st)) + +#define opus_fft(_st, _fin, _fout, arch) \ + ((void)(arch), opus_fft_float_neon(_st, _fin, _fout)) + +#endif /* OPUS_HAVE_RTCD */ + +#endif /* HAVE_ARM_NE10 */ +#endif /* FIXED_POINT */ + +#endif diff --git a/celt/arm/mdct_arm.h b/celt/arm/mdct_arm.h new file mode 100644 index 0000000..7d60fed --- /dev/null +++ b/celt/arm/mdct_arm.h @@ -0,0 +1,53 @@ +/* Copyright (c) 2015 Xiph.Org Foundation + Written by Viswanath Puttagunta */ +/** + @file arm_mdct.h + @brief ARM Neon Intrinsic optimizations for mdct using NE10 library + */ + +/* + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER + OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#if !defined(MDCT_ARM_H) +#define MDCT_ARM_H + +#include "config.h" +#include "mdct.h" + +#if !defined(FIXED_POINT) && defined(HAVE_ARM_NE10) +/** Compute a forward MDCT and scale by 4/N, trashes the input array */ +void clt_mdct_forward_float_neon(const mdct_lookup *l, kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, int overlap, + int shift, int stride, int arch); + +#if !defined(OPUS_HAVE_RTCD) +#define OVERRIDE_OPUS_MDCT (1) +#define clt_mdct_forward(_l, _in, _out, _window, _int, _shift, _stride, _arch) \ + clt_mdct_forward_float_neon(_l, _in, _out, _window, _int, _shift, _stride, _arch) +#endif /* OPUS_HAVE_RTCD */ +#endif /* !defined(FIXED_POINT) && defined(HAVE_ARM_NE10) */ + +#endif diff --git a/celt/celt_encoder.c b/celt/celt_encoder.c index 86a3fbb..7a2c71b 100644 --- a/celt/celt_encoder.c +++ b/celt/celt_encoder.c @@ -414,7 +414,8 @@ int patch_transient_decision(opus_val16 *newE, opus_val16 *oldE, int nbEBands, /** Apply window and compute the MDCT for all sub-frames and all channels in a frame */ static void compute_mdcts(const CELTMode *mode, int shortBlocks, celt_sig * OPUS_RESTRICT in, - celt_sig * OPUS_RESTRICT out, int C, int CC, int LM, int upsample) + celt_sig * OPUS_RESTRICT out, int C, int CC, int LM, int upsample, + int arch) { const int overlap = mode->overlap; int N; @@ -435,7 +436,9 @@ static void compute_mdcts(const CELTMode *mode, int shortBlocks, celt_sig * OPUS for (b=0;b<B;b++) { /* Interleaving the sub-frames while doing the MDCTs */ - clt_mdct_forward(&mode->mdct, in+c*(B*N+overlap)+b*N, &out[b+c*N*B], mode->window, overlap, shift, B); + clt_mdct_forward(&mode->mdct, in+c*(B*N+overlap)+b*N, + &out[b+c*N*B], mode->window, overlap, shift, B, + arch); } } while (++c<CC); if (CC==2&&C==1) @@ -1603,14 +1606,14 @@ int celt_encode_with_ec(CELTEncoder * OPUS_RESTRICT st, const opus_val16 * pcm, ALLOC(bandLogE2, C*nbEBands, opus_val16); if (secondMdct) { - compute_mdcts(mode, 0, in, freq, C, CC, LM, st->upsample); + compute_mdcts(mode, 0, in, freq, C, CC, LM, st->upsample, st->arch); compute_band_energies(mode, freq, bandE, effEnd, C, LM); amp2Log2(mode, effEnd, end, bandE, bandLogE2, C); for (i=0;i<C*nbEBands;i++) bandLogE2[i] += HALF16(SHL16(LM, DB_SHIFT)); } - compute_mdcts(mode, shortBlocks, in, freq, C, CC, LM, st->upsample); + compute_mdcts(mode, shortBlocks, in, freq, C, CC, LM, st->upsample, st->arch); if (CC==2&&C==1) tf_chan = 0; compute_band_energies(mode, freq, bandE, effEnd, C, LM); @@ -1736,7 +1739,7 @@ int celt_encode_with_ec(CELTEncoder * OPUS_RESTRICT st, const opus_val16 * pcm, { isTransient = 1; shortBlocks = M; - compute_mdcts(mode, shortBlocks, in, freq, C, CC, LM, st->upsample); + compute_mdcts(mode, shortBlocks, in, freq, C, CC, LM, st->upsample, st->arch); compute_band_energies(mode, freq, bandE, effEnd, C, LM); amp2Log2(mode, effEnd, end, bandE, bandLogE, C); /* Compensate for the scaling of short vs long mdcts */ diff --git a/celt/dump_modes/Makefile b/celt/dump_modes/Makefile index 74d527e..10c3679 100644 --- a/celt/dump_modes/Makefile +++ b/celt/dump_modes/Makefile @@ -1,10 +1,31 @@ + CFLAGS=-O2 -Wall -Wextra -DHAVE_CONFIG_H INCLUDES=-I. -I../ -I../.. -I../../include +SOURCES = dump_modes.c \ + ../modes.c \ + ../cwrs.c \ + ../rate.c \ + ../entenc.c \ + ../entdec.c \ + ../mathops.c \ + ../mdct.c \ + ../kiss_fft.c + +ifdef HAVE_ARM_NE10 +CC = gcc +CFLAGS += -mfpu=neon +INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR +LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so +SOURCES += ../arm/celt_ne10_fft.c \ + dump_modes_arm_ne10.c \ + ../arm/armcpu.c +endif + all: dump_modes dump_modes: - $(CC) $(CFLAGS) $(INCLUDES) -DCUSTOM_MODES_ONLY -DCUSTOM_MODES dump_modes.c ../modes.c ../cwrs.c ../rate.c ../entenc.c ../entdec.c ../mathops.c ../mdct.c ../kiss_fft.c -o dump_modes -lm + $(PREFIX)$(CC) $(CFLAGS) $(INCLUDES) -DCUSTOM_MODES_ONLY -DCUSTOM_MODES $(SOURCES) -o $@ $(LIBDIR) -lm clean: rm -f dump_modes diff --git a/celt/dump_modes/dump_modes.c b/celt/dump_modes/dump_modes.c index ae6a8c1..9105a53 100644 --- a/celt/dump_modes/dump_modes.c +++ b/celt/dump_modes/dump_modes.c @@ -35,6 +35,7 @@ #include "modes.h" #include "celt.h" #include "rate.h" +#include "dump_modes_arch.h" #define INT16 "%d" #define INT32 "%d" @@ -62,6 +63,10 @@ void dump_modes(FILE *file, CELTMode **modes, int nb_modes) fprintf(file, "\n It contains static definitions for some pre-defined modes. */\n"); fprintf(file, "#include \"modes.h\"\n"); fprintf(file, "#include \"rate.h\"\n"); + fprintf(file, "\n#ifdef HAVE_ARM_NE10\n"); + fprintf(file, "#define OVERRIDE_FFT 1\n"); + fprintf(file, "#include \"%s\"\n", ARM_NE10_ARCH_FILE_NAME); + fprintf(file, "#endif\n"); fprintf(file, "\n"); @@ -149,6 +154,9 @@ void dump_modes(FILE *file, CELTMode **modes, int nb_modes) fprintf (file, "{" WORD16 ", " WORD16 "},%c", mode->mdct.kfft[0]->twiddles[j].r, mode->mdct.kfft[0]->twiddles[j].i,(j+3)%2==0?'\n':' '); fprintf (file, "};\n"); +#ifdef OVERRIDE_FFT + dump_mode_arch(mode); +#endif /* FFT Bitrev tables */ for (k=0;k<=mode->mdct.maxshift;k++) { @@ -183,6 +191,13 @@ void dump_modes(FILE *file, CELTMode **modes, int nb_modes) fprintf (file, "}, /* factors */\n"); fprintf (file, "fft_bitrev%d, /* bitrev */\n", mode->mdct.kfft[k]->nfft); fprintf (file, "fft_twiddles%d_%d, /* bitrev */\n", mode->Fs, mdctSize); + + fprintf (file, "#ifdef OVERRIDE_FFT\n"); + fprintf (file, "(arch_fft_state *)&cfg_arch_%d,\n", mode->mdct.kfft[k]->nfft); + fprintf (file, "#else\n"); + fprintf (file, "NULL,\n"); + fprintf(file, "#endif\n"); + fprintf (file, "};\n"); fprintf(file, "#endif\n"); @@ -323,8 +338,14 @@ int main(int argc, char **argv) } } file = fopen(BASENAME ".h", "w"); +#ifdef OVERRIDE_FFT + dump_modes_arch_init(m, nb); +#endif dump_modes(file, m, nb); fclose(file); +#ifdef OVERRIDE_FFT + dump_modes_arch_finalize(); +#endif for (i=0;i<nb;i++) opus_custom_mode_destroy(m[i]); free(m); diff --git a/celt/dump_modes/dump_modes_arch.h b/celt/dump_modes/dump_modes_arch.h new file mode 100644 index 0000000..1436926 --- /dev/null +++ b/celt/dump_modes/dump_modes_arch.h @@ -0,0 +1,41 @@ +/* Copyright (c) 2015 Xiph.Org Foundation + Written by Viswanath Puttagunta */ +/* + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER + OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#ifndef DUMP_MODE_ARCH_H +#define DUMP_MODE_ARCH_H + +void dump_modes_arch_init(); +void dump_mode_arch(CELTMode *mode); +void dump_modes_arch_finalize(); + +#define ARM_NE10_ARCH_FILE_NAME "static_modes_float_arm_ne10.h" + +#if defined(HAVE_ARM_NE10) +#define OVERRIDE_FFT (1) +#endif + +#endif diff --git a/celt/dump_modes/dump_modes_arm_ne10.c b/celt/dump_modes/dump_modes_arm_ne10.c new file mode 100644 index 0000000..aa53f17 --- /dev/null +++ b/celt/dump_modes/dump_modes_arm_ne10.c @@ -0,0 +1,125 @@ +/* Copyright (c) 2015 Xiph.Org Foundation + Written by Viswanath Puttagunta */ +/* + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER + OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#include <stdio.h> +#include <stdlib.h> +#include "modes.h" +#include "dump_modes_arch.h" +#include <NE10_dsp.h> + +static FILE *file; + +void dump_modes_arch_init(CELTMode **modes, int nb_modes) +{ + int i; + + file = fopen(ARM_NE10_ARCH_FILE_NAME, "w"); + fprintf(file, "/* The contents of this file was automatically generated by\n"); + fprintf(file, " * dump_mode_arm_ne10.c with arguments:"); + for (i=0;i<nb_modes;i++) + { + CELTMode *mode = modes[i]; + fprintf(file, " %d %d",mode->Fs,mode->shortMdctSize*mode->nbShortMdcts); + } + fprintf(file, "\n * It contains static definitions for some pre-defined modes. */\n"); + fprintf(file, "#include <NE10_init.h>\n\n"); +} + +void dump_modes_arch_finalize() +{ + fclose(file); +} + +void dump_mode_arch(CELTMode *mode) +{ + int k, j; + int mdctSize; + + mdctSize = mode->shortMdctSize*mode->nbShortMdcts; + + fprintf(file, "#ifndef NE10_FFT_PARAMS%d_%d\n", mode->Fs, mdctSize); + fprintf(file, "#define NE10_FFT_PARAMS%d_%d\n", mode->Fs, mdctSize); + ne10_fft_cfg_float32_t cfg; + /* cfg->factors */ + for(k=0;k<=mode->mdct.maxshift;k++) { + cfg = (ne10_fft_cfg_float32_t)mode->mdct.kfft[k]->arch_fft->priv; + if (!cfg) + continue; + fprintf(file, "static const ne10_int32_t ne10_factors_%d[%d] = {\n", + mode->mdct.kfft[k]->nfft, (NE10_MAXFACTORS * 2)); + for(j=0;j<(NE10_MAXFACTORS * 2);j++) { + fprintf(file, "%d,%c", cfg->factors[j],(j+16)%15==0?'\n':' '); + } + fprintf (file, "};\n"); + } + + /* cfg->twiddles */ + for(k=0;k<=mode->mdct.maxshift;k++) { + cfg = (ne10_fft_cfg_float32_t)mode->mdct.kfft[k]->arch_fft->priv; + if (!cfg) + continue; + fprintf(file, "static const ne10_fft_cpx_float32_t ne10_twiddles_%d[%d] = {\n", + mode->mdct.kfft[k]->nfft, mode->mdct.kfft[k]->nfft); + for(j=0;j<mode->mdct.kfft[k]->nfft;j++) { + fprintf(file, "{%#0.8gf,%#0.8gf},%c", cfg->twiddles[j].r, cfg->twiddles[j].i,(j+4)%3==0?'\n':' '); + } + fprintf (file, "};\n"); + } + + for(k=0;k<=mode->mdct.maxshift;k++) { + cfg = (ne10_fft_cfg_float32_t)mode->mdct.kfft[k]->arch_fft->priv; + if (!cfg) { + fprintf(file, "/* Ne10 does not support scaled FFT for length = %d */\n", + mode->mdct.kfft[k]->nfft); + fprintf(file, "static const arch_fft_state cfg_arch_%d = {\n", mode->mdct.kfft[k]->nfft); + fprintf(file, "0,\n"); + fprintf(file, "NULL\n"); + fprintf(file, "};\n"); + continue; + } + fprintf(file, "static const ne10_fft_state_float32_t ne10_fft_state_float32_%d = {\n", + mode->mdct.kfft[k]->nfft); + fprintf(file, "%d,\n", cfg->nfft); + fprintf(file, "(ne10_int32_t *)ne10_factors_%d,\n", mode->mdct.kfft[k]->nfft); + fprintf(file, "(ne10_fft_cpx_float32_t *)ne10_twiddles_%d,\n", mode->mdct.kfft[k]->nfft); + fprintf(file, "NULL,\n"); /* buffer */ + fprintf(file, "(ne10_fft_cpx_float32_t *)&ne10_twiddles_%d[%d],\n", + mode->mdct.kfft[k]->nfft, cfg->nfft); + fprintf(file, "/* is_forward_scaled = true */\n"); + fprintf(file, "(ne10_int32_t) 1,\n"); + fprintf(file, "/* is_backward_scaled = false */\n"); + fprintf(file, "(ne10_int32_t) 0,\n"); + fprintf(file, "};\n"); + + fprintf(file, "static const arch_fft_state cfg_arch_%d = {\n", + mode->mdct.kfft[k]->nfft); + fprintf(file, "1,\n"); + fprintf(file, "(void *)&ne10_fft_state_float32_%d,\n", mode->mdct.kfft[k]->nfft); + fprintf(file, "};\n\n"); + } + fprintf(file, "#endif /* end NE10_FFT_PARAMS%d_%d */\n", mode->Fs, mdctSize); +} diff --git a/celt/kiss_fft.c b/celt/kiss_fft.c index cc487fc..38fd4fb 100644 --- a/celt/kiss_fft.c +++ b/celt/kiss_fft.c @@ -423,13 +423,19 @@ static void compute_twiddles(kiss_twiddle_cpx *twiddles, int nfft) #endif } +int opus_fft_alloc_arch_c(kiss_fft_state *st) { + (void)st; + return 0; +} + /* * * Allocates all necessary storage space for the fft and ifft. * The return value is a contiguous block of memory. As such, * It can be freed with free(). * */ -kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem, const kiss_fft_state *base) +kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem, + const kiss_fft_state *base, int arch) { kiss_fft_state *st=NULL; size_t memneeded = sizeof(struct kiss_fft_state); /* twiddle factors*/ @@ -478,22 +484,31 @@ kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem, co if (st->bitrev==NULL) goto fail; compute_bitrev_table(0, bitrev, 1,1, st->factors,st); + + /* Initialize architecture specific fft parameters */ + if (opus_fft_alloc_arch(st, arch)) + goto fail; } return st; fail: - opus_fft_free(st); + opus_fft_free(st, arch); return NULL; } -kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem ) +kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem, int arch) { - return opus_fft_alloc_twiddles(nfft, mem, lenmem, NULL); + return opus_fft_alloc_twiddles(nfft, mem, lenmem, NULL, arch); +} + +void opus_fft_free_arch_c(kiss_fft_state *st) { + (void)st; } -void opus_fft_free(const kiss_fft_state *cfg) +void opus_fft_free(const kiss_fft_state *cfg, int arch) { if (cfg) { + opus_fft_free_arch((kiss_fft_state *)cfg, arch); opus_free((opus_int16*)cfg->bitrev); if (cfg->shift < 0) opus_free((kiss_twiddle_cpx*)cfg->twiddles); @@ -551,7 +566,7 @@ void opus_fft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout) } } -void opus_fft(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx *fout) +void opus_fft_c(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx *fout) { int i; opus_val16 scale; diff --git a/celt/kiss_fft.h b/celt/kiss_fft.h index 390b54d..bf2f836 100644 --- a/celt/kiss_fft.h +++ b/celt/kiss_fft.h @@ -32,6 +32,7 @@ #include <stdlib.h> #include <math.h> #include "arch.h" +#include "cpu_support.h" #ifdef __cplusplus extern "C" { @@ -77,6 +78,11 @@ typedef struct { 4*4*4*2 */ +typedef struct arch_fft_state{ + int is_supported; + void *priv; +} arch_fft_state; + typedef struct kiss_fft_state{ int nfft; opus_val16 scale; @@ -87,8 +93,15 @@ typedef struct kiss_fft_state{ opus_int16 factors[2*MAXFACTORS]; const opus_int16 *bitrev; const kiss_twiddle_cpx *twiddles; +#ifndef FIXED_POINT + arch_fft_state *arch_fft; +#endif } kiss_fft_state; +#if !defined(FIXED_POINT) && defined(HAVE_ARM_NE10) +#include "arm/fft_arm.h" +#endif + /*typedef struct kiss_fft_state* kiss_fft_cfg;*/ /** @@ -114,9 +127,9 @@ typedef struct kiss_fft_state{ * buffer size in *lenmem. * */ -kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem, const kiss_fft_state *base); +kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem, const kiss_fft_state *base, int arch); -kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem); +kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem, int arch); /** * opus_fft(cfg,in_out_buf) @@ -128,13 +141,48 @@ kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem); * Note that each element is complex and can be accessed like f[k].r and f[k].i * */ -void opus_fft(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx *fout); +void opus_fft_c(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx *fout); void opus_ifft(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx *fout); void opus_fft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout); void opus_ifft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout); -void opus_fft_free(const kiss_fft_state *cfg); +void opus_fft_free(const kiss_fft_state *cfg, int arch); + + +void opus_fft_free_arch_c(kiss_fft_state *st); +int opus_fft_alloc_arch_c(kiss_fft_state *st); + +#if !defined(OVERRIDE_OPUS_FFT) +/* Is run-time CPU detection enabled on this platform? */ +#if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) + +int (*const OPUS_FFT_ALLOC_ARCH_IMPL[OPUS_ARCHMASK+1])(kiss_fft_state *st); + +#define opus_fft_alloc_arch(_st, arch) \ + ((*OPUS_FFT_ALLOC_ARCH_IMPL[(arch)&OPUS_ARCHMASK])(_st)) + +void (*const OPUS_FFT_FREE_ARCH_IMPL[OPUS_ARCHMASK+1])(kiss_fft_state *st); +#define opus_fft_free_arch(_st, arch) \ + ((*OPUS_FFT_FREE_ARCH_IMPL[(arch)&OPUS_ARCHMASK])(_st)) + +void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, + const kiss_fft_cpx *fin, + kiss_fft_cpx *fout); +#define opus_fft(_cfg, _fin, _fout, arch) \ + ((*OPUS_FFT[(arch)&OPUS_ARCHMASK])(_cfg, _fin, _fout)) +#else /* else for if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */ + +#define opus_fft_alloc_arch(_st, arch) \ + ((void)(arch), opus_fft_alloc_arch_c(_st)) + +#define opus_fft_free_arch(_st, arch) \ + ((void)(arch), opus_fft_free_arch_c(_st)) + +#define opus_fft(_cfg, _fin, _fout, arch) \ + ((void)(arch), opus_fft_c(_cfg, _fin, _fout)) +#endif /* end if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */ +#endif /* end if !defined(OVERRIDE_OPUS_FFT) */ #ifdef __cplusplus } diff --git a/celt/mdct.c b/celt/mdct.c index 2795d90..ee6d80e 100644 --- a/celt/mdct.c +++ b/celt/mdct.c @@ -60,7 +60,7 @@ #ifdef CUSTOM_MODES -int clt_mdct_init(mdct_lookup *l,int N, int maxshift) +int clt_mdct_init(mdct_lookup *l,int N, int maxshift, int arch) { int i; kiss_twiddle_scalar *trig; @@ -71,9 +71,9 @@ int clt_mdct_init(mdct_lookup *l,int N, int maxshift) for (i=0;i<=maxshift;i++) { if (i==0) - l->kfft[i] = opus_fft_alloc(N>>2>>i, 0, 0); + l->kfft[i] = opus_fft_alloc(N>>2>>i, 0, 0, arch); else - l->kfft[i] = opus_fft_alloc_twiddles(N>>2>>i, 0, 0, l->kfft[0]); + l->kfft[i] = opus_fft_alloc_twiddles(N>>2>>i, 0, 0, l->kfft[0], arch); #ifndef ENABLE_TI_DSPLIB55 if (l->kfft[i]==NULL) return 0; @@ -104,11 +104,11 @@ int clt_mdct_init(mdct_lookup *l,int N, int maxshift) return 1; } -void clt_mdct_clear(mdct_lookup *l) +void clt_mdct_clear(mdct_lookup *l, int arch) { int i; for (i=0;i<=l->maxshift;i++) - opus_fft_free(l->kfft[i]); + opus_fft_free(l->kfft[i], arch); opus_free((kiss_twiddle_scalar*)l->trig); } @@ -116,8 +116,8 @@ void clt_mdct_clear(mdct_lookup *l) /* Forward MDCT trashes the input array */ #ifndef OVERRIDE_clt_mdct_forward -void clt_mdct_forward(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scalar * OPUS_RESTRICT out, - const opus_val16 *window, int overlap, int shift, int stride) +void clt_mdct_forward_c(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, int overlap, int shift, int stride, int arch) { int i; int N, N2, N4; @@ -132,6 +132,7 @@ void clt_mdct_forward(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scalar int scale_shift = st->scale_shift-1; #endif SAVE_STACK; + (void)arch; scale = st->scale; N = l->n; diff --git a/celt/mdct.h b/celt/mdct.h index d721821..cbaf679 100644 --- a/celt/mdct.h +++ b/celt/mdct.h @@ -53,13 +53,19 @@ typedef struct { const kiss_twiddle_scalar * OPUS_RESTRICT trig; } mdct_lookup; -int clt_mdct_init(mdct_lookup *l,int N, int maxshift); -void clt_mdct_clear(mdct_lookup *l); +#if !defined(FIXED_POINT) && defined(HAVE_ARM_NE10) +#include "arm/mdct_arm.h" +#endif + + +int clt_mdct_init(mdct_lookup *l,int N, int maxshift, int arch); +void clt_mdct_clear(mdct_lookup *l, int arch); /** Compute a forward MDCT and scale by 4/N, trashes the input array */ -void clt_mdct_forward(const mdct_lookup *l, kiss_fft_scalar *in, - kiss_fft_scalar * OPUS_RESTRICT out, - const opus_val16 *window, int overlap, int shift, int stride); +void clt_mdct_forward_c(const mdct_lookup *l, kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, int overlap, + int shift, int stride, int arch); /** Compute a backward MDCT (no scaling) and performs weighted overlap-add (scales implicitly by 1/2) */ @@ -67,4 +73,27 @@ void clt_mdct_backward(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scalar * OPUS_RESTRICT out, const opus_val16 * OPUS_RESTRICT window, int overlap, int shift, int stride); +#if !defined(OVERRIDE_OPUS_MDCT) +/* Is run-time CPU detection enabled on this platform? */ +#if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) + +void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l, + kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, + int overlap, int shift, + int stride, int arch); + +#define clt_mdct_forward(_l, _in, _out, _window, _overlap, _shift, _stride, _arch) \ + ((*CLT_MDCT_FORWARD_IMPL[(arch)&OPUS_ARCHMASK])(_l, _in, _out, \ + _window, _overlap, _shift, \ + _stride, _arch)) +#else /* else for if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */ + +#define clt_mdct_forward(_l, _in, _out, _window, _overlap, _shift, _stride, _arch) \ + clt_mdct_forward_c(_l, _in, _out, _window, _overlap, _shift, _stride, _arch) + +#endif /* end if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */ +#endif /* end if !defined(OVERRIDE_OPUS_MDCT) */ + #endif diff --git a/celt/modes.c b/celt/modes.c index 42e68e1..4fe91ff 100644 --- a/celt/modes.c +++ b/celt/modes.c @@ -37,6 +37,7 @@ #include "os_support.h" #include "stack_alloc.h" #include "quant_bands.h" +#include "cpu_support.h" static const opus_int16 eband5ms[] = { /*0 200 400 600 800 1k 1.2 1.4 1.6 2k 2.4 2.8 3.2 4k 4.8 5.6 6.8 8k 9.6 12k 15.6 */ @@ -229,6 +230,7 @@ CELTMode *opus_custom_mode_create(opus_int32 Fs, int frame_size, int *error) opus_val16 *window; opus_int16 *logN; int LM; + int arch = opus_select_arch(); ALLOC_STACK; #if !defined(VAR_ARRAYS) && !defined(USE_ALLOCA) if (global_stack==NULL) @@ -389,7 +391,7 @@ CELTMode *opus_custom_mode_create(opus_int32 Fs, int frame_size, int *error) compute_pulse_cache(mode, mode->maxLM); if (clt_mdct_init(&mode->mdct, 2*mode->shortMdctSize*mode->nbShortMdcts, - mode->maxLM) == 0) + mode->maxLM, arch) == 0) goto failure; if (error) @@ -408,6 +410,8 @@ failure: #ifdef CUSTOM_MODES void opus_custom_mode_destroy(CELTMode *mode) { + int arch = opus_select_arch(); + if (mode == NULL) return; #ifndef CUSTOM_MODES_ONLY @@ -431,7 +435,7 @@ void opus_custom_mode_destroy(CELTMode *mode) opus_free((opus_int16*)mode->cache.index); opus_free((unsigned char*)mode->cache.bits); opus_free((unsigned char*)mode->cache.caps); - clt_mdct_clear(&mode->mdct); + clt_mdct_clear(&mode->mdct, arch); opus_free((CELTMode *)mode); } diff --git a/celt/static_modes_float.h b/celt/static_modes_float.h index 2fadb62..e102a38 100644 --- a/celt/static_modes_float.h +++ b/celt/static_modes_float.h @@ -4,6 +4,11 @@ #include "modes.h" #include "rate.h" +#ifdef HAVE_ARM_NE10 +#define OVERRIDE_FFT 1 +#include "static_modes_float_arm_ne10.h" +#endif + #ifndef DEF_WINDOW120 #define DEF_WINDOW120 static const opus_val16 window120[120] = { @@ -431,6 +436,11 @@ static const kiss_fft_state fft_state48000_960_0 = { {5, 96, 3, 32, 4, 8, 2, 4, 4, 1, 0, 0, 0, 0, 0, 0, }, /* factors */ fft_bitrev480, /* bitrev */ fft_twiddles48000_960, /* bitrev */ +#ifdef OVERRIDE_FFT +(arch_fft_state *)&cfg_arch_480, +#else +NULL, +#endif }; #endif @@ -443,6 +453,11 @@ static const kiss_fft_state fft_state48000_960_1 = { {5, 48, 3, 16, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, }, /* factors */ fft_bitrev240, /* bitrev */ fft_twiddles48000_960, /* bitrev */ +#ifdef OVERRIDE_FFT +(arch_fft_state *)&cfg_arch_240, +#else +NULL, +#endif }; #endif @@ -455,6 +470,11 @@ static const kiss_fft_state fft_state48000_960_2 = { {5, 24, 3, 8, 2, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, }, /* factors */ fft_bitrev120, /* bitrev */ fft_twiddles48000_960, /* bitrev */ +#ifdef OVERRIDE_FFT +(arch_fft_state *)&cfg_arch_120, +#else +NULL, +#endif }; #endif @@ -467,6 +487,11 @@ static const kiss_fft_state fft_state48000_960_3 = { {5, 12, 3, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, }, /* factors */ fft_bitrev60, /* bitrev */ fft_twiddles48000_960, /* bitrev */ +#ifdef OVERRIDE_FFT +(arch_fft_state *)&cfg_arch_60, +#else +NULL, +#endif }; #endif diff --git a/celt/static_modes_float_arm_ne10.h b/celt/static_modes_float_arm_ne10.h new file mode 100644 index 0000000..5bcec70 --- /dev/null +++ b/celt/static_modes_float_arm_ne10.h @@ -0,0 +1,404 @@ +/* The contents of this file was automatically generated by + * dump_mode_arm_ne10.c with arguments: 48000 960 + * It contains static definitions for some pre-defined modes. */ +#include <NE10_init.h> + +#ifndef NE10_FFT_PARAMS48000_960 +#define NE10_FFT_PARAMS48000_960 +static const ne10_int32_t ne10_factors_480[64] = { +4, 40, 4, 30, 2, 15, 5, 3, 3, 1, 1, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, }; +static const ne10_int32_t ne10_factors_240[64] = { +3, 20, 4, 15, 5, 3, 3, 1, 1, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, }; +static const ne10_int32_t ne10_factors_120[64] = { +3, 10, 2, 15, 5, 3, 3, 1, 1, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, }; +static const ne10_int32_t ne10_factors_60[64] = { +2, 5, 5, 3, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, +0, 0, 0, 0, }; +static const ne10_fft_cpx_float32_t ne10_twiddles_480[480] = { +{1.0000000f,0.0000000f}, {1.0000000f,-0.0000000f}, {1.0000000f,-0.0000000f}, +{1.0000000f,-0.0000000f}, {0.91354543f,-0.40673664f}, {0.66913056f,-0.74314487f}, +{1.0000000f,-0.0000000f}, {0.66913056f,-0.74314487f}, {-0.10452851f,-0.99452192f}, +{1.0000000f,-0.0000000f}, {0.30901697f,-0.95105654f}, {-0.80901700f,-0.58778518f}, +{1.0000000f,-0.0000000f}, {-0.10452851f,-0.99452192f}, {-0.97814757f,0.20791179f}, +{1.0000000f,-0.0000000f}, {0.97814763f,-0.20791170f}, {0.91354543f,-0.40673664f}, +{0.80901700f,-0.58778524f}, {0.66913056f,-0.74314487f}, {0.49999997f,-0.86602545f}, +{0.30901697f,-0.95105654f}, {0.10452842f,-0.99452192f}, {-0.10452851f,-0.99452192f}, +{-0.30901703f,-0.95105648f}, {-0.50000006f,-0.86602533f}, {-0.66913068f,-0.74314475f}, +{-0.80901700f,-0.58778518f}, {-0.91354549f,-0.40673658f}, {-0.97814763f,-0.20791161f}, +{1.0000000f,-0.0000000f}, {0.99862951f,-0.052335959f}, {0.99452192f,-0.10452846f}, +{0.98768836f,-0.15643448f}, {0.97814763f,-0.20791170f}, {0.96592581f,-0.25881904f}, +{0.95105648f,-0.30901700f}, {0.93358040f,-0.35836795f}, {0.91354543f,-0.40673664f}, +{0.89100653f,-0.45399052f}, {0.86602545f,-0.50000000f}, {0.83867055f,-0.54463905f}, +{0.80901700f,-0.58778524f}, {0.77714598f,-0.62932038f}, {0.74314475f,-0.66913062f}, +{0.70710677f,-0.70710683f}, {0.66913056f,-0.74314487f}, {0.62932038f,-0.77714598f}, +{0.58778524f,-0.80901700f}, {0.54463899f,-0.83867055f}, {0.49999997f,-0.86602545f}, +{0.45399052f,-0.89100653f}, {0.40673661f,-0.91354549f}, {0.35836786f,-0.93358046f}, +{0.30901697f,-0.95105654f}, {0.25881907f,-0.96592581f}, {0.20791166f,-0.97814763f}, +{0.15643437f,-0.98768836f}, {0.10452842f,-0.99452192f}, {0.052335974f,-0.99862951f}, +{1.0000000f,-0.0000000f}, {0.99452192f,-0.10452846f}, {0.97814763f,-0.20791170f}, +{0.95105648f,-0.30901700f}, {0.91354543f,-0.40673664f}, {0.86602545f,-0.50000000f}, +{0.80901700f,-0.58778524f}, {0.74314475f,-0.66913062f}, {0.66913056f,-0.74314487f}, +{0.58778524f,-0.80901700f}, {0.49999997f,-0.86602545f}, {0.40673661f,-0.91354549f}, +{0.30901697f,-0.95105654f}, {0.20791166f,-0.97814763f}, {0.10452842f,-0.99452192f}, +{-4.3711388e-08f,-1.0000000f}, {-0.10452851f,-0.99452192f}, {-0.20791174f,-0.97814757f}, +{-0.30901703f,-0.95105648f}, {-0.40673670f,-0.91354543f}, {-0.50000006f,-0.86602533f}, +{-0.58778518f,-0.80901700f}, {-0.66913068f,-0.74314475f}, {-0.74314493f,-0.66913044f}, +{-0.80901700f,-0.58778518f}, {-0.86602539f,-0.50000006f}, {-0.91354549f,-0.40673658f}, +{-0.95105654f,-0.30901679f}, {-0.97814763f,-0.20791161f}, {-0.99452192f,-0.10452849f}, +{1.0000000f,-0.0000000f}, {0.98768836f,-0.15643448f}, {0.95105648f,-0.30901700f}, +{0.89100653f,-0.45399052f}, {0.80901700f,-0.58778524f}, {0.70710677f,-0.70710683f}, +{0.58778524f,-0.80901700f}, {0.45399052f,-0.89100653f}, {0.30901697f,-0.95105654f}, +{0.15643437f,-0.98768836f}, {-4.3711388e-08f,-1.0000000f}, {-0.15643445f,-0.98768836f}, +{-0.30901703f,-0.95105648f}, {-0.45399061f,-0.89100647f}, {-0.58778518f,-0.80901700f}, +{-0.70710677f,-0.70710677f}, {-0.80901700f,-0.58778518f}, {-0.89100659f,-0.45399037f}, +{-0.95105654f,-0.30901679f}, {-0.98768836f,-0.15643445f}, {-1.0000000f,8.7422777e-08f}, +{-0.98768830f,0.15643461f}, {-0.95105654f,0.30901697f}, {-0.89100653f,0.45399055f}, +{-0.80901694f,0.58778536f}, {-0.70710665f,0.70710689f}, {-0.58778507f,0.80901712f}, +{-0.45399022f,0.89100665f}, {-0.30901709f,0.95105648f}, {-0.15643452f,0.98768830f}, +{1.0000000f,-0.0000000f}, {0.99991435f,-0.013089596f}, {0.99965733f,-0.026176950f}, +{0.99922901f,-0.039259817f}, {0.99862951f,-0.052335959f}, {0.99785894f,-0.065403134f}, +{0.99691731f,-0.078459099f}, {0.99580491f,-0.091501623f}, {0.99452192f,-0.10452846f}, +{0.99306846f,-0.11753740f}, {0.99144489f,-0.13052620f}, {0.98965138f,-0.14349262f}, +{0.98768836f,-0.15643448f}, {0.98555607f,-0.16934951f}, {0.98325491f,-0.18223552f}, +{0.98078525f,-0.19509032f}, {0.97814763f,-0.20791170f}, {0.97534233f,-0.22069745f}, +{0.97236991f,-0.23344538f}, {0.96923089f,-0.24615330f}, {0.96592581f,-0.25881904f}, +{0.96245521f,-0.27144045f}, {0.95881975f,-0.28401536f}, {0.95501995f,-0.29654160f}, +{0.95105648f,-0.30901700f}, {0.94693011f,-0.32143945f}, {0.94264150f,-0.33380687f}, +{0.93819129f,-0.34611708f}, {0.93358040f,-0.35836795f}, {0.92880952f,-0.37055743f}, +{0.92387956f,-0.38268346f}, {0.91879117f,-0.39474389f}, {0.91354543f,-0.40673664f}, +{0.90814316f,-0.41865975f}, {0.90258527f,-0.43051112f}, {0.89687270f,-0.44228873f}, +{0.89100653f,-0.45399052f}, {0.88498765f,-0.46561453f}, {0.87881708f,-0.47715878f}, +{0.87249601f,-0.48862126f}, {0.86602545f,-0.50000000f}, {0.85940641f,-0.51129311f}, +{0.85264015f,-0.52249855f}, {0.84572786f,-0.53361452f}, {0.83867055f,-0.54463905f}, +{0.83146960f,-0.55557024f}, {0.82412618f,-0.56640625f}, {0.81664151f,-0.57714522f}, +{0.80901700f,-0.58778524f}, {0.80125380f,-0.59832460f}, {0.79335332f,-0.60876143f}, +{0.78531694f,-0.61909395f}, {0.77714598f,-0.62932038f}, {0.76884180f,-0.63943899f}, +{0.76040596f,-0.64944810f}, {0.75183982f,-0.65934587f}, {0.74314475f,-0.66913062f}, +{0.73432249f,-0.67880076f}, {0.72537434f,-0.68835455f}, {0.71630192f,-0.69779050f}, +{0.70710677f,-0.70710683f}, {0.69779044f,-0.71630198f}, {0.68835455f,-0.72537440f}, +{0.67880070f,-0.73432255f}, {0.66913056f,-0.74314487f}, {0.65934581f,-0.75183982f}, +{0.64944804f,-0.76040596f}, {0.63943899f,-0.76884186f}, {0.62932038f,-0.77714598f}, +{0.61909395f,-0.78531694f}, {0.60876137f,-0.79335338f}, {0.59832460f,-0.80125386f}, +{0.58778524f,-0.80901700f}, {0.57714516f,-0.81664151f}, {0.56640625f,-0.82412618f}, +{0.55557019f,-0.83146960f}, {0.54463899f,-0.83867055f}, {0.53361452f,-0.84572786f}, +{0.52249849f,-0.85264015f}, {0.51129311f,-0.85940641f}, {0.49999997f,-0.86602545f}, +{0.48862118f,-0.87249601f}, {0.47715876f,-0.87881708f}, {0.46561447f,-0.88498765f}, +{0.45399052f,-0.89100653f}, {0.44228867f,-0.89687276f}, {0.43051103f,-0.90258533f}, +{0.41865975f,-0.90814316f}, {0.40673661f,-0.91354549f}, {0.39474380f,-0.91879129f}, +{0.38268343f,-0.92387956f}, {0.37055740f,-0.92880958f}, {0.35836786f,-0.93358046f}, +{0.34611705f,-0.93819135f}, {0.33380681f,-0.94264150f}, {0.32143947f,-0.94693011f}, +{0.30901697f,-0.95105654f}, {0.29654151f,-0.95501995f}, {0.28401533f,-0.95881975f}, +{0.27144039f,-0.96245527f}, {0.25881907f,-0.96592581f}, {0.24615327f,-0.96923089f}, +{0.23344530f,-0.97236991f}, {0.22069745f,-0.97534233f}, {0.20791166f,-0.97814763f}, +{0.19509023f,-0.98078531f}, {0.18223552f,-0.98325491f}, {0.16934945f,-0.98555607f}, +{0.15643437f,-0.98768836f}, {0.14349259f,-0.98965138f}, {0.13052613f,-0.99144489f}, +{0.11753740f,-0.99306846f}, {0.10452842f,-0.99452192f}, {0.091501534f,-0.99580491f}, +{0.078459084f,-0.99691731f}, {0.065403074f,-0.99785894f}, {0.052335974f,-0.99862951f}, +{0.039259788f,-0.99922901f}, {0.026176875f,-0.99965733f}, {0.013089597f,-0.99991435f}, +{1.0000000f,-0.0000000f}, {0.99965733f,-0.026176950f}, {0.99862951f,-0.052335959f}, +{0.99691731f,-0.078459099f}, {0.99452192f,-0.10452846f}, {0.99144489f,-0.13052620f}, +{0.98768836f,-0.15643448f}, {0.98325491f,-0.18223552f}, {0.97814763f,-0.20791170f}, +{0.97236991f,-0.23344538f}, {0.96592581f,-0.25881904f}, {0.95881975f,-0.28401536f}, +{0.95105648f,-0.30901700f}, {0.94264150f,-0.33380687f}, {0.93358040f,-0.35836795f}, +{0.92387956f,-0.38268346f}, {0.91354543f,-0.40673664f}, {0.90258527f,-0.43051112f}, +{0.89100653f,-0.45399052f}, {0.87881708f,-0.47715878f}, {0.86602545f,-0.50000000f}, +{0.85264015f,-0.52249855f}, {0.83867055f,-0.54463905f}, {0.82412618f,-0.56640625f}, +{0.80901700f,-0.58778524f}, {0.79335332f,-0.60876143f}, {0.77714598f,-0.62932038f}, +{0.76040596f,-0.64944810f}, {0.74314475f,-0.66913062f}, {0.72537434f,-0.68835455f}, +{0.70710677f,-0.70710683f}, {0.68835455f,-0.72537440f}, {0.66913056f,-0.74314487f}, +{0.64944804f,-0.76040596f}, {0.62932038f,-0.77714598f}, {0.60876137f,-0.79335338f}, +{0.58778524f,-0.80901700f}, {0.56640625f,-0.82412618f}, {0.54463899f,-0.83867055f}, +{0.52249849f,-0.85264015f}, {0.49999997f,-0.86602545f}, {0.47715876f,-0.87881708f}, +{0.45399052f,-0.89100653f}, {0.43051103f,-0.90258533f}, {0.40673661f,-0.91354549f}, +{0.38268343f,-0.92387956f}, {0.35836786f,-0.93358046f}, {0.33380681f,-0.94264150f}, +{0.30901697f,-0.95105654f}, {0.28401533f,-0.95881975f}, {0.25881907f,-0.96592581f}, +{0.23344530f,-0.97236991f}, {0.20791166f,-0.97814763f}, {0.18223552f,-0.98325491f}, +{0.15643437f,-0.98768836f}, {0.13052613f,-0.99144489f}, {0.10452842f,-0.99452192f}, +{0.078459084f,-0.99691731f}, {0.052335974f,-0.99862951f}, {0.026176875f,-0.99965733f}, +{-4.3711388e-08f,-1.0000000f}, {-0.026176963f,-0.99965733f}, {-0.052336060f,-0.99862951f}, +{-0.078459173f,-0.99691731f}, {-0.10452851f,-0.99452192f}, {-0.13052621f,-0.99144489f}, +{-0.15643445f,-0.98768836f}, {-0.18223560f,-0.98325491f}, {-0.20791174f,-0.97814757f}, +{-0.23344538f,-0.97236991f}, {-0.25881916f,-0.96592581f}, {-0.28401542f,-0.95881969f}, +{-0.30901703f,-0.95105648f}, {-0.33380687f,-0.94264150f}, {-0.35836795f,-0.93358040f}, +{-0.38268352f,-0.92387950f}, {-0.40673670f,-0.91354543f}, {-0.43051112f,-0.90258527f}, +{-0.45399061f,-0.89100647f}, {-0.47715873f,-0.87881708f}, {-0.50000006f,-0.86602533f}, +{-0.52249867f,-0.85264009f}, {-0.54463905f,-0.83867055f}, {-0.56640631f,-0.82412612f}, +{-0.58778518f,-0.80901700f}, {-0.60876143f,-0.79335332f}, {-0.62932050f,-0.77714586f}, +{-0.64944804f,-0.76040596f}, {-0.66913068f,-0.74314475f}, {-0.68835467f,-0.72537428f}, +{-0.70710677f,-0.70710677f}, {-0.72537446f,-0.68835449f}, {-0.74314493f,-0.66913044f}, +{-0.76040596f,-0.64944804f}, {-0.77714604f,-0.62932026f}, {-0.79335332f,-0.60876143f}, +{-0.80901700f,-0.58778518f}, {-0.82412624f,-0.56640613f}, {-0.83867055f,-0.54463899f}, +{-0.85264021f,-0.52249849f}, {-0.86602539f,-0.50000006f}, {-0.87881714f,-0.47715873f}, +{-0.89100659f,-0.45399037f}, {-0.90258527f,-0.43051112f}, {-0.91354549f,-0.40673658f}, +{-0.92387956f,-0.38268328f}, {-0.93358040f,-0.35836792f}, {-0.94264150f,-0.33380675f}, +{-0.95105654f,-0.30901679f}, {-0.95881975f,-0.28401530f}, {-0.96592587f,-0.25881892f}, +{-0.97236991f,-0.23344538f}, {-0.97814763f,-0.20791161f}, {-0.98325491f,-0.18223536f}, +{-0.98768836f,-0.15643445f}, {-0.99144489f,-0.13052608f}, {-0.99452192f,-0.10452849f}, +{-0.99691737f,-0.078459039f}, {-0.99862957f,-0.052335810f}, {-0.99965733f,-0.026176952f}, +{1.0000000f,-0.0000000f}, {0.99922901f,-0.039259817f}, {0.99691731f,-0.078459099f}, +{0.99306846f,-0.11753740f}, {0.98768836f,-0.15643448f}, {0.98078525f,-0.19509032f}, +{0.97236991f,-0.23344538f}, {0.96245521f,-0.27144045f}, {0.95105648f,-0.30901700f}, +{0.93819129f,-0.34611708f}, {0.92387956f,-0.38268346f}, {0.90814316f,-0.41865975f}, +{0.89100653f,-0.45399052f}, {0.87249601f,-0.48862126f}, {0.85264015f,-0.52249855f}, +{0.83146960f,-0.55557024f}, {0.80901700f,-0.58778524f}, {0.78531694f,-0.61909395f}, +{0.76040596f,-0.64944810f}, {0.73432249f,-0.67880076f}, {0.70710677f,-0.70710683f}, +{0.67880070f,-0.73432255f}, {0.64944804f,-0.76040596f}, {0.61909395f,-0.78531694f}, +{0.58778524f,-0.80901700f}, {0.55557019f,-0.83146960f}, {0.52249849f,-0.85264015f}, +{0.48862118f,-0.87249601f}, {0.45399052f,-0.89100653f}, {0.41865975f,-0.90814316f}, +{0.38268343f,-0.92387956f}, {0.34611705f,-0.93819135f}, {0.30901697f,-0.95105654f}, +{0.27144039f,-0.96245527f}, {0.23344530f,-0.97236991f}, {0.19509023f,-0.98078531f}, +{0.15643437f,-0.98768836f}, {0.11753740f,-0.99306846f}, {0.078459084f,-0.99691731f}, +{0.039259788f,-0.99922901f}, {-4.3711388e-08f,-1.0000000f}, {-0.039259877f,-0.99922901f}, +{-0.078459173f,-0.99691731f}, {-0.11753749f,-0.99306846f}, {-0.15643445f,-0.98768836f}, +{-0.19509032f,-0.98078525f}, {-0.23344538f,-0.97236991f}, {-0.27144048f,-0.96245521f}, +{-0.30901703f,-0.95105648f}, {-0.34611711f,-0.93819129f}, {-0.38268352f,-0.92387950f}, +{-0.41865984f,-0.90814310f}, {-0.45399061f,-0.89100647f}, {-0.48862135f,-0.87249595f}, +{-0.52249867f,-0.85264009f}, {-0.55557036f,-0.83146954f}, {-0.58778518f,-0.80901700f}, +{-0.61909389f,-0.78531694f}, {-0.64944804f,-0.76040596f}, {-0.67880076f,-0.73432249f}, +{-0.70710677f,-0.70710677f}, {-0.73432249f,-0.67880070f}, {-0.76040596f,-0.64944804f}, +{-0.78531694f,-0.61909389f}, {-0.80901700f,-0.58778518f}, {-0.83146966f,-0.55557019f}, +{-0.85264021f,-0.52249849f}, {-0.87249607f,-0.48862115f}, {-0.89100659f,-0.45399037f}, +{-0.90814322f,-0.41865960f}, {-0.92387956f,-0.38268328f}, {-0.93819135f,-0.34611690f}, +{-0.95105654f,-0.30901679f}, {-0.96245521f,-0.27144048f}, {-0.97236991f,-0.23344538f}, +{-0.98078531f,-0.19509031f}, {-0.98768836f,-0.15643445f}, {-0.99306846f,-0.11753736f}, +{-0.99691737f,-0.078459039f}, {-0.99922901f,-0.039259743f}, {-1.0000000f,8.7422777e-08f}, +{-0.99922901f,0.039259918f}, {-0.99691731f,0.078459218f}, {-0.99306846f,0.11753753f}, +{-0.98768830f,0.15643461f}, {-0.98078525f,0.19509049f}, {-0.97236985f,0.23344554f}, +{-0.96245515f,0.27144065f}, {-0.95105654f,0.30901697f}, {-0.93819135f,0.34611705f}, +{-0.92387956f,0.38268346f}, {-0.90814316f,0.41865975f}, {-0.89100653f,0.45399055f}, +{-0.87249601f,0.48862129f}, {-0.85264015f,0.52249861f}, {-0.83146960f,0.55557030f}, +{-0.80901694f,0.58778536f}, {-0.78531688f,0.61909401f}, {-0.76040590f,0.64944816f}, +{-0.73432243f,0.67880082f}, {-0.70710665f,0.70710689f}, {-0.67880058f,0.73432261f}, +{-0.64944792f,0.76040608f}, {-0.61909378f,0.78531706f}, {-0.58778507f,0.80901712f}, +{-0.55557001f,0.83146977f}, {-0.52249837f,0.85264033f}, {-0.48862100f,0.87249613f}, +{-0.45399022f,0.89100665f}, {-0.41865945f,0.90814328f}, {-0.38268313f,0.92387968f}, +{-0.34611672f,0.93819147f}, {-0.30901709f,0.95105648f}, {-0.27144054f,0.96245521f}, +{-0.23344545f,0.97236991f}, {-0.19509038f,0.98078525f}, {-0.15643452f,0.98768830f}, +{-0.11753743f,0.99306846f}, {-0.078459114f,0.99691731f}, {-0.039259821f,0.99922901f}, +}; +static const ne10_fft_cpx_float32_t ne10_twiddles_240[240] = { +{1.0000000f,0.0000000f}, {1.0000000f,-0.0000000f}, {1.0000000f,-0.0000000f}, +{1.0000000f,-0.0000000f}, {0.91354543f,-0.40673664f}, {0.66913056f,-0.74314487f}, +{1.0000000f,-0.0000000f}, {0.66913056f,-0.74314487f}, {-0.10452851f,-0.99452192f}, +{1.0000000f,-0.0000000f}, {0.30901697f,-0.95105654f}, {-0.80901700f,-0.58778518f}, +{1.0000000f,-0.0000000f}, {-0.10452851f,-0.99452192f}, {-0.97814757f,0.20791179f}, +{1.0000000f,-0.0000000f}, {0.99452192f,-0.10452846f}, {0.97814763f,-0.20791170f}, +{0.95105648f,-0.30901700f}, {0.91354543f,-0.40673664f}, {0.86602545f,-0.50000000f}, +{0.80901700f,-0.58778524f}, {0.74314475f,-0.66913062f}, {0.66913056f,-0.74314487f}, +{0.58778524f,-0.80901700f}, {0.49999997f,-0.86602545f}, {0.40673661f,-0.91354549f}, +{0.30901697f,-0.95105654f}, {0.20791166f,-0.97814763f}, {0.10452842f,-0.99452192f}, +{1.0000000f,-0.0000000f}, {0.97814763f,-0.20791170f}, {0.91354543f,-0.40673664f}, +{0.80901700f,-0.58778524f}, {0.66913056f,-0.74314487f}, {0.49999997f,-0.86602545f}, +{0.30901697f,-0.95105654f}, {0.10452842f,-0.99452192f}, {-0.10452851f,-0.99452192f}, +{-0.30901703f,-0.95105648f}, {-0.50000006f,-0.86602533f}, {-0.66913068f,-0.74314475f}, +{-0.80901700f,-0.58778518f}, {-0.91354549f,-0.40673658f}, {-0.97814763f,-0.20791161f}, +{1.0000000f,-0.0000000f}, {0.95105648f,-0.30901700f}, {0.80901700f,-0.58778524f}, +{0.58778524f,-0.80901700f}, {0.30901697f,-0.95105654f}, {-4.3711388e-08f,-1.0000000f}, +{-0.30901703f,-0.95105648f}, {-0.58778518f,-0.80901700f}, {-0.80901700f,-0.58778518f}, +{-0.95105654f,-0.30901679f}, {-1.0000000f,8.7422777e-08f}, {-0.95105654f,0.30901697f}, +{-0.80901694f,0.58778536f}, {-0.58778507f,0.80901712f}, {-0.30901709f,0.95105648f}, +{1.0000000f,-0.0000000f}, {0.99965733f,-0.026176950f}, {0.99862951f,-0.052335959f}, +{0.99691731f,-0.078459099f}, {0.99452192f,-0.10452846f}, {0.99144489f,-0.13052620f}, +{0.98768836f,-0.15643448f}, {0.98325491f,-0.18223552f}, {0.97814763f,-0.20791170f}, +{0.97236991f,-0.23344538f}, {0.96592581f,-0.25881904f}, {0.95881975f,-0.28401536f}, +{0.95105648f,-0.30901700f}, {0.94264150f,-0.33380687f}, {0.93358040f,-0.35836795f}, +{0.92387956f,-0.38268346f}, {0.91354543f,-0.40673664f}, {0.90258527f,-0.43051112f}, +{0.89100653f,-0.45399052f}, {0.87881708f,-0.47715878f}, {0.86602545f,-0.50000000f}, +{0.85264015f,-0.52249855f}, {0.83867055f,-0.54463905f}, {0.82412618f,-0.56640625f}, +{0.80901700f,-0.58778524f}, {0.79335332f,-0.60876143f}, {0.77714598f,-0.62932038f}, +{0.76040596f,-0.64944810f}, {0.74314475f,-0.66913062f}, {0.72537434f,-0.68835455f}, +{0.70710677f,-0.70710683f}, {0.68835455f,-0.72537440f}, {0.66913056f,-0.74314487f}, +{0.64944804f,-0.76040596f}, {0.62932038f,-0.77714598f}, {0.60876137f,-0.79335338f}, +{0.58778524f,-0.80901700f}, {0.56640625f,-0.82412618f}, {0.54463899f,-0.83867055f}, +{0.52249849f,-0.85264015f}, {0.49999997f,-0.86602545f}, {0.47715876f,-0.87881708f}, +{0.45399052f,-0.89100653f}, {0.43051103f,-0.90258533f}, {0.40673661f,-0.91354549f}, +{0.38268343f,-0.92387956f}, {0.35836786f,-0.93358046f}, {0.33380681f,-0.94264150f}, +{0.30901697f,-0.95105654f}, {0.28401533f,-0.95881975f}, {0.25881907f,-0.96592581f}, +{0.23344530f,-0.97236991f}, {0.20791166f,-0.97814763f}, {0.18223552f,-0.98325491f}, +{0.15643437f,-0.98768836f}, {0.13052613f,-0.99144489f}, {0.10452842f,-0.99452192f}, +{0.078459084f,-0.99691731f}, {0.052335974f,-0.99862951f}, {0.026176875f,-0.99965733f}, +{1.0000000f,-0.0000000f}, {0.99862951f,-0.052335959f}, {0.99452192f,-0.10452846f}, +{0.98768836f,-0.15643448f}, {0.97814763f,-0.20791170f}, {0.96592581f,-0.25881904f}, +{0.95105648f,-0.30901700f}, {0.93358040f,-0.35836795f}, {0.91354543f,-0.40673664f}, +{0.89100653f,-0.45399052f}, {0.86602545f,-0.50000000f}, {0.83867055f,-0.54463905f}, +{0.80901700f,-0.58778524f}, {0.77714598f,-0.62932038f}, {0.74314475f,-0.66913062f}, +{0.70710677f,-0.70710683f}, {0.66913056f,-0.74314487f}, {0.62932038f,-0.77714598f}, +{0.58778524f,-0.80901700f}, {0.54463899f,-0.83867055f}, {0.49999997f,-0.86602545f}, +{0.45399052f,-0.89100653f}, {0.40673661f,-0.91354549f}, {0.35836786f,-0.93358046f}, +{0.30901697f,-0.95105654f}, {0.25881907f,-0.96592581f}, {0.20791166f,-0.97814763f}, +{0.15643437f,-0.98768836f}, {0.10452842f,-0.99452192f}, {0.052335974f,-0.99862951f}, +{-4.3711388e-08f,-1.0000000f}, {-0.052336060f,-0.99862951f}, {-0.10452851f,-0.99452192f}, +{-0.15643445f,-0.98768836f}, {-0.20791174f,-0.97814757f}, {-0.25881916f,-0.96592581f}, +{-0.30901703f,-0.95105648f}, {-0.35836795f,-0.93358040f}, {-0.40673670f,-0.91354543f}, +{-0.45399061f,-0.89100647f}, {-0.50000006f,-0.86602533f}, {-0.54463905f,-0.83867055f}, +{-0.58778518f,-0.80901700f}, {-0.62932050f,-0.77714586f}, {-0.66913068f,-0.74314475f}, +{-0.70710677f,-0.70710677f}, {-0.74314493f,-0.66913044f}, {-0.77714604f,-0.62932026f}, +{-0.80901700f,-0.58778518f}, {-0.83867055f,-0.54463899f}, {-0.86602539f,-0.50000006f}, +{-0.89100659f,-0.45399037f}, {-0.91354549f,-0.40673658f}, {-0.93358040f,-0.35836792f}, +{-0.95105654f,-0.30901679f}, {-0.96592587f,-0.25881892f}, {-0.97814763f,-0.20791161f}, +{-0.98768836f,-0.15643445f}, {-0.99452192f,-0.10452849f}, {-0.99862957f,-0.052335810f}, +{1.0000000f,-0.0000000f}, {0.99691731f,-0.078459099f}, {0.98768836f,-0.15643448f}, +{0.97236991f,-0.23344538f}, {0.95105648f,-0.30901700f}, {0.92387956f,-0.38268346f}, +{0.89100653f,-0.45399052f}, {0.85264015f,-0.52249855f}, {0.80901700f,-0.58778524f}, +{0.76040596f,-0.64944810f}, {0.70710677f,-0.70710683f}, {0.64944804f,-0.76040596f}, +{0.58778524f,-0.80901700f}, {0.52249849f,-0.85264015f}, {0.45399052f,-0.89100653f}, +{0.38268343f,-0.92387956f}, {0.30901697f,-0.95105654f}, {0.23344530f,-0.97236991f}, +{0.15643437f,-0.98768836f}, {0.078459084f,-0.99691731f}, {-4.3711388e-08f,-1.0000000f}, +{-0.078459173f,-0.99691731f}, {-0.15643445f,-0.98768836f}, {-0.23344538f,-0.97236991f}, +{-0.30901703f,-0.95105648f}, {-0.38268352f,-0.92387950f}, {-0.45399061f,-0.89100647f}, +{-0.52249867f,-0.85264009f}, {-0.58778518f,-0.80901700f}, {-0.64944804f,-0.76040596f}, +{-0.70710677f,-0.70710677f}, {-0.76040596f,-0.64944804f}, {-0.80901700f,-0.58778518f}, +{-0.85264021f,-0.52249849f}, {-0.89100659f,-0.45399037f}, {-0.92387956f,-0.38268328f}, +{-0.95105654f,-0.30901679f}, {-0.97236991f,-0.23344538f}, {-0.98768836f,-0.15643445f}, +{-0.99691737f,-0.078459039f}, {-1.0000000f,8.7422777e-08f}, {-0.99691731f,0.078459218f}, +{-0.98768830f,0.15643461f}, {-0.97236985f,0.23344554f}, {-0.95105654f,0.30901697f}, +{-0.92387956f,0.38268346f}, {-0.89100653f,0.45399055f}, {-0.85264015f,0.52249861f}, +{-0.80901694f,0.58778536f}, {-0.76040590f,0.64944816f}, {-0.70710665f,0.70710689f}, +{-0.64944792f,0.76040608f}, {-0.58778507f,0.80901712f}, {-0.52249837f,0.85264033f}, +{-0.45399022f,0.89100665f}, {-0.38268313f,0.92387968f}, {-0.30901709f,0.95105648f}, +{-0.23344545f,0.97236991f}, {-0.15643452f,0.98768830f}, {-0.078459114f,0.99691731f}, +}; +static const ne10_fft_cpx_float32_t ne10_twiddles_120[120] = { +{1.0000000f,0.0000000f}, {1.0000000f,-0.0000000f}, {1.0000000f,-0.0000000f}, +{1.0000000f,-0.0000000f}, {0.91354543f,-0.40673664f}, {0.66913056f,-0.74314487f}, +{1.0000000f,-0.0000000f}, {0.66913056f,-0.74314487f}, {-0.10452851f,-0.99452192f}, +{1.0000000f,-0.0000000f}, {0.30901697f,-0.95105654f}, {-0.80901700f,-0.58778518f}, +{1.0000000f,-0.0000000f}, {-0.10452851f,-0.99452192f}, {-0.97814757f,0.20791179f}, +{1.0000000f,-0.0000000f}, {0.97814763f,-0.20791170f}, {0.91354543f,-0.40673664f}, +{0.80901700f,-0.58778524f}, {0.66913056f,-0.74314487f}, {0.49999997f,-0.86602545f}, +{0.30901697f,-0.95105654f}, {0.10452842f,-0.99452192f}, {-0.10452851f,-0.99452192f}, +{-0.30901703f,-0.95105648f}, {-0.50000006f,-0.86602533f}, {-0.66913068f,-0.74314475f}, +{-0.80901700f,-0.58778518f}, {-0.91354549f,-0.40673658f}, {-0.97814763f,-0.20791161f}, +{1.0000000f,-0.0000000f}, {0.99862951f,-0.052335959f}, {0.99452192f,-0.10452846f}, +{0.98768836f,-0.15643448f}, {0.97814763f,-0.20791170f}, {0.96592581f,-0.25881904f}, +{0.95105648f,-0.30901700f}, {0.93358040f,-0.35836795f}, {0.91354543f,-0.40673664f}, +{0.89100653f,-0.45399052f}, {0.86602545f,-0.50000000f}, {0.83867055f,-0.54463905f}, +{0.80901700f,-0.58778524f}, {0.77714598f,-0.62932038f}, {0.74314475f,-0.66913062f}, +{0.70710677f,-0.70710683f}, {0.66913056f,-0.74314487f}, {0.62932038f,-0.77714598f}, +{0.58778524f,-0.80901700f}, {0.54463899f,-0.83867055f}, {0.49999997f,-0.86602545f}, +{0.45399052f,-0.89100653f}, {0.40673661f,-0.91354549f}, {0.35836786f,-0.93358046f}, +{0.30901697f,-0.95105654f}, {0.25881907f,-0.96592581f}, {0.20791166f,-0.97814763f}, +{0.15643437f,-0.98768836f}, {0.10452842f,-0.99452192f}, {0.052335974f,-0.99862951f}, +{1.0000000f,-0.0000000f}, {0.99452192f,-0.10452846f}, {0.97814763f,-0.20791170f}, +{0.95105648f,-0.30901700f}, {0.91354543f,-0.40673664f}, {0.86602545f,-0.50000000f}, +{0.80901700f,-0.58778524f}, {0.74314475f,-0.66913062f}, {0.66913056f,-0.74314487f}, +{0.58778524f,-0.80901700f}, {0.49999997f,-0.86602545f}, {0.40673661f,-0.91354549f}, +{0.30901697f,-0.95105654f}, {0.20791166f,-0.97814763f}, {0.10452842f,-0.99452192f}, +{-4.3711388e-08f,-1.0000000f}, {-0.10452851f,-0.99452192f}, {-0.20791174f,-0.97814757f}, +{-0.30901703f,-0.95105648f}, {-0.40673670f,-0.91354543f}, {-0.50000006f,-0.86602533f}, +{-0.58778518f,-0.80901700f}, {-0.66913068f,-0.74314475f}, {-0.74314493f,-0.66913044f}, +{-0.80901700f,-0.58778518f}, {-0.86602539f,-0.50000006f}, {-0.91354549f,-0.40673658f}, +{-0.95105654f,-0.30901679f}, {-0.97814763f,-0.20791161f}, {-0.99452192f,-0.10452849f}, +{1.0000000f,-0.0000000f}, {0.98768836f,-0.15643448f}, {0.95105648f,-0.30901700f}, +{0.89100653f,-0.45399052f}, {0.80901700f,-0.58778524f}, {0.70710677f,-0.70710683f}, +{0.58778524f,-0.80901700f}, {0.45399052f,-0.89100653f}, {0.30901697f,-0.95105654f}, +{0.15643437f,-0.98768836f}, {-4.3711388e-08f,-1.0000000f}, {-0.15643445f,-0.98768836f}, +{-0.30901703f,-0.95105648f}, {-0.45399061f,-0.89100647f}, {-0.58778518f,-0.80901700f}, +{-0.70710677f,-0.70710677f}, {-0.80901700f,-0.58778518f}, {-0.89100659f,-0.45399037f}, +{-0.95105654f,-0.30901679f}, {-0.98768836f,-0.15643445f}, {-1.0000000f,8.7422777e-08f}, +{-0.98768830f,0.15643461f}, {-0.95105654f,0.30901697f}, {-0.89100653f,0.45399055f}, +{-0.80901694f,0.58778536f}, {-0.70710665f,0.70710689f}, {-0.58778507f,0.80901712f}, +{-0.45399022f,0.89100665f}, {-0.30901709f,0.95105648f}, {-0.15643452f,0.98768830f}, +}; +static const ne10_fft_cpx_float32_t ne10_twiddles_60[60] = { +{1.0000000f,0.0000000f}, {1.0000000f,-0.0000000f}, {1.0000000f,-0.0000000f}, +{1.0000000f,-0.0000000f}, {0.91354543f,-0.40673664f}, {0.66913056f,-0.74314487f}, +{1.0000000f,-0.0000000f}, {0.66913056f,-0.74314487f}, {-0.10452851f,-0.99452192f}, +{1.0000000f,-0.0000000f}, {0.30901697f,-0.95105654f}, {-0.80901700f,-0.58778518f}, +{1.0000000f,-0.0000000f}, {-0.10452851f,-0.99452192f}, {-0.97814757f,0.20791179f}, +{1.0000000f,-0.0000000f}, {0.99452192f,-0.10452846f}, {0.97814763f,-0.20791170f}, +{0.95105648f,-0.30901700f}, {0.91354543f,-0.40673664f}, {0.86602545f,-0.50000000f}, +{0.80901700f,-0.58778524f}, {0.74314475f,-0.66913062f}, {0.66913056f,-0.74314487f}, +{0.58778524f,-0.80901700f}, {0.49999997f,-0.86602545f}, {0.40673661f,-0.91354549f}, +{0.30901697f,-0.95105654f}, {0.20791166f,-0.97814763f}, {0.10452842f,-0.99452192f}, +{1.0000000f,-0.0000000f}, {0.97814763f,-0.20791170f}, {0.91354543f,-0.40673664f}, +{0.80901700f,-0.58778524f}, {0.66913056f,-0.74314487f}, {0.49999997f,-0.86602545f}, +{0.30901697f,-0.95105654f}, {0.10452842f,-0.99452192f}, {-0.10452851f,-0.99452192f}, +{-0.30901703f,-0.95105648f}, {-0.50000006f,-0.86602533f}, {-0.66913068f,-0.74314475f}, +{-0.80901700f,-0.58778518f}, {-0.91354549f,-0.40673658f}, {-0.97814763f,-0.20791161f}, +{1.0000000f,-0.0000000f}, {0.95105648f,-0.30901700f}, {0.80901700f,-0.58778524f}, +{0.58778524f,-0.80901700f}, {0.30901697f,-0.95105654f}, {-4.3711388e-08f,-1.0000000f}, +{-0.30901703f,-0.95105648f}, {-0.58778518f,-0.80901700f}, {-0.80901700f,-0.58778518f}, +{-0.95105654f,-0.30901679f}, {-1.0000000f,8.7422777e-08f}, {-0.95105654f,0.30901697f}, +{-0.80901694f,0.58778536f}, {-0.58778507f,0.80901712f}, {-0.30901709f,0.95105648f}, +}; +static const ne10_fft_state_float32_t ne10_fft_state_float32_480 = { +120, +(ne10_int32_t *)ne10_factors_480, +(ne10_fft_cpx_float32_t *)ne10_twiddles_480, +NULL, +(ne10_fft_cpx_float32_t *)&ne10_twiddles_480[120], +/* is_forward_scaled = true */ +(ne10_int32_t) 1, +/* is_backward_scaled = false */ +(ne10_int32_t) 0, +}; +static const arch_fft_state cfg_arch_480 = { +1, +(void *)&ne10_fft_state_float32_480, +}; + +static const ne10_fft_state_float32_t ne10_fft_state_float32_240 = { +60, +(ne10_int32_t *)ne10_factors_240, +(ne10_fft_cpx_float32_t *)ne10_twiddles_240, +NULL, +(ne10_fft_cpx_float32_t *)&ne10_twiddles_240[60], +/* is_forward_scaled = true */ +(ne10_int32_t) 1, +/* is_backward_scaled = false */ +(ne10_int32_t) 0, +}; +static const arch_fft_state cfg_arch_240 = { +1, +(void *)&ne10_fft_state_float32_240, +}; + +static const ne10_fft_state_float32_t ne10_fft_state_float32_120 = { +30, +(ne10_int32_t *)ne10_factors_120, +(ne10_fft_cpx_float32_t *)ne10_twiddles_120, +NULL, +(ne10_fft_cpx_float32_t *)&ne10_twiddles_120[30], +/* is_forward_scaled = true */ +(ne10_int32_t) 1, +/* is_backward_scaled = false */ +(ne10_int32_t) 0, +}; +static const arch_fft_state cfg_arch_120 = { +1, +(void *)&ne10_fft_state_float32_120, +}; + +static const ne10_fft_state_float32_t ne10_fft_state_float32_60 = { +15, +(ne10_int32_t *)ne10_factors_60, +(ne10_fft_cpx_float32_t *)ne10_twiddles_60, +NULL, +(ne10_fft_cpx_float32_t *)&ne10_twiddles_60[15], +/* is_forward_scaled = true */ +(ne10_int32_t) 1, +/* is_backward_scaled = false */ +(ne10_int32_t) 0, +}; +static const arch_fft_state cfg_arch_60 = { +1, +(void *)&ne10_fft_state_float32_60, +}; + +#endif /* end NE10_FFT_PARAMS48000_960 */ diff --git a/celt/tests/test_unit_dft.c b/celt/tests/test_unit_dft.c index 57db0e3..991ece3 100644 --- a/celt/tests/test_unit_dft.c +++ b/celt/tests/test_unit_dft.c @@ -45,6 +45,23 @@ #include "mathops.c" #include "entcode.c" +#if defined(OPUS_HAVE_RTCD) && \ + (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR)) +#include "arm/armcpu.c" +#if !defined(FIXED_POINT) +#if defined(HAVE_ARM_NE10) +#include "mdct.c" +#include "arm/celt_ne10_fft.c" +#include "arm/celt_ne10_mdct.c" +#endif +#include "celt_lpc.c" +#include "pitch.c" +#include "arm/celt_neon_intr.c" +#include "arm/arm_celt_map.c" +#endif +#elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1) +#include "x86/x86cpu.c" +#endif #ifndef M_PI #define M_PI 3.141592653 @@ -93,13 +110,13 @@ void check(kiss_fft_cpx * in,kiss_fft_cpx * out,int nfft,int isinverse) } } -void test1d(int nfft,int isinverse) +void test1d(int nfft,int isinverse,int arch) { size_t buflen = sizeof(kiss_fft_cpx)*nfft; kiss_fft_cpx * in = (kiss_fft_cpx*)malloc(buflen); kiss_fft_cpx * out= (kiss_fft_cpx*)malloc(buflen); - kiss_fft_state *cfg = opus_fft_alloc(nfft,0,0); + kiss_fft_state *cfg = opus_fft_alloc(nfft,0,0,arch); int k; for (k=0;k<nfft;++k) { @@ -125,7 +142,7 @@ void test1d(int nfft,int isinverse) if (isinverse) opus_ifft(cfg,in,out); else - opus_fft(cfg,in,out); + opus_fft(cfg,in,out, arch); /*for (k=0;k<nfft;++k) printf("%d %d ", out[k].r, out[k].i);printf("\n");*/ @@ -139,26 +156,28 @@ void test1d(int nfft,int isinverse) int main(int argc,char ** argv) { ALLOC_STACK; + int arch = opus_select_arch(); + if (argc>1) { int k; for (k=1;k<argc;++k) { - test1d(atoi(argv[k]),0); - test1d(atoi(argv[k]),1); + test1d(atoi(argv[k]),0,arch); + test1d(atoi(argv[k]),1,arch); } }else{ - test1d(32,0); - test1d(32,1); - test1d(128,0); - test1d(128,1); - test1d(256,0); - test1d(256,1); + test1d(32,0,arch); + test1d(32,1,arch); + test1d(128,0,arch); + test1d(128,1,arch); + test1d(256,0,arch); + test1d(256,1,arch); #ifndef RADIX_TWO_ONLY - test1d(36,0); - test1d(36,1); - test1d(50,0); - test1d(50,1); - test1d(120,0); - test1d(120,1); + test1d(36,0,arch); + test1d(36,1,arch); + test1d(50,0,arch); + test1d(50,1,arch); + test1d(120,0,arch); + test1d(120,1,arch); #endif } return ret; diff --git a/celt/tests/test_unit_mathops.c b/celt/tests/test_unit_mathops.c index b9b1bcf..5d2e8e4 100644 --- a/celt/tests/test_unit_mathops.c +++ b/celt/tests/test_unit_mathops.c @@ -60,6 +60,12 @@ || defined(OPUS_ARM_NEON_INTR)) #if defined(OPUS_ARM_NEON_INTR) #include "arm/celt_neon_intr.c" +#if defined(HAVE_ARM_NE10) +#include "kiss_fft.c" +#include "mdct.c" +#include "arm/celt_ne10_fft.c" +#include "arm/celt_ne10_mdct.c" +#endif #endif #include "arm/arm_celt_map.c" #endif diff --git a/celt/tests/test_unit_mdct.c b/celt/tests/test_unit_mdct.c index ac8957f..a1c92d0 100644 --- a/celt/tests/test_unit_mdct.c +++ b/celt/tests/test_unit_mdct.c @@ -46,6 +46,24 @@ #include "mathops.c" #include "entcode.c" +#if defined(OPUS_HAVE_RTCD) && \ + (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR)) +#include "arm/armcpu.c" +#if !defined(FIXED_POINT) +#if defined(HAVE_ARM_NE10) +#include "arm/celt_ne10_fft.c" +#include "arm/celt_ne10_mdct.c" +#endif +#include "pitch.c" +#include "celt_lpc.c" +#include "arm/celt_neon_intr.c" +#include "arm/arm_celt_map.c" +#endif + +#elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1) +#include "x86/x86cpu.c" +#endif + #ifndef M_PI #define M_PI 3.141592653 #endif @@ -112,7 +130,7 @@ void check_inv(kiss_fft_scalar * in,kiss_fft_scalar * out,int nfft,int isinver } -void test1d(int nfft,int isinverse) +void test1d(int nfft,int isinverse,int arch) { mdct_lookup cfg; size_t buflen = sizeof(kiss_fft_scalar)*nfft; @@ -123,7 +141,7 @@ void test1d(int nfft,int isinverse) opus_val16 * window= (opus_val16*)malloc(sizeof(opus_val16)*nfft/2); int k; - clt_mdct_init(&cfg, nfft, 0); + clt_mdct_init(&cfg, nfft, 0, arch); for (k=0;k<nfft;++k) { in[k] = (rand() % 32768) - 16384; } @@ -156,7 +174,7 @@ void test1d(int nfft,int isinverse) out[nfft-k-1] = out[nfft/2+k]; check_inv(in,out,nfft,isinverse); } else { - clt_mdct_forward(&cfg,in,out,window, nfft/2, 0, 1); + clt_mdct_forward(&cfg,in,out,window, nfft/2, 0, 1, arch); check(in_copy,out,nfft,isinverse); } /*for (k=0;k<nfft;++k) printf("%d %d ", out[k].r, out[k].i);printf("\n");*/ @@ -164,46 +182,48 @@ void test1d(int nfft,int isinverse) free(in); free(out); - clt_mdct_clear(&cfg); + clt_mdct_clear(&cfg, arch); } int main(int argc,char ** argv) { ALLOC_STACK; + int arch = opus_select_arch(); + if (argc>1) { int k; for (k=1;k<argc;++k) { - test1d(atoi(argv[k]),0); - test1d(atoi(argv[k]),1); + test1d(atoi(argv[k]),0,arch); + test1d(atoi(argv[k]),1,arch); } }else{ - test1d(32,0); - test1d(32,1); - test1d(256,0); - test1d(256,1); - test1d(512,0); - test1d(512,1); - test1d(1024,0); - test1d(1024,1); - test1d(2048,0); - test1d(2048,1); + test1d(32,0,arch); + test1d(32,1,arch); + test1d(256,0,arch); + test1d(256,1,arch); + test1d(512,0,arch); + test1d(512,1,arch); + test1d(1024,0,arch); + test1d(1024,1,arch); + test1d(2048,0,arch); + test1d(2048,1,arch); #ifndef RADIX_TWO_ONLY - test1d(36,0); - test1d(36,1); - test1d(40,0); - test1d(40,1); - test1d(60,0); - test1d(60,1); - test1d(120,0); - test1d(120,1); - test1d(240,0); - test1d(240,1); - test1d(480,0); - test1d(480,1); - test1d(960,0); - test1d(960,1); - test1d(1920,0); - test1d(1920,1); + test1d(36,0,arch); + test1d(36,1,arch); + test1d(40,0,arch); + test1d(40,1,arch); + test1d(60,0,arch); + test1d(60,1,arch); + test1d(120,0,arch); + test1d(120,1,arch); + test1d(240,0,arch); + test1d(240,1,arch); + test1d(480,0,arch); + test1d(480,1,arch); + test1d(960,0,arch); + test1d(960,1,arch); + test1d(1920,0,arch); + test1d(1920,1,arch); #endif } return ret; diff --git a/celt/tests/test_unit_rotation.c b/celt/tests/test_unit_rotation.c index 5507884..fb18df0 100644 --- a/celt/tests/test_unit_rotation.c +++ b/celt/tests/test_unit_rotation.c @@ -59,6 +59,12 @@ #if defined(OPUS_ARM_NEON_INTR) #include "arm/celt_neon_intr.c" #endif +#if defined(HAVE_ARM_NE10) +#include "kiss_fft.c" +#include "mdct.c" +#include "arm/celt_ne10_fft.c" +#include "arm/celt_ne10_mdct.c" +#endif #include "arm/arm_celt_map.c" #endif diff --git a/celt_headers.mk b/celt_headers.mk index d422e09..5dc9e1e 100644 --- a/celt_headers.mk +++ b/celt_headers.mk @@ -31,12 +31,15 @@ celt/stack_alloc.h \ celt/vq.h \ celt/static_modes_float.h \ celt/static_modes_fixed.h \ +celt/static_modes_float_arm_ne10.h \ celt/arm/armcpu.h \ celt/arm/fixed_armv4.h \ celt/arm/fixed_armv5e.h \ celt/arm/kiss_fft_armv4.h \ celt/arm/kiss_fft_armv5e.h \ celt/arm/pitch_arm.h \ +celt/arm/fft_arm.h \ +celt/arm/mdct_arm.h \ celt/mips/celt_mipsr1.h \ celt/mips/fixed_generic_mipsr1.h \ celt/mips/kiss_fft_mipsr1.h \ diff --git a/celt_sources.mk b/celt_sources.mk index 29ec937..7121301 100644 --- a/celt_sources.mk +++ b/celt_sources.mk @@ -35,3 +35,7 @@ celt/arm/armopts.s.in CELT_SOURCES_ARM_NEON_INTR = \ celt/arm/celt_neon_intr.c + +CELT_SOURCES_ARM_NE10= \ +celt/arm/celt_ne10_fft.c \ +celt/arm/celt_ne10_mdct.c diff --git a/configure.ac b/configure.ac index 87cece9..baa3425 100644 --- a/configure.ac +++ b/configure.ac @@ -351,6 +351,80 @@ AM_CONDITIONAL([OPUS_ARM_EXTERNAL_ASM], AM_CONDITIONAL([HAVE_SSE4_1], [false]) AM_CONDITIONAL([HAVE_SSE2], [false]) +AC_DEFUN([OPUS_PATH_NE10], + [ + AC_ARG_WITH(NE10, + AC_HELP_STRING([--with-NE10=PFX],[Prefix where libNE10 is installed (optional)]), + NE10_prefix="$withval", NE10_prefix="") + AC_ARG_WITH(NE10-libraries, + AC_HELP_STRING([--with-NE10-libraries=DIR], + [Directory where libNE10 library is installed (optional)]), + NE10_libraries="$withval", NE10_libraries="") + AC_ARG_WITH(NE10-includes, + AC_HELP_STRING([--with-NE10-includes=DIR], + [Directory where libNE10 header files are installed (optional)]), + NE10_includes="$withval", ogg_includes="") + + if test "x$NE10_libraries" != "x" ; then + NE10_LIBS="-L$NE10_libraries" + elif test "x$NE10_prefix" = "xno" || test "x$NE10_prefix" = "xyes" ; then + NE10_LIBS="" + elif test "x$NE10_prefix" != "x" ; then + NE10_LIBS="-L$NE10_prefix/lib" + elif test "x$prefix" != "xNONE" ; then + NE10_LIBS="-L$prefix/lib" + fi + + if test "x$NE10_prefix" != "xno" ; then + NE10_LIBS="$NE10_LIBS -lNE10" + fi + + if test "x$NE10_includes" != "x" ; then + NE10_CFLAGS="-I$NE10_includes" + elif test "x$NE10_prefix" = "xno" || test "x$NE10_prefix" = "xyes" ; then + NE10_CFLAGS="" + elif test "x$ogg_prefix" != "x" ; then + NE10_CFLAGS="-I$NE10_prefix/include" + elif test "x$prefix" != "xNONE"; then + NE10_CFLAGS="-I$prefix/include" + fi + + AC_MSG_CHECKING(for NE10) + save_CFLAGS="$CFLAGS"; CFLAGS="$NE10_CFLAGS" + save_LIBS="$LIBS"; LIBS="$NE10_LIBS" + AC_LINK_IFELSE( + [ + AC_LANG_PROGRAM( + [[#include <NE10_init.h> + ]], + [[ + ne10_fft_cfg_float32_t cfg; + cfg = ne10_fft_alloc_c2c_float32_neon(480); + ]] + ) + ],[ + HAVE_ARM_NE10=1 + AC_MSG_RESULT([yes]) + ],[ + HAVE_ARM_NE10=0 + AC_MSG_RESULT([no]) + NE10_CFLAGS="" + NE10_LIBS="" + ] + ) + CFLAGS="$save_CFLAGS"; LIBS="$save_LIBS" + #Now we know if libNE10 is installed or not + AS_IF([test x"$HAVE_ARM_NE10" = x"1"], + [ + AC_DEFINE([HAVE_ARM_NE10], 1, [NE10 library is installed on host. Make sure it is on target!]) + AC_SUBST(HAVE_ARM_NE10) + AC_SUBST(NE10_CFLAGS) + AC_SUBST(NE10_LIBS) + ],[] + ) + ] +) + AS_IF([test x"$enable_intrinsics" = x"yes"],[ case $host_cpu in arm*) @@ -391,6 +465,10 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ AC_DEFINE([OPUS_ARM_MAY_HAVE_EDSP], 1, [Define if compiler support EDSP Instructions]) AC_DEFINE([OPUS_ARM_MAY_HAVE_MEDIA], 1, [Define if compiler support MEDIA Instructions]) AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler support NEON instructions]) + + OPUS_PATH_NE10() + AS_IF([test x"$NE10_LIBS" != "x"], + [enable_intrinsics="$enable_intrinsics NE10"],[]) ], [ AC_MSG_WARN([Compiler does not support ARM intrinsics]) @@ -516,6 +594,9 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"]) AM_CONDITIONAL([OPUS_ARM_NEON_INTR], [test x"$OPUS_ARM_NEON_INTR" = x"1"]) +AM_CONDITIONAL([HAVE_ARM_NE10], + [test x"$HAVE_ARM_NE10" = x"1"]) + AS_IF([test x"$enable_rtcd" = x"yes"],[ AS_IF([test x"$rtcd_support" != x"no"],[ diff --git a/src/analysis.c b/src/analysis.c index 2ee8533..e04b282 100644 --- a/src/analysis.c +++ b/src/analysis.c @@ -189,7 +189,7 @@ void tonality_get_info(TonalityAnalysisState *tonal, AnalysisInfo *info_out, int info_out->music_prob = psum; } -static void tonality_analysis(TonalityAnalysisState *tonal, const CELTMode *celt_mode, const void *x, int len, int offset, int c1, int c2, int C, int lsb_depth, downmix_func downmix) +static void tonality_analysis(TonalityAnalysisState *tonal, const CELTMode *celt_mode, const void *x, int len, int offset, int c1, int c2, int C, int lsb_depth, downmix_func downmix, int arch) { int i, b; const kiss_fft_state *kfft; @@ -262,7 +262,7 @@ static void tonality_analysis(TonalityAnalysisState *tonal, const CELTMode *celt remaining = len - (ANALYSIS_BUF_SIZE-tonal->mem_fill); downmix(x, &tonal->inmem[240], remaining, offset+ANALYSIS_BUF_SIZE-tonal->mem_fill, c1, c2, C); tonal->mem_fill = 240 + remaining; - opus_fft(kfft, in, out); + opus_fft(kfft, in, out, arch); #ifndef FIXED_POINT /* If there's any NaN on the input, the entire output will be NaN, so we only need to check one value. */ if (celt_isnan(out[0].r)) @@ -635,7 +635,7 @@ static void tonality_analysis(TonalityAnalysisState *tonal, const CELTMode *celt void run_analysis(TonalityAnalysisState *analysis, const CELTMode *celt_mode, const void *analysis_pcm, int analysis_frame_size, int frame_size, int c1, int c2, int C, opus_int32 Fs, - int lsb_depth, downmix_func downmix, AnalysisInfo *analysis_info) + int lsb_depth, downmix_func downmix, AnalysisInfo *analysis_info, int arch) { int offset; int pcm_len; @@ -648,7 +648,7 @@ void run_analysis(TonalityAnalysisState *analysis, const CELTMode *celt_mode, co pcm_len = analysis_frame_size - analysis->analysis_offset; offset = analysis->analysis_offset; do { - tonality_analysis(analysis, celt_mode, analysis_pcm, IMIN(480, pcm_len), offset, c1, c2, C, lsb_depth, downmix); + tonality_analysis(analysis, celt_mode, analysis_pcm, IMIN(480, pcm_len), offset, c1, c2, C, lsb_depth, downmix, arch); offset += 480; pcm_len -= 480; } while (pcm_len>0); diff --git a/src/analysis.h b/src/analysis.h index 85a73d7..9c328e8 100644 --- a/src/analysis.h +++ b/src/analysis.h @@ -82,6 +82,6 @@ void tonality_get_info(TonalityAnalysisState *tonal, AnalysisInfo *info_out, int void run_analysis(TonalityAnalysisState *analysis, const CELTMode *celt_mode, const void *analysis_pcm, int analysis_frame_size, int frame_size, int c1, int c2, int C, opus_int32 Fs, - int lsb_depth, downmix_func downmix, AnalysisInfo *analysis_info); + int lsb_depth, downmix_func downmix, AnalysisInfo *analysis_info, int arch); #endif diff --git a/src/opus_encoder.c b/src/opus_encoder.c index d94163f..4656da5 100644 --- a/src/opus_encoder.c +++ b/src/opus_encoder.c @@ -1006,7 +1006,7 @@ opus_int32 opus_encode_native(OpusEncoder *st, const opus_val16 *pcm, int frame_ analysis_read_subframe_bak = st->analysis.read_subframe; run_analysis(&st->analysis, celt_mode, analysis_pcm, analysis_size, frame_size, c1, c2, analysis_channels, st->Fs, - lsb_depth, downmix, &analysis_info); + lsb_depth, downmix, &analysis_info, st->arch); } #else (void)analysis_pcm; diff --git a/src/opus_multistream_encoder.c b/src/opus_multistream_encoder.c index 6e87337..1281e85 100644 --- a/src/opus_multistream_encoder.c +++ b/src/opus_multistream_encoder.c @@ -71,6 +71,7 @@ typedef void (*opus_copy_channel_in_func)( struct OpusMSEncoder { ChannelLayout layout; + int arch; int lfe_stream; int application; int variable_duration; @@ -218,7 +219,7 @@ opus_val16 logSum(opus_val16 a, opus_val16 b) #endif void surround_analysis(const CELTMode *celt_mode, const void *pcm, opus_val16 *bandLogE, opus_val32 *mem, opus_val32 *preemph_mem, - int len, int overlap, int channels, int rate, opus_copy_channel_in_func copy_channel_in + int len, int overlap, int channels, int rate, opus_copy_channel_in_func copy_channel_in, int arch ) { int c; @@ -257,7 +258,8 @@ void surround_analysis(const CELTMode *celt_mode, const void *pcm, opus_val16 *b OPUS_COPY(in, mem+c*overlap, overlap); (*copy_channel_in)(x, 1, pcm, channels, c, len); celt_preemphasis(x, in+overlap, frame_size, 1, upsample, celt_mode->preemph, preemph_mem+c, 0); - clt_mdct_forward(&celt_mode->mdct, in, freq, celt_mode->window, overlap, celt_mode->maxLM-LM, 1); + clt_mdct_forward(&celt_mode->mdct, in, freq, celt_mode->window, + overlap, celt_mode->maxLM-LM, 1, arch); if (upsample != 1) { int bound = len; @@ -411,6 +413,7 @@ static int opus_multistream_encoder_init_impl( (streams<1) || (coupled_streams<0) || (streams>255-coupled_streams)) return OPUS_BAD_ARG; + st->arch = opus_select_arch(); st->layout.nb_channels = channels; st->layout.nb_streams = streams; st->layout.nb_coupled_streams = coupled_streams; @@ -767,7 +770,7 @@ static int opus_multistream_encode_native ALLOC(bandSMR, 21*st->layout.nb_channels, opus_val16); if (st->surround) { - surround_analysis(celt_mode, pcm, bandSMR, mem, preemph_mem, frame_size, 120, st->layout.nb_channels, Fs, copy_channel_in); + surround_analysis(celt_mode, pcm, bandSMR, mem, preemph_mem, frame_size, 120, st->layout.nb_channels, Fs, copy_channel_in, st->arch); } /* Compute bitrate allocation between streams (this could be a lot better) */ -- 1.9.1
Viswanath Puttagunta
2015-Mar-31 22:57 UTC
[opus] [RFC PATCH v1 2/5] armv7(float): Optimize decode usecase using NE10 library
Optimize opus decode (float only) use case using ARM NE10. Mainly effects opus_ifft and ctl_mdct_backward and related functions. Work based on previous Encode optimization using ARM NE10 library. TBD: Add commit id of upstream Encode NE10 optimization patch so that users have reference of how to enable this optimization Signed-off-by: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> --- celt/arm/arm_celt_map.c | 22 ++++++++++ celt/arm/celt_ne10_fft.c | 26 +++++++++++ celt/arm/celt_ne10_mdct.c | 102 ++++++++++++++++++++++++++++++++++++++++++++ celt/arm/fft_arm.h | 8 ++++ celt/arm/mdct_arm.h | 7 +++ celt/celt_decoder.c | 18 ++++---- celt/celt_encoder.c | 3 +- celt/kiss_fft.c | 4 +- celt/kiss_fft.h | 13 +++++- celt/mdct.c | 5 ++- celt/mdct.h | 22 ++++++++-- celt/tests/test_unit_dft.c | 2 +- celt/tests/test_unit_mdct.c | 2 +- 13 files changed, 214 insertions(+), 20 deletions(-) diff --git a/celt/arm/arm_celt_map.c b/celt/arm/arm_celt_map.c index 3b49f90..f132fe1 100644 --- a/celt/arm/arm_celt_map.c +++ b/celt/arm/arm_celt_map.c @@ -79,6 +79,15 @@ void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, opus_fft_float_neon /* Neon with NE10 */ }; +void (*const OPUS_IFFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, + const kiss_fft_cpx *fin, + kiss_fft_cpx *fout) = { + opus_ifft_c, /* ARMv4 */ + opus_ifft_c, /* EDSP */ + opus_ifft_c, /* Media */ + opus_ifft_float_neon /* Neon with NE10 */ +}; + void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scalar * OPUS_RESTRICT out, @@ -90,6 +99,19 @@ void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l, clt_mdct_forward_c, /* Media */ clt_mdct_forward_float_neon /* Neon with NE10 */ }; + +void (*const CLT_MDCT_BACKWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l, + kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, + int overlap, int shift, + int stride, int arch) = { + clt_mdct_backward_c, /* ARMv4 */ + clt_mdct_backward_c, /* EDSP */ + clt_mdct_backward_c, /* Media */ + clt_mdct_backward_float_neon /* Neon with NE10 */ +}; + #endif /* HAVE_ARM_NE10 */ # endif /* OPUS_ARM_NEON_INTR */ # endif /* FIXED_POINT */ diff --git a/celt/arm/celt_ne10_fft.c b/celt/arm/celt_ne10_fft.c index b592f19..d354502 100644 --- a/celt/arm/celt_ne10_fft.c +++ b/celt/arm/celt_ne10_fft.c @@ -118,3 +118,29 @@ void opus_fft_float_neon(const kiss_fft_state *st, } RESTORE_STACK; } + +void opus_ifft_float_neon(const kiss_fft_state *st, + const kiss_fft_cpx *fin, + kiss_fft_cpx *fout) +{ + ne10_fft_state_float32_t state; + ne10_fft_cfg_float32_t cfg = &state; + VARDECL(ne10_fft_cpx_float32_t, buffer); + SAVE_STACK; + ALLOC(buffer, st->nfft, ne10_fft_cpx_float32_t); + + if (!st->arch_fft->is_supported) { + /* This nfft length (scaled fft) not supported in NE10 */ + opus_ifft_c(st, fin, fout); + } + else { + memcpy((void *)cfg, st->arch_fft->priv, sizeof(ne10_fft_state_float32_t)); + state.buffer = (ne10_fft_cpx_float32_t *)&buffer[0]; + state.is_backward_scaled = 0; + + ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout, + (ne10_fft_cpx_float32_t *)fin, + cfg, 1); + } + RESTORE_STACK; +} diff --git a/celt/arm/celt_ne10_mdct.c b/celt/arm/celt_ne10_mdct.c index cf175cb..0979cbe 100644 --- a/celt/arm/celt_ne10_mdct.c +++ b/celt/arm/celt_ne10_mdct.c @@ -156,3 +156,105 @@ void clt_mdct_forward_float_neon(const mdct_lookup *l, } RESTORE_STACK; } + +void clt_mdct_backward_float_neon(const mdct_lookup *l, + kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 * OPUS_RESTRICT window, + int overlap, int shift, int stride, int arch) +{ + int i; + int N, N2, N4; + VARDECL(kiss_fft_scalar, f); + const kiss_twiddle_scalar *trig; + const kiss_fft_state *st = l->kfft[shift]; + + N = l->n; + trig = l->trig; + for (i=0;i<shift;i++) + { + N >>= 1; + trig += N; + } + N2 = N>>1; + N4 = N>>2; + + ALLOC(f, N2, kiss_fft_scalar); + + /* Pre-rotate */ + { + /* Temp pointers to make it really clear to the compiler what we're doing */ + const kiss_fft_scalar * OPUS_RESTRICT xp1 = in; + const kiss_fft_scalar * OPUS_RESTRICT xp2 = in+stride*(N2-1); + kiss_fft_scalar * OPUS_RESTRICT yp = f; + const kiss_twiddle_scalar * OPUS_RESTRICT t = &trig[0]; + for(i=0;i<N4;i++) + { + kiss_fft_scalar yr, yi; + yr = S_MUL(*xp2, t[i]) + S_MUL(*xp1, t[N4+i]); + yi = S_MUL(*xp1, t[i]) - S_MUL(*xp2, t[N4+i]); + yp[2*i] = yr; + yp[2*i+1] = yi; + xp1+=2*stride; + xp2-=2*stride; + } + } + + opus_ifft(st, (kiss_fft_cpx *)f, (kiss_fft_cpx*)(out+(overlap>>1)), arch); + + /* Post-rotate and de-shuffle from both ends of the buffer at once to make + it in-place. */ + { + kiss_fft_scalar * yp0 = out+(overlap>>1); + kiss_fft_scalar * yp1 = out+(overlap>>1)+N2-2; + const kiss_twiddle_scalar *t = &trig[0]; + /* Loop to (N4+1)>>1 to handle odd N4. When N4 is odd, the + middle pair will be computed twice. */ + for(i=0;i<(N4+1)>>1;i++) + { + kiss_fft_scalar re, im, yr, yi; + kiss_twiddle_scalar t0, t1; + re = yp0[0]; + im = yp0[1]; + t0 = t[i]; + t1 = t[N4+i]; + /* We'd scale up by 2 here, but instead it's done when mixing the windows */ + yr = S_MUL(re,t0) + S_MUL(im,t1); + yi = S_MUL(re,t1) - S_MUL(im,t0); + re = yp1[0]; + im = yp1[1]; + yp0[0] = yr; + yp1[1] = yi; + + t0 = t[(N4-i-1)]; + t1 = t[(N2-i-1)]; + /* We'd scale up by 2 here, but instead it's done when mixing the windows */ + yr = S_MUL(re,t0) + S_MUL(im,t1); + yi = S_MUL(re,t1) - S_MUL(im,t0); + yp1[0] = yr; + yp0[1] = yi; + yp0 += 2; + yp1 -= 2; + } + } + + /* Mirror on both sides for TDAC */ + { + kiss_fft_scalar * OPUS_RESTRICT xp1 = out+overlap-1; + kiss_fft_scalar * OPUS_RESTRICT yp1 = out; + const opus_val16 * OPUS_RESTRICT wp1 = window; + const opus_val16 * OPUS_RESTRICT wp2 = window+overlap-1; + + for(i = 0; i < overlap/2; i++) + { + kiss_fft_scalar x1, x2; + x1 = *xp1; + x2 = *yp1; + *yp1++ = MULT16_32_Q15(*wp2, x2) - MULT16_32_Q15(*wp1, x1); + *xp1-- = MULT16_32_Q15(*wp1, x2) + MULT16_32_Q15(*wp2, x1); + wp1++; + wp2--; + } + } + RESTORE_STACK; +} diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h index e7a30d6..e57b0aa 100644 --- a/celt/arm/fft_arm.h +++ b/celt/arm/fft_arm.h @@ -46,6 +46,11 @@ void opus_fft_free_arm_float_neon(kiss_fft_state *st); void opus_fft_float_neon(const kiss_fft_state *st, const kiss_fft_cpx *fin, kiss_fft_cpx *fout); + +void opus_ifft_float_neon(const kiss_fft_state *st, + const kiss_fft_cpx *fin, + kiss_fft_cpx *fout); + #if !defined(OPUS_HAVE_RTCD) #define OVERRIDE_OPUS_FFT (1) @@ -58,6 +63,9 @@ void opus_fft_float_neon(const kiss_fft_state *st, #define opus_fft(_st, _fin, _fout, arch) \ ((void)(arch), opus_fft_float_neon(_st, _fin, _fout)) +#define opus_ifft(_st, _fin, _fout, arch) \ + ((void)(arch), opus_ifft_float_neon(_st, _fin, _fout)) + #endif /* OPUS_HAVE_RTCD */ #endif /* HAVE_ARM_NE10 */ diff --git a/celt/arm/mdct_arm.h b/celt/arm/mdct_arm.h index 7d60fed..db32efe 100644 --- a/celt/arm/mdct_arm.h +++ b/celt/arm/mdct_arm.h @@ -43,10 +43,17 @@ void clt_mdct_forward_float_neon(const mdct_lookup *l, kiss_fft_scalar *in, const opus_val16 *window, int overlap, int shift, int stride, int arch); +void clt_mdct_backward_float_neon(const mdct_lookup *l, kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, int overlap, + int shift, int stride, int arch); + #if !defined(OPUS_HAVE_RTCD) #define OVERRIDE_OPUS_MDCT (1) #define clt_mdct_forward(_l, _in, _out, _window, _int, _shift, _stride, _arch) \ clt_mdct_forward_float_neon(_l, _in, _out, _window, _int, _shift, _stride, _arch) +#define clt_mdct_backward(_l, _in, _out, _window, _int, _shift, _stride, _arch) \ + clt_mdct_backward_float_neon(_l, _in, _out, _window, _int, _shift, _stride, _arch) #endif /* OPUS_HAVE_RTCD */ #endif /* !defined(FIXED_POINT) && defined(HAVE_ARM_NE10) */ diff --git a/celt/celt_decoder.c b/celt/celt_decoder.c index 4304a3e..304f334 100644 --- a/celt/celt_decoder.c +++ b/celt/celt_decoder.c @@ -278,8 +278,9 @@ void deemphasis(celt_sig *in[], opus_val16 *pcm, int N, int C, int downsample, c static #endif void celt_synthesis(const CELTMode *mode, celt_norm *X, celt_sig * out_syn[], - opus_val16 *oldBandE, int start, int effEnd, int C, int CC, int isTransient, - int LM, int downsample, int silence) + opus_val16 *oldBandE, int start, int effEnd, int C, int CC, + int isTransient, int LM, int downsample, + int silence, int arch) { int c, i; int M; @@ -319,9 +320,9 @@ void celt_synthesis(const CELTMode *mode, celt_norm *X, celt_sig * out_syn[], freq2 = out_syn[1]+overlap/2; OPUS_COPY(freq2, freq, N); for (b=0;b<B;b++) - clt_mdct_backward(&mode->mdct, &freq2[b], out_syn[0]+NB*b, mode->window, overlap, shift, B); + clt_mdct_backward(&mode->mdct, &freq2[b], out_syn[0]+NB*b, mode->window, overlap, shift, B, arch); for (b=0;b<B;b++) - clt_mdct_backward(&mode->mdct, &freq[b], out_syn[1]+NB*b, mode->window, overlap, shift, B); + clt_mdct_backward(&mode->mdct, &freq[b], out_syn[1]+NB*b, mode->window, overlap, shift, B, arch); } else if (CC==1&&C==2) { /* Downmixing a stereo stream to mono */ @@ -335,14 +336,14 @@ void celt_synthesis(const CELTMode *mode, celt_norm *X, celt_sig * out_syn[], for (i=0;i<N;i++) freq[i] = HALF32(ADD32(freq[i],freq2[i])); for (b=0;b<B;b++) - clt_mdct_backward(&mode->mdct, &freq[b], out_syn[0]+NB*b, mode->window, overlap, shift, B); + clt_mdct_backward(&mode->mdct, &freq[b], out_syn[0]+NB*b, mode->window, overlap, shift, B, arch); } else { /* Normal case (mono or stereo) */ c=0; do { denormalise_bands(mode, X+c*N, freq, oldBandE+c*nbEBands, start, effEnd, M, downsample, silence); for (b=0;b<B;b++) - clt_mdct_backward(&mode->mdct, &freq[b], out_syn[c]+NB*b, mode->window, overlap, shift, B); + clt_mdct_backward(&mode->mdct, &freq[b], out_syn[c]+NB*b, mode->window, overlap, shift, B, arch); } while (++c<CC); } RESTORE_STACK; @@ -509,7 +510,7 @@ static void celt_decode_lost(CELTDecoder * OPUS_RESTRICT st, int N, int LM) DECODE_BUFFER_SIZE-N+(overlap>>1)); } while (++c<C); - celt_synthesis(mode, X, out_syn, plcLogE, start, effEnd, C, C, 0, LM, st->downsample, 0); + celt_synthesis(mode, X, out_syn, plcLogE, start, effEnd, C, C, 0, LM, st->downsample, 0, st->arch); } else { /* Pitch-based PLC */ const opus_val16 *window; @@ -1002,7 +1003,8 @@ int celt_decode_with_ec(CELTDecoder * OPUS_RESTRICT st, const unsigned char *dat oldBandE[i] = -QCONST16(28.f,DB_SHIFT); } - celt_synthesis(mode, X, out_syn, oldBandE, start, effEnd, C, CC, isTransient, LM, st->downsample, silence); + celt_synthesis(mode, X, out_syn, oldBandE, start, effEnd, + C, CC, isTransient, LM, st->downsample, silence, st->arch); c=0; do { st->postfilter_period=IMAX(st->postfilter_period, COMBFILTER_MINPERIOD); diff --git a/celt/celt_encoder.c b/celt/celt_encoder.c index 7a2c71b..5f48638 100644 --- a/celt/celt_encoder.c +++ b/celt/celt_encoder.c @@ -2072,7 +2072,8 @@ int celt_encode_with_ec(CELTEncoder * OPUS_RESTRICT st, const opus_val16 * pcm, out_mem[c] = st->syn_mem[c]+2*MAX_PERIOD-N; } while (++c<CC); - celt_synthesis(mode, X, out_mem, oldBandE, start, effEnd, C, CC, isTransient, LM, st->upsample, silence); + celt_synthesis(mode, X, out_mem, oldBandE, start, effEnd, + C, CC, isTransient, LM, st->upsample, silence, st->arch); c=0; do { st->prefilter_period=IMAX(st->prefilter_period, COMBFILTER_MINPERIOD); diff --git a/celt/kiss_fft.c b/celt/kiss_fft.c index 38fd4fb..4ed37d2 100644 --- a/celt/kiss_fft.c +++ b/celt/kiss_fft.c @@ -589,8 +589,7 @@ void opus_fft_c(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx *f } -#ifdef TEST_UNIT_DFT_C -void opus_ifft(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx *fout) +void opus_ifft_c(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx *fout) { int i; celt_assert2 (fin != fout, "In-place FFT not supported"); @@ -603,4 +602,3 @@ void opus_ifft(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx *fo for (i=0;i<st->nfft;i++) fout[i].i = -fout[i].i; } -#endif diff --git a/celt/kiss_fft.h b/celt/kiss_fft.h index bf2f836..45017a4 100644 --- a/celt/kiss_fft.h +++ b/celt/kiss_fft.h @@ -142,7 +142,7 @@ kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem, int arch); f[k].r and f[k].i * */ void opus_fft_c(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx *fout); -void opus_ifft(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx *fout); +void opus_ifft_c(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx *fout); void opus_fft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout); void opus_ifft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout); @@ -171,6 +171,13 @@ void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, kiss_fft_cpx *fout); #define opus_fft(_cfg, _fin, _fout, arch) \ ((*OPUS_FFT[(arch)&OPUS_ARCHMASK])(_cfg, _fin, _fout)) + +void (*const OPUS_IFFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, + const kiss_fft_cpx *fin, + kiss_fft_cpx *fout); +#define opus_ifft(_cfg, _fin, _fout, arch) \ + ((*OPUS_IFFT[(arch)&OPUS_ARCHMASK])(_cfg, _fin, _fout)) + #else /* else for if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */ #define opus_fft_alloc_arch(_st, arch) \ @@ -181,6 +188,10 @@ void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg, #define opus_fft(_cfg, _fin, _fout, arch) \ ((void)(arch), opus_fft_c(_cfg, _fin, _fout)) + +#define opus_ifft(_cfg, _fin, _fout, arch) \ + ((void)(arch), opus_ifft_c(_cfg, _fin, _fout)) + #endif /* end if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */ #endif /* end if !defined(OVERRIDE_OPUS_FFT) */ diff --git a/celt/mdct.c b/celt/mdct.c index ee6d80e..5315ad1 100644 --- a/celt/mdct.c +++ b/celt/mdct.c @@ -239,12 +239,13 @@ void clt_mdct_forward_c(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scal #endif /* OVERRIDE_clt_mdct_forward */ #ifndef OVERRIDE_clt_mdct_backward -void clt_mdct_backward(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scalar * OPUS_RESTRICT out, - const opus_val16 * OPUS_RESTRICT window, int overlap, int shift, int stride) +void clt_mdct_backward_c(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 * OPUS_RESTRICT window, int overlap, int shift, int stride, int arch) { int i; int N, N2, N4; const kiss_twiddle_scalar *trig; + (void) arch; N = l->n; trig = l->trig; diff --git a/celt/mdct.h b/celt/mdct.h index cbaf679..5349ccf 100644 --- a/celt/mdct.h +++ b/celt/mdct.h @@ -69,9 +69,10 @@ void clt_mdct_forward_c(const mdct_lookup *l, kiss_fft_scalar *in, /** Compute a backward MDCT (no scaling) and performs weighted overlap-add (scales implicitly by 1/2) */ -void clt_mdct_backward(const mdct_lookup *l, kiss_fft_scalar *in, - kiss_fft_scalar * OPUS_RESTRICT out, - const opus_val16 * OPUS_RESTRICT window, int overlap, int shift, int stride); +void clt_mdct_backward_c(const mdct_lookup *l, kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 * OPUS_RESTRICT window, + int overlap, int shift, int stride, int arch); #if !defined(OVERRIDE_OPUS_MDCT) /* Is run-time CPU detection enabled on this platform? */ @@ -88,11 +89,26 @@ void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l, ((*CLT_MDCT_FORWARD_IMPL[(arch)&OPUS_ARCHMASK])(_l, _in, _out, \ _window, _overlap, _shift, \ _stride, _arch)) + +void (*const CLT_MDCT_BACKWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l, + kiss_fft_scalar *in, + kiss_fft_scalar * OPUS_RESTRICT out, + const opus_val16 *window, + int overlap, int shift, + int stride, int arch); + +#define clt_mdct_backward(_l, _in, _out, _window, _overlap, _shift, _stride, _arch) \ + (*CLT_MDCT_BACKWARD_IMPL[(arch)&OPUS_ARCHMASK])(_l, _in, _out, \ + _window, _overlap, _shift, \ + _stride, _arch) #else /* else for if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */ #define clt_mdct_forward(_l, _in, _out, _window, _overlap, _shift, _stride, _arch) \ clt_mdct_forward_c(_l, _in, _out, _window, _overlap, _shift, _stride, _arch) +#define clt_mdct_backward(_l, _in, _out, _window, _overlap, _shift, _stride, _arch) \ + clt_mdct_backward_c(_l, _in, _out, _window, _overlap, _shift, _stride, _arch) + #endif /* end if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */ #endif /* end if !defined(OVERRIDE_OPUS_MDCT) */ diff --git a/celt/tests/test_unit_dft.c b/celt/tests/test_unit_dft.c index 991ece3..28f0238 100644 --- a/celt/tests/test_unit_dft.c +++ b/celt/tests/test_unit_dft.c @@ -140,7 +140,7 @@ void test1d(int nfft,int isinverse,int arch) /*for (k=0;k<nfft;++k) printf("%d %d ", in[k].r, in[k].i);printf("\n");*/ if (isinverse) - opus_ifft(cfg,in,out); + opus_ifft(cfg,in,out, arch); else opus_fft(cfg,in,out, arch); diff --git a/celt/tests/test_unit_mdct.c b/celt/tests/test_unit_mdct.c index a1c92d0..51e457a 100644 --- a/celt/tests/test_unit_mdct.c +++ b/celt/tests/test_unit_mdct.c @@ -168,7 +168,7 @@ void test1d(int nfft,int isinverse,int arch) { for (k=0;k<nfft;++k) out[k] = 0; - clt_mdct_backward(&cfg,in,out, window, nfft/2, 0, 1); + clt_mdct_backward(&cfg,in,out, window, nfft/2, 0, 1, arch); /* apply TDAC because clt_mdct_backward() no longer does that */ for (k=0;k<nfft/4;++k) out[nfft-k-1] = out[nfft/2+k]; -- 1.9.1
Viswanath Puttagunta
2015-Mar-31 22:57 UTC
[opus] [RFC PATCH v1 3/5] Intrinsics/RTCD related fixes. Mostly x86
From: Jonathan Lennox <jonathan at vidyo.com> * Makes ?enable-intrinsics work with clang and other non-GCC compilers * Enables RTCD for the floating-point-mode SSE code in Celt. * Disables use of RTCD in cases where the compiler targets an instruction set by default. * Enables the SSE4.1 Silk optimizations that apply to the common parts of Silk when Opus is built in floating-point mode, not just in fixed-point mode. * Enables the SSE intrinsics (with RTCD when appropriate) in the Win32 build. * Fixes a case where GCC would compile SSE2 code as SSE4.1, causing a crash on non-SSE4.1 CPUs. * Allows configuration with compilers with non-GCC-flavor flags for enabling architecture options. * Hopefully makes the configuration and ifdef?s easier to follow and understand. Reviewed-by: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> --- Makefile.am | 38 ++-- celt/arm/armcpu.c | 6 +- celt/arm/pitch_arm.h | 4 +- celt/bands.c | 6 +- celt/celt.c | 16 +- celt/celt.h | 12 +- celt/celt_decoder.c | 6 +- celt/celt_encoder.c | 4 +- celt/celt_lpc.h | 2 +- celt/cpu_support.h | 15 +- celt/mips/celt_mipsr1.h | 2 +- celt/pitch.c | 4 +- celt/pitch.h | 19 +- celt/tests/test_unit_dft.c | 4 +- celt/tests/test_unit_mathops.c | 11 +- celt/tests/test_unit_mdct.c | 4 +- celt/tests/test_unit_rotation.c | 11 +- celt/x86/celt_lpc_sse.c | 4 + celt/x86/celt_lpc_sse.h | 12 +- celt/x86/pitch_sse.c | 334 +++++++++++++------------------ celt/x86/pitch_sse.h | 256 ++++++++++------------- celt/x86/pitch_sse2.c | 95 +++++++++ celt/x86/pitch_sse4_1.c | 195 ++++++++++++++++++ celt/x86/x86_celt_map.c | 76 ++++++- celt/x86/x86cpu.c | 47 ++++- celt/x86/x86cpu.h | 26 ++- celt_sources.mk | 5 +- configure.ac | 313 ++++++++++++++++++----------- m4/opus-intrinsics.m4 | 29 +++ silk/x86/SigProc_FIX_sse.h | 17 ++ silk/x86/main_sse.h | 48 +++++ silk/x86/x86_silk_map.c | 25 ++- win32/VS2010/celt.vcxproj | 17 +- win32/VS2010/celt.vcxproj.filters | 27 +++ win32/VS2010/silk_common.vcxproj | 17 +- win32/VS2010/silk_common.vcxproj.filters | 23 ++- win32/VS2010/silk_fixed.vcxproj | 13 +- win32/VS2010/silk_fixed.vcxproj.filters | 17 +- win32/config.h | 25 ++- 39 files changed, 1214 insertions(+), 571 deletions(-) create mode 100644 celt/x86/pitch_sse2.c create mode 100644 celt/x86/pitch_sse4_1.c create mode 100644 m4/opus-intrinsics.m4 diff --git a/Makefile.am b/Makefile.am index c5c1562..3a75740 100644 --- a/Makefile.am +++ b/Makefile.am @@ -23,6 +23,9 @@ SILK_SOURCES += $(SILK_SOURCES_SSE4_1) $(SILK_SOURCES_FIXED_SSE4_1) endif else SILK_SOURCES += $(SILK_SOURCES_FLOAT) +if HAVE_SSE4_1 +SILK_SOURCES += $(SILK_SOURCES_SSE4_1) +endif endif if DISABLE_FLOAT_API @@ -30,12 +33,14 @@ else OPUS_SOURCES += $(OPUS_SOURCES_FLOAT) endif -if HAVE_SSE4_1 -CELT_SOURCES += $(CELT_SOURCES_SSE) $(CELT_SOURCES_SSE4_1) -else -if HAVE_SSE2 +if HAVE_SSE CELT_SOURCES += $(CELT_SOURCES_SSE) endif +if HAVE_SSE2 +CELT_SOURCES += $(CELT_SOURCES_SSE2) +endif +if HAVE_SSE4_1 +CELT_SOURCES += $(CELT_SOURCES_SSE4_1) endif if CPU_ARM @@ -44,7 +49,6 @@ SILK_SOURCES += $(SILK_SOURCES_ARM) if OPUS_ARM_NEON_INTR CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) -OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon endif if HAVE_ARM_NE10 @@ -262,20 +266,30 @@ $(CELT_SOURCES_ARM_ASM:%.s=%-gnu.S): $(top_srcdir)/celt/arm/arm2gnu.pl %-gnu.S: %.s $(top_srcdir)/celt/arm/arm2gnu.pl @ARM2GNU_PARAMS@ < $< > $@ -SSE_OBJ = %_sse.o %_sse.lo %test_unit_mathops.o %test_unit_rotation.o +OPT_UNIT_TEST_OBJ = $(celt_tests_test_unit_mathops_SOURCES:.c=.o) \ + $(celt_tests_test_unit_rotation_SOURCES:.c=.o) + +if HAVE_SSE +SSE_OBJ = $(CELT_SOURCES_SSE:.c=.lo) +$(SSE_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE_CFLAGS) +endif -if HAVE_SSE4_1 -$(SSE_OBJ): CFLAGS += -msse4.1 -else if HAVE_SSE2 -$(SSE_OBJ): CFLAGS += -msse2 +SSE2_OBJ = $(CELT_SOURCES_SSE2:.c=.lo) +$(SSE2_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE2_CFLAGS) endif + +if HAVE_SSE4_1 +SSE4_1_OBJ = $(CELT_SOURCES_SSE4_1:.c=.lo) \ + $(SILK_SOURCES_SSE4_1:.c=.lo) \ + $(SILK_SOURCES_FIXED_SSE4_1:.c=.lo) +$(SSE4_1_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE4_1_CFLAGS) endif if OPUS_ARM_NEON_INTR CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \ $(CELT_SOURCES_ARM_NE10:.c=.lo) \ - %test_unit_rotation.o %test_unit_mathops.o \ %test_unit_mdct.o %test_unit_dft.o -$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS) $(NE10_CFLAGS) + +$(CELT_ARM_NEON_INTR_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CFLAGS) $(NE10_CFLAGS) endif diff --git a/celt/arm/armcpu.c b/celt/arm/armcpu.c index 1768525..5e5d10c 100644 --- a/celt/arm/armcpu.c +++ b/celt/arm/armcpu.c @@ -73,7 +73,7 @@ static OPUS_INLINE opus_uint32 opus_cpu_capabilities(void){ __except(GetExceptionCode()==EXCEPTION_ILLEGAL_INSTRUCTION){ /*Ignore exception.*/ } -# if defined(OPUS_ARM_MAY_HAVE_NEON) +# if defined(OPUS_ARM_MAY_HAVE_NEON) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR) __try{ /*VORR q0,q0,q0*/ __emit(0xF2200150); @@ -107,7 +107,7 @@ opus_uint32 opus_cpu_capabilities(void) while(fgets(buf, 512, cpuinfo) != NULL) { -# if defined(OPUS_ARM_MAY_HAVE_EDSP) || defined(OPUS_ARM_MAY_HAVE_NEON) +# if defined(OPUS_ARM_MAY_HAVE_EDSP) || defined(OPUS_ARM_MAY_HAVE_NEON) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR) /* Search for edsp and neon flag */ if(memcmp(buf, "Features", 8) == 0) { @@ -118,7 +118,7 @@ opus_uint32 opus_cpu_capabilities(void) flags |= OPUS_CPU_ARM_EDSP; # endif -# if defined(OPUS_ARM_MAY_HAVE_NEON) +# if defined(OPUS_ARM_MAY_HAVE_NEON) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR) p = strstr(buf, " neon"); if(p != NULL && (p[5] == ' ' || p[5] == '\n')) flags |= OPUS_CPU_ARM_NEON; diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h index 125d1bc..8626ed7 100644 --- a/celt/arm/pitch_arm.h +++ b/celt/arm/pitch_arm.h @@ -54,10 +54,10 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const opus_val16 *_y, #else /* Start !FIXED_POINT */ /* Float case */ -#if defined(OPUS_ARM_NEON_INTR) +#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, opus_val32 *xcorr, int len, int max_pitch); -#if !defined(OPUS_HAVE_RTCD) +#if !defined(OPUS_HAVE_RTCD) || defined(OPUS_ARM_PRESUME_NEON_INTR) #define OVERRIDE_PITCH_XCORR (1) # define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \ ((void)(arch),celt_pitch_xcorr_float_neon(_x, _y, xcorr, len, max_pitch)) diff --git a/celt/bands.c b/celt/bands.c index c643b09..25f229e 100644 --- a/celt/bands.c +++ b/celt/bands.c @@ -398,7 +398,7 @@ static void stereo_split(celt_norm * OPUS_RESTRICT X, celt_norm * OPUS_RESTRICT } } -static void stereo_merge(celt_norm * OPUS_RESTRICT X, celt_norm * OPUS_RESTRICT Y, opus_val16 mid, int N) +static void stereo_merge(celt_norm * OPUS_RESTRICT X, celt_norm * OPUS_RESTRICT Y, opus_val16 mid, int N, int arch) { int j; opus_val32 xp=0, side=0; @@ -410,7 +410,7 @@ static void stereo_merge(celt_norm * OPUS_RESTRICT X, celt_norm * OPUS_RESTRICT opus_val32 t, lgain, rgain; /* Compute the norm of X+Y and X-Y as |X|^2 + |Y|^2 +/- sum(xy) */ - dual_inner_prod(Y, X, Y, N, &xp, &side); + dual_inner_prod(Y, X, Y, N, &xp, &side, arch); /* Compensating for the mid normalization */ xp = MULT16_32_Q15(mid, xp); /* mid and side are in Q15, not Q14 like X and Y */ @@ -1348,7 +1348,7 @@ static unsigned quant_band_stereo(struct band_ctx *ctx, celt_norm *X, celt_norm if (resynth) { if (N!=2) - stereo_merge(X, Y, mid, N); + stereo_merge(X, Y, mid, N, ctx->arch); if (inv) { int j; diff --git a/celt/celt.c b/celt/celt.c index a610de4..40c62ce 100644 --- a/celt/celt.c +++ b/celt/celt.c @@ -89,10 +89,12 @@ int resampling_factor(opus_int32 rate) return ret; } -#ifndef OVERRIDE_COMB_FILTER_CONST /* This version should be faster on ARM */ #ifdef OPUS_ARM_ASM -static void comb_filter_const(opus_val32 *y, opus_val32 *x, int T, int N, +#ifndef NON_STATIC_COMB_FILTER_CONST_C +static +#endif +void comb_filter_const_c(opus_val32 *y, opus_val32 *x, int T, int N, opus_val16 g10, opus_val16 g11, opus_val16 g12) { opus_val32 x0, x1, x2, x3, x4; @@ -147,7 +149,10 @@ static void comb_filter_const(opus_val32 *y, opus_val32 *x, int T, int N, #endif } #else -static void comb_filter_const(opus_val32 *y, opus_val32 *x, int T, int N, +#ifndef NON_STATIC_COMB_FILTER_CONST_C +static +#endif +void comb_filter_const_c(opus_val32 *y, opus_val32 *x, int T, int N, opus_val16 g10, opus_val16 g11, opus_val16 g12) { opus_val32 x0, x1, x2, x3, x4; @@ -171,12 +176,11 @@ static void comb_filter_const(opus_val32 *y, opus_val32 *x, int T, int N, } #endif -#endif #ifndef OVERRIDE_comb_filter void comb_filter(opus_val32 *y, opus_val32 *x, int T0, int T1, int N, opus_val16 g0, opus_val16 g1, int tapset0, int tapset1, - const opus_val16 *window, int overlap) + const opus_val16 *window, int overlap, int arch) { int i; /* printf ("%d %d %f %f\n", T0, T1, g0, g1); */ @@ -234,7 +238,7 @@ void comb_filter(opus_val32 *y, opus_val32 *x, int T0, int T1, int N, } /* Compute the part with the constant filter. */ - comb_filter_const(y+i, x+i, T1, N-i, g10, g11, g12); + comb_filter_const(y+i, x+i, T1, N-i, g10, g11, g12, arch); } #endif /* OVERRIDE_comb_filter */ diff --git a/celt/celt.h b/celt/celt.h index b196751..a423b95 100644 --- a/celt/celt.h +++ b/celt/celt.h @@ -201,7 +201,17 @@ void celt_preemphasis(const opus_val16 * OPUS_RESTRICT pcmp, celt_sig * OPUS_RES void comb_filter(opus_val32 *y, opus_val32 *x, int T0, int T1, int N, opus_val16 g0, opus_val16 g1, int tapset0, int tapset1, - const opus_val16 *window, int overlap); + const opus_val16 *window, int overlap, int arch); + +#ifdef NON_STATIC_COMB_FILTER_CONST_C +void comb_filter_const_c(opus_val32 *y, opus_val32 *x, int T, int N, + opus_val16 g10, opus_val16 g11, opus_val16 g12); +#endif + +#ifndef OVERRIDE_COMB_FILTER_CONST +# define comb_filter_const(y, x, T, N, g10, g11, g12, arch) \ + ((void)(arch),comb_filter_const_c(y, x, T, N, g10, g11, g12)) +#endif void init_caps(const CELTMode *m,int *cap,int LM,int C); diff --git a/celt/celt_decoder.c b/celt/celt_decoder.c index 304f334..505a6ef 100644 --- a/celt/celt_decoder.c +++ b/celt/celt_decoder.c @@ -699,7 +699,7 @@ static void celt_decode_lost(CELTDecoder * OPUS_RESTRICT st, int N, int LM) comb_filter(etmp, buf+DECODE_BUFFER_SIZE, st->postfilter_period, st->postfilter_period, overlap, -st->postfilter_gain, -st->postfilter_gain, - st->postfilter_tapset, st->postfilter_tapset, NULL, 0); + st->postfilter_tapset, st->postfilter_tapset, NULL, 0, st->arch); /* Simulate TDAC on the concealed audio so that it blends with the MDCT of the next frame. */ @@ -1011,11 +1011,11 @@ int celt_decode_with_ec(CELTDecoder * OPUS_RESTRICT st, const unsigned char *dat st->postfilter_period_old=IMAX(st->postfilter_period_old, COMBFILTER_MINPERIOD); comb_filter(out_syn[c], out_syn[c], st->postfilter_period_old, st->postfilter_period, mode->shortMdctSize, st->postfilter_gain_old, st->postfilter_gain, st->postfilter_tapset_old, st->postfilter_tapset, - mode->window, overlap); + mode->window, overlap, st->arch); if (LM!=0) comb_filter(out_syn[c]+mode->shortMdctSize, out_syn[c]+mode->shortMdctSize, st->postfilter_period, postfilter_pitch, N-mode->shortMdctSize, st->postfilter_gain, postfilter_gain, st->postfilter_tapset, postfilter_tapset, - mode->window, overlap); + mode->window, overlap, st->arch); } while (++c<CC); st->postfilter_period_old = st->postfilter_period; diff --git a/celt/celt_encoder.c b/celt/celt_encoder.c index 5f48638..1c9dbcb 100644 --- a/celt/celt_encoder.c +++ b/celt/celt_encoder.c @@ -1166,11 +1166,11 @@ static int run_prefilter(CELTEncoder *st, celt_sig *in, celt_sig *prefilter_mem, if (offset) comb_filter(in+c*(N+overlap)+overlap, pre[c]+COMBFILTER_MAXPERIOD, st->prefilter_period, st->prefilter_period, offset, -st->prefilter_gain, -st->prefilter_gain, - st->prefilter_tapset, st->prefilter_tapset, NULL, 0); + st->prefilter_tapset, st->prefilter_tapset, NULL, 0, st->arch); comb_filter(in+c*(N+overlap)+overlap+offset, pre[c]+COMBFILTER_MAXPERIOD+offset, st->prefilter_period, pitch_index, N-offset, -st->prefilter_gain, -gain1, - st->prefilter_tapset, prefilter_tapset, mode->window, overlap); + st->prefilter_tapset, prefilter_tapset, mode->window, overlap, st->arch); OPUS_COPY(st->in_mem+c*(overlap), in+c*(N+overlap)+N, overlap); if (N>COMBFILTER_MAXPERIOD) diff --git a/celt/celt_lpc.h b/celt/celt_lpc.h index dc8967f..323459e 100644 --- a/celt/celt_lpc.h +++ b/celt/celt_lpc.h @@ -48,7 +48,7 @@ void celt_fir_c( opus_val16 *mem, int arch); -#if !defined(OPUS_X86_MAY_HAVE_SSE4_1) +#if !defined(OVERRIDE_CELT_FIR) #define celt_fir(x, num, y, N, ord, mem, arch) \ (celt_fir_c(x, num, y, N, ord, mem, arch)) #endif diff --git a/celt/cpu_support.h b/celt/cpu_support.h index 1d62e2f..5e99a90 100644 --- a/celt/cpu_support.h +++ b/celt/cpu_support.h @@ -32,7 +32,8 @@ #include "opus_defines.h" #if defined(OPUS_HAVE_RTCD) && \ - (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR)) + (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) + #include "arm/armcpu.h" /* We currently support 4 ARM variants: @@ -43,14 +44,16 @@ */ #define OPUS_ARCHMASK 3 -#elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1) +#elif (defined(OPUS_X86_MAY_HAVE_SSE) && !defined(OPUS_X86_PRESUME_SSE)) || \ + (defined(OPUS_X86_MAY_HAVE_SSE2) && !defined(OPUS_X86_PRESUME_SSE2)) || \ + (defined(OPUS_X86_MAY_HAVE_SSE4_1) && !defined(OPUS_X86_PRESUME_SSE4_1)) #include "x86/x86cpu.h" -/* We currently support 3 x86 variants: +/* We currently support 4 x86 variants: * arch[0] -> non-sse - * arch[1] -> sse2 - * arch[2] -> sse4.1 - * arch[3] -> NULL + * arch[1] -> sse + * arch[2] -> sse2 + * arch[3] -> sse4.1 */ #define OPUS_ARCHMASK 3 int opus_select_arch(void); diff --git a/celt/mips/celt_mipsr1.h b/celt/mips/celt_mipsr1.h index 03915d8..7915d59 100644 --- a/celt/mips/celt_mipsr1.h +++ b/celt/mips/celt_mipsr1.h @@ -56,7 +56,7 @@ #define OVERRIDE_comb_filter void comb_filter(opus_val32 *y, opus_val32 *x, int T0, int T1, int N, opus_val16 g0, opus_val16 g1, int tapset0, int tapset1, - const opus_val16 *window, int overlap) + const opus_val16 *window, int overlap, int arch) { int i; opus_val32 x0, x1, x2, x3, x4; diff --git a/celt/pitch.c b/celt/pitch.c index 4364703..1d89cb0 100644 --- a/celt/pitch.c +++ b/celt/pitch.c @@ -439,7 +439,7 @@ opus_val16 remove_doubling(opus_val16 *x, int maxperiod, int minperiod, T = T0 = *T0_; ALLOC(yy_lookup, maxperiod+1, opus_val32); - dual_inner_prod(x, x, x-T0, N, &xx, &xy); + dual_inner_prod(x, x, x-T0, N, &xx, &xy, arch); yy_lookup[0] = xx; yy=xx; for (i=1;i<=maxperiod;i++) @@ -483,7 +483,7 @@ opus_val16 remove_doubling(opus_val16 *x, int maxperiod, int minperiod, { T1b = celt_udiv(2*second_check[k]*T0+k, 2*k); } - dual_inner_prod(x, &x[-T1], &x[-T1b], N, &xy, &xy2); + dual_inner_prod(x, &x[-T1], &x[-T1b], N, &xy, &xy2, arch); xy += xy2; yy = yy_lookup[T1] + yy_lookup[T1b]; #ifdef FIXED_POINT diff --git a/celt/pitch.h b/celt/pitch.h index 4368cc5..af745eb 100644 --- a/celt/pitch.h +++ b/celt/pitch.h @@ -37,8 +37,8 @@ #include "modes.h" #include "cpu_support.h" -#if defined(__SSE__) && !defined(FIXED_POINT) \ - || defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2) +#if (defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT)) \ + || ((defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2)) && defined(FIXED_POINT)) #include "x86/pitch_sse.h" #endif @@ -135,8 +135,7 @@ static OPUS_INLINE void xcorr_kernel_c(const opus_val16 * x, const opus_val16 * #endif /* OVERRIDE_XCORR_KERNEL */ -#ifndef OVERRIDE_DUAL_INNER_PROD -static OPUS_INLINE void dual_inner_prod(const opus_val16 *x, const opus_val16 *y01, const opus_val16 *y02, +static OPUS_INLINE void dual_inner_prod_c(const opus_val16 *x, const opus_val16 *y01, const opus_val16 *y02, int N, opus_val32 *xy1, opus_val32 *xy2) { int i; @@ -150,6 +149,10 @@ static OPUS_INLINE void dual_inner_prod(const opus_val16 *x, const opus_val16 *y *xy1 = xy01; *xy2 = xy02; } + +#ifndef OVERRIDE_DUAL_INNER_PROD +# define dual_inner_prod(x, y01, y02, N, xy1, xy2, arch) \ + ((void)(arch),dual_inner_prod_c(x, y01, y02, N, xy1, xy2)) #endif /*We make sure a C version is always available for cases where the overhead of @@ -169,6 +172,12 @@ static OPUS_INLINE opus_val32 celt_inner_prod_c(const opus_val16 *x, ((void)(arch),celt_inner_prod_c(x, y, N)) #endif +#ifdef NON_STATIC_COMB_FILTER_CONST_C +void comb_filter_const_c(opus_val32 *y, opus_val32 *x, int T, int N, + opus_val16 g10, opus_val16 g11, opus_val16 g12); +#endif + + #ifdef FIXED_POINT opus_val32 #else @@ -180,7 +189,7 @@ celt_pitch_xcorr_c(const opus_val16 *_x, const opus_val16 *_y, #if !defined(OVERRIDE_PITCH_XCORR) /*Is run-time CPU detection enabled on this platform?*/ # if defined(OPUS_HAVE_RTCD) && \ - (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR)) + (defined(OPUS_ARM_ASM) || (defined(OPUS_ARM_NEON_INTR) && !defined(OPUS_ARM_PRESUME_NEON_INTR))) extern # if defined(FIXED_POINT) opus_val32 diff --git a/celt/tests/test_unit_dft.c b/celt/tests/test_unit_dft.c index 28f0238..9fbcdc4 100644 --- a/celt/tests/test_unit_dft.c +++ b/celt/tests/test_unit_dft.c @@ -46,7 +46,7 @@ #include "entcode.c" #if defined(OPUS_HAVE_RTCD) && \ - (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR)) + (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) #include "arm/armcpu.c" #if !defined(FIXED_POINT) #if defined(HAVE_ARM_NE10) @@ -61,6 +61,8 @@ #endif #elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1) #include "x86/x86cpu.c" +#include "celt/x86/pitch_sse.c" +#include "x86/x86_celt_map.c" #endif #ifndef M_PI diff --git a/celt/tests/test_unit_mathops.c b/celt/tests/test_unit_mathops.c index 5d2e8e4..a1cf2f7 100644 --- a/celt/tests/test_unit_mathops.c +++ b/celt/tests/test_unit_mathops.c @@ -49,10 +49,19 @@ #include "cwrs.c" #include "pitch.c" #include "celt_lpc.c" +#include "celt.c" -#if defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2) +#if defined(OPUS_X86_MAY_HAVE_SSE) || \ + defined(OPUS_X86_MAY_HAVE_SSE2) || \ + defined(OPUS_X86_MAY_HAVE_SSE4_1) +#if defined(OPUS_X86_MAY_HAVE_SSE) #include "x86/pitch_sse.c" +#endif +#if defined(OPUS_X86_MAY_HAVE_SSE2) +#include "x86/pitch_sse2.c" +#endif #if defined(OPUS_X86_MAY_HAVE_SSE4_1) +#include "x86/pitch_sse4_1.c" #include "x86/celt_lpc_sse.c" #endif #include "x86/x86_celt_map.c" diff --git a/celt/tests/test_unit_mdct.c b/celt/tests/test_unit_mdct.c index 51e457a..fdee079 100644 --- a/celt/tests/test_unit_mdct.c +++ b/celt/tests/test_unit_mdct.c @@ -47,7 +47,7 @@ #include "entcode.c" #if defined(OPUS_HAVE_RTCD) && \ - (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR)) + (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) #include "arm/armcpu.c" #if !defined(FIXED_POINT) #if defined(HAVE_ARM_NE10) @@ -62,6 +62,8 @@ #elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1) #include "x86/x86cpu.c" +#include "celt/x86/pitch_sse.c" +#include "x86/x86_celt_map.c" #endif #ifndef M_PI diff --git a/celt/tests/test_unit_rotation.c b/celt/tests/test_unit_rotation.c index fb18df0..4ac838e 100644 --- a/celt/tests/test_unit_rotation.c +++ b/celt/tests/test_unit_rotation.c @@ -46,11 +46,20 @@ #include "bands.h" #include "pitch.c" #include "celt_lpc.c" +#include "celt.c" #include <math.h> -#if defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2) +#if defined(OPUS_X86_MAY_HAVE_SSE) || \ + defined(OPUS_X86_MAY_HAVE_SSE2) || \ + defined(OPUS_X86_MAY_HAVE_SSE4_1) +#if defined(OPUS_X86_MAY_HAVE_SSE) #include "x86/pitch_sse.c" +#endif +#if defined(OPUS_X86_MAY_HAVE_SSE2) +#include "x86/pitch_sse2.c" +#endif #if defined(OPUS_X86_MAY_HAVE_SSE4_1) +#include "x86/pitch_sse4_1.c" #include "x86/celt_lpc_sse.c" #endif #include "x86/x86_celt_map.c" diff --git a/celt/x86/celt_lpc_sse.c b/celt/x86/celt_lpc_sse.c index 9fb9779..67e5592 100644 --- a/celt/x86/celt_lpc_sse.c +++ b/celt/x86/celt_lpc_sse.c @@ -38,6 +38,8 @@ #include "pitch.h" #include "x86cpu.h" +#if defined(FIXED_POINT) + void celt_fir_sse4_1(const opus_val16 *_x, const opus_val16 *num, opus_val16 *_y, @@ -126,3 +128,5 @@ void celt_fir_sse4_1(const opus_val16 *_x, #endif RESTORE_STACK; } + +#endif diff --git a/celt/x86/celt_lpc_sse.h b/celt/x86/celt_lpc_sse.h index f111420..c5ec796 100644 --- a/celt/x86/celt_lpc_sse.h +++ b/celt/x86/celt_lpc_sse.h @@ -32,7 +32,9 @@ #include "config.h" #endif -#if defined(OPUS_X86_MAY_HAVE_SSE4_1) +#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT) +#define OVERRIDE_CELT_FIR + void celt_fir_sse4_1( const opus_val16 *x, const opus_val16 *num, @@ -42,6 +44,12 @@ void celt_fir_sse4_1( opus_val16 *mem, int arch); +#if defined(OPUS_X86_PRESUME_SSE4_1) +#define celt_fir(x, num, y, N, ord, mem, arch) \ + ((void)arch, celt_fir_sse4_1(x, num, y, N, ord, mem, arch)) + +#else + extern void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])( const opus_val16 *x, const opus_val16 *num, @@ -56,3 +64,5 @@ extern void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])( #endif #endif + +#endif diff --git a/celt/x86/pitch_sse.c b/celt/x86/pitch_sse.c index e3bc6d7..20e7312 100644 --- a/celt/x86/pitch_sse.c +++ b/celt/x86/pitch_sse.c @@ -29,223 +29,157 @@ #include "config.h" #endif -#include <xmmintrin.h> -#include <emmintrin.h> - #include "macros.h" #include "celt_lpc.h" #include "stack_alloc.h" #include "mathops.h" #include "pitch.h" -#if defined(OPUS_X86_MAY_HAVE_SSE4_1) -#include <smmintrin.h> -#include "x86cpu.h" - -opus_val32 celt_inner_prod_sse4_1(const opus_val16 *x, const opus_val16 *y, - int N) -{ - opus_int i, dataSize16; - opus_int32 sum; - __m128i inVec1_76543210, inVec1_FEDCBA98, acc1; - __m128i inVec2_76543210, inVec2_FEDCBA98, acc2; - __m128i inVec1_3210, inVec2_3210; - - sum = 0; - dataSize16 = N & ~15; - - acc1 = _mm_setzero_si128(); - acc2 = _mm_setzero_si128(); - - for (i=0;i<dataSize16;i+=16) { - inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0])); - inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0])); - - inVec1_FEDCBA98 = _mm_loadu_si128((__m128i *)(&x[i + 8])); - inVec2_FEDCBA98 = _mm_loadu_si128((__m128i *)(&y[i + 8])); - - inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210); - inVec1_FEDCBA98 = _mm_madd_epi16(inVec1_FEDCBA98, inVec2_FEDCBA98); - - acc1 = _mm_add_epi32(acc1, inVec1_76543210); - acc2 = _mm_add_epi32(acc2, inVec1_FEDCBA98); - } +#if defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT) - acc1 = _mm_add_epi32(acc1, acc2); - - if (N - i >= 8) - { - inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0])); - inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0])); - - inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210); - - acc1 = _mm_add_epi32(acc1, inVec1_76543210); - i += 8; - } - - if (N - i >= 4) - { - inVec1_3210 = OP_CVTEPI16_EPI32_M64(&x[i + 0]); - inVec2_3210 = OP_CVTEPI16_EPI32_M64(&y[i + 0]); - - inVec1_3210 = _mm_mullo_epi32(inVec1_3210, inVec2_3210); - - acc1 = _mm_add_epi32(acc1, inVec1_3210); - i += 4; - } - - acc1 = _mm_add_epi32(acc1, _mm_unpackhi_epi64(acc1, acc1)); - acc1 = _mm_add_epi32(acc1, _mm_shufflelo_epi16(acc1, 0x0E)); - - sum += _mm_cvtsi128_si32(acc1); - - for (;i<N;i++) - { - sum = silk_SMLABB(sum, x[i], y[i]); - } +#include <xmmintrin.h> +#include "arch.h" - return sum; +void xcorr_kernel_sse(const opus_val16 *x, const opus_val16 *y, opus_val32 sum[4], int len) +{ + int j; + __m128 xsum1, xsum2; + xsum1 = _mm_loadu_ps(sum); + xsum2 = _mm_setzero_ps(); + + for (j = 0; j < len-3; j += 4) + { + __m128 x0 = _mm_loadu_ps(x+j); + __m128 yj = _mm_loadu_ps(y+j); + __m128 y3 = _mm_loadu_ps(y+j+3); + + xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),yj)); + xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55), + _mm_shuffle_ps(yj,y3,0x49))); + xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa), + _mm_shuffle_ps(yj,y3,0x9e))); + xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3)); + } + if (j < len) + { + xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); + if (++j < len) + { + xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); + if (++j < len) + { + xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); + } + } + } + _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2)); } -void xcorr_kernel_sse4_1(const opus_val16 * x, const opus_val16 * y, opus_val32 sum[ 4 ], int len) + +void dual_inner_prod_sse(const opus_val16 *x, const opus_val16 *y01, const opus_val16 *y02, + int N, opus_val32 *xy1, opus_val32 *xy2) { - int j; - - __m128i vecX, vecX0, vecX1, vecX2, vecX3; - __m128i vecY0, vecY1, vecY2, vecY3; - __m128i sum0, sum1, sum2, sum3, vecSum; - __m128i initSum; - - celt_assert(len >= 3); - - sum0 = _mm_setzero_si128(); - sum1 = _mm_setzero_si128(); - sum2 = _mm_setzero_si128(); - sum3 = _mm_setzero_si128(); - - for (j=0;j<(len-7);j+=8) - { - vecX = _mm_loadu_si128((__m128i *)(&x[j + 0])); - vecY0 = _mm_loadu_si128((__m128i *)(&y[j + 0])); - vecY1 = _mm_loadu_si128((__m128i *)(&y[j + 1])); - vecY2 = _mm_loadu_si128((__m128i *)(&y[j + 2])); - vecY3 = _mm_loadu_si128((__m128i *)(&y[j + 3])); - - sum0 = _mm_add_epi32(sum0, _mm_madd_epi16(vecX, vecY0)); - sum1 = _mm_add_epi32(sum1, _mm_madd_epi16(vecX, vecY1)); - sum2 = _mm_add_epi32(sum2, _mm_madd_epi16(vecX, vecY2)); - sum3 = _mm_add_epi32(sum3, _mm_madd_epi16(vecX, vecY3)); - } - - sum0 = _mm_add_epi32(sum0, _mm_unpackhi_epi64( sum0, sum0)); - sum0 = _mm_add_epi32(sum0, _mm_shufflelo_epi16( sum0, 0x0E)); - - sum1 = _mm_add_epi32(sum1, _mm_unpackhi_epi64( sum1, sum1)); - sum1 = _mm_add_epi32(sum1, _mm_shufflelo_epi16( sum1, 0x0E)); - - sum2 = _mm_add_epi32(sum2, _mm_unpackhi_epi64( sum2, sum2)); - sum2 = _mm_add_epi32(sum2, _mm_shufflelo_epi16( sum2, 0x0E)); - - sum3 = _mm_add_epi32(sum3, _mm_unpackhi_epi64( sum3, sum3)); - sum3 = _mm_add_epi32(sum3, _mm_shufflelo_epi16( sum3, 0x0E)); - - vecSum = _mm_unpacklo_epi64(_mm_unpacklo_epi32(sum0, sum1), - _mm_unpacklo_epi32(sum2, sum3)); - - for (;j<(len-3);j+=4) - { - vecX = OP_CVTEPI16_EPI32_M64(&x[j + 0]); - vecX0 = _mm_shuffle_epi32(vecX, 0x00); - vecX1 = _mm_shuffle_epi32(vecX, 0x55); - vecX2 = _mm_shuffle_epi32(vecX, 0xaa); - vecX3 = _mm_shuffle_epi32(vecX, 0xff); - - vecY0 = OP_CVTEPI16_EPI32_M64(&y[j + 0]); - vecY1 = OP_CVTEPI16_EPI32_M64(&y[j + 1]); - vecY2 = OP_CVTEPI16_EPI32_M64(&y[j + 2]); - vecY3 = OP_CVTEPI16_EPI32_M64(&y[j + 3]); - - sum0 = _mm_mullo_epi32(vecX0, vecY0); - sum1 = _mm_mullo_epi32(vecX1, vecY1); - sum2 = _mm_mullo_epi32(vecX2, vecY2); - sum3 = _mm_mullo_epi32(vecX3, vecY3); - - sum0 = _mm_add_epi32(sum0, sum1); - sum2 = _mm_add_epi32(sum2, sum3); - vecSum = _mm_add_epi32(vecSum, sum0); - vecSum = _mm_add_epi32(vecSum, sum2); - } - - for (;j<len;j++) - { - vecX = OP_CVTEPI16_EPI32_M64(&x[j + 0]); - vecX0 = _mm_shuffle_epi32(vecX, 0x00); - - vecY0 = OP_CVTEPI16_EPI32_M64(&y[j + 0]); - - sum0 = _mm_mullo_epi32(vecX0, vecY0); - vecSum = _mm_add_epi32(vecSum, sum0); - } - - initSum = _mm_loadu_si128((__m128i *)(&sum[0])); - initSum = _mm_add_epi32(initSum, vecSum); - _mm_storeu_si128((__m128i *)sum, initSum); + int i; + __m128 xsum1, xsum2; + xsum1 = _mm_setzero_ps(); + xsum2 = _mm_setzero_ps(); + for (i=0;i<N-3;i+=4) + { + __m128 xi = _mm_loadu_ps(x+i); + __m128 y1i = _mm_loadu_ps(y01+i); + __m128 y2i = _mm_loadu_ps(y02+i); + xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(xi, y1i)); + xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(xi, y2i)); + } + /* Horizontal sum */ + xsum1 = _mm_add_ps(xsum1, _mm_movehl_ps(xsum1, xsum1)); + xsum1 = _mm_add_ss(xsum1, _mm_shuffle_ps(xsum1, xsum1, 0x55)); + _mm_store_ss(xy1, xsum1); + xsum2 = _mm_add_ps(xsum2, _mm_movehl_ps(xsum2, xsum2)); + xsum2 = _mm_add_ss(xsum2, _mm_shuffle_ps(xsum2, xsum2, 0x55)); + _mm_store_ss(xy2, xsum2); + for (;i<N;i++) + { + *xy1 = MAC16_16(*xy1, x[i], y01[i]); + *xy2 = MAC16_16(*xy2, x[i], y02[i]); + } } -#endif -#if defined(OPUS_X86_MAY_HAVE_SSE2) -opus_val32 celt_inner_prod_sse2(const opus_val16 *x, const opus_val16 *y, +opus_val32 celt_inner_prod_sse(const opus_val16 *x, const opus_val16 *y, int N) { - opus_int i, dataSize16; - opus_int32 sum; - - __m128i inVec1_76543210, inVec1_FEDCBA98, acc1; - __m128i inVec2_76543210, inVec2_FEDCBA98, acc2; - - sum = 0; - dataSize16 = N & ~15; - - acc1 = _mm_setzero_si128(); - acc2 = _mm_setzero_si128(); - - for (i=0;i<dataSize16;i+=16) - { - inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0])); - inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0])); - - inVec1_FEDCBA98 = _mm_loadu_si128((__m128i *)(&x[i + 8])); - inVec2_FEDCBA98 = _mm_loadu_si128((__m128i *)(&y[i + 8])); - - inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210); - inVec1_FEDCBA98 = _mm_madd_epi16(inVec1_FEDCBA98, inVec2_FEDCBA98); - - acc1 = _mm_add_epi32(acc1, inVec1_76543210); - acc2 = _mm_add_epi32(acc2, inVec1_FEDCBA98); - } - - acc1 = _mm_add_epi32( acc1, acc2 ); - - if (N - i >= 8) - { - inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0])); - inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0])); - - inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210); + int i; + float xy; + __m128 sum; + sum = _mm_setzero_ps(); + /* FIXME: We should probably go 8-way and use 2 sums. */ + for (i=0;i<N-3;i+=4) + { + __m128 xi = _mm_loadu_ps(x+i); + __m128 yi = _mm_loadu_ps(y+i); + sum = _mm_add_ps(sum,_mm_mul_ps(xi, yi)); + } + /* Horizontal sum */ + sum = _mm_add_ps(sum, _mm_movehl_ps(sum, sum)); + sum = _mm_add_ss(sum, _mm_shuffle_ps(sum, sum, 0x55)); + _mm_store_ss(&xy, sum); + for (;i<N;i++) + { + xy = MAC16_16(xy, x[i], y[i]); + } + return xy; +} - acc1 = _mm_add_epi32(acc1, inVec1_76543210); - i += 8; - } +void comb_filter_const_sse(opus_val32 *y, opus_val32 *x, int T, int N, + opus_val16 g10, opus_val16 g11, opus_val16 g12) +{ + int i; + __m128 x0v; + __m128 g10v, g11v, g12v; + g10v = _mm_load1_ps(&g10); + g11v = _mm_load1_ps(&g11); + g12v = _mm_load1_ps(&g12); + x0v = _mm_loadu_ps(&x[-T-2]); + for (i=0;i<N-3;i+=4) + { + __m128 yi, yi2, x1v, x2v, x3v, x4v; + const opus_val32 *xp = &x[i-T-2]; + yi = _mm_loadu_ps(x+i); + x4v = _mm_loadu_ps(xp+4); +#if 0 + /* Slower version with all loads */ + x1v = _mm_loadu_ps(xp+1); + x2v = _mm_loadu_ps(xp+2); + x3v = _mm_loadu_ps(xp+3); +#else + x2v = _mm_shuffle_ps(x0v, x4v, 0x4e); + x1v = _mm_shuffle_ps(x0v, x2v, 0x99); + x3v = _mm_shuffle_ps(x2v, x4v, 0x99); +#endif - acc1 = _mm_add_epi32(acc1, _mm_unpackhi_epi64( acc1, acc1)); - acc1 = _mm_add_epi32(acc1, _mm_shufflelo_epi16( acc1, 0x0E)); - sum += _mm_cvtsi128_si32(acc1); + yi = _mm_add_ps(yi, _mm_mul_ps(g10v,x2v)); +#if 0 /* Set to 1 to make it bit-exact with the non-SSE version */ + yi = _mm_add_ps(yi, _mm_mul_ps(g11v,_mm_add_ps(x3v,x1v))); + yi = _mm_add_ps(yi, _mm_mul_ps(g12v,_mm_add_ps(x4v,x0v))); +#else + /* Use partial sums */ + yi2 = _mm_add_ps(_mm_mul_ps(g11v,_mm_add_ps(x3v,x1v)), + _mm_mul_ps(g12v,_mm_add_ps(x4v,x0v))); + yi = _mm_add_ps(yi, yi2); +#endif + x0v=x4v; + _mm_storeu_ps(y+i, yi); + } +#ifdef CUSTOM_MODES + for (;i<N;i++) + { + y[i] = x[i] + + MULT16_32_Q15(g10,x[i-T]) + + MULT16_32_Q15(g11,ADD32(x[i-T+1],x[i-T-1])) + + MULT16_32_Q15(g12,ADD32(x[i-T+2],x[i-T-2])); + } +#endif +} - for (;i<N;i++) { - sum = silk_SMLABB(sum, x[i], y[i]); - } - return sum; -} #endif diff --git a/celt/x86/pitch_sse.h b/celt/x86/pitch_sse.h index 99d1919..cbe722c 100644 --- a/celt/x86/pitch_sse.h +++ b/celt/x86/pitch_sse.h @@ -37,17 +37,37 @@ #include "config.h" #endif -#if defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2) -#if defined(OPUS_X86_MAY_HAVE_SSE4_1) +#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT) void xcorr_kernel_sse4_1( const opus_int16 *x, const opus_int16 *y, opus_val32 sum[4], int len); +#endif + +#if defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT) +void xcorr_kernel_sse( + const opus_val16 *x, + const opus_val16 *y, + opus_val32 sum[4], + int len); +#endif + +#if defined(OPUS_X86_PRESUME_SSE4_1) && defined(FIXED_POINT) +#define OVERRIDE_XCORR_KERNEL +#define xcorr_kernel(x, y, sum, len, arch) \ + ((void)arch, xcorr_kernel_sse4_1(x, y, sum, len)) + +#elif defined(OPUS_X86_PRESUME_SSE) && !defined(FIXED_POINT) +#define OVERRIDE_XCORR_KERNEL +#define xcorr_kernel(x, y, sum, len, arch) \ + ((void)arch, xcorr_kernel_sse(x, y, sum, len)) + +#elif (defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT)) || (defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT)) extern void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])( - const opus_int16 *x, - const opus_int16 *y, + const opus_val16 *x, + const opus_val16 *y, opus_val32 sum[4], int len); @@ -55,181 +75,115 @@ extern void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])( #define xcorr_kernel(x, y, sum, len, arch) \ ((*XCORR_KERNEL_IMPL[(arch) & OPUS_ARCHMASK])(x, y, sum, len)) +#endif + +#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT) opus_val32 celt_inner_prod_sse4_1( const opus_int16 *x, const opus_int16 *y, int N); #endif -#if defined(OPUS_X86_MAY_HAVE_SSE2) +#if defined(OPUS_X86_MAY_HAVE_SSE2) && defined(FIXED_POINT) opus_val32 celt_inner_prod_sse2( const opus_int16 *x, const opus_int16 *y, int N); #endif +#if defined(OPUS_X86_MAY_HAVE_SSE2) && !defined(FIXED_POINT) +opus_val32 celt_inner_prod_sse( + const opus_val16 *x, + const opus_val16 *y, + int N); +#endif + + +#if defined(OPUS_X86_PRESUME_SSE4_1) && defined(FIXED_POINT) +#define OVERRIDE_CELT_INNER_PROD +#define celt_inner_prod(x, y, N, arch) \ + ((void)arch, celt_inner_prod_sse4_1(x, y, N)) + +#elif defined(OPUS_X86_PRESUME_SSE2) && defined(FIXED_POINT) && !defined(OPUS_X86_MAY_HAVE_SSE4_1) +#define OVERRIDE_CELT_INNER_PROD +#define celt_inner_prod(x, y, N, arch) \ + ((void)arch, celt_inner_prod_sse2(x, y, N)) + +#elif defined(OPUS_X86_PRESUME_SSE) && !defined(FIXED_POINT) +#define OVERRIDE_CELT_INNER_PROD +#define celt_inner_prod(x, y, N, arch) \ + ((void)arch, celt_inner_prod_sse(x, y, N)) + + +#elif ((defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2)) && defined(FIXED_POINT)) || \ + (defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT)) + extern opus_val32 (*const CELT_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])( - const opus_int16 *x, - const opus_int16 *y, + const opus_val16 *x, + const opus_val16 *y, int N); #define OVERRIDE_CELT_INNER_PROD #define celt_inner_prod(x, y, N, arch) \ ((*CELT_INNER_PROD_IMPL[(arch) & OPUS_ARCHMASK])(x, y, N)) -#else -#include <xmmintrin.h> -#include "arch.h" +#endif -#define OVERRIDE_XCORR_KERNEL -static OPUS_INLINE void xcorr_kernel_sse(const opus_val16 *x, const opus_val16 *y, opus_val32 sum[4], int len) -{ - int j; - __m128 xsum1, xsum2; - xsum1 = _mm_loadu_ps(sum); - xsum2 = _mm_setzero_ps(); - - for (j = 0; j < len-3; j += 4) - { - __m128 x0 = _mm_loadu_ps(x+j); - __m128 yj = _mm_loadu_ps(y+j); - __m128 y3 = _mm_loadu_ps(y+j+3); - - xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),yj)); - xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55), - _mm_shuffle_ps(yj,y3,0x49))); - xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa), - _mm_shuffle_ps(yj,y3,0x9e))); - xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3)); - } - if (j < len) - { - xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); - if (++j < len) - { - xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); - if (++j < len) - { - xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j))); - } - } - } - _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2)); -} - -#define xcorr_kernel(_x, _y, _z, len, arch) \ - ((void)(arch),xcorr_kernel_sse(_x, _y, _z, len)) +#if defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT) #define OVERRIDE_DUAL_INNER_PROD -static OPUS_INLINE void dual_inner_prod(const opus_val16 *x, const opus_val16 *y01, const opus_val16 *y02, - int N, opus_val32 *xy1, opus_val32 *xy2) -{ - int i; - __m128 xsum1, xsum2; - xsum1 = _mm_setzero_ps(); - xsum2 = _mm_setzero_ps(); - for (i=0;i<N-3;i+=4) - { - __m128 xi = _mm_loadu_ps(x+i); - __m128 y1i = _mm_loadu_ps(y01+i); - __m128 y2i = _mm_loadu_ps(y02+i); - xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(xi, y1i)); - xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(xi, y2i)); - } - /* Horizontal sum */ - xsum1 = _mm_add_ps(xsum1, _mm_movehl_ps(xsum1, xsum1)); - xsum1 = _mm_add_ss(xsum1, _mm_shuffle_ps(xsum1, xsum1, 0x55)); - _mm_store_ss(xy1, xsum1); - xsum2 = _mm_add_ps(xsum2, _mm_movehl_ps(xsum2, xsum2)); - xsum2 = _mm_add_ss(xsum2, _mm_shuffle_ps(xsum2, xsum2, 0x55)); - _mm_store_ss(xy2, xsum2); - for (;i<N;i++) - { - *xy1 = MAC16_16(*xy1, x[i], y01[i]); - *xy2 = MAC16_16(*xy2, x[i], y02[i]); - } -} +#define OVERRIDE_COMB_FILTER_CONST -#define OVERRIDE_CELT_INNER_PROD -static OPUS_INLINE opus_val32 celt_inner_prod_sse(const opus_val16 *x, const opus_val16 *y, - int N) -{ - int i; - float xy; - __m128 sum; - sum = _mm_setzero_ps(); - /* FIXME: We should probably go 8-way and use 2 sums. */ - for (i=0;i<N-3;i+=4) - { - __m128 xi = _mm_loadu_ps(x+i); - __m128 yi = _mm_loadu_ps(y+i); - sum = _mm_add_ps(sum,_mm_mul_ps(xi, yi)); - } - /* Horizontal sum */ - sum = _mm_add_ps(sum, _mm_movehl_ps(sum, sum)); - sum = _mm_add_ss(sum, _mm_shuffle_ps(sum, sum, 0x55)); - _mm_store_ss(&xy, sum); - for (;i<N;i++) - { - xy = MAC16_16(xy, x[i], y[i]); - } - return xy; -} - -# define celt_inner_prod(_x, _y, len, arch) \ - ((void)(arch),celt_inner_prod_sse(_x, _y, len)) +void dual_inner_prod_sse(const opus_val16 *x, + const opus_val16 *y01, + const opus_val16 *y02, + int N, + opus_val32 *xy1, + opus_val32 *xy2); + +void comb_filter_const_sse(opus_val32 *y, + opus_val32 *x, + int T, + int N, + opus_val16 g10, + opus_val16 g11, + opus_val16 g12); + + +#if defined(OPUS_X86_PRESUME_SSE) +# define dual_inner_prod(x, y01, y02, N, xy1, xy2, arch) \ + ((void)(arch),dual_inner_prod_sse(x, y01, y02, N, xy1, xy2)) #define OVERRIDE_COMB_FILTER_CONST -static OPUS_INLINE void comb_filter_const(opus_val32 *y, opus_val32 *x, int T, int N, - opus_val16 g10, opus_val16 g11, opus_val16 g12) -{ - int i; - __m128 x0v; - __m128 g10v, g11v, g12v; - g10v = _mm_load1_ps(&g10); - g11v = _mm_load1_ps(&g11); - g12v = _mm_load1_ps(&g12); - x0v = _mm_loadu_ps(&x[-T-2]); - for (i=0;i<N-3;i+=4) - { - __m128 yi, yi2, x1v, x2v, x3v, x4v; - const opus_val32 *xp = &x[i-T-2]; - yi = _mm_loadu_ps(x+i); - x4v = _mm_loadu_ps(xp+4); -#if 0 - /* Slower version with all loads */ - x1v = _mm_loadu_ps(xp+1); - x2v = _mm_loadu_ps(xp+2); - x3v = _mm_loadu_ps(xp+3); -#else - x2v = _mm_shuffle_ps(x0v, x4v, 0x4e); - x1v = _mm_shuffle_ps(x0v, x2v, 0x99); - x3v = _mm_shuffle_ps(x2v, x4v, 0x99); -#endif - yi = _mm_add_ps(yi, _mm_mul_ps(g10v,x2v)); -#if 0 /* Set to 1 to make it bit-exact with the non-SSE version */ - yi = _mm_add_ps(yi, _mm_mul_ps(g11v,_mm_add_ps(x3v,x1v))); - yi = _mm_add_ps(yi, _mm_mul_ps(g12v,_mm_add_ps(x4v,x0v))); #else - /* Use partial sums */ - yi2 = _mm_add_ps(_mm_mul_ps(g11v,_mm_add_ps(x3v,x1v)), - _mm_mul_ps(g12v,_mm_add_ps(x4v,x0v))); - yi = _mm_add_ps(yi, yi2); + +extern void (*const DUAL_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])( + const opus_val16 *x, + const opus_val16 *y01, + const opus_val16 *y02, + int N, + opus_val32 *xy1, + opus_val32 *xy2); + +#define dual_inner_prod(x, y01, y02, N, xy1, xy2, arch) \ + ((*DUAL_INNER_PROD_IMPL[(arch) & OPUS_ARCHMASK])(x, y01, y02, N, xy1, xy2)) + +extern void (*const COMB_FILTER_CONST_IMPL[OPUS_ARCHMASK + 1])( + opus_val32 *y, + opus_val32 *x, + int T, + int N, + opus_val16 g10, + opus_val16 g11, + opus_val16 g12); + +#define comb_filter_const(y, x, T, N, g10, g11, g12, arch) \ + ((*COMB_FILTER_CONST_IMPL[(arch) & OPUS_ARCHMASK])(y, x, T, N, g10, g11, g12)) + +#define NON_STATIC_COMB_FILTER_CONST_C + #endif - x0v=x4v; - _mm_storeu_ps(y+i, yi); - } -#ifdef CUSTOM_MODES - for (;i<N;i++) - { - y[i] = x[i] - + MULT16_32_Q15(g10,x[i-T]) - + MULT16_32_Q15(g11,ADD32(x[i-T+1],x[i-T-1])) - + MULT16_32_Q15(g12,ADD32(x[i-T+2],x[i-T-2])); - } #endif -} #endif -#endif diff --git a/celt/x86/pitch_sse2.c b/celt/x86/pitch_sse2.c new file mode 100644 index 0000000..a0e7d1b --- /dev/null +++ b/celt/x86/pitch_sse2.c @@ -0,0 +1,95 @@ +/* Copyright (c) 2014, Cisco Systems, INC + Written by XiangMingZhu WeiZhou MinPeng YanWang + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER + OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#ifdef HAVE_CONFIG_H +#include "config.h" +#endif + +#include <xmmintrin.h> +#include <emmintrin.h> + +#include "macros.h" +#include "celt_lpc.h" +#include "stack_alloc.h" +#include "mathops.h" +#include "pitch.h" + +#if defined(OPUS_X86_MAY_HAVE_SSE2) && defined(FIXED_POINT) +opus_val32 celt_inner_prod_sse2(const opus_val16 *x, const opus_val16 *y, + int N) +{ + opus_int i, dataSize16; + opus_int32 sum; + + __m128i inVec1_76543210, inVec1_FEDCBA98, acc1; + __m128i inVec2_76543210, inVec2_FEDCBA98, acc2; + + sum = 0; + dataSize16 = N & ~15; + + acc1 = _mm_setzero_si128(); + acc2 = _mm_setzero_si128(); + + for (i=0;i<dataSize16;i+=16) + { + inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0])); + inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0])); + + inVec1_FEDCBA98 = _mm_loadu_si128((__m128i *)(&x[i + 8])); + inVec2_FEDCBA98 = _mm_loadu_si128((__m128i *)(&y[i + 8])); + + inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210); + inVec1_FEDCBA98 = _mm_madd_epi16(inVec1_FEDCBA98, inVec2_FEDCBA98); + + acc1 = _mm_add_epi32(acc1, inVec1_76543210); + acc2 = _mm_add_epi32(acc2, inVec1_FEDCBA98); + } + + acc1 = _mm_add_epi32( acc1, acc2 ); + + if (N - i >= 8) + { + inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0])); + inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0])); + + inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210); + + acc1 = _mm_add_epi32(acc1, inVec1_76543210); + i += 8; + } + + acc1 = _mm_add_epi32(acc1, _mm_unpackhi_epi64( acc1, acc1)); + acc1 = _mm_add_epi32(acc1, _mm_shufflelo_epi16( acc1, 0x0E)); + sum += _mm_cvtsi128_si32(acc1); + + for (;i<N;i++) { + sum = silk_SMLABB(sum, x[i], y[i]); + } + + return sum; +} +#endif diff --git a/celt/x86/pitch_sse4_1.c b/celt/x86/pitch_sse4_1.c new file mode 100644 index 0000000..a092c68 --- /dev/null +++ b/celt/x86/pitch_sse4_1.c @@ -0,0 +1,195 @@ +/* Copyright (c) 2014, Cisco Systems, INC + Written by XiangMingZhu WeiZhou MinPeng YanWang + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions + are met: + + - Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + - Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER + OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +*/ + +#ifdef HAVE_CONFIG_H +#include "config.h" +#endif + +#include <xmmintrin.h> +#include <emmintrin.h> + +#include "macros.h" +#include "celt_lpc.h" +#include "stack_alloc.h" +#include "mathops.h" +#include "pitch.h" + +#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT) +#include <smmintrin.h> +#include "x86cpu.h" + +opus_val32 celt_inner_prod_sse4_1(const opus_val16 *x, const opus_val16 *y, + int N) +{ + opus_int i, dataSize16; + opus_int32 sum; + __m128i inVec1_76543210, inVec1_FEDCBA98, acc1; + __m128i inVec2_76543210, inVec2_FEDCBA98, acc2; + __m128i inVec1_3210, inVec2_3210; + + sum = 0; + dataSize16 = N & ~15; + + acc1 = _mm_setzero_si128(); + acc2 = _mm_setzero_si128(); + + for (i=0;i<dataSize16;i+=16) { + inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0])); + inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0])); + + inVec1_FEDCBA98 = _mm_loadu_si128((__m128i *)(&x[i + 8])); + inVec2_FEDCBA98 = _mm_loadu_si128((__m128i *)(&y[i + 8])); + + inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210); + inVec1_FEDCBA98 = _mm_madd_epi16(inVec1_FEDCBA98, inVec2_FEDCBA98); + + acc1 = _mm_add_epi32(acc1, inVec1_76543210); + acc2 = _mm_add_epi32(acc2, inVec1_FEDCBA98); + } + + acc1 = _mm_add_epi32(acc1, acc2); + + if (N - i >= 8) + { + inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0])); + inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0])); + + inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210); + + acc1 = _mm_add_epi32(acc1, inVec1_76543210); + i += 8; + } + + if (N - i >= 4) + { + inVec1_3210 = OP_CVTEPI16_EPI32_M64(&x[i + 0]); + inVec2_3210 = OP_CVTEPI16_EPI32_M64(&y[i + 0]); + + inVec1_3210 = _mm_mullo_epi32(inVec1_3210, inVec2_3210); + + acc1 = _mm_add_epi32(acc1, inVec1_3210); + i += 4; + } + + acc1 = _mm_add_epi32(acc1, _mm_unpackhi_epi64(acc1, acc1)); + acc1 = _mm_add_epi32(acc1, _mm_shufflelo_epi16(acc1, 0x0E)); + + sum += _mm_cvtsi128_si32(acc1); + + for (;i<N;i++) + { + sum = silk_SMLABB(sum, x[i], y[i]); + } + + return sum; +} + +void xcorr_kernel_sse4_1(const opus_val16 * x, const opus_val16 * y, opus_val32 sum[ 4 ], int len) +{ + int j; + + __m128i vecX, vecX0, vecX1, vecX2, vecX3; + __m128i vecY0, vecY1, vecY2, vecY3; + __m128i sum0, sum1, sum2, sum3, vecSum; + __m128i initSum; + + celt_assert(len >= 3); + + sum0 = _mm_setzero_si128(); + sum1 = _mm_setzero_si128(); + sum2 = _mm_setzero_si128(); + sum3 = _mm_setzero_si128(); + + for (j=0;j<(len-7);j+=8) + { + vecX = _mm_loadu_si128((__m128i *)(&x[j + 0])); + vecY0 = _mm_loadu_si128((__m128i *)(&y[j + 0])); + vecY1 = _mm_loadu_si128((__m128i *)(&y[j + 1])); + vecY2 = _mm_loadu_si128((__m128i *)(&y[j + 2])); + vecY3 = _mm_loadu_si128((__m128i *)(&y[j + 3])); + + sum0 = _mm_add_epi32(sum0, _mm_madd_epi16(vecX, vecY0)); + sum1 = _mm_add_epi32(sum1, _mm_madd_epi16(vecX, vecY1)); + sum2 = _mm_add_epi32(sum2, _mm_madd_epi16(vecX, vecY2)); + sum3 = _mm_add_epi32(sum3, _mm_madd_epi16(vecX, vecY3)); + } + + sum0 = _mm_add_epi32(sum0, _mm_unpackhi_epi64( sum0, sum0)); + sum0 = _mm_add_epi32(sum0, _mm_shufflelo_epi16( sum0, 0x0E)); + + sum1 = _mm_add_epi32(sum1, _mm_unpackhi_epi64( sum1, sum1)); + sum1 = _mm_add_epi32(sum1, _mm_shufflelo_epi16( sum1, 0x0E)); + + sum2 = _mm_add_epi32(sum2, _mm_unpackhi_epi64( sum2, sum2)); + sum2 = _mm_add_epi32(sum2, _mm_shufflelo_epi16( sum2, 0x0E)); + + sum3 = _mm_add_epi32(sum3, _mm_unpackhi_epi64( sum3, sum3)); + sum3 = _mm_add_epi32(sum3, _mm_shufflelo_epi16( sum3, 0x0E)); + + vecSum = _mm_unpacklo_epi64(_mm_unpacklo_epi32(sum0, sum1), + _mm_unpacklo_epi32(sum2, sum3)); + + for (;j<(len-3);j+=4) + { + vecX = OP_CVTEPI16_EPI32_M64(&x[j + 0]); + vecX0 = _mm_shuffle_epi32(vecX, 0x00); + vecX1 = _mm_shuffle_epi32(vecX, 0x55); + vecX2 = _mm_shuffle_epi32(vecX, 0xaa); + vecX3 = _mm_shuffle_epi32(vecX, 0xff); + + vecY0 = OP_CVTEPI16_EPI32_M64(&y[j + 0]); + vecY1 = OP_CVTEPI16_EPI32_M64(&y[j + 1]); + vecY2 = OP_CVTEPI16_EPI32_M64(&y[j + 2]); + vecY3 = OP_CVTEPI16_EPI32_M64(&y[j + 3]); + + sum0 = _mm_mullo_epi32(vecX0, vecY0); + sum1 = _mm_mullo_epi32(vecX1, vecY1); + sum2 = _mm_mullo_epi32(vecX2, vecY2); + sum3 = _mm_mullo_epi32(vecX3, vecY3); + + sum0 = _mm_add_epi32(sum0, sum1); + sum2 = _mm_add_epi32(sum2, sum3); + vecSum = _mm_add_epi32(vecSum, sum0); + vecSum = _mm_add_epi32(vecSum, sum2); + } + + for (;j<len;j++) + { + vecX = OP_CVTEPI16_EPI32_M64(&x[j + 0]); + vecX0 = _mm_shuffle_epi32(vecX, 0x00); + + vecY0 = OP_CVTEPI16_EPI32_M64(&y[j + 0]); + + sum0 = _mm_mullo_epi32(vecX0, vecY0); + vecSum = _mm_add_epi32(vecSum, sum0); + } + + initSum = _mm_loadu_si128((__m128i *)(&sum[0])); + initSum = _mm_add_epi32(initSum, vecSum); + _mm_storeu_si128((__m128i *)sum, initSum); +} +#endif diff --git a/celt/x86/x86_celt_map.c b/celt/x86/x86_celt_map.c index 83410db..1ed2acb 100644 --- a/celt/x86/x86_celt_map.c +++ b/celt/x86/x86_celt_map.c @@ -38,6 +38,8 @@ # if defined(FIXED_POINT) +#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && !defined(OPUS_X86_PRESUME_SSE4_1) + void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])( const opus_val16 *x, const opus_val16 *num, @@ -49,8 +51,8 @@ void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])( ) = { celt_fir_c, /* non-sse */ celt_fir_c, + celt_fir_c, MAY_HAVE_SSE4_1(celt_fir), /* sse4.1 */ - NULL }; void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])( @@ -61,24 +63,86 @@ void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])( ) = { xcorr_kernel_c, /* non-sse */ xcorr_kernel_c, + xcorr_kernel_c, MAY_HAVE_SSE4_1(xcorr_kernel), /* sse4.1 */ - NULL }; +#endif + +#if (defined(OPUS_X86_MAY_HAVE_SSE4_1) && !defined(OPUS_X86_PRESUME_SSE4_1)) || \ + (!defined(OPUS_X86_MAY_HAVE_SSE_4_1) && defined(OPUS_X86_MAY_HAVE_SSE2) && !defined(OPUS_X86_PRESUME_SSE2)) + opus_val32 (*const CELT_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])( const opus_val16 *x, const opus_val16 *y, int N ) = { celt_inner_prod_c, /* non-sse */ + celt_inner_prod_c, MAY_HAVE_SSE2(celt_inner_prod), MAY_HAVE_SSE4_1(celt_inner_prod), /* sse4.1 */ - NULL }; +#endif + # else -# error "Floating-point implementation is not supported by x86 RTCD yet." \ - "Reconfigure with --disable-rtcd or send patches." -# endif +#if defined(OPUS_X86_MAY_HAVE_SSE) && !defined(OPUS_X86_PRESUME_SSE) + +void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])( + const opus_val16 *x, + const opus_val16 *y, + opus_val32 sum[4], + int len +) = { + xcorr_kernel_c, /* non-sse */ + MAY_HAVE_SSE(xcorr_kernel), + MAY_HAVE_SSE(xcorr_kernel), + MAY_HAVE_SSE(xcorr_kernel), +}; + +opus_val32 (*const CELT_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])( + const opus_val16 *x, + const opus_val16 *y, + int N +) = { + celt_inner_prod_c, /* non-sse */ + MAY_HAVE_SSE(celt_inner_prod), + MAY_HAVE_SSE(celt_inner_prod), + MAY_HAVE_SSE(celt_inner_prod), +}; + +void (*const DUAL_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])( + const opus_val16 *x, + const opus_val16 *y01, + const opus_val16 *y02, + int N, + opus_val32 *xy1, + opus_val32 *xy2 +) = { + dual_inner_prod_c, /* non-sse */ + MAY_HAVE_SSE(dual_inner_prod), + MAY_HAVE_SSE(dual_inner_prod), + MAY_HAVE_SSE(dual_inner_prod), +}; + +void (*const COMB_FILTER_CONST_IMPL[OPUS_ARCHMASK + 1])( + opus_val32 *y, + opus_val32 *x, + int T, + int N, + opus_val16 g10, + opus_val16 g11, + opus_val16 g12 +) = { + comb_filter_const_c, /* non-sse */ + MAY_HAVE_SSE(comb_filter_const), + MAY_HAVE_SSE(comb_filter_const), + MAY_HAVE_SSE(comb_filter_const), +}; + + +#endif + +#endif #endif diff --git a/celt/x86/x86cpu.c b/celt/x86/x86cpu.c index c82a4b7..afcdeb6 100644 --- a/celt/x86/x86cpu.c +++ b/celt/x86/x86cpu.c @@ -35,10 +35,19 @@ #include "pitch.h" #include "x86cpu.h" +#if (defined(OPUS_X86_MAY_HAVE_SSE) && !defined(OPUS_X86_PRESUME_SSE)) || \ + (defined(OPUS_X86_MAY_HAVE_SSE2) && !defined(OPUS_X86_PRESUME_SSE2)) || \ + (defined(OPUS_X86_MAY_HAVE_SSE4_1) && !defined(OPUS_X86_PRESUME_SSE4_1)) + + #if defined(_MSC_VER) #include <intrin.h> -#define cpuid(info,x) __cpuid(info,x) +static _inline void cpuid(unsigned int CPUInfo[4], unsigned int InfoType) +{ + __cpuid((int*)CPUInfo, InfoType); +} + #else #if defined(CPU_INFO_BY_C) @@ -48,14 +57,28 @@ static void cpuid(unsigned int CPUInfo[4], unsigned int InfoType) { #if defined(CPU_INFO_BY_ASM) +#if defined(__i386__) && defined(__PIC__) +/* %ebx is PIC register in 32-bit, so mustn't clobber it. */ + __asm__ __volatile__ ( + "xchg %%ebx, %1\n" + "cpuid\n" + "xchg %%ebx, %1\n": + "=a" (CPUInfo[0]), + "=r" (CPUInfo[1]), + "=c" (CPUInfo[2]), + "=d" (CPUInfo[3]) : + "0" (InfoType) + ); +#else __asm__ __volatile__ ( "cpuid": "=a" (CPUInfo[0]), "=b" (CPUInfo[1]), "=c" (CPUInfo[2]), "=d" (CPUInfo[3]) : - "a" (InfoType), "c" (0) + "0" (InfoType) ); +#endif #elif defined(CPU_INFO_BY_C) __get_cpuid(InfoType, &(CPUInfo[0]), &(CPUInfo[1]), &(CPUInfo[2]), &(CPUInfo[3])); #endif @@ -63,11 +86,9 @@ static void cpuid(unsigned int CPUInfo[4], unsigned int InfoType) #endif -#include "SigProc_FIX.h" -#include "celt_lpc.h" - typedef struct CPU_Feature{ /* SIMD: 128-bit */ + int HW_SSE; int HW_SSE2; int HW_SSE41; } CPU_Feature; @@ -82,19 +103,31 @@ static void opus_cpu_feature_check(CPU_Feature *cpu_feature) if (nIds >= 1){ cpuid(info, 1); + cpu_feature->HW_SSE = (info[3] & (1 << 25)) != 0; cpu_feature->HW_SSE2 = (info[3] & (1 << 26)) != 0; cpu_feature->HW_SSE41 = (info[2] & (1 << 19)) != 0; } + else { + cpu_feature->HW_SSE = 0; + cpu_feature->HW_SSE2 = 0; + cpu_feature->HW_SSE41 = 0; + } } int opus_select_arch(void) { - CPU_Feature cpu_feature = {0}; + CPU_Feature cpu_feature; int arch; opus_cpu_feature_check(&cpu_feature); arch = 0; + if (!cpu_feature.HW_SSE) + { + return arch; + } + arch++; + if (!cpu_feature.HW_SSE2) { return arch; @@ -109,3 +142,5 @@ int opus_select_arch(void) return arch; } + +#endif diff --git a/celt/x86/x86cpu.h b/celt/x86/x86cpu.h index ef53f0c..870b15e 100644 --- a/celt/x86/x86cpu.h +++ b/celt/x86/x86cpu.h @@ -28,6 +28,12 @@ #if !defined(X86CPU_H) # define X86CPU_H +# if defined(OPUS_X86_MAY_HAVE_SSE) +# define MAY_HAVE_SSE(name) name ## _sse +# else +# define MAY_HAVE_SSE(name) name ## _c +# endif + # if defined(OPUS_X86_MAY_HAVE_SSE2) # define MAY_HAVE_SSE2(name) name ## _sse2 # else @@ -55,21 +61,25 @@ int opus_select_arch(void); reference in the PMOVSXWD instruction itself, but gcc is not smart enough to optimize this out when optimizations ARE enabled. - It appears clang requires us to do this always (which is fair, since - technically the compiler is always allowed to do the dereference before - invoking the function implementing the intrinsic). I have not investiaged - whether it is any smarter than gcc when it comes to eliminating the extra - load instruction.*/ + Clang, in contrast, requires us to do this always for _mm_cvtepi8_epi32 + (which is fair, since technically the compiler is always allowed to do the + dereference before invoking the function implementing the intrinsic). + However, it is smart enough to eliminate the extra MOVD instruction. + For _mm_cvtepi16_epi32, it does the right thing, though does *not* optimize out + the extra MOVQ if it's specified explicitly */ + # if defined(__clang__) || !defined(__OPTIMIZE__) # define OP_CVTEPI8_EPI32_M32(x) \ (_mm_cvtepi8_epi32(_mm_cvtsi32_si128(*(int *)(x)))) - -# define OP_CVTEPI16_EPI32_M64(x) \ - (_mm_cvtepi16_epi32(_mm_loadl_epi64((__m128i *)(x)))) # else # define OP_CVTEPI8_EPI32_M32(x) \ (_mm_cvtepi8_epi32(*(__m128i *)(x))) +#endif +# if !defined(__OPTIMIZE__) +# define OP_CVTEPI16_EPI32_M64(x) \ + (_mm_cvtepi16_epi32(_mm_loadl_epi64((__m128i *)(x)))) +# else # define OP_CVTEPI16_EPI32_M64(x) \ (_mm_cvtepi16_epi32(*(__m128i *)(x))) # endif diff --git a/celt_sources.mk b/celt_sources.mk index 7121301..2ffe99a 100644 --- a/celt_sources.mk +++ b/celt_sources.mk @@ -21,7 +21,10 @@ CELT_SOURCES_SSE = celt/x86/x86cpu.c \ celt/x86/x86_celt_map.c \ celt/x86/pitch_sse.c -CELT_SOURCES_SSE4_1 = celt/x86/celt_lpc_sse.c +CELT_SOURCES_SSE2 = celt/x86/pitch_sse2.c + +CELT_SOURCES_SSE4_1 = celt/x86/celt_lpc_sse.c \ +celt/x86/pitch_sse4_1.c CELT_SOURCES_ARM = \ celt/arm/armcpu.c \ diff --git a/configure.ac b/configure.ac index baa3425..2380a5c 100644 --- a/configure.ac +++ b/configure.ac @@ -348,8 +348,24 @@ AM_CONDITIONAL([OPUS_ARM_INLINE_ASM], AM_CONDITIONAL([OPUS_ARM_EXTERNAL_ASM], [test x"${asm_optimization%% *}" = x"ARM"]) -AM_CONDITIONAL([HAVE_SSE4_1], [false]) +AM_CONDITIONAL([HAVE_SSE], [false]) AM_CONDITIONAL([HAVE_SSE2], [false]) +AM_CONDITIONAL([HAVE_SSE4_1], [false]) + +m4_define([DEFAULT_X86_SSE_CFLAGS], [-msse]) +m4_define([DEFAULT_X86_SSE2_CFLAGS], [-msse2]) +m4_define([DEFAULT_X86_SSE4_1_CFLAGS], [-msse4.1]) +m4_define([DEFAULT_ARM_NEON_INTR_CFLAGS], [-mfpu=neon]) + +AC_ARG_VAR([X86_SSE_CFLAGS], [C compiler flags to compile SSE intrinsics @<:@default=]DEFAULT_X86_SSE_CFLAGS[@:>@]) +AC_ARG_VAR([X86_SSE2_CFLAGS], [C compiler flags to compile SSE2 intrinsics @<:@default=]DEFAULT_X86_SSE2_CFLAGS[@:>@]) +AC_ARG_VAR([X86_SSE4_1_CFLAGS], [C compiler flags to compile SSE4.1 intrinsics @<:@default=]DEFAULT_X86_SSE4_1_CFLAGS[@:>@]) +AC_ARG_VAR([ARM_NEON_INTR_CFLAGS], [C compiler flags to compile ARM NEON intrinsics @<:@default=]DEFAULT_ARM_NEON_INTR_CFLAGS[@:>@]) + +AS_VAR_SET_IF([X86_SSE_CFLAGS], [], [AS_VAR_SET([X86_SSE_CFLAGS], DEFAULT_X86_SSE_CFLAGS)]) +AS_VAR_SET_IF([X86_SSE2_CFLAGS], [], [AS_VAR_SET([X86_SSE2_CFLAGS], DEFAULT_X86_SSE2_CFLAGS)]) +AS_VAR_SET_IF([X86_SSE4_1_CFLAGS], [], [AS_VAR_SET([X86_SSE4_1_CFLAGS], DEFAULT_X86_SSE4_1_CFLAGS)]) +AS_VAR_SET_IF([ARM_NEON_INTR_CFLAGS], [], [AS_VAR_SET([ARM_NEON_INTR_CFLAGS], DEFAULT_ARM_NEON_INTR_CFLAGS)]) AC_DEFUN([OPUS_PATH_NE10], [ @@ -426,64 +442,183 @@ AC_DEFUN([OPUS_PATH_NE10], ) AS_IF([test x"$enable_intrinsics" = x"yes"],[ - case $host_cpu in - arm*) + intrinsics_support="" + AS_CASE([$host_cpu], + [arm*], + [ cpu_arm=yes - AC_MSG_CHECKING(if compiler supports ARM NEON intrinsics) - save_CFLAGS="$CFLAGS"; CFLAGS="-mfpu=neon $CFLAGS" - AC_LINK_IFELSE( - [ - AC_LANG_PROGRAM( - [[#include <arm_neon.h> - ]], - [[ - static float32x4_t A[2], SUMM; - SUMM = vmlaq_f32(SUMM, A[0], A[1]); - ]] - ) - ],[ - OPUS_ARM_NEON_INTR=1 - AC_MSG_RESULT([yes]) - ],[ - OPUS_ARM_NEON_INTR=0 - AC_MSG_RESULT([no]) - ] + OPUS_CHECK_INTRINSICS( + [ARM Neon], + [$ARM_NEON_INTR_CFLAGS], + [OPUS_ARM_MAY_HAVE_NEON_INTR], + [OPUS_ARM_PRESUME_NEON_INTR], + [[#include <arm_neon.h> + ]], + [[ + static float32x4_t A0, A1, SUMM; + SUMM = vmlaq_f32(SUMM, A0, A1); + ]] + ) + AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1" && test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"], + [ + OPUS_ARM_NEON_INTR_CFLAGS="$ARM_NEON_INTR_CFLAGS" + AC_SUBST([OPUS_ARM_NEON_INTR_CFLAGS]) + ] ) - CFLAGS="$save_CFLAGS" - #Now we know if compiler supports ARM neon intrinsics or not - #Currently we only have intrinsic optimization for floating point + #Currently we only have intrinsic optimizations for floating point AS_IF([test x"$enable_float" = x"yes"], [ - AS_IF([test x"$OPUS_ARM_NEON_INTR" = x"1"], + AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"], [ - AC_DEFINE([OPUS_ARM_NEON_INTR], 1, [Compiler supports ARMv7 Neon Intrinsics]) - AS_IF([test x"enable_rtcd" != x""], - [rtcd_support="ARM (ARMv7_Neon_Intrinsics)"],[]) - enable_intrinsics="$enable_intrinsics ARMv7_Neon_Intrinsics" + OPUS_ARM_NEON_INTR=1 + AC_DEFINE([OPUS_ARM_NEON_INTR], 1, + [Support ARMv7 Neon Intrinsics for float]) + AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON_INTR], 1, + [Compiler supports ARMv7 Neon Intrinsics]) + intrinsics_support="$intrinsics_support (Neon_Intrinsics)" + + AS_IF([test x"enable_rtcd" != x"" && test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"], + [rtcd_support="$rtcd_support (ARMv7_Neon_Intrinsics)"],[]) + + AS_IF([test x"$OPUS_ARM_PRESUME_NEON_INTR" = x"1"], + [AC_DEFINE([OPUS_ARM_PRESUME_NEON_INTR], 1, + [Define if binary requires NEON intrinsics support])]) + + AS_IF([test x"$rtcd_support" = x""], + [rtcd_support=no]) + + AS_IF([test x"$intrinsics_support" = x""], + [intrinsics_support=no], + [intrinsics_support="arm$intrinsics_support"]) + dnl Don't see why defining these is necessary to check features at runtime AC_DEFINE([OPUS_ARM_MAY_HAVE_EDSP], 1, [Define if compiler support EDSP Instructions]) AC_DEFINE([OPUS_ARM_MAY_HAVE_MEDIA], 1, [Define if compiler support MEDIA Instructions]) AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler support NEON instructions]) OPUS_PATH_NE10() - AS_IF([test x"$NE10_LIBS" != "x"], - [enable_intrinsics="$enable_intrinsics NE10"],[]) + AS_IF([test x"$HAVE_ARM_NE10" = x"1"], + [intrinsics_support="$intrinsics_support NE10"],[]) ], [ AC_MSG_WARN([Compiler does not support ARM intrinsics]) - enable_intrinsics=no + intrinsics_support=no ]) ], [ - AC_MSG_WARN([Currently on have ARM intrinsics for float]) - enable_intrinsics=no + AC_MSG_WARN([Currently only have ARM intrinsics for float]) + intrinsics_support=no ]) - ;; - "i386" | "i686" | "x86_64") - AS_IF([test x"$enable_float" = x"no"],[ - AS_IF([test x"$enable_rtcd" = x"yes"],[ + ], + [i?86|x86_64], + [ + OPUS_CHECK_INTRINSICS( + [SSE], + [$X86_SSE_CFLAGS], + [OPUS_X86_MAY_HAVE_SSE], + [OPUS_X86_PRESUME_SSE], + [[#include <xmmintrin.h> + ]], + [[ + static __m128 mtest; + mtest = _mm_setzero_ps(); + ]] + ) + AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE" = x"1" && test x"$OPUS_X86_PRESUME_SSE" != x"1"], + [ + OPUS_X86_SSE_CFLAGS="$X86_SSE_CFLAGS" + AC_SUBST([OPUS_X86_SSE_CFLAGS]) + ] + ) + OPUS_CHECK_INTRINSICS( + [SSE2], + [$X86_SSE2_CFLAGS], + [OPUS_X86_MAY_HAVE_SSE2], + [OPUS_X86_PRESUME_SSE2], + [[#include <emmintrin.h> + ]], + [[ + static __m128i mtest; + mtest = _mm_setzero_si128(); + ]] + ) + AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE2" = x"1" && test x"$OPUS_X86_PRESUME_SSE2" != x"1"], + [ + OPUS_X86_SSE2_CFLAGS="$X86_SSE2_CFLAGS" + AC_SUBST([OPUS_X86_SSE2_CFLAGS]) + ] + ) + OPUS_CHECK_INTRINSICS( + [SSE4.1], + [$X86_SSE4_1_CFLAGS], + [OPUS_X86_MAY_HAVE_SSE4_1], + [OPUS_X86_PRESUME_SSE4_1], + [[#include <smmintrin.h> + ]], + [[ + static __m128i mtest; + mtest = _mm_setzero_si128(); + mtest = _mm_cmpeq_epi64(mtest, mtest); + ]] + ) + AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE4_1" = x"1" && test x"$OPUS_X86_PRESUME_SSE4_1" != x"1"], + [ + OPUS_X86_SSE4_1_CFLAGS="$X86_SSE4_1_CFLAGS" + AC_SUBST([OPUS_X86_SSE4_1_CFLAGS]) + ] + ) + + AS_IF([test x"$rtcd_support" = x"no"], [rtcd_support=""]) + AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE" = x"1"], + [ + AC_DEFINE([OPUS_X86_MAY_HAVE_SSE], 1, [Compiler supports X86 SSE Intrinsics]) + intrinsics_support="$intrinsics_support SSE" + + AS_IF([test x"$OPUS_X86_PRESUME_SSE" = x"1"], + [AC_DEFINE([OPUS_X86_PRESUME_SSE], 1, [Define if binary requires SSE intrinsics support])], + [rtcd_support="$rtcd_support SSE"]) + ], + [ + AC_MSG_WARN([Compiler does not support SSE intrinsics]) + ]) + + AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE2" = x"1"], + [ + AC_DEFINE([OPUS_X86_MAY_HAVE_SSE2], 1, [Compiler supports X86 SSE2 Intrinsics]) + intrinsics_support="$intrinsics_support SSE2" + + AS_IF([test x"$OPUS_X86_PRESUME_SSE2" = x"1"], + [AC_DEFINE([OPUS_X86_PRESUME_SSE2], 1, [Define if binary requires SSE2 intrinsics support])], + [rtcd_support="$rtcd_support SSE2"]) + ], + [ + AC_MSG_WARN([Compiler does not support SSE2 intrinsics]) + ]) + + AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE4_1" = x"1"], + [ + AC_DEFINE([OPUS_X86_MAY_HAVE_SSE4_1], 1, [Compiler supports X86 SSE4.1 Intrinsics]) + intrinsics_support="$intrinsics_support SSE4.1" + + AS_IF([test x"$OPUS_X86_PRESUME_SSE4_1" = x"1"], + [AC_DEFINE([OPUS_X86_PRESUME_SSE4_1], 1, [Define if binary requires SSE4.1 intrinsics support])], + [rtcd_support="$rtcd_support SSE4.1"]) + ], + [ + AC_MSG_WARN([Compiler does not support SSE4.1 intrinsics]) + ]) + AS_IF([test x"$intrinsics_support" = x""], + [intrinsics_support=no], + [intrinsics_support="x86$intrinsics_support"] + ) + AS_IF([test x"$rtcd_support" = x""], + [rtcd_support=no], + [rtcd_support="x86$rtcd_support"], + ) + + AS_IF([test x"$enable_rtcd" = x"yes" && test x"$rtcd_support" != x""],[ get_cpuid_by_asm="no" - AC_MSG_CHECKING([Get CPU Info]) + AC_MSG_CHECKING([How to get X86 CPU Info]) AC_LINK_IFELSE([AC_LANG_PROGRAM([[ #include <stdio.h> ]],[[ @@ -493,7 +628,7 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ unsigned int CPUInfo3; unsigned int InfoType; __asm__ __volatile__ ( - "cpuid11": + "cpuid": "=a" (CPUInfo0), "=b" (CPUInfo1), "=c" (CPUInfo2), @@ -502,7 +637,8 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ ); ]])], [get_cpuid_by_asm="yes" - AC_MSG_RESULT([Inline Assembly])], + AC_MSG_RESULT([Inline Assembly]) + AC_DEFINE([CPU_INFO_BY_ASM], [1], [Get CPU Info by asm method])], [AC_LINK_IFELSE([AC_LANG_PROGRAM([[ #include <cpuid.h> ]],[[ @@ -513,82 +649,17 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ unsigned int InfoType; __get_cpuid(InfoType, &CPUInfo0, &CPUInfo1, &CPUInfo2, &CPUInfo3); ]])], - [AC_MSG_RESULT([C method])], - [AC_MSG_ERROR([not support Get CPU Info, please disable intrinsics ])])]) - - AC_MSG_CHECKING([sse4.1]) - TMP_CFLAGS="$CFLAGS" - gcc -Q --help=target | grep "\-msse4.1 " - AS_IF([test x"$?" = x"0"],[ - CFLAGS="$CFLAGS -msse4.1" - AC_CHECK_HEADER(xmmintrin.h, [], [AC_MSG_ERROR([Couldn't find xmmintrin.h])]) - AC_CHECK_HEADER(emmintrin.h, [], [AC_MSG_ERROR([Couldn't find emmintrin.h])]) - AC_CHECK_HEADER(smmintrin.h, [], [AC_MSG_ERROR([Couldn't find smmintrin.h])],[ - #ifdef HAVE_XMMINSTRIN_H - #include <xmmintrin.h> - #endif - #ifdef HAVE_EMMINSTRIN_H - #include <emmintrin.h> - #endif - ]) - - AC_LINK_IFELSE([AC_LANG_PROGRAM([[ - #include <xmmintrin.h> - #include <emmintrin.h> - #include <smmintrin.h> - ]],[[ - __m128i mtest = _mm_setzero_si128(); - mtest = _mm_cmpeq_epi64(mtest, mtest); - ]])], - [AC_MSG_RESULT([yes])], [AC_MSG_ERROR([Compiler & linker failure for sse4.1, please disable intrinsics])]) - - CFLAGS="$TMP_CFLAGS" - AC_DEFINE([OPUS_X86_MAY_HAVE_SSE4_1], [1], [For x86 sse4.1 instrinsics optimizations]) - AC_DEFINE([OPUS_X86_MAY_HAVE_SSE2], [1], [For x86 sse2 instrinsics optimizations]) - rtcd_support="x86 sse4.1" - AM_CONDITIONAL([HAVE_SSE4_1], [true]) - AM_CONDITIONAL([HAVE_SSE2], [true]) - AS_IF([test x"$get_cpuid_by_asm" = x"yes"],[AC_DEFINE([CPU_INFO_BY_ASM], [1], [Get CPU Info by asm method])], - [AC_DEFINE([CPU_INFO_BY_C], [1], [Get CPU Info by C method])]) - ],[ ##### Else case for AS_IF([test x"$?" = x"0"]) - gcc -Q --help=target | grep "\-msse2 " - AC_MSG_CHECKING([sse2]) - AS_IF([test x"$?" = x"0"],[ - AC_MSG_RESULT([yes]) - CFLAGS="$CFLAGS -msse2" - AC_CHECK_HEADER(xmmintrin.h, [], [AC_MSG_ERROR([Couldn't find xmmintrin.h])]) - AC_CHECK_HEADER(emmintrin.h, [], [AC_MSG_ERROR([Couldn't find emmintrin.h])]) - - AC_LINK_IFELSE([AC_LANG_PROGRAM([[ - #include <xmmintrin.h> - #include <emmintrin.h> - ]],[[ - __m128i mtest = _mm_setzero_si128(); - ]])], - [AC_MSG_RESULT([yes])], [AC_MSG_ERROR([Compiler & linker failure for sse2, please disable intrinsics])]) - - CFLAGS="$TMP_CFLAGS" - AC_DEFINE([OPUS_X86_MAY_HAVE_SSE2], [1], [For x86 sse2 instrinsics optimize]) - rtcd_support="x86 sse2" - AM_CONDITIONAL([HAVE_SSE2], [true]) - AS_IF([test x"$get_cpuid_by_asm" = x"yes"],[AC_DEFINE([CPU_INFO_BY_ASM], [1], [Get CPU Info by asm method])], - [AC_DEFINE([CPU_INFO_BY_C], [1], [Get CPU Info by c method])]) - ],[enable_intrinsics="no"]) #End of AS_IF([test x"$?" = x"0"] - ]) - ], [ - enable_intrinsics="no" - ]) ## End of AS_IF([test x"$enable_rtcd" = x"yes"] -], -[ ## Else case for AS_IF([test x"$enable_float" = x"no"] - AC_MSG_WARN([Disabling intrinsics .. x86 intrinsics only avail for fixed point]) - enable_intrinsics="no" -]) ## End of AS_IF([test x"$enable_float" = x"no"] - ;; - *) + [AC_MSG_RESULT([C method]) + AC_DEFINE([CPU_INFO_BY_C], [1], [Get CPU Info by c method])], + [AC_MSG_ERROR([no supported Get CPU Info method, please disable intrinsics])])])]) + ], + [ AC_MSG_WARN([No intrinsics support for your architecture]) - enable_intrinsics="no" - ;; - esac + intrinsics_support="no" + ]) +], +[ + intrinsics_support="no" ]) AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"]) @@ -597,6 +668,12 @@ AM_CONDITIONAL([OPUS_ARM_NEON_INTR], AM_CONDITIONAL([HAVE_ARM_NE10], [test x"$HAVE_ARM_NE10" = x"1"]) +AM_CONDITIONAL([HAVE_SSE], + [test x"$OPUS_X86_MAY_HAVE_SSE" = x"1"]) +AM_CONDITIONAL([HAVE_SSE2], + [test x"$OPUS_X86_MAY_HAVE_SSE2" = x"1"]) +AM_CONDITIONAL([HAVE_SSE4_1], + [test x"$OPUS_X86_MAY_HAVE_SSE4_1" = x"1"]) AS_IF([test x"$enable_rtcd" = x"yes"],[ AS_IF([test x"$rtcd_support" != x"no"],[ @@ -704,7 +781,7 @@ AC_MSG_NOTICE([ Fixed point debugging: ......... ${enable_fixed_point_debug} Inline Assembly Optimizations: . ${inline_optimization} External Assembly Optimizations: ${asm_optimization} - Intrinsics Optimizations.......: ${enable_intrinsics} + Intrinsics Optimizations.......: ${intrinsics_support} Run-time CPU detection: ........ ${rtcd_support} Custom modes: .................. ${enable_custom_modes} Assertion checking: ............ ${enable_assertions} diff --git a/m4/opus-intrinsics.m4 b/m4/opus-intrinsics.m4 new file mode 100644 index 0000000..c74aecd --- /dev/null +++ b/m4/opus-intrinsics.m4 @@ -0,0 +1,29 @@ +dnl opus-intrinsics.m4 +dnl macro for testing for support for compiler intrinsics, either by default or with a compiler flag + +dnl OPUS_CHECK_INTRINSICS(NAME-OF-INTRINSICS, COMPILER-FLAG-FOR-INTRINSICS, VAR-IF-PRESENT, VAR-IF-DEFAULT, TEST-PROGRAM-HEADER, TEST-PROGRAM-BODY) +AC_DEFUN([OPUS_CHECK_INTRINSICS], +[ + AC_MSG_CHECKING([if compiler supports $1 intrinsics]) + AC_LINK_IFELSE( + [AC_LANG_PROGRAM($5, $6)], + [ + $3=1 + $4=1 + AC_MSG_RESULT([yes]) + ],[ + $4=0 + AC_MSG_RESULT([no]) + AC_MSG_CHECKING([if compiler supports $1 intrinsics with $2]) + save_CFLAGS="$CFLAGS"; CFLAGS="$2 $CFLAGS" + AC_LINK_IFELSE([AC_LANG_PROGRAM($5, $6)], + [ + AC_MSG_RESULT([yes]) + $3=1 + ],[ + AC_MSG_RESULT([no]) + $3=0 + ]) + CFLAGS="$save_CFLAGS" + ]) +]) diff --git a/silk/x86/SigProc_FIX_sse.h b/silk/x86/SigProc_FIX_sse.h index 9a0e096..61efa8d 100644 --- a/silk/x86/SigProc_FIX_sse.h +++ b/silk/x86/SigProc_FIX_sse.h @@ -45,6 +45,12 @@ void silk_burg_modified_sse4_1( int arch /* I Run-time architecture */ ); +#if defined(OPUS_X86_PRESUME_SSE4_1) +#define silk_burg_modified(res_nrg, res_nrg_Q, A_Q16, x, minInvGain_Q30, subfr_length, nb_subfr, D, arch) \ + ((void)(arch), silk_burg_modified_sse4_1(res_nrg, res_nrg_Q, A_Q16, x, minInvGain_Q30, subfr_length, nb_subfr, D, arch)) + +#else + extern void (*const SILK_BURG_MODIFIED_IMPL[OPUS_ARCHMASK + 1])( opus_int32 *res_nrg, /* O Residual energy */ opus_int *res_nrg_Q, /* O Residual energy Q value */ @@ -59,12 +65,22 @@ extern void (*const SILK_BURG_MODIFIED_IMPL[OPUS_ARCHMASK + 1])( # define silk_burg_modified(res_nrg, res_nrg_Q, A_Q16, x, minInvGain_Q30, subfr_length, nb_subfr, D, arch) \ ((*SILK_BURG_MODIFIED_IMPL[(arch) & OPUS_ARCHMASK])(res_nrg, res_nrg_Q, A_Q16, x, minInvGain_Q30, subfr_length, nb_subfr, D, arch)) +#endif + opus_int64 silk_inner_prod16_aligned_64_sse4_1( const opus_int16 *inVec1, const opus_int16 *inVec2, const opus_int len ); + +#if defined(OPUS_X86_PRESUME_SSE4_1) + +#define silk_inner_prod16_aligned_64(inVec1, inVec2, len, arch) \ + ((void)(arch),silk_inner_prod16_aligned_64_sse4_1(inVec1, inVec2, len)) + +#else + extern opus_int64 (*const SILK_INNER_PROD16_ALIGNED_64_IMPL[OPUS_ARCHMASK + 1])( const opus_int16 *inVec1, const opus_int16 *inVec2, @@ -75,3 +91,4 @@ extern opus_int64 (*const SILK_INNER_PROD16_ALIGNED_64_IMPL[OPUS_ARCHMASK + 1])( #endif #endif +#endif diff --git a/silk/x86/main_sse.h b/silk/x86/main_sse.h index f970632..afd5ec2 100644 --- a/silk/x86/main_sse.h +++ b/silk/x86/main_sse.h @@ -50,6 +50,15 @@ void silk_VQ_WMat_EC_sse4_1( opus_int L /* I number of vectors in codebook */ ); +#if defined OPUS_X86_PRESUME_SSE4_1 + +#define silk_VQ_WMat_EC(ind, rate_dist_Q14, gain_Q7, in_Q14, W_Q18, cb_Q7, cb_gain_Q7, cl_Q5, \ + mu_Q9, max_gain_Q7, L, arch) \ + ((void)(arch),silk_VQ_WMat_EC_sse4_1(ind, rate_dist_Q14, gain_Q7, in_Q14, W_Q18, cb_Q7, cb_gain_Q7, cl_Q5, \ + mu_Q9, max_gain_Q7, L)) + +#else + extern void (*const SILK_VQ_WMAT_EC_IMPL[OPUS_ARCHMASK + 1])( opus_int8 *ind, /* O index of best codebook vector */ opus_int32 *rate_dist_Q14, /* O best weighted quant error + mu * rate */ @@ -69,6 +78,8 @@ extern void (*const SILK_VQ_WMAT_EC_IMPL[OPUS_ARCHMASK + 1])( ((*SILK_VQ_WMAT_EC_IMPL[(arch) & OPUS_ARCHMASK])(ind, rate_dist_Q14, gain_Q7, in_Q14, W_Q18, cb_Q7, cb_gain_Q7, cl_Q5, \ mu_Q9, max_gain_Q7, L)) +#endif + # define OVERRIDE_silk_NSQ void silk_NSQ_sse4_1( @@ -89,6 +100,15 @@ void silk_NSQ_sse4_1( const opus_int LTP_scale_Q14 /* I LTP state scaling */ ); +#if defined OPUS_X86_PRESUME_SSE4_1 + +#define silk_NSQ(psEncC, NSQ, psIndices, x_Q3, pulses, PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \ + HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL, Lambda_Q10, LTP_scale_Q14, arch) \ + ((void)(arch),silk_NSQ_sse4_1(psEncC, NSQ, psIndices, x_Q3, pulses, PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \ + HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL, Lambda_Q10, LTP_scale_Q14)) + +#else + extern void (*const SILK_NSQ_IMPL[OPUS_ARCHMASK + 1])( const silk_encoder_state *psEncC, /* I/O Encoder State */ silk_nsq_state *NSQ, /* I/O NSQ state */ @@ -112,6 +132,8 @@ extern void (*const SILK_NSQ_IMPL[OPUS_ARCHMASK + 1])( ((*SILK_NSQ_IMPL[(arch) & OPUS_ARCHMASK])(psEncC, NSQ, psIndices, x_Q3, pulses, PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \ HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL, Lambda_Q10, LTP_scale_Q14)) +#endif + # define OVERRIDE_silk_NSQ_del_dec void silk_NSQ_del_dec_sse4_1( @@ -132,6 +154,15 @@ void silk_NSQ_del_dec_sse4_1( const opus_int LTP_scale_Q14 /* I LTP state scaling */ ); +#if defined OPUS_X86_PRESUME_SSE4_1 + +#define silk_NSQ_del_dec(psEncC, NSQ, psIndices, x_Q3, pulses, PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \ + HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL, Lambda_Q10, LTP_scale_Q14, arch) \ + ((void)(arch),silk_NSQ_del_dec_sse4_1(psEncC, NSQ, psIndices, x_Q3, pulses, PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \ + HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL, Lambda_Q10, LTP_scale_Q14)) + +#else + extern void (*const SILK_NSQ_DEL_DEC_IMPL[OPUS_ARCHMASK + 1])( const silk_encoder_state *psEncC, /* I/O Encoder State */ silk_nsq_state *NSQ, /* I/O NSQ state */ @@ -155,6 +186,8 @@ extern void (*const SILK_NSQ_DEL_DEC_IMPL[OPUS_ARCHMASK + 1])( ((*SILK_NSQ_DEL_DEC_IMPL[(arch) & OPUS_ARCHMASK])(psEncC, NSQ, psIndices, x_Q3, pulses, PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \ HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL, Lambda_Q10, LTP_scale_Q14)) +#endif + void silk_noise_shape_quantizer( silk_nsq_state *NSQ, /* I/O NSQ state */ opus_int signalType, /* I Signal type */ @@ -192,6 +225,11 @@ opus_int silk_VAD_GetSA_Q8_sse4_1( const opus_int16 pIn[] ); +#if defined(OPUS_X86_PRESUME_SSE4_1) +#define silk_VAD_GetSA_Q8(psEnC, pIn, arch) ((void)(arch),silk_VAD_GetSA_Q8_sse4_1(psEnC, pIn)) + +#else + # define silk_VAD_GetSA_Q8(psEnC, pIn, arch) \ ((*SILK_VAD_GETSA_Q8_IMPL[(arch) & OPUS_ARCHMASK])(psEnC, pIn)) @@ -201,6 +239,8 @@ extern opus_int (*const SILK_VAD_GETSA_Q8_IMPL[OPUS_ARCHMASK + 1])( # define OVERRIDE_silk_warped_LPC_analysis_filter_FIX +#endif + void silk_warped_LPC_analysis_filter_FIX_sse4_1( opus_int32 state[], /* I/O State [order + 1] */ opus_int32 res_Q2[], /* O Residual signal [length] */ @@ -211,6 +251,12 @@ void silk_warped_LPC_analysis_filter_FIX_sse4_1( const opus_int order /* I Filter order (even) */ ); +#if defined(OPUS_X86_PRESUME_SSE4_1) +#define silk_warped_LPC_analysis_filter_FIX(state, res_Q2, coef_Q13, input, lambda_Q16, length, order, arch) \ + ((void)(arch),silk_warped_LPC_analysis_filter_FIX_c(state, res_Q2, coef_Q13, input, lambda_Q16, length, order)) + +#else + extern void (*const SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[OPUS_ARCHMASK + 1])( opus_int32 state[], /* I/O State [order + 1] */ opus_int32 res_Q2[], /* O Residual signal [length] */ @@ -224,5 +270,7 @@ extern void (*const SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[OPUS_ARCHMASK + 1]) # define silk_warped_LPC_analysis_filter_FIX(state, res_Q2, coef_Q13, input, lambda_Q16, length, order, arch) \ ((*SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[(arch) & OPUS_ARCHMASK])(state, res_Q2, coef_Q13, input, lambda_Q16, length, order)) +#endif + # endif #endif diff --git a/silk/x86/x86_silk_map.c b/silk/x86/x86_silk_map.c index 6747d10..ad9fef2 100644 --- a/silk/x86/x86_silk_map.c +++ b/silk/x86/x86_silk_map.c @@ -35,6 +35,10 @@ #include "pitch.h" #include "main.h" +#if !defined(OPUS_X86_PRESUME_SSE4_1) + +#if defined(FIXED_POINT) + opus_int64 (*const SILK_INNER_PROD16_ALIGNED_64_IMPL[ OPUS_ARCHMASK + 1 ] )( const opus_int16 *inVec1, const opus_int16 *inVec2, @@ -42,18 +46,20 @@ opus_int64 (*const SILK_INNER_PROD16_ALIGNED_64_IMPL[ OPUS_ARCHMASK + 1 ] )( ) = { silk_inner_prod16_aligned_64_c, /* non-sse */ silk_inner_prod16_aligned_64_c, + silk_inner_prod16_aligned_64_c, MAY_HAVE_SSE4_1( silk_inner_prod16_aligned_64 ), /* sse4.1 */ - NULL }; +#endif + opus_int (*const SILK_VAD_GETSA_Q8_IMPL[ OPUS_ARCHMASK + 1 ] )( silk_encoder_state *psEncC, const opus_int16 pIn[] ) = { silk_VAD_GetSA_Q8_c, /* non-sse */ silk_VAD_GetSA_Q8_c, + silk_VAD_GetSA_Q8_c, MAY_HAVE_SSE4_1( silk_VAD_GetSA_Q8 ), /* sse4.1 */ - NULL }; void (*const SILK_NSQ_IMPL[ OPUS_ARCHMASK + 1 ] )( @@ -75,8 +81,8 @@ void (*const SILK_NSQ_IMPL[ OPUS_ARCHMASK + 1 ] )( ) = { silk_NSQ_c, /* non-sse */ silk_NSQ_c, + silk_NSQ_c, MAY_HAVE_SSE4_1( silk_NSQ ), /* sse4.1 */ - NULL }; void (*const SILK_VQ_WMAT_EC_IMPL[ OPUS_ARCHMASK + 1 ] )( @@ -94,8 +100,8 @@ void (*const SILK_VQ_WMAT_EC_IMPL[ OPUS_ARCHMASK + 1 ] )( ) = { silk_VQ_WMat_EC_c, /* non-sse */ silk_VQ_WMat_EC_c, + silk_VQ_WMat_EC_c, MAY_HAVE_SSE4_1( silk_VQ_WMat_EC ), /* sse4.1 */ - NULL }; void (*const SILK_NSQ_DEL_DEC_IMPL[ OPUS_ARCHMASK + 1 ] )( @@ -117,10 +123,12 @@ void (*const SILK_NSQ_DEL_DEC_IMPL[ OPUS_ARCHMASK + 1 ] )( ) = { silk_NSQ_del_dec_c, /* non-sse */ silk_NSQ_del_dec_c, + silk_NSQ_del_dec_c, MAY_HAVE_SSE4_1( silk_NSQ_del_dec ), /* sse4.1 */ - NULL }; +#if defined(FIXED_POINT) + void (*const SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[ OPUS_ARCHMASK + 1 ] )( opus_int32 state[], /* I/O State [order + 1] */ opus_int32 res_Q2[], /* O Residual signal [length] */ @@ -132,8 +140,8 @@ void (*const SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[ OPUS_ARCHMASK + 1 ] )( ) = { silk_warped_LPC_analysis_filter_FIX_c, /* non-sse */ silk_warped_LPC_analysis_filter_FIX_c, + silk_warped_LPC_analysis_filter_FIX_c, MAY_HAVE_SSE4_1( silk_warped_LPC_analysis_filter_FIX ), /* sse4.1 */ - NULL }; void (*const SILK_BURG_MODIFIED_IMPL[ OPUS_ARCHMASK + 1 ] )( @@ -149,6 +157,9 @@ void (*const SILK_BURG_MODIFIED_IMPL[ OPUS_ARCHMASK + 1 ] )( ) = { silk_burg_modified_c, /* non-sse */ silk_burg_modified_c, + silk_burg_modified_c, MAY_HAVE_SSE4_1( silk_burg_modified ), /* sse4.1 */ - NULL }; + +#endif +#endif diff --git a/win32/VS2010/celt.vcxproj b/win32/VS2010/celt.vcxproj index f107fec..e068fbe 100644 --- a/win32/VS2010/celt.vcxproj +++ b/win32/VS2010/celt.vcxproj @@ -37,6 +37,12 @@ <ClCompile Include="..\..\celt\quant_bands.c" /> <ClCompile Include="..\..\celt\rate.c" /> <ClCompile Include="..\..\celt\vq.c" /> + <ClCompile Include="..\..\celt\x86\celt_lpc_sse.c" /> + <ClCompile Include="..\..\celt\x86\pitch_sse.c" /> + <ClCompile Include="..\..\celt\x86\pitch_sse2.c" /> + <ClCompile Include="..\..\celt\x86\pitch_sse4_1.c" /> + <ClCompile Include="..\..\celt\x86\x86cpu.c" /> + <ClCompile Include="..\..\celt\x86\x86_celt_map.c" /> </ItemGroup> <ItemGroup> <ClInclude Include="..\..\celt\arch.h" /> @@ -67,6 +73,9 @@ <ClInclude Include="..\..\celt\static_modes_fixed.h" /> <ClInclude Include="..\..\celt\static_modes_float.h" /> <ClInclude Include="..\..\celt\vq.h" /> + <ClInclude Include="..\..\celt\x86\celt_lpc_sse.h" /> + <ClInclude Include="..\..\celt\x86\pitch_sse.h" /> + <ClInclude Include="..\..\celt\x86\x86cpu.h" /> <ClInclude Include="..\..\celt\_kiss_fft_guts.h" /> </ItemGroup> <PropertyGroup Label="Globals"> @@ -141,7 +150,7 @@ <WarningLevel>Level3</WarningLevel> <Optimization>Disabled</Optimization> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>..\;..\..\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)\..\;$(ProjectDir)\..\..\include;$(ProjectDir)\..\..\celt;$(ProjectDir)\..\..\silk;$(ProjectDir)\..\..\silk\float;$(ProjectDir)\..\..\silk\fixed;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary> </ClCompile> <Link> @@ -168,7 +177,7 @@ <WarningLevel>Level3</WarningLevel> <Optimization>Disabled</Optimization> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;WIN64;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>..\;..\..\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)\..\;$(ProjectDir)\..\..\include;$(ProjectDir)\..\..\celt;$(ProjectDir)\..\..\silk;$(ProjectDir)\..\..\silk\float;$(ProjectDir)\..\..\silk\fixed;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary> </ClCompile> <Link> @@ -196,7 +205,7 @@ <FunctionLevelLinking>true</FunctionLevelLinking> <IntrinsicFunctions>true</IntrinsicFunctions> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>..\;..\..\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)\..\;$(ProjectDir)\..\..\include;$(ProjectDir)\..\..\celt;$(ProjectDir)\..\..\silk;$(ProjectDir)\..\..\silk\float;$(ProjectDir)\..\..\silk\fixed;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreaded</RuntimeLibrary> </ClCompile> <Link> @@ -227,7 +236,7 @@ <FunctionLevelLinking>true</FunctionLevelLinking> <IntrinsicFunctions>true</IntrinsicFunctions> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;WIN64;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>..\;..\..\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)\..\;$(ProjectDir)\..\..\include;$(ProjectDir)\..\..\celt;$(ProjectDir)\..\..\silk;$(ProjectDir)\..\..\silk\float;$(ProjectDir)\..\..\silk\fixed;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreaded</RuntimeLibrary> </ClCompile> <Link> diff --git a/win32/VS2010/celt.vcxproj.filters b/win32/VS2010/celt.vcxproj.filters index e3a1d97..e9948fa 100644 --- a/win32/VS2010/celt.vcxproj.filters +++ b/win32/VS2010/celt.vcxproj.filters @@ -69,6 +69,24 @@ <ClCompile Include="..\..\celt\celt.c"> <Filter>Source Files</Filter> </ClCompile> + <ClCompile Include="..\..\celt\x86\celt_lpc_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\celt\x86\pitch_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\celt\x86\pitch_sse2.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\celt\x86\pitch_sse4_1.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\celt\x86\x86_celt_map.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\celt\x86\x86cpu.c"> + <Filter>Source Files</Filter> + </ClCompile> </ItemGroup> <ItemGroup> <ClInclude Include="..\..\celt\cwrs.h"> @@ -158,5 +176,14 @@ <ClInclude Include="..\..\celt\celt_lpc.h"> <Filter>Header Files</Filter> </ClInclude> + <ClInclude Include="..\..\celt\x86\celt_lpc_sse.h"> + <Filter>Header Files</Filter> + </ClInclude> + <ClInclude Include="..\..\celt\x86\pitch_sse.h"> + <Filter>Header Files</Filter> + </ClInclude> + <ClInclude Include="..\..\celt\x86\x86cpu.h"> + <Filter>Header Files</Filter> + </ClInclude> </ItemGroup> </Project> \ No newline at end of file diff --git a/win32/VS2010/silk_common.vcxproj b/win32/VS2010/silk_common.vcxproj index 9cf5f48..d3d077d 100644 --- a/win32/VS2010/silk_common.vcxproj +++ b/win32/VS2010/silk_common.vcxproj @@ -88,7 +88,7 @@ <WarningLevel>Level3</WarningLevel> <Optimization>Disabled</Optimization> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>../../silk/fixed;../../silk/float;../../win32;../../celt;../../include</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk/float;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary> </ClCompile> <Link> @@ -118,7 +118,7 @@ <WarningLevel>Level3</WarningLevel> <Optimization>Disabled</Optimization> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;WIN64;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>../../silk/fixed;../../silk/float;../../win32;../../celt;../../include</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk/float;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary> </ClCompile> <Link> @@ -149,7 +149,7 @@ <FunctionLevelLinking>true</FunctionLevelLinking> <IntrinsicFunctions>true</IntrinsicFunctions> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>../../silk/fixed;../../silk/float;../../win32;../../celt;../../include</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk/float;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreaded</RuntimeLibrary> <FloatingPointModel>Fast</FloatingPointModel> </ClCompile> @@ -184,7 +184,7 @@ <FunctionLevelLinking>true</FunctionLevelLinking> <IntrinsicFunctions>true</IntrinsicFunctions> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;WIN64;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>../../silk/fixed;../../silk/float;../../win32;../../celt;../../include</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk/float;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreaded</RuntimeLibrary> <FloatingPointModel>Fast</FloatingPointModel> </ClCompile> @@ -212,6 +212,8 @@ </ItemDefinitionGroup> <ItemGroup> <ClInclude Include="..\..\include\opus_types.h" /> + <ClInclude Include="..\..\silk\x86\main_sse.h" /> + <ClInclude Include="..\..\silk\x86\SigProc_FIX_sse.h" /> <ClInclude Include="..\..\win32\config.h" /> <ClInclude Include="..\..\silk\control.h" /> <ClInclude Include="..\..\silk\debug.h" /> @@ -311,8 +313,13 @@ <ClCompile Include="..\..\silk\table_LSF_cos.c" /> <ClCompile Include="..\..\silk\VAD.c" /> <ClCompile Include="..\..\silk\VQ_WMat_EC.c" /> + <ClCompile Include="..\..\silk\x86\NSQ_del_dec_sse.c" /> + <ClCompile Include="..\..\silk\x86\NSQ_sse.c" /> + <ClCompile Include="..\..\silk\x86\VAD_sse.c" /> + <ClCompile Include="..\..\silk\x86\VQ_WMat_EC_sse.c" /> + <ClCompile Include="..\..\silk\x86\x86_silk_map.c" /> </ItemGroup> <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" /> <ImportGroup Label="ExtensionTargets"> </ImportGroup> -</Project> +</Project> \ No newline at end of file diff --git a/win32/VS2010/silk_common.vcxproj.filters b/win32/VS2010/silk_common.vcxproj.filters index 30db48e..341180b 100644 --- a/win32/VS2010/silk_common.vcxproj.filters +++ b/win32/VS2010/silk_common.vcxproj.filters @@ -81,6 +81,12 @@ <ClInclude Include="..\..\silk\typedef.h"> <Filter>Header Files</Filter> </ClInclude> + <ClInclude Include="..\..\silk\x86\main_sse.h"> + <Filter>Header Files</Filter> + </ClInclude> + <ClInclude Include="..\..\silk\x86\SigProc_FIX_sse.h"> + <Filter>Header Files</Filter> + </ClInclude> </ItemGroup> <ItemGroup> <ClCompile Include="..\..\silk\VQ_WMat_EC.c"> @@ -311,5 +317,20 @@ <ClCompile Include="..\..\silk\VAD.c"> <Filter>Source Files</Filter> </ClCompile> + <ClCompile Include="..\..\silk\x86\NSQ_del_dec_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\silk\x86\NSQ_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\silk\x86\VAD_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\silk\x86\VQ_WMat_EC_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\silk\x86\x86_silk_map.c"> + <Filter>Source Files</Filter> + </ClCompile> </ItemGroup> -</Project> +</Project> \ No newline at end of file diff --git a/win32/VS2010/silk_fixed.vcxproj b/win32/VS2010/silk_fixed.vcxproj index 5ea1a91..522101e 100644 --- a/win32/VS2010/silk_fixed.vcxproj +++ b/win32/VS2010/silk_fixed.vcxproj @@ -86,7 +86,7 @@ <WarningLevel>Level3</WarningLevel> <Optimization>Disabled</Optimization> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>../../silk/fixed;../../silk;../../win32;../../celt;../../include;../win32</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include;$(ProjectDir)/../win32</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary> </ClCompile> <Link> @@ -104,7 +104,7 @@ <WarningLevel>Level3</WarningLevel> <Optimization>Disabled</Optimization> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>../../silk/fixed;../../silk;../../win32;../../celt;../../include;../win32</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include;$(ProjectDir)/../win32</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary> </ClCompile> <Link> @@ -123,7 +123,7 @@ <FunctionLevelLinking>true</FunctionLevelLinking> <IntrinsicFunctions>true</IntrinsicFunctions> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>../../silk/fixed;../../silk;../../win32;../../celt;../../include;../win32</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include;$(ProjectDir)/../win32</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreaded</RuntimeLibrary> </ClCompile> <Link> @@ -145,7 +145,7 @@ <FunctionLevelLinking>true</FunctionLevelLinking> <IntrinsicFunctions>true</IntrinsicFunctions> <PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions> - <AdditionalIncludeDirectories>../../silk/fixed;../../silk;../../win32;../../celt;../../include;../win32</AdditionalIncludeDirectories> + <AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include;$(ProjectDir)/../win32</AdditionalIncludeDirectories> <RuntimeLibrary>MultiThreaded</RuntimeLibrary> </ClCompile> <Link> @@ -191,8 +191,11 @@ <ClCompile Include="..\..\silk\fixed\solve_LS_FIX.c" /> <ClCompile Include="..\..\silk\fixed\vector_ops_FIX.c" /> <ClCompile Include="..\..\silk\fixed\warped_autocorrelation_FIX.c" /> + <ClCompile Include="..\..\silk\fixed\x86\burg_modified_FIX_sse.c" /> + <ClCompile Include="..\..\silk\fixed\x86\prefilter_FIX_sse.c" /> + <ClCompile Include="..\..\silk\fixed\x86\vector_ops_FIX_sse.c" /> </ItemGroup> <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" /> <ImportGroup Label="ExtensionTargets"> </ImportGroup> -</Project> +</Project> \ No newline at end of file diff --git a/win32/VS2010/silk_fixed.vcxproj.filters b/win32/VS2010/silk_fixed.vcxproj.filters index 6897930..c2327eb 100644 --- a/win32/VS2010/silk_fixed.vcxproj.filters +++ b/win32/VS2010/silk_fixed.vcxproj.filters @@ -18,16 +18,16 @@ <ClInclude Include="..\..\win32\config.h"> <Filter>Header Files</Filter> </ClInclude> - <ClInclude Include="main_FIX.h"> + <ClInclude Include="..\..\include\opus_types.h"> <Filter>Header Files</Filter> </ClInclude> - <ClInclude Include="..\SigProc_FIX.h"> + <ClInclude Include="..\..\silk\SigProc_FIX.h"> <Filter>Header Files</Filter> </ClInclude> - <ClInclude Include="structs_FIX.h"> + <ClInclude Include="..\..\silk\fixed\main_FIX.h"> <Filter>Header Files</Filter> </ClInclude> - <ClInclude Include="..\..\include\opus_types.h"> + <ClInclude Include="..\..\silk\fixed\structs_FIX.h"> <Filter>Header Files</Filter> </ClInclude> </ItemGroup> @@ -107,5 +107,14 @@ <ClCompile Include="..\..\silk\fixed\LTP_analysis_filter_FIX.c"> <Filter>Source Files</Filter> </ClCompile> + <ClCompile Include="..\..\silk\fixed\x86\burg_modified_FIX_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\silk\fixed\x86\prefilter_FIX_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> + <ClCompile Include="..\..\silk\fixed\x86\vector_ops_FIX_sse.c"> + <Filter>Source Files</Filter> + </ClCompile> </ItemGroup> </Project> \ No newline at end of file diff --git a/win32/config.h b/win32/config.h index 46ff699..10fbf33 100644 --- a/win32/config.h +++ b/win32/config.h @@ -35,9 +35,28 @@ POSSIBILITY OF SUCH DAMAGE. #define OPUS_BUILD 1 -/* Enable SSE functions, if compiled with SSE/SSE2 (note that AMD64 implies SSE2) */ -#if defined(_M_X64) || (defined(_M_IX86_FP) && (_M_IX86_FP >= 1)) -#define __SSE__ 1 +#if defined(_M_IX86) || defined(_M_X64) +/* Can always build with SSE intrinsics (no special compiler flags necessary) */ +#define OPUS_X86_MAY_HAVE_SSE +#define OPUS_X86_MAY_HAVE_SSE2 +#define OPUS_X86_MAY_HAVE_SSE4_1 + +/* Presume SSE functions, if compiled with SSE/SSE2/AVX (note that AMD64 implies SSE2, and AVX + implies SSE4.1) */ +#if defined(_M_X64) || (defined(_M_IX86_FP) && (_M_IX86_FP >= 1)) || defined(__AVX__) +#define OPUS_X86_PRESUME_SSE 1 +#endif +#if defined(_M_X64) || (defined(_M_IX86_FP) && (_M_IX86_FP >= 2)) || defined(__AVX__) +#define OPUS_X86_PRESUME_SSE2 1 +#endif +#if defined(__AVX__) +#define OPUS_X86_PRESUME_SSE4_1 1 +#endif + +#if !defined(OPUS_X86_PRESUME_SSE4_1) || !defined(OPUS_X86_PRESUME_SSE2) || !defined(OPUS_X86_PRESUME_SSE) +#define OPUS_HAVE_RTCD 1 +#endif + #endif #include "version.h" -- 1.9.1
Viswanath Puttagunta
2015-Mar-31 22:57 UTC
[opus] [RFC PATCH v1 4/5] aarch64: Enable intrinsics for aarch64
Enables existing neon intrinsic optimizations to work on aarch64 target. Signed-off-by: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> --- Makefile.am | 4 +- celt/arm/arm_celt_map.c | 4 +- celt/arm/celt_ne10_fft.c | 2 + celt/arm/celt_ne10_mdct.c | 3 ++ celt/arm/pitch_arm.h | 2 +- celt/dump_modes/Makefile | 2 +- celt/pitch.h | 5 +-- celt/tests/test_unit_dft.c | 3 +- celt/tests/test_unit_mathops.c | 7 ++-- celt/tests/test_unit_mdct.c | 4 +- celt/tests/test_unit_rotation.c | 5 ++- configure.ac | 93 +++++++++++++++++++---------------------- 12 files changed, 67 insertions(+), 67 deletions(-) diff --git a/Makefile.am b/Makefile.am index 3a75740..8bd7447 100644 --- a/Makefile.am +++ b/Makefile.am @@ -47,7 +47,7 @@ if CPU_ARM CELT_SOURCES += $(CELT_SOURCES_ARM) SILK_SOURCES += $(SILK_SOURCES_ARM) -if OPUS_ARM_NEON_INTR +if HAVE_ARM_NEON_INTR CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR) endif @@ -286,7 +286,7 @@ SSE4_1_OBJ = $(CELT_SOURCES_SSE4_1:.c=.lo) \ $(SSE4_1_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE4_1_CFLAGS) endif -if OPUS_ARM_NEON_INTR +if HAVE_ARM_NEON_INTR CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \ $(CELT_SOURCES_ARM_NE10:.c=.lo) \ %test_unit_mdct.o %test_unit_dft.o diff --git a/celt/arm/arm_celt_map.c b/celt/arm/arm_celt_map.c index f132fe1..918e6cf 100644 --- a/celt/arm/arm_celt_map.c +++ b/celt/arm/arm_celt_map.c @@ -44,7 +44,7 @@ opus_val32 (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *, MAY_HAVE_NEON(celt_pitch_xcorr) /* NEON */ }; # else /* !FIXED_POINT */ -# if defined(OPUS_ARM_NEON_INTR) +# if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *, const opus_val16 *, opus_val32 *, int, int) = { celt_pitch_xcorr_c, /* ARMv4 */ @@ -113,7 +113,7 @@ void (*const CLT_MDCT_BACKWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l, }; #endif /* HAVE_ARM_NE10 */ -# endif /* OPUS_ARM_NEON_INTR */ +# endif /* OPUS_ARM_MAY_HAVE_NEON_INTR */ # endif /* FIXED_POINT */ #endif /* OPUS_HAVE_RTCD */ diff --git a/celt/arm/celt_ne10_fft.c b/celt/arm/celt_ne10_fft.c index d354502..1901024 100644 --- a/celt/arm/celt_ne10_fft.c +++ b/celt/arm/celt_ne10_fft.c @@ -44,6 +44,7 @@ #include "os_support.h" #include "stack_alloc.h" +#if !defined(FIXED_POINT) #ifdef CUSTOM_MODES /* nfft lengths in NE10 that support scaled fft */ @@ -144,3 +145,4 @@ void opus_ifft_float_neon(const kiss_fft_state *st, } RESTORE_STACK; } +#endif /* !defined(FIXED_POINT) */ diff --git a/celt/arm/celt_ne10_mdct.c b/celt/arm/celt_ne10_mdct.c index 0979cbe..938fc93 100644 --- a/celt/arm/celt_ne10_mdct.c +++ b/celt/arm/celt_ne10_mdct.c @@ -43,6 +43,8 @@ #include "os_support.h" #include "stack_alloc.h" +#if !defined(FIXED_POINT) + void clt_mdct_forward_float_neon(const mdct_lookup *l, kiss_fft_scalar *in, kiss_fft_scalar * OPUS_RESTRICT out, @@ -258,3 +260,4 @@ void clt_mdct_backward_float_neon(const mdct_lookup *l, } RESTORE_STACK; } +#endif /* !defined(FIXED_POINT) */ diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h index 8626ed7..344186b 100644 --- a/celt/arm/pitch_arm.h +++ b/celt/arm/pitch_arm.h @@ -57,7 +57,7 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const opus_val16 *_y, #if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, opus_val32 *xcorr, int len, int max_pitch); -#if !defined(OPUS_HAVE_RTCD) || defined(OPUS_ARM_PRESUME_NEON_INTR) +#if defined(OPUS_ARM_PRESUME_NEON_INTR) #define OVERRIDE_PITCH_XCORR (1) # define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \ ((void)(arch),celt_pitch_xcorr_float_neon(_x, _y, xcorr, len, max_pitch)) diff --git a/celt/dump_modes/Makefile b/celt/dump_modes/Makefile index 10c3679..fef8d94 100644 --- a/celt/dump_modes/Makefile +++ b/celt/dump_modes/Makefile @@ -15,7 +15,7 @@ SOURCES = dump_modes.c \ ifdef HAVE_ARM_NE10 CC = gcc CFLAGS += -mfpu=neon -INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR +INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_PRESUME_NEON_INTR LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so SOURCES += ../arm/celt_ne10_fft.c \ dump_modes_arm_ne10.c \ diff --git a/celt/pitch.h b/celt/pitch.h index af745eb..dde48c8 100644 --- a/celt/pitch.h +++ b/celt/pitch.h @@ -46,8 +46,7 @@ #include "mips/pitch_mipsr1.h" #endif -#if ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \ - || defined(OPUS_ARM_NEON_INTR)) +#if (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) # include "arm/pitch_arm.h" #endif @@ -189,7 +188,7 @@ celt_pitch_xcorr_c(const opus_val16 *_x, const opus_val16 *_y, #if !defined(OVERRIDE_PITCH_XCORR) /*Is run-time CPU detection enabled on this platform?*/ # if defined(OPUS_HAVE_RTCD) && \ - (defined(OPUS_ARM_ASM) || (defined(OPUS_ARM_NEON_INTR) && !defined(OPUS_ARM_PRESUME_NEON_INTR))) + (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) extern # if defined(FIXED_POINT) opus_val32 diff --git a/celt/tests/test_unit_dft.c b/celt/tests/test_unit_dft.c index 9fbcdc4..e17e26f 100644 --- a/celt/tests/test_unit_dft.c +++ b/celt/tests/test_unit_dft.c @@ -45,8 +45,7 @@ #include "mathops.c" #include "entcode.c" -#if defined(OPUS_HAVE_RTCD) && \ - (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) +#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) || defined(OPUS_ARM_ASM) #include "arm/armcpu.c" #if !defined(FIXED_POINT) #if defined(HAVE_ARM_NE10) diff --git a/celt/tests/test_unit_mathops.c b/celt/tests/test_unit_mathops.c index a1cf2f7..2e43e07 100644 --- a/celt/tests/test_unit_mathops.c +++ b/celt/tests/test_unit_mathops.c @@ -65,17 +65,18 @@ #include "x86/celt_lpc_sse.c" #endif #include "x86/x86_celt_map.c" + #elif ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \ - || defined(OPUS_ARM_NEON_INTR)) -#if defined(OPUS_ARM_NEON_INTR) + || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) +#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) #include "arm/celt_neon_intr.c" +#endif #if defined(HAVE_ARM_NE10) #include "kiss_fft.c" #include "mdct.c" #include "arm/celt_ne10_fft.c" #include "arm/celt_ne10_mdct.c" #endif -#endif #include "arm/arm_celt_map.c" #endif diff --git a/celt/tests/test_unit_mdct.c b/celt/tests/test_unit_mdct.c index fdee079..53258fe 100644 --- a/celt/tests/test_unit_mdct.c +++ b/celt/tests/test_unit_mdct.c @@ -46,8 +46,8 @@ #include "mathops.c" #include "entcode.c" -#if defined(OPUS_HAVE_RTCD) && \ - (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) + +#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) || defined(OPUS_ARM_ASM) #include "arm/armcpu.c" #if !defined(FIXED_POINT) #if defined(HAVE_ARM_NE10) diff --git a/celt/tests/test_unit_rotation.c b/celt/tests/test_unit_rotation.c index 4ac838e..ecab5cb 100644 --- a/celt/tests/test_unit_rotation.c +++ b/celt/tests/test_unit_rotation.c @@ -63,9 +63,10 @@ #include "x86/celt_lpc_sse.c" #endif #include "x86/x86_celt_map.c" + #elif ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \ - || defined(OPUS_ARM_NEON_INTR)) -#if defined(OPUS_ARM_NEON_INTR) + || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)) +#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) #include "arm/celt_neon_intr.c" #endif #if defined(HAVE_ARM_NE10) diff --git a/configure.ac b/configure.ac index 2380a5c..a150d87 100644 --- a/configure.ac +++ b/configure.ac @@ -444,7 +444,7 @@ AC_DEFUN([OPUS_PATH_NE10], AS_IF([test x"$enable_intrinsics" = x"yes"],[ intrinsics_support="" AS_CASE([$host_cpu], - [arm*], + [arm*|aarch64], [ cpu_arm=yes OPUS_CHECK_INTRINSICS( @@ -459,55 +459,50 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ SUMM = vmlaq_f32(SUMM, A0, A1); ]] ) - AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1" && test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"], - [ - OPUS_ARM_NEON_INTR_CFLAGS="$ARM_NEON_INTR_CFLAGS" - AC_SUBST([OPUS_ARM_NEON_INTR_CFLAGS]) - ] + + AS_CASE([$host_cpu], + [arm*], + [ + AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"], + [ + OPUS_ARM_NEON_INTR_CFLAGS="$ARM_NEON_INTR_CFLAGS" + AC_SUBST([OPUS_ARM_NEON_INTR_CFLAGS]) + dnl Don't see why defining these is necessary to check features at runtime + AC_DEFINE([OPUS_ARM_MAY_HAVE_EDSP], 1, [Define if compiler support EDSP Instructions]) + AC_DEFINE([OPUS_ARM_MAY_HAVE_MEDIA], 1, [Define if compiler support MEDIA Instructions]) + AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler support NEON instructions]) + ] + ) + ] ) - #Currently we only have intrinsic optimizations for floating point - AS_IF([test x"$enable_float" = x"yes"], + AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"], [ - AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"], - [ - OPUS_ARM_NEON_INTR=1 - AC_DEFINE([OPUS_ARM_NEON_INTR], 1, - [Support ARMv7 Neon Intrinsics for float]) - AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON_INTR], 1, - [Compiler supports ARMv7 Neon Intrinsics]) - intrinsics_support="$intrinsics_support (Neon_Intrinsics)" - - AS_IF([test x"enable_rtcd" != x"" && test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"], - [rtcd_support="$rtcd_support (ARMv7_Neon_Intrinsics)"],[]) - - AS_IF([test x"$OPUS_ARM_PRESUME_NEON_INTR" = x"1"], - [AC_DEFINE([OPUS_ARM_PRESUME_NEON_INTR], 1, - [Define if binary requires NEON intrinsics support])]) - - AS_IF([test x"$rtcd_support" = x""], - [rtcd_support=no]) - - AS_IF([test x"$intrinsics_support" = x""], - [intrinsics_support=no], - [intrinsics_support="arm$intrinsics_support"]) - - dnl Don't see why defining these is necessary to check features at runtime - AC_DEFINE([OPUS_ARM_MAY_HAVE_EDSP], 1, [Define if compiler support EDSP Instructions]) - AC_DEFINE([OPUS_ARM_MAY_HAVE_MEDIA], 1, [Define if compiler support MEDIA Instructions]) - AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler support NEON instructions]) - - OPUS_PATH_NE10() - AS_IF([test x"$HAVE_ARM_NE10" = x"1"], - [intrinsics_support="$intrinsics_support NE10"],[]) - ], - [ - AC_MSG_WARN([Compiler does not support ARM intrinsics]) - intrinsics_support=no - ]) - ], [ - AC_MSG_WARN([Currently only have ARM intrinsics for float]) - intrinsics_support=no + AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON_INTR], 1, + [Compiler supports ARMv7 Neon Intrinsics]) + intrinsics_support="$intrinsics_support (Neon_Intrinsics)" + + AS_IF([test x"enable_rtcd" != x"" && test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"], + [rtcd_support="$rtcd_support (ARMv7_Neon_Intrinsics)"],[]) + + AS_IF([test x"$OPUS_ARM_PRESUME_NEON_INTR" = x"1"], + [AC_DEFINE([OPUS_ARM_PRESUME_NEON_INTR], 1, + [Define if binary requires NEON intrinsics support])]) + + AS_IF([test x"$rtcd_support" = x""], + [rtcd_support=no]) + + AS_IF([test x"$intrinsics_support" = x""], + [intrinsics_support=no], + [intrinsics_support="arm$intrinsics_support"]) + + OPUS_PATH_NE10() + AS_IF([test x"$HAVE_ARM_NE10" = x"1"], + [intrinsics_support="$intrinsics_support NE10"],[]) + ], + [ + AC_MSG_WARN([Compiler does not support ARM intrinsics]) + intrinsics_support=no ]) ], [i?86|x86_64], @@ -663,8 +658,8 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ ]) AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"]) -AM_CONDITIONAL([OPUS_ARM_NEON_INTR], - [test x"$OPUS_ARM_NEON_INTR" = x"1"]) +AM_CONDITIONAL([HAVE_ARM_NEON_INTR], + [test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"]) AM_CONDITIONAL([HAVE_ARM_NE10], [test x"$HAVE_ARM_NE10" = x"1"]) -- 1.9.1
Viswanath Puttagunta
2015-Mar-31 22:57 UTC
[opus] [RFC PATCH v1 5/5] aarch64: celt_pitch_xcorr: Fixed point intrinsics
Optimize celt_pitch_xcorr function (for fixed point). Even though same code in theory should work for ARMv7 as well, turning this on only for aarch64 at the moment since there is a fixed point asm implementation for ARMv7 neon. Signed-off-by: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> --- celt/arm/celt_neon_intr.c | 268 ++++++++++++++++++++++++++++++++++++++++++++++ celt/arm/pitch_arm.h | 10 ++ configure.ac | 6 ++ 3 files changed, 284 insertions(+) diff --git a/celt/arm/celt_neon_intr.c b/celt/arm/celt_neon_intr.c index 47dce15..be978a0 100644 --- a/celt/arm/celt_neon_intr.c +++ b/celt/arm/celt_neon_intr.c @@ -249,4 +249,272 @@ void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, (const float32_t *)_y+i, (float32_t *)xcorr+i, len); } } +#else /* FIXED POINT */ + +/* + * Function: xcorr_kernel_neon_fixed + * --------------------------------- + * Computes 8 correlation values and stores them in sum[8] + */ +static void xcorr_kernel_neon_fixed(const int16_t *x, const int16_t *y, + int32_t sum[4], int len) { + int16x8_t YY[3]; + int16x4_t YEXT[3]; + int16x8_t XX[2]; + int16x4_t XX_2, YY_2; + int32x4_t SUMM; + const int16_t *xi = x; + const int16_t *yi = y; + + celt_assert(len>4); + + YY[0] = vld1q_s16(yi); + YY_2 = vget_low_s16(YY[0]); + + SUMM = vdupq_n_s32(0); + + /* Consume 16 elements in x vector and 20 elements in y + * vector. However, the y[19] and beyond dont get accessed + * So, if len == 16, then we must only access y[0] to y[18] + * So, make sure len > 19 + */ + while (len > 19) { + yi += 8; + YY[1] = vld1q_s16(yi); + yi += 8; + YY[2] = vld1q_s16(yi); + + XX[0] = vld1q_s16(xi); + xi += 8; + XX[1] = vld1q_s16(xi); + xi += 8; + + /* Consume XX[0][0:3] */ + SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]), 0); + + YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[0]), 1); + + YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[0]), 2); + + YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3); + SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_low_s16(XX[0]), 3); + + /* Consume XX[0][7:4] */ + SUMM = vmlal_lane_s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0]), 0); + + YEXT[0] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_high_s16(XX[0]), 1); + + YEXT[1] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_high_s16(XX[0]), 2); + + YEXT[2] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 3); + SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_high_s16(XX[0]), 3); + + /* Consume XX[1][3:0]*/ + SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[1]), vget_low_s16(XX[1]), 0); + + YEXT[0] = vext_s16(vget_low_s16(YY[1]), vget_high_s16(YY[1]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[1]), 1); + + YEXT[1] = vext_s16(vget_low_s16(YY[1]), vget_high_s16(YY[1]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[1]), 2); + + YEXT[2] = vext_s16(vget_low_s16(YY[1]), vget_high_s16(YY[1]), 3); + SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_low_s16(XX[1]), 3); + + /* Consume XX[1][7:4] */ + SUMM = vmlal_lane_s16(SUMM, vget_high_s16(YY[1]), vget_high_s16(XX[1]), 0); + + YEXT[0] = vext_s16(vget_high_s16(YY[1]), vget_low_s16(YY[2]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_high_s16(XX[1]), 1); + + YEXT[1] = vext_s16(vget_high_s16(YY[1]), vget_low_s16(YY[2]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_high_s16(XX[1]), 2); + + YEXT[2] = vext_s16(vget_high_s16(YY[1]), vget_low_s16(YY[2]), 3); + SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_high_s16(XX[1]), 3); + + YY[0] = YY[2]; + len -= 16; + } + + /* Consume 8 elements in x vector and 16 elements in y + * vector. However, y[15:11] should not be accessed unless + * len is > 11 + */ + if (len > 11) { + yi += 8; + YY[1] = vld1q_s16(yi); + + XX[0] = vld1q_s16(xi); + xi += 8; + + /* Consume XX[0][0:3] */ + SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]), 0); + + YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[0]), 1); + + YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[0]), 2); + + YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3); + SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_low_s16(XX[0]), 3); + + /* Consume XX[0][7:4] */ + SUMM = vmlal_lane_s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0]), 0); + + YEXT[0] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_high_s16(XX[0]), 1); + + YEXT[1] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_high_s16(XX[0]), 2); + + YEXT[2] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 3); + SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_high_s16(XX[0]), 3); + + YY[0] = YY[1]; + len -= 8; + } + + /* Consume 4 elements in x vector and 8 elements in y vector + * However, y[7] should not be accessed unless len > 4 + */ + if (len > 4) { + XX_2 = vld1_s16(xi); + xi += 4; + /* Consume XX_2[0:3] */ + SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), XX_2, 0); + + YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1); + SUMM = vmlal_lane_s16(SUMM, YEXT[0], XX_2, 1); + + YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2); + SUMM = vmlal_lane_s16(SUMM, YEXT[1], XX_2, 2); + + YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3); + SUMM = vmlal_lane_s16(SUMM, YEXT[2], XX_2, 3); + + YY_2 = vget_high_s16(YY[0]); + len -= 4; + } + + while (--len > 0) { + XX_2 = vld1_dup_s16(xi++); + SUMM = vmlal_lane_s16(SUMM, YY_2, XX_2, 0); + YY_2= vld1_s16(++yi); + } + + XX_2 = vld1_dup_s16(xi); + SUMM = vmlal_lane_s16(SUMM, YY_2, XX_2, 0); + + vst1q_s32(sum, SUMM); +} + +/* + * Function: xcorr_kernel_neon_fixed_process1 + * --------------------------------- + * Computes single correlation values and stores in *sum + */ +static void xcorr_kernel_neon_fixed_process1(const int16_t *x, + const int16_t *y, + int32_t *sum, int len) { + int16x8_t XX[2]; + int16x8_t YY[2]; + + int16x4_t XX_2; + int16x4_t YY_2; + + int32x4_t SUMM; + int32x2_t SUMM_2; + const int16_t *xi = x; + const int16_t *yi = y; + + SUMM = vdupq_n_s32(0); + + /* Work on 16 values per iteration */ + while (len >= 16) { + XX[0] = vld1q_s16(xi); + xi += 8; + XX[1] = vld1q_s16(xi); + xi += 8; + + YY[0] = vld1q_s16(yi); + yi += 8; + YY[1] = vld1q_s16(yi); + yi += 8; + + SUMM = vmlal_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0])); + SUMM = vmlal_s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0])); + SUMM = vmlal_s16(SUMM, vget_low_s16(YY[1]), vget_low_s16(XX[1])); + SUMM = vmlal_s16(SUMM, vget_high_s16(YY[1]), vget_high_s16(XX[1])); + + len -= 16; + } + + /* Work on 8 values */ + if (len >= 8) { + XX[0] = vld1q_s16(xi); + xi += 8; + + YY[0] = vld1q_s16(yi); + yi += 8; + + SUMM = vmlal_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0])); + SUMM = vmlal_s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0])); + len -= 8; + } + + /* Work on 4 values */ + if (len >= 4) { + XX_2 = vld1_s16(xi); + xi += 4; + YY_2 = vld1_s16(yi); + yi += 4; + SUMM = vmlal_s16(SUMM, YY_2, XX_2); + len -= 4; + } + + SUMM_2 = vadd_s32(vget_high_s32(SUMM), vget_low_s32(SUMM)); + SUMM_2 = vpadd_s32(SUMM_2, SUMM_2); + SUMM = vcombine_s32(SUMM_2, SUMM_2); + + while (len > 0) { + XX_2 = vld1_dup_s16(xi++); + YY_2 = vld1_dup_s16(yi++); + SUMM = vmlal_s16(SUMM, XX_2, YY_2); + len--; + } + vst1q_lane_s32(sum, SUMM, 0); +} + +opus_val32 celt_pitch_xcorr_fixed_neon(const opus_val16 *_x, const opus_val16 *_y, + opus_val32 *xcorr, int len, int max_pitch) { + int i; + celt_assert(max_pitch > 0); + celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0); + int max_corr = 1; + + for (i = 0; i < (max_pitch-3); i += 4) { + xcorr_kernel_neon_fixed((const int16_t *)_x, (const int16_t *)_y+i, + (int32_t *)xcorr+i, len); + } + + /* In case max_pitch isn't multiple of 4 + * compute single correlation value per iteration + */ + for (; i < max_pitch; i++) { + xcorr_kernel_neon_fixed_process1((const int16_t *)_x, + (const int16_t *)_y+i, + (int32_t *)xcorr+i, len); + } + + for (i = 0; i < len; i++) { + max_corr = (max_corr > xcorr[i])? max_corr: xcorr[i]; + } + return max_corr; +} #endif diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h index 344186b..d5c9408 100644 --- a/celt/arm/pitch_arm.h +++ b/celt/arm/pitch_arm.h @@ -32,6 +32,15 @@ # if defined(FIXED_POINT) +#if defined(CPU_AARCH64) +#define OVERRIDE_PITCH_XCORR (1) +opus_val32 celt_pitch_xcorr_fixed_neon(const opus_val16 *_x, const opus_val16 *_y, + opus_val32 *xcorr, int len, int max_pitch); +#define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \ + ((void)(arch), celt_pitch_xcorr_fixed_neon(_x, _y, xcorr, len, max_pitch)) + +#else /* End CPU_AARCH64. Begin CPU_ARM */ + # if defined(OPUS_ARM_MAY_HAVE_NEON) opus_val32 celt_pitch_xcorr_neon(const opus_val16 *_x, const opus_val16 *_y, opus_val32 *xcorr, int len, int max_pitch); @@ -51,6 +60,7 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const opus_val16 *_y, # define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \ ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len, max_pitch)) # endif +#endif /* End CPU_ARM */ #else /* Start !FIXED_POINT */ /* Float case */ diff --git a/configure.ac b/configure.ac index a150d87..744c9b4 100644 --- a/configure.ac +++ b/configure.ac @@ -473,6 +473,11 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler support NEON instructions]) ] ) + ], + [aarch64], + [ + cpu_aarch64=yes + AC_DEFINE([CPU_AARCH64], 1, [Compiling for Aarch64]) ] ) @@ -658,6 +663,7 @@ AS_IF([test x"$enable_intrinsics" = x"yes"],[ ]) AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"]) +AM_CONDITIONAL([CPU_AARCH64], [test "$cpu_aarch64" = "yes"]) AM_CONDITIONAL([HAVE_ARM_NEON_INTR], [test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"]) AM_CONDITIONAL([HAVE_ARM_NE10], -- 1.9.1
Viswanath Puttagunta
2015-Apr-01 03:35 UTC
[opus] [RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series
Hi Timothy, FYI, I just submitted pull request for my Ne10 patch that enables builds for Aarch64 at https://github.com/projectNe10/Ne10/pull/108 Phil at ARM said he will do more testing on it and will merge it soon. As I mentioned in previous email, in mean time, for convinience, I provided pre-build NE10 library binaries at http://people.linaro.org/~viswanath.puttagunta/opus/NE10_root/ Regards, Vish On 31 March 2015 at 17:57, Viswanath Puttagunta <viswanath.puttagunta at linaro.org> wrote:> Hi Timothy, > > As I mentioned earlier [1], I now fixed compile issues > with fixed point and resubmitting the patch. > > I also have new patch that does intrinsics optimizations > for celt_pitch_xcorr targetting aarch64. > > You can find my latest work-in-progress branch at [2] > > For reference, you can use the Ne10 pre-built libraries > at [3] > > Note that I am working with Phil at ARM to get my patch at [4] > upstreamed to Ne10. > > [1]: http://lists.xiph.org/pipermail/opus/2015-March/002941.html > [2]: https://git.linaro.org/people/viswanath.puttagunta/opus.git > Branch: rfcv1_final_xcorr_fixed_armv8 > [3]: http://people.linaro.org/~viswanath.puttagunta/opus/NE10_root/ > [4]: git://git.linaro.org/people/viswanath.puttagunta/Ne10.git > Branch: rfcv1_rc1_armv8 > > Jonathan Lennox (1): > Intrinsics/RTCD related fixes. Mostly x86 > > Viswanath Puttagunta (4): > armv7(float): Optimize encode usecase using NE10 library > armv7(float): Optimize decode usecase using NE10 library > aarch64: Enable intrinsics for aarch64 > aarch64: celt_pitch_xcorr: Fixed point intrinsics > > Makefile.am | 72 ++++-- > celt/arm/arm_celt_map.c | 71 +++++- > celt/arm/armcpu.c | 6 +- > celt/arm/celt_ne10_fft.c | 148 +++++++++++ > celt/arm/celt_ne10_mdct.c | 263 ++++++++++++++++++++ > celt/arm/celt_neon_intr.c | 275 +++++++++++++++++++++ > celt/arm/fft_arm.h | 74 ++++++ > celt/arm/mdct_arm.h | 60 +++++ > celt/arm/pitch_arm.h | 14 +- > celt/bands.c | 6 +- > celt/celt.c | 16 +- > celt/celt.h | 12 +- > celt/celt_decoder.c | 24 +- > celt/celt_encoder.c | 20 +- > celt/celt_lpc.h | 2 +- > celt/cpu_support.h | 15 +- > celt/dump_modes/Makefile | 23 +- > celt/dump_modes/dump_modes.c | 21 ++ > celt/dump_modes/dump_modes_arch.h | 41 ++++ > celt/dump_modes/dump_modes_arm_ne10.c | 125 ++++++++++ > celt/kiss_fft.c | 31 ++- > celt/kiss_fft.h | 69 +++++- > celt/mdct.c | 20 +- > celt/mdct.h | 61 ++++- > celt/mips/celt_mipsr1.h | 2 +- > celt/modes.c | 8 +- > celt/pitch.c | 4 +- > celt/pitch.h | 22 +- > celt/static_modes_float.h | 25 ++ > celt/static_modes_float_arm_ne10.h | 404 +++++++++++++++++++++++++++++++ > celt/tests/test_unit_dft.c | 56 +++-- > celt/tests/test_unit_mathops.c | 22 +- > celt/tests/test_unit_mdct.c | 88 ++++--- > celt/tests/test_unit_rotation.c | 22 +- > celt/x86/celt_lpc_sse.c | 4 + > celt/x86/celt_lpc_sse.h | 12 +- > celt/x86/pitch_sse.c | 334 ++++++++++--------------- > celt/x86/pitch_sse.h | 256 ++++++++------------ > celt/x86/pitch_sse2.c | 95 ++++++++ > celt/x86/pitch_sse4_1.c | 195 +++++++++++++++ > celt/x86/x86_celt_map.c | 76 +++++- > celt/x86/x86cpu.c | 47 +++- > celt/x86/x86cpu.h | 26 +- > celt_headers.mk | 3 + > celt_sources.mk | 9 +- > configure.ac | 391 +++++++++++++++++++++--------- > m4/opus-intrinsics.m4 | 29 +++ > silk/x86/SigProc_FIX_sse.h | 17 ++ > silk/x86/main_sse.h | 48 ++++ > silk/x86/x86_silk_map.c | 25 +- > src/analysis.c | 8 +- > src/analysis.h | 2 +- > src/opus_encoder.c | 2 +- > src/opus_multistream_encoder.c | 9 +- > win32/VS2010/celt.vcxproj | 17 +- > win32/VS2010/celt.vcxproj.filters | 27 +++ > win32/VS2010/silk_common.vcxproj | 17 +- > win32/VS2010/silk_common.vcxproj.filters | 23 +- > win32/VS2010/silk_fixed.vcxproj | 13 +- > win32/VS2010/silk_fixed.vcxproj.filters | 17 +- > win32/config.h | 25 +- > 61 files changed, 3150 insertions(+), 699 deletions(-) > create mode 100644 celt/arm/celt_ne10_fft.c > create mode 100644 celt/arm/celt_ne10_mdct.c > create mode 100644 celt/arm/fft_arm.h > create mode 100644 celt/arm/mdct_arm.h > create mode 100644 celt/dump_modes/dump_modes_arch.h > create mode 100644 celt/dump_modes/dump_modes_arm_ne10.c > create mode 100644 celt/static_modes_float_arm_ne10.h > create mode 100644 celt/x86/pitch_sse2.c > create mode 100644 celt/x86/pitch_sse4_1.c > create mode 100644 m4/opus-intrinsics.m4 > > -- > 1.9.1 >