thr3ads.net - opus - [opus] [RFC PATCH v1 0/5] aarch64: celt_pitch

If this information is useful, please help other people find it:
Share via:

Viswanath Puttagunta

2015-Mar-31 22:57 UTC

[opus] [RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

Hi Timothy,

As I mentioned earlier [1], I now fixed compile issues
with fixed point and resubmitting the patch.

I also have new patch that does intrinsics optimizations
for celt_pitch_xcorr targetting aarch64.

You can find my latest work-in-progress branch at [2]

For reference, you can use the Ne10 pre-built libraries
at [3]

Note that I am working with Phil at ARM to get my patch at [4]
upstreamed to Ne10. 

[1]: http://lists.xiph.org/pipermail/opus/2015-March/002941.html
[2]: https://git.linaro.org/people/viswanath.puttagunta/opus.git
     Branch: rfcv1_final_xcorr_fixed_armv8
[3]: http://people.linaro.org/~viswanath.puttagunta/opus/NE10_root/ 
[4]: git://git.linaro.org/people/viswanath.puttagunta/Ne10.git
     Branch: rfcv1_rc1_armv8

Jonathan Lennox (1):
  Intrinsics/RTCD related fixes. Mostly x86

Viswanath Puttagunta (4):
  armv7(float): Optimize encode usecase using NE10 library
  armv7(float): Optimize decode usecase using NE10 library
  aarch64: Enable intrinsics for aarch64
  aarch64: celt_pitch_xcorr: Fixed point intrinsics

 Makefile.am                              |  72 ++++--
 celt/arm/arm_celt_map.c                  |  71 +++++-
 celt/arm/armcpu.c                        |   6 +-
 celt/arm/celt_ne10_fft.c                 | 148 +++++++++++
 celt/arm/celt_ne10_mdct.c                | 263 ++++++++++++++++++++
 celt/arm/celt_neon_intr.c                | 275 +++++++++++++++++++++
 celt/arm/fft_arm.h                       |  74 ++++++
 celt/arm/mdct_arm.h                      |  60 +++++
 celt/arm/pitch_arm.h                     |  14 +-
 celt/bands.c                             |   6 +-
 celt/celt.c                              |  16 +-
 celt/celt.h                              |  12 +-
 celt/celt_decoder.c                      |  24 +-
 celt/celt_encoder.c                      |  20 +-
 celt/celt_lpc.h                          |   2 +-
 celt/cpu_support.h                       |  15 +-
 celt/dump_modes/Makefile                 |  23 +-
 celt/dump_modes/dump_modes.c             |  21 ++
 celt/dump_modes/dump_modes_arch.h        |  41 ++++
 celt/dump_modes/dump_modes_arm_ne10.c    | 125 ++++++++++
 celt/kiss_fft.c                          |  31 ++-
 celt/kiss_fft.h                          |  69 +++++-
 celt/mdct.c                              |  20 +-
 celt/mdct.h                              |  61 ++++-
 celt/mips/celt_mipsr1.h                  |   2 +-
 celt/modes.c                             |   8 +-
 celt/pitch.c                             |   4 +-
 celt/pitch.h                             |  22 +-
 celt/static_modes_float.h                |  25 ++
 celt/static_modes_float_arm_ne10.h       | 404 +++++++++++++++++++++++++++++++
 celt/tests/test_unit_dft.c               |  56 +++--
 celt/tests/test_unit_mathops.c           |  22 +-
 celt/tests/test_unit_mdct.c              |  88 ++++---
 celt/tests/test_unit_rotation.c          |  22 +-
 celt/x86/celt_lpc_sse.c                  |   4 +
 celt/x86/celt_lpc_sse.h                  |  12 +-
 celt/x86/pitch_sse.c                     | 334 ++++++++++---------------
 celt/x86/pitch_sse.h                     | 256 ++++++++------------
 celt/x86/pitch_sse2.c                    |  95 ++++++++
 celt/x86/pitch_sse4_1.c                  | 195 +++++++++++++++
 celt/x86/x86_celt_map.c                  |  76 +++++-
 celt/x86/x86cpu.c                        |  47 +++-
 celt/x86/x86cpu.h                        |  26 +-
 celt_headers.mk                          |   3 +
 celt_sources.mk                          |   9 +-
 configure.ac                             | 391 +++++++++++++++++++++---------
 m4/opus-intrinsics.m4                    |  29 +++
 silk/x86/SigProc_FIX_sse.h               |  17 ++
 silk/x86/main_sse.h                      |  48 ++++
 silk/x86/x86_silk_map.c                  |  25 +-
 src/analysis.c                           |   8 +-
 src/analysis.h                           |   2 +-
 src/opus_encoder.c                       |   2 +-
 src/opus_multistream_encoder.c           |   9 +-
 win32/VS2010/celt.vcxproj                |  17 +-
 win32/VS2010/celt.vcxproj.filters        |  27 +++
 win32/VS2010/silk_common.vcxproj         |  17 +-
 win32/VS2010/silk_common.vcxproj.filters |  23 +-
 win32/VS2010/silk_fixed.vcxproj          |  13 +-
 win32/VS2010/silk_fixed.vcxproj.filters  |  17 +-
 win32/config.h                           |  25 +-
 61 files changed, 3150 insertions(+), 699 deletions(-)
 create mode 100644 celt/arm/celt_ne10_fft.c
 create mode 100644 celt/arm/celt_ne10_mdct.c
 create mode 100644 celt/arm/fft_arm.h
 create mode 100644 celt/arm/mdct_arm.h
 create mode 100644 celt/dump_modes/dump_modes_arch.h
 create mode 100644 celt/dump_modes/dump_modes_arm_ne10.c
 create mode 100644 celt/static_modes_float_arm_ne10.h
 create mode 100644 celt/x86/pitch_sse2.c
 create mode 100644 celt/x86/pitch_sse4_1.c
 create mode 100644 m4/opus-intrinsics.m4

-- 
1.9.1

Viswanath Puttagunta

2015-Mar-31 22:57 UTC

head link

[opus] [RFC PATCH v1 1/5] armv7(float): Optimize encode usecase using NE10 library

Optimize opus encode (float only) usecase using ARM NE10
library. Mainly effects opus_fft and ctl_mdct_forward
and related functions.

This optimization can be used for ARM CPUs that have NEON
VFP unit. This patch only enables optimizations for ARMv7.

Official ARM NE10 library page available at
http://projectne10.github.io/Ne10/

To enable this optimization, use
--enable-intrinsics --with-NE10=<install_prefix>
or
--enable-intrinsics --with-NE10-libraries=<NE10_lib_dir>
--with-NE10-includes=<NE10_includes_dir>

Compile time checks made during configure process to make sure
optimization option available only when compiler supports NEON
instrinsics.

Runtime checks made to make sure optimized functions only called
on appropriate hardware.
---
 Makefile.am                           |  34 +--
 celt/arm/arm_celt_map.c               |  47 +++-
 celt/arm/celt_ne10_fft.c              | 120 ++++++++++
 celt/arm/celt_ne10_mdct.c             | 158 +++++++++++++
 celt/arm/celt_neon_intr.c             |   7 +
 celt/arm/fft_arm.h                    |  66 ++++++
 celt/arm/mdct_arm.h                   |  53 +++++
 celt/celt_encoder.c                   |  13 +-
 celt/dump_modes/Makefile              |  23 +-
 celt/dump_modes/dump_modes.c          |  21 ++
 celt/dump_modes/dump_modes_arch.h     |  41 ++++
 celt/dump_modes/dump_modes_arm_ne10.c | 125 +++++++++++
 celt/kiss_fft.c                       |  27 ++-
 celt/kiss_fft.h                       |  56 ++++-
 celt/mdct.c                           |  15 +-
 celt/mdct.h                           |  39 +++-
 celt/modes.c                          |   8 +-
 celt/static_modes_float.h             |  25 +++
 celt/static_modes_float_arm_ne10.h    | 404 ++++++++++++++++++++++++++++++++++
 celt/tests/test_unit_dft.c            |  53 +++--
 celt/tests/test_unit_mathops.c        |   6 +
 celt/tests/test_unit_mdct.c           |  84 ++++---
 celt/tests/test_unit_rotation.c       |   6 +
 celt_headers.mk                       |   3 +
 celt_sources.mk                       |   4 +
 configure.ac                          |  81 +++++++
 src/analysis.c                        |   8 +-
 src/analysis.h                        |   2 +-
 src/opus_encoder.c                    |   2 +-
 src/opus_multistream_encoder.c        |   9 +-
 30 files changed, 1435 insertions(+), 105 deletions(-)
 create mode 100644 celt/arm/celt_ne10_fft.c
 create mode 100644 celt/arm/celt_ne10_mdct.c
 create mode 100644 celt/arm/fft_arm.h
 create mode 100644 celt/arm/mdct_arm.h
 create mode 100644 celt/dump_modes/dump_modes_arch.h
 create mode 100644 celt/dump_modes/dump_modes_arm_ne10.c
 create mode 100644 celt/static_modes_float_arm_ne10.h

diff --git a/Makefile.am b/Makefile.am
index 2a1ddc8..c5c1562 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -10,7 +10,7 @@ lib_LTLIBRARIES = libopus.la
 DIST_SUBDIRS = doc
 
 AM_CPPFLAGS = -I$(top_srcdir)/include -I$(top_srcdir)/celt -I$(top_srcdir)/silk
\
-              -I$(top_srcdir)/silk/float -I$(top_srcdir)/silk/fixed
+              -I$(top_srcdir)/silk/float -I$(top_srcdir)/silk/fixed
$(NE10_CFLAGS)
 
 include celt_sources.mk
 include silk_sources.mk
@@ -47,6 +47,10 @@ CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR)
 OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon
 endif
 
+if HAVE_ARM_NE10
+CELT_SOURCES += $(CELT_SOURCES_ARM_NE10)
+endif
+
 if OPUS_ARM_EXTERNAL_ASM
 nodist_libopus_la_SOURCES = $(CELT_SOURCES_ARM_ASM:.s=-gnu.S)
 BUILT_SOURCES = $(CELT_SOURCES_ARM_ASM:.s=-gnu.S) \
@@ -64,7 +68,7 @@ include opus_headers.mk
 
 libopus_la_SOURCES = $(CELT_SOURCES) $(SILK_SOURCES) $(OPUS_SOURCES)
 libopus_la_LDFLAGS = -no-undefined -version-info
@OPUS_LT_CURRENT@:@OPUS_LT_REVISION@:@OPUS_LT_AGE@
-libopus_la_LIBADD = $(LIBM)
+libopus_la_LIBADD = $(NE10_LIBS) $(LIBM)
 
 pkginclude_HEADERS = include/opus.h include/opus_multistream.h
include/opus_types.h include/opus_defines.h
 
@@ -77,32 +81,32 @@ TESTS = celt/tests/test_unit_types
celt/tests/test_unit_mathops celt/tests/test_
 
 opus_demo_SOURCES = src/opus_demo.c
 
-opus_demo_LDADD = libopus.la $(LIBM)
+opus_demo_LDADD = libopus.la $(NE10_LIBS) $(LIBM)
 
 repacketizer_demo_SOURCES = src/repacketizer_demo.c
 
-repacketizer_demo_LDADD = libopus.la $(LIBM)
+repacketizer_demo_LDADD = libopus.la $(NE10_LIBS) $(LIBM)
 
 opus_compare_SOURCES = src/opus_compare.c
 opus_compare_LDADD = $(LIBM)
 
 tests_test_opus_api_SOURCES = tests/test_opus_api.c tests/test_opus_common.h
-tests_test_opus_api_LDADD = libopus.la $(LIBM)
+tests_test_opus_api_LDADD = libopus.la $(NE10_LIBS) $(LIBM)
 
 tests_test_opus_encode_SOURCES = tests/test_opus_encode.c
tests/test_opus_common.h
-tests_test_opus_encode_LDADD = libopus.la $(LIBM)
+tests_test_opus_encode_LDADD = libopus.la $(NE10_LIBS) $(LIBM)
 
 tests_test_opus_decode_SOURCES = tests/test_opus_decode.c
tests/test_opus_common.h
-tests_test_opus_decode_LDADD = libopus.la $(LIBM)
+tests_test_opus_decode_LDADD = libopus.la $(NE10_LIBS) $(LIBM)
 
 tests_test_opus_padding_SOURCES = tests/test_opus_padding.c
tests/test_opus_common.h
-tests_test_opus_padding_LDADD = libopus.la $(LIBM)
+tests_test_opus_padding_LDADD = libopus.la $(NE10_LIBS) $(LIBM)
 
 celt_tests_test_unit_cwrs32_SOURCES = celt/tests/test_unit_cwrs32.c
 celt_tests_test_unit_cwrs32_LDADD = $(LIBM)
 
 celt_tests_test_unit_dft_SOURCES = celt/tests/test_unit_dft.c
-celt_tests_test_unit_dft_LDADD = $(LIBM)
+celt_tests_test_unit_dft_LDADD = $(NE10_LIBS) $(LIBM)
 
 celt_tests_test_unit_entropy_SOURCES = celt/tests/test_unit_entropy.c
 celt_tests_test_unit_entropy_LDADD = $(LIBM)
@@ -111,7 +115,7 @@ celt_tests_test_unit_laplace_SOURCES =
celt/tests/test_unit_laplace.c
 celt_tests_test_unit_laplace_LDADD = $(LIBM)
 
 celt_tests_test_unit_mathops_SOURCES = celt/tests/test_unit_mathops.c
-celt_tests_test_unit_mathops_LDADD = $(LIBM)
+celt_tests_test_unit_mathops_LDADD = $(NE10_LIBS) $(LIBM)
 if CPU_ARM
 if OPUS_ARM_EXTERNAL_ASM
 celt_tests_test_unit_mathops_LDADD += libopus.la
@@ -119,10 +123,10 @@ endif
 endif
 
 celt_tests_test_unit_mdct_SOURCES = celt/tests/test_unit_mdct.c
-celt_tests_test_unit_mdct_LDADD = $(LIBM)
+celt_tests_test_unit_mdct_LDADD = $(NE10_LIBS) $(LIBM)
 
 celt_tests_test_unit_rotation_SOURCES = celt/tests/test_unit_rotation.c
-celt_tests_test_unit_rotation_LDADD = $(LIBM)
+celt_tests_test_unit_rotation_LDADD = $(NE10_LIBS) $(LIBM)
 if CPU_ARM
 if OPUS_ARM_EXTERNAL_ASM
 celt_tests_test_unit_rotation_LDADD += libopus.la
@@ -270,6 +274,8 @@ endif
 
 if OPUS_ARM_NEON_INTR
 CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \
-			%test_unit_rotation.o %test_unit_mathops.o
-$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
+                         $(CELT_SOURCES_ARM_NE10:.c=.lo) \
+                         %test_unit_rotation.o %test_unit_mathops.o \
+                         %test_unit_mdct.o %test_unit_dft.o
+$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
$(NE10_CFLAGS)
 endif
diff --git a/celt/arm/arm_celt_map.c b/celt/arm/arm_celt_map.c
index 68c224d..3b49f90 100644
--- a/celt/arm/arm_celt_map.c
+++ b/celt/arm/arm_celt_map.c
@@ -30,6 +30,8 @@
 #endif
 
 #include "pitch.h"
+#include "kiss_fft.h"
+#include "mdct.h"
 
 #if defined(OPUS_HAVE_RTCD)
 
@@ -50,7 +52,46 @@ void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const
opus_val16 *,
   celt_pitch_xcorr_c,              /* Media */
   celt_pitch_xcorr_float_neon      /* Neon */
 };
-#  endif
-# endif
 
-#endif
+#if defined(HAVE_ARM_NE10)
+#ifdef CUSTOM_MODES
+int (*const OPUS_FFT_ALLOC_ARCH_IMPL[OPUS_ARCHMASK+1])(kiss_fft_state *st) = {
+   opus_fft_alloc_arch_c,        /* ARMv4 */
+   opus_fft_alloc_arch_c,        /* EDSP */
+   opus_fft_alloc_arch_c,        /* Media */
+   opus_fft_alloc_arm_float_neon /* Neon with NE10 library support */
+};
+
+void (*const OPUS_FFT_FREE_ARCH_IMPL[OPUS_ARCHMASK+1])(kiss_fft_state *st) = {
+   opus_fft_free_arch_c,         /* ARMv4 */
+   opus_fft_free_arch_c,         /* EDSP */
+   opus_fft_free_arch_c,         /* Media */
+   opus_fft_free_arm_float_neon  /* Neon with NE10 */
+};
+#endif /* CUSTOM_MODES */
+
+void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg,
+                                        const kiss_fft_cpx *fin,
+                                        kiss_fft_cpx *fout) = {
+   opus_fft_c,                   /* ARMv4 */
+   opus_fft_c,                   /* EDSP */
+   opus_fft_c,                   /* Media */
+   opus_fft_float_neon           /* Neon with NE10 */
+};
+
+void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l,
+                                                     kiss_fft_scalar *in,
+                                                     kiss_fft_scalar *
OPUS_RESTRICT out,
+                                                     const opus_val16 *window,
+                                                     int overlap, int shift,
+                                                     int stride, int arch) = {
+   clt_mdct_forward_c,           /* ARMv4 */
+   clt_mdct_forward_c,           /* EDSP */
+   clt_mdct_forward_c,           /* Media */
+   clt_mdct_forward_float_neon   /* Neon with NE10 */
+};
+#endif /* HAVE_ARM_NE10 */
+#  endif /* OPUS_ARM_NEON_INTR */
+# endif /* FIXED_POINT */
+
+#endif /* OPUS_HAVE_RTCD */
diff --git a/celt/arm/celt_ne10_fft.c b/celt/arm/celt_ne10_fft.c
new file mode 100644
index 0000000..b592f19
--- /dev/null
+++ b/celt/arm/celt_ne10_fft.c
@@ -0,0 +1,120 @@
+/* Copyright (c) 2015 Xiph.Org Foundation
+   Written by Viswanath Puttagunta */
+/**
+   @file celt_ne10_fft.c
+   @brief ARM Neon optimizations for fft using NE10 library
+ */
+
+/*
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef SKIP_CONFIG_H
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+#endif
+
+#include <arm_neon.h>
+#include <NE10_init.h>
+#include <NE10_dsp.h>
+#include "../kiss_fft.h"
+#include "stack_alloc.h"
+#include "os_support.h"
+#include "stack_alloc.h"
+
+#ifdef CUSTOM_MODES
+
+/* nfft lengths in NE10 that support scaled fft */
+#define NE10_FFTSCALED_SUPPORT_MAX 4
+static const int ne10_fft_scaled_support[NE10_FFTSCALED_SUPPORT_MAX] = {
+   480, 240, 120, 60
+};
+
+int opus_fft_alloc_arm_float_neon(kiss_fft_state *st)
+{
+   int i;
+   size_t memneeded = sizeof(struct arch_fft_state);
+
+   st->arch_fft = (arch_fft_state *)opus_alloc(memneeded);
+   if (!st->arch_fft)
+      return -1;
+
+   for (i = 0; i < NE10_FFTSCALED_SUPPORT_MAX; i++) {
+      if(st->nfft == ne10_fft_scaled_support[i])
+         break;
+   }
+   if (i == NE10_FFTSCALED_SUPPORT_MAX) {
+      /* This nfft length (scaled fft) is not supported in NE10 */
+      st->arch_fft->is_supported = 0;
+      st->arch_fft->priv = NULL;
+   }
+   else {
+      st->arch_fft->is_supported = 1;
+      st->arch_fft->priv = (void
*)ne10_fft_alloc_c2c_float32_neon(st->nfft);
+      if (st->arch_fft->priv == NULL) {
+         return -1;
+      }
+   }
+   return 0;
+}
+
+void opus_fft_free_arm_float_neon(kiss_fft_state *st)
+{
+   ne10_fft_cfg_float32_t cfg;
+
+   if (!st->arch_fft)
+      return;
+
+   cfg = (ne10_fft_cfg_float32_t)st->arch_fft->priv;
+   if (cfg)
+      ne10_fft_destroy_c2c_float32(cfg);
+   opus_free(st->arch_fft);
+}
+#endif
+void opus_fft_float_neon(const kiss_fft_state *st,
+                         const kiss_fft_cpx *fin,
+                         kiss_fft_cpx *fout)
+{
+   ne10_fft_state_float32_t state;
+   ne10_fft_cfg_float32_t cfg = &state;
+   VARDECL(ne10_fft_cpx_float32_t, buffer);
+   SAVE_STACK;
+   ALLOC(buffer, st->nfft, ne10_fft_cpx_float32_t);
+
+   if (!st->arch_fft->is_supported) {
+      /* This nfft length (scaled fft) not supported in NE10 */
+      opus_fft_c(st, fin, fout);
+   }
+   else {
+      memcpy((void *)cfg, st->arch_fft->priv,
sizeof(ne10_fft_state_float32_t));
+      state.buffer = (ne10_fft_cpx_float32_t *)&buffer[0];
+      state.is_forward_scaled = 1;
+
+      ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout,
+                                   (ne10_fft_cpx_float32_t *)fin,
+                                   cfg, 0);
+   }
+   RESTORE_STACK;
+}
diff --git a/celt/arm/celt_ne10_mdct.c b/celt/arm/celt_ne10_mdct.c
new file mode 100644
index 0000000..cf175cb
--- /dev/null
+++ b/celt/arm/celt_ne10_mdct.c
@@ -0,0 +1,158 @@
+/* Copyright (c) 2015 Xiph.Org Foundation
+   Written by Viswanath Puttagunta */
+/**
+   @file celt_ne10_mdct.c
+   @brief ARM Neon optimizations for mdct using NE10 library
+ */
+
+/*
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef SKIP_CONFIG_H
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+#endif
+
+#include "../kiss_fft.h"
+#include "_kiss_fft_guts.h"
+#include "../mdct.h"
+#include "stack_alloc.h"
+#include "os_support.h"
+#include "stack_alloc.h"
+
+void clt_mdct_forward_float_neon(const mdct_lookup *l,
+                                 kiss_fft_scalar *in,
+                                 kiss_fft_scalar * OPUS_RESTRICT out,
+                                 const opus_val16 *window,
+                                 int overlap, int shift, int stride, int arch)
+{
+   int i;
+   int N, N2, N4;
+   VARDECL(kiss_fft_scalar, f);
+   VARDECL(kiss_fft_cpx, f2);
+   const kiss_fft_state *st = l->kfft[shift];
+   const kiss_twiddle_scalar *trig;
+
+   SAVE_STACK;
+
+   N = l->n;
+   trig = l->trig;
+   for (i=0;i<shift;i++)
+   {
+      N >>= 1;
+      trig += N;
+   }
+   N2 = N>>1;
+   N4 = N>>2;
+
+   ALLOC(f, N2, kiss_fft_scalar);
+   ALLOC(f2, N4, kiss_fft_cpx);
+
+   /* Consider the input to be composed of four blocks: [a, b, c, d] */
+   /* Window, shuffle, fold */
+   {
+      /* Temp pointers to make it really clear to the compiler what we're
doing */
+      const kiss_fft_scalar * OPUS_RESTRICT xp1 = in+(overlap>>1);
+      const kiss_fft_scalar * OPUS_RESTRICT xp2 = in+N2-1+(overlap>>1);
+      kiss_fft_scalar * OPUS_RESTRICT yp = f;
+      const opus_val16 * OPUS_RESTRICT wp1 = window+(overlap>>1);
+      const opus_val16 * OPUS_RESTRICT wp2 = window+(overlap>>1)-1;
+      for(i=0;i<((overlap+3)>>2);i++)
+      {
+         /* Real part arranged as -d-cR, Imag part arranged as -b+aR*/
+         *yp++ = MULT16_32_Q15(*wp2, xp1[N2]) + MULT16_32_Q15(*wp1,*xp2);
+         *yp++ = MULT16_32_Q15(*wp1, *xp1)    - MULT16_32_Q15(*wp2, xp2[-N2]);
+         xp1+=2;
+         xp2-=2;
+         wp1+=2;
+         wp2-=2;
+      }
+      wp1 = window;
+      wp2 = window+overlap-1;
+      for(;i<N4-((overlap+3)>>2);i++)
+      {
+         /* Real part arranged as a-bR, Imag part arranged as -c-dR */
+         *yp++ = *xp2;
+         *yp++ = *xp1;
+         xp1+=2;
+         xp2-=2;
+      }
+      for(;i<N4;i++)
+      {
+         /* Real part arranged as a-bR, Imag part arranged as -c-dR */
+         *yp++ =  -MULT16_32_Q15(*wp1, xp1[-N2]) + MULT16_32_Q15(*wp2, *xp2);
+         *yp++ = MULT16_32_Q15(*wp2, *xp1)     + MULT16_32_Q15(*wp1, xp2[N2]);
+         xp1+=2;
+         xp2-=2;
+         wp1+=2;
+         wp2-=2;
+      }
+   }
+   /* Pre-rotation */
+   {
+      kiss_fft_scalar * OPUS_RESTRICT yp = f;
+      const kiss_twiddle_scalar *t = &trig[0];
+      for(i=0;i<N4;i++)
+      {
+         kiss_fft_cpx yc;
+         kiss_twiddle_scalar t0, t1;
+         kiss_fft_scalar re, im, yr, yi;
+         t0 = t[i];
+         t1 = t[N4+i];
+         re = *yp++;
+         im = *yp++;
+         yr = S_MUL(re,t0)  -  S_MUL(im,t1);
+         yi = S_MUL(im,t0)  +  S_MUL(re,t1);
+         yc.r = yr;
+         yc.i = yi;
+         f2[i] = yc;
+      }
+   }
+
+   opus_fft(st, f2, (kiss_fft_cpx *)f, arch);
+
+   /* Post-rotate */
+   {
+      /* Temp pointers to make it really clear to the compiler what we're
doing */
+      const kiss_fft_cpx * OPUS_RESTRICT fp = (kiss_fft_cpx *)f;
+      kiss_fft_scalar * OPUS_RESTRICT yp1 = out;
+      kiss_fft_scalar * OPUS_RESTRICT yp2 = out+stride*(N2-1);
+      const kiss_twiddle_scalar *t = &trig[0];
+      /* Temp pointers to make it really clear to the compiler what we're
doing */
+      for(i=0;i<N4;i++)
+      {
+         kiss_fft_scalar yr, yi;
+         yr = S_MUL(fp->i,t[N4+i]) - S_MUL(fp->r,t[i]);
+         yi = S_MUL(fp->r,t[N4+i]) + S_MUL(fp->i,t[i]);
+         *yp1 = yr;
+         *yp2 = yi;
+         fp++;
+         yp1 += 2*stride;
+         yp2 -= 2*stride;
+      }
+   }
+   RESTORE_STACK;
+}
diff --git a/celt/arm/celt_neon_intr.c b/celt/arm/celt_neon_intr.c
index 4a67413..47dce15 100644
--- a/celt/arm/celt_neon_intr.c
+++ b/celt/arm/celt_neon_intr.c
@@ -29,9 +29,15 @@
    NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
    SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 */
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+
 #include <arm_neon.h>
 #include "../pitch.h"
 
+#if !defined(FIXED_POINT)
 /*
  * Function: xcorr_kernel_neon_float
  * ---------------------------------
@@ -243,3 +249,4 @@ void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const
opus_val16 *_y,
             (const float32_t *)_y+i, (float32_t *)xcorr+i, len);
    }
 }
+#endif
diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h
new file mode 100644
index 0000000..e7a30d6
--- /dev/null
+++ b/celt/arm/fft_arm.h
@@ -0,0 +1,66 @@
+/* Copyright (c) 2015 Xiph.Org Foundation
+   Written by Viswanath Puttagunta */
+/**
+   @file fft_arm.h
+   @brief ARM Neon Intrinsic optimizations for fft using NE10 library
+ */
+
+/*
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+
+#if !defined(FFT_ARM_H)
+#define FFT_ARM_H
+
+#include "config.h"
+#include "kiss_fft.h"
+
+#if !defined(FIXED_POINT)
+#if defined(HAVE_ARM_NE10)
+
+int opus_fft_alloc_arm_float_neon(kiss_fft_state *st);
+void opus_fft_free_arm_float_neon(kiss_fft_state *st);
+
+void opus_fft_float_neon(const kiss_fft_state *st,
+                         const kiss_fft_cpx *fin,
+                         kiss_fft_cpx *fout);
+#if !defined(OPUS_HAVE_RTCD)
+#define OVERRIDE_OPUS_FFT (1)
+
+#define opus_fft_alloc_arch(_st, arch) \
+   ((void)(arch), opus_fft_alloc_arm_float_neon(_st))
+
+#define opus_fft_free_arch(_st, arch) \
+   ((void)(arch), opus_fft_free_arm_float_neon(_st))
+
+#define opus_fft(_st, _fin, _fout, arch) \
+   ((void)(arch), opus_fft_float_neon(_st, _fin, _fout))
+
+#endif /* OPUS_HAVE_RTCD */
+
+#endif /* HAVE_ARM_NE10 */
+#endif /* FIXED_POINT */
+
+#endif
diff --git a/celt/arm/mdct_arm.h b/celt/arm/mdct_arm.h
new file mode 100644
index 0000000..7d60fed
--- /dev/null
+++ b/celt/arm/mdct_arm.h
@@ -0,0 +1,53 @@
+/* Copyright (c) 2015 Xiph.Org Foundation
+   Written by Viswanath Puttagunta */
+/**
+   @file arm_mdct.h
+   @brief ARM Neon Intrinsic optimizations for mdct using NE10 library
+ */
+
+/*
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#if !defined(MDCT_ARM_H)
+#define MDCT_ARM_H
+
+#include "config.h"
+#include "mdct.h"
+
+#if !defined(FIXED_POINT) && defined(HAVE_ARM_NE10)
+/** Compute a forward MDCT and scale by 4/N, trashes the input array */
+void clt_mdct_forward_float_neon(const mdct_lookup *l, kiss_fft_scalar *in,
+                                 kiss_fft_scalar * OPUS_RESTRICT out,
+                                 const opus_val16 *window, int overlap,
+                                 int shift, int stride, int arch);
+
+#if !defined(OPUS_HAVE_RTCD)
+#define OVERRIDE_OPUS_MDCT (1)
+#define clt_mdct_forward(_l, _in, _out, _window, _int, _shift, _stride, _arch)
\
+      clt_mdct_forward_float_neon(_l, _in, _out, _window, _int, _shift,
_stride, _arch)
+#endif /* OPUS_HAVE_RTCD */
+#endif /* !defined(FIXED_POINT) && defined(HAVE_ARM_NE10) */
+
+#endif
diff --git a/celt/celt_encoder.c b/celt/celt_encoder.c
index 86a3fbb..7a2c71b 100644
--- a/celt/celt_encoder.c
+++ b/celt/celt_encoder.c
@@ -414,7 +414,8 @@ int patch_transient_decision(opus_val16 *newE, opus_val16
*oldE, int nbEBands,
 /** Apply window and compute the MDCT for all sub-frames and
     all channels in a frame */
 static void compute_mdcts(const CELTMode *mode, int shortBlocks, celt_sig *
OPUS_RESTRICT in,
-                          celt_sig * OPUS_RESTRICT out, int C, int CC, int LM,
int upsample)
+                          celt_sig * OPUS_RESTRICT out, int C, int CC, int LM,
int upsample,
+                          int arch)
 {
    const int overlap = mode->overlap;
    int N;
@@ -435,7 +436,9 @@ static void compute_mdcts(const CELTMode *mode, int
shortBlocks, celt_sig * OPUS
       for (b=0;b<B;b++)
       {
          /* Interleaving the sub-frames while doing the MDCTs */
-         clt_mdct_forward(&mode->mdct, in+c*(B*N+overlap)+b*N,
&out[b+c*N*B], mode->window, overlap, shift, B);
+         clt_mdct_forward(&mode->mdct, in+c*(B*N+overlap)+b*N,
+                          &out[b+c*N*B], mode->window, overlap, shift,
B,
+                          arch);
       }
    } while (++c<CC);
    if (CC==2&&C==1)
@@ -1603,14 +1606,14 @@ int celt_encode_with_ec(CELTEncoder * OPUS_RESTRICT st,
const opus_val16 * pcm,
    ALLOC(bandLogE2, C*nbEBands, opus_val16);
    if (secondMdct)
    {
-      compute_mdcts(mode, 0, in, freq, C, CC, LM, st->upsample);
+      compute_mdcts(mode, 0, in, freq, C, CC, LM, st->upsample,
st->arch);
       compute_band_energies(mode, freq, bandE, effEnd, C, LM);
       amp2Log2(mode, effEnd, end, bandE, bandLogE2, C);
       for (i=0;i<C*nbEBands;i++)
          bandLogE2[i] += HALF16(SHL16(LM, DB_SHIFT));
    }
 
-   compute_mdcts(mode, shortBlocks, in, freq, C, CC, LM, st->upsample);
+   compute_mdcts(mode, shortBlocks, in, freq, C, CC, LM, st->upsample,
st->arch);
    if (CC==2&&C==1)
       tf_chan = 0;
    compute_band_energies(mode, freq, bandE, effEnd, C, LM);
@@ -1736,7 +1739,7 @@ int celt_encode_with_ec(CELTEncoder * OPUS_RESTRICT st,
const opus_val16 * pcm,
       {
          isTransient = 1;
          shortBlocks = M;
-         compute_mdcts(mode, shortBlocks, in, freq, C, CC, LM,
st->upsample);
+         compute_mdcts(mode, shortBlocks, in, freq, C, CC, LM, st->upsample,
st->arch);
          compute_band_energies(mode, freq, bandE, effEnd, C, LM);
          amp2Log2(mode, effEnd, end, bandE, bandLogE, C);
          /* Compensate for the scaling of short vs long mdcts */
diff --git a/celt/dump_modes/Makefile b/celt/dump_modes/Makefile
index 74d527e..10c3679 100644
--- a/celt/dump_modes/Makefile
+++ b/celt/dump_modes/Makefile
@@ -1,10 +1,31 @@
+
 CFLAGS=-O2 -Wall -Wextra -DHAVE_CONFIG_H
 INCLUDES=-I. -I../ -I../.. -I../../include
 
+SOURCES = dump_modes.c \
+          ../modes.c \
+          ../cwrs.c \
+          ../rate.c \
+          ../entenc.c \
+          ../entdec.c \
+          ../mathops.c \
+          ../mdct.c \
+          ../kiss_fft.c
+
+ifdef HAVE_ARM_NE10
+CC = gcc
+CFLAGS += -mfpu=neon
+INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR
+LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so
+SOURCES += ../arm/celt_ne10_fft.c \
+           dump_modes_arm_ne10.c \
+           ../arm/armcpu.c
+endif
+
 all: dump_modes
 
 dump_modes:
-	$(CC) $(CFLAGS) $(INCLUDES) -DCUSTOM_MODES_ONLY -DCUSTOM_MODES dump_modes.c
../modes.c ../cwrs.c ../rate.c ../entenc.c ../entdec.c ../mathops.c ../mdct.c
../kiss_fft.c -o dump_modes -lm
+	$(PREFIX)$(CC) $(CFLAGS) $(INCLUDES) -DCUSTOM_MODES_ONLY -DCUSTOM_MODES
$(SOURCES) -o $@ $(LIBDIR) -lm
 
 clean:
 	rm -f dump_modes
diff --git a/celt/dump_modes/dump_modes.c b/celt/dump_modes/dump_modes.c
index ae6a8c1..9105a53 100644
--- a/celt/dump_modes/dump_modes.c
+++ b/celt/dump_modes/dump_modes.c
@@ -35,6 +35,7 @@
 #include "modes.h"
 #include "celt.h"
 #include "rate.h"
+#include "dump_modes_arch.h"
 
 #define INT16 "%d"
 #define INT32 "%d"
@@ -62,6 +63,10 @@ void dump_modes(FILE *file, CELTMode **modes, int nb_modes)
    fprintf(file, "\n   It contains static definitions for some pre-defined
modes. */\n");
    fprintf(file, "#include \"modes.h\"\n");
    fprintf(file, "#include \"rate.h\"\n");
+   fprintf(file, "\n#ifdef HAVE_ARM_NE10\n");
+   fprintf(file, "#define OVERRIDE_FFT 1\n");
+   fprintf(file, "#include \"%s\"\n",
ARM_NE10_ARCH_FILE_NAME);
+   fprintf(file, "#endif\n");
 
    fprintf(file, "\n");
 
@@ -149,6 +154,9 @@ void dump_modes(FILE *file, CELTMode **modes, int nb_modes)
          fprintf (file, "{" WORD16 ", " WORD16
"},%c", mode->mdct.kfft[0]->twiddles[j].r,
mode->mdct.kfft[0]->twiddles[j].i,(j+3)%2==0?'\n':' ');
       fprintf (file, "};\n");
 
+#ifdef OVERRIDE_FFT
+      dump_mode_arch(mode);
+#endif
       /* FFT Bitrev tables */
       for (k=0;k<=mode->mdct.maxshift;k++)
       {
@@ -183,6 +191,13 @@ void dump_modes(FILE *file, CELTMode **modes, int nb_modes)
          fprintf (file, "},    /* factors */\n");
          fprintf (file, "fft_bitrev%d,    /* bitrev */\n",
mode->mdct.kfft[k]->nfft);
          fprintf (file, "fft_twiddles%d_%d,    /* bitrev */\n",
mode->Fs, mdctSize);
+
+         fprintf (file, "#ifdef OVERRIDE_FFT\n");
+         fprintf (file, "(arch_fft_state *)&cfg_arch_%d,\n",
mode->mdct.kfft[k]->nfft);
+         fprintf (file, "#else\n");
+         fprintf (file, "NULL,\n");
+         fprintf(file, "#endif\n");
+
          fprintf (file, "};\n");
 
          fprintf(file, "#endif\n");
@@ -323,8 +338,14 @@ int main(int argc, char **argv)
       }
    }
    file = fopen(BASENAME ".h", "w");
+#ifdef OVERRIDE_FFT
+   dump_modes_arch_init(m, nb);
+#endif
    dump_modes(file, m, nb);
    fclose(file);
+#ifdef OVERRIDE_FFT
+   dump_modes_arch_finalize();
+#endif
    for (i=0;i<nb;i++)
       opus_custom_mode_destroy(m[i]);
    free(m);
diff --git a/celt/dump_modes/dump_modes_arch.h
b/celt/dump_modes/dump_modes_arch.h
new file mode 100644
index 0000000..1436926
--- /dev/null
+++ b/celt/dump_modes/dump_modes_arch.h
@@ -0,0 +1,41 @@
+/* Copyright (c) 2015 Xiph.Org Foundation
+   Written by Viswanath Puttagunta */
+/*
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifndef DUMP_MODE_ARCH_H
+#define DUMP_MODE_ARCH_H
+
+void dump_modes_arch_init();
+void dump_mode_arch(CELTMode *mode);
+void dump_modes_arch_finalize();
+
+#define ARM_NE10_ARCH_FILE_NAME "static_modes_float_arm_ne10.h"
+
+#if defined(HAVE_ARM_NE10)
+#define OVERRIDE_FFT (1)
+#endif
+
+#endif
diff --git a/celt/dump_modes/dump_modes_arm_ne10.c
b/celt/dump_modes/dump_modes_arm_ne10.c
new file mode 100644
index 0000000..aa53f17
--- /dev/null
+++ b/celt/dump_modes/dump_modes_arm_ne10.c
@@ -0,0 +1,125 @@
+/* Copyright (c) 2015 Xiph.Org Foundation
+   Written by Viswanath Puttagunta */
+/*
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#include <stdio.h>
+#include <stdlib.h>
+#include "modes.h"
+#include "dump_modes_arch.h"
+#include <NE10_dsp.h>
+
+static FILE *file;
+
+void dump_modes_arch_init(CELTMode **modes, int nb_modes)
+{
+   int i;
+
+   file = fopen(ARM_NE10_ARCH_FILE_NAME, "w");
+   fprintf(file, "/* The contents of this file was automatically generated
by\n");
+   fprintf(file, " * dump_mode_arm_ne10.c with arguments:");
+   for (i=0;i<nb_modes;i++)
+   {
+      CELTMode *mode = modes[i];
+      fprintf(file, " %d
%d",mode->Fs,mode->shortMdctSize*mode->nbShortMdcts);
+   }
+   fprintf(file, "\n * It contains static definitions for some pre-defined
modes. */\n");
+   fprintf(file, "#include <NE10_init.h>\n\n");
+}
+
+void dump_modes_arch_finalize()
+{
+   fclose(file);
+}
+
+void dump_mode_arch(CELTMode *mode)
+{
+   int k, j;
+   int mdctSize;
+
+   mdctSize = mode->shortMdctSize*mode->nbShortMdcts;
+
+   fprintf(file, "#ifndef NE10_FFT_PARAMS%d_%d\n", mode->Fs,
mdctSize);
+   fprintf(file, "#define NE10_FFT_PARAMS%d_%d\n", mode->Fs,
mdctSize);
+   ne10_fft_cfg_float32_t cfg;
+   /* cfg->factors */
+   for(k=0;k<=mode->mdct.maxshift;k++) {
+      cfg =
(ne10_fft_cfg_float32_t)mode->mdct.kfft[k]->arch_fft->priv;
+      if (!cfg)
+         continue;
+      fprintf(file, "static const ne10_int32_t ne10_factors_%d[%d] =
{\n",
+              mode->mdct.kfft[k]->nfft, (NE10_MAXFACTORS * 2));
+      for(j=0;j<(NE10_MAXFACTORS * 2);j++) {
+         fprintf(file, "%d,%c",
cfg->factors[j],(j+16)%15==0?'\n':' ');
+      }
+      fprintf (file, "};\n");
+   }
+
+   /* cfg->twiddles */
+   for(k=0;k<=mode->mdct.maxshift;k++) {
+      cfg =
(ne10_fft_cfg_float32_t)mode->mdct.kfft[k]->arch_fft->priv;
+      if (!cfg)
+         continue;
+      fprintf(file, "static const ne10_fft_cpx_float32_t
ne10_twiddles_%d[%d] = {\n",
+              mode->mdct.kfft[k]->nfft, mode->mdct.kfft[k]->nfft);
+      for(j=0;j<mode->mdct.kfft[k]->nfft;j++) {
+         fprintf(file, "{%#0.8gf,%#0.8gf},%c", cfg->twiddles[j].r,
cfg->twiddles[j].i,(j+4)%3==0?'\n':' ');
+      }
+      fprintf (file, "};\n");
+   }
+
+   for(k=0;k<=mode->mdct.maxshift;k++) {
+      cfg =
(ne10_fft_cfg_float32_t)mode->mdct.kfft[k]->arch_fft->priv;
+      if (!cfg) {
+         fprintf(file, "/* Ne10 does not support scaled FFT for length =
%d */\n",
+                 mode->mdct.kfft[k]->nfft);
+         fprintf(file, "static const arch_fft_state cfg_arch_%d =
{\n", mode->mdct.kfft[k]->nfft);
+         fprintf(file, "0,\n");
+         fprintf(file, "NULL\n");
+         fprintf(file, "};\n");
+         continue;
+      }
+      fprintf(file, "static const ne10_fft_state_float32_t
ne10_fft_state_float32_%d = {\n",
+              mode->mdct.kfft[k]->nfft);
+      fprintf(file, "%d,\n", cfg->nfft);
+      fprintf(file, "(ne10_int32_t *)ne10_factors_%d,\n",
mode->mdct.kfft[k]->nfft);
+      fprintf(file, "(ne10_fft_cpx_float32_t *)ne10_twiddles_%d,\n",
mode->mdct.kfft[k]->nfft);
+      fprintf(file, "NULL,\n");  /* buffer */
+      fprintf(file, "(ne10_fft_cpx_float32_t
*)&ne10_twiddles_%d[%d],\n",
+              mode->mdct.kfft[k]->nfft, cfg->nfft);
+      fprintf(file, "/* is_forward_scaled = true */\n");
+      fprintf(file, "(ne10_int32_t) 1,\n");
+      fprintf(file, "/* is_backward_scaled = false */\n");
+      fprintf(file, "(ne10_int32_t) 0,\n");
+      fprintf(file, "};\n");
+
+      fprintf(file, "static const arch_fft_state cfg_arch_%d = {\n",
+              mode->mdct.kfft[k]->nfft);
+      fprintf(file, "1,\n");
+      fprintf(file, "(void *)&ne10_fft_state_float32_%d,\n",
mode->mdct.kfft[k]->nfft);
+      fprintf(file, "};\n\n");
+   }
+   fprintf(file, "#endif  /* end NE10_FFT_PARAMS%d_%d */\n",
mode->Fs, mdctSize);
+}
diff --git a/celt/kiss_fft.c b/celt/kiss_fft.c
index cc487fc..38fd4fb 100644
--- a/celt/kiss_fft.c
+++ b/celt/kiss_fft.c
@@ -423,13 +423,19 @@ static void compute_twiddles(kiss_twiddle_cpx *twiddles,
int nfft)
 #endif
 }
 
+int opus_fft_alloc_arch_c(kiss_fft_state *st) {
+   (void)st;
+   return 0;
+}
+
 /*
  *
  * Allocates all necessary storage space for the fft and ifft.
  * The return value is a contiguous block of memory.  As such,
  * It can be freed with free().
  * */
-kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem, 
const kiss_fft_state *base)
+kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem,
+                                        const kiss_fft_state *base, int arch)
 {
     kiss_fft_state *st=NULL;
     size_t memneeded = sizeof(struct kiss_fft_state); /* twiddle factors*/
@@ -478,22 +484,31 @@ kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void *
mem,size_t * lenmem,  co
         if (st->bitrev==NULL)
             goto fail;
         compute_bitrev_table(0, bitrev, 1,1, st->factors,st);
+
+        /* Initialize architecture specific fft parameters */
+        if (opus_fft_alloc_arch(st, arch))
+            goto fail;
     }
     return st;
 fail:
-    opus_fft_free(st);
+    opus_fft_free(st, arch);
     return NULL;
 }
 
-kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem )
+kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem, int arch)
 {
-   return opus_fft_alloc_twiddles(nfft, mem, lenmem, NULL);
+   return opus_fft_alloc_twiddles(nfft, mem, lenmem, NULL, arch);
+}
+
+void opus_fft_free_arch_c(kiss_fft_state *st) {
+   (void)st;
 }
 
-void opus_fft_free(const kiss_fft_state *cfg)
+void opus_fft_free(const kiss_fft_state *cfg, int arch)
 {
    if (cfg)
    {
+      opus_fft_free_arch((kiss_fft_state *)cfg, arch);
       opus_free((opus_int16*)cfg->bitrev);
       if (cfg->shift < 0)
          opus_free((kiss_twiddle_cpx*)cfg->twiddles);
@@ -551,7 +566,7 @@ void opus_fft_impl(const kiss_fft_state *st,kiss_fft_cpx
*fout)
     }
 }
 
-void opus_fft(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout)
+void opus_fft_c(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout)
 {
    int i;
    opus_val16 scale;
diff --git a/celt/kiss_fft.h b/celt/kiss_fft.h
index 390b54d..bf2f836 100644
--- a/celt/kiss_fft.h
+++ b/celt/kiss_fft.h
@@ -32,6 +32,7 @@
 #include <stdlib.h>
 #include <math.h>
 #include "arch.h"
+#include "cpu_support.h"
 
 #ifdef __cplusplus
 extern "C" {
@@ -77,6 +78,11 @@ typedef struct {
  4*4*4*2
  */
 
+typedef struct arch_fft_state{
+   int is_supported;
+   void *priv;
+} arch_fft_state;
+
 typedef struct kiss_fft_state{
     int nfft;
     opus_val16 scale;
@@ -87,8 +93,15 @@ typedef struct kiss_fft_state{
     opus_int16 factors[2*MAXFACTORS];
     const opus_int16 *bitrev;
     const kiss_twiddle_cpx *twiddles;
+#ifndef FIXED_POINT
+    arch_fft_state *arch_fft;
+#endif
 } kiss_fft_state;
 
+#if !defined(FIXED_POINT) && defined(HAVE_ARM_NE10)
+#include "arm/fft_arm.h"
+#endif
+
 /*typedef struct kiss_fft_state* kiss_fft_cfg;*/
 
 /**
@@ -114,9 +127,9 @@ typedef struct kiss_fft_state{
  *      buffer size in *lenmem.
  * */
 
-kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem,
const kiss_fft_state *base);
+kiss_fft_state *opus_fft_alloc_twiddles(int nfft,void * mem,size_t * lenmem,
const kiss_fft_state *base, int arch);
 
-kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem);
+kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t * lenmem, int arch);
 
 /**
  * opus_fft(cfg,in_out_buf)
@@ -128,13 +141,48 @@ kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t
* lenmem);
  * Note that each element is complex and can be accessed like
     f[k].r and f[k].i
  * */
-void opus_fft(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout);
+void opus_fft_c(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout);
 void opus_ifft(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout);
 
 void opus_fft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout);
 void opus_ifft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout);
 
-void opus_fft_free(const kiss_fft_state *cfg);
+void opus_fft_free(const kiss_fft_state *cfg, int arch);
+
+
+void opus_fft_free_arch_c(kiss_fft_state *st);
+int opus_fft_alloc_arch_c(kiss_fft_state *st);
+
+#if !defined(OVERRIDE_OPUS_FFT)
+/* Is run-time CPU detection enabled on this platform? */
+#if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10))
+
+int (*const OPUS_FFT_ALLOC_ARCH_IMPL[OPUS_ARCHMASK+1])(kiss_fft_state *st);
+
+#define opus_fft_alloc_arch(_st, arch) \
+         ((*OPUS_FFT_ALLOC_ARCH_IMPL[(arch)&OPUS_ARCHMASK])(_st))
+
+void (*const OPUS_FFT_FREE_ARCH_IMPL[OPUS_ARCHMASK+1])(kiss_fft_state *st);
+#define opus_fft_free_arch(_st, arch) \
+         ((*OPUS_FFT_FREE_ARCH_IMPL[(arch)&OPUS_ARCHMASK])(_st))
+
+void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg,
+                                        const kiss_fft_cpx *fin,
+                                        kiss_fft_cpx *fout);
+#define opus_fft(_cfg, _fin, _fout, arch) \
+   ((*OPUS_FFT[(arch)&OPUS_ARCHMASK])(_cfg, _fin, _fout))
+#else /* else for if defined(OPUS_HAVE_RTCD) &&
(defined(HAVE_ARM_NE10)) */
+
+#define opus_fft_alloc_arch(_st, arch) \
+         ((void)(arch), opus_fft_alloc_arch_c(_st))
+
+#define opus_fft_free_arch(_st, arch) \
+         ((void)(arch), opus_fft_free_arch_c(_st))
+
+#define opus_fft(_cfg, _fin, _fout, arch) \
+         ((void)(arch), opus_fft_c(_cfg, _fin, _fout))
+#endif /* end if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */
+#endif /* end if !defined(OVERRIDE_OPUS_FFT) */
 
 #ifdef __cplusplus
 }
diff --git a/celt/mdct.c b/celt/mdct.c
index 2795d90..ee6d80e 100644
--- a/celt/mdct.c
+++ b/celt/mdct.c
@@ -60,7 +60,7 @@
 
 #ifdef CUSTOM_MODES
 
-int clt_mdct_init(mdct_lookup *l,int N, int maxshift)
+int clt_mdct_init(mdct_lookup *l,int N, int maxshift, int arch)
 {
    int i;
    kiss_twiddle_scalar *trig;
@@ -71,9 +71,9 @@ int clt_mdct_init(mdct_lookup *l,int N, int maxshift)
    for (i=0;i<=maxshift;i++)
    {
       if (i==0)
-         l->kfft[i] = opus_fft_alloc(N>>2>>i, 0, 0);
+         l->kfft[i] = opus_fft_alloc(N>>2>>i, 0, 0, arch);
       else
-         l->kfft[i] = opus_fft_alloc_twiddles(N>>2>>i, 0, 0,
l->kfft[0]);
+         l->kfft[i] = opus_fft_alloc_twiddles(N>>2>>i, 0, 0,
l->kfft[0], arch);
 #ifndef ENABLE_TI_DSPLIB55
       if (l->kfft[i]==NULL)
          return 0;
@@ -104,11 +104,11 @@ int clt_mdct_init(mdct_lookup *l,int N, int maxshift)
    return 1;
 }
 
-void clt_mdct_clear(mdct_lookup *l)
+void clt_mdct_clear(mdct_lookup *l, int arch)
 {
    int i;
    for (i=0;i<=l->maxshift;i++)
-      opus_fft_free(l->kfft[i]);
+      opus_fft_free(l->kfft[i], arch);
    opus_free((kiss_twiddle_scalar*)l->trig);
 }
 
@@ -116,8 +116,8 @@ void clt_mdct_clear(mdct_lookup *l)
 
 /* Forward MDCT trashes the input array */
 #ifndef OVERRIDE_clt_mdct_forward
-void clt_mdct_forward(const mdct_lookup *l, kiss_fft_scalar *in,
kiss_fft_scalar * OPUS_RESTRICT out,
-      const opus_val16 *window, int overlap, int shift, int stride)
+void clt_mdct_forward_c(const mdct_lookup *l, kiss_fft_scalar *in,
kiss_fft_scalar * OPUS_RESTRICT out,
+      const opus_val16 *window, int overlap, int shift, int stride, int arch)
 {
    int i;
    int N, N2, N4;
@@ -132,6 +132,7 @@ void clt_mdct_forward(const mdct_lookup *l, kiss_fft_scalar
*in, kiss_fft_scalar
    int scale_shift = st->scale_shift-1;
 #endif
    SAVE_STACK;
+   (void)arch;
    scale = st->scale;
 
    N = l->n;
diff --git a/celt/mdct.h b/celt/mdct.h
index d721821..cbaf679 100644
--- a/celt/mdct.h
+++ b/celt/mdct.h
@@ -53,13 +53,19 @@ typedef struct {
    const kiss_twiddle_scalar * OPUS_RESTRICT trig;
 } mdct_lookup;
 
-int clt_mdct_init(mdct_lookup *l,int N, int maxshift);
-void clt_mdct_clear(mdct_lookup *l);
+#if !defined(FIXED_POINT) && defined(HAVE_ARM_NE10)
+#include "arm/mdct_arm.h"
+#endif
+
+
+int clt_mdct_init(mdct_lookup *l,int N, int maxshift, int arch);
+void clt_mdct_clear(mdct_lookup *l, int arch);
 
 /** Compute a forward MDCT and scale by 4/N, trashes the input array */
-void clt_mdct_forward(const mdct_lookup *l, kiss_fft_scalar *in,
-      kiss_fft_scalar * OPUS_RESTRICT out,
-      const opus_val16 *window, int overlap, int shift, int stride);
+void clt_mdct_forward_c(const mdct_lookup *l, kiss_fft_scalar *in,
+                        kiss_fft_scalar * OPUS_RESTRICT out,
+                        const opus_val16 *window, int overlap,
+                        int shift, int stride, int arch);
 
 /** Compute a backward MDCT (no scaling) and performs weighted overlap-add
     (scales implicitly by 1/2) */
@@ -67,4 +73,27 @@ void clt_mdct_backward(const mdct_lookup *l, kiss_fft_scalar
*in,
       kiss_fft_scalar * OPUS_RESTRICT out,
       const opus_val16 * OPUS_RESTRICT window, int overlap, int shift, int
stride);
 
+#if !defined(OVERRIDE_OPUS_MDCT)
+/* Is run-time CPU detection enabled on this platform? */
+#if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10))
+
+void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l,
+                                                     kiss_fft_scalar *in,
+                                                     kiss_fft_scalar *
OPUS_RESTRICT out,
+                                                     const opus_val16 *window,
+                                                     int overlap, int shift,
+                                                     int stride, int arch);
+
+#define clt_mdct_forward(_l, _in, _out, _window, _overlap, _shift, _stride,
_arch) \
+   ((*CLT_MDCT_FORWARD_IMPL[(arch)&OPUS_ARCHMASK])(_l, _in, _out, \
+                                                   _window, _overlap, _shift, \
+                                                   _stride, _arch))
+#else /* else for if defined(OPUS_HAVE_RTCD) &&
(defined(HAVE_ARM_NE10)) */
+
+#define clt_mdct_forward(_l, _in, _out, _window, _overlap, _shift, _stride,
_arch) \
+   clt_mdct_forward_c(_l, _in, _out, _window, _overlap, _shift, _stride, _arch)
+
+#endif /* end if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */
+#endif /* end if !defined(OVERRIDE_OPUS_MDCT) */
+
 #endif
diff --git a/celt/modes.c b/celt/modes.c
index 42e68e1..4fe91ff 100644
--- a/celt/modes.c
+++ b/celt/modes.c
@@ -37,6 +37,7 @@
 #include "os_support.h"
 #include "stack_alloc.h"
 #include "quant_bands.h"
+#include "cpu_support.h"
 
 static const opus_int16 eband5ms[] = {
 /*0  200 400 600 800  1k 1.2 1.4 1.6  2k 2.4 2.8 3.2  4k 4.8 5.6 6.8  8k 9.6
12k 15.6 */
@@ -229,6 +230,7 @@ CELTMode *opus_custom_mode_create(opus_int32 Fs, int
frame_size, int *error)
    opus_val16 *window;
    opus_int16 *logN;
    int LM;
+   int arch = opus_select_arch();
    ALLOC_STACK;
 #if !defined(VAR_ARRAYS) && !defined(USE_ALLOCA)
    if (global_stack==NULL)
@@ -389,7 +391,7 @@ CELTMode *opus_custom_mode_create(opus_int32 Fs, int
frame_size, int *error)
    compute_pulse_cache(mode, mode->maxLM);
 
    if (clt_mdct_init(&mode->mdct,
2*mode->shortMdctSize*mode->nbShortMdcts,
-           mode->maxLM) == 0)
+                     mode->maxLM, arch) == 0)
       goto failure;
 
    if (error)
@@ -408,6 +410,8 @@ failure:
 #ifdef CUSTOM_MODES
 void opus_custom_mode_destroy(CELTMode *mode)
 {
+   int arch = opus_select_arch();
+
    if (mode == NULL)
       return;
 #ifndef CUSTOM_MODES_ONLY
@@ -431,7 +435,7 @@ void opus_custom_mode_destroy(CELTMode *mode)
    opus_free((opus_int16*)mode->cache.index);
    opus_free((unsigned char*)mode->cache.bits);
    opus_free((unsigned char*)mode->cache.caps);
-   clt_mdct_clear(&mode->mdct);
+   clt_mdct_clear(&mode->mdct, arch);
 
    opus_free((CELTMode *)mode);
 }
diff --git a/celt/static_modes_float.h b/celt/static_modes_float.h
index 2fadb62..e102a38 100644
--- a/celt/static_modes_float.h
+++ b/celt/static_modes_float.h
@@ -4,6 +4,11 @@
 #include "modes.h"
 #include "rate.h"
 
+#ifdef HAVE_ARM_NE10
+#define OVERRIDE_FFT 1
+#include "static_modes_float_arm_ne10.h"
+#endif
+
 #ifndef DEF_WINDOW120
 #define DEF_WINDOW120
 static const opus_val16 window120[120] = {
@@ -431,6 +436,11 @@ static const kiss_fft_state fft_state48000_960_0 = {
 {5, 96, 3, 32, 4, 8, 2, 4, 4, 1, 0, 0, 0, 0, 0, 0, },   /* factors */
 fft_bitrev480,  /* bitrev */
 fft_twiddles48000_960,  /* bitrev */
+#ifdef OVERRIDE_FFT
+(arch_fft_state *)&cfg_arch_480,
+#else
+NULL,
+#endif
 };
 #endif
 
@@ -443,6 +453,11 @@ static const kiss_fft_state fft_state48000_960_1 = {
 {5, 48, 3, 16, 4, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, },    /* factors */
 fft_bitrev240,  /* bitrev */
 fft_twiddles48000_960,  /* bitrev */
+#ifdef OVERRIDE_FFT
+(arch_fft_state *)&cfg_arch_240,
+#else
+NULL,
+#endif
 };
 #endif
 
@@ -455,6 +470,11 @@ static const kiss_fft_state fft_state48000_960_2 = {
 {5, 24, 3, 8, 2, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, },    /* factors */
 fft_bitrev120,  /* bitrev */
 fft_twiddles48000_960,  /* bitrev */
+#ifdef OVERRIDE_FFT
+(arch_fft_state *)&cfg_arch_120,
+#else
+NULL,
+#endif
 };
 #endif
 
@@ -467,6 +487,11 @@ static const kiss_fft_state fft_state48000_960_3 = {
 {5, 12, 3, 4, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, },    /* factors */
 fft_bitrev60,   /* bitrev */
 fft_twiddles48000_960,  /* bitrev */
+#ifdef OVERRIDE_FFT
+(arch_fft_state *)&cfg_arch_60,
+#else
+NULL,
+#endif
 };
 #endif
 
diff --git a/celt/static_modes_float_arm_ne10.h
b/celt/static_modes_float_arm_ne10.h
new file mode 100644
index 0000000..5bcec70
--- /dev/null
+++ b/celt/static_modes_float_arm_ne10.h
@@ -0,0 +1,404 @@
+/* The contents of this file was automatically generated by
+ * dump_mode_arm_ne10.c with arguments: 48000 960
+ * It contains static definitions for some pre-defined modes. */
+#include <NE10_init.h>
+
+#ifndef NE10_FFT_PARAMS48000_960
+#define NE10_FFT_PARAMS48000_960
+static const ne10_int32_t ne10_factors_480[64] = {
+4, 40, 4, 30, 2, 15, 5, 3, 3, 1, 1, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, };
+static const ne10_int32_t ne10_factors_240[64] = {
+3, 20, 4, 15, 5, 3, 3, 1, 1, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, };
+static const ne10_int32_t ne10_factors_120[64] = {
+3, 10, 2, 15, 5, 3, 3, 1, 1, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, };
+static const ne10_int32_t ne10_factors_60[64] = {
+2, 5, 5, 3, 3, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+0, 0, 0, 0, };
+static const ne10_fft_cpx_float32_t ne10_twiddles_480[480] = {
+{1.0000000f,0.0000000f}, {1.0000000f,-0.0000000f}, {1.0000000f,-0.0000000f},
+{1.0000000f,-0.0000000f}, {0.91354543f,-0.40673664f},
{0.66913056f,-0.74314487f},
+{1.0000000f,-0.0000000f}, {0.66913056f,-0.74314487f},
{-0.10452851f,-0.99452192f},
+{1.0000000f,-0.0000000f}, {0.30901697f,-0.95105654f},
{-0.80901700f,-0.58778518f},
+{1.0000000f,-0.0000000f}, {-0.10452851f,-0.99452192f},
{-0.97814757f,0.20791179f},
+{1.0000000f,-0.0000000f}, {0.97814763f,-0.20791170f},
{0.91354543f,-0.40673664f},
+{0.80901700f,-0.58778524f}, {0.66913056f,-0.74314487f},
{0.49999997f,-0.86602545f},
+{0.30901697f,-0.95105654f}, {0.10452842f,-0.99452192f},
{-0.10452851f,-0.99452192f},
+{-0.30901703f,-0.95105648f}, {-0.50000006f,-0.86602533f},
{-0.66913068f,-0.74314475f},
+{-0.80901700f,-0.58778518f}, {-0.91354549f,-0.40673658f},
{-0.97814763f,-0.20791161f},
+{1.0000000f,-0.0000000f}, {0.99862951f,-0.052335959f},
{0.99452192f,-0.10452846f},
+{0.98768836f,-0.15643448f}, {0.97814763f,-0.20791170f},
{0.96592581f,-0.25881904f},
+{0.95105648f,-0.30901700f}, {0.93358040f,-0.35836795f},
{0.91354543f,-0.40673664f},
+{0.89100653f,-0.45399052f}, {0.86602545f,-0.50000000f},
{0.83867055f,-0.54463905f},
+{0.80901700f,-0.58778524f}, {0.77714598f,-0.62932038f},
{0.74314475f,-0.66913062f},
+{0.70710677f,-0.70710683f}, {0.66913056f,-0.74314487f},
{0.62932038f,-0.77714598f},
+{0.58778524f,-0.80901700f}, {0.54463899f,-0.83867055f},
{0.49999997f,-0.86602545f},
+{0.45399052f,-0.89100653f}, {0.40673661f,-0.91354549f},
{0.35836786f,-0.93358046f},
+{0.30901697f,-0.95105654f}, {0.25881907f,-0.96592581f},
{0.20791166f,-0.97814763f},
+{0.15643437f,-0.98768836f}, {0.10452842f,-0.99452192f},
{0.052335974f,-0.99862951f},
+{1.0000000f,-0.0000000f}, {0.99452192f,-0.10452846f},
{0.97814763f,-0.20791170f},
+{0.95105648f,-0.30901700f}, {0.91354543f,-0.40673664f},
{0.86602545f,-0.50000000f},
+{0.80901700f,-0.58778524f}, {0.74314475f,-0.66913062f},
{0.66913056f,-0.74314487f},
+{0.58778524f,-0.80901700f}, {0.49999997f,-0.86602545f},
{0.40673661f,-0.91354549f},
+{0.30901697f,-0.95105654f}, {0.20791166f,-0.97814763f},
{0.10452842f,-0.99452192f},
+{-4.3711388e-08f,-1.0000000f}, {-0.10452851f,-0.99452192f},
{-0.20791174f,-0.97814757f},
+{-0.30901703f,-0.95105648f}, {-0.40673670f,-0.91354543f},
{-0.50000006f,-0.86602533f},
+{-0.58778518f,-0.80901700f}, {-0.66913068f,-0.74314475f},
{-0.74314493f,-0.66913044f},
+{-0.80901700f,-0.58778518f}, {-0.86602539f,-0.50000006f},
{-0.91354549f,-0.40673658f},
+{-0.95105654f,-0.30901679f}, {-0.97814763f,-0.20791161f},
{-0.99452192f,-0.10452849f},
+{1.0000000f,-0.0000000f}, {0.98768836f,-0.15643448f},
{0.95105648f,-0.30901700f},
+{0.89100653f,-0.45399052f}, {0.80901700f,-0.58778524f},
{0.70710677f,-0.70710683f},
+{0.58778524f,-0.80901700f}, {0.45399052f,-0.89100653f},
{0.30901697f,-0.95105654f},
+{0.15643437f,-0.98768836f}, {-4.3711388e-08f,-1.0000000f},
{-0.15643445f,-0.98768836f},
+{-0.30901703f,-0.95105648f}, {-0.45399061f,-0.89100647f},
{-0.58778518f,-0.80901700f},
+{-0.70710677f,-0.70710677f}, {-0.80901700f,-0.58778518f},
{-0.89100659f,-0.45399037f},
+{-0.95105654f,-0.30901679f}, {-0.98768836f,-0.15643445f},
{-1.0000000f,8.7422777e-08f},
+{-0.98768830f,0.15643461f}, {-0.95105654f,0.30901697f},
{-0.89100653f,0.45399055f},
+{-0.80901694f,0.58778536f}, {-0.70710665f,0.70710689f},
{-0.58778507f,0.80901712f},
+{-0.45399022f,0.89100665f}, {-0.30901709f,0.95105648f},
{-0.15643452f,0.98768830f},
+{1.0000000f,-0.0000000f}, {0.99991435f,-0.013089596f},
{0.99965733f,-0.026176950f},
+{0.99922901f,-0.039259817f}, {0.99862951f,-0.052335959f},
{0.99785894f,-0.065403134f},
+{0.99691731f,-0.078459099f}, {0.99580491f,-0.091501623f},
{0.99452192f,-0.10452846f},
+{0.99306846f,-0.11753740f}, {0.99144489f,-0.13052620f},
{0.98965138f,-0.14349262f},
+{0.98768836f,-0.15643448f}, {0.98555607f,-0.16934951f},
{0.98325491f,-0.18223552f},
+{0.98078525f,-0.19509032f}, {0.97814763f,-0.20791170f},
{0.97534233f,-0.22069745f},
+{0.97236991f,-0.23344538f}, {0.96923089f,-0.24615330f},
{0.96592581f,-0.25881904f},
+{0.96245521f,-0.27144045f}, {0.95881975f,-0.28401536f},
{0.95501995f,-0.29654160f},
+{0.95105648f,-0.30901700f}, {0.94693011f,-0.32143945f},
{0.94264150f,-0.33380687f},
+{0.93819129f,-0.34611708f}, {0.93358040f,-0.35836795f},
{0.92880952f,-0.37055743f},
+{0.92387956f,-0.38268346f}, {0.91879117f,-0.39474389f},
{0.91354543f,-0.40673664f},
+{0.90814316f,-0.41865975f}, {0.90258527f,-0.43051112f},
{0.89687270f,-0.44228873f},
+{0.89100653f,-0.45399052f}, {0.88498765f,-0.46561453f},
{0.87881708f,-0.47715878f},
+{0.87249601f,-0.48862126f}, {0.86602545f,-0.50000000f},
{0.85940641f,-0.51129311f},
+{0.85264015f,-0.52249855f}, {0.84572786f,-0.53361452f},
{0.83867055f,-0.54463905f},
+{0.83146960f,-0.55557024f}, {0.82412618f,-0.56640625f},
{0.81664151f,-0.57714522f},
+{0.80901700f,-0.58778524f}, {0.80125380f,-0.59832460f},
{0.79335332f,-0.60876143f},
+{0.78531694f,-0.61909395f}, {0.77714598f,-0.62932038f},
{0.76884180f,-0.63943899f},
+{0.76040596f,-0.64944810f}, {0.75183982f,-0.65934587f},
{0.74314475f,-0.66913062f},
+{0.73432249f,-0.67880076f}, {0.72537434f,-0.68835455f},
{0.71630192f,-0.69779050f},
+{0.70710677f,-0.70710683f}, {0.69779044f,-0.71630198f},
{0.68835455f,-0.72537440f},
+{0.67880070f,-0.73432255f}, {0.66913056f,-0.74314487f},
{0.65934581f,-0.75183982f},
+{0.64944804f,-0.76040596f}, {0.63943899f,-0.76884186f},
{0.62932038f,-0.77714598f},
+{0.61909395f,-0.78531694f}, {0.60876137f,-0.79335338f},
{0.59832460f,-0.80125386f},
+{0.58778524f,-0.80901700f}, {0.57714516f,-0.81664151f},
{0.56640625f,-0.82412618f},
+{0.55557019f,-0.83146960f}, {0.54463899f,-0.83867055f},
{0.53361452f,-0.84572786f},
+{0.52249849f,-0.85264015f}, {0.51129311f,-0.85940641f},
{0.49999997f,-0.86602545f},
+{0.48862118f,-0.87249601f}, {0.47715876f,-0.87881708f},
{0.46561447f,-0.88498765f},
+{0.45399052f,-0.89100653f}, {0.44228867f,-0.89687276f},
{0.43051103f,-0.90258533f},
+{0.41865975f,-0.90814316f}, {0.40673661f,-0.91354549f},
{0.39474380f,-0.91879129f},
+{0.38268343f,-0.92387956f}, {0.37055740f,-0.92880958f},
{0.35836786f,-0.93358046f},
+{0.34611705f,-0.93819135f}, {0.33380681f,-0.94264150f},
{0.32143947f,-0.94693011f},
+{0.30901697f,-0.95105654f}, {0.29654151f,-0.95501995f},
{0.28401533f,-0.95881975f},
+{0.27144039f,-0.96245527f}, {0.25881907f,-0.96592581f},
{0.24615327f,-0.96923089f},
+{0.23344530f,-0.97236991f}, {0.22069745f,-0.97534233f},
{0.20791166f,-0.97814763f},
+{0.19509023f,-0.98078531f}, {0.18223552f,-0.98325491f},
{0.16934945f,-0.98555607f},
+{0.15643437f,-0.98768836f}, {0.14349259f,-0.98965138f},
{0.13052613f,-0.99144489f},
+{0.11753740f,-0.99306846f}, {0.10452842f,-0.99452192f},
{0.091501534f,-0.99580491f},
+{0.078459084f,-0.99691731f}, {0.065403074f,-0.99785894f},
{0.052335974f,-0.99862951f},
+{0.039259788f,-0.99922901f}, {0.026176875f,-0.99965733f},
{0.013089597f,-0.99991435f},
+{1.0000000f,-0.0000000f}, {0.99965733f,-0.026176950f},
{0.99862951f,-0.052335959f},
+{0.99691731f,-0.078459099f}, {0.99452192f,-0.10452846f},
{0.99144489f,-0.13052620f},
+{0.98768836f,-0.15643448f}, {0.98325491f,-0.18223552f},
{0.97814763f,-0.20791170f},
+{0.97236991f,-0.23344538f}, {0.96592581f,-0.25881904f},
{0.95881975f,-0.28401536f},
+{0.95105648f,-0.30901700f}, {0.94264150f,-0.33380687f},
{0.93358040f,-0.35836795f},
+{0.92387956f,-0.38268346f}, {0.91354543f,-0.40673664f},
{0.90258527f,-0.43051112f},
+{0.89100653f,-0.45399052f}, {0.87881708f,-0.47715878f},
{0.86602545f,-0.50000000f},
+{0.85264015f,-0.52249855f}, {0.83867055f,-0.54463905f},
{0.82412618f,-0.56640625f},
+{0.80901700f,-0.58778524f}, {0.79335332f,-0.60876143f},
{0.77714598f,-0.62932038f},
+{0.76040596f,-0.64944810f}, {0.74314475f,-0.66913062f},
{0.72537434f,-0.68835455f},
+{0.70710677f,-0.70710683f}, {0.68835455f,-0.72537440f},
{0.66913056f,-0.74314487f},
+{0.64944804f,-0.76040596f}, {0.62932038f,-0.77714598f},
{0.60876137f,-0.79335338f},
+{0.58778524f,-0.80901700f}, {0.56640625f,-0.82412618f},
{0.54463899f,-0.83867055f},
+{0.52249849f,-0.85264015f}, {0.49999997f,-0.86602545f},
{0.47715876f,-0.87881708f},
+{0.45399052f,-0.89100653f}, {0.43051103f,-0.90258533f},
{0.40673661f,-0.91354549f},
+{0.38268343f,-0.92387956f}, {0.35836786f,-0.93358046f},
{0.33380681f,-0.94264150f},
+{0.30901697f,-0.95105654f}, {0.28401533f,-0.95881975f},
{0.25881907f,-0.96592581f},
+{0.23344530f,-0.97236991f}, {0.20791166f,-0.97814763f},
{0.18223552f,-0.98325491f},
+{0.15643437f,-0.98768836f}, {0.13052613f,-0.99144489f},
{0.10452842f,-0.99452192f},
+{0.078459084f,-0.99691731f}, {0.052335974f,-0.99862951f},
{0.026176875f,-0.99965733f},
+{-4.3711388e-08f,-1.0000000f}, {-0.026176963f,-0.99965733f},
{-0.052336060f,-0.99862951f},
+{-0.078459173f,-0.99691731f}, {-0.10452851f,-0.99452192f},
{-0.13052621f,-0.99144489f},
+{-0.15643445f,-0.98768836f}, {-0.18223560f,-0.98325491f},
{-0.20791174f,-0.97814757f},
+{-0.23344538f,-0.97236991f}, {-0.25881916f,-0.96592581f},
{-0.28401542f,-0.95881969f},
+{-0.30901703f,-0.95105648f}, {-0.33380687f,-0.94264150f},
{-0.35836795f,-0.93358040f},
+{-0.38268352f,-0.92387950f}, {-0.40673670f,-0.91354543f},
{-0.43051112f,-0.90258527f},
+{-0.45399061f,-0.89100647f}, {-0.47715873f,-0.87881708f},
{-0.50000006f,-0.86602533f},
+{-0.52249867f,-0.85264009f}, {-0.54463905f,-0.83867055f},
{-0.56640631f,-0.82412612f},
+{-0.58778518f,-0.80901700f}, {-0.60876143f,-0.79335332f},
{-0.62932050f,-0.77714586f},
+{-0.64944804f,-0.76040596f}, {-0.66913068f,-0.74314475f},
{-0.68835467f,-0.72537428f},
+{-0.70710677f,-0.70710677f}, {-0.72537446f,-0.68835449f},
{-0.74314493f,-0.66913044f},
+{-0.76040596f,-0.64944804f}, {-0.77714604f,-0.62932026f},
{-0.79335332f,-0.60876143f},
+{-0.80901700f,-0.58778518f}, {-0.82412624f,-0.56640613f},
{-0.83867055f,-0.54463899f},
+{-0.85264021f,-0.52249849f}, {-0.86602539f,-0.50000006f},
{-0.87881714f,-0.47715873f},
+{-0.89100659f,-0.45399037f}, {-0.90258527f,-0.43051112f},
{-0.91354549f,-0.40673658f},
+{-0.92387956f,-0.38268328f}, {-0.93358040f,-0.35836792f},
{-0.94264150f,-0.33380675f},
+{-0.95105654f,-0.30901679f}, {-0.95881975f,-0.28401530f},
{-0.96592587f,-0.25881892f},
+{-0.97236991f,-0.23344538f}, {-0.97814763f,-0.20791161f},
{-0.98325491f,-0.18223536f},
+{-0.98768836f,-0.15643445f}, {-0.99144489f,-0.13052608f},
{-0.99452192f,-0.10452849f},
+{-0.99691737f,-0.078459039f}, {-0.99862957f,-0.052335810f},
{-0.99965733f,-0.026176952f},
+{1.0000000f,-0.0000000f}, {0.99922901f,-0.039259817f},
{0.99691731f,-0.078459099f},
+{0.99306846f,-0.11753740f}, {0.98768836f,-0.15643448f},
{0.98078525f,-0.19509032f},
+{0.97236991f,-0.23344538f}, {0.96245521f,-0.27144045f},
{0.95105648f,-0.30901700f},
+{0.93819129f,-0.34611708f}, {0.92387956f,-0.38268346f},
{0.90814316f,-0.41865975f},
+{0.89100653f,-0.45399052f}, {0.87249601f,-0.48862126f},
{0.85264015f,-0.52249855f},
+{0.83146960f,-0.55557024f}, {0.80901700f,-0.58778524f},
{0.78531694f,-0.61909395f},
+{0.76040596f,-0.64944810f}, {0.73432249f,-0.67880076f},
{0.70710677f,-0.70710683f},
+{0.67880070f,-0.73432255f}, {0.64944804f,-0.76040596f},
{0.61909395f,-0.78531694f},
+{0.58778524f,-0.80901700f}, {0.55557019f,-0.83146960f},
{0.52249849f,-0.85264015f},
+{0.48862118f,-0.87249601f}, {0.45399052f,-0.89100653f},
{0.41865975f,-0.90814316f},
+{0.38268343f,-0.92387956f}, {0.34611705f,-0.93819135f},
{0.30901697f,-0.95105654f},
+{0.27144039f,-0.96245527f}, {0.23344530f,-0.97236991f},
{0.19509023f,-0.98078531f},
+{0.15643437f,-0.98768836f}, {0.11753740f,-0.99306846f},
{0.078459084f,-0.99691731f},
+{0.039259788f,-0.99922901f}, {-4.3711388e-08f,-1.0000000f},
{-0.039259877f,-0.99922901f},
+{-0.078459173f,-0.99691731f}, {-0.11753749f,-0.99306846f},
{-0.15643445f,-0.98768836f},
+{-0.19509032f,-0.98078525f}, {-0.23344538f,-0.97236991f},
{-0.27144048f,-0.96245521f},
+{-0.30901703f,-0.95105648f}, {-0.34611711f,-0.93819129f},
{-0.38268352f,-0.92387950f},
+{-0.41865984f,-0.90814310f}, {-0.45399061f,-0.89100647f},
{-0.48862135f,-0.87249595f},
+{-0.52249867f,-0.85264009f}, {-0.55557036f,-0.83146954f},
{-0.58778518f,-0.80901700f},
+{-0.61909389f,-0.78531694f}, {-0.64944804f,-0.76040596f},
{-0.67880076f,-0.73432249f},
+{-0.70710677f,-0.70710677f}, {-0.73432249f,-0.67880070f},
{-0.76040596f,-0.64944804f},
+{-0.78531694f,-0.61909389f}, {-0.80901700f,-0.58778518f},
{-0.83146966f,-0.55557019f},
+{-0.85264021f,-0.52249849f}, {-0.87249607f,-0.48862115f},
{-0.89100659f,-0.45399037f},
+{-0.90814322f,-0.41865960f}, {-0.92387956f,-0.38268328f},
{-0.93819135f,-0.34611690f},
+{-0.95105654f,-0.30901679f}, {-0.96245521f,-0.27144048f},
{-0.97236991f,-0.23344538f},
+{-0.98078531f,-0.19509031f}, {-0.98768836f,-0.15643445f},
{-0.99306846f,-0.11753736f},
+{-0.99691737f,-0.078459039f}, {-0.99922901f,-0.039259743f},
{-1.0000000f,8.7422777e-08f},
+{-0.99922901f,0.039259918f}, {-0.99691731f,0.078459218f},
{-0.99306846f,0.11753753f},
+{-0.98768830f,0.15643461f}, {-0.98078525f,0.19509049f},
{-0.97236985f,0.23344554f},
+{-0.96245515f,0.27144065f}, {-0.95105654f,0.30901697f},
{-0.93819135f,0.34611705f},
+{-0.92387956f,0.38268346f}, {-0.90814316f,0.41865975f},
{-0.89100653f,0.45399055f},
+{-0.87249601f,0.48862129f}, {-0.85264015f,0.52249861f},
{-0.83146960f,0.55557030f},
+{-0.80901694f,0.58778536f}, {-0.78531688f,0.61909401f},
{-0.76040590f,0.64944816f},
+{-0.73432243f,0.67880082f}, {-0.70710665f,0.70710689f},
{-0.67880058f,0.73432261f},
+{-0.64944792f,0.76040608f}, {-0.61909378f,0.78531706f},
{-0.58778507f,0.80901712f},
+{-0.55557001f,0.83146977f}, {-0.52249837f,0.85264033f},
{-0.48862100f,0.87249613f},
+{-0.45399022f,0.89100665f}, {-0.41865945f,0.90814328f},
{-0.38268313f,0.92387968f},
+{-0.34611672f,0.93819147f}, {-0.30901709f,0.95105648f},
{-0.27144054f,0.96245521f},
+{-0.23344545f,0.97236991f}, {-0.19509038f,0.98078525f},
{-0.15643452f,0.98768830f},
+{-0.11753743f,0.99306846f}, {-0.078459114f,0.99691731f},
{-0.039259821f,0.99922901f},
+};
+static const ne10_fft_cpx_float32_t ne10_twiddles_240[240] = {
+{1.0000000f,0.0000000f}, {1.0000000f,-0.0000000f}, {1.0000000f,-0.0000000f},
+{1.0000000f,-0.0000000f}, {0.91354543f,-0.40673664f},
{0.66913056f,-0.74314487f},
+{1.0000000f,-0.0000000f}, {0.66913056f,-0.74314487f},
{-0.10452851f,-0.99452192f},
+{1.0000000f,-0.0000000f}, {0.30901697f,-0.95105654f},
{-0.80901700f,-0.58778518f},
+{1.0000000f,-0.0000000f}, {-0.10452851f,-0.99452192f},
{-0.97814757f,0.20791179f},
+{1.0000000f,-0.0000000f}, {0.99452192f,-0.10452846f},
{0.97814763f,-0.20791170f},
+{0.95105648f,-0.30901700f}, {0.91354543f,-0.40673664f},
{0.86602545f,-0.50000000f},
+{0.80901700f,-0.58778524f}, {0.74314475f,-0.66913062f},
{0.66913056f,-0.74314487f},
+{0.58778524f,-0.80901700f}, {0.49999997f,-0.86602545f},
{0.40673661f,-0.91354549f},
+{0.30901697f,-0.95105654f}, {0.20791166f,-0.97814763f},
{0.10452842f,-0.99452192f},
+{1.0000000f,-0.0000000f}, {0.97814763f,-0.20791170f},
{0.91354543f,-0.40673664f},
+{0.80901700f,-0.58778524f}, {0.66913056f,-0.74314487f},
{0.49999997f,-0.86602545f},
+{0.30901697f,-0.95105654f}, {0.10452842f,-0.99452192f},
{-0.10452851f,-0.99452192f},
+{-0.30901703f,-0.95105648f}, {-0.50000006f,-0.86602533f},
{-0.66913068f,-0.74314475f},
+{-0.80901700f,-0.58778518f}, {-0.91354549f,-0.40673658f},
{-0.97814763f,-0.20791161f},
+{1.0000000f,-0.0000000f}, {0.95105648f,-0.30901700f},
{0.80901700f,-0.58778524f},
+{0.58778524f,-0.80901700f}, {0.30901697f,-0.95105654f},
{-4.3711388e-08f,-1.0000000f},
+{-0.30901703f,-0.95105648f}, {-0.58778518f,-0.80901700f},
{-0.80901700f,-0.58778518f},
+{-0.95105654f,-0.30901679f}, {-1.0000000f,8.7422777e-08f},
{-0.95105654f,0.30901697f},
+{-0.80901694f,0.58778536f}, {-0.58778507f,0.80901712f},
{-0.30901709f,0.95105648f},
+{1.0000000f,-0.0000000f}, {0.99965733f,-0.026176950f},
{0.99862951f,-0.052335959f},
+{0.99691731f,-0.078459099f}, {0.99452192f,-0.10452846f},
{0.99144489f,-0.13052620f},
+{0.98768836f,-0.15643448f}, {0.98325491f,-0.18223552f},
{0.97814763f,-0.20791170f},
+{0.97236991f,-0.23344538f}, {0.96592581f,-0.25881904f},
{0.95881975f,-0.28401536f},
+{0.95105648f,-0.30901700f}, {0.94264150f,-0.33380687f},
{0.93358040f,-0.35836795f},
+{0.92387956f,-0.38268346f}, {0.91354543f,-0.40673664f},
{0.90258527f,-0.43051112f},
+{0.89100653f,-0.45399052f}, {0.87881708f,-0.47715878f},
{0.86602545f,-0.50000000f},
+{0.85264015f,-0.52249855f}, {0.83867055f,-0.54463905f},
{0.82412618f,-0.56640625f},
+{0.80901700f,-0.58778524f}, {0.79335332f,-0.60876143f},
{0.77714598f,-0.62932038f},
+{0.76040596f,-0.64944810f}, {0.74314475f,-0.66913062f},
{0.72537434f,-0.68835455f},
+{0.70710677f,-0.70710683f}, {0.68835455f,-0.72537440f},
{0.66913056f,-0.74314487f},
+{0.64944804f,-0.76040596f}, {0.62932038f,-0.77714598f},
{0.60876137f,-0.79335338f},
+{0.58778524f,-0.80901700f}, {0.56640625f,-0.82412618f},
{0.54463899f,-0.83867055f},
+{0.52249849f,-0.85264015f}, {0.49999997f,-0.86602545f},
{0.47715876f,-0.87881708f},
+{0.45399052f,-0.89100653f}, {0.43051103f,-0.90258533f},
{0.40673661f,-0.91354549f},
+{0.38268343f,-0.92387956f}, {0.35836786f,-0.93358046f},
{0.33380681f,-0.94264150f},
+{0.30901697f,-0.95105654f}, {0.28401533f,-0.95881975f},
{0.25881907f,-0.96592581f},
+{0.23344530f,-0.97236991f}, {0.20791166f,-0.97814763f},
{0.18223552f,-0.98325491f},
+{0.15643437f,-0.98768836f}, {0.13052613f,-0.99144489f},
{0.10452842f,-0.99452192f},
+{0.078459084f,-0.99691731f}, {0.052335974f,-0.99862951f},
{0.026176875f,-0.99965733f},
+{1.0000000f,-0.0000000f}, {0.99862951f,-0.052335959f},
{0.99452192f,-0.10452846f},
+{0.98768836f,-0.15643448f}, {0.97814763f,-0.20791170f},
{0.96592581f,-0.25881904f},
+{0.95105648f,-0.30901700f}, {0.93358040f,-0.35836795f},
{0.91354543f,-0.40673664f},
+{0.89100653f,-0.45399052f}, {0.86602545f,-0.50000000f},
{0.83867055f,-0.54463905f},
+{0.80901700f,-0.58778524f}, {0.77714598f,-0.62932038f},
{0.74314475f,-0.66913062f},
+{0.70710677f,-0.70710683f}, {0.66913056f,-0.74314487f},
{0.62932038f,-0.77714598f},
+{0.58778524f,-0.80901700f}, {0.54463899f,-0.83867055f},
{0.49999997f,-0.86602545f},
+{0.45399052f,-0.89100653f}, {0.40673661f,-0.91354549f},
{0.35836786f,-0.93358046f},
+{0.30901697f,-0.95105654f}, {0.25881907f,-0.96592581f},
{0.20791166f,-0.97814763f},
+{0.15643437f,-0.98768836f}, {0.10452842f,-0.99452192f},
{0.052335974f,-0.99862951f},
+{-4.3711388e-08f,-1.0000000f}, {-0.052336060f,-0.99862951f},
{-0.10452851f,-0.99452192f},
+{-0.15643445f,-0.98768836f}, {-0.20791174f,-0.97814757f},
{-0.25881916f,-0.96592581f},
+{-0.30901703f,-0.95105648f}, {-0.35836795f,-0.93358040f},
{-0.40673670f,-0.91354543f},
+{-0.45399061f,-0.89100647f}, {-0.50000006f,-0.86602533f},
{-0.54463905f,-0.83867055f},
+{-0.58778518f,-0.80901700f}, {-0.62932050f,-0.77714586f},
{-0.66913068f,-0.74314475f},
+{-0.70710677f,-0.70710677f}, {-0.74314493f,-0.66913044f},
{-0.77714604f,-0.62932026f},
+{-0.80901700f,-0.58778518f}, {-0.83867055f,-0.54463899f},
{-0.86602539f,-0.50000006f},
+{-0.89100659f,-0.45399037f}, {-0.91354549f,-0.40673658f},
{-0.93358040f,-0.35836792f},
+{-0.95105654f,-0.30901679f}, {-0.96592587f,-0.25881892f},
{-0.97814763f,-0.20791161f},
+{-0.98768836f,-0.15643445f}, {-0.99452192f,-0.10452849f},
{-0.99862957f,-0.052335810f},
+{1.0000000f,-0.0000000f}, {0.99691731f,-0.078459099f},
{0.98768836f,-0.15643448f},
+{0.97236991f,-0.23344538f}, {0.95105648f,-0.30901700f},
{0.92387956f,-0.38268346f},
+{0.89100653f,-0.45399052f}, {0.85264015f,-0.52249855f},
{0.80901700f,-0.58778524f},
+{0.76040596f,-0.64944810f}, {0.70710677f,-0.70710683f},
{0.64944804f,-0.76040596f},
+{0.58778524f,-0.80901700f}, {0.52249849f,-0.85264015f},
{0.45399052f,-0.89100653f},
+{0.38268343f,-0.92387956f}, {0.30901697f,-0.95105654f},
{0.23344530f,-0.97236991f},
+{0.15643437f,-0.98768836f}, {0.078459084f,-0.99691731f},
{-4.3711388e-08f,-1.0000000f},
+{-0.078459173f,-0.99691731f}, {-0.15643445f,-0.98768836f},
{-0.23344538f,-0.97236991f},
+{-0.30901703f,-0.95105648f}, {-0.38268352f,-0.92387950f},
{-0.45399061f,-0.89100647f},
+{-0.52249867f,-0.85264009f}, {-0.58778518f,-0.80901700f},
{-0.64944804f,-0.76040596f},
+{-0.70710677f,-0.70710677f}, {-0.76040596f,-0.64944804f},
{-0.80901700f,-0.58778518f},
+{-0.85264021f,-0.52249849f}, {-0.89100659f,-0.45399037f},
{-0.92387956f,-0.38268328f},
+{-0.95105654f,-0.30901679f}, {-0.97236991f,-0.23344538f},
{-0.98768836f,-0.15643445f},
+{-0.99691737f,-0.078459039f}, {-1.0000000f,8.7422777e-08f},
{-0.99691731f,0.078459218f},
+{-0.98768830f,0.15643461f}, {-0.97236985f,0.23344554f},
{-0.95105654f,0.30901697f},
+{-0.92387956f,0.38268346f}, {-0.89100653f,0.45399055f},
{-0.85264015f,0.52249861f},
+{-0.80901694f,0.58778536f}, {-0.76040590f,0.64944816f},
{-0.70710665f,0.70710689f},
+{-0.64944792f,0.76040608f}, {-0.58778507f,0.80901712f},
{-0.52249837f,0.85264033f},
+{-0.45399022f,0.89100665f}, {-0.38268313f,0.92387968f},
{-0.30901709f,0.95105648f},
+{-0.23344545f,0.97236991f}, {-0.15643452f,0.98768830f},
{-0.078459114f,0.99691731f},
+};
+static const ne10_fft_cpx_float32_t ne10_twiddles_120[120] = {
+{1.0000000f,0.0000000f}, {1.0000000f,-0.0000000f}, {1.0000000f,-0.0000000f},
+{1.0000000f,-0.0000000f}, {0.91354543f,-0.40673664f},
{0.66913056f,-0.74314487f},
+{1.0000000f,-0.0000000f}, {0.66913056f,-0.74314487f},
{-0.10452851f,-0.99452192f},
+{1.0000000f,-0.0000000f}, {0.30901697f,-0.95105654f},
{-0.80901700f,-0.58778518f},
+{1.0000000f,-0.0000000f}, {-0.10452851f,-0.99452192f},
{-0.97814757f,0.20791179f},
+{1.0000000f,-0.0000000f}, {0.97814763f,-0.20791170f},
{0.91354543f,-0.40673664f},
+{0.80901700f,-0.58778524f}, {0.66913056f,-0.74314487f},
{0.49999997f,-0.86602545f},
+{0.30901697f,-0.95105654f}, {0.10452842f,-0.99452192f},
{-0.10452851f,-0.99452192f},
+{-0.30901703f,-0.95105648f}, {-0.50000006f,-0.86602533f},
{-0.66913068f,-0.74314475f},
+{-0.80901700f,-0.58778518f}, {-0.91354549f,-0.40673658f},
{-0.97814763f,-0.20791161f},
+{1.0000000f,-0.0000000f}, {0.99862951f,-0.052335959f},
{0.99452192f,-0.10452846f},
+{0.98768836f,-0.15643448f}, {0.97814763f,-0.20791170f},
{0.96592581f,-0.25881904f},
+{0.95105648f,-0.30901700f}, {0.93358040f,-0.35836795f},
{0.91354543f,-0.40673664f},
+{0.89100653f,-0.45399052f}, {0.86602545f,-0.50000000f},
{0.83867055f,-0.54463905f},
+{0.80901700f,-0.58778524f}, {0.77714598f,-0.62932038f},
{0.74314475f,-0.66913062f},
+{0.70710677f,-0.70710683f}, {0.66913056f,-0.74314487f},
{0.62932038f,-0.77714598f},
+{0.58778524f,-0.80901700f}, {0.54463899f,-0.83867055f},
{0.49999997f,-0.86602545f},
+{0.45399052f,-0.89100653f}, {0.40673661f,-0.91354549f},
{0.35836786f,-0.93358046f},
+{0.30901697f,-0.95105654f}, {0.25881907f,-0.96592581f},
{0.20791166f,-0.97814763f},
+{0.15643437f,-0.98768836f}, {0.10452842f,-0.99452192f},
{0.052335974f,-0.99862951f},
+{1.0000000f,-0.0000000f}, {0.99452192f,-0.10452846f},
{0.97814763f,-0.20791170f},
+{0.95105648f,-0.30901700f}, {0.91354543f,-0.40673664f},
{0.86602545f,-0.50000000f},
+{0.80901700f,-0.58778524f}, {0.74314475f,-0.66913062f},
{0.66913056f,-0.74314487f},
+{0.58778524f,-0.80901700f}, {0.49999997f,-0.86602545f},
{0.40673661f,-0.91354549f},
+{0.30901697f,-0.95105654f}, {0.20791166f,-0.97814763f},
{0.10452842f,-0.99452192f},
+{-4.3711388e-08f,-1.0000000f}, {-0.10452851f,-0.99452192f},
{-0.20791174f,-0.97814757f},
+{-0.30901703f,-0.95105648f}, {-0.40673670f,-0.91354543f},
{-0.50000006f,-0.86602533f},
+{-0.58778518f,-0.80901700f}, {-0.66913068f,-0.74314475f},
{-0.74314493f,-0.66913044f},
+{-0.80901700f,-0.58778518f}, {-0.86602539f,-0.50000006f},
{-0.91354549f,-0.40673658f},
+{-0.95105654f,-0.30901679f}, {-0.97814763f,-0.20791161f},
{-0.99452192f,-0.10452849f},
+{1.0000000f,-0.0000000f}, {0.98768836f,-0.15643448f},
{0.95105648f,-0.30901700f},
+{0.89100653f,-0.45399052f}, {0.80901700f,-0.58778524f},
{0.70710677f,-0.70710683f},
+{0.58778524f,-0.80901700f}, {0.45399052f,-0.89100653f},
{0.30901697f,-0.95105654f},
+{0.15643437f,-0.98768836f}, {-4.3711388e-08f,-1.0000000f},
{-0.15643445f,-0.98768836f},
+{-0.30901703f,-0.95105648f}, {-0.45399061f,-0.89100647f},
{-0.58778518f,-0.80901700f},
+{-0.70710677f,-0.70710677f}, {-0.80901700f,-0.58778518f},
{-0.89100659f,-0.45399037f},
+{-0.95105654f,-0.30901679f}, {-0.98768836f,-0.15643445f},
{-1.0000000f,8.7422777e-08f},
+{-0.98768830f,0.15643461f}, {-0.95105654f,0.30901697f},
{-0.89100653f,0.45399055f},
+{-0.80901694f,0.58778536f}, {-0.70710665f,0.70710689f},
{-0.58778507f,0.80901712f},
+{-0.45399022f,0.89100665f}, {-0.30901709f,0.95105648f},
{-0.15643452f,0.98768830f},
+};
+static const ne10_fft_cpx_float32_t ne10_twiddles_60[60] = {
+{1.0000000f,0.0000000f}, {1.0000000f,-0.0000000f}, {1.0000000f,-0.0000000f},
+{1.0000000f,-0.0000000f}, {0.91354543f,-0.40673664f},
{0.66913056f,-0.74314487f},
+{1.0000000f,-0.0000000f}, {0.66913056f,-0.74314487f},
{-0.10452851f,-0.99452192f},
+{1.0000000f,-0.0000000f}, {0.30901697f,-0.95105654f},
{-0.80901700f,-0.58778518f},
+{1.0000000f,-0.0000000f}, {-0.10452851f,-0.99452192f},
{-0.97814757f,0.20791179f},
+{1.0000000f,-0.0000000f}, {0.99452192f,-0.10452846f},
{0.97814763f,-0.20791170f},
+{0.95105648f,-0.30901700f}, {0.91354543f,-0.40673664f},
{0.86602545f,-0.50000000f},
+{0.80901700f,-0.58778524f}, {0.74314475f,-0.66913062f},
{0.66913056f,-0.74314487f},
+{0.58778524f,-0.80901700f}, {0.49999997f,-0.86602545f},
{0.40673661f,-0.91354549f},
+{0.30901697f,-0.95105654f}, {0.20791166f,-0.97814763f},
{0.10452842f,-0.99452192f},
+{1.0000000f,-0.0000000f}, {0.97814763f,-0.20791170f},
{0.91354543f,-0.40673664f},
+{0.80901700f,-0.58778524f}, {0.66913056f,-0.74314487f},
{0.49999997f,-0.86602545f},
+{0.30901697f,-0.95105654f}, {0.10452842f,-0.99452192f},
{-0.10452851f,-0.99452192f},
+{-0.30901703f,-0.95105648f}, {-0.50000006f,-0.86602533f},
{-0.66913068f,-0.74314475f},
+{-0.80901700f,-0.58778518f}, {-0.91354549f,-0.40673658f},
{-0.97814763f,-0.20791161f},
+{1.0000000f,-0.0000000f}, {0.95105648f,-0.30901700f},
{0.80901700f,-0.58778524f},
+{0.58778524f,-0.80901700f}, {0.30901697f,-0.95105654f},
{-4.3711388e-08f,-1.0000000f},
+{-0.30901703f,-0.95105648f}, {-0.58778518f,-0.80901700f},
{-0.80901700f,-0.58778518f},
+{-0.95105654f,-0.30901679f}, {-1.0000000f,8.7422777e-08f},
{-0.95105654f,0.30901697f},
+{-0.80901694f,0.58778536f}, {-0.58778507f,0.80901712f},
{-0.30901709f,0.95105648f},
+};
+static const ne10_fft_state_float32_t ne10_fft_state_float32_480 = {
+120,
+(ne10_int32_t *)ne10_factors_480,
+(ne10_fft_cpx_float32_t *)ne10_twiddles_480,
+NULL,
+(ne10_fft_cpx_float32_t *)&ne10_twiddles_480[120],
+/* is_forward_scaled = true */
+(ne10_int32_t) 1,
+/* is_backward_scaled = false */
+(ne10_int32_t) 0,
+};
+static const arch_fft_state cfg_arch_480 = {
+1,
+(void *)&ne10_fft_state_float32_480,
+};
+
+static const ne10_fft_state_float32_t ne10_fft_state_float32_240 = {
+60,
+(ne10_int32_t *)ne10_factors_240,
+(ne10_fft_cpx_float32_t *)ne10_twiddles_240,
+NULL,
+(ne10_fft_cpx_float32_t *)&ne10_twiddles_240[60],
+/* is_forward_scaled = true */
+(ne10_int32_t) 1,
+/* is_backward_scaled = false */
+(ne10_int32_t) 0,
+};
+static const arch_fft_state cfg_arch_240 = {
+1,
+(void *)&ne10_fft_state_float32_240,
+};
+
+static const ne10_fft_state_float32_t ne10_fft_state_float32_120 = {
+30,
+(ne10_int32_t *)ne10_factors_120,
+(ne10_fft_cpx_float32_t *)ne10_twiddles_120,
+NULL,
+(ne10_fft_cpx_float32_t *)&ne10_twiddles_120[30],
+/* is_forward_scaled = true */
+(ne10_int32_t) 1,
+/* is_backward_scaled = false */
+(ne10_int32_t) 0,
+};
+static const arch_fft_state cfg_arch_120 = {
+1,
+(void *)&ne10_fft_state_float32_120,
+};
+
+static const ne10_fft_state_float32_t ne10_fft_state_float32_60 = {
+15,
+(ne10_int32_t *)ne10_factors_60,
+(ne10_fft_cpx_float32_t *)ne10_twiddles_60,
+NULL,
+(ne10_fft_cpx_float32_t *)&ne10_twiddles_60[15],
+/* is_forward_scaled = true */
+(ne10_int32_t) 1,
+/* is_backward_scaled = false */
+(ne10_int32_t) 0,
+};
+static const arch_fft_state cfg_arch_60 = {
+1,
+(void *)&ne10_fft_state_float32_60,
+};
+
+#endif  /* end NE10_FFT_PARAMS48000_960 */
diff --git a/celt/tests/test_unit_dft.c b/celt/tests/test_unit_dft.c
index 57db0e3..991ece3 100644
--- a/celt/tests/test_unit_dft.c
+++ b/celt/tests/test_unit_dft.c
@@ -45,6 +45,23 @@
 #include "mathops.c"
 #include "entcode.c"
 
+#if defined(OPUS_HAVE_RTCD) && \
+         (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR))
+#include "arm/armcpu.c"
+#if !defined(FIXED_POINT)
+#if defined(HAVE_ARM_NE10)
+#include "mdct.c"
+#include "arm/celt_ne10_fft.c"
+#include "arm/celt_ne10_mdct.c"
+#endif
+#include "celt_lpc.c"
+#include "pitch.c"
+#include "arm/celt_neon_intr.c"
+#include "arm/arm_celt_map.c"
+#endif
+#elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#include "x86/x86cpu.c"
+#endif
 
 #ifndef M_PI
 #define M_PI 3.141592653
@@ -93,13 +110,13 @@ void check(kiss_fft_cpx  * in,kiss_fft_cpx  * out,int
nfft,int isinverse)
     }
 }
 
-void test1d(int nfft,int isinverse)
+void test1d(int nfft,int isinverse,int arch)
 {
     size_t buflen = sizeof(kiss_fft_cpx)*nfft;
 
     kiss_fft_cpx  * in = (kiss_fft_cpx*)malloc(buflen);
     kiss_fft_cpx  * out= (kiss_fft_cpx*)malloc(buflen);
-    kiss_fft_state *cfg = opus_fft_alloc(nfft,0,0);
+    kiss_fft_state *cfg = opus_fft_alloc(nfft,0,0,arch);
     int k;
 
     for (k=0;k<nfft;++k) {
@@ -125,7 +142,7 @@ void test1d(int nfft,int isinverse)
     if (isinverse)
        opus_ifft(cfg,in,out);
     else
-       opus_fft(cfg,in,out);
+       opus_fft(cfg,in,out, arch);
 
     /*for (k=0;k<nfft;++k) printf("%d %d ", out[k].r,
out[k].i);printf("\n");*/
 
@@ -139,26 +156,28 @@ void test1d(int nfft,int isinverse)
 int main(int argc,char ** argv)
 {
     ALLOC_STACK;
+    int arch = opus_select_arch();
+
     if (argc>1) {
         int k;
         for (k=1;k<argc;++k) {
-            test1d(atoi(argv[k]),0);
-            test1d(atoi(argv[k]),1);
+            test1d(atoi(argv[k]),0,arch);
+            test1d(atoi(argv[k]),1,arch);
         }
     }else{
-        test1d(32,0);
-        test1d(32,1);
-        test1d(128,0);
-        test1d(128,1);
-        test1d(256,0);
-        test1d(256,1);
+        test1d(32,0,arch);
+        test1d(32,1,arch);
+        test1d(128,0,arch);
+        test1d(128,1,arch);
+        test1d(256,0,arch);
+        test1d(256,1,arch);
 #ifndef RADIX_TWO_ONLY
-        test1d(36,0);
-        test1d(36,1);
-        test1d(50,0);
-        test1d(50,1);
-        test1d(120,0);
-        test1d(120,1);
+        test1d(36,0,arch);
+        test1d(36,1,arch);
+        test1d(50,0,arch);
+        test1d(50,1,arch);
+        test1d(120,0,arch);
+        test1d(120,1,arch);
 #endif
     }
     return ret;
diff --git a/celt/tests/test_unit_mathops.c b/celt/tests/test_unit_mathops.c
index b9b1bcf..5d2e8e4 100644
--- a/celt/tests/test_unit_mathops.c
+++ b/celt/tests/test_unit_mathops.c
@@ -60,6 +60,12 @@
        || defined(OPUS_ARM_NEON_INTR))
 #if defined(OPUS_ARM_NEON_INTR)
 #include "arm/celt_neon_intr.c"
+#if defined(HAVE_ARM_NE10)
+#include "kiss_fft.c"
+#include "mdct.c"
+#include "arm/celt_ne10_fft.c"
+#include "arm/celt_ne10_mdct.c"
+#endif
 #endif
 #include "arm/arm_celt_map.c"
 #endif
diff --git a/celt/tests/test_unit_mdct.c b/celt/tests/test_unit_mdct.c
index ac8957f..a1c92d0 100644
--- a/celt/tests/test_unit_mdct.c
+++ b/celt/tests/test_unit_mdct.c
@@ -46,6 +46,24 @@
 #include "mathops.c"
 #include "entcode.c"
 
+#if defined(OPUS_HAVE_RTCD) && \
+         (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR))
+#include "arm/armcpu.c"
+#if !defined(FIXED_POINT)
+#if defined(HAVE_ARM_NE10)
+#include "arm/celt_ne10_fft.c"
+#include "arm/celt_ne10_mdct.c"
+#endif
+#include "pitch.c"
+#include "celt_lpc.c"
+#include "arm/celt_neon_intr.c"
+#include "arm/arm_celt_map.c"
+#endif
+
+#elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#include "x86/x86cpu.c"
+#endif
+
 #ifndef M_PI
 #define M_PI 3.141592653
 #endif
@@ -112,7 +130,7 @@ void check_inv(kiss_fft_scalar  * in,kiss_fft_scalar  *
out,int nfft,int isinver
 }
 
 
-void test1d(int nfft,int isinverse)
+void test1d(int nfft,int isinverse,int arch)
 {
     mdct_lookup cfg;
     size_t buflen = sizeof(kiss_fft_scalar)*nfft;
@@ -123,7 +141,7 @@ void test1d(int nfft,int isinverse)
     opus_val16  * window= (opus_val16*)malloc(sizeof(opus_val16)*nfft/2);
     int k;
 
-    clt_mdct_init(&cfg, nfft, 0);
+    clt_mdct_init(&cfg, nfft, 0, arch);
     for (k=0;k<nfft;++k) {
         in[k] = (rand() % 32768) - 16384;
     }
@@ -156,7 +174,7 @@ void test1d(int nfft,int isinverse)
           out[nfft-k-1] = out[nfft/2+k];
        check_inv(in,out,nfft,isinverse);
     } else {
-       clt_mdct_forward(&cfg,in,out,window, nfft/2, 0, 1);
+       clt_mdct_forward(&cfg,in,out,window, nfft/2, 0, 1, arch);
        check(in_copy,out,nfft,isinverse);
     }
     /*for (k=0;k<nfft;++k) printf("%d %d ", out[k].r,
out[k].i);printf("\n");*/
@@ -164,46 +182,48 @@ void test1d(int nfft,int isinverse)
 
     free(in);
     free(out);
-    clt_mdct_clear(&cfg);
+    clt_mdct_clear(&cfg, arch);
 }
 
 int main(int argc,char ** argv)
 {
     ALLOC_STACK;
+    int arch = opus_select_arch();
+
     if (argc>1) {
         int k;
         for (k=1;k<argc;++k) {
-            test1d(atoi(argv[k]),0);
-            test1d(atoi(argv[k]),1);
+            test1d(atoi(argv[k]),0,arch);
+            test1d(atoi(argv[k]),1,arch);
         }
     }else{
-        test1d(32,0);
-        test1d(32,1);
-        test1d(256,0);
-        test1d(256,1);
-        test1d(512,0);
-        test1d(512,1);
-        test1d(1024,0);
-        test1d(1024,1);
-        test1d(2048,0);
-        test1d(2048,1);
+        test1d(32,0,arch);
+        test1d(32,1,arch);
+        test1d(256,0,arch);
+        test1d(256,1,arch);
+        test1d(512,0,arch);
+        test1d(512,1,arch);
+        test1d(1024,0,arch);
+        test1d(1024,1,arch);
+        test1d(2048,0,arch);
+        test1d(2048,1,arch);
 #ifndef RADIX_TWO_ONLY
-        test1d(36,0);
-        test1d(36,1);
-        test1d(40,0);
-        test1d(40,1);
-        test1d(60,0);
-        test1d(60,1);
-        test1d(120,0);
-        test1d(120,1);
-        test1d(240,0);
-        test1d(240,1);
-        test1d(480,0);
-        test1d(480,1);
-        test1d(960,0);
-        test1d(960,1);
-        test1d(1920,0);
-        test1d(1920,1);
+        test1d(36,0,arch);
+        test1d(36,1,arch);
+        test1d(40,0,arch);
+        test1d(40,1,arch);
+        test1d(60,0,arch);
+        test1d(60,1,arch);
+        test1d(120,0,arch);
+        test1d(120,1,arch);
+        test1d(240,0,arch);
+        test1d(240,1,arch);
+        test1d(480,0,arch);
+        test1d(480,1,arch);
+        test1d(960,0,arch);
+        test1d(960,1,arch);
+        test1d(1920,0,arch);
+        test1d(1920,1,arch);
 #endif
     }
     return ret;
diff --git a/celt/tests/test_unit_rotation.c b/celt/tests/test_unit_rotation.c
index 5507884..fb18df0 100644
--- a/celt/tests/test_unit_rotation.c
+++ b/celt/tests/test_unit_rotation.c
@@ -59,6 +59,12 @@
 #if defined(OPUS_ARM_NEON_INTR)
 #include "arm/celt_neon_intr.c"
 #endif
+#if defined(HAVE_ARM_NE10)
+#include "kiss_fft.c"
+#include "mdct.c"
+#include "arm/celt_ne10_fft.c"
+#include "arm/celt_ne10_mdct.c"
+#endif
 #include "arm/arm_celt_map.c"
 #endif
 
diff --git a/celt_headers.mk b/celt_headers.mk
index d422e09..5dc9e1e 100644
--- a/celt_headers.mk
+++ b/celt_headers.mk
@@ -31,12 +31,15 @@ celt/stack_alloc.h \
 celt/vq.h \
 celt/static_modes_float.h \
 celt/static_modes_fixed.h \
+celt/static_modes_float_arm_ne10.h \
 celt/arm/armcpu.h \
 celt/arm/fixed_armv4.h \
 celt/arm/fixed_armv5e.h \
 celt/arm/kiss_fft_armv4.h \
 celt/arm/kiss_fft_armv5e.h \
 celt/arm/pitch_arm.h \
+celt/arm/fft_arm.h \
+celt/arm/mdct_arm.h \
 celt/mips/celt_mipsr1.h \
 celt/mips/fixed_generic_mipsr1.h \
 celt/mips/kiss_fft_mipsr1.h \
diff --git a/celt_sources.mk b/celt_sources.mk
index 29ec937..7121301 100644
--- a/celt_sources.mk
+++ b/celt_sources.mk
@@ -35,3 +35,7 @@ celt/arm/armopts.s.in
 
 CELT_SOURCES_ARM_NEON_INTR = \
 celt/arm/celt_neon_intr.c
+
+CELT_SOURCES_ARM_NE10= \
+celt/arm/celt_ne10_fft.c \
+celt/arm/celt_ne10_mdct.c
diff --git a/configure.ac b/configure.ac
index 87cece9..baa3425 100644
--- a/configure.ac
+++ b/configure.ac
@@ -351,6 +351,80 @@ AM_CONDITIONAL([OPUS_ARM_EXTERNAL_ASM],
 AM_CONDITIONAL([HAVE_SSE4_1], [false])
 AM_CONDITIONAL([HAVE_SSE2], [false])
 
+AC_DEFUN([OPUS_PATH_NE10],
+   [
+      AC_ARG_WITH(NE10,
+                  AC_HELP_STRING([--with-NE10=PFX],[Prefix where libNE10 is
installed (optional)]),
+                  NE10_prefix="$withval", NE10_prefix="")
+      AC_ARG_WITH(NE10-libraries,
+                  AC_HELP_STRING([--with-NE10-libraries=DIR],
+                        [Directory where libNE10 library is installed
(optional)]),
+                  NE10_libraries="$withval",
NE10_libraries="")
+      AC_ARG_WITH(NE10-includes,
+                  AC_HELP_STRING([--with-NE10-includes=DIR],
+                                 [Directory where libNE10 header files are
installed (optional)]),
+                  NE10_includes="$withval",
ogg_includes="")
+
+      if test "x$NE10_libraries" != "x" ; then
+         NE10_LIBS="-L$NE10_libraries"
+      elif test "x$NE10_prefix" = "xno" || test
"x$NE10_prefix" = "xyes" ; then
+         NE10_LIBS=""
+      elif test "x$NE10_prefix" != "x" ; then
+         NE10_LIBS="-L$NE10_prefix/lib"
+      elif test "x$prefix" != "xNONE" ; then
+         NE10_LIBS="-L$prefix/lib"
+      fi
+
+      if test "x$NE10_prefix" != "xno" ; then
+         NE10_LIBS="$NE10_LIBS -lNE10"
+      fi
+
+      if test "x$NE10_includes" != "x" ; then
+         NE10_CFLAGS="-I$NE10_includes"
+      elif test "x$NE10_prefix" = "xno" || test
"x$NE10_prefix" = "xyes" ; then
+         NE10_CFLAGS=""
+      elif test "x$ogg_prefix" != "x" ; then
+         NE10_CFLAGS="-I$NE10_prefix/include"
+      elif test "x$prefix" != "xNONE"; then
+         NE10_CFLAGS="-I$prefix/include"
+      fi
+
+      AC_MSG_CHECKING(for NE10)
+      save_CFLAGS="$CFLAGS"; CFLAGS="$NE10_CFLAGS"
+      save_LIBS="$LIBS"; LIBS="$NE10_LIBS"
+      AC_LINK_IFELSE(
+         [
+            AC_LANG_PROGRAM(
+               [[#include <NE10_init.h>
+               ]],
+               [[
+                  ne10_fft_cfg_float32_t cfg;
+                  cfg = ne10_fft_alloc_c2c_float32_neon(480);
+               ]]
+            )
+         ],[
+            HAVE_ARM_NE10=1
+            AC_MSG_RESULT([yes])
+         ],[
+            HAVE_ARM_NE10=0
+            AC_MSG_RESULT([no])
+            NE10_CFLAGS=""
+            NE10_LIBS=""
+         ]
+      )
+      CFLAGS="$save_CFLAGS"; LIBS="$save_LIBS"
+      #Now we know if libNE10 is installed or not
+      AS_IF([test x"$HAVE_ARM_NE10" = x"1"],
+         [
+            AC_DEFINE([HAVE_ARM_NE10], 1, [NE10 library is installed on host.
Make sure it is on target!])
+            AC_SUBST(HAVE_ARM_NE10)
+            AC_SUBST(NE10_CFLAGS)
+            AC_SUBST(NE10_LIBS)
+         ],[]
+      )
+   ]
+)
+
 AS_IF([test x"$enable_intrinsics" = x"yes"],[
    case $host_cpu in
    arm*)
@@ -391,6 +465,10 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
             AC_DEFINE([OPUS_ARM_MAY_HAVE_EDSP], 1, [Define if compiler support
EDSP Instructions])
             AC_DEFINE([OPUS_ARM_MAY_HAVE_MEDIA], 1, [Define if compiler support
MEDIA Instructions])
             AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler support
NEON instructions])
+
+            OPUS_PATH_NE10()
+            AS_IF([test x"$NE10_LIBS" != "x"],
+                  [enable_intrinsics="$enable_intrinsics NE10"],[])
          ],
          [
             AC_MSG_WARN([Compiler does not support ARM intrinsics])
@@ -516,6 +594,9 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
 AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"])
 AM_CONDITIONAL([OPUS_ARM_NEON_INTR],
     [test x"$OPUS_ARM_NEON_INTR" = x"1"])
+AM_CONDITIONAL([HAVE_ARM_NE10],
+    [test x"$HAVE_ARM_NE10" = x"1"])
+
 
 AS_IF([test x"$enable_rtcd" = x"yes"],[
     AS_IF([test x"$rtcd_support" != x"no"],[
diff --git a/src/analysis.c b/src/analysis.c
index 2ee8533..e04b282 100644
--- a/src/analysis.c
+++ b/src/analysis.c
@@ -189,7 +189,7 @@ void tonality_get_info(TonalityAnalysisState *tonal,
AnalysisInfo *info_out, int
    info_out->music_prob = psum;
 }
 
-static void tonality_analysis(TonalityAnalysisState *tonal, const CELTMode
*celt_mode, const void *x, int len, int offset, int c1, int c2, int C, int
lsb_depth, downmix_func downmix)
+static void tonality_analysis(TonalityAnalysisState *tonal, const CELTMode
*celt_mode, const void *x, int len, int offset, int c1, int c2, int C, int
lsb_depth, downmix_func downmix, int arch)
 {
     int i, b;
     const kiss_fft_state *kfft;
@@ -262,7 +262,7 @@ static void tonality_analysis(TonalityAnalysisState *tonal,
const CELTMode *celt
     remaining = len - (ANALYSIS_BUF_SIZE-tonal->mem_fill);
     downmix(x, &tonal->inmem[240], remaining,
offset+ANALYSIS_BUF_SIZE-tonal->mem_fill, c1, c2, C);
     tonal->mem_fill = 240 + remaining;
-    opus_fft(kfft, in, out);
+    opus_fft(kfft, in, out, arch);
 #ifndef FIXED_POINT
     /* If there's any NaN on the input, the entire output will be NaN, so
we only need to check one value. */
     if (celt_isnan(out[0].r))
@@ -635,7 +635,7 @@ static void tonality_analysis(TonalityAnalysisState *tonal,
const CELTMode *celt
 
 void run_analysis(TonalityAnalysisState *analysis, const CELTMode *celt_mode,
const void *analysis_pcm,
                  int analysis_frame_size, int frame_size, int c1, int c2, int
C, opus_int32 Fs,
-                 int lsb_depth, downmix_func downmix, AnalysisInfo
*analysis_info)
+                 int lsb_depth, downmix_func downmix, AnalysisInfo
*analysis_info, int arch)
 {
    int offset;
    int pcm_len;
@@ -648,7 +648,7 @@ void run_analysis(TonalityAnalysisState *analysis, const
CELTMode *celt_mode, co
       pcm_len = analysis_frame_size - analysis->analysis_offset;
       offset = analysis->analysis_offset;
       do {
-         tonality_analysis(analysis, celt_mode, analysis_pcm, IMIN(480,
pcm_len), offset, c1, c2, C, lsb_depth, downmix);
+         tonality_analysis(analysis, celt_mode, analysis_pcm, IMIN(480,
pcm_len), offset, c1, c2, C, lsb_depth, downmix, arch);
          offset += 480;
          pcm_len -= 480;
       } while (pcm_len>0);
diff --git a/src/analysis.h b/src/analysis.h
index 85a73d7..9c328e8 100644
--- a/src/analysis.h
+++ b/src/analysis.h
@@ -82,6 +82,6 @@ void tonality_get_info(TonalityAnalysisState *tonal,
AnalysisInfo *info_out, int
 
 void run_analysis(TonalityAnalysisState *analysis, const CELTMode *celt_mode,
const void *analysis_pcm,
                  int analysis_frame_size, int frame_size, int c1, int c2, int
C, opus_int32 Fs,
-                 int lsb_depth, downmix_func downmix, AnalysisInfo
*analysis_info);
+                 int lsb_depth, downmix_func downmix, AnalysisInfo
*analysis_info, int arch);
 
 #endif
diff --git a/src/opus_encoder.c b/src/opus_encoder.c
index d94163f..4656da5 100644
--- a/src/opus_encoder.c
+++ b/src/opus_encoder.c
@@ -1006,7 +1006,7 @@ opus_int32 opus_encode_native(OpusEncoder *st, const
opus_val16 *pcm, int frame_
        analysis_read_subframe_bak = st->analysis.read_subframe;
        run_analysis(&st->analysis, celt_mode, analysis_pcm,
analysis_size, frame_size,
              c1, c2, analysis_channels, st->Fs,
-             lsb_depth, downmix, &analysis_info);
+             lsb_depth, downmix, &analysis_info, st->arch);
     }
 #else
     (void)analysis_pcm;
diff --git a/src/opus_multistream_encoder.c b/src/opus_multistream_encoder.c
index 6e87337..1281e85 100644
--- a/src/opus_multistream_encoder.c
+++ b/src/opus_multistream_encoder.c
@@ -71,6 +71,7 @@ typedef void (*opus_copy_channel_in_func)(
 
 struct OpusMSEncoder {
    ChannelLayout layout;
+   int arch;
    int lfe_stream;
    int application;
    int variable_duration;
@@ -218,7 +219,7 @@ opus_val16 logSum(opus_val16 a, opus_val16 b)
 #endif
 
 void surround_analysis(const CELTMode *celt_mode, const void *pcm, opus_val16
*bandLogE, opus_val32 *mem, opus_val32 *preemph_mem,
-      int len, int overlap, int channels, int rate, opus_copy_channel_in_func
copy_channel_in
+      int len, int overlap, int channels, int rate, opus_copy_channel_in_func
copy_channel_in, int arch
 )
 {
    int c;
@@ -257,7 +258,8 @@ void surround_analysis(const CELTMode *celt_mode, const void
*pcm, opus_val16 *b
       OPUS_COPY(in, mem+c*overlap, overlap);
       (*copy_channel_in)(x, 1, pcm, channels, c, len);
       celt_preemphasis(x, in+overlap, frame_size, 1, upsample,
celt_mode->preemph, preemph_mem+c, 0);
-      clt_mdct_forward(&celt_mode->mdct, in, freq, celt_mode->window,
overlap, celt_mode->maxLM-LM, 1);
+      clt_mdct_forward(&celt_mode->mdct, in, freq, celt_mode->window,
+                       overlap, celt_mode->maxLM-LM, 1, arch);
       if (upsample != 1)
       {
          int bound = len;
@@ -411,6 +413,7 @@ static int opus_multistream_encoder_init_impl(
        (streams<1) || (coupled_streams<0) ||
(streams>255-coupled_streams))
       return OPUS_BAD_ARG;
 
+   st->arch = opus_select_arch();
    st->layout.nb_channels = channels;
    st->layout.nb_streams = streams;
    st->layout.nb_coupled_streams = coupled_streams;
@@ -767,7 +770,7 @@ static int opus_multistream_encode_native
    ALLOC(bandSMR, 21*st->layout.nb_channels, opus_val16);
    if (st->surround)
    {
-      surround_analysis(celt_mode, pcm, bandSMR, mem, preemph_mem, frame_size,
120, st->layout.nb_channels, Fs, copy_channel_in);
+      surround_analysis(celt_mode, pcm, bandSMR, mem, preemph_mem, frame_size,
120, st->layout.nb_channels, Fs, copy_channel_in, st->arch);
    }
 
    /* Compute bitrate allocation between streams (this could be a lot better)
*/
-- 
1.9.1

Viswanath Puttagunta

2015-Mar-31 22:57 UTC

head link

[opus] [RFC PATCH v1 2/5] armv7(float): Optimize decode usecase using NE10 library

Optimize opus decode (float only) use case using ARM NE10.
Mainly effects opus_ifft and ctl_mdct_backward and related
functions.

Work based on previous Encode optimization using ARM NE10
library.

TBD: Add commit id of upstream Encode NE10 optimization patch
so that users have reference of how to enable this optimization

Signed-off-by: Viswanath Puttagunta <viswanath.puttagunta at linaro.org>
---
 celt/arm/arm_celt_map.c     |  22 ++++++++++
 celt/arm/celt_ne10_fft.c    |  26 +++++++++++
 celt/arm/celt_ne10_mdct.c   | 102 ++++++++++++++++++++++++++++++++++++++++++++
 celt/arm/fft_arm.h          |   8 ++++
 celt/arm/mdct_arm.h         |   7 +++
 celt/celt_decoder.c         |  18 ++++----
 celt/celt_encoder.c         |   3 +-
 celt/kiss_fft.c             |   4 +-
 celt/kiss_fft.h             |  13 +++++-
 celt/mdct.c                 |   5 ++-
 celt/mdct.h                 |  22 ++++++++--
 celt/tests/test_unit_dft.c  |   2 +-
 celt/tests/test_unit_mdct.c |   2 +-
 13 files changed, 214 insertions(+), 20 deletions(-)

diff --git a/celt/arm/arm_celt_map.c b/celt/arm/arm_celt_map.c
index 3b49f90..f132fe1 100644
--- a/celt/arm/arm_celt_map.c
+++ b/celt/arm/arm_celt_map.c
@@ -79,6 +79,15 @@ void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const kiss_fft_state
*cfg,
    opus_fft_float_neon           /* Neon with NE10 */
 };
 
+void (*const OPUS_IFFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg,
+                                         const kiss_fft_cpx *fin,
+                                         kiss_fft_cpx *fout) = {
+   opus_ifft_c,                   /* ARMv4 */
+   opus_ifft_c,                   /* EDSP */
+   opus_ifft_c,                   /* Media */
+   opus_ifft_float_neon           /* Neon with NE10 */
+};
+
 void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l,
                                                      kiss_fft_scalar *in,
                                                      kiss_fft_scalar *
OPUS_RESTRICT out,
@@ -90,6 +99,19 @@ void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const
mdct_lookup *l,
    clt_mdct_forward_c,           /* Media */
    clt_mdct_forward_float_neon   /* Neon with NE10 */
 };
+
+void (*const CLT_MDCT_BACKWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l,
+                                                      kiss_fft_scalar *in,
+                                                      kiss_fft_scalar *
OPUS_RESTRICT out,
+                                                      const opus_val16 *window,
+                                                      int overlap, int shift,
+                                                      int stride, int arch) = {
+   clt_mdct_backward_c,           /* ARMv4 */
+   clt_mdct_backward_c,           /* EDSP */
+   clt_mdct_backward_c,           /* Media */
+   clt_mdct_backward_float_neon   /* Neon with NE10 */
+};
+
 #endif /* HAVE_ARM_NE10 */
 #  endif /* OPUS_ARM_NEON_INTR */
 # endif /* FIXED_POINT */
diff --git a/celt/arm/celt_ne10_fft.c b/celt/arm/celt_ne10_fft.c
index b592f19..d354502 100644
--- a/celt/arm/celt_ne10_fft.c
+++ b/celt/arm/celt_ne10_fft.c
@@ -118,3 +118,29 @@ void opus_fft_float_neon(const kiss_fft_state *st,
    }
    RESTORE_STACK;
 }
+
+void opus_ifft_float_neon(const kiss_fft_state *st,
+                          const kiss_fft_cpx *fin,
+                          kiss_fft_cpx *fout)
+{
+   ne10_fft_state_float32_t state;
+   ne10_fft_cfg_float32_t cfg = &state;
+   VARDECL(ne10_fft_cpx_float32_t, buffer);
+   SAVE_STACK;
+   ALLOC(buffer, st->nfft, ne10_fft_cpx_float32_t);
+
+   if (!st->arch_fft->is_supported) {
+      /* This nfft length (scaled fft) not supported in NE10 */
+      opus_ifft_c(st, fin, fout);
+   }
+   else {
+      memcpy((void *)cfg, st->arch_fft->priv,
sizeof(ne10_fft_state_float32_t));
+      state.buffer = (ne10_fft_cpx_float32_t *)&buffer[0];
+      state.is_backward_scaled = 0;
+
+      ne10_fft_c2c_1d_float32_neon((ne10_fft_cpx_float32_t *)fout,
+                                   (ne10_fft_cpx_float32_t *)fin,
+                                   cfg, 1);
+   }
+   RESTORE_STACK;
+}
diff --git a/celt/arm/celt_ne10_mdct.c b/celt/arm/celt_ne10_mdct.c
index cf175cb..0979cbe 100644
--- a/celt/arm/celt_ne10_mdct.c
+++ b/celt/arm/celt_ne10_mdct.c
@@ -156,3 +156,105 @@ void clt_mdct_forward_float_neon(const mdct_lookup *l,
    }
    RESTORE_STACK;
 }
+
+void clt_mdct_backward_float_neon(const mdct_lookup *l,
+                                  kiss_fft_scalar *in,
+                                  kiss_fft_scalar * OPUS_RESTRICT out,
+                                  const opus_val16 * OPUS_RESTRICT window,
+                                  int overlap, int shift, int stride, int arch)
+{
+   int i;
+   int N, N2, N4;
+   VARDECL(kiss_fft_scalar, f);
+   const kiss_twiddle_scalar *trig;
+   const kiss_fft_state *st = l->kfft[shift];
+
+   N = l->n;
+   trig = l->trig;
+   for (i=0;i<shift;i++)
+   {
+      N >>= 1;
+      trig += N;
+   }
+   N2 = N>>1;
+   N4 = N>>2;
+
+   ALLOC(f, N2, kiss_fft_scalar);
+
+   /* Pre-rotate */
+   {
+      /* Temp pointers to make it really clear to the compiler what we're
doing */
+      const kiss_fft_scalar * OPUS_RESTRICT xp1 = in;
+      const kiss_fft_scalar * OPUS_RESTRICT xp2 = in+stride*(N2-1);
+      kiss_fft_scalar * OPUS_RESTRICT yp = f;
+      const kiss_twiddle_scalar * OPUS_RESTRICT t = &trig[0];
+      for(i=0;i<N4;i++)
+      {
+         kiss_fft_scalar yr, yi;
+         yr = S_MUL(*xp2, t[i]) + S_MUL(*xp1, t[N4+i]);
+         yi = S_MUL(*xp1, t[i]) - S_MUL(*xp2, t[N4+i]);
+         yp[2*i] = yr;
+         yp[2*i+1] = yi;
+         xp1+=2*stride;
+         xp2-=2*stride;
+      }
+   }
+
+   opus_ifft(st, (kiss_fft_cpx *)f, (kiss_fft_cpx*)(out+(overlap>>1)),
arch);
+
+   /* Post-rotate and de-shuffle from both ends of the buffer at once to make
+      it in-place. */
+   {
+      kiss_fft_scalar * yp0 = out+(overlap>>1);
+      kiss_fft_scalar * yp1 = out+(overlap>>1)+N2-2;
+      const kiss_twiddle_scalar *t = &trig[0];
+      /* Loop to (N4+1)>>1 to handle odd N4. When N4 is odd, the
+         middle pair will be computed twice. */
+      for(i=0;i<(N4+1)>>1;i++)
+      {
+         kiss_fft_scalar re, im, yr, yi;
+         kiss_twiddle_scalar t0, t1;
+         re = yp0[0];
+         im = yp0[1];
+         t0 = t[i];
+         t1 = t[N4+i];
+         /* We'd scale up by 2 here, but instead it's done when mixing
the windows */
+         yr = S_MUL(re,t0) + S_MUL(im,t1);
+         yi = S_MUL(re,t1) - S_MUL(im,t0);
+         re = yp1[0];
+         im = yp1[1];
+         yp0[0] = yr;
+         yp1[1] = yi;
+
+         t0 = t[(N4-i-1)];
+         t1 = t[(N2-i-1)];
+         /* We'd scale up by 2 here, but instead it's done when mixing
the windows */
+         yr = S_MUL(re,t0) + S_MUL(im,t1);
+         yi = S_MUL(re,t1) - S_MUL(im,t0);
+         yp1[0] = yr;
+         yp0[1] = yi;
+         yp0 += 2;
+         yp1 -= 2;
+      }
+   }
+
+   /* Mirror on both sides for TDAC */
+   {
+      kiss_fft_scalar * OPUS_RESTRICT xp1 = out+overlap-1;
+      kiss_fft_scalar * OPUS_RESTRICT yp1 = out;
+      const opus_val16 * OPUS_RESTRICT wp1 = window;
+      const opus_val16 * OPUS_RESTRICT wp2 = window+overlap-1;
+
+      for(i = 0; i < overlap/2; i++)
+      {
+         kiss_fft_scalar x1, x2;
+         x1 = *xp1;
+         x2 = *yp1;
+         *yp1++ = MULT16_32_Q15(*wp2, x2) - MULT16_32_Q15(*wp1, x1);
+         *xp1-- = MULT16_32_Q15(*wp1, x2) + MULT16_32_Q15(*wp2, x1);
+         wp1++;
+         wp2--;
+      }
+   }
+   RESTORE_STACK;
+}
diff --git a/celt/arm/fft_arm.h b/celt/arm/fft_arm.h
index e7a30d6..e57b0aa 100644
--- a/celt/arm/fft_arm.h
+++ b/celt/arm/fft_arm.h
@@ -46,6 +46,11 @@ void opus_fft_free_arm_float_neon(kiss_fft_state *st);
 void opus_fft_float_neon(const kiss_fft_state *st,
                          const kiss_fft_cpx *fin,
                          kiss_fft_cpx *fout);
+
+void opus_ifft_float_neon(const kiss_fft_state *st,
+                         const kiss_fft_cpx *fin,
+                         kiss_fft_cpx *fout);
+
 #if !defined(OPUS_HAVE_RTCD)
 #define OVERRIDE_OPUS_FFT (1)
 
@@ -58,6 +63,9 @@ void opus_fft_float_neon(const kiss_fft_state *st,
 #define opus_fft(_st, _fin, _fout, arch) \
    ((void)(arch), opus_fft_float_neon(_st, _fin, _fout))
 
+#define opus_ifft(_st, _fin, _fout, arch) \
+   ((void)(arch), opus_ifft_float_neon(_st, _fin, _fout))
+
 #endif /* OPUS_HAVE_RTCD */
 
 #endif /* HAVE_ARM_NE10 */
diff --git a/celt/arm/mdct_arm.h b/celt/arm/mdct_arm.h
index 7d60fed..db32efe 100644
--- a/celt/arm/mdct_arm.h
+++ b/celt/arm/mdct_arm.h
@@ -43,10 +43,17 @@ void clt_mdct_forward_float_neon(const mdct_lookup *l,
kiss_fft_scalar *in,
                                  const opus_val16 *window, int overlap,
                                  int shift, int stride, int arch);
 
+void clt_mdct_backward_float_neon(const mdct_lookup *l, kiss_fft_scalar *in,
+                                  kiss_fft_scalar * OPUS_RESTRICT out,
+                                  const opus_val16 *window, int overlap,
+                                  int shift, int stride, int arch);
+
 #if !defined(OPUS_HAVE_RTCD)
 #define OVERRIDE_OPUS_MDCT (1)
 #define clt_mdct_forward(_l, _in, _out, _window, _int, _shift, _stride, _arch)
\
       clt_mdct_forward_float_neon(_l, _in, _out, _window, _int, _shift,
_stride, _arch)
+#define clt_mdct_backward(_l, _in, _out, _window, _int, _shift, _stride, _arch)
\
+      clt_mdct_backward_float_neon(_l, _in, _out, _window, _int, _shift,
_stride, _arch)
 #endif /* OPUS_HAVE_RTCD */
 #endif /* !defined(FIXED_POINT) && defined(HAVE_ARM_NE10) */
 
diff --git a/celt/celt_decoder.c b/celt/celt_decoder.c
index 4304a3e..304f334 100644
--- a/celt/celt_decoder.c
+++ b/celt/celt_decoder.c
@@ -278,8 +278,9 @@ void deemphasis(celt_sig *in[], opus_val16 *pcm, int N, int
C, int downsample, c
 static
 #endif
 void celt_synthesis(const CELTMode *mode, celt_norm *X, celt_sig * out_syn[],
-      opus_val16 *oldBandE, int start, int effEnd, int C, int CC, int
isTransient,
-      int LM, int downsample, int silence)
+                    opus_val16 *oldBandE, int start, int effEnd, int C, int CC,
+                    int isTransient, int LM, int downsample,
+                    int silence, int arch)
 {
    int c, i;
    int M;
@@ -319,9 +320,9 @@ void celt_synthesis(const CELTMode *mode, celt_norm *X,
celt_sig * out_syn[],
       freq2 = out_syn[1]+overlap/2;
       OPUS_COPY(freq2, freq, N);
       for (b=0;b<B;b++)
-         clt_mdct_backward(&mode->mdct, &freq2[b], out_syn[0]+NB*b,
mode->window, overlap, shift, B);
+         clt_mdct_backward(&mode->mdct, &freq2[b], out_syn[0]+NB*b,
mode->window, overlap, shift, B, arch);
       for (b=0;b<B;b++)
-         clt_mdct_backward(&mode->mdct, &freq[b], out_syn[1]+NB*b,
mode->window, overlap, shift, B);
+         clt_mdct_backward(&mode->mdct, &freq[b], out_syn[1]+NB*b,
mode->window, overlap, shift, B, arch);
    } else if (CC==1&&C==2)
    {
       /* Downmixing a stereo stream to mono */
@@ -335,14 +336,14 @@ void celt_synthesis(const CELTMode *mode, celt_norm *X,
celt_sig * out_syn[],
       for (i=0;i<N;i++)
          freq[i] = HALF32(ADD32(freq[i],freq2[i]));
       for (b=0;b<B;b++)
-         clt_mdct_backward(&mode->mdct, &freq[b], out_syn[0]+NB*b,
mode->window, overlap, shift, B);
+         clt_mdct_backward(&mode->mdct, &freq[b], out_syn[0]+NB*b,
mode->window, overlap, shift, B, arch);
    } else {
       /* Normal case (mono or stereo) */
       c=0; do {
          denormalise_bands(mode, X+c*N, freq, oldBandE+c*nbEBands, start,
effEnd, M,
                downsample, silence);
          for (b=0;b<B;b++)
-            clt_mdct_backward(&mode->mdct, &freq[b],
out_syn[c]+NB*b, mode->window, overlap, shift, B);
+            clt_mdct_backward(&mode->mdct, &freq[b],
out_syn[c]+NB*b, mode->window, overlap, shift, B, arch);
       } while (++c<CC);
    }
    RESTORE_STACK;
@@ -509,7 +510,7 @@ static void celt_decode_lost(CELTDecoder * OPUS_RESTRICT st,
int N, int LM)
                DECODE_BUFFER_SIZE-N+(overlap>>1));
       } while (++c<C);
 
-      celt_synthesis(mode, X, out_syn, plcLogE, start, effEnd, C, C, 0, LM,
st->downsample, 0);
+      celt_synthesis(mode, X, out_syn, plcLogE, start, effEnd, C, C, 0, LM,
st->downsample, 0, st->arch);
    } else {
       /* Pitch-based PLC */
       const opus_val16 *window;
@@ -1002,7 +1003,8 @@ int celt_decode_with_ec(CELTDecoder * OPUS_RESTRICT st,
const unsigned char *dat
          oldBandE[i] = -QCONST16(28.f,DB_SHIFT);
    }
 
-   celt_synthesis(mode, X, out_syn, oldBandE, start, effEnd, C, CC,
isTransient, LM, st->downsample, silence);
+   celt_synthesis(mode, X, out_syn, oldBandE, start, effEnd,
+                  C, CC, isTransient, LM, st->downsample, silence,
st->arch);
 
    c=0; do {
       st->postfilter_period=IMAX(st->postfilter_period,
COMBFILTER_MINPERIOD);
diff --git a/celt/celt_encoder.c b/celt/celt_encoder.c
index 7a2c71b..5f48638 100644
--- a/celt/celt_encoder.c
+++ b/celt/celt_encoder.c
@@ -2072,7 +2072,8 @@ int celt_encode_with_ec(CELTEncoder * OPUS_RESTRICT st,
const opus_val16 * pcm,
          out_mem[c] = st->syn_mem[c]+2*MAX_PERIOD-N;
       } while (++c<CC);
 
-      celt_synthesis(mode, X, out_mem, oldBandE, start, effEnd, C, CC,
isTransient, LM, st->upsample, silence);
+      celt_synthesis(mode, X, out_mem, oldBandE, start, effEnd,
+                     C, CC, isTransient, LM, st->upsample, silence,
st->arch);
 
       c=0; do {
          st->prefilter_period=IMAX(st->prefilter_period,
COMBFILTER_MINPERIOD);
diff --git a/celt/kiss_fft.c b/celt/kiss_fft.c
index 38fd4fb..4ed37d2 100644
--- a/celt/kiss_fft.c
+++ b/celt/kiss_fft.c
@@ -589,8 +589,7 @@ void opus_fft_c(const kiss_fft_state *st,const kiss_fft_cpx
*fin,kiss_fft_cpx *f
 }
 
 
-#ifdef TEST_UNIT_DFT_C
-void opus_ifft(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout)
+void opus_ifft_c(const kiss_fft_state *st,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout)
 {
    int i;
    celt_assert2 (fin != fout, "In-place FFT not supported");
@@ -603,4 +602,3 @@ void opus_ifft(const kiss_fft_state *st,const kiss_fft_cpx
*fin,kiss_fft_cpx *fo
    for (i=0;i<st->nfft;i++)
       fout[i].i = -fout[i].i;
 }
-#endif
diff --git a/celt/kiss_fft.h b/celt/kiss_fft.h
index bf2f836..45017a4 100644
--- a/celt/kiss_fft.h
+++ b/celt/kiss_fft.h
@@ -142,7 +142,7 @@ kiss_fft_state *opus_fft_alloc(int nfft,void * mem,size_t *
lenmem, int arch);
     f[k].r and f[k].i
  * */
 void opus_fft_c(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout);
-void opus_ifft(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout);
+void opus_ifft_c(const kiss_fft_state *cfg,const kiss_fft_cpx *fin,kiss_fft_cpx
*fout);
 
 void opus_fft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout);
 void opus_ifft_impl(const kiss_fft_state *st,kiss_fft_cpx *fout);
@@ -171,6 +171,13 @@ void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const
kiss_fft_state *cfg,
                                         kiss_fft_cpx *fout);
 #define opus_fft(_cfg, _fin, _fout, arch) \
    ((*OPUS_FFT[(arch)&OPUS_ARCHMASK])(_cfg, _fin, _fout))
+
+void (*const OPUS_IFFT[OPUS_ARCHMASK+1])(const kiss_fft_state *cfg,
+                                         const kiss_fft_cpx *fin,
+                                         kiss_fft_cpx *fout);
+#define opus_ifft(_cfg, _fin, _fout, arch) \
+   ((*OPUS_IFFT[(arch)&OPUS_ARCHMASK])(_cfg, _fin, _fout))
+
 #else /* else for if defined(OPUS_HAVE_RTCD) &&
(defined(HAVE_ARM_NE10)) */
 
 #define opus_fft_alloc_arch(_st, arch) \
@@ -181,6 +188,10 @@ void (*const OPUS_FFT[OPUS_ARCHMASK+1])(const
kiss_fft_state *cfg,
 
 #define opus_fft(_cfg, _fin, _fout, arch) \
          ((void)(arch), opus_fft_c(_cfg, _fin, _fout))
+
+#define opus_ifft(_cfg, _fin, _fout, arch) \
+         ((void)(arch), opus_ifft_c(_cfg, _fin, _fout))
+
 #endif /* end if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */
 #endif /* end if !defined(OVERRIDE_OPUS_FFT) */
 
diff --git a/celt/mdct.c b/celt/mdct.c
index ee6d80e..5315ad1 100644
--- a/celt/mdct.c
+++ b/celt/mdct.c
@@ -239,12 +239,13 @@ void clt_mdct_forward_c(const mdct_lookup *l,
kiss_fft_scalar *in, kiss_fft_scal
 #endif /* OVERRIDE_clt_mdct_forward */
 
 #ifndef OVERRIDE_clt_mdct_backward
-void clt_mdct_backward(const mdct_lookup *l, kiss_fft_scalar *in,
kiss_fft_scalar * OPUS_RESTRICT out,
-      const opus_val16 * OPUS_RESTRICT window, int overlap, int shift, int
stride)
+void clt_mdct_backward_c(const mdct_lookup *l, kiss_fft_scalar *in,
kiss_fft_scalar * OPUS_RESTRICT out,
+      const opus_val16 * OPUS_RESTRICT window, int overlap, int shift, int
stride, int arch)
 {
    int i;
    int N, N2, N4;
    const kiss_twiddle_scalar *trig;
+   (void) arch;
 
    N = l->n;
    trig = l->trig;
diff --git a/celt/mdct.h b/celt/mdct.h
index cbaf679..5349ccf 100644
--- a/celt/mdct.h
+++ b/celt/mdct.h
@@ -69,9 +69,10 @@ void clt_mdct_forward_c(const mdct_lookup *l, kiss_fft_scalar
*in,
 
 /** Compute a backward MDCT (no scaling) and performs weighted overlap-add
     (scales implicitly by 1/2) */
-void clt_mdct_backward(const mdct_lookup *l, kiss_fft_scalar *in,
-      kiss_fft_scalar * OPUS_RESTRICT out,
-      const opus_val16 * OPUS_RESTRICT window, int overlap, int shift, int
stride);
+void clt_mdct_backward_c(const mdct_lookup *l, kiss_fft_scalar *in,
+                         kiss_fft_scalar * OPUS_RESTRICT out,
+                         const opus_val16 * OPUS_RESTRICT window,
+                         int overlap, int shift, int stride, int arch);
 
 #if !defined(OVERRIDE_OPUS_MDCT)
 /* Is run-time CPU detection enabled on this platform? */
@@ -88,11 +89,26 @@ void (*const CLT_MDCT_FORWARD_IMPL[OPUS_ARCHMASK+1])(const
mdct_lookup *l,
    ((*CLT_MDCT_FORWARD_IMPL[(arch)&OPUS_ARCHMASK])(_l, _in, _out, \
                                                    _window, _overlap, _shift, \
                                                    _stride, _arch))
+
+void (*const CLT_MDCT_BACKWARD_IMPL[OPUS_ARCHMASK+1])(const mdct_lookup *l,
+                                                      kiss_fft_scalar *in,
+                                                      kiss_fft_scalar *
OPUS_RESTRICT out,
+                                                      const opus_val16 *window,
+                                                      int overlap, int shift,
+                                                      int stride, int arch);
+
+#define clt_mdct_backward(_l, _in, _out, _window, _overlap, _shift, _stride,
_arch) \
+   (*CLT_MDCT_BACKWARD_IMPL[(arch)&OPUS_ARCHMASK])(_l, _in, _out, \
+                                                   _window, _overlap, _shift, \
+                                                   _stride, _arch)
 #else /* else for if defined(OPUS_HAVE_RTCD) &&
(defined(HAVE_ARM_NE10)) */
 
 #define clt_mdct_forward(_l, _in, _out, _window, _overlap, _shift, _stride,
_arch) \
    clt_mdct_forward_c(_l, _in, _out, _window, _overlap, _shift, _stride, _arch)
 
+#define clt_mdct_backward(_l, _in, _out, _window, _overlap, _shift, _stride,
_arch) \
+   clt_mdct_backward_c(_l, _in, _out, _window, _overlap, _shift, _stride,
_arch)
+
 #endif /* end if defined(OPUS_HAVE_RTCD) && (defined(HAVE_ARM_NE10)) */
 #endif /* end if !defined(OVERRIDE_OPUS_MDCT) */
 
diff --git a/celt/tests/test_unit_dft.c b/celt/tests/test_unit_dft.c
index 991ece3..28f0238 100644
--- a/celt/tests/test_unit_dft.c
+++ b/celt/tests/test_unit_dft.c
@@ -140,7 +140,7 @@ void test1d(int nfft,int isinverse,int arch)
     /*for (k=0;k<nfft;++k) printf("%d %d ", in[k].r,
in[k].i);printf("\n");*/
 
     if (isinverse)
-       opus_ifft(cfg,in,out);
+       opus_ifft(cfg,in,out, arch);
     else
        opus_fft(cfg,in,out, arch);
 
diff --git a/celt/tests/test_unit_mdct.c b/celt/tests/test_unit_mdct.c
index a1c92d0..51e457a 100644
--- a/celt/tests/test_unit_mdct.c
+++ b/celt/tests/test_unit_mdct.c
@@ -168,7 +168,7 @@ void test1d(int nfft,int isinverse,int arch)
     {
        for (k=0;k<nfft;++k)
           out[k] = 0;
-       clt_mdct_backward(&cfg,in,out, window, nfft/2, 0, 1);
+       clt_mdct_backward(&cfg,in,out, window, nfft/2, 0, 1, arch);
        /* apply TDAC because clt_mdct_backward() no longer does that */
        for (k=0;k<nfft/4;++k)
           out[nfft-k-1] = out[nfft/2+k];
-- 
1.9.1

Viswanath Puttagunta

2015-Mar-31 22:57 UTC

head link

[opus] [RFC PATCH v1 3/5] Intrinsics/RTCD related fixes. Mostly x86

From: Jonathan Lennox <jonathan at vidyo.com>

* Makes ?enable-intrinsics work with clang and other non-GCC compilers
* Enables RTCD for the floating-point-mode SSE code in Celt.
* Disables use of RTCD in cases where the compiler targets an instruction set by
default.
* Enables the SSE4.1 Silk optimizations that apply to the common parts of Silk
when Opus is built in floating-point mode, not just in fixed-point mode.
* Enables the SSE intrinsics (with RTCD when appropriate) in the Win32 build.
* Fixes a case where GCC would compile SSE2 code as SSE4.1, causing a crash on
non-SSE4.1 CPUs.
* Allows configuration with compilers with non-GCC-flavor flags for enabling
architecture options.
* Hopefully makes the configuration and ifdef?s easier to follow and understand.

Reviewed-by: Viswanath Puttagunta <viswanath.puttagunta at linaro.org>
---
 Makefile.am                              |  38 ++--
 celt/arm/armcpu.c                        |   6 +-
 celt/arm/pitch_arm.h                     |   4 +-
 celt/bands.c                             |   6 +-
 celt/celt.c                              |  16 +-
 celt/celt.h                              |  12 +-
 celt/celt_decoder.c                      |   6 +-
 celt/celt_encoder.c                      |   4 +-
 celt/celt_lpc.h                          |   2 +-
 celt/cpu_support.h                       |  15 +-
 celt/mips/celt_mipsr1.h                  |   2 +-
 celt/pitch.c                             |   4 +-
 celt/pitch.h                             |  19 +-
 celt/tests/test_unit_dft.c               |   4 +-
 celt/tests/test_unit_mathops.c           |  11 +-
 celt/tests/test_unit_mdct.c              |   4 +-
 celt/tests/test_unit_rotation.c          |  11 +-
 celt/x86/celt_lpc_sse.c                  |   4 +
 celt/x86/celt_lpc_sse.h                  |  12 +-
 celt/x86/pitch_sse.c                     | 334 +++++++++++++------------------
 celt/x86/pitch_sse.h                     | 256 ++++++++++-------------
 celt/x86/pitch_sse2.c                    |  95 +++++++++
 celt/x86/pitch_sse4_1.c                  | 195 ++++++++++++++++++
 celt/x86/x86_celt_map.c                  |  76 ++++++-
 celt/x86/x86cpu.c                        |  47 ++++-
 celt/x86/x86cpu.h                        |  26 ++-
 celt_sources.mk                          |   5 +-
 configure.ac                             | 313 ++++++++++++++++++-----------
 m4/opus-intrinsics.m4                    |  29 +++
 silk/x86/SigProc_FIX_sse.h               |  17 ++
 silk/x86/main_sse.h                      |  48 +++++
 silk/x86/x86_silk_map.c                  |  25 ++-
 win32/VS2010/celt.vcxproj                |  17 +-
 win32/VS2010/celt.vcxproj.filters        |  27 +++
 win32/VS2010/silk_common.vcxproj         |  17 +-
 win32/VS2010/silk_common.vcxproj.filters |  23 ++-
 win32/VS2010/silk_fixed.vcxproj          |  13 +-
 win32/VS2010/silk_fixed.vcxproj.filters  |  17 +-
 win32/config.h                           |  25 ++-
 39 files changed, 1214 insertions(+), 571 deletions(-)
 create mode 100644 celt/x86/pitch_sse2.c
 create mode 100644 celt/x86/pitch_sse4_1.c
 create mode 100644 m4/opus-intrinsics.m4

diff --git a/Makefile.am b/Makefile.am
index c5c1562..3a75740 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -23,6 +23,9 @@ SILK_SOURCES += $(SILK_SOURCES_SSE4_1)
$(SILK_SOURCES_FIXED_SSE4_1)
 endif
 else
 SILK_SOURCES += $(SILK_SOURCES_FLOAT)
+if HAVE_SSE4_1
+SILK_SOURCES += $(SILK_SOURCES_SSE4_1)
+endif
 endif
 
 if DISABLE_FLOAT_API
@@ -30,12 +33,14 @@ else
 OPUS_SOURCES += $(OPUS_SOURCES_FLOAT)
 endif
 
-if HAVE_SSE4_1
-CELT_SOURCES += $(CELT_SOURCES_SSE) $(CELT_SOURCES_SSE4_1)
-else
-if HAVE_SSE2
+if HAVE_SSE
 CELT_SOURCES += $(CELT_SOURCES_SSE)
 endif
+if HAVE_SSE2
+CELT_SOURCES += $(CELT_SOURCES_SSE2)
+endif
+if HAVE_SSE4_1
+CELT_SOURCES += $(CELT_SOURCES_SSE4_1)
 endif
 
 if CPU_ARM
@@ -44,7 +49,6 @@ SILK_SOURCES += $(SILK_SOURCES_ARM)
 
 if OPUS_ARM_NEON_INTR
 CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR)
-OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon
 endif
 
 if HAVE_ARM_NE10
@@ -262,20 +266,30 @@ $(CELT_SOURCES_ARM_ASM:%.s=%-gnu.S):
$(top_srcdir)/celt/arm/arm2gnu.pl
 %-gnu.S: %.s
 	$(top_srcdir)/celt/arm/arm2gnu.pl @ARM2GNU_PARAMS@ < $< > $@
 
-SSE_OBJ = %_sse.o %_sse.lo %test_unit_mathops.o %test_unit_rotation.o
+OPT_UNIT_TEST_OBJ = $(celt_tests_test_unit_mathops_SOURCES:.c=.o) \
+                    $(celt_tests_test_unit_rotation_SOURCES:.c=.o)
+
+if HAVE_SSE
+SSE_OBJ = $(CELT_SOURCES_SSE:.c=.lo)
+$(SSE_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE_CFLAGS)
+endif
 
-if HAVE_SSE4_1
-$(SSE_OBJ): CFLAGS += -msse4.1
-else
 if HAVE_SSE2
-$(SSE_OBJ): CFLAGS += -msse2
+SSE2_OBJ = $(CELT_SOURCES_SSE2:.c=.lo)
+$(SSE2_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE2_CFLAGS)
 endif
+
+if HAVE_SSE4_1
+SSE4_1_OBJ = $(CELT_SOURCES_SSE4_1:.c=.lo) \
+             $(SILK_SOURCES_SSE4_1:.c=.lo) \
+             $(SILK_SOURCES_FIXED_SSE4_1:.c=.lo)
+$(SSE4_1_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE4_1_CFLAGS)
 endif
 
 if OPUS_ARM_NEON_INTR
 CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \
                          $(CELT_SOURCES_ARM_NE10:.c=.lo) \
-                         %test_unit_rotation.o %test_unit_mathops.o \
                          %test_unit_mdct.o %test_unit_dft.o
-$(CELT_ARM_NEON_INTR_OBJ): CFLAGS += $(OPUS_ARM_NEON_INTR_CPPFLAGS)
$(NE10_CFLAGS)
+
+$(CELT_ARM_NEON_INTR_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS +=
$(OPUS_ARM_NEON_INTR_CFLAGS) $(NE10_CFLAGS)
 endif
diff --git a/celt/arm/armcpu.c b/celt/arm/armcpu.c
index 1768525..5e5d10c 100644
--- a/celt/arm/armcpu.c
+++ b/celt/arm/armcpu.c
@@ -73,7 +73,7 @@ static OPUS_INLINE opus_uint32 opus_cpu_capabilities(void){
   __except(GetExceptionCode()==EXCEPTION_ILLEGAL_INSTRUCTION){
     /*Ignore exception.*/
   }
-#   if defined(OPUS_ARM_MAY_HAVE_NEON)
+#   if defined(OPUS_ARM_MAY_HAVE_NEON) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
   __try{
     /*VORR q0,q0,q0*/
     __emit(0xF2200150);
@@ -107,7 +107,7 @@ opus_uint32 opus_cpu_capabilities(void)
 
     while(fgets(buf, 512, cpuinfo) != NULL)
     {
-# if defined(OPUS_ARM_MAY_HAVE_EDSP) || defined(OPUS_ARM_MAY_HAVE_NEON)
+# if defined(OPUS_ARM_MAY_HAVE_EDSP) || defined(OPUS_ARM_MAY_HAVE_NEON) ||
defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
       /* Search for edsp and neon flag */
       if(memcmp(buf, "Features", 8) == 0)
       {
@@ -118,7 +118,7 @@ opus_uint32 opus_cpu_capabilities(void)
           flags |= OPUS_CPU_ARM_EDSP;
 #  endif
 
-#  if defined(OPUS_ARM_MAY_HAVE_NEON)
+#  if defined(OPUS_ARM_MAY_HAVE_NEON) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
         p = strstr(buf, " neon");
         if(p != NULL && (p[5] == ' ' || p[5] == '\n'))
           flags |= OPUS_CPU_ARM_NEON;
diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
index 125d1bc..8626ed7 100644
--- a/celt/arm/pitch_arm.h
+++ b/celt/arm/pitch_arm.h
@@ -54,10 +54,10 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const
opus_val16 *_y,
 
 #else /* Start !FIXED_POINT */
 /* Float case */
-#if defined(OPUS_ARM_NEON_INTR)
+#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
 void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
                                  opus_val32 *xcorr, int len, int max_pitch);
-#if !defined(OPUS_HAVE_RTCD)
+#if !defined(OPUS_HAVE_RTCD) || defined(OPUS_ARM_PRESUME_NEON_INTR)
 #define OVERRIDE_PITCH_XCORR (1)
 #   define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
    ((void)(arch),celt_pitch_xcorr_float_neon(_x, _y, xcorr, len, max_pitch))
diff --git a/celt/bands.c b/celt/bands.c
index c643b09..25f229e 100644
--- a/celt/bands.c
+++ b/celt/bands.c
@@ -398,7 +398,7 @@ static void stereo_split(celt_norm * OPUS_RESTRICT X,
celt_norm * OPUS_RESTRICT
    }
 }
 
-static void stereo_merge(celt_norm * OPUS_RESTRICT X, celt_norm * OPUS_RESTRICT
Y, opus_val16 mid, int N)
+static void stereo_merge(celt_norm * OPUS_RESTRICT X, celt_norm * OPUS_RESTRICT
Y, opus_val16 mid, int N, int arch)
 {
    int j;
    opus_val32 xp=0, side=0;
@@ -410,7 +410,7 @@ static void stereo_merge(celt_norm * OPUS_RESTRICT X,
celt_norm * OPUS_RESTRICT
    opus_val32 t, lgain, rgain;
 
    /* Compute the norm of X+Y and X-Y as |X|^2 + |Y|^2 +/- sum(xy) */
-   dual_inner_prod(Y, X, Y, N, &xp, &side);
+   dual_inner_prod(Y, X, Y, N, &xp, &side, arch);
    /* Compensating for the mid normalization */
    xp = MULT16_32_Q15(mid, xp);
    /* mid and side are in Q15, not Q14 like X and Y */
@@ -1348,7 +1348,7 @@ static unsigned quant_band_stereo(struct band_ctx *ctx,
celt_norm *X, celt_norm
    if (resynth)
    {
       if (N!=2)
-         stereo_merge(X, Y, mid, N);
+         stereo_merge(X, Y, mid, N, ctx->arch);
       if (inv)
       {
          int j;
diff --git a/celt/celt.c b/celt/celt.c
index a610de4..40c62ce 100644
--- a/celt/celt.c
+++ b/celt/celt.c
@@ -89,10 +89,12 @@ int resampling_factor(opus_int32 rate)
    return ret;
 }
 
-#ifndef OVERRIDE_COMB_FILTER_CONST
 /* This version should be faster on ARM */
 #ifdef OPUS_ARM_ASM
-static void comb_filter_const(opus_val32 *y, opus_val32 *x, int T, int N,
+#ifndef NON_STATIC_COMB_FILTER_CONST_C
+static
+#endif
+void comb_filter_const_c(opus_val32 *y, opus_val32 *x, int T, int N,
       opus_val16 g10, opus_val16 g11, opus_val16 g12)
 {
    opus_val32 x0, x1, x2, x3, x4;
@@ -147,7 +149,10 @@ static void comb_filter_const(opus_val32 *y, opus_val32 *x,
int T, int N,
 #endif
 }
 #else
-static void comb_filter_const(opus_val32 *y, opus_val32 *x, int T, int N,
+#ifndef NON_STATIC_COMB_FILTER_CONST_C
+static
+#endif
+void comb_filter_const_c(opus_val32 *y, opus_val32 *x, int T, int N,
       opus_val16 g10, opus_val16 g11, opus_val16 g12)
 {
    opus_val32 x0, x1, x2, x3, x4;
@@ -171,12 +176,11 @@ static void comb_filter_const(opus_val32 *y, opus_val32
*x, int T, int N,
 
 }
 #endif
-#endif
 
 #ifndef OVERRIDE_comb_filter
 void comb_filter(opus_val32 *y, opus_val32 *x, int T0, int T1, int N,
       opus_val16 g0, opus_val16 g1, int tapset0, int tapset1,
-      const opus_val16 *window, int overlap)
+      const opus_val16 *window, int overlap, int arch)
 {
    int i;
    /* printf ("%d %d %f %f\n", T0, T1, g0, g1); */
@@ -234,7 +238,7 @@ void comb_filter(opus_val32 *y, opus_val32 *x, int T0, int
T1, int N,
    }
 
    /* Compute the part with the constant filter. */
-   comb_filter_const(y+i, x+i, T1, N-i, g10, g11, g12);
+   comb_filter_const(y+i, x+i, T1, N-i, g10, g11, g12, arch);
 }
 #endif /* OVERRIDE_comb_filter */
 
diff --git a/celt/celt.h b/celt/celt.h
index b196751..a423b95 100644
--- a/celt/celt.h
+++ b/celt/celt.h
@@ -201,7 +201,17 @@ void celt_preemphasis(const opus_val16 * OPUS_RESTRICT
pcmp, celt_sig * OPUS_RES
 
 void comb_filter(opus_val32 *y, opus_val32 *x, int T0, int T1, int N,
       opus_val16 g0, opus_val16 g1, int tapset0, int tapset1,
-      const opus_val16 *window, int overlap);
+      const opus_val16 *window, int overlap, int arch);
+
+#ifdef NON_STATIC_COMB_FILTER_CONST_C
+void comb_filter_const_c(opus_val32 *y, opus_val32 *x, int T, int N,
+                         opus_val16 g10, opus_val16 g11, opus_val16 g12);
+#endif
+
+#ifndef OVERRIDE_COMB_FILTER_CONST
+# define comb_filter_const(y, x, T, N, g10, g11, g12, arch)		\
+    ((void)(arch),comb_filter_const_c(y, x, T, N, g10, g11, g12))
+#endif
 
 void init_caps(const CELTMode *m,int *cap,int LM,int C);
 
diff --git a/celt/celt_decoder.c b/celt/celt_decoder.c
index 304f334..505a6ef 100644
--- a/celt/celt_decoder.c
+++ b/celt/celt_decoder.c
@@ -699,7 +699,7 @@ static void celt_decode_lost(CELTDecoder * OPUS_RESTRICT st,
int N, int LM)
          comb_filter(etmp, buf+DECODE_BUFFER_SIZE,
               st->postfilter_period, st->postfilter_period, overlap,
               -st->postfilter_gain, -st->postfilter_gain,
-              st->postfilter_tapset, st->postfilter_tapset, NULL, 0);
+              st->postfilter_tapset, st->postfilter_tapset, NULL, 0,
st->arch);
 
          /* Simulate TDAC on the concealed audio so that it blends with the
             MDCT of the next frame. */
@@ -1011,11 +1011,11 @@ int celt_decode_with_ec(CELTDecoder * OPUS_RESTRICT st,
const unsigned char *dat
       st->postfilter_period_old=IMAX(st->postfilter_period_old,
COMBFILTER_MINPERIOD);
       comb_filter(out_syn[c], out_syn[c], st->postfilter_period_old,
st->postfilter_period, mode->shortMdctSize,
             st->postfilter_gain_old, st->postfilter_gain,
st->postfilter_tapset_old, st->postfilter_tapset,
-            mode->window, overlap);
+            mode->window, overlap, st->arch);
       if (LM!=0)
          comb_filter(out_syn[c]+mode->shortMdctSize,
out_syn[c]+mode->shortMdctSize, st->postfilter_period, postfilter_pitch,
N-mode->shortMdctSize,
                st->postfilter_gain, postfilter_gain,
st->postfilter_tapset, postfilter_tapset,
-               mode->window, overlap);
+               mode->window, overlap, st->arch);
 
    } while (++c<CC);
    st->postfilter_period_old = st->postfilter_period;
diff --git a/celt/celt_encoder.c b/celt/celt_encoder.c
index 5f48638..1c9dbcb 100644
--- a/celt/celt_encoder.c
+++ b/celt/celt_encoder.c
@@ -1166,11 +1166,11 @@ static int run_prefilter(CELTEncoder *st, celt_sig *in,
celt_sig *prefilter_mem,
       if (offset)
          comb_filter(in+c*(N+overlap)+overlap, pre[c]+COMBFILTER_MAXPERIOD,
                st->prefilter_period, st->prefilter_period, offset,
-st->prefilter_gain, -st->prefilter_gain,
-               st->prefilter_tapset, st->prefilter_tapset, NULL, 0);
+               st->prefilter_tapset, st->prefilter_tapset, NULL, 0,
st->arch);
 
       comb_filter(in+c*(N+overlap)+overlap+offset,
pre[c]+COMBFILTER_MAXPERIOD+offset,
             st->prefilter_period, pitch_index, N-offset,
-st->prefilter_gain, -gain1,
-            st->prefilter_tapset, prefilter_tapset, mode->window,
overlap);
+            st->prefilter_tapset, prefilter_tapset, mode->window,
overlap, st->arch);
       OPUS_COPY(st->in_mem+c*(overlap), in+c*(N+overlap)+N, overlap);
 
       if (N>COMBFILTER_MAXPERIOD)
diff --git a/celt/celt_lpc.h b/celt/celt_lpc.h
index dc8967f..323459e 100644
--- a/celt/celt_lpc.h
+++ b/celt/celt_lpc.h
@@ -48,7 +48,7 @@ void celt_fir_c(
          opus_val16 *mem,
          int arch);
 
-#if !defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#if !defined(OVERRIDE_CELT_FIR)
 #define celt_fir(x, num, y, N, ord, mem, arch) \
     (celt_fir_c(x, num, y, N, ord, mem, arch))
 #endif
diff --git a/celt/cpu_support.h b/celt/cpu_support.h
index 1d62e2f..5e99a90 100644
--- a/celt/cpu_support.h
+++ b/celt/cpu_support.h
@@ -32,7 +32,8 @@
 #include "opus_defines.h"
 
 #if defined(OPUS_HAVE_RTCD) && \
-  (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR))
+  (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
+
 #include "arm/armcpu.h"
 
 /* We currently support 4 ARM variants:
@@ -43,14 +44,16 @@
  */
 #define OPUS_ARCHMASK 3
 
-#elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#elif (defined(OPUS_X86_MAY_HAVE_SSE) &&
!defined(OPUS_X86_PRESUME_SSE)) || \
+  (defined(OPUS_X86_MAY_HAVE_SSE2) && !defined(OPUS_X86_PRESUME_SSE2))
|| \
+  (defined(OPUS_X86_MAY_HAVE_SSE4_1) &&
!defined(OPUS_X86_PRESUME_SSE4_1))
 
 #include "x86/x86cpu.h"
-/* We currently support 3 x86 variants:
+/* We currently support 4 x86 variants:
  * arch[0] -> non-sse
- * arch[1] -> sse2
- * arch[2] -> sse4.1
- * arch[3] -> NULL
+ * arch[1] -> sse
+ * arch[2] -> sse2
+ * arch[3] -> sse4.1
  */
 #define OPUS_ARCHMASK 3
 int opus_select_arch(void);
diff --git a/celt/mips/celt_mipsr1.h b/celt/mips/celt_mipsr1.h
index 03915d8..7915d59 100644
--- a/celt/mips/celt_mipsr1.h
+++ b/celt/mips/celt_mipsr1.h
@@ -56,7 +56,7 @@
 #define OVERRIDE_comb_filter
 void comb_filter(opus_val32 *y, opus_val32 *x, int T0, int T1, int N,
       opus_val16 g0, opus_val16 g1, int tapset0, int tapset1,
-      const opus_val16 *window, int overlap)
+      const opus_val16 *window, int overlap, int arch)
 {
    int i;
    opus_val32 x0, x1, x2, x3, x4;
diff --git a/celt/pitch.c b/celt/pitch.c
index 4364703..1d89cb0 100644
--- a/celt/pitch.c
+++ b/celt/pitch.c
@@ -439,7 +439,7 @@ opus_val16 remove_doubling(opus_val16 *x, int maxperiod, int
minperiod,
 
    T = T0 = *T0_;
    ALLOC(yy_lookup, maxperiod+1, opus_val32);
-   dual_inner_prod(x, x, x-T0, N, &xx, &xy);
+   dual_inner_prod(x, x, x-T0, N, &xx, &xy, arch);
    yy_lookup[0] = xx;
    yy=xx;
    for (i=1;i<=maxperiod;i++)
@@ -483,7 +483,7 @@ opus_val16 remove_doubling(opus_val16 *x, int maxperiod, int
minperiod,
       {
          T1b = celt_udiv(2*second_check[k]*T0+k, 2*k);
       }
-      dual_inner_prod(x, &x[-T1], &x[-T1b], N, &xy, &xy2);
+      dual_inner_prod(x, &x[-T1], &x[-T1b], N, &xy, &xy2,
arch);
       xy += xy2;
       yy = yy_lookup[T1] + yy_lookup[T1b];
 #ifdef FIXED_POINT
diff --git a/celt/pitch.h b/celt/pitch.h
index 4368cc5..af745eb 100644
--- a/celt/pitch.h
+++ b/celt/pitch.h
@@ -37,8 +37,8 @@
 #include "modes.h"
 #include "cpu_support.h"
 
-#if defined(__SSE__) && !defined(FIXED_POINT) \
- || defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2)
+#if (defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT)) \
+  || ((defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2))
&& defined(FIXED_POINT))
 #include "x86/pitch_sse.h"
 #endif
 
@@ -135,8 +135,7 @@ static OPUS_INLINE void xcorr_kernel_c(const opus_val16 * x,
const opus_val16 *
 #endif /* OVERRIDE_XCORR_KERNEL */
 
 
-#ifndef OVERRIDE_DUAL_INNER_PROD
-static OPUS_INLINE void dual_inner_prod(const opus_val16 *x, const opus_val16
*y01, const opus_val16 *y02,
+static OPUS_INLINE void dual_inner_prod_c(const opus_val16 *x, const opus_val16
*y01, const opus_val16 *y02,
       int N, opus_val32 *xy1, opus_val32 *xy2)
 {
    int i;
@@ -150,6 +149,10 @@ static OPUS_INLINE void dual_inner_prod(const opus_val16
*x, const opus_val16 *y
    *xy1 = xy01;
    *xy2 = xy02;
 }
+
+#ifndef OVERRIDE_DUAL_INNER_PROD
+# define dual_inner_prod(x, y01, y02, N, xy1, xy2, arch) \
+    ((void)(arch),dual_inner_prod_c(x, y01, y02, N, xy1, xy2))
 #endif
 
 /*We make sure a C version is always available for cases where the overhead of
@@ -169,6 +172,12 @@ static OPUS_INLINE opus_val32 celt_inner_prod_c(const
opus_val16 *x,
     ((void)(arch),celt_inner_prod_c(x, y, N))
 #endif
 
+#ifdef NON_STATIC_COMB_FILTER_CONST_C
+void comb_filter_const_c(opus_val32 *y, opus_val32 *x, int T, int N,
+     opus_val16 g10, opus_val16 g11, opus_val16 g12);
+#endif
+
+
 #ifdef FIXED_POINT
 opus_val32
 #else
@@ -180,7 +189,7 @@ celt_pitch_xcorr_c(const opus_val16 *_x, const opus_val16
*_y,
 #if !defined(OVERRIDE_PITCH_XCORR)
 /*Is run-time CPU detection enabled on this platform?*/
 # if defined(OPUS_HAVE_RTCD) && \
-  (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR))
+  (defined(OPUS_ARM_ASM) || (defined(OPUS_ARM_NEON_INTR) &&
!defined(OPUS_ARM_PRESUME_NEON_INTR)))
 extern
 #  if defined(FIXED_POINT)
 opus_val32
diff --git a/celt/tests/test_unit_dft.c b/celt/tests/test_unit_dft.c
index 28f0238..9fbcdc4 100644
--- a/celt/tests/test_unit_dft.c
+++ b/celt/tests/test_unit_dft.c
@@ -46,7 +46,7 @@
 #include "entcode.c"
 
 #if defined(OPUS_HAVE_RTCD) && \
-         (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR))
+         (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
 #include "arm/armcpu.c"
 #if !defined(FIXED_POINT)
 #if defined(HAVE_ARM_NE10)
@@ -61,6 +61,8 @@
 #endif
 #elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1)
 #include "x86/x86cpu.c"
+#include "celt/x86/pitch_sse.c"
+#include "x86/x86_celt_map.c"
 #endif
 
 #ifndef M_PI
diff --git a/celt/tests/test_unit_mathops.c b/celt/tests/test_unit_mathops.c
index 5d2e8e4..a1cf2f7 100644
--- a/celt/tests/test_unit_mathops.c
+++ b/celt/tests/test_unit_mathops.c
@@ -49,10 +49,19 @@
 #include "cwrs.c"
 #include "pitch.c"
 #include "celt_lpc.c"
+#include "celt.c"
 
-#if defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2)
+#if defined(OPUS_X86_MAY_HAVE_SSE) || \
+    defined(OPUS_X86_MAY_HAVE_SSE2) || \
+    defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#if defined(OPUS_X86_MAY_HAVE_SSE)
 #include "x86/pitch_sse.c"
+#endif
+#if defined(OPUS_X86_MAY_HAVE_SSE2)
+#include "x86/pitch_sse2.c"
+#endif
 #if defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#include "x86/pitch_sse4_1.c"
 #include "x86/celt_lpc_sse.c"
 #endif
 #include "x86/x86_celt_map.c"
diff --git a/celt/tests/test_unit_mdct.c b/celt/tests/test_unit_mdct.c
index 51e457a..fdee079 100644
--- a/celt/tests/test_unit_mdct.c
+++ b/celt/tests/test_unit_mdct.c
@@ -47,7 +47,7 @@
 #include "entcode.c"
 
 #if defined(OPUS_HAVE_RTCD) && \
-         (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_NEON_INTR))
+         (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
 #include "arm/armcpu.c"
 #if !defined(FIXED_POINT)
 #if defined(HAVE_ARM_NE10)
@@ -62,6 +62,8 @@
 
 #elif defined(OPUS_X86_MAY_HAVE_SSE2) || defined(OPUS_X86_MAY_HAVE_SSE4_1)
 #include "x86/x86cpu.c"
+#include "celt/x86/pitch_sse.c"
+#include "x86/x86_celt_map.c"
 #endif
 
 #ifndef M_PI
diff --git a/celt/tests/test_unit_rotation.c b/celt/tests/test_unit_rotation.c
index fb18df0..4ac838e 100644
--- a/celt/tests/test_unit_rotation.c
+++ b/celt/tests/test_unit_rotation.c
@@ -46,11 +46,20 @@
 #include "bands.h"
 #include "pitch.c"
 #include "celt_lpc.c"
+#include "celt.c"
 #include <math.h>
 
-#if defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2)
+#if defined(OPUS_X86_MAY_HAVE_SSE) || \
+    defined(OPUS_X86_MAY_HAVE_SSE2) || \
+    defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#if defined(OPUS_X86_MAY_HAVE_SSE)
 #include "x86/pitch_sse.c"
+#endif
+#if defined(OPUS_X86_MAY_HAVE_SSE2)
+#include "x86/pitch_sse2.c"
+#endif
 #if defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#include "x86/pitch_sse4_1.c"
 #include "x86/celt_lpc_sse.c"
 #endif
 #include "x86/x86_celt_map.c"
diff --git a/celt/x86/celt_lpc_sse.c b/celt/x86/celt_lpc_sse.c
index 9fb9779..67e5592 100644
--- a/celt/x86/celt_lpc_sse.c
+++ b/celt/x86/celt_lpc_sse.c
@@ -38,6 +38,8 @@
 #include "pitch.h"
 #include "x86cpu.h"
 
+#if defined(FIXED_POINT)
+
 void celt_fir_sse4_1(const opus_val16 *_x,
          const opus_val16 *num,
          opus_val16 *_y,
@@ -126,3 +128,5 @@ void celt_fir_sse4_1(const opus_val16 *_x,
 #endif
    RESTORE_STACK;
 }
+
+#endif
diff --git a/celt/x86/celt_lpc_sse.h b/celt/x86/celt_lpc_sse.h
index f111420..c5ec796 100644
--- a/celt/x86/celt_lpc_sse.h
+++ b/celt/x86/celt_lpc_sse.h
@@ -32,7 +32,9 @@
 #include "config.h"
 #endif
 
-#if defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT)
+#define OVERRIDE_CELT_FIR
+
 void celt_fir_sse4_1(
          const opus_val16 *x,
          const opus_val16 *num,
@@ -42,6 +44,12 @@ void celt_fir_sse4_1(
          opus_val16 *mem,
          int arch);
 
+#if defined(OPUS_X86_PRESUME_SSE4_1)
+#define celt_fir(x, num, y, N, ord, mem, arch) \
+    ((void)arch, celt_fir_sse4_1(x, num, y, N, ord, mem, arch))
+
+#else
+
 extern void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])(
          const opus_val16 *x,
          const opus_val16 *num,
@@ -56,3 +64,5 @@ extern void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])(
 
 #endif
 #endif
+
+#endif
diff --git a/celt/x86/pitch_sse.c b/celt/x86/pitch_sse.c
index e3bc6d7..20e7312 100644
--- a/celt/x86/pitch_sse.c
+++ b/celt/x86/pitch_sse.c
@@ -29,223 +29,157 @@
 #include "config.h"
 #endif
 
-#include <xmmintrin.h>
-#include <emmintrin.h>
-
 #include "macros.h"
 #include "celt_lpc.h"
 #include "stack_alloc.h"
 #include "mathops.h"
 #include "pitch.h"
 
-#if defined(OPUS_X86_MAY_HAVE_SSE4_1)
-#include <smmintrin.h>
-#include "x86cpu.h"
-
-opus_val32 celt_inner_prod_sse4_1(const opus_val16 *x, const opus_val16 *y,
-      int N)
-{
-    opus_int  i, dataSize16;
-    opus_int32 sum;
-    __m128i inVec1_76543210, inVec1_FEDCBA98, acc1;
-    __m128i inVec2_76543210, inVec2_FEDCBA98, acc2;
-    __m128i inVec1_3210, inVec2_3210;
-
-    sum = 0;
-    dataSize16 = N & ~15;
-
-    acc1 = _mm_setzero_si128();
-    acc2 = _mm_setzero_si128();
-
-    for (i=0;i<dataSize16;i+=16) {
-        inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0]));
-        inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0]));
-
-        inVec1_FEDCBA98 = _mm_loadu_si128((__m128i *)(&x[i + 8]));
-        inVec2_FEDCBA98 = _mm_loadu_si128((__m128i *)(&y[i + 8]));
-
-        inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210);
-        inVec1_FEDCBA98 = _mm_madd_epi16(inVec1_FEDCBA98, inVec2_FEDCBA98);
-
-        acc1 = _mm_add_epi32(acc1, inVec1_76543210);
-        acc2 = _mm_add_epi32(acc2, inVec1_FEDCBA98);
-    }
+#if defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT)
 
-    acc1 = _mm_add_epi32(acc1, acc2);
-
-    if (N - i >= 8)
-    {
-        inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0]));
-        inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0]));
-
-        inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210);
-
-        acc1 = _mm_add_epi32(acc1, inVec1_76543210);
-        i += 8;
-    }
-
-    if (N - i >= 4)
-    {
-        inVec1_3210 = OP_CVTEPI16_EPI32_M64(&x[i + 0]);
-        inVec2_3210 = OP_CVTEPI16_EPI32_M64(&y[i + 0]);
-
-        inVec1_3210 = _mm_mullo_epi32(inVec1_3210, inVec2_3210);
-
-        acc1 = _mm_add_epi32(acc1, inVec1_3210);
-        i += 4;
-    }
-
-    acc1 = _mm_add_epi32(acc1, _mm_unpackhi_epi64(acc1, acc1));
-    acc1 = _mm_add_epi32(acc1, _mm_shufflelo_epi16(acc1, 0x0E));
-
-    sum += _mm_cvtsi128_si32(acc1);
-
-    for (;i<N;i++)
-    {
-        sum = silk_SMLABB(sum, x[i], y[i]);
-    }
+#include <xmmintrin.h>
+#include "arch.h"
 
-    return sum;
+void xcorr_kernel_sse(const opus_val16 *x, const opus_val16 *y, opus_val32
sum[4], int len)
+{
+   int j;
+   __m128 xsum1, xsum2;
+   xsum1 = _mm_loadu_ps(sum);
+   xsum2 = _mm_setzero_ps();
+
+   for (j = 0; j < len-3; j += 4)
+   {
+      __m128 x0 = _mm_loadu_ps(x+j);
+      __m128 yj = _mm_loadu_ps(y+j);
+      __m128 y3 = _mm_loadu_ps(y+j+3);
+
+      xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),yj));
+      xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),
+                                          _mm_shuffle_ps(yj,y3,0x49)));
+      xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),
+                                          _mm_shuffle_ps(yj,y3,0x9e)));
+      xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3));
+   }
+   if (j < len)
+   {
+      xsum1 =
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
+      if (++j < len)
+      {
+         xsum2 =
_mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
+         if (++j < len)
+         {
+            xsum1 =
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
+         }
+      }
+   }
+   _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2));
 }
 
-void xcorr_kernel_sse4_1(const opus_val16 * x, const opus_val16 * y, opus_val32
sum[ 4 ], int len)
+
+void dual_inner_prod_sse(const opus_val16 *x, const opus_val16 *y01, const
opus_val16 *y02,
+      int N, opus_val32 *xy1, opus_val32 *xy2)
 {
-    int j;
-
-    __m128i vecX, vecX0, vecX1, vecX2, vecX3;
-    __m128i vecY0, vecY1, vecY2, vecY3;
-    __m128i sum0, sum1, sum2, sum3, vecSum;
-    __m128i initSum;
-
-    celt_assert(len >= 3);
-
-    sum0 = _mm_setzero_si128();
-    sum1 = _mm_setzero_si128();
-    sum2 = _mm_setzero_si128();
-    sum3 = _mm_setzero_si128();
-
-    for (j=0;j<(len-7);j+=8)
-    {
-        vecX = _mm_loadu_si128((__m128i *)(&x[j + 0]));
-        vecY0 = _mm_loadu_si128((__m128i *)(&y[j + 0]));
-        vecY1 = _mm_loadu_si128((__m128i *)(&y[j + 1]));
-        vecY2 = _mm_loadu_si128((__m128i *)(&y[j + 2]));
-        vecY3 = _mm_loadu_si128((__m128i *)(&y[j + 3]));
-
-        sum0 = _mm_add_epi32(sum0, _mm_madd_epi16(vecX, vecY0));
-        sum1 = _mm_add_epi32(sum1, _mm_madd_epi16(vecX, vecY1));
-        sum2 = _mm_add_epi32(sum2, _mm_madd_epi16(vecX, vecY2));
-        sum3 = _mm_add_epi32(sum3, _mm_madd_epi16(vecX, vecY3));
-    }
-
-    sum0 = _mm_add_epi32(sum0, _mm_unpackhi_epi64( sum0, sum0));
-    sum0 = _mm_add_epi32(sum0, _mm_shufflelo_epi16( sum0, 0x0E));
-
-    sum1 = _mm_add_epi32(sum1, _mm_unpackhi_epi64( sum1, sum1));
-    sum1 = _mm_add_epi32(sum1, _mm_shufflelo_epi16( sum1, 0x0E));
-
-    sum2 = _mm_add_epi32(sum2, _mm_unpackhi_epi64( sum2, sum2));
-    sum2 = _mm_add_epi32(sum2, _mm_shufflelo_epi16( sum2, 0x0E));
-
-    sum3 = _mm_add_epi32(sum3, _mm_unpackhi_epi64( sum3, sum3));
-    sum3 = _mm_add_epi32(sum3, _mm_shufflelo_epi16( sum3, 0x0E));
-
-    vecSum = _mm_unpacklo_epi64(_mm_unpacklo_epi32(sum0, sum1),
-          _mm_unpacklo_epi32(sum2, sum3));
-
-    for (;j<(len-3);j+=4)
-    {
-        vecX = OP_CVTEPI16_EPI32_M64(&x[j + 0]);
-        vecX0 = _mm_shuffle_epi32(vecX, 0x00);
-        vecX1 = _mm_shuffle_epi32(vecX, 0x55);
-        vecX2 = _mm_shuffle_epi32(vecX, 0xaa);
-        vecX3 = _mm_shuffle_epi32(vecX, 0xff);
-
-        vecY0 = OP_CVTEPI16_EPI32_M64(&y[j + 0]);
-        vecY1 = OP_CVTEPI16_EPI32_M64(&y[j + 1]);
-        vecY2 = OP_CVTEPI16_EPI32_M64(&y[j + 2]);
-        vecY3 = OP_CVTEPI16_EPI32_M64(&y[j + 3]);
-
-        sum0 = _mm_mullo_epi32(vecX0, vecY0);
-        sum1 = _mm_mullo_epi32(vecX1, vecY1);
-        sum2 = _mm_mullo_epi32(vecX2, vecY2);
-        sum3 = _mm_mullo_epi32(vecX3, vecY3);
-
-        sum0 = _mm_add_epi32(sum0, sum1);
-        sum2 = _mm_add_epi32(sum2, sum3);
-        vecSum = _mm_add_epi32(vecSum, sum0);
-        vecSum = _mm_add_epi32(vecSum, sum2);
-    }
-
-    for (;j<len;j++)
-    {
-        vecX = OP_CVTEPI16_EPI32_M64(&x[j + 0]);
-        vecX0 = _mm_shuffle_epi32(vecX, 0x00);
-
-        vecY0 = OP_CVTEPI16_EPI32_M64(&y[j + 0]);
-
-        sum0 = _mm_mullo_epi32(vecX0, vecY0);
-        vecSum = _mm_add_epi32(vecSum, sum0);
-    }
-
-    initSum = _mm_loadu_si128((__m128i *)(&sum[0]));
-    initSum = _mm_add_epi32(initSum, vecSum);
-    _mm_storeu_si128((__m128i *)sum, initSum);
+   int i;
+   __m128 xsum1, xsum2;
+   xsum1 = _mm_setzero_ps();
+   xsum2 = _mm_setzero_ps();
+   for (i=0;i<N-3;i+=4)
+   {
+      __m128 xi = _mm_loadu_ps(x+i);
+      __m128 y1i = _mm_loadu_ps(y01+i);
+      __m128 y2i = _mm_loadu_ps(y02+i);
+      xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(xi, y1i));
+      xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(xi, y2i));
+   }
+   /* Horizontal sum */
+   xsum1 = _mm_add_ps(xsum1, _mm_movehl_ps(xsum1, xsum1));
+   xsum1 = _mm_add_ss(xsum1, _mm_shuffle_ps(xsum1, xsum1, 0x55));
+   _mm_store_ss(xy1, xsum1);
+   xsum2 = _mm_add_ps(xsum2, _mm_movehl_ps(xsum2, xsum2));
+   xsum2 = _mm_add_ss(xsum2, _mm_shuffle_ps(xsum2, xsum2, 0x55));
+   _mm_store_ss(xy2, xsum2);
+   for (;i<N;i++)
+   {
+      *xy1 = MAC16_16(*xy1, x[i], y01[i]);
+      *xy2 = MAC16_16(*xy2, x[i], y02[i]);
+   }
 }
-#endif
 
-#if defined(OPUS_X86_MAY_HAVE_SSE2)
-opus_val32 celt_inner_prod_sse2(const opus_val16 *x, const opus_val16 *y,
+opus_val32 celt_inner_prod_sse(const opus_val16 *x, const opus_val16 *y,
       int N)
 {
-    opus_int  i, dataSize16;
-    opus_int32 sum;
-
-    __m128i inVec1_76543210, inVec1_FEDCBA98, acc1;
-    __m128i inVec2_76543210, inVec2_FEDCBA98, acc2;
-
-    sum = 0;
-    dataSize16 = N & ~15;
-
-    acc1 = _mm_setzero_si128();
-    acc2 = _mm_setzero_si128();
-
-    for (i=0;i<dataSize16;i+=16)
-    {
-        inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0]));
-        inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0]));
-
-        inVec1_FEDCBA98 = _mm_loadu_si128((__m128i *)(&x[i + 8]));
-        inVec2_FEDCBA98 = _mm_loadu_si128((__m128i *)(&y[i + 8]));
-
-        inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210);
-        inVec1_FEDCBA98 = _mm_madd_epi16(inVec1_FEDCBA98, inVec2_FEDCBA98);
-
-        acc1 = _mm_add_epi32(acc1, inVec1_76543210);
-        acc2 = _mm_add_epi32(acc2, inVec1_FEDCBA98);
-    }
-
-    acc1 = _mm_add_epi32( acc1, acc2 );
-
-    if (N - i >= 8)
-    {
-        inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0]));
-        inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0]));
-
-        inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210);
+   int i;
+   float xy;
+   __m128 sum;
+   sum = _mm_setzero_ps();
+   /* FIXME: We should probably go 8-way and use 2 sums. */
+   for (i=0;i<N-3;i+=4)
+   {
+      __m128 xi = _mm_loadu_ps(x+i);
+      __m128 yi = _mm_loadu_ps(y+i);
+      sum = _mm_add_ps(sum,_mm_mul_ps(xi, yi));
+   }
+   /* Horizontal sum */
+   sum = _mm_add_ps(sum, _mm_movehl_ps(sum, sum));
+   sum = _mm_add_ss(sum, _mm_shuffle_ps(sum, sum, 0x55));
+   _mm_store_ss(&xy, sum);
+   for (;i<N;i++)
+   {
+      xy = MAC16_16(xy, x[i], y[i]);
+   }
+   return xy;
+}
 
-        acc1 = _mm_add_epi32(acc1, inVec1_76543210);
-        i += 8;
-    }
+void comb_filter_const_sse(opus_val32 *y, opus_val32 *x, int T, int N,
+      opus_val16 g10, opus_val16 g11, opus_val16 g12)
+{
+   int i;
+   __m128 x0v;
+   __m128 g10v, g11v, g12v;
+   g10v = _mm_load1_ps(&g10);
+   g11v = _mm_load1_ps(&g11);
+   g12v = _mm_load1_ps(&g12);
+   x0v = _mm_loadu_ps(&x[-T-2]);
+   for (i=0;i<N-3;i+=4)
+   {
+      __m128 yi, yi2, x1v, x2v, x3v, x4v;
+      const opus_val32 *xp = &x[i-T-2];
+      yi = _mm_loadu_ps(x+i);
+      x4v = _mm_loadu_ps(xp+4);
+#if 0
+      /* Slower version with all loads */
+      x1v = _mm_loadu_ps(xp+1);
+      x2v = _mm_loadu_ps(xp+2);
+      x3v = _mm_loadu_ps(xp+3);
+#else
+      x2v = _mm_shuffle_ps(x0v, x4v, 0x4e);
+      x1v = _mm_shuffle_ps(x0v, x2v, 0x99);
+      x3v = _mm_shuffle_ps(x2v, x4v, 0x99);
+#endif
 
-    acc1 = _mm_add_epi32(acc1, _mm_unpackhi_epi64( acc1, acc1));
-    acc1 = _mm_add_epi32(acc1, _mm_shufflelo_epi16( acc1, 0x0E));
-    sum += _mm_cvtsi128_si32(acc1);
+      yi = _mm_add_ps(yi, _mm_mul_ps(g10v,x2v));
+#if 0 /* Set to 1 to make it bit-exact with the non-SSE version */
+      yi = _mm_add_ps(yi, _mm_mul_ps(g11v,_mm_add_ps(x3v,x1v)));
+      yi = _mm_add_ps(yi, _mm_mul_ps(g12v,_mm_add_ps(x4v,x0v)));
+#else
+      /* Use partial sums */
+      yi2 = _mm_add_ps(_mm_mul_ps(g11v,_mm_add_ps(x3v,x1v)),
+                       _mm_mul_ps(g12v,_mm_add_ps(x4v,x0v)));
+      yi = _mm_add_ps(yi, yi2);
+#endif
+      x0v=x4v;
+      _mm_storeu_ps(y+i, yi);
+   }
+#ifdef CUSTOM_MODES
+   for (;i<N;i++)
+   {
+      y[i] = x[i]
+               + MULT16_32_Q15(g10,x[i-T])
+               + MULT16_32_Q15(g11,ADD32(x[i-T+1],x[i-T-1]))
+               + MULT16_32_Q15(g12,ADD32(x[i-T+2],x[i-T-2]));
+   }
+#endif
+}
 
-    for (;i<N;i++) {
-        sum = silk_SMLABB(sum, x[i], y[i]);
-    }
 
-    return sum;
-}
 #endif
diff --git a/celt/x86/pitch_sse.h b/celt/x86/pitch_sse.h
index 99d1919..cbe722c 100644
--- a/celt/x86/pitch_sse.h
+++ b/celt/x86/pitch_sse.h
@@ -37,17 +37,37 @@
 #include "config.h"
 #endif
 
-#if defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2)
-#if defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT)
 void xcorr_kernel_sse4_1(
                     const opus_int16 *x,
                     const opus_int16 *y,
                     opus_val32       sum[4],
                     int              len);
+#endif
+
+#if defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT)
+void xcorr_kernel_sse(
+                    const opus_val16 *x,
+                    const opus_val16 *y,
+                    opus_val32       sum[4],
+                    int              len);
+#endif
+
+#if defined(OPUS_X86_PRESUME_SSE4_1) && defined(FIXED_POINT)
+#define OVERRIDE_XCORR_KERNEL
+#define xcorr_kernel(x, y, sum, len, arch) \
+    ((void)arch, xcorr_kernel_sse4_1(x, y, sum, len))
+
+#elif defined(OPUS_X86_PRESUME_SSE) && !defined(FIXED_POINT)
+#define OVERRIDE_XCORR_KERNEL
+#define xcorr_kernel(x, y, sum, len, arch) \
+    ((void)arch, xcorr_kernel_sse(x, y, sum, len))
+
+#elif (defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT)) ||
(defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT))
 
 extern void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])(
-                    const opus_int16 *x,
-                    const opus_int16 *y,
+                    const opus_val16 *x,
+                    const opus_val16 *y,
                     opus_val32       sum[4],
                     int              len);
 
@@ -55,181 +75,115 @@ extern void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])(
 #define xcorr_kernel(x, y, sum, len, arch) \
     ((*XCORR_KERNEL_IMPL[(arch) & OPUS_ARCHMASK])(x, y, sum, len))
 
+#endif
+
+#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT)
 opus_val32 celt_inner_prod_sse4_1(
     const opus_int16 *x,
     const opus_int16 *y,
     int               N);
 #endif
 
-#if defined(OPUS_X86_MAY_HAVE_SSE2)
+#if defined(OPUS_X86_MAY_HAVE_SSE2) && defined(FIXED_POINT)
 opus_val32 celt_inner_prod_sse2(
     const opus_int16 *x,
     const opus_int16 *y,
     int               N);
 #endif
 
+#if defined(OPUS_X86_MAY_HAVE_SSE2) && !defined(FIXED_POINT)
+opus_val32 celt_inner_prod_sse(
+    const opus_val16 *x,
+    const opus_val16 *y,
+    int               N);
+#endif
+
+
+#if defined(OPUS_X86_PRESUME_SSE4_1) && defined(FIXED_POINT)
+#define OVERRIDE_CELT_INNER_PROD
+#define celt_inner_prod(x, y, N, arch) \
+	((void)arch, celt_inner_prod_sse4_1(x, y, N))
+
+#elif defined(OPUS_X86_PRESUME_SSE2) && defined(FIXED_POINT) &&
!defined(OPUS_X86_MAY_HAVE_SSE4_1)
+#define OVERRIDE_CELT_INNER_PROD
+#define celt_inner_prod(x, y, N, arch) \
+	((void)arch, celt_inner_prod_sse2(x, y, N))
+
+#elif defined(OPUS_X86_PRESUME_SSE) && !defined(FIXED_POINT)
+#define OVERRIDE_CELT_INNER_PROD
+#define celt_inner_prod(x, y, N, arch) \
+	((void)arch, celt_inner_prod_sse(x, y, N))
+
+
+#elif ((defined(OPUS_X86_MAY_HAVE_SSE4_1) || defined(OPUS_X86_MAY_HAVE_SSE2))
&& defined(FIXED_POINT)) || \
+	(defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT))
+
 extern opus_val32 (*const CELT_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])(
-                    const opus_int16 *x,
-                    const opus_int16 *y,
+                    const opus_val16 *x,
+                    const opus_val16 *y,
                     int               N);
 
 #define OVERRIDE_CELT_INNER_PROD
 #define celt_inner_prod(x, y, N, arch) \
     ((*CELT_INNER_PROD_IMPL[(arch) & OPUS_ARCHMASK])(x, y, N))
-#else
 
-#include <xmmintrin.h>
-#include "arch.h"
+#endif
 
-#define OVERRIDE_XCORR_KERNEL
-static OPUS_INLINE void xcorr_kernel_sse(const opus_val16 *x, const opus_val16
*y, opus_val32 sum[4], int len)
-{
-   int j;
-   __m128 xsum1, xsum2;
-   xsum1 = _mm_loadu_ps(sum);
-   xsum2 = _mm_setzero_ps();
-
-   for (j = 0; j < len-3; j += 4)
-   {
-      __m128 x0 = _mm_loadu_ps(x+j);
-      __m128 yj = _mm_loadu_ps(y+j);
-      __m128 y3 = _mm_loadu_ps(y+j+3);
-
-      xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x00),yj));
-      xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0x55),
-                                          _mm_shuffle_ps(yj,y3,0x49)));
-      xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xaa),
-                                          _mm_shuffle_ps(yj,y3,0x9e)));
-      xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(_mm_shuffle_ps(x0,x0,0xff),y3));
-   }
-   if (j < len)
-   {
-      xsum1 =
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
-      if (++j < len)
-      {
-         xsum2 =
_mm_add_ps(xsum2,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
-         if (++j < len)
-         {
-            xsum1 =
_mm_add_ps(xsum1,_mm_mul_ps(_mm_load1_ps(x+j),_mm_loadu_ps(y+j)));
-         }
-      }
-   }
-   _mm_storeu_ps(sum,_mm_add_ps(xsum1,xsum2));
-}
-
-#define xcorr_kernel(_x, _y, _z, len, arch) \
-    ((void)(arch),xcorr_kernel_sse(_x, _y, _z, len))
+#if defined(OPUS_X86_MAY_HAVE_SSE) && !defined(FIXED_POINT)
 
 #define OVERRIDE_DUAL_INNER_PROD
-static OPUS_INLINE void dual_inner_prod(const opus_val16 *x, const opus_val16
*y01, const opus_val16 *y02,
-      int N, opus_val32 *xy1, opus_val32 *xy2)
-{
-   int i;
-   __m128 xsum1, xsum2;
-   xsum1 = _mm_setzero_ps();
-   xsum2 = _mm_setzero_ps();
-   for (i=0;i<N-3;i+=4)
-   {
-      __m128 xi = _mm_loadu_ps(x+i);
-      __m128 y1i = _mm_loadu_ps(y01+i);
-      __m128 y2i = _mm_loadu_ps(y02+i);
-      xsum1 = _mm_add_ps(xsum1,_mm_mul_ps(xi, y1i));
-      xsum2 = _mm_add_ps(xsum2,_mm_mul_ps(xi, y2i));
-   }
-   /* Horizontal sum */
-   xsum1 = _mm_add_ps(xsum1, _mm_movehl_ps(xsum1, xsum1));
-   xsum1 = _mm_add_ss(xsum1, _mm_shuffle_ps(xsum1, xsum1, 0x55));
-   _mm_store_ss(xy1, xsum1);
-   xsum2 = _mm_add_ps(xsum2, _mm_movehl_ps(xsum2, xsum2));
-   xsum2 = _mm_add_ss(xsum2, _mm_shuffle_ps(xsum2, xsum2, 0x55));
-   _mm_store_ss(xy2, xsum2);
-   for (;i<N;i++)
-   {
-      *xy1 = MAC16_16(*xy1, x[i], y01[i]);
-      *xy2 = MAC16_16(*xy2, x[i], y02[i]);
-   }
-}
+#define OVERRIDE_COMB_FILTER_CONST
 
-#define OVERRIDE_CELT_INNER_PROD
-static OPUS_INLINE opus_val32 celt_inner_prod_sse(const opus_val16 *x, const
opus_val16 *y,
-      int N)
-{
-   int i;
-   float xy;
-   __m128 sum;
-   sum = _mm_setzero_ps();
-   /* FIXME: We should probably go 8-way and use 2 sums. */
-   for (i=0;i<N-3;i+=4)
-   {
-      __m128 xi = _mm_loadu_ps(x+i);
-      __m128 yi = _mm_loadu_ps(y+i);
-      sum = _mm_add_ps(sum,_mm_mul_ps(xi, yi));
-   }
-   /* Horizontal sum */
-   sum = _mm_add_ps(sum, _mm_movehl_ps(sum, sum));
-   sum = _mm_add_ss(sum, _mm_shuffle_ps(sum, sum, 0x55));
-   _mm_store_ss(&xy, sum);
-   for (;i<N;i++)
-   {
-      xy = MAC16_16(xy, x[i], y[i]);
-   }
-   return xy;
-}
-
-#  define celt_inner_prod(_x, _y, len, arch) \
-    ((void)(arch),celt_inner_prod_sse(_x, _y, len))
+void dual_inner_prod_sse(const opus_val16 *x,
+	const opus_val16 *y01,
+	const opus_val16 *y02,
+	int               N,
+	opus_val32       *xy1,
+	opus_val32       *xy2);
+
+void comb_filter_const_sse(opus_val32 *y,
+	opus_val32 *x,
+	int         T,
+	int         N,
+	opus_val16  g10,
+	opus_val16  g11,
+	opus_val16  g12);
+
+
+#if defined(OPUS_X86_PRESUME_SSE)
+# define dual_inner_prod(x, y01, y02, N, xy1, xy2, arch) \
+    ((void)(arch),dual_inner_prod_sse(x, y01, y02, N, xy1, xy2))
 
 #define OVERRIDE_COMB_FILTER_CONST
-static OPUS_INLINE void comb_filter_const(opus_val32 *y, opus_val32 *x, int T,
int N,
-      opus_val16 g10, opus_val16 g11, opus_val16 g12)
-{
-   int i;
-   __m128 x0v;
-   __m128 g10v, g11v, g12v;
-   g10v = _mm_load1_ps(&g10);
-   g11v = _mm_load1_ps(&g11);
-   g12v = _mm_load1_ps(&g12);
-   x0v = _mm_loadu_ps(&x[-T-2]);
-   for (i=0;i<N-3;i+=4)
-   {
-      __m128 yi, yi2, x1v, x2v, x3v, x4v;
-      const opus_val32 *xp = &x[i-T-2];
-      yi = _mm_loadu_ps(x+i);
-      x4v = _mm_loadu_ps(xp+4);
-#if 0
-      /* Slower version with all loads */
-      x1v = _mm_loadu_ps(xp+1);
-      x2v = _mm_loadu_ps(xp+2);
-      x3v = _mm_loadu_ps(xp+3);
-#else
-      x2v = _mm_shuffle_ps(x0v, x4v, 0x4e);
-      x1v = _mm_shuffle_ps(x0v, x2v, 0x99);
-      x3v = _mm_shuffle_ps(x2v, x4v, 0x99);
-#endif
 
-      yi = _mm_add_ps(yi, _mm_mul_ps(g10v,x2v));
-#if 0 /* Set to 1 to make it bit-exact with the non-SSE version */
-      yi = _mm_add_ps(yi, _mm_mul_ps(g11v,_mm_add_ps(x3v,x1v)));
-      yi = _mm_add_ps(yi, _mm_mul_ps(g12v,_mm_add_ps(x4v,x0v)));
 #else
-      /* Use partial sums */
-      yi2 = _mm_add_ps(_mm_mul_ps(g11v,_mm_add_ps(x3v,x1v)),
-                       _mm_mul_ps(g12v,_mm_add_ps(x4v,x0v)));
-      yi = _mm_add_ps(yi, yi2);
+
+extern void (*const DUAL_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])(
+              const opus_val16 *x,
+              const opus_val16 *y01,
+              const opus_val16 *y02,
+              int               N,
+              opus_val32       *xy1,
+              opus_val32       *xy2);
+
+#define dual_inner_prod(x, y01, y02, N, xy1, xy2, arch)			\
+    ((*DUAL_INNER_PROD_IMPL[(arch) & OPUS_ARCHMASK])(x, y01, y02, N, xy1,
xy2))
+
+extern void (*const COMB_FILTER_CONST_IMPL[OPUS_ARCHMASK + 1])(
+              opus_val32 *y,
+              opus_val32 *x,
+              int         T,
+              int         N,
+              opus_val16  g10,
+              opus_val16  g11,
+              opus_val16  g12);
+
+#define comb_filter_const(y, x, T, N, g10, g11, g12, arch)				\
+    ((*COMB_FILTER_CONST_IMPL[(arch) & OPUS_ARCHMASK])(y, x, T, N, g10,
g11, g12))
+
+#define NON_STATIC_COMB_FILTER_CONST_C
+
 #endif
-      x0v=x4v;
-      _mm_storeu_ps(y+i, yi);
-   }
-#ifdef CUSTOM_MODES
-   for (;i<N;i++)
-   {
-      y[i] = x[i]
-               + MULT16_32_Q15(g10,x[i-T])
-               + MULT16_32_Q15(g11,ADD32(x[i-T+1],x[i-T-1]))
-               + MULT16_32_Q15(g12,ADD32(x[i-T+2],x[i-T-2]));
-   }
 #endif
-}
 
 #endif
-#endif
diff --git a/celt/x86/pitch_sse2.c b/celt/x86/pitch_sse2.c
new file mode 100644
index 0000000..a0e7d1b
--- /dev/null
+++ b/celt/x86/pitch_sse2.c
@@ -0,0 +1,95 @@
+/* Copyright (c) 2014, Cisco Systems, INC
+   Written by XiangMingZhu WeiZhou MinPeng YanWang
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+
+#include <xmmintrin.h>
+#include <emmintrin.h>
+
+#include "macros.h"
+#include "celt_lpc.h"
+#include "stack_alloc.h"
+#include "mathops.h"
+#include "pitch.h"
+
+#if defined(OPUS_X86_MAY_HAVE_SSE2) && defined(FIXED_POINT)
+opus_val32 celt_inner_prod_sse2(const opus_val16 *x, const opus_val16 *y,
+      int N)
+{
+    opus_int  i, dataSize16;
+    opus_int32 sum;
+
+    __m128i inVec1_76543210, inVec1_FEDCBA98, acc1;
+    __m128i inVec2_76543210, inVec2_FEDCBA98, acc2;
+
+    sum = 0;
+    dataSize16 = N & ~15;
+
+    acc1 = _mm_setzero_si128();
+    acc2 = _mm_setzero_si128();
+
+    for (i=0;i<dataSize16;i+=16)
+    {
+        inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0]));
+        inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0]));
+
+        inVec1_FEDCBA98 = _mm_loadu_si128((__m128i *)(&x[i + 8]));
+        inVec2_FEDCBA98 = _mm_loadu_si128((__m128i *)(&y[i + 8]));
+
+        inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210);
+        inVec1_FEDCBA98 = _mm_madd_epi16(inVec1_FEDCBA98, inVec2_FEDCBA98);
+
+        acc1 = _mm_add_epi32(acc1, inVec1_76543210);
+        acc2 = _mm_add_epi32(acc2, inVec1_FEDCBA98);
+    }
+
+    acc1 = _mm_add_epi32( acc1, acc2 );
+
+    if (N - i >= 8)
+    {
+        inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0]));
+        inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0]));
+
+        inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210);
+
+        acc1 = _mm_add_epi32(acc1, inVec1_76543210);
+        i += 8;
+    }
+
+    acc1 = _mm_add_epi32(acc1, _mm_unpackhi_epi64( acc1, acc1));
+    acc1 = _mm_add_epi32(acc1, _mm_shufflelo_epi16( acc1, 0x0E));
+    sum += _mm_cvtsi128_si32(acc1);
+
+    for (;i<N;i++) {
+        sum = silk_SMLABB(sum, x[i], y[i]);
+    }
+
+    return sum;
+}
+#endif
diff --git a/celt/x86/pitch_sse4_1.c b/celt/x86/pitch_sse4_1.c
new file mode 100644
index 0000000..a092c68
--- /dev/null
+++ b/celt/x86/pitch_sse4_1.c
@@ -0,0 +1,195 @@
+/* Copyright (c) 2014, Cisco Systems, INC
+   Written by XiangMingZhu WeiZhou MinPeng YanWang
+
+   Redistribution and use in source and binary forms, with or without
+   modification, are permitted provided that the following conditions
+   are met:
+
+   - Redistributions of source code must retain the above copyright
+   notice, this list of conditions and the following disclaimer.
+
+   - Redistributions in binary form must reproduce the above copyright
+   notice, this list of conditions and the following disclaimer in the
+   documentation and/or other materials provided with the distribution.
+
+   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+   ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER
+   OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+   LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+   NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+   SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*/
+
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+
+#include <xmmintrin.h>
+#include <emmintrin.h>
+
+#include "macros.h"
+#include "celt_lpc.h"
+#include "stack_alloc.h"
+#include "mathops.h"
+#include "pitch.h"
+
+#if defined(OPUS_X86_MAY_HAVE_SSE4_1) && defined(FIXED_POINT)
+#include <smmintrin.h>
+#include "x86cpu.h"
+
+opus_val32 celt_inner_prod_sse4_1(const opus_val16 *x, const opus_val16 *y,
+      int N)
+{
+    opus_int  i, dataSize16;
+    opus_int32 sum;
+    __m128i inVec1_76543210, inVec1_FEDCBA98, acc1;
+    __m128i inVec2_76543210, inVec2_FEDCBA98, acc2;
+    __m128i inVec1_3210, inVec2_3210;
+
+    sum = 0;
+    dataSize16 = N & ~15;
+
+    acc1 = _mm_setzero_si128();
+    acc2 = _mm_setzero_si128();
+
+    for (i=0;i<dataSize16;i+=16) {
+        inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0]));
+        inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0]));
+
+        inVec1_FEDCBA98 = _mm_loadu_si128((__m128i *)(&x[i + 8]));
+        inVec2_FEDCBA98 = _mm_loadu_si128((__m128i *)(&y[i + 8]));
+
+        inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210);
+        inVec1_FEDCBA98 = _mm_madd_epi16(inVec1_FEDCBA98, inVec2_FEDCBA98);
+
+        acc1 = _mm_add_epi32(acc1, inVec1_76543210);
+        acc2 = _mm_add_epi32(acc2, inVec1_FEDCBA98);
+    }
+
+    acc1 = _mm_add_epi32(acc1, acc2);
+
+    if (N - i >= 8)
+    {
+        inVec1_76543210 = _mm_loadu_si128((__m128i *)(&x[i + 0]));
+        inVec2_76543210 = _mm_loadu_si128((__m128i *)(&y[i + 0]));
+
+        inVec1_76543210 = _mm_madd_epi16(inVec1_76543210, inVec2_76543210);
+
+        acc1 = _mm_add_epi32(acc1, inVec1_76543210);
+        i += 8;
+    }
+
+    if (N - i >= 4)
+    {
+        inVec1_3210 = OP_CVTEPI16_EPI32_M64(&x[i + 0]);
+        inVec2_3210 = OP_CVTEPI16_EPI32_M64(&y[i + 0]);
+
+        inVec1_3210 = _mm_mullo_epi32(inVec1_3210, inVec2_3210);
+
+        acc1 = _mm_add_epi32(acc1, inVec1_3210);
+        i += 4;
+    }
+
+    acc1 = _mm_add_epi32(acc1, _mm_unpackhi_epi64(acc1, acc1));
+    acc1 = _mm_add_epi32(acc1, _mm_shufflelo_epi16(acc1, 0x0E));
+
+    sum += _mm_cvtsi128_si32(acc1);
+
+    for (;i<N;i++)
+    {
+        sum = silk_SMLABB(sum, x[i], y[i]);
+    }
+
+    return sum;
+}
+
+void xcorr_kernel_sse4_1(const opus_val16 * x, const opus_val16 * y, opus_val32
sum[ 4 ], int len)
+{
+    int j;
+
+    __m128i vecX, vecX0, vecX1, vecX2, vecX3;
+    __m128i vecY0, vecY1, vecY2, vecY3;
+    __m128i sum0, sum1, sum2, sum3, vecSum;
+    __m128i initSum;
+
+    celt_assert(len >= 3);
+
+    sum0 = _mm_setzero_si128();
+    sum1 = _mm_setzero_si128();
+    sum2 = _mm_setzero_si128();
+    sum3 = _mm_setzero_si128();
+
+    for (j=0;j<(len-7);j+=8)
+    {
+        vecX = _mm_loadu_si128((__m128i *)(&x[j + 0]));
+        vecY0 = _mm_loadu_si128((__m128i *)(&y[j + 0]));
+        vecY1 = _mm_loadu_si128((__m128i *)(&y[j + 1]));
+        vecY2 = _mm_loadu_si128((__m128i *)(&y[j + 2]));
+        vecY3 = _mm_loadu_si128((__m128i *)(&y[j + 3]));
+
+        sum0 = _mm_add_epi32(sum0, _mm_madd_epi16(vecX, vecY0));
+        sum1 = _mm_add_epi32(sum1, _mm_madd_epi16(vecX, vecY1));
+        sum2 = _mm_add_epi32(sum2, _mm_madd_epi16(vecX, vecY2));
+        sum3 = _mm_add_epi32(sum3, _mm_madd_epi16(vecX, vecY3));
+    }
+
+    sum0 = _mm_add_epi32(sum0, _mm_unpackhi_epi64( sum0, sum0));
+    sum0 = _mm_add_epi32(sum0, _mm_shufflelo_epi16( sum0, 0x0E));
+
+    sum1 = _mm_add_epi32(sum1, _mm_unpackhi_epi64( sum1, sum1));
+    sum1 = _mm_add_epi32(sum1, _mm_shufflelo_epi16( sum1, 0x0E));
+
+    sum2 = _mm_add_epi32(sum2, _mm_unpackhi_epi64( sum2, sum2));
+    sum2 = _mm_add_epi32(sum2, _mm_shufflelo_epi16( sum2, 0x0E));
+
+    sum3 = _mm_add_epi32(sum3, _mm_unpackhi_epi64( sum3, sum3));
+    sum3 = _mm_add_epi32(sum3, _mm_shufflelo_epi16( sum3, 0x0E));
+
+    vecSum = _mm_unpacklo_epi64(_mm_unpacklo_epi32(sum0, sum1),
+          _mm_unpacklo_epi32(sum2, sum3));
+
+    for (;j<(len-3);j+=4)
+    {
+        vecX = OP_CVTEPI16_EPI32_M64(&x[j + 0]);
+        vecX0 = _mm_shuffle_epi32(vecX, 0x00);
+        vecX1 = _mm_shuffle_epi32(vecX, 0x55);
+        vecX2 = _mm_shuffle_epi32(vecX, 0xaa);
+        vecX3 = _mm_shuffle_epi32(vecX, 0xff);
+
+        vecY0 = OP_CVTEPI16_EPI32_M64(&y[j + 0]);
+        vecY1 = OP_CVTEPI16_EPI32_M64(&y[j + 1]);
+        vecY2 = OP_CVTEPI16_EPI32_M64(&y[j + 2]);
+        vecY3 = OP_CVTEPI16_EPI32_M64(&y[j + 3]);
+
+        sum0 = _mm_mullo_epi32(vecX0, vecY0);
+        sum1 = _mm_mullo_epi32(vecX1, vecY1);
+        sum2 = _mm_mullo_epi32(vecX2, vecY2);
+        sum3 = _mm_mullo_epi32(vecX3, vecY3);
+
+        sum0 = _mm_add_epi32(sum0, sum1);
+        sum2 = _mm_add_epi32(sum2, sum3);
+        vecSum = _mm_add_epi32(vecSum, sum0);
+        vecSum = _mm_add_epi32(vecSum, sum2);
+    }
+
+    for (;j<len;j++)
+    {
+        vecX = OP_CVTEPI16_EPI32_M64(&x[j + 0]);
+        vecX0 = _mm_shuffle_epi32(vecX, 0x00);
+
+        vecY0 = OP_CVTEPI16_EPI32_M64(&y[j + 0]);
+
+        sum0 = _mm_mullo_epi32(vecX0, vecY0);
+        vecSum = _mm_add_epi32(vecSum, sum0);
+    }
+
+    initSum = _mm_loadu_si128((__m128i *)(&sum[0]));
+    initSum = _mm_add_epi32(initSum, vecSum);
+    _mm_storeu_si128((__m128i *)sum, initSum);
+}
+#endif
diff --git a/celt/x86/x86_celt_map.c b/celt/x86/x86_celt_map.c
index 83410db..1ed2acb 100644
--- a/celt/x86/x86_celt_map.c
+++ b/celt/x86/x86_celt_map.c
@@ -38,6 +38,8 @@
 
 # if defined(FIXED_POINT)
 
+#if defined(OPUS_X86_MAY_HAVE_SSE4_1) &&
!defined(OPUS_X86_PRESUME_SSE4_1)
+
 void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])(
          const opus_val16 *x,
          const opus_val16 *num,
@@ -49,8 +51,8 @@ void (*const CELT_FIR_IMPL[OPUS_ARCHMASK + 1])(
 ) = {
   celt_fir_c,                /* non-sse */
   celt_fir_c,
+  celt_fir_c,
   MAY_HAVE_SSE4_1(celt_fir), /* sse4.1  */
-  NULL
 };
 
 void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])(
@@ -61,24 +63,86 @@ void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])(
 ) = {
   xcorr_kernel_c,                /* non-sse */
   xcorr_kernel_c,
+  xcorr_kernel_c,
   MAY_HAVE_SSE4_1(xcorr_kernel), /* sse4.1  */
-  NULL
 };
 
+#endif
+
+#if (defined(OPUS_X86_MAY_HAVE_SSE4_1) &&
!defined(OPUS_X86_PRESUME_SSE4_1)) ||  \
+	(!defined(OPUS_X86_MAY_HAVE_SSE_4_1) &&
defined(OPUS_X86_MAY_HAVE_SSE2) && !defined(OPUS_X86_PRESUME_SSE2))
+
 opus_val32 (*const CELT_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])(
          const opus_val16 *x,
          const opus_val16 *y,
          int              N
 ) = {
   celt_inner_prod_c,                /* non-sse */
+  celt_inner_prod_c,
   MAY_HAVE_SSE2(celt_inner_prod),
   MAY_HAVE_SSE4_1(celt_inner_prod), /* sse4.1  */
-  NULL
 };
 
+#endif
+
 # else
-#  error "Floating-point implementation is not supported by x86 RTCD
yet." \
- "Reconfigure with --disable-rtcd or send patches."
-# endif
 
+#if defined(OPUS_X86_MAY_HAVE_SSE) && !defined(OPUS_X86_PRESUME_SSE)
+
+void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])(
+         const opus_val16 *x,
+         const opus_val16 *y,
+         opus_val32       sum[4],
+         int              len
+) = {
+  xcorr_kernel_c,                /* non-sse */
+  MAY_HAVE_SSE(xcorr_kernel),
+  MAY_HAVE_SSE(xcorr_kernel),
+  MAY_HAVE_SSE(xcorr_kernel),
+};
+
+opus_val32 (*const CELT_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])(
+         const opus_val16 *x,
+         const opus_val16 *y,
+         int              N
+) = {
+  celt_inner_prod_c,                /* non-sse */
+  MAY_HAVE_SSE(celt_inner_prod),
+  MAY_HAVE_SSE(celt_inner_prod),
+  MAY_HAVE_SSE(celt_inner_prod),
+};
+
+void (*const DUAL_INNER_PROD_IMPL[OPUS_ARCHMASK + 1])(
+                    const opus_val16 *x,
+                    const opus_val16 *y01,
+                    const opus_val16 *y02,
+                    int               N,
+                    opus_val32       *xy1,
+                    opus_val32       *xy2
+) = {
+  dual_inner_prod_c,                /* non-sse */
+  MAY_HAVE_SSE(dual_inner_prod),
+  MAY_HAVE_SSE(dual_inner_prod),
+  MAY_HAVE_SSE(dual_inner_prod),
+};
+
+void (*const COMB_FILTER_CONST_IMPL[OPUS_ARCHMASK + 1])(
+              opus_val32 *y,
+              opus_val32 *x,
+              int         T,
+              int         N,
+              opus_val16  g10,
+              opus_val16  g11,
+              opus_val16  g12
+) = {
+  comb_filter_const_c,                /* non-sse */
+  MAY_HAVE_SSE(comb_filter_const),
+  MAY_HAVE_SSE(comb_filter_const),
+  MAY_HAVE_SSE(comb_filter_const),
+};
+
+
+#endif
+
+#endif
 #endif
diff --git a/celt/x86/x86cpu.c b/celt/x86/x86cpu.c
index c82a4b7..afcdeb6 100644
--- a/celt/x86/x86cpu.c
+++ b/celt/x86/x86cpu.c
@@ -35,10 +35,19 @@
 #include "pitch.h"
 #include "x86cpu.h"
 
+#if (defined(OPUS_X86_MAY_HAVE_SSE) && !defined(OPUS_X86_PRESUME_SSE))
|| \
+  (defined(OPUS_X86_MAY_HAVE_SSE2) && !defined(OPUS_X86_PRESUME_SSE2))
|| \
+  (defined(OPUS_X86_MAY_HAVE_SSE4_1) &&
!defined(OPUS_X86_PRESUME_SSE4_1))
+
+
 #if defined(_MSC_VER)
 
 #include <intrin.h>
-#define cpuid(info,x) __cpuid(info,x)
+static _inline void cpuid(unsigned int CPUInfo[4], unsigned int InfoType)
+{
+	__cpuid((int*)CPUInfo, InfoType);
+}
+
 #else
 
 #if defined(CPU_INFO_BY_C)
@@ -48,14 +57,28 @@
 static void cpuid(unsigned int CPUInfo[4], unsigned int InfoType)
 {
 #if defined(CPU_INFO_BY_ASM)
+#if defined(__i386__) && defined(__PIC__)
+/* %ebx is PIC register in 32-bit, so mustn't clobber it. */
+    __asm__ __volatile__ (
+       "xchg %%ebx, %1\n"
+       "cpuid\n"
+       "xchg %%ebx, %1\n":
+        "=a" (CPUInfo[0]),
+        "=r" (CPUInfo[1]),
+        "=c" (CPUInfo[2]),
+        "=d" (CPUInfo[3]) :
+        "0" (InfoType)
+    );
+#else
     __asm__ __volatile__ (
         "cpuid":
         "=a" (CPUInfo[0]),
         "=b" (CPUInfo[1]),
         "=c" (CPUInfo[2]),
         "=d" (CPUInfo[3]) :
-        "a" (InfoType), "c" (0)
+        "0" (InfoType)
     );
+#endif
 #elif defined(CPU_INFO_BY_C)
     __get_cpuid(InfoType, &(CPUInfo[0]), &(CPUInfo[1]),
&(CPUInfo[2]), &(CPUInfo[3]));
 #endif
@@ -63,11 +86,9 @@ static void cpuid(unsigned int CPUInfo[4], unsigned int
InfoType)
 
 #endif
 
-#include "SigProc_FIX.h"
-#include "celt_lpc.h"
-
 typedef struct CPU_Feature{
     /*  SIMD: 128-bit */
+    int HW_SSE;
     int HW_SSE2;
     int HW_SSE41;
 } CPU_Feature;
@@ -82,19 +103,31 @@ static void opus_cpu_feature_check(CPU_Feature
*cpu_feature)
 
     if (nIds >= 1){
         cpuid(info, 1);
+        cpu_feature->HW_SSE = (info[3] & (1 << 25)) != 0;
         cpu_feature->HW_SSE2 = (info[3] & (1 << 26)) != 0;
         cpu_feature->HW_SSE41 = (info[2] & (1 << 19)) != 0;
     }
+    else {
+        cpu_feature->HW_SSE = 0;
+        cpu_feature->HW_SSE2 = 0;
+        cpu_feature->HW_SSE41 = 0;
+    }
 }
 
 int opus_select_arch(void)
 {
-    CPU_Feature cpu_feature = {0};
+    CPU_Feature cpu_feature;
     int arch;
 
     opus_cpu_feature_check(&cpu_feature);
 
     arch = 0;
+    if (!cpu_feature.HW_SSE)
+    {
+       return arch;
+    }
+    arch++;
+
     if (!cpu_feature.HW_SSE2)
     {
        return arch;
@@ -109,3 +142,5 @@ int opus_select_arch(void)
 
     return arch;
 }
+
+#endif
diff --git a/celt/x86/x86cpu.h b/celt/x86/x86cpu.h
index ef53f0c..870b15e 100644
--- a/celt/x86/x86cpu.h
+++ b/celt/x86/x86cpu.h
@@ -28,6 +28,12 @@
 #if !defined(X86CPU_H)
 # define X86CPU_H
 
+# if defined(OPUS_X86_MAY_HAVE_SSE)
+#  define MAY_HAVE_SSE(name) name ## _sse
+# else
+#  define MAY_HAVE_SSE(name) name ## _c
+# endif
+
 # if defined(OPUS_X86_MAY_HAVE_SSE2)
 #  define MAY_HAVE_SSE2(name) name ## _sse2
 # else
@@ -55,21 +61,25 @@ int opus_select_arch(void);
   reference in the PMOVSXWD instruction itself, but gcc is not smart enough to
   optimize this out when optimizations ARE enabled.
 
-  It appears clang requires us to do this always (which is fair, since
-  technically the compiler is always allowed to do the dereference before
-  invoking the function implementing the intrinsic). I have not investiaged
-  whether it is any smarter than gcc when it comes to eliminating the extra
-  load instruction.*/
+  Clang, in contrast, requires us to do this always for _mm_cvtepi8_epi32
+  (which is fair, since technically the compiler is always allowed to do the
+  dereference before invoking the function implementing the intrinsic).
+  However, it is smart enough to eliminate the extra MOVD instruction.
+  For _mm_cvtepi16_epi32, it does the right thing, though does *not* optimize
out
+  the extra MOVQ if it's specified explicitly */
+
 # if defined(__clang__) || !defined(__OPTIMIZE__)
 #  define OP_CVTEPI8_EPI32_M32(x) \
  (_mm_cvtepi8_epi32(_mm_cvtsi32_si128(*(int *)(x))))
-
-#  define OP_CVTEPI16_EPI32_M64(x) \
- (_mm_cvtepi16_epi32(_mm_loadl_epi64((__m128i *)(x))))
 # else
 #  define OP_CVTEPI8_EPI32_M32(x) \
  (_mm_cvtepi8_epi32(*(__m128i *)(x)))
+#endif
 
+# if !defined(__OPTIMIZE__)
+#  define OP_CVTEPI16_EPI32_M64(x) \
+ (_mm_cvtepi16_epi32(_mm_loadl_epi64((__m128i *)(x))))
+# else
 #  define OP_CVTEPI16_EPI32_M64(x) \
  (_mm_cvtepi16_epi32(*(__m128i *)(x)))
 # endif
diff --git a/celt_sources.mk b/celt_sources.mk
index 7121301..2ffe99a 100644
--- a/celt_sources.mk
+++ b/celt_sources.mk
@@ -21,7 +21,10 @@ CELT_SOURCES_SSE = celt/x86/x86cpu.c \
 celt/x86/x86_celt_map.c \
 celt/x86/pitch_sse.c
 
-CELT_SOURCES_SSE4_1 = celt/x86/celt_lpc_sse.c
+CELT_SOURCES_SSE2 = celt/x86/pitch_sse2.c
+
+CELT_SOURCES_SSE4_1 = celt/x86/celt_lpc_sse.c \
+celt/x86/pitch_sse4_1.c
 
 CELT_SOURCES_ARM = \
 celt/arm/armcpu.c \
diff --git a/configure.ac b/configure.ac
index baa3425..2380a5c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -348,8 +348,24 @@ AM_CONDITIONAL([OPUS_ARM_INLINE_ASM],
 AM_CONDITIONAL([OPUS_ARM_EXTERNAL_ASM],
     [test x"${asm_optimization%% *}" = x"ARM"])
 
-AM_CONDITIONAL([HAVE_SSE4_1], [false])
+AM_CONDITIONAL([HAVE_SSE], [false])
 AM_CONDITIONAL([HAVE_SSE2], [false])
+AM_CONDITIONAL([HAVE_SSE4_1], [false])
+
+m4_define([DEFAULT_X86_SSE_CFLAGS], [-msse])
+m4_define([DEFAULT_X86_SSE2_CFLAGS], [-msse2])
+m4_define([DEFAULT_X86_SSE4_1_CFLAGS], [-msse4.1])
+m4_define([DEFAULT_ARM_NEON_INTR_CFLAGS], [-mfpu=neon])
+
+AC_ARG_VAR([X86_SSE_CFLAGS], [C compiler flags to compile SSE intrinsics
@<:@default=]DEFAULT_X86_SSE_CFLAGS[@:>@])
+AC_ARG_VAR([X86_SSE2_CFLAGS], [C compiler flags to compile SSE2 intrinsics
@<:@default=]DEFAULT_X86_SSE2_CFLAGS[@:>@])
+AC_ARG_VAR([X86_SSE4_1_CFLAGS], [C compiler flags to compile SSE4.1 intrinsics
@<:@default=]DEFAULT_X86_SSE4_1_CFLAGS[@:>@])
+AC_ARG_VAR([ARM_NEON_INTR_CFLAGS], [C compiler flags to compile ARM NEON
intrinsics @<:@default=]DEFAULT_ARM_NEON_INTR_CFLAGS[@:>@])
+
+AS_VAR_SET_IF([X86_SSE_CFLAGS], [], [AS_VAR_SET([X86_SSE_CFLAGS],
DEFAULT_X86_SSE_CFLAGS)])
+AS_VAR_SET_IF([X86_SSE2_CFLAGS], [], [AS_VAR_SET([X86_SSE2_CFLAGS],
DEFAULT_X86_SSE2_CFLAGS)])
+AS_VAR_SET_IF([X86_SSE4_1_CFLAGS], [], [AS_VAR_SET([X86_SSE4_1_CFLAGS],
DEFAULT_X86_SSE4_1_CFLAGS)])
+AS_VAR_SET_IF([ARM_NEON_INTR_CFLAGS], [], [AS_VAR_SET([ARM_NEON_INTR_CFLAGS],
DEFAULT_ARM_NEON_INTR_CFLAGS)])
 
 AC_DEFUN([OPUS_PATH_NE10],
    [
@@ -426,64 +442,183 @@ AC_DEFUN([OPUS_PATH_NE10],
 )
 
 AS_IF([test x"$enable_intrinsics" = x"yes"],[
-   case $host_cpu in
-   arm*)
+   intrinsics_support=""
+   AS_CASE([$host_cpu],
+   [arm*],
+   [
       cpu_arm=yes
-      AC_MSG_CHECKING(if compiler supports ARM NEON intrinsics)
-      save_CFLAGS="$CFLAGS"; CFLAGS="-mfpu=neon $CFLAGS"
-      AC_LINK_IFELSE(
-         [
-            AC_LANG_PROGRAM(
-               [[#include <arm_neon.h>
-               ]],
-               [[
-                  static float32x4_t A[2], SUMM;
-                  SUMM = vmlaq_f32(SUMM, A[0], A[1]);
-               ]]
-            )
-         ],[
-            OPUS_ARM_NEON_INTR=1
-            AC_MSG_RESULT([yes])
-         ],[
-            OPUS_ARM_NEON_INTR=0
-            AC_MSG_RESULT([no])
-         ]
+      OPUS_CHECK_INTRINSICS(
+         [ARM Neon],
+         [$ARM_NEON_INTR_CFLAGS],
+         [OPUS_ARM_MAY_HAVE_NEON_INTR],
+         [OPUS_ARM_PRESUME_NEON_INTR],
+         [[#include <arm_neon.h>
+         ]],
+         [[
+            static float32x4_t A0, A1, SUMM;
+            SUMM = vmlaq_f32(SUMM, A0, A1);
+         ]]
+      )
+      AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"
&& test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"],
+          [
+             OPUS_ARM_NEON_INTR_CFLAGS="$ARM_NEON_INTR_CFLAGS"
+             AC_SUBST([OPUS_ARM_NEON_INTR_CFLAGS])
+          ]
       )
-      CFLAGS="$save_CFLAGS"
-      #Now we know if compiler supports ARM neon intrinsics or not
 
-      #Currently we only have intrinsic optimization for floating point
+      #Currently we only have intrinsic optimizations for floating point
       AS_IF([test x"$enable_float" = x"yes"],
       [
-         AS_IF([test x"$OPUS_ARM_NEON_INTR" = x"1"],
+         AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" =
x"1"],
          [
-            AC_DEFINE([OPUS_ARM_NEON_INTR], 1, [Compiler supports ARMv7 Neon
Intrinsics])
-            AS_IF([test x"enable_rtcd" != x""],
-               [rtcd_support="ARM (ARMv7_Neon_Intrinsics)"],[])
-            enable_intrinsics="$enable_intrinsics
ARMv7_Neon_Intrinsics"
+            OPUS_ARM_NEON_INTR=1
+            AC_DEFINE([OPUS_ARM_NEON_INTR], 1,
+                      [Support ARMv7 Neon Intrinsics for float])
+            AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON_INTR], 1,
+                      [Compiler supports ARMv7 Neon Intrinsics])
+            intrinsics_support="$intrinsics_support
(Neon_Intrinsics)"
+
+            AS_IF([test x"enable_rtcd" != x"" &&
test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"],
+                  [rtcd_support="$rtcd_support
(ARMv7_Neon_Intrinsics)"],[])
+
+            AS_IF([test x"$OPUS_ARM_PRESUME_NEON_INTR" =
x"1"],
+                  [AC_DEFINE([OPUS_ARM_PRESUME_NEON_INTR], 1,
+                             [Define if binary requires NEON intrinsics
support])])
+
+			   AS_IF([test x"$rtcd_support" = x""],
+                  [rtcd_support=no])
+
+            AS_IF([test x"$intrinsics_support" = x""],
+                  [intrinsics_support=no],
+			         [intrinsics_support="arm$intrinsics_support"])
+
             dnl Don't see why defining these is necessary to check features
at runtime
             AC_DEFINE([OPUS_ARM_MAY_HAVE_EDSP], 1, [Define if compiler support
EDSP Instructions])
             AC_DEFINE([OPUS_ARM_MAY_HAVE_MEDIA], 1, [Define if compiler support
MEDIA Instructions])
             AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler support
NEON instructions])
 
             OPUS_PATH_NE10()
-            AS_IF([test x"$NE10_LIBS" != "x"],
-                  [enable_intrinsics="$enable_intrinsics NE10"],[])
+            AS_IF([test x"$HAVE_ARM_NE10" = x"1"],
+                  [intrinsics_support="$intrinsics_support NE10"],[])
          ],
          [
             AC_MSG_WARN([Compiler does not support ARM intrinsics])
-            enable_intrinsics=no
+            intrinsics_support=no
          ])
       ], [
-            AC_MSG_WARN([Currently on have ARM intrinsics for float])
-            enable_intrinsics=no
+            AC_MSG_WARN([Currently only have ARM intrinsics for float])
+            intrinsics_support=no
       ])
-   ;;
-   "i386" | "i686" | "x86_64")
-    AS_IF([test x"$enable_float" = x"no"],[
-    AS_IF([test x"$enable_rtcd" = x"yes"],[
+   ],
+   [i?86|x86_64],
+   [
+      OPUS_CHECK_INTRINSICS(
+         [SSE],
+         [$X86_SSE_CFLAGS],
+         [OPUS_X86_MAY_HAVE_SSE],
+         [OPUS_X86_PRESUME_SSE],
+         [[#include <xmmintrin.h>
+         ]],
+         [[
+             static __m128 mtest;
+             mtest = _mm_setzero_ps();
+         ]]
+      )
+      AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE" = x"1"
&& test x"$OPUS_X86_PRESUME_SSE" != x"1"],
+          [
+             OPUS_X86_SSE_CFLAGS="$X86_SSE_CFLAGS"
+             AC_SUBST([OPUS_X86_SSE_CFLAGS])
+          ]
+      )
+      OPUS_CHECK_INTRINSICS(
+         [SSE2],
+         [$X86_SSE2_CFLAGS],
+         [OPUS_X86_MAY_HAVE_SSE2],
+         [OPUS_X86_PRESUME_SSE2],
+         [[#include <emmintrin.h>
+         ]],
+         [[
+             static __m128i mtest;
+             mtest = _mm_setzero_si128();
+         ]]
+      )
+      AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE2" = x"1"
&& test x"$OPUS_X86_PRESUME_SSE2" != x"1"],
+          [
+             OPUS_X86_SSE2_CFLAGS="$X86_SSE2_CFLAGS"
+             AC_SUBST([OPUS_X86_SSE2_CFLAGS])
+          ]
+      )
+      OPUS_CHECK_INTRINSICS(
+         [SSE4.1],
+         [$X86_SSE4_1_CFLAGS],
+         [OPUS_X86_MAY_HAVE_SSE4_1],
+         [OPUS_X86_PRESUME_SSE4_1],
+         [[#include <smmintrin.h>
+         ]],
+         [[
+            static __m128i mtest;
+            mtest = _mm_setzero_si128();
+            mtest = _mm_cmpeq_epi64(mtest, mtest);
+         ]]
+      )
+      AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE4_1" = x"1"
&& test x"$OPUS_X86_PRESUME_SSE4_1" != x"1"],
+          [
+             OPUS_X86_SSE4_1_CFLAGS="$X86_SSE4_1_CFLAGS"
+             AC_SUBST([OPUS_X86_SSE4_1_CFLAGS])
+          ]
+      )
+
+         AS_IF([test x"$rtcd_support" = x"no"],
[rtcd_support=""])
+         AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE" = x"1"],
+         [
+            AC_DEFINE([OPUS_X86_MAY_HAVE_SSE], 1, [Compiler supports X86 SSE
Intrinsics])
+            intrinsics_support="$intrinsics_support SSE"
+
+            AS_IF([test x"$OPUS_X86_PRESUME_SSE" = x"1"],
+               [AC_DEFINE([OPUS_X86_PRESUME_SSE], 1, [Define if binary requires
SSE intrinsics support])],
+               [rtcd_support="$rtcd_support SSE"])
+         ],
+         [
+            AC_MSG_WARN([Compiler does not support SSE intrinsics])
+         ])
+
+         AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE2" = x"1"],
+         [
+            AC_DEFINE([OPUS_X86_MAY_HAVE_SSE2], 1, [Compiler supports X86 SSE2
Intrinsics])
+            intrinsics_support="$intrinsics_support SSE2"
+
+            AS_IF([test x"$OPUS_X86_PRESUME_SSE2" = x"1"],
+               [AC_DEFINE([OPUS_X86_PRESUME_SSE2], 1, [Define if binary
requires SSE2 intrinsics support])],
+               [rtcd_support="$rtcd_support SSE2"])
+         ],
+         [
+            AC_MSG_WARN([Compiler does not support SSE2 intrinsics])
+         ])
+
+         AS_IF([test x"$OPUS_X86_MAY_HAVE_SSE4_1" = x"1"],
+         [
+            AC_DEFINE([OPUS_X86_MAY_HAVE_SSE4_1], 1, [Compiler supports X86
SSE4.1 Intrinsics])
+            intrinsics_support="$intrinsics_support SSE4.1"
+
+            AS_IF([test x"$OPUS_X86_PRESUME_SSE4_1" =
x"1"],
+               [AC_DEFINE([OPUS_X86_PRESUME_SSE4_1], 1, [Define if binary
requires SSE4.1 intrinsics support])],
+               [rtcd_support="$rtcd_support SSE4.1"])
+         ],
+         [
+            AC_MSG_WARN([Compiler does not support SSE4.1 intrinsics])
+         ])
+         AS_IF([test x"$intrinsics_support" = x""],
+            [intrinsics_support=no],
+            [intrinsics_support="x86$intrinsics_support"]
+         )
+         AS_IF([test x"$rtcd_support" = x""],
+            [rtcd_support=no],
+            [rtcd_support="x86$rtcd_support"],
+        )
+
+    AS_IF([test x"$enable_rtcd" = x"yes" && test
x"$rtcd_support" != x""],[
             get_cpuid_by_asm="no"
-            AC_MSG_CHECKING([Get CPU Info])
+            AC_MSG_CHECKING([How to get X86 CPU Info])
             AC_LINK_IFELSE([AC_LANG_PROGRAM([[
                  #include <stdio.h>
             ]],[[
@@ -493,7 +628,7 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
                  unsigned int CPUInfo3;
                  unsigned int InfoType;
                  __asm__ __volatile__ (
-                 "cpuid11":
+                 "cpuid":
                  "=a" (CPUInfo0),
                  "=b" (CPUInfo1),
                  "=c" (CPUInfo2),
@@ -502,7 +637,8 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
                 );
             ]])],
             [get_cpuid_by_asm="yes"
-             AC_MSG_RESULT([Inline Assembly])],
+             AC_MSG_RESULT([Inline Assembly])
+			 AC_DEFINE([CPU_INFO_BY_ASM], [1], [Get CPU Info by asm method])],
              [AC_LINK_IFELSE([AC_LANG_PROGRAM([[
                  #include <cpuid.h>
             ]],[[
@@ -513,82 +649,17 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
                  unsigned int InfoType;
                  __get_cpuid(InfoType, &CPUInfo0, &CPUInfo1,
&CPUInfo2, &CPUInfo3);
             ]])],
-            [AC_MSG_RESULT([C method])],
-            [AC_MSG_ERROR([not support Get CPU Info, please disable intrinsics
])])])
-
-       AC_MSG_CHECKING([sse4.1])
-       TMP_CFLAGS="$CFLAGS"
-       gcc -Q --help=target | grep "\-msse4.1 "
-       AS_IF([test x"$?" = x"0"],[
-            CFLAGS="$CFLAGS -msse4.1"
-            AC_CHECK_HEADER(xmmintrin.h, [], [AC_MSG_ERROR([Couldn't find
xmmintrin.h])])
-            AC_CHECK_HEADER(emmintrin.h, [], [AC_MSG_ERROR([Couldn't find
emmintrin.h])])
-            AC_CHECK_HEADER(smmintrin.h, [], [AC_MSG_ERROR([Couldn't find
smmintrin.h])],[
-            #ifdef HAVE_XMMINSTRIN_H
-                 #include <xmmintrin.h>
-                 #endif
-                 #ifdef HAVE_EMMINSTRIN_H
-                 #include <emmintrin.h>
-                 #endif
-            ])
-
-            AC_LINK_IFELSE([AC_LANG_PROGRAM([[
-                 #include <xmmintrin.h>
-                 #include <emmintrin.h>
-                 #include <smmintrin.h>
-            ]],[[
-                 __m128i mtest = _mm_setzero_si128();
-                 mtest = _mm_cmpeq_epi64(mtest, mtest);
-            ]])],
-            [AC_MSG_RESULT([yes])], [AC_MSG_ERROR([Compiler & linker
failure for sse4.1, please disable intrinsics])])
-
-            CFLAGS="$TMP_CFLAGS"
-            AC_DEFINE([OPUS_X86_MAY_HAVE_SSE4_1], [1], [For x86 sse4.1
instrinsics optimizations])
-            AC_DEFINE([OPUS_X86_MAY_HAVE_SSE2], [1], [For x86 sse2 instrinsics
optimizations])
-            rtcd_support="x86 sse4.1"
-            AM_CONDITIONAL([HAVE_SSE4_1], [true])
-            AM_CONDITIONAL([HAVE_SSE2], [true])
-            AS_IF([test x"$get_cpuid_by_asm" =
x"yes"],[AC_DEFINE([CPU_INFO_BY_ASM], [1], [Get CPU Info by asm
method])],
-            [AC_DEFINE([CPU_INFO_BY_C], [1], [Get CPU Info by C method])])
-             ],[ ##### Else case for AS_IF([test x"$?" =
x"0"])
-               gcc -Q --help=target | grep "\-msse2 "
-               AC_MSG_CHECKING([sse2])
-               AS_IF([test x"$?" = x"0"],[
-                   AC_MSG_RESULT([yes])
-                   CFLAGS="$CFLAGS -msse2"
-                   AC_CHECK_HEADER(xmmintrin.h, [], [AC_MSG_ERROR([Couldn't
find xmmintrin.h])])
-                   AC_CHECK_HEADER(emmintrin.h, [], [AC_MSG_ERROR([Couldn't
find emmintrin.h])])
-
-                   AC_LINK_IFELSE([AC_LANG_PROGRAM([[
-                        #include <xmmintrin.h>
-                        #include <emmintrin.h>
-                   ]],[[
-                        __m128i mtest = _mm_setzero_si128();
-                   ]])],
-                   [AC_MSG_RESULT([yes])], [AC_MSG_ERROR([Compiler & linker
failure for sse2, please disable intrinsics])])
-
-                  CFLAGS="$TMP_CFLAGS"
-                  AC_DEFINE([OPUS_X86_MAY_HAVE_SSE2], [1], [For x86 sse2
instrinsics optimize])
-                  rtcd_support="x86 sse2"
-                  AM_CONDITIONAL([HAVE_SSE2], [true])
-                  AS_IF([test x"$get_cpuid_by_asm" =
x"yes"],[AC_DEFINE([CPU_INFO_BY_ASM], [1], [Get CPU Info by asm
method])],
-                  [AC_DEFINE([CPU_INFO_BY_C], [1], [Get CPU Info by c
method])])
-            ],[enable_intrinsics="no"]) #End of AS_IF([test
x"$?" = x"0"]
-        ])
-    ], [
-        enable_intrinsics="no"
-    ]) ## End of AS_IF([test x"$enable_rtcd" = x"yes"]
-],
-[  ## Else case for AS_IF([test x"$enable_float" = x"no"]
-   AC_MSG_WARN([Disabling intrinsics .. x86 intrinsics only avail for fixed
point])
-   enable_intrinsics="no"
-]) ## End of AS_IF([test x"$enable_float" = x"no"]
-   ;;
-   *)
+            [AC_MSG_RESULT([C method])
+			 AC_DEFINE([CPU_INFO_BY_C], [1], [Get CPU Info by c method])],
+            [AC_MSG_ERROR([no supported Get CPU Info method, please disable
intrinsics])])])])
+   ],
+   [
       AC_MSG_WARN([No intrinsics support for your architecture])
-      enable_intrinsics="no"
-   ;;
-   esac
+      intrinsics_support="no"
+   ])
+],
+[
+   intrinsics_support="no"
 ])
 
 AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"])
@@ -597,6 +668,12 @@ AM_CONDITIONAL([OPUS_ARM_NEON_INTR],
 AM_CONDITIONAL([HAVE_ARM_NE10],
     [test x"$HAVE_ARM_NE10" = x"1"])
 
+AM_CONDITIONAL([HAVE_SSE],
+    [test x"$OPUS_X86_MAY_HAVE_SSE" = x"1"])
+AM_CONDITIONAL([HAVE_SSE2],
+    [test x"$OPUS_X86_MAY_HAVE_SSE2" = x"1"])
+AM_CONDITIONAL([HAVE_SSE4_1],
+    [test x"$OPUS_X86_MAY_HAVE_SSE4_1" = x"1"])
 
 AS_IF([test x"$enable_rtcd" = x"yes"],[
     AS_IF([test x"$rtcd_support" != x"no"],[
@@ -704,7 +781,7 @@ AC_MSG_NOTICE([
       Fixed point debugging: ......... ${enable_fixed_point_debug}
       Inline Assembly Optimizations: . ${inline_optimization}
       External Assembly Optimizations: ${asm_optimization}
-      Intrinsics Optimizations.......: ${enable_intrinsics}
+      Intrinsics Optimizations.......: ${intrinsics_support}
       Run-time CPU detection: ........ ${rtcd_support}
       Custom modes: .................. ${enable_custom_modes}
       Assertion checking: ............ ${enable_assertions}
diff --git a/m4/opus-intrinsics.m4 b/m4/opus-intrinsics.m4
new file mode 100644
index 0000000..c74aecd
--- /dev/null
+++ b/m4/opus-intrinsics.m4
@@ -0,0 +1,29 @@
+dnl opus-intrinsics.m4
+dnl macro for testing for support for compiler intrinsics, either by default or
with a compiler flag
+
+dnl OPUS_CHECK_INTRINSICS(NAME-OF-INTRINSICS, COMPILER-FLAG-FOR-INTRINSICS,
VAR-IF-PRESENT, VAR-IF-DEFAULT, TEST-PROGRAM-HEADER, TEST-PROGRAM-BODY)
+AC_DEFUN([OPUS_CHECK_INTRINSICS],
+[
+   AC_MSG_CHECKING([if compiler supports $1 intrinsics])
+   AC_LINK_IFELSE(
+     [AC_LANG_PROGRAM($5, $6)],
+     [
+        $3=1
+ 	$4=1
+        AC_MSG_RESULT([yes])
+      ],[
+        $4=0
+        AC_MSG_RESULT([no])
+        AC_MSG_CHECKING([if compiler supports $1 intrinsics with $2])
+        save_CFLAGS="$CFLAGS"; CFLAGS="$2 $CFLAGS"
+        AC_LINK_IFELSE([AC_LANG_PROGRAM($5, $6)],
+        [
+           AC_MSG_RESULT([yes])
+           $3=1
+        ],[
+           AC_MSG_RESULT([no])
+           $3=0
+        ])
+        CFLAGS="$save_CFLAGS"
+     ])
+])
diff --git a/silk/x86/SigProc_FIX_sse.h b/silk/x86/SigProc_FIX_sse.h
index 9a0e096..61efa8d 100644
--- a/silk/x86/SigProc_FIX_sse.h
+++ b/silk/x86/SigProc_FIX_sse.h
@@ -45,6 +45,12 @@ void silk_burg_modified_sse4_1(
     int                         arch                /* I    Run-time
architecture                                       */
 );
 
+#if defined(OPUS_X86_PRESUME_SSE4_1)
+#define silk_burg_modified(res_nrg, res_nrg_Q, A_Q16, x, minInvGain_Q30,
subfr_length, nb_subfr, D, arch) \
+    ((void)(arch), silk_burg_modified_sse4_1(res_nrg, res_nrg_Q, A_Q16, x,
minInvGain_Q30, subfr_length, nb_subfr, D, arch))
+
+#else
+
 extern void (*const SILK_BURG_MODIFIED_IMPL[OPUS_ARCHMASK + 1])(
     opus_int32                  *res_nrg,           /* O    Residual energy    
*/
     opus_int                    *res_nrg_Q,         /* O    Residual energy Q
value                                     */
@@ -59,12 +65,22 @@ extern void (*const SILK_BURG_MODIFIED_IMPL[OPUS_ARCHMASK +
1])(
 #  define silk_burg_modified(res_nrg, res_nrg_Q, A_Q16, x, minInvGain_Q30,
subfr_length, nb_subfr, D, arch) \
     ((*SILK_BURG_MODIFIED_IMPL[(arch) & OPUS_ARCHMASK])(res_nrg, res_nrg_Q,
A_Q16, x, minInvGain_Q30, subfr_length, nb_subfr, D, arch))
 
+#endif
+
 opus_int64 silk_inner_prod16_aligned_64_sse4_1(
     const opus_int16 *inVec1,
     const opus_int16 *inVec2,
     const opus_int   len
 );
 
+
+#if defined(OPUS_X86_PRESUME_SSE4_1)
+
+#define silk_inner_prod16_aligned_64(inVec1, inVec2, len, arch) \
+    ((void)(arch),silk_inner_prod16_aligned_64_sse4_1(inVec1, inVec2, len))
+
+#else
+
 extern opus_int64 (*const SILK_INNER_PROD16_ALIGNED_64_IMPL[OPUS_ARCHMASK +
1])(
                     const opus_int16 *inVec1,
                     const opus_int16 *inVec2,
@@ -75,3 +91,4 @@ extern opus_int64 (*const
SILK_INNER_PROD16_ALIGNED_64_IMPL[OPUS_ARCHMASK + 1])(
 
 #endif
 #endif
+#endif
diff --git a/silk/x86/main_sse.h b/silk/x86/main_sse.h
index f970632..afd5ec2 100644
--- a/silk/x86/main_sse.h
+++ b/silk/x86/main_sse.h
@@ -50,6 +50,15 @@ void silk_VQ_WMat_EC_sse4_1(
     opus_int                    L                               /* I    number
of vectors in codebook               */
 );
 
+#if defined OPUS_X86_PRESUME_SSE4_1
+
+#define silk_VQ_WMat_EC(ind, rate_dist_Q14, gain_Q7, in_Q14, W_Q18, cb_Q7,
cb_gain_Q7, cl_Q5, \
+                          mu_Q9, max_gain_Q7, L, arch) \
+    ((void)(arch),silk_VQ_WMat_EC_sse4_1(ind, rate_dist_Q14, gain_Q7, in_Q14,
W_Q18, cb_Q7, cb_gain_Q7, cl_Q5, \
+                          mu_Q9, max_gain_Q7, L))
+
+#else
+
 extern void (*const SILK_VQ_WMAT_EC_IMPL[OPUS_ARCHMASK + 1])(
     opus_int8                   *ind,                           /* O    index
of best codebook vector               */
     opus_int32                  *rate_dist_Q14,                 /* O    best
weighted quant error + mu * rate       */
@@ -69,6 +78,8 @@ extern void (*const SILK_VQ_WMAT_EC_IMPL[OPUS_ARCHMASK + 1])(
     ((*SILK_VQ_WMAT_EC_IMPL[(arch) & OPUS_ARCHMASK])(ind, rate_dist_Q14,
gain_Q7, in_Q14, W_Q18, cb_Q7, cb_gain_Q7, cl_Q5, \
                           mu_Q9, max_gain_Q7, L))
 
+#endif
+
 #  define OVERRIDE_silk_NSQ
 
 void silk_NSQ_sse4_1(
@@ -89,6 +100,15 @@ void silk_NSQ_sse4_1(
     const opus_int              LTP_scale_Q14                               /*
I    LTP state scaling               */
 );
 
+#if defined OPUS_X86_PRESUME_SSE4_1
+
+#define silk_NSQ(psEncC, NSQ, psIndices, x_Q3, pulses, PredCoef_Q12,
LTPCoef_Q14, AR2_Q13, \
+                   HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL,
Lambda_Q10, LTP_scale_Q14, arch) \
+    ((void)(arch),silk_NSQ_sse4_1(psEncC, NSQ, psIndices, x_Q3, pulses,
PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \
+                   HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL,
Lambda_Q10, LTP_scale_Q14))
+
+#else
+
 extern void (*const SILK_NSQ_IMPL[OPUS_ARCHMASK + 1])(
     const silk_encoder_state    *psEncC,                                    /*
I/O  Encoder State                   */
     silk_nsq_state              *NSQ,                                       /*
I/O  NSQ state                       */
@@ -112,6 +132,8 @@ extern void (*const SILK_NSQ_IMPL[OPUS_ARCHMASK + 1])(
     ((*SILK_NSQ_IMPL[(arch) & OPUS_ARCHMASK])(psEncC, NSQ, psIndices, x_Q3,
pulses, PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \
                    HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16, pitchL,
Lambda_Q10, LTP_scale_Q14))
 
+#endif
+
 #  define OVERRIDE_silk_NSQ_del_dec
 
 void silk_NSQ_del_dec_sse4_1(
@@ -132,6 +154,15 @@ void silk_NSQ_del_dec_sse4_1(
     const opus_int              LTP_scale_Q14                               /*
I    LTP state scaling               */
 );
 
+#if defined OPUS_X86_PRESUME_SSE4_1
+
+#define silk_NSQ_del_dec(psEncC, NSQ, psIndices, x_Q3, pulses, PredCoef_Q12,
LTPCoef_Q14, AR2_Q13, \
+                           HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16,
pitchL, Lambda_Q10, LTP_scale_Q14, arch) \
+    ((void)(arch),silk_NSQ_del_dec_sse4_1(psEncC, NSQ, psIndices, x_Q3, pulses,
PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \
+                           HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16,
pitchL, Lambda_Q10, LTP_scale_Q14))
+
+#else
+
 extern void (*const SILK_NSQ_DEL_DEC_IMPL[OPUS_ARCHMASK + 1])(
     const silk_encoder_state    *psEncC,                                    /*
I/O  Encoder State                   */
     silk_nsq_state              *NSQ,                                       /*
I/O  NSQ state                       */
@@ -155,6 +186,8 @@ extern void (*const SILK_NSQ_DEL_DEC_IMPL[OPUS_ARCHMASK +
1])(
     ((*SILK_NSQ_DEL_DEC_IMPL[(arch) & OPUS_ARCHMASK])(psEncC, NSQ,
psIndices, x_Q3, pulses, PredCoef_Q12, LTPCoef_Q14, AR2_Q13, \
                            HarmShapeGain_Q14, Tilt_Q14, LF_shp_Q14, Gains_Q16,
pitchL, Lambda_Q10, LTP_scale_Q14))
 
+#endif
+
 void silk_noise_shape_quantizer(
     silk_nsq_state      *NSQ,                   /* I/O  NSQ state              
*/
     opus_int            signalType,             /* I    Signal type            
*/
@@ -192,6 +225,11 @@ opus_int silk_VAD_GetSA_Q8_sse4_1(
     const opus_int16   pIn[]
 );
 
+#if defined(OPUS_X86_PRESUME_SSE4_1)
+#define silk_VAD_GetSA_Q8(psEnC, pIn, arch)
((void)(arch),silk_VAD_GetSA_Q8_sse4_1(psEnC, pIn))
+
+#else
+
 #  define silk_VAD_GetSA_Q8(psEnC, pIn, arch) \
      ((*SILK_VAD_GETSA_Q8_IMPL[(arch) & OPUS_ARCHMASK])(psEnC, pIn))
 
@@ -201,6 +239,8 @@ extern opus_int (*const SILK_VAD_GETSA_Q8_IMPL[OPUS_ARCHMASK
+ 1])(
 
 #  define OVERRIDE_silk_warped_LPC_analysis_filter_FIX
 
+#endif
+
 void silk_warped_LPC_analysis_filter_FIX_sse4_1(
           opus_int32            state[],                    /* I/O  State
[order + 1]                   */
           opus_int32            res_Q2[],                   /* O    Residual
signal [length]            */
@@ -211,6 +251,12 @@ void silk_warped_LPC_analysis_filter_FIX_sse4_1(
     const opus_int              order                       /* I    Filter
order (even)                 */
 );
 
+#if defined(OPUS_X86_PRESUME_SSE4_1)
+#define silk_warped_LPC_analysis_filter_FIX(state, res_Q2, coef_Q13, input,
lambda_Q16, length, order, arch) \
+    ((void)(arch),silk_warped_LPC_analysis_filter_FIX_c(state, res_Q2,
coef_Q13, input, lambda_Q16, length, order))
+
+#else
+
 extern void (*const SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[OPUS_ARCHMASK +
1])(
           opus_int32            state[],                    /* I/O  State
[order + 1]                   */
           opus_int32            res_Q2[],                   /* O    Residual
signal [length]            */
@@ -224,5 +270,7 @@ extern void (*const
SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[OPUS_ARCHMASK + 1])
 #  define silk_warped_LPC_analysis_filter_FIX(state, res_Q2, coef_Q13, input,
lambda_Q16, length, order, arch) \
     ((*SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[(arch) &
OPUS_ARCHMASK])(state, res_Q2, coef_Q13, input, lambda_Q16, length, order))
 
+#endif
+
 # endif
 #endif
diff --git a/silk/x86/x86_silk_map.c b/silk/x86/x86_silk_map.c
index 6747d10..ad9fef2 100644
--- a/silk/x86/x86_silk_map.c
+++ b/silk/x86/x86_silk_map.c
@@ -35,6 +35,10 @@
 #include "pitch.h"
 #include "main.h"
 
+#if !defined(OPUS_X86_PRESUME_SSE4_1)
+
+#if defined(FIXED_POINT)
+
 opus_int64 (*const SILK_INNER_PROD16_ALIGNED_64_IMPL[ OPUS_ARCHMASK + 1 ] )(
     const opus_int16 *inVec1,
     const opus_int16 *inVec2,
@@ -42,18 +46,20 @@ opus_int64 (*const SILK_INNER_PROD16_ALIGNED_64_IMPL[
OPUS_ARCHMASK + 1 ] )(
 ) = {
   silk_inner_prod16_aligned_64_c,                  /* non-sse */
   silk_inner_prod16_aligned_64_c,
+  silk_inner_prod16_aligned_64_c,
   MAY_HAVE_SSE4_1( silk_inner_prod16_aligned_64 ), /* sse4.1 */
-  NULL
 };
 
+#endif
+
 opus_int (*const SILK_VAD_GETSA_Q8_IMPL[ OPUS_ARCHMASK + 1 ] )(
     silk_encoder_state *psEncC,
     const opus_int16   pIn[]
 ) = {
   silk_VAD_GetSA_Q8_c,                  /* non-sse */
   silk_VAD_GetSA_Q8_c,
+  silk_VAD_GetSA_Q8_c,
   MAY_HAVE_SSE4_1( silk_VAD_GetSA_Q8 ), /* sse4.1 */
-  NULL
 };
 
 void (*const SILK_NSQ_IMPL[ OPUS_ARCHMASK + 1 ] )(
@@ -75,8 +81,8 @@ void (*const SILK_NSQ_IMPL[ OPUS_ARCHMASK + 1 ] )(
 ) = {
   silk_NSQ_c,                  /* non-sse */
   silk_NSQ_c,
+  silk_NSQ_c,
   MAY_HAVE_SSE4_1( silk_NSQ ), /* sse4.1 */
-  NULL
 };
 
 void (*const SILK_VQ_WMAT_EC_IMPL[ OPUS_ARCHMASK + 1 ] )(
@@ -94,8 +100,8 @@ void (*const SILK_VQ_WMAT_EC_IMPL[ OPUS_ARCHMASK + 1 ] )(
 ) = {
   silk_VQ_WMat_EC_c,                  /* non-sse */
   silk_VQ_WMat_EC_c,
+  silk_VQ_WMat_EC_c,
   MAY_HAVE_SSE4_1( silk_VQ_WMat_EC ), /* sse4.1 */
-  NULL
 };
 
 void (*const SILK_NSQ_DEL_DEC_IMPL[ OPUS_ARCHMASK + 1 ] )(
@@ -117,10 +123,12 @@ void (*const SILK_NSQ_DEL_DEC_IMPL[ OPUS_ARCHMASK + 1 ] )(
 ) = {
   silk_NSQ_del_dec_c,                  /* non-sse */
   silk_NSQ_del_dec_c,
+  silk_NSQ_del_dec_c,
   MAY_HAVE_SSE4_1( silk_NSQ_del_dec ), /* sse4.1 */
-  NULL
 };
 
+#if defined(FIXED_POINT)
+
 void (*const SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[ OPUS_ARCHMASK + 1 ] )(
     opus_int32                  state[],                    /* I/O  State
[order + 1]                   */
     opus_int32                  res_Q2[],                   /* O    Residual
signal [length]            */
@@ -132,8 +140,8 @@ void (*const SILK_WARPED_LPC_ANALYSIS_FILTER_FIX_IMPL[
OPUS_ARCHMASK + 1 ] )(
 ) = {
   silk_warped_LPC_analysis_filter_FIX_c,                  /* non-sse */
   silk_warped_LPC_analysis_filter_FIX_c,
+  silk_warped_LPC_analysis_filter_FIX_c,
   MAY_HAVE_SSE4_1( silk_warped_LPC_analysis_filter_FIX ), /* sse4.1 */
-  NULL
 };
 
 void (*const SILK_BURG_MODIFIED_IMPL[ OPUS_ARCHMASK + 1 ] )(
@@ -149,6 +157,9 @@ void (*const SILK_BURG_MODIFIED_IMPL[ OPUS_ARCHMASK + 1 ] )(
 ) = {
   silk_burg_modified_c,                  /* non-sse */
   silk_burg_modified_c,
+  silk_burg_modified_c,
   MAY_HAVE_SSE4_1( silk_burg_modified ), /* sse4.1 */
-  NULL
 };
+
+#endif
+#endif
diff --git a/win32/VS2010/celt.vcxproj b/win32/VS2010/celt.vcxproj
index f107fec..e068fbe 100644
--- a/win32/VS2010/celt.vcxproj
+++ b/win32/VS2010/celt.vcxproj
@@ -37,6 +37,12 @@
     <ClCompile Include="..\..\celt\quant_bands.c" />
     <ClCompile Include="..\..\celt\rate.c" />
     <ClCompile Include="..\..\celt\vq.c" />
+    <ClCompile Include="..\..\celt\x86\celt_lpc_sse.c" />
+    <ClCompile Include="..\..\celt\x86\pitch_sse.c" />
+    <ClCompile Include="..\..\celt\x86\pitch_sse2.c" />
+    <ClCompile Include="..\..\celt\x86\pitch_sse4_1.c" />
+    <ClCompile Include="..\..\celt\x86\x86cpu.c" />
+    <ClCompile Include="..\..\celt\x86\x86_celt_map.c" />
   </ItemGroup>
   <ItemGroup>
     <ClInclude Include="..\..\celt\arch.h" />
@@ -67,6 +73,9 @@
     <ClInclude Include="..\..\celt\static_modes_fixed.h" />
     <ClInclude Include="..\..\celt\static_modes_float.h" />
     <ClInclude Include="..\..\celt\vq.h" />
+    <ClInclude Include="..\..\celt\x86\celt_lpc_sse.h" />
+    <ClInclude Include="..\..\celt\x86\pitch_sse.h" />
+    <ClInclude Include="..\..\celt\x86\x86cpu.h" />
     <ClInclude Include="..\..\celt\_kiss_fft_guts.h" />
   </ItemGroup>
   <PropertyGroup Label="Globals">
@@ -141,7 +150,7 @@
       <WarningLevel>Level3</WarningLevel>
       <Optimization>Disabled</Optimization>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>..\;..\..\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)\..\;$(ProjectDir)\..\..\include;$(ProjectDir)\..\..\celt;$(ProjectDir)\..\..\silk;$(ProjectDir)\..\..\silk\float;$(ProjectDir)\..\..\silk\fixed;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -168,7 +177,7 @@
       <WarningLevel>Level3</WarningLevel>
       <Optimization>Disabled</Optimization>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;WIN64;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>..\;..\..\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)\..\;$(ProjectDir)\..\..\include;$(ProjectDir)\..\..\celt;$(ProjectDir)\..\..\silk;$(ProjectDir)\..\..\silk\float;$(ProjectDir)\..\..\silk\fixed;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -196,7 +205,7 @@
       <FunctionLevelLinking>true</FunctionLevelLinking>
       <IntrinsicFunctions>true</IntrinsicFunctions>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>..\;..\..\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)\..\;$(ProjectDir)\..\..\include;$(ProjectDir)\..\..\celt;$(ProjectDir)\..\..\silk;$(ProjectDir)\..\..\silk\float;$(ProjectDir)\..\..\silk\fixed;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -227,7 +236,7 @@
       <FunctionLevelLinking>true</FunctionLevelLinking>
       <IntrinsicFunctions>true</IntrinsicFunctions>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;WIN64;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>..\;..\..\include;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)\..\;$(ProjectDir)\..\..\include;$(ProjectDir)\..\..\celt;$(ProjectDir)\..\..\silk;$(ProjectDir)\..\..\silk\float;$(ProjectDir)\..\..\silk\fixed;%(AdditionalIncludeDirectories)</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
     </ClCompile>
     <Link>
diff --git a/win32/VS2010/celt.vcxproj.filters
b/win32/VS2010/celt.vcxproj.filters
index e3a1d97..e9948fa 100644
--- a/win32/VS2010/celt.vcxproj.filters
+++ b/win32/VS2010/celt.vcxproj.filters
@@ -69,6 +69,24 @@
     <ClCompile Include="..\..\celt\celt.c">
       <Filter>Source Files</Filter>
     </ClCompile>
+    <ClCompile Include="..\..\celt\x86\celt_lpc_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\celt\x86\pitch_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\celt\x86\pitch_sse2.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\celt\x86\pitch_sse4_1.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\celt\x86\x86_celt_map.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\celt\x86\x86cpu.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
   </ItemGroup>
   <ItemGroup>
     <ClInclude Include="..\..\celt\cwrs.h">
@@ -158,5 +176,14 @@
     <ClInclude Include="..\..\celt\celt_lpc.h">
       <Filter>Header Files</Filter>
     </ClInclude>
+    <ClInclude Include="..\..\celt\x86\celt_lpc_sse.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="..\..\celt\x86\pitch_sse.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="..\..\celt\x86\x86cpu.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
   </ItemGroup>
 </Project>
\ No newline at end of file
diff --git a/win32/VS2010/silk_common.vcxproj b/win32/VS2010/silk_common.vcxproj
index 9cf5f48..d3d077d 100644
--- a/win32/VS2010/silk_common.vcxproj
+++ b/win32/VS2010/silk_common.vcxproj
@@ -88,7 +88,7 @@
       <WarningLevel>Level3</WarningLevel>
       <Optimization>Disabled</Optimization>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>../../silk/fixed;../../silk/float;../../win32;../../celt;../../include</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk/float;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -118,7 +118,7 @@
       <WarningLevel>Level3</WarningLevel>
       <Optimization>Disabled</Optimization>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;WIN64;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>../../silk/fixed;../../silk/float;../../win32;../../celt;../../include</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk/float;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -149,7 +149,7 @@
       <FunctionLevelLinking>true</FunctionLevelLinking>
       <IntrinsicFunctions>true</IntrinsicFunctions>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>../../silk/fixed;../../silk/float;../../win32;../../celt;../../include</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk/float;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
       <FloatingPointModel>Fast</FloatingPointModel>
     </ClCompile>
@@ -184,7 +184,7 @@
       <FunctionLevelLinking>true</FunctionLevelLinking>
       <IntrinsicFunctions>true</IntrinsicFunctions>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;WIN64;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>../../silk/fixed;../../silk/float;../../win32;../../celt;../../include</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk/float;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
       <FloatingPointModel>Fast</FloatingPointModel>
     </ClCompile>
@@ -212,6 +212,8 @@
   </ItemDefinitionGroup>
   <ItemGroup>
     <ClInclude Include="..\..\include\opus_types.h" />
+    <ClInclude Include="..\..\silk\x86\main_sse.h" />
+    <ClInclude Include="..\..\silk\x86\SigProc_FIX_sse.h" />
     <ClInclude Include="..\..\win32\config.h" />
     <ClInclude Include="..\..\silk\control.h" />
     <ClInclude Include="..\..\silk\debug.h" />
@@ -311,8 +313,13 @@
     <ClCompile Include="..\..\silk\table_LSF_cos.c" />
     <ClCompile Include="..\..\silk\VAD.c" />
     <ClCompile Include="..\..\silk\VQ_WMat_EC.c" />
+    <ClCompile Include="..\..\silk\x86\NSQ_del_dec_sse.c" />
+    <ClCompile Include="..\..\silk\x86\NSQ_sse.c" />
+    <ClCompile Include="..\..\silk\x86\VAD_sse.c" />
+    <ClCompile Include="..\..\silk\x86\VQ_WMat_EC_sse.c" />
+    <ClCompile Include="..\..\silk\x86\x86_silk_map.c" />
   </ItemGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
   <ImportGroup Label="ExtensionTargets">
   </ImportGroup>
-</Project>
+</Project>
\ No newline at end of file
diff --git a/win32/VS2010/silk_common.vcxproj.filters
b/win32/VS2010/silk_common.vcxproj.filters
index 30db48e..341180b 100644
--- a/win32/VS2010/silk_common.vcxproj.filters
+++ b/win32/VS2010/silk_common.vcxproj.filters
@@ -81,6 +81,12 @@
     <ClInclude Include="..\..\silk\typedef.h">
       <Filter>Header Files</Filter>
     </ClInclude>
+    <ClInclude Include="..\..\silk\x86\main_sse.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
+    <ClInclude Include="..\..\silk\x86\SigProc_FIX_sse.h">
+      <Filter>Header Files</Filter>
+    </ClInclude>
   </ItemGroup>
   <ItemGroup>
     <ClCompile Include="..\..\silk\VQ_WMat_EC.c">
@@ -311,5 +317,20 @@
     <ClCompile Include="..\..\silk\VAD.c">
       <Filter>Source Files</Filter>
     </ClCompile>
+    <ClCompile Include="..\..\silk\x86\NSQ_del_dec_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\silk\x86\NSQ_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\silk\x86\VAD_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\silk\x86\VQ_WMat_EC_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile Include="..\..\silk\x86\x86_silk_map.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
   </ItemGroup>
-</Project>
+</Project>
\ No newline at end of file
diff --git a/win32/VS2010/silk_fixed.vcxproj b/win32/VS2010/silk_fixed.vcxproj
index 5ea1a91..522101e 100644
--- a/win32/VS2010/silk_fixed.vcxproj
+++ b/win32/VS2010/silk_fixed.vcxproj
@@ -86,7 +86,7 @@
       <WarningLevel>Level3</WarningLevel>
       <Optimization>Disabled</Optimization>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>../../silk/fixed;../../silk;../../win32;../../celt;../../include;../win32</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include;$(ProjectDir)/../win32</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -104,7 +104,7 @@
       <WarningLevel>Level3</WarningLevel>
       <Optimization>Disabled</Optimization>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;_DEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>../../silk/fixed;../../silk;../../win32;../../celt;../../include;../win32</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include;$(ProjectDir)/../win32</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreadedDebug</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -123,7 +123,7 @@
       <FunctionLevelLinking>true</FunctionLevelLinking>
       <IntrinsicFunctions>true</IntrinsicFunctions>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>../../silk/fixed;../../silk;../../win32;../../celt;../../include;../win32</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include;$(ProjectDir)/../win32</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -145,7 +145,7 @@
       <FunctionLevelLinking>true</FunctionLevelLinking>
       <IntrinsicFunctions>true</IntrinsicFunctions>
      
<PreprocessorDefinitions>HAVE_CONFIG_H;WIN32;NDEBUG;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>
-     
<AdditionalIncludeDirectories>../../silk/fixed;../../silk;../../win32;../../celt;../../include;../win32</AdditionalIncludeDirectories>
+     
<AdditionalIncludeDirectories>$(ProjectDir)/../..;$(ProjectDir)/../../silk/fixed;$(ProjectDir)/../../silk;$(ProjectDir)/../../win32;$(ProjectDir)/../../celt;$(ProjectDir)/../../include;$(ProjectDir)/../win32</AdditionalIncludeDirectories>
       <RuntimeLibrary>MultiThreaded</RuntimeLibrary>
     </ClCompile>
     <Link>
@@ -191,8 +191,11 @@
     <ClCompile Include="..\..\silk\fixed\solve_LS_FIX.c" />
     <ClCompile Include="..\..\silk\fixed\vector_ops_FIX.c" />
     <ClCompile
Include="..\..\silk\fixed\warped_autocorrelation_FIX.c" />
+    <ClCompile
Include="..\..\silk\fixed\x86\burg_modified_FIX_sse.c" />
+    <ClCompile Include="..\..\silk\fixed\x86\prefilter_FIX_sse.c"
/>
+    <ClCompile Include="..\..\silk\fixed\x86\vector_ops_FIX_sse.c"
/>
   </ItemGroup>
   <Import Project="$(VCTargetsPath)\Microsoft.Cpp.targets" />
   <ImportGroup Label="ExtensionTargets">
   </ImportGroup>
-</Project>
+</Project>
\ No newline at end of file
diff --git a/win32/VS2010/silk_fixed.vcxproj.filters
b/win32/VS2010/silk_fixed.vcxproj.filters
index 6897930..c2327eb 100644
--- a/win32/VS2010/silk_fixed.vcxproj.filters
+++ b/win32/VS2010/silk_fixed.vcxproj.filters
@@ -18,16 +18,16 @@
     <ClInclude Include="..\..\win32\config.h">
       <Filter>Header Files</Filter>
     </ClInclude>
-    <ClInclude Include="main_FIX.h">
+    <ClInclude Include="..\..\include\opus_types.h">
       <Filter>Header Files</Filter>
     </ClInclude>
-    <ClInclude Include="..\SigProc_FIX.h">
+    <ClInclude Include="..\..\silk\SigProc_FIX.h">
       <Filter>Header Files</Filter>
     </ClInclude>
-    <ClInclude Include="structs_FIX.h">
+    <ClInclude Include="..\..\silk\fixed\main_FIX.h">
       <Filter>Header Files</Filter>
     </ClInclude>
-    <ClInclude Include="..\..\include\opus_types.h">
+    <ClInclude Include="..\..\silk\fixed\structs_FIX.h">
       <Filter>Header Files</Filter>
     </ClInclude>
   </ItemGroup>
@@ -107,5 +107,14 @@
     <ClCompile
Include="..\..\silk\fixed\LTP_analysis_filter_FIX.c">
       <Filter>Source Files</Filter>
     </ClCompile>
+    <ClCompile
Include="..\..\silk\fixed\x86\burg_modified_FIX_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile
Include="..\..\silk\fixed\x86\prefilter_FIX_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
+    <ClCompile
Include="..\..\silk\fixed\x86\vector_ops_FIX_sse.c">
+      <Filter>Source Files</Filter>
+    </ClCompile>
   </ItemGroup>
 </Project>
\ No newline at end of file
diff --git a/win32/config.h b/win32/config.h
index 46ff699..10fbf33 100644
--- a/win32/config.h
+++ b/win32/config.h
@@ -35,9 +35,28 @@ POSSIBILITY OF SUCH DAMAGE.
 
 #define OPUS_BUILD            1
 
-/* Enable SSE functions, if compiled with SSE/SSE2 (note that AMD64 implies
SSE2) */
-#if defined(_M_X64) || (defined(_M_IX86_FP) && (_M_IX86_FP >= 1))
-#define __SSE__               1
+#if defined(_M_IX86) || defined(_M_X64)
+/* Can always build with SSE intrinsics (no special compiler flags necessary)
*/
+#define OPUS_X86_MAY_HAVE_SSE
+#define OPUS_X86_MAY_HAVE_SSE2
+#define OPUS_X86_MAY_HAVE_SSE4_1
+
+/* Presume SSE functions, if compiled with SSE/SSE2/AVX (note that AMD64
implies SSE2, and AVX
+   implies SSE4.1) */
+#if defined(_M_X64) || (defined(_M_IX86_FP) && (_M_IX86_FP >= 1)) ||
defined(__AVX__)
+#define OPUS_X86_PRESUME_SSE 1
+#endif
+#if defined(_M_X64) || (defined(_M_IX86_FP) && (_M_IX86_FP >= 2)) ||
defined(__AVX__)
+#define OPUS_X86_PRESUME_SSE2 1
+#endif
+#if defined(__AVX__)
+#define OPUS_X86_PRESUME_SSE4_1 1
+#endif
+
+#if !defined(OPUS_X86_PRESUME_SSE4_1) || !defined(OPUS_X86_PRESUME_SSE2) ||
!defined(OPUS_X86_PRESUME_SSE)
+#define OPUS_HAVE_RTCD 1
+#endif
+
 #endif
 
 #include "version.h"
-- 
1.9.1

Viswanath Puttagunta

2015-Mar-31 22:57 UTC

head link

[opus] [RFC PATCH v1 4/5] aarch64: Enable intrinsics for aarch64

Enables existing neon intrinsic optimizations to work
on aarch64 target.

Signed-off-by: Viswanath Puttagunta <viswanath.puttagunta at linaro.org>
---
 Makefile.am                     |  4 +-
 celt/arm/arm_celt_map.c         |  4 +-
 celt/arm/celt_ne10_fft.c        |  2 +
 celt/arm/celt_ne10_mdct.c       |  3 ++
 celt/arm/pitch_arm.h            |  2 +-
 celt/dump_modes/Makefile        |  2 +-
 celt/pitch.h                    |  5 +--
 celt/tests/test_unit_dft.c      |  3 +-
 celt/tests/test_unit_mathops.c  |  7 ++--
 celt/tests/test_unit_mdct.c     |  4 +-
 celt/tests/test_unit_rotation.c |  5 ++-
 configure.ac                    | 93 +++++++++++++++++++----------------------
 12 files changed, 67 insertions(+), 67 deletions(-)

diff --git a/Makefile.am b/Makefile.am
index 3a75740..8bd7447 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -47,7 +47,7 @@ if CPU_ARM
 CELT_SOURCES += $(CELT_SOURCES_ARM)
 SILK_SOURCES += $(SILK_SOURCES_ARM)
 
-if OPUS_ARM_NEON_INTR
+if HAVE_ARM_NEON_INTR
 CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR)
 endif
 
@@ -286,7 +286,7 @@ SSE4_1_OBJ = $(CELT_SOURCES_SSE4_1:.c=.lo) \
 $(SSE4_1_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE4_1_CFLAGS)
 endif
 
-if OPUS_ARM_NEON_INTR
+if HAVE_ARM_NEON_INTR
 CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo) \
                          $(CELT_SOURCES_ARM_NE10:.c=.lo) \
                          %test_unit_mdct.o %test_unit_dft.o
diff --git a/celt/arm/arm_celt_map.c b/celt/arm/arm_celt_map.c
index f132fe1..918e6cf 100644
--- a/celt/arm/arm_celt_map.c
+++ b/celt/arm/arm_celt_map.c
@@ -44,7 +44,7 @@ opus_val32 (*const
CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
   MAY_HAVE_NEON(celt_pitch_xcorr)   /* NEON */
 };
 # else /* !FIXED_POINT */
-#  if defined(OPUS_ARM_NEON_INTR)
+#  if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
 void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
     const opus_val16 *, opus_val32 *, int, int) = {
   celt_pitch_xcorr_c,              /* ARMv4 */
@@ -113,7 +113,7 @@ void (*const CLT_MDCT_BACKWARD_IMPL[OPUS_ARCHMASK+1])(const
mdct_lookup *l,
 };
 
 #endif /* HAVE_ARM_NE10 */
-#  endif /* OPUS_ARM_NEON_INTR */
+#  endif /* OPUS_ARM_MAY_HAVE_NEON_INTR */
 # endif /* FIXED_POINT */
 
 #endif /* OPUS_HAVE_RTCD */
diff --git a/celt/arm/celt_ne10_fft.c b/celt/arm/celt_ne10_fft.c
index d354502..1901024 100644
--- a/celt/arm/celt_ne10_fft.c
+++ b/celt/arm/celt_ne10_fft.c
@@ -44,6 +44,7 @@
 #include "os_support.h"
 #include "stack_alloc.h"
 
+#if !defined(FIXED_POINT)
 #ifdef CUSTOM_MODES
 
 /* nfft lengths in NE10 that support scaled fft */
@@ -144,3 +145,4 @@ void opus_ifft_float_neon(const kiss_fft_state *st,
    }
    RESTORE_STACK;
 }
+#endif /* !defined(FIXED_POINT) */
diff --git a/celt/arm/celt_ne10_mdct.c b/celt/arm/celt_ne10_mdct.c
index 0979cbe..938fc93 100644
--- a/celt/arm/celt_ne10_mdct.c
+++ b/celt/arm/celt_ne10_mdct.c
@@ -43,6 +43,8 @@
 #include "os_support.h"
 #include "stack_alloc.h"
 
+#if !defined(FIXED_POINT)
+
 void clt_mdct_forward_float_neon(const mdct_lookup *l,
                                  kiss_fft_scalar *in,
                                  kiss_fft_scalar * OPUS_RESTRICT out,
@@ -258,3 +260,4 @@ void clt_mdct_backward_float_neon(const mdct_lookup *l,
    }
    RESTORE_STACK;
 }
+#endif /* !defined(FIXED_POINT) */
diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
index 8626ed7..344186b 100644
--- a/celt/arm/pitch_arm.h
+++ b/celt/arm/pitch_arm.h
@@ -57,7 +57,7 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const
opus_val16 *_y,
 #if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
 void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
                                  opus_val32 *xcorr, int len, int max_pitch);
-#if !defined(OPUS_HAVE_RTCD) || defined(OPUS_ARM_PRESUME_NEON_INTR)
+#if defined(OPUS_ARM_PRESUME_NEON_INTR)
 #define OVERRIDE_PITCH_XCORR (1)
 #   define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
    ((void)(arch),celt_pitch_xcorr_float_neon(_x, _y, xcorr, len, max_pitch))
diff --git a/celt/dump_modes/Makefile b/celt/dump_modes/Makefile
index 10c3679..fef8d94 100644
--- a/celt/dump_modes/Makefile
+++ b/celt/dump_modes/Makefile
@@ -15,7 +15,7 @@ SOURCES = dump_modes.c \
 ifdef HAVE_ARM_NE10
 CC = gcc
 CFLAGS += -mfpu=neon
-INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_NEON_INTR
+INCLUDES += -I$(NE10_INCDIR) -DHAVE_ARM_NE10 -DOPUS_ARM_PRESUME_NEON_INTR
 LIBDIR = -l:$(NE10_LIBDIR)/libNE10.so
 SOURCES += ../arm/celt_ne10_fft.c \
            dump_modes_arm_ne10.c \
diff --git a/celt/pitch.h b/celt/pitch.h
index af745eb..dde48c8 100644
--- a/celt/pitch.h
+++ b/celt/pitch.h
@@ -46,8 +46,7 @@
 #include "mips/pitch_mipsr1.h"
 #endif
 
-#if ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \
-  || defined(OPUS_ARM_NEON_INTR))
+#if (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
 # include "arm/pitch_arm.h"
 #endif
 
@@ -189,7 +188,7 @@ celt_pitch_xcorr_c(const opus_val16 *_x, const opus_val16
*_y,
 #if !defined(OVERRIDE_PITCH_XCORR)
 /*Is run-time CPU detection enabled on this platform?*/
 # if defined(OPUS_HAVE_RTCD) && \
-  (defined(OPUS_ARM_ASM) || (defined(OPUS_ARM_NEON_INTR) &&
!defined(OPUS_ARM_PRESUME_NEON_INTR)))
+  (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
 extern
 #  if defined(FIXED_POINT)
 opus_val32
diff --git a/celt/tests/test_unit_dft.c b/celt/tests/test_unit_dft.c
index 9fbcdc4..e17e26f 100644
--- a/celt/tests/test_unit_dft.c
+++ b/celt/tests/test_unit_dft.c
@@ -45,8 +45,7 @@
 #include "mathops.c"
 #include "entcode.c"
 
-#if defined(OPUS_HAVE_RTCD) && \
-         (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
+#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) || defined(OPUS_ARM_ASM)
 #include "arm/armcpu.c"
 #if !defined(FIXED_POINT)
 #if defined(HAVE_ARM_NE10)
diff --git a/celt/tests/test_unit_mathops.c b/celt/tests/test_unit_mathops.c
index a1cf2f7..2e43e07 100644
--- a/celt/tests/test_unit_mathops.c
+++ b/celt/tests/test_unit_mathops.c
@@ -65,17 +65,18 @@
 #include "x86/celt_lpc_sse.c"
 #endif
 #include "x86/x86_celt_map.c"
+
 #elif ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \
-       || defined(OPUS_ARM_NEON_INTR))
-#if defined(OPUS_ARM_NEON_INTR)
+       || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
+#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
 #include "arm/celt_neon_intr.c"
+#endif
 #if defined(HAVE_ARM_NE10)
 #include "kiss_fft.c"
 #include "mdct.c"
 #include "arm/celt_ne10_fft.c"
 #include "arm/celt_ne10_mdct.c"
 #endif
-#endif
 #include "arm/arm_celt_map.c"
 #endif
 
diff --git a/celt/tests/test_unit_mdct.c b/celt/tests/test_unit_mdct.c
index fdee079..53258fe 100644
--- a/celt/tests/test_unit_mdct.c
+++ b/celt/tests/test_unit_mdct.c
@@ -46,8 +46,8 @@
 #include "mathops.c"
 #include "entcode.c"
 
-#if defined(OPUS_HAVE_RTCD) && \
-         (defined(OPUS_ARM_ASM) || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
+
+#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR) || defined(OPUS_ARM_ASM)
 #include "arm/armcpu.c"
 #if !defined(FIXED_POINT)
 #if defined(HAVE_ARM_NE10)
diff --git a/celt/tests/test_unit_rotation.c b/celt/tests/test_unit_rotation.c
index 4ac838e..ecab5cb 100644
--- a/celt/tests/test_unit_rotation.c
+++ b/celt/tests/test_unit_rotation.c
@@ -63,9 +63,10 @@
 #include "x86/celt_lpc_sse.c"
 #endif
 #include "x86/x86_celt_map.c"
+
 #elif ((defined(OPUS_ARM_ASM) && defined(FIXED_POINT)) \
-       || defined(OPUS_ARM_NEON_INTR))
-#if defined(OPUS_ARM_NEON_INTR)
+       || defined(OPUS_ARM_MAY_HAVE_NEON_INTR))
+#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
 #include "arm/celt_neon_intr.c"
 #endif
 #if defined(HAVE_ARM_NE10)
diff --git a/configure.ac b/configure.ac
index 2380a5c..a150d87 100644
--- a/configure.ac
+++ b/configure.ac
@@ -444,7 +444,7 @@ AC_DEFUN([OPUS_PATH_NE10],
 AS_IF([test x"$enable_intrinsics" = x"yes"],[
    intrinsics_support=""
    AS_CASE([$host_cpu],
-   [arm*],
+   [arm*|aarch64],
    [
       cpu_arm=yes
       OPUS_CHECK_INTRINSICS(
@@ -459,55 +459,50 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
             SUMM = vmlaq_f32(SUMM, A0, A1);
          ]]
       )
-      AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"
&& test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"],
-          [
-             OPUS_ARM_NEON_INTR_CFLAGS="$ARM_NEON_INTR_CFLAGS"
-             AC_SUBST([OPUS_ARM_NEON_INTR_CFLAGS])
-          ]
+
+      AS_CASE([$host_cpu],
+         [arm*],
+         [
+            AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" =
x"1"],
+                  [
+                    
OPUS_ARM_NEON_INTR_CFLAGS="$ARM_NEON_INTR_CFLAGS"
+                     AC_SUBST([OPUS_ARM_NEON_INTR_CFLAGS])
+                     dnl Don't see why defining these is necessary to check
features at runtime
+                     AC_DEFINE([OPUS_ARM_MAY_HAVE_EDSP], 1, [Define if compiler
support EDSP Instructions])
+                     AC_DEFINE([OPUS_ARM_MAY_HAVE_MEDIA], 1, [Define if
compiler support MEDIA Instructions])
+                     AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler
support NEON instructions])
+                  ]
+            )
+         ]
       )
 
-      #Currently we only have intrinsic optimizations for floating point
-      AS_IF([test x"$enable_float" = x"yes"],
+      AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"],
       [
-         AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" =
x"1"],
-         [
-            OPUS_ARM_NEON_INTR=1
-            AC_DEFINE([OPUS_ARM_NEON_INTR], 1,
-                      [Support ARMv7 Neon Intrinsics for float])
-            AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON_INTR], 1,
-                      [Compiler supports ARMv7 Neon Intrinsics])
-            intrinsics_support="$intrinsics_support
(Neon_Intrinsics)"
-
-            AS_IF([test x"enable_rtcd" != x"" &&
test x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"],
-                  [rtcd_support="$rtcd_support
(ARMv7_Neon_Intrinsics)"],[])
-
-            AS_IF([test x"$OPUS_ARM_PRESUME_NEON_INTR" =
x"1"],
-                  [AC_DEFINE([OPUS_ARM_PRESUME_NEON_INTR], 1,
-                             [Define if binary requires NEON intrinsics
support])])
-
-			   AS_IF([test x"$rtcd_support" = x""],
-                  [rtcd_support=no])
-
-            AS_IF([test x"$intrinsics_support" = x""],
-                  [intrinsics_support=no],
-			         [intrinsics_support="arm$intrinsics_support"])
-
-            dnl Don't see why defining these is necessary to check features
at runtime
-            AC_DEFINE([OPUS_ARM_MAY_HAVE_EDSP], 1, [Define if compiler support
EDSP Instructions])
-            AC_DEFINE([OPUS_ARM_MAY_HAVE_MEDIA], 1, [Define if compiler support
MEDIA Instructions])
-            AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler support
NEON instructions])
-
-            OPUS_PATH_NE10()
-            AS_IF([test x"$HAVE_ARM_NE10" = x"1"],
-                  [intrinsics_support="$intrinsics_support NE10"],[])
-         ],
-         [
-            AC_MSG_WARN([Compiler does not support ARM intrinsics])
-            intrinsics_support=no
-         ])
-      ], [
-            AC_MSG_WARN([Currently only have ARM intrinsics for float])
-            intrinsics_support=no
+         AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON_INTR], 1,
+                   [Compiler supports ARMv7 Neon Intrinsics])
+         intrinsics_support="$intrinsics_support (Neon_Intrinsics)"
+
+         AS_IF([test x"enable_rtcd" != x"" && test
x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"],
+               [rtcd_support="$rtcd_support
(ARMv7_Neon_Intrinsics)"],[])
+
+         AS_IF([test x"$OPUS_ARM_PRESUME_NEON_INTR" =
x"1"],
+               [AC_DEFINE([OPUS_ARM_PRESUME_NEON_INTR], 1,
+                          [Define if binary requires NEON intrinsics
support])])
+
+         AS_IF([test x"$rtcd_support" = x""],
+               [rtcd_support=no])
+
+         AS_IF([test x"$intrinsics_support" = x""],
+               [intrinsics_support=no],
+               [intrinsics_support="arm$intrinsics_support"])
+
+         OPUS_PATH_NE10()
+         AS_IF([test x"$HAVE_ARM_NE10" = x"1"],
+               [intrinsics_support="$intrinsics_support NE10"],[])
+      ],
+      [
+         AC_MSG_WARN([Compiler does not support ARM intrinsics])
+         intrinsics_support=no
       ])
    ],
    [i?86|x86_64],
@@ -663,8 +658,8 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
 ])
 
 AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"])
-AM_CONDITIONAL([OPUS_ARM_NEON_INTR],
-    [test x"$OPUS_ARM_NEON_INTR" = x"1"])
+AM_CONDITIONAL([HAVE_ARM_NEON_INTR],
+    [test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"])
 AM_CONDITIONAL([HAVE_ARM_NE10],
     [test x"$HAVE_ARM_NE10" = x"1"])
 
-- 
1.9.1

Viswanath Puttagunta

2015-Mar-31 22:57 UTC

head link

[opus] [RFC PATCH v1 5/5] aarch64: celt_pitch_xcorr: Fixed point intrinsics

Optimize celt_pitch_xcorr function (for fixed point).
Even though same code in theory should work for ARMv7
as well, turning this on only for aarch64 at the moment since
there is a fixed point asm implementation for ARMv7 neon.

Signed-off-by: Viswanath Puttagunta <viswanath.puttagunta at linaro.org>
---
 celt/arm/celt_neon_intr.c | 268 ++++++++++++++++++++++++++++++++++++++++++++++
 celt/arm/pitch_arm.h      |  10 ++
 configure.ac              |   6 ++
 3 files changed, 284 insertions(+)

diff --git a/celt/arm/celt_neon_intr.c b/celt/arm/celt_neon_intr.c
index 47dce15..be978a0 100644
--- a/celt/arm/celt_neon_intr.c
+++ b/celt/arm/celt_neon_intr.c
@@ -249,4 +249,272 @@ void celt_pitch_xcorr_float_neon(const opus_val16 *_x,
const opus_val16 *_y,
             (const float32_t *)_y+i, (float32_t *)xcorr+i, len);
    }
 }
+#else /* FIXED POINT */
+
+/*
+ * Function: xcorr_kernel_neon_fixed
+ * ---------------------------------
+ * Computes 8 correlation values and stores them in sum[8]
+ */
+static void xcorr_kernel_neon_fixed(const int16_t *x, const int16_t *y,
+                                    int32_t sum[4], int len) {
+   int16x8_t YY[3];
+   int16x4_t YEXT[3];
+   int16x8_t XX[2];
+   int16x4_t XX_2, YY_2;
+   int32x4_t SUMM;
+   const int16_t *xi = x;
+   const int16_t *yi = y;
+
+   celt_assert(len>4);
+
+   YY[0] = vld1q_s16(yi);
+   YY_2 = vget_low_s16(YY[0]);
+
+   SUMM = vdupq_n_s32(0);
+
+   /* Consume 16 elements in x vector and 20 elements in y
+    * vector. However, the y[19] and beyond dont get accessed
+    * So, if len == 16, then we must only access y[0] to y[18]
+    * So, make sure len > 19
+    */
+   while (len > 19) {
+      yi += 8;
+      YY[1] = vld1q_s16(yi);
+      yi += 8;
+      YY[2] = vld1q_s16(yi);
+
+      XX[0] = vld1q_s16(xi);
+      xi += 8;
+      XX[1] = vld1q_s16(xi);
+      xi += 8;
+
+      /* Consume XX[0][0:3] */
+      SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]), 0);
+
+      YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[0]), 1);
+
+      YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[0]), 2);
+
+      YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_low_s16(XX[0]), 3);
+
+      /* Consume XX[0][7:4] */
+      SUMM = vmlal_lane_s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0]),
0);
+
+      YEXT[0] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 1);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_high_s16(XX[0]), 1);
+
+      YEXT[1] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 2);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_high_s16(XX[0]), 2);
+
+      YEXT[2] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 3);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_high_s16(XX[0]), 3);
+
+      /* Consume XX[1][3:0]*/
+      SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[1]), vget_low_s16(XX[1]), 0);
+
+      YEXT[0] = vext_s16(vget_low_s16(YY[1]), vget_high_s16(YY[1]), 1);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[1]), 1);
+
+      YEXT[1] = vext_s16(vget_low_s16(YY[1]), vget_high_s16(YY[1]), 2);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[1]), 2);
+
+      YEXT[2] = vext_s16(vget_low_s16(YY[1]), vget_high_s16(YY[1]), 3);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_low_s16(XX[1]), 3);
+
+      /* Consume XX[1][7:4] */
+      SUMM = vmlal_lane_s16(SUMM, vget_high_s16(YY[1]), vget_high_s16(XX[1]),
0);
+
+      YEXT[0] = vext_s16(vget_high_s16(YY[1]), vget_low_s16(YY[2]), 1);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_high_s16(XX[1]), 1);
+
+      YEXT[1] = vext_s16(vget_high_s16(YY[1]), vget_low_s16(YY[2]), 2);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_high_s16(XX[1]), 2);
+
+      YEXT[2] = vext_s16(vget_high_s16(YY[1]), vget_low_s16(YY[2]), 3);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_high_s16(XX[1]), 3);
+
+      YY[0] = YY[2];
+      len -= 16;
+   }
+
+   /* Consume 8 elements in x vector and 16 elements in y
+    * vector. However, y[15:11] should not be accessed unless
+    * len is > 11
+    */
+   if (len > 11) {
+      yi += 8;
+      YY[1] = vld1q_s16(yi);
+
+      XX[0] = vld1q_s16(xi);
+      xi += 8;
+
+      /* Consume XX[0][0:3] */
+      SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]), 0);
+
+      YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_low_s16(XX[0]), 1);
+
+      YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_low_s16(XX[0]), 2);
+
+      YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_low_s16(XX[0]), 3);
+
+      /* Consume XX[0][7:4] */
+      SUMM = vmlal_lane_s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0]),
0);
+
+      YEXT[0] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 1);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[0], vget_high_s16(XX[0]), 1);
+
+      YEXT[1] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 2);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[1], vget_high_s16(XX[0]), 2);
+
+      YEXT[2] = vext_s16(vget_high_s16(YY[0]), vget_low_s16(YY[1]), 3);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[2], vget_high_s16(XX[0]), 3);
+
+      YY[0] = YY[1];
+      len -= 8;
+   }
+
+   /* Consume 4 elements in x vector and 8 elements in y vector
+    * However, y[7] should not be accessed unless len > 4
+    */
+   if (len > 4) {
+      XX_2 = vld1_s16(xi);
+      xi += 4;
+      /* Consume XX_2[0:3] */
+      SUMM = vmlal_lane_s16(SUMM, vget_low_s16(YY[0]), XX_2, 0);
+
+      YEXT[0] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 1);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[0], XX_2, 1);
+
+      YEXT[1] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 2);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[1], XX_2, 2);
+
+      YEXT[2] = vext_s16(vget_low_s16(YY[0]), vget_high_s16(YY[0]), 3);
+      SUMM = vmlal_lane_s16(SUMM, YEXT[2], XX_2, 3);
+
+      YY_2 = vget_high_s16(YY[0]);
+      len -= 4;
+   }
+
+   while (--len > 0) {
+      XX_2 = vld1_dup_s16(xi++);
+      SUMM = vmlal_lane_s16(SUMM, YY_2, XX_2, 0);
+      YY_2= vld1_s16(++yi);
+   }
+
+   XX_2 = vld1_dup_s16(xi);
+   SUMM = vmlal_lane_s16(SUMM, YY_2, XX_2, 0);
+
+   vst1q_s32(sum, SUMM);
+}
+
+/*
+  * Function: xcorr_kernel_neon_fixed_process1
+  * ---------------------------------
+  * Computes single correlation values and stores in *sum
+  */
+static void xcorr_kernel_neon_fixed_process1(const int16_t *x,
+                                             const int16_t *y,
+                                             int32_t *sum, int len) {
+   int16x8_t XX[2];
+   int16x8_t YY[2];
+
+   int16x4_t XX_2;
+   int16x4_t YY_2;
+
+   int32x4_t SUMM;
+   int32x2_t SUMM_2;
+   const int16_t *xi = x;
+   const int16_t *yi = y;
+
+   SUMM = vdupq_n_s32(0);
+
+   /* Work on 16 values per iteration */
+   while (len >= 16) {
+      XX[0] = vld1q_s16(xi);
+      xi += 8;
+      XX[1] = vld1q_s16(xi);
+      xi += 8;
+
+      YY[0] = vld1q_s16(yi);
+      yi += 8;
+      YY[1] = vld1q_s16(yi);
+      yi += 8;
+
+      SUMM = vmlal_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]));
+      SUMM = vmlal_s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0]));
+      SUMM = vmlal_s16(SUMM, vget_low_s16(YY[1]), vget_low_s16(XX[1]));
+      SUMM = vmlal_s16(SUMM, vget_high_s16(YY[1]), vget_high_s16(XX[1]));
+
+      len -= 16;
+   }
+
+   /* Work on 8 values */
+   if (len >= 8) {
+      XX[0] = vld1q_s16(xi);
+      xi += 8;
+
+      YY[0] = vld1q_s16(yi);
+      yi += 8;
+
+      SUMM = vmlal_s16(SUMM, vget_low_s16(YY[0]), vget_low_s16(XX[0]));
+      SUMM = vmlal_s16(SUMM, vget_high_s16(YY[0]), vget_high_s16(XX[0]));
+      len -= 8;
+   }
+
+   /* Work on 4 values */
+   if (len >= 4) {
+      XX_2 = vld1_s16(xi);
+      xi += 4;
+      YY_2 = vld1_s16(yi);
+      yi += 4;
+      SUMM = vmlal_s16(SUMM, YY_2, XX_2);
+      len -= 4;
+   }
+
+   SUMM_2 = vadd_s32(vget_high_s32(SUMM), vget_low_s32(SUMM));
+   SUMM_2 = vpadd_s32(SUMM_2, SUMM_2);
+   SUMM = vcombine_s32(SUMM_2, SUMM_2);
+
+   while (len > 0) {
+      XX_2 = vld1_dup_s16(xi++);
+      YY_2 = vld1_dup_s16(yi++);
+      SUMM = vmlal_s16(SUMM, XX_2, YY_2);
+      len--;
+   }
+   vst1q_lane_s32(sum, SUMM, 0);
+}
+
+opus_val32 celt_pitch_xcorr_fixed_neon(const opus_val16 *_x, const opus_val16
*_y,
+                                       opus_val32 *xcorr, int len, int
max_pitch) {
+   int i;
+   celt_assert(max_pitch > 0);
+   celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0);
+   int max_corr = 1;
+
+   for (i = 0; i < (max_pitch-3); i += 4) {
+      xcorr_kernel_neon_fixed((const int16_t *)_x, (const int16_t *)_y+i,
+                              (int32_t *)xcorr+i, len);
+   }
+
+   /* In case max_pitch isn't multiple of 4
+    * compute single correlation value per iteration
+    */
+   for (; i < max_pitch; i++) {
+      xcorr_kernel_neon_fixed_process1((const int16_t *)_x,
+                                       (const int16_t *)_y+i,
+                                       (int32_t *)xcorr+i, len);
+   }
+
+   for (i = 0; i < len; i++) {
+      max_corr = (max_corr > xcorr[i])? max_corr: xcorr[i];
+   }
+   return max_corr;
+}
 #endif
diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
index 344186b..d5c9408 100644
--- a/celt/arm/pitch_arm.h
+++ b/celt/arm/pitch_arm.h
@@ -32,6 +32,15 @@
 
 # if defined(FIXED_POINT)
 
+#if defined(CPU_AARCH64)
+#define OVERRIDE_PITCH_XCORR (1)
+opus_val32 celt_pitch_xcorr_fixed_neon(const opus_val16 *_x, const opus_val16
*_y,
+                                       opus_val32 *xcorr, int len, int
max_pitch);
+#define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
+   ((void)(arch), celt_pitch_xcorr_fixed_neon(_x, _y, xcorr, len, max_pitch))
+
+#else /* End CPU_AARCH64. Begin CPU_ARM */
+
 #  if defined(OPUS_ARM_MAY_HAVE_NEON)
 opus_val32 celt_pitch_xcorr_neon(const opus_val16 *_x, const opus_val16 *_y,
     opus_val32 *xcorr, int len, int max_pitch);
@@ -51,6 +60,7 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const
opus_val16 *_y,
 #   define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
   ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len, max_pitch))
 #  endif
+#endif /* End CPU_ARM */
 
 #else /* Start !FIXED_POINT */
 /* Float case */
diff --git a/configure.ac b/configure.ac
index a150d87..744c9b4 100644
--- a/configure.ac
+++ b/configure.ac
@@ -473,6 +473,11 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
                      AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON], 1, [Define if compiler
support NEON instructions])
                   ]
             )
+         ],
+         [aarch64],
+         [
+            cpu_aarch64=yes
+            AC_DEFINE([CPU_AARCH64], 1, [Compiling for Aarch64])
          ]
       )
 
@@ -658,6 +663,7 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
 ])
 
 AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"])
+AM_CONDITIONAL([CPU_AARCH64], [test "$cpu_aarch64" =
"yes"])
 AM_CONDITIONAL([HAVE_ARM_NEON_INTR],
     [test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"])
 AM_CONDITIONAL([HAVE_ARM_NE10],
-- 
1.9.1

Viswanath Puttagunta

2015-Apr-01 03:35 UTC

head link

[opus] [RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

Hi Timothy,

FYI, I just submitted pull request for my Ne10 patch that enables
builds for Aarch64 at
https://github.com/projectNe10/Ne10/pull/108

Phil at ARM said he will do more testing on it and will merge it soon.

As I mentioned in previous email, in mean time, for convinience, I
provided pre-build NE10 library binaries at
http://people.linaro.org/~viswanath.puttagunta/opus/NE10_root/

Regards,
Vish


On 31 March 2015 at 17:57, Viswanath Puttagunta
<viswanath.puttagunta at linaro.org> wrote:> Hi Timothy,
>
> As I mentioned earlier [1], I now fixed compile issues
> with fixed point and resubmitting the patch.
>
> I also have new patch that does intrinsics optimizations
> for celt_pitch_xcorr targetting aarch64.
>
> You can find my latest work-in-progress branch at [2]
>
> For reference, you can use the Ne10 pre-built libraries
> at [3]
>
> Note that I am working with Phil at ARM to get my patch at [4]
> upstreamed to Ne10.
>
> [1]: http://lists.xiph.org/pipermail/opus/2015-March/002941.html
> [2]: https://git.linaro.org/people/viswanath.puttagunta/opus.git
>      Branch: rfcv1_final_xcorr_fixed_armv8
> [3]: http://people.linaro.org/~viswanath.puttagunta/opus/NE10_root/
> [4]: git://git.linaro.org/people/viswanath.puttagunta/Ne10.git
>      Branch: rfcv1_rc1_armv8
>
> Jonathan Lennox (1):
>   Intrinsics/RTCD related fixes. Mostly x86
>
> Viswanath Puttagunta (4):
>   armv7(float): Optimize encode usecase using NE10 library
>   armv7(float): Optimize decode usecase using NE10 library
>   aarch64: Enable intrinsics for aarch64
>   aarch64: celt_pitch_xcorr: Fixed point intrinsics
>
>  Makefile.am                              |  72 ++++--
>  celt/arm/arm_celt_map.c                  |  71 +++++-
>  celt/arm/armcpu.c                        |   6 +-
>  celt/arm/celt_ne10_fft.c                 | 148 +++++++++++
>  celt/arm/celt_ne10_mdct.c                | 263 ++++++++++++++++++++
>  celt/arm/celt_neon_intr.c                | 275 +++++++++++++++++++++
>  celt/arm/fft_arm.h                       |  74 ++++++
>  celt/arm/mdct_arm.h                      |  60 +++++
>  celt/arm/pitch_arm.h                     |  14 +-
>  celt/bands.c                             |   6 +-
>  celt/celt.c                              |  16 +-
>  celt/celt.h                              |  12 +-
>  celt/celt_decoder.c                      |  24 +-
>  celt/celt_encoder.c                      |  20 +-
>  celt/celt_lpc.h                          |   2 +-
>  celt/cpu_support.h                       |  15 +-
>  celt/dump_modes/Makefile                 |  23 +-
>  celt/dump_modes/dump_modes.c             |  21 ++
>  celt/dump_modes/dump_modes_arch.h        |  41 ++++
>  celt/dump_modes/dump_modes_arm_ne10.c    | 125 ++++++++++
>  celt/kiss_fft.c                          |  31 ++-
>  celt/kiss_fft.h                          |  69 +++++-
>  celt/mdct.c                              |  20 +-
>  celt/mdct.h                              |  61 ++++-
>  celt/mips/celt_mipsr1.h                  |   2 +-
>  celt/modes.c                             |   8 +-
>  celt/pitch.c                             |   4 +-
>  celt/pitch.h                             |  22 +-
>  celt/static_modes_float.h                |  25 ++
>  celt/static_modes_float_arm_ne10.h       | 404
+++++++++++++++++++++++++++++++
>  celt/tests/test_unit_dft.c               |  56 +++--
>  celt/tests/test_unit_mathops.c           |  22 +-
>  celt/tests/test_unit_mdct.c              |  88 ++++---
>  celt/tests/test_unit_rotation.c          |  22 +-
>  celt/x86/celt_lpc_sse.c                  |   4 +
>  celt/x86/celt_lpc_sse.h                  |  12 +-
>  celt/x86/pitch_sse.c                     | 334 ++++++++++---------------
>  celt/x86/pitch_sse.h                     | 256 ++++++++------------
>  celt/x86/pitch_sse2.c                    |  95 ++++++++
>  celt/x86/pitch_sse4_1.c                  | 195 +++++++++++++++
>  celt/x86/x86_celt_map.c                  |  76 +++++-
>  celt/x86/x86cpu.c                        |  47 +++-
>  celt/x86/x86cpu.h                        |  26 +-
>  celt_headers.mk                          |   3 +
>  celt_sources.mk                          |   9 +-
>  configure.ac                             | 391
+++++++++++++++++++++---------
>  m4/opus-intrinsics.m4                    |  29 +++
>  silk/x86/SigProc_FIX_sse.h               |  17 ++
>  silk/x86/main_sse.h                      |  48 ++++
>  silk/x86/x86_silk_map.c                  |  25 +-
>  src/analysis.c                           |   8 +-
>  src/analysis.h                           |   2 +-
>  src/opus_encoder.c                       |   2 +-
>  src/opus_multistream_encoder.c           |   9 +-
>  win32/VS2010/celt.vcxproj                |  17 +-
>  win32/VS2010/celt.vcxproj.filters        |  27 +++
>  win32/VS2010/silk_common.vcxproj         |  17 +-
>  win32/VS2010/silk_common.vcxproj.filters |  23 +-
>  win32/VS2010/silk_fixed.vcxproj          |  13 +-
>  win32/VS2010/silk_fixed.vcxproj.filters  |  17 +-
>  win32/config.h                           |  25 +-
>  61 files changed, 3150 insertions(+), 699 deletions(-)
>  create mode 100644 celt/arm/celt_ne10_fft.c
>  create mode 100644 celt/arm/celt_ne10_mdct.c
>  create mode 100644 celt/arm/fft_arm.h
>  create mode 100644 celt/arm/mdct_arm.h
>  create mode 100644 celt/dump_modes/dump_modes_arch.h
>  create mode 100644 celt/dump_modes/dump_modes_arm_ne10.c
>  create mode 100644 celt/static_modes_float_arm_ne10.h
>  create mode 100644 celt/x86/pitch_sse2.c
>  create mode 100644 celt/x86/pitch_sse4_1.c
>  create mode 100644 m4/opus-intrinsics.m4
>
> --
> 1.9.1
>

Reasonably Related Threads

Search for more possibly parallel threads

opus - Mar 2015 - [RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

[opus] [RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

[opus] [RFC PATCH v1 1/5] armv7(float): Optimize encode usecase using NE10 library

[opus] [RFC PATCH v1 2/5] armv7(float): Optimize decode usecase using NE10 library

[opus] [RFC PATCH v1 3/5] Intrinsics/RTCD related fixes. Mostly x86

[opus] [RFC PATCH v1 4/5] aarch64: Enable intrinsics for aarch64

[opus] [RFC PATCH v1 5/5] aarch64: celt_pitch_xcorr: Fixed point intrinsics

[opus] [RFC PATCH v1 0/5] aarch64: celt_pitch_xcorr: Fixed point series

Reasonably Related Threads