thr3ads.net - opus - [opus] [Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr

If this information is useful, please help other people find it:
Share via:

Jonathan Lennox

2015-Nov-21 04:02 UTC

[opus] [Aarch64 v2 00/18] Patches to enable Aarch64 (version 2)

As promised, here's a re-send of all my Aarch64 patches, following
comments by John Ridges.

Note that they actually affect more than just Aarch64 -- other than
the ones specifically guarded by AARCH64_NEON defines, the Neon
intrinsics all also apply on armv7; and the OPUS_FAST_INT64 patches
apply on any 64-bit machine.

The patches should largely be independent and independently useful,
other than obvious infrastructure setups.

Jonathan Lennox (18):
  Move ARM-specific macro overrides to arm-specific file.
  Reorganize ARM CPU #ifdefs.
  Rename OPUS_ARM_NEON_INTR AM_CONDITIONAL as HAVE_ARM_NEON_INTR, for
    consistency with x86.
  Enable Neon intrinsics for aarch64.
  Add Neon intrinsics for Silk noise shape quantization.
  Add Neon intrinsics for Silk noise shape feedback loop.
  Apply Neon short prediction optimization to
    silk_noise_shape_quantizer_del_dec.
  Add Neon fixed-point implementation of xcorr_kernel.
  Enable intrinsics by default.
  Clean up some intrinsics-related wording in configure.
  Move OPUS_FAST_INT64 definition to celt/arch.h.
  Add OPUS_FAST_INT64 flavors of celt/fixed_generic.h macros.
  Explicitly cast results of silk OPUS_FAST_INT64 macros back to
    opus_int32.
  Add OPUS_FAST_INT64 definition of silk_SMULWT.
  Clean up formatting of configure output for ARM intrinsics detection.
  Add configure check for Aarch64-specific Neon intrinsics.
  Add Aarch64 intrinsics for saturated add/subtract.
  Add Aarch64 intrinsic for SIG2WORD16.

 Makefile.am                    |   9 +--
 celt/arch.h                    |   9 ++-
 celt/arm/arm_celt_map.c        |  22 ++++++-
 celt/arm/celt_neon_intr.c      |  61 ++++++++++++++++++-
 celt/arm/fixed_arm64.h         |  35 +++++++++++
 celt/arm/pitch_arm.h           |  62 +++++++++++++++++++-
 celt/fixed_generic.h           |  16 +++++
 celt/pitch.h                   |  20 -------
 celt_headers.mk                |   1 +
 configure.ac                   |  41 +++++++++----
 silk/NSQ.c                     |  55 +++++------------
 silk/NSQ.h                     |  97 ++++++++++++++++++++++++++++++
 silk/NSQ_del_dec.c             |  37 +++++-------
 silk/arm/NSQ_neon.c            | 130 +++++++++++++++++++++++++++++++++++++++++
 silk/arm/NSQ_neon.h            | 101 ++++++++++++++++++++++++++++++++
 silk/arm/macros_arm64.h        |  39 +++++++++++++
 silk/macros.h                  |  22 ++++---
 silk/mips/NSQ_del_dec_mipsr1.h |   3 +-
 silk/x86/NSQ_sse.c             |   2 +-
 silk/x86/main_sse.h            |   3 +-
 silk_headers.mk                |   3 +
 silk_sources.mk                |   2 +
 22 files changed, 655 insertions(+), 115 deletions(-)
 create mode 100644 celt/arm/fixed_arm64.h
 create mode 100644 silk/NSQ.h
 create mode 100644 silk/arm/NSQ_neon.c
 create mode 100644 silk/arm/NSQ_neon.h
 create mode 100644 silk/arm/macros_arm64.h

-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:02 UTC

head link

[opus] [Aarch64 v2 01/18] Move ARM-specific macro overrides to arm-specific file.

---
 celt/arm/pitch_arm.h | 20 ++++++++++++++++++++
 celt/pitch.h         | 20 --------------------
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
index 8626ed7..eaf61c9 100644
--- a/celt/arm/pitch_arm.h
+++ b/celt/arm/pitch_arm.h
@@ -65,4 +65,24 @@ void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const
opus_val16 *_y,
 #endif
 
 #endif /* end !FIXED_POINT */
+
+/*Is run-time CPU detection enabled on this platform?*/
+# if defined(OPUS_HAVE_RTCD) && (defined(OPUS_ARM_ASM) \
+   || (defined(OPUS_ARM_MAY_HAVE_NEON_INTR) \
+   && !defined(OPUS_ARM_PRESUME_NEON_INTR)))
+extern
+#  if defined(FIXED_POINT)
+opus_val32
+#  else
+void
+#  endif
+(*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
+      const opus_val16 *, opus_val32 *, int, int);
+
+#  define OVERRIDE_PITCH_XCORR
+#  define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
+  ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
+        xcorr, len, max_pitch))
+# endif
+
 #endif
diff --git a/celt/pitch.h b/celt/pitch.h
index 65a77a6..d350353 100644
--- a/celt/pitch.h
+++ b/celt/pitch.h
@@ -187,25 +187,6 @@ celt_pitch_xcorr_c(const opus_val16 *_x, const opus_val16
*_y,
       opus_val32 *xcorr, int len, int max_pitch);
 
 #if !defined(OVERRIDE_PITCH_XCORR)
-/*Is run-time CPU detection enabled on this platform?*/
-# if defined(OPUS_HAVE_RTCD) && (defined(OPUS_ARM_ASM) \
-   || (defined(OPUS_ARM_MAY_HAVE_NEON_INTR) \
-   && !defined(OPUS_ARM_PRESUME_NEON_INTR)))
-extern
-#  if defined(FIXED_POINT)
-opus_val32
-#  else
-void
-#  endif
-(*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
-      const opus_val16 *, opus_val32 *, int, int);
-
-#  define OVERRIDE_PITCH_XCORR
-#  define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
-  ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
-        xcorr, len, max_pitch))
-# else
-
 #ifdef FIXED_POINT
 opus_val32
 #else
@@ -214,7 +195,6 @@ void
 celt_pitch_xcorr(const opus_val16 *_x, const opus_val16 *_y,
       opus_val32 *xcorr, int len, int max_pitch, int arch);
 
-# endif
 #endif
 
 #endif
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:02 UTC

head link

[opus] [Aarch64 v2 02/18] Reorganize ARM CPU #ifdefs.

---
 celt/arm/arm_celt_map.c |  5 ++++-
 celt/arm/pitch_arm.h    | 19 ++++++++++++++-----
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/celt/arm/arm_celt_map.c b/celt/arm/arm_celt_map.c
index ee6c244..f195315 100644
--- a/celt/arm/arm_celt_map.c
+++ b/celt/arm/arm_celt_map.c
@@ -35,7 +35,10 @@
 
 #if defined(OPUS_HAVE_RTCD)
 
-# if defined(FIXED_POINT)
+# if defined(FIXED_POINT) && \
+	((defined(OPUS_ARM_MAY_HAVE_NEON) && !defined(OPUS_ARM_PRESUME_NEON))
|| \
+		(defined(OPUS_ARM_MAY_HAVE_MEDIA) &&
!defined(OPUS_ARM_PRESUME_MEDIA)) || \
+		(defined(OPUS_ARM_MAY_HAVE_EDSP) && !defined(OPUS_ARM_PRESUME_EDSP)))
 opus_val32 (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
     const opus_val16 *, opus_val32 *, int , int) = {
   celt_pitch_xcorr_c,               /* ARMv4 */
diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
index eaf61c9..bd41774 100644
--- a/celt/arm/pitch_arm.h
+++ b/celt/arm/pitch_arm.h
@@ -46,7 +46,13 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const
opus_val16 *_y,
     opus_val32 *xcorr, int len, int max_pitch);
 #  endif
 
-#  if !defined(OPUS_HAVE_RTCD)
+#  if (defined(OPUS_ARM_MAY_HAVE_NEON) || \
+		defined(OPUS_ARM_MAY_HAVE_MEDIA) || \
+		defined(OPUS_ARM_MAY_HAVE_EDSP)) && \
+	(!defined(OPUS_HAVE_RTCD) || \
+		defined(OPUS_ARM_PRESUME_NEON) || \
+		(defined(OPUS_ARM_PRESUME_MEDIA) && !defined(OPUS_ARM_MAY_HAVE_NEON))
|| \
+		(defined(OPUS_ARM_PRESUME_EDSP) && !defined(OPUS_ARM_MAY_HAVE_NEON)
&& !defined(OPUS_ARM_MAY_HAVE_MEDIA)))
 #   define OVERRIDE_PITCH_XCORR (1)
 #   define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
   ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len, max_pitch))
@@ -66,10 +72,13 @@ void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const
opus_val16 *_y,
 
 #endif /* end !FIXED_POINT */
 
-/*Is run-time CPU detection enabled on this platform?*/
-# if defined(OPUS_HAVE_RTCD) && (defined(OPUS_ARM_ASM) \
-   || (defined(OPUS_ARM_MAY_HAVE_NEON_INTR) \
-   && !defined(OPUS_ARM_PRESUME_NEON_INTR)))
+# if defined(OPUS_HAVE_RTCD) && \
+	(defined(FIXED_POINT) && \
+		((defined(OPUS_ARM_MAY_HAVE_NEON) && !defined(OPUS_ARM_PRESUME_NEON))
|| \
+			(defined(OPUS_ARM_MAY_HAVE_MEDIA) &&
!defined(OPUS_ARM_PRESUME_MEDIA)) || \
+			(defined(OPUS_ARM_MAY_HAVE_EDSP) &&
!defined(OPUS_ARM_PRESUME_EDSP)))) || \
+	(!defined(FIXED_POINT) && \
+		(defined(OPUS_ARM_MAY_HAVE_NEON_INTR) &&
!defined(OPUS_ARM_PRESUME_NEON_INTR)))
 extern
 #  if defined(FIXED_POINT)
 opus_val32
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:02 UTC

head link

[opus] [Aarch64 v2 03/18] Rename OPUS_ARM_NEON_INTR AM_CONDITIONAL as HAVE_ARM_NEON_INTR, for consistency with x86.

---
 Makefile.am  | 4 ++--
 configure.ac | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Makefile.am b/Makefile.am
index 4d3a888..d256b45 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -47,7 +47,7 @@ if CPU_ARM
 CELT_SOURCES += $(CELT_SOURCES_ARM)
 SILK_SOURCES += $(SILK_SOURCES_ARM)
 
-if OPUS_ARM_NEON_INTR
+if HAVE_ARM_NEON_INTR
 CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR)
 endif
 
@@ -294,7 +294,7 @@ SSE4_1_OBJ = $(CELT_SOURCES_SSE4_1:.c=.lo) \
 $(SSE4_1_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += $(OPUS_X86_SSE4_1_CFLAGS)
 endif
 
-if OPUS_ARM_NEON_INTR
+if HAVE_ARM_NEON_INTR
 CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo)
 $(CELT_ARM_NEON_INTR_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += \
  $(OPUS_ARM_NEON_INTR_CFLAGS)  $(NE10_CFLAGS)
diff --git a/configure.ac b/configure.ac
index 74aa2f4..5f6fc71 100644
--- a/configure.ac
+++ b/configure.ac
@@ -696,7 +696,7 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
 ])
 
 AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"])
-AM_CONDITIONAL([OPUS_ARM_NEON_INTR],
+AM_CONDITIONAL([HAVE_ARM_NEON_INTR],
     [test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"])
 AM_CONDITIONAL([HAVE_ARM_NE10],
     [test x"$HAVE_ARM_NE10" = x"1"])
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:02 UTC

head link

[opus] [Aarch64 v2 04/18] Enable Neon intrinsics for aarch64.

Enables existing Neon intrinsic optimizations to work on aarch64
targets.
---
 configure.ac | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/configure.ac b/configure.ac
index 5f6fc71..6f83f82 100644
--- a/configure.ac
+++ b/configure.ac
@@ -459,7 +459,7 @@ AC_DEFUN([OPUS_PATH_NE10],
 AS_IF([test x"$enable_intrinsics" = x"yes"],[
    intrinsics_support=""
    AS_CASE([$host_cpu],
-   [arm*],
+   [arm*|aarch64*],
    [
       cpu_arm=yes
       OPUS_CHECK_INTRINSICS(
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:02 UTC

head link

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

---
 Makefile.am         |  5 +--
 silk/NSQ.c          | 37 ++++++++--------------
 silk/NSQ.h          | 70 +++++++++++++++++++++++++++++++++++++++++
 silk/arm/NSQ_neon.c | 64 +++++++++++++++++++++++++++++++++++++
 silk/arm/NSQ_neon.h | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 silk/x86/NSQ_sse.c  |  2 +-
 silk/x86/main_sse.h |  3 +-
 silk_headers.mk     |  2 ++
 silk_sources.mk     |  2 ++
 9 files changed, 248 insertions(+), 28 deletions(-)
 create mode 100644 silk/NSQ.h
 create mode 100644 silk/arm/NSQ_neon.c
 create mode 100644 silk/arm/NSQ_neon.h

diff --git a/Makefile.am b/Makefile.am
index d256b45..36762c2 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -49,6 +49,7 @@ SILK_SOURCES += $(SILK_SOURCES_ARM)
 
 if HAVE_ARM_NEON_INTR
 CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR)
+SILK_SOURCES += $(SILK_SOURCES_ARM_NEON_INTR)
 endif
 
 if HAVE_ARM_NE10
@@ -295,7 +296,7 @@ $(SSE4_1_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS +=
$(OPUS_X86_SSE4_1_CFLAGS)
 endif
 
 if HAVE_ARM_NEON_INTR
-CELT_ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo)
-$(CELT_ARM_NEON_INTR_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += \
+ARM_NEON_INTR_OBJ = $(CELT_SOURCES_ARM_NEON_INTR:.c=.lo)
$(SILK_SOURCES_ARM_NEON_INTR:.c=.lo)
+$(ARM_NEON_INTR_OBJ) $(OPT_UNIT_TEST_OBJ): CFLAGS += \
  $(OPUS_ARM_NEON_INTR_CFLAGS)  $(NE10_CFLAGS)
 endif
diff --git a/silk/NSQ.c b/silk/NSQ.c
index a065884..d8513dc 100644
--- a/silk/NSQ.c
+++ b/silk/NSQ.c
@@ -31,6 +31,8 @@ POSSIBILITY OF SUCH DAMAGE.
 
 #include "main.h"
 #include "stack_alloc.h"
+#include "NSQ.h"
+
 
 static OPUS_INLINE void silk_nsq_scale_states(
     const silk_encoder_state *psEncC,           /* I    Encoder State          
*/
@@ -66,7 +68,8 @@ static OPUS_INLINE void silk_noise_shape_quantizer(
     opus_int            offset_Q10,             /* I                           
*/
     opus_int            length,                 /* I    Input length           
*/
     opus_int            shapingLPCOrder,        /* I    Noise shaping AR filter
order   */
-    opus_int            predictLPCOrder         /* I    Prediction filter order
*/
+    opus_int            predictLPCOrder,        /* I    Prediction filter order
*/
+    int                 arch                    /* I    Architecture           
*/
 );
 #endif
 
@@ -155,7 +158,7 @@ void silk_NSQ_c
 
         silk_noise_shape_quantizer( NSQ, psIndices->signalType, x_sc_Q10,
pulses, pxq, sLTP_Q15, A_Q12, B_Q14,
             AR_shp_Q13, lag, HarmShapeFIRPacked_Q14, Tilt_Q14[ k ], LF_shp_Q14[
k ], Gains_Q16[ k ], Lambda_Q10,
-            offset_Q10, psEncC->subfr_length, psEncC->shapingLPCOrder,
psEncC->predictLPCOrder );
+            offset_Q10, psEncC->subfr_length, psEncC->shapingLPCOrder,
psEncC->predictLPCOrder, psEncC->arch );
 
         x_Q3   += psEncC->subfr_length;
         pulses += psEncC->subfr_length;
@@ -198,7 +201,8 @@ void silk_noise_shape_quantizer(
     opus_int            offset_Q10,             /* I                           
*/
     opus_int            length,                 /* I    Input length           
*/
     opus_int            shapingLPCOrder,        /* I    Noise shaping AR filter
order   */
-    opus_int            predictLPCOrder         /* I    Prediction filter order
*/
+    opus_int            predictLPCOrder,        /* I    Prediction filter order
*/
+    int                 arch                    /* I    Architecture           
*/
 )
 {
     opus_int     i, j;
@@ -207,6 +211,9 @@ void silk_noise_shape_quantizer(
     opus_int32   exc_Q14, LPC_exc_Q14, xq_Q14, Gain_Q10;
     opus_int32   tmp1, tmp2, sLF_AR_shp_Q14;
     opus_int32   *psLPC_Q14, *shp_lag_ptr, *pred_lag_ptr;
+#ifdef OPUS_ARM_MAY_HAVE_NEON_INTR
+    opus_int32   a_Q12_rev[16];
+#endif
 
     shp_lag_ptr  = &NSQ->sLTP_shp_Q14[ NSQ->sLTP_shp_buf_idx - lag +
HARM_SHAPE_FIR_TAPS / 2 ];
     pred_lag_ptr = &sLTP_Q15[ NSQ->sLTP_buf_idx - lag + LTP_ORDER / 2 ];
@@ -215,32 +222,14 @@ void silk_noise_shape_quantizer(
     /* Set up short term AR state */
     psLPC_Q14 = &NSQ->sLPC_Q14[ NSQ_LPC_BUF_LENGTH - 1 ];
 
+    optional_coef_reversal(a_Q12_rev, a_Q12, predictLPCOrder);
+
     for( i = 0; i < length; i++ ) {
         /* Generate dither */
         NSQ->rand_seed = silk_RAND( NSQ->rand_seed );
 
         /* Short-term prediction */
-        silk_assert( predictLPCOrder == 10 || predictLPCOrder == 16 );
-        /* Avoids introducing a bias because silk_SMLAWB() always rounds to
-inf */
-        LPC_pred_Q10 = silk_RSHIFT( predictLPCOrder, 1 );
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[  0 ], a_Q12[ 0 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -1 ], a_Q12[ 1 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -2 ], a_Q12[ 2 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -3 ], a_Q12[ 3 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -4 ], a_Q12[ 4 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -5 ], a_Q12[ 5 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -6 ], a_Q12[ 6 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -7 ], a_Q12[ 7 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -8 ], a_Q12[ 8 ]
);
-        LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -9 ], a_Q12[ 9 ]
);
-        if( predictLPCOrder == 16 ) {
-            LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -10 ], a_Q12[
10 ] );
-            LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -11 ], a_Q12[
11 ] );
-            LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -12 ], a_Q12[
12 ] );
-            LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -13 ], a_Q12[
13 ] );
-            LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -14 ], a_Q12[
14 ] );
-            LPC_pred_Q10 = silk_SMLAWB( LPC_pred_Q10, psLPC_Q14[ -15 ], a_Q12[
15 ] );
-        }
+        LPC_pred_Q10 = silk_noise_shape_quantizer_short_prediction(psLPC_Q14,
a_Q12, a_Q12_rev, predictLPCOrder, arch);
 
         /* Long-term prediction */
         if( signalType == TYPE_VOICED ) {
diff --git a/silk/NSQ.h b/silk/NSQ.h
new file mode 100644
index 0000000..a18a951
--- /dev/null
+++ b/silk/NSQ.h
@@ -0,0 +1,70 @@
+/***********************************************************************
+Copyright (c) 2014 Vidyo.
+Copyright (c) 2006-2011, Skype Limited. All rights reserved.
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+- Redistributions of source code must retain the above copyright notice,
+this list of conditions and the following disclaimer.
+- Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+- Neither the name of Internet Society, IETF or IETF Trust, nor the
+names of specific contributors, may be used to endorse or promote
+products derived from this software without specific prior written
+permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+***********************************************************************/
+#ifndef SILK_NSQ_H
+#define SILK_NSQ_H
+
+#define optional_coef_reversal(out, in, order)
+
+static OPUS_INLINE opus_int32
silk_noise_shape_quantizer_short_prediction_c(const opus_int32 *buf32, const
opus_int16 *coef16, opus_int order)
+{
+    opus_int32 out;
+    silk_assert( order == 10 || order == 16 );
+
+    /* Avoids introducing a bias because silk_SMLAWB() always rounds to -inf */
+    out = silk_RSHIFT( order, 1 );
+    out = silk_SMLAWB( out, buf32[  0 ], coef16[ 0 ] );
+    out = silk_SMLAWB( out, buf32[ -1 ], coef16[ 1 ] );
+    out = silk_SMLAWB( out, buf32[ -2 ], coef16[ 2 ] );
+    out = silk_SMLAWB( out, buf32[ -3 ], coef16[ 3 ] );
+    out = silk_SMLAWB( out, buf32[ -4 ], coef16[ 4 ] );
+    out = silk_SMLAWB( out, buf32[ -5 ], coef16[ 5 ] );
+    out = silk_SMLAWB( out, buf32[ -6 ], coef16[ 6 ] );
+    out = silk_SMLAWB( out, buf32[ -7 ], coef16[ 7 ] );
+    out = silk_SMLAWB( out, buf32[ -8 ], coef16[ 8 ] );
+    out = silk_SMLAWB( out, buf32[ -9 ], coef16[ 9 ] );
+
+    if( order == 16 )
+    {
+        out = silk_SMLAWB( out, buf32[ -10 ], coef16[ 10 ] );
+        out = silk_SMLAWB( out, buf32[ -11 ], coef16[ 11 ] );
+        out = silk_SMLAWB( out, buf32[ -12 ], coef16[ 12 ] );
+        out = silk_SMLAWB( out, buf32[ -13 ], coef16[ 13 ] );
+        out = silk_SMLAWB( out, buf32[ -14 ], coef16[ 14 ] );
+        out = silk_SMLAWB( out, buf32[ -15 ], coef16[ 15 ] );
+    }
+    return out;
+}
+
+#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order,
arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_c(in, coef,
order))
+
+
+#if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
+#include "arm/NSQ_neon.h"
+#endif
+
+#endif /* SILK_NSQ_H */
diff --git a/silk/arm/NSQ_neon.c b/silk/arm/NSQ_neon.c
new file mode 100644
index 0000000..96b672d
--- /dev/null
+++ b/silk/arm/NSQ_neon.c
@@ -0,0 +1,64 @@
+/***********************************************************************
+Copyright (C) 2014 Vidyo
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+- Redistributions of source code must retain the above copyright notice,
+this list of conditions and the following disclaimer.
+- Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+- Neither the name of Internet Society, IETF or IETF Trust, nor the
+names of specific contributors, may be used to endorse or promote
+products derived from this software without specific prior written
+permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+***********************************************************************/
+#ifdef HAVE_CONFIG_H
+#include "config.h"
+#endif
+
+#include <arm_neon.h>
+#include "main.h"
+#include "stack_alloc.h"
+#include "NSQ.h"
+#include "celt/cpu_support.h"
+#include "celt/arm/armcpu.h"
+
+opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32
*buf32, const opus_int32 *coef32)
+{
+    int32x4_t coef0 = vld1q_s32(coef32);
+    int32x4_t coef1 = vld1q_s32(coef32 + 4);
+    int32x4_t coef2 = vld1q_s32(coef32 + 8);
+    int32x4_t coef3 = vld1q_s32(coef32 + 12);
+
+    int32x4_t a0 = vld1q_s32(buf32 - 15);
+    int32x4_t a1 = vld1q_s32(buf32 - 11);
+    int32x4_t a2 = vld1q_s32(buf32 - 7);
+    int32x4_t a3 = vld1q_s32(buf32 - 3);
+
+    int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
+    int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
+    int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
+    int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1), vget_high_s32(coef1));
+    int64x2_t b4 = vmlal_s32(b3, vget_low_s32(a2), vget_low_s32(coef2));
+    int64x2_t b5 = vmlal_s32(b4, vget_high_s32(a2), vget_high_s32(coef2));
+    int64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3), vget_low_s32(coef3));
+    int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3), vget_high_s32(coef3));
+
+    int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7));
+    int64x1_t cS = vshr_n_s64(c, 16);
+    int32x2_t d = vreinterpret_s32_s64(cS);
+    opus_int32 out = vget_lane_s32(d, 0);
+    return out;
+}
diff --git a/silk/arm/NSQ_neon.h b/silk/arm/NSQ_neon.h
new file mode 100644
index 0000000..8e67cb9
--- /dev/null
+++ b/silk/arm/NSQ_neon.h
@@ -0,0 +1,91 @@
+/***********************************************************************
+Copyright (C) 2014 Vidyo
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+- Redistributions of source code must retain the above copyright notice,
+this list of conditions and the following disclaimer.
+- Redistributions in binary form must reproduce the above copyright
+notice, this list of conditions and the following disclaimer in the
+documentation and/or other materials provided with the distribution.
+- Neither the name of Internet Society, IETF or IETF Trust, nor the
+names of specific contributors, may be used to endorse or promote
+products derived from this software without specific prior written
+permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+***********************************************************************/
+#ifndef SILK_NSQ_NEON_H
+#define SILK_NSQ_NEON_H
+
+#undef optional_coef_reversal
+// reverse a_Q12 coefs to make calc easier, convert to 32
+static OPUS_INLINE void optional_coef_reversal_neon(opus_int32 *out, const
opus_int16 *in, opus_int order)
+{
+    out[15] = in[0];
+    out[14] = in[1];
+    out[13] = in[2];
+    out[12] = in[3];
+    out[11] = in[4];
+    out[10] = in[5];
+    out[9]  = in[6];
+    out[8]  = in[7];
+    out[7]  = in[8];
+    out[6]  = in[9];
+
+    if (order == 16)
+    {
+        out[5] = in[10];
+        out[4] = in[11];
+        out[3] = in[12];
+        out[2] = in[13];
+        out[1] = in[14];
+        out[0] = in[15];
+    }
+    else
+    {
+        out[5] = 0;
+        out[4] = 0;
+        out[3] = 0;
+        out[2] = 0;
+        out[1] = 0;
+        out[0] = 0;
+    }
+}
+
+#if OPUS_ARM_PRESUME_NEON_INTR
+
+#define optional_coef_reversal(out, in, order)
(optional_coef_reversal_neon(out, in, order))
+
+#elif OPUS_HAVE_RTCD
+
+#define optional_coef_reversal(out, in, order) do { if (arch == 3) {
optional_coef_reversal_neon(out, in, order); } } while (0)
+
+#endif
+
+opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32
*buf32, const opus_int32 *coef32);
+
+#if OPUS_ARM_PRESUME_NEON_INTR
+#undef silk_noise_shape_quantizer_short_prediction
+#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order,
arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev))
+
+#elif OPUS_HAVE_RTCD
+
+/* silk_noise_shape_quantizer_short_prediction implementations take different
parameters based on arch
+   (coef vs. coefRev) so can't use the usual IMPL table implementation */
+#undef silk_noise_shape_quantizer_short_prediction
+#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order,
arch)  (arch == 3 ? silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev) : silk_noise_shape_quantizer_short_prediction_c(in, coef, order))
+
+
+#endif
+
+#endif /* SILK_NSQ_NEON_H */
diff --git a/silk/x86/NSQ_sse.c b/silk/x86/NSQ_sse.c
index 72f34fd..bb3c5f1 100644
--- a/silk/x86/NSQ_sse.c
+++ b/silk/x86/NSQ_sse.c
@@ -221,7 +221,7 @@ void silk_NSQ_sse4_1(
         {
             silk_noise_shape_quantizer( NSQ, psIndices->signalType,
x_sc_Q10, pulses, pxq, sLTP_Q15, A_Q12, B_Q14,
                 AR_shp_Q13, lag, HarmShapeFIRPacked_Q14, Tilt_Q14[ k ],
LF_shp_Q14[ k ], Gains_Q16[ k ], Lambda_Q10,
-                offset_Q10, psEncC->subfr_length,
psEncC->shapingLPCOrder, psEncC->predictLPCOrder );
+                offset_Q10, psEncC->subfr_length,
psEncC->shapingLPCOrder, psEncC->predictLPCOrder, psEncC->arch );
         }
 
         x_Q3   += psEncC->subfr_length;
diff --git a/silk/x86/main_sse.h b/silk/x86/main_sse.h
index afd5ec2..d8d6131 100644
--- a/silk/x86/main_sse.h
+++ b/silk/x86/main_sse.h
@@ -207,7 +207,8 @@ void silk_noise_shape_quantizer(
     opus_int            offset_Q10,             /* I                           
*/
     opus_int            length,                 /* I    Input length           
*/
     opus_int            shapingLPCOrder,        /* I    Noise shaping AR filter
order   */
-    opus_int            predictLPCOrder         /* I    Prediction filter order
*/
+    opus_int            predictLPCOrder,        /* I    Prediction filter order
*/
+    int                 arch                    /* I    Architecture           
*/
 );
 
 /**************************/
diff --git a/silk_headers.mk b/silk_headers.mk
index 679ff8f..c74ab81 100644
--- a/silk_headers.mk
+++ b/silk_headers.mk
@@ -15,6 +15,7 @@ silk/Inlines.h \
 silk/MacroCount.h \
 silk/MacroDebug.h \
 silk/macros.h \
+silk/NSQ.h \
 silk/pitch_est_defines.h \
 silk/resampler_private.h \
 silk/resampler_rom.h \
@@ -25,6 +26,7 @@ silk/arm/macros_armv4.h \
 silk/arm/macros_armv5e.h \
 silk/arm/SigProc_FIX_armv4.h \
 silk/arm/SigProc_FIX_armv5e.h \
+silk/arm/NSQ_neon.h \
 silk/fixed/main_FIX.h \
 silk/fixed/structs_FIX.h \
 silk/fixed/mips/noise_shape_analysis_FIX_mipsr1.h \
diff --git a/silk_sources.mk b/silk_sources.mk
index 7cfb7d3..79ac6f0 100644
--- a/silk_sources.mk
+++ b/silk_sources.mk
@@ -82,6 +82,8 @@ silk/x86/x86_silk_map.c \
 silk/x86/VAD_sse.c \
 silk/x86/VQ_WMat_EC_sse.c
 
+SILK_SOURCES_ARM_NEON_INTR = silk/arm/NSQ_neon.c
+
 SILK_SOURCES_FIXED = \
 silk/fixed/LTP_analysis_filter_FIX.c \
 silk/fixed/LTP_scale_ctrl_FIX.c \
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:02 UTC

head link

[opus] [Aarch64 v2 06/18] Add Neon intrinsics for Silk noise shape feedback loop.

---
 silk/NSQ.c          | 18 ++-------------
 silk/NSQ.h          | 27 ++++++++++++++++++++++
 silk/arm/NSQ_neon.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 silk/arm/NSQ_neon.h | 10 ++++++++
 4 files changed, 105 insertions(+), 16 deletions(-)

diff --git a/silk/NSQ.c b/silk/NSQ.c
index d8513dc..ec81f3b 100644
--- a/silk/NSQ.c
+++ b/silk/NSQ.c
@@ -205,7 +205,7 @@ void silk_noise_shape_quantizer(
     int                 arch                    /* I    Architecture           
*/
 )
 {
-    opus_int     i, j;
+    opus_int     i;
     opus_int32   LTP_pred_Q13, LPC_pred_Q10, n_AR_Q12, n_LTP_Q13;
     opus_int32   n_LF_Q12, r_Q10, rr_Q10, q1_Q0, q1_Q10, q2_Q10, rd1_Q20,
rd2_Q20;
     opus_int32   exc_Q14, LPC_exc_Q14, xq_Q14, Gain_Q10;
@@ -248,21 +248,7 @@ void silk_noise_shape_quantizer(
 
         /* Noise shape feedback */
         silk_assert( ( shapingLPCOrder & 1 ) == 0 );   /* check that order
is even */
-        tmp2 = psLPC_Q14[ 0 ];
-        tmp1 = NSQ->sAR2_Q14[ 0 ];
-        NSQ->sAR2_Q14[ 0 ] = tmp2;
-        n_AR_Q12 = silk_RSHIFT( shapingLPCOrder, 1 );
-        n_AR_Q12 = silk_SMLAWB( n_AR_Q12, tmp2, AR_shp_Q13[ 0 ] );
-        for( j = 2; j < shapingLPCOrder; j += 2 ) {
-            tmp2 = NSQ->sAR2_Q14[ j - 1 ];
-            NSQ->sAR2_Q14[ j - 1 ] = tmp1;
-            n_AR_Q12 = silk_SMLAWB( n_AR_Q12, tmp1, AR_shp_Q13[ j - 1 ] );
-            tmp1 = NSQ->sAR2_Q14[ j + 0 ];
-            NSQ->sAR2_Q14[ j + 0 ] = tmp2;
-            n_AR_Q12 = silk_SMLAWB( n_AR_Q12, tmp2, AR_shp_Q13[ j ] );
-        }
-        NSQ->sAR2_Q14[ shapingLPCOrder - 1 ] = tmp1;
-        n_AR_Q12 = silk_SMLAWB( n_AR_Q12, tmp1, AR_shp_Q13[ shapingLPCOrder - 1
] );
+        n_AR_Q12 = silk_NSQ_noise_shape_feedback_loop(psLPC_Q14,
NSQ->sAR2_Q14, AR_shp_Q13, shapingLPCOrder, arch);
 
         n_AR_Q12 = silk_LSHIFT32( n_AR_Q12, 1 );                               
/* Q11 -> Q12 */
         n_AR_Q12 = silk_SMLAWB( n_AR_Q12, NSQ->sLF_AR_shp_Q14, Tilt_Q14 );
diff --git a/silk/NSQ.h b/silk/NSQ.h
index a18a951..df856e6 100644
--- a/silk/NSQ.h
+++ b/silk/NSQ.h
@@ -62,6 +62,33 @@ static OPUS_INLINE opus_int32
silk_noise_shape_quantizer_short_prediction_c(cons
 
 #define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order,
arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_c(in, coef,
order))
 
+static OPUS_INLINE opus_int32 silk_NSQ_noise_shape_feedback_loop_c(const
opus_int32 *data0, opus_int32 *data1, const opus_int16 *coef, opus_int order)
+{
+    opus_int32 out;
+    opus_int32 tmp1, tmp2;
+    opus_int j;
+
+    tmp2 = data0[0];
+    tmp1 = data1[0];
+    data1[0] = tmp2;
+
+    out = silk_RSHIFT(order, 1);
+    out = silk_SMLAWB(out, tmp2, coef[0]);
+
+    for (j = 2; j < order; j += 2) {
+        tmp2 = data1[j - 1];
+        data1[j - 1] = tmp1;
+        out = silk_SMLAWB(out, tmp1, coef[j - 1]);
+        tmp1 = data1[j + 0];
+        data1[j + 0] = tmp2;
+        out = silk_SMLAWB(out, tmp2, coef[j]);
+    }
+    data1[order - 1] = tmp1;
+    out = silk_SMLAWB(out, tmp1, coef[order - 1]);
+    return out;
+}
+
+#define silk_NSQ_noise_shape_feedback_loop(data0, data1, coef, order, arch) 
((void)arch,silk_NSQ_noise_shape_feedback_loop_c(data0, data1, coef, order))
 
 #if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
 #include "arm/NSQ_neon.h"
diff --git a/silk/arm/NSQ_neon.c b/silk/arm/NSQ_neon.c
index 96b672d..fb858f3 100644
--- a/silk/arm/NSQ_neon.c
+++ b/silk/arm/NSQ_neon.c
@@ -62,3 +62,69 @@ opus_int32
silk_noise_shape_quantizer_short_prediction_neon(const opus_int32 *bu
     opus_int32 out = vget_lane_s32(d, 0);
     return out;
 }
+
+
+opus_int32 silk_NSQ_noise_shape_feedback_loop_neon(const opus_int32 *data0,
opus_int32 *data1, const opus_int16 *coef, opus_int order)
+{
+    opus_int32 out;
+    if (order == 8)
+    {
+        int32x4_t a00 = vdupq_n_s32(data0[0]);
+        int32x4_t a01 = vld1q_s32(data1);  // data1[0] ... [3]
+
+        int32x4_t a0 = vextq_s32 (a00, a01, 3); // data0[0] data1[0] ...[2]
+        int32x4_t a1 = vld1q_s32(data1 + 3);  // data1[3] ... [6]
+
+        int16x8_t coef16 = vld1q_s16(coef);
+        int32x4_t coef0 = vmovl_s16(vget_low_s16(coef16));
+        int32x4_t coef1 = vmovl_s16(vget_high_s16(coef16));
+
+        int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
+        int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
+        int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
+        int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1), vget_high_s32(coef1));
+
+        int64x1_t c = vadd_s64(vget_low_s64(b3), vget_high_s64(b3));
+        int64x1_t cS = vshr_n_s64(c, 16);
+        int32x2_t d = vreinterpret_s32_s64(cS);
+
+        out = vget_lane_s32(d, 0);
+        vst1q_s32(data1, a0);
+        vst1q_s32(data1 + 4, a1);
+    }
+    else
+    {
+        opus_int32 tmp1, tmp2;
+        opus_int j;
+
+        tmp2 = data0[0];
+        tmp1 = data1[0];
+        data1[0] = tmp2;
+
+        out = silk_RSHIFT(order, 1);
+        out = silk_SMLAWB(out, tmp2, coef[0]);
+
+        for (j = 2; j < order; j += 2) {
+            tmp2 = data1[j - 1];
+            data1[j - 1] = tmp1;
+            out = silk_SMLAWB(out, tmp1, coef[j - 1]);
+            tmp1 = data1[j + 0];
+            data1[j + 0] = tmp2;
+            out = silk_SMLAWB(out, tmp2, coef[j]);
+        }
+        data1[order - 1] = tmp1;
+        out = silk_SMLAWB(out, tmp1, coef[order - 1]);
+    }
+    return out;
+}
+
+#if !defined(OPUS_ARM_PRESUME_NEON_INTR) && defined(OPUS_HAVE_RTCD)
+
+opus_int32 (*const
SILK_NSQ_NOISE_SHAPE_FEEDBACK_LOOP_NEON_IMPL[OPUS_ARCHMASK+1])(const opus_int32
*data0, opus_int32 *data1, const opus_int16 *coef, opus_int order) = {
+    silk_NSQ_noise_shape_feedback_loop_c,
+    silk_NSQ_noise_shape_feedback_loop_c,
+    silk_NSQ_noise_shape_feedback_loop_c,
+    silk_NSQ_noise_shape_feedback_loop_neon,
+};
+
+#endif
diff --git a/silk/arm/NSQ_neon.h b/silk/arm/NSQ_neon.h
index 8e67cb9..24db2a6 100644
--- a/silk/arm/NSQ_neon.h
+++ b/silk/arm/NSQ_neon.h
@@ -74,10 +74,15 @@ static OPUS_INLINE void
optional_coef_reversal_neon(opus_int32 *out, const opus_
 
 opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const opus_int32
*buf32, const opus_int32 *coef32);
 
+opus_int32 silk_NSQ_noise_shape_feedback_loop_neon(const opus_int32 *data0,
opus_int32 *data1, const opus_int16 *coef, opus_int order);
+
 #if OPUS_ARM_PRESUME_NEON_INTR
 #undef silk_noise_shape_quantizer_short_prediction
 #define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order,
arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev))
 
+#undef silk_NSQ_noise_shape_feedback_loop
+#define silk_NSQ_noise_shape_feedback_loop(data0, data1, coef, order, arch) 
((void)arch,silk_NSQ_noise_shape_feedback_loop_neon(data0, data1, coef, order))
+
 #elif OPUS_HAVE_RTCD
 
 /* silk_noise_shape_quantizer_short_prediction implementations take different
parameters based on arch
@@ -85,6 +90,11 @@ opus_int32
silk_noise_shape_quantizer_short_prediction_neon(const opus_int32 *bu
 #undef silk_noise_shape_quantizer_short_prediction
 #define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev, order,
arch)  (arch == 3 ? silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev) : silk_noise_shape_quantizer_short_prediction_c(in, coef, order))
 
+extern opus_int32 (*const
SILK_NSQ_NOISE_SHAPE_FEEDBACK_LOOP_NEON_IMPL[OPUS_ARCHMASK+1])(const opus_int32
*data0, opus_int32 *data1, const opus_int16 *coef, opus_int order);
+
+#undef silk_NSQ_noise_shape_feedback_loop
+#define silk_NSQ_noise_shape_feedback_loop(data0, data1, coef, order, arch) 
(SILK_NSQ_NOISE_SHAPE_FEEDBACK_LOOP_NEON_IMPL[(arch)&OPUS_ARCHMASK](data0,
data1, coef, order))
+
 
 #endif
 
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:02 UTC

head link

[opus] [Aarch64 v2 07/18] Apply Neon short prediction optimization to silk_noise_shape_quantizer_del_dec.

---
 silk/NSQ_del_dec.c             | 37 +++++++++++++------------------------
 silk/mips/NSQ_del_dec_mipsr1.h |  3 ++-
 2 files changed, 15 insertions(+), 25 deletions(-)

diff --git a/silk/NSQ_del_dec.c b/silk/NSQ_del_dec.c
index aff560c..aaa1fca 100644
--- a/silk/NSQ_del_dec.c
+++ b/silk/NSQ_del_dec.c
@@ -31,6 +31,8 @@ POSSIBILITY OF SUCH DAMAGE.
 
 #include "main.h"
 #include "stack_alloc.h"
+#include "NSQ.h"
+
 
 typedef struct {
     opus_int32 sLPC_Q14[ MAX_SUB_FRAME_LENGTH + NSQ_LPC_BUF_LENGTH ];
@@ -106,7 +108,8 @@ static OPUS_INLINE void silk_noise_shape_quantizer_del_dec(
     opus_int            warping_Q16,            /* I                           
*/
     opus_int            nStatesDelayedDecision, /* I    Number of states in
decision tree   */
     opus_int            *smpl_buf_idx,          /* I    Index to newest samples
in buffers  */
-    opus_int            decisionDelay           /* I                           
*/
+    opus_int            decisionDelay,          /* I                           
*/
+    int                 arch                    /* I                           
*/
 );
 
 void silk_NSQ_del_dec_c(
@@ -260,7 +263,7 @@ void silk_NSQ_del_dec_c(
         silk_noise_shape_quantizer_del_dec( NSQ, psDelDec,
psIndices->signalType, x_sc_Q10, pulses, pxq, sLTP_Q15,
             delayedGain_Q10, A_Q12, B_Q14, AR_shp_Q13, lag,
HarmShapeFIRPacked_Q14, Tilt_Q14[ k ], LF_shp_Q14[ k ],
             Gains_Q16[ k ], Lambda_Q10, offset_Q10, psEncC->subfr_length,
subfr++, psEncC->shapingLPCOrder,
-            psEncC->predictLPCOrder, psEncC->warping_Q16,
psEncC->nStatesDelayedDecision, &smpl_buf_idx, decisionDelay );
+            psEncC->predictLPCOrder, psEncC->warping_Q16,
psEncC->nStatesDelayedDecision, &smpl_buf_idx, decisionDelay,
psEncC->arch );
 
         x_Q3   += psEncC->subfr_length;
         pulses += psEncC->subfr_length;
@@ -333,7 +336,8 @@ static OPUS_INLINE void silk_noise_shape_quantizer_del_dec(
     opus_int            warping_Q16,            /* I                           
*/
     opus_int            nStatesDelayedDecision, /* I    Number of states in
decision tree   */
     opus_int            *smpl_buf_idx,          /* I    Index to newest samples
in buffers  */
-    opus_int            decisionDelay           /* I                           
*/
+    opus_int            decisionDelay,          /* I                           
*/
+    int                 arch                    /* I                           
*/
 )
 {
     opus_int     i, j, k, Winner_ind, RDmin_ind, RDmax_ind, last_smple_idx;
@@ -343,6 +347,9 @@ static OPUS_INLINE void silk_noise_shape_quantizer_del_dec(
     opus_int32   q1_Q0, q1_Q10, q2_Q10, exc_Q14, LPC_exc_Q14, xq_Q14, Gain_Q10;
     opus_int32   tmp1, tmp2, sLF_AR_shp_Q14;
     opus_int32   *pred_lag_ptr, *shp_lag_ptr, *psLPC_Q14;
+#ifdef OPUS_ARM_MAY_HAVE_NEON_INTR
+    opus_int32   a_Q12_rev[16];
+#endif
     VARDECL( NSQ_sample_pair, psSampleState );
     NSQ_del_dec_struct *psDD;
     NSQ_sample_struct  *psSS;
@@ -355,6 +362,8 @@ static OPUS_INLINE void silk_noise_shape_quantizer_del_dec(
     pred_lag_ptr = &sLTP_Q15[ NSQ->sLTP_buf_idx - lag + LTP_ORDER / 2 ];
     Gain_Q10     = silk_RSHIFT( Gain_Q16, 6 );
 
+    optional_coef_reversal(a_Q12_rev, a_Q12, predictLPCOrder);
+
     for( i = 0; i < length; i++ ) {
         /* Perform common calculations used in all states */
 
@@ -398,27 +407,7 @@ static OPUS_INLINE void silk_noise_shape_quantizer_del_dec(
             /* Pointer used in short term prediction and shaping */
             psLPC_Q14 = &psDD->sLPC_Q14[ NSQ_LPC_BUF_LENGTH - 1 + i ];
             /* Short-term prediction */
-            silk_assert( predictLPCOrder == 10 || predictLPCOrder == 16 );
-            /* Avoids introducing a bias because silk_SMLAWB() always rounds to
-inf */
-            LPC_pred_Q14 = silk_RSHIFT( predictLPCOrder, 1 );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[  0 ], a_Q12[ 0
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -1 ], a_Q12[ 1
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -2 ], a_Q12[ 2
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -3 ], a_Q12[ 3
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -4 ], a_Q12[ 4
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -5 ], a_Q12[ 5
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -6 ], a_Q12[ 6
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -7 ], a_Q12[ 7
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -8 ], a_Q12[ 8
] );
-            LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -9 ], a_Q12[ 9
] );
-            if( predictLPCOrder == 16 ) {
-                LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -10 ],
a_Q12[ 10 ] );
-                LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -11 ],
a_Q12[ 11 ] );
-                LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -12 ],
a_Q12[ 12 ] );
-                LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -13 ],
a_Q12[ 13 ] );
-                LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -14 ],
a_Q12[ 14 ] );
-                LPC_pred_Q14 = silk_SMLAWB( LPC_pred_Q14, psLPC_Q14[ -15 ],
a_Q12[ 15 ] );
-            }
+            LPC_pred_Q14 =
silk_noise_shape_quantizer_short_prediction(psLPC_Q14, a_Q12, a_Q12_rev,
predictLPCOrder, arch);
             LPC_pred_Q14 = silk_LSHIFT( LPC_pred_Q14, 4 );                     
/* Q10 -> Q14 */
 
             /* Noise shape feedback */
diff --git a/silk/mips/NSQ_del_dec_mipsr1.h b/silk/mips/NSQ_del_dec_mipsr1.h
index f6afd92..88e281b 100644
--- a/silk/mips/NSQ_del_dec_mipsr1.h
+++ b/silk/mips/NSQ_del_dec_mipsr1.h
@@ -62,7 +62,8 @@ static inline void silk_noise_shape_quantizer_del_dec(
     opus_int            warping_Q16,            /* I                           
*/
     opus_int            nStatesDelayedDecision, /* I    Number of states in
decision tree   */
     opus_int            *smpl_buf_idx,          /* I    Index to newest samples
in buffers  */
-    opus_int            decisionDelay           /* I                           
*/
+    opus_int            decisionDelay,          /* I                           
*/
+    int                 arch                    /* I                           
*/
 )
 {
     opus_int     i, j, k, Winner_ind, RDmin_ind, RDmax_ind, last_smple_idx;
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:03 UTC

head link

[opus] [Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.

Used for celt_pitch_xcorr on aarch64, and celt_fir and celt_iir on both armv7
and aarch64.
---
 celt/arm/arm_celt_map.c   | 17 +++++++++++++
 celt/arm/celt_neon_intr.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++-
 celt/arm/pitch_arm.h      | 31 +++++++++++++++++++++++-
 3 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/celt/arm/arm_celt_map.c b/celt/arm/arm_celt_map.c
index f195315..5794e44 100644
--- a/celt/arm/arm_celt_map.c
+++ b/celt/arm/arm_celt_map.c
@@ -58,6 +58,23 @@ void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const
opus_val16 *,
 #  endif
 # endif /* FIXED_POINT */
 
+#if defined(FIXED_POINT) && defined(OPUS_HAVE_RTCD) && \
+	defined(OPUS_ARM_MAY_HAVE_NEON_INTR) &&
!defined(OPUS_ARM_PRESUME_NEON_INTR)
+
+void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])(
+         const opus_val16 *x,
+         const opus_val16 *y,
+         opus_val32       sum[4],
+         int              len
+) = {
+  xcorr_kernel_c,                /* ARMv4 */
+  xcorr_kernel_c,                /* EDSP */
+  xcorr_kernel_c,                /* Media */
+  xcorr_kernel_neon_fixed,       /* Neon */
+};
+
+#endif
+
 # if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
 #  if defined(HAVE_ARM_NE10)
 #   if defined(CUSTOM_MODES)
diff --git a/celt/arm/celt_neon_intr.c b/celt/arm/celt_neon_intr.c
index 47dce15..557c3b7 100644
--- a/celt/arm/celt_neon_intr.c
+++ b/celt/arm/celt_neon_intr.c
@@ -37,7 +37,66 @@
 #include <arm_neon.h>
 #include "../pitch.h"
 
-#if !defined(FIXED_POINT)
+#if defined(FIXED_POINT)
+void xcorr_kernel_neon_fixed(const opus_val16 * x, const opus_val16 * y,
opus_val32 sum[4], int len)
+{
+	int j;
+	int32x4_t a = vld1q_s32(sum);
+	//Load y[0...3]
+	//This requires len>0 to always be valid (which we assert in the C code).
+	int16x4_t y0 = vld1_s16(y);
+	y += 4;
+
+	for (j = 0; j + 8 <= len; j += 8)
+	{
+		// Load x[0...7]
+		int16x8_t xx = vld1q_s16(x);
+		int16x4_t x0 = vget_low_s16(xx);
+		int16x4_t x4 = vget_high_s16(xx);
+		// Load y[4...11]
+		int16x8_t yy = vld1q_s16(y);
+		int16x4_t y4 = vget_low_s16(yy);
+		int16x4_t y8 = vget_high_s16(yy);
+		int32x4_t a0 = vmlal_lane_s16(a, y0, x0, 0);
+		int32x4_t a1 = vmlal_lane_s16(a0, y4, x4, 0);
+
+		int16x4_t y1 = vext_s16(y0, y4, 1);
+		int16x4_t y5 = vext_s16(y4, y8, 1);
+		int32x4_t a2 = vmlal_lane_s16(a1, y1, x0, 1);
+		int32x4_t a3 = vmlal_lane_s16(a2, y5, x4, 1);
+
+		int16x4_t y2 = vext_s16(y0, y4, 2);
+		int16x4_t y6 = vext_s16(y4, y8, 2);
+		int32x4_t a4 = vmlal_lane_s16(a3, y2, x0, 2);
+		int32x4_t a5 = vmlal_lane_s16(a4, y6, x4, 2);
+
+		int16x4_t y3 = vext_s16(y0, y4, 3);
+		int16x4_t y7 = vext_s16(y4, y8, 3);
+		int32x4_t a6 = vmlal_lane_s16(a5, y3, x0, 3);
+		int32x4_t a7 = vmlal_lane_s16(a6, y7, x4, 3);
+
+		y0 = y8;
+		a = a7;
+		x += 8;
+		y += 8;
+	}
+
+	for (; j < len; j++)
+	{
+		int16x4_t x0 = vld1_dup_s16(x);  //load next x
+		int32x4_t a0 = vmlal_s16(a, y0, x0);
+
+		int16x4_t y4 = vld1_dup_s16(y);  //load next y
+		y0 = vext_s16(y0, y4, 1);
+		a = a0;
+		x++;
+		y++;
+	}
+
+	vst1q_s32(sum, a);
+}
+
+#else
 /*
  * Function: xcorr_kernel_neon_float
  * ---------------------------------
diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
index bd41774..545c115 100644
--- a/celt/arm/pitch_arm.h
+++ b/celt/arm/pitch_arm.h
@@ -56,7 +56,36 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x, const
opus_val16 *_y,
 #   define OVERRIDE_PITCH_XCORR (1)
 #   define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
   ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len, max_pitch))
-#  endif
+
+#endif
+
+#  if defined(OPUS_ARM_MAY_HAVE_NEON_INTR)
+
+void xcorr_kernel_neon_fixed(
+                    const opus_val16 *x,
+                    const opus_val16 *y,
+                    opus_val32       sum[4],
+                    int              len);
+
+#   define OVERRIDE_XCORR_KERNEL (1)
+
+#   if defined(OPUS_ARM_PRESUME_NEON_INTR) || !defined(OPUS_HAVE_RTCD)
+#define xcorr_kernel(x, y, sum, len, arch) \
+    ((void)arch, xcorr_kernel_neon_fixed(x, y, sum, len))
+#   else /* Start !OPUS_ARM_PRESUME_NEON_INTR */
+
+extern void (*const XCORR_KERNEL_IMPL[OPUS_ARCHMASK + 1])(
+                    const opus_val16 *x,
+                    const opus_val16 *y,
+                    opus_val32       sum[4],
+                    int              len);
+
+#define xcorr_kernel(x, y, sum, len, arch) \
+    ((*XCORR_KERNEL_IMPL[(arch) & OPUS_ARCHMASK])(x, y, sum, len))
+
+
+#   endif /* end !OPUS_ARM_PRESUME_NEON_INTR */
+#  endif /* end OPUS_ARM_MAY_HAVE_NEON_INTR */
 
 #else /* Start !FIXED_POINT */
 /* Float case */
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:03 UTC

head link

[opus] [Aarch64 v2 09/18] Enable intrinsics by default.

---
 configure.ac | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/configure.ac b/configure.ac
index 6f83f82..f52d2c2 100644
--- a/configure.ac
+++ b/configure.ac
@@ -190,8 +190,8 @@ AC_ARG_ENABLE([rtcd],
     [enable_rtcd=yes])
 
 AC_ARG_ENABLE([intrinsics],
-    [AS_HELP_STRING([--enable-intrinsics], [Enable intrinsics optimizations for
ARM(float) X86(fixed)])],,
-    [enable_intrinsics=no])
+    [AS_HELP_STRING([--disable-intrinsics], [Disable intrinsics optimizations
for ARM(float) X86(fixed)])],,
+    [enable_intrinsics=yes])
 
 rtcd_support=no
 cpu_arm=no
-- 
2.4.9 (Apple Git-60)

Jonathan Lennox

2015-Nov-21 04:03 UTC

head link

[opus] [Aarch64 v2 10/18] Clean up some intrinsics-related wording in configure.

---
 configure.ac | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/configure.ac b/configure.ac
index f52d2c2..e1a6e9b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -190,7 +190,7 @@ AC_ARG_ENABLE([rtcd],
     [enable_rtcd=yes])
 
 AC_ARG_ENABLE([intrinsics],
-    [AS_HELP_STRING([--disable-intrinsics], [Disable intrinsics optimizations
for ARM(float) X86(fixed)])],,
+    [AS_HELP_STRING([--disable-intrinsics], [Disable intrinsics
optimizations])],,
     [enable_intrinsics=yes])
 
 rtcd_support=no
@@ -483,11 +483,11 @@ AS_IF([test x"$enable_intrinsics" =
x"yes"],[
 
       AS_IF([test x"$OPUS_ARM_MAY_HAVE_NEON_INTR" = x"1"],
       [
-         AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON_INTR], 1, [Compiler supports ARMv7
Neon Intrinsics])
+         AC_DEFINE([OPUS_ARM_MAY_HAVE_NEON_INTR], 1, [Compiler supports
ARMv7/Aarch64 Neon Intrinsics])
          intrinsics_support="$intrinsics_support (Neon_Intrinsics)"
 
          AS_IF([test x"enable_rtcd" != x"" && test
x"$OPUS_ARM_PRESUME_NEON_INTR" != x"1"],
-            [rtcd_support="$rtcd_support (ARMv7_Neon_Intrinsics)"])
+            [rtcd_support="$rtcd_support (Neon_Intrinsics)"])
 
          AS_IF([test x"$OPUS_ARM_PRESUME_NEON_INTR" =
x"1"],
             [AC_DEFINE([OPUS_ARM_PRESUME_NEON_INTR], 1, [Define if binary
requires NEON intrinsics support])])
-- 
2.4.9 (Apple Git-60)

Timothy B. Terriberry

2015-Dec-08 17:13 UTC

head link

[opus] [Aarch64 v2 02/18] Reorganize ARM CPU #ifdefs.

Jonathan Lennox wrote:> -# if defined(FIXED_POINT)
> +# if defined(FIXED_POINT) && \
> +	((defined(OPUS_ARM_MAY_HAVE_NEON) &&
!defined(OPUS_ARM_PRESUME_NEON)) || \
> +		(defined(OPUS_ARM_MAY_HAVE_MEDIA) &&
!defined(OPUS_ARM_PRESUME_MEDIA)) || \
> +		(defined(OPUS_ARM_MAY_HAVE_EDSP) &&
!defined(OPUS_ARM_PRESUME_EDSP)))
>   opus_val32 (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const
opus_val16 *,
>       const opus_val16 *, opus_val32 *, int , int) = {
Maybe I'm missing something, but...
> -/*Is run-time CPU detection enabled on this platform?*/
> -# if defined(OPUS_HAVE_RTCD) && (defined(OPUS_ARM_ASM) \
> -   || (defined(OPUS_ARM_MAY_HAVE_NEON_INTR) \
> -   && !defined(OPUS_ARM_PRESUME_NEON_INTR)))
> +# if defined(OPUS_HAVE_RTCD) && \
> +	(defined(FIXED_POINT) && \
> +		((defined(OPUS_ARM_MAY_HAVE_NEON) &&
!defined(OPUS_ARM_PRESUME_NEON)) || \
> +			(defined(OPUS_ARM_MAY_HAVE_MEDIA) &&
!defined(OPUS_ARM_PRESUME_MEDIA)) || \
> +			(defined(OPUS_ARM_MAY_HAVE_EDSP) &&
!defined(OPUS_ARM_PRESUME_EDSP)))) || \
> +	(!defined(FIXED_POINT) && \
> +		(defined(OPUS_ARM_MAY_HAVE_NEON_INTR) &&
!defined(OPUS_ARM_PRESUME_NEON_INTR)))
>   extern
>   #  if defined(FIXED_POINT)
>   opus_val32
>   #  else
>   void
>   #  endif
>   (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
>         const opus_val16 *, opus_val32 *, int, int);
Shouldn't the first case have a corresponding update in the #else clause 
for the !defined(FIXED_POINT) case?
>     celt_pitch_xcorr_c,               /* ARMv4 */
> diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
> index eaf61c9..bd41774 100644
> --- a/celt/arm/pitch_arm.h
> +++ b/celt/arm/pitch_arm.h
> @@ -46,7 +46,13 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x,
const opus_val16 *_y,
>       opus_val32 *xcorr, int len, int max_pitch);
>   #  endif
>
> -#  if !defined(OPUS_HAVE_RTCD)
> +#  if (defined(OPUS_ARM_MAY_HAVE_NEON) || \
> +		defined(OPUS_ARM_MAY_HAVE_MEDIA) || \
> +		defined(OPUS_ARM_MAY_HAVE_EDSP)) && \
> +	(!defined(OPUS_HAVE_RTCD) || \
> +		defined(OPUS_ARM_PRESUME_NEON) || \
> +		(defined(OPUS_ARM_PRESUME_MEDIA) &&
!defined(OPUS_ARM_MAY_HAVE_NEON)) || \
> +		(defined(OPUS_ARM_PRESUME_EDSP) &&
!defined(OPUS_ARM_MAY_HAVE_NEON) && !defined(OPUS_ARM_MAY_HAVE_MEDIA)))
>   #   define OVERRIDE_PITCH_XCORR (1)
>   #   define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>     ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len,
max_pitch))
> @@ -66,10 +72,13 @@ void celt_pitch_xcorr_float_neon(const opus_val16 *_x,
const opus_val16 *_y,
>
>   #endif /* end !FIXED_POINT */
>
And shouldn't this be the inverse of the previous two?

Timothy B. Terriberry

2015-Dec-20 03:07 UTC

head link

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Jonathan Lennox wrote:> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const
opus_int32 *buf32, const opus_int32 *coef32)
> +{
> +    int32x4_t coef0 = vld1q_s32(coef32);
> +    int32x4_t coef1 = vld1q_s32(coef32 + 4);
> +    int32x4_t coef2 = vld1q_s32(coef32 + 8);
> +    int32x4_t coef3 = vld1q_s32(coef32 + 12);
> +
> +    int32x4_t a0 = vld1q_s32(buf32 - 15);
> +    int32x4_t a1 = vld1q_s32(buf32 - 11);
> +    int32x4_t a2 = vld1q_s32(buf32 - 7);
> +    int32x4_t a3 = vld1q_s32(buf32 - 3);
> +
> +    int64x2_t b0 = vmull_s32(vget_low_s32(a0), vget_low_s32(coef0));
> +    int64x2_t b1 = vmlal_s32(b0, vget_high_s32(a0), vget_high_s32(coef0));
> +    int64x2_t b2 = vmlal_s32(b1, vget_low_s32(a1), vget_low_s32(coef1));
> +    int64x2_t b3 = vmlal_s32(b2, vget_high_s32(a1), vget_high_s32(coef1));
> +    int64x2_t b4 = vmlal_s32(b3, vget_low_s32(a2), vget_low_s32(coef2));
> +    int64x2_t b5 = vmlal_s32(b4, vget_high_s32(a2), vget_high_s32(coef2));
> +    int64x2_t b6 = vmlal_s32(b5, vget_low_s32(a3), vget_low_s32(coef3));
> +    int64x2_t b7 = vmlal_s32(b6, vget_high_s32(a3), vget_high_s32(coef3));
> +
> +    int64x1_t c = vadd_s64(vget_low_s64(b7), vget_high_s64(b7));
> +    int64x1_t cS = vshr_n_s64(c, 16);
> +    int32x2_t d = vreinterpret_s32_s64(cS);
> +    opus_int32 out = vget_lane_s32(d, 0);
> +    return out;
> +}
So, this is not bit-exact in a portion of the code where I am personally 
wary of the problems that might cause, since (like most speech codecs) 
we can use slightly unstable filters. If there was a big speed advantage 
it might be worth the testing to make sure nothing diverges here 
significantly (and it's _probably_ fine), but I think you can actually 
do this faster while remaining bitexact.

If you shift up the contents of coef32 by 15 bits (which you can do, 
since you are already transforming them specially for this platform), 
you can use vqdmulhq_s32() to emulate SMULWB. You then have to do the 
addition in a separate instruction, but because you can keep all of the 
results in 32-bit, you get double the parallelism and only need half as 
many multiplies (which have much higher latency than addition). Overall 
it should be faster, and match the C code exactly.
> +#define optional_coef_reversal(out, in, order) do { if (arch == 3) {
optional_coef_reversal_neon(out, in, order); } } while (0)
> +
> +#endif
> +
> +opus_int32 silk_noise_shape_quantizer_short_prediction_neon(const
opus_int32 *buf32, const opus_int32 *coef32);
> +
> +#if OPUS_ARM_PRESUME_NEON_INTR
> +#undef silk_noise_shape_quantizer_short_prediction
> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev,
order, arch)  ((void)arch,silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev))
> +
> +#elif OPUS_HAVE_RTCD
> +
> +/* silk_noise_shape_quantizer_short_prediction implementations take
different parameters based on arch
> +   (coef vs. coefRev) so can't use the usual IMPL table implementation
*/
> +#undef silk_noise_shape_quantizer_short_prediction
> +#define silk_noise_shape_quantizer_short_prediction(in, coef, coefRev,
order, arch)  (arch == 3 ? silk_noise_shape_quantizer_short_prediction_neon(in,
coefRev) : silk_noise_shape_quantizer_short_prediction_c(in, coef, order))
I'm also not wild about these hard-coded 3's. Right now what arch maps 
to what number is confined to arm_celt_map.c, which does not use the 
indices directly (only sorts its table entries by them). So we never got 
named constants for them. But if we have to re-organize what arch 
configurations we support, these might change. Random 3's scattered 
across the codebase are going to be hard to track down and update.

(also, I realize libopus doesn't have a line-length restriction, but a 
few newlines in here might be a mercy to those of us who work in 
80-column terminals)

Possibly Parallel Threads

Search for more possibly parallel threads

opus - Nov 2015 - [Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.

[opus] [Aarch64 v2 00/18] Patches to enable Aarch64 (version 2)

[opus] [Aarch64 v2 01/18] Move ARM-specific macro overrides to arm-specific file.

[opus] [Aarch64 v2 02/18] Reorganize ARM CPU #ifdefs.

[opus] [Aarch64 v2 03/18] Rename OPUS_ARM_NEON_INTR AM_CONDITIONAL as HAVE_ARM_NEON_INTR, for consistency with x86.

[opus] [Aarch64 v2 04/18] Enable Neon intrinsics for aarch64.

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

[opus] [Aarch64 v2 06/18] Add Neon intrinsics for Silk noise shape feedback loop.

[opus] [Aarch64 v2 07/18] Apply Neon short prediction optimization to silk_noise_shape_quantizer_del_dec.

[opus] [Aarch64 v2 08/18] Add Neon fixed-point implementation of xcorr_kernel.

[opus] [Aarch64 v2 09/18] Enable intrinsics by default.

[opus] [Aarch64 v2 10/18] Clean up some intrinsics-related wording in configure.

[opus] [Aarch64 v2 02/18] Reorganize ARM CPU #ifdefs.

[opus] [Aarch64 v2 05/18] Add Neon intrinsics for Silk noise shape quantization.

Possibly Parallel Threads