thr3ads.net - opus - [opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Timothy B. Terriberry

2014-Nov-28 21:52 UTC

[opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Review comments inline.
> +if OPUS_ARM_NEON_INTR
> +noinst_LTLIBRARIES = libarmneon.la
> +libarmneon_la_SOURCES = $(CELT_SOURCES_ARM_NEON_INTR)
> +libarmneon_la_CPPFLAGS = $(OPUS_ARM_NEON_INTR_CPPFLAGS)
-I$(top_srcdir)/include
> +endif
I don't think these should be in a separate library. It brings with it 
lots of complications (to name one: wouldn't the .pc files need to be 
updated?). Please use the same mechanism that the SSE intrinsics use to 
add CFLAGS to the compilation of specific object files, e.g.,

$(SSE_OBJ): CFLAGS += -msse4.1
> +void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
> +    const opus_val16 *, opus_val32 *, int , int, int) = {
> +  celt_pitch_xcorr_c,               	/* ARMv4 */
> +  celt_pitch_xcorr_c,  		 	/* EDSP */
> +  celt_pitch_xcorr_c,			/* Media */
> +#if defined(OPUS_ARM_NEON_INTR)
> +  celt_pitch_xcorr_float_neon		 /* Neon */
Please do not use tabs in source code (this applies here and everywhere 
below). Even with the tabs expanded in context, the comments here do not 
line up properly.
> +static void xcorr_kernel_neon_float(float *x, float *y, float sum[4], int
len) {
x and y should be const.
> +	float32x4_t YY[5];
> +	float32x4_t XX[4];
> +	float32x2_t XX_2;
> +	float32x4_t SUMM[4];
> +	float *xi = x;
> +	float *yi = y;
> +	int cd = len/4;
> +	int cr = len%4;
len is signed, so / and % are NOT equivalent to the corresponding >> and 
& (they are much slower).
> +	int j;
> +
> +	celt_assert(len>=3);
> +
> +	/* Initialize sums to 0 */
> +	SUMM[0] = vdupq_n_f32(0);
> +	SUMM[1] = vdupq_n_f32(0);
> +	SUMM[2] = vdupq_n_f32(0);
> +	SUMM[3] = vdupq_n_f32(0);
> +
> +	YY[0] = vld1q_f32(yi);
> +
> +	/* Each loop consumes 8 floats in y vector
> +	 * and 4 floats in x vector
> +	 */
> +	for (j = 0; j < cd; j++) {
> +		yi += 4;
> +		YY[4] = vld1q_f32(yi);
If len == 4, then in the first iteration you will have loaded 8 y 
values, but only 7 are guaranteed to be available (e.g., the C code only 
references y[0] up to y[len-1+3]). You need to end this loop early and 
fall back to another approach. See comments in celt_pitch_xcorr_arm.s 
for details and an example (there are other useful comments there that 
could shave another cycle or two from this inner loop).
> +		YY[1] = vextq_f32(YY[0], YY[4], 1);
> +		YY[2] = vextq_f32(YY[0], YY[4], 2);
> +		YY[3] = vextq_f32(YY[0], YY[4], 3);
> +
> +		XX[0] = vld1q_dup_f32(xi++);
> +		XX[1] = vld1q_dup_f32(xi++);
> +		XX[2] = vld1q_dup_f32(xi++);
> +		XX[3] = vld1q_dup_f32(xi++);
Don't do this. Do a single load and use vmlaq_lane_f32() to multiply by 
each value. That should cut at least 5 cycles out of this loop.
> +
> +		SUMM[0] = vmlaq_f32(SUMM[0], XX[0], YY[0]);
> +		SUMM[1] = vmlaq_f32(SUMM[1], XX[1], YY[1]);
> +		SUMM[2] = vmlaq_f32(SUMM[2], XX[2], YY[2]);
> +		SUMM[3] = vmlaq_f32(SUMM[3], XX[3], YY[3]);
> +		YY[0] = YY[4];
> +	}
> +
> +	/* Handle remaining values max iterations = 3 */
> +	for (j = 0; j < cr; j++) {
> +		YY[0] = vld1q_f32(yi++);
This load is always redundant in the first iteration, which is a bit 
unfortunate.
> +		XX_2 = vld1_lane_f32(xi++, XX_2, 0);
Don't load a single lane when you don't need the value(s) in the other 
lane(s). Use vld1_dup_f32() instead. It's faster and breaks dependencies.
> +		SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
> +	}
> +
> +	SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
> +	SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
> +	SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
> +
> +	vst1q_f32(sum, SUMM[0]);
> +}
> +
> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16
*_y,
> +			opus_val32 *xcorr, int len, int max_pitch, int arch) {
arch is unused. There's no reason to pass it here. If we're here, we 
know what the arch is.
> +	int i, j;
> +
> +	celt_assert(max_pitch > 0);
> +	celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0);
> +
> +	for (i = 0; i < (max_pitch-3); i += 4) {
> +		xcorr_kernel_neon_float((float *)_x, (float *)_y+i,
> +					(float *)xcorr+i, len);
> +	}
> +
> +	/* In case max_pitch isn't multiple of 4, do unrolled version */
> +	for (; i < max_pitch; i++) {
> +		float sum = 0;
> +		float *yi = _y+i;
> +		for (j = 0; j < len; j++)
> +			sum += _x[i]*yi[i];
> +		xcorr[i] = sum;
> +	}
This loop can still be largely vectorized. Any reason not to do so?
> +}
> diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
> index a07f8ac..f5adc48 100644
> --- a/celt/arm/pitch_arm.h
> +++ b/celt/arm/pitch_arm.h
> @@ -52,6 +52,19 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16 *_x,
const opus_val16 *_y,
>     ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len,
max_pitch))
>   #  endif
>
> -# endif
> +#else /* Start !FIXED_POINT */
> +/* Float case */
> +#if defined(OPUS_ARM_NEON_INTR)
> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16
*_y,
> +				opus_val32 *xcorr, int len, int max_pitch, int arch);
> +#endif
> +
> +#if !defined(OPUS_HAVE_RTCD) &&
defined(OPUS_PRESUME_ARM_NEON_INTR)
> +#define OVERRIDE_PITCH_XCORR (1)
> +#define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
> +	((void)(arch),celt_pitch_xcorr_float_neon(_x, _y, xcorr, \
> +						len, max_pitch, arch))
Again, don't pass arch.
>   (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
> -      const opus_val16 *, opus_val32 *, int, int);
> +      const opus_val16 *, opus_val32 *, int, int
> +#if !defined(FIXED_POINT)
> +	,int
> +#endif
> +);
Which gets rid of this ugliness.
> +#if defined(FIXED_POINT)
> +#  define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
> +  ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
> +        xcorr, len, max_pitch)
> +#else
>   #  define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>     ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
> -        xcorr, len, max_pitch))
> +        xcorr, len, max_pitch, arch))
> +#endif
> +
And this.
> diff --git a/configure.ac b/configure.ac
> index 9b2f51f..09657b6 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -198,12 +198,11 @@ cpu_arm=no
>
>   AS_IF([test x"${enable_asm}" = x"yes"],[
>       inline_optimization="No ASM for your platform, please send
patches"
> +    OPUS_ARM_NEON_INTR_CPPFLAGS>       case $host_cpu in
>         arm*)
> -        dnl Currently we only have asm for fixed-point
> -        AS_IF([test "$enable_float" != "yes"],[
>               cpu_arm=yes
> -            AC_DEFINE([OPUS_ARM_ASM], [],  [Make use of ARM asm
optimization])
> +            AC_DEFINE([OPUS_ARM_ASM], [],  [Make use of ARM asm/intrinsic
optimization])
Not sure I'm in love with conflating intrinsics with inline assembly. 
For example, are these tests (especially the PRESUME_NEON stuff) going 
to do the right thing on aarch64?
>               AS_GCC_INLINE_ASSEMBLY(
>                   [inline_optimization="ARM"],
>                   [inline_optimization="disabled"]
> @@ -212,6 +211,35 @@ AS_IF([test x"${enable_asm}" =
x"yes"],[
>               AS_ASM_ARM_MEDIA([OPUS_ARM_INLINE_MEDIA=1],
>                   [OPUS_ARM_INLINE_MEDIA=0])
>              
AS_ASM_ARM_NEON([OPUS_ARM_INLINE_NEON=1],[OPUS_ARM_INLINE_NEON=0])
> +
> +	    AC_ARG_ENABLE([arm-neon-intrinsics],
> +		AS_HELP_STRING([--enable-arm-neon-intrinsics], [Enable NEON
optimisations on ARM CPUs that support it]))
This should specify a default value for enable_arm_neon_intrinsics. 
However, I really think this switch should be unified with the 
--enable-intrinsics switch currently used by x86.
> +
> +	    AS_IF([test x"$enable_arm_neon_intrinsics" =
x"yes"],
> +		[
> +	        AC_MSG_CHECKING(if compiler supports arm neon intrinsics)
> +		save_CFLAGS="$CFLAGS"
> +		save_CFLAGS="$CFLAGS"; CFLAGS="-mfpu=neon $CFLAGS"
> +		AC_COMPILE_IFELSE(
> +		    [AC_LANG_PROGRAM([[#include <arm_neon.h>]], [])],
> +		    [
> +			OPUS_ARM_NEON_INTR=1
> +			OPUS_ARM_NEON_INTR_CPPFLAGS="-mfpu=neon -O3"
> +			AC_SUBST(OPUS_ARM_NEON_INTR_CPPFLAGS)
> +		    ],
> +		    [
> +			OPUS_ARM_NEON_INTR=0
> +		    ])
> +		CFLAGS="$save_CFLAGS"
> +		AS_IF([test x"$OPUS_ARM_NEON_INTR"=x"1"],
> +		    [AC_MSG_RESULT([yes])],
> +		    [AC_MSG_RESULT([no])])
> +		],
> +		[
> +		    OPUS_ARM_NEON_INTR=0
> +		    AC_MSG_WARN([ARMv7 neon intrinsics not enabled])
> +		])
> +
>               AS_IF([test x"$inline_optimization" =
x"ARM"],[
>                   AM_CONDITIONAL([OPUS_ARM_INLINE_ASM],[true])
>                   AC_DEFINE([OPUS_ARM_INLINE_ASM], 1,
> @@ -220,7 +248,7 @@ AS_IF([test x"${enable_asm}" =
x"yes"],[
>                       AC_DEFINE([OPUS_ARM_INLINE_EDSP], [1],
>                           [Use ARMv5E inline asm optimizations])
>                       inline_optimization="$inline_optimization
(EDSP)"
> -                ])
> +                ]n)
Buh?
>                   AS_IF([test x"$OPUS_ARM_INLINE_MEDIA" =
x"1"],[
>                       AC_DEFINE([OPUS_ARM_INLINE_MEDIA], [1],
>                           [Use ARMv6 inline asm optimizations])
> @@ -335,13 +363,20 @@ AS_IF([test x"${enable_asm}" =
x"yes"],[
>                     [*** ARM assembly requires perl -- disabling
optimizations])
>                   asm_optimization="(missing perl dependency for
ARM)"
>               ])
> -        ])
> +	   AS_IF([test x"$OPUS_ARM_NEON_INTR" = x"1"], [
> +		AC_DEFINE([OPUS_ARM_NEON_INTR], 1,
> +			[Compiler supports ARMv7 Neon Intrinsics]),
> +		AS_IF([test x"$OPUS_ARM_PRESUME_NEON" = x"1"], [
> +			AC_DEFINE([OPUS_PRESUME_NEON_INTR], 1,
> +			    [Compiler support arm Intrinsics and target must support neon])],
> +			[])
> +		AS_IF([test x"enable_rtcd" != x""],
> +			[rtcd_support="$rtcd_support (NEON_INTR)"],
> +			[])
> +		],[])
>           ;;
>       esac
> -],[
> -   inline_optimization="disabled"
> -   asm_optimization="disabled"
> -])
> +],[])
>
>   AM_CONDITIONAL([CPU_ARM], [test "$cpu_arm" = "yes"])
>   AM_CONDITIONAL([OPUS_ARM_INLINE_ASM],
> @@ -349,6 +384,9 @@ AM_CONDITIONAL([OPUS_ARM_INLINE_ASM],
>   AM_CONDITIONAL([OPUS_ARM_EXTERNAL_ASM],
>       [test x"${asm_optimization%% *}" = x"ARM"])
>
> +AM_CONDITIONAL([OPUS_ARM_NEON_INTR],
> +    [test x"$OPUS_ARM_NEON_INTR" = x"1"])
> +
>   AM_CONDITIONAL([HAVE_SSE4_1], [false])
>   AM_CONDITIONAL([HAVE_SSE2], [false])
>   AS_IF([test x"$enable_intrinsics" = x"yes"],[

Viswanath Puttagunta

2014-Dec-01 17:13 UTC

head link

[opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hello Timothy,

Appreciate the thorough review. Have a few questions before I re-spin
the patch in-line.

On 28 November 2014 at 15:52, Timothy B. Terriberry <tterribe at xiph.org>
wrote:> Review comments inline.
>
>> +if OPUS_ARM_NEON_INTR
>> +noinst_LTLIBRARIES = libarmneon.la
>> +libarmneon_la_SOURCES = $(CELT_SOURCES_ARM_NEON_INTR)
>> +libarmneon_la_CPPFLAGS = $(OPUS_ARM_NEON_INTR_CPPFLAGS)
-I$(top_srcdir)/include
>> +endif
>
> I don't think these should be in a separate library. It brings with it
> lots of complications (to name one: wouldn't the .pc files need to be
> updated?). Please use the same mechanism that the SSE intrinsics use to
> add CFLAGS to the compilation of specific object files, e.g.,
Sorry, I don't know what .pc file is. Also no strong opinions on my
side.. but I was merely following
guidance from
http://www.gnu.org/software/automake/manual/html_node/Per_002dObject-Flags.html
which specifically advocates against per object flags.

I can follow your feedback if you still think per-object flag is the
right thing to do for libopus.
Please let me know either way.
>
> $(SSE_OBJ): CFLAGS += -msse4.1
>
>> +void (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16
*,
>> +    const opus_val16 *, opus_val32 *, int , int, int) = {
>> +  celt_pitch_xcorr_c,                /* ARMv4 */
>> +  celt_pitch_xcorr_c,                        /* EDSP */
>> +  celt_pitch_xcorr_c,                        /* Media */
>> +#if defined(OPUS_ARM_NEON_INTR)
>> +  celt_pitch_xcorr_float_neon                 /* Neon */
>
> Please do not use tabs in source code (this applies here and everywhere
> below). Even with the tabs expanded in context, the comments here do not
> line up properly.Will do. Thanks.
>
>> +static void xcorr_kernel_neon_float(float *x, float *y, float sum[4],
int len) {
>
> x and y should be const.
>
>> +     float32x4_t YY[5];
>> +     float32x4_t XX[4];
>> +     float32x2_t XX_2;
>> +     float32x4_t SUMM[4];
>> +     float *xi = x;
>> +     float *yi = y;
>> +     int cd = len/4;
>> +     int cr = len%4;
>
> len is signed, so / and % are NOT equivalent to the corresponding >>
and
> & (they are much slower).
>
>> +     int j;
>> +
>> +     celt_assert(len>=3);
>> +
>> +     /* Initialize sums to 0 */
>> +     SUMM[0] = vdupq_n_f32(0);
>> +     SUMM[1] = vdupq_n_f32(0);
>> +     SUMM[2] = vdupq_n_f32(0);
>> +     SUMM[3] = vdupq_n_f32(0);
>> +
>> +     YY[0] = vld1q_f32(yi);
>> +
>> +     /* Each loop consumes 8 floats in y vector
>> +      * and 4 floats in x vector
>> +      */
>> +     for (j = 0; j < cd; j++) {
>> +             yi += 4;
>> +             YY[4] = vld1q_f32(yi);
>
> If len == 4, then in the first iteration you will have loaded 8 y
> values, but only 7 are guaranteed to be available (e.g., the C code only
> references y[0] up to y[len-1+3]). You need to end this loop early and
> fall back to another approach. See comments in celt_pitch_xcorr_arm.s
> for details and an example (there are other useful comments there that
> could shave another cycle or two from this inner loop).Analyzed the implementation in celt_pitch_xcorr_arm.s. I will re-do my
implementation to follow same algorithm. It seems more elegant.
This comment applies to rest of feedback on
celt_neon_intr.c>
>> +             YY[1] = vextq_f32(YY[0], YY[4], 1);
>> +             YY[2] = vextq_f32(YY[0], YY[4], 2);
>> +             YY[3] = vextq_f32(YY[0], YY[4], 3);
>> +
>> +             XX[0] = vld1q_dup_f32(xi++);
>> +             XX[1] = vld1q_dup_f32(xi++);
>> +             XX[2] = vld1q_dup_f32(xi++);
>> +             XX[3] = vld1q_dup_f32(xi++);
>
> Don't do this. Do a single load and use vmlaq_lane_f32() to multiply by
> each value. That should cut at least 5 cycles out of this loop.
>
>> +
>> +             SUMM[0] = vmlaq_f32(SUMM[0], XX[0], YY[0]);
>> +             SUMM[1] = vmlaq_f32(SUMM[1], XX[1], YY[1]);
>> +             SUMM[2] = vmlaq_f32(SUMM[2], XX[2], YY[2]);
>> +             SUMM[3] = vmlaq_f32(SUMM[3], XX[3], YY[3]);
>> +             YY[0] = YY[4];
>> +     }
>> +
>> +     /* Handle remaining values max iterations = 3 */
>> +     for (j = 0; j < cr; j++) {
>> +             YY[0] = vld1q_f32(yi++);
>
> This load is always redundant in the first iteration, which is a bit
> unfortunate.
>
>> +             XX_2 = vld1_lane_f32(xi++, XX_2, 0);
>
> Don't load a single lane when you don't need the value(s) in the
other
> lane(s). Use vld1_dup_f32() instead. It's faster and breaks
dependencies.
Will keep this in mind. Thanks.>
>> +             SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
>> +     }
>> +
>> +     SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
>> +     SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
>> +     SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
>> +
>> +     vst1q_f32(sum, SUMM[0]);
>> +}
>> +
>> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const
opus_val16 *_y,
>> +                     opus_val32 *xcorr, int len, int max_pitch, int
arch) {
>
> arch is unused. There's no reason to pass it here. If we're here,
we
> know what the arch is.
>
>> +     int i, j;
>> +
>> +     celt_assert(max_pitch > 0);
>> +     celt_assert((((unsigned char *)_x-(unsigned char
*)NULL)&3)==0);
>> +
>> +     for (i = 0; i < (max_pitch-3); i += 4) {
>> +             xcorr_kernel_neon_float((float *)_x, (float *)_y+i,
>> +                                     (float *)xcorr+i, len);
>> +     }
>> +
>> +     /* In case max_pitch isn't multiple of 4, do unrolled version
*/
>> +     for (; i < max_pitch; i++) {
>> +             float sum = 0;
>> +             float *yi = _y+i;
>> +             for (j = 0; j < len; j++)
>> +                     sum += _x[i]*yi[i];
>> +             xcorr[i] = sum;
>> +     }
>
> This loop can still be largely vectorized. Any reason not to do so?Should have put TBD in the comments..I will try to get this done in the
re-spin.>
>> +}
>> diff --git a/celt/arm/pitch_arm.h b/celt/arm/pitch_arm.h
>> index a07f8ac..f5adc48 100644
>> --- a/celt/arm/pitch_arm.h
>> +++ b/celt/arm/pitch_arm.h
>> @@ -52,6 +52,19 @@ opus_val32 celt_pitch_xcorr_edsp(const opus_val16
*_x, const opus_val16 *_y,
>>     ((void)(arch),PRESUME_NEON(celt_pitch_xcorr)(_x, _y, xcorr, len,
max_pitch))
>>   #  endif
>>
>> -# endif
>> +#else /* Start !FIXED_POINT */
>> +/* Float case */
>> +#if defined(OPUS_ARM_NEON_INTR)
>> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const
opus_val16 *_y,
>> +                             opus_val32 *xcorr, int len, int
max_pitch, int arch);
>> +#endif
>> +
>> +#if !defined(OPUS_HAVE_RTCD) &&
defined(OPUS_PRESUME_ARM_NEON_INTR)
>> +#define OVERRIDE_PITCH_XCORR (1)
>> +#define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>> +     ((void)(arch),celt_pitch_xcorr_float_neon(_x, _y, xcorr, \
>> +                                             len, max_pitch, arch))
>
> Again, don't pass arch.I really did not want to pass "arch", but couldn't figure out a
way to
avoid passing arch as the
signature of
celt_pitch_xcorr_c(const opus_val16 *_x, const opus_val16 *_y,
       opus_val32 *xcorr, int len, int max_pitch, int arch)
has arch as the parameter.

Also, xcorr_kernel_c() which is called by celt_pitch_xcorr_c() could
also be optimized in isolation
for some other CPU architecture.

I really don't even understand how the FIXED point code is even
working.. doesn't it face this
same problem? Don't you get compile error that function signatures are
not matching?

I tried as much to be least disruptive here, but my other alternative
thought is that I'm really not
sure why arch is being passed for each call to celt_pitch_xcorr() and
we do a table lookup for
the appropriate function every time.

Instead, can't we just see what arch we are running on during (may be)
opus_custom_decoder_init() or some appropriate function that must be called..
which we do any ways, and then assign the function pointers to any
optimized or c functions at that time?
>
>>   (*const CELT_PITCH_XCORR_IMPL[OPUS_ARCHMASK+1])(const opus_val16 *,
>> -      const opus_val16 *, opus_val32 *, int, int);
>> +      const opus_val16 *, opus_val32 *, int, int
>> +#if !defined(FIXED_POINT)
>> +     ,int
>> +#endif
>> +);
>
> Which gets rid of this ugliness.
>
>> +#if defined(FIXED_POINT)
>> +#  define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>> +  ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
>> +        xcorr, len, max_pitch)
>> +#else
>>   #  define celt_pitch_xcorr(_x, _y, xcorr, len, max_pitch, arch) \
>>     ((*CELT_PITCH_XCORR_IMPL[(arch)&OPUS_ARCHMASK])(_x, _y, \
>> -        xcorr, len, max_pitch))
>> +        xcorr, len, max_pitch, arch))
>> +#endif
>> +
>
> And this.
>
>> diff --git a/configure.ac b/configure.ac
>> index 9b2f51f..09657b6 100644
>> --- a/configure.ac
>> +++ b/configure.ac
>> @@ -198,12 +198,11 @@ cpu_arm=no
>>
>>   AS_IF([test x"${enable_asm}" = x"yes"],[
>>       inline_optimization="No ASM for your platform, please send
patches"
>> +    OPUS_ARM_NEON_INTR_CPPFLAGS>>       case $host_cpu in
>>         arm*)
>> -        dnl Currently we only have asm for fixed-point
>> -        AS_IF([test "$enable_float" != "yes"],[
>>               cpu_arm=yes
>> -            AC_DEFINE([OPUS_ARM_ASM], [],  [Make use of ARM asm
optimization])
>> +            AC_DEFINE([OPUS_ARM_ASM], [],  [Make use of ARM
asm/intrinsic optimization])
>
> Not sure I'm in love with conflating intrinsics with inline assembly.

Timothy B. Terriberry

2014-Dec-01 20:19 UTC

head link

[opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Viswanath Puttagunta wrote:> Sorry, I don't know what .pc file is. Also no strong opinions on my
A pkg-config file (e.g., opus.pc.in)
> side.. but I was merely following
> guidance from
>
http://www.gnu.org/software/automake/manual/html_node/Per_002dObject-Flags.html
> which specifically advocates against per object flags.
Well, I don't claim to know anything about automake, but they seem to be 
supported just fine. We're already doing things this way, and I'd like 
to keep things consistent.
> I really don't even understand how the FIXED point code is even
> working.. doesn't it face this
> same problem? Don't you get compile error that function signatures are
> not matching?
You're right. This got broken by the SSE intrinsics, but it only 
generates a warning, not an error (and by accident, the code actually 
works). But yeah, that's a mess. I submitted 
<https://review.xiph.org/540/> to fix it (waiting on review to land).
> thought is that I'm really not
> sure why arch is being passed for each call to celt_pitch_xcorr() and
> we do a table lookup for
> the appropriate function every time.
The theory here is that a lookup in a global table is not any slower in 
practice than a virtual function call, but we have the advantage that we 
can put the table in a read-only data segment and mask off the table 
lookup index to guarantee that we never read outside the table, even if 
the codec state becomes completely corrupted (e.g., by a security 
exploit). If we stored explicit function pointers somewhere writable, 
such corruption could be used to redirect control flow arbitrarily. It's 
just reducing attack surface (belt-and-suspenders).
> amount of time... could you please suggest an alternative here?.. Because
I'm
> really out of ideas in this area and could use some advise / direction.
Sorry, this is okay for now since I can't think of a good alternative to 
suggest (if I could have I would have). Was just throwing that out there 
in case you had a good idea.
> For aarch64,
> case $host_cpu in
>        aarch64*)
> the configure.ac code should be much smaller as we can make valid
Okay, if that's the plan of record then I withdraw my objection.

Apparently Analagous Threads

Search for more maybe matching threads

opus - Dec 2014 - [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

[opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

[opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

[opus] [RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Apparently Analagous Threads