Anton Blanchard
2018-Jul-10 21:31 UTC
[flac-dev] [PATCH 0/7] PowerPC64 performance improvements
The following series adds initial vector support for PowerPC64. On POWER9, flac --best is about 3.3x faster. Amitay Isaacs (2): Add m4 macro to check for C __attribute__ features Check if compiler supports target attribute on ppc64 Anton Blanchard (5): configure.ac: Remove SPE detection code configure.ac: Add VSX enable/disable configure.ac: Fix FLAC__CPU_PPC on little endian, and add FLAC__CPU_PPC64 Add runtime detection of POWER8 and POWER9 Add VSX optimised versions of autocorrelation loops configure.ac | 53 +- m4/c_attribute.m4 | 18 + src/libFLAC/Makefile.am | 1 + src/libFLAC/cpu.c | 31 + src/libFLAC/include/private/cpu.h | 6 + src/libFLAC/include/private/lpc.h | 14 + src/libFLAC/lpc_intrin_vsx.c | 942 ++++++++++++++++++++++++++++++ src/libFLAC/stream_encoder.c | 30 + 8 files changed, 1086 insertions(+), 9 deletions(-) create mode 100644 m4/c_attribute.m4 create mode 100644 src/libFLAC/lpc_intrin_vsx.c -- 2.17.1
Anton Blanchard
2018-Jul-10 21:31 UTC
[flac-dev] [PATCH 1/7] configure.ac: Remove SPE detection code
We don't have any SPE code, so there's no need to detect it at configure time. Signed-off-by: Anton Blanchard <anton at ozlabs.org> --- configure.ac | 8 -------- 1 file changed, 8 deletions(-) diff --git a/configure.ac b/configure.ac index ffde189a..77e3628e 100644 --- a/configure.ac +++ b/configure.ac @@ -174,14 +174,6 @@ case "$host" in *) OBJ_FORMAT=elf ;; esac AC_SUBST(OBJ_FORMAT) -case "$host" in - *-gnuspe) - abi_spe=true - AC_DEFINE(FLAC__CPU_PPC_SPE) - AH_TEMPLATE(FLAC__CPU_PPC_SPE, [define if building for PowerPC with SPE ABI]) - ;; -esac -AM_CONDITIONAL(FLaC__CPU_PPC_SPE, test "x$abi_spe" = xtrue) os_is_windows=no case "$host" in -- 2.17.1
Anton Blanchard
2018-Jul-10 21:31 UTC
[flac-dev] [PATCH 2/7] configure.ac: Add VSX enable/disable
We want to create functions with PowerPC VSX instructions, so add a configure check. Signed-off-by: Anton Blanchard <anton at ozlabs.org> --- configure.ac | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/configure.ac b/configure.ac index 77e3628e..592e7750 100644 --- a/configure.ac +++ b/configure.ac @@ -228,6 +228,19 @@ AC_DEFINE(FLAC__USE_ALTIVEC) AH_TEMPLATE(FLAC__USE_ALTIVEC, [define to enable use of Altivec instructions]) fi +AC_ARG_ENABLE(vsx, +AC_HELP_STRING([--disable-vsx], [Disable VSX optimizations]), +[case "${enableval}" in + yes) use_vsx=true ;; + no) use_vsx=false ;; + *) AC_MSG_ERROR(bad value ${enableval} for --enable-vsx) ;; +esac],[use_vsx=true]) +AM_CONDITIONAL(FLaC__USE_VSX, test "x$use_vsx" = xtrue) +if test "x$use_vsx" = xtrue ; then +AC_DEFINE(FLAC__USE_VSX) +AH_TEMPLATE(FLAC__USE_VSX, [define to enable use of VSX instructions]) +fi + AC_ARG_ENABLE(avx, AC_HELP_STRING([--disable-avx], [Disable AVX, AVX2 optimizations]), [case "${enableval}" in -- 2.17.1
Anton Blanchard
2018-Jul-10 21:31 UTC
[flac-dev] [PATCH 3/7] configure.ac: Fix FLAC__CPU_PPC on little endian, and add FLAC__CPU_PPC64
FLAC__CPU_PPC wasn't catching powerpcle or powerpc64le. Fix that and add a new define for FLAC__CPU_PPC64. Signed-off-by: Anton Blanchard <anton at ozlabs.org> --- configure.ac | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index 592e7750..55078293 100644 --- a/configure.ac +++ b/configure.ac @@ -141,7 +141,16 @@ case "$host_cpu" in AH_TEMPLATE(FLAC__CPU_IA32, [define if building for ia32/i386]) asm_optimisation=$asm_opt ;; - powerpc|powerpc64) + powerpc64|powerpc64le) + cpu_ppc64=true + cpu_ppc=true + AC_DEFINE(FLAC__CPU_PPC) + AH_TEMPLATE(FLAC__CPU_PPC, [define if building for PowerPC]) + AC_DEFINE(FLAC__CPU_PPC64) + AH_TEMPLATE(FLAC__CPU_PPC64, [define if building for PowerPC64]) + asm_optimisation=$asm_opt + ;; + powerpc|powerpcle) cpu_ppc=true AC_DEFINE(FLAC__CPU_PPC) AH_TEMPLATE(FLAC__CPU_PPC, [define if building for PowerPC]) @@ -157,6 +166,7 @@ esac AM_CONDITIONAL(FLAC__CPU_X86_64, test "x$cpu_x86_64" = xtrue) AM_CONDITIONAL(FLaC__CPU_IA32, test "x$cpu_ia32" = xtrue) AM_CONDITIONAL(FLaC__CPU_PPC, test "x$cpu_ppc" = xtrue) +AM_CONDITIONAL(FLaC__CPU_PPC64, test "x$cpu_ppc64" = xtrue) AM_CONDITIONAL(FLaC__CPU_SPARC, test "x$cpu_sparc" = xtrue) if test "x$ac_cv_header_x86intrin_h" = xyes; then -- 2.17.1
Anton Blanchard
2018-Jul-10 21:31 UTC
[flac-dev] [PATCH 4/7] Add m4 macro to check for C __attribute__ features
From: Amitay Isaacs <amitay at ozlabs.org> Signed-off-by: Amitay Isaacs <amitay at ozlabs.org> --- m4/c_attribute.m4 | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) create mode 100644 m4/c_attribute.m4 diff --git a/m4/c_attribute.m4 b/m4/c_attribute.m4 new file mode 100644 index 00000000..48aa6223 --- /dev/null +++ b/m4/c_attribute.m4 @@ -0,0 +1,18 @@ +# +# Check for supported __attribute__ features +# +# AC_C_ATTRIBUTE(FEATURE, [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND]) +# +AC_DEFUN([AC_C_ATTRIBUTE], +[AS_VAR_PUSHDEF([CACHEVAR], [ax_cv_c_attribute_$1])dnl +AC_CACHE_CHECK([for __attribute__ (($1))], + CACHEVAR,[ + AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], + [[ void foo(void) __attribute__ (($1)); ]])], + [AS_VAR_SET(CACHEVAR, [yes])], + [AS_VAR_SET(CACHEVAR, [no])])]) +AS_VAR_IF(CACHEVAR,yes, + [m4_default([$2], :)], + [m4_default([$3], :)]) +AS_VAR_POPDEF([CACHEVAR])dnl +])dnl -- 2.17.1
Anton Blanchard
2018-Jul-10 21:31 UTC
[flac-dev] [PATCH 5/7] Check if compiler supports target attribute on ppc64
From: Amitay Isaacs <amitay at ozlabs.org> Check if the compiler supports __attribute__((target("cpu=power8"))) and __attribute__((target("cpu=power9"))) Signed-off-by: Amitay Isaacs <amitay at ozlabs.org> --- configure.ac | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/configure.ac b/configure.ac index 55078293..3d18bb91 100644 --- a/configure.ac +++ b/configure.ac @@ -175,6 +175,26 @@ else AC_DEFINE([FLAC__HAS_X86INTRIN], 0) fi +if test x"$cpu_ppc64" = xtrue ; then + +AC_C_ATTRIBUTE([target("cpu=power8")], + [have_cpu_power8=yes], + [have_cpu_power8=no]) +if test x"$have_cpu_power8" = xyes ; then + AC_DEFINE(FLAC__HAS_TARGET_POWER8) + AH_TEMPLATE(FLAC__HAS_TARGET_POWER8, [define if compiler has __attribute__((target("cpu=power8"))) support]) +fi + +AC_C_ATTRIBUTE([target("cpu=power9")], + [have_cpu_power9=yes], + [have_cpu_power9=no]) +if test x"$have_cpu_power9" = xyes ; then + AC_DEFINE(FLAC__HAS_TARGET_POWER9) + AH_TEMPLATE(FLAC__HAS_TARGET_POWER9, [define if compiler has __attribute__((target("cpu=power9"))) support]) +fi + +fi + case "$host" in i386-*-openbsd3.[[0-3]]) OBJ_FORMAT=aoutb ;; *-*-cygwin|*mingw*) OBJ_FORMAT=win32 ;; -- 2.17.1
Anton Blanchard
2018-Jul-10 21:31 UTC
[flac-dev] [PATCH 6/7] Add runtime detection of POWER8 and POWER9
Use getauxval() to determine if we are on POWER8 or POWER9 or newer. POWER8 is represented by version 2.07 and POWER9 by version 3.00. Signed-off-by: Anton Blanchard <anton at ozlabs.org> --- src/libFLAC/cpu.c | 31 +++++++++++++++++++++++++++++++ src/libFLAC/include/private/cpu.h | 6 ++++++ 2 files changed, 37 insertions(+) diff --git a/src/libFLAC/cpu.c b/src/libFLAC/cpu.c index bf0708c8..64da9cbc 100644 --- a/src/libFLAC/cpu.c +++ b/src/libFLAC/cpu.c @@ -53,6 +53,9 @@ #define dfprintf(file, format, ...) #endif +#if defined FLAC__CPU_PPC +#include <sys/auxv.h> +#endif #if (defined FLAC__CPU_IA32 || defined FLAC__CPU_X86_64) && (defined FLAC__HAS_NASM || FLAC__HAS_X86INTRIN) && !defined FLAC__NO_ASM @@ -230,6 +233,29 @@ x86_cpu_info (FLAC__CPUInfo *info) #endif } +static void +ppc_cpu_info (FLAC__CPUInfo *info) +{ +#if defined FLAC__CPU_PPC +#ifndef PPC_FEATURE2_ARCH_3_00 +#define PPC_FEATURE2_ARCH_3_00 0x00800000 +#endif + +#ifndef PPC_FEATURE2_ARCH_2_07 +#define PPC_FEATURE2_ARCH_2_07 0x80000000 +#endif + + if (getauxval(AT_HWCAP2) & PPC_FEATURE2_ARCH_3_00) { + info->ppc.arch_3_00 = true; + } else if (getauxval(AT_HWCAP2) & PPC_FEATURE2_ARCH_2_07) { + info->ppc.arch_2_07 = true; + } +#else + info->ppc.arch_2_07 = false; + info->ppc.arch_3_00 = false; +#endif +} + void FLAC__cpu_info (FLAC__CPUInfo *info) { memset(info, 0, sizeof(*info)); @@ -238,6 +264,8 @@ void FLAC__cpu_info (FLAC__CPUInfo *info) info->type = FLAC__CPUINFO_TYPE_IA32; #elif defined FLAC__CPU_X86_64 info->type = FLAC__CPUINFO_TYPE_X86_64; +#elif defined FLAC__CPU_PPC + info->type = FLAC__CPUINFO_TYPE_PPC; #else info->type = FLAC__CPUINFO_TYPE_UNKNOWN; #endif @@ -247,6 +275,9 @@ void FLAC__cpu_info (FLAC__CPUInfo *info) case FLAC__CPUINFO_TYPE_X86_64: x86_cpu_info (info); break; + case FLAC__CPUINFO_TYPE_PPC: + ppc_cpu_info (info); + break; default: info->use_asm = false; break; diff --git a/src/libFLAC/include/private/cpu.h b/src/libFLAC/include/private/cpu.h index 3fe279b0..e07aa09d 100644 --- a/src/libFLAC/include/private/cpu.h +++ b/src/libFLAC/include/private/cpu.h @@ -153,6 +153,7 @@ typedef enum { FLAC__CPUINFO_TYPE_IA32, FLAC__CPUINFO_TYPE_X86_64, + FLAC__CPUINFO_TYPE_PPC, FLAC__CPUINFO_TYPE_UNKNOWN } FLAC__CPUInfo_Type; @@ -173,11 +174,16 @@ typedef struct { FLAC__bool fma; } FLAC__CPUInfo_x86; +typedef struct { + FLAC__bool arch_3_00; + FLAC__bool arch_2_07; +} FLAC__CPUInfo_ppc; typedef struct { FLAC__bool use_asm; FLAC__CPUInfo_Type type; FLAC__CPUInfo_x86 x86; + FLAC__CPUInfo_ppc ppc; } FLAC__CPUInfo; void FLAC__cpu_info(FLAC__CPUInfo *info); -- 2.17.1
Anton Blanchard
2018-Jul-10 21:31 UTC
[flac-dev] [PATCH 7/7] Add VSX optimised versions of autocorrelation loops
Add a POWER8 and POWER9 version of the autocorrelation functions. flac --best is about 3.3x faster on POWER9 with this patch. Signed-off-by: Anton Blanchard <anton at ozlabs.org> --- src/libFLAC/Makefile.am | 1 + src/libFLAC/include/private/lpc.h | 14 + src/libFLAC/lpc_intrin_vsx.c | 942 ++++++++++++++++++++++++++++++ src/libFLAC/stream_encoder.c | 30 + 4 files changed, 987 insertions(+) create mode 100644 src/libFLAC/lpc_intrin_vsx.c diff --git a/src/libFLAC/Makefile.am b/src/libFLAC/Makefile.am index 863f7f95..f0f32f04 100644 --- a/src/libFLAC/Makefile.am +++ b/src/libFLAC/Makefile.am @@ -114,6 +114,7 @@ libFLAC_sources = \ lpc_intrin_sse2.c \ lpc_intrin_sse41.c \ lpc_intrin_avx2.c \ + lpc_intrin_vsx.c \ md5.c \ memory.c \ metadata_iterators.c \ diff --git a/src/libFLAC/include/private/lpc.h b/src/libFLAC/include/private/lpc.h index 63d64324..64dfd1f8 100644 --- a/src/libFLAC/include/private/lpc.h +++ b/src/libFLAC/include/private/lpc.h @@ -91,6 +91,20 @@ void FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new(const FLAC__real da void FLAC__lpc_compute_autocorrelation_intrin_sse_lag_16_new(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); # endif # endif +#if defined(FLAC__CPU_PPC64) && defined(FLAC__USE_VSX) +#ifdef FLAC__HAS_TARGET_POWER9 +void FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_4(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); +void FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_8(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); +void FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_12(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); +void FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_16(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); +#endif +#ifdef FLAC__HAS_TARGET_POWER8 +void FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_4(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); +void FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_8(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); +void FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_12(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); +void FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_16(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]); +#endif +#endif #endif /* diff --git a/src/libFLAC/lpc_intrin_vsx.c b/src/libFLAC/lpc_intrin_vsx.c new file mode 100644 index 00000000..48c82182 --- /dev/null +++ b/src/libFLAC/lpc_intrin_vsx.c @@ -0,0 +1,942 @@ +/* libFLAC - Free Lossless Audio Codec library + * Copyright (C) 2000-2009 Josh Coalson + * Copyright (C) 2011-2016 Xiph.Org Foundation + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * - Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * + * - Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * - Neither the name of the Xiph.org Foundation nor the names of its + * contributors may be used to endorse or promote products derived from + * this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR + * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF + * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING + * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +#ifdef HAVE_CONFIG_H +# include <config.h> +#endif + +#ifndef FLAC__INTEGER_ONLY_LIBRARY +#ifndef FLAC__NO_ASM +#if defined(FLAC__CPU_PPC64) && defined(FLAC__USE_VSX) + +#include "private/cpu.h" +#include "private/lpc.h" +#include "FLAC/assert.h" +#include "FLAC/format.h" + +#include <altivec.h> + +#ifdef FLAC__HAS_TARGET_POWER8 +__attribute__((target("cpu=power8"))) +void FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_16(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]) +{ + long i; + long limit = (long)data_len - 16; + const FLAC__real *base; + vector float sum0 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum1 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum2 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum3 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum10 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum11 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum12 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum13 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum20 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum21 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum22 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum23 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum30 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum31 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum32 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum33 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float d0, d1, d2, d3, d4; +#if WORDS_BIGENDIAN + vector unsigned int vsel1 = { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }; + vector unsigned int vsel2 = { 0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vsel3 = { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vperm1 = { 0x04050607, 0x08090A0B, 0x0C0D0E0F, 0x10111213 }; + vector unsigned int vperm2 = { 0x08090A0B, 0x0C0D0E0F, 0x10111213, 0x14151617 }; + vector unsigned int vperm3 = { 0x0C0D0E0F, 0x10111213, 0x14151617, 0x18191A1B }; +#else + vector unsigned int vsel1 = { 0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000 }; + vector unsigned int vsel2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000 }; + vector unsigned int vsel3 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 }; + vector unsigned int vperm1 = { 0x07060504, 0x0B0A0908, 0x0F0E0D0C, 0x13121110 }; + vector unsigned int vperm2 = { 0x0B0A0908, 0x0F0E0D0C, 0x13121110, 0x17161514 }; + vector unsigned int vperm3 = { 0x0F0E0D0C, 0x13121110, 0x17161514, 0x1B1A1918 }; +#endif + + (void) lag; + FLAC__ASSERT(lag <= 16); + FLAC__ASSERT(lag <= data_len); + + base = data; + + d0 = vec_vsx_ld(0, base); + d1 = vec_vsx_ld(16, base); + d2 = vec_vsx_ld(32, base); + d3 = vec_vsx_ld(48, base); + + base += 16; + + for (i = 0; i <= (limit-4); i += 4) { + vector float d, d0_orig = d0; + + d4 = vec_vsx_ld(0, base); + base += 4; + + d = vec_splat(d0_orig, 0); + sum0 += d0 * d; + sum1 += d1 * d; + sum2 += d2 * d; + sum3 += d3 * d; + + d = vec_splat(d0_orig, 1); + d0 = vec_sel(d0_orig, d4, vsel1); + sum10 += d0 * d; + sum11 += d1 * d; + sum12 += d2 * d; + sum13 += d3 * d; + + d = vec_splat(d0_orig, 2); + d0 = vec_sel(d0_orig, d4, vsel2); + sum20 += d0 * d; + sum21 += d1 * d; + sum22 += d2 * d; + sum23 += d3 * d; + + d = vec_splat(d0_orig, 3); + d0 = vec_sel(d0_orig, d4, vsel3); + sum30 += d0 * d; + sum31 += d1 * d; + sum32 += d2 * d; + sum33 += d3 * d; + + d0 = d1; + d1 = d2; + d2 = d3; + d3 = d4; + } + + sum0 += vec_perm(sum10, sum11, (vector unsigned char)vperm1); + sum1 += vec_perm(sum11, sum12, (vector unsigned char)vperm1); + sum2 += vec_perm(sum12, sum13, (vector unsigned char)vperm1); + sum3 += vec_perm(sum13, sum10, (vector unsigned char)vperm1); + + sum0 += vec_perm(sum20, sum21, (vector unsigned char)vperm2); + sum1 += vec_perm(sum21, sum22, (vector unsigned char)vperm2); + sum2 += vec_perm(sum22, sum23, (vector unsigned char)vperm2); + sum3 += vec_perm(sum23, sum20, (vector unsigned char)vperm2); + + sum0 += vec_perm(sum30, sum31, (vector unsigned char)vperm3); + sum1 += vec_perm(sum31, sum32, (vector unsigned char)vperm3); + sum2 += vec_perm(sum32, sum33, (vector unsigned char)vperm3); + sum3 += vec_perm(sum33, sum30, (vector unsigned char)vperm3); + + for (; i <= limit; i++) { + vector float d; + + d0 = vec_vsx_ld(0, data+i); + d1 = vec_vsx_ld(16, data+i); + d2 = vec_vsx_ld(32, data+i); + d3 = vec_vsx_ld(48, data+i); + + d = vec_splat(d0, 0); + sum0 += d0 * d; + sum1 += d1 * d; + sum2 += d2 * d; + sum3 += d3 * d; + } + + vec_vsx_st(sum0, 0, autoc); + vec_vsx_st(sum1, 16, autoc); + vec_vsx_st(sum2, 32, autoc); + vec_vsx_st(sum3, 48, autoc); + + for (; i < (long)data_len; i++) { + uint32_t coeff; + + FLAC__real d = data[i]; + for (coeff = 0; coeff < data_len - i; coeff++) + autoc[coeff] += d * data[i+coeff]; + } +} + +__attribute__((target("cpu=power8"))) +void FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_12(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]) +{ + long i; + long limit = (long)data_len - 12; + const FLAC__real *base; + vector float sum0 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum1 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum2 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum10 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum11 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum12 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum20 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum21 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum22 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum30 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum31 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum32 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float d0, d1, d2, d3; +#if WORDS_BIGENDIAN + vector unsigned int vsel1 = { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }; + vector unsigned int vsel2 = { 0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vsel3 = { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vperm1 = { 0x04050607, 0x08090A0B, 0x0C0D0E0F, 0x10111213 }; + vector unsigned int vperm2 = { 0x08090A0B, 0x0C0D0E0F, 0x10111213, 0x14151617 }; + vector unsigned int vperm3 = { 0x0C0D0E0F, 0x10111213, 0x14151617, 0x18191A1B }; +#else + vector unsigned int vsel1 = { 0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000 }; + vector unsigned int vsel2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000 }; + vector unsigned int vsel3 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 }; + vector unsigned int vperm1 = { 0x07060504, 0x0B0A0908, 0x0F0E0D0C, 0x13121110 }; + vector unsigned int vperm2 = { 0x0B0A0908, 0x0F0E0D0C, 0x13121110, 0x17161514 }; + vector unsigned int vperm3 = { 0x0F0E0D0C, 0x13121110, 0x17161514, 0x1B1A1918 }; +#endif + + (void) lag; + FLAC__ASSERT(lag <= 12); + FLAC__ASSERT(lag <= data_len); + + base = data; + + d0 = vec_vsx_ld(0, base); + d1 = vec_vsx_ld(16, base); + d2 = vec_vsx_ld(32, base); + + base += 12; + + for (i = 0; i <= (limit-3); i += 4) { + vector float d, d0_orig = d0; + + d3 = vec_vsx_ld(0, base); + base += 4; + + d = vec_splat(d0_orig, 0); + sum0 += d0 * d; + sum1 += d1 * d; + sum2 += d2 * d; + + d = vec_splat(d0_orig, 1); + d0 = vec_sel(d0_orig, d3, vsel1); + sum10 += d0 * d; + sum11 += d1 * d; + sum12 += d2 * d; + + d = vec_splat(d0_orig, 2); + d0 = vec_sel(d0_orig, d3, vsel2); + sum20 += d0 * d; + sum21 += d1 * d; + sum22 += d2 * d; + + d = vec_splat(d0_orig, 3); + d0 = vec_sel(d0_orig, d3, vsel3); + sum30 += d0 * d; + sum31 += d1 * d; + sum32 += d2 * d; + + d0 = d1; + d1 = d2; + d2 = d3; + } + + sum0 += vec_perm(sum10, sum11, (vector unsigned char)vperm1); + sum1 += vec_perm(sum11, sum12, (vector unsigned char)vperm1); + sum2 += vec_perm(sum12, sum10, (vector unsigned char)vperm1); + + sum0 += vec_perm(sum20, sum21, (vector unsigned char)vperm2); + sum1 += vec_perm(sum21, sum22, (vector unsigned char)vperm2); + sum2 += vec_perm(sum22, sum20, (vector unsigned char)vperm2); + + sum0 += vec_perm(sum30, sum31, (vector unsigned char)vperm3); + sum1 += vec_perm(sum31, sum32, (vector unsigned char)vperm3); + sum2 += vec_perm(sum32, sum30, (vector unsigned char)vperm3); + + for (; i <= limit; i++) { + vector float d; + + d0 = vec_vsx_ld(0, data+i); + d1 = vec_vsx_ld(16, data+i); + d2 = vec_vsx_ld(32, data+i); + + d = vec_splat(d0, 0); + sum0 += d0 * d; + sum1 += d1 * d; + sum2 += d2 * d; + } + + vec_vsx_st(sum0, 0, autoc); + vec_vsx_st(sum1, 16, autoc); + vec_vsx_st(sum2, 32, autoc); + + for (; i < (long)data_len; i++) { + uint32_t coeff; + + FLAC__real d = data[i]; + for (coeff = 0; coeff < data_len - i; coeff++) + autoc[coeff] += d * data[i+coeff]; + } +} + +__attribute__((target("cpu=power8"))) +void FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_8(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]) +{ + long i; + long limit = (long)data_len - 8; + const FLAC__real *base; + vector float sum0 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum1 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum10 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum11 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum20 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum21 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum30 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum31 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float d0, d1, d2; +#if WORDS_BIGENDIAN + vector unsigned int vsel1 = { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }; + vector unsigned int vsel2 = { 0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vsel3 = { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vperm1 = { 0x04050607, 0x08090A0B, 0x0C0D0E0F, 0x10111213 }; + vector unsigned int vperm2 = { 0x08090A0B, 0x0C0D0E0F, 0x10111213, 0x14151617 }; + vector unsigned int vperm3 = { 0x0C0D0E0F, 0x10111213, 0x14151617, 0x18191A1B }; +#else + vector unsigned int vsel1 = { 0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000 }; + vector unsigned int vsel2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000 }; + vector unsigned int vsel3 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 }; + vector unsigned int vperm1 = { 0x07060504, 0x0B0A0908, 0x0F0E0D0C, 0x13121110 }; + vector unsigned int vperm2 = { 0x0B0A0908, 0x0F0E0D0C, 0x13121110, 0x17161514 }; + vector unsigned int vperm3 = { 0x0F0E0D0C, 0x13121110, 0x17161514, 0x1B1A1918 }; +#endif + + (void) lag; + FLAC__ASSERT(lag <= 8); + FLAC__ASSERT(lag <= data_len); + + base = data; + + d0 = vec_vsx_ld(0, base); + d1 = vec_vsx_ld(16, base); + + base += 8; + + for (i = 0; i <= (limit-2); i += 4) { + vector float d, d0_orig = d0; + + d2 = vec_vsx_ld(0, base); + base += 4; + + d = vec_splat(d0_orig, 0); + sum0 += d0 * d; + sum1 += d1 * d; + + d = vec_splat(d0_orig, 1); + d0 = vec_sel(d0_orig, d2, vsel1); + sum10 += d0 * d; + sum11 += d1 * d; + + d = vec_splat(d0_orig, 2); + d0 = vec_sel(d0_orig, d2, vsel2); + sum20 += d0 * d; + sum21 += d1 * d; + + d = vec_splat(d0_orig, 3); + d0 = vec_sel(d0_orig, d2, vsel3); + sum30 += d0 * d; + sum31 += d1 * d; + + d0 = d1; + d1 = d2; + } + + sum0 += vec_perm(sum10, sum11, (vector unsigned char)vperm1); + sum1 += vec_perm(sum11, sum10, (vector unsigned char)vperm1); + + sum0 += vec_perm(sum20, sum21, (vector unsigned char)vperm2); + sum1 += vec_perm(sum21, sum20, (vector unsigned char)vperm2); + + sum0 += vec_perm(sum30, sum31, (vector unsigned char)vperm3); + sum1 += vec_perm(sum31, sum30, (vector unsigned char)vperm3); + + for (; i <= limit; i++) { + vector float d; + + d0 = vec_vsx_ld(0, data+i); + d1 = vec_vsx_ld(16, data+i); + + d = vec_splat(d0, 0); + sum0 += d0 * d; + sum1 += d1 * d; + } + + vec_vsx_st(sum0, 0, autoc); + vec_vsx_st(sum1, 16, autoc); + + for (; i < (long)data_len; i++) { + uint32_t coeff; + + FLAC__real d = data[i]; + for (coeff = 0; coeff < data_len - i; coeff++) + autoc[coeff] += d * data[i+coeff]; + } +} + +__attribute__((target("cpu=power8"))) +void FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_4(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]) +{ + long i; + long limit = (long)data_len - 4; + const FLAC__real *base; + vector float sum0 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum10 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum20 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum30 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float d0, d1; +#if WORDS_BIGENDIAN + vector unsigned int vsel1 = { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }; + vector unsigned int vsel2 = { 0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vsel3 = { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vperm1 = { 0x04050607, 0x08090A0B, 0x0C0D0E0F, 0x10111213 }; + vector unsigned int vperm2 = { 0x08090A0B, 0x0C0D0E0F, 0x10111213, 0x14151617 }; + vector unsigned int vperm3 = { 0x0C0D0E0F, 0x10111213, 0x14151617, 0x18191A1B }; +#else + vector unsigned int vsel1 = { 0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000 }; + vector unsigned int vsel2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000 }; + vector unsigned int vsel3 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 }; + vector unsigned int vperm1 = { 0x07060504, 0x0B0A0908, 0x0F0E0D0C, 0x13121110 }; + vector unsigned int vperm2 = { 0x0B0A0908, 0x0F0E0D0C, 0x13121110, 0x17161514 }; + vector unsigned int vperm3 = { 0x0F0E0D0C, 0x13121110, 0x17161514, 0x1B1A1918 }; +#endif + + (void) lag; + FLAC__ASSERT(lag <= 4); + FLAC__ASSERT(lag <= data_len); + + base = data; + + d0 = vec_vsx_ld(0, base); + + base += 4; + + for (i = 0; i <= (limit-1); i += 4) { + vector float d, d0_orig = d0; + + d1 = vec_vsx_ld(0, base); + base += 4; + + d = vec_splat(d0_orig, 0); + sum0 += d0 * d; + + d = vec_splat(d0_orig, 1); + d0 = vec_sel(d0_orig, d1, vsel1); + sum10 += d0 * d; + + d = vec_splat(d0_orig, 2); + d0 = vec_sel(d0_orig, d1, vsel2); + sum20 += d0 * d; + + d = vec_splat(d0_orig, 3); + d0 = vec_sel(d0_orig, d1, vsel3); + sum30 += d0 * d; + + d0 = d1; + } + + sum0 += vec_perm(sum10, sum10, (vector unsigned char)vperm1); + + sum0 += vec_perm(sum20, sum20, (vector unsigned char)vperm2); + + sum0 += vec_perm(sum30, sum30, (vector unsigned char)vperm3); + + for (; i <= limit; i++) { + vector float d; + + d0 = vec_vsx_ld(0, data+i); + + d = vec_splat(d0, 0); + sum0 += d0 * d; + } + + vec_vsx_st(sum0, 0, autoc); + + for (; i < (long)data_len; i++) { + uint32_t coeff; + + FLAC__real d = data[i]; + for (coeff = 0; coeff < data_len - i; coeff++) + autoc[coeff] += d * data[i+coeff]; + } +} +#endif /* FLAC__HAS_TARGET_POWER8 */ + +#ifdef FLAC__HAS_TARGET_POWER9 +__attribute__((target("cpu=power9"))) +void FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_16(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]) +{ + long i; + long limit = (long)data_len - 16; + const FLAC__real *base; + vector float sum0 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum1 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum2 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum3 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum10 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum11 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum12 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum13 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum20 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum21 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum22 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum23 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum30 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum31 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum32 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum33 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float d0, d1, d2, d3, d4; +#if WORDS_BIGENDIAN + vector unsigned int vsel1 = { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }; + vector unsigned int vsel2 = { 0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vsel3 = { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vperm1 = { 0x04050607, 0x08090A0B, 0x0C0D0E0F, 0x10111213 }; + vector unsigned int vperm2 = { 0x08090A0B, 0x0C0D0E0F, 0x10111213, 0x14151617 }; + vector unsigned int vperm3 = { 0x0C0D0E0F, 0x10111213, 0x14151617, 0x18191A1B }; +#else + vector unsigned int vsel1 = { 0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000 }; + vector unsigned int vsel2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000 }; + vector unsigned int vsel3 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 }; + vector unsigned int vperm1 = { 0x07060504, 0x0B0A0908, 0x0F0E0D0C, 0x13121110 }; + vector unsigned int vperm2 = { 0x0B0A0908, 0x0F0E0D0C, 0x13121110, 0x17161514 }; + vector unsigned int vperm3 = { 0x0F0E0D0C, 0x13121110, 0x17161514, 0x1B1A1918 }; +#endif + + (void) lag; + FLAC__ASSERT(lag <= 16); + FLAC__ASSERT(lag <= data_len); + + base = data; + + d0 = vec_vsx_ld(0, base); + d1 = vec_vsx_ld(16, base); + d2 = vec_vsx_ld(32, base); + d3 = vec_vsx_ld(48, base); + + base += 16; + + for (i = 0; i <= (limit-4); i += 4) { + vector float d, d0_orig = d0; + + d4 = vec_vsx_ld(0, base); + base += 4; + + d = vec_splat(d0_orig, 0); + sum0 += d0 * d; + sum1 += d1 * d; + sum2 += d2 * d; + sum3 += d3 * d; + + d = vec_splat(d0_orig, 1); + d0 = vec_sel(d0_orig, d4, vsel1); + sum10 += d0 * d; + sum11 += d1 * d; + sum12 += d2 * d; + sum13 += d3 * d; + + d = vec_splat(d0_orig, 2); + d0 = vec_sel(d0_orig, d4, vsel2); + sum20 += d0 * d; + sum21 += d1 * d; + sum22 += d2 * d; + sum23 += d3 * d; + + d = vec_splat(d0_orig, 3); + d0 = vec_sel(d0_orig, d4, vsel3); + sum30 += d0 * d; + sum31 += d1 * d; + sum32 += d2 * d; + sum33 += d3 * d; + + d0 = d1; + d1 = d2; + d2 = d3; + d3 = d4; + } + + sum0 += vec_perm(sum10, sum11, (vector unsigned char)vperm1); + sum1 += vec_perm(sum11, sum12, (vector unsigned char)vperm1); + sum2 += vec_perm(sum12, sum13, (vector unsigned char)vperm1); + sum3 += vec_perm(sum13, sum10, (vector unsigned char)vperm1); + + sum0 += vec_perm(sum20, sum21, (vector unsigned char)vperm2); + sum1 += vec_perm(sum21, sum22, (vector unsigned char)vperm2); + sum2 += vec_perm(sum22, sum23, (vector unsigned char)vperm2); + sum3 += vec_perm(sum23, sum20, (vector unsigned char)vperm2); + + sum0 += vec_perm(sum30, sum31, (vector unsigned char)vperm3); + sum1 += vec_perm(sum31, sum32, (vector unsigned char)vperm3); + sum2 += vec_perm(sum32, sum33, (vector unsigned char)vperm3); + sum3 += vec_perm(sum33, sum30, (vector unsigned char)vperm3); + + for (; i <= limit; i++) { + vector float d; + + d0 = vec_vsx_ld(0, data+i); + d1 = vec_vsx_ld(16, data+i); + d2 = vec_vsx_ld(32, data+i); + d3 = vec_vsx_ld(48, data+i); + + d = vec_splat(d0, 0); + sum0 += d0 * d; + sum1 += d1 * d; + sum2 += d2 * d; + sum3 += d3 * d; + } + + vec_vsx_st(sum0, 0, autoc); + vec_vsx_st(sum1, 16, autoc); + vec_vsx_st(sum2, 32, autoc); + vec_vsx_st(sum3, 48, autoc); + + for (; i < (long)data_len; i++) { + uint32_t coeff; + + FLAC__real d = data[i]; + for (coeff = 0; coeff < data_len - i; coeff++) + autoc[coeff] += d * data[i+coeff]; + } +} + +__attribute__((target("cpu=power9"))) +void FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_12(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]) +{ + long i; + long limit = (long)data_len - 12; + const FLAC__real *base; + vector float sum0 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum1 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum2 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum10 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum11 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum12 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum20 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum21 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum22 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum30 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum31 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum32 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float d0, d1, d2, d3; +#if WORDS_BIGENDIAN + vector unsigned int vsel1 = { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }; + vector unsigned int vsel2 = { 0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vsel3 = { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vperm1 = { 0x04050607, 0x08090A0B, 0x0C0D0E0F, 0x10111213 }; + vector unsigned int vperm2 = { 0x08090A0B, 0x0C0D0E0F, 0x10111213, 0x14151617 }; + vector unsigned int vperm3 = { 0x0C0D0E0F, 0x10111213, 0x14151617, 0x18191A1B }; +#else + vector unsigned int vsel1 = { 0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000 }; + vector unsigned int vsel2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000 }; + vector unsigned int vsel3 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 }; + vector unsigned int vperm1 = { 0x07060504, 0x0B0A0908, 0x0F0E0D0C, 0x13121110 }; + vector unsigned int vperm2 = { 0x0B0A0908, 0x0F0E0D0C, 0x13121110, 0x17161514 }; + vector unsigned int vperm3 = { 0x0F0E0D0C, 0x13121110, 0x17161514, 0x1B1A1918 }; +#endif + + (void) lag; + FLAC__ASSERT(lag <= 12); + FLAC__ASSERT(lag <= data_len); + + base = data; + + d0 = vec_vsx_ld(0, base); + d1 = vec_vsx_ld(16, base); + d2 = vec_vsx_ld(32, base); + + base += 12; + + for (i = 0; i <= (limit-3); i += 4) { + vector float d, d0_orig = d0; + + d3 = vec_vsx_ld(0, base); + base += 4; + + d = vec_splat(d0_orig, 0); + sum0 += d0 * d; + sum1 += d1 * d; + sum2 += d2 * d; + + d = vec_splat(d0_orig, 1); + d0 = vec_sel(d0_orig, d3, vsel1); + sum10 += d0 * d; + sum11 += d1 * d; + sum12 += d2 * d; + + d = vec_splat(d0_orig, 2); + d0 = vec_sel(d0_orig, d3, vsel2); + sum20 += d0 * d; + sum21 += d1 * d; + sum22 += d2 * d; + + d = vec_splat(d0_orig, 3); + d0 = vec_sel(d0_orig, d3, vsel3); + sum30 += d0 * d; + sum31 += d1 * d; + sum32 += d2 * d; + + d0 = d1; + d1 = d2; + d2 = d3; + } + + sum0 += vec_perm(sum10, sum11, (vector unsigned char)vperm1); + sum1 += vec_perm(sum11, sum12, (vector unsigned char)vperm1); + sum2 += vec_perm(sum12, sum10, (vector unsigned char)vperm1); + + sum0 += vec_perm(sum20, sum21, (vector unsigned char)vperm2); + sum1 += vec_perm(sum21, sum22, (vector unsigned char)vperm2); + sum2 += vec_perm(sum22, sum20, (vector unsigned char)vperm2); + + sum0 += vec_perm(sum30, sum31, (vector unsigned char)vperm3); + sum1 += vec_perm(sum31, sum32, (vector unsigned char)vperm3); + sum2 += vec_perm(sum32, sum30, (vector unsigned char)vperm3); + + for (; i <= limit; i++) { + vector float d; + + d0 = vec_vsx_ld(0, data+i); + d1 = vec_vsx_ld(16, data+i); + d2 = vec_vsx_ld(32, data+i); + + d = vec_splat(d0, 0); + sum0 += d0 * d; + sum1 += d1 * d; + sum2 += d2 * d; + } + + vec_vsx_st(sum0, 0, autoc); + vec_vsx_st(sum1, 16, autoc); + vec_vsx_st(sum2, 32, autoc); + + for (; i < (long)data_len; i++) { + uint32_t coeff; + + FLAC__real d = data[i]; + for (coeff = 0; coeff < data_len - i; coeff++) + autoc[coeff] += d * data[i+coeff]; + } +} + +__attribute__((target("cpu=power9"))) +void FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_8(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]) +{ + long i; + long limit = (long)data_len - 8; + const FLAC__real *base; + vector float sum0 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum1 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum10 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum11 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum20 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum21 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum30 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum31 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float d0, d1, d2; +#if WORDS_BIGENDIAN + vector unsigned int vsel1 = { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }; + vector unsigned int vsel2 = { 0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vsel3 = { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vperm1 = { 0x04050607, 0x08090A0B, 0x0C0D0E0F, 0x10111213 }; + vector unsigned int vperm2 = { 0x08090A0B, 0x0C0D0E0F, 0x10111213, 0x14151617 }; + vector unsigned int vperm3 = { 0x0C0D0E0F, 0x10111213, 0x14151617, 0x18191A1B }; +#else + vector unsigned int vsel1 = { 0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000 }; + vector unsigned int vsel2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000 }; + vector unsigned int vsel3 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 }; + vector unsigned int vperm1 = { 0x07060504, 0x0B0A0908, 0x0F0E0D0C, 0x13121110 }; + vector unsigned int vperm2 = { 0x0B0A0908, 0x0F0E0D0C, 0x13121110, 0x17161514 }; + vector unsigned int vperm3 = { 0x0F0E0D0C, 0x13121110, 0x17161514, 0x1B1A1918 }; +#endif + + (void) lag; + FLAC__ASSERT(lag <= 8); + FLAC__ASSERT(lag <= data_len); + + base = data; + + d0 = vec_vsx_ld(0, base); + d1 = vec_vsx_ld(16, base); + + base += 8; + + for (i = 0; i <= (limit-2); i += 4) { + vector float d, d0_orig = d0; + + d2 = vec_vsx_ld(0, base); + base += 4; + + d = vec_splat(d0_orig, 0); + sum0 += d0 * d; + sum1 += d1 * d; + + d = vec_splat(d0_orig, 1); + d0 = vec_sel(d0_orig, d2, vsel1); + sum10 += d0 * d; + sum11 += d1 * d; + + d = vec_splat(d0_orig, 2); + d0 = vec_sel(d0_orig, d2, vsel2); + sum20 += d0 * d; + sum21 += d1 * d; + + d = vec_splat(d0_orig, 3); + d0 = vec_sel(d0_orig, d2, vsel3); + sum30 += d0 * d; + sum31 += d1 * d; + + d0 = d1; + d1 = d2; + } + + sum0 += vec_perm(sum10, sum11, (vector unsigned char)vperm1); + sum1 += vec_perm(sum11, sum10, (vector unsigned char)vperm1); + + sum0 += vec_perm(sum20, sum21, (vector unsigned char)vperm2); + sum1 += vec_perm(sum21, sum20, (vector unsigned char)vperm2); + + sum0 += vec_perm(sum30, sum31, (vector unsigned char)vperm3); + sum1 += vec_perm(sum31, sum30, (vector unsigned char)vperm3); + + for (; i <= limit; i++) { + vector float d; + + d0 = vec_vsx_ld(0, data+i); + d1 = vec_vsx_ld(16, data+i); + + d = vec_splat(d0, 0); + sum0 += d0 * d; + sum1 += d1 * d; + } + + vec_vsx_st(sum0, 0, autoc); + vec_vsx_st(sum1, 16, autoc); + + for (; i < (long)data_len; i++) { + uint32_t coeff; + + FLAC__real d = data[i]; + for (coeff = 0; coeff < data_len - i; coeff++) + autoc[coeff] += d * data[i+coeff]; + } +} + +__attribute__((target("cpu=power9"))) +void FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_4(const FLAC__real data[], uint32_t data_len, uint32_t lag, FLAC__real autoc[]) +{ + long i; + long limit = (long)data_len - 4; + const FLAC__real *base; + vector float sum0 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum10 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum20 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float sum30 = { 0.0f, 0.0f, 0.0f, 0.0f}; + vector float d0, d1; +#if WORDS_BIGENDIAN + vector unsigned int vsel1 = { 0x00000000, 0x00000000, 0x00000000, 0xFFFFFFFF }; + vector unsigned int vsel2 = { 0x00000000, 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vsel3 = { 0x00000000, 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF }; + vector unsigned int vperm1 = { 0x04050607, 0x08090A0B, 0x0C0D0E0F, 0x10111213 }; + vector unsigned int vperm2 = { 0x08090A0B, 0x0C0D0E0F, 0x10111213, 0x14151617 }; + vector unsigned int vperm3 = { 0x0C0D0E0F, 0x10111213, 0x14151617, 0x18191A1B }; +#else + vector unsigned int vsel1 = { 0xFFFFFFFF, 0x00000000, 0x00000000, 0x00000000 }; + vector unsigned int vsel2 = { 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000, 0x00000000 }; + vector unsigned int vsel3 = { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0x00000000 }; + vector unsigned int vperm1 = { 0x07060504, 0x0B0A0908, 0x0F0E0D0C, 0x13121110 }; + vector unsigned int vperm2 = { 0x0B0A0908, 0x0F0E0D0C, 0x13121110, 0x17161514 }; + vector unsigned int vperm3 = { 0x0F0E0D0C, 0x13121110, 0x17161514, 0x1B1A1918 }; +#endif + + (void) lag; + FLAC__ASSERT(lag <= 4); + FLAC__ASSERT(lag <= data_len); + + base = data; + + d0 = vec_vsx_ld(0, base); + + base += 4; + + for (i = 0; i <= (limit-1); i += 4) { + vector float d, d0_orig = d0; + + d1 = vec_vsx_ld(0, base); + base += 4; + + d = vec_splat(d0_orig, 0); + sum0 += d0 * d; + + d = vec_splat(d0_orig, 1); + d0 = vec_sel(d0_orig, d1, vsel1); + sum10 += d0 * d; + + d = vec_splat(d0_orig, 2); + d0 = vec_sel(d0_orig, d1, vsel2); + sum20 += d0 * d; + + d = vec_splat(d0_orig, 3); + d0 = vec_sel(d0_orig, d1, vsel3); + sum30 += d0 * d; + + d0 = d1; + } + + sum0 += vec_perm(sum10, sum10, (vector unsigned char)vperm1); + + sum0 += vec_perm(sum20, sum20, (vector unsigned char)vperm2); + + sum0 += vec_perm(sum30, sum30, (vector unsigned char)vperm3); + + for (; i <= limit; i++) { + vector float d; + + d0 = vec_vsx_ld(0, data+i); + + d = vec_splat(d0, 0); + sum0 += d0 * d; + } + + vec_vsx_st(sum0, 0, autoc); + + for (; i < (long)data_len; i++) { + uint32_t coeff; + + FLAC__real d = data[i]; + for (coeff = 0; coeff < data_len - i; coeff++) + autoc[coeff] += d * data[i+coeff]; + } +} +#endif /* FLAC__HAS_TARGET_POWER9 */ + +#endif /* FLAC__CPU_PPC64 && FLAC__USE_VSX */ +#endif /* FLAC__NO_ASM */ +#endif /* FLAC__INTEGER_ONLY_LIBRARY */ diff --git a/src/libFLAC/stream_encoder.c b/src/libFLAC/stream_encoder.c index 87cfb580..74387ec3 100644 --- a/src/libFLAC/stream_encoder.c +++ b/src/libFLAC/stream_encoder.c @@ -885,6 +885,36 @@ static FLAC__StreamEncoderInitStatus init_stream_internal_( /* now override with asm where appropriate */ #ifndef FLAC__INTEGER_ONLY_LIBRARY # ifndef FLAC__NO_ASM +#if defined(FLAC__CPU_PPC64) && defined(FLAC__USE_VSX) +#ifdef FLAC__HAS_TARGET_POWER8 +#ifdef FLAC__HAS_TARGET_POWER9 + if (encoder->private_->cpuinfo.ppc.arch_3_00) { + if(encoder->protected_->max_lpc_order < 4) + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_4; + else if(encoder->protected_->max_lpc_order < 8) + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_8; + else if(encoder->protected_->max_lpc_order < 12) + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_12; + else if(encoder->protected_->max_lpc_order < 16) + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation_intrin_power9_vsx_lag_16; + else + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation; + } else +#endif + if (encoder->private_->cpuinfo.ppc.arch_2_07) { + if(encoder->protected_->max_lpc_order < 4) + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_4; + else if(encoder->protected_->max_lpc_order < 8) + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_8; + else if(encoder->protected_->max_lpc_order < 12) + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_12; + else if(encoder->protected_->max_lpc_order < 16) + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation_intrin_power8_vsx_lag_16; + else + encoder->private_->local_lpc_compute_autocorrelation = FLAC__lpc_compute_autocorrelation; + } +#endif +#endif if(encoder->private_->cpuinfo.use_asm) { # ifdef FLAC__CPU_IA32 FLAC__ASSERT(encoder->private_->cpuinfo.type == FLAC__CPUINFO_TYPE_IA32); -- 2.17.1
Brian Willoughby
2018-Jul-11 02:40 UTC
[flac-dev] [PATCH 0/7] PowerPC64 performance improvements
Thank you for this collection of patches. How can I test them? What platforms (computers) have the ppc64 or POWER9 processor? Brian On Jul 10, 2018, at 2:31 PM, Anton Blanchard <anton at ozlabs.org> wrote:> > The following series adds initial vector support for PowerPC64. > On POWER9, flac --best is about 3.3x faster. > > Amitay Isaacs (2):
Anton Blanchard
2018-Jul-12 00:59 UTC
[flac-dev] [PATCH 0/7] PowerPC64 performance improvements
Hi Brian,> Thank you for this collection of patches. > > How can I test them? What platforms (computers) have the ppc64 or > POWER9 processor?The IBM Bounty Source page has a list of resources: https://www.bountysource.com/teams/ibm/bounties Travis also has ppc64le support, so if you add the linux-ppc64le target then FLAC will be tested on ppc64le. Thanks, Anton