thr3ads.net - search: "float32x4

Displaying 20 results from an estimated 67 matches for "float32x4_t".

2019 Sep 05

ARM vectorized fp16 support

...ts, $ clang -O3 -march=armv8.2-a+fp16fml -ffast-math -S -o- vfp32.c test_vfma_lane_f16: // @test_vfma_lane_f16 fmla v2.4s, v1.4s, v0.4s // fp32 is GOOD mov v0.16b, v2.16b ret $ cat vfp32.c #include <arm_neon.h> float32x4_t test_vfma_lane_f16(float32x4_t a, float32x4_t b, float32x4_t c) { c += a * b; return c; } $ clang -O3 -march=armv8.2-a+fp16fml -ffast-math -S -o- vfp16.c test_vfma_lane_f16: // @test_vfma_lane_f16 fmul v0.4h, v1.4h, v0.4h fadd v0....

ARM vectorized fp16 support

2019 Sep 05

ARM vectorized fp16 support

...p16fml -ffast-math -S -o- vfp32.c > test_vfma_lane_f16: // @test_vfma_lane_f16 > fmla v2.4s, v1.4s, v0.4s // fp32 is GOOD > mov v0.16b, v2.16b > ret > $ cat vfp32.c > #include <arm_neon.h> > float32x4_t test_vfma_lane_f16(float32x4_t a, float32x4_t b, float32x4_t c) { > c += a * b; > return c; > } > > $ clang -O3 -march=armv8.2-a+fp16fml -ffast-math -S -o- vfp16.c > test_vfma_lane_f16: // @test_vfma_lane_f16 > fmul v0.4h, v1.4h, v0...

[LLVMdev] NEON intrinsics preventing redundant load optimization?

2014 Dec 10

[LLVMdev] NEON intrinsics preventing redundant load optimization?

...r the interleaved cases (vld[234].*). It seems the vld1.* and vst1.* do have those direct IR representations though. It’s great news if this is fixed in the current tip, but in the short term (for app store builds using the official toolchain) are there any LLVM-specific extensions to initialise a float32x4_t that will get lowered to the "load <4 x float>* %1” form? Or is that more a question for the clang folks? Simon

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for ARM NEON floating point. Changes from RFCv3: - celt_neon_intr.c - removed warnings due to not having constant pointers - Put simpler loop to take care of corner cases. Unrolling using intrinsics was not really mapping well to what was done in celt_pitch_xcorr_arm.s - Makefile.am Removed explicit -O3 optimization - test_unit_mathops.c,

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

...ulq_f32(a, t); \ + tv = vtrnq_f32(m, t); \ + m = vaddq_f32(tv.val[0], tv.val[1]); \ + }while(0) + +#define ONES_MINUS_ONE 0xbf8000003f800000 //{-1.0, 1.0} +#define MINUS_ONE 0xbf800000bf800000 // {-1.0, -1.0} + +static void kf_bfly4_neon_m1(kiss_fft_cpx *Fout, int N) { + float32x4_t Fout_4[2]; + float32x2_t Fout_2[4]; + float32x2_t scratch_2[2]; + float32x2_t ones_2 = vcreate_f32(ONES_MINUS_ONE); + float32x2_t minusones_2 = vcreate_f32(MINUS_ONE); + float *ai = (float *)Fout; + float *bi = (float *)Fout; + int i; + + /* Consume/update 4 complex Fout values per cycle + * just...

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...lude "../arch.h" + +/* + * Function: xcorr_kernel_neon_float + * --------------------------------- + * Computes 4 correlation values and stores them in sum[4] + */ +static void xcorr_kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + const float *xi = x; + const float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y +...

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more in-line with algorithm in celt_pitch_xcorr_arm.s Viswanath Puttagunta

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...lude "../arch.h" + +/* + * Function: xcorr_kernel_neon_float + * --------------------------------- + * Computes 4 correlation values and stores them in sum[4] + */ +static void xcorr_kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + float *xi = x; + float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. H...

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 10

[RFC PATCH v3] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv2: - Changes recommended by Timothy for celt_neon_intr.c everything except, left the unrolled loop still unrolled - configure.ac - use AC_LINK_IFELSE instead of AC_COMPILE_IFELSE - Moved compile flags into Makefile.am - OPUS_ARM_NEON_INR --> typo --> OPUS_ARM_NEON_INTR Viswanath Puttagunta (1): armv7:

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 09

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...{ I had to think quite a bit about what "3to1" meant (since it is describing the context of the caller, not what the actual function does). I'd follow the naming convention in the existing celt_pitch_xcorr_arm.s, and use "process1", personally. > + int i; > + float32x4_t XX[4]; > + float32x4_t YY[4]; > + float32x4_t SUMM; > + float32x2_t ZERO; > + float32x2x2_t tv; > + float sumi; > + float *xi = x; > + float *yi = y; > + > + ZERO = vdup_n_f32(0); > + SUMM = vdupq_n_f32(0); > + > + /* Work on 16 values per cyc...

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 19

[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...t; + * Function: xcorr_kernel_neon_float > + * --------------------------------- > + * Computes 4 correlation values and stores them in sum[4] > + */ > +static void xcorr_kernel_neon_float(const float *x, const float *y, > + float sum[4], int len) { > + float32x4_t YY[3]; > + float32x4_t YEXT[3]; > + float32x4_t XX[2]; > + float32x2_t XX_2; > + float32x4_t SUMM; > + const float *xi = x; > + const float *yi = y; > + > + celt_assert(len>0); > + > + YY[0] = vld1q_f32(yi); > + SUMM = vdupq_n_f32(0); > + &gt...

Vectorization with fast-math on irregular ISA sub-sets

2016 Feb 09

Vectorization with fast-math on irregular ISA sub-sets

----- Original Message ----- > From: "Renato Golin" <renato.golin at linaro.org> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "James Molloy" <James.Molloy at arm.com>, "Nadav Rotem" <nrotem at apple.com>, "Arnold Schwaighofer" > <aschwaighofer at apple.com>, "LLVM Dev" <llvm-dev at

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

...; +#include "../arch.h" + +/* + * Function: xcorr_kernel_neon_float + * --------------------------------- + * Computes 4 correlation values and stores them in sum[4] + */ +void xcorr_kernel_neon_float(const float *x, const float *y, + float sum[4], int len) { + float32x4_t YY[3]; + float32x4_t YEXT[3]; + float32x4_t XX[2]; + float32x2_t XX_2; + float32x4_t SUMM; + float *xi = x; + float *yi = y; + + celt_assert(len>0); + + YY[0] = vld1q_f32(yi); + SUMM = vdupq_n_f32(0); + + /* Consume 8 elements in x vector and 12 elements in y + * vector. H...

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

2014 Nov 09

[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics

Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Dec 07

[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

From: Viswanath Puttagunta <viswanath.puttagunta at linaro.org> Hi, Optimizes celt_pitch_xcorr for floating point. Changes from RFCv1: - Rebased on top of commit aad281878: Fix celt_pitch_xcorr_c signature. which got rid of ugly code around CELT_PITCH_XCORR_IMPL passing of "arch" parameter. - Unified with --enable-intrinsics used by x86 - Modified algorithm to be more

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

2016 Sep 13

[PATCH 12/15] Replace call of celt_inner_prod_c() (step 1)

Should call celt_inner_prod(). --- celt/bands.c | 7 ++++--- celt/bands.h | 2 +- celt/celt_encoder.c | 6 +++--- celt/pitch.c | 2 +- src/opus_multistream_encoder.c | 2 +- 5 files changed, 10 insertions(+), 9 deletions(-) diff --git a/celt/bands.c b/celt/bands.c index bbe8a4c..1ab24aa 100644 --- a/celt/bands.c +++ b/celt/bands.c

[LLVMdev] ARM NEON intrinsics in clang

2013 Sep 26

[LLVMdev] ARM NEON intrinsics in clang

...so, you're probably having problems cross-compiling. Renato's > recently worked on some documentation in this area: > http://clang.llvm.org/docs/CrossCompilation.html. > > But for a quick hack, you could try: > > $ cat > neon.c > #include <arm_neon.h> > > float32x4_t my_func(float32x4_t lhs, float32x4_t rhs) { > return vaddq_f32(lhs, rhs); > } > $ clang --target=arm-linux-gnueabihf -mcpu=cortex-a15 -ffreestanding > -O3 -S -o - neon.c > > ("ffreestanding" will dodge any issues with your supporting toolchain, > but won't work...

[LLVMdev] ARM NEON intrinsics in clang

2013 Sep 26

[LLVMdev] ARM NEON intrinsics in clang

...64-bit. I am much happy to compile the latest code and am successfully doing so. I tried to compile release 2.9, as I (wrongly) believed that I need llvm-gcc in order to compile NEON code on LLVM. Tim's minimalist example worked on my clang3.4: $ cat > neon.c #include <arm_neon.h> float32x4_t my_func(float32x4_t lhs, float32x4_t rhs) { return vaddq_f32(lhs, rhs); } $ clang --target=arm-linux-gnueabihf -mcpu=cortex-a15 -ffreestanding -O3 -S -o - neon.c however it doesn't if I remove the -ffreestanding flag. I need to figure this out next. Thank you for your help. Cheers, - Stan...

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

2014 Nov 28

[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics

.../* Neon */ Please do not use tabs in source code (this applies here and everywhere below). Even with the tabs expanded in context, the comments here do not line up properly. > +static void xcorr_kernel_neon_float(float *x, float *y, float sum[4], int len) { x and y should be const. > + float32x4_t YY[5]; > + float32x4_t XX[4]; > + float32x2_t XX_2; > + float32x4_t SUMM[4]; > + float *xi = x; > + float *yi = y; > + int cd = len/4; > + int cr = len%4; len is signed, so / and % are NOT equivalent to the corresponding >> and & (they are much slower). > + int...

[LLVMdev] ARM NEON intrinsics in clang

2013 Sep 26

[LLVMdev] ARM NEON intrinsics in clang

Hello LLVM Devs, I am starting my PhD on Automatic Parallelization for DSP and want to play with some ARM NEON intrinsics for a start. I spent the last three days trying to compile a version of LLVM that would allow me to compile sources that contain these intrinsics, but with no success. In the process I found out that clang doesn't support NEON (as per

search for: float32x4_t