search for: vaddq_f32

Displaying 12 results from an estimated 12 matches for "vaddq_f32".

2014 Nov 09
0
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
...include "../kiss_fft.h" +#include <arm_neon.h> + +#define C_MUL_NEON(m, a, b, t, ones, tv) \ + do{ \ + t = vrev64q_f32(b); \ + m = vmulq_f32(a, b); \ + m = vmulq_f32(m, ones); \ + t = vmulq_f32(a, t); \ + tv = vtrnq_f32(m, t); \ + m = vaddq_f32(tv.val[0], tv.val[1]); \ + }while(0) + +#define ONES_MINUS_ONE 0xbf8000003f800000 //{-1.0, 1.0} +#define MINUS_ONE 0xbf800000bf800000 // {-1.0, -1.0} + +static void kf_bfly4_neon_m1(kiss_fft_cpx *Fout, int N) { + float32x4_t Fout_4[2]; + float32x2_t Fout_2[4]; + float32x2_t scratch_2[...
2014 Nov 09
3
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
Hello, This patch introduces ARM NEON Intrinsics to optimize kf_bfly4 routine in celt part of libopus. Using NEON optimized kf_bfly4(_neon) routine helped improve performance of opus_fft_impl function by about 21.4%. The end use case was decoding a music opus ogg file. The end use case saw performance improvement of about 4.47%. This patch has 2 components i. Actual neon code to improve
2014 Nov 28
2
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
.... > + XX_2 = vld1_lane_f32(xi++, XX_2, 0); Don't load a single lane when you don't need the value(s) in the other lane(s). Use vld1_dup_f32() instead. It's faster and breaks dependencies. > + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0); > + } > + > + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]); > + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]); > + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]); > + > + vst1q_f32(sum, SUMM[0]); > +} > + > +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, > + opus_val32 *xcorr, int len, int max_pitch,...
2016 Feb 09
2
Vectorization with fast-math on irregular ISA sub-sets
----- Original Message ----- > From: "Renato Golin" <renato.golin at linaro.org> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "James Molloy" <James.Molloy at arm.com>, "Nadav Rotem" <nrotem at apple.com>, "Arnold Schwaighofer" > <aschwaighofer at apple.com>, "LLVM Dev" <llvm-dev at
2014 Dec 01
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...ne when you don't need the value(s) in the other > lane(s). Use vld1_dup_f32() instead. It's faster and breaks dependencies. Will keep this in mind. Thanks. > >> + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0); >> + } >> + >> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]); >> + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]); >> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]); >> + >> + vst1q_f32(sum, SUMM[0]); >> +} >> + >> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, >> +...
2014 Nov 21
4
[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hello, I received feedback from engineers working on NE10 [1] that it would be better to use NE10 [1] for FFT optimizations for opus use cases. However, these FFT patches are currently in review and haven't been integrated into NE10 yet. While the FFT functions in NE10 are getting baked, I wanted to optimize the celt_pitch_xcorr (floating point only) and use it to introduce ARM NEON
2014 Nov 21
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...[2]); + SUMM[3] = vmlaq_f32(SUMM[3], XX[3], YY[3]); + YY[0] = YY[4]; + } + + /* Handle remaining values max iterations = 3 */ + for (j = 0; j < cr; j++) { + YY[0] = vld1q_f32(yi++); + XX_2 = vld1_lane_f32(xi++, XX_2, 0); + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0); + } + + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]); + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]); + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]); + + vst1q_f32(sum, SUMM[0]); +} + +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y, + opus_val32 *xcorr, int len, int max_pitch, int arch) { + int i, j; + + celt_assert...
2016 Feb 11
4
Vectorization with fast-math on irregular ISA sub-sets
...y-wrong results. > > But we lower a NEON intrinsic into plain IR instructions. > > If I got it right, the current "fast" attribute is "may use non IEEE > compliant", emphasis on the *may*. > > As a user, I'd be really angry if I used "float32x4_t vaddq_f32 > (float32x4_t, float32x4_t)" and the compiler emitted four VADD.f32 > SN. > > Right now, Clang lowers: > vaddq_f32 (a, b); > > to: > %add.i = fadd <4 x float> %a, %b > > which lowers (correctly) to: > vadd.f32 q0, q0, q1 > > If, OTOH, &qu...
2013 Sep 26
0
[LLVMdev] ARM NEON intrinsics in clang
...> recently worked on some documentation in this area: > http://clang.llvm.org/docs/CrossCompilation.html. > > But for a quick hack, you could try: > > $ cat > neon.c > #include <arm_neon.h> > > float32x4_t my_func(float32x4_t lhs, float32x4_t rhs) { > return vaddq_f32(lhs, rhs); > } > $ clang --target=arm-linux-gnueabihf -mcpu=cortex-a15 -ffreestanding > -O3 -S -o - neon.c > > ("ffreestanding" will dodge any issues with your supporting toolchain, > but won't work for larger tests. You've got to actually solve the > issues b...
2013 Sep 26
2
[LLVMdev] ARM NEON intrinsics in clang
...ssfully doing so. I tried to compile release 2.9, as I (wrongly) believed that I need llvm-gcc in order to compile NEON code on LLVM. Tim's minimalist example worked on my clang3.4: $ cat > neon.c #include <arm_neon.h> float32x4_t my_func(float32x4_t lhs, float32x4_t rhs) { return vaddq_f32(lhs, rhs); } $ clang --target=arm-linux-gnueabihf -mcpu=cortex-a15 -ffreestanding -O3 -S -o - neon.c however it doesn't if I remove the -ffreestanding flag. I need to figure this out next. Thank you for your help. Cheers, - Stan On Thu, Sep 26, 2013 at 4:01 PM, Renato Golin <renato.gol...
2013 Sep 26
2
[LLVMdev] ARM NEON intrinsics in clang
Hello LLVM Devs, I am starting my PhD on Automatic Parallelization for DSP and want to play with some ARM NEON intrinsics for a start. I spent the last three days trying to compile a version of LLVM that would allow me to compile sources that contain these intrinsics, but with no success. In the process I found out that clang doesn't support NEON (as per
2014 Sep 10
4
[RFC PATCH v1 0/3] Introducing ARM SIMD Support
libvorbis does not currently have any simd/vectorization. Following patches add generic framework for simd/vectorization and on top, add ARM-NEON simd vectorization using intrinsics. I was able to get over 34% performance improvement on my Beaglebone Black which is single Cortex-A8 based CPU. You can find more information on metrics and procedure I used to measure at