Displaying 12 results from an estimated 12 matches for "vaddq_f32".
2014 Nov 09
0
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
...include "../kiss_fft.h"
+#include <arm_neon.h>
+
+#define C_MUL_NEON(m, a, b, t, ones, tv) \
+ do{ \
+ t = vrev64q_f32(b); \
+ m = vmulq_f32(a, b); \
+ m = vmulq_f32(m, ones); \
+ t = vmulq_f32(a, t); \
+ tv = vtrnq_f32(m, t); \
+ m = vaddq_f32(tv.val[0], tv.val[1]); \
+ }while(0)
+
+#define ONES_MINUS_ONE 0xbf8000003f800000 //{-1.0, 1.0}
+#define MINUS_ONE 0xbf800000bf800000 // {-1.0, -1.0}
+
+static void kf_bfly4_neon_m1(kiss_fft_cpx *Fout, int N) {
+ float32x4_t Fout_4[2];
+ float32x2_t Fout_2[4];
+ float32x2_t scratch_2[...
2014 Nov 09
3
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
Hello,
This patch introduces ARM NEON Intrinsics to optimize
kf_bfly4 routine in celt part of libopus.
Using NEON optimized kf_bfly4(_neon) routine helped improve
performance of opus_fft_impl function by about 21.4%. The
end use case was decoding a music opus ogg file. The end
use case saw performance improvement of about 4.47%.
This patch has 2 components
i. Actual neon code to improve
2014 Nov 28
2
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
....
> + XX_2 = vld1_lane_f32(xi++, XX_2, 0);
Don't load a single lane when you don't need the value(s) in the other
lane(s). Use vld1_dup_f32() instead. It's faster and breaks dependencies.
> + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
> + }
> +
> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
> + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
> +
> + vst1q_f32(sum, SUMM[0]);
> +}
> +
> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
> + opus_val32 *xcorr, int len, int max_pitch,...
2016 Feb 09
2
Vectorization with fast-math on irregular ISA sub-sets
----- Original Message -----
> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "James Molloy" <James.Molloy at arm.com>, "Nadav Rotem" <nrotem at apple.com>, "Arnold Schwaighofer"
> <aschwaighofer at apple.com>, "LLVM Dev" <llvm-dev at
2014 Dec 01
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...ne when you don't need the value(s) in the other
> lane(s). Use vld1_dup_f32() instead. It's faster and breaks dependencies.
Will keep this in mind. Thanks.
>
>> + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
>> + }
>> +
>> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
>> + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
>> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
>> +
>> + vst1q_f32(sum, SUMM[0]);
>> +}
>> +
>> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
>> +...
2014 Nov 21
4
[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hello,
I received feedback from engineers working on NE10 [1] that
it would be better to use NE10 [1] for FFT optimizations for
opus use cases. However, these FFT patches are currently in review
and haven't been integrated into NE10 yet.
While the FFT functions in NE10 are getting baked, I wanted
to optimize the celt_pitch_xcorr (floating point only) and use
it to introduce ARM NEON
2014 Nov 21
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...[2]);
+ SUMM[3] = vmlaq_f32(SUMM[3], XX[3], YY[3]);
+ YY[0] = YY[4];
+ }
+
+ /* Handle remaining values max iterations = 3 */
+ for (j = 0; j < cr; j++) {
+ YY[0] = vld1q_f32(yi++);
+ XX_2 = vld1_lane_f32(xi++, XX_2, 0);
+ SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
+ }
+
+ SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
+ SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
+ SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
+
+ vst1q_f32(sum, SUMM[0]);
+}
+
+void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
+ opus_val32 *xcorr, int len, int max_pitch, int arch) {
+ int i, j;
+
+ celt_assert...
2016 Feb 11
4
Vectorization with fast-math on irregular ISA sub-sets
...y-wrong results.
>
> But we lower a NEON intrinsic into plain IR instructions.
>
> If I got it right, the current "fast" attribute is "may use non IEEE
> compliant", emphasis on the *may*.
>
> As a user, I'd be really angry if I used "float32x4_t vaddq_f32
> (float32x4_t, float32x4_t)" and the compiler emitted four VADD.f32
> SN.
>
> Right now, Clang lowers:
> vaddq_f32 (a, b);
>
> to:
> %add.i = fadd <4 x float> %a, %b
>
> which lowers (correctly) to:
> vadd.f32 q0, q0, q1
>
> If, OTOH, &qu...
2013 Sep 26
0
[LLVMdev] ARM NEON intrinsics in clang
...> recently worked on some documentation in this area:
> http://clang.llvm.org/docs/CrossCompilation.html.
>
> But for a quick hack, you could try:
>
> $ cat > neon.c
> #include <arm_neon.h>
>
> float32x4_t my_func(float32x4_t lhs, float32x4_t rhs) {
> return vaddq_f32(lhs, rhs);
> }
> $ clang --target=arm-linux-gnueabihf -mcpu=cortex-a15 -ffreestanding
> -O3 -S -o - neon.c
>
> ("ffreestanding" will dodge any issues with your supporting toolchain,
> but won't work for larger tests. You've got to actually solve the
> issues b...
2013 Sep 26
2
[LLVMdev] ARM NEON intrinsics in clang
...ssfully doing so. I
tried to compile release 2.9, as I (wrongly) believed that I need llvm-gcc
in order to compile NEON code on LLVM.
Tim's minimalist example worked on my clang3.4:
$ cat > neon.c
#include <arm_neon.h>
float32x4_t my_func(float32x4_t lhs, float32x4_t rhs) {
return vaddq_f32(lhs, rhs);
}
$ clang --target=arm-linux-gnueabihf -mcpu=cortex-a15 -ffreestanding
-O3 -S -o - neon.c
however it doesn't if I remove the -ffreestanding flag. I need to figure
this out next.
Thank you for your help.
Cheers,
- Stan
On Thu, Sep 26, 2013 at 4:01 PM, Renato Golin <renato.gol...
2013 Sep 26
2
[LLVMdev] ARM NEON intrinsics in clang
Hello LLVM Devs,
I am starting my PhD on Automatic Parallelization for DSP and want to play
with some ARM NEON intrinsics for a start. I spent the last three days
trying to compile a version of LLVM that would allow me to compile sources
that contain these intrinsics, but with no success.
In the process I found out that clang doesn't support NEON (as per
2014 Sep 10
4
[RFC PATCH v1 0/3] Introducing ARM SIMD Support
libvorbis does not currently have any simd/vectorization.
Following patches add generic framework for simd/vectorization
and on top, add ARM-NEON simd vectorization using intrinsics.
I was able to get over 34% performance improvement on my
Beaglebone Black which is single Cortex-A8 based CPU.
You can find more information on metrics and procedure I used
to measure at