Displaying 20 results from an estimated 25 matches for "vst1q_f32".
Did you mean:
vld1q_f32
2014 Nov 09
0
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
...t_2[1] = vadd_f32(scratch_2[0], scratch_2[1]);
+
+ /* scratch_2[1] *= (-1, -1) */
+ scratch_2[1] = vmul_f32(scratch_2[1], minusones_2);
+ Fout_2[3] = vadd_f32(scratch_2[0], scratch_2[1]);
+
+ Fout_4[0] = vcombine_f32(Fout_2[0], Fout_2[1]);
+ Fout_4[1] = vcombine_f32(Fout_2[2], Fout_2[3]);
+
+ vst1q_f32(bi, Fout_4[0]);
+ bi += 4;
+ vst1q_f32(bi, Fout_4[1]);
+ bi += 4;
+ }
+}
+
+static void kf_bfly4_neon_m8(kiss_fft_cpx * Fout,
+ const size_t fstride,
+ const kiss_fft_state *st,
+ int m,
+ int N,
+...
2014 Nov 09
3
[RFC PATCH v1] arm: kf_bfly4: Introduce ARM neon intrinsics
Hello,
This patch introduces ARM NEON Intrinsics to optimize
kf_bfly4 routine in celt part of libopus.
Using NEON optimized kf_bfly4(_neon) routine helped improve
performance of opus_fft_impl function by about 21.4%. The
end use case was decoding a music opus ogg file. The end
use case saw performance improvement of about 4.47%.
This patch has 2 components
i. Actual neon code to improve
2014 Dec 19
3
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...gt; + float *yi = y;
These need to be const float32_t (in both xcorr_kernel_neon_float and
xcorr_kernel_neon_float_process1). They're currently causing a ton of
warning spew. float32_t appears to not be considered equivalent to
float, which means you'll also need casts here:
> + vst1q_f32(sum, SUMM);
and here:
> + vst1_lane_f32(sum, SUMM_2[0], 0);
2015 Jan 29
2
[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library
..._fft_impl().
> +
> + out = (float *)tempin;
These are pretty confusing names (if you have to keep this scaling
here). Ideally they'd be related since they refer to the same memory
(e.g., scaled and scaledp or something).
Also, float is _not_ compatible with float32_t (which is what vst1q_f32
takes) in all compiler versions. Please do not mix and match them.
> + scale = vld1_dup_f32(&st->scale);
Needs a (const float32_t *) cast.
> + for (i = 0; i < N2; i++) {
> + inq = vld1q_f32(in);
> + in += 4;
> + outq = vmulq_lane_f32(inq, scale, 0);
&...
2014 Dec 07
3
[LLVMdev] NEON intrinsics preventing redundant load optimization?
...n) then the temporary "result" seems to be kept in the generated code for the test function, and triggers the bad penalty of a load after a NEON store.
vec4 operator* (vec4& a, vec4& b)
{
vec4 result;
float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data));
vst1q_f32(result.data, result_data);
return result;
}
__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
sub sp, #16
vld1.32 {d16, d17}, [r1]
vld1.32 {d18, d19}, [r0]
mov r0, sp
vmul.f32 q8, q9, q8
vst1.32 {d16, d17}, [r0]
vld1.32 {d16, d17}, [r0]
vst1.32 {d16, d17}, [r2]
add sp, #16
bx lr
Is there som...
2015 Jan 29
0
[RFC PATCH v1 2/2] armv7(float): Optimize encode usecase using NE10 library
...ut = (float *)tempin;
>
>
> These are pretty confusing names (if you have to keep this scaling here).
> Ideally they'd be related since they refer to the same memory (e.g., scaled
> and scaledp or something).
>
> Also, float is _not_ compatible with float32_t (which is what vst1q_f32
> takes) in all compiler versions. Please do not mix and match them.
>
>> + scale = vld1_dup_f32(&st->scale);
>
>
> Needs a (const float32_t *) cast.
>
>> + for (i = 0; i < N2; i++) {
>> + inq = vld1q_f32(in);
>> + in += 4;
>>...
2014 Dec 09
1
[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...32(yi++);
> + case 2:
> + XX_2 = vld1_dup_f32(xi++);
> + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> + YY[0] = vld1q_f32(yi++);
> + case 1:
> + XX_2 = vld1_dup_f32(xi++);
> + SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
> + }
> +
> + vst1q_f32(sum, SUMM);
> +}
> +
> +/*
> + * Function: xcorr3to1_kernel_neon_float
> + * ---------------------------------
> + * Computes single correlation values and stores in *sum
> + */
> +void xcorr3to1_kernel_neon_float(const float *x, const float *y,
> + float *s...
2014 Sep 10
4
[RFC PATCH v1 0/3] Introducing ARM SIMD Support
libvorbis does not currently have any simd/vectorization.
Following patches add generic framework for simd/vectorization
and on top, add ARM-NEON simd vectorization using intrinsics.
I was able to get over 34% performance improvement on my
Beaglebone Black which is single Cortex-A8 based CPU.
You can find more information on metrics and procedure I used
to measure at
2014 Dec 19
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...gt; These need to be const float32_t (in both xcorr_kernel_neon_float and
> xcorr_kernel_neon_float_process1). They're currently causing a ton of
> warning spew. float32_t appears to not be considered equivalent to
> float, which means you'll also need casts here:
>
>> + vst1q_f32(sum, SUMM);
>
> and here:
>
>> + vst1_lane_f32(sum, SUMM_2[0], 0);
Thanks, will do.
> _______________________________________________
> opus mailing list
> opus at xiph.org
> http://lists.xiph.org/mailman/listinfo/opus
2014 Nov 28
2
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...). Use vld1_dup_f32() instead. It's faster and breaks dependencies.
> + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
> + }
> +
> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
> + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
> +
> + vst1q_f32(sum, SUMM[0]);
> +}
> +
> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
> + opus_val32 *xcorr, int len, int max_pitch, int arch) {
arch is unused. There's no reason to pass it here. If we're here, we
know what the arch is.
> + int i, j;
&g...
2014 Dec 18
2
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Almost there... just a few nits left.
Viswanath Puttagunta wrote:
> +if OPUS_ARM_NEON_INTR
> +CELT_SOURCES += $(CELT_SOURCES_ARM_NEON_INTR)
> +OPUS_ARM_NEON_INTR_CPPFLAGS = -mfpu=neon -O3
I'll repeat: I don't think you should change the optimization level here.
> + /* Just unroll the rest of the loop */
I saw you decided to keep this unrolled, but you didn't actually
2014 Dec 07
2
[RFC PATCH v2] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for floating point.
Changes from RFCv1:
- Rebased on top of commit
aad281878: Fix celt_pitch_xcorr_c signature.
which got rid of ugly code around CELT_PITCH_XCORR_IMPL
passing of "arch" parameter.
- Unified with --enable-intrinsics used by x86
- Modified algorithm to be more in-line with algorithm in
celt_pitch_xcorr_arm.s
Viswanath Puttagunta
2014 Dec 01
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...nks.
>
>> + SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
>> + }
>> +
>> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
>> + SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
>> + SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
>> +
>> + vst1q_f32(sum, SUMM[0]);
>> +}
>> +
>> +void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
>> + opus_val32 *xcorr, int len, int max_pitch, int arch) {
>
> arch is unused. There's no reason to pass it here. If we're here, we
&...
2014 Nov 21
4
[RFC PATCHv1] cover: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hello,
I received feedback from engineers working on NE10 [1] that
it would be better to use NE10 [1] for FFT optimizations for
opus use cases. However, these FFT patches are currently in review
and haven't been integrated into NE10 yet.
While the FFT functions in NE10 are getting baked, I wanted
to optimize the celt_pitch_xcorr (floating point only) and use
it to introduce ARM NEON
2014 Nov 21
0
[RFC PATCHv1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...tions = 3 */
+ for (j = 0; j < cr; j++) {
+ YY[0] = vld1q_f32(yi++);
+ XX_2 = vld1_lane_f32(xi++, XX_2, 0);
+ SUMM[0] = vmlaq_lane_f32(SUMM[0], YY[0], XX_2, 0);
+ }
+
+ SUMM[0] = vaddq_f32(SUMM[0], SUMM[1]);
+ SUMM[2] = vaddq_f32(SUMM[2], SUMM[3]);
+ SUMM[0] = vaddq_f32(SUMM[0], SUMM[2]);
+
+ vst1q_f32(sum, SUMM[0]);
+}
+
+void celt_pitch_xcorr_float_neon(const opus_val16 *_x, const opus_val16 *_y,
+ opus_val32 *xcorr, int len, int max_pitch, int arch) {
+ int i, j;
+
+ celt_assert(max_pitch > 0);
+ celt_assert((((unsigned char *)_x-(unsigned char *)NULL)&3)==0);
+
+ for (i = 0; i < (...
2014 Dec 19
2
[PATCH v1] cover: armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
Hi,
Optimizes celt_pitch_xcorr for ARM NEON floating point.
Changes from RFCv3:
- celt_neon_intr.c
- removed warnings due to not having constant pointers
- Put simpler loop to take care of corner cases. Unrolling using
intrinsics was not really mapping well to what was done
in celt_pitch_xcorr_arm.s
- Makefile.am
Removed explicit -O3 optimization
- test_unit_mathops.c,
2014 Dec 19
0
[PATCH v1] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
...+ }
+
+ yi++;
+ while (len > 1) {
+ XX_2 = vld1_dup_f32(xi++);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ YY[0]= vld1q_f32(yi++);
+ len--;
+ }
+
+ if (len > 0) {
+ XX_2 = vld1_dup_f32(xi);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ }
+
+ vst1q_f32(sum, SUMM);
+}
+
+/*
+ * Function: xcorr_kernel_neon_float_process1
+ * ---------------------------------
+ * Computes single correlation values and stores in *sum
+ */
+static void xcorr_kernel_neon_float_process1(const float *x, const float *y,
+ float *sum, int len) {
+ float32x4...
2014 Dec 07
0
[RFC PATCH v2] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
..._f32(SUMM, YY[0], XX_2, 0);
+ YY[0] = vld1q_f32(yi++);
+ case 2:
+ XX_2 = vld1_dup_f32(xi++);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ YY[0] = vld1q_f32(yi++);
+ case 1:
+ XX_2 = vld1_dup_f32(xi++);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ }
+
+ vst1q_f32(sum, SUMM);
+}
+
+/*
+ * Function: xcorr3to1_kernel_neon_float
+ * ---------------------------------
+ * Computes single correlation values and stores in *sum
+ */
+void xcorr3to1_kernel_neon_float(const float *x, const float *y,
+ float *sum, int len) {
+ int i;
+ float32x4_t XX[...
2014 Dec 10
0
[RFC PATCH v3] armv7: celt_pitch_xcorr: Introduce ARM neon intrinsics
..._f32(SUMM, YY[0], XX_2, 0);
+ YY[0] = vld1q_f32(yi++);
+ case 2:
+ XX_2 = vld1_dup_f32(xi++);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ YY[0] = vld1q_f32(yi++);
+ case 1:
+ XX_2 = vld1_dup_f32(xi++);
+ SUMM = vmlaq_lane_f32(SUMM, YY[0], XX_2, 0);
+ }
+
+ vst1q_f32(sum, SUMM);
+}
+
+/*
+ * Function: xcorr_kernel_neon_float_process1
+ * ---------------------------------
+ * Computes single correlation values and stores in *sum
+ */
+static void xcorr_kernel_neon_float_process1(const float *x, const float *y,
+ float *sum, int len) {
+ float32x4...
2015 Jan 20
6
[RFC PATCH v1 0/2] Encode optimize using libNE10
Hello opus-dev,
I've been cooking up this patchset to integrate NE10 library into opus.
Current patchset focuses on encode use case mainly effecting performance of
clt_mdct_forward() and opus_fft() (for float only)
Glad to report the following on Encode use case:
(Measured on my Beaglebone Black Cortex-A8 board)
- Performance improvement for encode use case ~= 12.34% (Based on time -p