search for: __z16testvec4multiplyr4vec4s0_s0_

Displaying 1 result from an estimated 1 matches for "__z16testvec4multiplyr4vec4s0_s0_".

2014 Dec 07
3
[LLVMdev] NEON intrinsics preventing redundant load optimization?
...r* (vec4& a, vec4& b) { vec4 result; for(int i = 0; i < 4; ++i) result.data[i] = a.data[i] * b.data[i]; return result; } void TestVec4Multiply(vec4& a, vec4& b, vec4& result) { result = a * b; } With -O3 the loop gets vectorized and the code generated looks optimal: __Z16TestVec4MultiplyR4vec4S0_S0_: @ BB#0: vld1.32 {d16, d17}, [r1] vld1.32 {d18, d19}, [r0] vmul.f32 q8, q9, q8 vst1.32 {d16, d17}, [r2] bx lr However if I replace the operator* with a NEON intrinsic implementation (I know the vectorizer figured out optimal code in this case anyway, but that wasn't true for my real situa...