Displaying 1 result from an estimated 1 matches for "__z16testvec4multiplyr4vec4s0_s0_".
2014 Dec 07
3
[LLVMdev] NEON intrinsics preventing redundant load optimization?
...r* (vec4& a, vec4& b)
{
vec4 result;
for(int i = 0; i < 4; ++i)
result.data[i] = a.data[i] * b.data[i];
return result;
}
void TestVec4Multiply(vec4& a, vec4& b, vec4& result)
{
result = a * b;
}
With -O3 the loop gets vectorized and the code generated looks optimal:
__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
vld1.32 {d16, d17}, [r1]
vld1.32 {d18, d19}, [r0]
vmul.f32 q8, q9, q8
vst1.32 {d16, d17}, [r2]
bx lr
However if I replace the operator* with a NEON intrinsic implementation (I know the vectorizer figured out optimal code in this case anyway, but that wasn't true for my real situa...