suyog sarda
2014-Nov-28 18:56 UTC
[LLVMdev] Horizontal ADD across single vector not profitable in SLP Vectorization
Hi all, Following Analysis is regarding horizontal add across single vector. Test case for AARCH64: #include <arm_neon.h> unsigned hadd(uint32x4_t a) { return a[0] + a[1] + a[2] + a[3]; } Currently, we emit scalar instructions for above code. IR for above code will involve - 4 'extractelement' - to extract elements from vector 'a'. 3 'adds' - to perform add 1 return statement. Lets say, we somehow vectorize this kind of code. The IR will probably have something like : 1. Extract a[0] and put it in vec1 <2 x i32>, 0 2. Extract a[1] and put it in vec1 <2 x i32>, 1 2. Extract a[2] and put it in vec2 <2 x i32>, 0 3. Extract a[3] and put it in vec2 <2 x i32>, 1 4. Add vec1 and vec2, sum in vec3 <2 x i32> 5. Extract vec3[0] in sum1 6. Extract vec3[1] in sum2 7 add sum1 and sum2 in sum3 8. return sum3 So overall instructions - 6 'extractlement', 4 'insertelement', 1 vector add, 1 scalar add and 1 return statement. We have vectorized add operation. This indicates code getting worse than its scalar form (if i am not missing something). This was related to PR 20035, where it was advised to handle add across single vector in SLP vectorizer. If my analysis is correct, we can never have a more profitable horizontal add across a single vector in vectorized form (Unless if i am missing something, perhaps may be 'insertelement and extractelement can be bundled together in single instruction', not sure on this). As there is an ARM vector instruction available - ADDV.4S for addition across a sinle vector and if such code cannot be made profitable by vectorizing it in SLP, isn't it better to handle in SelectionDAG phase? Please correct me if i am wrong and suggest better form of vectorized IR. Suggestions/Comments/Corrections are most awaited !! -- With regards, Suyog Sarda -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141129/649d420b/attachment.html>