This is almost ideal for SLP vectorization, except for two problems: 1. We have 4 stores to consecutive locations, but the last element is the constant zero, and not an additional SUB. At the moment we don’t have support for idempotence operations, but this is something that we should add. 2. The values that we are subtracting come from 3 loads. We usually load 4 elements from memory, or scalarize the inputs (we don’t support masked loads on AVX512). Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop Vectorizer ? Thanks, Nadav On Oct 14, 2013, at 10:09 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 14 October 2013 18:03, Nadav Rotem <nrotem at apple.com> wrote: > This also looks like a form of SLP vectorization. > > Yes. Would it be more beneficial to make it a BB-only pass? It seems that, independent of that, it would be beneficial to have pointer reduction variables. > > > I assume that you meant to write (*read++). Basically, we have a wide load and a wide store and some operations on ABC. > > yes. > > > Can you send the IR for this code ? > > Unoptimized and optimized version, with the latter being exactly what the vectorizer will see at O3 (I dumped from inside the debugger and it was identical). > > cheers, > --renato > > > <vect-pointer-test.zip>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/3197e878/attachment.html>
On 14 October 2013 18:15, Nadav Rotem <nrotem at apple.com> wrote:> 1. We have 4 stores to consecutive locations, but the last element is the > constant zero, and not an additional SUB. At the moment we don’t have > support for idempotence operations, but this is something that we should > add. >The fourth write is not necessary for GCC to vectorize it (nor was in the original code), but it was a result of CReduce's attempt to converge when running ARM's GCC and inspecting the right sequence of vector instructions. (btw, CReduce is great!). In this case, shouldn't the vector operations to just add an undef to the fourth lane? Would back-ends recognize it as a AVX/NEON/AltiVec instruction, or just try to re-linearise? 2. The values that we are subtracting come from 3 loads. We usually load 4> elements from memory, or scalarize the inputs (we don’t support masked > loads on AVX512). >That is a more complicated issue, but we can get away with it if we, in a first implementation, only allow the same number of reads and writes on each loop. In that case, if the operations on the independent variables are identical, than it means the loop can be simplified by multiplying the induction range by N and reducing the number of load/sub/store lanes to one, in which case, loop vectorization becomes trivial. Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop> Vectorizer ? >Good question. What vectorizer does the "-ftree-vectorizer" turns on? Because if I use "-fno-tree-vectorize", the code remains scalar. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/2b7f646a/attachment.html>
Renato, can you post the c code for the function and the assembly that gcc produces? Your initial example could be well handled by vectorization of strided loops (and the mentioning of VLD3(.8?)/VST3(.8?) lead me to assume that this is what happened). But the LLVM-IR you sent has a store of 0 in there ;) and strides by 4. Thanks, Arnold Vectorization of strided loops: I am using float as the example otherwise would get too long. void f(float * restrict read, float * restrict write) { for (int i = 0; i < 256; i++) { float a1 = *read++ * 3.0; float a2 = *read++ * 4.0; float a3 = *read++ * 5.0; *write++ = a1; *write++ = a2; *write++ = a3; } recognized as for (int i = 0; i < 256; i +=3) { float a1 = *read[i] * 3.0; float a2 = *read[i+1]* 4.0; float a3 = *read[i+2] * 5.0; write[i] = a1; write[i+1] = a2; write[i+2] = a3; } => loop vectorize with a factor of 4, recognizing that after we vector-unroll the loop by four the scattered accesses from different lines (read[i]..read[i+9+2]) are consecutive and we can efficiently vectorized these accesses (3 vector loads plus interleaves which on arm we can do with VLD3.8): for (int i = 0; i < 256; i +=12) { float a1 = *read[i] * 3.0; float a1_2 = *read[i+3] * 3.0; float a1_3 = *read[i+6] * 3.0; float a1_4 = *read[i+9] * 3.0 float a2 = *read[i+1]* 4.0; float a2_2 = *read[i+3+1]* 4.0; … float a3 = *read[i+2] * 5.0; float a3_2 = *read[i+3+2] * 5.0; write[i] = a1; write[i+3] = a1_2; … write[i+1] = a2; write[i+1+3] = a2_2; ... } VLD3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i] a1..a1_4 = VMUL a1..a1_4, #3.0 a2..a2_4 = VMUL a2..a2_4, #4.0 a3..a3_4 = VMUL a3..a3_4, #5.0 VST3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i] On Oct 14, 2013, at 12:15 PM, Nadav Rotem <nrotem at apple.com> wrote:> This is almost ideal for SLP vectorization, except for two problems: > > 1. We have 4 stores to consecutive locations, but the last element is the constant zero, and not an additional SUB. At the moment we don’t have support for idempotence operations, but this is something that we should add. > > 2. The values that we are subtracting come from 3 loads. We usually load 4 elements from memory, or scalarize the inputs (we don’t support masked loads on AVX512). > > Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop Vectorizer ? > > Thanks, > Nadav > > > > On Oct 14, 2013, at 10:09 AM, Renato Golin <renato.golin at linaro.org> wrote: > >> On 14 October 2013 18:03, Nadav Rotem <nrotem at apple.com> wrote: >> This also looks like a form of SLP vectorization. >> >> Yes. Would it be more beneficial to make it a BB-only pass? It seems that, independent of that, it would be beneficial to have pointer reduction variables. >> >> >> I assume that you meant to write (*read++). Basically, we have a wide load and a wide store and some operations on ABC. >> >> yes. >> >> >> Can you send the IR for this code ? >> >> Unoptimized and optimized version, with the latter being exactly what the vectorizer will see at O3 (I dumped from inside the debugger and it was identical). >> >> cheers, >> --renato >> >> >> <vect-pointer-test.zip> >
On 14 October 2013 19:31, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:> Renato, can you post the c code for the function and the assembly that gcc > produces? >Attached. Your initial example could be well handled by vectorization of strided> loops (and the mentioning of VLD3(.8?)/VST3(.8?) lead me to assume that > this is what happened). But the LLVM-IR you sent has a store of 0 in there > ;) and strides by 4. >I think so. Ignore the last write, it was bogus. (but don't ignore the fact that GCC vectorized it anyway with vst4!). By running GCC with -ftree-vectorizer-verbose=1 I got: test.c:11: note: create runtime check for data references DELTA and *WRITE_30 test.c:11: note: create runtime check for data references *READ_29 and *WRITE_30 test.c:11: note: created 2 versioning for alias checks. test.c:11: note: === vect_do_peeling_for_loop_bound ===Setting upper bound of nb iterations for epilogue loop to 14 test.c:11: note: LOOP VECTORIZED. The result is a very concise and very dense code: vld1.8 {d28[], d29[]}, [r5] vld3.8 {d16, d18, d20}, [r9]! vld3.8 {d17, d19, d21}, [r9] vmvn q3, q8 vmvn q15, q9 vmvn q8, q10 vsub.i8 q11, q3, q14 vsub.i8 q12, q15, q14 vsub.i8 q13, q8, q14 vst3.8 {d22, d24, d26}, [r8]! vst3.8 {d23, d25, d27}, [r8] cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/a05ed9f0/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: test.c Type: text/x-csrc Size: 398 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/a05ed9f0/attachment.c>
Hi Renato, As far as I know, -ftree-vectorizer will enable both loop vectorization and slp vectorization.-ftree-slp-vectorize will do slp vectorization but it will be enabled by -free-vectorizer automatically. -Yi On Oct 14, 2013, at 11:28 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 14 October 2013 18:15, Nadav Rotem <nrotem at apple.com> wrote: > 1. We have 4 stores to consecutive locations, but the last element is the constant zero, and not an additional SUB. At the moment we don’t have support for idempotence operations, but this is something that we should add. > > The fourth write is not necessary for GCC to vectorize it (nor was in the original code), but it was a result of CReduce's attempt to converge when running ARM's GCC and inspecting the right sequence of vector instructions. (btw, CReduce is great!). > > In this case, shouldn't the vector operations to just add an undef to the fourth lane? Would back-ends recognize it as a AVX/NEON/AltiVec instruction, or just try to re-linearise? > > > 2. The values that we are subtracting come from 3 loads. We usually load 4 elements from memory, or scalarize the inputs (we don’t support masked loads on AVX512). > > That is a more complicated issue, but we can get away with it if we, in a first implementation, only allow the same number of reads and writes on each loop. In that case, if the operations on the independent variables are identical, than it means the loop can be simplified by multiplying the induction range by N and reducing the number of load/sub/store lanes to one, in which case, loop vectorization becomes trivial. > > > Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop Vectorizer ? > > Good question. What vectorizer does the "-ftree-vectorizer" turns on? Because if I use "-fno-tree-vectorize", the code remains scalar. > > cheers, > --renato > _______________________________________________ > llvm-commits mailing list > llvm-commits at cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/918af5b8/attachment.html>