thr3ads.net - llvm dev - [LLVMdev] Vectorization of pointer PHI nodes [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Nadav Rotem

2013-Oct-14 17:15 UTC

[LLVMdev] Vectorization of pointer PHI nodes

This is almost ideal for SLP vectorization, except for two problems:

1. We have 4 stores to consecutive locations, but the last element is the
constant zero, and not an additional SUB.   At the moment we don’t have support
for idempotence operations, but this is something that we should add.

2. The values that we are subtracting come from 3 loads.  We usually load 4
elements from memory, or scalarize the inputs (we don’t support masked loads on
AVX512).

Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop
Vectorizer ?

Thanks,
Nadav 

On Oct 14, 2013, at 10:09 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 14 October 2013 18:03, Nadav Rotem <nrotem at apple.com> wrote:
> This also looks like a form of SLP vectorization.
> 
> Yes. Would it be more beneficial to make it a BB-only pass? It seems that,
independent of that, it would be beneficial to have pointer reduction variables.
> 
> 
> I assume that you meant to write (*read++). Basically, we have a wide load
and a wide store and some operations on ABC.
> 
> yes.
> 
> 
> Can you send the IR for this code ?
> 
> Unoptimized and optimized version, with the latter being exactly what the
vectorizer will see at O3 (I dumped from inside the debugger and it was
identical).
> 
> cheers,
> --renato
> 
> 
> <vect-pointer-test.zip>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/3197e878/attachment.html>

Renato Golin

2013-Oct-14 18:28 UTC

head link

[LLVMdev] Vectorization of pointer PHI nodes

On 14 October 2013 18:15, Nadav Rotem <nrotem at apple.com> wrote:
> 1. We have 4 stores to consecutive locations, but the last element is the
> constant zero, and not an additional SUB.   At the moment we don’t have
> support for idempotence operations, but this is something that we should
> add.
>
The fourth write is not necessary for GCC to vectorize it (nor was in the
original code), but it was a result of CReduce's attempt to converge when
running ARM's GCC and inspecting the right sequence of vector instructions.
(btw, CReduce is great!).

In this case, shouldn't the vector operations to just add an undef to the
fourth lane? Would back-ends recognize it as a AVX/NEON/AltiVec
instruction, or just try to re-linearise?


2. The values that we are subtracting come from 3 loads.  We usually load
4> elements from memory, or scalarize the inputs (we don’t support masked
> loads on AVX512).
>
That is a more complicated issue, but we can get away with it if we, in a
first implementation, only allow the same number of reads and writes on
each loop. In that case, if the operations on the independent variables are
identical, than it means the loop can be simplified by multiplying the
induction range by N and reducing the number of load/sub/store lanes to
one, in which case, loop vectorization becomes trivial.


Do you know if the GCC SLP Vectorizer vectorizes this, or is it their
Loop> Vectorizer ?
>
Good question. What vectorizer does the "-ftree-vectorizer" turns on?
Because if I use "-fno-tree-vectorize", the code remains scalar.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/2b7f646a/attachment.html>

Arnold Schwaighofer

2013-Oct-14 18:31 UTC

head link

[LLVMdev] Vectorization of pointer PHI nodes

Renato, can you post the c code for the function and the assembly that gcc
produces?

Your initial example could be well handled by vectorization of strided loops
(and the mentioning of VLD3(.8?)/VST3(.8?) lead me to assume that this is what
happened). But the LLVM-IR you sent has a store of 0 in there ;) and strides by
4.


Thanks,
Arnold


Vectorization of strided loops:

I am using float as the example otherwise would get too long.

void f(float * restrict read, float * restrict write) {
  for (int i = 0; i < 256; i++) {
    float a1 = *read++ * 3.0;
    float a2 = *read++ * 4.0;
    float a3 = *read++ * 5.0;

    *write++ = a1;
    *write++ = a2;
    *write++ = a3;
  }


recognized as

  for (int i = 0; i < 256; i +=3) {
    float a1 = *read[i] * 3.0;
    float a2 = *read[i+1]* 4.0;
    float a3 = *read[i+2] * 5.0;

    write[i] = a1;
    write[i+1] = a2;
    write[i+2] = a3;
  }

=> loop vectorize with a factor of 4, recognizing that after we vector-unroll
the loop by four the scattered accesses from different lines
(read[i]..read[i+9+2]) are consecutive and we can efficiently vectorized these
accesses (3 vector loads plus interleaves which on arm we can do with VLD3.8):

  for (int i = 0; i < 256; i +=12) {
    float a1 = *read[i] * 3.0; 
    float a1_2 = *read[i+3] * 3.0;
    float a1_3 = *read[i+6] * 3.0;
    float a1_4 = *read[i+9] * 3.0

    float a2 = *read[i+1]* 4.0;
    float a2_2 = *read[i+3+1]* 4.0;
    …

    float a3 = *read[i+2] * 5.0;
    float a3_2 = *read[i+3+2] * 5.0;

    write[i] = a1;
    write[i+3] = a1_2;
    …

    write[i+1] = a2;
    write[i+1+3] = a2_2;
    ...
  }


 VLD3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i]
 a1..a1_4 = VMUL a1..a1_4, #3.0
 a2..a2_4 = VMUL a2..a2_4, #4.0
 a3..a3_4 = VMUL a3..a3_4, #5.0
 VST3.f32 {a1..a1_4, a2..a2_4, a3..3_4} [read+i]



On Oct 14, 2013, at 12:15 PM, Nadav Rotem <nrotem at apple.com> wrote:
> This is almost ideal for SLP vectorization, except for two problems:
> 
> 1. We have 4 stores to consecutive locations, but the last element is the
constant zero, and not an additional SUB.   At the moment we don’t have support
for idempotence operations, but this is something that we should add.
> 
> 2. The values that we are subtracting come from 3 loads.  We usually load 4
elements from memory, or scalarize the inputs (we don’t support masked loads on
AVX512).
> 
> Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop
Vectorizer ?
> 
> Thanks,
> Nadav 
>   
> 
> 
> On Oct 14, 2013, at 10:09 AM, Renato Golin <renato.golin at
linaro.org> wrote:
> 
>> On 14 October 2013 18:03, Nadav Rotem <nrotem at apple.com>
wrote:
>> This also looks like a form of SLP vectorization.
>> 
>> Yes. Would it be more beneficial to make it a BB-only pass? It seems
that, independent of that, it would be beneficial to have pointer reduction
variables.
>> 
>> 
>> I assume that you meant to write (*read++). Basically, we have a wide
load and a wide store and some operations on ABC.
>> 
>> yes.
>> 
>> 
>> Can you send the IR for this code ?
>> 
>> Unoptimized and optimized version, with the latter being exactly what
the vectorizer will see at O3 (I dumped from inside the debugger and it was
identical).
>> 
>> cheers,
>> --renato
>> 
>> 
>> <vect-pointer-test.zip>
>

Renato Golin

2013-Oct-14 18:53 UTC

head link

[LLVMdev] Vectorization of pointer PHI nodes

On 14 October 2013 19:31, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
> Renato, can you post the c code for the function and the assembly that gcc
> produces?
>
Attached.


Your initial example could be well handled by vectorization of
strided> loops (and the mentioning of VLD3(.8?)/VST3(.8?) lead me to assume that
> this is what happened). But the LLVM-IR you sent has a store of 0 in there
> ;) and strides by 4.
>
I think so. Ignore the last write, it was bogus. (but don't ignore the fact
that GCC vectorized it anyway with vst4!).

By running GCC with -ftree-vectorizer-verbose=1 I got:

test.c:11: note: create runtime check for data references DELTA and
*WRITE_30
test.c:11: note: create runtime check for data references *READ_29 and
*WRITE_30
test.c:11: note: created 2 versioning for alias checks.
test.c:11: note: === vect_do_peeling_for_loop_bound ===Setting upper bound
of nb iterations for epilogue loop to 14
test.c:11: note: LOOP VECTORIZED.

The result is a very concise and very dense code:

vld1.8 {d28[], d29[]}, [r5]
vld3.8 {d16, d18, d20}, [r9]!
vld3.8 {d17, d19, d21}, [r9]
vmvn  q3, q8
vmvn  q15, q9
vmvn  q8, q10
vsub.i8 q11, q3, q14
vsub.i8 q12, q15, q14
vsub.i8 q13, q8, q14
vst3.8 {d22, d24, d26}, [r8]!
vst3.8 {d23, d25, d27}, [r8]

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/a05ed9f0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.c
Type: text/x-csrc
Size: 398 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/a05ed9f0/attachment.c>

Yi Jiang

2013-Oct-14 19:02 UTC

head link

[LLVMdev] Vectorization of pointer PHI nodes

Hi Renato,

As far as I know, -ftree-vectorizer will enable both loop vectorization and slp
vectorization.-ftree-slp-vectorize will do slp vectorization but it will be
enabled by -free-vectorizer automatically.

-Yi 

On Oct 14, 2013, at 11:28 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 14 October 2013 18:15, Nadav Rotem <nrotem at apple.com> wrote:
> 1. We have 4 stores to consecutive locations, but the last element is the
constant zero, and not an additional SUB.   At the moment we don’t have support
for idempotence operations, but this is something that we should add.
> 
> The fourth write is not necessary for GCC to vectorize it (nor was in the
original code), but it was a result of CReduce's attempt to converge when
running ARM's GCC and inspecting the right sequence of vector instructions.
(btw, CReduce is great!).
> 
> In this case, shouldn't the vector operations to just add an undef to
the fourth lane? Would back-ends recognize it as a AVX/NEON/AltiVec instruction,
or just try to re-linearise?
> 
> 
> 2. The values that we are subtracting come from 3 loads.  We usually load 4
elements from memory, or scalarize the inputs (we don’t support masked loads on
AVX512).
> 
> That is a more complicated issue, but we can get away with it if we, in a
first implementation, only allow the same number of reads and writes on each
loop. In that case, if the operations on the independent variables are
identical, than it means the loop can be simplified by multiplying the induction
range by N and reducing the number of load/sub/store lanes to one, in which
case, loop vectorization becomes trivial.
> 
> 
> Do you know if the GCC SLP Vectorizer vectorizes this, or is it their Loop
Vectorizer ?
> 
> Good question. What vectorizer does the "-ftree-vectorizer" turns
on? Because if I use "-fno-tree-vectorize", the code remains scalar.
> 
> cheers,
> --renato
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131014/918af5b8/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Oct 2013 - [LLVMdev] Vectorization of pointer PHI nodes

[LLVMdev] Vectorization of pointer PHI nodes

[LLVMdev] Vectorization of pointer PHI nodes

[LLVMdev] Vectorization of pointer PHI nodes

[LLVMdev] Vectorization of pointer PHI nodes

[LLVMdev] Vectorization of pointer PHI nodes

Maybe Matching Threads