thr3ads.net - llvm dev - [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Diego Novillo

2014-Jan-16 16:16 UTC

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote:
> Was the vectorizer successful in unrolling the loop in quantum_sigma_x?  I
> wonder if 'size’ is typically high or low.
No. The vectorizer stated that it wasn't going to bother with the loop
because it wasn't profitable. Specifically:

LV: Checking a loop in "quantum_sigma_x"
LV: Found a loop: for.body
LV: Found an induction variable.
LV: Found a write-only loop!
LV: We can vectorize this loop!
LV: Found trip count: 0
LV: The Widest type: 64 bits.
LV: The Widest register is: 128 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
%for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %state
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
%indvars.iv, i32 1, !dbg !58
LV: Found an estimated cost of 1 for VF 1 For instruction:   %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 1 For instruction:   %xor xor i64 %3,
%shl, !dbg !58
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
%xor, i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 1 For instruction:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
LV: Found an estimated cost of 0 for VF 1 For instruction:   %4 trunc i64
%indvars.iv.next to i32, !dbg !52
LV: Found an estimated cost of 1 for VF 1 For instruction:   %cmp icmp slt i32
%4, %1, !dbg !52
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
%cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
LV: Scalar loop costs: 5.
LV: Found an estimated cost of 0 for VF 2 For instruction:
%indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
%for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %state
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
%indvars.iv, i32 1, !dbg !58
LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 2 For instruction:   %xor xor i64 %3,
%shl, !dbg !58
LV: Found an estimated cost of 6 for VF 2 For instruction:   store i64
%xor, i64* %state, align 8, !dbg !58, !tbaa !61
LV: Found an estimated cost of 1 for VF 2 For instruction:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
LV: Found an estimated cost of 0 for VF 2 For instruction:   %4 trunc i64
%indvars.iv.next to i32, !dbg !52
LV: Found an estimated cost of 1 for VF 2 For instruction:   %cmp icmp slt i32
%4, %1, !dbg !52
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
%cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
LV: Vector loop of width 2 costs: 7.
LV: Selecting VF = : 1.
LV: The target has 16 vector registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #5 Interval # 2
LV(REG): At #6 Interval # 2
LV(REG): At #7 Interval # 2
LV(REG): Found max usage: 3
LV(REG): Found invariant usage: 3
LV(REG): LoopSize: 9
LV: Found a vectorizable loop (1) in gates.ll
LV: Unroll Factor is 1
LV: Vectorization is possible but not beneficial.

I poked briefly at the vectorizer code to see if there is anything
that the profile data could've told it, but this loop did not meet the
requirements for unrolling. And even if it did, the trip count is not
constant and the unroll factor used by the vectorizer is pretty low.
So, even if we vectorized it (or parts of it) I don't think the
speedup would be significant.

What really helps this loop is to peel it a few times and do the
remaining iterations in the loop.

Diego.

Nadav Rotem

2014-Jan-16 17:26 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

Hi Diego, 

It looks like the problem is with the code in the vectorizer that tries to
estimate the most profitable vectorization factor:
> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
> i64* %state, align 8, !dbg !58, !tbaa !61

It looks like a cost model problem.  The vectorizer thinks that loading %3
(above) is non consecutive and would require scatter/gather.  Is that correct? I
wonder that SCEV is reporting. Is there an index overflow problem that is
preventing us from loading consecutive elements?

Thanks,
Nadav


On Jan 16, 2014, at 8:16 AM, Diego Novillo <dnovillo at google.com> wrote:
> On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com>
wrote:
> 
>> Was the vectorizer successful in unrolling the loop in quantum_sigma_x?
I
>> wonder if 'size’ is typically high or low.
> 
> No. The vectorizer stated that it wasn't going to bother with the loop
> because it wasn't profitable. Specifically:
> 
> LV: Checking a loop in "quantum_sigma_x"
> LV: Found a loop: for.body
> LV: Found an induction variable.
> LV: Found a write-only loop!
> LV: We can vectorize this loop!
> LV: Found trip count: 0
> LV: The Widest type: 64 bits.
> LV: Found an estimated cost of 0 for VF 1 For instruction:
> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
> %for.body ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %state >
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
> %indvars.iv, i32 1, !dbg !58
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %3 = load
> i64* %state, align 8, !dbg !58, !tbaa !61
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %xor > xor
i64 %3, %shl, !dbg !58
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %xor, i64* %state, align 8, !dbg !58, !tbaa !61
> LV: Found an estimated cost of 1 for VF 1 For instruction:
> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %4 > trunc
i64 %indvars.iv.next to i32, !dbg !52
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %cmp > icmp
slt i32 %4, %1, !dbg !52
> LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
> LV: Scalar loop costs: 5.
> LV: Found an estimated cost of 0 for VF 2 For instruction:
> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
> %for.body ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %state >
getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
> %indvars.iv, i32 1, !dbg !58
> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
> i64* %state, align 8, !dbg !58, !tbaa !61
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %xor > xor
i64 %3, %shl, !dbg !58
> LV: Found an estimated cost of 6 for VF 2 For instruction:   store i64
> %xor, i64* %state, align 8, !dbg !58, !tbaa !61
> LV: Found an estimated cost of 1 for VF 2 For instruction:
> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %4 > trunc
i64 %indvars.iv.next to i32, !dbg !52
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %cmp > icmp
slt i32 %4, %1, !dbg !52
> LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
> LV: Vector loop of width 2 costs: 7.
> LV: Selecting VF = : 1.
> LV: The target has 16 vector registers
> LV(REG): Calculating max register usage:
> LV(REG): At #0 Interval # 0
> LV(REG): At #1 Interval # 1
> LV(REG): At #2 Interval # 2
> LV(REG): At #3 Interval # 3
> LV(REG): At #5 Interval # 2
> LV(REG): At #6 Interval # 2
> LV(REG): At #7 Interval # 2
> LV(REG): Found max usage: 3
> LV(REG): Found invariant usage: 3
> LV(REG): LoopSize: 9
> LV: Found a vectorizable loop (1) in gates.ll
> LV: Unroll Factor is 1
> LV: Vectorization is possible but not beneficial.
> 
> 
> I poked briefly at the vectorizer code to see if there is anything
> that the profile data could've told it, but this loop did not meet the
> requirements for unrolling. And even if it did, the trip count is not
> constant and the unroll factor used by the vectorizer is pretty low.
> So, even if we vectorized it (or parts of it) I don't think the
> speedup would be significant.
> 
> What really helps this loop is to peel it a few times and do the
> remaining iterations in the loop.
> 
> 
> Diego.

Nadav Rotem

2014-Jan-16 17:29 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Jan 16, 2014, at 9:26 AM, Nadav Rotem <nrotem at apple.com> wrote:
> Hi Diego, 
> 
> It looks like the problem is with the code in the vectorizer that tries to
estimate the most profitable vectorization factor:
> 
>> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
>> i64* %state, align 8, !dbg !58, !tbaa !61
> 
> 
> It looks like a cost model problem.  The vectorizer thinks that loading %3
(above) is non consecutive and would require scatter/gather.  Is that correct? I
wonder that SCEV is reporting. Is there an index overflow problem that is
preventing us from loading consecutive elements?
> 
Oh, yes, yes.

reg->node[i].state ^=  ...

We are accessing a struct with multiple members. This would require strided
access because we need to skip the other members of the struct.

Thanks,
Nadav 

> 
> On Jan 16, 2014, at 8:16 AM, Diego Novillo <dnovillo at google.com>
wrote:
> 
>> On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at
apple.com> wrote:
>> 
>>> Was the vectorizer successful in unrolling the loop in
quantum_sigma_x?  I
>>> wonder if 'size’ is typically high or low.
>> 
>> No. The vectorizer stated that it wasn't going to bother with the
loop
>> because it wasn't profitable. Specifically:
>> 
>> LV: Checking a loop in "quantum_sigma_x"
>> LV: Found a loop: for.body
>> LV: Found an induction variable.
>> LV: Found a write-only loop!
>> LV: We can vectorize this loop!
>> LV: Found trip count: 0
>> LV: The Widest type: 64 bits.
>> LV: Found an estimated cost of 0 for VF 1 For instruction:
>> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
>> %for.body ]
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %state
>> getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
>> %indvars.iv, i32 1, !dbg !58
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %3 = load
>> i64* %state, align 8, !dbg !58, !tbaa !61
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %xor
>> xor i64 %3, %shl, !dbg !58
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
>> %xor, i64* %state, align 8, !dbg !58, !tbaa !61
>> LV: Found an estimated cost of 1 for VF 1 For instruction:
>> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   %4
>> trunc i64 %indvars.iv.next to i32, !dbg !52
>> LV: Found an estimated cost of 1 for VF 1 For instruction:   %cmp
>> icmp slt i32 %4, %1, !dbg !52
>> LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
>> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
>> LV: Scalar loop costs: 5.
>> LV: Found an estimated cost of 0 for VF 2 For instruction:
>> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next,
>> %for.body ]
>> LV: Found an estimated cost of 0 for VF 2 For instruction:   %state
>> getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64
>> %indvars.iv, i32 1, !dbg !58
>> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
>> i64* %state, align 8, !dbg !58, !tbaa !61
>> LV: Found an estimated cost of 1 for VF 2 For instruction:   %xor
>> xor i64 %3, %shl, !dbg !58
>> LV: Found an estimated cost of 6 for VF 2 For instruction:   store i64
>> %xor, i64* %state, align 8, !dbg !58, !tbaa !61
>> LV: Found an estimated cost of 1 for VF 2 For instruction:
>> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52
>> LV: Found an estimated cost of 0 for VF 2 For instruction:   %4
>> trunc i64 %indvars.iv.next to i32, !dbg !52
>> LV: Found an estimated cost of 1 for VF 2 For instruction:   %cmp
>> icmp slt i32 %4, %1, !dbg !52
>> LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
>> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57
>> LV: Vector loop of width 2 costs: 7.
>> LV: Selecting VF = : 1.
>> LV: The target has 16 vector registers
>> LV(REG): Calculating max register usage:
>> LV(REG): At #0 Interval # 0
>> LV(REG): At #1 Interval # 1
>> LV(REG): At #2 Interval # 2
>> LV(REG): At #3 Interval # 3
>> LV(REG): At #5 Interval # 2
>> LV(REG): At #6 Interval # 2
>> LV(REG): At #7 Interval # 2
>> LV(REG): Found max usage: 3
>> LV(REG): Found invariant usage: 3
>> LV(REG): LoopSize: 9
>> LV: Found a vectorizable loop (1) in gates.ll
>> LV: Unroll Factor is 1
>> LV: Vectorization is possible but not beneficial.
>> 
>> 
>> I poked briefly at the vectorizer code to see if there is anything
>> that the profile data could've told it, but this loop did not meet
the
>> requirements for unrolling. And even if it did, the trip count is not
>> constant and the unroll factor used by the vectorizer is pretty low.
>> So, even if we vectorized it (or parts of it) I don't think the
>> speedup would be significant.
>> 
>> What really helps this loop is to peel it a few times and do the
>> remaining iterations in the loop.
>> 
>> 
>> Diego.
>

Diego Novillo

2014-Jan-16 18:02 UTC

head link

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

On Thu, Jan 16, 2014 at 9:26 AM, Nadav Rotem <nrotem at apple.com>
wrote:> Hi Diego,
>
> It looks like the problem is with the code in the vectorizer that tries to
estimate the most profitable vectorization factor:
>
>> LV: Found an estimated cost of 6 for VF 2 For instruction:   %3 = load
>> i64* %state, align 8, !dbg !58, !tbaa !61
>
>
> It looks like a cost model problem.  The vectorizer thinks that loading %3
(above) is non consecutive and would require scatter/gather.  Is that correct? I
wonder that SCEV is reporting. Is there an index overflow problem that is
preventing us from loading consecutive elements?
Yes, I forgot to mention that. The access is non-consecutive:

      for(i=0; i<reg->size; i++)
         {
             /* Flip the target bit of each basis state */
             reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target);
}

The code is writing to the 'state' field. The data structures look like
this:

typedef struct quantum_matrix_struct quantum_matrix;

struct quantum_reg_node_struct
{
  COMPLEX_FLOAT amplitude; /* alpha_j */
  MAX_UNSIGNED state;      /* j */
};

typedef struct quantum_reg_node_struct quantum_reg_node;

/* The quantum register */

struct quantum_reg_struct
{
  int width;    /* number of qubits in the qureg */
  int size;     /* number of non-zero vectors */
  int hashw;    /* width of the hash array */
  quantum_reg_node *node;
  int *hash;
};

If you do the trick of writing to a separate array, then the loop can
be vectorized.


Diego.

llvm dev - Jan 2014 - [LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info

[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info