Diego Novillo
2014-Jan-16 16:16 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote:> Was the vectorizer successful in unrolling the loop in quantum_sigma_x? I > wonder if 'size’ is typically high or low.No. The vectorizer stated that it wasn't going to bother with the loop because it wasn't profitable. Specifically: LV: Checking a loop in "quantum_sigma_x" LV: Found a loop: for.body LV: Found an induction variable. LV: Found a write-only loop! LV: We can vectorize this loop! LV: Found trip count: 0 LV: The Widest type: 64 bits. LV: The Widest register is: 128 bits. LV: Found an estimated cost of 0 for VF 1 For instruction: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ] LV: Found an estimated cost of 0 for VF 1 For instruction: %state getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64 %indvars.iv, i32 1, !dbg !58 LV: Found an estimated cost of 1 for VF 1 For instruction: %3 = load i64* %state, align 8, !dbg !58, !tbaa !61 LV: Found an estimated cost of 1 for VF 1 For instruction: %xor xor i64 %3, %shl, !dbg !58 LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 %xor, i64* %state, align 8, !dbg !58, !tbaa !61 LV: Found an estimated cost of 1 for VF 1 For instruction: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52 LV: Found an estimated cost of 0 for VF 1 For instruction: %4 trunc i64 %indvars.iv.next to i32, !dbg !52 LV: Found an estimated cost of 1 for VF 1 For instruction: %cmp icmp slt i32 %4, %1, !dbg !52 LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57 LV: Scalar loop costs: 5. LV: Found an estimated cost of 0 for VF 2 For instruction: %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, %for.body ] LV: Found an estimated cost of 0 for VF 2 For instruction: %state getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64 %indvars.iv, i32 1, !dbg !58 LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load i64* %state, align 8, !dbg !58, !tbaa !61 LV: Found an estimated cost of 1 for VF 2 For instruction: %xor xor i64 %3, %shl, !dbg !58 LV: Found an estimated cost of 6 for VF 2 For instruction: store i64 %xor, i64* %state, align 8, !dbg !58, !tbaa !61 LV: Found an estimated cost of 1 for VF 2 For instruction: %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52 LV: Found an estimated cost of 0 for VF 2 For instruction: %4 trunc i64 %indvars.iv.next to i32, !dbg !52 LV: Found an estimated cost of 1 for VF 2 For instruction: %cmp icmp slt i32 %4, %1, !dbg !52 LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57 LV: Vector loop of width 2 costs: 7. LV: Selecting VF = : 1. LV: The target has 16 vector registers LV(REG): Calculating max register usage: LV(REG): At #0 Interval # 0 LV(REG): At #1 Interval # 1 LV(REG): At #2 Interval # 2 LV(REG): At #3 Interval # 3 LV(REG): At #5 Interval # 2 LV(REG): At #6 Interval # 2 LV(REG): At #7 Interval # 2 LV(REG): Found max usage: 3 LV(REG): Found invariant usage: 3 LV(REG): LoopSize: 9 LV: Found a vectorizable loop (1) in gates.ll LV: Unroll Factor is 1 LV: Vectorization is possible but not beneficial. I poked briefly at the vectorizer code to see if there is anything that the profile data could've told it, but this loop did not meet the requirements for unrolling. And even if it did, the trip count is not constant and the unroll factor used by the vectorizer is pretty low. So, even if we vectorized it (or parts of it) I don't think the speedup would be significant. What really helps this loop is to peel it a few times and do the remaining iterations in the loop. Diego.
Nadav Rotem
2014-Jan-16 17:26 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
Hi Diego, It looks like the problem is with the code in the vectorizer that tries to estimate the most profitable vectorization factor:> LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load > i64* %state, align 8, !dbg !58, !tbaa !61It looks like a cost model problem. The vectorizer thinks that loading %3 (above) is non consecutive and would require scatter/gather. Is that correct? I wonder that SCEV is reporting. Is there an index overflow problem that is preventing us from loading consecutive elements? Thanks, Nadav On Jan 16, 2014, at 8:16 AM, Diego Novillo <dnovillo at google.com> wrote:> On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote: > >> Was the vectorizer successful in unrolling the loop in quantum_sigma_x? I >> wonder if 'size’ is typically high or low. > > No. The vectorizer stated that it wasn't going to bother with the loop > because it wasn't profitable. Specifically: > > LV: Checking a loop in "quantum_sigma_x" > LV: Found a loop: for.body > LV: Found an induction variable. > LV: Found a write-only loop! > LV: We can vectorize this loop! > LV: Found trip count: 0 > LV: The Widest type: 64 bits. > LV: Found an estimated cost of 0 for VF 1 For instruction: > %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, > %for.body ] > LV: Found an estimated cost of 0 for VF 1 For instruction: %state > getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64 > %indvars.iv, i32 1, !dbg !58 > LV: Found an estimated cost of 1 for VF 1 For instruction: %3 = load > i64* %state, align 8, !dbg !58, !tbaa !61 > LV: Found an estimated cost of 1 for VF 1 For instruction: %xor > xor i64 %3, %shl, !dbg !58 > LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 > %xor, i64* %state, align 8, !dbg !58, !tbaa !61 > LV: Found an estimated cost of 1 for VF 1 For instruction: > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52 > LV: Found an estimated cost of 0 for VF 1 For instruction: %4 > trunc i64 %indvars.iv.next to i32, !dbg !52 > LV: Found an estimated cost of 1 for VF 1 For instruction: %cmp > icmp slt i32 %4, %1, !dbg !52 > LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 > %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57 > LV: Scalar loop costs: 5. > LV: Found an estimated cost of 0 for VF 2 For instruction: > %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, > %for.body ] > LV: Found an estimated cost of 0 for VF 2 For instruction: %state > getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64 > %indvars.iv, i32 1, !dbg !58 > LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load > i64* %state, align 8, !dbg !58, !tbaa !61 > LV: Found an estimated cost of 1 for VF 2 For instruction: %xor > xor i64 %3, %shl, !dbg !58 > LV: Found an estimated cost of 6 for VF 2 For instruction: store i64 > %xor, i64* %state, align 8, !dbg !58, !tbaa !61 > LV: Found an estimated cost of 1 for VF 2 For instruction: > %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52 > LV: Found an estimated cost of 0 for VF 2 For instruction: %4 > trunc i64 %indvars.iv.next to i32, !dbg !52 > LV: Found an estimated cost of 1 for VF 2 For instruction: %cmp > icmp slt i32 %4, %1, !dbg !52 > LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 > %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57 > LV: Vector loop of width 2 costs: 7. > LV: Selecting VF = : 1. > LV: The target has 16 vector registers > LV(REG): Calculating max register usage: > LV(REG): At #0 Interval # 0 > LV(REG): At #1 Interval # 1 > LV(REG): At #2 Interval # 2 > LV(REG): At #3 Interval # 3 > LV(REG): At #5 Interval # 2 > LV(REG): At #6 Interval # 2 > LV(REG): At #7 Interval # 2 > LV(REG): Found max usage: 3 > LV(REG): Found invariant usage: 3 > LV(REG): LoopSize: 9 > LV: Found a vectorizable loop (1) in gates.ll > LV: Unroll Factor is 1 > LV: Vectorization is possible but not beneficial. > > > I poked briefly at the vectorizer code to see if there is anything > that the profile data could've told it, but this loop did not meet the > requirements for unrolling. And even if it did, the trip count is not > constant and the unroll factor used by the vectorizer is pretty low. > So, even if we vectorized it (or parts of it) I don't think the > speedup would be significant. > > What really helps this loop is to peel it a few times and do the > remaining iterations in the loop. > > > Diego.
Nadav Rotem
2014-Jan-16 17:29 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Jan 16, 2014, at 9:26 AM, Nadav Rotem <nrotem at apple.com> wrote:> Hi Diego, > > It looks like the problem is with the code in the vectorizer that tries to estimate the most profitable vectorization factor: > >> LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load >> i64* %state, align 8, !dbg !58, !tbaa !61 > > > It looks like a cost model problem. The vectorizer thinks that loading %3 (above) is non consecutive and would require scatter/gather. Is that correct? I wonder that SCEV is reporting. Is there an index overflow problem that is preventing us from loading consecutive elements? >Oh, yes, yes. reg->node[i].state ^= ... We are accessing a struct with multiple members. This would require strided access because we need to skip the other members of the struct. Thanks, Nadav> > On Jan 16, 2014, at 8:16 AM, Diego Novillo <dnovillo at google.com> wrote: > >> On Wed, Jan 15, 2014 at 5:30 PM, Nadav Rotem <nrotem at apple.com> wrote: >> >>> Was the vectorizer successful in unrolling the loop in quantum_sigma_x? I >>> wonder if 'size’ is typically high or low. >> >> No. The vectorizer stated that it wasn't going to bother with the loop >> because it wasn't profitable. Specifically: >> >> LV: Checking a loop in "quantum_sigma_x" >> LV: Found a loop: for.body >> LV: Found an induction variable. >> LV: Found a write-only loop! >> LV: We can vectorize this loop! >> LV: Found trip count: 0 >> LV: The Widest type: 64 bits. >> LV: Found an estimated cost of 0 for VF 1 For instruction: >> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, >> %for.body ] >> LV: Found an estimated cost of 0 for VF 1 For instruction: %state >> getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64 >> %indvars.iv, i32 1, !dbg !58 >> LV: Found an estimated cost of 1 for VF 1 For instruction: %3 = load >> i64* %state, align 8, !dbg !58, !tbaa !61 >> LV: Found an estimated cost of 1 for VF 1 For instruction: %xor >> xor i64 %3, %shl, !dbg !58 >> LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 >> %xor, i64* %state, align 8, !dbg !58, !tbaa !61 >> LV: Found an estimated cost of 1 for VF 1 For instruction: >> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52 >> LV: Found an estimated cost of 0 for VF 1 For instruction: %4 >> trunc i64 %indvars.iv.next to i32, !dbg !52 >> LV: Found an estimated cost of 1 for VF 1 For instruction: %cmp >> icmp slt i32 %4, %1, !dbg !52 >> LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 >> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57 >> LV: Scalar loop costs: 5. >> LV: Found an estimated cost of 0 for VF 2 For instruction: >> %indvars.iv = phi i64 [ 0, %for.body.lr.ph ], [ %indvars.iv.next, >> %for.body ] >> LV: Found an estimated cost of 0 for VF 2 For instruction: %state >> getelementptr inbounds %struct.quantum_reg_node_struct* %2, i64 >> %indvars.iv, i32 1, !dbg !58 >> LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load >> i64* %state, align 8, !dbg !58, !tbaa !61 >> LV: Found an estimated cost of 1 for VF 2 For instruction: %xor >> xor i64 %3, %shl, !dbg !58 >> LV: Found an estimated cost of 6 for VF 2 For instruction: store i64 >> %xor, i64* %state, align 8, !dbg !58, !tbaa !61 >> LV: Found an estimated cost of 1 for VF 2 For instruction: >> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1, !dbg !52 >> LV: Found an estimated cost of 0 for VF 2 For instruction: %4 >> trunc i64 %indvars.iv.next to i32, !dbg !52 >> LV: Found an estimated cost of 1 for VF 2 For instruction: %cmp >> icmp slt i32 %4, %1, !dbg !52 >> LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 >> %cmp, label %for.body, label %for.end.loopexit, !dbg !52, !prof !57 >> LV: Vector loop of width 2 costs: 7. >> LV: Selecting VF = : 1. >> LV: The target has 16 vector registers >> LV(REG): Calculating max register usage: >> LV(REG): At #0 Interval # 0 >> LV(REG): At #1 Interval # 1 >> LV(REG): At #2 Interval # 2 >> LV(REG): At #3 Interval # 3 >> LV(REG): At #5 Interval # 2 >> LV(REG): At #6 Interval # 2 >> LV(REG): At #7 Interval # 2 >> LV(REG): Found max usage: 3 >> LV(REG): Found invariant usage: 3 >> LV(REG): LoopSize: 9 >> LV: Found a vectorizable loop (1) in gates.ll >> LV: Unroll Factor is 1 >> LV: Vectorization is possible but not beneficial. >> >> >> I poked briefly at the vectorizer code to see if there is anything >> that the profile data could've told it, but this loop did not meet the >> requirements for unrolling. And even if it did, the trip count is not >> constant and the unroll factor used by the vectorizer is pretty low. >> So, even if we vectorized it (or parts of it) I don't think the >> speedup would be significant. >> >> What really helps this loop is to peel it a few times and do the >> remaining iterations in the loop. >> >> >> Diego. >
Diego Novillo
2014-Jan-16 18:02 UTC
[LLVMdev] Loop unrolling opportunity in SPEC's libquantum with profile info
On Thu, Jan 16, 2014 at 9:26 AM, Nadav Rotem <nrotem at apple.com> wrote:> Hi Diego, > > It looks like the problem is with the code in the vectorizer that tries to estimate the most profitable vectorization factor: > >> LV: Found an estimated cost of 6 for VF 2 For instruction: %3 = load >> i64* %state, align 8, !dbg !58, !tbaa !61 > > > It looks like a cost model problem. The vectorizer thinks that loading %3 (above) is non consecutive and would require scatter/gather. Is that correct? I wonder that SCEV is reporting. Is there an index overflow problem that is preventing us from loading consecutive elements?Yes, I forgot to mention that. The access is non-consecutive: for(i=0; i<reg->size; i++) { /* Flip the target bit of each basis state */ reg->node[i].state ^= ((MAX_UNSIGNED) 1 << target); } The code is writing to the 'state' field. The data structures look like this: typedef struct quantum_matrix_struct quantum_matrix; struct quantum_reg_node_struct { COMPLEX_FLOAT amplitude; /* alpha_j */ MAX_UNSIGNED state; /* j */ }; typedef struct quantum_reg_node_struct quantum_reg_node; /* The quantum register */ struct quantum_reg_struct { int width; /* number of qubits in the qureg */ int size; /* number of non-zero vectors */ int hashw; /* width of the hash array */ quantum_reg_node *node; int *hash; }; If you do the trick of writing to a separate array, then the loop can be vectorized. Diego.