Jonas Paulsson via llvm-dev
2016-Oct-06 14:30 UTC
[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence
Hi, I have experimented with enabling the LoopVectorizer for SystemZ. I have come across a loop which, when vectorized, seems to have been poorly generated. In short, there seems to be a completely unnecessary sequence of shufflevector instructions, that doesn't get optimized away anywhere. In other words, there is a shuffling so that leads back to the original vector: [0 1 2 3 4 5 6 7] [0 4] [1 5] [2 6] [3 7] [0 4 1 5] [2 6 3 7] [0 1 2 3 4 5 6 7] Is this something the instruction combiner, or perhaps the InterleavedAccess pass should handle? Even though I suspect that there are currently many target hooks for SystemZ with bad values returned, this seems like something that the optimizers should handle regardless. The result of this is unnecessary target instruction - as can be seen at the bottom. I would appreciate any input on this, and if needed I can supply a test case. /Jonas Loop before vectorize pass: while.body320: ; preds = %while.body320.preheader, %while.body320 %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, %while.body320.preheader ] %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, %while.body320.preheader ] %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, %while.body320.preheader ] %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ] %dec = add nsw i32 %len.0288, -1 %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 %176 = load i64, i64* %ll.0290, align 8 %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 %177 = load i64, i64* %rl.0289, align 8 %and322 = and i64 %177, %176 %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 store i64 %and322, i64* %dl.0291, align 8 %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 %178 = load i64, i64* %incdec.ptr, align 8 %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 %179 = load i64, i64* %incdec.ptr321, align 8 %and326 = and i64 %179, %178 %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 store i64 %and326, i64* %incdec.ptr323, align 8 %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 %180 = load i64, i64* %incdec.ptr324, align 8 %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 %181 = load i64, i64* %incdec.ptr325, align 8 %and330 = and i64 %181, %180 %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 store i64 %and330, i64* %incdec.ptr327, align 8 %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 %182 = load i64, i64* %incdec.ptr328, align 8 %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 %183 = load i64, i64* %incdec.ptr329, align 8 %and334 = and i64 %183, %182 %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 store i64 %and334, i64* %incdec.ptr331, align 8 %tobool319 = icmp eq i32 %dec, 0 br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320 Vectorizing: LV: Checking a loop in "Perl_do_vop" from do_vop.bc LV: Loop hints: force=? width=0 unroll=0 LV: Found a loop: while.body320 LV: Found an induction variable. LV: Found an induction variable. LV: Found an induction variable. LV: Found an induction variable. LV: Did not find one integer induction var. LV: We can vectorize this loop (with a runtime bound check)! LV: Analyzing interleaved accesses... LV: Creating an interleave group with: store i64 %and334, i64* %incdec.ptr331, align 8 LV: Inserted: store i64 %and330, i64* %incdec.ptr327, align 8 into the interleave group with store i64 %and334, i64* %incdec.ptr331, align 8 LV: Inserted: store i64 %and326, i64* %incdec.ptr323, align 8 into the interleave group with store i64 %and334, i64* %incdec.ptr331, align 8 LV: Inserted: store i64 %and322, i64* %dl.0291, align 8 into the interleave group with store i64 %and334, i64* %incdec.ptr331, align 8 LV: Creating an interleave group with: %183 = load i64, i64* %incdec.ptr329, align 8 LV: Inserted: %181 = load i64, i64* %incdec.ptr325, align 8 into the interleave group with %183 = load i64, i64* %incdec.ptr329, align 8 LV: Inserted: %179 = load i64, i64* %incdec.ptr321, align 8 into the interleave group with %183 = load i64, i64* %incdec.ptr329, align 8 LV: Inserted: %177 = load i64, i64* %rl.0289, align 8 into the interleave group with %183 = load i64, i64* %incdec.ptr329, align 8 LV: Creating an interleave group with: %182 = load i64, i64* %incdec.ptr328, align 8 LV: Inserted: %180 = load i64, i64* %incdec.ptr324, align 8 into the interleave group with %182 = load i64, i64* %incdec.ptr328, align 8 LV: Inserted: %178 = load i64, i64* %incdec.ptr, align 8 into the interleave group with %182 = load i64, i64* %incdec.ptr328, align 8 LV: Inserted: %176 = load i64, i64* %ll.0290, align 8 into the interleave group with %182 = load i64, i64* %incdec.ptr328, align 8 LV: Found uniform instruction: %tobool319 = icmp eq i32 %dec, 0 LV: Found uniform instruction: %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 LV: Found uniform instruction: %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 LV: Found uniform instruction: %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 LV: Found uniform instruction: %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 LV: Found uniform instruction: %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 LV: Found uniform instruction: %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 LV: Found uniform instruction: %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 LV: Found uniform instruction: %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 LV: Found uniform instruction: %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 LV: Found uniform instruction: %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, %while.body320.preheader ] LV: Found uniform instruction: %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 LV: Found uniform instruction: %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, %while.body320.preheader ] LV: Found uniform instruction: %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 LV: Found uniform instruction: %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, %while.body320.preheader ] LV: Found uniform instruction: %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 LV: Found uniform instruction: %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ] LV: Found uniform instruction: %dec = add nsw i32 %len.0288, -1 LV: Found trip count: 0 LV: The Smallest and Widest types: 64 / 64 bits. LV: The Widest register is: 128 bits. LV: Found an estimated cost of 0 for VF 1 For instruction: %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, %while.body320.preheader ] LV: Found an estimated cost of 0 for VF 1 For instruction: %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, %while.body320.preheader ] LV: Found an estimated cost of 0 for VF 1 For instruction: %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, %while.body320.preheader ] LV: Found an estimated cost of 0 for VF 1 For instruction: %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ] LV: Found an estimated cost of 1 for VF 1 For instruction: %dec = add nsw i32 %len.0288, -1 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 LV: Found an estimated cost of 1 for VF 1 For instruction: %176 = load i64, i64* %ll.0290, align 8 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 LV: Found an estimated cost of 1 for VF 1 For instruction: %177 = load i64, i64* %rl.0289, align 8 LV: Found an estimated cost of 1 for VF 1 For instruction: %and322 = and i64 %177, %176 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 %and322, i64* %dl.0291, align 8 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 LV: Found an estimated cost of 1 for VF 1 For instruction: %178 = load i64, i64* %incdec.ptr, align 8 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 LV: Found an estimated cost of 1 for VF 1 For instruction: %179 = load i64, i64* %incdec.ptr321, align 8 LV: Found an estimated cost of 1 for VF 1 For instruction: %and326 = and i64 %179, %178 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 %and326, i64* %incdec.ptr323, align 8 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 LV: Found an estimated cost of 1 for VF 1 For instruction: %180 = load i64, i64* %incdec.ptr324, align 8 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 LV: Found an estimated cost of 1 for VF 1 For instruction: %181 = load i64, i64* %incdec.ptr325, align 8 LV: Found an estimated cost of 1 for VF 1 For instruction: %and330 = and i64 %181, %180 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 %and330, i64* %incdec.ptr327, align 8 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 LV: Found an estimated cost of 1 for VF 1 For instruction: %182 = load i64, i64* %incdec.ptr328, align 8 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 LV: Found an estimated cost of 1 for VF 1 For instruction: %183 = load i64, i64* %incdec.ptr329, align 8 LV: Found an estimated cost of 1 for VF 1 For instruction: %and334 = and i64 %183, %182 LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 %and334, i64* %incdec.ptr331, align 8 LV: Found an estimated cost of 1 for VF 1 For instruction: %tobool319 = icmp eq i32 %dec, 0 LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320 LV: Scalar loop costs: 18. LV: Found an estimated cost of 0 for VF 2 For instruction: %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, %while.body320.preheader ] LV: Found an estimated cost of 0 for VF 2 For instruction: %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, %while.body320.preheader ] LV: Found an estimated cost of 0 for VF 2 For instruction: %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, %while.body320.preheader ] LV: Found an estimated cost of 0 for VF 2 For instruction: %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ] LV: Found an estimated cost of 1 for VF 2 For instruction: %dec = add nsw i32 %len.0288, -1 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 LV: Found an estimated cost of 4 for VF 2 For instruction: %176 = load i64, i64* %ll.0290, align 8 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 LV: Found an estimated cost of 4 for VF 2 For instruction: %177 = load i64, i64* %rl.0289, align 8 LV: Found an estimated cost of 1 for VF 2 For instruction: %and322 = and i64 %177, %176 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 LV: Found an estimated cost of 0 for VF 2 For instruction: store i64 %and322, i64* %dl.0291, align 8 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 LV: Found an estimated cost of 0 for VF 2 For instruction: %178 = load i64, i64* %incdec.ptr, align 8 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 LV: Found an estimated cost of 0 for VF 2 For instruction: %179 = load i64, i64* %incdec.ptr321, align 8 LV: Found an estimated cost of 1 for VF 2 For instruction: %and326 = and i64 %179, %178 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 LV: Found an estimated cost of 0 for VF 2 For instruction: store i64 %and326, i64* %incdec.ptr323, align 8 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 LV: Found an estimated cost of 0 for VF 2 For instruction: %180 = load i64, i64* %incdec.ptr324, align 8 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 LV: Found an estimated cost of 0 for VF 2 For instruction: %181 = load i64, i64* %incdec.ptr325, align 8 LV: Found an estimated cost of 1 for VF 2 For instruction: %and330 = and i64 %181, %180 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 LV: Found an estimated cost of 0 for VF 2 For instruction: store i64 %and330, i64* %incdec.ptr327, align 8 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 LV: Found an estimated cost of 0 for VF 2 For instruction: %182 = load i64, i64* %incdec.ptr328, align 8 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 LV: Found an estimated cost of 0 for VF 2 For instruction: %183 = load i64, i64* %incdec.ptr329, align 8 LV: Found an estimated cost of 1 for VF 2 For instruction: %and334 = and i64 %183, %182 LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 LV: Found an estimated cost of 4 for VF 2 For instruction: store i64 %and334, i64* %incdec.ptr331, align 8 LV: Found an estimated cost of 1 for VF 2 For instruction: %tobool319 = icmp eq i32 %dec, 0 LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320 LV: Vector loop of width 2 costs: 9. LV: Selecting VF: 2. LV: The target has 32 registers LV(REG): Calculating max register usage: LV(REG): At #0 Interval # 0 LV(REG): At #1 Interval # 1 LV(REG): At #2 Interval # 2 LV(REG): At #3 Interval # 3 LV(REG): At #4 Interval # 4 LV(REG): At #5 Interval # 4 LV(REG): At #6 Interval # 5 LV(REG): At #7 Interval # 6 LV(REG): At #8 Interval # 7 LV(REG): At #9 Interval # 8 LV(REG): At #10 Interval # 7 LV(REG): At #12 Interval # 7 LV(REG): At #13 Interval # 8 LV(REG): At #14 Interval # 8 LV(REG): At #15 Interval # 9 LV(REG): At #16 Interval # 9 LV(REG): At #17 Interval # 8 LV(REG): At #19 Interval # 7 LV(REG): At #20 Interval # 8 LV(REG): At #21 Interval # 8 LV(REG): At #22 Interval # 9 LV(REG): At #23 Interval # 9 LV(REG): At #24 Interval # 8 LV(REG): At #26 Interval # 7 LV(REG): At #27 Interval # 7 LV(REG): At #28 Interval # 7 LV(REG): At #29 Interval # 7 LV(REG): At #30 Interval # 7 LV(REG): At #31 Interval # 6 LV(REG): At #33 Interval # 5 LV(REG): VF = 2 LV(REG): Found max usage: 2 LV(REG): Found invariant usage: 4 LV(REG): LoopSize: 35 LV: Loop cost is 18 LV: Interleaving to reduce branch cost. LV: Interleaving is not beneficial. LV: Found a vectorizable loop (2) in do_vop.bc LV: Interleaving disabled by the pass manager LV: Scalarizing: %dec = add nsw i32 %len.0288, -1 LV: Scalarizing: %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 LV: Scalarizing: %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 LV: Scalarizing: %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 LV: Scalarizing: %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 LV: Scalarizing: %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 LV: Scalarizing: %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 LV: Scalarizing: %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 LV: Scalarizing: %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 LV: Scalarizing: %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 LV: Scalarizing: %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 LV: Scalarizing: %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 LV: Scalarizing: %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 LV: Scalarizing: %tobool319 = icmp eq i32 %dec, 0 vectorized loop (vectorization width: 2, interleaved count: 1) Loop after vectorize pass: vector.body419: ; preds = %vector.body419, %vector.ph440 %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442, %vector.body419 ] %184 = add i64 %index441, 0 %185 = shl i64 %184, 2 %next.gep453 = getelementptr i64, i64* %73, i64 %185 %186 = add i64 %index441, 0 %187 = shl i64 %186, 2 %next.gep454 = getelementptr i64, i64* %74, i64 %187 %188 = add i64 %index441, 0 %189 = shl i64 %188, 2 %next.gep455 = getelementptr i64, i64* %75, i64 %189 %190 = trunc i64 %index441 to i32 %offset.idx456 = sub i32 %conv316, %190 %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32 %offset.idx456, i32 0 %broadcast.splat458 = shufflevector <2 x i32> %broadcast.splatinsert457, <2 x i32> undef, <2 x i32> zeroinitializer %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32 -1> %191 = add i32 %offset.idx456, 0 %192 = add nsw i32 %191, -1 %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1 %194 = getelementptr i64, i64* %next.gep454, i32 0 %195 = bitcast i64* %194 to <8 x i64>* %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8, !alias.scope !21 %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef, <2 x i32> <i32 0, i32 4> %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef, <2 x i32> <i32 1, i32 5> %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef, <2 x i32> <i32 2, i32 6> %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef, <2 x i32> <i32 3, i32 7> %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1 %197 = getelementptr i64, i64* %next.gep455, i32 0 %198 = bitcast i64* %197 to <8 x i64>* %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8, !alias.scope !24 %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef, <2 x i32> <i32 0, i32 4> %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef, <2 x i32> <i32 1, i32 5> %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef, <2 x i32> <i32 2, i32 6> %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef, <2 x i32> <i32 3, i32 7> %199 = and <2 x i64> %strided.vec466, %strided.vec461 %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1 %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2 %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2 %203 = and <2 x i64> %strided.vec467, %strided.vec462 %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2 %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3 %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3 %207 = and <2 x i64> %strided.vec468, %strided.vec463 %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3 %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4 %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4 %211 = and <2 x i64> %strided.vec469, %strided.vec464 %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4 %213 = getelementptr i64, i64* %208, i32 -3 %214 = bitcast i64* %213 to <8 x i64>* %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7> store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8, !alias.scope !26, !noalias !28 %218 = icmp eq i32 %192, 0 %index.next442 = add i64 %index441, 2 %219 = icmp eq i64 %index.next442, %n.vec425 br i1 %219, label %middle.block420, label %vector.body419, !llvm.loop !29 Loop after instruction combining: vector.body419: ; preds = %vector.body419, %vector.body419.preheader %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2, %vector.body419.preheader ] %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond, %vector.body419.preheader ] %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57, %vector.body419.preheader ] %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [ %n.vec425, %vector.body419.preheader ] %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>* %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>* %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>* %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8, !alias.scope !21 %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8, !alias.scope !24 %179 = and <8 x i64> %wide.vec465, %wide.vec460 %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x i32> <i32 0, i32 4> %181 = and <8 x i64> %wide.vec465, %wide.vec460 %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x i32> <i32 1, i32 5> %183 = and <8 x i64> %wide.vec465, %wide.vec460 %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x i32> <i32 2, i32 6> %185 = and <8 x i64> %wide.vec465, %wide.vec460 %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x i32> <i32 3, i32 7> %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x i32> <i32 0, i32 1, i32 2, i32 3> %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64> %188, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7> store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264, align 8, !alias.scope !26, !noalias !28 %lsr.iv.next55 = add i64 %lsr.iv54, -2 %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64 %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64 %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64 %189 = icmp eq i64 %lsr.iv.next55, 0 br i1 %189, label %middle.block420, label %vector.body419, !llvm.loop !29 Final vectorized loop .LBB0_141: # %vector.body419 # =>This Inner Loop Header: Depth=1 vl %v0, 48(%r8) vl %v1, 48(%r7) vn %v0, %v1, %v0 vl %v1, 16(%r8) vl %v2, 16(%r7) vn %v1, %v2, %v1 vmrlg %v2, %v1, %v0 vmrhg %v0, %v1, %v0 vmrlg %v1, %v0, %v2 vst %v1, 48(%r9) vl %v1, 32(%r8) vl %v3, 32(%r7) vn %v1, %v3, %v1 vl %v3, 0(%r8) vl %v4, 0(%r7) vn %v3, %v4, %v3 vmrlg %v4, %v3, %v1 vmrhg %v1, %v3, %v1 vmrlg %v3, %v1, %v4 vst %v3, 32(%r9) vmrhg %v0, %v0, %v2 vst %v0, 16(%r9) vmrhg %v0, %v1, %v4 vst %v0, 0(%r9) la %r9, 64(%r9) la %r8, 64(%r8) la %r7, 64(%r7) aghi %r13, -2 jne .LBB0_141 Final scalar loop : .LBB0_152: # %while.body320 # =>This Inner Loop Header: Depth=1 lg %r13, 0(%r14) ng %r13, 0(%r5) stg %r13, 0(%r4) lg %r13, 8(%r14) ng %r13, 8(%r5) stg %r13, 8(%r4) lg %r13, 16(%r14) ng %r13, 16(%r5) stg %r13, 16(%r4) lg %r13, 24(%r14) ng %r13, 24(%r5) stg %r13, 24(%r4) la %r4, 32(%r4) la %r14, 32(%r14) la %r5, 32(%r5) brct %r0, .LBB0_152 j .LBB0_155
Matthew Simpson via llvm-dev
2016-Oct-06 18:40 UTC
[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence
Hi Jonas, It does look like we should be able to simplify this. Would you mind filing a bug? Looking at the code after InstCombine, the vector adds are trivially redundant (I think EarlyCSE should already be able to remove them). I think we could then teach InstructionSimplify to simplify the remaining shuffles similar to the way it already handles extracts. -- Matt On Thu, Oct 6, 2016 at 10:30 AM, Jonas Paulsson via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > Hi, > > I have experimented with enabling the LoopVectorizer for SystemZ. I have > come across a loop which, when vectorized, seems to have been poorly > generated. In short, there seems to be a completely unnecessary sequence of > shufflevector instructions, that doesn't get optimized away anywhere. In > other words, there is a shuffling so that leads back to the original vector: > > [0 1 2 3 4 5 6 7] > > [0 4] [1 5] [2 6] [3 7] > > [0 4 1 5] [2 6 3 7] > > [0 1 2 3 4 5 6 7] > > Is this something the instruction combiner, or perhaps the > InterleavedAccess pass should handle? Even though I suspect that there are > currently many target hooks for SystemZ with bad values returned, this > seems like something that the optimizers should handle regardless. The > result of this is unnecessary target instruction - as can be seen at the > bottom. > > I would appreciate any input on this, and if needed I can supply a test > case. > > /Jonas > > > Loop before vectorize pass: > > while.body320: ; preds > %while.body320.preheader, %while.body320 > %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, > %while.body320.preheader ] > %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, > %while.body320.preheader ] > %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, > %while.body320.preheader ] > %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, > %while.body320.preheader ] > %dec = add nsw i32 %len.0288, -1 > %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 > %176 = load i64, i64* %ll.0290, align 8 > %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 > %177 = load i64, i64* %rl.0289, align 8 > %and322 = and i64 %177, %176 > %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 > store i64 %and322, i64* %dl.0291, align 8 > %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 > %178 = load i64, i64* %incdec.ptr, align 8 > %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 > %179 = load i64, i64* %incdec.ptr321, align 8 > %and326 = and i64 %179, %178 > %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 > store i64 %and326, i64* %incdec.ptr323, align 8 > %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 > %180 = load i64, i64* %incdec.ptr324, align 8 > %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 > %181 = load i64, i64* %incdec.ptr325, align 8 > %and330 = and i64 %181, %180 > %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 > store i64 %and330, i64* %incdec.ptr327, align 8 > %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 > %182 = load i64, i64* %incdec.ptr328, align 8 > %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 > %183 = load i64, i64* %incdec.ptr329, align 8 > %and334 = and i64 %183, %182 > %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 > store i64 %and334, i64* %incdec.ptr331, align 8 > %tobool319 = icmp eq i32 %dec, 0 > br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320 > > > Vectorizing: > > LV: Checking a loop in "Perl_do_vop" from do_vop.bc > LV: Loop hints: force=? width=0 unroll=0 > LV: Found a loop: while.body320 > LV: Found an induction variable. > LV: Found an induction variable. > LV: Found an induction variable. > LV: Found an induction variable. > LV: Did not find one integer induction var. > LV: We can vectorize this loop (with a runtime bound check)! > LV: Analyzing interleaved accesses... > LV: Creating an interleave group with: store i64 %and334, i64* > %incdec.ptr331, align 8 > LV: Inserted: store i64 %and330, i64* %incdec.ptr327, align 8 > into the interleave group with store i64 %and334, i64* > %incdec.ptr331, align 8 > LV: Inserted: store i64 %and326, i64* %incdec.ptr323, align 8 > into the interleave group with store i64 %and334, i64* > %incdec.ptr331, align 8 > LV: Inserted: store i64 %and322, i64* %dl.0291, align 8 > into the interleave group with store i64 %and334, i64* > %incdec.ptr331, align 8 > LV: Creating an interleave group with: %183 = load i64, i64* > %incdec.ptr329, align 8 > LV: Inserted: %181 = load i64, i64* %incdec.ptr325, align 8 > into the interleave group with %183 = load i64, i64* %incdec.ptr329, > align 8 > LV: Inserted: %179 = load i64, i64* %incdec.ptr321, align 8 > into the interleave group with %183 = load i64, i64* %incdec.ptr329, > align 8 > LV: Inserted: %177 = load i64, i64* %rl.0289, align 8 > into the interleave group with %183 = load i64, i64* %incdec.ptr329, > align 8 > LV: Creating an interleave group with: %182 = load i64, i64* > %incdec.ptr328, align 8 > LV: Inserted: %180 = load i64, i64* %incdec.ptr324, align 8 > into the interleave group with %182 = load i64, i64* %incdec.ptr328, > align 8 > LV: Inserted: %178 = load i64, i64* %incdec.ptr, align 8 > into the interleave group with %182 = load i64, i64* %incdec.ptr328, > align 8 > LV: Inserted: %176 = load i64, i64* %ll.0290, align 8 > into the interleave group with %182 = load i64, i64* %incdec.ptr328, > align 8 > LV: Found uniform instruction: %tobool319 = icmp eq i32 %dec, 0 > LV: Found uniform instruction: %incdec.ptr324 = getelementptr inbounds > i64, i64* %ll.0290, i64 2 > LV: Found uniform instruction: %incdec.ptr329 = getelementptr inbounds > i64, i64* %rl.0289, i64 3 > LV: Found uniform instruction: %incdec.ptr323 = getelementptr inbounds > i64, i64* %dl.0291, i64 1 > LV: Found uniform instruction: %incdec.ptr328 = getelementptr inbounds > i64, i64* %ll.0290, i64 3 > LV: Found uniform instruction: %incdec.ptr321 = getelementptr inbounds > i64, i64* %rl.0289, i64 1 > LV: Found uniform instruction: %incdec.ptr327 = getelementptr inbounds > i64, i64* %dl.0291, i64 2 > LV: Found uniform instruction: %incdec.ptr325 = getelementptr inbounds > i64, i64* %rl.0289, i64 2 > LV: Found uniform instruction: %incdec.ptr331 = getelementptr inbounds > i64, i64* %dl.0291, i64 3 > LV: Found uniform instruction: %incdec.ptr = getelementptr inbounds i64, > i64* %ll.0290, i64 1 > LV: Found uniform instruction: %dl.0291 = phi i64* [ %incdec.ptr335, > %while.body320 ], [ %73, %while.body320.preheader ] > LV: Found uniform instruction: %incdec.ptr335 = getelementptr inbounds > i64, i64* %dl.0291, i64 4 > LV: Found uniform instruction: %ll.0290 = phi i64* [ %incdec.ptr332, > %while.body320 ], [ %74, %while.body320.preheader ] > LV: Found uniform instruction: %incdec.ptr332 = getelementptr inbounds > i64, i64* %ll.0290, i64 4 > LV: Found uniform instruction: %rl.0289 = phi i64* [ %incdec.ptr333, > %while.body320 ], [ %75, %while.body320.preheader ] > LV: Found uniform instruction: %incdec.ptr333 = getelementptr inbounds > i64, i64* %rl.0289, i64 4 > LV: Found uniform instruction: %len.0288 = phi i32 [ %dec, > %while.body320 ], [ %conv316, %while.body320.preheader ] > LV: Found uniform instruction: %dec = add nsw i32 %len.0288, -1 > LV: Found trip count: 0 > LV: The Smallest and Widest types: 64 / 64 bits. > LV: The Widest register is: 128 bits. > LV: Found an estimated cost of 0 for VF 1 For instruction: %dl.0291 > phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 1 For instruction: %ll.0290 > phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 1 For instruction: %rl.0289 > phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 1 For instruction: %len.0288 > phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ] > LV: Found an estimated cost of 1 for VF 1 For instruction: %dec = add > nsw i32 %len.0288, -1 > LV: Found an estimated cost of 0 for VF 1 For instruction: %incdec.ptr > getelementptr inbounds i64, i64* %ll.0290, i64 1 > LV: Found an estimated cost of 1 for VF 1 For instruction: %176 = load > i64, i64* %ll.0290, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 > LV: Found an estimated cost of 1 for VF 1 For instruction: %177 = load > i64, i64* %rl.0289, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: %and322 = and > i64 %177, %176 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 > LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 > %and322, i64* %dl.0291, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 > LV: Found an estimated cost of 1 for VF 1 For instruction: %178 = load > i64, i64* %incdec.ptr, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 > LV: Found an estimated cost of 1 for VF 1 For instruction: %179 = load > i64, i64* %incdec.ptr321, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: %and326 = and > i64 %179, %178 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 > LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 > %and326, i64* %incdec.ptr323, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 > LV: Found an estimated cost of 1 for VF 1 For instruction: %180 = load > i64, i64* %incdec.ptr324, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 > LV: Found an estimated cost of 1 for VF 1 For instruction: %181 = load > i64, i64* %incdec.ptr325, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: %and330 = and > i64 %181, %180 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 > LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 > %and330, i64* %incdec.ptr327, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 > LV: Found an estimated cost of 1 for VF 1 For instruction: %182 = load > i64, i64* %incdec.ptr328, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 > LV: Found an estimated cost of 1 for VF 1 For instruction: %183 = load > i64, i64* %incdec.ptr329, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: %and334 = and > i64 %183, %182 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 > LV: Found an estimated cost of 1 for VF 1 For instruction: store i64 > %and334, i64* %incdec.ptr331, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: %tobool319 > icmp eq i32 %dec, 0 > LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 > %tobool319, label %sw.epilog381.loopexit, label %while.body320 > LV: Scalar loop costs: 18. > LV: Found an estimated cost of 0 for VF 2 For instruction: %dl.0291 > phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 2 For instruction: %ll.0290 > phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 2 For instruction: %rl.0289 > phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 2 For instruction: %len.0288 > phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ] > LV: Found an estimated cost of 1 for VF 2 For instruction: %dec = add > nsw i32 %len.0288, -1 > LV: Found an estimated cost of 0 for VF 2 For instruction: %incdec.ptr > getelementptr inbounds i64, i64* %ll.0290, i64 1 > LV: Found an estimated cost of 4 for VF 2 For instruction: %176 = load > i64, i64* %ll.0290, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 > LV: Found an estimated cost of 4 for VF 2 For instruction: %177 = load > i64, i64* %rl.0289, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: %and322 = and > i64 %177, %176 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 > LV: Found an estimated cost of 0 for VF 2 For instruction: store i64 > %and322, i64* %dl.0291, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 > LV: Found an estimated cost of 0 for VF 2 For instruction: %178 = load > i64, i64* %incdec.ptr, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 > LV: Found an estimated cost of 0 for VF 2 For instruction: %179 = load > i64, i64* %incdec.ptr321, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: %and326 = and > i64 %179, %178 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 > LV: Found an estimated cost of 0 for VF 2 For instruction: store i64 > %and326, i64* %incdec.ptr323, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 > LV: Found an estimated cost of 0 for VF 2 For instruction: %180 = load > i64, i64* %incdec.ptr324, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 > LV: Found an estimated cost of 0 for VF 2 For instruction: %181 = load > i64, i64* %incdec.ptr325, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: %and330 = and > i64 %181, %180 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 > LV: Found an estimated cost of 0 for VF 2 For instruction: store i64 > %and330, i64* %incdec.ptr327, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 > LV: Found an estimated cost of 0 for VF 2 For instruction: %182 = load > i64, i64* %incdec.ptr328, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 > LV: Found an estimated cost of 0 for VF 2 For instruction: %183 = load > i64, i64* %incdec.ptr329, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: %and334 = and > i64 %183, %182 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 > LV: Found an estimated cost of 4 for VF 2 For instruction: store i64 > %and334, i64* %incdec.ptr331, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: %tobool319 > icmp eq i32 %dec, 0 > LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 > %tobool319, label %sw.epilog381.loopexit, label %while.body320 > LV: Vector loop of width 2 costs: 9. > LV: Selecting VF: 2. > LV: The target has 32 registers > LV(REG): Calculating max register usage: > LV(REG): At #0 Interval # 0 > LV(REG): At #1 Interval # 1 > LV(REG): At #2 Interval # 2 > LV(REG): At #3 Interval # 3 > LV(REG): At #4 Interval # 4 > LV(REG): At #5 Interval # 4 > LV(REG): At #6 Interval # 5 > LV(REG): At #7 Interval # 6 > LV(REG): At #8 Interval # 7 > LV(REG): At #9 Interval # 8 > LV(REG): At #10 Interval # 7 > LV(REG): At #12 Interval # 7 > LV(REG): At #13 Interval # 8 > LV(REG): At #14 Interval # 8 > LV(REG): At #15 Interval # 9 > LV(REG): At #16 Interval # 9 > LV(REG): At #17 Interval # 8 > LV(REG): At #19 Interval # 7 > LV(REG): At #20 Interval # 8 > LV(REG): At #21 Interval # 8 > LV(REG): At #22 Interval # 9 > LV(REG): At #23 Interval # 9 > LV(REG): At #24 Interval # 8 > LV(REG): At #26 Interval # 7 > LV(REG): At #27 Interval # 7 > LV(REG): At #28 Interval # 7 > LV(REG): At #29 Interval # 7 > LV(REG): At #30 Interval # 7 > LV(REG): At #31 Interval # 6 > LV(REG): At #33 Interval # 5 > LV(REG): VF = 2 > LV(REG): Found max usage: 2 > LV(REG): Found invariant usage: 4 > LV(REG): LoopSize: 35 > LV: Loop cost is 18 > LV: Interleaving to reduce branch cost. > LV: Interleaving is not beneficial. > LV: Found a vectorizable loop (2) in do_vop.bc > LV: Interleaving disabled by the pass manager > LV: Scalarizing: %dec = add nsw i32 %len.0288, -1 > LV: Scalarizing: %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, > i64 1 > LV: Scalarizing: %incdec.ptr321 = getelementptr inbounds i64, i64* > %rl.0289, i64 1 > LV: Scalarizing: %incdec.ptr323 = getelementptr inbounds i64, i64* > %dl.0291, i64 1 > LV: Scalarizing: %incdec.ptr324 = getelementptr inbounds i64, i64* > %ll.0290, i64 2 > LV: Scalarizing: %incdec.ptr325 = getelementptr inbounds i64, i64* > %rl.0289, i64 2 > LV: Scalarizing: %incdec.ptr327 = getelementptr inbounds i64, i64* > %dl.0291, i64 2 > LV: Scalarizing: %incdec.ptr328 = getelementptr inbounds i64, i64* > %ll.0290, i64 3 > LV: Scalarizing: %incdec.ptr329 = getelementptr inbounds i64, i64* > %rl.0289, i64 3 > LV: Scalarizing: %incdec.ptr331 = getelementptr inbounds i64, i64* > %dl.0291, i64 3 > LV: Scalarizing: %incdec.ptr332 = getelementptr inbounds i64, i64* > %ll.0290, i64 4 > LV: Scalarizing: %incdec.ptr333 = getelementptr inbounds i64, i64* > %rl.0289, i64 4 > LV: Scalarizing: %incdec.ptr335 = getelementptr inbounds i64, i64* > %dl.0291, i64 4 > LV: Scalarizing: %tobool319 = icmp eq i32 %dec, 0 > > vectorized loop (vectorization width: 2, interleaved count: 1) > > Loop after vectorize pass: > > vector.body419: ; preds > %vector.body419, %vector.ph440 > %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442, > %vector.body419 ] > %184 = add i64 %index441, 0 > %185 = shl i64 %184, 2 > %next.gep453 = getelementptr i64, i64* %73, i64 %185 > %186 = add i64 %index441, 0 > %187 = shl i64 %186, 2 > %next.gep454 = getelementptr i64, i64* %74, i64 %187 > %188 = add i64 %index441, 0 > %189 = shl i64 %188, 2 > %next.gep455 = getelementptr i64, i64* %75, i64 %189 > %190 = trunc i64 %index441 to i32 > %offset.idx456 = sub i32 %conv316, %190 > %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32 > %offset.idx456, i32 0 > %broadcast.splat458 = shufflevector <2 x i32> %broadcast.splatinsert457, > <2 x i32> undef, <2 x i32> zeroinitializer > %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32 -1> > %191 = add i32 %offset.idx456, 0 > %192 = add nsw i32 %191, -1 > %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1 > %194 = getelementptr i64, i64* %next.gep454, i32 0 > %195 = bitcast i64* %194 to <8 x i64>* > %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8, !alias.scope !21 > %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef, > <2 x i32> <i32 0, i32 4> > %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef, > <2 x i32> <i32 1, i32 5> > %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef, > <2 x i32> <i32 2, i32 6> > %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x i64> undef, > <2 x i32> <i32 3, i32 7> > %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1 > %197 = getelementptr i64, i64* %next.gep455, i32 0 > %198 = bitcast i64* %197 to <8 x i64>* > %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8, !alias.scope !24 > %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef, > <2 x i32> <i32 0, i32 4> > %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef, > <2 x i32> <i32 1, i32 5> > %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef, > <2 x i32> <i32 2, i32 6> > %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x i64> undef, > <2 x i32> <i32 3, i32 7> > %199 = and <2 x i64> %strided.vec466, %strided.vec461 > %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1 > %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2 > %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2 > %203 = and <2 x i64> %strided.vec467, %strided.vec462 > %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2 > %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3 > %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3 > %207 = and <2 x i64> %strided.vec468, %strided.vec463 > %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3 > %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4 > %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4 > %211 = and <2 x i64> %strided.vec469, %strided.vec464 > %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4 > %213 = getelementptr i64, i64* %208, i32 -3 > %214 = bitcast i64* %213 to <8 x i64>* > %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x i32> <i32 0, > i32 1, i32 2, i32 3> > %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x i32> <i32 0, > i32 1, i32 2, i32 3> > %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x i32> <i32 0, > i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> > %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64> undef, <8 > x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7> > store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8, > !alias.scope !26, !noalias !28 > %218 = icmp eq i32 %192, 0 > %index.next442 = add i64 %index441, 2 > %219 = icmp eq i64 %index.next442, %n.vec425 > br i1 %219, label %middle.block420, label %vector.body419, !llvm.loop !29 > > Loop after instruction combining: > > vector.body419: ; preds > %vector.body419, %vector.body419.preheader > %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2, > %vector.body419.preheader ] > %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond, > %vector.body419.preheader ] > %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57, > %vector.body419.preheader ] > %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [ %n.vec425, > %vector.body419.preheader ] > %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>* > %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>* > %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>* > %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8, > !alias.scope !21 > %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8, > !alias.scope !24 > %179 = and <8 x i64> %wide.vec465, %wide.vec460 > %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x i32> <i32 0, > i32 4> > %181 = and <8 x i64> %wide.vec465, %wide.vec460 > %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x i32> <i32 1, > i32 5> > %183 = and <8 x i64> %wide.vec465, %wide.vec460 > %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x i32> <i32 2, > i32 6> > %185 = and <8 x i64> %wide.vec465, %wide.vec460 > %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x i32> <i32 3, > i32 7> > %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x i32> <i32 0, > i32 1, i32 2, i32 3> > %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x i32> <i32 0, > i32 1, i32 2, i32 3> > %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64> %188, <8 x > i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7> > store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264, align 8, > !alias.scope !26, !noalias !28 > %lsr.iv.next55 = add i64 %lsr.iv54, -2 > %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64 > %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64 > %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64 > %189 = icmp eq i64 %lsr.iv.next55, 0 > br i1 %189, label %middle.block420, label %vector.body419, !llvm.loop !29 > > Final vectorized loop > > .LBB0_141: # %vector.body419 > # =>This Inner Loop Header: Depth=1 > vl %v0, 48(%r8) > vl %v1, 48(%r7) > vn %v0, %v1, %v0 > vl %v1, 16(%r8) > vl %v2, 16(%r7) > vn %v1, %v2, %v1 > vmrlg %v2, %v1, %v0 > vmrhg %v0, %v1, %v0 > vmrlg %v1, %v0, %v2 > vst %v1, 48(%r9) > vl %v1, 32(%r8) > vl %v3, 32(%r7) > vn %v1, %v3, %v1 > vl %v3, 0(%r8) > vl %v4, 0(%r7) > vn %v3, %v4, %v3 > vmrlg %v4, %v3, %v1 > vmrhg %v1, %v3, %v1 > vmrlg %v3, %v1, %v4 > vst %v3, 32(%r9) > vmrhg %v0, %v0, %v2 > vst %v0, 16(%r9) > vmrhg %v0, %v1, %v4 > vst %v0, 0(%r9) > la %r9, 64(%r9) > la %r8, 64(%r8) > la %r7, 64(%r7) > aghi %r13, -2 > jne .LBB0_141 > > Final scalar loop : > .LBB0_152: # %while.body320 > # =>This Inner Loop Header: Depth=1 > lg %r13, 0(%r14) > ng %r13, 0(%r5) > stg %r13, 0(%r4) > lg %r13, 8(%r14) > ng %r13, 8(%r5) > stg %r13, 8(%r4) > lg %r13, 16(%r14) > ng %r13, 16(%r5) > stg %r13, 16(%r4) > lg %r13, 24(%r14) > ng %r13, 24(%r5) > stg %r13, 24(%r4) > la %r4, 32(%r4) > la %r14, 32(%r14) > la %r5, 32(%r5) > brct %r0, .LBB0_152 > j .LBB0_155 > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161006/f79b0e32/attachment.html>
Jonas Paulsson via llvm-dev
2016-Oct-07 09:07 UTC
[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence
Hi Matt, ok - see https://llvm.org/bugs/show_bug.cgi?id=30630. /Jonas On 2016-10-06 20:40, Matthew Simpson wrote:> Hi Jonas, > > It does look like we should be able to simplify this. Would you mind > filing a bug? Looking at the code after InstCombine, the vector adds > are trivially redundant (I think EarlyCSE should already be able to > remove them). I think we could then teach InstructionSimplify to > simplify the remaining shuffles similar to the way it already handles > extracts. > > -- Matt > > On Thu, Oct 6, 2016 at 10:30 AM, Jonas Paulsson via llvm-dev > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > > > Hi, > > I have experimented with enabling the LoopVectorizer for SystemZ. > I have come across a loop which, when vectorized, seems to have > been poorly generated. In short, there seems to be a completely > unnecessary sequence of shufflevector instructions, that doesn't > get optimized away anywhere. In other words, there is a shuffling > so that leads back to the original vector: > > [0 1 2 3 4 5 6 7] > > [0 4] [1 5] [2 6] [3 7] > > [0 4 1 5] [2 6 3 7] > > [0 1 2 3 4 5 6 7] > > Is this something the instruction combiner, or perhaps the > InterleavedAccess pass should handle? Even though I suspect that > there are currently many target hooks for SystemZ with bad values > returned, this seems like something that the optimizers should > handle regardless. The result of this is unnecessary target > instruction - as can be seen at the bottom. > > I would appreciate any input on this, and if needed I can supply a > test case. > > /Jonas > > > Loop before vectorize pass: > > while.body320: ; preds > %while.body320.preheader, %while.body320 > %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, > %while.body320.preheader ] > %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, > %while.body320.preheader ] > %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, > %while.body320.preheader ] > %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, > %while.body320.preheader ] > %dec = add nsw i32 %len.0288, -1 > %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 > %176 = load i64, i64* %ll.0290, align 8 > %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 > %177 = load i64, i64* %rl.0289, align 8 > %and322 = and i64 %177, %176 > %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 > store i64 %and322, i64* %dl.0291, align 8 > %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 > %178 = load i64, i64* %incdec.ptr, align 8 > %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 > %179 = load i64, i64* %incdec.ptr321, align 8 > %and326 = and i64 %179, %178 > %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 > store i64 %and326, i64* %incdec.ptr323, align 8 > %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 > %180 = load i64, i64* %incdec.ptr324, align 8 > %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 > %181 = load i64, i64* %incdec.ptr325, align 8 > %and330 = and i64 %181, %180 > %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 > store i64 %and330, i64* %incdec.ptr327, align 8 > %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 > %182 = load i64, i64* %incdec.ptr328, align 8 > %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 > %183 = load i64, i64* %incdec.ptr329, align 8 > %and334 = and i64 %183, %182 > %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 > store i64 %and334, i64* %incdec.ptr331, align 8 > %tobool319 = icmp eq i32 %dec, 0 > br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320 > > > Vectorizing: > > LV: Checking a loop in "Perl_do_vop" from do_vop.bc > LV: Loop hints: force=? width=0 unroll=0 > LV: Found a loop: while.body320 > LV: Found an induction variable. > LV: Found an induction variable. > LV: Found an induction variable. > LV: Found an induction variable. > LV: Did not find one integer induction var. > LV: We can vectorize this loop (with a runtime bound check)! > LV: Analyzing interleaved accesses... > LV: Creating an interleave group with: store i64 %and334, i64* > %incdec.ptr331, align 8 > LV: Inserted: store i64 %and330, i64* %incdec.ptr327, align 8 > into the interleave group with store i64 %and334, i64* > %incdec.ptr331, align 8 > LV: Inserted: store i64 %and326, i64* %incdec.ptr323, align 8 > into the interleave group with store i64 %and334, i64* > %incdec.ptr331, align 8 > LV: Inserted: store i64 %and322, i64* %dl.0291, align 8 > into the interleave group with store i64 %and334, i64* > %incdec.ptr331, align 8 > LV: Creating an interleave group with: %183 = load i64, i64* > %incdec.ptr329, align 8 > LV: Inserted: %181 = load i64, i64* %incdec.ptr325, align 8 > into the interleave group with %183 = load i64, i64* > %incdec.ptr329, align 8 > LV: Inserted: %179 = load i64, i64* %incdec.ptr321, align 8 > into the interleave group with %183 = load i64, i64* > %incdec.ptr329, align 8 > LV: Inserted: %177 = load i64, i64* %rl.0289, align 8 > into the interleave group with %183 = load i64, i64* > %incdec.ptr329, align 8 > LV: Creating an interleave group with: %182 = load i64, i64* > %incdec.ptr328, align 8 > LV: Inserted: %180 = load i64, i64* %incdec.ptr324, align 8 > into the interleave group with %182 = load i64, i64* > %incdec.ptr328, align 8 > LV: Inserted: %178 = load i64, i64* %incdec.ptr, align 8 > into the interleave group with %182 = load i64, i64* > %incdec.ptr328, align 8 > LV: Inserted: %176 = load i64, i64* %ll.0290, align 8 > into the interleave group with %182 = load i64, i64* > %incdec.ptr328, align 8 > LV: Found uniform instruction: %tobool319 = icmp eq i32 %dec, 0 > LV: Found uniform instruction: %incdec.ptr324 = getelementptr > inbounds i64, i64* %ll.0290, i64 2 > LV: Found uniform instruction: %incdec.ptr329 = getelementptr > inbounds i64, i64* %rl.0289, i64 3 > LV: Found uniform instruction: %incdec.ptr323 = getelementptr > inbounds i64, i64* %dl.0291, i64 1 > LV: Found uniform instruction: %incdec.ptr328 = getelementptr > inbounds i64, i64* %ll.0290, i64 3 > LV: Found uniform instruction: %incdec.ptr321 = getelementptr > inbounds i64, i64* %rl.0289, i64 1 > LV: Found uniform instruction: %incdec.ptr327 = getelementptr > inbounds i64, i64* %dl.0291, i64 2 > LV: Found uniform instruction: %incdec.ptr325 = getelementptr > inbounds i64, i64* %rl.0289, i64 2 > LV: Found uniform instruction: %incdec.ptr331 = getelementptr > inbounds i64, i64* %dl.0291, i64 3 > LV: Found uniform instruction: %incdec.ptr = getelementptr > inbounds i64, i64* %ll.0290, i64 1 > LV: Found uniform instruction: %dl.0291 = phi i64* [ > %incdec.ptr335, %while.body320 ], [ %73, %while.body320.preheader ] > LV: Found uniform instruction: %incdec.ptr335 = getelementptr > inbounds i64, i64* %dl.0291, i64 4 > LV: Found uniform instruction: %ll.0290 = phi i64* [ > %incdec.ptr332, %while.body320 ], [ %74, %while.body320.preheader ] > LV: Found uniform instruction: %incdec.ptr332 = getelementptr > inbounds i64, i64* %ll.0290, i64 4 > LV: Found uniform instruction: %rl.0289 = phi i64* [ > %incdec.ptr333, %while.body320 ], [ %75, %while.body320.preheader ] > LV: Found uniform instruction: %incdec.ptr333 = getelementptr > inbounds i64, i64* %rl.0289, i64 4 > LV: Found uniform instruction: %len.0288 = phi i32 [ %dec, > %while.body320 ], [ %conv316, %while.body320.preheader ] > LV: Found uniform instruction: %dec = add nsw i32 %len.0288, -1 > LV: Found trip count: 0 > LV: The Smallest and Widest types: 64 / 64 bits. > LV: The Widest register is: 128 bits. > LV: Found an estimated cost of 0 for VF 1 For instruction: > %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 1 For instruction: > %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 1 For instruction: > %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 1 For instruction: > %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, > %while.body320.preheader ] > LV: Found an estimated cost of 1 for VF 1 For instruction: %dec > add nsw i32 %len.0288, -1 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 > LV: Found an estimated cost of 1 for VF 1 For instruction: %176 > load i64, i64* %ll.0290, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 > LV: Found an estimated cost of 1 for VF 1 For instruction: %177 > load i64, i64* %rl.0289, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: > %and322 = and i64 %177, %176 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 > LV: Found an estimated cost of 1 for VF 1 For instruction: store > i64 %and322, i64* %dl.0291, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 > LV: Found an estimated cost of 1 for VF 1 For instruction: %178 > load i64, i64* %incdec.ptr, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 > LV: Found an estimated cost of 1 for VF 1 For instruction: %179 > load i64, i64* %incdec.ptr321, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: > %and326 = and i64 %179, %178 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 > LV: Found an estimated cost of 1 for VF 1 For instruction: store > i64 %and326, i64* %incdec.ptr323, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 > LV: Found an estimated cost of 1 for VF 1 For instruction: %180 > load i64, i64* %incdec.ptr324, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 > LV: Found an estimated cost of 1 for VF 1 For instruction: %181 > load i64, i64* %incdec.ptr325, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: > %and330 = and i64 %181, %180 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 > LV: Found an estimated cost of 1 for VF 1 For instruction: store > i64 %and330, i64* %incdec.ptr327, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 > LV: Found an estimated cost of 1 for VF 1 For instruction: %182 > load i64, i64* %incdec.ptr328, align 8 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 > LV: Found an estimated cost of 1 for VF 1 For instruction: %183 > load i64, i64* %incdec.ptr329, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: > %and334 = and i64 %183, %182 > LV: Found an estimated cost of 0 for VF 1 For instruction: > %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 > LV: Found an estimated cost of 1 for VF 1 For instruction: store > i64 %and334, i64* %incdec.ptr331, align 8 > LV: Found an estimated cost of 1 for VF 1 For instruction: > %tobool319 = icmp eq i32 %dec, 0 > LV: Found an estimated cost of 0 for VF 1 For instruction: br i1 > %tobool319, label %sw.epilog381.loopexit, label %while.body320 > LV: Scalar loop costs: 18. > LV: Found an estimated cost of 0 for VF 2 For instruction: > %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 2 For instruction: > %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 2 For instruction: > %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, > %while.body320.preheader ] > LV: Found an estimated cost of 0 for VF 2 For instruction: > %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, > %while.body320.preheader ] > LV: Found an estimated cost of 1 for VF 2 For instruction: %dec > add nsw i32 %len.0288, -1 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1 > LV: Found an estimated cost of 4 for VF 2 For instruction: %176 > load i64, i64* %ll.0290, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1 > LV: Found an estimated cost of 4 for VF 2 For instruction: %177 > load i64, i64* %rl.0289, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: > %and322 = and i64 %177, %176 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1 > LV: Found an estimated cost of 0 for VF 2 For instruction: store > i64 %and322, i64* %dl.0291, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2 > LV: Found an estimated cost of 0 for VF 2 For instruction: %178 > load i64, i64* %incdec.ptr, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2 > LV: Found an estimated cost of 0 for VF 2 For instruction: %179 > load i64, i64* %incdec.ptr321, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: > %and326 = and i64 %179, %178 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2 > LV: Found an estimated cost of 0 for VF 2 For instruction: store > i64 %and326, i64* %incdec.ptr323, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3 > LV: Found an estimated cost of 0 for VF 2 For instruction: %180 > load i64, i64* %incdec.ptr324, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3 > LV: Found an estimated cost of 0 for VF 2 For instruction: %181 > load i64, i64* %incdec.ptr325, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: > %and330 = and i64 %181, %180 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3 > LV: Found an estimated cost of 0 for VF 2 For instruction: store > i64 %and330, i64* %incdec.ptr327, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4 > LV: Found an estimated cost of 0 for VF 2 For instruction: %182 > load i64, i64* %incdec.ptr328, align 8 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4 > LV: Found an estimated cost of 0 for VF 2 For instruction: %183 > load i64, i64* %incdec.ptr329, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: > %and334 = and i64 %183, %182 > LV: Found an estimated cost of 0 for VF 2 For instruction: > %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4 > LV: Found an estimated cost of 4 for VF 2 For instruction: store > i64 %and334, i64* %incdec.ptr331, align 8 > LV: Found an estimated cost of 1 for VF 2 For instruction: > %tobool319 = icmp eq i32 %dec, 0 > LV: Found an estimated cost of 0 for VF 2 For instruction: br i1 > %tobool319, label %sw.epilog381.loopexit, label %while.body320 > LV: Vector loop of width 2 costs: 9. > LV: Selecting VF: 2. > LV: The target has 32 registers > LV(REG): Calculating max register usage: > LV(REG): At #0 Interval # 0 > LV(REG): At #1 Interval # 1 > LV(REG): At #2 Interval # 2 > LV(REG): At #3 Interval # 3 > LV(REG): At #4 Interval # 4 > LV(REG): At #5 Interval # 4 > LV(REG): At #6 Interval # 5 > LV(REG): At #7 Interval # 6 > LV(REG): At #8 Interval # 7 > LV(REG): At #9 Interval # 8 > LV(REG): At #10 Interval # 7 > LV(REG): At #12 Interval # 7 > LV(REG): At #13 Interval # 8 > LV(REG): At #14 Interval # 8 > LV(REG): At #15 Interval # 9 > LV(REG): At #16 Interval # 9 > LV(REG): At #17 Interval # 8 > LV(REG): At #19 Interval # 7 > LV(REG): At #20 Interval # 8 > LV(REG): At #21 Interval # 8 > LV(REG): At #22 Interval # 9 > LV(REG): At #23 Interval # 9 > LV(REG): At #24 Interval # 8 > LV(REG): At #26 Interval # 7 > LV(REG): At #27 Interval # 7 > LV(REG): At #28 Interval # 7 > LV(REG): At #29 Interval # 7 > LV(REG): At #30 Interval # 7 > LV(REG): At #31 Interval # 6 > LV(REG): At #33 Interval # 5 > LV(REG): VF = 2 > LV(REG): Found max usage: 2 > LV(REG): Found invariant usage: 4 > LV(REG): LoopSize: 35 > LV: Loop cost is 18 > LV: Interleaving to reduce branch cost. > LV: Interleaving is not beneficial. > LV: Found a vectorizable loop (2) in do_vop.bc > LV: Interleaving disabled by the pass manager > LV: Scalarizing: %dec = add nsw i32 %len.0288, -1 > LV: Scalarizing: %incdec.ptr = getelementptr inbounds i64, i64* > %ll.0290, i64 1 > LV: Scalarizing: %incdec.ptr321 = getelementptr inbounds i64, > i64* %rl.0289, i64 1 > LV: Scalarizing: %incdec.ptr323 = getelementptr inbounds i64, > i64* %dl.0291, i64 1 > LV: Scalarizing: %incdec.ptr324 = getelementptr inbounds i64, > i64* %ll.0290, i64 2 > LV: Scalarizing: %incdec.ptr325 = getelementptr inbounds i64, > i64* %rl.0289, i64 2 > LV: Scalarizing: %incdec.ptr327 = getelementptr inbounds i64, > i64* %dl.0291, i64 2 > LV: Scalarizing: %incdec.ptr328 = getelementptr inbounds i64, > i64* %ll.0290, i64 3 > LV: Scalarizing: %incdec.ptr329 = getelementptr inbounds i64, > i64* %rl.0289, i64 3 > LV: Scalarizing: %incdec.ptr331 = getelementptr inbounds i64, > i64* %dl.0291, i64 3 > LV: Scalarizing: %incdec.ptr332 = getelementptr inbounds i64, > i64* %ll.0290, i64 4 > LV: Scalarizing: %incdec.ptr333 = getelementptr inbounds i64, > i64* %rl.0289, i64 4 > LV: Scalarizing: %incdec.ptr335 = getelementptr inbounds i64, > i64* %dl.0291, i64 4 > LV: Scalarizing: %tobool319 = icmp eq i32 %dec, 0 > > vectorized loop (vectorization width: 2, interleaved count: 1) > > Loop after vectorize pass: > > vector.body419: ; preds > %vector.body419, %vector.ph440 > %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442, > %vector.body419 ] > %184 = add i64 %index441, 0 > %185 = shl i64 %184, 2 > %next.gep453 = getelementptr i64, i64* %73, i64 %185 > %186 = add i64 %index441, 0 > %187 = shl i64 %186, 2 > %next.gep454 = getelementptr i64, i64* %74, i64 %187 > %188 = add i64 %index441, 0 > %189 = shl i64 %188, 2 > %next.gep455 = getelementptr i64, i64* %75, i64 %189 > %190 = trunc i64 %index441 to i32 > %offset.idx456 = sub i32 %conv316, %190 > %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32 > %offset.idx456, i32 0 > %broadcast.splat458 = shufflevector <2 x i32> > %broadcast.splatinsert457, <2 x i32> undef, <2 x i32> zeroinitializer > %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32 -1> > %191 = add i32 %offset.idx456, 0 > %192 = add nsw i32 %191, -1 > %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1 > %194 = getelementptr i64, i64* %next.gep454, i32 0 > %195 = bitcast i64* %194 to <8 x i64>* > %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8, > !alias.scope !21 > %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x > i64> undef, <2 x i32> <i32 0, i32 4> > %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x > i64> undef, <2 x i32> <i32 1, i32 5> > %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x > i64> undef, <2 x i32> <i32 2, i32 6> > %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x > i64> undef, <2 x i32> <i32 3, i32 7> > %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1 > %197 = getelementptr i64, i64* %next.gep455, i32 0 > %198 = bitcast i64* %197 to <8 x i64>* > %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8, > !alias.scope !24 > %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x > i64> undef, <2 x i32> <i32 0, i32 4> > %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x > i64> undef, <2 x i32> <i32 1, i32 5> > %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x > i64> undef, <2 x i32> <i32 2, i32 6> > %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x > i64> undef, <2 x i32> <i32 3, i32 7> > %199 = and <2 x i64> %strided.vec466, %strided.vec461 > %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1 > %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2 > %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2 > %203 = and <2 x i64> %strided.vec467, %strided.vec462 > %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2 > %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3 > %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3 > %207 = and <2 x i64> %strided.vec468, %strided.vec463 > %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3 > %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4 > %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4 > %211 = and <2 x i64> %strided.vec469, %strided.vec464 > %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4 > %213 = getelementptr i64, i64* %208, i32 -3 > %214 = bitcast i64* %213 to <8 x i64>* > %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x i32> > <i32 0, i32 1, i32 2, i32 3> > %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x i32> > <i32 0, i32 1, i32 2, i32 3> > %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x i32> > <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7> > %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64> > undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, > i32 7> > store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8, > !alias.scope !26, !noalias !28 > %218 = icmp eq i32 %192, 0 > %index.next442 = add i64 %index441, 2 > %219 = icmp eq i64 %index.next442, %n.vec425 > br i1 %219, label %middle.block420, label %vector.body419, > !llvm.loop !29 > > Loop after instruction combining: > > vector.body419: ; preds > %vector.body419, %vector.body419.preheader > %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2, > %vector.body419.preheader ] > %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond, > %vector.body419.preheader ] > %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57, > %vector.body419.preheader ] > %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [ > %n.vec425, %vector.body419.preheader ] > %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>* > %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>* > %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>* > %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8, > !alias.scope !21 > %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8, > !alias.scope !24 > %179 = and <8 x i64> %wide.vec465, %wide.vec460 > %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x i32> > <i32 0, i32 4> > %181 = and <8 x i64> %wide.vec465, %wide.vec460 > %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x i32> > <i32 1, i32 5> > %183 = and <8 x i64> %wide.vec465, %wide.vec460 > %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x i32> > <i32 2, i32 6> > %185 = and <8 x i64> %wide.vec465, %wide.vec460 > %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x i32> > <i32 3, i32 7> > %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x i32> > <i32 0, i32 1, i32 2, i32 3> > %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x i32> > <i32 0, i32 1, i32 2, i32 3> > %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64> > %188, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, > i32 7> > store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264, > align 8, !alias.scope !26, !noalias !28 > %lsr.iv.next55 = add i64 %lsr.iv54, -2 > %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64 > %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64 > %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64 > %189 = icmp eq i64 %lsr.iv.next55, 0 > br i1 %189, label %middle.block420, label %vector.body419, > !llvm.loop !29 > > Final vectorized loop > > .LBB0_141: # %vector.body419 > # =>This Inner Loop > Header: Depth=1 > vl %v0, 48(%r8) > vl %v1, 48(%r7) > vn %v0, %v1, %v0 > vl %v1, 16(%r8) > vl %v2, 16(%r7) > vn %v1, %v2, %v1 > vmrlg %v2, %v1, %v0 > vmrhg %v0, %v1, %v0 > vmrlg %v1, %v0, %v2 > vst %v1, 48(%r9) > vl %v1, 32(%r8) > vl %v3, 32(%r7) > vn %v1, %v3, %v1 > vl %v3, 0(%r8) > vl %v4, 0(%r7) > vn %v3, %v4, %v3 > vmrlg %v4, %v3, %v1 > vmrhg %v1, %v3, %v1 > vmrlg %v3, %v1, %v4 > vst %v3, 32(%r9) > vmrhg %v0, %v0, %v2 > vst %v0, 16(%r9) > vmrhg %v0, %v1, %v4 > vst %v0, 0(%r9) > la %r9, 64(%r9) > la %r8, 64(%r8) > la %r7, 64(%r7) > aghi %r13, -2 > jne .LBB0_141 > > Final scalar loop : > .LBB0_152: # %while.body320 > # =>This Inner Loop > Header: Depth=1 > lg %r13, 0(%r14) > ng %r13, 0(%r5) > stg %r13, 0(%r4) > lg %r13, 8(%r14) > ng %r13, 8(%r5) > stg %r13, 8(%r4) > lg %r13, 16(%r14) > ng %r13, 16(%r5) > stg %r13, 16(%r4) > lg %r13, 24(%r14) > ng %r13, 24(%r5) > stg %r13, 24(%r4) > la %r4, 32(%r4) > la %r14, 32(%r14) > la %r5, 32(%r5) > brct %r0, .LBB0_152 > j .LBB0_155 > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161007/ddd64a02/attachment.html>