thr3ads.net - llvm dev - [llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Jonas Paulsson via llvm-dev

2016-Oct-06 14:30 UTC

[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

Hi,

I have experimented with enabling the LoopVectorizer for SystemZ. I have 
come across a loop which, when vectorized, seems to have been poorly 
generated. In short, there seems to be a completely unnecessary sequence 
of shufflevector instructions, that doesn't get optimized away anywhere. 
In other words, there is a shuffling so that leads back to the original 
vector:

        [0 1 2 3 4 5 6 7]

  [0 4]   [1 5]   [2 6]   [3 7]

    [0 4 1 5]       [2 6 3 7]

        [0 1 2 3 4 5 6 7]

Is this something the instruction combiner, or perhaps the 
InterleavedAccess pass should handle? Even though I suspect that there 
are currently many target hooks for SystemZ with bad values returned, 
this seems like something that the optimizers should handle regardless. 
The result of this is unnecessary target instruction - as can be seen at 
the bottom.

I would appreciate any input on this, and if needed I can supply a test 
case.

/Jonas


Loop before vectorize pass:

while.body320:                                    ; preds = 
%while.body320.preheader, %while.body320
   %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, 
%while.body320.preheader ]
   %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, 
%while.body320.preheader ]
   %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, 
%while.body320.preheader ]
   %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316, 
%while.body320.preheader ]
   %dec = add nsw i32 %len.0288, -1
   %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
   %176 = load i64, i64* %ll.0290, align 8
   %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
   %177 = load i64, i64* %rl.0289, align 8
   %and322 = and i64 %177, %176
   %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
   store i64 %and322, i64* %dl.0291, align 8
   %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
   %178 = load i64, i64* %incdec.ptr, align 8
   %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
   %179 = load i64, i64* %incdec.ptr321, align 8
   %and326 = and i64 %179, %178
   %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
   store i64 %and326, i64* %incdec.ptr323, align 8
   %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
   %180 = load i64, i64* %incdec.ptr324, align 8
   %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
   %181 = load i64, i64* %incdec.ptr325, align 8
   %and330 = and i64 %181, %180
   %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
   store i64 %and330, i64* %incdec.ptr327, align 8
   %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
   %182 = load i64, i64* %incdec.ptr328, align 8
   %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
   %183 = load i64, i64* %incdec.ptr329, align 8
   %and334 = and i64 %183, %182
   %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
   store i64 %and334, i64* %incdec.ptr331, align 8
   %tobool319 = icmp eq i32 %dec, 0
   br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320


Vectorizing:

LV: Checking a loop in "Perl_do_vop" from do_vop.bc
LV: Loop hints: force=? width=0 unroll=0
LV: Found a loop: while.body320
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found an induction variable.
LV: Found an induction variable.
LV: Did not find one integer induction var.
LV: We can vectorize this loop (with a runtime bound check)!
LV: Analyzing interleaved accesses...
LV: Creating an interleave group with:  store i64 %and334, i64* 
%incdec.ptr331, align 8
LV: Inserted:  store i64 %and330, i64* %incdec.ptr327, align 8
     into the interleave group with  store i64 %and334, i64* 
%incdec.ptr331, align 8
LV: Inserted:  store i64 %and326, i64* %incdec.ptr323, align 8
     into the interleave group with  store i64 %and334, i64* 
%incdec.ptr331, align 8
LV: Inserted:  store i64 %and322, i64* %dl.0291, align 8
     into the interleave group with  store i64 %and334, i64* 
%incdec.ptr331, align 8
LV: Creating an interleave group with:  %183 = load i64, i64* 
%incdec.ptr329, align 8
LV: Inserted:  %181 = load i64, i64* %incdec.ptr325, align 8
     into the interleave group with  %183 = load i64, i64* 
%incdec.ptr329, align 8
LV: Inserted:  %179 = load i64, i64* %incdec.ptr321, align 8
     into the interleave group with  %183 = load i64, i64* 
%incdec.ptr329, align 8
LV: Inserted:  %177 = load i64, i64* %rl.0289, align 8
     into the interleave group with  %183 = load i64, i64* 
%incdec.ptr329, align 8
LV: Creating an interleave group with:  %182 = load i64, i64* 
%incdec.ptr328, align 8
LV: Inserted:  %180 = load i64, i64* %incdec.ptr324, align 8
     into the interleave group with  %182 = load i64, i64* 
%incdec.ptr328, align 8
LV: Inserted:  %178 = load i64, i64* %incdec.ptr, align 8
     into the interleave group with  %182 = load i64, i64* 
%incdec.ptr328, align 8
LV: Inserted:  %176 = load i64, i64* %ll.0290, align 8
     into the interleave group with  %182 = load i64, i64* 
%incdec.ptr328, align 8
LV: Found uniform instruction:   %tobool319 = icmp eq i32 %dec, 0
LV: Found uniform instruction:   %incdec.ptr324 = getelementptr inbounds 
i64, i64* %ll.0290, i64 2
LV: Found uniform instruction:   %incdec.ptr329 = getelementptr inbounds 
i64, i64* %rl.0289, i64 3
LV: Found uniform instruction:   %incdec.ptr323 = getelementptr inbounds 
i64, i64* %dl.0291, i64 1
LV: Found uniform instruction:   %incdec.ptr328 = getelementptr inbounds 
i64, i64* %ll.0290, i64 3
LV: Found uniform instruction:   %incdec.ptr321 = getelementptr inbounds 
i64, i64* %rl.0289, i64 1
LV: Found uniform instruction:   %incdec.ptr327 = getelementptr inbounds 
i64, i64* %dl.0291, i64 2
LV: Found uniform instruction:   %incdec.ptr325 = getelementptr inbounds 
i64, i64* %rl.0289, i64 2
LV: Found uniform instruction:   %incdec.ptr331 = getelementptr inbounds 
i64, i64* %dl.0291, i64 3
LV: Found uniform instruction:   %incdec.ptr = getelementptr inbounds 
i64, i64* %ll.0290, i64 1
LV: Found uniform instruction:   %dl.0291 = phi i64* [ %incdec.ptr335, 
%while.body320 ], [ %73, %while.body320.preheader ]
LV: Found uniform instruction:   %incdec.ptr335 = getelementptr inbounds 
i64, i64* %dl.0291, i64 4
LV: Found uniform instruction:   %ll.0290 = phi i64* [ %incdec.ptr332, 
%while.body320 ], [ %74, %while.body320.preheader ]
LV: Found uniform instruction:   %incdec.ptr332 = getelementptr inbounds 
i64, i64* %ll.0290, i64 4
LV: Found uniform instruction:   %rl.0289 = phi i64* [ %incdec.ptr333, 
%while.body320 ], [ %75, %while.body320.preheader ]
LV: Found uniform instruction:   %incdec.ptr333 = getelementptr inbounds 
i64, i64* %rl.0289, i64 4
LV: Found uniform instruction:   %len.0288 = phi i32 [ %dec, 
%while.body320 ], [ %conv316, %while.body320.preheader ]
LV: Found uniform instruction:   %dec = add nsw i32 %len.0288, -1
LV: Found trip count: 0
LV: The Smallest and Widest types: 64 / 64 bits.
LV: The Widest register is: 128 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:   %dl.0291 = 
phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %ll.0290 = 
phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %rl.0289 = 
phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %len.0288 = 
phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
LV: Found an estimated cost of 1 for VF 1 For instruction:   %dec = add 
nsw i32 %len.0288, -1
LV: Found an estimated cost of 0 for VF 1 For instruction:   %incdec.ptr 
= getelementptr inbounds i64, i64* %ll.0290, i64 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %176 = load 
i64, i64* %ll.0290, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %177 = load 
i64, i64* %rl.0289, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %and322 = 
and i64 %177, %176
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64 
%and322, i64* %dl.0291, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
LV: Found an estimated cost of 1 for VF 1 For instruction:   %178 = load 
i64, i64* %incdec.ptr, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
LV: Found an estimated cost of 1 for VF 1 For instruction:   %179 = load 
i64, i64* %incdec.ptr321, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %and326 = 
and i64 %179, %178
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64 
%and326, i64* %incdec.ptr323, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
LV: Found an estimated cost of 1 for VF 1 For instruction:   %180 = load 
i64, i64* %incdec.ptr324, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
LV: Found an estimated cost of 1 for VF 1 For instruction:   %181 = load 
i64, i64* %incdec.ptr325, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %and330 = 
and i64 %181, %180
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64 
%and330, i64* %incdec.ptr327, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
LV: Found an estimated cost of 1 for VF 1 For instruction:   %182 = load 
i64, i64* %incdec.ptr328, align 8
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
LV: Found an estimated cost of 1 for VF 1 For instruction:   %183 = load 
i64, i64* %incdec.ptr329, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %and334 = 
and i64 %183, %182
LV: Found an estimated cost of 0 for VF 1 For instruction:   
%incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64 
%and334, i64* %incdec.ptr331, align 8
LV: Found an estimated cost of 1 for VF 1 For instruction:   %tobool319 
= icmp eq i32 %dec, 0
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 
%tobool319, label %sw.epilog381.loopexit, label %while.body320
LV: Scalar loop costs: 18.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %dl.0291 = 
phi i64* [ %incdec.ptr335, %while.body320 ], [ %73, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %ll.0290 = 
phi i64* [ %incdec.ptr332, %while.body320 ], [ %74, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %rl.0289 = 
phi i64* [ %incdec.ptr333, %while.body320 ], [ %75, 
%while.body320.preheader ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %len.0288 = 
phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
LV: Found an estimated cost of 1 for VF 2 For instruction:   %dec = add 
nsw i32 %len.0288, -1
LV: Found an estimated cost of 0 for VF 2 For instruction:   %incdec.ptr 
= getelementptr inbounds i64, i64* %ll.0290, i64 1
LV: Found an estimated cost of 4 for VF 2 For instruction:   %176 = load 
i64, i64* %ll.0290, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
LV: Found an estimated cost of 4 for VF 2 For instruction:   %177 = load 
i64, i64* %rl.0289, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %and322 = 
and i64 %177, %176
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64 
%and322, i64* %dl.0291, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
LV: Found an estimated cost of 0 for VF 2 For instruction:   %178 = load 
i64, i64* %incdec.ptr, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
LV: Found an estimated cost of 0 for VF 2 For instruction:   %179 = load 
i64, i64* %incdec.ptr321, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %and326 = 
and i64 %179, %178
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64 
%and326, i64* %incdec.ptr323, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
LV: Found an estimated cost of 0 for VF 2 For instruction:   %180 = load 
i64, i64* %incdec.ptr324, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
LV: Found an estimated cost of 0 for VF 2 For instruction:   %181 = load 
i64, i64* %incdec.ptr325, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %and330 = 
and i64 %181, %180
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64 
%and330, i64* %incdec.ptr327, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
LV: Found an estimated cost of 0 for VF 2 For instruction:   %182 = load 
i64, i64* %incdec.ptr328, align 8
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
LV: Found an estimated cost of 0 for VF 2 For instruction:   %183 = load 
i64, i64* %incdec.ptr329, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %and334 = 
and i64 %183, %182
LV: Found an estimated cost of 0 for VF 2 For instruction:   
%incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
LV: Found an estimated cost of 4 for VF 2 For instruction:   store i64 
%and334, i64* %incdec.ptr331, align 8
LV: Found an estimated cost of 1 for VF 2 For instruction:   %tobool319 
= icmp eq i32 %dec, 0
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 
%tobool319, label %sw.epilog381.loopexit, label %while.body320
LV: Vector loop of width 2 costs: 9.
LV: Selecting VF: 2.
LV: The target has 32 registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #3 Interval # 3
LV(REG): At #4 Interval # 4
LV(REG): At #5 Interval # 4
LV(REG): At #6 Interval # 5
LV(REG): At #7 Interval # 6
LV(REG): At #8 Interval # 7
LV(REG): At #9 Interval # 8
LV(REG): At #10 Interval # 7
LV(REG): At #12 Interval # 7
LV(REG): At #13 Interval # 8
LV(REG): At #14 Interval # 8
LV(REG): At #15 Interval # 9
LV(REG): At #16 Interval # 9
LV(REG): At #17 Interval # 8
LV(REG): At #19 Interval # 7
LV(REG): At #20 Interval # 8
LV(REG): At #21 Interval # 8
LV(REG): At #22 Interval # 9
LV(REG): At #23 Interval # 9
LV(REG): At #24 Interval # 8
LV(REG): At #26 Interval # 7
LV(REG): At #27 Interval # 7
LV(REG): At #28 Interval # 7
LV(REG): At #29 Interval # 7
LV(REG): At #30 Interval # 7
LV(REG): At #31 Interval # 6
LV(REG): At #33 Interval # 5
LV(REG): VF = 2
LV(REG): Found max usage: 2
LV(REG): Found invariant usage: 4
LV(REG): LoopSize: 35
LV: Loop cost is 18
LV: Interleaving to reduce branch cost.
LV: Interleaving is not beneficial.
LV: Found a vectorizable loop (2) in do_vop.bc
LV: Interleaving disabled by the pass manager
LV: Scalarizing:  %dec = add nsw i32 %len.0288, -1
LV: Scalarizing:  %incdec.ptr = getelementptr inbounds i64, i64* 
%ll.0290, i64 1
LV: Scalarizing:  %incdec.ptr321 = getelementptr inbounds i64, i64* 
%rl.0289, i64 1
LV: Scalarizing:  %incdec.ptr323 = getelementptr inbounds i64, i64* 
%dl.0291, i64 1
LV: Scalarizing:  %incdec.ptr324 = getelementptr inbounds i64, i64* 
%ll.0290, i64 2
LV: Scalarizing:  %incdec.ptr325 = getelementptr inbounds i64, i64* 
%rl.0289, i64 2
LV: Scalarizing:  %incdec.ptr327 = getelementptr inbounds i64, i64* 
%dl.0291, i64 2
LV: Scalarizing:  %incdec.ptr328 = getelementptr inbounds i64, i64* 
%ll.0290, i64 3
LV: Scalarizing:  %incdec.ptr329 = getelementptr inbounds i64, i64* 
%rl.0289, i64 3
LV: Scalarizing:  %incdec.ptr331 = getelementptr inbounds i64, i64* 
%dl.0291, i64 3
LV: Scalarizing:  %incdec.ptr332 = getelementptr inbounds i64, i64* 
%ll.0290, i64 4
LV: Scalarizing:  %incdec.ptr333 = getelementptr inbounds i64, i64* 
%rl.0289, i64 4
LV: Scalarizing:  %incdec.ptr335 = getelementptr inbounds i64, i64* 
%dl.0291, i64 4
LV: Scalarizing:  %tobool319 = icmp eq i32 %dec, 0

vectorized loop (vectorization width: 2, interleaved count: 1)

Loop after vectorize pass:

vector.body419:                                   ; preds = 
%vector.body419, %vector.ph440
   %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442, 
%vector.body419 ]
   %184 = add i64 %index441, 0
   %185 = shl i64 %184, 2
   %next.gep453 = getelementptr i64, i64* %73, i64 %185
   %186 = add i64 %index441, 0
   %187 = shl i64 %186, 2
   %next.gep454 = getelementptr i64, i64* %74, i64 %187
   %188 = add i64 %index441, 0
   %189 = shl i64 %188, 2
   %next.gep455 = getelementptr i64, i64* %75, i64 %189
   %190 = trunc i64 %index441 to i32
   %offset.idx456 = sub i32 %conv316, %190
   %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32 
%offset.idx456, i32 0
   %broadcast.splat458 = shufflevector <2 x i32> 
%broadcast.splatinsert457, <2 x i32> undef, <2 x i32>
zeroinitializer
   %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32
-1>
   %191 = add i32 %offset.idx456, 0
   %192 = add nsw i32 %191, -1
   %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1
   %194 = getelementptr i64, i64* %next.gep454, i32 0
   %195 = bitcast i64* %194 to <8 x i64>*
   %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8,
!alias.scope !21
   %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x i64>
undef, <2 x i32> <i32 0, i32 4>
   %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x i64>
undef, <2 x i32> <i32 1, i32 5>
   %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x i64>
undef, <2 x i32> <i32 2, i32 6>
   %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x i64>
undef, <2 x i32> <i32 3, i32 7>
   %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1
   %197 = getelementptr i64, i64* %next.gep455, i32 0
   %198 = bitcast i64* %197 to <8 x i64>*
   %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8,
!alias.scope !24
   %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x i64>
undef, <2 x i32> <i32 0, i32 4>
   %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x i64>
undef, <2 x i32> <i32 1, i32 5>
   %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x i64>
undef, <2 x i32> <i32 2, i32 6>
   %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x i64>
undef, <2 x i32> <i32 3, i32 7>
   %199 = and <2 x i64> %strided.vec466, %strided.vec461
   %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1
   %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2
   %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2
   %203 = and <2 x i64> %strided.vec467, %strided.vec462
   %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2
   %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3
   %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3
   %207 = and <2 x i64> %strided.vec468, %strided.vec463
   %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3
   %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4
   %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4
   %211 = and <2 x i64> %strided.vec469, %strided.vec464
   %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4
   %213 = getelementptr i64, i64* %208, i32 -3
   %214 = bitcast i64* %213 to <8 x i64>*
   %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x
i32> <i32
0, i32 1, i32 2, i32 3>
   %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x
i32> <i32
0, i32 1, i32 2, i32 3>
   %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x
i32> <i32
0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
   %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64>
undef,
<8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
   store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align 8, 
!alias.scope !26, !noalias !28
   %218 = icmp eq i32 %192, 0
   %index.next442 = add i64 %index441, 2
   %219 = icmp eq i64 %index.next442, %n.vec425
   br i1 %219, label %middle.block420, label %vector.body419, !llvm.loop !29

Loop after instruction combining:

vector.body419:                                   ; preds = 
%vector.body419, %vector.body419.preheader
   %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2, 
%vector.body419.preheader ]
   %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond, 
%vector.body419.preheader ]
   %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57, 
%vector.body419.preheader ]
   %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [ %n.vec425, 
%vector.body419.preheader ]
   %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>*
   %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>*
   %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>*
   %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align 8, 
!alias.scope !21
   %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align 8, 
!alias.scope !24
   %179 = and <8 x i64> %wide.vec465, %wide.vec460
   %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x
i32> <i32
0, i32 4>
   %181 = and <8 x i64> %wide.vec465, %wide.vec460
   %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x
i32> <i32
1, i32 5>
   %183 = and <8 x i64> %wide.vec465, %wide.vec460
   %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x
i32> <i32
2, i32 6>
   %185 = and <8 x i64> %wide.vec465, %wide.vec460
   %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x
i32> <i32
3, i32 7>
   %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x
i32> <i32
0, i32 1, i32 2, i32 3>
   %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x
i32> <i32
0, i32 1, i32 2, i32 3>
   %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64>
%188,
<8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
   store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264,
align 8,
!alias.scope !26, !noalias !28
   %lsr.iv.next55 = add i64 %lsr.iv54, -2
   %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64
   %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64
   %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64
   %189 = icmp eq i64 %lsr.iv.next55, 0
   br i1 %189, label %middle.block420, label %vector.body419, !llvm.loop !29

Final vectorized loop

.LBB0_141:                              # %vector.body419
                                         # =>This Inner Loop Header: Depth=1
         vl      %v0, 48(%r8)
         vl      %v1, 48(%r7)
         vn      %v0, %v1, %v0
         vl      %v1, 16(%r8)
         vl      %v2, 16(%r7)
         vn      %v1, %v2, %v1
         vmrlg   %v2, %v1, %v0
         vmrhg   %v0, %v1, %v0
         vmrlg   %v1, %v0, %v2
         vst     %v1, 48(%r9)
         vl      %v1, 32(%r8)
         vl      %v3, 32(%r7)
         vn      %v1, %v3, %v1
         vl      %v3, 0(%r8)
         vl      %v4, 0(%r7)
         vn      %v3, %v4, %v3
         vmrlg   %v4, %v3, %v1
         vmrhg   %v1, %v3, %v1
         vmrlg   %v3, %v1, %v4
         vst     %v3, 32(%r9)
         vmrhg   %v0, %v0, %v2
         vst     %v0, 16(%r9)
         vmrhg   %v0, %v1, %v4
         vst     %v0, 0(%r9)
         la      %r9, 64(%r9)
         la      %r8, 64(%r8)
         la      %r7, 64(%r7)
         aghi    %r13, -2
         jne     .LBB0_141

Final scalar loop :
.LBB0_152:                              # %while.body320
                                         # =>This Inner Loop Header: Depth=1
         lg      %r13, 0(%r14)
         ng      %r13, 0(%r5)
         stg     %r13, 0(%r4)
         lg      %r13, 8(%r14)
         ng      %r13, 8(%r5)
         stg     %r13, 8(%r4)
         lg      %r13, 16(%r14)
         ng      %r13, 16(%r5)
         stg     %r13, 16(%r4)
         lg      %r13, 24(%r14)
         ng      %r13, 24(%r5)
         stg     %r13, 24(%r4)
         la      %r4, 32(%r4)
         la      %r14, 32(%r14)
         la      %r5, 32(%r5)
         brct    %r0, .LBB0_152
         j       .LBB0_155

Matthew Simpson via llvm-dev

2016-Oct-06 18:40 UTC

head link

[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

Hi Jonas,

It does look like we should be able to simplify this. Would you mind filing
a bug? Looking at the code after InstCombine, the vector adds are trivially
redundant (I think EarlyCSE should already be able to remove them). I think
we could then teach InstructionSimplify to simplify the remaining shuffles
similar to the way it already handles extracts.

-- Matt

On Thu, Oct 6, 2016 at 10:30 AM, Jonas Paulsson via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> Hi,
>
> I have experimented with enabling the LoopVectorizer for SystemZ. I have
> come across a loop which, when vectorized, seems to have been poorly
> generated. In short, there seems to be a completely unnecessary sequence of
> shufflevector instructions, that doesn't get optimized away anywhere.
In
> other words, there is a shuffling so that leads back to the original
vector:
>
>        [0 1 2 3 4 5 6 7]
>
>  [0 4]   [1 5]   [2 6]   [3 7]
>
>    [0 4 1 5]       [2 6 3 7]
>
>        [0 1 2 3 4 5 6 7]
>
> Is this something the instruction combiner, or perhaps the
> InterleavedAccess pass should handle? Even though I suspect that there are
> currently many target hooks for SystemZ with bad values returned, this
> seems like something that the optimizers should handle regardless. The
> result of this is unnecessary target instruction - as can be seen at the
> bottom.
>
> I would appreciate any input on this, and if needed I can supply a test
> case.
>
> /Jonas
>
>
> Loop before vectorize pass:
>
> while.body320:                                    ; preds >
%while.body320.preheader, %while.body320
>   %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
>   %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
>   %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
>   %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
> %while.body320.preheader ]
>   %dec = add nsw i32 %len.0288, -1
>   %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
>   %176 = load i64, i64* %ll.0290, align 8
>   %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
>   %177 = load i64, i64* %rl.0289, align 8
>   %and322 = and i64 %177, %176
>   %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
>   store i64 %and322, i64* %dl.0291, align 8
>   %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
>   %178 = load i64, i64* %incdec.ptr, align 8
>   %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
>   %179 = load i64, i64* %incdec.ptr321, align 8
>   %and326 = and i64 %179, %178
>   %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
>   store i64 %and326, i64* %incdec.ptr323, align 8
>   %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
>   %180 = load i64, i64* %incdec.ptr324, align 8
>   %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
>   %181 = load i64, i64* %incdec.ptr325, align 8
>   %and330 = and i64 %181, %180
>   %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
>   store i64 %and330, i64* %incdec.ptr327, align 8
>   %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
>   %182 = load i64, i64* %incdec.ptr328, align 8
>   %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
>   %183 = load i64, i64* %incdec.ptr329, align 8
>   %and334 = and i64 %183, %182
>   %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
>   store i64 %and334, i64* %incdec.ptr331, align 8
>   %tobool319 = icmp eq i32 %dec, 0
>   br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320
>
>
> Vectorizing:
>
> LV: Checking a loop in "Perl_do_vop" from do_vop.bc
> LV: Loop hints: force=? width=0 unroll=0
> LV: Found a loop: while.body320
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Found an induction variable.
> LV: Did not find one integer induction var.
> LV: We can vectorize this loop (with a runtime bound check)!
> LV: Analyzing interleaved accesses...
> LV: Creating an interleave group with:  store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted:  store i64 %and330, i64* %incdec.ptr327, align 8
>     into the interleave group with  store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted:  store i64 %and326, i64* %incdec.ptr323, align 8
>     into the interleave group with  store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Inserted:  store i64 %and322, i64* %dl.0291, align 8
>     into the interleave group with  store i64 %and334, i64*
> %incdec.ptr331, align 8
> LV: Creating an interleave group with:  %183 = load i64, i64*
> %incdec.ptr329, align 8
> LV: Inserted:  %181 = load i64, i64* %incdec.ptr325, align 8
>     into the interleave group with  %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Inserted:  %179 = load i64, i64* %incdec.ptr321, align 8
>     into the interleave group with  %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Inserted:  %177 = load i64, i64* %rl.0289, align 8
>     into the interleave group with  %183 = load i64, i64* %incdec.ptr329,
> align 8
> LV: Creating an interleave group with:  %182 = load i64, i64*
> %incdec.ptr328, align 8
> LV: Inserted:  %180 = load i64, i64* %incdec.ptr324, align 8
>     into the interleave group with  %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Inserted:  %178 = load i64, i64* %incdec.ptr, align 8
>     into the interleave group with  %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Inserted:  %176 = load i64, i64* %ll.0290, align 8
>     into the interleave group with  %182 = load i64, i64* %incdec.ptr328,
> align 8
> LV: Found uniform instruction:   %tobool319 = icmp eq i32 %dec, 0
> LV: Found uniform instruction:   %incdec.ptr324 = getelementptr inbounds
> i64, i64* %ll.0290, i64 2
> LV: Found uniform instruction:   %incdec.ptr329 = getelementptr inbounds
> i64, i64* %rl.0289, i64 3
> LV: Found uniform instruction:   %incdec.ptr323 = getelementptr inbounds
> i64, i64* %dl.0291, i64 1
> LV: Found uniform instruction:   %incdec.ptr328 = getelementptr inbounds
> i64, i64* %ll.0290, i64 3
> LV: Found uniform instruction:   %incdec.ptr321 = getelementptr inbounds
> i64, i64* %rl.0289, i64 1
> LV: Found uniform instruction:   %incdec.ptr327 = getelementptr inbounds
> i64, i64* %dl.0291, i64 2
> LV: Found uniform instruction:   %incdec.ptr325 = getelementptr inbounds
> i64, i64* %rl.0289, i64 2
> LV: Found uniform instruction:   %incdec.ptr331 = getelementptr inbounds
> i64, i64* %dl.0291, i64 3
> LV: Found uniform instruction:   %incdec.ptr = getelementptr inbounds i64,
> i64* %ll.0290, i64 1
> LV: Found uniform instruction:   %dl.0291 = phi i64* [ %incdec.ptr335,
> %while.body320 ], [ %73, %while.body320.preheader ]
> LV: Found uniform instruction:   %incdec.ptr335 = getelementptr inbounds
> i64, i64* %dl.0291, i64 4
> LV: Found uniform instruction:   %ll.0290 = phi i64* [ %incdec.ptr332,
> %while.body320 ], [ %74, %while.body320.preheader ]
> LV: Found uniform instruction:   %incdec.ptr332 = getelementptr inbounds
> i64, i64* %ll.0290, i64 4
> LV: Found uniform instruction:   %rl.0289 = phi i64* [ %incdec.ptr333,
> %while.body320 ], [ %75, %while.body320.preheader ]
> LV: Found uniform instruction:   %incdec.ptr333 = getelementptr inbounds
> i64, i64* %rl.0289, i64 4
> LV: Found uniform instruction:   %len.0288 = phi i32 [ %dec,
> %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found uniform instruction:   %dec = add nsw i32 %len.0288, -1
> LV: Found trip count: 0
> LV: The Smallest and Widest types: 64 / 64 bits.
> LV: The Widest register is: 128 bits.
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %dl.0291 >
phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %ll.0290 >
phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %rl.0289 >
phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %len.0288 >
phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %dec = add
> nsw i32 %len.0288, -1
> LV: Found an estimated cost of 0 for VF 1 For instruction:   %incdec.ptr
> getelementptr inbounds i64, i64* %ll.0290, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %176 = load
> i64, i64* %ll.0290, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %177 = load
> i64, i64* %rl.0289, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %and322 = and
> i64 %177, %176
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %and322, i64* %dl.0291, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %178 = load
> i64, i64* %incdec.ptr, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %179 = load
> i64, i64* %incdec.ptr321, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %and326 = and
> i64 %179, %178
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %and326, i64* %incdec.ptr323, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %180 = load
> i64, i64* %incdec.ptr324, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %181 = load
> i64, i64* %incdec.ptr325, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %and330 = and
> i64 %181, %180
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %and330, i64* %incdec.ptr327, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %182 = load
> i64, i64* %incdec.ptr328, align 8
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %183 = load
> i64, i64* %incdec.ptr329, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %and334 = and
> i64 %183, %182
> LV: Found an estimated cost of 0 for VF 1 For instruction:
>  %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> LV: Found an estimated cost of 1 for VF 1 For instruction:   store i64
> %and334, i64* %incdec.ptr331, align 8
> LV: Found an estimated cost of 1 for VF 1 For instruction:   %tobool319
> icmp eq i32 %dec, 0
> LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1
> %tobool319, label %sw.epilog381.loopexit, label %while.body320
> LV: Scalar loop costs: 18.
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %dl.0291 >
phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %ll.0290 >
phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %rl.0289 >
phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
> %while.body320.preheader ]
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %len.0288 >
phi i32 [ %dec, %while.body320 ], [ %conv316, %while.body320.preheader ]
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %dec = add
> nsw i32 %len.0288, -1
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %incdec.ptr
> getelementptr inbounds i64, i64* %ll.0290, i64 1
> LV: Found an estimated cost of 4 for VF 2 For instruction:   %176 = load
> i64, i64* %ll.0290, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
> LV: Found an estimated cost of 4 for VF 2 For instruction:   %177 = load
> i64, i64* %rl.0289, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %and322 = and
> i64 %177, %176
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
> LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64
> %and322, i64* %dl.0291, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %178 = load
> i64, i64* %incdec.ptr, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %179 = load
> i64, i64* %incdec.ptr321, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %and326 = and
> i64 %179, %178
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
> LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64
> %and326, i64* %incdec.ptr323, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %180 = load
> i64, i64* %incdec.ptr324, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %181 = load
> i64, i64* %incdec.ptr325, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %and330 = and
> i64 %181, %180
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
> LV: Found an estimated cost of 0 for VF 2 For instruction:   store i64
> %and330, i64* %incdec.ptr327, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %182 = load
> i64, i64* %incdec.ptr328, align 8
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
> LV: Found an estimated cost of 0 for VF 2 For instruction:   %183 = load
> i64, i64* %incdec.ptr329, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %and334 = and
> i64 %183, %182
> LV: Found an estimated cost of 0 for VF 2 For instruction:
>  %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
> LV: Found an estimated cost of 4 for VF 2 For instruction:   store i64
> %and334, i64* %incdec.ptr331, align 8
> LV: Found an estimated cost of 1 for VF 2 For instruction:   %tobool319
> icmp eq i32 %dec, 0
> LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1
> %tobool319, label %sw.epilog381.loopexit, label %while.body320
> LV: Vector loop of width 2 costs: 9.
> LV: Selecting VF: 2.
> LV: The target has 32 registers
> LV(REG): Calculating max register usage:
> LV(REG): At #0 Interval # 0
> LV(REG): At #1 Interval # 1
> LV(REG): At #2 Interval # 2
> LV(REG): At #3 Interval # 3
> LV(REG): At #4 Interval # 4
> LV(REG): At #5 Interval # 4
> LV(REG): At #6 Interval # 5
> LV(REG): At #7 Interval # 6
> LV(REG): At #8 Interval # 7
> LV(REG): At #9 Interval # 8
> LV(REG): At #10 Interval # 7
> LV(REG): At #12 Interval # 7
> LV(REG): At #13 Interval # 8
> LV(REG): At #14 Interval # 8
> LV(REG): At #15 Interval # 9
> LV(REG): At #16 Interval # 9
> LV(REG): At #17 Interval # 8
> LV(REG): At #19 Interval # 7
> LV(REG): At #20 Interval # 8
> LV(REG): At #21 Interval # 8
> LV(REG): At #22 Interval # 9
> LV(REG): At #23 Interval # 9
> LV(REG): At #24 Interval # 8
> LV(REG): At #26 Interval # 7
> LV(REG): At #27 Interval # 7
> LV(REG): At #28 Interval # 7
> LV(REG): At #29 Interval # 7
> LV(REG): At #30 Interval # 7
> LV(REG): At #31 Interval # 6
> LV(REG): At #33 Interval # 5
> LV(REG): VF = 2
> LV(REG): Found max usage: 2
> LV(REG): Found invariant usage: 4
> LV(REG): LoopSize: 35
> LV: Loop cost is 18
> LV: Interleaving to reduce branch cost.
> LV: Interleaving is not beneficial.
> LV: Found a vectorizable loop (2) in do_vop.bc
> LV: Interleaving disabled by the pass manager
> LV: Scalarizing:  %dec = add nsw i32 %len.0288, -1
> LV: Scalarizing:  %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290,
> i64 1
> LV: Scalarizing:  %incdec.ptr321 = getelementptr inbounds i64, i64*
> %rl.0289, i64 1
> LV: Scalarizing:  %incdec.ptr323 = getelementptr inbounds i64, i64*
> %dl.0291, i64 1
> LV: Scalarizing:  %incdec.ptr324 = getelementptr inbounds i64, i64*
> %ll.0290, i64 2
> LV: Scalarizing:  %incdec.ptr325 = getelementptr inbounds i64, i64*
> %rl.0289, i64 2
> LV: Scalarizing:  %incdec.ptr327 = getelementptr inbounds i64, i64*
> %dl.0291, i64 2
> LV: Scalarizing:  %incdec.ptr328 = getelementptr inbounds i64, i64*
> %ll.0290, i64 3
> LV: Scalarizing:  %incdec.ptr329 = getelementptr inbounds i64, i64*
> %rl.0289, i64 3
> LV: Scalarizing:  %incdec.ptr331 = getelementptr inbounds i64, i64*
> %dl.0291, i64 3
> LV: Scalarizing:  %incdec.ptr332 = getelementptr inbounds i64, i64*
> %ll.0290, i64 4
> LV: Scalarizing:  %incdec.ptr333 = getelementptr inbounds i64, i64*
> %rl.0289, i64 4
> LV: Scalarizing:  %incdec.ptr335 = getelementptr inbounds i64, i64*
> %dl.0291, i64 4
> LV: Scalarizing:  %tobool319 = icmp eq i32 %dec, 0
>
> vectorized loop (vectorization width: 2, interleaved count: 1)
>
> Loop after vectorize pass:
>
> vector.body419:                                   ; preds >
%vector.body419, %vector.ph440
>   %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442,
> %vector.body419 ]
>   %184 = add i64 %index441, 0
>   %185 = shl i64 %184, 2
>   %next.gep453 = getelementptr i64, i64* %73, i64 %185
>   %186 = add i64 %index441, 0
>   %187 = shl i64 %186, 2
>   %next.gep454 = getelementptr i64, i64* %74, i64 %187
>   %188 = add i64 %index441, 0
>   %189 = shl i64 %188, 2
>   %next.gep455 = getelementptr i64, i64* %75, i64 %189
>   %190 = trunc i64 %index441 to i32
>   %offset.idx456 = sub i32 %conv316, %190
>   %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32
> %offset.idx456, i32 0
>   %broadcast.splat458 = shufflevector <2 x i32>
%broadcast.splatinsert457,
> <2 x i32> undef, <2 x i32> zeroinitializer
>   %induction459 = add <2 x i32> %broadcast.splat458, <i32 0, i32
-1>
>   %191 = add i32 %offset.idx456, 0
>   %192 = add nsw i32 %191, -1
>   %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1
>   %194 = getelementptr i64, i64* %next.gep454, i32 0
>   %195 = bitcast i64* %194 to <8 x i64>*
>   %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8,
!alias.scope !21
>   %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x
i64> undef,
> <2 x i32> <i32 0, i32 4>
>   %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x
i64> undef,
> <2 x i32> <i32 1, i32 5>
>   %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x
i64> undef,
> <2 x i32> <i32 2, i32 6>
>   %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x
i64> undef,
> <2 x i32> <i32 3, i32 7>
>   %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1
>   %197 = getelementptr i64, i64* %next.gep455, i32 0
>   %198 = bitcast i64* %197 to <8 x i64>*
>   %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8,
!alias.scope !24
>   %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x
i64> undef,
> <2 x i32> <i32 0, i32 4>
>   %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x
i64> undef,
> <2 x i32> <i32 1, i32 5>
>   %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x
i64> undef,
> <2 x i32> <i32 2, i32 6>
>   %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x
i64> undef,
> <2 x i32> <i32 3, i32 7>
>   %199 = and <2 x i64> %strided.vec466, %strided.vec461
>   %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1
>   %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2
>   %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2
>   %203 = and <2 x i64> %strided.vec467, %strided.vec462
>   %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2
>   %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3
>   %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3
>   %207 = and <2 x i64> %strided.vec468, %strided.vec463
>   %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3
>   %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4
>   %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4
>   %211 = and <2 x i64> %strided.vec469, %strided.vec464
>   %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4
>   %213 = getelementptr i64, i64* %208, i32 -3
>   %214 = bitcast i64* %213 to <8 x i64>*
>   %215 = shufflevector <2 x i64> %199, <2 x i64> %203, <4 x
i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %216 = shufflevector <2 x i64> %207, <2 x i64> %211, <4 x
i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %217 = shufflevector <4 x i64> %215, <4 x i64> %216, <8 x
i32> <i32 0,
> i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
>   %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x i64>
undef, <8
> x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
>   store <8 x i64> %interleaved.vec470, <8 x i64>* %214, align
8,
> !alias.scope !26, !noalias !28
>   %218 = icmp eq i32 %192, 0
>   %index.next442 = add i64 %index441, 2
>   %219 = icmp eq i64 %index.next442, %n.vec425
>   br i1 %219, label %middle.block420, label %vector.body419, !llvm.loop !29
>
> Loop after instruction combining:
>
> vector.body419:                                   ; preds >
%vector.body419, %vector.body419.preheader
>   %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2,
> %vector.body419.preheader ]
>   %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond,
> %vector.body419.preheader ]
>   %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57,
> %vector.body419.preheader ]
>   %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [ %n.vec425,
> %vector.body419.preheader ]
>   %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>*
>   %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>*
>   %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>*
>   %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961, align
8,
> !alias.scope !21
>   %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658, align
8,
> !alias.scope !24
>   %179 = and <8 x i64> %wide.vec465, %wide.vec460
>   %180 = shufflevector <8 x i64> %179, <8 x i64> undef, <2 x
i32> <i32 0,
> i32 4>
>   %181 = and <8 x i64> %wide.vec465, %wide.vec460
>   %182 = shufflevector <8 x i64> %181, <8 x i64> undef, <2 x
i32> <i32 1,
> i32 5>
>   %183 = and <8 x i64> %wide.vec465, %wide.vec460
>   %184 = shufflevector <8 x i64> %183, <8 x i64> undef, <2 x
i32> <i32 2,
> i32 6>
>   %185 = and <8 x i64> %wide.vec465, %wide.vec460
>   %186 = shufflevector <8 x i64> %185, <8 x i64> undef, <2 x
i32> <i32 3,
> i32 7>
>   %187 = shufflevector <2 x i64> %180, <2 x i64> %182, <4 x
i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %188 = shufflevector <2 x i64> %184, <2 x i64> %186, <4 x
i32> <i32 0,
> i32 1, i32 2, i32 3>
>   %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x i64>
%188, <8 x
> i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
>   store <8 x i64> %interleaved.vec470, <8 x i64>* %lsr.iv6264,
align 8,
> !alias.scope !26, !noalias !28
>   %lsr.iv.next55 = add i64 %lsr.iv54, -2
>   %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64
>   %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64
>   %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64
>   %189 = icmp eq i64 %lsr.iv.next55, 0
>   br i1 %189, label %middle.block420, label %vector.body419, !llvm.loop !29
>
> Final vectorized loop
>
> .LBB0_141:                              # %vector.body419
>                                         # =>This Inner Loop Header:
Depth=1
>         vl      %v0, 48(%r8)
>         vl      %v1, 48(%r7)
>         vn      %v0, %v1, %v0
>         vl      %v1, 16(%r8)
>         vl      %v2, 16(%r7)
>         vn      %v1, %v2, %v1
>         vmrlg   %v2, %v1, %v0
>         vmrhg   %v0, %v1, %v0
>         vmrlg   %v1, %v0, %v2
>         vst     %v1, 48(%r9)
>         vl      %v1, 32(%r8)
>         vl      %v3, 32(%r7)
>         vn      %v1, %v3, %v1
>         vl      %v3, 0(%r8)
>         vl      %v4, 0(%r7)
>         vn      %v3, %v4, %v3
>         vmrlg   %v4, %v3, %v1
>         vmrhg   %v1, %v3, %v1
>         vmrlg   %v3, %v1, %v4
>         vst     %v3, 32(%r9)
>         vmrhg   %v0, %v0, %v2
>         vst     %v0, 16(%r9)
>         vmrhg   %v0, %v1, %v4
>         vst     %v0, 0(%r9)
>         la      %r9, 64(%r9)
>         la      %r8, 64(%r8)
>         la      %r7, 64(%r7)
>         aghi    %r13, -2
>         jne     .LBB0_141
>
> Final scalar loop :
> .LBB0_152:                              # %while.body320
>                                         # =>This Inner Loop Header:
Depth=1
>         lg      %r13, 0(%r14)
>         ng      %r13, 0(%r5)
>         stg     %r13, 0(%r4)
>         lg      %r13, 8(%r14)
>         ng      %r13, 8(%r5)
>         stg     %r13, 8(%r4)
>         lg      %r13, 16(%r14)
>         ng      %r13, 16(%r5)
>         stg     %r13, 16(%r4)
>         lg      %r13, 24(%r14)
>         ng      %r13, 24(%r5)
>         stg     %r13, 24(%r4)
>         la      %r4, 32(%r4)
>         la      %r14, 32(%r14)
>         la      %r5, 32(%r5)
>         brct    %r0, .LBB0_152
>         j       .LBB0_155
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161006/f79b0e32/attachment.html>

Jonas Paulsson via llvm-dev

2016-Oct-07 09:07 UTC

head link

[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

Hi Matt,

ok - see https://llvm.org/bugs/show_bug.cgi?id=30630.

/Jonas

On 2016-10-06 20:40, Matthew Simpson wrote:> Hi Jonas,
>
> It does look like we should be able to simplify this. Would you mind 
> filing a bug? Looking at the code after InstCombine, the vector adds 
> are trivially redundant (I think EarlyCSE should already be able to 
> remove them). I think we could then teach InstructionSimplify to 
> simplify the remaining shuffles similar to the way it already handles 
> extracts.
>
> -- Matt
>
> On Thu, Oct 6, 2016 at 10:30 AM, Jonas Paulsson via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>
>     Hi,
>
>     I have experimented with enabling the LoopVectorizer for SystemZ.
>     I have come across a loop which, when vectorized, seems to have
>     been poorly generated. In short, there seems to be a completely
>     unnecessary sequence of shufflevector instructions, that doesn't
>     get optimized away anywhere. In other words, there is a shuffling
>     so that leads back to the original vector:
>
>            [0 1 2 3 4 5 6 7]
>
>      [0 4]   [1 5]   [2 6]   [3 7]
>
>        [0 4 1 5]       [2 6 3 7]
>
>            [0 1 2 3 4 5 6 7]
>
>     Is this something the instruction combiner, or perhaps the
>     InterleavedAccess pass should handle? Even though I suspect that
>     there are currently many target hooks for SystemZ with bad values
>     returned, this seems like something that the optimizers should
>     handle regardless. The result of this is unnecessary target
>     instruction - as can be seen at the bottom.
>
>     I would appreciate any input on this, and if needed I can supply a
>     test case.
>
>     /Jonas
>
>
>     Loop before vectorize pass:
>
>     while.body320:                                    ; preds >    
%while.body320.preheader, %while.body320
>       %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
>     %while.body320.preheader ]
>       %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
>     %while.body320.preheader ]
>       %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
>     %while.body320.preheader ]
>       %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
>     %while.body320.preheader ]
>       %dec = add nsw i32 %len.0288, -1
>       %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
>       %176 = load i64, i64* %ll.0290, align 8
>       %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
>       %177 = load i64, i64* %rl.0289, align 8
>       %and322 = and i64 %177, %176
>       %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
>       store i64 %and322, i64* %dl.0291, align 8
>       %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
>       %178 = load i64, i64* %incdec.ptr, align 8
>       %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
>       %179 = load i64, i64* %incdec.ptr321, align 8
>       %and326 = and i64 %179, %178
>       %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
>       store i64 %and326, i64* %incdec.ptr323, align 8
>       %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
>       %180 = load i64, i64* %incdec.ptr324, align 8
>       %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
>       %181 = load i64, i64* %incdec.ptr325, align 8
>       %and330 = and i64 %181, %180
>       %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
>       store i64 %and330, i64* %incdec.ptr327, align 8
>       %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
>       %182 = load i64, i64* %incdec.ptr328, align 8
>       %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
>       %183 = load i64, i64* %incdec.ptr329, align 8
>       %and334 = and i64 %183, %182
>       %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
>       store i64 %and334, i64* %incdec.ptr331, align 8
>       %tobool319 = icmp eq i32 %dec, 0
>       br i1 %tobool319, label %sw.epilog381.loopexit, label %while.body320
>
>
>     Vectorizing:
>
>     LV: Checking a loop in "Perl_do_vop" from do_vop.bc
>     LV: Loop hints: force=? width=0 unroll=0
>     LV: Found a loop: while.body320
>     LV: Found an induction variable.
>     LV: Found an induction variable.
>     LV: Found an induction variable.
>     LV: Found an induction variable.
>     LV: Did not find one integer induction var.
>     LV: We can vectorize this loop (with a runtime bound check)!
>     LV: Analyzing interleaved accesses...
>     LV: Creating an interleave group with:  store i64 %and334, i64*
>     %incdec.ptr331, align 8
>     LV: Inserted:  store i64 %and330, i64* %incdec.ptr327, align 8
>         into the interleave group with  store i64 %and334, i64*
>     %incdec.ptr331, align 8
>     LV: Inserted:  store i64 %and326, i64* %incdec.ptr323, align 8
>         into the interleave group with  store i64 %and334, i64*
>     %incdec.ptr331, align 8
>     LV: Inserted:  store i64 %and322, i64* %dl.0291, align 8
>         into the interleave group with  store i64 %and334, i64*
>     %incdec.ptr331, align 8
>     LV: Creating an interleave group with:  %183 = load i64, i64*
>     %incdec.ptr329, align 8
>     LV: Inserted:  %181 = load i64, i64* %incdec.ptr325, align 8
>         into the interleave group with  %183 = load i64, i64*
>     %incdec.ptr329, align 8
>     LV: Inserted:  %179 = load i64, i64* %incdec.ptr321, align 8
>         into the interleave group with  %183 = load i64, i64*
>     %incdec.ptr329, align 8
>     LV: Inserted:  %177 = load i64, i64* %rl.0289, align 8
>         into the interleave group with  %183 = load i64, i64*
>     %incdec.ptr329, align 8
>     LV: Creating an interleave group with:  %182 = load i64, i64*
>     %incdec.ptr328, align 8
>     LV: Inserted:  %180 = load i64, i64* %incdec.ptr324, align 8
>         into the interleave group with  %182 = load i64, i64*
>     %incdec.ptr328, align 8
>     LV: Inserted:  %178 = load i64, i64* %incdec.ptr, align 8
>         into the interleave group with  %182 = load i64, i64*
>     %incdec.ptr328, align 8
>     LV: Inserted:  %176 = load i64, i64* %ll.0290, align 8
>         into the interleave group with  %182 = load i64, i64*
>     %incdec.ptr328, align 8
>     LV: Found uniform instruction:   %tobool319 = icmp eq i32 %dec, 0
>     LV: Found uniform instruction:   %incdec.ptr324 = getelementptr
>     inbounds i64, i64* %ll.0290, i64 2
>     LV: Found uniform instruction:   %incdec.ptr329 = getelementptr
>     inbounds i64, i64* %rl.0289, i64 3
>     LV: Found uniform instruction:   %incdec.ptr323 = getelementptr
>     inbounds i64, i64* %dl.0291, i64 1
>     LV: Found uniform instruction:   %incdec.ptr328 = getelementptr
>     inbounds i64, i64* %ll.0290, i64 3
>     LV: Found uniform instruction:   %incdec.ptr321 = getelementptr
>     inbounds i64, i64* %rl.0289, i64 1
>     LV: Found uniform instruction:   %incdec.ptr327 = getelementptr
>     inbounds i64, i64* %dl.0291, i64 2
>     LV: Found uniform instruction:   %incdec.ptr325 = getelementptr
>     inbounds i64, i64* %rl.0289, i64 2
>     LV: Found uniform instruction:   %incdec.ptr331 = getelementptr
>     inbounds i64, i64* %dl.0291, i64 3
>     LV: Found uniform instruction:   %incdec.ptr = getelementptr
>     inbounds i64, i64* %ll.0290, i64 1
>     LV: Found uniform instruction:   %dl.0291 = phi i64* [
>     %incdec.ptr335, %while.body320 ], [ %73, %while.body320.preheader ]
>     LV: Found uniform instruction:   %incdec.ptr335 = getelementptr
>     inbounds i64, i64* %dl.0291, i64 4
>     LV: Found uniform instruction:   %ll.0290 = phi i64* [
>     %incdec.ptr332, %while.body320 ], [ %74, %while.body320.preheader ]
>     LV: Found uniform instruction:   %incdec.ptr332 = getelementptr
>     inbounds i64, i64* %ll.0290, i64 4
>     LV: Found uniform instruction:   %rl.0289 = phi i64* [
>     %incdec.ptr333, %while.body320 ], [ %75, %while.body320.preheader ]
>     LV: Found uniform instruction:   %incdec.ptr333 = getelementptr
>     inbounds i64, i64* %rl.0289, i64 4
>     LV: Found uniform instruction:   %len.0288 = phi i32 [ %dec,
>     %while.body320 ], [ %conv316, %while.body320.preheader ]
>     LV: Found uniform instruction:   %dec = add nsw i32 %len.0288, -1
>     LV: Found trip count: 0
>     LV: The Smallest and Widest types: 64 / 64 bits.
>     LV: The Widest register is: 128 bits.
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %dec >  
add nsw i32 %len.0288, -1
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %176 >  
load i64, i64* %ll.0290, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %177 >  
load i64, i64* %rl.0289, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %and322 = and i64 %177, %176
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  store
>     i64 %and322, i64* %dl.0291, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %178 >  
load i64, i64* %incdec.ptr, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %179 >  
load i64, i64* %incdec.ptr321, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %and326 = and i64 %179, %178
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  store
>     i64 %and326, i64* %incdec.ptr323, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %180 >  
load i64, i64* %incdec.ptr324, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %181 >  
load i64, i64* %incdec.ptr325, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %and330 = and i64 %181, %180
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  store
>     i64 %and330, i64* %incdec.ptr327, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %182 >  
load i64, i64* %incdec.ptr328, align 8
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  %183 >  
load i64, i64* %incdec.ptr329, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %and334 = and i64 %183, %182
>     LV: Found an estimated cost of 0 for VF 1 For instruction:
>      %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
>     LV: Found an estimated cost of 1 for VF 1 For instruction:  store
>     i64 %and334, i64* %incdec.ptr331, align 8
>     LV: Found an estimated cost of 1 for VF 1 For instruction:
>      %tobool319 = icmp eq i32 %dec, 0
>     LV: Found an estimated cost of 0 for VF 1 For instruction:  br i1
>     %tobool319, label %sw.epilog381.loopexit, label %while.body320
>     LV: Scalar loop costs: 18.
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %dl.0291 = phi i64* [ %incdec.ptr335, %while.body320 ], [ %73,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %ll.0290 = phi i64* [ %incdec.ptr332, %while.body320 ], [ %74,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %rl.0289 = phi i64* [ %incdec.ptr333, %while.body320 ], [ %75,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %len.0288 = phi i32 [ %dec, %while.body320 ], [ %conv316,
>     %while.body320.preheader ]
>     LV: Found an estimated cost of 1 for VF 2 For instruction:  %dec >  
add nsw i32 %len.0288, -1
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr = getelementptr inbounds i64, i64* %ll.0290, i64 1
>     LV: Found an estimated cost of 4 for VF 2 For instruction:  %176 >  
load i64, i64* %ll.0290, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr321 = getelementptr inbounds i64, i64* %rl.0289, i64 1
>     LV: Found an estimated cost of 4 for VF 2 For instruction:  %177 >  
load i64, i64* %rl.0289, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %and322 = and i64 %177, %176
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr323 = getelementptr inbounds i64, i64* %dl.0291, i64 1
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  store
>     i64 %and322, i64* %dl.0291, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr324 = getelementptr inbounds i64, i64* %ll.0290, i64 2
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %178 >  
load i64, i64* %incdec.ptr, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr325 = getelementptr inbounds i64, i64* %rl.0289, i64 2
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %179 >  
load i64, i64* %incdec.ptr321, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %and326 = and i64 %179, %178
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr327 = getelementptr inbounds i64, i64* %dl.0291, i64 2
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  store
>     i64 %and326, i64* %incdec.ptr323, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr328 = getelementptr inbounds i64, i64* %ll.0290, i64 3
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %180 >  
load i64, i64* %incdec.ptr324, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr329 = getelementptr inbounds i64, i64* %rl.0289, i64 3
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %181 >  
load i64, i64* %incdec.ptr325, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %and330 = and i64 %181, %180
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr331 = getelementptr inbounds i64, i64* %dl.0291, i64 3
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  store
>     i64 %and330, i64* %incdec.ptr327, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr332 = getelementptr inbounds i64, i64* %ll.0290, i64 4
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %182 >  
load i64, i64* %incdec.ptr328, align 8
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr333 = getelementptr inbounds i64, i64* %rl.0289, i64 4
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  %183 >  
load i64, i64* %incdec.ptr329, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %and334 = and i64 %183, %182
>     LV: Found an estimated cost of 0 for VF 2 For instruction:
>      %incdec.ptr335 = getelementptr inbounds i64, i64* %dl.0291, i64 4
>     LV: Found an estimated cost of 4 for VF 2 For instruction:  store
>     i64 %and334, i64* %incdec.ptr331, align 8
>     LV: Found an estimated cost of 1 for VF 2 For instruction:
>      %tobool319 = icmp eq i32 %dec, 0
>     LV: Found an estimated cost of 0 for VF 2 For instruction:  br i1
>     %tobool319, label %sw.epilog381.loopexit, label %while.body320
>     LV: Vector loop of width 2 costs: 9.
>     LV: Selecting VF: 2.
>     LV: The target has 32 registers
>     LV(REG): Calculating max register usage:
>     LV(REG): At #0 Interval # 0
>     LV(REG): At #1 Interval # 1
>     LV(REG): At #2 Interval # 2
>     LV(REG): At #3 Interval # 3
>     LV(REG): At #4 Interval # 4
>     LV(REG): At #5 Interval # 4
>     LV(REG): At #6 Interval # 5
>     LV(REG): At #7 Interval # 6
>     LV(REG): At #8 Interval # 7
>     LV(REG): At #9 Interval # 8
>     LV(REG): At #10 Interval # 7
>     LV(REG): At #12 Interval # 7
>     LV(REG): At #13 Interval # 8
>     LV(REG): At #14 Interval # 8
>     LV(REG): At #15 Interval # 9
>     LV(REG): At #16 Interval # 9
>     LV(REG): At #17 Interval # 8
>     LV(REG): At #19 Interval # 7
>     LV(REG): At #20 Interval # 8
>     LV(REG): At #21 Interval # 8
>     LV(REG): At #22 Interval # 9
>     LV(REG): At #23 Interval # 9
>     LV(REG): At #24 Interval # 8
>     LV(REG): At #26 Interval # 7
>     LV(REG): At #27 Interval # 7
>     LV(REG): At #28 Interval # 7
>     LV(REG): At #29 Interval # 7
>     LV(REG): At #30 Interval # 7
>     LV(REG): At #31 Interval # 6
>     LV(REG): At #33 Interval # 5
>     LV(REG): VF = 2
>     LV(REG): Found max usage: 2
>     LV(REG): Found invariant usage: 4
>     LV(REG): LoopSize: 35
>     LV: Loop cost is 18
>     LV: Interleaving to reduce branch cost.
>     LV: Interleaving is not beneficial.
>     LV: Found a vectorizable loop (2) in do_vop.bc
>     LV: Interleaving disabled by the pass manager
>     LV: Scalarizing:  %dec = add nsw i32 %len.0288, -1
>     LV: Scalarizing:  %incdec.ptr = getelementptr inbounds i64, i64*
>     %ll.0290, i64 1
>     LV: Scalarizing:  %incdec.ptr321 = getelementptr inbounds i64,
>     i64* %rl.0289, i64 1
>     LV: Scalarizing:  %incdec.ptr323 = getelementptr inbounds i64,
>     i64* %dl.0291, i64 1
>     LV: Scalarizing:  %incdec.ptr324 = getelementptr inbounds i64,
>     i64* %ll.0290, i64 2
>     LV: Scalarizing:  %incdec.ptr325 = getelementptr inbounds i64,
>     i64* %rl.0289, i64 2
>     LV: Scalarizing:  %incdec.ptr327 = getelementptr inbounds i64,
>     i64* %dl.0291, i64 2
>     LV: Scalarizing:  %incdec.ptr328 = getelementptr inbounds i64,
>     i64* %ll.0290, i64 3
>     LV: Scalarizing:  %incdec.ptr329 = getelementptr inbounds i64,
>     i64* %rl.0289, i64 3
>     LV: Scalarizing:  %incdec.ptr331 = getelementptr inbounds i64,
>     i64* %dl.0291, i64 3
>     LV: Scalarizing:  %incdec.ptr332 = getelementptr inbounds i64,
>     i64* %ll.0290, i64 4
>     LV: Scalarizing:  %incdec.ptr333 = getelementptr inbounds i64,
>     i64* %rl.0289, i64 4
>     LV: Scalarizing:  %incdec.ptr335 = getelementptr inbounds i64,
>     i64* %dl.0291, i64 4
>     LV: Scalarizing:  %tobool319 = icmp eq i32 %dec, 0
>
>     vectorized loop (vectorization width: 2, interleaved count: 1)
>
>     Loop after vectorize pass:
>
>     vector.body419:                                   ; preds >    
%vector.body419, %vector.ph440
>       %index441 = phi i64 [ 0, %vector.ph440 ], [ %index.next442,
>     %vector.body419 ]
>       %184 = add i64 %index441, 0
>       %185 = shl i64 %184, 2
>       %next.gep453 = getelementptr i64, i64* %73, i64 %185
>       %186 = add i64 %index441, 0
>       %187 = shl i64 %186, 2
>       %next.gep454 = getelementptr i64, i64* %74, i64 %187
>       %188 = add i64 %index441, 0
>       %189 = shl i64 %188, 2
>       %next.gep455 = getelementptr i64, i64* %75, i64 %189
>       %190 = trunc i64 %index441 to i32
>       %offset.idx456 = sub i32 %conv316, %190
>       %broadcast.splatinsert457 = insertelement <2 x i32> undef, i32
>     %offset.idx456, i32 0
>       %broadcast.splat458 = shufflevector <2 x i32>
>     %broadcast.splatinsert457, <2 x i32> undef, <2 x i32>
zeroinitializer
>       %induction459 = add <2 x i32> %broadcast.splat458, <i32 0,
i32 -1>
>       %191 = add i32 %offset.idx456, 0
>       %192 = add nsw i32 %191, -1
>       %193 = getelementptr inbounds i64, i64* %next.gep454, i64 1
>       %194 = getelementptr i64, i64* %next.gep454, i32 0
>       %195 = bitcast i64* %194 to <8 x i64>*
>       %wide.vec460 = load <8 x i64>, <8 x i64>* %195, align 8,
>     !alias.scope !21
>       %strided.vec461 = shufflevector <8 x i64> %wide.vec460, <8 x
>     i64> undef, <2 x i32> <i32 0, i32 4>
>       %strided.vec462 = shufflevector <8 x i64> %wide.vec460, <8 x
>     i64> undef, <2 x i32> <i32 1, i32 5>
>       %strided.vec463 = shufflevector <8 x i64> %wide.vec460, <8 x
>     i64> undef, <2 x i32> <i32 2, i32 6>
>       %strided.vec464 = shufflevector <8 x i64> %wide.vec460, <8 x
>     i64> undef, <2 x i32> <i32 3, i32 7>
>       %196 = getelementptr inbounds i64, i64* %next.gep455, i64 1
>       %197 = getelementptr i64, i64* %next.gep455, i32 0
>       %198 = bitcast i64* %197 to <8 x i64>*
>       %wide.vec465 = load <8 x i64>, <8 x i64>* %198, align 8,
>     !alias.scope !24
>       %strided.vec466 = shufflevector <8 x i64> %wide.vec465, <8 x
>     i64> undef, <2 x i32> <i32 0, i32 4>
>       %strided.vec467 = shufflevector <8 x i64> %wide.vec465, <8 x
>     i64> undef, <2 x i32> <i32 1, i32 5>
>       %strided.vec468 = shufflevector <8 x i64> %wide.vec465, <8 x
>     i64> undef, <2 x i32> <i32 2, i32 6>
>       %strided.vec469 = shufflevector <8 x i64> %wide.vec465, <8 x
>     i64> undef, <2 x i32> <i32 3, i32 7>
>       %199 = and <2 x i64> %strided.vec466, %strided.vec461
>       %200 = getelementptr inbounds i64, i64* %next.gep453, i64 1
>       %201 = getelementptr inbounds i64, i64* %next.gep454, i64 2
>       %202 = getelementptr inbounds i64, i64* %next.gep455, i64 2
>       %203 = and <2 x i64> %strided.vec467, %strided.vec462
>       %204 = getelementptr inbounds i64, i64* %next.gep453, i64 2
>       %205 = getelementptr inbounds i64, i64* %next.gep454, i64 3
>       %206 = getelementptr inbounds i64, i64* %next.gep455, i64 3
>       %207 = and <2 x i64> %strided.vec468, %strided.vec463
>       %208 = getelementptr inbounds i64, i64* %next.gep453, i64 3
>       %209 = getelementptr inbounds i64, i64* %next.gep454, i64 4
>       %210 = getelementptr inbounds i64, i64* %next.gep455, i64 4
>       %211 = and <2 x i64> %strided.vec469, %strided.vec464
>       %212 = getelementptr inbounds i64, i64* %next.gep453, i64 4
>       %213 = getelementptr i64, i64* %208, i32 -3
>       %214 = bitcast i64* %213 to <8 x i64>*
>       %215 = shufflevector <2 x i64> %199, <2 x i64> %203,
<4 x i32>
>     <i32 0, i32 1, i32 2, i32 3>
>       %216 = shufflevector <2 x i64> %207, <2 x i64> %211,
<4 x i32>
>     <i32 0, i32 1, i32 2, i32 3>
>       %217 = shufflevector <4 x i64> %215, <4 x i64> %216,
<8 x i32>
>     <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
>       %interleaved.vec470 = shufflevector <8 x i64> %217, <8 x
i64>
>     undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3,
i32 5,
>     i32 7>
>       store <8 x i64> %interleaved.vec470, <8 x i64>* %214,
align 8,
>     !alias.scope !26, !noalias !28
>       %218 = icmp eq i32 %192, 0
>       %index.next442 = add i64 %index441, 2
>       %219 = icmp eq i64 %index.next442, %n.vec425
>       br i1 %219, label %middle.block420, label %vector.body419,
>     !llvm.loop !29
>
>     Loop after instruction combining:
>
>     vector.body419:                                   ; preds >    
%vector.body419, %vector.body419.preheader
>       %lsr.iv62 = phi i8* [ %scevgep63, %vector.body419 ], [ %dc.2,
>     %vector.body419.preheader ]
>       %lsr.iv59 = phi i8* [ %scevgep60, %vector.body419 ], [ %cond,
>     %vector.body419.preheader ]
>       %lsr.iv56 = phi i8* [ %scevgep57, %vector.body419 ], [ %cond57,
>     %vector.body419.preheader ]
>       %lsr.iv54 = phi i64 [ %lsr.iv.next55, %vector.body419 ], [
>     %n.vec425, %vector.body419.preheader ]
>       %lsr.iv6264 = bitcast i8* %lsr.iv62 to <8 x i64>*
>       %lsr.iv5961 = bitcast i8* %lsr.iv59 to <8 x i64>*
>       %lsr.iv5658 = bitcast i8* %lsr.iv56 to <8 x i64>*
>       %wide.vec460 = load <8 x i64>, <8 x i64>* %lsr.iv5961,
align 8,
>     !alias.scope !21
>       %wide.vec465 = load <8 x i64>, <8 x i64>* %lsr.iv5658,
align 8,
>     !alias.scope !24
>       %179 = and <8 x i64> %wide.vec465, %wide.vec460
>       %180 = shufflevector <8 x i64> %179, <8 x i64> undef,
<2 x i32>
>     <i32 0, i32 4>
>       %181 = and <8 x i64> %wide.vec465, %wide.vec460
>       %182 = shufflevector <8 x i64> %181, <8 x i64> undef,
<2 x i32>
>     <i32 1, i32 5>
>       %183 = and <8 x i64> %wide.vec465, %wide.vec460
>       %184 = shufflevector <8 x i64> %183, <8 x i64> undef,
<2 x i32>
>     <i32 2, i32 6>
>       %185 = and <8 x i64> %wide.vec465, %wide.vec460
>       %186 = shufflevector <8 x i64> %185, <8 x i64> undef,
<2 x i32>
>     <i32 3, i32 7>
>       %187 = shufflevector <2 x i64> %180, <2 x i64> %182,
<4 x i32>
>     <i32 0, i32 1, i32 2, i32 3>
>       %188 = shufflevector <2 x i64> %184, <2 x i64> %186,
<4 x i32>
>     <i32 0, i32 1, i32 2, i32 3>
>       %interleaved.vec470 = shufflevector <4 x i64> %187, <4 x
i64>
>     %188, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32
5,
>     i32 7>
>       store <8 x i64> %interleaved.vec470, <8 x i64>*
%lsr.iv6264,
>     align 8, !alias.scope !26, !noalias !28
>       %lsr.iv.next55 = add i64 %lsr.iv54, -2
>       %scevgep57 = getelementptr i8, i8* %lsr.iv56, i64 64
>       %scevgep60 = getelementptr i8, i8* %lsr.iv59, i64 64
>       %scevgep63 = getelementptr i8, i8* %lsr.iv62, i64 64
>       %189 = icmp eq i64 %lsr.iv.next55, 0
>       br i1 %189, label %middle.block420, label %vector.body419,
>     !llvm.loop !29
>
>     Final vectorized loop
>
>     .LBB0_141:                              # %vector.body419
>                                             # =>This Inner Loop
>     Header: Depth=1
>             vl      %v0, 48(%r8)
>             vl      %v1, 48(%r7)
>             vn      %v0, %v1, %v0
>             vl      %v1, 16(%r8)
>             vl      %v2, 16(%r7)
>             vn      %v1, %v2, %v1
>             vmrlg   %v2, %v1, %v0
>             vmrhg   %v0, %v1, %v0
>             vmrlg   %v1, %v0, %v2
>             vst     %v1, 48(%r9)
>             vl      %v1, 32(%r8)
>             vl      %v3, 32(%r7)
>             vn      %v1, %v3, %v1
>             vl      %v3, 0(%r8)
>             vl      %v4, 0(%r7)
>             vn      %v3, %v4, %v3
>             vmrlg   %v4, %v3, %v1
>             vmrhg   %v1, %v3, %v1
>             vmrlg   %v3, %v1, %v4
>             vst     %v3, 32(%r9)
>             vmrhg   %v0, %v0, %v2
>             vst     %v0, 16(%r9)
>             vmrhg   %v0, %v1, %v4
>             vst     %v0, 0(%r9)
>             la      %r9, 64(%r9)
>             la      %r8, 64(%r8)
>             la      %r7, 64(%r7)
>             aghi    %r13, -2
>             jne     .LBB0_141
>
>     Final scalar loop :
>     .LBB0_152:                              # %while.body320
>                                             # =>This Inner Loop
>     Header: Depth=1
>             lg      %r13, 0(%r14)
>             ng      %r13, 0(%r5)
>             stg     %r13, 0(%r4)
>             lg      %r13, 8(%r14)
>             ng      %r13, 8(%r5)
>             stg     %r13, 8(%r4)
>             lg      %r13, 16(%r14)
>             ng      %r13, 16(%r5)
>             stg     %r13, 16(%r4)
>             lg      %r13, 24(%r14)
>             ng      %r13, 24(%r5)
>             stg     %r13, 24(%r4)
>             la      %r4, 32(%r4)
>             la      %r14, 32(%r14)
>             la      %r5, 32(%r5)
>             brct    %r0, .LBB0_152
>             j       .LBB0_155
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161007/ddd64a02/attachment.html>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Oct 2016 - LoopVectorizer -- generating bad and unhandled shufflevector sequence

[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

[llvm-dev] LoopVectorizer -- generating bad and unhandled shufflevector sequence

Apparently Analagous Threads