thr3ads.net - llvm dev - [LLVMdev] Vectorizer using Instruction, not opcodes [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2013-Feb-04 19:38 UTC

[LLVMdev] Vectorizer using Instruction, not opcodes

On 4 February 2013 18:25, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
> For cases where this approach breaks really badly we could consider adding
> a specialized api or parameters (like the type of a user/use). But we
> should do so only as a last resort and backed by actual code that would
> benefit from doing so.
>
Very sensible, more or less what I had in mind. I think we could go one
step further and just get some high-level decisions like: is this cast
associated with an (any) arithmetic operation? It won't be perfect, but it
adds a bit more information at very reduced price. Though, this would
require us to pass the Instruction, not the Opcode.


Do you have an example where this happening?


The example below is not stopping the vectorizer, but it does add a lot of
costs where there are none...

** C code:

 int direct (int k) {
  int i;
  int a[256], b[256], c[256];

  for (i=0; i<256; i++){
    a[i] = b[i] * c[i];
  }
  return a[k];
}

** ASM vectorized result:

        adr     r5, .LCPI0_0
        vdup.32 q9, r1
        vld1.64 {d16, d17}, [r5, :128]
        add     r1, r1, #4
        vadd.i32        q8, q9, q8
        cmp     r3, r1
        vmov.32 r5, d16[0]
        add     r6, lr, r5, lsl #2
        add     r7, r2, r5, lsl #2
        vld1.32 {d16, d17}, [r6]
        add     r5, r4, r5, lsl #2
        vld1.32 {d18, d19}, [r7]
        vmul.i32        q8, q9, q8
        vst1.32 {d16, d17}, [r5]
        bne     .LBB0_2

** Vectorized IR (just the loop):

vector.body:                                      ; preds = %vector.body, %
vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32
0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4
x
i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2,
i32 3>
  %0 = extractelement <4 x i32> %induction, i32 0
  %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
  %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
  %3 = extractelement <4 x i32> %induction, i32 1
  %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
  %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
  %6 = extractelement <4 x i32> %induction, i32 2
  %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
  %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
  %9 = extractelement <4 x i32> %induction, i32 3
  %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
  %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
  %12 = extractelement <4 x i32> %induction, i32 0
  %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
  %14 = getelementptr i32* %13, i32 0
  %15 = bitcast i32* %14 to <4 x i32>*
  %wide.load = load <4 x i32>* %15, align 4
  %16 = extractelement <4 x i32> %induction, i32 0
  %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16
  %18 = insertelement <4 x i32*> undef, i32* %17, i32 0
  %19 = extractelement <4 x i32> %induction, i32 1
  %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19
  %21 = insertelement <4 x i32*> %18, i32* %20, i32 1
  %22 = extractelement <4 x i32> %induction, i32 2
  %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22
  %24 = insertelement <4 x i32*> %21, i32* %23, i32 2
  %25 = extractelement <4 x i32> %induction, i32 3
  %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25
  %27 = insertelement <4 x i32*> %24, i32* %26, i32 3
  %28 = extractelement <4 x i32> %induction, i32 0
  %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28
  %30 = getelementptr i32* %29, i32 0
  %31 = bitcast i32* %30 to <4 x i32>*
  %wide.load3 = load <4 x i32>* %31, align 4
  %32 = mul nsw <4 x i32> %wide.load3, %wide.load
  %33 = extractelement <4 x i32> %induction, i32 0
  %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33
  %35 = insertelement <4 x i32*> undef, i32* %34, i32 0
  %36 = extractelement <4 x i32> %induction, i32 1
  %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36
  %38 = insertelement <4 x i32*> %35, i32* %37, i32 1
  %39 = extractelement <4 x i32> %induction, i32 2
  %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39
  %41 = insertelement <4 x i32*> %38, i32* %40, i32 2
  %42 = extractelement <4 x i32> %induction, i32 3
  %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42
  %44 = insertelement <4 x i32*> %41, i32* %43, i32 3
  %45 = extractelement <4 x i32> %induction, i32 0
  %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45
  %47 = getelementptr i32* %46, i32 0
  %48 = bitcast i32* %47 to <4 x i32>*
  store <4 x i32> %32, <4 x i32>* %48, align 4
  %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
  %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256>
  %index.next = add i32 %index, 4
  %51 = icmp eq i32 %index.next, %end.idx.rnd.down
  br i1 %51, label %middle.block, label %vector.body

** Cost analysis:

Cost Model: Found an estimated cost of 1 for instruction:   %induction add <4
x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
Cost Model: Found an estimated cost of 1 for instruction:   %0 extractelement
<4 x i32> %induction, i32 0
Cost Model: Unknown cost for instruction:   %1 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %0
Cost Model: Found an estimated cost of 1 for instruction:   %2 insertelement
<4 x i32*> undef, i32* %1, i32 0
Cost Model: Found an estimated cost of 1 for instruction:   %3 extractelement
<4 x i32> %induction, i32 1
Cost Model: Unknown cost for instruction:   %4 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %3
Cost Model: Found an estimated cost of 1 for instruction:   %5 insertelement
<4 x i32*> %2, i32* %4, i32 1
Cost Model: Found an estimated cost of 1 for instruction:   %6 extractelement
<4 x i32> %induction, i32 2
Cost Model: Unknown cost for instruction:   %7 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %6
Cost Model: Found an estimated cost of 1 for instruction:   %8 insertelement
<4 x i32*> %5, i32* %7, i32 2
Cost Model: Found an estimated cost of 1 for instruction:   %9 extractelement
<4 x i32> %induction, i32 3
Cost Model: Unknown cost for instruction:   %10 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %9
Cost Model: Found an estimated cost of 1 for instruction:   %11 insertelement
<4 x i32*> %8, i32* %10, i32 3
Cost Model: Found an estimated cost of 1 for instruction:   %12 extractelement
<4 x i32> %induction, i32 0
Cost Model: Unknown cost for instruction:   %13 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %12
Cost Model: Unknown cost for instruction:   %14 = getelementptr i32* %13,
i32 0

and so on...
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/ca2977c3/attachment.html>

Hal Finkel

2013-Feb-04 20:09 UTC

head link

[LLVMdev] Vectorizer using Instruction, not opcodes

----- Original Message -----> From: "Renato Golin" <renato.golin at linaro.org>
> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>
> Cc: "LLVM Dev" <llvmdev at cs.uiuc.edu>, "Nadav
Rotem" <nrotem at apple.com>, "Hal Finkel" <hfinkel at
anl.gov>
> Sent: Monday, February 4, 2013 1:38:03 PM
> Subject: Re: Vectorizer using Instruction, not opcodes
> 
> 
> On 4 February 2013 18:25, Arnold Schwaighofer <
> aschwaighofer at apple.com > wrote:
> 
> 
> 
> 
> For cases where this approach breaks really badly we could consider
> adding a specialized api or parameters (like the type of a
> user/use). But we should do so only as a last resort and backed by
> actual code that would benefit from doing so.
> 
> 
> 
> Very sensible, more or less what I had in mind. I think we could go
> one step further and just get some high-level decisions like: is
> this cast associated with an (any) arithmetic operation? It won't be
> perfect, but it adds a bit more information at very reduced price.
> Though, this would require us to pass the Instruction, not the
> Opcode.
> 
> 
We probably don't want to end up in a situation where we're
speculatively creating instructions just to test their cost using the cost
model. For simple things, passing an existing instruction and its widening
factor will work, but I worry that won't really be sufficiently general.
Would a tree of <opcode, return-type, input-type>s work? How deep would it
need to be? We also need to decide on some self-consistent way of dealing with
cost folding. What I mean is that if an instruction can be folded with its use,
do we register that cost savings for the use or the operand? We'd also need
to pass hasOneUse() information.

 -Hal
> 
> 
> 
> Do you have an example where this happening?
> 
> 
> The example below is not stopping the vectorizer, but it does add a
> lot of costs where there are none...
> 
> 
> ** C code:
> 
> 
> int direct (int k) {
> int i;
> int a[256], b[256], c[256];
> 
> 
> for (i=0; i<256; i++){
> a[i] = b[i] * c[i];
> }
> return a[k];
> }
> 
> 
> ** ASM vectorized result:
> 
> 
> 
> adr r5, .LCPI0_0
> vdup.32 q9, r1
> vld1.64 {d16, d17}, [r5, :128]
> add r1, r1, #4
> vadd.i32 q8, q9, q8
> cmp r3, r1
> vmov.32 r5, d16[0]
> add r6, lr, r5, lsl #2
> add r7, r2, r5, lsl #2
> vld1.32 {d16, d17}, [r6]
> add r5, r4, r5, lsl #2
> vld1.32 {d18, d19}, [r7]
> vmul.i32 q8, q9, q8
> vst1.32 {d16, d17}, [r5]
> bne .LBB0_2
> 
> 
> ** Vectorized IR (just the loop):
> 
> 
> 
> 
> vector.body: ; preds = %vector.body, % vector.ph
> %index = phi i32 [ 0, % vector.ph ], [ %index.next, %vector.body ]
> %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index,
> i32 0
> %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert,
<4
> x i32> undef, <4 x i32> zeroinitializer
> %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2,
> i32 3>
> %0 = extractelement <4 x i32> %induction, i32 0
> %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
> %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
> %3 = extractelement <4 x i32> %induction, i32 1
> %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
> %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
> %6 = extractelement <4 x i32> %induction, i32 2
> %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
> %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
> %9 = extractelement <4 x i32> %induction, i32 3
> %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
> %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
> %12 = extractelement <4 x i32> %induction, i32 0
> %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
> %14 = getelementptr i32* %13, i32 0
> %15 = bitcast i32* %14 to <4 x i32>*
> %wide.load = load <4 x i32>* %15, align 4
> 
> %16 = extractelement <4 x i32> %induction, i32 0
> %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16
> %18 = insertelement <4 x i32*> undef, i32* %17, i32 0
> %19 = extractelement <4 x i32> %induction, i32 1
> %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19
> %21 = insertelement <4 x i32*> %18, i32* %20, i32 1
> %22 = extractelement <4 x i32> %induction, i32 2
> %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22
> %24 = insertelement <4 x i32*> %21, i32* %23, i32 2
> %25 = extractelement <4 x i32> %induction, i32 3
> %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25
> %27 = insertelement <4 x i32*> %24, i32* %26, i32 3
> %28 = extractelement <4 x i32> %induction, i32 0
> %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28
> %30 = getelementptr i32* %29, i32 0
> %31 = bitcast i32* %30 to <4 x i32>*
> %wide.load3 = load <4 x i32>* %31, align 4
> %32 = mul nsw <4 x i32> %wide.load3, %wide.load
> %33 = extractelement <4 x i32> %induction, i32 0
> %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33
> %35 = insertelement <4 x i32*> undef, i32* %34, i32 0
> %36 = extractelement <4 x i32> %induction, i32 1
> %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36
> %38 = insertelement <4 x i32*> %35, i32* %37, i32 1
> %39 = extractelement <4 x i32> %induction, i32 2
> %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39
> %41 = insertelement <4 x i32*> %38, i32* %40, i32 2
> %42 = extractelement <4 x i32> %induction, i32 3
> %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42
> %44 = insertelement <4 x i32*> %41, i32* %43, i32 3
> %45 = extractelement <4 x i32> %induction, i32 0
> %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45
> %47 = getelementptr i32* %46, i32 0
> %48 = bitcast i32* %47 to <4 x i32>*
> store <4 x i32> %32, <4 x i32>* %48, align 4
> %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32
1>
> %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32
256>
> %index.next = add i32 %index, 4
> %51 = icmp eq i32 %index.next, %end.idx.rnd.down
> br i1 %51, label %middle.block, label %vector.body
> 
> 
> ** Cost analysis:
> 
> 
> 
> Cost Model: Found an estimated cost of 1 for instruction: %induction
> = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
> Cost Model: Found an estimated cost of 1 for instruction: %0 >
extractelement <4 x i32> %induction, i32 0
> Cost Model: Unknown cost for instruction: %1 = getelementptr inbounds
> [256 x i32]* %b, i32 0, i32 %0
> Cost Model: Found an estimated cost of 1 for instruction: %2 >
insertelement <4 x i32*> undef, i32* %1, i32 0
> Cost Model: Found an estimated cost of 1 for instruction: %3 >
extractelement <4 x i32> %induction, i32 1
> Cost Model: Unknown cost for instruction: %4 = getelementptr inbounds
> [256 x i32]* %b, i32 0, i32 %3
> Cost Model: Found an estimated cost of 1 for instruction: %5 >
insertelement <4 x i32*> %2, i32* %4, i32 1
> Cost Model: Found an estimated cost of 1 for instruction: %6 >
extractelement <4 x i32> %induction, i32 2
> Cost Model: Unknown cost for instruction: %7 = getelementptr inbounds
> [256 x i32]* %b, i32 0, i32 %6
> Cost Model: Found an estimated cost of 1 for instruction: %8 >
insertelement <4 x i32*> %5, i32* %7, i32 2
> Cost Model: Found an estimated cost of 1 for instruction: %9 >
extractelement <4 x i32> %induction, i32 3
> Cost Model: Unknown cost for instruction: %10 = getelementptr
> inbounds [256 x i32]* %b, i32 0, i32 %9
> Cost Model: Found an estimated cost of 1 for instruction: %11 >
insertelement <4 x i32*> %8, i32* %10, i32 3
> Cost Model: Found an estimated cost of 1 for instruction: %12 >
extractelement <4 x i32> %induction, i32 0
> Cost Model: Unknown cost for instruction: %13 = getelementptr
> inbounds [256 x i32]* %b, i32 0, i32 %12
> Cost Model: Unknown cost for instruction: %14 = getelementptr i32*
> %13, i32 0
> 
> 
> and so on...

Arnold Schwaighofer

2013-Feb-04 20:21 UTC

head link

[LLVMdev] Vectorizer using Instruction, not opcodes

The loop vectorized does not estimate the cost of vectorization by looking at
the IR you list below. It does not vectorize and then run the CostAnalysis pass.
It estimates the cost itself before it even performs the vectorization.

The way it works is that it looks at all the scalar instructions and asks: What
is the cost if I execute the scalar instruction as a vector instruction.
Therefore, it will not consider any of the insert/extractelement instructions
you see below (note, that they will all go away any way).

If you run clang++ -O3 -mllvm -debug-only=loop-vectorize you will see which
instructions it considers.

E.g:

LV: Found an estimated cost of 0 for VF 1 For instruction:   %indvars.iv = phi
i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx =
getelementptr inbounds [256 x i32]* @b, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   %0 = load i32*
%arrayidx, align 4, !tbaa !0
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx2 =
getelementptr inbounds [256 x i32]* @c, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   %1 = load i32*
%arrayidx2, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 1 For instruction:   %mul = mul nsw i32
%1, %0
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx4 =
getelementptr inbounds [256 x i32]* @a, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i32 %mul,
i32* %arrayidx4, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 1 For instruction:   %indvars.iv.next =
add i64 %indvars.iv, 1
LV: Found an estimated cost of 0 for VF 1 For instruction:   %lftr.wideiv =
trunc i64 %indvars.iv.next to i32
LV: Found an estimated cost of 1 for VF 1 For instruction:   %exitcond = icmp ne
i32 %lftr.wideiv, 256
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %exitcond,
label %for.body, label %for.end
LV: Scalar loop costs: 6.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %indvars.iv = phi
i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %arrayidx =
getelementptr inbounds [256 x i32]* @b, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction:   %0 = load i32*
%arrayidx, align 4, !tbaa !0
LV: Found an estimated cost of 0 for VF 2 For instruction:   %arrayidx2 =
getelementptr inbounds [256 x i32]* @c, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction:   %1 = load i32*
%arrayidx2, align 4, !tbaa !0
LV: Found an estimated cost of 2 for VF 2 For instruction:   %mul = mul nsw i32
%1, %0
LV: Found an estimated cost of 0 for VF 2 For instruction:   %arrayidx4 =
getelementptr inbounds [256 x i32]* @a, i64 0, i64 %indvars.iv
LV: Found an estimated cost of 1 for VF 2 For instruction:   store i32 %mul,
i32* %arrayidx4, align 4, !tbaa !0
LV: Found an estimated cost of 1 for VF 2 For instruction:   %indvars.iv.next =
add i64 %indvars.iv, 1
LV: Found an estimated cost of 0 for VF 2 For instruction:   %lftr.wideiv =
trunc i64 %indvars.iv.next to i32
LV: Found an estimated cost of 1 for VF 2 For instruction:   %exitcond = icmp ne
i32 %lftr.wideiv, 256
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %exitcond,
label %for.body, label %for.end

The corresponding code is in LoopVectorizationCostModel::getInstructionCost().
Basically, it just takes the scalar instruction, its type, creates a vector type
and asks: How much would an instruction of that vector type cost (Slight
simplification).


On Feb 4, 2013, at 1:38 PM, Renato Golin <renato.golin at linaro.org>
wrote:> The example below is not stopping the vectorizer, but it does add a lot of
costs where there are none...
> 
> ** C code:
> 
>  int direct (int k) {
>   int i;
>   int a[256], b[256], c[256];
> 
>   for (i=0; i<256; i++){
>     a[i] = b[i] * c[i];
>   }
>   return a[k];
> }
> 
> ** ASM vectorized result:
> 
>         adr     r5, .LCPI0_0
>         vdup.32 q9, r1
>         vld1.64 {d16, d17}, [r5, :128]
>         add     r1, r1, #4
>         vadd.i32        q8, q9, q8
>         cmp     r3, r1
>         vmov.32 r5, d16[0]
>         add     r6, lr, r5, lsl #2
>         add     r7, r2, r5, lsl #2
>         vld1.32 {d16, d17}, [r6]
>         add     r5, r4, r5, lsl #2
>         vld1.32 {d18, d19}, [r7]
>         vmul.i32        q8, q9, q8
>         vst1.32 {d16, d17}, [r5]
>         bne     .LBB0_2
> 
> ** Vectorized IR (just the loop):
> 
> vector.body:                                      ; preds = %vector.body,
%vector.ph
>   %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
>   %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index,
i32 0
>   %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert,
<4 x i32> undef, <4 x i32> zeroinitializer
>   %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32
2, i32 3>
>   %0 = extractelement <4 x i32> %induction, i32 0
>   %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
>   %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
>   %3 = extractelement <4 x i32> %induction, i32 1
>   %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
>   %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
>   %6 = extractelement <4 x i32> %induction, i32 2
>   %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
>   %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
>   %9 = extractelement <4 x i32> %induction, i32 3
>   %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
>   %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
>   %12 = extractelement <4 x i32> %induction, i32 0
>   %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
>   %14 = getelementptr i32* %13, i32 0
>   %15 = bitcast i32* %14 to <4 x i32>*
>   %wide.load = load <4 x i32>* %15, align 4
>   %16 = extractelement <4 x i32> %induction, i32 0
>   %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16
>   %18 = insertelement <4 x i32*> undef, i32* %17, i32 0
>   %19 = extractelement <4 x i32> %induction, i32 1
>   %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19
>   %21 = insertelement <4 x i32*> %18, i32* %20, i32 1
>   %22 = extractelement <4 x i32> %induction, i32 2
>   %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22
>   %24 = insertelement <4 x i32*> %21, i32* %23, i32 2
>   %25 = extractelement <4 x i32> %induction, i32 3
>   %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25
>   %27 = insertelement <4 x i32*> %24, i32* %26, i32 3
>   %28 = extractelement <4 x i32> %induction, i32 0
>   %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28
>   %30 = getelementptr i32* %29, i32 0
>   %31 = bitcast i32* %30 to <4 x i32>*
>   %wide.load3 = load <4 x i32>* %31, align 4
>   %32 = mul nsw <4 x i32> %wide.load3, %wide.load
>   %33 = extractelement <4 x i32> %induction, i32 0
>   %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33
>   %35 = insertelement <4 x i32*> undef, i32* %34, i32 0
>   %36 = extractelement <4 x i32> %induction, i32 1
>   %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36
>   %38 = insertelement <4 x i32*> %35, i32* %37, i32 1
>   %39 = extractelement <4 x i32> %induction, i32 2
>   %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39
>   %41 = insertelement <4 x i32*> %38, i32* %40, i32 2
>   %42 = extractelement <4 x i32> %induction, i32 3
>   %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42
>   %44 = insertelement <4 x i32*> %41, i32* %43, i32 3
>   %45 = extractelement <4 x i32> %induction, i32 0
>   %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45
>   %47 = getelementptr i32* %46, i32 0
>   %48 = bitcast i32* %47 to <4 x i32>*
>   store <4 x i32> %32, <4 x i32>* %48, align 4
>   %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32
1>
>   %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32
256>
>   %index.next = add i32 %index, 4
>   %51 = icmp eq i32 %index.next, %end.idx.rnd.down
>   br i1 %51, label %middle.block, label %vector.body
> 
> ** Cost analysis:
> 
> Cost Model: Found an estimated cost of 1 for instruction:   %induction =
add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
> Cost Model: Found an estimated cost of 1 for instruction:   %0 =
extractelement <4 x i32> %induction, i32 0
> Cost Model: Unknown cost for instruction:   %1 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %0
> Cost Model: Found an estimated cost of 1 for instruction:   %2 =
insertelement <4 x i32*> undef, i32* %1, i32 0
> Cost Model: Found an estimated cost of 1 for instruction:   %3 =
extractelement <4 x i32> %induction, i32 1
> Cost Model: Unknown cost for instruction:   %4 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %3
> Cost Model: Found an estimated cost of 1 for instruction:   %5 =
insertelement <4 x i32*> %2, i32* %4, i32 1
> Cost Model: Found an estimated cost of 1 for instruction:   %6 =
extractelement <4 x i32> %induction, i32 2
> Cost Model: Unknown cost for instruction:   %7 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %6
> Cost Model: Found an estimated cost of 1 for instruction:   %8 =
insertelement <4 x i32*> %5, i32* %7, i32 2
> Cost Model: Found an estimated cost of 1 for instruction:   %9 =
extractelement <4 x i32> %induction, i32 3
> Cost Model: Unknown cost for instruction:   %10 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %9
> Cost Model: Found an estimated cost of 1 for instruction:   %11 =
insertelement <4 x i32*> %8, i32* %10, i32 3
> Cost Model: Found an estimated cost of 1 for instruction:   %12 =
extractelement <4 x i32> %induction, i32 0
> Cost Model: Unknown cost for instruction:   %13 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %12
> Cost Model: Unknown cost for instruction:   %14 = getelementptr i32* %13,
i32 0
> 
> and so on...
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/6ce62297/attachment.html>

Renato Golin

2013-Feb-04 20:34 UTC

head link

[LLVMdev] Vectorizer using Instruction, not opcodes

On 4 February 2013 20:21, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
> The loop vectorized does not estimate the cost of vectorization by looking
> at the IR you list below. It does not vectorize and then run the
> CostAnalysis pass. It estimates the cost itself before it even performs the
> vectorization.
>
Ok, I see my mistake. I thought that they were being taken into account
(even before being created).


(note, that they will all go away any way).>
Thus, my worry. ;)


The corresponding code is in> LoopVectorizationCostModel::getInstructionCost(). Basically, it just takes
> the scalar instruction, its type, creates a vector type and asks: How much
> would an instruction of that vector type cost (Slight simplification).
>
And then divide by VF in the end. Ok, things are making more sense, now. ;)

Thanks!
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/30588fa4/attachment.html>

Arnold Schwaighofer

2013-Feb-04 20:44 UTC

head link

[LLVMdev] Vectorizer using Instruction, not opcodes

Hi Hal,

On Feb 4, 2013, at 2:09 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Renato Golin" <renato.golin at linaro.org>
>> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>
>> Cc: "LLVM Dev" <llvmdev at cs.uiuc.edu>, "Nadav
Rotem" <nrotem at apple.com>, "Hal Finkel" <hfinkel at
anl.gov>
>> Sent: Monday, February 4, 2013 1:38:03 PM
>> Subject: Re: Vectorizer using Instruction, not opcodes
>> 
>> 
>> On 4 February 2013 18:25, Arnold Schwaighofer <
>> aschwaighofer at apple.com > wrote:
>> 
>> For cases where this approach breaks really badly we could consider
>> adding a specialized api or parameters (like the type of a
>> user/use). But we should do so only as a last resort and backed by
>> actual code that would benefit from doing so.
>> 
>> 
>> Very sensible, more or less what I had in mind. I think we could go
>> one step further and just get some high-level decisions like: is
>> this cast associated with an (any) arithmetic operation? It won't
be
>> perfect, but it adds a bit more information at very reduced price.
>> Though, this would require us to pass the Instruction, not the
>> Opcode.
>> 
> We probably don't want to end up in a situation where we're
speculatively creating instructions just to test their cost using the cost
model. For simple things, passing an existing instruction and its widening
factor will work, but I worry that won't really be sufficiently general.
Would a tree of <opcode, return-type, input-type>s work? How deep would it
need to be?
Agreed, if we decide on more information than just type and opcode we will need
some abstract representation.
> We also need to decide on some self-consistent way of dealing with cost
folding. What I mean is that if an instruction can be folded with its use, do we
register that cost savings for the use or the operand? We'd also need to
pass hasOneUse() information.

I think this probably going to far, I don't think we want to model all of
target lowering in the cost model. Any attempt would be a lie anyway ;). I
don't think we should model at such a fine granularity at the IR level
because any attempt of doing so is bound to fail.

- Arnold

Renato Golin

2013-Feb-05 11:13 UTC

head link

[LLVMdev] Vectorizer using Instruction, not opcodes

On 5 February 2013 11:03, David Tweed <david.tweed at arm.com> wrote:
> since the more instructions there
> are the more an out-of-order CPU can put them into otherwise unused slots.
> I
> can't think of a way of figuring out such a mapping other than
empirically.
>

Given the amount of uncertainty on these OOO guesses, I don't think we can
get anything worth trying, even empirically. The noise will always outweigh
the signal.

You can normally save a few cycles on a static micro-benchmark and think
you're in control, but you normally don't evaluate the exact impact that
the same guess had on other benchmarks or real code. Since this is
auto-vectorization, it'll be hard to compare, but more so to evaluate the
overall impact on the rest.

--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130205/5c32c253/attachment.html>

David Tweed

2013-Feb-05 11:22 UTC

head link

[LLVMdev] Vectorizer using Instruction, not opcodes

Hi Renato,

On 5 February 2013 11:03, David Tweed <david.tweed at
arm.com<mailto:david.tweed at arm.com>> wrote:
since the more instructions there
are the more an out-of-order CPU can put them into otherwise unused slots. I
can't think of a way of figuring out such a mapping other than empirically.

> Given the amount of uncertainty on these OOO guesses, I don't think we
can get anything worth trying, even empirically. The noise will always outweigh
the signal.
> You can normally save a few cycles on a static micro-benchmark and think
you're in control, but you normally don't evaluate the exact impact that
the same guess had on other benchmarks or real code. Since this is
auto-vectorization, it'll be hard to
>  compare, but more so to evaluate the overall impact on the rest.
That may be the case. Just to be clear I was thinking of something incredibly
simple along the lines of (cost) ^ (min(1-alpha*noOfINstructions,floorTheshold))
for some values of alpha and floorThreshold. I'll probably have a look at
this when I next get some discretionary time, but that wont' be for a couple
of months. Having said that, it may turn out that it doesn't really matter
in that the more vector instructions you've got the higher the likelihood it
will have a lower cost than a scalar version.
ave

-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130205/cd754f4e/attachment.html>

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Feb 2013 - [LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

Maybe Matching Threads