thr3ads.net - llvm dev - [LLVMdev] Vectorizer using Instruction, not opcodes [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2013-Feb-04 17:49 UTC

[LLVMdev] Vectorizer using Instruction, not opcodes

Hi folks,

I've been thinking on how to implement some of the costs and there is a lot
of instructions which cost depend on other instructions around. Casts are
one obvious case, since arithmetic and memory instructions can, sometimes,
cast values for free.

The cost model receives Opcodes, which lose the info on the history of the
values being vectorized, and I thought we could pass the whole Instruction
instead. But this is the easy part... Below are some  ideas which I'd love
to hear from you guys which makes sense and which doesn't...

A. How to find context?

Suppose you have a cast instruction, and you want to find out if there is
an arithmetic operations before (or after) on any of the values it
touches...

%3 = sext %type1 %2 to %type2

You could iterate through all uses of %3 and %2 and see if any of them is a
mul/add/sub, and return 'true', so that the cost would add (or not) an
extra cycle.

Problems:

1. This space is multi-dimensional, ie. there are many types in and many
out, and the cost table is not linear on any of the dimensions (hence, cost
tables). An idea is to create a multi-dimension cost table, but that could
get confusing really quickly.

2. We're essentially reproducing the logic of Inst Combine, which is not
sensible nor appropriate.

Any ideas on direction for this one?


B. How to find resulting code?

Another alternative is to run some lower level passes on the code, to
reduce the chances that we're guessing in the dark, but that'd be
extremely
expensive. Maybe we can only run the vectorizer at the very end, and only
if code changed, that we'd re-ruin again the clean-up passes. I'm not
convinced...


C. Settle for guess in the dark

We can continue guessing in the dark, but do it consistently. This is our
approach, and I'm, wondering if it makes sense to deviate from it.

My main issue is that the costs for ARM instructions are almost random when
seen from an IR perspective, we're bound to make as many mistakes as not.

Another issue is that the number of vector IR instructions (extract/select)
is a lot more than the resulting vector code, and that can count negatively
on the benefits of vectorizing.


It's also quite possible that I'm over-thinking...

Thoughts?

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/9edd894d/attachment.html>

Arnold Schwaighofer

2013-Feb-04 18:25 UTC

head link

[LLVMdev] Vectorizer using Instruction, not opcodes

Hi all,

My take on this is that, as you state below, at the IR level we are only roughly
estimating cost, at best (or we would have to lower the code and then estimate
cost - something we don't want to do).

I would propose for estimating the "worst case costs" and see how far
we get with this. My rational here is that we don't want vectorization to
decrease performance relative to scalar code. If we miss an opportunity this is
bad, but not as bad as degrading relative to scalar code.

For cases where this approach breaks really badly we could consider adding a
specialized api or parameters (like the type of a user/use). But we should do so
only as a last resort and backed by actual code that would benefit from doing
so.

- Arnold

On Feb 4, 2013, at 11:49 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> C. Settle for guess in the dark
> 
> We can continue guessing in the dark, but do it consistently. This is our
approach, and I'm, wondering if it makes sense to deviate from it.
> 
> My main issue is that the costs for ARM instructions are almost random when
seen from an IR perspective, we're bound to make as many mistakes as not.
> 
> Another issue is that the number of vector IR instructions (extract/select)
is a lot more than the resulting vector code, and that can count negatively on
the benefits of vectorizing.
Do you have an example where this happening? If the cost of extracts/inserts
(could maybe shuffle instructions be used to express a series of
extracts/inserts) is small it should be mostly neutralized by dividing by the
vector length and amortized over other vectorized instructions in the block?

Renato Golin

2013-Feb-04 19:38 UTC

head link

[LLVMdev] Vectorizer using Instruction, not opcodes

On 4 February 2013 18:25, Arnold Schwaighofer <aschwaighofer at
apple.com>wrote:
> For cases where this approach breaks really badly we could consider adding
> a specialized api or parameters (like the type of a user/use). But we
> should do so only as a last resort and backed by actual code that would
> benefit from doing so.
>
Very sensible, more or less what I had in mind. I think we could go one
step further and just get some high-level decisions like: is this cast
associated with an (any) arithmetic operation? It won't be perfect, but it
adds a bit more information at very reduced price. Though, this would
require us to pass the Instruction, not the Opcode.


Do you have an example where this happening?


The example below is not stopping the vectorizer, but it does add a lot of
costs where there are none...

** C code:

 int direct (int k) {
  int i;
  int a[256], b[256], c[256];

  for (i=0; i<256; i++){
    a[i] = b[i] * c[i];
  }
  return a[k];
}

** ASM vectorized result:

        adr     r5, .LCPI0_0
        vdup.32 q9, r1
        vld1.64 {d16, d17}, [r5, :128]
        add     r1, r1, #4
        vadd.i32        q8, q9, q8
        cmp     r3, r1
        vmov.32 r5, d16[0]
        add     r6, lr, r5, lsl #2
        add     r7, r2, r5, lsl #2
        vld1.32 {d16, d17}, [r6]
        add     r5, r4, r5, lsl #2
        vld1.32 {d18, d19}, [r7]
        vmul.i32        q8, q9, q8
        vst1.32 {d16, d17}, [r5]
        bne     .LBB0_2

** Vectorized IR (just the loop):

vector.body:                                      ; preds = %vector.body, %
vector.ph
  %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32
0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4
x
i32> undef, <4 x i32> zeroinitializer
  %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2,
i32 3>
  %0 = extractelement <4 x i32> %induction, i32 0
  %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0
  %2 = insertelement <4 x i32*> undef, i32* %1, i32 0
  %3 = extractelement <4 x i32> %induction, i32 1
  %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3
  %5 = insertelement <4 x i32*> %2, i32* %4, i32 1
  %6 = extractelement <4 x i32> %induction, i32 2
  %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6
  %8 = insertelement <4 x i32*> %5, i32* %7, i32 2
  %9 = extractelement <4 x i32> %induction, i32 3
  %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9
  %11 = insertelement <4 x i32*> %8, i32* %10, i32 3
  %12 = extractelement <4 x i32> %induction, i32 0
  %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12
  %14 = getelementptr i32* %13, i32 0
  %15 = bitcast i32* %14 to <4 x i32>*
  %wide.load = load <4 x i32>* %15, align 4
  %16 = extractelement <4 x i32> %induction, i32 0
  %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16
  %18 = insertelement <4 x i32*> undef, i32* %17, i32 0
  %19 = extractelement <4 x i32> %induction, i32 1
  %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19
  %21 = insertelement <4 x i32*> %18, i32* %20, i32 1
  %22 = extractelement <4 x i32> %induction, i32 2
  %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22
  %24 = insertelement <4 x i32*> %21, i32* %23, i32 2
  %25 = extractelement <4 x i32> %induction, i32 3
  %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25
  %27 = insertelement <4 x i32*> %24, i32* %26, i32 3
  %28 = extractelement <4 x i32> %induction, i32 0
  %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28
  %30 = getelementptr i32* %29, i32 0
  %31 = bitcast i32* %30 to <4 x i32>*
  %wide.load3 = load <4 x i32>* %31, align 4
  %32 = mul nsw <4 x i32> %wide.load3, %wide.load
  %33 = extractelement <4 x i32> %induction, i32 0
  %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33
  %35 = insertelement <4 x i32*> undef, i32* %34, i32 0
  %36 = extractelement <4 x i32> %induction, i32 1
  %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36
  %38 = insertelement <4 x i32*> %35, i32* %37, i32 1
  %39 = extractelement <4 x i32> %induction, i32 2
  %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39
  %41 = insertelement <4 x i32*> %38, i32* %40, i32 2
  %42 = extractelement <4 x i32> %induction, i32 3
  %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42
  %44 = insertelement <4 x i32*> %41, i32* %43, i32 3
  %45 = extractelement <4 x i32> %induction, i32 0
  %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45
  %47 = getelementptr i32* %46, i32 0
  %48 = bitcast i32* %47 to <4 x i32>*
  store <4 x i32> %32, <4 x i32>* %48, align 4
  %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1>
  %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256>
  %index.next = add i32 %index, 4
  %51 = icmp eq i32 %index.next, %end.idx.rnd.down
  br i1 %51, label %middle.block, label %vector.body

** Cost analysis:

Cost Model: Found an estimated cost of 1 for instruction:   %induction add <4
x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3>
Cost Model: Found an estimated cost of 1 for instruction:   %0 extractelement
<4 x i32> %induction, i32 0
Cost Model: Unknown cost for instruction:   %1 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %0
Cost Model: Found an estimated cost of 1 for instruction:   %2 insertelement
<4 x i32*> undef, i32* %1, i32 0
Cost Model: Found an estimated cost of 1 for instruction:   %3 extractelement
<4 x i32> %induction, i32 1
Cost Model: Unknown cost for instruction:   %4 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %3
Cost Model: Found an estimated cost of 1 for instruction:   %5 insertelement
<4 x i32*> %2, i32* %4, i32 1
Cost Model: Found an estimated cost of 1 for instruction:   %6 extractelement
<4 x i32> %induction, i32 2
Cost Model: Unknown cost for instruction:   %7 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %6
Cost Model: Found an estimated cost of 1 for instruction:   %8 insertelement
<4 x i32*> %5, i32* %7, i32 2
Cost Model: Found an estimated cost of 1 for instruction:   %9 extractelement
<4 x i32> %induction, i32 3
Cost Model: Unknown cost for instruction:   %10 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %9
Cost Model: Found an estimated cost of 1 for instruction:   %11 insertelement
<4 x i32*> %8, i32* %10, i32 3
Cost Model: Found an estimated cost of 1 for instruction:   %12 extractelement
<4 x i32> %induction, i32 0
Cost Model: Unknown cost for instruction:   %13 = getelementptr inbounds
[256 x i32]* %b, i32 0, i32 %12
Cost Model: Unknown cost for instruction:   %14 = getelementptr i32* %13,
i32 0

and so on...
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/ca2977c3/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Feb 2013 - [LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

[LLVMdev] Vectorizer using Instruction, not opcodes

Apparently Analagous Threads