Hi folks, I've been thinking on how to implement some of the costs and there is a lot of instructions which cost depend on other instructions around. Casts are one obvious case, since arithmetic and memory instructions can, sometimes, cast values for free. The cost model receives Opcodes, which lose the info on the history of the values being vectorized, and I thought we could pass the whole Instruction instead. But this is the easy part... Below are some ideas which I'd love to hear from you guys which makes sense and which doesn't... A. How to find context? Suppose you have a cast instruction, and you want to find out if there is an arithmetic operations before (or after) on any of the values it touches... %3 = sext %type1 %2 to %type2 You could iterate through all uses of %3 and %2 and see if any of them is a mul/add/sub, and return 'true', so that the cost would add (or not) an extra cycle. Problems: 1. This space is multi-dimensional, ie. there are many types in and many out, and the cost table is not linear on any of the dimensions (hence, cost tables). An idea is to create a multi-dimension cost table, but that could get confusing really quickly. 2. We're essentially reproducing the logic of Inst Combine, which is not sensible nor appropriate. Any ideas on direction for this one? B. How to find resulting code? Another alternative is to run some lower level passes on the code, to reduce the chances that we're guessing in the dark, but that'd be extremely expensive. Maybe we can only run the vectorizer at the very end, and only if code changed, that we'd re-ruin again the clean-up passes. I'm not convinced... C. Settle for guess in the dark We can continue guessing in the dark, but do it consistently. This is our approach, and I'm, wondering if it makes sense to deviate from it. My main issue is that the costs for ARM instructions are almost random when seen from an IR perspective, we're bound to make as many mistakes as not. Another issue is that the number of vector IR instructions (extract/select) is a lot more than the resulting vector code, and that can count negatively on the benefits of vectorizing. It's also quite possible that I'm over-thinking... Thoughts? cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/9edd894d/attachment.html>
Arnold Schwaighofer
2013-Feb-04 18:25 UTC
[LLVMdev] Vectorizer using Instruction, not opcodes
Hi all, My take on this is that, as you state below, at the IR level we are only roughly estimating cost, at best (or we would have to lower the code and then estimate cost - something we don't want to do). I would propose for estimating the "worst case costs" and see how far we get with this. My rational here is that we don't want vectorization to decrease performance relative to scalar code. If we miss an opportunity this is bad, but not as bad as degrading relative to scalar code. For cases where this approach breaks really badly we could consider adding a specialized api or parameters (like the type of a user/use). But we should do so only as a last resort and backed by actual code that would benefit from doing so. - Arnold On Feb 4, 2013, at 11:49 AM, Renato Golin <renato.golin at linaro.org> wrote:> C. Settle for guess in the dark > > We can continue guessing in the dark, but do it consistently. This is our approach, and I'm, wondering if it makes sense to deviate from it. > > My main issue is that the costs for ARM instructions are almost random when seen from an IR perspective, we're bound to make as many mistakes as not. > > Another issue is that the number of vector IR instructions (extract/select) is a lot more than the resulting vector code, and that can count negatively on the benefits of vectorizing.Do you have an example where this happening? If the cost of extracts/inserts (could maybe shuffle instructions be used to express a series of extracts/inserts) is small it should be mostly neutralized by dividing by the vector length and amortized over other vectorized instructions in the block?
On 4 February 2013 18:25, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:> For cases where this approach breaks really badly we could consider adding > a specialized api or parameters (like the type of a user/use). But we > should do so only as a last resort and backed by actual code that would > benefit from doing so. >Very sensible, more or less what I had in mind. I think we could go one step further and just get some high-level decisions like: is this cast associated with an (any) arithmetic operation? It won't be perfect, but it adds a bit more information at very reduced price. Though, this would require us to pass the Instruction, not the Opcode. Do you have an example where this happening? The example below is not stopping the vectorizer, but it does add a lot of costs where there are none... ** C code: int direct (int k) { int i; int a[256], b[256], c[256]; for (i=0; i<256; i++){ a[i] = b[i] * c[i]; } return a[k]; } ** ASM vectorized result: adr r5, .LCPI0_0 vdup.32 q9, r1 vld1.64 {d16, d17}, [r5, :128] add r1, r1, #4 vadd.i32 q8, q9, q8 cmp r3, r1 vmov.32 r5, d16[0] add r6, lr, r5, lsl #2 add r7, r2, r5, lsl #2 vld1.32 {d16, d17}, [r6] add r5, r4, r5, lsl #2 vld1.32 {d18, d19}, [r7] vmul.i32 q8, q9, q8 vst1.32 {d16, d17}, [r5] bne .LBB0_2 ** Vectorized IR (just the loop): vector.body: ; preds = %vector.body, % vector.ph %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ] %broadcast.splatinsert = insertelement <4 x i32> undef, i32 %index, i32 0 %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer %induction = add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3> %0 = extractelement <4 x i32> %induction, i32 0 %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0 %2 = insertelement <4 x i32*> undef, i32* %1, i32 0 %3 = extractelement <4 x i32> %induction, i32 1 %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3 %5 = insertelement <4 x i32*> %2, i32* %4, i32 1 %6 = extractelement <4 x i32> %induction, i32 2 %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6 %8 = insertelement <4 x i32*> %5, i32* %7, i32 2 %9 = extractelement <4 x i32> %induction, i32 3 %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9 %11 = insertelement <4 x i32*> %8, i32* %10, i32 3 %12 = extractelement <4 x i32> %induction, i32 0 %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12 %14 = getelementptr i32* %13, i32 0 %15 = bitcast i32* %14 to <4 x i32>* %wide.load = load <4 x i32>* %15, align 4 %16 = extractelement <4 x i32> %induction, i32 0 %17 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %16 %18 = insertelement <4 x i32*> undef, i32* %17, i32 0 %19 = extractelement <4 x i32> %induction, i32 1 %20 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %19 %21 = insertelement <4 x i32*> %18, i32* %20, i32 1 %22 = extractelement <4 x i32> %induction, i32 2 %23 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %22 %24 = insertelement <4 x i32*> %21, i32* %23, i32 2 %25 = extractelement <4 x i32> %induction, i32 3 %26 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %25 %27 = insertelement <4 x i32*> %24, i32* %26, i32 3 %28 = extractelement <4 x i32> %induction, i32 0 %29 = getelementptr inbounds [256 x i32]* %c, i32 0, i32 %28 %30 = getelementptr i32* %29, i32 0 %31 = bitcast i32* %30 to <4 x i32>* %wide.load3 = load <4 x i32>* %31, align 4 %32 = mul nsw <4 x i32> %wide.load3, %wide.load %33 = extractelement <4 x i32> %induction, i32 0 %34 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %33 %35 = insertelement <4 x i32*> undef, i32* %34, i32 0 %36 = extractelement <4 x i32> %induction, i32 1 %37 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %36 %38 = insertelement <4 x i32*> %35, i32* %37, i32 1 %39 = extractelement <4 x i32> %induction, i32 2 %40 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %39 %41 = insertelement <4 x i32*> %38, i32* %40, i32 2 %42 = extractelement <4 x i32> %induction, i32 3 %43 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %42 %44 = insertelement <4 x i32*> %41, i32* %43, i32 3 %45 = extractelement <4 x i32> %induction, i32 0 %46 = getelementptr inbounds [256 x i32]* %a, i32 0, i32 %45 %47 = getelementptr i32* %46, i32 0 %48 = bitcast i32* %47 to <4 x i32>* store <4 x i32> %32, <4 x i32>* %48, align 4 %49 = add nsw <4 x i32> %induction, <i32 1, i32 1, i32 1, i32 1> %50 = icmp eq <4 x i32> %49, <i32 256, i32 256, i32 256, i32 256> %index.next = add i32 %index, 4 %51 = icmp eq i32 %index.next, %end.idx.rnd.down br i1 %51, label %middle.block, label %vector.body ** Cost analysis: Cost Model: Found an estimated cost of 1 for instruction: %induction add <4 x i32> %broadcast.splat, <i32 0, i32 1, i32 2, i32 3> Cost Model: Found an estimated cost of 1 for instruction: %0 extractelement <4 x i32> %induction, i32 0 Cost Model: Unknown cost for instruction: %1 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %0 Cost Model: Found an estimated cost of 1 for instruction: %2 insertelement <4 x i32*> undef, i32* %1, i32 0 Cost Model: Found an estimated cost of 1 for instruction: %3 extractelement <4 x i32> %induction, i32 1 Cost Model: Unknown cost for instruction: %4 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %3 Cost Model: Found an estimated cost of 1 for instruction: %5 insertelement <4 x i32*> %2, i32* %4, i32 1 Cost Model: Found an estimated cost of 1 for instruction: %6 extractelement <4 x i32> %induction, i32 2 Cost Model: Unknown cost for instruction: %7 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %6 Cost Model: Found an estimated cost of 1 for instruction: %8 insertelement <4 x i32*> %5, i32* %7, i32 2 Cost Model: Found an estimated cost of 1 for instruction: %9 extractelement <4 x i32> %induction, i32 3 Cost Model: Unknown cost for instruction: %10 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %9 Cost Model: Found an estimated cost of 1 for instruction: %11 insertelement <4 x i32*> %8, i32* %10, i32 3 Cost Model: Found an estimated cost of 1 for instruction: %12 extractelement <4 x i32> %induction, i32 0 Cost Model: Unknown cost for instruction: %13 = getelementptr inbounds [256 x i32]* %b, i32 0, i32 %12 Cost Model: Unknown cost for instruction: %14 = getelementptr i32* %13, i32 0 and so on... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/ca2977c3/attachment.html>
Possibly Parallel Threads
- [LLVMdev] Vectorizer using Instruction, not opcodes
- [LLVMdev] Vectorizer using Instruction, not opcodes
- [LLVMdev] NEON intrinsics preventing redundant load optimization?
- [LLVMdev] RE : Vector argument passing abi for ARM ?
- [LLVMdev] Vectorizer using Instruction, not opcodes