I'm working on some small improvements to SLPVectorizer.cpp so that it can deal with some tuple operations arising from Julia code. Being fairly new to LLVM, I could use some advice, particular from those familiar with the internals of SLPVectorizer. The motivation can be found in the Julia discussion https://github.com/JuliaLang/julia/issues/5857 . Here is an example of the kind of LLVM code I wish to vectorize. ------------------------------------------------------------- define <4 x float> @julia_foo111(<4 x float>, <4 x float>) { top: %2 = extractelement <4 x float> %0, i32 0 %3 = extractelement <4 x float> %1, i32 0 %4 = fadd float %2, %3 %5 = insertelement <4 x float> undef, float %4, i32 0 %6 = extractelement <4 x float> %0, i32 1 %7 = extractelement <4 x float> %1, i32 1 %8 = fadd float %6, %7 %9 = insertelement <4 x float> %5, float %8, i32 1 %10 = extractelement <4 x float> %0, i32 2 %11 = extractelement <4 x float> %1, i32 2 %12 = fadd float %10, %11 %13 = insertelement <4 x float> %9, float %12, i32 2 %14 = extractelement <4 x float> %0, i32 3 %15 = extractelement <4 x float> %1, i32 3 %16 = fadd float %14, %15 %17 = insertelement <4 x float> %13, float %16, i32 3 ret <4 x float> %17 } ------------------------------------------------------------- I want the fadd instructions to be vectorized. I've been able to implement most of what I need (see attached patch), but with a fatal flaw: the uses of the vectorized result are not moved as necessary. Here is the current (and quite illegal) result: ------------------------------------------------------------- top: %2 = extractelement <4 x float> %8, i32 0 %3 = insertelement <4 x float> undef, float %2, i32 0 %4 = extractelement <4 x float> %8, i32 1 %5 = insertelement <4 x float> %3, float %4, i32 1 %6 = extractelement <4 x float> %8, i32 2 %7 = insertelement <4 x float> %5, float %6, i32 2 %8 = fadd <4 x float> %0, %1 %9 = extractelement <4 x float> %8, i32 3 %10 = insertelement <4 x float> %7, float %9, i32 3 ret <4 x float> %10 ------------------------------------------------------------- Instructions %3, %A5 and %7 need to be moved to after Instructions %8. I'm wondering what is a good way to do this. The relevant place in SLPVectorizer.cpp is around here: ------------------------------------------------------------- if (Cost < -SLPCostThreshold) { DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n"); R.vectorizeTree(); // Move to the next bundle. i += VF - 1; Changed = true; } ------------------------------------------------------------- Should I try to move the instructions before or after the call to R.vectorizeTree()? Or maybe do it even later after all bundles have been vectorized? Are there LLVM utilities for doing this kind of fixup within a basic block? Any pointers/advice appreciated. - Arch -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140317/c4612128/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: SLPVectorizer.cpp.patch Type: application/octet-stream Size: 4777 bytes Desc: SLPVectorizer.cpp.patch URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140317/c4612128/attachment.obj>
Hi Arch, Thanks for looking at this. The reason the SLPVectorizer bails out on many cases that seem vectorizable is scheduling. It needs to produce a legal schedule. The way it does this is by making sure that it can move all vectorized instructions to the last instruction in a bundle. (Alternatively, you could build a dag, make sure that you don’t create cycles and then produce a topological sort, but this was not done out of compile time concerns). If I understand your patch correctly you are disabling the above mentioned check if the vectorizer starts at an insertelement instruction? What about other users? You still need to detect that you can schedule them correctly. define <4 x float> @julia_foo111(<4 x float>, <4 x float>) { top: %2 = extractelement <4 x float> %0, i32 0 %3 = extractelement <4 x float> %1, i32 0 %4 = fadd float %2, %3 %5 = insertelement <4 x float> undef, float %4, i32 0 %6 = extractelement <4 x float> %0, i32 1 %7 = extractelement <4 x float> %1, i32 1 %8 = fadd float %6, %7 %foo = operation which has a use of %8 that potentially feeds %12 but even if not all of its users now need to be move below %16 and we need to check all their users recursively … %9 = insertelement <4 x float> %5, float %8, i32 1 %10 = extractelement <4 x float> %0, i32 2 %11 = extractelement <4 x float> %1, i32 2 %12 = fadd float %10, %11 %13 = insertelement <4 x float> %9, float %12, i32 2 %14 = extractelement <4 x float> %0, i32 3 %15 = extractelement <4 x float> %1, i32 3 %16 = fadd float %14, %15 %17 = insertelement <4 x float> %13, float %16, i32 3 ret <4 x float> %17 } For your case of insertelements that start a vector tree you would get away keeping a set of “insertelement” instructions of of which trytoVectorizeList below started of. if (InsertElementInst *IE = dyn_cast<InsertElementInst>(it)) { SmallVector<Value *, 8> Ops; if (!findBuildVector(IE, Ops)) continue; // add insert elements to InsertVectorRoot. you would need to make sure that all ‘other’ uses of those insert elements are below the last insert. if (tryToVectorizeList(Ops, R)) Instead of checking “buildsVector”. You could check this set. if (RdxOps && RdxOps->count(UI)) continue; + // This user is part of building a vector + if (buildsVector) // use something like: if (InsertVectorRoot.count(UI)) instead. + continue; + And this set would also contain the instructions that need to be moved. Alternatively, we could teach the slp vectorizer how to ‘vectorize’ insertelements and start the vectorization tree with the insertelements instead of its operands. Then it would naturally work (because in tree users are considered safe). Best, Arnold On Mar 17, 2014, at 2:38 PM, Robison, Arch <arch.robison at intel.com> wrote:> define <4 x float> @julia_foo111(<4 x float>, <4 x float>) { > top: > %2 = extractelement <4 x float> %0, i32 0 > %3 = extractelement <4 x float> %1, i32 0 > %4 = fadd float %2, %3 > %5 = insertelement <4 x float> undef, float %4, i32 0 > %6 = extractelement <4 x float> %0, i32 1 > %7 = extractelement <4 x float> %1, i32 1 > %8 = fadd float %6, %7 > %9 = insertelement <4 x float> %5, float %8, i32 1 > %10 = extractelement <4 x float> %0, i32 2 > %11 = extractelement <4 x float> %1, i32 2 > %12 = fadd float %10, %11 > %13 = insertelement <4 x float> %9, float %12, i32 2 > %14 = extractelement <4 x float> %0, i32 3 > %15 = extractelement <4 x float> %1, i32 3 > %16 = fadd float %14, %15 > %17 = insertelement <4 x float> %13, float %16, i32 3 > ret <4 x float> %17 > }
Possibly Parallel Threads
- [LLVMdev] Extend SLPVectorizer to struct operations that are isomorphic to vector operations?
- [LLVMdev] Vectorizing alloca instructions
- [RFC][SDAG] Convert build_vector of ops on extractelts into ops on input vectors
- [LLVMdev] Vectorizing alloca instructions
- [LLVMdev] Vectorizing alloca instructions