I'm working on some small improvements to SLPVectorizer.cpp so that it can
deal with some tuple operations arising from Julia code. Being fairly new to
LLVM, I could use some advice, particular from those familiar with the internals
of SLPVectorizer.
The motivation can be found in the Julia discussion
https://github.com/JuliaLang/julia/issues/5857 . Here is an example of the kind
of LLVM code I wish to vectorize.
-------------------------------------------------------------
define <4 x float> @julia_foo111(<4 x float>, <4 x float>) {
top:
%2 = extractelement <4 x float> %0, i32 0
%3 = extractelement <4 x float> %1, i32 0
%4 = fadd float %2, %3
%5 = insertelement <4 x float> undef, float %4, i32 0
%6 = extractelement <4 x float> %0, i32 1
%7 = extractelement <4 x float> %1, i32 1
%8 = fadd float %6, %7
%9 = insertelement <4 x float> %5, float %8, i32 1
%10 = extractelement <4 x float> %0, i32 2
%11 = extractelement <4 x float> %1, i32 2
%12 = fadd float %10, %11
%13 = insertelement <4 x float> %9, float %12, i32 2
%14 = extractelement <4 x float> %0, i32 3
%15 = extractelement <4 x float> %1, i32 3
%16 = fadd float %14, %15
%17 = insertelement <4 x float> %13, float %16, i32 3
ret <4 x float> %17
}
-------------------------------------------------------------
I want the fadd instructions to be vectorized. I've been able to implement
most of what I need (see attached patch), but with a fatal flaw: the uses of the
vectorized result are not moved as necessary. Here is the current (and quite
illegal) result:
-------------------------------------------------------------
top:
%2 = extractelement <4 x float> %8, i32 0
%3 = insertelement <4 x float> undef, float %2, i32 0
%4 = extractelement <4 x float> %8, i32 1
%5 = insertelement <4 x float> %3, float %4, i32 1
%6 = extractelement <4 x float> %8, i32 2
%7 = insertelement <4 x float> %5, float %6, i32 2
%8 = fadd <4 x float> %0, %1
%9 = extractelement <4 x float> %8, i32 3
%10 = insertelement <4 x float> %7, float %9, i32 3
ret <4 x float> %10
-------------------------------------------------------------
Instructions %3, %A5 and %7 need to be moved to after Instructions %8. I'm
wondering what is a good way to do this. The relevant place in
SLPVectorizer.cpp is around here:
-------------------------------------------------------------
if (Cost < -SLPCostThreshold) {
DEBUG(dbgs() << "SLP: Vectorizing list at cost:" <<
Cost << ".\n");
R.vectorizeTree();
// Move to the next bundle.
i += VF - 1;
Changed = true;
}
-------------------------------------------------------------
Should I try to move the instructions before or after the call to
R.vectorizeTree()? Or maybe do it even later after all bundles have been
vectorized? Are there LLVM utilities for doing this kind of fixup within a
basic block?
Any pointers/advice appreciated.
- Arch
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140317/c4612128/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SLPVectorizer.cpp.patch
Type: application/octet-stream
Size: 4777 bytes
Desc: SLPVectorizer.cpp.patch
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140317/c4612128/attachment.obj>
Hi Arch,
Thanks for looking at this.
The reason the SLPVectorizer bails out on many cases that seem vectorizable is
scheduling. It needs to produce a legal schedule. The way it does this is by
making sure that it can move all vectorized instructions to the last instruction
in a bundle. (Alternatively, you could build a dag, make sure that you don’t
create cycles and then produce a topological sort, but this was not done out of
compile time concerns).
If I understand your patch correctly you are disabling the above mentioned check
if the vectorizer starts at an insertelement instruction? What about other
users? You still need to detect that you can schedule them correctly.
define <4 x float> @julia_foo111(<4 x float>, <4 x float>) {
top:
%2 = extractelement <4 x float> %0, i32 0
%3 = extractelement <4 x float> %1, i32 0
%4 = fadd float %2, %3
%5 = insertelement <4 x float> undef, float %4, i32 0
%6 = extractelement <4 x float> %0, i32 1
%7 = extractelement <4 x float> %1, i32 1
%8 = fadd float %6, %7
%foo = operation which has a use of %8 that potentially feeds %12 but even if
not all of its users now need to be move below %16 and we need to check all
their users recursively …
%9 = insertelement <4 x float> %5, float %8, i32 1
%10 = extractelement <4 x float> %0, i32 2
%11 = extractelement <4 x float> %1, i32 2
%12 = fadd float %10, %11
%13 = insertelement <4 x float> %9, float %12, i32 2
%14 = extractelement <4 x float> %0, i32 3
%15 = extractelement <4 x float> %1, i32 3
%16 = fadd float %14, %15
%17 = insertelement <4 x float> %13, float %16, i32 3
ret <4 x float> %17
}
For your case of insertelements that start a vector tree you would get away
keeping a set of “insertelement” instructions of of which trytoVectorizeList
below started of.
if (InsertElementInst *IE = dyn_cast<InsertElementInst>(it)) {
SmallVector<Value *, 8> Ops;
if (!findBuildVector(IE, Ops))
continue;
// add insert elements to InsertVectorRoot. you would need to make sure
that all ‘other’ uses of those insert elements are below the last insert.
if (tryToVectorizeList(Ops, R))
Instead of checking “buildsVector”. You could check this set.
if (RdxOps && RdxOps->count(UI))
continue;
+ // This user is part of building a vector
+ if (buildsVector) // use something like: if (InsertVectorRoot.count(UI))
instead.
+ continue;
+
And this set would also contain the instructions that need to be moved.
Alternatively, we could teach the slp vectorizer how to ‘vectorize’
insertelements and start the vectorization tree with the insertelements instead
of its operands. Then it would naturally work (because in tree users are
considered safe).
Best,
Arnold
On Mar 17, 2014, at 2:38 PM, Robison, Arch <arch.robison at intel.com>
wrote:
> define <4 x float> @julia_foo111(<4 x float>, <4 x
float>) {
> top:
> %2 = extractelement <4 x float> %0, i32 0
> %3 = extractelement <4 x float> %1, i32 0
> %4 = fadd float %2, %3
> %5 = insertelement <4 x float> undef, float %4, i32 0
> %6 = extractelement <4 x float> %0, i32 1
> %7 = extractelement <4 x float> %1, i32 1
> %8 = fadd float %6, %7
> %9 = insertelement <4 x float> %5, float %8, i32 1
> %10 = extractelement <4 x float> %0, i32 2
> %11 = extractelement <4 x float> %1, i32 2
> %12 = fadd float %10, %11
> %13 = insertelement <4 x float> %9, float %12, i32 2
> %14 = extractelement <4 x float> %0, i32 3
> %15 = extractelement <4 x float> %1, i32 3
> %16 = fadd float %14, %15
> %17 = insertelement <4 x float> %13, float %16, i32 3
> ret <4 x float> %17
> }
Maybe Matching Threads
- [LLVMdev] Extend SLPVectorizer to struct operations that are isomorphic to vector operations?
- [LLVMdev] Vectorizing alloca instructions
- [RFC][SDAG] Convert build_vector of ops on extractelts into ops on input vectors
- [LLVMdev] Vectorizing alloca instructions
- [LLVMdev] Vectorizing alloca instructions