thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser

2011-Dec-02 16:07 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On 11/23/2011 05:52 PM, Hal Finkel wrote:> On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
>> >  On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
>>> >  >  Tobias,
>>> >  >
>>> >  >  I've attached an updated patch. It contains a few
bug fixes and many
>>> >  >  (refactoring and coding-convention) changes inspired by
your comments.
>>> >  >
>>> >  >  I'm currently trying to fix the bug responsible for
causing a compile
>>> >  >  failure when compiling
>>> >  > 
test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
>>> >  >  pass begins to fuse instructions in a basic block in
this file, the
>>> >  >  aliasing analysis starts producing different (more
pessimistic) query
>>> >  >  answers. I've never before seen it do this, and
I'm not sure why it is
>>> >  >  happening. Also odd, at the same time, the numeric
labels that are
>>> >  >  assigned to otherwise unnamed instructions, change. I
don't think I've
>>> >  >  seen this happen either (they generally seem to stay
fixed once
>>> >  >  assigned). I don't know if these two things are
related, but if anyone
>>> >  >  can provide some insight, I'd appreciate it.
>> >
>> >  I think that I see what is happening in this case (please someone
tell
>> >  me if this does not make sense). In the problematic basic block,
there
>> >  are some loads and stores that are independent. The default
aliasing
>> >  analysis can tell that these loads and stores don't reference
the same
>> >  memory region. Then, some of the inputs to the getelementptr
>> >  instructions used for these loads and stores are fused by the
>> >  vectorization. After this happens, the aliasing analysis loses
its
>> >  ability to tell that the loads and stores that make use of those
>> >  vector-calculated indices are independent.
> The attached patch works around this issue. Please review.
Hi Hal,

thanks for reworking the patch several times. This time the patch looks
a lot better, with nicely structured helper functions and a lot more
consistent style. It is a lot better to read and review.

Unfortunately I had today just two hours to look into this patch. I
added a lot of remarks, but especially in the tree pruning part I run a
little out of time such that the remarks are mostly obvious style
remarks. I hope I have a little bit more time over the weekend to look
into this again, but I wanted to get out the first junk of comments today.

Especially with my style remarks, do not feel forced to follow all of my
comments. Style stuff is highly personal and often other people may
suggest other fixes. In case you believe your style is better just check
with the LLVM coding reference or other parts of the LLVM source code
and try to match this as good as possible.

On the algorithmic point of view I am not yet completely convinced the
algorithm itself is not worse than N^2. I really want to understand this
part better and plan to look into problematic behaviour in functions not
directly visible in the core routines.

Also, you choose a very high bound (4000) to limit the quadratic
behaviour, because you stated otherwise optimizations will be missed.
Can you send an example where a value smaller than 4000 would miss
optimizations?

Cheers
Tobi
> diff --git a/lib/Transforms/Makefile b/lib/Transforms/Makefile
> index e527be2..8b1df92 100644
> --- a/lib/Transforms/Makefile
> +++ b/lib/Transforms/Makefile
> @@ -8,7 +8,7 @@
>  
##===----------------------------------------------------------------------===##
>
>   LEVEL = ../..
> -PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO Hello
> +PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO Vectorize
Hello
>
>   include $(LEVEL)/Makefile.config
>
> diff --git a/lib/Transforms/Vectorize/BBVectorize.cpp
b/lib/Transforms/Vectorize/BBVectorize.cpp
> new file mode 100644
> index 0000000..5505502
> --- /dev/null
> +++ b/lib/Transforms/Vectorize/BBVectorize.cpp
> @@ -0,0 +1,1644 @@
> +//===- BBVectorize.cpp - A Basic-Block Vectorizer
-------------------------===//
> +//
> +//                     The LLVM Compiler Infrastructure
> +//
> +// This file is distributed under the University of Illinois Open Source
> +// License. See LICENSE.TXT for details.
> +//
>
+//===----------------------------------------------------------------------===//
> +//
> +// This file implements a basic-block vectorization pass. The algorithm
was
> +// inspired by that used by the Vienna MAP Vectorizor by Franchetti and
Kral,
> +// et al. It works by looking for chains of pairable operations and then
> +// pairing them.
> +//
>
+//===----------------------------------------------------------------------===//
> +
> +#define BBV_NAME "bb-vectorize"
> +#define DEBUG_TYPE BBV_NAME
> +#include "llvm/Constants.h"
> +#include "llvm/DerivedTypes.h"
> +#include "llvm/Function.h"
> +#include "llvm/Instructions.h"
> +#include "llvm/IntrinsicInst.h"
> +#include "llvm/Intrinsics.h"
> +#include "llvm/LLVMContext.h"
> +#include "llvm/Pass.h"
> +#include "llvm/Type.h"
> +#include "llvm/ADT/DenseMap.h"
> +#include "llvm/ADT/DenseSet.h"
> +#include "llvm/ADT/SmallVector.h"
> +#include "llvm/ADT/Statistic.h"
> +#include "llvm/ADT/StringExtras.h"
> +#include "llvm/Analysis/AliasAnalysis.h"
> +#include "llvm/Analysis/AliasSetTracker.h"
> +#include "llvm/Analysis/ValueTracking.h"This comes to lines later, if you want to sort it alphabetically.
> +namespace {
> +  struct BBVectorize : public BasicBlockPass {
> +    static char ID; // Pass identification, replacement for typeid
> +    BBVectorize() : BasicBlockPass(ID) {
> +      initializeBBVectorizePass(*PassRegistry::getPassRegistry());
> +    }
> +
> +    typedef std::pair<Value *, Value *>  ValuePair;
> +    typedef std::pair<ValuePair, size_t>  ValuePairWithDepth;
> +    typedef std::pair<ValuePair, ValuePair>  VPPair; // A ValuePair
pair
> +    typedef std::pair<std::multimap<Value *, Value *>::iterator,
> +              std::multimap<Value *, Value *>::iterator> 
VPIteratorPair;
> +    typedef std::pair<std::multimap<ValuePair,
ValuePair>::iterator,
> +              std::multimap<ValuePair, ValuePair>::iterator>
> +                VPPIteratorPair;
> +
> +    // FIXME: const correct?
> +
> +    bool vectorizePairs(BasicBlock&BB);
> +
> +    void getCandidatePairs(BasicBlock&BB,
> +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> +                       std::vector<Value *>  &PairableInsts);
> +
> +    vojd computeConnectedPairs(std::multimap<Value *, Value *> 
&CandidatePairs,
> +                       std::vector<Value *>  &PairableInsts,
> +                       std::multimap<ValuePair, ValuePair> 
&ConnectedPairs);
> +
> +    void buildDepMap(BasicBlock&BB,
> +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> +                       std::vector<Value *>  &PairableInsts,
> +                       DenseSet<ValuePair>  &PairableInstUsers);
> +
> +    void choosePairs(std::multimap<Value *, Value *> 
&CandidatePairs,
> +                        std::vector<Value *>  &PairableInsts,
> +                        std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                        DenseSet<ValuePair>  &PairableInstUsers,
> +                        DenseMap<Value *, Value *>& 
ChosenPairs);
> +
> +    void fuseChosenPairs(BasicBlock&BB,
> +                     std::vector<Value *>  &PairableInsts,
> +                     DenseMap<Value *, Value *>&  ChosenPairs);
> +
> +    bool isInstVectorizable(Instruction *I, bool&IsSimpleLoadStore);
> +
> +    bool areInstsCompatible(Instruction *I, Instruction *J,
> +                       bool IsSimpleLoadStore);
> +
> +    void trackUsesOfI(DenseSet<Value *>  &Users,
> +                      AliasSetTracker&WriteSet, Instruction *I,
> +                      Instruction *J, bool&UsesI, bool UpdateUsers =
true,
> +                      std::multimap<Value *, Value *>  *LoadMoveSet
= 0);
> +
> +    void computePairsConnectedTo(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      ValuePair P);
> +
> +    bool pairsConflict(ValuePair P, ValuePair Q,
> +                      DenseSet<ValuePair>  &PairableInstUsers);
> +
> +    void pruneTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseMap<ValuePair, size_t>  &Tree,
> +                      DenseSet<ValuePair>  &PrunedTree,
ValuePair J);
> +
> +    void buildInitialTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseMap<ValuePair, size_t>  &Tree,
ValuePair J);
> +
> +    void findBestTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseSet<ValuePair>  &BestTree,
size_t&BestMaxDepth,
> +                      size_t&BestEffSize, VPIteratorPair ChoiceRange);
> +
> +    Value *getReplacementPointerInput(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, unsigned o, bool&FlipMemInputs);
> +
> +    Value *getReplacementShuffleMask(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, unsigned o);
> +
> +    Value *getReplacementInput(LLVMContext&  Context, Instruction *I,
> +                     Instruction *J, unsigned o, bool FlipMemInputs);
> +
> +    void getReplacementInputsForPair(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, SmallVector<Value *, 3> 
&ReplacedOperands,
> +                     bool&FlipMemInputs);
> +
> +    void replaceOutputsOfPair(LLVMContext&  Context, Instruction *I,
> +                     Instruction *J, Instruction *K,
> +                     Instruction *&InsertionPt, Instruction *&K1,
> +                     Instruction *&K2, bool&FlipMemInputs);
> +
> +    void collectPairLoadMoveSet(BasicBlock&BB,
> +                     DenseMap<Value *, Value *>  &ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> +                     Instruction *I, Instruction *J);
> +
> +    void collectLoadMoveSet(BasicBlock&BB,
> +                     std::vector<Value *>  &PairableInsts,
> +                     DenseMap<Value *, Value *>  &ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet);
> +
> +    void moveUsesOfIAfterJ(BasicBlock&BB,
> +                     DenseMap<Value *, Value *>&  ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> +                     Instruction *&InsertionPt,
> +                     Instruction *I, Instruction *J);
Here I am somehow unsatisfied with the indentation, but can currently
not suggest anything better. The single operands are just too long. ;-)
> +
> +    virtual bool runOnBasicBlock(BasicBlock&BB) {
> +      bool changed = false;
> +      // Iterate a sufficient number of times to merge types of
> +      // size 1, then 2, then 4, etc. up to half of the
> +      // target vector width of the target vector register.Use all 80 columns.

Also. What is a type of size 1? Do you mean 1 bit, one byte or what?
> +      for (unsigned v = 2, n = 1; v<= VectorBits&&
> +             (!MaxIter || n<= MaxIter); v *= 2, ++n) {                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
                 This fits in the previous row.
Also, having the whole condition in one row seems easier to read.
> +        DEBUG(dbgs()<<  "BBV: fusing loop #"<< 
n<<
> +              " for "<<  BB.getName()<<  " in
"<<
> +              BB.getParent()->getName()<<  "...\n");
> +        if (vectorizePairs(BB)) {
> +          changed = true;
> +        } else {
> +          break;
> +        }Braces not needed here.
> +      }
> +
> +      DEBUG(dbgs()<<  "BBV: done!\n");
> +      return changed;
> +    }
> +
> +    virtual void getAnalysisUsage(AnalysisUsage&AU) const {
> +      BasicBlockPass::getAnalysisUsage(AU);
> +      AU.addRequired<AliasAnalysis>();
> +      AU.addRequired<ScalarEvolution>();
> +      AU.addPreserved<AliasAnalysis>();
> +      AU.addPreserved<ScalarEvolution>();
> +    }
> +
> +    // This returns the vector type that holds a pair of the provided
type.
> +    // If the provided type is already a vector, then its length is
doubled.
> +    static inline VectorType *getVecType(Type *VElemType) {
What about naming this getVecTypeForPair()?

I think ElemType is sufficient. No need for the 'V'.
> +      if (VElemType->isVectorTy()) {
> +        unsigned numElem =
cast<VectorType>(VElemType)->getNumElements();
> +        return VectorType::get(VElemType->getScalarType(), numElem*2);
> +      } else {
> +        return VectorType::get(VElemType, 2);
> +      }
> +    }
This can be:

if (VectorType *VectorTy = dyn_cast<VectorType>(ElemType)) {
   unsigned numElem = VectorTy->getNumElements();
   return VectorType::get(VElemType->getScalarType(), numElem*2);
} else {
   return VectorType::get(VElemType, 2);
}
> +    // Returns the weight associated with the provided value. A chain of
> +    // candidate pairs has a length given by the sum of the weights of its
> +    // members (one weight per pair; the weight of each member of the pair
> +    // is assumed to be the same). This length is then compared to the
> +    // chain-length threshold to determine if a given chain is significant
> +    // enough to be vectorized. The length is also used in comparing
> +    // candidate chains where longer chains are considered to be better.
> +    // Note: when this function returns 0, the resulting instructions are
> +    // not actually fused.
> +    static inline size_t getDepthFactor(Value *V) {
> +      if (isa<InsertElementInst>(V) ||
isa<ExtractElementInst>(V)) {
Why have InsertElementInst, ExtractElementInst instructions a weight of
zero?
> +        return 0;
> +      }No need for braces.
> +
> +      return 1;
> +    }
> +
> +    // This determines the relative offset of two loads or stores,
returning
> +    // true if the offset could be determined to be some constant value.
> +    // For example, if OffsetInElmts == 1, then J accesses the memory
directly
> +    // after I; if OffsetInElmts == -1 then I accesses the memory
> +    // directly after J. This function assumes that both instructions
> +    // have the same type.You may want to mention more explicit that this function returns parts
of the result in OffsetInElmts.

I think 'Offset' is enough, no need for 'OffsetInElmts'.
> +    bool getPairPtrInfo(Instruction *I, Instruction *J, const
TargetData&TD,
> +        Value *&IPtr, Value *&JPtr, unsigned&IAlignment,
unsigned&JAlignment,
> +        int64_t&OffsetInElmts) {
> +      OffsetInElmts = 0;
> +      if (isa<LoadInst>(I)) {
> +        IPtr = cast<LoadInst>(I)->getPointerOperand();
> +        JPtr = cast<LoadInst>(J)->getPointerOperand();
> +        IAlignment = cast<LoadInst>(I)->getAlignment();
> +        JAlignment = cast<LoadInst>(J)->getAlignment();
> +      } else {
> +        IPtr = cast<StoreInst>(I)->getPointerOperand();
> +        JPtr = cast<StoreInst>(J)->getPointerOperand();
> +        IAlignment = cast<StoreInst>(I)->getAlignment();
> +        JAlignment = cast<StoreInst>(J)->getAlignment();
> +      }
> +
> +      ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> +      const SCEV *IPtrSCEV = SE.getSCEV(IPtr);
> +      const SCEV *JPtrSCEV = SE.getSCEV(JPtr);
> +
> +      // If this is a trivial offset, then we'll get something like
> +      // 1*sizeof(type). With target data, which we need anyway, this will
get
> +      // constant folded into a number.
> +      const SCEV *OffsetSCEV = SE.getMinusSCEV(JPtrSCEV, IPtrSCEV);
> +      if (const SCEVConstant *ConstOffSCEV > +           
dyn_cast<SCEVConstant>(OffsetSCEV)) {
> +        ConstantInt *IntOff = ConstOffSCEV->getValue();
> +        int64_t Offset = IntOff->getSExtValue();
> +
> +        Type *VTy =
cast<PointerType>(IPtr->getType())->getElementType();
> +        int64_t VTyTSS = (int64_t) TD.getTypeStoreSize(VTy);
> +
> +        assert(VTy ==
cast<PointerType>(JPtr->getType())->getElementType());
> +
> +        OffsetInElmts = Offset/VTyTSS;
> +        return (abs64(Offset) % VTyTSS) == 0;
> +      }
> +
> +      return false;
> +    }
> +
> +    // Returns true if the provided CallInst represents an intrinsic that
can
> +    // be vectorized.
> +    bool isVectorizableIntrinsic(CallInst* I) {
> +      bool VIntr = false;
> +      if (Function *F = I->getCalledFunction()) {
> +        if (unsigned IID = F->getIntrinsicID()) {
Use early exit.

Function *F = I->getCalledFunction();

if (!F)
   return false;

unsigned IID = F->getIntrinsicID();

if (!IID)
   return false;

switch(IID) {
	case Intrinsic::sqrt:
	case Intrinsic::powi:
	case Intrinsic::sin:
	case Intrinsic::cos:
	case Intrinsic::log:
	case Intrinsic::log2:
	case Intrinsic::log10:
	case Intrinsic::exp:
	case Intrinsic::exp2:
	case Intrinsic::pow:
		return !NoMath;
	case Intrinsic::fma:
		return !NoFMA;
}

Is this switch exhaustive or are you missing a default case?
> +    // Returns true if J is the second element in some ValuePair
referenced by
> +    // some multimap ValuePair iterator pair.
> +    bool isSecondInVPIteratorPair(VPIteratorPair PairRange, Value *J) {
> +      for (std::multimap<Value *, Value *>::iterator K =
PairRange.first;
> +             K != PairRange.second; ++K) {
> +        if (K->second == J) return true;
> +      }No need for braces.
> +
> +      return false;
> +    }
> +  };
> +
> +  // This function implements one vectorization iteration on the provided
> +  // basic block. It returns true if the block is changed.
> +  bool BBVectorize::vectorizePairs(BasicBlock&BB) {
> +    std::vector<Value *>  PairableInsts;
> +    std::multimap<Value *, Value *>  CandidatePairs;
> +    getCandidatePairs(BB, CandidatePairs, PairableInsts);
> +    if (!PairableInsts.size()) return false;
if (PairableInsts.size() == 0) ...

I think it is more straightforward to compare a size to 0 than to force
the brain to implicitly convert it to bool.
> +
> +    // Now we have a map of all of the pairable instructions and we need
to               we have a map of all pairable instructions
> +    // select the best possible pairing. A good pairing is one such that
the
> +    // Users of the pair are also paired. This defines a (directed) forest	  users
           ^> +    // over the pairs such that two pairs are connected iff the second
pair
> +    // uses the first.
> +
> +    // Note that it only matters that both members of the second pair use
some
> +    // element of the first pair (to allow for splatting).
> +
> +    std::multimap<ValuePair, ValuePair>  ConnectedPairs;
> +    computeConnectedPairs(CandidatePairs, PairableInsts, ConnectedPairs);
> +    if (!ConnectedPairs.size()) return false;
if (ConnectedPairs.size() == 0) ...
> +    // Build the pairable-instruction dependency map
> +    DenseSet<ValuePair>  PairableInstUsers;
> +    buildDepMap(BB, CandidatePairs, PairableInsts, PairableInstUsers);
> +
> +    // There is now a graph of the connected pairs. For each variable,
pick the
> +    // pairing with the largest tree meeting the depth requirement on at
least
> +    // one branch. Then select all pairings that are part of that tree and
> +    // remove them from the list of available parings and pairable
variables.						 pairings
> +
> +    DenseMap<Value *, Value *>  ChosenPairs;
> +    choosePairs(CandidatePairs, PairableInsts, ConnectedPairs,
> +      PairableInstUsers, ChosenPairs);
> +
> +    if (!ChosenPairs.size()) return false;
> +    NumFusedOps += ChosenPairs.size();
> +
> +    // A set of chosen pairs has now been selected. It is now necessary to
        // A set of pairs has been chosen.> +    // replace the paired functions with vector functions. For this
procedure
You are only talking about functions. I thought the main objective is to
pair LLVM-IR _instructions_.
> +    // each argument much be replaced with a vector argument. This vector                         must

At least for instructions, I would use the term 'operand' instead of
'argument'.
> +    // is formed by using build_vector on the old arguments. The replaced
> +    // values are then replaced with a vector_extract on the result.
> +    // Subsequent optimization passes should coalesce the build/extract
> +    // combinations.
> +
> +    fuseChosenPairs(BB, PairableInsts, ChosenPairs);
> +
> +    return true;
> +  }
> +
> +  // This function returns true if the provided instruction is capable of
being
> +  // fused into a vector instruction. This determination is based only on
the
> +  // type and other attributes of the instruction.
> +  bool BBVectorize::isInstVectorizable(Instruction *I,
> +                                         bool&IsSimpleLoadStore) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
AA can become a member of the BBVectorize class and it can be
initialized once per basic block.
> +    bool VIntr = false;
> +    if (isa<CallInst>(I)) {
> +      VIntr = isVectorizableIntrinsic(cast<CallInst>(I));
> +    }Remove VIntr and use early exit.

if (CallInst *C = dyn_cast<CallInst>(I)) {
   if(!isVectorizableIntrinsic(cast<CallInst>(I))
     return false;
}

Also, no braces if not necessary in at least one branch.
> +    // Vectorize simple loads and stores if possbile:
> +    IsSimpleLoadStore = false;
> +    if (isa<LoadInst>(I)) {
> +      IsSimpleLoadStore = cast<LoadInst>(I)->isSimple();
> +    } else if (isa<StoreInst>(I)) {
> +      IsSimpleLoadStore = cast<StoreInst>(I)->isSimple();
> +    }Again, no braces needed and use early exit.

} else if (LoadInst *L = dyn_cast<LoadInst>(I)) {
   if (!L->isSimple() || NoMemOps) {
     IsSimpleLoadStore = false;
     return false;
   }
} else if (StorInst *S = dyn_cast<StoreInst>(S)) {
   if (!S->isSimple() || NoMemOps) {
     IsSimpleLoadStore = false;
     return false;
   }
}
> +    // We can vectorize casts, but not casts of pointer types, etc.
> +    bool IsCast = false;
> +    if (I->isCast()) {
> +      IsCast = true;
> +      if (!cast<CastInst>(I)->getSrcTy()->isSingleValueType())
{
> +        IsCast = false;
> +      } else if
(!cast<CastInst>(I)->getDestTy()->isSingleValueType()) {
> +        IsCast = false;
> +      } else if (cast<CastInst>(I)->getSrcTy()->isPointerTy())
{
> +        IsCast = false;
> +      } else if
(cast<CastInst>(I)->getDestTy()->isPointerTy()) {
> +        IsCast = false;
> +      }
> +    }

} else if (CastInst *C = dyn_cast<CastInst>(I)) {
   if (NoCasts)
     return false;

   Type *SrcTy = C->getSrcTy()

   if (!SrcTy->isSingleValueType() || SrcTy->isPointerTy)
     return false;

   Type *DestTy = C->getDestTy()

   if (!DestTy->isSingleValueType() || DestTy->isPointerTy)
     return false;
}
> +
> +    if (!(I->isBinaryOp() || isa<ShuffleVectorInst>(I) ||
> +            isa<ExtractElementInst>(I) ||
isa<InsertElementInst>(I) ||
> +            (!NoCasts&&  IsCast) || VIntr ||
> +            (!NoMemOps&&  IsSimpleLoadStore))) {
> +      return false;
> +    }
else if (I->isBinaryOp
	 || isa<ShuffleVectorInst>(I)
          || isa<ExtractElementInst>(I)
          || isa<InsertElementInst>(I)) {
   // Those instructions are always OK.
} else {
   // Everything else can not be handled.
   return false;
}

> +
> +    // We can't vectorize memory operations without target data
> +    if (AA.getTargetData() == 0&&  IsSimpleLoadStore) {
> +      return false;
Is this check needed at this place? I had the feeling the whole pass
does not work without target data. At least the offset calculation seems
to require it.
> +    }
> +
> +    Type *T1, *T2;
> +    if (isa<StoreInst>(I)) {
> +      // For stores, it is the value type, not the pointer type that
matters
> +      // because the value is what will come from a vector register.
> +
> +      Value *IVal = cast<StoreInst>(I)->getValueOperand();
> +      T1 = IVal->getType();
> +    } else {
> +      T1 = I->getType();
> +    }
> +
> +    if (I->isCast()) {
> +      T2 = cast<CastInst>(I)->getSrcTy();
> +    } else {
> +      T2 = T1;
> +    }No braces needed.
> +
> +    // Not every type can be vectorized...
> +    if (!(VectorType::isValidElementType(T1) || T1->isVectorTy()) ||
> +        !(VectorType::isValidElementType(T2) || T2->isVectorTy()))
> +      return false;
> +
> +    if (NoInts&&  (T1->isIntOrIntVectorTy() ||
T2->isIntOrIntVectorTy()))
> +      return false;
> +
> +    if (NoFloats&&  (T1->isFPOrFPVectorTy() ||
T2->isFPOrFPVectorTy()))
> +      return false;
> +
> +    if (T1->getPrimitiveSizeInBits()>  VectorBits/2 ||
> +        T2->getPrimitiveSizeInBits()>  VectorBits/2)
> +      return false;
> +
> +    return true;
> +  }
> +
> +  // This function returns true if the two provided instructions are
compatible
> +  // (meaning that they can be fused into a vector instruction). This
assumes
> +  // that I has already been determined to be vectorizable and that J is
not
> +  // in the use tree of I.
> +  bool BBVectorize::areInstsCompatible(Instruction *I, Instruction *J,
> +                       bool IsSimpleLoadStore) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();AA can become a member of the BBVectorize class and it can be
initialized once per basic block.
> +    bool IsCompat = J->isSameOperationAs(I);
> +    // FIXME: handle addsub-type operations!
> +
> +    // Loads and stores can be merged if they have different alignments,
> +    // but are otherwise the same.
> +    if (!IsCompat&&  isa<LoadInst>(I)&& 
isa<LoadInst>(J)) {
> +      if (I->getType() == J->getType()) {
> +        if (cast<LoadInst>(I)->getPointerOperand()->getType()
=> +             
cast<LoadInst>(J)->getPointerOperand()->getType()&&
> +            cast<LoadInst>(I)->isVolatile() => +             
cast<LoadInst>(J)->isVolatile()&&
> +            cast<LoadInst>(I)->getOrdering() => +             
cast<LoadInst>(J)->getOrdering()&&
> +            cast<LoadInst>(I)->getSynchScope() => +           
cast<LoadInst>(J)->getSynchScope()
> +           ) {
> +             IsCompat = true;
> +         }
> +      }
> +    } else if (!IsCompat&&  isa<StoreInst>(I)&& 
isa<StoreInst>(J)) {
> +      if (cast<StoreInst>(I)->getValueOperand()->getType()
=> +           
cast<StoreInst>(J)->getValueOperand()->getType()&&
> +          cast<StoreInst>(I)->getPointerOperand()->getType()
=> +           
cast<StoreInst>(J)->getPointerOperand()->getType()&&
> +          cast<StoreInst>(I)->isVolatile() => +           
cast<StoreInst>(J)->isVolatile()&&
> +          cast<StoreInst>(I)->getOrdering() => +           
cast<StoreInst>(J)->getOrdering()&&
> +          cast<StoreInst>(I)->getSynchScope() => +           
cast<StoreInst>(J)->getSynchScope()
> +         ) {
> +           IsCompat = true;
> +       }
> +    }
> +
> +    if (!IsCompat) return false;
> +
> +    if (IsSimpleLoadStore) {
> +      const TargetData&TD = *AA.getTargetData();
> +
> +      Value *IPtr, *JPtr;
> +      unsigned IAlignment, JAlignment;
> +      int64_t OffsetInElmts = 0;
> +      if (getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment, JAlignment,
> +            OffsetInElmts)&&  abs64(OffsetInElmts) == 1) {
> +        if (AlignedOnly) {
> +          Type *aType = isa<StoreInst>(I) ?
> +            cast<StoreInst>(I)->getValueOperand()->getType() :
I->getType();
> +          // An aligned load or store is possible only if the instruction
> +          // with the lower offset has an alignment suitable for the
> +          // vector type.
> +
> +          unsigned BottomAlignment = IAlignment;
> +          if (OffsetInElmts<  0) BottomAlignment = JAlignment;
> +
> +          Type *VType = getVecType(aType);
> +          unsigned VecAlignment = TD.getPrefTypeAlignment(VType);
> +          if (BottomAlignment<  VecAlignment) {
> +            IsCompat = false;
> +          }
> +        }
> +      } else {
> +        IsCompat = false;
> +      }
> +    } else if (isa<ShuffleVectorInst>(I)) {
> +      // Only merge two shuffles if they're both constant
> +      // or both not constant.
> +      IsCompat = isa<Constant>(I->getOperand(2))&&
> +                 isa<Constant>(J->getOperand(2));
> +      // FIXME: We may want to vectorize non-constant shuffles also.
> +    }
> +
> +    return IsCompat;
Here again for the whole function. Can you try to use early exit to
simplify the code http://llvm.org/docs/CodingStandards.html#hl_earlyexit
You might be able to remove the use of IsCompat completely.
> +  }
> +
> +  // Figure out whether or not J uses I and update the users and write-set
> +  // structures associated with I.
You may want to give I and J more descriptive names. I remember I asked
you the last time to rename One and Two to I and J, but in this case it
seems both have very specific roles which could be used to describe
them. What about 'Inst' and 'User'?

Also what are the arguments of this function? Can you explain them a
little?
> +  void BBVectorize::trackUsesOfI(DenseSet<Value *>  &Users,
> +                       AliasSetTracker&WriteSet, Instruction *I,
> +                       Instruction *J, bool&UsesI, bool UpdateUsers,
You may want to return UsesI as a function return value not as a
reference parameter, though you may need to updated the function name
accordingly.
> +                       std::multimap<Value *, Value *> 
*LoadMoveSet) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +
> +    UsesI = false;
> +
> +    // This instruction may already be marked as a user due, for example,
to
> +    // being a member of a selected pair.
> +    if (Users.count(J))
> +      UsesI = true;
> +
> +    if (!UsesI) for (User::op_iterator JU = J->op_begin(), e =
J->op_end();Use a new line for this 'for'.
> +           JU != e; ++JU) {
> +      Value *V = *JU;
> +      if (I == V || Users.count(V)) {
> +        UsesI = true;
> +        break;
> +      }
> +    }
> +    if (!UsesI&&  J->mayReadFromMemory()) {
> +      if (LoadMoveSet) {
> +        VPIteratorPair JPairRange = LoadMoveSet->equal_range(J);
> +        UsesI = isSecondInVPIteratorPair(JPairRange, I);
> +      }
> +      else for (AliasSetTracker::iterator W = WriteSet.begin(),Use a new line for this 'for'.
> +            WE = WriteSet.end(); W != WE; ++W) {
> +        for (AliasSet::iterator A = W->begin(), AE = W->end();
> +               A != AE; ++A) {
> +          AliasAnalysis::Location ptrLoc(A->getValue(),
A->getSize(),
> +            A->getTBAAInfo());Align this argument with the first argument.

           AliasAnalysis::Location ptrLoc(A->getValue(), A->getSize(),
           				 A->getTBAAInfo());
> +          if (AA.getModRefInfo(J, ptrLoc) != AliasAnalysis::NoModRef) {
> +            UsesI = true;
> +            break;
> +          }
> +        }
> +        if (UsesI) break;
> +      }
> +    }
> +    if (UsesI&&  UpdateUsers) {
> +      if (J->mayWriteToMemory()) WriteSet.add(J);
> +      Users.insert(J);
> +    }
> +  }
> +
> +  // This function iterates over all instruction pairs in the provided
> +  // basic block and collects all candidate pairs for vectorization.
> +  void BBVectorize::getCandidatePairs(BasicBlock&BB,
> +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> +                       std::vector<Value *>  &PairableInsts) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +    BasicBlock::iterator E = BB.end();
> +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E; ++I) {
> +      bool IsSimpleLoadStore;
> +      if (!isInstVectorizable(I, IsSimpleLoadStore)) continue;
> +
> +      // Look for an instruction with which to pair instruction *I...
> +      DenseSet<Value *>  Users;
> +      AliasSetTracker WriteSet(AA);
> +      BasicBlock::iterator J = I; ++J;
> +      for (unsigned ss = 0; J != E&&  ss<= SearchLimit; ++J,
++ss) {
> +        // Determine if J uses I, if so, exit the loop.
> +        bool UsesI;
> +        trackUsesOfI(Users, WriteSet, I, J, UsesI, !FastDep);
'UsesI' can be eliminated if the corresponding value is passed as
function return value.
> +        if (FastDep) {
> +          // Note: For this heuristic to be effective, independent
operations
> +          // must tend to be intermixed. This is likely to be true from
some
> +          // kinds of grouped loop unrolling (but not the generic LLVM
pass),
> +          // but otherwise may require some kind of reordering pass.
> +
> +          // When using fast dependency analysis,
> +          // stop searching after first use:
> +          if (UsesI) break;
> +        } else {
> +          if (UsesI) continue;
> +        }
> +
> +        // J does not use I, and comes before the first use of I, so it
can be
> +        // merged with I if the instructions are compatible.
> +        if (!areInstsCompatible(I, J, IsSimpleLoadStore)) continue;
> +
> +        // J is a candidate for merging with I.
> +        if (!PairableInsts.size() ||
> +             PairableInsts[PairableInsts.size()-1] != I) {
> +          PairableInsts.push_back(I);
> +        }
> +        CandidatePairs.insert(ValuePair(I, J));
> +        DEBUG(dbgs()<<  "BBV: candidate pair "<< 
*I<<
> +                     "<->  "<<  *J<< 
"\n");
> +      }
> +    }
> +
> +    DEBUG(dbgs()<<  "BBV: found "<< 
PairableInsts.size()
> +<<  " instructions with candidate pairs\n");
> +  }
> +
> +  // Finds candidate pairs connected to the pair P =<PI, PJ>. This
means that
> +  // it looks for pairs such that both members have an input which is an
> +  // output of PI or PJ.
> +  void BBVectorize::computePairsConnectedTo(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      ValuePair P) {
> +    // For each possible pairing for this variable, look at the uses of
> +    // the first value...
> +    for (Value::use_iterator I = P.first->use_begin(),
> +         E = P.first->use_end(); I != E; ++I) {
> +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> +
> +      // For each use of the first variable, look for uses of the second
> +      // variable...
> +      for (Value::use_iterator J = P.second->use_begin(),
> +           E2 = P.second->use_end(); J != E2; ++J) {
> +        VPIteratorPair JPairRange = CandidatePairs.equal_range(*J);
> +
> +        // Look for<I, J>:
> +        if (isSecondInVPIteratorPair(IPairRange, *J))
> +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> +
> +        // Look for<J, I>:
> +        if (isSecondInVPIteratorPair(JPairRange, *I))
> +          ConnectedPairs.insert(VPPair(P, ValuePair(*J, *I)));
> +      }
> +
> +      // Look for cases where just the first value in the pair is used by
> +      // both members of another pair (splatting).
> +      Value::use_iterator J = P.first->use_begin();
> +      if (!SplatBreaksChain) for (; J != E; ++J) {This can become an early continue. Also please do not put an 'if' and a
'for' into the same line.
> +        if (isSecondInVPIteratorPair(IPairRange, *J))
> +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> +      }
> +    }
> +
> +    // Look for cases where just the second value in the pair is used by
> +    // both members of another pair (splatting).
> +    if (!SplatBreaksChain)This can become an early return.
> +      for (Value::use_iterator I = P.second->use_begin(),
> +         E = P.second->use_end(); I != E; ++I) {
> +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> +
> +      Value::use_iterator J = P.second->use_begin();
> +      for (; J != E; ++J) {Why not move initialization into 'for'?
> +        if (isSecondInVPIteratorPair(IPairRange, *J))
> +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> +      }
> +    }
> +  }
> +
> +  // This function figures out which pairs are connected.  Two pairs are
> +  // connected if some output of the first pair forms an input to both
members
> +  // of the second pair.
> +  void BBVectorize::computeConnectedPairs(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs) {
> +
> +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> +           PE = PairableInsts.end(); PI != PE; ++PI) {
> +      VPIteratorPair choiceRange = CandidatePairs.equal_range(*PI);
> +
> +      for (std::multimap<Value *, Value *>::iterator P =
choiceRange.first;
> +           P != choiceRange.second; ++P) {
> +        computePairsConnectedTo(CandidatePairs, PairableInsts,
> +          ConnectedPairs, *P);Align CandidatePairs and ConnectedPairs.
> +      }No braces needed here.
> +    }
> +
> +    DEBUG(dbgs()<<  "BBV: found "<< 
ConnectedPairs.size()
> +<<  " pair connections.\n");
> +  }
> +
> +  // This function builds a set of use tuples such that<A, B>  is in
the set
> +  // if B is in the use tree of A. If B is in the use tree of A, then B
> +  // depends on the output of A.
> +  void BBVectorize::buildDepMap(
> +                      BasicBlock&BB,
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      DenseSet<ValuePair>  &PairableInstUsers) {
> +    DenseSet<Value *>  IsInPair;
> +    for (std::multimap<Value *, Value *>::iterator C =
CandidatePairs.begin(),
> +           E = CandidatePairs.end(); C != E; ++C) {
> +      IsInPair.insert(C->first);
> +      IsInPair.insert(C->second);
> +    }
> +
> +    // Iterate through the basic block, recording all Users of each
> +    // pairable instruction.
> +
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +    BasicBlock::iterator E = BB.end();
> +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E; ++I) {
> +      if (IsInPair.find(I) == IsInPair.end()) continue;
> +
> +      DenseSet<Value *>  Users;
> +      AliasSetTracker WriteSet(AA);
> +      BasicBlock::iterator J = I; ++J;
> +      for (; J != E; ++J) {Move the initialization into the for loop.

	for (BasicBlock::iterator J = I + 1; J != E; ++J)
> +        bool UsesI;
> +        trackUsesOfI(Users, WriteSet, I, J, UsesI);
> +      }
> +
> +      for (DenseSet<Value *>::iterator U = Users.begin(), E =
Users.end();
> +             U != E; ++U) {	      ^^ Check indentation here.
> +        PairableInstUsers.insert(ValuePair(I, *U));
> +      }No braces needed.
> +    }
> +  }
> +
> +  // Returns true if an input to pair P is an output of pair Q and also an
> +  // input of pair Q is an output of pair P. If this is the case, then
these
> +  // two pairs cannot be simultaneously fused.
> +  bool BBVectorize::pairsConflict(ValuePair P, ValuePair Q,
> +                      DenseSet<ValuePair>  &PairableInstUsers) {
> +    // Two pairs are in conflict if they are mutual Users of eachother.
> +    bool QUsesP = PairableInstUsers.count(ValuePair(P.first,  Q.first)) 
||
> +                  PairableInstUsers.count(ValuePair(P.first,  Q.second))
||
> +                  PairableInstUsers.count(ValuePair(P.second, Q.first)) 
||
> +                  PairableInstUsers.count(ValuePair(P.second, Q.second));
> +    bool PUsesQ = PairableInstUsers.count(ValuePair(Q.first,  P.first)) 
||
> +                  PairableInstUsers.count(ValuePair(Q.first,  P.second))
||
> +                  PairableInstUsers.count(ValuePair(Q.second, P.first)) 
||
> +                  PairableInstUsers.count(ValuePair(Q.second, P.second));
> +    return (QUsesP&&  PUsesQ);
> +  }
> +
> +  // This function builds the initial tree of connected pairs with the
> +  // pair J at the root.
> +  void BBVectorize::buildInitialTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseMap<ValuePair, size_t>  &Tree,
ValuePair J) {
> +    // Each of these pairs is viewed as the root node of a Tree. The Tree
> +    // is then walked (depth-first). As this happens, we keep track of
> +    // the pairs that compose the Tree and the maximum depth of the Tree.
> +    std::vector<ValuePairWithDepth>  Q;
> +    // General depth-first post-order traversal:
> +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> +    while (!Q.empty()) {
> +      ValuePairWithDepth QTop = Q.back();
> +
> +      // Push each child onto the queue:
> +      bool MoreChildren = false;
> +      size_t MaxChildDepth = QTop.second;
> +      VPPIteratorPair qtRange = ConnectedPairs.equal_range(QTop.first);
> +      for (std::map<ValuePair, ValuePair>::iterator k =
qtRange.first;
> +           k != qtRange.second; ++k) {
> +        // Make sure that this child pair is still a candidate:
> +        bool IsStillCand = false;
> +        VPIteratorPair checkRange > +         
CandidatePairs.equal_range(k->second.first);
> +        for (std::multimap<Value *, Value *>::iterator m =
checkRange.first;
> +             m != checkRange.second; ++m) {
> +          if (m->second == k->second.second) {
> +            IsStillCand = true;
> +            break;
> +          }
> +        }
> +
> +        if (IsStillCand) {
> +          DenseMap<ValuePair, size_t>::iterator C =
Tree.find(k->second);
> +          if (C == Tree.end()) {
> +            size_t d = getDepthFactor(k->second.first);
> +            Q.push_back(ValuePairWithDepth(k->second, QTop.second+d));
> +            MoreChildren = true;
> +          } else {
> +            MaxChildDepth = std::max(MaxChildDepth, C->second);
> +          }
> +        }
> +      }
> +
> +      if (!MoreChildren) {
> +        // Record the current pair as part of the Tree:
> +        Tree.insert(ValuePairWithDepth(QTop.first, MaxChildDepth));
> +        Q.pop_back();
> +      }
> +    }
> +  }
> +
> +  // Given some initial tree, prune it by removing conflicting pairs
(pairs
> +  // that cannot be simultaneously chosen for vectorization).
> +  void BBVectorize::pruneTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseMap<ValuePair, size_t>  &Tree,
> +                      DenseSet<ValuePair>  &PrunedTree,
ValuePair J) {
> +    std::vector<ValuePairWithDepth>  Q;
> +    // General depth-first post-order traversal:
> +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> +    while (!Q.empty()) {
> +      ValuePairWithDepth QTop = Q.back();
> +      PrunedTree.insert(QTop.first);
> +      Q.pop_back();
> +
> +      // Visit each child, pruning as necessary...
> +      DenseMap<ValuePair, size_t>  BestChilden;
> +      VPPIteratorPair QTopRange = ConnectedPairs.equal_range(QTop.first);
> +      for (std::map<ValuePair, ValuePair>::iterator K =
QTopRange.first;
> +           K != QTopRange.second; ++K) {
> +        DenseMap<ValuePair, size_t>::iterator C =
Tree.find(K->second);
> +        if (C == Tree.end()) continue;
> +
> +        // This child is in the Tree, now we need to make sure it is the
> +        // best of any conflicting children. There could be multiple
> +        // conflicting children, so first, determine if we're keeping
> +        // this child, then delete conflicting children as necessary.
> +
> +        // It is also necessary to guard against pairing-induced
> +        // dependencies. Consider instructions a .. x .. y .. b
> +        // such that (a,b) are to be fused and (x,y) are to be fused
> +        // but a is an input to x and b is an output from y. This
> +        // means that y cannot be moved after b but x must be moved
> +        // after b for (a,b) to be fused. In other words, after
> +        // fusing (a,b) we have y .. a/b .. x where y is an input
> +        // to a/b and x is an output to a/b: x and y can no longer
> +        // be legally fused. To prevent this condition, we must
> +        // make sure that a child pair added to the Tree is not
> +        // both an input and output of an already-selected pair.
> +
> +        bool CanAdd = true;
The use of CanAdd can be improved with early continues.
> +        for (DenseMap<ValuePair, size_t>::iterator C2
> +              = BestChilden.begin(), E2 = BestChilden.end();
> +             C2 != E2; ++C2) {
> +          if (C2->first.first == C->first.first ||
> +              C2->first.first == C->first.second ||
> +              C2->first.second == C->first.first ||
> +              C2->first.second == C->first.second ||
> +              pairsConflict(C2->first, C->first, PairableInstUsers))
{
> +            if (C2->second>= C->second) {
> +              CanAdd = false;
> +              break;
> +            }
> +          }
This loop
> +        }
> +
> +        // Even worse, this child could conflict with another node already
> +        // selected for the Tree. If that is the case, ignore this child.
> +        if (CanAdd)
> +          for (DenseSet<ValuePair>::iterator T = PrunedTree.begin(),
> +                 E2 = PrunedTree.end(); T != E2; ++T) {
> +            if (T->first == C->first.first ||
> +                T->first == C->first.second ||
> +                T->second == C->first.first ||
> +                T->second == C->first.second ||
> +                pairsConflict(*T, C->first, PairableInstUsers)) {
> +              CanAdd = false;
> +              break;
> +            }
> +          }
and this loop
> +
> +        // And check the queue too...
> +        if (CanAdd)
> +          for (std::vector<ValuePairWithDepth>::iterator C2 =
Q.begin(),
> +                E2 = Q.end(); C2 != E2; ++C2) {
> +            if (C2->first.first == C->first.first ||
> +                C2->first.first == C->first.second ||
> +                C2->first.second == C->first.first ||
> +                C2->first.second == C->first.second ||
> +                pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> +              CanAdd = false;
> +              break;
> +            }
and this loop
> +          }
> +
> +        // Last but not least, check for a conflict with any of the
> +        // already-chosen pairs.
> +        if (CanAdd)
> +          for (DenseMap<Value *, Value *>::iterator C2 > +       
ChosenPairs.begin(), E2 = ChosenPairs.end();
> +               C2 != E2; ++C2) {
> +            if (pairsConflict(*C2, C->first, PairableInstUsers)) {
> +              CanAdd = false;
> +              break;
(and almost this one)
> +            }
> +          }
> +
> +        if (CanAdd) {
> +          // This child can be added, but we may have chosen it in
preference
> +          // to an already-selected child. Check for this here, and if a
> +          // conflict is found, then remove the previously-selected child
> +          // before adding this one in its place.
> +          for (DenseMap<ValuePair, size_t>::iterator C2
> +                = BestChilden.begin(), E2 = BestChilden.end();
> +               C2 != E2;) {
> +            if (C2->first.first == C->first.first ||
> +                C2->first.first == C->first.second ||
> +                C2->first.second == C->first.first ||
> +                C2->first.second == C->first.second ||
> +                pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> +              BestChilden.erase(C2++);
> +            } else {
> +              ++C2;and this loop look entirely the same. Can they be moved into a helper
function?

> +            }
> +          }
> +
> +          BestChilden.insert(ValuePairWithDepth(C->first,
C->second));
> +        }
> +      }
> +
> +      for (DenseMap<ValuePair, size_t>::iterator C
> +            = BestChilden.begin(), E2 = BestChilden.end();
> +           C != E2; ++C) {
> +        size_t DepthF = getDepthFactor(C->first.first);
> +        Q.push_back(ValuePairWithDepth(C->first, QTop.second+DepthF));
> +      }
> +    }
> +  }
> +
> +  // This function finds the best tree of mututally-compatible connected
> +  // pairs, given the choice of root pairs as an iterator range.
> +  void BBVectorize::findBestTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseSet<ValuePair>  &BestTree,
size_t&BestMaxDepth,
> +                      size_t&BestEffSize, VPIteratorPair ChoiceRange)
{
> +    for (std::multimap<Value *, Value *>::iterator J =
ChoiceRange.first;
> +         J != ChoiceRange.second; ++J) {
> +
> +      // Before going any further, make sure that this pair does not
> +      // conflict with any already-selected pairs (see comment below
> +      // near the Tree pruning for more details).
> +      bool DoesConflict = false;
> +      for (DenseMap<Value *, Value *>::iterator C =
ChosenPairs.begin(),
> +            E = ChosenPairs.end(); C != E; ++C) {
> +        if (pairsConflict(*C, *J, PairableInstUsers)) {
> +          DoesConflict = true;
> +          break;
> +        }
> +      }
> +      if (DoesConflict) continue;
> +
> +      DenseMap<ValuePair, size_t>  Tree;
> +      buildInitialTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> +        PairableInstUsers, ChosenPairs, Tree, *J);Fix indentation. (Align CandidatePairs and PairableInstUsers)
> +
> +      // Because we'll keep the child with the largest depth, the
largest
> +      // depth is still the same in the unpruned Tree.
> +      size_t MaxDepth = Tree.lookup(*J);
> +
> +      DEBUG(dbgs()<<  "BBV: found Tree for pair {"<<
*J->first<<
> +                   "<->  "<<  *J->second<< 
"} of depth "<<
> +                   MaxDepth<<  " and size "<< 
Tree.size()<<  "\n");
> +
> +      // At this point the Tree has been constructed, but, may contain
> +      // contradictory children (meaning that different children of
> +      // some tree node may be attempting to fuse the same instruction).
> +      // So now we walk the tree again, in the case of a conflict,
> +      // keep only the child with the largest depth. To break a tie,
> +      // favor the first child.
> +
> +      DenseSet<ValuePair>  PrunedTree;
> +      pruneTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> +        PairableInstUsers, ChosenPairs, Tree, PrunedTree, *J);
> +
> +      size_t EffSize = 0;
> +      for (DenseSet<ValuePair>::iterator S = PrunedTree.begin(),
> +            E = PrunedTree.end(); S != E; ++S) {
> +        EffSize += getDepthFactor(S->first);
> +      }No braces.
> +
> +      DEBUG(dbgs()<<  "BBV: found pruned Tree for pair
{"<<  *J->first<<
> +             "<->  "<<  *J->second<< 
"} of depth "<<
> +             MaxDepth<<  " and size "<< 
PrunedTree.size()<<
> +            " (effective size: "<<  EffSize<< 
")\n");
> +      if (MaxDepth>= ReqChainDepth&&  EffSize>  BestEffSize)
{
> +        BestMaxDepth = MaxDepth;
> +        BestEffSize = EffSize;
> +        BestTree = PrunedTree;
> +      }
> +    }
> +  }
> +
> +  // Given the list of candidate pairs, this function selects those
> +  // that will be fused into vector instructions.
> +  void BBVectorize::choosePairs(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>&  ChosenPairs)
{
> +    for (std::vector<Value *>::iterator I = PairableInsts.begin(),
> +           E = PairableInsts.end(); I != E; ++I) {
> +      // The number of possible pairings for this variable:
> +      size_t NumChoices = CandidatePairs.count(*I);
> +      if (!NumChoices) continue;
> +
> +      VPIteratorPair ChoiceRange = CandidatePairs.equal_range(*I);
> +
> +      // The best pair to choose and its tree:
> +      size_t BestMaxDepth = 0, BestEffSize = 0;
> +      DenseSet<ValuePair>  BestTree;
> +      findBestTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> +        PairableInstUsers, ChosenPairs, BestTree, BestMaxDepth,
> +        BestEffSize, ChoiceRange);
Fix indentation. Align CandidatePairs, ...
> +
> +      // A tree has been chosen (or not) at this point. If no tree was
> +      // chosen, then this instruction, I, cannot be paired (and is no
longer
> +      // considered).
> +
> +      for (DenseSet<ValuePair>::iterator S = BestTree.begin(),
> +           SE = BestTree.end(); S != SE; ++S) {
> +        // Insert the members of this tree into the list of chosen pairs.
> +        ChosenPairs.insert(ValuePair(S->first, S->second));
> +        DEBUG(dbgs()<<  "BBV: selected pair: "<< 
*S->first<<  "<->  "<<
> +               *S->second<<  "\n");
> +
> +        // Remove all candidate pairs that have values in the chosen tree.
> +        for (std::multimap<Value *, Value *>::iterator K > +     
CandidatePairs.begin(),
> +             E2 = CandidatePairs.end(); K != E2;) {
> +          if (K->first == S->first || K->second == S->first ||
> +              K->second == S->second || K->first == S->second)
{
> +            // Don't remove the actual pair chosen so that it can be
used
> +            // in subsequent tree selections.
> +            if (!(K->first == S->first&&  K->second ==
S->second)) {
> +              CandidatePairs.erase(K++);
> +            } else {
> +              ++K;
> +            }No braces needed.
> +          } else {
> +            ++K;
> +          }These are needed.
> +        }
> +      }
> +    }
> +
> +    DEBUG(dbgs()<<  "BBV: selected "<< 
ChosenPairs.size()<<  " pairs.\n");
> +  }
> +
> +  // Returns the value that is to be used as the pointer input to the
vector
> +  // instruction that fuses I with J.
> +  Value *BBVectorize::getReplacementPointerInput(LLVMContext& 
Context,
> +                     Instruction *I, Instruction *J, unsigned o,
> +                     bool&FlipMemInputs) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +    const TargetData&TD = *AA.getTargetData();
> +
> +    Value *IPtr, *JPtr;
> +    unsigned IAlignment, JAlignment;
> +    int64_t OffsetInElmts;
> +    (void) getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment, JAlignment,
> +             OffsetInElmts);Fix indentation.
> +
> +    // The pointer value is taken to be the one with the lowest offset.
> +    Value *VPtr;
> +    if (OffsetInElmts>  0) {
> +      VPtr = IPtr;
> +    } else {
> +      FlipMemInputs = true;
> +      VPtr = JPtr;
> +    }
> +
> +    Type *ArgType =
cast<PointerType>(IPtr->getType())->getElementType();
> +    Type *VArgType = getVecType(ArgType);
> +    Type *VArgPtrType = PointerType::get(VArgType,
> +      cast<PointerType>(IPtr->getType())->getAddressSpace());
> +    return new BitCastInst(VPtr, VArgPtrType,
> +                        I->hasName() ?
> +                          I->getName() + ".v.i" + utostr(o) :
"",
> +                        /* insert before */ FlipMemInputs ? J : I);
> +  }
> +
> +  // Returns the value that is to be used as the vector-shuffle mask to
the
> +  // vector instruction that fuses I with J.
> +  Value *BBVectorize::getReplacementShuffleMask(LLVMContext&  Context,
> +                     Instruction *I, Instruction *J, unsigned o) {
> +    // This is the shuffle mask. We need to append the second
> +    // mask to the first, and the numbers need to be adjusted.
> +
> +    Type *ArgType = I->getType();
> +    Type *VArgType = getVecType(ArgType);
> +
> +    // Get the total number of elements in the fused vector type.
> +    // By definition, this must equal the number of elements in
> +    // the final mask.
> +    unsigned NumElem =
cast<VectorType>(VArgType)->getNumElements();
> +    std::vector<Constant*>  Mask(NumElem);
> +
> +    Type *OpType = I->getOperand(0)->getType();
> +    unsigned numInElem =
cast<VectorType>(OpType)->getNumElements();
> +
> +    // For the mask from the first pair...
> +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> +      int m = cast<ShuffleVectorInst>(I)->getMaskValue(v);
> +      if (m<  0) {
> +        Mask[v] = UndefValue::get(Type::getInt32Ty(Context));
> +      } else {
> +        unsigned vv = v;
> +        if (m>= (int) numInElem) {
> +          vv += (int) numInElem;
> +        }No braces needed.
> +
> +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), vv);
> +      }
> +    }
> +
> +    // For the mask from the second pair...
> +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> +      int m = cast<ShuffleVectorInst>(J)->getMaskValue(v);
> +      if (m<  0) {
> +        Mask[v+NumElem/2] = UndefValue::get(Type::getInt32Ty(Context));
> +      } else {
> +        unsigned vv = v + (int) numInElem;
> +        if (m>= (int) numInElem) {
> +          vv += (int) numInElem;
> +        }No braces needed.
> +
> +        Mask[v+NumElem/2] > +         
ConstantInt::get(Type::getInt32Ty(Context), vv);
> +      }
> +    }
Those two for loops look very similar. Any chance they can be extracted
into a helper function?
> +
> +    return ConstantVector::get(Mask);
> +  }
> +
> +  // Returns the value to be used as the specified operand of the vector
> +  // instruction that fuses I with J.
> +  Value *BBVectorize::getReplacementInput(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, unsigned o, bool FlipMemInputs) {
Indentation.
> +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> +
> +      // Compute the fused vector type for this operand
> +    Type *ArgType = I->getOperand(o)->getType();
> +    VectorType *VArgType = getVecType(ArgType);
> +
> +    Instruction *L = I, *H = J;
> +    if (FlipMemInputs) {
> +      L = J;
> +      H = I;
> +    }
> +
> +    if (ArgType->isVectorTy()) {
> +      unsigned numElem =
cast<VectorType>(VArgType)->getNumElements();
> +      std::vector<Constant*>  Mask(numElem);
> +      for (unsigned v = 0; v<  numElem; ++v) {
> +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), v);
> +      }No braces.
> +
> +      Instruction *BV = new ShuffleVectorInst(L->getOperand(o),
> +                                              H->getOperand(o),
> +                                              ConstantVector::get(Mask),
> +                                              I->hasName() ?
> +                                                I->getName() +
".v.i" +
> +                                                  utostr(o) :
"");
It may be better to not squeeze

I->hasName() ?  I->getName() + ".v.i" + utostr(o) : ""

into this operand.
> +      BV->insertBefore(J);
> +      return BV;
> +    }
> +
> +    // If these two inputs are the output of another vector instruction,
> +    // then we should use that output directly. It might be necessary to
> +    // permute it first. [When pairings are fused recursively, you can
> +    // end up with cases where a large vector is decomposed into scalars
> +    // using extractelement instructions, then built into size-2
> +    // vectors using insertelement and the into larger vectors using
> +    // shuffles. InstCombine does not simplify all of these cases well,
> +    // and so we make sure that shuffles are generated here when possible.
> +    ExtractElementInst *LEE
> +      = dyn_cast<ExtractElementInst>(L->getOperand(o));
> +    ExtractElementInst *HEE
> +      = dyn_cast<ExtractElementInst>(H->getOperand(o));
> +
> +    if (LEE&&  HEE&&
> +        LEE->getOperand(0)->getType() == VArgType) {
> +      unsigned LowIndx =
cast<ConstantInt>(LEE->getOperand(1))->getZExtValue();
> +      unsigned HighIndx =
cast<ConstantInt>(HEE->getOperand(1))->getZExtValue();
> +      if (LEE->getOperand(0) == HEE->getOperand(0)) {
> +        if (LowIndx == 0&&  HighIndx == 1)
> +          return LEE->getOperand(0);
> +
> +        std::vector<Constant*>  Mask(2);
> +        Mask[0] = ConstantInt::get(Type::getInt32Ty(Context), LowIndx);
> +        Mask[1] = ConstantInt::get(Type::getInt32Ty(Context), HighIndx);
> +
> +        Instruction *BV = new ShuffleVectorInst(LEE->getOperand(0),
> +                                            UndefValue::get(VArgType),
> +                                            ConstantVector::get(Mask),
> +                                            I->hasName() ?
> +                                              I->getName() +
".v.i" +
> +                                                utostr(o) : "");
Do not squeeze

I->hasName() ?  I->getName() + ".v.i" + utostr(o) : ""

into the operand.
> +        BV->insertBefore(J);
> +        return BV;
> +      }
> +
> +      std::vector<Constant*>  Mask(2);
> +      HighIndx += VArgType->getNumElements();
> +      Mask[0] = ConstantInt::get(Type::getInt32Ty(Context), LowIndx);
> +      Mask[1] = ConstantInt::get(Type::getInt32Ty(Context), HighIndx);
> +
> +      Instruction *BV = new ShuffleVectorInst(LEE->getOperand(0),
> +                                          HEE->getOperand(0),
> +                                          ConstantVector::get(Mask),
> +                                          I->hasName() ?
> +                                            I->getName() +
".v.i" +
> +                                              utostr(o) : "");No squeezing please.
> +      BV->insertBefore(J);
> +      return BV;
> +    }
> +
> +    Instruction *BV1 = InsertElementInst::Create(
> +                                               UndefValue::get(VArgType),
> +                                               L->getOperand(o), CV0,
> +                                               I->hasName() ?
> +                                                 I->getName() +
".v.i" +
> +                                                   utostr(o) +
".1"
> +                                               : "");No squeezing.
> +    BV1->insertBefore(I);
> +    Instruction *BV2 = InsertElementInst::Create(BV1, H->getOperand(o),
> +                                               CV1, I->hasName() ?
> +                                                 I->getName() +
> +                                                   ".v.i" +
utostr(o)
> +                                                   + ".2" :
"");
No squeezing.

Also, the stuff that you squeeze in looks very similar or even
identical. Do you really write the whole expression 4 times in this
function? Is there no better option?
> +    BV2->insertBefore(J);
> +    return BV2;
> +  }
> +
> +  // This function creates an array of values that will be used as the
inputs
> +  // to the vector instruction that fuses I with J.
> +  void BBVectorize::getReplacementInputsForPair(LLVMContext&  Context,
> +                     Instruction *I, Instruction *J,
> +                     SmallVector<Value *, 3>  &ReplacedOperands,
> +                     bool&FlipMemInputs) {
> +    FlipMemInputs = false;
> +    unsigned NumOperands = I->getNumOperands();
> +
> +    for (unsigned p = 0, o = NumOperands-1; p<  NumOperands; ++p, --o)
{
> +      // Iterate backward so that we look at the store pointer
> +      // first and know whether or not we need to flip the inputs.
> +
> +      if (isa<LoadInst>(I) || (o == 1&& 
isa<StoreInst>(I))) {
> +        // This is the pointer for a load/store instruction.
> +        ReplacedOperands[o] = getReplacementPointerInput(Context, I, J, o,
> +                                FlipMemInputs);
> +        continue;
> +      } else if (isa<CallInst>(I)&&  o == NumOperands-1) {
> +        Function *F = cast<CallInst>(I)->getCalledFunction();
> +        unsigned IID = F->getIntrinsicID();
> +        BasicBlock&BB = *cast<Instruction>(I)->getParent();
> +
> +        Module *M = BB.getParent()->getParent();
> +        Type *ArgType = I->getType();
> +        Type *VArgType = getVecType(ArgType);
> +
> +        // FIXME: is it safe to do this here?
> +        ReplacedOperands[o] = Intrinsic::getDeclaration(M,
> +          (Intrinsic::ID) IID, VArgType);
> +        continue;
> +      } else if (isa<ShuffleVectorInst>(I)&&  o ==
NumOperands-1) {
> +        ReplacedOperands[o] = getReplacementShuffleMask(Context, I, J, o);
> +        continue;
> +      }
> +
> +      ReplacedOperands[o] > +        getReplacementInput(Context, I, J,
o, FlipMemInputs);
> +    }
> +  }
> +
> +  // This function creates two values that represent the outputs of the
> +  // original I and J instructions. These are generally vector shuffles
> +  // or extracts. In many cases, these will end up being unused and, thus,
> +  // eliminated by later passes.
> +  void BBVectorize::replaceOutputsOfPair(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, Instruction *K,
> +                     Instruction *&InsertionPt,
> +                     Instruction *&K1, Instruction *&K2,
> +                     bool&FlipMemInputs) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +
> +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> +
> +    if (isa<StoreInst>(I)) {
> +      AA.replaceWithNewValue(I, K);
> +      AA.replaceWithNewValue(J, K);
> +    } else {
> +      Type *IType = I->getType();
> +      Type *VType = getVecType(IType);
> +
> +      if (IType->isVectorTy()) {
> +          unsigned numElem =
cast<VectorType>(IType)->getNumElements();
> +          std::vector<Constant*>  Mask1(numElem), Mask2(numElem);
> +          for (unsigned v = 0; v<  numElem; ++v) {
> +            Mask1[v] = ConstantInt::get(Type::getInt32Ty(Context), v);
> +            Mask2[v] = ConstantInt::get(Type::getInt32Ty(Context),
numElem+v);
> +          }
> +
> +          K1 = new ShuffleVectorInst(K, UndefValue::get(VType),
> +                                       ConstantVector::get(
> +                                         FlipMemInputs ? Mask2 : Mask1),
> +                                       K->hasName() ?
> +                                         K->getName() +
".v.r1" : "");
> +          K2 = new ShuffleVectorInst(K, UndefValue::get(VType),
> +                                       ConstantVector::get(
> +                                         FlipMemInputs ? Mask1 : Mask2),
> +                                       K->hasName() ?
> +                                         K->getName() +
".v.r2" : "");
> +      } else {
> +        K1 = ExtractElementInst::Create(K, FlipMemInputs ? CV1 : CV0,
> +                                          K->hasName() ?
> +                                            K->getName() +
".v.r1" : "");
> +        K2 = ExtractElementInst::Create(K, FlipMemInputs ? CV0 : CV1,
> +                                          K->hasName() ?
> +                                            K->getName() +
".v.r2" : "");
> +      }
> +
> +      K1->insertAfter(K);
> +      K2->insertAfter(K1);
> +      InsertionPt = K2;
> +    }
> +  }
> +
> +  // Move all uses of the function I (including pairing-induced uses)
after J.
> +  void BBVectorize::moveUsesOfIAfterJ(BasicBlock&BB,
> +                     DenseMap<Value *, Value *>&  ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> +                     Instruction *&InsertionPt,
> +                     Instruction *I, Instruction *J) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +
> +    // Skip to the first instruction past I.
> +    BasicBlock::iterator L = BB.begin();
> +    for (; cast<Instruction>(L) != cast<Instruction>(I); ++L);Why don't you put the initialization into the 'for'? I do not see
any
need for the cast instructions.
> +    ++L;
> +
> +    DenseSet<Value *>  Users;
> +    AliasSetTracker WriteSet(AA);
> +    for (BasicBlock::iterator E = BB.end();
> +          cast<Instruction>(L) != cast<Instruction>(J);) {Why do you need to cast here?
> +      bool UsesI;
> +      trackUsesOfI(Users, WriteSet, I, L, UsesI, true,&LoadMoveSet);
> +
> +      if (UsesI) {
> +        // Make sure that the order of selected pairs is not reversed.
> +        // So if we come to a user instruction that has a pair,
> +        // then mark the pair's tree as used as well.
> +        DenseMap<Value *, Value *>::iterator LP =
ChosenPairs.find(L);
> +        if (LP != ChosenPairs.end())
> +          Users.insert(LP->second);
> +
> +        // Move this instruction
> +        Instruction *InstToMove = L; ++L;
> +
> +        DEBUG(dbgs()<<  "BBV: moving: "<< 
*InstToMove<<
> +                        " to after "<< 
*InsertionPt<<  "\n");
> +        InstToMove->removeFromParent();
> +        InstToMove->insertAfter(InsertionPt);
> +        InsertionPt = InstToMove;
> +      } else {
> +        ++L;
> +      }
> +    }
> +  }
> +
> +  // Collect all load instruction that are in the move set of a given
pair.
> +  // These loads depend on the first instruction, I, and so need to be
moved
> +  // after J when the pair is fused.
> +  void BBVectorize::collectPairLoadMoveSet(BasicBlock&BB,
> +                     DenseMap<Value *, Value *>  &ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> +                     Instruction *I, Instruction *J) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +
> +    // Skip to the first instruction past I.
> +    BasicBlock::iterator L = BB.begin();
> +    for (; cast<Instruction>(L) != cast<Instruction>(I); ++L);Why not initialize in the 'for' loop. Why cast here?
> +    ++L;
> +
> +    DenseSet<Value *>  Users;
> +    AliasSetTracker WriteSet(AA);
> +    for (BasicBlock::iterator E = BB.end();This 'E' is never used in the loop.
> +          cast<Instruction>(L) != cast<Instruction>(J); ++L) {Why cast here?
> +      bool UsesI;
> +      trackUsesOfI(Users, WriteSet, I, L, UsesI);
> +
> +      if (UsesI) {
> +        // Make sure that the order of selected pairs is not reversed.
> +        // So if we come to a user instruction that has a pair,
> +        // then mark the pair's tree as used as well.
> +        DenseMap<Value *, Value *>::iterator LP =
ChosenPairs.find(L);
> +        if (LP != ChosenPairs.end())
> +          Users.insert(LP->second);
> +
> +        if (L->mayReadFromMemory())
> +          LoadMoveSet.insert(ValuePair(L, I));
> +      }
> +    }
> +  }
> +
> +  // In cases where both load/stores and the computation of their pointers
> +  // are chosen for vectorization, we can end up in a situation where the
> +  // aliasing analysis starts returning different query results as the
> +  // process of fusing instruction pairs continues. Because the algorithm
> +  // relies on finding the same use trees here as were found earlier,
we'll
> +  // need to precompute the necessary aliasing information here and then
> +  // manually update it during the fusion process.
> +  void BBVectorize::collectLoadMoveSet(BasicBlock&BB,
> +                     std::vector<Value *>  &PairableInsts,
> +                     DenseMap<Value *, Value *>  &ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet) {
> +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> +         PIE = PairableInsts.end(); PI != PIE; ++PI) {
> +      DenseMap<Value *, Value *>::iterator P =
ChosenPairs.find(*PI);
> +      if (P == ChosenPairs.end()) continue;
> +
> +      if (getDepthFactor(P->first) == 0) {
> +        continue;
> +      }No braces.
> +
> +      Instruction *I = cast<Instruction>(P->first),
> +        *J = cast<Instruction>(P->second);
> +
> +      collectPairLoadMoveSet(BB, ChosenPairs, LoadMoveSet, I, J);
> +    }
> +  }
> +
> +  // This function fuses the chosen instruction pairs into vector
instructions,
> +  // taking care preserve any needed scalar outputs and, then, it reorders
the
> +  // remaining instructions as needed (users of the first member of the
pair
> +  // need to be moved to after the location of the second member of the
pair
> +  // because the vector instruction is inserted in the location of the
pair's
> +  // second member).
> +  void BBVectorize::fuseChosenPairs(BasicBlock&BB,
> +                     std::vector<Value *>  &PairableInsts,
> +                     DenseMap<Value *, Value *>  &ChosenPairs) {
> +    LLVMContext&  Context = BB.getContext();
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +    ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> +
> +    std::multimap<Value *, Value *>  LoadMoveSet;
> +    collectLoadMoveSet(BB, PairableInsts, ChosenPairs, LoadMoveSet);
> +
> +    DEBUG(dbgs()<<  "BBV: initial: \n"<<  BB<<
"\n");
> +
> +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> +         PIE = PairableInsDenseMap<Vts.end(); PI != PIE; ++PI) {
> +      DenseMap<Value *, Value *>::iterator P =
ChosenPairs.find(*PI);
> +      if (P == ChosenPairs.end()) continue;
> +
> +      if (getDepthFactor(P->first) == 0) {
> +        // These instructions are not really fused, but are tracked as
though
> +        // they are. Any case in which it would be interesting to fuse
them
> +        // will be taken care of by InstCombine.
> +        --NumFusedOps;
> +        continue;
> +      }
> +
> +      Instruction *I = cast<Instruction>(P->first),
> +        *J = cast<Instruction>(P->second);
Why do we need to cast here? It seems ChosenPairs should actually be of
type DenseMap<Instruction *, Instruction *>. Can PariableInsts also be
of type std::vector<Instruction*>?
> +
> +      DEBUG(dbgs()<<  "BBV: fusing: "<<  *I<<
> +             "<->  "<<  *J<<  "\n");
> +
> +      bool FlipMemInputs;
> +      unsigned NumOperands = I->getNumOperands();
> +      SmallVector<Value *, 3>  ReplacedOperands(NumOperands);
> +      getReplacementInputsForPair(Context, I, J, ReplacedOperands,
> +        FlipMemInputs);
> +
> +      // Make a copy of the original operation, change its type to the
vector
> +      // type and replace its operands with the vector operands.
> +      Instruction *K = I->clone();
> +      if (I->hasName()) K->takeName(I);
> +
> +      if (!isa<StoreInst>(K)) {
> +        K->mutateType(getVecType(I->getType()));
> +      }No braces.
> +
> +      for (unsigned o = 0; o<  NumOperands; ++o) {
> +        K->setOperand(o, ReplacedOperands[o]);
> +      }No braces.
> +
> +      // If we've flipped the memory inputs, make sure that we take
the correct
> +      // alignment.
> +      if (FlipMemInputs) {
> +        if (isa<StoreInst>(K))
> +         
cast<StoreInst>(K)->setAlignment(cast<StoreInst>(J)->getAlignment());
> +        else
> +         
cast<LoadInst>(K)->setAlignment(cast<LoadInst>(J)->getAlignment());
> +      }
> +
> +      K->insertAfter(J);
> +
> +      // Instruction insertion point:
> +      Instruction *InsertionPt = K;
> +      Instruction *K1 = 0, *K2 = 0;
> +      replaceOutputsOfPair(Context, I, J, K, InsertionPt, K1, K2,
> +        FlipMemInputs);
> +
> +      // The use tree of the first original instruction must be moved to
after
> +      // the location of the second instruction. The entire use tree of
the
> +      // first instruction is disjoint from the input tree of the second
> +      // (by definition), and so commutes with it.
> +
> +      moveUsesOfIAfterJ(BB, ChosenPairs, LoadMoveSet, InsertionPt,
> +        cast<Instruction>(I), cast<Instruction>(J));
> +
> +      if (!isa<StoreInst>(I)) {
> +        I->replaceAllUsesWith(K1);
> +        J->replaceAllUsesWith(K2);
> +        AA.replaceWithNewValue(I, K1);
> +        AA.replaceWithNewValue(J, K2);
> +      }
> +
> +      // Instructions that may read from memory may be in the load move
set.
> +      // Once an instruction is fused, we no longer need its move set, and
so
> +      // the values of the map never need to be updated. However, when a
load
> +      // is fused, we need to merge the entries from both instructions in
the
> +      // pair in case those instructions were in the move set of some
other
> +      // yet-to-be-fused pair. The loads in question are the keys of the
map.
> +      if (I->mayReadFromMemory()) {
> +        std::vector<ValuePair>  NewSetMembers;
> +        VPIteratorPair IPairRange = LoadMoveSet.equal_range(I);
> +        VPIteratorPair JPairRange = LoadMoveSet.equal_range(J);
> +        for (std::multimap<Value *, Value *>::iterator N =
IPairRange.first;
> +               N != IPairRange.second; ++N) {Indentation.
> +          NewSetMembers.push_back(ValuePair(K, N->second));
> +        }No braces.
> +        for (std::multimap<Value *, Value *>::iterator N =
JPairRange.first;
> +               N != JPairRange.second; ++N) {Indentation.
> +          NewSetMembers.push_back(ValuePair(K, N->second));
> +        }No braces.
> +        for (std::vector<ValuePair>::iterator A =
NewSetMembers.begin(),
> +               AE = NewSetMembers.end(); A != AE; ++A) {Indentation.
> +          LoadMoveSet.insert(*A);
> +        }No braces.
> +      }
> +
> +      SE.forgetValue(I);
> +      SE.forgetValue(J);
> +      I->eraseFromParent();
> +      J->eraseFromParent();
> +    }
> +
> +    DEBUG(dbgs()<<  "BBV: final: \n"<<  BB<< 
"\n");
> +  }
> +}
> --- /dev/null
> +++ b/test/Transforms/BBVectorize/ld1.ll
> @@ -0,0 +1,28 @@
> +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> +
> +define double @test1(double* %a, double* %b, double* %c) nounwind uwtable
readonly {
> +entry:
> +  %i0 = load double* %a, align 8
> +  %i1 = load double* %b, align 8
> +  %mul = fmul double %i0, %i1
> +  %i2 = load double* %c, align 8
> +  %add = fadd double %mul, %i2
> +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> +  %i3 = load double* %arrayidx3, align 8
> +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> +  %i4 = load double* %arrayidx4, align 8
> +  %mul5 = fmul double %i3, %i4
> +  %arrayidx6 = getelementptr inbounds double* %c, i64 1
> +  %i5 = load double* %arrayidx6, align 8
> +  %add7 = fadd double %mul5, %i5
> +  %mul9 = fmul double %add, %i1
> +  %add11 = fadd double %mul9, %i2
> +  %mul13 = fmul double %add7, %i4
> +  %add15 = fadd double %mul13, %i5
> +  %mul16 = fmul double %add11, %add15
> +  ret double %mul16
> +; CHECK: @test1
> +; CHECK:<2 x double>
For me this CHECK looks very short. Should all instructions be
vectorized or is it sufficient if one is vectorized. Also, it would be
nice if I could understand the code that is generated just by looking
at the test case. Any reason not to CHECK: for the entire sequence of
expected instructions (This applies to all test cases).

> +}
> +
> diff --git a/test/Transforms/BBVectorize/loop1.ll
b/test/Transforms/BBVectorize/loop1.ll
> new file mode 100644
> index 0000000..4c60b42
> --- /dev/null
> +++ b/test/Transforms/BBVectorize/loop1.ll
> @@ -0,0 +1,39 @@
> +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> +target triple = "x86_64-unknown-linux-gnu"
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> +; RUN: opt<  %s -loop-unroll -unroll-allow-partial -bb-vectorize
-bb-vectorize-req-chain-depth=3 -S | FileCheck %sWhy are you checking here with -loop-unroll? Do you expect something
special. In case you do it may make sense to use another -check-prefix
> --- /dev/null
> +++ b/test/Transforms/BBVectorize/req-depth.ll
> @@ -0,0 +1,17 @@
> +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 3 -S |
FileCheck %s -check-prefix=CHECK-RD3
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 2 -S |
FileCheck %s -check-prefix=CHECK-RD2
> +
> +define double @test1(double %A1, double %A2, double %B1, double %B2) {
> +	%X1 = fsub double %A1, %B1
> +	%X2 = fsub double %A2, %B2
> +	%Y1 = fmul double %X1, %A1
> +	%Y2 = fmul double %X2, %A2
> +	%R  = fmul double %Y1, %Y2
> +	ret double %R
> +; CHECK-RD3: @test1
> +; CHECK-RD2: @test1
> +; CHECK-RD3-NOT:<2 x double>
> +; CHECK-RD2:<2 x double>What about sorting them by check prefix?
> +++ b/test/Transforms/BBVectorize/simple-ldstr.ll
> @@ -0,0 +1,69 @@
> +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> +
> +; Simple 3-pair chain with loads and stores
> +define void @test1(double* %a, double* %b, double* %c) nounwind uwtable
readonly {
> +entry:
> +  %i0 = load double* %a, align 8
> +  %i1 = load double* %b, align 8
> +  %mul = fmul double %i0, %i1
> +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> +  %i3 = load double* %arrayidx3, align 8
> +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> +  %i4 = load double* %arrayidx4, align 8
> +  %mul5 = fmul double %i3, %i4
> +  store double %mul, double* %c, align 8
> +  %arrayidx5 = getelementptr inbounds double* %c, i64 1
> +  store double %mul5, double* %arrayidx5, align 8
> +  ret void
> +; CHECK: @test1
> +; CHECK:<2 x double>
Especially here the entire vectorization sequence is interesting as I
would love to see the alignment you produce for load/store instructions.
> +; Basic depth-3 chain (last pair permuted)
> +define double @test2(double %A1, double %A2, double %B1, double %B2) {
> +	%X1 = fsub double %A1, %B1
> +	%X2 = fsub double %A2, %B2
> +	%Y1 = fmul double %X1, %A1
> +	%Y2 = fmul double %X2, %A2
> +	%Z1 = fadd double %Y2, %B1
> +	%Z2 = fadd double %Y1, %B2
> +	%R  = fmul double %Z1, %Z2
> +	ret double %R
> +; CHECK: @test2
> +; CHECK:<2 x double>
Again, the entire sequence would show how you express a permutation.

Hal Finkel

2011-Dec-02 17:32 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Fri, 2011-12-02 at 17:07 +0100, Tobias Grosser wrote:> On 11/23/2011 05:52 PM, Hal Finkel wrote:
> > On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
> >> >  On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
> >>> >  >  Tobias,
> >>> >  >
> >>> >  >  I've attached an updated patch. It contains a
few bug fixes and many
> >>> >  >  (refactoring and coding-convention) changes
inspired by your comments.
> >>> >  >
> >>> >  >  I'm currently trying to fix the bug
responsible for causing a compile
> >>> >  >  failure when compiling
> >>> >  > 
test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
> >>> >  >  pass begins to fuse instructions in a basic block
in this file, the
> >>> >  >  aliasing analysis starts producing different (more
pessimistic) query
> >>> >  >  answers. I've never before seen it do this,
and I'm not sure why it is
> >>> >  >  happening. Also odd, at the same time, the numeric
labels that are
> >>> >  >  assigned to otherwise unnamed instructions,
change. I don't think I've
> >>> >  >  seen this happen either (they generally seem to
stay fixed once
> >>> >  >  assigned). I don't know if these two things
are related, but if anyone
> >>> >  >  can provide some insight, I'd appreciate it.
> >> >
> >> >  I think that I see what is happening in this case (please
someone tell
> >> >  me if this does not make sense). In the problematic basic
block, there
> >> >  are some loads and stores that are independent. The default
aliasing
> >> >  analysis can tell that these loads and stores don't
reference the same
> >> >  memory region. Then, some of the inputs to the getelementptr
> >> >  instructions used for these loads and stores are fused by
the
> >> >  vectorization. After this happens, the aliasing analysis
loses its
> >> >  ability to tell that the loads and stores that make use of
those
> >> >  vector-calculated indices are independent.
> > The attached patch works around this issue. Please review.
> 
> Hi Hal,
> 
> thanks for reworking the patch several times. This time the patch looks
> a lot better, with nicely structured helper functions and a lot more
> consistent style. It is a lot better to read and review.
> 
> Unfortunately I had today just two hours to look into this patch. I
> added a lot of remarks, but especially in the tree pruning part I run a
> little out of time such that the remarks are mostly obvious style
> remarks. I hope I have a little bit more time over the weekend to look
> into this again, but I wanted to get out the first junk of comments today.
Thanks! I appreciate you putting in that kind of time. [Also, there is a
more-recent version than the one you're looking at,
llvm_bb_vectorize-20111122.diff, make sure that you look at that one].
> 
> Especially with my style remarks, do not feel forced to follow all of my
> comments. Style stuff is highly personal and often other people may
> suggest other fixes. In case you believe your style is better just check
> with the LLVM coding reference or other parts of the LLVM source code
> and try to match this as good as possible.
I have tried to do that, but other eyes are always useful.
> 
> On the algorithmic point of view I am not yet completely convinced the
> algorithm itself is not worse than N^2. I really want to understand this
> part better and plan to look into problematic behaviour in functions not
> directly visible in the core routines.
Here are my thoughts:

getCandidatePairs - This is O(N^2) in the number of instructions in BB.
[This assumes that all of the external helper functions, especially
queries to alias and scalar-evolution analysis don't make the complexity
worse].

computeConnectedPairs - For each candidate pair, there are O(N^2) in the
worse case, this loops over all pairs of uses of the instructions in the
candidate pair. So if every instruction has U uses, then this has
complexity O(N^2 U^2). In practice, U is small, and so this is just
another O(N^2). Unfortunately, that is only true if determining if a
given use pair is also a candidate pair is sub-linear. It is not in this
case (it is linear because the multimap only indexes the first key, not
both), but that could be changed by using a different (or additional)
data structure to hold the candidate pairs. I should probably add a
candidate-pair DenseSet to clean this up, then it will be O(N^2) [note
that if an instruction has many uses, then it cannot be in many
candidate pairs].

buildDepMap - This records all uses of each pairable instruction; as
implemented, this is also O(N^2).

choosePairs - As you point out, this is the most complicated to figure
out. The reason is that it deals with connected pairs and that, as pairs
are selected, other pairs are dropped from contention. Fundamentally, it
is considering pair-to-pair interactions, and so that seems like O(N^4).
However, it only considers only the first possible tree that can be
built from connected pairs for each remaining candidate pair. This is
important for the following reason: If a block of N instructions had the
maximum number of candidate pairs: N*(N-1)/2, then it must be the case
that all instrucitons are independent, and, thus, none of them are
connected. Thus, in this regime, the algorithm is O(N^2). On the other
extreme, if the block has N candidate pairs, then they could all be
connected, but then there are only N of them. In this extreme, the
algorithm is also O(N^2). Because both extremes are O(N^2), I've
generally figured that this is also O(N^2), but I may want to do more
math for the paper I'm writing.

fuseChosenPairs - This is linear in the number of selected pairs, and
that number scales with the number of instructions in the BB. In the
worst case, after each fusion, a number of instructions need to be moved
(up to almost the total number in the block). Thus, this is also O(N^2).
> 
> Also, you choose a very high bound (4000) to limit the quadratic
> behaviour, because you stated otherwise optimizations will be missed.
> Can you send an example where a value smaller than 4000 would miss
> optimizations?
I may be able to find an example, but I don't have one immediately
available. I had run the test suite with various settings, and chose the
default value (4000) based on the approximate point where the average
performance stopped increasing. Thus, I'm surmising that choosing a
smaller value causes useful optimization opportunities to be missed.

I go through the syntax comments and provide a revised patch.

Thanks again,
Hal
> 
> Cheers
> Tobi
> 
> > diff --git a/lib/Transforms/Makefile b/lib/Transforms/Makefile
> > index e527be2..8b1df92 100644
> > --- a/lib/Transforms/Makefile
> > +++ b/lib/Transforms/Makefile
> > @@ -8,7 +8,7 @@
> >  
##===----------------------------------------------------------------------===##
> >
> >   LEVEL = ../..
> > -PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO Hello
> > +PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO
Vectorize Hello
> >
> >   include $(LEVEL)/Makefile.config
> >
> > diff --git a/lib/Transforms/Vectorize/BBVectorize.cpp
b/lib/Transforms/Vectorize/BBVectorize.cpp
> > new file mode 100644
> > index 0000000..5505502
> > --- /dev/null
> > +++ b/lib/Transforms/Vectorize/BBVectorize.cpp
> > @@ -0,0 +1,1644 @@
> > +//===- BBVectorize.cpp - A Basic-Block Vectorizer
-------------------------===//
> > +//
> > +//                     The LLVM Compiler Infrastructure
> > +//
> > +// This file is distributed under the University of Illinois Open
Source
> > +// License. See LICENSE.TXT for details.
> > +//
> >
+//===----------------------------------------------------------------------===//
> > +//
> > +// This file implements a basic-block vectorization pass. The
algorithm was
> > +// inspired by that used by the Vienna MAP Vectorizor by Franchetti
and Kral,
> > +// et al. It works by looking for chains of pairable operations and
then
> > +// pairing them.
> > +//
> >
+//===----------------------------------------------------------------------===//
> > +
> > +#define BBV_NAME "bb-vectorize"
> > +#define DEBUG_TYPE BBV_NAME
> > +#include "llvm/Constants.h"
> > +#include "llvm/DerivedTypes.h"
> > +#include "llvm/Function.h"
> > +#include "llvm/Instructions.h"
> > +#include "llvm/IntrinsicInst.h"
> > +#include "llvm/Intrinsics.h"
> > +#include "llvm/LLVMContext.h"
> > +#include "llvm/Pass.h"
> > +#include "llvm/Type.h"
> > +#include "llvm/ADT/DenseMap.h"
> > +#include "llvm/ADT/DenseSet.h"
> > +#include "llvm/ADT/SmallVector.h"
> > +#include "llvm/ADT/Statistic.h"
> > +#include "llvm/ADT/StringExtras.h"
> > +#include "llvm/Analysis/AliasAnalysis.h"
> > +#include "llvm/Analysis/AliasSetTracker.h"
> > +#include "llvm/Analysis/ValueTracking.h"
> This comes to lines later, if you want to sort it alphabetically.
What do you mean? I had run the include statements through sort at some
point, so I think they're all sorted.
> 
> > +namespace {
> > +  struct BBVectorize : public BasicBlockPass {
> > +    static char ID; // Pass identification, replacement for typeid
> > +    BBVectorize() : BasicBlockPass(ID) {
> > +      initializeBBVectorizePass(*PassRegistry::getPassRegistry());
> > +    }
> > +
> > +    typedef std::pair<Value *, Value *>  ValuePair;
> > +    typedef std::pair<ValuePair, size_t>  ValuePairWithDepth;
> > +    typedef std::pair<ValuePair, ValuePair>  VPPair; // A
ValuePair pair
> > +    typedef std::pair<std::multimap<Value *, Value
*>::iterator,
> > +              std::multimap<Value *, Value *>::iterator> 
VPIteratorPair;
> > +    typedef std::pair<std::multimap<ValuePair,
ValuePair>::iterator,
> > +              std::multimap<ValuePair, ValuePair>::iterator>
> > +                VPPIteratorPair;
> > +
> > +    // FIXME: const correct?
> > +
> > +    bool vectorizePairs(BasicBlock&BB);
> > +
> > +    void getCandidatePairs(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                       std::vector<Value *> 
&PairableInsts);
> > +
> > +    vojd computeConnectedPairs(std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                       std::vector<Value *> 
&PairableInsts,
> > +                       std::multimap<ValuePair, ValuePair> 
&ConnectedPairs);
> > +
> > +    void buildDepMap(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                       std::vector<Value *> 
&PairableInsts,
> > +                       DenseSet<ValuePair> 
&PairableInstUsers);
> > +
> > +    void choosePairs(std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                        std::vector<Value *> 
&PairableInsts,
> > +                        std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                        DenseSet<ValuePair> 
&PairableInstUsers,
> > +                        DenseMap<Value *, Value *>& 
ChosenPairs);
> > +
> > +    void fuseChosenPairs(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *>& 
ChosenPairs);
> > +
> > +    bool isInstVectorizable(Instruction *I,
bool&IsSimpleLoadStore);
> > +
> > +    bool areInstsCompatible(Instruction *I, Instruction *J,
> > +                       bool IsSimpleLoadStore);
> > +
> > +    void trackUsesOfI(DenseSet<Value *>  &Users,
> > +                      AliasSetTracker&WriteSet, Instruction *I,
> > +                      Instruction *J, bool&UsesI, bool
UpdateUsers = true,
> > +                      std::multimap<Value *, Value *> 
*LoadMoveSet = 0);
> > +
> > +    void computePairsConnectedTo(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      ValuePair P);
> > +
> > +    bool pairsConflict(ValuePair P, ValuePair Q,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers);
> > +
> > +    void pruneTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
> > +                      DenseSet<ValuePair>  &PrunedTree,
ValuePair J);
> > +
> > +    void buildInitialTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
ValuePair J);
> > +
> > +    void findBestTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseSet<ValuePair>  &BestTree,
size_t&BestMaxDepth,
> > +                      size_t&BestEffSize, VPIteratorPair
ChoiceRange);
> > +
> > +    Value *getReplacementPointerInput(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, unsigned o,
bool&FlipMemInputs);
> > +
> > +    Value *getReplacementShuffleMask(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, unsigned o);
> > +
> > +    Value *getReplacementInput(LLVMContext&  Context, Instruction
*I,
> > +                     Instruction *J, unsigned o, bool FlipMemInputs);
> > +
> > +    void getReplacementInputsForPair(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, SmallVector<Value *, 3> 
&ReplacedOperands,
> > +                     bool&FlipMemInputs);
> > +
> > +    void replaceOutputsOfPair(LLVMContext&  Context, Instruction
*I,
> > +                     Instruction *J, Instruction *K,
> > +                     Instruction *&InsertionPt, Instruction
*&K1,
> > +                     Instruction *&K2, bool&FlipMemInputs);
> > +
> > +    void collectPairLoadMoveSet(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> > +                     Instruction *I, Instruction *J);
> > +
> > +    void collectLoadMoveSet(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet);
> > +
> > +    void moveUsesOfIAfterJ(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *>& 
ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *I, Instruction *J);
> 
> Here I am somehow unsatisfied with the indentation, but can currently
> not suggest anything better. The single operands are just too long. ;-)
> 
> > +
> > +    virtual bool runOnBasicBlock(BasicBlock&BB) {
> > +      bool changed = false;
> > +      // Iterate a sufficient number of times to merge types of
> > +      // size 1, then 2, then 4, etc. up to half of the
> > +      // target vector width of the target vector register.
> Use all 80 columns.
> 
> Also. What is a type of size 1? Do you mean 1 bit, one byte or what?
> 
> > +      for (unsigned v = 2, n = 1; v<= VectorBits&&
> > +             (!MaxIter || n<= MaxIter); v *= 2, ++n) {
>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
>                  This fits in the previous row.
> Also, having the whole condition in one row seems easier to read.
> 
> > +        DEBUG(dbgs()<<  "BBV: fusing loop #"<< 
n<<
> > +              " for "<<  BB.getName()<<  "
in "<<
> > +              BB.getParent()->getName()<< 
"...\n");
> > +        if (vectorizePairs(BB)) {
> > +          changed = true;
> > +        } else {
> > +          break;
> > +        }
> Braces not needed here.
> 
> > +      }
> > +
> > +      DEBUG(dbgs()<<  "BBV: done!\n");
> > +      return changed;
> > +    }
> > +
> > +    virtual void getAnalysisUsage(AnalysisUsage&AU) const {
> > +      BasicBlockPass::getAnalysisUsage(AU);
> > +      AU.addRequired<AliasAnalysis>();
> > +      AU.addRequired<ScalarEvolution>();
> > +      AU.addPreserved<AliasAnalysis>();
> > +      AU.addPreserved<ScalarEvolution>();
> > +    }
> > +
> > +    // This returns the vector type that holds a pair of the provided
type.
> > +    // If the provided type is already a vector, then its length is
doubled.
> > +    static inline VectorType *getVecType(Type *VElemType) {
> 
> What about naming this getVecTypeForPair()?
> 
> I think ElemType is sufficient. No need for the 'V'.
> 
> > +      if (VElemType->isVectorTy()) {
> > +        unsigned numElem =
cast<VectorType>(VElemType)->getNumElements();
> > +        return VectorType::get(VElemType->getScalarType(),
numElem*2);
> > +      } else {
> > +        return VectorType::get(VElemType, 2);
> > +      }
> > +    }
> 
> This can be:
> 
> if (VectorType *VectorTy = dyn_cast<VectorType>(ElemType)) {
>    unsigned numElem = VectorTy->getNumElements();
>    return VectorType::get(VElemType->getScalarType(), numElem*2);
> } else {
>    return VectorType::get(VElemType, 2);
> }
> 
> > +    // Returns the weight associated with the provided value. A chain
of
> > +    // candidate pairs has a length given by the sum of the weights
of its
> > +    // members (one weight per pair; the weight of each member of the
pair
> > +    // is assumed to be the same). This length is then compared to
the
> > +    // chain-length threshold to determine if a given chain is
significant
> > +    // enough to be vectorized. The length is also used in comparing
> > +    // candidate chains where longer chains are considered to be
better.
> > +    // Note: when this function returns 0, the resulting instructions
are
> > +    // not actually fused.
> > +    static inline size_t getDepthFactor(Value *V) {
> > +      if (isa<InsertElementInst>(V) ||
isa<ExtractElementInst>(V)) {
> 
> Why have InsertElementInst, ExtractElementInst instructions a weight of
> zero?
Two reasons: First, they cannot be usefully fused. Second, because the
pass generates a lot of these, they can confuse the simple metric used
to compare the trees in the next iteration. Thus, giving them a weight
of zero allows the pass to essentially ignore them in subsequent
iterations when looking for vectorization opportunities while still
tracking dependency chains that flow through those instructions.
> 
> > +        return 0;
> > +      }
> No need for braces.
> 
> > +
> > +      return 1;
> > +    }
> > +
> > +    // This determines the relative offset of two loads or stores,
returning
> > +    // true if the offset could be determined to be some constant
value.
> > +    // For example, if OffsetInElmts == 1, then J accesses the memory
directly
> > +    // after I; if OffsetInElmts == -1 then I accesses the memory
> > +    // directly after J. This function assumes that both instructions
> > +    // have the same type.
> You may want to mention more explicit that this function returns parts
> of the result in OffsetInElmts.
> 
> I think 'Offset' is enough, no need for 'OffsetInElmts'.
> 
> > +    bool getPairPtrInfo(Instruction *I, Instruction *J, const
TargetData&TD,
> > +        Value *&IPtr, Value *&JPtr, unsigned&IAlignment,
unsigned&JAlignment,
> > +        int64_t&OffsetInElmts) {
> > +      OffsetInElmts = 0;
> > +      if (isa<LoadInst>(I)) {
> > +        IPtr = cast<LoadInst>(I)->getPointerOperand();
> > +        JPtr = cast<LoadInst>(J)->getPointerOperand();
> > +        IAlignment = cast<LoadInst>(I)->getAlignment();
> > +        JAlignment = cast<LoadInst>(J)->getAlignment();
> > +      } else {
> > +        IPtr = cast<StoreInst>(I)->getPointerOperand();
> > +        JPtr = cast<StoreInst>(J)->getPointerOperand();
> > +        IAlignment = cast<StoreInst>(I)->getAlignment();
> > +        JAlignment = cast<StoreInst>(J)->getAlignment();
> > +      }
> > +
> > +      ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> > +      const SCEV *IPtrSCEV = SE.getSCEV(IPtr);
> > +      const SCEV *JPtrSCEV = SE.getSCEV(JPtr);
> > +
> > +      // If this is a trivial offset, then we'll get something
like
> > +      // 1*sizeof(type). With target data, which we need anyway, this
will get
> > +      // constant folded into a number.
> > +      const SCEV *OffsetSCEV = SE.getMinusSCEV(JPtrSCEV, IPtrSCEV);
> > +      if (const SCEVConstant *ConstOffSCEV > > +           
dyn_cast<SCEVConstant>(OffsetSCEV)) {
> > +        ConstantInt *IntOff = ConstOffSCEV->getValue();
> > +        int64_t Offset = IntOff->getSExtValue();
> > +
> > +        Type *VTy =
cast<PointerType>(IPtr->getType())->getElementType();
> > +        int64_t VTyTSS = (int64_t) TD.getTypeStoreSize(VTy);
> > +
> > +        assert(VTy ==
cast<PointerType>(JPtr->getType())->getElementType());
> > +
> > +        OffsetInElmts = Offset/VTyTSS;
> > +        return (abs64(Offset) % VTyTSS) == 0;
> > +      }
> > +
> > +      return false;
> > +    }
> > +
> > +    // Returns true if the provided CallInst represents an intrinsic
that can
> > +    // be vectorized.
> > +    bool isVectorizableIntrinsic(CallInst* I) {
> > +      bool VIntr = false;
> > +      if (Function *F = I->getCalledFunction()) {
> > +        if (unsigned IID = F->getIntrinsicID()) {
> 
> Use early exit.
> 
> Function *F = I->getCalledFunction();
> 
> if (!F)
>    return false;
> 
> unsigned IID = F->getIntrinsicID();
> 
> if (!IID)
>    return false;
> 
> switch(IID) {
> 	case Intrinsic::sqrt:
> 	case Intrinsic::powi:
> 	case Intrinsic::sin:
> 	case Intrinsic::cos:
> 	case Intrinsic::log:
> 	case Intrinsic::log2:
> 	case Intrinsic::log10:
> 	case Intrinsic::exp:
> 	case Intrinsic::exp2:
> 	case Intrinsic::pow:
> 		return !NoMath;
> 	case Intrinsic::fma:
> 		return !NoFMA;
> }
> 
> Is this switch exhaustive or are you missing a default case?
You're not looking at the most-recent version. You had (correctly)
commented on this last time, and I had corrected it.
> 
> > +    // Returns true if J is the second element in some ValuePair
referenced by
> > +    // some multimap ValuePair iterator pair.
> > +    bool isSecondInVPIteratorPair(VPIteratorPair PairRange, Value *J)
{
> > +      for (std::multimap<Value *, Value *>::iterator K =
PairRange.first;
> > +             K != PairRange.second; ++K) {
> > +        if (K->second == J) return true;
> > +      }
> No need for braces.
> 
> > +
> > +      return false;
> > +    }
> > +  };
> > +
> > +  // This function implements one vectorization iteration on the
provided
> > +  // basic block. It returns true if the block is changed.
> > +  bool BBVectorize::vectorizePairs(BasicBlock&BB) {
> > +    std::vector<Value *>  PairableInsts;
> > +    std::multimap<Value *, Value *>  CandidatePairs;
> > +    getCandidatePairs(BB, CandidatePairs, PairableInsts);
> > +    if (!PairableInsts.size()) return false;
> 
> if (PairableInsts.size() == 0) ...
> 
> I think it is more straightforward to compare a size to 0 than to force
> the brain to implicitly convert it to bool.
> 
> > +
> > +    // Now we have a map of all of the pairable instructions and we
need to
>                we have a map of all pairable instructions
> 
> > +    // select the best possible pairing. A good pairing is one such
that the
> > +    // Users of the pair are also paired. This defines a (directed)
forest
> 	  users
>            ^
> > +    // over the pairs such that two pairs are connected iff the
second pair
> > +    // uses the first.
> > +
> > +    // Note that it only matters that both members of the second pair
use some
> > +    // element of the first pair (to allow for splatting).
> > +
> > +    std::multimap<ValuePair, ValuePair>  ConnectedPairs;
> > +    computeConnectedPairs(CandidatePairs, PairableInsts,
ConnectedPairs);
> > +    if (!ConnectedPairs.size()) return false;
> 
> if (ConnectedPairs.size() == 0) ...
> 
> > +    // Build the pairable-instruction dependency map
> > +    DenseSet<ValuePair>  PairableInstUsers;
> > +    buildDepMap(BB, CandidatePairs, PairableInsts,
PairableInstUsers);
> > +
> > +    // There is now a graph of the connected pairs. For each
variable, pick the
> > +    // pairing with the largest tree meeting the depth requirement on
at least
> > +    // one branch. Then select all pairings that are part of that
tree and
> > +    // remove them from the list of available parings and pairable
variables.
> 						 pairings
> 
> > +
> > +    DenseMap<Value *, Value *>  ChosenPairs;
> > +    choosePairs(CandidatePairs, PairableInsts, ConnectedPairs,
> > +      PairableInstUsers, ChosenPairs);
> > +
> > +    if (!ChosenPairs.size()) return false;
> > +    NumFusedOps += ChosenPairs.size();
> > +
> > +    // A set of chosen pairs has now been selected. It is now
necessary to
>         // A set of pairs has been chosen.
> > +    // replace the paired functions with vector functions. For this
procedure
> 
> You are only talking about functions. I thought the main objective is to
> pair LLVM-IR _instructions_.
Indeed, bad wording. I do mean instructions.
> 
> > +    // each argument much be replaced with a vector argument. This
vector
>                          must
> 
> At least for instructions, I would use the term 'operand' instead
of
> 'argument'.
> 
> > +    // is formed by using build_vector on the old arguments. The
replaced
> > +    // values are then replaced with a vector_extract on the result.
> > +    // Subsequent optimization passes should coalesce the
build/extract
> > +    // combinations.
> > +
> > +    fuseChosenPairs(BB, PairableInsts, ChosenPairs);
> > +
> > +    return true;
> > +  }
> > +
> > +  // This function returns true if the provided instruction is
capable of being
> > +  // fused into a vector instruction. This determination is based
only on the
> > +  // type and other attributes of the instruction.
> > +  bool BBVectorize::isInstVectorizable(Instruction *I,
> > +                                         bool&IsSimpleLoadStore)
{
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> 
> AA can become a member of the BBVectorize class and it can be
> initialized once per basic block.
> 
> > +    bool VIntr = false;
> > +    if (isa<CallInst>(I)) {
> > +      VIntr = isVectorizableIntrinsic(cast<CallInst>(I));
> > +    }
> Remove VIntr and use early exit.
> 
> if (CallInst *C = dyn_cast<CallInst>(I)) {
>    if(!isVectorizableIntrinsic(cast<CallInst>(I))
>      return false;
> }
> 
> Also, no braces if not necessary in at least one branch.
> 
> > +    // Vectorize simple loads and stores if possbile:
> > +    IsSimpleLoadStore = false;
> > +    if (isa<LoadInst>(I)) {
> > +      IsSimpleLoadStore = cast<LoadInst>(I)->isSimple();
> > +    } else if (isa<StoreInst>(I)) {
> > +      IsSimpleLoadStore = cast<StoreInst>(I)->isSimple();
> > +    }
> Again, no braces needed and use early exit.
> 
> } else if (LoadInst *L = dyn_cast<LoadInst>(I)) {
>    if (!L->isSimple() || NoMemOps) {
>      IsSimpleLoadStore = false;
>      return false;
>    }
> } else if (StorInst *S = dyn_cast<StoreInst>(S)) {
>    if (!S->isSimple() || NoMemOps) {
>      IsSimpleLoadStore = false;
>      return false;
>    }
> }
> 
> > +    // We can vectorize casts, but not casts of pointer types, etc.
> > +    bool IsCast = false;
> > +    if (I->isCast()) {
> > +      IsCast = true;
> > +      if
(!cast<CastInst>(I)->getSrcTy()->isSingleValueType()) {
> > +        IsCast = false;
> > +      } else if
(!cast<CastInst>(I)->getDestTy()->isSingleValueType()) {
> > +        IsCast = false;
> > +      } else if
(cast<CastInst>(I)->getSrcTy()->isPointerTy()) {
> > +        IsCast = false;
> > +      } else if
(cast<CastInst>(I)->getDestTy()->isPointerTy()) {
> > +        IsCast = false;
> > +      }
> > +    }
> 
> 
> } else if (CastInst *C = dyn_cast<CastInst>(I)) {
>    if (NoCasts)
>      return false;
> 
>    Type *SrcTy = C->getSrcTy()
> 
>    if (!SrcTy->isSingleValueType() || SrcTy->isPointerTy)
>      return false;
> 
>    Type *DestTy = C->getDestTy()
> 
>    if (!DestTy->isSingleValueType() || DestTy->isPointerTy)
>      return false;
> }
> 
> > +
> > +    if (!(I->isBinaryOp() || isa<ShuffleVectorInst>(I) ||
> > +            isa<ExtractElementInst>(I) ||
isa<InsertElementInst>(I) ||
> > +            (!NoCasts&&  IsCast) || VIntr ||
> > +            (!NoMemOps&&  IsSimpleLoadStore))) {
> > +      return false;
> > +    }
> 
> else if (I->isBinaryOp
> 	 || isa<ShuffleVectorInst>(I)
>           || isa<ExtractElementInst>(I)
>           || isa<InsertElementInst>(I)) {
>    // Those instructions are always OK.
> } else {
>    // Everything else can not be handled.
>    return false;
> }
> 
> 
> > +
> > +    // We can't vectorize memory operations without target data
> > +    if (AA.getTargetData() == 0&&  IsSimpleLoadStore) {
> > +      return false;
> 
> Is this check needed at this place? I had the feeling the whole pass
> does not work without target data. At least the offset calculation seems
> to require it.
Yes, it is needed. Without target data, no loads or stores are selected
for vectorization, and so no offsets are calculated. The pass should
otherwise work, however, without target data. I tested this at some
point, if it does not degrade gracefully, then I consider that to be a
bug.
> 
> > +    }
> > +
> > +    Type *T1, *T2;
> > +    if (isa<StoreInst>(I)) {
> > +      // For stores, it is the value type, not the pointer type that
matters
> > +      // because the value is what will come from a vector register.
> > +
> > +      Value *IVal = cast<StoreInst>(I)->getValueOperand();
> > +      T1 = IVal->getType();
> > +    } else {
> > +      T1 = I->getType();
> > +    }
> 
> > +
> > +    if (I->isCast()) {
> > +      T2 = cast<CastInst>(I)->getSrcTy();
> > +    } else {
> > +      T2 = T1;
> > +    }
> No braces needed.
> 
> > +
> > +    // Not every type can be vectorized...
> > +    if (!(VectorType::isValidElementType(T1) || T1->isVectorTy())
||
> > +        !(VectorType::isValidElementType(T2) || T2->isVectorTy()))
> > +      return false;
> > +
> > +    if (NoInts&&  (T1->isIntOrIntVectorTy() ||
T2->isIntOrIntVectorTy()))
> > +      return false;
> > +
> > +    if (NoFloats&&  (T1->isFPOrFPVectorTy() ||
T2->isFPOrFPVectorTy()))
> > +      return false;
> > +
> > +    if (T1->getPrimitiveSizeInBits()>  VectorBits/2 ||
> > +        T2->getPrimitiveSizeInBits()>  VectorBits/2)
> > +      return false;
> > +
> > +    return true;
> > +  }
> > +
> > +  // This function returns true if the two provided instructions are
compatible
> > +  // (meaning that they can be fused into a vector instruction). This
assumes
> > +  // that I has already been determined to be vectorizable and that J
is not
> > +  // in the use tree of I.
> > +  bool BBVectorize::areInstsCompatible(Instruction *I, Instruction
*J,
> > +                       bool IsSimpleLoadStore) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> AA can become a member of the BBVectorize class and it can be
> initialized once per basic block.
> 
> > +    bool IsCompat = J->isSameOperationAs(I);
> > +    // FIXME: handle addsub-type operations!
> > +
> > +    // Loads and stores can be merged if they have different
alignments,
> > +    // but are otherwise the same.
> > +    if (!IsCompat&&  isa<LoadInst>(I)&& 
isa<LoadInst>(J)) {
> > +      if (I->getType() == J->getType()) {
> > +        if
(cast<LoadInst>(I)->getPointerOperand()->getType() => > +     
cast<LoadInst>(J)->getPointerOperand()->getType()&&
> > +            cast<LoadInst>(I)->isVolatile() => > +    
cast<LoadInst>(J)->isVolatile()&&
> > +            cast<LoadInst>(I)->getOrdering() => > +   
cast<LoadInst>(J)->getOrdering()&&
> > +            cast<LoadInst>(I)->getSynchScope() => > + 
cast<LoadInst>(J)->getSynchScope()
> > +           ) {
> > +             IsCompat = true;
> > +         }
> > +      }
> > +    } else if (!IsCompat&&  isa<StoreInst>(I)&&
isa<StoreInst>(J)) {
> > +      if
(cast<StoreInst>(I)->getValueOperand()->getType() => > +      
cast<StoreInst>(J)->getValueOperand()->getType()&&
> > +         
cast<StoreInst>(I)->getPointerOperand()->getType() => > +     
cast<StoreInst>(J)->getPointerOperand()->getType()&&
> > +          cast<StoreInst>(I)->isVolatile() => > +     
cast<StoreInst>(J)->isVolatile()&&
> > +          cast<StoreInst>(I)->getOrdering() => > +    
cast<StoreInst>(J)->getOrdering()&&
> > +          cast<StoreInst>(I)->getSynchScope() => > +  
cast<StoreInst>(J)->getSynchScope()
> > +         ) {
> > +           IsCompat = true;
> > +       }
> > +    }
> > +
> > +    if (!IsCompat) return false;
> > +
> > +    if (IsSimpleLoadStore) {
> > +      const TargetData&TD = *AA.getTargetData();
> > +
> > +      Value *IPtr, *JPtr;
> > +      unsigned IAlignment, JAlignment;
> > +      int64_t OffsetInElmts = 0;
> > +      if (getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment,
JAlignment,
> > +            OffsetInElmts)&&  abs64(OffsetInElmts) == 1) {
> > +        if (AlignedOnly) {
> > +          Type *aType = isa<StoreInst>(I) ?
> > +           
cast<StoreInst>(I)->getValueOperand()->getType() : I->getType();
> > +          // An aligned load or store is possible only if the
instruction
> > +          // with the lower offset has an alignment suitable for the
> > +          // vector type.
> > +
> > +          unsigned BottomAlignment = IAlignment;
> > +          if (OffsetInElmts<  0) BottomAlignment = JAlignment;
> > +
> > +          Type *VType = getVecType(aType);
> > +          unsigned VecAlignment = TD.getPrefTypeAlignment(VType);
> > +          if (BottomAlignment<  VecAlignment) {
> > +            IsCompat = false;
> > +          }
> > +        }
> > +      } else {
> > +        IsCompat = false;
> > +      }
> > +    } else if (isa<ShuffleVectorInst>(I)) {
> > +      // Only merge two shuffles if they're both constant
> > +      // or both not constant.
> > +      IsCompat = isa<Constant>(I->getOperand(2))&&
> > +                 isa<Constant>(J->getOperand(2));
> > +      // FIXME: We may want to vectorize non-constant shuffles also.
> > +    }
> > +
> > +    return IsCompat;
> 
> Here again for the whole function. Can you try to use early exit to
> simplify the code http://llvm.org/docs/CodingStandards.html#hl_earlyexit
> You might be able to remove the use of IsCompat completely.
> 
> > +  }
> > +
> > +  // Figure out whether or not J uses I and update the users and
write-set
> > +  // structures associated with I.
> 
> You may want to give I and J more descriptive names. I remember I asked
> you the last time to rename One and Two to I and J, but in this case it
> seems both have very specific roles which could be used to describe
> them. What about 'Inst' and 'User'?
> 
> Also what are the arguments of this function? Can you explain them a
> little?
> 
> > +  void BBVectorize::trackUsesOfI(DenseSet<Value *>  &Users,
> > +                       AliasSetTracker&WriteSet, Instruction *I,
> > +                       Instruction *J, bool&UsesI, bool
UpdateUsers,
> 
> You may want to return UsesI as a function return value not as a
> reference parameter, though you may need to updated the function name
> accordingly.
> 
> > +                       std::multimap<Value *, Value *> 
*LoadMoveSet) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    UsesI = false;
> > +
> > +    // This instruction may already be marked as a user due, for
example, to
> > +    // being a member of a selected pair.
> > +    if (Users.count(J))
> > +      UsesI = true;
> > +
> > +    if (!UsesI) for (User::op_iterator JU = J->op_begin(), e =
J->op_end();
> Use a new line for this 'for'.
> 
> > +           JU != e; ++JU) {
> > +      Value *V = *JU;
> > +      if (I == V || Users.count(V)) {
> > +        UsesI = true;
> > +        break;
> > +      }
> > +    }
> > +    if (!UsesI&&  J->mayReadFromMemory()) {
> > +      if (LoadMoveSet) {
> > +        VPIteratorPair JPairRange = LoadMoveSet->equal_range(J);
> > +        UsesI = isSecondInVPIteratorPair(JPairRange, I);
> > +      }
> > +      else for (AliasSetTracker::iterator W = WriteSet.begin(),
> Use a new line for this 'for'.
> 
> > +            WE = WriteSet.end(); W != WE; ++W) {
> > +        for (AliasSet::iterator A = W->begin(), AE = W->end();
> > +               A != AE; ++A) {
> > +          AliasAnalysis::Location ptrLoc(A->getValue(),
A->getSize(),
> > +            A->getTBAAInfo());
> Align this argument with the first argument.
> 
>            AliasAnalysis::Location ptrLoc(A->getValue(),
A->getSize(),
>            				 A->getTBAAInfo());
> 
> > +          if (AA.getModRefInfo(J, ptrLoc) != AliasAnalysis::NoModRef)
{
> > +            UsesI = true;
> > +            break;
> > +          }
> > +        }
> > +        if (UsesI) break;
> > +      }
> > +    }
> > +    if (UsesI&&  UpdateUsers) {
> > +      if (J->mayWriteToMemory()) WriteSet.add(J);
> > +      Users.insert(J);
> > +    }
> > +  }
> > +
> > +  // This function iterates over all instruction pairs in the
provided
> > +  // basic block and collects all candidate pairs for vectorization.
> > +  void BBVectorize::getCandidatePairs(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                       std::vector<Value *> 
&PairableInsts) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    BasicBlock::iterator E = BB.end();
> > +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E;
++I) {
> > +      bool IsSimpleLoadStore;
> > +      if (!isInstVectorizable(I, IsSimpleLoadStore)) continue;
> > +
> > +      // Look for an instruction with which to pair instruction *I...
> > +      DenseSet<Value *>  Users;
> > +      AliasSetTracker WriteSet(AA);
> > +      BasicBlock::iterator J = I; ++J;
> > +      for (unsigned ss = 0; J != E&&  ss<= SearchLimit;
++J, ++ss) {
> > +        // Determine if J uses I, if so, exit the loop.
> > +        bool UsesI;
> > +        trackUsesOfI(Users, WriteSet, I, J, UsesI, !FastDep);
> 
> 'UsesI' can be eliminated if the corresponding value is passed as
> function return value.
> 
> > +        if (FastDep) {
> > +          // Note: For this heuristic to be effective, independent
operations
> > +          // must tend to be intermixed. This is likely to be true
from some
> > +          // kinds of grouped loop unrolling (but not the generic
LLVM pass),
> > +          // but otherwise may require some kind of reordering pass.
> > +
> > +          // When using fast dependency analysis,
> > +          // stop searching after first use:
> > +          if (UsesI) break;
> > +        } else {
> > +          if (UsesI) continue;
> > +        }
> > +
> > +        // J does not use I, and comes before the first use of I, so
it can be
> > +        // merged with I if the instructions are compatible.
> > +        if (!areInstsCompatible(I, J, IsSimpleLoadStore)) continue;
> > +
> > +        // J is a candidate for merging with I.
> > +        if (!PairableInsts.size() ||
> > +             PairableInsts[PairableInsts.size()-1] != I) {
> > +          PairableInsts.push_back(I);
> > +        }
> > +        CandidatePairs.insert(ValuePair(I, J));
> > +        DEBUG(dbgs()<<  "BBV: candidate pair
"<<  *I<<
> > +                     "<->  "<<  *J<< 
"\n");
> > +      }
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: found "<< 
PairableInsts.size()
> > +<<  " instructions with candidate pairs\n");
> > +  }
> > +
> > +  // Finds candidate pairs connected to the pair P =<PI, PJ>.
This means that
> > +  // it looks for pairs such that both members have an input which is
an
> > +  // output of PI or PJ.
> > +  void BBVectorize::computePairsConnectedTo(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      ValuePair P) {
> > +    // For each possible pairing for this variable, look at the uses
of
> > +    // the first value...
> > +    for (Value::use_iterator I = P.first->use_begin(),
> > +         E = P.first->use_end(); I != E; ++I) {
> > +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> > +
> > +      // For each use of the first variable, look for uses of the
second
> > +      // variable...
> > +      for (Value::use_iterator J = P.second->use_begin(),
> > +           E2 = P.second->use_end(); J != E2; ++J) {
> > +        VPIteratorPair JPairRange = CandidatePairs.equal_range(*J);
> > +
> > +        // Look for<I, J>:
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +
> > +        // Look for<J, I>:
> > +        if (isSecondInVPIteratorPair(JPairRange, *I))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*J, *I)));
> > +      }
> > +
> > +      // Look for cases where just the first value in the pair is
used by
> > +      // both members of another pair (splatting).
> > +      Value::use_iterator J = P.first->use_begin();
> > +      if (!SplatBreaksChain) for (; J != E; ++J) {
> This can become an early continue. Also please do not put an 'if'
and a
> 'for' into the same line.
> 
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +      }
> > +    }
> > +
> > +    // Look for cases where just the second value in the pair is used
by
> > +    // both members of another pair (splatting).
> > +    if (!SplatBreaksChain)
> This can become an early return.
> 
> > +      for (Value::use_iterator I = P.second->use_begin(),
> > +         E = P.second->use_end(); I != E; ++I) {
> > +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> > +
> > +      Value::use_iterator J = P.second->use_begin();
> > +      for (; J != E; ++J) {
> Why not move initialization into 'for'?
> 
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +      }
> > +    }
> > +  }
> > +
> > +  // This function figures out which pairs are connected.  Two pairs
are
> > +  // connected if some output of the first pair forms an input to
both members
> > +  // of the second pair.
> > +  void BBVectorize::computeConnectedPairs(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs) {
> > +
> > +    for (std::vector<Value *>::iterator PI =
PairableInsts.begin(),
> > +           PE = PairableInsts.end(); PI != PE; ++PI) {
> > +      VPIteratorPair choiceRange = CandidatePairs.equal_range(*PI);
> > +
> > +      for (std::multimap<Value *, Value *>::iterator P =
choiceRange.first;
> > +           P != choiceRange.second; ++P) {
> > +        computePairsConnectedTo(CandidatePairs, PairableInsts,
> > +          ConnectedPairs, *P);
> Align CandidatePairs and ConnectedPairs.
> 
> > +      }
> No braces needed here.
> 
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: found "<< 
ConnectedPairs.size()
> > +<<  " pair connections.\n");
> > +  }
> > +
> > +  // This function builds a set of use tuples such that<A, B> 
is in the set
> > +  // if B is in the use tree of A. If B is in the use tree of A, then
B
> > +  // depends on the output of A.
> > +  void BBVectorize::buildDepMap(
> > +                      BasicBlock&BB,
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers) {
> > +    DenseSet<Value *>  IsInPair;
> > +    for (std::multimap<Value *, Value *>::iterator C =
CandidatePairs.begin(),
> > +           E = CandidatePairs.end(); C != E; ++C) {
> > +      IsInPair.insert(C->first);
> > +      IsInPair.insert(C->second);
> > +    }
> > +
> > +    // Iterate through the basic block, recording all Users of each
> > +    // pairable instruction.
> > +
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    BasicBlock::iterator E = BB.end();
> > +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E;
++I) {
> > +      if (IsInPair.find(I) == IsInPair.end()) continue;
> > +
> > +      DenseSet<Value *>  Users;
> > +      AliasSetTracker WriteSet(AA);
> > +      BasicBlock::iterator J = I; ++J;
> > +      for (; J != E; ++J) {
> Move the initialization into the for loop.
> 
> 	for (BasicBlock::iterator J = I + 1; J != E; ++J)
> 
> > +        bool UsesI;
> > +        trackUsesOfI(Users, WriteSet, I, J, UsesI);
> > +      }
> > +
> > +      for (DenseSet<Value *>::iterator U = Users.begin(), E =
Users.end();
> > +             U != E; ++U) {
> 	      ^^ Check indentation here.
> 
> > +        PairableInstUsers.insert(ValuePair(I, *U));
> > +      }
> No braces needed.
> 
> > +    }
> > +  }
> > +
> > +  // Returns true if an input to pair P is an output of pair Q and
also an
> > +  // input of pair Q is an output of pair P. If this is the case,
then these
> > +  // two pairs cannot be simultaneously fused.
> > +  bool BBVectorize::pairsConflict(ValuePair P, ValuePair Q,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers) {
> > +    // Two pairs are in conflict if they are mutual Users of
eachother.
> > +    bool QUsesP = PairableInstUsers.count(ValuePair(P.first, 
Q.first))  ||
> > +                  PairableInstUsers.count(ValuePair(P.first, 
Q.second)) ||
> > +                  PairableInstUsers.count(ValuePair(P.second,
Q.first))  ||
> > +                  PairableInstUsers.count(ValuePair(P.second,
Q.second));
> > +    bool PUsesQ = PairableInstUsers.count(ValuePair(Q.first, 
P.first))  ||
> > +                  PairableInstUsers.count(ValuePair(Q.first, 
P.second)) ||
> > +                  PairableInstUsers.count(ValuePair(Q.second,
P.first))  ||
> > +                  PairableInstUsers.count(ValuePair(Q.second,
P.second));
> > +    return (QUsesP&&  PUsesQ);
> > +  }
> > +
> > +  // This function builds the initial tree of connected pairs with
the
> > +  // pair J at the root.
> > +  void BBVectorize::buildInitialTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
ValuePair J) {
> > +    // Each of these pairs is viewed as the root node of a Tree. The
Tree
> > +    // is then walked (depth-first). As this happens, we keep track
of
> > +    // the pairs that compose the Tree and the maximum depth of the
Tree.
> > +    std::vector<ValuePairWithDepth>  Q;
> > +    // General depth-first post-order traversal:
> > +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> > +    while (!Q.empty()) {
> > +      ValuePairWithDepth QTop = Q.back();
> > +
> > +      // Push each child onto the queue:
> > +      bool MoreChildren = false;
> > +      size_t MaxChildDepth = QTop.second;
> > +      VPPIteratorPair qtRange =
ConnectedPairs.equal_range(QTop.first);
> > +      for (std::map<ValuePair, ValuePair>::iterator k =
qtRange.first;
> > +           k != qtRange.second; ++k) {
> > +        // Make sure that this child pair is still a candidate:
> > +        bool IsStillCand = false;
> > +        VPIteratorPair checkRange > > +         
CandidatePairs.equal_range(k->second.first);
> > +        for (std::multimap<Value *, Value *>::iterator m =
checkRange.first;
> > +             m != checkRange.second; ++m) {
> > +          if (m->second == k->second.second) {
> > +            IsStillCand = true;
> > +            break;
> > +          }
> > +        }
> > +
> > +        if (IsStillCand) {
> > +          DenseMap<ValuePair, size_t>::iterator C =
Tree.find(k->second);
> > +          if (C == Tree.end()) {
> > +            size_t d = getDepthFactor(k->second.first);
> > +            Q.push_back(ValuePairWithDepth(k->second,
QTop.second+d));
> > +            MoreChildren = true;
> > +          } else {
> > +            MaxChildDepth = std::max(MaxChildDepth, C->second);
> > +          }
> > +        }
> > +      }
> > +
> > +      if (!MoreChildren) {
> > +        // Record the current pair as part of the Tree:
> > +        Tree.insert(ValuePairWithDepth(QTop.first, MaxChildDepth));
> > +        Q.pop_back();
> > +      }
> > +    }
> > +  }
> > +
> > +  // Given some initial tree, prune it by removing conflicting pairs
(pairs
> > +  // that cannot be simultaneously chosen for vectorization).
> > +  void BBVectorize::pruneTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
> > +                      DenseSet<ValuePair>  &PrunedTree,
ValuePair J) {
> > +    std::vector<ValuePairWithDepth>  Q;
> > +    // General depth-first post-order traversal:
> > +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> > +    while (!Q.empty()) {
> > +      ValuePairWithDepth QTop = Q.back();
> > +      PrunedTree.insert(QTop.first);
> > +      Q.pop_back();
> > +
> > +      // Visit each child, pruning as necessary...
> > +      DenseMap<ValuePair, size_t>  BestChilden;
> > +      VPPIteratorPair QTopRange =
ConnectedPairs.equal_range(QTop.first);
> > +      for (std::map<ValuePair, ValuePair>::iterator K =
QTopRange.first;
> > +           K != QTopRange.second; ++K) {
> > +        DenseMap<ValuePair, size_t>::iterator C =
Tree.find(K->second);
> > +        if (C == Tree.end()) continue;
> > +
> > +        // This child is in the Tree, now we need to make sure it is
the
> > +        // best of any conflicting children. There could be multiple
> > +        // conflicting children, so first, determine if we're
keeping
> > +        // this child, then delete conflicting children as necessary.
> > +
> > +        // It is also necessary to guard against pairing-induced
> > +        // dependencies. Consider instructions a .. x .. y .. b
> > +        // such that (a,b) are to be fused and (x,y) are to be fused
> > +        // but a is an input to x and b is an output from y. This
> > +        // means that y cannot be moved after b but x must be moved
> > +        // after b for (a,b) to be fused. In other words, after
> > +        // fusing (a,b) we have y .. a/b .. x where y is an input
> > +        // to a/b and x is an output to a/b: x and y can no longer
> > +        // be legally fused. To prevent this condition, we must
> > +        // make sure that a child pair added to the Tree is not
> > +        // both an input and output of an already-selected pair.
> > +
> > +        bool CanAdd = true;
> 
> The use of CanAdd can be improved with early continues.
> 
> > +        for (DenseMap<ValuePair, size_t>::iterator C2
> > +              = BestChilden.begin(), E2 = BestChilden.end();
> > +             C2 != E2; ++C2) {
> > +          if (C2->first.first == C->first.first ||
> > +              C2->first.first == C->first.second ||
> > +              C2->first.second == C->first.first ||
> > +              C2->first.second == C->first.second ||
> > +              pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> > +            if (C2->second>= C->second) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> > +          }
> 
> This loop
> 
> > +        }
> > +
> > +        // Even worse, this child could conflict with another node
already
> > +        // selected for the Tree. If that is the case, ignore this
child.
> > +        if (CanAdd)
> > +          for (DenseSet<ValuePair>::iterator T =
PrunedTree.begin(),
> > +                 E2 = PrunedTree.end(); T != E2; ++T) {
> > +            if (T->first == C->first.first ||
> > +                T->first == C->first.second ||
> > +                T->second == C->first.first ||
> > +                T->second == C->first.second ||
> > +                pairsConflict(*T, C->first, PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> > +          }
> 
> and this loop
> 
> > +
> > +        // And check the queue too...
> > +        if (CanAdd)
> > +          for (std::vector<ValuePairWithDepth>::iterator C2 =
Q.begin(),
> > +                E2 = Q.end(); C2 != E2; ++C2) {
> > +            if (C2->first.first == C->first.first ||
> > +                C2->first.first == C->first.second ||
> > +                C2->first.second == C->first.first ||
> > +                C2->first.second == C->first.second ||
> > +                pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> 
> and this loop
> 
> > +          }
> > +
> > +        // Last but not least, check for a conflict with any of the
> > +        // already-chosen pairs.
> > +        if (CanAdd)
> > +          for (DenseMap<Value *, Value *>::iterator C2 >
> +                ChosenPairs.begin(), E2 = ChosenPairs.end();
> > +               C2 != E2; ++C2) {
> > +            if (pairsConflict(*C2, C->first, PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> 
> (and almost this one)
> 
> > +            }
> > +          }
> > +
> > +        if (CanAdd) {
> > +          // This child can be added, but we may have chosen it in
preference
> > +          // to an already-selected child. Check for this here, and
if a
> > +          // conflict is found, then remove the previously-selected
child
> > +          // before adding this one in its place.
> > +          for (DenseMap<ValuePair, size_t>::iterator C2
> > +                = BestChilden.begin(), E2 = BestChilden.end();
> > +               C2 != E2;) {
> > +            if (C2->first.first == C->first.first ||
> > +                C2->first.first == C->first.second ||
> > +                C2->first.second == C->first.first ||
> > +                C2->first.second == C->first.second ||
> > +                pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> > +              BestChilden.erase(C2++);
> > +            } else {
> > +              ++C2;
> and this loop look entirely the same. Can they be moved into a helper
> function?
> 
> 
> > +            }
> > +          }
> > +
> > +          BestChilden.insert(ValuePairWithDepth(C->first,
C->second));
> > +        }
> > +      }
> > +
> > +      for (DenseMap<ValuePair, size_t>::iterator C
> > +            = BestChilden.begin(), E2 = BestChilden.end();
> > +           C != E2; ++C) {
> > +        size_t DepthF = getDepthFactor(C->first.first);
> > +        Q.push_back(ValuePairWithDepth(C->first,
QTop.second+DepthF));
> > +      }
> > +    }
> > +  }
> > +
> > +  // This function finds the best tree of mututally-compatible
connected
> > +  // pairs, given the choice of root pairs as an iterator range.
> > +  void BBVectorize::findBestTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseSet<ValuePair>  &BestTree,
size_t&BestMaxDepth,
> > +                      size_t&BestEffSize, VPIteratorPair
ChoiceRange) {
> > +    for (std::multimap<Value *, Value *>::iterator J =
ChoiceRange.first;
> > +         J != ChoiceRange.second; ++J) {
> > +
> > +      // Before going any further, make sure that this pair does not
> > +      // conflict with any already-selected pairs (see comment below
> > +      // near the Tree pruning for more details).
> > +      bool DoesConflict = false;
> > +      for (DenseMap<Value *, Value *>::iterator C =
ChosenPairs.begin(),
> > +            E = ChosenPairs.end(); C != E; ++C) {
> > +        if (pairsConflict(*C, *J, PairableInstUsers)) {
> > +          DoesConflict = true;
> > +          break;
> > +        }
> > +      }
> > +      if (DoesConflict) continue;
> > +
> > +      DenseMap<ValuePair, size_t>  Tree;
> > +      buildInitialTreeFor(CandidatePairs, PairableInsts,
ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, Tree, *J);
> Fix indentation. (Align CandidatePairs and PairableInstUsers)
> 
> > +
> > +      // Because we'll keep the child with the largest depth, the
largest
> > +      // depth is still the same in the unpruned Tree.
> > +      size_t MaxDepth = Tree.lookup(*J);
> > +
> > +      DEBUG(dbgs()<<  "BBV: found Tree for pair
{"<<  *J->first<<
> > +                   "<->  "<< 
*J->second<<  "} of depth "<<
> > +                   MaxDepth<<  " and size "<< 
Tree.size()<<  "\n");
> > +
> > +      // At this point the Tree has been constructed, but, may
contain
> > +      // contradictory children (meaning that different children of
> > +      // some tree node may be attempting to fuse the same
instruction).
> > +      // So now we walk the tree again, in the case of a conflict,
> > +      // keep only the child with the largest depth. To break a tie,
> > +      // favor the first child.
> > +
> > +      DenseSet<ValuePair>  PrunedTree;
> > +      pruneTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, Tree, PrunedTree, *J);
> > +
> > +      size_t EffSize = 0;
> > +      for (DenseSet<ValuePair>::iterator S =
PrunedTree.begin(),
> > +            E = PrunedTree.end(); S != E; ++S) {
> > +        EffSize += getDepthFactor(S->first);
> > +      }
> No braces.
> 
> > +
> > +      DEBUG(dbgs()<<  "BBV: found pruned Tree for pair
{"<<  *J->first<<
> > +             "<->  "<<  *J->second<< 
"} of depth "<<
> > +             MaxDepth<<  " and size "<< 
PrunedTree.size()<<
> > +            " (effective size: "<<  EffSize<< 
")\n");
> > +      if (MaxDepth>= ReqChainDepth&&  EffSize> 
BestEffSize) {
> > +        BestMaxDepth = MaxDepth;
> > +        BestEffSize = EffSize;
> > +        BestTree = PrunedTree;
> > +      }
> > +    }
> > +  }
> > +
> > +  // Given the list of candidate pairs, this function selects those
> > +  // that will be fused into vector instructions.
> > +  void BBVectorize::choosePairs(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *>& 
ChosenPairs) {
> > +    for (std::vector<Value *>::iterator I =
PairableInsts.begin(),
> > +           E = PairableInsts.end(); I != E; ++I) {
> > +      // The number of possible pairings for this variable:
> > +      size_t NumChoices = CandidatePairs.count(*I);
> > +      if (!NumChoices) continue;
> > +
> > +      VPIteratorPair ChoiceRange = CandidatePairs.equal_range(*I);
> > +
> > +      // The best pair to choose and its tree:
> > +      size_t BestMaxDepth = 0, BestEffSize = 0;
> > +      DenseSet<ValuePair>  BestTree;
> > +      findBestTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, BestTree, BestMaxDepth,
> > +        BestEffSize, ChoiceRange);
> 
> Fix indentation. Align CandidatePairs, ...
> 
> > +
> > +      // A tree has been chosen (or not) at this point. If no tree
was
> > +      // chosen, then this instruction, I, cannot be paired (and is
no longer
> > +      // considered).
> > +
> > +      for (DenseSet<ValuePair>::iterator S = BestTree.begin(),
> > +           SE = BestTree.end(); S != SE; ++S) {
> > +        // Insert the members of this tree into the list of chosen
pairs.
> > +        ChosenPairs.insert(ValuePair(S->first, S->second));
> > +        DEBUG(dbgs()<<  "BBV: selected pair:
"<<  *S->first<<  "<->  "<<
> > +               *S->second<<  "\n");
> > +
> > +        // Remove all candidate pairs that have values in the chosen
tree.
> > +        for (std::multimap<Value *, Value *>::iterator K >
> +               CandidatePairs.begin(),
> > +             E2 = CandidatePairs.end(); K != E2;) {
> > +          if (K->first == S->first || K->second ==
S->first ||
> > +              K->second == S->second || K->first ==
S->second) {
> > +            // Don't remove the actual pair chosen so that it can
be used
> > +            // in subsequent tree selections.
> > +            if (!(K->first == S->first&&  K->second
== S->second)) {
> > +              CandidatePairs.erase(K++);
> > +            } else {
> > +              ++K;
> > +            }
> No braces needed.
> 
> > +          } else {
> > +            ++K;
> > +          }
> These are needed.
> 
> > +        }
> > +      }
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: selected "<< 
ChosenPairs.size()<<  " pairs.\n");
> > +  }
> > +
> > +  // Returns the value that is to be used as the pointer input to the
vector
> > +  // instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementPointerInput(LLVMContext& 
Context,
> > +                     Instruction *I, Instruction *J, unsigned o,
> > +                     bool&FlipMemInputs) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    const TargetData&TD = *AA.getTargetData();
> > +
> > +    Value *IPtr, *JPtr;
> > +    unsigned IAlignment, JAlignment;
> > +    int64_t OffsetInElmts;
> > +    (void) getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment,
JAlignment,
> > +             OffsetInElmts);
> Fix indentation.
> 
> > +
> > +    // The pointer value is taken to be the one with the lowest
offset.
> > +    Value *VPtr;
> > +    if (OffsetInElmts>  0) {
> > +      VPtr = IPtr;
> > +    } else {
> > +      FlipMemInputs = true;
> > +      VPtr = JPtr;
> > +    }
> > +
> > +    Type *ArgType =
cast<PointerType>(IPtr->getType())->getElementType();
> > +    Type *VArgType = getVecType(ArgType);
> > +    Type *VArgPtrType = PointerType::get(VArgType,
> > +     
cast<PointerType>(IPtr->getType())->getAddressSpace());
> > +    return new BitCastInst(VPtr, VArgPtrType,
> > +                        I->hasName() ?
> > +                          I->getName() + ".v.i" +
utostr(o) : "",
> > +                        /* insert before */ FlipMemInputs ? J : I);
> > +  }
> > +
> > +  // Returns the value that is to be used as the vector-shuffle mask
to the
> > +  // vector instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementShuffleMask(LLVMContext& 
Context,
> > +                     Instruction *I, Instruction *J, unsigned o) {
> > +    // This is the shuffle mask. We need to append the second
> > +    // mask to the first, and the numbers need to be adjusted.
> > +
> > +    Type *ArgType = I->getType();
> > +    Type *VArgType = getVecType(ArgType);
> > +
> > +    // Get the total number of elements in the fused vector type.
> > +    // By definition, this must equal the number of elements in
> > +    // the final mask.
> > +    unsigned NumElem =
cast<VectorType>(VArgType)->getNumElements();
> > +    std::vector<Constant*>  Mask(NumElem);
> > +
> > +    Type *OpType = I->getOperand(0)->getType();
> > +    unsigned numInElem =
cast<VectorType>(OpType)->getNumElements();
> > +
> > +    // For the mask from the first pair...
> > +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> > +      int m = cast<ShuffleVectorInst>(I)->getMaskValue(v);
> > +      if (m<  0) {
> > +        Mask[v] = UndefValue::get(Type::getInt32Ty(Context));
> > +      } else {
> > +        unsigned vv = v;
> > +        if (m>= (int) numInElem) {
> > +          vv += (int) numInElem;
> > +        }
> No braces needed.
> 
> > +
> > +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), vv);
> > +      }
> > +    }
> > +
> > +    // For the mask from the second pair...
> > +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> > +      int m = cast<ShuffleVectorInst>(J)->getMaskValue(v);
> > +      if (m<  0) {
> > +        Mask[v+NumElem/2] =
UndefValue::get(Type::getInt32Ty(Context));
> > +      } else {
> > +        unsigned vv = v + (int) numInElem;
> > +        if (m>= (int) numInElem) {
> > +          vv += (int) numInElem;
> > +        }
> No braces needed.
> 
> > +
> > +        Mask[v+NumElem/2] > > +         
ConstantInt::get(Type::getInt32Ty(Context), vv);
> > +      }
> > +    }
> 
> Those two for loops look very similar. Any chance they can be extracted
> into a helper function?
> 
> > +
> > +    return ConstantVector::get(Mask);
> > +  }
> > +
> > +  // Returns the value to be used as the specified operand of the
vector
> > +  // instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementInput(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, unsigned o, bool FlipMemInputs)
{
> 
> Indentation.
> 
> > +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> > +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> > +
> > +      // Compute the fused vector type for this operand
> > +    Type *ArgType = I->getOperand(o)->getType();
> > +    VectorType *VArgType = getVecType(ArgType);
> > +
> > +    Instruction *L = I, *H = J;
> > +    if (FlipMemInputs) {
> > +      L = J;
> > +      H = I;
> > +    }
> > +
> > +    if (ArgType->isVectorTy()) {
> > +      unsigned numElem =
cast<VectorType>(VArgType)->getNumElements();
> > +      std::vector<Constant*>  Mask(numElem);
> > +      for (unsigned v = 0; v<  numElem; ++v) {
> > +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), v);
> > +      }
> No braces.
> 
> > +
> > +      Instruction *BV = new ShuffleVectorInst(L->getOperand(o),
> > +                                              H->getOperand(o),
> > +                                             
ConstantVector::get(Mask),
> > +                                              I->hasName() ?
> > +                                                I->getName() +
".v.i" +
> > +                                                  utostr(o) :
"");
> 
> It may be better to not squeeze
> 
> I->hasName() ?  I->getName() + ".v.i" + utostr(o) :
""
> 
> into this operand.
> 
> > +      BV->insertBefore(J);
> > +      return BV;
> > +    }
> > +
> > +    // If these two inputs are the output of another vector
instruction,
> > +    // then we should use that output directly. It might be necessary
to
> > +    // permute it first. [When pairings are fused recursively, you
can
> > +    // end up with cases where a large vector is decomposed into
scalars
> > +    // using extractelement instructions, then built into size-2
> > +    // vectors using insertelement and the into larger vectors using
> > +    // shuffles. InstCombine does not simplify all of these cases
well,
> > +    // and so we make sure that shuffles are generated here when
possible.
> > +    ExtractElementInst *LEE
> > +      = dyn_cast<ExtractElementInst>(L->getOperand(o));
> > +    ExtractElementInst *HEE
> > +      = dyn_cast<ExtractElementInst>(H->getOperand(o));
> > +
> > +    if (LEE&&  HEE&&
> > +        LEE->getOperand(0)->getType() == VArgType) {
> > +      unsigned LowIndx =
cast<ConstantInt>(LEE->getOperand(1))->getZExtValue();
> > +      unsigned HighIndx =
cast<ConstantInt>(HEE->getOperand(1))->getZExtValue();
> > +      if (LEE->getOperand(0) == HEE->getOperand(0)) {
> > +        if (LowIndx == 0&&  HighIndx == 1)
> > +          return LEE->getOperand(0);
> > +
> > +        std::vector<Constant*>  Mask(2);
> > +        Mask[0] = ConstantInt::get(Type::getInt32Ty(Context),
LowIndx);
> > +        Mask[1] = ConstantInt::get(Type::getInt32Ty(Context),
HighIndx);
> > +
> > +        Instruction *BV = new
ShuffleVectorInst(LEE->getOperand(0),
> > +                                           
UndefValue::get(VArgType),
> > +                                           
ConstantVector::get(Mask),
> > +                                            I->hasName() ?
> > +                                              I->getName() +
".v.i" +
> > +                                                utostr(o) :
"");
> 
> Do not squeeze
> 
> I->hasName() ?  I->getName() + ".v.i" + utostr(o) :
""
> 
> into the operand.
> 
> > +        BV->insertBefore(J);
> > +        return BV;
> > +      }
> > +
> > +      std::vector<Constant*>  Mask(2);
> > +      HighIndx += VArgType->getNumElements();
> > +      Mask[0] = ConstantInt::get(Type::getInt32Ty(Context), LowIndx);
> > +      Mask[1] = ConstantInt::get(Type::getInt32Ty(Context),
HighIndx);
> > +
> > +      Instruction *BV = new ShuffleVectorInst(LEE->getOperand(0),
> > +                                          HEE->getOperand(0),
> > +                                          ConstantVector::get(Mask),
> > +                                          I->hasName() ?
> > +                                            I->getName() +
".v.i" +
> > +                                              utostr(o) :
"");
> No squeezing please.
> 
> > +      BV->insertBefore(J);
> > +      return BV;
> > +    }
> > +
> > +    Instruction *BV1 = InsertElementInst::Create(
> > +                                              
UndefValue::get(VArgType),
> > +                                               L->getOperand(o),
CV0,
> > +                                               I->hasName() ?
> > +                                                 I->getName() +
".v.i" +
> > +                                                   utostr(o) +
".1"
> > +                                               : "");
> No squeezing.
> 
> > +    BV1->insertBefore(I);
> > +    Instruction *BV2 = InsertElementInst::Create(BV1,
H->getOperand(o),
> > +                                               CV1, I->hasName() ?
> > +                                                 I->getName() +
> > +                                                   ".v.i" +
utostr(o)
> > +                                                   + ".2" :
"");
> 
> No squeezing.
> 
> Also, the stuff that you squeeze in looks very similar or even
> identical. Do you really write the whole expression 4 times in this
> function? Is there no better option?
> 
> > +    BV2->insertBefore(J);
> > +    return BV2;
> > +  }
> > +
> > +  // This function creates an array of values that will be used as
the inputs
> > +  // to the vector instruction that fuses I with J.
> > +  void BBVectorize::getReplacementInputsForPair(LLVMContext& 
Context,
> > +                     Instruction *I, Instruction *J,
> > +                     SmallVector<Value *, 3> 
&ReplacedOperands,
> > +                     bool&FlipMemInputs) {
> > +    FlipMemInputs = false;
> > +    unsigned NumOperands = I->getNumOperands();
> > +
> > +    for (unsigned p = 0, o = NumOperands-1; p<  NumOperands; ++p,
--o) {
> > +      // Iterate backward so that we look at the store pointer
> > +      // first and know whether or not we need to flip the inputs.
> > +
> > +      if (isa<LoadInst>(I) || (o == 1&& 
isa<StoreInst>(I))) {
> > +        // This is the pointer for a load/store instruction.
> > +        ReplacedOperands[o] = getReplacementPointerInput(Context, I,
J, o,
> > +                                FlipMemInputs);
> > +        continue;
> > +      } else if (isa<CallInst>(I)&&  o ==
NumOperands-1) {
> > +        Function *F =
cast<CallInst>(I)->getCalledFunction();
> > +        unsigned IID = F->getIntrinsicID();
> > +        BasicBlock&BB =
*cast<Instruction>(I)->getParent();
> > +
> > +        Module *M = BB.getParent()->getParent();
> > +        Type *ArgType = I->getType();
> > +        Type *VArgType = getVecType(ArgType);
> > +
> > +        // FIXME: is it safe to do this here?
> > +        ReplacedOperands[o] = Intrinsic::getDeclaration(M,
> > +          (Intrinsic::ID) IID, VArgType);
> > +        continue;
> > +      } else if (isa<ShuffleVectorInst>(I)&&  o ==
NumOperands-1) {
> > +        ReplacedOperands[o] = getReplacementShuffleMask(Context, I,
J, o);
> > +        continue;
> > +      }
> > +
> > +      ReplacedOperands[o] > > +       
getReplacementInput(Context, I, J, o, FlipMemInputs);
> > +    }
> > +  }
> > +
> > +  // This function creates two values that represent the outputs of
the
> > +  // original I and J instructions. These are generally vector
shuffles
> > +  // or extracts. In many cases, these will end up being unused and,
thus,
> > +  // eliminated by later passes.
> > +  void BBVectorize::replaceOutputsOfPair(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, Instruction *K,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *&K1, Instruction *&K2,
> > +                     bool&FlipMemInputs) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> > +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> > +
> > +    if (isa<StoreInst>(I)) {
> > +      AA.replaceWithNewValue(I, K);
> > +      AA.replaceWithNewValue(J, K);
> > +    } else {
> > +      Type *IType = I->getType();
> > +      Type *VType = getVecType(IType);
> > +
> > +      if (IType->isVectorTy()) {
> > +          unsigned numElem =
cast<VectorType>(IType)->getNumElements();
> > +          std::vector<Constant*>  Mask1(numElem),
Mask2(numElem);
> > +          for (unsigned v = 0; v<  numElem; ++v) {
> > +            Mask1[v] = ConstantInt::get(Type::getInt32Ty(Context),
v);
> > +            Mask2[v] = ConstantInt::get(Type::getInt32Ty(Context),
numElem+v);
> > +          }
> > +
> > +          K1 = new ShuffleVectorInst(K, UndefValue::get(VType),
> > +                                       ConstantVector::get(
> > +                                         FlipMemInputs ? Mask2 :
Mask1),
> > +                                       K->hasName() ?
> > +                                         K->getName() +
".v.r1" : "");
> > +          K2 = new ShuffleVectorInst(K, UndefValue::get(VType),
> > +                                       ConstantVector::get(
> > +                                         FlipMemInputs ? Mask1 :
Mask2),
> > +                                       K->hasName() ?
> > +                                         K->getName() +
".v.r2" : "");
> > +      } else {
> > +        K1 = ExtractElementInst::Create(K, FlipMemInputs ? CV1 : CV0,
> > +                                          K->hasName() ?
> > +                                            K->getName() +
".v.r1" : "");
> > +        K2 = ExtractElementInst::Create(K, FlipMemInputs ? CV0 : CV1,
> > +                                          K->hasName() ?
> > +                                            K->getName() +
".v.r2" : "");
> > +      }
> > +
> > +      K1->insertAfter(K);
> > +      K2->insertAfter(K1);
> > +      InsertionPt = K2;
> > +    }
> > +  }
> > +
> > +  // Move all uses of the function I (including pairing-induced uses)
after J.
> > +  void BBVectorize::moveUsesOfIAfterJ(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *>& 
ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *I, Instruction *J) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    // Skip to the first instruction past I.
> > +    BasicBlock::iterator L = BB.begin();
> > +    for (; cast<Instruction>(L) != cast<Instruction>(I);
++L);
> Why don't you put the initialization into the 'for'? I do not
see any
> need for the cast instructions.
> 
> > +    ++L;
> > +
> > +    DenseSet<Value *>  Users;
> > +    AliasSetTracker WriteSet(AA);
> > +    for (BasicBlock::iterator E = BB.end();
> > +          cast<Instruction>(L) != cast<Instruction>(J);)
{
> Why do you need to cast here?
> 
> > +      bool UsesI;
> > +      trackUsesOfI(Users, WriteSet, I, L, UsesI,
true,&LoadMoveSet);
> > +
> > +      if (UsesI) {
> > +        // Make sure that the order of selected pairs is not
reversed.
> > +        // So if we come to a user instruction that has a pair,
> > +        // then mark the pair's tree as used as well.
> > +        DenseMap<Value *, Value *>::iterator LP =
ChosenPairs.find(L);
> > +        if (LP != ChosenPairs.end())
> > +          Users.insert(LP->second);
> > +
> > +        // Move this instruction
> > +        Instruction *InstToMove = L; ++L;
> > +
> > +        DEBUG(dbgs()<<  "BBV: moving: "<< 
*InstToMove<<
> > +                        " to after "<< 
*InsertionPt<<  "\n");
> > +        InstToMove->removeFromParent();
> > +        InstToMove->insertAfter(InsertionPt);
> > +        InsertionPt = InstToMove;
> > +      } else {
> > +        ++L;
> > +      }
> > +    }
> > +  }
> > +
> > +  // Collect all load instruction that are in the move set of a given
pair.
> > +  // These loads depend on the first instruction, I, and so need to
be moved
> > +  // after J when the pair is fused.
> > +  void BBVectorize::collectPairLoadMoveSet(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> > +                     Instruction *I, Instruction *J) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    // Skip to the first instruction past I.
> > +    BasicBlock::iterator L = BB.begin();
> > +    for (; cast<Instruction>(L) != cast<Instruction>(I);
++L);
> Why not initialize in the 'for' loop. Why cast here?
> 
> > +    ++L;
> > +
> > +    DenseSet<Value *>  Users;
> > +    AliasSetTracker WriteSet(AA);
> > +    for (BasicBlock::iterator E = BB.end();
> This 'E' is never used in the loop.
> 
> > +          cast<Instruction>(L) != cast<Instruction>(J);
++L) {
> Why cast here?
> 
> > +      bool UsesI;
> > +      trackUsesOfI(Users, WriteSet, I, L, UsesI);
> > +
> > +      if (UsesI) {
> > +        // Make sure that the order of selected pairs is not
reversed.
> > +        // So if we come to a user instruction that has a pair,
> > +        // then mark the pair's tree as used as well.
> > +        DenseMap<Value *, Value *>::iterator LP =
ChosenPairs.find(L);
> > +        if (LP != ChosenPairs.end())
> > +          Users.insert(LP->second);
> > +
> > +        if (L->mayReadFromMemory())
> > +          LoadMoveSet.insert(ValuePair(L, I));
> > +      }
> > +    }
> > +  }
> > +
> > +  // In cases where both load/stores and the computation of their
pointers
> > +  // are chosen for vectorization, we can end up in a situation where
the
> > +  // aliasing analysis starts returning different query results as
the
> > +  // process of fusing instruction pairs continues. Because the
algorithm
> > +  // relies on finding the same use trees here as were found earlier,
we'll
> > +  // need to precompute the necessary aliasing information here and
then
> > +  // manually update it during the fusion process.
> > +  void BBVectorize::collectLoadMoveSet(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet) {
> > +    for (std::vector<Value *>::iterator PI =
PairableInsts.begin(),
> > +         PIE = PairableInsts.end(); PI != PIE; ++PI) {
> > +      DenseMap<Value *, Value *>::iterator P =
ChosenPairs.find(*PI);
> > +      if (P == ChosenPairs.end()) continue;
> > +
> > +      if (getDepthFactor(P->first) == 0) {
> > +        continue;
> > +      }
> No braces.
> 
> > +
> > +      Instruction *I = cast<Instruction>(P->first),
> > +        *J = cast<Instruction>(P->second);
> > +
> > +      collectPairLoadMoveSet(BB, ChosenPairs, LoadMoveSet, I, J);
> > +    }
> > +  }
> > +
> > +  // This function fuses the chosen instruction pairs into vector
instructions,
> > +  // taking care preserve any needed scalar outputs and, then, it
reorders the
> > +  // remaining instructions as needed (users of the first member of
the pair
> > +  // need to be moved to after the location of the second member of
the pair
> > +  // because the vector instruction is inserted in the location of
the pair's
> > +  // second member).
> > +  void BBVectorize::fuseChosenPairs(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs) {
> > +    LLVMContext&  Context = BB.getContext();
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> > +
> > +    std::multimap<Value *, Value *>  LoadMoveSet;
> > +    collectLoadMoveSet(BB, PairableInsts, ChosenPairs, LoadMoveSet);
> > +
> > +    DEBUG(dbgs()<<  "BBV: initial: \n"<< 
BB<<  "\n");
> > +
> > +    for (std::vector<Value *>::iterator PI =
PairableInsts.begin(),
> > +         PIE = PairableInsDenseMap<Vts.end(); PI != PIE; ++PI) {
> > +      DenseMap<Value *, Value *>::iterator P =
ChosenPairs.find(*PI);
> > +      if (P == ChosenPairs.end()) continue;
> > +
> > +      if (getDepthFactor(P->first) == 0) {
> > +        // These instructions are not really fused, but are tracked
as though
> > +        // they are. Any case in which it would be interesting to
fuse them
> > +        // will be taken care of by InstCombine.
> > +        --NumFusedOps;
> > +        continue;
> > +      }
> > +
> > +      Instruction *I = cast<Instruction>(P->first),
> > +        *J = cast<Instruction>(P->second);
> 
> Why do we need to cast here? It seems ChosenPairs should actually be of
> type DenseMap<Instruction *, Instruction *>. Can PariableInsts also
be
> of type std::vector<Instruction*>?
> 
> > +
> > +      DEBUG(dbgs()<<  "BBV: fusing: "<< 
*I<<
> > +             "<->  "<<  *J<< 
"\n");
> > +
> > +      bool FlipMemInputs;
> > +      unsigned NumOperands = I->getNumOperands();
> > +      SmallVector<Value *, 3>  ReplacedOperands(NumOperands);
> > +      getReplacementInputsForPair(Context, I, J, ReplacedOperands,
> > +        FlipMemInputs);
> > +
> > +      // Make a copy of the original operation, change its type to
the vector
> > +      // type and replace its operands with the vector operands.
> > +      Instruction *K = I->clone();
> > +      if (I->hasName()) K->takeName(I);
> > +
> > +      if (!isa<StoreInst>(K)) {
> > +        K->mutateType(getVecType(I->getType()));
> > +      }
> No braces.
> 
> > +
> > +      for (unsigned o = 0; o<  NumOperands; ++o) {
> > +        K->setOperand(o, ReplacedOperands[o]);
> > +      }
> No braces.
> 
> > +
> > +      // If we've flipped the memory inputs, make sure that we
take the correct
> > +      // alignment.
> > +      if (FlipMemInputs) {
> > +        if (isa<StoreInst>(K))
> > +         
cast<StoreInst>(K)->setAlignment(cast<StoreInst>(J)->getAlignment());
> > +        else
> > +         
cast<LoadInst>(K)->setAlignment(cast<LoadInst>(J)->getAlignment());
> > +      }
> > +
> > +      K->insertAfter(J);
> > +
> > +      // Instruction insertion point:
> > +      Instruction *InsertionPt = K;
> > +      Instruction *K1 = 0, *K2 = 0;
> > +      replaceOutputsOfPair(Context, I, J, K, InsertionPt, K1, K2,
> > +        FlipMemInputs);
> > +
> > +      // The use tree of the first original instruction must be moved
to after
> > +      // the location of the second instruction. The entire use tree
of the
> > +      // first instruction is disjoint from the input tree of the
second
> > +      // (by definition), and so commutes with it.
> > +
> > +      moveUsesOfIAfterJ(BB, ChosenPairs, LoadMoveSet, InsertionPt,
> > +        cast<Instruction>(I), cast<Instruction>(J));
> > +
> > +      if (!isa<StoreInst>(I)) {
> > +        I->replaceAllUsesWith(K1);
> > +        J->replaceAllUsesWith(K2);
> > +        AA.replaceWithNewValue(I, K1);
> > +        AA.replaceWithNewValue(J, K2);
> > +      }
> > +
> > +      // Instructions that may read from memory may be in the load
move set.
> > +      // Once an instruction is fused, we no longer need its move
set, and so
> > +      // the values of the map never need to be updated. However,
when a load
> > +      // is fused, we need to merge the entries from both
instructions in the
> > +      // pair in case those instructions were in the move set of some
other
> > +      // yet-to-be-fused pair. The loads in question are the keys of
the map.
> > +      if (I->mayReadFromMemory()) {
> > +        std::vector<ValuePair>  NewSetMembers;
> > +        VPIteratorPair IPairRange = LoadMoveSet.equal_range(I);
> > +        VPIteratorPair JPairRange = LoadMoveSet.equal_range(J);
> > +        for (std::multimap<Value *, Value *>::iterator N =
IPairRange.first;
> > +               N != IPairRange.second; ++N) {
> Indentation.
> 
> > +          NewSetMembers.push_back(ValuePair(K, N->second));
> > +        }
> No braces.
> 
> > +        for (std::multimap<Value *, Value *>::iterator N =
JPairRange.first;
> > +               N != JPairRange.second; ++N) {
> Indentation.
> 
> > +          NewSetMembers.push_back(ValuePair(K, N->second));
> > +        }
> No braces.
> 
> > +        for (std::vector<ValuePair>::iterator A =
NewSetMembers.begin(),
> > +               AE = NewSetMembers.end(); A != AE; ++A) {
> Indentation.
> 
> > +          LoadMoveSet.insert(*A);
> > +        }
> No braces.
> 
> > +      }
> > +
> > +      SE.forgetValue(I);
> > +      SE.forgetValue(J);
> > +      I->eraseFromParent();
> > +      J->eraseFromParent();
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: final: \n"<< 
BB<<  "\n");
> > +  }
> > +}
> 
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/ld1.ll
> > @@ -0,0 +1,28 @@
> > +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> > +
> > +define double @test1(double* %a, double* %b, double* %c) nounwind
uwtable readonly {
> > +entry:
> > +  %i0 = load double* %a, align 8
> > +  %i1 = load double* %b, align 8
> > +  %mul = fmul double %i0, %i1
> > +  %i2 = load double* %c, align 8
> > +  %add = fadd double %mul, %i2
> > +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> > +  %i3 = load double* %arrayidx3, align 8
> > +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> > +  %i4 = load double* %arrayidx4, align 8
> > +  %mul5 = fmul double %i3, %i4
> > +  %arrayidx6 = getelementptr inbounds double* %c, i64 1
> > +  %i5 = load double* %arrayidx6, align 8
> > +  %add7 = fadd double %mul5, %i5
> > +  %mul9 = fmul double %add, %i1
> > +  %add11 = fadd double %mul9, %i2
> > +  %mul13 = fmul double %add7, %i4
> > +  %add15 = fadd double %mul13, %i5
> > +  %mul16 = fmul double %add11, %add15
> > +  ret double %mul16
> > +; CHECK: @test1
> > +; CHECK:<2 x double>
> 
> For me this CHECK looks very short. Should all instructions be
> vectorized or is it sufficient if one is vectorized. Also, it would be
> nice if I could understand the code that is generated just by looking
> at the test case. Any reason not to CHECK: for the entire sequence of
> expected instructions (This applies to all test cases).
> 
> 
> > +}
> > +
> > diff --git a/test/Transforms/BBVectorize/loop1.ll
b/test/Transforms/BBVectorize/loop1.ll
> > new file mode 100644
> > index 0000000..4c60b42
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/loop1.ll
> > @@ -0,0 +1,39 @@
> > +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +target triple = "x86_64-unknown-linux-gnu"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> > +; RUN: opt<  %s -loop-unroll -unroll-allow-partial -bb-vectorize
-bb-vectorize-req-chain-depth=3 -S | FileCheck %s
> Why are you checking here with -loop-unroll? Do you expect something
> special. In case you do it may make sense to use another -check-prefix
> 
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/req-depth.ll
> > @@ -0,0 +1,17 @@
> > +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 3 -S |
FileCheck %s -check-prefix=CHECK-RD3
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 2 -S |
FileCheck %s -check-prefix=CHECK-RD2
> > +
> > +define double @test1(double %A1, double %A2, double %B1, double %B2)
{
> > +	%X1 = fsub double %A1, %B1
> > +	%X2 = fsub double %A2, %B2
> > +	%Y1 = fmul double %X1, %A1
> > +	%Y2 = fmul double %X2, %A2
> > +	%R  = fmul double %Y1, %Y2
> > +	ret double %R
> > +; CHECK-RD3: @test1
> > +; CHECK-RD2: @test1
> > +; CHECK-RD3-NOT:<2 x double>
> > +; CHECK-RD2:<2 x double>
> What about sorting them by check prefix?
> 
> > +++ b/test/Transforms/BBVectorize/simple-ldstr.ll
> > @@ -0,0 +1,69 @@
> > +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> > +
> > +; Simple 3-pair chain with loads and stores
> > +define void @test1(double* %a, double* %b, double* %c) nounwind
uwtable readonly {
> > +entry:
> > +  %i0 = load double* %a, align 8
> > +  %i1 = load double* %b, align 8
> > +  %mul = fmul double %i0, %i1
> > +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> > +  %i3 = load double* %arrayidx3, align 8
> > +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> > +  %i4 = load double* %arrayidx4, align 8
> > +  %mul5 = fmul double %i3, %i4
> > +  store double %mul, double* %c, align 8
> > +  %arrayidx5 = getelementptr inbounds double* %c, i64 1
> > +  store double %mul5, double* %arrayidx5, align 8
> > +  ret void
> > +; CHECK: @test1
> > +; CHECK:<2 x double>
> 
> Especially here the entire vectorization sequence is interesting as I
> would love to see the alignment you produce for load/store instructions.
> 
> > +; Basic depth-3 chain (last pair permuted)
> > +define double @test2(double %A1, double %A2, double %B1, double %B2)
{
> > +	%X1 = fsub double %A1, %B1
> > +	%X2 = fsub double %A2, %B2
> > +	%Y1 = fmul double %X1, %A1
> > +	%Y2 = fmul double %X2, %A2
> > +	%Z1 = fadd double %Y2, %B1
> > +	%Z2 = fadd double %Y1, %B2
> > +	%R  = fmul double %Z1, %Z2
> > +	ret double %R
> > +; CHECK: @test2
> > +; CHECK:<2 x double>
> 
> Again, the entire sequence would show how you express a permutation.
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Tobias Grosser

2011-Dec-02 18:13 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On 12/02/2011 06:32 PM, Hal Finkel wrote:> On Fri, 2011-12-02 at 17:07 +0100, Tobias Grosser wrote:
>> On 11/23/2011 05:52 PM, Hal Finkel wrote:
>>> On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
>>>>>   On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
>>>>>>   >   Tobias,
>>>>>>   >
>>>>>>   >   I've attached an updated patch. It
contains a few bug fixes and many
>>>>>>   >   (refactoring and coding-convention) changes
inspired by your comments.
>>>>>>   >
>>>>>>   >   I'm currently trying to fix the bug
responsible for causing a compile
>>>>>>   >   failure when compiling
>>>>>>   >  
test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
>>>>>>   >   pass begins to fuse instructions in a basic
block in this file, the
>>>>>>   >   aliasing analysis starts producing different
(more pessimistic) query
>>>>>>   >   answers. I've never before seen it do
this, and I'm not sure why it is
>>>>>>   >   happening. Also odd, at the same time, the
numeric labels that are
>>>>>>   >   assigned to otherwise unnamed instructions,
change. I don't think I've
>>>>>>   >   seen this happen either (they generally seem
to stay fixed once
>>>>>>   >   assigned). I don't know if these two
things are related, but if anyone
>>>>>>   >   can provide some insight, I'd appreciate
it.
>>>>>
>>>>>   I think that I see what is happening in this case (please
someone tell
>>>>>   me if this does not make sense). In the problematic basic
block, there
>>>>>   are some loads and stores that are independent. The
default aliasing
>>>>>   analysis can tell that these loads and stores don't
reference the same
>>>>>   memory region. Then, some of the inputs to the
getelementptr
>>>>>   instructions used for these loads and stores are fused by
the
>>>>>   vectorization. After this happens, the aliasing analysis
loses its
>>>>>   ability to tell that the loads and stores that make use
of those
>>>>>   vector-calculated indices are independent.
>>> The attached patch works around this issue. Please review.
>>
>> Hi Hal,
>>
>> thanks for reworking the patch several times. This time the patch looks
>> a lot better, with nicely structured helper functions and a lot more
>> consistent style. It is a lot better to read and review.
>>
>> Unfortunately I had today just two hours to look into this patch. I
>> added a lot of remarks, but especially in the tree pruning part I run a
>> little out of time such that the remarks are mostly obvious style
>> remarks. I hope I have a little bit more time over the weekend to look
>> into this again, but I wanted to get out the first junk of comments
today.
>
> Thanks! I appreciate you putting in that kind of time. [Also, there is a
> more-recent version than the one you're looking at,
> llvm_bb_vectorize-20111122.diff, make sure that you look at that one].
I looked at the one that was in the email I replied to. This was
'llvm_bb_vectorize-20111122.diff'. I could not see any new one or did I 
overlook anything.
>> Especially with my style remarks, do not feel forced to follow all of
my
>> comments. Style stuff is highly personal and often other people may
>> suggest other fixes. In case you believe your style is better just
check
>> with the LLVM coding reference or other parts of the LLVM source code
>> and try to match this as good as possible.
>
> I have tried to do that, but other eyes are always useful.
You were very successful. The code looks already very nice.
>> On the algorithmic point of view I am not yet completely convinced the
>> algorithm itself is not worse than N^2. I really want to understand
this
>> part better and plan to look into problematic behaviour in functions
not
>> directly visible in the core routines.
>
> Here are my thoughts:
>
> getCandidatePairs - This is O(N^2) in the number of instructions in BB.
> [This assumes that all of the external helper functions, especially
> queries to alias and scalar-evolution analysis don't make the
complexity
> worse].
Yes. Alias analysis and scalar-evolution are queries that may not be 
fast enough (complexity wise). Dan Gohman proposed to check them and 
until today I did not find the time to verify they yield worse complexity.
> computeConnectedPairs - For each candidate pair, there are O(N^2) in the
> worse case, this loops over all pairs of uses of the instructions in the
> candidate pair. So if every instruction has U uses, then this has
> complexity O(N^2 U^2). In practice, U is small, and so this is just
> another O(N^2). Unfortunately, that is only true if determining if a
> given use pair is also a candidate pair is sub-linear. It is not in this
> case (it is linear because the multimap only indexes the first key, not
> both), but that could be changed by using a different (or additional)
> data structure to hold the candidate pairs. I should probably add a
> candidate-pair DenseSet to clean this up, then it will be O(N^2) [note
> that if an instruction has many uses, then it cannot be in many
> candidate pairs].
>
> buildDepMap - This records all uses of each pairable instruction; as
> implemented, this is also O(N^2).
>
> choosePairs - As you point out, this is the most complicated to figure
> out. The reason is that it deals with connected pairs and that, as pairs
> are selected, other pairs are dropped from contention. Fundamentally, it
> is considering pair-to-pair interactions, and so that seems like O(N^4).
> However, it only considers only the first possible tree that can be
> built from connected pairs for each remaining candidate pair. This is
> important for the following reason: If a block of N instructions had the
> maximum number of candidate pairs: N*(N-1)/2, then it must be the case
> that all instrucitons are independent, and, thus, none of them are
> connected. Thus, in this regime, the algorithm is O(N^2). On the other
> extreme, if the block has N candidate pairs, then they could all be
> connected, but then there are only N of them. In this extreme, the
> algorithm is also O(N^2). Because both extremes are O(N^2), I've
> generally figured that this is also O(N^2), but I may want to do more
> math for the paper I'm writing.
>
> fuseChosenPairs - This is linear in the number of selected pairs, and
> that number scales with the number of instructions in the BB. In the
> worst case, after each fusion, a number of instructions need to be moved
> (up to almost the total number in the block). Thus, this is also O(N^2).
Thanks for this more detailed analysis. It definitely helps to 
understand your complexity calculation.
>> Also, you choose a very high bound (4000) to limit the quadratic
>> behaviour, because you stated otherwise optimizations will be missed.
>> Can you send an example where a value smaller than 4000 would miss
>> optimizations?
>
> I may be able to find an example, but I don't have one immediately
> available. I had run the test suite with various settings, and chose the
> default value (4000) based on the approximate point where the average
> performance stopped increasing. Thus, I'm surmising that choosing a
> smaller value causes useful optimization opportunities to be missed.
I was just surprised that 4000 is that value, because I don't believe 
-partial-unroll would create basic blocks containing so many instructions.
>>> +#define BBV_NAME "bb-vectorize"
>>> +#define DEBUG_TYPE BBV_NAME
>>> +#include "llvm/Constants.h"
>>> +#include "llvm/DerivedTypes.h"
>>> +#include "llvm/Function.h"
>>> +#include "llvm/Instructions.h"
>>> +#include "llvm/IntrinsicInst.h"
>>> +#include "llvm/Intrinsics.h"
>>> +#include "llvm/LLVMContext.h"
>>> +#include "llvm/Pass.h"
>>> +#include "llvm/Type.h"
>>> +#include "llvm/ADT/DenseMap.h"
>>> +#include "llvm/ADT/DenseSet.h"
>>> +#include "llvm/ADT/SmallVector.h"
>>> +#include "llvm/ADT/Statistic.h"
>>> +#include "llvm/ADT/StringExtras.h"
>>> +#include "llvm/Analysis/AliasAnalysis.h"
>>> +#include "llvm/Analysis/AliasSetTracker.h"
>>> +#include "llvm/Analysis/ValueTracking.h"
>> This comes to lines later, if you want to sort it alphabetically.
>
> What do you mean? I had run the include statements through sort at some
> point, so I think they're all sorted.
#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/AliasSetTracker.h"
#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"

I removed the last part of the list, so you don't see the problem. All, 
except 'ValueTracking' are sorted.
>>> +    // Returns the weight associated with the provided value. A
chain of
>>> +    // candidate pairs has a length given by the sum of the
weights of its
>>> +    // members (one weight per pair; the weight of each member of
the pair
>>> +    // is assumed to be the same). This length is then compared to
the
>>> +    // chain-length threshold to determine if a given chain is
significant
>>> +    // enough to be vectorized. The length is also used in
comparing
>>> +    // candidate chains where longer chains are considered to be
better.
>>> +    // Note: when this function returns 0, the resulting
instructions are
>>> +    // not actually fused.
>>> +    static inline size_t getDepthFactor(Value *V) {
>>> +      if (isa<InsertElementInst>(V) ||
isa<ExtractElementInst>(V)) {
>>
>> Why have InsertElementInst, ExtractElementInst instructions a weight of
>> zero?
>
> Two reasons: First, they cannot be usefully fused. Second, because the
> pass generates a lot of these, they can confuse the simple metric used
> to compare the trees in the next iteration. Thus, giving them a weight
> of zero allows the pass to essentially ignore them in subsequent
> iterations when looking for vectorization opportunities while still
> tracking dependency chains that flow through those instructions.
OK. Please put this in a comment. I think it is useful to know.
>>> +    // Returns true if the provided CallInst represents an
intrinsic that can
>>> +    // be vectorized.
>>> +    bool isVectorizableIntrinsic(CallInst* I) {
>>> +      bool VIntr = false;
>>> +      if (Function *F = I->getCalledFunction()) {
>>> +        if (unsigned IID = F->getIntrinsicID()) {
>>
>> Use early exit.
>>
>> Function *F = I->getCalledFunction();
>>
>> if (!F)
>>     return false;
>>
>> unsigned IID = F->getIntrinsicID();
>>
>> if (!IID)
>>     return false;
>>
>> switch(IID) {
>> 	case Intrinsic::sqrt:
>> 	case Intrinsic::powi:
>> 	case Intrinsic::sin:
>> 	case Intrinsic::cos:
>> 	case Intrinsic::log:
>> 	case Intrinsic::log2:
>> 	case Intrinsic::log10:
>> 	case Intrinsic::exp:
>> 	case Intrinsic::exp2:
>> 	case Intrinsic::pow:
>> 		return !NoMath;
>> 	case Intrinsic::fma:
>> 		return !NoFMA;
>> }
>>
>> Is this switch exhaustive or are you missing a default case?
>
> You're not looking at the most-recent version. You had (correctly)
> commented on this last time, and I had corrected it.
I look at llvm_bb_vectorize-20111122.diff, which was according to my 
mail client posted at Date: Wed, 23 Nov 2011 10:52:37 -0600.
In your mail from today you said:

" I know that I've sent a large number of revisions, but I
think that the version that I posted on 11/23 should be reviewed."

Did I get something wrong?

Cheers
Tobi

Hal Finkel

2011-Dec-14 00:25 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Tobias,

I've attached an updated copy of the patch. I believe that I accounted
for all of your suggestions except for:

1. You said that I could make AA a member of the class and initialize it
for each basic block. I suppose that I'd need to make it a pointer, but
more generally, what is the thread-safely model that I should have in
mind for the analysis passes (will multiple threads always use distinct
instances of the passes)? Furthermore, if I am going to make AA a class
variable that is re-initialized on every call to runOnBasicBlock, then I
should probably do the same with all of the top-level maps, etc. Do you
agree?

2. I have not (yet) changed the types of the maps from holding Value* to
Instruction*. Doing so would eliminate a few casts, but would cause
even-more casts (and maybe checks) in computePairsConnectedTo. Even so,
it might be worthwhile to make the change for clarity (conceptually,
those maps do hold only pointers to instructions).

Please look it over and let me know what you think.

Thanks again,
Hal

On Fri, 2011-12-02 at 17:07 +0100, Tobias Grosser wrote:> On 11/23/2011 05:52 PM, Hal Finkel wrote:
> > On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
> >> >  On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
> >>> >  >  Tobias,
> >>> >  >
> >>> >  >  I've attached an updated patch. It contains a
few bug fixes and many
> >>> >  >  (refactoring and coding-convention) changes
inspired by your comments.
> >>> >  >
> >>> >  >  I'm currently trying to fix the bug
responsible for causing a compile
> >>> >  >  failure when compiling
> >>> >  > 
test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
> >>> >  >  pass begins to fuse instructions in a basic block
in this file, the
> >>> >  >  aliasing analysis starts producing different (more
pessimistic) query
> >>> >  >  answers. I've never before seen it do this,
and I'm not sure why it is
> >>> >  >  happening. Also odd, at the same time, the numeric
labels that are
> >>> >  >  assigned to otherwise unnamed instructions,
change. I don't think I've
> >>> >  >  seen this happen either (they generally seem to
stay fixed once
> >>> >  >  assigned). I don't know if these two things
are related, but if anyone
> >>> >  >  can provide some insight, I'd appreciate it.
> >> >
> >> >  I think that I see what is happening in this case (please
someone tell
> >> >  me if this does not make sense). In the problematic basic
block, there
> >> >  are some loads and stores that are independent. The default
aliasing
> >> >  analysis can tell that these loads and stores don't
reference the same
> >> >  memory region. Then, some of the inputs to the getelementptr
> >> >  instructions used for these loads and stores are fused by
the
> >> >  vectorization. After this happens, the aliasing analysis
loses its
> >> >  ability to tell that the loads and stores that make use of
those
> >> >  vector-calculated indices are independent.
> > The attached patch works around this issue. Please review.
> 
> Hi Hal,
> 
> thanks for reworking the patch several times. This time the patch looks
> a lot better, with nicely structured helper functions and a lot more
> consistent style. It is a lot better to read and review.
> 
> Unfortunately I had today just two hours to look into this patch. I
> added a lot of remarks, but especially in the tree pruning part I run a
> little out of time such that the remarks are mostly obvious style
> remarks. I hope I have a little bit more time over the weekend to look
> into this again, but I wanted to get out the first junk of comments today.
> 
> Especially with my style remarks, do not feel forced to follow all of my
> comments. Style stuff is highly personal and often other people may
> suggest other fixes. In case you believe your style is better just check
> with the LLVM coding reference or other parts of the LLVM source code
> and try to match this as good as possible.
> 
> On the algorithmic point of view I am not yet completely convinced the
> algorithm itself is not worse than N^2. I really want to understand this
> part better and plan to look into problematic behaviour in functions not
> directly visible in the core routines.
> 
> Also, you choose a very high bound (4000) to limit the quadratic
> behaviour, because you stated otherwise optimizations will be missed.
> Can you send an example where a value smaller than 4000 would miss
> optimizations?
> 
> Cheers
> Tobi
> 
> > diff --git a/lib/Transforms/Makefile b/lib/Transforms/Makefile
> > index e527be2..8b1df92 100644
> > --- a/lib/Transforms/Makefile
> > +++ b/lib/Transforms/Makefile
> > @@ -8,7 +8,7 @@
> >  
##===----------------------------------------------------------------------===##
> >
> >   LEVEL = ../..
> > -PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO Hello
> > +PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO
Vectorize Hello
> >
> >   include $(LEVEL)/Makefile.config
> >
> > diff --git a/lib/Transforms/Vectorize/BBVectorize.cpp
b/lib/Transforms/Vectorize/BBVectorize.cpp
> > new file mode 100644
> > index 0000000..5505502
> > --- /dev/null
> > +++ b/lib/Transforms/Vectorize/BBVectorize.cpp
> > @@ -0,0 +1,1644 @@
> > +//===- BBVectorize.cpp - A Basic-Block Vectorizer
-------------------------===//
> > +//
> > +//                     The LLVM Compiler Infrastructure
> > +//
> > +// This file is distributed under the University of Illinois Open
Source
> > +// License. See LICENSE.TXT for details.
> > +//
> >
+//===----------------------------------------------------------------------===//
> > +//
> > +// This file implements a basic-block vectorization pass. The
algorithm was
> > +// inspired by that used by the Vienna MAP Vectorizor by Franchetti
and Kral,
> > +// et al. It works by looking for chains of pairable operations and
then
> > +// pairing them.
> > +//
> >
+//===----------------------------------------------------------------------===//
> > +
> > +#define BBV_NAME "bb-vectorize"
> > +#define DEBUG_TYPE BBV_NAME
> > +#include "llvm/Constants.h"
> > +#include "llvm/DerivedTypes.h"
> > +#include "llvm/Function.h"
> > +#include "llvm/Instructions.h"
> > +#include "llvm/IntrinsicInst.h"
> > +#include "llvm/Intrinsics.h"
> > +#include "llvm/LLVMContext.h"
> > +#include "llvm/Pass.h"
> > +#include "llvm/Type.h"
> > +#include "llvm/ADT/DenseMap.h"
> > +#include "llvm/ADT/DenseSet.h"
> > +#include "llvm/ADT/SmallVector.h"
> > +#include "llvm/ADT/Statistic.h"
> > +#include "llvm/ADT/StringExtras.h"
> > +#include "llvm/Analysis/AliasAnalysis.h"
> > +#include "llvm/Analysis/AliasSetTracker.h"
> > +#include "llvm/Analysis/ValueTracking.h"
> This comes to lines later, if you want to sort it alphabetically.
> 
> > +namespace {
> > +  struct BBVectorize : public BasicBlockPass {
> > +    static char ID; // Pass identification, replacement for typeid
> > +    BBVectorize() : BasicBlockPass(ID) {
> > +      initializeBBVectorizePass(*PassRegistry::getPassRegistry());
> > +    }
> > +
> > +    typedef std::pair<Value *, Value *>  ValuePair;
> > +    typedef std::pair<ValuePair, size_t>  ValuePairWithDepth;
> > +    typedef std::pair<ValuePair, ValuePair>  VPPair; // A
ValuePair pair
> > +    typedef std::pair<std::multimap<Value *, Value
*>::iterator,
> > +              std::multimap<Value *, Value *>::iterator> 
VPIteratorPair;
> > +    typedef std::pair<std::multimap<ValuePair,
ValuePair>::iterator,
> > +              std::multimap<ValuePair, ValuePair>::iterator>
> > +                VPPIteratorPair;
> > +
> > +    // FIXME: const correct?
> > +
> > +    bool vectorizePairs(BasicBlock&BB);
> > +
> > +    void getCandidatePairs(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                       std::vector<Value *> 
&PairableInsts);
> > +
> > +    vojd computeConnectedPairs(std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                       std::vector<Value *> 
&PairableInsts,
> > +                       std::multimap<ValuePair, ValuePair> 
&ConnectedPairs);
> > +
> > +    void buildDepMap(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                       std::vector<Value *> 
&PairableInsts,
> > +                       DenseSet<ValuePair> 
&PairableInstUsers);
> > +
> > +    void choosePairs(std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                        std::vector<Value *> 
&PairableInsts,
> > +                        std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                        DenseSet<ValuePair> 
&PairableInstUsers,
> > +                        DenseMap<Value *, Value *>& 
ChosenPairs);
> > +
> > +    void fuseChosenPairs(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *>& 
ChosenPairs);
> > +
> > +    bool isInstVectorizable(Instruction *I,
bool&IsSimpleLoadStore);
> > +
> > +    bool areInstsCompatible(Instruction *I, Instruction *J,
> > +                       bool IsSimpleLoadStore);
> > +
> > +    void trackUsesOfI(DenseSet<Value *>  &Users,
> > +                      AliasSetTracker&WriteSet, Instruction *I,
> > +                      Instruction *J, bool&UsesI, bool
UpdateUsers = true,
> > +                      std::multimap<Value *, Value *> 
*LoadMoveSet = 0);
> > +
> > +    void computePairsConnectedTo(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      ValuePair P);
> > +
> > +    bool pairsConflict(ValuePair P, ValuePair Q,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers);
> > +
> > +    void pruneTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
> > +                      DenseSet<ValuePair>  &PrunedTree,
ValuePair J);
> > +
> > +    void buildInitialTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
ValuePair J);
> > +
> > +    void findBestTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseSet<ValuePair>  &BestTree,
size_t&BestMaxDepth,
> > +                      size_t&BestEffSize, VPIteratorPair
ChoiceRange);
> > +
> > +    Value *getReplacementPointerInput(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, unsigned o,
bool&FlipMemInputs);
> > +
> > +    Value *getReplacementShuffleMask(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, unsigned o);
> > +
> > +    Value *getReplacementInput(LLVMContext&  Context, Instruction
*I,
> > +                     Instruction *J, unsigned o, bool FlipMemInputs);
> > +
> > +    void getReplacementInputsForPair(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, SmallVector<Value *, 3> 
&ReplacedOperands,
> > +                     bool&FlipMemInputs);
> > +
> > +    void replaceOutputsOfPair(LLVMContext&  Context, Instruction
*I,
> > +                     Instruction *J, Instruction *K,
> > +                     Instruction *&InsertionPt, Instruction
*&K1,
> > +                     Instruction *&K2, bool&FlipMemInputs);
> > +
> > +    void collectPairLoadMoveSet(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> > +                     Instruction *I, Instruction *J);
> > +
> > +    void collectLoadMoveSet(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet);
> > +
> > +    void moveUsesOfIAfterJ(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *>& 
ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *I, Instruction *J);
> 
> Here I am somehow unsatisfied with the indentation, but can currently
> not suggest anything better. The single operands are just too long. ;-)
> 
> > +
> > +    virtual bool runOnBasicBlock(BasicBlock&BB) {
> > +      bool changed = false;
> > +      // Iterate a sufficient number of times to merge types of
> > +      // size 1, then 2, then 4, etc. up to half of the
> > +      // target vector width of the target vector register.
> Use all 80 columns.
> 
> Also. What is a type of size 1? Do you mean 1 bit, one byte or what?
> 
> > +      for (unsigned v = 2, n = 1; v<= VectorBits&&
> > +             (!MaxIter || n<= MaxIter); v *= 2, ++n) {
>                  ^^^^^^^^^^^^^^^^^^^^^^^^^^
>                  This fits in the previous row.
> Also, having the whole condition in one row seems easier to read.
> 
> > +        DEBUG(dbgs()<<  "BBV: fusing loop #"<< 
n<<
> > +              " for "<<  BB.getName()<<  "
in "<<
> > +              BB.getParent()->getName()<< 
"...\n");
> > +        if (vectorizePairs(BB)) {
> > +          changed = true;
> > +        } else {
> > +          break;
> > +        }
> Braces not needed here.
> 
> > +      }
> > +
> > +      DEBUG(dbgs()<<  "BBV: done!\n");
> > +      return changed;
> > +    }
> > +
> > +    virtual void getAnalysisUsage(AnalysisUsage&AU) const {
> > +      BasicBlockPass::getAnalysisUsage(AU);
> > +      AU.addRequired<AliasAnalysis>();
> > +      AU.addRequired<ScalarEvolution>();
> > +      AU.addPreserved<AliasAnalysis>();
> > +      AU.addPreserved<ScalarEvolution>();
> > +    }
> > +
> > +    // This returns the vector type that holds a pair of the provided
type.
> > +    // If the provided type is already a vector, then its length is
doubled.
> > +    static inline VectorType *getVecType(Type *VElemType) {
> 
> What about naming this getVecTypeForPair()?
> 
> I think ElemType is sufficient. No need for the 'V'.
> 
> > +      if (VElemType->isVectorTy()) {
> > +        unsigned numElem =
cast<VectorType>(VElemType)->getNumElements();
> > +        return VectorType::get(VElemType->getScalarType(),
numElem*2);
> > +      } else {
> > +        return VectorType::get(VElemType, 2);
> > +      }
> > +    }
> 
> This can be:
> 
> if (VectorType *VectorTy = dyn_cast<VectorType>(ElemType)) {
>    unsigned numElem = VectorTy->getNumElements();
>    return VectorType::get(VElemType->getScalarType(), numElem*2);
> } else {
>    return VectorType::get(VElemType, 2);
> }
> 
> > +    // Returns the weight associated with the provided value. A chain
of
> > +    // candidate pairs has a length given by the sum of the weights
of its
> > +    // members (one weight per pair; the weight of each member of the
pair
> > +    // is assumed to be the same). This length is then compared to
the
> > +    // chain-length threshold to determine if a given chain is
significant
> > +    // enough to be vectorized. The length is also used in comparing
> > +    // candidate chains where longer chains are considered to be
better.
> > +    // Note: when this function returns 0, the resulting instructions
are
> > +    // not actually fused.
> > +    static inline size_t getDepthFactor(Value *V) {
> > +      if (isa<InsertElementInst>(V) ||
isa<ExtractElementInst>(V)) {
> 
> Why have InsertElementInst, ExtractElementInst instructions a weight of
> zero?
> 
> > +        return 0;
> > +      }
> No need for braces.
> 
> > +
> > +      return 1;
> > +    }
> > +
> > +    // This determines the relative offset of two loads or stores,
returning
> > +    // true if the offset could be determined to be some constant
value.
> > +    // For example, if OffsetInElmts == 1, then J accesses the memory
directly
> > +    // after I; if OffsetInElmts == -1 then I accesses the memory
> > +    // directly after J. This function assumes that both instructions
> > +    // have the same type.
> You may want to mention more explicit that this function returns parts
> of the result in OffsetInElmts.
> 
> I think 'Offset' is enough, no need for 'OffsetInElmts'.
> 
> > +    bool getPairPtrInfo(Instruction *I, Instruction *J, const
TargetData&TD,
> > +        Value *&IPtr, Value *&JPtr, unsigned&IAlignment,
unsigned&JAlignment,
> > +        int64_t&OffsetInElmts) {
> > +      OffsetInElmts = 0;
> > +      if (isa<LoadInst>(I)) {
> > +        IPtr = cast<LoadInst>(I)->getPointerOperand();
> > +        JPtr = cast<LoadInst>(J)->getPointerOperand();
> > +        IAlignment = cast<LoadInst>(I)->getAlignment();
> > +        JAlignment = cast<LoadInst>(J)->getAlignment();
> > +      } else {
> > +        IPtr = cast<StoreInst>(I)->getPointerOperand();
> > +        JPtr = cast<StoreInst>(J)->getPointerOperand();
> > +        IAlignment = cast<StoreInst>(I)->getAlignment();
> > +        JAlignment = cast<StoreInst>(J)->getAlignment();
> > +      }
> > +
> > +      ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> > +      const SCEV *IPtrSCEV = SE.getSCEV(IPtr);
> > +      const SCEV *JPtrSCEV = SE.getSCEV(JPtr);
> > +
> > +      // If this is a trivial offset, then we'll get something
like
> > +      // 1*sizeof(type). With target data, which we need anyway, this
will get
> > +      // constant folded into a number.
> > +      const SCEV *OffsetSCEV = SE.getMinusSCEV(JPtrSCEV, IPtrSCEV);
> > +      if (const SCEVConstant *ConstOffSCEV > > +           
dyn_cast<SCEVConstant>(OffsetSCEV)) {
> > +        ConstantInt *IntOff = ConstOffSCEV->getValue();
> > +        int64_t Offset = IntOff->getSExtValue();
> > +
> > +        Type *VTy =
cast<PointerType>(IPtr->getType())->getElementType();
> > +        int64_t VTyTSS = (int64_t) TD.getTypeStoreSize(VTy);
> > +
> > +        assert(VTy ==
cast<PointerType>(JPtr->getType())->getElementType());
> > +
> > +        OffsetInElmts = Offset/VTyTSS;
> > +        return (abs64(Offset) % VTyTSS) == 0;
> > +      }
> > +
> > +      return false;
> > +    }
> > +
> > +    // Returns true if the provided CallInst represents an intrinsic
that can
> > +    // be vectorized.
> > +    bool isVectorizableIntrinsic(CallInst* I) {
> > +      bool VIntr = false;
> > +      if (Function *F = I->getCalledFunction()) {
> > +        if (unsigned IID = F->getIntrinsicID()) {
> 
> Use early exit.
> 
> Function *F = I->getCalledFunction();
> 
> if (!F)
>    return false;
> 
> unsigned IID = F->getIntrinsicID();
> 
> if (!IID)
>    return false;
> 
> switch(IID) {
> 	case Intrinsic::sqrt:
> 	case Intrinsic::powi:
> 	case Intrinsic::sin:
> 	case Intrinsic::cos:
> 	case Intrinsic::log:
> 	case Intrinsic::log2:
> 	case Intrinsic::log10:
> 	case Intrinsic::exp:
> 	case Intrinsic::exp2:
> 	case Intrinsic::pow:
> 		return !NoMath;
> 	case Intrinsic::fma:
> 		return !NoFMA;
> }
> 
> Is this switch exhaustive or are you missing a default case?
> 
> > +    // Returns true if J is the second element in some ValuePair
referenced by
> > +    // some multimap ValuePair iterator pair.
> > +    bool isSecondInVPIteratorPair(VPIteratorPair PairRange, Value *J)
{
> > +      for (std::multimap<Value *, Value *>::iterator K =
PairRange.first;
> > +             K != PairRange.second; ++K) {
> > +        if (K->second == J) return true;
> > +      }
> No need for braces.
> 
> > +
> > +      return false;
> > +    }
> > +  };
> > +
> > +  // This function implements one vectorization iteration on the
provided
> > +  // basic block. It returns true if the block is changed.
> > +  bool BBVectorize::vectorizePairs(BasicBlock&BB) {
> > +    std::vector<Value *>  PairableInsts;
> > +    std::multimap<Value *, Value *>  CandidatePairs;
> > +    getCandidatePairs(BB, CandidatePairs, PairableInsts);
> > +    if (!PairableInsts.size()) return false;
> 
> if (PairableInsts.size() == 0) ...
> 
> I think it is more straightforward to compare a size to 0 than to force
> the brain to implicitly convert it to bool.
> 
> > +
> > +    // Now we have a map of all of the pairable instructions and we
need to
>                we have a map of all pairable instructions
> 
> > +    // select the best possible pairing. A good pairing is one such
that the
> > +    // Users of the pair are also paired. This defines a (directed)
forest
> 	  users
>            ^
> > +    // over the pairs such that two pairs are connected iff the
second pair
> > +    // uses the first.
> > +
> > +    // Note that it only matters that both members of the second pair
use some
> > +    // element of the first pair (to allow for splatting).
> > +
> > +    std::multimap<ValuePair, ValuePair>  ConnectedPairs;
> > +    computeConnectedPairs(CandidatePairs, PairableInsts,
ConnectedPairs);
> > +    if (!ConnectedPairs.size()) return false;
> 
> if (ConnectedPairs.size() == 0) ...
> 
> > +    // Build the pairable-instruction dependency map
> > +    DenseSet<ValuePair>  PairableInstUsers;
> > +    buildDepMap(BB, CandidatePairs, PairableInsts,
PairableInstUsers);
> > +
> > +    // There is now a graph of the connected pairs. For each
variable, pick the
> > +    // pairing with the largest tree meeting the depth requirement on
at least
> > +    // one branch. Then select all pairings that are part of that
tree and
> > +    // remove them from the list of available parings and pairable
variables.
> 						 pairings
> 
> > +
> > +    DenseMap<Value *, Value *>  ChosenPairs;
> > +    choosePairs(CandidatePairs, PairableInsts, ConnectedPairs,
> > +      PairableInstUsers, ChosenPairs);
> > +
> > +    if (!ChosenPairs.size()) return false;
> > +    NumFusedOps += ChosenPairs.size();
> > +
> > +    // A set of chosen pairs has now been selected. It is now
necessary to
>         // A set of pairs has been chosen.
> > +    // replace the paired functions with vector functions. For this
procedure
> 
> You are only talking about functions. I thought the main objective is to
> pair LLVM-IR _instructions_.
> 
> > +    // each argument much be replaced with a vector argument. This
vector
>                          must
> 
> At least for instructions, I would use the term 'operand' instead
of
> 'argument'.
> 
> > +    // is formed by using build_vector on the old arguments. The
replaced
> > +    // values are then replaced with a vector_extract on the result.
> > +    // Subsequent optimization passes should coalesce the
build/extract
> > +    // combinations.
> > +
> > +    fuseChosenPairs(BB, PairableInsts, ChosenPairs);
> > +
> > +    return true;
> > +  }
> > +
> > +  // This function returns true if the provided instruction is
capable of being
> > +  // fused into a vector instruction. This determination is based
only on the
> > +  // type and other attributes of the instruction.
> > +  bool BBVectorize::isInstVectorizable(Instruction *I,
> > +                                         bool&IsSimpleLoadStore)
{
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> 
> AA can become a member of the BBVectorize class and it can be
> initialized once per basic block.
> 
> > +    bool VIntr = false;
> > +    if (isa<CallInst>(I)) {
> > +      VIntr = isVectorizableIntrinsic(cast<CallInst>(I));
> > +    }
> Remove VIntr and use early exit.
> 
> if (CallInst *C = dyn_cast<CallInst>(I)) {
>    if(!isVectorizableIntrinsic(cast<CallInst>(I))
>      return false;
> }
> 
> Also, no braces if not necessary in at least one branch.
> 
> > +    // Vectorize simple loads and stores if possbile:
> > +    IsSimpleLoadStore = false;
> > +    if (isa<LoadInst>(I)) {
> > +      IsSimpleLoadStore = cast<LoadInst>(I)->isSimple();
> > +    } else if (isa<StoreInst>(I)) {
> > +      IsSimpleLoadStore = cast<StoreInst>(I)->isSimple();
> > +    }
> Again, no braces needed and use early exit.
> 
> } else if (LoadInst *L = dyn_cast<LoadInst>(I)) {
>    if (!L->isSimple() || NoMemOps) {
>      IsSimpleLoadStore = false;
>      return false;
>    }
> } else if (StorInst *S = dyn_cast<StoreInst>(S)) {
>    if (!S->isSimple() || NoMemOps) {
>      IsSimpleLoadStore = false;
>      return false;
>    }
> }
> 
> > +    // We can vectorize casts, but not casts of pointer types, etc.
> > +    bool IsCast = false;
> > +    if (I->isCast()) {
> > +      IsCast = true;
> > +      if
(!cast<CastInst>(I)->getSrcTy()->isSingleValueType()) {
> > +        IsCast = false;
> > +      } else if
(!cast<CastInst>(I)->getDestTy()->isSingleValueType()) {
> > +        IsCast = false;
> > +      } else if
(cast<CastInst>(I)->getSrcTy()->isPointerTy()) {
> > +        IsCast = false;
> > +      } else if
(cast<CastInst>(I)->getDestTy()->isPointerTy()) {
> > +        IsCast = false;
> > +      }
> > +    }
> 
> 
> } else if (CastInst *C = dyn_cast<CastInst>(I)) {
>    if (NoCasts)
>      return false;
> 
>    Type *SrcTy = C->getSrcTy()
> 
>    if (!SrcTy->isSingleValueType() || SrcTy->isPointerTy)
>      return false;
> 
>    Type *DestTy = C->getDestTy()
> 
>    if (!DestTy->isSingleValueType() || DestTy->isPointerTy)
>      return false;
> }
> 
> > +
> > +    if (!(I->isBinaryOp() || isa<ShuffleVectorInst>(I) ||
> > +            isa<ExtractElementInst>(I) ||
isa<InsertElementInst>(I) ||
> > +            (!NoCasts&&  IsCast) || VIntr ||
> > +            (!NoMemOps&&  IsSimpleLoadStore))) {
> > +      return false;
> > +    }
> 
> else if (I->isBinaryOp
> 	 || isa<ShuffleVectorInst>(I)
>           || isa<ExtractElementInst>(I)
>           || isa<InsertElementInst>(I)) {
>    // Those instructions are always OK.
> } else {
>    // Everything else can not be handled.
>    return false;
> }
> 
> 
> > +
> > +    // We can't vectorize memory operations without target data
> > +    if (AA.getTargetData() == 0&&  IsSimpleLoadStore) {
> > +      return false;
> 
> Is this check needed at this place? I had the feeling the whole pass
> does not work without target data. At least the offset calculation seems
> to require it.
> 
> > +    }
> > +
> > +    Type *T1, *T2;
> > +    if (isa<StoreInst>(I)) {
> > +      // For stores, it is the value type, not the pointer type that
matters
> > +      // because the value is what will come from a vector register.
> > +
> > +      Value *IVal = cast<StoreInst>(I)->getValueOperand();
> > +      T1 = IVal->getType();
> > +    } else {
> > +      T1 = I->getType();
> > +    }
> 
> > +
> > +    if (I->isCast()) {
> > +      T2 = cast<CastInst>(I)->getSrcTy();
> > +    } else {
> > +      T2 = T1;
> > +    }
> No braces needed.
> 
> > +
> > +    // Not every type can be vectorized...
> > +    if (!(VectorType::isValidElementType(T1) || T1->isVectorTy())
||
> > +        !(VectorType::isValidElementType(T2) || T2->isVectorTy()))
> > +      return false;
> > +
> > +    if (NoInts&&  (T1->isIntOrIntVectorTy() ||
T2->isIntOrIntVectorTy()))
> > +      return false;
> > +
> > +    if (NoFloats&&  (T1->isFPOrFPVectorTy() ||
T2->isFPOrFPVectorTy()))
> > +      return false;
> > +
> > +    if (T1->getPrimitiveSizeInBits()>  VectorBits/2 ||
> > +        T2->getPrimitiveSizeInBits()>  VectorBits/2)
> > +      return false;
> > +
> > +    return true;
> > +  }
> > +
> > +  // This function returns true if the two provided instructions are
compatible
> > +  // (meaning that they can be fused into a vector instruction). This
assumes
> > +  // that I has already been determined to be vectorizable and that J
is not
> > +  // in the use tree of I.
> > +  bool BBVectorize::areInstsCompatible(Instruction *I, Instruction
*J,
> > +                       bool IsSimpleLoadStore) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> AA can become a member of the BBVectorize class and it can be
> initialized once per basic block.
> 
> > +    bool IsCompat = J->isSameOperationAs(I);
> > +    // FIXME: handle addsub-type operations!
> > +
> > +    // Loads and stores can be merged if they have different
alignments,
> > +    // but are otherwise the same.
> > +    if (!IsCompat&&  isa<LoadInst>(I)&& 
isa<LoadInst>(J)) {
> > +      if (I->getType() == J->getType()) {
> > +        if
(cast<LoadInst>(I)->getPointerOperand()->getType() => > +     
cast<LoadInst>(J)->getPointerOperand()->getType()&&
> > +            cast<LoadInst>(I)->isVolatile() => > +    
cast<LoadInst>(J)->isVolatile()&&
> > +            cast<LoadInst>(I)->getOrdering() => > +   
cast<LoadInst>(J)->getOrdering()&&
> > +            cast<LoadInst>(I)->getSynchScope() => > + 
cast<LoadInst>(J)->getSynchScope()
> > +           ) {
> > +             IsCompat = true;
> > +         }
> > +      }
> > +    } else if (!IsCompat&&  isa<StoreInst>(I)&&
isa<StoreInst>(J)) {
> > +      if
(cast<StoreInst>(I)->getValueOperand()->getType() => > +      
cast<StoreInst>(J)->getValueOperand()->getType()&&
> > +         
cast<StoreInst>(I)->getPointerOperand()->getType() => > +     
cast<StoreInst>(J)->getPointerOperand()->getType()&&
> > +          cast<StoreInst>(I)->isVolatile() => > +     
cast<StoreInst>(J)->isVolatile()&&
> > +          cast<StoreInst>(I)->getOrdering() => > +    
cast<StoreInst>(J)->getOrdering()&&
> > +          cast<StoreInst>(I)->getSynchScope() => > +  
cast<StoreInst>(J)->getSynchScope()
> > +         ) {
> > +           IsCompat = true;
> > +       }
> > +    }
> > +
> > +    if (!IsCompat) return false;
> > +
> > +    if (IsSimpleLoadStore) {
> > +      const TargetData&TD = *AA.getTargetData();
> > +
> > +      Value *IPtr, *JPtr;
> > +      unsigned IAlignment, JAlignment;
> > +      int64_t OffsetInElmts = 0;
> > +      if (getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment,
JAlignment,
> > +            OffsetInElmts)&&  abs64(OffsetInElmts) == 1) {
> > +        if (AlignedOnly) {
> > +          Type *aType = isa<StoreInst>(I) ?
> > +           
cast<StoreInst>(I)->getValueOperand()->getType() : I->getType();
> > +          // An aligned load or store is possible only if the
instruction
> > +          // with the lower offset has an alignment suitable for the
> > +          // vector type.
> > +
> > +          unsigned BottomAlignment = IAlignment;
> > +          if (OffsetInElmts<  0) BottomAlignment = JAlignment;
> > +
> > +          Type *VType = getVecType(aType);
> > +          unsigned VecAlignment = TD.getPrefTypeAlignment(VType);
> > +          if (BottomAlignment<  VecAlignment) {
> > +            IsCompat = false;
> > +          }
> > +        }
> > +      } else {
> > +        IsCompat = false;
> > +      }
> > +    } else if (isa<ShuffleVectorInst>(I)) {
> > +      // Only merge two shuffles if they're both constant
> > +      // or both not constant.
> > +      IsCompat = isa<Constant>(I->getOperand(2))&&
> > +                 isa<Constant>(J->getOperand(2));
> > +      // FIXME: We may want to vectorize non-constant shuffles also.
> > +    }
> > +
> > +    return IsCompat;
> 
> Here again for the whole function. Can you try to use early exit to
> simplify the code http://llvm.org/docs/CodingStandards.html#hl_earlyexit
> You might be able to remove the use of IsCompat completely.
> 
> > +  }
> > +
> > +  // Figure out whether or not J uses I and update the users and
write-set
> > +  // structures associated with I.
> 
> You may want to give I and J more descriptive names. I remember I asked
> you the last time to rename One and Two to I and J, but in this case it
> seems both have very specific roles which could be used to describe
> them. What about 'Inst' and 'User'?
> 
> Also what are the arguments of this function? Can you explain them a
> little?
> 
> > +  void BBVectorize::trackUsesOfI(DenseSet<Value *>  &Users,
> > +                       AliasSetTracker&WriteSet, Instruction *I,
> > +                       Instruction *J, bool&UsesI, bool
UpdateUsers,
> 
> You may want to return UsesI as a function return value not as a
> reference parameter, though you may need to updated the function name
> accordingly.
> 
> > +                       std::multimap<Value *, Value *> 
*LoadMoveSet) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    UsesI = false;
> > +
> > +    // This instruction may already be marked as a user due, for
example, to
> > +    // being a member of a selected pair.
> > +    if (Users.count(J))
> > +      UsesI = true;
> > +
> > +    if (!UsesI) for (User::op_iterator JU = J->op_begin(), e =
J->op_end();
> Use a new line for this 'for'.
> 
> > +           JU != e; ++JU) {
> > +      Value *V = *JU;
> > +      if (I == V || Users.count(V)) {
> > +        UsesI = true;
> > +        break;
> > +      }
> > +    }
> > +    if (!UsesI&&  J->mayReadFromMemory()) {
> > +      if (LoadMoveSet) {
> > +        VPIteratorPair JPairRange = LoadMoveSet->equal_range(J);
> > +        UsesI = isSecondInVPIteratorPair(JPairRange, I);
> > +      }
> > +      else for (AliasSetTracker::iterator W = WriteSet.begin(),
> Use a new line for this 'for'.
> 
> > +            WE = WriteSet.end(); W != WE; ++W) {
> > +        for (AliasSet::iterator A = W->begin(), AE = W->end();
> > +               A != AE; ++A) {
> > +          AliasAnalysis::Location ptrLoc(A->getValue(),
A->getSize(),
> > +            A->getTBAAInfo());
> Align this argument with the first argument.
> 
>            AliasAnalysis::Location ptrLoc(A->getValue(),
A->getSize(),
>            				 A->getTBAAInfo());
> 
> > +          if (AA.getModRefInfo(J, ptrLoc) != AliasAnalysis::NoModRef)
{
> > +            UsesI = true;
> > +            break;
> > +          }
> > +        }
> > +        if (UsesI) break;
> > +      }
> > +    }
> > +    if (UsesI&&  UpdateUsers) {
> > +      if (J->mayWriteToMemory()) WriteSet.add(J);
> > +      Users.insert(J);
> > +    }
> > +  }
> > +
> > +  // This function iterates over all instruction pairs in the
provided
> > +  // basic block and collects all candidate pairs for vectorization.
> > +  void BBVectorize::getCandidatePairs(BasicBlock&BB,
> > +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                       std::vector<Value *> 
&PairableInsts) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    BasicBlock::iterator E = BB.end();
> > +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E;
++I) {
> > +      bool IsSimpleLoadStore;
> > +      if (!isInstVectorizable(I, IsSimpleLoadStore)) continue;
> > +
> > +      // Look for an instruction with which to pair instruction *I...
> > +      DenseSet<Value *>  Users;
> > +      AliasSetTracker WriteSet(AA);
> > +      BasicBlock::iterator J = I; ++J;
> > +      for (unsigned ss = 0; J != E&&  ss<= SearchLimit;
++J, ++ss) {
> > +        // Determine if J uses I, if so, exit the loop.
> > +        bool UsesI;
> > +        trackUsesOfI(Users, WriteSet, I, J, UsesI, !FastDep);
> 
> 'UsesI' can be eliminated if the corresponding value is passed as
> function return value.
> 
> > +        if (FastDep) {
> > +          // Note: For this heuristic to be effective, independent
operations
> > +          // must tend to be intermixed. This is likely to be true
from some
> > +          // kinds of grouped loop unrolling (but not the generic
LLVM pass),
> > +          // but otherwise may require some kind of reordering pass.
> > +
> > +          // When using fast dependency analysis,
> > +          // stop searching after first use:
> > +          if (UsesI) break;
> > +        } else {
> > +          if (UsesI) continue;
> > +        }
> > +
> > +        // J does not use I, and comes before the first use of I, so
it can be
> > +        // merged with I if the instructions are compatible.
> > +        if (!areInstsCompatible(I, J, IsSimpleLoadStore)) continue;
> > +
> > +        // J is a candidate for merging with I.
> > +        if (!PairableInsts.size() ||
> > +             PairableInsts[PairableInsts.size()-1] != I) {
> > +          PairableInsts.push_back(I);
> > +        }
> > +        CandidatePairs.insert(ValuePair(I, J));
> > +        DEBUG(dbgs()<<  "BBV: candidate pair
"<<  *I<<
> > +                     "<->  "<<  *J<< 
"\n");
> > +      }
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: found "<< 
PairableInsts.size()
> > +<<  " instructions with candidate pairs\n");
> > +  }
> > +
> > +  // Finds candidate pairs connected to the pair P =<PI, PJ>.
This means that
> > +  // it looks for pairs such that both members have an input which is
an
> > +  // output of PI or PJ.
> > +  void BBVectorize::computePairsConnectedTo(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      ValuePair P) {
> > +    // For each possible pairing for this variable, look at the uses
of
> > +    // the first value...
> > +    for (Value::use_iterator I = P.first->use_begin(),
> > +         E = P.first->use_end(); I != E; ++I) {
> > +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> > +
> > +      // For each use of the first variable, look for uses of the
second
> > +      // variable...
> > +      for (Value::use_iterator J = P.second->use_begin(),
> > +           E2 = P.second->use_end(); J != E2; ++J) {
> > +        VPIteratorPair JPairRange = CandidatePairs.equal_range(*J);
> > +
> > +        // Look for<I, J>:
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +
> > +        // Look for<J, I>:
> > +        if (isSecondInVPIteratorPair(JPairRange, *I))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*J, *I)));
> > +      }
> > +
> > +      // Look for cases where just the first value in the pair is
used by
> > +      // both members of another pair (splatting).
> > +      Value::use_iterator J = P.first->use_begin();
> > +      if (!SplatBreaksChain) for (; J != E; ++J) {
> This can become an early continue. Also please do not put an 'if'
and a
> 'for' into the same line.
> 
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +      }
> > +    }
> > +
> > +    // Look for cases where just the second value in the pair is used
by
> > +    // both members of another pair (splatting).
> > +    if (!SplatBreaksChain)
> This can become an early return.
> 
> > +      for (Value::use_iterator I = P.second->use_begin(),
> > +         E = P.second->use_end(); I != E; ++I) {
> > +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> > +
> > +      Value::use_iterator J = P.second->use_begin();
> > +      for (; J != E; ++J) {
> Why not move initialization into 'for'?
> 
> > +        if (isSecondInVPIteratorPair(IPairRange, *J))
> > +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> > +      }
> > +    }
> > +  }
> > +
> > +  // This function figures out which pairs are connected.  Two pairs
are
> > +  // connected if some output of the first pair forms an input to
both members
> > +  // of the second pair.
> > +  void BBVectorize::computeConnectedPairs(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs) {
> > +
> > +    for (std::vector<Value *>::iterator PI =
PairableInsts.begin(),
> > +           PE = PairableInsts.end(); PI != PE; ++PI) {
> > +      VPIteratorPair choiceRange = CandidatePairs.equal_range(*PI);
> > +
> > +      for (std::multimap<Value *, Value *>::iterator P =
choiceRange.first;
> > +           P != choiceRange.second; ++P) {
> > +        computePairsConnectedTo(CandidatePairs, PairableInsts,
> > +          ConnectedPairs, *P);
> Align CandidatePairs and ConnectedPairs.
> 
> > +      }
> No braces needed here.
> 
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: found "<< 
ConnectedPairs.size()
> > +<<  " pair connections.\n");
> > +  }
> > +
> > +  // This function builds a set of use tuples such that<A, B> 
is in the set
> > +  // if B is in the use tree of A. If B is in the use tree of A, then
B
> > +  // depends on the output of A.
> > +  void BBVectorize::buildDepMap(
> > +                      BasicBlock&BB,
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers) {
> > +    DenseSet<Value *>  IsInPair;
> > +    for (std::multimap<Value *, Value *>::iterator C =
CandidatePairs.begin(),
> > +           E = CandidatePairs.end(); C != E; ++C) {
> > +      IsInPair.insert(C->first);
> > +      IsInPair.insert(C->second);
> > +    }
> > +
> > +    // Iterate through the basic block, recording all Users of each
> > +    // pairable instruction.
> > +
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    BasicBlock::iterator E = BB.end();
> > +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E;
++I) {
> > +      if (IsInPair.find(I) == IsInPair.end()) continue;
> > +
> > +      DenseSet<Value *>  Users;
> > +      AliasSetTracker WriteSet(AA);
> > +      BasicBlock::iterator J = I; ++J;
> > +      for (; J != E; ++J) {
> Move the initialization into the for loop.
> 
> 	for (BasicBlock::iterator J = I + 1; J != E; ++J)
> 
> > +        bool UsesI;
> > +        trackUsesOfI(Users, WriteSet, I, J, UsesI);
> > +      }
> > +
> > +      for (DenseSet<Value *>::iterator U = Users.begin(), E =
Users.end();
> > +             U != E; ++U) {
> 	      ^^ Check indentation here.
> 
> > +        PairableInstUsers.insert(ValuePair(I, *U));
> > +      }
> No braces needed.
> 
> > +    }
> > +  }
> > +
> > +  // Returns true if an input to pair P is an output of pair Q and
also an
> > +  // input of pair Q is an output of pair P. If this is the case,
then these
> > +  // two pairs cannot be simultaneously fused.
> > +  bool BBVectorize::pairsConflict(ValuePair P, ValuePair Q,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers) {
> > +    // Two pairs are in conflict if they are mutual Users of
eachother.
> > +    bool QUsesP = PairableInstUsers.count(ValuePair(P.first, 
Q.first))  ||
> > +                  PairableInstUsers.count(ValuePair(P.first, 
Q.second)) ||
> > +                  PairableInstUsers.count(ValuePair(P.second,
Q.first))  ||
> > +                  PairableInstUsers.count(ValuePair(P.second,
Q.second));
> > +    bool PUsesQ = PairableInstUsers.count(ValuePair(Q.first, 
P.first))  ||
> > +                  PairableInstUsers.count(ValuePair(Q.first, 
P.second)) ||
> > +                  PairableInstUsers.count(ValuePair(Q.second,
P.first))  ||
> > +                  PairableInstUsers.count(ValuePair(Q.second,
P.second));
> > +    return (QUsesP&&  PUsesQ);
> > +  }
> > +
> > +  // This function builds the initial tree of connected pairs with
the
> > +  // pair J at the root.
> > +  void BBVectorize::buildInitialTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
ValuePair J) {
> > +    // Each of these pairs is viewed as the root node of a Tree. The
Tree
> > +    // is then walked (depth-first). As this happens, we keep track
of
> > +    // the pairs that compose the Tree and the maximum depth of the
Tree.
> > +    std::vector<ValuePairWithDepth>  Q;
> > +    // General depth-first post-order traversal:
> > +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> > +    while (!Q.empty()) {
> > +      ValuePairWithDepth QTop = Q.back();
> > +
> > +      // Push each child onto the queue:
> > +      bool MoreChildren = false;
> > +      size_t MaxChildDepth = QTop.second;
> > +      VPPIteratorPair qtRange =
ConnectedPairs.equal_range(QTop.first);
> > +      for (std::map<ValuePair, ValuePair>::iterator k =
qtRange.first;
> > +           k != qtRange.second; ++k) {
> > +        // Make sure that this child pair is still a candidate:
> > +        bool IsStillCand = false;
> > +        VPIteratorPair checkRange > > +         
CandidatePairs.equal_range(k->second.first);
> > +        for (std::multimap<Value *, Value *>::iterator m =
checkRange.first;
> > +             m != checkRange.second; ++m) {
> > +          if (m->second == k->second.second) {
> > +            IsStillCand = true;
> > +            break;
> > +          }
> > +        }
> > +
> > +        if (IsStillCand) {
> > +          DenseMap<ValuePair, size_t>::iterator C =
Tree.find(k->second);
> > +          if (C == Tree.end()) {
> > +            size_t d = getDepthFactor(k->second.first);
> > +            Q.push_back(ValuePairWithDepth(k->second,
QTop.second+d));
> > +            MoreChildren = true;
> > +          } else {
> > +            MaxChildDepth = std::max(MaxChildDepth, C->second);
> > +          }
> > +        }
> > +      }
> > +
> > +      if (!MoreChildren) {
> > +        // Record the current pair as part of the Tree:
> > +        Tree.insert(ValuePairWithDepth(QTop.first, MaxChildDepth));
> > +        Q.pop_back();
> > +      }
> > +    }
> > +  }
> > +
> > +  // Given some initial tree, prune it by removing conflicting pairs
(pairs
> > +  // that cannot be simultaneously chosen for vectorization).
> > +  void BBVectorize::pruneTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseMap<ValuePair, size_t>  &Tree,
> > +                      DenseSet<ValuePair>  &PrunedTree,
ValuePair J) {
> > +    std::vector<ValuePairWithDepth>  Q;
> > +    // General depth-first post-order traversal:
> > +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> > +    while (!Q.empty()) {
> > +      ValuePairWithDepth QTop = Q.back();
> > +      PrunedTree.insert(QTop.first);
> > +      Q.pop_back();
> > +
> > +      // Visit each child, pruning as necessary...
> > +      DenseMap<ValuePair, size_t>  BestChilden;
> > +      VPPIteratorPair QTopRange =
ConnectedPairs.equal_range(QTop.first);
> > +      for (std::map<ValuePair, ValuePair>::iterator K =
QTopRange.first;
> > +           K != QTopRange.second; ++K) {
> > +        DenseMap<ValuePair, size_t>::iterator C =
Tree.find(K->second);
> > +        if (C == Tree.end()) continue;
> > +
> > +        // This child is in the Tree, now we need to make sure it is
the
> > +        // best of any conflicting children. There could be multiple
> > +        // conflicting children, so first, determine if we're
keeping
> > +        // this child, then delete conflicting children as necessary.
> > +
> > +        // It is also necessary to guard against pairing-induced
> > +        // dependencies. Consider instructions a .. x .. y .. b
> > +        // such that (a,b) are to be fused and (x,y) are to be fused
> > +        // but a is an input to x and b is an output from y. This
> > +        // means that y cannot be moved after b but x must be moved
> > +        // after b for (a,b) to be fused. In other words, after
> > +        // fusing (a,b) we have y .. a/b .. x where y is an input
> > +        // to a/b and x is an output to a/b: x and y can no longer
> > +        // be legally fused. To prevent this condition, we must
> > +        // make sure that a child pair added to the Tree is not
> > +        // both an input and output of an already-selected pair.
> > +
> > +        bool CanAdd = true;
> 
> The use of CanAdd can be improved with early continues.
> 
> > +        for (DenseMap<ValuePair, size_t>::iterator C2
> > +              = BestChilden.begin(), E2 = BestChilden.end();
> > +             C2 != E2; ++C2) {
> > +          if (C2->first.first == C->first.first ||
> > +              C2->first.first == C->first.second ||
> > +              C2->first.second == C->first.first ||
> > +              C2->first.second == C->first.second ||
> > +              pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> > +            if (C2->second>= C->second) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> > +          }
> 
> This loop
> 
> > +        }
> > +
> > +        // Even worse, this child could conflict with another node
already
> > +        // selected for the Tree. If that is the case, ignore this
child.
> > +        if (CanAdd)
> > +          for (DenseSet<ValuePair>::iterator T =
PrunedTree.begin(),
> > +                 E2 = PrunedTree.end(); T != E2; ++T) {
> > +            if (T->first == C->first.first ||
> > +                T->first == C->first.second ||
> > +                T->second == C->first.first ||
> > +                T->second == C->first.second ||
> > +                pairsConflict(*T, C->first, PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> > +          }
> 
> and this loop
> 
> > +
> > +        // And check the queue too...
> > +        if (CanAdd)
> > +          for (std::vector<ValuePairWithDepth>::iterator C2 =
Q.begin(),
> > +                E2 = Q.end(); C2 != E2; ++C2) {
> > +            if (C2->first.first == C->first.first ||
> > +                C2->first.first == C->first.second ||
> > +                C2->first.second == C->first.first ||
> > +                C2->first.second == C->first.second ||
> > +                pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> > +            }
> 
> and this loop
> 
> > +          }
> > +
> > +        // Last but not least, check for a conflict with any of the
> > +        // already-chosen pairs.
> > +        if (CanAdd)
> > +          for (DenseMap<Value *, Value *>::iterator C2 >
> +                ChosenPairs.begin(), E2 = ChosenPairs.end();
> > +               C2 != E2; ++C2) {
> > +            if (pairsConflict(*C2, C->first, PairableInstUsers)) {
> > +              CanAdd = false;
> > +              break;
> 
> (and almost this one)
> 
> > +            }
> > +          }
> > +
> > +        if (CanAdd) {
> > +          // This child can be added, but we may have chosen it in
preference
> > +          // to an already-selected child. Check for this here, and
if a
> > +          // conflict is found, then remove the previously-selected
child
> > +          // before adding this one in its place.
> > +          for (DenseMap<ValuePair, size_t>::iterator C2
> > +                = BestChilden.begin(), E2 = BestChilden.end();
> > +               C2 != E2;) {
> > +            if (C2->first.first == C->first.first ||
> > +                C2->first.first == C->first.second ||
> > +                C2->first.second == C->first.first ||
> > +                C2->first.second == C->first.second ||
> > +                pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> > +              BestChilden.erase(C2++);
> > +            } else {
> > +              ++C2;
> and this loop look entirely the same. Can they be moved into a helper
> function?
> 
> 
> > +            }
> > +          }
> > +
> > +          BestChilden.insert(ValuePairWithDepth(C->first,
C->second));
> > +        }
> > +      }
> > +
> > +      for (DenseMap<ValuePair, size_t>::iterator C
> > +            = BestChilden.begin(), E2 = BestChilden.end();
> > +           C != E2; ++C) {
> > +        size_t DepthF = getDepthFactor(C->first.first);
> > +        Q.push_back(ValuePairWithDepth(C->first,
QTop.second+DepthF));
> > +      }
> > +    }
> > +  }
> > +
> > +  // This function finds the best tree of mututally-compatible
connected
> > +  // pairs, given the choice of root pairs as an iterator range.
> > +  void BBVectorize::findBestTreeFor(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                      DenseSet<ValuePair>  &BestTree,
size_t&BestMaxDepth,
> > +                      size_t&BestEffSize, VPIteratorPair
ChoiceRange) {
> > +    for (std::multimap<Value *, Value *>::iterator J =
ChoiceRange.first;
> > +         J != ChoiceRange.second; ++J) {
> > +
> > +      // Before going any further, make sure that this pair does not
> > +      // conflict with any already-selected pairs (see comment below
> > +      // near the Tree pruning for more details).
> > +      bool DoesConflict = false;
> > +      for (DenseMap<Value *, Value *>::iterator C =
ChosenPairs.begin(),
> > +            E = ChosenPairs.end(); C != E; ++C) {
> > +        if (pairsConflict(*C, *J, PairableInstUsers)) {
> > +          DoesConflict = true;
> > +          break;
> > +        }
> > +      }
> > +      if (DoesConflict) continue;
> > +
> > +      DenseMap<ValuePair, size_t>  Tree;
> > +      buildInitialTreeFor(CandidatePairs, PairableInsts,
ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, Tree, *J);
> Fix indentation. (Align CandidatePairs and PairableInstUsers)
> 
> > +
> > +      // Because we'll keep the child with the largest depth, the
largest
> > +      // depth is still the same in the unpruned Tree.
> > +      size_t MaxDepth = Tree.lookup(*J);
> > +
> > +      DEBUG(dbgs()<<  "BBV: found Tree for pair
{"<<  *J->first<<
> > +                   "<->  "<< 
*J->second<<  "} of depth "<<
> > +                   MaxDepth<<  " and size "<< 
Tree.size()<<  "\n");
> > +
> > +      // At this point the Tree has been constructed, but, may
contain
> > +      // contradictory children (meaning that different children of
> > +      // some tree node may be attempting to fuse the same
instruction).
> > +      // So now we walk the tree again, in the case of a conflict,
> > +      // keep only the child with the largest depth. To break a tie,
> > +      // favor the first child.
> > +
> > +      DenseSet<ValuePair>  PrunedTree;
> > +      pruneTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, Tree, PrunedTree, *J);
> > +
> > +      size_t EffSize = 0;
> > +      for (DenseSet<ValuePair>::iterator S =
PrunedTree.begin(),
> > +            E = PrunedTree.end(); S != E; ++S) {
> > +        EffSize += getDepthFactor(S->first);
> > +      }
> No braces.
> 
> > +
> > +      DEBUG(dbgs()<<  "BBV: found pruned Tree for pair
{"<<  *J->first<<
> > +             "<->  "<<  *J->second<< 
"} of depth "<<
> > +             MaxDepth<<  " and size "<< 
PrunedTree.size()<<
> > +            " (effective size: "<<  EffSize<< 
")\n");
> > +      if (MaxDepth>= ReqChainDepth&&  EffSize> 
BestEffSize) {
> > +        BestMaxDepth = MaxDepth;
> > +        BestEffSize = EffSize;
> > +        BestTree = PrunedTree;
> > +      }
> > +    }
> > +  }
> > +
> > +  // Given the list of candidate pairs, this function selects those
> > +  // that will be fused into vector instructions.
> > +  void BBVectorize::choosePairs(
> > +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> > +                      std::vector<Value *>  &PairableInsts,
> > +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> > +                      DenseSet<ValuePair> 
&PairableInstUsers,
> > +                      DenseMap<Value *, Value *>& 
ChosenPairs) {
> > +    for (std::vector<Value *>::iterator I =
PairableInsts.begin(),
> > +           E = PairableInsts.end(); I != E; ++I) {
> > +      // The number of possible pairings for this variable:
> > +      size_t NumChoices = CandidatePairs.count(*I);
> > +      if (!NumChoices) continue;
> > +
> > +      VPIteratorPair ChoiceRange = CandidatePairs.equal_range(*I);
> > +
> > +      // The best pair to choose and its tree:
> > +      size_t BestMaxDepth = 0, BestEffSize = 0;
> > +      DenseSet<ValuePair>  BestTree;
> > +      findBestTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> > +        PairableInstUsers, ChosenPairs, BestTree, BestMaxDepth,
> > +        BestEffSize, ChoiceRange);
> 
> Fix indentation. Align CandidatePairs, ...
> 
> > +
> > +      // A tree has been chosen (or not) at this point. If no tree
was
> > +      // chosen, then this instruction, I, cannot be paired (and is
no longer
> > +      // considered).
> > +
> > +      for (DenseSet<ValuePair>::iterator S = BestTree.begin(),
> > +           SE = BestTree.end(); S != SE; ++S) {
> > +        // Insert the members of this tree into the list of chosen
pairs.
> > +        ChosenPairs.insert(ValuePair(S->first, S->second));
> > +        DEBUG(dbgs()<<  "BBV: selected pair:
"<<  *S->first<<  "<->  "<<
> > +               *S->second<<  "\n");
> > +
> > +        // Remove all candidate pairs that have values in the chosen
tree.
> > +        for (std::multimap<Value *, Value *>::iterator K >
> +               CandidatePairs.begin(),
> > +             E2 = CandidatePairs.end(); K != E2;) {
> > +          if (K->first == S->first || K->second ==
S->first ||
> > +              K->second == S->second || K->first ==
S->second) {
> > +            // Don't remove the actual pair chosen so that it can
be used
> > +            // in subsequent tree selections.
> > +            if (!(K->first == S->first&&  K->second
== S->second)) {
> > +              CandidatePairs.erase(K++);
> > +            } else {
> > +              ++K;
> > +            }
> No braces needed.
> 
> > +          } else {
> > +            ++K;
> > +          }
> These are needed.
> 
> > +        }
> > +      }
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: selected "<< 
ChosenPairs.size()<<  " pairs.\n");
> > +  }
> > +
> > +  // Returns the value that is to be used as the pointer input to the
vector
> > +  // instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementPointerInput(LLVMContext& 
Context,
> > +                     Instruction *I, Instruction *J, unsigned o,
> > +                     bool&FlipMemInputs) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    const TargetData&TD = *AA.getTargetData();
> > +
> > +    Value *IPtr, *JPtr;
> > +    unsigned IAlignment, JAlignment;
> > +    int64_t OffsetInElmts;
> > +    (void) getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment,
JAlignment,
> > +             OffsetInElmts);
> Fix indentation.
> 
> > +
> > +    // The pointer value is taken to be the one with the lowest
offset.
> > +    Value *VPtr;
> > +    if (OffsetInElmts>  0) {
> > +      VPtr = IPtr;
> > +    } else {
> > +      FlipMemInputs = true;
> > +      VPtr = JPtr;
> > +    }
> > +
> > +    Type *ArgType =
cast<PointerType>(IPtr->getType())->getElementType();
> > +    Type *VArgType = getVecType(ArgType);
> > +    Type *VArgPtrType = PointerType::get(VArgType,
> > +     
cast<PointerType>(IPtr->getType())->getAddressSpace());
> > +    return new BitCastInst(VPtr, VArgPtrType,
> > +                        I->hasName() ?
> > +                          I->getName() + ".v.i" +
utostr(o) : "",
> > +                        /* insert before */ FlipMemInputs ? J : I);
> > +  }
> > +
> > +  // Returns the value that is to be used as the vector-shuffle mask
to the
> > +  // vector instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementShuffleMask(LLVMContext& 
Context,
> > +                     Instruction *I, Instruction *J, unsigned o) {
> > +    // This is the shuffle mask. We need to append the second
> > +    // mask to the first, and the numbers need to be adjusted.
> > +
> > +    Type *ArgType = I->getType();
> > +    Type *VArgType = getVecType(ArgType);
> > +
> > +    // Get the total number of elements in the fused vector type.
> > +    // By definition, this must equal the number of elements in
> > +    // the final mask.
> > +    unsigned NumElem =
cast<VectorType>(VArgType)->getNumElements();
> > +    std::vector<Constant*>  Mask(NumElem);
> > +
> > +    Type *OpType = I->getOperand(0)->getType();
> > +    unsigned numInElem =
cast<VectorType>(OpType)->getNumElements();
> > +
> > +    // For the mask from the first pair...
> > +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> > +      int m = cast<ShuffleVectorInst>(I)->getMaskValue(v);
> > +      if (m<  0) {
> > +        Mask[v] = UndefValue::get(Type::getInt32Ty(Context));
> > +      } else {
> > +        unsigned vv = v;
> > +        if (m>= (int) numInElem) {
> > +          vv += (int) numInElem;
> > +        }
> No braces needed.
> 
> > +
> > +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), vv);
> > +      }
> > +    }
> > +
> > +    // For the mask from the second pair...
> > +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> > +      int m = cast<ShuffleVectorInst>(J)->getMaskValue(v);
> > +      if (m<  0) {
> > +        Mask[v+NumElem/2] =
UndefValue::get(Type::getInt32Ty(Context));
> > +      } else {
> > +        unsigned vv = v + (int) numInElem;
> > +        if (m>= (int) numInElem) {
> > +          vv += (int) numInElem;
> > +        }
> No braces needed.
> 
> > +
> > +        Mask[v+NumElem/2] > > +         
ConstantInt::get(Type::getInt32Ty(Context), vv);
> > +      }
> > +    }
> 
> Those two for loops look very similar. Any chance they can be extracted
> into a helper function?
> 
> > +
> > +    return ConstantVector::get(Mask);
> > +  }
> > +
> > +  // Returns the value to be used as the specified operand of the
vector
> > +  // instruction that fuses I with J.
> > +  Value *BBVectorize::getReplacementInput(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, unsigned o, bool FlipMemInputs)
{
> 
> Indentation.
> 
> > +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> > +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> > +
> > +      // Compute the fused vector type for this operand
> > +    Type *ArgType = I->getOperand(o)->getType();
> > +    VectorType *VArgType = getVecType(ArgType);
> > +
> > +    Instruction *L = I, *H = J;
> > +    if (FlipMemInputs) {
> > +      L = J;
> > +      H = I;
> > +    }
> > +
> > +    if (ArgType->isVectorTy()) {
> > +      unsigned numElem =
cast<VectorType>(VArgType)->getNumElements();
> > +      std::vector<Constant*>  Mask(numElem);
> > +      for (unsigned v = 0; v<  numElem; ++v) {
> > +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), v);
> > +      }
> No braces.
> 
> > +
> > +      Instruction *BV = new ShuffleVectorInst(L->getOperand(o),
> > +                                              H->getOperand(o),
> > +                                             
ConstantVector::get(Mask),
> > +                                              I->hasName() ?
> > +                                                I->getName() +
".v.i" +
> > +                                                  utostr(o) :
"");
> 
> It may be better to not squeeze
> 
> I->hasName() ?  I->getName() + ".v.i" + utostr(o) :
""
> 
> into this operand.
> 
> > +      BV->insertBefore(J);
> > +      return BV;
> > +    }
> > +
> > +    // If these two inputs are the output of another vector
instruction,
> > +    // then we should use that output directly. It might be necessary
to
> > +    // permute it first. [When pairings are fused recursively, you
can
> > +    // end up with cases where a large vector is decomposed into
scalars
> > +    // using extractelement instructions, then built into size-2
> > +    // vectors using insertelement and the into larger vectors using
> > +    // shuffles. InstCombine does not simplify all of these cases
well,
> > +    // and so we make sure that shuffles are generated here when
possible.
> > +    ExtractElementInst *LEE
> > +      = dyn_cast<ExtractElementInst>(L->getOperand(o));
> > +    ExtractElementInst *HEE
> > +      = dyn_cast<ExtractElementInst>(H->getOperand(o));
> > +
> > +    if (LEE&&  HEE&&
> > +        LEE->getOperand(0)->getType() == VArgType) {
> > +      unsigned LowIndx =
cast<ConstantInt>(LEE->getOperand(1))->getZExtValue();
> > +      unsigned HighIndx =
cast<ConstantInt>(HEE->getOperand(1))->getZExtValue();
> > +      if (LEE->getOperand(0) == HEE->getOperand(0)) {
> > +        if (LowIndx == 0&&  HighIndx == 1)
> > +          return LEE->getOperand(0);
> > +
> > +        std::vector<Constant*>  Mask(2);
> > +        Mask[0] = ConstantInt::get(Type::getInt32Ty(Context),
LowIndx);
> > +        Mask[1] = ConstantInt::get(Type::getInt32Ty(Context),
HighIndx);
> > +
> > +        Instruction *BV = new
ShuffleVectorInst(LEE->getOperand(0),
> > +                                           
UndefValue::get(VArgType),
> > +                                           
ConstantVector::get(Mask),
> > +                                            I->hasName() ?
> > +                                              I->getName() +
".v.i" +
> > +                                                utostr(o) :
"");
> 
> Do not squeeze
> 
> I->hasName() ?  I->getName() + ".v.i" + utostr(o) :
""
> 
> into the operand.
> 
> > +        BV->insertBefore(J);
> > +        return BV;
> > +      }
> > +
> > +      std::vector<Constant*>  Mask(2);
> > +      HighIndx += VArgType->getNumElements();
> > +      Mask[0] = ConstantInt::get(Type::getInt32Ty(Context), LowIndx);
> > +      Mask[1] = ConstantInt::get(Type::getInt32Ty(Context),
HighIndx);
> > +
> > +      Instruction *BV = new ShuffleVectorInst(LEE->getOperand(0),
> > +                                          HEE->getOperand(0),
> > +                                          ConstantVector::get(Mask),
> > +                                          I->hasName() ?
> > +                                            I->getName() +
".v.i" +
> > +                                              utostr(o) :
"");
> No squeezing please.
> 
> > +      BV->insertBefore(J);
> > +      return BV;
> > +    }
> > +
> > +    Instruction *BV1 = InsertElementInst::Create(
> > +                                              
UndefValue::get(VArgType),
> > +                                               L->getOperand(o),
CV0,
> > +                                               I->hasName() ?
> > +                                                 I->getName() +
".v.i" +
> > +                                                   utostr(o) +
".1"
> > +                                               : "");
> No squeezing.
> 
> > +    BV1->insertBefore(I);
> > +    Instruction *BV2 = InsertElementInst::Create(BV1,
H->getOperand(o),
> > +                                               CV1, I->hasName() ?
> > +                                                 I->getName() +
> > +                                                   ".v.i" +
utostr(o)
> > +                                                   + ".2" :
"");
> 
> No squeezing.
> 
> Also, the stuff that you squeeze in looks very similar or even
> identical. Do you really write the whole expression 4 times in this
> function? Is there no better option?
> 
> > +    BV2->insertBefore(J);
> > +    return BV2;
> > +  }
> > +
> > +  // This function creates an array of values that will be used as
the inputs
> > +  // to the vector instruction that fuses I with J.
> > +  void BBVectorize::getReplacementInputsForPair(LLVMContext& 
Context,
> > +                     Instruction *I, Instruction *J,
> > +                     SmallVector<Value *, 3> 
&ReplacedOperands,
> > +                     bool&FlipMemInputs) {
> > +    FlipMemInputs = false;
> > +    unsigned NumOperands = I->getNumOperands();
> > +
> > +    for (unsigned p = 0, o = NumOperands-1; p<  NumOperands; ++p,
--o) {
> > +      // Iterate backward so that we look at the store pointer
> > +      // first and know whether or not we need to flip the inputs.
> > +
> > +      if (isa<LoadInst>(I) || (o == 1&& 
isa<StoreInst>(I))) {
> > +        // This is the pointer for a load/store instruction.
> > +        ReplacedOperands[o] = getReplacementPointerInput(Context, I,
J, o,
> > +                                FlipMemInputs);
> > +        continue;
> > +      } else if (isa<CallInst>(I)&&  o ==
NumOperands-1) {
> > +        Function *F =
cast<CallInst>(I)->getCalledFunction();
> > +        unsigned IID = F->getIntrinsicID();
> > +        BasicBlock&BB =
*cast<Instruction>(I)->getParent();
> > +
> > +        Module *M = BB.getParent()->getParent();
> > +        Type *ArgType = I->getType();
> > +        Type *VArgType = getVecType(ArgType);
> > +
> > +        // FIXME: is it safe to do this here?
> > +        ReplacedOperands[o] = Intrinsic::getDeclaration(M,
> > +          (Intrinsic::ID) IID, VArgType);
> > +        continue;
> > +      } else if (isa<ShuffleVectorInst>(I)&&  o ==
NumOperands-1) {
> > +        ReplacedOperands[o] = getReplacementShuffleMask(Context, I,
J, o);
> > +        continue;
> > +      }
> > +
> > +      ReplacedOperands[o] > > +       
getReplacementInput(Context, I, J, o, FlipMemInputs);
> > +    }
> > +  }
> > +
> > +  // This function creates two values that represent the outputs of
the
> > +  // original I and J instructions. These are generally vector
shuffles
> > +  // or extracts. In many cases, these will end up being unused and,
thus,
> > +  // eliminated by later passes.
> > +  void BBVectorize::replaceOutputsOfPair(LLVMContext&  Context,
Instruction *I,
> > +                     Instruction *J, Instruction *K,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *&K1, Instruction *&K2,
> > +                     bool&FlipMemInputs) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> > +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> > +
> > +    if (isa<StoreInst>(I)) {
> > +      AA.replaceWithNewValue(I, K);
> > +      AA.replaceWithNewValue(J, K);
> > +    } else {
> > +      Type *IType = I->getType();
> > +      Type *VType = getVecType(IType);
> > +
> > +      if (IType->isVectorTy()) {
> > +          unsigned numElem =
cast<VectorType>(IType)->getNumElements();
> > +          std::vector<Constant*>  Mask1(numElem),
Mask2(numElem);
> > +          for (unsigned v = 0; v<  numElem; ++v) {
> > +            Mask1[v] = ConstantInt::get(Type::getInt32Ty(Context),
v);
> > +            Mask2[v] = ConstantInt::get(Type::getInt32Ty(Context),
numElem+v);
> > +          }
> > +
> > +          K1 = new ShuffleVectorInst(K, UndefValue::get(VType),
> > +                                       ConstantVector::get(
> > +                                         FlipMemInputs ? Mask2 :
Mask1),
> > +                                       K->hasName() ?
> > +                                         K->getName() +
".v.r1" : "");
> > +          K2 = new ShuffleVectorInst(K, UndefValue::get(VType),
> > +                                       ConstantVector::get(
> > +                                         FlipMemInputs ? Mask1 :
Mask2),
> > +                                       K->hasName() ?
> > +                                         K->getName() +
".v.r2" : "");
> > +      } else {
> > +        K1 = ExtractElementInst::Create(K, FlipMemInputs ? CV1 : CV0,
> > +                                          K->hasName() ?
> > +                                            K->getName() +
".v.r1" : "");
> > +        K2 = ExtractElementInst::Create(K, FlipMemInputs ? CV0 : CV1,
> > +                                          K->hasName() ?
> > +                                            K->getName() +
".v.r2" : "");
> > +      }
> > +
> > +      K1->insertAfter(K);
> > +      K2->insertAfter(K1);
> > +      InsertionPt = K2;
> > +    }
> > +  }
> > +
> > +  // Move all uses of the function I (including pairing-induced uses)
after J.
> > +  void BBVectorize::moveUsesOfIAfterJ(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *>& 
ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> > +                     Instruction *&InsertionPt,
> > +                     Instruction *I, Instruction *J) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    // Skip to the first instruction past I.
> > +    BasicBlock::iterator L = BB.begin();
> > +    for (; cast<Instruction>(L) != cast<Instruction>(I);
++L);
> Why don't you put the initialization into the 'for'? I do not
see any
> need for the cast instructions.
> 
> > +    ++L;
> > +
> > +    DenseSet<Value *>  Users;
> > +    AliasSetTracker WriteSet(AA);
> > +    for (BasicBlock::iterator E = BB.end();
> > +          cast<Instruction>(L) != cast<Instruction>(J);)
{
> Why do you need to cast here?
> 
> > +      bool UsesI;
> > +      trackUsesOfI(Users, WriteSet, I, L, UsesI,
true,&LoadMoveSet);
> > +
> > +      if (UsesI) {
> > +        // Make sure that the order of selected pairs is not
reversed.
> > +        // So if we come to a user instruction that has a pair,
> > +        // then mark the pair's tree as used as well.
> > +        DenseMap<Value *, Value *>::iterator LP =
ChosenPairs.find(L);
> > +        if (LP != ChosenPairs.end())
> > +          Users.insert(LP->second);
> > +
> > +        // Move this instruction
> > +        Instruction *InstToMove = L; ++L;
> > +
> > +        DEBUG(dbgs()<<  "BBV: moving: "<< 
*InstToMove<<
> > +                        " to after "<< 
*InsertionPt<<  "\n");
> > +        InstToMove->removeFromParent();
> > +        InstToMove->insertAfter(InsertionPt);
> > +        InsertionPt = InstToMove;
> > +      } else {
> > +        ++L;
> > +      }
> > +    }
> > +  }
> > +
> > +  // Collect all load instruction that are in the move set of a given
pair.
> > +  // These loads depend on the first instruction, I, and so need to
be moved
> > +  // after J when the pair is fused.
> > +  void BBVectorize::collectPairLoadMoveSet(BasicBlock&BB,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> > +                     Instruction *I, Instruction *J) {
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +
> > +    // Skip to the first instruction past I.
> > +    BasicBlock::iterator L = BB.begin();
> > +    for (; cast<Instruction>(L) != cast<Instruction>(I);
++L);
> Why not initialize in the 'for' loop. Why cast here?
> 
> > +    ++L;
> > +
> > +    DenseSet<Value *>  Users;
> > +    AliasSetTracker WriteSet(AA);
> > +    for (BasicBlock::iterator E = BB.end();
> This 'E' is never used in the loop.
> 
> > +          cast<Instruction>(L) != cast<Instruction>(J);
++L) {
> Why cast here?
> 
> > +      bool UsesI;
> > +      trackUsesOfI(Users, WriteSet, I, L, UsesI);
> > +
> > +      if (UsesI) {
> > +        // Make sure that the order of selected pairs is not
reversed.
> > +        // So if we come to a user instruction that has a pair,
> > +        // then mark the pair's tree as used as well.
> > +        DenseMap<Value *, Value *>::iterator LP =
ChosenPairs.find(L);
> > +        if (LP != ChosenPairs.end())
> > +          Users.insert(LP->second);
> > +
> > +        if (L->mayReadFromMemory())
> > +          LoadMoveSet.insert(ValuePair(L, I));
> > +      }
> > +    }
> > +  }
> > +
> > +  // In cases where both load/stores and the computation of their
pointers
> > +  // are chosen for vectorization, we can end up in a situation where
the
> > +  // aliasing analysis starts returning different query results as
the
> > +  // process of fusing instruction pairs continues. Because the
algorithm
> > +  // relies on finding the same use trees here as were found earlier,
we'll
> > +  // need to precompute the necessary aliasing information here and
then
> > +  // manually update it during the fusion process.
> > +  void BBVectorize::collectLoadMoveSet(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs,
> > +                     std::multimap<Value *, Value *> 
&LoadMoveSet) {
> > +    for (std::vector<Value *>::iterator PI =
PairableInsts.begin(),
> > +         PIE = PairableInsts.end(); PI != PIE; ++PI) {
> > +      DenseMap<Value *, Value *>::iterator P =
ChosenPairs.find(*PI);
> > +      if (P == ChosenPairs.end()) continue;
> > +
> > +      if (getDepthFactor(P->first) == 0) {
> > +        continue;
> > +      }
> No braces.
> 
> > +
> > +      Instruction *I = cast<Instruction>(P->first),
> > +        *J = cast<Instruction>(P->second);
> > +
> > +      collectPairLoadMoveSet(BB, ChosenPairs, LoadMoveSet, I, J);
> > +    }
> > +  }
> > +
> > +  // This function fuses the chosen instruction pairs into vector
instructions,
> > +  // taking care preserve any needed scalar outputs and, then, it
reorders the
> > +  // remaining instructions as needed (users of the first member of
the pair
> > +  // need to be moved to after the location of the second member of
the pair
> > +  // because the vector instruction is inserted in the location of
the pair's
> > +  // second member).
> > +  void BBVectorize::fuseChosenPairs(BasicBlock&BB,
> > +                     std::vector<Value *>  &PairableInsts,
> > +                     DenseMap<Value *, Value *> 
&ChosenPairs) {
> > +    LLVMContext&  Context = BB.getContext();
> > +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> > +    ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> > +
> > +    std::multimap<Value *, Value *>  LoadMoveSet;
> > +    collectLoadMoveSet(BB, PairableInsts, ChosenPairs, LoadMoveSet);
> > +
> > +    DEBUG(dbgs()<<  "BBV: initial: \n"<< 
BB<<  "\n");
> > +
> > +    for (std::vector<Value *>::iterator PI =
PairableInsts.begin(),
> > +         PIE = PairableInsDenseMap<Vts.end(); PI != PIE; ++PI) {
> > +      DenseMap<Value *, Value *>::iterator P =
ChosenPairs.find(*PI);
> > +      if (P == ChosenPairs.end()) continue;
> > +
> > +      if (getDepthFactor(P->first) == 0) {
> > +        // These instructions are not really fused, but are tracked
as though
> > +        // they are. Any case in which it would be interesting to
fuse them
> > +        // will be taken care of by InstCombine.
> > +        --NumFusedOps;
> > +        continue;
> > +      }
> > +
> > +      Instruction *I = cast<Instruction>(P->first),
> > +        *J = cast<Instruction>(P->second);
> 
> Why do we need to cast here? It seems ChosenPairs should actually be of
> type DenseMap<Instruction *, Instruction *>. Can PariableInsts also
be
> of type std::vector<Instruction*>?
> 
> > +
> > +      DEBUG(dbgs()<<  "BBV: fusing: "<< 
*I<<
> > +             "<->  "<<  *J<< 
"\n");
> > +
> > +      bool FlipMemInputs;
> > +      unsigned NumOperands = I->getNumOperands();
> > +      SmallVector<Value *, 3>  ReplacedOperands(NumOperands);
> > +      getReplacementInputsForPair(Context, I, J, ReplacedOperands,
> > +        FlipMemInputs);
> > +
> > +      // Make a copy of the original operation, change its type to
the vector
> > +      // type and replace its operands with the vector operands.
> > +      Instruction *K = I->clone();
> > +      if (I->hasName()) K->takeName(I);
> > +
> > +      if (!isa<StoreInst>(K)) {
> > +        K->mutateType(getVecType(I->getType()));
> > +      }
> No braces.
> 
> > +
> > +      for (unsigned o = 0; o<  NumOperands; ++o) {
> > +        K->setOperand(o, ReplacedOperands[o]);
> > +      }
> No braces.
> 
> > +
> > +      // If we've flipped the memory inputs, make sure that we
take the correct
> > +      // alignment.
> > +      if (FlipMemInputs) {
> > +        if (isa<StoreInst>(K))
> > +         
cast<StoreInst>(K)->setAlignment(cast<StoreInst>(J)->getAlignment());
> > +        else
> > +         
cast<LoadInst>(K)->setAlignment(cast<LoadInst>(J)->getAlignment());
> > +      }
> > +
> > +      K->insertAfter(J);
> > +
> > +      // Instruction insertion point:
> > +      Instruction *InsertionPt = K;
> > +      Instruction *K1 = 0, *K2 = 0;
> > +      replaceOutputsOfPair(Context, I, J, K, InsertionPt, K1, K2,
> > +        FlipMemInputs);
> > +
> > +      // The use tree of the first original instruction must be moved
to after
> > +      // the location of the second instruction. The entire use tree
of the
> > +      // first instruction is disjoint from the input tree of the
second
> > +      // (by definition), and so commutes with it.
> > +
> > +      moveUsesOfIAfterJ(BB, ChosenPairs, LoadMoveSet, InsertionPt,
> > +        cast<Instruction>(I), cast<Instruction>(J));
> > +
> > +      if (!isa<StoreInst>(I)) {
> > +        I->replaceAllUsesWith(K1);
> > +        J->replaceAllUsesWith(K2);
> > +        AA.replaceWithNewValue(I, K1);
> > +        AA.replaceWithNewValue(J, K2);
> > +      }
> > +
> > +      // Instructions that may read from memory may be in the load
move set.
> > +      // Once an instruction is fused, we no longer need its move
set, and so
> > +      // the values of the map never need to be updated. However,
when a load
> > +      // is fused, we need to merge the entries from both
instructions in the
> > +      // pair in case those instructions were in the move set of some
other
> > +      // yet-to-be-fused pair. The loads in question are the keys of
the map.
> > +      if (I->mayReadFromMemory()) {
> > +        std::vector<ValuePair>  NewSetMembers;
> > +        VPIteratorPair IPairRange = LoadMoveSet.equal_range(I);
> > +        VPIteratorPair JPairRange = LoadMoveSet.equal_range(J);
> > +        for (std::multimap<Value *, Value *>::iterator N =
IPairRange.first;
> > +               N != IPairRange.second; ++N) {
> Indentation.
> 
> > +          NewSetMembers.push_back(ValuePair(K, N->second));
> > +        }
> No braces.
> 
> > +        for (std::multimap<Value *, Value *>::iterator N =
JPairRange.first;
> > +               N != JPairRange.second; ++N) {
> Indentation.
> 
> > +          NewSetMembers.push_back(ValuePair(K, N->second));
> > +        }
> No braces.
> 
> > +        for (std::vector<ValuePair>::iterator A =
NewSetMembers.begin(),
> > +               AE = NewSetMembers.end(); A != AE; ++A) {
> Indentation.
> 
> > +          LoadMoveSet.insert(*A);
> > +        }
> No braces.
> 
> > +      }
> > +
> > +      SE.forgetValue(I);
> > +      SE.forgetValue(J);
> > +      I->eraseFromParent();
> > +      J->eraseFromParent();
> > +    }
> > +
> > +    DEBUG(dbgs()<<  "BBV: final: \n"<< 
BB<<  "\n");
> > +  }
> > +}
> 
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/ld1.ll
> > @@ -0,0 +1,28 @@
> > +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> > +
> > +define double @test1(double* %a, double* %b, double* %c) nounwind
uwtable readonly {
> > +entry:
> > +  %i0 = load double* %a, align 8
> > +  %i1 = load double* %b, align 8
> > +  %mul = fmul double %i0, %i1
> > +  %i2 = load double* %c, align 8
> > +  %add = fadd double %mul, %i2
> > +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> > +  %i3 = load double* %arrayidx3, align 8
> > +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> > +  %i4 = load double* %arrayidx4, align 8
> > +  %mul5 = fmul double %i3, %i4
> > +  %arrayidx6 = getelementptr inbounds double* %c, i64 1
> > +  %i5 = load double* %arrayidx6, align 8
> > +  %add7 = fadd double %mul5, %i5
> > +  %mul9 = fmul double %add, %i1
> > +  %add11 = fadd double %mul9, %i2
> > +  %mul13 = fmul double %add7, %i4
> > +  %add15 = fadd double %mul13, %i5
> > +  %mul16 = fmul double %add11, %add15
> > +  ret double %mul16
> > +; CHECK: @test1
> > +; CHECK:<2 x double>
> 
> For me this CHECK looks very short. Should all instructions be
> vectorized or is it sufficient if one is vectorized. Also, it would be
> nice if I could understand the code that is generated just by looking
> at the test case. Any reason not to CHECK: for the entire sequence of
> expected instructions (This applies to all test cases).
> 
> 
> > +}
> > +
> > diff --git a/test/Transforms/BBVectorize/loop1.ll
b/test/Transforms/BBVectorize/loop1.ll
> > new file mode 100644
> > index 0000000..4c60b42
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/loop1.ll
> > @@ -0,0 +1,39 @@
> > +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +target triple = "x86_64-unknown-linux-gnu"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> > +; RUN: opt<  %s -loop-unroll -unroll-allow-partial -bb-vectorize
-bb-vectorize-req-chain-depth=3 -S | FileCheck %s
> Why are you checking here with -loop-unroll? Do you expect something
> special. In case you do it may make sense to use another -check-prefix
> 
> > --- /dev/null
> > +++ b/test/Transforms/BBVectorize/req-depth.ll
> > @@ -0,0 +1,17 @@
> > +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 3 -S |
FileCheck %s -check-prefix=CHECK-RD3
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 2 -S |
FileCheck %s -check-prefix=CHECK-RD2
> > +
> > +define double @test1(double %A1, double %A2, double %B1, double %B2)
{
> > +	%X1 = fsub double %A1, %B1
> > +	%X2 = fsub double %A2, %B2
> > +	%Y1 = fmul double %X1, %A1
> > +	%Y2 = fmul double %X2, %A2
> > +	%R  = fmul double %Y1, %Y2
> > +	ret double %R
> > +; CHECK-RD3: @test1
> > +; CHECK-RD2: @test1
> > +; CHECK-RD3-NOT:<2 x double>
> > +; CHECK-RD2:<2 x double>
> What about sorting them by check prefix?
> 
> > +++ b/test/Transforms/BBVectorize/simple-ldstr.ll
> > @@ -0,0 +1,69 @@
> > +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> > +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> > +
> > +; Simple 3-pair chain with loads and stores
> > +define void @test1(double* %a, double* %b, double* %c) nounwind
uwtable readonly {
> > +entry:
> > +  %i0 = load double* %a, align 8
> > +  %i1 = load double* %b, align 8
> > +  %mul = fmul double %i0, %i1
> > +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> > +  %i3 = load double* %arrayidx3, align 8
> > +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> > +  %i4 = load double* %arrayidx4, align 8
> > +  %mul5 = fmul double %i3, %i4
> > +  store double %mul, double* %c, align 8
> > +  %arrayidx5 = getelementptr inbounds double* %c, i64 1
> > +  store double %mul5, double* %arrayidx5, align 8
> > +  ret void
> > +; CHECK: @test1
> > +; CHECK:<2 x double>
> 
> Especially here the entire vectorization sequence is interesting as I
> would love to see the alignment you produce for load/store instructions.
> 
> > +; Basic depth-3 chain (last pair permuted)
> > +define double @test2(double %A1, double %A2, double %B1, double %B2)
{
> > +	%X1 = fsub double %A1, %B1
> > +	%X2 = fsub double %A2, %B2
> > +	%Y1 = fmul double %X1, %A1
> > +	%Y2 = fmul double %X2, %A2
> > +	%Z1 = fadd double %Y2, %B1
> > +	%Z2 = fadd double %Y1, %B2
> > +	%R  = fmul double %Z1, %Z2
> > +	ret double %R
> > +; CHECK: @test2
> > +; CHECK:<2 x double>
> 
> Again, the entire sequence would show how you express a permutation.
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_bb_vectorize-20111213.diff
Type: text/x-patch
Size: 108949 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111213/2338613f/attachment.bin>

Sebastian Pop

2011-Dec-20 19:57 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Hi,

I see that there are two functions in your code that are O(n^2) in
number of instructions of the program: getCandidatePairs and
buildDepMap.  I think that you could make these two functions faster
if you work on some form of factored def-use chains for memory, like
the VUSE/VDEFs of GCC.

I was trying to find a similar representation in LLVM: isn't there already
a virtual SSA representation for memory references in LLVM?

Thanks,
Sebastian
--
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum

Tobias Grosser

2011-Dec-29 14:00 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On 12/14/2011 01:25 AM, Hal Finkel wrote:> Tobias,
>
> I've attached an updated copy of the patch. I believe that I accounted
> for all of your suggestions except for:
>
> 1. You said that I could make AA a member of the class and initialize it
> for each basic block. I suppose that I'd need to make it a pointer, but
> more generally, what is the thread-safely model that I should have in
> mind for the analysis passes (will multiple threads always use distinct
> instances of the passes)? Furthermore, if I am going to make AA a class
> variable that is re-initialized on every call to runOnBasicBlock, then I
> should probably do the same with all of the top-level maps, etc. Do you
> agree?
I am actually not sure about the thread safety conventions in LLVM. I 
personally always assumed that a transformation class will only be 
instantiated and used within a single thread and in case of multiple 
threads, multiple class instances would be created.

The pattern of assigning analysis results to object variables is common 
(e.g. lib/Analysis/IVUsers.cpp). So we should be able to use it. I 
personally do not have any strong opinion about the other top-level 
maps. Just decide yourself what is easier to read.
> 2. I have not (yet) changed the types of the maps from holding Value* to
> Instruction*. Doing so would eliminate a few casts, but would cause
> even-more casts (and maybe checks) in computePairsConnectedTo. Even so,
> it might be worthwhile to make the change for clarity (conceptually,
> those maps do hold only pointers to instructions).
I believe if code relies on the assumption that some data structures 
only objects of a certain type, we should use that type to define the 
data structures. Though, I do not think this change is required to 
commit the patch. It can be addressed after the code was committed.

One thing that I would still like to have is a test case where 
bb-vectorize-search-limit is needed to avoid exponential compile time 
growth and another test case that is not optimized, if 
bb-vectorize-search-limit is set to a value less than 4000. I think 
those cases are very valuable to understand the performance behavior of 
this code. Especially, as I am not yet sure why we need a value as high 
as 4000.

This is my last issue on this patch, that should be addressed before 
committing it. After we discussed this, I propose to post a final patch 
stating that you will commit after three days to give other people a 
chance to veto.

Tobi

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Dec 2011 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Possibly Parallel Threads