thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Nov-22 03:22 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:> Tobias,
> 
> I've attached an updated patch. It contains a few bug fixes and many
> (refactoring and coding-convention) changes inspired by your comments.
> 
> I'm currently trying to fix the bug responsible for causing a compile
> failure when compiling
> test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
> pass begins to fuse instructions in a basic block in this file, the
> aliasing analysis starts producing different (more pessimistic) query
> answers. I've never before seen it do this, and I'm not sure why it
is
> happening. Also odd, at the same time, the numeric labels that are
> assigned to otherwise unnamed instructions, change. I don't think
I've
> seen this happen either (they generally seem to stay fixed once
> assigned). I don't know if these two things are related, but if anyone
> can provide some insight, I'd appreciate it.
I think that I see what is happening in this case (please someone tell
me if this does not make sense). In the problematic basic block, there
are some loads and stores that are independent. The default aliasing
analysis can tell that these loads and stores don't reference the same
memory region. Then, some of the inputs to the getelementptr
instructions used for these loads and stores are fused by the
vectorization. After this happens, the aliasing analysis loses its
ability to tell that the loads and stores that make use of those
vector-calculated indices are independent.

 -Hal
> 
> In any case, this version of the patch should be much more suitable for
> your (or anyone else's) further review.
> 
> Thanks again,
> Hal
> 
> On Thu, 2011-11-17 at 13:57 +0100, Tobias Grosser wrote:
> > On 11/17/2011 12:38 AM, Hal Finkel wrote:
> > > Tobias, et al.,
> > >
> > > Attached is the my autovectorization pass.
> > 
> > Very nice. Will you be at the developer summit? Maybe we could discuss
> > the integration there?
> > 
> > Here a first review of the source code.
> > 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2011-Nov-23 16:52 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:> On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
> > Tobias,
> > 
> > I've attached an updated patch. It contains a few bug fixes and
many
> > (refactoring and coding-convention) changes inspired by your comments.
> > 
> > I'm currently trying to fix the bug responsible for causing a
compile
> > failure when compiling
> > test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
> > pass begins to fuse instructions in a basic block in this file, the
> > aliasing analysis starts producing different (more pessimistic) query
> > answers. I've never before seen it do this, and I'm not sure
why it is
> > happening. Also odd, at the same time, the numeric labels that are
> > assigned to otherwise unnamed instructions, change. I don't think
I've
> > seen this happen either (they generally seem to stay fixed once
> > assigned). I don't know if these two things are related, but if
anyone
> > can provide some insight, I'd appreciate it.
> 
> I think that I see what is happening in this case (please someone tell
> me if this does not make sense). In the problematic basic block, there
> are some loads and stores that are independent. The default aliasing
> analysis can tell that these loads and stores don't reference the same
> memory region. Then, some of the inputs to the getelementptr
> instructions used for these loads and stores are fused by the
> vectorization. After this happens, the aliasing analysis loses its
> ability to tell that the loads and stores that make use of those
> vector-calculated indices are independent.
The attached patch works around this issue. Please review.

Thanks again,
Hal
> 
>  -Hal
> 
> > 
> > In any case, this version of the patch should be much more suitable
for
> > your (or anyone else's) further review.
> > 
> > Thanks again,
> > Hal
> > 
> > On Thu, 2011-11-17 at 13:57 +0100, Tobias Grosser wrote:
> > > On 11/17/2011 12:38 AM, Hal Finkel wrote:
> > > > Tobias, et al.,
> > > >
> > > > Attached is the my autovectorization pass.
> > > 
> > > Very nice. Will you be at the developer summit? Maybe we could
discuss
> > > the integration there?
> > > 
> > > Here a first review of the source code.
> > > 
> 
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_bb_vectorize-20111122.diff
Type: text/x-patch
Size: 99054 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111123/32476f28/attachment.bin>

Evan Cheng

2011-Nov-28 02:34 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

This is very interesting work. Thanks. Quick questions / comments.

1. What's the algorithm complexity?
2. Please rename -vectorize to -bb-vectorize since this is a specific kind of
vectorization.
3. Have you tried non-x86 targets? Does it work? If not, what would it take to
make it work?

I am concerned that LLVM currently does not have a cost model for vectorization.
That's necessary for your work to be applicable for general adoption.

Evan

On Nov 23, 2011, at 8:52 AM, Hal Finkel wrote:
> On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
>> On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
>>> Tobias,
>>> 
>>> I've attached an updated patch. It contains a few bug fixes and
many
>>> (refactoring and coding-convention) changes inspired by your
comments.
>>> 
>>> I'm currently trying to fix the bug responsible for causing a
compile
>>> failure when compiling
>>> test-suite/MultiSource/Applications/obsequi/toggle_move.c; after
the
>>> pass begins to fuse instructions in a basic block in this file, the
>>> aliasing analysis starts producing different (more pessimistic)
query
>>> answers. I've never before seen it do this, and I'm not
sure why it is
>>> happening. Also odd, at the same time, the numeric labels that are
>>> assigned to otherwise unnamed instructions, change. I don't
think I've
>>> seen this happen either (they generally seem to stay fixed once
>>> assigned). I don't know if these two things are related, but if
anyone
>>> can provide some insight, I'd appreciate it.
>> 
>> I think that I see what is happening in this case (please someone tell
>> me if this does not make sense). In the problematic basic block, there
>> are some loads and stores that are independent. The default aliasing
>> analysis can tell that these loads and stores don't reference the
same
>> memory region. Then, some of the inputs to the getelementptr
>> instructions used for these loads and stores are fused by the
>> vectorization. After this happens, the aliasing analysis loses its
>> ability to tell that the loads and stores that make use of those
>> vector-calculated indices are independent.
> 
> The attached patch works around this issue. Please review.
> 
> Thanks again,
> Hal
> 
>> 
>> -Hal
>> 
>>> 
>>> In any case, this version of the patch should be much more suitable
for
>>> your (or anyone else's) further review.
>>> 
>>> Thanks again,
>>> Hal
>>> 
>>> On Thu, 2011-11-17 at 13:57 +0100, Tobias Grosser wrote:
>>>> On 11/17/2011 12:38 AM, Hal Finkel wrote:
>>>>> Tobias, et al.,
>>>>> 
>>>>> Attached is the my autovectorization pass.
>>>> 
>>>> Very nice. Will you be at the developer summit? Maybe we could
discuss
>>>> the integration there?
>>>> 
>>>> Here a first review of the source code.
>>>> 
>> 
>> 
> 
> -- 
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
<llvm_bb_vectorize-20111122.diff>_______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

Evan Cheng

2011-Nov-29 23:00 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Nov 27, 2011, at 7:04 PM, Hal Finkel wrote:
> On Sun, 2011-11-27 at 18:34 -0800, Evan Cheng wrote:
>> This is very interesting work. Thanks. Quick questions / comments.
>> 
>> 1. What's the algorithm complexity?
> 
> Roughly, it is O(N^2), where N is the number of instructions in a basic
> block, but:
> - The run time only *really* grows with the number of "pairable"
> instructions; this is *generally* a much smaller number than the total
> number of instructions (except in test cases). Also determining whether
> two loads or stores are pairable is more expensive than for other kinds
> of instructions because scalar-evolution analysis is used.
> - There is a (configurable) cutoff, so N will not grow exceedingly
> large even for large basic blocks (this was a suggestion by Tobias
> Grosser).
N^2 is still very scary. Is there room for improvement? Can you share the
compile time of LLVM test suites?
> 
>> 2. Please rename -vectorize to -bb-vectorize since this is a specific
kind of vectorization.
> 
> I have two objections:
> 1. -vectorize is an option to the pass manager, and as I hope that
> we'll soon have other vectorization passes in addition to this one,
> -vectorize could then enable those in some default configuration as
> well. You can use -bb-vectorize currently to enable just this pass in
> the same way as any other transformation pass.
> 2. I don't think that I can because it would cause the option to
> conflict with the name of the pass. In addition to enabling the pass
> itself, -vectorize adds instcombine and gvn passes to do some necessary
> cleanup.
Ok!
> 
>> 3. Have you tried non-x86 targets? Does it work? If not, what would it
take to make it work?
> 
> I have not yet tried it, but I am planning to do that this week. My
> primary motivation for working on autovectorization is to use it on the
> IBM Blue Gene supercomputers (which are PPC-based). The pass is designed
> to be platform independent, and it *should* work.
> 
>> 
>> I am concerned that LLVM currently does not have a cost model for
vectorization. That's necessary for your work to be applicable for general
adoption.
> 
> In general I agree. That having been said, as-is, the pass produces
> significant speedups in "real" applications (including some in
the test
> suite). Unfortunately, it also produces some real slow-downs, and so I
> think that before vectorization could be enabled by default (at -O3 for
> example), we would indeed need a cost model.
Agreed.

Evan
> 
> Thank you for looking at this,
> Hal
> 
>> 
>> Evan
>> 
>> On Nov 23, 2011, at 8:52 AM, Hal Finkel wrote:
>> 
>>> On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
>>>> On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
>>>>> Tobias,
>>>>> 
>>>>> I've attached an updated patch. It contains a few bug
fixes and many
>>>>> (refactoring and coding-convention) changes inspired by
your comments.
>>>>> 
>>>>> I'm currently trying to fix the bug responsible for
causing a compile
>>>>> failure when compiling
>>>>> test-suite/MultiSource/Applications/obsequi/toggle_move.c;
after the
>>>>> pass begins to fuse instructions in a basic block in this
file, the
>>>>> aliasing analysis starts producing different (more
pessimistic) query
>>>>> answers. I've never before seen it do this, and I'm
not sure why it is
>>>>> happening. Also odd, at the same time, the numeric labels
that are
>>>>> assigned to otherwise unnamed instructions, change. I
don't think I've
>>>>> seen this happen either (they generally seem to stay fixed
once
>>>>> assigned). I don't know if these two things are
related, but if anyone
>>>>> can provide some insight, I'd appreciate it.
>>>> 
>>>> I think that I see what is happening in this case (please
someone tell
>>>> me if this does not make sense). In the problematic basic
block, there
>>>> are some loads and stores that are independent. The default
aliasing
>>>> analysis can tell that these loads and stores don't
reference the same
>>>> memory region. Then, some of the inputs to the getelementptr
>>>> instructions used for these loads and stores are fused by the
>>>> vectorization. After this happens, the aliasing analysis loses
its
>>>> ability to tell that the loads and stores that make use of
those
>>>> vector-calculated indices are independent.
>>> 
>>> The attached patch works around this issue. Please review.
>>> 
>>> Thanks again,
>>> Hal
>>> 
>>>> 
>>>> -Hal
>>>> 
>>>>> 
>>>>> In any case, this version of the patch should be much more
suitable for
>>>>> your (or anyone else's) further review.
>>>>> 
>>>>> Thanks again,
>>>>> Hal
>>>>> 
>>>>> On Thu, 2011-11-17 at 13:57 +0100, Tobias Grosser wrote:
>>>>>> On 11/17/2011 12:38 AM, Hal Finkel wrote:
>>>>>>> Tobias, et al.,
>>>>>>> 
>>>>>>> Attached is the my autovectorization pass.
>>>>>> 
>>>>>> Very nice. Will you be at the developer summit? Maybe
we could discuss
>>>>>> the integration there?
>>>>>> 
>>>>>> Here a first review of the source code.
>>>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> Hal Finkel
>>> Postdoctoral Appointee
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
<llvm_bb_vectorize-20111122.diff>_______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
> 
> -- 
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>

Hal Finkel

2011-Nov-30 23:14 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Tue, 2011-11-29 at 15:00 -0800, Evan Cheng wrote:> On Nov 27, 2011, at 7:04 PM, Hal Finkel wrote:
> 
> > On Sun, 2011-11-27 at 18:34 -0800, Evan Cheng wrote:
> >> This is very interesting work. Thanks. Quick questions / comments.
> >> 
> >> 1. What's the algorithm complexity?
> > 
> > Roughly, it is O(N^2), where N is the number of instructions in a
basic
> > block, but:
> > - The run time only *really* grows with the number of
"pairable"
> > instructions; this is *generally* a much smaller number than the total
> > number of instructions (except in test cases). Also determining
whether
> > two loads or stores are pairable is more expensive than for other
kinds
> > of instructions because scalar-evolution analysis is used.
> > - There is a (configurable) cutoff, so N will not grow exceedingly
> > large even for large basic blocks (this was a suggestion by Tobias
> > Grosser).
> 
> N^2 is still very scary. Is there room for improvement? Can you share the
compile time of LLVM test suites?
Agreed. There certainly might be room for improvement, and I have some
ideas, but I need to spend some quality time with the profiler. Even so,
the current performance is not that bad (sometimes, it is even good).

Over the entire test suite:

default vs. -vectorize:

The mean compile-time slowdown was 23%, the standard deviation was 107%
and the median was 0%. 185 of the tests exhibited a slowdown, 173
exhibited a speedup (the remainder showed no change).

The runtime showed a mean slowdown of 6% with a standard deviation of
68.5% and a median speedup of 3.5%. 226 tests exhibited a speedup, 91
exhibited a slowdown.

default vs. -unroll-allow-partial:

The mean compile-time slowdown was 19.5%, the standard deviation was 60%
and the median was 0%. 199 of the tests exhibited a slowdown, 172
exhibited a speedup.

The runtime showed a mean slowdown of 8.7% with a standard deviation of
64% and a median speedup of 0.9%. 107 tests exhibited a speedup, 213
exhibited a slowdown.

default vs -unroll-allow-partial -vectorize

The mean compile-time slowdown was 31%, the standard deviation was 122%
and the median was 2.6%. 213 of the tests exhibited a slowdown, 138
exhibited a speedup.

The mean runtime showed a slowdown of 4.3% with a standard deviation of
68% and a median speedup of 3.5%. 225 tests exhibited a speedup, 88
tests exhibited a slowdown.

In conclusion, the current vectorization patch has about the same
compile-time/runtime profile as does -unroll-allow-partial.

 -Hal
> 
> > 
> >> 2. Please rename -vectorize to -bb-vectorize since this is a
specific kind of vectorization.
> > 
> > I have two objections:
> > 1. -vectorize is an option to the pass manager, and as I hope that
> > we'll soon have other vectorization passes in addition to this
one,
> > -vectorize could then enable those in some default configuration as
> > well. You can use -bb-vectorize currently to enable just this pass in
> > the same way as any other transformation pass.
> > 2. I don't think that I can because it would cause the option to
> > conflict with the name of the pass. In addition to enabling the pass
> > itself, -vectorize adds instcombine and gvn passes to do some
necessary
> > cleanup.
> 
> Ok!
> 
> > 
> >> 3. Have you tried non-x86 targets? Does it work? If not, what
would it take to make it work?
> > 
> > I have not yet tried it, but I am planning to do that this week. My
> > primary motivation for working on autovectorization is to use it on
the
> > IBM Blue Gene supercomputers (which are PPC-based). The pass is
designed
> > to be platform independent, and it *should* work.
> > 
> >> 
> >> I am concerned that LLVM currently does not have a cost model for
vectorization. That's necessary for your work to be applicable for general
adoption.
> > 
> > In general I agree. That having been said, as-is, the pass produces
> > significant speedups in "real" applications (including some
in the test
> > suite). Unfortunately, it also produces some real slow-downs, and so I
> > think that before vectorization could be enabled by default (at -O3
for
> > example), we would indeed need a cost model.
> 
> Agreed.
> 
> Evan
> 
> > 
> > Thank you for looking at this,
> > Hal
> > 
> >> 
> >> Evan
> >> 
> >> On Nov 23, 2011, at 8:52 AM, Hal Finkel wrote:
> >> 
> >>> On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
> >>>> On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
> >>>>> Tobias,
> >>>>> 
> >>>>> I've attached an updated patch. It contains a few
bug fixes and many
> >>>>> (refactoring and coding-convention) changes inspired
by your comments.
> >>>>> 
> >>>>> I'm currently trying to fix the bug responsible
for causing a compile
> >>>>> failure when compiling
> >>>>>
test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
> >>>>> pass begins to fuse instructions in a basic block in
this file, the
> >>>>> aliasing analysis starts producing different (more
pessimistic) query
> >>>>> answers. I've never before seen it do this, and
I'm not sure why it is
> >>>>> happening. Also odd, at the same time, the numeric
labels that are
> >>>>> assigned to otherwise unnamed instructions, change. I
don't think I've
> >>>>> seen this happen either (they generally seem to stay
fixed once
> >>>>> assigned). I don't know if these two things are
related, but if anyone
> >>>>> can provide some insight, I'd appreciate it.
> >>>> 
> >>>> I think that I see what is happening in this case (please
someone tell
> >>>> me if this does not make sense). In the problematic basic
block, there
> >>>> are some loads and stores that are independent. The
default aliasing
> >>>> analysis can tell that these loads and stores don't
reference the same
> >>>> memory region. Then, some of the inputs to the
getelementptr
> >>>> instructions used for these loads and stores are fused by
the
> >>>> vectorization. After this happens, the aliasing analysis
loses its
> >>>> ability to tell that the loads and stores that make use of
those
> >>>> vector-calculated indices are independent.
> >>> 
> >>> The attached patch works around this issue. Please review.
> >>> 
> >>> Thanks again,
> >>> Hal
> >>> 
> >>>> 
> >>>> -Hal
> >>>> 
> >>>>> 
> >>>>> In any case, this version of the patch should be much
more suitable for
> >>>>> your (or anyone else's) further review.
> >>>>> 
> >>>>> Thanks again,
> >>>>> Hal
> >>>>> 
> >>>>> On Thu, 2011-11-17 at 13:57 +0100, Tobias Grosser
wrote:
> >>>>>> On 11/17/2011 12:38 AM, Hal Finkel wrote:
> >>>>>>> Tobias, et al.,
> >>>>>>> 
> >>>>>>> Attached is the my autovectorization pass.
> >>>>>> 
> >>>>>> Very nice. Will you be at the developer summit?
Maybe we could discuss
> >>>>>> the integration there?
> >>>>>> 
> >>>>>> Here a first review of the source code.
> >>>>>> 
> >>>> 
> >>>> 
> >>> 
> >>> -- 
> >>> Hal Finkel
> >>> Postdoctoral Appointee
> >>> Leadership Computing Facility
> >>> Argonne National Laboratory
> >>>
<llvm_bb_vectorize-20111122.diff>_______________________________________________
> >>> llvm-commits mailing list
> >>> llvm-commits at cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >> 
> > 
> > -- 
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Tobias Grosser

2011-Dec-02 16:07 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On 11/23/2011 05:52 PM, Hal Finkel wrote:> On Mon, 2011-11-21 at 21:22 -0600, Hal Finkel wrote:
>> >  On Mon, 2011-11-21 at 11:55 -0600, Hal Finkel wrote:
>>> >  >  Tobias,
>>> >  >
>>> >  >  I've attached an updated patch. It contains a few
bug fixes and many
>>> >  >  (refactoring and coding-convention) changes inspired by
your comments.
>>> >  >
>>> >  >  I'm currently trying to fix the bug responsible for
causing a compile
>>> >  >  failure when compiling
>>> >  > 
test-suite/MultiSource/Applications/obsequi/toggle_move.c; after the
>>> >  >  pass begins to fuse instructions in a basic block in
this file, the
>>> >  >  aliasing analysis starts producing different (more
pessimistic) query
>>> >  >  answers. I've never before seen it do this, and
I'm not sure why it is
>>> >  >  happening. Also odd, at the same time, the numeric
labels that are
>>> >  >  assigned to otherwise unnamed instructions, change. I
don't think I've
>>> >  >  seen this happen either (they generally seem to stay
fixed once
>>> >  >  assigned). I don't know if these two things are
related, but if anyone
>>> >  >  can provide some insight, I'd appreciate it.
>> >
>> >  I think that I see what is happening in this case (please someone
tell
>> >  me if this does not make sense). In the problematic basic block,
there
>> >  are some loads and stores that are independent. The default
aliasing
>> >  analysis can tell that these loads and stores don't reference
the same
>> >  memory region. Then, some of the inputs to the getelementptr
>> >  instructions used for these loads and stores are fused by the
>> >  vectorization. After this happens, the aliasing analysis loses
its
>> >  ability to tell that the loads and stores that make use of those
>> >  vector-calculated indices are independent.
> The attached patch works around this issue. Please review.
Hi Hal,

thanks for reworking the patch several times. This time the patch looks
a lot better, with nicely structured helper functions and a lot more
consistent style. It is a lot better to read and review.

Unfortunately I had today just two hours to look into this patch. I
added a lot of remarks, but especially in the tree pruning part I run a
little out of time such that the remarks are mostly obvious style
remarks. I hope I have a little bit more time over the weekend to look
into this again, but I wanted to get out the first junk of comments today.

Especially with my style remarks, do not feel forced to follow all of my
comments. Style stuff is highly personal and often other people may
suggest other fixes. In case you believe your style is better just check
with the LLVM coding reference or other parts of the LLVM source code
and try to match this as good as possible.

On the algorithmic point of view I am not yet completely convinced the
algorithm itself is not worse than N^2. I really want to understand this
part better and plan to look into problematic behaviour in functions not
directly visible in the core routines.

Also, you choose a very high bound (4000) to limit the quadratic
behaviour, because you stated otherwise optimizations will be missed.
Can you send an example where a value smaller than 4000 would miss
optimizations?

Cheers
Tobi
> diff --git a/lib/Transforms/Makefile b/lib/Transforms/Makefile
> index e527be2..8b1df92 100644
> --- a/lib/Transforms/Makefile
> +++ b/lib/Transforms/Makefile
> @@ -8,7 +8,7 @@
>  
##===----------------------------------------------------------------------===##
>
>   LEVEL = ../..
> -PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO Hello
> +PARALLEL_DIRS = Utils Instrumentation Scalar InstCombine IPO Vectorize
Hello
>
>   include $(LEVEL)/Makefile.config
>
> diff --git a/lib/Transforms/Vectorize/BBVectorize.cpp
b/lib/Transforms/Vectorize/BBVectorize.cpp
> new file mode 100644
> index 0000000..5505502
> --- /dev/null
> +++ b/lib/Transforms/Vectorize/BBVectorize.cpp
> @@ -0,0 +1,1644 @@
> +//===- BBVectorize.cpp - A Basic-Block Vectorizer
-------------------------===//
> +//
> +//                     The LLVM Compiler Infrastructure
> +//
> +// This file is distributed under the University of Illinois Open Source
> +// License. See LICENSE.TXT for details.
> +//
>
+//===----------------------------------------------------------------------===//
> +//
> +// This file implements a basic-block vectorization pass. The algorithm
was
> +// inspired by that used by the Vienna MAP Vectorizor by Franchetti and
Kral,
> +// et al. It works by looking for chains of pairable operations and then
> +// pairing them.
> +//
>
+//===----------------------------------------------------------------------===//
> +
> +#define BBV_NAME "bb-vectorize"
> +#define DEBUG_TYPE BBV_NAME
> +#include "llvm/Constants.h"
> +#include "llvm/DerivedTypes.h"
> +#include "llvm/Function.h"
> +#include "llvm/Instructions.h"
> +#include "llvm/IntrinsicInst.h"
> +#include "llvm/Intrinsics.h"
> +#include "llvm/LLVMContext.h"
> +#include "llvm/Pass.h"
> +#include "llvm/Type.h"
> +#include "llvm/ADT/DenseMap.h"
> +#include "llvm/ADT/DenseSet.h"
> +#include "llvm/ADT/SmallVector.h"
> +#include "llvm/ADT/Statistic.h"
> +#include "llvm/ADT/StringExtras.h"
> +#include "llvm/Analysis/AliasAnalysis.h"
> +#include "llvm/Analysis/AliasSetTracker.h"
> +#include "llvm/Analysis/ValueTracking.h"This comes to lines later, if you want to sort it alphabetically.
> +namespace {
> +  struct BBVectorize : public BasicBlockPass {
> +    static char ID; // Pass identification, replacement for typeid
> +    BBVectorize() : BasicBlockPass(ID) {
> +      initializeBBVectorizePass(*PassRegistry::getPassRegistry());
> +    }
> +
> +    typedef std::pair<Value *, Value *>  ValuePair;
> +    typedef std::pair<ValuePair, size_t>  ValuePairWithDepth;
> +    typedef std::pair<ValuePair, ValuePair>  VPPair; // A ValuePair
pair
> +    typedef std::pair<std::multimap<Value *, Value *>::iterator,
> +              std::multimap<Value *, Value *>::iterator> 
VPIteratorPair;
> +    typedef std::pair<std::multimap<ValuePair,
ValuePair>::iterator,
> +              std::multimap<ValuePair, ValuePair>::iterator>
> +                VPPIteratorPair;
> +
> +    // FIXME: const correct?
> +
> +    bool vectorizePairs(BasicBlock&BB);
> +
> +    void getCandidatePairs(BasicBlock&BB,
> +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> +                       std::vector<Value *>  &PairableInsts);
> +
> +    vojd computeConnectedPairs(std::multimap<Value *, Value *> 
&CandidatePairs,
> +                       std::vector<Value *>  &PairableInsts,
> +                       std::multimap<ValuePair, ValuePair> 
&ConnectedPairs);
> +
> +    void buildDepMap(BasicBlock&BB,
> +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> +                       std::vector<Value *>  &PairableInsts,
> +                       DenseSet<ValuePair>  &PairableInstUsers);
> +
> +    void choosePairs(std::multimap<Value *, Value *> 
&CandidatePairs,
> +                        std::vector<Value *>  &PairableInsts,
> +                        std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                        DenseSet<ValuePair>  &PairableInstUsers,
> +                        DenseMap<Value *, Value *>& 
ChosenPairs);
> +
> +    void fuseChosenPairs(BasicBlock&BB,
> +                     std::vector<Value *>  &PairableInsts,
> +                     DenseMap<Value *, Value *>&  ChosenPairs);
> +
> +    bool isInstVectorizable(Instruction *I, bool&IsSimpleLoadStore);
> +
> +    bool areInstsCompatible(Instruction *I, Instruction *J,
> +                       bool IsSimpleLoadStore);
> +
> +    void trackUsesOfI(DenseSet<Value *>  &Users,
> +                      AliasSetTracker&WriteSet, Instruction *I,
> +                      Instruction *J, bool&UsesI, bool UpdateUsers =
true,
> +                      std::multimap<Value *, Value *>  *LoadMoveSet
= 0);
> +
> +    void computePairsConnectedTo(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      ValuePair P);
> +
> +    bool pairsConflict(ValuePair P, ValuePair Q,
> +                      DenseSet<ValuePair>  &PairableInstUsers);
> +
> +    void pruneTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseMap<ValuePair, size_t>  &Tree,
> +                      DenseSet<ValuePair>  &PrunedTree,
ValuePair J);
> +
> +    void buildInitialTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseMap<ValuePair, size_t>  &Tree,
ValuePair J);
> +
> +    void findBestTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseSet<ValuePair>  &BestTree,
size_t&BestMaxDepth,
> +                      size_t&BestEffSize, VPIteratorPair ChoiceRange);
> +
> +    Value *getReplacementPointerInput(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, unsigned o, bool&FlipMemInputs);
> +
> +    Value *getReplacementShuffleMask(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, unsigned o);
> +
> +    Value *getReplacementInput(LLVMContext&  Context, Instruction *I,
> +                     Instruction *J, unsigned o, bool FlipMemInputs);
> +
> +    void getReplacementInputsForPair(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, SmallVector<Value *, 3> 
&ReplacedOperands,
> +                     bool&FlipMemInputs);
> +
> +    void replaceOutputsOfPair(LLVMContext&  Context, Instruction *I,
> +                     Instruction *J, Instruction *K,
> +                     Instruction *&InsertionPt, Instruction *&K1,
> +                     Instruction *&K2, bool&FlipMemInputs);
> +
> +    void collectPairLoadMoveSet(BasicBlock&BB,
> +                     DenseMap<Value *, Value *>  &ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> +                     Instruction *I, Instruction *J);
> +
> +    void collectLoadMoveSet(BasicBlock&BB,
> +                     std::vector<Value *>  &PairableInsts,
> +                     DenseMap<Value *, Value *>  &ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet);
> +
> +    void moveUsesOfIAfterJ(BasicBlock&BB,
> +                     DenseMap<Value *, Value *>&  ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> +                     Instruction *&InsertionPt,
> +                     Instruction *I, Instruction *J);
Here I am somehow unsatisfied with the indentation, but can currently
not suggest anything better. The single operands are just too long. ;-)
> +
> +    virtual bool runOnBasicBlock(BasicBlock&BB) {
> +      bool changed = false;
> +      // Iterate a sufficient number of times to merge types of
> +      // size 1, then 2, then 4, etc. up to half of the
> +      // target vector width of the target vector register.Use all 80 columns.

Also. What is a type of size 1? Do you mean 1 bit, one byte or what?
> +      for (unsigned v = 2, n = 1; v<= VectorBits&&
> +             (!MaxIter || n<= MaxIter); v *= 2, ++n) {                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
                 This fits in the previous row.
Also, having the whole condition in one row seems easier to read.
> +        DEBUG(dbgs()<<  "BBV: fusing loop #"<< 
n<<
> +              " for "<<  BB.getName()<<  " in
"<<
> +              BB.getParent()->getName()<<  "...\n");
> +        if (vectorizePairs(BB)) {
> +          changed = true;
> +        } else {
> +          break;
> +        }Braces not needed here.
> +      }
> +
> +      DEBUG(dbgs()<<  "BBV: done!\n");
> +      return changed;
> +    }
> +
> +    virtual void getAnalysisUsage(AnalysisUsage&AU) const {
> +      BasicBlockPass::getAnalysisUsage(AU);
> +      AU.addRequired<AliasAnalysis>();
> +      AU.addRequired<ScalarEvolution>();
> +      AU.addPreserved<AliasAnalysis>();
> +      AU.addPreserved<ScalarEvolution>();
> +    }
> +
> +    // This returns the vector type that holds a pair of the provided
type.
> +    // If the provided type is already a vector, then its length is
doubled.
> +    static inline VectorType *getVecType(Type *VElemType) {
What about naming this getVecTypeForPair()?

I think ElemType is sufficient. No need for the 'V'.
> +      if (VElemType->isVectorTy()) {
> +        unsigned numElem =
cast<VectorType>(VElemType)->getNumElements();
> +        return VectorType::get(VElemType->getScalarType(), numElem*2);
> +      } else {
> +        return VectorType::get(VElemType, 2);
> +      }
> +    }
This can be:

if (VectorType *VectorTy = dyn_cast<VectorType>(ElemType)) {
   unsigned numElem = VectorTy->getNumElements();
   return VectorType::get(VElemType->getScalarType(), numElem*2);
} else {
   return VectorType::get(VElemType, 2);
}
> +    // Returns the weight associated with the provided value. A chain of
> +    // candidate pairs has a length given by the sum of the weights of its
> +    // members (one weight per pair; the weight of each member of the pair
> +    // is assumed to be the same). This length is then compared to the
> +    // chain-length threshold to determine if a given chain is significant
> +    // enough to be vectorized. The length is also used in comparing
> +    // candidate chains where longer chains are considered to be better.
> +    // Note: when this function returns 0, the resulting instructions are
> +    // not actually fused.
> +    static inline size_t getDepthFactor(Value *V) {
> +      if (isa<InsertElementInst>(V) ||
isa<ExtractElementInst>(V)) {
Why have InsertElementInst, ExtractElementInst instructions a weight of
zero?
> +        return 0;
> +      }No need for braces.
> +
> +      return 1;
> +    }
> +
> +    // This determines the relative offset of two loads or stores,
returning
> +    // true if the offset could be determined to be some constant value.
> +    // For example, if OffsetInElmts == 1, then J accesses the memory
directly
> +    // after I; if OffsetInElmts == -1 then I accesses the memory
> +    // directly after J. This function assumes that both instructions
> +    // have the same type.You may want to mention more explicit that this function returns parts
of the result in OffsetInElmts.

I think 'Offset' is enough, no need for 'OffsetInElmts'.
> +    bool getPairPtrInfo(Instruction *I, Instruction *J, const
TargetData&TD,
> +        Value *&IPtr, Value *&JPtr, unsigned&IAlignment,
unsigned&JAlignment,
> +        int64_t&OffsetInElmts) {
> +      OffsetInElmts = 0;
> +      if (isa<LoadInst>(I)) {
> +        IPtr = cast<LoadInst>(I)->getPointerOperand();
> +        JPtr = cast<LoadInst>(J)->getPointerOperand();
> +        IAlignment = cast<LoadInst>(I)->getAlignment();
> +        JAlignment = cast<LoadInst>(J)->getAlignment();
> +      } else {
> +        IPtr = cast<StoreInst>(I)->getPointerOperand();
> +        JPtr = cast<StoreInst>(J)->getPointerOperand();
> +        IAlignment = cast<StoreInst>(I)->getAlignment();
> +        JAlignment = cast<StoreInst>(J)->getAlignment();
> +      }
> +
> +      ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> +      const SCEV *IPtrSCEV = SE.getSCEV(IPtr);
> +      const SCEV *JPtrSCEV = SE.getSCEV(JPtr);
> +
> +      // If this is a trivial offset, then we'll get something like
> +      // 1*sizeof(type). With target data, which we need anyway, this will
get
> +      // constant folded into a number.
> +      const SCEV *OffsetSCEV = SE.getMinusSCEV(JPtrSCEV, IPtrSCEV);
> +      if (const SCEVConstant *ConstOffSCEV > +           
dyn_cast<SCEVConstant>(OffsetSCEV)) {
> +        ConstantInt *IntOff = ConstOffSCEV->getValue();
> +        int64_t Offset = IntOff->getSExtValue();
> +
> +        Type *VTy =
cast<PointerType>(IPtr->getType())->getElementType();
> +        int64_t VTyTSS = (int64_t) TD.getTypeStoreSize(VTy);
> +
> +        assert(VTy ==
cast<PointerType>(JPtr->getType())->getElementType());
> +
> +        OffsetInElmts = Offset/VTyTSS;
> +        return (abs64(Offset) % VTyTSS) == 0;
> +      }
> +
> +      return false;
> +    }
> +
> +    // Returns true if the provided CallInst represents an intrinsic that
can
> +    // be vectorized.
> +    bool isVectorizableIntrinsic(CallInst* I) {
> +      bool VIntr = false;
> +      if (Function *F = I->getCalledFunction()) {
> +        if (unsigned IID = F->getIntrinsicID()) {
Use early exit.

Function *F = I->getCalledFunction();

if (!F)
   return false;

unsigned IID = F->getIntrinsicID();

if (!IID)
   return false;

switch(IID) {
	case Intrinsic::sqrt:
	case Intrinsic::powi:
	case Intrinsic::sin:
	case Intrinsic::cos:
	case Intrinsic::log:
	case Intrinsic::log2:
	case Intrinsic::log10:
	case Intrinsic::exp:
	case Intrinsic::exp2:
	case Intrinsic::pow:
		return !NoMath;
	case Intrinsic::fma:
		return !NoFMA;
}

Is this switch exhaustive or are you missing a default case?
> +    // Returns true if J is the second element in some ValuePair
referenced by
> +    // some multimap ValuePair iterator pair.
> +    bool isSecondInVPIteratorPair(VPIteratorPair PairRange, Value *J) {
> +      for (std::multimap<Value *, Value *>::iterator K =
PairRange.first;
> +             K != PairRange.second; ++K) {
> +        if (K->second == J) return true;
> +      }No need for braces.
> +
> +      return false;
> +    }
> +  };
> +
> +  // This function implements one vectorization iteration on the provided
> +  // basic block. It returns true if the block is changed.
> +  bool BBVectorize::vectorizePairs(BasicBlock&BB) {
> +    std::vector<Value *>  PairableInsts;
> +    std::multimap<Value *, Value *>  CandidatePairs;
> +    getCandidatePairs(BB, CandidatePairs, PairableInsts);
> +    if (!PairableInsts.size()) return false;
if (PairableInsts.size() == 0) ...

I think it is more straightforward to compare a size to 0 than to force
the brain to implicitly convert it to bool.
> +
> +    // Now we have a map of all of the pairable instructions and we need
to               we have a map of all pairable instructions
> +    // select the best possible pairing. A good pairing is one such that
the
> +    // Users of the pair are also paired. This defines a (directed) forest	  users
           ^> +    // over the pairs such that two pairs are connected iff the second
pair
> +    // uses the first.
> +
> +    // Note that it only matters that both members of the second pair use
some
> +    // element of the first pair (to allow for splatting).
> +
> +    std::multimap<ValuePair, ValuePair>  ConnectedPairs;
> +    computeConnectedPairs(CandidatePairs, PairableInsts, ConnectedPairs);
> +    if (!ConnectedPairs.size()) return false;
if (ConnectedPairs.size() == 0) ...
> +    // Build the pairable-instruction dependency map
> +    DenseSet<ValuePair>  PairableInstUsers;
> +    buildDepMap(BB, CandidatePairs, PairableInsts, PairableInstUsers);
> +
> +    // There is now a graph of the connected pairs. For each variable,
pick the
> +    // pairing with the largest tree meeting the depth requirement on at
least
> +    // one branch. Then select all pairings that are part of that tree and
> +    // remove them from the list of available parings and pairable
variables.						 pairings
> +
> +    DenseMap<Value *, Value *>  ChosenPairs;
> +    choosePairs(CandidatePairs, PairableInsts, ConnectedPairs,
> +      PairableInstUsers, ChosenPairs);
> +
> +    if (!ChosenPairs.size()) return false;
> +    NumFusedOps += ChosenPairs.size();
> +
> +    // A set of chosen pairs has now been selected. It is now necessary to
        // A set of pairs has been chosen.> +    // replace the paired functions with vector functions. For this
procedure
You are only talking about functions. I thought the main objective is to
pair LLVM-IR _instructions_.
> +    // each argument much be replaced with a vector argument. This vector                         must

At least for instructions, I would use the term 'operand' instead of
'argument'.
> +    // is formed by using build_vector on the old arguments. The replaced
> +    // values are then replaced with a vector_extract on the result.
> +    // Subsequent optimization passes should coalesce the build/extract
> +    // combinations.
> +
> +    fuseChosenPairs(BB, PairableInsts, ChosenPairs);
> +
> +    return true;
> +  }
> +
> +  // This function returns true if the provided instruction is capable of
being
> +  // fused into a vector instruction. This determination is based only on
the
> +  // type and other attributes of the instruction.
> +  bool BBVectorize::isInstVectorizable(Instruction *I,
> +                                         bool&IsSimpleLoadStore) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
AA can become a member of the BBVectorize class and it can be
initialized once per basic block.
> +    bool VIntr = false;
> +    if (isa<CallInst>(I)) {
> +      VIntr = isVectorizableIntrinsic(cast<CallInst>(I));
> +    }Remove VIntr and use early exit.

if (CallInst *C = dyn_cast<CallInst>(I)) {
   if(!isVectorizableIntrinsic(cast<CallInst>(I))
     return false;
}

Also, no braces if not necessary in at least one branch.
> +    // Vectorize simple loads and stores if possbile:
> +    IsSimpleLoadStore = false;
> +    if (isa<LoadInst>(I)) {
> +      IsSimpleLoadStore = cast<LoadInst>(I)->isSimple();
> +    } else if (isa<StoreInst>(I)) {
> +      IsSimpleLoadStore = cast<StoreInst>(I)->isSimple();
> +    }Again, no braces needed and use early exit.

} else if (LoadInst *L = dyn_cast<LoadInst>(I)) {
   if (!L->isSimple() || NoMemOps) {
     IsSimpleLoadStore = false;
     return false;
   }
} else if (StorInst *S = dyn_cast<StoreInst>(S)) {
   if (!S->isSimple() || NoMemOps) {
     IsSimpleLoadStore = false;
     return false;
   }
}
> +    // We can vectorize casts, but not casts of pointer types, etc.
> +    bool IsCast = false;
> +    if (I->isCast()) {
> +      IsCast = true;
> +      if (!cast<CastInst>(I)->getSrcTy()->isSingleValueType())
{
> +        IsCast = false;
> +      } else if
(!cast<CastInst>(I)->getDestTy()->isSingleValueType()) {
> +        IsCast = false;
> +      } else if (cast<CastInst>(I)->getSrcTy()->isPointerTy())
{
> +        IsCast = false;
> +      } else if
(cast<CastInst>(I)->getDestTy()->isPointerTy()) {
> +        IsCast = false;
> +      }
> +    }

} else if (CastInst *C = dyn_cast<CastInst>(I)) {
   if (NoCasts)
     return false;

   Type *SrcTy = C->getSrcTy()

   if (!SrcTy->isSingleValueType() || SrcTy->isPointerTy)
     return false;

   Type *DestTy = C->getDestTy()

   if (!DestTy->isSingleValueType() || DestTy->isPointerTy)
     return false;
}
> +
> +    if (!(I->isBinaryOp() || isa<ShuffleVectorInst>(I) ||
> +            isa<ExtractElementInst>(I) ||
isa<InsertElementInst>(I) ||
> +            (!NoCasts&&  IsCast) || VIntr ||
> +            (!NoMemOps&&  IsSimpleLoadStore))) {
> +      return false;
> +    }
else if (I->isBinaryOp
	 || isa<ShuffleVectorInst>(I)
          || isa<ExtractElementInst>(I)
          || isa<InsertElementInst>(I)) {
   // Those instructions are always OK.
} else {
   // Everything else can not be handled.
   return false;
}

> +
> +    // We can't vectorize memory operations without target data
> +    if (AA.getTargetData() == 0&&  IsSimpleLoadStore) {
> +      return false;
Is this check needed at this place? I had the feeling the whole pass
does not work without target data. At least the offset calculation seems
to require it.
> +    }
> +
> +    Type *T1, *T2;
> +    if (isa<StoreInst>(I)) {
> +      // For stores, it is the value type, not the pointer type that
matters
> +      // because the value is what will come from a vector register.
> +
> +      Value *IVal = cast<StoreInst>(I)->getValueOperand();
> +      T1 = IVal->getType();
> +    } else {
> +      T1 = I->getType();
> +    }
> +
> +    if (I->isCast()) {
> +      T2 = cast<CastInst>(I)->getSrcTy();
> +    } else {
> +      T2 = T1;
> +    }No braces needed.
> +
> +    // Not every type can be vectorized...
> +    if (!(VectorType::isValidElementType(T1) || T1->isVectorTy()) ||
> +        !(VectorType::isValidElementType(T2) || T2->isVectorTy()))
> +      return false;
> +
> +    if (NoInts&&  (T1->isIntOrIntVectorTy() ||
T2->isIntOrIntVectorTy()))
> +      return false;
> +
> +    if (NoFloats&&  (T1->isFPOrFPVectorTy() ||
T2->isFPOrFPVectorTy()))
> +      return false;
> +
> +    if (T1->getPrimitiveSizeInBits()>  VectorBits/2 ||
> +        T2->getPrimitiveSizeInBits()>  VectorBits/2)
> +      return false;
> +
> +    return true;
> +  }
> +
> +  // This function returns true if the two provided instructions are
compatible
> +  // (meaning that they can be fused into a vector instruction). This
assumes
> +  // that I has already been determined to be vectorizable and that J is
not
> +  // in the use tree of I.
> +  bool BBVectorize::areInstsCompatible(Instruction *I, Instruction *J,
> +                       bool IsSimpleLoadStore) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();AA can become a member of the BBVectorize class and it can be
initialized once per basic block.
> +    bool IsCompat = J->isSameOperationAs(I);
> +    // FIXME: handle addsub-type operations!
> +
> +    // Loads and stores can be merged if they have different alignments,
> +    // but are otherwise the same.
> +    if (!IsCompat&&  isa<LoadInst>(I)&& 
isa<LoadInst>(J)) {
> +      if (I->getType() == J->getType()) {
> +        if (cast<LoadInst>(I)->getPointerOperand()->getType()
=> +             
cast<LoadInst>(J)->getPointerOperand()->getType()&&
> +            cast<LoadInst>(I)->isVolatile() => +             
cast<LoadInst>(J)->isVolatile()&&
> +            cast<LoadInst>(I)->getOrdering() => +             
cast<LoadInst>(J)->getOrdering()&&
> +            cast<LoadInst>(I)->getSynchScope() => +           
cast<LoadInst>(J)->getSynchScope()
> +           ) {
> +             IsCompat = true;
> +         }
> +      }
> +    } else if (!IsCompat&&  isa<StoreInst>(I)&& 
isa<StoreInst>(J)) {
> +      if (cast<StoreInst>(I)->getValueOperand()->getType()
=> +           
cast<StoreInst>(J)->getValueOperand()->getType()&&
> +          cast<StoreInst>(I)->getPointerOperand()->getType()
=> +           
cast<StoreInst>(J)->getPointerOperand()->getType()&&
> +          cast<StoreInst>(I)->isVolatile() => +           
cast<StoreInst>(J)->isVolatile()&&
> +          cast<StoreInst>(I)->getOrdering() => +           
cast<StoreInst>(J)->getOrdering()&&
> +          cast<StoreInst>(I)->getSynchScope() => +           
cast<StoreInst>(J)->getSynchScope()
> +         ) {
> +           IsCompat = true;
> +       }
> +    }
> +
> +    if (!IsCompat) return false;
> +
> +    if (IsSimpleLoadStore) {
> +      const TargetData&TD = *AA.getTargetData();
> +
> +      Value *IPtr, *JPtr;
> +      unsigned IAlignment, JAlignment;
> +      int64_t OffsetInElmts = 0;
> +      if (getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment, JAlignment,
> +            OffsetInElmts)&&  abs64(OffsetInElmts) == 1) {
> +        if (AlignedOnly) {
> +          Type *aType = isa<StoreInst>(I) ?
> +            cast<StoreInst>(I)->getValueOperand()->getType() :
I->getType();
> +          // An aligned load or store is possible only if the instruction
> +          // with the lower offset has an alignment suitable for the
> +          // vector type.
> +
> +          unsigned BottomAlignment = IAlignment;
> +          if (OffsetInElmts<  0) BottomAlignment = JAlignment;
> +
> +          Type *VType = getVecType(aType);
> +          unsigned VecAlignment = TD.getPrefTypeAlignment(VType);
> +          if (BottomAlignment<  VecAlignment) {
> +            IsCompat = false;
> +          }
> +        }
> +      } else {
> +        IsCompat = false;
> +      }
> +    } else if (isa<ShuffleVectorInst>(I)) {
> +      // Only merge two shuffles if they're both constant
> +      // or both not constant.
> +      IsCompat = isa<Constant>(I->getOperand(2))&&
> +                 isa<Constant>(J->getOperand(2));
> +      // FIXME: We may want to vectorize non-constant shuffles also.
> +    }
> +
> +    return IsCompat;
Here again for the whole function. Can you try to use early exit to
simplify the code http://llvm.org/docs/CodingStandards.html#hl_earlyexit
You might be able to remove the use of IsCompat completely.
> +  }
> +
> +  // Figure out whether or not J uses I and update the users and write-set
> +  // structures associated with I.
You may want to give I and J more descriptive names. I remember I asked
you the last time to rename One and Two to I and J, but in this case it
seems both have very specific roles which could be used to describe
them. What about 'Inst' and 'User'?

Also what are the arguments of this function? Can you explain them a
little?
> +  void BBVectorize::trackUsesOfI(DenseSet<Value *>  &Users,
> +                       AliasSetTracker&WriteSet, Instruction *I,
> +                       Instruction *J, bool&UsesI, bool UpdateUsers,
You may want to return UsesI as a function return value not as a
reference parameter, though you may need to updated the function name
accordingly.
> +                       std::multimap<Value *, Value *> 
*LoadMoveSet) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +
> +    UsesI = false;
> +
> +    // This instruction may already be marked as a user due, for example,
to
> +    // being a member of a selected pair.
> +    if (Users.count(J))
> +      UsesI = true;
> +
> +    if (!UsesI) for (User::op_iterator JU = J->op_begin(), e =
J->op_end();Use a new line for this 'for'.
> +           JU != e; ++JU) {
> +      Value *V = *JU;
> +      if (I == V || Users.count(V)) {
> +        UsesI = true;
> +        break;
> +      }
> +    }
> +    if (!UsesI&&  J->mayReadFromMemory()) {
> +      if (LoadMoveSet) {
> +        VPIteratorPair JPairRange = LoadMoveSet->equal_range(J);
> +        UsesI = isSecondInVPIteratorPair(JPairRange, I);
> +      }
> +      else for (AliasSetTracker::iterator W = WriteSet.begin(),Use a new line for this 'for'.
> +            WE = WriteSet.end(); W != WE; ++W) {
> +        for (AliasSet::iterator A = W->begin(), AE = W->end();
> +               A != AE; ++A) {
> +          AliasAnalysis::Location ptrLoc(A->getValue(),
A->getSize(),
> +            A->getTBAAInfo());Align this argument with the first argument.

           AliasAnalysis::Location ptrLoc(A->getValue(), A->getSize(),
           				 A->getTBAAInfo());
> +          if (AA.getModRefInfo(J, ptrLoc) != AliasAnalysis::NoModRef) {
> +            UsesI = true;
> +            break;
> +          }
> +        }
> +        if (UsesI) break;
> +      }
> +    }
> +    if (UsesI&&  UpdateUsers) {
> +      if (J->mayWriteToMemory()) WriteSet.add(J);
> +      Users.insert(J);
> +    }
> +  }
> +
> +  // This function iterates over all instruction pairs in the provided
> +  // basic block and collects all candidate pairs for vectorization.
> +  void BBVectorize::getCandidatePairs(BasicBlock&BB,
> +                       std::multimap<Value *, Value *> 
&CandidatePairs,
> +                       std::vector<Value *>  &PairableInsts) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +    BasicBlock::iterator E = BB.end();
> +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E; ++I) {
> +      bool IsSimpleLoadStore;
> +      if (!isInstVectorizable(I, IsSimpleLoadStore)) continue;
> +
> +      // Look for an instruction with which to pair instruction *I...
> +      DenseSet<Value *>  Users;
> +      AliasSetTracker WriteSet(AA);
> +      BasicBlock::iterator J = I; ++J;
> +      for (unsigned ss = 0; J != E&&  ss<= SearchLimit; ++J,
++ss) {
> +        // Determine if J uses I, if so, exit the loop.
> +        bool UsesI;
> +        trackUsesOfI(Users, WriteSet, I, J, UsesI, !FastDep);
'UsesI' can be eliminated if the corresponding value is passed as
function return value.
> +        if (FastDep) {
> +          // Note: For this heuristic to be effective, independent
operations
> +          // must tend to be intermixed. This is likely to be true from
some
> +          // kinds of grouped loop unrolling (but not the generic LLVM
pass),
> +          // but otherwise may require some kind of reordering pass.
> +
> +          // When using fast dependency analysis,
> +          // stop searching after first use:
> +          if (UsesI) break;
> +        } else {
> +          if (UsesI) continue;
> +        }
> +
> +        // J does not use I, and comes before the first use of I, so it
can be
> +        // merged with I if the instructions are compatible.
> +        if (!areInstsCompatible(I, J, IsSimpleLoadStore)) continue;
> +
> +        // J is a candidate for merging with I.
> +        if (!PairableInsts.size() ||
> +             PairableInsts[PairableInsts.size()-1] != I) {
> +          PairableInsts.push_back(I);
> +        }
> +        CandidatePairs.insert(ValuePair(I, J));
> +        DEBUG(dbgs()<<  "BBV: candidate pair "<< 
*I<<
> +                     "<->  "<<  *J<< 
"\n");
> +      }
> +    }
> +
> +    DEBUG(dbgs()<<  "BBV: found "<< 
PairableInsts.size()
> +<<  " instructions with candidate pairs\n");
> +  }
> +
> +  // Finds candidate pairs connected to the pair P =<PI, PJ>. This
means that
> +  // it looks for pairs such that both members have an input which is an
> +  // output of PI or PJ.
> +  void BBVectorize::computePairsConnectedTo(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      ValuePair P) {
> +    // For each possible pairing for this variable, look at the uses of
> +    // the first value...
> +    for (Value::use_iterator I = P.first->use_begin(),
> +         E = P.first->use_end(); I != E; ++I) {
> +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> +
> +      // For each use of the first variable, look for uses of the second
> +      // variable...
> +      for (Value::use_iterator J = P.second->use_begin(),
> +           E2 = P.second->use_end(); J != E2; ++J) {
> +        VPIteratorPair JPairRange = CandidatePairs.equal_range(*J);
> +
> +        // Look for<I, J>:
> +        if (isSecondInVPIteratorPair(IPairRange, *J))
> +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> +
> +        // Look for<J, I>:
> +        if (isSecondInVPIteratorPair(JPairRange, *I))
> +          ConnectedPairs.insert(VPPair(P, ValuePair(*J, *I)));
> +      }
> +
> +      // Look for cases where just the first value in the pair is used by
> +      // both members of another pair (splatting).
> +      Value::use_iterator J = P.first->use_begin();
> +      if (!SplatBreaksChain) for (; J != E; ++J) {This can become an early continue. Also please do not put an 'if' and a
'for' into the same line.
> +        if (isSecondInVPIteratorPair(IPairRange, *J))
> +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> +      }
> +    }
> +
> +    // Look for cases where just the second value in the pair is used by
> +    // both members of another pair (splatting).
> +    if (!SplatBreaksChain)This can become an early return.
> +      for (Value::use_iterator I = P.second->use_begin(),
> +         E = P.second->use_end(); I != E; ++I) {
> +      VPIteratorPair IPairRange = CandidatePairs.equal_range(*I);
> +
> +      Value::use_iterator J = P.second->use_begin();
> +      for (; J != E; ++J) {Why not move initialization into 'for'?
> +        if (isSecondInVPIteratorPair(IPairRange, *J))
> +          ConnectedPairs.insert(VPPair(P, ValuePair(*I, *J)));
> +      }
> +    }
> +  }
> +
> +  // This function figures out which pairs are connected.  Two pairs are
> +  // connected if some output of the first pair forms an input to both
members
> +  // of the second pair.
> +  void BBVectorize::computeConnectedPairs(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs) {
> +
> +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> +           PE = PairableInsts.end(); PI != PE; ++PI) {
> +      VPIteratorPair choiceRange = CandidatePairs.equal_range(*PI);
> +
> +      for (std::multimap<Value *, Value *>::iterator P =
choiceRange.first;
> +           P != choiceRange.second; ++P) {
> +        computePairsConnectedTo(CandidatePairs, PairableInsts,
> +          ConnectedPairs, *P);Align CandidatePairs and ConnectedPairs.
> +      }No braces needed here.
> +    }
> +
> +    DEBUG(dbgs()<<  "BBV: found "<< 
ConnectedPairs.size()
> +<<  " pair connections.\n");
> +  }
> +
> +  // This function builds a set of use tuples such that<A, B>  is in
the set
> +  // if B is in the use tree of A. If B is in the use tree of A, then B
> +  // depends on the output of A.
> +  void BBVectorize::buildDepMap(
> +                      BasicBlock&BB,
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      DenseSet<ValuePair>  &PairableInstUsers) {
> +    DenseSet<Value *>  IsInPair;
> +    for (std::multimap<Value *, Value *>::iterator C =
CandidatePairs.begin(),
> +           E = CandidatePairs.end(); C != E; ++C) {
> +      IsInPair.insert(C->first);
> +      IsInPair.insert(C->second);
> +    }
> +
> +    // Iterate through the basic block, recording all Users of each
> +    // pairable instruction.
> +
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +    BasicBlock::iterator E = BB.end();
> +    for (BasicBlock::iterator I = BB.getFirstInsertionPt(); I != E; ++I) {
> +      if (IsInPair.find(I) == IsInPair.end()) continue;
> +
> +      DenseSet<Value *>  Users;
> +      AliasSetTracker WriteSet(AA);
> +      BasicBlock::iterator J = I; ++J;
> +      for (; J != E; ++J) {Move the initialization into the for loop.

	for (BasicBlock::iterator J = I + 1; J != E; ++J)
> +        bool UsesI;
> +        trackUsesOfI(Users, WriteSet, I, J, UsesI);
> +      }
> +
> +      for (DenseSet<Value *>::iterator U = Users.begin(), E =
Users.end();
> +             U != E; ++U) {	      ^^ Check indentation here.
> +        PairableInstUsers.insert(ValuePair(I, *U));
> +      }No braces needed.
> +    }
> +  }
> +
> +  // Returns true if an input to pair P is an output of pair Q and also an
> +  // input of pair Q is an output of pair P. If this is the case, then
these
> +  // two pairs cannot be simultaneously fused.
> +  bool BBVectorize::pairsConflict(ValuePair P, ValuePair Q,
> +                      DenseSet<ValuePair>  &PairableInstUsers) {
> +    // Two pairs are in conflict if they are mutual Users of eachother.
> +    bool QUsesP = PairableInstUsers.count(ValuePair(P.first,  Q.first)) 
||
> +                  PairableInstUsers.count(ValuePair(P.first,  Q.second))
||
> +                  PairableInstUsers.count(ValuePair(P.second, Q.first)) 
||
> +                  PairableInstUsers.count(ValuePair(P.second, Q.second));
> +    bool PUsesQ = PairableInstUsers.count(ValuePair(Q.first,  P.first)) 
||
> +                  PairableInstUsers.count(ValuePair(Q.first,  P.second))
||
> +                  PairableInstUsers.count(ValuePair(Q.second, P.first)) 
||
> +                  PairableInstUsers.count(ValuePair(Q.second, P.second));
> +    return (QUsesP&&  PUsesQ);
> +  }
> +
> +  // This function builds the initial tree of connected pairs with the
> +  // pair J at the root.
> +  void BBVectorize::buildInitialTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseMap<ValuePair, size_t>  &Tree,
ValuePair J) {
> +    // Each of these pairs is viewed as the root node of a Tree. The Tree
> +    // is then walked (depth-first). As this happens, we keep track of
> +    // the pairs that compose the Tree and the maximum depth of the Tree.
> +    std::vector<ValuePairWithDepth>  Q;
> +    // General depth-first post-order traversal:
> +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> +    while (!Q.empty()) {
> +      ValuePairWithDepth QTop = Q.back();
> +
> +      // Push each child onto the queue:
> +      bool MoreChildren = false;
> +      size_t MaxChildDepth = QTop.second;
> +      VPPIteratorPair qtRange = ConnectedPairs.equal_range(QTop.first);
> +      for (std::map<ValuePair, ValuePair>::iterator k =
qtRange.first;
> +           k != qtRange.second; ++k) {
> +        // Make sure that this child pair is still a candidate:
> +        bool IsStillCand = false;
> +        VPIteratorPair checkRange > +         
CandidatePairs.equal_range(k->second.first);
> +        for (std::multimap<Value *, Value *>::iterator m =
checkRange.first;
> +             m != checkRange.second; ++m) {
> +          if (m->second == k->second.second) {
> +            IsStillCand = true;
> +            break;
> +          }
> +        }
> +
> +        if (IsStillCand) {
> +          DenseMap<ValuePair, size_t>::iterator C =
Tree.find(k->second);
> +          if (C == Tree.end()) {
> +            size_t d = getDepthFactor(k->second.first);
> +            Q.push_back(ValuePairWithDepth(k->second, QTop.second+d));
> +            MoreChildren = true;
> +          } else {
> +            MaxChildDepth = std::max(MaxChildDepth, C->second);
> +          }
> +        }
> +      }
> +
> +      if (!MoreChildren) {
> +        // Record the current pair as part of the Tree:
> +        Tree.insert(ValuePairWithDepth(QTop.first, MaxChildDepth));
> +        Q.pop_back();
> +      }
> +    }
> +  }
> +
> +  // Given some initial tree, prune it by removing conflicting pairs
(pairs
> +  // that cannot be simultaneously chosen for vectorization).
> +  void BBVectorize::pruneTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseMap<ValuePair, size_t>  &Tree,
> +                      DenseSet<ValuePair>  &PrunedTree,
ValuePair J) {
> +    std::vector<ValuePairWithDepth>  Q;
> +    // General depth-first post-order traversal:
> +    Q.push_back(ValuePairWithDepth(J, getDepthFactor(J.first)));
> +    while (!Q.empty()) {
> +      ValuePairWithDepth QTop = Q.back();
> +      PrunedTree.insert(QTop.first);
> +      Q.pop_back();
> +
> +      // Visit each child, pruning as necessary...
> +      DenseMap<ValuePair, size_t>  BestChilden;
> +      VPPIteratorPair QTopRange = ConnectedPairs.equal_range(QTop.first);
> +      for (std::map<ValuePair, ValuePair>::iterator K =
QTopRange.first;
> +           K != QTopRange.second; ++K) {
> +        DenseMap<ValuePair, size_t>::iterator C =
Tree.find(K->second);
> +        if (C == Tree.end()) continue;
> +
> +        // This child is in the Tree, now we need to make sure it is the
> +        // best of any conflicting children. There could be multiple
> +        // conflicting children, so first, determine if we're keeping
> +        // this child, then delete conflicting children as necessary.
> +
> +        // It is also necessary to guard against pairing-induced
> +        // dependencies. Consider instructions a .. x .. y .. b
> +        // such that (a,b) are to be fused and (x,y) are to be fused
> +        // but a is an input to x and b is an output from y. This
> +        // means that y cannot be moved after b but x must be moved
> +        // after b for (a,b) to be fused. In other words, after
> +        // fusing (a,b) we have y .. a/b .. x where y is an input
> +        // to a/b and x is an output to a/b: x and y can no longer
> +        // be legally fused. To prevent this condition, we must
> +        // make sure that a child pair added to the Tree is not
> +        // both an input and output of an already-selected pair.
> +
> +        bool CanAdd = true;
The use of CanAdd can be improved with early continues.
> +        for (DenseMap<ValuePair, size_t>::iterator C2
> +              = BestChilden.begin(), E2 = BestChilden.end();
> +             C2 != E2; ++C2) {
> +          if (C2->first.first == C->first.first ||
> +              C2->first.first == C->first.second ||
> +              C2->first.second == C->first.first ||
> +              C2->first.second == C->first.second ||
> +              pairsConflict(C2->first, C->first, PairableInstUsers))
{
> +            if (C2->second>= C->second) {
> +              CanAdd = false;
> +              break;
> +            }
> +          }
This loop
> +        }
> +
> +        // Even worse, this child could conflict with another node already
> +        // selected for the Tree. If that is the case, ignore this child.
> +        if (CanAdd)
> +          for (DenseSet<ValuePair>::iterator T = PrunedTree.begin(),
> +                 E2 = PrunedTree.end(); T != E2; ++T) {
> +            if (T->first == C->first.first ||
> +                T->first == C->first.second ||
> +                T->second == C->first.first ||
> +                T->second == C->first.second ||
> +                pairsConflict(*T, C->first, PairableInstUsers)) {
> +              CanAdd = false;
> +              break;
> +            }
> +          }
and this loop
> +
> +        // And check the queue too...
> +        if (CanAdd)
> +          for (std::vector<ValuePairWithDepth>::iterator C2 =
Q.begin(),
> +                E2 = Q.end(); C2 != E2; ++C2) {
> +            if (C2->first.first == C->first.first ||
> +                C2->first.first == C->first.second ||
> +                C2->first.second == C->first.first ||
> +                C2->first.second == C->first.second ||
> +                pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> +              CanAdd = false;
> +              break;
> +            }
and this loop
> +          }
> +
> +        // Last but not least, check for a conflict with any of the
> +        // already-chosen pairs.
> +        if (CanAdd)
> +          for (DenseMap<Value *, Value *>::iterator C2 > +       
ChosenPairs.begin(), E2 = ChosenPairs.end();
> +               C2 != E2; ++C2) {
> +            if (pairsConflict(*C2, C->first, PairableInstUsers)) {
> +              CanAdd = false;
> +              break;
(and almost this one)
> +            }
> +          }
> +
> +        if (CanAdd) {
> +          // This child can be added, but we may have chosen it in
preference
> +          // to an already-selected child. Check for this here, and if a
> +          // conflict is found, then remove the previously-selected child
> +          // before adding this one in its place.
> +          for (DenseMap<ValuePair, size_t>::iterator C2
> +                = BestChilden.begin(), E2 = BestChilden.end();
> +               C2 != E2;) {
> +            if (C2->first.first == C->first.first ||
> +                C2->first.first == C->first.second ||
> +                C2->first.second == C->first.first ||
> +                C2->first.second == C->first.second ||
> +                pairsConflict(C2->first, C->first,
PairableInstUsers)) {
> +              BestChilden.erase(C2++);
> +            } else {
> +              ++C2;and this loop look entirely the same. Can they be moved into a helper
function?

> +            }
> +          }
> +
> +          BestChilden.insert(ValuePairWithDepth(C->first,
C->second));
> +        }
> +      }
> +
> +      for (DenseMap<ValuePair, size_t>::iterator C
> +            = BestChilden.begin(), E2 = BestChilden.end();
> +           C != E2; ++C) {
> +        size_t DepthF = getDepthFactor(C->first.first);
> +        Q.push_back(ValuePairWithDepth(C->first, QTop.second+DepthF));
> +      }
> +    }
> +  }
> +
> +  // This function finds the best tree of mututally-compatible connected
> +  // pairs, given the choice of root pairs as an iterator range.
> +  void BBVectorize::findBestTreeFor(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>  &ChosenPairs,
> +                      DenseSet<ValuePair>  &BestTree,
size_t&BestMaxDepth,
> +                      size_t&BestEffSize, VPIteratorPair ChoiceRange)
{
> +    for (std::multimap<Value *, Value *>::iterator J =
ChoiceRange.first;
> +         J != ChoiceRange.second; ++J) {
> +
> +      // Before going any further, make sure that this pair does not
> +      // conflict with any already-selected pairs (see comment below
> +      // near the Tree pruning for more details).
> +      bool DoesConflict = false;
> +      for (DenseMap<Value *, Value *>::iterator C =
ChosenPairs.begin(),
> +            E = ChosenPairs.end(); C != E; ++C) {
> +        if (pairsConflict(*C, *J, PairableInstUsers)) {
> +          DoesConflict = true;
> +          break;
> +        }
> +      }
> +      if (DoesConflict) continue;
> +
> +      DenseMap<ValuePair, size_t>  Tree;
> +      buildInitialTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> +        PairableInstUsers, ChosenPairs, Tree, *J);Fix indentation. (Align CandidatePairs and PairableInstUsers)
> +
> +      // Because we'll keep the child with the largest depth, the
largest
> +      // depth is still the same in the unpruned Tree.
> +      size_t MaxDepth = Tree.lookup(*J);
> +
> +      DEBUG(dbgs()<<  "BBV: found Tree for pair {"<<
*J->first<<
> +                   "<->  "<<  *J->second<< 
"} of depth "<<
> +                   MaxDepth<<  " and size "<< 
Tree.size()<<  "\n");
> +
> +      // At this point the Tree has been constructed, but, may contain
> +      // contradictory children (meaning that different children of
> +      // some tree node may be attempting to fuse the same instruction).
> +      // So now we walk the tree again, in the case of a conflict,
> +      // keep only the child with the largest depth. To break a tie,
> +      // favor the first child.
> +
> +      DenseSet<ValuePair>  PrunedTree;
> +      pruneTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> +        PairableInstUsers, ChosenPairs, Tree, PrunedTree, *J);
> +
> +      size_t EffSize = 0;
> +      for (DenseSet<ValuePair>::iterator S = PrunedTree.begin(),
> +            E = PrunedTree.end(); S != E; ++S) {
> +        EffSize += getDepthFactor(S->first);
> +      }No braces.
> +
> +      DEBUG(dbgs()<<  "BBV: found pruned Tree for pair
{"<<  *J->first<<
> +             "<->  "<<  *J->second<< 
"} of depth "<<
> +             MaxDepth<<  " and size "<< 
PrunedTree.size()<<
> +            " (effective size: "<<  EffSize<< 
")\n");
> +      if (MaxDepth>= ReqChainDepth&&  EffSize>  BestEffSize)
{
> +        BestMaxDepth = MaxDepth;
> +        BestEffSize = EffSize;
> +        BestTree = PrunedTree;
> +      }
> +    }
> +  }
> +
> +  // Given the list of candidate pairs, this function selects those
> +  // that will be fused into vector instructions.
> +  void BBVectorize::choosePairs(
> +                      std::multimap<Value *, Value *> 
&CandidatePairs,
> +                      std::vector<Value *>  &PairableInsts,
> +                      std::multimap<ValuePair, ValuePair> 
&ConnectedPairs,
> +                      DenseSet<ValuePair>  &PairableInstUsers,
> +                      DenseMap<Value *, Value *>&  ChosenPairs)
{
> +    for (std::vector<Value *>::iterator I = PairableInsts.begin(),
> +           E = PairableInsts.end(); I != E; ++I) {
> +      // The number of possible pairings for this variable:
> +      size_t NumChoices = CandidatePairs.count(*I);
> +      if (!NumChoices) continue;
> +
> +      VPIteratorPair ChoiceRange = CandidatePairs.equal_range(*I);
> +
> +      // The best pair to choose and its tree:
> +      size_t BestMaxDepth = 0, BestEffSize = 0;
> +      DenseSet<ValuePair>  BestTree;
> +      findBestTreeFor(CandidatePairs, PairableInsts, ConnectedPairs,
> +        PairableInstUsers, ChosenPairs, BestTree, BestMaxDepth,
> +        BestEffSize, ChoiceRange);
Fix indentation. Align CandidatePairs, ...
> +
> +      // A tree has been chosen (or not) at this point. If no tree was
> +      // chosen, then this instruction, I, cannot be paired (and is no
longer
> +      // considered).
> +
> +      for (DenseSet<ValuePair>::iterator S = BestTree.begin(),
> +           SE = BestTree.end(); S != SE; ++S) {
> +        // Insert the members of this tree into the list of chosen pairs.
> +        ChosenPairs.insert(ValuePair(S->first, S->second));
> +        DEBUG(dbgs()<<  "BBV: selected pair: "<< 
*S->first<<  "<->  "<<
> +               *S->second<<  "\n");
> +
> +        // Remove all candidate pairs that have values in the chosen tree.
> +        for (std::multimap<Value *, Value *>::iterator K > +     
CandidatePairs.begin(),
> +             E2 = CandidatePairs.end(); K != E2;) {
> +          if (K->first == S->first || K->second == S->first ||
> +              K->second == S->second || K->first == S->second)
{
> +            // Don't remove the actual pair chosen so that it can be
used
> +            // in subsequent tree selections.
> +            if (!(K->first == S->first&&  K->second ==
S->second)) {
> +              CandidatePairs.erase(K++);
> +            } else {
> +              ++K;
> +            }No braces needed.
> +          } else {
> +            ++K;
> +          }These are needed.
> +        }
> +      }
> +    }
> +
> +    DEBUG(dbgs()<<  "BBV: selected "<< 
ChosenPairs.size()<<  " pairs.\n");
> +  }
> +
> +  // Returns the value that is to be used as the pointer input to the
vector
> +  // instruction that fuses I with J.
> +  Value *BBVectorize::getReplacementPointerInput(LLVMContext& 
Context,
> +                     Instruction *I, Instruction *J, unsigned o,
> +                     bool&FlipMemInputs) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +    const TargetData&TD = *AA.getTargetData();
> +
> +    Value *IPtr, *JPtr;
> +    unsigned IAlignment, JAlignment;
> +    int64_t OffsetInElmts;
> +    (void) getPairPtrInfo(I, J, TD, IPtr, JPtr, IAlignment, JAlignment,
> +             OffsetInElmts);Fix indentation.
> +
> +    // The pointer value is taken to be the one with the lowest offset.
> +    Value *VPtr;
> +    if (OffsetInElmts>  0) {
> +      VPtr = IPtr;
> +    } else {
> +      FlipMemInputs = true;
> +      VPtr = JPtr;
> +    }
> +
> +    Type *ArgType =
cast<PointerType>(IPtr->getType())->getElementType();
> +    Type *VArgType = getVecType(ArgType);
> +    Type *VArgPtrType = PointerType::get(VArgType,
> +      cast<PointerType>(IPtr->getType())->getAddressSpace());
> +    return new BitCastInst(VPtr, VArgPtrType,
> +                        I->hasName() ?
> +                          I->getName() + ".v.i" + utostr(o) :
"",
> +                        /* insert before */ FlipMemInputs ? J : I);
> +  }
> +
> +  // Returns the value that is to be used as the vector-shuffle mask to
the
> +  // vector instruction that fuses I with J.
> +  Value *BBVectorize::getReplacementShuffleMask(LLVMContext&  Context,
> +                     Instruction *I, Instruction *J, unsigned o) {
> +    // This is the shuffle mask. We need to append the second
> +    // mask to the first, and the numbers need to be adjusted.
> +
> +    Type *ArgType = I->getType();
> +    Type *VArgType = getVecType(ArgType);
> +
> +    // Get the total number of elements in the fused vector type.
> +    // By definition, this must equal the number of elements in
> +    // the final mask.
> +    unsigned NumElem =
cast<VectorType>(VArgType)->getNumElements();
> +    std::vector<Constant*>  Mask(NumElem);
> +
> +    Type *OpType = I->getOperand(0)->getType();
> +    unsigned numInElem =
cast<VectorType>(OpType)->getNumElements();
> +
> +    // For the mask from the first pair...
> +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> +      int m = cast<ShuffleVectorInst>(I)->getMaskValue(v);
> +      if (m<  0) {
> +        Mask[v] = UndefValue::get(Type::getInt32Ty(Context));
> +      } else {
> +        unsigned vv = v;
> +        if (m>= (int) numInElem) {
> +          vv += (int) numInElem;
> +        }No braces needed.
> +
> +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), vv);
> +      }
> +    }
> +
> +    // For the mask from the second pair...
> +    for (unsigned v = 0; v<  NumElem/2; ++v) {
> +      int m = cast<ShuffleVectorInst>(J)->getMaskValue(v);
> +      if (m<  0) {
> +        Mask[v+NumElem/2] = UndefValue::get(Type::getInt32Ty(Context));
> +      } else {
> +        unsigned vv = v + (int) numInElem;
> +        if (m>= (int) numInElem) {
> +          vv += (int) numInElem;
> +        }No braces needed.
> +
> +        Mask[v+NumElem/2] > +         
ConstantInt::get(Type::getInt32Ty(Context), vv);
> +      }
> +    }
Those two for loops look very similar. Any chance they can be extracted
into a helper function?
> +
> +    return ConstantVector::get(Mask);
> +  }
> +
> +  // Returns the value to be used as the specified operand of the vector
> +  // instruction that fuses I with J.
> +  Value *BBVectorize::getReplacementInput(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, unsigned o, bool FlipMemInputs) {
Indentation.
> +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> +
> +      // Compute the fused vector type for this operand
> +    Type *ArgType = I->getOperand(o)->getType();
> +    VectorType *VArgType = getVecType(ArgType);
> +
> +    Instruction *L = I, *H = J;
> +    if (FlipMemInputs) {
> +      L = J;
> +      H = I;
> +    }
> +
> +    if (ArgType->isVectorTy()) {
> +      unsigned numElem =
cast<VectorType>(VArgType)->getNumElements();
> +      std::vector<Constant*>  Mask(numElem);
> +      for (unsigned v = 0; v<  numElem; ++v) {
> +        Mask[v] = ConstantInt::get(Type::getInt32Ty(Context), v);
> +      }No braces.
> +
> +      Instruction *BV = new ShuffleVectorInst(L->getOperand(o),
> +                                              H->getOperand(o),
> +                                              ConstantVector::get(Mask),
> +                                              I->hasName() ?
> +                                                I->getName() +
".v.i" +
> +                                                  utostr(o) :
"");
It may be better to not squeeze

I->hasName() ?  I->getName() + ".v.i" + utostr(o) : ""

into this operand.
> +      BV->insertBefore(J);
> +      return BV;
> +    }
> +
> +    // If these two inputs are the output of another vector instruction,
> +    // then we should use that output directly. It might be necessary to
> +    // permute it first. [When pairings are fused recursively, you can
> +    // end up with cases where a large vector is decomposed into scalars
> +    // using extractelement instructions, then built into size-2
> +    // vectors using insertelement and the into larger vectors using
> +    // shuffles. InstCombine does not simplify all of these cases well,
> +    // and so we make sure that shuffles are generated here when possible.
> +    ExtractElementInst *LEE
> +      = dyn_cast<ExtractElementInst>(L->getOperand(o));
> +    ExtractElementInst *HEE
> +      = dyn_cast<ExtractElementInst>(H->getOperand(o));
> +
> +    if (LEE&&  HEE&&
> +        LEE->getOperand(0)->getType() == VArgType) {
> +      unsigned LowIndx =
cast<ConstantInt>(LEE->getOperand(1))->getZExtValue();
> +      unsigned HighIndx =
cast<ConstantInt>(HEE->getOperand(1))->getZExtValue();
> +      if (LEE->getOperand(0) == HEE->getOperand(0)) {
> +        if (LowIndx == 0&&  HighIndx == 1)
> +          return LEE->getOperand(0);
> +
> +        std::vector<Constant*>  Mask(2);
> +        Mask[0] = ConstantInt::get(Type::getInt32Ty(Context), LowIndx);
> +        Mask[1] = ConstantInt::get(Type::getInt32Ty(Context), HighIndx);
> +
> +        Instruction *BV = new ShuffleVectorInst(LEE->getOperand(0),
> +                                            UndefValue::get(VArgType),
> +                                            ConstantVector::get(Mask),
> +                                            I->hasName() ?
> +                                              I->getName() +
".v.i" +
> +                                                utostr(o) : "");
Do not squeeze

I->hasName() ?  I->getName() + ".v.i" + utostr(o) : ""

into the operand.
> +        BV->insertBefore(J);
> +        return BV;
> +      }
> +
> +      std::vector<Constant*>  Mask(2);
> +      HighIndx += VArgType->getNumElements();
> +      Mask[0] = ConstantInt::get(Type::getInt32Ty(Context), LowIndx);
> +      Mask[1] = ConstantInt::get(Type::getInt32Ty(Context), HighIndx);
> +
> +      Instruction *BV = new ShuffleVectorInst(LEE->getOperand(0),
> +                                          HEE->getOperand(0),
> +                                          ConstantVector::get(Mask),
> +                                          I->hasName() ?
> +                                            I->getName() +
".v.i" +
> +                                              utostr(o) : "");No squeezing please.
> +      BV->insertBefore(J);
> +      return BV;
> +    }
> +
> +    Instruction *BV1 = InsertElementInst::Create(
> +                                               UndefValue::get(VArgType),
> +                                               L->getOperand(o), CV0,
> +                                               I->hasName() ?
> +                                                 I->getName() +
".v.i" +
> +                                                   utostr(o) +
".1"
> +                                               : "");No squeezing.
> +    BV1->insertBefore(I);
> +    Instruction *BV2 = InsertElementInst::Create(BV1, H->getOperand(o),
> +                                               CV1, I->hasName() ?
> +                                                 I->getName() +
> +                                                   ".v.i" +
utostr(o)
> +                                                   + ".2" :
"");
No squeezing.

Also, the stuff that you squeeze in looks very similar or even
identical. Do you really write the whole expression 4 times in this
function? Is there no better option?
> +    BV2->insertBefore(J);
> +    return BV2;
> +  }
> +
> +  // This function creates an array of values that will be used as the
inputs
> +  // to the vector instruction that fuses I with J.
> +  void BBVectorize::getReplacementInputsForPair(LLVMContext&  Context,
> +                     Instruction *I, Instruction *J,
> +                     SmallVector<Value *, 3>  &ReplacedOperands,
> +                     bool&FlipMemInputs) {
> +    FlipMemInputs = false;
> +    unsigned NumOperands = I->getNumOperands();
> +
> +    for (unsigned p = 0, o = NumOperands-1; p<  NumOperands; ++p, --o)
{
> +      // Iterate backward so that we look at the store pointer
> +      // first and know whether or not we need to flip the inputs.
> +
> +      if (isa<LoadInst>(I) || (o == 1&& 
isa<StoreInst>(I))) {
> +        // This is the pointer for a load/store instruction.
> +        ReplacedOperands[o] = getReplacementPointerInput(Context, I, J, o,
> +                                FlipMemInputs);
> +        continue;
> +      } else if (isa<CallInst>(I)&&  o == NumOperands-1) {
> +        Function *F = cast<CallInst>(I)->getCalledFunction();
> +        unsigned IID = F->getIntrinsicID();
> +        BasicBlock&BB = *cast<Instruction>(I)->getParent();
> +
> +        Module *M = BB.getParent()->getParent();
> +        Type *ArgType = I->getType();
> +        Type *VArgType = getVecType(ArgType);
> +
> +        // FIXME: is it safe to do this here?
> +        ReplacedOperands[o] = Intrinsic::getDeclaration(M,
> +          (Intrinsic::ID) IID, VArgType);
> +        continue;
> +      } else if (isa<ShuffleVectorInst>(I)&&  o ==
NumOperands-1) {
> +        ReplacedOperands[o] = getReplacementShuffleMask(Context, I, J, o);
> +        continue;
> +      }
> +
> +      ReplacedOperands[o] > +        getReplacementInput(Context, I, J,
o, FlipMemInputs);
> +    }
> +  }
> +
> +  // This function creates two values that represent the outputs of the
> +  // original I and J instructions. These are generally vector shuffles
> +  // or extracts. In many cases, these will end up being unused and, thus,
> +  // eliminated by later passes.
> +  void BBVectorize::replaceOutputsOfPair(LLVMContext&  Context,
Instruction *I,
> +                     Instruction *J, Instruction *K,
> +                     Instruction *&InsertionPt,
> +                     Instruction *&K1, Instruction *&K2,
> +                     bool&FlipMemInputs) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +
> +    Value *CV0 = ConstantInt::get(Type::getInt32Ty(Context), 0);
> +    Value *CV1 = ConstantInt::get(Type::getInt32Ty(Context), 1);
> +
> +    if (isa<StoreInst>(I)) {
> +      AA.replaceWithNewValue(I, K);
> +      AA.replaceWithNewValue(J, K);
> +    } else {
> +      Type *IType = I->getType();
> +      Type *VType = getVecType(IType);
> +
> +      if (IType->isVectorTy()) {
> +          unsigned numElem =
cast<VectorType>(IType)->getNumElements();
> +          std::vector<Constant*>  Mask1(numElem), Mask2(numElem);
> +          for (unsigned v = 0; v<  numElem; ++v) {
> +            Mask1[v] = ConstantInt::get(Type::getInt32Ty(Context), v);
> +            Mask2[v] = ConstantInt::get(Type::getInt32Ty(Context),
numElem+v);
> +          }
> +
> +          K1 = new ShuffleVectorInst(K, UndefValue::get(VType),
> +                                       ConstantVector::get(
> +                                         FlipMemInputs ? Mask2 : Mask1),
> +                                       K->hasName() ?
> +                                         K->getName() +
".v.r1" : "");
> +          K2 = new ShuffleVectorInst(K, UndefValue::get(VType),
> +                                       ConstantVector::get(
> +                                         FlipMemInputs ? Mask1 : Mask2),
> +                                       K->hasName() ?
> +                                         K->getName() +
".v.r2" : "");
> +      } else {
> +        K1 = ExtractElementInst::Create(K, FlipMemInputs ? CV1 : CV0,
> +                                          K->hasName() ?
> +                                            K->getName() +
".v.r1" : "");
> +        K2 = ExtractElementInst::Create(K, FlipMemInputs ? CV0 : CV1,
> +                                          K->hasName() ?
> +                                            K->getName() +
".v.r2" : "");
> +      }
> +
> +      K1->insertAfter(K);
> +      K2->insertAfter(K1);
> +      InsertionPt = K2;
> +    }
> +  }
> +
> +  // Move all uses of the function I (including pairing-induced uses)
after J.
> +  void BBVectorize::moveUsesOfIAfterJ(BasicBlock&BB,
> +                     DenseMap<Value *, Value *>&  ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> +                     Instruction *&InsertionPt,
> +                     Instruction *I, Instruction *J) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +
> +    // Skip to the first instruction past I.
> +    BasicBlock::iterator L = BB.begin();
> +    for (; cast<Instruction>(L) != cast<Instruction>(I); ++L);Why don't you put the initialization into the 'for'? I do not see
any
need for the cast instructions.
> +    ++L;
> +
> +    DenseSet<Value *>  Users;
> +    AliasSetTracker WriteSet(AA);
> +    for (BasicBlock::iterator E = BB.end();
> +          cast<Instruction>(L) != cast<Instruction>(J);) {Why do you need to cast here?
> +      bool UsesI;
> +      trackUsesOfI(Users, WriteSet, I, L, UsesI, true,&LoadMoveSet);
> +
> +      if (UsesI) {
> +        // Make sure that the order of selected pairs is not reversed.
> +        // So if we come to a user instruction that has a pair,
> +        // then mark the pair's tree as used as well.
> +        DenseMap<Value *, Value *>::iterator LP =
ChosenPairs.find(L);
> +        if (LP != ChosenPairs.end())
> +          Users.insert(LP->second);
> +
> +        // Move this instruction
> +        Instruction *InstToMove = L; ++L;
> +
> +        DEBUG(dbgs()<<  "BBV: moving: "<< 
*InstToMove<<
> +                        " to after "<< 
*InsertionPt<<  "\n");
> +        InstToMove->removeFromParent();
> +        InstToMove->insertAfter(InsertionPt);
> +        InsertionPt = InstToMove;
> +      } else {
> +        ++L;
> +      }
> +    }
> +  }
> +
> +  // Collect all load instruction that are in the move set of a given
pair.
> +  // These loads depend on the first instruction, I, and so need to be
moved
> +  // after J when the pair is fused.
> +  void BBVectorize::collectPairLoadMoveSet(BasicBlock&BB,
> +                     DenseMap<Value *, Value *>  &ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet,
> +                     Instruction *I, Instruction *J) {
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +
> +    // Skip to the first instruction past I.
> +    BasicBlock::iterator L = BB.begin();
> +    for (; cast<Instruction>(L) != cast<Instruction>(I); ++L);Why not initialize in the 'for' loop. Why cast here?
> +    ++L;
> +
> +    DenseSet<Value *>  Users;
> +    AliasSetTracker WriteSet(AA);
> +    for (BasicBlock::iterator E = BB.end();This 'E' is never used in the loop.
> +          cast<Instruction>(L) != cast<Instruction>(J); ++L) {Why cast here?
> +      bool UsesI;
> +      trackUsesOfI(Users, WriteSet, I, L, UsesI);
> +
> +      if (UsesI) {
> +        // Make sure that the order of selected pairs is not reversed.
> +        // So if we come to a user instruction that has a pair,
> +        // then mark the pair's tree as used as well.
> +        DenseMap<Value *, Value *>::iterator LP =
ChosenPairs.find(L);
> +        if (LP != ChosenPairs.end())
> +          Users.insert(LP->second);
> +
> +        if (L->mayReadFromMemory())
> +          LoadMoveSet.insert(ValuePair(L, I));
> +      }
> +    }
> +  }
> +
> +  // In cases where both load/stores and the computation of their pointers
> +  // are chosen for vectorization, we can end up in a situation where the
> +  // aliasing analysis starts returning different query results as the
> +  // process of fusing instruction pairs continues. Because the algorithm
> +  // relies on finding the same use trees here as were found earlier,
we'll
> +  // need to precompute the necessary aliasing information here and then
> +  // manually update it during the fusion process.
> +  void BBVectorize::collectLoadMoveSet(BasicBlock&BB,
> +                     std::vector<Value *>  &PairableInsts,
> +                     DenseMap<Value *, Value *>  &ChosenPairs,
> +                     std::multimap<Value *, Value *> 
&LoadMoveSet) {
> +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> +         PIE = PairableInsts.end(); PI != PIE; ++PI) {
> +      DenseMap<Value *, Value *>::iterator P =
ChosenPairs.find(*PI);
> +      if (P == ChosenPairs.end()) continue;
> +
> +      if (getDepthFactor(P->first) == 0) {
> +        continue;
> +      }No braces.
> +
> +      Instruction *I = cast<Instruction>(P->first),
> +        *J = cast<Instruction>(P->second);
> +
> +      collectPairLoadMoveSet(BB, ChosenPairs, LoadMoveSet, I, J);
> +    }
> +  }
> +
> +  // This function fuses the chosen instruction pairs into vector
instructions,
> +  // taking care preserve any needed scalar outputs and, then, it reorders
the
> +  // remaining instructions as needed (users of the first member of the
pair
> +  // need to be moved to after the location of the second member of the
pair
> +  // because the vector instruction is inserted in the location of the
pair's
> +  // second member).
> +  void BBVectorize::fuseChosenPairs(BasicBlock&BB,
> +                     std::vector<Value *>  &PairableInsts,
> +                     DenseMap<Value *, Value *>  &ChosenPairs) {
> +    LLVMContext&  Context = BB.getContext();
> +    AliasAnalysis&AA = getAnalysis<AliasAnalysis>();
> +    ScalarEvolution&SE = getAnalysis<ScalarEvolution>();
> +
> +    std::multimap<Value *, Value *>  LoadMoveSet;
> +    collectLoadMoveSet(BB, PairableInsts, ChosenPairs, LoadMoveSet);
> +
> +    DEBUG(dbgs()<<  "BBV: initial: \n"<<  BB<<
"\n");
> +
> +    for (std::vector<Value *>::iterator PI = PairableInsts.begin(),
> +         PIE = PairableInsDenseMap<Vts.end(); PI != PIE; ++PI) {
> +      DenseMap<Value *, Value *>::iterator P =
ChosenPairs.find(*PI);
> +      if (P == ChosenPairs.end()) continue;
> +
> +      if (getDepthFactor(P->first) == 0) {
> +        // These instructions are not really fused, but are tracked as
though
> +        // they are. Any case in which it would be interesting to fuse
them
> +        // will be taken care of by InstCombine.
> +        --NumFusedOps;
> +        continue;
> +      }
> +
> +      Instruction *I = cast<Instruction>(P->first),
> +        *J = cast<Instruction>(P->second);
Why do we need to cast here? It seems ChosenPairs should actually be of
type DenseMap<Instruction *, Instruction *>. Can PariableInsts also be
of type std::vector<Instruction*>?
> +
> +      DEBUG(dbgs()<<  "BBV: fusing: "<<  *I<<
> +             "<->  "<<  *J<<  "\n");
> +
> +      bool FlipMemInputs;
> +      unsigned NumOperands = I->getNumOperands();
> +      SmallVector<Value *, 3>  ReplacedOperands(NumOperands);
> +      getReplacementInputsForPair(Context, I, J, ReplacedOperands,
> +        FlipMemInputs);
> +
> +      // Make a copy of the original operation, change its type to the
vector
> +      // type and replace its operands with the vector operands.
> +      Instruction *K = I->clone();
> +      if (I->hasName()) K->takeName(I);
> +
> +      if (!isa<StoreInst>(K)) {
> +        K->mutateType(getVecType(I->getType()));
> +      }No braces.
> +
> +      for (unsigned o = 0; o<  NumOperands; ++o) {
> +        K->setOperand(o, ReplacedOperands[o]);
> +      }No braces.
> +
> +      // If we've flipped the memory inputs, make sure that we take
the correct
> +      // alignment.
> +      if (FlipMemInputs) {
> +        if (isa<StoreInst>(K))
> +         
cast<StoreInst>(K)->setAlignment(cast<StoreInst>(J)->getAlignment());
> +        else
> +         
cast<LoadInst>(K)->setAlignment(cast<LoadInst>(J)->getAlignment());
> +      }
> +
> +      K->insertAfter(J);
> +
> +      // Instruction insertion point:
> +      Instruction *InsertionPt = K;
> +      Instruction *K1 = 0, *K2 = 0;
> +      replaceOutputsOfPair(Context, I, J, K, InsertionPt, K1, K2,
> +        FlipMemInputs);
> +
> +      // The use tree of the first original instruction must be moved to
after
> +      // the location of the second instruction. The entire use tree of
the
> +      // first instruction is disjoint from the input tree of the second
> +      // (by definition), and so commutes with it.
> +
> +      moveUsesOfIAfterJ(BB, ChosenPairs, LoadMoveSet, InsertionPt,
> +        cast<Instruction>(I), cast<Instruction>(J));
> +
> +      if (!isa<StoreInst>(I)) {
> +        I->replaceAllUsesWith(K1);
> +        J->replaceAllUsesWith(K2);
> +        AA.replaceWithNewValue(I, K1);
> +        AA.replaceWithNewValue(J, K2);
> +      }
> +
> +      // Instructions that may read from memory may be in the load move
set.
> +      // Once an instruction is fused, we no longer need its move set, and
so
> +      // the values of the map never need to be updated. However, when a
load
> +      // is fused, we need to merge the entries from both instructions in
the
> +      // pair in case those instructions were in the move set of some
other
> +      // yet-to-be-fused pair. The loads in question are the keys of the
map.
> +      if (I->mayReadFromMemory()) {
> +        std::vector<ValuePair>  NewSetMembers;
> +        VPIteratorPair IPairRange = LoadMoveSet.equal_range(I);
> +        VPIteratorPair JPairRange = LoadMoveSet.equal_range(J);
> +        for (std::multimap<Value *, Value *>::iterator N =
IPairRange.first;
> +               N != IPairRange.second; ++N) {Indentation.
> +          NewSetMembers.push_back(ValuePair(K, N->second));
> +        }No braces.
> +        for (std::multimap<Value *, Value *>::iterator N =
JPairRange.first;
> +               N != JPairRange.second; ++N) {Indentation.
> +          NewSetMembers.push_back(ValuePair(K, N->second));
> +        }No braces.
> +        for (std::vector<ValuePair>::iterator A =
NewSetMembers.begin(),
> +               AE = NewSetMembers.end(); A != AE; ++A) {Indentation.
> +          LoadMoveSet.insert(*A);
> +        }No braces.
> +      }
> +
> +      SE.forgetValue(I);
> +      SE.forgetValue(J);
> +      I->eraseFromParent();
> +      J->eraseFromParent();
> +    }
> +
> +    DEBUG(dbgs()<<  "BBV: final: \n"<<  BB<< 
"\n");
> +  }
> +}
> --- /dev/null
> +++ b/test/Transforms/BBVectorize/ld1.ll
> @@ -0,0 +1,28 @@
> +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> +
> +define double @test1(double* %a, double* %b, double* %c) nounwind uwtable
readonly {
> +entry:
> +  %i0 = load double* %a, align 8
> +  %i1 = load double* %b, align 8
> +  %mul = fmul double %i0, %i1
> +  %i2 = load double* %c, align 8
> +  %add = fadd double %mul, %i2
> +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> +  %i3 = load double* %arrayidx3, align 8
> +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> +  %i4 = load double* %arrayidx4, align 8
> +  %mul5 = fmul double %i3, %i4
> +  %arrayidx6 = getelementptr inbounds double* %c, i64 1
> +  %i5 = load double* %arrayidx6, align 8
> +  %add7 = fadd double %mul5, %i5
> +  %mul9 = fmul double %add, %i1
> +  %add11 = fadd double %mul9, %i2
> +  %mul13 = fmul double %add7, %i4
> +  %add15 = fadd double %mul13, %i5
> +  %mul16 = fmul double %add11, %add15
> +  ret double %mul16
> +; CHECK: @test1
> +; CHECK:<2 x double>
For me this CHECK looks very short. Should all instructions be
vectorized or is it sufficient if one is vectorized. Also, it would be
nice if I could understand the code that is generated just by looking
at the test case. Any reason not to CHECK: for the entire sequence of
expected instructions (This applies to all test cases).

> +}
> +
> diff --git a/test/Transforms/BBVectorize/loop1.ll
b/test/Transforms/BBVectorize/loop1.ll
> new file mode 100644
> index 0000000..4c60b42
> --- /dev/null
> +++ b/test/Transforms/BBVectorize/loop1.ll
> @@ -0,0 +1,39 @@
> +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> +target triple = "x86_64-unknown-linux-gnu"
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> +; RUN: opt<  %s -loop-unroll -unroll-allow-partial -bb-vectorize
-bb-vectorize-req-chain-depth=3 -S | FileCheck %sWhy are you checking here with -loop-unroll? Do you expect something
special. In case you do it may make sense to use another -check-prefix
> --- /dev/null
> +++ b/test/Transforms/BBVectorize/req-depth.ll
> @@ -0,0 +1,17 @@
> +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128"
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 3 -S |
FileCheck %s -check-prefix=CHECK-RD3
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth 2 -S |
FileCheck %s -check-prefix=CHECK-RD2
> +
> +define double @test1(double %A1, double %A2, double %B1, double %B2) {
> +	%X1 = fsub double %A1, %B1
> +	%X2 = fsub double %A2, %B2
> +	%Y1 = fmul double %X1, %A1
> +	%Y2 = fmul double %X2, %A2
> +	%R  = fmul double %Y1, %Y2
> +	ret double %R
> +; CHECK-RD3: @test1
> +; CHECK-RD2: @test1
> +; CHECK-RD3-NOT:<2 x double>
> +; CHECK-RD2:<2 x double>What about sorting them by check prefix?
> +++ b/test/Transforms/BBVectorize/simple-ldstr.ll
> @@ -0,0 +1,69 @@
> +target datalayout =
"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
> +; RUN: opt<  %s -bb-vectorize -bb-vectorize-req-chain-depth=3 -S |
FileCheck %s
> +
> +; Simple 3-pair chain with loads and stores
> +define void @test1(double* %a, double* %b, double* %c) nounwind uwtable
readonly {
> +entry:
> +  %i0 = load double* %a, align 8
> +  %i1 = load double* %b, align 8
> +  %mul = fmul double %i0, %i1
> +  %arrayidx3 = getelementptr inbounds double* %a, i64 1
> +  %i3 = load double* %arrayidx3, align 8
> +  %arrayidx4 = getelementptr inbounds double* %b, i64 1
> +  %i4 = load double* %arrayidx4, align 8
> +  %mul5 = fmul double %i3, %i4
> +  store double %mul, double* %c, align 8
> +  %arrayidx5 = getelementptr inbounds double* %c, i64 1
> +  store double %mul5, double* %arrayidx5, align 8
> +  ret void
> +; CHECK: @test1
> +; CHECK:<2 x double>
Especially here the entire vectorization sequence is interesting as I
would love to see the alignment you produce for load/store instructions.
> +; Basic depth-3 chain (last pair permuted)
> +define double @test2(double %A1, double %A2, double %B1, double %B2) {
> +	%X1 = fsub double %A1, %B1
> +	%X2 = fsub double %A2, %B2
> +	%Y1 = fmul double %X1, %A1
> +	%Y2 = fmul double %X2, %A2
> +	%Z1 = fadd double %Y2, %B1
> +	%Z2 = fadd double %Y1, %B2
> +	%R  = fmul double %Z1, %Z2
> +	ret double %R
> +; CHECK: @test2
> +; CHECK:<2 x double>
Again, the entire sequence would show how you express a permutation.

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Dec 2011 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Possibly Parallel Threads