thr3ads.net - llvm dev - [LLVMdev] GSoC 2009: Auto-vectorization [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Andreas Bolka

2009-Apr-01 00:27 UTC

[LLVMdev] GSoC 2009: Auto-vectorization

Hi all,

I'd like to have a first stab at a loop-based auto-vectorization pass as
part of 2009's Google Summer of Code program. As far as I can tell from
searching the mailing list archives, no work on such an auto-vectorizer
seems to be currently in progress.

Whereas auto-vectorization is a well-researched topic, complexity seems
to quickly explode once more advanced loop constructs are to be
supported. Therefore my aim is to start with the most minimal
implementation possible, to explore the difficulties encountered in
the specific context of LLVM and to build a foundation from which future
work can progress.

So, initially, I aim at supporting only the simplest loops such as:

    int a[256], b[256], c[256];
    for (int i = 0; i < 256; ++i)
      c[i] = a[i] + b[i];

My goal is to implement the necessary analyses and transformations to
turn IR corresponding to such code into IR utilizing vector
instructions; i.e. the core of the desired result would look like:

    %va = load <256 x i32>* %a
    %vb = load <256 x i32>* %b
    %vc = add <256 x i32> %a, %b
    store <256 x i32> %vc, <256 x i32>* %c

After this transformation, the code could be left as is (which would
generate a fully unrolled vector-add, as long as the vector length
doesn't exceed the lowering capabilities of the code generator), or
strip-mined into a loop using fixed-size vectors (preferably
corresponding to the SIMD register width of the target architecture).

Auto-vectorization has numerous dependencies on other
analysis/transformation passes (loop normalization, constant/scalar
expansion, dependence analysis, etc). A significant part of the proposed
project is to evaluate already-available LLVM passes regarding their
applicability to support auto-vectorization and to extend/implement
what's missing.

As for my personal background, I'm a graduate student at Vienna
University of Technology (TU Wien) with a strong background in compiler
theory, acquired in a wide variety of undergraduate- and graduate-level
courses. I recently implemented two simple compilers LLVM and conducted
various experiments using LLVM's vector support. My general interest in
data-parallel computation stems from a great fondness of APL (especially
of its modern offspring K). While I'm familiar with the concepts behind
auto-vectorization, I haven't implemented an auto-vectorizer before. I'm
author of and contributor to several smaller open source projects.

I appreciate any suggestions and would be very excited if someone is
interested in mentoring this.

-- 
Best regards,
Andreas Bolka

Nick Lewycky

2009-Apr-01 05:05 UTC

head link

[LLVMdev] GSoC 2009: Auto-vectorization

Andreas Bolka wrote:> Hi all,
> 
> I'd like to have a first stab at a loop-based auto-vectorization pass
as
> part of 2009's Google Summer of Code program. As far as I can tell from
> searching the mailing list archives, no work on such an auto-vectorizer
> seems to be currently in progress.
Hi Andreas,

Actually, you'd be the third person to try writing one, with myself 
being the second. :) Don't let that discourage you, I don't think 
anyone's actively working on it now.
> Whereas auto-vectorization is a well-researched topic, complexity seems
> to quickly explode once more advanced loop constructs are to be
> supported. Therefore my aim is to start with the most minimal
> implementation possible, to explore the difficulties encountered in
> the specific context of LLVM and to build a foundation from which future
> work can progress.
There's two types of autovectorization, SLP (superword level 
parallelism) and ILP (instruction level parallelism). You can do ILP 
which is loop-ignorant autovectorization of straight-line code which 
turns out to be the code that runs inside the loop.
> So, initially, I aim at supporting only the simplest loops such as:
> 
>     int a[256], b[256], c[256];
>     for (int i = 0; i < 256; ++i)
>       c[i] = a[i] + b[i];
> 
> My goal is to implement the necessary analyses and transformations to
> turn IR corresponding to such code into IR utilizing vector
> instructions; i.e. the core of the desired result would look like:
> 
>     %va = load <256 x i32>* %a
>     %vb = load <256 x i32>* %b
>     %vc = add <256 x i32> %a, %b
>     store <256 x i32> %vc, <256 x i32>* %c
> 
> After this transformation, the code could be left as is (which would
> generate a fully unrolled vector-add, as long as the vector length
> doesn't exceed the lowering capabilities of the code generator), or
> strip-mined into a loop using fixed-size vectors (preferably
> corresponding to the SIMD register width of the target architecture).
> 
> Auto-vectorization has numerous dependencies on other
> analysis/transformation passes (loop normalization, constant/scalar
> expansion, dependence analysis, etc). A significant part of the proposed
> project is to evaluate already-available LLVM passes regarding their
> applicability to support auto-vectorization and to extend/implement
> what's missing.
There's no loop dependence analysis in LLVM yet. How are you planning to 
implement it? Can you point to a paper, or write up a plain English 
description?
> As for my personal background, I'm a graduate student at Vienna
> University of Technology (TU Wien) with a strong background in compiler
> theory, acquired in a wide variety of undergraduate- and graduate-level
> courses. I recently implemented two simple compilers LLVM and conducted
> various experiments using LLVM's vector support. My general interest in
> data-parallel computation stems from a great fondness of APL (especially
> of its modern offspring K). While I'm familiar with the concepts behind
> auto-vectorization, I haven't implemented an auto-vectorizer before.
I'm
> author of and contributor to several smaller open source projects.
> 
> I appreciate any suggestions and would be very excited if someone is
> interested in mentoring this.
I'm interested, but I'd like to see more details about exactly how (ie.,
with what algorithm) you intend to use to find and transform code first.

Nick

Nick Lewycky

2009-Apr-01 05:25 UTC

head link

[LLVMdev] GSoC 2009: Auto-vectorization

Nick Lewycky wrote:> Andreas Bolka wrote:
>> Hi all,
>>
>> I'd like to have a first stab at a loop-based auto-vectorization
pass as
>> part of 2009's Google Summer of Code program. As far as I can tell
from
>> searching the mailing list archives, no work on such an auto-vectorizer
>> seems to be currently in progress.
> 
> Hi Andreas,
> 
> Actually, you'd be the third person to try writing one, with myself 
> being the second. :) Don't let that discourage you, I don't think 
> anyone's actively working on it now.
> 
>> Whereas auto-vectorization is a well-researched topic, complexity seems
>> to quickly explode once more advanced loop constructs are to be
>> supported. Therefore my aim is to start with the most minimal
>> implementation possible, to explore the difficulties encountered in
>> the specific context of LLVM and to build a foundation from which
future
>> work can progress.
> 
> There's two types of autovectorization, SLP (superword level 
> parallelism) and ILP (instruction level parallelism). You can do ILP 
> which is loop-ignorant autovectorization of straight-line code which 
> turns out to be the code that runs inside the loop.
Pardon me. I've been disabused by Owen Anderson for writing the above.

SLP is loop-ignorant. ILP is a rather generic term and probably 
shouldn't be described as a type of autovectorization. Cross-iteration 
and intra-iteration better describes the distinction.

In any event, the paper I was thinking of when I wrote the above is:
http://www.cag.lcs.mit.edu/slp/SLP-PLDI-2000.pdf

My concern being that writing both loop dependence analysis and loop 
autovectorization would be difficult in the time span alloted for GSoC, 
and you may want to consider just doing SLP.

Nick
> 
>> So, initially, I aim at supporting only the simplest loops such as:
>>
>>     int a[256], b[256], c[256];
>>     for (int i = 0; i < 256; ++i)
>>       c[i] = a[i] + b[i];
>>
>> My goal is to implement the necessary analyses and transformations to
>> turn IR corresponding to such code into IR utilizing vector
>> instructions; i.e. the core of the desired result would look like:
>>
>>     %va = load <256 x i32>* %a
>>     %vb = load <256 x i32>* %b
>>     %vc = add <256 x i32> %a, %b
>>     store <256 x i32> %vc, <256 x i32>* %c
>>
>> After this transformation, the code could be left as is (which would
>> generate a fully unrolled vector-add, as long as the vector length
>> doesn't exceed the lowering capabilities of the code generator), or
>> strip-mined into a loop using fixed-size vectors (preferably
>> corresponding to the SIMD register width of the target architecture).
>>
>> Auto-vectorization has numerous dependencies on other
>> analysis/transformation passes (loop normalization, constant/scalar
>> expansion, dependence analysis, etc). A significant part of the
proposed
>> project is to evaluate already-available LLVM passes regarding their
>> applicability to support auto-vectorization and to extend/implement
>> what's missing.
> 
> There's no loop dependence analysis in LLVM yet. How are you planning
to
> implement it? Can you point to a paper, or write up a plain English 
> description?
> 
>> As for my personal background, I'm a graduate student at Vienna
>> University of Technology (TU Wien) with a strong background in compiler
>> theory, acquired in a wide variety of undergraduate- and graduate-level
>> courses. I recently implemented two simple compilers LLVM and conducted
>> various experiments using LLVM's vector support. My general
interest in
>> data-parallel computation stems from a great fondness of APL
(especially
>> of its modern offspring K). While I'm familiar with the concepts
behind
>> auto-vectorization, I haven't implemented an auto-vectorizer
before. I'm
>> author of and contributor to several smaller open source projects.
>>
>> I appreciate any suggestions and would be very excited if someone is
>> interested in mentoring this.
> 
> I'm interested, but I'd like to see more details about exactly how
(ie.,
> with what algorithm) you intend to use to find and transform code first.
> 
> Nick
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

Chris Lattner

2009-Apr-01 06:18 UTC

head link

[LLVMdev] GSoC 2009: Auto-vectorization

On Mar 31, 2009, at 5:27 PM, Andreas Bolka wrote:> Hi all,
> I'd like to have a first stab at a loop-based auto-vectorization  
> pass as
> part of 2009's Google Summer of Code program. As far as I can tell  
> from
> searching the mailing list archives, no work on such an auto- 
> vectorizer
> seems to be currently in progress.
Hi Andreas,

This would be a very interesting project!  The most important part of  
the proposal is to break it down into small pieces.
> Whereas auto-vectorization is a well-researched topic, complexity  
> seems
> to quickly explode once more advanced loop constructs are to be
> supported. Therefore my aim is to start with the most minimal
> implementation possible, to explore the difficulties encountered in
> the specific context of LLVM and to build a foundation from which  
> future
> work can progress.
This is a great approach.  I'm very much in favor of solving small  
problems first, instead of trying to solve all the worlds issues at  
once. :)
> So, initially, I aim at supporting only the simplest loops such as:
>
>    int a[256], b[256], c[256];
>    for (int i = 0; i < 256; ++i)
>      c[i] = a[i] + b[i];
>
> My goal is to implement the necessary analyses and transformations to
> turn IR corresponding to such code into IR utilizing vector
> instructions;
Sounds great.  Some important steps:

1. We need an abstract dependence analysis interface (like we have for  
AA) and one implementation (probably handling trivial loops like the  
one above to start with).  Many people have talked about this, but no  
code has gone in yet.

2. We need some interface that can be implemented by a target machine  
to describe what the vector capabilities of the machine are.

3. We need the transformation code.  Starting with a simple loop  
vectorizor (as opposed to an SLP system) would make sense to me.


Once the basics are in place, it can be feature-crept to support new  
idioms (reductions, alignment analysis, etc).  The first cut doesn't  
need to provide an immediate speedup, but should have many testcases  
that demonstrates the loop forms it can handle.

> i.e. the core of the desired result would look like:
>
>    %va = load <256 x i32>* %a
>    %vb = load <256 x i32>* %b
>    %vc = add <256 x i32> %a, %b
>    store <256 x i32> %vc, <256 x i32>* %c
Actually, IMO this is probably not the directly we want to go.   
Vectorizing to extremely wide vectors like this isn't along the lines  
of what we want: we should get something like the original loop  
unrolled 4 times to do operations on <4 x i32>'s (for x86/ppc).
> Auto-vectorization has numerous dependencies on other
> analysis/transformation passes (loop normalization, constant/scalar
> expansion, dependence analysis, etc). A significant part of the  
> proposed
> project is to evaluate already-available LLVM passes regarding their
> applicability to support auto-vectorization and to extend/implement
> what's missing.
Sounds good.  I'm also fine with it only working in carefully  
controlled circumstances.  Depending on how ambitious you are, you  
could also do a GSoC project just around a) building dependence  
analysis interface + some *non*-trivial implementations, and b) build  
some dep analysis clients around it (e.g. loop interchange, etc).   
This may be a less risky and higher impact project in the short term,  
if you are interested in it.
> As for my personal background, I'm a graduate student at Vienna
> University of Technology (TU Wien) with a strong background in  
> compiler
> theory, acquired in a wide variety of undergraduate- and graduate- 
> level
> courses. I recently implemented two simple compilers LLVM and  
> conducted
> various experiments using LLVM's vector support. My general interest  
> in
> data-parallel computation stems from a great fondness of APL  
> (especially
> of its modern offspring K). While I'm familiar with the concepts  
> behind
> auto-vectorization, I haven't implemented an auto-vectorizer before.  
> I'm
> author of and contributor to several smaller open source projects.
Great!  Either approach (building a very simple auto-vec system from  
end-to-end, or building some non-trivial dep analysis implementations  
+ a client or two) sound like very promising GSoC projects!

-Chris

Pertti Kellomäki

2009-Apr-01 10:26 UTC

head link

[LLVMdev] GSoC 2009: Auto-vectorization

Chris Lattner wrote:> Depending on how ambitious you are, you
> could also do a GSoC project just around a) building dependence  
> analysis interface + some *non*-trivial implementations, and b) build  
> some dep analysis clients around it (e.g. loop interchange, etc).   
> This may be a less risky and higher impact project in the short term,  
> if you are interested in it.
This would be in great demand independent of the auto-vectorization.

Given that the amount of work is fixed, I would rather see all
the work go into a brilliant dependence analysis rather than an ok
dependence analysis and an ok vectorization. I encourage you to
consider it, Andreas.
-- 
Pertti

Stefanus Du Toit

2009-Apr-01 14:08 UTC

head link

[LLVMdev] GSoC 2009: Auto-vectorization

Hi Andreas,

On 31-Mar-09, at 8:27 PM, Andreas Bolka wrote:> So, initially, I aim at supporting only the simplest loops such as:
>
>    int a[256], b[256], c[256];
>    for (int i = 0; i < 256; ++i)
>      c[i] = a[i] + b[i];
>
> My goal is to implement the necessary analyses and transformations to
> turn IR corresponding to such code into IR utilizing vector
> instructions; i.e. the core of the desired result would look like:
>
>    %va = load <256 x i32>* %a
>    %vb = load <256 x i32>* %b
>    %vc = add <256 x i32> %a, %b
>    store <256 x i32> %vc, <256 x i32>* %c
>
> After this transformation, the code could be left as is (which would
> generate a fully unrolled vector-add, as long as the vector length
> doesn't exceed the lowering capabilities of the code generator), or
> strip-mined into a loop using fixed-size vectors (preferably
> corresponding to the SIMD register width of the target architecture).
I think the biggest problem with this approach, apart from the fact  
that it doesn't mirror how vectors are typically used today in LLVM,  
is that vectors in LLVM are of fixed size. This is going to severely  
limit the usefulness of this transformation. I think you may be better  
off getting information about vector widths for the architecture (e.g.  
from TargetData) and vectorizing directly with a particular width in  
mind.

Overall, this would be a great GSOC project, and I guarantee you would  
get a lot of interest in this :). Good luck!

Stefanus

--
Stefanus Du Toit <stefanus.dutoit at rapidmind.com>
   RapidMind Inc.
   phone: +1 519 885 5455 x116 -- fax: +1 519 885 1463

Devang Patel

2009-Apr-01 18:20 UTC

head link

[LLVMdev] GSoC 2009: Auto-vectorization

Hi Andreas,

On Mar 31, 2009, at 5:27 PM, Andreas Bolka wrote:
> Hi all,
>
> I'd like to have a first stab at a loop-based auto-vectorization  
> pass as
> part of 2009's Google Summer of Code program.
Good idea!
> As far as I can tell from
> searching the mailing list archives, no work on such an auto- 
> vectorizer
> seems to be currently in progress.
Yes.
>
> Whereas auto-vectorization is a well-researched topic, complexity  
> seems
> to quickly explode once more advanced loop constructs are to be
> supported. Therefore my aim is to start with the most minimal
> implementation possible, to explore the difficulties encountered in
> the specific context of LLVM and to build a foundation from which  
> future
> work can progress.
>
> So, initially, I aim at supporting only the simplest loops such as:
>
>    int a[256], b[256], c[256];
>    for (int i = 0; i < 256; ++i)
>      c[i] = a[i] + b[i];
>
> My goal is to implement the necessary analyses and transformations to
> turn IR corresponding to such code into IR utilizing vector
> instructions; i.e. the core of the desired result would look like:
>
>    %va = load <256 x i32>* %a
>    %vb = load <256 x i32>* %b
>    %vc = add <256 x i32> %a, %b
>    store <256 x i32> %vc, <256 x i32>* %c
>
> After this transformation, the code could be left as is (which would
> generate a fully unrolled vector-add, as long as the vector length
> doesn't exceed the lowering capabilities of the code generator), or
> strip-mined into a loop using fixed-size vectors (preferably
> corresponding to the SIMD register width of the target architecture).
>
> Auto-vectorization has numerous dependencies on other
> analysis/transformation passes (loop normalization, constant/scalar
> expansion, dependence analysis, etc). A significant part of the  
> proposed
> project is to evaluate already-available LLVM passes regarding their
> applicability to support auto-vectorization and to extend/implement
> what's missing.
>
> As for my personal background, I'm a graduate student at Vienna
> University of Technology (TU Wien) with a strong background in  
> compiler
> theory, acquired in a wide variety of undergraduate- and graduate- 
> level
> courses. I recently implemented two simple compilers LLVM and  
> conducted
> various experiments using LLVM's vector support. My general interest  
> in
> data-parallel computation stems from a great fondness of APL  
> (especially
> of its modern offspring K). While I'm familiar with the concepts  
> behind
> auto-vectorization, I haven't implemented an auto-vectorizer before.  
> I'm
> author of and contributor to several smaller open source projects.
>
> I appreciate any suggestions and would be very excited if someone is
> interested in mentoring this.
As Chris said, there are many approaches you can take for a GSoC  
project whose title includes "auto vectorization". IMO, building a  
initial foundation for data dependence analysis is a good start.

I encourage you to prepare a proposal that includes list of measurable  
and achievable milestones with realistic goals, where each milestone  
is a small step in the right direction. I am ready to mentor this  
project.

-
Devang

Andreas Bolka

2009-Apr-01 19:50 UTC

head link

[LLVMdev] GSoC 2009: Auto-vectorization

Hi Stefanus,

On Wed Apr 01 16:08:45 +0200 2009, Stefanus Du Toit
wrote:> On 31-Mar-09, at 8:27 PM, Andreas Bolka wrote:
> > i.e. the core of the desired result would look like:
> >
> >    %va = load <256 x i32>* %a
> >    %vb = load <256 x i32>* %b
> >    %vc = add <256 x i32> %a, %b
> >    store <256 x i32> %vc, <256 x i32>* %c
>
> I think the biggest problem with this approach, apart from the fact
> that it doesn't mirror how vectors are typically used today in LLVM,
> is that vectors in LLVM are of fixed size. This is going to severely
> limit the usefulness of this transformation. I think you may be better
> off getting information about vector widths for the architecture (e.g.
> from TargetData) and vectorizing directly with a particular width in
> mind.
Thanks for the remark. My initial thinking was that, independent of
auto-vectorization, a general "wide vector strip-mining"
transformation
would be worthwhile to have anyway. I.e. a transformation which would
convert such wide vector operations as above to a loop over
(target-dependent) smaller width vectors. And as the auto-vectorizer I
am aiming for would initially only be able to vectorize loops with
statically known constant trip counts, I could squash some complexity by
leaving this target-dependent aspect to such a hypothetical strip-mining
pass.

But I understand the limits of this, and based on the feedback I got so
far, I'll rather generate a loop with fixed-size vectors instead. A
command-line option to determine the vectorization factor should do fine
to enable quick bootstrapping, if querying the necessary register width
from TargetData turns out to be problematic.
> Overall, this would be a great GSOC project, and I guarantee you would
> get a lot of interest in this :). Good luck!
Thanks!

-- 
Andreas

Andreas Bolka

2009-Apr-01 21:54 UTC

head link

[LLVMdev] GSoC 2009: Auto-vectorization

Hi Chris,

On Wed Apr 01 08:18:28 +0200 2009, Chris Lattner wrote:> On Mar 31, 2009, at 5:27 PM, Andreas Bolka wrote:
> > My goal is to implement the necessary analyses and transformations to
> > turn IR corresponding to such code into IR utilizing vector
> > instructions;
> 
> Sounds great.  Some important steps:
> 1. We need an abstract dependence analysis interface (like we have for
> AA) and one implementation (probably handling trivial loops like the
> one above to start with).  Many people have talked about this, but no
> code has gone in yet.
My original plan was to implement a loop dependence analysis only for
non-nested loops using the "GCD" method of determining interdependent
statements. This limits the array subscript expressions to simple linear
equations of the form x1*i + x0 where i is the induction variable of the
loop.

I haven't thought yet about generalising this to ease future
implementation of more complex loop dependency analyses. 
> 2. We need some interface that can be implemented by a target machine
> to describe what the vector capabilities of the machine are.
Initially the vectorization pass would basically only need the SIMD
register width; later, the supported "packing formats" may come in
handy
(i.e. the information that 128-bit registers that be used as e.g.
16 x i8, 8 x i16, and 4 x i32).
> 3. We need the transformation code.  Starting with a simple loop
> vectorizor (as opposed to an SLP system) would make sense to me.
I'm also inclined to prefer "classical" Allen & Kennedy-style
auto-vectorization over "Vectorization by unrolling"/SLP approaches.
Based on the responses on this list, it seems that this sentiment is
shared by the LLVM community. An SLP pass would certainly be a nice
complement to loop-based auto-vectorization, though
> Once the basics are in place, it can be feature-crept to support new
> idioms (reductions, alignment analysis, etc).  The first cut doesn't
> need to provide an immediate speedup, but should have many testcases
> that demonstrates the loop forms it can handle.
That's exactly my understanding of it. I want to limit this to the most
basic transformation possible; i.e.:

- non-nested loops, having
- a statically known constant trip count, which is
- divisible by the vectorization factor;
- a single basic block as loop body, with
- a single arithmetic instruction, using
- only same-sized vector operands, and
- ignoring alignment issues.
> Vectorizing to extremely wide vectors like this isn't along the lines
> of what we want: we should get something like the original loop
> unrolled 4 times to do operations on <4 x i32>'s (for x86/ppc).
Yes, I agree. See also my reply to Stefanus.
> > Auto-vectorization has numerous dependencies on other
> > analysis/transformation passes (loop normalization, constant/scalar
> > expansion, dependence analysis, etc). A significant part of the
> > proposed project is to evaluate already-available LLVM passes
> > regarding their applicability to support auto-vectorization and to
> > extend/implement what's missing.
> 
> Depending on how ambitious you are, you could also do a GSoC project
> just around a) building dependence analysis interface + some
> *non*-trivial implementations, and b) build some dep analysis clients
> around it (e.g. loop interchange, etc). This may be a less risky and
> higher impact project in the short term, if you are interested in it.
I understand the importance and necessity of a general loop dependence
interface, but I'm not too familiar with the more complex
approaches (which generally require integer linear programming, if I'm
not mistaken). So I can't really asses if there's a realistic chance to
build a generally usable non-trivial implementation in the time
available.

But as a simple loop dependence analysis pass will be one of the first
steps in this project anyway, I'll be in a much better position to
re-evaluate that question once this first pass is done. I'd certainly be
willing to modify the project, if it turns out that this direction would
be more valuable and/or more realistic to pursue.
> Great!  Either approach (building a very simple auto-vec system from
> end-to-end, or building some non-trivial dep analysis implementations
> + a client or two) sound like very promising GSoC projects!
Thanks for the extensive feedback!

-- 
Andreas

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Apr 2009 - [LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

[LLVMdev] GSoC 2009: Auto-vectorization

Possibly Parallel Threads