thr3ads.net - llvm dev - [llvm-dev] RFC: Extending LV to vectorize outerloops [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Zaks, Ayal via llvm-dev

2016-Sep-21 17:50 UTC

[llvm-dev] RFC: Extending LV to vectorize outerloops

Proposal for extending the Loop Vectorizer to handle Outer Loops

===============================================================
Goal:

-----

We propose to extend the innermost Loop Vectorizer to also handle outerloops
(cf.[1]). Our aim is to best leverage the efforts already invested in the
existing innermost Loop Vectorizer rather than introduce a separate pass
dedicated to outerloop vectorization. This proposal will support explicit vector
programming of loops and functions [2]. It also facilitates evaluating multiple
vectorization candidates, including both outer and innermost loops, to choose
the most profitable ones.


Background:
-----------
The current Loop Vectorizer is confined to handle only innermost loops. In order
to extend it to also handle outerloops the following design decisions need to be
reworked:

1.  The resulting vectorized loop is inherently designed to contain a *single*
basic block. This poses an issue today, as innermost loops may benefit from
retaining some internal branches when vectorized. For outerloops this clearly
cannot hold - the resulting vectorized loop will contain more than a single
basic block as it will contain innerloops.

2.  There is inherently a single vectorization candidate with a single dimension
of optimization - namely the Vectorization Factor and/or Unrolling Factor of the
innermost loop. When dealing with outerloops it is important to evaluate
multiple vectorization candidates - including both outer and inner loops, and
their respective VF/UF's.

The current Loop Vectorizer works in the following three main steps:

1.  Legal Step: check if loop can be legally vectorized; encode constraints and
artifacts if so.

2.  Cost Step: compute the relative cost of vectorizing it along possible
vectorization and unroll factors.

3.  Transform Step: vectorize the loop according to best vectorization and
unroll factors.

This design has some implications:

1.  Cost Step tries to predict what the vectorized loop will look like and how
much it will cost, independently of what the Transform Step will eventually do.
It's hard to keep the two in sync, enforcing  the comment placed at the
beginning of Transform Step:
// Notice: any optimization or new instruction that go
// into the code below should be also be implemented in
// the cost-model.

2.  Legal Step does more than check for vectorizability; e.g., it records
auxiliary artifacts such as collectLoopUniforms() and InterleaveInfo.

3.  Transform Step first populates the single basic block of the vectorized loop
and later revisits scalarized instructions to predicate them one by one, as
needed.

Proposal: introduce the Vectorization Plan as an explicit model of a
vectorization candidate and update the overall flow:

1.  Legal Step: check if loop can be legally vectorized, encode constraints by
initiating Vectorization Plan(s) if so.

2.  Plan Step:

a.  Build initial vectorization plans complying with all legal constraints.

b.  Optimize vectorization plans.

3.  Cost Step: compute the relative cost of each plan. This step can be applied
repeatedly by Plan Optimize Step 2.b.

4.  Transform Step: materialize the best plan. Note that only this step modifies
the IR, as in the current Loop Vectorizer.

The Vectorization Plan is a recipe describing the vectorization candidate,
including for instance its internal control-flow. It serves for both estimating
the cost and for performing the translation, and facilitates dealing with
multiple vectorization candidates. One can compare with LLVM's existing SLP
vectorizer, where TSLP [3] adds step 2.b.

As the scope of vectorization grows from innermost to outer loops, so do the
uncertainty and complexity of each step. One way to mitigate the shortcomings of
the Legal and Cost steps is to rely on programmers to indicate which loops can
and/or should be vectorized. This is implicit for certain loops in data-parallel
languages such as OpenCL [4,5] and explicit in others such as OpenMP [6]. This
proposal to extend the Loop Vectorizer to outer loops supports and raises the
importance of explicit vectorization beyond the current capabilities of Clang
and LLVM. Namely, from currently forcing the vectorization of innermost loops
according to prescribed width and/or interleaving count, to supporting
OpenMP's "#pragma omp simd" construct and associated clauses,
including our earlier proposal for vectorizing across function boundaries [2].


Comments on this overall approach are welcome. The first patches we're
working on are designed to have the innermost Loop Vectorizer explicitly model
the control flow of its vectorized loop. More will be presented in our technical
talk at the upcoming LLVM Developers' Meeting.



References:

-----------

[1] "Outer-loop vectorization: revisited for short SIMD
architectures", Dorit Nuzman and Ayal Zaks, PACT 2008.

[2] "Proposal for function vectorization and loop vectorization with
function calls", Xinmin Tian, [cfe-dev] March 2, 2016
(http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html. See also
https://reviews.llvm.org/D22792).

[3] "Throttling Automatic Vectorization: When Less is More", Vasileios
Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.

[4] "Intel OpenCL SDK Vectorizer", Nadav Rotem, LLVM Developers'
Meeting 2011.

[5] "Automatic SIMD Vectorization of SSA-based Control Flow Graphs",
Ralf Karrenberg, Springer 2015. See also "Improving Performance of OpenCL
on CPUs", LLVM Developers' Meeting 2012.

[6] "Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on
Multicore-SIMD Processors", Xinmin Tian and Hideki Saito et al., IPDPSW
2012.
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160921/c1bf7d7b/attachment.html>

Das, Dibyendu via llvm-dev

2016-Sep-22 06:30 UTC

head link

[llvm-dev] Extending LV to vectorize outerloops

Hi Ayal-

Can you elaborate a bit more on the modelling of the control flow for innermost
loop vectorization ? Especially I am interested in knowing whether you are
tending to a full mask-based approach as suggested by Karrenberg et al or you
will evaluate ( as a cost/benefit ) models like BOSCC ( Shin et al ) or
kind-of-BOSCC models by El Hajj (DYNAMIC LOOP VECTORIZATION FOR EXECUTING OPENCL
KERNELS ON CPUS ). Because different vectorization factors, ISA overheads and
control flow probabilities may dictate which one may turn out to be the best for
a particular situation.

-Thx
Dibyendu

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Zaks,
Ayal via llvm-dev
Sent: Wednesday, September 21, 2016 11:21 PM
To: LLVM Dev <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] RFC: Extending LV to vectorize outerloops


Proposal for extending the Loop Vectorizer to handle Outer Loops

===============================================================
Goal:

-----

We propose to extend the innermost Loop Vectorizer to also handle outerloops
(cf.[1]). Our aim is to best leverage the efforts already invested in the
existing innermost Loop Vectorizer rather than introduce a separate pass
dedicated to outerloop vectorization. This proposal will support explicit vector
programming of loops and functions [2]. It also facilitates evaluating multiple
vectorization candidates, including both outer and innermost loops, to choose
the most profitable ones.


Background:
-----------
The current Loop Vectorizer is confined to handle only innermost loops. In order
to extend it to also handle outerloops the following design decisions need to be
reworked:

1.  The resulting vectorized loop is inherently designed to contain a *single*
basic block. This poses an issue today, as innermost loops may benefit from
retaining some internal branches when vectorized. For outerloops this clearly
cannot hold - the resulting vectorized loop will contain more than a single
basic block as it will contain innerloops.

2.  There is inherently a single vectorization candidate with a single dimension
of optimization - namely the Vectorization Factor and/or Unrolling Factor of the
innermost loop. When dealing with outerloops it is important to evaluate
multiple vectorization candidates - including both outer and inner loops, and
their respective VF/UF's.

The current Loop Vectorizer works in the following three main steps:

1.  Legal Step: check if loop can be legally vectorized; encode constraints and
artifacts if so.

2.  Cost Step: compute the relative cost of vectorizing it along possible
vectorization and unroll factors.

3.  Transform Step: vectorize the loop according to best vectorization and
unroll factors.

This design has some implications:

1.  Cost Step tries to predict what the vectorized loop will look like and how
much it will cost, independently of what the Transform Step will eventually do.
It's hard to keep the two in sync, enforcing  the comment placed at the
beginning of Transform Step:
// Notice: any optimization or new instruction that go
// into the code below should be also be implemented in
// the cost-model.

2.  Legal Step does more than check for vectorizability; e.g., it records
auxiliary artifacts such as collectLoopUniforms() and InterleaveInfo.

3.  Transform Step first populates the single basic block of the vectorized loop
and later revisits scalarized instructions to predicate them one by one, as
needed.

Proposal: introduce the Vectorization Plan as an explicit model of a
vectorization candidate and update the overall flow:

1.  Legal Step: check if loop can be legally vectorized, encode constraints by
initiating Vectorization Plan(s) if so.

2.  Plan Step:

a.  Build initial vectorization plans complying with all legal constraints.

b.  Optimize vectorization plans.

3.  Cost Step: compute the relative cost of each plan. This step can be applied
repeatedly by Plan Optimize Step 2.b.

4.  Transform Step: materialize the best plan. Note that only this step modifies
the IR, as in the current Loop Vectorizer.

The Vectorization Plan is a recipe describing the vectorization candidate,
including for instance its internal control-flow. It serves for both estimating
the cost and for performing the translation, and facilitates dealing with
multiple vectorization candidates. One can compare with LLVM's existing SLP
vectorizer, where TSLP [3] adds step 2.b.

As the scope of vectorization grows from innermost to outer loops, so do the
uncertainty and complexity of each step. One way to mitigate the shortcomings of
the Legal and Cost steps is to rely on programmers to indicate which loops can
and/or should be vectorized. This is implicit for certain loops in data-parallel
languages such as OpenCL [4,5] and explicit in others such as OpenMP [6]. This
proposal to extend the Loop Vectorizer to outer loops supports and raises the
importance of explicit vectorization beyond the current capabilities of Clang
and LLVM. Namely, from currently forcing the vectorization of innermost loops
according to prescribed width and/or interleaving count, to supporting
OpenMP's "#pragma omp simd" construct and associated clauses,
including our earlier proposal for vectorizing across function boundaries [2].


Comments on this overall approach are welcome. The first patches we're
working on are designed to have the innermost Loop Vectorizer explicitly model
the control flow of its vectorized loop. More will be presented in our technical
talk at the upcoming LLVM Developers' Meeting.



References:

-----------

[1] "Outer-loop vectorization: revisited for short SIMD
architectures", Dorit Nuzman and Ayal Zaks, PACT 2008.

[2] "Proposal for function vectorization and loop vectorization with
function calls", Xinmin Tian, [cfe-dev] March 2, 2016
(http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html. See also
https://reviews.llvm.org/D22792).

[3] "Throttling Automatic Vectorization: When Less is More", Vasileios
Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.

[4] "Intel OpenCL SDK Vectorizer", Nadav Rotem, LLVM Developers'
Meeting 2011.

[5] "Automatic SIMD Vectorization of SSA-based Control Flow Graphs",
Ralf Karrenberg, Springer 2015. See also "Improving Performance of OpenCL
on CPUs", LLVM Developers' Meeting 2012.

[6] "Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on
Multicore-SIMD Processors", Xinmin Tian and Hideki Saito et al., IPDPSW
2012.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160922/06fadfeb/attachment.html>

Zaks, Ayal via llvm-dev

2016-Sep-22 10:55 UTC

head link

[llvm-dev] Extending LV to vectorize outerloops

Sure Dibyendu. The modelling of the control-flow inside the vectorized loop is
designed to support both cases where this control-flow is mandatory and cases
where it's the result of optimization. The former include predicated
scalarized instructions, and inner loops when vectorizing an outer loop. The
latter include various static and dynamic techniques you referred to, and
possibly others such as our "Predicate Vectors If You Must". Indeed as
you point out these are subject to cost/benefit considerations, similar to many
other optimizations; these should fit within the proposed "Plan Optimize
Step 2.b" and "Cost Step 3".

From: Das, Dibyendu [mailto:Dibyendu.Das at amd.com]
Sent: Thursday, September 22, 2016 09:31
To: Zaks, Ayal <ayal.zaks at intel.com>
Cc: llvm-dev at lists.llvm.org
Subject: RE: Extending LV to vectorize outerloops

Hi Ayal-

Can you elaborate a bit more on the modelling of the control flow for innermost
loop vectorization ? Especially I am interested in knowing whether you are
tending to a full mask-based approach as suggested by Karrenberg et al or you
will evaluate ( as a cost/benefit ) models like BOSCC ( Shin et al ) or
kind-of-BOSCC models by El Hajj (DYNAMIC LOOP VECTORIZATION FOR EXECUTING OPENCL
KERNELS ON CPUS ). Because different vectorization factors, ISA overheads and
control flow probabilities may dictate which one may turn out to be the best for
a particular situation.

-Thx
Dibyendu

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Zaks,
Ayal via llvm-dev
Sent: Wednesday, September 21, 2016 11:21 PM
To: LLVM Dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
Subject: [llvm-dev] RFC: Extending LV to vectorize outerloops


Proposal for extending the Loop Vectorizer to handle Outer Loops

===============================================================
Goal:

-----

We propose to extend the innermost Loop Vectorizer to also handle outerloops
(cf.[1]). Our aim is to best leverage the efforts already invested in the
existing innermost Loop Vectorizer rather than introduce a separate pass
dedicated to outerloop vectorization. This proposal will support explicit vector
programming of loops and functions [2]. It also facilitates evaluating multiple
vectorization candidates, including both outer and innermost loops, to choose
the most profitable ones.


Background:
-----------
The current Loop Vectorizer is confined to handle only innermost loops. In order
to extend it to also handle outerloops the following design decisions need to be
reworked:

1.  The resulting vectorized loop is inherently designed to contain a *single*
basic block. This poses an issue today, as innermost loops may benefit from
retaining some internal branches when vectorized. For outerloops this clearly
cannot hold - the resulting vectorized loop will contain more than a single
basic block as it will contain innerloops.

2.  There is inherently a single vectorization candidate with a single dimension
of optimization - namely the Vectorization Factor and/or Unrolling Factor of the
innermost loop. When dealing with outerloops it is important to evaluate
multiple vectorization candidates - including both outer and inner loops, and
their respective VF/UF's.

The current Loop Vectorizer works in the following three main steps:

1.  Legal Step: check if loop can be legally vectorized; encode constraints and
artifacts if so.

2.  Cost Step: compute the relative cost of vectorizing it along possible
vectorization and unroll factors.

3.  Transform Step: vectorize the loop according to best vectorization and
unroll factors.

This design has some implications:

1.  Cost Step tries to predict what the vectorized loop will look like and how
much it will cost, independently of what the Transform Step will eventually do.
It's hard to keep the two in sync, enforcing  the comment placed at the
beginning of Transform Step:
// Notice: any optimization or new instruction that go
// into the code below should be also be implemented in
// the cost-model.

2.  Legal Step does more than check for vectorizability; e.g., it records
auxiliary artifacts such as collectLoopUniforms() and InterleaveInfo.

3.  Transform Step first populates the single basic block of the vectorized loop
and later revisits scalarized instructions to predicate them one by one, as
needed.

Proposal: introduce the Vectorization Plan as an explicit model of a
vectorization candidate and update the overall flow:

1.  Legal Step: check if loop can be legally vectorized, encode constraints by
initiating Vectorization Plan(s) if so.

2.  Plan Step:

a.  Build initial vectorization plans complying with all legal constraints.

b.  Optimize vectorization plans.

3.  Cost Step: compute the relative cost of each plan. This step can be applied
repeatedly by Plan Optimize Step 2.b.

4.  Transform Step: materialize the best plan. Note that only this step modifies
the IR, as in the current Loop Vectorizer.

The Vectorization Plan is a recipe describing the vectorization candidate,
including for instance its internal control-flow. It serves for both estimating
the cost and for performing the translation, and facilitates dealing with
multiple vectorization candidates. One can compare with LLVM's existing SLP
vectorizer, where TSLP [3] adds step 2.b.

As the scope of vectorization grows from innermost to outer loops, so do the
uncertainty and complexity of each step. One way to mitigate the shortcomings of
the Legal and Cost steps is to rely on programmers to indicate which loops can
and/or should be vectorized. This is implicit for certain loops in data-parallel
languages such as OpenCL [4,5] and explicit in others such as OpenMP [6]. This
proposal to extend the Loop Vectorizer to outer loops supports and raises the
importance of explicit vectorization beyond the current capabilities of Clang
and LLVM. Namely, from currently forcing the vectorization of innermost loops
according to prescribed width and/or interleaving count, to supporting
OpenMP's "#pragma omp simd" construct and associated clauses,
including our earlier proposal for vectorizing across function boundaries [2].


Comments on this overall approach are welcome. The first patches we're
working on are designed to have the innermost Loop Vectorizer explicitly model
the control flow of its vectorized loop. More will be presented in our technical
talk at the upcoming LLVM Developers' Meeting.



References:

-----------

[1] "Outer-loop vectorization: revisited for short SIMD
architectures", Dorit Nuzman and Ayal Zaks, PACT 2008.

[2] "Proposal for function vectorization and loop vectorization with
function calls", Xinmin Tian, [cfe-dev] March 2, 2016
(http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html. See also
https://reviews.llvm.org/D22792).

[3] "Throttling Automatic Vectorization: When Less is More", Vasileios
Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.

[4] "Intel OpenCL SDK Vectorizer", Nadav Rotem, LLVM Developers'
Meeting 2011.

[5] "Automatic SIMD Vectorization of SSA-based Control Flow Graphs",
Ralf Karrenberg, Springer 2015. See also "Improving Performance of OpenCL
on CPUs", LLVM Developers' Meeting 2012.

[6] "Compiling C/C++ SIMD Extensions for Function and Loop Vectorization on
Multicore-SIMD Processors", Xinmin Tian and Hideki Saito et al., IPDPSW
2012.

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160922/c96a75c5/attachment-0001.html>

Hal Finkel via llvm-dev

2016-Sep-22 22:35 UTC

head link

[llvm-dev] RFC: Extending LV to vectorize outerloops

----- Original Message -----
> From: "Ayal via llvm-dev Zaks" <llvm-dev at lists.llvm.org>
> To: "LLVM Dev" <llvm-dev at lists.llvm.org>
> Sent: Wednesday, September 21, 2016 12:50:34 PM
> Subject: [llvm-dev] RFC: Extending LV to vectorize outerloops
> Proposal for extending the Loop Vectorizer to handle Outer Loops
> ===============================================================> Goal:
> -----
> We propose to extend the innermost Loop Vectorizer to also handle
> outerloops (cf.[1]). Our aim is to best leverage the efforts already
> invested in the existing innermost Loop Vectorizer rather than
> introduce a separate pass dedicated to outerloop vectorization.Good; I think it is important that we aim for incremental progress here; taking
advantage of the existing infrastructure where reasonable.
> This proposal will support explicit vector programming of loops and
> functions [2]. It also facilitates evaluating multiple vectorization
> candidates, including both outer and innermost loops, to choose the
> most profitable ones.
> Background:
> -----------
> The current Loop Vectorizer is confined to handle only innermost
> loops. In order to extend it to also handle outerloops the following
> design decisions need to be reworked:
> 1. The resulting vectorized loop is inherently designed to contain a
> * single * basic block. This poses an issue today, as innermost
> loops may benefit from retaining some internal branches when
> vectorized. For outerloops this clearly cannot hold – the resulting
> vectorized loop will contain more than a single basic block as it
> will contain innerloops.
> 2. There is inherently a single vectorization candidate with a single
> dimension of optimization – namely the Vectorization Factor and/or
> Unrolling Factor of the innermost loop. When dealing with outerloops
> it is important to evaluate multiple vectorization candidates –
> including both outer and inner loops, and their respective VF/UF’s.
> The current Loop Vectorizer works in the following three main steps:
> 1. Legal Step: check if loop can be legally vectorized; encode
> constraints and artifacts if so.
> 2. Cost Step: compute the relative cost of vectorizing it along
> possible vectorization and unroll factors.
> 3. Transform Step: vectorize the loop according to best vectorization
> and unroll factors.
> This design has some implications:
> 1. Cost Step tries to predict what the vectorized loop will look like
> and how much it will cost, independently of what the Transform Step
> will eventually do. It’s hard to keep the two in sync, enforcing the
> comment placed at the beginning of Transform Step:
> // Notice: any optimization or new instruction that go
> // into the code below should be also be implemented in
> // the cost-model.
> 2. Legal Step does more than check for vectorizability; e.g., it
> records auxiliary artifacts such as collectLoopUniforms() and
> InterleaveInfo.
> 3. Transform Step first populates the single basic block of the
> vectorized loop and later revisits scalarized instructions to
> predicate them one by one, as needed.
> Proposal: introduce the Vectorization Plan as an explicit model of a
> vectorization candidate and update the overall flow:
> 1. Legal Step: check if loop can be legally vectorized, encode
> constraints by initiating Vectorization Plan(s) if so.
> 2. Plan Step:
> a. Build initial vectorization plans complying with all legal
> constraints.
> b. Optimize vectorization plans.
> 3. Cost Step: compute the relative cost of each plan. This step can
> be applied repeatedly by Plan Optimize Step 2.b.
> 4. Transform Step: materialize the best plan. Note that only this
> step modifies the IR, as in the current Loop Vectorizer.This is also important: We should not speculatively modify the IR. 
> The Vectorization Plan is a recipe describing the vectorization
> candidate, including for instance its internal control-flow. It
> serves for both estimating the cost and for performing the
> translation, and facilitates dealing with multiple vectorization
> candidates. One can compare with LLVM’s existing SLP vectorizer,
> where TSLP [3] adds step 2.b.
> As the scope of vectorization grows from innermost to outer loops, so
> do the uncertainty and complexity of each step. One way to mitigate
> the shortcomings of the Legal and Cost steps is to rely on
> programmers to indicate which loops can and/or should be vectorized.
> This is implicit for certain loops in data-parallel languages such
> as OpenCL [4,5] and explicit in others such as OpenMP [6]. This
> proposal to extend the Loop Vectorizer to outer loops supports and
> raises the importance of explicit vectorization beyond the current
> capabilities of Clang and LLVM. Namely, from currently forcing the
> vectorization of innermost loops according to prescribed width
> and/or interleaving count, to supporting OpenMP’s “#pragma omp simd”
> construct and associated clauses, including our earlier proposal for
> vectorizing across function boundaries [2].Our current metadata scheme for controlling the loop vectorizer should prove
easy to extend, if extension is necessary at all, to provide explicitly
vectorization factors, etc. for various loop-nest levels.

When we have profiling data, or can statically determine the inner-loop trip
counts, then it should also be possible to reasonably decide on outer-loop
vectorization. We'll need to put some thought into how to collect inner-loop
trip counts from profiling data when the code is outer-loop vectorized. There
may also be cases where outer-loop vectorization always seem more profitable,
but we obviously can't evaluate an exponential number of potential
vectorization plans, so we'll need to design some heuristic (perhaps based
on reductions, consecutive memory accesses, etc.) to limit the search space.

This is an important capability; thanks for pursuing it! 

-Hal 
> Comments on this overall approach are welcome. The first patches
> we’re working on are designed to have the innermost Loop Vectorizer
> explicitly model the control flow of its vectorized loop. More will
> be presented in our technical talk at the upcoming LLVM Developers'
> Meeting.
> References:
> -----------
> [1] “Outer-loop vectorization: revisited for short SIMD
> architectures”, Dorit Nuzman and Ayal Zaks, PACT 2008.
> [2] “Proposal for function vectorization and loop vectorization with
> function calls”, Xinmin Tian, [cfe-dev] March 2, 2016 (
> http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html . See
> also https://reviews.llvm.org/D22792 ).
> [3] “Throttling Automatic Vectorization: When Less is More”,
> Vasileios Porpodas and Tim Jones, PACT 2015 and LLVM Developers'
> Meeting 2015.
> [4] “Intel OpenCL SDK Vectorizer”, Nadav Rotem, LLVM Developers'
> Meeting 2011.
> [5] “Automatic SIMD Vectorization of SSA-based Control Flow Graphs”,
> Ralf Karrenberg, Springer 2015. See also “Improving Performance of
> OpenCL on CPUs”, LLVM Developers' Meeting 2012.
> [6] “Compiling C/C++ SIMD Extensions for Function and Loop
> Vectorization on Multicore-SIMD Processors”, Xinmin Tian and Hideki
> Saito et al., IPDPSW 2012.
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160922/48a7d64a/attachment.html>

Renato Golin via llvm-dev

2016-Sep-22 23:03 UTC

head link

[llvm-dev] RFC: Extending LV to vectorize outerloops

On 21 September 2016 at 18:50, Zaks, Ayal via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> 1.  The resulting vectorized loop is inherently designed to contain a
> *single* basic block. This poses an issue today, as innermost loops may
> benefit from retaining some internal branches when vectorized. For
> outerloops this clearly cannot hold – the resulting vectorized loop will
> contain more than a single basic block as it will contain innerloops.
There's some space for if-conversion, but it is true that the loop
vectoriser gives up too easily on anything but canonical loops. That's
why we even have loop canonicalisation steps.

Though, the alternative would be to expand the solution space beyond
anything we can do in low-polinomial time and space. I also don't want
us having completely different strategies for simple inner loops,
conditionalised inner loops, outerloops with one inner loop, more than
one inner loop, multiply-nested loops, etc. That' would be a
maintenance nightmare.

We haven't focused on outer loops for a reason. Doing one of those
strategies is not that complex, but doing one without penalising the
other (or all others), is. We need to at least have a plan for a few,
even if we implement only at a time.

> 1.  Cost Step tries to predict what the vectorized loop will look like and
> how much it will cost, independently of what the Transform Step will
> eventually do. It’s hard to keep the two in sync, enforcing  the comment
> placed at the beginning of Transform Step:
> // Notice: any optimization or new instruction that go
> // into the code below should be also be implemented in
> // the cost-model.
This is *particularly* bad. Your proposal was put forward during the
first year of the vectorisation and was deemed too complex for our
initial goals. I think now is the time to start thinking again about
this.

> 1.  Legal Step: check if loop can be legally vectorized, encode constraints
> by initiating Vectorization Plan(s) if so.
Not much different than today. We'd need to inspect the evolution of
all inner-loops to be able to infer anything about the outer-loop
access, but that shouldn't be too complex.

> 2.  Plan Step:
> a.  Build initial vectorization plans complying with all legal constraints.
> b.  Optimize vectorization plans.
Are you proposing we at least generate some straw-man vectorised loop
in IR form? Or is this still just an improved cost-model?

> 3.  Cost Step: compute the relative cost of each plan. This step can be
> applied repeatedly by Plan Optimize Step 2.b.
It's cheap to calculate the costs (mostly linear), but it's not cheap
to come up with multiple plans on step 2 to feed a voting system.

> 4.  Transform Step: materialize the best plan. Note that only this step
> modifies the IR, as in the current Loop Vectorizer.
Modifies is a dubious term, I just want to make sure we're on the same page.

You can create as many basic blocks as you want, but unless you
replace the old ones with the new ones, the "IR" isn't changed.

So, if I got ir right, you want to:

1. check legal, pass.
2. create some structure with an actual implementation of the
vectorisation (IR?)
3. min(cost[i])
4. Implement (or IR replace) implementation "i".

If 2 is similar to 4, than this will be very costly. If not, then we
have two *different* representations, and we go back to having a bad
cost-model.

The former can be justified if you pass an option like
"-faggressive-vectorizer". The latter can make our "diverging
cost
model" problem much worse.

> The Vectorization Plan is a recipe describing the vectorization candidate,
> including for instance its internal control-flow. It serves for both
> estimating the cost and for performing the translation, and facilitates
> dealing with multiple vectorization candidates. One can compare with LLVM’s
> existing SLP vectorizer, where TSLP [3] adds step 2.b.
We sort of do that already, since we pick the costs of all VFs and
choose the "best". By far not perfect, but I envisioned an improved
cost-model to having more than just VF/UF and trying skew factors
(with information provided by Polly) or fiddling with the induction
sequence, before/after inner-loop hoisting, etc.

Since the cost model is really cheap, we can do many of them each time
and even be very creative around them. But the more creative we are,
the harder it'll be to get them close to the implementation phase.

> This proposal to extend the Loop Vectorizer to
> outer loops supports and raises the importance of explicit vectorization
> beyond the current capabilities of Clang and LLVM. Namely, from currently
> forcing the vectorization of innermost loops according to prescribed width
> and/or interleaving count, to supporting OpenMP’s “#pragma omp simd”
> construct and associated clauses, including our earlier proposal for
> vectorizing across function boundaries [2].
We already have all those pragmas, some Clang-specific, some borrowed
from OpenMP SIMD.

As Hal said, extending the pragmas to indicate recommended trip
counts, guaranteeing no-wrapping, etc. shouldn't be too hard.

cheers,
-renato

Zaks, Ayal via llvm-dev

2016-Sep-23 23:08 UTC

head link

[llvm-dev] RFC: Extending LV to vectorize outerloops

> -----Original Message-----
> From: Renato Golin [mailto:renato.golin at linaro.org]
> Sent: Friday, September 23, 2016 02:04
> To: Zaks, Ayal <ayal.zaks at intel.com>
> Cc: LLVM Dev <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] RFC: Extending LV to vectorize outerloops
> 
> On 21 September 2016 at 18:50, Zaks, Ayal via llvm-dev <llvm-
> dev at lists.llvm.org> wrote:
> > 1.  The resulting vectorized loop is inherently designed to contain a
> > *single* basic block. This poses an issue today, as innermost loops
> > may benefit from retaining some internal branches when vectorized. For
> > outerloops this clearly cannot hold – the resulting vectorized loop
> > will contain more than a single basic block as it will contain
innerloops.
> 
> There's some space for if-conversion, but it is true that the loop
vectoriser
> gives up too easily on anything but canonical loops. That's why we even
> have loop canonicalisation steps.
> 
> Though, the alternative would be to expand the solution space beyond
> anything we can do in low-polinomial time and space. I also don't want
us
> having completely different strategies for simple inner loops,
> conditionalised inner loops, outerloops with one inner loop, more than one
> inner loop, multiply-nested loops, etc. That' would be a maintenance
> nightmare.
> 
> We haven't focused on outer loops for a reason. Doing one of those
> strategies is not that complex, but doing one without penalising the other
> (or all others), is. We need to at least have a plan for a few, even if we
> implement only at a time.
> 
Our first step is to make LV fully conscious of the control-flow that will
result from vectorizing an (innermost) loop, allowing this control-flow to
comprise of more than a single basic-block. A next step is to consider retaining
branches rather than fully if-converting them. This could indeed entail a large
solution space, but simple heuristics are usually employed to prune it
effectively. When we reach the ability to handle outerloops the solution space
grows further, and we fully recognize that early pruning of candidates will be
important. As Hal noted, some heuristics can indeed be employed here too.
> 
> > 1.  Cost Step tries to predict what the vectorized loop will look like
> > and how much it will cost, independently of what the Transform Step
> > will eventually do. It’s hard to keep the two in sync, enforcing  the
> > comment placed at the beginning of Transform Step:
> > // Notice: any optimization or new instruction that go // into the
> > code below should be also be implemented in // the cost-model.
> 
> This is *particularly* bad. Your proposal was put forward during the first
> year of the vectorisation and was deemed too complex for our initial goals.
I
> think now is the time to start thinking again about this.
> 
> 
> > 1.  Legal Step: check if loop can be legally vectorized, encode
> > constraints by initiating Vectorization Plan(s) if so.
> 
> Not much different than today. We'd need to inspect the evolution of
all
> inner-loops to be able to infer anything about the outer-loop access, but
> that shouldn't be too complex.
> 
Hopefully, based on SCEV's analysis that considers nested loops.
> 
> > 2.  Plan Step:
> > a.  Build initial vectorization plans complying with all legal
constraints.
> > b.  Optimize vectorization plans.
> 
> Are you proposing we at least generate some straw-man vectorised loop in
> IR form? Or is this still just an improved cost-model?
> 
This step is designed to generate a straw-man vectorized loop in
"shadow" form, whose cost can be directly computed (3), and which
facilitates generating the actual vector IR instructions (4).
> 
> > 3.  Cost Step: compute the relative cost of each plan. This step can
> > be applied repeatedly by Plan Optimize Step 2.b.
> 
> It's cheap to calculate the costs (mostly linear), but it's not
cheap to come up
> with multiple plans on step 2 to feed a voting system.
> 
("cheap" - in terms of compile-time I take it)
Care must be taken therefore when deciding how many plans to draft. Step 2.b can
consult the cost of partial plans being considered, for early plan pruning.
> 
> > 4.  Transform Step: materialize the best plan. Note that only this
> > step modifies the IR, as in the current Loop Vectorizer.
> 
> Modifies is a dubious term, I just want to make sure we're on the same
page.
> 
> You can create as many basic blocks as you want, but unless you replace the
> old ones with the new ones, the "IR" isn't changed.
> 
This relates to the "shadow" instructions and basic-blocks discussed
recently in http://lists.llvm.org/pipermail/llvm-dev/2016-August/104317.html.
> So, if I got ir right, you want to:
> 
> 1. check legal, pass.
> 2. create some structure with an actual implementation of the vectorisation
> (IR?) 3. min(cost[i]) 4. Implement (or IR replace) implementation
"i".
> 
> If 2 is similar to 4, than this will be very costly. If not, then we have
two
("very costly" - in terms of compile-time I take it; "cost"
is overloaded here)
> *different* representations, and we go back to having a bad cost-model.
> 
> The former can be justified if you pass an option like "-faggressive-
> vectorizer". The latter can make our "diverging cost model"
problem much
> worse.
> 
Computing cost before having an explicit vectorization plan is indeed more
compile-time efficient than first constructing the plan, but inherently less
accurate and harder to keep consistent. Such a "divergent" predictive
cost could be used to prune candidates, when considered decisive, to save
compile-time. Having both a "-faggressive-vectorizer" path that builds
a plan and weighs it before implementing it, while keeping the current path that
vectorizes directly w/o a plan, may be quite a maintenance burden. A
"-faggressive-vectorizer" option could be used to control early
pruning of candidates. Let's first see how much compile-time, and improved
accuracy, are actually involved.

> 
> > The Vectorization Plan is a recipe describing the vectorization
> > candidate, including for instance its internal control-flow. It serves
> > for both estimating the cost and for performing the translation, and
> > facilitates dealing with multiple vectorization candidates. One can
> > compare with LLVM’s existing SLP vectorizer, where TSLP [3] adds step
2.b.
> 
> We sort of do that already, since we pick the costs of all VFs and choose
the
> "best". By far not perfect, but I envisioned an improved
cost-model to having
> more than just VF/UF and trying skew factors (with information provided by
> Polly) or fiddling with the induction sequence, before/after inner-loop
> hoisting, etc.
> 
> Since the cost model is really cheap, we can do many of them each time and
> even be very creative around them. But the more creative we are, the harder
> it'll be to get them close to the implementation phase.
> 
OTOH, computing the "non-divergent" costs of explicit plans could lead
to the development of pruning techniques, or creative improvements to the
cost-model. That is, when these costs turn out to be very high or very low.

Note that some decisions are currently taken w/o fully consulting cost, e.g.:
// FIXME: The interleaved load group with a huge gap could be even more
// expensive than scalar operations. Then we could ignore such group and
// use scalar operations instead.
and the hardwired NumberOfStoresToPredicate threshold.
> 
> > This proposal to extend the Loop Vectorizer to outer loops supports
> > and raises the importance of explicit vectorization beyond the current
> > capabilities of Clang and LLVM. Namely, from currently forcing the
> > vectorization of innermost loops according to prescribed width and/or
> > interleaving count, to supporting OpenMP’s “#pragma omp simd”
> > construct and associated clauses, including our earlier proposal for
> > vectorizing across function boundaries [2].
> 
> We already have all those pragmas, some Clang-specific, some borrowed
> from OpenMP SIMD.
> 
> As Hal said, extending the pragmas to indicate recommended trip counts,
> guaranteeing no-wrapping, etc. shouldn't be too hard.
> 
> cheers,
> -renato
Thanks!
Ayal.
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Sep 2016 - RFC: Extending LV to vectorize outerloops

[llvm-dev] RFC: Extending LV to vectorize outerloops

[llvm-dev] Extending LV to vectorize outerloops

[llvm-dev] Extending LV to vectorize outerloops

[llvm-dev] RFC: Extending LV to vectorize outerloops

[llvm-dev] RFC: Extending LV to vectorize outerloops

[llvm-dev] RFC: Extending LV to vectorize outerloops

Possibly Parallel Threads