thr3ads.net - llvm dev - [LLVMdev] RFC: Loop distribution/Partial vectorization [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Adam Nemet

2015-Jan-12 18:42 UTC

[LLVMdev] RFC: Loop distribution/Partial vectorization

Hi,

We'd like to propose new Loop Distribution pass.  The main motivation is to
allow partial vectorization of loops.  One such example is the main loop of
456.hmmer in SpecINT_2006.  The current version of the patch improves hmmer by
24% on ARM64 and 18% on X86.

The goal of the pass is to distribute a loop that can't be vectorized
because of
memory dependence cycles.  The pass splits the part with cycles into a new loop
making the remainder of the loop a candidate for vectorization.  E.g.:

        for (k = 0; k < M; k++) {
         S1:   MC[k+1] = …
         // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
         S2:   DC[k+1] = DC[k], MC[k]                                      
        }
  
      => (Loop Distribute)
  
        for (k = 0; k < M; k++) {
         S1:   MC[k+1] = ...
        }
        for (k = 0; k < M; k++) {
         S2:   DC[k+1] = DC[k], MC[k]
        }
  
      => (Loop Vectorize S1)
  
        for (k = 0; k < M; k += 4) {
         S1:   MC[k+1:k+5] = ...
        }
        for (k = 0; k < M; k++) {
         S2:   DC[k+1] = DC[k], MC[k]
        }

I'd like to collect feedback on the design decisions made so far.  These are
implemented in a proof-of-concept patch (http://reviews.llvm.org/D6930
<http://reviews.llvm.org/D6930>).
Here is the list of design choices:

- Loop Distribution is implemented as a separate pass to be run before the Loop
  Vectorizer.

- The pass reuses the Memory Dependence Checker framework from the Loop
  Vectorizer.  This along with the AccessAnalysis class is split out into a new
  LoopAccessAnalysis class.  We may want to turn this into an analysis pass on
its own.

- It also reuses the Run-time Memory Check code from the Loop Vectorizer.  The
  hmmer loop requires memchecks.  This is again captured by the same
  LoopAccessAnalysis class.

- The actual loop distribution is implemented as follows:

  - The list of unsafe memory dependencies is computed for the loop.  Unsafe
    means that the dependence may be part of a cycle (this is what the current
    framework provides).
  - Partitions are created for each set of unsafe dependences.
  - Partitions are created for each of the remaining stores not yet encountered.
    The order of the partitions preserve the original order of the dependent
    memory accesses.
  - Simple partition merging is performed to minimize the number of new loops.
  - Partitions are populated with the other dependent instructions by following
    the SSA use-def chains and control dependence edges.
  - Finally, the actual distribution is performed by creating a loop for each
    partition.  For each partition we clone the loop and remove all the
    instructions that don't belong to the partition.
  - Also, if run-time memory checks are necessary, these are emitted.  We keep
    an original version of the loop around to branch too if the checks fail.

My plan is to proceed with the following steps:

- Bring the current functionality to trunk by splitting off smaller patches from
  the current patch and completing them.  The final commit will enable loop
  distribution with a command-line flag or a loop hint.

- Explore and fine-tune the proper cost model for loop distribution to allow
  partial vectorization.  This is essentially whether to partition and what
  these partitions should be.  Currently instructions are mapped to partitions
  using a simple heuristics to create a vectorizable partitions.  We may need to
  interact with the vectorizer to make sure the vectorization will actually
  happen and it will be overall profitable.

- Explore other potentials for loop distribution, e.g.:
  - Partial vectorization of loops that can't be if-converted
  - Classic loop distribution to improve spatial locality
  - Compute the Program Dependence Graph rather than the list of unsafe memory
    accesses and allow reordering of memory operations
  - Distribute a loop in order to recognize parts as loop idioms

    Long term, loop distribution could also become a transformation utility
    (Transform/Util).  That way, the loop transformation passes could use it to
    strip the loop from parts that inhibits the given optimization.

Please let me know if you have feedback either on the design or on the next
steps.

Thanks,
Adam

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150112/4dda94ae/attachment.html>

Krzysztof Parzyszek

2015-Jan-13 16:27 UTC

head link

[LLVMdev] RFC: Loop distribution/Partial vectorization

Thanks for doing this. I really like the idea of having loop 
distribution as a separate pass (and having dependence analysis as an 
analysis pass).

A couple of comments that I have are below.

1. Handle situations like this:

     for (k = 0; k < M; k++) {
       for (i = 0; i < N; ++i) {
S1:     MC[i][k+1] = …
       }
S2:   DC[k+1] = DC[k], MC[…][k]
     }

Basically, recognize and handle dependencies between differently nested 
expressions.


2. Make it general so that it can serve any purpose (not only 
vectorization).  Various targets may want to do different things and 
distribute loops for various reasons that don't apply universally.


-Krzysztof



On 1/12/2015 12:42 PM, Adam Nemet wrote:> Hi,
>
> We'd like to propose new Loop Distribution pass.  The main motivation
is to
> allow partial vectorization of loops.  One such example is the main loop of
> 456.hmmer in SpecINT_2006.  The current version of the patch improves
> hmmer by
> 24% on ARM64 and 18% on X86.
>
> The goal of the pass is to distribute a loop that can't be vectorized
> because of
> memory dependence cycles.  The pass splits the part with cycles into a
> new loop
> making the remainder of the loop a candidate for vectorization.  E.g.:
>
>          for (k = 0; k < M; k++) {
>           S1:   MC[k+1] = …
>           // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>        => (Loop Distribute)
>          for (k = 0; k < M; k++) {
>           S1:   MC[k+1] = ...
>          }
>          for (k = 0; k < M; k++) {
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>        => (Loop Vectorize S1)
>          for (k = 0; k < M; k += 4) {
>           S1:   MC[k+1:k+5] = ...
>          }
>          for (k = 0; k < M; k++) {
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>
> I'd like to collect feedback on the design decisions made so far. 
These are
> implemented in a proof-of-concept patch (http://reviews.llvm.org/D6930).
> Here is the list of design choices:
>
> - Loop Distribution is implemented as a separate pass to be run before
> the Loop
>    Vectorizer.
>
> - The pass reuses the Memory Dependence Checker framework from the Loop
>    Vectorizer.  This along with the AccessAnalysis class is split out
> into a new
>    LoopAccessAnalysis class.  We may want to turn this into an analysis
> pass on its own.
>
> - It also reuses the Run-time Memory Check code from the Loop
> Vectorizer.  The
>    hmmer loop requires memchecks.  This is again captured by the same
>    LoopAccessAnalysis class.
>
> - The actual loop distribution is implemented as follows:
>
>    - The list of unsafe memory dependencies is computed for the loop.
>   Unsafe
>      means that the dependence may be part of a cycle (this is what the
> current
>      framework provides).
>    - Partitions are created for each set of unsafe dependences.
>    - Partitions are created for each of the remaining stores not yet
> encountered.
>      The order of the partitions preserve the original order of the
> dependent
>      memory accesses.
>    - Simple partition merging is performed to minimize the number of new
> loops.
>    - Partitions are populated with the other dependent instructions by
> following
>      the SSA use-def chains and control dependence edges.
>    - Finally, the actual distribution is performed by creating a loop
> for each
>      partition.  For each partition we clone the loop and remove all the
>      instructions that don't belong to the partition.
>    - Also, if run-time memory checks are necessary, these are emitted.
>   We keep
>      an original version of the loop around to branch too if the checks
> fail.
>
> My plan is to proceed with the following steps:
>
> - Bring the current functionality to trunk by splitting off smaller
> patches from
>    the current patch and completing them.  The final commit will enable
loop
>    distribution with a command-line flag or a loop hint.
>
> - Explore and fine-tune the proper cost model for loop distribution to
allow
>    partial vectorization.  This is essentially whether to partition and
what
>    these partitions should be.  Currently instructions are mapped to
> partitions
>    using a simple heuristics to create a vectorizable partitions.  We
> may need to
>    interact with the vectorizer to make sure the vectorization will
actually
>    happen and it will be overall profitable.
>
> - Explore other potentials for loop distribution, e.g.:
>    - Partial vectorization of loops that can't be if-converted
>    - Classic loop distribution to improve spatial locality
>    - Compute the Program Dependence Graph rather than the list of unsafe
> memory
>      accesses and allow reordering of memory operations
>    - Distribute a loop in order to recognize parts as loop idioms
>
>      Long term, loop distribution could also become a transformation
utility
>      (Transform/Util).  That way, the loop transformation passes could
> use it to
>      strip the loop from parts that inhibits the given optimization.
>
> Please let me know if you have feedback either on the design or on the next
> steps.
>
> Thanks,
> Adam
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

Das, Dibyendu

2015-Jan-13 17:29 UTC

head link

[LLVMdev] RFC: Loop distribution/Partial vectorization

As far as 456.hmmer is concerned it may gain a bit more ( at least on x86) by
peeling off the last part of the loop. This will (afaik) create three
vectorizable loops instead of 2 after loop distribution.

-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Krzysztof Parzyszek
Sent: Tuesday, January 13, 2015 9:57 PM
To: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] RFC: Loop distribution/Partial vectorization

Thanks for doing this. I really like the idea of having loop distribution as a
separate pass (and having dependence analysis as an analysis pass).

A couple of comments that I have are below.

1. Handle situations like this:

     for (k = 0; k < M; k++) {
       for (i = 0; i < N; ++i) {
S1:     MC[i][k+1] = ...
       }
S2:   DC[k+1] = DC[k], MC[...][k]
     }

Basically, recognize and handle dependencies between differently nested
expressions.


2. Make it general so that it can serve any purpose (not only 
vectorization).  Various targets may want to do different things and 
distribute loops for various reasons that don't apply universally.


-Krzysztof



On 1/12/2015 12:42 PM, Adam Nemet wrote:> Hi,
>
> We'd like to propose new Loop Distribution pass.  The main motivation
is to
> allow partial vectorization of loops.  One such example is the main loop of
> 456.hmmer in SpecINT_2006.  The current version of the patch improves
> hmmer by
> 24% on ARM64 and 18% on X86.
>
> The goal of the pass is to distribute a loop that can't be vectorized
> because of
> memory dependence cycles.  The pass splits the part with cycles into a
> new loop
> making the remainder of the loop a candidate for vectorization.  E.g.:
>
>          for (k = 0; k < M; k++) {
>           S1:   MC[k+1] = ...
>           // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>        => (Loop Distribute)
>          for (k = 0; k < M; k++) {
>           S1:   MC[k+1] = ...
>          }
>          for (k = 0; k < M; k++) {
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>        => (Loop Vectorize S1)
>          for (k = 0; k < M; k += 4) {
>           S1:   MC[k+1:k+5] = ...
>          }
>          for (k = 0; k < M; k++) {
>           S2:   DC[k+1] = DC[k], MC[k]
>          }
>
> I'd like to collect feedback on the design decisions made so far. 
These are
> implemented in a proof-of-concept patch (http://reviews.llvm.org/D6930).
> Here is the list of design choices:
>
> - Loop Distribution is implemented as a separate pass to be run before
> the Loop
>    Vectorizer.
>
> - The pass reuses the Memory Dependence Checker framework from the Loop
>    Vectorizer.  This along with the AccessAnalysis class is split out
> into a new
>    LoopAccessAnalysis class.  We may want to turn this into an analysis
> pass on its own.
>
> - It also reuses the Run-time Memory Check code from the Loop
> Vectorizer.  The
>    hmmer loop requires memchecks.  This is again captured by the same
>    LoopAccessAnalysis class.
>
> - The actual loop distribution is implemented as follows:
>
>    - The list of unsafe memory dependencies is computed for the loop.
>   Unsafe
>      means that the dependence may be part of a cycle (this is what the
> current
>      framework provides).
>    - Partitions are created for each set of unsafe dependences.
>    - Partitions are created for each of the remaining stores not yet
> encountered.
>      The order of the partitions preserve the original order of the
> dependent
>      memory accesses.
>    - Simple partition merging is performed to minimize the number of new
> loops.
>    - Partitions are populated with the other dependent instructions by
> following
>      the SSA use-def chains and control dependence edges.
>    - Finally, the actual distribution is performed by creating a loop
> for each
>      partition.  For each partition we clone the loop and remove all the
>      instructions that don't belong to the partition.
>    - Also, if run-time memory checks are necessary, these are emitted.
>   We keep
>      an original version of the loop around to branch too if the checks
> fail.
>
> My plan is to proceed with the following steps:
>
> - Bring the current functionality to trunk by splitting off smaller
> patches from
>    the current patch and completing them.  The final commit will enable
loop
>    distribution with a command-line flag or a loop hint.
>
> - Explore and fine-tune the proper cost model for loop distribution to
allow
>    partial vectorization.  This is essentially whether to partition and
what
>    these partitions should be.  Currently instructions are mapped to
> partitions
>    using a simple heuristics to create a vectorizable partitions.  We
> may need to
>    interact with the vectorizer to make sure the vectorization will
actually
>    happen and it will be overall profitable.
>
> - Explore other potentials for loop distribution, e.g.:
>    - Partial vectorization of loops that can't be if-converted
>    - Classic loop distribution to improve spatial locality
>    - Compute the Program Dependence Graph rather than the list of unsafe
> memory
>      accesses and allow reordering of memory operations
>    - Distribute a loop in order to recognize parts as loop idioms
>
>      Long term, loop distribution could also become a transformation
utility
>      (Transform/Util).  That way, the loop transformation passes could
> use it to
>      strip the loop from parts that inhibits the given optimization.
>
> Please let me know if you have feedback either on the design or on the next
> steps.
>
> Thanks,
> Adam
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
hosted by The Linux Foundation

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Adam Nemet

2015-Jan-13 19:05 UTC

head link

[LLVMdev] RFC: Loop distribution/Partial vectorization

> On Jan 13, 2015, at 8:27 AM, Krzysztof Parzyszek <kparzysz at
codeaurora.org> wrote:
> 
> Thanks for doing this. I really like the idea of having loop distribution
as a separate pass (and having dependence analysis as an analysis pass).
Great, thanks for your comments.
> A couple of comments that I have are below.
> 
> 1. Handle situations like this:
> 
>    for (k = 0; k < M; k++) {
>      for (i = 0; i < N; ++i) {
> S1:     MC[i][k+1] = …
>      }
> S2:   DC[k+1] = DC[k], MC[…][k]
>    }
> 
> Basically, recognize and handle dependencies between differently nested
expressions.
This would take some improvements to the current memory dependence checker which
handles a single loop right now.  Do you have some standard algorithms in mind
that would benefit from this?
> 2. Make it general so that it can serve any purpose (not only
vectorization).  Various targets may want to do different things and distribute
loops for various reasons that don't apply universally.
Agreed.  As I alluded to in the original post there could be multiple
optimization benefiting from this transformation.  I’ll make sure that other
heuristics are pluggable besides partial vectorization.

Adam
> -Krzysztof
> 
> 
> 
> On 1/12/2015 12:42 PM, Adam Nemet wrote:
>> Hi,
>> 
>> We'd like to propose new Loop Distribution pass.  The main
motivation is to
>> allow partial vectorization of loops.  One such example is the main
loop of
>> 456.hmmer in SpecINT_2006.  The current version of the patch improves
>> hmmer by
>> 24% on ARM64 and 18% on X86.
>> 
>> The goal of the pass is to distribute a loop that can't be
vectorized
>> because of
>> memory dependence cycles.  The pass splits the part with cycles into a
>> new loop
>> making the remainder of the loop a candidate for vectorization.  E.g.:
>> 
>>         for (k = 0; k < M; k++) {
>>          S1:   MC[k+1] = …
>>          // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried
dependence
>>          S2:   DC[k+1] = DC[k], MC[k]
>>         }
>>       => (Loop Distribute)
>>         for (k = 0; k < M; k++) {
>>          S1:   MC[k+1] = ...
>>         }
>>         for (k = 0; k < M; k++) {
>>          S2:   DC[k+1] = DC[k], MC[k]
>>         }
>>       => (Loop Vectorize S1)
>>         for (k = 0; k < M; k += 4) {
>>          S1:   MC[k+1:k+5] = ...
>>         }
>>         for (k = 0; k < M; k++) {
>>          S2:   DC[k+1] = DC[k], MC[k]
>>         }
>> 
>> I'd like to collect feedback on the design decisions made so far. 
These are
>> implemented in a proof-of-concept patch
(http://reviews.llvm.org/D6930).
>> Here is the list of design choices:
>> 
>> - Loop Distribution is implemented as a separate pass to be run before
>> the Loop
>>   Vectorizer.
>> 
>> - The pass reuses the Memory Dependence Checker framework from the Loop
>>   Vectorizer.  This along with the AccessAnalysis class is split out
>> into a new
>>   LoopAccessAnalysis class.  We may want to turn this into an analysis
>> pass on its own.
>> 
>> - It also reuses the Run-time Memory Check code from the Loop
>> Vectorizer.  The
>>   hmmer loop requires memchecks.  This is again captured by the same
>>   LoopAccessAnalysis class.
>> 
>> - The actual loop distribution is implemented as follows:
>> 
>>   - The list of unsafe memory dependencies is computed for the loop.
>>  Unsafe
>>     means that the dependence may be part of a cycle (this is what the
>> current
>>     framework provides).
>>   - Partitions are created for each set of unsafe dependences.
>>   - Partitions are created for each of the remaining stores not yet
>> encountered.
>>     The order of the partitions preserve the original order of the
>> dependent
>>     memory accesses.
>>   - Simple partition merging is performed to minimize the number of new
>> loops.
>>   - Partitions are populated with the other dependent instructions by
>> following
>>     the SSA use-def chains and control dependence edges.
>>   - Finally, the actual distribution is performed by creating a loop
>> for each
>>     partition.  For each partition we clone the loop and remove all the
>>     instructions that don't belong to the partition.
>>   - Also, if run-time memory checks are necessary, these are emitted.
>>  We keep
>>     an original version of the loop around to branch too if the checks
>> fail.
>> 
>> My plan is to proceed with the following steps:
>> 
>> - Bring the current functionality to trunk by splitting off smaller
>> patches from
>>   the current patch and completing them.  The final commit will enable
loop
>>   distribution with a command-line flag or a loop hint.
>> 
>> - Explore and fine-tune the proper cost model for loop distribution to
allow
>>   partial vectorization.  This is essentially whether to partition and
what
>>   these partitions should be.  Currently instructions are mapped to
>> partitions
>>   using a simple heuristics to create a vectorizable partitions.  We
>> may need to
>>   interact with the vectorizer to make sure the vectorization will
actually
>>   happen and it will be overall profitable.
>> 
>> - Explore other potentials for loop distribution, e.g.:
>>   - Partial vectorization of loops that can't be if-converted
>>   - Classic loop distribution to improve spatial locality
>>   - Compute the Program Dependence Graph rather than the list of unsafe
>> memory
>>     accesses and allow reordering of memory operations
>>   - Distribute a loop in order to recognize parts as loop idioms
>> 
>>     Long term, loop distribution could also become a transformation
utility
>>     (Transform/Util).  That way, the loop transformation passes could
>> use it to
>>     strip the loop from parts that inhibits the given optimization.
>> 
>> Please let me know if you have feedback either on the design or on the
next
>> steps.
>> 
>> Thanks,
>> Adam
>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
> 
> 
> -- 
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
by The Linux Foundation
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Philip Reames

2015-Jan-14 00:24 UTC

head link

[LLVMdev] RFC: Loop distribution/Partial vectorization

On 01/12/2015 10:42 AM, Adam Nemet wrote:> Hi,
>
> We'd like to propose new Loop Distribution pass.  The main motivation 
> is to
> allow partial vectorization of loops.  One such example is the main 
> loop of
> 456.hmmer in SpecINT_2006.  The current version of the patch improves 
> hmmer by
> 24% on ARM64 and 18% on X86.
>
> The goal of the pass is to distribute a loop that can't be vectorized 
> because of
> memory dependence cycles.  The pass splits the part with cycles into a 
> new loop
> making the remainder of the loop a candidate for vectorization.  E.g.:
>
>         for (k = 0; k < M; k++) {
>          S1:   MC[k+1] = …
>          // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
>          S2:   DC[k+1] = DC[k], MC[k]
>         }
>       => (Loop Distribute)
>         for (k = 0; k < M; k++) {
>          S1:   MC[k+1] = ...
>         }
>         for (k = 0; k < M; k++) {
>          S2:   DC[k+1] = DC[k], MC[k]
>         }
>       => (Loop Vectorize S1)
>         for (k = 0; k < M; k += 4) {
>          S1:   MC[k+1:k+5] = ...
>         }
>         for (k = 0; k < M; k++) {
>          S2:   DC[k+1] = DC[k], MC[k]
>         }
>
> I'd like to collect feedback on the design decisions made so far. 
>  These are
> implemented in a proof-of-concept patch (http://reviews.llvm.org/D6930).
> Here is the list of design choices:
>
> - Loop Distribution is implemented as a separate pass to be run before 
> the Loop
>   Vectorizer.
>
> - The pass reuses the Memory Dependence Checker framework from the Loop
>   Vectorizer.  This along with the AccessAnalysis class is split out 
> into a new
>   LoopAccessAnalysis class.  We may want to turn this into an analysis 
> pass on its own.
>
> - It also reuses the Run-time Memory Check code from the Loop 
> Vectorizer.  The
>   hmmer loop requires memchecks.  This is again captured by the same
>   LoopAccessAnalysis class.
>
> - The actual loop distribution is implemented as follows:
>
>   - The list of unsafe memory dependencies is computed for the loop. 
>  Unsafe
>     means that the dependence may be part of a cycle (this is what the 
> current
>     framework provides).
>   - Partitions are created for each set of unsafe dependences.
>   - Partitions are created for each of the remaining stores not yet 
> encountered.
>     The order of the partitions preserve the original order of the 
> dependent
>     memory accesses.
>   - Simple partition merging is performed to minimize the number of 
> new loops.
>   - Partitions are populated with the other dependent instructions by 
> following
>     the SSA use-def chains and control dependence edges.
>   - Finally, the actual distribution is performed by creating a loop 
> for each
>     partition.  For each partition we clone the loop and remove all the
>     instructions that don't belong to the partition.
>   - Also, if run-time memory checks are necessary, these are emitted. 
>  We keep
>     an original version of the loop around to branch too if the checks 
> fail.I like the general direction.  One potential concern I have is regards 
to distributing a loop which we turn out not to vectorize and 
potentially creating larger code for no clear benefit.  We'll have to 
see how this works in practice.>
> My plan is to proceed with the following steps:
>
> - Bring the current functionality to trunk by splitting off smaller 
> patches from
>   the current patch and completing them.  The final commit will enable 
> loop
>   distribution with a command-line flag or a loop hint.I look forward to seeing your patches.  Getting this in incrementally 
will take some work on all sides, but is definitely better than trying 
to land one large patch.>
> - Explore and fine-tune the proper cost model for loop distribution to 
> allow
>   partial vectorization.  This is essentially whether to partition and 
> what
>   these partitions should be.  Currently instructions are mapped to 
> partitions
>   using a simple heuristics to create a vectorizable partitions.  We 
> may need to
>   interact with the vectorizer to make sure the vectorization will 
> actually
>   happen and it will be overall profitable.As I said above, this is my biggest area of concern.  It'll be 
interesting to see where you end up.>
> - Explore other potentials for loop distribution, e.g.:
>   - Partial vectorization of loops that can't be if-converted
>   - Classic loop distribution to improve spatial locality
>   - Compute the Program Dependence Graph rather than the list of 
> unsafe memory
>     accesses and allow reordering of memory operations
>   - Distribute a loop in order to recognize parts as loop idioms
>
>     Long term, loop distribution could also become a transformation 
> utility
>     (Transform/Util).  That way, the loop transformation passes could 
> use it to
>     strip the loop from parts that inhibits the given optimization.
>
> Please let me know if you have feedback either on the design or on the 
> next
> steps.
>
> Thanks,
> Adam
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150113/98ee0939/attachment.html>

Hal Finkel

2015-Jan-17 14:29 UTC

head link

[LLVMdev] RFC: Loop distribution/Partial vectorization

----- Original Message -----> From: "Adam Nemet" <anemet at apple.com>
> To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Monday, January 12, 2015 12:42:36 PM
> Subject: [LLVMdev] RFC: Loop distribution/Partial vectorization
> 
> 
> 
> Hi,
> 
> 
> We'd like to propose new Loop Distribution pass. The main motivation
> is to
> allow partial vectorization of loops. One such example is the main
> loop of
> 456.hmmer in SpecINT_2006. The current version of the patch improves
> hmmer by
> 24% on ARM64 and 18% on X86.
Thanks for working on this! We definitely need this capability in LLVM (and for
more than just enabling vectorization).
> 
> 
> The goal of the pass is to distribute a loop that can't be vectorized
> because of
> memory dependence cycles. The pass splits the part with cycles into a
> new loop
> making the remainder of the loop a candidate for vectorization. E.g.:
> 
> 
> for (k = 0; k < M; k++) {
> S1: MC[k+1] = …
> 
> // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
> S2: DC[k+1] = DC[k], MC[k]
> }
> 
> => (Loop Distribute)
> 
> for (k = 0; k < M; k++) {
> S1: MC[k+1] = ...
> }
> for (k = 0; k < M; k++) {
> S2: DC[k+1] = DC[k], MC[k]
> }
> 
> => (Loop Vectorize S1)
> 
> for (k = 0; k < M; k += 4) {
> S1: MC[k+1:k+5] = ...
> }
> for (k = 0; k < M; k++) {
> S2: DC[k+1] = DC[k], MC[k]
> }
> 
> 
> I'd like to collect feedback on the design decisions made so far.
> These are
> implemented in a proof-of-concept patch (
> http://reviews.llvm.org/D6930 ).
> Here is the list of design choices:
> 
> 
> - Loop Distribution is implemented as a separate pass to be run
> before the Loop
> Vectorizer.
> 
> 
> - The pass reuses the Memory Dependence Checker framework from the
> Loop
> Vectorizer. This along with the AccessAnalysis class is split out
> into a new
> LoopAccessAnalysis class. We may want to turn this into an analysis
> pass on its own.
This is good. It would be nice if this new analysis could have the same
interface as the DependenceAnalysis pass, so that it will be easy to switch
between them. I think that, eventually, we'll want to switch everything to
use something like DependenceAnalysis, at least at higher optimization levels.
> 
> 
> - It also reuses the Run-time Memory Check code from the Loop
> Vectorizer. The
> hmmer loop requires memchecks. This is again captured by the same
> LoopAccessAnalysis class.
I think this is also reasonable; we just want to make sure that we don't end
up with double memory checks. I've seen cases in the past where the
vectorizer has inserted checks that should have been eliminated as duplicates
with other loop guards, SE guard domination checking may need to be improved.
> 
> 
> - The actual loop distribution is implemented as follows:
> 
> 
> - The list of unsafe memory dependencies is computed for the loop.
> Unsafe
> means that the dependence may be part of a cycle (this is what the
> current
> framework provides).
> - Partitions are created for each set of unsafe dependences.
> - Partitions are created for each of the remaining stores not yet
> encountered.
> The order of the partitions preserve the original order of the
> dependent
> memory accesses.
> - Simple partition merging is performed to minimize the number of new
> loops.
> - Partitions are populated with the other dependent instructions by
> following
> the SSA use-def chains and control dependence edges.
> - Finally, the actual distribution is performed by creating a loop
> for each
> partition. For each partition we clone the loop and remove all the
> instructions that don't belong to the partition.
> - Also, if run-time memory checks are necessary, these are emitted.
> We keep
> an original version of the loop around to branch too if the checks
> fail.
This sounds reasonable.
> 
> 
> My plan is to proceed with the following steps:
> 
> 
> - Bring the current functionality to trunk by splitting off smaller
> patches from
> the current patch and completing them. The final commit will enable
> loop
> distribution with a command-line flag or a loop hint.
Okay, please do.
> 
> 
> - Explore and fine-tune the proper cost model for loop distribution
> to allow
> partial vectorization. This is essentially whether to partition and
> what
> these partitions should be. Currently instructions are mapped to
> partitions
> using a simple heuristics to create a vectorizable partitions. We may
> need to
> interact with the vectorizer to make sure the vectorization will
> actually
> happen and it will be overall profitable.
I think this sounds reasonable. Splitting to enable vectorization is important;
one reason to have this process tightly integrated with vectorization is so that
it can properly integrate with the vectorizers register pressure checking (we
might split to reduce register pressure, thus enabling more interleaving, at
least when doing so does not decrease spatial locality).

Independent of vectorization, loop splitting is important to reduce register
pressure within loops (i.e. loops with too many phis, but that are splittable,
could be split to prevent intra-iteration spilling). Also very important is
splitting to reduce the number of hardware prefetching streams used by the loop.
In every system on which I've worked, the hardware prefetchers have a finite
set of resources to sustain prefetching streams (5-10 per thread, depending on
the architecture). When a loop would require more streams than this then
performance will greatly suffer, and splitting it highly profitable. I'd
definitely like us to hit these two use cases too.
> 
> 
> - Explore other potentials for loop distribution, e.g.:
> - Partial vectorization of loops that can't be if-converted
> - Classic loop distribution to improve spatial locality
> - Compute the Program Dependence Graph rather than the list of unsafe
> memory
> accesses and allow reordering of memory operations
This would also be quite nice to have.
> - Distribute a loop in order to recognize parts as loop idioms
Indeed, once you have the partitions, splitting out a memcpy, etc. should not be
hard; this is not always profitable, however.
> 
> 
> Long term, loop distribution could also become a transformation
> utility
> (Transform/Util). That way, the loop transformation passes could use
> it to
> strip the loop from parts that inhibits the given optimization.
This sounds good, but we still may want to schedule the transformation itself
(late in the pipeline). We don't want to limit register-pressure-induced
loop splitting, for example, to vectorizable loops.

Thanks again,
Hal
> 
> 
> Please let me know if you have feedback either on the design or on
> the next
> steps.
> 
> 
> Thanks,
> Adam
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Adam Nemet

2015-Jan-18 09:24 UTC

head link

[LLVMdev] RFC: Loop distribution/Partial vectorization

> On Jan 17, 2015, at 6:29 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> ----- Original Message -----
>> From: "Adam Nemet" <anemet at apple.com>
>> To: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>
>> Sent: Monday, January 12, 2015 12:42:36 PM
>> Subject: [LLVMdev] RFC: Loop distribution/Partial vectorization
>> 
>> 
>> 
>> Hi,
>> 
>> 
>> We'd like to propose new Loop Distribution pass. The main
motivation
>> is to
>> allow partial vectorization of loops. One such example is the main
>> loop of
>> 456.hmmer in SpecINT_2006. The current version of the patch improves
>> hmmer by
>> 24% on ARM64 and 18% on X86.
> 
> Thanks for working on this! We definitely need this capability in LLVM (and
for more than just enabling vectorization).
> 
>> 
>> 
>> The goal of the pass is to distribute a loop that can't be
vectorized
>> because of
>> memory dependence cycles. The pass splits the part with cycles into a
>> new loop
>> making the remainder of the loop a candidate for vectorization. E.g.:
>> 
>> 
>> for (k = 0; k < M; k++) {
>> S1: MC[k+1] = …
>> 
>> // Cycle in S2 due to DC[k+1] -> DC[k] loop-carried dependence
>> S2: DC[k+1] = DC[k], MC[k]
>> }
>> 
>> => (Loop Distribute)
>> 
>> for (k = 0; k < M; k++) {
>> S1: MC[k+1] = ...
>> }
>> for (k = 0; k < M; k++) {
>> S2: DC[k+1] = DC[k], MC[k]
>> }
>> 
>> => (Loop Vectorize S1)
>> 
>> for (k = 0; k < M; k += 4) {
>> S1: MC[k+1:k+5] = ...
>> }
>> for (k = 0; k < M; k++) {
>> S2: DC[k+1] = DC[k], MC[k]
>> }
>> 
>> 
>> I'd like to collect feedback on the design decisions made so far.
>> These are
>> implemented in a proof-of-concept patch (
>> http://reviews.llvm.org/D6930 ).
>> Here is the list of design choices:
>> 
>> 
>> - Loop Distribution is implemented as a separate pass to be run
>> before the Loop
>> Vectorizer.
>> 
>> 
>> - The pass reuses the Memory Dependence Checker framework from the
>> Loop
>> Vectorizer. This along with the AccessAnalysis class is split out
>> into a new
>> LoopAccessAnalysis class. We may want to turn this into an analysis
>> pass on its own.
> 
> This is good. It would be nice if this new analysis could have the same
interface as the DependenceAnalysis pass, so that it will be easy to switch
between them. I think that, eventually, we'll want to switch everything to
use something like DependenceAnalysis, at least at higher optimization levels.
Yes, that is precisely what Arnold and I discussed a few weeks ago.  We want to
reuse something that's known to work initially to get us off the ground but
then we want to be to swap in the DependenceAnalysis.  The idea was exactly as
you describe it: to try to change the interface of the Memory Dependence Checker
to match the interface of the DependenceAnalysis pass.
>> 
>> 
>> - It also reuses the Run-time Memory Check code from the Loop
>> Vectorizer. The
>> hmmer loop requires memchecks. This is again captured by the same
>> LoopAccessAnalysis class.
> 
> I think this is also reasonable; we just want to make sure that we
don't end up with double memory checks. I've seen cases in the past
where the vectorizer has inserted checks that should have been eliminated as
duplicates with other loop guards, SE guard domination checking may need to be
improved.
Yes, this was also on my list.  (Sorry.  I didn’t include everything in the
original post because it would have been way too long).

I didn’t know we already have code that tries to deal with this.  Can you please
point me to it?

>> 
>> 
>> - The actual loop distribution is implemented as follows:
>> 
>> 
>> - The list of unsafe memory dependencies is computed for the loop.
>> Unsafe
>> means that the dependence may be part of a cycle (this is what the
>> current
>> framework provides).
>> - Partitions are created for each set of unsafe dependences.
>> - Partitions are created for each of the remaining stores not yet
>> encountered.
>> The order of the partitions preserve the original order of the
>> dependent
>> memory accesses.
>> - Simple partition merging is performed to minimize the number of new
>> loops.
>> - Partitions are populated with the other dependent instructions by
>> following
>> the SSA use-def chains and control dependence edges.
>> - Finally, the actual distribution is performed by creating a loop
>> for each
>> partition. For each partition we clone the loop and remove all the
>> instructions that don't belong to the partition.
>> - Also, if run-time memory checks are necessary, these are emitted.
>> We keep
>> an original version of the loop around to branch too if the checks
>> fail.
> 
> This sounds reasonable.
> 
>> 
>> 
>> My plan is to proceed with the following steps:
>> 
>> 
>> - Bring the current functionality to trunk by splitting off smaller
>> patches from
>> the current patch and completing them. The final commit will enable
>> loop
>> distribution with a command-line flag or a loop hint.
> 
> Okay, please do.
> 
>> 
>> 
>> - Explore and fine-tune the proper cost model for loop distribution
>> to allow
>> partial vectorization. This is essentially whether to partition and
>> what
>> these partitions should be. Currently instructions are mapped to
>> partitions
>> using a simple heuristics to create a vectorizable partitions. We may
>> need to
>> interact with the vectorizer to make sure the vectorization will
>> actually
>> happen and it will be overall profitable.
> 
> I think this sounds reasonable. Splitting to enable vectorization is
important; one reason to have this process tightly integrated with vectorization
is so that it can properly integrate with the vectorizers register pressure
checking (we might split to reduce register pressure, thus enabling more
interleaving, at least when doing so does not decrease spatial locality).
OK, I haven’t thought of splitting due to register pressure.  I guess this makes
sense both in vectorizable and non-vectorizable loops.

Do you have an example for the part in parentheses?  Do you mean that spatial
locality would be decreased by interleaving?
> Independent of vectorization, loop splitting is important to reduce
register pressure within loops (i.e. loops with too many phis, but that are
splittable, could be split to prevent intra-iteration spilling). Also very
important is splitting to reduce the number of hardware prefetching streams used
by the loop. In every system on which I've worked, the hardware prefetchers
have a finite set of resources to sustain prefetching streams (5-10 per thread,
depending on the architecture). When a loop would require more streams than this
then performance will greatly suffer, and splitting it highly profitable.
I'd definitely like us to hit these two use cases too.
Sure.  I think that this is essentially what I meant by loop distribution to
improve spatial locality.   Exposing the target’s parameters for the HW
prefetcher sounds like a nice way to model this.
>> 
>> 
>> - Explore other potentials for loop distribution, e.g.:
>> - Partial vectorization of loops that can't be if-converted
>> - Classic loop distribution to improve spatial locality
>> - Compute the Program Dependence Graph rather than the list of unsafe
>> memory
>> accesses and allow reordering of memory operations
> 
> This would also be quite nice to have.
> 
>> - Distribute a loop in order to recognize parts as loop idioms
> 
> Indeed, once you have the partitions, splitting out a memcpy, etc. should
not be hard; this is not always profitable, however.
> 
>> 
>> 
>> Long term, loop distribution could also become a transformation
>> utility
>> (Transform/Util). That way, the loop transformation passes could use
>> it to
>> strip the loop from parts that inhibits the given optimization.
> 
> This sounds good, but we still may want to schedule the transformation
itself (late in the pipeline). We don't want to limit
register-pressure-induced loop splitting, for example, to vectorizable loops.
Sure.  I meant that there would still be a “stand-alone” loop distribution pass
which would be another user of this transformation utility.

Thanks very much for your feedback, Hal!

Adam
> Thanks again,
> Hal
> 
>> 
>> 
>> Please let me know if you have feedback either on the design or on
>> the next
>> steps.
>> 
>> 
>> Thanks,
>> Adam
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/426cedf6/attachment.html>

Renato Golin

2015-Jan-26 11:26 UTC

head link

[LLVMdev] RFC: Loop distribution/Partial vectorization

On 12 January 2015 at 18:42, Adam Nemet <anemet at apple.com>
wrote:> - Loop Distribution is implemented as a separate pass to be run before the
> Loop Vectorizer.
Agree, this would be the best place to put it. I'm wondering whether
it would be valid (or desirable) to put any metadata stopping the
vectorizers from looking at the loops that you have marked as
non-vectorizable.

cheers,
--renato

Adam Nemet

2015-Jan-26 17:59 UTC

head link

[LLVMdev] RFC: Loop distribution/Partial vectorization

> On Jan 26, 2015, at 3:26 AM, Renato Golin <renato.golin at
linaro.org> wrote:
> 
> On 12 January 2015 at 18:42, Adam Nemet <anemet at apple.com> wrote:
>> - Loop Distribution is implemented as a separate pass to be run before
the
>> Loop Vectorizer.
> 
> Agree, this would be the best place to put it. I'm wondering whether
> it would be valid (or desirable) to put any metadata stopping the
> vectorizers from looking at the loops that you have marked as
> non-vectorizable.
This sounds like a good idea, thank you.  I’ll experiment with it when I get
this far.

Adam
> cheers,
> --renato

llvm dev - Jan 2015 - [LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization

[LLVMdev] RFC: Loop distribution/Partial vectorization