thr3ads.net - llvm dev - [LLVMdev] proof of concept for a loop fusion pass [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Ramanarayanan, Ramshankar

2015-Jan-16 00:22 UTC

[LLVMdev] proof of concept for a loop fusion pass

Hi,

We are proposing a loop fusion pass that tries to proactive fuse loops across
function call boundaries and arbitrary control flow.

http://reviews.llvm.org/D7008

With this pass, we get 103 loop fusions in SPECCPU INT 2006 462.libquantum with
rate performance improving close to 2.5X in x86 (results from AMD A10-6700).

I took some liberties in patching up some of the code in
ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and also
adjusted the IPO/LTO pass managers. I would need to do a better job there but I
would like to put this out as WIP for high level reviews.  At this time standard
loop fusion test cases may not work (cases where control dep are same or loops
are truly adjacent etc). I would be doing that as well.

Appreciate suggestions and other help.

The pass initially attempts to inline calls which have loops in them.  Following
inlining, a separate scalar "main" pass tries to fuse loops. It is not
necessary for loops to be adjacent or have the same control dependence.
Specifically a control dependence closure is computed and loops that share a
control dep closure prefix are taken as candidates for fusion, in case there are
no other loops in a certain path between them.  A loop graph is built with edges
between loops which share the control dependence closure prefix. A recursive
application of fusion follows on the graph from "bottom-up".

During fusion, program slices are taken based on the control flow closures of
the two loops being fused.  Example: Suppose two loops A and B are going to be
fused.  They share the same control dependence prefix, but their suffices vary. 
The suffices are used to duplicate Control flow paths that pass through both
loops. Specifically paths from the first block in the control-dep closure suffix
of the first loop, through the first loop's exit and following into the
suffix of the second loop through the second loop's exit on to the common
post-dominator of either loop's exit. The number of paths grows
exponentially. At this time some heuristics are used when exploring for paths. 
Using profile based heuristics is a better idea.

Also function unswitching helps by cleaning up control flow.

Example:
if (a & b) {
  if (a & c) {
    if (t & 0xF) {
      for (i=0; i < size; i++) {
        if (A[i] & d) {
          A[i] |= (1 << t);
        }
      }
    }
  }
}

if (b & d) {
  if (a & d) {
    if (t & 0xA) {
      for (i=0; i < size; i++) {
        if (A[i] & d) {
          A[i] |= (1 << t);
        }
      }
    }
  }
}

After fusion we will have:

if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {
  for (i=0; i < size; i++) {
    if (A[i] & d) {
      A[i] |= (1 << t);
    }
    if (A[i] & d) {
      A[i] |= (1 << t);
    }
  }
} else {
// original code
  if (a & b) {
    if (a & c) {
      if (t & 0xF) {
        for (i=0; i < size; i++) {
          if (A[i] & d) {
            A[i] |= (1 << t);
          }
        }
      }
    }
  }

  if (b & d) {
    if (a & d) {
      if (t & 0xA) {
        for (i=0; i < size; i++) {
          if (A[i] & d) {
            A[i] |= (1 << t);
          }
        }
      }
    }
  }
}

Thanks,
Ramshankar
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150116/fa585cf3/attachment.html>

Kristof Beyls

2015-Jan-16 09:14 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

Hi Ramshankar,

This looks really interesting.

Interestingly, a patch for loop distribution has been posted for review
recently.
On a conceptual level loop fusion and loop distribution are each other's
inverse, so I would assume that
they should ideally somehow be combined and use a single cost function to
decide, out of all legal
fusions/distributions, which one is most profitable.

Does that make sense, or is there some reason why these fundamentally
couldn't be combined into a single
loop fuse-and-or-distribute transformation?

I haven't looked closely at either the code of the distribution or the
fusion patches, but would it be hard to
combine them into a single transformation - assuming my assumptions above do
make sense?

If it would prove sensible to somehow combine the distribution and fusion
transformations - could this
be the beginning of a more general high-level loop transformation framework
in LLVM?

Thanks,

Kristof

 

 

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Ramanarayanan, Ramshankar
Sent: 16 January 2015 00:22
To: llvmdev at cs.uiuc.edu
Subject: [LLVMdev] proof of concept for a loop fusion pass

 

Hi,

 

We are proposing a loop fusion pass that tries to proactive fuse loops
across function call boundaries and arbitrary control flow.

 

http://reviews.llvm.org/D7008

 

With this pass, we get 103 loop fusions in SPECCPU INT 2006 462.libquantum
with rate performance improving close to 2.5X in x86 (results from AMD
A10-6700).

 

I took some liberties in patching up some of the code in
ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and also
adjusted the IPO/LTO pass managers. I would need to do a better job there
but I would like to put this out as WIP for high level reviews.  At this
time standard loop fusion test cases may not work (cases where control dep
are same or loops are truly adjacent etc). I would be doing that as well.

 

Appreciate suggestions and other help. 

 

The pass initially attempts to inline calls which have loops in them.
Following inlining, a separate scalar "main" pass tries to fuse loops.
It is
not necessary for loops to be adjacent or have the same control dependence.
Specifically a control dependence closure is computed and loops that share a
control dep closure prefix are taken as candidates for fusion, in case there
are no other loops in a certain path between them.  A loop graph is built
with edges between loops which share the control dependence closure prefix.
A recursive application of fusion follows on the graph from
"bottom-up".

 

During fusion, program slices are taken based on the control flow closures
of the two loops being fused.  Example: Suppose two loops A and B are going
to be fused.  They share the same control dependence prefix, but their
suffices vary.  The suffices are used to duplicate Control flow paths that
pass through both loops. Specifically paths from the first block in the
control-dep closure suffix of the first loop, through the first loop's exit
and following into the suffix of the second loop through the second loop's
exit on to the common post-dominator of either loop's exit. The number of
paths grows exponentially. At this time some heuristics are used when
exploring for paths.  Using profile based heuristics is a better idea.

 

Also function unswitching helps by cleaning up control flow.

 

Example:

if (a & b) {

  if (a & c) {

    if (t & 0xF) {

      for (i=0; i < size; i++) {

        if (A[i] & d) {

          A[i] |= (1 << t);

        }

      }

    }

  }

}

 

if (b & d) {

  if (a & d) {

    if (t & 0xA) {

      for (i=0; i < size; i++) {

        if (A[i] & d) {

          A[i] |= (1 << t);

        }

      }

    }

  }

}

 

After fusion we will have:

 

if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {

  for (i=0; i < size; i++) {

    if (A[i] & d) {

      A[i] |= (1 << t);

    }

    if (A[i] & d) {

      A[i] |= (1 << t);

    }

  }

} else {

// original code

  if (a & b) {

    if (a & c) {

      if (t & 0xF) {

        for (i=0; i < size; i++) {

          if (A[i] & d) {

            A[i] |= (1 << t);

          }

        }

      }

    }

  }

 

  if (b & d) {

    if (a & d) {

      if (t & 0xA) {

        for (i=0; i < size; i++) {

          if (A[i] & d) {

            A[i] |= (1 << t);

          }

        }

      }

    }

  }

}

 

Thanks,

Ramshankar
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150116/4222bd56/attachment.html>

C Bergström

2015-Jan-16 09:40 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

On Fri, Jan 16, 2015 at 4:14 PM, Kristof Beyls <kristof.beyls at arm.com>
wrote:
> Hi Ramshankar,
>
> This looks really interesting.
>
> Interestingly, a patch for loop distribution has been posted for review
> recently.
> On a conceptual level loop fusion and loop distribution are each other’s
> inverse, so I would assume that
> they should ideally somehow be combined and use a single cost function to
> decide, out of all legal
> fusions/distributions, which one is most profitable.
>
> Does that make sense, or is there some reason why these fundamentally
> couldn’t be combined into a single
> loop fuse-and-or-distribute transformation?
>
> I haven’t looked closely at either the code of the distribution or the
> fusion patches, but would it be hard to
> combine them into a single transformation – assuming my assumptions above
> do make sense?
>
> If it would prove sensible to somehow combine the distribution and fusion
> transformations – could this
> be the beginning of a more general high-level loop transformation
> framework in LLVM?
>
At the very high level I don't think the analysis passes should be
combined. Outside of SPEC tuning how would you decide when the cost of
doing one vs the other is appropriate? Without target level information..
how could you estimate the cost or impact? If we think very very long term,
I wonder how something like backtracking could be useful here.. by this I
mean running multiple potentially conflicting or opposing transformations
in parallel, potentially all the way to code generation, and then in the
end pick the path which produces the best output. I realize the
infrastructure to do something like this is almost a dream, but I thought
it's worth mentioning.

If my comments aren't coherent I can potentially dig up the slides from
Fred Chow who originally proposed the idea/concept (for another compiler)
(ping me off list)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150116/ef611b95/attachment.html>

Ramanarayanan, Ramshankar

2015-Jan-16 15:33 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

Hi Kristof,

Thanks! We have to have a cost function but unfortunately I haven't given it
more than a fleeting thought yet. I think that a reasonable locality of
reference model would be able to identify the candidates we are fusing right
now. In Adam's work, he is pruning out loops with one loop for statements in
a dependence cycle with the result that has one vectorizable loop for the
statements not in a dependence cycle and one scalar loop for the rest of the
statements. That is not a locality of reference based tuning. That is still a
minor detail and one expects a common cost function based on cache model/helps
vectorize etc., would very well fit into both of these works. IMO both works are
standard, text book like. All suggestions/help sought :) as we are far from a
strong loop optimization framework. There are other challenges like phase
ordering (inlining may help fusion if targeted/analyzing at LTO for real life
cases/it looks to be better to delay vectorization after fusion).

Best regards,
Ramshankar

From: Kristof Beyls [mailto:kristof.beyls at arm.com]
Sent: Friday, January 16, 2015 4:15 AM
To: Ramanarayanan, Ramshankar; llvmdev at cs.uiuc.edu
Cc: anemet at apple.com; kparzysz at codeaurora.org; listmail at
philipreames.com; Das, Dibyendu
Subject: RE: proof of concept for a loop fusion pass

Hi Ramshankar,
This looks really interesting.
Interestingly, a patch for loop distribution has been posted for review
recently.
On a conceptual level loop fusion and loop distribution are each other's
inverse, so I would assume that
they should ideally somehow be combined and use a single cost function to
decide, out of all legal
fusions/distributions, which one is most profitable.
Does that make sense, or is there some reason why these fundamentally
couldn't be combined into a single
loop fuse-and-or-distribute transformation?
I haven't looked closely at either the code of the distribution or the
fusion patches, but would it be hard to
combine them into a single transformation - assuming my assumptions above do
make sense?
If it would prove sensible to somehow combine the distribution and fusion
transformations - could this
be the beginning of a more general high-level loop transformation framework in
LLVM?
Thanks,
Kristof


From: llvmdev-bounces at cs.uiuc.edu<mailto:llvmdev-bounces at
cs.uiuc.edu> [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of
Ramanarayanan, Ramshankar
Sent: 16 January 2015 00:22
To: llvmdev at cs.uiuc.edu<mailto:llvmdev at cs.uiuc.edu>
Subject: [LLVMdev] proof of concept for a loop fusion pass

Hi,

We are proposing a loop fusion pass that tries to proactive fuse loops across
function call boundaries and arbitrary control flow.

http://reviews.llvm.org/D7008

With this pass, we get 103 loop fusions in SPECCPU INT 2006 462.libquantum with
rate performance improving close to 2.5X in x86 (results from AMD A10-6700).

I took some liberties in patching up some of the code in
ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and also
adjusted the IPO/LTO pass managers. I would need to do a better job there but I
would like to put this out as WIP for high level reviews.  At this time standard
loop fusion test cases may not work (cases where control dep are same or loops
are truly adjacent etc). I would be doing that as well.

Appreciate suggestions and other help.

The pass initially attempts to inline calls which have loops in them.  Following
inlining, a separate scalar "main" pass tries to fuse loops. It is not
necessary for loops to be adjacent or have the same control dependence.
Specifically a control dependence closure is computed and loops that share a
control dep closure prefix are taken as candidates for fusion, in case there are
no other loops in a certain path between them.  A loop graph is built with edges
between loops which share the control dependence closure prefix. A recursive
application of fusion follows on the graph from "bottom-up".

During fusion, program slices are taken based on the control flow closures of
the two loops being fused.  Example: Suppose two loops A and B are going to be
fused.  They share the same control dependence prefix, but their suffices vary. 
The suffices are used to duplicate Control flow paths that pass through both
loops. Specifically paths from the first block in the control-dep closure suffix
of the first loop, through the first loop's exit and following into the
suffix of the second loop through the second loop's exit on to the common
post-dominator of either loop's exit. The number of paths grows
exponentially. At this time some heuristics are used when exploring for paths. 
Using profile based heuristics is a better idea.

Also function unswitching helps by cleaning up control flow.

Example:
if (a & b) {
  if (a & c) {
    if (t & 0xF) {
      for (i=0; i < size; i++) {
        if (A[i] & d) {
          A[i] |= (1 << t);
        }
      }
    }
  }
}

if (b & d) {
  if (a & d) {
    if (t & 0xA) {
      for (i=0; i < size; i++) {
        if (A[i] & d) {
          A[i] |= (1 << t);
        }
      }
    }
  }
}

After fusion we will have:

if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {
  for (i=0; i < size; i++) {
    if (A[i] & d) {
      A[i] |= (1 << t);
    }
    if (A[i] & d) {
      A[i] |= (1 << t);
    }
  }
} else {
// original code
  if (a & b) {
    if (a & c) {
      if (t & 0xF) {
        for (i=0; i < size; i++) {
          if (A[i] & d) {
            A[i] |= (1 << t);
          }
        }
      }
    }
  }

  if (b & d) {
    if (a & d) {
      if (t & 0xA) {
        for (i=0; i < size; i++) {
          if (A[i] & d) {
            A[i] |= (1 << t);
          }
        }
      }
    }
  }
}

Thanks,
Ramshankar
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150116/145fa08b/attachment.html>

Philip Reames

2015-Jan-16 18:55 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

I would strongly suggest we not try to be overly general at this time.  
Let's focus on getting a clearly profitably use case for each, and then 
worry about generalizing once we have more practical experience and code 
has actually landed in tree.  General loop optimization is a *hard* 
problem.

As a near term goal, having the profitability heuristics well separated 
from the actual transforms seems like a reasonable goal. Even this 
should happen after we have tests and working examples in tree though.

Philip

On 01/16/2015 01:14 AM, Kristof Beyls wrote:>
> Hi Ramshankar,
>
> This looks really interesting.
>
> Interestingly, a patch for loop distribution has been posted for 
> review recently.
> On a conceptual level loop fusion and loop distribution are each 
> other’s inverse, so I would assume that
> they should ideally somehow be combined and use a single cost function 
> to decide, out of all legal
> fusions/distributions, which one is most profitable.
>
> Does that make sense, or is there some reason why these fundamentally 
> couldn’t be combined into a single
> loop fuse-and-or-distribute transformation?
>
> I haven’t looked closely at either the code of the distribution or the 
> fusion patches, but would it be hard to
> combine them into a single transformation – assuming my assumptions 
> above do make sense?
>
> If it would prove sensible to somehow combine the distribution and 
> fusion transformations – could this
> be the beginning of a more general high-level loop transformation 
> framework in LLVM?
>
> Thanks,
>
> Kristof
>
> *From:*llvmdev-bounces at cs.uiuc.edu 
> [mailto:llvmdev-bounces at cs.uiuc.edu] *On Behalf Of *Ramanarayanan, 
> Ramshankar
> *Sent:* 16 January 2015 00:22
> *To:* llvmdev at cs.uiuc.edu
> *Subject:* [LLVMdev] proof of concept for a loop fusion pass
>
> Hi,
>
> We are proposing a loop fusion pass that tries to proactive fuse loops 
> across function call boundaries and arbitrary control flow.
>
> http://reviews.llvm.org/D7008
>
> With this pass, we get 103 loop fusions in SPECCPU INT 2006 
> 462.libquantum with rate performance improving close to 2.5X in x86 
> (results from AMD A10-6700).
>
> I took some liberties in patching up some of the code in 
> ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and 
> also adjusted the IPO/LTO pass managers. I would need to do a better 
> job there but I would like to put this out as WIP for high level 
> reviews.  At this time standard loop fusion test cases may not work 
> (cases where control dep are same or loops are truly adjacent etc). I 
> would be doing that as well.
>
> Appreciate suggestions and other help.
>
> The pass initially attempts to inline calls which have loops in them.  
> Following inlining, a separate scalar "main" pass tries to fuse
loops.
> It is not necessary for loops to be adjacent or have the same control 
> dependence. Specifically a control dependence closure is computed and 
> loops that share a control dep closure prefix are taken as candidates 
> for fusion, in case there are no other loops in a certain path between 
> them.  A loop graph is built with edges between loops which share the 
> control dependence closure prefix. A recursive application of fusion 
> follows on the graph from "bottom-up".
>
> During fusion, program slices are taken based on the control flow 
> closures of the two loops being fused.  Example: Suppose two loops A 
> and B are going to be fused.  They share the same control dependence 
> prefix, but their suffices vary.  The suffices are used to duplicate 
> Control flow paths that pass through both loops. Specifically paths 
> from the first block in the control-dep closure suffix of the first 
> loop, through the first loop's exit and following into the suffix of 
> the second loop through the second loop's exit on to the common 
> post-dominator of either loop's exit. The number of paths grows 
> exponentially. At this time some heuristics are used when exploring 
> for paths.  Using profile based heuristics is a better idea.
>
> Also function unswitching helps by cleaning up control flow.
>
> Example:
>
> if (a & b) {
>
>   if (a & c) {
>
>     if (t & 0xF) {
>
>       for (i=0; i < size; i++) {
>
>         if (A[i] & d) {
>
>           A[i] |= (1 << t);
>
>         }
>
>       }
>
>     }
>
>   }
>
> }
>
> if (b & d) {
>
>   if (a & d) {
>
>     if (t & 0xA) {
>
>       for (i=0; i < size; i++) {
>
>         if (A[i] & d) {
>
>           A[i] |= (1 << t);
>
>         }
>
>       }
>
>     }
>
>   }
>
> }
>
> After fusion we will have:
>
> if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {
>
>   for (i=0; i < size; i++) {
>
>     if (A[i] & d) {
>
>       A[i] |= (1 << t);
>
>     }
>
>     if (A[i] & d) {
>
>       A[i] |= (1 << t);
>
>     }
>
>   }
>
> } else {
>
> // original code
>
>   if (a & b) {
>
>     if (a & c) {
>
>       if (t & 0xF) {
>
>         for (i=0; i < size; i++) {
>
>           if (A[i] & d) {
>
>             A[i] |= (1 << t);
>
>           }
>
>         }
>
>       }
>
>     }
>
>   }
>
>   if (b & d) {
>
>     if (a & d) {
>
>       if (t & 0xA) {
>
>         for (i=0; i < size; i++) {
>
>           if (A[i] & d) {
>
>             A[i] |= (1 << t);
>
>           }
>
>         }
>
>       }
>
>     }
>
>   }
>
> }
>
> Thanks,
>
> Ramshankar
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150116/8d9b6c28/attachment.html>

Adam Nemet

2015-Jan-17 06:20 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

> On Jan 15, 2015, at 4:22 PM, Ramanarayanan, Ramshankar
<Ramshankar.Ramanarayanan at amd.com> wrote:
> 
> Hi,
>  
> We are proposing a loop fusion pass that tries to proactive fuse loops
across function call boundaries and arbitrary control flow.
>  
> http://reviews.llvm.org/D7008 <http://reviews.llvm.org/D7008>
>  
> With this pass, we get 103 loop fusions in SPECCPU INT 2006 462.libquantum
with rate performance improving close to 2.5X in x86 (results from AMD
A10-6700).
>  
> I took some liberties in patching up some of the code in
ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and also
adjusted the IPO/LTO pass managers. I would need to do a better job there but I
would like to put this out as WIP for high level reviews.  At this time standard
loop fusion test cases may not work (cases where control dep are same or loops
are truly adjacent etc). I would be doing that as well.
>  
> Appreciate suggestions and other help. 
>  
> The pass initially attempts to inline calls which have loops in them. 
Following inlining, a separate scalar "main" pass tries to fuse loops.
It is not necessary for loops to be adjacent or have the same control
dependence. Specifically a control dependence closure is computed and loops that
share a control dep closure prefix are taken as candidates for fusion, in case
there are no other loops in a certain path between them.  A loop graph is built
with edges between loops which share the control dependence closure prefix. A
recursive application of fusion follows on the graph from "bottom-up".
>  
> During fusion, program slices are taken based on the control flow closures
of the two loops being fused.  Example: Suppose two loops A and B are going to
be fused.  They share the same control dependence prefix, but their suffices
vary.  The suffices are used to duplicate Control flow paths that pass through
both loops. Specifically paths from the first block in the control-dep closure
suffix of the first loop, through the first loop's exit and following into
the suffix of the second loop through the second loop's exit on to the
common post-dominator of either loop's exit. The number of paths grows
exponentially. At this time some heuristics are used when exploring for paths. 
Using profile based heuristics is a better idea.
Hi Ramshankar,


This is interesting, indeed.  Thanks for soliciting input.

My main comment has to do with the fact that in order to get this gain you need
many pieces to work together.  My suggestion is to proceed in steps and develop
it incrementally on trunk.

For example it would be good to first add a simple loop fusion pass that only
merges adjacent loops and then build it up from there.  Can you break down your
approach into smaller, incremental steps like this?  What would these steps be?

My other comment is about the choice of DependenceAnalysis framework.  No pass
currently uses this framework and I am not sure how complete the implementation
is and whether it has adequate test coverage.  This was my main reason for
reusing the dependence analysis from the LoopVectorizer.  Have you considered
the same?

Adam
>  
> Also function unswitching helps by cleaning up control flow.
>  
> Example:
> if (a & b) {
>   if (a & c) {
>     if (t & 0xF) {
>       for (i=0; i < size; i++) {
>         if (A[i] & d) {
>           A[i] |= (1 << t);
>         }
>       }
>     }
>   }
> }
>  
> if (b & d) {
>   if (a & d) {
>     if (t & 0xA) {
>       for (i=0; i < size; i++) {
>         if (A[i] & d) {
>           A[i] |= (1 << t);
>         }
>       }
>     }
>   }
> }
>  
> After fusion we will have:
>  
> if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {
>   for (i=0; i < size; i++) {
>     if (A[i] & d) {
>       A[i] |= (1 << t);
>     }
>     if (A[i] & d) {
>       A[i] |= (1 << t);
>     }
>   }
> } else {
> // original code
>   if (a & b) {
>     if (a & c) {
>       if (t & 0xF) {
>         for (i=0; i < size; i++) {
>           if (A[i] & d) {
>             A[i] |= (1 << t);
>           }
>         }
>       }
>     }
>   }
>  
>   if (b & d) {
>     if (a & d) {
>       if (t & 0xA) {
>         for (i=0; i < size; i++) {
>           if (A[i] & d) {
>             A[i] |= (1 << t);
>           }
>         }
>       }
>     }
>   }
> }
>  
> Thanks,
> Ramshankar
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150116/bd2a873d/attachment.html>

Hal Finkel

2015-Jan-17 07:29 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

----- Original Message -----> From: "Adam Nemet" <anemet at apple.com>
> To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan at
amd.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Saturday, January 17, 2015 12:20:55 AM
> Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
> 
> 
> On Jan 15, 2015, at 4:22 PM, Ramanarayanan, Ramshankar <
> Ramshankar.Ramanarayanan at amd.com > wrote:
> 
> 
> Hi,
> 
> 
> We are proposing a loop fusion pass that tries to proactive fuse
> loops across function call boundaries and arbitrary control flow.
> 
> 
> http://reviews.llvm.org/D7008
This link contains the Clang patch. Did you intend to post the LLVM patch as
well?
> 
> 
> With this pass, we get 103 loop fusions in SPECCPU INT 2006
> 462.libquantum with rate performance improving close to 2.5X in x86
> (results from AMD A10-6700).
> 
> 
> I took some liberties in patching up some of the code in
> ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and
> also adjusted the IPO/LTO pass managers. I would need to do a better
> job there but I would like to put this out as WIP for high level
> reviews. At this time standard loop fusion test cases may not work
> (cases where control dep are same or loops are truly adjacent etc).
> I would be doing that as well.
> 
> 
> Appreciate suggestions and other help.
> 
> 
> The pass initially attempts to inline calls which have loops in them.
> Following inlining, a separate scalar "main" pass tries to fuse
> loops. It is not necessary for loops to be adjacent or have the same
> control dependence. Specifically a control dependence closure is
> computed and loops that share a control dep closure prefix are taken
> as candidates for fusion, in case there are no other loops in a
> certain path between them. A loop graph is built with edges between
> loops which share the control dependence closure prefix. A recursive
> application of fusion follows on the graph from "bottom-up".
> 
> 
> During fusion, program slices are taken based on the control flow
> closures of the two loops being fused. Example: Suppose two loops A
> and B are going to be fused. They share the same control dependence
> prefix, but their suffices vary. The suffices are used to duplicate
> Control flow paths that pass through both loops. Specifically paths
> from the first block in the control-dep closure suffix of the first
> loop, through the first loop's exit and following into the suffix of
> the second loop through the second loop's exit on to the common
> post-dominator of either loop's exit. The number of paths grows
> exponentially. At this time some heuristics are used when exploring
> for paths. Using profile based heuristics is a better idea.
On this last point, would it be possible to encode your static heuristics as
improvements to our static branch probability heuristics, and then naturally
adapt to profile-based branch probabilities when those are available?
> 
> Hi Ramshankar,
> 
> 
> 
> 
> 
> This is interesting, indeed. Thanks for soliciting input.
> 
> 
> My main comment has to do with the fact that in order to get this
> gain you need many pieces to work together. My suggestion is to
> proceed in steps and develop it incrementally on trunk.
I second this. We definitely should have loop fusion in LLVM, and incremental
in-trunk development is the best way to make that happen.
> 
> 
> For example it would be good to first add a simple loop fusion pass
> that only merges adjacent loops and then build it up from there. Can
> you break down your approach into smaller, incremental steps like
> this? What would these steps be?
> 
> 
> My other comment is about the choice of DependenceAnalysis framework.
> No pass currently uses this framework and I am not sure how complete
> the implementation is
DependenceAnalysis has known issues -- as I recall, the way it pulls apart GEP
indices without verifying the boundary conditions is not really sound, but this
could be replaced by our delinearization functionality now (what Polly is using)
-- and if this work provides the impetus for cleaning that up, so much the
better. That having been said, care is required. DependenceAnalysis is
definitely more powerful than the vectorizer's dependence analysis, and
conceptually cleaner. However, using what the vectorizer uses should provide a
conservatively-correct base.
> and whether it has adequate test coverage.
DependenceAnalysis's test coverage actually isn't bad, at least for loop
bodies in traditional form.
> This was my main reason for reusing the dependence analysis from the
> LoopVectorizer. Have you considered the same?
Finally, it might be a good idea to discuss the overall optimization plan for
fuseable loops. One possibility that I had discussed with Chandler a couple of
months ago was: aggressively fuse early, expose optimization opportunities, and
then split late (after vectorization) based on target-specific modeling
(accounting for register pressure, hardware prefetching resources, etc.). There
are obviously also other possible approaches, and I'd like to know what you
think would be an optimal strategy.

Thanks again,
Hal
> 
> 
> Adam
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Also function unswitching helps by cleaning up control flow.
> 
> 
> Example:
> if (a & b) {
> if (a & c) {
> if (t & 0xF) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> 
> 
> if (b & d) {
> if (a & d) {
> if (t & 0xA) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> 
> 
> After fusion we will have:
> 
> 
> if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> } else {
> // original code
> if (a & b) {
> if (a & c) {
> if (t & 0xF) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> 
> 
> if (b & d) {
> if (a & d) {
> if (t & 0xA) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> }
> 
> 
> Thanks,
> Ramshankar _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Gerolf Hoflehner

2015-Jan-19 05:13 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

That looks like the best way to go and will split the general part from the
heroics.

The libquantum optimization an example for an heroic optimization that will fire
for one benchmark only. But the compiler capability that recognizes the
opportunity has value beyond that.

There are also a few patents from AMD and Intel in that space the oversight team
will have to be aware of.

-Gerolf
> On Jan 16, 2015, at 10:20 PM, Adam Nemet <anemet at apple.com> wrote:
> 
> 
>> On Jan 15, 2015, at 4:22 PM, Ramanarayanan, Ramshankar
<Ramshankar.Ramanarayanan at amd.com <mailto:Ramshankar.Ramanarayanan at
amd.com>> wrote:
>> 
>> Hi,
>>  
>> We are proposing a loop fusion pass that tries to proactive fuse loops
across function call boundaries and arbitrary control flow.
>>  
>> http://reviews.llvm.org/D7008 <http://reviews.llvm.org/D7008>
>>  
>> With this pass, we get 103 loop fusions in SPECCPU INT 2006
462.libquantum with rate performance improving close to 2.5X in x86 (results
from AMD A10-6700).
>>  
>> I took some liberties in patching up some of the code in
ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and also
adjusted the IPO/LTO pass managers. I would need to do a better job there but I
would like to put this out as WIP for high level reviews.  At this time standard
loop fusion test cases may not work (cases where control dep are same or loops
are truly adjacent etc). I would be doing that as well.
>>  
>> Appreciate suggestions and other help. 
>>  
>> The pass initially attempts to inline calls which have loops in them. 
Following inlining, a separate scalar "main" pass tries to fuse loops.
It is not necessary for loops to be adjacent or have the same control
dependence. Specifically a control dependence closure is computed and loops that
share a control dep closure prefix are taken as candidates for fusion, in case
there are no other loops in a certain path between them.  A loop graph is built
with edges between loops which share the control dependence closure prefix. A
recursive application of fusion follows on the graph from "bottom-up".
>>  
>> During fusion, program slices are taken based on the control flow
closures of the two loops being fused.  Example: Suppose two loops A and B are
going to be fused.  They share the same control dependence prefix, but their
suffices vary.  The suffices are used to duplicate Control flow paths that pass
through both loops. Specifically paths from the first block in the control-dep
closure suffix of the first loop, through the first loop's exit and
following into the suffix of the second loop through the second loop's exit
on to the common post-dominator of either loop's exit. The number of paths
grows exponentially. At this time some heuristics are used when exploring for
paths.  Using profile based heuristics is a better idea.
> 
> Hi Ramshankar,
> 
> 
> This is interesting, indeed.  Thanks for soliciting input.
> 
> My main comment has to do with the fact that in order to get this gain you
need many pieces to work together.  My suggestion is to proceed in steps and
develop it incrementally on trunk.
> 
> For example it would be good to first add a simple loop fusion pass that
only merges adjacent loops and then build it up from there.  Can you break down
your approach into smaller, incremental steps like this?  What would these steps
be?
> 
> My other comment is about the choice of DependenceAnalysis framework.  No
pass currently uses this framework and I am not sure how complete the
implementation is and whether it has adequate test coverage.  This was my main
reason for reusing the dependence analysis from the LoopVectorizer.  Have you
considered the same?
> 
> Adam
> 
>>  
>> Also function unswitching helps by cleaning up control flow.
>>  
>> Example:
>> if (a & b) {
>>   if (a & c) {
>>     if (t & 0xF) {
>>       for (i=0; i < size; i++) {
>>         if (A[i] & d) {
>>           A[i] |= (1 << t);
>>         }
>>       }
>>     }
>>   }
>> }
>>  
>> if (b & d) {
>>   if (a & d) {
>>     if (t & 0xA) {
>>       for (i=0; i < size; i++) {
>>         if (A[i] & d) {
>>           A[i] |= (1 << t);
>>         }
>>       }
>>     }
>>   }
>> }
>>  
>> After fusion we will have:
>>  
>> if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {
>>   for (i=0; i < size; i++) {
>>     if (A[i] & d) {
>>       A[i] |= (1 << t);
>>     }
>>     if (A[i] & d) {
>>       A[i] |= (1 << t);
>>     }
>>   }
>> } else {
>> // original code
>>   if (a & b) {
>>     if (a & c) {
>>       if (t & 0xF) {
>>         for (i=0; i < size; i++) {
>>           if (A[i] & d) {
>>             A[i] |= (1 << t);
>>           }
>>         }
>>       }
>>     }
>>   }
>>  
>>   if (b & d) {
>>     if (a & d) {
>>       if (t & 0xA) {
>>         for (i=0; i < size; i++) {
>>           if (A[i] & d) {
>>             A[i] |= (1 << t);
>>           }
>>         }
>>       }
>>     }
>>   }
>> }
>>  
>> Thanks,
>> Ramshankar
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/f8ea1757/attachment.html>

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Jan 2015 - [LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

Seemingly Similar Threads