thr3ads.net - llvm dev - [LLVMdev] proof of concept for a loop fusion pass [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2015-Jan-17 07:29 UTC

[LLVMdev] proof of concept for a loop fusion pass

----- Original Message -----> From: "Adam Nemet" <anemet at apple.com>
> To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan at
amd.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Saturday, January 17, 2015 12:20:55 AM
> Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
> 
> 
> On Jan 15, 2015, at 4:22 PM, Ramanarayanan, Ramshankar <
> Ramshankar.Ramanarayanan at amd.com > wrote:
> 
> 
> Hi,
> 
> 
> We are proposing a loop fusion pass that tries to proactive fuse
> loops across function call boundaries and arbitrary control flow.
> 
> 
> http://reviews.llvm.org/D7008
This link contains the Clang patch. Did you intend to post the LLVM patch as
well?
> 
> 
> With this pass, we get 103 loop fusions in SPECCPU INT 2006
> 462.libquantum with rate performance improving close to 2.5X in x86
> (results from AMD A10-6700).
> 
> 
> I took some liberties in patching up some of the code in
> ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and
> also adjusted the IPO/LTO pass managers. I would need to do a better
> job there but I would like to put this out as WIP for high level
> reviews. At this time standard loop fusion test cases may not work
> (cases where control dep are same or loops are truly adjacent etc).
> I would be doing that as well.
> 
> 
> Appreciate suggestions and other help.
> 
> 
> The pass initially attempts to inline calls which have loops in them.
> Following inlining, a separate scalar "main" pass tries to fuse
> loops. It is not necessary for loops to be adjacent or have the same
> control dependence. Specifically a control dependence closure is
> computed and loops that share a control dep closure prefix are taken
> as candidates for fusion, in case there are no other loops in a
> certain path between them. A loop graph is built with edges between
> loops which share the control dependence closure prefix. A recursive
> application of fusion follows on the graph from "bottom-up".
> 
> 
> During fusion, program slices are taken based on the control flow
> closures of the two loops being fused. Example: Suppose two loops A
> and B are going to be fused. They share the same control dependence
> prefix, but their suffices vary. The suffices are used to duplicate
> Control flow paths that pass through both loops. Specifically paths
> from the first block in the control-dep closure suffix of the first
> loop, through the first loop's exit and following into the suffix of
> the second loop through the second loop's exit on to the common
> post-dominator of either loop's exit. The number of paths grows
> exponentially. At this time some heuristics are used when exploring
> for paths. Using profile based heuristics is a better idea.
On this last point, would it be possible to encode your static heuristics as
improvements to our static branch probability heuristics, and then naturally
adapt to profile-based branch probabilities when those are available?
> 
> Hi Ramshankar,
> 
> 
> 
> 
> 
> This is interesting, indeed. Thanks for soliciting input.
> 
> 
> My main comment has to do with the fact that in order to get this
> gain you need many pieces to work together. My suggestion is to
> proceed in steps and develop it incrementally on trunk.
I second this. We definitely should have loop fusion in LLVM, and incremental
in-trunk development is the best way to make that happen.
> 
> 
> For example it would be good to first add a simple loop fusion pass
> that only merges adjacent loops and then build it up from there. Can
> you break down your approach into smaller, incremental steps like
> this? What would these steps be?
> 
> 
> My other comment is about the choice of DependenceAnalysis framework.
> No pass currently uses this framework and I am not sure how complete
> the implementation is
DependenceAnalysis has known issues -- as I recall, the way it pulls apart GEP
indices without verifying the boundary conditions is not really sound, but this
could be replaced by our delinearization functionality now (what Polly is using)
-- and if this work provides the impetus for cleaning that up, so much the
better. That having been said, care is required. DependenceAnalysis is
definitely more powerful than the vectorizer's dependence analysis, and
conceptually cleaner. However, using what the vectorizer uses should provide a
conservatively-correct base.
> and whether it has adequate test coverage.
DependenceAnalysis's test coverage actually isn't bad, at least for loop
bodies in traditional form.
> This was my main reason for reusing the dependence analysis from the
> LoopVectorizer. Have you considered the same?
Finally, it might be a good idea to discuss the overall optimization plan for
fuseable loops. One possibility that I had discussed with Chandler a couple of
months ago was: aggressively fuse early, expose optimization opportunities, and
then split late (after vectorization) based on target-specific modeling
(accounting for register pressure, hardware prefetching resources, etc.). There
are obviously also other possible approaches, and I'd like to know what you
think would be an optimal strategy.

Thanks again,
Hal
> 
> 
> Adam
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Also function unswitching helps by cleaning up control flow.
> 
> 
> Example:
> if (a & b) {
> if (a & c) {
> if (t & 0xF) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> 
> 
> if (b & d) {
> if (a & d) {
> if (t & 0xA) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> 
> 
> After fusion we will have:
> 
> 
> if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> } else {
> // original code
> if (a & b) {
> if (a & c) {
> if (t & 0xF) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> 
> 
> if (b & d) {
> if (a & d) {
> if (t & 0xA) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> }
> 
> 
> Thanks,
> Ramshankar _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Adam Nemet

2015-Jan-18 09:26 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

> On Jan 16, 2015, at 11:29 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> ----- Original Message -----
>> From: "Adam Nemet" <anemet at apple.com>
>> To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan
at amd.com>
>> Cc: llvmdev at cs.uiuc.edu
>> Sent: Saturday, January 17, 2015 12:20:55 AM
>> Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
>> 
>> 
>> On Jan 15, 2015, at 4:22 PM, Ramanarayanan, Ramshankar <
>> Ramshankar.Ramanarayanan at amd.com > wrote:
>> 
>> 
>> Hi,
>> 
>> 
>> We are proposing a loop fusion pass that tries to proactive fuse
>> loops across function call boundaries and arbitrary control flow.
>> 
>> 
>> http://reviews.llvm.org/D7008
> 
> This link contains the Clang patch. Did you intend to post the LLVM patch
as well?
> 
>> 
>> 
>> With this pass, we get 103 loop fusions in SPECCPU INT 2006
>> 462.libquantum with rate performance improving close to 2.5X in x86
>> (results from AMD A10-6700).
>> 
>> 
>> I took some liberties in patching up some of the code in
>> ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and
>> also adjusted the IPO/LTO pass managers. I would need to do a better
>> job there but I would like to put this out as WIP for high level
>> reviews. At this time standard loop fusion test cases may not work
>> (cases where control dep are same or loops are truly adjacent etc).
>> I would be doing that as well.
>> 
>> 
>> Appreciate suggestions and other help.
>> 
>> 
>> The pass initially attempts to inline calls which have loops in them.
>> Following inlining, a separate scalar "main" pass tries to
fuse
>> loops. It is not necessary for loops to be adjacent or have the same
>> control dependence. Specifically a control dependence closure is
>> computed and loops that share a control dep closure prefix are taken
>> as candidates for fusion, in case there are no other loops in a
>> certain path between them. A loop graph is built with edges between
>> loops which share the control dependence closure prefix. A recursive
>> application of fusion follows on the graph from "bottom-up".
>> 
>> 
>> During fusion, program slices are taken based on the control flow
>> closures of the two loops being fused. Example: Suppose two loops A
>> and B are going to be fused. They share the same control dependence
>> prefix, but their suffices vary. The suffices are used to duplicate
>> Control flow paths that pass through both loops. Specifically paths
>> from the first block in the control-dep closure suffix of the first
>> loop, through the first loop's exit and following into the suffix
of
>> the second loop through the second loop's exit on to the common
>> post-dominator of either loop's exit. The number of paths grows
>> exponentially. At this time some heuristics are used when exploring
>> for paths. Using profile based heuristics is a better idea.
> 
> On this last point, would it be possible to encode your static heuristics
as improvements to our static branch probability heuristics, and then naturally
adapt to profile-based branch probabilities when those are available?
> 
>> 
>> Hi Ramshankar,
>> 
>> 
>> 
>> 
>> 
>> This is interesting, indeed. Thanks for soliciting input.
>> 
>> 
>> My main comment has to do with the fact that in order to get this
>> gain you need many pieces to work together. My suggestion is to
>> proceed in steps and develop it incrementally on trunk.
> 
> I second this. We definitely should have loop fusion in LLVM, and
incremental in-trunk development is the best way to make that happen.
> 
>> 
>> 
>> For example it would be good to first add a simple loop fusion pass
>> that only merges adjacent loops and then build it up from there. Can
>> you break down your approach into smaller, incremental steps like
>> this? What would these steps be?
>> 
>> 
>> My other comment is about the choice of DependenceAnalysis framework.
>> No pass currently uses this framework and I am not sure how complete
>> the implementation is
> 
> DependenceAnalysis has known issues -- as I recall, the way it pulls apart
GEP indices without verifying the boundary conditions is not really sound, but
this could be replaced by our delinearization functionality now (what Polly is
using) -- and if this work provides the impetus for cleaning that up, so much
the better. That having been said, care is required. DependenceAnalysis is
definitely more powerful than the vectorizer's dependence analysis, and
conceptually cleaner. However, using what the vectorizer uses should provide a
conservatively-correct base.
> 
>> and whether it has adequate test coverage.
> 
> DependenceAnalysis's test coverage actually isn't bad, at least for
loop bodies in traditional form.
> 
>> This was my main reason for reusing the dependence analysis from the
>> LoopVectorizer. Have you considered the same?
> 
> Finally, it might be a good idea to discuss the overall optimization plan
for fuseable loops. One possibility that I had discussed with Chandler a couple
of months ago was: aggressively fuse early, expose optimization opportunities,
and then split late (after vectorization) based on target-specific modeling
(accounting for register pressure, hardware prefetching resources, etc.). There
are obviously also other possible approaches, and I'd like to know what you
think would be an optimal strategy.
We probably want another splitting before the vectorizer to allow partial
vectorization.  But other than that, this sounds like a good initial plan to me.

Adam
> Thanks again,
> Hal
> 
>> 
>> 
>> Adam
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Also function unswitching helps by cleaning up control flow.
>> 
>> 
>> Example:
>> if (a & b) {
>> if (a & c) {
>> if (t & 0xF) {
>> for (i=0; i < size; i++) {
>> if (A[i] & d) {
>> A[i] |= (1 << t);
>> }
>> }
>> }
>> }
>> }
>> 
>> 
>> if (b & d) {
>> if (a & d) {
>> if (t & 0xA) {
>> for (i=0; i < size; i++) {
>> if (A[i] & d) {
>> A[i] |= (1 << t);
>> }
>> }
>> }
>> }
>> }
>> 
>> 
>> After fusion we will have:
>> 
>> 
>> if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) {
>> for (i=0; i < size; i++) {
>> if (A[i] & d) {
>> A[i] |= (1 << t);
>> }
>> if (A[i] & d) {
>> A[i] |= (1 << t);
>> }
>> }
>> } else {
>> // original code
>> if (a & b) {
>> if (a & c) {
>> if (t & 0xF) {
>> for (i=0; i < size; i++) {
>> if (A[i] & d) {
>> A[i] |= (1 << t);
>> }
>> }
>> }
>> }
>> }
>> 
>> 
>> if (b & d) {
>> if (a & d) {
>> if (t & 0xA) {
>> for (i=0; i < size; i++) {
>> if (A[i] & d) {
>> A[i] |= (1 << t);
>> }
>> }
>> }
>> }
>> }
>> }
>> 
>> 
>> Thanks,
>> Ramshankar _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/>
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/68f23d7e/attachment.html>

Ramanarayanan, Ramshankar

2015-Jan-19 13:57 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

Hi Hal,

There are two diffs under revision update history:

The clang patch is: http://reviews.llvm.org/differential/diff/18250/
The llvm patch is: http://reviews.llvm.org/differential/diff/18249/

The loop fusion pass is in Scalar opts. One of the challenges in legality checks
was aliases with extern variables. However I derived some metadata from
GlobalsModRef and used that in the scalar opt to get what I wanted. Do you have
thoughts on how we can propagate GlobalsModRef analysis into the scalar opt?
That could be one of the first sub-patches. Going by the various suggestions I
will put out a diff just for loop fusion (along with control
dependences/closures) with some synthetic test cases.

This phase ordering question is of interest: Will inlining (and other IPO
analysis/opt incl globals mod/ref, function unswitch) facilitate better loop
fusion? I experimented with an IPO analysis that spans call graphs and checks
"adjacent" code for loops that can be fused. However this made the
compiler very slow and I ended up with inlining all functions that has a singly
nested loop. I could check that there is a path from entry to the loop header
that does not have "bad" basicblocks: blocks that have calls for
example. Alternatively I could use profile or static block frequencies. We need
to prune the search space somehow and we got to do it across call boundaries.

I think that vectorization should come later than fusion and distribution in the
optimization flow. Vectorization adds loop versioning. That creates additional
loops and probably a more challenging search space for loop fusion.
Interestingly, I tried vectorization by hand for 462.libquantum and it did not
help: there is an interplay with if-converting stores (what I saw so far).
Perhaps it is an architecture thing.

Thanks,
Ram

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov] 
Sent: Saturday, January 17, 2015 2:30 AM
To: Adam Nemet; Ramanarayanan, Ramshankar
Cc: llvmdev at cs.uiuc.edu; chandlerc 
Subject: Re: [LLVMdev] proof of concept for a loop fusion pass

----- Original Message -----> From: "Adam Nemet" <anemet at apple.com>
> To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan at
amd.com>
> Cc: llvmdev at cs.uiuc.edu
> Sent: Saturday, January 17, 2015 12:20:55 AM
> Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
> 
> 
> On Jan 15, 2015, at 4:22 PM, Ramanarayanan, Ramshankar < 
> Ramshankar.Ramanarayanan at amd.com > wrote:
> 
> 
> Hi,
> 
> 
> We are proposing a loop fusion pass that tries to proactive fuse loops 
> across function call boundaries and arbitrary control flow.
> 
> 
> http://reviews.llvm.org/D7008
This link contains the Clang patch. Did you intend to post the LLVM patch as
well?
> 
> 
> With this pass, we get 103 loop fusions in SPECCPU INT 2006 
> 462.libquantum with rate performance improving close to 2.5X in x86 
> (results from AMD A10-6700).
> 
> 
> I took some liberties in patching up some of the code in 
> ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and 
> also adjusted the IPO/LTO pass managers. I would need to do a better 
> job there but I would like to put this out as WIP for high level 
> reviews. At this time standard loop fusion test cases may not work 
> (cases where control dep are same or loops are truly adjacent etc).
> I would be doing that as well.
> 
> 
> Appreciate suggestions and other help.
> 
> 
> The pass initially attempts to inline calls which have loops in them.
> Following inlining, a separate scalar "main" pass tries to fuse
loops.
> It is not necessary for loops to be adjacent or have the same control 
> dependence. Specifically a control dependence closure is computed and 
> loops that share a control dep closure prefix are taken as candidates 
> for fusion, in case there are no other loops in a certain path between 
> them. A loop graph is built with edges between loops which share the 
> control dependence closure prefix. A recursive application of fusion 
> follows on the graph from "bottom-up".
> 
> 
> During fusion, program slices are taken based on the control flow 
> closures of the two loops being fused. Example: Suppose two loops A 
> and B are going to be fused. They share the same control dependence 
> prefix, but their suffices vary. The suffices are used to duplicate 
> Control flow paths that pass through both loops. Specifically paths 
> from the first block in the control-dep closure suffix of the first 
> loop, through the first loop's exit and following into the suffix of 
> the second loop through the second loop's exit on to the common 
> post-dominator of either loop's exit. The number of paths grows 
> exponentially. At this time some heuristics are used when exploring 
> for paths. Using profile based heuristics is a better idea.
On this last point, would it be possible to encode your static heuristics as
improvements to our static branch probability heuristics, and then naturally
adapt to profile-based branch probabilities when those are available?
> 
> Hi Ramshankar,
> 
> 
> 
> 
> 
> This is interesting, indeed. Thanks for soliciting input.
> 
> 
> My main comment has to do with the fact that in order to get this gain 
> you need many pieces to work together. My suggestion is to proceed in 
> steps and develop it incrementally on trunk.
I second this. We definitely should have loop fusion in LLVM, and incremental
in-trunk development is the best way to make that happen.
> 
> 
> For example it would be good to first add a simple loop fusion pass 
> that only merges adjacent loops and then build it up from there. Can 
> you break down your approach into smaller, incremental steps like 
> this? What would these steps be?
> 
> 
> My other comment is about the choice of DependenceAnalysis framework.
> No pass currently uses this framework and I am not sure how complete 
> the implementation is
DependenceAnalysis has known issues -- as I recall, the way it pulls apart GEP
indices without verifying the boundary conditions is not really sound, but this
could be replaced by our delinearization functionality now (what Polly is using)
-- and if this work provides the impetus for cleaning that up, so much the
better. That having been said, care is required. DependenceAnalysis is
definitely more powerful than the vectorizer's dependence analysis, and
conceptually cleaner. However, using what the vectorizer uses should provide a
conservatively-correct base.
> and whether it has adequate test coverage.
DependenceAnalysis's test coverage actually isn't bad, at least for loop
bodies in traditional form.
> This was my main reason for reusing the dependence analysis from the 
> LoopVectorizer. Have you considered the same?
Finally, it might be a good idea to discuss the overall optimization plan for
fuseable loops. One possibility that I had discussed with Chandler a couple of
months ago was: aggressively fuse early, expose optimization opportunities, and
then split late (after vectorization) based on target-specific modeling
(accounting for register pressure, hardware prefetching resources, etc.). There
are obviously also other possible approaches, and I'd like to know what you
think would be an optimal strategy.

Thanks again,
Hal
> 
> 
> Adam
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Also function unswitching helps by cleaning up control flow.
> 
> 
> Example:
> if (a & b) {
> if (a & c) {
> if (t & 0xF) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> 
> 
> if (b & d) {
> if (a & d) {
> if (t & 0xA) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> 
> 
> After fusion we will have:
> 
> 
> if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) { for (i=0; i < size;
> i++) { if (A[i] & d) { A[i] |= (1 << t); } if (A[i] & d) {
A[i] |= (1
> << t); } } } else { // original code if (a & b) { if (a & c)
{ if (t &
> 0xF) { for (i=0; i < size; i++) { if (A[i] & d) { A[i] |= (1
<< t); }
> } } } }
> 
> 
> if (b & d) {
> if (a & d) {
> if (t & 0xA) {
> for (i=0; i < size; i++) {
> if (A[i] & d) {
> A[i] |= (1 << t);
> }
> }
> }
> }
> }
> }
> 
> 
> Thanks,
> Ramshankar _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu 
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2015-Jan-26 03:16 UTC

head link

[LLVMdev] proof of concept for a loop fusion pass

----- Original Message -----> From: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan at
amd.com>
> To: "Hal Finkel" <hfinkel at anl.gov>, "Adam
Nemet" <anemet at apple.com>
> Cc: llvmdev at cs.uiuc.edu, "chandlerc" <chandlerc at
gmail.com>
> Sent: Monday, January 19, 2015 7:57:42 AM
> Subject: RE: [LLVMdev] proof of concept for a loop fusion pass
> 
> Hi Hal,
> 
> There are two diffs under revision update history:
> 
> The clang patch is: http://reviews.llvm.org/differential/diff/18250/
> The llvm patch is: http://reviews.llvm.org/differential/diff/18249/
Okay, please don't do that. Please open one differential per patch. It is
not at all obvious that the different updates refer to completely different
patches (plus, it will mess up the incremental patch differencing capability).
> 
> The loop fusion pass is in Scalar opts. 
That seems like the right place.
> One of the challenges in
> legality checks was aliases with extern variables. However I derived
> some metadata from GlobalsModRef and used that in the scalar opt to
> get what I wanted. Do you have thoughts on how we can propagate
> GlobalsModRef analysis into the scalar opt? 
So it seems, specifically, you're attempting to propagate whether a global
does not have its address taken. For basic blocks, we stick this information
into its value subclass data, perhaps we could do the same here? We could also
have an AddressTaken member variable, as we do for MachineBasicBlock in CodeGen.
Regardless, the metadata does not seem unreasonable.
> That could be one of the
> first sub-patches. Going by the various suggestions I will put out a
> diff just for loop fusion (along with control dependences/closures)
> with some synthetic test cases.
Great!
> 
> This phase ordering question is of interest: Will inlining (and other
> IPO analysis/opt incl globals mod/ref, function unswitch) facilitate
> better loop fusion?
I'm sure they will, especially inlining, but I see no reason that loop
fusion should occur outside of the inliner-driven CGSCC pass manager (so that is
occurs iteratively with inlining). We might want to make ICA (the inliner's
cost analysis) smarter about loops that might be fused, but that's a
separable improvement.
> I experimented with an IPO analysis that spans
> call graphs and checks "adjacent" code for loops that can be
fused.
> However this made the compiler very slow and I ended up with
> inlining all functions that has a singly nested loop. I could check
> that there is a path from entry to the loop header that does not
> have "bad" basicblocks: blocks that have calls for example.
I see no reason to avoid fusing loops with calls (so long, I suppose, as we can
split later based on register pressure considerations). Why are you doing that?
> Alternatively I could use profile or static block frequencies. We
> need to prune the search space somehow and we got to do it across
> call boundaries.
Better use of BPI within ICA is something that the new pass-manager
infrastructure should allow.
> 
> I think that vectorization should come later than fusion and
> distribution in the optimization flow. Vectorization adds loop
> versioning. That creates additional loops and probably a more
> challenging search space for loop fusion.
Good point. As I mentioned, I think that fusing early and splitting late makes
sense. Loop fusion potentially exposes many other simplification opportunities,
and predicting them ahead of time is unlikely to be very successful.
> Interestingly, I tried
> vectorization by hand for 462.libquantum and it did not help: there
> is an interplay with if-converting stores (what I saw so far).
> Perhaps it is an architecture thing.
Interesting.

 -Hal
> 
> Thanks,
> Ram
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Saturday, January 17, 2015 2:30 AM
> To: Adam Nemet; Ramanarayanan, Ramshankar
> Cc: llvmdev at cs.uiuc.edu; chandlerc
> Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
> 
> ----- Original Message -----
> > From: "Adam Nemet" <anemet at apple.com>
> > To: "Ramshankar Ramanarayanan" <Ramshankar.Ramanarayanan
at amd.com>
> > Cc: llvmdev at cs.uiuc.edu
> > Sent: Saturday, January 17, 2015 12:20:55 AM
> > Subject: Re: [LLVMdev] proof of concept for a loop fusion pass
> > 
> > 
> > On Jan 15, 2015, at 4:22 PM, Ramanarayanan, Ramshankar <
> > Ramshankar.Ramanarayanan at amd.com > wrote:
> > 
> > 
> > Hi,
> > 
> > 
> > We are proposing a loop fusion pass that tries to proactive fuse
> > loops
> > across function call boundaries and arbitrary control flow.
> > 
> > 
> > http://reviews.llvm.org/D7008
> 
> This link contains the Clang patch. Did you intend to post the LLVM
> patch as well?
> 
> > 
> > 
> > With this pass, we get 103 loop fusions in SPECCPU INT 2006
> > 462.libquantum with rate performance improving close to 2.5X in x86
> > (results from AMD A10-6700).
> > 
> > 
> > I took some liberties in patching up some of the code in
> > ScalarEvolution/DependenceAnalysis/GlobalsModRef/Clang options/ and
> > also adjusted the IPO/LTO pass managers. I would need to do a
> > better
> > job there but I would like to put this out as WIP for high level
> > reviews. At this time standard loop fusion test cases may not work
> > (cases where control dep are same or loops are truly adjacent etc).
> > I would be doing that as well.
> > 
> > 
> > Appreciate suggestions and other help.
> > 
> > 
> > The pass initially attempts to inline calls which have loops in
> > them.
> > Following inlining, a separate scalar "main" pass tries to
fuse
> > loops.
> > It is not necessary for loops to be adjacent or have the same
> > control
> > dependence. Specifically a control dependence closure is computed
> > and
> > loops that share a control dep closure prefix are taken as
> > candidates
> > for fusion, in case there are no other loops in a certain path
> > between
> > them. A loop graph is built with edges between loops which share
> > the
> > control dependence closure prefix. A recursive application of
> > fusion
> > follows on the graph from "bottom-up".
> > 
> > 
> > During fusion, program slices are taken based on the control flow
> > closures of the two loops being fused. Example: Suppose two loops A
> > and B are going to be fused. They share the same control dependence
> > prefix, but their suffices vary. The suffices are used to duplicate
> > Control flow paths that pass through both loops. Specifically paths
> > from the first block in the control-dep closure suffix of the first
> > loop, through the first loop's exit and following into the suffix
> > of
> > the second loop through the second loop's exit on to the common
> > post-dominator of either loop's exit. The number of paths grows
> > exponentially. At this time some heuristics are used when exploring
> > for paths. Using profile based heuristics is a better idea.
> 
> On this last point, would it be possible to encode your static
> heuristics as improvements to our static branch probability
> heuristics, and then naturally adapt to profile-based branch
> probabilities when those are available?
> 
> > 
> > Hi Ramshankar,
> > 
> > 
> > 
> > 
> > 
> > This is interesting, indeed. Thanks for soliciting input.
> > 
> > 
> > My main comment has to do with the fact that in order to get this
> > gain
> > you need many pieces to work together. My suggestion is to proceed
> > in
> > steps and develop it incrementally on trunk.
> 
> I second this. We definitely should have loop fusion in LLVM, and
> incremental in-trunk development is the best way to make that
> happen.
> 
> > 
> > 
> > For example it would be good to first add a simple loop fusion pass
> > that only merges adjacent loops and then build it up from there.
> > Can
> > you break down your approach into smaller, incremental steps like
> > this? What would these steps be?
> > 
> > 
> > My other comment is about the choice of DependenceAnalysis
> > framework.
> > No pass currently uses this framework and I am not sure how
> > complete
> > the implementation is
> 
> DependenceAnalysis has known issues -- as I recall, the way it pulls
> apart GEP indices without verifying the boundary conditions is not
> really sound, but this could be replaced by our delinearization
> functionality now (what Polly is using) -- and if this work provides
> the impetus for cleaning that up, so much the better. That having
> been said, care is required. DependenceAnalysis is definitely more
> powerful than the vectorizer's dependence analysis, and conceptually
> cleaner. However, using what the vectorizer uses should provide a
> conservatively-correct base.
> 
> > and whether it has adequate test coverage.
> 
> DependenceAnalysis's test coverage actually isn't bad, at least for
> loop bodies in traditional form.
> 
> > This was my main reason for reusing the dependence analysis from
> > the
> > LoopVectorizer. Have you considered the same?
> 
> Finally, it might be a good idea to discuss the overall optimization
> plan for fuseable loops. One possibility that I had discussed with
> Chandler a couple of months ago was: aggressively fuse early, expose
> optimization opportunities, and then split late (after
> vectorization) based on target-specific modeling (accounting for
> register pressure, hardware prefetching resources, etc.). There are
> obviously also other possible approaches, and I'd like to know what
> you think would be an optimal strategy.
> 
> Thanks again,
> Hal
> 
> > 
> > 
> > Adam
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > Also function unswitching helps by cleaning up control flow.
> > 
> > 
> > Example:
> > if (a & b) {
> > if (a & c) {
> > if (t & 0xF) {
> > for (i=0; i < size; i++) {
> > if (A[i] & d) {
> > A[i] |= (1 << t);
> > }
> > }
> > }
> > }
> > }
> > 
> > 
> > if (b & d) {
> > if (a & d) {
> > if (t & 0xA) {
> > for (i=0; i < size; i++) {
> > if (A[i] & d) {
> > A[i] |= (1 << t);
> > }
> > }
> > }
> > }
> > }
> > 
> > 
> > After fusion we will have:
> > 
> > 
> > if (a&b && a&c && t&0xF && b&d
&& a&d t&0xA) { for (i=0; i < size;
> > i++) { if (A[i] & d) { A[i] |= (1 << t); } if (A[i] & d)
{ A[i] |> > (1
> > << t); } } } else { // original code if (a & b) { if (a
& c) { if
> > (t &
> > 0xF) { for (i=0; i < size; i++) { if (A[i] & d) { A[i] |= (1
<< t);
> > }
> > } } } }
> > 
> > 
> > if (b & d) {
> > if (a & d) {
> > if (t & 0xA) {
> > for (i=0; i < size; i++) {
> > if (A[i] & d) {
> > A[i] |= (1 << t);
> > }
> > }
> > }
> > }
> > }
> > }
> > 
> > 
> > Thanks,
> > Ramshankar _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

llvm dev - Jan 2015 - [LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass

[LLVMdev] proof of concept for a loop fusion pass