thr3ads.net - llvm dev - [LLVMdev] Loop unrolling analysis next steps [May 2015]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2015-May-26 17:47 UTC

[LLVMdev] Loop unrolling analysis next steps

So, I've spent some considerable time cleaning up and looking at the loop
unrolling analysis that you started Michael, but I don't have time to look
at it further so I'm writing down the current state and what I think should
happen next.

Currently, in tree, we have a reasonably clean framework now that visits
each instruction for each iteration and tries to simplify it. Currently,
the *only* meaningful simplification it does is that of folding loaded
constants. There is a patch out by Michael that tries to also do DCE, but I
think it is fundamentally the wrong approach.

The primary problem is that the current system is trying to use an
optimization technique that relies on *forward simulation* of each
iteration, but then applying backwards analyses to fold specific
instructions through complex transforms. This fundamentally doesn't scale
and I suspect is the reason for the caching layers that were added on top
of SCEV. I suspect those caching layers are more than just a minor code
 smell -- they point to the flawed design.

I think the correct design needs to build a reasonably fast way of taking
an instruction for which we have a SCEV model and constant folding it for a
specific iteration of the loop. Then the visitor needs to be cast in terms
of this: it should first try basic constant folding of operands, and then
fall back on querying for a SCEV model and constant folding that if it has
one for a particular iteration. The result should fold the GEPs that are
currently analyzed ahead of time into constantexpr's. Finally, the load
folding needs to be expressed in terms of folding the load through a
constantexpr GEP into a constant array. If the existing constant folding
isn't enough for this, I'd be rather surprised.

Without restructuring in this way, we'll always be patching things up. For
example, I added logic to skip dead control flows when analyzing the loop
for unrolling and it is completely ineffective because we don't analyze
'icmp' instructions against the induction variable. For cases where the
dead control flow is predicated on the GEPs themselves (for example, any
STL-style algorithm using pointers as iterators!) we can't get an accurate
analysis without applying the GEP simplification independently of the load
simplification.


Anyways, I'm not a SCEV expert, and so I'm not the right person to
re-cast
the current visitor to directly build on top of a SCEV model and do
substitution of the current iteration. But I think that is the design I'd
like to see before any other work goes into the code here. Happy to review
when there is a patch to this effect.

-Chandler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150526/af1e823c/attachment.html>

Gerolf Hoflehner

2015-May-29 02:25 UTC

head link

[LLVMdev] Loop unrolling analysis next steps

Given how far you pushed this analysis pass I’m surprised by some of your
concerns. Please see below.> On May 26, 2015, at 10:47 AM, Chandler Carruth <chandlerc at
gmail.com> wrote:
> 
> So, I've spent some considerable time cleaning up and looking at the
loop unrolling analysis that you started Michael, but I don't have time to
look at it further so I'm writing down the current state and what I think
should happen next.
> 
> Currently, in tree, we have a reasonably clean framework now that visits
each instruction for each iteration and tries to simplify it. Currently, the
*only* meaningful simplification it does is that of folding loaded constants.
There is a patch out by Michael that tries to also do DCE, but I think it is
fundamentally the wrong approach.DCE should not be thought as part of the framework. And it has to be a backward
pass to account for (iteration-specific) dead instructions. Take for example: a
= <some>; b = a * c[i]; When c[i] turns out to be zero zero, a becomes
dead (assuming only one use of a). > 
> The primary problem is that the current system is trying to use an
optimization technique that relies on *forward simulation* of each iteration,
but then applying backwards analyses to fold specific instructions through
complex transforms. This fundamentally doesn't scale and I suspect is the
reason for the caching layers that were added on top of SCEV. I suspect those
caching layers are more than just a minor code  smell -- they point to the
flawed design.
I don’t see the scalability issue with DCE. 

Why can’t the current design integrate GEP naturally w/o the need for the SCEV
cache? I think the code could do away with the extra cache. In that case the
cache is a current implementation artifact rather than a flawed design
signal.> 
> I think the correct design needs to build a reasonably fast way of taking
an instruction for which we have a SCEV model and constant folding it for a
specific iteration of the loop. Then the visitor needs to be cast in terms of
this: it should first try basic constant folding of operands, and then fall back
on querying for a SCEV model and constant folding that if it has one for a
particular iteration. The result should fold the GEPs that are currently
analyzed ahead of time into constantexpr's. Finally, the load folding needs
to be expressed in terms of folding the load through a constantexpr GEP into a
constant array. If the existing constant folding isn't enough for this,
I'd be rather surprised.
> 
> Without restructuring in this way, we'll always be patching things up.
For example, I added logic to skip dead control flows when analyzing the loop
for unrolling and it is completely ineffective because we don't analyze
'icmp' instructions against the induction variable. For cases where the
dead control flow is predicated on the GEPs themselves (for example, any
STL-style algorithm using pointers as iterators!) we can't get an accurate
analysis without applying the GEP simplification independently of the load
simplification.
Could you share your test case? I would be surprised if the current design
couldn’t handle it by adding a visitICmp, but, yes, it requires additional logic
to handle dead conditional control flow.> 
> 
> Anyways, I'm not a SCEV expert, and so I'm not the right person to
re-cast the current visitor to directly build on top of a SCEV model and do
substitution of the current iteration. But I think that is the design I'd
like to see before any other work goes into the code here. Happy to review when
there is a patch to this effect.
> 
> -Chandler
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Chris Lattner

2015-May-29 04:43 UTC

head link

[LLVMdev] Loop unrolling analysis next steps

On May 26, 2015, at 10:47 AM, Chandler Carruth <chandlerc at gmail.com>
wrote:> So, I've spent some considerable time cleaning up and looking at the
loop unrolling analysis that you started Michael, but I don't have time to
look at it further so I'm writing down the current state and what I think
should happen next.
> 
> Currently, in tree, we have a reasonably clean framework now that visits
each instruction for each iteration and tries to simplify it. Currently, the
*only* meaningful simplification it does is that of folding loaded constants.
There is a patch out by Michael that tries to also do DCE, but I think it is
fundamentally the wrong approach.
You’re right that it’s the wrong approach, but that doesn’t mean it is the wrong
way to go.

The right approach for this (and many other transformations) is to actually do
the transformation, do a suite of simplifications, then see if it was worth it. 
There are many compilers that take that approach (because that is the only and
best way to get an accurate cost model for a transformation like this) but it is
extremely expensive in compile time.
> The primary problem is that the current system is trying to use an
optimization technique that relies on *forward simulation* of each iteration,
but then applying backwards analyses to fold specific instructions through
complex transforms. This fundamentally doesn't scale and I suspect is the
reason for the caching layers that were added on top of SCEV. I suspect those
caching layers are more than just a minor code  smell -- they point to the
flawed design.
As the guy who added the caching scheme to SCEV, I can assure you that it had
nothing to do with loop unrolling.  I’m not arguing that caching ended up being
a great thing in the end, just that it doesn’t have anything to do with
unrolling.
> Anyways, I'm not a SCEV expert, and so I'm not the right person to
re-cast the current visitor to directly build on top of a SCEV model and do
substitution of the current iteration. But I think that is the design I'd
like to see before any other work goes into the code here. Happy to review when
there is a patch to this effect.
The problem here is that SCEV isn’t enough.  SCEV doesn’t handle things like
DCE, doesn’t handle store->load forwarding, and doesn’t handle calls.  SCEV
is only really appropriate for simple chains of integer operations, and real
loops get more complicated than that.  For example, one of the major benefits of
unrolling:

for (i) {
  a[i] = a[i-1]+1
}

is that you get trivial store->load forwarding without depending on our crazy
memdep “PRE” stuff.  Unless we add something like MemorySSA and teach SCEV about
it, I don’t see how this can be painted as a SCEV problem.

-Chris

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - May 2015 - [LLVMdev] Loop unrolling analysis next steps

[LLVMdev] Loop unrolling analysis next steps

[LLVMdev] Loop unrolling analysis next steps

[LLVMdev] Loop unrolling analysis next steps

Possibly Parallel Threads