thr3ads.net - llvm dev - [LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its [Jun 2013]

If this information is useful, please help other people find it:
Share via:

Arnold Schwaighofer

2013-Jun-24 18:01 UTC

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

Hi,

I wanted to start a discussion about the following issue since I am not sure
about what to do about it:

The loop-vectorizer  has the potential to make code a quite a bit bigger (esp.
in cases where we don’t know the loop count or whether pointers alias).
Chandler has observed this in snappy where we have a simple memory copying loop
(that may overlap).

We vectorize this loop and then this loop gets inlined into a function and
prevents this function from getting inlined again. Causing a significant(?)
degradation.

https://code.google.com/p/snappy/source/browse/trunk/snappy.cc#99

We have seen some good performance benefits from vectorizing such loops. So not
vectorizing them is not really a good option I think.

In <http://llvm.org/viewvc/llvm-project?view=revision&revision=184698>
Chandler introduced a flag so that we can run the vectorizer after all CG
passes. This would prevent the inline from seeing the vectorized code.

I see some potential issues:

* We run a loop pass later again with the associated compile time cost (?)

* The vectorizer relies on cleanup passes to run afterwards: dce, instsimplify,
simplifycfg, maybe some form of redundancy elimination
  If we run later we have to run those passes again increasing compile time OR
  We have to duplicate them in the loop vectorizer increasing its complexity

* The vectorizer would like SCEV analysis to be as precise as possible: one
reason are dependency checks that want to know that expressions cannot wrap
(AddRec expressions to be more specific):
  At the moment, indvars will remove those flags in some cases which currently
is not a problem because SCEV analysis still has the old values cached (except
in the case that Chandler mention to me on IRC where we inline a function - in
which case that info is lost). My understanding of this is that this is not
really something we can fix easily because of the way that SCEV works
(unique-ifying/commoning expressions and thereby dropping the flags).
  A potential solution would be to move indvars to later. The question is do
other loop passes which simplify IR depend on indvars? Andy what is your take on
this?

The benefit of vectorizing later is that we would have more context at the
inlined call site. And it would solve the problem of the inliner seeing
vectorized code.

What do you all think?

On Jun 24, 2013, at 2:21 AM, Chandler Carruth <chandlerc at gmail.com>
wrote:
> Adding this based on a discussion with Arnold and it seems at least
> worth having this flag for us to both run some experiments to see if
> this strategy is workable. It may solve some of the regressions seen
> with the loop vectorizer.

Hal Finkel

2013-Jun-24 19:32 UTC

head link

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

----- Original Message -----> Hi,
> 
> I wanted to start a discussion about the following issue since I am
> not sure about what to do about it:
> 
> The loop-vectorizer  has the potential to make code a quite a bit
> bigger (esp. in cases where we don’t know the loop count or whether
> pointers alias).
> Chandler has observed this in snappy where we have a simple memory
> copying loop (that may overlap).
> 
> We vectorize this loop and then this loop gets inlined into a
> function and prevents this function from getting inlined again.
> Causing a significant(?) degradation.
> 
> https://code.google.com/p/snappy/source/browse/trunk/snappy.cc#99
> 
> We have seen some good performance benefits from vectorizing such
> loops. So not vectorizing them is not really a good option I think.
> 
> In
>
<http://llvm.org/viewvc/llvm-project?view=revision&revision=184698>
> Chandler introduced a flag so that we can run the vectorizer after
> all CG passes. This would prevent the inline from seeing the
> vectorized code.
There are obviously several issues here, and they seem only loosely related.
Regarding this first one, why is the right answer not to adjust (or improve) the
inlining heuristic? I understand that this is not easy, but the fact remains
that, in the end, having the loop inlined, even with the extra vectorization
checks, is what should be happening (or is the performance still worse than the
non-vectorized code?). If we really feel that we can't adjust the current
heuristic without breaking other things, then we could add some metadata to make
the cost estimator ignore the vector loop preheader, but I'd prefer
adjusting the inlining thresholds, etc. The commit message for r184698 said that
the flag was for experimentation purposes, and I think that's fine, but this
should not be the solution unless it really produces better non-inlined code as
well.
> 
> I see some potential issues:
> 
> * We run a loop pass later again with the associated compile time
> cost (?)
> 
> * The vectorizer relies on cleanup passes to run afterwards: dce,
> instsimplify, simplifycfg, maybe some form of redundancy elimination
>   If we run later we have to run those passes again increasing
>   compile time OR
>   We have to duplicate them in the loop vectorizer increasing its
>   complexity
> 
> * The vectorizer would like SCEV analysis to be as precise as
> possible: one reason are dependency checks that want to know that
> expressions cannot wrap (AddRec expressions to be more specific):
>   At the moment, indvars will remove those flags in some cases which
>   currently is not a problem because SCEV analysis still has the old
>   values cached (except in the case that Chandler mention to me on
>   IRC where we inline a function - in which case that info is lost).
>   My understanding of this is that this is not really something we
>   can fix easily because of the way that SCEV works
>   (unique-ifying/commoning expressions and thereby dropping the
>   flags).
I assume that we're taking about nsw, etc. The fact that SCEV assumes nsw in
some cases has led to problems (PR16130 has some history on this), and I
don't understand why SCEV's unique-ifying/commoning expressions implies
that it needs to drop the flags. Maybe this is just a matter of someone needing
to do the work? Is it clear whether (i + 5 == i +(nsw) 5) should always be true,
always false, or it depends on how the caller wants to use the answer?
>   A potential solution would be to move indvars to later. The
>   question is do other loop passes which simplify IR depend on
>   indvars? Andy what is your take on this?
> 
> The benefit of vectorizing later is that we would have more context
> at the inlined call site. 
Is it clear that this additional context is a win? Some simplifications make
loops easier to understand, and some make them harder to understand. The best
thing (not for compile time) may be to run the vectorizer in both places.
Nevertheless, now that we can experiment with this, it will be interesting to
see some statistics.

Thanks again,
Hal
> And it would solve the problem of the
> inliner seeing vectorized code.
> 
> What do you all think?
> 
> 
> On Jun 24, 2013, at 2:21 AM, Chandler Carruth <chandlerc at
gmail.com>
> wrote:
> 
> > Adding this based on a discussion with Arnold and it seems at least
> > worth having this flag for us to both run some experiments to see
> > if
> > this strategy is workable. It may solve some of the regressions
> > seen
> > with the loop vectorizer.
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Arnold Schwaighofer

2013-Jun-24 19:46 UTC

head link

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

On Jun 24, 2013, at 2:32 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> I assume that we're taking about nsw, etc. The fact that SCEV assumes
nsw in some cases has led to problems (PR16130 has some history on this), and I
don't understand why SCEV's unique-ifying/commoning expressions implies
that it needs to drop the flags.
Yes I am talking about the ability to maintain nsw/nuw on general expressions
between LLVM IR and the SCEV IR and back.

If I recall Andy’s explanation correctly, it is at least an issue if you have
the same expression once with a wrapping flag (because it is guarded) and once
without. Maybe also issues when we simplify expressions in that it is hard to
maintain them?

What holds true is that if you go

  LLVM IR -> SCEV -> SCEV expander -> LLVM IR 

is that you will loose wrapping flags. And IndVars does just that :(.

Chandler Carruth

2013-Jun-24 19:59 UTC

head link

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

On Mon, Jun 24, 2013 at 11:01 AM, Arnold Schwaighofer <
aschwaighofer at apple.com> wrote:
> In
<http://llvm.org/viewvc/llvm-project?view=revision&revision=184698>
> Chandler introduced a flag so that we can run the vectorizer after all CG
> passes. This would prevent the inline from seeing the vectorized code.
>
Just for the record, I have no real expectation that this is a good idea
yet... But it's hard to collect numbers without a flag of some kind, and
it's also really annoying to craft this flag given the current pass
manager, so I figured I would get a skeleton in place that folks could
experiment with, and we could keep or delete based on this discussion and
any numbers.

>
> I see some potential issues:
>
> * We run a loop pass later again with the associated compile time cost (?)
>
This actually worries me the least -- the tradeoff is between locality and
repeatedly analyzing the same loops during the highly iterative CGSCC pass
manager (which looks at each function up to 4 times, and may look at the
body of a function which gets inlined many more times).

> * The vectorizer relies on cleanup passes to run afterwards: dce,
> instsimplify, simplifycfg, maybe some form of redundancy elimination
>   If we run later we have to run those passes again increasing compile
> time OR
>   We have to duplicate them in the loop vectorizer increasing its
> complexity
>
Hopefully, these won't be too expensive. instcombine and simplifycfg
shouldn't be this late in the pipeline, but only numbers will really tell.
If we need redundancy elimination in the form of GVN, then we're in a lot
of trouble, but that doesn't seem likely to be necessary (I hope).

> * The vectorizer would like SCEV analysis to be as precise as possible:
> one reason are dependency checks that want to know that expressions cannot
> wrap (AddRec expressions to be more specific):
>   At the moment, indvars will remove those flags in some cases which
> currently is not a problem because SCEV analysis still has the old values
> cached (except in the case that Chandler mention to me on IRC where we
> inline a function - in which case that info is lost). My understanding of
> this is that this is not really something we can fix easily because of the
> way that SCEV works (unique-ifying/commoning expressions and thereby
> dropping the flags).
>   A potential solution would be to move indvars to later. The question is
> do other loop passes which simplify IR depend on indvars? Andy what is your
> take on this?
>
I think this is far and away the most important question to answer. =] I
think there are lots of things that would be improved by preserving SCEVs
precision throughout the CGSCC pass manager, but I have no idea how
feasible that is. I would also appreciate Dan's insights here.

> The benefit of vectorizing later is that we would have more context at the
> inlined call site. And it would solve the problem of the inliner seeing
> vectorized code.
>
There's more though in theory (although maybe not in practice today, and I
may just be wrong on some of these):
- It would improve the ability of the vectorizer to reason about icache
impact, because it wouldn't think the loop was the only loop in the
function only to have it be inlined in 10 places.
- It may form a vectorized loop before inlining which would be handled
better by loop idiom recognition after inlining.
- It would prevent turning loops which SCEV-based passes can simply compute
and/or delete *after* inlining adds some context into a vectorized loop
that may be significantly harder to analyze at this level.

(The last one is the most speculative to me -- it could be that I'm wrong
and SCEV and other loop analyses will work just as well on the vectorized
loops...)

All of these share a common thread: the vectorizer somewhat inherently
loses information, and thus doing it during the iterative pass manager is
very risky as later iterations may be hampered by it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130624/cb6811a7/attachment.html>

Chandler Carruth

2013-Jun-24 20:04 UTC

head link

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

On Mon, Jun 24, 2013 at 12:32 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > In
> >
<http://llvm.org/viewvc/llvm-project?view=revision&revision=184698>
> > Chandler introduced a flag so that we can run the vectorizer after
> > all CG passes. This would prevent the inline from seeing the
> > vectorized code.
>
> There are obviously several issues here, and they seem only loosely
> related. Regarding this first one, why is the right answer not to adjust
> (or improve) the inlining heuristic? I understand that this is not easy,
> but the fact remains that, in the end, having the loop inlined, even with
> the extra vectorization checks, is what should be happening (or is the
> performance still worse than the non-vectorized code?). If we really feel
> that we can't adjust the current heuristic without breaking other
things,
> then we could add some metadata to make the cost estimator ignore the
> vector loop preheader, but I'd prefer adjusting the inlining
thresholds,
> etc. The commit message for r184698 said that the flag was for
> experimentation purposes, and I think that's fine, but this should not
be
> the solution unless it really produces better non-inlined code as well.

If all we are doing is inlining first in order to tweak the cost model for
inlining, then I agree with everything you say... but I'm not at all sure
that's the end result.

After inlining, the control flow and even loop structure may look
substantially different than they do before inlining, so the vectorizer may
make a substantially different decision about whether or not to vectorize a
particular loop. As one (somewhat contrived) example, if the loop body is
inlined into a cold region of the enclosing function, the vectorizer might
be able to prioritize code size over performance and skip vectorization.

I think the fundamental problem is that we can't un-vectorize, and the
decision (and strategy) to vectorize is based on a cost model, so the later
we do it the more information we can use in the cost model to make the
correct decision.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130624/47e4376b/attachment.html>

Hal Finkel

2013-Jun-24 21:16 UTC

head link

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

----- Original Message -----> 
> 
> 
> 
> 
> On Mon, Jun 24, 2013 at 11:01 AM, Arnold Schwaighofer <
> aschwaighofer at apple.com > wrote:
> 
> 
> 
> In <
> http://llvm.org/viewvc/llvm-project?view=revision&revision=184698 >
> Chandler introduced a flag so that we can run the vectorizer after
> all CG passes. This would prevent the inline from seeing the
> vectorized code.
> 
> 
> 
> Just for the record, I have no real expectation that this is a good
> idea yet... But it's hard to collect numbers without a flag of some
> kind, and it's also really annoying to craft this flag given the
> current pass manager, so I figured I would get a skeleton in place
> that folks could experiment with, and we could keep or delete based
> on this discussion and any numbers.
> 
> 
> 
> 
> I see some potential issues:
> 
> * We run a loop pass later again with the associated compile time
> cost (?)
> 
> 
> 
> This actually worries me the least -- the tradeoff is between
> locality and repeatedly analyzing the same loops during the highly
> iterative CGSCC pass manager (which looks at each function up to 4
> times, and may look at the body of a function which gets inlined
> many more times).
> 
> 
> 
> * The vectorizer relies on cleanup passes to run afterwards: dce,
> instsimplify, simplifycfg, maybe some form of redundancy elimination
> If we run later we have to run those passes again increasing compile
> time OR
> We have to duplicate them in the loop vectorizer increasing its
> complexity
> 
> 
> 
> Hopefully, these won't be too expensive. instcombine and simplifycfg
> shouldn't be this late in the pipeline, but only numbers will really
> tell. If we need redundancy elimination in the form of GVN, then
> we're in a lot of trouble, but that doesn't seem likely to be
> necessary (I hope).
Obviously there may be relevant differences, but running EarlyCSE after the BB
vectorizer turned out to work almost as well as running GVN, and the improved
compile time seemed worth the trade off.
> 
> 
> 
> * The vectorizer would like SCEV analysis to be as precise as
> possible: one reason are dependency checks that want to know that
> expressions cannot wrap (AddRec expressions to be more specific):
> At the moment, indvars will remove those flags in some cases which
> currently is not a problem because SCEV analysis still has the old
> values cached (except in the case that Chandler mention to me on IRC
> where we inline a function - in which case that info is lost). My
> understanding of this is that this is not really something we can
> fix easily because of the way that SCEV works
> (unique-ifying/commoning expressions and thereby dropping the
> flags).
> A potential solution would be to move indvars to later. The question
> is do other loop passes which simplify IR depend on indvars? Andy
> what is your take on this?
> 
> 
> 
> I think this is far and away the most important question to answer.
> =] I think there are lots of things that would be improved by
> preserving SCEVs precision throughout the CGSCC pass manager, but I
> have no idea how feasible that is. I would also appreciate Dan's
> insights here.
> 
> 
> 
> The benefit of vectorizing later is that we would have more context
> at the inlined call site. And it would solve the problem of the
> inliner seeing vectorized code.
> 
> 
> 
> There's more though in theory (although maybe not in practice today,
> and I may just be wrong on some of these):
> - It would improve the ability of the vectorizer to reason about
> icache impact, because it wouldn't think the loop was the only loop
> in the function only to have it be inlined in 10 places.
Good point (although might only apply in practice to cases where we know that
the trip count is small, and that requires profiling data in general).
> - It may form a vectorized loop before inlining which would be
> handled better by loop idiom recognition after inlining.
I imagine that we could improve idiom recognition to mitigate this particular
issue.
> - It would prevent turning loops which SCEV-based passes can simply
> compute and/or delete *after* inlining adds some context into a
> vectorized loop that may be significantly harder to analyze at this
> level.
In my experience (from working with the BB vectorizer), this can be a
significant problem. Even worse, if you vectorize any of the calculations
feeding addressing, then BasicAA also becomes less precise.
> 
> 
> (The last one is the most speculative to me -- it could be that I'm
> wrong and SCEV and other loop analyses will work just as well on the
> vectorized loops...)
> 
> 
> All of these share a common thread: the vectorizer somewhat
> inherently loses information, and thus doing it during the iterative
> pass manager is very risky as later iterations may be hampered by
> it.
This is an infrastructure problem, but I suspect it will remain true without a
significant effort to teach SCEV, BasicAA, etc. to look though vectorized
computations.

 -Hal
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Nadav Rotem

2013-Jun-24 21:59 UTC

head link

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

> 
> Just for the record, I have no real expectation that this is a good idea
yet... But it's hard to collect numbers without a flag of some kind, and
it's also really annoying to craft this flag given the current pass manager,
so I figured I would get a skeleton in place that folks could experiment with,
and we could keep or delete based on this discussion and any numbers.
I agree. Adding a temporary flag is a good way to allow people to test this
change with minimal effort.  This is what we did when Jeffery Yasskin wanted to
check the vectorizer a few weeks ago.
>  
> 
> I see some potential issues:
> 
> * We run a loop pass later again with the associated compile time cost (?)
> 
> All of these share a common thread: the vectorizer somewhat inherently
loses information, and thus doing it during the iterative pass manager is very
risky as later iterations may be hampered by it.

I agree. The vectorizer is a *lowering* pass, and much like LSR and it loses
information.  A few months ago some of us talked about this and came up with a
general draft for the ideal pass ordering.
If I remember correctly the plan was that the second half of the pipe should
start with GVN (which currently runs after the loop passes). After that come the
loop passes, the Vectorizers (loop vectorization first), and finally LSR,
Lower-switch, CGP, etc.  I think that when we discussed this most people argued
that the inliner should be before GVN and the loop passes. It would be
interesting to see the performance numbers for the new pass order.

Nadav 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130624/549b3c86/attachment.html>

Andrew Trick

2013-Jun-24 22:55 UTC

head link

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

On Jun 24, 2013, at 11:01 AM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
> Hi,
> 
> I wanted to start a discussion about the following issue since I am not
sure about what to do about it:
> 
> The loop-vectorizer  has the potential to make code a quite a bit bigger
(esp. in cases where we don’t know the loop count or whether pointers alias).
> Chandler has observed this in snappy where we have a simple memory copying
loop (that may overlap).
> 
> We vectorize this loop and then this loop gets inlined into a function and
prevents this function from getting inlined again. Causing a significant(?)
degradation.
> 
> https://code.google.com/p/snappy/source/browse/trunk/snappy.cc#99
> 
> We have seen some good performance benefits from vectorizing such loops. So
not vectorizing them is not really a good option I think.
> 
> In
<http://llvm.org/viewvc/llvm-project?view=revision&revision=184698>
Chandler introduced a flag so that we can run the vectorizer after all CG
passes. This would prevent the inline from seeing the vectorized code.
> 
> I see some potential issues:
> 
> * We run a loop pass later again with the associated compile time cost (?)
I want to move loop opts that depend on target info later, outside of CGSCC:
definitely indvars, vectorize/partial unroll. That way we only inline canonical
code and have a clean break between canonical and lowering passes. Hopefully
inlining heuristics will be adequate without first running these passes. For the
most part, I think it's as simple as inlining first with high-level
heuristics, then lowering later.
> * The vectorizer relies on cleanup passes to run afterwards: dce,
instsimplify, simplifycfg, maybe some form of redundancy elimination
>  If we run later we have to run those passes again increasing compile time
OR
>  We have to duplicate them in the loop vectorizer increasing its complexity
We'll have to handle this case-by-case as we gradually move passes around.
But the general idea is that lowering passes like the vectorizer should clean up
after themselves as much as feasible (whereas canonicalization passes should not
need to). We should be developing utilities to cleanup redundancies
incrementally. A value number utility would make sense. Of course, if a very
light-weight pass can simply be rescheduled to run again to do the cleanup then
we don't need a cleanup utility.
> * The vectorizer would like SCEV analysis to be as precise as possible: one
reason are dependency checks that want to know that expressions cannot wrap
(AddRec expressions to be more specific):
>  At the moment, indvars will remove those flags in some cases which
currently is not a problem because SCEV analysis still has the old values cached
(except in the case that Chandler mention to me on IRC where we inline a
function - in which case that info is lost). My understanding of this is that
this is not really something we can fix easily because of the way that SCEV
works (unique-ifying/commoning expressions and thereby dropping the flags).
>  A potential solution would be to move indvars to later. The question is do
other loop passes which simplify IR depend on indvars? Andy what is your take on
this?
Indvars should ideally preserve NSW flags whenever possible. However, we
don't want to rely on SCEV to preserve them. SCEV expressions are implicitly
reassociated and uniqued in a flow-insensitive universe independent of the
def-use chain of values. SCEV simply can't represent the flags in most
cases. I think the only flag that makes sense in SCEV is the no-wrap flag on a
recurrence (that's independent of signed/unsigned overflow).

As long as indvars does not rely on SCEVExpander it should be able to preserve
the flags. Unfortunately, it still uses SCEVExpander in a few places.
LinearFunctionTestReplace is one that should simply be moved into LSR instead.
For the couple other cases, we'll just have to work on alternative
implementations that don't drop flags, but I think it's worth doing.

That said, we should try not to rely on NSW at all unless clearly necessary. It
introduces nasty complexity that needs to be well justified. e.g. in the
vectorized loop preheader we should explicitly check for wrapping and only try
to optimize those checks using NSW if we have data that indicates it's
really necessary.

-Andy
> 
> The benefit of vectorizing later is that we would have more context at the
inlined call site. And it would solve the problem of the inliner seeing
vectorized code.
> 
> What do you all think?
> 
> 
> On Jun 24, 2013, at 2:21 AM, Chandler Carruth <chandlerc at
gmail.com> wrote:
> 
>> Adding this based on a discussion with Arnold and it seems at least
>> worth having this flag for us to both run some experiments to see if
>> this strategy is workable. It may solve some of the regressions seen
>> with the loop vectorizer.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130624/fe65128e/attachment.html>

Hal Finkel

2013-Jun-24 23:24 UTC

head link

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

----- Original Message -----> 
> 
> 
> 
> On Jun 24, 2013, at 11:01 AM, Arnold Schwaighofer <
> aschwaighofer at apple.com > wrote:
> 
> 
> 
> Hi,
> 
> I wanted to start a discussion about the following issue since I am
> not sure about what to do about it:
> 
> The loop-vectorizer has the potential to make code a quite a bit
> bigger (esp. in cases where we don’t know the loop count or whether
> pointers alias).
> Chandler has observed this in snappy where we have a simple memory
> copying loop (that may overlap).
> 
> We vectorize this loop and then this loop gets inlined into a
> function and prevents this function from getting inlined again.
> Causing a significant(?) degradation.
> 
> https://code.google.com/p/snappy/source/browse/trunk/snappy.cc#99
> 
> We have seen some good performance benefits from vectorizing such
> loops. So not vectorizing them is not really a good option I think.
> 
> In <
> http://llvm.org/viewvc/llvm-project?view=revision&revision=184698 >
> Chandler introduced a flag so that we can run the vectorizer after
> all CG passes. This would prevent the inline from seeing the
> vectorized code.
> 
> I see some potential issues:
> 
> * We run a loop pass later again with the associated compile time
> cost (?)
> 
> 
> 
> I want to move loop opts that depend on target info later, outside of
> CGSCC: definitely indvars, vectorize/partial unroll. That way we
> only inline canonical code and have a clean break between canonical
> and lowering passes. Hopefully inlining heuristics will be adequate
> without first running these passes. For the most part, I think it's
> as simple as inlining first with high-level heuristics, then
> lowering later.
> 
> 
> 
> 
> * The vectorizer relies on cleanup passes to run afterwards: dce,
> instsimplify, simplifycfg, maybe some form of redundancy elimination
> If we run later we have to run those passes again increasing compile
> time OR
> We have to duplicate them in the loop vectorizer increasing its
> complexity
> 
> 
> 
> 
> We'll have to handle this case-by-case as we gradually move passes
> around. But the general idea is that lowering passes like the
> vectorizer should clean up after themselves as much as feasible
> (whereas canonicalization passes should not need to). We should be
> developing utilities to cleanup redundancies incrementally. A value
> number utility would make sense. Of course, if a very light-weight
> pass can simply be rescheduled to run again to do the cleanup then
> we don't need a cleanup utility.
> 
> 
> 
> * The vectorizer would like SCEV analysis to be as precise as
> possible: one reason are dependency checks that want to know that
> expressions cannot wrap (AddRec expressions to be more specific):
> At the moment, indvars will remove those flags in some cases which
> currently is not a problem because SCEV analysis still has the old
> values cached (except in the case that Chandler mention to me on IRC
> where we inline a function - in which case that info is lost). My
> understanding of this is that this is not really something we can
> fix easily because of the way that SCEV works
> (unique-ifying/commoning expressions and thereby dropping the
> flags).
> A potential solution would be to move indvars to later. The question
> is do other loop passes which simplify IR depend on indvars? Andy
> what is your take on this?
> 
> 
> 
> Indvars should ideally preserve NSW flags whenever possible. However,
> we don't want to rely on SCEV to preserve them. SCEV expressions are
> implicitly reassociated and uniqued in a flow-insensitive universe
> independent of the def-use chain of values. SCEV simply can't
> represent the flags in most cases. I think the only flag that makes
> sense in SCEV is the no-wrap flag on a recurrence (that's
> independent of signed/unsigned overflow).
Why can't SCEV keep a flow sensitive (effectively per-BB) map of expressions
and their original flags (and perhaps cached deduced flags)? It seems like this
problem is solvable within SCEV.

 -Hal
> 
> 
> As long as indvars does not rely on SCEVExpander it should be able to
> preserve the flags. Unfortunately, it still uses SCEVExpander in a
> few places. LinearFunctionTestReplace is one that should simply be
> moved into LSR instead. For the couple other cases, we'll just have
> to work on alternative implementations that don't drop flags, but I
> think it's worth doing.
> 
> 
> That said, we should try not to rely on NSW at all unless clearly
> necessary. It introduces nasty complexity that needs to be well
> justified. e.g. in the vectorized loop preheader we should
> explicitly check for wrapping and only try to optimize those checks
> using NSW if we have data that indicates it's really necessary.
> 
> 
> -Andy
> 
> 
> 
> 
> 
> The benefit of vectorizing later is that we would have more context
> at the inlined call site. And it would solve the problem of the
> inliner seeing vectorized code.
> 
> What do you all think?
> 
> 
> On Jun 24, 2013, at 2:21 AM, Chandler Carruth < chandlerc at gmail.com
>
> wrote:
> 
> 
> 
> Adding this based on a discussion with Arnold and it seems at least
> worth having this flag for us to both run some experiments to see if
> this strategy is workable. It may solve some of the regressions seen
> with the loop vectorizer.
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Jun 2013 - [LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

[LLVMdev] [llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

Reasonably Related Threads