thr3ads.net - llvm dev - [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2? [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Andrew Trick

2014-Oct-14 17:11 UTC

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

> On Oct 14, 2014, at 8:53 AM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
> 
> 
>> On Oct 13, 2014, at 5:56 PM, Chandler Carruth <chandlerc at
gmail.com> wrote:
>> 
>> I've added a straw-man of some extra optimization passes that help
specific benchmarks here or there by either preparing code better on the way
into the vectorizer or cleaning up afterward. These are off by default until
there is some consensus on the right path forward, but this way we can all test
out the same set of flags, and collaborate on any tweaks to them.
>> 
>> The primary principle here is that the vectorizer expects the IR input
to be in a certain canonical form, and produces IR output that may not yet be in
that form. The primary alternative to this is to make the vectorizers both extra
powerful (able to recognize many variations on things like loop structure) and
extra cautious about their emitted code (so that it is always already
optimized). I much prefer the solution of using passes rather than this unless
compile time is hurt too drastically. It makes it much easier to test, validate,
and compose all of the various components of the core optimizer.
>> 
>> Here is the structural diff:
>> 
>> + loop-rotate
>>  loop-vectorize
>> + early-cse
>> + correlated-propagation
>> + instcombine
>> + licm
>> + loop-unswitch
>> + simplifycfg
>> + instcombine
>>  slp-vectorize
>> + early-cse
>> 
> 
> I think a late loop optimization (vectorization) pipeline makes sense. I
think we just have to carefully evaluate benefit over compile time.
> 
> Runing loop rotation makes sense. Critical edge splitting can transform
loops into a form that prevents loop vectorization.
> 
> Both the loop vectorizer and the SLPVectorizer perform limited (restricted
in region) forms of CSE to cleanup. EarlyCSE runs across the whole function and
so might catch more opportunities.
> 
> The downside of always running passes is that we pay the cost irrespective
of benefit. There might not be much to cleanup if we don’t vectorize a loop but
we still have to pay for running the cleanup passes. This has been the motivator
to have “pass local” CSE but this also stems from a time where we ran within the
inlining pass manager which meant running over and over again.
> 
> I think we will just have to look at compile time and decide what makes
sense.
It’s great that we’re running the vectorizers late, outside CGSCC. Regarding the
set of passes that we rerun, I completely agree with Arnold. Naturally,
iterating over the pass pipeline produces speedups, and I understand the
engineering advantage. But rerunning several expensive function passes on the
slim chance that a loop was transformed is an awful design for compile time.
>> + loop-rotate
I have no concern about loop-rotate. It should be very fast.
>>  loop-vectorize
>> + early-cse
Passes like loop-vectorize should be able to do their own CSE without much
engineering effort.
>> + correlated-propagation
A little worried about this.
>> + instcombine
I'm *very* concerned about rerunning instcombine, but understand it may help
cleanup the vectorized preheader.
>> + licm
>> + loop-unswitch
These should limited to the relevant loop nest. 
>> + simplifycfg
OK if the CFG actually changed.
>> + instcombine
instcombine again! This can’t be good.
>>  slp-vectorize
>> + early-cse
SLP should do its own CSE.

—

I think it’s generally useful to have an “extreme” level of optimization without
much regard for compile time, and in that scenario this pipeline makes sense.
But this is hardly something that should happen at -O2/-Os, unless real data
shows otherwise.

If the pass manager were designed to conditionally invoke late passes triggered
by certain transformation passes, that would solve my immediate concern.

Long term, I think a much better design is for function transformations to be
conditionally rerun within a scope/region. For example, loop-vectorize should be
able to trigger instcombine on the loop preheader, which I think is the real
problem here.

-Andy
>> The rationale I have for this:
>> 
>> 1) Zinovy pointed out that the loop vectorizer really needs the input
loops to still be rotated. One counter point is that perhaps we should prevent
any pass from un-rotating loops?
>> 
>> 2) I cherrypicked the core of the scalar optimization pipeline that
seems like it would be relevant to code which looks like runtime checks. Things
like correlated values for overlap predicates, loop invariant code, or
predicates that can be unswitched out of loops. Then I added the canonicalizing
passes that might be relevant given those passes.
>> 
>> 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it
isn't relevant for SLP vectorize, no idea. I did say this was a straw man.
=D
>> 
>> 
>> My benchmarking has shown some modest improvements to benchmarks, but
nothing huge. However, it shows only a 2% slowdown for building the
'opt' binary, which I'm actually happy with so that we can work to
improve the loop vectorizer's overhead *knowing* that these passes will
clean up stuff. Thoughts? I'm currently OK with this, but it's pretty
borderline so I just wanted to start the discussion and see what other folks
observe in their benchmarking.
>> 
>> -Chandler
>

Andrew Trick

2014-Oct-14 17:17 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

> On Oct 14, 2014, at 10:11 AM, Andrew Trick <atrick at apple.com>
wrote:
> 
>> 
>> On Oct 14, 2014, at 8:53 AM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
>> 
>> 
>>> On Oct 13, 2014, at 5:56 PM, Chandler Carruth <chandlerc at
gmail.com> wrote:
>>> 
>>> I've added a straw-man of some extra optimization passes that
help specific benchmarks here or there by either preparing code better on the
way into the vectorizer or cleaning up afterward. These are off by default until
there is some consensus on the right path forward, but this way we can all test
out the same set of flags, and collaborate on any tweaks to them.
>>> 
>>> The primary principle here is that the vectorizer expects the IR
input to be in a certain canonical form, and produces IR output that may not yet
be in that form. The primary alternative to this is to make the vectorizers both
extra powerful (able to recognize many variations on things like loop structure)
and extra cautious about their emitted code (so that it is always already
optimized). I much prefer the solution of using passes rather than this unless
compile time is hurt too drastically. It makes it much easier to test, validate,
and compose all of the various components of the core optimizer.
>>> 
>>> Here is the structural diff:
>>> 
>>> + loop-rotate
>>> loop-vectorize
>>> + early-cse
>>> + correlated-propagation
>>> + instcombine
>>> + licm
>>> + loop-unswitch
>>> + simplifycfg
>>> + instcombine
>>> slp-vectorize
>>> + early-cse
>>> 
>> 
>> I think a late loop optimization (vectorization) pipeline makes sense.
I think we just have to carefully evaluate benefit over compile time.
>> 
>> Runing loop rotation makes sense. Critical edge splitting can transform
loops into a form that prevents loop vectorization.
>> 
>> Both the loop vectorizer and the SLPVectorizer perform limited
(restricted in region) forms of CSE to cleanup. EarlyCSE runs across the whole
function and so might catch more opportunities.
>> 
>> The downside of always running passes is that we pay the cost
irrespective of benefit. There might not be much to cleanup if we don’t
vectorize a loop but we still have to pay for running the cleanup passes. This
has been the motivator to have “pass local” CSE but this also stems from a time
where we ran within the inlining pass manager which meant running over and over
again.
>> 
>> I think we will just have to look at compile time and decide what makes
sense.
> 
> It’s great that we’re running the vectorizers late, outside CGSCC.
Regarding the set of passes that we rerun, I completely agree with Arnold.
Naturally, iterating over the pass pipeline produces speedups, and I understand
the engineering advantage. But rerunning several expensive function passes on
the slim chance that a loop was transformed is an awful design for compile time.
> 
>>> + loop-rotate
> 
> I have no concern about loop-rotate. It should be very fast.
> 
>>> loop-vectorize
>>> + early-cse
> 
> Passes like loop-vectorize should be able to do their own CSE without much
engineering effort.
> 
>>> + correlated-propagation
> 
> A little worried about this.
> 
>>> + instcombine
> 
> I'm *very* concerned about rerunning instcombine, but understand it may
help cleanup the vectorized preheader.
> 
>>> + licm
>>> + loop-unswitch
> 
> These should limited to the relevant loop nest. 
> 
>>> + simplifycfg
> 
> OK if the CFG actually changed.
> 
>>> + instcombine
> 
> instcombine again! This can’t be good.
> 
>>> slp-vectorize
>>> + early-cse
> 
> SLP should do its own CSE.
> 
> —
> 
> I think it’s generally useful to have an “extreme” level of optimization
without much regard for compile time, and in that scenario this pipeline makes
sense. But this is hardly something that should happen at -O2/-Os, unless real
data shows otherwise.
> 
> If the pass manager were designed to conditionally invoke late passes
triggered by certain transformation passes, that would solve my immediate
concern.
> 
> Long term, I think a much better design is for function transformations to
be conditionally rerun within a scope/region. For example, loop-vectorize should
be able to trigger instcombine on the loop preheader, which I think is the real
problem here.
One more thing I forgot to mention. I think it makes a lot of sense to have an
early canonical instcombine mode to expose opportunities for simplifyCFG and
loop passes and a late mode that optimizes for code gen. It’s possible that some
expensive logic does not need to be repeatedly applied, although I don’t have
evidence of that.
> 
> -Andy
> 
>>> The rationale I have for this:
>>> 
>>> 1) Zinovy pointed out that the loop vectorizer really needs the
input loops to still be rotated. One counter point is that perhaps we should
prevent any pass from un-rotating loops?
>>> 
>>> 2) I cherrypicked the core of the scalar optimization pipeline that
seems like it would be relevant to code which looks like runtime checks. Things
like correlated values for overlap predicates, loop invariant code, or
predicates that can be unswitched out of loops. Then I added the canonicalizing
passes that might be relevant given those passes.
>>> 
>>> 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it
isn't relevant for SLP vectorize, no idea. I did say this was a straw man.
=D
>>> 
>>> 
>>> My benchmarking has shown some modest improvements to benchmarks,
but nothing huge. However, it shows only a 2% slowdown for building the
'opt' binary, which I'm actually happy with so that we can work to
improve the loop vectorizer's overhead *knowing* that these passes will
clean up stuff. Thoughts? I'm currently OK with this, but it's pretty
borderline so I just wanted to start the discussion and see what other folks
observe in their benchmarking.
>>> 
>>> -Chandler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141014/7ebfc5e2/attachment.html>

Hal Finkel

2014-Oct-14 17:28 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

----- Original Message -----> From: "Andrew Trick" <atrick at apple.com>
> To: "Arnold Schwaighofer" <aschwaighofer at apple.com>
> Cc: "Chandler Carruth" <chandlerc at gmail.com>, "LLVM
Developers Mailing List" <llvmdev at cs.uiuc.edu>, "James
Molloy"
> <james at jamesmolloy.co.uk>, "Zinovy Nis" <zinovy.nis
at gmail.com>, "Hal Finkel" <hfinkel at anl.gov>,
"Gerolf Hoflehner"
> <ghoflehner at apple.com>
> Sent: Tuesday, October 14, 2014 12:11:43 PM
> Subject: Re: RFC: Should we have (something like) -extra-vectorizer-passes
in -O2?
> 
> 
> > On Oct 14, 2014, at 8:53 AM, Arnold Schwaighofer
> > <aschwaighofer at apple.com> wrote:
> > 
> > 
> >> On Oct 13, 2014, at 5:56 PM, Chandler Carruth
> >> <chandlerc at gmail.com> wrote:
> >> 
> >> I've added a straw-man of some extra optimization passes that
help
> >> specific benchmarks here or there by either preparing code better
> >> on the way into the vectorizer or cleaning up afterward. These
> >> are off by default until there is some consensus on the right
> >> path forward, but this way we can all test out the same set of
> >> flags, and collaborate on any tweaks to them.
> >> 
> >> The primary principle here is that the vectorizer expects the IR
> >> input to be in a certain canonical form, and produces IR output
> >> that may not yet be in that form. The primary alternative to this
> >> is to make the vectorizers both extra powerful (able to recognize
> >> many variations on things like loop structure) and extra cautious
> >> about their emitted code (so that it is always already
> >> optimized). I much prefer the solution of using passes rather
> >> than this unless compile time is hurt too drastically. It makes
> >> it much easier to test, validate, and compose all of the various
> >> components of the core optimizer.
> >> 
> >> Here is the structural diff:
> >> 
> >> + loop-rotate
> >>  loop-vectorize
> >> + early-cse
> >> + correlated-propagation
> >> + instcombine
> >> + licm
> >> + loop-unswitch
> >> + simplifycfg
> >> + instcombine
> >>  slp-vectorize
> >> + early-cse
> >> 
> > 
> > I think a late loop optimization (vectorization) pipeline makes
> > sense. I think we just have to carefully evaluate benefit over
> > compile time.
> > 
> > Runing loop rotation makes sense. Critical edge splitting can
> > transform loops into a form that prevents loop vectorization.
> > 
> > Both the loop vectorizer and the SLPVectorizer perform limited
> > (restricted in region) forms of CSE to cleanup. EarlyCSE runs
> > across the whole function and so might catch more opportunities.
> > 
> > The downside of always running passes is that we pay the cost
> > irrespective of benefit. There might not be much to cleanup if we
> > don’t vectorize a loop but we still have to pay for running the
> > cleanup passes. This has been the motivator to have “pass local”
> > CSE but this also stems from a time where we ran within the
> > inlining pass manager which meant running over and over again.
> > 
> > I think we will just have to look at compile time and decide what
> > makes sense.
> 
> It’s great that we’re running the vectorizers late, outside CGSCC.
> Regarding the set of passes that we rerun, I completely agree with
> Arnold. Naturally, iterating over the pass pipeline produces
> speedups, and I understand the engineering advantage. But rerunning
> several expensive function passes on the slim chance that a loop was
> transformed is an awful design for compile time.
> 
> >> + loop-rotate
> 
> I have no concern about loop-rotate. It should be very fast.
> 
> >>  loop-vectorize
> >> + early-cse
> 
> Passes like loop-vectorize should be able to do their own CSE without
> much engineering effort.
> 
> >> + correlated-propagation
> 
> A little worried about this.
> 
> >> + instcombine
> 
> I'm *very* concerned about rerunning instcombine,
Why? I understand that it is not cheap (especially because it calls into
ValueTracking a lot), but how expensive is it when it has nothing to do?
> but understand it
> may help cleanup the vectorized preheader.
> 
> >> + licm
> >> + loop-unswitch
> 
> These should limited to the relevant loop nest.
> 
> >> + simplifycfg
> 
> OK if the CFG actually changed.
> 
> >> + instcombine
> 
> instcombine again! This can’t be good.
> 
> >>  slp-vectorize
> >> + early-cse
> 
> SLP should do its own CSE.
I'm not sure how much of this is reasonable. Obviously, it can do its own
CSE within each vectorization tree. But across trees (where multiple independent
parts of the function are vectorized), finding and reusing gather sequences,
etc. is a general CSE problem, and I'm not sure how much of that we want to
replicate in the SLP vectorizer.

When I switched my internal builds from using the BBVectorizer by default to
using the SLP vectorizer by default, I saw a number of performance regressions
(mostly not from the vectorization, but from the lack of the 'cleanup'
passes, EarlyCSE and InstCombine, that were generally being run afterward). My
general impression is that running these passes late in the pipeline brings
general benefits.
> 
> —
> 
> I think it’s generally useful to have an “extreme” level of
> optimization without much regard for compile time, and in that
> scenario this pipeline makes sense. But this is hardly something
> that should happen at -O2/-Os, unless real data shows otherwise.
Doing all this only at >= -O3 does not seem unreasonable to me.
> 
> If the pass manager were designed to conditionally invoke late passes
> triggered by certain transformation passes, that would solve my
> immediate concern.
> 
> Long term, I think a much better design is for function
> transformations to be conditionally rerun within a scope/region. For
> example, loop-vectorize should be able to trigger instcombine on the
> loop preheader, which I think is the real problem here.
As Chandler might recall ;) -- I've made several requests that the new pass
manager design specifically support this.

 -Hal
> 
> -Andy
> 
> >> The rationale I have for this:
> >> 
> >> 1) Zinovy pointed out that the loop vectorizer really needs the
> >> input loops to still be rotated. One counter point is that
> >> perhaps we should prevent any pass from un-rotating loops?
> >> 
> >> 2) I cherrypicked the core of the scalar optimization pipeline
> >> that seems like it would be relevant to code which looks like
> >> runtime checks. Things like correlated values for overlap
> >> predicates, loop invariant code, or predicates that can be
> >> unswitched out of loops. Then I added the canonicalizing passes
> >> that might be relevant given those passes.
> >> 
> >> 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it
> >> isn't relevant for SLP vectorize, no idea. I did say this was
a
> >> straw man. =D
> >> 
> >> 
> >> My benchmarking has shown some modest improvements to benchmarks,
> >> but nothing huge. However, it shows only a 2% slowdown for
> >> building the 'opt' binary, which I'm actually happy
with so that
> >> we can work to improve the loop vectorizer's overhead
*knowing*
> >> that these passes will clean up stuff. Thoughts? I'm currently
OK
> >> with this, but it's pretty borderline so I just wanted to
start
> >> the discussion and see what other folks observe in their
> >> benchmarking.
> >> 
> >> -Chandler
> > 
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Andrew Trick

2014-Oct-14 17:38 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

> On Oct 14, 2014, at 10:28 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
>>>> + instcombine
>> 
>> I'm *very* concerned about rerunning instcombine,
> 
> Why? I understand that it is not cheap (especially because it calls into
ValueTracking a lot), but how expensive is it when it has nothing to do?
Ok, I’ll reserve judgement until I have that data point.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141014/e86dca63/attachment.html>

Chandler Carruth

2014-Oct-14 17:41 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

On Tue, Oct 14, 2014 at 10:28 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> > I think it’s generally useful to have an “extreme” level of
> > optimization without much regard for compile time, and in that
> > scenario this pipeline makes sense. But this is hardly something
> > that should happen at -O2/-Os, unless real data shows otherwise.
>
> Doing all this only at >= -O3 does not seem unreasonable to me.

FWIW, I think we're being overly conservative if we're relegating these
to
-O3 when the total cost is 2%. That doesn't seem like the right tradeoff.

I actually agree that the set I proposed is on the aggressive end -- that
was the point -- but we have more than 2% fluctuations in the optimizers'
runtime from month to month. If we want to rip stuff out it should be
because of a principled reason that it isn't going to help the code in that
phase.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141014/4ec726eb/attachment.html>

Chandler Carruth

2014-Oct-14 17:56 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

On Tue, Oct 14, 2014 at 10:11 AM, Andrew Trick <atrick at apple.com>
wrote:
> >> + correlated-propagation
>
> A little worried about this.
>
> >> + instcombine
>
> I'm *very* concerned about rerunning instcombine, but understand it may
> help cleanup the vectorized preheader.
>
Why are you concerned? Is instcombine that slow? I usually don't see huge
overhead from re-running it on nearly-canonical code. (Oh, I see you just
replied to Hal here, fair enough.

>
> >> + licm
> >> + loop-unswitch
>
> These should limited to the relevant loop nest.
>
We have no way to do that currently. Do you think they will in practice be
too slow? If so, why? I would naively expect unswitch to be essentially
free unless it can do something, and LICM not much more expensive.

>
> >> + simplifycfg
>
> OK if the CFG actually changed.
>
Again, we have no mechanism to gate this. Frustratingly, the only thing I
want here is to delete dead code formed by earlier passes. We just don't
have anything cheaper (and I don't have any measurements indicating we need
something cheaper).

>
> >> + instcombine
>
> instcombine again! This can’t be good.
>
I actually have no specific reason to think we need this other than the
fact that we run instcombine after simplifycfg in a bunch of other places.
If you're looking for one to rip out, this would be the first one I would
rip out because I'm doubtful of its value.


On a separate note:

> >> + early-cse
>
> Passes like loop-vectorize should be able to do their own CSE without much
> engineering effort.
>
> >>  slp-vectorize
> >> + early-cse
>
> SLP should do its own CSE.
>
I actually agree with you in principle, but I would rather run the pass now
(and avoid hacks downstream to essentially do CSE in the backend) than hold
up progress on the hope of advanced on-demand CSE layers being added to the
vectorizers. I don't know of anyone actually working on that, and so I'm
somewhat concerned it will never materialize.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141014/7f0fe816/attachment.html>

Hal Finkel

2014-Oct-14 18:16 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

----- Original Message -----> From: "Chandler Carruth" <chandlerc at google.com>
> To: "Andrew Trick" <atrick at apple.com>
> Cc: "James Molloy" <james at jamesmolloy.co.uk>, "LLVM
Developers Mailing List" <llvmdev at cs.uiuc.edu>
> Sent: Tuesday, October 14, 2014 12:56:46 PM
> Subject: Re: [LLVMdev] RFC: Should we have (something like)
-extra-vectorizer-passes in -O2?
> 
> 
> 
> 
> 
> 
> On Tue, Oct 14, 2014 at 10:11 AM, Andrew Trick < atrick at apple.com
>
> wrote:
> 
> 
> 
> >> + correlated-propagation
> 
> A little worried about this.
> 
> >> + instcombine
> 
> I'm *very* concerned about rerunning instcombine, but understand it
> may help cleanup the vectorized preheader.
> 
> 
> 
> Why are you concerned? Is instcombine that slow? I usually don't see
> huge overhead from re-running it on nearly-canonical code. (Oh, I
> see you just replied to Hal here, fair enough.
> 
> 
> 
> 
> >> + licm
> >> + loop-unswitch
> 
> These should limited to the relevant loop nest.
> 
> 
> 
> We have no way to do that currently. Do you think they will in
> practice be too slow? If so, why? I would naively expect unswitch to
> be essentially free unless it can do something, and LICM not much
> more expensive.
> 
> 
> 
> 
> >> + simplifycfg
> 
> OK if the CFG actually changed.
> 
> 
> 
> Again, we have no mechanism to gate this. Frustratingly, the only
> thing I want here is to delete dead code formed by earlier passes.
> We just don't have anything cheaper (and I don't have any
> measurements indicating we need something cheaper).
> 
> 
> 
> 
> >> + instcombine
> 
> instcombine again! This can’t be good.
> 
> 
> 
> I actually have no specific reason to think we need this other than
> the fact that we run instcombine after simplifycfg in a bunch of
> other places. If you're looking for one to rip out, this would be
> the first one I would rip out because I'm doubtful of its value.
> 
> 
> 
> On a separate note:
> 
> 
> 
> 
> 
> >> + early-cse
> 
> Passes like loop-vectorize should be able to do their own CSE without
> much engineering effort.
> 
> >> slp-vectorize
> >> + early-cse
> 
> SLP should do its own CSE.
> 
> I actually agree with you in principle, but I would rather run the
> pass now (and avoid hacks downstream to essentially do CSE in the
> backend) than hold up progress on the hope of advanced on-demand CSE
> layers being added to the vectorizers. I don't know of anyone
> actually working on that, and so I'm somewhat concerned it will
> never materialize.
I mentioned this in another mail, but to be specific, I'm also inclined to
think that, globally, it shouldn't materialize. SLP should do its own
internal cleanup, per tree, but cross-tree CSE should likely be left to an
actual CSE pass (or perhaps GVN at -O3, but that's another matter).

What I mean is that if we have:

entry:
  br %cond, %block1, %block2
block1:
  stuff in here is SLP vectorized
  ...
block2:
  stuff in here is SLP vectorized
  ...


or even just:

entry:
  ...
  stuff in here is SLP vectorized
  ...
  stuff here is also SLP vectorized (using some of the same inputs)
  ...

there might be some common vector shuffles, insert/extractelement instructions,
etc. that are generated in both blocks that CSE might combine. But this is a
general CSE problem (especially as these things might be memory operations, and
thus need to deal with memory dependency issues), and we should not have new
generalized CSE logic in the vectorizers (although we could certainly think
about factoring some of the current logic out into utility functions).

 -Hal

> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Andrew Trick

2014-Oct-14 18:21 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

I’ll summarize your responses as: The new pipeline produces better results than
the old, and we currently have no good mechanism for reducing the compile time
overhead.

I’ll summarize my criticism as: In principle, there are better ways to clean up
after the vectorizer without turning it into a complicated megapass, but no one
has done the engineering. I don’t think cleaning up after the vectorizer should
incur any noticeable overhead if the vectorizer never runs, and it would be
avoidable with a sensibly designed passes that aren’t locked into the current
pass manager design.

I don’t have the data right now to argue against enabling the new pipeline under
O2. Hopefully others who care about clang compile time will jump in.

As for the long-term plan to improve compile-time, all I can do now is to
advocate for a better approach.

-Andy
> On Oct 14, 2014, at 10:56 AM, Chandler Carruth <chandlerc at
google.com> wrote:
> 
> 
> On Tue, Oct 14, 2014 at 10:11 AM, Andrew Trick <atrick at apple.com
<mailto:atrick at apple.com>> wrote:
> >> + correlated-propagation
> 
> A little worried about this.
> 
> >> + instcombine
> 
> I'm *very* concerned about rerunning instcombine, but understand it may
help cleanup the vectorized preheader.
> 
> Why are you concerned? Is instcombine that slow? I usually don't see
huge overhead from re-running it on nearly-canonical code. (Oh, I see you just
replied to Hal here, fair enough.
>  
> 
> >> + licm
> >> + loop-unswitch
> 
> These should limited to the relevant loop nest.
> 
> We have no way to do that currently. Do you think they will in practice be
too slow? If so, why? I would naively expect unswitch to be essentially free
unless it can do something, and LICM not much more expensive.
>  
> 
> >> + simplifycfg
> 
> OK if the CFG actually changed.
> 
> Again, we have no mechanism to gate this. Frustratingly, the only thing I
want here is to delete dead code formed by earlier passes. We just don't
have anything cheaper (and I don't have any measurements indicating we need
something cheaper).
>  
> 
> >> + instcombine
> 
> instcombine again! This can’t be good.
> 
> I actually have no specific reason to think we need this other than the
fact that we run instcombine after simplifycfg in a bunch of other places. If
you're looking for one to rip out, this would be the first one I would rip
out because I'm doubtful of its value.
>  
> 
> On a separate note:
> 
> 
> >> + early-cse
> 
> Passes like loop-vectorize should be able to do their own CSE without much
engineering effort.
> 
> >>  slp-vectorize
> >> + early-cse
> 
> SLP should do its own CSE.
> 
> I actually agree with you in principle, but I would rather run the pass now
(and avoid hacks downstream to essentially do CSE in the backend) than hold up
progress on the hope of advanced on-demand CSE layers being added to the
vectorizers. I don't know of anyone actually working on that, and so I'm
somewhat concerned it will never materialize.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141014/d48763d6/attachment.html>

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Oct 2014 - [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

Possibly Parallel Threads