thr3ads.net - llvm dev - [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2? [Oct 2014]

If this information is useful, please help other people find it:
Share via:

Chandler Carruth

2014-Oct-14 00:56 UTC

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

I've added a straw-man of some extra optimization passes that help specific
benchmarks here or there by either preparing code better on the way into
the vectorizer or cleaning up afterward. These are off by default until
there is some consensus on the right path forward, but this way we can all
test out the same set of flags, and collaborate on any tweaks to them.

The primary principle here is that the vectorizer expects the IR input to
be in a certain canonical form, and produces IR output that may not yet be
in that form. The primary alternative to this is to make the vectorizers
both extra powerful (able to recognize many variations on things like loop
structure) and extra cautious about their emitted code (so that it is
always already optimized). I much prefer the solution of using passes
rather than this unless compile time is hurt too drastically. It makes it
much easier to test, validate, and compose all of the various components of
the core optimizer.

Here is the structural diff:

+ loop-rotate
  loop-vectorize
+ early-cse
+ correlated-propagation
+ instcombine
+ licm
+ loop-unswitch
+ simplifycfg
+ instcombine
  slp-vectorize
+ early-cse

The rationale I have for this:

1) Zinovy pointed out that the loop vectorizer really needs the input loops
to still be rotated. One counter point is that perhaps we should prevent
any pass from un-rotating loops?

2) I cherrypicked the core of the scalar optimization pipeline that seems
like it would be relevant to code which looks like runtime checks. Things
like correlated values for overlap predicates, loop invariant code, or
predicates that can be unswitched out of loops. Then I added the
canonicalizing passes that might be relevant given those passes.

3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it isn't
relevant for SLP vectorize, no idea. I did say this was a straw man. =D


My benchmarking has shown some modest improvements to benchmarks, but
nothing huge. However, it shows only a 2% slowdown for building the
'opt'
binary, which I'm actually happy with so that we can work to improve the
loop vectorizer's overhead *knowing* that these passes will clean up stuff.
Thoughts? I'm currently OK with this, but it's pretty borderline so I
just
wanted to start the discussion and see what other folks observe in their
benchmarking.

-Chandler
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141013/3cb69b3d/attachment.html>

Arnold Schwaighofer

2014-Oct-14 15:53 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

> On Oct 13, 2014, at 5:56 PM, Chandler Carruth <chandlerc at
gmail.com> wrote:
> 
> I've added a straw-man of some extra optimization passes that help
specific benchmarks here or there by either preparing code better on the way
into the vectorizer or cleaning up afterward. These are off by default until
there is some consensus on the right path forward, but this way we can all test
out the same set of flags, and collaborate on any tweaks to them.
> 
> The primary principle here is that the vectorizer expects the IR input to
be in a certain canonical form, and produces IR output that may not yet be in
that form. The primary alternative to this is to make the vectorizers both extra
powerful (able to recognize many variations on things like loop structure) and
extra cautious about their emitted code (so that it is always already
optimized). I much prefer the solution of using passes rather than this unless
compile time is hurt too drastically. It makes it much easier to test, validate,
and compose all of the various components of the core optimizer.
> 
> Here is the structural diff:
> 
> + loop-rotate
>   loop-vectorize
> + early-cse
> + correlated-propagation
> + instcombine
> + licm
> + loop-unswitch
> + simplifycfg
> + instcombine
>   slp-vectorize
> + early-cse
> 
I think a late loop optimization (vectorization) pipeline makes sense. I think
we just have to carefully evaluate benefit over compile time.

Runing loop rotation makes sense. Critical edge splitting can transform loops
into a form that prevents loop vectorization.

Both the loop vectorizer and the SLPVectorizer perform limited (restricted in
region) forms of CSE to cleanup. EarlyCSE runs across the whole function and so
might catch more opportunities.

The downside of always running passes is that we pay the cost irrespective of
benefit. There might not be much to cleanup if we don’t vectorize a loop but we
still have to pay for running the cleanup passes. This has been the motivator to
have “pass local” CSE but this also stems from a time where we ran within the
inlining pass manager which meant running over and over again.

I think we will just have to look at compile time and decide what makes sense.

> The rationale I have for this:
> 
> 1) Zinovy pointed out that the loop vectorizer really needs the input loops
to still be rotated. One counter point is that perhaps we should prevent any
pass from un-rotating loops?
> 
> 2) I cherrypicked the core of the scalar optimization pipeline that seems
like it would be relevant to code which looks like runtime checks. Things like
correlated values for overlap predicates, loop invariant code, or predicates
that can be unswitched out of loops. Then I added the canonicalizing passes that
might be relevant given those passes.
> 
> 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it isn't
relevant for SLP vectorize, no idea. I did say this was a straw man. =D
> 
> 
> My benchmarking has shown some modest improvements to benchmarks, but
nothing huge. However, it shows only a 2% slowdown for building the
'opt' binary, which I'm actually happy with so that we can work to
improve the loop vectorizer's overhead *knowing* that these passes will
clean up stuff. Thoughts? I'm currently OK with this, but it's pretty
borderline so I just wanted to start the discussion and see what other folks
observe in their benchmarking.
> 
> -Chandler

Hal Finkel

2014-Oct-14 16:00 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

----- Original Message -----> From: "Arnold Schwaighofer" <aschwaighofer at apple.com>
> To: "Chandler Carruth" <chandlerc at gmail.com>
> Cc: "LLVM Developers Mailing List" <llvmdev at
cs.uiuc.edu>, "James Molloy" <james at jamesmolloy.co.uk>,
"Zinovy Nis"
> <zinovy.nis at gmail.com>, "Andy Trick" <atrick at
apple.com>, "Hal Finkel" <hfinkel at anl.gov>, "Gerolf
Hoflehner"
> <ghoflehner at apple.com>
> Sent: Tuesday, October 14, 2014 10:53:49 AM
> Subject: Re: RFC: Should we have (something like) -extra-vectorizer-passes
in -O2?
> 
> 
> > On Oct 13, 2014, at 5:56 PM, Chandler Carruth <chandlerc at
gmail.com>
> > wrote:
> > 
> > I've added a straw-man of some extra optimization passes that help
> > specific benchmarks here or there by either preparing code better
> > on the way into the vectorizer or cleaning up afterward. These are
> > off by default until there is some consensus on the right path
> > forward, but this way we can all test out the same set of flags,
> > and collaborate on any tweaks to them.
> > 
> > The primary principle here is that the vectorizer expects the IR
> > input to be in a certain canonical form, and produces IR output
> > that may not yet be in that form. The primary alternative to this
> > is to make the vectorizers both extra powerful (able to recognize
> > many variations on things like loop structure) and extra cautious
> > about their emitted code (so that it is always already optimized).
> > I much prefer the solution of using passes rather than this unless
> > compile time is hurt too drastically. It makes it much easier to
> > test, validate, and compose all of the various components of the
> > core optimizer.
> > 
> > Here is the structural diff:
> > 
> > + loop-rotate
> >   loop-vectorize
> > + early-cse
> > + correlated-propagation
> > + instcombine
> > + licm
> > + loop-unswitch
> > + simplifycfg
> > + instcombine
> >   slp-vectorize
> > + early-cse
> > 
> 
> I think a late loop optimization (vectorization) pipeline makes
> sense. I think we just have to carefully evaluate benefit over
> compile time.
> 
> Runing loop rotation makes sense. Critical edge splitting can
> transform loops into a form that prevents loop vectorization.
> 
> Both the loop vectorizer and the SLPVectorizer perform limited
> (restricted in region) forms of CSE to cleanup. EarlyCSE runs across
> the whole function and so might catch more opportunities.
In my experience, running a late EarlyCSE produces generic speedups across the
board.

 -Hal
> 
> The downside of always running passes is that we pay the cost
> irrespective of benefit. There might not be much to cleanup if we
> don’t vectorize a loop but we still have to pay for running the
> cleanup passes. This has been the motivator to have “pass local” CSE
> but this also stems from a time where we ran within the inlining
> pass manager which meant running over and over again.
> 
> I think we will just have to look at compile time and decide what
> makes sense.
> 
> 
> > The rationale I have for this:
> > 
> > 1) Zinovy pointed out that the loop vectorizer really needs the
> > input loops to still be rotated. One counter point is that perhaps
> > we should prevent any pass from un-rotating loops?
> > 
> > 2) I cherrypicked the core of the scalar optimization pipeline that
> > seems like it would be relevant to code which looks like runtime
> > checks. Things like correlated values for overlap predicates, loop
> > invariant code, or predicates that can be unswitched out of loops.
> > Then I added the canonicalizing passes that might be relevant
> > given those passes.
> > 
> > 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it
> > isn't relevant for SLP vectorize, no idea. I did say this was a
> > straw man. =D
> > 
> > 
> > My benchmarking has shown some modest improvements to benchmarks,
> > but nothing huge. However, it shows only a 2% slowdown for
> > building the 'opt' binary, which I'm actually happy with
so that
> > we can work to improve the loop vectorizer's overhead *knowing*
> > that these passes will clean up stuff. Thoughts? I'm currently OK
> > with this, but it's pretty borderline so I just wanted to start
> > the discussion and see what other folks observe in their
> > benchmarking.
> > 
> > -Chandler
> 
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Andrew Trick

2014-Oct-14 17:11 UTC

head link

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

> On Oct 14, 2014, at 8:53 AM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
> 
> 
>> On Oct 13, 2014, at 5:56 PM, Chandler Carruth <chandlerc at
gmail.com> wrote:
>> 
>> I've added a straw-man of some extra optimization passes that help
specific benchmarks here or there by either preparing code better on the way
into the vectorizer or cleaning up afterward. These are off by default until
there is some consensus on the right path forward, but this way we can all test
out the same set of flags, and collaborate on any tweaks to them.
>> 
>> The primary principle here is that the vectorizer expects the IR input
to be in a certain canonical form, and produces IR output that may not yet be in
that form. The primary alternative to this is to make the vectorizers both extra
powerful (able to recognize many variations on things like loop structure) and
extra cautious about their emitted code (so that it is always already
optimized). I much prefer the solution of using passes rather than this unless
compile time is hurt too drastically. It makes it much easier to test, validate,
and compose all of the various components of the core optimizer.
>> 
>> Here is the structural diff:
>> 
>> + loop-rotate
>>  loop-vectorize
>> + early-cse
>> + correlated-propagation
>> + instcombine
>> + licm
>> + loop-unswitch
>> + simplifycfg
>> + instcombine
>>  slp-vectorize
>> + early-cse
>> 
> 
> I think a late loop optimization (vectorization) pipeline makes sense. I
think we just have to carefully evaluate benefit over compile time.
> 
> Runing loop rotation makes sense. Critical edge splitting can transform
loops into a form that prevents loop vectorization.
> 
> Both the loop vectorizer and the SLPVectorizer perform limited (restricted
in region) forms of CSE to cleanup. EarlyCSE runs across the whole function and
so might catch more opportunities.
> 
> The downside of always running passes is that we pay the cost irrespective
of benefit. There might not be much to cleanup if we don’t vectorize a loop but
we still have to pay for running the cleanup passes. This has been the motivator
to have “pass local” CSE but this also stems from a time where we ran within the
inlining pass manager which meant running over and over again.
> 
> I think we will just have to look at compile time and decide what makes
sense.
It’s great that we’re running the vectorizers late, outside CGSCC. Regarding the
set of passes that we rerun, I completely agree with Arnold. Naturally,
iterating over the pass pipeline produces speedups, and I understand the
engineering advantage. But rerunning several expensive function passes on the
slim chance that a loop was transformed is an awful design for compile time.
>> + loop-rotate
I have no concern about loop-rotate. It should be very fast.
>>  loop-vectorize
>> + early-cse
Passes like loop-vectorize should be able to do their own CSE without much
engineering effort.
>> + correlated-propagation
A little worried about this.
>> + instcombine
I'm *very* concerned about rerunning instcombine, but understand it may help
cleanup the vectorized preheader.
>> + licm
>> + loop-unswitch
These should limited to the relevant loop nest. 
>> + simplifycfg
OK if the CFG actually changed.
>> + instcombine
instcombine again! This can’t be good.
>>  slp-vectorize
>> + early-cse
SLP should do its own CSE.

—

I think it’s generally useful to have an “extreme” level of optimization without
much regard for compile time, and in that scenario this pipeline makes sense.
But this is hardly something that should happen at -O2/-Os, unless real data
shows otherwise.

If the pass manager were designed to conditionally invoke late passes triggered
by certain transformation passes, that would solve my immediate concern.

Long term, I think a much better design is for function transformations to be
conditionally rerun within a scope/region. For example, loop-vectorize should be
able to trigger instcombine on the loop preheader, which I think is the real
problem here.

-Andy
>> The rationale I have for this:
>> 
>> 1) Zinovy pointed out that the loop vectorizer really needs the input
loops to still be rotated. One counter point is that perhaps we should prevent
any pass from un-rotating loops?
>> 
>> 2) I cherrypicked the core of the scalar optimization pipeline that
seems like it would be relevant to code which looks like runtime checks. Things
like correlated values for overlap predicates, loop invariant code, or
predicates that can be unswitched out of loops. Then I added the canonicalizing
passes that might be relevant given those passes.
>> 
>> 3) I pulled the EarlyCSE from the BB vectorize stuff. Maybe it
isn't relevant for SLP vectorize, no idea. I did say this was a straw man.
=D
>> 
>> 
>> My benchmarking has shown some modest improvements to benchmarks, but
nothing huge. However, it shows only a 2% slowdown for building the
'opt' binary, which I'm actually happy with so that we can work to
improve the loop vectorizer's overhead *knowing* that these passes will
clean up stuff. Thoughts? I'm currently OK with this, but it's pretty
borderline so I just wanted to start the discussion and see what other folks
observe in their benchmarking.
>> 
>> -Chandler
>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Oct 2014 - [LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

[LLVMdev] RFC: Should we have (something like) -extra-vectorizer-passes in -O2?

Maybe Matching Threads