thr3ads.net - llvm dev - [llvm-dev] RFC: Large, especially super-linear, compile time regressions are bugs. [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Mehdi Amini via llvm-dev

2016-Mar-31 22:34 UTC

[llvm-dev] RFC: Large, especially super-linear, compile time regressions are bugs.

Hi Renato,
> On Mar 31, 2016, at 2:46 PM, Renato Golin <renato.golin at
linaro.org> wrote:
> 
> On 31 March 2016 at 21:41, Mehdi Amini via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> TLDR: I totally support considering compile time regression as bug.
> 
> Me too.
> 
> I also agree that reverting fresh and reapplying is *much* easier than
> trying to revert late.
> 
> But I'd like to avoid dubious metrics.
I'm not sure about how "this commit regress the compile time by
2%" is a dubious metric.
The metric is not dubious IMO, it is what it is: a measurement. 
You just have to cast a good process around it to exploit this measurement in a
useful way for the project.
>> The closest I could find would be what Chandler wrote in:
>> http://reviews.llvm.org/D12826 ; for instance for O2 he stated that
"if an
>> optimization increases compile time by 5% or increases code size by 5%
for a
>> particular benchmark, that benchmark should also be one which sees a 5%
>> runtime improvement".
> 
> I think this is a bit limited and can lead to which hunts, especially
> wrt performance measurements.
> 
> Chandler's title is perfect though... Large can be vague, but
> "super-linear" is not. We used to have the concept that any large
> super-linear (quadratic+) compile time introductions had to be in O3
> or, for really bad cases, behind additional flags. I think we should
> keep that mindset.
> 
> 
>> My hope is that with better tooling for tracking compile time in the
future,
>> we'll reach a state where we'll be able to consider
"breaking" the
>> compile-time regression test as important as breaking any test: i.e.
the
>> offending commit should be reverted unless it has been shown to
>> significantly (hand wavy...) improve the runtime performance.
> 
> In order to have any kind of threshold, we'd have to monitor with some
> accuracy the performance of both compiler and compiled code for the
> main platforms. We do that to certain extent with the test-suite bots,
> but that's very far from ideal.
I agree. Did you read the part where I was mentioning that we're working in
the tooling part and that I was waiting for it to be done to start this thread?
> 
> So, I'd recommend we steer away from any kind of percentage or ratio
> and keep at least the quadratic changes and beyond on special flags
> (n.logn is ok for most cases).

How to do you suggest we address the long trail of 1-3% slow down that lead to
the current situation (cf the two links I posted in my previous email)?
Because there *is* a problem here, and I'd really like someone to come up
with a solution for that.

>> Since you raise the discussion now, I take the opportunity to push on
the
>> "more aggressive" side: I think the policy should be a
balance between the
>> improvement the commit brings compared to the compile time slow down.
> 
> This is a fallacy.
Not sure why or what you mean? The fact that an optimization improves only some
target does not invalidate the point.
> 
> Compile time often regress across all targets, while execution
> improvements are focused on specific targets and can have negative
> effects on those that were not benchmarked on.
Yeah, as usual in LLVM: if you care about something on your platform, setup a
bot and track trunk closely, otherwise you're less of a priority.
> Overall, though,
> compile time regressions dilute over the improvements, but not on a
> commit per commit basis. That's what I meant by which hunt.
There is no "witch hunt", at least that's not my objective.
I think everyone is pretty enthusiastic with every new perf improvement (I do),
but just like without bot in general (and policy) we would break them all the
time unintentionally.
I talking about chasing and tracking every single commit were a developer would
regress compile time *without even being aware*.
I'd personally love to have a bot or someone emailing me with compile time
regression I would introduce.
> 
> I think we should keep an eye on those changes, ask for numbers in
> code review and even maybe do some benchmarking on our own before
> accepting it. Also, we should not commit code that we know hurts
> performance that badly, even if we believe people will replace them in
> the future. It always takes too long. I myself have done that last
> year, and I learnt my lesson.
Agree. 
> Metrics are often more dangerous than helpful, as they tend to be used
> as a substitute for thinking.
I don't relate this sentence to anything concrete at stance here. 
I think this list is full of people that are very good at thinking and won't
substitute it :)

Best,

-- 
Mehdi

Renato Golin via llvm-dev

2016-Mar-31 23:40 UTC

head link

[llvm-dev] RFC: Large, especially super-linear, compile time regressions are bugs.

On 31 March 2016 at 23:34, Mehdi Amini <mehdi.amini at apple.com>
wrote:> I'm not sure about how "this commit regress the compile time by
2%" is a dubious metric.
> The metric is not dubious IMO, it is what it is: a measurement.
Ignoring for a moment the slippery slope we recently had on compile
time performance, 2% is an acceptable regression for a change that
improves most targets around 2% execution time, more than if only one
target was affected.

Different people see performance with different eyes, and companies
have different expectations about it, too, so those percentages can
have different impact on different people for the same change.

I guess my point is that no threshold will please everybody, and
people are more likely to "abuse" of the metric if the results are far
from what they see as acceptable, even if everyone else is ok with it.

My point about replacing metrics for thinking is not to the lazy
programmers (of which there are very few here), but to how far does
the encoded threshold fall from your own. Bias is a *very* hard thing
to remove, even for extremely smart and experienced people.

So, while "which hunt" is a very strong term for the mild bias
we'll
all have personally, we have seen recently how some discussions end up
in rage when a group of people strongly disagree with the rest,
self-reinforcing their bias to levels that they would never reach
alone. In those cases, the term stops being strong, and may be
fitting... Makes sense?

> I agree. Did you read the part where I was mentioning that we're
working in the tooling part and that I was waiting for it to be done to start
this thread?
I did, and should have mentioned on my reply. I think you guys (and
ARM) are doing an amazing job at quality measurement. I wasn't trying
to reduce your efforts, but IMHO, the relationship between effort and
bias removal is not linear, ie. you'll have to improve quality
exponentially to remove bias linearly. So, the threshold we're
prepared to stop might not remove all the problems and metrics could
still play a negative role.

I think I'm just asking for us to be aware of the fact, not to stop
any attempt to introduce metrics. If they remain relevant to the final
objective, and we're allowed to break them with enough arguments, it
should work fine.

> How to do you suggest we address the long trail of 1-3% slow down that lead
to the current situation (cf the two links I posted in my previous email)?
> Because there *is* a problem here, and I'd really like someone to come
up with a solution for that.
Indeed, we're now slower than GCC, and that's a place that looked
impossible two years ago. But I doubt reverting a few patches will
help. For this problem, we'll need a task force to hunt for all the
dragons, and surgically alter them, since at this time, all relevant
patches are too far in the past.

For the future, emailing on compile time regressions (as well as run
time) is a good thing to have and I vouch for it. But I don't want
that to become a tool that will increase stress in the community.

> Not sure why or what you mean? The fact that an optimization improves only
some target does not invalidate the point.
Sorry, I seem to have misinterpreted your point.

The fallacy is about the measurement of "benefit" versus the
regression "effect". The former is very hard to measure, while the
latter is very precise. Comparisons with radically different standard
deviations can easily fall into "undefined behaviour" land, and be
seed for rage threads.

> I talking about chasing and tracking every single commit were a developer
would regress compile time *without even being aware*.
That's a goal worth pursuing, regardless of the patch's benefit, I
agree wholeheartedly. And for that, I'm very grateful of the work you
guys are doing.

cheers,
--renato

Mehdi Amini via llvm-dev

2016-Apr-01 00:09 UTC

head link

[llvm-dev] RFC: Large, especially super-linear, compile time regressions are bugs.

> On Mar 31, 2016, at 4:40 PM, Renato Golin <renato.golin at
linaro.org> wrote:
> 
> On 31 March 2016 at 23:34, Mehdi Amini <mehdi.amini at apple.com>
wrote:
>> I'm not sure about how "this commit regress the compile time
by 2%" is a dubious metric.
>> The metric is not dubious IMO, it is what it is: a measurement.
> 
> Ignoring for a moment the slippery slope we recently had on compile
> time performance, 2% is an acceptable regression for a change that
> improves most targets around 2% execution time, more than if only one
> target was affected.
Sure, I don't think I have suggested anything else, if I did it is because I
don't express myself correctly then :)
I'm excited about runtime performance, and I'm willing to spend
compile-time budget to achieve these.
I'd even say that my view is that by tracking compile-time on other things,
it'll help to preserve more compile-time budget for the kind of commit you
mention above.
> 
> Different people see performance with different eyes, and companies
> have different expectations about it, too, so those percentages can
> have different impact on different people for the same change.
> 
> I guess my point is that no threshold
I don't suggest a threshold that says "a commit can't regress
x%", and that would be set in stone.

What I have in mind is more: if a commit regress the build above a threshold (1%
on average for instance), then we should be able to have a discussion about this
commit to evaluate if it belongs to O2 or if it should go to O3 for instance.
Also if the commit is about refactoring, or introducing a new feature, the
regression might not be intended at all by the author!

> will please everybody, and
> people are more likely to "abuse" of the metric if the results
are far
> from what they see as acceptable, even if everyone else is ok with it.
The metric is "the commit regressed 1%". The natural thing that
follows is what happens usually in the community: we look at the data (what is
the performance improvement), and decide on a case by case if it is fine as is
or not.
I feel like you're talking about the "metric" like an automatic
threshold that triggers an automatic revert and block things, this is not the
goal and that is not what I mean when I use of the word metric (but hey, I'm
not a native speaker!).
As I said before, I'm mostly chasing *untracked* and *unintentional* compile
time regression.

> My point about replacing metrics for thinking is not to the lazy
> programmers (of which there are very few here), but to how far does
> the encoded threshold fall from your own. Bias is a *very* hard thing
> to remove, even for extremely smart and experienced people.
> 
> So, while "which hunt" is a very strong term for the mild bias
we'll
> all have personally, we have seen recently how some discussions end up
> in rage when a group of people strongly disagree with the rest,
> self-reinforcing their bias to levels that they would never reach
> alone. In those cases, the term stops being strong, and may be
> fitting... Makes sense?
> 
> 
>> I agree. Did you read the part where I was mentioning that we're
working in the tooling part and that I was waiting for it to be done to start
this thread?
> 
> I did, and should have mentioned on my reply. I think you guys (and
> ARM) are doing an amazing job at quality measurement. I wasn't trying
> to reduce your efforts, but IMHO, the relationship between effort and
> bias removal is not linear, ie. you'll have to improve quality
> exponentially to remove bias linearly. So, the threshold we're
> prepared to stop might not remove all the problems and metrics could
> still play a negative role.
I'm not sure I really totally understand everything you mean.

> 
> I think I'm just asking for us to be aware of the fact, not to stop
> any attempt to introduce metrics. If they remain relevant to the final
> objective, and we're allowed to break them with enough arguments, it
> should work fine.
> 
> 
>> How to do you suggest we address the long trail of 1-3% slow down that
lead to the current situation (cf the two links I posted in my previous email)?
>> Because there *is* a problem here, and I'd really like someone to
come up with a solution for that.
> 
> Indeed, we're now slower than GCC, and that's a place that looked
> impossible two years ago. But I doubt reverting a few patches will
> help. For this problem, we'll need a task force to hunt for all the
> dragons, and surgically alter them, since at this time, all relevant
> patches are too far in the past.
Obviously, my immediate concern is "what tools and process to make sure it
does not get worse", and starting with "community awareness" is
not bad. Improving and recovering from the current state is valuable, but
orthogonal to what I'm trying to achieve.
Another things is the complain from multiple people that are trying to JIT using
LLVM, we know LLVM is not designed in a way that helps with latency and memory
consumption, but getting worse is not nice.
> For the future, emailing on compile time regressions (as well as run
> time) is a good thing to have and I vouch for it. But I don't want
> that to become a tool that will increase stress in the community.
Sure, I'm glad you step up to make sure it does not happen. So please
continue to voice up in the future as we try to roll thing.
I hope we're on the same track past the initial misunderstanding we had each
other?

What I'd really like is to have a consensus on the goal to pursue (knowing
to not be alone to care about compile time is a great start!), so that the
tooling can be set up to serve this goal the best way possible (and decreasing
stress instead of increasing it).

Best,

-- 
Mehdi

> 
> 
>> Not sure why or what you mean? The fact that an optimization improves
only some target does not invalidate the point.
> 
> Sorry, I seem to have misinterpreted your point.
> 
> The fallacy is about the measurement of "benefit" versus the
> regression "effect". The former is very hard to measure, while
the
> latter is very precise. Comparisons with radically different standard
> deviations can easily fall into "undefined behaviour" land, and
be
> seed for rage threads.
> 
> 
>> I talking about chasing and tracking every single commit were a
developer would regress compile time *without even being aware*.
> 
> That's a goal worth pursuing, regardless of the patch's benefit, I
> agree wholeheartedly. And for that, I'm very grateful of the work you
> guys are doing.
> 
> cheers,
> --renato

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Apr 2016 - RFC: Large, especially super-linear, compile time regressions are bugs.

[llvm-dev] RFC: Large, especially super-linear, compile time regressions are bugs.

[llvm-dev] RFC: Large, especially super-linear, compile time regressions are bugs.

[llvm-dev] RFC: Large, especially super-linear, compile time regressions are bugs.

Apparently Analagous Threads