thr3ads.net - llvm dev - [llvm-dev] Buildbot Noise [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2015-Oct-09 17:14 UTC

[llvm-dev] Buildbot Noise

I think we've hit a record in the number of inline replies, here... :)

Let's start fresh...

    Problem #1: What is flaky?

The types of failures of a buildbot:

1. failures because of bad hardware / bad software / bad admin
(timeout, disk full, crash, bad RAM)
2. failures because of infrastructure problems (svn, lnt, etc)
3. failures due to previous or external commits unrelated to the blame
list (intermittent, timeout)
4. results that you don't know how to act on, but you have to
5. clear error messages, easy to act on

In my view, "flaky" is *only* number 1. Everything else is signal.

I agree that bots that cause 1. should be silent, and that failures in
2. and 3. should be only emailed to the bot admin. But category 4
still needs to email the blame list and cannot be ignored, even if
*you* don't know how to act on.

Type 2. can easily be separated, but I'm yet to see how are we going
to code in which category each failure lies for types 3. and 4. One
way to work around the problem in 4 is to print the bot owner's name
on the email, so that you know who to reply to, for more details on
what to do. How to decide if your change is unrelated or you didn't
understand is a big problem. Once all bots are low-noise, people will
tend more to 4, until then, to 3 or 1.

In agreement?


    Problem #2: Breakage types

Bots can break for a number of reasons in category 4. Some examples:

A. silly, quick fixed ones, like bad CHECK lines, missing explicit
triple, move tests to target-specific directories, add an include
file.
B. real problems, like an assert in the code, seg fault, bad test results.
C. hard problems, like bad codegen affecting self-hosting,
intermittent failures in test-suite or self-hosted clang.

Problems of type A. tend to show by the firehose on ARM, while they're
a lot less common on x86_64 bots just because people develop on
x86_64. Problems B. and C. and equally common on all platforms due to
the complexity of the compiler.

Problems of type B. should have same behaviour in all platforms. If
the bots are fast enough (either fast hardware, or many hardware), the
blame list should be small and bisect should be quick (<1day). These
are not the problem.

Problems of type C, however, are seriously worse on slow targets. Not
only it's slower to build (sometimes 10x slower than on a decent
server), but the testing is hard to get right (because it's
intermittent), and until you get it right, you're actively working on
that (minus sleep time, etc). Since we're talking about an order of
magnitude slower to debug, sleep time becomes a much bigger issue. If
a hard problem takes about 5 hours on fast hardware, it can take up to
50 hours, and in that case, no one can work that long. If you do 10hs
straight every day, it's still a week past.

In agreement?


I'll continue later, once we're in agreement over the base facts.

cheers,
--renato

David Blaikie via llvm-dev

2015-Oct-09 18:02 UTC

head link

[llvm-dev] Buildbot Noise

On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> I think we've hit a record in the number of inline replies, here... :)
>
> Let's start fresh...
>
>     Problem #1: What is flaky?
>
> The types of failures of a buildbot:
>
> 1. failures because of bad hardware / bad software / bad admin
> (timeout, disk full, crash, bad RAM)
>
Where "software" here is presumably the OS software, not the software
under
test (otherwise all actual failures would be (1)), and not infrastructure
software because you've called that out as (2).

> 2. failures because of infrastructure problems (svn, lnt, etc)
> 3. failures due to previous or external commits unrelated to the blame
> list (intermittent, timeout)
> 4. results that you don't know how to act on, but you have to
> 5. clear error messages, easy to act on
>
> In my view, "flaky" is *only* number 1. Everything else is
signal.
>
I think that misses the common usage of the term "flaky test" (or do
the
tests themselves end up other (1) or (2)?) or flaky tests due to flaky
product code (hash ordering in the output).

> I agree that bots that cause 1. should be silent, and that failures in
> 2. and 3. should be only emailed to the bot admin. But category 4
> still needs to email the blame list and cannot be ignored, even if
> *you* don't know how to act on.
>
& I disagree here - if most contributors aren't acting on these (for
whatever reasons, basically) we should just stop sending them. If at some
point we find ways to make them actionable (by having common machine access
people can use, documentation on how to proceed, short blame lists, etc -
whatever's getting in the way of people acting on these).

And I don't think it's that people simply don't care about certain
architectures - We see Linux developers fixing Windows and Darwin build
breaks, for example. But, yes, more complicated things (I think a large
part of the problem is the temporal issue - no matter the architecture, if
the results are substantially delayed (even with a short blame list) and
the steps to reproduce are not quick/easy, it's easy for people to decide
it's not worth the hassle - which I think is something we likely have to
live with (again, lack of familiarity with a long/complex/inaccessible
process means that those developers really aren't in the best place to do
the reproduction/check that it was their patch that caused the problem)) do
tend to fall to bot owners/people familiar with that platform/hardware, and
I think that's totally OK/acceptable/the right thing.

>
> Type 2. can easily be separated, but I'm yet to see how are we going
> to code in which category each failure lies for types 3. and 4.

Yeah, I don't have any .particular insight there either. Ideally I'd
hope
we can ensure those issues are rare enough (though I've been seeing some
consistently flaky SVN behavior on my buildbot for the last few months,
admittedly - reached out to Tanya about it, but didn't have much to go on)
that it's probably not worth the engineering effort to filter them out.

> One
> way to work around the problem in 4 is to print the bot owner's name
> on the email, so that you know who to reply to, for more details on
> what to do. How to decide if your change is unrelated or you didn't
> understand is a big problem.

What I'm suggesting is that if most developers, most of the time, aren't
able to determine this easily, it's not valuable email - if most of the
time they have to reach out to the owner for details/clarification, then we
should just invert it. Have the bot owner push to the contributor rather
than the contributor pull from the bot owner.

> Once all bots are low-noise, people will
> tend more to 4, until then, to 3 or 1.
>
> In agreement?
>
>
>     Problem #2: Breakage types
>
> Bots can break for a number of reasons in category 4. Some examples:
>
> A. silly, quick fixed ones, like bad CHECK lines, missing explicit
> triple, move tests to target-specific directories, add an include
> file.
> B. real problems, like an assert in the code, seg fault, bad test results.
> C. hard problems, like bad codegen affecting self-hosting,
> intermittent failures in test-suite or self-hosted clang.
>
> Problems of type A. tend to show by the firehose on ARM, while they're
> a lot less common on x86_64 bots just because people develop on
> x86_64.

They show up often enough cross-OS and build config too (-Asserts, Windows,
Darwin, etc).

> Problems B. and C. and equally common on all platforms due to
> the complexity of the compiler.
>
> Problems of type B. should have same behaviour in all platforms. If
> the bots are fast enough (either fast hardware, or many hardware), the
> blame list should be small and bisect should be quick (<1day).

Patches should still be reverted, or tests XFAIL - bots shouldn't be left
red for hours (especially in the middle of a work day) or a day.

> These are not the problem.
>
> Problems of type C, however, are seriously worse on slow targets.

This can often/mostly be compensated for by having more hardware -
especially for something as mechanical as a bisect. (obviously once you're
in manual iterations, more hardware doesn't help much unless you have a few
different hypotheses you can test simultaneously)

Certainly it takes some more engineering effort and there's overhead for
dealing with multiple machines, etc. But it's not linearly proportional to
machine speed, because some of it can be compensated for.

> Not
> only it's slower to build (sometimes 10x slower than on a decent
> server), but the testing is hard to get right (because it's
> intermittent), and until you get it right, you're actively working on
> that (minus sleep time, etc). Since we're talking about an order of
> magnitude slower to debug, sleep time becomes a much bigger issue. If
> a hard problem takes about 5 hours on fast hardware, it can take up to
> 50 hours, and in that case, no one can work that long. If you do 10hs
> straight every day, it's still a week past.

Sure - some issues take a while to investigate. No doubt - but so long as
the issue is live (be it flaky or consistent) it's unhelpful (moreso if
it's flaky, given the way our buildbots send mail - though I still don't
like a red line on the status page, that's costly too) to have the bot red
and/or sending mail. The issue is known and being investigated, sending
other people mail (or having it show up as red in the dashboard) isn't
terribly helpful. It produces redundant work for everyone (they all
investigate these issues - or learn to ignore them & thus miss true
positives later) on the project.

>
> In agreement?
>
>
> I'll continue later, once we're in agreement over the base facts.
>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151009/f6a672bd/attachment.html>

Renato Golin via llvm-dev

2015-Oct-10 11:59 UTC

head link

[llvm-dev] Buildbot Noise

On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com>
wrote:> Where "software" here is presumably the OS software
Yes. This is the real noise, one that we cannot accept.

> I think that misses the common usage of the term "flaky test" (or
do the
> tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> product code (hash ordering in the output).
Flaky code, either compiler or tests, are the ones that don't fail in
the correct blame list. Otherwise, even if it was flaky, we don't
know, because it failed in the right blame list, so it's easy to
revert or XFAIL.

So, in my categorisation, flaky code ends up in either 3 or 4:

3, wrong blame list: if the failure is completely independent from the
blame list, example, misuse of the C++ ABI.
4, related, but not directly: if the failure is related, but in ways
that the patch didn't touch, example, changing related debug info for
a non-debug patch.

These can be that the original code didn't cope with this future
change, but the change is semantically valid, or the test CHECK lines
were poor (like naming explicit registers, etc), and that's why the
tests broke. The former is harder for the blamed developer to fix, but
"git blame" can help find the one to help. The latter is a lot easier
to spot and fix, but is also helped by "git blame". Both actionable,
but not immediately obvious.

> & I disagree here - if most contributors aren't acting on these
(for
> whatever reasons, basically) we should just stop sending them. If at some
> point we find ways to make them actionable (by having common machine access
> people can use, documentation on how to proceed, short blame lists, etc -
> whatever's getting in the way of people acting on these).
I see, your disagreement is temporal.

You're basically saying that, because people ignore them today,
there's no point in sending them the email today, and it's up to the
bot owners to make people start paying attention to their bots.

My argument is that I cannot make you care, no matter how stable my
bots are. And the evidence for that is that my bots are very stable,
but you're ignoring them, either because you don't understand what a
flaky bot is, or just out of principle.

My bots don't have hardware or OS problems, nor they timeout or run
out of disk for a good number of years. But I can't stop bad testing,
or bad coding. And, as I've outlined too many times, these affect bots
like mine more heavily than others. It's the nature of the failures
plus the nature of my hardware.

I can't make you care about it, so I don't mind if you ignore them,
but I *do* mind if you want to shut them off.

> And I don't think it's that people simply don't care about
certain
> architectures - We see Linux developers fixing Windows and Darwin build
> breaks, for example. But, yes, more complicated things (I think a large
part
> of the problem is the temporal issue - no matter the architecture, if the
> results are substantially delayed (even with a short blame list) and the
> steps to reproduce are not quick/easy, it's easy for people to decide
it's
> not worth the hassle
I think that's an appalling behaviour for a community.

> - which I think is something we likely have to live
> with (again, lack of familiarity with a long/complex/inaccessible process
> means that those developers really aren't in the best place to do the
> reproduction/check that it was their patch that caused the problem)) do
tend
> to fall to bot owners/people familiar with that platform/hardware, and I
> think that's totally OK/acceptable/the right thing.
Hum, ok. There are two sides here.

1. You do care, but can't do anything. In this case, you work with the
owner to resolve the problem, even if the owner does all the work.

2. You don't care, and ignore the failure. Here the bot owner has to
find out on his own and do all the work.

The first is perfectly acceptable, and I'm more than happy to do all
the work. The second I normally just revert the patch without asking.

> What I'm suggesting is that if most developers, most of the time,
aren't
> able to determine this easily, it's not valuable email - if most of the
time
> they have to reach out to the owner for details/clarification, then we
> should just invert it. Have the bot owner push to the contributor rather
> than the contributor pull from the bot owner.
The LLVM project has hundreds of committers, dozens of bots have a
single owner. How does that scale?

I think this proposal is against the very nature of open source
projects in general and a horrible engineering decision. I have
noticed that recently some people have taken the attitude that "if you
can't keep up with my commits, you're not worth noticing", and
that's
the attitude that will get us forked.

> They show up often enough cross-OS and build config too (-Asserts, Windows,
> Darwin, etc).
Ok, good.

> Patches should still be reverted, or tests XFAIL - bots shouldn't be
left
> red for hours (especially in the middle of a work day) or a day.
How do you XFAIL a Clang miscompilation of Clang?

How do you revert a failure that is unrelated to the blame list
because they're from previous or external commits?

> This can often/mostly be compensated for by having more hardware -
Throw money at the problem? :D
https://www.youtube.com/watch?v=CZmHDEa0Y20

> especially for something as mechanical as a bisect. (obviously once
you're
> in manual iterations, more hardware doesn't help much unless you have a
few
> different hypotheses you can test simultaneously)
I don't have infinite hardware, nor infinite space, nor infinite
power, nor infinite time.

Certain things take longer than others, and people that are used to
getting them fast have a lower tolerance for slow(er) processes. Fast
and slow are completely arbitrary and relative to how slow or fast
things are between themselves.

> Certainly it takes some more engineering effort and there's overhead
for
> dealing with multiple machines, etc. But it's not linearly proportional
to
> machine speed, because some of it can be compensated for.
Right. So, here, I agree with you. It IS possible to improve and make
it much better.

I'm working on making it better, but it takes time. I can't make it
work tomorrow, and that's my original point:

We have to improve and be more strict, but we have to grow to get
there, not to flip the table now. I'm suggesting an exp(x) migration
plan, not a sig(x).

> Sure - some issues take a while to investigate. No doubt - but so long as
> the issue is live (be it flaky or consistent) it's unhelpful (moreso if
it's
> flaky, given the way our buildbots send mail - though I still don't
like a
> red line on the status page, that's costly too) to have the bot red
and/or
> sending mail.
Here, there are two issues:

1. Buildbots should not email on red->except->red. That's settled, and
we must ignore those cases from now on, otherwise, we'll keep coming
back at it. So, assume we don't do that any more.

2. If we agree that any flaky bot is turned off, and the master
behaves correctly (as above), we cannot assume that the constant
emailing during the investigation phase is due to flakyness. So, if
you do get an email, it's probably a meaningful reason.

We're not there yet, but we're discussing at a higher level here,
dissecting the issue and finding the problems.


> The issue is known and being investigated, sending other
> people mail (or having it show up as red in the dashboard) isn't
terribly
> helpful. It produces redundant work for everyone (they all investigate
these
> issues - or learn to ignore them & thus miss true positives later) on
the
> project.
Chris is investigating the Green Bot infrastructure, which is orders
of magnitude better than our current. In that scenario, we'll have
orders of magnitude less redundant work, even if you get a warning
that you can't act on.

--renato

On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com>
wrote:>
>
> On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at
linaro.org>
> wrote:
>>
>> I think we've hit a record in the number of inline replies, here...
:)
>>
>> Let's start fresh...
>>
>>     Problem #1: What is flaky?
>>
>> The types of failures of a buildbot:
>>
>> 1. failures because of bad hardware / bad software / bad admin
>> (timeout, disk full, crash, bad RAM)
>
>
> Where "software" here is presumably the OS software, not the
software under
> test (otherwise all actual failures would be (1)), and not infrastructure
> software because you've called that out as (2).
>
>>
>> 2. failures because of infrastructure problems (svn, lnt, etc)
>> 3. failures due to previous or external commits unrelated to the blame
>> list (intermittent, timeout)
>> 4. results that you don't know how to act on, but you have to
>> 5. clear error messages, easy to act on
>>
>> In my view, "flaky" is *only* number 1. Everything else is
signal.
>
>
> I think that misses the common usage of the term "flaky test" (or
do the
> tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> product code (hash ordering in the output).
>
>>
>> I agree that bots that cause 1. should be silent, and that failures in
>> 2. and 3. should be only emailed to the bot admin. But category 4
>> still needs to email the blame list and cannot be ignored, even if
>> *you* don't know how to act on.
>
>
> & I disagree here - if most contributors aren't acting on these
(for
> whatever reasons, basically) we should just stop sending them. If at some
> point we find ways to make them actionable (by having common machine access
> people can use, documentation on how to proceed, short blame lists, etc -
> whatever's getting in the way of people acting on these).
>
> And I don't think it's that people simply don't care about
certain
> architectures - We see Linux developers fixing Windows and Darwin build
> breaks, for example. But, yes, more complicated things (I think a large
part
> of the problem is the temporal issue - no matter the architecture, if the
> results are substantially delayed (even with a short blame list) and the
> steps to reproduce are not quick/easy, it's easy for people to decide
it's
> not worth the hassle - which I think is something we likely have to live
> with (again, lack of familiarity with a long/complex/inaccessible process
> means that those developers really aren't in the best place to do the
> reproduction/check that it was their patch that caused the problem)) do
tend
> to fall to bot owners/people familiar with that platform/hardware, and I
> think that's totally OK/acceptable/the right thing.
>
>>
>>
>> Type 2. can easily be separated, but I'm yet to see how are we
going
>> to code in which category each failure lies for types 3. and 4.
>
>
> Yeah, I don't have any .particular insight there either. Ideally
I'd hope we
> can ensure those issues are rare enough (though I've been seeing some
> consistently flaky SVN behavior on my buildbot for the last few months,
> admittedly - reached out to Tanya about it, but didn't have much to go
on)
> that it's probably not worth the engineering effort to filter them out.
>
>>
>> One
>> way to work around the problem in 4 is to print the bot owner's
name
>> on the email, so that you know who to reply to, for more details on
>> what to do. How to decide if your change is unrelated or you didn't
>> understand is a big problem.
>
>
> What I'm suggesting is that if most developers, most of the time,
aren't
> able to determine this easily, it's not valuable email - if most of the
time
> they have to reach out to the owner for details/clarification, then we
> should just invert it. Have the bot owner push to the contributor rather
> than the contributor pull from the bot owner.
>
>>
>> Once all bots are low-noise, people will
>> tend more to 4, until then, to 3 or 1.
>>
>> In agreement?
>>
>>
>>     Problem #2: Breakage types
>>
>> Bots can break for a number of reasons in category 4. Some examples:
>>
>> A. silly, quick fixed ones, like bad CHECK lines, missing explicit
>> triple, move tests to target-specific directories, add an include
>> file.
>> B. real problems, like an assert in the code, seg fault, bad test
results.
>> C. hard problems, like bad codegen affecting self-hosting,
>> intermittent failures in test-suite or self-hosted clang.
>>
>> Problems of type A. tend to show by the firehose on ARM, while
they're
>> a lot less common on x86_64 bots just because people develop on
>> x86_64.
>
>
> They show up often enough cross-OS and build config too (-Asserts, Windows,
> Darwin, etc).
>
>>
>> Problems B. and C. and equally common on all platforms due to
>> the complexity of the compiler.
>>
>> Problems of type B. should have same behaviour in all platforms. If
>> the bots are fast enough (either fast hardware, or many hardware), the
>> blame list should be small and bisect should be quick (<1day).
>
>
> Patches should still be reverted, or tests XFAIL - bots shouldn't be
left
> red for hours (especially in the middle of a work day) or a day.
>
>>
>> These are not the problem.
>>
>> Problems of type C, however, are seriously worse on slow targets.
>
>
> This can often/mostly be compensated for by having more hardware -
> especially for something as mechanical as a bisect. (obviously once
you're
> in manual iterations, more hardware doesn't help much unless you have a
few
> different hypotheses you can test simultaneously)
>
> Certainly it takes some more engineering effort and there's overhead
for
> dealing with multiple machines, etc. But it's not linearly proportional
to
> machine speed, because some of it can be compensated for.
>
>>
>> Not
>> only it's slower to build (sometimes 10x slower than on a decent
>> server), but the testing is hard to get right (because it's
>> intermittent), and until you get it right, you're actively working
on
>> that (minus sleep time, etc). Since we're talking about an order of
>> magnitude slower to debug, sleep time becomes a much bigger issue. If
>> a hard problem takes about 5 hours on fast hardware, it can take up to
>> 50 hours, and in that case, no one can work that long. If you do 10hs
>> straight every day, it's still a week past.
>
>
> Sure - some issues take a while to investigate. No doubt - but so long as
> the issue is live (be it flaky or consistent) it's unhelpful (moreso if
it's
> flaky, given the way our buildbots send mail - though I still don't
like a
> red line on the status page, that's costly too) to have the bot red
and/or
> sending mail. The issue is known and being investigated, sending other
> people mail (or having it show up as red in the dashboard) isn't
terribly
> helpful. It produces redundant work for everyone (they all investigate
these
> issues - or learn to ignore them & thus miss true positives later) on
the
> project.
>
>>
>>
>> In agreement?
>>
>>
>> I'll continue later, once we're in agreement over the base
facts.
>>
>> cheers,
>> --renato
>
>

Possibly Parallel Threads

Search for more maybe matching threads

llvm dev - Oct 2015 - Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

Possibly Parallel Threads