thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] Buildbot Noise [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2015-Oct-10 11:59 UTC

[llvm-dev] Buildbot Noise

On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com>
wrote:> Where "software" here is presumably the OS software
Yes. This is the real noise, one that we cannot accept.

> I think that misses the common usage of the term "flaky test" (or
do the
> tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> product code (hash ordering in the output).
Flaky code, either compiler or tests, are the ones that don't fail in
the correct blame list. Otherwise, even if it was flaky, we don't
know, because it failed in the right blame list, so it's easy to
revert or XFAIL.

So, in my categorisation, flaky code ends up in either 3 or 4:

3, wrong blame list: if the failure is completely independent from the
blame list, example, misuse of the C++ ABI.
4, related, but not directly: if the failure is related, but in ways
that the patch didn't touch, example, changing related debug info for
a non-debug patch.

These can be that the original code didn't cope with this future
change, but the change is semantically valid, or the test CHECK lines
were poor (like naming explicit registers, etc), and that's why the
tests broke. The former is harder for the blamed developer to fix, but
"git blame" can help find the one to help. The latter is a lot easier
to spot and fix, but is also helped by "git blame". Both actionable,
but not immediately obvious.

> & I disagree here - if most contributors aren't acting on these
(for
> whatever reasons, basically) we should just stop sending them. If at some
> point we find ways to make them actionable (by having common machine access
> people can use, documentation on how to proceed, short blame lists, etc -
> whatever's getting in the way of people acting on these).
I see, your disagreement is temporal.

You're basically saying that, because people ignore them today,
there's no point in sending them the email today, and it's up to the
bot owners to make people start paying attention to their bots.

My argument is that I cannot make you care, no matter how stable my
bots are. And the evidence for that is that my bots are very stable,
but you're ignoring them, either because you don't understand what a
flaky bot is, or just out of principle.

My bots don't have hardware or OS problems, nor they timeout or run
out of disk for a good number of years. But I can't stop bad testing,
or bad coding. And, as I've outlined too many times, these affect bots
like mine more heavily than others. It's the nature of the failures
plus the nature of my hardware.

I can't make you care about it, so I don't mind if you ignore them,
but I *do* mind if you want to shut them off.

> And I don't think it's that people simply don't care about
certain
> architectures - We see Linux developers fixing Windows and Darwin build
> breaks, for example. But, yes, more complicated things (I think a large
part
> of the problem is the temporal issue - no matter the architecture, if the
> results are substantially delayed (even with a short blame list) and the
> steps to reproduce are not quick/easy, it's easy for people to decide
it's
> not worth the hassle
I think that's an appalling behaviour for a community.

> - which I think is something we likely have to live
> with (again, lack of familiarity with a long/complex/inaccessible process
> means that those developers really aren't in the best place to do the
> reproduction/check that it was their patch that caused the problem)) do
tend
> to fall to bot owners/people familiar with that platform/hardware, and I
> think that's totally OK/acceptable/the right thing.
Hum, ok. There are two sides here.

1. You do care, but can't do anything. In this case, you work with the
owner to resolve the problem, even if the owner does all the work.

2. You don't care, and ignore the failure. Here the bot owner has to
find out on his own and do all the work.

The first is perfectly acceptable, and I'm more than happy to do all
the work. The second I normally just revert the patch without asking.

> What I'm suggesting is that if most developers, most of the time,
aren't
> able to determine this easily, it's not valuable email - if most of the
time
> they have to reach out to the owner for details/clarification, then we
> should just invert it. Have the bot owner push to the contributor rather
> than the contributor pull from the bot owner.
The LLVM project has hundreds of committers, dozens of bots have a
single owner. How does that scale?

I think this proposal is against the very nature of open source
projects in general and a horrible engineering decision. I have
noticed that recently some people have taken the attitude that "if you
can't keep up with my commits, you're not worth noticing", and
that's
the attitude that will get us forked.

> They show up often enough cross-OS and build config too (-Asserts, Windows,
> Darwin, etc).
Ok, good.

> Patches should still be reverted, or tests XFAIL - bots shouldn't be
left
> red for hours (especially in the middle of a work day) or a day.
How do you XFAIL a Clang miscompilation of Clang?

How do you revert a failure that is unrelated to the blame list
because they're from previous or external commits?

> This can often/mostly be compensated for by having more hardware -
Throw money at the problem? :D
https://www.youtube.com/watch?v=CZmHDEa0Y20

> especially for something as mechanical as a bisect. (obviously once
you're
> in manual iterations, more hardware doesn't help much unless you have a
few
> different hypotheses you can test simultaneously)
I don't have infinite hardware, nor infinite space, nor infinite
power, nor infinite time.

Certain things take longer than others, and people that are used to
getting them fast have a lower tolerance for slow(er) processes. Fast
and slow are completely arbitrary and relative to how slow or fast
things are between themselves.

> Certainly it takes some more engineering effort and there's overhead
for
> dealing with multiple machines, etc. But it's not linearly proportional
to
> machine speed, because some of it can be compensated for.
Right. So, here, I agree with you. It IS possible to improve and make
it much better.

I'm working on making it better, but it takes time. I can't make it
work tomorrow, and that's my original point:

We have to improve and be more strict, but we have to grow to get
there, not to flip the table now. I'm suggesting an exp(x) migration
plan, not a sig(x).

> Sure - some issues take a while to investigate. No doubt - but so long as
> the issue is live (be it flaky or consistent) it's unhelpful (moreso if
it's
> flaky, given the way our buildbots send mail - though I still don't
like a
> red line on the status page, that's costly too) to have the bot red
and/or
> sending mail.
Here, there are two issues:

1. Buildbots should not email on red->except->red. That's settled, and
we must ignore those cases from now on, otherwise, we'll keep coming
back at it. So, assume we don't do that any more.

2. If we agree that any flaky bot is turned off, and the master
behaves correctly (as above), we cannot assume that the constant
emailing during the investigation phase is due to flakyness. So, if
you do get an email, it's probably a meaningful reason.

We're not there yet, but we're discussing at a higher level here,
dissecting the issue and finding the problems.


> The issue is known and being investigated, sending other
> people mail (or having it show up as red in the dashboard) isn't
terribly
> helpful. It produces redundant work for everyone (they all investigate
these
> issues - or learn to ignore them & thus miss true positives later) on
the
> project.
Chris is investigating the Green Bot infrastructure, which is orders
of magnitude better than our current. In that scenario, we'll have
orders of magnitude less redundant work, even if you get a warning
that you can't act on.

--renato

On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com>
wrote:>
>
> On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at
linaro.org>
> wrote:
>>
>> I think we've hit a record in the number of inline replies, here...
:)
>>
>> Let's start fresh...
>>
>>     Problem #1: What is flaky?
>>
>> The types of failures of a buildbot:
>>
>> 1. failures because of bad hardware / bad software / bad admin
>> (timeout, disk full, crash, bad RAM)
>
>
> Where "software" here is presumably the OS software, not the
software under
> test (otherwise all actual failures would be (1)), and not infrastructure
> software because you've called that out as (2).
>
>>
>> 2. failures because of infrastructure problems (svn, lnt, etc)
>> 3. failures due to previous or external commits unrelated to the blame
>> list (intermittent, timeout)
>> 4. results that you don't know how to act on, but you have to
>> 5. clear error messages, easy to act on
>>
>> In my view, "flaky" is *only* number 1. Everything else is
signal.
>
>
> I think that misses the common usage of the term "flaky test" (or
do the
> tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> product code (hash ordering in the output).
>
>>
>> I agree that bots that cause 1. should be silent, and that failures in
>> 2. and 3. should be only emailed to the bot admin. But category 4
>> still needs to email the blame list and cannot be ignored, even if
>> *you* don't know how to act on.
>
>
> & I disagree here - if most contributors aren't acting on these
(for
> whatever reasons, basically) we should just stop sending them. If at some
> point we find ways to make them actionable (by having common machine access
> people can use, documentation on how to proceed, short blame lists, etc -
> whatever's getting in the way of people acting on these).
>
> And I don't think it's that people simply don't care about
certain
> architectures - We see Linux developers fixing Windows and Darwin build
> breaks, for example. But, yes, more complicated things (I think a large
part
> of the problem is the temporal issue - no matter the architecture, if the
> results are substantially delayed (even with a short blame list) and the
> steps to reproduce are not quick/easy, it's easy for people to decide
it's
> not worth the hassle - which I think is something we likely have to live
> with (again, lack of familiarity with a long/complex/inaccessible process
> means that those developers really aren't in the best place to do the
> reproduction/check that it was their patch that caused the problem)) do
tend
> to fall to bot owners/people familiar with that platform/hardware, and I
> think that's totally OK/acceptable/the right thing.
>
>>
>>
>> Type 2. can easily be separated, but I'm yet to see how are we
going
>> to code in which category each failure lies for types 3. and 4.
>
>
> Yeah, I don't have any .particular insight there either. Ideally
I'd hope we
> can ensure those issues are rare enough (though I've been seeing some
> consistently flaky SVN behavior on my buildbot for the last few months,
> admittedly - reached out to Tanya about it, but didn't have much to go
on)
> that it's probably not worth the engineering effort to filter them out.
>
>>
>> One
>> way to work around the problem in 4 is to print the bot owner's
name
>> on the email, so that you know who to reply to, for more details on
>> what to do. How to decide if your change is unrelated or you didn't
>> understand is a big problem.
>
>
> What I'm suggesting is that if most developers, most of the time,
aren't
> able to determine this easily, it's not valuable email - if most of the
time
> they have to reach out to the owner for details/clarification, then we
> should just invert it. Have the bot owner push to the contributor rather
> than the contributor pull from the bot owner.
>
>>
>> Once all bots are low-noise, people will
>> tend more to 4, until then, to 3 or 1.
>>
>> In agreement?
>>
>>
>>     Problem #2: Breakage types
>>
>> Bots can break for a number of reasons in category 4. Some examples:
>>
>> A. silly, quick fixed ones, like bad CHECK lines, missing explicit
>> triple, move tests to target-specific directories, add an include
>> file.
>> B. real problems, like an assert in the code, seg fault, bad test
results.
>> C. hard problems, like bad codegen affecting self-hosting,
>> intermittent failures in test-suite or self-hosted clang.
>>
>> Problems of type A. tend to show by the firehose on ARM, while
they're
>> a lot less common on x86_64 bots just because people develop on
>> x86_64.
>
>
> They show up often enough cross-OS and build config too (-Asserts, Windows,
> Darwin, etc).
>
>>
>> Problems B. and C. and equally common on all platforms due to
>> the complexity of the compiler.
>>
>> Problems of type B. should have same behaviour in all platforms. If
>> the bots are fast enough (either fast hardware, or many hardware), the
>> blame list should be small and bisect should be quick (<1day).
>
>
> Patches should still be reverted, or tests XFAIL - bots shouldn't be
left
> red for hours (especially in the middle of a work day) or a day.
>
>>
>> These are not the problem.
>>
>> Problems of type C, however, are seriously worse on slow targets.
>
>
> This can often/mostly be compensated for by having more hardware -
> especially for something as mechanical as a bisect. (obviously once
you're
> in manual iterations, more hardware doesn't help much unless you have a
few
> different hypotheses you can test simultaneously)
>
> Certainly it takes some more engineering effort and there's overhead
for
> dealing with multiple machines, etc. But it's not linearly proportional
to
> machine speed, because some of it can be compensated for.
>
>>
>> Not
>> only it's slower to build (sometimes 10x slower than on a decent
>> server), but the testing is hard to get right (because it's
>> intermittent), and until you get it right, you're actively working
on
>> that (minus sleep time, etc). Since we're talking about an order of
>> magnitude slower to debug, sleep time becomes a much bigger issue. If
>> a hard problem takes about 5 hours on fast hardware, it can take up to
>> 50 hours, and in that case, no one can work that long. If you do 10hs
>> straight every day, it's still a week past.
>
>
> Sure - some issues take a while to investigate. No doubt - but so long as
> the issue is live (be it flaky or consistent) it's unhelpful (moreso if
it's
> flaky, given the way our buildbots send mail - though I still don't
like a
> red line on the status page, that's costly too) to have the bot red
and/or
> sending mail. The issue is known and being investigated, sending other
> people mail (or having it show up as red in the dashboard) isn't
terribly
> helpful. It produces redundant work for everyone (they all investigate
these
> issues - or learn to ignore them & thus miss true positives later) on
the
> project.
>
>>
>>
>> In agreement?
>>
>>
>> I'll continue later, once we're in agreement over the base
facts.
>>
>> cheers,
>> --renato
>
>

Robinson, Paul via llvm-dev

2015-Oct-16 14:17 UTC

head link

[llvm-dev] [cfe-dev] Buildbot Noise

Not to distract from the truly worthwhile discussion going on here,
but let me bring up one notion that I think buildbot currently doesn't
support:

Our internal build/test system can distinguish "has new failure(s)"
from
"failed but no new failures" and represent those things differently
on our dashboard.  In public-bot terms this would mean saving the most
recent list of test failures, comparing to the new set of test failures,
and having a different failure-state if the new set is equal to or a
proper subset of the previous set.  This might ameliorate an ongoing-red
situation, as a no-new-fails state wouldn't send blame mail.  But if
there are new fails, the blame mailer can do a set-difference and report
only the new ones. That would reduce the noise a bit, hmm?

--paulr
> -----Original Message-----
> From: cfe-dev [mailto:cfe-dev-bounces at lists.llvm.org] On Behalf Of
Renato
> Golin via cfe-dev
> Sent: Saturday, October 10, 2015 5:00 AM
> To: David Blaikie
> Cc: LLVM Dev; Galina Kistanova; Clang Dev
> Subject: Re: [cfe-dev] Buildbot Noise
> 
> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com>
wrote:
> > Where "software" here is presumably the OS software
> 
> Yes. This is the real noise, one that we cannot accept.
> 
> 
> > I think that misses the common usage of the term "flaky
test" (or do the
> > tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> > product code (hash ordering in the output).
> 
> Flaky code, either compiler or tests, are the ones that don't fail in
> the correct blame list. Otherwise, even if it was flaky, we don't
> know, because it failed in the right blame list, so it's easy to
> revert or XFAIL.
> 
> So, in my categorisation, flaky code ends up in either 3 or 4:
> 
> 3, wrong blame list: if the failure is completely independent from the
> blame list, example, misuse of the C++ ABI.
> 4, related, but not directly: if the failure is related, but in ways
> that the patch didn't touch, example, changing related debug info for
> a non-debug patch.
> 
> These can be that the original code didn't cope with this future
> change, but the change is semantically valid, or the test CHECK lines
> were poor (like naming explicit registers, etc), and that's why the
> tests broke. The former is harder for the blamed developer to fix, but
> "git blame" can help find the one to help. The latter is a lot
easier
> to spot and fix, but is also helped by "git blame". Both
actionable,
> but not immediately obvious.
> 
> 
> > & I disagree here - if most contributors aren't acting on
these (for
> > whatever reasons, basically) we should just stop sending them. If at
> some
> > point we find ways to make them actionable (by having common machine
> access
> > people can use, documentation on how to proceed, short blame lists,
etc
> -
> > whatever's getting in the way of people acting on these).
> 
> I see, your disagreement is temporal.
> 
> You're basically saying that, because people ignore them today,
> there's no point in sending them the email today, and it's up to
the
> bot owners to make people start paying attention to their bots.
> 
> My argument is that I cannot make you care, no matter how stable my
> bots are. And the evidence for that is that my bots are very stable,
> but you're ignoring them, either because you don't understand what
a
> flaky bot is, or just out of principle.
> 
> My bots don't have hardware or OS problems, nor they timeout or run
> out of disk for a good number of years. But I can't stop bad testing,
> or bad coding. And, as I've outlined too many times, these affect bots
> like mine more heavily than others. It's the nature of the failures
> plus the nature of my hardware.
> 
> I can't make you care about it, so I don't mind if you ignore them,
> but I *do* mind if you want to shut them off.
> 
> 
> > And I don't think it's that people simply don't care about
certain
> > architectures - We see Linux developers fixing Windows and Darwin
build
> > breaks, for example. But, yes, more complicated things (I think a
large
> part
> > of the problem is the temporal issue - no matter the architecture, if
> the
> > results are substantially delayed (even with a short blame list) and
the
> > steps to reproduce are not quick/easy, it's easy for people to
decide
> it's
> > not worth the hassle
> 
> I think that's an appalling behaviour for a community.
> 
> 
> > - which I think is something we likely have to live
> > with (again, lack of familiarity with a long/complex/inaccessible
> process
> > means that those developers really aren't in the best place to do
the
> > reproduction/check that it was their patch that caused the problem))
do
> tend
> > to fall to bot owners/people familiar with that platform/hardware, and
I
> > think that's totally OK/acceptable/the right thing.
> 
> Hum, ok. There are two sides here.
> 
> 1. You do care, but can't do anything. In this case, you work with the
> owner to resolve the problem, even if the owner does all the work.
> 
> 2. You don't care, and ignore the failure. Here the bot owner has to
> find out on his own and do all the work.
> 
> The first is perfectly acceptable, and I'm more than happy to do all
> the work. The second I normally just revert the patch without asking.
> 
> 
> > What I'm suggesting is that if most developers, most of the time,
aren't
> > able to determine this easily, it's not valuable email - if most
of the
> time
> > they have to reach out to the owner for details/clarification, then we
> > should just invert it. Have the bot owner push to the contributor
rather
> > than the contributor pull from the bot owner.
> 
> The LLVM project has hundreds of committers, dozens of bots have a
> single owner. How does that scale?
> 
> I think this proposal is against the very nature of open source
> projects in general and a horrible engineering decision. I have
> noticed that recently some people have taken the attitude that "if you
> can't keep up with my commits, you're not worth noticing", and
that's
> the attitude that will get us forked.
> 
> 
> > They show up often enough cross-OS and build config too (-Asserts,
> Windows,
> > Darwin, etc).
> 
> Ok, good.
> 
> 
> > Patches should still be reverted, or tests XFAIL - bots shouldn't
be
> left
> > red for hours (especially in the middle of a work day) or a day.
> 
> How do you XFAIL a Clang miscompilation of Clang?
> 
> How do you revert a failure that is unrelated to the blame list
> because they're from previous or external commits?
> 
> 
> > This can often/mostly be compensated for by having more hardware -
> 
> Throw money at the problem? :D
> https://www.youtube.com/watch?v=CZmHDEa0Y20
> 
> 
> > especially for something as mechanical as a bisect. (obviously once
> you're
> > in manual iterations, more hardware doesn't help much unless you
have a
> few
> > different hypotheses you can test simultaneously)
> 
> I don't have infinite hardware, nor infinite space, nor infinite
> power, nor infinite time.
> 
> Certain things take longer than others, and people that are used to
> getting them fast have a lower tolerance for slow(er) processes. Fast
> and slow are completely arbitrary and relative to how slow or fast
> things are between themselves.
> 
> 
> > Certainly it takes some more engineering effort and there's
overhead for
> > dealing with multiple machines, etc. But it's not linearly
proportional
> to
> > machine speed, because some of it can be compensated for.
> 
> Right. So, here, I agree with you. It IS possible to improve and make
> it much better.
> 
> I'm working on making it better, but it takes time. I can't make it
> work tomorrow, and that's my original point:
> 
> We have to improve and be more strict, but we have to grow to get
> there, not to flip the table now. I'm suggesting an exp(x) migration
> plan, not a sig(x).
> 
> 
> > Sure - some issues take a while to investigate. No doubt - but so long
> as
> > the issue is live (be it flaky or consistent) it's unhelpful
(moreso if
> it's
> > flaky, given the way our buildbots send mail - though I still
don't like
> a
> > red line on the status page, that's costly too) to have the bot
red
> and/or
> > sending mail.
> 
> Here, there are two issues:
> 
> 1. Buildbots should not email on red->except->red. That's
settled, and
> we must ignore those cases from now on, otherwise, we'll keep coming
> back at it. So, assume we don't do that any more.
> 
> 2. If we agree that any flaky bot is turned off, and the master
> behaves correctly (as above), we cannot assume that the constant
> emailing during the investigation phase is due to flakyness. So, if
> you do get an email, it's probably a meaningful reason.
> 
> We're not there yet, but we're discussing at a higher level here,
> dissecting the issue and finding the problems.
> 
> 
> 
> > The issue is known and being investigated, sending other
> > people mail (or having it show up as red in the dashboard) isn't
> terribly
> > helpful. It produces redundant work for everyone (they all investigate
> these
> > issues - or learn to ignore them & thus miss true positives later)
on
> the
> > project.
> 
> Chris is investigating the Green Bot infrastructure, which is orders
> of magnitude better than our current. In that scenario, we'll have
> orders of magnitude less redundant work, even if you get a warning
> that you can't act on.
> 
> --renato
> 
> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com>
wrote:
> >
> >
> > On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at
linaro.org>
> > wrote:
> >>
> >> I think we've hit a record in the number of inline replies,
here... :)
> >>
> >> Let's start fresh...
> >>
> >>     Problem #1: What is flaky?
> >>
> >> The types of failures of a buildbot:
> >>
> >> 1. failures because of bad hardware / bad software / bad admin
> >> (timeout, disk full, crash, bad RAM)
> >
> >
> > Where "software" here is presumably the OS software, not the
software
> under
> > test (otherwise all actual failures would be (1)), and not
> infrastructure
> > software because you've called that out as (2).
> >
> >>
> >> 2. failures because of infrastructure problems (svn, lnt, etc)
> >> 3. failures due to previous or external commits unrelated to the
blame
> >> list (intermittent, timeout)
> >> 4. results that you don't know how to act on, but you have to
> >> 5. clear error messages, easy to act on
> >>
> >> In my view, "flaky" is *only* number 1. Everything else
is signal.
> >
> >
> > I think that misses the common usage of the term "flaky
test" (or do the
> > tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> > product code (hash ordering in the output).
> >
> >>
> >> I agree that bots that cause 1. should be silent, and that
failures in
> >> 2. and 3. should be only emailed to the bot admin. But category 4
> >> still needs to email the blame list and cannot be ignored, even if
> >> *you* don't know how to act on.
> >
> >
> > & I disagree here - if most contributors aren't acting on
these (for
> > whatever reasons, basically) we should just stop sending them. If at
> some
> > point we find ways to make them actionable (by having common machine
> access
> > people can use, documentation on how to proceed, short blame lists,
etc
> -
> > whatever's getting in the way of people acting on these).
> >
> > And I don't think it's that people simply don't care about
certain
> > architectures - We see Linux developers fixing Windows and Darwin
build
> > breaks, for example. But, yes, more complicated things (I think a
large
> part
> > of the problem is the temporal issue - no matter the architecture, if
> the
> > results are substantially delayed (even with a short blame list) and
the
> > steps to reproduce are not quick/easy, it's easy for people to
decide
> it's
> > not worth the hassle - which I think is something we likely have to
live
> > with (again, lack of familiarity with a long/complex/inaccessible
> process
> > means that those developers really aren't in the best place to do
the
> > reproduction/check that it was their patch that caused the problem))
do
> tend
> > to fall to bot owners/people familiar with that platform/hardware, and
I
> > think that's totally OK/acceptable/the right thing.
> >
> >>
> >>
> >> Type 2. can easily be separated, but I'm yet to see how are we
going
> >> to code in which category each failure lies for types 3. and 4.
> >
> >
> > Yeah, I don't have any .particular insight there either. Ideally
I'd
> hope we
> > can ensure those issues are rare enough (though I've been seeing
some
> > consistently flaky SVN behavior on my buildbot for the last few
months,
> > admittedly - reached out to Tanya about it, but didn't have much
to go
> on)
> > that it's probably not worth the engineering effort to filter them
out.
> >
> >>
> >> One
> >> way to work around the problem in 4 is to print the bot
owner's name
> >> on the email, so that you know who to reply to, for more details
on
> >> what to do. How to decide if your change is unrelated or you
didn't
> >> understand is a big problem.
> >
> >
> > What I'm suggesting is that if most developers, most of the time,
aren't
> > able to determine this easily, it's not valuable email - if most
of the
> time
> > they have to reach out to the owner for details/clarification, then we
> > should just invert it. Have the bot owner push to the contributor
rather
> > than the contributor pull from the bot owner.
> >
> >>
> >> Once all bots are low-noise, people will
> >> tend more to 4, until then, to 3 or 1.
> >>
> >> In agreement?
> >>
> >>
> >>     Problem #2: Breakage types
> >>
> >> Bots can break for a number of reasons in category 4. Some
examples:
> >>
> >> A. silly, quick fixed ones, like bad CHECK lines, missing explicit
> >> triple, move tests to target-specific directories, add an include
> >> file.
> >> B. real problems, like an assert in the code, seg fault, bad test
> results.
> >> C. hard problems, like bad codegen affecting self-hosting,
> >> intermittent failures in test-suite or self-hosted clang.
> >>
> >> Problems of type A. tend to show by the firehose on ARM, while
they're
> >> a lot less common on x86_64 bots just because people develop on
> >> x86_64.
> >
> >
> > They show up often enough cross-OS and build config too (-Asserts,
> Windows,
> > Darwin, etc).
> >
> >>
> >> Problems B. and C. and equally common on all platforms due to
> >> the complexity of the compiler.
> >>
> >> Problems of type B. should have same behaviour in all platforms.
If
> >> the bots are fast enough (either fast hardware, or many hardware),
the
> >> blame list should be small and bisect should be quick (<1day).
> >
> >
> > Patches should still be reverted, or tests XFAIL - bots shouldn't
be
> left
> > red for hours (especially in the middle of a work day) or a day.
> >
> >>
> >> These are not the problem.
> >>
> >> Problems of type C, however, are seriously worse on slow targets.
> >
> >
> > This can often/mostly be compensated for by having more hardware -
> > especially for something as mechanical as a bisect. (obviously once
> you're
> > in manual iterations, more hardware doesn't help much unless you
have a
> few
> > different hypotheses you can test simultaneously)
> >
> > Certainly it takes some more engineering effort and there's
overhead for
> > dealing with multiple machines, etc. But it's not linearly
proportional
> to
> > machine speed, because some of it can be compensated for.
> >
> >>
> >> Not
> >> only it's slower to build (sometimes 10x slower than on a
decent
> >> server), but the testing is hard to get right (because it's
> >> intermittent), and until you get it right, you're actively
working on
> >> that (minus sleep time, etc). Since we're talking about an
order of
> >> magnitude slower to debug, sleep time becomes a much bigger issue.
If
> >> a hard problem takes about 5 hours on fast hardware, it can take
up to
> >> 50 hours, and in that case, no one can work that long. If you do
10hs
> >> straight every day, it's still a week past.
> >
> >
> > Sure - some issues take a while to investigate. No doubt - but so long
> as
> > the issue is live (be it flaky or consistent) it's unhelpful
(moreso if
> it's
> > flaky, given the way our buildbots send mail - though I still
don't like
> a
> > red line on the status page, that's costly too) to have the bot
red
> and/or
> > sending mail. The issue is known and being investigated, sending other
> > people mail (or having it show up as red in the dashboard) isn't
> terribly
> > helpful. It produces redundant work for everyone (they all investigate
> these
> > issues - or learn to ignore them & thus miss true positives later)
on
> the
> > project.
> >
> >>
> >>
> >> In agreement?
> >>
> >>
> >> I'll continue later, once we're in agreement over the base
facts.
> >>
> >> cheers,
> >> --renato
> >
> >
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

Renato Golin via llvm-dev

2015-Oct-16 14:29 UTC

head link

[llvm-dev] [cfe-dev] Buildbot Noise

On 16 October 2015 at 15:17, Robinson, Paul
<Paul_Robinson at playstation.sony.com> wrote:> But if
> there are new fails, the blame mailer can do a set-difference and report
> only the new ones. That would reduce the noise a bit, hmm?
Hi Paul,

The danger there is that it'd be easier to "get used" to having
some
failures as long as you don't have "new" failures. Every place I
worked that supported that philosophy, ended up with all bots
"orange". It's never the intention, but it's almost always the
inevitable consequence. In a small team, or a single company, it may
be a lot easier to move them back to green, but in an open community,
it's not that easy, nor that quick.

The way we work with the same concept, as David mentioned repeatedly,
is to use XFAILs. It is essentially the same thing, except that "it
hurts more" to mark an XFAIL than to see a different shade of red, so
we're more reluctant to ignore them.

Plus, an orange bot that becomes red (new failures) will itself become
orange as time passes, or new failures show up. If we end up with that
many shades of red, understanding the difference will become harder,
and the value will decrease.

cheers,
--renato

David Blaikie via llvm-dev

2015-Oct-19 18:38 UTC

head link

[llvm-dev] Buildbot Noise

On Sat, Oct 10, 2015 at 4:59 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com>
wrote:
> > Where "software" here is presumably the OS software
>
> Yes. This is the real noise, one that we cannot accept.
>
>
> > I think that misses the common usage of the term "flaky
test" (or do the
> > tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> > product code (hash ordering in the output).
>
> Flaky code, either compiler or tests, are the ones that don't fail in
> the correct blame list. Otherwise, even if it was flaky, we don't
> know, because it failed in the right blame list, so it's easy to
> revert or XFAIL.
>
> So, in my categorisation, flaky code ends up in either 3 or 4:
>
> 3, wrong blame list: if the failure is completely independent from the
> blame list, example, misuse of the C++ ABI.
> 4, related, but not directly: if the failure is related, but in ways
> that the patch didn't touch, example, changing related debug info for
> a non-debug patch.
>
> These can be that the original code didn't cope with this future
> change, but the change is semantically valid, or the test CHECK lines
> were poor (like naming explicit registers, etc), and that's why the
> tests broke. The former is harder for the blamed developer to fix, but
> "git blame" can help find the one to help. The latter is a lot
easier
> to spot and fix, but is also helped by "git blame". Both
actionable,
> but not immediately obvious.
>
>
> > & I disagree here - if most contributors aren't acting on
these (for
> > whatever reasons, basically) we should just stop sending them. If at
some
> > point we find ways to make them actionable (by having common machine
> access
> > people can use, documentation on how to proceed, short blame lists,
etc -
> > whatever's getting in the way of people acting on these).
>
> I see, your disagreement is temporal.
>
> You're basically saying that, because people ignore them today,
> there's no point in sending them the email today, and it's up to
the
> bot owners to make people start paying attention to their bots.
>
> My argument is that I cannot make you care, no matter how stable my
> bots are. And the evidence for that is that my bots are very stable,
> but you're ignoring them, either because you don't understand what
a
> flaky bot is, or just out of principle.
>
In the proximal issue - the bot was red for a week. When I see a bot red
for a week, I assume no one cares about it (because I assume that if they
did they would've at least XFAILed the issue so they could get back to
green & catch future issues). That's the question I was asking and the
reason I'm inclined to ignore the email I got from that bot.

As you've pointed out, the reason I got email from the bot was because of
the master restart (red->purple->red), and addressing that would mean I
wouldn't've sent my original email to you (but to other bot masters who
had
long-red bots - as you can see, I wasn't singling you out, I was looking at
any bot that had been red for multiple work days). I would still, in the
abstract, disagree with leaving bots red for long periods because it makes
the buildbot status pages hard to read - which things are unknown issues
that someone needs to investigate, and which aren't? XFAIL should represent
the mechanism by which we acknowledge a known failure, get back to green,
and investigate. XFAILing a bootstrap is a bit unknown - perhaps we should
have a way to do that?

Beyond that, I've been talking about flakey failures in general, but that
wasn't my issue with your bot at the time I sent the mail. I have no
opinion on the flakiness of your bot(s). I think we got caught down a
rathole talking about the abstract problems of flakiness, even though when
I sent my last volley of "what's with these bot results" they
weren't about
flakiness at all, but /specifically/ about long-red bots that appear
neglected.

> My bots don't have hardware or OS problems, nor they timeout or run
> out of disk for a good number of years. But I can't stop bad testing,
> or bad coding. And, as I've outlined too many times, these affect bots
> like mine more heavily than others. It's the nature of the failures
> plus the nature of my hardware.
>
> I can't make you care about it, so I don't mind if you ignore them,
>
Are there often original contributors, faced with a unique result from
these bots, who are addressing the problem themselves? Or do they usually
have to defer to you or another expert in this hardware, to do some level
of triage/investigation/reproduction first?

> but I *do* mind if you want to shut them off.
>
As I've said before - I'm suggesting not sending mail. I'm not
suggesting
turning them off.

It would be little-to-no change to me to do this to my GDB 7.5 bot, for
example - I glance at every failure that comes through anyway. All I'd do
differently is forward anything that I thought looked like a real, unique
failure, to the mailing list/blame list, rather than having it done
automatically. This does not seem terribly onerous. Is it?

> > And I don't think it's that people simply don't care about
certain
> > architectures - We see Linux developers fixing Windows and Darwin
build
> > breaks, for example. But, yes, more complicated things (I think a
large
> part
> > of the problem is the temporal issue - no matter the architecture, if
the
> > results are substantially delayed (even with a short blame list) and
the
> > steps to reproduce are not quick/easy, it's easy for people to
decide
> it's
> > not worth the hassle
>
> I think that's an appalling behaviour for a community.
>
I... don't, really. As with my own GDB 7.5 buildbot, I pretty much assume
interesting failures will probably involve me helping to triage (especially
with the Apple engineers explicitly not having access to the source/test
cases run there) the issues. The bot sends me email on every red, and I
treat that as pretty much a thing I need to care about until it's green, as
much as possible by acting as a facilitator to the original contributor who
committed the breakage.

> > - which I think is something we likely have to live
> > with (again, lack of familiarity with a long/complex/inaccessible
process
> > means that those developers really aren't in the best place to do
the
> > reproduction/check that it was their patch that caused the problem))
do
> tend
> > to fall to bot owners/people familiar with that platform/hardware, and
I
> > think that's totally OK/acceptable/the right thing.
>
> Hum, ok. There are two sides here.
>
> 1. You do care, but can't do anything. In this case, you work with the
> owner to resolve the problem, even if the owner does all the work.
>
> 2. You don't care, and ignore the failure. Here the bot owner has to
> find out on his own and do all the work.
>
> The first is perfectly acceptable, and I'm more than happy to do all
> the work. The second I normally just revert the patch without asking.
>
It's generally not the community policy to revert a patch without providing
actionable reproduction steps, etc. Do you do that? I don't recall seeing
that done. (in general, I think it better to get reproduction steps first,
then revert - sometimes people revert first and provide reproduction much
later (because a reduction takes time, etc) - which I don't think is ideal,
but is sometimes the right tradeoff for the community (if it's obviously
going to be/is a problem for everyone, we're just not all seeing it yet,
etc))

> > What I'm suggesting is that if most developers, most of the time,
aren't
> > able to determine this easily, it's not valuable email - if most
of the
> time
> > they have to reach out to the owner for details/clarification, then we
> > should just invert it. Have the bot owner push to the contributor
rather
> > than the contributor pull from the bot owner.
>
> The LLVM project has hundreds of committers, dozens of bots have a
> single owner. How does that scale?
>
Most of the bot results are pretty easily actionable - just by reading the
diagnostics from the bots, etc. I run a bot - I glance at every fail mail
that comes from it. It does not seem to be terribly onerous to me to do
this - is it for you? The only time it costs me more than sub-second per
failure is if it's a real issue I need to investigate (OK, if it's
actually
a GDB test failure that's just flakey, that costs me a few seconds, but
still not long)

The point is that doing the opposite: sending mail to large blame lists is
strictly higher cost than having a bot owner do the work. A bot owner is 1
person, a large blame list is multiple. It scales better to have 1 person
look at the failure rather than many. Also non-owners are less familiar
with the interesting failures from the bot (or the ongoing state - red or
otherwise) so it costs them more than the owner.

A long red bot is a worse example of this, if it's sending mail eevn on a
few reds - that's multiple developers looking at the bot to see if they
broke it, when it's already known broken and being investigated. Every one
of those emails is costly/worse scaling than just sending mail to the owner
& having the owner triage/escalate to the contributor.

> I think this proposal is against the very nature of open source
> projects in general and a horrible engineering decision.

Do you believe there's no quality point in a buildbot notification where it
is not worth sending mail/notification? Where those notifications hurt the
quality (by reducing the signal/noise to the point where we either hurt the
throughput of developers by having substantially redundant (& unskilled in
the specific kinds of failures a certain platform might see) failure
investigation or hurt the quality of the project by people learning to
ignore bot mails in general and thus missing important true positives as
well?)

> I have
> noticed that recently some people have taken the attitude that "if you
> can't keep up with my commits, you're not worth noticing",

Not quite sure what you're referring to here - we seem to be pretty good
about moving fast, but also having important design discussions in the
community (llvm-dev mailing list, etc) when there's input required or
people need a bit of forewarning about a change in direction, etc.

I think it's not too unreasonable to expect people to check some of the
commit history to see what's been going on in an area they're interested
in
(if they're contributors - if they're not contributors, yes, we
don't tend
to care much) some recent failure they're seeing, etc.

> and that's
> the attitude that will get us forked.
>
I don't really see the concern of that (I don't really understand the
chance of this, or what causes projects to be forked, nor the cost if they
are).

>
>
> > They show up often enough cross-OS and build config too (-Asserts,
> Windows,
> > Darwin, etc).
>
> Ok, good.
>
>
> > Patches should still be reverted, or tests XFAIL - bots shouldn't
be left
> > red for hours (especially in the middle of a work day) or a day.
>
> How do you XFAIL a Clang miscompilation of Clang?
>
It's a good question - seems like it'd be something we might want to
have
some way of doing. Perhaps we could have some stub test cases that are used
to describe some of these sort of tests.

> How do you revert a failure that is unrelated to the blame list
> because they're from previous or external commits?
>
external?

If they're from previous commits/it's a flakey product issue -
that's
tricky, for sure. We don't have good infrastructure for that. It would be
nice to build some (we could run flake detection in off-peak times - tests
that are suspected of being flakey could be run repeatedly to see if they
are, etc), but non-triival to do so, for sure. For now, I don't know that
that's the long pole - though there are some notable exceptions (windows
filesystem IO caused some ongoing flakes on windows, which I think should
be an issue for those running the windows buildbots)

>
>
> > This can often/mostly be compensated for by having more hardware -
>
> Throw money at the problem? :D
>
Sure, if that's what it takes - we're already paying for the problem
with
engineering time. I'm suggesting that maybe that cost shouldn't be
distributed across the project, but rather localized to those invested
(literally, financially) in the behavior of the platforms in question.

> https://www.youtube.com/watch?v=CZmHDEa0Y20
>
>
> > especially for something as mechanical as a bisect. (obviously once
> you're
> > in manual iterations, more hardware doesn't help much unless you
have a
> few
> > different hypotheses you can test simultaneously)
>
> I don't have infinite hardware, nor infinite space, nor infinite
> power, nor infinite time.
>
None of these things require infinite anything. There's a
"reasonable"
level of turnaround that can help quite a bit.

> Certain things take longer than others, and people that are used to
> getting them fast have a lower tolerance for slow(er) processes. Fast
> and slow are completely arbitrary and relative to how slow or fast
> things are between themselves.
>
I don't think they're entirely arbitrary (there are certain broad
cutoffs
where the productivity loss is more noticable as you transition from one
way to another way of doing things (eg: once your build takes more than a
few seconds, you're likely to context switch away then come back to it,
etc)). But even if they are, I don't think it's entirely wrong to strive
to
have a system that is fast.

> > Certainly it takes some more engineering effort and there's
overhead for
> > dealing with multiple machines, etc. But it's not linearly
proportional
> to
> > machine speed, because some of it can be compensated for.
>
> Right. So, here, I agree with you. It IS possible to improve and make
> it much better.
>
> I'm working on making it better, but it takes time. I can't make it
> work tomorrow, and that's my original point:
>
> We have to improve and be more strict, but we have to grow to get
> there, not to flip the table now. I'm suggesting an exp(x) migration
> plan, not a sig(x).
>
I'm not suggesting flipping any tables. I'm suggesting having owners of
bots that aren't great/easily actionable do the first level triage, then
forward to the relevant contributors. This does not seem to be an
impossibly onerous request - is it? Is there something I'm missing about
this request being unreasonable?

>
>
> > Sure - some issues take a while to investigate. No doubt - but so long
as
> > the issue is live (be it flaky or consistent) it's unhelpful
(moreso if
> it's
> > flaky, given the way our buildbots send mail - though I still
don't like
> a
> > red line on the status page, that's costly too) to have the bot
red
> and/or
> > sending mail.
>
> Here, there are two issues:
>
> 1. Buildbots should not email on red->except->red. That's
settled, and
> we must ignore those cases from now on, otherwise, we'll keep coming
> back at it. So, assume we don't do that any more.
>
Until that's fixed, again, I don't think it'd be unreasonable to
switch
bots that tend ot be red for extended periods of time (& are thus more
prone to this problem) to be owner-triage-first.

> 2. If we agree that any flaky bot is turned off, and the master
> behaves correctly (as above), we cannot assume that the constant
> emailing during the investigation phase is due to flakyness. So, if
> you do get an email, it's probably a meaningful reason.
>
Sure - though I have a problem, to a lesser degree, with the buildbot
status page having red results for issues that are known & under
investigation. It would be better if that were not the case (if those bots
were XFAIL'd), but it doesn't relate to email notifications at all,
which
is my bigger concern.

>
> We're not there yet, but we're discussing at a higher level here,
> dissecting the issue and finding the problems.
>
>
>
> > The issue is known and being investigated, sending other
> > people mail (or having it show up as red in the dashboard) isn't
terribly
> > helpful. It produces redundant work for everyone (they all investigate
> these
> > issues - or learn to ignore them & thus miss true positives later)
on the
> > project.
>
> Chris is investigating the Green Bot infrastructure, which is orders
> of magnitude better than our current. In that scenario, we'll have
> orders of magnitude less redundant work, even if you get a warning
> that you can't act on.
>
> --renato
>
> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com>
wrote:
> >
> >
> > On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at
linaro.org>
> > wrote:
> >>
> >> I think we've hit a record in the number of inline replies,
here... :)
> >>
> >> Let's start fresh...
> >>
> >>     Problem #1: What is flaky?
> >>
> >> The types of failures of a buildbot:
> >>
> >> 1. failures because of bad hardware / bad software / bad admin
> >> (timeout, disk full, crash, bad RAM)
> >
> >
> > Where "software" here is presumably the OS software, not the
software
> under
> > test (otherwise all actual failures would be (1)), and not
infrastructure
> > software because you've called that out as (2).
> >
> >>
> >> 2. failures because of infrastructure problems (svn, lnt, etc)
> >> 3. failures due to previous or external commits unrelated to the
blame
> >> list (intermittent, timeout)
> >> 4. results that you don't know how to act on, but you have to
> >> 5. clear error messages, easy to act on
> >>
> >> In my view, "flaky" is *only* number 1. Everything else
is signal.
> >
> >
> > I think that misses the common usage of the term "flaky
test" (or do the
> > tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> > product code (hash ordering in the output).
> >
> >>
> >> I agree that bots that cause 1. should be silent, and that
failures in
> >> 2. and 3. should be only emailed to the bot admin. But category 4
> >> still needs to email the blame list and cannot be ignored, even if
> >> *you* don't know how to act on.
> >
> >
> > & I disagree here - if most contributors aren't acting on
these (for
> > whatever reasons, basically) we should just stop sending them. If at
some
> > point we find ways to make them actionable (by having common machine
> access
> > people can use, documentation on how to proceed, short blame lists,
etc -
> > whatever's getting in the way of people acting on these).
> >
> > And I don't think it's that people simply don't care about
certain
> > architectures - We see Linux developers fixing Windows and Darwin
build
> > breaks, for example. But, yes, more complicated things (I think a
large
> part
> > of the problem is the temporal issue - no matter the architecture, if
the
> > results are substantially delayed (even with a short blame list) and
the
> > steps to reproduce are not quick/easy, it's easy for people to
decide
> it's
> > not worth the hassle - which I think is something we likely have to
live
> > with (again, lack of familiarity with a long/complex/inaccessible
process
> > means that those developers really aren't in the best place to do
the
> > reproduction/check that it was their patch that caused the problem))
do
> tend
> > to fall to bot owners/people familiar with that platform/hardware, and
I
> > think that's totally OK/acceptable/the right thing.
> >
> >>
> >>
> >> Type 2. can easily be separated, but I'm yet to see how are we
going
> >> to code in which category each failure lies for types 3. and 4.
> >
> >
> > Yeah, I don't have any .particular insight there either. Ideally
I'd
> hope we
> > can ensure those issues are rare enough (though I've been seeing
some
> > consistently flaky SVN behavior on my buildbot for the last few
months,
> > admittedly - reached out to Tanya about it, but didn't have much
to go
> on)
> > that it's probably not worth the engineering effort to filter them
out.
> >
> >>
> >> One
> >> way to work around the problem in 4 is to print the bot
owner's name
> >> on the email, so that you know who to reply to, for more details
on
> >> what to do. How to decide if your change is unrelated or you
didn't
> >> understand is a big problem.
> >
> >
> > What I'm suggesting is that if most developers, most of the time,
aren't
> > able to determine this easily, it's not valuable email - if most
of the
> time
> > they have to reach out to the owner for details/clarification, then we
> > should just invert it. Have the bot owner push to the contributor
rather
> > than the contributor pull from the bot owner.
> >
> >>
> >> Once all bots are low-noise, people will
> >> tend more to 4, until then, to 3 or 1.
> >>
> >> In agreement?
> >>
> >>
> >>     Problem #2: Breakage types
> >>
> >> Bots can break for a number of reasons in category 4. Some
examples:
> >>
> >> A. silly, quick fixed ones, like bad CHECK lines, missing explicit
> >> triple, move tests to target-specific directories, add an include
> >> file.
> >> B. real problems, like an assert in the code, seg fault, bad test
> results.
> >> C. hard problems, like bad codegen affecting self-hosting,
> >> intermittent failures in test-suite or self-hosted clang.
> >>
> >> Problems of type A. tend to show by the firehose on ARM, while
they're
> >> a lot less common on x86_64 bots just because people develop on
> >> x86_64.
> >
> >
> > They show up often enough cross-OS and build config too (-Asserts,
> Windows,
> > Darwin, etc).
> >
> >>
> >> Problems B. and C. and equally common on all platforms due to
> >> the complexity of the compiler.
> >>
> >> Problems of type B. should have same behaviour in all platforms.
If
> >> the bots are fast enough (either fast hardware, or many hardware),
the
> >> blame list should be small and bisect should be quick (<1day).
> >
> >
> > Patches should still be reverted, or tests XFAIL - bots shouldn't
be left
> > red for hours (especially in the middle of a work day) or a day.
> >
> >>
> >> These are not the problem.
> >>
> >> Problems of type C, however, are seriously worse on slow targets.
> >
> >
> > This can often/mostly be compensated for by having more hardware -
> > especially for something as mechanical as a bisect. (obviously once
> you're
> > in manual iterations, more hardware doesn't help much unless you
have a
> few
> > different hypotheses you can test simultaneously)
> >
> > Certainly it takes some more engineering effort and there's
overhead for
> > dealing with multiple machines, etc. But it's not linearly
proportional
> to
> > machine speed, because some of it can be compensated for.
> >
> >>
> >> Not
> >> only it's slower to build (sometimes 10x slower than on a
decent
> >> server), but the testing is hard to get right (because it's
> >> intermittent), and until you get it right, you're actively
working on
> >> that (minus sleep time, etc). Since we're talking about an
order of
> >> magnitude slower to debug, sleep time becomes a much bigger issue.
If
> >> a hard problem takes about 5 hours on fast hardware, it can take
up to
> >> 50 hours, and in that case, no one can work that long. If you do
10hs
> >> straight every day, it's still a week past.
> >
> >
> > Sure - some issues take a while to investigate. No doubt - but so long
as
> > the issue is live (be it flaky or consistent) it's unhelpful
(moreso if
> it's
> > flaky, given the way our buildbots send mail - though I still
don't like
> a
> > red line on the status page, that's costly too) to have the bot
red
> and/or
> > sending mail. The issue is known and being investigated, sending other
> > people mail (or having it show up as red in the dashboard) isn't
terribly
> > helpful. It produces redundant work for everyone (they all investigate
> these
> > issues - or learn to ignore them & thus miss true positives later)
on the
> > project.
> >
> >>
> >>
> >> In agreement?
> >>
> >>
> >> I'll continue later, once we're in agreement over the base
facts.
> >>
> >> cheers,
> >> --renato
> >
> >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151019/491e8d4e/attachment-0001.html>

Renato Golin via llvm-dev

2015-Oct-19 19:26 UTC

head link

[llvm-dev] Buildbot Noise

Huge inline record again... I'll pick the contentious issues...

On 19 October 2015 at 19:38, David Blaikie <dblaikie at gmail.com>
wrote:> at all, but /specifically/ about long-red bots that appear neglected.
"appear" is the key here. It'd be better if you ask first, then
propose to disable later. If I was on holidays, someone (maybe you)
could have assumed lack of care and disabled them without the ARM
sub-community's knowledge. Probably no one got your email but me.

I don't know how you could have made sure everyone was copied, TBH. We
have to think about that one, too. Maybe add sub-owners?

> It would be little-to-no change to me to do this to my GDB 7.5 bot, for
> example - I glance at every failure that comes through anyway. All I'd
do
> differently is forward anything that I thought looked like a real, unique
> failure, to the mailing list/blame list, rather than having it done
> automatically. This does not seem terribly onerous. Is it?
You mind one bot. I mind 11, and the list is growing.

Our bots are very different from each other, and the failures that
happen to one rarely happen to others. I am solving the contingency
issue, but that takes time. I agree that's largely my responsibility,
but we can't go from "it's ok to have some red bots" to
"we're doomed,
kill them all" overnight.

I am working towards the goals we both agree, but it *will* take some
time. I'd appreciate some patience.

> I... don't, really. As with my own GDB 7.5 buildbot, I pretty much
assume
> interesting failures will probably involve me helping to triage (especially
> with the Apple engineers explicitly not having access to the source/test
> cases run there) the issues. The bot sends me email on every red, and I
> treat that as pretty much a thing I need to care about until it's
green, as
> much as possible by acting as a facilitator to the original contributor who
> committed the breakage.
ARM is one of the main architectures in LLVM. Compatibility with GDB
7.5 is an important, but substantially less important. It may look
selfish from my part, but I don't think you can compare them as
equals.

A lot more people, projects and companies will be upset if ARM support
regresses, than if the GDB 7.5 bot stays red for a few weeks, or even
a few months.

Given the importance, I don't think it's feasible (or healthy) for me
to own most of the bots, but for now, it is what it is. I'd appreciate
if other companies that do care about ARM could *also* contribute and
maintain ARM bots on their own. But even that will take some time.

> Do you believe there's no quality point in a buildbot notification
where it
> is not worth sending mail/notification?
No, I agree with you on almost all technical points. But those changes
need to take some time to happen.

>> How do you XFAIL a Clang miscompilation of Clang?
>
> It's a good question - seems like it'd be something we might want
to have
> some way of doing. Perhaps we could have some stub test cases that are used
> to describe some of these sort of tests.
To answer my own question, I think staged bots is the solution here.

> If they're from previous commits/it's a flakey product issue -
that's
> tricky, for sure.
One critical thing that doesn't get caught: Zorg changes. Maybe we
should add a monitor to Zorg on every SVN poller. If we can, make sure
that we build every Zorg change isolated from any other.

> None of these things require infinite anything. There's a
"reasonable" level
> of turnaround that can help quite a bit.
"reasonable" depends on how many resources (money, hardware,
engineers) you have. You're seeing everyone else with your own
glasses, assuming you could fix the problem in X days because you have
N engineers, M money and Y hardware availability, whereas all those
variables are different to other people / companies.

By saying that "everyone willing to help" should invest as much as
Google or Apple does, you're essentially shutting off everyone else
*but* Google and Apple from the project. That's where the risk of
forking comes from.

cheers,
--renato

Maybe Matching Threads

Search for more maybe matching threads

llvm dev - Oct 2015 - [cfe-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] [cfe-dev] Buildbot Noise

[llvm-dev] [cfe-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

Maybe Matching Threads