thr3ads.net - llvm dev - [llvm-dev] Buildbot Noise [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2015-Oct-06 10:49 UTC

[llvm-dev] Buildbot Noise

On 5 October 2015 at 22:28, David Blaikie <dblaikie at gmail.com>
wrote:>> These buildbots normally finish under one hour, but most of the time
>> under 1/2 hour and should be kept green as much as possible.
>> Therefore, any reasonable noise
>
> Not sure what kind of noise you're referring to here. Flaky fast
builders
> would be a bad thing, still - so that sort of noise should still be
> questioned.
Sorry, I meant "noise" as in "sound", not as opposed to
"signal".

These bots are assumed stable, otherwise they would be in another
category below.

> I'm not sure if we need extra policy here - but I don't mind
documenting the
> common community behavior here to make it more clear.
Some people in the community behaves strongly different than others. I
sent this email because I felt we disagree in some fundamental
properties of the buildbots, and before we can agree to a common
strategy, there is no consensus or "common behaviour" to be
documented.

However, I agree, we don't need "policy", just "documented
behaviour"
as usual. That was my intention when I said "policy".

> Long bisection is a function of not enough boards (producing large revision
> ranges for each run), generally - no? (or is there some other reason?)
It's not that simple. Some bugs appear after several iterations of
green results. It may sound odd, but I had at least three this year.

These are the hardest bugs to find and usually standard regression
scripts can't find them automatically, so I have to do most of the
investigation manually. This takes *a lot* of time.

> Generally all bots catch serious bugs.
That's not what I meant. Quick bots catch bad new tests (over-assuming
on CHECK lines, forgetting to specify the triple on RUN lines) as well
as simple code issues (32 vs 64 bits, new vs old compiler errors,
etc), just because they're the first to run on a different environment
than the developer uses. Slow bots are most of the time buffered
against those, since patches and fixes (or reverts) tend to come in
bundles, while the slow bot is building.

>> like self-hosted Clang
>> mis-compiling a long-running test which sometimes fails. They can
>> produce noise, but when the noise is correct, we really need to listen
>> to it. Writing software to understand that is non-trivial.
>
> Again, not sure which kind of noise you're referring to here - it'd
be
> helpful to clarify/disambiguate.
Noise here is less "sound" and more "noisy signal". Some of
the
"noise" in these bots are just noise, others are signal masquerading
as noise.

Of course, the higher the noise level, the harder it is to interpret
the signal, but as it's usual in science, sometimes the only signal we
have is a noisy one.

It's common for mathematicians to scoff the physicists lack of
precision, as is for them to to the same to chemists, then biologists,
etc. When you're on top, it seems folly that some people endure large
amounts of noise in their signal, but when you're at the bottom and
your only signal has a lot of noise, you have to work with it and make
do with what you have.

As I said above, it's not uncommon the case where a failure
"passes"
the tests for a few iterations before failing. So, we're not talking
*only* at hardware noise, but also at the code level, which had
assumptions based on the host architecture that might not be valid on
other architectures. Most of us develop on x86 machines, so it's only
logical that PPC, MIPS and ARM buildbots will fail more often than x86
ones. But that's precisely the point of having those bots in the first
place.

Requesting to disable those bots because they generate noise is the
same as asking people to give their opinion about a product, show the
positive reviews, and sue the rest.

> But they are problems that need to be addressed, is the key - and arguably,
> until they are addressed, these bots should only report to the owner, not
to
> contributors.
If we didn't have those bots already for many years, and if we had
another way of testing on those architectures, I'd agree with you. But
we don't.

I agree we need to improve. I agree it's the architecture specific
community's responsibility to do so. I just don't agree that we should
disable all noise (with signal, baby/bath) until we do so.

By the time we get there, all sorts of problems will have crept in,
and we'll enter a vicious cycle. Been there, done that.

> I still question whether these bots provide value to the community as a
> whole when they send email. If the investigation usually falls to the
owners
> rather than the contributors, then the emails they send (& their
presence on
> a broader dashboard) may not be beneficial.
Benefit is a spectrum. People have different thresholds. Your
threshold is tougher than mine because I'm used working on an
environment where the noise is almost as loud as the signal.

I don't think we should be bound to either of our thresholds, that's
why I'm opening the discussion to have a migration plan to produce
less noise. But that plan doesn't include killing bots just because
they annoy people.

If you plot a function of value ~ noise OP benefit, you have a surface
with maxima and minima. Your proposal is to set a threshold and cut
all the bots that fall on those minima that are below that line. My
proposal is to move all those bots as high as we can and only then,
cut the bots that didn't make it past the threshold.

> So to be actionable they need to have small blame lists and be reliable
(low
> false positive rate). If either of those is compromised, investigation will
> fall to the owner and ideally they should not be present in email/core
> dashboard groups.
Ideally, this is where both of us want to be. Realistically, it'll
take a while to get there.

We need changes in the buildbot area, but there are also inherent
problems that cannot be solved.

Any new architecture (like AArch64) will have only experimental
hardware for years, and later on, experimental kernel, then
experimental tools, etc. When developing a new back-end for a
compiler, those unstable and rapidly evolving environments are the
*only* thing you have to test on.

You normally only have one of two (experimental means either *very*
expensive or priceless), so having multiple boxes per bot is highly
unlikely. It can also mean that the experimental device you got last
month is not supported any more because a new one is coming, so you'll
have to live with those bugs until you get the new one, which will
come with its own bugs.

For older ARM cores (v7), this is less of a problem, but since old ARM
hardware was never designed as production machines, their flakiness is
inherent of their form factor. It is possible to get them on a
stable-enough configuration, but it takes time, resources, excess
hardware and people constantly monitoring the infrastructure. We're
getting there, but we're not there yet.

I agree that this is mostly *my* problem and *I* should fix it, and
believe me I *want* to fix it, I just need a bit more time. I suspect
that the other platform folks feel the same way, so I'd appreciate a
little more respect when we talk about acceptable levels of noise and
effort.

> I disagree here - if the bots remain red, they should be addressed. This is
> akin to committing a problematic patch before you leave - you should
> expect/hope it is reverted quickly so that you're not interrupting
> everyone's work for a week.
Absolutely not!

Committing a patch and going on holidays is a disrespectful act. Bot
maintainers going on holidays is an inescapable fact.

Silencing a bot while the maintainer is a possible way around, but
disabling it is most disrespectful.

However, I'd like to remind you of the confirmation bias problem,
where people will look at the bot, think it's noise, silence the bot
when they could have easily fixed it. Later on, when the owner gets to
work, surprise new bugs that weren't caught will fill the first weeks.
We have to be extra careful when taking actions without the bot
owners' knowledge.

> I'm not quite sure I follow this comment. The less noise we have, the
/more/
> problematic any remaining noise will be
Yes, I meant what you said. :)

Less noise, higher bar to meet.

> Depends on the error - if it's transient, then this is flakiness as
always &
> should be addressed as such (by trying to remove/address the flakes).
> Though, yes, this sort of failure should, ideally, probably, go to the
> buildbot owner but not to users.
Ideally, SVN errors should go to the site admins, but let's not get
ahead of ourselves. :)

>> Other failures, like timeout, can be either flaky hardware or broken
>> codegen. A way to be conservative and low noise would be to only warn
>> on timeouts IFF it's the *second* in a row.
>
> I don't think this helps - this reduces the incidence, but isn't a
real
> solution.
I agree.

> We should reduce the flakiness of hardware. If hardware is this
> unreliable, why would we be building a compiler for it?
Because that's the only hardware that exists.

> No user could rely on it to produce the right answer.
No user is building trunk every commit (ish). Buildbots are not meant
to be as stable as a user (including distros) would require. That's
why we have extra validation on releases.

Buildbots build potentially unstable compilers, otherwise we wouldn't
need buildbots in the first place.

>> For all these adjustments, we'll need some form of walk-back on the
>> history to find the previous genuine result, and we'll need to mark
>> results with some metadata. This may involve some patches to buildbot.
>
> Yeah, having temporally related buildbot results seems dubious/something
I'd
> be really cautious about.
This is not temporal, it's just regarding exception as no-change
instead of success.

The only reason why it's success right now is because, the way we're
setup to email on every failure, we don't want to spam people when the
master is reloaded.

That's the wrong meaning for the wrong reason.

> I imagine one of the better options would be some live embedded HTML that
> would just show a green square/some indicator that the bot has been green
at
> least once since this commit.
That would be cool! But I suspect at the cost of a big change in the
buildbots. Maybe not...

>> This is a wish-list that I have, for the case where the bots are slow
>> and hard to debug and are still red. Assuming everything above is
>> fixed, they will emit no noise until they go green again, however,
>> while I'm debugging the first problem, others can appear. If that
>> happens, *I* want to know, but not necessarily everyone else.
>
> This seems like the place where XFAIL would help you and everyone else. If
> the original test failure was XFAILed immediately, the bot would go green,
> then red again if a new failure was introduced. Not only would you know,
but
> so would the auhtor of the change.
I agree in principle. I just worry that it's a lot easier to add an
XFAIL than to remove it later.

Though, it might be just a matter of documenting the common behaviour
and expecting people to follow through.

cheers,
--renato

David Blaikie via llvm-dev

2015-Oct-06 20:40 UTC

head link

[llvm-dev] Buildbot Noise

On Tue, Oct 6, 2015 at 3:49 AM, Renato Golin <renato.golin at linaro.org>
wrote:
> On 5 October 2015 at 22:28, David Blaikie <dblaikie at gmail.com>
wrote:
> >> These buildbots normally finish under one hour, but most of the
time
> >> under 1/2 hour and should be kept green as much as possible.
> >> Therefore, any reasonable noise
> >
> > Not sure what kind of noise you're referring to here. Flaky fast
builders
> > would be a bad thing, still - so that sort of noise should still be
> > questioned.
>
> Sorry, I meant "noise" as in "sound", not as opposed to
"signal".
>
> These bots are assumed stable, otherwise they would be in another
> category below.
>
>
> > I'm not sure if we need extra policy here - but I don't mind
documenting
> the
> > common community behavior here to make it more clear.
>
> Some people in the community behaves strongly different than others. I
> sent this email because I felt we disagree in some fundamental
> properties of the buildbots, and before we can agree to a common
> strategy, there is no consensus or "common behaviour" to be
> documented.
>
> However, I agree, we don't need "policy", just
"documented behaviour"
> as usual. That was my intention when I said "policy".
>
>
> > Long bisection is a function of not enough boards (producing large
> revision
> > ranges for each run), generally - no? (or is there some other reason?)
>
> It's not that simple. Some bugs appear after several iterations of
> green results. It may sound odd, but I had at least three this year.
>
> These are the hardest bugs to find and usually standard regression
> scripts can't find them automatically, so I have to do most of the
> investigation manually. This takes *a lot* of time.
>
Flakey failures, yes. I'd expect an XFAIL while it's under
investigation,
for sure (or notifications forcibly disabled, if that works better/is
necessary). Because flakey failures produce un-suppressable noise on the
bot (because they're not a continuous run of red).

> > Generally all bots catch serious bugs.
>
> That's not what I meant. Quick bots catch bad new tests (over-assuming
> on CHECK lines, forgetting to specify the triple on RUN lines) as well
> as simple code issues (32 vs 64 bits, new vs old compiler errors,
> etc), just because they're the first to run on a different environment
> than the developer uses. Slow bots are most of the time buffered
> against those, since patches and fixes (or reverts) tend to come in
> bundles, while the slow bot is building.
>
>
> >> like self-hosted Clang
> >> mis-compiling a long-running test which sometimes fails. They can
> >> produce noise, but when the noise is correct, we really need to
listen
> >> to it. Writing software to understand that is non-trivial.
> >
> > Again, not sure which kind of noise you're referring to here -
it'd be
> > helpful to clarify/disambiguate.
>
> Noise here is less "sound" and more "noisy signal".
Some of the
> "noise" in these bots are just noise, others are signal
masquerading
> as noise.
>
> Of course, the higher the noise level, the harder it is to interpret
> the signal, but as it's usual in science, sometimes the only signal we
> have is a noisy one.
>
> It's common for mathematicians to scoff the physicists lack of
> precision, as is for them to to the same to chemists, then biologists,
> etc. When you're on top, it seems folly that some people endure large
> amounts of noise in their signal, but when you're at the bottom and
> your only signal has a lot of noise, you have to work with it and make
> do with what you have.
>
> As I said above, it's not uncommon the case where a failure
"passes"
> the tests for a few iterations before failing.

That would be flakey - yes, there are many sources of flakeyness (& if it
passed a few times before it failed, it probably won't fail regularly now -
it'll continue to oscillate back and forth over passing and failing,
producing a lot of notification noise/spam). Flakeyness should be
addressed, for sure - XFAIL or suppress bot notifications while
investigating, etc.

> So, we're not talking
> *only* at hardware noise, but also at the code level, which had
> assumptions based on the host architecture that might not be valid on
> other architectures. Most of us develop on x86 machines, so it's only
> logical that PPC, MIPS and ARM buildbots will fail more often than x86
> ones. But that's precisely the point of having those bots in the first
> place.
>
> Requesting to disable those bots because they generate noise is the
> same as asking people to give their opinion about a product, show the
> positive reviews, and sue the rest.
>
When I suggest someone disable notifications from a bot it's because those
notifications aren't actionable to those receiving them. It's not a
suggestion that the platform is unsupported, that the bot should be turned
off, or that the issues are not real. It is a suggestion that the
notifications are only relevant to the owner/persons invested on that
platform, and that a level of triage from them may be necessary or
otherwise appropriate.

>
>
> > But they are problems that need to be addressed, is the key - and
> arguably,
> > until they are addressed, these bots should only report to the owner,
> not to
> > contributors.
>
> If we didn't have those bots already for many years, and if we had
> another way of testing on those architectures, I'd agree with you. But
> we don't.
>
I'm not suggesting removing the testing. Merely placing the onus on
responding to/investigating notifications on the parties with the context
to do so. Long blame lists and flakey results on inaccessible hardware
generally amount to unactionable results except for the person who owns/is
invested in the architecture.

> I agree we need to improve. I agree it's the architecture specific
> community's responsibility to do so. I just don't agree that we
should
> disable all noise (with signal, baby/bath) until we do so.
>
If the results aren't actionable by the people receiving them, that's a
bug
& we should fix it pretty much immediately. If the architecture specific
community can then produce automated actionable results, great. Until then
I don't think it's a huge cost to say that that community can do the
first
level triage.

In cases where the triage is cheap, this shouldn't be a big deal for the
bot owner to do so - and when the triage is expensive, well, that's the
point: imposing that triage on the community at large (especially with
large blame lists), doesn't seem to work.

>
> By the time we get there, all sorts of problems will have crept in,
> and we'll enter a vicious cycle. Been there, done that.
>
Why would that happen? All I'd expect is that you/others watch the negative
bot results, and forward any on that look like actionable true positives.
If that's too expensive, then I don't know how you can expect community
members to incur that cost instead of bot owners?

>
>
> > I still question whether these bots provide value to the community as
a
> > whole when they send email. If the investigation usually falls to the
> owners
> > rather than the contributors, then the emails they send (& their
> presence on
> > a broader dashboard) may not be beneficial.
>
> Benefit is a spectrum. People have different thresholds. Your
> threshold is tougher than mine because I'm used working on an
> environment where the noise is almost as loud as the signal.
>
> I don't think we should be bound to either of our thresholds,
that's
> why I'm opening the discussion to have a migration plan to produce
> less noise. But that plan doesn't include killing bots just because
> they annoy people.
>
Again: if the notifications are going to people who can't act on them, we
should disable them, otherwise people will not have confidence in the
positive signals and we are lacking value overall.

Disabling the bot is not the only solution - simply disabling the
notifications for contributors, having them go to the bot owner first for
triage, then they can forward the notification on to the contributor if
it's a good true positive. I assume bot owners are already doing this work
- if they care about their platform presumably they're watching for
platform-specific failures and often having to follow up with contributors
because so many of them ignore the mails today because of the unactionable
nature of them. So I'm not really asking for any more work from the owners
of these bots, if they care about the results & people are already ignoring
them.

I'm just asking to remove the un-actioned notifications to increase
confidence in our notifications.

>
> If you plot a function of value ~ noise OP benefit, you have a surface
> with maxima and minima. Your proposal is to set a threshold and cut
> all the bots that fall on those minima that are below that line. My
> proposal is to move all those bots as high as we can and only then,
> cut the bots that didn't make it past the threshold.
>
The problem with that is that people continue to lose confidence in the
bots (especially new contributors) the longer we maintain the current state
(& I don't have a great deal of confidence in the timespan this will
take
to get automated high quality resulrts from all the current bots). Once
people lose confidence in the bots, they're not likely to /gain/ confidence
again - they'll start ignoring them and not have any reason to re-evaluate
that situation in the future. That's my usual approach, but recently
decided to re-evaluate that & be verbose about it. Most peolpe aren't
verbose (or militant, as you put it) because they're already ignoring all
of this. That's a /bad/ thing.

I would like to set the bar high: that bot notifications must be high
quality, and if they aren't, that we disable them aggressively. This places
the onus on the owner to improve the quality before turning the
notifications (back) on. Rather than incurring a distributed cost over the
whole project (that may have a long term effect) while we wait for the
quality to improve.

>
>
> > So to be actionable they need to have small blame lists and be
reliable
> (low
> > false positive rate). If either of those is compromised, investigation
> will
> > fall to the owner and ideally they should not be present in email/core
> > dashboard groups.
>
> Ideally, this is where both of us want to be. Realistically, it'll
> take a while to get there.
>
> We need changes in the buildbot area, but there are also inherent
> problems that cannot be solved.
>
> Any new architecture (like AArch64) will have only experimental
> hardware for years, and later on, experimental kernel, then
> experimental tools, etc. When developing a new back-end for a
> compiler, those unstable and rapidly evolving environments are the
> *only* thing you have to test on.
>
> You normally only have one of two (experimental means either *very*
> expensive or priceless), so having multiple boxes per bot is highly
> unlikely. It can also mean that the experimental device you got last
> month is not supported any more because a new one is coming, so you'll
> have to live with those bugs until you get the new one, which will
> come with its own bugs.
>
> For older ARM cores (v7), this is less of a problem, but since old ARM
> hardware was never designed as production machines, their flakiness is
> inherent of their form factor. It is possible to get them on a
> stable-enough configuration, but it takes time, resources, excess
> hardware and people constantly monitoring the infrastructure. We're
> getting there, but we're not there yet.
>
> I agree that this is mostly *my* problem and *I* should fix it, and
> believe me I *want* to fix it, I just need a bit more time. I suspect
> that the other platform folks feel the same way, so I'd appreciate a
> little more respect when we talk about acceptable levels of noise and
> effort.
>
I'm sorry if I've come across as disrespectful. I do appreciate that a
whole bunch of people care about a whole bunch of different things. Even I
fall prey to the same situation - the GDB buildbot is flakey. I let it go
but I really should fix the flakes - and I wouldn't mind community pressure
to do so. But partly, as I mentioned in my previous reply, existing noise
levels make it not as interesting to improve small amounts of noise
(tragedy of the commons, etc). The lower the noise level, teh more
important/pressure we'll have to keep the remaining sources of noise down.

>
>
> > I disagree here - if the bots remain red, they should be addressed.
This
> is
> > akin to committing a problematic patch before you leave - you should
> > expect/hope it is reverted quickly so that you're not interrupting
> > everyone's work for a week.
>
> Absolutely not!
>
> Committing a patch and going on holidays is a disrespectful act. Bot
> maintainers going on holidays is an inescapable fact.
>
> Silencing a bot while the maintainer is a possible way around, but
> disabling it is most disrespectful.
>
> However, I'd like to remind you of the confirmation bias problem,
> where people will look at the bot, think it's noise, silence the bot
> when they could have easily fixed it. Later on, when the owner gets to
> work, surprise new bugs that weren't caught will fill the first weeks.
> We have to be extra careful when taking actions without the bot
> owners' knowledge.
>
I'm looking at the existing behavior of the community - if people are
generally ignoring the result of a bot anyway (& if it's red for weeks
at a
time, I think they are) then the notifications are providing no value. The
bot only provides value to the owner - who triages the failures, then
reaches out to the community to provide the reproduction/assist developers
with a fix.

All I want to do is remove notifications people aren't acting on anyway.

>
>
> > I'm not quite sure I follow this comment. The less noise we have,
the
> /more/
> > problematic any remaining noise will be
>
> Yes, I meant what you said. :)
>
> Less noise, higher bar to meet.
>
>
> > Depends on the error - if it's transient, then this is flakiness
as
> always &
> > should be addressed as such (by trying to remove/address the flakes).
> > Though, yes, this sort of failure should, ideally, probably, go to the
> > buildbot owner but not to users.
>
> Ideally, SVN errors should go to the site admins, but let's not get
> ahead of ourselves. :)
>
>
> >> Other failures, like timeout, can be either flaky hardware or
broken
> >> codegen. A way to be conservative and low noise would be to only
warn
> >> on timeouts IFF it's the *second* in a row.
> >
> > I don't think this helps - this reduces the incidence, but
isn't a real
> > solution.
>
> I agree.
>
>
> > We should reduce the flakiness of hardware. If hardware is this
> > unreliable, why would we be building a compiler for it?
>
> Because that's the only hardware that exists.
>
>
> > No user could rely on it to produce the right answer.
>
> No user is building trunk every commit (ish). Buildbots are not meant
> to be as stable as a user (including distros) would require.

I disagree with this - I think it's a worthy goal to have continuous
validation that is more robust and comprehensive. At some point the compute
resource cost is not worth the bug finding rate - and we run those tasks
less frequently. But none of that excuse instability. I'd expect infrequent
validation to be acceptably less stable than frequent validation.

Extra validation before release is for work that takes too long (& has a
lower chance of finding bugs) to run on every change/frequently.

> That's
> why we have extra validation on releases.
>
> Buildbots build potentially unstable compilers,

Potentially - though flakey behavior in the compiler isn't /terribly/
common. It does happen, for sure. More often I see flakey tests/test
infrastructure - which reduces confidence in the quality of the
infrastructure, causing people to ignore true positives due to the high
rate of false positives.

> otherwise we wouldn't
> need buildbots in the first place.
>
>
> >> For all these adjustments, we'll need some form of walk-back
on the
> >> history to find the previous genuine result, and we'll need to
mark
> >> results with some metadata. This may involve some patches to
buildbot.
> >
> > Yeah, having temporally related buildbot results seems
dubious/something
> I'd
> > be really cautious about.
>
> This is not temporal, it's just regarding exception as no-change
> instead of success.
>
red->exception->red I don't mind too much - the
"timeout->timeout" example
you gave is one I disagree with.

>
> The only reason why it's success right now is because, the way
we're
> setup to email on every failure, we don't want to spam people when the
> master is reloaded.
>
> That's the wrong meaning for the wrong reason.
>
>
> > I imagine one of the better options would be some live embedded HTML
that
> > would just show a green square/some indicator that the bot has been
> green at
> > least once since this commit.
>
> That would be cool! But I suspect at the cost of a big change in the
> buildbots. Maybe not...
>
Yeah, not sure how expensive it'd be.

>
>
> >> This is a wish-list that I have, for the case where the bots are
slow
> >> and hard to debug and are still red. Assuming everything above is
> >> fixed, they will emit no noise until they go green again, however,
> >> while I'm debugging the first problem, others can appear. If
that
> >> happens, *I* want to know, but not necessarily everyone else.
> >
> > This seems like the place where XFAIL would help you and everyone
else.
> If
> > the original test failure was XFAILed immediately, the bot would go
> green,
> > then red again if a new failure was introduced. Not only would you
know,
> but
> > so would the auhtor of the change.
>
> I agree in principle. I just worry that it's a lot easier to add an
> XFAIL than to remove it later.
>
How so? If you're actively investigating the issue, and everyone else is
happily ignoring the bot result (& so won't care when it goes green, or
red
again) - you're owning the issue to get your bot back to green, and it just
means you have to un-XFAIL it as soon as that happens.

>
> Though, it might be just a matter of documenting the common behaviour
> and expecting people to follow through.
>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<lists.llvm.org/pipermail/llvm-dev/attachments/20151006/43b14a91/attachment.html>

Renato Golin via llvm-dev

2015-Oct-07 10:10 UTC

head link

[llvm-dev] Buildbot Noise

Hi David,

I think we're repeating ourselves here, so I'll reduce to the bare
minimum before replying.

On 6 October 2015 at 21:40, David Blaikie <dblaikie at gmail.com>
wrote:> When I suggest someone disable notifications from a bot it's because
those
> notifications aren't actionable to those receiving them.
This is a very limited view of the utility of buildbots.

I think part of the problem is that you're expecting to get instant
value out of something that cannot provide that to you. If you can't
extract value from it, it's worthless.

Also, it seems, you're associating community buildbots with company
testing infrastructure. When I worked at big companies, there were
validation teams that would test my stuff and deal with *any* noise on
their own, and only the real signal would come to me: 100% actionable.
However, most of the bot owners in open source communities do this as
a secondary task. This has always been the case and until someone
(LLVM Foundation?) starts investing in a better infrastructure overall
(multi master, new slaves, admins), there isn't much we can do to
improve that quick enough.

The alternative is that the less common architectures will always have
noisier bots because less people use them day-to-day, during their
development time. By having a hard line on those, in the long run,
means we'll disable most testing on all secondary architectures, and
LLVM becomes an Intel compiler. But many companies use LLVM for their
production compiler on their own targets, so the inevitable is that
they will *fork* LLVM. I don't think anyone wants that.

> I'm not suggesting removing the testing. Merely placing the onus on
> responding to/investigating notifications on the parties with the context
to
> do so.
You still don't get the point. This would make sense on a world where
all parties are equal.

Most people develop and test on x86, even ARM and MIPS engineers. That
means x86 is almost always stable, no matter who's working.

But some bugs that we had to fix this year show up randomly *only* on
ARM. That was a serious misuse of the Itanium C++ ABI, and one that
took a long time to be fixed, and we still don't know if we got them
all.

Bugs like that normally only show up on self-hosting builds, sometimes
on self-hosted Clang compiled test-suite. These bugs have no hard
good/bad line for bisecting, they take hours per cycle, and they may
or may not fail, so automated bisecting won't work. Furthermore, there
is nothing to XFAIL in this case, unless you want to disable building
Clang, which I don't think you do.

While it's taking days, if not weeks, to investigate this bot, the
status may be going from red to green to red. It would be very
simplistic to assume that *any* greed->red transition while I'm
bisecting the problem will be due to the current known instability. It
could be anything, and developers still need to be warned if the alarm
goes off.

The result may be it's still flaky, the developer can't do much, life
goes on. Or it could be his test, he fixes immediately, and I'm
eternally grateful, because I still need to investigate *only one* bug
at a time. By silencing the bot, I'd have to be responsible for
debugging the original hard problem plus any other that would come
while the bot was flaky.

Now, there's the issue of where does the responsibility lies...

I'm responsible for the quality of the ARM code, including the
buildbots. What you're suggesting is that *no matter what* gets
committed, it is *my* responsibility to fix any bug that the original
developers can't *action upon*.

That might seem sensible at first, but the biggest problem here is the
term that you're using over and over again: *acting upon*. It can be a
technical limitation that you can't act upon a bug on an ARM bot, but
it can also be a personal one. I'm not saying *you* would do that, but
we have plenty of people in the community with plenty of their own
problems. You said it yourself, people tend to ignore problems that
they can't understand, but not understanding is *not* the same as not
being able to *act upon*.

For me, that attitude is what's at the core of the problem here. By
raising the bar faster than we can make it better, you're essentially
just giving people the right not to care. The bar will be raised even
further by peer pressure, and that's the kind of behaviour that leads
to a fork. I'm trying to avoid this at all costs.

> All I'd expect is that you/others watch the negative
> bot results, and forward any on that look like actionable true positives.
If
> that's too expensive, then I don't know how you can expect
community members
> to incur that cost instead of bot owners?
Another example of the assumption that bot owners are validation
engineers and that's their only job. It was never like this in LLVM
and it won't start today just because we want to.

My expectation of the LLVM Foundation is that they would take our
validation infrastructure to the next level, but so far I haven't seen
much happening. If you want to make it better, instead of forcing your
way on the existing scenario, why not work with the Foundation to move
this to the next level?

> Once people lose
> confidence in the bots, they're not likely to /gain/ confidence again -
That's not true. Galina's Panda bots were unstable in 2010, people
lost confidence, she added more boards, people re-gained confidence in
2011. Then it became unstable in 2013, people lost confidence, we
fixed the issues, people re-gained confidence only a few months later.
This year it got unstable again, but because we already have enough
ARM bots elsewhere, she disabled them for good.

You're exaggerating the effects of unstable bots as if people expected
them to be always perfect. I'd love if they could be, but I don't
expect them to be.

> I'm looking at the existing behavior of the community - if people are
> generally ignoring the result of a bot anyway (& if it's red for
weeks at a
> time, I think they are) then the notifications are providing no value.
I'm not seeing that, myself. So far, you're the only one that is
shouting out loud that this or that bot is noisy.

Sometimes people ignore bots, but I don't take this as a sign that
everything is doomed, just that people focus on different things at
different times.

>> No user is building trunk every commit (ish). Buildbots are not meant
>> to be as stable as a user (including distros) would require.
>
> I disagree with this - I think it's a worthy goal to have continuous
> validation that is more robust and comprehensive.
A worthy goal, yes. Doable right now, with the resources that we have,
no. And no amount of shouting will get this done.

If we want quality, we need top-level management, preferably from the
LLVM Foundation, and a bunch of dedicated people working on it, which
could be either funded by the foundation or agreed between the
interested parties. If anyone ever get this conversation going (I
tried), please let me know, as I'm very interested in making that
happen.

> red->exception->red I don't mind too much - the
"timeout->timeout" example
> you gave is one I disagree with.
Ah, yes. I mixed them up.

>> I agree in principle. I just worry that it's a lot easier to add an
>> XFAIL than to remove it later.
>
> How so? If you're actively investigating the issue, and everyone else
is
> happily ignoring the bot result (& so won't care when it goes
green, or red
> again) - you're owning the issue to get your bot back to green, and it
just
> means you have to un-XFAIL it as soon as that happens.
>From my experience, companies put people to work on open sourceprojects when they need something done and don't want to bear the
costs of maintaining it later.

So, initially, developers have a high pressure of pushing their
patches through, and you see them very excited in addressing the
review comments, adding tests, fixing bugs.

But once the patch is in, the priority of that task, for that company,
is greatly reduced. Most developers consider investigating an XFAIL
from their commit as important as the commit itself, but not
necessarily most companies do so with the same passion.

Moreover, once developers implement whatever they needed here, it's
not uncommon for their parent companies to move them away from the
project, in which case they can't even contribute any more due to
license issues, etc.

But we also have the not-so-responsible developers, that could create
a bug, assign to themselves, and never look back unless someone
complains.

That's why, at Linaro, I have the policy to only mark XFAIL when I can
guarantee that it's either not supposed to work or the developer will
fix it *before* marking the task closed.

cheers,
--rento

llvm dev - Oct 2015 - Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise