thr3ads.net - llvm dev - [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9 [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Mehdi Amini via llvm-dev

2015-Aug-26 16:01 UTC

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

> On Aug 26, 2015, at 8:21 AM, Renato Golin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On 26 August 2015 at 15:44, Tobias Grosser <tobias at grosser.es>
wrote:
>> What time-line do you have in mind for this fix? If you are in charge
>> and can make this happen within a day, giving cmake + ninja a chance
seems
>> OK.
> 
> It's not my bot. All my bots are CMake+Ninja based and are stable
enough.
> 
> 
>> However, if the owner of the buildbot is not known or the fix can not
come
>> soon, I am in favor of disabling the noise and (re)enabling it when
someone
>> found time to address the problem and verify the solution.
> 
> That's up to Galina. We haven't had any action against unstable
bots
> so far, and this is not the only one. There are lots of Windows and
> sanitizer bots that break randomly and provide little information, are
> we going to disable them all? How about the perf bots that still fail
> occasionally and we haven't managed to fix the root cause, are we
> going to disable then, too?
> 
> You're asking to reduce considerably the quality of testing on some
> areas so that you can reduce the time spent looking at spurious
> failures. I don't agree with that in principle.
That’s not how I understand his point. In my opinion, he is asking to increase
the quality of testing. You just happen to disagree on his solution :)

The situation does not seem that black and white to me here. In the end, it
seems to me that is is about a threshold: if a bot is crashing 90% of the time,
does it really contributes to increase the quality of testing or on the opposite
it is just adding noise? Same question with 20%, 40%, 60%, …  We may all have a
different answer, but I’m pretty sure we could reach an agreement on what seems
appropriate

Another way of considering in general the impact of a bot on the quality is:
“how many legit failures were found by this bot in the last x years that weren’t
covered by another bot”.
Because sometimes you may just having a HW lab stress rack, without providing
any increased coverage for the software.

Cheers,

— 
Mehdi


> There were other
> threads focusing on how to make them less spurious, more stable, less
> noisy, and some work is being done on the GreenDragon bot structure.
> But killing everything that looks suspicious now will reduce our
> ability to validate LLVM on the range of configurations that we do
> today, and that, for me, is a lot worse than a few minutes' worth of
> some engineers.
> 
> 
>> The cost of
>> buildbot noise is very high, both in terms of developer time spent, but
>> more importantly due to people starting to ignore them when monitoring
them
>> becomes costly.
> 
> I think you're overestimating the cost.
> 
> When I get bot emails, I click on the link and if it was timeout, I
> always ignore it. If I can't make heads or tails (like the sanitizer
> ones), I ignore it temporarily, then look again next day.
> 
> My assumption is that the bot owner will make me aware if the reason
> is not obvious, as I do with my bots. I always wait for people to
> realise, and fix. But if they can't, either because the bot was
> already broken, or because the breakage isn't clear, I let people know
> where to search for the information in the bot itself. This is my
> responsibility as a bot owner.
> 
> I appreciate the benefit of having green / red bots, but you also have
> to appreciate that hardware is not perfect, and they will invariably
> fail once in a while. I had some Polly bots failing randomly and it
> took me only a couple of seconds to infer so. I'm not asking to remove
> them, even those that fail more than pass throughout the year. I
> assume that, if they're still there, it provides *some* value to
> someone.
> 
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=Ka76E8XTfggJYWrDeaGXLSBKQHN2iCVEjKVsTb2pHwI&s=7HEhGhQSdWB_XWL-36BNpvyorugu1RCgTDgqEzWMVX4&e=

David Blaikie via llvm-dev

2015-Aug-26 16:21 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

On Wed, Aug 26, 2015 at 9:01 AM, Mehdi Amini via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> > On Aug 26, 2015, at 8:21 AM, Renato Golin via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > On 26 August 2015 at 15:44, Tobias Grosser <tobias at
grosser.es> wrote:
> >> What time-line do you have in mind for this fix? If you are in
charge
> >> and can make this happen within a day, giving cmake + ninja a
chance
> seems
> >> OK.
> >
> > It's not my bot. All my bots are CMake+Ninja based and are stable
enough.
> >
> >
> >> However, if the owner of the buildbot is not known or the fix can
not
> come
> >> soon, I am in favor of disabling the noise and (re)enabling it
when
> someone
> >> found time to address the problem and verify the solution.
> >
> > That's up to Galina. We haven't had any action against
unstable bots
> > so far, and this is not the only one. There are lots of Windows and
> > sanitizer bots that break randomly and provide little information, are
> > we going to disable them all? How about the perf bots that still fail
> > occasionally and we haven't managed to fix the root cause, are we
> > going to disable then, too?
> >
> > You're asking to reduce considerably the quality of testing on
some
> > areas so that you can reduce the time spent looking at spurious
> > failures. I don't agree with that in principle.
>
> That’s not how I understand his point. In my opinion, he is asking to
> increase the quality of testing. You just happen to disagree on his
> solution :)
>
> The situation does not seem that black and white to me here. In the end,
> it seems to me that is is about a threshold: if a bot is crashing 90% of
> the time, does it really contributes to increase the quality of testing or
> on the opposite it is just adding noise? Same question with 20%, 40%, 60%,
> …  We may all have a different answer, but I’m pretty sure we could reach
> an agreement on what seems appropriate
>
> Another way of considering in general the impact of a bot on the quality
> is: “how many legit failures were found by this bot in the last x years
> that weren’t covered by another bot”.
>
Even that doesn't really capture it - if the bot has enough false
positives, or spends long periods being red, even those legit failures will
be lost in the noise & the cost to the whole project (not only in ignoring
that bot, but in reducing confidence in the bots in general (which is
pretty low generally because of this kind of situation)) may outweigh the
value of those bugs being found.

If a bot is of low enough quality that most engineers ignore it due to
false positives, long periods of broken-ness, then it makes sense to me to
remove it from the main buildbot view and from sending email. The owner can
monitor the bot and, once they triage a failure, manually reach out to
those who might be to blame.

(oh, and add long cycle times to the list of issues - people do have a
tendency to ignore bots that come back with giant blame lists & no obvious
determination as to who's patch caused the problem, if any)

- David

> Because sometimes you may just having a HW lab stress rack, without
> providing any increased coverage for the software.
>
> Cheers,
>
> —
> Mehdi
>
>
>
> > There were other
> > threads focusing on how to make them less spurious, more stable, less
> > noisy, and some work is being done on the GreenDragon bot structure.
> > But killing everything that looks suspicious now will reduce our
> > ability to validate LLVM on the range of configurations that we do
> > today, and that, for me, is a lot worse than a few minutes' worth
of
> > some engineers.
> >
> >
> >> The cost of
> >> buildbot noise is very high, both in terms of developer time
spent, but
> >> more importantly due to people starting to ignore them when
monitoring
> them
> >> becomes costly.
> >
> > I think you're overestimating the cost.
> >
> > When I get bot emails, I click on the link and if it was timeout, I
> > always ignore it. If I can't make heads or tails (like the
sanitizer
> > ones), I ignore it temporarily, then look again next day.
> >
> > My assumption is that the bot owner will make me aware if the reason
> > is not obvious, as I do with my bots. I always wait for people to
> > realise, and fix. But if they can't, either because the bot was
> > already broken, or because the breakage isn't clear, I let people
know
> > where to search for the information in the bot itself. This is my
> > responsibility as a bot owner.
> >
> > I appreciate the benefit of having green / red bots, but you also have
> > to appreciate that hardware is not perfect, and they will invariably
> > fail once in a while. I had some Polly bots failing randomly and it
> > took me only a couple of seconds to infer so. I'm not asking to
remove
> > them, even those that fail more than pass throughout the year. I
> > assume that, if they're still there, it provides *some* value to
> > someone.
> >
> > cheers,
> > --renato
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> >
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=Ka76E8XTfggJYWrDeaGXLSBKQHN2iCVEjKVsTb2pHwI&s=7HEhGhQSdWB_XWL-36BNpvyorugu1RCgTDgqEzWMVX4&e>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150826/df9db988/attachment.html>

Renato Golin via llvm-dev

2015-Aug-26 16:24 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

On 26 August 2015 at 17:01, Mehdi Amini <mehdi.amini at apple.com>
wrote:> The situation does not seem that black and white to me here. In the end, it
seems to me that is is about a threshold: if a bot is crashing 90% of the time,
does it really contributes to increase the quality of testing or on the opposite
it is just adding noise?
That question is not alone, and per se, it's meaningless. Your next
question, however, is the key.

> Another way of considering in general the impact of a bot on the quality
is: “how many legit failures were found by this bot in the last x years that
weren’t covered by another bot”.
In this criteria, which was my point, those bots haven't added much
after I added some faster ones. So, if we want to shut them down,
let's do so because they don't add value, not because they are
unstable.

However, that is the *only* bot running on an A9. As an example, this
year, I spent 2 whole weeks during the release to bisect and fix an
issue because I disabled, for two months, one bot that I thought was
already covered by another.

That headache was real. I've wasted two whole weeks, maybe more or my
time. I've wasted time from other people waiting to do the release
validation. I have delayed the release and all that it entangles. All
because I thought that bot was noisy. It's ratio was about 20 passes
to 1 failure.

The A9 bots are more than that, so on my monitor[1], I currently
ignore their results. I still keep them there to see what's going on,
and when my bots fail, I look at that, too, to see if the problem is
the same. Sometimes, they do provide useful insight on the other bot's
breakages.

So, for me, disabling the A9 bots would be a loss. But as I said
before, that's up to Galina, as she's the bot owner. If she's ok
with
finally putting them to rest, I'll respect the community's decision
and remove it from my monitor. But we can't transform this in to a
which hunt. It's not about thresholds, it's about cost and value,
which may be different for you than it is for me. We have to consider
the whole community, not just our opinions.

For every broken bot that someone want to get rid of, I propose
consult the bot owner first, and then a vote on llvm-dev@ / cfe-dev@
if the owner is not in agreement. After all, you can always have an
internal buildmaster (like I have) for unstable bots.

cheers,
--renato
[1] http://llvm.tcwglab.linaro.org/monitor/

Renato Golin via llvm-dev

2015-Aug-26 16:27 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

On 26 August 2015 at 17:21, David Blaikie <dblaikie at gmail.com>
wrote:> (oh, and add long cycle times to the list of issues - people do have a
> tendency to ignore bots that come back with giant blame lists & no
obvious
> determination as to who's patch caused the problem, if any)
Yes, but remember, not all hardware is as fast as a multi-core Xeon
server. Build times can't always be controlled.

But I agree with you on all accounts. The bot owner should bear the
responsibility of his/her own unstable bots. If it brings less value
than it adds cost to the community, it should be moved to a separate
buildmaster that doesn't email people around, but can be accessed, so
the owner can point breakages to devs.

cheers,
--renato

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Aug 2015 - buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Apparently Analagous Threads