thr3ads.net - llvm dev - [llvm-dev] Buildbot Noise [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Chris Matthews via llvm-dev

2015-Oct-07 22:20 UTC

[llvm-dev] Buildbot Noise

One strategy I use for our flaky bots is to have them email me only.  If the
failure is real, then I forward the email to who ever I find on the blame list. 
For a flaky build, this is least you can do. For our flaky builds I know how and
why they are flaky, some person that gets email does not.  This is also a great
motivator to help me know what is wrong, and how to fix it.  By default, all new
builds I create I do this, until I decide the SNR is appropriate for the
community.  Yes I have to triage builds sometimes, but I have an interest in
them working, and people always acting on green dragon emails, so I think it is
worth it.  Beyond that, we have regexes which identify common failures, and
highlight them in the build page and log.  For instance, a build that fails with
a ninja error, will say so, same with a svn failure or a Jenkins exception.

We also have a few polices on email: only email on first failure, don’t email on
exception and abort, and don’t email long blame lists (more than 10 people). 
These require some manual intervention sometimes.  But no point in emailing the
wrong people, or too many people.  We also track the failure span for all of our
builds, if any fail for more than a day, I get an email to go shake things up. 
We also keep a cluster wide health metric, which is the total number of hours of
currently failing builds, I use this as an overall indicator of how the builds
are doing.

In all our CI cluster we use phased builds.  Phase 1 is a fast incremental
builder and a no bootstrap release asserts build.  If those build, we trigger a
release with LTO build, if that works, we trigger all the rest of our compilers
and tests.  It is a waste to queue long builds on revisions that have not been
vetted in some way.  In some places the tree of builds is 4 deep, and  the turn
around time can be upwards of 12 hours after commit, BUT failures in those bots
are rare, because so much other testing has gone on beforehand.

Mechanically, staging works by uploading the build artifacts to a central web
server, then passing a URL to the next set of builds so they can download the
compiler.  This also speeds up builds that would otherwise have to build a
compiler to run a test.  For the lab, I think that won’t work as well because of
the diversity of platforms and configurations, but a known good revision could
be passed around. Some of the fast reliable builds can run first, and publish
all the builds that work.

I do think flaky bots should only email their owners.  I also think we should
nominate some reliable fast builds to produce vetted revision, and trigger most
other builds from those.
> On Oct 7, 2015, at 2:44 PM, Eric Christopher <echristo at gmail.com>
wrote:
> 
> 
> 
> On Wed, Oct 7, 2015 at 2:24 PM Renato Golin <renato.golin at linaro.org
<mailto:renato.golin at linaro.org>> wrote:
> On 7 October 2015 at 22:14, Eric Christopher <echristo at gmail.com
<mailto:echristo at gmail.com>> wrote:
> > As a foreword: I haven't read a lot of the thread here and
it's just a
> > single developer talking here :)
> 
> I recommend you to, then. Most of your arguments are similar to
> David's and they don't take into account the difficulty in
maintaining
> non-x86 buildbots.
> 
> 
> OK. I've now read the rest of the thread and don't find any of the
arguments compelling for keeping flaky bots around for notifications. I also
don't think that the x86-ness of it matters here. The powerpc64 and hexagon
bots are very reliable.
>  
> What you're both saying is basically the same as: We want all the cars
> we care about in our garage, but only as long as they can race in F1.
> However, you care about the whole range, from beetles to McLarens, but
> are only willing to cope with the speed and reliability of the latter.
> You'll end up with only McLarens in your garage. It just doesn't
make
> sense.
> 
> 
> I think this is a poor analogy. You're also ignoring the solution I
gave you in my previous mail for slow bots.
>  
> Also, very briefly, I want the same as both of you: reliability. But I
> alone cannot guarantee that. And even with help, I can only get there
> in months, not days. To get there, we need to *slowly* move towards
> it, not drastically throw away everything that is not a McLaren and
> only put them back when they're as fast as a McLaren. It just won't
> happen, and the risk of a fork becomes non-trivial.
> 
> I think this is a completely ridiculous statement. I mean, feel free if
that's the direction you think you need to go, but I'm not going to
continue down that thread with you.
> 
> Basically what I'm saying is that if you want a bot to be public and
people to pay attention to it then you need to have some basic stability
guarantees. If you can't give some basic stability guarantees then the bot
is only harming the entire testing infrastructure. That said, having your own
internal bots is entirely useful, it just means that it's up to you to
notice failures and provide some sort of test case to the community. We could
even have a "beta" bot site if something is reliable enough for that,
but not reliable enough for general consumption. I believe you mentioned having
a separate bot master before, we have other bot masters as well - see the green
dragon stuff with jenkins.
> 
> -eric
>  
>  ps. Have actually added Chris Matthews to talk about the buildbot staging
work. Or even moving the rest of the bots to something staged, or anything. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151007/773d9359/attachment.html>

Renato Golin via llvm-dev

2015-Oct-07 22:28 UTC

head link

[llvm-dev] Buildbot Noise

On 7 October 2015 at 23:20, Chris Matthews <chris.matthews at apple.com>
wrote:> For instance, a build that fails with a ninja error, will say so, same with
a
> svn failure or a Jenkins exception.
Will these get mailed to developers? Or admins?

> We also have a few polices on email: only email on first failure, don’t
> email on exception and abort, and don’t email long blame lists (more than
10
> people).
That's the same. The only problem we found is that exception is
treated as "success" because we don't want to email when the
master is
reloaded. But the sequence red->exception->red emails, since
exception->red is treated as good->bad.

> In all our CI cluster we use phased builds.
I'd love to have that! :D

> I do think flaky bots should only email their owners.
I agree. The problem is defining flaky. A lot of flaky behaviour can
be mapped back to the compiler (like Clang abusing of the C++ ABI, or
code assuming 64-bit types in odd ways).

>  I also think we
> should nominate some reliable fast builds to produce vetted revision, and
> trigger most other builds from those.
This would be the perfect world.

When Apple moved to GreenBots, I was expecting that we'd be moving too
not long after. I was also expecting the LLVM Foundation to be driving
this change, and I'd have dived head first to have what you have.

But it makes no sense for me to do that on my own, locally. Nor I have
bandwidth or resources to do that for everyone else.

I don't really care if it's buildbot, Jenkins or orc slaves, as long
as I can spend my time doing something else, I'm happy.

cheers,
--renato

Chris Matthews via llvm-dev

2015-Oct-07 22:58 UTC

head link

[llvm-dev] Buildbot Noise

> On Oct 7, 2015, at 3:28 PM, Renato Golin <renato.golin at linaro.org>
wrote:
> 
> On 7 October 2015 at 23:20, Chris Matthews <chris.matthews at
apple.com> wrote:
>> For instance, a build that fails with a ninja error, will say so, same
with a
>> svn failure or a Jenkins exception.
> 
> Will these get mailed to developers? Or admins?
Unfortunately, Jenkins does not let me determine who to email based on the
failure cause. That would be wonderful! The detected problem is right at the top
of the email though, so at least you don’t have to click the link.  For
infrastructure problems sometimes we add helpful messages, for instance we had
an issue with about 1 in 20 builds failing with a “killed -9” message, when that
happened we could just say sorry an print a link to the bug.  Mostly is is just
a nice fast link right to the test case failure or build failure.
> 
> 
>> We also have a few polices on email: only email on first failure, don’t
>> email on exception and abort, and don’t email long blame lists (more
than 10
>> people).
> 
> That's the same. The only problem we found is that exception is
> treated as "success" because we don't want to email when the
master is
> reloaded. But the sequence red->exception->red emails, since
> exception->red is treated as good->bad.
> 
Yep, that has to be changed.  It is not a useful state change
> 
>> In all our CI cluster we use phased builds.
> 
> I'd love to have that! :D
> 
> 
>> I do think flaky bots should only email their owners.
> 
> I agree. The problem is defining flaky. A lot of flaky behaviour can
> be mapped back to the compiler (like Clang abusing of the C++ ABI, or
> code assuming 64-bit types in odd ways).
I define flaky as a build that fails for a reason unrelated to the code on the
blame list.
> 
> 
>> I also think we
>> should nominate some reliable fast builds to produce vetted revision,
and
>> trigger most other builds from those.
> 
> This would be the perfect world.
We should move towards getting this setup then.  There is some code that needs
to be setup in buildbot, as well as an agreement on what gets attached to what.
> 
> When Apple moved to GreenBots, I was expecting that we'd be moving too
> not long after. I was also expecting the LLVM Foundation to be driving
> this change, and I'd have dived head first to have what you have.
> 
> But it makes no sense for me to do that on my own, locally. Nor I have
> bandwidth or resources to do that for everyone else.
> 
> I don't really care if it's buildbot, Jenkins or orc slaves, as
long
> as I can spend my time doing something else, I'm happy.
> 
> cheers,
> --renato

Reasonably Related Threads

Search for more apparently analagous threads

llvm dev - Oct 2015 - Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

Reasonably Related Threads