thr3ads.net - llvm dev - [LLVMdev] Build bot fatigue [Dec 2013]

If this information is useful, please help other people find it:
Share via:

Alp Toker

2013-Dec-29 02:03 UTC

[LLVMdev] Build bot fatigue

My inbox has been filled with llvm.buildmaster at lab.llvm.org build 
failure notifications lately.

The two problems appear to be:

  1) Getting notifications for breakage that was introduced by an 
unrelated commit, often in a module I don't work on. Usually the 
original committer is working on or has already landed the necessary fix.

  2) A cascade of dozens of notifications from various build servers 
that continue to flood in over the course of 24 hours after the issue 
was fixed.

These two conflate and produce a high signal-to-noise ratio, and in 
practice you have to filter them out which means you no longer get a 
ping on your phone when you need it.

Presumably a full fix is a non-trivial CI engineering problem, but are 
there simple measures get the situation back under control?

Doesn't have to be perfect as long as it reduces the dozens of mails 
every day to something more manageable. Ideas:

  1) Only send direct mail when the recipient is the single name in the 
blame list.

  2) Set an In-Reply-To header in order to thread all failure 
notifications related to a specific SVN revision. Most email clients 
will let you silence the thread once you've confirmed the issue has been 
resolved.

3) Or even simpler, don't send failure mail from any builders outside 
the "fast" set? Otherwise the important failures blocking
everyone's
work get drowned out in the noise.

Sorry to send a feature request without patches but I'm not familiar 
with the CI infrastructure and this looks like a fairly recent 
development (or is it just me?)

Alp.


-- 
http://www.nuanti.com
the browser experts

Sean Silva

2013-Dec-29 04:25 UTC

head link

[LLVMdev] Build bot fatigue

On Sat, Dec 28, 2013 at 7:03 PM, Alp Toker <alp at nuanti.com> wrote:
> My inbox has been filled with llvm.buildmaster at lab.llvm.org build
failure
> notifications lately.
>
> The two problems appear to be:
>
>  1) Getting notifications for breakage that was introduced by an unrelated
> commit, often in a module I don't work on. Usually the original
committer
> is working on or has already landed the necessary fix.
>
>  2) A cascade of dozens of notifications from various build servers that
> continue to flood in over the course of 24 hours after the issue was fixed.
>
> These two conflate and produce a high signal-to-noise ratio, and in
> practice you have to filter them out which means you no longer get a ping
> on your phone when you need it.
>
FWIW, this has generally been my experience.

Nit: I think you mean "low" signal-to-noise ratio.

>
> Presumably a full fix is a non-trivial CI engineering problem, but are
> there simple measures get the situation back under control?
>
> Doesn't have to be perfect as long as it reduces the dozens of mails
every
> day to something more manageable. Ideas:
>
>  1) Only send direct mail when the recipient is the single name in the
> blame list.
>
>  2) Set an In-Reply-To header in order to thread all failure notifications
> related to a specific SVN revision. Most email clients will let you silence
> the thread once you've confirmed the issue has been resolved.
>
This seems like it might be a simple, depending on where these emails are
being generated (in one of our scripts, or deep inside some CI application).

-- Sean Silva

>
> 3) Or even simpler, don't send failure mail from any builders outside
the
> "fast" set? Otherwise the important failures blocking
everyone's work get
> drowned out in the noise.
>
> Sorry to send a feature request without patches but I'm not familiar
with
> the CI infrastructure and this looks like a fairly recent development (or
> is it just me?)
>
> Alp.
>
>
> --
> http://www.nuanti.com
> the browser experts
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131228/ec0bd977/attachment.html>

Renato Golin

2013-Dec-29 10:31 UTC

head link

[LLVMdev] Build bot fatigue

On 29 December 2013 04:25, Sean Silva <chisophugis at gmail.com> wrote:
>  1) Only send direct mail when the recipient is the single name in the
>> blame list.
>>
>That would filter out important breakages. I think your option 3 below is
probably the most effective and simplest to implement right now.

An alternative to that would be to filter for the failure's cause (test
failed, file miscompiled, etc) and see if any commit touches that file, and
only send the email to the users that touched any of them. We have all the
info in the page, shouldn't be too hard to grep stuff around...

3) Or even simpler, don't send failure mail from any builders outside
the>> "fast" set? Otherwise the important failures blocking
everyone's work get
>> drowned out in the noise.
>>
>An option on the bot configuration to send or not an email would do. I
wouldn't separate "fast" from "slow", but
"unique" from the rest.

For instance, we have two "fast" bots, on on A15 and one on A9. Of
course,
the A15 is faster, and the A9 repeats a few minutes later. I'd want to
receive it only from one of them.

I also have a test-suite bot that doesn't "check-all", so if that
fails,
it's either a compilation failure, or it's a test-suite failure, and I
really want to be warned when it breaks.

A further step would be to manage emails by bot type. Fast-unique bots
report everything (compilation, svn, make, tests), while other unique bots
only report their own stuff, so my test-suite bot would not report
compilation failures. The problem with that would be if the compilation
*only* happens on the test-suite bot, and then we'd need an extra layer to
diff between error reports, and that would be massive. I don't expect this
to happen ever.

Finally, I think your comment is also valid for IRC messages, they drive me
crazy... Can we have a separate IRC channel for those messages? Like
llvm-buildbots? Or just stick to email?

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131229/09e93682/attachment.html>

dblaikie at gmail.com

2013-Dec-29 16:45 UTC

head link

[LLVMdev] Build bot fatigue

On Saturday, December 28, 2013 6:05:38 PM, Alp Toker <alp at nuanti.com>
wrote:

My inbox has been filled with llvm.buildmaster<llvm.buildmaster at
lab.llvm.org>
@ <llvm.buildmaster at lab.llvm.org>lab.llvm.org<llvm.buildmaster at
lab.llvm.org>build
failure notifications lately.

The two problems appear to be:

1) Getting notifications for breakage that was introduced by an
unrelated commit, often in a module I don't work on. Usually the
original committer is working on or has already landed the necessary fix.

2) A cascade of dozens of notifications from various build servers
that continue to flood in over the course of 24 hours after the issue
was fixed.

These two conflate and produce a high signal-to-noise ratio, and in
practice you have to filter them out which means you no longer get a
ping on your phone when you need it.

Presumably a full fix is a non-trivial CI engineering problem, but are
there simple measures get the situation back under control?

Doesn't have to be perfect as long as it reduces the dozens of mails
every day to something more manageable. Ideas:

1) Only send direct mail when the recipient is the single name in the
blame list.

2) Set an In-Reply-To header in order to thread all failure
notifications related to a specific SVN revision. Most email clients
will let you silence the thread once you've confirmed the issue has been
resolved.

3) Or even simpler, don't send failure mail from any builders outside
the "fast" set? Otherwise the important failures blocking
everyone's
work get drowned out in the noise.

Sorry to send a feature request without patches but I'm not familiar
with the CI infrastructure and this looks like a fairly recent
development (or is it just me?

This isn't new. Just how the boys have always worked.

The biggest thing would be to move boots over to the phased builder
infrastructure pioneered by apple (they use it internally and I believe
most of it has been upstreamed by Daniel Dunbar and David Tweed) that sets
up dependencies (eg: testing debug info depends on the compiler paying the
basic check first) and refuse/caching of build product (eg: use the output
of the basic checks to test the debug info, rather than rebuilding the
compiler on every builder).

This would reduce noise and increase build slave efficiency and granularity
to produce smaller blame lists.

Alp.

--
http://www.nuanti.com
the browser experts

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu http:// <http://llvm.cs.uiuc.edu>
llvm.cs.uiuc.edu
http://
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>lists.cs.uiuc.edu<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
/mailman/
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>listinfo<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
/
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>llvmdev<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131229/ec316150/attachment.html>

Eric Christopher

2013-Dec-29 19:30 UTC

head link

[LLVMdev] Build bot fatigue

Agreed. This is perhaps the best way to deal with the problem and still
have committers catch important failures.

On Sun Dec 29 2013 at 8:47:58 AM, dblaikie at gmail.com <dblaikie at
gmail.com>
wrote:
>
> On Saturday, December 28, 2013 6:05:38 PM, Alp Toker <alp at
nuanti.com>
> wrote:
>
> My inbox has been filled with llvm.buildmaster<llvm.buildmaster at
lab.llvm.org>
> @ <llvm.buildmaster at lab.llvm.org>lab.llvm.org<llvm.buildmaster
at lab.llvm.org>build
> failure notifications lately.
>
> The two problems appear to be:
>
>   1) Getting notifications for breakage that was introduced by an
> unrelated commit, often in a module I don't work on. Usually the
> original committer is working on or has already landed the necessary fix.
>
>   2) A cascade of dozens of notifications from various build servers
> that continue to flood in over the course of 24 hours after the issue
> was fixed.
>
> These two conflate and produce a high signal-to-noise ratio, and in
> practice you have to filter them out which means you no longer get a
> ping on your phone when you need it.
>
> Presumably a full fix is a non-trivial CI engineering problem, but are
> there simple measures get the situation back under control?
>
> Doesn't have to be perfect as long as it reduces the dozens of mails
> every day to something more manageable. Ideas:
>
>   1) Only send direct mail when the recipient is the single name in the
> blame list.
>
>   2) Set an In-Reply-To header in order to thread all failure
> notifications related to a specific SVN revision. Most email clients
> will let you silence the thread once you've confirmed the issue has
been
> resolved.
>
> 3) Or even simpler, don't send failure mail from any builders outside
> the "fast" set? Otherwise the important failures blocking
everyone's
> work get drowned out in the noise.
>
> Sorry to send a feature request without patches but I'm not familiar
> with the CI infrastructure and this looks like a fairly recent
> development (or is it just me?
>
>
>
> This isn't new. Just how the boys have always worked.
>
> The biggest thing would be to move boots over to the phased builder
> infrastructure pioneered by apple (they use it internally and I believe
> most of it has been upstreamed by Daniel Dunbar and David Tweed) that sets
> up dependencies (eg: testing debug info depends on the compiler paying the
> basic check first) and refuse/caching of build product (eg: use the output
> of the basic checks to test the debug info, rather than rebuilding the
> compiler on every builder).
>
> This would reduce noise and increase build slave efficiency and
> granularity to produce smaller blame lists.
>
>
> Alp.
>
> --
> http://www.nuanti.com
> the browser experts
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http:// <http://llvm.cs.uiuc.edu>
> llvm.cs.uiuc.edu
> http:// <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> lists.cs.uiuc.edu <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> /mailman/
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>listinfo<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
> /
<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>llvmdev<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20131229/10421fa9/attachment.html>

David Tweed

2013-Dec-29 23:59 UTC

head link

[LLVMdev] Build bot fatigue

My personal views (by which I always mean that I'm speaking as one of
the compiler engineers
employed by ARM but not officially on behalf of ARM):

On Sun, Dec 29, 2013 at 4:45 PM, dblaikie at gmail.com <dblaikie at
gmail.com> wrote:>
> On Saturday, December 28, 2013 6:05:38 PM, Alp Toker <alp at
nuanti.com> wrote:
>
> My inbox has been filled with llvm.buildmaster at lab.llvm.org build
> failure notifications lately.
>
> The two problems appear to be:
>
>   1) Getting notifications for breakage that was introduced by an
> unrelated commit, often in a module I don't work on. Usually the
> original committer is working on or has already landed the necessary fix.
>
>   2) A cascade of dozens of notifications from various build servers
> that continue to flood in over the course of 24 hours after the issue
> was fixed.
>
> These two conflate and produce a high signal-to-noise ratio, and in
> practice you have to filter them out which means you no longer get a
> ping on your phone when you need it.
>
> Presumably a full fix is a non-trivial CI engineering problem, but are
> there simple measures get the situation back under control?
>
> Doesn't have to be perfect as long as it reduces the dozens of mails
> every day to something more manageable. Ideas:
>
>   1) Only send direct mail when the recipient is the single name in the
> blame list.
I think this would mean less-high-performance builders would never
signal their failures, which as explained below would be unfortunate.
>   2) Set an In-Reply-To header in order to thread all failure
> notifications related to a specific SVN revision. Most email clients
> will let you silence the thread once you've confirmed the issue has
been
> resolved.
This sounds like a reasonable solution.
> 3) Or even simpler, don't send failure mail from any builders outside
> the "fast" set? Otherwise the important failures blocking
everyone's
> work get drowned out in the noise.
I think it would certainly be helpful to separate out the builders into
a set which are sufficiently maintained and reliable to get an email
from when something breaks their build/tests, and a more "advisory"
set of builders (eg, there are some builders that appear to be have
borderline stability, often throwing up errors unrelated to the issues
under test). I think declaring only fast builders get to send emails would
have unfortunate effects in terms of testing native builds on
low-power architectures
(which will have a slower turn-around) but are otherwise quite reliable.
(ARM, my employer, spent quite a bit of effort fixing the ARM issues that
had crept in, work which for various reasons has transitioned to Linaro now.)
Modified in to that sense, this also seems a reasonable solution.
> This isn't new. Just how the boys have always worked.
>
> The biggest thing would be to move boots over to the phased builder
> infrastructure pioneered by apple (they use it internally and I believe
most
> of it has been upstreamed by Daniel Dunbar and David Tweed) that sets up
> dependencies (eg: testing debug info depends on the compiler paying the
> basic check first) and refuse/caching of build product (eg: use the output
> of the basic checks to test the debug info, rather than rebuilding the
> compiler on every builder).
Just to note that I suspect it's someone else you're thinking of
regarding the phased
builder. (Although I did quite a bit of work on the ARM buildbots late last year
I haven't been involved in the phased builder work.)

-- 
cheers, dave tweed__________________________
high-performance computing and machine vision expert: david.tweed at gmail.com
"while having code so boring anyone can maintain it, use Python." --
attempted insult seen on slashdot

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - Dec 2013 - [LLVMdev] Build bot fatigue

[LLVMdev] Build bot fatigue

[LLVMdev] Build bot fatigue

[LLVMdev] Build bot fatigue

[LLVMdev] Build bot fatigue

[LLVMdev] Build bot fatigue

[LLVMdev] Build bot fatigue

Maybe Matching Threads