My inbox has been filled with llvm.buildmaster at lab.llvm.org build failure notifications lately. The two problems appear to be: 1) Getting notifications for breakage that was introduced by an unrelated commit, often in a module I don't work on. Usually the original committer is working on or has already landed the necessary fix. 2) A cascade of dozens of notifications from various build servers that continue to flood in over the course of 24 hours after the issue was fixed. These two conflate and produce a high signal-to-noise ratio, and in practice you have to filter them out which means you no longer get a ping on your phone when you need it. Presumably a full fix is a non-trivial CI engineering problem, but are there simple measures get the situation back under control? Doesn't have to be perfect as long as it reduces the dozens of mails every day to something more manageable. Ideas: 1) Only send direct mail when the recipient is the single name in the blame list. 2) Set an In-Reply-To header in order to thread all failure notifications related to a specific SVN revision. Most email clients will let you silence the thread once you've confirmed the issue has been resolved. 3) Or even simpler, don't send failure mail from any builders outside the "fast" set? Otherwise the important failures blocking everyone's work get drowned out in the noise. Sorry to send a feature request without patches but I'm not familiar with the CI infrastructure and this looks like a fairly recent development (or is it just me?) Alp. -- http://www.nuanti.com the browser experts
On Sat, Dec 28, 2013 at 7:03 PM, Alp Toker <alp at nuanti.com> wrote:> My inbox has been filled with llvm.buildmaster at lab.llvm.org build failure > notifications lately. > > The two problems appear to be: > > 1) Getting notifications for breakage that was introduced by an unrelated > commit, often in a module I don't work on. Usually the original committer > is working on or has already landed the necessary fix. > > 2) A cascade of dozens of notifications from various build servers that > continue to flood in over the course of 24 hours after the issue was fixed. > > These two conflate and produce a high signal-to-noise ratio, and in > practice you have to filter them out which means you no longer get a ping > on your phone when you need it. >FWIW, this has generally been my experience. Nit: I think you mean "low" signal-to-noise ratio.> > Presumably a full fix is a non-trivial CI engineering problem, but are > there simple measures get the situation back under control? > > Doesn't have to be perfect as long as it reduces the dozens of mails every > day to something more manageable. Ideas: > > 1) Only send direct mail when the recipient is the single name in the > blame list. > > 2) Set an In-Reply-To header in order to thread all failure notifications > related to a specific SVN revision. Most email clients will let you silence > the thread once you've confirmed the issue has been resolved. >This seems like it might be a simple, depending on where these emails are being generated (in one of our scripts, or deep inside some CI application). -- Sean Silva> > 3) Or even simpler, don't send failure mail from any builders outside the > "fast" set? Otherwise the important failures blocking everyone's work get > drowned out in the noise. > > Sorry to send a feature request without patches but I'm not familiar with > the CI infrastructure and this looks like a fairly recent development (or > is it just me?) > > Alp. > > > -- > http://www.nuanti.com > the browser experts > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131228/ec0bd977/attachment.html>
On 29 December 2013 04:25, Sean Silva <chisophugis at gmail.com> wrote:> 1) Only send direct mail when the recipient is the single name in the >> blame list. >> >That would filter out important breakages. I think your option 3 below is probably the most effective and simplest to implement right now. An alternative to that would be to filter for the failure's cause (test failed, file miscompiled, etc) and see if any commit touches that file, and only send the email to the users that touched any of them. We have all the info in the page, shouldn't be too hard to grep stuff around... 3) Or even simpler, don't send failure mail from any builders outside the>> "fast" set? Otherwise the important failures blocking everyone's work get >> drowned out in the noise. >> >An option on the bot configuration to send or not an email would do. I wouldn't separate "fast" from "slow", but "unique" from the rest. For instance, we have two "fast" bots, on on A15 and one on A9. Of course, the A15 is faster, and the A9 repeats a few minutes later. I'd want to receive it only from one of them. I also have a test-suite bot that doesn't "check-all", so if that fails, it's either a compilation failure, or it's a test-suite failure, and I really want to be warned when it breaks. A further step would be to manage emails by bot type. Fast-unique bots report everything (compilation, svn, make, tests), while other unique bots only report their own stuff, so my test-suite bot would not report compilation failures. The problem with that would be if the compilation *only* happens on the test-suite bot, and then we'd need an extra layer to diff between error reports, and that would be massive. I don't expect this to happen ever. Finally, I think your comment is also valid for IRC messages, they drive me crazy... Can we have a separate IRC channel for those messages? Like llvm-buildbots? Or just stick to email? cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131229/09e93682/attachment.html>
On Saturday, December 28, 2013 6:05:38 PM, Alp Toker <alp at nuanti.com> wrote: My inbox has been filled with llvm.buildmaster<llvm.buildmaster at lab.llvm.org> @ <llvm.buildmaster at lab.llvm.org>lab.llvm.org<llvm.buildmaster at lab.llvm.org>build failure notifications lately. The two problems appear to be: 1) Getting notifications for breakage that was introduced by an unrelated commit, often in a module I don't work on. Usually the original committer is working on or has already landed the necessary fix. 2) A cascade of dozens of notifications from various build servers that continue to flood in over the course of 24 hours after the issue was fixed. These two conflate and produce a high signal-to-noise ratio, and in practice you have to filter them out which means you no longer get a ping on your phone when you need it. Presumably a full fix is a non-trivial CI engineering problem, but are there simple measures get the situation back under control? Doesn't have to be perfect as long as it reduces the dozens of mails every day to something more manageable. Ideas: 1) Only send direct mail when the recipient is the single name in the blame list. 2) Set an In-Reply-To header in order to thread all failure notifications related to a specific SVN revision. Most email clients will let you silence the thread once you've confirmed the issue has been resolved. 3) Or even simpler, don't send failure mail from any builders outside the "fast" set? Otherwise the important failures blocking everyone's work get drowned out in the noise. Sorry to send a feature request without patches but I'm not familiar with the CI infrastructure and this looks like a fairly recent development (or is it just me? This isn't new. Just how the boys have always worked. The biggest thing would be to move boots over to the phased builder infrastructure pioneered by apple (they use it internally and I believe most of it has been upstreamed by Daniel Dunbar and David Tweed) that sets up dependencies (eg: testing debug info depends on the compiler paying the basic check first) and refuse/caching of build product (eg: use the output of the basic checks to test the debug info, rather than rebuilding the compiler on every builder). This would reduce noise and increase build slave efficiency and granularity to produce smaller blame lists. Alp. -- http://www.nuanti.com the browser experts _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http:// <http://llvm.cs.uiuc.edu> llvm.cs.uiuc.edu http:// <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>lists.cs.uiuc.edu<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> /mailman/ <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>listinfo<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> / <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>llvmdev<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131229/ec316150/attachment.html>
Agreed. This is perhaps the best way to deal with the problem and still have committers catch important failures. On Sun Dec 29 2013 at 8:47:58 AM, dblaikie at gmail.com <dblaikie at gmail.com> wrote:> > On Saturday, December 28, 2013 6:05:38 PM, Alp Toker <alp at nuanti.com> > wrote: > > My inbox has been filled with llvm.buildmaster<llvm.buildmaster at lab.llvm.org> > @ <llvm.buildmaster at lab.llvm.org>lab.llvm.org<llvm.buildmaster at lab.llvm.org>build > failure notifications lately. > > The two problems appear to be: > > 1) Getting notifications for breakage that was introduced by an > unrelated commit, often in a module I don't work on. Usually the > original committer is working on or has already landed the necessary fix. > > 2) A cascade of dozens of notifications from various build servers > that continue to flood in over the course of 24 hours after the issue > was fixed. > > These two conflate and produce a high signal-to-noise ratio, and in > practice you have to filter them out which means you no longer get a > ping on your phone when you need it. > > Presumably a full fix is a non-trivial CI engineering problem, but are > there simple measures get the situation back under control? > > Doesn't have to be perfect as long as it reduces the dozens of mails > every day to something more manageable. Ideas: > > 1) Only send direct mail when the recipient is the single name in the > blame list. > > 2) Set an In-Reply-To header in order to thread all failure > notifications related to a specific SVN revision. Most email clients > will let you silence the thread once you've confirmed the issue has been > resolved. > > 3) Or even simpler, don't send failure mail from any builders outside > the "fast" set? Otherwise the important failures blocking everyone's > work get drowned out in the noise. > > Sorry to send a feature request without patches but I'm not familiar > with the CI infrastructure and this looks like a fairly recent > development (or is it just me? > > > > This isn't new. Just how the boys have always worked. > > The biggest thing would be to move boots over to the phased builder > infrastructure pioneered by apple (they use it internally and I believe > most of it has been upstreamed by Daniel Dunbar and David Tweed) that sets > up dependencies (eg: testing debug info depends on the compiler paying the > basic check first) and refuse/caching of build product (eg: use the output > of the basic checks to test the debug info, rather than rebuilding the > compiler on every builder). > > This would reduce noise and increase build slave efficiency and > granularity to produce smaller blame lists. > > > Alp. > > -- > http://www.nuanti.com > the browser experts > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http:// <http://llvm.cs.uiuc.edu> > llvm.cs.uiuc.edu > http:// <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> > lists.cs.uiuc.edu <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> > /mailman/ <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>listinfo<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> > / <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>llvmdev<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131229/10421fa9/attachment.html>
My personal views (by which I always mean that I'm speaking as one of the compiler engineers employed by ARM but not officially on behalf of ARM): On Sun, Dec 29, 2013 at 4:45 PM, dblaikie at gmail.com <dblaikie at gmail.com> wrote:> > On Saturday, December 28, 2013 6:05:38 PM, Alp Toker <alp at nuanti.com> wrote: > > My inbox has been filled with llvm.buildmaster at lab.llvm.org build > failure notifications lately. > > The two problems appear to be: > > 1) Getting notifications for breakage that was introduced by an > unrelated commit, often in a module I don't work on. Usually the > original committer is working on or has already landed the necessary fix. > > 2) A cascade of dozens of notifications from various build servers > that continue to flood in over the course of 24 hours after the issue > was fixed. > > These two conflate and produce a high signal-to-noise ratio, and in > practice you have to filter them out which means you no longer get a > ping on your phone when you need it. > > Presumably a full fix is a non-trivial CI engineering problem, but are > there simple measures get the situation back under control? > > Doesn't have to be perfect as long as it reduces the dozens of mails > every day to something more manageable. Ideas: > > 1) Only send direct mail when the recipient is the single name in the > blame list.I think this would mean less-high-performance builders would never signal their failures, which as explained below would be unfortunate.> 2) Set an In-Reply-To header in order to thread all failure > notifications related to a specific SVN revision. Most email clients > will let you silence the thread once you've confirmed the issue has been > resolved.This sounds like a reasonable solution.> 3) Or even simpler, don't send failure mail from any builders outside > the "fast" set? Otherwise the important failures blocking everyone's > work get drowned out in the noise.I think it would certainly be helpful to separate out the builders into a set which are sufficiently maintained and reliable to get an email from when something breaks their build/tests, and a more "advisory" set of builders (eg, there are some builders that appear to be have borderline stability, often throwing up errors unrelated to the issues under test). I think declaring only fast builders get to send emails would have unfortunate effects in terms of testing native builds on low-power architectures (which will have a slower turn-around) but are otherwise quite reliable. (ARM, my employer, spent quite a bit of effort fixing the ARM issues that had crept in, work which for various reasons has transitioned to Linaro now.) Modified in to that sense, this also seems a reasonable solution.> This isn't new. Just how the boys have always worked. > > The biggest thing would be to move boots over to the phased builder > infrastructure pioneered by apple (they use it internally and I believe most > of it has been upstreamed by Daniel Dunbar and David Tweed) that sets up > dependencies (eg: testing debug info depends on the compiler paying the > basic check first) and refuse/caching of build product (eg: use the output > of the basic checks to test the debug info, rather than rebuilding the > compiler on every builder).Just to note that I suspect it's someone else you're thinking of regarding the phased builder. (Although I did quite a bit of work on the ARM buildbots late last year I haven't been involved in the phased builder work.) -- cheers, dave tweed__________________________ high-performance computing and machine vision expert: david.tweed at gmail.com "while having code so boring anyone can maintain it, use Python." -- attempted insult seen on slashdot