On 7 October 2015 at 22:14, Eric Christopher <echristo at gmail.com> wrote:> As a foreword: I haven't read a lot of the thread here and it's just a > single developer talking here :)I recommend you to, then. Most of your arguments are similar to David's and they don't take into account the difficulty in maintaining non-x86 buildbots. What you're both saying is basically the same as: We want all the cars we care about in our garage, but only as long as they can race in F1. However, you care about the whole range, from beetles to McLarens, but are only willing to cope with the speed and reliability of the latter. You'll end up with only McLarens in your garage. It just doesn't make sense. Also, very briefly, I want the same as both of you: reliability. But I alone cannot guarantee that. And even with help, I can only get there in months, not days. To get there, we need to *slowly* move towards it, not drastically throw away everything that is not a McLaren and only put them back when they're as fast as a McLaren. It just won't happen, and the risk of a fork becomes non-trivial. --renato
On Wed, Oct 7, 2015 at 2:24 PM Renato Golin <renato.golin at linaro.org> wrote:> On 7 October 2015 at 22:14, Eric Christopher <echristo at gmail.com> wrote: > > As a foreword: I haven't read a lot of the thread here and it's just a > > single developer talking here :) > > I recommend you to, then. Most of your arguments are similar to > David's and they don't take into account the difficulty in maintaining > non-x86 buildbots. > >OK. I've now read the rest of the thread and don't find any of the arguments compelling for keeping flaky bots around for notifications. I also don't think that the x86-ness of it matters here. The powerpc64 and hexagon bots are very reliable.> What you're both saying is basically the same as: We want all the cars > we care about in our garage, but only as long as they can race in F1. > However, you care about the whole range, from beetles to McLarens, but > are only willing to cope with the speed and reliability of the latter. > You'll end up with only McLarens in your garage. It just doesn't make > sense. > >I think this is a poor analogy. You're also ignoring the solution I gave you in my previous mail for slow bots.> Also, very briefly, I want the same as both of you: reliability. But I > alone cannot guarantee that. And even with help, I can only get there > in months, not days. To get there, we need to *slowly* move towards > it, not drastically throw away everything that is not a McLaren and > only put them back when they're as fast as a McLaren. It just won't > happen, and the risk of a fork becomes non-trivial. >I think this is a completely ridiculous statement. I mean, feel free if that's the direction you think you need to go, but I'm not going to continue down that thread with you. Basically what I'm saying is that if you want a bot to be public and people to pay attention to it then you need to have some basic stability guarantees. If you can't give some basic stability guarantees then the bot is only harming the entire testing infrastructure. That said, having your own internal bots is entirely useful, it just means that it's up to you to notice failures and provide some sort of test case to the community. We could even have a "beta" bot site if something is reliable enough for that, but not reliable enough for general consumption. I believe you mentioned having a separate bot master before, we have other bot masters as well - see the green dragon stuff with jenkins. -eric ps. Have actually added Chris Matthews to talk about the buildbot staging work. Or even moving the rest of the bots to something staged, or anything. :) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151007/4ef9c405/attachment.html>
On 7 October 2015 at 22:44, Eric Christopher <echristo at gmail.com> wrote:> I think this is a poor analogy. You're also ignoring the solution I gave you > in my previous mail for slow bots.I'm not ignoring it, I'm acting upon it. But it takes time. I don't have infinite resources.> If you can't give some basic stability guarantees then the bot > is only harming the entire testing infrastructure.Define stability. Daniel was talking about "things I can act upon". That's so vague it means nothing. "Basic stability guarantees" is on a similar gist. Any universal rule you try to make will either be too lax for fast and reliable bots, or too hard on slow and less used bots. That's what I'm finding hard to understand. All you guys are saying is that things are bad and need to get better. I agree completely. But your solution is to turn off everything you don't understand or assume it's flaky, and that's just wrong. We had two flaky bots: Pandas and a Juno. Pandas were disabled, the Juno was fixed. Some of our bots, however, are still slow, and we have been asked to disable them because they were red for too long. Most of the problem we find are bad tests from people that didn't (obviously) test on ARM. The second most common is code that doesn't take into account 32-bits platforms. The third most common breakages is the sanitizer tests, which pop in and out on many platforms. The most common long breakage is due to self-hosted Clang breaking and making it hard to find what commit to revert or even warn the developer. None of those are due to instability of my buildbots. But I got shouted at many times to disable the bot because it was "red for too long". I find this behaviour disrespectful. I'm now trying to get 8 more ARM boards and 3 AArch64, and I plan to put them as redundant builders. But it takes time. Weeks to make them work reliably, more weeks to make sure they won't fall under pressure, more weeks to put in production and stabilise. Meanwhile, I'd appreciate if people stopped trying to kill the others. What else do you want us to do? cheers, --renato
One strategy I use for our flaky bots is to have them email me only. If the failure is real, then I forward the email to who ever I find on the blame list. For a flaky build, this is least you can do. For our flaky builds I know how and why they are flaky, some person that gets email does not. This is also a great motivator to help me know what is wrong, and how to fix it. By default, all new builds I create I do this, until I decide the SNR is appropriate for the community. Yes I have to triage builds sometimes, but I have an interest in them working, and people always acting on green dragon emails, so I think it is worth it. Beyond that, we have regexes which identify common failures, and highlight them in the build page and log. For instance, a build that fails with a ninja error, will say so, same with a svn failure or a Jenkins exception. We also have a few polices on email: only email on first failure, don’t email on exception and abort, and don’t email long blame lists (more than 10 people). These require some manual intervention sometimes. But no point in emailing the wrong people, or too many people. We also track the failure span for all of our builds, if any fail for more than a day, I get an email to go shake things up. We also keep a cluster wide health metric, which is the total number of hours of currently failing builds, I use this as an overall indicator of how the builds are doing. In all our CI cluster we use phased builds. Phase 1 is a fast incremental builder and a no bootstrap release asserts build. If those build, we trigger a release with LTO build, if that works, we trigger all the rest of our compilers and tests. It is a waste to queue long builds on revisions that have not been vetted in some way. In some places the tree of builds is 4 deep, and the turn around time can be upwards of 12 hours after commit, BUT failures in those bots are rare, because so much other testing has gone on beforehand. Mechanically, staging works by uploading the build artifacts to a central web server, then passing a URL to the next set of builds so they can download the compiler. This also speeds up builds that would otherwise have to build a compiler to run a test. For the lab, I think that won’t work as well because of the diversity of platforms and configurations, but a known good revision could be passed around. Some of the fast reliable builds can run first, and publish all the builds that work. I do think flaky bots should only email their owners. I also think we should nominate some reliable fast builds to produce vetted revision, and trigger most other builds from those.> On Oct 7, 2015, at 2:44 PM, Eric Christopher <echristo at gmail.com> wrote: > > > > On Wed, Oct 7, 2015 at 2:24 PM Renato Golin <renato.golin at linaro.org <mailto:renato.golin at linaro.org>> wrote: > On 7 October 2015 at 22:14, Eric Christopher <echristo at gmail.com <mailto:echristo at gmail.com>> wrote: > > As a foreword: I haven't read a lot of the thread here and it's just a > > single developer talking here :) > > I recommend you to, then. Most of your arguments are similar to > David's and they don't take into account the difficulty in maintaining > non-x86 buildbots. > > > OK. I've now read the rest of the thread and don't find any of the arguments compelling for keeping flaky bots around for notifications. I also don't think that the x86-ness of it matters here. The powerpc64 and hexagon bots are very reliable. > > What you're both saying is basically the same as: We want all the cars > we care about in our garage, but only as long as they can race in F1. > However, you care about the whole range, from beetles to McLarens, but > are only willing to cope with the speed and reliability of the latter. > You'll end up with only McLarens in your garage. It just doesn't make > sense. > > > I think this is a poor analogy. You're also ignoring the solution I gave you in my previous mail for slow bots. > > Also, very briefly, I want the same as both of you: reliability. But I > alone cannot guarantee that. And even with help, I can only get there > in months, not days. To get there, we need to *slowly* move towards > it, not drastically throw away everything that is not a McLaren and > only put them back when they're as fast as a McLaren. It just won't > happen, and the risk of a fork becomes non-trivial. > > I think this is a completely ridiculous statement. I mean, feel free if that's the direction you think you need to go, but I'm not going to continue down that thread with you. > > Basically what I'm saying is that if you want a bot to be public and people to pay attention to it then you need to have some basic stability guarantees. If you can't give some basic stability guarantees then the bot is only harming the entire testing infrastructure. That said, having your own internal bots is entirely useful, it just means that it's up to you to notice failures and provide some sort of test case to the community. We could even have a "beta" bot site if something is reliable enough for that, but not reliable enough for general consumption. I believe you mentioned having a separate bot master before, we have other bot masters as well - see the green dragon stuff with jenkins. > > -eric > > ps. Have actually added Chris Matthews to talk about the buildbot staging work. Or even moving the rest of the bots to something staged, or anything. :)-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151007/773d9359/attachment.html>
On 10/07/2015 02:44 PM, Eric Christopher via cfe-dev wrote:> > > On Wed, Oct 7, 2015 at 2:24 PM Renato Golin <renato.golin at linaro.org > <mailto:renato.golin at linaro.org>> wrote: > > On 7 October 2015 at 22:14, Eric Christopher <echristo at gmail.com > <mailto:echristo at gmail.com>> wrote: > > As a foreword: I haven't read a lot of the thread here and it's > just a > > single developer talking here :) > > I recommend you to, then. Most of your arguments are similar to > David's and they don't take into account the difficulty in maintaining > non-x86 buildbots. > > > OK. I've now read the rest of the thread and don't find any of the > arguments compelling for keeping flaky bots around for notifications. > I also don't think that the x86-ness of it matters here. The powerpc64 > and hexagon bots are very reliable.After reading the thread, this is also my view. Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151007/38766a31/attachment.html>