On 05/20/2015 11:04 AM, Renato Golin wrote:> On 20 May 2015 at 18:47, Philip Reames <listmail at philipreames.com> wrote: >> One particular irritant is getting emails 12-24 hours later about someone else's >> breakage that has *already been fixed*. The long cycling bots are really >> irritating in that respect. > That's not that easy to fix, and I think we'll have to cope with that > forever. Not all machines are fast, and some buildbots do a full > self-host, with compiler-rt and running all tests. Others do a full > benchmark run of LNT, running it 5-8 times, which can take several > hours on an ARM box.I agree it's not easy, but it's not something we should just live with either. There are ways to address the problem and we should consider them. As a randomly chosen example, one thing we could do would be to have the notion of a "last good commit". Fast builders would cycle off ToT, if any one (or some subset) passed, that advances the notion of last good commit. Slower builders should cycle off the last good commit, not ToT. We have all the mechanisms to implement this today. It could be as simple as parsing the JSON output of buildbot in the script that runs the slower build bots and sync to a particular revision rather than ToT.> > The benchmark bots should be marked not to spam, since they're not > there to pick up errors, but the full self-hosting ones do need to > warn on errors. For example, right now I have a bug only on a thumbv7a > self-hosting bot, and not on others. I'm now bisecting it to find the > culprit, but this is not always clear, as the longer it takes for me > to realise, the harder it will be to fix it.At this point, you're long past the point I was grossing about. I'm not arguing that long running bots shouldn't notify; I'm arguing they shouldn't report *obvious* false positives. Also, the bisect step really should be automated... :)> > The only way out of it is for people to look at the fast bots, and if > they're fixed, check the commit that did it and see if the slow bot > has been fixed by the same commit later.You've now wasted 10 minutes or more my time per slow noisy bot. When I routinely get 10+ builder failure emails for changes that are clean, that's not worthwhile investment.> > Buildbot owners will eventually pick those problems up, but as I said, > the longer it takes, the harder it is to get to the bottom of it, and > the higher is the probability of getting more regressions introduced > because the bot is red and won't warn.I agree. All I'm suggesting is reducing noise so that real failures are likely to be noticed quickly.> > cheers, > --renato
On 21 May 2015 at 01:52, Philip Reames <listmail at philipreames.com> wrote:> As a randomly chosen example, one thing we could do would be to have the > notion of a "last good commit". Fast builders would cycle off ToT, if any > one (or some subset) passed, that advances the notion of last good commit. > Slower builders should cycle off the last good commit, not ToT. We have all > the mechanisms to implement this today. It could be as simple as parsing > the JSON output of buildbot in the script that runs the slower build bots > and sync to a particular revision rather than ToT.Not all slow builders have the same sources as the fast builders. For example, our "full" builders consider compiler-rt, while the fast ones don't.> At this point, you're long past the point I was grossing about. I'm not > arguing that long running bots shouldn't notify; I'm arguing they shouldn't > report *obvious* false positives.Well, that's yet another fix we need for all builders. I think we're missing: 1. Detection of infrastructure vs. real code problems. There isn't a simple way of doing this, so just adding patterns to "infrastructure" problems being ignored, everything else is an error, would be ok. 2. Detection of different failures. If new tests fail, or the build fail instead of tests, the bot should email *again*. This is very problematic and why we're so angry towards broken bots. 3. Detection of long running failures, that might have been forgotten. No emails to the blame list, but an email to the bot owner would help.> Also, the bisect step really should be automated... :)It's not always simple, especially when self-hosting. If each step takes 7 hours, guessing what the output is and waiting 7 days to realise it wasn't is not a good use of resources. For those cases I always bisect manually.> You've now wasted 10 minutes or more my time per slow noisy bot. When I > routinely get 10+ builder failure emails for changes that are clean, that's > not worthwhile investment.I know. That's why I do that on my own bots. It's my time to spend. Maybe we should divide the bots into three categories. Fast, Slow and Experimental. Fast bots are everyone's responsibility. Slow bots are the bot owners'. Experimental can safely be ignored. That's pretty much what I do now with my NOC page. As a bot owner, if I want to reduce my time spend on slow bots, I'll have to work hard to make it fast, and not transfer the burden to the rest of the community. cheers, --renato
On 05/21/2015 02:05 AM, Renato Golin wrote:> On 21 May 2015 at 01:52, Philip Reames <listmail at philipreames.com> wrote: >> As a randomly chosen example, one thing we could do would be to have the >> notion of a "last good commit". Fast builders would cycle off ToT, if any >> one (or some subset) passed, that advances the notion of last good commit. >> Slower builders should cycle off the last good commit, not ToT. We have all >> the mechanisms to implement this today. It could be as simple as parsing >> the JSON output of buildbot in the script that runs the slower build bots >> and sync to a particular revision rather than ToT. > Not all slow builders have the same sources as the fast builders. For > example, our "full" builders consider compiler-rt, while the fast ones > don't. > > >> At this point, you're long past the point I was grossing about. I'm not >> arguing that long running bots shouldn't notify; I'm arguing they shouldn't >> report *obvious* false positives. > Well, that's yet another fix we need for all builders. I think we're missing: > > 1. Detection of infrastructure vs. real code problems. There isn't a > simple way of doing this, so just adding patterns to "infrastructure" > problems being ignored, everything else is an error, would be ok. > > 2. Detection of different failures. If new tests fail, or the build > fail instead of tests, the bot should email *again*. This is very > problematic and why we're so angry towards broken bots. > > 3. Detection of long running failures, that might have been forgotten. > No emails to the blame list, but an email to the bot owner would help. > > >> Also, the bisect step really should be automated... :) > It's not always simple, especially when self-hosting. If each step > takes 7 hours, guessing what the output is and waiting 7 days to > realise it wasn't is not a good use of resources. For those cases I > always bisect manually. > > >> You've now wasted 10 minutes or more my time per slow noisy bot. When I >> routinely get 10+ builder failure emails for changes that are clean, that's >> not worthwhile investment. > I know. That's why I do that on my own bots. It's my time to spend. > > Maybe we should divide the bots into three categories. Fast, Slow and > Experimental. > > Fast bots are everyone's responsibility. Slow bots are the bot > owners'. Experimental can safely be ignored. That's pretty much what I > do now with my NOC page. > > As a bot owner, if I want to reduce my time spend on slow bots, I'll > have to work hard to make it fast, and not transfer the burden to the > rest of the community.+ 1. I would be in full support of such a proposal.> > cheers, > --renato