On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote:> Where "software" here is presumably the OS softwareYes. This is the real noise, one that we cannot accept.> I think that misses the common usage of the term "flaky test" (or do the > tests themselves end up other (1) or (2)?) or flaky tests due to flaky > product code (hash ordering in the output).Flaky code, either compiler or tests, are the ones that don't fail in the correct blame list. Otherwise, even if it was flaky, we don't know, because it failed in the right blame list, so it's easy to revert or XFAIL. So, in my categorisation, flaky code ends up in either 3 or 4: 3, wrong blame list: if the failure is completely independent from the blame list, example, misuse of the C++ ABI. 4, related, but not directly: if the failure is related, but in ways that the patch didn't touch, example, changing related debug info for a non-debug patch. These can be that the original code didn't cope with this future change, but the change is semantically valid, or the test CHECK lines were poor (like naming explicit registers, etc), and that's why the tests broke. The former is harder for the blamed developer to fix, but "git blame" can help find the one to help. The latter is a lot easier to spot and fix, but is also helped by "git blame". Both actionable, but not immediately obvious.> & I disagree here - if most contributors aren't acting on these (for > whatever reasons, basically) we should just stop sending them. If at some > point we find ways to make them actionable (by having common machine access > people can use, documentation on how to proceed, short blame lists, etc - > whatever's getting in the way of people acting on these).I see, your disagreement is temporal. You're basically saying that, because people ignore them today, there's no point in sending them the email today, and it's up to the bot owners to make people start paying attention to their bots. My argument is that I cannot make you care, no matter how stable my bots are. And the evidence for that is that my bots are very stable, but you're ignoring them, either because you don't understand what a flaky bot is, or just out of principle. My bots don't have hardware or OS problems, nor they timeout or run out of disk for a good number of years. But I can't stop bad testing, or bad coding. And, as I've outlined too many times, these affect bots like mine more heavily than others. It's the nature of the failures plus the nature of my hardware. I can't make you care about it, so I don't mind if you ignore them, but I *do* mind if you want to shut them off.> And I don't think it's that people simply don't care about certain > architectures - We see Linux developers fixing Windows and Darwin build > breaks, for example. But, yes, more complicated things (I think a large part > of the problem is the temporal issue - no matter the architecture, if the > results are substantially delayed (even with a short blame list) and the > steps to reproduce are not quick/easy, it's easy for people to decide it's > not worth the hassleI think that's an appalling behaviour for a community.> - which I think is something we likely have to live > with (again, lack of familiarity with a long/complex/inaccessible process > means that those developers really aren't in the best place to do the > reproduction/check that it was their patch that caused the problem)) do tend > to fall to bot owners/people familiar with that platform/hardware, and I > think that's totally OK/acceptable/the right thing.Hum, ok. There are two sides here. 1. You do care, but can't do anything. In this case, you work with the owner to resolve the problem, even if the owner does all the work. 2. You don't care, and ignore the failure. Here the bot owner has to find out on his own and do all the work. The first is perfectly acceptable, and I'm more than happy to do all the work. The second I normally just revert the patch without asking.> What I'm suggesting is that if most developers, most of the time, aren't > able to determine this easily, it's not valuable email - if most of the time > they have to reach out to the owner for details/clarification, then we > should just invert it. Have the bot owner push to the contributor rather > than the contributor pull from the bot owner.The LLVM project has hundreds of committers, dozens of bots have a single owner. How does that scale? I think this proposal is against the very nature of open source projects in general and a horrible engineering decision. I have noticed that recently some people have taken the attitude that "if you can't keep up with my commits, you're not worth noticing", and that's the attitude that will get us forked.> They show up often enough cross-OS and build config too (-Asserts, Windows, > Darwin, etc).Ok, good.> Patches should still be reverted, or tests XFAIL - bots shouldn't be left > red for hours (especially in the middle of a work day) or a day.How do you XFAIL a Clang miscompilation of Clang? How do you revert a failure that is unrelated to the blame list because they're from previous or external commits?> This can often/mostly be compensated for by having more hardware -Throw money at the problem? :D https://www.youtube.com/watch?v=CZmHDEa0Y20> especially for something as mechanical as a bisect. (obviously once you're > in manual iterations, more hardware doesn't help much unless you have a few > different hypotheses you can test simultaneously)I don't have infinite hardware, nor infinite space, nor infinite power, nor infinite time. Certain things take longer than others, and people that are used to getting them fast have a lower tolerance for slow(er) processes. Fast and slow are completely arbitrary and relative to how slow or fast things are between themselves.> Certainly it takes some more engineering effort and there's overhead for > dealing with multiple machines, etc. But it's not linearly proportional to > machine speed, because some of it can be compensated for.Right. So, here, I agree with you. It IS possible to improve and make it much better. I'm working on making it better, but it takes time. I can't make it work tomorrow, and that's my original point: We have to improve and be more strict, but we have to grow to get there, not to flip the table now. I'm suggesting an exp(x) migration plan, not a sig(x).> Sure - some issues take a while to investigate. No doubt - but so long as > the issue is live (be it flaky or consistent) it's unhelpful (moreso if it's > flaky, given the way our buildbots send mail - though I still don't like a > red line on the status page, that's costly too) to have the bot red and/or > sending mail.Here, there are two issues: 1. Buildbots should not email on red->except->red. That's settled, and we must ignore those cases from now on, otherwise, we'll keep coming back at it. So, assume we don't do that any more. 2. If we agree that any flaky bot is turned off, and the master behaves correctly (as above), we cannot assume that the constant emailing during the investigation phase is due to flakyness. So, if you do get an email, it's probably a meaningful reason. We're not there yet, but we're discussing at a higher level here, dissecting the issue and finding the problems.> The issue is known and being investigated, sending other > people mail (or having it show up as red in the dashboard) isn't terribly > helpful. It produces redundant work for everyone (they all investigate these > issues - or learn to ignore them & thus miss true positives later) on the > project.Chris is investigating the Green Bot infrastructure, which is orders of magnitude better than our current. In that scenario, we'll have orders of magnitude less redundant work, even if you get a warning that you can't act on. --renato On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote:> > > On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at linaro.org> > wrote: >> >> I think we've hit a record in the number of inline replies, here... :) >> >> Let's start fresh... >> >> Problem #1: What is flaky? >> >> The types of failures of a buildbot: >> >> 1. failures because of bad hardware / bad software / bad admin >> (timeout, disk full, crash, bad RAM) > > > Where "software" here is presumably the OS software, not the software under > test (otherwise all actual failures would be (1)), and not infrastructure > software because you've called that out as (2). > >> >> 2. failures because of infrastructure problems (svn, lnt, etc) >> 3. failures due to previous or external commits unrelated to the blame >> list (intermittent, timeout) >> 4. results that you don't know how to act on, but you have to >> 5. clear error messages, easy to act on >> >> In my view, "flaky" is *only* number 1. Everything else is signal. > > > I think that misses the common usage of the term "flaky test" (or do the > tests themselves end up other (1) or (2)?) or flaky tests due to flaky > product code (hash ordering in the output). > >> >> I agree that bots that cause 1. should be silent, and that failures in >> 2. and 3. should be only emailed to the bot admin. But category 4 >> still needs to email the blame list and cannot be ignored, even if >> *you* don't know how to act on. > > > & I disagree here - if most contributors aren't acting on these (for > whatever reasons, basically) we should just stop sending them. If at some > point we find ways to make them actionable (by having common machine access > people can use, documentation on how to proceed, short blame lists, etc - > whatever's getting in the way of people acting on these). > > And I don't think it's that people simply don't care about certain > architectures - We see Linux developers fixing Windows and Darwin build > breaks, for example. But, yes, more complicated things (I think a large part > of the problem is the temporal issue - no matter the architecture, if the > results are substantially delayed (even with a short blame list) and the > steps to reproduce are not quick/easy, it's easy for people to decide it's > not worth the hassle - which I think is something we likely have to live > with (again, lack of familiarity with a long/complex/inaccessible process > means that those developers really aren't in the best place to do the > reproduction/check that it was their patch that caused the problem)) do tend > to fall to bot owners/people familiar with that platform/hardware, and I > think that's totally OK/acceptable/the right thing. > >> >> >> Type 2. can easily be separated, but I'm yet to see how are we going >> to code in which category each failure lies for types 3. and 4. > > > Yeah, I don't have any .particular insight there either. Ideally I'd hope we > can ensure those issues are rare enough (though I've been seeing some > consistently flaky SVN behavior on my buildbot for the last few months, > admittedly - reached out to Tanya about it, but didn't have much to go on) > that it's probably not worth the engineering effort to filter them out. > >> >> One >> way to work around the problem in 4 is to print the bot owner's name >> on the email, so that you know who to reply to, for more details on >> what to do. How to decide if your change is unrelated or you didn't >> understand is a big problem. > > > What I'm suggesting is that if most developers, most of the time, aren't > able to determine this easily, it's not valuable email - if most of the time > they have to reach out to the owner for details/clarification, then we > should just invert it. Have the bot owner push to the contributor rather > than the contributor pull from the bot owner. > >> >> Once all bots are low-noise, people will >> tend more to 4, until then, to 3 or 1. >> >> In agreement? >> >> >> Problem #2: Breakage types >> >> Bots can break for a number of reasons in category 4. Some examples: >> >> A. silly, quick fixed ones, like bad CHECK lines, missing explicit >> triple, move tests to target-specific directories, add an include >> file. >> B. real problems, like an assert in the code, seg fault, bad test results. >> C. hard problems, like bad codegen affecting self-hosting, >> intermittent failures in test-suite or self-hosted clang. >> >> Problems of type A. tend to show by the firehose on ARM, while they're >> a lot less common on x86_64 bots just because people develop on >> x86_64. > > > They show up often enough cross-OS and build config too (-Asserts, Windows, > Darwin, etc). > >> >> Problems B. and C. and equally common on all platforms due to >> the complexity of the compiler. >> >> Problems of type B. should have same behaviour in all platforms. If >> the bots are fast enough (either fast hardware, or many hardware), the >> blame list should be small and bisect should be quick (<1day). > > > Patches should still be reverted, or tests XFAIL - bots shouldn't be left > red for hours (especially in the middle of a work day) or a day. > >> >> These are not the problem. >> >> Problems of type C, however, are seriously worse on slow targets. > > > This can often/mostly be compensated for by having more hardware - > especially for something as mechanical as a bisect. (obviously once you're > in manual iterations, more hardware doesn't help much unless you have a few > different hypotheses you can test simultaneously) > > Certainly it takes some more engineering effort and there's overhead for > dealing with multiple machines, etc. But it's not linearly proportional to > machine speed, because some of it can be compensated for. > >> >> Not >> only it's slower to build (sometimes 10x slower than on a decent >> server), but the testing is hard to get right (because it's >> intermittent), and until you get it right, you're actively working on >> that (minus sleep time, etc). Since we're talking about an order of >> magnitude slower to debug, sleep time becomes a much bigger issue. If >> a hard problem takes about 5 hours on fast hardware, it can take up to >> 50 hours, and in that case, no one can work that long. If you do 10hs >> straight every day, it's still a week past. > > > Sure - some issues take a while to investigate. No doubt - but so long as > the issue is live (be it flaky or consistent) it's unhelpful (moreso if it's > flaky, given the way our buildbots send mail - though I still don't like a > red line on the status page, that's costly too) to have the bot red and/or > sending mail. The issue is known and being investigated, sending other > people mail (or having it show up as red in the dashboard) isn't terribly > helpful. It produces redundant work for everyone (they all investigate these > issues - or learn to ignore them & thus miss true positives later) on the > project. > >> >> >> In agreement? >> >> >> I'll continue later, once we're in agreement over the base facts. >> >> cheers, >> --renato > >
Not to distract from the truly worthwhile discussion going on here, but let me bring up one notion that I think buildbot currently doesn't support: Our internal build/test system can distinguish "has new failure(s)" from "failed but no new failures" and represent those things differently on our dashboard. In public-bot terms this would mean saving the most recent list of test failures, comparing to the new set of test failures, and having a different failure-state if the new set is equal to or a proper subset of the previous set. This might ameliorate an ongoing-red situation, as a no-new-fails state wouldn't send blame mail. But if there are new fails, the blame mailer can do a set-difference and report only the new ones. That would reduce the noise a bit, hmm? --paulr> -----Original Message----- > From: cfe-dev [mailto:cfe-dev-bounces at lists.llvm.org] On Behalf Of Renato > Golin via cfe-dev > Sent: Saturday, October 10, 2015 5:00 AM > To: David Blaikie > Cc: LLVM Dev; Galina Kistanova; Clang Dev > Subject: Re: [cfe-dev] Buildbot Noise > > On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote: > > Where "software" here is presumably the OS software > > Yes. This is the real noise, one that we cannot accept. > > > > I think that misses the common usage of the term "flaky test" (or do the > > tests themselves end up other (1) or (2)?) or flaky tests due to flaky > > product code (hash ordering in the output). > > Flaky code, either compiler or tests, are the ones that don't fail in > the correct blame list. Otherwise, even if it was flaky, we don't > know, because it failed in the right blame list, so it's easy to > revert or XFAIL. > > So, in my categorisation, flaky code ends up in either 3 or 4: > > 3, wrong blame list: if the failure is completely independent from the > blame list, example, misuse of the C++ ABI. > 4, related, but not directly: if the failure is related, but in ways > that the patch didn't touch, example, changing related debug info for > a non-debug patch. > > These can be that the original code didn't cope with this future > change, but the change is semantically valid, or the test CHECK lines > were poor (like naming explicit registers, etc), and that's why the > tests broke. The former is harder for the blamed developer to fix, but > "git blame" can help find the one to help. The latter is a lot easier > to spot and fix, but is also helped by "git blame". Both actionable, > but not immediately obvious. > > > > & I disagree here - if most contributors aren't acting on these (for > > whatever reasons, basically) we should just stop sending them. If at > some > > point we find ways to make them actionable (by having common machine > access > > people can use, documentation on how to proceed, short blame lists, etc > - > > whatever's getting in the way of people acting on these). > > I see, your disagreement is temporal. > > You're basically saying that, because people ignore them today, > there's no point in sending them the email today, and it's up to the > bot owners to make people start paying attention to their bots. > > My argument is that I cannot make you care, no matter how stable my > bots are. And the evidence for that is that my bots are very stable, > but you're ignoring them, either because you don't understand what a > flaky bot is, or just out of principle. > > My bots don't have hardware or OS problems, nor they timeout or run > out of disk for a good number of years. But I can't stop bad testing, > or bad coding. And, as I've outlined too many times, these affect bots > like mine more heavily than others. It's the nature of the failures > plus the nature of my hardware. > > I can't make you care about it, so I don't mind if you ignore them, > but I *do* mind if you want to shut them off. > > > > And I don't think it's that people simply don't care about certain > > architectures - We see Linux developers fixing Windows and Darwin build > > breaks, for example. But, yes, more complicated things (I think a large > part > > of the problem is the temporal issue - no matter the architecture, if > the > > results are substantially delayed (even with a short blame list) and the > > steps to reproduce are not quick/easy, it's easy for people to decide > it's > > not worth the hassle > > I think that's an appalling behaviour for a community. > > > > - which I think is something we likely have to live > > with (again, lack of familiarity with a long/complex/inaccessible > process > > means that those developers really aren't in the best place to do the > > reproduction/check that it was their patch that caused the problem)) do > tend > > to fall to bot owners/people familiar with that platform/hardware, and I > > think that's totally OK/acceptable/the right thing. > > Hum, ok. There are two sides here. > > 1. You do care, but can't do anything. In this case, you work with the > owner to resolve the problem, even if the owner does all the work. > > 2. You don't care, and ignore the failure. Here the bot owner has to > find out on his own and do all the work. > > The first is perfectly acceptable, and I'm more than happy to do all > the work. The second I normally just revert the patch without asking. > > > > What I'm suggesting is that if most developers, most of the time, aren't > > able to determine this easily, it's not valuable email - if most of the > time > > they have to reach out to the owner for details/clarification, then we > > should just invert it. Have the bot owner push to the contributor rather > > than the contributor pull from the bot owner. > > The LLVM project has hundreds of committers, dozens of bots have a > single owner. How does that scale? > > I think this proposal is against the very nature of open source > projects in general and a horrible engineering decision. I have > noticed that recently some people have taken the attitude that "if you > can't keep up with my commits, you're not worth noticing", and that's > the attitude that will get us forked. > > > > They show up often enough cross-OS and build config too (-Asserts, > Windows, > > Darwin, etc). > > Ok, good. > > > > Patches should still be reverted, or tests XFAIL - bots shouldn't be > left > > red for hours (especially in the middle of a work day) or a day. > > How do you XFAIL a Clang miscompilation of Clang? > > How do you revert a failure that is unrelated to the blame list > because they're from previous or external commits? > > > > This can often/mostly be compensated for by having more hardware - > > Throw money at the problem? :D > https://www.youtube.com/watch?v=CZmHDEa0Y20 > > > > especially for something as mechanical as a bisect. (obviously once > you're > > in manual iterations, more hardware doesn't help much unless you have a > few > > different hypotheses you can test simultaneously) > > I don't have infinite hardware, nor infinite space, nor infinite > power, nor infinite time. > > Certain things take longer than others, and people that are used to > getting them fast have a lower tolerance for slow(er) processes. Fast > and slow are completely arbitrary and relative to how slow or fast > things are between themselves. > > > > Certainly it takes some more engineering effort and there's overhead for > > dealing with multiple machines, etc. But it's not linearly proportional > to > > machine speed, because some of it can be compensated for. > > Right. So, here, I agree with you. It IS possible to improve and make > it much better. > > I'm working on making it better, but it takes time. I can't make it > work tomorrow, and that's my original point: > > We have to improve and be more strict, but we have to grow to get > there, not to flip the table now. I'm suggesting an exp(x) migration > plan, not a sig(x). > > > > Sure - some issues take a while to investigate. No doubt - but so long > as > > the issue is live (be it flaky or consistent) it's unhelpful (moreso if > it's > > flaky, given the way our buildbots send mail - though I still don't like > a > > red line on the status page, that's costly too) to have the bot red > and/or > > sending mail. > > Here, there are two issues: > > 1. Buildbots should not email on red->except->red. That's settled, and > we must ignore those cases from now on, otherwise, we'll keep coming > back at it. So, assume we don't do that any more. > > 2. If we agree that any flaky bot is turned off, and the master > behaves correctly (as above), we cannot assume that the constant > emailing during the investigation phase is due to flakyness. So, if > you do get an email, it's probably a meaningful reason. > > We're not there yet, but we're discussing at a higher level here, > dissecting the issue and finding the problems. > > > > > The issue is known and being investigated, sending other > > people mail (or having it show up as red in the dashboard) isn't > terribly > > helpful. It produces redundant work for everyone (they all investigate > these > > issues - or learn to ignore them & thus miss true positives later) on > the > > project. > > Chris is investigating the Green Bot infrastructure, which is orders > of magnitude better than our current. In that scenario, we'll have > orders of magnitude less redundant work, even if you get a warning > that you can't act on. > > --renato > > On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote: > > > > > > On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at linaro.org> > > wrote: > >> > >> I think we've hit a record in the number of inline replies, here... :) > >> > >> Let's start fresh... > >> > >> Problem #1: What is flaky? > >> > >> The types of failures of a buildbot: > >> > >> 1. failures because of bad hardware / bad software / bad admin > >> (timeout, disk full, crash, bad RAM) > > > > > > Where "software" here is presumably the OS software, not the software > under > > test (otherwise all actual failures would be (1)), and not > infrastructure > > software because you've called that out as (2). > > > >> > >> 2. failures because of infrastructure problems (svn, lnt, etc) > >> 3. failures due to previous or external commits unrelated to the blame > >> list (intermittent, timeout) > >> 4. results that you don't know how to act on, but you have to > >> 5. clear error messages, easy to act on > >> > >> In my view, "flaky" is *only* number 1. Everything else is signal. > > > > > > I think that misses the common usage of the term "flaky test" (or do the > > tests themselves end up other (1) or (2)?) or flaky tests due to flaky > > product code (hash ordering in the output). > > > >> > >> I agree that bots that cause 1. should be silent, and that failures in > >> 2. and 3. should be only emailed to the bot admin. But category 4 > >> still needs to email the blame list and cannot be ignored, even if > >> *you* don't know how to act on. > > > > > > & I disagree here - if most contributors aren't acting on these (for > > whatever reasons, basically) we should just stop sending them. If at > some > > point we find ways to make them actionable (by having common machine > access > > people can use, documentation on how to proceed, short blame lists, etc > - > > whatever's getting in the way of people acting on these). > > > > And I don't think it's that people simply don't care about certain > > architectures - We see Linux developers fixing Windows and Darwin build > > breaks, for example. But, yes, more complicated things (I think a large > part > > of the problem is the temporal issue - no matter the architecture, if > the > > results are substantially delayed (even with a short blame list) and the > > steps to reproduce are not quick/easy, it's easy for people to decide > it's > > not worth the hassle - which I think is something we likely have to live > > with (again, lack of familiarity with a long/complex/inaccessible > process > > means that those developers really aren't in the best place to do the > > reproduction/check that it was their patch that caused the problem)) do > tend > > to fall to bot owners/people familiar with that platform/hardware, and I > > think that's totally OK/acceptable/the right thing. > > > >> > >> > >> Type 2. can easily be separated, but I'm yet to see how are we going > >> to code in which category each failure lies for types 3. and 4. > > > > > > Yeah, I don't have any .particular insight there either. Ideally I'd > hope we > > can ensure those issues are rare enough (though I've been seeing some > > consistently flaky SVN behavior on my buildbot for the last few months, > > admittedly - reached out to Tanya about it, but didn't have much to go > on) > > that it's probably not worth the engineering effort to filter them out. > > > >> > >> One > >> way to work around the problem in 4 is to print the bot owner's name > >> on the email, so that you know who to reply to, for more details on > >> what to do. How to decide if your change is unrelated or you didn't > >> understand is a big problem. > > > > > > What I'm suggesting is that if most developers, most of the time, aren't > > able to determine this easily, it's not valuable email - if most of the > time > > they have to reach out to the owner for details/clarification, then we > > should just invert it. Have the bot owner push to the contributor rather > > than the contributor pull from the bot owner. > > > >> > >> Once all bots are low-noise, people will > >> tend more to 4, until then, to 3 or 1. > >> > >> In agreement? > >> > >> > >> Problem #2: Breakage types > >> > >> Bots can break for a number of reasons in category 4. Some examples: > >> > >> A. silly, quick fixed ones, like bad CHECK lines, missing explicit > >> triple, move tests to target-specific directories, add an include > >> file. > >> B. real problems, like an assert in the code, seg fault, bad test > results. > >> C. hard problems, like bad codegen affecting self-hosting, > >> intermittent failures in test-suite or self-hosted clang. > >> > >> Problems of type A. tend to show by the firehose on ARM, while they're > >> a lot less common on x86_64 bots just because people develop on > >> x86_64. > > > > > > They show up often enough cross-OS and build config too (-Asserts, > Windows, > > Darwin, etc). > > > >> > >> Problems B. and C. and equally common on all platforms due to > >> the complexity of the compiler. > >> > >> Problems of type B. should have same behaviour in all platforms. If > >> the bots are fast enough (either fast hardware, or many hardware), the > >> blame list should be small and bisect should be quick (<1day). > > > > > > Patches should still be reverted, or tests XFAIL - bots shouldn't be > left > > red for hours (especially in the middle of a work day) or a day. > > > >> > >> These are not the problem. > >> > >> Problems of type C, however, are seriously worse on slow targets. > > > > > > This can often/mostly be compensated for by having more hardware - > > especially for something as mechanical as a bisect. (obviously once > you're > > in manual iterations, more hardware doesn't help much unless you have a > few > > different hypotheses you can test simultaneously) > > > > Certainly it takes some more engineering effort and there's overhead for > > dealing with multiple machines, etc. But it's not linearly proportional > to > > machine speed, because some of it can be compensated for. > > > >> > >> Not > >> only it's slower to build (sometimes 10x slower than on a decent > >> server), but the testing is hard to get right (because it's > >> intermittent), and until you get it right, you're actively working on > >> that (minus sleep time, etc). Since we're talking about an order of > >> magnitude slower to debug, sleep time becomes a much bigger issue. If > >> a hard problem takes about 5 hours on fast hardware, it can take up to > >> 50 hours, and in that case, no one can work that long. If you do 10hs > >> straight every day, it's still a week past. > > > > > > Sure - some issues take a while to investigate. No doubt - but so long > as > > the issue is live (be it flaky or consistent) it's unhelpful (moreso if > it's > > flaky, given the way our buildbots send mail - though I still don't like > a > > red line on the status page, that's costly too) to have the bot red > and/or > > sending mail. The issue is known and being investigated, sending other > > people mail (or having it show up as red in the dashboard) isn't > terribly > > helpful. It produces redundant work for everyone (they all investigate > these > > issues - or learn to ignore them & thus miss true positives later) on > the > > project. > > > >> > >> > >> In agreement? > >> > >> > >> I'll continue later, once we're in agreement over the base facts. > >> > >> cheers, > >> --renato > > > > > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
On 16 October 2015 at 15:17, Robinson, Paul <Paul_Robinson at playstation.sony.com> wrote:> But if > there are new fails, the blame mailer can do a set-difference and report > only the new ones. That would reduce the noise a bit, hmm?Hi Paul, The danger there is that it'd be easier to "get used" to having some failures as long as you don't have "new" failures. Every place I worked that supported that philosophy, ended up with all bots "orange". It's never the intention, but it's almost always the inevitable consequence. In a small team, or a single company, it may be a lot easier to move them back to green, but in an open community, it's not that easy, nor that quick. The way we work with the same concept, as David mentioned repeatedly, is to use XFAILs. It is essentially the same thing, except that "it hurts more" to mark an XFAIL than to see a different shade of red, so we're more reluctant to ignore them. Plus, an orange bot that becomes red (new failures) will itself become orange as time passes, or new failures show up. If we end up with that many shades of red, understanding the difference will become harder, and the value will decrease. cheers, --renato
On Sat, Oct 10, 2015 at 4:59 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote: > > Where "software" here is presumably the OS software > > Yes. This is the real noise, one that we cannot accept. > > > > I think that misses the common usage of the term "flaky test" (or do the > > tests themselves end up other (1) or (2)?) or flaky tests due to flaky > > product code (hash ordering in the output). > > Flaky code, either compiler or tests, are the ones that don't fail in > the correct blame list. Otherwise, even if it was flaky, we don't > know, because it failed in the right blame list, so it's easy to > revert or XFAIL. > > So, in my categorisation, flaky code ends up in either 3 or 4: > > 3, wrong blame list: if the failure is completely independent from the > blame list, example, misuse of the C++ ABI. > 4, related, but not directly: if the failure is related, but in ways > that the patch didn't touch, example, changing related debug info for > a non-debug patch. > > These can be that the original code didn't cope with this future > change, but the change is semantically valid, or the test CHECK lines > were poor (like naming explicit registers, etc), and that's why the > tests broke. The former is harder for the blamed developer to fix, but > "git blame" can help find the one to help. The latter is a lot easier > to spot and fix, but is also helped by "git blame". Both actionable, > but not immediately obvious. > > > > & I disagree here - if most contributors aren't acting on these (for > > whatever reasons, basically) we should just stop sending them. If at some > > point we find ways to make them actionable (by having common machine > access > > people can use, documentation on how to proceed, short blame lists, etc - > > whatever's getting in the way of people acting on these). > > I see, your disagreement is temporal. > > You're basically saying that, because people ignore them today, > there's no point in sending them the email today, and it's up to the > bot owners to make people start paying attention to their bots. > > My argument is that I cannot make you care, no matter how stable my > bots are. And the evidence for that is that my bots are very stable, > but you're ignoring them, either because you don't understand what a > flaky bot is, or just out of principle. >In the proximal issue - the bot was red for a week. When I see a bot red for a week, I assume no one cares about it (because I assume that if they did they would've at least XFAILed the issue so they could get back to green & catch future issues). That's the question I was asking and the reason I'm inclined to ignore the email I got from that bot. As you've pointed out, the reason I got email from the bot was because of the master restart (red->purple->red), and addressing that would mean I wouldn't've sent my original email to you (but to other bot masters who had long-red bots - as you can see, I wasn't singling you out, I was looking at any bot that had been red for multiple work days). I would still, in the abstract, disagree with leaving bots red for long periods because it makes the buildbot status pages hard to read - which things are unknown issues that someone needs to investigate, and which aren't? XFAIL should represent the mechanism by which we acknowledge a known failure, get back to green, and investigate. XFAILing a bootstrap is a bit unknown - perhaps we should have a way to do that? Beyond that, I've been talking about flakey failures in general, but that wasn't my issue with your bot at the time I sent the mail. I have no opinion on the flakiness of your bot(s). I think we got caught down a rathole talking about the abstract problems of flakiness, even though when I sent my last volley of "what's with these bot results" they weren't about flakiness at all, but /specifically/ about long-red bots that appear neglected.> My bots don't have hardware or OS problems, nor they timeout or run > out of disk for a good number of years. But I can't stop bad testing, > or bad coding. And, as I've outlined too many times, these affect bots > like mine more heavily than others. It's the nature of the failures > plus the nature of my hardware. > > I can't make you care about it, so I don't mind if you ignore them, >Are there often original contributors, faced with a unique result from these bots, who are addressing the problem themselves? Or do they usually have to defer to you or another expert in this hardware, to do some level of triage/investigation/reproduction first?> but I *do* mind if you want to shut them off. >As I've said before - I'm suggesting not sending mail. I'm not suggesting turning them off. It would be little-to-no change to me to do this to my GDB 7.5 bot, for example - I glance at every failure that comes through anyway. All I'd do differently is forward anything that I thought looked like a real, unique failure, to the mailing list/blame list, rather than having it done automatically. This does not seem terribly onerous. Is it?> > And I don't think it's that people simply don't care about certain > > architectures - We see Linux developers fixing Windows and Darwin build > > breaks, for example. But, yes, more complicated things (I think a large > part > > of the problem is the temporal issue - no matter the architecture, if the > > results are substantially delayed (even with a short blame list) and the > > steps to reproduce are not quick/easy, it's easy for people to decide > it's > > not worth the hassle > > I think that's an appalling behaviour for a community. >I... don't, really. As with my own GDB 7.5 buildbot, I pretty much assume interesting failures will probably involve me helping to triage (especially with the Apple engineers explicitly not having access to the source/test cases run there) the issues. The bot sends me email on every red, and I treat that as pretty much a thing I need to care about until it's green, as much as possible by acting as a facilitator to the original contributor who committed the breakage.> > - which I think is something we likely have to live > > with (again, lack of familiarity with a long/complex/inaccessible process > > means that those developers really aren't in the best place to do the > > reproduction/check that it was their patch that caused the problem)) do > tend > > to fall to bot owners/people familiar with that platform/hardware, and I > > think that's totally OK/acceptable/the right thing. > > Hum, ok. There are two sides here. > > 1. You do care, but can't do anything. In this case, you work with the > owner to resolve the problem, even if the owner does all the work. > > 2. You don't care, and ignore the failure. Here the bot owner has to > find out on his own and do all the work. > > The first is perfectly acceptable, and I'm more than happy to do all > the work. The second I normally just revert the patch without asking. >It's generally not the community policy to revert a patch without providing actionable reproduction steps, etc. Do you do that? I don't recall seeing that done. (in general, I think it better to get reproduction steps first, then revert - sometimes people revert first and provide reproduction much later (because a reduction takes time, etc) - which I don't think is ideal, but is sometimes the right tradeoff for the community (if it's obviously going to be/is a problem for everyone, we're just not all seeing it yet, etc))> > What I'm suggesting is that if most developers, most of the time, aren't > > able to determine this easily, it's not valuable email - if most of the > time > > they have to reach out to the owner for details/clarification, then we > > should just invert it. Have the bot owner push to the contributor rather > > than the contributor pull from the bot owner. > > The LLVM project has hundreds of committers, dozens of bots have a > single owner. How does that scale? >Most of the bot results are pretty easily actionable - just by reading the diagnostics from the bots, etc. I run a bot - I glance at every fail mail that comes from it. It does not seem to be terribly onerous to me to do this - is it for you? The only time it costs me more than sub-second per failure is if it's a real issue I need to investigate (OK, if it's actually a GDB test failure that's just flakey, that costs me a few seconds, but still not long) The point is that doing the opposite: sending mail to large blame lists is strictly higher cost than having a bot owner do the work. A bot owner is 1 person, a large blame list is multiple. It scales better to have 1 person look at the failure rather than many. Also non-owners are less familiar with the interesting failures from the bot (or the ongoing state - red or otherwise) so it costs them more than the owner. A long red bot is a worse example of this, if it's sending mail eevn on a few reds - that's multiple developers looking at the bot to see if they broke it, when it's already known broken and being investigated. Every one of those emails is costly/worse scaling than just sending mail to the owner & having the owner triage/escalate to the contributor.> I think this proposal is against the very nature of open source > projects in general and a horrible engineering decision.Do you believe there's no quality point in a buildbot notification where it is not worth sending mail/notification? Where those notifications hurt the quality (by reducing the signal/noise to the point where we either hurt the throughput of developers by having substantially redundant (& unskilled in the specific kinds of failures a certain platform might see) failure investigation or hurt the quality of the project by people learning to ignore bot mails in general and thus missing important true positives as well?)> I have > noticed that recently some people have taken the attitude that "if you > can't keep up with my commits, you're not worth noticing",Not quite sure what you're referring to here - we seem to be pretty good about moving fast, but also having important design discussions in the community (llvm-dev mailing list, etc) when there's input required or people need a bit of forewarning about a change in direction, etc. I think it's not too unreasonable to expect people to check some of the commit history to see what's been going on in an area they're interested in (if they're contributors - if they're not contributors, yes, we don't tend to care much) some recent failure they're seeing, etc.> and that's > the attitude that will get us forked. >I don't really see the concern of that (I don't really understand the chance of this, or what causes projects to be forked, nor the cost if they are).> > > > They show up often enough cross-OS and build config too (-Asserts, > Windows, > > Darwin, etc). > > Ok, good. > > > > Patches should still be reverted, or tests XFAIL - bots shouldn't be left > > red for hours (especially in the middle of a work day) or a day. > > How do you XFAIL a Clang miscompilation of Clang? >It's a good question - seems like it'd be something we might want to have some way of doing. Perhaps we could have some stub test cases that are used to describe some of these sort of tests.> How do you revert a failure that is unrelated to the blame list > because they're from previous or external commits? >external? If they're from previous commits/it's a flakey product issue - that's tricky, for sure. We don't have good infrastructure for that. It would be nice to build some (we could run flake detection in off-peak times - tests that are suspected of being flakey could be run repeatedly to see if they are, etc), but non-triival to do so, for sure. For now, I don't know that that's the long pole - though there are some notable exceptions (windows filesystem IO caused some ongoing flakes on windows, which I think should be an issue for those running the windows buildbots)> > > > This can often/mostly be compensated for by having more hardware - > > Throw money at the problem? :D >Sure, if that's what it takes - we're already paying for the problem with engineering time. I'm suggesting that maybe that cost shouldn't be distributed across the project, but rather localized to those invested (literally, financially) in the behavior of the platforms in question.> https://www.youtube.com/watch?v=CZmHDEa0Y20 > > > > especially for something as mechanical as a bisect. (obviously once > you're > > in manual iterations, more hardware doesn't help much unless you have a > few > > different hypotheses you can test simultaneously) > > I don't have infinite hardware, nor infinite space, nor infinite > power, nor infinite time. >None of these things require infinite anything. There's a "reasonable" level of turnaround that can help quite a bit.> Certain things take longer than others, and people that are used to > getting them fast have a lower tolerance for slow(er) processes. Fast > and slow are completely arbitrary and relative to how slow or fast > things are between themselves. >I don't think they're entirely arbitrary (there are certain broad cutoffs where the productivity loss is more noticable as you transition from one way to another way of doing things (eg: once your build takes more than a few seconds, you're likely to context switch away then come back to it, etc)). But even if they are, I don't think it's entirely wrong to strive to have a system that is fast.> > Certainly it takes some more engineering effort and there's overhead for > > dealing with multiple machines, etc. But it's not linearly proportional > to > > machine speed, because some of it can be compensated for. > > Right. So, here, I agree with you. It IS possible to improve and make > it much better. > > I'm working on making it better, but it takes time. I can't make it > work tomorrow, and that's my original point: > > We have to improve and be more strict, but we have to grow to get > there, not to flip the table now. I'm suggesting an exp(x) migration > plan, not a sig(x). >I'm not suggesting flipping any tables. I'm suggesting having owners of bots that aren't great/easily actionable do the first level triage, then forward to the relevant contributors. This does not seem to be an impossibly onerous request - is it? Is there something I'm missing about this request being unreasonable?> > > > Sure - some issues take a while to investigate. No doubt - but so long as > > the issue is live (be it flaky or consistent) it's unhelpful (moreso if > it's > > flaky, given the way our buildbots send mail - though I still don't like > a > > red line on the status page, that's costly too) to have the bot red > and/or > > sending mail. > > Here, there are two issues: > > 1. Buildbots should not email on red->except->red. That's settled, and > we must ignore those cases from now on, otherwise, we'll keep coming > back at it. So, assume we don't do that any more. >Until that's fixed, again, I don't think it'd be unreasonable to switch bots that tend ot be red for extended periods of time (& are thus more prone to this problem) to be owner-triage-first.> 2. If we agree that any flaky bot is turned off, and the master > behaves correctly (as above), we cannot assume that the constant > emailing during the investigation phase is due to flakyness. So, if > you do get an email, it's probably a meaningful reason. >Sure - though I have a problem, to a lesser degree, with the buildbot status page having red results for issues that are known & under investigation. It would be better if that were not the case (if those bots were XFAIL'd), but it doesn't relate to email notifications at all, which is my bigger concern.> > We're not there yet, but we're discussing at a higher level here, > dissecting the issue and finding the problems. > > > > > The issue is known and being investigated, sending other > > people mail (or having it show up as red in the dashboard) isn't terribly > > helpful. It produces redundant work for everyone (they all investigate > these > > issues - or learn to ignore them & thus miss true positives later) on the > > project. > > Chris is investigating the Green Bot infrastructure, which is orders > of magnitude better than our current. In that scenario, we'll have > orders of magnitude less redundant work, even if you get a warning > that you can't act on. > > --renato > > On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote: > > > > > > On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at linaro.org> > > wrote: > >> > >> I think we've hit a record in the number of inline replies, here... :) > >> > >> Let's start fresh... > >> > >> Problem #1: What is flaky? > >> > >> The types of failures of a buildbot: > >> > >> 1. failures because of bad hardware / bad software / bad admin > >> (timeout, disk full, crash, bad RAM) > > > > > > Where "software" here is presumably the OS software, not the software > under > > test (otherwise all actual failures would be (1)), and not infrastructure > > software because you've called that out as (2). > > > >> > >> 2. failures because of infrastructure problems (svn, lnt, etc) > >> 3. failures due to previous or external commits unrelated to the blame > >> list (intermittent, timeout) > >> 4. results that you don't know how to act on, but you have to > >> 5. clear error messages, easy to act on > >> > >> In my view, "flaky" is *only* number 1. Everything else is signal. > > > > > > I think that misses the common usage of the term "flaky test" (or do the > > tests themselves end up other (1) or (2)?) or flaky tests due to flaky > > product code (hash ordering in the output). > > > >> > >> I agree that bots that cause 1. should be silent, and that failures in > >> 2. and 3. should be only emailed to the bot admin. But category 4 > >> still needs to email the blame list and cannot be ignored, even if > >> *you* don't know how to act on. > > > > > > & I disagree here - if most contributors aren't acting on these (for > > whatever reasons, basically) we should just stop sending them. If at some > > point we find ways to make them actionable (by having common machine > access > > people can use, documentation on how to proceed, short blame lists, etc - > > whatever's getting in the way of people acting on these). > > > > And I don't think it's that people simply don't care about certain > > architectures - We see Linux developers fixing Windows and Darwin build > > breaks, for example. But, yes, more complicated things (I think a large > part > > of the problem is the temporal issue - no matter the architecture, if the > > results are substantially delayed (even with a short blame list) and the > > steps to reproduce are not quick/easy, it's easy for people to decide > it's > > not worth the hassle - which I think is something we likely have to live > > with (again, lack of familiarity with a long/complex/inaccessible process > > means that those developers really aren't in the best place to do the > > reproduction/check that it was their patch that caused the problem)) do > tend > > to fall to bot owners/people familiar with that platform/hardware, and I > > think that's totally OK/acceptable/the right thing. > > > >> > >> > >> Type 2. can easily be separated, but I'm yet to see how are we going > >> to code in which category each failure lies for types 3. and 4. > > > > > > Yeah, I don't have any .particular insight there either. Ideally I'd > hope we > > can ensure those issues are rare enough (though I've been seeing some > > consistently flaky SVN behavior on my buildbot for the last few months, > > admittedly - reached out to Tanya about it, but didn't have much to go > on) > > that it's probably not worth the engineering effort to filter them out. > > > >> > >> One > >> way to work around the problem in 4 is to print the bot owner's name > >> on the email, so that you know who to reply to, for more details on > >> what to do. How to decide if your change is unrelated or you didn't > >> understand is a big problem. > > > > > > What I'm suggesting is that if most developers, most of the time, aren't > > able to determine this easily, it's not valuable email - if most of the > time > > they have to reach out to the owner for details/clarification, then we > > should just invert it. Have the bot owner push to the contributor rather > > than the contributor pull from the bot owner. > > > >> > >> Once all bots are low-noise, people will > >> tend more to 4, until then, to 3 or 1. > >> > >> In agreement? > >> > >> > >> Problem #2: Breakage types > >> > >> Bots can break for a number of reasons in category 4. Some examples: > >> > >> A. silly, quick fixed ones, like bad CHECK lines, missing explicit > >> triple, move tests to target-specific directories, add an include > >> file. > >> B. real problems, like an assert in the code, seg fault, bad test > results. > >> C. hard problems, like bad codegen affecting self-hosting, > >> intermittent failures in test-suite or self-hosted clang. > >> > >> Problems of type A. tend to show by the firehose on ARM, while they're > >> a lot less common on x86_64 bots just because people develop on > >> x86_64. > > > > > > They show up often enough cross-OS and build config too (-Asserts, > Windows, > > Darwin, etc). > > > >> > >> Problems B. and C. and equally common on all platforms due to > >> the complexity of the compiler. > >> > >> Problems of type B. should have same behaviour in all platforms. If > >> the bots are fast enough (either fast hardware, or many hardware), the > >> blame list should be small and bisect should be quick (<1day). > > > > > > Patches should still be reverted, or tests XFAIL - bots shouldn't be left > > red for hours (especially in the middle of a work day) or a day. > > > >> > >> These are not the problem. > >> > >> Problems of type C, however, are seriously worse on slow targets. > > > > > > This can often/mostly be compensated for by having more hardware - > > especially for something as mechanical as a bisect. (obviously once > you're > > in manual iterations, more hardware doesn't help much unless you have a > few > > different hypotheses you can test simultaneously) > > > > Certainly it takes some more engineering effort and there's overhead for > > dealing with multiple machines, etc. But it's not linearly proportional > to > > machine speed, because some of it can be compensated for. > > > >> > >> Not > >> only it's slower to build (sometimes 10x slower than on a decent > >> server), but the testing is hard to get right (because it's > >> intermittent), and until you get it right, you're actively working on > >> that (minus sleep time, etc). Since we're talking about an order of > >> magnitude slower to debug, sleep time becomes a much bigger issue. If > >> a hard problem takes about 5 hours on fast hardware, it can take up to > >> 50 hours, and in that case, no one can work that long. If you do 10hs > >> straight every day, it's still a week past. > > > > > > Sure - some issues take a while to investigate. No doubt - but so long as > > the issue is live (be it flaky or consistent) it's unhelpful (moreso if > it's > > flaky, given the way our buildbots send mail - though I still don't like > a > > red line on the status page, that's costly too) to have the bot red > and/or > > sending mail. The issue is known and being investigated, sending other > > people mail (or having it show up as red in the dashboard) isn't terribly > > helpful. It produces redundant work for everyone (they all investigate > these > > issues - or learn to ignore them & thus miss true positives later) on the > > project. > > > >> > >> > >> In agreement? > >> > >> > >> I'll continue later, once we're in agreement over the base facts. > >> > >> cheers, > >> --renato > > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151019/491e8d4e/attachment-0001.html>
Huge inline record again... I'll pick the contentious issues... On 19 October 2015 at 19:38, David Blaikie <dblaikie at gmail.com> wrote:> at all, but /specifically/ about long-red bots that appear neglected."appear" is the key here. It'd be better if you ask first, then propose to disable later. If I was on holidays, someone (maybe you) could have assumed lack of care and disabled them without the ARM sub-community's knowledge. Probably no one got your email but me. I don't know how you could have made sure everyone was copied, TBH. We have to think about that one, too. Maybe add sub-owners?> It would be little-to-no change to me to do this to my GDB 7.5 bot, for > example - I glance at every failure that comes through anyway. All I'd do > differently is forward anything that I thought looked like a real, unique > failure, to the mailing list/blame list, rather than having it done > automatically. This does not seem terribly onerous. Is it?You mind one bot. I mind 11, and the list is growing. Our bots are very different from each other, and the failures that happen to one rarely happen to others. I am solving the contingency issue, but that takes time. I agree that's largely my responsibility, but we can't go from "it's ok to have some red bots" to "we're doomed, kill them all" overnight. I am working towards the goals we both agree, but it *will* take some time. I'd appreciate some patience.> I... don't, really. As with my own GDB 7.5 buildbot, I pretty much assume > interesting failures will probably involve me helping to triage (especially > with the Apple engineers explicitly not having access to the source/test > cases run there) the issues. The bot sends me email on every red, and I > treat that as pretty much a thing I need to care about until it's green, as > much as possible by acting as a facilitator to the original contributor who > committed the breakage.ARM is one of the main architectures in LLVM. Compatibility with GDB 7.5 is an important, but substantially less important. It may look selfish from my part, but I don't think you can compare them as equals. A lot more people, projects and companies will be upset if ARM support regresses, than if the GDB 7.5 bot stays red for a few weeks, or even a few months. Given the importance, I don't think it's feasible (or healthy) for me to own most of the bots, but for now, it is what it is. I'd appreciate if other companies that do care about ARM could *also* contribute and maintain ARM bots on their own. But even that will take some time.> Do you believe there's no quality point in a buildbot notification where it > is not worth sending mail/notification?No, I agree with you on almost all technical points. But those changes need to take some time to happen.>> How do you XFAIL a Clang miscompilation of Clang? > > It's a good question - seems like it'd be something we might want to have > some way of doing. Perhaps we could have some stub test cases that are used > to describe some of these sort of tests.To answer my own question, I think staged bots is the solution here.> If they're from previous commits/it's a flakey product issue - that's > tricky, for sure.One critical thing that doesn't get caught: Zorg changes. Maybe we should add a monitor to Zorg on every SVN poller. If we can, make sure that we build every Zorg change isolated from any other.> None of these things require infinite anything. There's a "reasonable" level > of turnaround that can help quite a bit."reasonable" depends on how many resources (money, hardware, engineers) you have. You're seeing everyone else with your own glasses, assuming you could fix the problem in X days because you have N engineers, M money and Y hardware availability, whereas all those variables are different to other people / companies. By saying that "everyone willing to help" should invest as much as Google or Apple does, you're essentially shutting off everyone else *but* Google and Apple from the project. That's where the risk of forking comes from. cheers, --renato
Reasonably Related Threads
- [cfe-dev] Buildbot Noise
- Buildbot Noise
- [LLVMdev] llvm-gcc-4.2-2.5 fails to build from source on arm: - ARM buildbot are installed.
- [LLVMdev] llvm-gcc-4.2-2.5 fails to build from source on arm: - ARM buildbot are installed.
- [LLVMdev] How to XFAIL test cases with buildbot LNTFactory