David Blaikie via llvm-dev
2021-Sep-10 21:02 UTC
[llvm-dev] False positive notifications around commit notifications
The folks on the Infrastructure Working Group: https://foundation.llvm.org/docs/infrastructure-wg/ might have some context on this (I think Christian Kuhnel is part of that & he's on this list). On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > On 9/10/21 11:36 AM, Mehdi AMINI wrote: > > > > On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> I've been noticing a trend where there is more and more false positive >> email notifications sent out on valid commits. This is getting really >> problematic as real signal is being lost in the noise. I've had several >> cases in the last few weeks where I did not see a "real" failure notice >> because it was buried in a bunch of false positives. >> >> Let me run through a few sources of what I consider false positives, and >> suggest a couple things we could do to clean these up. Note that the >> recommendations here are entirely independent and we can adopt any subset. >> >> *Slow Try Bots* >> >> ex: "This revision was landed with ongoing or failed builds." on >> https://reviews.llvm.org/D109091 >> >> Someone - I'm not really sure who - enabled builds for all reviews, and >> this notice on landed commits. Given it's utterly routine to make a last >> few style fixes before landing an LGTMed change >> > > I do such "few style fixes", but I don't re-upload a revision before > landing, so I don't see this "false positive" in general. > > I don't explicit upload the final patch either, but something in the close > automation does. > > What I frequently see is that the pre-merge config is broken for some > other reason, and that's quite annoying. One aspect of the issue is that > the is no buildbot tracking the pre-merge configuration so it can be broken > without notification (there is a buildkite job tracking it, but buildkite > does not support blamelist notifications). > > Hm, maybe I misinterpreted the cause of these entirely? Your explanation > sounds plausible as well. > > If your explanation is correct, that would lean strongly to the "just > disable" option. > > Do you know who to contact about this? (i.e. Who owns the automation > here? Or where is the appropriate code to adjust?) >Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it was setup by https://reviews.llvm.org/p/goncharov/ (added them to the "to" line on this email) Also looks like Kuhnel ( https://reviews.llvm.org/H576 ) has sometthing to do with it, so I've added them too.> , I consider this notice complete noise. In practice, almost review gets >> tagged this way. To be clear, there is value in being told about changes >> which don't build. The false positive part is only around the "ongoing" >> builds. >> >> Recommendation: Disable this message for the "ongoing" build case, and if >> we can't, disable them entirely. >> >> *Flaky Builders* >> >> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250 >> >> We have many build bots which are not entirely stable. It's gotten to >> the point where I *expect* failure notifications on literally every change >> I land. I've been trying to reach out to individual build bot owners to >> get issues resolved, and to their credit, most owners have been very >> responsive. However, we have enough builders that the situation isn't >> getting meaningful better. >> >> Recommendation: Introduce specific "test commits" whose only purpose is >> to run the CI infrastructure. Any builder which notifies of failure on >> such a commit (and only said commit) is disabled without discussion until >> human action is taken by the bot owner to re-enable. The idea here is to >> a) automate the process, and b) shift the responsibility of action to the >> bot owner for any flaky bot. >> >> Note: By "disabled", I specifically mean that *notification* is >> disabled. Leaving it in the waterfall view is fine, as long as we're not >> sending out email about it. >> >> Aside: It's really tempting to attempt to separate builders which are >> "still failing" (e.g. a rare configuration which has been broken for a few >> days) from "flaky" ones. I'd argue any bot notifying on a "still failing" >> case is buggy, and thus it's fine to treat them the same as a "flaky" bot. >> >> >> *Slow Builders and Redundant Notices * >> >> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128 >> >> Occasionally, we have a bad commit land which breaks every (or nearly >> every) builder. That happens. If you happen to land a change just before >> or after it, you then get on the blame list for every slow running builder >> we have (since they tend to have large commit windows) if they happen to >> cycle before the fix is committed. This is particularly annoying since the >> root issue is likely fixed quickly, but due to cycle times on the builders, >> you may be getting emails for 24 hours to come. >> >> Recommendation: Introduce a new requirement for "slow" builders (say >> cycle time of > 30 minutes) either a) have a maximum commit window of ~15 >> commits, or b) use a staged builder model. Personally, I'd prefer the >> staged model, but the max commit window at least helps to limit the >> damage. >> >> By "staged builder model", I mean that slow builders only build points in >> the history which have already been successfully build by one of the fast >> builders. This eliminates redundant build failures, at the cost of >> delaying the slow builder slightly. As long as the slow builder uses the >> "last good commit" as opposed to waiting until the current fast builder >> finishes, the delay should be very minimal for most commits. >> > > Does buildbot support staged builders? That would really be ideal indeed! > If we could also disable notification to the blamelist when it is larger > than 5, that'd be great! > > I'll be honest here and say I don't know what buildbot natively supports. > Even if it doesn't, there are "easy" process workarounds to achieve the > same effect. Just as an example (i.e. definitely not proposing this as > technical solution to be implemented right now), we could introduce a new > branch in git called e.g. "buildbot-tracking-slow" and have a specific fast > builder do a fast forward merge from main into this branch. All "slow" > builders would simply follow this branch and not main. > > If we get consensus that this is the right approach, I am willing to put > some of my own time to figuring out how to implement this. For my own > volunteer time, I'd probably start with the flaky bot test commit piece > just because that's much easier to do manually first and then automate, and > because I find them personally more annoying. >There was some prototype buildbot based (so far as I can tell) staged builder setup years ago, it seems: https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps there's some commit history in the zorg repo that shows how it was configured.> Your point about disabling notification on a blamelist larger than 5 seems > reasonable to me, but I'd definitely consider "build chunks of no more than > N commits" and "build arbitrary sets, but only notify if less than M > people" as distinct possibilities to be evaluated independently. >"chunks of no more than N commits" risks slower builders getting behind - presumably you'd need some catchup mechanism (run a build with all the available changes, but don't notify) if they just kept falling further and further behind. - Dave -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/70a02d83/attachment.html>
Mehdi AMINI via llvm-dev
2021-Sep-11 00:57 UTC
[llvm-dev] False positive notifications around commit notifications
On Fri, Sep 10, 2021 at 2:03 PM David Blaikie <dblaikie at gmail.com> wrote:> The folks on the Infrastructure Working Group: > https://foundation.llvm.org/docs/infrastructure-wg/ might have some > context on this (I think Christian Kuhnel is part of that & he's on this > list). > > On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> >> On 9/10/21 11:36 AM, Mehdi AMINI wrote: >> >> >> >> On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> I've been noticing a trend where there is more and more false positive >>> email notifications sent out on valid commits. This is getting really >>> problematic as real signal is being lost in the noise. I've had several >>> cases in the last few weeks where I did not see a "real" failure notice >>> because it was buried in a bunch of false positives. >>> >>> Let me run through a few sources of what I consider false positives, and >>> suggest a couple things we could do to clean these up. Note that the >>> recommendations here are entirely independent and we can adopt any subset. >>> >>> *Slow Try Bots* >>> >>> ex: "This revision was landed with ongoing or failed builds." on >>> https://reviews.llvm.org/D109091 >>> >>> Someone - I'm not really sure who - enabled builds for all reviews, and >>> this notice on landed commits. Given it's utterly routine to make a last >>> few style fixes before landing an LGTMed change >>> >> >> I do such "few style fixes", but I don't re-upload a revision before >> landing, so I don't see this "false positive" in general. >> >> I don't explicit upload the final patch either, but something in the >> close automation does. >> >> What I frequently see is that the pre-merge config is broken for some >> other reason, and that's quite annoying. One aspect of the issue is that >> the is no buildbot tracking the pre-merge configuration so it can be broken >> without notification (there is a buildkite job tracking it, but buildkite >> does not support blamelist notifications). >> >> Hm, maybe I misinterpreted the cause of these entirely? Your explanation >> sounds plausible as well. >> >> If your explanation is correct, that would lean strongly to the "just >> disable" option. >> >> Do you know who to contact about this? (i.e. Who owns the automation >> here? Or where is the appropriate code to adjust?) >> > Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it > was setup by https://reviews.llvm.org/p/goncharov/ (added them to the > "to" line on this email) Also looks like Kuhnel ( > https://reviews.llvm.org/H576 ) has sometthing to do with it, so I've > added them too. > >> , I consider this notice complete noise. In practice, almost review gets >>> tagged this way. To be clear, there is value in being told about changes >>> which don't build. The false positive part is only around the "ongoing" >>> builds. >>> >>> Recommendation: Disable this message for the "ongoing" build case, and >>> if we can't, disable them entirely. >>> >>> *Flaky Builders* >>> >>> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250 >>> >>> We have many build bots which are not entirely stable. It's gotten to >>> the point where I *expect* failure notifications on literally every change >>> I land. I've been trying to reach out to individual build bot owners to >>> get issues resolved, and to their credit, most owners have been very >>> responsive. However, we have enough builders that the situation isn't >>> getting meaningful better. >>> >>> Recommendation: Introduce specific "test commits" whose only purpose is >>> to run the CI infrastructure. Any builder which notifies of failure on >>> such a commit (and only said commit) is disabled without discussion until >>> human action is taken by the bot owner to re-enable. The idea here is to >>> a) automate the process, and b) shift the responsibility of action to the >>> bot owner for any flaky bot. >>> >>> Note: By "disabled", I specifically mean that *notification* is >>> disabled. Leaving it in the waterfall view is fine, as long as we're not >>> sending out email about it. >>> >>> Aside: It's really tempting to attempt to separate builders which are >>> "still failing" (e.g. a rare configuration which has been broken for a few >>> days) from "flaky" ones. I'd argue any bot notifying on a "still failing" >>> case is buggy, and thus it's fine to treat them the same as a "flaky" bot. >>> >>> >>> *Slow Builders and Redundant Notices * >>> >>> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128 >>> >>> Occasionally, we have a bad commit land which breaks every (or nearly >>> every) builder. That happens. If you happen to land a change just before >>> or after it, you then get on the blame list for every slow running builder >>> we have (since they tend to have large commit windows) if they happen to >>> cycle before the fix is committed. This is particularly annoying since the >>> root issue is likely fixed quickly, but due to cycle times on the builders, >>> you may be getting emails for 24 hours to come. >>> >>> Recommendation: Introduce a new requirement for "slow" builders (say >>> cycle time of > 30 minutes) either a) have a maximum commit window of ~15 >>> commits, or b) use a staged builder model. Personally, I'd prefer the >>> staged model, but the max commit window at least helps to limit the >>> damage. >>> >>> By "staged builder model", I mean that slow builders only build points >>> in the history which have already been successfully build by one of the >>> fast builders. This eliminates redundant build failures, at the cost of >>> delaying the slow builder slightly. As long as the slow builder uses the >>> "last good commit" as opposed to waiting until the current fast builder >>> finishes, the delay should be very minimal for most commits. >>> >> >> Does buildbot support staged builders? That would really be ideal indeed! >> If we could also disable notification to the blamelist when it is larger >> than 5, that'd be great! >> >> I'll be honest here and say I don't know what buildbot natively >> supports. Even if it doesn't, there are "easy" process workarounds to >> achieve the same effect. Just as an example (i.e. definitely not proposing >> this as technical solution to be implemented right now), we could introduce >> a new branch in git called e.g. "buildbot-tracking-slow" and have a >> specific fast builder do a fast forward merge from main into this branch. >> All "slow" builders would simply follow this branch and not main. >> >> If we get consensus that this is the right approach, I am willing to put >> some of my own time to figuring out how to implement this. For my own >> volunteer time, I'd probably start with the flaky bot test commit piece >> just because that's much easier to do manually first and then automate, and >> because I find them personally more annoying. >> > There was some prototype buildbot based (so far as I can tell) staged > builder setup years ago, it seems: > https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps there's some > commit history in the zorg repo that shows how it was configured. > >> Your point about disabling notification on a blamelist larger than 5 >> seems reasonable to me, but I'd definitely consider "build chunks of no >> more than N commits" and "build arbitrary sets, but only notify if less >> than M people" as distinct possibilities to be evaluated independently. >> > "chunks of no more than N commits" risks slower builders getting behind - > presumably you'd need some catchup mechanism (run a build with all the > available changes, but don't notify) if they just kept falling further and > further behind. >Slightly off-topic, but relevant for slow builders: switching to using `ccache` made a huge difference in for one of our builder ( https://lab.llvm.org/buildbot/#/builders/61 ) ; it'll occasionally spend 25-30 min but otherwise is mostly taking only ~4min. Maybe we could advise this more widely as well? -- Mehdi>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/4f7e3eac/attachment.html>