David Blaikie via llvm-dev
2021-Oct-11 17:56 UTC
[llvm-dev] False positive notifications around commit notifications
Here's a fun one: https://lab.llvm.org/buildbot/#/builders/164/builds/3428 - a buildbot failure with a single blame (me) - but I hadn't committed in the last few days, so I was confused. Turns out its from a change committed 3 months ago - and the failure is a timeout. Given the number of buildbot timeout false positives, I honestly wouldn't be averse to saying timeouts shouldn't produce fail-mail & are the responsibility of buildbot owners to triage. I realize we can actually submit code that leads to timeouts, but on balance that seems rare compared to the number of times its a buildbot configuration issue instead. (though open to debate on that for sure) On Wed, Oct 6, 2021 at 4:08 AM Nemanja Ivanovic via llvm-dev < llvm-dev at lists.llvm.org> wrote:> I wonder if it would be possible to make some recommendations for > improvements based on data rather than our collective anecdotal experience. > Much as anyone else, I feel that the vast majority of the failure emails I > get are not related, but I would have a lot of trouble quantifying it any > better than a "gut feeling". > > Would it be possible to somehow acquire historical data from buildbots to > help identify things that can improve. Perhaps: > - Bot failures where none of the commits were reverted before the bot went > back to green > - For those failures, collect the test cases that failed - those might be > flaky test cases if they show up frequently and/or on multiple bots > - For bots that have many such instances (especially with different test > cases every time), perhaps the bot itself is somehow flaky > > This is definitely an annoying problem that has significant consequences > (real failures being missed due to many false failures), but it is a > difficult problem to solve. > > On Wed, Sep 22, 2021 at 5:50 AM Martin Storsjö via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> On Wed, 22 Sep 2021, Florian Hahn via llvm-dev wrote: >> >> > Thanks for raising this issue! My experience matches what you are >> > describing. The false positive rate for me is seems to be at least 10 >> false >> > positives due to flakiness to 1 real failure. >> > I think it would be good to have some sort of policy spelling out the >> > requirements for having notification enabled for a buildbot, with a >> process >> > that makes it easy to disable flaky bots until the owners can make them >> more >> > stable. It would be good if notifications could be disabled without >> > requiring contacting/interventions from individual owners, but I am not >> sure >> > if that’s possible with buildbot. >> >> Another aspect is that some tests can be flakey - they might work >> seemingly fine in local testing but start showing up as timeouts/spurious >> failures when run in a CI/buildbot setting. And due to their flakiness, >> it's not evident when the breakage is introduced, but over time, such >> flakey tests/setups do add up, to the situation we have today. >> >> // Martin >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211011/6ea6a83f/attachment.html>
Michael Kruse via llvm-dev
2021-Oct-11 19:06 UTC
[llvm-dev] False positive notifications around commit notifications
Am Mo., 11. Okt. 2021 um 12:57 Uhr schrieb David Blaikie via llvm-dev <llvm-dev at lists.llvm.org>:> Here's a fun one: https://lab.llvm.org/buildbot/#/builders/164/builds/3428 - a buildbot failure with a single blame (me) - but I hadn't committed in the last few days, so I was confused. Turns out its from a change committed 3 months ago - and the failure is a timeout. > > Given the number of buildbot timeout false positives, I honestly wouldn't be averse to saying timeouts shouldn't produce fail-mail & are the responsibility of buildbot owners to triage. I realize we can actually submit code that leads to timeouts, but on balance that seems rare compared to the number of times its a buildbot configuration issue instead. (though open to debate on that for sure)Wow, that bot does not collapse buildrequests and is indeed 3 months behind due to not being fast enough to keep up with LLVM's commit rate. Even if the bot was reliable, getting notified 3 months later isn't useful.>From the wildly varying duration the test step takes (5 - 33 minutes;not the build step, it is doing incremental builds), I assume that the worker is running other things in parallel, maybe another worker, such that the buildjob sometimes is starving and causing the timeout. IMHO buildbots should not run other heavy jobs in parallel. Michael