thr3ads.net - llvm dev - [llvm-dev] False positive notifications around commit notifications [Sep 2021]

If this information is useful, please help other people find it:
Share via:

Philip Reames via llvm-dev

2021-Sep-10 18:52 UTC

[llvm-dev] False positive notifications around commit notifications

On 9/10/21 11:36 AM, Mehdi AMINI wrote:>
>
> On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
>
>     I've been noticing a trend where there is more and more false
>     positive email notifications sent out on valid commits.  This is
>     getting really problematic as real signal is being lost in the
>     noise.  I've had several cases in the last few weeks where I did
>     not see a "real" failure notice because it was buried in a
bunch
>     of false positives.
>
>     Let me run through a few sources of what I consider false
>     positives, and suggest a couple things we could do to clean these
>     up.  Note that the recommendations here are entirely independent
>     and we can adopt any subset.
>
>     *Slow Try Bots*
>
>     ex: "This revision was landed with ongoing or failed builds."
on
>     https://reviews.llvm.org/D109091
<https://reviews.llvm.org/D109091>
>
>     Someone - I'm not really sure who - enabled builds for all
>     reviews, and this notice on landed commits.  Given it's utterly
>     routine to make a last few style fixes before landing an LGTMed change
>
>
> I do such "few style fixes", but I don't re-upload a revision
before
> landing, so I don't see this "false positive" in general.I don't explicit upload the final patch either, but something in the 
close automation does.> What I frequently see is that the pre-merge config is broken for some 
> other reason, and that's quite annoying. One aspect of the issue is 
> that the is no buildbot tracking the pre-merge configuration so it can 
> be broken without notification (there is a buildkite job tracking it, 
> but buildkite does not support blamelist notifications).
Hm, maybe I misinterpreted the cause of these entirely?  Your 
explanation sounds plausible as well.

If your explanation is correct, that would lean strongly to the "just 
disable" option.

Do you know who to contact about this?  (i.e. Who owns the automation 
here?  Or where is the appropriate code to adjust?)
>
>     , I consider this notice complete noise.  In practice, almost
>     review gets tagged this way.  To be clear, there is value in being
>     told about changes which don't build. The false positive part is
>     only around the "ongoing" builds.
>
>     Recommendation: Disable this message for the "ongoing" build
case,
>     and if we can't, disable them entirely.
>
>     *Flaky Builders*
>
>     ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
>     <https://lab.llvm.org/buildbot/#/builders/68/builds/18250>
>
>     We have many build bots which are not entirely stable. It's gotten
>     to the point where I *expect* failure notifications on literally
>     every change I land.  I've been trying to reach out to individual
>     build bot owners to get issues resolved, and to their credit, most
>     owners have been very responsive.  However, we have enough
>     builders that the situation isn't getting meaningful better.
>
>     Recommendation: Introduce specific "test commits" whose only
>     purpose is to run the CI infrastructure.  Any builder which
>     notifies of failure on such a commit (and only said commit) is
>     disabled without discussion until human action is taken by the bot
>     owner to re-enable. The idea here is to a) automate the process,
>     and b) shift the responsibility of action to the bot owner for any
>     flaky bot.
>
>     Note: By "disabled", I specifically mean that *notification*
is
>     disabled.  Leaving it in the waterfall view is fine, as long as
>     we're not sending out email about it.
>
>     Aside: It's really tempting to attempt to separate builders which
>     are "still failing" (e.g. a rare configuration which has been
>     broken for a few days) from "flaky" ones.  I'd argue any
bot
>     notifying on a "still failing" case is buggy, and thus
it's fine
>     to treat them the same as a "flaky" bot.
>
>     *Slow Builders and Redundant Notices
>     *
>
>     ex: https://lab.llvm.org/buildbot#builders/67/builds/4128
>     <https://lab.llvm.org/buildbot#builders/67/builds/4128>
>
>     Occasionally, we have a bad commit land which breaks every (or
>     nearly every) builder.  That happens.  If you happen to land a
>     change just before or after it, you then get on the blame list for
>     every slow running builder we have (since they tend to have large
>     commit windows) if they happen to cycle before the fix is
>     committed.  This is particularly annoying since the root issue is
>     likely fixed quickly, but due to cycle times on the builders, you
>     may be getting emails for 24 hours to come.
>
>     Recommendation: Introduce a new requirement for "slow"
builders
>     (say cycle time of > 30 minutes) either a) have a maximum commit
>     window of ~15 commits, or b) use a staged builder model. 
>     Personally, I'd prefer the staged model, but the max commit window
>     at least helps to limit the damage.
>
>     By "staged builder model", I mean that slow builders only
build
>     points in the history which have already been successfully build
>     by one of the fast builders.  This eliminates redundant build
>     failures, at the cost of delaying the slow builder slightly.  As
>     long as the slow builder uses the "last good commit" as
opposed to
>     waiting until the current fast builder finishes, the delay should
>     be very minimal for most commits.
>
>
> Does buildbot support staged builders? That would really be ideal indeed!
> If we could also disable notification to the blamelist when it is 
> larger than 5, that'd be great!
I'll be honest here and say I don't know what buildbot natively 
supports.  Even if it doesn't, there are "easy" process
workarounds to
achieve the same effect.  Just as an example (i.e. definitely not 
proposing this as technical solution to be implemented right now), we 
could introduce a new branch in git called e.g.
"buildbot-tracking-slow"
and have a specific fast builder do a fast forward merge from main into 
this branch.  All "slow" builders would simply follow this branch and 
not main.

If we get consensus that this is the right approach, I am willing to put 
some of my own time to figuring out how to implement this. For my own 
volunteer time, I'd probably start with the flaky bot test commit piece 
just because that's much easier to do manually first and then automate, 
and because I find them personally more annoying.

Your point about disabling notification on a blamelist larger than 5 
seems reasonable to me, but I'd definitely consider "build chunks of no
more than N commits" and "build arbitrary sets, but only notify if
less
than M people" as distinct possibilities to be evaluated independently.
>
> Cheers,
>
> -- 
> Mehdi
>
>     Philip
>
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/a11c3c0c/attachment-0001.html>

David Blaikie via llvm-dev

2021-Sep-10 21:02 UTC

head link

[llvm-dev] False positive notifications around commit notifications

The folks on the Infrastructure Working Group:
https://foundation.llvm.org/docs/infrastructure-wg/ might have some context
on this (I think Christian Kuhnel is part of that & he's on this list).

On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On 9/10/21 11:36 AM, Mehdi AMINI wrote:
>
>
>
> On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I've been noticing a trend where there is more and more false
positive
>> email notifications sent out on valid commits.  This is getting really
>> problematic as real signal is being lost in the noise.  I've had
several
>> cases in the last few weeks where I did not see a "real"
failure notice
>> because it was buried in a bunch of false positives.
>>
>> Let me run through a few sources of what I consider false positives,
and
>> suggest a couple things we could do to clean these up.  Note that the
>> recommendations here are entirely independent and we can adopt any
subset.
>>
>> *Slow Try Bots*
>>
>> ex: "This revision was landed with ongoing or failed builds."
on
>> https://reviews.llvm.org/D109091
>>
>> Someone - I'm not really sure who - enabled builds for all reviews,
and
>> this notice on landed commits.  Given it's utterly routine to make
a last
>> few style fixes before landing an LGTMed change
>>
>
> I do such "few style fixes", but I don't re-upload a revision
before
> landing, so I don't see this "false positive" in general.
>
> I don't explicit upload the final patch either, but something in the
close
> automation does.
>
> What I frequently see is that the pre-merge config is broken for some
> other reason, and that's quite annoying. One aspect of the issue is
that
> the is no buildbot tracking the pre-merge configuration so it can be broken
> without notification (there is a buildkite job tracking it, but buildkite
> does not support blamelist notifications).
>
> Hm, maybe I misinterpreted the cause of these entirely?  Your explanation
> sounds plausible as well.
>
> If your explanation is correct, that would lean strongly to the "just
> disable" option.
>
> Do you know who to contact about this?  (i.e. Who owns the automation
> here?  Or where is the appropriate code to adjust?)
>Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it was
setup by https://reviews.llvm.org/p/goncharov/ (added them to the "to"
line
on this email) Also looks like Kuhnel ( https://reviews.llvm.org/H576 ) has
sometthing to do with it, so I've added them too.
> , I consider this notice complete noise.  In practice, almost review gets
>> tagged this way.  To be clear, there is value in being told about
changes
>> which don't build.  The false positive part is only around the
"ongoing"
>> builds.
>>
>> Recommendation: Disable this message for the "ongoing" build
case, and if
>> we can't, disable them entirely.
>>
>> *Flaky Builders*
>>
>> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
>>
>> We have many build bots which are not entirely stable.  It's gotten
to
>> the point where I *expect* failure notifications on literally every
change
>> I land.  I've been trying to reach out to individual build bot
owners to
>> get issues resolved, and to their credit, most owners have been very
>> responsive.  However, we have enough builders that the situation
isn't
>> getting meaningful better.
>>
>> Recommendation: Introduce specific "test commits" whose only
purpose is
>> to run the CI infrastructure.  Any builder which notifies of failure on
>> such a commit (and only said commit) is disabled without discussion
until
>> human action is taken by the bot owner to re-enable.  The idea here is
to
>> a) automate the process, and b) shift the responsibility of action to
the
>> bot owner for any flaky bot.
>>
>> Note: By "disabled", I specifically mean that *notification*
is
>> disabled.  Leaving it in the waterfall view is fine, as long as
we're not
>> sending out email about it.
>>
>> Aside: It's really tempting to attempt to separate builders which
are
>> "still failing" (e.g. a rare configuration which has been
broken for a few
>> days) from "flaky" ones.  I'd argue any bot notifying on
a "still failing"
>> case is buggy, and thus it's fine to treat them the same as a
"flaky" bot.
>>
>>
>> *Slow Builders and Redundant Notices *
>>
>> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128
>>
>> Occasionally, we have a bad commit land which breaks every (or nearly
>> every) builder.  That happens.  If you happen to land a change just
before
>> or after it, you then get on the blame list for every slow running
builder
>> we have (since they tend to have large commit windows) if they happen
to
>> cycle before the fix is committed.  This is particularly annoying since
the
>> root issue is likely fixed quickly, but due to cycle times on the
builders,
>> you may be getting emails for 24 hours to come.
>>
>> Recommendation: Introduce a new requirement for "slow"
builders (say
>> cycle time of > 30 minutes) either a) have a maximum commit window
of ~15
>> commits, or b) use a staged builder model.  Personally, I'd prefer
the
>> staged model, but the max commit window at least helps to limit the
>> damage.
>>
>> By "staged builder model", I mean that slow builders only
build points in
>> the history which have already been successfully build by one of the
fast
>> builders.  This eliminates redundant build failures, at the cost of
>> delaying the slow builder slightly.  As long as the slow builder uses
the
>> "last good commit" as opposed to waiting until the current
fast builder
>> finishes, the delay should be very minimal for most commits.
>>
>
> Does buildbot support staged builders? That would really be ideal indeed!
> If we could also disable notification to the blamelist when it is larger
> than 5, that'd be great!
>
> I'll be honest here and say I don't know what buildbot natively
supports.
> Even if it doesn't, there are "easy" process workarounds to
achieve the
> same effect.  Just as an example (i.e. definitely not proposing this as
> technical solution to be implemented right now), we could introduce a new
> branch in git called e.g. "buildbot-tracking-slow" and have a
specific fast
> builder do a fast forward merge from main into this branch.  All
"slow"
> builders would simply follow this branch and not main.
>
> If we get consensus that this is the right approach, I am willing to put
> some of my own time to figuring out how to implement this.  For my own
> volunteer time, I'd probably start with the flaky bot test commit piece
> just because that's much easier to do manually first and then automate,
and
> because I find them personally more annoying.
>There was some prototype buildbot based (so far as I can tell) staged
builder setup years ago, it seems:
https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps there's
some
commit history in the zorg repo that shows how it was configured.
> Your point about disabling notification on a blamelist larger than 5 seems
> reasonable to me, but I'd definitely consider "build chunks of no
more than
> N commits" and "build arbitrary sets, but only notify if less
than M
> people" as distinct possibilities to be evaluated independently.
>"chunks of no more than N commits" risks slower builders getting
behind -
presumably you'd need some catchup mechanism (run a build with all the
available changes, but don't notify) if they just kept falling further and
further behind.

- Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/70a02d83/attachment.html>

llvm dev - Sep 2021 - False positive notifications around commit notifications

[llvm-dev] False positive notifications around commit notifications

[llvm-dev] False positive notifications around commit notifications