thr3ads.net - llvm dev - [llvm-dev] False positive notifications around commit notifications [Sep 2021]

If this information is useful, please help other people find it:
Share via:

David Blaikie via llvm-dev

2021-Sep-10 21:02 UTC

[llvm-dev] False positive notifications around commit notifications

The folks on the Infrastructure Working Group:
https://foundation.llvm.org/docs/infrastructure-wg/ might have some context
on this (I think Christian Kuhnel is part of that & he's on this list).

On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
> On 9/10/21 11:36 AM, Mehdi AMINI wrote:
>
>
>
> On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I've been noticing a trend where there is more and more false
positive
>> email notifications sent out on valid commits.  This is getting really
>> problematic as real signal is being lost in the noise.  I've had
several
>> cases in the last few weeks where I did not see a "real"
failure notice
>> because it was buried in a bunch of false positives.
>>
>> Let me run through a few sources of what I consider false positives,
and
>> suggest a couple things we could do to clean these up.  Note that the
>> recommendations here are entirely independent and we can adopt any
subset.
>>
>> *Slow Try Bots*
>>
>> ex: "This revision was landed with ongoing or failed builds."
on
>> https://reviews.llvm.org/D109091
>>
>> Someone - I'm not really sure who - enabled builds for all reviews,
and
>> this notice on landed commits.  Given it's utterly routine to make
a last
>> few style fixes before landing an LGTMed change
>>
>
> I do such "few style fixes", but I don't re-upload a revision
before
> landing, so I don't see this "false positive" in general.
>
> I don't explicit upload the final patch either, but something in the
close
> automation does.
>
> What I frequently see is that the pre-merge config is broken for some
> other reason, and that's quite annoying. One aspect of the issue is
that
> the is no buildbot tracking the pre-merge configuration so it can be broken
> without notification (there is a buildkite job tracking it, but buildkite
> does not support blamelist notifications).
>
> Hm, maybe I misinterpreted the cause of these entirely?  Your explanation
> sounds plausible as well.
>
> If your explanation is correct, that would lean strongly to the "just
> disable" option.
>
> Do you know who to contact about this?  (i.e. Who owns the automation
> here?  Or where is the appropriate code to adjust?)
>Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it was
setup by https://reviews.llvm.org/p/goncharov/ (added them to the "to"
line
on this email) Also looks like Kuhnel ( https://reviews.llvm.org/H576 ) has
sometthing to do with it, so I've added them too.
> , I consider this notice complete noise.  In practice, almost review gets
>> tagged this way.  To be clear, there is value in being told about
changes
>> which don't build.  The false positive part is only around the
"ongoing"
>> builds.
>>
>> Recommendation: Disable this message for the "ongoing" build
case, and if
>> we can't, disable them entirely.
>>
>> *Flaky Builders*
>>
>> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
>>
>> We have many build bots which are not entirely stable.  It's gotten
to
>> the point where I *expect* failure notifications on literally every
change
>> I land.  I've been trying to reach out to individual build bot
owners to
>> get issues resolved, and to their credit, most owners have been very
>> responsive.  However, we have enough builders that the situation
isn't
>> getting meaningful better.
>>
>> Recommendation: Introduce specific "test commits" whose only
purpose is
>> to run the CI infrastructure.  Any builder which notifies of failure on
>> such a commit (and only said commit) is disabled without discussion
until
>> human action is taken by the bot owner to re-enable.  The idea here is
to
>> a) automate the process, and b) shift the responsibility of action to
the
>> bot owner for any flaky bot.
>>
>> Note: By "disabled", I specifically mean that *notification*
is
>> disabled.  Leaving it in the waterfall view is fine, as long as
we're not
>> sending out email about it.
>>
>> Aside: It's really tempting to attempt to separate builders which
are
>> "still failing" (e.g. a rare configuration which has been
broken for a few
>> days) from "flaky" ones.  I'd argue any bot notifying on
a "still failing"
>> case is buggy, and thus it's fine to treat them the same as a
"flaky" bot.
>>
>>
>> *Slow Builders and Redundant Notices *
>>
>> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128
>>
>> Occasionally, we have a bad commit land which breaks every (or nearly
>> every) builder.  That happens.  If you happen to land a change just
before
>> or after it, you then get on the blame list for every slow running
builder
>> we have (since they tend to have large commit windows) if they happen
to
>> cycle before the fix is committed.  This is particularly annoying since
the
>> root issue is likely fixed quickly, but due to cycle times on the
builders,
>> you may be getting emails for 24 hours to come.
>>
>> Recommendation: Introduce a new requirement for "slow"
builders (say
>> cycle time of > 30 minutes) either a) have a maximum commit window
of ~15
>> commits, or b) use a staged builder model.  Personally, I'd prefer
the
>> staged model, but the max commit window at least helps to limit the
>> damage.
>>
>> By "staged builder model", I mean that slow builders only
build points in
>> the history which have already been successfully build by one of the
fast
>> builders.  This eliminates redundant build failures, at the cost of
>> delaying the slow builder slightly.  As long as the slow builder uses
the
>> "last good commit" as opposed to waiting until the current
fast builder
>> finishes, the delay should be very minimal for most commits.
>>
>
> Does buildbot support staged builders? That would really be ideal indeed!
> If we could also disable notification to the blamelist when it is larger
> than 5, that'd be great!
>
> I'll be honest here and say I don't know what buildbot natively
supports.
> Even if it doesn't, there are "easy" process workarounds to
achieve the
> same effect.  Just as an example (i.e. definitely not proposing this as
> technical solution to be implemented right now), we could introduce a new
> branch in git called e.g. "buildbot-tracking-slow" and have a
specific fast
> builder do a fast forward merge from main into this branch.  All
"slow"
> builders would simply follow this branch and not main.
>
> If we get consensus that this is the right approach, I am willing to put
> some of my own time to figuring out how to implement this.  For my own
> volunteer time, I'd probably start with the flaky bot test commit piece
> just because that's much easier to do manually first and then automate,
and
> because I find them personally more annoying.
>There was some prototype buildbot based (so far as I can tell) staged
builder setup years ago, it seems:
https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps there's
some
commit history in the zorg repo that shows how it was configured.
> Your point about disabling notification on a blamelist larger than 5 seems
> reasonable to me, but I'd definitely consider "build chunks of no
more than
> N commits" and "build arbitrary sets, but only notify if less
than M
> people" as distinct possibilities to be evaluated independently.
>"chunks of no more than N commits" risks slower builders getting
behind -
presumably you'd need some catchup mechanism (run a build with all the
available changes, but don't notify) if they just kept falling further and
further behind.

- Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/70a02d83/attachment.html>

Mehdi AMINI via llvm-dev

2021-Sep-11 00:57 UTC

head link

[llvm-dev] False positive notifications around commit notifications

On Fri, Sep 10, 2021 at 2:03 PM David Blaikie <dblaikie at gmail.com>
wrote:
> The folks on the Infrastructure Working Group:
> https://foundation.llvm.org/docs/infrastructure-wg/ might have some
> context on this (I think Christian Kuhnel is part of that & he's on
this
> list).
>
> On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>> On 9/10/21 11:36 AM, Mehdi AMINI wrote:
>>
>>
>>
>> On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> I've been noticing a trend where there is more and more false
positive
>>> email notifications sent out on valid commits.  This is getting
really
>>> problematic as real signal is being lost in the noise.  I've
had several
>>> cases in the last few weeks where I did not see a "real"
failure notice
>>> because it was buried in a bunch of false positives.
>>>
>>> Let me run through a few sources of what I consider false
positives, and
>>> suggest a couple things we could do to clean these up.  Note that
the
>>> recommendations here are entirely independent and we can adopt any
subset.
>>>
>>> *Slow Try Bots*
>>>
>>> ex: "This revision was landed with ongoing or failed
builds." on
>>> https://reviews.llvm.org/D109091
>>>
>>> Someone - I'm not really sure who - enabled builds for all
reviews, and
>>> this notice on landed commits.  Given it's utterly routine to
make a last
>>> few style fixes before landing an LGTMed change
>>>
>>
>> I do such "few style fixes", but I don't re-upload a
revision before
>> landing, so I don't see this "false positive" in general.
>>
>> I don't explicit upload the final patch either, but something in
the
>> close automation does.
>>
>> What I frequently see is that the pre-merge config is broken for some
>> other reason, and that's quite annoying. One aspect of the issue is
that
>> the is no buildbot tracking the pre-merge configuration so it can be
broken
>> without notification (there is a buildkite job tracking it, but
buildkite
>> does not support blamelist notifications).
>>
>> Hm, maybe I misinterpreted the cause of these entirely?  Your
explanation
>> sounds plausible as well.
>>
>> If your explanation is correct, that would lean strongly to the
"just
>> disable" option.
>>
>> Do you know who to contact about this?  (i.e. Who owns the automation
>> here?  Or where is the appropriate code to adjust?)
>>
> Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it
> was setup by https://reviews.llvm.org/p/goncharov/ (added them to the
> "to" line on this email) Also looks like Kuhnel (
> https://reviews.llvm.org/H576 ) has sometthing to do with it, so I've
> added them too.
>
>> , I consider this notice complete noise.  In practice, almost review
gets
>>> tagged this way.  To be clear, there is value in being told about
changes
>>> which don't build.  The false positive part is only around the
"ongoing"
>>> builds.
>>>
>>> Recommendation: Disable this message for the "ongoing"
build case, and
>>> if we can't, disable them entirely.
>>>
>>> *Flaky Builders*
>>>
>>> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
>>>
>>> We have many build bots which are not entirely stable.  It's
gotten to
>>> the point where I *expect* failure notifications on literally every
change
>>> I land.  I've been trying to reach out to individual build bot
owners to
>>> get issues resolved, and to their credit, most owners have been
very
>>> responsive.  However, we have enough builders that the situation
isn't
>>> getting meaningful better.
>>>
>>> Recommendation: Introduce specific "test commits" whose
only purpose is
>>> to run the CI infrastructure.  Any builder which notifies of
failure on
>>> such a commit (and only said commit) is disabled without discussion
until
>>> human action is taken by the bot owner to re-enable.  The idea here
is to
>>> a) automate the process, and b) shift the responsibility of action
to the
>>> bot owner for any flaky bot.
>>>
>>> Note: By "disabled", I specifically mean that
*notification* is
>>> disabled.  Leaving it in the waterfall view is fine, as long as
we're not
>>> sending out email about it.
>>>
>>> Aside: It's really tempting to attempt to separate builders
which are
>>> "still failing" (e.g. a rare configuration which has been
broken for a few
>>> days) from "flaky" ones.  I'd argue any bot notifying
on a "still failing"
>>> case is buggy, and thus it's fine to treat them the same as a
"flaky" bot.
>>>
>>>
>>> *Slow Builders and Redundant Notices *
>>>
>>> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128
>>>
>>> Occasionally, we have a bad commit land which breaks every (or
nearly
>>> every) builder.  That happens.  If you happen to land a change just
before
>>> or after it, you then get on the blame list for every slow running
builder
>>> we have (since they tend to have large commit windows) if they
happen to
>>> cycle before the fix is committed.  This is particularly annoying
since the
>>> root issue is likely fixed quickly, but due to cycle times on the
builders,
>>> you may be getting emails for 24 hours to come.
>>>
>>> Recommendation: Introduce a new requirement for "slow"
builders (say
>>> cycle time of > 30 minutes) either a) have a maximum commit
window of ~15
>>> commits, or b) use a staged builder model.  Personally, I'd
prefer the
>>> staged model, but the max commit window at least helps to limit the
>>> damage.
>>>
>>> By "staged builder model", I mean that slow builders only
build points
>>> in the history which have already been successfully build by one of
the
>>> fast builders.  This eliminates redundant build failures, at the
cost of
>>> delaying the slow builder slightly.  As long as the slow builder
uses the
>>> "last good commit" as opposed to waiting until the
current fast builder
>>> finishes, the delay should be very minimal for most commits.
>>>
>>
>> Does buildbot support staged builders? That would really be ideal
indeed!
>> If we could also disable notification to the blamelist when it is
larger
>> than 5, that'd be great!
>>
>> I'll be honest here and say I don't know what buildbot natively
>> supports.  Even if it doesn't, there are "easy" process
workarounds to
>> achieve the same effect.  Just as an example (i.e. definitely not
proposing
>> this as technical solution to be implemented right now), we could
introduce
>> a new branch in git called e.g. "buildbot-tracking-slow" and
have a
>> specific fast builder do a fast forward merge from main into this
branch.
>> All "slow" builders would simply follow this branch and not
main.
>>
>> If we get consensus that this is the right approach, I am willing to
put
>> some of my own time to figuring out how to implement this.  For my own
>> volunteer time, I'd probably start with the flaky bot test commit
piece
>> just because that's much easier to do manually first and then
automate, and
>> because I find them personally more annoying.
>>
> There was some prototype buildbot based (so far as I can tell) staged
> builder setup years ago, it seems:
> https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps
there's some
> commit history in the zorg repo that shows how it was configured.
>
>> Your point about disabling notification on a blamelist larger than 5
>> seems reasonable to me, but I'd definitely consider "build
chunks of no
>> more than N commits" and "build arbitrary sets, but only
notify if less
>> than M people" as distinct possibilities to be evaluated
independently.
>>
> "chunks of no more than N commits" risks slower builders getting
behind -
> presumably you'd need some catchup mechanism (run a build with all the
> available changes, but don't notify) if they just kept falling further
and
> further behind.
>
Slightly off-topic, but relevant for slow builders: switching to using
`ccache` made a huge difference in for one of our builder (
https://lab.llvm.org/buildbot/#/builders/61 ) ; it'll occasionally spend
25-30 min but otherwise is mostly taking only ~4min. Maybe we could advise
this more widely as well?

-- 
Mehdi
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/4f7e3eac/attachment.html>

llvm dev - Sep 2021 - False positive notifications around commit notifications

[llvm-dev] False positive notifications around commit notifications

[llvm-dev] False positive notifications around commit notifications