thr3ads.net - llvm dev - [llvm-dev] Responsibilities of a buildbot owner [Jan 2022]

If this information is useful, please help other people find it:
Share via:

Mehdi AMINI via llvm-dev

2022-Jan-09 01:14 UTC

[llvm-dev] Responsibilities of a buildbot owner

Hi,

First: thanks a lot Stella for being a bot owner and providing valuable
resources to the community. The sequence of even is really unfortunate
here, and thank you for bringing it up to everyone's attention, let's
try
to improve our processes.

On Sat, Jan 8, 2022 at 1:01 PM Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Stella,
>
> Thank you for raising the question.  This is a great discussion for us to
> have publicly.
>
> So folks know, I am the individual Stella mentioned below.  I'll start
> with a bit of history so that everyone's on the same page, then dive
into
> the policy question.
>
> My general take is that buildbots are only useful if failure notifications
> are generally actionable.  A couple months back, I was on the edge of
> setting up mail filter rules to auto-delete a bunch of bots because they
> were regularly broken, and decided I should try to be constructive first.
> In the first wave of that, I emailed a couple of bot owners about things
> which seemed like false positives.
>
> At the time, I thought it was the bot owners responsibility to not be
> testing a flaky configuration.  I got a bit of push back on that from a
> couple sources - Stella was one - and put that question on hold.  This
> thread is a great opportunity to decide what our policy actually is, and
> document it.
>
> In the meantime, I've been working with Galina to document existing
> practice where we could, and to try to identify best practices on setting
> up bots.  These changes have been posted publicly, and reviewed through the
> normal process.  We've been deliberately trying to stick to
> non-controversial stuff as we got the docs improved.  I've been
actively
> reaching out to bot owners to gather feedback in this process, but Stella
> had not, yet, been one.
>
> Separately, this week I noticed a bot which was repeatedly toggling
> between red and green.  I forget the exact ratio, but in the recent build
> history, there were multiple transitions, seemingly unrelated to the
> changes being committed.  I emailed Galina asking her to address, and she
> removed the buildbot until it could be moved to the staging buildmaster,
> addressed, and then restored.  I left Stella off the initial email.  Sorry
> about that, no ill intent, just written in a hurry.
>
> Now, transitioning into a bit of policy discussion...
>
> From my conversations with existing bot owners, there is a general
> agreement that bots should only be notifying the community if they are
> stable enough.  There's honest disagreement on what the bar for stable
> enough is, and disagreement about exactly whose responsibility addressing
> new instability is.  (To be clear, I'd separate instability from a
clear
> deterministic breakage caused by a commit - we have a lot more agreement on
> that.)
>
> My personal take is that for a bot to be publicly notifying,
"someone"
> needs to take the responsibility to backstop the normal revert to green
> process.  This "someone" can be developers who work in a
particular area,
> the bot owner, or some combination thereof.  I view the responsibility of
> the bot config owner as being the person responsible for making sure that
> backstopping is happening.  Not necessarily by doing it themselves, but by
> having the contacts with developers who can, and following up when the
> normal flow is not working.
>
> In this particular example, we appear to have a bunch of flaky lldb
> tests.  I personally know absolutely nothing about lldb.  I have no idea
> whether the tests are badly designed, the system they're being run on
isn't
> yet supported by lldb, or if there's some recent code bug introduced
which
> causes the failure.  "Someone" needs to take the responsibility
of figuring
> that out, and in the meantime spaming developers with inactionable failure
> notices seems undesirable.
>
I generally agree with the overall sentiment. I would add that something
worse differentiating is that the source of flakiness can be coming from
the bot itself (flaky hardware / fragile setup), or from the test/codebase
itself (a flaky bot may just be a deterministic ASAN failure).
Of course from Philip's point of view it does not matter: the effect on the
developer is similar, we get undesirable and unactionable
notifications.>From the maintenance flow however, it matters in that the
"someone" who hasto take responsibility is often not the same group of folks.
Also when encountering flaky tests, the best action may not be to disable
the bot itself but instead to disable the test itself! (and file a bug
against the test owner...).

One more dimension that seems to surface here may be different practices or
expectations across subprojects, for example here the LLDB folks may be
used to having some flaky tests, but they trigger on changes to LLVM
itself, where we may not expect any flakiness (or so).

> For context, the bot was disabled until it could be moved to the staging
> buildmaster.  Moving to staging is required (currently) to disable
> developer notification.  In the email from Galina, it seems clear that the
> bot would be fine to move back to production once the issue was triaged.
> This seems entirely reasonable to me.
>
Something quite annoying with staging is that it does not have (as far as I
know) a way to continue to notify the buildbot owner. I don't really care
about staging vs prod as much as having a mode to just "not notify the
blame list" / "only notify the owner".

-- 
Mehdi


> Philip
>
> p.s. One thing I'll note as a definite problem with the current system
is
> that a lot of this happens in private email, and it's hard to share so
that
> everyone has a good picture of what's going on.  It makes
miscommunications
> all too easy.  Last time I spoke with Galina, we were tentative planning to
> start using github issues for bot operation matters to address that, but as
> that was in the middle of the transition from bugzilla, we deferred and
> haven't gotten back to that yet.
>
> p.p.s. The bot in question is https://lab.llvm.org/buildbot/#/builders/83
> if folks want to examine the history themselves.
> On 1/8/22 12:06 PM, Stella Stamenova via llvm-dev wrote:
>
> Hey all,
>
>
>
> I have a couple of questions about what the responsibilities of a buildbot
> owner are. I’ve been maintaining a couple of buildbots for lldb and mlir
> for some time now and I thought I had a pretty good idea of what is
> required based on the documentation here: How To Add Your Build
> Configuration To LLVM Buildbot Infrastructure — LLVM 13 documentation
> <https://www.llvm.org/docs/HowToAddABuilder.html>
>
>
>
> My understanding was that there are some things that are **expected** of
> the owner. Namely:
>
>    1. Make sure that the buildbot is connected and has the right
>    infrastructure (e.g. the right version of Python, or tools, etc.).
Update
>    as needed.
>    2. Make sure that the build configuration is one that is supported
>    (e.g. supported flavor or cmake variables). Update as needed.
>
>
>
> There are also a couple of things that are **optional**, but nice to have:
>
>    1. If the buildbot stays red for a while (where “a while” is
>    completely subjective), figure out the patch or patches that are causing
an
>    issue and either revert them or notify the authors, so they can take
action.
>    2. If someone is having trouble investigating a failure that only
>    happens on the buildbot (or the buildbot is a rare configuration), help
>    them out (e.g. collect logs if possible).
>
>
>
> Up to now, I’ve not had any issues with this and the community has been
> very good at fixing issues with builds and tests when I point them out, or
> more often than not, without me having to do anything but the occasional
> test re-run and software update (like this one, for example, ⚙ D114639
> Raise the minimum Visual Studio version to VS2019 (llvm.org)
> <https://reviews.llvm.org/D114639>). lldb has some tests that are
flaky
> because of the nature of the product, so there is some noise, but mostly
> things work well and everyone seems happy.
>
>
>
> I’ve recently run into a situation that makes me wonder whether there are
> other expectations of a buildbot owner that are not explicitly listed in
> the llvm documentation. Someone reached out to me some time ago to let me
> know their unhappiness at the flakiness of some of the lldb tests and
> demanded that I either fix them or disable them. I let them know that there
> are some tests that are known to be flaky, that my expectation is that it
> is not my responsibility to fix all such issues and that the community
> would be very happy to have their contribution in the form of a fix or a
> change to disable the tests. I didn’t get a response from this person, but
> I did disable a couple of particularly flaky tests since it seemed like the
> nice thing to do.
>
>
>
> The real excitement happened yesterday when I received an email that **the
> build bot had been turned off**. This same person reached out to the
> powers that be (without letting me know) and asked them explicitly to
> silence it **without my active involvement** because of the flakiness.
>
>
>
> I have a couple of issues with this approach but perhaps I’ve
> misunderstood what my responsibilities are as the buildbot owner. I know it
> is frustrating to see a bot fail because of flaky tests and it is nice to
> have someone to ask to resolve them all – is that really the expectation of
> a buildbot owner? Where is the line between maintenance of the bot and
> fixing build and test issues for the community?
>
>
>
> I’d like to understand what the general expectations are and if there are
> things missing from the documentation, I propose that we add them, so that
> it is clear for everyone what is required.
>
>
>
> Thanks,
>
> -Stella
>
>
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20220108/0ee3034e/attachment.html>

David Blaikie via llvm-dev

2022-Jan-10 02:06 UTC

head link

[llvm-dev] Responsibilities of a buildbot owner

+1 to most of what Mehdi's said here - I'd love to see improvements in
stability, though probably having some rigid delegation of responsibility
(rather than relying on developers to judge whether it's a flaky test or
flaky bot - that isn't always obvious, maybe it's only flaky on a
particular configuration that that buildbot happens to test and the
developer doesn't have access to - then which is it?) might help (eg: if
it's at all unclear, then the assumption is that it's always the test or
always the buildbot owner - and an expectation that the author or owner
then takes responsibility for working with the other party to address the
issue, etc).

That all said, disabling individual tests may risk no one caring enough to
re-enable them, especially when the flakiness is found long after the
change is made that introduced the test or flakiness (usually the case with
flakiness - it takes a while to become apparent) - I don't really know how
to address that issue. The "convenience" with disabling a buildbot is
that
there's other value to the buildbot (other than the flaky test that was
providing negative value), so buildbot owners have more motivation to get
the bot back online - though I don't want to burden buildbot owners unduly
either (because they'd eventually give up on them) :/

- Dave

On Sat, Jan 8, 2022 at 5:15 PM Mehdi AMINI via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> First: thanks a lot Stella for being a bot owner and providing valuable
> resources to the community. The sequence of even is really unfortunate
> here, and thank you for bringing it up to everyone's attention,
let's try
> to improve our processes.
>
> On Sat, Jan 8, 2022 at 1:01 PM Philip Reames via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Stella,
>>
>> Thank you for raising the question.  This is a great discussion for us
to
>> have publicly.
>>
>> So folks know, I am the individual Stella mentioned below.  I'll
start
>> with a bit of history so that everyone's on the same page, then
dive into
>> the policy question.
>>
>> My general take is that buildbots are only useful if failure
>> notifications are generally actionable.  A couple months back, I was on
the
>> edge of setting up mail filter rules to auto-delete a bunch of bots
because
>> they were regularly broken, and decided I should try to be constructive
>> first.  In the first wave of that, I emailed a couple of bot owners
about
>> things which seemed like false positives.
>>
>> At the time, I thought it was the bot owners responsibility to not be
>> testing a flaky configuration.  I got a bit of push back on that from a
>> couple sources - Stella was one - and put that question on hold.  This
>> thread is a great opportunity to decide what our policy actually is,
and
>> document it.
>>
>> In the meantime, I've been working with Galina to document existing
>> practice where we could, and to try to identify best practices on
setting
>> up bots.  These changes have been posted publicly, and reviewed through
the
>> normal process.  We've been deliberately trying to stick to
>> non-controversial stuff as we got the docs improved.  I've been
actively
>> reaching out to bot owners to gather feedback in this process, but
Stella
>> had not, yet, been one.
>>
>> Separately, this week I noticed a bot which was repeatedly toggling
>> between red and green.  I forget the exact ratio, but in the recent
build
>> history, there were multiple transitions, seemingly unrelated to the
>> changes being committed.  I emailed Galina asking her to address, and
she
>> removed the buildbot until it could be moved to the staging
buildmaster,
>> addressed, and then restored.  I left Stella off the initial email. 
Sorry
>> about that, no ill intent, just written in a hurry.
>>
>> Now, transitioning into a bit of policy discussion...
>>
>> From my conversations with existing bot owners, there is a general
>> agreement that bots should only be notifying the community if they are
>> stable enough.  There's honest disagreement on what the bar for
stable
>> enough is, and disagreement about exactly whose responsibility
addressing
>> new instability is.  (To be clear, I'd separate instability from a
clear
>> deterministic breakage caused by a commit - we have a lot more
agreement on
>> that.)
>>
>> My personal take is that for a bot to be publicly notifying,
"someone"
>> needs to take the responsibility to backstop the normal revert to green
>> process.  This "someone" can be developers who work in a
particular area,
>> the bot owner, or some combination thereof.  I view the responsibility
of
>> the bot config owner as being the person responsible for making sure
that
>> backstopping is happening.  Not necessarily by doing it themselves, but
by
>> having the contacts with developers who can, and following up when the
>> normal flow is not working.
>>
>> In this particular example, we appear to have a bunch of flaky lldb
>> tests.  I personally know absolutely nothing about lldb.  I have no
idea
>> whether the tests are badly designed, the system they're being run
on isn't
>> yet supported by lldb, or if there's some recent code bug
introduced which
>> causes the failure.  "Someone" needs to take the
responsibility of figuring
>> that out, and in the meantime spaming developers with inactionable
failure
>> notices seems undesirable.
>>
>
> I generally agree with the overall sentiment. I would add that something
> worse differentiating is that the source of flakiness can be coming from
> the bot itself (flaky hardware / fragile setup), or from the test/codebase
> itself (a flaky bot may just be a deterministic ASAN failure).
> Of course from Philip's point of view it does not matter: the effect on
> the developer is similar, we get undesirable and unactionable
> notifications. From the maintenance flow however, it matters in that the
> "someone" who has to take responsibility is often not the same
group of
> folks.
> Also when encountering flaky tests, the best action may not be to disable
> the bot itself but instead to disable the test itself! (and file a bug
> against the test owner...).
>
> One more dimension that seems to surface here may be different practices
> or expectations across subprojects, for example here the LLDB folks may be
> used to having some flaky tests, but they trigger on changes to LLVM
> itself, where we may not expect any flakiness (or so).
>
>
>> For context, the bot was disabled until it could be moved to the
staging
>> buildmaster.  Moving to staging is required (currently) to disable
>> developer notification.  In the email from Galina, it seems clear that
the
>> bot would be fine to move back to production once the issue was
triaged.
>> This seems entirely reasonable to me.
>>
>
> Something quite annoying with staging is that it does not have (as far as
> I know) a way to continue to notify the buildbot owner. I don't really
care
> about staging vs prod as much as having a mode to just "not notify the
> blame list" / "only notify the owner".
>
> --
> Mehdi
>
>
>
>> Philip
>>
>> p.s. One thing I'll note as a definite problem with the current
system is
>> that a lot of this happens in private email, and it's hard to share
so that
>> everyone has a good picture of what's going on.  It makes
miscommunications
>> all too easy.  Last time I spoke with Galina, we were tentative
planning to
>> start using github issues for bot operation matters to address that,
but as
>> that was in the middle of the transition from bugzilla, we deferred and
>> haven't gotten back to that yet.
>>
>> p.p.s. The bot in question is
https://lab.llvm.org/buildbot/#/builders/83
>> if folks want to examine the history themselves.
>> On 1/8/22 12:06 PM, Stella Stamenova via llvm-dev wrote:
>>
>> Hey all,
>>
>>
>>
>> I have a couple of questions about what the responsibilities of a
>> buildbot owner are. I’ve been maintaining a couple of buildbots for
lldb
>> and mlir for some time now and I thought I had a pretty good idea of
what
>> is required based on the documentation here: How To Add Your Build
>> Configuration To LLVM Buildbot Infrastructure — LLVM 13 documentation
>> <https://www.llvm.org/docs/HowToAddABuilder.html>
>>
>>
>>
>> My understanding was that there are some things that are **expected**
of
>> the owner. Namely:
>>
>>    1. Make sure that the buildbot is connected and has the right
>>    infrastructure (e.g. the right version of Python, or tools, etc.).
Update
>>    as needed.
>>    2. Make sure that the build configuration is one that is supported
>>    (e.g. supported flavor or cmake variables). Update as needed.
>>
>>
>>
>> There are also a couple of things that are **optional**, but nice to
>> have:
>>
>>    1. If the buildbot stays red for a while (where “a while” is
>>    completely subjective), figure out the patch or patches that are
causing an
>>    issue and either revert them or notify the authors, so they can take
action.
>>    2. If someone is having trouble investigating a failure that only
>>    happens on the buildbot (or the buildbot is a rare configuration),
help
>>    them out (e.g. collect logs if possible).
>>
>>
>>
>> Up to now, I’ve not had any issues with this and the community has been
>> very good at fixing issues with builds and tests when I point them out,
or
>> more often than not, without me having to do anything but the
occasional
>> test re-run and software update (like this one, for example, ⚙ D114639
>> Raise the minimum Visual Studio version to VS2019 (llvm.org)
>> <https://reviews.llvm.org/D114639>). lldb has some tests that are
flaky
>> because of the nature of the product, so there is some noise, but
mostly
>> things work well and everyone seems happy.
>>
>>
>>
>> I’ve recently run into a situation that makes me wonder whether there
are
>> other expectations of a buildbot owner that are not explicitly listed
in
>> the llvm documentation. Someone reached out to me some time ago to let
me
>> know their unhappiness at the flakiness of some of the lldb tests and
>> demanded that I either fix them or disable them. I let them know that
there
>> are some tests that are known to be flaky, that my expectation is that
it
>> is not my responsibility to fix all such issues and that the community
>> would be very happy to have their contribution in the form of a fix or
a
>> change to disable the tests. I didn’t get a response from this person,
but
>> I did disable a couple of particularly flaky tests since it seemed like
the
>> nice thing to do.
>>
>>
>>
>> The real excitement happened yesterday when I received an email that
**the
>> build bot had been turned off**. This same person reached out to the
>> powers that be (without letting me know) and asked them explicitly to
>> silence it **without my active involvement** because of the flakiness.
>>
>>
>>
>> I have a couple of issues with this approach but perhaps I’ve
>> misunderstood what my responsibilities are as the buildbot owner. I
know it
>> is frustrating to see a bot fail because of flaky tests and it is nice
to
>> have someone to ask to resolve them all – is that really the
expectation of
>> a buildbot owner? Where is the line between maintenance of the bot and
>> fixing build and test issues for the community?
>>
>>
>>
>> I’d like to understand what the general expectations are and if there
are
>> things missing from the documentation, I propose that we add them, so
that
>> it is clear for everyone what is required.
>>
>>
>>
>> Thanks,
>>
>> -Stella
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20220109/29f3d43f/attachment.html>

llvm dev - Jan 2022 - Responsibilities of a buildbot owner

[llvm-dev] Responsibilities of a buildbot owner

[llvm-dev] Responsibilities of a buildbot owner