thr3ads.net - llvm dev - [llvm-dev] Responsibilities of a buildbot owner [Jan 2022]

If this information is useful, please help other people find it:
Share via:

Stella Stamenova via llvm-dev

2022-Jan-08 20:06 UTC

[llvm-dev] Responsibilities of a buildbot owner

Hey all,

I have a couple of questions about what the responsibilities of a buildbot owner
are. I’ve been maintaining a couple of buildbots for lldb and mlir for some time
now and I thought I had a pretty good idea of what is required based on the
documentation here: How To Add Your Build Configuration To LLVM Buildbot
Infrastructure — LLVM 13
documentation<https://www.llvm.org/docs/HowToAddABuilder.html>

My understanding was that there are some things that are *expected* of the
owner. Namely:

  1.  Make sure that the buildbot is connected and has the right infrastructure
(e.g. the right version of Python, or tools, etc.). Update as needed.
  2.  Make sure that the build configuration is one that is supported (e.g.
supported flavor or cmake variables). Update as needed.

There are also a couple of things that are *optional*, but nice to have:

  1.  If the buildbot stays red for a while (where “a while” is completely
subjective), figure out the patch or patches that are causing an issue and
either revert them or notify the authors, so they can take action.
  2.  If someone is having trouble investigating a failure that only happens on
the buildbot (or the buildbot is a rare configuration), help them out (e.g.
collect logs if possible).

Up to now, I’ve not had any issues with this and the community has been very
good at fixing issues with builds and tests when I point them out, or more often
than not, without me having to do anything but the occasional test re-run and
software update (like this one, for example, ⚙ D114639 Raise the minimum Visual
Studio version to VS2019 (llvm.org)<https://reviews.llvm.org/D114639>).
lldb has some tests that are flaky because of the nature of the product, so
there is some noise, but mostly things work well and everyone seems happy.

I’ve recently run into a situation that makes me wonder whether there are other
expectations of a buildbot owner that are not explicitly listed in the llvm
documentation. Someone reached out to me some time ago to let me know their
unhappiness at the flakiness of some of the lldb tests and demanded that I
either fix them or disable them. I let them know that there are some tests that
are known to be flaky, that my expectation is that it is not my responsibility
to fix all such issues and that the community would be very happy to have their
contribution in the form of a fix or a change to disable the tests. I didn’t get
a response from this person, but I did disable a couple of particularly flaky
tests since it seemed like the nice thing to do.

The real excitement happened yesterday when I received an email that *the build
bot had been turned off*. This same person reached out to the powers that be
(without letting me know) and asked them explicitly to silence it *without my
active involvement* because of the flakiness.

I have a couple of issues with this approach but perhaps I’ve misunderstood what
my responsibilities are as the buildbot owner. I know it is frustrating to see a
bot fail because of flaky tests and it is nice to have someone to ask to resolve
them all – is that really the expectation of a buildbot owner? Where is the line
between maintenance of the bot and fixing build and test issues for the
community?

I’d like to understand what the general expectations are and if there are things
missing from the documentation, I propose that we add them, so that it is clear
for everyone what is required.

Thanks,
-Stella

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20220108/c28110cc/attachment.html>

Philip Reames via llvm-dev

2022-Jan-08 21:01 UTC

head link

[llvm-dev] Responsibilities of a buildbot owner

Stella,

Thank you for raising the question.  This is a great discussion for us 
to have publicly.

So folks know, I am the individual Stella mentioned below.  I'll start 
with a bit of history so that everyone's on the same page, then dive 
into the policy question.

My general take is that buildbots are only useful if failure 
notifications are generally actionable.  A couple months back, I was on 
the edge of setting up mail filter rules to auto-delete a bunch of bots 
because they were regularly broken, and decided I should try to be 
constructive first.  In the first wave of that, I emailed a couple of 
bot owners about things which seemed like false positives.

At the time, I thought it was the bot owners responsibility to not be 
testing a flaky configuration.  I got a bit of push back on that from a 
couple sources - Stella was one - and put that question on hold.  This 
thread is a great opportunity to decide what our policy actually is, and 
document it.

In the meantime, I've been working with Galina to document existing 
practice where we could, and to try to identify best practices on 
setting up bots.  These changes have been posted publicly, and reviewed 
through the normal process.  We've been deliberately trying to stick to 
non-controversial stuff as we got the docs improved.  I've been actively 
reaching out to bot owners to gather feedback in this process, but 
Stella had not, yet, been one.

Separately, this week I noticed a bot which was repeatedly toggling 
between red and green.  I forget the exact ratio, but in the recent 
build history, there were multiple transitions, seemingly unrelated to 
the changes being committed.  I emailed Galina asking her to address, 
and she removed the buildbot until it could be moved to the staging 
buildmaster, addressed, and then restored.  I left Stella off the 
initial email.  Sorry about that, no ill intent, just written in a hurry.

Now, transitioning into a bit of policy discussion...

 From my conversations with existing bot owners, there is a general 
agreement that bots should only be notifying the community if they are 
stable enough.  There's honest disagreement on what the bar for stable 
enough is, and disagreement about exactly whose responsibility 
addressing new instability is.  (To be clear, I'd separate instability 
from a clear deterministic breakage caused by a commit - we have a lot 
more agreement on that.)

My personal take is that for a bot to be publicly notifying, "someone"
needs to take the responsibility to backstop the normal revert to green 
process.  This "someone" can be developers who work in a particular 
area, the bot owner, or some combination thereof.  I view the 
responsibility of the bot config owner as being the person responsible 
for making sure that backstopping is happening.  Not necessarily by 
doing it themselves, but by having the contacts with developers who can, 
and following up when the normal flow is not working.

In this particular example, we appear to have a bunch of flaky lldb 
tests.  I personally know absolutely nothing about lldb.  I have no idea 
whether the tests are badly designed, the system they're being run on 
isn't yet supported by lldb, or if there's some recent code bug 
introduced which causes the failure. "Someone" needs to take the 
responsibility of figuring that out, and in the meantime spaming 
developers with inactionable failure notices seems undesirable.

For context, the bot was disabled until it could be moved to the staging 
buildmaster.  Moving to staging is required (currently) to disable 
developer notification.  In the email from Galina, it seems clear that 
the bot would be fine to move back to production once the issue was 
triaged.  This seems entirely reasonable to me.

Philip

p.s. One thing I'll note as a definite problem with the current system 
is that a lot of this happens in private email, and it's hard to share 
so that everyone has a good picture of what's going on.  It makes 
miscommunications all too easy.  Last time I spoke with Galina, we were 
tentative planning to start using github issues for bot operation 
matters to address that, but as that was in the middle of the transition 
from bugzilla, we deferred and haven't gotten back to that yet.

p.p.s. The bot in question is 
https://lab.llvm.org/buildbot/#/builders/83 if folks want to examine the 
history themselves.

On 1/8/22 12:06 PM, Stella Stamenova via llvm-dev wrote:>
> Hey all,
>
> I have a couple of questions about what the responsibilities of a 
> buildbot owner are. I’ve been maintaining a couple of buildbots for 
> lldb and mlir for some time now and I thought I had a pretty good idea 
> of what is required based on the documentation here: How To Add Your 
> Build Configuration To LLVM Buildbot Infrastructure — LLVM 13 
> documentation <https://www.llvm.org/docs/HowToAddABuilder.html>
>
> My understanding was that there are some things that are **expected** 
> of the owner. Namely:
>
>  1. Make sure that the buildbot is connected and has the right
>     infrastructure (e.g. the right version of Python, or tools, etc.).
>     Update as needed.
>  2. Make sure that the build configuration is one that is supported
>     (e.g. supported flavor or cmake variables). Update as needed.
>
> There are also a couple of things that are **optional**, but nice to have:
>
>  3. If the buildbot stays red for a while (where “a while” is
>     completely subjective), figure out the patch or patches that are
>     causing an issue and either revert them or notify the authors, so
>     they can take action.
>  4. If someone is having trouble investigating a failure that only
>     happens on the buildbot (or the buildbot is a rare configuration),
>     help them out (e.g. collect logs if possible).
>
> Up to now, I’ve not had any issues with this and the community has 
> been very good at fixing issues with builds and tests when I point 
> them out, or more often than not, without me having to do anything but 
> the occasional test re-run and software update (like this one, for 
> example, ⚙ D114639 Raise the minimum Visual Studio version to VS2019 
> (llvm.org) <https://reviews.llvm.org/D114639>). lldb has some tests 
> that are flaky because of the nature of the product, so there is some 
> noise, but mostly things work well and everyone seems happy.
>
> I’ve recently run into a situation that makes me wonder whether there 
> are other expectations of a buildbot owner that are not explicitly 
> listed in the llvm documentation. Someone reached out to me some time 
> ago to let me know their unhappiness at the flakiness of some of the 
> lldb tests and demanded that I either fix them or disable them. I let 
> them know that there are some tests that are known to be flaky, that 
> my expectation is that it is not my responsibility to fix all such 
> issues and that the community would be very happy to have their 
> contribution in the form of a fix or a change to disable the tests. I 
> didn’t get a response from this person, but I did disable a couple of 
> particularly flaky tests since it seemed like the nice thing to do.
>
> The real excitement happened yesterday when I received an email that 
> **the build bot had been turned off**. This same person reached out to 
> the powers that be (without letting me know) and asked them explicitly 
> to silence it **without my active involvement** because of the flakiness.
>
> I have a couple of issues with this approach but perhaps I’ve 
> misunderstood what my responsibilities are as the buildbot owner. I 
> know it is frustrating to see a bot fail because of flaky tests and it 
> is nice to have someone to ask to resolve them all – is that really 
> the expectation of a buildbot owner? Where is the line between 
> maintenance of the bot and fixing build and test issues for the community?
>
> I’d like to understand what the general expectations are and if there 
> are things missing from the documentation, I propose that we add them, 
> so that it is clear for everyone what is required.
>
> Thanks,
>
> -Stella
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20220108/51dbd88b/attachment.html>

llvm dev - Jan 2022 - Responsibilities of a buildbot owner

[llvm-dev] Responsibilities of a buildbot owner

[llvm-dev] Responsibilities of a buildbot owner