thr3ads.net - llvm dev - [llvm-dev] Buildbot Noise [Oct 2015]

If this information is useful, please help other people find it:
Share via:

Renato Golin via llvm-dev

2015-Oct-07 22:09 UTC

[llvm-dev] Buildbot Noise

On 7 October 2015 at 22:44, Eric Christopher <echristo at gmail.com>
wrote:> I think this is a poor analogy. You're also ignoring the solution I
gave you
> in my previous mail for slow bots.
I'm not ignoring it, I'm acting upon it. But it takes time. I don't
have infinite resources.

> If you can't give some basic stability guarantees then the bot
> is only harming the entire testing infrastructure.
Define stability. Daniel was talking about "things I can act upon".
That's so vague it means nothing. "Basic stability guarantees" is
on a
similar gist.

Any universal rule you try to make will either be too lax for fast and
reliable bots, or too hard on slow and less used bots.

That's what I'm finding hard to understand. All you guys are saying is
that things are bad and need to get better. I agree completely. But
your solution is to turn off everything you don't understand or assume
it's flaky, and that's just wrong.

We had two flaky bots: Pandas and a Juno. Pandas were disabled, the
Juno was fixed. Some of our bots, however, are still slow, and we have
been asked to disable them because they were red for too long.

Most of the problem we find are bad tests from people that didn't
(obviously) test on ARM. The second most common is code that doesn't
take into account 32-bits platforms. The third most common breakages
is the sanitizer tests, which pop in and out on many platforms. The
most common long breakage is due to self-hosted Clang breaking and
making it hard to find what commit to revert or even warn the
developer.

None of those are due to instability of my buildbots. But I got
shouted at many times to disable the bot because it was "red for too
long". I find this behaviour disrespectful.

I'm now trying to get 8 more ARM boards and 3 AArch64, and I plan to
put them as redundant builders. But it takes time. Weeks to make them
work reliably, more weeks to make sure they won't fall under pressure,
more weeks to put in production and stabilise. Meanwhile, I'd
appreciate if people stopped trying to kill the others.

What else do you want us to do?

cheers,
--renato

Eric Christopher via llvm-dev

2015-Oct-07 22:54 UTC

head link

[llvm-dev] Buildbot Noise

On Wed, Oct 7, 2015 at 3:09 PM Renato Golin <renato.golin at linaro.org>
wrote:
> On 7 October 2015 at 22:44, Eric Christopher <echristo at gmail.com>
wrote:
> > I think this is a poor analogy. You're also ignoring the solution
I gave
> you
> > in my previous mail for slow bots.
>
> I'm not ignoring it, I'm acting upon it. But it takes time. I
don't
> have infinite resources.
>
>Of course, it just seemed like you were ignoring it as a (partial/full)
solution.

>
> > If you can't give some basic stability guarantees then the bot
> > is only harming the entire testing infrastructure.
>
> Define stability. Daniel was talking about "things I can act
upon".
> That's so vague it means nothing. "Basic stability
guarantees" is on a
> similar gist.
>
>Basic stability guarantee:
"Only returns failure for failures due to the compiler or the occasional
exception"

> Any universal rule you try to make will either be too lax for fast and
> reliable bots, or too hard on slow and less used bots.
>
>I don't know how fast/slow comes into this. See Chris's mail for more
comments on this. I think you're concentrating too hard on this particular
axis to the detriment of the discussion. I think a better way is to look at
it as "signal to noise" ratio.

If the bot is correctly identifying problems, but yet mostly staying green
then it has a good signal and is useful,

If it's mostly red due to:
a) instability (exceptions, timeouts, what have you), or
b) no one looking at the failures, or
c) can't complete fast enough to deal with the transient red in top of tree

then it isn't providing a lot of signal.

This is my general guideline on how bots should go. A description of what's
going on with yours and how they relate here is probably good to have as
far as yours. Other sets of bots may fall into different sets of the
buckets here.

> That's what I'm finding hard to understand. All you guys are saying
is
> that things are bad and need to get better. I agree completely. But
> your solution is to turn off everything you don't understand or assume
> it's flaky, and that's just wrong.
>
> We had two flaky bots: Pandas and a Juno. Pandas were disabled, the
> Juno was fixed. Some of our bots, however, are still slow, and we have
> been asked to disable them because they were red for too long.
>
>Are they red because the tree is red over their run lifetime or red because
there are problems that aren't being fixed?

If it's the former then they might truly be too slow to be enabled right
now as public bots. When (I hope it's a when) we move to a staged bot
infrastructure they can be re-enabled as things that send email and bug
people when they fail. If it's the latter then we need to figure out how to
get problems identified and fixed in a more rapid fashion.

> Most of the problem we find are bad tests from people that didn't
> (obviously) test on ARM. The second most common is code that doesn't
> take into account 32-bits platforms. The third most common breakages
> is the sanitizer tests, which pop in and out on many platforms. The
> most common long breakage is due to self-hosted Clang breaking and
> making it hard to find what commit to revert or even warn the
> developer.
>
> None of those are due to instability of my buildbots. But I got
> shouted at many times to disable the bot because it was "red for too
> long". I find this behaviour disrespectful.
>
>Seems reasonable. If you're getting actual failures then that seems like
something reasonable. If you're not trying to get them fixed by getting
testcases or helping people get a problem that they can see then it may
mean that since the owner doesn't care then no one does :)

Again, I'm not saying this is what's going on with your bots in
particular,
just describing a general case.

> I'm now trying to get 8 more ARM boards and 3 AArch64, and I plan to
> put them as redundant builders. But it takes time. Weeks to make them
> work reliably, more weeks to make sure they won't fall under pressure,
> more weeks to put in production and stabilise. Meanwhile, I'd
> appreciate if people stopped trying to kill the others.
>
>Honestly I'm not sure if redundant builders are the solution here, but
rather the phased system. Basically more noise (e.g. they're all going to
fail) isn't going to help. That said, if they help you reduce time to find
problems then it's great.

Hope this explains my position on how the bots should work. I definitely
think we need a phased scheme and I was hoping to hear some sort of
scheduling idea or transition idea from Chris. I have no idea what kind of
time he's got for this sort of thing. If it's documentation to move a
set
of bots over to the phased builder then that seems like it would be an
amazing help to the community in general :)

Thanks!

-eric
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<lists.llvm.org/pipermail/llvm-dev/attachments/20151007/3db9e0df/attachment.html>

Renato Golin via llvm-dev

2015-Oct-07 23:16 UTC

head link

[llvm-dev] Buildbot Noise

On 7 October 2015 at 23:54, Eric Christopher <echristo at gmail.com>
wrote:> Basic stability guarantee:
> "Only returns failure for failures due to the compiler or the
occasional
> exception"
Ok, in that sense, my bots are very stable.

> I don't know how fast/slow comes into this. See Chris's mail for
more
> comments on this. I think you're concentrating too hard on this
particular
> axis to the detriment of the discussion. I think a better way is to look at
> it as "signal to noise" ratio.
Chris' CI is orders of magnitude better than ours. In his
infrastructure, speed is a lot less relevant when waiting a fix/revert
to work (make it green).

I do agree with almost everything else except one thing: We had three
pandas to solve the speed issue. But more often than I'd like, they'd
pick three consecutive commits and keep dozens of commits waiting.
That makes the value of maintaining more bots, smaller.

> If it's mostly red due to:
> a) instability (exceptions, timeouts, what have you), or
> b) no one looking at the failures, or
> c) can't complete fast enough to deal with the transient red in top of
tree
Absolutely agree. None of that apply to our bots.

We don't have instability issues any more for a long time. As I said,
Pandas are gone, Junos are fixed. The rest is very stable.

We're *always* looking at failures, but sometimes it takes time to
figure out what to revert, and sometimes there's no test to XFAIL.
These take longer to fix.

> Are they red because the tree is red over their run lifetime or red because
> there are problems that aren't being fixed?
The two examples where I was asked to disable my bots were similar.

There were two separated instances in two separated weeks where a
self-hosting bot would spot a weird bug but not the others. Marking
the test as XFAIL is not an option, otherwise all the other bots would
then fail.

So, we tried to understand what was going on, but our hardware is
mostly remote and shared, so it took days to get to an idea. Then, we
needed to mark it as unstable, and wait for it to go back to green.
All that took about 5 days, including the weekend, so in reality, 3
working days. I don't find that flaky, nor unreasonable, nor
unsustainable.

However, during those 5 days, the build master was restarted, and the
bot status went from red to exception and back to red. Since, as I
explained earlier, exception is treated as "success", David got an
email, looked that it was red for "a long time" and assumed no one was
looking at it.

By coincidence, this happened twice in a row for completely different
reasons and David was emailed twice in two weeks. That's when he
assumed the bot was flaky and no one was trying to fix them.

> Seems reasonable. If you're getting actual failures then that seems
like
> something reasonable. If you're not trying to get them fixed by getting
> testcases or helping people get a problem that they can see then it may
mean
> that since the owner doesn't care then no one does :)
I'm always trying to fix every bug we find. I've always helped
everyone. I've even provided access to our hardware on multiple
occasions when I wasn't able to debug the problem myself.

I worked very hard to reduce the noise of our hardware, and I managed
to get some pretty stable buildbots. That's why I was so shocked when
I was asked to disable my bots twice!

> Honestly I'm not sure if redundant builders are the solution here, but
> rather the phased system. Basically more noise (e.g. they're all going
to
> fail) isn't going to help. That said, if they help you reduce time to
find
> problems then it's great.
I described some of those problems above, so I agree with you.

Moving to something like the GreenBots seem like the best option.

cheers,
--renato

llvm dev - Oct 2015 - Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise

[llvm-dev] Buildbot Noise