thr3ads.net - llvm dev - [llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh [Sep 2015]

If this information is useful, please help other people find it:
Share via:

David Blaikie via llvm-dev

2015-Sep-29 17:41 UTC

[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

On Tue, Sep 29, 2015 at 10:38 AM, Renato Golin <renato.golin at
linaro.org>
wrote:
> On 29 September 2015 at 18:22, David Blaikie <dblaikie at gmail.com>
wrote:
> > This buildbot looks like it's been failing since Friday - does
anyone
> > know/own/care about it?
>
> Yes, we're looking into it.
>
> As you probably noticed, debugging ARM buildbots are not easy, not
> fast. Reverting commits at random also don't help with the problem,
> and bisecting can take days, if not weeks. So the one week rule to
> disable bots is too harsh on those bots.
>
Is it? While it's failing, the buildbot doesn't seem to be any use to
the
community at large - it's essentially the buildbot owners problem at that
point and probably shouldn't be engaging with the community until it's
green again, I think?

Is the buildbot useful to you during this time? Or are you debugging
elsewhere/privately?

If the buildbot is useful to you, but not the community at large - perhaps
we could get in the habit of moving it into a "no email" pool whenever
a
failure occurs, until it can be cleared up. (hopefully this pool is clearly
distinguished from the rest of the buildbots in the waterfall/grid view -
because it'd be helpful to be able to look at an easily distinguished
subset of the waterfall/grid and see the bots that are expected to be green
for any developer there)

> Also, please know that I do care a lot about *all* ARM bots (including
> AArch64) and I do check them multiple times a day, so if they're red,
> I'm definitely aware and trying to fix it.
>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150929/e82954fc/attachment.html>

David Blaikie via llvm-dev

2015-Sep-29 17:44 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

On Tue, Sep 29, 2015 at 10:41 AM, David Blaikie <dblaikie at gmail.com>
wrote:
>
>
> On Tue, Sep 29, 2015 at 10:38 AM, Renato Golin <renato.golin at
linaro.org>
> wrote:
>
>> On 29 September 2015 at 18:22, David Blaikie <dblaikie at
gmail.com> wrote:
>> > This buildbot looks like it's been failing since Friday - does
anyone
>> > know/own/care about it?
>>
>> Yes, we're looking into it.
>>
>> As you probably noticed, debugging ARM buildbots are not easy, not
>> fast. Reverting commits at random also don't help with the problem,
>> and bisecting can take days, if not weeks.
>
>Also - if the blame list isn't short enough to provide effective/actionable
blame for the actual developer who caused the regression, sending email
seems noisy and unhelpful. This seems like a buildbot that should just be
emailing you (and anyone else tasked with/interested in investigating these
failures), not a long list project contributors?

> So the one week rule to
>> disable bots is too harsh on those bots.
>>
>
> Is it? While it's failing, the buildbot doesn't seem to be any use
to the
> community at large - it's essentially the buildbot owners problem at
that
> point and probably shouldn't be engaging with the community until
it's
> green again, I think?
>
> Is the buildbot useful to you during this time? Or are you debugging
> elsewhere/privately?
>
> If the buildbot is useful to you, but not the community at large - perhaps
> we could get in the habit of moving it into a "no email" pool
whenever a
> failure occurs, until it can be cleared up. (hopefully this pool is clearly
> distinguished from the rest of the buildbots in the waterfall/grid view -
> because it'd be helpful to be able to look at an easily distinguished
> subset of the waterfall/grid and see the bots that are expected to be green
> for any developer there)
>
>
>> Also, please know that I do care a lot about *all* ARM bots (including
>> AArch64) and I do check them multiple times a day, so if they're
red,
>> I'm definitely aware and trying to fix it.
>>
>> cheers,
>> --renato
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150929/a047c935/attachment.html>

Renato Golin via llvm-dev

2015-Sep-29 17:56 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

On 29 September 2015 at 18:41, David Blaikie <dblaikie at gmail.com>
wrote:> Is it? While it's failing, the buildbot doesn't seem to be any use
to the
> community at large - it's essentially the buildbot owners problem at
that
> point and probably shouldn't be engaging with the community until
it's green
> again, I think?
The bot is useful as it still shows if there are new bugs since the
initial problem, and can help bisect any further problem when they
come. If we disable that bot, when we fix the issue and bring it back,
there could be a number of new failures that we didn't monitor and
that will need a few more days/weeks to remove, especially if they're
cumulative. This way, it's likely that we'll never have that bot
online ever again. This is bad for the community.

> Is the buildbot useful to you during this time? Or are you debugging
> elsewhere/privately?
Both. As I described above, this bot is useful not just to me, but the
community, as they can cross check if their commits introduced bugs to
all ARM bots, not just one, and the slow bot will show that. I'm also
investigating elsewhere, since if I turn this bot off, what I said
above will happen. I'm also not alone in investigating this, Saleem is
helping me.

> If the buildbot is useful to you, but not the community at large - perhaps
> we could get in the habit of moving it into a "no email" pool
whenever a
> failure occurs, until it can be cleared up. (hopefully this pool is clearly
> distinguished from the rest of the buildbots in the waterfall/grid view -
> because it'd be helpful to be able to look at an easily distinguished
subset
> of the waterfall/grid and see the bots that are expected to be green for
any
> developer there)
Any movement means restarting the buildmaster, which means stopping
all current builds and upsetting all other bots. If we start taking
the stance of moving things up and down the priority list, we'll have
more unstable buildbots and that's worse for the community. Our
agreement, at least from what I understood, was that we should move
unstable bots to offline if: they're broken for a while AND no one is
trying to or can fix it. "A while" is vague because it depends on the
hardware, and I'm definitely trying to fix it.

It's not because the hardware is slow that it has no value to the
community, unless you're arguing that we shouldn't test ARM at all,
which is a whole different story.

Not emailing bugs in this bot when it's green means it's probably
useless, so I wouldn't want to have any bots in there. I already have
a separate buildmaster which doesn't email where I test my prototypes,
but those are work in progress, while my production bots are not.

A neater solution would be to not email *any* buildbot that moves from
exception to failure if the previous non-exceptional status is also
failure. This way, we won't have the kind of email that upset you, but
we still have the value that a red bot provides.

cheers,
--renato

David Blaikie via llvm-dev

2015-Sep-29 18:04 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

On Tue, Sep 29, 2015 at 10:56 AM, Renato Golin <renato.golin at
linaro.org>
wrote:
> On 29 September 2015 at 18:41, David Blaikie <dblaikie at gmail.com>
wrote:
> > Is it? While it's failing, the buildbot doesn't seem to be any
use to the
> > community at large - it's essentially the buildbot owners problem
at that
> > point and probably shouldn't be engaging with the community until
it's
> green
> > again, I think?
>
> The bot is useful as it still shows if there are new bugs since the
> initial problem, and can help bisect any further problem when they
> come. If we disable that bot, when we fix the issue and bring it back,
> there could be a number of new failures that we didn't monitor and
> that will need a few more days/weeks to remove, especially if they're
> cumulative. This way, it's likely that we'll never have that bot
> online ever again. This is bad for the community.
>
The community generally doesn't pay attention to the bot once it goes red -
so this seems to be only relevant to the "we didn't monitor" and
by "we"
I/you mean you-and-other-people-who-care-about-the-bot, not the community
at large.

I certainly don't look beyond "oh, the bot was already red" and
/maybe/ if
you're lucky "oh, a different thing is failing now", but I often
don't get
that far owing to the high false positive rate (due to flakes and existing
errors) in the buildbots.

Maybe other people's experiences are different, but I don't have much
evidence to suggest that.

> > Is the buildbot useful to you during this time? Or are you debugging
> > elsewhere/privately?
>
> Both. As I described above, this bot is useful not just to me, but the
> community, as they can cross check if their commits introduced bugs to
> all ARM bots, not just one, and the slow bot will show that.

I don't know about other people, but I don't cross reference bots that
closely. I mostly ignore the low rumble of noise I get back from the
buildbots every time I commit. I have to measure by magnitude (& level of
trust with different bots) this is really not possible for newer
contributors - they won't know what to pay attention to or not. I don't
think it's a sustainable way to run the bots.

> I'm also
> investigating elsewhere, since if I turn this bot off, what I said
> above will happen. I'm also not alone in investigating this, Saleem is
> helping me.
>
>
> > If the buildbot is useful to you, but not the community at large -
> perhaps
> > we could get in the habit of moving it into a "no email"
pool whenever a
> > failure occurs, until it can be cleared up. (hopefully this pool is
> clearly
> > distinguished from the rest of the buildbots in the waterfall/grid
view -
> > because it'd be helpful to be able to look at an easily
distinguished
> subset
> > of the waterfall/grid and see the bots that are expected to be green
for
> any
> > developer there)
>
> Any movement means restarting the buildmaster, which means stopping
> all current builds and upsetting all other bots. If we start taking
> the stance of moving things up and down the priority list, we'll have
> more unstable buildbots and that's worse for the community. Our
> agreement, at least from what I understood, was that we should move
> unstable bots to offline if: they're broken for a while AND no one is
> trying to or can fix it. "A while" is vague because it depends on
the
> hardware, and I'm definitely trying to fix it.
>
> It's not because the hardware is slow that it has no value to the
> community, unless you're arguing that we shouldn't test ARM at all,
> which is a whole different story.
>
If the failure mails are not actionable, they're not useful to the
community. If the blame list is too long (or too delayed) it's not likely
to be useful.

If a certain platform just takes a long time (though we could reduce that
with a hybrid approach - cross build the compiler on a fast platform, run
the tests on the other) then it's necessary to put more hardware (multiple
slaves) behind it to reduce the blame lists, I think.

> Not emailing bugs in this bot when it's green means it's probably
> useless,

It doesn't seem useless - it's still a signal to you and other
developers
who care about the platform and will investigate failures.

> so I wouldn't want to have any bots in there. I already have
> a separate buildmaster which doesn't email where I test my prototypes,
> but those are work in progress, while my production bots are not.
>
> A neater solution would be to not email *any* buildbot that moves from
> exception to failure if the previous non-exceptional status is also
> failure. This way, we won't have the kind of email that upset you, but
> we still have the value that a red bot provides.
>
Sure, I'd be OK-ish with that, though it'd still make looking at the
waterfall/grid problematic as it is today (though I don't do that often, so
I don't personally care about that). It'd be the same as moving the
buildbot to a "no email" group until fixed, but without the need to
cycle
the buildmaster (& with the benefit that it'd happen automatically -
though
I'm only suggesting moving it off emailing when there's active
investigation, so the small manual task at the beginning and end of that
cycle doesn't seem too detrimental - no need to do it when someone just
checks in a buildbreak by mistake, etc)

- Dave

>
> cheers,
> --renato
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150929/9d08a0dd/attachment.html>

llvm dev - Sep 2015 - buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh