David Blaikie via llvm-dev
2015-Sep-29 17:41 UTC
[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh
On Tue, Sep 29, 2015 at 10:38 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 29 September 2015 at 18:22, David Blaikie <dblaikie at gmail.com> wrote: > > This buildbot looks like it's been failing since Friday - does anyone > > know/own/care about it? > > Yes, we're looking into it. > > As you probably noticed, debugging ARM buildbots are not easy, not > fast. Reverting commits at random also don't help with the problem, > and bisecting can take days, if not weeks. So the one week rule to > disable bots is too harsh on those bots. >Is it? While it's failing, the buildbot doesn't seem to be any use to the community at large - it's essentially the buildbot owners problem at that point and probably shouldn't be engaging with the community until it's green again, I think? Is the buildbot useful to you during this time? Or are you debugging elsewhere/privately? If the buildbot is useful to you, but not the community at large - perhaps we could get in the habit of moving it into a "no email" pool whenever a failure occurs, until it can be cleared up. (hopefully this pool is clearly distinguished from the rest of the buildbots in the waterfall/grid view - because it'd be helpful to be able to look at an easily distinguished subset of the waterfall/grid and see the bots that are expected to be green for any developer there)> Also, please know that I do care a lot about *all* ARM bots (including > AArch64) and I do check them multiple times a day, so if they're red, > I'm definitely aware and trying to fix it. > > cheers, > --renato >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150929/e82954fc/attachment.html>
David Blaikie via llvm-dev
2015-Sep-29 17:44 UTC
[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh
On Tue, Sep 29, 2015 at 10:41 AM, David Blaikie <dblaikie at gmail.com> wrote:> > > On Tue, Sep 29, 2015 at 10:38 AM, Renato Golin <renato.golin at linaro.org> > wrote: > >> On 29 September 2015 at 18:22, David Blaikie <dblaikie at gmail.com> wrote: >> > This buildbot looks like it's been failing since Friday - does anyone >> > know/own/care about it? >> >> Yes, we're looking into it. >> >> As you probably noticed, debugging ARM buildbots are not easy, not >> fast. Reverting commits at random also don't help with the problem, >> and bisecting can take days, if not weeks. > >Also - if the blame list isn't short enough to provide effective/actionable blame for the actual developer who caused the regression, sending email seems noisy and unhelpful. This seems like a buildbot that should just be emailing you (and anyone else tasked with/interested in investigating these failures), not a long list project contributors?> So the one week rule to >> disable bots is too harsh on those bots. >> > > Is it? While it's failing, the buildbot doesn't seem to be any use to the > community at large - it's essentially the buildbot owners problem at that > point and probably shouldn't be engaging with the community until it's > green again, I think? > > Is the buildbot useful to you during this time? Or are you debugging > elsewhere/privately? > > If the buildbot is useful to you, but not the community at large - perhaps > we could get in the habit of moving it into a "no email" pool whenever a > failure occurs, until it can be cleared up. (hopefully this pool is clearly > distinguished from the rest of the buildbots in the waterfall/grid view - > because it'd be helpful to be able to look at an easily distinguished > subset of the waterfall/grid and see the bots that are expected to be green > for any developer there) > > >> Also, please know that I do care a lot about *all* ARM bots (including >> AArch64) and I do check them multiple times a day, so if they're red, >> I'm definitely aware and trying to fix it. >> >> cheers, >> --renato >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150929/a047c935/attachment.html>
Renato Golin via llvm-dev
2015-Sep-29 17:56 UTC
[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh
On 29 September 2015 at 18:41, David Blaikie <dblaikie at gmail.com> wrote:> Is it? While it's failing, the buildbot doesn't seem to be any use to the > community at large - it's essentially the buildbot owners problem at that > point and probably shouldn't be engaging with the community until it's green > again, I think?The bot is useful as it still shows if there are new bugs since the initial problem, and can help bisect any further problem when they come. If we disable that bot, when we fix the issue and bring it back, there could be a number of new failures that we didn't monitor and that will need a few more days/weeks to remove, especially if they're cumulative. This way, it's likely that we'll never have that bot online ever again. This is bad for the community.> Is the buildbot useful to you during this time? Or are you debugging > elsewhere/privately?Both. As I described above, this bot is useful not just to me, but the community, as they can cross check if their commits introduced bugs to all ARM bots, not just one, and the slow bot will show that. I'm also investigating elsewhere, since if I turn this bot off, what I said above will happen. I'm also not alone in investigating this, Saleem is helping me.> If the buildbot is useful to you, but not the community at large - perhaps > we could get in the habit of moving it into a "no email" pool whenever a > failure occurs, until it can be cleared up. (hopefully this pool is clearly > distinguished from the rest of the buildbots in the waterfall/grid view - > because it'd be helpful to be able to look at an easily distinguished subset > of the waterfall/grid and see the bots that are expected to be green for any > developer there)Any movement means restarting the buildmaster, which means stopping all current builds and upsetting all other bots. If we start taking the stance of moving things up and down the priority list, we'll have more unstable buildbots and that's worse for the community. Our agreement, at least from what I understood, was that we should move unstable bots to offline if: they're broken for a while AND no one is trying to or can fix it. "A while" is vague because it depends on the hardware, and I'm definitely trying to fix it. It's not because the hardware is slow that it has no value to the community, unless you're arguing that we shouldn't test ARM at all, which is a whole different story. Not emailing bugs in this bot when it's green means it's probably useless, so I wouldn't want to have any bots in there. I already have a separate buildmaster which doesn't email where I test my prototypes, but those are work in progress, while my production bots are not. A neater solution would be to not email *any* buildbot that moves from exception to failure if the previous non-exceptional status is also failure. This way, we won't have the kind of email that upset you, but we still have the value that a red bot provides. cheers, --renato
David Blaikie via llvm-dev
2015-Sep-29 18:04 UTC
[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh
On Tue, Sep 29, 2015 at 10:56 AM, Renato Golin <renato.golin at linaro.org> wrote:> On 29 September 2015 at 18:41, David Blaikie <dblaikie at gmail.com> wrote: > > Is it? While it's failing, the buildbot doesn't seem to be any use to the > > community at large - it's essentially the buildbot owners problem at that > > point and probably shouldn't be engaging with the community until it's > green > > again, I think? > > The bot is useful as it still shows if there are new bugs since the > initial problem, and can help bisect any further problem when they > come. If we disable that bot, when we fix the issue and bring it back, > there could be a number of new failures that we didn't monitor and > that will need a few more days/weeks to remove, especially if they're > cumulative. This way, it's likely that we'll never have that bot > online ever again. This is bad for the community. >The community generally doesn't pay attention to the bot once it goes red - so this seems to be only relevant to the "we didn't monitor" and by "we" I/you mean you-and-other-people-who-care-about-the-bot, not the community at large. I certainly don't look beyond "oh, the bot was already red" and /maybe/ if you're lucky "oh, a different thing is failing now", but I often don't get that far owing to the high false positive rate (due to flakes and existing errors) in the buildbots. Maybe other people's experiences are different, but I don't have much evidence to suggest that.> > Is the buildbot useful to you during this time? Or are you debugging > > elsewhere/privately? > > Both. As I described above, this bot is useful not just to me, but the > community, as they can cross check if their commits introduced bugs to > all ARM bots, not just one, and the slow bot will show that.I don't know about other people, but I don't cross reference bots that closely. I mostly ignore the low rumble of noise I get back from the buildbots every time I commit. I have to measure by magnitude (& level of trust with different bots) this is really not possible for newer contributors - they won't know what to pay attention to or not. I don't think it's a sustainable way to run the bots.> I'm also > investigating elsewhere, since if I turn this bot off, what I said > above will happen. I'm also not alone in investigating this, Saleem is > helping me. > > > > If the buildbot is useful to you, but not the community at large - > perhaps > > we could get in the habit of moving it into a "no email" pool whenever a > > failure occurs, until it can be cleared up. (hopefully this pool is > clearly > > distinguished from the rest of the buildbots in the waterfall/grid view - > > because it'd be helpful to be able to look at an easily distinguished > subset > > of the waterfall/grid and see the bots that are expected to be green for > any > > developer there) > > Any movement means restarting the buildmaster, which means stopping > all current builds and upsetting all other bots. If we start taking > the stance of moving things up and down the priority list, we'll have > more unstable buildbots and that's worse for the community. Our > agreement, at least from what I understood, was that we should move > unstable bots to offline if: they're broken for a while AND no one is > trying to or can fix it. "A while" is vague because it depends on the > hardware, and I'm definitely trying to fix it. > > It's not because the hardware is slow that it has no value to the > community, unless you're arguing that we shouldn't test ARM at all, > which is a whole different story. >If the failure mails are not actionable, they're not useful to the community. If the blame list is too long (or too delayed) it's not likely to be useful. If a certain platform just takes a long time (though we could reduce that with a hybrid approach - cross build the compiler on a fast platform, run the tests on the other) then it's necessary to put more hardware (multiple slaves) behind it to reduce the blame lists, I think.> Not emailing bugs in this bot when it's green means it's probably > useless,It doesn't seem useless - it's still a signal to you and other developers who care about the platform and will investigate failures.> so I wouldn't want to have any bots in there. I already have > a separate buildmaster which doesn't email where I test my prototypes, > but those are work in progress, while my production bots are not. > > A neater solution would be to not email *any* buildbot that moves from > exception to failure if the previous non-exceptional status is also > failure. This way, we won't have the kind of email that upset you, but > we still have the value that a red bot provides. >Sure, I'd be OK-ish with that, though it'd still make looking at the waterfall/grid problematic as it is today (though I don't do that often, so I don't personally care about that). It'd be the same as moving the buildbot to a "no email" group until fixed, but without the need to cycle the buildmaster (& with the benefit that it'd happen automatically - though I'm only suggesting moving it off emailing when there's active investigation, so the small manual task at the beginning and end of that cycle doesn't seem too detrimental - no need to do it when someone just checks in a buildbreak by mistake, etc) - Dave> > cheers, > --renato >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150929/9d08a0dd/attachment.html>