thr3ads.net - llvm dev - [llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9 [Aug 2015]

If this information is useful, please help other people find it:
Share via:

Tobias Grosser via llvm-dev

2015-Aug-26 14:44 UTC

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

On 08/26/2015 04:38 PM, Renato Golin via llvm-dev wrote:> On 26 August 2015 at 15:32, Tobias Grosser <tobias at grosser.es>
wrote:
>> What's the problem with increasing the timeout? Asking people to
ignore
>> buildbot mails does not seem right. If the buildbot is flaky I believe
>> the buildbot owner should ensure it shuts up until the problems have
>> been resolved and the buildbot has a low false positive rate again.
>
> That's the point I make about solving the real issue, not increase the
timeout.
>
> CMake + Ninja has fixed virtually all our flakiness on all other ARM
> bots, so I think we should give it a try first.
What time-line do you have in mind for this fix? If you are in charge
and can make this happen within a day, giving cmake + ninja a chance seems
OK.

However, if the owner of the buildbot is not known or the fix can not come
soon, I am in favor of disabling the noise and (re)enabling it when someone
found time to address the problem and verify the solution. The cost of
buildbot noise is very high, both in terms of developer time spent, but
more importantly due to people starting to ignore them when monitoring them
becomes costly.

Best,
Tobias

Renato Golin via llvm-dev

2015-Aug-26 15:21 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

On 26 August 2015 at 15:44, Tobias Grosser <tobias at grosser.es>
wrote:> What time-line do you have in mind for this fix? If you are in charge
> and can make this happen within a day, giving cmake + ninja a chance seems
> OK.
It's not my bot. All my bots are CMake+Ninja based and are stable enough.

> However, if the owner of the buildbot is not known or the fix can not come
> soon, I am in favor of disabling the noise and (re)enabling it when someone
> found time to address the problem and verify the solution.
That's up to Galina. We haven't had any action against unstable bots
so far, and this is not the only one. There are lots of Windows and
sanitizer bots that break randomly and provide little information, are
we going to disable them all? How about the perf bots that still fail
occasionally and we haven't managed to fix the root cause, are we
going to disable then, too?

You're asking to reduce considerably the quality of testing on some
areas so that you can reduce the time spent looking at spurious
failures. I don't agree with that in principle. There were other
threads focusing on how to make them less spurious, more stable, less
noisy, and some work is being done on the GreenDragon bot structure.
But killing everything that looks suspicious now will reduce our
ability to validate LLVM on the range of configurations that we do
today, and that, for me, is a lot worse than a few minutes' worth of
some engineers.

> The cost of
> buildbot noise is very high, both in terms of developer time spent, but
> more importantly due to people starting to ignore them when monitoring them
> becomes costly.
I think you're overestimating the cost.

When I get bot emails, I click on the link and if it was timeout, I
always ignore it. If I can't make heads or tails (like the sanitizer
ones), I ignore it temporarily, then look again next day.

My assumption is that the bot owner will make me aware if the reason
is not obvious, as I do with my bots. I always wait for people to
realise, and fix. But if they can't, either because the bot was
already broken, or because the breakage isn't clear, I let people know
where to search for the information in the bot itself. This is my
responsibility as a bot owner.

I appreciate the benefit of having green / red bots, but you also have
to appreciate that hardware is not perfect, and they will invariably
fail once in a while. I had some Polly bots failing randomly and it
took me only a couple of seconds to infer so. I'm not asking to remove
them, even those that fail more than pass throughout the year. I
assume that, if they're still there, it provides *some* value to
someone.

cheers,
--renato

Mehdi Amini via llvm-dev

2015-Aug-26 16:01 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

> On Aug 26, 2015, at 8:21 AM, Renato Golin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> On 26 August 2015 at 15:44, Tobias Grosser <tobias at grosser.es>
wrote:
>> What time-line do you have in mind for this fix? If you are in charge
>> and can make this happen within a day, giving cmake + ninja a chance
seems
>> OK.
> 
> It's not my bot. All my bots are CMake+Ninja based and are stable
enough.
> 
> 
>> However, if the owner of the buildbot is not known or the fix can not
come
>> soon, I am in favor of disabling the noise and (re)enabling it when
someone
>> found time to address the problem and verify the solution.
> 
> That's up to Galina. We haven't had any action against unstable
bots
> so far, and this is not the only one. There are lots of Windows and
> sanitizer bots that break randomly and provide little information, are
> we going to disable them all? How about the perf bots that still fail
> occasionally and we haven't managed to fix the root cause, are we
> going to disable then, too?
> 
> You're asking to reduce considerably the quality of testing on some
> areas so that you can reduce the time spent looking at spurious
> failures. I don't agree with that in principle.
That’s not how I understand his point. In my opinion, he is asking to increase
the quality of testing. You just happen to disagree on his solution :)

The situation does not seem that black and white to me here. In the end, it
seems to me that is is about a threshold: if a bot is crashing 90% of the time,
does it really contributes to increase the quality of testing or on the opposite
it is just adding noise? Same question with 20%, 40%, 60%, …  We may all have a
different answer, but I’m pretty sure we could reach an agreement on what seems
appropriate

Another way of considering in general the impact of a bot on the quality is:
“how many legit failures were found by this bot in the last x years that weren’t
covered by another bot”.
Because sometimes you may just having a HW lab stress rack, without providing
any increased coverage for the software.

Cheers,

— 
Mehdi


> There were other
> threads focusing on how to make them less spurious, more stable, less
> noisy, and some work is being done on the GreenDragon bot structure.
> But killing everything that looks suspicious now will reduce our
> ability to validate LLVM on the range of configurations that we do
> today, and that, for me, is a lot worse than a few minutes' worth of
> some engineers.
> 
> 
>> The cost of
>> buildbot noise is very high, both in terms of developer time spent, but
>> more importantly due to people starting to ignore them when monitoring
them
>> becomes costly.
> 
> I think you're overestimating the cost.
> 
> When I get bot emails, I click on the link and if it was timeout, I
> always ignore it. If I can't make heads or tails (like the sanitizer
> ones), I ignore it temporarily, then look again next day.
> 
> My assumption is that the bot owner will make me aware if the reason
> is not obvious, as I do with my bots. I always wait for people to
> realise, and fix. But if they can't, either because the bot was
> already broken, or because the breakage isn't clear, I let people know
> where to search for the information in the bot itself. This is my
> responsibility as a bot owner.
> 
> I appreciate the benefit of having green / red bots, but you also have
> to appreciate that hardware is not perfect, and they will invariably
> fail once in a while. I had some Polly bots failing randomly and it
> took me only a couple of seconds to infer so. I'm not asking to remove
> them, even those that fail more than pass throughout the year. I
> assume that, if they're still there, it provides *some* value to
> someone.
> 
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
>
https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=BQIGaQ&c=eEvniauFctOgLOKGJOplqw&r=v-ruWq0KCv2O3thJZiK6naxuXK8mQHZUmGq5FBtAmZ4&m=Ka76E8XTfggJYWrDeaGXLSBKQHN2iCVEjKVsTb2pHwI&s=7HEhGhQSdWB_XWL-36BNpvyorugu1RCgTDgqEzWMVX4&e=

Tobias Grosser via llvm-dev

2015-Aug-26 16:27 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

@Galina: It seems this bot is now almost permanently running into a compile-time
timeout. Maybe you can fix this by either increasing the timeout or by
switching to a cmake/ninja based build as suggested by Renato.

On 08/26/2015 05:21 PM, Renato Golin via llvm-dev wrote:> On 26 August 2015 at 15:44, Tobias Grosser <tobias at grosser.es>
wrote:
>> What time-line do you have in mind for this fix? If you are in charge
>> and can make this happen within a day, giving cmake + ninja a chance
seems
>> OK.
>
> It's not my bot. All my bots are CMake+Ninja based and are stable
enough.
I should have looked it up myself. I did not want to finger-point, but
ensure we understand who will address this issue. I just looked up who
owns these builders and to my understanding it is Galina herself.
I CC her such that she can take action.

I also have the feeling I was generally to harsh in my mail, as I seem
to have triggered a rather defensive reply. Sorry for this.

Regarding the discussion about disabling/enabling buildbots. I agree with
Mehdi there is no black and white. For this bot, it seems important to
address this issue as it seems to start failing very regularly now.

Regarding my own bots: In case you see flaky polly buildbots or any other
of my bots sending emails without reason, please send me a short ping
such that I can fix the issue. None of my LNT bots send emails as they
run too long before starting to report.

Best,
Tobias

Philip Reames via llvm-dev

2015-Aug-26 16:30 UTC

head link

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

On 08/26/2015 08:21 AM, Renato Golin via llvm-dev wrote:> On 26 August 2015 at 15:44, Tobias Grosser <tobias at grosser.es>
wrote:
>> What time-line do you have in mind for this fix? If you are in charge
>> and can make this happen within a day, giving cmake + ninja a chance
seems
>> OK.
> It's not my bot. All my bots are CMake+Ninja based and are stable
enough.
>
>
>> However, if the owner of the buildbot is not known or the fix can not
come
>> soon, I am in favor of disabling the noise and (re)enabling it when
someone
>> found time to address the problem and verify the solution.
> That's up to Galina. We haven't had any action against unstable
bots
> so far, and this is not the only one. There are lots of Windows and
> sanitizer bots that break randomly and provide little information, are
> we going to disable them all? How about the perf bots that still fail
> occasionally and we haven't managed to fix the root cause, are we
> going to disable then, too?If the bot fails regularly (say false positive rate 1 in 10 runs), then 
yes, it should be disabled until the owner fixes it.  It's perfectly 
okay for it to be put into a "known unstable" list and for the bot
owner
to report failures after they've been confirmed.

To say this differently, we will revert a *change* which is 
problematic.  Why shouldn't we "revert" a
bot?>
> You're asking to reduce considerably the quality of testing on some
> areas so that you can reduce the time spent looking at spurious
> failures. I don't agree with that in principle. There were other
> threads focusing on how to make them less spurious, more stable, less
> noisy, and some work is being done on the GreenDragon bot structure.
> But killing everything that looks suspicious now will reduce our
> ability to validate LLVM on the range of configurations that we do
> today, and that, for me, is a lot worse than a few minutes' worth of
> some engineers.
>
>
>> The cost of
>> buildbot noise is very high, both in terms of developer time spent, but
>> more importantly due to people starting to ignore them when monitoring
them
>> becomes costly.
> I think you're overestimating the cost.
>
> When I get bot emails, I click on the link and if it was timeout, I
> always ignore it. If I can't make heads or tails (like the sanitizer
> ones), I ignore it temporarily, then look again next day.I disagree strongly here.  The cost of having flaky bots is quite high.  
When I make a commit, I'm committing to be responsive to problems it 
introduces over the next few hours.  Every one of those false positives 
is a 5-10 minute high priority interruption to what I'm actually working 
on.  In practice, that greatly diminishes my effectiveness.

As an illustrative example, I submitted some documentation changes 
earlier this week and got 5 unique build failure notices.  In this case, 
I ignored them, but if that had been a small code change, that would 
have cost me at least an hour of productivity.>
> My assumption is that the bot owner will make me aware if the reason
> is not obvious, as I do with my bots. I always wait for people to
> realise, and fix. But if they can't, either because the bot was
> already broken, or because the breakage isn't clear, I let people know
> where to search for the information in the bot itself. This is my
> responsibility as a bot owner.First, thanks for being a responsible bot owner.  :)

If all bot owners were doing this, having a unstable list which doesn't 
actively notify would be completely workable.  If not all bot owners are 
doing this, I can't say I really care about the status of those
bots.>
> I appreciate the benefit of having green / red bots, but you also have
> to appreciate that hardware is not perfect, and they will invariably
> fail once in a while. I had some Polly bots failing randomly and it
> took me only a couple of seconds to infer so. I'm not asking to remove
> them, even those that fail more than pass throughout the year. I
> assume that, if they're still there, it provides *some* value to
> someone.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Aug 2015 - buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

[llvm-dev] buildbot failure in LLVM on clang-native-arm-cortex-a9

Possibly Parallel Threads