Folks, David has been particularly militant with broken buildbots recently, so to make sure we don't throw the baby with the bath water, I'd like to propose some changes on how we deal with the emails on our *current* buildmaster, since there's no concrete plans to move it to anything else at the moment. The main issue is that managing the buildbots is not a simple task. It requires build owners to disable on the slave side, or specific people on the master side. The former can take as long as the owner wants (which is not nice), and the latter refreshes all active bots (triggering exceptions) and are harder to revert. We need to be pragmatic without re-writing the BuildBot product. Grab some popcorn... There are two main fronts that we need to discuss the noise: Bot and test stability. 1. Bot stability issues We need to distinguish between four classes of buildbots: 1.1. Fast && stable && green These buildbots normally finish under one hour, but most of the time under 1/2 hour and should be kept green as much as possible. Therefore, any reasonable noise in these bots are welcomed, since we want them to go back to green as soon as possible. They're normally the front-line, and usually catch most of the silly bugs. But we need some kind of policy that allows us to revert patches that break them for more than a few hours. We have an agreement already, and for me that's good enough. People might think differently. With the items 2.x below taken care of, we should keep the current state of our bots for this group. 1.2. One of: Slow || Unstable || Often Red These bots are special. They're normally *very* important, but have some issues, like slow hardware, not too many available boards, or they take long times to bisect and fix the bugs. These bots catch the *serious* bugs, like self-hosted Clang mis-compiling a long-running test which sometimes fails. They can produce noise, but when the noise is correct, we really need to listen to it. Writing software to understand that is non-trivial. So, the idea here is to have a few special treatments for each type of problem. For example, slow bots need more hardware to reduce the blame list. Unstable bot need more work to reduce spurious noise to a minimum (see 2.x below), and red bots *must* remain *silent* until they come back to green (see 2.x below). What we *don't* want is to disable or silence them after they're green. Most of the bugs they find are hard to debug, so the longer we take to fix it the harder it is to find out what happened. We need to know as soon as possible when they break. 1.3. Two of: Slow || Unstable || Often Red These bots are normally only important to their owners, and they are on the verge of being disabled. The only way to cope with those bots is to completely disable their emails / IRC messages, so that no one gets flooded with noise from broken bots. However, some bots on the 1.2 category fall into this one for short periods of time (~1 week), so we need to be careful with what we disable here. That's the key baby/bathwater issue. Any hard policy here will be wrong for some bots some of the time, so I'd love if we could all just trust the bot owners a bit when they say they're fixing the issue. However, bots that fall here for more than a month, or more often that a few times during a few months (I'm being vague on purpose), then we collectively decide to disable the bot. What I *don't* want is any two or three guys deciding to disable the buildbot of someone else because they can't stand the noise. Remember, people do take holidays once in a while, and they may be in the Amazon or the Sahara having well deserved rest. Returning to work and learning that all your bots are disabled for a week is not nice. So far, we have coped with noise, and the result is that people tend to ignore those bots, which means more work to the bot owner. This is not a good situation, and we want to move away from it, but we shouldn't flip all switches off by default. We can still be pragmatic about this as long as we improve the quality overall (see 2.x below) with time. In summary, bots that fall here for too long will have their emails disabled and candidates for removal in the next spring clean-up, but not immediately. 1.4. Slow && Unstable && Red These bots don't belong here. They should be moved elsewhere, preferably to a local buildmaster that you can control and that will never email people or upset our master if you need changes. I have such a local master myself and it's very easy to setup and maintain. They *do* have value to *you*, for example to show the progress of your features cleaning up the failures, or generating some benchmark numbers, but that's something that is very specific to your project and should remain separated. Any of these bots in LLVM Lab should be moved away / removed, but on consensus, including the bot owner if he/she is still available in the list. 2. Test stability issues These issues, as you may have noticed from the links above, apply to *all* bots. The less noise we have overall, the lower will be our threshold for kicking bots out of the critical pool, and the higher the value of the not-so-perfect buildbots to the rest of the community. 2.1 Failed vs Exception The most critical issue we have to fix is the "red -> exception -> red" issue. Basically, a bot is red (because you're still investigating the problem), then someone restarts the master, so you get an exception. The next build will be a failure, and the buildmaster recognises the status change and emails everyone. That's just wrong. We need to add an extra check to that logic where it searches down for the next non-exceptional status and compares to that, not just the immediately previous result. This is a no-brainer and I don't think anyone would be against it. I just don't know where this is done, I welcome the knowledge of more experienced folks. 2.2 Failure types The next obvious thing is to detect what the error is. If it's an SVN error, we *really* don't need to get an email. But this raises the problem that an SVN failure followed by a genuine failure will not be reported. So, the reporting mechanism also has to know what's the previously *reported* failure, not just the previous failure. Other failures, like timeout, can be either flaky hardware or broken codegen. A way to be conservative and low noise would be to only warn on timeouts IFF it's the *second* in a row. For all these adjustments, we'll need some form of walk-back on the history to find the previous genuine result, and we'll need to mark results with some metadata. This may involve some patches to buildbot. 2.3 Detecting fixed bots Another interesting feature, that is present in the "GreenBot" is a warning when a bot you broke was fixed. That, per se, is not a good idea if the noise levels are high, since this will probably double it. So, this feature can only be introduced *after* we've done the clean ups above. But once it's clean, having a "green" email will put the minds of everyone that haven't seen the "red" email yet to rest, as they now know they don't even need to look at it at all, just delete the email. For those using fetchmail, I'm sure you could create a rule to do that automatically, but that's optional. :) 2.4 Detecting new failures This is a wish-list that I have, for the case where the bots are slow and hard to debug and are still red. Assuming everything above is fixed, they will emit no noise until they go green again, however, while I'm debugging the first problem, others can appear. If that happens, *I* want to know, but not necessarily everyone else. So, a list of problems reported could be added to the failure report, and if the failure is different, the bot owner gets an email. This would have to play nice with exception statuses, as well as spurious failures like SVN or timeouts, so it's not an easy patch. The community at large would be already happy with all the changes minus this one, but folks that have to maintain slow hardware like me would appreciate this feature. :) Does any one have more concerns? AFAICS, we should figure out where the walk-back code needs to be inserted and that would get us 90% of the way. The other 10% will be to list all the buildbots, check their statuses, owners, and map into those categories, and take the appropriate action. Maybe we should also reduce the noise in the IRC channel further (like only first red, first green), but that's not my primary concern right now. Feel free to look into it if it is for you. cheers, --renato
Hi Renato, Very useful thoughts, thanks. Need to think what could be done about these. I will add few comments from my side. Buildmaster as is configured now should send notifications on status change only for 'successToFailure' and 'failureToSuccess' events, so always red bots should be quiet. Also we have group of builders (experimental_scheduled_builders) in configuration file builders.py which also should be quiet. This is place for noisy unstable bots. If these features are not working properly please let me know and I will also try to watch these. Unfortunately buildbot currently does not distinguish test and build failures. I am going to be away on vacation the whole next week, but will keep an eye on buildbot. Thanks Galina On Thu, Oct 1, 2015 at 10:31 AM, Renato Golin <renato.golin at linaro.org> wrote:> Folks, > > David has been particularly militant with broken buildbots recently, > so to make sure we don't throw the baby with the bath water, I'd like > to propose some changes on how we deal with the emails on our > *current* buildmaster, since there's no concrete plans to move it to > anything else at the moment. > > The main issue is that managing the buildbots is not a simple task. It > requires build owners to disable on the slave side, or specific people > on the master side. The former can take as long as the owner wants > (which is not nice), and the latter refreshes all active bots > (triggering exceptions) and are harder to revert. > > We need to be pragmatic without re-writing the BuildBot product. > > Grab some popcorn... > > There are two main fronts that we need to discuss the noise: Bot and > test stability. > > > 1. Bot stability issues > > We need to distinguish between four classes of buildbots: > > 1.1. Fast && stable && green > > These buildbots normally finish under one hour, but most of the time > under 1/2 hour and should be kept green as much as possible. > Therefore, any reasonable noise in these bots are welcomed, since we > want them to go back to green as soon as possible. > > They're normally the front-line, and usually catch most of the silly > bugs. But we need some kind of policy that allows us to revert patches > that break them for more than a few hours. We have an agreement > already, and for me that's good enough. People might think > differently. > > With the items 2.x below taken care of, we should keep the current > state of our bots for this group. > > 1.2. One of: Slow || Unstable || Often Red > > These bots are special. They're normally *very* important, but have > some issues, like slow hardware, not too many available boards, or > they take long times to bisect and fix the bugs. > > These bots catch the *serious* bugs, like self-hosted Clang > mis-compiling a long-running test which sometimes fails. They can > produce noise, but when the noise is correct, we really need to listen > to it. Writing software to understand that is non-trivial. > > So, the idea here is to have a few special treatments for each type of > problem. For example, slow bots need more hardware to reduce the blame > list. Unstable bot need more work to reduce spurious noise to a > minimum (see 2.x below), and red bots *must* remain *silent* until > they come back to green (see 2.x below). > > What we *don't* want is to disable or silence them after they're > green. Most of the bugs they find are hard to debug, so the longer we > take to fix it the harder it is to find out what happened. We need to > know as soon as possible when they break. > > 1.3. Two of: Slow || Unstable || Often Red > > These bots are normally only important to their owners, and they are > on the verge of being disabled. The only way to cope with those bots > is to completely disable their emails / IRC messages, so that no one > gets flooded with noise from broken bots. > > However, some bots on the 1.2 category fall into this one for short > periods of time (~1 week), so we need to be careful with what we > disable here. That's the key baby/bathwater issue. > > Any hard policy here will be wrong for some bots some of the time, so > I'd love if we could all just trust the bot owners a bit when they say > they're fixing the issue. However, bots that fall here for more than a > month, or more often that a few times during a few months (I'm being > vague on purpose), then we collectively decide to disable the bot. > > What I *don't* want is any two or three guys deciding to disable the > buildbot of someone else because they can't stand the noise. Remember, > people do take holidays once in a while, and they may be in the Amazon > or the Sahara having well deserved rest. Returning to work and > learning that all your bots are disabled for a week is not nice. > > So far, we have coped with noise, and the result is that people tend > to ignore those bots, which means more work to the bot owner. This is > not a good situation, and we want to move away from it, but we > shouldn't flip all switches off by default. We can still be pragmatic > about this as long as we improve the quality overall (see 2.x below) > with time. > > In summary, bots that fall here for too long will have their emails > disabled and candidates for removal in the next spring clean-up, but > not immediately. > > 1.4. Slow && Unstable && Red > > These bots don't belong here. They should be moved elsewhere, > preferably to a local buildmaster that you can control and that will > never email people or upset our master if you need changes. I have > such a local master myself and it's very easy to setup and maintain. > > They *do* have value to *you*, for example to show the progress of > your features cleaning up the failures, or generating some benchmark > numbers, but that's something that is very specific to your project > and should remain separated. > > Any of these bots in LLVM Lab should be moved away / removed, but on > consensus, including the bot owner if he/she is still available in the > list. > > > 2. Test stability issues > > These issues, as you may have noticed from the links above, apply to > *all* bots. The less noise we have overall, the lower will be our > threshold for kicking bots out of the critical pool, and the higher > the value of the not-so-perfect buildbots to the rest of the > community. > > 2.1 Failed vs Exception > > The most critical issue we have to fix is the "red -> exception -> > red" issue. Basically, a bot is red (because you're still > investigating the problem), then someone restarts the master, so you > get an exception. The next build will be a failure, and the > buildmaster recognises the status change and emails everyone. That's > just wrong. > > We need to add an extra check to that logic where it searches down for > the next non-exceptional status and compares to that, not just the > immediately previous result. > > This is a no-brainer and I don't think anyone would be against it. I > just don't know where this is done, I welcome the knowledge of more > experienced folks. > > 2.2 Failure types > > The next obvious thing is to detect what the error is. If it's an SVN > error, we *really* don't need to get an email. But this raises the > problem that an SVN failure followed by a genuine failure will not be > reported. So, the reporting mechanism also has to know what's the > previously *reported* failure, not just the previous failure. > > Other failures, like timeout, can be either flaky hardware or broken > codegen. A way to be conservative and low noise would be to only warn > on timeouts IFF it's the *second* in a row. > > For all these adjustments, we'll need some form of walk-back on the > history to find the previous genuine result, and we'll need to mark > results with some metadata. This may involve some patches to buildbot. > > 2.3 Detecting fixed bots > > Another interesting feature, that is present in the "GreenBot" is a > warning when a bot you broke was fixed. That, per se, is not a good > idea if the noise levels are high, since this will probably double it. > > So, this feature can only be introduced *after* we've done the clean > ups above. But once it's clean, having a "green" email will put the > minds of everyone that haven't seen the "red" email yet to rest, as > they now know they don't even need to look at it at all, just delete > the email. > > For those using fetchmail, I'm sure you could create a rule to do that > automatically, but that's optional. :) > > 2.4 Detecting new failures > > This is a wish-list that I have, for the case where the bots are slow > and hard to debug and are still red. Assuming everything above is > fixed, they will emit no noise until they go green again, however, > while I'm debugging the first problem, others can appear. If that > happens, *I* want to know, but not necessarily everyone else. > > So, a list of problems reported could be added to the failure report, > and if the failure is different, the bot owner gets an email. This > would have to play nice with exception statuses, as well as spurious > failures like SVN or timeouts, so it's not an easy patch. > > The community at large would be already happy with all the changes > minus this one, but folks that have to maintain slow hardware like me > would appreciate this feature. :) > > > > Does any one have more concerns? > > AFAICS, we should figure out where the walk-back code needs to be > inserted and that would get us 90% of the way. The other 10% will be > to list all the buildbots, check their statuses, owners, and map into > those categories, and take the appropriate action. > > Maybe we should also reduce the noise in the IRC channel further (like > only first red, first green), but that's not my primary concern right > now. Feel free to look into it if it is for you. > > cheers, > --renato >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151001/00fcbabc/attachment.html>
On Thu, Oct 1, 2015 at 1:31 PM, Renato Golin via llvm-dev < llvm-dev at lists.llvm.org> wrote:> the latter refreshes all active bots > (triggering exceptions) and are harder to revert. >I haven't looked at LLVM's configs or buildbot setup at all, but the buildbot I ran previously, it was possible to have the master reload its configs without restarting the master or interrupting any of the unmodified builders. Might be worth looking into why that doesn't work (if it doesn't)? -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151001/bdfa624d/attachment.html>
For many changes restarting master is not necessary but not for all changes. But there is way for improvements also. Thanks Galina On Thu, Oct 1, 2015 at 11:42 AM, James Y Knight via llvm-dev < llvm-dev at lists.llvm.org> wrote:> > > On Thu, Oct 1, 2015 at 1:31 PM, Renato Golin via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> the latter refreshes all active bots >> (triggering exceptions) and are harder to revert. >> > > I haven't looked at LLVM's configs or buildbot setup at all, but the > buildbot I ran previously, it was possible to have the master reload its > configs without restarting the master or interrupting any of the unmodified > builders. > > Might be worth looking into why that doesn't work (if it doesn't)? > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151001/17a2a1f6/attachment.html>
I agree with almost everything you said. A couple of comments inline. On 10/01/2015 10:31 AM, Renato Golin via llvm-dev wrote:> Folks, > > David has been particularly militant with broken buildbots recently, > so to make sure we don't throw the baby with the bath water, I'd like > to propose some changes on how we deal with the emails on our > *current* buildmaster, since there's no concrete plans to move it to > anything else at the moment. > > The main issue is that managing the buildbots is not a simple task. It > requires build owners to disable on the slave side, or specific people > on the master side. The former can take as long as the owner wants > (which is not nice), and the latter refreshes all active bots > (triggering exceptions) and are harder to revert. > > We need to be pragmatic without re-writing the BuildBot product. > > Grab some popcorn... > > There are two main fronts that we need to discuss the noise: Bot and > test stability. > > > 1. Bot stability issues > > We need to distinguish between four classes of buildbots: > > 1.1. Fast && stable && green > > These buildbots normally finish under one hour, but most of the time > under 1/2 hour and should be kept green as much as possible. > Therefore, any reasonable noise in these bots are welcomed, since we > want them to go back to green as soon as possible. > > They're normally the front-line, and usually catch most of the silly > bugs. But we need some kind of policy that allows us to revert patches > that break them for more than a few hours. We have an agreement > already, and for me that's good enough. People might think > differently. > > With the items 2.x below taken care of, we should keep the current > state of our bots for this group. > > 1.2. One of: Slow || Unstable || Often Red > > These bots are special. They're normally *very* important, but have > some issues, like slow hardware, not too many available boards, or > they take long times to bisect and fix the bugs. > > These bots catch the *serious* bugs, like self-hosted Clang > mis-compiling a long-running test which sometimes fails. They can > produce noise, but when the noise is correct, we really need to listen > to it. Writing software to understand that is non-trivial. > > So, the idea here is to have a few special treatments for each type of > problem. For example, slow bots need more hardware to reduce the blame > list. Unstable bot need more work to reduce spurious noise to a > minimum (see 2.x below), and red bots *must* remain *silent* until > they come back to green (see 2.x below). > > What we *don't* want is to disable or silence them after they're > green. Most of the bugs they find are hard to debug, so the longer we > take to fix it the harder it is to find out what happened. We need to > know as soon as possible when they break.I view the three conditions as warranting somewhat different treatment. Specifically: "slow" these are tolerable if annoying "unstable" these should be removed immediately. If the failure rate is more than 1 in 5 builds of a known clean revision, that's far too much noise to be notifying. To be clear, I'm specifically referring to spurious *failures* not environmental factors which are global to all bots. "often red" these are extremely valuable (msan, etc..). Assuming we only notify on green->red, the only rule we should likely enforce is that each bot has been green "recently". I'd suggest a threshold of 2 months. If it hasn't been green in 2 months, it's not really a build bot.> 1.3. Two of: Slow || Unstable || Often Red > > These bots are normally only important to their owners, and they are > on the verge of being disabled. The only way to cope with those bots > is to completely disable their emails / IRC messages, so that no one > gets flooded with noise from broken bots. > > However, some bots on the 1.2 category fall into this one for short > periods of time (~1 week), so we need to be careful with what we > disable here. That's the key baby/bathwater issue.+1. Any reasonable threshold is fine. We just need to have one.> > Any hard policy here will be wrong for some bots some of the time, so > I'd love if we could all just trust the bot owners a bit when they say > they're fixing the issue. However, bots that fall here for more than a > month, or more often that a few times during a few months (I'm being > vague on purpose), then we collectively decide to disable the bot. > > What I *don't* want is any two or three guys deciding to disable the > buildbot of someone else because they can't stand the noise. Remember, > people do take holidays once in a while, and they may be in the Amazon > or the Sahara having well deserved rest. Returning to work and > learning that all your bots are disabled for a week is not nice.So, maybe I'm missing something, but: why is it any harder to bring a silence bot green than an emailing one?> > So far, we have coped with noise, and the result is that people tend > to ignore those bots, which means more work to the bot owner. This is > not a good situation, and we want to move away from it, but we > shouldn't flip all switches off by default. We can still be pragmatic > about this as long as we improve the quality overall (see 2.x below) > with time. > > In summary, bots that fall here for too long will have their emails > disabled and candidates for removal in the next spring clean-up, but > not immediately. > > 1.4. Slow && Unstable && Red > > These bots don't belong here. They should be moved elsewhere, > preferably to a local buildmaster that you can control and that will > never email people or upset our master if you need changes. I have > such a local master myself and it's very easy to setup and maintain. > > They *do* have value to *you*, for example to show the progress of > your features cleaning up the failures, or generating some benchmark > numbers, but that's something that is very specific to your project > and should remain separated. > > Any of these bots in LLVM Lab should be moved away / removed, but on > consensus, including the bot owner if he/she is still available in the > list. > > > 2. Test stability issues > > These issues, as you may have noticed from the links above, apply to > *all* bots. The less noise we have overall, the lower will be our > threshold for kicking bots out of the critical pool, and the higher > the value of the not-so-perfect buildbots to the rest of the > community. > > 2.1 Failed vs Exception > > The most critical issue we have to fix is the "red -> exception -> > red" issue. Basically, a bot is red (because you're still > investigating the problem), then someone restarts the master, so you > get an exception. The next build will be a failure, and the > buildmaster recognises the status change and emails everyone. That's > just wrong. > > We need to add an extra check to that logic where it searches down for > the next non-exceptional status and compares to that, not just the > immediately previous result. > > This is a no-brainer and I don't think anyone would be against it. I > just don't know where this is done, I welcome the knowledge of more > experienced folks. > > 2.2 Failure types > > The next obvious thing is to detect what the error is. If it's an SVN > error, we *really* don't need to get an email. But this raises the > problem that an SVN failure followed by a genuine failure will not be > reported. So, the reporting mechanism also has to know what's the > previously *reported* failure, not just the previous failure. > > Other failures, like timeout, can be either flaky hardware or broken > codegen. A way to be conservative and low noise would be to only warn > on timeouts IFF it's the *second* in a row. > > For all these adjustments, we'll need some form of walk-back on the > history to find the previous genuine result, and we'll need to mark > results with some metadata. This may involve some patches to buildbot. > > 2.3 Detecting fixed bots > > Another interesting feature, that is present in the "GreenBot" is a > warning when a bot you broke was fixed. That, per se, is not a good > idea if the noise levels are high, since this will probably double it. > > So, this feature can only be introduced *after* we've done the clean > ups above. But once it's clean, having a "green" email will put the > minds of everyone that haven't seen the "red" email yet to rest, as > they now know they don't even need to look at it at all, just delete > the email. > > For those using fetchmail, I'm sure you could create a rule to do that > automatically, but that's optional. :) > > 2.4 Detecting new failures > > This is a wish-list that I have, for the case where the bots are slow > and hard to debug and are still red. Assuming everything above is > fixed, they will emit no noise until they go green again, however, > while I'm debugging the first problem, others can appear. If that > happens, *I* want to know, but not necessarily everyone else. > > So, a list of problems reported could be added to the failure report, > and if the failure is different, the bot owner gets an email. This > would have to play nice with exception statuses, as well as spurious > failures like SVN or timeouts, so it's not an easy patch. > > The community at large would be already happy with all the changes > minus this one, but folks that have to maintain slow hardware like me > would appreciate this feature. :) > > > > Does any one have more concerns? > > AFAICS, we should figure out where the walk-back code needs to be > inserted and that would get us 90% of the way. The other 10% will be > to list all the buildbots, check their statuses, owners, and map into > those categories, and take the appropriate action. > > Maybe we should also reduce the noise in the IRC channel further (like > only first red, first green), but that's not my primary concern right > now. Feel free to look into it if it is for you. > > cheers, > --renato > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
On 1 October 2015 at 23:01, Philip Reames <listmail at philipreames.com> wrote:> "unstable" these should be removed immediately. If the failure rate is more > than 1 in 5 builds of a known clean revision, that's far too much noise to > be notifying. To be clear, I'm specifically referring to spurious > *failures* not environmental factors which are global to all bots.There are some bugs that introduce intermittent behaviour, and it would be very bad if we just disabled the bots that warned us about them. Some genuine bugs in Clang or the sanitizers can come and go if they depend on where the objects are stored in memory, or if the block happens to be aligned or not. One example is Clang's inability to cope with alignment when using its own version of placement new for derived classes. Our ARM bots have been warning on them for more than a year and we have fixed most of them. If we had disabled the ARM bots the first time it became "unstable", we would still have those problems and we wouldn't be testing on ARM any more. Two very bad outcomes. We have to protect ourselves from assuming too much, too early.> "often red" these are extremely valuable (msan, etc..). Assuming we only > notify on green->red, the only rule we should likely enforce is that each > bot has been green "recently". I'd suggest a threshold of 2 months. If it > hasn't been green in 2 months, it's not really a build bot.This is the case I make on 1.4. If the bot is assumed to be red because someone is slowly fixing its problems, than this bot belongs to a separate buildmaster. However, slow bots tend to be red for longer periods, not necessarily for longer number of builds. OTOH, fast bots can be red for a very large number of builds, but immediately green when a revert is applied. So we need to be careful on timings here.> So, maybe I'm missing something, but: why is it any harder to bring a > silence bot green than an emailing one?It's not. But keeping it green later is, because it takes time to change the buildmaster. For obvious reasons, not all of us have access to the buildmaster, meaning we depend on the few people that work on it directly to move things around. By adding the uncertainty of commits breaking the build to the uncertainty of when the master will be updated, you can easily fall into a deadlock. I have been in situations when in the period of two weeks I had to bring one bot from red to green 5 times. If in between someone put that bot to not warn, it could have taken me more time to realise, and every new failure on top of the original makes the process non-linearly more complex, especially if whoever fixed the bot is committing loads of patches to try to fix the mess. Reverting two sequences of intercalated patches independently is more than twice harder than one sequence, and so on. I think if we had different public masters, and if the bot owner had the responsibility to move between them, that could work well, since moving masters is in the owners' power, while moving groups in the master is not. We can then leave the decision of disabling the bot in the master for more radical solutions, when the bot owner is not responsive or uncooperative. cheers, --renato
On 1 October 2015 at 19:38, Galina Kistanova <gkistanova at gmail.com> wrote:> Buildmaster as is configured now should send notifications on status change > only for > 'successToFailure' and 'failureToSuccess' events, so always red bots should > be quiet.Hi Galina, This is true, but bots that go from red to exception (when the master is restarted, for instance) to red, we get emails. Maybe the master is treating exception as success, because it doesn't want to warn on its own exceptions, but then it creates the problem described above. We need a richer logic than just success<->failure.> Also we have group of builders (experimental_scheduled_builders) in > configuration file builders.py which also should be quiet. This is place for > noisy unstable bots.But moving between them needs configuration changes. These not only can take a while to happen, but they depend on buildbot admins. My other proposal is to have different buildmasters, with the same configurations for all bots, but one that emails, and the other that doesn't. As a bot owner, I could easily move between them by just updating one line on my bot config, and not needing anyone else's help. cheers, --renato
On Thu, Oct 1, 2015 at 10:31 AM, Renato Golin <renato.golin at linaro.org> wrote:> Folks, > > David has been particularly militant with broken buildbots recently, > so to make sure we don't throw the baby with the bath water, I'd like > to propose some changes on how we deal with the emails on our > *current* buildmaster, since there's no concrete plans to move it to > anything else at the moment. > > The main issue is that managing the buildbots is not a simple task. It > requires build owners to disable on the slave side, or specific people > on the master side. The former can take as long as the owner wants > (which is not nice), and the latter refreshes all active bots > (triggering exceptions) and are harder to revert. > > We need to be pragmatic without re-writing the BuildBot product. > > Grab some popcorn... > > There are two main fronts that we need to discuss the noise: Bot and > test stability. > > > 1. Bot stability issues > > We need to distinguish between four classes of buildbots: > > 1.1. Fast && stable && green > > These buildbots normally finish under one hour, but most of the time > under 1/2 hour and should be kept green as much as possible. > Therefore, any reasonable noiseNot sure what kind of noise you're referring to here. Flaky fast builders would be a bad thing, still - so that sort of noise should still be questioned.> in these bots are welcomed, since we > want them to go back to green as soon as possible. > > They're normally the front-line, and usually catch most of the silly > bugs. But we need some kind of policy that allows us to revert patches > that break them for more than a few hours.I'm not sure if we need extra policy here - but I don't mind documenting the common community behavior here to make it more clear. Essentially: if you've provided a contributor with a way to reproduce the issue, and it seems to clearly be a valid issue, revert to green & let them look at the reproduction when they have time. We do this pretty regularly (especially outside office hours when we don't expect someone will be around to revert it themselves - but honestly, I don't see that as a requirement - if you've provided the evidence for them to investigate, revert first & they can investigate whenever they get to it, sooner or later)> We have an agreement > already, and for me that's good enough. People might think > differently. > > With the items 2.x below taken care of, we should keep the current > state of our bots for this group. > > 1.2. One of: Slow || Unstable || Often Red > > These bots are special. They're normally *very* important, but have > some issues, like slow hardware, not too many available boards, or > they take long times to bisect and fix the bugs. >Long bisection is a function of not enough boards (producing large revision ranges for each run), generally - no? (or is there some other reason?)> These bots catch the *serious* bugs,Generally all bots catch serious bugs - it's just a long tail: fast easy to find bugs, then longer tests find the harder to find bugs, and so on and so forth. (until we get below the value/bug thershold where it's not worth expending the CPU cycles to find the next bug)> like self-hosted Clang > mis-compiling a long-running test which sometimes fails. They can > produce noise, but when the noise is correct, we really need to listen > to it. Writing software to understand that is non-trivial. >Again, not sure which kind of noise you're referring to here - it'd be helpful to clarify/disambiguate. Flaky or often-red results on slow buildbots without enough resources (long blame lists) are pretty easily ignored ("oh, it could be any of those 20 other people's patches, I'll just ignore it - someone else will do the work & tell me if it's my fault").> So, the idea here is to have a few special treatments for each type of > problem.But they are problems that need to be addressed, is the key - and arguably, until they are addressed, these bots should only report to the owner, not to contributors. (as above - if people generally ignore them because they're not accurate enough to believe that it's 'your' fault, then they essentially are already leaving it to the owner to do the investigation - they just have extra email they have to ignore too, let's remove the email so that we can make those we send more valuable by not getting lost in the noise)> For example, slow bots need more hardware to reduce the blame > list.Definitely ^.> Unstable bot need more work to reduce spurious noise to a > minimum (see 2.x below), and red bots *must* remain *silent* until > they come back to green (see 2.x below). >As I mentioned on IRC/other threads - having red bots, even if they don't send email, does come at some cost. It makes dashboards hard to read. So for those trying to get a sense of the overall state of the project (what's on fire/what needs to be investigated) this can be problematic. Having issues XFAILed (with a bug filed, or someone otherwise owning the issue until the XFAIL is removed) or reverted aggressively or having bots moved into a separate group so that there's a clear "this is the stuff we should expect to be green all the time" group that can be eyeballed quickly, is nice.> What we *don't* want is to disable or silence them after they're > green. Most of the bugs they find are hard to debug, so the longer we > take to fix it the harder it is to find out what happened. We need to > know as soon as possible when they break. >I still question whether these bots provide value to the community as a whole when they send email. If the investigation usually falls to the owners rather than the contributors, then the emails they send (& their presence on a broader dashboard) may not be beneficial. So to be actionable they need to have small blame lists and be reliable (low false positive rate). If either of those is compromised, investigation will fall to the owner and ideally they should not be present in email/core dashboard groups.> > 1.3. Two of: Slow || Unstable || Often Red > > These bots are normally only important to their owners, and they are > on the verge of being disabled.I don't think they have to be on the verge of being disabled - so long as they don't send email and are in a separate group, I don't see any problem with them being on the main llvm buildbot. (no particular benefit either, I suppose - other than saving the owner the hassle of running their own master, which is fine)> The only way to cope with those bots > is to completely disable their emails / IRC messages, so that no one > gets flooded with noise from broken bots. >Yep> However, some bots on the 1.2 category fall into this one for short > periods of time (~1 week), so we need to be careful with what we > disable here. That's the key baby/bathwater issue. > > Any hard policy here will be wrong for some bots some of the time, so > I'd love if we could all just trust the bot owners a bit when they say > they're fixing the issue.It's not a question of trust, from my perspective - regardless of whether they will address the issue or not, the emails add noise and decrease the overall trust developers have in the signal (via email, dashboards and IRC) from the buildbots. If an issue is being investigated we have tools to deal with that: XFAIL, revert, and buildbot reconfig (we could/should check if the reconfig for email configuration can be done without a restart - yes, it still relies on a buildbot admin to be available (perhaps we should have more people empowered to reconfig the buildmaster to make this cheaper/easier) but without the interruption to all builds). If there's enough hardware that blame lists are small and the bot is reliable, then reverts can happen aggressively. If not, XFAIL is always an option too.> However, bots that fall here for more than a > month, or more often that a few times during a few months (I'm being > vague on purpose), then we collectively decide to disable the bot. > > What I *don't* want is any two or three guys deciding to disable the > buildbot of someone else because they can't stand the noise. Remember, > people do take holidays once in a while, and they may be in the Amazon > or the Sahara having well deserved rest. Returning to work and > learning that all your bots are disabled for a week is not nice. >I disagree here - if the bots remain red, they should be addressed. This is akin to committing a problematic patch before you leave - you should expect/hope it is reverted quickly so that you're not interrupting everyone's work for a week. If your bot is not flakey and has short blame lists, I think it's possibly reasonable to expect that people should revert their patches rather than disable the bot or XFAIL the test on that platform. But without access to hardware it may be hard for them to investigate the failure - XFAIL is probably the right tool, then when the owner is back they can provide a reproduction, extra logs, help remote-debug it, etc.> So far, we have coped with noise, and the result is that people tend > to ignore those bots, which means more work to the bot owner.The problem is, that work doesn't only fall on the owners of the bots which produce the noise. It falls on all bot owners because developers become immune/numb to bot failure mail to a large degree.> This is > not a good situation, and we want to move away from it, but we > shouldn't flip all switches off by default. We can still be pragmatic > about this as long as we improve the quality overall (see 2.x below) > with time. > > In summary, bots that fall here for too long will have their emails > disabled and candidates for removal in the next spring clean-up, but > not immediately. > > 1.4. Slow && Unstable && Red > > These bots don't belong here. They should be moved elsewhere, > preferably to a local buildmaster that you can control and that will > never email people or upset our master if you need changes. I have > such a local master myself and it's very easy to setup and maintain. >Yep - bots that are only useful to the owner (some of the situations above I think constitute this situation, but anyway) shouldn't email/show up in the main buildbot group. But I wouldn't mind if we had a separate grouping in the dashboards for these bots (I think we have an experimental group which is somewhat like this). No big deal either way to me. If they're not sending mail/IRC messages, and they're not in the main group on the dashboard, I'm OK with it.> They *do* have value to *you*, for example to show the progress of > your features cleaning up the failures, or generating some benchmark > numbers, but that's something that is very specific to your project > and should remain separated. > > Any of these bots in LLVM Lab should be moved away / removed, but on > consensus, including the bot owner if he/she is still available in the > list. > > > 2. Test stability issues > > These issues, as you may have noticed from the links above, apply to > *all* bots. The less noise we have overall, the lower will be our > threshold for kicking bots out of the critical pool, and the higher > the value of the not-so-perfect buildbots to the rest of the > community. >I'm not quite sure I follow this comment. The less noise we have, the /more/ problematic any remaining noise will be (because it'll be costing us more relative to no-noise - when we have lots of noise, any one specific source of noise isn't critical, we can remove it but it won't change much - when there's a little noise, removing any one source substantially decreases our false positive rate, etc)> > 2.1 Failed vs Exception > > The most critical issue we have to fix is the "red -> exception -> > red" issue. Basically, a bot is red (because you're still > investigating the problem), then someone restarts the master, so you > get an exception. The next build will be a failure, and the > buildmaster recognises the status change and emails everyone. That's > just wrong. > > We need to add an extra check to that logic where it searches down for > the next non-exceptional status and compares to that, not just the > immediately previous result. > > This is a no-brainer and I don't think anyone would be against it. I > just don't know where this is done, I welcome the knowledge of more > experienced folks. >Yep, sounds like we might be able to have Galina look into that. I have no context there about where that particular behavior might be (whether it's in the buildbot code itself, or in the user-provided buildbot configuration, etc).> > 2.2 Failure types > > The next obvious thing is to detect what the error is. If it's an SVN > error, we *really* don't need to get an email.Depends on the error - if it's transient, then this is flakiness as always & should be addressed as such (by trying to remove/address the flakes). Though, yes, this sort of failure should, ideally, probably, go to the buildbot owner but not to users.> But this raises the > problem that an SVN failure followed by a genuine failure will not be > reported. So, the reporting mechanism also has to know what's the > previously *reported* failure, not just the previous failure. > > Other failures, like timeout, can be either flaky hardware or broken > codegen. A way to be conservative and low noise would be to only warn > on timeouts IFF it's the *second* in a row. >I don't think this helps - this reduces the incidence, but isn't a real solution. We should reduce the flakiness of hardware. If hardware is this unreliable, why would we be building a compiler for it? No user could rely on it to produce the right answer. (& again, if the flakiness is bad enough - I think that goes back to an owner-triaged bot, one that doesn't send mail, etc)> For all these adjustments, we'll need some form of walk-back on the > history to find the previous genuine result, and we'll need to mark > results with some metadata. This may involve some patches to buildbot. >Yeah, having temporally related buildbot results seems dubious/something I'd be really cautious about.> 2.3 Detecting fixed bots > > Another interesting feature, that is present in the "GreenBot" is a > warning when a bot you broke was fixed. That, per se, is not a good > idea if the noise levels are high, since this will probably double it. > > So, this feature can only be introduced *after* we've done the clean > ups above. But once it's clean, having a "green" email will put the > minds of everyone that haven't seen the "red" email yet to rest, as > they now know they don't even need to look at it at all, just delete > the email. > > For those using fetchmail, I'm sure you could create a rule to do that > automatically, but that's optional. :) >Yeah, I don't know what the right solution is here at all - but it certainly would be handy if there were an easier way to tell if an issue has been resolved since your commit. I imagine one of the better options would be some live embedded HTML that would just show a green square/some indicator that the bot has been green at least once since this commit. (that doesn't help if you introduced a flaky test, though... - that's harder to deal with/convey to users, repeated test execution may be necessary in that case - that's when temporal information may be useful)> > 2.4 Detecting new failures > > This is a wish-list that I have, for the case where the bots are slow > and hard to debug and are still red. Assuming everything above is > fixed, they will emit no noise until they go green again, however, > while I'm debugging the first problem, others can appear. If that > happens, *I* want to know, but not necessarily everyone else. >This seems like the place where XFAIL would help you and everyone else. If the original test failure was XFAILed immediately, the bot would go green, then red again if a new failure was introduced. Not only would you know, but so would the auhtor of the change.> > So, a list of problems reported could be added to the failure report, > and if the failure is different, the bot owner gets an email. This > would have to play nice with exception statuses, as well as spurious > failures like SVN or timeouts, so it's not an easy patch. > > The community at large would be already happy with all the changes > minus this one, but folks that have to maintain slow hardware like me > would appreciate this feature. :) > > > > Does any one have more concerns? > > AFAICS, we should figure out where the walk-back code needs to be > inserted and that would get us 90% of the way. The other 10% will be > to list all the buildbots, check their statuses, owners, and map into > those categories, and take the appropriate action. > > Maybe we should also reduce the noise in the IRC channel further (like > only first red, first green), but that's not my primary concern right > now. Feel free to look into it if it is for you. > > cheers, > --renato >-------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20151005/4ef8bbf2/attachment.html>
On 5 October 2015 at 22:28, David Blaikie <dblaikie at gmail.com> wrote:>> These buildbots normally finish under one hour, but most of the time >> under 1/2 hour and should be kept green as much as possible. >> Therefore, any reasonable noise > > Not sure what kind of noise you're referring to here. Flaky fast builders > would be a bad thing, still - so that sort of noise should still be > questioned.Sorry, I meant "noise" as in "sound", not as opposed to "signal". These bots are assumed stable, otherwise they would be in another category below.> I'm not sure if we need extra policy here - but I don't mind documenting the > common community behavior here to make it more clear.Some people in the community behaves strongly different than others. I sent this email because I felt we disagree in some fundamental properties of the buildbots, and before we can agree to a common strategy, there is no consensus or "common behaviour" to be documented. However, I agree, we don't need "policy", just "documented behaviour" as usual. That was my intention when I said "policy".> Long bisection is a function of not enough boards (producing large revision > ranges for each run), generally - no? (or is there some other reason?)It's not that simple. Some bugs appear after several iterations of green results. It may sound odd, but I had at least three this year. These are the hardest bugs to find and usually standard regression scripts can't find them automatically, so I have to do most of the investigation manually. This takes *a lot* of time.> Generally all bots catch serious bugs.That's not what I meant. Quick bots catch bad new tests (over-assuming on CHECK lines, forgetting to specify the triple on RUN lines) as well as simple code issues (32 vs 64 bits, new vs old compiler errors, etc), just because they're the first to run on a different environment than the developer uses. Slow bots are most of the time buffered against those, since patches and fixes (or reverts) tend to come in bundles, while the slow bot is building.>> like self-hosted Clang >> mis-compiling a long-running test which sometimes fails. They can >> produce noise, but when the noise is correct, we really need to listen >> to it. Writing software to understand that is non-trivial. > > Again, not sure which kind of noise you're referring to here - it'd be > helpful to clarify/disambiguate.Noise here is less "sound" and more "noisy signal". Some of the "noise" in these bots are just noise, others are signal masquerading as noise. Of course, the higher the noise level, the harder it is to interpret the signal, but as it's usual in science, sometimes the only signal we have is a noisy one. It's common for mathematicians to scoff the physicists lack of precision, as is for them to to the same to chemists, then biologists, etc. When you're on top, it seems folly that some people endure large amounts of noise in their signal, but when you're at the bottom and your only signal has a lot of noise, you have to work with it and make do with what you have. As I said above, it's not uncommon the case where a failure "passes" the tests for a few iterations before failing. So, we're not talking *only* at hardware noise, but also at the code level, which had assumptions based on the host architecture that might not be valid on other architectures. Most of us develop on x86 machines, so it's only logical that PPC, MIPS and ARM buildbots will fail more often than x86 ones. But that's precisely the point of having those bots in the first place. Requesting to disable those bots because they generate noise is the same as asking people to give their opinion about a product, show the positive reviews, and sue the rest.> But they are problems that need to be addressed, is the key - and arguably, > until they are addressed, these bots should only report to the owner, not to > contributors.If we didn't have those bots already for many years, and if we had another way of testing on those architectures, I'd agree with you. But we don't. I agree we need to improve. I agree it's the architecture specific community's responsibility to do so. I just don't agree that we should disable all noise (with signal, baby/bath) until we do so. By the time we get there, all sorts of problems will have crept in, and we'll enter a vicious cycle. Been there, done that.> I still question whether these bots provide value to the community as a > whole when they send email. If the investigation usually falls to the owners > rather than the contributors, then the emails they send (& their presence on > a broader dashboard) may not be beneficial.Benefit is a spectrum. People have different thresholds. Your threshold is tougher than mine because I'm used working on an environment where the noise is almost as loud as the signal. I don't think we should be bound to either of our thresholds, that's why I'm opening the discussion to have a migration plan to produce less noise. But that plan doesn't include killing bots just because they annoy people. If you plot a function of value ~ noise OP benefit, you have a surface with maxima and minima. Your proposal is to set a threshold and cut all the bots that fall on those minima that are below that line. My proposal is to move all those bots as high as we can and only then, cut the bots that didn't make it past the threshold.> So to be actionable they need to have small blame lists and be reliable (low > false positive rate). If either of those is compromised, investigation will > fall to the owner and ideally they should not be present in email/core > dashboard groups.Ideally, this is where both of us want to be. Realistically, it'll take a while to get there. We need changes in the buildbot area, but there are also inherent problems that cannot be solved. Any new architecture (like AArch64) will have only experimental hardware for years, and later on, experimental kernel, then experimental tools, etc. When developing a new back-end for a compiler, those unstable and rapidly evolving environments are the *only* thing you have to test on. You normally only have one of two (experimental means either *very* expensive or priceless), so having multiple boxes per bot is highly unlikely. It can also mean that the experimental device you got last month is not supported any more because a new one is coming, so you'll have to live with those bugs until you get the new one, which will come with its own bugs. For older ARM cores (v7), this is less of a problem, but since old ARM hardware was never designed as production machines, their flakiness is inherent of their form factor. It is possible to get them on a stable-enough configuration, but it takes time, resources, excess hardware and people constantly monitoring the infrastructure. We're getting there, but we're not there yet. I agree that this is mostly *my* problem and *I* should fix it, and believe me I *want* to fix it, I just need a bit more time. I suspect that the other platform folks feel the same way, so I'd appreciate a little more respect when we talk about acceptable levels of noise and effort.> I disagree here - if the bots remain red, they should be addressed. This is > akin to committing a problematic patch before you leave - you should > expect/hope it is reverted quickly so that you're not interrupting > everyone's work for a week.Absolutely not! Committing a patch and going on holidays is a disrespectful act. Bot maintainers going on holidays is an inescapable fact. Silencing a bot while the maintainer is a possible way around, but disabling it is most disrespectful. However, I'd like to remind you of the confirmation bias problem, where people will look at the bot, think it's noise, silence the bot when they could have easily fixed it. Later on, when the owner gets to work, surprise new bugs that weren't caught will fill the first weeks. We have to be extra careful when taking actions without the bot owners' knowledge.> I'm not quite sure I follow this comment. The less noise we have, the /more/ > problematic any remaining noise will beYes, I meant what you said. :) Less noise, higher bar to meet.> Depends on the error - if it's transient, then this is flakiness as always & > should be addressed as such (by trying to remove/address the flakes). > Though, yes, this sort of failure should, ideally, probably, go to the > buildbot owner but not to users.Ideally, SVN errors should go to the site admins, but let's not get ahead of ourselves. :)>> Other failures, like timeout, can be either flaky hardware or broken >> codegen. A way to be conservative and low noise would be to only warn >> on timeouts IFF it's the *second* in a row. > > I don't think this helps - this reduces the incidence, but isn't a real > solution.I agree.> We should reduce the flakiness of hardware. If hardware is this > unreliable, why would we be building a compiler for it?Because that's the only hardware that exists.> No user could rely on it to produce the right answer.No user is building trunk every commit (ish). Buildbots are not meant to be as stable as a user (including distros) would require. That's why we have extra validation on releases. Buildbots build potentially unstable compilers, otherwise we wouldn't need buildbots in the first place.>> For all these adjustments, we'll need some form of walk-back on the >> history to find the previous genuine result, and we'll need to mark >> results with some metadata. This may involve some patches to buildbot. > > Yeah, having temporally related buildbot results seems dubious/something I'd > be really cautious about.This is not temporal, it's just regarding exception as no-change instead of success. The only reason why it's success right now is because, the way we're setup to email on every failure, we don't want to spam people when the master is reloaded. That's the wrong meaning for the wrong reason.> I imagine one of the better options would be some live embedded HTML that > would just show a green square/some indicator that the bot has been green at > least once since this commit.That would be cool! But I suspect at the cost of a big change in the buildbots. Maybe not...>> This is a wish-list that I have, for the case where the bots are slow >> and hard to debug and are still red. Assuming everything above is >> fixed, they will emit no noise until they go green again, however, >> while I'm debugging the first problem, others can appear. If that >> happens, *I* want to know, but not necessarily everyone else. > > This seems like the place where XFAIL would help you and everyone else. If > the original test failure was XFAILed immediately, the bot would go green, > then red again if a new failure was introduced. Not only would you know, but > so would the auhtor of the change.I agree in principle. I just worry that it's a lot easier to add an XFAIL than to remove it later. Though, it might be just a matter of documenting the common behaviour and expecting people to follow through. cheers, --renato