How common is it for servers to go NOT HEALTHY? I feel it is happening much more often than it should be with us. A few times a month. If this happens, we reboot the servers. Should we do something else? Maybe it depends on what the problem was? If we should not be getting NOT HEALTHY that often, what information should I collect to report to CFS? Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
Brock Palen wrote:> How common is it for servers to go NOT HEALTHY? I feel it is > happening much more often than it should be with us. A few times a > month. >It should not happen at all, in the normal case. It indicates a problem.> If this happens, we reboot the servers. Should we do something > else? Maybe it depends on what the problem was?Well, determining what the actual problem that caused the NOT HEALTHY would be quite useful, yes. I would not just reboot. -Examine consoles of _all_ servers for any error indications - Examine syslogs of _all_ servers for any LustreErrors or LBUG - Check network and hardware health. Are your disks happy? Is your network dropping packets? Try to figure out what was happening on the cluster. Does this relate to a specific user workload or system load condition? Can you reproduce the situation? Does it happen at a specific time of day, time of month?> > If we should not be getting NOT HEALTHY that often, what information > should I collect to report to CFS?The lustre-diagnostics package is good start for general system config. Beyond that, most of what we would need is listed above. cliffw> > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Ok thanks, It happened again last night, sooner than normal. I will send a new message with the details. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Jan 13, 2009, at 11:09 PM, Cliff White wrote:> Brock Palen wrote: >> How common is it for servers to go NOT HEALTHY? I feel it is >> happening much more often than it should be with us. A few times >> a month. > It should not happen at all, in the normal case. It indicates a > problem. > >> If this happens, we reboot the servers. Should we do something >> else? Maybe it depends on what the problem was? > > Well, determining what the actual problem that caused the NOT > HEALTHY would be quite useful, yes. I would not just reboot. > > -Examine consoles of _all_ servers for any error indications > - Examine syslogs of _all_ servers for any LustreErrors or LBUG > - Check network and hardware health. Are your disks happy? > Is your network dropping packets? > > Try to figure out what was happening on the cluster. Does this > relate to > a specific user workload or system load condition? Can you reproduce > the situation? Does it happen at a specific time of day, time of > month? >> If we should not be getting NOT HEALTHY that often, what >> information should I collect to report to CFS? > > The lustre-diagnostics package is good start for general system > config. > Beyond that, most of what we would need is listed above. > cliffw > >> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > >