Hi -- I'm running a server which is usually stable, but every once in a while it hangs. The server is used as a file store using NFS and to run VMware machines. I don't see anything in /var/log/messages or elsewhere to indicate any problem or offer any clue why the system was hung. Any suggestions where I might look for a clue? -- Michael Eager eager at eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077
On 3/8/2011 11:24 AM, Michael Eager wrote:> Hi -- > > I'm running a server which is usually stable, but every > once in a while it hangs. The server is used as a file > store using NFS and to run VMware machines. > > I don't see anything in /var/log/messages or elsewhere > to indicate any problem or offer any clue why the system > was hung. > > Any suggestions where I might look for a clue?Probably something hardware related. Bad memory, overheating, power supply, etc. I've even seen some rare cases where a bios update would fix it although it didn't make much sense for a machine to run for years, then need a firmware change. -- Les Mikesell lesmikesell at gmail.com
>I'm running a server which is usually stable, but every >once in a while it hangs.There can be many reasons for that. One thing I'm curious about - try looking at the reallocated sector count, and current pending sector count for your drives with smartctl.
On Tue, Mar 8, 2011 at 12:24 PM, Michael Eager <eager at eagerm.com> wrote:> Hi -- > > I'm running a server which is usually stable, but every > once in a while it hangs. ?The server is used as a file > store using NFS and to run VMware machines. > > I don't see anything in /var/log/messages or elsewhere > to indicate any problem or offer any clue why the system > was hung. > > Any suggestions where I might look for a clue?Please be more specific when you say it "hangs". Does it just pause for a minute and then continue working, or does it freeze completely until you reboot it? Does it respond to s "soft" reboot like Ctrl-Alt-Del, or do you need to hard power it off? Since this is an NFS server I'm going to guess there might be a lot of IO. Maybe there is some large IO load going on, like maybe all your VMs are running anti-virus scan at the same time, or something like that. To troubleshoot, I recommend installing the 'sar' utilities (yum install sysstat) and then reviewing the collected data using the 'ksar' utility (http://sourceforge.net/projects/ksar/). sar/ksar are good for tracking down acute problems.
on 09:24 Tue 08 Mar, Michael Eager (eager at eagerm.com) wrote:> Hi -- > > I'm running a server which is usually stable, but every > once in a while it hangs. The server is used as a file > store using NFS and to run VMware machines. > > I don't see anything in /var/log/messages or elsewhere > to indicate any problem or offer any clue why the system > was hung. > > Any suggestions where I might look for a clue?I'd very strongly recommend you configure netconsole. Though not entire clear from the name, it's actually an in-kernel network logging module, which is very useful for kicking out kernel panics which otherwise aren't logged to disk and can't be seen on a (nonresponsive) monitor. Alternately, a serial console which actually retains all output sent to it (some remote access systems support this, some don't) may help. Barring that, I'd start looking at individual HW components, starting with RAM. The trick is in passing the appropriate parameters to the module at load time. I found it helpful to have an @boot cronjob to do this. You'll need to pass the local port, local system IP, local network device, remote syslog UDP port, remote syslog IP, and the /gateway/ MAC address, where gateway is the syslogd (if on a contiguous ethernet segment), or your network gateway host, if not. Some parsing magic can determine these values for you. Good article describing configuration: http://www.cyberciti.biz/tips/linux-netconsole-log-management-tutorial.html If you're not already remote-logging all other activity, I'd do that as well. You might catch the start of the hang, if not all of it. -- Dr. Ed Morbius, Chief Scientist / | Robot Wrangler / Staff Psychologist | When you seek unlimited power Krell Power Systems Unlimited | Go to Krell! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 490 bytes Desc: Digital signature URL: <http://lists.centos.org/pipermail/centos/attachments/20110308/844bd4e9/attachment-0002.sig>
> On Wed, Mar 9, 2011 at 10:24 AM, Leen de Braal <ldb at braha.nl> wrote: >>> m.roth at 5-cent.us wrote: >>>> Michael Eager wrote: >>> >>>>> House-built, Gigabyte MB, AMD Phenom II X6, 6Gb RAM. >>>> >>>> Any chance the problem's with the video card? >>> >>> Video is on the MB. ?It doesn't seem likely that it's >>> the video, since the system doesn't respond to network >>> when it crashes. >>> >>> It could be anything. ?That's why I'm looking for >>> something that would give me a bit of a hint what >>> to look at. ?With an infrequent failure, it's not >>> practical to replace components piecemeal. >> >> While you open the case, check for the bulging capacitor problem. >> Will have the effect you describe, freezing up the system so that even >> bios routines don't work (your fans). >> If that's the case, replace mainboard. >> > > > Or replace the CAPS if you're not afraid of a soldering iron :)Very often resulting in a damaged board, because you damage the via's when pulling the caps. But it is worth a try.> > > > -- > Kind Regards > Rudi Ahlers > SoftDux > > Website: http://www.SoftDux.com > Technical Blog: http://Blog.SoftDux.com > Office: 087 805 9573 > Cell: 082 554 7532 >-- L. de Braal BraHa Systems NL - Terneuzen T +31 115 649333 F +31 115 649444
On Tuesday, March 08, 2011 04:44:54 pm Dr. Ed Morbius wrote:> I'd very strongly recommend you configure netconsole.Ok, now this is useful indeed. Thanks for the information, even though I'm not the OP.... While I suspected the facility might be there, I hadn't really dug for it, but if this will catch things after filesystems go r/o (ext3 journal things, ya know) it could be worth its weight in gold for catching kernel errors from VMware guests (serial console not really an option with the hosts I have, although I'm sure some enterprising soul has figured out how to redirect the VM guest serial port to something else....).
Dr. Ed Morbius wrote:> on 09:24 Tue 08 Mar, Michael Eager (eager at eagerm.com) wrote: >> Hi -- >> >> I'm running a server which is usually stable, but every >> once in a while it hangs. The server is used as a file >> store using NFS and to run VMware machines. >> >> I don't see anything in /var/log/messages or elsewhere >> to indicate any problem or offer any clue why the system >> was hung. >> >> Any suggestions where I might look for a clue? > > I'd very strongly recommend you configure netconsole. Though not entire > clear from the name, it's actually an in-kernel network logging module, > which is very useful for kicking out kernel panics which otherwise > aren't logged to disk and can't be seen on a (nonresponsive) monitor.I'll take a look at netconsole.> Alternately, a serial console which actually retains all output sent to > it (some remote access systems support this, some don't) may help. > > Barring that, I'd start looking at individual HW components, starting > with RAM.The problem with randomly replacing various components, other than the downtime and nuisance, is that there's no way to know that the change actually fixed any problem. When the base rate is one unknown system hang every few weeks, how many wees should I wait without a failure to conclude that the replaced component was the cause? A failure which happens infrequently isn't really amenable to a random diagnostic approach. -- Michael Eager eager at eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077
On Wednesday, March 09, 2011 03:24:48 am Leen de Braal wrote:> While you open the case, check for the bulging capacitor problem. > Will have the effect you describe, freezing up the system so that even > bios routines don't work (your fans). > If that's the case, replace mainboard.I've seen capacitor problems in the past, and they can be rather interesting. What the caps do is open up (electrically speaking) meaning they no longer can smooth out the ripple in the output of the switching regulator; this ripple is very high frequency due to the switching regulator's design. As the CPU draws more current (which happens when it's loaded, of course, since MOS gates by design consume the most power during the switching period (capacitor charging time constants on the gates of the transistors themselves)), the switching regulator has to supply more current, and if the caps are open they can't smooth out the deeper ripple. I actually had one motherboard blow two caps; one of the cases of one of the blown capacitors was violently ejected off of the 'guts' of the cap, hard enough that it dented the PC's case from the inside. The PC kept running, until it was put under load, then it would lock up. When the second cap blew, about an hour later, the PC hung; it would power up and run POST, and even run the BIOS setup's memory check and health check, but as soon as the CPU was shifted into protect mode as the OS booted it would hard hang due to the CPU's increased current draw overwhelming the ripple absorbing capacity of the remaining good capacitors on the CPU's switching regulator. There's really only one way to determine this, and that's by putting an oscilloscope on the CPU's power supply output rails and looking for ripple while running a CPU burnin program. The hard part of that is actually finding a good place to measure the output, thanks to the typical motherboard's multilayer design. And while with the proper desoldering equipment and training/experience one can re-cap a motherboard, I would not recommend doing so for a critical server, unless you want and can assume personal liability for that server's operation. Better to get a new motherboard with a warranty. For a personal server that if it breaks isn't going to open you up to personal liability, sure, you can re-cap if you'd like and have the patience, time, equipment, and experience necessary to work on 6 to 8 layer PC boards, with may be soldered with RoHS lead-free solder, which requires special techniques. Otherwise, as you said, you can damage the 'vias' (that is, the plated through holes the capacitor leads solder to, which may be used to connect to internal layers that you can't resolder) very easily.
On Wednesday, March 09, 2011 10:16:34 am Brunner, Brian T. wrote:> This would be far cheaper than the time spent troubleshooting the > running (sometimes hanging) system.Let me interject here, that from a budgeting standpoint 'cheaper' has to be interpreted in the context of which budget the costs are coming out of. New hardware is capex, and thus would come out of the capital budget, and admin time is opex, and thus would come out of the operating budget. There may be sufficient funds in the operating budget to pay an admin $x,000 but the funds in the capital budget may be insufficient to buy a server costing $y,000, where y=x. And if this is an educational institution, and there are grants involved, it may be the reverse situation. So 'cheaper' only has meaning when the costs are coming out of the same budget. So, yes, while it's easy for a single-budget entity to make this decision, it's not so easy when you have multiple budgets involved with different spending parameters and different funding entities.> Starting with RAM and Power Supply is not random ... They're "The Usual > Suspects".This is a very true statement. Heat and airflow are two others.
Rudi Ahlers wrote:> On Thu, Mar 10, 2011 at 12:31 AM, Michael Eager <eager at eagerm.com> wrote: >> Dr. Ed Morbius wrote: >> >>> If the issue is repeated but rare system failures on one of a set of >>> similarly configured hosts, I'd RMA the box and get a replacement. End >>> of story. >> I'll repeat: this is a house-made system. There's no vendor to RMA to. > > > > I don't know where you are, but in our country we can RMA anything and > everything. Apart from CPU's. So, even a cheap desktop mobo could be > RMA'd, as long as I can prove to the suppliers it's faulty, and it's > within the warrenty periodI responded to Dr. Morbius' suggestion that I "RMA the box". There is vendor to RMA the box to. If I knew that it was a motherboard problem, I could RMA it. Or disk, or PSU, or network card, or whatever. But, as I've mentioned, there's no indication what causes the system to hang. There is no way at this point to prove that it is a defective motherboard. -- Michael Eager eager at eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077
Rudi Ahlers wrote:> As far as I can see you were giving a bucked load of advice, which you > haven't even bothered to follow yet. You're the only one who could > actually do anything about the problem.I have followed quite a bit of the advice, which I have appreciated and noted. I've set up the monitor so that it will not be blanked on a crash, installed monitoring software, and checked a number of conditions which people have suggested. No, I have not responded to the philosophical discussions about vender management, nor to the suggestions to RMA something to somebody for unknown reasons. No, I'm not going to replace RAM or capacitors here and there on the off chance that something might be bad. (But I will look for capacitors which show signs of bulging or leaking.)> No amount of suggestions made on this list will fix the problem for > you. You need to actually take apart the server and see what's going > on.I wasn't interested in anyone fixing the server for me. I did ask for suggestions on how improve the diagnostics for the problem, which several people have responded to. Again, I appreciate their suggestions greatly. As I've said, I have a list of things to check when the server is next taken down. -- Michael Eager eager at eagercon.com 1960 Park Blvd., Palo Alto, CA 94306 650-325-8077
On Thursday, March 10, 2011 05:35:29 am Rudi Ahlers wrote:> I prefer to use a dust blower instead. It doesn't risk pulling loose > components with "dry" or loose "soldering"I use both: antistatic canned air to blow the dust and a metal-tubed vacuum rested on a part of the case away from any boards to grab the dust that's being blown. Works great, and you don't 'recycle' the dust.....