I have a total of 20 CentOS 4.1 systems running on fairly new hardware. About 6 of them are experiencing strange hangs without any indication -- nothing in /var/log/messages nor on the console -- sometime within 10-30 minutes after a reboot. The systems still responds to ping but you can't ssh to it. At the console, you could type "root" at the user prompt but it hangs immediately after hitting enter. Memory scan of all systems show no error. Any idea how to troubleshoot this problem. The system's not responsive to do any troubleshooting and nothing abnormal is in the log. We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm. Thanks for any help.
} } I have a total of 20 CentOS 4.1 systems running on fairly new } hardware. About 6 of them are experiencing strange hangs without any } indication -- nothing in /var/log/messages nor on the console -- } sometime within 10-30 minutes after a reboot. The systems still } responds to ping but you can't ssh to it. At the console, you could } type "root" at the user prompt but it hangs immediately after hitting } enter. } } Memory scan of all systems show no error. } } Any idea how to troubleshoot this problem. The system's not } responsive to do any troubleshooting and nothing abnormal is in the } log. } } We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm. } } Thanks for any help. greetings im quite sure you are most intelligent so you have pry done these things already.. the first two things that come to mind are... do you have the latest stable "firmware" on those machines are they all the same or is there a common denominator besides CentOS 4.1 ? and have you tried to install the latest kernels and such... there was recent publishing of them if they are connected to the internet, unplug for testing?? - rh -- Robert Hanson - Abba Communications Computer & Internet Services (509) 624-7159 - www.abbacomm.net
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, Jan 18, 2006 at 11:38:38AM -0800, Fong Vang wrote:> We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.I have a fairly good idea what it is: context switch storm. We have been seeing it for quite some time. Always Intel hardware, usually Xeons, but sometimes P4 HTs too. It is a known condition, even tho the bug itself is still elusive. Could be either related to the processor or the northbridge. So far, the only way to stop the problem is switching to a non-smp Kernel. Best Regards, - -- Rodrigo Barbosa <rodrigob at suespammers.org> "Quid quid Latine dictum sit, altum viditur" "Be excellent to each other ..." - Bill & Ted (Wyld Stallyns) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) iD8DBQFDzp1PpdyWzQ5b5ckRAvfUAJsEKplmubip/tvCCLy2fuDc75WvCgCggvFS 0kkwIFpeiVIEKPWXEG7bbIo=CMTD -----END PGP SIGNATURE-----
On Wed, 2006-01-18 at 13:38, Fong Vang wrote:> I have a total of 20 CentOS 4.1 systems running on fairly new > hardware. About 6 of them are experiencing strange hangs without any > indication -- nothing in /var/log/messages nor on the console -- > sometime within 10-30 minutes after a reboot. The systems still > responds to ping but you can't ssh to it. At the console, you could > type "root" at the user prompt but it hangs immediately after hitting > enter. > > Memory scan of all systems show no error. > > Any idea how to troubleshoot this problem. The system's not > responsive to do any troubleshooting and nothing abnormal is in the > log. > > We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.My first guess would be that something is consuming all possible memory and pushing everything else into swap. The system may not be completely hung, but it can't respond in a reasonable amount of time. If the logs for whatever services you run don't show anything, I'd watch with top over a period of time to see if a single program is doing it and frequent "ps ax" check to see if a large number of small processes are accumulating. You can get a hint about how fast new processes are being started by looking at the process id of the ps process when you run it repeatedly. I assume from the fact that you have 20 boxes that you are doing something that causes substantial load - perhaps it needs to be distributed better. -- Les Mikesell lesmikesell at gmail.com
> We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.I'd suggest updating to CentOS 4.2 and the newest kernel-smp-2.6.9-22.0.2.EL.i686 and verifying whether firmware/BIOS is up2date. Do the same machines always crash? Common hardware denominator? Cheers, MaZe.
On Wed, 18 Jan 2006, Fong Vang wrote:> I have a total of 20 CentOS 4.1 systems running on fairly new > hardware. About 6 of them are experiencing strange hangs without > any indication -- nothing in /var/log/messages nor on the console -- > sometime within 10-30 minutes after a reboot. The systems still > responds to ping but you can't ssh to it. At the console, you could > type "root" at the user prompt but it hangs immediately after > hitting enter. > > Memory scan of all systems show no error. > > Any idea how to troubleshoot this problem. The system's not > responsive to do any troubleshooting and nothing abnormal is in the > log.Other folks have hit on the best starting points. For diagnosis, however, you might want to cobble up a cron script that can run every minute: #!/bin/sh # # season to taste... ( top -n 1 -b # also provides a timestamp vmstat iostat ps axf ) >> /var/log/troubleshooting.log 2>&1 The resulting log will be verbose and will grow quickly, but it'll likely contain strong hints of any process-related problems. What it won't do, of course, is provide indications of hardware faults. -- Paul Heinlein <> heinlein at madboa.com <> www.madboa.com
Fong Vang wrote:> I have a total of 20 CentOS 4.1 systems running on fairly new > hardware. About 6 of them are experiencing strange hangs without any > indication -- nothing in /var/log/messages nor on the console -- > sometime within 10-30 minutes after a reboot. The systems still > responds to ping but you can't ssh to it. At the console, you could > type "root" at the user prompt but it hangs immediately after hitting > enter. > > Memory scan of all systems show no error. > > Any idea how to troubleshoot this problem. The system's not > responsive to do any troubleshooting and nothing abnormal is in the > log. > > We running htis kernel: kernel-smp-2.6.9-11.EL.i686.rpm.Have you tried disabling hyperthreading? I had suffered this problem when we started buying Intel Xeon's, disabling HT'ing seemed to fix the problem for me. Dean
Rodrigo Barbosa wrote:>>> Have you tried disabling hyperthreading? I had suffered this problem >>> when we started buying Intel Xeon's, disabling HT'ing seemed to fix >>> the problem for me. >> >> Some people suggested that too. I'm asking for two of these systems >> to be sent back from our remote data center. I'll try it then. >> >> Did you just ended up running without hyperthreading? Did you ever >> find a solution? It seems an awful waste if you can't use it. > > Disabling HT only solves it on single processor systems.In my case it fixed our dual processor systems (Dell PE2650) but to be honest I was never quite sure if it was an OS patch or HT'ing. I just disable HT'ing, build and they just work, I have not seen this problem for a while now. Dean