I have many different centos machines that are hanging regulary. I believe this is due to something our application is doing - not a centos specific problem. When the machines hang, there is no access to the console or remote access(ssh, rsh, etc). Any tips on debugging this issue? It is becoming quite a show stopper as we migrate our product from Solaris to Linux. tia, FYI - The "application" is a collection of programs that communicate with each other and to a large chip tester via a proprietary serial bus. The hangs are random but pretty frequent - in the range of several per day to several per week. -Mark -- Mark Belanger LTX Corporation
On Mon, Dec 18, 2006 at 04:17:59PM -0500, Mark Belanger wrote:> I have many different centos machines that are hanging > regulary. I believe this is due to something our application > is doing - not a centos specific problem. > When the machines hang, there is no access to the console > or remote access(ssh, rsh, etc).Do you mean that there's no *access* to the console, or that it doesn't *respond* on the console? -- Matthew Miller mattdm at mattdm.org <http://mattdm.org/> Boston University Linux ------> <http://linux.bu.edu/>
On 12/18/06, Mark Belanger <mark_belanger at ltx.com> wrote:> I have many different centos machines that are hanging > regulary. I believe this is due to something our application > is doing - not a centos specific problem. > > When the machines hang, there is no access to the console > or remote access(ssh, rsh, etc). > > Any tips on debugging this issue? It is becoming quite a > show stopper as we migrate our product from Solaris to > Linux. > > tia, > > FYI - The "application" is a collection of programs that > communicate with each other and to a large chip tester via > a proprietary serial bus. The hangs are random but pretty > frequent - in the range of several per day to several per week. > > -Mark >Have you tried the magic sysreq sequence on the console? Cheers...james
On Mon, December 18, 2006 10:17 pm, Mark Belanger wrote:> I have many different centos machines that are hanging > regulary. I believe this is due to something our application > is doing - not a centos specific problem.A normal unprivileged userland application should not be able to bring down the kernel. Do you use any additional drivers, e.g. for the serial bus? If so, it is a good idea to enable kernel crash dumps, and send it to your driver developers to analyze the bug. Other than that, you could (in no particular order): - Let syslog send messages to some remote site. - As others suggested, use remote X or a serial console to be able to track important messages. - 'systrace -f' the X11 program, and redirect the output somewhere safe, to see the last actions that were performed by the program. With kind regards, Daniel de Kok
On Dec 18, 2006, at 16:17, Mark Belanger wrote:> I have many different centos machines that are hanging > regulary. I believe this is due to something our application > is doing - not a centos specific problem.I have the same problem. I even posted something to this list titled "Strange system hangs" on 11/27 but didn't get any responses.> When the machines hang, there is no access to the console > or remote access(ssh, rsh, etc).I have that symptom as well. No way to do any debugging after it gets into that state. So I added the following two lines to the /etc/ syslog.conf file: kern.* @<central server> *.info;mail.none;authpriv.none;cron.none @<central server> Should I add any other levels to the selector field? BTW, my systems are running completely stock CentOS distribution EXCEPT for the binary nVidia driver, which was the only way I could get these systems to drive the 20" LCD displays at their native 1600x1200 resolution using the correct refresh rate. I had another report of a hang this morning, but in this case even though the machine appears frozen (the screen saver is stuck and I can't get to the alternate consoles), I can in fact log into the machine remotely and top shows me that the X server is using 100% of the CPU: top - 08:44:22 up 10 days, 23:00, 10 users, load average: 1.04, 1.01, 1.00 Tasks: 115 total, 2 running, 113 sleeping, 0 stopped, 0 zombie Cpu(s): 99.7% us, 0.3% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 3113468k total, 1361240k used, 1752228k free, 87312k buffers Swap: 3047416k total, 0k used, 3047416k free, 957756k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4381 root 25 0 67748 42m 7776 R 99.8 1.4 782:53.37 X I also see the following in /var/log/messages: Dec 18 19:56:02 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001 Dec 18 19:56:03 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:09 hepdsw04 Synergy 1.3.1: NOTE: CServerProxy.cpp, 315: server is dead Dec 18 19:56:10 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:11 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:18 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:19 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:26 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000020 Dec 18 19:56:27 hepdsw04 kernel: NVRM: Xid (0001:00): 9, Channel 00000020 Instance 00000000 Intr 00100000 Dec 18 19:56:34 hepdsw04 kernel: NVRM: Xid (0001:00): 8, Channel 00000001 What is the meaning of the NVRM entries? The Synergy entry is from the keyboard/mouse sharing Synergy utility (great program BTW, I couldn't live without it). Anyway, sorry to inject my own problems into this thread, but maybe these hangs are all related. Alfred
Quoting Mark Belanger <mark_belanger at ltx.com>:> I have many different centos machines that are hanging > regulary. I believe this is due to something our application > is doing - not a centos specific problem. > > When the machines hang, there is no access to the console > or remote access(ssh, rsh, etc). > > Any tips on debugging this issue? It is becoming quite a > show stopper as we migrate our product from Solaris to > Linux.Consider setting up serial console (you'll still be able to run X11 on your keyboard/monitor). Or alternatively, you might try setting up network console. Check out serial-console.txt and networking/netconsole.txt in kernel documentation.