Hi, We have a server which locks up about once a week (for the past 3 weeks now), without any warning, and the only way to recover it, is to reset the server. This causes unwanted downtime, and often software loss as well. How do I debug the server, which runs CentOS 5.2 to see why it locks up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel Motherboard The last few entries before the server froze, is: Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:59008 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:59008 Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:47729 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:47729 Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:47890 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:47890 Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:50023 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:50023 Nov 15 07:15:20 saturn snmpd[2527]: Connection from UDP: [127.0.0.1]:58459 Nov 15 07:15:20 saturn snmpd[2527]: Received SNMP packet(s) from UDP: [127.0.0.1]:58459 Nov 15 10:10:10 saturn syslogd 1.4.1: restart. Nov 15 10:10:11 saturn kernel: klogd 1.4.1, log source = /proc/kmsg started. Nov 15 10:10:11 saturn kernel: Bootdata ok (command line is ro root=/dev/System/root) Nov 15 10:10:11 saturn kernel: Linux version 2.6.18-92.1.17.el5xen (mockbuild at builder10.centos.org) (gcc version 4.1.2 20071124 (Red Hat 4.1 .2-42)) #1 SMP Tue Nov 4 14:13:09 EST 2008 Nov 15 10:10:11 saturn kernel: BIOS-provided physical RAM map: Nov 15 10:10:11 saturn kernel: Xen: 0000000000000000 - 00000001ef958000 (usable) Nov 15 10:10:11 saturn kernel: DMI 2.4 present. Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x03] lapic_id[0x02] enabled) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC_NMI (acpi_id[0x01] dfl dfl lint[0x1]) Nov 15 10:10:11 saturn kernel: ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0x1]) Nov 15 10:10:11 saturn kernel: ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0]) Nov 15 10:10:11 saturn kernel: IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23 Nov 15 10:10:11 saturn kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) Nov 15 10:10:11 saturn kernel: ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) -- Kind Regards Rudi Ahlers
On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers <rudiahlers at gmail.com> wrote:> Hi, > > We have a server which locks up about once a week (for the past 3 > weeks now), without any warning, and the only way to recover it, is to > reset the server. This causes unwanted downtime, and often software > loss as well. > > How do I debug the server, which runs CentOS 5.2 to see why it locks > up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel > Motherboard >Attach a local console to the video port and let us know what it says --> that will (probably) be very insightful. E.G., Kernel panic, MCE, .... Next, run memtest86+ -- at least overnight. [Note: I've had less than stellar results with memtest86 recently, but if it shows errors, you've got a problem big time; if it doesn't show errors, you still not 100% sure that memory is good:-):-).] Is it ECC memory?? If not, why not -- particularly given it is a critical server .... Are all the fans spinning -- particularly the CPU?? Do you have lm-sensors enabled?? Either create a script or using something like munin to track things and see if fans, temperature, voltages are all stable & within range up to death. Can you easilhy swap power supplies?? (Is the unit dual powered or just one unit?) Clearly, just a start, but you get the idea of elementary, 101 problem solving .... -rak- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20081115/0ba37fe5/attachment-0003.html>
On Sat, Nov 15, 2008 at 4:47 PM, Richard Karhuse <rkarhuse at gmail.com> wrote:> > > On Sat, Nov 15, 2008 at 3:16 AM, Rudi Ahlers <rudiahlers at gmail.com> wrote: >> >> Hi, >> >> We have a server which locks up about once a week (for the past 3 >> weeks now), without any warning, and the only way to recover it, is to >> reset the server. This causes unwanted downtime, and often software >> loss as well. >> >> How do I debug the server, which runs CentOS 5.2 to see why it locks >> up? The CPU is an Intel Q9300 Core 2 Quad, with 8 GB RAM, on an Intel >> Motherboard > > Attach a local console to the video port and let us know what it says --> > that will (probably) be very insightful. E.G., Kernel panic, MCE, .... > > Next, run memtest86+ -- at least overnight. [Note: I've had less than > stellar results with memtest86 recently, but if it shows errors, you've got > a problem big time; if it doesn't show errors, you still not 100% sure that > memory is good:-):-).] Is it ECC memory?? If not, why not -- particularly > given it is a critical server .... > > Are all the fans spinning -- particularly the CPU?? Do you have lm-sensors > enabled?? Either create a script or using something like munin to track > things > and see if fans, temperature, voltages are all stable & within range up to > death. > > Can you easilhy swap power supplies?? (Is the unit dual powered or just > one unit?) > > Clearly, just a start, but you get the idea of elementary, 101 problem > solving .... > > -rak- > > > _______________________________________________Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a shared cabinet @ a 3rd party ISP, and they lock the cabinets once we're done working with it. The last lockup was about 6 days ago, and previous one about 8 days ago. There's no consitancy. How can I redirect all console output to a file instead? I have got lm-sensors installed, but it doesn't pick-up the motherboard's sensors. All fans are working when I checked last time, but it's a 1U chassis, so it's got limited air-flow. I don't know if it get's too hot, or not. When I rebooted it, the temp was about 45 degrees celcius, but the lockup only happened about 6 days later. So, I can't even sit there 24/7 to see what happens. -- Kind Regards Rudi Ahlers
Rudi Ahlers wrote:> We have a server which locks up about once a week (for the > past 3 > weeks now), without any warning, and the only way to > recover it, is to > reset the server. This causes unwanted downtime, and often > software > loss as well. > > How do I debug the server, which runs CentOS 5.2 to see why > it locks > up?Are those the only logs you've got. Normally linux is very chatty, and you get WARNING, PANIC etc messages. What kernel are you using? Does a previous kernel or CentOS plus kernel stop the problem? Regards, Vandaman.