Alwin Roosen
2008-Jun-20 12:40 UTC
[CentOS] Kernel panic - not syncing: CPU context corrupt
Hi, Is there someone on this mailing list who could/want help me figure out this issue? We do not know where to look to solve this. --- Description --- This is a brand new server, which has been tested for days with FreeBSD in our office, and a few days with Windows on the site of our hardware distributor. Now customer wants CentOS, which we installed, but after few days we get a kernel panic. Last night at 2:08 it gave the same kernel panic. Please tell me what information I should give you and most important how to get it from the system, because we do not have experience with CentOS (only FreeBSD). I would be very surprised if this is hardware related. We use the same hardware for several years, and run FreeBSD on it very successfully. It is a SuperMicro PDSMI+ motherboard with 3ware raid controller (8006-2LP). CPU is Xeon 3040 1.8 Ghz EM64 2MB 1066FSB (65W). Memory is DDR 2 Trancend 2048MB ECC Unbuffered 800. Error message on console is in "Additional Information". I am hoping that I should switch off some setting in CentOS to fix this, but I cannot find much useful information about this issue on Google. --- Additional Information --- CentOS release 5 (Final) Kernel 2.6.18-53.1.21.el5 on an i686 ws174 login: CPU 1: Machine Check Exception: 0000000000000005 CPU 0: Machine Check Exception: 0000000000000004 Bank 3: f62000020002010a at 0000000032c93500 Bank 5: f20000300c000e0f Kernel panic - not syncing: CPU context corrupt Bank 3: f62000020002010a --- Attachments --- 19-06-2008 16-03-31.png (Screenshot of console) With kind regards, Alwin Roosen -------------- next part -------------- A non-text attachment was scrubbed... Name: 19-06-2008_16-03-31.png Type: image/png Size: 20007 bytes Desc: 19-06-2008_16-03-31.png URL: <http://lists.centos.org/pipermail/centos/attachments/20080620/dcfff776/attachment-0002.png>
Phil Schaffner
2008-Jun-20 13:08 UTC
[CentOS] Kernel panic - not syncing: CPU context corrupt
On Fri, 2008-06-20 at 14:40 +0200, Alwin Roosen wrote:> Hi, > > > Is there someone on this mailing list who could/want help me figure out > this issue? We do not know where to look to solve this....> I would be very surprised if this is hardware related.A google on "Machine Check Exception" "Kernel panic - not syncing: CPU context corrupt" turns up 50 results (including your CentOS BZ request referring you to this list), many of which point to hardware problems - CPU, MB (bad caps), chipset, are all listed as possible problems. I'd go back to the hardware vendor if still under warranty. Phil
2008/6/20 Alwin Roosen <alwin.roosen at webline.be>:> Hi, > > > Is there someone on this mailing list who could/want help me figure out > this issue? We do not know where to look to solve this. >If your installation is standard CentOS with no thirdparty software, and configurations, I would first run the vendor hardware checks several times, as they are usually not good with intermittent or hard to find problems, run extenisve memtest also if possible regards Walid -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20080620/c326a395/attachment-0002.html>
Lanny Marcus
2008-Jun-20 15:23 UTC
[CentOS] Kernel panic - not syncing: CPU context corrupt
On 6/20/08, Alwin Roosen <alwin.roosen at webline.be> wrote: <snip>> CentOS release 5 (Final) > Kernel 2.6.18-53.1.21.el5 on an i686 > > ws174 login: CPU 1: Machine Check Exception: 0000000000000005 > CPU 0: Machine Check Exception: 0000000000000004 > Bank 3: f62000020002010a at 0000000032c93500 > Bank 5: f20000300c000e0f > Kernel panic - not syncing: CPU context corrupt > Bank 3: f62000020002010a >Phil or someone else: Do the three (3) "Bank" lines above indicate RAM problems? If not, what do they refer to? Alwin wrote that this is brand new HW, so he suspects that it is OK, but it doesn't seem to be OK? Lanny
Richard Karhuse
2008-Jun-20 19:36 UTC
[CentOS] Kernel panic - not syncing: CPU context corrupt
On 6/20/08, Alwin Roosen <alwin.roosen at webline.be> wrote:> > Hi, > > > CentOS release 5 (Final) > Kernel 2.6.18-53.1.21.el5 on an i686 > > ws174 login: CPU 1: Machine Check Exception: 0000000000000005 > CPU 0: Machine Check Exception: 0000000000000004 > Bank 3: f62000020002010a at 0000000032c93500 > Bank 5: f20000300c000e0f > Kernel panic - not syncing: CPU context corrupt > Bank 3: f62000020002010a > > >Alwin --> I would be very, very "surprised" *IF* this wasn't hardware related. Dave Jones wrote a nice little program to help decode this: $ parsemce -b 3 -s f62000020002010a -e 5 -a 0000000032c93500 Status: (5) Machine Check in progress. Restart IP valid. parsebank(3): f62000020002010a @ 32c93500 External tag parity error CPU state corrupt. Restart not possible Address in addr register valid Error enabled in control register Error not corrected. Error overflow Memory hierarchy error Request: Generic error Transaction type : Generic Memory/IO : I/O and: $ parsemce -b 5 -s f20000300c000e0f -e 4 -a 0 Status: (4) Machine Check in progress. Restart IP invalid. parsebank(5): f20000300c000e0f @ 0 External tag parity error CPU state corrupt. Restart not possible Error enabled in control register Error not corrected. Error overflow Bus and interconnect error Participation: Generic Timeout: Request did not timeout Request: Generic error Transaction type : Invalid Memory/IO : Other Dag's Repo has the new memtest86+ 2.01 RPM. I'd pull it and let it run overnight. While memtest86+ is good, I've recently had cases where is didn't find (obvious) memory errors. I've also seen things like SATA disks drive cause MCEs. This one looks like you're taking memory parity errors somewhere in the path to the CPU. On you BIOS, check you Events log for any "interesting" entries, too. Hope this helps ... -rak- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20080620/493cecba/attachment-0002.html>
Richard Karhuse wrote:> Dag's Repo has the new memtest86+ 2.01 RPM. I'd pull it and > let it run overnight. While memtest86+ is good, I've recently had > cases where is didn't find (obvious) memory errors.My favorite test is cerberus(ctcs). Quite a few OEMs out there use it to burn in their systems. For me it can typically find a problem within a few hours. Whereas memtest I've let it run for a week and have it not find anything useful. Though the results of cerberus sometimes won't help you pinpoint the problem(often the result is just a machine crash). But at least you know there is an issue and can start swapping hardware until it's fixed(or just replace the whole system). http://sourceforge.net/projects/va-ctcs/ nate
Lanny Marcus
2008-Jun-20 20:57 UTC
[CentOS] Kernel panic - not syncing: CPU context corrupt
On 6/20/08, Alwin Roosen <alwin.roosen at webline.be> wrote: <snip>> This is a brand new server, which has been tested for days with FreeBSD > in our office, and a few days with Windows on the site of our hardware > distributor. Now customer wants CentOS, which we installed, but after > few days we get a kernel panic. Last night at 2:08 it gave the same > kernel panic.The fact that it worked OK, the first few days, with FreeBSD and Windows, may have been a Burn In test and now something in the HW has failed or is failing. Or, possibly CentOS is utilizing the HW much more robustly than the other 2 OS did? I would suggest that you get a Knoppix Live CD, or, preferably, a CentOS Live CD, and let it roll. And, you get a Kernel Panic, after_ a_ few_ days, on CentOS. That might indicate a Memory problem? Or, a Cooling problem?> ws174 login: CPU 1: Machine Check Exception: 0000000000000005 > CPU 0: Machine Check Exception: 0000000000000004 > Bank 3: f62000020002010a at 0000000032c93500 > Bank 5: f20000300c000e0f > Kernel panic - not syncing: CPU context corrupt > Bank 3: f62000020002010aTwo banks of Memory (3 and 5) have problems? If the RAM tests OK, suggest you swap the motherboard
Lanny Marcus
2008-Jun-20 23:08 UTC
[CentOS] Kernel panic - not syncing: CPU context corrupt
On 6/20/08, Alwin Roosen <alwin.roosen at webline.be> wrote:> This is a brand new server, which has been tested for days with FreeBSD > in our office, and a few days with Windows on the site of our hardware > distributor. Now customer wants CentOS, which we installed, but after > few days we get a kernel panic. Last night at 2:08 it gave the same > kernel panic.Have you checked to verify that the fans are spinning? Since it is a new system, I think you should take it back to your HW distributor and have them run cerberus(ctcs) on it, as Richard Karhuse wrote. If it takes a few days for it to get the Kernel Panic, I doubt that is related to the OS. Let your HW distributor do the work of troubleshooting and replacing whatever component(s) are faulty. They can get a CentOS Live CD and run that on it.