Hi all, We are running a new Centos-4 server, and it has kernel panicked on us 4 times in the last month. After the first kernel panic we hooked up a serial console to the server and captured the output in order to have a record of what happens. I've included the error messages from the last time it locked up... but it doesn't really mean much to me. Anybody have any ideas what might be causing this server lock up? Server description: -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM - Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration -ext3 file system -kernel-smp-2.6.9-5.0.3.EL -mysql - from distribution -2 postfix instances rebuilt with MySQL support -amavisd-new -clamav -spamassassin -rbldnsd -bind Here's the captured output from a serial console connected to the server at time of fault. Unable to handle kernel NULL pointer dereference at virtual address 00000000 printing eip: f8872da8 *pde = 35562001 Oops: 0000 [#1] SMP Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod CPU: 1 EIP: 0060:[<f8872da8>] Not tainted VLI EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) EIP is at __journal_file_buffer+0x1b/0x221 [jbd] eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680 esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4 ds: 007b es: 007b ss: 0068 Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b 00000286 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 caa4c50c 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 f8836f24 Call Trace: [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bcd5>] finish_task_switch+0x30/0x66 [<c02c4363>] schedule+0x833/0x869 [<c0127e62>] del_timer_sync+0x7a/0x9c [<f8875e6d>] kjournald+0xc7/0x215 [jbd] [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011e8d2>] autoremove_wake_function+0x0/0x2d [<c011bd1d>] schedule_tail+0x12/0x55 [<f8875da0>] commit_timeout+0x0/0x5 [jbd] [<f8875da6>] kjournald+0x0/0x215 [jbd] [<c01041f1>] kernel_thread_helper+0x5/0xb Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00 a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20050412/31d48a50/attachment-0001.html>
Bob Pierce wrote:> > Unable to handle kernel NULL pointer dereference at virtual address > 00000000 > printing eip: > f8872da8 > *pde = 35562001 > Oops: 0000 [#1] > SMPNo expert here, but just had this same type of error on a workstation. Wouldn't even boot anymore, panic on start up. I personally had never seen this error before. Pulled ram modules, cleaned contacts and reseated back in place. Has not happened again. Soooooo, I'd test/change out memory. Just a thought.
Bob Pierce wrote:> Hi all, > > We are running a new Centos-4 server, and it has kernel panicked on us > 4 times in the last month. After the first kernel panic we hooked up a > serial console to the server and captured the output in order to have > a record of what happens. I've included the error messages from the > last time it locked up? but it doesn't really mean much to me. Anybody > have any ideas what might be causing this server lock up? > > Server description: > -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR > RAM - Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration > > -ext3 file system > -kernel-smp-2.6.9-5.0.3.EL > -mysql - from distribution > -2 postfix instances rebuilt with MySQL support > -amavisd-new > -clamav > -spamassassin > -rbldnsd > -bind > > > Here's the captured output from a serial console connected to the > server at time of fault. > > Unable to handle kernel NULL pointer dereference at virtual address > 00000000 > printing eip: > f8872da8 > *pde = 35562001 > Oops: 0000 [#1] > SMP > Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac > ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod > > CPU: 1 > EIP: 0060:[<f8872da8>] Not tainted VLI > EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) > EIP is at __journal_file_buffer+0x1b/0x221 [jbd] > eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680 > esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4 > ds: 007b es: 007b ss: 0068 > Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) > Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b > 00000286 > 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 caa4c50c > 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 f8836f24 > Call Trace: > [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011bcd5>] finish_task_switch+0x30/0x66 > [<c02c4363>] schedule+0x833/0x869 > [<c0127e62>] del_timer_sync+0x7a/0x9c > [<f8875e6d>] kjournald+0xc7/0x215 [jbd] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011bd1d>] schedule_tail+0x12/0x55 > [<f8875da0>] commit_timeout+0x0/0x5 [jbd] > [<f8875da6>] kjournald+0x0/0x215 [jbd] > [<c01041f1>] kernel_thread_helper+0x5/0xb > Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 > cf 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> > 00 a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55 > > >------------------------------------------------------------------------ > >_______________________________________________ >CentOS mailing list >CentOS at centos.org >http://lists.centos.org/mailman/listinfo/centos > >Looks to me as there is a problem with the RAID, I'm not too familiar with LSI oems for dell(I'm guessing it's LSI, since it said something about megaraid I'm too lazy to google it), but I'm guessing that Perc4-DI is a host raid? I would look into it, and really think about getting a hardware raid card if it is. I've had nothing but problems with onboard host raids myself, I gave up with them and just went and used LVM's software raid, it actually performs much better now. I've even seen benchmarks saying the same thing. But we are still switching to hardware raid, for much easier restoring. -- Brian Trudeau, I.T., Q.A. Inspector Eastek International Corporation 330 Hastings Drive, Buffalo Grove, IL 60089 Tel: (847) 353-8300 Ext. 213 Fax: (847) 353-8900 Web: http://www.eastek-intl.com Email: btrudeau at eastek-intl.com ---- The information contained in this electronic mail transmission is intended by Eastek International for the use of the named individual or entity to which it is directed and may contain information that is confidential or privileged. If you are not the intended recipient, you must not keep, use, disclose, copy or distribute this email without the author's prior permission. We have taken precautions to minimize the risk of transmitting software viruses, but we advise you to carry out your own virus checks on any attachment to this message. We cannot accept liability for any loss or damage caused by software viruses or other attachments. If you have received this electronic mail transmission in error, please delete it from your system without copying or forwarding it, and notify the sender of the error by reply email so that the sender's address records can be corrected. Thank you.
On Tue, April 12, 2005 3:08 pm, Bob Pierce said:> Hi all, > > We are running a new Centos-4 server, and it has kernel panicked on us 4 > times in the last month. After the first kernel panic we hooked up a > serial console to the server and captured the output in order to have a > record of what happens. I've included the error messages from the last > time it locked up... but it doesn't really mean much to me. Anybody have > any ideas what might be causing this server lock up? > > Server description: > -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM > - Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration > -ext3 file system > -kernel-smp-2.6.9-5.0.3.EL > -mysql - from distribution > -2 postfix instances rebuilt with MySQL support > -amavisd-new > -clamav > -spamassassin > -rbldnsd > -bind > > > Here's the captured output from a serial console connected to the server > at time of fault. > > Unable to handle kernel NULL pointer dereference at virtual address > 00000000 > printing eip: > f8872da8 > *pde = 35562001 > Oops: 0000 [#1] > SMP > Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac > ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod > scsi_mod > CPU: 1 > EIP: 0060:[<f8872da8>] Not tainted VLI > EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) > EIP is at __journal_file_buffer+0x1b/0x221 [jbd] > eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680 > esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4 > ds: 007b es: 007b ss: 0068 > Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) > Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b > 00000286 > 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 > caa4c50c > 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 > f8836f24 > Call Trace: > [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011bcd5>] finish_task_switch+0x30/0x66 > [<c02c4363>] schedule+0x833/0x869 > [<c0127e62>] del_timer_sync+0x7a/0x9c > [<f8875e6d>] kjournald+0xc7/0x215 [jbd] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011bd1d>] schedule_tail+0x12/0x55 > [<f8875da0>] commit_timeout+0x0/0x5 [jbd] > [<f8875da6>] kjournald+0x0/0x215 [jbd] > [<c01041f1>] kernel_thread_helper+0x5/0xb > Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf > 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00 > a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55 >No idea what is causing this (looks like a Filesystem process to me), but we have a new kernel (that will be included in CentOS-4.1). It is kernel-2.6.9-6.37.EL.src.rpm. I would be glad to give you the new i686-smp kernel to see if it solves your problem. Are these EM64T Xeons or i686(32-bit) Xeons: http://www.intel.com/products/processor/xeon/index.htm (looking at the Dell site, I think they are 32-bit) (If I am wrong and it is the EM64T Xeons, you should have installed the x86_64 distro instead of the i386 one) Also recommend the latest SCSI Controller BIOS: http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=en&s=bsd&SystemID=PWE_PNT_XEO_1750&os=LE30&osl=en&deviceid=2608&devlib=35&category=35&releaseid=R85295 and Server BIOS: http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=en&s=bsd&SystemID=PWE_PNT_XEO_1750&os=LE30&osl=en&deviceid=159&devlib=1&category=1&releaseid=R87618 -- Johnny Hughes <http://www.HughesJR.com/>
Have a closer look at jbd :-) On 4/12/05, Bob Pierce <pierceb at westmancom.com> wrote:> > > Hi all, > > We are running a new Centos-4 server, and it has kernel panicked on us 4 > times in the last month. After the first kernel panic we hooked up a serial > console to the server and captured the output in order to have a record of > what happens. I've included the error messages from the last time it locked > up? but it doesn't really mean much to me. Anybody have any ideas what might > be causing this server lock up? > > Server description: > -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM - > Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration > > -ext3 file system > -kernel-smp-2.6.9-5.0.3.EL > -mysql - from distribution > -2 postfix instances rebuilt with MySQL support > -amavisd-new > -clamav > -spamassassin > -rbldnsd > -bind > > > Here's the captured output from a serial console connected to the server at > time of fault. > > Unable to handle kernel NULL pointer dereference at virtual address 00000000 > printing eip: > f8872da8 > *pde = 35562001 > Oops: 0000 [#1] > SMP > Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac ohci_hcd > tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod > > CPU: 1 > EIP: 0060:[<f8872da8>] Not tainted VLI > EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp) > EIP is at __journal_file_buffer+0x1b/0x221 [jbd] > eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680 > esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4 > ds: 007b es: 007b ss: 0068 > Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) > Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b > 00000286 > 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 > caa4c50c > 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 > f8836f24 > Call Trace: > [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011bcd5>] finish_task_switch+0x30/0x66 > [<c02c4363>] schedule+0x833/0x869 > [<c0127e62>] del_timer_sync+0x7a/0x9c > [<f8875e6d>] kjournald+0xc7/0x215 [jbd] > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011e8d2>] autoremove_wake_function+0x0/0x2d > [<c011bd1d>] schedule_tail+0x12/0x55 > [<f8875da0>] commit_timeout+0x0/0x5 [jbd] > [<f8875da6>] kjournald+0x0/0x215 [jbd] > [<c01041f1>] kernel_thread_helper+0x5/0xb > Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf 56 > 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00 a9 00 00 > 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55 > > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos > > >
I think we might be interested in trying that new kernel. I will be upgrading the Server BIOS and SCSI RAID Firmware this morning, then we'll wait and see. If that doesn't help I think are next steps will be a new kernel and a memory scan. Thanks for your help, Bob. -----Original Message----- From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On Behalf Of Johnny Hughes Sent: Wednesday, April 13, 2005 7:18 AM To: CentOS ML Subject: Re: [CentOS] Centos-4 Kernel pannic On Wed, 2005-04-13 at 07:47 -0400, James Olin Oden wrote:> This is not necessarily a problem with your hardware but could be a > bonified bug in the megaraid device driver.Looking at the changelog for the new kernel (from 2.6.9-5.0.3.EL up to 2.6.9-6.37.EL), there are several megaraid and/or scsi device driver changes ... may be fixed w/the new kernel.
Just an update to this... I upgraded the firmware on the SCSI RAID controller to version 413O-A09 as found at this link: http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=en&sbsd&SystemID=PWE_PNT_XEO_1750&os=LE30&osl=en&deviceid=2608&devlib=35&cat egory=35&releaseid=R85295 Since upgrading the firmware we have had no more kernel panic problems. Thanks to everyone for your help, and thanks to Johnny Hughes for providing the easy links to the firmware. Bob.