thr3ads.net - CentOS - [CentOS] Centos-4 Kernel pannic [Apr 2005]

If this information is useful, please help other people find it:
Share via:

Bob Pierce

2005-Apr-12 20:08 UTC

[CentOS] Centos-4 Kernel pannic

Hi all,

We are running a new Centos-4 server, and it has kernel panicked on us 4
times in the last month. After the first kernel panic we hooked up a
serial console to the server and captured the output in order to have a
record of what happens.  I've included the error messages from the last
time it locked up... but it doesn't really mean much to me. Anybody have
any ideas what might be causing this server lock up?

Server description:
-Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM
- Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration
-ext3 file system
-kernel-smp-2.6.9-5.0.3.EL
-mysql - from distribution
-2 postfix instances rebuilt with MySQL support
-amavisd-new
-clamav
-spamassassin
-rbldnsd
-bind


Here's the captured output from a serial console connected to the server
at time of fault.

Unable to handle kernel NULL pointer dereference at virtual address
00000000
 printing eip:
f8872da8
*pde = 35562001
Oops: 0000 [#1]
SMP 
Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac
ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod
scsi_mod
CPU:    1
EIP:    0060:[<f8872da8>]    Not tainted VLI
EFLAGS: 00010246   (2.6.9-5.0.3.ELsmp) 
EIP is at __journal_file_buffer+0x1b/0x221 [jbd]
eax: 00000000   ebx: d2fff26c   ecx: 00000008   edx: c2327680
esi: c2327680   edi: 00000008   ebp: 00000000   esp: f7533dd4
ds: 007b   es: 007b   ss: 0068
Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0)
Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b
00000286 
       00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200
caa4c50c 
       00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054
f8836f24 
Call Trace:
 [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd]
 [<c011e8d2>] autoremove_wake_function+0x0/0x2d
 [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox]
 [<c011e8d2>] autoremove_wake_function+0x0/0x2d
 [<c011bcd5>] finish_task_switch+0x30/0x66
 [<c02c4363>] schedule+0x833/0x869
 [<c0127e62>] del_timer_sync+0x7a/0x9c
 [<f8875e6d>] kjournald+0xc7/0x215 [jbd]
 [<c011e8d2>] autoremove_wake_function+0x0/0x2d
 [<c011e8d2>] autoremove_wake_function+0x0/0x2d
 [<c011bd1d>] schedule_tail+0x12/0x55
 [<f8875da0>] commit_timeout+0x0/0x5 [jbd]
 [<f8875da6>] kjournald+0x0/0x215 [jbd]
 [<c01041f1>] kernel_thread_helper+0x5/0xb
Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf
56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00
a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55 


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20050412/31d48a50/attachment-0001.html>

Ken Godee

2005-Apr-12 20:28 UTC

head link

[CentOS] Centos-4 Kernel pannic

Bob Pierce wrote:
> 
> Unable to handle kernel NULL pointer dereference at virtual address 
> 00000000
>  printing eip:
> f8872da8
> *pde = 35562001
> Oops: 0000 [#1]
> SMP
No expert here, but just had this same type of error on a workstation.

Wouldn't even boot anymore, panic on start up. I personally
had never seen this error before.

Pulled ram modules, cleaned contacts and reseated back in place.
Has not happened again. Soooooo, I'd test/change out memory.

Just a thought.

Brian Trudeau

2005-Apr-12 20:36 UTC

head link

[CentOS] Centos-4 Kernel pannic

Bob Pierce wrote:
> Hi all,
>
> We are running a new Centos-4 server, and it has kernel panicked on us 
> 4 times in the last month. After the first kernel panic we hooked up a 
> serial console to the server and captured the output in order to have 
> a record of what happens. I've included the error messages from the 
> last time it locked up? but it doesn't really mean much to me. Anybody 
> have any ideas what might be causing this server lock up?
>
> Server description:
> -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR 
> RAM - Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration
>
> -ext3 file system
> -kernel-smp-2.6.9-5.0.3.EL
> -mysql - from distribution
> -2 postfix instances rebuilt with MySQL support
> -amavisd-new
> -clamav
> -spamassassin
> -rbldnsd
> -bind
>
>
> Here's the captured output from a serial console connected to the 
> server at time of fault.
>
> Unable to handle kernel NULL pointer dereference at virtual address 
> 00000000
> printing eip:
> f8872da8
> *pde = 35562001
> Oops: 0000 [#1]
> SMP
> Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac 
> ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod
>
> CPU: 1
> EIP: 0060:[<f8872da8>] Not tainted VLI
> EFLAGS: 00010246 (2.6.9-5.0.3.ELsmp)
> EIP is at __journal_file_buffer+0x1b/0x221 [jbd]
> eax: 00000000 ebx: d2fff26c ecx: 00000008 edx: c2327680
> esi: c2327680 edi: 00000008 ebp: 00000000 esp: f7533dd4
> ds: 007b es: 007b ss: 0068
> Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0)
> Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b 
> 00000286
> 00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200 caa4c50c
> 00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054 f8836f24
> Call Trace:
> [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd]
> [<c011e8d2>] autoremove_wake_function+0x0/0x2d
> [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox]
> [<c011e8d2>] autoremove_wake_function+0x0/0x2d
> [<c011bcd5>] finish_task_switch+0x30/0x66
> [<c02c4363>] schedule+0x833/0x869
> [<c0127e62>] del_timer_sync+0x7a/0x9c
> [<f8875e6d>] kjournald+0xc7/0x215 [jbd]
> [<c011e8d2>] autoremove_wake_function+0x0/0x2d
> [<c011e8d2>] autoremove_wake_function+0x0/0x2d
> [<c011bd1d>] schedule_tail+0x12/0x55
> [<f8875da0>] commit_timeout+0x0/0x5 [jbd]
> [<f8875da6>] kjournald+0x0/0x215 [jbd]
> [<c01041f1>] kernel_thread_helper+0x5/0xb
> Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 
> cf 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24
<8b>
> 00 a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>CentOS mailing list
>CentOS at centos.org
>http://lists.centos.org/mailman/listinfo/centos
>  
>Looks to me as there is a problem with the RAID, I'm not too familiar 
with LSI oems for dell(I'm guessing it's LSI, since it said something 
about megaraid I'm too lazy to google it), but I'm guessing that 
Perc4-DI is a host raid? I would look into it, and really think about 
getting a hardware raid card if it is. I've had nothing but problems 
with onboard host raids myself, I gave up with them and just went and 
used LVM's software raid, it actually performs much better now. I've 
even seen benchmarks saying the same thing. But we are still switching 
to hardware raid, for much easier restoring.

-- 
Brian Trudeau,   I.T., Q.A. Inspector
Eastek International Corporation
330 Hastings Drive,   Buffalo Grove, IL 60089
Tel: (847) 353-8300 Ext. 213   Fax: (847) 353-8900
Web: http://www.eastek-intl.com   Email: btrudeau at eastek-intl.com
----
The information contained in this electronic mail transmission is intended by
Eastek International for the use of the named individual or entity to which it
is directed and may contain information that is confidential or privileged.

If you are not the intended recipient, you must not keep, use, disclose, copy or
distribute this email without the author's prior permission. We have taken
precautions to minimize the risk of transmitting software viruses, but we advise
you to carry out your own virus checks on any attachment to this message. We
cannot accept liability for any loss or damage caused by software viruses or
other attachments.

If you have received this electronic mail transmission in error, please delete
it from your system without copying or forwarding it, and notify the sender of
the error by reply email so that the sender's address records can be
corrected.  Thank you.

Johnny Hughes

2005-Apr-12 21:00 UTC

head link

[CentOS] Centos-4 Kernel pannic

On Tue, April 12, 2005 3:08 pm, Bob Pierce said:> Hi all,
>
> We are running a new Centos-4 server, and it has kernel panicked on us 4
> times in the last month. After the first kernel panic we hooked up a
> serial console to the server and captured the output in order to have a
> record of what happens.  I've included the error messages from the last
> time it locked up... but it doesn't really mean much to me. Anybody
have
> any ideas what might be causing this server lock up?
>
> Server description:
> -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM
> - Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration
> -ext3 file system
> -kernel-smp-2.6.9-5.0.3.EL
> -mysql - from distribution
> -2 postfix instances rebuilt with MySQL support
> -amavisd-new
> -clamav
> -spamassassin
> -rbldnsd
> -bind
>
>
> Here's the captured output from a serial console connected to the
server
> at time of fault.
>
> Unable to handle kernel NULL pointer dereference at virtual address
> 00000000
>  printing eip:
> f8872da8
> *pde = 35562001
> Oops: 0000 [#1]
> SMP
> Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac
> ohci_hcd tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod
> scsi_mod
> CPU:    1
> EIP:    0060:[<f8872da8>]    Not tainted VLI
> EFLAGS: 00010246   (2.6.9-5.0.3.ELsmp)
> EIP is at __journal_file_buffer+0x1b/0x221 [jbd]
> eax: 00000000   ebx: d2fff26c   ecx: 00000008   edx: c2327680
> esi: c2327680   edi: 00000008   ebp: 00000000   esp: f7533dd4
> ds: 007b   es: 007b   ss: 0068
> Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0)
> Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b
> 00000286
>        00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200
> caa4c50c
>        00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054
> f8836f24
> Call Trace:
>  [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd]
>  [<c011e8d2>] autoremove_wake_function+0x0/0x2d
>  [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox]
>  [<c011e8d2>] autoremove_wake_function+0x0/0x2d
>  [<c011bcd5>] finish_task_switch+0x30/0x66
>  [<c02c4363>] schedule+0x833/0x869
>  [<c0127e62>] del_timer_sync+0x7a/0x9c
>  [<f8875e6d>] kjournald+0xc7/0x215 [jbd]
>  [<c011e8d2>] autoremove_wake_function+0x0/0x2d
>  [<c011e8d2>] autoremove_wake_function+0x0/0x2d
>  [<c011bd1d>] schedule_tail+0x12/0x55
>  [<f8875da0>] commit_timeout+0x0/0x5 [jbd]
>  [<f8875da6>] kjournald+0x0/0x215 [jbd]
>  [<c01041f1>] kernel_thread_helper+0x5/0xb
> Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf
> 56 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b>
00
> a9 00 00 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55
>
No idea what is causing this (looks like a Filesystem process to me), but
we have a new kernel (that will be included in CentOS-4.1).  It is
kernel-2.6.9-6.37.EL.src.rpm.

I would be glad to give you the new i686-smp kernel to see if it solves
your problem.

Are these EM64T Xeons or i686(32-bit) Xeons:
http://www.intel.com/products/processor/xeon/index.htm
(looking at the Dell site, I think they are 32-bit)

(If I am wrong and it is the EM64T Xeons, you should have installed the
x86_64 distro instead of the i386 one)

Also recommend the latest SCSI Controller BIOS:
http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=en&s=bsd&SystemID=PWE_PNT_XEO_1750&os=LE30&osl=en&deviceid=2608&devlib=35&category=35&releaseid=R85295

and Server BIOS:
http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=en&s=bsd&SystemID=PWE_PNT_XEO_1750&os=LE30&osl=en&deviceid=159&devlib=1&category=1&releaseid=R87618

-- 
Johnny Hughes
<http://www.HughesJR.com/>

tobaccofarm

2005-Apr-13 06:22 UTC

head link

[CentOS] Centos-4 Kernel pannic

Have a closer look at jbd :-)

On 4/12/05, Bob Pierce <pierceb at westmancom.com>
wrote:>  
> 
> Hi all, 
> 
> We are running a new Centos-4 server, and it has kernel panicked on us 4
> times in the last month. After the first kernel panic we hooked up a serial
> console to the server and captured the output in order to have a record of
> what happens.  I've included the error messages from the last time it
locked
> up? but it doesn't really mean much to me. Anybody have any ideas what
might
> be causing this server lock up? 
> 
> Server description: 
> -Dell PE1750 - dual 2.8Ghz Xeon (with Hyper Threading on) - 2GB DDR RAM -
> Perc4-DI onboard RAID using 3 scsi drives in raid-5 configuration 
> 
> -ext3 file system 
> -kernel-smp-2.6.9-5.0.3.EL 
> -mysql - from distribution 
> -2 postfix instances rebuilt with MySQL support 
> -amavisd-new 
> -clamav 
> -spamassassin 
> -rbldnsd 
> -bind 
>  
> 
> Here's the captured output from a serial console connected to the
server at
> time of fault. 
> 
> Unable to handle kernel NULL pointer dereference at virtual address
00000000
>  printing eip: 
> f8872da8 
> *pde = 35562001 
> Oops: 0000 [#1] 
> SMP 
> Modules linked in: md5 ipv6 autofs4 sunrpc dm_mod button battery ac
ohci_hcd
> tg3 floppy sg ext3 jbd megaraid_mbox megaraid_mm sd_mod scsi_mod 
> 
> CPU:    1 
> EIP:    0060:[<f8872da8>]    Not tainted VLI 
> EFLAGS: 00010246   (2.6.9-5.0.3.ELsmp) 
> EIP is at __journal_file_buffer+0x1b/0x221 [jbd] 
> eax: 00000000   ebx: d2fff26c   ecx: 00000008   edx: c2327680 
> esi: c2327680   edi: 00000008   ebp: 00000000   esp: f7533dd4 
> ds: 007b   es: 007b   ss: 0068 
> Process kjournald (pid: 210, threadinfo=f7533000 task=f75825b0) 
> Stack: 00000000 00000000 f148fad8 f7f66200 d2fff26c c2327680 f887351b
> 00000286 
>        00000000 00000000 00000000 00000000 00000000 d2517e6c f7f66200
> caa4c50c 
>        00001f18 00000000 f75825b0 c011e8d2 f7533e44 f7533e44 f750c054
> f8836f24 
> Call Trace: 
>  [<f887351b>] journal_commit_transaction+0x310/0xfb1 [jbd] 
>  [<c011e8d2>] autoremove_wake_function+0x0/0x2d 
>  [<f8836f24>] megaraid_isr+0x1ad/0x1bf [megaraid_mbox] 
>  [<c011e8d2>] autoremove_wake_function+0x0/0x2d 
>  [<c011bcd5>] finish_task_switch+0x30/0x66 
>  [<c02c4363>] schedule+0x833/0x869 
>  [<c0127e62>] del_timer_sync+0x7a/0x9c 
>  [<f8875e6d>] kjournald+0xc7/0x215 [jbd] 
>  [<c011e8d2>] autoremove_wake_function+0x0/0x2d 
>  [<c011e8d2>] autoremove_wake_function+0x0/0x2d 
>  [<c011bd1d>] schedule_tail+0x12/0x55 
>  [<f8875da0>] commit_timeout+0x0/0x5 [jbd] 
>  [<f8875da6>] kjournald+0x0/0x215 [jbd] 
>  [<c01041f1>] kernel_thread_helper+0x5/0xb 
> Code: 14 ba 01 00 00 00 83 c4 10 89 d0 5b 5e 5f 5d c3 55 31 ed 57 89 cf 56
> 89 d6 53 53 53 89 c3 c7 44 24 04 00 00 00 00 8b 00 89 04 24 <8b> 00
a9 00 00
> 08 00 75 29 68 d4 85 87 f8 68 9b 07 00 00 68 55 
>  
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
> 
> 
>

Bob Pierce

2005-Apr-13 13:20 UTC

head link

[CentOS] Centos-4 Kernel pannic

I think we might be interested in trying that new kernel.

I will be upgrading the Server BIOS and SCSI RAID Firmware this morning,
then we'll wait and see. If that doesn't help I think are next steps
will be a new kernel and a memory scan.

Thanks for your help,
Bob.

-----Original Message-----
From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On
Behalf Of Johnny Hughes
Sent: Wednesday, April 13, 2005 7:18 AM
To: CentOS ML
Subject: Re: [CentOS] Centos-4 Kernel pannic


On Wed, 2005-04-13 at 07:47 -0400, James Olin Oden
wrote:> This is not necessarily a problem with your hardware but could be a 
> bonified bug in the megaraid device driver.
Looking at the changelog for the new kernel (from 2.6.9-5.0.3.EL up to
2.6.9-6.37.EL), there are several megaraid and/or scsi device driver
changes ... may be fixed w/the new kernel.

Bob Pierce

2005-Apr-26 13:20 UTC

head link

[CentOS] Centos-4 Kernel pannic

Just an update to this...

I upgraded the firmware on the SCSI RAID controller to version 413O-A09
as found at this link:
http://support.dell.com/support/downloads/format.aspx?c=us&cs=04&l=en&sbsd&SystemID=PWE_PNT_XEO_1750&os=LE30&osl=en&deviceid=2608&devlib=35&cat
egory=35&releaseid=R85295

Since upgrading the firmware we have had no more kernel panic problems.

Thanks to everyone for your help, and thanks to Johnny Hughes for
providing the easy links to the firmware.

Bob.

CentOS - Apr 2005 - Centos-4 Kernel pannic

[CentOS] Centos-4 Kernel pannic

[CentOS] Centos-4 Kernel pannic

[CentOS] Centos-4 Kernel pannic

[CentOS] Centos-4 Kernel pannic

[CentOS] Centos-4 Kernel pannic

[CentOS] Centos-4 Kernel pannic

[CentOS] Centos-4 Kernel pannic