thr3ads.net - Ocfs2 users - [Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Christian Schlittchen

2006-Oct-27 00:29 UTC

[Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference

Thanks to syncronous writes on the log-files I finally managed to get
a log of the regular panics we experience.

The setup is as follows: Three blades (IBM HS20) accessing a shared storage
on a fibre channel connected storage server (IBM DS4300). The storage is
used as a central mailstorage for about 35000 users, so it is pretty heavy
duty storage wise.

blade01 crashes every few days with a kernel panic. Unfortunatly all
watchdogs we tried fail to reboot the machine, and setting
/proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops to non-zero
values doesn't help either. The machine still responds to pings, but
to nothing else. Even more unfortunatly the file system on the other
blades starts to hang sometime after blade01 crashes.

Logging /proc/slabinfo showed a steady increase of the size-256 and size-32
number of objects and we thought the crashes might have something to do
with it. We then did a nightly umount/mount which reduced the values a
bit and which does seem to reduce the frequency of crashes slightly.

Nevertheless today we had a crash with rather low values of size-256 and
size-32:
>From /proc/slabinfo, timestamped, a few seconds before the crash:
2006-10-27-06:20:01 size-256           92187 169605    256   15    1 : tunables 
120   60    8 : slabdata  11307  113 07      0
2006-10-27-06:20:01 size-32            94037 534942     32  113    1 : tunables 
120   60    8 : slabdata   4734   47 34      0

The kern.log shows:

Oct 27 06:20:11 blade01 kernel: BUG: unable to handle kernel NULL pointer
dereference at virtual address 00000004
Oct 27 06:20:11 blade01 kernel:  printing eip:
Oct 27 06:20:11 blade01 kernel: f92b9431
Oct 27 06:20:11 blade01 kernel: *pde = 00000000
Oct 27 06:20:11 blade01 kernel: Oops: 0002 [#1]
Oct 27 06:20:11 blade01 kernel: SMP 
Oct 27 06:20:11 blade01 kernel: Modules linked in: i6300esb ocfs2 xt_state
ip_conntrack xt_limit ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager md_mod dm_snapshot
dm_mirror dm_mod mptctl qla2xxx i2c_i801 firmware_class i2c_core
scsi_transport_fc rtc
Oct 27 06:20:11 blade01 kernel: CPU:    1
Oct 27 06:20:11 blade01 kernel: EIP:    0060:[<f92b9431>]    Not tainted
VLI
Oct 27 06:20:11 blade01 kernel: EFLAGS: 00010286   (2.6.18 #1) 
Oct 27 06:20:11 blade01 kernel: EIP is at dlm_add_migration_mle+0x1f6/0x30a
[ocfs2_dlm]
Oct 27 06:20:11 blade01 kernel: eax: 00000000   ebx: d61e4c00   ecx: c4ce5988  
edx: 00000000
Oct 27 06:20:11 blade01 kernel: esi: f7531de4   edi: c4ce5980   ebp: e1873080  
esp: f7531d6c
Oct 27 06:20:11 blade01 kernel: ds: 007b   es: 007b   ss: 0068
Oct 27 06:20:11 blade01 kernel: Process o2net (pid: 1698, ti=f7530000
task=c215b560 task.ti=f7530000)
Oct 27 06:20:11 blade01 kernel: Stack: 00000000 c0327a2c f7531d88 e6805a80
f7531e6c 00000048 00000040 d61e4c00
Oct 27 06:20:11 blade01 kernel:        d899a020 00000000 00000001 00000000
01020000 00000000 d899a021 0000004d
Oct 27 06:20:11 blade01 kernel:        c4ce5980 00000000 d61e4c00 fffffff4
f92bb927 f7531de4 d899a020 0000001f
Oct 27 06:20:11 blade01 kernel: Call Trace:
Oct 27 06:20:11 blade01 kernel:  [<c0327a2c>] sock_recvmsg+0xe9/0x10b
Oct 27 06:20:11 blade01 kernel:  [<f92bb927>]
dlm_migrate_request_handler+0x17b/0x231 [ocfs2_dlm]
Oct 27 06:20:11 blade01 kernel:  [<f9256762>]
o2net_process_message+0x46e/0x626 [ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [<c0120312>] __do_softirq+0x73/0xdf
Oct 27 06:20:11 blade01 kernel:  [<f9256057>] o2net_recv_tcp_msg+0x6b/0x7e
[ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [<c0114142>]
find_busiest_group+0x129/0x4f9
Oct 27 06:20:11 blade01 kernel:  [<f925819e>]
o2net_rx_until_empty+0x1e6/0x6b9 [ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [<c011619f>] __wake_up+0x32/0x43
Oct 27 06:20:11 blade01 kernel:  [<c012af5b>] run_workqueue+0x73/0xe1
Oct 27 06:20:11 blade01 kernel:  [<f9257fb8>]
o2net_rx_until_empty+0x0/0x6b9 [ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [<c012b710>] worker_thread+0x143/0x15f
Oct 27 06:20:11 blade01 kernel:  [<c011563d>]
default_wake_function+0x0/0x15
Oct 27 06:20:11 blade01 kernel:  [<c012b5cd>] worker_thread+0x0/0x15f
Oct 27 06:20:11 blade01 kernel:  [<c012e151>] kthread+0xfc/0x100
Oct 27 06:20:11 blade01 kernel:  [<c012e055>] kthread+0x0/0x100
Oct 27 06:20:11 blade01 kernel:  [<c0100d95>] kernel_thread_helper+0x5/0xb
Oct 27 06:20:11 blade01 kernel: Code: 98 0a 00 00 c7 44 24 0c 62 81 2c f9 89 54
24 08 89 44 24 04 c7 04 24 80 06 2d f9 e8 85 29 e6 c6 e9 57 fe ff ff 8b 57 08 8b
41 04 <89> 42 04 89 10 89 4f 08 89 49 04 eb 9c f7 05 a0 2b 26 f9 00 09
Oct 27 06:20:11 blade01 kernel: EIP: [<f92b9431>]
dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] SS:ESP 0068:f7531d6c

This is with a vanilla 2.6.18 kernel from kernel.org. There were no
suspicious messages in the logs before the crash.

Sunil Mushran

2006-Oct-27 09:32 UTC

head link

[Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference

Please file a bugzilla with the details provided. It is easier to manage 
bugs
that a way.

Thanks

Christian Schlittchen wrote:> Thanks to syncronous writes on the log-files I finally managed to get
> a log of the regular panics we experience.
>
> The setup is as follows: Three blades (IBM HS20) accessing a shared storage
> on a fibre channel connected storage server (IBM DS4300). The storage is
> used as a central mailstorage for about 35000 users, so it is pretty heavy
> duty storage wise.
>
> blade01 crashes every few days with a kernel panic. Unfortunatly all
> watchdogs we tried fail to reboot the machine, and setting
> /proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops to non-zero
> values doesn't help either. The machine still responds to pings, but
> to nothing else. Even more unfortunatly the file system on the other
> blades starts to hang sometime after blade01 crashes.
>
> Logging /proc/slabinfo showed a steady increase of the size-256 and size-32
> number of objects and we thought the crashes might have something to do
> with it. We then did a nightly umount/mount which reduced the values a
> bit and which does seem to reduce the frequency of crashes slightly.
>
> Nevertheless today we had a crash with rather low values of size-256 and
> size-32:
>
> >From /proc/slabinfo, timestamped, a few seconds before the crash:
>
> 2006-10-27-06:20:01 size-256           92187 169605    256   15    1 :
tunables  120   60    8 : slabdata  11307  113 07      0
> 2006-10-27-06:20:01 size-32            94037 534942     32  113    1 :
tunables  120   60    8 : slabdata   4734   47 34      0
>
> The kern.log shows:
>
> Oct 27 06:20:11 blade01 kernel: BUG: unable to handle kernel NULL pointer
dereference at virtual address 00000004
> Oct 27 06:20:11 blade01 kernel:  printing eip:
> Oct 27 06:20:11 blade01 kernel: f92b9431
> Oct 27 06:20:11 blade01 kernel: *pde = 00000000
> Oct 27 06:20:11 blade01 kernel: Oops: 0002 [#1]
> Oct 27 06:20:11 blade01 kernel: SMP 
> Oct 27 06:20:11 blade01 kernel: Modules linked in: i6300esb ocfs2 xt_state
ip_conntrack xt_limit ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager md_mod dm_snapshot
dm_mirror dm_mod mptctl qla2xxx i2c_i801 firmware_class i2c_core
scsi_transport_fc rtc
> Oct 27 06:20:11 blade01 kernel: CPU:    1
> Oct 27 06:20:11 blade01 kernel: EIP:    0060:[<f92b9431>]    Not
tainted VLI
> Oct 27 06:20:11 blade01 kernel: EFLAGS: 00010286   (2.6.18 #1) 
> Oct 27 06:20:11 blade01 kernel: EIP is at dlm_add_migration_mle+0x1f6/0x30a
[ocfs2_dlm]
> Oct 27 06:20:11 blade01 kernel: eax: 00000000   ebx: d61e4c00   ecx:
c4ce5988   edx: 00000000
> Oct 27 06:20:11 blade01 kernel: esi: f7531de4   edi: c4ce5980   ebp:
e1873080   esp: f7531d6c
> Oct 27 06:20:11 blade01 kernel: ds: 007b   es: 007b   ss: 0068
> Oct 27 06:20:11 blade01 kernel: Process o2net (pid: 1698, ti=f7530000
task=c215b560 task.ti=f7530000)
> Oct 27 06:20:11 blade01 kernel: Stack: 00000000 c0327a2c f7531d88 e6805a80
f7531e6c 00000048 00000040 d61e4c00
> Oct 27 06:20:11 blade01 kernel:        d899a020 00000000 00000001 00000000
01020000 00000000 d899a021 0000004d
> Oct 27 06:20:11 blade01 kernel:        c4ce5980 00000000 d61e4c00 fffffff4
f92bb927 f7531de4 d899a020 0000001f
> Oct 27 06:20:11 blade01 kernel: Call Trace:
> Oct 27 06:20:11 blade01 kernel:  [<c0327a2c>] sock_recvmsg+0xe9/0x10b
> Oct 27 06:20:11 blade01 kernel:  [<f92bb927>]
dlm_migrate_request_handler+0x17b/0x231 [ocfs2_dlm]
> Oct 27 06:20:11 blade01 kernel:  [<f9256762>]
o2net_process_message+0x46e/0x626 [ocfs2_nodemanager]
> Oct 27 06:20:11 blade01 kernel:  [<c0120312>] __do_softirq+0x73/0xdf
> Oct 27 06:20:11 blade01 kernel:  [<f9256057>]
o2net_recv_tcp_msg+0x6b/0x7e [ocfs2_nodemanager]
> Oct 27 06:20:11 blade01 kernel:  [<c0114142>]
find_busiest_group+0x129/0x4f9
> Oct 27 06:20:11 blade01 kernel:  [<f925819e>]
o2net_rx_until_empty+0x1e6/0x6b9 [ocfs2_nodemanager]
> Oct 27 06:20:11 blade01 kernel:  [<c011619f>] __wake_up+0x32/0x43
> Oct 27 06:20:11 blade01 kernel:  [<c012af5b>] run_workqueue+0x73/0xe1
> Oct 27 06:20:11 blade01 kernel:  [<f9257fb8>]
o2net_rx_until_empty+0x0/0x6b9 [ocfs2_nodemanager]
> Oct 27 06:20:11 blade01 kernel:  [<c012b710>]
worker_thread+0x143/0x15f
> Oct 27 06:20:11 blade01 kernel:  [<c011563d>]
default_wake_function+0x0/0x15
> Oct 27 06:20:11 blade01 kernel:  [<c012b5cd>] worker_thread+0x0/0x15f
> Oct 27 06:20:11 blade01 kernel:  [<c012e151>] kthread+0xfc/0x100
> Oct 27 06:20:11 blade01 kernel:  [<c012e055>] kthread+0x0/0x100
> Oct 27 06:20:11 blade01 kernel:  [<c0100d95>]
kernel_thread_helper+0x5/0xb
> Oct 27 06:20:11 blade01 kernel: Code: 98 0a 00 00 c7 44 24 0c 62 81 2c f9
89 54 24 08 89 44 24 04 c7 04 24 80 06 2d f9 e8 85 29 e6 c6 e9 57 fe ff ff 8b 57
08 8b 41 04 <89> 42 04 89 10 89 4f 08 89 49 04 eb 9c f7 05 a0 2b 26 f9 00
09
> Oct 27 06:20:11 blade01 kernel: EIP: [<f92b9431>]
dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] SS:ESP 0068:f7531d6c
>
> This is with a vanilla 2.6.18 kernel from kernel.org. There were no
> suspicious messages in the logs before the crash.
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Ocfs2 users - Oct 2006 - BUG: unable to handle kernel NULL pointer dereference

[Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference

[Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference