Christian Schlittchen
2006-Oct-27 00:29 UTC
[Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference
Thanks to syncronous writes on the log-files I finally managed to get a log of the regular panics we experience. The setup is as follows: Three blades (IBM HS20) accessing a shared storage on a fibre channel connected storage server (IBM DS4300). The storage is used as a central mailstorage for about 35000 users, so it is pretty heavy duty storage wise. blade01 crashes every few days with a kernel panic. Unfortunatly all watchdogs we tried fail to reboot the machine, and setting /proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops to non-zero values doesn't help either. The machine still responds to pings, but to nothing else. Even more unfortunatly the file system on the other blades starts to hang sometime after blade01 crashes. Logging /proc/slabinfo showed a steady increase of the size-256 and size-32 number of objects and we thought the crashes might have something to do with it. We then did a nightly umount/mount which reduced the values a bit and which does seem to reduce the frequency of crashes slightly. Nevertheless today we had a crash with rather low values of size-256 and size-32:>From /proc/slabinfo, timestamped, a few seconds before the crash:2006-10-27-06:20:01 size-256 92187 169605 256 15 1 : tunables 120 60 8 : slabdata 11307 113 07 0 2006-10-27-06:20:01 size-32 94037 534942 32 113 1 : tunables 120 60 8 : slabdata 4734 47 34 0 The kern.log shows: Oct 27 06:20:11 blade01 kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004 Oct 27 06:20:11 blade01 kernel: printing eip: Oct 27 06:20:11 blade01 kernel: f92b9431 Oct 27 06:20:11 blade01 kernel: *pde = 00000000 Oct 27 06:20:11 blade01 kernel: Oops: 0002 [#1] Oct 27 06:20:11 blade01 kernel: SMP Oct 27 06:20:11 blade01 kernel: Modules linked in: i6300esb ocfs2 xt_state ip_conntrack xt_limit ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager md_mod dm_snapshot dm_mirror dm_mod mptctl qla2xxx i2c_i801 firmware_class i2c_core scsi_transport_fc rtc Oct 27 06:20:11 blade01 kernel: CPU: 1 Oct 27 06:20:11 blade01 kernel: EIP: 0060:[<f92b9431>] Not tainted VLI Oct 27 06:20:11 blade01 kernel: EFLAGS: 00010286 (2.6.18 #1) Oct 27 06:20:11 blade01 kernel: EIP is at dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] Oct 27 06:20:11 blade01 kernel: eax: 00000000 ebx: d61e4c00 ecx: c4ce5988 edx: 00000000 Oct 27 06:20:11 blade01 kernel: esi: f7531de4 edi: c4ce5980 ebp: e1873080 esp: f7531d6c Oct 27 06:20:11 blade01 kernel: ds: 007b es: 007b ss: 0068 Oct 27 06:20:11 blade01 kernel: Process o2net (pid: 1698, ti=f7530000 task=c215b560 task.ti=f7530000) Oct 27 06:20:11 blade01 kernel: Stack: 00000000 c0327a2c f7531d88 e6805a80 f7531e6c 00000048 00000040 d61e4c00 Oct 27 06:20:11 blade01 kernel: d899a020 00000000 00000001 00000000 01020000 00000000 d899a021 0000004d Oct 27 06:20:11 blade01 kernel: c4ce5980 00000000 d61e4c00 fffffff4 f92bb927 f7531de4 d899a020 0000001f Oct 27 06:20:11 blade01 kernel: Call Trace: Oct 27 06:20:11 blade01 kernel: [<c0327a2c>] sock_recvmsg+0xe9/0x10b Oct 27 06:20:11 blade01 kernel: [<f92bb927>] dlm_migrate_request_handler+0x17b/0x231 [ocfs2_dlm] Oct 27 06:20:11 blade01 kernel: [<f9256762>] o2net_process_message+0x46e/0x626 [ocfs2_nodemanager] Oct 27 06:20:11 blade01 kernel: [<c0120312>] __do_softirq+0x73/0xdf Oct 27 06:20:11 blade01 kernel: [<f9256057>] o2net_recv_tcp_msg+0x6b/0x7e [ocfs2_nodemanager] Oct 27 06:20:11 blade01 kernel: [<c0114142>] find_busiest_group+0x129/0x4f9 Oct 27 06:20:11 blade01 kernel: [<f925819e>] o2net_rx_until_empty+0x1e6/0x6b9 [ocfs2_nodemanager] Oct 27 06:20:11 blade01 kernel: [<c011619f>] __wake_up+0x32/0x43 Oct 27 06:20:11 blade01 kernel: [<c012af5b>] run_workqueue+0x73/0xe1 Oct 27 06:20:11 blade01 kernel: [<f9257fb8>] o2net_rx_until_empty+0x0/0x6b9 [ocfs2_nodemanager] Oct 27 06:20:11 blade01 kernel: [<c012b710>] worker_thread+0x143/0x15f Oct 27 06:20:11 blade01 kernel: [<c011563d>] default_wake_function+0x0/0x15 Oct 27 06:20:11 blade01 kernel: [<c012b5cd>] worker_thread+0x0/0x15f Oct 27 06:20:11 blade01 kernel: [<c012e151>] kthread+0xfc/0x100 Oct 27 06:20:11 blade01 kernel: [<c012e055>] kthread+0x0/0x100 Oct 27 06:20:11 blade01 kernel: [<c0100d95>] kernel_thread_helper+0x5/0xb Oct 27 06:20:11 blade01 kernel: Code: 98 0a 00 00 c7 44 24 0c 62 81 2c f9 89 54 24 08 89 44 24 04 c7 04 24 80 06 2d f9 e8 85 29 e6 c6 e9 57 fe ff ff 8b 57 08 8b 41 04 <89> 42 04 89 10 89 4f 08 89 49 04 eb 9c f7 05 a0 2b 26 f9 00 09 Oct 27 06:20:11 blade01 kernel: EIP: [<f92b9431>] dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] SS:ESP 0068:f7531d6c This is with a vanilla 2.6.18 kernel from kernel.org. There were no suspicious messages in the logs before the crash.
Sunil Mushran
2006-Oct-27 09:32 UTC
[Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference
Please file a bugzilla with the details provided. It is easier to manage bugs that a way. Thanks Christian Schlittchen wrote:> Thanks to syncronous writes on the log-files I finally managed to get > a log of the regular panics we experience. > > The setup is as follows: Three blades (IBM HS20) accessing a shared storage > on a fibre channel connected storage server (IBM DS4300). The storage is > used as a central mailstorage for about 35000 users, so it is pretty heavy > duty storage wise. > > blade01 crashes every few days with a kernel panic. Unfortunatly all > watchdogs we tried fail to reboot the machine, and setting > /proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops to non-zero > values doesn't help either. The machine still responds to pings, but > to nothing else. Even more unfortunatly the file system on the other > blades starts to hang sometime after blade01 crashes. > > Logging /proc/slabinfo showed a steady increase of the size-256 and size-32 > number of objects and we thought the crashes might have something to do > with it. We then did a nightly umount/mount which reduced the values a > bit and which does seem to reduce the frequency of crashes slightly. > > Nevertheless today we had a crash with rather low values of size-256 and > size-32: > > >From /proc/slabinfo, timestamped, a few seconds before the crash: > > 2006-10-27-06:20:01 size-256 92187 169605 256 15 1 : tunables 120 60 8 : slabdata 11307 113 07 0 > 2006-10-27-06:20:01 size-32 94037 534942 32 113 1 : tunables 120 60 8 : slabdata 4734 47 34 0 > > The kern.log shows: > > Oct 27 06:20:11 blade01 kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004 > Oct 27 06:20:11 blade01 kernel: printing eip: > Oct 27 06:20:11 blade01 kernel: f92b9431 > Oct 27 06:20:11 blade01 kernel: *pde = 00000000 > Oct 27 06:20:11 blade01 kernel: Oops: 0002 [#1] > Oct 27 06:20:11 blade01 kernel: SMP > Oct 27 06:20:11 blade01 kernel: Modules linked in: i6300esb ocfs2 xt_state ip_conntrack xt_limit ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager md_mod dm_snapshot dm_mirror dm_mod mptctl qla2xxx i2c_i801 firmware_class i2c_core scsi_transport_fc rtc > Oct 27 06:20:11 blade01 kernel: CPU: 1 > Oct 27 06:20:11 blade01 kernel: EIP: 0060:[<f92b9431>] Not tainted VLI > Oct 27 06:20:11 blade01 kernel: EFLAGS: 00010286 (2.6.18 #1) > Oct 27 06:20:11 blade01 kernel: EIP is at dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] > Oct 27 06:20:11 blade01 kernel: eax: 00000000 ebx: d61e4c00 ecx: c4ce5988 edx: 00000000 > Oct 27 06:20:11 blade01 kernel: esi: f7531de4 edi: c4ce5980 ebp: e1873080 esp: f7531d6c > Oct 27 06:20:11 blade01 kernel: ds: 007b es: 007b ss: 0068 > Oct 27 06:20:11 blade01 kernel: Process o2net (pid: 1698, ti=f7530000 task=c215b560 task.ti=f7530000) > Oct 27 06:20:11 blade01 kernel: Stack: 00000000 c0327a2c f7531d88 e6805a80 f7531e6c 00000048 00000040 d61e4c00 > Oct 27 06:20:11 blade01 kernel: d899a020 00000000 00000001 00000000 01020000 00000000 d899a021 0000004d > Oct 27 06:20:11 blade01 kernel: c4ce5980 00000000 d61e4c00 fffffff4 f92bb927 f7531de4 d899a020 0000001f > Oct 27 06:20:11 blade01 kernel: Call Trace: > Oct 27 06:20:11 blade01 kernel: [<c0327a2c>] sock_recvmsg+0xe9/0x10b > Oct 27 06:20:11 blade01 kernel: [<f92bb927>] dlm_migrate_request_handler+0x17b/0x231 [ocfs2_dlm] > Oct 27 06:20:11 blade01 kernel: [<f9256762>] o2net_process_message+0x46e/0x626 [ocfs2_nodemanager] > Oct 27 06:20:11 blade01 kernel: [<c0120312>] __do_softirq+0x73/0xdf > Oct 27 06:20:11 blade01 kernel: [<f9256057>] o2net_recv_tcp_msg+0x6b/0x7e [ocfs2_nodemanager] > Oct 27 06:20:11 blade01 kernel: [<c0114142>] find_busiest_group+0x129/0x4f9 > Oct 27 06:20:11 blade01 kernel: [<f925819e>] o2net_rx_until_empty+0x1e6/0x6b9 [ocfs2_nodemanager] > Oct 27 06:20:11 blade01 kernel: [<c011619f>] __wake_up+0x32/0x43 > Oct 27 06:20:11 blade01 kernel: [<c012af5b>] run_workqueue+0x73/0xe1 > Oct 27 06:20:11 blade01 kernel: [<f9257fb8>] o2net_rx_until_empty+0x0/0x6b9 [ocfs2_nodemanager] > Oct 27 06:20:11 blade01 kernel: [<c012b710>] worker_thread+0x143/0x15f > Oct 27 06:20:11 blade01 kernel: [<c011563d>] default_wake_function+0x0/0x15 > Oct 27 06:20:11 blade01 kernel: [<c012b5cd>] worker_thread+0x0/0x15f > Oct 27 06:20:11 blade01 kernel: [<c012e151>] kthread+0xfc/0x100 > Oct 27 06:20:11 blade01 kernel: [<c012e055>] kthread+0x0/0x100 > Oct 27 06:20:11 blade01 kernel: [<c0100d95>] kernel_thread_helper+0x5/0xb > Oct 27 06:20:11 blade01 kernel: Code: 98 0a 00 00 c7 44 24 0c 62 81 2c f9 89 54 24 08 89 44 24 04 c7 04 24 80 06 2d f9 e8 85 29 e6 c6 e9 57 fe ff ff 8b 57 08 8b 41 04 <89> 42 04 89 10 89 4f 08 89 49 04 eb 9c f7 05 a0 2b 26 f9 00 09 > Oct 27 06:20:11 blade01 kernel: EIP: [<f92b9431>] dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] SS:ESP 0068:f7531d6c > > This is with a vanilla 2.6.18 kernel from kernel.org. There were no > suspicious messages in the logs before the crash. > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >