Hello, I'm on 2.6.24 with OCFS2 1.3.3 and every couple days this comes up in dmesg. I have to reboot the cluster machines, there's nothing else I can do. Stopping the services or unmounting volumes fails. Perhaps this is a well-known bug, but I couldn't find it. On the other hand, do you think it could be solved by upgrading to 1.5.0? BUG: unable to handle kernel NULL pointer dereference at virtual address 00000934 printing eip: f8d6f442 *pde = cf03a067 Oops: 0002 [#1] SMP Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs fuse ipv6 sr_mod cdrom ata_generic pata_acpi ata_piix serio_raw bnx2 libata pcspkr iTCO_wdt iTCO_vendor_support button i5000_edac edac_core dcdbas sg dm_round_robin dm_emc dm_multipath dm_snapshot dm_zero dm_mirror dm_mod lpfc scsi_transport_fc scsi_tgt megaraid_sas sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 575, comm: pop3 Not tainted (2.6.24.7-92.fc8 #1) EIP: 0060:[<f8d6f442>] EFLAGS: 00210246 CPU: 2 EIP is at ocfs2_free_suballoc_bits+0x41a/0x6d9 [ocfs2] EAX: f77b6f00 EBX: 00000000 ECX: 00000000 EDX: 000047a8 ESI: 0000000b EDI: 00000000 EBP: f35bf000 ESP: da67bcec DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process pop3 (pid: 575, ti=da67b000 task=f43b4000 task.ti=da67b000) Stack: 00000002 da67bd2c 00000001 f77b7278 f75a4840 e0f019a0 f884e798 f7724f88 f77b7278 f75a4840 f3bc9000 f3bc90c0 f772c968 f77b6f00 00000000 00000000 f772c968 000047a8 00000000 f42b0800 010d81a8 f8d72da9 000047a8 010d3a00 Call Trace: [<f884e798>] do_get_write_access+0x329/0x362 [jbd] [<f8d72da9>] ocfs2_free_clusters+0x171/0x212 [ocfs2] [<f8d3ed5c>] __ocfs2_flush_truncate_log+0x596/0x702 [ocfs2] [<f884e463>] journal_stop+0x15d/0x169 [jbd] [<f8d483e9>] ocfs2_commit_truncate+0x30f/0x1240 [ocfs2] [<f8d4e3c0>] ocfs2_read_blocks+0x45c/0x46d [ocfs2] [<f8d6027f>] ocfs2_wipe_inode+0x4f3/0xcc2 [ocfs2] [<f8d628e5>] ocfs2_delete_inode+0x409/0x624 [ocfs2] [<c062b62b>] mutex_lock+0x1a/0x29 [<c04ac011>] inotify_inode_is_dead+0x18/0x6c [<f8d624dc>] ocfs2_delete_inode+0x0/0x624 [ocfs2] [<c04994a7>] generic_delete_inode+0x91/0xf7 [<c0498d95>] iput+0x60/0x62 [<c04919aa>] do_unlinkat+0xae/0x119 [<c04895f3>] vfs_read+0x111/0x14b [<c0489ac1>] sys_pread64+0x48/0x5f [<c04051da>] syscall_call+0x7/0xb ======================Code: 9c 00 00 00 8b 80 6c 01 00 00 8b b8 c4 00 00 00 8b b0 c0 00 00 00 8b 44 24 34 8b 58 04 8b 08 39 df 75 0e 39 ce 75 0a 8b 4c 24 38 <0f> ab 51 40 19 c0 4a ff 4c 24 3c 83 7c 24 3c ff 75 b7 8b 5c 24 EIP: [<f8d6f442>] ocfs2_free_suballoc_bits+0x41a/0x6d9 [ocfs2] SS:ESP 0068:da67bcec ---[ end trace 2d0b75b98f26e1b8 ]--- Many thanks, Paulo Rodrigues -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080813/1e690de3/attachment.html
This could suggest an on disk problem. Have you run fsck.ocfs2 recently? fsck.ocfs2 -f /dev/sdX1 Paulo Rodrigues wrote:> Hello, > > I'm on 2.6.24 with OCFS2 1.3.3 and every couple days this comes up in > dmesg. I have to reboot the cluster machines, there's nothing else I > can do. Stopping the services or unmounting volumes fails. Perhaps > this is a well-known bug, but I couldn't find it. On the other hand, > do you think it could be solved by upgrading to 1.5.0? > > BUG: unable to handle kernel NULL pointer dereference at virtual > address 00000934 > printing eip: f8d6f442 *pde = cf03a067 > Oops: 0002 [#1] SMP > Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager > configfs fuse ipv6 sr_mod cdrom ata_generic pata_acpi ata_piix > serio_raw bnx2 libata pcspkr iTCO_wdt iTCO_vendor_support button > i5000_edac edac_core dcdbas sg dm_round_robin dm_emc dm_multipath > dm_snapshot dm_zero dm_mirror dm_mod lpfc scsi_transport_fc scsi_tgt > megaraid_sas sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd > > Pid: 575, comm: pop3 Not tainted (2.6.24.7-92.fc8 #1) > EIP: 0060:[<f8d6f442>] EFLAGS: 00210246 CPU: 2 > EIP is at ocfs2_free_suballoc_bits+0x41a/0x6d9 [ocfs2] > EAX: f77b6f00 EBX: 00000000 ECX: 00000000 EDX: 000047a8 > ESI: 0000000b EDI: 00000000 EBP: f35bf000 ESP: da67bcec > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 > Process pop3 (pid: 575, ti=da67b000 task=f43b4000 task.ti=da67b000) > Stack: 00000002 da67bd2c 00000001 f77b7278 f75a4840 e0f019a0 f884e798 > f7724f88 > f77b7278 f75a4840 f3bc9000 f3bc90c0 f772c968 f77b6f00 00000000 > 00000000 > f772c968 000047a8 00000000 f42b0800 010d81a8 f8d72da9 000047a8 > 010d3a00 > Call Trace: > [<f884e798>] do_get_write_access+0x329/0x362 [jbd] > [<f8d72da9>] ocfs2_free_clusters+0x171/0x212 [ocfs2] > [<f8d3ed5c>] __ocfs2_flush_truncate_log+0x596/0x702 [ocfs2] > [<f884e463>] journal_stop+0x15d/0x169 [jbd] > [<f8d483e9>] ocfs2_commit_truncate+0x30f/0x1240 [ocfs2] > [<f8d4e3c0>] ocfs2_read_blocks+0x45c/0x46d [ocfs2] > [<f8d6027f>] ocfs2_wipe_inode+0x4f3/0xcc2 [ocfs2] > [<f8d628e5>] ocfs2_delete_inode+0x409/0x624 [ocfs2] > [<c062b62b>] mutex_lock+0x1a/0x29 > [<c04ac011>] inotify_inode_is_dead+0x18/0x6c > [<f8d624dc>] ocfs2_delete_inode+0x0/0x624 [ocfs2] > [<c04994a7>] generic_delete_inode+0x91/0xf7 > [<c0498d95>] iput+0x60/0x62 > [<c04919aa>] do_unlinkat+0xae/0x119 > [<c04895f3>] vfs_read+0x111/0x14b > [<c0489ac1>] sys_pread64+0x48/0x5f > [<c04051da>] syscall_call+0x7/0xb > ======================> Code: 9c 00 00 00 8b 80 6c 01 00 00 8b b8 c4 00 00 00 8b b0 c0 00 00 > 00 8b 44 24 34 8b 58 04 8b 08 39 df 75 0e 39 ce 75 0a 8b 4c 24 38 <0f> > ab 51 40 19 c0 4a ff 4c 24 3c 83 7c 24 3c ff 75 b7 8b 5c 24 > EIP: [<f8d6f442>] ocfs2_free_suballoc_bits+0x41a/0x6d9 [ocfs2] SS:ESP > 0068:da67bcec > ---[ end trace 2d0b75b98f26e1b8 ]--- > > Many thanks, > Paulo Rodrigues > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Hello Sunil, fsck says its clean: Checking OCFS2 filesystem in /dev/dm-1: label: /var/lib/dovecot/spool uuid: ab 1e ac 82 67 cb 47 58 81 07 2b 00 55 f6 09 36 number of blocks: 246838717 bytes per block: 4096 number of clusters: 246838717 bytes per cluster: 4096 max slots: 4 o2fsck_should_replay_journals:564 | slot 0 JOURNAL_DIRTY_FL: 0 o2fsck_should_replay_journals:564 | slot 1 JOURNAL_DIRTY_FL: 0 o2fsck_should_replay_journals:564 | slot 2 JOURNAL_DIRTY_FL: 0 o2fsck_should_replay_journals:564 | slot 3 JOURNAL_DIRTY_FL: 0 /dev/dm-1 is clean. It will be checked after 20 additional mounts. I expected upgrading to 1.5.0 would fix it... What do you think? Many thanks, Paulo This could suggest an on disk problem. Have you run fsck.ocfs2 recently?> >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080813/2d2d4eef/attachment.html
Does not look you used the force option. Or, you ran with the file system mounted. Umount the fs on all nodes and do: $ fsck.ocfs2 -f /dev/dm-1 Paulo Rodrigues wrote:> Hello Sunil, > > fsck says its clean: > > Checking OCFS2 filesystem in /dev/dm-1: > label: /var/lib/dovecot/spool > uuid: ab 1e ac 82 67 cb 47 58 81 07 2b 00 55 f6 09 36 > number of blocks: 246838717 > bytes per block: 4096 > number of clusters: 246838717 > bytes per cluster: 4096 > max slots: 4 > > o2fsck_should_replay_journals:564 | slot 0 JOURNAL_DIRTY_FL: 0 > o2fsck_should_replay_journals:564 | slot 1 JOURNAL_DIRTY_FL: 0 > o2fsck_should_replay_journals:564 | slot 2 JOURNAL_DIRTY_FL: 0 > o2fsck_should_replay_journals:564 | slot 3 JOURNAL_DIRTY_FL: 0 > /dev/dm-1 is clean. It will be checked after 20 additional mounts. > > I expected upgrading to 1.5.0 would fix it... What do you think? > > Many thanks, > Paulo > > This could suggest an on disk problem. Have you run fsck.ocfs2 > recently? > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users
Please can you file a bugzilla and attach this stack trace. Also attach the output of the following: $ objdump -DSl /lib/modules/`uname -r`/kernel/fs/ocfs2/ocfs2.ko >/tmp/ocfs2.out Paulo Rodrigues wrote:> Got the same error again today. > > BUG: unable to handle kernel NULL pointer dereference at virtual > address 000002d8 > printing eip: f8d8f442 *pde = cecd2067 > Oops: 0002 [#1] SMP > Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager > configfs fuse ipv6 sr_mod cdrom ata_generic pata_acpi bnx2 ata_piix > pcspkr iTCO_wdt libata button serio_raw iTCO_vendor_support i5000_edac > edac_core dcdbas sg dm_round_robin dm_emc dm_multipath dm_snapshot > dm_zero dm_mirror dm_mod lpfc scsi_transport_fc scsi_tgt megaraid_sas > sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd > > Pid: 30807, comm: pop3 Not tainted (2.6.24.7-92.fc8 #1) > EIP: 0060:[<f8d8f442>] EFLAGS: 00210246 CPU: 4 > EIP is at ocfs2_free_suballoc_bits+0x41a/0x6d9 [ocfs2] > EAX: cbe6ca00 EBX: 00000000 ECX: 00000000 EDX: 000014c2 > ESI: 0000000b EDI: 00000000 EBP: c8afd000 ESP: dbb1bcec > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 > Process pop3 (pid: 30807, ti=dbb1b000 task=c3472d20 task.ti=dbb1b000) > Stack: 00000002 dbb1bd2c 00000001 cbe6cd78 f76bb5e8 f10195e8 f884e798 > d7709070 > cbe6cd78 f76bb5e8 c668f000 c668f0c0 f19ee380 cbe6ca00 00000000 > 00000000 > f19ee380 000014c2 00000000 e72a0000 0810cec2 f8d92da9 000014c2 > 0810ba00 > Call Trace: > [<f884e798>] do_get_write_access+0x329/0x362 [jbd] > [<f8d92da9>] ocfs2_free_clusters+0x171/0x212 [ocfs2] > [<f8d5ed5c>] __ocfs2_flush_truncate_log+0x596/0x702 [ocfs2] > [<f884e463>] journal_stop+0x15d/0x169 [jbd] > [<f8d683e9>] ocfs2_commit_truncate+0x30f/0x1240 [ocfs2] > [<f8d6e3c0>] ocfs2_read_blocks+0x45c/0x46d [ocfs2] > [<f8d8027f>] ocfs2_wipe_inode+0x4f3/0xcc2 [ocfs2] > [<f8d828e5>] ocfs2_delete_inode+0x409/0x624 [ocfs2] > [<c062b62b>] mutex_lock+0x1a/0x29 > [<c04ac011>] inotify_inode_is_dead+0x18/0x6c > [<f8d824dc>] ocfs2_delete_inode+0x0/0x624 [ocfs2] > [<c04994a7>] generic_delete_inode+0x91/0xf7 > [<c0498d95>] iput+0x60/0x62 > [<c04919aa>] do_unlinkat+0xae/0x119 > [<c04895f3>] vfs_read+0x111/0x14b > [<c0489ac1>] sys_pread64+0x48/0x5f > [<c04051da>] syscall_call+0x7/0xb > ======================> Code: 9c 00 00 00 8b 80 6c 01 00 00 8b b8 c4 00 00 00 8b b0 c0 00 00 > 00 8b 44 24 34 8b 58 04 8b 08 39 df 75 0e 39 ce 75 0a 8b 4c 24 38 <0f> > ab 51 40 19 c0 4a ff 4c 24 3c 83 7c 24 3c ff 75 b7 8b 5c 24 > EIP: [<f8d8f442>] ocfs2_free_suballoc_bits+0x41a/0x6d9 [ocfs2] SS:ESP > 0068:dbb1bcec > ---[ end trace 4bb65900c779e50c ]--- > > Am I missing something? > > Thanks!