Dunno if it is useful, but we had a never seen crash. Setup: 2x SuSE SLES 10SP2 (its old, I known) Problem description: 1. We had to reboot ocfs2 master node. 2. During the reboot, the umount coredumped, leaving the filesystem mounted or may be heartbeating (?); 3. The slave node detected that the slave was dead; 4. When the slave tried to assume the master status, it rebooted (no crash, no warning, nothing, just like press reset button); 5. The master hanged because it could not unmount ocfs2 filesystem; Could not take many messages from nodes, just this ones: master node umount crash (from syslog): Dec 2 14:22:08 soap02 kernel: (19573,5):dlm_empty_lockres:2783 ERROR: lockres M00000000000000164ad60700000000 still has local locks! Dec 2 14:22:08 soap02 kernel: ----------- [cut here ] --------- [please bite here ] --------- Dec 2 14:22:08 soap02 kernel: Kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2784 Dec 2 14:22:08 soap02 kernel: invalid opcode: 0000 [1] SMP Dec 2 14:22:08 soap02 kernel: last sysfs file: /devices/pci0000:00/0000:00:1c.0/0000:04:00.0/0000:05:00.0/power/state Dec 2 14:22:08 soap02 kernel: CPU 5 Dec 2 14:22:08 soap02 kernel: Modules linked in: af_packet joydev st ocfs2 jbd ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs nfsd exportfs nfs lockd nfs_acl sunrpc ipv6 button battery ac binfmt_misc netconsole xt_comment xt_tcpudp xt_state iptable_filter iptable_mangle iptab le_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables apparmor loop sr_mod usbhid usb_storage ide_cd uhci_hcd ehci_hcd usbcore shpchp hw_random cdrom bnx2 pci_hotplug reiserfs ata_piix ahci libata dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd dm_mod fan therm al processor sg megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Dec 2 14:22:08 soap02 kernel: Pid: 19573, comm: umount Tainted: G U 2.6.16.60-0.21-smp #1 Dec 2 14:22:08 soap02 kernel: RIP: 0010:[<ffffffff885a9d6d>] <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} Dec 2 14:22:08 soap02 kernel: RSP: 0018:ffff810356f65c88 EFLAGS: 00010292 Dec 2 14:22:08 soap02 kernel: RAX: 000000000000006a RBX: ffff8101f28f7880 RCX: 0000000000000292 Dec 2 14:22:08 soap02 kernel: RDX: ffffffff80359968 RSI: 0000000000000296 RDI: ffffffff80359960 Dec 2 14:22:08 soap02 kernel: RBP: ffff81025eec7e00 R08: ffffffff80359968 R09: ffff810423f77a80 Dec 2 14:22:08 soap02 kernel: R10: ffff810001071600 R11: 0000000000000070 R12: 0000000000000184 Dec 2 14:22:08 soap02 kernel: R13: ffff8104257a5400 R14: 0000000000000184 R15: ffff8101f28f7880 Dec 2 14:22:08 soap02 kernel: FS: 00002ab1a83db6d0(0000) GS:ffff810430654840(0000) knlGS:0000000000000000 Dec 2 14:22:08 soap02 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 2 14:22:08 soap02 kernel: CR2: 00002aaaaac16000 CR3: 00000001a2c2f000 CR4: 00000000000006e0 Dec 2 14:22:08 soap02 kernel: Process umount (pid: 19573, threadinfo ffff810356f64000, task ffff8102e78997e0) Dec 2 14:22:08 soap02 kernel: Stack: 00000000ffffffd9 0000000000000000 01ff810400000001 ffff8102e78997e0 Dec 2 14:22:08 soap02 kernel: 0100000000000000 0000000100000003 0000000000000000 ffff8102e78997e0 Dec 2 14:22:08 soap02 kernel: ffffffff80147f3e ffff810356f65cd0 Dec 2 14:22:08 soap02 kernel: Call Trace: <ffffffff80147f3e>{autoremove_wake_function+0} Dec 2 14:22:08 soap02 kernel: <ffffffff885a30e1>{:ocfs2_dlm:dlm_unregister_domain+479} Dec 2 14:22:08 soap02 kernel: <ffffffff8012c668>{default_wake_function+0} <ffffffff8860bb5e>{:ocfs2:ocfs2_dlm_shutdown+190} Dec 2 14:22:08 soap02 kernel: <ffffffff8862fe07>{:ocfs2:ocfs2_dismount_volume+559} Dec 2 14:22:08 soap02 kernel: <ffffffff886302f7>{:ocfs2:ocfs2_put_super+104} <ffffffff8018bc99>{generic_shutdown_super+148} Dec 2 14:22:08 soap02 kernel: <ffffffff8018bd6a>{kill_block_super+38} <ffffffff8018be40>{deactivate_super+114} Dec 2 14:22:08 soap02 kernel: <ffffffff801a078e>{sys_umount+623} <ffffffff8018e4e1>{sys_newstat+25} Dec 2 14:22:08 soap02 kernel: <ffffffff8010ae42>{system_call+126} Dec 2 14:22:08 soap02 kernel: Dec 2 14:22:08 soap02 kernel: Code: 0f 0b 68 95 d0 5b 88 c2 e0 0a 48 f7 05 9e 2c fd ff 00 09 00 Dec 2 14:22:08 soap02 kernel: RIP <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} RSP <ffff810356f65c88> Dec 2 14:22:08 soap02 kernel: Badness in do_exit at kernel/exit.c:837 Dec 2 14:22:08 soap02 kernel: Dec 2 14:22:08 soap02 kernel: Call Trace: <ffffffff80137000>{do_exit+80} <ffffffff802ea8b6>{_spin_unlock_irqrestore+8} Dec 2 14:22:08 soap02 kernel: <ffffffff8010c820>{kernel_math_error+0} <ffffffff8010cdb5>{do_invalid_op+163} Dec 2 14:22:09 soap02 kernel: <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} Dec 2 14:22:09 soap02 kernel: <ffffffff8012c10c>{activate_task+204} <ffffffff8012c657>{try_to_wake_up+1106} Dec 2 14:22:09 soap02 kernel: <ffffffff801349b8>{printk+78} <ffffffff8010bd19>{error_exit+0} Dec 2 14:22:09 soap02 kernel: <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} Dec 2 14:22:09 soap02 kernel: <ffffffff80147f3e>{autoremove_wake_function+0} <ffffffff885a30e1>{:ocfs2_dlm:dlm_unregister_domain+479} Dec 2 14:22:09 soap02 kernel: <ffffffff8012c668>{default_wake_function+0} <ffffffff8860bb5e>{:ocfs2:ocfs2_dlm_shutdown+190} Dec 2 14:22:09 soap02 kernel: <ffffffff8862fe07>{:ocfs2:ocfs2_dismount_volume+559} Dec 2 14:22:09 soap02 kernel: <ffffffff886302f7>{:ocfs2:ocfs2_put_super+104} <ffffffff8018bc99>{generic_shutdown_super+148} Dec 2 14:22:09 soap02 kernel: <ffffffff8018bd6a>{kill_block_super+38} <ffffffff8018be40>{deactivate_super+114} Dec 2 14:22:09 soap02 kernel: <ffffffff801a078e>{sys_umount+623} <ffffffff8018e4e1>{sys_newstat+25} Dec 2 14:22:09 soap02 kernel: <ffffffff8010ae42>{system_call+126} slave node detecting master down and rebooted: Dec 2 14:23:14 soap01 kernel: o2net: connection to node soap02 (num 0) at 192.168.0.10:7777 has been idle for 60.0 seconds, shutting it down. Dec 2 14:23:14 soap01 kernel: (0,0):o2net_idle_timer:1422 here are some times that might help debug the situation: (tmr 1259770934.129785 now 1259770994.132629 dr 1259770934.129779 adv 1259770934.129789:1259770934.129789 func (300d6acb:505) 1259770933.205787:1259770933.205792) Dec 2 14:23:14 soap01 kernel: o2net: no longer connected to node soap02 (num 0) at 192.168.0.10:7777 Dec 2 14:23:14 soap01 kernel: (7035,1):dlm_do_master_request:1409 ERROR: link to 0 went down! Dec 2 14:23:14 soap01 kernel: (7039,0):dlm_do_master_request:1409 ERROR: link to 0 went down! Dec 2 14:23:14 soap01 kernel: (7039,0):dlm_get_lock_resource:986 ERROR: status = -112 Dec 2 14:23:14 soap01 kernel: (7035,1):dlm_get_lock_resource:986 ERROR: status = -112 Dec 2 14:23:14 soap01 kernel: (7043,0):dlm_do_master_request:1409 ERROR: link to 0 went down! Dec 2 14:23:14 soap01 kernel: (7043,0):dlm_get_lock_resource:986 ERROR: status = -112 Dec 2 14:23:14 soap01 kernel: (7047,0):dlm_send_remote_convert_request:395 ERROR: status = -112 Dec 2 14:23:14 soap01 kernel: (7047,0):dlm_wait_for_node_death:370 F59B45831EEA41F384BADE6C4B7A932B: waiting 5000ms for notification of death of node 0 Dec 2 14:24:14 soap01 kernel: (5283,0):o2net_connect_expired:1583 ERROR: no connection established with node 0 after 60.0 seconds, giving up and returning errors. Dec 2 14:24:14 soap01 kernel: (7047,0):dlm_send_remote_convert_request:395 ERROR: status = -107 Dec 2 14:24:14 soap01 kernel: (7047,0):dlm_wait_for_node_death:370 F59B45831EEA41F384BADE6C4B7A932B: waiting 5000ms for notification of death of node 0 Hope this information is useful for something. Regards, -- .:''''':. .:' ` S?rgio Surkamp | Gerente de Rede :: ........ sergio at gruposinternet.com.br `:. .:' `:, ,.:' *Grupos Internet S.A.* `: :' R. Lauro Linhares, 2123 Torre B - Sala 201 : : Trindade - Florian?polis - SC :.' :: +55 48 3234-4109 : ' http://www.gruposinternet.com.br
Ping Novell. http://oss.oracle.com/projects/ocfs2/news/article_20.html * Oracle# 7373369 OOPS on umount saying lockres has local locks (oss bz# 914) S?rgio Surkamp wrote:> Dunno if it is useful, but we had a never seen crash. > > Setup: > > 2x SuSE SLES 10SP2 (its old, I known) > > Problem description: > > 1. We had to reboot ocfs2 master node. > 2. During the reboot, the umount coredumped, leaving the filesystem > mounted or may be heartbeating (?); > 3. The slave node detected that the slave was dead; > 4. When the slave tried to assume the master status, it rebooted (no > crash, no warning, nothing, just like press reset button); > 5. The master hanged because it could not unmount ocfs2 filesystem; > > Could not take many messages from nodes, just this ones: > > master node umount crash (from syslog): > Dec 2 14:22:08 soap02 kernel: (19573,5):dlm_empty_lockres:2783 ERROR: > lockres M00000000000000164ad60700000000 still has local locks! > Dec 2 14:22:08 soap02 kernel: ----------- [cut here ] --------- > [please bite here ] --------- > Dec 2 14:22:08 soap02 kernel: Kernel BUG at > fs/ocfs2/dlm/dlmmaster.c:2784 > Dec 2 14:22:08 soap02 kernel: invalid opcode: 0000 [1] SMP > Dec 2 14:22:08 soap02 kernel: last sysfs > file: /devices/pci0000:00/0000:00:1c.0/0000:04:00.0/0000:05:00.0/power/state > Dec 2 14:22:08 soap02 kernel: CPU 5 > Dec 2 14:22:08 soap02 kernel: Modules linked in: af_packet joydev st > ocfs2 jbd ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs nfsd > exportfs nfs lockd nfs_acl sunrpc ipv6 button battery ac binfmt_misc > netconsole xt_comment xt_tcpudp xt_state iptable_filter iptable_mangle > iptab le_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables apparmor > loop sr_mod usbhid usb_storage ide_cd uhci_hcd ehci_hcd usbcore shpchp > hw_random cdrom bnx2 pci_hotplug reiserfs ata_piix ahci libata > dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd dm_mod > fan therm al processor sg megaraid_sas piix sd_mod scsi_mod ide_disk > ide_core > Dec 2 14:22:08 soap02 kernel: Pid: 19573, comm: umount Tainted: G > U 2.6.16.60-0.21-smp #1 > Dec 2 14:22:08 soap02 kernel: RIP: 0010:[<ffffffff885a9d6d>] > <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} > Dec 2 14:22:08 soap02 kernel: RSP: 0018:ffff810356f65c88 EFLAGS: > 00010292 > Dec 2 14:22:08 soap02 kernel: RAX: 000000000000006a RBX: > ffff8101f28f7880 RCX: 0000000000000292 > Dec 2 14:22:08 soap02 kernel: RDX: ffffffff80359968 RSI: > 0000000000000296 RDI: ffffffff80359960 > Dec 2 14:22:08 soap02 kernel: RBP: ffff81025eec7e00 R08: > ffffffff80359968 R09: ffff810423f77a80 > Dec 2 14:22:08 soap02 kernel: R10: ffff810001071600 R11: > 0000000000000070 R12: 0000000000000184 > Dec 2 14:22:08 soap02 kernel: R13: ffff8104257a5400 R14: > 0000000000000184 R15: ffff8101f28f7880 > Dec 2 14:22:08 soap02 kernel: FS: 00002ab1a83db6d0(0000) > GS:ffff810430654840(0000) knlGS:0000000000000000 > Dec 2 14:22:08 soap02 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: > 000000008005003b > Dec 2 14:22:08 soap02 kernel: CR2: 00002aaaaac16000 CR3: > 00000001a2c2f000 CR4: 00000000000006e0 > Dec 2 14:22:08 soap02 kernel: Process umount (pid: 19573, threadinfo > ffff810356f64000, task ffff8102e78997e0) > Dec 2 14:22:08 soap02 kernel: Stack: 00000000ffffffd9 0000000000000000 > 01ff810400000001 ffff8102e78997e0 > Dec 2 14:22:08 soap02 kernel: 0100000000000000 0000000100000003 > 0000000000000000 ffff8102e78997e0 > Dec 2 14:22:08 soap02 kernel: ffffffff80147f3e ffff810356f65cd0 > Dec 2 14:22:08 soap02 kernel: Call Trace: > <ffffffff80147f3e>{autoremove_wake_function+0} > Dec 2 14:22:08 soap02 kernel: > <ffffffff885a30e1>{:ocfs2_dlm:dlm_unregister_domain+479} > Dec 2 14:22:08 soap02 kernel: > <ffffffff8012c668>{default_wake_function+0} > <ffffffff8860bb5e>{:ocfs2:ocfs2_dlm_shutdown+190} > Dec 2 14:22:08 soap02 kernel: > <ffffffff8862fe07>{:ocfs2:ocfs2_dismount_volume+559} > Dec 2 14:22:08 soap02 kernel: > <ffffffff886302f7>{:ocfs2:ocfs2_put_super+104} > <ffffffff8018bc99>{generic_shutdown_super+148} > Dec 2 14:22:08 soap02 kernel: > <ffffffff8018bd6a>{kill_block_super+38} > <ffffffff8018be40>{deactivate_super+114} > Dec 2 14:22:08 soap02 kernel: <ffffffff801a078e>{sys_umount+623} > <ffffffff8018e4e1>{sys_newstat+25} > Dec 2 14:22:08 soap02 kernel: > <ffffffff8010ae42>{system_call+126} > Dec 2 14:22:08 soap02 kernel: Dec 2 14:22:08 soap02 kernel: Code: 0f > 0b 68 95 d0 5b 88 c2 e0 0a 48 f7 05 9e 2c fd ff 00 09 00 > Dec 2 14:22:08 soap02 kernel: RIP > <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} RSP > <ffff810356f65c88> > Dec 2 14:22:08 soap02 kernel: Badness in do_exit at kernel/exit.c:837 > Dec 2 14:22:08 soap02 kernel: > Dec 2 14:22:08 soap02 kernel: Call Trace: > <ffffffff80137000>{do_exit+80} > <ffffffff802ea8b6>{_spin_unlock_irqrestore+8} > Dec 2 14:22:08 soap02 kernel: > <ffffffff8010c820>{kernel_math_error+0} > <ffffffff8010cdb5>{do_invalid_op+163} > Dec 2 14:22:09 soap02 kernel: > <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} > Dec 2 14:22:09 soap02 kernel: <ffffffff8012c10c>{activate_task+204} > <ffffffff8012c657>{try_to_wake_up+1106} > Dec 2 14:22:09 soap02 kernel: <ffffffff801349b8>{printk+78} > <ffffffff8010bd19>{error_exit+0} > Dec 2 14:22:09 soap02 kernel: > <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} > Dec 2 14:22:09 soap02 kernel: > <ffffffff80147f3e>{autoremove_wake_function+0} > <ffffffff885a30e1>{:ocfs2_dlm:dlm_unregister_domain+479} > Dec 2 14:22:09 soap02 kernel: > <ffffffff8012c668>{default_wake_function+0} > <ffffffff8860bb5e>{:ocfs2:ocfs2_dlm_shutdown+190} > Dec 2 14:22:09 soap02 kernel: > <ffffffff8862fe07>{:ocfs2:ocfs2_dismount_volume+559} > Dec 2 14:22:09 soap02 kernel: > <ffffffff886302f7>{:ocfs2:ocfs2_put_super+104} > <ffffffff8018bc99>{generic_shutdown_super+148} > Dec 2 14:22:09 soap02 kernel: > <ffffffff8018bd6a>{kill_block_super+38} > <ffffffff8018be40>{deactivate_super+114} > Dec 2 14:22:09 soap02 kernel: <ffffffff801a078e>{sys_umount+623} > <ffffffff8018e4e1>{sys_newstat+25} > Dec 2 14:22:09 soap02 kernel: > <ffffffff8010ae42>{system_call+126} > > slave node detecting master down and rebooted: > > Dec 2 14:23:14 soap01 kernel: o2net: connection to node soap02 (num 0) > at 192.168.0.10:7777 has been idle for 60.0 seconds, shutting it down. > Dec 2 14:23:14 soap01 kernel: (0,0):o2net_idle_timer:1422 here are > some times that might help debug the situation: (tmr 1259770934.129785 > now 1259770994.132629 dr 1259770934.129779 adv > 1259770934.129789:1259770934.129789 func (300d6acb:505) > 1259770933.205787:1259770933.205792) > Dec 2 14:23:14 soap01 kernel: o2net: no longer connected to node > soap02 (num 0) at 192.168.0.10:7777 > Dec 2 14:23:14 soap01 kernel: (7035,1):dlm_do_master_request:1409 > ERROR: link to 0 went down! > Dec 2 14:23:14 soap01 kernel: (7039,0):dlm_do_master_request:1409 > ERROR: link to 0 went down! > Dec 2 14:23:14 soap01 kernel: (7039,0):dlm_get_lock_resource:986 > ERROR: status = -112 > Dec 2 14:23:14 soap01 kernel: (7035,1):dlm_get_lock_resource:986 > ERROR: status = -112 > Dec 2 14:23:14 soap01 kernel: (7043,0):dlm_do_master_request:1409 > ERROR: link to 0 went down! > Dec 2 14:23:14 soap01 kernel: (7043,0):dlm_get_lock_resource:986 > ERROR: status = -112 > Dec 2 14:23:14 soap01 kernel: > (7047,0):dlm_send_remote_convert_request:395 ERROR: status = -112 > Dec 2 14:23:14 soap01 kernel: (7047,0):dlm_wait_for_node_death:370 > F59B45831EEA41F384BADE6C4B7A932B: waiting 5000ms for notification of > death of node 0 > Dec 2 14:24:14 soap01 kernel: (5283,0):o2net_connect_expired:1583 > ERROR: no connection established with node 0 after 60.0 seconds, giving > up and returning errors. > Dec 2 14:24:14 soap01 kernel: > (7047,0):dlm_send_remote_convert_request:395 ERROR: status = -107 > Dec 2 14:24:14 soap01 kernel: (7047,0):dlm_wait_for_node_death:370 > F59B45831EEA41F384BADE6C4B7A932B: waiting 5000ms for notification of > death of node 0 > > Hope this information is useful for something. > > Regards, >