Hi all, I'm doing some tests on OCFS2 with a 2.6.32-100 kernel (Oracle) or RHEL6/fedora and I have a hang in lowcomms.c as you can see below. I have a crash dump if you need more information. I'm lost and I need help to know where to search to debug this problem. Thanks Regards, Benoit Kernel 2.6.32-100.0.19.el5 on an x86_64 chili0 login: ------------[ cut here ]------------ kernel BUG at fs/dlm/lowcomms.c:647! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control CPU 34 Modules linked in: ocfs2(U) ocfs2_nodemanager(U) nfsd(U) exportfs(U) sctp(U) libcrc32c(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) sunrpc(U) ipv6(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) iTCO_wdt(U) iTCO_vendor_support(U) mlx4_core(U) i2c_i801(U) igb(U) pcspkr(U) i2c_core(U) ioatdma(U) dca(U) ahci(U) uhci_hcd(U) ehci_hcd(U) lpfc(U) scsi_transport_fc(U) scsi_tgt(U) [last unloaded: ocfs2_nodemanager] Pid: 27062, comm: dlm_recv/34 Not tainted 2.6.32-100.0.19.el5 #1 bullx super-node RIP: 0010:[<ffffffffa02406c3>] [<ffffffffa02406c3>] receive_from_sock+0x554/0x6ed [dlm] RSP: 0018:ffff880c77c6bc60 EFLAGS: 00010246 RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8 RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045 RBP: ffff880c77c6be50 R08: ffff000000000000 R09: ffff880c77c6b900 R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030 R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca FS: 0000000000000000(0000) GS:ffff88048e600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000fcb078 CR3: 0000000001001000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dlm_recv/34 (pid: 27062, threadinfo ffff880c77c6a000, task ffff880c7caa00c0) Stack: ffff880c77c6bc70 ffffffff8122fa24 ffff880c77c6bc90 ffffffff8122faca <0> ffff88048e414ec0 0000100000000002 0000000000000000 ffffffff00000000 <0> 0000000000000000 0000000000000000 ffffffffa024bb20 0000000000000030 Call Trace: [<ffffffff8122fa24>] ? cpumask_next+0x19/0x1b [<ffffffff8122faca>] ? cpumask_next_and+0x20/0x32 [<ffffffffa023ecca>] ? process_recv_sockets+0x0/0x28 [dlm] [<ffffffffa023ecea>] process_recv_sockets+0x20/0x28 [dlm] [<ffffffff81071802>] worker_thread+0x14d/0x1ed [<ffffffff81075a7c>] ? autoremove_wake_function+0x0/0x3d [<ffffffff810716b5>] ? worker_thread+0x0/0x1ed [<ffffffff810756d3>] kthread+0x6e/0x76 [<ffffffff81012dea>] child_rip+0xa/0x20 [<ffffffff81075665>] ? kthread+0x0/0x76 [<ffffffff81012de0>] ? child_rip+0x0/0x20 Code: 29 e7 ff ff e9 2d 01 00 00 41 8b 74 24 10 0f b7 d0 48 c7 c7 d1 8c 24 a0 31 c0 e8 ab 71 e1 e0 e9 12 01 00 00 41 83 7d 08 00 75 04 <0f> 0b eb fe 4d 8d 7d 68 49 be 00 00 00 00 00 16 00 00 41 8b 55 RIP [<ffffffffa02406c3>] receive_from_sock+0x554/0x6ed [dlm] RSP <ffff880c77c6bc60> Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Linux version 2.6.32-100.0.19.el5 (mockbuild at ca-build9.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Fri Sep 17 17:51:41 EDT 2010 Command line: ro root=/dev/mapper/vg_chili0-lv_root rd_LVM_LV=vg_chili0/lv_root rd_LVM_LV=vg_chili0/lv_swap rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=fr-pc cgroup_disable=memory selinux=0 pcie_aspm=off nmi_watchdog=0 console=ttyS1,115200 maxcpus=1 reset_devices memmap=exactmap memmap=640K at 0K memmap=195948K at 33408K elfcorehdr=229356K memmap=308K#1993940K memmap=16K#2077704K memmap=4K#2077748K memmap=4K#2077764K memmap=44K#2077768K memmap=72K#2077812K memmap=4K#2077884K memmap=4K#2077888K memmap=4K#2077892K memmap=4K#2078024K memmap=2716K#2078052K memmap=1024K#69204860K memmap=128K#69205884K KERNEL supported cpus: Intel GenuineIntel AMD AuthenticAMD Centaur CentaurHauls BIOS-provided physical RAM map: From the dump : GNU gdb (GDB) 7.0 Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore [PARTIAL DUMP] CPUS: 64 DATE: Mon Oct 18 16:41:48 2010 UPTIME: 00:15:00 LOAD AVERAGE: 1.06, 1.22, 1.65 TASKS: 1594 NODENAME: chili0 RELEASE: 2.6.32-100.0.19.el5 VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010 MACHINE: x86_64 (1999 Mhz) MEMORY: 64 GB PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!" PID: 27062 COMMAND: "dlm_recv/34" TASK: ffff880c7caa00c0 [THREAD_INFO: ffff880c77c6a000] CPU: 34 STATE: TASK_RUNNING (PANIC) crash> bt PID: 27062 TASK: ffff880c7caa00c0 CPU: 34 COMMAND: "dlm_recv/34" #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4 #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9 #3 [ffff880c77c6ba90] die at ffffffff81015639 #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902 #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b [exception RIP: receive_from_sock+1364] RIP: ffffffffa02406c3 RSP: ffff880c77c6bc60 RFLAGS: 00010246 RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8 RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045 RBP: ffff880c77c6be50 R8: ffff000000000000 R9: ffff880c77c6b900 R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030 R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea #8 [ffff880c77c6be78] worker_thread at ffffffff81071802 #9 [ffff880c77c6bee8] kthread at ffffffff810756d3 #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea
On Wed, Oct 20, 2010 at 04:15:15PM +0200, Welterlen Benoit wrote:> I'm doing some tests on OCFS2 with a 2.6.32-100 kernel (Oracle) or > RHEL6/fedora and I have a hang in lowcomms.c as you can see below. > I have a crash dump if you need more information. I'm lost and I need > help to know where to search to debug this problem.Whee! Userspace stack on the 2.6.32-100 kernel ;-) We haven't actually tested this configuration yet; it's not supported officially. However, it "should" work, just as the userspace stack stuff has worked for a while. I've forwarded this report on to the fs/dlm maintainer for pointers to see if we can get you any help. Joel> Thanks > > Regards, > > Benoit > > > > Kernel 2.6.32-100.0.19.el5 on an x86_64 > chili0 login: ------------[ cut here ]------------ > kernel BUG at fs/dlm/lowcomms.c:647! > invalid opcode: 0000 [#1] SMP > last sysfs file: /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control > CPU 34 > Modules linked in: ocfs2(U) ocfs2_nodemanager(U) nfsd(U) exportfs(U) > sctp(U) libcrc32c(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) > configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) > ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) > sunrpc(U) ipv6(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) > iTCO_wdt(U) iTCO_vendor_support(U) mlx4_core(U) i2c_i801(U) igb(U) > pcspkr(U) i2c_core(U) ioatdma(U) dca(U) ahci(U) uhci_hcd(U) ehci_hcd(U) > lpfc(U) scsi_transport_fc(U) scsi_tgt(U) [last unloaded: ocfs2_nodemanager] > Pid: 27062, comm: dlm_recv/34 Not tainted 2.6.32-100.0.19.el5 #1 bullx > super-node > RIP: 0010:[<ffffffffa02406c3>] [<ffffffffa02406c3>] > receive_from_sock+0x554/0x6ed [dlm] > RSP: 0018:ffff880c77c6bc60 EFLAGS: 00010246 > RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8 > RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045 > RBP: ffff880c77c6be50 R08: ffff000000000000 R09: ffff880c77c6b900 > R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030 > R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca > FS: 0000000000000000(0000) GS:ffff88048e600000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 0000000000fcb078 CR3: 0000000001001000 CR4: 00000000000006e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > Process dlm_recv/34 (pid: 27062, threadinfo ffff880c77c6a000, task > ffff880c7caa00c0) > Stack: > ffff880c77c6bc70 ffffffff8122fa24 ffff880c77c6bc90 ffffffff8122faca > <0> ffff88048e414ec0 0000100000000002 0000000000000000 ffffffff00000000 > <0> 0000000000000000 0000000000000000 ffffffffa024bb20 0000000000000030 > Call Trace: > [<ffffffff8122fa24>] ? cpumask_next+0x19/0x1b > [<ffffffff8122faca>] ? cpumask_next_and+0x20/0x32 > [<ffffffffa023ecca>] ? process_recv_sockets+0x0/0x28 [dlm] > [<ffffffffa023ecea>] process_recv_sockets+0x20/0x28 [dlm] > [<ffffffff81071802>] worker_thread+0x14d/0x1ed > [<ffffffff81075a7c>] ? autoremove_wake_function+0x0/0x3d > [<ffffffff810716b5>] ? worker_thread+0x0/0x1ed > [<ffffffff810756d3>] kthread+0x6e/0x76 > [<ffffffff81012dea>] child_rip+0xa/0x20 > [<ffffffff81075665>] ? kthread+0x0/0x76 > [<ffffffff81012de0>] ? child_rip+0x0/0x20 > Code: 29 e7 ff ff e9 2d 01 00 00 41 8b 74 24 10 0f b7 d0 48 c7 c7 d1 8c > 24 a0 31 c0 e8 ab 71 e1 e0 e9 12 01 00 00 41 83 7d 08 00 75 04 <0f> 0b > eb fe 4d 8d 7d 68 49 be 00 00 00 00 00 16 00 00 41 8b 55 > RIP [<ffffffffa02406c3>] receive_from_sock+0x554/0x6ed [dlm] > RSP <ffff880c77c6bc60> > Initializing cgroup subsys cpuset > Initializing cgroup subsys cpu > Linux version 2.6.32-100.0.19.el5 (mockbuild at ca-build9.us.oracle.com) > (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Fri Sep 17 > 17:51:41 EDT 2010 > Command line: ro root=/dev/mapper/vg_chili0-lv_root > rd_LVM_LV=vg_chili0/lv_root rd_LVM_LV=vg_chili0/lv_swap rd_NO_LUKS > rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 > KEYBOARDTYPE=pc KEYTABLE=fr-pc cgroup_disable=memory selinux=0 > pcie_aspm=off nmi_watchdog=0 console=ttyS1,115200 maxcpus=1 > reset_devices memmap=exactmap memmap=640K at 0K memmap=195948K at 33408K > elfcorehdr=229356K memmap=308K#1993940K memmap=16K#2077704K > memmap=4K#2077748K memmap=4K#2077764K memmap=44K#2077768K > memmap=72K#2077812K memmap=4K#2077884K memmap=4K#2077888K > memmap=4K#2077892K memmap=4K#2078024K memmap=2716K#2078052K > memmap=1024K#69204860K memmap=128K#69205884K > KERNEL supported cpus: > Intel GenuineIntel > AMD AuthenticAMD > Centaur CentaurHauls > BIOS-provided physical RAM map: > > From the dump : > GNU gdb (GDB) 7.0 > Copyright (C) 2009 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later > <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-unknown-linux-gnu"... > > KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux > DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore > [PARTIAL DUMP] > CPUS: 64 > DATE: Mon Oct 18 16:41:48 2010 > UPTIME: 00:15:00 > LOAD AVERAGE: 1.06, 1.22, 1.65 > TASKS: 1594 > NODENAME: chili0 > RELEASE: 2.6.32-100.0.19.el5 > VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010 > MACHINE: x86_64 (1999 Mhz) > MEMORY: 64 GB > PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!" > PID: 27062 > COMMAND: "dlm_recv/34" > TASK: ffff880c7caa00c0 [THREAD_INFO: ffff880c77c6a000] > CPU: 34 > STATE: TASK_RUNNING (PANIC) > > crash> bt > PID: 27062 TASK: ffff880c7caa00c0 CPU: 34 COMMAND: "dlm_recv/34" > #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b > #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4 > #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9 > #3 [ffff880c77c6ba90] die at ffffffff81015639 > #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c > #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902 > #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b > [exception RIP: receive_from_sock+1364] > RIP: ffffffffa02406c3 RSP: ffff880c77c6bc60 RFLAGS: 00010246 > RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8 > RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045 > RBP: ffff880c77c6be50 R8: ffff000000000000 R9: ffff880c77c6b900 > R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030 > R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea > #8 [ffff880c77c6be78] worker_thread at ffffffff81071802 > #9 [ffff880c77c6bee8] kthread at ffffffff810756d3 > #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users-- "Every new beginning comes from some other beginning's end." Joel Becker Senior Development Manager Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127
Hi, Thanks for your answer. sctp is automatically selected to be used ("dlm: Using SCTP for communications") and I have no option to modify that (/sys/kernel/config/dlm/cluster/protocol is set to 1 and is only available once the service is started, I can modify it at this time ?) But, I can give you some details about the problem. I'm doing HA tests between 2 nodes and the problem occurs when I'm stopping the service on one node, and, on restart the other system bugs : M1 : ocfs-pcmk starts M2 : ocfs-pcmk starts M2 : ocfs-pcmk stops Ok till now, but M2 : ocfs-pcmk restarts : M1 bugs !! I adds traces in the code. From what I understand : The first connection initilizes a dlm connection with a node id and an address. The second connection tries to recover the first structure. It has the nodeid and tries to find the address unsuccessfully, then, tries from an address to find the node id, no more success. But in the datas from the address, we can find the node id : 02 00 00 00 0*b 01 00 02* 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 200010b=node id : 33554699 I'm new in this stuff and it's hard to understand the role of each component. Regards, Benoit details : Starting DLM/OCFS : DLM (built Oct 21 2010 11:14:24) installed ocfs2: Registered cluster interface user OCFS2 Node Manager 1.6.3 OCFS2 1.6.3 dlm: Using SCTP for communications SCTP: Hash tables configured (established 65536 bind 65536) dlm: 77410678764B4782BDAE3E888E0C8C4D: joining the lockspace group... dlm: 77410678764B4782BDAE3E888E0C8C4D: group event done 0 0 dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 1 dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 16777483 dlm: 77410678764B4782BDAE3E888E0C8C4D: total members 1 error 0 dlm: 77410678764B4782BDAE3E888E0C8C4D: dlm_recover_directory dlm: 77410678764B4782BDAE3E888E0C8C4D: dlm_recover_directory 0 entries dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 1 done: 0 ms dlm: 77410678764B4782BDAE3E888E0C8C4D: join complete ocfs2: Mounting device (253,6) on (node 1677748, slot 0) with ordered data mode. > Here, OCFS is available on both nodes dlm: closing connection to node 33554699 > Here, OCFS is down on M2, then restart : dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 3 dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 33554699 cm->sockaddr_storage ffff8804796268a0 dlm_nodeid_to_addr -EEXIST; dlm: no address for nodeid 33554699 sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924 sctp_packet_config: packet:ffff880c8e6037d0 vtag:0x2b186924 sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924 process_sctp_notification SCTP_RESTART get_comm nodeid : 0, sockaddr_storage : ffff88047a357c54 get_comm cm->addr_count : 0000000000000001, cm->addr[0] : ffff880479626740 addr : ffff88047a357c54 dlm: reject connect from unknown addr 02 00 00 00 0b 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 in __nodeid2con , nodeid : 0 sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924 ------------[ cut here ]------------ kernel BUG at fs/dlm/lowcomms.c:661! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/dlm/77410678764B4782BDAE3E888E0C8C4D/control CPU 35 Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U) i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U) ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler] Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U) i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U) ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler] Pid: 4306, comm: dlm_recv/35 Not tainted 2.6.32-30.el6.Bull.14.x86_64 #2 bullx super-node RIP: 0010:[<ffffffffa039edeb>] [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm] RSP: 0018:ffff88047a357d10 EFLAGS: 00010246 RAX: 0000000000000095 RBX: ffff88087c5aad20 RCX: 0000000000001b62 RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246 RBP: ffff88047a357e10 R08: 00000000ffffffff R09: 0000000000000000 R10: ffff881079423480 R11: 0000000000000000 R12: ffff88047a357da0 R13: 0000000000000030 R14: ffff88087c5aad30 R15: ffff88047a357d40 FS: 0000000000000000(0000) GS:ffff880c8e600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 000000359bea6df0 CR3: 0000000001001000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dlm_recv/35 (pid: 4306, threadinfo ffff88047a356000, task ffff88047aeaf2e0) Stack: ffff88047a357db0 0000000000000246 ffff880c7da288c0 0000000000000246 <0> ffff88047a357d70 0000100081049472 0000000000000000 0000000000000000 <0> ffff88047a357d80 0000000000000002 ffff88047a357dd0 0000000000000000 Call Trace: [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm] [<ffffffffa039dda6>] process_recv_sockets+0x36/0x50 [dlm] [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm] [<ffffffff8107ce9d>] worker_thread+0x16d/0x290 [<ffffffff81082430>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8107cd30>] ? worker_thread+0x0/0x290 [<ffffffff810820d6>] kthread+0x96/0xa0 [<ffffffff8100d1aa>] child_rip+0xa/0x20 [<ffffffff81082040>] ? kthread+0x0/0xa0 [<ffffffff8100d1a0>] ? child_rip+0x0/0x20 Code: f1 4c 8b 8d 70 ff ff ff 48 c1 e6 0c 4c 01 d6 e8 f1 33 0b e1 4c 89 f7 e8 14 4b 0b e1 31 f6 48 89 df e8 6a f4 ff ff e9 cd fd ff ff <0f> 0b 0f 1f 00 eb fb 48 c7 c7 e8 9a 3a a0 31 c0 e8 c5 33 0b e1 RIP [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm] RSP <ffff88047a357d10> crash> Le 21/10/2010 17:13, David Teigland a ?crit :>> kernel BUG at fs/dlm/lowcomms.c:647! >> > That looks like an interesting one, I haven't seen it before. > First ensure dlm is not configured to use sctp (that code is > not widely tested.) Other than that, if you'd like to start > debugging this before I get around to it, replace the BUG_ON > with some printk's and return error. The conn with nodeid 0 is > the listening socket, for which tcp_accept_from_sock() should > be called rather than receive_from_sock(). > > > >> KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux >> DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore >> [PARTIAL DUMP] >> CPUS: 64 >> DATE: Mon Oct 18 16:41:48 2010 >> UPTIME: 00:15:00 >> LOAD AVERAGE: 1.06, 1.22, 1.65 >> TASKS: 1594 >> NODENAME: chili0 >> RELEASE: 2.6.32-100.0.19.el5 >> VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010 >> MACHINE: x86_64 (1999 Mhz) >> MEMORY: 64 GB >> PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!" >> PID: 27062 >> COMMAND: "dlm_recv/34" >> TASK: ffff880c7caa00c0 [THREAD_INFO: ffff880c77c6a000] >> CPU: 34 >> STATE: TASK_RUNNING (PANIC) >> >> crash> bt >> PID: 27062 TASK: ffff880c7caa00c0 CPU: 34 COMMAND: "dlm_recv/34" >> #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b >> #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4 >> #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9 >> #3 [ffff880c77c6ba90] die at ffffffff81015639 >> #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c >> #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902 >> #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b >> [exception RIP: receive_from_sock+1364] >> RIP: ffffffffa02406c3 RSP: ffff880c77c6bc60 RFLAGS: 00010246 >> RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8 >> RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045 >> RBP: ffff880c77c6be50 R8: ffff000000000000 R9: ffff880c77c6b900 >> R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030 >> R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca >> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >> #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea >> #8 [ffff880c77c6be78] worker_thread at ffffffff81071802 >> #9 [ffff880c77c6bee8] kthread at ffffffff810756d3 >> #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea >> >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users at oss.oracle.com >> http://oss.oracle.com/mailman/listinfo/ocfs2-users >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101021/1d4f80f5/attachment-0001.html
Hi, Thanks for your answer. I tried to force TCP instead of SCTP. There is no crash anymore, but OCFS is unavailable when I stop one node : The OCFS directory is not accessible : ls /BCM/conf [root at chili0 ~]# cat /proc/23921/stack [<ffffffffa02c5d53>] ocfs2_wait_for_recovery+0x77/0x8f [ocfs2] [<ffffffffa02b08a8>] ocfs2_inode_lock_full_nested+0x160/0xb8d [ocfs2] [<ffffffffa02c35e2>] ocfs2_inode_revalidate+0x163/0x25c [ocfs2] [<ffffffffa02bd9f4>] ocfs2_getattr+0x8b/0x19c [ocfs2] [<ffffffff8111c30f>] vfs_getattr+0x4c/0x69 [<ffffffff8111c37c>] vfs_fstatat+0x50/0x67 [<ffffffff8111c479>] vfs_stat+0x1b/0x1d [<ffffffff8111c49a>] sys_newstat+0x1f/0x39 [<ffffffff81011db2>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Any access to the filesystem hangs. Regards, Benoit Le 21/10/2010 19:13, David Teigland a ?crit :> On Thu, Oct 21, 2010 at 06:11:36PM +0200, Welterlen Benoit wrote: > >> Hi, >> >> Thanks for your answer. >> sctp is automatically selected to be used ("dlm: Using SCTP for >> communications") and I have no option to modify that >> (/sys/kernel/config/dlm/cluster/protocol is set to 1 and is only >> available once the service is started, I can modify it at this time >> ?) >> > Yes, there's a dlm_controld command line option (or cluster.conf which I > don't suppose you're using with pacemaker.) You can set that to get TCP, > but that obviates corosync redundant ring also (which is what is used to > auto-select SCTP). > > >> dlm: closing connection to node 33554699 >> >> >>> Here, OCFS is down on M2, then restart : >>> >> dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 3 >> dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 33554699 >> cm->sockaddr_storage ffff8804796268a0 >> dlm_nodeid_to_addr -EEXIST; >> dlm: no address for nodeid 33554699 >> > I suspect the pacemaker version of dlm_controld is doing an unusual > sequence of node addition/removal. The stuff above eventually leads to > the oops below: > > >> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924 >> sctp_packet_config: packet:ffff880c8e6037d0 vtag:0x2b186924 >> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924 >> process_sctp_notification SCTP_RESTART >> get_comm nodeid : 0, sockaddr_storage : ffff88047a357c54 >> get_comm cm->addr_count : 0000000000000001, cm->addr[0] : ffff880479626740 >> addr : ffff88047a357c54 >> dlm: reject connect from unknown addr >> 02 00 00 00 0b 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> 00 00 00 00 00 00 00 00 00 00 00 00 00 >> in __nodeid2con , nodeid : 0 >> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924 >> ------------[ cut here ]------------ >> kernel BUG at fs/dlm/lowcomms.c:661! >> invalid opcode: 0000 [#1] SMP >> last sysfs file: /sys/kernel/dlm/77410678764B4782BDAE3E888E0C8C4D/control >> CPU 35 >> Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) >> ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) >> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) >> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) >> auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) >> dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U) >> dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U) >> i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U) >> ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) >> lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) >> scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler] >> >> Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) >> ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) >> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) >> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) >> auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) >> dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U) >> dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U) >> i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U) >> ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) >> lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) >> scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler] >> Pid: 4306, comm: dlm_recv/35 Not tainted >> 2.6.32-30.el6.Bull.14.x86_64 #2 bullx super-node >> RIP: 0010:[<ffffffffa039edeb>] [<ffffffffa039edeb>] >> receive_from_sock+0x38b/0x430 [dlm] >> RSP: 0018:ffff88047a357d10 EFLAGS: 00010246 >> RAX: 0000000000000095 RBX: ffff88087c5aad20 RCX: 0000000000001b62 >> RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246 >> RBP: ffff88047a357e10 R08: 00000000ffffffff R09: 0000000000000000 >> R10: ffff881079423480 R11: 0000000000000000 R12: ffff88047a357da0 >> R13: 0000000000000030 R14: ffff88087c5aad30 R15: ffff88047a357d40 >> FS: 0000000000000000(0000) GS:ffff880c8e600000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b >> CR2: 000000359bea6df0 CR3: 0000000001001000 CR4: 00000000000006e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> Process dlm_recv/35 (pid: 4306, threadinfo ffff88047a356000, task >> ffff88047aeaf2e0) >> Stack: >> ffff88047a357db0 0000000000000246 ffff880c7da288c0 0000000000000246 >> <0> ffff88047a357d70 0000100081049472 0000000000000000 0000000000000000 >> <0> ffff88047a357d80 0000000000000002 ffff88047a357dd0 0000000000000000 >> Call Trace: >> [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm] >> [<ffffffffa039dda6>] process_recv_sockets+0x36/0x50 [dlm] >> [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm] >> [<ffffffff8107ce9d>] worker_thread+0x16d/0x290 >> [<ffffffff81082430>] ? autoremove_wake_function+0x0/0x40 >> [<ffffffff8107cd30>] ? worker_thread+0x0/0x290 >> [<ffffffff810820d6>] kthread+0x96/0xa0 >> [<ffffffff8100d1aa>] child_rip+0xa/0x20 >> [<ffffffff81082040>] ? kthread+0x0/0xa0 >> [<ffffffff8100d1a0>] ? child_rip+0x0/0x20 >> Code: f1 4c 8b 8d 70 ff ff ff 48 c1 e6 0c 4c 01 d6 e8 f1 33 0b e1 4c >> 89 f7 e8 14 4b 0b e1 31 f6 48 89 df e8 6a f4 ff ff e9 cd fd ff ff >> <0f> 0b 0f 1f 00 eb fb 48 c7 c7 e8 9a 3a a0 31 c0 e8 c5 33 0b e1 >> RIP [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm] >> RSP<ffff88047a357d10> >> crash> >> >> >> >> Le 21/10/2010 17:13, David Teigland a ?crit : >> >>>> kernel BUG at fs/dlm/lowcomms.c:647! >>>> >>> That looks like an interesting one, I haven't seen it before. >>> First ensure dlm is not configured to use sctp (that code is >>> not widely tested.) Other than that, if you'd like to start >>> debugging this before I get around to it, replace the BUG_ON >>> with some printk's and return error. The conn with nodeid 0 is >>> the listening socket, for which tcp_accept_from_sock() should >>> be called rather than receive_from_sock(). >>> >>> >>> >>>> KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux >>>> DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore >>>> [PARTIAL DUMP] >>>> CPUS: 64 >>>> DATE: Mon Oct 18 16:41:48 2010 >>>> UPTIME: 00:15:00 >>>> LOAD AVERAGE: 1.06, 1.22, 1.65 >>>> TASKS: 1594 >>>> NODENAME: chili0 >>>> RELEASE: 2.6.32-100.0.19.el5 >>>> VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010 >>>> MACHINE: x86_64 (1999 Mhz) >>>> MEMORY: 64 GB >>>> PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!" >>>> PID: 27062 >>>> COMMAND: "dlm_recv/34" >>>> TASK: ffff880c7caa00c0 [THREAD_INFO: ffff880c77c6a000] >>>> CPU: 34 >>>> STATE: TASK_RUNNING (PANIC) >>>> >>>> crash> bt >>>> PID: 27062 TASK: ffff880c7caa00c0 CPU: 34 COMMAND: "dlm_recv/34" >>>> #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b >>>> #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4 >>>> #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9 >>>> #3 [ffff880c77c6ba90] die at ffffffff81015639 >>>> #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c >>>> #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902 >>>> #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b >>>> [exception RIP: receive_from_sock+1364] >>>> RIP: ffffffffa02406c3 RSP: ffff880c77c6bc60 RFLAGS: 00010246 >>>> RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8 >>>> RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045 >>>> RBP: ffff880c77c6be50 R8: ffff000000000000 R9: ffff880c77c6b900 >>>> R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030 >>>> R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca >>>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 >>>> #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea >>>> #8 [ffff880c77c6be78] worker_thread at ffffffff81071802 >>>> #9 [ffff880c77c6bee8] kthread at ffffffff810756d3 >>>> #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea >>>> >>>> >>>> _______________________________________________ >>>> Ocfs2-users mailing list >>>> Ocfs2-users at oss.oracle.com >>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>>> >>> >> >