thr3ads.net - Ocfs2 users - [Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647! [Oct 2010]

If this information is useful, please help other people find it:
Share via:

Welterlen Benoit

2010-Oct-20 14:15 UTC

[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

Hi all,


I'm doing some tests on OCFS2 with a 2.6.32-100 kernel (Oracle) or 
RHEL6/fedora and I have a hang in lowcomms.c as you can see below.
I have a crash dump if you need more information. I'm lost and I need 
help to know where to search to debug this problem.

Thanks

Regards,

Benoit



Kernel 2.6.32-100.0.19.el5 on an x86_64
chili0 login: ------------[ cut here ]------------
kernel BUG at fs/dlm/lowcomms.c:647!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control
CPU 34
Modules linked in: ocfs2(U) ocfs2_nodemanager(U) nfsd(U) exportfs(U) 
sctp(U) libcrc32c(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) 
configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) 
sunrpc(U) ipv6(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) 
iTCO_wdt(U) iTCO_vendor_support(U) mlx4_core(U) i2c_i801(U) igb(U) 
pcspkr(U) i2c_core(U) ioatdma(U) dca(U) ahci(U) uhci_hcd(U) ehci_hcd(U) 
lpfc(U) scsi_transport_fc(U) scsi_tgt(U) [last unloaded: ocfs2_nodemanager]
Pid: 27062, comm: dlm_recv/34 Not tainted 2.6.32-100.0.19.el5 #1 bullx 
super-node
RIP: 0010:[<ffffffffa02406c3>]  [<ffffffffa02406c3>] 
receive_from_sock+0x554/0x6ed [dlm]
RSP: 0018:ffff880c77c6bc60  EFLAGS: 00010246
RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8
RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045
RBP: ffff880c77c6be50 R08: ffff000000000000 R09: ffff880c77c6b900
R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030
R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca
FS:  0000000000000000(0000) GS:ffff88048e600000(0000) 
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000fcb078 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dlm_recv/34 (pid: 27062, threadinfo ffff880c77c6a000, task 
ffff880c7caa00c0)
Stack:
  ffff880c77c6bc70 ffffffff8122fa24 ffff880c77c6bc90 ffffffff8122faca
<0> ffff88048e414ec0 0000100000000002 0000000000000000 ffffffff00000000
<0> 0000000000000000 0000000000000000 ffffffffa024bb20 0000000000000030
Call Trace:
  [<ffffffff8122fa24>] ? cpumask_next+0x19/0x1b
  [<ffffffff8122faca>] ? cpumask_next_and+0x20/0x32
  [<ffffffffa023ecca>] ? process_recv_sockets+0x0/0x28 [dlm]
  [<ffffffffa023ecea>] process_recv_sockets+0x20/0x28 [dlm]
  [<ffffffff81071802>] worker_thread+0x14d/0x1ed
  [<ffffffff81075a7c>] ? autoremove_wake_function+0x0/0x3d
  [<ffffffff810716b5>] ? worker_thread+0x0/0x1ed
  [<ffffffff810756d3>] kthread+0x6e/0x76
  [<ffffffff81012dea>] child_rip+0xa/0x20
  [<ffffffff81075665>] ? kthread+0x0/0x76
  [<ffffffff81012de0>] ? child_rip+0x0/0x20
Code: 29 e7 ff ff e9 2d 01 00 00 41 8b 74 24 10 0f b7 d0 48 c7 c7 d1 8c 
24 a0 31 c0 e8 ab 71 e1 e0 e9 12 01 00 00 41 83 7d 08 00 75 04 <0f> 0b 
eb fe 4d 8d 7d 68 49 be 00 00 00 00 00 16 00 00 41 8b 55
RIP  [<ffffffffa02406c3>] receive_from_sock+0x554/0x6ed [dlm]
  RSP <ffff880c77c6bc60>
Initializing cgroup subsys cpuset
Initializing cgroup subsys cpu
Linux version 2.6.32-100.0.19.el5 (mockbuild at ca-build9.us.oracle.com) 
(gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Fri Sep 17 
17:51:41 EDT 2010
Command line: ro root=/dev/mapper/vg_chili0-lv_root 
rd_LVM_LV=vg_chili0/lv_root rd_LVM_LV=vg_chili0/lv_swap rd_NO_LUKS 
rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 
KEYBOARDTYPE=pc KEYTABLE=fr-pc cgroup_disable=memory selinux=0 
pcie_aspm=off nmi_watchdog=0 console=ttyS1,115200 maxcpus=1 
reset_devices memmap=exactmap memmap=640K at 0K memmap=195948K at 33408K 
elfcorehdr=229356K memmap=308K#1993940K memmap=16K#2077704K 
memmap=4K#2077748K memmap=4K#2077764K memmap=44K#2077768K 
memmap=72K#2077812K memmap=4K#2077884K memmap=4K#2077888K 
memmap=4K#2077892K memmap=4K#2078024K memmap=2716K#2078052K 
memmap=1024K#69204860K memmap=128K#69205884K
KERNEL supported cpus:
   Intel GenuineIntel
   AMD AuthenticAMD
   Centaur CentaurHauls
BIOS-provided physical RAM map:

 From the dump :
GNU gdb (GDB) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

       KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux
     DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore  
[PARTIAL DUMP]
         CPUS: 64
         DATE: Mon Oct 18 16:41:48 2010
       UPTIME: 00:15:00
LOAD AVERAGE: 1.06, 1.22, 1.65
        TASKS: 1594
     NODENAME: chili0
      RELEASE: 2.6.32-100.0.19.el5
      VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010
      MACHINE: x86_64  (1999 Mhz)
       MEMORY: 64 GB
        PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!"
          PID: 27062
      COMMAND: "dlm_recv/34"
         TASK: ffff880c7caa00c0  [THREAD_INFO: ffff880c77c6a000]
          CPU: 34
        STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 27062  TASK: ffff880c7caa00c0  CPU: 34  COMMAND: "dlm_recv/34"
  #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b
  #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4
  #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9
  #3 [ffff880c77c6ba90] die at ffffffff81015639
  #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c
  #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902
  #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b
     [exception RIP: receive_from_sock+1364]
     RIP: ffffffffa02406c3  RSP: ffff880c77c6bc60  RFLAGS: 00010246
     RAX: 0000000000000030  RBX: ffff8810774b8d30  RCX: ffff88087c4548f8
     RDX: 0000000000000030  RSI: ffff880876dce000  RDI: ffffffff81398045
     RBP: ffff880c77c6be50   R8: ffff000000000000   R9: ffff880c77c6b900
     R10: ffff880c77c6b8f0  R11: 0000000000000030  R12: 0000000000000030
     R13: ffff8810774b8d20  R14: ffff880c7caa00c0  R15: ffffffffa023ecca
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea
  #8 [ffff880c77c6be78] worker_thread at ffffffff81071802
  #9 [ffff880c77c6bee8] kthread at ffffffff810756d3
#10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea

Joel Becker

2010-Oct-21 05:21 UTC

head link

[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

On Wed, Oct 20, 2010 at 04:15:15PM +0200, Welterlen Benoit
wrote:> I'm doing some tests on OCFS2 with a 2.6.32-100 kernel (Oracle) or 
> RHEL6/fedora and I have a hang in lowcomms.c as you can see below.
> I have a crash dump if you need more information. I'm lost and I need 
> help to know where to search to debug this problem.
	Whee!  Userspace stack on the 2.6.32-100 kernel ;-)  We haven't
actually tested this configuration yet; it's not supported officially.
However, it "should" work, just as the userspace stack stuff has
worked
for a while.  I've forwarded this report on to the fs/dlm maintainer for
pointers to see if we can get you any help.

Joel
> Thanks
> 
> Regards,
> 
> Benoit
> 
> 
> 
> Kernel 2.6.32-100.0.19.el5 on an x86_64
> chili0 login: ------------[ cut here ]------------
> kernel BUG at fs/dlm/lowcomms.c:647!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control
> CPU 34
> Modules linked in: ocfs2(U) ocfs2_nodemanager(U) nfsd(U) exportfs(U) 
> sctp(U) libcrc32c(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) 
> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) 
> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) 
> sunrpc(U) ipv6(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) 
> iTCO_wdt(U) iTCO_vendor_support(U) mlx4_core(U) i2c_i801(U) igb(U) 
> pcspkr(U) i2c_core(U) ioatdma(U) dca(U) ahci(U) uhci_hcd(U) ehci_hcd(U) 
> lpfc(U) scsi_transport_fc(U) scsi_tgt(U) [last unloaded: ocfs2_nodemanager]
> Pid: 27062, comm: dlm_recv/34 Not tainted 2.6.32-100.0.19.el5 #1 bullx 
> super-node
> RIP: 0010:[<ffffffffa02406c3>]  [<ffffffffa02406c3>] 
> receive_from_sock+0x554/0x6ed [dlm]
> RSP: 0018:ffff880c77c6bc60  EFLAGS: 00010246
> RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8
> RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045
> RBP: ffff880c77c6be50 R08: ffff000000000000 R09: ffff880c77c6b900
> R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030
> R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca
> FS:  0000000000000000(0000) GS:ffff88048e600000(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000fcb078 CR3: 0000000001001000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process dlm_recv/34 (pid: 27062, threadinfo ffff880c77c6a000, task 
> ffff880c7caa00c0)
> Stack:
>   ffff880c77c6bc70 ffffffff8122fa24 ffff880c77c6bc90 ffffffff8122faca
> <0> ffff88048e414ec0 0000100000000002 0000000000000000
ffffffff00000000
> <0> 0000000000000000 0000000000000000 ffffffffa024bb20
0000000000000030
> Call Trace:
>   [<ffffffff8122fa24>] ? cpumask_next+0x19/0x1b
>   [<ffffffff8122faca>] ? cpumask_next_and+0x20/0x32
>   [<ffffffffa023ecca>] ? process_recv_sockets+0x0/0x28 [dlm]
>   [<ffffffffa023ecea>] process_recv_sockets+0x20/0x28 [dlm]
>   [<ffffffff81071802>] worker_thread+0x14d/0x1ed
>   [<ffffffff81075a7c>] ? autoremove_wake_function+0x0/0x3d
>   [<ffffffff810716b5>] ? worker_thread+0x0/0x1ed
>   [<ffffffff810756d3>] kthread+0x6e/0x76
>   [<ffffffff81012dea>] child_rip+0xa/0x20
>   [<ffffffff81075665>] ? kthread+0x0/0x76
>   [<ffffffff81012de0>] ? child_rip+0x0/0x20
> Code: 29 e7 ff ff e9 2d 01 00 00 41 8b 74 24 10 0f b7 d0 48 c7 c7 d1 8c 
> 24 a0 31 c0 e8 ab 71 e1 e0 e9 12 01 00 00 41 83 7d 08 00 75 04 <0f>
0b
> eb fe 4d 8d 7d 68 49 be 00 00 00 00 00 16 00 00 41 8b 55
> RIP  [<ffffffffa02406c3>] receive_from_sock+0x554/0x6ed [dlm]
>   RSP <ffff880c77c6bc60>
> Initializing cgroup subsys cpuset
> Initializing cgroup subsys cpu
> Linux version 2.6.32-100.0.19.el5 (mockbuild at ca-build9.us.oracle.com) 
> (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) #1 SMP Fri Sep 17 
> 17:51:41 EDT 2010
> Command line: ro root=/dev/mapper/vg_chili0-lv_root 
> rd_LVM_LV=vg_chili0/lv_root rd_LVM_LV=vg_chili0/lv_swap rd_NO_LUKS 
> rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 
> KEYBOARDTYPE=pc KEYTABLE=fr-pc cgroup_disable=memory selinux=0 
> pcie_aspm=off nmi_watchdog=0 console=ttyS1,115200 maxcpus=1 
> reset_devices memmap=exactmap memmap=640K at 0K memmap=195948K at 33408K 
> elfcorehdr=229356K memmap=308K#1993940K memmap=16K#2077704K 
> memmap=4K#2077748K memmap=4K#2077764K memmap=44K#2077768K 
> memmap=72K#2077812K memmap=4K#2077884K memmap=4K#2077888K 
> memmap=4K#2077892K memmap=4K#2078024K memmap=2716K#2078052K 
> memmap=1024K#69204860K memmap=128K#69205884K
> KERNEL supported cpus:
>    Intel GenuineIntel
>    AMD AuthenticAMD
>    Centaur CentaurHauls
> BIOS-provided physical RAM map:
> 
>  From the dump :
> GNU gdb (GDB) 7.0
> Copyright (C) 2009 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show
copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-unknown-linux-gnu"...
> 
>        KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux
>      DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore  
> [PARTIAL DUMP]
>          CPUS: 64
>          DATE: Mon Oct 18 16:41:48 2010
>        UPTIME: 00:15:00
> LOAD AVERAGE: 1.06, 1.22, 1.65
>         TASKS: 1594
>      NODENAME: chili0
>       RELEASE: 2.6.32-100.0.19.el5
>       VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010
>       MACHINE: x86_64  (1999 Mhz)
>        MEMORY: 64 GB
>         PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!"
>           PID: 27062
>       COMMAND: "dlm_recv/34"
>          TASK: ffff880c7caa00c0  [THREAD_INFO: ffff880c77c6a000]
>           CPU: 34
>         STATE: TASK_RUNNING (PANIC)
> 
> crash> bt
> PID: 27062  TASK: ffff880c7caa00c0  CPU: 34  COMMAND:
"dlm_recv/34"
>   #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b
>   #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4
>   #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9
>   #3 [ffff880c77c6ba90] die at ffffffff81015639
>   #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c
>   #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902
>   #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b
>      [exception RIP: receive_from_sock+1364]
>      RIP: ffffffffa02406c3  RSP: ffff880c77c6bc60  RFLAGS: 00010246
>      RAX: 0000000000000030  RBX: ffff8810774b8d30  RCX: ffff88087c4548f8
>      RDX: 0000000000000030  RSI: ffff880876dce000  RDI: ffffffff81398045
>      RBP: ffff880c77c6be50   R8: ffff000000000000   R9: ffff880c77c6b900
>      R10: ffff880c77c6b8f0  R11: 0000000000000030  R12: 0000000000000030
>      R13: ffff8810774b8d20  R14: ffff880c7caa00c0  R15: ffffffffa023ecca
>      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>   #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea
>   #8 [ffff880c77c6be78] worker_thread at ffffffff81071802
>   #9 [ffff880c77c6bee8] kthread at ffffffff810756d3
> #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea
> 
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
-- 

"Every new beginning comes from some other beginning's end."

Joel Becker
Senior Development Manager
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127

Welterlen Benoit

2010-Oct-21 16:11 UTC

head link

[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

Hi,

Thanks for your answer.
sctp is automatically selected to be used ("dlm: Using SCTP for 
communications") and I have no option to modify that 
(/sys/kernel/config/dlm/cluster/protocol is set to 1 and is only 
available once the service is started, I can modify it at this time ?)

But, I can give you some details about the problem.
I'm doing HA tests between 2 nodes and the problem occurs when I'm 
stopping the service on one node, and, on restart the other system bugs :
M1 : ocfs-pcmk starts
M2 : ocfs-pcmk starts
M2 : ocfs-pcmk stops
Ok till now, but
M2 : ocfs-pcmk restarts : M1 bugs !!

I adds traces in the code. From what I understand :
The first connection initilizes a dlm connection with a node id and an 
address.
The second connection tries to recover the first structure. It has the 
nodeid and tries to find the address unsuccessfully, then, tries from an 
address to find the node id, no more success.
But in the datas from the address, we can find the node id :
02 00 00 00 0*b 01 00 02* 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00
200010b=node id : 33554699

I'm new in this stuff and it's hard to understand the role of each 
component.

Regards,

Benoit

details :

Starting DLM/OCFS :

DLM (built Oct 21 2010 11:14:24) installed
ocfs2: Registered cluster interface user
OCFS2 Node Manager 1.6.3
OCFS2 1.6.3
dlm: Using SCTP for communications
SCTP: Hash tables configured (established 65536 bind 65536)
dlm: 77410678764B4782BDAE3E888E0C8C4D: joining the lockspace group...
dlm: 77410678764B4782BDAE3E888E0C8C4D: group event done 0 0
dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 1
dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 16777483
dlm: 77410678764B4782BDAE3E888E0C8C4D: total members 1 error 0
dlm: 77410678764B4782BDAE3E888E0C8C4D: dlm_recover_directory
dlm: 77410678764B4782BDAE3E888E0C8C4D: dlm_recover_directory 0 entries
dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 1 done: 0 ms
dlm: 77410678764B4782BDAE3E888E0C8C4D: join complete
ocfs2: Mounting device (253,6) on (node 1677748, slot 0) with ordered 
data mode.

 > Here, OCFS is available on both nodes

dlm: closing connection to node 33554699

 > Here, OCFS is down on M2, then restart :

dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 3
dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 33554699
cm->sockaddr_storage  ffff8804796268a0
dlm_nodeid_to_addr -EEXIST;
dlm: no address for nodeid 33554699
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
sctp_packet_config: packet:ffff880c8e6037d0 vtag:0x2b186924
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
process_sctp_notification SCTP_RESTART
get_comm  nodeid : 0, sockaddr_storage : ffff88047a357c54
get_comm cm->addr_count : 0000000000000001, cm->addr[0] : ffff880479626740
addr : ffff88047a357c54
dlm: reject connect from unknown addr
02 00 00 00 0b 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00
in __nodeid2con , nodeid : 0
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
------------[ cut here ]------------
kernel BUG at fs/dlm/lowcomms.c:661!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/dlm/77410678764B4782BDAE3E888E0C8C4D/control
CPU 35
Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) ocfs2_nodemanager(U) 
ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U) 
acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) 
ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) dm_region_hash(U) 
dm_log(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) ioatdma(U) 
i7core_edac(U) igb(U) edac_core(U) sg(U) i2c_i801(U) i2c_core(U) 
iTCO_wdt(U) dca(U) iTCO_vendor_support(U) ext4(U) mbcache(U) jbd2(U) 
usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) lpfc(U) ahci(U) 
scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) scsi_tgt(U) dm_mod(U) [last 
unloaded: ipmi_msghandler]

Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) ocfs2_nodemanager(U) 
ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U) 
acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) 
ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) dm_region_hash(U) 
dm_log(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) ioatdma(U) 
i7core_edac(U) igb(U) edac_core(U) sg(U) i2c_i801(U) i2c_core(U) 
iTCO_wdt(U) dca(U) iTCO_vendor_support(U) ext4(U) mbcache(U) jbd2(U) 
usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) lpfc(U) ahci(U) 
scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) scsi_tgt(U) dm_mod(U) [last 
unloaded: ipmi_msghandler]
Pid: 4306, comm: dlm_recv/35 Not tainted 2.6.32-30.el6.Bull.14.x86_64 #2 
bullx super-node
RIP: 0010:[<ffffffffa039edeb>]  [<ffffffffa039edeb>] 
receive_from_sock+0x38b/0x430 [dlm]
RSP: 0018:ffff88047a357d10  EFLAGS: 00010246
RAX: 0000000000000095 RBX: ffff88087c5aad20 RCX: 0000000000001b62
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
RBP: ffff88047a357e10 R08: 00000000ffffffff R09: 0000000000000000
R10: ffff881079423480 R11: 0000000000000000 R12: ffff88047a357da0
R13: 0000000000000030 R14: ffff88087c5aad30 R15: ffff88047a357d40
FS:  0000000000000000(0000) GS:ffff880c8e600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000359bea6df0 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dlm_recv/35 (pid: 4306, threadinfo ffff88047a356000, task 
ffff88047aeaf2e0)
Stack:
  ffff88047a357db0 0000000000000246 ffff880c7da288c0 0000000000000246
<0> ffff88047a357d70 0000100081049472 0000000000000000 0000000000000000
<0> ffff88047a357d80 0000000000000002 ffff88047a357dd0 0000000000000000
Call Trace:
  [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
  [<ffffffffa039dda6>] process_recv_sockets+0x36/0x50 [dlm]
  [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
  [<ffffffff8107ce9d>] worker_thread+0x16d/0x290
  [<ffffffff81082430>] ? autoremove_wake_function+0x0/0x40
  [<ffffffff8107cd30>] ? worker_thread+0x0/0x290
  [<ffffffff810820d6>] kthread+0x96/0xa0
  [<ffffffff8100d1aa>] child_rip+0xa/0x20
  [<ffffffff81082040>] ? kthread+0x0/0xa0
  [<ffffffff8100d1a0>] ? child_rip+0x0/0x20
Code: f1 4c 8b 8d 70 ff ff ff 48 c1 e6 0c 4c 01 d6 e8 f1 33 0b e1 4c 89 
f7 e8 14 4b 0b e1 31 f6 48 89 df e8 6a f4 ff ff e9 cd fd ff ff <0f> 0b 
0f 1f 00 eb fb 48 c7 c7 e8 9a 3a a0 31 c0 e8 c5 33 0b e1
RIP  [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm]
  RSP <ffff88047a357d10>
crash>



Le 21/10/2010 17:13, David Teigland a ?crit :>> kernel BUG at fs/dlm/lowcomms.c:647!
>>      
> That looks like an interesting one, I haven't seen it before.
> First ensure dlm is not configured to use sctp (that code is
> not widely tested.)  Other than that, if you'd like to start
> debugging this before I get around to it, replace the BUG_ON
> with some printk's and return error.  The conn with nodeid 0 is
> the listening socket, for which tcp_accept_from_sock() should
> be called rather than receive_from_sock().
>
>
>    
>>         KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux
>>       DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore
>> [PARTIAL DUMP]
>>           CPUS: 64
>>           DATE: Mon Oct 18 16:41:48 2010
>>         UPTIME: 00:15:00
>> LOAD AVERAGE: 1.06, 1.22, 1.65
>>          TASKS: 1594
>>       NODENAME: chili0
>>        RELEASE: 2.6.32-100.0.19.el5
>>        VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010
>>        MACHINE: x86_64  (1999 Mhz)
>>         MEMORY: 64 GB
>>          PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!"
>>            PID: 27062
>>        COMMAND: "dlm_recv/34"
>>           TASK: ffff880c7caa00c0  [THREAD_INFO: ffff880c77c6a000]
>>            CPU: 34
>>          STATE: TASK_RUNNING (PANIC)
>>
>> crash>  bt
>> PID: 27062  TASK: ffff880c7caa00c0  CPU: 34  COMMAND:
"dlm_recv/34"
>>    #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b
>>    #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4
>>    #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9
>>    #3 [ffff880c77c6ba90] die at ffffffff81015639
>>    #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c
>>    #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902
>>    #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b
>>       [exception RIP: receive_from_sock+1364]
>>       RIP: ffffffffa02406c3  RSP: ffff880c77c6bc60  RFLAGS: 00010246
>>       RAX: 0000000000000030  RBX: ffff8810774b8d30  RCX:
ffff88087c4548f8
>>       RDX: 0000000000000030  RSI: ffff880876dce000  RDI:
ffffffff81398045
>>       RBP: ffff880c77c6be50   R8: ffff000000000000   R9:
ffff880c77c6b900
>>       R10: ffff880c77c6b8f0  R11: 0000000000000030  R12:
0000000000000030
>>       R13: ffff8810774b8d20  R14: ffff880c7caa00c0  R15:
ffffffffa023ecca
>>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>    #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea
>>    #8 [ffff880c77c6be78] worker_thread at ffffffff81071802
>>    #9 [ffff880c77c6bee8] kthread at ffffffff810756d3
>> #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>      
>
>    
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101021/1d4f80f5/attachment-0001.html

Welterlen Benoit

2010-Nov-08 16:06 UTC

head link

[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

Hi,

Thanks for your answer.
I tried to force TCP instead of SCTP.
There is no crash anymore, but OCFS is unavailable when I stop one node :
The OCFS directory is not accessible :

ls /BCM/conf
[root at chili0 ~]# cat /proc/23921/stack
[<ffffffffa02c5d53>] ocfs2_wait_for_recovery+0x77/0x8f [ocfs2]
[<ffffffffa02b08a8>] ocfs2_inode_lock_full_nested+0x160/0xb8d [ocfs2]
[<ffffffffa02c35e2>] ocfs2_inode_revalidate+0x163/0x25c [ocfs2]
[<ffffffffa02bd9f4>] ocfs2_getattr+0x8b/0x19c [ocfs2]
[<ffffffff8111c30f>] vfs_getattr+0x4c/0x69
[<ffffffff8111c37c>] vfs_fstatat+0x50/0x67
[<ffffffff8111c479>] vfs_stat+0x1b/0x1d
[<ffffffff8111c49a>] sys_newstat+0x1f/0x39
[<ffffffff81011db2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Any access to the filesystem hangs.

Regards,

Benoit

Le 21/10/2010 19:13, David Teigland a ?crit :> On Thu, Oct 21, 2010 at 06:11:36PM +0200, Welterlen Benoit wrote:
>    
>> Hi,
>>
>> Thanks for your answer.
>> sctp is automatically selected to be used ("dlm: Using SCTP for
>> communications") and I have no option to modify that
>> (/sys/kernel/config/dlm/cluster/protocol is set to 1 and is only
>> available once the service is started, I can modify it at this time
>> ?)
>>      
> Yes, there's a dlm_controld command line option (or cluster.conf which
I
> don't suppose you're using with pacemaker.)  You can set that to
get TCP,
> but that obviates corosync redundant ring also (which is what is used to
> auto-select SCTP).
>
>    
>> dlm: closing connection to node 33554699
>>
>>      
>>> Here, OCFS is down on M2, then restart :
>>>        
>> dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 3
>> dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 33554699
>> cm->sockaddr_storage  ffff8804796268a0
>> dlm_nodeid_to_addr -EEXIST;
>> dlm: no address for nodeid 33554699
>>      
> I suspect the pacemaker version of dlm_controld is doing an unusual
> sequence of node addition/removal.  The stuff above eventually leads to
> the oops below:
>
>    
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> sctp_packet_config: packet:ffff880c8e6037d0 vtag:0x2b186924
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> process_sctp_notification SCTP_RESTART
>> get_comm  nodeid : 0, sockaddr_storage : ffff88047a357c54
>> get_comm cm->addr_count : 0000000000000001, cm->addr[0] :
ffff880479626740
>> addr : ffff88047a357c54
>> dlm: reject connect from unknown addr
>> 02 00 00 00 0b 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00
>> in __nodeid2con , nodeid : 0
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> ------------[ cut here ]------------
>> kernel BUG at fs/dlm/lowcomms.c:661!
>> invalid opcode: 0000 [#1] SMP
>> last sysfs file:
/sys/kernel/dlm/77410678764B4782BDAE3E888E0C8C4D/control
>> CPU 35
>> Modules linked in: sctp(U) libcrc32c(U) ocfs2(U)
>> ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U)
>> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U)
>> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U)
>> auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U)
>> dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U)
>> dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U)
>> i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U)
>> ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U)
>> lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U)
>> scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler]
>>
>> Modules linked in: sctp(U) libcrc32c(U) ocfs2(U)
>> ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U)
>> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U)
>> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U)
>> auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U)
>> dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U)
>> dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U)
>> i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U)
>> ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U)
>> lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U)
>> scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler]
>> Pid: 4306, comm: dlm_recv/35 Not tainted
>> 2.6.32-30.el6.Bull.14.x86_64 #2 bullx super-node
>> RIP: 0010:[<ffffffffa039edeb>]  [<ffffffffa039edeb>]
>> receive_from_sock+0x38b/0x430 [dlm]
>> RSP: 0018:ffff88047a357d10  EFLAGS: 00010246
>> RAX: 0000000000000095 RBX: ffff88087c5aad20 RCX: 0000000000001b62
>> RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
>> RBP: ffff88047a357e10 R08: 00000000ffffffff R09: 0000000000000000
>> R10: ffff881079423480 R11: 0000000000000000 R12: ffff88047a357da0
>> R13: 0000000000000030 R14: ffff88087c5aad30 R15: ffff88047a357d40
>> FS:  0000000000000000(0000) GS:ffff880c8e600000(0000)
knlGS:0000000000000000
>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 000000359bea6df0 CR3: 0000000001001000 CR4: 00000000000006e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process dlm_recv/35 (pid: 4306, threadinfo ffff88047a356000, task
>> ffff88047aeaf2e0)
>> Stack:
>>   ffff88047a357db0 0000000000000246 ffff880c7da288c0 0000000000000246
>> <0>  ffff88047a357d70 0000100081049472 0000000000000000
0000000000000000
>> <0>  ffff88047a357d80 0000000000000002 ffff88047a357dd0
0000000000000000
>> Call Trace:
>>   [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
>>   [<ffffffffa039dda6>] process_recv_sockets+0x36/0x50 [dlm]
>>   [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
>>   [<ffffffff8107ce9d>] worker_thread+0x16d/0x290
>>   [<ffffffff81082430>] ? autoremove_wake_function+0x0/0x40
>>   [<ffffffff8107cd30>] ? worker_thread+0x0/0x290
>>   [<ffffffff810820d6>] kthread+0x96/0xa0
>>   [<ffffffff8100d1aa>] child_rip+0xa/0x20
>>   [<ffffffff81082040>] ? kthread+0x0/0xa0
>>   [<ffffffff8100d1a0>] ? child_rip+0x0/0x20
>> Code: f1 4c 8b 8d 70 ff ff ff 48 c1 e6 0c 4c 01 d6 e8 f1 33 0b e1 4c
>> 89 f7 e8 14 4b 0b e1 31 f6 48 89 df e8 6a f4 ff ff e9 cd fd ff ff
>> <0f>  0b 0f 1f 00 eb fb 48 c7 c7 e8 9a 3a a0 31 c0 e8 c5 33 0b e1
>> RIP  [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm]
>>   RSP<ffff88047a357d10>
>> crash>
>>
>>
>>
>> Le 21/10/2010 17:13, David Teigland a ?crit :
>>      
>>>> kernel BUG at fs/dlm/lowcomms.c:647!
>>>>          
>>> That looks like an interesting one, I haven't seen it before.
>>> First ensure dlm is not configured to use sctp (that code is
>>> not widely tested.)  Other than that, if you'd like to start
>>> debugging this before I get around to it, replace the BUG_ON
>>> with some printk's and return error.  The conn with nodeid 0 is
>>> the listening socket, for which tcp_accept_from_sock() should
>>> be called rather than receive_from_sock().
>>>
>>>
>>>        
>>>>         KERNEL:
/usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux
>>>>       DUMPFILE:
/var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore
>>>> [PARTIAL DUMP]
>>>>           CPUS: 64
>>>>           DATE: Mon Oct 18 16:41:48 2010
>>>>         UPTIME: 00:15:00
>>>> LOAD AVERAGE: 1.06, 1.22, 1.65
>>>>          TASKS: 1594
>>>>       NODENAME: chili0
>>>>        RELEASE: 2.6.32-100.0.19.el5
>>>>        VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010
>>>>        MACHINE: x86_64  (1999 Mhz)
>>>>         MEMORY: 64 GB
>>>>          PANIC: "kernel BUG at
fs/dlm/lowcomms.c:647!"
>>>>            PID: 27062
>>>>        COMMAND: "dlm_recv/34"
>>>>           TASK: ffff880c7caa00c0  [THREAD_INFO:
ffff880c77c6a000]
>>>>            CPU: 34
>>>>          STATE: TASK_RUNNING (PANIC)
>>>>
>>>> crash>   bt
>>>> PID: 27062  TASK: ffff880c7caa00c0  CPU: 34  COMMAND:
"dlm_recv/34"
>>>>    #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b
>>>>    #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4
>>>>    #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9
>>>>    #3 [ffff880c77c6ba90] die at ffffffff81015639
>>>>    #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c
>>>>    #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902
>>>>    #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b
>>>>       [exception RIP: receive_from_sock+1364]
>>>>       RIP: ffffffffa02406c3  RSP: ffff880c77c6bc60  RFLAGS:
00010246
>>>>       RAX: 0000000000000030  RBX: ffff8810774b8d30  RCX:
ffff88087c4548f8
>>>>       RDX: 0000000000000030  RSI: ffff880876dce000  RDI:
ffffffff81398045
>>>>       RBP: ffff880c77c6be50   R8: ffff000000000000   R9:
ffff880c77c6b900
>>>>       R10: ffff880c77c6b8f0  R11: 0000000000000030  R12:
0000000000000030
>>>>       R13: ffff8810774b8d20  R14: ffff880c7caa00c0  R15:
ffffffffa023ecca
>>>>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>>    #7 [ffff880c77c6be58] process_recv_sockets at
ffffffffa023ecea
>>>>    #8 [ffff880c77c6be78] worker_thread at ffffffff81071802
>>>>    #9 [ffff880c77c6bee8] kthread at ffffffff810756d3
>>>> #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea
>>>>
>>>>
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>          
>>>        
>>      
>

Ocfs2 users - Oct 2010 - kernel BUG at fs/dlm/lowcomms.c:647!

[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!