Ramappa, Ravi (NSN - IN/Bangalore)
2013-Feb-26 06:07 UTC
[Ocfs2-users] Kernel panic due to ocfs2
Hi, In a 13 node cluster, the first four nodes went into kernel panic state. The /var/log/messages contained messages as below, Feb 25 22:02:46 prod152 kernel: (o2net,9470,5):dlm_assert_master_handler:1817 ERROR: DIE! Mastery assert from 4, but current owner is 2! (O000000000000000c36706200000000) Feb 25 22:02:46 prod152 kernel: lockres: O000000000000000c36706200000000, owner=2, state=0 Feb 25 22:02:46 prod152 kernel: last used: 0, refcnt: 3, on purge list: no Feb 25 22:02:46 prod152 kernel: on dirty list: no, on reco list: no, migrating pending: no Feb 25 22:02:46 prod152 kernel: inflight locks: 0, asts reserved: 0 Feb 25 22:02:46 prod152 kernel: refmap nodes: [ ], inflight=0 Feb 25 22:02:46 prod152 kernel: granted queue: Feb 25 22:02:46 prod152 kernel: type=3, conv=-1, node=2, cookie=2:222205, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), pending=(conv=n,lock=n,cancel=n,unlock=n) Feb 25 22:02:46 prod152 kernel: converting queue: Feb 25 22:02:46 prod152 kernel: blocked queue: Feb 25 22:02:46 prod152 kernel: ----------- [cut here ] --------- [please bite here ] --------- Feb 25 22:02:46 prod152 kernel: Kernel BUG at .../build/BUILD/ocfs2-1.4.10/fs/ocfs2/dlm/dlmmaster.c:1819 Feb 25 22:02:46 prod152 kernel: invalid opcode: 0000 [1] SMP Feb 25 22:02:46 prod152 kernel: last sysfs file: /block/cciss!c0d0/cciss!c0d0p1/stat Feb 25 22:02:46 prod152 kernel: CPU 5 Feb 26 09:50:27 prod152 syslogd 1.4.1: restart. The OCFS2 rpm versions used are as below, [root at prod152 ~]# uname -r 2.6.18-308.1.1.el5 [root at prod152 ~]# rpm -qa| grep ocfs ocfs2-2.6.18-308.1.1.el5xen-1.4.10-1.el5 ocfs2-tools-devel-1.6.3-2.el5 ocfs2-2.6.18-308.1.1.el5-1.4.10-1.el5 ocfs2-tools-debuginfo-1.6.3-2.el5 ocfs2-tools-1.6.3-2.el5 ocfs2console-1.6.3-2.el5 ocfs2-2.6.18-308.1.1.el5debug-1.4.10-1.el5 root at prod152 ~]# cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 10.10.10.150 number = 0 name = prod150 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.151 number = 1 name = prod151 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.152 number = 2 name = prod152 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.153 number = 3 name = prod153 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.106 number = 4 name = prod106 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.107 number = 5 name = prod107 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.155 number = 6 name = prod155 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.156 number = 7 name = prod156 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.157 number = 8 name = prod157 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.158 number = 9 name = prod158 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.51 number = 10 name = prod51 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.52 number = 11 name = prod52 cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.10.154 number = 12 name = prod154 cluster = ocfs2 cluster: node_count =13 name = ocfs2 [root at prod152 ~]# Is this a known issue ? Any issues in the configuration ? Thanks, Ravi -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130226/72b8e1bb/attachment.html
This is due to a race in lock mastery/purge. I have recently fixed this problem but haven't yet submitted the patch to mainline. Please file a Service request with Oracle to get a one-off fix. On 02/25/2013 10:07 PM, Ramappa, Ravi (NSN - IN/Bangalore) wrote:> Hi, > In a 13 node cluster, the first four nodes went into kernel panic > state. The /var/log/messages contained messages as below, > Feb 25 22:02:46 prod152 kernel: > (o2net,9470,5):dlm_assert_master_handler:1817 ERROR: DIE! Mastery > assert from 4, but current owner is 2! (O000000000000000c36706200000000) > Feb 25 22:02:46 prod152 kernel: lockres: > O000000000000000c36706200000000, owner=2, state=0 > Feb 25 22:02:46 prod152 kernel: last used: 0, refcnt: 3, on purge > list: no > Feb 25 22:02:46 prod152 kernel: on dirty list: no, on reco list: no, > migrating pending: no > Feb 25 22:02:46 prod152 kernel: inflight locks: 0, asts reserved: 0 > Feb 25 22:02:46 prod152 kernel: refmap nodes: [ ], inflight=0 > Feb 25 22:02:46 prod152 kernel: granted queue: > Feb 25 22:02:46 prod152 kernel: type=3, conv=-1, node=2, > cookie=2:222205, ref=2, ast=(empty=y,pend=n), bast=(empty=y,pend=n), > pending=(conv=n,lock=n,cancel=n,unlock=n) > Feb 25 22:02:46 prod152 kernel: converting queue: > Feb 25 22:02:46 prod152 kernel: blocked queue: > Feb 25 22:02:46 prod152 kernel: ----------- [cut here ] --------- > [please bite here ] --------- > Feb 25 22:02:46 prod152 kernel: Kernel BUG at > .../build/BUILD/ocfs2-1.4.10/fs/ocfs2/dlm/dlmmaster.c:1819 > Feb 25 22:02:46 prod152 kernel: invalid opcode: 0000 [1] SMP > Feb 25 22:02:46 prod152 kernel: last sysfs file: > /block/cciss!c0d0/cciss!c0d0p1/stat > Feb 25 22:02:46 prod152 kernel: CPU 5 > Feb 26 09:50:27 prod152 syslogd 1.4.1: restart. > The OCFS2 rpm versions used are as below, > [root at prod152 ~]# uname -r > 2.6.18-308.1.1.el5 > [root at prod152 ~]# rpm -qa| grep ocfs > ocfs2-2.6.18-308.1.1.el5xen-1.4.10-1.el5 > ocfs2-tools-devel-1.6.3-2.el5 > ocfs2-2.6.18-308.1.1.el5-1.4.10-1.el5 > ocfs2-tools-debuginfo-1.6.3-2.el5 > ocfs2-tools-1.6.3-2.el5 > ocfs2console-1.6.3-2.el5 > ocfs2-2.6.18-308.1.1.el5debug-1.4.10-1.el5 > root at prod152 ~]# cat /etc/ocfs2/cluster.conf > node: > ip_port = 7777 > ip_address = 10.10.10.150 > number = 0 > name = prod150 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.151 > number = 1 > name = prod151 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.152 > number = 2 > name = prod152 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.153 > number = 3 > name = prod153 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.106 > number = 4 > name = prod106 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.107 > number = 5 > name = prod107 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.155 > number = 6 > name = prod155 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.156 > number = 7 > name = prod156 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.157 > number = 8 > name = prod157 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.158 > number = 9 > name = prod158 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.51 > number = 10 > name = prod51 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.52 > number = 11 > name = prod52 > cluster = ocfs2 > node: > ip_port = 7777 > ip_address = 10.10.10.154 > number = 12 > name = prod154 > cluster = ocfs2 > cluster: > node_count =13 > name = ocfs2 > [root at prod152 ~]# > Is this a known issue ? Any issues in the configuration ? > Thanks, > Ravi > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-users-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20130225/3332dd7e/attachment-0001.html