Daniel Keisling
2008-Oct-03 15:20 UTC
[Ocfs2-users] Cannot mount 1 out of 3 OCFS2 filesystems
Greetings, I have a 4-node Oracle RAC cluster sharing four OCFS2 v1.2 filesystems on RHEL5. Node 3 was taken down for maintenance and was rebooted several times. During this time, the networking stack on the cluster interconnect had issues (after changing to an active-backup bonding method) and was receiving high packet loss, resulting in timeouts connecting to the cluster. After the networking changes were reverted (putting the bonding method back to active-active) and the server rebooted, I can join the cluster but can only mount 3 out of the 4 OCFS2 filesystems: [root at ausracdb04 /]# mount /dev/mapper/limsp_archp1 mount.ocfs2: Unknown code B 0 while mounting /dev/mapper/limsp_archp1 on /var/opt/oracle/oradata/limsp/arch. Check 'dmesg' for more information on this error. dmesg reports: (17909,1):dlm_join_domain:1301 Timed out joining dlm domain 980E9BC11D2C458B9BC8BEACC1365CAC after 90400 msecs ocfs2: Unmounting device (253,19) on (node 3) The other nodes do not report anything for this filesystem during the failed join, but I do see successful domain joins for the other OCFS2 filesystems. I can ping the interconnect IPs between all 4 servers. I have rebooted several times and restarted the entire cluster stack to no avail. The problem has persisted for the last 18 hours. My initial thoughts is that there is a DLM resource lock that cannot be released, but I'm not exactly sure how to fix it (rebooting the other nodes is not the best option as this is a high production environment). I've tried to use the debugfs tools mentioned in the FAQ/User Guides, but it's very confusing and I'm not sure what I need to look for. I can see the disk device just fine on the server, and can browse the filesystem using ocfs2console, just cannot join the domain to mount it. I would appreciate any advice anyone may have. My details are: [root at ausracdb04 /]# uname -a Linux ausracdb04.austin.ppdi.com 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [root at ausracdb04 /]# rpm -qa | grep -i ocfs2 ocfs2-2.6.18-53.el5-1.2.8-2.el5 ocfs2console-1.2.7-2.el5 ocfs2-tools-1.2.7-2.el5 [root at ausracdb04 /]# cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 192.168.0.100 number = 0 name = ausracdb01 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.101 number = 1 name = ausracdb02 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.102 number = 2 name = ausracdb03 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.106 number = 3 name = ausracdb04 cluster = racdb cluster: node_count = 4 name = racdb [root at ausracdb04 /]# cat /etc/sysconfig/o2cb # # This is a configuration file for automatic startup of the O2CB # driver. It is generated by running /etc/init.d/o2cb configure. # Please use that method to modify this file # # O2CB_ENABELED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=racdb # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=61 # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=60000 # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent O2CB_KEEPALIVE_DELAY_MS # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts O2CB_RECONNECT_DELAY_MS [root at ausracdb04 /]# echo "stat " | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 Inode: 5 Mode: 0775 Generation: 1066067688 (0x3f8ae6e8) FS Generation: 1066067688 (0x3f8ae6e8) Type: Directory Attr: 0x0 Flags: Valid System User: 503 (oracle) Group: 505 (dba) Size: 40960 Links: 4 Clusters: 10 ctime: 0x48e635d4 -- Fri Oct 3 10:10:12 2008 atime: 0x48627838 -- Wed Jun 25 11:54:16 2008 mtime: 0x48e635d4 -- Fri Oct 3 10:10:12 2008 dtime: 0x0 -- Wed Dec 31 18:00:00 1969 ctime_nsec: 0x3ad5b3d6 -- 987083734 atime_nsec: 0x00000000 -- 0 mtime_nsec: 0x3ad5b3d6 -- 987083734 Last Extblk: 0 Sub Alloc Slot: Global Sub Alloc Bit: 1 Tree Depth: 0 Count: 243 Next Free Rec: 10 ## Offset Clusters Block# 0 0 1 207 1 1 1 485268 2 2 1 2096789 3 3 1 751454 4 4 1 1782521 5 5 1 2144728 6 6 1 2145932 7 7 1 1784169 8 8 1 1601861 9 9 1 2446400 [root at ausracdb04 /]# echo "slotmap" | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 Slot# Node# 0 0 1 1 2 2 Slotmaps for another filesystem that is correctly joined and mounted: [root at ausracdb04 /]# echo "slotmap" | debugfs.ocfs2 -n /dev/mapper/ph1pp1 Slot# Node# 0 0 1 1 2 2 3 3 I don't know if this is a correct command to look for "busy" locks. (Done from another node): [root at ausracdb01 ~]# echo "fs_locks" | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 | grep -i busy [root at ausracdb01 ~]# TIA, Daniel ______________________________________________________________________ This email transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081003/12d5f273/attachment-0001.html
Daniel Keisling
2008-Oct-03 21:13 UTC
[Ocfs2-users] [SUMMARY] Cannot mount 1 out of 3 OCFS2 filesystems
This seems to be related to bug 6719988 in v1.2.8-2. This is fixed in v1.2.9-1. ________________________________ From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Daniel Keisling Sent: Friday, October 03, 2008 10:21 AM To: ocfs2-users at oss.oracle.com Subject: [Ocfs2-users] Cannot mount 1 out of 3 OCFS2 filesystems Greetings, I have a 4-node Oracle RAC cluster sharing four OCFS2 v1.2 filesystems on RHEL5. Node 3 was taken down for maintenance and was rebooted several times. During this time, the networking stack on the cluster interconnect had issues (after changing to an active-backup bonding method) and was receiving high packet loss, resulting in timeouts connecting to the cluster. After the networking changes were reverted (putting the bonding method back to active-active) and the server rebooted, I can join the cluster but can only mount 3 out of the 4 OCFS2 filesystems: [root at ausracdb04 /]# mount /dev/mapper/limsp_archp1 mount.ocfs2: Unknown code B 0 while mounting /dev/mapper/limsp_archp1 on /var/opt/oracle/oradata/limsp/arch. Check 'dmesg' for more information on this error. dmesg reports: (17909,1):dlm_join_domain:1301 Timed out joining dlm domain 980E9BC11D2C458B9BC8BEACC1365CAC after 90400 msecs ocfs2: Unmounting device (253,19) on (node 3) The other nodes do not report anything for this filesystem during the failed join, but I do see successful domain joins for the other OCFS2 filesystems. I can ping the interconnect IPs between all 4 servers. I have rebooted several times and restarted the entire cluster stack to no avail. The problem has persisted for the last 18 hours. My initial thoughts is that there is a DLM resource lock that cannot be released, but I'm not exactly sure how to fix it (rebooting the other nodes is not the best option as this is a high production environment). I've tried to use the debugfs tools mentioned in the FAQ/User Guides, but it's very confusing and I'm not sure what I need to look for. I can see the disk device just fine on the server, and can browse the filesystem using ocfs2console, just cannot join the domain to mount it. I would appreciate any advice anyone may have. My details are: [root at ausracdb04 /]# uname -a Linux ausracdb04.austin.ppdi.com 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [root at ausracdb04 /]# rpm -qa | grep -i ocfs2 ocfs2-2.6.18-53.el5-1.2.8-2.el5 ocfs2console-1.2.7-2.el5 ocfs2-tools-1.2.7-2.el5 [root at ausracdb04 /]# cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 192.168.0.100 number = 0 name = ausracdb01 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.101 number = 1 name = ausracdb02 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.102 number = 2 name = ausracdb03 cluster = racdb node: ip_port = 7777 ip_address = 192.168.0.106 number = 3 name = ausracdb04 cluster = racdb cluster: node_count = 4 name = racdb [root at ausracdb04 /]# cat /etc/sysconfig/o2cb # # This is a configuration file for automatic startup of the O2CB # driver. It is generated by running /etc/init.d/o2cb configure. # Please use that method to modify this file # # O2CB_ENABELED: 'true' means to load the driver on boot. O2CB_ENABLED=true # O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=racdb # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=61 # O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=60000 # O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent O2CB_KEEPALIVE_DELAY_MS # O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts O2CB_RECONNECT_DELAY_MS [root at ausracdb04 /]# echo "stat " | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 Inode: 5 Mode: 0775 Generation: 1066067688 (0x3f8ae6e8) FS Generation: 1066067688 (0x3f8ae6e8) Type: Directory Attr: 0x0 Flags: Valid System User: 503 (oracle) Group: 505 (dba) Size: 40960 Links: 4 Clusters: 10 ctime: 0x48e635d4 -- Fri Oct 3 10:10:12 2008 atime: 0x48627838 -- Wed Jun 25 11:54:16 2008 mtime: 0x48e635d4 -- Fri Oct 3 10:10:12 2008 dtime: 0x0 -- Wed Dec 31 18:00:00 1969 ctime_nsec: 0x3ad5b3d6 -- 987083734 atime_nsec: 0x00000000 -- 0 mtime_nsec: 0x3ad5b3d6 -- 987083734 Last Extblk: 0 Sub Alloc Slot: Global Sub Alloc Bit: 1 Tree Depth: 0 Count: 243 Next Free Rec: 10 ## Offset Clusters Block# 0 0 1 207 1 1 1 485268 2 2 1 2096789 3 3 1 751454 4 4 1 1782521 5 5 1 2144728 6 6 1 2145932 7 7 1 1784169 8 8 1 1601861 9 9 1 2446400 [root at ausracdb04 /]# echo "slotmap" | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 Slot# Node# 0 0 1 1 2 2 Slotmaps for another filesystem that is correctly joined and mounted: [root at ausracdb04 /]# echo "slotmap" | debugfs.ocfs2 -n /dev/mapper/ph1pp1 Slot# Node# 0 0 1 1 2 2 3 3 I don't know if this is a correct command to look for "busy" locks. (Done from another node): [root at ausracdb01 ~]# echo "fs_locks" | debugfs.ocfs2 -n /dev/mapper/limsp_archp1 | grep -i busy [root at ausracdb01 ~]# TIA, Daniel ______________________________________________________________________ This email transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner. ______________________________________________________________________ This email transmission and any documents, files or previous email messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient or a person responsible for delivering this transmission to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of this transmission is strictly prohibited. If you have received this transmission in error, please immediately notify the sender by telephone or return email and delete the original transmission and its attachments without reading or saving in any manner. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20081003/6f1a2c79/attachment-0001.html