Vaidya, Sachin
2006-Mar-31 02:04 UTC
[Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device.
Removed VIPs from hosts and restarted the cluster. But nothing changed. Still cannot mount /dev/md0 on both nodes. Do I need to reboot servers after changing the /etc/hosts ? Any other suggestions ? Thanks, Sachin Vaidya Infrastructure Management Senior Analyst Affiliated Computer Services -----Original Message----- From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com] Sent: Thursday, March 30, 2006 5:34 PM To: Vaidya, Sachin Cc: ''ocfs2-users at oss.oracle.com' ' Subject: Re: [Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device. Remove vip and mount on both. See if that helps. Vaidya, Sachin wrote:> > Hi, > Tried both public and private ip addreses but still not able to mount > device on both nodes. > Here are my configuration details. > hosts file : same on both nodes. > > 127.0.0.1 localhost.localdomain localhost > 172.18.11.12 acspittdw001 acspittdw001.servicemetrics.net > 172.18.22.1 priv-acspittdw001 > 172.18.11.24 vip-acspittdw001 > 172.18.11.13 acspittdw002 acspittdw002.servicemetrics.net > 172.18.22.2 priv-acspittdw002 > 172.18.11.25 vip-acspittdw002 > > The cluster.conf on both nodes looks same as > node: > ip_port = 7777 > ip_address = 172.18.11.12 > number = 0 > name = acspittdw001 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 172.18.11.13 > number = 1 > name = acspittdw002 > cluster = ocfs2 > > cluster: > node_count = 2 > name = ocfs2 > > Both nodes can ping each other on public and private ips. > The mount command produces following error on node 2 when device is > already mounted on node 1. > > [root at acspittdw002 ~]# mount -t ocfs2 /dev/md0 /crs1 > mount.ocfs2: Transport endpoint is not connected while mounting > /dev/md0 on /crs1 > [root at acspittdw002 ~]# > > dmesg show following messages > > SELinux: initialized (dev debugfs, type debugfs), uses genfs_contexts > (5027,2):ocfs2_initialize_super:1354 max_slots for this device: 8 > (5027,2):ocfs2_fill_local_node_info:1031 I am node 1 > (4986,2):o2net_connect_expired:1446 ERROR: no connection established > with node 0 after 10 seconds, giving up and returning errors. > > (5027,2):dlm_request_join:771 ERROR: status = -107 > (5027,2):dlm_try_to_join_domain:919 ERROR: status = -107 > (5027,2):dlm_join_domain:1164 ERROR: status = -107 > (5027,2):dlm_register_domain:1354 ERROR: status = -107 > (5027,2):ocfs2_dlm_init:1996 ERROR: status = -107 > (5027,2):ocfs2_mount_volume:1063 ERROR: status = -107 > ocfs2: Unmounting device (9,0) on (node 1) > [root at acspittdw002 ~]# > > Any idea why this is happening ? > I can provide more details if needed. > Any help will be greatly appreciated. > Thanks in advance. > - Sachin Vaidya. > > > > -----Original Message----- > From: Sunil Mushran > To: Vaidya, Sachin > Cc: 'ocfs2-users at oss.oracle.com' > Sent: 3/29/2006 7:16 PM > Subject: Re: [Ocfs2-users] Getting eI am using RHLError when mounting > shared OCFS2 device. > > Connection failiure. Check dmesg. > > Mount triggers the heartbeat thread which triggers the o2net > to make a connection to all heartbeating nodes. If this connection > fails, > the mount fails. (The larger node number initiates the connection > to the lower node number.) > > Obvious error would be incorrect ipaddr specified in cluster.conf. > Error messages in /var/log/messsages on both nodes will > provide more clues. > > Vaidya, Sachin wrote: > > > > Hi, > > > > I am using RHLE4 2.6.9-34.Elsmp with OCFS2 1.2. > > > > The h/w for this 2 node cluster is connected correctly. > > > > After loading ocfs2 on both nodes, the shared device could only be > > mounted on one node. When I try to mount same shared device on second > > node then I get following error. > > > > Mount.ocfs2: Transport endpoint is not connected while mounting > > /dev/md0 on /crs1 > > > > Any idea, why this is happening ? > > > > Any help will be highly appreciated. > > > > Thanks, > > > > Sachin Vaidya > > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users at oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060330/edc69054/attachment-0001.html
Sunil Mushran
2006-Mar-31 02:14 UTC
[Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device.
/etc/hosts is not the problem. Do: /sbin/ifconfig Do you see the vip bound on the same interface as the one used in cluster.conf? Also, what does the dmesg indicate on both nodes. The lower node number will list the ip which is trying to connect to it. Vaidya, Sachin wrote:> > Removed VIPs from hosts and restarted the cluster. But nothing > changed. Still cannot mount /dev/md0 on both nodes. > Do I need to reboot servers after changing the /etc/hosts ? Any other > suggestions ? > Thanks, > > Sachin Vaidya > Infrastructure Management Senior Analyst > Affiliated Computer Services > > > -----Original Message----- > From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com] > Sent: Thursday, March 30, 2006 5:34 PM > To: Vaidya, Sachin > Cc: ''ocfs2-users at oss.oracle.com' ' > Subject: Re: [Ocfs2-users] Getting eI am using RHLError when > mounting shar ed OCFS2 device. > > Remove vip and mount on both. See if that helps. > > Vaidya, Sachin wrote: > > > > Hi, > > Tried both public and private ip addreses but still not able to mount > > device on both nodes. > > Here are my configuration details. > > hosts file : same on both nodes. > > > > 127.0.0.1 localhost.localdomain localhost > > 172.18.11.12 acspittdw001 acspittdw001.servicemetrics.net > > 172.18.22.1 priv-acspittdw001 > > 172.18.11.24 vip-acspittdw001 > > 172.18.11.13 acspittdw002 acspittdw002.servicemetrics.net > > 172.18.22.2 priv-acspittdw002 > > 172.18.11.25 vip-acspittdw002 > > > > The cluster.conf on both nodes looks same as > > node: > > ip_port = 7777 > > ip_address = 172.18.11.12 > > number = 0 > > name = acspittdw001 > > cluster = ocfs2 > > > > node: > > ip_port = 7777 > > ip_address = 172.18.11.13 > > number = 1 > > name = acspittdw002 > > cluster = ocfs2 > > > > cluster: > > node_count = 2 > > name = ocfs2 > > > > Both nodes can ping each other on public and private ips. > > The mount command produces following error on node 2 when device is > > already mounted on node 1. > > > > [root at acspittdw002 ~]# mount -t ocfs2 /dev/md0 /crs1 > > mount.ocfs2: Transport endpoint is not connected while mounting > > /dev/md0 on /crs1 > > [root at acspittdw002 ~]# > > > > dmesg show following messages > > > > SELinux: initialized (dev debugfs, type debugfs), uses genfs_contexts > > (5027,2):ocfs2_initialize_super:1354 max_slots for this device: 8 > > (5027,2):ocfs2_fill_local_node_info:1031 I am node 1 > > (4986,2):o2net_connect_expired:1446 ERROR: no connection established > > with node 0 after 10 seconds, giving up and returning errors. > > > > (5027,2):dlm_request_join:771 ERROR: status = -107 > > (5027,2):dlm_try_to_join_domain:919 ERROR: status = -107 > > (5027,2):dlm_join_domain:1164 ERROR: status = -107 > > (5027,2):dlm_register_domain:1354 ERROR: status = -107 > > (5027,2):ocfs2_dlm_init:1996 ERROR: status = -107 > > (5027,2):ocfs2_mount_volume:1063 ERROR: status = -107 > > ocfs2: Unmounting device (9,0) on (node 1) > > [root at acspittdw002 ~]# > > > > Any idea why this is happening ? > > I can provide more details if needed. > > Any help will be greatly appreciated. > > Thanks in advance. > > - Sachin Vaidya. > > > > > > > > -----Original Message----- > > From: Sunil Mushran > > To: Vaidya, Sachin > > Cc: 'ocfs2-users at oss.oracle.com' > > Sent: 3/29/2006 7:16 PM > > Subject: Re: [Ocfs2-users] Getting eI am using RHLError when mounting > > shared OCFS2 device. > > > > Connection failiure. Check dmesg. > > > > Mount triggers the heartbeat thread which triggers the o2net > > to make a connection to all heartbeating nodes. If this connection > > fails, > > the mount fails. (The larger node number initiates the connection > > to the lower node number.) > > > > Obvious error would be incorrect ipaddr specified in cluster.conf. > > Error messages in /var/log/messsages on both nodes will > > provide more clues. > > > > Vaidya, Sachin wrote: > > > > > > Hi, > > > > > > I am using RHLE4 2.6.9-34.Elsmp with OCFS2 1.2. > > > > > > The h/w for this 2 node cluster is connected correctly. > > > > > > After loading ocfs2 on both nodes, the shared device could only be > > > mounted on one node. When I try to mount same shared device on second > > > node then I get following error. > > > > > > Mount.ocfs2: Transport endpoint is not connected while mounting > > > /dev/md0 on /crs1 > > > > > > Any idea, why this is happening ? > > > > > > Any help will be highly appreciated. > > > > > > Thanks, > > > > > > Sachin Vaidya > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Ocfs2-users mailing list > > > Ocfs2-users at oss.oracle.com > > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users at oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > >
Sunil Mushran
2006-Mar-31 02:57 UTC
[Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device.
So, the connection is expiring. Hmmm... 1. cat /config/cluster/<clustername>/node/*/ipv4_address Ensure the ips are correct. If you change the ips in cluster.conf, you need to restart the cluster for the new values to take effect. 2. Enable tcpdump on that device and port on both nodes. See if the connection request is can be tracked. It should be easy assuming no other traffic on that interface between the nodes. 3. Ensure you are on ocfs2 1.2.0. Vaidya, Sachin wrote:
Vaidya, Sachin
2006-Mar-31 03:19 UTC
[Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device.
Response from our storage admin is as follows. Everything is Raid 5, however, do not confuse this with standard Raid 5. This is completely cache centric array and the Volumes are multiple devices spread over up to 24 disks. All our Oracle instances for other clients run the same way , ie : on multiple LDEVS spread across multiple 7+1 P raid groups Sachin Vaidya -----Original Message----- From: Joel Becker [mailto:Joel.Becker at oracle.com] Sent: Thursday, March 30, 2006 9:04 PM To: Vaidya, Sachin Cc: 'ocfs2-users at oss.oracle.com' Subject: Re: [Ocfs2-users] Getting eI am using RHLError when mounting shared OCFS2 device. On Wed, Mar 29, 2006 at 06:45:08PM -0600, Vaidya, Sachin wrote:> Mount.ocfs2: Transport endpoint is not connected while mounting /dev/md0on> /crs1md0? What type of raid? multipath, raid0, raid1, etc? Are you using persistent-superblock? Joel -- "The nearest approach to immortality on Earth is a government bureau." - James F. Byrnes Joel Becker Principal Software Developer Oracle E-mail: joel.becker at oracle.com Phone: (650) 506-8127 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060330/53e959b6/attachment.html
Vaidya, Sachin
2006-Mar-31 03:30 UTC
[Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device.
1. cat /config/cluster/<clustername>/node/*/ipv4_address [root at acspittdw001 ~]# ls -l /config/cluster/ocfs2/node/acspittdw001/ total 0 -rw-r--r-- 1 root root 4096 Mar 31 01:13 ipv4_address -rw-r--r-- 1 root root 4096 Mar 31 01:13 ipv4_port -rw-r--r-- 1 root root 4096 Mar 31 01:13 local -rw-r--r-- 1 root root 4096 Mar 31 01:13 num [root at acspittdw001 ~]# [root at acspittdw001 ~]# cat /config/cluster/ocfs2/node/acspittdw001/* 172.18.11.12 7777 1 0 [root at acspittdw001 ~]# [root at acspittdw001 ~]# ls -l /config/cluster/ocfs2/node/acspittdw002 total 0 -rw-r--r-- 1 root root 4096 Mar 31 01:14 ipv4_address -rw-r--r-- 1 root root 4096 Mar 31 01:14 ipv4_port -rw-r--r-- 1 root root 4096 Mar 31 01:14 local -rw-r--r-- 1 root root 4096 Mar 31 01:14 num [root at acspittdw001 ~]# [root at acspittdw001 ~]# cat /config/cluster/ocfs2/node/acspittdw002/* 172.18.11.13 7777 0 1 [root at acspittdw001 ~]# Same output came on acspittdw002. 2. Enable tcpdump on that device and port on both nodes. See if the connection request is can be tracked. It should be easy assuming no other traffic on that interface between the nodes. How do I start this dump ? 3. Ensure you are on ocfs2 1.2.0. That is the only version loaded on this system as this is a fresh build. IS there any other way to find the version ? Thanks for your help. - Sachin. -----Original Message----- From: Sunil Mushran To: Vaidya, Sachin Cc: '''ocfs2-users at oss.oracle.com' ' ' Sent: 3/30/2006 8:57 PM Subject: Re: [Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device. So, the connection is expiring. Hmmm... 1. cat /config/cluster/<clustername>/node/*/ipv4_address Ensure the ips are correct. If you change the ips in cluster.conf, you need to restart the cluster for the new values to take effect. 2. Enable tcpdump on that device and port on both nodes. See if the connection request is can be tracked. It should be easy assuming no other traffic on that interface between the nodes. 3. Ensure you are on ocfs2 1.2.0. Vaidya, Sachin wrote: -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060330/79bbef01/attachment.html
Vaidya, Sachin
2006-Mar-31 03:50 UTC
[Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device.
Here is the output of tcpdump. This output was taken from dw002 when it was trying to mount the device which was already mounted on dw001. Pls let me know if you want me to run tcpdump in any other way. [root at acspittdw002 ~]# tcpdump host acspittdw001 -vvv tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size 96 bytes 03:02:37.475171 IP (tos 0x0, ttl 64, id 21869, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32885 > acspittdw001.7777: S [tcp sum ok] 552401327:552401327(0) win 5840 <mss 1460,sackOK,timestamp 19901634 0,nop,wscale 3> 03:02:37.475277 IP (tos 0xc0, ttl 255, id 3141, offset 0, flags [none], proto 1, length: 88) acspittdw001 > acspittdw002: icmp 68: host acspittdw001 unreachable - admin prohibited for IP (tos 0x0, ttl 64, id 21869, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32885 > acspittdw001.7777: S 552401327:552401327(0) win 5840 <mss 1460,sackOK,timestamp 19901634[|tcp]> 03:02:39.475072 IP (tos 0x0, ttl 64, id 3835, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32886 > acspittdw001.7777: S [tcp sum ok] 559200872:559200872(0) win 5840 <mss 1460,sackOK,timestamp 19903635 0,nop,wscale 3> 03:02:39.475164 IP (tos 0xc0, ttl 255, id 3142, offset 0, flags [none], proto 1, length: 88) acspittdw001 > acspittdw002: icmp 68: host acspittdw001 unreachable - admin prohibited for IP (tos 0x0, ttl 64, id 3835, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32886 > acspittdw001.7777: S 559200872:559200872(0) win 5840 <mss 1460,sackOK,timestamp 19903635[|tcp]> 03:02:41.473785 IP (tos 0x0, ttl 64, id 23905, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32887 > acspittdw001.7777: S [tcp sum ok] 560783619:560783619(0) win 5840 <mss 1460,sackOK,timestamp 19905634 0,nop,wscale 3> 03:02:41.473874 IP (tos 0xc0, ttl 255, id 3143, offset 0, flags [none], proto 1, length: 88) acspittdw001 > acspittdw002: icmp 68: host acspittdw001 unreachable - admin prohibited for IP (tos 0x0, ttl 64, id 23905, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32887 > acspittdw001.7777: S 560783619:560783619(0) win 5840 <mss 1460,sackOK,timestamp 19905634[|tcp]> 03:02:42.473944 arp who-has acspittdw001 tell acspittdw002 03:02:42.474057 arp reply acspittdw001 is-at 00:15:60:ac:67:1a 03:02:43.473728 IP (tos 0x0, ttl 64, id 15940, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32888 > acspittdw001.7777: S [tcp sum ok] 561773174:561773174(0) win 5840 <mss 1460,sackOK,timestamp 19907634 0,nop,wscale 3> 03:02:43.473813 IP (tos 0xc0, ttl 255, id 3144, offset 0, flags [none], proto 1, length: 88) acspittdw001 > acspittdw002: icmp 68: host acspittdw001 unreachable - admin prohibited for IP (tos 0x0, ttl 64, id 15940, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32888 > acspittdw001.7777: S 561773174:561773174(0) win 5840 <mss 1460,sackOK,timestamp 19907634[|tcp]> 03:02:45.473671 IP (tos 0x0, ttl 64, id 56389, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32889 > acspittdw001.7777: S [tcp sum ok] 555880519:555880519(0) win 5840 <mss 1460,sackOK,timestamp 19909634 0,nop,wscale 3> 03:02:45.473775 IP (tos 0xc0, ttl 255, id 3145, offset 0, flags [none], proto 1, length: 88) acspittdw001 > acspittdw002: icmp 68: host acspittdw001 unreachable - admin prohibited for IP (tos 0x0, ttl 64, id 56389, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32889 > acspittdw001.7777: S 555880519::555880519(0) win 5840 <mss 1460,sackOK,timestamp 19909634[|tcp]> 03:02:47.473617 IP (tos 0x0, ttl 64, id 28231, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32890 > acspittdw001.7777: S [tcp sum ok] 564706709:564706709(0) win 5840 <mss 1460,sackOK,timestamp 19911634 0,nop,wscale 3> 03:02:47.473722 IP (tos 0xc0, ttl 255, id 3146, offset 0, flags [none], proto 1, length: 88) acspittdw001 > acspittdw002: icmp 68: host acspittdw001 unreachable - admin prohibited for IP (tos 0x0, ttl 64, id 28231, offset 0, flags [DF], proto 6, length: 60) acspittdw002.32890 > acspittdw001.7777: S 564706709:564706709(0) win 5840 <mss 1460,sackOK,timestamp 19911634[|tcp]> 03:02:48.472372 arp who-has acspittdw002 tell acspittdw001 03:02:48.472379 arp reply acspittdw002 is-at 00:15:60:ac:08:fe Thanks, Sachin. -----Original Message----- From: Sunil Mushran To: Vaidya, Sachin Cc: '''ocfs2-users at oss.oracle.com' ' ' Sent: 3/30/2006 8:57 PM Subject: Re: [Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device. So, the connection is expiring. Hmmm... 1. cat /config/cluster/<clustername>/node/*/ipv4_address Ensure the ips are correct. If you change the ips in cluster.conf, you need to restart the cluster for the new values to take effect. 2. Enable tcpdump on that device and port on both nodes. See if the connection request is can be tracked. It should be easy assuming no other traffic on that interface between the nodes. 3. Ensure you are on ocfs2 1.2.0. Vaidya, Sachin wrote: -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060330/4241f54e/attachment-0001.html
Vaidya, Sachin
2006-Mar-31 04:52 UTC
[Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device.
Guys, Finally the problem is resolved.>From the tcpdump, my sysadmin could figure out that something was blockingthe connection to go through from one node to the other. It was found that there was some kind of firewall process running on both nodes, which was blocking IPTables from being routed correctly. Once that process was shutdown, the connection went through and the devices could be mounted as shared on both nodes at the same time. I will now install clusterware on these nodes to make progress towards 10g RAC solution. Thanks to all of you who provided valuable suggestions and clues to resolve this problem. Special thanks to Sunil Mushran for his insight into this product and prompt help ! Thanks and good night ! Sachin Vaidya Infrastructure Management Senior Analyst Affiliated Computer Services -----Original Message----- From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com] Sent: Thursday, March 30, 2006 8:58 PM To: Vaidya, Sachin Cc: '''ocfs2-users at oss.oracle.com' ' ' Subject: Re: [Ocfs2-users] Getting eI am using RHLError when mounting shar ed OCFS2 device. So, the connection is expiring. Hmmm... 1. cat /config/cluster/<clustername>/node/*/ipv4_address Ensure the ips are correct. If you change the ips in cluster.conf, you need to restart the cluster for the new values to take effect. 2. Enable tcpdump on that device and port on both nodes. See if the connection request is can be tracked. It should be easy assuming no other traffic on that interface between the nodes. 3. Ensure you are on ocfs2 1.2.0. Vaidya, Sachin wrote: -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060330/fefa7538/attachment.html