We are testing clustering and I am having issues getting all of my nodes to mount. I have 4 nodes. I am using iSCSI to share 1 target with 2 luns. All 4 nodes can are accessing the target; I can run fdisk -l against the block devices. Initially I had all 4 nodes mounting the share but brought the cluster down to add an additional NIC. Presently nodes 2 and 3 can mount the shares, 1 and 4 can not. Previously I had node 1 mounted and nodes 2, 3 and 4 could not. Any help is appreciated! Nodes 2 & 3: # service o2cb status Driver for "configfs": Loaded Filesystem "configfs": Mounted Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Heartbeat dead threshold = 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Active Nodes 1 & 4: # service o2cb status Driver for "configfs": Loaded Filesystem "configfs": Mounted Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Heartbeat dead threshold = 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Not active All nodes: # mounted.ocfs2 -d Device FS UUID Label /dev/sda1 ocfs2 fea0a398-a696-414f-bd9f-d7aa84bd6b77 ocu01 /dev/sdb1 ocfs2 26e82fa7-ec91-4a81-a965-571ed4223ab0 oracluster # mounted.ocfs2 -f Device FS Nodes /dev/sda1 ocfs2 ocnode2, ocnode3 /dev/sdb1 ocfs2 ocnode2, ocnode3 dmesg snippet from node 4: o2net: connected to node ocnode2 (num 2) at 192.168.1.2:7777 (4145,0):o2net_connect_expired:1664 ERROR: no connection established with node 3 after 30.0 seconds, giving up and returning errors. (4176,0):dlm_request_join:1036 ERROR: status = -107 (4176,0):dlm_try_to_join_domain:1210 ERROR: status = -107 (4176,0):dlm_join_domain:1488 ERROR: status = -107 (4176,0):dlm_register_domain:1754 ERROR: status = -107 (4176,0):ocfs2_dlm_init:2723 ERROR: status = -107 (4176,0):ocfs2_mount_volume:1437 ERROR: status = -107 ocfs2: Unmounting device (8,17) on (node 4) o2net: no longer connected to node ocnode2 (num 2) at 192.168.1.2:7777 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100322/2f5224ee/attachment.html
The network connect is failing. Could be because of a firewall, or bad ip address, some switch issue. Mount the volume on node 2. Then enable tracing and tail messages file. # debugfs.ocfs2 -l TCP allow # tail -f /var/log/messages Then from node 4, ping node 2 using netcat. # nc -z 192.168.1.2 7777 If it succeeds, then you should see: Connection to 192.168.1.2 7777 port [tcp/cbt] succeeded! Additionally, you will see a message on node 2 "attempt to connect from node...". If not, then look at your network setup. Remember to disable tracing on node 2. #debugfs.ocfs2 -l TCP off Sunil Chris Clonch wrote:> We are testing clustering and I am having issues getting all of my > nodes to mount. I have 4 nodes. I am using iSCSI to share 1 target > with 2 luns. All 4 nodes can are accessing the target; I can run > fdisk -l against the block devices. Initially I had all 4 nodes > mounting the share but brought the cluster down to add an additional > NIC. Presently nodes 2 and 3 can mount the shares, 1 and 4 can not. > Previously I had node 1 mounted and nodes 2, 3 and 4 could not. > > Any help is appreciated! > > Nodes 2 & 3: > > # service o2cb status > Driver for "configfs": Loaded > Filesystem "configfs": Mounted > Driver for "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Heartbeat dead threshold = 31 > Network idle timeout: 30000 > Network keepalive delay: 2000 > Network reconnect delay: 2000 > Checking O2CB heartbeat: Active > > > Nodes 1 & 4: > > # service o2cb status > Driver for "configfs": Loaded > Filesystem "configfs": Mounted > Driver for "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Heartbeat dead threshold = 31 > Network idle timeout: 30000 > Network keepalive delay: 2000 > Network reconnect delay: 2000 > Checking O2CB heartbeat: Not active > > > All nodes: > > # mounted.ocfs2 -d > Device FS UUID Label > /dev/sda1 ocfs2 fea0a398-a696-414f-bd9f-d7aa84bd6b77 ocu01 > /dev/sdb1 ocfs2 26e82fa7-ec91-4a81-a965-571ed4223ab0 > oracluster > > # mounted.ocfs2 -f > Device FS Nodes > /dev/sda1 ocfs2 ocnode2, ocnode3 > /dev/sdb1 ocfs2 ocnode2, ocnode3 > > > dmesg snippet from node 4: > > o2net: connected to node ocnode2 (num 2) at 192.168.1.2:7777 > <http://192.168.1.2:7777> > (4145,0):o2net_connect_expired:1664 ERROR: no connection established > with node 3 after 30.0 seconds, giving up and returning errors. > (4176,0):dlm_request_join:1036 ERROR: status = -107 > (4176,0):dlm_try_to_join_domain:1210 ERROR: status = -107 > (4176,0):dlm_join_domain:1488 ERROR: status = -107 > (4176,0):dlm_register_domain:1754 ERROR: status = -107 > (4176,0):ocfs2_dlm_init:2723 ERROR: status = -107 > (4176,0):ocfs2_mount_volume:1437 ERROR: status = -107 > ocfs2: Unmounting device (8,17) on (node 4) > o2net: no longer connected to node ocnode2 (num 2) at 192.168.1.2:7777 > <http://192.168.1.2:7777>
Thanks for the reply Sunil! ?I thought it might be a network issue, but when I enabled debug then ran netcat against it I could connect. Firewall and SELinux are both disabled on all nodes. I went ahead and reran the ethernet for both sets of NIC to ensure the old hub I was using was not a factor. ?Now everything runs through an enterprise-class switch. ?Stats from netstat and ethtool look normal; I did not view them prior to the changes. ?I also pulled down the latest updates from RHEL (did I mention these are RHEL5.4?), which included kernel-2.6.18-164.15.1.el5. ?Now the ocfs2_stackglue and ocfs2_dlmfs modules are failing to load. -Chris On Mon, Mar 22, 2010 at 4:52 PM, Sunil Mushran <sunil.mushran at oracle.com> wrote:> > The network connect is failing. Could be because of a firewall, > or bad ip address, some switch issue. > > Mount the volume on node 2. Then enable tracing and > tail messages file. > # debugfs.ocfs2 -l TCP allow > # tail -f /var/log/messages > > Then from node 4, ping node 2 using netcat. > # nc -z 192.168.1.2 7777 > > If it succeeds, then you should see: > Connection to 192.168.1.2 7777 port [tcp/cbt] succeeded! > > Additionally, you will see a message on node 2 "attempt to connect > from node...". > > If not, then look at your network setup. > > Remember to disable tracing on node 2. > #debugfs.ocfs2 -l TCP off > > Sunil