Hello! Hey didnt a setting for the 10 second network timeout get into the 2.6.20 kernel? if so how do we set this? I am getting OCFS2 1.3.3 (2201,0):o2net_connect_expired:1547 ERROR: no connection established with node 1 after 10.0 seconds, giving up and returning errors. (2458,0):dlm_request_join:802 ERROR: status = -107 (2458,0):dlm_try_to_join_domain:950 ERROR: status = -107 (2458,0):dlm_join_domain:1202 ERROR: status = -107 (2458,0):dlm_register_domain:1393 ERROR: status = -107 (2458,0):ocfs2_dlm_init:2215 ERROR: status = -107 (2458,0):ocfs2_mount_volume:1069 ERROR: status = -107 ocfs2: Unmounting device (8,1) on (node 3) o2net: connected to node cluster1 (num 0) at xxx.xxx.xxx.xxx:7777 o2net: no longer connected to node cluster1 (num 0) at xxx.xxx.xxx.xxx:7777
Hi, Ok I'll try this again since there seems to be more people reading this list. I don't quite understand the log messages regarding fencing. Should the other nodes in the cluster that lost network connectivity state something about quorum/fencing etc...? Is it true that the network timeout param. can be set in 1.2.4 and if not, can I change the setting myself before compile? What will we see in logs if a node cannot write to the clusterfs but heartbeat still works ? This node panic'd last night with this as the only log. "Node 1" Feb 6 20:52:51 atl02010304 kernel: o2net: connection to node atl02010305 (num 1) at 192.168.3.105:7777 has been idle for 10 seconds, shutting it down. Feb 6 20:52:51 atl02010304 kernel: (15822,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1170813158.337779 now 1170813168.338726 dr 1170813163.339064 adv 1170813158.337780:1170813158.337780 func (ca3835ec:505) 1170813013.339584:1170813013.339601) Feb 6 20:52:51 atl02010304 kernel: o2net: connection to node atl02010310 (num 0) at 192.168.3.110:7777 has been idle for 10 seconds, shutting it down. Feb 6 20:52:51 atl02010304 kernel: (15486,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1170813161.826171 now 1170813171.827091 dr 1170813171.826723 adv 1170813161.826171:1170813161.826172 func (ca3835ec:506) 1170812821.832120:1170812821.832128) Feb 6 20:52:51 atl02010304 kernel: o2net: no longer connected to node atl02010305 (num 1) at 192.168.3.105:7777 Feb 6 20:52:51 atl02010304 kernel: o2net: no longer connected to node atl02010310 (num 0) at 192.168.3.110:7777 "Node 2" Jan 21 05:25:19 atl02010310 kernel: o2net: no longer connected to node atl02010304 (num 2) at 192.168.3.104:7777 Jan 21 05:25:19 atl02010310 kernel: klogd 1.4.1, ---------- state change ---------- Jan 21 05:25:21 atl02010310 kernel: (3716,1):dlm_get_lock_resource:847 32E007178FA24E87B45ECDDE6F7D5D52:$RECOVERY: at least one node (2) torecover before lock mastery can begin Jan 21 05:25:21 atl02010310 kernel: (3716,1):dlm_get_lock_resource:874 32E007178FA24E87B45ECDDE6F7D5D52: recovery map is not empty, but must master $RECOVERY lock now <snip> Jan 21 05:28:43 atl02010310 kernel: o2net: accepted connection from node atl02010304 (num 2) at 192.168.3.104:7777 Jan 21 05:28:47 atl02010310 kernel: ocfs2_dlm: Node 2 joins domain 32E007178FA24E87B45ECDDE6F7D5D52 Jan 21 05:28:47 atl02010310 kernel: ocfs2_dlm: Nodes in domain ("32E007178FA24E87B45ECDDE6F7D5D52"): 0 2 "Node 3" Same as above
Brandon, You can set it using /sys/kernel/config/cluster/<cluster name>/idle_timeout_ms. The default value is 10000ms, just echo the new timeout value in milliseconds and it will be set. The way to do it is to start o2cb and set the value before mounting the first volume. If you try to do that after you mount, it will not work. Make the timeout the same value on all nodes of the cluster. It would not make sense to set a different value for each node. Brandon Lamb wrote:> Hello! > > Hey didnt a setting for the 10 second network timeout get into the > 2.6.20 kernel? > > if so how do we set this? > > I am getting > > OCFS2 1.3.3 > (2201,0):o2net_connect_expired:1547 ERROR: no connection established > with node 1 after 10.0 seconds, giving up and returning errors. > (2458,0):dlm_request_join:802 ERROR: status = -107 > (2458,0):dlm_try_to_join_domain:950 ERROR: status = -107 > (2458,0):dlm_join_domain:1202 ERROR: status = -107 > (2458,0):dlm_register_domain:1393 ERROR: status = -107 > (2458,0):ocfs2_dlm_init:2215 ERROR: status = -107 > (2458,0):ocfs2_mount_volume:1069 ERROR: status = -107 > ocfs2: Unmounting device (8,1) on (node 3) > o2net: connected to node cluster1 (num 0) at xxx.xxx.xxx.xxx:7777 > o2net: no longer connected to node cluster1 (num 0) at > xxx.xxx.xxx.xxx:7777 > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > oss.oracle.com/mailman/listinfo/ocfs2-users-- Regards, Marcos Eduardo Matsunaga Oracle USA Linux Engineering 7453 TG Lee Blvd | Email : Marcos.Matsunaga@oracle.com Orlando, FL 32822 | Phone/Fax : (407)458-1710 A crisis is when you can't say "Let's forget the whole thing." --------------------------------------------------------------------------------------- The statements and opinions expressed here are my own and do not necessarily represent those of Oracle Corporation. -------------- next part -------------- An HTML attachment was scrubbed... URL: oss.oracle.com/pipermail/ocfs2-users/attachments/20070207/830673ca/attachment.html