info+ocfs at polnik.de
2011-Oct-27 11:31 UTC
[Ocfs2-users] nodes do not reconnect after network failure
Hi, I use ocfs2 with a isci device on 4 servers (vmhost1 - vmhost4) and try to simulate a network problem with iptables. uname -a Linux vmhost3 2.6.39-gentoo-r3 #1 SMP Tue Sep 27 12:07:18 CEST 2011 i686 Intel(R) Xeon(R) CPU X5650 @ 2.67GHz GenuineIntel GNU/Linux ocfs2-tools 1.6.4 # grep -v '#\|^$' /etc/conf.d/ocfs2 echo 1 > /proc/sys/kernel/panic_on_oops echo 30 > /proc/sys/kernel/panic OCFS2_CLUSTER="vmhostfiles" OCFS2_IDLE_TIMEOUT_MS="30000" OCFS2_KEEPALIVE_DELAY_MS="2000" OCFS2_RECONNECT_DELAY_MS="2000" OCFS2_DEAD_THRESHOLD="61" OCFS2_FSCK="-fy" OCFS2_FSCK_SWAPOFF="yes" Test: What happens, if one node can't communicate with one other node. 1. Step (simulate a network failure) vmhost1: iptables -A INPUT -p tcp -s vmhost2 -j DROP => No access possible to the mounted ocfs2 device on all 4 nodes. syslog messages from vmhost1/2: Oct 27 12:41:18 vmhost2 kernel: (kworker/u:4,1149,7):o2net_connect_expired:1724 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Oct 27 12:41:18 vmhost1 kernel: (kworker/u:6,1168,1):o2net_connect_expired:1724 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. Oct 27 12:41:48 vmhost2 kernel: (kworker/u:6,1168,7):o2net_connect_expired:1724 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Oct 27 12:41:48 vmhost1 kernel: (kworker/u:5,1150,1):o2net_connect_expired:1724 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. Oct 27 12:42:18 vmhost2 kernel: (kworker/u:4,1149,7):o2net_connect_expired:1724 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Oct 27 12:42:18 vmhost1 kernel: (kworker/u:6,1168,1):o2net_connect_expired:1724 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. Oct 27 12:42:33 vmhost1 kernel: (kworker/u:6,1168,0):dlm_do_assert_master:1661 ERROR: Error -107 when sending message 502 (key 0x6aa537f1) to node 1 Oct 27 12:42:33 vmhost1 kernel: (dlm_thread,3143,6):dlm_send_proxy_ast_msg:484 ERROR: 3EF4047BABBC4DAD9E52FFEAECC8DED8: res P000000000000000000000000000000, error -107 send AST to node 1 Oct 27 12:42:33 vmhost1 kernel: (dlm_thread,3143,6):dlm_flush_asts:605 ERROR: status = -107 Oct 27 12:42:48 vmhost2 kernel: (kworker/u:6,1168,7):o2net_connect_expired:1724 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Oct 27 12:42:48 vmhost1 kernel: (kworker/u:6,1168,1):o2net_connect_expired:1724 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. Oct 27 12:43:18 vmhost2 kernel: (kworker/u:4,1149,7):o2net_connect_expired:1724 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Oct 27 12:43:18 vmhost1 kernel: (kworker/u:5,1150,1):o2net_connect_expired:1724 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. 2. Step (network failure is solved) vmhost1: iptables -F ... but node 1 and 2 don't want communicate. Oct 27 13:17:54 vmhost2 kernel: (kworker/u:0,3354,0):o2net_connect_expired:1724 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Oct 27 13:17:54 vmhost1 kernel: (kworker/u:0,3365,1):o2net_connect_expired:1724 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. Oct 27 13:18:24 vmhost2 kernel: (kworker/u:6,1168,0):o2net_connect_expired:1724 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. Oct 27 13:18:24 vmhost1 kernel: (kworker/u:1,4053,1):o2net_connect_expired:1724 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. I check it with tcpdump - A ping works fine, but ocfs on node 1 does not send any packets to node 2 and vice versa, but the syslog messages suggest, that node 1/2 try to established a connection but it fails. What must I do, that after a network failure all ocfs nodes communicate again? Best regards, thomas polnik.