thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect [May 2014]

If this information is useful, please help other people find it:
Share via:

Junxiao Bi

2014-May-15 04:26 UTC

[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect

Hi,

After the tcp connection is established between two ocfs2 nodes, an idle
timer will be set to check its state periodically, if no messages are
received during this time, idle timer will timeout, it will shutdown
the connection and try to rebuild, so pending message in tcp queues will
be lost. This may cause the whole ocfs2 cluster hung. 
This is very possible to happen when network state goes bad. Do the
reconnect is useless, it will fail if network state doesn't recover.
Just waiting there for network recovering may be a good idea, it will
not lost messages and some node will be fenced until cluster goes into
split-brain state, for this case, Tcp user timeout is used to override
the tcp retransmit timeout. It will timeout after 25 days, user should
have notice this through the provided log and fix the network, if they
don't, ocfs2 will fall back to original reconnect way.
The following is the serial of patches to fix the bug. Please help review.

Thanks,
Junxiao.

Junxiao Bi

2014-May-15 04:26 UTC

head link

[Ocfs2-devel] [PATCH 1/3] ocfs2: o2net: don't shutdown connection when idle timeout

Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout. If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.

To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.

Reviewed-by: Srinivas Eeda <srinivas.eeda at oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com>
---
 fs/ocfs2/cluster/tcp.c |   25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index c6b90e6..76ef3d8 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1536,16 +1536,20 @@ static void o2net_idle_timer(unsigned long data)
 #endif
 
 	printk(KERN_NOTICE "o2net: Connection to " SC_NODEF_FMT " has
been "
-	       "idle for %lu.%lu secs, shutting it down.\n",
SC_NODEF_ARGS(sc),
-	       msecs / 1000, msecs % 1000);
+	       "idle for %lu.%lu secs.\n",
+	       SC_NODEF_ARGS(sc), msecs / 1000, msecs % 1000);
 
-	/*
-	 * Initialize the nn_timeout so that the next connection attempt
-	 * will continue in o2net_start_connect.
+	/* idle timerout happen, don't shutdown the connection, but
+	 * make fence decision. Maybe the connection can recover before
+	 * the decision is made.
 	 */
 	atomic_set(&nn->nn_timeout, 1);
+	o2quo_conn_err(o2net_num_from_nn(nn));
+	queue_delayed_work(o2net_wq, &nn->nn_still_up,
+			msecs_to_jiffies(O2NET_QUORUM_DELAY_MS));
+
+	o2net_sc_reset_idle_timer(sc);
 
-	o2net_sc_queue_work(sc, &sc->sc_shutdown_work);
 }
 
 static void o2net_sc_reset_idle_timer(struct o2net_sock_container *sc)
@@ -1560,6 +1564,15 @@ static void o2net_sc_reset_idle_timer(struct
o2net_sock_container *sc)
 
 static void o2net_sc_postpone_idle(struct o2net_sock_container *sc)
 {
+	struct o2net_node *nn = o2net_nn_from_num(sc->sc_node->nd_num);
+
+	/* clear fence decision since the connection recover from timeout*/
+	if (atomic_read(&nn->nn_timeout)) {
+		o2quo_conn_up(o2net_num_from_nn(nn));
+		cancel_delayed_work(&nn->nn_still_up);
+		atomic_set(&nn->nn_timeout, 0);
+	}
+
 	/* Only push out an existing timer */
 	if (timer_pending(&sc->sc_idle_timeout))
 		o2net_sc_reset_idle_timer(sc);
-- 
1.7.9.5

Junxiao Bi

2014-May-15 04:26 UTC

head link

[Ocfs2-devel] [PATCH 2/3] ocfs2: o2net: set tcp user timeout to max value

When tcp retransmit timeout(15mins), the connection will be closed.
Pending messages may be lost during this time. So we set tcp user
timeout to override the retransmit timeout to the max value.
This is OK for ocfs2 since we have disk heartbeat, if peer crash,
the disk heartbeat will timeout and it will be evicted, if disk
heartbeat not timeout and connection idle for a long time, then
this means the cluster enters split-brain state, since fence can't
happen, we'd better keep the connection and wait network recover.

Reviewed-by: Srinivas Eeda <srinivas.eeda at oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com>
---
 fs/ocfs2/cluster/tcp.c |   20 ++++++++++++++++++++
 fs/ocfs2/cluster/tcp.h |    1 +
 2 files changed, 21 insertions(+)

diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 76ef3d8..eae58d8 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1480,6 +1480,14 @@ static int o2net_set_nodelay(struct socket *sock)
 	return ret;
 }
 
+static int o2net_set_usertimeout(struct socket *sock)
+{
+	int user_timeout = O2NET_TCP_USER_TIMEOUT;
+
+	return kernel_setsockopt(sock, SOL_TCP, TCP_USER_TIMEOUT,
+				(char *)&user_timeout, sizeof(user_timeout));
+}
+
 static void o2net_initialize_handshake(void)
 {
 	o2net_hand->o2hb_heartbeat_timeout_ms = cpu_to_be32(
@@ -1663,6 +1671,12 @@ static void o2net_start_connect(struct work_struct *work)
 		goto out;
 	}
 
+	ret = o2net_set_usertimeout(sock);
+	if (ret) {
+		mlog(ML_ERROR, "set TCP_USER_TIMEOUT failed with %d\n", ret);
+		goto out;
+	}
+
 	o2net_register_callbacks(sc->sc_sock->sk, sc);
 
 	spin_lock(&nn->nn_lock);
@@ -1842,6 +1856,12 @@ static int o2net_accept_one(struct socket *sock)
 		goto out;
 	}
 
+	ret = o2net_set_usertimeout(new_sock);
+	if (ret) {
+		mlog(ML_ERROR, "set TCP_USER_TIMEOUT failed with %d\n", ret);
+		goto out;
+	}
+
 	slen = sizeof(sin);
 	ret = new_sock->ops->getname(new_sock, (struct sockaddr *) &sin,
 				       &slen, 1);
diff --git a/fs/ocfs2/cluster/tcp.h b/fs/ocfs2/cluster/tcp.h
index 5bada2a..c571e84 100644
--- a/fs/ocfs2/cluster/tcp.h
+++ b/fs/ocfs2/cluster/tcp.h
@@ -63,6 +63,7 @@ typedef void (o2net_post_msg_handler_func)(int status, void
*data,
 #define O2NET_KEEPALIVE_DELAY_MS_DEFAULT	2000
 #define O2NET_IDLE_TIMEOUT_MS_DEFAULT		30000
 
+#define O2NET_TCP_USER_TIMEOUT			0x7fffffff
 
 /* TODO: figure this out.... */
 static inline int o2net_link_down(int err, struct socket *sock)
-- 
1.7.9.5

Junxiao Bi

2014-May-15 04:26 UTC

head link

[Ocfs2-devel] [PATCH 3/3] ocfs2: quorum: add a log for node not fenced

For debug use, we can see from the log whether the fence decision
is made and why it is not fenced.

Reviewed-by: Srinivas Eeda <srinivas.eeda at oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com>
---
 fs/ocfs2/cluster/quorum.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/ocfs2/cluster/quorum.c b/fs/ocfs2/cluster/quorum.c
index 1ec141e..62e8ec6 100644
--- a/fs/ocfs2/cluster/quorum.c
+++ b/fs/ocfs2/cluster/quorum.c
@@ -160,9 +160,18 @@ static void o2quo_make_decision(struct work_struct *work)
 	}
 
 out:
-	spin_unlock(&qs->qs_lock);
-	if (fence)
+	if (fence) {
+		spin_unlock(&qs->qs_lock);
 		o2quo_fence_self();
+	} else {
+		mlog(ML_NOTICE, "not fencing this node, heartbeating: %d, "
+			"connected: %d, lowest: %d (%sreachable)\n",
+			qs->qs_heartbeating, qs->qs_connected, lowest_hb,
+			lowest_reachable ? "" : "un");
+		spin_unlock(&qs->qs_lock);
+
+	}
+
 }
 
 static void o2quo_set_hold(struct o2quo_state *qs, u8 node)
-- 
1.7.9.5

Joseph Qi

2014-May-15 08:27 UTC

head link

[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect

On 2014/5/15 12:26, Junxiao Bi wrote:> 
> Hi,
> 
> After the tcp connection is established between two ocfs2 nodes, an idle
> timer will be set to check its state periodically, if no messages are
> received during this time, idle timer will timeout, it will shutdown
> the connection and try to rebuild, so pending message in tcp queues will
> be lost. This may cause the whole ocfs2 cluster hung. 
> This is very possible to happen when network state goes bad. Do the
> reconnect is useless, it will fail if network state doesn't recover.
> Just waiting there for network recovering may be a good idea, it will
> not lost messages and some node will be fenced until cluster goes into
> split-brain state, for this case, Tcp user timeout is used to override
> the tcp retransmit timeout. It will timeout after 25 days, user should
> have notice this through the provided log and fix the network, if they
> don't, ocfs2 will fall back to original reconnect way.
> The following is the serial of patches to fix the bug. Please help review.TCP RTT is auto-regressive, that means the following case may take
place:
Suppose current retransmission interval is ?T (somewhat long), network
recovers but down again before the next retransmission windows
comes (< ?T), so the network recovery won't be detected and ocfs2
cluster still hungs.> 
> Thanks,
> Junxiao.
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
>

Junxiao Bi

2014-Jun-06 02:18 UTC

head link

[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect

Hi Mark & Andrew,

Could you help review this patch list?
This bug can be saw when network state go bad. It may cause ocfs2 hung
forever if some packets lost. With this fix, ocfs2 will recover from
hung if network becomes good.

Thanks,
Junxiao.

On 05/15/2014 12:26 PM, Junxiao Bi wrote:> Hi,
>
> After the tcp connection is established between two ocfs2 nodes, an idle
> timer will be set to check its state periodically, if no messages are
> received during this time, idle timer will timeout, it will shutdown
> the connection and try to rebuild, so pending message in tcp queues will
> be lost. This may cause the whole ocfs2 cluster hung. 
> This is very possible to happen when network state goes bad. Do the
> reconnect is useless, it will fail if network state doesn't recover.
> Just waiting there for network recovering may be a good idea, it will
> not lost messages and some node will be fenced until cluster goes into
> split-brain state, for this case, Tcp user timeout is used to override
> the tcp retransmit timeout. It will timeout after 25 days, user should
> have notice this through the provided log and fix the network, if they
> don't, ocfs2 will fall back to original reconnect way.
> The following is the serial of patches to fix the bug. Please help review.
>
> Thanks,
> Junxiao.
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Ocfs2 devel - May 2014 - [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect

[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect

[Ocfs2-devel] [PATCH 1/3] ocfs2: o2net: don't shutdown connection when idle timeout

[Ocfs2-devel] [PATCH 2/3] ocfs2: o2net: set tcp user timeout to max value

[Ocfs2-devel] [PATCH 3/3] ocfs2: quorum: add a log for node not fenced

[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect

[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect