Junxiao Bi
2014-May-15 04:26 UTC
[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect
Hi, After the tcp connection is established between two ocfs2 nodes, an idle timer will be set to check its state periodically, if no messages are received during this time, idle timer will timeout, it will shutdown the connection and try to rebuild, so pending message in tcp queues will be lost. This may cause the whole ocfs2 cluster hung. This is very possible to happen when network state goes bad. Do the reconnect is useless, it will fail if network state doesn't recover. Just waiting there for network recovering may be a good idea, it will not lost messages and some node will be fenced until cluster goes into split-brain state, for this case, Tcp user timeout is used to override the tcp retransmit timeout. It will timeout after 25 days, user should have notice this through the provided log and fix the network, if they don't, ocfs2 will fall back to original reconnect way. The following is the serial of patches to fix the bug. Please help review. Thanks, Junxiao.
Junxiao Bi
2014-May-15 04:26 UTC
[Ocfs2-devel] [PATCH 1/3] ocfs2: o2net: don't shutdown connection when idle timeout
Some messages in the tcp queue maybe lost if we shutdown the connection and reconnect when idle timeout. If packets lost and reconnect success, then the ocfs2 cluster maybe hung. To fix this, we can leave the connection there and do the fence decision when idle timeout, if network recover before fence dicision is made, the connection survive without lost any messages. Reviewed-by: Srinivas Eeda <srinivas.eeda at oracle.com> Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com> --- fs/ocfs2/cluster/tcp.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index c6b90e6..76ef3d8 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1536,16 +1536,20 @@ static void o2net_idle_timer(unsigned long data) #endif printk(KERN_NOTICE "o2net: Connection to " SC_NODEF_FMT " has been " - "idle for %lu.%lu secs, shutting it down.\n", SC_NODEF_ARGS(sc), - msecs / 1000, msecs % 1000); + "idle for %lu.%lu secs.\n", + SC_NODEF_ARGS(sc), msecs / 1000, msecs % 1000); - /* - * Initialize the nn_timeout so that the next connection attempt - * will continue in o2net_start_connect. + /* idle timerout happen, don't shutdown the connection, but + * make fence decision. Maybe the connection can recover before + * the decision is made. */ atomic_set(&nn->nn_timeout, 1); + o2quo_conn_err(o2net_num_from_nn(nn)); + queue_delayed_work(o2net_wq, &nn->nn_still_up, + msecs_to_jiffies(O2NET_QUORUM_DELAY_MS)); + + o2net_sc_reset_idle_timer(sc); - o2net_sc_queue_work(sc, &sc->sc_shutdown_work); } static void o2net_sc_reset_idle_timer(struct o2net_sock_container *sc) @@ -1560,6 +1564,15 @@ static void o2net_sc_reset_idle_timer(struct o2net_sock_container *sc) static void o2net_sc_postpone_idle(struct o2net_sock_container *sc) { + struct o2net_node *nn = o2net_nn_from_num(sc->sc_node->nd_num); + + /* clear fence decision since the connection recover from timeout*/ + if (atomic_read(&nn->nn_timeout)) { + o2quo_conn_up(o2net_num_from_nn(nn)); + cancel_delayed_work(&nn->nn_still_up); + atomic_set(&nn->nn_timeout, 0); + } + /* Only push out an existing timer */ if (timer_pending(&sc->sc_idle_timeout)) o2net_sc_reset_idle_timer(sc); -- 1.7.9.5
Junxiao Bi
2014-May-15 04:26 UTC
[Ocfs2-devel] [PATCH 2/3] ocfs2: o2net: set tcp user timeout to max value
When tcp retransmit timeout(15mins), the connection will be closed. Pending messages may be lost during this time. So we set tcp user timeout to override the retransmit timeout to the max value. This is OK for ocfs2 since we have disk heartbeat, if peer crash, the disk heartbeat will timeout and it will be evicted, if disk heartbeat not timeout and connection idle for a long time, then this means the cluster enters split-brain state, since fence can't happen, we'd better keep the connection and wait network recover. Reviewed-by: Srinivas Eeda <srinivas.eeda at oracle.com> Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com> --- fs/ocfs2/cluster/tcp.c | 20 ++++++++++++++++++++ fs/ocfs2/cluster/tcp.h | 1 + 2 files changed, 21 insertions(+) diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c index 76ef3d8..eae58d8 100644 --- a/fs/ocfs2/cluster/tcp.c +++ b/fs/ocfs2/cluster/tcp.c @@ -1480,6 +1480,14 @@ static int o2net_set_nodelay(struct socket *sock) return ret; } +static int o2net_set_usertimeout(struct socket *sock) +{ + int user_timeout = O2NET_TCP_USER_TIMEOUT; + + return kernel_setsockopt(sock, SOL_TCP, TCP_USER_TIMEOUT, + (char *)&user_timeout, sizeof(user_timeout)); +} + static void o2net_initialize_handshake(void) { o2net_hand->o2hb_heartbeat_timeout_ms = cpu_to_be32( @@ -1663,6 +1671,12 @@ static void o2net_start_connect(struct work_struct *work) goto out; } + ret = o2net_set_usertimeout(sock); + if (ret) { + mlog(ML_ERROR, "set TCP_USER_TIMEOUT failed with %d\n", ret); + goto out; + } + o2net_register_callbacks(sc->sc_sock->sk, sc); spin_lock(&nn->nn_lock); @@ -1842,6 +1856,12 @@ static int o2net_accept_one(struct socket *sock) goto out; } + ret = o2net_set_usertimeout(new_sock); + if (ret) { + mlog(ML_ERROR, "set TCP_USER_TIMEOUT failed with %d\n", ret); + goto out; + } + slen = sizeof(sin); ret = new_sock->ops->getname(new_sock, (struct sockaddr *) &sin, &slen, 1); diff --git a/fs/ocfs2/cluster/tcp.h b/fs/ocfs2/cluster/tcp.h index 5bada2a..c571e84 100644 --- a/fs/ocfs2/cluster/tcp.h +++ b/fs/ocfs2/cluster/tcp.h @@ -63,6 +63,7 @@ typedef void (o2net_post_msg_handler_func)(int status, void *data, #define O2NET_KEEPALIVE_DELAY_MS_DEFAULT 2000 #define O2NET_IDLE_TIMEOUT_MS_DEFAULT 30000 +#define O2NET_TCP_USER_TIMEOUT 0x7fffffff /* TODO: figure this out.... */ static inline int o2net_link_down(int err, struct socket *sock) -- 1.7.9.5
Junxiao Bi
2014-May-15 04:26 UTC
[Ocfs2-devel] [PATCH 3/3] ocfs2: quorum: add a log for node not fenced
For debug use, we can see from the log whether the fence decision is made and why it is not fenced. Reviewed-by: Srinivas Eeda <srinivas.eeda at oracle.com> Signed-off-by: Junxiao Bi <junxiao.bi at oracle.com> --- fs/ocfs2/cluster/quorum.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/cluster/quorum.c b/fs/ocfs2/cluster/quorum.c index 1ec141e..62e8ec6 100644 --- a/fs/ocfs2/cluster/quorum.c +++ b/fs/ocfs2/cluster/quorum.c @@ -160,9 +160,18 @@ static void o2quo_make_decision(struct work_struct *work) } out: - spin_unlock(&qs->qs_lock); - if (fence) + if (fence) { + spin_unlock(&qs->qs_lock); o2quo_fence_self(); + } else { + mlog(ML_NOTICE, "not fencing this node, heartbeating: %d, " + "connected: %d, lowest: %d (%sreachable)\n", + qs->qs_heartbeating, qs->qs_connected, lowest_hb, + lowest_reachable ? "" : "un"); + spin_unlock(&qs->qs_lock); + + } + } static void o2quo_set_hold(struct o2quo_state *qs, u8 node) -- 1.7.9.5
Joseph Qi
2014-May-15 08:27 UTC
[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect
On 2014/5/15 12:26, Junxiao Bi wrote:> > Hi, > > After the tcp connection is established between two ocfs2 nodes, an idle > timer will be set to check its state periodically, if no messages are > received during this time, idle timer will timeout, it will shutdown > the connection and try to rebuild, so pending message in tcp queues will > be lost. This may cause the whole ocfs2 cluster hung. > This is very possible to happen when network state goes bad. Do the > reconnect is useless, it will fail if network state doesn't recover. > Just waiting there for network recovering may be a good idea, it will > not lost messages and some node will be fenced until cluster goes into > split-brain state, for this case, Tcp user timeout is used to override > the tcp retransmit timeout. It will timeout after 25 days, user should > have notice this through the provided log and fix the network, if they > don't, ocfs2 will fall back to original reconnect way. > The following is the serial of patches to fix the bug. Please help review.TCP RTT is auto-regressive, that means the following case may take place: Suppose current retransmission interval is ?T (somewhat long), network recovers but down again before the next retransmission windows comes (< ?T), so the network recovery won't be detected and ocfs2 cluster still hungs.> > Thanks, > Junxiao. > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel > >
Junxiao Bi
2014-Jun-06 02:18 UTC
[Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect
Hi Mark & Andrew, Could you help review this patch list? This bug can be saw when network state go bad. It may cause ocfs2 hung forever if some packets lost. With this fix, ocfs2 will recover from hung if network becomes good. Thanks, Junxiao. On 05/15/2014 12:26 PM, Junxiao Bi wrote:> Hi, > > After the tcp connection is established between two ocfs2 nodes, an idle > timer will be set to check its state periodically, if no messages are > received during this time, idle timer will timeout, it will shutdown > the connection and try to rebuild, so pending message in tcp queues will > be lost. This may cause the whole ocfs2 cluster hung. > This is very possible to happen when network state goes bad. Do the > reconnect is useless, it will fail if network state doesn't recover. > Just waiting there for network recovering may be a good idea, it will > not lost messages and some node will be fenced until cluster goes into > split-brain state, for this case, Tcp user timeout is used to override > the tcp retransmit timeout. It will timeout after 25 days, user should > have notice this through the provided log and fix the network, if they > don't, ocfs2 will fall back to original reconnect way. > The following is the serial of patches to fix the bug. Please help review. > > Thanks, > Junxiao. > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel