thr3ads.net - Ocfs2 devel - [Ocfs2-users] one issue reports, is there any advice with the code reviewed? thanks [Apr 2014]

If this information is useful, please help other people find it:
Share via:
Guozhonghua
2014-Apr-04 09:17 UTC
[Ocfs2-users] one issue reports, is there any advice with the code reviewed? thanks

Hi, everyone

We setup 8 nodes to run ocfs2 as the storage pool providing storage service.
The test scenario is on the OS of Ubuntu with kernel version 3.13.6.
As the node 7 rebooted, all the other nodes blocked.
The other nodes is racing to be the master of DLM_RECOVERY_LOCK_NAME($RECOVERY),
but there is not any one will be successfully, and all of them still loop and
wait.
So all the other node running unstopped loop, and print the log info as below.
The debug level is changed about on Apr 2 18:00:00, so the debug info is begin
about 18:02:16.


Apr  2 15:24:21 node-01 kernel: [64409.487556] o2cb: o2dlm has evicted node 7
from domain 2D2B1913CA08467896AC80B2F1AA80DB
Apr  2 15:24:26 node-01 kernel: [64414.350643] o2dlm: Begin recovery on domain
2D2B1913CA08467896AC80B2F1AA80DB for node 7
Apr  2 18:02:16 node-01 kernel: [73879.177060]
(dlm_reco_thread,7871,3):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:02:16 node-01 kernel: [73879.177068]
(dlm_reco_thread,7871,3):dlm_wait_for_lock_mastery:1046 map not changed and
voting not done for 2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY
Apr  2 18:02:21 node-01 kernel: [73884.174312]
(dlm_reco_thread,7871,4):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
......  same as above.

Apr  2 15:24:09 node-02 kernel: [330500.807738] o2cb: o2dlm has evicted node 7
from domain 2D2B1913CA08467896AC80B2F1AA80DB
Apr  2 15:24:13 node-02 kernel: [330504.679296] o2dlm: Begin recovery on domain
2D2B1913CA08467896AC80B2F1AA80DB for node 7
Apr  2 18:57:38 node-02 kernel: [343302.676048]
(dlm_reco_thread,39426,9):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:57:38 node-02 kernel: [343302.676055]
(dlm_reco_thread,39426,9):dlm_wait_for_lock_mastery:1046 map not changed and
voting not done for 2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY
Apr  2 18:57:43 node-02 kernel: [343307.673271]
(dlm_reco_thread,39426,10):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
......  same as above.

Apr  2 15:24:47 node-03 kernel: [105816.250867] o2cb: o2dlm has evicted node 7
from domain 2D2B1913CA08467896AC80B2F1AA80DB
Apr  2 15:24:49 node-03 kernel: [105818.291396]
(dlm_thread,6493,8):dlm_send_proxy_ast_msg:482 ERROR:
2D2B1913CA08467896AC80B2F1AA80DB: res M00000000000000000002084cc0d288, error
-107 send AST to node 4
Apr  2 15:24:50 node-03 kernel: [105819.679081] o2dlm: Begin recovery on domain
2D2B1913CA08467896AC80B2F1AA80DB for node 7
Apr  2 18:58:46 node-03 kernel: [118649.084903]
(dlm_reco_thread,6494,9):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:58:46 node-03 kernel: [118649.084911]
(dlm_reco_thread,6494,9):dlm_wait_for_lock_mastery:1046 map not changed and
voting not done for 2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY
Apr  2 18:58:51 node-03 kernel: [118654.082626]
(dlm_reco_thread,6494,9):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
......  same as above.

Apr  2 15:24:50 node-04 kernel: [330501.154090] o2cb: o2dlm has evicted node 7
from domain 2D2B1913CA08467896AC80B2F1AA80DB
Apr  2 15:24:51 node-04 kernel: [330502.229376] o2dlm: Begin recovery on domain
2D2B1913CA08467896AC80B2F1AA80DB for node 7
Apr  2 18:59:09 node-04 kernel: [343353.348762]
(dlm_reco_thread,38197,2):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:59:14 node-04 kernel: [343358.345980]
(dlm_reco_thread,38197,2):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:59:19 node-04 kernel: [343363.343207]
(dlm_reco_thread,38197,2):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:59:24 node-04 kernel: [343368.340485]
(dlm_reco_thread,38197,3):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
......  same as above.

Apr  2 15:24:18 node-05 kernel: [330489.928160] o2cb: o2dlm has evicted node 7
from domain 2D2B1913CA08467896AC80B2F1AA80DB
Apr  2 15:24:21 node-05 kernel: [330492.969521] o2dlm: Waiting on the recovery
of node 7 in domain 2D2B1913CA08467896AC80B2F1AA80DB
Apr  2 15:24:22 node-05 kernel: [330494.168849] o2dlm: Begin recovery on domain
2D2B1913CA08467896AC80B2F1AA80DB for node 7
Apr  2 18:58:52 node-05 kernel: [343357.161638]
(dlm_reco_thread,24064,6):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:58:52 node-05 kernel: [343357.161644]
(dlm_reco_thread,24064,6):dlm_wait_for_lock_mastery:1046 map not changed and
voting not done for 2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY
Apr  2 18:58:57 node-05 kernel: [343362.158919]
(dlm_reco_thread,24064,7):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
......  same as above.

Apr  2 15:23:24 node-06 kernel: [330529.804363] o2cb: o2dlm has evicted node 7
from domain 2D2B1913CA08467896AC80B2F1AA80DB
Apr  2 15:23:28 node-06 kernel: [330533.733262] o2dlm: Begin recovery on domain
2D2B1913CA08467896AC80B2F1AA80DB for node 7
Apr  2 18:58:08 node-06 kernel: [343408.025502]
(dlm_reco_thread,28213,5):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:58:08 node-06 kernel: [343408.025509]
(dlm_reco_thread,28213,5):dlm_wait_for_lock_mastery:1046 map not changed and
voting not done for 2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY
Apr  2 18:58:13 node-06 kernel: [343413.023209]
(dlm_reco_thread,28213,0):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again

Apr  2 18:24:26 node-07 kernel: [ 4010.492579]
(mount.ocfs2,11993,2):dlm_join_domain:1932 Timed out joining dlm domain
2D2B1913CA08467896AC80B2F1AA80DB after 91200 msecs
Apr  2 18:33:04 node-07 kernel: [ 4528.414250]
(mount.ocfs2,31116,11):dlm_register_domain:2138 register called for domain
"2D2B1913CA08467896AC80B2F1AA80DB"
... ... ... ...
Apr  2 18:33:04 node-07 kernel: [ 4528.415465] (mount.ocfs2,31116,11):sc_put:417
[sc ffff8817f8f98000 refs 4 sock ffff8817fd3b8f00 node 1 page ffffea005fc67240
pg_off 0] put
Apr  2 18:33:04 node-07 kernel: [ 4528.415469]
(mount.ocfs2,31116,11):dlm_request_join:1522 status 0, node 1 response is 0     
Here the node 1 disallow the node 7 to join the domain.
Apr  2 18:33:04 node-07 kernel: [ 4528.415471]
(mount.ocfs2,31116,11):dlm_should_restart_join:1598 Latest response of disallow
-- should restart
Apr  2 18:33:04 node-07 kernel: [ 4528.415474]
(mount.ocfs2,31116,11):dlm_try_to_join_domain:1724 returning -11
Apr  2 18:33:04 node-07 kernel: [ 4528.415476]
(mount.ocfs2,31116,11):dlm_join_domain:1946 backoff 600
... ... ... ...
Apr  2 18:33:28 node-07 kernel: [ 4551.869493]
(mount.ocfs2,31116,2):dlm_join_domain:1932 Timed out joining dlm domain
2D2B1913CA08467896AC80B2F1AA80DB after 91200 msecs
... ... ... ...
Apr  2 19:00:26 node-07 kernel: [ 6168.900707]
(mount.ocfs2,19177,6):dlm_ctxt_release:338 freeing memory from domain
2D2B1913CA08467896AC80B2F1AA80DB

Apr  2 15:23:42 node-08 kernel: [330502.270181] o2cb: o2dlm has evicted node 7
from domain 2D2B1913CA08467896AC80B2F1AA80DB
Apr  2 15:23:43 node-08 kernel: [330503.054257] o2dlm: Begin recovery on domain
2D2B1913CA08467896AC80B2F1AA80DB for node 7
Apr  2 18:58:48 node-08 kernel: [343401.119756]
(dlm_reco_thread,61931,10):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
Apr  2 18:58:48 node-08 kernel: [343401.119763]
(dlm_reco_thread,61931,10):dlm_wait_for_lock_mastery:1046 map not changed and
voting not done for 2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY
Apr  2 18:58:53 node-08 kernel: [343406.116991]
(dlm_reco_thread,61931,11):dlm_wait_for_lock_mastery:1092
2D2B1913CA08467896AC80B2F1AA80DB:$RECOVERY: waiting again
......  same as above.


As the code reviewed times, I think there is may be one bug in the function:
static int dlm_wait_for_lock_mastery(struct dlm_ctxt *dlm,
                                     struct dlm_lock_resource *res,
                                     struct dlm_master_list_entry *mle,
                                     int *blocked)
{
        ... ... ...
if (res->owner == O2NM_MAX_NODES) {
                        mlog(0, "%s:%.*s: waiting again\n",
dlm->name,
                             res->lockname.len, res->lockname.name);

            // if ther network failure and not any tcp msg received(tcp package
lost), map not changed, and the master racing should be retriggered again.
+           ret = -EAGAIN;
+           goto leave;
-           goto recheck;
              }
        ... ... ...

}



-------------------------------------------------------------------------------------------------------------------------------------
????????????????????????????????????????
????????????????????????????????????????
????????????????????????????????????????
???
This e-mail and its attachments contain confidential information from H3C, which
is
intended only for the person or entity whose address is listed above. Any use of
the
information contained herein in any way (including, but not limited to, total or
partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender
by phone or email immediately and delete it!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20140404/2e0cf8a1/attachment-0001.html
Ocfs2 devel - Apr 2014 - one issue reports, is there any advice with the code reviewed? thanks

[Ocfs2-users] one issue reports, is there any advice with the code reviewed? thanks