James Harrell
2009-Nov-06 21:45 UTC
[Ocfs2-users] iscsi connection drop, comes back in seconds, then deadlock in cluster
Greetings ocfs2 folks, A client is experiencing some random deadlock issues within a cluster, wondering if anyone can point us in the right direction. The iSCSI connection seemed to have dropped on one node briefly, ultimately several hours later landing us in a complete deadlock scenario where multiple nodes (Node 7 and Node 8) had to be panic'd (by hand - they didn't ever panic on their own) in order to get the cluster back. The would not unmount the file system, so we hard-panic'd (insmod panic.ko). Vitals: - Linux mgr01 2.6.18-164.2.1.el5 #1 SMP Wed Sep 30 12:52:46 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux - ocfs2-2.6.18-164.2.1.el5-1.4.4-1.el5 - MD3000i iSCSI - O2CB_HEARTBEAT_THRESHOLD=61 - O2CB_IDLE_TIMEOUT_MS=30000 - O2CB_KEEPALIVE_DELAY_MS=2000 - O2CB_RECONNECT_DELAY_MS=2000 - iSCSI node.session.timeo.replacement_timeout = 120 - iSCSI node.conn[0].timeo.noop_out_interval = 5 - node.conn[0].timeo.noop_out_timeout = 5 Logs at the time of failure on mgr01: Nov 6 01:00:12 mgr01 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4325764337, last ping 4325769337, now 4325774337 Nov 6 01:00:12 mgr01 kernel: connection1:0: detected conn error (1011) Nov 6 01:00:13 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Nov 6 01:00:15 mgr01 iscsid: connection1:0 is operational after recovery (1 attempts) Nov 6 01:00:38 mgr01 kernel: o2net: no longer connected to node rack105 (num 7) at 10.244.1.105:7777 Nov 6 01:00:38 mgr01 kernel: (3270,0):dlm_send_remote_convert_request:395 ERROR: status = -112 Nov 6 01:00:38 mgr01 kernel: (3270,0):dlm_wait_for_node_death:370 4FF4E858AF6E4AEEB2650A543A320C2F: waiting 5000ms for notification of death of node 7 Nov 6 01:00:38 mgr01 kernel: o2net: accepted connection from node rack105 (num 7) at 10.244.1.105:7777 Nov 6 01:00:46 mgr01 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4325798155, last ping 4325803155, now 4325808155 Nov 6 01:00:46 mgr01 kernel: connection1:0: detected conn error (1011) Nov 6 01:00:46 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Nov 6 01:00:52 mgr01 iscsid: connection1:0 is operational after recovery (1 attempts) Nov 6 01:01:22 mgr01 kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4325834382, last ping 4325839382, now 4325844382 Nov 6 01:01:22 mgr01 kernel: connection1:0: detected conn error (1011) Nov 6 01:01:23 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error (1011) state (3) Nov 6 01:01:28 mgr01 kernel: connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4325840761, last ping 4325845761, now 4325850761 Nov 6 01:01:28 mgr01 kernel: connection2:0: detected conn error (1011) Nov 6 01:01:29 mgr01 iscsid: Kernel reported iSCSI connection 2:0 error (1011) state (3) Nov 6 01:01:36 mgr01 kernel: o2net: no longer connected to node rack107 (num 9) at 10.244.1.107:7777 Nov 6 01:01:36 mgr01 kernel: o2net: accepted connection from node rack107 (num 9) at 10.244.1.107:7777 Nov 6 01:01:40 mgr01 kernel: o2net: no longer connected to node rack105 (num 7) at 10.244.1.105:7777 Nov 6 01:01:40 mgr01 kernel: o2net: accepted connection from node rack105 (num 7) at 10.244.1.105:7777 Nov 6 01:01:41 mgr01 iscsid: connection2:0 is operational after recovery (1 attempts) Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node db01 (num 1) at 10.244.1.120:7777 Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack104 (num 6) at 10.244.1.104:7777 Nov 6 01:01:42 mgr01 kernel: o2net: accepted connection from node rack104 (num 6) at 10.244.1.104:7777 Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack103 (num 5) at 10.244.1.103:7777 Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack101 (num 3) at 10.244.1.101:7777 Nov 6 01:01:42 mgr01 kernel: o2net: accepted connection from node rack101 (num 3) at 10.244.1.101:7777 Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack106 (num 8) at 10.244.1.106:7777 Nov 6 01:01:42 mgr01 kernel: (3216,0):dlm_send_remote_convert_request:395 ERROR: status = -112 Nov 6 01:01:42 mgr01 kernel: (3216,0):dlm_wait_for_node_death:370 5B4A18B8C0B64B05BAB5F5AA1C93D7E0: waiting 5000ms for notification of death of node 8 Logs at time of failure on Node 7 (rack105): Nov 6 01:00:38 rack105 kernel: o2net: connection to node mgr01 (num 0) at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down. Nov 6 01:00:38 rack105 kernel: (0,1):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1257498008.773099 now 1257498038.772814 dr 1257498008.773068 adv 1257498008.773099:1257498008.773100 func (6055cf71:502) 1257498002.348395:1257498002.348398) Nov 6 01:00:38 rack105 kernel: o2net: no longer connected to node mgr01 (num 0) at 10.244.1.100:7777 Nov 6 01:00:38 rack105 kernel: (4209,6):dlm_send_proxy_ast_msg:458 ERROR: status = -112 Nov 6 01:00:38 rack105 kernel: (4209,6):dlm_flush_asts:600 ERROR: status = -112 Nov 6 01:00:38 rack105 kernel: o2net: connected to node mgr01 (num 0) at 10.244.1.100:7777 Nov 6 01:01:40 rack105 kernel: o2net: connection to node mgr01 (num 0) at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down. Nov 6 01:01:40 rack105 kernel: (0,2):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1257498070.407069 now 1257498100.406902 dr 1257498070.407060 adv 1257498070.407078:1257498070.407079 func (098ec5c8:504) 1257498043.875896:1257498043.875904) Logs at time of failure on Node 8 (rack106): Nov 6 01:01:42 rack106 kernel: o2net: connection to node mgr01 (num 0) at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down. Nov 6 01:01:42 rack106 kernel: (0,0):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1257498072.440067 now 1257498102.440021 dr 1257498072.440044 adv 1257498072.440067:1257498072.440068 func (6055cf71:502) 1257498001.698233:1257498001.698236) Nov 6 01:01:42 rack106 kernel: o2net: no longer connected to node mgr01 (num 0) at 10.244.1.100:7777 Nov 6 01:01:42 rack106 kernel: (4162,2):dlm_send_proxy_ast_msg:458 ERROR: status = -112 Nov 6 01:01:42 rack106 kernel: (4162,2):dlm_flush_asts:600 ERROR: status = -112 Nov 6 01:01:42 rack106 kernel: (4162,2):dlm_send_proxy_ast_msg:458 ERROR: status = -107 Nov 6 01:01:42 rack106 kernel: (4162,2):dlm_flush_asts:600 ERROR: status = -107 Nov 6 01:01:42 rack106 kernel: o2net: connected to node mgr01 (num 0) at 10.244.1.100:7777 Nov 6 01:02:20 rack106 kernel: o2net: connection to node mgr01 (num 0) at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down. Nov 6 01:02:20 rack106 kernel: (0,2):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1257498110.160230 now 1257498140.160050 dr 1257498110.160211 adv 1257498110.160239:1257498110.160240 func (6055cf71:504) 1257498108.160393:1257498108.160401) Nov 6 01:02:20 rack106 kernel: o2net: no longer connected to node mgr01 (num 0) at 10.244.1.100:7777 Nov 6 01:02:20 rack106 kernel: o2net: connected to node mgr01 (num 0) at 10.244.1.100:7777 Nov 6 01:02:55 rack106 kernel: o2net: connection to node mgr01 (num 0) at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down. Nov 6 01:02:55 rack106 kernel: (0,2):o2net_idle_timer:1503 here are some times that might help debug the situation: (tmr 1257498145.231769 now 1257498175.231038 dr 1257498145.231761 adv 1257498145.231781:1257498145.231781 func (00000000:0) 0.0:0.0) Nov 6 01:02:55 rack106 kernel: o2net: no longer connected to node mgr01 (num 0) at 10.244.1.100:7777 Nov 6 01:02:58 rack106 kernel: o2net: connected to node mgr01 (num 0) at 10.244.1.100:7777 Thanks, james