We are running OCFS2 on SLES9 machines using a FC SAN. Without warning both
nodes will become unresponsive. Can not access either machine via ssh or
terminal (hangs after typing in username). However the machine still responds to
pings. This continues until one node is rebooted, at which time the second node
resumes normal operations.
I am not entirely sure that this is an OCFS2 problem at all however the syslog
shows it had issues Here is the log from the node that was not rebooted. The
node that was rebooted contained no log information. The system appeared to have
gone down at about 3AM, until the node was rebooted at around 7:15.
Mar 8 03:06:32 groupwise-1-mht kernel: o2net: connection to node
groupwise-2-mht (num 2) at 192.168.1.3:7777 has been idle for 10 seconds,
shutting it down.
Mar 8 03:06:32 groupwise-1-mht kernel: (0,2):o2net_idle_timer:1310 here are
some times that might help debug the situation: (tmr 1173341182.367220 now
1173341192.367244 dr 1173341182.367213 adv 1173341182.367228:1173341182.367229
func (05ce6220:2) 1173341182.367221:1173341182.367224)
Mar 8 03:06:32 groupwise-1-mht kernel: o2net: no longer connected to node
groupwise-2-mht (num 2) at 192.168.1.3:7777
Mar 8 03:06:32 groupwise-1-mht kernel: (499,0):dlm_do_master_request:1330
ERROR: link to 2 went down!
Mar 8 03:06:32 groupwise-1-mht kernel: (499,0):dlm_get_lock_resource:914 ERROR:
status = -112
Mar 8 03:13:02 groupwise-1-mht kernel: (8476,0):dlm_send_proxy_ast_msg:458
ERROR: status = -107
Mar 8 03:13:02 groupwise-1-mht kernel: (8476,0):dlm_flush_asts:607 ERROR:
status = -107
Mar 8 03:19:54 groupwise-1-mht kernel:
(147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107
Mar 8 03:19:54 groupwise-1-mht last message repeated 127 times
Mar 8 03:19:55 groupwise-1-mht kernel: (873,0):dlm_do_master_request:1330
ERROR: link to 2 went down!
Mar 8 03:19:55 groupwise-1-mht kernel: (873,0):dlm_get_lock_resource:914 ERROR:
status = -107
Mar 8 03:19:55 groupwise-1-mht kernel: (901,0):dlm_do_master_request:1330
ERROR: link to 2 went down!
Mar 8 03:19:55 groupwise-1-mht kernel: (901,0):dlm_get_lock_resource:914 ERROR:
status = -107
Mar 8 03:19:56 groupwise-1-mht kernel: (929,0):dlm_do_master_request:1330
ERROR: link to 2 went down!
Mar 8 03:19:56 groupwise-1-mht kernel: (929,0):dlm_get_lock_resource:914 ERROR:
status = -107
Mar 8 03:45:29 groupwise-1-mht -- MARK --
Mar 8 04:15:02 groupwise-1-mht kernel:
(147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107
Mar 8 04:15:03 groupwise-1-mht last message repeated 383 times
Mar 8 06:27:54 groupwise-1-mht kernel:
(147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107
Mar 8 06:27:54 groupwise-1-mht last message repeated 127 times
Mar 8 06:27:54 groupwise-1-mht kernel:
(147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107
Mar 8 06:27:54 groupwise-1-mht last message repeated 127 times
Mar 8 06:35:48 groupwise-1-mht kernel: (8872,0):dlm_do_master_request:1330
ERROR: link to 2 went down!
Mar 8 06:35:48 groupwise-1-mht kernel: (8872,0):dlm_get_lock_resource:914
ERROR: status = -107
Mar 8 06:52:45 groupwise-1-mht kernel: (8861,0):dlm_do_master_request:1330
ERROR: link to 2 went down!
Mar 8 06:52:45 groupwise-1-mht kernel: (8861,0):dlm_get_lock_resource:914
ERROR: status = -107
Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 07:09:41 groupwise-1-mht kernel: (4192,0):dlm_do_master_request:1330
ERROR: link to 2 went down!
Mar 8 07:09:41 groupwise-1-mht kernel: (4192,0):dlm_get_lock_resource:914
ERROR: status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_do_request_vote:798
ERROR: status = -107
Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_unlink:840 ERROR: status
= -107
Mar 8 07:18:57 groupwise-1-mht kernel: (4341,0):dlm_do_master_request:1330
ERROR: link to 2 went down!
Mar 8 07:18:57 groupwise-1-mht kernel: (4341,0):dlm_get_lock_resource:914
ERROR: status = -107
Mar 8 07:19:24 groupwise-1-mht kernel: (4356,0):ocfs2_broadcast_vote:725 ERROR:
status = -107
Mar 8 07:19:24 groupwise-1-mht kernel: (4356,0):ocfs2_do_request_vote:798
ERROR: status = -107 Mar 8 07:19:24 groupwise-1-mht kernel:
(4356,0):ocfs2_unlink:840 ERROR: status = -107
Mar 8 07:20:49 groupwise-1-mht sshd[4375]: Accepted publickey for root from
10.1.31.27 port 1752 ssh2
Mar 8 07:20:50 groupwise-1-mht kernel:
(147,0):dlm_send_remote_unlock_request:356 ERROR: status = -107 Mar 8 07:20:50
groupwise-1-mht last message repeated 255 times
Mar 8 07:20:53 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:20:53 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:20:58 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:20:58 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:03 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:03 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:08 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:08 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:13 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:13 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:19 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:19 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:24 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:24 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:29 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:29 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:34 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:34 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:39 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:39 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:44 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:44 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:49 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:49 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:54 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:21:54 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:21:59 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 Mar 8
07:21:59 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2 Mar 8 07:22:04 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:04 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:10 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:10 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:15 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:20 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:20 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:25 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:25 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:30 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:30 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:35 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:35 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:40 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:40 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:45 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:45 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:50 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:50 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:22:55 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:22:55 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:01 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:23:01 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:06 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:23:06 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:11 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:23:11 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:16 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:23:16 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:21 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:23:21 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:26 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:23:26 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:31 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:23:31 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:36 groupwise-1-mht kernel:
(4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
Mar 8 07:23:36 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of
node 2
Mar 8 07:23:40 groupwise-1-mht kernel: (28613,2):dlm_get_lock_resource:847
B6ECAF5A668A4573AF763908F26958DB:$RECOVERY: at least one node (2) torecover
before lock mastery can begin
Mar 8 07:23:40 groupwise-1-mht kernel: (28613,2):dlm_get_lock_resource:874
B6ECAF5A668A4573AF763908F26958DB: recovery map is not empty, but must master
$RECOVERY lock now
Mar 8 07:23:41 groupwise-1-mht kernel: (4432,0):ocfs2_replay_journal:1176
Recovering node 2 from slot 1 on device (253,1)
Mar 8 07:23:41 groupwise-1-mht kernel: (4192,0):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:41 groupwise-1-mht kernel: (4192,0):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:41 groupwise-1-mht kernel: (929,1):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:41 groupwise-1-mht kernel: (929,1):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:42 groupwise-1-mht kernel: (4192,0):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:D000000000000000037872ff59e2a10: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:42 groupwise-1-mht kernel: (499,1):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:42 groupwise-1-mht kernel: (499,1):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:42 groupwise-1-mht kernel: (929,1):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:M0000000000000002d2ab960a02ee32: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:43 groupwise-1-mht kernel: (4341,1):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:D00000000000000005ac8f593b44a80: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:43 groupwise-1-mht kernel: (8872,1):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:43 groupwise-1-mht kernel: (8872,1):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:43 groupwise-1-mht kernel: (499,1):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:D0000000000000000059e0c78635d25: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:43 groupwise-1-mht kernel: (8223,2):ocfs2_dlm_eviction_cb:119
device (253,0): dlm has evicted node 2
Mar 8 07:23:43 groupwise-1-mht kernel: (4431,0):dlm_get_lock_resource:847
2062CE05ABA246988E9CCCDAE253F458:M000000000000000000001de83f8b74: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:44 groupwise-1-mht kernel: (8872,1):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:D0000000000000000ce315c7764670d: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:44 groupwise-1-mht kernel: (4431,0):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:M000000000000000000001de83f8b74: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:44 groupwise-1-mht kernel: (873,1):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:49 groupwise-1-mht kernel: (873,1):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:49 groupwise-1-mht kernel: (8861,1):dlm_restart_lock_mastery:1214
ERROR: node down! 2
Mar 8 07:23:49 groupwise-1-mht kernel: (8861,1):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Mar 8 07:23:49 groupwise-1-mht kernel: (873,1):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:M0000000000000002fc058c0a084a80: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:M0000000000000002ff18686a1b86f4: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:49 groupwise-1-mht kernel: (8861,1):dlm_get_lock_resource:895
2062CE05ABA246988E9CCCDAE253F458:D0000000000000000b2f76e77647700: at least one
node (2) torecover before lock mastery can begin
Mar 8 07:23:49 groupwise-1-mht kernel: kjournald starting. Commit interval 5
seconds
Mar 8 07:23:49 groupwise-1-mht kernel: (4431,0):ocfs2_replay_journal:1176
Recovering node 2 from slot 1 on device (253,0)
Mar 8 07:23:55 groupwise-1-mht kernel: (fs/jbd/recovery.c, 255):
journal_recover: JBD: recovery, exit status 0, recovered transactions 599034 to
599035
Mar 8 07:23:55 groupwise-1-mht kernel: (fs/jbd/recovery.c, 257):
journal_recover: JBD: Replayed 8 and revoked 0/0 blocks
Mar 8 07:23:55 groupwise-1-mht kernel: kjournald starting. Commit interval 5
seconds
Mar 8 07:25:51 groupwise-1-mht kernel: o2net: accepted connection from node
groupwise-2-mht (num 2) at 192.168.1.3:7777
Mar 8 07:25:55 groupwise-1-mht kernel: ocfs2_dlm: Node 2 joins domain
2062CE05ABA246988E9CCCDAE253F458
Mar 8 07:25:55 groupwise-1-mht kernel: ocfs2_dlm: Nodes in domain
("2062CE05ABA246988E9CCCDAE253F458"): 0 1 2
Mar 8 07:25:59 groupwise-1-mht kernel: ocfs2_dlm: Node 2 joins domain
B6ECAF5A668A4573AF763908F26958DB
Mar 8 07:25:59 groupwise-1-mht kernel: ocfs2_dlm: Nodes in domain
("B6ECAF5A668A4573AF763908F26958DB"): 0 1 2
Andy Kipp
Network Administrator
Velcro USA Inc.
406 Brown Ave.
Manchester, NH 03103
Phone: (603) 222-4844
Email: akipp@velcro.com
CONFIDENTIALITY NOTICE:
This email is intended only for the person or entity to which it is addressed
and may contain confidential and/or privileged material. Any unauthorized
review, use, disclosure or distribution is prohibited. If you are not the
intended recipient, please contact the sender by reply e-mail and destroy all
copies of the original message. If you are the intended recipient but do not
wish to receive communications through this medium, please so advise
immediately.
File a bugzilla with the messages from all three nodes. Appears node 2 went down but kept heartbeating. Strange. The messages from node 2 may shed more light. Andy Kipp wrote:> We are running OCFS2 on SLES9 machines using a FC SAN. Without warning both nodes will become unresponsive. Can not access either machine via ssh or terminal (hangs after typing in username). However the machine still responds to pings. This continues until one node is rebooted, at which time the second node resumes normal operations. > > I am not entirely sure that this is an OCFS2 problem at all however the syslog shows it had issues Here is the log from the node that was not rebooted. The node that was rebooted contained no log information. The system appeared to have gone down at about 3AM, until the node was rebooted at around 7:15. > > Mar 8 03:06:32 groupwise-1-mht kernel: o2net: connection to node groupwise-2-mht (num 2) at 192.168.1.3:7777 has been idle for 10 seconds, shutting it down. > Mar 8 03:06:32 groupwise-1-mht kernel: (0,2):o2net_idle_timer:1310 here are some times that might help debug the situation: (tmr 1173341182.367220 now 1173341192.367244 dr 1173341182.367213 adv 1173341182.367228:1173341182.367229 func (05ce6220:2) 1173341182.367221:1173341182.367224) > Mar 8 03:06:32 groupwise-1-mht kernel: o2net: no longer connected to node groupwise-2-mht (num 2) at 192.168.1.3:7777 > Mar 8 03:06:32 groupwise-1-mht kernel: (499,0):dlm_do_master_request:1330 ERROR: link to 2 went down! > Mar 8 03:06:32 groupwise-1-mht kernel: (499,0):dlm_get_lock_resource:914 ERROR: status = -112 > Mar 8 03:13:02 groupwise-1-mht kernel: (8476,0):dlm_send_proxy_ast_msg:458 ERROR: status = -107 > Mar 8 03:13:02 groupwise-1-mht kernel: (8476,0):dlm_flush_asts:607 ERROR: status = -107 > Mar 8 03:19:54 groupwise-1-mht kernel: (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107 > Mar 8 03:19:54 groupwise-1-mht last message repeated 127 times > Mar 8 03:19:55 groupwise-1-mht kernel: (873,0):dlm_do_master_request:1330 ERROR: link to 2 went down! > Mar 8 03:19:55 groupwise-1-mht kernel: (873,0):dlm_get_lock_resource:914 ERROR: status = -107 > Mar 8 03:19:55 groupwise-1-mht kernel: (901,0):dlm_do_master_request:1330 ERROR: link to 2 went down! > Mar 8 03:19:55 groupwise-1-mht kernel: (901,0):dlm_get_lock_resource:914 ERROR: status = -107 > Mar 8 03:19:56 groupwise-1-mht kernel: (929,0):dlm_do_master_request:1330 ERROR: link to 2 went down! > Mar 8 03:19:56 groupwise-1-mht kernel: (929,0):dlm_get_lock_resource:914 ERROR: status = -107 > Mar 8 03:45:29 groupwise-1-mht -- MARK -- > Mar 8 04:15:02 groupwise-1-mht kernel: (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107 > Mar 8 04:15:03 groupwise-1-mht last message repeated 383 times > Mar 8 06:27:54 groupwise-1-mht kernel: (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107 > Mar 8 06:27:54 groupwise-1-mht last message repeated 127 times > Mar 8 06:27:54 groupwise-1-mht kernel: (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107 > Mar 8 06:27:54 groupwise-1-mht last message repeated 127 times > Mar 8 06:35:48 groupwise-1-mht kernel: (8872,0):dlm_do_master_request:1330 ERROR: link to 2 went down! > Mar 8 06:35:48 groupwise-1-mht kernel: (8872,0):dlm_get_lock_resource:914 ERROR: status = -107 > Mar 8 06:52:45 groupwise-1-mht kernel: (8861,0):dlm_do_master_request:1330 ERROR: link to 2 went down! > Mar 8 06:52:45 groupwise-1-mht kernel: (8861,0):dlm_get_lock_resource:914 ERROR: status = -107 > Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:09:41 groupwise-1-mht kernel: (4192,0):dlm_do_master_request:1330 ERROR: link to 2 went down! > Mar 8 07:09:41 groupwise-1-mht kernel: (4192,0):dlm_get_lock_resource:914 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_do_request_vote:798 ERROR: status = -107 > Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:18:57 groupwise-1-mht kernel: (4341,0):dlm_do_master_request:1330 ERROR: link to 2 went down! > Mar 8 07:18:57 groupwise-1-mht kernel: (4341,0):dlm_get_lock_resource:914 ERROR: status = -107 > Mar 8 07:19:24 groupwise-1-mht kernel: (4356,0):ocfs2_broadcast_vote:725 ERROR: status = -107 > Mar 8 07:19:24 groupwise-1-mht kernel: (4356,0):ocfs2_do_request_vote:798 ERROR: status = -107 Mar 8 07:19:24 groupwise-1-mht kernel: (4356,0):ocfs2_unlink:840 ERROR: status = -107 > Mar 8 07:20:49 groupwise-1-mht sshd[4375]: Accepted publickey for root from 10.1.31.27 port 1752 ssh2 > Mar 8 07:20:50 groupwise-1-mht kernel: (147,0):dlm_send_remote_unlock_request:356 ERROR: status = -107 Mar 8 07:20:50 groupwise-1-mht last message repeated 255 times > Mar 8 07:20:53 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:20:53 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:20:58 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:20:58 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:03 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:03 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:08 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:08 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:13 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:13 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:19 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:19 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:24 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:24 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:29 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:29 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:34 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:34 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:39 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:39 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:44 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:44 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:49 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:49 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:54 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:21:54 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:21:59 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 Mar 8 07:21:59 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 Mar 8 07:22:04 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:04 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:10 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:10 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:15 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:20 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:20 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:25 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:25 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:30 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:30 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:35 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:35 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:40 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:40 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:45 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:45 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:50 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:50 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:22:55 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:22:55 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:01 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:23:01 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:06 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:23:06 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:11 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:23:11 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:16 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:23:16 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:21 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:23:21 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:26 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:23:26 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:31 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:23:31 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:36 groupwise-1-mht kernel: (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 > Mar 8 07:23:36 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of node 2 > Mar 8 07:23:40 groupwise-1-mht kernel: (28613,2):dlm_get_lock_resource:847 B6ECAF5A668A4573AF763908F26958DB:$RECOVERY: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:40 groupwise-1-mht kernel: (28613,2):dlm_get_lock_resource:874 B6ECAF5A668A4573AF763908F26958DB: recovery map is not empty, but must master $RECOVERY lock now > Mar 8 07:23:41 groupwise-1-mht kernel: (4432,0):ocfs2_replay_journal:1176 Recovering node 2 from slot 1 on device (253,1) > Mar 8 07:23:41 groupwise-1-mht kernel: (4192,0):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:41 groupwise-1-mht kernel: (4192,0):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:41 groupwise-1-mht kernel: (929,1):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:41 groupwise-1-mht kernel: (929,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:42 groupwise-1-mht kernel: (4192,0):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:D000000000000000037872ff59e2a10: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:42 groupwise-1-mht kernel: (499,1):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:42 groupwise-1-mht kernel: (499,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:42 groupwise-1-mht kernel: (929,1):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:M0000000000000002d2ab960a02ee32: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:43 groupwise-1-mht kernel: (4341,1):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:D00000000000000005ac8f593b44a80: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:43 groupwise-1-mht kernel: (8872,1):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:43 groupwise-1-mht kernel: (8872,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:43 groupwise-1-mht kernel: (499,1):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:D0000000000000000059e0c78635d25: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:43 groupwise-1-mht kernel: (8223,2):ocfs2_dlm_eviction_cb:119 device (253,0): dlm has evicted node 2 > Mar 8 07:23:43 groupwise-1-mht kernel: (4431,0):dlm_get_lock_resource:847 2062CE05ABA246988E9CCCDAE253F458:M000000000000000000001de83f8b74: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:44 groupwise-1-mht kernel: (8872,1):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:D0000000000000000ce315c7764670d: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:44 groupwise-1-mht kernel: (4431,0):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:M000000000000000000001de83f8b74: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:44 groupwise-1-mht kernel: (873,1):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:49 groupwise-1-mht kernel: (873,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:49 groupwise-1-mht kernel: (8861,1):dlm_restart_lock_mastery:1214 ERROR: node down! 2 > Mar 8 07:23:49 groupwise-1-mht kernel: (8861,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 > Mar 8 07:23:49 groupwise-1-mht kernel: (873,1):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:M0000000000000002fc058c0a084a80: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:M0000000000000002ff18686a1b86f4: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:49 groupwise-1-mht kernel: (8861,1):dlm_get_lock_resource:895 2062CE05ABA246988E9CCCDAE253F458:D0000000000000000b2f76e77647700: at least one node (2) torecover before lock mastery can begin > Mar 8 07:23:49 groupwise-1-mht kernel: kjournald starting. Commit interval 5 seconds > Mar 8 07:23:49 groupwise-1-mht kernel: (4431,0):ocfs2_replay_journal:1176 Recovering node 2 from slot 1 on device (253,0) > Mar 8 07:23:55 groupwise-1-mht kernel: (fs/jbd/recovery.c, 255): journal_recover: JBD: recovery, exit status 0, recovered transactions 599034 to 599035 > Mar 8 07:23:55 groupwise-1-mht kernel: (fs/jbd/recovery.c, 257): journal_recover: JBD: Replayed 8 and revoked 0/0 blocks > Mar 8 07:23:55 groupwise-1-mht kernel: kjournald starting. Commit interval 5 seconds > Mar 8 07:25:51 groupwise-1-mht kernel: o2net: accepted connection from node groupwise-2-mht (num 2) at 192.168.1.3:7777 > Mar 8 07:25:55 groupwise-1-mht kernel: ocfs2_dlm: Node 2 joins domain 2062CE05ABA246988E9CCCDAE253F458 > Mar 8 07:25:55 groupwise-1-mht kernel: ocfs2_dlm: Nodes in domain ("2062CE05ABA246988E9CCCDAE253F458"): 0 1 2 > Mar 8 07:25:59 groupwise-1-mht kernel: ocfs2_dlm: Node 2 joins domain B6ECAF5A668A4573AF763908F26958DB > Mar 8 07:25:59 groupwise-1-mht kernel: ocfs2_dlm: Nodes in domain ("B6ECAF5A668A4573AF763908F26958DB"): 0 1 2 > > > > > Andy Kipp > Network Administrator > Velcro USA Inc. > 406 Brown Ave. > Manchester, NH 03103 > Phone: (603) 222-4844 > Email: akipp@velcro.com > > CONFIDENTIALITY NOTICE: > This email is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. If you are the intended recipient but do not wish to receive communications through this medium, please so advise immediately. > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
I checked bugzilla and what is happening is almost identical to bug #819. However, the "dead" node continues to heartbeat, yet is unresponsive. No log output at all is generated on the "dead" node. This has been happening for a few months however frequency is increasing. Is there any information I can provide to hopefully figure this out? - Andy -- Andrew Kipp Network Administrator Velcro USA Inc. Email: akipp@velcro.com Work: (603) 222-4844 CONFIDENTIALITY NOTICE: This email is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e mail and destroy all copies of the original message. If you are the intended recipient but do not wish to receive communications through this medium, please so advise immediately.>>> On 3/9/2007 at 9:39 PM, in message <45F21A7F.5090802@oracle.com>, Sunil Mushran<Sunil.Mushran@oracle.com> wrote:> File a bugzilla with the messages from all three nodes. Appears > node 2 went down but kept heartbeating. Strange. The messages > from node 2 may shed more light. > > Andy Kipp wrote: >> We are running OCFS2 on SLES9 machines using a FC SAN. Without warning both > nodes will become unresponsive. Can not access either machine via ssh or > terminal (hangs after typing in username). However the machine still responds > to pings. This continues until one node is rebooted, at which time the second > node resumes normal operations. >> >> I am not entirely sure that this is an OCFS2 problem at all however the > syslog shows it had issues Here is the log from the node that was not > rebooted. The node that was rebooted contained no log information. The system > appeared to have gone down at about 3AM, until the node was rebooted at > around 7:15. >> >> Mar 8 03:06:32 groupwise-1-mht kernel: o2net: connection to node > groupwise-2-mht (num 2) at 192.168.1.3:7777 has been idle for 10 seconds, > shutting it down. >> Mar 8 03:06:32 groupwise-1-mht kernel: (0,2):o2net_idle_timer:1310 here are > some times that might help debug the situation: (tmr 1173341182.367220 now > 1173341192.367244 dr 1173341182.367213 adv > 1173341182.367228:1173341182.367229 func (05ce6220:2) > 1173341182.367221:1173341182.367224) >> Mar 8 03:06:32 groupwise-1-mht kernel: o2net: no longer connected to node > groupwise-2-mht (num 2) at 192.168.1.3:7777 >> Mar 8 03:06:32 groupwise-1-mht kernel: (499,0):dlm_do_master_request:1330 > ERROR: link to 2 went down! >> Mar 8 03:06:32 groupwise-1-mht kernel: (499,0):dlm_get_lock_resource:914 > ERROR: status = -112 >> Mar 8 03:13:02 groupwise-1-mht kernel: (8476,0):dlm_send_proxy_ast_msg:458 > ERROR: status = -107 >> Mar 8 03:13:02 groupwise-1-mht kernel: (8476,0):dlm_flush_asts:607 ERROR: > status = -107 >> Mar 8 03:19:54 groupwise-1-mht kernel: > (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107 >> Mar 8 03:19:54 groupwise-1-mht last message repeated 127 times >> Mar 8 03:19:55 groupwise-1-mht kernel: (873,0):dlm_do_master_request:1330 > ERROR: link to 2 went down! >> Mar 8 03:19:55 groupwise-1-mht kernel: (873,0):dlm_get_lock_resource:914 > ERROR: status = -107 >> Mar 8 03:19:55 groupwise-1-mht kernel: (901,0):dlm_do_master_request:1330 > ERROR: link to 2 went down! >> Mar 8 03:19:55 groupwise-1-mht kernel: (901,0):dlm_get_lock_resource:914 > ERROR: status = -107 >> Mar 8 03:19:56 groupwise-1-mht kernel: (929,0):dlm_do_master_request:1330 > ERROR: link to 2 went down! >> Mar 8 03:19:56 groupwise-1-mht kernel: (929,0):dlm_get_lock_resource:914 > ERROR: status = -107 >> Mar 8 03:45:29 groupwise-1-mht -- MARK -- >> Mar 8 04:15:02 groupwise-1-mht kernel: > (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107 >> Mar 8 04:15:03 groupwise-1-mht last message repeated 383 times >> Mar 8 06:27:54 groupwise-1-mht kernel: > (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107 >> Mar 8 06:27:54 groupwise-1-mht last message repeated 127 times >> Mar 8 06:27:54 groupwise-1-mht kernel: > (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107 >> Mar 8 06:27:54 groupwise-1-mht last message repeated 127 times >> Mar 8 06:35:48 groupwise-1-mht kernel: (8872,0):dlm_do_master_request:1330 > ERROR: link to 2 went down! >> Mar 8 06:35:48 groupwise-1-mht kernel: (8872,0):dlm_get_lock_resource:914 > ERROR: status = -107 >> Mar 8 06:52:45 groupwise-1-mht kernel: (8861,0):dlm_do_master_request:1330 > ERROR: link to 2 went down! >> Mar 8 06:52:45 groupwise-1-mht kernel: (8861,0):dlm_get_lock_resource:914 > ERROR: status = -107 >> Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 07:09:41 groupwise-1-mht kernel: (4192,0):dlm_do_master_request:1330 > ERROR: link to 2 went down! >> Mar 8 07:09:41 groupwise-1-mht kernel: (4192,0):dlm_get_lock_resource:914 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_do_request_vote:798 > ERROR: status = -107 >> Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_unlink:840 ERROR: > status = -107 >> Mar 8 07:18:57 groupwise-1-mht kernel: (4341,0):dlm_do_master_request:1330 > ERROR: link to 2 went down! >> Mar 8 07:18:57 groupwise-1-mht kernel: (4341,0):dlm_get_lock_resource:914 > ERROR: status = -107 >> Mar 8 07:19:24 groupwise-1-mht kernel: (4356,0):ocfs2_broadcast_vote:725 > ERROR: status = -107 >> Mar 8 07:19:24 groupwise-1-mht kernel: (4356,0):ocfs2_do_request_vote:798 > ERROR: status = -107 Mar 8 07:19:24 groupwise-1-mht kernel: > (4356,0):ocfs2_unlink:840 ERROR: status = -107 >> Mar 8 07:20:49 groupwise-1-mht sshd[4375]: Accepted publickey for root from > 10.1.31.27 port 1752 ssh2 >> Mar 8 07:20:50 groupwise-1-mht kernel: > (147,0):dlm_send_remote_unlock_request:356 ERROR: status = -107 Mar 8 > 07:20:50 groupwise-1-mht last message repeated 255 times >> Mar 8 07:20:53 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:20:53 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:20:58 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:20:58 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:03 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:03 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:08 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:08 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:13 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:13 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:19 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:19 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:24 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:24 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:29 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:29 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:34 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:34 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:39 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:39 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:44 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:44 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:49 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:49 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:54 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:21:54 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:21:59 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 Mar 8 > 07:21:59 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 Mar 8 07:22:04 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:04 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:10 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:10 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:15 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:20 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:20 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:25 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:25 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:30 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:30 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:35 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:35 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:40 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:40 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:45 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:45 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:50 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:50 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:22:55 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:22:55 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:01 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:23:01 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:06 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:23:06 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:11 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:23:11 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:16 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:23:16 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:21 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:23:21 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:26 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:23:26 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:31 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:23:31 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:36 groupwise-1-mht kernel: > (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 >> Mar 8 07:23:36 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371 > 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death of > node 2 >> Mar 8 07:23:40 groupwise-1-mht kernel: (28613,2):dlm_get_lock_resource:847 > B6ECAF5A668A4573AF763908F26958DB:$RECOVERY: at least one node (2) torecover > before lock mastery can begin >> Mar 8 07:23:40 groupwise-1-mht kernel: (28613,2):dlm_get_lock_resource:874 > B6ECAF5A668A4573AF763908F26958DB: recovery map is not empty, but must master > $RECOVERY lock now >> Mar 8 07:23:41 groupwise-1-mht kernel: (4432,0):ocfs2_replay_journal:1176 > Recovering node 2 from slot 1 on device (253,1) >> Mar 8 07:23:41 groupwise-1-mht kernel: (4192,0):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:41 groupwise-1-mht kernel: > (4192,0):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 >> Mar 8 07:23:41 groupwise-1-mht kernel: (929,1):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:41 groupwise-1-mht kernel: (929,1):dlm_wait_for_lock_mastery:1035 > ERROR: status = -11 >> Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:42 groupwise-1-mht kernel: > (4341,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 >> Mar 8 07:23:42 groupwise-1-mht kernel: (4341,1):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:42 groupwise-1-mht kernel: > (4341,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 >> Mar 8 07:23:42 groupwise-1-mht kernel: (4192,0):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:D000000000000000037872ff59e2a10: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:42 groupwise-1-mht kernel: (499,1):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:42 groupwise-1-mht kernel: (499,1):dlm_wait_for_lock_mastery:1035 > ERROR: status = -11 >> Mar 8 07:23:42 groupwise-1-mht kernel: (929,1):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:M0000000000000002d2ab960a02ee32: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:43 groupwise-1-mht kernel: (4341,1):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:D00000000000000005ac8f593b44a80: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:43 groupwise-1-mht kernel: (8872,1):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:43 groupwise-1-mht kernel: > (8872,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 >> Mar 8 07:23:43 groupwise-1-mht kernel: (499,1):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:D0000000000000000059e0c78635d25: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:43 groupwise-1-mht kernel: (8223,2):ocfs2_dlm_eviction_cb:119 > device (253,0): dlm has evicted node 2 >> Mar 8 07:23:43 groupwise-1-mht kernel: (4431,0):dlm_get_lock_resource:847 > 2062CE05ABA246988E9CCCDAE253F458:M000000000000000000001de83f8b74: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:44 groupwise-1-mht kernel: (8872,1):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:D0000000000000000ce315c7764670d: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:44 groupwise-1-mht kernel: (4431,0):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:M000000000000000000001de83f8b74: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:44 groupwise-1-mht kernel: (873,1):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:49 groupwise-1-mht kernel: (873,1):dlm_wait_for_lock_mastery:1035 > ERROR: status = -11 >> Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_wait_for_lock_mastery:1035 > ERROR: status = -11 >> Mar 8 07:23:49 groupwise-1-mht kernel: (8861,1):dlm_restart_lock_mastery:1214 > ERROR: node down! 2 >> Mar 8 07:23:49 groupwise-1-mht kernel: > (8861,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 >> Mar 8 07:23:49 groupwise-1-mht kernel: (873,1):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:M0000000000000002fc058c0a084a80: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:49 groupwise-1-mht kernel: (901,1):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:M0000000000000002ff18686a1b86f4: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:49 groupwise-1-mht kernel: (8861,1):dlm_get_lock_resource:895 > 2062CE05ABA246988E9CCCDAE253F458:D0000000000000000b2f76e77647700: at least > one node (2) torecover before lock mastery can begin >> Mar 8 07:23:49 groupwise-1-mht kernel: kjournald starting. Commit interval 5 > seconds >> Mar 8 07:23:49 groupwise-1-mht kernel: (4431,0):ocfs2_replay_journal:1176 > Recovering node 2 from slot 1 on device (253,0) >> Mar 8 07:23:55 groupwise-1-mht kernel: (fs/jbd/recovery.c, 255): > journal_recover: JBD: recovery, exit status 0, recovered transactions 599034 > to 599035 >> Mar 8 07:23:55 groupwise-1-mht kernel: (fs/jbd/recovery.c, 257): > journal_recover: JBD: Replayed 8 and revoked 0/0 blocks >> Mar 8 07:23:55 groupwise-1-mht kernel: kjournald starting. Commit interval 5 > seconds >> Mar 8 07:25:51 groupwise-1-mht kernel: o2net: accepted connection from node > groupwise-2-mht (num 2) at 192.168.1.3:7777 >> Mar 8 07:25:55 groupwise-1-mht kernel: ocfs2_dlm: Node 2 joins domain > 2062CE05ABA246988E9CCCDAE253F458 >> Mar 8 07:25:55 groupwise-1-mht kernel: ocfs2_dlm: Nodes in domain > ("2062CE05ABA246988E9CCCDAE253F458"): 0 1 2 >> Mar 8 07:25:59 groupwise-1-mht kernel: ocfs2_dlm: Node 2 joins domain > B6ECAF5A668A4573AF763908F26958DB >> Mar 8 07:25:59 groupwise-1-mht kernel: ocfs2_dlm: Nodes in domain > ("B6ECAF5A668A4573AF763908F26958DB"): 0 1 2 >> >> >> >> >> Andy Kipp >> Network Administrator >> Velcro USA Inc. >> 406 Brown Ave. >> Manchester, NH 03103 >> Phone: (603) 222-4844 >> Email: akipp@velcro.com >> >> CONFIDENTIALITY NOTICE: >> This email is intended only for the person or entity to which it is > addressed and may contain confidential and/or privileged material. Any > unauthorized review, use, disclosure or distribution is prohibited. If you > are not the intended recipient, please contact the sender by reply e-mail and > destroy all copies of the original message. If you are the intended recipient > but do not wish to receive communications through this medium, please so > advise immediately. >> >> >> _______________________________________________ >> Ocfs2-users mailing list >> Ocfs2-users@oss.oracle.com >> http://oss.oracle.com/mailman/listinfo/ocfs2-users >>
Have you tried to do alt-sysrq-t on the "dead" node? The stack traces will be of great help. Also, even though this could be the same as #819, I would still recommend filing a new bug with all the messages files. Even though that will take some of your time, it will be much easier to keep track of this issue and ensure some sort of resolution. (Prune out really old info from the message files.) Andy Kipp wrote:> I checked bugzilla and what is happening is almost identical to bug #819. However, the "dead" node continues to heartbeat, yet is unresponsive. No log output at all is generated on the "dead" node. This has been happening for a few months however frequency is increasing. Is there any information I can provide to hopefully figure this out? > > - Andy >
Andy,
I found helpfull to diagnose this kind of hang to keep a priority 0 shell
opened on the server. This shell usually keeps working even during heavy
swapping or other situations where the system becomes unresponsive. You can
start one with this command:
nice -n -20 bash
From this you could run a top or a vmstat to see what is happening when
the server is unresponsive. Just be careful to not run any command that might
generate a large output or have high CPU usage, as you might hang the server
yourself.
Regards,
Luis
Andy Kipp <AKIPP@velcro.com> wrote:
I checked bugzilla and what is happening is almost identical to bug #819.
However, the "dead" node continues to heartbeat, yet is unresponsive.
No log output at all is generated on the "dead" node. This has been
happening for a few months however frequency is increasing. Is there any
information I can provide to hopefully figure this out?
- Andy
--
Andrew Kipp
Network Administrator
Velcro USA Inc.
Email: akipp@velcro.com
Work: (603) 222-4844
CONFIDENTIALITY NOTICE: This email is intended only for the person or entity to
which it is addressed and may contain confidential and/or privileged material.
Any unauthorized review, use, disclosure or distribution is prohibited. If you
are not the intended recipient, please contact the sender by reply e mail and
destroy all copies of the original message. If you are the intended recipient
but do not wish to receive communications through this medium, please so advise
immediately.
>>> On 3/9/2007 at 9:39 PM, in message
<45F21A7F.5090802@oracle.com>, Sunil Mushran
wrote:> File a bugzilla with the messages from all three nodes. Appears
> node 2 went down but kept heartbeating. Strange. The messages
> from node 2 may shed more light.
>
> Andy Kipp wrote:
>> We are running OCFS2 on SLES9 machines using a FC SAN. Without warning
both
> nodes will become unresponsive. Can not access either machine via ssh or
> terminal (hangs after typing in username). However the machine still
responds
> to pings. This continues until one node is rebooted, at which time the
second
> node resumes normal operations.
>>
>> I am not entirely sure that this is an OCFS2 problem at all however the
> syslog shows it had issues Here is the log from the node that was not
> rebooted. The node that was rebooted contained no log information. The
system
> appeared to have gone down at about 3AM, until the node was rebooted at
> around 7:15.
>>
>> Mar 8 03:06:32 groupwise-1-mht kernel: o2net: connection to node
> groupwise-2-mht (num 2) at 192.168.1.3:7777 has been idle for 10 seconds,
> shutting it down.
>> Mar 8 03:06:32 groupwise-1-mht kernel: (0,2):o2net_idle_timer:1310 here
are
> some times that might help debug the situation: (tmr 1173341182.367220 now
> 1173341192.367244 dr 1173341182.367213 adv
> 1173341182.367228:1173341182.367229 func (05ce6220:2)
> 1173341182.367221:1173341182.367224)
>> Mar 8 03:06:32 groupwise-1-mht kernel: o2net: no longer connected to
node
> groupwise-2-mht (num 2) at 192.168.1.3:7777
>> Mar 8 03:06:32 groupwise-1-mht kernel:
(499,0):dlm_do_master_request:1330
> ERROR: link to 2 went down!
>> Mar 8 03:06:32 groupwise-1-mht kernel:
(499,0):dlm_get_lock_resource:914
> ERROR: status = -112
>> Mar 8 03:13:02 groupwise-1-mht kernel:
(8476,0):dlm_send_proxy_ast_msg:458
> ERROR: status = -107
>> Mar 8 03:13:02 groupwise-1-mht kernel: (8476,0):dlm_flush_asts:607
ERROR:
> status = -107
>> Mar 8 03:19:54 groupwise-1-mht kernel:
> (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107
>> Mar 8 03:19:54 groupwise-1-mht last message repeated 127 times
>> Mar 8 03:19:55 groupwise-1-mht kernel:
(873,0):dlm_do_master_request:1330
> ERROR: link to 2 went down!
>> Mar 8 03:19:55 groupwise-1-mht kernel:
(873,0):dlm_get_lock_resource:914
> ERROR: status = -107
>> Mar 8 03:19:55 groupwise-1-mht kernel:
(901,0):dlm_do_master_request:1330
> ERROR: link to 2 went down!
>> Mar 8 03:19:55 groupwise-1-mht kernel:
(901,0):dlm_get_lock_resource:914
> ERROR: status = -107
>> Mar 8 03:19:56 groupwise-1-mht kernel:
(929,0):dlm_do_master_request:1330
> ERROR: link to 2 went down!
>> Mar 8 03:19:56 groupwise-1-mht kernel:
(929,0):dlm_get_lock_resource:914
> ERROR: status = -107
>> Mar 8 03:45:29 groupwise-1-mht -- MARK --
>> Mar 8 04:15:02 groupwise-1-mht kernel:
> (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107
>> Mar 8 04:15:03 groupwise-1-mht last message repeated 383 times
>> Mar 8 06:27:54 groupwise-1-mht kernel:
> (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107
>> Mar 8 06:27:54 groupwise-1-mht last message repeated 127 times
>> Mar 8 06:27:54 groupwise-1-mht kernel:
> (147,1):dlm_send_remote_unlock_request:356 ERROR: status = -107
>> Mar 8 06:27:54 groupwise-1-mht last message repeated 127 times
>> Mar 8 06:35:48 groupwise-1-mht kernel:
(8872,0):dlm_do_master_request:1330
> ERROR: link to 2 went down!
>> Mar 8 06:35:48 groupwise-1-mht kernel:
(8872,0):dlm_get_lock_resource:914
> ERROR: status = -107
>> Mar 8 06:52:45 groupwise-1-mht kernel:
(8861,0):dlm_do_master_request:1330
> ERROR: link to 2 went down!
>> Mar 8 06:52:45 groupwise-1-mht kernel:
(8861,0):dlm_get_lock_resource:914
> ERROR: status = -107
>> Mar 8 06:54:11 groupwise-1-mht kernel:
(8854,3):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 06:54:11 groupwise-1-mht kernel:
(8854,3):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 06:54:11 groupwise-1-mht kernel: (8854,3):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 06:54:18 groupwise-1-mht kernel:
(8855,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 06:54:18 groupwise-1-mht kernel:
(8855,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 06:54:18 groupwise-1-mht kernel:
(8855,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 06:54:18 groupwise-1-mht kernel:
(8855,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 06:54:18 groupwise-1-mht kernel: (8855,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 06:54:58 groupwise-1-mht kernel:
(8853,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 06:54:58 groupwise-1-mht kernel:
(8853,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 06:54:58 groupwise-1-mht kernel: (8853,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 07:09:41 groupwise-1-mht kernel:
(4192,0):dlm_do_master_request:1330
> ERROR: link to 2 went down!
>> Mar 8 07:09:41 groupwise-1-mht kernel:
(4192,0):dlm_get_lock_resource:914
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel:
(4236,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel:
(4236,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel:
(4236,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel:
(4236,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel:
(4236,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel:
(4236,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel:
(4236,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel:
(4236,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 07:14:09 groupwise-1-mht kernel: (4236,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 07:15:50 groupwise-1-mht kernel:
(4289,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 07:15:50 groupwise-1-mht kernel:
(4289,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 07:15:50 groupwise-1-mht kernel:
(4289,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 07:15:50 groupwise-1-mht kernel:
(4289,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 07:15:50 groupwise-1-mht kernel: (4289,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 07:16:13 groupwise-1-mht kernel:
(4253,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 07:16:13 groupwise-1-mht kernel:
(4253,0):ocfs2_do_request_vote:798
> ERROR: status = -107
>> Mar 8 07:16:13 groupwise-1-mht kernel: (4253,0):ocfs2_unlink:840 ERROR:
> status = -107
>> Mar 8 07:18:57 groupwise-1-mht kernel:
(4341,0):dlm_do_master_request:1330
> ERROR: link to 2 went down!
>> Mar 8 07:18:57 groupwise-1-mht kernel:
(4341,0):dlm_get_lock_resource:914
> ERROR: status = -107
>> Mar 8 07:19:24 groupwise-1-mht kernel:
(4356,0):ocfs2_broadcast_vote:725
> ERROR: status = -107
>> Mar 8 07:19:24 groupwise-1-mht kernel:
(4356,0):ocfs2_do_request_vote:798
> ERROR: status = -107 Mar 8 07:19:24 groupwise-1-mht kernel:
> (4356,0):ocfs2_unlink:840 ERROR: status = -107
>> Mar 8 07:20:49 groupwise-1-mht sshd[4375]: Accepted publickey for root
from
> 10.1.31.27 port 1752 ssh2
>> Mar 8 07:20:50 groupwise-1-mht kernel:
> (147,0):dlm_send_remote_unlock_request:356 ERROR: status = -107 Mar 8
> 07:20:50 groupwise-1-mht last message repeated 255 times
>> Mar 8 07:20:53 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:20:53 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:20:58 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:20:58 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:03 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:03 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:08 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:08 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:13 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:13 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:19 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:19 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:24 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:24 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:29 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:29 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:34 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:34 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:39 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:39 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:44 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:44 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:49 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:49 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:54 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:21:54 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:21:59 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107 Mar 8
> 07:21:59 groupwise-1-mht kernel: (4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2 Mar 8 07:22:04 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:04 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:10 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:10 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:15 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:20 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:20 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:25 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:25 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:30 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:30 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:35 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:35 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:40 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:40 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:45 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:45 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:50 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:50 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:22:55 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:22:55 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:01 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:23:01 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:06 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:23:06 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:11 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:23:11 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:16 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:23:16 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:21 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:23:21 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:26 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:23:26 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:31 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:23:31 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:36 groupwise-1-mht kernel:
> (4377,0):dlm_send_remote_convert_request:398 ERROR: status = -107
>> Mar 8 07:23:36 groupwise-1-mht kernel:
(4377,0):dlm_wait_for_node_death:371
> 2062CE05ABA246988E9CCCDAE253F458: waiting 5000ms for notification of death
of
> node 2
>> Mar 8 07:23:40 groupwise-1-mht kernel:
(28613,2):dlm_get_lock_resource:847
> B6ECAF5A668A4573AF763908F26958DB:$RECOVERY: at least one node (2) torecover
> before lock mastery can begin
>> Mar 8 07:23:40 groupwise-1-mht kernel:
(28613,2):dlm_get_lock_resource:874
> B6ECAF5A668A4573AF763908F26958DB: recovery map is not empty, but must
master
> $RECOVERY lock now
>> Mar 8 07:23:41 groupwise-1-mht kernel:
(4432,0):ocfs2_replay_journal:1176
> Recovering node 2 from slot 1 on device (253,1)
>> Mar 8 07:23:41 groupwise-1-mht kernel:
(4192,0):dlm_restart_lock_mastery:1214
> ERROR: node down! 2
>> Mar 8 07:23:41 groupwise-1-mht kernel:
> (4192,0):dlm_wait_for_lock_mastery:1035 ERROR: status = -11
>> Mar 8 07:23:41 groupwise-1-mht kernel:
(929,1):dlm_restart_lock_mastery:1214
> ERROR: node down! 2
>> Mar 8 07:23:41 groupwise-1-mht kernel:
(929,1):dlm_wait_for_lock_mastery:1035
> ERROR: status = -11
>> Mar 8 07:23:42 groupwise-1-mht kernel:
(4341,1):dlm_restart_lock_mastery:1214
> ERROR: node down! 2
>> Mar 8 07:23:42 groupwise-1-mht kernel:
> (4341,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11
>> Mar 8 07:23:42 groupwise-1-mht kernel:
(4341,1):dlm_restart_lock_mastery:1214
> ERROR: node down! 2
>> Mar 8 07:23:42 groupwise-1-mht kernel:
> (4341,1):dlm_wait_for_lock_mastery:1035 ERROR: status = -11
>> Mar 8 07:23:42 groupwise-1-mht kernel:
(4192,0):dlm_get_lock_resource:895
> 2062CE05ABA246988E9CCCDAE253F458:D000000000000000037872ff59e2a10: at least
> one node (2) torecover before lock mastery can begin
>> Mar 8 07:23:42 groupwise-1-mht kernel:
(499,1):dlm_restart_lock_mastery:1214
> ERROR: node down! 2
>> Mar 8 07:23:42 groupwise-1-mht kernel:
(499,1):dlm_wait_for_lock_mastery:1035
> ERROR: status = -11
>> Mar 8 07:23:42 groupwise-1-mht kernel:
(929,1):dlm_get_lock_resource:895
> 2062CE05ABA246988E9CCCDAE253F458:M0000000000000002d2ab960a02ee32: at least
> one node (2) torecover before lock mastery can begin
>> Mar 8 07:23:43 groupwise-1-mht kernel:
(4341,1):dlm_get_lock_resource:895
=== message truncated ==
---------------------------------
Bored stiff? Loosen up...
Download and play hundreds of games for free on Yahoo! Games.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20070313/516051b6/attachment-0001.html