Marcus Alves Grando
2007-May-23 14:42 UTC
[Ocfs2-users] "node down!" are related with SVN rev 3004?
Hi list, Today i have a problem with ocfs2, one server stop access to ocfs2 disks and the only message in /var/log/messages are: May 23 16:24:26 node3 kernel: (6956,3):dlm_restart_lock_mastery:1301 ERROR: node down! 1 May 23 16:24:26 node3 kernel: (6956,3):dlm_wait_for_lock_mastery:1118 ERROR: status = -11 I don't know what's happened. Maybe that's related with 3004 fix? Someone already see that? Another strange fact are all nodes mount 13 SAN disks and "leaves" messages occurrs only nine times. Another fact are node1 is down to maintanance since 08:30. Others servers have this messages: **** node2 May 23 16:48:31 node2 kernel: ocfs2_dlm: Node 3 leaves domain 84407FC4A92E451DADEF260A2FE0E366 May 23 16:48:31 node2 kernel: ocfs2_dlm: Nodes in domain ("84407FC4A92E451DADEF260A2FE0E366"): 2 4 May 23 16:48:37 node2 kernel: ocfs2_dlm: Node 3 leaves domain 1FB62EB34D1F495A9F11F396E707588C May 23 16:48:37 node2 kernel: ocfs2_dlm: Nodes in domain ("1FB62EB34D1F495A9F11F396E707588C"): 2 4 May 23 16:48:42 node2 kernel: ocfs2_dlm: Node 3 leaves domain 793ACD36E8CA4067AB99F9F4F2229634 May 23 16:48:42 node2 kernel: ocfs2_dlm: Nodes in domain ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4 May 23 16:48:48 node2 kernel: ocfs2_dlm: Node 3 leaves domain ECECF9980CBD44EFA7E8A950EDE40573 May 23 16:48:48 node2 kernel: ocfs2_dlm: Nodes in domain ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4 May 23 16:48:53 node2 kernel: ocfs2_dlm: Node 3 leaves domain D8AFCBD0CF59404991FAB19916CEE08B May 23 16:48:53 node2 kernel: ocfs2_dlm: Nodes in domain ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4 May 23 16:48:58 node2 kernel: ocfs2_dlm: Node 3 leaves domain E8B0A018151943A28674662818529F0F May 23 16:48:58 node2 kernel: ocfs2_dlm: Nodes in domain ("E8B0A018151943A28674662818529F0F"): 2 4 May 23 16:49:03 node2 kernel: ocfs2_dlm: Node 3 leaves domain 3D227E224D0D4D9F97B84B0BB7DE7E22 May 23 16:49:03 node2 kernel: ocfs2_dlm: Nodes in domain ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4 May 23 16:49:09 node2 kernel: ocfs2_dlm: Node 3 leaves domain C40090C8D14D48C9AC0D1024A228EC59 May 23 16:49:09 node2 kernel: ocfs2_dlm: Nodes in domain ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4 May 23 16:49:14 node2 kernel: ocfs2_dlm: Node 3 leaves domain 9CB224941DC64A39872A5012FBD12354 May 23 16:49:14 node2 kernel: ocfs2_dlm: Nodes in domain ("9CB224941DC64A39872A5012FBD12354"): 2 4 May 23 16:50:04 node2 kernel: o2net: connection to node node3.hst.host (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting it down. May 23 16:50:04 node2 kernel: (0,3):o2net_idle_timer:1418 here are some times that might help debug the situation: (tmr 1179949774.944117 now 1179949804.944585 dr 1179949774.944109 adv 1179949774.944119:1179949774.944121 func (d21ddb4d:513) 1179949754.944260:1179949754.944271) May 23 16:50:04 node2 kernel: o2net: no longer connected to node node3.hst.host (num 3) at 192.168.0.3:7777 May 23 16:52:31 node2 kernel: (23351,2):dlm_send_remote_convert_request:398 ERROR: status = -107 May 23 16:52:31 node2 kernel: (23351,2):dlm_wait_for_node_death:365 BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of death of node 3 May 23 16:52:32 node2 kernel: (4379,3):ocfs2_dlm_eviction_cb:119 device (8,49): dlm has evicted node 3 May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:921 BD2D6C1943FB4771B018EA2A7D056E8A:$RECOVERY: at least one node (3) torecover before lock mastery can begin May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:955 BD2D6C1943FB4771B018EA2A7D056E8A: recovery map is not empty, but must master $RECOVERY lock now May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:921 36D7DEC36FC44C53A6107B6A9CE863A2:$RECOVERY: at least one node (3) torecover before lock mastery can begin May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:955 36D7DEC36FC44C53A6107B6A9CE863A2: recovery map is not empty, but must master $RECOVERY lock now May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:921 9D0941F9B5B843E0B8F8C9FD7D514C35:$RECOVERY: at least one node (3) torecover before lock mastery can begin May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:955 9D0941F9B5B843E0B8F8C9FD7D514C35: recovery map is not empty, but must master $RECOVERY lock now May 23 16:52:37 node2 kernel: (23351,2):ocfs2_replay_journal:1167 Recovering node 3 from slot 1 on device (8,97) May 23 16:52:42 node2 kernel: kjournald starting. Commit interval 5 seconds **** node4 May 23 16:48:31 node4 kernel: ocfs2_dlm: Node 3 leaves domain 84407FC4A92E451DADEF260A2FE0E366 May 23 16:48:31 node4 kernel: ocfs2_dlm: Nodes in domain ("84407FC4A92E451DADEF260A2FE0E366"): 2 4 May 23 16:48:37 node4 kernel: ocfs2_dlm: Node 3 leaves domain 1FB62EB34D1F495A9F11F396E707588C May 23 16:48:37 node4 kernel: ocfs2_dlm: Nodes in domain ("1FB62EB34D1F495A9F11F396E707588C"): 2 4 May 23 16:48:42 node4 kernel: ocfs2_dlm: Node 3 leaves domain 793ACD36E8CA4067AB99F9F4F2229634 May 23 16:48:42 node4 kernel: ocfs2_dlm: Nodes in domain ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4 May 23 16:48:48 node4 kernel: ocfs2_dlm: Node 3 leaves domain ECECF9980CBD44EFA7E8A950EDE40573 May 23 16:48:48 node4 kernel: ocfs2_dlm: Nodes in domain ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4 May 23 16:48:53 node4 kernel: ocfs2_dlm: Node 3 leaves domain D8AFCBD0CF59404991FAB19916CEE08B May 23 16:48:53 node4 kernel: ocfs2_dlm: Nodes in domain ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4 May 23 16:48:58 node4 kernel: ocfs2_dlm: Node 3 leaves domain E8B0A018151943A28674662818529F0F May 23 16:48:58 node4 kernel: ocfs2_dlm: Nodes in domain ("E8B0A018151943A28674662818529F0F"): 2 4 May 23 16:49:03 node4 kernel: ocfs2_dlm: Node 3 leaves domain 3D227E224D0D4D9F97B84B0BB7DE7E22 May 23 16:49:03 node4 kernel: ocfs2_dlm: Nodes in domain ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4 May 23 16:49:09 node4 kernel: ocfs2_dlm: Node 3 leaves domain C40090C8D14D48C9AC0D1024A228EC59 May 23 16:49:09 node4 kernel: ocfs2_dlm: Nodes in domain ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4 May 23 16:49:14 node4 kernel: ocfs2_dlm: Node 3 leaves domain 9CB224941DC64A39872A5012FBD12354 May 23 16:49:14 node4 kernel: ocfs2_dlm: Nodes in domain ("9CB224941DC64A39872A5012FBD12354"): 2 4 May 23 16:50:04 node4 kernel: o2net: connection to node node3.hst.host (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting it down. May 23 16:50:04 node4 kernel: (19355,0):o2net_idle_timer:1418 here are some times that might help debug the situation: (tmr 1179949774.943813 now 1179949804.944242 dr 1179949774.943805 adv 1179949774.943815:1179949774.943817 func (d21ddb4d:513) 1179949754.944088:1179949754.944097) May 23 16:50:04 node4 kernel: o2net: no longer connected to node node3.hst.host (num 3) at 192.168.0.3:7777 May 23 16:50:04 node4 kernel: (18902,0):dlm_do_master_request:1418 ERROR: link to 3 went down! May 23 16:50:04 node4 kernel: (18902,0):dlm_get_lock_resource:995 ERROR: status = -112 May 23 16:52:31 node4 kernel: (22785,3):dlm_send_remote_convert_request:398 ERROR: status = -107 May 23 16:52:31 node4 kernel: (22785,3):dlm_wait_for_node_death:365 BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of death of node 3 May 23 16:52:31 node4 kernel: (22786,3):dlm_get_lock_resource:921 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at least one node (3) torecover before lock mastery can begin May 23 16:52:31 node4 kernel: (22784,3):dlm_get_lock_resource:921 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at least one node (3) torecover before lock mastery can begin May 23 16:52:31 node4 kernel: (22783,3):dlm_get_lock_resource:921 A85D18C01AE747AC905343D919B60525:M000000000000000000021535d8e891: at least one node (3) torecover before lock mastery can begin May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:921 A85D18C01AE747AC905343D919B60525:$RECOVERY: at least one node (3) torecover before lock mastery can begin May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:955 A85D18C01AE747AC905343D919B60525: recovery map is not empty, but must master $RECOVERY lock now May 23 16:52:32 node4 kernel: (22786,3):dlm_get_lock_resource:976 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at least one node (3) torecover before lock mastery can begin May 23 16:52:32 node4 kernel: (22784,3):dlm_get_lock_resource:976 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at least one node (3) torecover before lock mastery can begin May 23 16:52:32 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device (8,97): dlm has evicted node 3 May 23 16:52:33 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device (8,81): dlm has evicted node 3 May 23 16:52:34 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device (8,161): dlm has evicted node 3 May 23 16:52:35 node4 kernel: (18902,0):dlm_restart_lock_mastery:1301 ERROR: node down! 3 May 23 16:52:35 node4 kernel: (18902,0):dlm_wait_for_lock_mastery:1118 ERROR: status = -11 May 23 16:52:36 node4 kernel: (18902,0):dlm_get_lock_resource:976 9D0941F9B5B843E0B8F8C9FD7D514C35:D0000000000000000030b2be3ea1a0c: at least one node (3) torecover before lock mastery can begin May 23 16:52:37 node4 kernel: (22783,3):ocfs2_replay_journal:1167 Recovering node 3 from slot 1 on device (8,49) May 23 16:52:39 node4 kernel: (22784,0):ocfs2_replay_journal:1167 Recovering node 3 from slot 1 on device (8,81) May 23 16:52:40 node4 kernel: (22786,0):ocfs2_replay_journal:1167 Recovering node 3 from slot 1 on device (8,161) May 23 16:52:44 node4 kernel: kjournald starting. Commit interval 5 seconds -- Marcus Alves Grando <marcus.grando [] terra.com.br> Suporte Engenharia 1 Terra Networks Brasil S/A Tel: 55 (51) 3284-4238 Qual ? a sua Terra?
Marcus Alves Grando
2007-May-23 15:01 UTC
[Ocfs2-users] "node down!" are related with SVN rev 3004?
I forget to say, all nodes are RedHat AS 4 update 5 and # rpm -qa | grep ocfs2 ocfs2-2.6.9-55.ELhugemem-1.2.5-1 ocfs2-tools-1.2.4-1 # uname -r 2.6.9-55.ELhugemem Regards Marcus Alves Grando wrote:> Hi list, > > Today i have a problem with ocfs2, one server stop access to ocfs2 disks > and the only message in /var/log/messages are: > > May 23 16:24:26 node3 kernel: (6956,3):dlm_restart_lock_mastery:1301 > ERROR: node down! 1 > May 23 16:24:26 node3 kernel: (6956,3):dlm_wait_for_lock_mastery:1118 > ERROR: status = -11 > > I don't know what's happened. Maybe that's related with 3004 fix? > Someone already see that? > > Another strange fact are all nodes mount 13 SAN disks and "leaves" > messages occurrs only nine times. > > Another fact are node1 is down to maintanance since 08:30. > > Others servers have this messages: > > **** node2 > > May 23 16:48:31 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 84407FC4A92E451DADEF260A2FE0E366 > May 23 16:48:31 node2 kernel: ocfs2_dlm: Nodes in domain > ("84407FC4A92E451DADEF260A2FE0E366"): 2 4 > May 23 16:48:37 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 1FB62EB34D1F495A9F11F396E707588C > May 23 16:48:37 node2 kernel: ocfs2_dlm: Nodes in domain > ("1FB62EB34D1F495A9F11F396E707588C"): 2 4 > May 23 16:48:42 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 793ACD36E8CA4067AB99F9F4F2229634 > May 23 16:48:42 node2 kernel: ocfs2_dlm: Nodes in domain > ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4 > May 23 16:48:48 node2 kernel: ocfs2_dlm: Node 3 leaves domain > ECECF9980CBD44EFA7E8A950EDE40573 > May 23 16:48:48 node2 kernel: ocfs2_dlm: Nodes in domain > ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4 > May 23 16:48:53 node2 kernel: ocfs2_dlm: Node 3 leaves domain > D8AFCBD0CF59404991FAB19916CEE08B > May 23 16:48:53 node2 kernel: ocfs2_dlm: Nodes in domain > ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4 > May 23 16:48:58 node2 kernel: ocfs2_dlm: Node 3 leaves domain > E8B0A018151943A28674662818529F0F > May 23 16:48:58 node2 kernel: ocfs2_dlm: Nodes in domain > ("E8B0A018151943A28674662818529F0F"): 2 4 > May 23 16:49:03 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 3D227E224D0D4D9F97B84B0BB7DE7E22 > May 23 16:49:03 node2 kernel: ocfs2_dlm: Nodes in domain > ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4 > May 23 16:49:09 node2 kernel: ocfs2_dlm: Node 3 leaves domain > C40090C8D14D48C9AC0D1024A228EC59 > May 23 16:49:09 node2 kernel: ocfs2_dlm: Nodes in domain > ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4 > May 23 16:49:14 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 9CB224941DC64A39872A5012FBD12354 > May 23 16:49:14 node2 kernel: ocfs2_dlm: Nodes in domain > ("9CB224941DC64A39872A5012FBD12354"): 2 4 > May 23 16:50:04 node2 kernel: o2net: connection to node node3.hst.host > (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting it > down. > May 23 16:50:04 node2 kernel: (0,3):o2net_idle_timer:1418 here are some > times that might help debug the situation: (tmr 1179949774.944117 now > 1179949804.944585 dr 1179949774.944109 adv > 1179949774.944119:1179949774.944121 func (d21ddb4d:513) > 1179949754.944260:1179949754.944271) > May 23 16:50:04 node2 kernel: o2net: no longer connected to node > node3.hst.host (num 3) at 192.168.0.3:7777 > May 23 16:52:31 node2 kernel: > (23351,2):dlm_send_remote_convert_request:398 ERROR: status = -107 > May 23 16:52:31 node2 kernel: (23351,2):dlm_wait_for_node_death:365 > BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of > death of node 3 > May 23 16:52:32 node2 kernel: (4379,3):ocfs2_dlm_eviction_cb:119 device > (8,49): dlm has evicted node 3 > May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:921 > BD2D6C1943FB4771B018EA2A7D056E8A:$RECOVERY: at least one node (3) > torecover before lock mastery can begin > May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:955 > BD2D6C1943FB4771B018EA2A7D056E8A: recovery map is not empty, but must > master $RECOVERY lock now > May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:921 > 36D7DEC36FC44C53A6107B6A9CE863A2:$RECOVERY: at least one node (3) > torecover before lock mastery can begin > May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:955 > 36D7DEC36FC44C53A6107B6A9CE863A2: recovery map is not empty, but must > master $RECOVERY lock now > May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:921 > 9D0941F9B5B843E0B8F8C9FD7D514C35:$RECOVERY: at least one node (3) > torecover before lock mastery can begin > May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:955 > 9D0941F9B5B843E0B8F8C9FD7D514C35: recovery map is not empty, but must > master $RECOVERY lock now > May 23 16:52:37 node2 kernel: (23351,2):ocfs2_replay_journal:1167 > Recovering node 3 from slot 1 on device (8,97) > May 23 16:52:42 node2 kernel: kjournald starting. Commit interval 5 > seconds > > **** node4 > > May 23 16:48:31 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 84407FC4A92E451DADEF260A2FE0E366 > May 23 16:48:31 node4 kernel: ocfs2_dlm: Nodes in domain > ("84407FC4A92E451DADEF260A2FE0E366"): 2 4 > May 23 16:48:37 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 1FB62EB34D1F495A9F11F396E707588C > May 23 16:48:37 node4 kernel: ocfs2_dlm: Nodes in domain > ("1FB62EB34D1F495A9F11F396E707588C"): 2 4 > May 23 16:48:42 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 793ACD36E8CA4067AB99F9F4F2229634 > May 23 16:48:42 node4 kernel: ocfs2_dlm: Nodes in domain > ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4 > May 23 16:48:48 node4 kernel: ocfs2_dlm: Node 3 leaves domain > ECECF9980CBD44EFA7E8A950EDE40573 > May 23 16:48:48 node4 kernel: ocfs2_dlm: Nodes in domain > ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4 > May 23 16:48:53 node4 kernel: ocfs2_dlm: Node 3 leaves domain > D8AFCBD0CF59404991FAB19916CEE08B > May 23 16:48:53 node4 kernel: ocfs2_dlm: Nodes in domain > ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4 > May 23 16:48:58 node4 kernel: ocfs2_dlm: Node 3 leaves domain > E8B0A018151943A28674662818529F0F > May 23 16:48:58 node4 kernel: ocfs2_dlm: Nodes in domain > ("E8B0A018151943A28674662818529F0F"): 2 4 > May 23 16:49:03 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 3D227E224D0D4D9F97B84B0BB7DE7E22 > May 23 16:49:03 node4 kernel: ocfs2_dlm: Nodes in domain > ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4 > May 23 16:49:09 node4 kernel: ocfs2_dlm: Node 3 leaves domain > C40090C8D14D48C9AC0D1024A228EC59 > May 23 16:49:09 node4 kernel: ocfs2_dlm: Nodes in domain > ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4 > May 23 16:49:14 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 9CB224941DC64A39872A5012FBD12354 > May 23 16:49:14 node4 kernel: ocfs2_dlm: Nodes in domain > ("9CB224941DC64A39872A5012FBD12354"): 2 4 > May 23 16:50:04 node4 kernel: o2net: connection to node node3.hst.host > (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting it > down. > May 23 16:50:04 node4 kernel: (19355,0):o2net_idle_timer:1418 here are > some times that might help debug the situation: (tmr 1179949774.943813 > now 1179949804.944242 dr 1179949774.943805 adv > 1179949774.943815:1179949774.943817 func (d21ddb4d:513) > 1179949754.944088:1179949754.944097) > May 23 16:50:04 node4 kernel: o2net: no longer connected to node > node3.hst.host (num 3) at 192.168.0.3:7777 > May 23 16:50:04 node4 kernel: (18902,0):dlm_do_master_request:1418 > ERROR: link to 3 went down! > May 23 16:50:04 node4 kernel: (18902,0):dlm_get_lock_resource:995 ERROR: > status = -112 > May 23 16:52:31 node4 kernel: > (22785,3):dlm_send_remote_convert_request:398 ERROR: status = -107 > May 23 16:52:31 node4 kernel: (22785,3):dlm_wait_for_node_death:365 > BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of > death of node 3 > May 23 16:52:31 node4 kernel: (22786,3):dlm_get_lock_resource:921 > 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:31 node4 kernel: (22784,3):dlm_get_lock_resource:921 > 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:31 node4 kernel: (22783,3):dlm_get_lock_resource:921 > A85D18C01AE747AC905343D919B60525:M000000000000000000021535d8e891: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:921 > A85D18C01AE747AC905343D919B60525:$RECOVERY: at least one node (3) > torecover before lock mastery can begin > May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:955 > A85D18C01AE747AC905343D919B60525: recovery map is not empty, but must > master $RECOVERY lock now > May 23 16:52:32 node4 kernel: (22786,3):dlm_get_lock_resource:976 > 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:32 node4 kernel: (22784,3):dlm_get_lock_resource:976 > 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:32 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device > (8,97): dlm has evicted node 3 > May 23 16:52:33 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device > (8,81): dlm has evicted node 3 > May 23 16:52:34 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device > (8,161): dlm has evicted node 3 > May 23 16:52:35 node4 kernel: (18902,0):dlm_restart_lock_mastery:1301 > ERROR: node down! 3 > May 23 16:52:35 node4 kernel: (18902,0):dlm_wait_for_lock_mastery:1118 > ERROR: status = -11 > May 23 16:52:36 node4 kernel: (18902,0):dlm_get_lock_resource:976 > 9D0941F9B5B843E0B8F8C9FD7D514C35:D0000000000000000030b2be3ea1a0c: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:37 node4 kernel: (22783,3):ocfs2_replay_journal:1167 > Recovering node 3 from slot 1 on device (8,49) > May 23 16:52:39 node4 kernel: (22784,0):ocfs2_replay_journal:1167 > Recovering node 3 from slot 1 on device (8,81) > May 23 16:52:40 node4 kernel: (22786,0):ocfs2_replay_journal:1167 > Recovering node 3 from slot 1 on device (8,161) > May 23 16:52:44 node4 kernel: kjournald starting. Commit interval 5 > seconds >-- Marcus Alves Grando <marcus.grando [] terra.com.br> Suporte Engenharia 1 Terra Networks Brasil S/A Tel: 55 (51) 3284-4238 Qual ? a sua Terra?
Sunil Mushran
2007-May-24 08:53 UTC
[Ocfs2-users] "node down!" are related with SVN rev 3004?
Such issues are handled best via bugzilla. File one on oss.oracle.com/bugzilla with all the details. The most important detail would be node3's netdump or netconsole output. The real reason for the outage will be in that dump. Marcus Alves Grando wrote:> Hi list, > > Today i have a problem with ocfs2, one server stop access to ocfs2 > disks and the only message in /var/log/messages are: > > May 23 16:24:26 node3 kernel: (6956,3):dlm_restart_lock_mastery:1301 > ERROR: node down! 1 > May 23 16:24:26 node3 kernel: (6956,3):dlm_wait_for_lock_mastery:1118 > ERROR: status = -11 > > I don't know what's happened. Maybe that's related with 3004 fix? > Someone already see that? > > Another strange fact are all nodes mount 13 SAN disks and "leaves" > messages occurrs only nine times. > > Another fact are node1 is down to maintanance since 08:30. > > Others servers have this messages: > > **** node2 > > May 23 16:48:31 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 84407FC4A92E451DADEF260A2FE0E366 > May 23 16:48:31 node2 kernel: ocfs2_dlm: Nodes in domain > ("84407FC4A92E451DADEF260A2FE0E366"): 2 4 > May 23 16:48:37 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 1FB62EB34D1F495A9F11F396E707588C > May 23 16:48:37 node2 kernel: ocfs2_dlm: Nodes in domain > ("1FB62EB34D1F495A9F11F396E707588C"): 2 4 > May 23 16:48:42 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 793ACD36E8CA4067AB99F9F4F2229634 > May 23 16:48:42 node2 kernel: ocfs2_dlm: Nodes in domain > ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4 > May 23 16:48:48 node2 kernel: ocfs2_dlm: Node 3 leaves domain > ECECF9980CBD44EFA7E8A950EDE40573 > May 23 16:48:48 node2 kernel: ocfs2_dlm: Nodes in domain > ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4 > May 23 16:48:53 node2 kernel: ocfs2_dlm: Node 3 leaves domain > D8AFCBD0CF59404991FAB19916CEE08B > May 23 16:48:53 node2 kernel: ocfs2_dlm: Nodes in domain > ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4 > May 23 16:48:58 node2 kernel: ocfs2_dlm: Node 3 leaves domain > E8B0A018151943A28674662818529F0F > May 23 16:48:58 node2 kernel: ocfs2_dlm: Nodes in domain > ("E8B0A018151943A28674662818529F0F"): 2 4 > May 23 16:49:03 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 3D227E224D0D4D9F97B84B0BB7DE7E22 > May 23 16:49:03 node2 kernel: ocfs2_dlm: Nodes in domain > ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4 > May 23 16:49:09 node2 kernel: ocfs2_dlm: Node 3 leaves domain > C40090C8D14D48C9AC0D1024A228EC59 > May 23 16:49:09 node2 kernel: ocfs2_dlm: Nodes in domain > ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4 > May 23 16:49:14 node2 kernel: ocfs2_dlm: Node 3 leaves domain > 9CB224941DC64A39872A5012FBD12354 > May 23 16:49:14 node2 kernel: ocfs2_dlm: Nodes in domain > ("9CB224941DC64A39872A5012FBD12354"): 2 4 > May 23 16:50:04 node2 kernel: o2net: connection to node node3.hst.host > (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting > it down. > May 23 16:50:04 node2 kernel: (0,3):o2net_idle_timer:1418 here are > some times that might help debug the situation: (tmr 1179949774.944117 > now 1179949804.944585 dr 1179949774.944109 adv > 1179949774.944119:1179949774.944121 func (d21ddb4d:513) > 1179949754.944260:1179949754.944271) > May 23 16:50:04 node2 kernel: o2net: no longer connected to node > node3.hst.host (num 3) at 192.168.0.3:7777 > May 23 16:52:31 node2 kernel: > (23351,2):dlm_send_remote_convert_request:398 ERROR: status = -107 > May 23 16:52:31 node2 kernel: (23351,2):dlm_wait_for_node_death:365 > BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of > death of node 3 > May 23 16:52:32 node2 kernel: (4379,3):ocfs2_dlm_eviction_cb:119 > device (8,49): dlm has evicted node 3 > May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:921 > BD2D6C1943FB4771B018EA2A7D056E8A:$RECOVERY: at least one node (3) > torecover before lock mastery can begin > May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:955 > BD2D6C1943FB4771B018EA2A7D056E8A: recovery map is not empty, but must > master $RECOVERY lock now > May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:921 > 36D7DEC36FC44C53A6107B6A9CE863A2:$RECOVERY: at least one node (3) > torecover before lock mastery can begin > May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:955 > 36D7DEC36FC44C53A6107B6A9CE863A2: recovery map is not empty, but must > master $RECOVERY lock now > May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:921 > 9D0941F9B5B843E0B8F8C9FD7D514C35:$RECOVERY: at least one node (3) > torecover before lock mastery can begin > May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:955 > 9D0941F9B5B843E0B8F8C9FD7D514C35: recovery map is not empty, but must > master $RECOVERY lock now > May 23 16:52:37 node2 kernel: (23351,2):ocfs2_replay_journal:1167 > Recovering node 3 from slot 1 on device (8,97) > May 23 16:52:42 node2 kernel: kjournald starting. Commit interval 5 > seconds > > **** node4 > > May 23 16:48:31 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 84407FC4A92E451DADEF260A2FE0E366 > May 23 16:48:31 node4 kernel: ocfs2_dlm: Nodes in domain > ("84407FC4A92E451DADEF260A2FE0E366"): 2 4 > May 23 16:48:37 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 1FB62EB34D1F495A9F11F396E707588C > May 23 16:48:37 node4 kernel: ocfs2_dlm: Nodes in domain > ("1FB62EB34D1F495A9F11F396E707588C"): 2 4 > May 23 16:48:42 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 793ACD36E8CA4067AB99F9F4F2229634 > May 23 16:48:42 node4 kernel: ocfs2_dlm: Nodes in domain > ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4 > May 23 16:48:48 node4 kernel: ocfs2_dlm: Node 3 leaves domain > ECECF9980CBD44EFA7E8A950EDE40573 > May 23 16:48:48 node4 kernel: ocfs2_dlm: Nodes in domain > ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4 > May 23 16:48:53 node4 kernel: ocfs2_dlm: Node 3 leaves domain > D8AFCBD0CF59404991FAB19916CEE08B > May 23 16:48:53 node4 kernel: ocfs2_dlm: Nodes in domain > ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4 > May 23 16:48:58 node4 kernel: ocfs2_dlm: Node 3 leaves domain > E8B0A018151943A28674662818529F0F > May 23 16:48:58 node4 kernel: ocfs2_dlm: Nodes in domain > ("E8B0A018151943A28674662818529F0F"): 2 4 > May 23 16:49:03 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 3D227E224D0D4D9F97B84B0BB7DE7E22 > May 23 16:49:03 node4 kernel: ocfs2_dlm: Nodes in domain > ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4 > May 23 16:49:09 node4 kernel: ocfs2_dlm: Node 3 leaves domain > C40090C8D14D48C9AC0D1024A228EC59 > May 23 16:49:09 node4 kernel: ocfs2_dlm: Nodes in domain > ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4 > May 23 16:49:14 node4 kernel: ocfs2_dlm: Node 3 leaves domain > 9CB224941DC64A39872A5012FBD12354 > May 23 16:49:14 node4 kernel: ocfs2_dlm: Nodes in domain > ("9CB224941DC64A39872A5012FBD12354"): 2 4 > May 23 16:50:04 node4 kernel: o2net: connection to node node3.hst.host > (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting > it down. > May 23 16:50:04 node4 kernel: (19355,0):o2net_idle_timer:1418 here are > some times that might help debug the situation: (tmr 1179949774.943813 > now 1179949804.944242 dr 1179949774.943805 adv > 1179949774.943815:1179949774.943817 func (d21ddb4d:513) > 1179949754.944088:1179949754.944097) > May 23 16:50:04 node4 kernel: o2net: no longer connected to node > node3.hst.host (num 3) at 192.168.0.3:7777 > May 23 16:50:04 node4 kernel: (18902,0):dlm_do_master_request:1418 > ERROR: link to 3 went down! > May 23 16:50:04 node4 kernel: (18902,0):dlm_get_lock_resource:995 > ERROR: status = -112 > May 23 16:52:31 node4 kernel: > (22785,3):dlm_send_remote_convert_request:398 ERROR: status = -107 > May 23 16:52:31 node4 kernel: (22785,3):dlm_wait_for_node_death:365 > BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of > death of node 3 > May 23 16:52:31 node4 kernel: (22786,3):dlm_get_lock_resource:921 > 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:31 node4 kernel: (22784,3):dlm_get_lock_resource:921 > 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:31 node4 kernel: (22783,3):dlm_get_lock_resource:921 > A85D18C01AE747AC905343D919B60525:M000000000000000000021535d8e891: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:921 > A85D18C01AE747AC905343D919B60525:$RECOVERY: at least one node (3) > torecover before lock mastery can begin > May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:955 > A85D18C01AE747AC905343D919B60525: recovery map is not empty, but must > master $RECOVERY lock now > May 23 16:52:32 node4 kernel: (22786,3):dlm_get_lock_resource:976 > 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:32 node4 kernel: (22784,3):dlm_get_lock_resource:976 > 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:32 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 > device (8,97): dlm has evicted node 3 > May 23 16:52:33 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 > device (8,81): dlm has evicted node 3 > May 23 16:52:34 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 > device (8,161): dlm has evicted node 3 > May 23 16:52:35 node4 kernel: (18902,0):dlm_restart_lock_mastery:1301 > ERROR: node down! 3 > May 23 16:52:35 node4 kernel: (18902,0):dlm_wait_for_lock_mastery:1118 > ERROR: status = -11 > May 23 16:52:36 node4 kernel: (18902,0):dlm_get_lock_resource:976 > 9D0941F9B5B843E0B8F8C9FD7D514C35:D0000000000000000030b2be3ea1a0c: at > least one node (3) torecover before lock mastery can begin > May 23 16:52:37 node4 kernel: (22783,3):ocfs2_replay_journal:1167 > Recovering node 3 from slot 1 on device (8,49) > May 23 16:52:39 node4 kernel: (22784,0):ocfs2_replay_journal:1167 > Recovering node 3 from slot 1 on device (8,81) > May 23 16:52:40 node4 kernel: (22786,0):ocfs2_replay_journal:1167 > Recovering node 3 from slot 1 on device (8,161) > May 23 16:52:44 node4 kernel: kjournald starting. Commit interval 5 > seconds >