Hi all, I'm running OCFS2 on two system with OpenSUSE 10.2 connected on fibre channel with a shared storage (HP MSA1500 + HP PROLIANT MSA20). The cluster has two node (web-ha1 and web-ha2), sometimes (1 or 2 times on a month) the OCFS2 stop to work on both system. On the first node I'm getting no error in log files and after a forced shoutdown of the first node on the second I can see the logs on the bottom of this message. I saw some other people is getting on a similar trouble (http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg01135.html) but the thread don't gave me help... Anyone has any idea? Thanks you in advance. Maurizio web-ha1:~ # cat /etc/sysconfig/o2cb O2CB_ENABLED=true O2CB_BOOTCLUSTER=ocfs2 O2CB_HEARTBEAT_THRESHOLD=451 web-ha1:~ # web-ha1:~ # cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 192.168.255.1 number = 0 name = web-ha1 cluster = ocfs2 node: ip_port = 7777 ip_address = 192.168.255.2 number = 1 name = web-ha2 cluster = ocfs2 cluster: node_count = 2 name = ocfs2 web-ha1:~ # Nov 28 15:28:59 web-ha2 kernel: o2net: connection to node web-ha1 (num 0) at 192.168.255.1:7777 has been idle for 10 seconds, shutting it down. Nov 28 15:28:59 web-ha2 kernel: (23432,0):o2net_idle_timer:1297 here are some times that might help debug the situation: (tmr 1196260129.36511 now 1196260139 .34907 dr 1196260129.36503 adv 1196260129.36514:1196260129.36515 func (95bc84eb:504) 1196260129.36329:1196260129.36337) Nov 28 15:28:59 web-ha2 kernel: o2net: no longer connected to node web-ha1 (num 0) at 192.168.255.1:7777 Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_do_master_request:1331 ERROR: link to 0 went down! Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_get_lock_resource:915 ERROR: status = -112 Nov 28 15:29:18 web-ha2 sshd[23503]: pam_unix2(sshd:auth): conversation failed Nov 28 15:29:18 web-ha2 sshd[23503]: error: ssh_msg_send: write Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_do_master_request:1331 ERROR: link to 0 went down! Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_get_lock_resource:915 ERROR: status = -107 Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_do_master_request:1331 ERROR: link to 0 went down! Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_get_lock_resource:915 ERROR: status = -107 Nov 28 15:29:46 web-ha2 kernel: (23443,0):dlm_do_master_request:1331 ERROR: link to 0 went down! ERROR: status = -107 [...] Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_restart_lock_mastery:1215 ERROR: node down! 0 Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_wait_for_lock_mastery:1036 ERROR: status = -11 Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_restart_lock_mastery:1215 ERROR: node down! 0 Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_wait_for_lock_mastery:1036 ERROR: status = -11 Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_restart_lock_mastery:1215 ERROR: node down! 0 Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_wait_for_lock_mastery:1036 ERROR: status = -11 Nov 22 18:14:51 web-ha2 kernel: (17804,1):dlm_get_lock_resource:896 86472C5C33A54FF88030591B1210C560:M0000000000000009e7e54516dd16ec: at least one node (0) torecover before lock mastery can begin Nov 22 18:14:51 web-ha2 kernel: (17730,1):dlm_get_lock_resource:896 86472C5C33A54FF88030591B1210C560:M0000000000000009e76bf516dd144d: at least one node (0) torecover before lock mastery can begin Nov 22 18:14:51 web-ha2 kernel: (17634,0):dlm_get_lock_resource:896 86472C5C33A54FF88030591B1210C560:M000000000000000ac0d22b1f78e53c: at least one node (0) torecover before lock mastery can begin Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_restart_lock_mastery:1215 ERROR: node down! 0 Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_wait_for_lock_mastery:1036 ERROR: status = -11 [...] Nov 22 18:14:54 web-ha2 kernel: (17702,1):dlm_get_lock_resource:896 86472C5C33A54FF88030591B1210C560:M0000000000000007a6dab9ef6eacbd: at least one node (0) torecover before lock mastery can begin Nov 22 18:14:54 web-ha2 kernel: (17701,1):dlm_get_lock_resource:896 86472C5C33A54FF88030591B1210C560:M000000000000000a06a13716de553e: at least one node (0) torecover before lock mastery can begin Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:849 86472C5C33A54FF88030591B1210C560:$RECOVERY: at least one node (0) torecover before lock mastery can begin Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:876 86472C5C33A54FF88030591B1210C560: recovery map is not empty, but must master $RECOVERY lock now Nov 22 18:14:54 web-ha2 kernel: (17893,0):ocfs2_replay_journal:1184 Recovering node 0 from slot 0 on device (8,17) Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_restart_lock_mastery:1215 ERROR: node down! 0 Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_wait_for_lock_mastery:1036 ERROR: status = -11 Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_restart_lock_mastery:1215 ERROR: node down! 0 Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_wait_for_lock_mastery:1036 ERROR: status = -11
What's the kernel version#? inode wrote:> Hi all, > > I'm running OCFS2 on two system with OpenSUSE 10.2 connected on fibre > channel with a shared storage (HP MSA1500 + HP PROLIANT MSA20). > > The cluster has two node (web-ha1 and web-ha2), sometimes (1 or 2 times > on a month) the OCFS2 stop to work on both system. On the first node I'm > getting no error in log files and after a forced shoutdown of the first > node on the second I can see the logs on the bottom of this message. > > I saw some other people is getting on a similar trouble > (http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg01135.html) > but the thread don't gave me help... > > Anyone has any idea? > > Thanks you in advance. > > Maurizio > > > web-ha1:~ # cat /etc/sysconfig/o2cb > > O2CB_ENABLED=true > O2CB_BOOTCLUSTER=ocfs2 > O2CB_HEARTBEAT_THRESHOLD=451 > > web-ha1:~ # > web-ha1:~ # cat /etc/ocfs2/cluster.conf > node: > ip_port = 7777 > ip_address = 192.168.255.1 > number = 0 > name = web-ha1 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 192.168.255.2 > number = 1 > name = web-ha2 > cluster = ocfs2 > > cluster: > node_count = 2 > name = ocfs2 > > web-ha1:~ # > > > > Nov 28 15:28:59 web-ha2 kernel: o2net: connection to node web-ha1 (num > 0) at 192.168.255.1:7777 has been idle for 10 seconds, shutting it down. > Nov 28 15:28:59 web-ha2 kernel: (23432,0):o2net_idle_timer:1297 here are > some times that might help debug the situation: (tmr 1196260129.36511 > now 1196260139 > .34907 dr 1196260129.36503 adv 1196260129.36514:1196260129.36515 func > (95bc84eb:504) 1196260129.36329:1196260129.36337) > Nov 28 15:28:59 web-ha2 kernel: o2net: no longer connected to node > web-ha1 (num 0) at 192.168.255.1:7777 > Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_do_master_request:1331 > ERROR: link to 0 went down! > Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_get_lock_resource:915 > ERROR: status = -112 > Nov 28 15:29:18 web-ha2 sshd[23503]: pam_unix2(sshd:auth): conversation > failed > Nov 28 15:29:18 web-ha2 sshd[23503]: error: ssh_msg_send: write > Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_do_master_request:1331 > ERROR: link to 0 went down! > Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_get_lock_resource:915 > ERROR: status = -107 > Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_do_master_request:1331 > ERROR: link to 0 went down! > Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_get_lock_resource:915 > ERROR: status = -107 > Nov 28 15:29:46 web-ha2 kernel: (23443,0):dlm_do_master_request:1331 > ERROR: link to 0 went down! > ERROR: status = -107 > > [...] > > Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_restart_lock_mastery:1215 > ERROR: node down! 0 > Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_wait_for_lock_mastery:1036 > ERROR: status = -11 > Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_restart_lock_mastery:1215 > ERROR: node down! 0 > Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_wait_for_lock_mastery:1036 > ERROR: status = -11 > Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_restart_lock_mastery:1215 > ERROR: node down! 0 > Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_wait_for_lock_mastery:1036 > ERROR: status = -11 > Nov 22 18:14:51 web-ha2 kernel: (17804,1):dlm_get_lock_resource:896 > 86472C5C33A54FF88030591B1210C560:M0000000000000009e7e54516dd16ec: at > least one node (0) torecover before lock mastery can begin > Nov 22 18:14:51 web-ha2 kernel: (17730,1):dlm_get_lock_resource:896 > 86472C5C33A54FF88030591B1210C560:M0000000000000009e76bf516dd144d: at > least one node (0) torecover before lock mastery can begin > Nov 22 18:14:51 web-ha2 kernel: (17634,0):dlm_get_lock_resource:896 > 86472C5C33A54FF88030591B1210C560:M000000000000000ac0d22b1f78e53c: at > least one node (0) torecover before lock mastery can begin > Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_restart_lock_mastery:1215 > ERROR: node down! 0 > Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_wait_for_lock_mastery:1036 > ERROR: status = -11 > > [...] > > Nov 22 18:14:54 web-ha2 kernel: (17702,1):dlm_get_lock_resource:896 > 86472C5C33A54FF88030591B1210C560:M0000000000000007a6dab9ef6eacbd: at > least one node (0) torecover before lock mastery can begin > Nov 22 18:14:54 web-ha2 kernel: (17701,1):dlm_get_lock_resource:896 > 86472C5C33A54FF88030591B1210C560:M000000000000000a06a13716de553e: at > least one node (0) torecover before lock mastery can begin > Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:849 > 86472C5C33A54FF88030591B1210C560:$RECOVERY: at least one node (0) > torecover before lock mastery can begin > Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:876 > 86472C5C33A54FF88030591B1210C560: recovery map is not empty, but must > master $RECOVERY lock now > Nov 22 18:14:54 web-ha2 kernel: (17893,0):ocfs2_replay_journal:1184 > Recovering node 0 from slot 0 on device (8,17) > Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_restart_lock_mastery:1215 > ERROR: node down! 0 > Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_wait_for_lock_mastery:1036 > ERROR: status = -11 > Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_restart_lock_mastery:1215 > ERROR: node down! 0 > Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_wait_for_lock_mastery:1036 > ERROR: status = -11 > > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >