Sebastian Reitenbach
2007-Feb-26 12:18 UTC
[Ocfs2-users] dlm timeouts and following errors -112
Hi list, I am experimenting with ocfs2 (rpm package: 1.2.2-0.2), using linux-ha 2.0.8 (all running on a SLES 10 x86-64, rpm packages from linux-ha.org) for the heartbeat. The three nodes are connected on a gigabit switch. From time to time I have problems to unmount a drive, and I have to reboot the whole system to fix the problem. When these lockups occur, I see these messages in /var/log/messages: Feb 26 21:03:47 ppsbackup101 heartbeat: [5394]: ERROR: Irretrievably lost packet: node ppsdb102 seq 6 Feb 26 21:03:47 ppsbackup101 heartbeat: [5394]: ERROR: Irretrievably lost packet: node ppsdb102 seq 6 Feb 26 21:04:32 ppsbackup101 kernel: o2net: connection to node ppsnfs102 (num 3) at 192.168.102.32:7777 has been idle for 300.0 seconds, shutting it down. Feb 26 21:04:32 ppsbackup101 kernel: (5394,1):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1172519972.626184 now 1172520272.653263 dr 1172519972.626167 adv 1172519972.626208:1172519972.626210 func (666c6172:510) 1172519972.626186:1172519972.626195) Feb 26 21:04:32 ppsbackup101 kernel: o2net: no longer connected to node ppsnfs102 (num 3) at 192.168.102.32:7777 Feb 26 21:04:32 ppsbackup101 kernel: (8915,0):dlm_drop_lockres_ref:2283 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):dlm_request_join:899 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):dlm_try_to_join_domain:1048 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (8915,0):dlm_purge_lockres:189 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):dlm_join_domain:1321 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):dlm_register_domain:1514 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):ocfs2_dlm_init:2007 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11375,0):dlm_leave_domain:565 Error -112 sending domain exit message to node 3 Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):ocfs2_mount_volume:1093 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: ocfs2: Unmounting device (8,145) on (node 4) Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):dlm_request_join:899 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):dlm_try_to_join_domain:1048 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):dlm_join_domain:1321 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):dlm_register_domain:1514 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):ocfs2_dlm_init:2007 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):ocfs2_mount_volume:1093 ERROR: status = -112 Feb 26 21:04:32 ppsbackup101 kernel: ocfs2: Unmounting device (8,97) on (node 4) Feb 26 21:04:32 ppsbackup101 kernel: ocfs2: Unmounting device (8,129) on (node 4) Feb 26 21:04:33 ppsbackup101 kernel: ocfs2: Unmounting device (8,113) on (node 4) I think it is because of the timeout at the beginning of the logs, but don't know whether I am right, and what I can do to make it not happen anymore. Is there anything I can do to overcome these problems? kind regards Sebastian
Yes, the messages are related. -112 is EHOSTDOWN. Sebastian Reitenbach wrote:> Hi list, > > I am experimenting with ocfs2 (rpm package: 1.2.2-0.2), using linux-ha 2.0.8 > (all running on a SLES 10 x86-64, rpm packages from linux-ha.org) for the > heartbeat. The three nodes are connected on a gigabit switch. From time to > time I have problems to unmount a drive, and I have to reboot the whole > system to fix the problem. When these lockups occur, I see these messages > in /var/log/messages: > > > Feb 26 21:03:47 ppsbackup101 heartbeat: [5394]: ERROR: Irretrievably lost > packet: node ppsdb102 seq 6 > Feb 26 21:03:47 ppsbackup101 heartbeat: [5394]: ERROR: Irretrievably lost > packet: node ppsdb102 seq 6 > Feb 26 21:04:32 ppsbackup101 kernel: o2net: connection to node ppsnfs102 (num > 3) > at 192.168.102.32:7777 has been idle for 300.0 seconds, shutting it down. > Feb 26 21:04:32 ppsbackup101 kernel: (5394,1):o2net_idle_timer:1426 here are > some times that might help debug the situation: (tmr 1172519972.626184 now > 1172520272.653263 dr 1172519972.626167 adv 1172519972.626208:1172519972.626210 > func (666c6172:510) 1172519972.626186:1172519972.626195) > Feb 26 21:04:32 ppsbackup101 kernel: o2net: no longer connected to node > ppsnfs102 (num 3) at 192.168.102.32:7777 > Feb 26 21:04:32 ppsbackup101 kernel: (8915,0):dlm_drop_lockres_ref:2283 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):dlm_request_join:899 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):dlm_try_to_join_domain:1048 > ERROR: status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (8915,0):dlm_purge_lockres:189 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):dlm_join_domain:1321 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):dlm_register_domain:1514 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):ocfs2_dlm_init:2007 ERROR: > status > = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11375,0):dlm_leave_domain:565 Error -112 > sending domain exit message to node 3 > Feb 26 21:04:32 ppsbackup101 kernel: (11534,2):ocfs2_mount_volume:1093 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: ocfs2: Unmounting device (8,145) on (node > 4) > Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):dlm_request_join:899 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):dlm_try_to_join_domain:1048 > ERROR: status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):dlm_join_domain:1321 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):dlm_register_domain:1514 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):ocfs2_dlm_init:2007 ERROR: > status > = -112 > Feb 26 21:04:32 ppsbackup101 kernel: (11449,3):ocfs2_mount_volume:1093 ERROR: > status = -112 > Feb 26 21:04:32 ppsbackup101 kernel: ocfs2: Unmounting device (8,97) on (node > 4) > Feb 26 21:04:32 ppsbackup101 kernel: ocfs2: Unmounting device (8,129) on (node > 4) > Feb 26 21:04:33 ppsbackup101 kernel: ocfs2: Unmounting device (8,113) on (node > 4) > > > I think it is because of the timeout at the beginning of the logs, but don't > know whether I am right, and what I can do to make it not happen anymore. Is > there anything I can do to overcome these problems? > > kind regards > Sebastian > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >