Sebastian Reitenbach
2007-Feb-19 02:40 UTC
[Ocfs2-users] problem to get an ocfs2 cluster running
Hi list, I am struggling since weeks to get a linux-ha cluster running, managing some ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem. I tried with a disk based heartbeat, without linux-ha running, to make sure that ocfs2 is working as expected, but unfortunately it is not. when I mount the ocfs2 partition on the first host, everything is fine. I can make an ls /mnt (where the ocfs2 partition is mounted), and see the directory listing. I can mount the partition on the second host. running the mount command, shows me the partition mounted, also mounted.ocfs2 shows the partition mounted on both hosts. but a ls /mnt hangs forever. the ls command is also not killable via kill -9 and an umount is also impossible. here is my cluster.conf file node: ip_port = 7777 ip_address = 192.168.102.31 number = 0 name = ppsnfs101 cluster = ocfs2 node: ip_port = 7777 ip_address = 192.168.102.32 number = 1 name = ppsnfs102 cluster = ocfs2 cluster: node_count = 2 name = ocfs2 both hosts are reachable via the network. The OCFS partitions are on a SAN, mounted by both hosts, but I also tried to use iscsi, but with the same result. I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, with a kernel: 2.6.16.27-0.6-smp: ocfs2-tools-1.2.2-0.2 ocfs2console-1.2.2-0.2 while all this happens, I see the following messages in the logs: this is the log from the first host mounting this device: Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1, slot 1) Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0) at 192.168.102.31:7777 Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain CAD397436504401B86AA79A8BCAE88D4 Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to node ppsnfs101 (num 0) at 192.168.102.31:7777 this is the messages, from the second host mounting the device: Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node ppsnfs102 (num 1) at 192.168.102.32:7777 Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles) Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting. Commit interval 5 seconds Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0, slot 0) Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num 1) at 192.168.102.32:7777 has been idle for 10 seconds, shuttin g it down. Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some times that might help debug the situation: (tmr 1171878885.425 241 now 1171878895.425628 dr 1171878890.425647 adv 1171878890.425655:1171878890.425657 func (573d7565:505) 1171878889.164179:1171878889.16 4185) Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to node ppsnfs102 (num 1) at 192.168.102.32:7777 Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR: link to 1 went down! Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR: status = -107 Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214 ERROR: node down! 1 Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895 CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at least one node (1) torecover before lock mastery can begin Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847 CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor ecover before lock mastery can begin Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874 CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must master $RECOVERY lock now I am clueless at this point, no idea why it fails. If there is anybody who can enlighten me, it would really be appreciated. kind regards Sebastian
That kernel is bugged. Use the Kernel of the day SP1 branch. ftp://ftp.suse.com/pub/projects/kernel/kotd/sle10-sp-i386/SLES10_SP1_BRANCH ------------------------------------------------------------- Hi list, I am struggling since weeks to get a linux-ha cluster running, managing some ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem. I tried with a disk based heartbeat, without linux-ha running, to make sure that ocfs2 is working as expected, but unfortunately it is not. when I mount the ocfs2 partition on the first host, everything is fine. I can make an ls /mnt (where the ocfs2 partition is mounted), and see the directory listing. I can mount the partition on the second host. running the mount command, shows me the partition mounted, also mounted.ocfs2 shows the partition mounted on both hosts. but a ls /mnt hangs forever. the ls command is also not killable via kill -9 and an umount is also impossible. here is my cluster.conf file node: ip_port = 7777 ip_address = 192.168.102.31 number = 0 name = ppsnfs101 cluster = ocfs2 node: ip_port = 7777 ip_address = 192.168.102.32 number = 1 name = ppsnfs102 cluster = ocfs2 cluster: node_count = 2 name = ocfs2 both hosts are reachable via the network. The OCFS partitions are on a SAN, mounted by both hosts, but I also tried to use iscsi, but with the same result. I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, with a kernel: 2.6.16.27-0.6-smp: ocfs2-tools-1.2.2-0.2 ocfs2console-1.2.2-0.2 while all this happens, I see the following messages in the logs: this is the log from the first host mounting this device: Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1, slot 1) Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0) at 192.168.102.31:7777 Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain CAD397436504401B86AA79A8BCAE88D4 Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to node ppsnfs101 (num 0) at 192.168.102.31:7777 this is the messages, from the second host mounting the device: Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node ppsnfs102 (num 1) at 192.168.102.32:7777 Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 2006 (build sles) Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting. Commit interval 5 seconds Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0, slot 0) Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num 1) at 192.168.102.32:7777 has been idle for 10 seconds, shuttin g it down. Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some times that might help debug the situation: (tmr 1171878885.425 241 now 1171878895.425628 dr 1171878890.425647 adv 1171878890.425655:1171878890.425657 func (573d7565:505) 1171878889.164179:1171878889.16 4185) Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to node ppsnfs102 (num 1) at 192.168.102.32:7777 Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR: link to 1 went down! Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR: status = -107 Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214 ERROR: node down! 1 Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035 ERROR: status = -11 Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895 CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at least one node (1) torecover before lock mastery can begin Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847 CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor ecover before lock mastery can begin Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874 CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must master $RECOVERY lock now I am clueless at this point, no idea why it fails. If there is anybody who can enlighten me, it would really be appreciated. kind regards Sebastian
Sebastian Reitenbach
2007-Feb-19 07:50 UTC
[Ocfs2-users] problem to get an ocfs2 cluster running
Hi Jose, you are my hero, exchanging the kernel fixed my problem. Thanks a lot Sebastian "Jos? Costa" <meetra@gmail.com> wrote:> That kernel is bugged. Use the Kernel of the day SP1 branch. > > ftp://ftp.suse.com/pub/projects/kernel/kotd/sle10-sp-i386/SLES10_SP1_BRANCH > > ------------------------------------------------------------- > > Hi list, > > I am struggling since weeks to get a linux-ha cluster running, managing some > ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem. > > I tried with a disk based heartbeat, without linux-ha running, to make sure > that ocfs2 is working as expected, but unfortunately it is not. > > when I mount the ocfs2 partition on the first host, everything is fine. Ican> make an ls /mnt (where the ocfs2 partition is mounted), and see thedirectory> listing. > > I can mount the partition on the second host. running the mount command,shows> me the partition mounted, also mounted.ocfs2 shows the partition mounted on > both hosts. but a ls /mnt hangs forever. the ls command is also not killable > via kill -9 and an umount is also impossible. > > here is my cluster.conf file > node: > ip_port = 7777 > ip_address = 192.168.102.31 > number = 0 > name = ppsnfs101 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 192.168.102.32 > number = 1 > name = ppsnfs102 > cluster = ocfs2 > > cluster: > node_count = 2 > name = ocfs2 > > > both hosts are reachable via the network. The OCFS partitions are on a SAN, > mounted by both hosts, but I also tried to use iscsi, but with the same > result. > > > I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, witha> kernel: 2.6.16.27-0.6-smp: > > ocfs2-tools-1.2.2-0.2 > ocfs2console-1.2.2-0.2 > > while all this happens, I see the following messages in the logs: > this is the log from the first host mounting this device: > > Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1, > slot 1) > Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0) > at 192.168.102.31:7777 > Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain > CAD397436504401B86AA79A8BCAE88D4 > Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain > ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 > Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to nodeppsnfs101> (num 0) at 192.168.102.31:7777 > > > > > this is the messages, from the second host mounting the device: > Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server > Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present > Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present > Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node > ppsnfs102 (num 1) at 192.168.102.32:7777 > Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT > 2006 (build sles) > Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain > ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 > Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting. Commit interval 5 > seconds > Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0, > slot 0) > Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num1)> at 192.168.102.32:7777 has been idle for 10 seconds, shuttin > g it down. > Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some > times that might help debug the situation: (tmr 1171878885.425 > 241 now 1171878895.425628 dr 1171878890.425647 adv > 1171878890.425655:1171878890.425657 func (573d7565:505) > 1171878889.164179:1171878889.16 > 4185) > Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to nodeppsnfs102> (num 1) at 192.168.102.32:7777 > Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR: > link to 1 went down! > Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR: > status = -107 > Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214 > ERROR: node down! 1 > Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035 > ERROR: status = -11 > Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895 > CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at > least one node (1) torecover before lock mastery can begin > Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847 > CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor > ecover before lock mastery can begin > Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874 > CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must > master $RECOVERY lock now > > > I am clueless at this point, no idea why it fails. If there is anybody whocan> enlighten me, it would really be appreciated. > > kind regards > Sebastian > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >-- Sebastian Reitenbach Tel.: ++49-(0)3381-8904-451 RapidEye AG Fax: ++49-(0)3381-8904-101 Molkenmarkt 30 e-mail:reitenbach@rapideye.de D-14776 Brandenburg web:http://www.rapideye.de