Sebastian Reitenbach
2007-Feb-19  02:40 UTC
[Ocfs2-users] problem to get an ocfs2 cluster running
Hi list,
I am struggling since weeks to get a linux-ha cluster running, managing some 
ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem.
I tried with a disk based heartbeat, without linux-ha running, to make sure 
that ocfs2 is working as expected, but unfortunately it is not. 
when I mount the ocfs2 partition on the first host, everything is fine. I can 
make an ls /mnt (where the ocfs2 partition is mounted), and see the directory 
listing.
I can mount the partition on the second host. running the mount command, shows 
me the partition mounted, also mounted.ocfs2 shows the partition mounted on 
both hosts. but a ls /mnt hangs forever. the ls command is also not killable 
via kill -9 and an umount is also impossible.
here is my cluster.conf file
node:
        ip_port = 7777
        ip_address = 192.168.102.31
        number = 0
        name = ppsnfs101
        cluster = ocfs2
node:
        ip_port = 7777
        ip_address = 192.168.102.32
        number = 1
        name = ppsnfs102
        cluster = ocfs2
cluster:
        node_count = 2
        name = ocfs2
both hosts are reachable via the network. The OCFS partitions are on a SAN, 
mounted by both hosts, but I also tried to use iscsi, but with the same 
result.
I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, with
a
kernel: 2.6.16.27-0.6-smp:
ocfs2-tools-1.2.2-0.2
ocfs2console-1.2.2-0.2
while all this happens, I see the following messages in the logs:
this is the log from the first host mounting this device:
Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1, 
slot 1)
Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0) 
at 192.168.102.31:7777
Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain 
CAD397436504401B86AA79A8BCAE88D4
Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain 
("CAD397436504401B86AA79A8BCAE88D4"): 0 1
Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to node ppsnfs101 
(num 0) at 192.168.102.31:7777
this is the messages, from the second host mounting the device:
Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server
Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present
Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present
Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node 
ppsnfs102 (num 1) at 192.168.102.32:7777
Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 
2006 (build sles)
Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain 
("CAD397436504401B86AA79A8BCAE88D4"): 0 1
Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting.  Commit interval 5 
seconds
Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0, 
slot 0)
Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num 1) 
at 192.168.102.32:7777 has been idle for 10 seconds, shuttin
g it down.
Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some 
times that might help debug the situation: (tmr 1171878885.425
241 now 1171878895.425628 dr 1171878890.425647 adv 
1171878890.425655:1171878890.425657 func (573d7565:505) 
1171878889.164179:1171878889.16
4185)
Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to node ppsnfs102 
(num 1) at 192.168.102.32:7777
Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR: 
link to 1 went down!
Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR: 
status = -107
Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214 
ERROR: node down! 1
Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035 
ERROR: status = -11
Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895 
CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at
least one node (1) torecover before lock mastery can begin
Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847 
CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor
ecover before lock mastery can begin
Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874 
CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must
 master $RECOVERY lock now
I am clueless at this point, no idea why it fails. If there is anybody who can 
enlighten me, it would really be appreciated.
kind regards
Sebastian
That kernel is bugged. Use the Kernel of the day SP1 branch.
ftp://ftp.suse.com/pub/projects/kernel/kotd/sle10-sp-i386/SLES10_SP1_BRANCH
-------------------------------------------------------------
Hi list,
I am struggling since weeks to get a linux-ha cluster running, managing some
ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem.
I tried with a disk based heartbeat, without linux-ha running, to make sure
that ocfs2 is working as expected, but unfortunately it is not.
when I mount the ocfs2 partition on the first host, everything is fine. I can
make an ls /mnt (where the ocfs2 partition is mounted), and see the directory
listing.
I can mount the partition on the second host. running the mount command, shows
me the partition mounted, also mounted.ocfs2 shows the partition mounted on
both hosts. but a ls /mnt hangs forever. the ls command is also not killable
via kill -9 and an umount is also impossible.
here is my cluster.conf file
node:
        ip_port = 7777
        ip_address = 192.168.102.31
        number = 0
        name = ppsnfs101
        cluster = ocfs2
node:
        ip_port = 7777
        ip_address = 192.168.102.32
        number = 1
        name = ppsnfs102
        cluster = ocfs2
cluster:
        node_count = 2
        name = ocfs2
both hosts are reachable via the network. The OCFS partitions are on a SAN,
mounted by both hosts, but I also tried to use iscsi, but with the same
result.
I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, with
a
kernel: 2.6.16.27-0.6-smp:
ocfs2-tools-1.2.2-0.2
ocfs2console-1.2.2-0.2
while all this happens, I see the following messages in the logs:
this is the log from the first host mounting this device:
Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1,
slot 1)
Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0)
at 192.168.102.31:7777
Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain
CAD397436504401B86AA79A8BCAE88D4
Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain
("CAD397436504401B86AA79A8BCAE88D4"): 0 1
Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to node ppsnfs101
(num 0) at 192.168.102.31:7777
this is the messages, from the second host mounting the device:
Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server
Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present
Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present
Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node
ppsnfs102 (num 1) at 192.168.102.32:7777
Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT
2006 (build sles)
Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain
("CAD397436504401B86AA79A8BCAE88D4"): 0 1
Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting.  Commit interval 5
seconds
Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0,
slot 0)
Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num 1)
at 192.168.102.32:7777 has been idle for 10 seconds, shuttin
g it down.
Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some
times that might help debug the situation: (tmr 1171878885.425
241 now 1171878895.425628 dr 1171878890.425647 adv
1171878890.425655:1171878890.425657 func (573d7565:505)
1171878889.164179:1171878889.16
4185)
Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to node ppsnfs102
(num 1) at 192.168.102.32:7777
Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR:
link to 1 went down!
Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR:
status = -107
Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214
ERROR: node down! 1
Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895
CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at
least one node (1) torecover before lock mastery can begin
Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847
CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor
ecover before lock mastery can begin
Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874
CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must
 master $RECOVERY lock now
I am clueless at this point, no idea why it fails. If there is anybody who can
enlighten me, it would really be appreciated.
kind regards
Sebastian
Sebastian Reitenbach
2007-Feb-19  07:50 UTC
[Ocfs2-users] problem to get an ocfs2 cluster running
Hi Jose, you are my hero, exchanging the kernel fixed my problem. Thanks a lot Sebastian "Jos? Costa" <meetra@gmail.com> wrote:> That kernel is bugged. Use the Kernel of the day SP1 branch. > > ftp://ftp.suse.com/pub/projects/kernel/kotd/sle10-sp-i386/SLES10_SP1_BRANCH > > ------------------------------------------------------------- > > Hi list, > > I am struggling since weeks to get a linux-ha cluster running, managing some > ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem. > > I tried with a disk based heartbeat, without linux-ha running, to make sure > that ocfs2 is working as expected, but unfortunately it is not. > > when I mount the ocfs2 partition on the first host, everything is fine. Ican> make an ls /mnt (where the ocfs2 partition is mounted), and see thedirectory> listing. > > I can mount the partition on the second host. running the mount command,shows> me the partition mounted, also mounted.ocfs2 shows the partition mounted on > both hosts. but a ls /mnt hangs forever. the ls command is also not killable > via kill -9 and an umount is also impossible. > > here is my cluster.conf file > node: > ip_port = 7777 > ip_address = 192.168.102.31 > number = 0 > name = ppsnfs101 > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 192.168.102.32 > number = 1 > name = ppsnfs102 > cluster = ocfs2 > > cluster: > node_count = 2 > name = ocfs2 > > > both hosts are reachable via the network. The OCFS partitions are on a SAN, > mounted by both hosts, but I also tried to use iscsi, but with the same > result. > > > I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, witha> kernel: 2.6.16.27-0.6-smp: > > ocfs2-tools-1.2.2-0.2 > ocfs2console-1.2.2-0.2 > > while all this happens, I see the following messages in the logs: > this is the log from the first host mounting this device: > > Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1, > slot 1) > Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0) > at 192.168.102.31:7777 > Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain > CAD397436504401B86AA79A8BCAE88D4 > Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain > ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 > Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to nodeppsnfs101> (num 0) at 192.168.102.31:7777 > > > > > this is the messages, from the second host mounting the device: > Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server > Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present > Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present > Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node > ppsnfs102 (num 1) at 192.168.102.32:7777 > Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT > 2006 (build sles) > Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain > ("CAD397436504401B86AA79A8BCAE88D4"): 0 1 > Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting. Commit interval 5 > seconds > Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0, > slot 0) > Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num1)> at 192.168.102.32:7777 has been idle for 10 seconds, shuttin > g it down. > Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some > times that might help debug the situation: (tmr 1171878885.425 > 241 now 1171878895.425628 dr 1171878890.425647 adv > 1171878890.425655:1171878890.425657 func (573d7565:505) > 1171878889.164179:1171878889.16 > 4185) > Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to nodeppsnfs102> (num 1) at 192.168.102.32:7777 > Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR: > link to 1 went down! > Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR: > status = -107 > Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214 > ERROR: node down! 1 > Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035 > ERROR: status = -11 > Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895 > CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at > least one node (1) torecover before lock mastery can begin > Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847 > CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor > ecover before lock mastery can begin > Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874 > CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must > master $RECOVERY lock now > > > I am clueless at this point, no idea why it fails. If there is anybody whocan> enlighten me, it would really be appreciated. > > kind regards > Sebastian > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >-- Sebastian Reitenbach Tel.: ++49-(0)3381-8904-451 RapidEye AG Fax: ++49-(0)3381-8904-101 Molkenmarkt 30 e-mail:reitenbach@rapideye.de D-14776 Brandenburg web:http://www.rapideye.de