thr3ads.net - Ocfs2 users - [Ocfs2-users] problem to get an ocfs2 cluster running [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Sebastian Reitenbach

2007-Feb-19 02:40 UTC

[Ocfs2-users] problem to get an ocfs2 cluster running

Hi list,

I am struggling since weeks to get a linux-ha cluster running, managing some 
ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem.

I tried with a disk based heartbeat, without linux-ha running, to make sure 
that ocfs2 is working as expected, but unfortunately it is not. 

when I mount the ocfs2 partition on the first host, everything is fine. I can 
make an ls /mnt (where the ocfs2 partition is mounted), and see the directory 
listing.

I can mount the partition on the second host. running the mount command, shows 
me the partition mounted, also mounted.ocfs2 shows the partition mounted on 
both hosts. but a ls /mnt hangs forever. the ls command is also not killable 
via kill -9 and an umount is also impossible.

here is my cluster.conf file
node:
        ip_port = 7777
        ip_address = 192.168.102.31
        number = 0
        name = ppsnfs101
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.102.32
        number = 1
        name = ppsnfs102
        cluster = ocfs2

cluster:
        node_count = 2
        name = ocfs2


both hosts are reachable via the network. The OCFS partitions are on a SAN, 
mounted by both hosts, but I also tried to use iscsi, but with the same 
result.


I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, with
a
kernel: 2.6.16.27-0.6-smp:

ocfs2-tools-1.2.2-0.2
ocfs2console-1.2.2-0.2

while all this happens, I see the following messages in the logs:
this is the log from the first host mounting this device:

Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1, 
slot 1)
Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0) 
at 192.168.102.31:7777
Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain 
CAD397436504401B86AA79A8BCAE88D4
Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain 
("CAD397436504401B86AA79A8BCAE88D4"): 0 1
Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to node ppsnfs101 
(num 0) at 192.168.102.31:7777




this is the messages, from the second host mounting the device:
Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server
Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present
Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present
Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node 
ppsnfs102 (num 1) at 192.168.102.32:7777
Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT 
2006 (build sles)
Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain 
("CAD397436504401B86AA79A8BCAE88D4"): 0 1
Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting.  Commit interval 5 
seconds
Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0, 
slot 0)
Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num 1) 
at 192.168.102.32:7777 has been idle for 10 seconds, shuttin
g it down.
Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some 
times that might help debug the situation: (tmr 1171878885.425
241 now 1171878895.425628 dr 1171878890.425647 adv 
1171878890.425655:1171878890.425657 func (573d7565:505) 
1171878889.164179:1171878889.16
4185)
Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to node ppsnfs102 
(num 1) at 192.168.102.32:7777
Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR: 
link to 1 went down!
Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR: 
status = -107
Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214 
ERROR: node down! 1
Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035 
ERROR: status = -11
Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895 
CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at
least one node (1) torecover before lock mastery can begin
Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847 
CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor
ecover before lock mastery can begin
Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874 
CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must
 master $RECOVERY lock now


I am clueless at this point, no idea why it fails. If there is anybody who can 
enlighten me, it would really be appreciated.

kind regards
Sebastian

José Costa

2007-Feb-19 03:18 UTC

head link

[Ocfs2-users] problem to get an ocfs2 cluster running

That kernel is bugged. Use the Kernel of the day SP1 branch.

ftp://ftp.suse.com/pub/projects/kernel/kotd/sle10-sp-i386/SLES10_SP1_BRANCH

-------------------------------------------------------------

Hi list,

I am struggling since weeks to get a linux-ha cluster running, managing some
ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem.

I tried with a disk based heartbeat, without linux-ha running, to make sure
that ocfs2 is working as expected, but unfortunately it is not.

when I mount the ocfs2 partition on the first host, everything is fine. I can
make an ls /mnt (where the ocfs2 partition is mounted), and see the directory
listing.

I can mount the partition on the second host. running the mount command, shows
me the partition mounted, also mounted.ocfs2 shows the partition mounted on
both hosts. but a ls /mnt hangs forever. the ls command is also not killable
via kill -9 and an umount is also impossible.

here is my cluster.conf file
node:
        ip_port = 7777
        ip_address = 192.168.102.31
        number = 0
        name = ppsnfs101
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.102.32
        number = 1
        name = ppsnfs102
        cluster = ocfs2

cluster:
        node_count = 2
        name = ocfs2


both hosts are reachable via the network. The OCFS partitions are on a SAN,
mounted by both hosts, but I also tried to use iscsi, but with the same
result.


I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46, with
a
kernel: 2.6.16.27-0.6-smp:

ocfs2-tools-1.2.2-0.2
ocfs2console-1.2.2-0.2

while all this happens, I see the following messages in the logs:
this is the log from the first host mounting this device:

Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1,
slot 1)
Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num 0)
at 192.168.102.31:7777
Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain
CAD397436504401B86AA79A8BCAE88D4
Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain
("CAD397436504401B86AA79A8BCAE88D4"): 0 1
Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to node ppsnfs101
(num 0) at 192.168.102.31:7777




this is the messages, from the second host mounting the device:
Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web server
Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present
Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present
Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node
ppsnfs102 (num 1) at 192.168.102.32:7777
Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT
2006 (build sles)
Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain
("CAD397436504401B86AA79A8BCAE88D4"): 0 1
Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting.  Commit interval 5
seconds
Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0,
slot 0)
Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num 1)
at 192.168.102.32:7777 has been idle for 10 seconds, shuttin
g it down.
Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some
times that might help debug the situation: (tmr 1171878885.425
241 now 1171878895.425628 dr 1171878890.425647 adv
1171878890.425655:1171878890.425657 func (573d7565:505)
1171878889.164179:1171878889.16
4185)
Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to node ppsnfs102
(num 1) at 192.168.102.32:7777
Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330 ERROR:
link to 1 went down!
Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR:
status = -107
Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214
ERROR: node down! 1
Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035
ERROR: status = -11
Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895
CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at
least one node (1) torecover before lock mastery can begin
Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847
CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor
ecover before lock mastery can begin
Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874
CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must
 master $RECOVERY lock now


I am clueless at this point, no idea why it fails. If there is anybody who can
enlighten me, it would really be appreciated.

kind regards
Sebastian

Sebastian Reitenbach

2007-Feb-19 07:50 UTC

head link

[Ocfs2-users] problem to get an ocfs2 cluster running

Hi Jose,

you are my hero, exchanging the kernel fixed my problem.
Thanks a lot

Sebastian



"Jos? Costa" <meetra@gmail.com> wrote: > That kernel is bugged. Use the Kernel of the day SP1 branch.
> 
> ftp://ftp.suse.com/pub/projects/kernel/kotd/sle10-sp-i386/SLES10_SP1_BRANCH
> 
> -------------------------------------------------------------
> 
> Hi list,
> 
> I am struggling since weeks to get a linux-ha cluster running, managing
some
> ocfs2 partitions. I think I isolated the problem to be a ocfs2 problem.
> 
> I tried with a disk based heartbeat, without linux-ha running, to make sure
> that ocfs2 is working as expected, but unfortunately it is not.
> 
> when I mount the ocfs2 partition on the first host, everything is fine. I 
can> make an ls /mnt (where the ocfs2 partition is mounted), and see the 
directory> listing.
> 
> I can mount the partition on the second host. running the mount command, 
shows> me the partition mounted, also mounted.ocfs2 shows the partition mounted on
> both hosts. but a ls /mnt hangs forever. the ls command is also not
killable
> via kill -9 and an umount is also impossible.
> 
> here is my cluster.conf file
> node:
>         ip_port = 7777
>         ip_address = 192.168.102.31
>         number = 0
>         name = ppsnfs101
>         cluster = ocfs2
> 
> node:
>         ip_port = 7777
>         ip_address = 192.168.102.32
>         number = 1
>         name = ppsnfs102
>         cluster = ocfs2
> 
> cluster:
>         node_count = 2
>         name = ocfs2
> 
> 
> both hosts are reachable via the network. The OCFS partitions are on a SAN,
> mounted by both hosts, but I also tried to use iscsi, but with the same
> result.
> 
> 
> I have these rpm's of ocfs2 installed, on a SLES 10, running on x86_46,
with
a> kernel: 2.6.16.27-0.6-smp:
> 
> ocfs2-tools-1.2.2-0.2
> ocfs2console-1.2.2-0.2
> 
> while all this happens, I see the following messages in the logs:
> this is the log from the first host mounting this device:
> 
> Feb 19 10:53:25 ppsnfs102 kernel: ocfs2: Mounting device (8,1) on (node 1,
> slot 1)
> Feb 19 10:54:45 ppsnfs102 kernel: o2net: connected to node ppsnfs101 (num
0)
> at 192.168.102.31:7777
> Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Node 0 joins domain
> CAD397436504401B86AA79A8BCAE88D4
> Feb 19 10:54:49 ppsnfs102 kernel: ocfs2_dlm: Nodes in domain
> ("CAD397436504401B86AA79A8BCAE88D4"): 0 1
> Feb 19 10:54:55 ppsnfs102 kernel: o2net: no longer connected to node 
ppsnfs101> (num 0) at 192.168.102.31:7777
> 
> 
> 
> 
> this is the messages, from the second host mounting the device:
> Feb 19 10:50:59 ppsnfs101 zmd: Daemon (WARN): Not starting remote web
server
> Feb 19 10:51:02 ppsnfs101 kernel: eth1: no IPv6 routers present
> Feb 19 10:51:03 ppsnfs101 kernel: eth0: no IPv6 routers present
> Feb 19 10:54:45 ppsnfs101 kernel: o2net: accepted connection from node
> ppsnfs102 (num 1) at 192.168.102.32:7777
> Feb 19 10:54:49 ppsnfs101 kernel: OCFS2 1.2.3-SLES Thu Aug 17 11:38:33 PDT
> 2006 (build sles)
> Feb 19 10:54:49 ppsnfs101 kernel: ocfs2_dlm: Nodes in domain
> ("CAD397436504401B86AA79A8BCAE88D4"): 0 1
> Feb 19 10:54:49 ppsnfs101 kernel: kjournald starting.  Commit interval 5
> seconds
> Feb 19 10:54:49 ppsnfs101 kernel: ocfs2: Mounting device (8,1) on (node 0,
> slot 0)
> Feb 19 10:54:55 ppsnfs101 kernel: o2net: connection to node ppsnfs102 (num 
1)> at 192.168.102.32:7777 has been idle for 10 seconds, shuttin
> g it down.
> Feb 19 10:54:55 ppsnfs101 kernel: (0,0):o2net_idle_timer:1314 here are some
> times that might help debug the situation: (tmr 1171878885.425
> 241 now 1171878895.425628 dr 1171878890.425647 adv
> 1171878890.425655:1171878890.425657 func (573d7565:505)
> 1171878889.164179:1171878889.16
> 4185)
> Feb 19 10:54:55 ppsnfs101 kernel: o2net: no longer connected to node 
ppsnfs102> (num 1) at 192.168.102.32:7777
> Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_do_master_request:1330
ERROR:
> link to 1 went down!
> Feb 19 10:55:14 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:914 ERROR:
> status = -107
> Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_restart_lock_mastery:1214
> ERROR: node down! 1
> Feb 19 11:03:04 ppsnfs101 kernel: (7285,2):dlm_wait_for_lock_mastery:1035
> ERROR: status = -11
> Feb 19 11:03:05 ppsnfs101 kernel: (7285,2):dlm_get_lock_resource:895
> CAD397436504401B86AA79A8BCAE88D4:M00000000000000000a02d1b1f2de9c: at
> least one node (1) torecover before lock mastery can begin
> Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:847
> CAD397436504401B86AA79A8BCAE88D4:$RECOVERY: at least one node (1) tor
> ecover before lock mastery can begin
> Feb 19 11:03:09 ppsnfs101 kernel: (7184,0):dlm_get_lock_resource:874
> CAD397436504401B86AA79A8BCAE88D4: recovery map is not empty, but must
>  master $RECOVERY lock now
> 
> 
> I am clueless at this point, no idea why it fails. If there is anybody who 
can> enlighten me, it would really be appreciated.
> 
> kind regards
> Sebastian
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
-- 
Sebastian Reitenbach            Tel.: ++49-(0)3381-8904-451
RapidEye AG                     Fax: ++49-(0)3381-8904-101    
Molkenmarkt 30                  e-mail:reitenbach@rapideye.de     
D-14776 Brandenburg             web:http://www.rapideye.de

Ocfs2 users - Feb 2007 - problem to get an ocfs2 cluster running

[Ocfs2-users] problem to get an ocfs2 cluster running

[Ocfs2-users] problem to get an ocfs2 cluster running

[Ocfs2-users] problem to get an ocfs2 cluster running