thr3ads.net - Ocfs2 users - [Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node) [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Gareth Bult

2008-Feb-09 10:22 UTC

[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)

Hi, 

I've seen a number of people with this problem (me too!) but nobody seems to
have a solution, any help would be greatly appreciated.

Two noded work file with DRBD/OCFS2, but when I load a third using GNBD, I seem
to run into problems...

I'm running an RH 2.6.21 kernel with Xen 3.2. - OCFS version 1.3.3 - Tools
1.2.4.

I have two nodes with the following config; 

node: 
ip_port = 7777 
ip_address = 10.0.0.1 
number = 0 
name = nodea 
cluster = ocfs2 

node: 
ip_port = 7777 
ip_address = 10.0.0.2 
number = 1 
name = nodeb 
cluster = ocfs2 

node: 
ip_port = 7777 
ip_address = 10.0.0.20 
number = 3 
name = mgm 
cluster = ocfs2 

cluster: 
node_count = 3 
name = ocfs2 

nodea is running a 400G filesystem on /drdb1 
nodeb is running a 400G filesystem on /drdb2 (mirroring drbd1 using drbd 8) 

I can load nodes a and b and things look fine and work no problem, both systems
can mount their respective drbd's and it all seems to work.

I then run gnbd_serv on both machines and export the drbd devices. 

On booting "mgm", I load drbd-client, then /etc/init.d/o2cb, so far so
good;

root@mgm:~# /etc/init.d/o2cb status 
Module "configfs": Loaded 
Filesystem "configfs": Mounted 
Module "ocfs2_nodemanager": Loaded 
Module "ocfs2_dlm": Loaded 
Module "ocfs2_dlmfs": Loaded 
Filesystem "ocfs2_dlmfs": Mounted 
Checking O2CB cluster ocfs2: Online 
Heartbeat dead threshold = 7 
Network idle timeout: 10000 
Network keepalive delay: 5000 
Network reconnect delay: 2000 
Checking O2CB heartbeat: Not active 

root@mgm:~# mounted.ocfs2 -f 
Device FS Nodes 
/dev/gnbd0 ocfs2 nodea, nodeb 
/dev/gnbd1 ocfs2 nodea, nodeb 

root@mgm:~# mounted.ocfs2 -d 
Device FS UUID Label 
/dev/gnbd0 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick 
/dev/gnbd1 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick 

Slots; 
Slot# Node# 
0 0 
1 1 

Slot# Node# 
0 0 
1 1 

Now .. I come to try and mount a device on host "mgm"; 

mount -t ocfs2 /dev/gnbd0 /cluster 

In the kernel log on nodea I see; 
Feb 9 17:37:01 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device
"drbd1": another node is heartbeating in our slot!
Feb 9 17:37:03 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device
"drbd1": another node is heartbeating in our slot!

On nodeb I see; 
Feb 9 17:37:00 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device
"drbd2": another node is heartbeating in our slot!
Feb 9 17:37:02 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device
"drbd2": another node is heartbeating in our slot!

And within 10 seconds or so both machines fence themselves off and reboot. 

It "seems" as tho' mgm is not recognising that slots 0 and 1 are
already taken .. but everything "look" Ok to me.
Can anyone spot any glaring mistakes or suggest a way I can debug this or
provide more information to the list?

Many thanks, 
Gareth. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080209/602e57bf/attachment.html

Saar Maoz

2008-Feb-11 00:20 UTC

head link

[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)

> I've seen a number of people with this problem (me too!) but nobody
seems to have a solution,
Nobody?? :)

Compare the /etc/ocfs2/cluster.conf on all nodes and make sure it's 
identical.  I would suggest to change:

     number = 3

to number=2 since you only have 3 nodes and they start from zero.  The 
error is simply saying that the node can't join since another node is 
using it's slot, this could easily happen if these files are out of sync 
on all nodes.  Also confirm that 'hostname -s' returns the output of 
"name=" in that file.  I'm sure you'll resolve this very
quickly.

Good luck,

Saar.

--
   __  _    _  __    _ _   _   _ ___ _____________________________
  ((  /\\  /\\ ||)  |\V/| /\\ /\\ >/ Consulting Software Engineer
  _))//-\\//-\\||\  |||||//-\\\\//<_ Oracle Corporation
  HQ: 650.50-mixOS  WK: 510.222.4224 Saar.Maoz@oracle.com  4op441
  \\\\\\\\50-64967\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
  ///Share your knowledge with others and compete with yourself///


On Sat, 9 Feb 2008, Gareth Bult wrote:
> Date: Sat, 9 Feb 2008 18:22:58 +0000 (GMT)
> From: Gareth Bult <gareth@encryptec.net>
> To: ocfs2-users@oss.oracle.com
> Subject: [Ocfs2-users] Re; another node is heartbeating in our slot! (when
>     starting 3rd node)
> 
> Hi,
>
> I've seen a number of people with this problem (me too!) but nobody
seems to have a solution, any help would be greatly appreciated.
>
> Two noded work file with DRBD/OCFS2, but when I load a third using GNBD, I
seem to run into problems...
>
> I'm running an RH 2.6.21 kernel with Xen 3.2. - OCFS version 1.3.3 -
Tools 1.2.4.
>
> I have two nodes with the following config;
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.1
> number = 0
> name = nodea
> cluster = ocfs2
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.2
> number = 1
> name = nodeb
> cluster = ocfs2
>
> node:
> ip_port = 7777
> ip_address = 10.0.0.20
> number = 3
> name = mgm
> cluster = ocfs2
>
> cluster:
> node_count = 3
> name = ocfs2
>
> nodea is running a 400G filesystem on /drdb1
> nodeb is running a 400G filesystem on /drdb2 (mirroring drbd1 using drbd 8)
>
> I can load nodes a and b and things look fine and work no problem, both
systems can mount their respective drbd's and it all seems to work.
>
> I then run gnbd_serv on both machines and export the drbd devices.
>
> On booting "mgm", I load drbd-client, then /etc/init.d/o2cb, so
far so good;
>
> root@mgm:~# /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster ocfs2: Online
> Heartbeat dead threshold = 7
> Network idle timeout: 10000
> Network keepalive delay: 5000
> Network reconnect delay: 2000
> Checking O2CB heartbeat: Not active
>
> root@mgm:~# mounted.ocfs2 -f
> Device FS Nodes
> /dev/gnbd0 ocfs2 nodea, nodeb
> /dev/gnbd1 ocfs2 nodea, nodeb
>
> root@mgm:~# mounted.ocfs2 -d
> Device FS UUID Label
> /dev/gnbd0 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick
> /dev/gnbd1 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick
>
> Slots;
> Slot# Node#
> 0 0
> 1 1
>
> Slot# Node#
> 0 0
> 1 1
>
> Now .. I come to try and mount a device on host "mgm";
>
> mount -t ocfs2 /dev/gnbd0 /cluster
>
> In the kernel log on nodea I see;
> Feb 9 17:37:01 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR:
Device "drbd1": another node is heartbeating in our slot!
> Feb 9 17:37:03 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR:
Device "drbd1": another node is heartbeating in our slot!
>
> On nodeb I see;
> Feb 9 17:37:00 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR:
Device "drbd2": another node is heartbeating in our slot!
> Feb 9 17:37:02 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR:
Device "drbd2": another node is heartbeating in our slot!
>
> And within 10 seconds or so both machines fence themselves off and reboot.
>
> It "seems" as tho' mgm is not recognising that slots 0 and 1
are already taken .. but everything "look" Ok to me.
> Can anyone spot any glaring mistakes or suggest a way I can debug this or
provide more information to the list?
>
> Many thanks,
> Gareth.
>-------------- next part --------------
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Ocfs2 users - Feb 2008 - Re; another node is heartbeating in our slot! (when starting 3rd node)

[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)

[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)