Gareth Bult
2008-Feb-09 10:22 UTC
[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)
Hi, I've seen a number of people with this problem (me too!) but nobody seems to have a solution, any help would be greatly appreciated. Two noded work file with DRBD/OCFS2, but when I load a third using GNBD, I seem to run into problems... I'm running an RH 2.6.21 kernel with Xen 3.2. - OCFS version 1.3.3 - Tools 1.2.4. I have two nodes with the following config; node: ip_port = 7777 ip_address = 10.0.0.1 number = 0 name = nodea cluster = ocfs2 node: ip_port = 7777 ip_address = 10.0.0.2 number = 1 name = nodeb cluster = ocfs2 node: ip_port = 7777 ip_address = 10.0.0.20 number = 3 name = mgm cluster = ocfs2 cluster: node_count = 3 name = ocfs2 nodea is running a 400G filesystem on /drdb1 nodeb is running a 400G filesystem on /drdb2 (mirroring drbd1 using drbd 8) I can load nodes a and b and things look fine and work no problem, both systems can mount their respective drbd's and it all seems to work. I then run gnbd_serv on both machines and export the drbd devices. On booting "mgm", I load drbd-client, then /etc/init.d/o2cb, so far so good; root@mgm:~# /etc/init.d/o2cb status Module "configfs": Loaded Filesystem "configfs": Mounted Module "ocfs2_nodemanager": Loaded Module "ocfs2_dlm": Loaded Module "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster ocfs2: Online Heartbeat dead threshold = 7 Network idle timeout: 10000 Network keepalive delay: 5000 Network reconnect delay: 2000 Checking O2CB heartbeat: Not active root@mgm:~# mounted.ocfs2 -f Device FS Nodes /dev/gnbd0 ocfs2 nodea, nodeb /dev/gnbd1 ocfs2 nodea, nodeb root@mgm:~# mounted.ocfs2 -d Device FS UUID Label /dev/gnbd0 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick /dev/gnbd1 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick Slots; Slot# Node# 0 0 1 1 Slot# Node# 0 0 1 1 Now .. I come to try and mount a device on host "mgm"; mount -t ocfs2 /dev/gnbd0 /cluster In the kernel log on nodea I see; Feb 9 17:37:01 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot! Feb 9 17:37:03 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot! On nodeb I see; Feb 9 17:37:00 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot! Feb 9 17:37:02 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot! And within 10 seconds or so both machines fence themselves off and reboot. It "seems" as tho' mgm is not recognising that slots 0 and 1 are already taken .. but everything "look" Ok to me. Can anyone spot any glaring mistakes or suggest a way I can debug this or provide more information to the list? Many thanks, Gareth. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080209/602e57bf/attachment.html
Saar Maoz
2008-Feb-11 00:20 UTC
[Ocfs2-users] Re; another node is heartbeating in our slot! (when starting 3rd node)
> I've seen a number of people with this problem (me too!) but nobody seems to have a solution,Nobody?? :) Compare the /etc/ocfs2/cluster.conf on all nodes and make sure it's identical. I would suggest to change: number = 3 to number=2 since you only have 3 nodes and they start from zero. The error is simply saying that the node can't join since another node is using it's slot, this could easily happen if these files are out of sync on all nodes. Also confirm that 'hostname -s' returns the output of "name=" in that file. I'm sure you'll resolve this very quickly. Good luck, Saar. -- __ _ _ __ _ _ _ _ ___ _____________________________ (( /\\ /\\ ||) |\V/| /\\ /\\ >/ Consulting Software Engineer _))//-\\//-\\||\ |||||//-\\\\//<_ Oracle Corporation HQ: 650.50-mixOS WK: 510.222.4224 Saar.Maoz@oracle.com 4op441 \\\\\\\\50-64967\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ ///Share your knowledge with others and compete with yourself/// On Sat, 9 Feb 2008, Gareth Bult wrote:> Date: Sat, 9 Feb 2008 18:22:58 +0000 (GMT) > From: Gareth Bult <gareth@encryptec.net> > To: ocfs2-users@oss.oracle.com > Subject: [Ocfs2-users] Re; another node is heartbeating in our slot! (when > starting 3rd node) > > Hi, > > I've seen a number of people with this problem (me too!) but nobody seems to have a solution, any help would be greatly appreciated. > > Two noded work file with DRBD/OCFS2, but when I load a third using GNBD, I seem to run into problems... > > I'm running an RH 2.6.21 kernel with Xen 3.2. - OCFS version 1.3.3 - Tools 1.2.4. > > I have two nodes with the following config; > > node: > ip_port = 7777 > ip_address = 10.0.0.1 > number = 0 > name = nodea > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.0.0.2 > number = 1 > name = nodeb > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.0.0.20 > number = 3 > name = mgm > cluster = ocfs2 > > cluster: > node_count = 3 > name = ocfs2 > > nodea is running a 400G filesystem on /drdb1 > nodeb is running a 400G filesystem on /drdb2 (mirroring drbd1 using drbd 8) > > I can load nodes a and b and things look fine and work no problem, both systems can mount their respective drbd's and it all seems to work. > > I then run gnbd_serv on both machines and export the drbd devices. > > On booting "mgm", I load drbd-client, then /etc/init.d/o2cb, so far so good; > > root@mgm:~# /etc/init.d/o2cb status > Module "configfs": Loaded > Filesystem "configfs": Mounted > Module "ocfs2_nodemanager": Loaded > Module "ocfs2_dlm": Loaded > Module "ocfs2_dlmfs": Loaded > Filesystem "ocfs2_dlmfs": Mounted > Checking O2CB cluster ocfs2: Online > Heartbeat dead threshold = 7 > Network idle timeout: 10000 > Network keepalive delay: 5000 > Network reconnect delay: 2000 > Checking O2CB heartbeat: Not active > > root@mgm:~# mounted.ocfs2 -f > Device FS Nodes > /dev/gnbd0 ocfs2 nodea, nodeb > /dev/gnbd1 ocfs2 nodea, nodeb > > root@mgm:~# mounted.ocfs2 -d > Device FS UUID Label > /dev/gnbd0 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick > /dev/gnbd1 ocfs2 35fff639-0ec2-4a8d-8849-2b9ef078a40a brick > > Slots; > Slot# Node# > 0 0 > 1 1 > > Slot# Node# > 0 0 > 1 1 > > Now .. I come to try and mount a device on host "mgm"; > > mount -t ocfs2 /dev/gnbd0 /cluster > > In the kernel log on nodea I see; > Feb 9 17:37:01 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot! > Feb 9 17:37:03 nodea kernel: (3576,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd1": another node is heartbeating in our slot! > > On nodeb I see; > Feb 9 17:37:00 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot! > Feb 9 17:37:02 nodeb kernel: (3515,0):o2hb_do_disk_heartbeat:767 ERROR: Device "drbd2": another node is heartbeating in our slot! > > And within 10 seconds or so both machines fence themselves off and reboot. > > It "seems" as tho' mgm is not recognising that slots 0 and 1 are already taken .. but everything "look" Ok to me. > Can anyone spot any glaring mistakes or suggest a way I can debug this or provide more information to the list? > > Many thanks, > Gareth. >-------------- next part -------------- _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users