Florin Andrei
2009-Sep-23 18:31 UTC
[Ocfs2-users] "another node is heartbeating in our slot"
OCFS2 cluster, two nodes, nothing fancy: ##################################### [root at serv1 ~]# cat /etc/ocfs2/cluster.conf node: ip_port = 7777 ip_address = 10.10.20.64 number = 0 name = serv1.foobar cluster = ocfs2 node: ip_port = 7777 ip_address = 10.10.20.65 number = 1 name = serv2.foobar cluster = ocfs2 cluster: node_count = 2 name = ocfs2 ##################################### A filesystem shared by these two machines got mounted on a 3rd machine, which is part of another cluster, and the 3rd machine happens to share the same node number with serv2. Some files were deleted on the 3rd machine, then the fs was unmounted from it (but remained mounted on 1 and 2). As a result, a bunch of messages like this appeared in the logs: serv2 kernel: (21146,1):o2hb_do_disk_heartbeat:982 ERROR: Device "dm-3": another node is heartbeating in our slot! And now there's a discrepancy between the disk usage indicated by df (it's pretty high) and du (it's much lower). Also, ls -l generates weird output for some files (which were supposedly deleted on the 3rd machine): ?--------- ? ? ? ? ? access_log.20090601 ?--------- ? ? ? ? ? access_log.20090602 ?--------- ? ? ? ? ? access_log.20090603 ?--------- ? ? ? ? ? access_log.20090604 I unmounted the fs on serv2 then mounted it back, but that didn't help. Didn't try to unmount serv1 yet. Any suggestions? -- Florin Andrei http://florin.myip.org/
Sunil Mushran
2009-Sep-23 18:36 UTC
[Ocfs2-users] "another node is heartbeating in our slot"
You cannot share a device between two different clusters. Florin Andrei wrote:> OCFS2 cluster, two nodes, nothing fancy: > > ##################################### > [root at serv1 ~]# cat /etc/ocfs2/cluster.conf > node: > ip_port = 7777 > ip_address = 10.10.20.64 > number = 0 > name = serv1.foobar > cluster = ocfs2 > > node: > ip_port = 7777 > ip_address = 10.10.20.65 > number = 1 > name = serv2.foobar > cluster = ocfs2 > > cluster: > node_count = 2 > name = ocfs2 > ##################################### > > A filesystem shared by these two machines got mounted on a 3rd machine, > which is part of another cluster, and the 3rd machine happens to share > the same node number with serv2. > Some files were deleted on the 3rd machine, then the fs was unmounted > from it (but remained mounted on 1 and 2). > As a result, a bunch of messages like this appeared in the logs: > > serv2 kernel: (21146,1):o2hb_do_disk_heartbeat:982 ERROR: Device "dm-3": > another node is heartbeating in our slot! > > And now there's a discrepancy between the disk usage indicated by df > (it's pretty high) and du (it's much lower). Also, ls -l generates weird > output for some files (which were supposedly deleted on the 3rd machine): > > ?--------- ? ? ? ? ? access_log.20090601 > ?--------- ? ? ? ? ? access_log.20090602 > ?--------- ? ? ? ? ? access_log.20090603 > ?--------- ? ? ? ? ? access_log.20090604 > > I unmounted the fs on serv2 then mounted it back, but that didn't help. > Didn't try to unmount serv1 yet. > > Any suggestions? > >