thr3ads.net - Ocfs2 users - [Ocfs2-users] server reboots due to heartbeat error

If this information is useful, please help other people find it:
Share via:

Khosrof Ohanian

2012-Jun-28 19:18 UTC

[Ocfs2-users] server reboots due to heartbeat error - ocfs node

I have been trying to submit this message to the group for 3 days now. ?I hope
it works this time around. ?Any help would be appreciated.

I am troubleshooting an issue with my development RAC server where node 1 will
reboot due to a heartbeat timeout. ?I am getting multiple errors from ocfs2 and
wondering if my configuration is wrong or if my procedure is wrong. ?The issue
comes up when I am cloning the OCFS lun for the production system to a separate
lun used for development. ?Taking standard precautions, but yet still node 1 on
the development cluster (2 node) reboots.

System design:
2 ocfs2 clusters (prod and development)
production = 3 nodes, development = 2 nodes
All systems are running Oracle Linux Server release 5.8
kernel: [root at node1 ~]# uname -a
Linux node1.xxxxx.com 2.6.18-308.1.1.0.1.el5 #1 SMP Wed Mar 7 11:39:17 EST 2012
x86_64 x86_64 x86_64 GNU/Linux
OCFS release ocfs2-2.6.18-308.1.1.0.1.el5-1.4.9-1.el5

cluster node setup on prod and development clusters.
Unique cluster names for each.
Nodes in the production cluster are numbered 1,2,3
Nodes in the development cluster are number 1,2
QUESTION 1: ?Should the node numbers be unique since I am cloning a LUN between
the 2 clusters? ?See error about heartbeat in the same slot below?

Procedure:
Shutdown all processes on the luns to be cloned.
unmount the luns to be cloned on both development nodes (devnode1, devnode2)
synchronize the clone to the production lun on the san, EMC clariion. ?Fracture
the luns.
On devnode2 I run the following commands:
fsck.ocfs2 (fix errors)
tunefs.ocfs2 --label=dev.index /dev/path (set new label)
tunefs.ocfs2 --uuid-reset /dev/path (set random uuid)
On devnode1 I run:
sfdisk -R /dev/path (re-read partitions to grab new label and uuid)
Then I mount the volumes onto both nodes

Errors:
Both nodes constantly report the same error on the cloned luns.
kernel: (o2hb-D34207AE9F,4086,16):o2hb_do_disk_heartbeat:781 ERROR: Device
"emcpowerj1": another node is heartbeating in our slot!
However, the error above does not cause any instability.
After I unmount the luns and start the clone, I get the following error for a
few minutes:
(MpxTestDaemon ?,14513,8):o2hb_bio_end_io:241 ERROR: IO Error -5
kernel: (o2hb-023EBFE1B5,3945,8):o2hb_do_disk_heartbeat:772 ERROR: status = -5
After 3 minutes the system gets rebooted, the logs show the following:
kernel: (events/8,70,8):o2hb_write_timeout:176 ERROR: Heartbeat write timeout to
device emcpowerl1 after 150000 milliseconds
(events/8,70,8):o2hb_stop_all_regions:2026 ERROR: stopping heartbeat on all
active regions.

QUESTION 2: ?Do I need to stop the heartbeat on the unmounted luns before the
SAN unpresents them from the server? ?I found a command as follows:
ocfs2_hb_ctl -K -d /dev/device

QUESTION 3: ?Am I doing anything else wrong in my procedure that would be
causing the heartbeat issue and server reboot?

Thanks in advance for your time and replies.

Ocfs2 users - Jun 2012 - server reboots due to heartbeat error - ocfs node

[Ocfs2-users] server reboots due to heartbeat error - ocfs node