Jon Norris
2014-Nov-13 02:26 UTC
[Ocfs2-users] OCFS2 v1.8 on VMware VMs global heartbeat woes
Running two VMs on ESXi 5.1.0 and trying to get global heart beat (HB) working with no luck (on about my 20th rebuild and redo) Environment: Two VMware based VMs running # cat /etc/oracle-release Oracle Linux Server release 6.5 # uname -r 2.6.32-400.36.8.el6uek.x86_64 # yum list installed | grep ocfs ocfs2-tools.x86_64 1.8.0-11.el6 @oel-latest # yum list installed | grep uek kernel-uek.x86_64 2.6.32-400.36.8.el6uek @oel-latest kernel-uek-firmware.noarch 2.6.32-400.36.8.el6uek @oel-latest kernel-uek-headers.x86_64 2.6.32-400.36.8.el6uek @oel-latest Configuration: The shared data stores (HB and mounted OCFS) are setup in a similar way as described by VMWare and Oracle for shared RAC VMWare based data stores. All blogs, wikis and VMWare KB docs show similar setup, VM shared SCSI settings [multi-writer], shared disk [independant + persistent] etc. such as: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165 <http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1034165> ) The devices can be seen by both VMs after in the OS. I have used the same configuration to run an OCFS2 setup with local heartbeat, and that works fine (cluster starts up and the OCFS2 file system mounts with no issues) I followed similar procedures as show in an Oracle blog + docs: https://docs.oracle.com/cd/E37670_01/E37355/html/ol_instcfg_ocfs2.html <https://docs.oracle.com/cd/E37670_01/E37355/html/ol_instcfg_ocfs2.html> and https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat <https://blogs.oracle.com/wim/entry/ocfs2_global_heartbeat> with no luck. The shared SCSI controllers are VMware paravirtual and set to ?shared none? as suggested by the VMware RAC shared disk KB (previously mentioned) After the shared Linux devices have been added to both VMs and are seen by both VMs in the OS (ls /dev/sd* shows the devices on each) I format the global HB devices in a way similar to the following from one VM: # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol1 --cluster-name=test --cluster-stack=o2cb --global-heartbeat /dev/sdc # mkfs.ocfs2 -b 4K -C 4K -J size=4M -N 4 -L ocfs2vol2 --cluster-name=test --cluster-stack=o2cb --global-heartbeat /dev/sdd From both VMs you can run the following and see: # mounted.ocfs2 -d Device Stack Cluster F UUID Label /dev/sdc o2cb test G 5620F19D43D840C7A46523019AE15A96 ocfs2vol1 /dev/sdd o2cb test G 9B9182279ABD4FD99F695F91488C94C1 ocfs2vol2 I then add the global HB devices to the ocfs config file with similar commands: # o2cb add-heartbeat test 5620F19D43D840C7A46523019AE15A96 # o2cb add-heartbeat test 9B9182279ABD4FD99F695F91488C94C1 Thus far looking good (heh, but then all we?ve done is format ocfs2 with options and updated a text file) - then I do the following: # o2cb heartbeat-mode test global All this being done on one node in the cluster I copy the following to the other node (with hostnames changed here, though the actual hostname = output of the hostname command on each node): # cat /etc/ocfs2/cluster.conf node: name = clusterhost1.mydomain.com cluster = test number = 0 ip_address = 10.143.144.12 ip_port = 7777 node: name = clusterhost2.mydomain.com cluster = test number = 1 ip_address = 10.143.144.13 ip_port = 7777 cluster: name = test heartbeat_mode = global node_count = 2 heartbeat: cluster = test region = 5620F19D43D840C7A46523019AE15A96 heartbeat: cluster = test region = 9B9182279ABD4FD99F695F91488C94C1 The same config works fine with heartbeat_mode set to local and the global heartbeat devices removed, and I can mount a shared FS - the local HB interfaces are IPv4 on a private L2 non routed VLAN, are up and each node can ping each other. Once the config is copied to each node and have already run: # service o2cb configure Which completes in local heartbeat mode fine, so the cluster will start on boot and the params are default for timeouts etc. I check that the service on both nodes unloads and loads modules with no issues: # service o2cb unload Clean userdlm domains: OK Unmounting ocfs2_dlmfs filesystem: OK Unloading module "ocfs2_dlmfs": OK Unloading module "ocfs2_stack_o2cb": OK Unmounting configfs filesystem: OK Unloading module "configfs": OK # service o2cb load Loading filesystem "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading stack plugin "o2cb": OK Loading filesystem "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK # mount -v ? ?. debugfs on /sys/kernel/debug type debugfs (rw) ?. ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw) # lsmod | grep ocfs ocfs2_dlmfs 18026 1 ocfs2_stack_o2cb 3606 0 ocfs2_dlm 196778 1 ocfs2_stack_o2cb ocfs2_nodemanager 202856 3 ocfs2_dlmfs,ocfs2_stack_o2cb,ocfs2_dlm ocfs2_stackglue 11283 2 ocfs2_dlmfs,ocfs2_stack_o2cb configfs 25853 2 ocfs2_nodemanager Looks good on both nodes?. then (sigh) # service o2cb enable Writing O2CB configuration: OK Setting cluster stack "o2cb": OK Registering O2CB cluster "test": Failed o2cb: Unable to access cluster service while registering heartbeat mode 'global' Unregistering O2CB cluster "test": OK I have searched for the error string and have come up with a huge ZERO on help - and the local OS log messages are equally unhelpful: # tail /var/log/messages Nov 12 21:54:53 clusterhost1 o2cb.init: online test Nov 13 00:58:38 clusterhost1 o2cb.init: online test Nov 13 01:00:06 clusterhost1 o2cb.init: offline test 0 Nov 13 01:00:06 clusterhost1 kernel: ocfs2: Unregistered cluster interface o2cb Nov 13 01:01:14 clusterhost1 kernel: OCFS2 Node Manager 1.6.3 Nov 13 01:01:14 clusterhost1 kernel: OCFS2 DLM 1.6.3 Nov 13 01:01:14 clusterhost1 kernel: ocfs2: Registered cluster interface o2cb Nov 13 01:01:14 clusterhost1 kernel: OCFS2 DLMFS 1.6.3 Nov 13 01:01:14 clusterhost1 kernel: OCFS2 User DLM kernel interface loaded Nov 13 01:03:32 clusterhost1 o2cb.init: online test Dmesg shows the same: # dmesg OCFS2 Node Manager 1.6.3 OCFS2 DLM 1.6.3 ocfs2: Registered cluster interface o2cb OCFS2 DLMFS 1.6.3 OCFS2 User DLM kernel interface loaded Slow work thread pool: Starting up Slow work thread pool: Ready FS-Cache: Loaded FS-Cache: Netfs 'nfs' registered for caching eth0: no IPv6 routers present eth1: no IPv6 routers present ocfs2: Unregistered cluster interface o2cb OCFS2 Node Manager 1.6.3 OCFS2 DLM 1.6.3 ocfs2: Registered cluster interface o2cb OCFS2 DLMFS 1.6.3 OCFS2 User DLM kernel interface loaded ocfs2: Unregistered cluster interface o2cb OCFS2 Node Manager 1.6.3 OCFS2 DLM 1.6.3 ocfs2: Registered cluster interface o2cb OCFS2 DLMFS 1.6.3 OCFS2 User DLM kernel interface loaded The filesystem looks fine and this can be run from both hosts in the cluster: # fsck.ocfs2 -n /dev/sdc fsck.ocfs2 1.8.0 Checking OCFS2 filesystem in /dev/sdc: Label: ocfs2vol1 UUID: 5620F19D43D840C7A46523019AE15A96 Number of blocks: 524288 Block size: 4096 Number of clusters: 524288 Cluster size: 4096 Number of slots: 4 # fsck.ocfs2 -n /dev/sdd fsck.ocfs2 1.8.0 Checking OCFS2 filesystem in /dev/sdd: Label: ocfs2vol2 UUID: 9B9182279ABD4FD99F695F91488C94C1 Number of blocks: 524288 Block size: 4096 Number of clusters: 524288 Cluster size: 4096 Number of slots: 4 What am I missing? I?ve re-done this, re-created the devices a few too many times (thinking I may have missed something) but I am mystified. From all outer appearances I have two VMs that can see and in local heartbeat mode mount a shared OCFS2 filesystem and access it (have it running in local heartbeat mode for a cluster of rsyslog servers that are being load balanced by an F5 LTM VS with no issues) I am stumped on how to get global HB devices setup, though I have read and re-read the user guides, troubleshooting guides and wikis/blogs on how to make that work until my eyes hurt. Mounted the debugfs and ran the debugfs.ocfs2 utility but am unfamiliar of what I should be looking for there (or if this is where I would look for cluster not coming online errors) As the oc2b/ocfs modules are all kernel based I am not 100% sure how to increase debug information without digging into the source code and mucking around there. Any guidance or lessons learned (or things to check) would be super :) and if works warrant a happy scream of joy from my frustrated cube! Warm regards, Jon -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20141112/68b903fe/attachment-0001.html