Jeff Mahoney
2005-Oct-18 16:52 UTC
[Ocfs2-devel] [RFC] Integration with external clustering
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hey all - We're interested in using OCFS2 with an external, userspace clustering solution. Specifically, the heartbeat2 project from linux-ha.org. Obviously, the internal cluster manager would still be available for users with no interest in deploying and configuring a full cluster manager just to use the file system. I'd like to attempt to make the interface as consistent as possible between the two. The obvious mapping to an external cluster manager is to map one file system to one cluster resource, to be managed individually. The user space cluster manager will take over most of the cluster management infrastructure supplied now by o2cb, including heartbeat, fencing, etc. The node manager would still be used to coordinate DLM operations, which would be left in-kernel. The o2cb code is pretty well structured for this kind of integration without a lot of hacking, but there are a few sticking points. The good news is that the infrastructure for fixing most of them is already in place, just waiting to be used. The existing code has a notion of one global cluster with each node owning a particular node number and a single IP address/port. This node number is mapped 1:1 to file system slots and DLM domain node numbers, regardless of how many nodes are actually involved in mounting any particular file system. A large cluster may deploy a cluster-global file system, but also many smaller file systems to small subsets of nodes. The smaller file systems, even though they are deployed on a small number of nodes, still require slots for every member of the larger cluster. If separate network connectivity is desired for the smaller file systems, separate node numbers must be allocated in order to utilize the alternate network, making the problem worse. The one-cluster notion appears to be rooted in o2net, where the assumption of a 1:1 IP Address:Node mapping is made. The node manager is aware of multiple clusters, and even has to provide an interface to fake the single cluster membership. o2net itself even understands that an internode connection will be used for multiple virtual connections. And, one of the larger issues for integration with a userspace cluster manager is how nodes are organized and exported to userspace. Currently, there is only one instance of a node. If a heartbeat down event is triggered for a particular node, all file systems are told about it, even if they don't care. What we need to integrate a userspace cluster manager is more fine grained configuration of node membership. I'd like to address these issues in my proposal: Individual file systems should be represented individually, with resources and connectivity assignable independently to each. I'll start with an idea of what I'd like to see the configfs space look like, since I think it will probably illustrate it best: /config/cluster/ocfs2/<fs uuid>/<node>/ ip address port fs slot local active (for userspace) heartbeat/ (for kernelspace) block_bytes blocks dev start_block Rather than having one global cluster, each file system would be its own cluster. Nodes would be created and destroyed as needed on a per file system basis. The current o2net concept of a node would be replaced by something that is specific to connectivity. The current implemention of one connection per ip/port would stay, but rather than assume a particular connection-node mapping at accept time, it would broker messages later once the key has been observed in the message. Since heartbeat and node management would end up having similar trees with different attributes, the node and heartbeat attributes would be unified under a single fs instance. Obviously, modifications to the o2cb userspace tools would be required to make this work. I think that the changes required for cluster.conf could be minimal -- just keep the existing format and add overrides for file systems that want to use different slots/networks/etc. I'm volunteering to code all this up, I just didn't want to post code that nobody wanted. Opinions? - -Jeff - -- Jeff Mahoney SUSE Labs -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFDVW+KLPWxlyuTD7IRAv5SAJ4yUID/gnGslfhu0JZzNiF+1f0OYQCfUQei 2eeyWWd6lfe9Ae8NzV8tXSI=xI1V -----END PGP SIGNATURE-----
Joel Becker
2005-Oct-18 17:18 UTC
[Ocfs2-devel] [RFC] Integration with external clustering
On Tue, Oct 18, 2005 at 05:56:27PM -0400, Jeff Mahoney wrote:> I'll start with an idea of what I'd like to see the configfs space look > like, since I think it will probably illustrate it best: > > /config/cluster/ocfs2/<fs uuid>/<node>/If you are treating each mount as a 'cluster', the ocfs2 path element is pretty redundant, and /config/cluster/<fs uuid> would suffice. Given that heartbeat regions can and should be shared, you need a way to describe this. We don't have userspace doing global heartbeat yet, but there is no reason that all OCFS2 volumes can't share one heartbeat region (see http://oss.oracle.com/projects/ocfs2-tools/src/branches/global-heartbeat/documentation/o2cb/). Have you also considered what this will or won't do to possible interaction with the CMan stack? We'd love OCFS2 to handle both stacks. Finally, have you considered the user barriers to this? The absolute bottom-line goal of O2CB is the minimum input by the user. For this to work, the user should not have to see the plethora of XML config files that heartbeat has (or at least, used to have). I'm talking about the user-visible part here, not the technical reality. The O2CB frontend or some other piece of software can take the user's name:ip node mapping and turn it into whatever XML it needs, but the user shouldn't have to do anything more than ocfs2console requires them today. Joel -- "If you took all of the grains of sand in the world, and lined them up end to end in a row, you'd be working for the government!" - Mr. Interesting Joel Becker Principal Software Developer Oracle E-mail: joel.becker@oracle.com Phone: (650) 506-8127
Lars Marowsky-Bree
2005-Oct-28 10:11 UTC
[Ocfs2-devel] Re: [RFC] Integration with external clustering
On 2005-10-18T17:56:27, Jeff Mahoney <jeffm@suse.com> wrote: Hi all, just want to make sure this doesn't get lost. Where are we currently at? FYI, I'd like to ask for an additional way of documenting a suggested approach: Please show how to setup a, say, 3 node "cluster" (statically) and how to shut it down again - on the commandline with shell scripts ;-) Hey, we're only operating on configfs/sysfs style "text files" and directories, no? That should be possible. Not only will it be a good basis for a regression test of the API, but it'll also help us understand how the scripts for the Cluster Resource Manager integration will have to look like and whether that's a workable approach. Anybody thinking I'm on drugs? ;-) Sincerely, Lars Marowsky-Br?e <lmb@suse.de> -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin "Ignorance more frequently begets confidence than does knowledge"