thr3ads.net - Ocfs2 devel - [Ocfs2-devel] OCFS2 features RFC [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Mark Fasheh

2006-Apr-25 18:35 UTC

[Ocfs2-devel] OCFS2 features RFC

The OCFS2 team is in the preliminary stages of planning major features for
our next cycle of development. The goal of this e-mail then is to stimulate
some discussion as to how features should be prioritized going forward. Some
disclaimers apply:

* The following list is very preliminary and is sure to change.

* I've probably missed some things.

* Development priorities within Oracle can be influenced but are ultimately
  up to management. That's not stopping anyone from contributing though, and
  patches are always welcome.

So I'll start with changes that can be completely contained within the file
system (no cluster stack changes needed):

-Sparse file support: Self explanatory. We need this for various reasons
 including performance, correctness and space usage.

-Htree support

-Extended attributes: This might be another area where we
 steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can
 trivially implement posix acls. We're not likely to support EA block
 sharing though as it becomes difficult to manage across the cluster.

-Removal of the vote mechanism: The most trivial dentry type network votes
 can go quite easily by replacing them with a cluster lock. This is critical
 in speeding up unlink and rename operations in the cluster. The remaining
 votes (mount, unmount, delete_inode) look like they'll require cluster
 stack adjustments.

-Data in inode blocks: Should speed up local node data operations with small
 files significantly.

-Shared writeable mmap: This looks like it might require changes to the
 kernel (outside of OCFS2). We need to investigate further...

Now on to file system features which require cluster stack changes. I'll
have alot more to say about the cluster stack in a bit, but it's worth
listing these out here for completeness.

-Cluster consistent Flock / Lockf

-Online file system resize

-Removal of remaining FS votes: If we can get rid of the delete_inode vote,
 I don't believe we'll need the mount / umount ones anymore (and if we
still
 do, then a proper group services could handle that)

-Allow the file system to go "hard read only" when it loses it's
connection
 to the disk, rather than the kernel panic we have today. This allows
 applications using the file system to gracefully shut down. Other
 applications on the system continue unharmed. "Hard read only" in the
OCFS2
 context means that the RO node does not look mounted to the other nodes on
 that file system. Absolutely no disk writes are allowed.  File data and
 meta data can be stale or otherwise invalid. We never want to return
 invalid data to userspace, so file reads return -EIO.

As far as the existing cluster stack goes, currently most of the OCFS2 team
feels that the code has gone as far as it can and should go. It would
therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney at
Novell has already done some integration work implementing a userspace
clustering interface. We probably want to do more in that area though.

There are several good reasons why we might want to integrate with external
cluster stacks. The most obvious is code reuse. The list of cluster stack
features we require for our next phase of development is very large (some
are listed below). There is no reason to implement those features unless
we're certain existing software doesn't provide them and can't be
extended.
This will also allow a greater amount of choice for the end user. What stack
works well for one environment might not work as well for another. There's
also the fact that current resources are limited. It's enough work designing
and implementing a file system. If we can get out of the business of
maintaining a cluster stack, we should do so.

So the question then becomes, "What is it that we require of our cluster
stack going forward?"

- We'd like as much of it to be user space code as is possible and
  practical.

- The node manager should support dynamic cluster topology updates,
  including removing nodes from the cluster, propagating new configurations to
  existing nodes, etc.

- A pluggable fencing mechanism is a priority.

- We'd like some group services implementation to handle things like
  membership of a mount point, dlm domain/lockspace, etc.

- On the DLM side, we'd like things like directory based mastery, a range
  locking API, and some extra LVB recovery bits.

So that's it for now. Hopefully this will spurn some interesting discussion.
Please keep in mind that any of this is subject to change - cluster stack
requirements especially are things we've only recently begun discussing.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

Christoph Hellwig

2006-Apr-25 21:55 UTC

head link

[Ocfs2-devel] OCFS2 features RFC

On Tue, Apr 25, 2006 at 11:35:53AM -0700, Mark Fasheh
wrote:> -Htree support
Please not.  htree is just the worst possible directory format around.
Do some nice hashed or btree directories, but don't try this odd hack
again. Especially as the only reason it was developed for in ext2/3
doesn't work very well in a cluster filesystem anyway - to access the
new htree all nodes would have to support the format anyway, so the
whole easy up/downgrade thing doesn't matter at all.
> -Extended attributes: This might be another area where we
>  steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one can
>  trivially implement posix acls. We're not likely to support EA block
>  sharing though as it becomes difficult to manage across the cluster.
again the ext3 implementation might not be the best.  I'd say look at
jfs or xfs (in the latter case of course with a less monsterous btree
implementation)

Andi Kleen

2006-Apr-26 04:11 UTC

head link

[Ocfs2-devel] OCFS2 features RFC

Mark Fasheh <mark.fasheh at oracle.com> writes:> 
> - We'd like as much of it to be user space code as is possible and
>   practical.
Won't you get into deadlocks then when the system is low on memory?
(freeing memory might require write outs on OCFS2 and the user space
cluster might be stuck already)

Or rather if you rely on user space you would need to make sure
that the basic block write out path works without such possible
deadlocks.

-Andi

Paul Taysom

2006-Apr-27 20:25 UTC

head link

[Ocfs2-devel] OCFS2 features RFC

I've done some experiments with h-trees on ext3 and have found one case
where h-trees get confused.  If I create several thousand files in a
single directory and then try to remove the directory (rm -r), I get an
error that one of the files has not been removed but when I check the
directory, the file is not there.  I repeat the command and the
directory is removed.  I suspect the h-tree code is using the hash for
the cookie for readdir and I'm getting a hash collision.  ReiserFS
solves this problem by having 24 bits of hash and 8 bits of uniqueness
to resolve hash collisions.

Paul Taysom
 >>> Mark Fasheh <mark.fasheh at oracle.com> 04/25/06 12:35 pm
>>>The OCFS2 team is in the preliminary stages of planning major features
for
our next cycle of development. The goal of this e- mail then is to
stimulate
some discussion as to how features should be prioritized going forward.
Some
disclaimers apply:

* The following list is very preliminary and is sure to change.

* I've probably missed some things.

* Development priorities within Oracle can be influenced but are
ultimately
  up to management. That's not stopping anyone from contributing
though, and
  patches are always welcome.

So I'll start with changes that can be completely contained within the
file
system (no cluster stack changes needed):

- Sparse file support: Self explanatory. We need this for various
reasons
 including performance, correctness and space usage.

- Htree support

- Extended attributes: This might be another area where we
 steal^H^H^H^H^Hcopy some good code from Ext3 :) On top of this one
can
 trivially implement posix acls. We're not likely to support EA block
 sharing though as it becomes difficult to manage across the cluster.

- Removal of the vote mechanism: The most trivial dentry type network
votes
 can go quite easily by replacing them with a cluster lock. This is
critical
 in speeding up unlink and rename operations in the cluster. The
remaining
 votes (mount, unmount, delete_inode) look like they'll require
cluster
 stack adjustments.

- Data in inode blocks: Should speed up local node data operations with
small
 files significantly.

- Shared writeable mmap: This looks like it might require changes to
the
 kernel (outside of OCFS2). We need to investigate further...

Now on to file system features which require cluster stack changes.
I'll
have alot more to say about the cluster stack in a bit, but it's worth
listing these out here for completeness.

- Cluster consistent Flock / Lockf

- Online file system resize

- Removal of remaining FS votes: If we can get rid of the delete_inode
vote,
 I don't believe we'll need the mount / umount ones anymore (and if we
still
 do, then a proper group services could handle that)

- Allow the file system to go "hard read only" when it loses it's
connection
 to the disk, rather than the kernel panic we have today. This allows
 applications using the file system to gracefully shut down. Other
 applications on the system continue unharmed. "Hard read only" in the
OCFS2
 context means that the RO node does not look mounted to the other
nodes on
 that file system. Absolutely no disk writes are allowed.  File data
and
 meta data can be stale or otherwise invalid. We never want to return
 invalid data to userspace, so file reads return - EIO.

As far as the existing cluster stack goes, currently most of the OCFS2
team
feels that the code has gone as far as it can and should go. It would
therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney
at
Novell has already done some integration work implementing a userspace
clustering interface. We probably want to do more in that area though.

There are several good reasons why we might want to integrate with
external
cluster stacks. The most obvious is code reuse. The list of cluster
stack
features we require for our next phase of development is very large
(some
are listed below). There is no reason to implement those features
unless
we're certain existing software doesn't provide them and can't be
extended.
This will also allow a greater amount of choice for the end user. What
stack
works well for one environment might not work as well for another.
There's
also the fact that current resources are limited. It's enough work
designing
and implementing a file system. If we can get out of the business of
maintaining a cluster stack, we should do so.

So the question then becomes, "What is it that we require of our
cluster
stack going forward?"

-  We'd like as much of it to be user space code as is possible and
  practical.

-  The node manager should support dynamic cluster topology updates,
  including removing nodes from the cluster, propagating new
configurations to
  existing nodes, etc.

-  A pluggable fencing mechanism is a priority.

-  We'd like some group services implementation to handle things like
  membership of a mount point, dlm domain/lockspace, etc.

-  On the DLM side, we'd like things like directory based mastery, a
range
  locking API, and some extra LVB recovery bits.

So that's it for now. Hopefully this will spurn some interesting
discussion.
Please keep in mind that any of this is subject to change -  cluster
stack
requirements especially are things we've only recently begun
discussing.
	-- Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com

_______________________________________________
Ocfs2- devel mailing list
Ocfs2- devel at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2- devel

Brian Long

2006-May-02 18:22 UTC

head link

[Ocfs2-devel] OCFS2 Features RFC

Hello,

I just subscribed to this list because I saw this posting in the
archives:
http://oss.oracle.com/pipermail/ocfs2-devel/2006-April/000931.html

Is there any reason you wouldn't ask the ocfs2-users community for
feedback on features as well?  I hadn't subscribed to -devel because I
figured it was solely for folks actually developing the OCFS2 code  :)

In my opinion, the proposed feature about "hard read only" is the
most-
wanted.  My team is in the middle of testing 10gR2 RAC on OCFS2 for
production deployments on RHEL 4 (hopefully your x86_64 certification is
coming soon).  I assume Oracle RAC would like the "hard read only"
more
than the current panic.

Also, while I saw one end user complain about your ideas of implementing
ext3 code inside OCFS2, please remember the rest of us that survive just
fine with ext3 in Red Hat's Enterprise Linux.  :)

Third, is there any thoughts on integrating LVM support or using
something like Red Hat's CLVM to allow OCFS2 to layer on top of LVs
instead of just individual disks?

The biggest drawback I see in my environment is that my storage team
provides 34GB and 68GB metas from the EMC Frames.  I'd rather not have a
ton of 68GB OCFS2 filesystems but rather a larger, host-controlled LV.
Trying to get the storage team to provide a 200+GB LUN and grow it on
the fly in the future is a tough task.  If I could control the LV on the
host _and_ grow OCFS2 into larger LVs, that would rock.

Thanks.

/Brian/
-- 
       Brian Long                      |         |           |
       IT Data Center Systems          |       .|||.       .|||.
       Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
       Phone: (919) 392-7363           |   C i s c o   S y s t e m s

Daniel Phillips

2006-May-03 23:04 UTC

head link

[Ocfs2-devel] OCFS2 features RFC - separate journal?

Mark Fasheh wrote:> The OCFS2 team is in the preliminary stages of planning major features for
> our next cycle of development. The goal of this e-mail then is to stimulate
> some discussion as to how features should be prioritized going forward.
Some
> disclaimers apply:
Hi guys,

Sorry about the lag.  Here's an easy feature nobody has mentioned so far,
and
from my reading isn't supported: separate journal, like Ext3.  The journals
stay per-node, but they can be on a different (shared) volume than the
filesystem proper.  This should be dead simple to do and it can make a huge
difference to write latency, by putting the journals on separate spindles or
(what I actually have in mind) in NVRAM.

Regards,

Daniel

Jeff Mahoney

2006-May-11 20:04 UTC

head link

[Ocfs2-devel] OCFS2 features RFC

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Fasheh wrote:> The OCFS2 team is in the preliminary stages of planning major features for
> our next cycle of development. The goal of this e-mail then is to stimulate
> some discussion as to how features should be prioritized going forward.
Some
> disclaimers apply:
> 
> * The following list is very preliminary and is sure to change.
> 
> * I've probably missed some things.
> 
> * Development priorities within Oracle can be influenced but are ultimately
>   up to management. That's not stopping anyone from contributing
though, and
>   patches are always welcome.
> 
While performance enhancements are always welcome, the two big features
we'd like to see in future OCFS2 releases are features that will make
using OCFS2 more transparent and more like a "local" file system. The
features we want are cluster wide lockf/flock and shared writable mmap.

- From a data integrity perspective, it shouldn't make a difference to an
application whether competing reader/writers are on the same node or a
different node. If standard locking primitives are already in use by the
application, they should "just work" if the competing process is on
another node.
> So I'll start with changes that can be completely contained within the
file
> system (no cluster stack changes needed):
> 
> -Sparse file support: Self explanatory. We need this for various reasons
>  including performance, correctness and space usage.
I think we all want this one. Once apon a time, ReiserFS didn't support
sparse files and it made doing things that expected sparse files an
exercise in torture.
> -Htree support
Hashed directories in some form, but I think the comments against ext3
style h-trees are valid.
> Now on to file system features which require cluster stack changes.
I'll
> have alot more to say about the cluster stack in a bit, but it's worth
> listing these out here for completeness.
> -Online file system resize
This would be nice, and I think easily done in the same manner ext3
does. Anything outside the file system's current view of the block
device can be initialized in userspace, and the last block group,
bitmaps, and superblock would be adjusted by an ioctl in kernelspace.
> -Allow the file system to go "hard read only" when it loses
it's connection
>  to the disk, rather than the kernel panic we have today. This allows
>  applications using the file system to gracefully shut down. Other
>  applications on the system continue unharmed. "Hard read only"
in the OCFS2
>  context means that the RO node does not look mounted to the other nodes on
>  that file system. Absolutely no disk writes are allowed.  File data and
>  meta data can be stale or otherwise invalid. We never want to return
>  invalid data to userspace, so file reads return -EIO.
This is a big one as well. If a node knows to fence itself, it can put
itself in an error state as well. fence={panic,ro} would be a decent start.
> As far as the existing cluster stack goes, currently most of the OCFS2 team
> feels that the code has gone as far as it can and should go. It would
> therefore be prudent to allow pluggable cluster stacks. Jeff Mahoney at
> Novell has already done some integration work implementing a userspace
> clustering interface. We probably want to do more in that area though.
> 
> There are several good reasons why we might want to integrate with external
> cluster stacks. The most obvious is code reuse. The list of cluster stack
> features we require for our next phase of development is very large (some
> are listed below). There is no reason to implement those features unless
> we're certain existing software doesn't provide them and can't
be extended.
> This will also allow a greater amount of choice for the end user. What
stack
> works well for one environment might not work as well for another.
There's
> also the fact that current resources are limited. It's enough work
designing
> and implementing a file system. If we can get out of the business of
> maintaining a cluster stack, we should do so.
> 
> So the question then becomes, "What is it that we require of our
cluster
> stack going forward?"
> 
> - We'd like as much of it to be user space code as is possible and
>   practical.
The heartbeat project does a pretty good job on the userspace end, but
as Andi pointed out, it has the usual shortcomings of anything in
userspace involved with writing data inside the kernel. It is prone to
deadlocks and we could miss node topology events.

- -Jeff

- --
Jeff Mahoney
SUSE Labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFEY5jTLPWxlyuTD7IRAmsMAKCTZpN5rb+6jr6K0TvMJVq6LxNrwgCggFvT
uLovIf8rbp1GhF2LVg1i6Cw=SkZi
-----END PGP SIGNATURE-----

Daniel Phillips

2006-May-19 00:35 UTC

head link

[Ocfs2-devel] OCFS2 features RFC

(a dialog between Mark and me that inadvertently became private)

Mark Fasheh wrote:> On Wed, May 17, 2006 at 05:16:53PM -0700, Daniel Phillips wrote:
>>Does clustered NFS count as software we're going to worry about
right now?
>>The impact is, if OCFS2 does provide cluster-aware fcntl locking then
the
>>cluster locking hacks lockd needs can possibly be smaller.  Otherwise,
>>lockd must do the job itself, and as a consequence, any applications
running
>>on the (clustered) NFS server nodes will not see locks held by NFS
clients.
> 
> Clustered NFS is definitely something we care about. We have people using
it
> today, with the caveat that file locking won't be cluster aware.
It's
> actually pretty interesting how far people get with that. We'd love to
> support the whole thing of course. As far as NFS with file locking though,
I
> have to admit that we've heard many more requests from people wanting
to do
> things like apache, sendmail, etc on OCFS2.
Ok, I just figured out how to be really lazy and do cluster-consistent
NFS locking across clustered NFS servers without doing much work.  In the
duh category, only one node will actually run lockd and all other NFS
server nodes will just port-forward the NLM traffic to/from it.  Sure,
you can bottleneck this scheme with a little effort, but to be honest we
aren't that interested in NFS locking performance, we are more interested
in actual file operations.

So strike NFS serving off the list of applications that care about cluster
fcntl locking.
>>Unless I have missed something major, fcntl locking does not have any
>>overlap with your existing DLM, so you can implement it with a separate
>>mechanism.  Does that help?
> 
> Eh, unfortunately not that much... It's still a large amount of work :/
> Doing it outside a dlm would just mean one has to reproduce existing
> mechanisms (such as determining lock mastery for instance).
You don't have to distribute the fcntl locking, you can instead manage it
with a single server active on just one node at a time.  So go ahead and
distribute it if you really enjoy futzing with the DLM, but going for the
server approach should reduce your stress considerably.  As a fringe
benefit, you are then forced to consider how to accomodate classic server
failover within the cluster manager framework, which should not be very
hard and is absolutely necessary.
>>Starting with one obvious requirement, the cluster stack needs to be
able
>>to handle different kinds of fencing methods or even mixed fencing
methods.
>>If the stack stays in kernel, what is the instancing framework? 
Modules?
>>I do believe we can make that work.
> 
> call_usermodehelper()?
Bad idea, this gets you back into memory deadlock zone.  Avoiding memory
deadlock is considerably easier in kernel and is nigh on impossible with
call_usermodehelper.

Sure, it's totally possible to do all that in
kernel.> 
> But we're getting ahead of ourselves - I don't want to implement
yet another
> cluster stack - I'd rather fit the file system into an existing
framework -
> one which already has all the fencing methods work out for instance.
Like the Red Hat framework?  Ahem.  Maybe not.  For one thing, they never
even got close to figuring out how to avoid memory deadlock.  For another,
it's a rambling bloated pig with lots of bogus factoring.  Honestly, what
you have now is a much better starting point, you should be thinking about
how to evolve it in the direction it needs to go rather than cutting over
to an existing framework, that was designed with the mindset of usermode
cluster apps, not the more stringent requirements of a cluster filesystem.
>>Consider this: if we define the fencing interface entirely in terms of
>>messages over sockets then the cluster stack does not need to know or
care
>>whether the other end lives in kernel or userland.  Comments?
> 
> Interesting, and I'll have to think about whether I can poke holes in
that
> or not. Of course, I'm not sure the file system ever has to call out to
> fencing directly, so maybe it's something it never has to worry about.
No, the filesystem never calls fencing, only the cluster manager does.
As I understand it, what happens is:

    1) Somebody (heartbeat) reports a dead node to cluster manager
    2) Cluster manager issues a fence request for the dead node
    3) Cluster manager receives confirmation that the node was fenced
    4) Cluster manager sends out dead node messages to cluster managers
       on other nodes
    5) Some cluster manager receives dead node message, notifies DLM
    6) DLM receives dead node message, initiates lock recovery

Step (2) is where we need plugins, where each plugin registers a fencing
and somehow each node becomes associated with a particular fencing method
(setting up this association is an excellent example of a component that
can and should be in userspace because this part never executes in the
block IO path).  The right interface to initiate fencing is probably a
direct (kernel-to-kernel) call, there is actually no good reason to use
a socket interface here.  However, the fencing confirmation is an
asynchronous event and might as well come in over a socket.  There are
alternatives (e.g., linked list event queue) but the socket is most
natural because the cluster manager already needs one to receive events
from other sources.

Actually, fencing has no divine right to be a separate subsystem and is
properly part of the cluster manager.  It's better to think of it that
way.  As such, the cluster manager <=> fencing api is internal, there is
no need to get into interminable discussions of how to standardize it.  So
let's just do something really minimal that gives us a plugin interface
and move on to harder problems.  If you do eventually figure out how to
move the whole cluster manager to userspace then you replace the module
scheme in favor of a dso scheme.

Anyway, assuming both bits are in-kernel then initiating fencing should
just be a method on the (in-kernel) node object and confirmation of
fencing is just an event sent to the node manager's event pipe.  Simple,
no?

In summary, I retract my point about using the socket to abstract away
the question of whether fencing lives in kernel or userspace and
instead assert that the fencing harness should live wherever the cluster
manager lives, which is in kernel right now and ought to stay there for
the time being.  Socket is still the right way to receive messages from
a fencing module, but a method call is a better way to initiate fencing.

Regards,

Daniel

Ocfs2 devel - Apr 2006 - OCFS2 features RFC

[Ocfs2-devel] OCFS2 features RFC

[Ocfs2-devel] OCFS2 features RFC

[Ocfs2-devel] OCFS2 features RFC

[Ocfs2-devel] OCFS2 features RFC

[Ocfs2-devel] OCFS2 Features RFC

[Ocfs2-devel] OCFS2 features RFC - separate journal?

[Ocfs2-devel] OCFS2 features RFC

[Ocfs2-devel] OCFS2 features RFC