David Howells
2013-Dec-17 16:53 UTC
What is needed to build an AFS fileserver on top of BTRFS?
It has occurred to me and others that something like BTRFS could be a good fit to build an AFS fileserver directly on top of. The question is what facilities would be needed from BTRFS to make this work? So I thought I''d kick off a shopping list;-) (1) 64-bit data version numbers that increase monotonically with each write. Yes, this is likely to cause some performance degredation as it introduces an ordering over data writes and metadata writes to a file. Maybe writes can be batched to improve performance? (2) Storage for ACLs and AFS UIDs. Having shareable ACLs might also be useful. Xattrs would likely do for this. (3) The ability to snapshot a filesystem to make backups and for pushing to read-only volume servers. (4) A 32-bit vnode number and 32-bit vnode uniquifier/generation number. These don''t necessarily have to be stored by BTRFS directly but could instead be in a separate database file that gets snapshotted also. (5) The ability to set the vnode number, vnode uniquifier and data version number to specific values. Necessary to clone volumes and restore volume dumps. David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Mason
2013-Dec-17 17:07 UTC
Re: What is needed to build an AFS fileserver on top of BTRFS?
On Tue, 2013-12-17 at 16:53 +0000, David Howells wrote:> It has occurred to me and others that something like BTRFS could be a good fit > to build an AFS fileserver directly on top of. The question is what facilities > would be needed from BTRFS to make this work? > > So I thought I''d kick off a shopping list;-) > > (1) 64-bit data version numbers that increase monotonically with each write. > > Yes, this is likely to cause some performance degredation as it introduces > an ordering over data writes and metadata writes to a file. Maybe writes > can be batched to improve performance? > > (2) Storage for ACLs and AFS UIDs. Having shareable ACLs might also be useful. > > Xattrs would likely do for this. > > (3) The ability to snapshot a filesystem to make backups and for pushing to > read-only volume servers. > > (4) A 32-bit vnode number and 32-bit vnode uniquifier/generation number. > > These don''t necessarily have to be stored by BTRFS directly but could > instead be in a separate database file that gets snapshotted also. > > (5) The ability to set the vnode number, vnode uniquifier and data version > number to specific values. Necessary to clone volumes and restore > volume dumps.Hmmm, what exactly are vnodes? Could we put them in xattrs? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2013-Dec-17 17:20 UTC
Re: What is needed to build an AFS fileserver on top of BTRFS?
On Tue, Dec 17, 2013 at 04:53:16PM +0000, David Howells wrote:> It has occurred to me and others that something like BTRFS could be > a good fit to build an AFS fileserver directly on top of. The > question is what facilities would be needed from BTRFS to make this > work? So I thought I''d kick off a shopping list;-)> (1) 64-bit data version numbers that increase monotonically with > each write. Yes, this is likely to cause some performance > degredation as it introduces an ordering over data writes and > metadata writes to a file. Maybe writes can be batched to improve > performance?Do these have to be per-file? If not, then you might be able to get away with using the transid, which is a filesystem-global monotonically-increasing number. btrfs batches disk writes already, and uses the transid to differentiate these -- the writes come at 30 second intervals (by default, although there''s an option to change the period). There may be multiple distinct changes to a single file within that transaction (although obviously, only the state of the file after the last one gets written to disk). I don''t know exactly what you need it for, so this may or may not be appropriate here. Ceph uses transids for [something, mumble, wavy-hand] -- I don''t know if the use-case for Ceph is equivalent to the use-case for AFS.> (2) Storage for ACLs and AFS UIDs. Having shareable ACLs might also > be useful. Xattrs would likely do for this.This would seem like a reasonable place to put them, given that that''s what POSIX ACLs do, and we have POSIX ACL support already.> (3) The ability to snapshot a filesystem to make backups and for > pushing to read-only volume servers.We have snapshots of subvolumes, but not the filesystem as a whole.> (4) A 32-bit vnode number and 32-bit vnode uniquifier/generation > number. These don''t necessarily have to be stored by BTRFS directly > but could instead be in a separate database file that gets > snapshotted also. > > (5) The ability to set the vnode number, vnode uniquifier and data > version number to specific values. Necessary to clone volumes > and restore volume dumps.What''s a vnode meant to represent? I''m not familiar with the terminology. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- "Are you the man who rules the Universe?" "Well, I --- try not to."
David Howells
2013-Dec-17 17:40 UTC
Re: What is needed to build an AFS fileserver on top of BTRFS?
Chris Mason <clm@fb.com> wrote:> Hmmm, what exactly are vnodes? Could we put them in xattrs?vnode numbers are AFS''s equivalent of inode numbers. Since they''re one per file, they could be the object filename. Probably there would have to be a table of {vnode,latest_uniquifier} as the uniquifier must still go up even if the vnode is unused for a while, so there could also be a table of {vnode,btrfs_file}. David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Howells
2013-Dec-17 17:47 UTC
Re: What is needed to build an AFS fileserver on top of BTRFS?
Hugo Mills <hugo@carfax.org.uk> wrote:> > (1) 64-bit data version numbers that increase monotonically with > > each write. Yes, this is likely to cause some performance > > degredation as it introduces an ordering over data writes and > > metadata writes to a file. Maybe writes can be batched to improve > > performance? > > Do these have to be per-file? If not, then you might be able to get > away with using the transid, which is a filesystem-global > monotonically-increasing number.Yes. If you send a write RPC op to the server, you get back the new version number. If the new version number is not the old version number + 1 you know there was a collision with a write from another client and you have to flush your cache for that file and request a new "callback" (ie. a promise to notify you if someone else changes the file).> > (3) The ability to snapshot a filesystem to make backups and for > > pushing to read-only volume servers. > > We have snapshots of subvolumes, but not the filesystem as a whole.By "filesystem" I meant the current state of an AFS volume. Very likely this would be represented by a BTRFS subvolume, if I understand it correctly. You might have several AFS volumes represented within a BTRFS filesystem. They would be manipulated independently.> > (5) The ability to set the vnode number, vnode uniquifier and data > > version number to specific values. Necessary to clone volumes > > and restore volume dumps. > > What''s a vnode meant to represent? I''m not familiar with the > terminology.AFS''s equivalent of an inode with a 32-bit number representing it. See my reply to Chris''s question about the same thing. David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jeffrey Hutzelman
2013-Dec-17 18:42 UTC
Re: Re: What is needed to build an AFS fileserver on top of BTRFS?
On Tue, 2013-12-17 at 17:40 +0000, David Howells wrote:> Chris Mason <clm@fb.com> wrote: > > > Hmmm, what exactly are vnodes? Could we put them in xattrs? > > vnode numbers are AFS''s equivalent of inode numbers. Since they''re one per > file, they could be the object filename.Yes, in fact, the volume, vnode number, uniqifier, and DV are effectively the "name" the fileserver uses for the underlying inode. Note that if the fileserver is maintaining the vnode indices, then you don''t actually _need_ to store a uniqifier for normal operation, because at any given time, a volume can contain at most one vnode with a particular vnode number, and that vnode''s uniqifier is stored in the index. The uniqifier is used on-the-wire to distinguish different files that existed at different points in time with the same vnode number.> Probably there would have to be a table of {vnode,latest_uniquifier} as the > uniquifier must still go up even if the vnode is unused for a while, so there > could also be a table of {vnode,btrfs_file}.No, you don''t actually have to do this. The OpenAFS fileserver maintains a single uniqifier for an entire volume, and simply increments it every time a vnode is created. -- Jeff
Jeffrey Hutzelman
2013-Dec-17 18:45 UTC
Re: Re: What is needed to build an AFS fileserver on top of BTRFS?
On Tue, 2013-12-17 at 17:47 +0000, David Howells wrote:> Hugo Mills <hugo@carfax.org.uk> wrote: > > > > (1) 64-bit data version numbers that increase monotonically with > > > each write. Yes, this is likely to cause some performance > > > degredation as it introduces an ordering over data writes and > > > metadata writes to a file. Maybe writes can be batched to improve > > > performance? > > > > Do these have to be per-file? If not, then you might be able to get > > away with using the transid, which is a filesystem-global > > monotonically-increasing number. > > Yes. If you send a write RPC op to the server, you get back the new version > number. If the new version number is not the old version number + 1 you know > there was a collision with a write from another client and you have to flush > your cache for that file and request a new "callback" (ie. a promise to notify > you if someone else changes the file).Right. So, the DV must increment by exactly one for each successful StoreData (and not for other changes). This is important because clients cache data and metadata independently, and cached data is labeled with the file''s DV. This means that even if metadata for a file has to be refetched for some reason (for example, an expired callback), the _data_ doesn''t have to be refetched unless it has actually changed, or been evicted from the client''s cache due to cache pressure. -- Jeff