thr3ads.net - zfs discuss - [zfs-discuss] [RFC] Improved versioned pointer algorithms [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Daniel Phillips

2008-Jul-13 08:55 UTC

[zfs-discuss] [RFC] Improved versioned pointer algorithms

Greetings, filesystem algorithm fans.

The recent, detailed description of the versioned pointer method for
volume versioning is here:

http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-07/msg02663.html

I apologize humbly for the typo in the first sentence. Today''s
revision
of the proof of concept code is cleaned up to remove some redundant
logic from the exception delete and origin write algorithms, and has
a number of small improvements. The fuzzing test has run to 20 million
iterations without problems. Some of the more complex logic has become
simple to the point where it can possibly be seen to be correct in
addition to merely running correctly.

The current efficiency scores for the six basic operations are:

Origin write: O(exceptions)
Snapshot write: O(versions)
Snapshot delete: O(versions)
Snapshot read: O(exceptions)
Snapshot of origin: O(versions)
Snapshot of snapshot: O(versions)

This is mainly about CPU time rather than filesystem efficiency. The
real time performance will be dominated by disk operations as usual.

Snapshot write is a frequent operation and snapshot delete is performed
on large amounts of metadata at a time, so it would be nice to improve
their performance to be independent of the number of versions created.

O(exceptions) orphan test
-------------------------

It is only the orphan searches that increase snapshot write and
snapshot delete running times from O(exceptions) to O(versions),
otherwise the main algorithm is already O(exceptions).

The orphan test is used in snapshot write and exception delete to
identify any new orphans created as a result of a write that creates an
exclusive exception for the only heir of the ghost exception, or a
delete that removes the only heir and does not promote a new heir.

The current method of determining whether a ghost exception is an
orphan is to recurse through the subtree below the ghost until a visible
version is found that inherits the exception. This traversal runs in
O(versions) time.

To perform this test in O(exceptions) time, we proceed as follows:

First, identify a subtree of the version tree consisting only of ghost
versions in interior nodes and visible versions at terminal nodes,
descending from the ghost ancestor of the victim version nearest the
root and having no visible versions or present exceptions between itself
and the victim. Call each interior node of that subtree a "nexus",
which must have more than one child because it is a ghost. This step
is done once, prior to a full-tree version delete pass.

The interesting question is whether a ghost exception is inherited
by any visible version that does not appear in the same exception list.
If not, then the ghost exception is an orphan that must be deleted.

This can be computed efficiently using a bottom up approach with a
single pass through the exception list. At each nexus keep a count of
the children of the nexus that are known not to inherit from that
nexus. Call that the blocked count. Set the blocked counts to zero
and:

For each exception in the exception list:
If the version labeled by the exception is in the ghost
tree then increment the blocked count of the nexus parent

If the blocked count is now equal to the number of children
of the nexus then repeat from the preceding step

At completion, if the blocked count of the ghost ancestor is equal to
its child count then the ghost exception is an orphan, otherwise not.

Example
-------

Below, ghost versions are marked with ~ and members of the ghost tree
are marked with *.

.
`-- A
|-- B
`-- *C~
|-- *D
|-- E <== victim to be deleted
`-- *F~
|-- *G
`-- *H
|-- I
`-- J

Version C is the single ghost ancestor (there may be more than one) that
may contain an orphan exception. D does not belong to the ghost tree
because it is to be deleted and has already been removed from the
version tree when the ghost tree is constructed. The interior nexus
nodes are C and F while D, G and H are terminal, user visible nodes of
the ghost tree.

Overlaying the exception list [I, p1] [G, p2] [C, p3] [D, p4] on the
example tree:

.
`-- A
|-- B
`-- *C~ [p3]
|-- *D [p4]
|-- E <== deleted victim
`-- *F~
|-- *G [p2]
`-- *H
|-- I [p1]
`-- J

When exception [I, p1] is found it is ignored because it is not a member
of the ghost tree. H has no exception, so [G, p2] cannot cause the
algorithm to iterate past node F. Processing [D, p4] will raise the
nexus count at C to one, but C has two children after ignoring E (which
is deleted by this time) so [C, p3] is not an orphan. Examining the
tree shows that visible version H continues to inherit the ghost
exception at C, which could now be relabeled to H if desired, but since
it will be detected and deleted in the fullness of time anyway there
is little point in doing that extra work here.

If any of versions F, H or J had an exception then [C, p3] would be an
orphan.

O(log(exceptions)) snapshot lookup
----------------------------------

An O(log(exceptions)) snapshot lookup is known to exist and is left as
an exercise for the interested reader.

Updated versioned pointers fuzztest attached
--------------------------------------------

c99 fuzzversion.c && ./a.out

Request for comment from Btrfs and ZFS developers
-------------------------------------------------

I am considering the wisdom of developing a filesystem based on
versioned pointers. The prospect of per-file and per-directory
versioning seems attractive, as does the elegant formulation of
snapshots-of-snapshots and the compact metadata. I suspect that less
updating of metadata will be required for writes versus either Btrfs or
ZFS. Mind you, there is no working filesystem code at the moment, only
ddsnap the snapshotting block device and the attached proof of concept
of versioned pointers, so some of this is speculation.

I described earlier how versioned pointers will be extended to versioned
extents but have not yet tried it. This would be necessary in order to
operate in the same ballpark of efficiency as ZFS or Btrfs.

Versioned pointers have some advantages over versioning methods used in
ZFS and Btrfs, and also some potential drawbacks:

Advantages versus ZFS and Btrfs:

* No per-write recursive copy on write of metadata blocks. Creating a
new copy of a data block normally only requires inserting a new
versioned pointer to a btree leaf block. Such inserts can be logged
logically and added to the btree in large batches, which can be done
atomically using standard methods (physical journal, log or copy on
write).

Advantages versus Btrfs:

* No reference count updates, which must be be durable and atomic,
surely a painful overhead. Instead, it can be determined entirely
from the exception list whether a given block is referenced, so
only exception lists need to be updated, and only by adding,
removing, or in rare cases, relabeling physical pointers.

Advantages versus ZFS:

* No need to maintain a dead list. This information can be entirely
determined from the combination of the versioned pointers and the
version tree.

* It would seem that ZFS is deeply wedded to the concept of a single,
linear chain of snapshots. No snapshots of snapshots, apparently.
http://blogs.sun.com/ahrens/entry/is_it_magic

Drawbacks of versioned pointers:

* All version info stored together for any given logical address.
This will require reading more metadata blocks as the number of
versions increases, and put more pressure on cache. How to fix: with
extents, metadata is tiny compared to data, even if expanded by a
large factor, so it may not matter. If it does, an in-memory version
of accessed btree nodes for a particular snapshot can be cached and
saved persistently to disk so that the full version btree only
rarely needs to be accessed.

* Delete is a full tree pass over the metadata. But this can be done
in the background for multiple deleted versions per pass. Also can
keep a persistent cache per version of which logical address ranges
have exceptions for a given version to limit the search.

* Limited number of versions. This can be overcome mindlessly by
extending the size of a physical pointer, or more elegantly by
keeping a per-leaf "wide" translation table of versions referenced
by the leaf, or both depending on which representation is smaller for
a particular leaf.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fuzzversion.c
Type: text/x-csrc
Size: 24080 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080713/3df82c49/attachment.bin>

Akhilesh Mritunjai

2008-Jul-14 15:29 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

Still reading, but would like to correct one point.
> * It would seem that ZFS is deeply wedded to the
> concept of a single,
> linear chain of snapshots.  No snapshots of
> snapshots, apparently.
>    http://blogs.sun.com/ahrens/entry/is_it_magic
Writable snapshots are called "clones" in zfs. So infact, you have
trees of snapshots and clones. Snapshots are read-only, and you can create any
number of "writable" clones from a snapshot, that behave like a normal
filesystem and you can again take snapshots of the clones.
 
 
This message posted from opensolaris.org

Daniel Phillips

2008-Jul-14 23:01 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

On Monday 14 July 2008 08:29, Akhilesh Mritunjai wrote:> Still reading, but would like to correct one point.
> 
> > * It would seem that ZFS is deeply wedded to the
> > concept of a single,
> > linear chain of snapshots.  No snapshots of
> > snapshots, apparently.
> >    http://blogs.sun.com/ahrens/entry/is_it_magic
> 
> Writable snapshots are called "clones" in zfs. So infact, you
have trees of snapshots and clones. Snapshots are read-only, and you can create
any number of "writable" clones from a snapshot, that behave like a
normal filesystem and you can again take snapshots of the clones.
Thanks for clarifying Akhilesh.

That was not clear from the blog link above, which says that any blocks in
a deleted snapshot born after the most recent snapshot will be freed.  I do not
see how that works for clones of snapshots.  Are the details of clone deletion 
covered anywhere, or did I just not read the blog closely enough?

Regards,

Daniel

Daniel Phillips

2008-Jul-20 07:46 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

On Monday 14 July 2008 08:29, Akhilesh Mritunjai wrote:> Writable snapshots are called "clones" in zfs. So infact, you
have
> trees of snapshots and clones. Snapshots are read-only, and you can
> create any number of "writable" clones from a snapshot, that
behave
> like a normal filesystem and you can again take snapshots of the
> clones. 
So if I snapshot a filesystem, then clone it, then delete a file
from both the clone and the original filesystem, the presence
of the snapshot will prevent the file blocks from being recovered,
and there is no way I can get rid of those blocks short of deleting
both the clone and the snapshot.  Did I get that right?

Regards,

Daniel

Akhilesh Mritunjai

2008-Jul-21 04:38 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

> On Monday 14 July 2008 08:29, Akhilesh Mritunjai
> wrote:
> > Writable snapshots are called "clones" in zfs. So
> infact, you have
> > trees of snapshots and clones. Snapshots are
> read-only, and you can
> > create any number of "writable" clones from a
> snapshot, that behave
> > like a normal filesystem and you can again take
> snapshots of the
> > clones. 
> 
> So if I snapshot a filesystem, then clone it, then
> delete a file
> from both the clone and the original filesystem, the
> presence
> of the snapshot will prevent the file blocks from
> being recovered,
> and there is no way I can get rid of those blocks
> short of deleting
> both the clone and the snapshot.  Did I get that
> right?
Right. Snapshots are immutable. Isn''t this the whole point of a
snapshot ?

FS1(file1) -> Snapshot1 (file1)

delete FS1->file1 : Snapshot1->File1 is still intact

Snapshot1(file1) -> CloneFs1(file1)

delete CloneFS1->file1 : Snapshot1->File1 is still intact (snapshot is
immutable)

There is lot of information in zfs docs on zfs community. For low level info,
you may refer to ZFS on disc format document.

Regards
- Akhilesh
 
 
This message posted from opensolaris.org

Daniel Phillips

2008-Jul-21 21:22 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

On Sunday 20 July 2008 21:38, Akhilesh Mritunjai wrote:> > On Monday 14 July 2008 08:29, Akhilesh Mritunjai
> > wrote:
> > > Writable snapshots are called "clones" in zfs. So
> > infact, you have
> > > trees of snapshots and clones. Snapshots are
> > read-only, and you can
> > > create any number of "writable" clones from a
> > snapshot, that behave
> > > like a normal filesystem and you can again take
> > snapshots of the
> > > clones. 
> > 
> > So if I snapshot a filesystem, then clone it, then
> > delete a file
> > from both the clone and the original filesystem, the
> > presence
> > of the snapshot will prevent the file blocks from
> > being recovered,
> > and there is no way I can get rid of those blocks
> > short of deleting
> > both the clone and the snapshot.  Did I get that
> > right?
> 
> Right. Snapshots are immutable. Isn''t this the whole point of a
snapshot ?
But that is not my point.  My point is that there is no way to recover
the volume space used by my example file short of deleting both the
clone and the snapshot.  My example file might be very large.  I can
easily imagine a case where this might be an impediment to efficient
use of storage.  For example, where you have multiple virtual machine
instances sharing one root filesystem via clones, and you later want to
get rid some large files from the original root image and all the
clones.  But because you can''t get rid of the snapshot that is required
for the clone, no disk space will be recovered.

Btrfs does not suffer from this problem as far as I can see because it
uses reference counting rather than a ZFS-style dead list.  I was just
wondering if ZFS devs recognize the problem and are working on a
solution.

Regards,

Daniel

Will Murnane

2008-Jul-21 21:37 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

On Mon, Jul 21, 2008 at 17:22, Daniel Phillips <phillips at phunq.net>
wrote:> But that is not my point.  My point is that there is no way to recover
> the volume space used by my example file short of deleting both the
> clone and the snapshot.Not so.  Use "zfs promote" to make the clone into a full-fledged
independent, and then destroy the snapshot.

Will

Daniel Phillips

2008-Jul-21 21:55 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

On Monday 21 July 2008 14:37, Will Murnane wrote:> On Mon, Jul 21, 2008 at 17:22, Daniel Phillips <phillips at
phunq.net> wrote:
> > But that is not my point.  My point is that there is no way to recover
> > the volume space used by my example file short of deleting both the
> > clone and the snapshot.
> Not so.  Use "zfs promote" to make the clone into a full-fledged
> independent, and then destroy the snapshot.
What if you don''t want it to be a full fledged independent, but
instead continue to share as much storage as possible with the
original filesystem?  It seems to me that the promote would likely
waste even more disk space.

Regards,

Daniel

Miles Nordin

2008-Jul-21 23:33 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

>>>>> "dp" == Daniel Phillips <phillips at
phunq.net> writes:
    dp> What if you don''t want it to be a full fledged independent,
    dp> but instead continue to share as much storage as possible with
    dp> the original filesystem?

The Fine Man page says in very clear orange-and-black text that no new
space is consumed by promoting a clone.

However, to my reading, the OP is wrong, and you can''t free space in
the way you want to.  You can''t delete the snapshot until you delete
all writeable clones of it but one.  Assuming you accept this rule and
still want to delete the snapshot, ''promote'' lets you pick
which clone
you''d like to survive.  The man page implies that''s its
primary
intended purpose.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080721/bff04101/attachment.bin>

Akhilesh Mritunjai

2008-Jul-22 20:39 UTC

head link

[zfs-discuss] [RFC] Improved versioned pointer algorithms

> Btrfs does not suffer from this problem as far as I
> can see because it
> uses reference counting rather than a ZFS-style dead
> list.  I was just
> wondering if ZFS devs recognize the problem and are
> working on a
> solution.
Daniel,

Correct me if I''m wrong, but how does reference counting solve this
problem ?

The terminology is as following:

1. Filesystem : A writable "filesystem" with no "references"
or a "parent".
2. Snapshot: Immutable point-in-time view of a filesystem
3. Clone: A writable filesystem whose parent is a given snapsho

Under this terminology, it is easy to see that dead-list is equivalent to
reference counting. The "problem" is rather that to have a clone, you
need to have it''s snapshot around, since by definition it is a child of
a snapshot (with an exception that by using "zfs promote" you can make
a clone a direct child of the filesystem, it''s like turning a
grand-child into a child).

So what is the terminology of brtfs ?
 
 
This message posted from opensolaris.org

Possibly Parallel Threads

Search for more apparently analagous threads

zfs discuss - Jul 2008 - [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

[zfs-discuss] [RFC] Improved versioned pointer algorithms

Possibly Parallel Threads