thr3ads.net - zfs discuss - [zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces [Dec 2005]

If this information is useful, please help other people find it:
Share via:

Andrew

2005-Dec-14 01:21 UTC

[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces

I propose three changes to ZFS, all of which are individually beneficial, and
which together move some features of ZFS from features which must be explicitly
invoked by the user/administrator to features which are automatically and
transparently invoked by the system. The changes are:
1. Make snapshots of nested filesystems do nested transactional snapshots (and
rollbacks do nested rollbacks), the lack of which I objected to in my message
"Counterintuitive snapshotting" in this forum (at
http://www.opensolaris.org/jive/thread.jspa?threadID=4244&tstart=0).
2. Do not require that rolling back a filesystem to a snapshot destroy all
intermediate snapshots. Dropping the requirement allows filesystems to be
arbitrarily rolled back without requiring destruction of clones which are
dependent on intermediate snapshots.
3. Change "cp" to use COW by taking a snapshot of the source file and
creating the target file as a clone (yes, this means snapshotting and cloning
individual files; see below), and perform the copies-on-write at the block level
in the same way that ZFS currently performs COW at the block level when a file
is modified. The sharing of blocks among files would be entirely transparent at
the user level in the same way that the sharing of files among cloned
filesystems is entirely transparent.

With these changes, "zfs create" and "zfs clone" can be
eliminated, and replaced with "mkdir" and "cp -r",
respectively; "mkdir /foo/bar", if foo is a ZFS filesystem, will
create a ZFS filesystem named "bar", and "cp -r foo bar", if
foo is a ZFS filesystem, will create the snapshot foo at
time_when_bar_was_made-autogenerated and make a clone named "bar"
(snapshots ending with "-autogenerated" could be excluded by default
from snapshot listings, with an option to show them). Both of these are entirely
transparent if the changes mentioned above are made. Thus, filesystems and
directories are the same thing, and also there''s no difference between
making a copy of a directory and making a clone of it.

Next, suppose you have filesystems /foo and /foo/bar, and the file /foo/bar/baz.
If you do "zfs snapshot foo at first", you can then do "zfs
destroy foo at first", but you can''t do "ls foo at
first"; you have to do "ls /foo/.zfs/snapshot/first". But you
can''t do "rm -r /foo/.zfs/snapshot/first"; you have to do
"zfs destroy foo at first". Similarly for "mv" vs. "zfs
rename". Why the inconsistency? Why not drop the
"/.zfs/snapshot/" nonsense as the separator between the filesystem
name and version name halves of the snapshot name, and simply always use the
"@" separator and allow the simple snapshot names to be passed
directly to conventional tools like rm and mv?
Also, if I can do "zfs snapshot foo at first" and "zfs snapshot
foo/bar at first", why can''t I do "zfs snapshot foo/bar/baz
at first"? It makes no sense to allow me to take snapshots only of
individual directories (see above for equivalence of directories and
filesystems) but not of individual files. If I can do "zfs snapshot
project/code.c at tuesday-added_some_more_code" and "zfs snapshot
project/code.c at wednesday-fixed_some_bugs" and "cat project/code.c
at monday-known_to_compile", and "ls -l --show_snapshots
project/code.c" to see the creation times of all the snapshots along with
the modification time of the file itself, then I can get rid of CVS and just use
ZFS as my version control system.

Filenames and directory names are names of variables; the data which they
identify changes over time. Snapshot names are names of constants; they identify
particular versions of files and directories. "variablename at
timestamp" is an appropriate way to identify the value which a variable had
at some particular time; "variablename/.zfs/snapshot/timestamp" is
not. Not only is the latter more complicated, but it would be a very strange
expression if the variable happened to be a file rather than a directory.
This message posted from opensolaris.org

Eric Schrock

2005-Dec-14 01:59 UTC

head link

[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces

On Tue, Dec 13, 2005 at 05:21:38PM -0800, Andrew wrote:>
> 1. Make snapshots of nested filesystems do nested transactional
> snapshots (and rollbacks do nested rollbacks), the lack of which I
> objected to in my message "Counterintuitive snapshotting" in this
> forum (at
> http://www.opensolaris.org/jive/thread.jspa?threadID=4244&tstart=0).
I think a reasonable RFE would be to have something like:

# zfs snapshot -r today tank/home

Which would take a snapshot with the name ''today'' for all
filesystems
under tank/home.  This is slightly different from what you''re probably
asking for, because you''ll end up with N snapshots, where N is the
number of filesystems.  There is really no way around this, because you
cannot have a snapshot span multiple datasets, it''s just fundamentally
not possible.

Feel free to file an RFE through OpenSolaris for the above behavior; it
seems entirely reasonable to me.
> 2. Do not require that rolling back a filesystem to a snapshot destroy
> all intermediate snapshots. Dropping the requirement allows
> filesystems to be arbitrarily rolled back without requiring
> destruction of clones which are dependent on intermediate snapshots.
>
> 3. Change "cp" to use COW by taking a snapshot of the source file
and
> creating the target file as a clone (yes, this means snapshotting and
> cloning individual files; see below), and perform the copies-on-write
> at the block level in the same way that ZFS currently performs COW at
> the block level when a file is modified. The sharing of blocks among
> files would be entirely transparent at the user level in the same way
> that the sharing of files among cloned filesystems is entirely
> transparent.
The above aren''t actually possible given the way ZFS implements
snapshots.  ZFS does not implement snapshot/clones via a simple
reference counting mechanism.  The reasons not to do this are many, not
the least of which is that taking a snapshot is no longer constant time,
and now overwrites live data.  See Matt''s blog for more information on
how it''s accomplished:

http://blogs.sun.com/roller/page/ahrens?entry=is_it_magic

The end result is that you can''t simply move and COW blocks around as
you see fit.  In particular, you cannot COW blocks within a certain file
without going the full mile and creating a clone of the whole
filesystem.  Therefore snapshots (or clones) of individual files and
directories are not possible.
> With these changes, "zfs create" and "zfs clone" can be
eliminated,
> and replaced with "mkdir" and "cp -r", respectively;
"mkdir /foo/bar",
> if foo is a ZFS filesystem, will create a ZFS filesystem named
"bar",
> and "cp -r foo bar", if foo is a ZFS filesystem, will create the
> snapshot foo at time_when_bar_was_made-autogenerated and make a clone
> named "bar" (snapshots ending with "-autogenerated"
could be excluded
> by default from snapshot listings, with an option to show them). Both
> of these are entirely transparent if the changes mentioned above are
> made. Thus, filesystems and directories are the same thing, and also
> there''s no difference between making a copy of a directory and
making
> a clone of it.
Besides the inherent fact that you cannot clone an individual file or
directory, if we imagine this scheme, a simple home directory server
would end up with hundreds of millions of snapshots and clones.  
> Next, suppose you have filesystems /foo and /foo/bar, and the file
> /foo/bar/baz. If you do "zfs snapshot foo at first", you can then
do "zfs
> destroy foo at first", but you can''t do "ls foo at
first"; you have to do
> "ls /foo/.zfs/snapshot/first". But you can''t do "rm
-r
> /foo/.zfs/snapshot/first"; you have to do "zfs destroy foo at
first".
> Similarly for "mv" vs. "zfs rename". Why the
inconsistency? Why not
> drop the "/.zfs/snapshot/" nonsense as the separator between the
> filesystem name and version name halves of the snapshot name, and
> simply always use the "@" separator and allow the simple snapshot
> names to be passed directly to conventional tools like rm and mv?
We have talked about exposing some level of adminstrative control
of this sort through .zfs, and have even prototyped it.  I''m still not
sure what you''re suggesting by "dropping the nonsense".  Do
you mean you
would like to do:

	$ cd /home/eschrock
	$ mv .vimrc at yesterday .vimrc

Instead of:

	$ cd /home/eschrock
	$ cp .zfs/snapshot/yesterday .vimrc

We can''t really hijack "@*" from every file in the
filesystem.  That''s
simply not acceptable for POSIX or any reasonable expectation of a
filesystem.
> Also, if I can do "zfs snapshot foo at first" and "zfs
snapshot
> foo/bar at first", why can''t I do "zfs snapshot
foo/bar/baz at first"? It
> makes no sense to allow me to take snapshots only of individual
> directories (see above for equivalence of directories and filesystems)
> but not of individual files. If I can do "zfs snapshot
> project/code.c at tuesday-added_some_more_code" and "zfs snapshot
> project/code.c at wednesday-fixed_some_bugs" and "cat
> project/code.c at monday-known_to_compile", and "ls -l
--show_snapshots
> project/code.c" to see the creation times of all the snapshots along
> with the modification time of the file itself, then I can get rid of
> CVS and just use ZFS as my version control system.
See the discussion of how snapshots work, above.
> Filenames and directory names are names of variables; the data which
> they identify changes over time. Snapshot names are names of
> constants; they identify particular versions of files and directories.
> "variablename at timestamp" is an appropriate way to identify the
value
> which a variable had at some particular time;
> "variablename/.zfs/snapshot/timestamp" is not. Not only is the
latter
> more complicated, but it would be a very strange expression if the
> variable happened to be a file rather than a directory.
I''m not sure what this is supposed to mean, but I hope the above
responses are some indication of what''s possible and what
isn''t.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

Richard Elling

2005-Dec-14 05:37 UTC

head link

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

> 3. Change "cp" to use COW by taking a snapshot of the
> source file and creating the target file as a clone
> (yes, this means snapshotting and cloning individual
> files; see below), and perform the copies-on-write at
> the block level in the same way that ZFS currently
> performs COW at the block level when a file is
> modified. The sharing of blocks among files would be
> entirely transparent at the user level in the same
> way that the sharing of files among cloned
> filesystems is entirely transparent.
<geezer_mode>
Way back when disks were small, and files were smaller, 
VMS did this sort of versioning.  Not "cp" per se, since cp 
isn''t normally destructive.  But if you modified a file, a new
version would be created.  Very cool, at first glance.
IIRC, it had the same sort of semantics as you describe.
Modifications created a version, and versions had to be
explicitly removed.  Raises hell with quotas (I''ve got the
scar).

But when you look at the actual implications of this, it
quickly becomes a costly feature and, sometimes, a
system management nightmare.  For example, I get about
200 emails each day.  Suppose the system created 200
versions of my inbox each day... very uncool.

OK, someone will probably point out that VMS still exists
today, but I stopped using it at version 2.21, ha!
</geezer_mode>

There are a large number of applications which modify
or append files.  IMHO it is not in our best interest if the
file system implements versioning policies for each and
every file.  It is much more manageable for a system
administrator to say "friday at happy hour + 20 minutes
we snapshot for the  week." 

And what about databases which use files for backing
store?  COW run amuck causes some concern for such
environments.

[ok, since I do work for Sun, we''d *love* to sell you a
few petabytes of storage each month... :-) but that is 
really a non-starter]

 -- richard
This message posted from opensolaris.org

Tao Chen

2005-Dec-14 06:12 UTC

head link

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

On 12/13/05, Richard Elling <Richard.Elling at sun.com>
wrote:>
> > 3. Change "cp" to use COW by taking a snapshot of the
> > source file and creating the target file as a clone

[...]

There are a large number of applications which modify> or append files.  IMHO it is not in our best interest if the
> file system implements versioning policies for each and
> every file.  It is much more manageable for a system
> administrator to say "friday at happy hour + 20 minutes
> we snapshot for the  week."
>
> And what about databases which use files for backing
> store?  COW run amuck causes some concern for such
> environments.
>
> [ok, since I do work for Sun, we''d *love* to sell you a
> few petabytes of storage each month... :-) but that is
> really a non-starter]
>

Item 3., by itself, is not about file versioning, if I understand correctly.
It sounds similar to memory COW after fork:
''cp'' the parent process to a child process, memory pages are
not copied
until modified.

In fact when I heard ZFS is a "COW filesystem", I thought it does
exactly
that, or does it?
It could save users storage space if files are duplicated then slightly
modified.

Or are you commenting on the rest of the proposal?

Yes, I can see quotas will become very interesting and potential I/Os become
somewhat unpredicable
- a write to the ''parent'' file can generate I/Os to all the
''child'' copies
and their children.

Most definitely you guys have thought about that in the early days :-)

Tao
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051214/908d76a2/attachment.html>

Andrew

2005-Dec-14 15:15 UTC

head link

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

Eric Schrock wrote:> I think a reasonable RFE would be to have something
> like:
> 
> # zfs snapshot -r today tank/home
> 
> Which would take a snapshot with the name ''today''
> for all filesystems
> under tank/home.Could that be done atomically?

BTW, the recursiveness should be enabled by default. Considering the
schizophrenic insanity of standard cp defaulting to non-recursive copy, mv
defaulting to recursive move, tar defaulting to recursive copy, and chmod being
non-recursive by default (not to mention the switch insanity with "cp
-r" meaning "enable recursiveness and make a copy" but
"chmod -r" meaning "disable readability, and do it
non-recursively", and that snapshots already default to recursiveness
through subdirectories of a filesystem, it would make more sense for snapshots
to default to recursiveness through subfilesystems too. Otherwise
you''ll have administrators accidentally excluding some subparts of
their filesystems while taking snapshots because they forgot that they happened
to create those subparts as subfilesystems rather than as subdirectories; not
only that, but the mistake would be harder to catch, because unlike with cp,
where a cursory glance at the result will reveal if the copy was accidentally
made non-recursively, a cursory glance at a snapshot will show recursive
directories and thus provide a false sense of confidence that recursiveness was
enabled when actually the subfilesystems are missing from the snapshot but not
noticed because they''re deeply buried in the name hierarchy.
> This is slightly different from
> what you''re probably
> asking for, because you''ll end up with N snapshots,
> where N is the
> number of filesystems.  There is really no way
> around this, because you
> cannot have a snapshot span multiple datasets, it''s
> just fundamentally
> not possible.If multiple datasets can be snapshotted atomically, then what I''m
describing can be implemented; the fact that there would be N underlying
snapshots would be irrelevant to the user, since those snapshots could be
integrated into one namespace. E.g. with filesystems foo and foo/bar, and file
foo/bar/baz, the snapshot foo at first could be taken atomically recursively,
and then using your naming convention baz could be accessed as
foo/.zfs/snapshot/first/bar/baz instead of (or as well as)
foo/bar/.zfs/snapshot/first/baz. (Using the naming convention I''m
recommending, it would be foo at first/bar/baz, but that''s irrelevant
to the important issue here of whether the recursive snapshots can be done
atomically.)
> Feel free to file an RFE through OpenSolaris for the
> above behavior; it
> seems entirely reasonable to me.If recursive snapshots can''t be done atomically, then the RFE would
have to be "please fundamentally rearchitect ZFS"... :)
> The above aren''t actually possible given the way ZFS
> implements
> snapshots.
[snip]> Therefore snapshots (or clones) of
> individual files and
> directories are not possible.I''m still working on a full response to this, but in the meantime, what
disadvantage currently would there be in aliasing mkdir to zfs create? (Except
on non-ZFS filesystems, of course.) I.e. in what circumstances would it be
necessary for a particular directory in a ZFS filesystem to be just a regular
directory rather than a ZFS filesystem? Simply making all directories be
filesystems would allow all directories to be individually snapshottable and
cloneable. As for individual files, have ZFS simply create a filesystem foo
whenever a user process requests creation of a file foo, and automatically
create the file foo/thefile, hide foo/thefile from user processes, report foo as
a file instead of as a filesystem, and automatically direct read/write requests
on foo to foo/thefile. Then the user process can request a snapshot or clone of
foo, thinking that foo is a file, and ZFS can actually create a snapshot of the
filesystem foo, which would include foo/thefile.
> Besides the inherent fact that you cannot clone an
> individual file or
> directory,Hopefully addressed by my comments above.
> if we imagine this scheme, a simple home
> directory server
> would end up with hundreds of millions of snapshots
> and clones.Why would this be a problem? (Of course the administrative tools could by
default exclude the auto-generated snapshots and clones from listings, and
include them only upon explicit request.)
> I''m still not
> sure what you''re suggesting by "dropping the
> nonsense".  Do you mean you
> would like to do:
> 
> 	$ cd /home/eschrock
> 	$ mv .vimrc at yesterday .vimrc
> 
> Instead of:
> 
> 	$ cd /home/eschrock
> 	$ cp .zfs/snapshot/yesterday .vimrcYes.
> We can''t really hijack "@*" from every file in the
> filesystem.But you can hijack .zfs from every directory (er, from every directory which
happens to be a filesystem) in the filesystem?
> That''s
> simply not acceptable for POSIXWhy not? Consider "cp -r foo bar;chmod -R -w bar". Is that acceptable
for POSIX? Yes.
Now consider "cp -r foo bar;chmod -R -w bar;mv bar
foo/.zfs/snapshot/bar". Is that acceptable for POSIX? Yes. But
isn''t this the same as "zfs snapshot foo at bar" (assuming
that either foo has only subdirectories but no subfilesystems, or ZFS by default
does recursive snapshots as mentioned above), besides the fact that the latter
does it atomically? Thus what ZFS currently does is acceptable for POSIX.
Now consider "cp -r foo bar;chmod -R -w bar;mv bar foo at bar". Is
that acceptable for POSIX? Yes. So why not make "zfs snapshot foo at
bar" do this?
This use of "@" in names doesn''t conflict with other
arbitrary uses of the symbol. ZFS would know which names containing
"@" are names of snapshots, and which aren''t, and which parts
of which snapshot names are the dataset names and which parts are the version
names, because the fact that a name contains the "@" symbol
doesn''t imply that the name identifies a snapshot, only vice versa.
Thus snapshotting the filesystem mail/messages-from-eric at sun to produce the
snapshot mail/messages-from-eric at sun@tuesday wouldn''t cause any
problem, and I could do "vi mail/messages-from-eric at
sun/explanations-of-zfs" and "cat mail/messages-from-eric at
sun@monday/explanations-of-zfs". I don''t see how this would be
unacceptable for POSIX.

I''m also suggesting the ability do "zfs snapshot mail at
wednesday", and then mail/messages-from-eric at
sun@tuesday/explanations-of-zfs would be the same as mail at
wednesday/messages-from-eric at sun@tuesday/explanations-of-zfs, and that
snapshots would include subfilesystems and their snapshots (and rollbacks would
roll back subfilesystems and their snapshots), but this is dependent on features
which you''re saying are not practical to implement.
> > Filenames and directory names are names of
> variables; the data which
> > they identify changes over time. Snapshot names
> are names of
> > constants; they identify particular versions of
> files and directories.
> > "variablename at timestamp" is an appropriate way to
> identify the value
> > which a variable had at some particular time;
> > "variablename/.zfs/snapshot/timestamp" is not. Not
> only is the latter
> > more complicated, but it would be a very strange
> expression if the
> > variable happened to be a file rather than a
> directory.
> 
> I''m not sure what this is supposed to meanRegarding snapshotting individual files, I was referring to e.g. the file
"code.c" and the strangeness of using
"code.c/.zfs/snapshot/monday" instead of "code.c at monday"
for a snapshot of the file.
> but I
> hope the above
> responses are some indication of what''s possible and
> what isn''t.They help. I''m still studying Matt''s blog entry and the source
code to try to figure out whether recursive atomic snapshots are possible. It
appears that the answer is yes.
Yes?
This message posted from opensolaris.org

Andrew

2005-Dec-14 16:00 UTC

head link

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

Minor clarifications: I wrote:> Now consider "cp -r foo bar;chmod -R -w bar;mv bar
foo/.zfs/snapshot/bar". Is that acceptable for POSIX? Yes. But
isn''t this the same as "zfs snapshot foo at bar" (assuming
that either foo has only subdirectories but no subfilesystems, or ZFS by default
does recursive snapshots as mentioned above), besides the fact that the latter
does it atomically?
Of course I understand that it''s very different from ZFS''s
perspective; my point is the equivalence from the user''s perspective.
> I''m also suggesting the ability do "zfs snapshot mail at
wednesday", and then mail/messages-from-eric at
sun@tuesday/explanations-of-zfs would be the same as mail at
wednesday/messages-from-eric at sun@tuesday/explanations-of-zfs, and that
snapshots would include subfilesystems and their snapshots (and rollbacks would
roll back subfilesystems and their snapshots)
Oops. I meant "and rollbacks would roll back subfilesystems and remove
those filesystems'' more recent snapshots from the filesystems''
namespaces, but not necessarily destroy those snapshots, since there might be
clones dependent on them. Alternatively, the clones could be fully copied and
then the snapshots destroyed." My point is that any particular filesystem
should be exactly the same immediately after a snapshot and immediately after
later rolling back to that snapshot, including the states of all subfilesystems
and the snapshots thereof.
This message posted from opensolaris.org

Eric Schrock

2005-Dec-14 16:18 UTC

head link

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

On Wed, Dec 14, 2005 at 07:15:19AM -0800, Andrew wrote:> 
> I''m still working on a full response to this, but in the meantime,
> what disadvantage currently would there be in aliasing mkdir to zfs
> create? (Except on non-ZFS filesystems, of course.) I.e. in what
> circumstances would it be necessary for a particular directory in a
> ZFS filesystem to be just a regular directory rather than a ZFS
> filesystem? Simply making all directories be filesystems would allow
> all directories to be individually snapshottable and cloneable. As for
> individual files, have ZFS simply create a filesystem foo whenever a
> user process requests creation of a file foo, and automatically create
> the file foo/thefile, hide foo/thefile from user processes, report foo
> as a file instead of as a filesystem, and automatically direct
> read/write requests on foo to foo/thefile. Then the user process can
> request a snapshot or clone of foo, thinking that foo is a file, and
> ZFS can actually create a snapshot of the filesystem foo, which would
> include foo/thefile.
You''re basically asking us to re-architect the way UNIX works.  Besides
the insane cost of doing so, our backwards compatibility guarantee
prevents us from even trying it.  There are many reasons why we have
filesystems that contain files.  The first is just that filesystems have
a great deal more overhead than directories.  Secondly, filesystems
provide a precise namespace.  In particular, the inode number space is
unique per filesystem.  If you had each directory as a filesystem, you
would no longer have unique inodes per ''filesystem''.  Hard
links would
be impossible, archivers and backup solutions would break all over the
place, etc.  There are tons more examples.

Please understand that this is just not possible in ZFS, period.
> But you can hijack .zfs from every directory (er, from every directory
> which happens to be a filesystem) in the filesystem?
Yes.  POSIX rules allow us to reserve a single name in the root of each
filesystem (such as ''.zfs'' or ''lost+found'').
> This use of "@" in names doesn''t conflict with other
arbitrary uses of
> the symbol. ZFS would know which names containing "@" are names
of
> snapshots, and which aren''t, and which parts of which snapshot
names
> are the dataset names and which parts are the version names, because
> the fact that a name contains the "@" symbol doesn''t
imply that the
> name identifies a snapshot, only vice versa. Thus snapshotting the
> filesystem mail/messages-from-eric at sun to produce the snapshot
> mail/messages-from-eric at sun@tuesday wouldn''t cause any problem,
and I
> could do "vi mail/messages-from-eric at sun/explanations-of-zfs"
and "cat
> mail/messages-from-eric at sun@monday/explanations-of-zfs". I
don''t see
> how this would be unacceptable for POSIX.
We could have this an property available for ZFS filesystems, but it
would definitely be non-POSIX compliant.  You can''t just have random
"hidden" files that get triggered whenever you use an
''@'' symbol.

This is also _incredibly_ tricky under the UNIX filesystem model.  Take
a look at how .zfs is implemented today to see how difficult it is
already.  And personally, I would classify it as "impossible".

While we appreciate the suggestions, these have fallen into one of two
categories:

a. Impossible to implement, due to fundamental restrictions of ZFS,
   UNIX, or POSIX.

b. Difficult or impractical to the point of not being worth attempting.

While we could continue to debate this ad infinitum, our team will stand
firm on the above conclusions.  I''m sorry if this is not the answer you
want to hear, but ZFS can''t solve every possible problem in the world.
If you don''t believe me, please take the code and implement it
yourself.

- Eric

--
Eric Schrock, Solaris Kernel Development       http://blogs.sun.com/eschrock

andrewee2 andrewee2

2005-Dec-14 18:57 UTC

head link

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

> We could have this an property available for ZFS
> filesystems, but it
> would definitely be non-POSIX compliant.  You can''t
> just have random
> "hidden" files that get triggered whenever you use
> an ''@'' symbol.I don''t understand what you mean by this. In the same
way that I can use "zfs snapshot foo at bar" which will
automatically create foo/.zfs/snapshot/bar, or use "cp
-r foo bar;chmod -R -w bar;mv bar
foo/.zfs/snapshot/bar", and the result, though
different from ZFS''s perspective, would be the same
from the perspective of a POSIX-only program, I could
also use "zfs snapshot foo at bar" to automatically
create foo at bar or use "cp -r foo bar;chmod -R -w
bar;mv bar foo at bar", and the result would be the same
from the perspective of a POSIX-only program. I don''t
understand the sense in which you say it''s hidden, or
what would get triggered by using "@".
> While we appreciate the suggestions, these have
> fallen into one of two
> categories:
> 
> a. Impossible to implement, due to fundamental
> restrictions of ZFS,
>    UNIX, or POSIX.
> 
> b. Difficult or impractical to the point of not
> being worth attempting.
> 
> While we could continue to debate this ad infinitum,
> our team will stand
> firm on the above conclusions.  I''m sorry if this is
> not the answer you
> want to hear, but ZFS can''t solve every possible
> problem in the world.
> If you don''t believe me, please take the code and
> implement it yourself.Fair enough. But since it''s not (yet) clear to me
whether recursive snapshots can be done atomically, if
you happen to know the answer off the top of your
head, can you give the answer? If the answer is no,
then it looks like ZFS would be unable to do what I
want even if UNIX/POSIX compatibility weren''t a
requirement, but if the answer is yes, then it looks
like ZFS would be useful for accomplishing at least
some of what I''m suggesting in environments in which
compatibility isn''t a requirement.


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Matthew Ahrens

2005-Dec-14 23:44 UTC

head link

[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces

On Tue, Dec 13, 2005 at 05:21:38PM -0800, Andrew wrote:> I propose three changes to ZFS, all of which are individually
> beneficial, and which together move some features of ZFS from features
> which must be explicitly invoked by the user/administrator to features
> which are automatically and transparently invoked by the system. The
> changes are:
>
> 1. Make snapshots of nested filesystems do nested transactional
> snapshots (and rollbacks do nested rollbacks), the lack of which I
> objected to in my message "Counterintuitive snapshotting" in this
> forum (at
> http://www.opensolaris.org/jive/thread.jspa?threadID=4244&tstart=0).
As Eric mentioned, it would be straightforward to make
''zfs snapshot -r <fs>@<snap>'' (or some variation
on that syntax)
take a snapshot of all nested filesystems as well.  It would be possible
to create all the snapshots atomically (eg. by creating them all in the
same txg).

The idea of making recursive snapshots be the default is interesting,
but it brings up a lot more issues.  For example, would a snapshot of a
filesystem *imply* that its descendents are snapshotted as well?  Then
descendent filesystems wouldn''t have the snapshot explicitly listed
(eg.
by ''zfs list'').  In that case, would non-recursive snapshots
be allowed
at all?  If they are, how do you distinguish between recursive and
non-recursive snapshots?  I think it would be exceedingly difficult to
"hide" the recursive snapshots in the face of ''zfs
rename''.  For
example, what happens if I have:

tank/home
tank/home at yesterday
tank/home/ahrens (which is implicitly snapshotted by tank/home at yesterday)
tank/mail

And I run ''zfs rename tank/mail tank/home/mail''?
> 2. Do not require that rolling back a filesystem to a snapshot destroy
> all intermediate snapshots. Dropping the requirement allows
> filesystems to be arbitrarily rolled back without requiring
> destruction of clones which are dependent on intermediate snapshots.
That might be possible, by creating a clone of the snapshot you want to
rollback to, and using that as the filesystem.  However, the snapshot
couldn''t be deleted until the subsequent snapshots (from the abandoned
branch) are deleted.  Rather than hide this ''magic'' and invent
a new
rule to explain why the snapshot can''t be deleted, I think we should
just expose what''s actually going on -- you''ve cloned the
snapshot.  For
example, you would have:

tank/foo at a --- tank/foo at b --- tank/foo at c ------------------- tank/foo
		    \
		     \
		      ---------------------- tank/foo-new at d --- tank/foo-new

Since this is kind of clumsy, you really want to ''clone swap''
these two
(see bug 6276916), so you can have:

tank/foo at a --- tank/foo at b -------------------- tank/foo at d --- tank/foo
		    \
		     \
		      --- tank/foo-abandoned at c ------------- tank/foo-abandoned

So I think you can accomplish what you want today, with a few steps and
a little uglyness in the naming.  I''d consider adding a 
''zfs rollback -c <snap>'' to automate the procedure.
> 3. Change "cp" to use COW by taking a snapshot of the source file
and
> creating the target file as a clone (yes, this means snapshotting and
> cloning individual files; see below), and perform the copies-on-write
> at the block level in the same way that ZFS currently performs COW at
> the block level when a file is modified. The sharing of blocks among
> files would be entirely transparent at the user level in the same way
> that the sharing of files among cloned filesystems is entirely
> transparent.
This would be nice, but the implementation is non-obvious.  Did you have
any ideas in mind?  We certainly *could* make each file be its own
filesystem, as you suggested.  However, as Eric mentioned, that really
doesn''t scale very well, especially with respect to administration. 
Not
to mention, creating recursive snapshots in that world would be really
time-consuming since you could have millions of nested filesystems to
traverse through.

Keep in mind that ZFS snapshots are not implemented by simply
reference-counting every block (which would be slow).  See my blog entry
for more details:

http://blogs.sun.com/roller/page/ahrens?entry=is_it_magic

We''ve tossed around a couple ideas in the past few years for allowing
more flexible references to blocks.  It would be really cool to have
references to blocks from arbitrary places -- we could even coalesce
references to identical blocks that weren''t created with
''cp''.
Unfortunately, it''s hard to come up with something that performs well
and works in all cases.  But we''re certainly open to suggestions (or
code!).

It might be possible to come up with a method that implemented snapshots
of files in a similar way to snapshots of filesystems, but with a
separate mechanism (and different administration).  But I don''t see
that
being worth the effort just to make ''cp'' go a bit faster and
use less
space.

Thanks for thinking about how to make snapshots better.  Bug 6343653,
"want to quickly ''copy'' a file from a snapshot", may
provide some more
food for thought.

--matt

Andrew

2005-Dec-15 16:38 UTC

head link

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

Matthew Ahrens wrote:> It would be possible
> to create all the snapshots atomically (eg. by
> creating them all in the
> same txg).Excellent. I was hoping this was the case.
> The idea of making recursive snapshots be the
> default is interesting,
> but it brings up a lot more issues.  For example,
> would a snapshot of a
> filesystem *imply* that its descendents are
> snapshotted as well?Yes.
> Then
> descendent filesystems wouldn''t have the snapshot
> explicitly listed (eg.
> by ''zfs list'').True.
> In that case, would non-recursive
> snapshots be allowed
> at all?  If they are, how do you distinguish between
> recursive and
> non-recursive snapshots?Well, if you tar a directory, it operates recursively. If you try to cp a
directory, Solaris makes you use cp -r, and cp -r operates recursively.
Similarly for rm. And mv operates recursively.
> I think it would be
> exceedingly difficult to
> "hide" the recursive snapshots in the face of ''zfs
> rename''.  For
> example, what happens if I have:
> 
> tank/home
> tank/home at yesterday
> tank/home/ahrens (which is implicitly snapshotted by
> tank/home at yesterday)
> tank/mail
> 
> And I run ''zfs rename tank/mail tank/home/mail''?
It should work as follows:

First, .zfs/snapshot directories shouldn''t be used; instead,
"@" should be used directly in file and directory names, as
I''ve already proposed. For example, as Eric
wrote:> Do you mean you
> would like to do:
> 
> $ cd /home/eschrock
> $ mv .vimrc at yesterday .vimrc
> 
> Instead of:
> 
> $ cd /home/eschrock
> $ cp .zfs/snapshot/yesterday .vimrc(My answer is "yes".) That happened to be in the context of the issue
of snapshotting individual files, but that issue is irrelevant here; regardless
of whether the individual file was snapshotted using "zfs snapshot
home/eschrock/.vimrc at yesterday" or (if individual file snapshots
aren''t supported) an entire filesystem was snapshotted using "zfs
snapshot home/eschrock at yesterday" or "zfs snapshot home at
yesterday", the .vimrc file within the snapshot should be accessible as
/home/eschrock/.vimrc at yesterday rather than as
/home/eschrock/.zfs/snapshot/yesterday/.vimrc or
/home/.zfs/snapshot/yesterday/eschrock/.vimrc.

Now, to address the issue of renaming:
Standard path names (.e.g /foo/bar) should work as usual, with a path name p
identifying some directory or some file, the contents of which can change over
time, and p can exist or not exist at various points in time and might even be a
directory at one point in time and a file sometime later. But p at versionname
always identifies the same data which p identified at the time at which the
snapshot named "versionname" was taken. So perhaps /foo/bar is
currently a file containing some data, /foo/bar at third is a file containing
some different data, /foo/bar at second doesn''t exist (because at the
time that the snapshot foo at second was taken, /foo/bar didn''t exist),
and /foo/bar at first is a directory, and /foo/bar/baz at first is a file but
/foo/bar/baz at third doesn''t exist (obviously, because /foo/bar at
third is a file, not a directory).

Now (starting a new example) suppose I have filesystems foo, foo/bar, and
foo/biz, and files /foo/bar/baz and /foo/biz/baz exist at the time that I take
the snapshot foo/bar at first. Then /foo/bar/baz at first will exist, but
/foo/biz/baz at first will not (because foo/biz is not a subfilesystem of
foo/bar, and I only took a snapshot of foo/bar). But if I instead take the
snapshot foo at first, then both /foo/bar/baz at first and /foo/biz/baz at first
will exist (because both foo/bar and foo/biz are subfilesystems of foo, and I
took a snapshot of foo). If I then move (i.e. use zfs rename) foo/bar to
foo/buz, then now /foo/bar/baz will no longer exist but /foo/buz/baz will exist,
yet /foo/bar/baz at first will continue to exist, but /foo/buz/baz at first
still won''t exist (because /foo/buz/baz didn''t exist when the
snapshot foo at first was taken).
After taking foo at first and then moving foo/bar to foo/buz, if I then create a
new filesystem foo/bar, and make a new file /foo/bar/baz in it, and take the
snapshot foo at second, then both /foo/bar/baz at first and /foo/bar/baz at
second will exist, yet they will be from separate subfilesystems. This
entanglement isn''t a problem, because if I do a rollback to foo at
first, everything under foo will get blown away, including all of the
subfilesystems and their snapshots, and foo will be exactly as it was when foo
at first was taken (including the subfilesystems which existed under it at that
time, and their snapshots). But if, before doing the rollback to foo at first, I
move (i.e. zfs rename) foo/biz to biz, then biz will survive when I rollback foo
to foo at first, yet even though /biz/baz will now exist, /biz/baz at first
won''t exist, because /biz/baz didn''t exist when foo at first
was taken.

One minor attribute of this system is that snapshot version names (the part of
the snapshot name following the "@") must now be globally unique (per
pool), rather than just unique per subfilesystem, but this isn''t a
problem. The version name is thus simply just an alias for the txg
counter''s value at the time at which the snapshot was taken (with all
the subfilesystems snapshotted in the same txg, as you mentioned). Note also
that if I have filesystems foo and foo/bar, and I take snapshot foo at first,
then I can rollback to foo at first, which will blow away both foo and foo/bar
and replace them with the versions which existed at the time that foo at first
was taken, but also I could instead rollback to just foo/bar at first, which
will blow away just foo/bar and replace it with what existed when foo at first
was taken but will leave foo itself unaffected. And if instead of foo at first,
I''d taken just the snapshot foo/bar at first, then I couldn''t
rollback to foo at first, because foo at first wouldn''t exist.
Note that in order to avoid namespace conflicts, the system must refuse to take
the snapshot foo at first if there''s any directory or file under foo
(or under any subfilesystem of foo) which happens to have a name ending with
"@first"; in this case, the user must choose some other snapshot
version name besides "first". Similarly, if there was anything under
foo named "x" in existence at the time that foo at first was taken,
then so long as foo at first exists, there also exists x at first; therefore,
regardless of whether x continues to exist, the system must refuse to create any
new thing named "x at first", because x at first already exists.

So finally, in your example, you would now have tank/home/mail, but no
tank/home/mail at yesterday, and you''d no longer have tank/mail, and of
course you''d have no tank/mail at yesterday.

However, if, prior to running  "zfs rename tank/mail tank/home/mail",
you''d run "zfs snapshot tank at yesterday" instead of
"zfs snapshot tank/home at yesterday", you would now have
tank/home/mail, but no tank/home/mail at yesterday, and you''d no longer
have tank/mail, but you''d still have tank/mail at yesterday.
This also provides the most intuitive solution if you, as an administrator,
moved tank/mail to tank/home/mail without telling your users, and they come to
work the next morning and say "Dude! Where''s the mail? It was in
tank/mail yesterday." And so they can access yesterday''s mail as
tank/mail at yesterday. (The question of where you put today''s mail is
another matter; you have to bother to tell them to look in tank/home/mail.)

Eric has objected that what I''m proposing violates POSIX, but I still
fail to understand why; I''m unable to think of any example of a
POSIX-compliant script which would operate correctly on ZFS as ZFS currently
exists but would operate incorrectly on ZFS as modified by my proposed changes.

Please note that my suggestions for individual-file snapshottability and for
making directories be equivalent to filesystems are entirely independent from my
suggestions for recursive atomic snapshots and for changing the snapshot naming
convention, and the latter would still be useful even if the former are not
practical to implement.

> > 2. Do not require that rolling back a filesystem
> to a snapshot destroy
> > all intermediate snapshots. Dropping the
> requirement allows
> > filesystems to be arbitrarily rolled back without
> requiring
> > destruction of clones which are dependent on
> intermediate snapshots.
> 
> That might be possible, by creating a clone of the
> snapshot you want to
> rollback to, and using that as the filesystem. 
> However, the snapshot
> couldn''t be deleted until the subsequent snapshots
> (from the abandoned
> branch) are deleted.  Rather than hide this ''magic''
> and invent a new
> rule to explain why the snapshot can''t be deleted, I
> think we should
> just expose what''s actually going on -- you''ve
> cloned the snapshot.  For
> example, you would have:
[diagrams snipped] (incidentally, your diagrams as shown in the discussion forum
on opensolaris.org are mangled to the point if incomprehensibility due to the
annoying fact that web browsers don''t render the whitespace
that''s present in html code)
> So I think you can accomplish what you want today,
> with a few steps and
> a little uglyness in the naming.  I''d consider
> adding a 
> ''zfs rollback -c <snap>'' to automate the procedure.I''d suggest hiding the magic, and just display the dependency tree upon
request; after all, if the user wants to roll back a filesystem, and there are
clones dependent on intermediate snapshots, in the ordinary case he''s
not going to change his mind and say "well, in that case I suppose I
don''t really want to roll back after all"; ordinarily
he''s rolling back because he wants to restore the filesystem to a
previous state, and the fact that there are intermediate states stored on the
disk (and consuming disk space) is irrelevant. So requiring a "-c"
switch would be pointless. Though ZFS, upon doing the requested rollback, could
output "oh by the way, there are intermediate snapshots, and if you want to
get rid of them then you''ll have to do it manually".
In fact, I''d suggest that the intermediate snapshots be kept even if
there are no dependencies, and I''d even suggest that prior to doing the
rollback, ZFS should automatically take a snapshot (and name it something like
"thefilesystem at point_at_which_rollback_was_done-autogenerated"),
thus maintaining a redo log to help the user avoid shooting himself in the foot.
> > 3. Change "cp" to use COW by taking a snapshot of
> the source file and
> > creating the target file as a clone
[snip]> This would be nice, but the implementation is
> non-obvious.  Did you have
> any ideas in mind?I''m still thinking about this and about the stuff from the remainder of
your message.
This message posted from opensolaris.org

zfs discuss - Dec 2005 - Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces

[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces