Andrew
2005-Dec-14 01:21 UTC
[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces
I propose three changes to ZFS, all of which are individually beneficial, and which together move some features of ZFS from features which must be explicitly invoked by the user/administrator to features which are automatically and transparently invoked by the system. The changes are: 1. Make snapshots of nested filesystems do nested transactional snapshots (and rollbacks do nested rollbacks), the lack of which I objected to in my message "Counterintuitive snapshotting" in this forum (at http://www.opensolaris.org/jive/thread.jspa?threadID=4244&tstart=0). 2. Do not require that rolling back a filesystem to a snapshot destroy all intermediate snapshots. Dropping the requirement allows filesystems to be arbitrarily rolled back without requiring destruction of clones which are dependent on intermediate snapshots. 3. Change "cp" to use COW by taking a snapshot of the source file and creating the target file as a clone (yes, this means snapshotting and cloning individual files; see below), and perform the copies-on-write at the block level in the same way that ZFS currently performs COW at the block level when a file is modified. The sharing of blocks among files would be entirely transparent at the user level in the same way that the sharing of files among cloned filesystems is entirely transparent. With these changes, "zfs create" and "zfs clone" can be eliminated, and replaced with "mkdir" and "cp -r", respectively; "mkdir /foo/bar", if foo is a ZFS filesystem, will create a ZFS filesystem named "bar", and "cp -r foo bar", if foo is a ZFS filesystem, will create the snapshot foo at time_when_bar_was_made-autogenerated and make a clone named "bar" (snapshots ending with "-autogenerated" could be excluded by default from snapshot listings, with an option to show them). Both of these are entirely transparent if the changes mentioned above are made. Thus, filesystems and directories are the same thing, and also there''s no difference between making a copy of a directory and making a clone of it. Next, suppose you have filesystems /foo and /foo/bar, and the file /foo/bar/baz. If you do "zfs snapshot foo at first", you can then do "zfs destroy foo at first", but you can''t do "ls foo at first"; you have to do "ls /foo/.zfs/snapshot/first". But you can''t do "rm -r /foo/.zfs/snapshot/first"; you have to do "zfs destroy foo at first". Similarly for "mv" vs. "zfs rename". Why the inconsistency? Why not drop the "/.zfs/snapshot/" nonsense as the separator between the filesystem name and version name halves of the snapshot name, and simply always use the "@" separator and allow the simple snapshot names to be passed directly to conventional tools like rm and mv? Also, if I can do "zfs snapshot foo at first" and "zfs snapshot foo/bar at first", why can''t I do "zfs snapshot foo/bar/baz at first"? It makes no sense to allow me to take snapshots only of individual directories (see above for equivalence of directories and filesystems) but not of individual files. If I can do "zfs snapshot project/code.c at tuesday-added_some_more_code" and "zfs snapshot project/code.c at wednesday-fixed_some_bugs" and "cat project/code.c at monday-known_to_compile", and "ls -l --show_snapshots project/code.c" to see the creation times of all the snapshots along with the modification time of the file itself, then I can get rid of CVS and just use ZFS as my version control system. Filenames and directory names are names of variables; the data which they identify changes over time. Snapshot names are names of constants; they identify particular versions of files and directories. "variablename at timestamp" is an appropriate way to identify the value which a variable had at some particular time; "variablename/.zfs/snapshot/timestamp" is not. Not only is the latter more complicated, but it would be a very strange expression if the variable happened to be a file rather than a directory. This message posted from opensolaris.org
Eric Schrock
2005-Dec-14 01:59 UTC
[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces
On Tue, Dec 13, 2005 at 05:21:38PM -0800, Andrew wrote:> > 1. Make snapshots of nested filesystems do nested transactional > snapshots (and rollbacks do nested rollbacks), the lack of which I > objected to in my message "Counterintuitive snapshotting" in this > forum (at > http://www.opensolaris.org/jive/thread.jspa?threadID=4244&tstart=0).I think a reasonable RFE would be to have something like: # zfs snapshot -r today tank/home Which would take a snapshot with the name ''today'' for all filesystems under tank/home. This is slightly different from what you''re probably asking for, because you''ll end up with N snapshots, where N is the number of filesystems. There is really no way around this, because you cannot have a snapshot span multiple datasets, it''s just fundamentally not possible. Feel free to file an RFE through OpenSolaris for the above behavior; it seems entirely reasonable to me.> 2. Do not require that rolling back a filesystem to a snapshot destroy > all intermediate snapshots. Dropping the requirement allows > filesystems to be arbitrarily rolled back without requiring > destruction of clones which are dependent on intermediate snapshots. > > 3. Change "cp" to use COW by taking a snapshot of the source file and > creating the target file as a clone (yes, this means snapshotting and > cloning individual files; see below), and perform the copies-on-write > at the block level in the same way that ZFS currently performs COW at > the block level when a file is modified. The sharing of blocks among > files would be entirely transparent at the user level in the same way > that the sharing of files among cloned filesystems is entirely > transparent.The above aren''t actually possible given the way ZFS implements snapshots. ZFS does not implement snapshot/clones via a simple reference counting mechanism. The reasons not to do this are many, not the least of which is that taking a snapshot is no longer constant time, and now overwrites live data. See Matt''s blog for more information on how it''s accomplished: http://blogs.sun.com/roller/page/ahrens?entry=is_it_magic The end result is that you can''t simply move and COW blocks around as you see fit. In particular, you cannot COW blocks within a certain file without going the full mile and creating a clone of the whole filesystem. Therefore snapshots (or clones) of individual files and directories are not possible.> With these changes, "zfs create" and "zfs clone" can be eliminated, > and replaced with "mkdir" and "cp -r", respectively; "mkdir /foo/bar", > if foo is a ZFS filesystem, will create a ZFS filesystem named "bar", > and "cp -r foo bar", if foo is a ZFS filesystem, will create the > snapshot foo at time_when_bar_was_made-autogenerated and make a clone > named "bar" (snapshots ending with "-autogenerated" could be excluded > by default from snapshot listings, with an option to show them). Both > of these are entirely transparent if the changes mentioned above are > made. Thus, filesystems and directories are the same thing, and also > there''s no difference between making a copy of a directory and making > a clone of it.Besides the inherent fact that you cannot clone an individual file or directory, if we imagine this scheme, a simple home directory server would end up with hundreds of millions of snapshots and clones.> Next, suppose you have filesystems /foo and /foo/bar, and the file > /foo/bar/baz. If you do "zfs snapshot foo at first", you can then do "zfs > destroy foo at first", but you can''t do "ls foo at first"; you have to do > "ls /foo/.zfs/snapshot/first". But you can''t do "rm -r > /foo/.zfs/snapshot/first"; you have to do "zfs destroy foo at first". > Similarly for "mv" vs. "zfs rename". Why the inconsistency? Why not > drop the "/.zfs/snapshot/" nonsense as the separator between the > filesystem name and version name halves of the snapshot name, and > simply always use the "@" separator and allow the simple snapshot > names to be passed directly to conventional tools like rm and mv?We have talked about exposing some level of adminstrative control of this sort through .zfs, and have even prototyped it. I''m still not sure what you''re suggesting by "dropping the nonsense". Do you mean you would like to do: $ cd /home/eschrock $ mv .vimrc at yesterday .vimrc Instead of: $ cd /home/eschrock $ cp .zfs/snapshot/yesterday .vimrc We can''t really hijack "@*" from every file in the filesystem. That''s simply not acceptable for POSIX or any reasonable expectation of a filesystem.> Also, if I can do "zfs snapshot foo at first" and "zfs snapshot > foo/bar at first", why can''t I do "zfs snapshot foo/bar/baz at first"? It > makes no sense to allow me to take snapshots only of individual > directories (see above for equivalence of directories and filesystems) > but not of individual files. If I can do "zfs snapshot > project/code.c at tuesday-added_some_more_code" and "zfs snapshot > project/code.c at wednesday-fixed_some_bugs" and "cat > project/code.c at monday-known_to_compile", and "ls -l --show_snapshots > project/code.c" to see the creation times of all the snapshots along > with the modification time of the file itself, then I can get rid of > CVS and just use ZFS as my version control system.See the discussion of how snapshots work, above.> Filenames and directory names are names of variables; the data which > they identify changes over time. Snapshot names are names of > constants; they identify particular versions of files and directories. > "variablename at timestamp" is an appropriate way to identify the value > which a variable had at some particular time; > "variablename/.zfs/snapshot/timestamp" is not. Not only is the latter > more complicated, but it would be a very strange expression if the > variable happened to be a file rather than a directory.I''m not sure what this is supposed to mean, but I hope the above responses are some indication of what''s possible and what isn''t. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Richard Elling
2005-Dec-14 05:37 UTC
[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces
> 3. Change "cp" to use COW by taking a snapshot of the > source file and creating the target file as a clone > (yes, this means snapshotting and cloning individual > files; see below), and perform the copies-on-write at > the block level in the same way that ZFS currently > performs COW at the block level when a file is > modified. The sharing of blocks among files would be > entirely transparent at the user level in the same > way that the sharing of files among cloned > filesystems is entirely transparent.<geezer_mode> Way back when disks were small, and files were smaller, VMS did this sort of versioning. Not "cp" per se, since cp isn''t normally destructive. But if you modified a file, a new version would be created. Very cool, at first glance. IIRC, it had the same sort of semantics as you describe. Modifications created a version, and versions had to be explicitly removed. Raises hell with quotas (I''ve got the scar). But when you look at the actual implications of this, it quickly becomes a costly feature and, sometimes, a system management nightmare. For example, I get about 200 emails each day. Suppose the system created 200 versions of my inbox each day... very uncool. OK, someone will probably point out that VMS still exists today, but I stopped using it at version 2.21, ha! </geezer_mode> There are a large number of applications which modify or append files. IMHO it is not in our best interest if the file system implements versioning policies for each and every file. It is much more manageable for a system administrator to say "friday at happy hour + 20 minutes we snapshot for the week." And what about databases which use files for backing store? COW run amuck causes some concern for such environments. [ok, since I do work for Sun, we''d *love* to sell you a few petabytes of storage each month... :-) but that is really a non-starter] -- richard This message posted from opensolaris.org
Tao Chen
2005-Dec-14 06:12 UTC
[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces
On 12/13/05, Richard Elling <Richard.Elling at sun.com> wrote:> > > 3. Change "cp" to use COW by taking a snapshot of the > > source file and creating the target file as a clone[...] There are a large number of applications which modify> or append files. IMHO it is not in our best interest if the > file system implements versioning policies for each and > every file. It is much more manageable for a system > administrator to say "friday at happy hour + 20 minutes > we snapshot for the week." > > And what about databases which use files for backing > store? COW run amuck causes some concern for such > environments. > > [ok, since I do work for Sun, we''d *love* to sell you a > few petabytes of storage each month... :-) but that is > really a non-starter] >Item 3., by itself, is not about file versioning, if I understand correctly. It sounds similar to memory COW after fork: ''cp'' the parent process to a child process, memory pages are not copied until modified. In fact when I heard ZFS is a "COW filesystem", I thought it does exactly that, or does it? It could save users storage space if files are duplicated then slightly modified. Or are you commenting on the rest of the proposal? Yes, I can see quotas will become very interesting and potential I/Os become somewhat unpredicable - a write to the ''parent'' file can generate I/Os to all the ''child'' copies and their children. Most definitely you guys have thought about that in the early days :-) Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20051214/908d76a2/attachment.html>
Andrew
2005-Dec-14 15:15 UTC
[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces
Eric Schrock wrote:> I think a reasonable RFE would be to have something > like: > > # zfs snapshot -r today tank/home > > Which would take a snapshot with the name ''today'' > for all filesystems > under tank/home.Could that be done atomically? BTW, the recursiveness should be enabled by default. Considering the schizophrenic insanity of standard cp defaulting to non-recursive copy, mv defaulting to recursive move, tar defaulting to recursive copy, and chmod being non-recursive by default (not to mention the switch insanity with "cp -r" meaning "enable recursiveness and make a copy" but "chmod -r" meaning "disable readability, and do it non-recursively", and that snapshots already default to recursiveness through subdirectories of a filesystem, it would make more sense for snapshots to default to recursiveness through subfilesystems too. Otherwise you''ll have administrators accidentally excluding some subparts of their filesystems while taking snapshots because they forgot that they happened to create those subparts as subfilesystems rather than as subdirectories; not only that, but the mistake would be harder to catch, because unlike with cp, where a cursory glance at the result will reveal if the copy was accidentally made non-recursively, a cursory glance at a snapshot will show recursive directories and thus provide a false sense of confidence that recursiveness was enabled when actually the subfilesystems are missing from the snapshot but not noticed because they''re deeply buried in the name hierarchy.> This is slightly different from > what you''re probably > asking for, because you''ll end up with N snapshots, > where N is the > number of filesystems. There is really no way > around this, because you > cannot have a snapshot span multiple datasets, it''s > just fundamentally > not possible.If multiple datasets can be snapshotted atomically, then what I''m describing can be implemented; the fact that there would be N underlying snapshots would be irrelevant to the user, since those snapshots could be integrated into one namespace. E.g. with filesystems foo and foo/bar, and file foo/bar/baz, the snapshot foo at first could be taken atomically recursively, and then using your naming convention baz could be accessed as foo/.zfs/snapshot/first/bar/baz instead of (or as well as) foo/bar/.zfs/snapshot/first/baz. (Using the naming convention I''m recommending, it would be foo at first/bar/baz, but that''s irrelevant to the important issue here of whether the recursive snapshots can be done atomically.)> Feel free to file an RFE through OpenSolaris for the > above behavior; it > seems entirely reasonable to me.If recursive snapshots can''t be done atomically, then the RFE would have to be "please fundamentally rearchitect ZFS"... :)> The above aren''t actually possible given the way ZFS > implements > snapshots.[snip]> Therefore snapshots (or clones) of > individual files and > directories are not possible.I''m still working on a full response to this, but in the meantime, what disadvantage currently would there be in aliasing mkdir to zfs create? (Except on non-ZFS filesystems, of course.) I.e. in what circumstances would it be necessary for a particular directory in a ZFS filesystem to be just a regular directory rather than a ZFS filesystem? Simply making all directories be filesystems would allow all directories to be individually snapshottable and cloneable. As for individual files, have ZFS simply create a filesystem foo whenever a user process requests creation of a file foo, and automatically create the file foo/thefile, hide foo/thefile from user processes, report foo as a file instead of as a filesystem, and automatically direct read/write requests on foo to foo/thefile. Then the user process can request a snapshot or clone of foo, thinking that foo is a file, and ZFS can actually create a snapshot of the filesystem foo, which would include foo/thefile.> Besides the inherent fact that you cannot clone an > individual file or > directory,Hopefully addressed by my comments above.> if we imagine this scheme, a simple home > directory server > would end up with hundreds of millions of snapshots > and clones.Why would this be a problem? (Of course the administrative tools could by default exclude the auto-generated snapshots and clones from listings, and include them only upon explicit request.)> I''m still not > sure what you''re suggesting by "dropping the > nonsense". Do you mean you > would like to do: > > $ cd /home/eschrock > $ mv .vimrc at yesterday .vimrc > > Instead of: > > $ cd /home/eschrock > $ cp .zfs/snapshot/yesterday .vimrcYes.> We can''t really hijack "@*" from every file in the > filesystem.But you can hijack .zfs from every directory (er, from every directory which happens to be a filesystem) in the filesystem?> That''s > simply not acceptable for POSIXWhy not? Consider "cp -r foo bar;chmod -R -w bar". Is that acceptable for POSIX? Yes. Now consider "cp -r foo bar;chmod -R -w bar;mv bar foo/.zfs/snapshot/bar". Is that acceptable for POSIX? Yes. But isn''t this the same as "zfs snapshot foo at bar" (assuming that either foo has only subdirectories but no subfilesystems, or ZFS by default does recursive snapshots as mentioned above), besides the fact that the latter does it atomically? Thus what ZFS currently does is acceptable for POSIX. Now consider "cp -r foo bar;chmod -R -w bar;mv bar foo at bar". Is that acceptable for POSIX? Yes. So why not make "zfs snapshot foo at bar" do this? This use of "@" in names doesn''t conflict with other arbitrary uses of the symbol. ZFS would know which names containing "@" are names of snapshots, and which aren''t, and which parts of which snapshot names are the dataset names and which parts are the version names, because the fact that a name contains the "@" symbol doesn''t imply that the name identifies a snapshot, only vice versa. Thus snapshotting the filesystem mail/messages-from-eric at sun to produce the snapshot mail/messages-from-eric at sun@tuesday wouldn''t cause any problem, and I could do "vi mail/messages-from-eric at sun/explanations-of-zfs" and "cat mail/messages-from-eric at sun@monday/explanations-of-zfs". I don''t see how this would be unacceptable for POSIX. I''m also suggesting the ability do "zfs snapshot mail at wednesday", and then mail/messages-from-eric at sun@tuesday/explanations-of-zfs would be the same as mail at wednesday/messages-from-eric at sun@tuesday/explanations-of-zfs, and that snapshots would include subfilesystems and their snapshots (and rollbacks would roll back subfilesystems and their snapshots), but this is dependent on features which you''re saying are not practical to implement.> > Filenames and directory names are names of > variables; the data which > > they identify changes over time. Snapshot names > are names of > > constants; they identify particular versions of > files and directories. > > "variablename at timestamp" is an appropriate way to > identify the value > > which a variable had at some particular time; > > "variablename/.zfs/snapshot/timestamp" is not. Not > only is the latter > > more complicated, but it would be a very strange > expression if the > > variable happened to be a file rather than a > directory. > > I''m not sure what this is supposed to meanRegarding snapshotting individual files, I was referring to e.g. the file "code.c" and the strangeness of using "code.c/.zfs/snapshot/monday" instead of "code.c at monday" for a snapshot of the file.> but I > hope the above > responses are some indication of what''s possible and > what isn''t.They help. I''m still studying Matt''s blog entry and the source code to try to figure out whether recursive atomic snapshots are possible. It appears that the answer is yes. Yes? This message posted from opensolaris.org
Andrew
2005-Dec-14 16:00 UTC
[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces
Minor clarifications: I wrote:> Now consider "cp -r foo bar;chmod -R -w bar;mv bar foo/.zfs/snapshot/bar". Is that acceptable for POSIX? Yes. But isn''t this the same as "zfs snapshot foo at bar" (assuming that either foo has only subdirectories but no subfilesystems, or ZFS by default does recursive snapshots as mentioned above), besides the fact that the latter does it atomically?Of course I understand that it''s very different from ZFS''s perspective; my point is the equivalence from the user''s perspective.> I''m also suggesting the ability do "zfs snapshot mail at wednesday", and then mail/messages-from-eric at sun@tuesday/explanations-of-zfs would be the same as mail at wednesday/messages-from-eric at sun@tuesday/explanations-of-zfs, and that snapshots would include subfilesystems and their snapshots (and rollbacks would roll back subfilesystems and their snapshots)Oops. I meant "and rollbacks would roll back subfilesystems and remove those filesystems'' more recent snapshots from the filesystems'' namespaces, but not necessarily destroy those snapshots, since there might be clones dependent on them. Alternatively, the clones could be fully copied and then the snapshots destroyed." My point is that any particular filesystem should be exactly the same immediately after a snapshot and immediately after later rolling back to that snapshot, including the states of all subfilesystems and the snapshots thereof. This message posted from opensolaris.org
Eric Schrock
2005-Dec-14 16:18 UTC
[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces
On Wed, Dec 14, 2005 at 07:15:19AM -0800, Andrew wrote:> > I''m still working on a full response to this, but in the meantime, > what disadvantage currently would there be in aliasing mkdir to zfs > create? (Except on non-ZFS filesystems, of course.) I.e. in what > circumstances would it be necessary for a particular directory in a > ZFS filesystem to be just a regular directory rather than a ZFS > filesystem? Simply making all directories be filesystems would allow > all directories to be individually snapshottable and cloneable. As for > individual files, have ZFS simply create a filesystem foo whenever a > user process requests creation of a file foo, and automatically create > the file foo/thefile, hide foo/thefile from user processes, report foo > as a file instead of as a filesystem, and automatically direct > read/write requests on foo to foo/thefile. Then the user process can > request a snapshot or clone of foo, thinking that foo is a file, and > ZFS can actually create a snapshot of the filesystem foo, which would > include foo/thefile.You''re basically asking us to re-architect the way UNIX works. Besides the insane cost of doing so, our backwards compatibility guarantee prevents us from even trying it. There are many reasons why we have filesystems that contain files. The first is just that filesystems have a great deal more overhead than directories. Secondly, filesystems provide a precise namespace. In particular, the inode number space is unique per filesystem. If you had each directory as a filesystem, you would no longer have unique inodes per ''filesystem''. Hard links would be impossible, archivers and backup solutions would break all over the place, etc. There are tons more examples. Please understand that this is just not possible in ZFS, period.> But you can hijack .zfs from every directory (er, from every directory > which happens to be a filesystem) in the filesystem?Yes. POSIX rules allow us to reserve a single name in the root of each filesystem (such as ''.zfs'' or ''lost+found'').> This use of "@" in names doesn''t conflict with other arbitrary uses of > the symbol. ZFS would know which names containing "@" are names of > snapshots, and which aren''t, and which parts of which snapshot names > are the dataset names and which parts are the version names, because > the fact that a name contains the "@" symbol doesn''t imply that the > name identifies a snapshot, only vice versa. Thus snapshotting the > filesystem mail/messages-from-eric at sun to produce the snapshot > mail/messages-from-eric at sun@tuesday wouldn''t cause any problem, and I > could do "vi mail/messages-from-eric at sun/explanations-of-zfs" and "cat > mail/messages-from-eric at sun@monday/explanations-of-zfs". I don''t see > how this would be unacceptable for POSIX.We could have this an property available for ZFS filesystems, but it would definitely be non-POSIX compliant. You can''t just have random "hidden" files that get triggered whenever you use an ''@'' symbol. This is also _incredibly_ tricky under the UNIX filesystem model. Take a look at how .zfs is implemented today to see how difficult it is already. And personally, I would classify it as "impossible". While we appreciate the suggestions, these have fallen into one of two categories: a. Impossible to implement, due to fundamental restrictions of ZFS, UNIX, or POSIX. b. Difficult or impractical to the point of not being worth attempting. While we could continue to debate this ad infinitum, our team will stand firm on the above conclusions. I''m sorry if this is not the answer you want to hear, but ZFS can''t solve every possible problem in the world. If you don''t believe me, please take the code and implement it yourself. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
andrewee2 andrewee2
2005-Dec-14 18:57 UTC
[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces
> We could have this an property available for ZFS > filesystems, but it > would definitely be non-POSIX compliant. You can''t > just have random > "hidden" files that get triggered whenever you use > an ''@'' symbol.I don''t understand what you mean by this. In the same way that I can use "zfs snapshot foo at bar" which will automatically create foo/.zfs/snapshot/bar, or use "cp -r foo bar;chmod -R -w bar;mv bar foo/.zfs/snapshot/bar", and the result, though different from ZFS''s perspective, would be the same from the perspective of a POSIX-only program, I could also use "zfs snapshot foo at bar" to automatically create foo at bar or use "cp -r foo bar;chmod -R -w bar;mv bar foo at bar", and the result would be the same from the perspective of a POSIX-only program. I don''t understand the sense in which you say it''s hidden, or what would get triggered by using "@".> While we appreciate the suggestions, these have > fallen into one of two > categories: > > a. Impossible to implement, due to fundamental > restrictions of ZFS, > UNIX, or POSIX. > > b. Difficult or impractical to the point of not > being worth attempting. > > While we could continue to debate this ad infinitum, > our team will stand > firm on the above conclusions. I''m sorry if this is > not the answer you > want to hear, but ZFS can''t solve every possible > problem in the world. > If you don''t believe me, please take the code and > implement it yourself.Fair enough. But since it''s not (yet) clear to me whether recursive snapshots can be done atomically, if you happen to know the answer off the top of your head, can you give the answer? If the answer is no, then it looks like ZFS would be unable to do what I want even if UNIX/POSIX compatibility weren''t a requirement, but if the answer is yes, then it looks like ZFS would be useful for accomplishing at least some of what I''m suggesting in environments in which compatibility isn''t a requirement. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Matthew Ahrens
2005-Dec-14 23:44 UTC
[zfs-discuss] Simplifying ZFS via consistent use of COW and snapshot namespaces
On Tue, Dec 13, 2005 at 05:21:38PM -0800, Andrew wrote:> I propose three changes to ZFS, all of which are individually > beneficial, and which together move some features of ZFS from features > which must be explicitly invoked by the user/administrator to features > which are automatically and transparently invoked by the system. The > changes are: > > 1. Make snapshots of nested filesystems do nested transactional > snapshots (and rollbacks do nested rollbacks), the lack of which I > objected to in my message "Counterintuitive snapshotting" in this > forum (at > http://www.opensolaris.org/jive/thread.jspa?threadID=4244&tstart=0).As Eric mentioned, it would be straightforward to make ''zfs snapshot -r <fs>@<snap>'' (or some variation on that syntax) take a snapshot of all nested filesystems as well. It would be possible to create all the snapshots atomically (eg. by creating them all in the same txg). The idea of making recursive snapshots be the default is interesting, but it brings up a lot more issues. For example, would a snapshot of a filesystem *imply* that its descendents are snapshotted as well? Then descendent filesystems wouldn''t have the snapshot explicitly listed (eg. by ''zfs list''). In that case, would non-recursive snapshots be allowed at all? If they are, how do you distinguish between recursive and non-recursive snapshots? I think it would be exceedingly difficult to "hide" the recursive snapshots in the face of ''zfs rename''. For example, what happens if I have: tank/home tank/home at yesterday tank/home/ahrens (which is implicitly snapshotted by tank/home at yesterday) tank/mail And I run ''zfs rename tank/mail tank/home/mail''?> 2. Do not require that rolling back a filesystem to a snapshot destroy > all intermediate snapshots. Dropping the requirement allows > filesystems to be arbitrarily rolled back without requiring > destruction of clones which are dependent on intermediate snapshots.That might be possible, by creating a clone of the snapshot you want to rollback to, and using that as the filesystem. However, the snapshot couldn''t be deleted until the subsequent snapshots (from the abandoned branch) are deleted. Rather than hide this ''magic'' and invent a new rule to explain why the snapshot can''t be deleted, I think we should just expose what''s actually going on -- you''ve cloned the snapshot. For example, you would have: tank/foo at a --- tank/foo at b --- tank/foo at c ------------------- tank/foo \ \ ---------------------- tank/foo-new at d --- tank/foo-new Since this is kind of clumsy, you really want to ''clone swap'' these two (see bug 6276916), so you can have: tank/foo at a --- tank/foo at b -------------------- tank/foo at d --- tank/foo \ \ --- tank/foo-abandoned at c ------------- tank/foo-abandoned So I think you can accomplish what you want today, with a few steps and a little uglyness in the naming. I''d consider adding a ''zfs rollback -c <snap>'' to automate the procedure.> 3. Change "cp" to use COW by taking a snapshot of the source file and > creating the target file as a clone (yes, this means snapshotting and > cloning individual files; see below), and perform the copies-on-write > at the block level in the same way that ZFS currently performs COW at > the block level when a file is modified. The sharing of blocks among > files would be entirely transparent at the user level in the same way > that the sharing of files among cloned filesystems is entirely > transparent.This would be nice, but the implementation is non-obvious. Did you have any ideas in mind? We certainly *could* make each file be its own filesystem, as you suggested. However, as Eric mentioned, that really doesn''t scale very well, especially with respect to administration. Not to mention, creating recursive snapshots in that world would be really time-consuming since you could have millions of nested filesystems to traverse through. Keep in mind that ZFS snapshots are not implemented by simply reference-counting every block (which would be slow). See my blog entry for more details: http://blogs.sun.com/roller/page/ahrens?entry=is_it_magic We''ve tossed around a couple ideas in the past few years for allowing more flexible references to blocks. It would be really cool to have references to blocks from arbitrary places -- we could even coalesce references to identical blocks that weren''t created with ''cp''. Unfortunately, it''s hard to come up with something that performs well and works in all cases. But we''re certainly open to suggestions (or code!). It might be possible to come up with a method that implemented snapshots of files in a similar way to snapshots of filesystems, but with a separate mechanism (and different administration). But I don''t see that being worth the effort just to make ''cp'' go a bit faster and use less space. Thanks for thinking about how to make snapshots better. Bug 6343653, "want to quickly ''copy'' a file from a snapshot", may provide some more food for thought. --matt
Andrew
2005-Dec-15 16:38 UTC
[zfs-discuss] Re: Simplifying ZFS via consistent use of COW and snapshot namespaces
Matthew Ahrens wrote:> It would be possible > to create all the snapshots atomically (eg. by > creating them all in the > same txg).Excellent. I was hoping this was the case.> The idea of making recursive snapshots be the > default is interesting, > but it brings up a lot more issues. For example, > would a snapshot of a > filesystem *imply* that its descendents are > snapshotted as well?Yes.> Then > descendent filesystems wouldn''t have the snapshot > explicitly listed (eg. > by ''zfs list'').True.> In that case, would non-recursive > snapshots be allowed > at all? If they are, how do you distinguish between > recursive and > non-recursive snapshots?Well, if you tar a directory, it operates recursively. If you try to cp a directory, Solaris makes you use cp -r, and cp -r operates recursively. Similarly for rm. And mv operates recursively.> I think it would be > exceedingly difficult to > "hide" the recursive snapshots in the face of ''zfs > rename''. For > example, what happens if I have: > > tank/home > tank/home at yesterday > tank/home/ahrens (which is implicitly snapshotted by > tank/home at yesterday) > tank/mail > > And I run ''zfs rename tank/mail tank/home/mail''?It should work as follows: First, .zfs/snapshot directories shouldn''t be used; instead, "@" should be used directly in file and directory names, as I''ve already proposed. For example, as Eric wrote:> Do you mean you > would like to do: > > $ cd /home/eschrock > $ mv .vimrc at yesterday .vimrc > > Instead of: > > $ cd /home/eschrock > $ cp .zfs/snapshot/yesterday .vimrc(My answer is "yes".) That happened to be in the context of the issue of snapshotting individual files, but that issue is irrelevant here; regardless of whether the individual file was snapshotted using "zfs snapshot home/eschrock/.vimrc at yesterday" or (if individual file snapshots aren''t supported) an entire filesystem was snapshotted using "zfs snapshot home/eschrock at yesterday" or "zfs snapshot home at yesterday", the .vimrc file within the snapshot should be accessible as /home/eschrock/.vimrc at yesterday rather than as /home/eschrock/.zfs/snapshot/yesterday/.vimrc or /home/.zfs/snapshot/yesterday/eschrock/.vimrc. Now, to address the issue of renaming: Standard path names (.e.g /foo/bar) should work as usual, with a path name p identifying some directory or some file, the contents of which can change over time, and p can exist or not exist at various points in time and might even be a directory at one point in time and a file sometime later. But p at versionname always identifies the same data which p identified at the time at which the snapshot named "versionname" was taken. So perhaps /foo/bar is currently a file containing some data, /foo/bar at third is a file containing some different data, /foo/bar at second doesn''t exist (because at the time that the snapshot foo at second was taken, /foo/bar didn''t exist), and /foo/bar at first is a directory, and /foo/bar/baz at first is a file but /foo/bar/baz at third doesn''t exist (obviously, because /foo/bar at third is a file, not a directory). Now (starting a new example) suppose I have filesystems foo, foo/bar, and foo/biz, and files /foo/bar/baz and /foo/biz/baz exist at the time that I take the snapshot foo/bar at first. Then /foo/bar/baz at first will exist, but /foo/biz/baz at first will not (because foo/biz is not a subfilesystem of foo/bar, and I only took a snapshot of foo/bar). But if I instead take the snapshot foo at first, then both /foo/bar/baz at first and /foo/biz/baz at first will exist (because both foo/bar and foo/biz are subfilesystems of foo, and I took a snapshot of foo). If I then move (i.e. use zfs rename) foo/bar to foo/buz, then now /foo/bar/baz will no longer exist but /foo/buz/baz will exist, yet /foo/bar/baz at first will continue to exist, but /foo/buz/baz at first still won''t exist (because /foo/buz/baz didn''t exist when the snapshot foo at first was taken). After taking foo at first and then moving foo/bar to foo/buz, if I then create a new filesystem foo/bar, and make a new file /foo/bar/baz in it, and take the snapshot foo at second, then both /foo/bar/baz at first and /foo/bar/baz at second will exist, yet they will be from separate subfilesystems. This entanglement isn''t a problem, because if I do a rollback to foo at first, everything under foo will get blown away, including all of the subfilesystems and their snapshots, and foo will be exactly as it was when foo at first was taken (including the subfilesystems which existed under it at that time, and their snapshots). But if, before doing the rollback to foo at first, I move (i.e. zfs rename) foo/biz to biz, then biz will survive when I rollback foo to foo at first, yet even though /biz/baz will now exist, /biz/baz at first won''t exist, because /biz/baz didn''t exist when foo at first was taken. One minor attribute of this system is that snapshot version names (the part of the snapshot name following the "@") must now be globally unique (per pool), rather than just unique per subfilesystem, but this isn''t a problem. The version name is thus simply just an alias for the txg counter''s value at the time at which the snapshot was taken (with all the subfilesystems snapshotted in the same txg, as you mentioned). Note also that if I have filesystems foo and foo/bar, and I take snapshot foo at first, then I can rollback to foo at first, which will blow away both foo and foo/bar and replace them with the versions which existed at the time that foo at first was taken, but also I could instead rollback to just foo/bar at first, which will blow away just foo/bar and replace it with what existed when foo at first was taken but will leave foo itself unaffected. And if instead of foo at first, I''d taken just the snapshot foo/bar at first, then I couldn''t rollback to foo at first, because foo at first wouldn''t exist. Note that in order to avoid namespace conflicts, the system must refuse to take the snapshot foo at first if there''s any directory or file under foo (or under any subfilesystem of foo) which happens to have a name ending with "@first"; in this case, the user must choose some other snapshot version name besides "first". Similarly, if there was anything under foo named "x" in existence at the time that foo at first was taken, then so long as foo at first exists, there also exists x at first; therefore, regardless of whether x continues to exist, the system must refuse to create any new thing named "x at first", because x at first already exists. So finally, in your example, you would now have tank/home/mail, but no tank/home/mail at yesterday, and you''d no longer have tank/mail, and of course you''d have no tank/mail at yesterday. However, if, prior to running "zfs rename tank/mail tank/home/mail", you''d run "zfs snapshot tank at yesterday" instead of "zfs snapshot tank/home at yesterday", you would now have tank/home/mail, but no tank/home/mail at yesterday, and you''d no longer have tank/mail, but you''d still have tank/mail at yesterday. This also provides the most intuitive solution if you, as an administrator, moved tank/mail to tank/home/mail without telling your users, and they come to work the next morning and say "Dude! Where''s the mail? It was in tank/mail yesterday." And so they can access yesterday''s mail as tank/mail at yesterday. (The question of where you put today''s mail is another matter; you have to bother to tell them to look in tank/home/mail.) Eric has objected that what I''m proposing violates POSIX, but I still fail to understand why; I''m unable to think of any example of a POSIX-compliant script which would operate correctly on ZFS as ZFS currently exists but would operate incorrectly on ZFS as modified by my proposed changes. Please note that my suggestions for individual-file snapshottability and for making directories be equivalent to filesystems are entirely independent from my suggestions for recursive atomic snapshots and for changing the snapshot naming convention, and the latter would still be useful even if the former are not practical to implement.> > 2. Do not require that rolling back a filesystem > to a snapshot destroy > > all intermediate snapshots. Dropping the > requirement allows > > filesystems to be arbitrarily rolled back without > requiring > > destruction of clones which are dependent on > intermediate snapshots. > > That might be possible, by creating a clone of the > snapshot you want to > rollback to, and using that as the filesystem. > However, the snapshot > couldn''t be deleted until the subsequent snapshots > (from the abandoned > branch) are deleted. Rather than hide this ''magic'' > and invent a new > rule to explain why the snapshot can''t be deleted, I > think we should > just expose what''s actually going on -- you''ve > cloned the snapshot. For > example, you would have:[diagrams snipped] (incidentally, your diagrams as shown in the discussion forum on opensolaris.org are mangled to the point if incomprehensibility due to the annoying fact that web browsers don''t render the whitespace that''s present in html code)> So I think you can accomplish what you want today, > with a few steps and > a little uglyness in the naming. I''d consider > adding a > ''zfs rollback -c <snap>'' to automate the procedure.I''d suggest hiding the magic, and just display the dependency tree upon request; after all, if the user wants to roll back a filesystem, and there are clones dependent on intermediate snapshots, in the ordinary case he''s not going to change his mind and say "well, in that case I suppose I don''t really want to roll back after all"; ordinarily he''s rolling back because he wants to restore the filesystem to a previous state, and the fact that there are intermediate states stored on the disk (and consuming disk space) is irrelevant. So requiring a "-c" switch would be pointless. Though ZFS, upon doing the requested rollback, could output "oh by the way, there are intermediate snapshots, and if you want to get rid of them then you''ll have to do it manually". In fact, I''d suggest that the intermediate snapshots be kept even if there are no dependencies, and I''d even suggest that prior to doing the rollback, ZFS should automatically take a snapshot (and name it something like "thefilesystem at point_at_which_rollback_was_done-autogenerated"), thus maintaining a redo log to help the user avoid shooting himself in the foot.> > 3. Change "cp" to use COW by taking a snapshot of > the source file and > > creating the target file as a clone[snip]> This would be nice, but the implementation is > non-obvious. Did you have > any ideas in mind?I''m still thinking about this and about the stuff from the remainder of your message. This message posted from opensolaris.org