Chen Zheng
2008-Nov-08 02:59 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
Hi matt, I have some problems about understanding zfs COW implemention. Suppose b and c are both children dir of a, if c changes, there will be new versions of both a and c, namely c'' and a''. a a'' b c c'' Because ''..'' in b points to a before this change, shall we modify b to let ''..'' point to a''? If yes, perhaps we need to modify all brothers dir of c and recursively. that''s impossible, what''s the little trick here? If no, does that means path a''/b/.. really points to a, not a''? Regards Chenz -- args are passed by ref, but bindings are local, variables are in fact just a symbol referencing an object
Eddie Edwards
2008-Nov-11 18:27 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
''..'' is a virtual directory node and doesn''t exist on disc, so there is no issue with COW here. See zfs_dirlook() for details of the treatment of ''..''; code starts with if (name[0] == ''.'' && name[1] == ''.'' && name[2] == 0). HTH -- This message posted from opensolaris.org
Eddie Edwards
2008-Nov-11 18:45 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
Actually, thinking about it, that may not answer your question. There''s still the issue of how zp_parent (the znode that represents ''..'') is updated in the znode ... -- This message posted from opensolaris.org
Eddie Edwards
2008-Nov-11 19:03 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
I think this is the correct answer: The parent directory is stored as a 64-bit object id in znode->zp_parent. This is just an index and is relative to the containing objset. The /index/ remains valid even after the /contents/ of directory ?a? have changed, so directory ?b? does not need updating. In other words, the objset provides the additional level of indirection required to avoid updating all the dependents such as ?b?. -- This message posted from opensolaris.org
Matthew Ahrens
2008-Nov-11 19:13 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
Chen Zheng wrote:> Hi matt, > > I have some problems about understanding zfs COW implemention. Suppose > b and c are both children dir of a, if c changes, there will be new > versions of both a and c, namely c'' and a''. > > a a'' > b c c'' >Actually, there will not be a new version of A unless the name of C changes (or it is removed from A). Adding or removing entries from C does not change A.> Because ''..'' in b points to a before this change, shall we modify b > to let ''..'' point to a''?If we did write out a new version of A (for example, when adding a new file "A/D"), then B''s "pointer" to A will still be valid, because it references A by object number, not by block pointer (in zp_parent, as Eddie mentioned). --matt
Chen Zheng
2008-Nov-12 13:33 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
2008/11/12 Matthew Ahrens <Matthew.Ahrens at sun.com>:> Chen Zheng wrote: >> >> Hi matt, >> >> I have some problems about understanding zfs COW implemention. Suppose >> b and c are both children dir of a, if c changes, there will be new >> versions of both a and c, namely c'' and a''. >> >> a a'' >> b c c'' >> > > Actually, there will not be a new version of A unless the name of C changes > (or it is removed from A). Adding or removing entries from C does not > change A. >> >> Because ''..'' in b points to a before this change, shall we modify b >> to let ''..'' point to a''? > > If we did write out a new version of A (for example, when adding a new file > "A/D"), then B''s "pointer" to A will still be valid, because it references A > by object number, not by block pointer (in zp_parent, as Eddie mentioned). > > --matt >Yes, understood, the old A and the new A just have the same object number N: first alloc a new block, write new A, then modify objset''s slot N to let it point the new A, done Does that mean that the two versions of A can''t both exist in the same objset? And snapshot S which contains old A and current active fs which contains new A, must have independent objsets? So when creating snapshots, objset of the fs should be cloned. But in dsl_dataset_snapshot_sync, I found that dsphys->ds_bp = ds->ds_phys->ds_bp; shows that newly created snapshot has a objset same as the fs, where is the clone work? And if the objset is large, has millions of slots, that will be lots of work to do. I''m considering if backtracking pointer like ''parent'' is allowed, is it still possibles that different versions of object can exist in the same object set with different object numbers? Matt, Eddie, thanks for your kindly help -chenz -- args are passed by ref, but bindings are local, variables are in fact just a symbol referencing an object
Eddie Edwards
2008-Nov-12 14:59 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
Well in the creation of a snapshot, the code now copies the *physical* block pointer for the objset. Then if a live object changes, the live objset changes, but since this is copy-on-write the snapshot objset still exists unchanged at the same physical location. So the live objset sees the new data and the snapshot objset sees the old data. But at snapshot creation, the objset does not need to be copied. In principle, I think you''re right, in that an objset could hold both versions of a file. It would contain physical pointers to each copy. They would start out the same and diverge when one or other copy was modified. For instance, zfs could support a fast "cp" which just cloned the file i.e. copied the objset entry to a new objset entry. This would allow you to copy a 100GB file in milliseconds, and still have the ability to update both copies of the file independently. This is similar in principle to UNIX fork(). But I''m not aware of any user-level interfaces that would allow this to happen. -- This message posted from opensolaris.org
Matthew Ahrens
2008-Nov-12 16:16 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
Eddie Edwards wrote:> In principle, I think you''re right, in that an objset could hold both > versions of a file. It would contain physical pointers to each copy. > They would start out the same and diverge when one or other copy was > modified. > > For instance, zfs could support a fast "cp" which just cloned the file > i.e. copied the objset entry to a new objset entry. This would allow you > to copy a 100GB file in milliseconds, and still have the ability to update > both copies of the file independently. This is similar in principle to > UNIX fork(). > > But I''m not aware of any user-level interfaces that would allow this to > happen.That''s because multiple pointers to the same block in the same dataset would make it Extremely Nontrivial to determine when to free a block. For a related problem, see RFE 6343653[*] "want to quickly "copy" a file from a snapshot". fork() has an additional level of indirection (the page_t) and it takes O(address space) time because it has to bump the reference count on each page. --matt [*] oh, you can''t see the evaluation. I''ll reproduce it here: In addition to simply copying the bp''s, we will need to remove any copied bp''s from their dead lists. We don''t know exactly which deadlist they are in, so we''ll have to search all deadlists after the snapshot we are copying from. One way to architect this would be for the zpl to set up the destination object, then call a "dmu_splice" routine which would copy a range of one object to another. Note that if the old object still exists, you *must* copy to that one, otherwise the head objset could have the same bp stored in many objects (very bad). The dmu could store an in-core list of bp''s to remove from deadlist ("reincarnate"?), and process it in syncing context (by creating an avl tree and traversing the deadlists once). *** (#1 of 2): 2005-10-30 04:59:00 GMT+00:00 matthew.ahrens at sun.com *** Last Edit: 2005-10-30 04:59:00 GMT+00:00 matthew.ahrens at sun.com Some design notes... sh: zfs recover <from-.snapshot-path> <to-fs-path> sh: zfs recover @<snapname> <to-fs-path> zpl: remove old version of file[*]. if not recovering into same <obj,gen>: create new file set file metadata (perms, size, etc) to snap file''s add snap''s <obj,gen> -> new <obj,gen> to recovered-map dmu: copy bps from snapshot to head obj (use db_overridden_by?) dmu: put bps in reincarnate avl-tree (per txg) dsl: when sync, walk each deadlist, most recent first. if find a bp in the reincarnate avl, remove bp from deadlist (decrease unique if necessary) (move last entry here), and remove bp from avl when avl empty, stop. assert that avl becomes empty. [*] to remove the old version of the file: set tocheck <obj,gen> = snap''s <obj,gen> if tocheck <obj,gen> is in recovered-map set tocheck <obj,gen> = value <obj,gen> if tocheck <obj,gen> exists in head fs { if it is the specified destination file, truncate it otherwise, fail "you must remove filename" } remove tocheck <obj,gen> from recovered-map *** (#2 of 2): 2006-05-25 19:03:05 GMT+00:00 matthew.ahrens at sun.com *** Last Edit: 2006-05-25 19:03:05 GMT+00:00 matthew.ahrens at sun.com
Eddie Edwards
2008-Nov-12 17:00 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
Interesting. In fact I just found a thread on this exact idea here: http://www.opensolaris.org/jive/thread.jspa?threadID=41615&tstart=90 -- This message posted from opensolaris.org
Chen Zheng
2008-Nov-12 18:12 UTC
[zfs-code] How does zfs COW deal with ''..'' in brother directory?
sorry, forgot to send to list again ---------- Forwarded message ---------- From: Chen Zheng <nkchenz at gmail.com> Date: 2008/11/13 Subject: Re: [zfs-code] How does zfs COW deal with ''..'' in brother directory? To: Eddie Edwards <eddie at tinyted.net> 2008/11/12 Eddie Edwards <eddie at tinyted.net>:> Well in the creation of a snapshot, the code now copies the *physical* block pointer for the objset. > > Then if a live object changes, the live objset changes, but since this is copy-on-write the snapshot objset still exists unchanged at the same physical location. So the live objset sees the new data and the snapshot objset sees the old data. >aha, that''s it, blocks used by objset''s 7 levels object number tree have no problems of backtracking ''parent'' pointer, it''s quite reasonable and feasible to using copy-on-write.> But at snapshot creation, the objset does not need to be copied. > > In principle, I think you''re right, in that an objset could hold both versions of a file. It would contain physical pointers to each copy. They would start out the same and diverge when one or other copy was modified.yes, only if there are no such backtracking pointers in file tree. maybe we can construct a version tree like this R1 <- R2 <- R3 <- R4 A1 <- A2 <- A3 <- A4 ^ ^ B C1 <- C2 R is the root, which has a dir A, which has dir B, C. New versions point to old, so old ones can be left untouched. ''parent'' field in each version always point to the oldest version of its parent, ex. C1 is born with A1, C2 is born with A3. And another super object S points to the latest root, in this case S=R4. 1. parent ''..'' works fine from R2/A2/C1/.. we get A2 from R1/A1/C1/.. we get A1 2. find objects which reference a given object directly. in this case, C1 is referenced by A2, A1 key point is find the next version C2 which points to C1. start from super object S, find the latest Cn, then it''s easy to get C2. Problem may be complicated when some parents of C has been deleted at the newer versions, but it still possible to find it, just search the next older version at the level which does not contain the desired child entry. 3. delete object Suppose R1, A1 are freed, shall we free C1? It all depends on step2, if the only object reference C1 is A1, then just free it, or else modify C1.parent points to A2, which is just the oldest version of its parent now. Though we are always trying to not modify object, but here the modification of C1.parent seems unavoidable anyway.> For instance, zfs could support a fast "cp" which just cloned the file i.e. copied the objset entry to a new objset entry. This would allow you to copy a 100GB file in milliseconds, and still have the ability to update both copies of the file independently. This is similar in principle to UNIX fork(). > > But I''m not aware of any user-level interfaces that would allow this to happen. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code >-- args are passed by ref, but bindings are local, variables are in fact just a symbol referencing an object -- args are passed by ref, but bindings are local, variables are in fact just a symbol referencing an object