ZFS documentation lists snapshot limits on any single file system in a pool at 2**48 snaps, and that seems to logically imply that a snap on a file system does not require an update to the pool?s currently active uberblock. That is to say, that if we take a snapshot of a file system in a pool, and then make any changes to that file system, the copy on write behavior induced by the changes will stop at some synchronization point below the uberblock (presumably at or below the DNODE that is the DSL directory for that file system). In-place updates to a DNODE that has been allocated in a single sector sized ZFS block can be considered atomic, since the sector write will either succeed or fail totally, leaving either the old version or the new version, but not a combination of the two. This seems sensible to me, but the description of object sets beginning on page 26 of the ZFS On-Disk Specification, states that the DNODE type DMU_OT_DNODE (the type of the DNODE that?s included in the 1KB objset_phys_t structure) will have a data load of an array of DNODES allocated in 128KB blocks, and the picture (Illustration 12 in the spec) shows these blocks as containing 1024 DNODES. Since DNODES are 512 bytes, it would not be possible to fit the 1024 DNODES depicted in the illustration and if DNODES did live in such an array then they could not be atomically updated in-place. If the blocks in question were actually filled with an array of block pointers pointing to single sector sized blocks that each held a DNODE then this would account for the 1024 entries per 128KB block shown, since block pointers are 128 bytes (not the 512 bytes of a DNODE), but in this case wouldn?t such 128KB blocks be considered to be indirect block pointers, forcing the dn_nlevels field shown in the object set DNODE at the top left of Illustration 12 to be 2, instead of the 1 that?s there ? I?m further confused by the illustration?s use of dotted lines to project the contents of a structure field (as seen in the projection of the metadnode field of the objset_phys_t structure found at the top of the picture) and arrows to represent pointers (as seen in the projection of the block pointer array of the DMU-OT-DNODE type dnode, also at the top of the picture), but the blocks pointed to by these block pointers seem to actually contain instances of DNODES (as seen from the projection of one of these instances in the lower part of the picture). Should this projection be replaced by a pointer to the lower DNODE ? This message posted from opensolaris.org
> ZFS documentation lists snapshot limits on any single file system in a > pool at 2**48 snaps, and that seems to logically imply that a snap on > a file system does not require an update to the pool???s > currently active uberblock.All commited changes (including snapshot creation) require a new uberblock to be written.> That is to say, that if we take a snapshot of a file system in a pool, > and then make any changes to that file system, the copy on write > behavior induced by the changes will stop at some synchronization > point below the uberblock (presumably at or below the DNODE that is > the DSL directory for that file system). In-place updates to a DNODE > that has been allocated in a single sector sized ZFS block can be > considered atomic, since the sector write will either succeed or fail > totally, leaving either the old version or the new version, but not a > combination of the two.There are no in-place updates. Any updates to a node would also require updating it''s parent to make the checksum in that block consistent (and so on back up the tree). So instead a new block is written. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Bill Moloney
2007-Feb-07  21:14 UTC
[zfs-discuss] Re: The ZFS MOS and how DNODES are stored
Thanks for the input Darren, but I''m still confused about DNODE atomicity ... it''s difficult to imagine that a change that is made anyplace in the zpool would require copy operations all the way back up to the uberblock (e.g. if some single file in one of many file systems in a zpool was suddenly changed, making a new copy of all of the interceeding objects in the tree back to the uberblock would seem to be an untenable amount of work even though it may all be carried out in memory and not involve any IO, although if the zpool itself was under snapshot control this would have to happen) ... the DNODE implementation appears to include its own checksum field (self-checksumming), and controlling DNODEs (those that lead to decendent collections of DNODEs) are always of the known type DMU_OT_DNODE and so their block pointers do not have to checksum the DNODEs they point to (unlike all other block pointers that do cehcksum the data they point to) ... this would allow for inplace updates of a DNODE, without the need to continue further up the tree ... since all objects are controlled by a DNODE, updates to an object''s data can stop at its DNODE if that DNODE is not under some snapshot or clone control ... if this is not the case, than ''any'' modification in the zpool would require copying up to the uberblock This message posted from opensolaris.org
Darren Dunham
2007-Feb-07  21:47 UTC
[zfs-discuss] Re: The ZFS MOS and how DNODES are stored
> Thanks for the input Darren, but I''m still confused about DNODE > atomicity ... it''s difficult to imagine that a change that is made > anyplace in the zpool would require copy operations all the way back > up to the uberblock (e.g. if some single file in one of many file > systems in a zpool was suddenly changed, making a new copy of all of > the interceeding objects in the tree back to the uberblock would seem > to be an untenable amount of work even though it may all be carried > out in memory and not involve any IO, although if the zpool itself was > under snapshot control this would have to happen)How many objects need to change? Not that many. ... the DNODE> implementation appears to include its own checksum field > (self-checksumming), and controlling DNODEs (those that lead to > decendent collections of DNODEs) are always of the known type > DMU_OT_DNODE and so their block pointers do not have to checksum the > DNODEs they point to (unlike all other block pointers that do cehcksum > the data they point to) ... this would allow for inplace updates of a > DNODE, without the need to continue further up the tree ... since all > objects are controlled by a DNODE, updates to an object''s data can > stop at its DNODE if that DNODE is not under some snapshot or clone > control ... if this is not the case, than ''any'' modification in the > zpool would require copying up to the uberblockI will have to go back and look at the dnode stuff in the specification. But everything I know about it suggests that any committed change to the filesystem structure (snapshots included) will require writing a new uberblock. Certainly the uberblock is updated periodically anyway. Perhaps someone knows an easy way to display the uberblock generation number so it can be viewed as changes are occuring? -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Matthew Ahrens
2007-Feb-08  22:57 UTC
[zfs-discuss] Re: The ZFS MOS and how DNODES are stored
Bill Moloney wrote:> Thanks for the input Darren, but I''m still confused about DNODE > atomicity ... it''s difficult to imagine that a change that is made > anyplace in the zpool would require copy operations all the way back > up to the uberblockThis is in fact what happens. However, these changes are all batched up (into a transaction group, or "txg"), so the overhead is minimal. > the DNODE> implementation appears to include its own checksum field > (self-checksumming),That is not the case. Only the uberblock and intent log blocks are self-checksumming. > if this is not the case, than ''any''> modification in the zpool would require copying up to the uberblockThat''s correct, any modifications require modifying the uberblock (with the exception of intent log writes). FYI, dnodes are not involved with the snapshot mechanism. Snapshotting happens at the dsl dataset layer, while dnodes are implemented above that in the dmu layer. Check out dsl_dataset.[ch]. --matt