Matthew Ahrens
2011-May-25 19:02 UTC
[zfs-discuss] ZFS working group and feature flags proposal
The community of developers working on ZFS continues to grow, as does the diversity of companies betting big on ZFS. We wanted a forum for these developers to coordinate their efforts and exchange ideas. The ZFS working group was formed to coordinate these development efforts. The working group encourages new membership. In order to maintain the group''s focus on ZFS development, candidates should demonstrate significant and ongoing contribution to ZFS. The first product of the working group is the design for a ZFS on-disk versioning method that will allow for distributed development of ZFS on-disk format changes without further explicit coordination. This method eliminates the problem of two developers both allocating version number 31 to mean their own feature. This "feature flags" versioning allows unknown versions to be identified, and in many cases the ZFS pool or filesystem can be accessed read-only even in the presence of unknown on-disk features. My proposal covers versioning of the SPA/zpool, ZPL/zfs, send stream, and allocation of compression and checksum identifiers (enum values). We plan to implement the feature flags this summer, and aim to integrate it into Illumos. I welcome feedback on my proposal, and I''d especially like to hear from people doing ZFS development -- what are you working on? Does this meet your needs? If we implement it, will you use it? Thanks, --matt -------------- next part -------------- ZFS Feature Flags proposal, version 1.0, May 25th 2011 ==============================ON-DISK FORMAT CHANGES ============================== for SPA/zpool versioning: new pool version = SPA_VERSION_FEATURES = 1000 ZAP objects in MOS, pointed to by DMU_POOL_DIRECTORY_OBJECT = 1 "features_for_read" -> { feature name -> nonzero if in use } "features_for_write" -> { feature name -> nonzero if in use } "feature_descriptions" -> { feature name -> description } Note that a pool can''t be opened "write-only", so the features_for_read are always required. A given feature should be stored in either features_for_read or features_for_write, not both. Note that if a feature is "promoted" from a company-private feature to part of a larger distribution (eg. illumos), this can be handled in a variety of ways, all of which can be handled with code added at that time, without changing the on-disk format. for ZPL/zfs versioning: new zpl version = ZPL_VERSION_FEATURES = 1000 same 3 ZAP objects as above, but pointed to by MASTER_NODE_OBJ = 1 "features_for_read" -> { feature name -> nonzero if in use } "features_for_write" -> { feature name -> nonzero if in use } "feature_descriptions" -> { feature name -> description } Note that the namespace for ZPL features is separate from SPA features (like version numbers), so the same feature name can be used for both (eg. for related SPA and ZPL features), but compatibility-wise this is not treated specially. for compression: must be at pool version SPA_VERSION_FEATURES ZAP object in MOS, pointed to by POOL_DIR_OBJ: "compression_algos" -> { algo name -> enum value } Existing enum values (0-14) must stay the same, but new algorithms may have different enum values in different pools. Note that this simply defines the enum value. If a new algorithm is in use, there must also be a corresponding feature in features_for_read with a nonzero value. For simplicity, all algorithms, including legacy algorithms with fixed values (lzjb, gzip, etc) should be stored here (pending evaluation of prototype code -- this may be more trouble than it''s worth). for checksum: must be at pool version SPA_VERSION_FEATURES ZAP object in MOS, pointed to by POOL_DIR_OBJ: "checksum_algos" -> { algo name -> enum value } All notes for compression_algos apply here too. Must also store copy of what''s needed to read the MOS in label nvlist: "features_for_read" -> { feature name -> nonzero if in use } "compression_algos" -> { algo name -> enum value } "checksum_algos" -> { algo name -> enum value } ZPL information is never needed. It''s fine to store complete copies of these objects in the label. However, space in the label is limited. It''s only *required* to store information needed to read the MOS so we can get to the definitive version of this information. Eg, introduce new compression algo, but it is never used in the MOS, don''t need to add it to the label. Legacy algos with fixed values may be omitted from the label nvlist (eg. lzjb, fletcher4). The values in the nvlist features_for_read map may be different from the values in the MOS features_for_read. However, they must be the same when interpreted as a boolean (ie, the nvlist value != 0 iff the MOS value != 0). This is so that the nvlist map need not be updated whenever the "reference count" on a feature changes, only when it changes to/from zero. for send stream: new feature flag DRR_FLAG_FEATURES = 1<<16 BEGIN record has nvlist payload nvlist has: "features" -> { feature name -> unspecified } "types" -> { type name -> enum value } types are record types, existing ones are reserved. New types should have a corresponding feature, so presence of an unknown type is not an error. If an unknown type is used, records of that type can be safely ignored. So if a new record type can not be safely ignored, a corresponding new feature must be added. all name formats (feature name, algo name, type name): <reverse-dns>:<short-name> eg. com.delphix:raidz4 all ALL_CAPS_STRING_DEFINITIONS will be #defined to the lowercase string, eg: #define FEATURES_FOR_READ "features_for_read" ==============================BEHAVIOR CHANGES ============================== zpool upgrade zpool upgrade (no arguments) If the pool is at SPA_VERSION_FEATURES, but this software supports features which are not listed in the features_for_* MOS objects, the pool should be listed as available to upgrade. It''s recommended that the short name of the available features be listed. zpool upgrade -v After the list of static versions, each supported feature should be listed. zpool upgrade -a | <pool> The pool or pools will have their features_for_* MOS objects updated to list all features supported by this software. Ideally, the value of the newly-added ZAP entries will be 0, indicating that the feature is enabled but not yet in use. zpool upgrade -V <version> -a | <pool> The <version> may specify a feature, rather than a version number, if the version is already at SPA_VERSION_FEATURES. The feature may be specified by its short or full name. The pool or pools will have their features_for_* MOS object updated to list the specified feature, and any other features required by the specified one. pool open ("zpool import" and implicit import from zpool.cache) If pool is at SPA_VERSION_FEATURES, we must check for feature compatibility. First we will look through entries in the label nvlist''s features_for_read. If there is a feature listed there which we don''t understand, and it has a nonzero value, then we can not open the pool. Each vendor may decide how much information they want to print about the unsupported feature. It may be a catch all "Pool could not be opened because it uses an unsupported feature.", or it may be the most verbose message, "Pool could not be opened because it uses the following unsupported features: <long feature name> <feature description> ...". Or features from known vs foreign vendors may be treated differently (eg. print this vendors features description, but not unknown vendors''). Note that if a feature in the label is not supported, we can''t access the feature description, so at best we can print the full feature name. After checking the label''s features_for_read, we know we can read the MOS, so we will continue opening it and then check the MOS''s features_for_read. Note that we will need to load the label''s checksum_algos and compression_algos before reading any blocks. This should be implemented as: If the pool is bring opened for writing, then features_for_write must also be checked. (Note, currently grub and zdb open the pool read-only, and the kernel module opens the pool for writing. In the future it would be great to allow the kernel module to open the pool read-only.) zfs upgrade Treat this similarly to zpool upgrade, using the filesystem''s MASTER_NODE''s features_for_* objects. filesystem mount Treat this similarly to pool open, using the filesystem''s MASTER_NODE''s features_for_* objects. zfs receive If any unknown features are in the stream''s BEGIN record''s nvlist''s "features" entry, then the stream can not be received. ==============================IMPLEMENTATION NOTES ============================== Legacy checksum algorithms are to be stored as follows: "com.sun:label" -> 3 "com.sun:gang_header" -> 4 "com.sun:zilog" -> 5 "com.sun:fletcher2" -> 6 "com.sun:fletcher4" -> 7 "com.sun:sha256" -> 8 "com.sun:zilog2" -> 9 Legacy compression algorithms are to be stored as follows: "com.sun:lzjb" -> 3 "com.sun:empty" -> 4 "com.sun:gzip-1" -> 5 "com.sun:gzip-2" -> 6 "com.sun:gzip-3" -> 7 "com.sun:gzip-4" -> 8 "com.sun:gzip-5" -> 9 "com.sun:gzip-6" -> 10 "com.sun:gzip-7" -> 11 "com.sun:gzip-8" -> 12 "com.sun:gzip-9" -> 13 "com.sun:zle" -> 14 Legacy send record types are to be stored as follows: "com.sun:begin" -> 0 "com.sun:object" -> 1 "com.sun:freeobjects" -> 2 "com.sun:write" -> 3 "com.sun:free" -> 4 "com.sun:end" -> 5 "com.sun:write_byref" -> 6 "com.sun:spill" -> 7 The indirection tables for checksum algorithm, compression algorithm, and send stream record type can be implemented as follows: enum zio_checksum { ZIO_CHECKSUM_INHERIT = 0, ZIO_CHECKSUM_ON, ZIO_CHECKSUM_OFF, ZIO_CHECKSUM_LABEL, ZIO_CHECKSUM_GANG_HEADER, ZIO_CHECKSUM_ZILOG, ZIO_CHECKSUM_FLETCHER_2, ZIO_CHECKSUM_FLETCHER_4, ZIO_CHECKSUM_SHA256, ZIO_CHECKSUM_ZILOG2, ... ZIO_CHECKSUM_FUNCTIONS }; const char *zio_checksum_names[] = { /* Order must match enum zio_checksum! */ "inherit", "on", "off", "com.sun:label", "com.sun:gang_header" "com.sun:zilog" "com.sun:fletcher2" "com.sun:fletcher4" "com.sun:sha256" "com.sun:zilog2" ... }; /* * inherit, on, and off are not stored on disk, so * pre-initialize them here. Note that 8 bits are used for the * checksum algorithm in the blkptr_t, so there are 256 posible * values. */ uint8_t checksum_to_index[ZIO_CHECKSUM_FUNCTIONS] = {0, 1, 2}; enum zio_checksum index_to_checksum[256] = {0, 1, 2}; void add_checksum_algo(const char *algo_name, uint8_t value) { enum zio_checksum i; for (i = 0; i < ZIO_CHECKSUM_FUNCTIONS; i++) { if (strcmp(algo_name, zio_checksum_names[i]) == 0) { checksum_to_index[i] = value; index_to_checksum[value] = i; } } /* Ignore any unknown algorithms. */ } #define BP_GET_CHECKSUM(bp) index_to_checksum[BF64_GET(...)] #define BP_SET_CHECKSUM(bp, x) BF64_SET(..., checksum_to_index[x])
Deano
2011-May-25 19:55 UTC
[zfs-discuss] [illumos-Developer] ZFS working group and feature flags proposal
<snip> Hi Matt, That''s looks really good, I''ve been meaning to implement a ZFS compressor (using a two pass, LZ4 + Arithmetic Entropy), so nice to see a route with which this can be done. One question, is the extendibility of RAID and other similar systems, my quick perusal makes me thinks this is handled by simple asserting a new feature using the extension mechanism, but perhaps I''ve missed something? Do you see it being able to handle this situation? Its of course a slightly tricky one, as not only does it change data but potentially data layout as well... Great work ZFS working group :) Nice to see ZFS''s future coming together! Bye, Deano
Matthew Ahrens
2011-May-25 20:16 UTC
[zfs-discuss] [illumos-Developer] ZFS working group and feature flags proposal
On Wed, May 25, 2011 at 12:55 PM, Deano <deano at rattie.demon.co.uk> wrote:> <snip> > Hi Matt, > > That''s looks really good, I''ve been meaning to implement a ZFS compressor > (using a two pass, LZ4 + Arithmetic Entropy), so nice to see a route with > which this can be done. >Cool! New compression algorithms are definitely something we want to make straightforward to implement. I look forward to seeing your results.> One question, is the extendibility of RAID and other similar systems, my > quick perusal makes me thinks this is handled by simple asserting a new > feature using the extension mechanism, but perhaps I''ve missed something? > Do > you see it being able to handle this situation? > Its of course a slightly tricky one, as not only does it change data but > potentially data layout as well... >Yes, a feature like RAIDZ3 could be implemented as a "feature_for_read". It would be extra nice if the value was a count of RAIDZ3 devices. That way you could "zpool upgrade", but if you didn''t actually have any RAIDZ3 devices, systems that don''t know about RAIDZ3 would still be able to read it.> Great work ZFS working group :) Nice to see ZFS''s future coming together! >Thank you! --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110525/291463b5/attachment-0001.html>
Peter Jeremy
2011-May-25 22:08 UTC
[zfs-discuss] ZFS working group and feature flags proposal
On 2011-May-26 03:02:04 +0800, Matthew Ahrens <mahrens at delphix.com> wrote:>The first product of the working group is the design for a ZFS on-disk >versioning method that will allow for distributed development of ZFS >on-disk format changes without further explicit coordination. This >method eliminates the problem of two developers both allocating >version number 31 to mean their own feature.Looks good.>pool open ("zpool import" and implicit import from zpool.cache) > If pool is at SPA_VERSION_FEATURES, we must check for feature > compatibility. First we will look through entries in the label > nvlist''s features_for_read. If there is a feature listed there > which we don''t understand, and it has a nonzero value, then we > can not open the pool.Is it worth splitting "feature used" value into "optional" and "mandatory"? (Possibly with the ability to have an "optional" read feature be linked to a "mandatory" write feature). To use an existing example: dedupe (AFAIK) does not affect read code and so could show up as an optional read feature but a mandatory write feature (though I suspect this could equally be handled by just listing it in "features_for_write"). As a more theoretical example, consider OS-X resource forks? The presence of a resource fork matters for both read and write on OS-X but nowhere else. A (hypothetical) ZFS port to OS-X would want to know whether the pool contained resource forks even if opened R/O but this should not stop a different ZFS port from reading (and maybe even writing to) the pool. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110526/339c76ab/attachment.bin>
Matthew Ahrens
2011-May-25 22:44 UTC
[zfs-discuss] ZFS working group and feature flags proposal
On Wed, May 25, 2011 at 3:08 PM, Peter Jeremy < peter.jeremy at alcatel-lucent.com> wrote:> On 2011-May-26 03:02:04 +0800, Matthew Ahrens <mahrens at delphix.com> wrote: > > Looks good. >Thanks for taking the time to look at this. More comments inline below.> >pool open ("zpool import" and implicit import from zpool.cache) > > If pool is at SPA_VERSION_FEATURES, we must check for feature > > compatibility. First we will look through entries in the label > > nvlist''s features_for_read. If there is a feature listed there > > which we don''t understand, and it has a nonzero value, then we > > can not open the pool. > > Is it worth splitting "feature used" value into "optional" and > "mandatory"? (Possibly with the ability to have an "optional" read > feature be linked to a "mandatory" write feature). > > To use an existing example: dedupe (AFAIK) does not affect read code > and so could show up as an optional read feature but a mandatory write > feature (though I suspect this could equally be handled by just > listing it in "features_for_write"). >I''m not sure I understand the "optional" idea. How would an "optional read feature" change the behavior, as opposed to just being listed in "features_for_write"? If dedup''d pools can be read by old code that doesn''t understand dedup, then the dedup feature should be listed in "features_for_write", and not "features_for_read".> As a more theoretical example, consider OS-X resource forks? The > presence of a resource fork matters for both read and write on OS-X > but nowhere else. A (hypothetical) ZFS port to OS-X would want to > know whether the pool contained resource forks even if opened R/O > but this should not stop a different ZFS port from reading (and > maybe even writing to) the pool. >A hypothetical resource fork feature would probably be a ZPL (filesystem) feature, rather than a pool feature, but that doesn''t really change your question. If the presence of a resource fork doesn''t preclude old code from reading or writing it, but the MacOS code needs to know, "are there any resource forks in this filesystem", then this can be handled in a way specific to the resource fork code. For example, with a new entry in the MASTER_NODE_OBJ, which only the resource fork code would care about; other code would ignore it. So this can be handled seamlessly, outside the scope of the feature flags proposal. However, it''s more likely that old code would not be able to safely write to a filesystem with resource forks (for example, to know how to free the resource fork when a file is removed). In this case, resource forks would be a "feature for write". The MacOS code could use the features_for_write to determine the presence of resource forks, even if opening the filesystem read-only. --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110525/878f589f/attachment-0001.html>