Hello all, The idea of dedicated metadata devices (likely SSDs) for ZFS has been generically discussed a number of times on this list, but I don''t think I''ve seen a final proposal that someone would take up for implementation (as a public source code, at least). I''d like to take a liberty of summarizing the ideas I''ve either seen in discussions or proposed myself on this matter, to see if the overall idea would make sense to gurus of ZFS architecture. So, the assumption was that the performance killer in ZFS at least on smallish deployments (few HDDs and an SSD accelerator), like those in Home-NAS types of boxes, was random IO to lots of metadata. This IMHO includes primarily the block pointer tree and the DDT for those who risked using dedup. I am not sure how frequent is the required read access to other types of metadata (like dataset descriptors, etc.) that the occasional reading and caching won''t solve. Another idea was that L2ARC caching might not really cut it for metadata in comparison to a dedicated metadata storage, partly due to the L2ARC becoming empty upon every export/import (boot) and needing to get re-heated. So, here go the highlights of proposal (up for discussion). In short, the idea is to use today''s format of the blkptr_t which by default allows to store up to 3 DVA addresses of the block, and many types of metadata use only 2 copies (at least by default). This new feature adds a specially processed TLVDEV in the common DVA address space of the pool, and enforces storage of added third copies for certain types of metadata blocks on these devices. (Limited) Backwards compatibility is quite possible, on-disk format change may be not required. The proposal also addresses some questions that arose in previous discussions, especially about proposals where SSDs would be the only storage for pool''s metadata: * What if the dedicated metadata device overflows? * What if the dedicated metadata device breaks? = okay/expected by design, nothing dies. In more detail: 1) Add a special Top-Level VDEV (TLVDEV below) device type (like "cache" and "log" - say, "metaxel" for "metadata accelerator"?), and allow (even encourage) use of mirrored devices and allow expansion (raid0, raid10 and/or separate TLVDEVs) with added singlets/mirrors of such devices. Method of device type definition for the pool is discussable, I''d go with a special attribute (array) or nvlist in the pool descriptor, rather than some special type ID in the ZFS label (backwards compatibility, see point 4 for detailed rationale). Discussable: enable pool-wide or per-dataset (i.e. don''t waste accelerator space and lifetime for rarely-reused datasets like rolling backups)? Choose what to store on (particular) metaxels - DDT, BPTree, something else? Overall, this availability of choice is similar to choice of modes for ARC/L2ARC caching or enabling ZIL per-dataset... 2) These devices should be formally addressable as part of the pool in DVA terms (tlvdev:offset:size), but writes onto them are artificially limited by ZFS scheduler so as to only allow specific types of metadata blocks (blkptr_t''s, DDT entries), and also enforce writing of added third copies (for blocks of metadata with usual copies=2) onto these devices. 3) Absence or "FAULTEDness" of this device should not be fatal to the pool, but it may require manual intervention to force the import. Particularly, removal, replacement or resilvering onto different storage (i.e. migrating to larger SSDs) should be supported in the design. Beside experimentation and migration concerns, this approach should also ease replacement of SSDs used for metadata in case of their untimely fatal failures - and this may be a concern for many SSD deployments, increasingly susceptible to write wearing and ultimate death (at least in the cheaper bulkier range, which is a likely component in Home-NAS solutions). 4) For backwards compatibility, to older versions of ZFS this device should seem like a normal single-disk or mirror TLVDEV which contains blocks addressed within the common pool DVA address-space. This should have no effect for read-only imports. However, other ZFS releases likely won''t respect the filtering and alignment limitations enforced for the device normally in this design, and can "contaminate" the device with other types of blocks (and would refuse to import the pool if the device is missing/faulted). 5) The ZFS reads should be tweaked to first consult the copy of metadata blocks on the metadata accelerator device, and only use spinning rust (ordinary TLVDEVs) if there are some errors (checksum mismatches, lacking devices, etc.) or during scrubs and similar tasks which would require full reads of the pool''s addressed blocks. Prioritized reads from this metadata accelerator won''t need a special bit in the blkptr_t (like is done for deduped-bit) - the TLVDEV number in the DVA already points to the known identifier of the TLVDEV, which we know is a metaxel. 6) The ZFS writes onto this storage should take into account the increased blocksize (likely 4-8Kb for either current HDDs or for SSDs) and subsequent coalescing and pagination required to reduce SSD wear-out. This might be a tweakable component of the scheduler, which could be disabled if some different media is used and this scheduler is not needed (small-sectored HDDs, DDR, SSDs of the future), but the default writing mode today should expect SSDs. 7) A special tool like scrub should be added to walk the pool''s block tree and rewrite the existing block pointers (and I am not sure this is as problematic as the generic BPRewrite - if needed, the task can be done once offline, for example). By definition this is a restartable task (initiating a new tree walk), so it should be pausable or abortable as well. As the result, new copies of metadata blocks would be created on the accelerator device, and the two old copies remain in place. Just in case the metaxel TLVDEV is "contaminated" by other block types (by virtue of read-write import and usage by ZFS implementations unsupporting this feature), those should be relocated onto the main HDD pool. One thing to discuss is: what should be done for metadata with already existing three copies on HDDs? Should one copy be disposed of and recreated on the accelerator? 8) If the metaxel devices are filled up, the "overflowing" metadata blocks may just not get the third copies (only the standard HDD-based copies are written). If the metaxel device is freed up (by deletion of data and release of the block pointers) or expanded (or another one is added), then another run of the "scrub-like" procedure from point 7 can add the missing copies. 9) Also, the solution should allow to discard and recreate the copies of block pointers on the accelerator TLVDEV in case that it fails fatally or is replaced by new empty media. Unlike usual data, where loss of a TLVDEV is considered fatal to the pool, in this case we are known to have (redundant) copies of these blocks on other media. If the new TLVDEV is at least as big as the failed one, the pre-recorded accelerator tlvdevid:offsets in HDD-based copies of the block pointers can be used to re-instantiate the copies on metaxel, just like scrub or in-flight repairs happen on usual pools (rewriting corrupt blocks in-place and not having to change the BP tree at all). In this case the tlvdevid part of DVA can (should) remain unchanged. For new metaxels smaller than the replaced one new DVA allocations might be required. To enforce this and avoid mixups, the tlvdevid should change to some new unique number, and the BP tree gets rewritten as in point 7. 10) If these metaxel devices are used (and known to be SSDs?) then (SSD-based) L2ARC caches should not be used for the metadata blocks readily available from the metaxel. Guess: This might in fact reduce overheads from use of dedup, where pushing blocks into L2ARC only halves the needed RAM footprint. With an SSD metaxel we can just drop unneeded DDT entries from RAM ARC, and quickly get them from stable storage when needed. I hope I covered all or most of what I think on this matter, discussion (and ultimately open-sourced implementations) are most welcome ;) HTH, //Jim Klimov
This is something I''ve been looking into in the code and my take on your proposed points this: 1) This requires many and deep changes across much of ZFS''s architecture (especially the ability to sustain tlvdev failures). 2) Most of this can be achieved (except for cache persistency) by implementing ARC space reservations for certain types of data. The latter has the added benefit of spreading load across all ARC and L2ARC resources, so your metaxel device never becomes the sole bottleneck and it better embraces the ZFS design philosophy of pooled storage. I plan on having a look at implementing cache management policies (which would allow for tuning space reservations for metadata/etc. in a fine-grained manner without the cruft of having to worry about physical cache devices as well). Cheers, -- Saso On 08/24/2012 03:39 PM, Jim Klimov wrote:> Hello all, > > The idea of dedicated metadata devices (likely SSDs) for ZFS > has been generically discussed a number of times on this list, > but I don''t think I''ve seen a final proposal that someone would > take up for implementation (as a public source code, at least). > > I''d like to take a liberty of summarizing the ideas I''ve either > seen in discussions or proposed myself on this matter, to see if > the overall idea would make sense to gurus of ZFS architecture. > > So, the assumption was that the performance killer in ZFS at > least on smallish deployments (few HDDs and an SSD accelerator), > like those in Home-NAS types of boxes, was random IO to lots of > metadata. This IMHO includes primarily the block pointer tree > and the DDT for those who risked using dedup. I am not sure how > frequent is the required read access to other types of metadata > (like dataset descriptors, etc.) that the occasional reading and > caching won''t solve. > > Another idea was that L2ARC caching might not really cut it > for metadata in comparison to a dedicated metadata storage, > partly due to the L2ARC becoming empty upon every export/import > (boot) and needing to get re-heated. > > So, here go the highlights of proposal (up for discussion). > > In short, the idea is to use today''s format of the blkptr_t > which by default allows to store up to 3 DVA addresses of the > block, and many types of metadata use only 2 copies (at least > by default). This new feature adds a specially processed > TLVDEV in the common DVA address space of the pool, and > enforces storage of added third copies for certain types > of metadata blocks on these devices. (Limited) Backwards > compatibility is quite possible, on-disk format change may > be not required. The proposal also addresses some questions > that arose in previous discussions, especially about proposals > where SSDs would be the only storage for pool''s metadata: > * What if the dedicated metadata device overflows? > * What if the dedicated metadata device breaks? > = okay/expected by design, nothing dies. > > In more detail: > 1) Add a special Top-Level VDEV (TLVDEV below) device type (like > "cache" and "log" - say, "metaxel" for "metadata accelerator"?), > and allow (even encourage) use of mirrored devices and allow > expansion (raid0, raid10 and/or separate TLVDEVs) with added > singlets/mirrors of such devices. > Method of device type definition for the pool is discussable, > I''d go with a special attribute (array) or nvlist in the pool > descriptor, rather than some special type ID in the ZFS label > (backwards compatibility, see point 4 for detailed rationale). > > Discussable: enable pool-wide or per-dataset (i.e. don''t > waste accelerator space and lifetime for rarely-reused > datasets like rolling backups)? Choose what to store on > (particular) metaxels - DDT, BPTree, something else? > Overall, this availability of choice is similar to choice > of modes for ARC/L2ARC caching or enabling ZIL per-dataset... > > 2) These devices should be formally addressable as part of the > pool in DVA terms (tlvdev:offset:size), but writes onto them > are artificially limited by ZFS scheduler so as to only allow > specific types of metadata blocks (blkptr_t''s, DDT entries), > and also enforce writing of added third copies (for blocks > of metadata with usual copies=2) onto these devices. > > 3) Absence or "FAULTEDness" of this device should not be fatal > to the pool, but it may require manual intervention to force > the import. Particularly, removal, replacement or resilvering > onto different storage (i.e. migrating to larger SSDs) should > be supported in the design. > Beside experimentation and migration concerns, this approach > should also ease replacement of SSDs used for metadata in case > of their untimely fatal failures - and this may be a concern > for many SSD deployments, increasingly susceptible to write > wearing and ultimate death (at least in the cheaper bulkier > range, which is a likely component in Home-NAS solutions). > > 4) For backwards compatibility, to older versions of ZFS this > device should seem like a normal single-disk or mirror TLVDEV > which contains blocks addressed within the common pool DVA > address-space. This should have no effect for read-only > imports. However, other ZFS releases likely won''t respect the > filtering and alignment limitations enforced for the device > normally in this design, and can "contaminate" the device > with other types of blocks (and would refuse to import the > pool if the device is missing/faulted). > > 5) The ZFS reads should be tweaked to first consult the copy > of metadata blocks on the metadata accelerator device, and > only use spinning rust (ordinary TLVDEVs) if there are some > errors (checksum mismatches, lacking devices, etc.) or during > scrubs and similar tasks which would require full reads of > the pool''s addressed blocks. > Prioritized reads from this metadata accelerator won''t need > a special bit in the blkptr_t (like is done for deduped-bit) - > the TLVDEV number in the DVA already points to the known > identifier of the TLVDEV, which we know is a metaxel. > > 6) The ZFS writes onto this storage should take into account > the increased blocksize (likely 4-8Kb for either current > HDDs or for SSDs) and subsequent coalescing and pagination > required to reduce SSD wear-out. This might be a tweakable > component of the scheduler, which could be disabled if some > different media is used and this scheduler is not needed > (small-sectored HDDs, DDR, SSDs of the future), but the > default writing mode today should expect SSDs. > > 7) A special tool like scrub should be added to walk the pool''s > block tree and rewrite the existing block pointers (and I am > not sure this is as problematic as the generic BPRewrite - > if needed, the task can be done once offline, for example). > > By definition this is a restartable task (initiating a new > tree walk), so it should be pausable or abortable as well. > > As the result, new copies of metadata blocks would be created > on the accelerator device, and the two old copies remain in > place. > Just in case the metaxel TLVDEV is "contaminated" by other > block types (by virtue of read-write import and usage by > ZFS implementations unsupporting this feature), those should > be relocated onto the main HDD pool. > > One thing to discuss is: what should be done for metadata > with already existing three copies on HDDs? Should one copy > be disposed of and recreated on the accelerator? > > 8) If the metaxel devices are filled up, the "overflowing" > metadata blocks may just not get the third copies (only > the standard HDD-based copies are written). If the metaxel > device is freed up (by deletion of data and release of the > block pointers) or expanded (or another one is added), then > another run of the "scrub-like" procedure from point 7 can > add the missing copies. > > 9) Also, the solution should allow to discard and recreate the > copies of block pointers on the accelerator TLVDEV in case > that it fails fatally or is replaced by new empty media. > > Unlike usual data, where loss of a TLVDEV is considered > fatal to the pool, in this case we are known to have > (redundant) copies of these blocks on other media. > > If the new TLVDEV is at least as big as the failed one, > the pre-recorded accelerator tlvdevid:offsets in HDD-based > copies of the block pointers can be used to re-instantiate > the copies on metaxel, just like scrub or in-flight repairs > happen on usual pools (rewriting corrupt blocks in-place > and not having to change the BP tree at all). In this case > the tlvdevid part of DVA can (should) remain unchanged. > > For new metaxels smaller than the replaced one new DVA > allocations might be required. To enforce this and avoid > mixups, the tlvdevid should change to some new unique > number, and the BP tree gets rewritten as in point 7. > > 10) If these metaxel devices are used (and known to be SSDs?) > then (SSD-based) L2ARC caches should not be used for the > metadata blocks readily available from the metaxel. > Guess: This might in fact reduce overheads from use of > dedup, where pushing blocks into L2ARC only halves the > needed RAM footprint. With an SSD metaxel we can just > drop unneeded DDT entries from RAM ARC, and quickly get > them from stable storage when needed. > > I hope I covered all or most of what I think on this matter, > discussion (and ultimately open-sourced implementations) are > most welcome ;) > > HTH, > //Jim Klimov > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Aug 24, 2012, at 6:50 AM, Sa?o Kiselkov wrote:> This is something I''ve been looking into in the code and my take on your > proposed points this: > > 1) This requires many and deep changes across much of ZFS''s architecture > (especially the ability to sustain tlvdev failures). > > 2) Most of this can be achieved (except for cache persistency) by > implementing ARC space reservations for certain types of data.I think the simple solution of increasing default metadata limit above 1/4 of arc_max will take care of the vast majority of small system complaints. The limit is arbitrary and set well before dedupe was delivered.> > The latter has the added benefit of spreading load across all ARC and > L2ARC resources, so your metaxel device never becomes the sole > bottleneck and it better embraces the ZFS design philosophy of pooled > storage. > > I plan on having a look at implementing cache management policies (which > would allow for tuning space reservations for metadata/etc. in a > fine-grained manner without the cruft of having to worry about physical > cache devices as well). > > Cheers, > -- > Saso > > On 08/24/2012 03:39 PM, Jim Klimov wrote: >> Hello all, >> >> The idea of dedicated metadata devices (likely SSDs) for ZFS >> has been generically discussed a number of times on this list, >> but I don''t think I''ve seen a final proposal that someone would >> take up for implementation (as a public source code, at least). >> >> I''d like to take a liberty of summarizing the ideas I''ve either >> seen in discussions or proposed myself on this matter, to see if >> the overall idea would make sense to gurus of ZFS architecture. >> >> So, the assumption was that the performance killer in ZFS at >> least on smallish deployments (few HDDs and an SSD accelerator), >> like those in Home-NAS types of boxes, was random IO to lots of >> metadata.It is a bad idea to make massive investments in development and testing because of an assumption. Build test cases, prove that the benefits of the investment can outweigh other alternatives, and then deliver code. -- richard>> This IMHO includes primarily the block pointer tree >> and the DDT for those who risked using dedup. I am not sure how >> frequent is the required read access to other types of metadata >> (like dataset descriptors, etc.) that the occasional reading and >> caching won''t solve. >> >> Another idea was that L2ARC caching might not really cut it >> for metadata in comparison to a dedicated metadata storage, >> partly due to the L2ARC becoming empty upon every export/import >> (boot) and needing to get re-heated. >> >> So, here go the highlights of proposal (up for discussion). >> >> In short, the idea is to use today''s format of the blkptr_t >> which by default allows to store up to 3 DVA addresses of the >> block, and many types of metadata use only 2 copies (at least >> by default). This new feature adds a specially processed >> TLVDEV in the common DVA address space of the pool, and >> enforces storage of added third copies for certain types >> of metadata blocks on these devices. (Limited) Backwards >> compatibility is quite possible, on-disk format change may >> be not required. The proposal also addresses some questions >> that arose in previous discussions, especially about proposals >> where SSDs would be the only storage for pool''s metadata: >> * What if the dedicated metadata device overflows? >> * What if the dedicated metadata device breaks? >> = okay/expected by design, nothing dies. >> >> In more detail: >> 1) Add a special Top-Level VDEV (TLVDEV below) device type (like >> "cache" and "log" - say, "metaxel" for "metadata accelerator"?), >> and allow (even encourage) use of mirrored devices and allow >> expansion (raid0, raid10 and/or separate TLVDEVs) with added >> singlets/mirrors of such devices. >> Method of device type definition for the pool is discussable, >> I''d go with a special attribute (array) or nvlist in the pool >> descriptor, rather than some special type ID in the ZFS label >> (backwards compatibility, see point 4 for detailed rationale). >> >> Discussable: enable pool-wide or per-dataset (i.e. don''t >> waste accelerator space and lifetime for rarely-reused >> datasets like rolling backups)? Choose what to store on >> (particular) metaxels - DDT, BPTree, something else? >> Overall, this availability of choice is similar to choice >> of modes for ARC/L2ARC caching or enabling ZIL per-dataset... >> >> 2) These devices should be formally addressable as part of the >> pool in DVA terms (tlvdev:offset:size), but writes onto them >> are artificially limited by ZFS scheduler so as to only allow >> specific types of metadata blocks (blkptr_t''s, DDT entries), >> and also enforce writing of added third copies (for blocks >> of metadata with usual copies=2) onto these devices. >> >> 3) Absence or "FAULTEDness" of this device should not be fatal >> to the pool, but it may require manual intervention to force >> the import. Particularly, removal, replacement or resilvering >> onto different storage (i.e. migrating to larger SSDs) should >> be supported in the design. >> Beside experimentation and migration concerns, this approach >> should also ease replacement of SSDs used for metadata in case >> of their untimely fatal failures - and this may be a concern >> for many SSD deployments, increasingly susceptible to write >> wearing and ultimate death (at least in the cheaper bulkier >> range, which is a likely component in Home-NAS solutions). >> >> 4) For backwards compatibility, to older versions of ZFS this >> device should seem like a normal single-disk or mirror TLVDEV >> which contains blocks addressed within the common pool DVA >> address-space. This should have no effect for read-only >> imports. However, other ZFS releases likely won''t respect the >> filtering and alignment limitations enforced for the device >> normally in this design, and can "contaminate" the device >> with other types of blocks (and would refuse to import the >> pool if the device is missing/faulted). >> >> 5) The ZFS reads should be tweaked to first consult the copy >> of metadata blocks on the metadata accelerator device, and >> only use spinning rust (ordinary TLVDEVs) if there are some >> errors (checksum mismatches, lacking devices, etc.) or during >> scrubs and similar tasks which would require full reads of >> the pool''s addressed blocks. >> Prioritized reads from this metadata accelerator won''t need >> a special bit in the blkptr_t (like is done for deduped-bit) - >> the TLVDEV number in the DVA already points to the known >> identifier of the TLVDEV, which we know is a metaxel. >> >> 6) The ZFS writes onto this storage should take into account >> the increased blocksize (likely 4-8Kb for either current >> HDDs or for SSDs) and subsequent coalescing and pagination >> required to reduce SSD wear-out. This might be a tweakable >> component of the scheduler, which could be disabled if some >> different media is used and this scheduler is not needed >> (small-sectored HDDs, DDR, SSDs of the future), but the >> default writing mode today should expect SSDs. >> >> 7) A special tool like scrub should be added to walk the pool''s >> block tree and rewrite the existing block pointers (and I am >> not sure this is as problematic as the generic BPRewrite - >> if needed, the task can be done once offline, for example). >> >> By definition this is a restartable task (initiating a new >> tree walk), so it should be pausable or abortable as well. >> >> As the result, new copies of metadata blocks would be created >> on the accelerator device, and the two old copies remain in >> place. >> Just in case the metaxel TLVDEV is "contaminated" by other >> block types (by virtue of read-write import and usage by >> ZFS implementations unsupporting this feature), those should >> be relocated onto the main HDD pool. >> >> One thing to discuss is: what should be done for metadata >> with already existing three copies on HDDs? Should one copy >> be disposed of and recreated on the accelerator? >> >> 8) If the metaxel devices are filled up, the "overflowing" >> metadata blocks may just not get the third copies (only >> the standard HDD-based copies are written). If the metaxel >> device is freed up (by deletion of data and release of the >> block pointers) or expanded (or another one is added), then >> another run of the "scrub-like" procedure from point 7 can >> add the missing copies. >> >> 9) Also, the solution should allow to discard and recreate the >> copies of block pointers on the accelerator TLVDEV in case >> that it fails fatally or is replaced by new empty media. >> >> Unlike usual data, where loss of a TLVDEV is considered >> fatal to the pool, in this case we are known to have >> (redundant) copies of these blocks on other media. >> >> If the new TLVDEV is at least as big as the failed one, >> the pre-recorded accelerator tlvdevid:offsets in HDD-based >> copies of the block pointers can be used to re-instantiate >> the copies on metaxel, just like scrub or in-flight repairs >> happen on usual pools (rewriting corrupt blocks in-place >> and not having to change the BP tree at all). In this case >> the tlvdevid part of DVA can (should) remain unchanged. >> >> For new metaxels smaller than the replaced one new DVA >> allocations might be required. To enforce this and avoid >> mixups, the tlvdevid should change to some new unique >> number, and the BP tree gets rewritten as in point 7. >> >> 10) If these metaxel devices are used (and known to be SSDs?) >> then (SSD-based) L2ARC caches should not be used for the >> metadata blocks readily available from the metaxel. >> Guess: This might in fact reduce overheads from use of >> dedup, where pushing blocks into L2ARC only halves the >> needed RAM footprint. With an SSD metaxel we can just >> drop unneeded DDT entries from RAM ARC, and quickly get >> them from stable storage when needed. >> >> I hope I covered all or most of what I think on this matter, >> discussion (and ultimately open-sourced implementations) are >> most welcome ;) >> >> HTH, >> //Jim Klimov >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120824/975c6527/attachment-0001.html>
First of all, thanks for reading and discussing! :) 2012-08-24 17:50, Sa?o Kiselkov wrote:> This is something I''ve been looking into in the code and my take on your > proposed points this: > > 1) This requires many and deep changes across much of ZFS''s architecture > (especially the ability to sustain tlvdev failures).I''d trust the expert; on the outside it did not seem as a very deep change. At least, if for the first POC tests we leave out the rewriting of existing block pointers to store copies of existing metadata on an SSD, and the resilience to failures and absence of METAXELs. Basically, for a POC implementation we can just make a regular top-level VDEV forced as a single disk or mirror and add some hint to describe that it is a METAXEL component of the pool, so the ZFS kernel gets some restrictions on what gets written there (for new metadata writes) and to prioritize reads (fetch metadata from METAXELs, unless there is no copy on a known METAXEL or the copy is corrupted). The POC as outlined would be useful to estimate the benefits and impacts of the solution, and like "BP Rewrite", the more advanced features might be delayed by a few years - so even the POC would easily be the useful solution for many of us, especially if applied to new pools from TXG=0. "There is nothing as immortal as a temporary solution" ;)> > 2) Most of this can be achieved (except for cache persistency) by > implementing ARC space reservations for certain types of data. > > The latter has the added benefit of spreading load across all ARC and > L2ARC resources, so your METAXEL device never becomes the sole > bottleneck and it better embraces the ZFS design philosophy of pooled > storage.Well, we already have somewhat non-pooled ZILs and L2ARCs. Or, rather, they are in sub-pools of their own, reserved for specific tasks to optimize and speed up the ZFS storage subsystem in face of particular problems. My proposal does indeed add another sub-pool for another such task (and nominally METAXELs are parts of the common pool - more than cache and log devices are today), and explicitly describes adding several METAXELs or raid10''ing them (thus regarding the bottleneck question). On larger systems, this metadata storage might be available with a different SAS controller on a separate PCI bus, further boosting performance and reducing bottlenecks. Unlike L2ARC, METAXELs can be N-way mirrored and so instances are available in parallel from several controllers and lanes - further boosting IO and reliability of metadata operations. However, unlike L2ARC in general, here we know our "target audience" better, so we can do an optimization for a particular useful situation: gigabytes worth of data in small portions (sized from 512b to 8Kb, IIRC?), quite randomly stored and often read in comparison to amount of writes. Regarding size in particular: with blocks of 128K and BP entries of 512b, the minimum overhead for a single copy of BPtree metadata is 1/256 (without actually the tree, dataset labels, etc). So for each 1Tb of written ZFS pool userdata we get at least 4Gb metadata of just the block pointer tree (likely more in reality). For practical Home-NAS pools of about 10Tb this warrants about 60Gb (give or take an order of magnitude) on SSD dedicated to casual metadata without even a DDT, be it generic L2ARC or an optimized METAXEL. The tradeoffs for dedicating a storage device (or several) to this one task are, hopefully: no need for heating up the cache every time with gigabytes that are known to be needed again and again, even if only to boost weekly scrubs, some RAM ARC savings and release of L2ARC to tasks it is more efficient at (generic larger blocks). Eliminating many small random IOs to spinning rust, we''re winning in HDD performance and arguably power consumption and vitality (less mechanical overheads and delays per overall amount of transferred gigabytes). There is also relatively large RAM pointer overhead for storing small pieces of data (such as metadata blocks sized 1 or few sectors) in L2ARC, which I expect to be eliminated by storing and using these blocks directly from the pool (on SSD METAXELs), having both SSD-fast access to the blocks and no expiration into L2ARC and back with inefficiently-sized ARC pointers to remember. I guess METAXEL might indeed be cheaper and faster than L2ARC, for this particular use-case (metadata). Also, this way the true L2ARC device would be more available to "real" userdata which is likely to use larger blocks - improving benefits from your L2ARC compression features as well as reducing the overhead percentage for ARC pointers; and being a random selection of the pool''s blocks, the userdata is unpredictable for good acceleration by other means (short of a full-SSD pool). Also, having this bulky amount of bytes (BPTree, DDT) is essentially required for fast operation of the overall pool, and it is not some unpredictable random set of blocks as is expected for usual cacheable data - so why keep reheating it into the cache upon every restart (and the greener home-NAS users might power down their boxes when not in use, to save on power bills, so reheating L2ARC is frequent), needlessly wearing it out with writes and anyway chopping this amount of bytes from usual L2ARC data and RAM ARC as well. The DDT also hopefully won''t have the drastic impacts we see today on budgeting and/or performance of smaller machines (like your HP Microserver with 8Gb RAM tops) with enabled dedup, because DDT entries can be quickly re-seeked from SSD and not consume RAM while they are expired from ARC as today - thus freeing it for more efficient caching of real data or for pointers to bulkier userdata in L2ARC. Finally, even scrubs should be faster - beside checking integrity of on-HDD copies of metadata blocks, the system won''t need to read them with slower access times in order to find addresses and checksums of the bulk of userdata. This scrubs are likely to become more sequential and fast with little to no special coding to do this boost. What do you think? //Jim Klimov
2012-08-24 17:39, Jim Klimov wrote:> Hello all, > > The idea of dedicated metadata devices (likely SSDs) for ZFS > has been generically discussed a number of times on this list, > but I don''t think I''ve seen a final proposal that someone would > take up for implementation (as a public source code, at least).Hmmm... now that I think of it, this solution might also eat some of the cake for dedicated ZIL devices: having an SSD with in-pool metadata, we can send sync writes (for metadata blocks) straight to the METAXEL SSD, while the TXG sync would flush out their HDD- based counterparts and other (async) data. In case of pool import after breakage we just need to repair the last uncommitted TXG''s worth of recorded metadata on METAXEL... Also, I now think that directories, ACLs and such (POSIX fs layer metadata) should have a copy on the METAXEL SSDs too. This certainly does not replace the ZIL in general, but unlike the rolling "write a lot, read never" ZIL approach, this would actually write the needed data onto the pool with low latency, won''t abuse the flash cells needlessly (hopefully). Expert opinion and/or tests could confirm my guess that this could provide "good enough" boost to some modes of sync writes (NFS maybe?) that are heavier on metadata updates than on userdata, and these tight-on-budget deployments which think about costly dedicated ZIL devices might no longer require one, or not need it so heavily. Is there any sanity to this? ;) Thanks, //Jim Klimov
Oh man, that''s a million-billion points you made. I''ll try to run through each quickly. On 08/24/2012 05:43 PM, Jim Klimov wrote:> First of all, thanks for reading and discussing! :)No problem at all ;)> 2012-08-24 17:50, Sa?o Kiselkov wrote: >> This is something I''ve been looking into in the code and my take on your >> proposed points this: >> >> 1) This requires many and deep changes across much of ZFS''s architecture >> (especially the ability to sustain tlvdev failures). > > I''d trust the expert; on the outside it did not seem as a very > deep change. At least, if for the first POC tests we leave out the > rewriting of existing block pointers to store copies of existing > metadata on an SSD, and the resilience to failures and absence > of METAXELs.The initial set of change areas I can identify, even for the stripped down version of your proposal is: *) implement a new vdev type (mirrored or straight metaxel) *) integrate all format changes to labels to describe these *) alter the block allocator strategy so that if there are metaxels present, we utilize those *) alter the metadata fetch points (of which there are many) to preferably fetch from metaxels when possible, or fall back to main-pool copies *) make sure that the previous two points play nicely with copies=X The other points you mentioned, i.e. fault-resiliency, block-pointer rewrite and other stuff is another mountain of work with an even higher mountain of testing to be done on all possible combinations.> Basically, for a POC implementation we can just make a regular > top-level VDEV forced as a single disk or mirror and add some > hint to describe that it is a METAXEL component of the pool, > so the ZFS kernel gets some restrictions on what gets written > there (for new metadata writes) and to prioritize reads (fetch > metadata from METAXELs, unless there is no copy on a known > METAXEL or the copy is corrupted).As noted before, you''ll have to go through the code to look for paths which fetch metadata (mostly the object layer) and replace those with metaxel-aware calls. That''s a lot of work for a POC.> The POC as outlined would be useful to estimate the benefits and > impacts of the solution, and like "BP Rewrite", the more advanced > features might be delayed by a few years - so even the POC would > easily be the useful solution for many of us, especially if applied > to new pools from TXG=0.I wish I had all the time to implement it, but alas, I''m just a zfs n00b and am not doing this for a living :-)>> 2) Most of this can be achieved (except for cache persistency) by >> implementing ARC space reservations for certain types of data. >> >> The latter has the added benefit of spreading load across all ARC and >> L2ARC resources, so your METAXEL device never becomes the sole >> bottleneck and it better embraces the ZFS design philosophy of pooled >> storage. > > Well, we already have somewhat non-pooled ZILs and L2ARCs.Yes, that''s because these have vastly different performance properties from main-pool storage. However, metaxels and cache devices are essentially the same (many small random reads, infrequent large async writes).> Or, rather, they are in sub-pools of their own, reserved > for specific tasks to optimize and speed up the ZFS storage > subsystem in face of particular problems.Exactly. The difference between metaxel and cache, however, is cosmetic.> My proposal does indeed add another sub-pool for another such > task (and nominally METAXELs are parts of the common pool - > more than cache and log devices are today), and explicitly > describes adding several METAXELs or raid10''ing them (thus > regarding the bottleneck question).The problem regarding bottlenecking is that you''re creating a new separate island of resources which has very little difference in performance requirements to cache devices, yet by separating them out artificially, you''re creating a potential scalability barrier.> On larger systems, this > metadata storage might be available with a different SAS > controller on a separate PCI bus, further boosting performance > and reducing bottlenecks. Unlike L2ARC, METAXELs can be N-way > mirrored and so instances are available in parallel from > several controllers and lanes - further boosting IO and > reliability of metadata operations.How often do you expect cache devices to fail? I mean we''re talking about a one-off occasional event that doesn''t even present data loss (only a little bit of performance loss, especially if you use multiple cache devices). And since you''re proposing mirroring metaxels, you are essentially going to be continuously doing twice the write work for a 50% reduction in read performance from the vdev in case of a device failure. If you just used both devices as cache, you''ll get 100% speedup in read AND write performance (in case you lose one cache device, you''ve still got 50% of your cache data available). So to sum up, you''re applying raid to something that doesn''t need it.> However, unlike L2ARC in general, here we know our "target > audience" better, so we can do an optimization for a particular > useful situation: gigabytes worth of data in small portions > (sized from 512b to 8Kb, IIRC?), quite randomly stored and > often read in comparison to amount of writes.L2ARC also knows its target audience well, and it''s nearly identical with what you''ve described. L2ARC doesn''t cache prefetched buffers or streaming workloads (unless you instruct it to). It''s there merely to serve as a low-latency random-read accelerator.> Regarding size in particular: with blocks of 128K and BP entries > of 512b, the minimum overhead for a single copy of BPtree metadata > is 1/256 (without actually the tree, dataset labels, etc).Block pointers are actually much smaller, ZFS groups them into gang blocks if it needs to store multiple of them and is below the SPA_MINBLOCKSIZE (hope I remember that macro''s name right).> So for each 1Tb of written ZFS pool userdata we get at least 4Gb > metadata of just the block pointer tree (likely more in reality). > For practical Home-NAS pools of about 10Tb this warrants about > 60Gb (give or take an order of magnitude) on SSD dedicated to > casual metadata without even a DDT, be it generic L2ARC or an > optimized METAXEL.And how is that different to having a cache-sizing policy which selects how much each data type get allocated from a single common cache?> The tradeoffs for dedicating a storage device (or several) to > this one task are, hopefully: no need for heating up the cache > every time with gigabytes that are known to be needed again > and again,Agree, the persistency would be nice to have, and in fact it might be a lot easier to implement (I''ve already thought about how to do this, but that''s a topic for another day).> even if only to boost weekly scrubs, some RAM ARC > savings and release of L2ARC to tasks it is more efficient at > (generic larger blocks).Scrubs will populate your cache anyways, so only the first one will be slow, the next one will be much faster. Also, you''re wrong if you think the clientele of l2arc and metaxel would be different - it most likely wouldn''t.> Eliminating many small random IOs to > spinning rust, we''re winning in HDD performance and arguably > power consumption and vitality (less mechanical overheads and > delays per overall amount of transferred gigabytes).No disagreement there.> There is also relatively large RAM pointer overhead for storing > small pieces of data (such as metadata blocks sized 1 or few > sectors) in L2ARC, which I expect to be eliminated by storing > and using these blocks directly from the pool (on SSD METAXELs), > having both SSD-fast access to the blocks and no expiration into > L2ARC and back with inefficiently-sized ARC pointers to remember.You''d still need to reference metaxel data from ARC, so your savings would be very small. ZFS already is pretty efficient there.> I guess METAXEL might indeed be cheaper and faster than L2ARC, > for this particular use-case (metadata). Also, this way the > true L2ARC device would be more available to "real" userdata > which is likely to use larger blocks - improving benefits > from your L2ARC compression features as well as reducing > the overhead percentage for ARC pointers; and being a random > selection of the pool''s blocks, the userdata is unpredictable > for good acceleration by other means (short of a full-SSD pool).While compression indeed works much better on larger blocks, I hardly think the proportion to regular data is somehow significant in any way to warrant taking it out of the compression datastream. At worst it''s a few percent of compression overhead - in fact, my current implementation of l2arc compression already does a check for block size and refuses to compress blocks smaller than ~2048 bytes.> Also, having this bulky amount of bytes (BPTree, DDT) is > essentially required for fast operation of the overall pool, > and it is not some unpredictable random set of blocks as is > expected for usual cacheable data - so why keep reheating it > into the cache upon every restart (and the greener home-NAS > users might power down their boxes when not in use, to save > on power bills, so reheating L2ARC is frequent), needlessly > wearing it out with writes and anyway chopping this amount > of bytes from usual L2ARC data and RAM ARC as well. > The DDT also hopefully won''t have the drastic impacts we see > today on budgeting and/or performance of smaller machines > (like your HP Microserver with 8Gb RAM tops) with enabled > dedup, because DDT entries can be quickly re-seeked from > SSD and not consume RAM while they are expired from ARC as > today - thus freeing it for more efficient caching of real > data or for pointers to bulkier userdata in L2ARC.So essentially that''s an argument for l2arc persistency. As I said, it can be done (and more easily than using metaxels).> Finally, even scrubs should be faster - beside checking > integrity of on-HDD copies of metadata blocks, the system > won''t need to read them with slower access times in order > to find addresses and checksums of the bulk of userdata. > This scrubs are likely to become more sequential and fast > with little to no special coding to do this boost.See above. All of this can be solved by cache sizing policies and l2arc persistency. Cheers, -- Saso
2012-08-25 0:42, Sa?o Kiselkov wrote:> Oh man, that''s a million-billion points you made. I''ll try to run > through each quickly.Thanks... I still do not have the feeling that you''ve fully got my idea, or, alternately, that I correctly understand ARC :)>> There is also relatively large RAM pointer overhead for storing >> small pieces of data (such as metadata blocks sized 1 or few >> sectors) in L2ARC, which I expect to be eliminated by storing >> and using these blocks directly from the pool (on SSD METAXELs), >> having both SSD-fast access to the blocks and no expiration into >> L2ARC and back with inefficiently-sized ARC pointers to remember....And these counter-arguments probably are THE point of deviation:> However, metaxels and cache devices are essentially the same> (many small random reads, infrequent large async writes). > The difference between metaxel and cache, however, is cosmetic.> You''d still need to reference metaxel data from ARC, so your savings > would be very small. ZFS already is pretty efficient there.No, you don''t! "Republic credits WON''T do fine!" ;) The way I understood ARC (without/before L2ARC), it either caches pool blocks or it doesn''t. More correctly, there is also a cache of ghosts without bulk block data, so we can account for misses of recently expired blocks of one of the two categories, and so adjust the cache subdivision towards MRU or MFU. Ultimately, those ghosts which were not requested, also expire away from the cache, and no reference to a recently-cached block remains. With L2ARC on the other hand, there is some list of pointers in the ARC so it knows which blocks were cached on the SSD - and lack of this list upon pool import is in effect the perceived emptiness of the L2ARC device. L2ARC''s pointers are of comparable size to the small metadata blocks, and *this* consideration IMHO makes it much more efficient to use L2ARC with larger cached blocks, especially on systems with limited RAM (which effectively limits addressable L2ARC size as accounted in amount of blocks), with the added benefit that you can compress larger blocks in L2ARC. This way, the *difference* between L2ARC and a METAXEL is that the latter is an ordinary pool tlvdev with a specially biased read priority and write filter. If a metadata block is read, it goes into the ARC. If it expires - then there''s a ghost for a while and soon there is no memory that this block was cached - unlike L2ARC''s list of pointers which are just a couple of times smaller than the cached block of this type. But re-fetching metadata from SSD METAXEL is faster, when it is needed again. > Also, you''re wrong if you think the clientele of l2arc and > metaxel would be different - it most likely wouldn''t. This only stresses the problem with L2ARC''s shortcomings for metadata, the way I see them (if they do indeed exist), and in particular chews your RAM a lot more than it could or should, being a mechanism to increase caching efficiency. If their clientele is indeed similar, and if metaxels would be more efficient for metadata storage, then you might not need L2ARC with its overheads, or not as much of it, and get a clear win in system resource consumption ;) > How often do you expect cache devices to fail? From what I hear, life expectancy for today''s consumer-scale devices is small (1-3 years) for heavy writes - at which the L2ARC would likely exceed METAXEL''s write rates, due to the need to write the same metadata into L2ARC time and again, if it were not for the special throttling to limit L2ARC write bandwidth. > So to sum up, you''re applying raid to something that doesn''t > need it. Well, metadata is kinda important - though here we do add a third copy where we previously sufficed to have two. And you''re not "required" to mirror it. Also, on the other hand, if a METAXEL is a top-level vdev without special resilience to its failure/absence as described in my first post, then its failure would formally be considered a fatal situation and bring down the whole pool - unlike problems with L2ARC or ZIL devices, which can be ignored at admin''s discretion. > And how is that different to having a cache-sizing policy > which selects how much each data type get allocated from > a single common cache? ... > All of this can be solved by cache sizing policies and > l2arc persistency. Ultimately, I don''t disagree with this point :) But I do think that this might not be the optimal solution in terms of RAM requirements and coding complexity, etc. If you want to store some data long-term, such as is my desire to store the metadata - ZFS has mechanisms for that in ways of normal VDEVs (or subclassing that into metaxels) ;)> *) implement a new vdev type (mirrored or straight metaxel) > *) integrate all format changes to labels to describe theseAs one idea in the proposal - though I don''t require sticking to it - is that the metaxel''s job is described in the pool metadata (i.e. a readonly attribute which can be set during tlvdev device creation/addition - metaxels:list-of-guids). Until the pool is imported, a metaxel seems like a normal singledisk/mirrored tlvdev in a normal pool. This approach can limit importability of a pool with failed metaxels, unless we expect that and try to make sense of other pool devices - essentially until we can decipher the nvlist and see that the absent device is a metaxel, so the error is deemed not fatal. However, this also requires no label changes or other incompatible on-disk format changes, the way I see it. As long as the metaxel is not faulted, any other ZFS implementation (like grub or an older livecd) can import this pool and read 1/3 of metadata faster, on average ;)> As noted before, you''ll have to go through the code to look for paths > which fetch metadata (mostly the object layer) and replace those with > metaxel-aware calls. That''s a lot of work for a POC.Alas, for some years now I''m a lot less of a programmer and a lot more of a brainstormer ;) Still, judging from whatever experience I have, a working POC with some corners cut might be a matter of a week or two of coding... Just to see if the expected benefits in comparison to L2ARC do exist. The full-scale thing, yes, might take months or years from even a team of programmers ;) Thanks, //Jim
On 08/25/2012 12:22 AM, Jim Klimov wrote:> 2012-08-25 0:42, Sa?o Kiselkov wrote: >> Oh man, that''s a million-billion points you made. I''ll try to run >> through each quickly. > > Thanks... > I still do not have the feeling that you''ve fully got my > idea, or, alternately, that I correctly understand ARC :)Could be I misunderstood you, it''s past midnight here...>>> There is also relatively large RAM pointer overhead for storing >>> small pieces of data (such as metadata blocks sized 1 or few >>> sectors) in L2ARC, which I expect to be eliminated by storing >>> and using these blocks directly from the pool (on SSD METAXELs), >>> having both SSD-fast access to the blocks and no expiration into >>> L2ARC and back with inefficiently-sized ARC pointers to remember. > > ...And these counter-arguments probably are THE point of deviation: > >> However, metaxels and cache devices are essentially the same >> (many small random reads, infrequent large async writes). >> The difference between metaxel and cache, however, is cosmetic. > >> You''d still need to reference metaxel data from ARC, so your savings >> would be very small. ZFS already is pretty efficient there. > > No, you don''t! "Republic credits WON''T do fine!" ;) > > The way I understood ARC (without/before L2ARC), it either caches > pool blocks or it doesn''t. More correctly, there is also a cache > of ghosts without bulk block data, so we can account for misses > of recently expired blocks of one of the two categories, and so > adjust the cache subdivision towards MRU or MFU. Ultimately, those > ghosts which were not requested, also expire away from the cache, > and no reference to a recently-cached block remains.Correct so far.> With L2ARC on the other hand, there is some list of pointers in > the ARC so it knows which blocks were cached on the SSD - and > lack of this list upon pool import is in effect the perceived > emptiness of the L2ARC device. L2ARC''s pointers are of comparable > size to the small metadata blocks,No they''re not, here''s l2arc_buf_hdr_t a per-buffer structure held for buffers which were moved to l2arc: typedef struct l2arc_buf_hdr { l2arc_dev_t *b_dev; uint64_t b_daddr; } l2arc_buf_hdr_t; That''s about 16-bytes overhead per block, or 3.125% if the block''s data is 512 bytes long.> and *this* consideration IMHO > makes it much more efficient to use L2ARC with larger cached blocks, > especially on systems with limited RAM (which effectively limits > addressable L2ARC size as accounted in amount of blocks), with > the added benefit that you can compress larger blocks in L2ARC.The main overhead comes from an arc_buf_hdr_t, which is pretty fat, around 180 bytes by a first degree approximation, so in all around 200 bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully inefficient (around 39% overhead), however, at 4k average block size, this drops to ~5% and at 64k average block size (which is entirely possible on average untuned storage pools) this drops down to ~0.3% overhead.> This way, the *difference* between L2ARC and a METAXEL is that > the latter is an ordinary pool tlvdev with a specially biased > read priority and write filter. If a metadata block is read, > it goes into the ARC. If it expires - then there''s a ghost > for a while and soon there is no memory that this block was > cached - unlike L2ARC''s list of pointers which are just a > couple of times smaller than the cached block of this type. > But re-fetching metadata from SSD METAXEL is faster, when > it is needed again.As explained above, the difference would be about 9% at best: sizeof(l2arc_buf_hdr_t) / sizeof(arc_buf_hdr_t) = 0.0888...>> Also, you''re wrong if you think the clientele of l2arc and >> metaxel would be different - it most likely wouldn''t. > > This only stresses the problem with L2ARC''s shortcomings for > metadata, the way I see them (if they do indeed exist), and > in particular chews your RAM a lot more than it could or > should, being a mechanism to increase caching efficiency.And as I demonstrated above, the savings would be negligible.> If their clientele is indeed similar, and if metaxels would > be more efficient for metadata storage, then you might not > need L2ARC with its overheads, or not as much of it, and > get a clear win in system resource consumption ;)Would it be a win? Probably. But the cost-benefit analysis suggests to me that it would probably simply not be worth the added hassle.>> How often do you expect cache devices to fail? > > From what I hear, life expectancy for today''s consumer-scale > devices is small (1-3 years) for heavy writes - at which the > L2ARC would likely exceed METAXEL''s write rates, due to the > need to write the same metadata into L2ARC time and again, > if it were not for the special throttling to limit L2ARC > write bandwidth.Depending on your workload, l2arc write throughput tends to get pretty low once you''ve cached in your working dataset. Remember, l2arc only caches random reads, so think stuff like databases, not linear copy operations. Once it''s warmed up, it''s pretty much read-only (assuming most of your working dataset fits in there).>> So to sum up, you''re applying raid to something that doesn''t >> need it. > > Well, metadata is kinda important - though here we do add > a third copy where we previously sufficed to have two. And > you''re not "required" to mirror it. Also, on the other hand, > if a METAXEL is a top-level vdev without special resilience > to its failure/absence as described in my first post, then > its failure would formally be considered a fatal situation > and bring down the whole pool - unlike problems with L2ARC > or ZIL devices, which can be ignored at admin''s discretion.Is doubly-redundant metadata not enough? Remember, if you''ve lost even a single vdev, your data is essentially toast, and doubly-redundant metadata is there essentially to try and save your ass by letting you copy off what remains readable (by making sure you have another metadata copy available somewhere else). If you''ve got double-vdev failure, then that''s essentially considered a catastrophic pool failure.>> And how is that different to having a cache-sizing policy >> which selects how much each data type get allocated from >> a single common cache? > ... >> All of this can be solved by cache sizing policies and >> l2arc persistency. > > Ultimately, I don''t disagree with this point :) > But I do think that this might not be the optimal solution > in terms of RAM requirements and coding complexity, etc. > If you want to store some data long-term, such as is my > desire to store the metadata - ZFS has mechanisms for that > in ways of normal VDEVs (or subclassing that into metaxels) ;)How about we instead implemented l2arc persistency? That''s a lot easier to do and it would allow us to make all/most caches persistent, not just the metadata cache.>> *) implement a new vdev type (mirrored or straight metaxel) >> *) integrate all format changes to labels to describe these > > As one idea in the proposal - though I don''t require sticking > to it - is that the metaxel''s job is described in the pool > metadata (i.e. a readonly attribute which can be set during > tlvdev device creation/addition - metaxels:list-of-guids). > Until the pool is imported, a metaxel seems like a normal > singledisk/mirrored tlvdev in a normal pool.Yeah, that would be workable, but the trouble is that when somebody mounts the pool on an older version, they might allocate non-metadata blocks there, resulting an inconsistent metaxel state. That would make implementing metaxel-failure resilience a lot harder. Plus, you''ll need to propagate information on data type (metadata/normal data) to the spa layer - might not be that hard, I haven''t looked in that code yet.> This approach can limit importability of a pool with failed > metaxels, unless we expect that and try to make sense of > other pool devices - essentially until we can decipher the > nvlist and see that the absent device is a metaxel, so the > error is deemed not fatal. However, this also requires no > label changes or other incompatible on-disk format changes, > the way I see it. As long as the metaxel is not faulted, > any other ZFS implementation (like grub or an older livecd) > can import this pool and read 1/3 of metadata faster, on > average ;)Which is why I would propose to use cache sizing policies and possibly persistent l2arc contents. A persistency-unaware host would simply use the l2arc device as normal (so backwards-compatibility wouldn''t be an issue), while newer hosts could happily coexist there.>> As noted before, you''ll have to go through the code to look for paths >> which fetch metadata (mostly the object layer) and replace those with >> metaxel-aware calls. That''s a lot of work for a POC. > > Alas, for some years now I''m a lot less of a programmer and > a lot more of a brainstormer ;) Still, judging from whatever > experience I have, a working POC with some corners cut might > be a matter of a week or two of coding... Just to see if the > expected benefits in comparison to L2ARC do exist. > The full-scale thing, yes, might take months or years from > even a team of programmers ;)Code talks ;-) Cheers, -- Saso
> No they''re not, here''s l2arc_buf_hdr_t a per-buffer structure > held for > buffers which were moved to l2arc: > > typedef struct l2arc_buf_hdr { > l2arc_dev_t *b_dev; > uint64_t b_daddr; > } l2arc_buf_hdr_t; > > That''s about 16-bytes overhead per block, or 3.125% if the > block''s data is 512 bytes long. > > The main overhead comes from an arc_buf_hdr_t, which is pretty fat, > around 180 bytes by a first degree approximation, so in all > around 200 > bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully > inefficient (around 39% overhead), however, at 4k average block size, > this drops to ~5% and at 64k average block size (which is entirely > possible on average untuned storage pools) this drops down to ~0.3% > overhead.So... unless I miscalculated before drinking a morning coffee, for a 512b block quickly fetchable from SSD in both L2ARC and METAXEL cases, we have roughly these numbers?: 1) When it is in RAM, we consume 512+180 bytes (though some ZFS slides said that for 1 byte stored we spend 1 byte - i thought this meant zero overhead, though I couldn''t imagine how... or 100% overhead, also quite unimaginable =) ) 2L) When the block is on L2ARC SSD, we spend 180+16 bytes (though discussions about DDT on L2ARC at least, settled on 176 bytes of cache metainformation per entry moved off to L2ARC, with the DDT entry''s size being around 350 bytes, IIRC). 2M) When the block is expired from ARC and is only stored on the pool, including the SSD-based copy on a METAXEL, we spend zero RAM to reference this block from ARC - because we don''t remember it anymore. And when needed, we can access it just as fast (right?) as from L2ARC on the same media type. Where am I wrong, because we seem to dispute over THIS point over several emails, and I''m ready to accept that you''ve seen the code and I''m the clueless one. So I want to learn, then ;) //Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120825/57abd3ea/attachment.html>
On 08/25/2012 11:53 AM, Jim Klimov wrote:>> No they''re not, here''s l2arc_buf_hdr_t a per-buffer structure >> held for >> buffers which were moved to l2arc: >> >> typedef struct l2arc_buf_hdr { >> l2arc_dev_t *b_dev; >> uint64_t b_daddr; >> } l2arc_buf_hdr_t; >> >> That''s about 16-bytes overhead per block, or 3.125% if the >> block''s data is 512 bytes long. >> >> The main overhead comes from an arc_buf_hdr_t, which is pretty fat, >> around 180 bytes by a first degree approximation, so in all >> around 200 >> bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully >> inefficient (around 39% overhead), however, at 4k average block size, >> this drops to ~5% and at 64k average block size (which is entirely >> possible on average untuned storage pools) this drops down to ~0.3% >> overhead. > > So... unless I miscalculated before drinking a morning coffee, for a 512b block > quickly fetchable from SSD in both L2ARC and METAXEL cases, we have > roughly these numbers?: > 1) When it is in RAM, we consume 512+180 bytes (though some ZFS > slides said that for 1 byte stored we spend 1 byte - i thought this meant zero > overhead, though I couldn''t imagine how... or 100% overhead, also quite > unimaginable =) ) > > 2L) When the block is on L2ARC SSD, we spend 180+16 bytes (though > discussions about DDT on L2ARC at least, settled on 176 bytes of cache > metainformation per entry moved off to L2ARC, with the DDT entry''s size > being around 350 bytes, IIRC). > > 2M) When the block is expired from ARC and is only stored on the pool, > including the SSD-based copy on a METAXEL, we spend zero RAM to > reference this block from ARC - because we don''t remember it anymore. > And when needed, we can access it just as fast (right?) as from L2ARC > on the same media type. > > Where am I wrong, because we seem to dispute over THIS point over > several emails, and I''m ready to accept that you''ve seen the code and > I''m the clueless one. So I want to learn, then ;)The difference is that when you want to go fetch a block from a metaxel, you still need some way to reference it. Either you use direct references (i.e. ARC entries as above), or you use an indirect mechanism, which means that for each read you will need to walk the metaxel device, which is slow. Cheers, -- Saso
2012-08-25 15:46, Sa?o Kiselkov wrote:> The difference is that when you want to go fetch a block from a metaxel, > you still need some way to reference it. Either you use direct > references (i.e. ARC entries as above), or you use an indirect > mechanism, which means that for each read you will need to walk the > metaxel device, which is slow.Um... how does your application (or the filesystem during its metadata traversal) know that it wants to read a certain block? IMHO, it has the block''s address for that (likely known from a higher-level "parent" block of metadata), and it requests - "give me L bytes from offset O on tlvdev T", which is the layman interpretation of DVA. From what I understand, with ARC-cached blocks, we traverse the RAM-based cache and find one with the requested DVA; then we have its data already in RAM and return it to the caller. If the block is not in ARC (and there''s no L2ARC), we can fetch it from the media using the DVA address(es?) we already know from the request. In case of L2ARC there is probably a non-null pointer to the l2arc_buf_hdr_t, so we can request the block from the L2ARC. If true, this is not faster than fetching the block from the same SSD used as a metadata accelerator instead of being an L2ARC device with a policy (or even without one, as today), and in comparison only wasted RAM for the ARC entries. BTW, as I see in "struct arc_buf_hdr", it only stores one DVA - so I guess for blocks with multiple on-disk copies it is possible to have them cached twice, or does ZFS always enforce storing and seeking the ARC by a particular DVA of the block (likely DVA[0])? http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#433 If we do go with METAXELs as I described/proposed, and prefer fetching metadata from SSD unless there are errors, then some care should be taken to use this instance of DVA to reference cached metadata blocks in the ARC. //Jim Klimov
On 2012-08-24 14:39, Jim Klimov wrote:> Hello all, > > The idea of dedicated metadata devices (likely SSDs) for ZFS > has been generically discussed a number of times on this list, > but I don''t think I''ve seen a final proposal that someone would > take up for implementation (as a public source code, at least). >Hi OK, I am not a ZFS dev and have barely even looked at the code, but it seems to me that this could be dealt with in an easier and more efficient manner by modifying current L2ARC code to make a persistent cache device, and adding the preference mechanism somebody has already suggested (e.g. prefer metadata, or prefer specific typed of metadata) My reasoning is as follows: 1) As metadata is already available on the main pool devices, there is no need to make this data redundant. It is there for acceleration. In the event of a failure, it can just be read directly from the pool, and there is no need to write the data twice (as would be in a mirrored ''metaxel'') or waste the space. This is only my oppinion, but it makes sense to me. The other option, for me, would be to make it the main storage area for metadata, with no requirement to store it on the main pool devices beyond needing enough copies. i.e. if you need 2 metadata copies but have only one metaxel, store on on there and one in the pool. If you need 2 copies and there are 2 metaxels, store them on the metaxels, no pool storage needed. 2) Persistent cache devices and cache policies would bring more benefits to the system overall than adding this metaxel: No warming of the cache (besides reading in what is stored there on import/boot, so lets say accelerated warming) & finer control over what to store in the cache. The cache devices could then be tuned on a per dataset (and possibly per cache dev, so certain data types prefer the cache dev with the best performance profile for it) basis to provide the best for your own unique situation. Possibly even a "keep this dataset in cache at all times" would be usefull for less frequently accessed but time-critical data (so no more loops cat''ing to /dev/null to keep data in cache). 3) This would provide, IMHO, the building blocks for a combined cache/log device. This would basically go as follows: You set up, say, a pair of persistent cache devices. You then tell ZFS that these can be used for ZIL blocks, with something like the copies attribute to tell it to ensure redundancy. So it basically builds a ZIL device from blocks within the cache as it needs it. It would not be as fast as a dedicated log device, but would allow greater efficiency. Point 3 would be for future development, but I believe the benefits of cache persistence and policies are enough to make them a priority. I believe it would cover what the metaxel is trying to do and more. The other, simpler, option I could see is a flag which tells ZFS "Keep metadata in the cache", which ensures all metadata (where possible) is stored in ARC/L2ARC at all times, and possibly forces it to be read in on import/boot.