thr3ads.net - zfs discuss - [zfs-discuss] Dedicated metadata devices [Aug 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-Aug-24 13:39 UTC

[zfs-discuss] Dedicated metadata devices

Hello all,

   The idea of dedicated metadata devices (likely SSDs) for ZFS
has been generically discussed a number of times on this list,
but I don''t think I''ve seen a final proposal that someone
would
take up for implementation (as a public source code, at least).

   I''d like to take a liberty of summarizing the ideas I''ve
either
seen in discussions or proposed myself on this matter, to see if
the overall idea would make sense to gurus of ZFS architecture.

   So, the assumption was that the performance killer in ZFS at
least on smallish deployments (few HDDs and an SSD accelerator),
like those in Home-NAS types of boxes, was random IO to lots of
metadata. This IMHO includes primarily the block pointer tree
and the DDT for those who risked using dedup. I am not sure how
frequent is the required read access to other types of metadata
(like dataset descriptors, etc.) that the occasional reading and
caching won''t solve.

   Another idea was that L2ARC caching might not really cut it
for metadata in comparison to a dedicated metadata storage,
partly due to the L2ARC becoming empty upon every export/import
(boot) and needing to get re-heated.

   So, here go the highlights of proposal (up for discussion).

In short, the idea is to use today''s format of the blkptr_t
which by default allows to store up to 3 DVA addresses of the
block, and many types of metadata use only 2 copies (at least
by default). This new feature adds a specially processed
TLVDEV in the common DVA address space of the pool, and
enforces storage of added third copies for certain types
of metadata blocks on these devices. (Limited) Backwards
compatibility is quite possible, on-disk format change may
be not required. The proposal also addresses some questions
that arose in previous discussions, especially about proposals
where SSDs would be the only storage for pool''s metadata:
* What if the dedicated metadata device overflows?
* What if the dedicated metadata device breaks?
= okay/expected by design, nothing dies.

   In more detail:
1) Add a special Top-Level VDEV (TLVDEV below) device type (like
    "cache" and "log" - say, "metaxel" for
"metadata accelerator"?),
    and allow (even encourage) use of mirrored devices and allow
    expansion (raid0, raid10 and/or separate TLVDEVs) with added
    singlets/mirrors of such devices.
    Method of device type definition for the pool is discussable,
    I''d go with a special attribute (array) or nvlist in the pool
    descriptor, rather than some special type ID in the ZFS label
    (backwards compatibility, see point 4 for detailed rationale).

    Discussable: enable pool-wide or per-dataset (i.e. don''t
    waste accelerator space and lifetime for rarely-reused
    datasets like rolling backups)? Choose what to store on
    (particular) metaxels - DDT, BPTree, something else?
    Overall, this availability of choice is similar to choice
    of modes for ARC/L2ARC caching or enabling ZIL per-dataset...

2) These devices should be formally addressable as part of the
    pool in DVA terms (tlvdev:offset:size), but writes onto them
    are artificially limited by ZFS scheduler so as to only allow
    specific types of metadata blocks (blkptr_t''s, DDT entries),
    and also enforce writing of added third copies (for blocks
    of metadata with usual copies=2) onto these devices.

3) Absence or "FAULTEDness" of this device should not be fatal
    to the pool, but it may require manual intervention to force
    the import. Particularly, removal, replacement or resilvering
    onto different storage (i.e. migrating to larger SSDs) should
    be supported in the design.
    Beside experimentation and migration concerns, this approach
    should also ease replacement of SSDs used for metadata in case
    of their untimely fatal failures - and this may be a concern
    for many SSD deployments, increasingly susceptible to write
    wearing and ultimate death (at least in the cheaper bulkier
    range, which is a likely component in Home-NAS solutions).

4) For backwards compatibility, to older versions of ZFS this
    device should seem like a normal single-disk or mirror TLVDEV
    which contains blocks addressed within the common pool DVA
    address-space. This should have no effect for read-only
    imports. However, other ZFS releases likely won''t respect the
    filtering and alignment limitations enforced for the device
    normally in this design, and can "contaminate" the device
    with other types of blocks (and would refuse to import the
    pool if the device is missing/faulted).

5) The ZFS reads should be tweaked to first consult the copy
    of metadata blocks on the metadata accelerator device, and
    only use spinning rust (ordinary TLVDEVs) if there are some
    errors (checksum mismatches, lacking devices, etc.) or during
    scrubs and similar tasks which would require full reads of
    the pool''s addressed blocks.
    Prioritized reads from this metadata accelerator won''t need
    a special bit in the blkptr_t (like is done for deduped-bit) -
    the TLVDEV number in the DVA already points to the known
    identifier of the TLVDEV, which we know is a metaxel.

6) The ZFS writes onto this storage should take into account
    the increased blocksize (likely 4-8Kb for either current
    HDDs or for SSDs) and subsequent coalescing and pagination
    required to reduce SSD wear-out. This might be a tweakable
    component of the scheduler, which could be disabled if some
    different media is used and this scheduler is not needed
    (small-sectored HDDs, DDR, SSDs of the future), but the
    default writing mode today should expect SSDs.

7) A special tool like scrub should be added to walk the pool''s
    block tree and rewrite the existing block pointers (and I am
    not sure this is as problematic as the generic BPRewrite -
    if needed, the task can be done once offline, for example).

    By definition this is a restartable task (initiating a new
    tree walk), so it should be pausable or abortable as well.

    As the result, new copies of metadata blocks would be created
    on the accelerator device, and the two old copies remain in
    place.
    Just in case the metaxel TLVDEV is "contaminated" by other
    block types (by virtue of read-write import and usage by
    ZFS implementations unsupporting this feature), those should
    be relocated onto the main HDD pool.

    One thing to discuss is: what should be done for metadata
    with already existing three copies on HDDs? Should one copy
    be disposed of and recreated on the accelerator?

8) If the metaxel devices are filled up, the "overflowing"
    metadata blocks may just not get the third copies (only
    the standard HDD-based copies are written). If the metaxel
    device is freed up (by deletion of data and release of the
    block pointers) or expanded (or another one is added), then
    another run of the "scrub-like" procedure from point 7 can
    add the missing copies.

9) Also, the solution should allow to discard and recreate the
    copies of block pointers on the accelerator TLVDEV in case
    that it fails fatally or is replaced by new empty media.

    Unlike usual data, where loss of a TLVDEV is considered
    fatal to the pool, in this case we are known to have
    (redundant) copies of these  blocks on other media.

    If the new TLVDEV is at least as big as the failed one,
    the pre-recorded accelerator tlvdevid:offsets in HDD-based
    copies of the block pointers can be used to re-instantiate
    the copies on metaxel, just like scrub or in-flight repairs
    happen on usual pools (rewriting corrupt blocks in-place
    and not having to change the BP tree at all). In this case
    the tlvdevid part of DVA can (should) remain unchanged.

    For new metaxels smaller than the replaced one new DVA
    allocations might be required. To enforce this and avoid
    mixups, the tlvdevid should change to some new unique
    number, and the BP tree gets rewritten as in point 7.

10) If these metaxel devices are used (and known to be SSDs?)
    then (SSD-based) L2ARC caches should not be used for the
    metadata blocks readily available from the metaxel.
    Guess: This might in fact reduce overheads from use of
    dedup, where pushing blocks into L2ARC only halves the
    needed RAM footprint. With an SSD metaxel we can just
    drop unneeded DDT entries from RAM ARC, and quickly get
    them from stable storage when needed.

I hope I covered all or most of what I think on this matter,
discussion (and ultimately open-sourced implementations) are
most welcome ;)

HTH,
//Jim Klimov

Sašo Kiselkov

2012-Aug-24 13:50 UTC

head link

[zfs-discuss] Dedicated metadata devices

This is something I''ve been looking into in the code and my take on
your
proposed points this:

1) This requires many and deep changes across much of ZFS''s
architecture
(especially the ability to sustain tlvdev failures).

2) Most of this can be achieved (except for cache persistency) by
implementing ARC space reservations for certain types of data.

The latter has the added benefit of spreading load across all ARC and
L2ARC resources, so your metaxel device never becomes the sole
bottleneck and it better embraces the ZFS design philosophy of pooled
storage.

I plan on having a look at implementing cache management policies (which
would allow for tuning space reservations for metadata/etc. in a
fine-grained manner without the cruft of having to worry about physical
cache devices as well).

Cheers,
--
Saso

On 08/24/2012 03:39 PM, Jim Klimov wrote:> Hello all,
> 
>   The idea of dedicated metadata devices (likely SSDs) for ZFS
> has been generically discussed a number of times on this list,
> but I don''t think I''ve seen a final proposal that someone
would
> take up for implementation (as a public source code, at least).
> 
>   I''d like to take a liberty of summarizing the ideas
I''ve either
> seen in discussions or proposed myself on this matter, to see if
> the overall idea would make sense to gurus of ZFS architecture.
> 
>   So, the assumption was that the performance killer in ZFS at
> least on smallish deployments (few HDDs and an SSD accelerator),
> like those in Home-NAS types of boxes, was random IO to lots of
> metadata. This IMHO includes primarily the block pointer tree
> and the DDT for those who risked using dedup. I am not sure how
> frequent is the required read access to other types of metadata
> (like dataset descriptors, etc.) that the occasional reading and
> caching won''t solve.
> 
>   Another idea was that L2ARC caching might not really cut it
> for metadata in comparison to a dedicated metadata storage,
> partly due to the L2ARC becoming empty upon every export/import
> (boot) and needing to get re-heated.
> 
>   So, here go the highlights of proposal (up for discussion).
> 
> In short, the idea is to use today''s format of the blkptr_t
> which by default allows to store up to 3 DVA addresses of the
> block, and many types of metadata use only 2 copies (at least
> by default). This new feature adds a specially processed
> TLVDEV in the common DVA address space of the pool, and
> enforces storage of added third copies for certain types
> of metadata blocks on these devices. (Limited) Backwards
> compatibility is quite possible, on-disk format change may
> be not required. The proposal also addresses some questions
> that arose in previous discussions, especially about proposals
> where SSDs would be the only storage for pool''s metadata:
> * What if the dedicated metadata device overflows?
> * What if the dedicated metadata device breaks?
> = okay/expected by design, nothing dies.
> 
>   In more detail:
> 1) Add a special Top-Level VDEV (TLVDEV below) device type (like
>    "cache" and "log" - say, "metaxel" for
"metadata accelerator"?),
>    and allow (even encourage) use of mirrored devices and allow
>    expansion (raid0, raid10 and/or separate TLVDEVs) with added
>    singlets/mirrors of such devices.
>    Method of device type definition for the pool is discussable,
>    I''d go with a special attribute (array) or nvlist in the pool
>    descriptor, rather than some special type ID in the ZFS label
>    (backwards compatibility, see point 4 for detailed rationale).
> 
>    Discussable: enable pool-wide or per-dataset (i.e. don''t
>    waste accelerator space and lifetime for rarely-reused
>    datasets like rolling backups)? Choose what to store on
>    (particular) metaxels - DDT, BPTree, something else?
>    Overall, this availability of choice is similar to choice
>    of modes for ARC/L2ARC caching or enabling ZIL per-dataset...
> 
> 2) These devices should be formally addressable as part of the
>    pool in DVA terms (tlvdev:offset:size), but writes onto them
>    are artificially limited by ZFS scheduler so as to only allow
>    specific types of metadata blocks (blkptr_t''s, DDT entries),
>    and also enforce writing of added third copies (for blocks
>    of metadata with usual copies=2) onto these devices.
> 
> 3) Absence or "FAULTEDness" of this device should not be fatal
>    to the pool, but it may require manual intervention to force
>    the import. Particularly, removal, replacement or resilvering
>    onto different storage (i.e. migrating to larger SSDs) should
>    be supported in the design.
>    Beside experimentation and migration concerns, this approach
>    should also ease replacement of SSDs used for metadata in case
>    of their untimely fatal failures - and this may be a concern
>    for many SSD deployments, increasingly susceptible to write
>    wearing and ultimate death (at least in the cheaper bulkier
>    range, which is a likely component in Home-NAS solutions).
> 
> 4) For backwards compatibility, to older versions of ZFS this
>    device should seem like a normal single-disk or mirror TLVDEV
>    which contains blocks addressed within the common pool DVA
>    address-space. This should have no effect for read-only
>    imports. However, other ZFS releases likely won''t respect the
>    filtering and alignment limitations enforced for the device
>    normally in this design, and can "contaminate" the device
>    with other types of blocks (and would refuse to import the
>    pool if the device is missing/faulted).
> 
> 5) The ZFS reads should be tweaked to first consult the copy
>    of metadata blocks on the metadata accelerator device, and
>    only use spinning rust (ordinary TLVDEVs) if there are some
>    errors (checksum mismatches, lacking devices, etc.) or during
>    scrubs and similar tasks which would require full reads of
>    the pool''s addressed blocks.
>    Prioritized reads from this metadata accelerator won''t need
>    a special bit in the blkptr_t (like is done for deduped-bit) -
>    the TLVDEV number in the DVA already points to the known
>    identifier of the TLVDEV, which we know is a metaxel.
> 
> 6) The ZFS writes onto this storage should take into account
>    the increased blocksize (likely 4-8Kb for either current
>    HDDs or for SSDs) and subsequent coalescing and pagination
>    required to reduce SSD wear-out. This might be a tweakable
>    component of the scheduler, which could be disabled if some
>    different media is used and this scheduler is not needed
>    (small-sectored HDDs, DDR, SSDs of the future), but the
>    default writing mode today should expect SSDs.
> 
> 7) A special tool like scrub should be added to walk the pool''s
>    block tree and rewrite the existing block pointers (and I am
>    not sure this is as problematic as the generic BPRewrite -
>    if needed, the task can be done once offline, for example).
> 
>    By definition this is a restartable task (initiating a new
>    tree walk), so it should be pausable or abortable as well.
> 
>    As the result, new copies of metadata blocks would be created
>    on the accelerator device, and the two old copies remain in
>    place.
>    Just in case the metaxel TLVDEV is "contaminated" by other
>    block types (by virtue of read-write import and usage by
>    ZFS implementations unsupporting this feature), those should
>    be relocated onto the main HDD pool.
> 
>    One thing to discuss is: what should be done for metadata
>    with already existing three copies on HDDs? Should one copy
>    be disposed of and recreated on the accelerator?
> 
> 8) If the metaxel devices are filled up, the "overflowing"
>    metadata blocks may just not get the third copies (only
>    the standard HDD-based copies are written). If the metaxel
>    device is freed up (by deletion of data and release of the
>    block pointers) or expanded (or another one is added), then
>    another run of the "scrub-like" procedure from point 7 can
>    add the missing copies.
> 
> 9) Also, the solution should allow to discard and recreate the
>    copies of block pointers on the accelerator TLVDEV in case
>    that it fails fatally or is replaced by new empty media.
> 
>    Unlike usual data, where loss of a TLVDEV is considered
>    fatal to the pool, in this case we are known to have
>    (redundant) copies of these  blocks on other media.
> 
>    If the new TLVDEV is at least as big as the failed one,
>    the pre-recorded accelerator tlvdevid:offsets in HDD-based
>    copies of the block pointers can be used to re-instantiate
>    the copies on metaxel, just like scrub or in-flight repairs
>    happen on usual pools (rewriting corrupt blocks in-place
>    and not having to change the BP tree at all). In this case
>    the tlvdevid part of DVA can (should) remain unchanged.
> 
>    For new metaxels smaller than the replaced one new DVA
>    allocations might be required. To enforce this and avoid
>    mixups, the tlvdevid should change to some new unique
>    number, and the BP tree gets rewritten as in point 7.
> 
> 10) If these metaxel devices are used (and known to be SSDs?)
>    then (SSD-based) L2ARC caches should not be used for the
>    metadata blocks readily available from the metaxel.
>    Guess: This might in fact reduce overheads from use of
>    dedup, where pushing blocks into L2ARC only halves the
>    needed RAM footprint. With an SSD metaxel we can just
>    drop unneeded DDT entries from RAM ARC, and quickly get
>    them from stable storage when needed.
> 
> I hope I covered all or most of what I think on this matter,
> discussion (and ultimately open-sourced implementations) are
> most welcome ;)
> 
> HTH,
> //Jim Klimov
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Richard Elling

2012-Aug-24 14:53 UTC

head link

[zfs-discuss] Dedicated metadata devices

On Aug 24, 2012, at 6:50 AM, Sa?o Kiselkov wrote:
> This is something I''ve been looking into in the code and my take
on your
> proposed points this:
> 
> 1) This requires many and deep changes across much of ZFS''s
architecture
> (especially the ability to sustain tlvdev failures).
> 
> 2) Most of this can be achieved (except for cache persistency) by
> implementing ARC space reservations for certain types of data.
I think the simple solution of increasing default metadata limit above 1/4 of
arc_max will take care of the vast majority of small system complaints. The 
limit is arbitrary and set well before dedupe was delivered.
> 
> The latter has the added benefit of spreading load across all ARC and
> L2ARC resources, so your metaxel device never becomes the sole
> bottleneck and it better embraces the ZFS design philosophy of pooled
> storage.
> 
> I plan on having a look at implementing cache management policies (which
> would allow for tuning space reservations for metadata/etc. in a
> fine-grained manner without the cruft of having to worry about physical
> cache devices as well).
> 
> Cheers,
> --
> Saso
> 
> On 08/24/2012 03:39 PM, Jim Klimov wrote:
>> Hello all,
>> 
>>  The idea of dedicated metadata devices (likely SSDs) for ZFS
>> has been generically discussed a number of times on this list,
>> but I don''t think I''ve seen a final proposal that
someone would
>> take up for implementation (as a public source code, at least).
>> 
>>  I''d like to take a liberty of summarizing the ideas
I''ve either
>> seen in discussions or proposed myself on this matter, to see if
>> the overall idea would make sense to gurus of ZFS architecture.
>> 
>>  So, the assumption was that the performance killer in ZFS at
>> least on smallish deployments (few HDDs and an SSD accelerator),
>> like those in Home-NAS types of boxes, was random IO to lots of
>> metadata.
It is a bad idea to make massive investments in development and 
testing because of an assumption. Build test cases, prove that the
benefits of the investment can outweigh other alternatives, and then
deliver code.
 -- richard
>> This IMHO includes primarily the block pointer tree
>> and the DDT for those who risked using dedup. I am not sure how
>> frequent is the required read access to other types of metadata
>> (like dataset descriptors, etc.) that the occasional reading and
>> caching won''t solve.
>> 
>>  Another idea was that L2ARC caching might not really cut it
>> for metadata in comparison to a dedicated metadata storage,
>> partly due to the L2ARC becoming empty upon every export/import
>> (boot) and needing to get re-heated.
>> 
>>  So, here go the highlights of proposal (up for discussion).
>> 
>> In short, the idea is to use today''s format of the blkptr_t
>> which by default allows to store up to 3 DVA addresses of the
>> block, and many types of metadata use only 2 copies (at least
>> by default). This new feature adds a specially processed
>> TLVDEV in the common DVA address space of the pool, and
>> enforces storage of added third copies for certain types
>> of metadata blocks on these devices. (Limited) Backwards
>> compatibility is quite possible, on-disk format change may
>> be not required. The proposal also addresses some questions
>> that arose in previous discussions, especially about proposals
>> where SSDs would be the only storage for pool''s metadata:
>> * What if the dedicated metadata device overflows?
>> * What if the dedicated metadata device breaks?
>> = okay/expected by design, nothing dies.
>> 
>>  In more detail:
>> 1) Add a special Top-Level VDEV (TLVDEV below) device type (like
>>   "cache" and "log" - say, "metaxel" for
"metadata accelerator"?),
>>   and allow (even encourage) use of mirrored devices and allow
>>   expansion (raid0, raid10 and/or separate TLVDEVs) with added
>>   singlets/mirrors of such devices.
>>   Method of device type definition for the pool is discussable,
>>   I''d go with a special attribute (array) or nvlist in the
pool
>>   descriptor, rather than some special type ID in the ZFS label
>>   (backwards compatibility, see point 4 for detailed rationale).
>> 
>>   Discussable: enable pool-wide or per-dataset (i.e. don''t
>>   waste accelerator space and lifetime for rarely-reused
>>   datasets like rolling backups)? Choose what to store on
>>   (particular) metaxels - DDT, BPTree, something else?
>>   Overall, this availability of choice is similar to choice
>>   of modes for ARC/L2ARC caching or enabling ZIL per-dataset...
>> 
>> 2) These devices should be formally addressable as part of the
>>   pool in DVA terms (tlvdev:offset:size), but writes onto them
>>   are artificially limited by ZFS scheduler so as to only allow
>>   specific types of metadata blocks (blkptr_t''s, DDT entries),
>>   and also enforce writing of added third copies (for blocks
>>   of metadata with usual copies=2) onto these devices.
>> 
>> 3) Absence or "FAULTEDness" of this device should not be
fatal
>>   to the pool, but it may require manual intervention to force
>>   the import. Particularly, removal, replacement or resilvering
>>   onto different storage (i.e. migrating to larger SSDs) should
>>   be supported in the design.
>>   Beside experimentation and migration concerns, this approach
>>   should also ease replacement of SSDs used for metadata in case
>>   of their untimely fatal failures - and this may be a concern
>>   for many SSD deployments, increasingly susceptible to write
>>   wearing and ultimate death (at least in the cheaper bulkier
>>   range, which is a likely component in Home-NAS solutions).
>> 
>> 4) For backwards compatibility, to older versions of ZFS this
>>   device should seem like a normal single-disk or mirror TLVDEV
>>   which contains blocks addressed within the common pool DVA
>>   address-space. This should have no effect for read-only
>>   imports. However, other ZFS releases likely won''t respect
the
>>   filtering and alignment limitations enforced for the device
>>   normally in this design, and can "contaminate" the device
>>   with other types of blocks (and would refuse to import the
>>   pool if the device is missing/faulted).
>> 
>> 5) The ZFS reads should be tweaked to first consult the copy
>>   of metadata blocks on the metadata accelerator device, and
>>   only use spinning rust (ordinary TLVDEVs) if there are some
>>   errors (checksum mismatches, lacking devices, etc.) or during
>>   scrubs and similar tasks which would require full reads of
>>   the pool''s addressed blocks.
>>   Prioritized reads from this metadata accelerator won''t need
>>   a special bit in the blkptr_t (like is done for deduped-bit) -
>>   the TLVDEV number in the DVA already points to the known
>>   identifier of the TLVDEV, which we know is a metaxel.
>> 
>> 6) The ZFS writes onto this storage should take into account
>>   the increased blocksize (likely 4-8Kb for either current
>>   HDDs or for SSDs) and subsequent coalescing and pagination
>>   required to reduce SSD wear-out. This might be a tweakable
>>   component of the scheduler, which could be disabled if some
>>   different media is used and this scheduler is not needed
>>   (small-sectored HDDs, DDR, SSDs of the future), but the
>>   default writing mode today should expect SSDs.
>> 
>> 7) A special tool like scrub should be added to walk the
pool''s
>>   block tree and rewrite the existing block pointers (and I am
>>   not sure this is as problematic as the generic BPRewrite -
>>   if needed, the task can be done once offline, for example).
>> 
>>   By definition this is a restartable task (initiating a new
>>   tree walk), so it should be pausable or abortable as well.
>> 
>>   As the result, new copies of metadata blocks would be created
>>   on the accelerator device, and the two old copies remain in
>>   place.
>>   Just in case the metaxel TLVDEV is "contaminated" by other
>>   block types (by virtue of read-write import and usage by
>>   ZFS implementations unsupporting this feature), those should
>>   be relocated onto the main HDD pool.
>> 
>>   One thing to discuss is: what should be done for metadata
>>   with already existing three copies on HDDs? Should one copy
>>   be disposed of and recreated on the accelerator?
>> 
>> 8) If the metaxel devices are filled up, the "overflowing"
>>   metadata blocks may just not get the third copies (only
>>   the standard HDD-based copies are written). If the metaxel
>>   device is freed up (by deletion of data and release of the
>>   block pointers) or expanded (or another one is added), then
>>   another run of the "scrub-like" procedure from point 7 can
>>   add the missing copies.
>> 
>> 9) Also, the solution should allow to discard and recreate the
>>   copies of block pointers on the accelerator TLVDEV in case
>>   that it fails fatally or is replaced by new empty media.
>> 
>>   Unlike usual data, where loss of a TLVDEV is considered
>>   fatal to the pool, in this case we are known to have
>>   (redundant) copies of these  blocks on other media.
>> 
>>   If the new TLVDEV is at least as big as the failed one,
>>   the pre-recorded accelerator tlvdevid:offsets in HDD-based
>>   copies of the block pointers can be used to re-instantiate
>>   the copies on metaxel, just like scrub or in-flight repairs
>>   happen on usual pools (rewriting corrupt blocks in-place
>>   and not having to change the BP tree at all). In this case
>>   the tlvdevid part of DVA can (should) remain unchanged.
>> 
>>   For new metaxels smaller than the replaced one new DVA
>>   allocations might be required. To enforce this and avoid
>>   mixups, the tlvdevid should change to some new unique
>>   number, and the BP tree gets rewritten as in point 7.
>> 
>> 10) If these metaxel devices are used (and known to be SSDs?)
>>   then (SSD-based) L2ARC caches should not be used for the
>>   metadata blocks readily available from the metaxel.
>>   Guess: This might in fact reduce overheads from use of
>>   dedup, where pushing blocks into L2ARC only halves the
>>   needed RAM footprint. With an SSD metaxel we can just
>>   drop unneeded DDT entries from RAM ARC, and quickly get
>>   them from stable storage when needed.
>> 
>> I hope I covered all or most of what I think on this matter,
>> discussion (and ultimately open-sourced implementations) are
>> most welcome ;)
>> 
>> HTH,
>> //Jim Klimov
>> 
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120824/975c6527/attachment-0001.html>

Jim Klimov

2012-Aug-24 15:43 UTC

head link

[zfs-discuss] Dedicated metadata devices

First of all, thanks for reading and discussing! :)

2012-08-24 17:50, Sa?o Kiselkov wrote:> This is something I''ve been looking into in the code and my take
on your
> proposed points this:
>
> 1) This requires many and deep changes across much of ZFS''s
architecture
> (especially the ability to sustain tlvdev failures).
I''d trust the expert; on the outside it did not seem as a very
deep change. At least, if for the first POC tests we leave out the
rewriting of existing block pointers to store copies of existing
metadata on an SSD, and the resilience to failures and absence
of METAXELs.

Basically, for a POC implementation we can just make a regular
top-level VDEV forced as a single disk or mirror and add some
hint to describe that it is a METAXEL component of the pool,
so the ZFS kernel gets some restrictions on what gets written
there (for new metadata writes) and to prioritize reads (fetch
metadata from METAXELs, unless there is no copy on a known
METAXEL or the copy is corrupted).

The POC as outlined would be useful to estimate the benefits and
impacts of the solution, and like "BP Rewrite", the more advanced
features might be delayed by a few years - so even the POC would
easily be the useful solution for many of us, especially if applied
to new pools from TXG=0.

"There is nothing as immortal as a temporary solution" ;)
>
> 2) Most of this can be achieved (except for cache persistency) by
> implementing ARC space reservations for certain types of data.
>
> The latter has the added benefit of spreading load across all ARC and
> L2ARC resources, so your METAXEL device never becomes the sole
> bottleneck and it better embraces the ZFS design philosophy of pooled
> storage.
Well, we already have somewhat non-pooled ZILs and L2ARCs.
Or, rather, they are in sub-pools of their own, reserved
for specific tasks to optimize and speed up the ZFS storage
subsystem in face of particular problems.

My proposal does indeed add another sub-pool for another such
task (and nominally METAXELs are parts of the common pool -
more than cache and log devices are today), and explicitly
describes adding several METAXELs or raid10''ing them (thus
regarding the bottleneck question). On larger systems, this
metadata storage might be available with a different SAS
controller on a separate PCI bus, further boosting performance
and reducing bottlenecks. Unlike L2ARC, METAXELs can be N-way
mirrored and so instances are available in parallel from
several controllers and lanes - further boosting IO and
reliability of metadata operations.

However, unlike L2ARC in general, here we know our "target
audience" better, so we can do an optimization for a particular
useful situation: gigabytes worth of data in small portions
(sized from 512b to 8Kb, IIRC?), quite randomly stored and
often read in comparison to amount of writes.
Regarding size in particular: with blocks of 128K and BP entries
of 512b, the minimum overhead for a single copy of BPtree metadata
is 1/256 (without actually the tree, dataset labels, etc).
So for each 1Tb of written ZFS pool userdata we get at least 4Gb
metadata of just the block pointer tree (likely more in reality).
For practical Home-NAS pools of about 10Tb this warrants about
60Gb (give or take an order of magnitude) on SSD dedicated to
casual metadata without even a DDT, be it generic L2ARC or an
optimized METAXEL.

The tradeoffs for dedicating a storage device (or several) to
this one task are, hopefully: no need for heating up the cache
every time with gigabytes that are known to be needed again
and again, even if only to boost weekly scrubs, some RAM ARC
savings and release of L2ARC to tasks it is more efficient at
(generic larger blocks). Eliminating many small random IOs to
spinning rust, we''re winning in HDD performance and arguably
power consumption and vitality (less mechanical overheads and
delays per overall amount of transferred gigabytes).

There is also relatively large RAM pointer overhead for storing
small pieces of data (such as metadata blocks sized 1 or few
sectors) in L2ARC, which I expect to be eliminated by storing
and using these blocks directly from the pool (on SSD METAXELs),
having both SSD-fast access to the blocks and no expiration into
L2ARC and back with inefficiently-sized ARC pointers to remember.

I guess METAXEL might indeed be cheaper and faster than L2ARC,
for this particular use-case (metadata). Also, this way the
true L2ARC device would be more available to "real" userdata
which is likely to use larger blocks - improving benefits
from your L2ARC compression features as well as reducing
the overhead percentage for ARC pointers; and being a random
selection of the pool''s blocks, the userdata is unpredictable
for good acceleration by other means (short of a full-SSD pool).

Also, having this bulky amount of bytes (BPTree, DDT) is
essentially required for fast operation of the overall pool,
and it is not some unpredictable random set of blocks as is
expected for usual cacheable data - so why keep reheating it
into the cache upon every restart (and the greener home-NAS
users might power down their boxes when not in use, to save
on power bills, so reheating L2ARC is frequent), needlessly
wearing it out with writes and anyway chopping this amount
of bytes from usual L2ARC data and RAM ARC as well.

The DDT also hopefully won''t have the drastic impacts we see
today on budgeting and/or performance of smaller machines
(like your HP Microserver with 8Gb RAM tops) with enabled
dedup, because DDT entries can be quickly re-seeked from
SSD and not consume RAM while they are expired from ARC as
today - thus freeing it for more efficient caching of real
data or for pointers to bulkier userdata in L2ARC.

Finally, even scrubs should be faster - beside checking
integrity of on-HDD copies of metadata blocks, the system
won''t need to read them with slower access times in order
to find addresses and checksums of the bulk of userdata.
This scrubs are likely to become more sequential and fast
with little to no special coding to do this boost.

What do you think?
//Jim Klimov

Jim Klimov

2012-Aug-24 19:07 UTC

head link

[zfs-discuss] Dedicated metadata devices

2012-08-24 17:39, Jim Klimov wrote:> Hello all,
>
>    The idea of dedicated metadata devices (likely SSDs) for ZFS
> has been generically discussed a number of times on this list,
> but I don''t think I''ve seen a final proposal that someone
would
> take up for implementation (as a public source code, at least).

Hmmm... now that I think of it, this solution might also eat some
of the cake for dedicated ZIL devices: having an SSD with in-pool
metadata, we can send sync writes (for metadata blocks) straight
to the METAXEL SSD, while the TXG sync would flush out their HDD-
based counterparts and other (async) data. In case of pool import
after breakage we just need to repair the last uncommitted TXG''s
worth of recorded metadata on METAXEL...

Also, I now think that directories, ACLs and such (POSIX fs layer
metadata) should have a copy on the METAXEL SSDs too.

This certainly does not replace the ZIL in general, but unlike the
rolling "write a lot, read never" ZIL approach, this would actually
write the needed data onto the pool with low latency, won''t abuse
the flash cells needlessly (hopefully).

Expert opinion and/or tests could confirm my guess that this could
provide "good enough" boost to some modes of sync writes (NFS maybe?)
that are heavier on metadata updates than on userdata, and these
tight-on-budget deployments which think about costly dedicated ZIL
devices might no longer require one, or not need it so heavily.

Is there any sanity to this? ;)
Thanks,
//Jim Klimov

Sašo Kiselkov

2012-Aug-24 20:42 UTC

head link

[zfs-discuss] Dedicated metadata devices

Oh man, that''s a million-billion points you made. I''ll try to
run
through each quickly.

On 08/24/2012 05:43 PM, Jim Klimov wrote:> First of all, thanks for reading and discussing! :)
No problem at all ;)
> 2012-08-24 17:50, Sa?o Kiselkov wrote:
>> This is something I''ve been looking into in the code and my
take on your
>> proposed points this:
>>
>> 1) This requires many and deep changes across much of ZFS''s
architecture
>> (especially the ability to sustain tlvdev failures).
> 
> I''d trust the expert; on the outside it did not seem as a very
> deep change. At least, if for the first POC tests we leave out the
> rewriting of existing block pointers to store copies of existing
> metadata on an SSD, and the resilience to failures and absence
> of METAXELs.
The initial set of change areas I can identify, even for the stripped
down version of your proposal is:

 *) implement a new vdev type (mirrored or straight metaxel)
 *) integrate all format changes to labels to describe these
 *) alter the block allocator strategy so that if there are metaxels
    present, we utilize those
 *) alter the metadata fetch points (of which there are many) to
    preferably fetch from metaxels when possible, or fall back to
    main-pool copies
 *) make sure that the previous two points play nicely with copies=X

The other points you mentioned, i.e. fault-resiliency, block-pointer
rewrite and other stuff is another mountain of work with an even higher
mountain of testing to be done on all possible combinations.
> Basically, for a POC implementation we can just make a regular
> top-level VDEV forced as a single disk or mirror and add some
> hint to describe that it is a METAXEL component of the pool,
> so the ZFS kernel gets some restrictions on what gets written
> there (for new metadata writes) and to prioritize reads (fetch
> metadata from METAXELs, unless there is no copy on a known
> METAXEL or the copy is corrupted).
As noted before, you''ll have to go through the code to look for paths
which fetch metadata (mostly the object layer) and replace those with
metaxel-aware calls. That''s a lot of work for a POC.
> The POC as outlined would be useful to estimate the benefits and
> impacts of the solution, and like "BP Rewrite", the more advanced
> features might be delayed by a few years - so even the POC would
> easily be the useful solution for many of us, especially if applied
> to new pools from TXG=0.
I wish I had all the time to implement it, but alas, I''m just a zfs
n00b
and am not doing this for a living :-)
>> 2) Most of this can be achieved (except for cache persistency) by
>> implementing ARC space reservations for certain types of data.
>>
>> The latter has the added benefit of spreading load across all ARC and
>> L2ARC resources, so your METAXEL device never becomes the sole
>> bottleneck and it better embraces the ZFS design philosophy of pooled
>> storage.
> 
> Well, we already have somewhat non-pooled ZILs and L2ARCs.
Yes, that''s because these have vastly different performance properties
from main-pool storage. However, metaxels and cache devices are
essentially the same (many small random reads, infrequent large async
writes).
> Or, rather, they are in sub-pools of their own, reserved
> for specific tasks to optimize and speed up the ZFS storage
> subsystem in face of particular problems.
Exactly. The difference between metaxel and cache, however, is cosmetic.
> My proposal does indeed add another sub-pool for another such
> task (and nominally METAXELs are parts of the common pool -
> more than cache and log devices are today), and explicitly
> describes adding several METAXELs or raid10''ing them (thus
> regarding the bottleneck question).
The problem regarding bottlenecking is that you''re creating a new
separate island of resources which has very little difference in
performance requirements to cache devices, yet by separating them out
artificially, you''re creating a potential scalability barrier.
> On larger systems, this
> metadata storage might be available with a different SAS
> controller on a separate PCI bus, further boosting performance
> and reducing bottlenecks. Unlike L2ARC, METAXELs can be N-way
> mirrored and so instances are available in parallel from
> several controllers and lanes - further boosting IO and
> reliability of metadata operations.
How often do you expect cache devices to fail? I mean we''re talking
about a one-off occasional event that doesn''t even present data loss
(only a little bit of performance loss, especially if you use multiple
cache devices). And since you''re proposing mirroring metaxels, you are
essentially going to be continuously doing twice the write work for a
50% reduction in read performance from the vdev in case of a device
failure. If you just used both devices as cache, you''ll get 100%
speedup
in read AND write performance (in case you lose one cache device,
you''ve
still got 50% of your cache data available). So to sum up, you''re
applying raid to something that doesn''t need it.
> However, unlike L2ARC in general, here we know our "target
> audience" better, so we can do an optimization for a particular
> useful situation: gigabytes worth of data in small portions
> (sized from 512b to 8Kb, IIRC?), quite randomly stored and
> often read in comparison to amount of writes.
L2ARC also knows its target audience well, and it''s nearly identical
with what you''ve described. L2ARC doesn''t cache prefetched
buffers or
streaming workloads (unless you instruct it to). It''s there merely to
serve as a low-latency random-read accelerator.
> Regarding size in particular: with blocks of 128K and BP entries
> of 512b, the minimum overhead for a single copy of BPtree metadata
> is 1/256 (without actually the tree, dataset labels, etc).
Block pointers are actually much smaller, ZFS groups them into gang
blocks if it needs to store multiple of them and is below the
SPA_MINBLOCKSIZE (hope I remember that macro''s name right).
> So for each 1Tb of written ZFS pool userdata we get at least 4Gb
> metadata of just the block pointer tree (likely more in reality).
> For practical Home-NAS pools of about 10Tb this warrants about
> 60Gb (give or take an order of magnitude) on SSD dedicated to
> casual metadata without even a DDT, be it generic L2ARC or an
> optimized METAXEL.
And how is that different to having a cache-sizing policy which selects
how much each data type get allocated from a single common cache?
> The tradeoffs for dedicating a storage device (or several) to
> this one task are, hopefully: no need for heating up the cache
> every time with gigabytes that are known to be needed again
> and again,
Agree, the persistency would be nice to have, and in fact it might be a
lot easier to implement (I''ve already thought about how to do this, but
that''s a topic for another day).
> even if only to boost weekly scrubs, some RAM ARC
> savings and release of L2ARC to tasks it is more efficient at
> (generic larger blocks).
Scrubs will populate your cache anyways, so only the first one will be
slow, the next one will be much faster. Also, you''re wrong if you think
the clientele of l2arc and metaxel would be different - it most likely
wouldn''t.
> Eliminating many small random IOs to
> spinning rust, we''re winning in HDD performance and arguably
> power consumption and vitality (less mechanical overheads and
> delays per overall amount of transferred gigabytes).
No disagreement there.
> There is also relatively large RAM pointer overhead for storing
> small pieces of data (such as metadata blocks sized 1 or few
> sectors) in L2ARC, which I expect to be eliminated by storing
> and using these blocks directly from the pool (on SSD METAXELs),
> having both SSD-fast access to the blocks and no expiration into
> L2ARC and back with inefficiently-sized ARC pointers to remember.
You''d still need to reference metaxel data from ARC, so your savings
would be very small. ZFS already is pretty efficient there.
> I guess METAXEL might indeed be cheaper and faster than L2ARC,
> for this particular use-case (metadata). Also, this way the
> true L2ARC device would be more available to "real" userdata
> which is likely to use larger blocks - improving benefits
> from your L2ARC compression features as well as reducing
> the overhead percentage for ARC pointers; and being a random
> selection of the pool''s blocks, the userdata is unpredictable
> for good acceleration by other means (short of a full-SSD pool).
While compression indeed works much better on larger blocks, I hardly
think the proportion to regular data is somehow significant in any way
to warrant taking it out of the compression datastream. At worst it''s a
few percent of compression overhead - in fact, my current implementation
of l2arc compression already does a check for block size and refuses to
compress blocks smaller than ~2048 bytes.
> Also, having this bulky amount of bytes (BPTree, DDT) is
> essentially required for fast operation of the overall pool,
> and it is not some unpredictable random set of blocks as is
> expected for usual cacheable data - so why keep reheating it
> into the cache upon every restart (and the greener home-NAS
> users might power down their boxes when not in use, to save
> on power bills, so reheating L2ARC is frequent), needlessly
> wearing it out with writes and anyway chopping this amount
> of bytes from usual L2ARC data and RAM ARC as well.
> The DDT also hopefully won''t have the drastic impacts we see
> today on budgeting and/or performance of smaller machines
> (like your HP Microserver with 8Gb RAM tops) with enabled
> dedup, because DDT entries can be quickly re-seeked from
> SSD and not consume RAM while they are expired from ARC as
> today - thus freeing it for more efficient caching of real
> data or for pointers to bulkier userdata in L2ARC.
So essentially that''s an argument for l2arc persistency. As I said, it
can be done (and more easily than using metaxels).
> Finally, even scrubs should be faster - beside checking
> integrity of on-HDD copies of metadata blocks, the system
> won''t need to read them with slower access times in order
> to find addresses and checksums of the bulk of userdata.
> This scrubs are likely to become more sequential and fast
> with little to no special coding to do this boost.
See above. All of this can be solved by cache sizing policies and l2arc
persistency.

Cheers,
--
Saso

Jim Klimov

2012-Aug-24 22:22 UTC

head link

[zfs-discuss] Dedicated metadata devices

2012-08-25 0:42, Sa?o Kiselkov wrote:> Oh man, that''s a million-billion points you made. I''ll
try to run
> through each quickly.
Thanks...
I still do not have the feeling that you''ve fully got my
idea, or, alternately, that I correctly understand ARC :)
>> There is also relatively large RAM pointer overhead for storing
>> small pieces of data (such as metadata blocks sized 1 or few
>> sectors) in L2ARC, which I expect to be eliminated by storing
>> and using these blocks directly from the pool (on SSD METAXELs),
>> having both SSD-fast access to the blocks and no expiration into
>> L2ARC and back with inefficiently-sized ARC pointers to remember.
...And these counter-arguments probably are THE point of deviation:
> However, metaxels and cache devices are essentially the same > (many small random reads, infrequent large async writes).
 > The difference between metaxel and cache, however, is cosmetic.
> You''d still need to reference metaxel data from ARC, so your
savings
> would be very small. ZFS already is pretty efficient there.
No, you don''t! "Republic credits WON''T do fine!" ;)

The way I understood ARC (without/before L2ARC), it either caches
pool blocks or it doesn''t. More correctly, there is also a cache
of ghosts without bulk block data, so we can account for misses
of recently expired blocks of one of the two categories, and so
adjust the cache subdivision towards MRU or MFU. Ultimately, those
ghosts which were not requested, also expire away from the cache,
and no reference to a recently-cached block remains.

With L2ARC on the other hand, there is some list of pointers in
the ARC so it knows which blocks were cached on the SSD - and
lack of this list upon pool import is in effect the perceived
emptiness of the L2ARC device. L2ARC''s pointers are of comparable
size to the small metadata blocks, and *this* consideration IMHO
makes it much more efficient to use L2ARC with larger cached blocks,
especially on systems with limited RAM (which effectively limits
addressable L2ARC size as accounted in amount of blocks), with
the added benefit that you can compress larger blocks in L2ARC.

This way, the *difference* between L2ARC and a METAXEL is that
the latter is an ordinary pool tlvdev with a specially biased
read priority and write filter. If a metadata block is read,
it goes into the ARC. If it expires - then there''s a ghost
for a while and soon there is no memory that this block was
cached - unlike L2ARC''s list of pointers which are just a
couple of times smaller than the cached block of this type.
But re-fetching metadata from SSD METAXEL is faster, when
it is needed again.

 > Also, you''re wrong if you think the clientele of l2arc and
 > metaxel would be different - it most likely wouldn''t.

This only stresses the problem with L2ARC''s shortcomings for
metadata, the way I see them (if they do indeed exist), and
in particular chews your RAM a lot more than it could or
should, being a mechanism to increase caching efficiency.

If their clientele is indeed similar, and if metaxels would
be more efficient for metadata storage, then you might not
need L2ARC with its overheads, or not as much of it, and
get a clear win in system resource consumption ;)

 > How often do you expect cache devices to fail?

 From what I hear, life expectancy for today''s consumer-scale
devices is small (1-3 years) for heavy writes - at which the
L2ARC would likely exceed METAXEL''s write rates, due to the
need to write the same metadata into L2ARC time and again,
if it were not for the special throttling to limit L2ARC
write bandwidth.

 > So to sum up, you''re applying raid to something that
doesn''t
 > need it.

Well, metadata is kinda important - though here we do add
a third copy where we previously sufficed to have two. And
you''re not "required" to mirror it. Also, on the other hand,
if a METAXEL is a top-level vdev without special resilience
to its failure/absence as described in my first post, then
its failure would formally be considered a fatal situation
and bring down the whole pool - unlike problems with L2ARC
or ZIL devices, which can be ignored at admin''s discretion.

 > And how is that different to having a cache-sizing policy
 > which selects how much each data type get allocated from
 > a single common cache?
...
 > All of this can be solved by cache sizing policies and
 > l2arc persistency.

Ultimately, I don''t disagree with this point :)
But I do think that this might not be the optimal solution
in terms of RAM requirements and coding complexity, etc.
If you want to store some data long-term, such as is my
desire to store the metadata - ZFS has mechanisms for that
in ways of normal VDEVs (or subclassing that into metaxels) ;)
>  *) implement a new vdev type (mirrored or straight metaxel)
>  *) integrate all format changes to labels to describe these
As one idea in the proposal - though I don''t require sticking
to it - is that the metaxel''s job is described in the pool
metadata (i.e. a readonly attribute which can be set during
tlvdev device creation/addition - metaxels:list-of-guids).
Until the pool is imported, a metaxel seems like a normal
singledisk/mirrored tlvdev in a normal pool.

This approach can limit importability of a pool with failed
metaxels, unless we expect that and try to make sense of
other pool devices - essentially until we can decipher the
nvlist and see that the absent device is a metaxel, so the
error is deemed not fatal. However, this also requires no
label changes or other incompatible on-disk format changes,
the way I see it. As long as the metaxel is not faulted,
any other ZFS implementation (like grub or an older livecd)
can import this pool and read 1/3 of metadata faster, on
average ;)
> As noted before, you''ll have to go through the code to look for
paths
> which fetch metadata (mostly the object layer) and replace those with
> metaxel-aware calls. That''s a lot of work for a POC.
Alas, for some years now I''m a lot less of a programmer and
a lot more of a brainstormer ;) Still, judging from whatever
experience I have, a working POC with some corners cut might
be a matter of a week or two of coding... Just to see if the
expected benefits in comparison to L2ARC do exist.
The full-scale thing, yes, might take months or years from
even a team of programmers ;)

Thanks,
//Jim

Sašo Kiselkov

2012-Aug-25 00:39 UTC

head link

[zfs-discuss] Dedicated metadata devices

On 08/25/2012 12:22 AM, Jim Klimov wrote:> 2012-08-25 0:42, Sa?o Kiselkov wrote:
>> Oh man, that''s a million-billion points you made.
I''ll try to run
>> through each quickly.
> 
> Thanks...
> I still do not have the feeling that you''ve fully got my
> idea, or, alternately, that I correctly understand ARC :)
Could be I misunderstood you, it''s past midnight here...
>>> There is also relatively large RAM pointer overhead for storing
>>> small pieces of data (such as metadata blocks sized 1 or few
>>> sectors) in L2ARC, which I expect to be eliminated by storing
>>> and using these blocks directly from the pool (on SSD METAXELs),
>>> having both SSD-fast access to the blocks and no expiration into
>>> L2ARC and back with inefficiently-sized ARC pointers to remember.
> 
> ...And these counter-arguments probably are THE point of deviation:
> 
>> However, metaxels and cache devices are essentially the same
>> (many small random reads, infrequent large async writes).
>> The difference between metaxel and cache, however, is cosmetic.
> 
>> You''d still need to reference metaxel data from ARC, so your
savings
>> would be very small. ZFS already is pretty efficient there.
> 
> No, you don''t! "Republic credits WON''T do
fine!" ;)
> 
> The way I understood ARC (without/before L2ARC), it either caches
> pool blocks or it doesn''t. More correctly, there is also a cache
> of ghosts without bulk block data, so we can account for misses
> of recently expired blocks of one of the two categories, and so
> adjust the cache subdivision towards MRU or MFU. Ultimately, those
> ghosts which were not requested, also expire away from the cache,
> and no reference to a recently-cached block remains.
Correct so far.
> With L2ARC on the other hand, there is some list of pointers in
> the ARC so it knows which blocks were cached on the SSD - and
> lack of this list upon pool import is in effect the perceived
> emptiness of the L2ARC device. L2ARC''s pointers are of comparable
> size to the small metadata blocks,
No they''re not, here''s l2arc_buf_hdr_t a per-buffer structure
held for
buffers which were moved to l2arc:

typedef struct l2arc_buf_hdr {
	l2arc_dev_t	*b_dev;
	uint64_t	b_daddr;
} l2arc_buf_hdr_t;

That''s about 16-bytes overhead per block, or 3.125% if the
block''s data
is 512 bytes long.
> and *this* consideration IMHO
> makes it much more efficient to use L2ARC with larger cached blocks,
> especially on systems with limited RAM (which effectively limits
> addressable L2ARC size as accounted in amount of blocks), with
> the added benefit that you can compress larger blocks in L2ARC.
The main overhead comes from an arc_buf_hdr_t, which is pretty fat,
around 180 bytes by a first degree approximation, so in all around 200
bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully
inefficient (around 39% overhead), however, at 4k average block size,
this drops to ~5% and at 64k average block size (which is entirely
possible on average untuned storage pools) this drops down to ~0.3%
overhead.
> This way, the *difference* between L2ARC and a METAXEL is that
> the latter is an ordinary pool tlvdev with a specially biased
> read priority and write filter. If a metadata block is read,
> it goes into the ARC. If it expires - then there''s a ghost
> for a while and soon there is no memory that this block was
> cached - unlike L2ARC''s list of pointers which are just a
> couple of times smaller than the cached block of this type.
> But re-fetching metadata from SSD METAXEL is faster, when
> it is needed again.
As explained above, the difference would be about 9% at best:
sizeof(l2arc_buf_hdr_t) / sizeof(arc_buf_hdr_t) = 0.0888...
>> Also, you''re wrong if you think the clientele of l2arc and
>> metaxel would be different - it most likely wouldn''t.
> 
> This only stresses the problem with L2ARC''s shortcomings for
> metadata, the way I see them (if they do indeed exist), and
> in particular chews your RAM a lot more than it could or
> should, being a mechanism to increase caching efficiency.
And as I demonstrated above, the savings would be negligible.
> If their clientele is indeed similar, and if metaxels would
> be more efficient for metadata storage, then you might not
> need L2ARC with its overheads, or not as much of it, and
> get a clear win in system resource consumption ;)
Would it be a win? Probably. But the cost-benefit analysis suggests to
me that it would probably simply not be worth the added hassle.
>> How often do you expect cache devices to fail?
> 
> From what I hear, life expectancy for today''s consumer-scale
> devices is small (1-3 years) for heavy writes - at which the
> L2ARC would likely exceed METAXEL''s write rates, due to the
> need to write the same metadata into L2ARC time and again,
> if it were not for the special throttling to limit L2ARC
> write bandwidth.
Depending on your workload, l2arc write throughput tends to get pretty
low once you''ve cached in your working dataset. Remember, l2arc only
caches random reads, so think stuff like databases, not linear copy
operations. Once it''s warmed up, it''s pretty much read-only
(assuming
most of your working dataset fits in there).
>> So to sum up, you''re applying raid to something that
doesn''t
>> need it.
> 
> Well, metadata is kinda important - though here we do add
> a third copy where we previously sufficed to have two. And
> you''re not "required" to mirror it. Also, on the other
hand,
> if a METAXEL is a top-level vdev without special resilience
> to its failure/absence as described in my first post, then
> its failure would formally be considered a fatal situation
> and bring down the whole pool - unlike problems with L2ARC
> or ZIL devices, which can be ignored at admin''s discretion.
Is doubly-redundant metadata not enough? Remember, if you''ve lost even
a
single vdev, your data is essentially toast, and doubly-redundant
metadata is there essentially to try and save your ass by letting you
copy off what remains readable (by making sure you have another metadata
copy available somewhere else). If you''ve got double-vdev failure, then
that''s essentially considered a catastrophic pool failure.
>> And how is that different to having a cache-sizing policy
>> which selects how much each data type get allocated from
>> a single common cache?
> ...
>> All of this can be solved by cache sizing policies and
>> l2arc persistency.
> 
> Ultimately, I don''t disagree with this point :)
> But I do think that this might not be the optimal solution
> in terms of RAM requirements and coding complexity, etc.
> If you want to store some data long-term, such as is my
> desire to store the metadata - ZFS has mechanisms for that
> in ways of normal VDEVs (or subclassing that into metaxels) ;)
How about we instead implemented l2arc persistency? That''s a lot easier
to do and it would allow us to make all/most caches persistent, not just
the metadata cache.
>>  *) implement a new vdev type (mirrored or straight metaxel)
>>  *) integrate all format changes to labels to describe these
> 
> As one idea in the proposal - though I don''t require sticking
> to it - is that the metaxel''s job is described in the pool
> metadata (i.e. a readonly attribute which can be set during
> tlvdev device creation/addition - metaxels:list-of-guids).
> Until the pool is imported, a metaxel seems like a normal
> singledisk/mirrored tlvdev in a normal pool.
Yeah, that would be workable, but the trouble is that when somebody
mounts the pool on an older version, they might allocate non-metadata
blocks there, resulting an inconsistent metaxel state. That would make
implementing metaxel-failure resilience a lot harder. Plus, you''ll need
to propagate information on data type (metadata/normal data) to the spa
layer - might not be that hard, I haven''t looked in that code yet.
> This approach can limit importability of a pool with failed
> metaxels, unless we expect that and try to make sense of
> other pool devices - essentially until we can decipher the
> nvlist and see that the absent device is a metaxel, so the
> error is deemed not fatal. However, this also requires no
> label changes or other incompatible on-disk format changes,
> the way I see it. As long as the metaxel is not faulted,
> any other ZFS implementation (like grub or an older livecd)
> can import this pool and read 1/3 of metadata faster, on
> average ;)
Which is why I would propose to use cache sizing policies and possibly
persistent l2arc contents. A persistency-unaware host would simply use
the l2arc device as normal (so backwards-compatibility wouldn''t be an
issue), while newer hosts could happily coexist there.
>> As noted before, you''ll have to go through the code to look
for paths
>> which fetch metadata (mostly the object layer) and replace those with
>> metaxel-aware calls. That''s a lot of work for a POC.
> 
> Alas, for some years now I''m a lot less of a programmer and
> a lot more of a brainstormer ;) Still, judging from whatever
> experience I have, a working POC with some corners cut might
> be a matter of a week or two of coding... Just to see if the
> expected benefits in comparison to L2ARC do exist.
> The full-scale thing, yes, might take months or years from
> even a team of programmers ;)
Code talks ;-)

Cheers,
--
Saso

Jim Klimov

2012-Aug-25 09:53 UTC

head link

[zfs-discuss] Dedicated metadata devices

> No they''re not, here''s l2arc_buf_hdr_t a per-buffer
structure
> held for
> buffers which were moved to l2arc:
> 
> typedef struct l2arc_buf_hdr {
> l2arc_dev_t *b_dev;
> uint64_t b_daddr;
> } l2arc_buf_hdr_t;
> 
> That''s about 16-bytes overhead per block, or 3.125% if the 
> block''s data is 512 bytes long.
> 
> The main overhead comes from an arc_buf_hdr_t, which is pretty fat,
> around 180 bytes by a first degree approximation, so in all 
> around 200
> bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully
> inefficient (around 39% overhead), however, at 4k average block size,
> this drops to ~5% and at 64k average block size (which is entirely
> possible on average untuned storage pools) this drops down to ~0.3%
> overhead.
So... unless I miscalculated before drinking a morning coffee, for a 512b block
quickly fetchable from SSD in both L2ARC and METAXEL cases, we have
roughly these numbers?:
1) When it is in RAM, we consume 512+180 bytes (though some ZFS
slides said that for 1 byte stored we spend 1 byte - i thought this meant zero
overhead, though I couldn''t imagine how... or 100% overhead, also quite
unimaginable =) )
 
2L) When the block is on L2ARC SSD, we spend 180+16 bytes (though
discussions about DDT on L2ARC at least, settled on 176 bytes of cache
metainformation per entry moved off to L2ARC, with the DDT entry''s size
being around 350 bytes, IIRC).
 
2M) When the block is expired from ARC and is only stored on the pool,
including the SSD-based copy on a METAXEL, we spend zero RAM to
reference this block from ARC - because we don''t remember it anymore.
And when needed, we can access it just as fast (right?) as from L2ARC
on the same media type.
 
Where am I wrong, because we seem to dispute over THIS point over 
several emails, and I''m ready to accept that you''ve seen the
code and
I''m the clueless one. So I want to learn, then ;)
 
//Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120825/57abd3ea/attachment.html>

Sašo Kiselkov

2012-Aug-25 11:46 UTC

head link

[zfs-discuss] Dedicated metadata devices

On 08/25/2012 11:53 AM, Jim Klimov wrote:>> No they''re not, here''s l2arc_buf_hdr_t a per-buffer
structure
>> held for
>> buffers which were moved to l2arc:
>>
>> typedef struct l2arc_buf_hdr {
>> l2arc_dev_t *b_dev;
>> uint64_t b_daddr;
>> } l2arc_buf_hdr_t;
>>
>> That''s about 16-bytes overhead per block, or 3.125% if the 
>> block''s data is 512 bytes long.
>>
>> The main overhead comes from an arc_buf_hdr_t, which is pretty fat,
>> around 180 bytes by a first degree approximation, so in all 
>> around 200
>> bytes per ARC + L2ARC entry. At 512 bytes per block, this is painfully
>> inefficient (around 39% overhead), however, at 4k average block size,
>> this drops to ~5% and at 64k average block size (which is entirely
>> possible on average untuned storage pools) this drops down to ~0.3%
>> overhead.
> 
> So... unless I miscalculated before drinking a morning coffee, for a 512b
block
> quickly fetchable from SSD in both L2ARC and METAXEL cases, we have
> roughly these numbers?:
> 1) When it is in RAM, we consume 512+180 bytes (though some ZFS
> slides said that for 1 byte stored we spend 1 byte - i thought this meant
zero
> overhead, though I couldn''t imagine how... or 100% overhead, also
quite
> unimaginable =) )
>  
> 2L) When the block is on L2ARC SSD, we spend 180+16 bytes (though
> discussions about DDT on L2ARC at least, settled on 176 bytes of cache
> metainformation per entry moved off to L2ARC, with the DDT entry''s
size
> being around 350 bytes, IIRC).
>  
> 2M) When the block is expired from ARC and is only stored on the pool,
> including the SSD-based copy on a METAXEL, we spend zero RAM to
> reference this block from ARC - because we don''t remember it
anymore.
> And when needed, we can access it just as fast (right?) as from L2ARC
> on the same media type.
>  
> Where am I wrong, because we seem to dispute over THIS point over 
> several emails, and I''m ready to accept that you''ve seen
the code and
> I''m the clueless one. So I want to learn, then ;)
The difference is that when you want to go fetch a block from a metaxel,
you still need some way to reference it. Either you use direct
references (i.e. ARC entries as above), or you use an indirect
mechanism, which means that for each read you will need to walk the
metaxel device, which is slow.

Cheers,
--
Saso

Jim Klimov

2012-Aug-25 22:59 UTC

head link

[zfs-discuss] Dedicated metadata devices

2012-08-25 15:46, Sa?o Kiselkov wrote:> The difference is that when you want to go fetch a block from a metaxel,
> you still need some way to reference it. Either you use direct
> references (i.e. ARC entries as above), or you use an indirect
> mechanism, which means that for each read you will need to walk the
> metaxel device, which is slow.
Um... how does your application (or the filesystem during its
metadata traversal) know that it wants to read a certain block?
IMHO, it has the block''s address for that (likely known from
a higher-level "parent" block of metadata), and it requests -
"give me L bytes from offset O on tlvdev T", which is the layman
interpretation of DVA.

 From what I understand, with ARC-cached blocks, we traverse the
RAM-based cache and find one with the requested DVA; then we
have its data already in RAM and return it to the caller.
If the block is not in ARC (and there''s no L2ARC), we can fetch
it from the media using the DVA address(es?) we already know
from the request.
In case of L2ARC there is probably a non-null pointer to the
l2arc_buf_hdr_t, so we can request the block from the L2ARC.

If true, this is not faster than fetching the block from the
same SSD used as a metadata accelerator instead of being an
L2ARC device with a policy (or even without one, as today),
and in comparison only wasted RAM for the ARC entries.

BTW, as I see in "struct arc_buf_hdr", it only stores one DVA -
so I guess for blocks with multiple on-disk copies it is possible
to have them cached twice, or does ZFS always enforce storing and
seeking the ARC by a particular DVA of the block (likely DVA[0])?

http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/arc.c#433

If we do go with METAXELs as I described/proposed, and prefer
fetching metadata from SSD unless there are errors, then some
care should be taken to use this instance of DVA to reference
cached metadata blocks in the ARC.

//Jim Klimov

Karl Wagner

2012-Aug-28 09:04 UTC

head link

[zfs-discuss] Dedicated metadata devices

On 2012-08-24 14:39, Jim Klimov wrote:> Hello all,
>
>   The idea of dedicated metadata devices (likely SSDs) for ZFS
> has been generically discussed a number of times on this list,
> but I don''t think I''ve seen a final proposal that someone
would
> take up for implementation (as a public source code, at least).
>
Hi

OK, I am not a ZFS dev and have barely even looked at the code, but it 
seems to me that this could be dealt with in an easier and more 
efficient manner by modifying current L2ARC code to make a persistent 
cache device, and adding the preference mechanism somebody has already 
suggested (e.g. prefer metadata, or prefer specific typed of metadata)

My reasoning is as follows:
1) As metadata is already available on the main pool devices, there is 
no need to make this data redundant. It is there for acceleration. In 
the event of a failure, it can just be read directly from the pool, and 
there is no need to write the data twice (as would be in a mirrored 
''metaxel'') or waste the space. This is only my oppinion, but
it makes
sense to me. The other option, for me, would be to make it the main 
storage area for metadata, with no requirement to store it on the main 
pool devices beyond needing enough copies. i.e. if you need 2 metadata 
copies but have only one metaxel, store on on there and one in the pool. 
If you need 2 copies and there are 2 metaxels, store them on the 
metaxels, no pool storage needed.
2) Persistent cache devices and cache policies would bring more 
benefits to the system overall than adding this metaxel: No warming of 
the cache (besides reading in what is stored there on import/boot, so 
lets say accelerated warming) & finer control over what to store in the 
cache. The cache devices could then be tuned on a per dataset (and 
possibly per cache dev, so certain data types prefer the cache dev with 
the best performance profile for it) basis to provide the best for your 
own unique situation. Possibly even a "keep this dataset in cache at all 
times" would be usefull for less frequently accessed but time-critical 
data (so no more loops cat''ing to /dev/null to keep data in cache).
3) This would provide, IMHO, the building blocks for a combined 
cache/log device. This would basically go as follows: You set up, say, a 
pair of persistent cache devices. You then tell ZFS that these can be 
used for ZIL blocks, with something like the copies attribute to tell it 
to ensure redundancy. So it basically builds a ZIL device from blocks 
within the cache as it needs it. It would not be as fast as a dedicated 
log device, but would allow greater efficiency.

Point 3 would be for future development, but I believe the benefits of 
cache persistence and policies are enough to make them a priority. I 
believe it would cover what the metaxel is trying to do and more.

The other, simpler, option I could see is a flag which tells ZFS "Keep 
metadata in the cache", which ensures all metadata (where possible) is 
stored in ARC/L2ARC at all times, and possibly forces it to be read in 
on import/boot.

zfs discuss - Aug 2012 - Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices

[zfs-discuss] Dedicated metadata devices