thr3ads.net - zfs discuss - [zfs-discuss] questions about the DDT and other things [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Ragnar Sundblad

2011-Dec-02 00:59 UTC

[zfs-discuss] questions about the DDT and other things

I am sorry if these are dumb questions. If there are explanations
available somewhere for those questions that I just haven''t found,
please
let me know! :-)

1. It has been said that when the DDT entries, some 376 bytes or so, are
rolled out on L2ARC, there still is some 170 bytes in the ARC to reference
them (or rather the ZAP objects I believe). In some places it sounds like
 those 170 bytes refers to ZAP objects that contain several DDT entries.
In other cases it sounds like for each DDT entry in the L2ARC there must
be one 170 byte reference in the ARC. What is the story here really?

2. Deletion with dedup enabled is a lot heavier for some reason that I
don''t
understand. It is said that the DDT entries have to be updated for each
deleted reference to that block. Since zfs already have a mechanism for sharing
blocks (for example with snapshots), I don''t understand why the DDT has
to
contain any more block references at all, or why deletion should be much harder
just because there are checksums (DDT entries) tied to those blocks, and even
if they have to, why it would be much harder than the other block reference
mechanism. If anyone could explain this (or give me a pointer to an
explanation), I''d be very happy!

3. I, as many others, would of course like to be able to have very large
datasets deduped without having to have enormous amounts of RAM.
Since the DDT is a AVL tree, couldn''t just that entire tree be cached
on
for example a SSD and be searched there without necessarily having to store
anything of it in RAM? That would probably require some changes to the DDT
lookup code, and some mechanism to gather the tree to be able to lift it
over to the SSD cache, and some other stuff, but still that sounds - with
my very basic (non-)understanding of zfs - like a not to overwhelming change.

4. Now and then people mention that the problem with bp_rewrite has been
explained, on this very mailing list I believe, but I haven''t found
that
explanation. Could someone please give me a pointer to that description
(or perhaps explain it again :-) )?

Thanks for any enlightenment!

/ragge

Erik Trimble

2011-Dec-02 01:54 UTC

head link

[zfs-discuss] questions about the DDT and other things

On 12/1/2011 4:59 PM, Ragnar Sundblad wrote:> I am sorry if these are dumb questions. If there are explanations
> available somewhere for those questions that I just haven''t found,
please
> let me know! :-)
>
> 1. It has been said that when the DDT entries, some 376 bytes or so, are
> rolled out on L2ARC, there still is some 170 bytes in the ARC to reference
> them (or rather the ZAP objects I believe). In some places it sounds like
>   those 170 bytes refers to ZAP objects that contain several DDT entries.
> In other cases it sounds like for each DDT entry in the L2ARC there must
> be one 170 byte reference in the ARC. What is the story here really?Yup. Each entry (not just a DDT entry, but any cached reference) in the 
L2ARC requires a pointer record in the ARC, so the DDT entries held in 
L2ARC also consume ARC space.  It''s a bad situation.
> 2. Deletion with dedup enabled is a lot heavier for some reason that I
don''t
> understand. It is said that the DDT entries have to be updated for each
> deleted reference to that block. Since zfs already have a mechanism for
sharing
> blocks (for example with snapshots), I don''t understand why the
DDT has to
> contain any more block references at all, or why deletion should be much
harder
> just because there are checksums (DDT entries) tied to those blocks, and
even
> if they have to, why it would be much harder than the other block reference
> mechanism. If anyone could explain this (or give me a pointer to an
> explanation), I''d be very happy!Remember that, when using Dedup, each block can potentially be part of a 
very large number of files. So, when you delete a file, you have to go 
look at the DDT entry FOR EACH BLOCK IN THAT FILE, and make the 
appropriate DDT updates.  It''s essentially the same problem that
erasing
snapshots has - for each block you delete, you have to find and update 
the metadata for all the other files that share that block usage.  Dedup 
and snapshot deletion share the same problem, it''s just usually worse 
for dedup, since there''s a much larger number of blocks that have to be
updated.

The problem is that you really need to have the entire DDT in some form 
of high-speed random-access memory in order for things to be efficient. 
If you have to search the entire hard drive to get the proper DDT entry 
every time you delete a block, then your IOPs limits are going to get 
hammered hard.
> 3. I, as many others, would of course like to be able to have very large
> datasets deduped without having to have enormous amounts of RAM.
> Since the DDT is a AVL tree, couldn''t just that entire tree be
cached on
> for example a SSD and be searched there without necessarily having to store
> anything of it in RAM? That would probably require some changes to the DDT
> lookup code, and some mechanism to gather the tree to be able to lift it
> over to the SSD cache, and some other stuff, but still that sounds - with
> my very basic (non-)understanding of zfs - like a not to overwhelming
change.L2ARC typically sits on an SSD, and the DDT is usually held there, if 
the L2ARC device exists.  There does need to be serious work on changing 
how the DDT in the L2ARC is referenced, however; the ARC memory 
requirements for DDT-in-L2ARC definitely need to be removed (which 
requires a non-trivial rearchitecting of dedup).  There are some other 
changes that have to happen for Dedup to be really usable. 
Unfortunately, I can''t see anyone around willing to do those changes, 
and my understanding of the code says that it is much more likely that 
we will simply remove and replace the entire dedup feature rather than 
trying to fix the existing design.
> 4. Now and then people mention that the problem with bp_rewrite has been
> explained, on this very mailing list I believe, but I haven''t
found that
> explanation. Could someone please give me a pointer to that description
> (or perhaps explain it again :-) )?
>
> Thanks for any enlightenment!
>
> /ragge
bp_rewrite is a feature which stands for the (as yet unimplemented) 
system call of the same name, which does Block Pointer re-writing. That 
is, it would allow ZFS to change the physical location on media of an 
existing ZFS data slab. That is, bp_rewrite is necessary to allow ZFS to 
change the Physical layout of data on media, without changing the 
Conceptual arrangement of such data.

It''s been the #1 most-wanted feature of ZFS since I can remember, 
probably for 10 years now.

-Erik

Ragnar Sundblad

2011-Dec-02 02:44 UTC

head link

[zfs-discuss] questions about the DDT and other things

Thanks for your answers!

On 2 dec 2011, at 02:54, Erik Trimble wrote:
> On 12/1/2011 4:59 PM, Ragnar Sundblad wrote:
>> I am sorry if these are dumb questions. If there are explanations
>> available somewhere for those questions that I just haven''t
found, please
>> let me know! :-)
>> 
>> 1. It has been said that when the DDT entries, some 376 bytes or so,
are
>> rolled out on L2ARC, there still is some 170 bytes in the ARC to
reference
>> them (or rather the ZAP objects I believe). In some places it sounds
like
>>  those 170 bytes refers to ZAP objects that contain several DDT
entries.
>> In other cases it sounds like for each DDT entry in the L2ARC there
must
>> be one 170 byte reference in the ARC. What is the story here really?
> Yup. Each entry (not just a DDT entry, but any cached reference) in the
L2ARC requires a pointer record in the ARC, so the DDT entries held in L2ARC
also consume ARC space.  It''s a bad situation.
Yes, it is a bad situation. But how many DDT entries can there be in each ZAP
object? Some have suggested an 1:1 relationship, others have suggested that it
isn''t.
>> 2. Deletion with dedup enabled is a lot heavier for some reason that I
don''t
>> understand. It is said that the DDT entries have to be updated for each
>> deleted reference to that block. Since zfs already have a mechanism for
sharing
>> blocks (for example with snapshots), I don''t understand why
the DDT has to
>> contain any more block references at all, or why deletion should be
much harder
>> just because there are checksums (DDT entries) tied to those blocks,
and even
>> if they have to, why it would be much harder than the other block
reference
>> mechanism. If anyone could explain this (or give me a pointer to an
>> explanation), I''d be very happy!
> Remember that, when using Dedup, each block can potentially be part of a
very large number of files. So, when you delete a file, you have to go look at
the DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates.
It''s essentially the same problem that erasing snapshots has - for each
block you delete, you have to find and update the metadata for all the other
files that share that block usage.  Dedup and snapshot deletion share the same
problem, it''s just usually worse for dedup, since there''s a
much larger number of blocks that have to be updated.
What is it that must be updated in the DDT entries - a ref count?
And how does that differ from the snapshot case, which seems like
a very similar mechanism?
> The problem is that you really need to have the entire DDT in some form of
high-speed random-access memory in order for things to be efficient. If you have
to search the entire hard drive to get the proper DDT entry every time you
delete a block, then your IOPs limits are going to get hammered hard.
Indeed!
>> 3. I, as many others, would of course like to be able to have very
large
>> datasets deduped without having to have enormous amounts of RAM.
>> Since the DDT is a AVL tree, couldn''t just that entire tree be
cached on
>> for example a SSD and be searched there without necessarily having to
store
>> anything of it in RAM? That would probably require some changes to the
DDT
>> lookup code, and some mechanism to gather the tree to be able to lift
it
>> over to the SSD cache, and some other stuff, but still that sounds -
with
>> my very basic (non-)understanding of zfs - like a not to overwhelming
change.
> L2ARC typically sits on an SSD, and the DDT is usually held there, if the
L2ARC device exists.
Well, it rather seems to be ZAP objects, referenced from the ARC, which
happens to contain DDT entries, that is in the L2ARC.

I mean that you could just move the entire AVL tree onto the SSD, completely
outside of zfs if you will, and have it being searched there, not dependent
of what is in RAM at all.
Every DDT lookup would take up to [tree depth] number of reads, but that could
be OK if you have a SSD which is fast on reading (which many are).
>  There does need to be serious work on changing how the DDT in the L2ARC is
referenced, however; the ARC memory requirements for DDT-in-L2ARC definitely
need to be removed (which requires a non-trivial rearchitecting of dedup). 
There are some other changes that have to happen for Dedup to be really usable.
Unfortunately, I can''t see anyone around willing to do those changes,
and my understanding of the code says that it is much more likely that we will
simply remove and replace the entire dedup feature rather than trying to fix the
existing design.
Yes, replacing it is certainly one possibility.
Is there any work going on for a replacement mechanism?
>> 4. Now and then people mention that the problem with bp_rewrite has
been
>> explained, on this very mailing list I believe, but I haven''t
found that
>> explanation. Could someone please give me a pointer to that description
>> (or perhaps explain it again :-) )?
>> 
>> Thanks for any enlightenment!
>> 
>> /ragge
> 
> bp_rewrite is a feature which stands for the (as yet unimplemented) system
call of the same name, which does Block Pointer re-writing. That is, it would
allow ZFS to change the physical location on media of an existing ZFS data slab.
That is, bp_rewrite is necessary to allow ZFS to change the Physical layout of
data on media, without changing the Conceptual arrangement of such data.
> 
> It''s been the #1 most-wanted feature of ZFS since I can remember,
probably for 10 years now.
Yes, I got that much. :-)
But what is the problem really?
Being naive/ignorant (and completely ignoring any possible dependencies between
the different layers in the zfs stack), it doesn''t seem that magic or
esoteric
when compared to the rest of the stuff in there.

/ragge

Daniel Carosone

2011-Dec-02 03:44 UTC

head link

[zfs-discuss] questions about the DDT and other things

On Fri, Dec 02, 2011 at 01:59:37AM +0100, Ragnar Sundblad
wrote:> 
> I am sorry if these are dumb questions. If there are explanations
> available somewhere for those questions that I just haven''t found,
please
> let me know! :-)
I''ll give you a brief summary.
> 1. It has been said that when the DDT entries, some 376 bytes or so, are
> rolled out on L2ARC, there still is some 170 bytes in the ARC to reference
> them (or rather the ZAP objects I believe). In some places it sounds like
>  those 170 bytes refers to ZAP objects that contain several DDT entries.
> In other cases it sounds like for each DDT entry in the L2ARC there must
> be one 170 byte reference in the ARC. What is the story here really?
Currently, every object (not just DDT entries) stored in L2ARC is
tracked in memory. This metadata identifies the object and where on
L2ARC it is stored. The L2ARC on-disk doesn''t contain metadata and is
not self-describing. This is one reason why the L2ARC starts out
empty/cold after every reboot, and why the usable size of L2ARC is
limited by memory.

DDT entries in core are used directly.  If the relevant DDT node is
not in core, it must be fetched from the pool, which may in turn be
assisted by an L2ARC.  It''s my understanding that, yes, several DDT
entries are stored in each on-disk "block", though I''m not
certain of
the number.  The on-disk size of the DDT entry is different, too.
> 2. Deletion with dedup enabled is a lot heavier for some reason that I
don''t
> understand. It is said that the DDT entries have to be updated for each
> deleted reference to that block. Since zfs already have a mechanism for
sharing
> blocks (for example with snapshots), I don''t understand why the
DDT has to
> contain any more block references at all, or why deletion should be much
harder
> just because there are checksums (DDT entries) tied to those blocks, and
even
> if they have to, why it would be much harder than the other block reference
> mechanism. If anyone could explain this (or give me a pointer to an
> explanation), I''d be very happy!
DDT entries are reference-counted.  Unlike other things that look like
multiple references, these are truly block-level independent.

Everything else is either tree-structured or highly aggregated (metaslab
free-space tracking).

Snapshots, for example, are references to a certain internal node (the
root of a filesystem tree at a certain txg), and that counts as a
reference to the entire subtree underneath.  Note that any changes to
this subtree later (via writes into the live filesystem) diverge
completely via CoW; an update produces a new CoW block tree all the way
back to the root, above the snapshot node. 

When a snapshot is created, it starts out owning (almost) nothing. As
data is overwritten, the ownership of the data that might otherwise be
freed is transferred to the snapshot.

When the oldest snapshot is freed, any data blocks it owns can be
freed. When an intermediate snapshot is freed, data blocks it owns are
either transferred to the previous older snapshot because they were
shared with it (txg < snapshot''s) or they''re unique to this
snapshot
and can be freed.

Either way, these decisions are tree based and can potentially free
large swathes of space with a single decision, whereas the DDT needs
refcount updates individually for each block (in random order, as per
below).

(This is not the same as the ZPL directory tree used for naming,
however, don''t get those confused, it''s flatter than that).
> 3. I, as many others, would of course like to be able to have very large
> datasets deduped without having to have enormous amounts of RAM.
> Since the DDT is a AVL tree, couldn''t just that entire tree be
cached on
> for example a SSD and be searched there without necessarily having to store
> anything of it in RAM? That would probably require some changes to the DDT
> lookup code, and some mechanism to gather the tree to be able to lift it
> over to the SSD cache, and some other stuff, but still that sounds - with
> my very basic (non-)understanding of zfs - like a not to overwhelming
change.
Think of this the other way round. One could do this, and could
require a dedicated device (SSD) in order to use dedup at all.  Now,
every DDT lookup requires IO to bring the DDT entry into memory.  This
would be slow, so we could add an in-memory cache for the DDT... and
we''re back to square one.

The major issue with the DDT is that, being context-hash indexed, it
is random-access, even for sequential-access data.  There''s no getting
around that, it''s in its job description.
> 4. Now and then people mention that the problem with bp_rewrite has been
> explained, on this very mailing list I believe, but I haven''t
found that
> explanation. Could someone please give me a pointer to that description
> (or perhaps explain it again :-) )?
This relates to the answer for 2; all the pointers in the tree
discussed there are block pointers to device virtual addresses.  If
you''re going to move data, you''re going to change its address,
which
necessitates updating all the trees that reference it with new
hashes. Several things make this tricky:

 - you''re trying to follow references the wrong way, so
there''s a lot
   of tree-searching to be done, even just to find dependencies.
   Resolving those dependencies may be harder still with lots of
   combinatorial complexity and reverse searching for information.
 - you want to retain CoW semantics for safety of update in making the
   changes, yet the rest of the filesystem depends on the semantics of
   these blocks not changing.
 - as a result of the combination of the above, you may wind up with
   races/contention against live filesystem updates, scrubs and other
   errors/recoveries, and the need to add a lot of complex locking or
   other mechanism that''s currently not needed.

It''s not impossible, but you will wind up touching lots of code and
making all the tests much more complex. 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20111202/69e81ced/attachment.bin>

Erik Trimble

2011-Dec-02 04:21 UTC

head link

[zfs-discuss] questions about the DDT and other things

On 12/1/2011 6:44 PM, Ragnar Sundblad wrote:> Thanks for your answers!
>
> On 2 dec 2011, at 02:54, Erik Trimble wrote:
>
>> On 12/1/2011 4:59 PM, Ragnar Sundblad wrote:
>>> I am sorry if these are dumb questions. If there are explanations
>>> available somewhere for those questions that I just
haven''t found, please
>>> let me know! :-)
>>>
>>> 1. It has been said that when the DDT entries, some 376 bytes or
so, are
>>> rolled out on L2ARC, there still is some 170 bytes in the ARC to
reference
>>> them (or rather the ZAP objects I believe). In some places it
sounds like
>>>   those 170 bytes refers to ZAP objects that contain several DDT
entries.
>>> In other cases it sounds like for each DDT entry in the L2ARC there
must
>>> be one 170 byte reference in the ARC. What is the story here
really?
>> Yup. Each entry (not just a DDT entry, but any cached reference) in the
L2ARC requires a pointer record in the ARC, so the DDT entries held in L2ARC
also consume ARC space.  It''s a bad situation.
> Yes, it is a bad situation. But how many DDT entries can there be in each
ZAP
> object? Some have suggested an 1:1 relationship, others have suggested that
it
> isn''t.I''m pretty sure it''s NOT 1:1, but I''d have to go look
at the code. In
any case, it''s not a very big number, so you''re still looking
at the
same O(n) as the number of DDT entries (n).

>>> 2. Deletion with dedup enabled is a lot heavier for some reason
that I don''t
>>> understand. It is said that the DDT entries have to be updated for
each
>>> deleted reference to that block. Since zfs already have a mechanism
for sharing
>>> blocks (for example with snapshots), I don''t understand
why the DDT has to
>>> contain any more block references at all, or why deletion should be
much harder
>>> just because there are checksums (DDT entries) tied to those
blocks, and even
>>> if they have to, why it would be much harder than the other block
reference
>>> mechanism. If anyone could explain this (or give me a pointer to an
>>> explanation), I''d be very happy!
>> Remember that, when using Dedup, each block can potentially be part of
a very large number of files. So, when you delete a file, you have to go look at
the DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT updates.
It''s essentially the same problem that erasing snapshots has - for each
block you delete, you have to find and update the metadata for all the other
files that share that block usage.  Dedup and snapshot deletion share the same
problem, it''s just usually worse for dedup, since there''s a
much larger number of blocks that have to be updated.
> What is it that must be updated in the DDT entries - a ref count?
> And how does that differ from the snapshot case, which seems like
> a very similar mechanism?
It is similar to the snapshot case, in that the block itself has a 
reference count in it''s structure (for use in both dedup and snapshots)
that would get updated upon "delete", but you also have to consider
that
the DDT entry itself, which is a separate structure from the block 
structure, also has to be updated. This is a whole new IOPS to get that 
additional structure. So, more or less, a dedup delete has to do two 
operations for every one that a snapshot delete does.  Plus,
>> The problem is that you really need to have the entire DDT in some form
of high-speed random-access memory in order for things to be efficient. If you
have to search the entire hard drive to get the proper DDT entry every time you
delete a block, then your IOPs limits are going to get hammered hard.
> Indeed!
>
>>> 3. I, as many others, would of course like to be able to have very
large
>>> datasets deduped without having to have enormous amounts of RAM.
>>> Since the DDT is a AVL tree, couldn''t just that entire
tree be cached on
>>> for example a SSD and be searched there without necessarily having
to store
>>> anything of it in RAM? That would probably require some changes to
the DDT
>>> lookup code, and some mechanism to gather the tree to be able to
lift it
>>> over to the SSD cache, and some other stuff, but still that sounds
- with
>>> my very basic (non-)understanding of zfs - like a not to
overwhelming change.
>> L2ARC typically sits on an SSD, and the DDT is usually held there, if
the L2ARC device exists.
> Well, it rather seems to be ZAP objects, referenced from the ARC, which
> happens to contain DDT entries, that is in the L2ARC.
>
> I mean that you could just move the entire AVL tree onto the SSD,
completely
> outside of zfs if you will, and have it being searched there, not dependent
> of what is in RAM at all.
> Every DDT lookup would take up to [tree depth] number of reads, but that
could
> be OK if you have a SSD which is fast on reading (which many are).ZFS currently treats all metadata (of which DDT entries are) and data 
slabs the same when it comes to choosing to migrate them from ARC to 
L2ARC, so the most-frequently-accessed info is in the ARC (regardless of 
what that info is), and everything else sits in the L2ARC.  But, ALL 
entries in the L2ARC require an ARC reference pointer.

Under normal operation, you really should have an L2ARC device capable 
of holding the entire DDT, to get the random IOPS benefit from that. 
However, using the current design, that still consumes a rather large 
amount of ARC space to hold the L2ARC reference pointers. A redesign 
effort should definitely reconsider how this is done - probably the most 
efficient way would be to delete L2ARC ref pointers completely in ARC, 
and just force a search of L2ARC if the data isn''t found in the ARC. 
But, that''s just a guess at a new implementation; I''m sure
there''s
gotchas around that, and, like I said, I suspect that the only way to 
save dedup is to kill dedup (then redo it from scratch).
>>   There does need to be serious work on changing how the DDT in the
L2ARC is referenced, however; the ARC memory requirements for DDT-in-L2ARC
definitely need to be removed (which requires a non-trivial rearchitecting of
dedup).  There are some other changes that have to happen for Dedup to be really
usable. Unfortunately, I can''t see anyone around willing to do those
changes, and my understanding of the code says that it is much more likely that
we will simply remove and replace the entire dedup feature rather than trying to
fix the existing design.
> Yes, replacing it is certainly one possibility.
> Is there any work going on for a replacement mechanism?Not that I know of, and there hasn''t been any talk on any of these
lists
about it.
>>> 4. Now and then people mention that the problem with bp_rewrite has
been
>>> explained, on this very mailing list I believe, but I
haven''t found that
>>> explanation. Could someone please give me a pointer to that
description
>>> (or perhaps explain it again :-) )?
>>>
>>> Thanks for any enlightenment!
>>>
>>> /ragge
>> bp_rewrite is a feature which stands for the (as yet unimplemented)
system call of the same name, which does Block Pointer re-writing. That is, it
would allow ZFS to change the physical location on media of an existing ZFS data
slab. That is, bp_rewrite is necessary to allow ZFS to change the Physical
layout of data on media, without changing the Conceptual arrangement of such
data.
>>
>> It''s been the #1 most-wanted feature of ZFS since I can
remember, probably for 10 years now.
> Yes, I got that much. :-)
> But what is the problem really?
> Being naive/ignorant (and completely ignoring any possible dependencies
between
> the different layers in the zfs stack), it doesn''t seem that magic
or esoteric
> when compared to the rest of the stuff in there.
>
> /ragge
Conceptually, it''s not *that* bad.  From an implementation point of 
view, it''s a major feature add, which touches a big chunk of the code. 
As always, the Devil is in the details.  One area of problem is how to 
guaranty the move has taken place - that is, when I say I''m going to 
move Slab A from disk location X to location Y, how can I atomically 
guaranty this?  While I''m doing other I/O. When there might be a power 
loss (or other pool loss). Plus lots of other non-best-case events 
happening....


The major problem with "active" (vs off-line) deduplication is that no
matter what strategy you use, you MUST keep a *complete* copy of all 
blocks currently in the pool, with their checksums. So, for something 
like ZFS, you need a structure that holds the physical block location, a 
256-bit checksum, and a reference count, at the minimum, for each and 
every block in the entire pool. If you want good performance, this 
lookup table has to be on something that has very good random I/O 
performance.

-Erik

Richard Elling

2011-Dec-04 04:06 UTC

head link

[zfs-discuss] questions about the DDT and other things

more below?

On Dec 1, 2011, at 8:21 PM, Erik Trimble wrote:> On 12/1/2011 6:44 PM, Ragnar Sundblad wrote:
>> Thanks for your answers!
>> 
>> On 2 dec 2011, at 02:54, Erik Trimble wrote:
>> 
>>> On 12/1/2011 4:59 PM, Ragnar Sundblad wrote:
>>>> I am sorry if these are dumb questions. If there are
explanations
>>>> available somewhere for those questions that I just
haven''t found, please
>>>> let me know! :-)
>>>> 
>>>> 1. It has been said that when the DDT entries, some 376 bytes
or so, are
>>>> rolled out on L2ARC, there still is some 170 bytes in the ARC
to reference
>>>> them (or rather the ZAP objects I believe). In some places it
sounds like
>>>>  those 170 bytes refers to ZAP objects that contain several DDT
entries.
>>>> In other cases it sounds like for each DDT entry in the L2ARC
there must
>>>> be one 170 byte reference in the ARC. What is the story here
really?
>>> Yup. Each entry (not just a DDT entry, but any cached reference) in
the L2ARC requires a pointer record in the ARC, so the DDT entries held in L2ARC
also consume ARC space.  It''s a bad situation.
>> Yes, it is a bad situation. But how many DDT entries can there be in
each ZAP
>> object? Some have suggested an 1:1 relationship, others have suggested
that it
>> isn''t.
> I''m pretty sure it''s NOT 1:1, but I''d have to go
look at the code. In any case, it''s not a very big number, so
you''re still looking at the same O(n) as the number of DDT entries (n).
It is not a "bad thing" it is what it is. Almost all non-trivial
caches have a directory (sometimes
called tags in the case of CPU caches). Trivial caches do trivial manipulation
of the address
to find the data in cache, a technique that would not work well for more
sophisticated data
management systems, like databases or file systems. So, to implement the cache,
we need
to put the cache directory somewhere. Again, in the case of CPU caches, the size
of the tags
is not counted as the size of the cache, but can be quite substantially large.

DDT is stored in an AVL tree. It is unlikely that each ZAP object will contain
only one DDT entry.
>>>> 2. Deletion with dedup enabled is a lot heavier for some reason
that I don''t
>>>> understand. It is said that the DDT entries have to be updated
for each
>>>> deleted reference to that block. Since zfs already have a
mechanism for sharing
>>>> blocks (for example with snapshots), I don''t
understand why the DDT has to
>>>> contain any more block references at all, or why deletion
should be much harder
>>>> just because there are checksums (DDT entries) tied to those
blocks, and even
>>>> if they have to, why it would be much harder than the other
block reference
>>>> mechanism. If anyone could explain this (or give me a pointer
to an
>>>> explanation), I''d be very happy!
>>> Remember that, when using Dedup, each block can potentially be part
of a very large number of files. So, when you delete a file, you have to go look
at the DDT entry FOR EACH BLOCK IN THAT FILE, and make the appropriate DDT
updates.  It''s essentially the same problem that erasing snapshots has
- for each block you delete, you have to find and update the metadata for all
the other files that share that block usage.  Dedup and snapshot deletion share
the same problem, it''s just usually worse for dedup, since
there''s a much larger number of blocks that have to be updated.
>> What is it that must be updated in the DDT entries - a ref count?
>> And how does that differ from the snapshot case, which seems like
>> a very similar mechanism?
> 
> It is similar to the snapshot case, in that the block itself has a
reference count in it''s structure (for use in both dedup and snapshots)
that would get updated upon "delete", but you also have to consider
that the DDT entry itself, which is a separate structure from the block
structure, also has to be updated. This is a whole new IOPS to get that
additional structure. So, more or less, a dedup delete has to do two operations
for every one that a snapshot delete does.  Plus,
A snapshot does not modify blocks. Each block pointer has a birth txg entry. The
txg number is
guaranteed to be monotonically incremented, so we can tell the age of a block by
its birth txg.
When you delete a snapshot, the blocks that belong to that snapshot exclusively
are returned
to the free list.
>>> The problem is that you really need to have the entire DDT in some
form of high-speed random-access memory in order for things to be efficient. If
you have to search the entire hard drive to get the proper DDT entry every time
you delete a block, then your IOPs limits are going to get hammered hard.
>> Indeed!
>> 
>>>> 3. I, as many others, would of course like to be able to have
very large
>>>> datasets deduped without having to have enormous amounts of
RAM.
>>>> Since the DDT is a AVL tree, couldn''t just that entire
tree be cached on
>>>> for example a SSD and be searched there without necessarily
having to store
>>>> anything of it in RAM? That would probably require some changes
to the DDT
>>>> lookup code, and some mechanism to gather the tree to be able
to lift it
>>>> over to the SSD cache, and some other stuff, but still that
sounds - with
>>>> my very basic (non-)understanding of zfs - like a not to
overwhelming change.
>>> L2ARC typically sits on an SSD, and the DDT is usually held there,
if the L2ARC device exists.
>> Well, it rather seems to be ZAP objects, referenced from the ARC, which
>> happens to contain DDT entries, that is in the L2ARC.
>> 
>> I mean that you could just move the entire AVL tree onto the SSD,
completely
>> outside of zfs if you will, and have it being searched there, not
dependent
>> of what is in RAM at all.
>> Every DDT lookup would take up to [tree depth] number of reads, but
that could
>> be OK if you have a SSD which is fast on reading (which many are).
> ZFS currently treats all metadata (of which DDT entries are) and data slabs
the same when it comes to choosing to migrate them from ARC to L2ARC, so the
most-frequently-accessed info is in the ARC (regardless of what that info is),
and everything else sits in the L2ARC.
The ARC has both most-frequently used and most-recently used data (hence the
name
Adaptive Replacement Cache)  Therefore L2ARC contains data that is soon to be
evicted
from with the most-recent or most-frequent list.
>  But, ALL entries in the L2ARC require an ARC reference pointer.
Yes
> 
> Under normal operation, you really should have an L2ARC device capable of
holding the entire DDT, to get the random IOPS benefit from that. However, using
the current design, that still consumes a rather large amount of ARC space to
hold the L2ARC reference pointers. A redesign effort should definitely
reconsider how this is done - probably the most efficient way would be to delete
L2ARC ref pointers completely in ARC, and just force a search of L2ARC if the
data isn''t found in the ARC. But, that''s just a guess at a new
implementation; I''m sure there''s gotchas around that, and,
like I said, I suspect that the only way to save dedup is to kill dedup (then
redo it from scratch).
All deduplication implementations have a DDT. In some cases, they use a
dedicated device.
The problem of a fixed, dedicated device is that when space runs out, they stop
deduping.

It should be noted that the DDT necessarily contains critical data. On-disk
there is at least 2
copies of the DDT (and other metadata) that are spread around the pool for
diversity. This
implies that the worst case is a pool constructed of a single, large, slow HDD.
>>>  There does need to be serious work on changing how the DDT in the
L2ARC is referenced, however; the ARC memory requirements for DDT-in-L2ARC
definitely need to be removed (which requires a non-trivial rearchitecting of
dedup).  There are some other changes that have to happen for Dedup to be really
usable. Unfortunately, I can''t see anyone around willing to do those
changes, and my understanding of the code says that it is much more likely that
we will simply remove and replace the entire dedup feature rather than trying to
fix the existing design.
>> Yes, replacing it is certainly one possibility.
>> Is there any work going on for a replacement mechanism?
> Not that I know of, and there hasn''t been any talk on any of these
lists about it.
Greenbytes has an interesting implementation. Very different than stock ZFS.

Due to the critical nature of the DDT, it needs to be protected. For those who
are too
cheap to buy one, fast L2ARC device, buying 2 fast devices to be used only for
DDT
is a tough sell.
>>>> 4. Now and then people mention that the problem with bp_rewrite
has been
>>>> explained, on this very mailing list I believe, but I
haven''t found that
>>>> explanation. Could someone please give me a pointer to that
description
>>>> (or perhaps explain it again :-) )?
>>>> 
>>>> Thanks for any enlightenment!
>>>> 
>>>> /ragge
>>> bp_rewrite is a feature which stands for the (as yet unimplemented)
system call of the same name, which does Block Pointer re-writing. That is, it
would allow ZFS to change the physical location on media of an existing ZFS data
slab. That is, bp_rewrite is necessary to allow ZFS to change the Physical
layout of data on media, without changing the Conceptual arrangement of such
data.
>>> 
>>> It''s been the #1 most-wanted feature of ZFS since I can
remember, probably for 10 years now.
>> Yes, I got that much. :-)
>> But what is the problem really?
>> Being naive/ignorant (and completely ignoring any possible dependencies
between
>> the different layers in the zfs stack), it doesn''t seem that
magic or esoteric
>> when compared to the rest of the stuff in there.
>> 
>> /ragge
> 
> Conceptually, it''s not *that* bad.  From an implementation point
of view, it''s a major feature add, which touches a big chunk of the
code. As always, the Devil is in the details.  One area of problem is how to
guaranty the move has taken place - that is, when I say I''m going to
move Slab A from disk location X to location Y, how can I atomically guaranty
this?  While I''m doing other I/O. When there might be a power loss (or
other pool loss). Plus lots of other non-best-case events happening....
> 
> 
> The major problem with "active" (vs off-line) deduplication is
that no matter what strategy you use, you MUST keep a *complete* copy of all
blocks currently in the pool, with their checksums. So, for something like ZFS,
you need a structure that holds the physical block location, a 256-bit checksum,
and a reference count, at the minimum, for each and every block in the entire
pool. If you want good performance, this lookup table has to be on something
that has very good random I/O performance.
As you relocate the blocks, you also have to COW the metadata for all historical
metadata.
IMHO, this workload can be a far worse workload than DDT lookups or reference
count updates.
For those who are too cheap to purchase fast disks, life will be unpleasant? it
is likely to be more
efficient to just build a new pool and migrate the data.
 -- richard

-- 

ZFS and performance consulting
http://www.RichardElling.com
LISA ''11, Boston, MA, December 4-9

zfs discuss - Dec 2011 - questions about the DDT and other things

[zfs-discuss] questions about the DDT and other things

[zfs-discuss] questions about the DDT and other things

[zfs-discuss] questions about the DDT and other things

[zfs-discuss] questions about the DDT and other things

[zfs-discuss] questions about the DDT and other things

[zfs-discuss] questions about the DDT and other things