First, a disclaimer: I do not know how the zfs dataset destruction
is implemented in reality,but I can guess at least a couple of
legal variants for a slow destruction.
2012-06-25 21:55, Philip Brown wrote:> I ran into something odd today:
>
> zfs destroy -r random/filesystem
>
> is mindbogglingly slow. But seems to me, it shouldnt be.
> It''s slow, because the filesystem has two snapshots on it.
Presumably,
> it''s busy "rolling back" the snapshots.
> but I''ve already declared by my command line, that I DONT CARE
about the
> contents of the filesystem!
> Why doesnt zfs simply do:
>
> 1. unmount filesystem, if possible (it was possible)
> (1.5 possibly note "intent to delete" somewhere in the pool
records)
> 2. zero out/free the in-kernel-memory in one go
> 3. update the pool, "hey I deleted the filesystem, all these blocks
are
> now clear"
Basically, your ideal fast destruction would be the pruning of
the dataset tree (the node under which the snapshots'' and the
live dataset''s blocks are rooted and accounted for). In this
case "everything not allocated is free", or at least it might
be made this way. The slow part is, likely, a walk of the block
pointer tree (through all the random on-disk locations) and
some sort of revision in order to release the blocks. So, what
can be done at this step (speculation follows)?
* Blocks might have been written as deduped; in this case we
have to decrease the reference counters in DDT - but first
we have to walk the dataset''s branch of the block-pointer
tree and see if any have the "dedup" bit-flag set.
* A simpler case is the presence of cloned datasets based on
snapshots of this dataset. Unless you''re destroying the whole
family of sibling datasets, the clones have to be promoted
and referenced blocks are to be reassigned to these datasets
(including reassignment of the snapshot "ownership").
* Even for the "trivial" step (2) of yours, the freeing of
memory, we need to know which ARC-cached blocks to free.
How can we know that without walking the BP tree first?
I listed just a few reasons off the top of my head why a
walk of the whole BP-tree branch is required to free the
blocks referenced by this tree. If any further operations
are needed, such as modifications to DDT, they may delay
the result.
In particular, this may be why recent versions of zfs/zpool
worked towards asynchronous destructions and "deferred free"
capability. The destroyed branch can be quickly marked as
deleted, then the kernel works in the background to do its
processing. In my (and not only mine) problematic cases
it could have required prodigous amounts of RAM, especially
with dedup procesing in play, and cause computer freezes.
However, sometime after ZFSv22, the deferred freeing in
such cases just takes several hard-resets to complete,
instead of taking truly forever with no progress ;)
Basically, the steps you outlined should be there already,
in some manner, at least for ZFSv28.
So, the practical questions are:
* your version of zpool/zfs; OS version?
* presence of deduplication on this dataset (and dedup
support in the OS version - lack of it may have less
code paths to follow and check, and be faster just
due to that; i.e. Solaris 10 nominally has ZFSv29(?),
but not all features are implemented as in Solaris 11
or OpenSolaris of similar ZFS versions);
* did you use clones?
* fragmentation (or how busy is the pool while processing
the deletion, in terms of iops)?
HTH,
//Jim