thr3ads.net - zfs discuss - [zfs-discuss] How does resilver/scrub work? [May 2012]

If this information is useful, please help other people find it:
Share via:

Jim Klimov

2012-May-17 23:05 UTC

[zfs-discuss] How does resilver/scrub work?

Hello all,

   While waiting for that resilver to complete last week,
I caught myself wondering how the resilvers (are supposed
to) work in ZFS?

   Based on what I see in practice and read in this list
and some blogs, I''ve built a picture and would be grateful
if some experts actually familiar with code and architecture
would say how far off I guessed from the truth ;)

   Ultimately I wonder if there are possible optimizations
to make the scrub process more resembling a sequential
drive-cloning (bandwidth/throughput-bound), than an
IOPS-bound random seek thrashing for hours that we
often see now, at least on (over?)saturated pools.
This may possibly improve zfs send speeds as well.

   First of all, I state (and ask to confirm): I think
resilvers are a subset of scrubs, in that:
1) resilvers are limited to a particular top-level VDEV
(and its number is a component of each block''s DVA address)
and
2) when scrub finds a block mismatching its known checksum,
scrub reallocates the whole block anew using the recovered
known-valid data - in essence it is a newly written block
with a new path in BP tree and so on; a resilver expects
to have a disk full of known-missing pieces of blocks,
and reconstructed pieces are written on the resilvering
disk "in-place" at an address dictated by the known DVA -
this allows to not rewrite the other disks and BP tree
as COW would otherwise require.

   Other than these points, resilvers and scrubs should
work the same, perhaps with nuances like separate tunables
for throttling and such - but generic algorithms should
be nearly identical.

Q1: Is this assessment true?

   So I''ll call them both a "scrub" below - it''s
shorter :)



   Now, as everybody knows, at least by word-of-mouth on
this list, the scrub tends to be slow on pools with a rich
life (many updates and deletions, causing fragmentation,
with "old" and "young" blocks intermixed on disk), more
so if the pools are quite full (over about 80% for some
reporters). This slowness (on non-SSD disks with non-zero
seek latency) is attributed to several reasons I''ve seen
stated and/or thought up while pondering. The reasons may
include statements like:



1) "Scrub goes on in TXG order".

If it is indeed so - the system must find older blocks,
then newer ones, and so on. IF the block-pointer tree
starting from uberblock is the only reference to the
entirety of the on-disk blocks (unlike say DDT) then
this tree would have to be read into memory and sorted
by TXG age and then processed.

 From my system''s failures I know that this tree would
take about 30Gb on my home-NAS box with 8Gb RAM, and
the kernel crashes the machine by depleting RAM and
not going into swap after certain operations (i.e.
large deletes on datasets with enabled deduplication).
That was discussed last year by me, and recently by
other posters.

Since the scrub does not do that and does not even
press on RAM in a fatal manner, I think this "reason"
is wrong. I also fail to see why one would do that
processing ordering in the first place - on a fairly
fragmented system even the blocks from "newer" TXGs
do not necessarily follow those from the "previous"
ones.

What this rumour could reflect, however, is that a scrub
(or more importantly, a resilver) are indeed limited by
the "interesting" range of TXGs, such as picking only
those blocks which were written between the last TXG that
a lost-and-reconnected disk knew of (known to the system
via that disk''s stale uberblock), and the current TXG
at the moment of its reconnection. Newer writes would
probably land onto all disks anyway, so a resilver has
only to find and fix those missing TXG numbers.

In my problematic system however I only saw full resilvers
even after they restarted numerously... This may actually
support the idea that scrubs are NOT txg-ordered, otherwise
a regularly updated tabkeeping attribute on the disk (in
uberblock?) would note that some TXGs are known to fully
exist on the resilvering drive - and this is not happening.




2) "Scrub walks the block-pointer tree".

That seems like a viable reason for lots of random reads
(hitting the IOPS barrier). It does not directly explain
the reports I think I''ve seen about L2ARC improving scrub
speeds and system responsiveness - although extra caching
takes the repetitive load off the HDDs and leaves them
some more timeslices to participate in scrubbing (and
*that* should incur reads from disks, not caches).

On an active system, block pointer entries are relatively
short-lived, with whole branches of a tree being updated
and written in a new location upon every file update.
This image is bound to look like good cheese after a while
even if the writes were initially coalesced into few IOs.



3) "If there are N top-level VDEVs in a pool, then only
the one with the resilvering disk would be hit for
performance" - not quite true, because pieces of the
BPtree are spread across all VDEVs. The one resilvering
would get the most bulk traffic, when DVAs residing on
it are found and userdata blocks get transferred, but
random read seeks caused by the resilvering process
should happen all over the pool.



Q2: Am I correct with the interpretation of statements 1-3?



IDEA1

One optimization that could take place here would be to
store some of the BPs'' ditto copies in compact locations
on disk (not all over it evenly), albeit maybe hurting
the write performance. This way a resilver run, or even
a scrub or zfs send, might be like a vdev-prefetch - a
scooping read of several megabytes worth of blockpointers
(this would especially help if the whole tree would fit
in RAM/L2ARC/swap), then sorting out the tree or its major
branches. The benefit would be little mechanical seeking
for lots of BP data. This might possibly require us to
invalidate the freed BP slots somehow as well :\

In case of scrubs, where we would have to read in all of
the allocated blocks from the media to test it, this would
let us schedule a sequential read of the drives userdata
while making sense of the sectors we find (as particular
zfs blocks).

In case of resilvering - this would let us find DVAs of
blocks in the interesting TLVDEV and in the TXG range and
also schedule huge sequential reads instead of random
seeking.

In case of zfs send, this would help us pick out the
TXG-limited ranges of the blocks for a dataset, and
again schedule the sequential reads for userdata (if any).

Q3: Does the IDEA above make sense - storing BP entries
(one of the ditto blocks) in some common location on disk,
so as to minimize mechanical seeks while reading much of
the BP tree?

IDEA2

It seems possible to enable defragmentation of the BP tree
(those ditto copies that are stored together) by just
relocating the valid ones in correct order onto a free
metaslab. It seems that ZFS keeps some free space for
passive defrag purposes anyway - why not use it actively?
Live migration of blocks like this seems to be available
with scrub''s repair of the mismatching blocks. However,
here some care should be taken to take into account that
the parent blockpointers would also need to be reallocated
since the childrens'' checksums would change - so the whole
tree/branch of reallocations would have to be planned and
written out in sequential order onto the spare free space.


Overall, if my understanding somewhat resembles how things
really are, these ideas may help create and maintain such
layout of metadata that it can be bulk-read, which is IMHO
critical for many operations as well as to shorted recovery
windows when resilvering disks.

Q4: I wonder if similar (equivalent) solutions are already
in place and did not help much? ;)

Thanks,
//Jim

Daniel Carosone

2012-May-18 06:18 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov
wrote:>   While waiting for that resilver to complete last week,
> I caught myself wondering how the resilvers (are supposed
> to) work in ZFS?
The devil finds work for idle hands... :-)
>   Based on what I see in practice and read in this list
> and some blogs, I''ve built a picture and would be grateful
> if some experts actually familiar with code and architecture
> would say how far off I guessed from the truth ;)
Well, I''m not that - certainly not on the code.  It would probably be
best (for both of us) to spend idle time looking at the code, before
spending too much on speculation. Nonetheless, let''s have at it! :)
>   Ultimately I wonder if there are possible optimizations
> to make the scrub process more resembling a sequential
> drive-cloning (bandwidth/throughput-bound), than an
> IOPS-bound random seek thrashing for hours that we
> often see now, at least on (over?)saturated pools.
The tradeoff will be code complexity and resulting fragility. Choose
wisely what you wish for.
> This may possibly improve zfs send speeds as well.
Less likely, that''s pretty much always going to have to go in txg
order.
>   First of all, I state (and ask to confirm): I think
> resilvers are a subset of scrubs, in that:
> 1) resilvers are limited to a particular top-level VDEV
> (and its number is a component of each block''s DVA address)
> and
> 2) when scrub finds a block mismatching its known checksum,
> scrub reallocates the whole block anew using the recovered
> known-valid data - in essence it is a newly written block
> with a new path in BP tree and so on; a resilver expects
> to have a disk full of known-missing pieces of blocks,
> and reconstructed pieces are written on the resilvering
> disk "in-place" at an address dictated by the known DVA -
> this allows to not rewrite the other disks and BP tree
> as COW would otherwise require.
No. Scrub (and any other repair, such as for errors found in the
course of normal reads) rewrite the reconstructed blocks in-place: to
the original DVA as referenced by its parents in the BP tree, even if
the device underneath that DVA is actually a new disk.

There is no COW. This is not a rewrite, and there is no original data
to preserve, this is a repair: making the disk sector contain what the
rest of the filesystem tree ''expects'' it to contain. More
specifically,
making it contain data that checksums to the value that block pointers
elsewhere say it should, via reconstruction using redundant
information (same DVA on a mirror/RAIDZ recon, or ditto blocks at
different DVAs found in the parent BP for copies>1, including metadata)

BTW, if a new BP tree was required to repair blocks, we''d have
bp-rewrite already (or we wouldn''t have repair yet).
>   Other than these points, resilvers and scrubs should
> work the same, perhaps with nuances like separate tunables
> for throttling and such - but generic algorithms should
> be nearly identical.
>
> Q1: Is this assessment true?
In a sense, yes, despite the correction above.  There is less
difference between these cases than you expected, so they are nearly
identical :-)
>   So I''ll call them both a "scrub" below - it''s
shorter :)
Call them all repair.

The difference is not in how repair happens, but in how the need for a
given sector to be repaired is discovered.

Let''s go over those, and clarify terminology, before going through the
rest of your post:

 * Normal reads: a device error or checksum failure triggers a
   repair. 

 * Scrub: Devices may be fine, but we want to verify that and fix any
   errors. In particular, we want to check all redundant copies.

 * Resilver: A device has been offline for a while, and needs to be
   ''caught up'', from its last known-good TXG to current.

 * Replace: A device has gone, and needs to be completely
   reconstructed.

Scrub is very similar to normal reads, apart from checking all copies
rather than serving the data from whichever copy successfully returns
first. Errors are not expected, are counted and repaired as/if found.

Resilver and Replace are very similar, and the terms are often used
interchangably. Replace is essentially resilver with a starting TXG of
0 (plus some labelling). In both cases, an error is expected or
assumed from the device in question, and repair initiated
unconditionally (and without incrementing error counters). 

You''re suggesting an assymetry between Resilver and Replace to exploit
the possibile speedup of sequential access; ok, seems attractive at
first blush, let''s explore the idea.
>   Now, as everybody knows, at least by word-of-mouth on
> this list, the scrub tends to be slow on pools with a rich
> life (many updates and deletions, causing fragmentation,
> with "old" and "young" blocks intermixed on disk), more
> so if the pools are quite full (over about 80% for some
> reporters). This slowness (on non-SSD disks with non-zero
> seek latency) is attributed to several reasons I''ve seen
> stated and/or thought up while pondering. The reasons may
> include statements like:
>
> 1) "Scrub goes on in TXG order".
Yes, it does, approximately. More below.
> If it is indeed so - the system must find older blocks,
> then newer ones, and so on. IF the block-pointer tree
> starting from uberblock is the only reference to the
> entirety of the on-disk blocks (unlike say DDT)
(aside: it is. The DDT is not special in this sense, because to find
the DDT you have to follow the bp tree too.)
> then
> this tree would have to be read into memory and sorted
> by TXG age and then processed.
>
> From my system''s failures I know that this tree would
> take about 30Gb on my home-NAS box with 8Gb RAM, and
> the kernel crashes the machine by depleting RAM and
> not going into swap after certain operations (i.e.
> large deletes on datasets with enabled deduplication).
> That was discussed last year by me, and recently by
> other posters.
>
> Since the scrub does not do that and does not even
> press on RAM in a fatal manner, I think this "reason"
> is wrong.
Well, your observations and analysis of what scrub is not doing are
correct and sound.. :-)
> I also fail to see why one would do that
> processing ordering in the first place - on a fairly
> fragmented system even the blocks from "newer" TXGs
> do not necessarily follow those from the "previous"
> ones.
You''re thinking too much about the on-disk ordering of sector
numbers.  Understandable, since you''re trying to find a way to do
sequential repair.  

For now, let''s just say that going in TXG order is the easiest way to
iterate over the disk and be sure to get all live data, without doing
other complicated and memory/IO-intensive sorts. Again, we''ll come back
to this.  
> What this rumour could reflect, however, is that a scrub
> (or more importantly, a resilver) are indeed limited by
> the "interesting" range of TXGs, such as picking only
> those blocks which were written between the last TXG that
> a lost-and-reconnected disk knew of (known to the system
> via that disk''s stale uberblock), and the current TXG
> at the moment of its reconnection. Newer writes would
> probably land onto all disks anyway, so a resilver has
> only to find and fix those missing TXG numbers.
Yes, for resilver this is spot on, as above.
> In my problematic system however I only saw full resilvers
> even after they restarted numerously... This may actually
> support the idea that scrubs are NOT txg-ordered, otherwise
> a regularly updated tabkeeping attribute on the disk (in
> uberblock?) would note that some TXGs are known to fully
> exist on the resilvering drive - and this is not happening.
Now you have two problems:

 * confusing scrub (as a way of checking and possibly triggering
   repair) with resilver (known need to repair).
 * older code: in newer code there is better bookkeeping, at least for
   scrub, that allows a resume (after, say a reboot) from where it
   left off.  I''m not sure about resilver here, though (and note the
   complexity with the optimisation of ''new writes'' past the
offline
   window, above).
> 2) "Scrub walks the block-pointer tree".
yes, it does. It''s essentially the same as the previous point, though:
scrub walks the bp tree in txg order.
> That seems like a viable reason for lots of random reads
> (hitting the IOPS barrier). 
Yep.  We''re getting closer to the real reason here, but let''s
play it
out in full as we go.
> It does not directly explain
> the reports I think I''ve seen about L2ARC improving scrub
> speeds and system responsiveness - although extra caching
> takes the repetitive load off the HDDs and leaves them
> some more timeslices to participate in scrubbing (and
> *that* should incur reads from disks, not caches).
If L2ARC indeed helps, it will surely be mostly to do with improving
responsiveness on other reads and freeing up the disks to do scrubs. 
> On an active system, block pointer entries are relatively
> short-lived, with whole branches of a tree being updated
> and written in a new location upon every file update.
> This image is bound to look like good cheese after a while
> even if the writes were initially coalesced into few IOs.
You might be surprised, you probably have more long-lived data than
you thought, especially with snapshots in place.  The full metadata
bp tree path to that old data is also retained.

Note also the corollary: whenever data is COW''d, the full metadata
path is also COW''d (possibly rolled up together with other updates in
the same TXG).  What that means is that, to read data for a new TXG as
you progress in a resilver, replace or scrub, you have to read all new
metadata. 
> 3) "If there are N top-level VDEVs in a pool, then only
> the one with the resilvering disk would be hit for
> performance" - not quite true, because pieces of the
> BPtree are spread across all VDEVs. The one resilvering
> would get the most bulk traffic, when DVAs residing on
> it are found and userdata blocks get transferred, but
> random read seeks caused by the resilvering process
> should happen all over the pool.
Not sure what this one means and I think it''s mostly false for the
reason you state.  

Either resilvering or replacing, the disk is mostly getting writes -
and cachable writes at that - from this activity. For resilver
especially, it might see reads for other concurrent activity.

The IOPS limitation is for seeks necessary to satisfy reads, mostly
from other disks, to provide data for reconstruction.  As noted above,
if a disk is being resilvered for TXG n, it won''t have any of the
metadata for that TXG either, so won''t really be servicing any reads.
> Q2: Am I correct with the interpretation of statements 1-3?
Not quite, as discussed above.

Let''s go over the scrub case in detail (resilver being a txg
window-limited variant, and both resilver and replace enabling
different error reporting logic).

 * Every meta/data block in the disk was written in a given TXG.
 * Every meta/data block is reachable by a path through the bp tree,
   from the root at the close of that TXG, down through however many
   indirect levels are needed. 
 * For every later TXG while the data remains current, the new root
   and top few nodes in the tree will change, (due to other writes) but
   those upper nodes will refer to the same subtree below the point of 
   divergence caused by those later writes. In other words, each TXG
   assembles a bp tree from a new root, and reuses subtrees from the
   previous TXG where no changes have been made.
 * Snapshots are simply additional references to old (filesystem) root
   bp''s, as a way to keep that subtree live.

And here''s the kicker for any attempt at LBA-sequential repair:

 * The checksum for a given block, that allows it to be verified, is
   stored in the bp that refers to it.

So, if reading blocks sequentially, you can''t verify them. You
don''t
know what their checksums are supposed to be, or even what they
contain or where to look for the checksums, even if you were prepared
to seek to find them.  This is why scrub walks the bp tree.

When doing a scrub, you start at the root bp and walk the tree, doing
reads for everything, verifying checksums, and letting repair happen
for any errors. That traversal is either a breadth-first or
depth-first traversal of the tree (I''m not sure which) done in TXG
order.  

When you''re done with that bp tree, the pool has almost certainly
moved on with new TXG''s. Get the new root bp, and do the traversal
again. This time, any bp with a birth time equal or older to the TXG
you previously finished has already been verified, including the
entire subtree below, and so can be skipped.  This is why scrub walks
in TXG order.  It''s also why the disk access is in
''approximate TXG
order'', as you''ll sometimes see the more pedantic commenters
state.

Note that there can be a lot of fanout in the tree; don''t make the
mistake of thinking that the directories and filesystems you see are
the tree in question; the ZPL is a layer on top of the ZAP object
store.
> IDEA1
>
> One optimization that could take place here would be to
> store some of the BPs'' ditto copies in compact locations
> on disk (not all over it evenly), albeit maybe hurting
> the write performance. This way a resilver run, or even
> a scrub or zfs send, might be like a vdev-prefetch - a
> scooping read of several megabytes worth of blockpointers
> (this would especially help if the whole tree would fit
> in RAM/L2ARC/swap), then sorting out the tree or its major
> branches. The benefit would be little mechanical seeking
> for lots of BP data. This might possibly require us to
> invalidate the freed BP slots somehow as well :\
>
> In case of scrubs, where we would have to read in all of
> the allocated blocks from the media to test it, this would
> let us schedule a sequential read of the drives userdata
> while making sense of the sectors we find (as particular
> zfs blocks).
>
> In case of resilvering - this would let us find DVAs of
> blocks in the interesting TLVDEV and in the TXG range and
> also schedule huge sequential reads instead of random
> seeking.
>
> In case of zfs send, this would help us pick out the
> TXG-limited ranges of the blocks for a dataset, and
> again schedule the sequential reads for userdata (if any).
>
> Q3: Does the IDEA above make sense - storing BP entries
> (one of the ditto blocks) in some common location on disk,
> so as to minimize mechanical seeks while reading much of
> the BP tree?
It''s not going to help a scrub, since that reads all of the ditto
block copies, so bunching just one copy isn''t useful. It might
potentially help metadata-heavy activities that don''t touch the data,
like find(1), at the expense of several other issues, at least some of
which you note.

That said, there are always opportunities for tweaks and improvements
to the allocation policy, or even for multiple allocation policies
each more suited/tuned to specific workloads if known in advance.
> IDEA2
>
> It seems possible to enable defragmentation of the BP tree
> (those ditto copies that are stored together) by just
> relocating the valid ones in correct order onto a free
> metaslab.
"Just" is such a four-letter word.

If you move a bp, you change its DVA. Which means that the parent bp
pointing to it needs to be updated and rewritten, and its parents as
well. This is new, COW data, with a new TXG attached -- but referring
to data that is old and has not been changed. You just broke snapshots
and scrub, at least.

This is bp rewrite, or rather, why bp-rewrite is hard.
> It seems that ZFS keeps some free space for
> passive defrag purposes anyway - why not use it actively?
> Live migration of blocks like this seems to be available
> with scrub''s repair of the mismatching blocks. 
This gets back to the misunderstanding (way) above.  Repair is not
COW; repair is repairing the disk block to the original, correct
contents.
> However,
> here some care should be taken to take into account that
> the parent blockpointers would also need to be reallocated
> since the childrens'' checksums would change - so the whole
> tree/branch of reallocations would have to be planned and
> written out in sequential order onto the spare free space.
And more complexities, since you want this done on a live pool. 
> Overall, if my understanding somewhat resembles how things
> really are, these ideas may help create and maintain such
> layout of metadata that it can be bulk-read, which is IMHO
> critical for many operations as well as to shorted recovery
> windows when resilvering disks.
>
> Q4: I wonder if similar (equivalent) solutions are already
> in place and did not help much? ;)
At least scrub does more book-keeping in more recent code and will
avoid restarts and rework.

I would like to see a replace variant that signals that at least some
of the data on the disk may already be valid, so it could potentially
be used in reconstruction when multiple disks have errors.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120518/b4012f9f/attachment.bin>

Daniel Carosone

2012-May-18 11:30 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On Fri, May 18, 2012 at 04:18:12PM +1000, Daniel Carosone
wrote:> 
> When doing a scrub, you start at the root bp and walk the tree, doing
> reads for everything, verifying checksums, and letting repair happen
> for any errors. That traversal is either a breadth-first or
> depth-first traversal of the tree (I''m not sure which) done in TXG
> order.  
> 
> [..]
> 
> Note that there can be a lot of fanout in the tree;
Given the latter point, I''m going to guess depth-first.  Yes, I should
look at the code instead of posting speculation. 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120518/360cd729/attachment.bin>

Edward Ned Harvey

2012-May-18 13:58 UTC

head link

[zfs-discuss] How does resilver/scrub work?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
I''m reading the ZFS on-disk spec, and I get the idea that
there''s an uberblock pointing to a self-balancing tree (some say
b-tree, some say avl-tree, some say nv-tree), where data is only contained in
the nodes.  But I haven''t found one particular important detail yet:

On which values does the balancing tree balance?  Is it balancing on the logical
block address?  This would make sense, as an application requests to read/write
some logical block, making it easy and fast to find the corresponding physical
blocks...

If that is the case, wouldn''t scrub/resilver need to work according to
logical block order?  (Which would also be random-ish, but decidedly NOT the
same as TXG temporal order.)

Jim Klimov

2012-May-18 15:04 UTC

head link

[zfs-discuss] How does resilver/scrub work?

First of all, thank you Daniel for taking the time to post a
lengthy reply! I do not get that kind of high-quality feedback
very often :)

I hope the community and googlers would benefit from that
conversation sometime. I did straighten out some thoughts
and (mis-)understandings, at least, more on that below :)

2012-05-18 15:30, Daniel Carosone wrote:> On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote: >>    While waiting for that resilver to complete last week,
 >> I caught myself wondering how the resilvers (are supposed
 >> to) work in ZFS?
 > The devil finds work for idle hands... :-)

Or rather, brains ;)

 > Well, I''m not that - certainly not on the code.  It would
probably be
 > best (for both of us) to spend idle time looking at the code, before
 > spending too much on speculation. Nonetheless, let''s have at it!
:)> ...Yes, I should look at the code instead of posting speculation.

Good idea any day, but rather lengthy in time. I have looked at the
code, at blogs, at mailing list archives, at the aged ZFS spec, for
about a year on-and-off now, and as you could see - understanding
remains imperfect ;)

Besides, turning the specific C code, even with those good comments
that are in place, into a narrative description like we did in this
thread, is bulky, time-consuming and likely useless (not conveyed)
to other people wanting to understand the same and perhaps hoping
to contribute - even if only algorithmic ideas ;)

Finally, breaking the head over existing code only, instead of
sitting back and doing some educated thinking (speculation),
*may* be useless in the sense that if the current algorithms
(or their implementation) work unsatisfactorily for at least
the use-cases I see them used in. Thus I as a n00b researcher
might care a bit less about what exactly is wrong in the system
that does not work (the way I want it to, at least), and I''d
care a bit more about designing and planning = speculating how (I think) it
should work to suit my needs and usage patterns.
In this regard the existing implementation may be seen as a
POC which demostrates what can be done, even if sub-optimally.
It works somewhat, and since we see downsides - it might work
better.

At the very least I can try to understand how it works now
and why some particular choices and tradeoffs were mare
(perhaps we do use the lesser of evils indeed) - explained
in higher-level concepts and natural-language words that
correspondents like you or other ZFS experts (and authors)
on this list can quickly confirm or deny without wasting
their precious time (no sarcasm) on lengthy posts like these,
describing it all in detail. This is a useful experience and
learning source, and different from what reading the code
alone gives me.

Anyway, this "speculation" would be done by this n00b reader of
the code implicitly and with less (without any?) constructive
discussion (thanks again for that!) if I were to look into code
trying to fix something without planning ahead, and I know that
often does not end very well.

Ultimately, I guess I got more understanding by spending a few
hours to formulate correct questions (and thankfully getting some
answers) than from compiling all the disparate (and often outdated)
docs and blogs, and code, into some form of a structure in my head.
I also got to confirm that much of this compilation was correct
and which parts I missed ;)

Perhaps, now I (or someone else) won''t waste months on inventing
or implementing something senseless from the start, or would find
ways to make a pluggable writing policy for tests of different
allocators for different purposes, or something of that kind... -
as you propose here:
 > That said, there are always opportunities for tweaks and improvements
 > to the allocation policy, or even for multiple allocation policies
 > each more suited/tuned to specific workloads if known in advance.

Now, on to my ZFS questions and your selected responses:

 >> This may possibly improve zfs send speeds as well.
 >
 > Less likely, that''s pretty much always going to have to go in txg
 > order.

Would that be really TXG order - i.e. send blocks from TXG(N),
then send blocks from TXG(N+1), and so on; OR a BPtree walk
of the selected branch (starting from the root of snapshot
dataset), perhaps limiting the range of chosen TXG numbers
by the snapshot''s creation and completion "TXG timestamps"?

Essentially, I don''t want to quote all those pieces of text,
but I still doubt that tree walks are done in TXG order - at
least the way I understand it (which may be different from
your or others'' understanding): I interpreted "TXG order" as
I said above - a monotonous incremental walk from older TXG
numbers to newer ones. In order to do that you must have the
whole tree in RAM and sort it by TXGs (perhaps making an
array of all defined TXGs and pointers to individual block
pointers that have this TXG), which is lengthy, bulky on
RAM and I don''t think I see it happening in real life.

If the statement means that "when walking the tree, first
walk the child branch with lower TXG" then the statement
makes sense somewhat - but it is not strictly "TXG-ordered",
I think. At the very least, the walk starts with the most
recent TXG being the uberblock (or poolwide root block) ;)
Such a walk would indeed reach out to the oldest TXGs in a
particular branch first, but starting from (and backtracking
to) newer ones.

So in order to benefit from sequential reads during the
tree walk, the written blocks with the block-pointer tree
(at least one copy of them) should be stored on disk in
essentially this same order that a tree walk reader expects
to find them. Then a read request (with associated vdev
prefetch) would find large portions of the BP tree needed
"now or in a few steps" in one mechanical IO...

 > So, if reading blocks sequentially, you can''t verify them. You
don''t
 > know what their checksums are supposed to be, or even what they
 > contain or where to look for the checksums, even if you were prepared
 > to seek to find them.  This is why scrub walks the bp tree.

...And perhaps to take more advantage of this, the system
should not descend into a single child BP and its branch
right away, but rather try to see in the rolling prefetch
cache (after a read was satisfied by a mechanical IO) if
more of the soon-to-be-needed blkptrs are in RAM currently
and should be relocated to the ARC/L2ARC before they roll
out of the prefetch cache, even if actual requests for
them would come after the subtree walk, perhaps in a few
seconds or minutes. If the subtree is so big that these
ARCed entries would be pushed out by then, well, we did
all we could to speed up the system for smaller branches
and lost little time in the process. And cache misses
would be logged so users can know to upgrade their ARCs.

 > No. Scrub (and any other repair, such as for errors found in the
 > course of normal reads) rewrite the reconstructed blocks in-place: to
 > the original DVA as referenced by its parents in the BP tree, even if
 > the device underneath that DVA is actually a new disk.
 > There is no COW. This is not a rewrite, and there is no original data
 > to preserve...

Okay, thanks, I guess this simplifies things - although
somewhat defies the BPtree defrag approach I proposed.

 > BTW, if a new BP tree was required to repair blocks, we''d have
 > bp-rewrite already (or we wouldn''t have repair yet).

I''m not so sure. I''ve seen discussed (and proposed) many small
tasks that could be done by a BP rewrite in general, but can
be done "elsehow". Taking as an example my (mis)understanding
of scrub repairs, the recovered block data could just be written
into the pool just like any other new data block, and cause the
rewriting of the BP tree branch leading to it. If that is not
done (or required) here - well, that''s for the better I guess.

 > ...This is bp rewrite, or rather, why bp-rewrite is hard.

The generic BP rewrite also should handle things like
reduction of VDEV sizes, removal of TLVDEVs, changes to
TLVDEV layouts (i.e. migration of raidz levels) and so
on. That is likely hard (especially to do online) indeed.

Individual operations, like defragmentation, recompression
or dedup of existing data, all of which can be done today
by zfs-sending data away from the pool, cleaning it up, and
zfs-receiving the data back - without all the lowlevel layout
changes that BP rewrite can do - well, they can be done today.
Why not in-place?

Unlike manual send-away-and-receive cycles incurring downtime,
the equivalent in-place manipulations can be done transparently
to ZPL/ZVOL users by just invalidating parts of the ARC (by DVA
of reallocated blocks), I think, and do not seem as inherently
difficult as complete BP rewrites.

Again, this interim solution may be just a POC for later works
on BP rewrite to include and improve :)

 > "Just" is such a four-letter word.
 >
 > If you move a bp, you change its DVA. Which means that the parent bp
 > pointing to it needs to be updated and rewritten, and its parents as
 > well. This is new, COW data, with a new TXG attached -- but referring
 > to data that is old and has not been changed.

 > This gets back to the misunderstanding (way) above.  Repair is not
 > COW; repair is repairing the disk block to the original, correct
 > contents.

Changes of DVAs causing reallocation of the whole branch of
BPs during the defrag - yes, as I also wrote. However I am
not sure that it would induce such changes to TXG numbers
that must be fatal to snapshots and scrubs: as I''ve seen in
the code (unlike the ZFS on-disk format docs), the current
blkptr_t includes two fields for a TXG number - the birth
TXG and (IIRC) the write TXG. I guess one refers to the
timestamp of when the data block was initially allocated
in the queue, and another one (if non-zero) refers to the
timestamp of when the block was optionally reallocated and
written into the pool - perhaps upon recovery from ZIL, or
(as I thought above) upon generic repair, or my proposed
idea of defrag.

So perhaps the system is already ready to correctly
process such reallocations, or can be cheated into that
by "clever" use and/or ignoration of one of these fields...

 > You just broke snapshots and scrub, at least.

As for snapshots: you can send a series of incremental
snapshots from one system to another, and of course the
TXG numbers on a particular pool for blocks of the snapshot
dataset would differ. But this does not matter, as long as
they are committed on disk in a particular order, with
BPtree branches properly pointing to timestamp-ordered
snapshots of the parent dataset.

Your concern seems valid indeed, but I think it can be
countered by scheduling a BPtree defrag to involve
relocating and updating block pointers for all snapshots
of a dataset (and maybe its clones), or at least ensuring
that the parent blocks of newer snapshots have higher TXG
numbers - if that is required. This may place non-trivial
demands on cache or buffer memory size and usage in order
to prepare the big transaction in case of large datasets,
so perhaps if the system detects it can''t properly defrag
the BPtree branch in one operation, it should abort without
crashing the OS into scanrate-hell ;)

 > It''s not going to help a scrub, since that reads all of the ditto
 > block copies, so bunching just one copy isn''t useful.

I can agree - but only partially. If the point of storing
the blockpointers together and minimizing mechanical reads
to get many of them at once is reachable, then it becomes
possible to "preread" the "colocated" version of BP tree
or its large portions quickly (if there are no checksum
or device errors during such reads - otherwise we fall
back to scattered ditto copies of those corrupted BP tree
blocks). Then we can schedule more optimal reads for the
scattered data, including the ditto blocks of the BP tree
that we''ve already read in (the other copies of these blocks).
It would be the same walk covering the same data objects
on disk, but possibly in a different (and hopefully faster)
manner than today.

> --
> Dan.
Thanks a lot for the discussion, I really appreciate it :)
//Jim Klimov

Edward Ned Harvey

2012-May-18 15:08 UTC

head link

[zfs-discuss] How does resilver/scrub work?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Jim Klimov
I''m reading the ZFS on-disk spec, and I get the idea that
there''s an
uberblock pointing to a self-balancing tree (some say b-tree, some say
avl-tree, some say nv-tree), where data is only contained in the nodes.  But
I haven''t found one particular important detail yet:

On which values does the balancing tree balance?  Is it balancing on the
logical block address?  This would make sense, as an application requests to
read/write some logical block, making it easy and fast to find the
corresponding physical blocks...

If that is the case, wouldn''t scrub/resilver need to work according to
logical block order?  (Which would also be random-ish, but decidedly NOT the
same as TXG temporal order.)

Jim Klimov

2012-May-18 20:10 UTC

head link

[zfs-discuss] How does resilver/scrub work?

2012-05-18 19:08, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Jim Klimov
>
> I''m reading the ZFS on-disk spec, and I get the idea that
there''s an
> uberblock pointing to a self-balancing tree (some say b-tree, some say
> avl-tree, some say nv-tree), where data is only contained in the nodes. 
But
> I haven''t found one particular important detail yet:
>
> On which values does the balancing tree balance?  Is it balancing on the
> logical block address?  This would make sense, as an application requests
to
> read/write some logical block, making it easy and fast to find the
> corresponding physical blocks...
My memory fails me here for a precise answer... I think that
the on-disk data within a raidzN top-level VDEV (mirrors are
trivial) is laid out as follows, for an arbitrary 6-disk set
of raidz2 TLVDEV:

D1   D2   D3   D4   D5   D6
Ar1  Ar2  Ad1  Ad2  Ad3  Ad4
Br1  Br2  Bd1  Cr1  Cr2  Cd1
Cd2  Cd3  Cd4  Cr3  Cr4  Cd5
Cd6  Dr1  Dr2  Dd1  ...

In these examples above, several blocks are laid out in sectors
of different disks, including the redundancy blocks. Sequential
accesses on one disk progress in a column from top to bottom.
Accesses in a row are parallelized between many disks.

The "A" block userdata is 4 sectors long, with 2 redundancy blocks.
The "B" block has just one userdata sector, and the "C"
block has
6 userdata sectors with a redundancy started for each 4 sectors.

AFAIK each ZFS block fully resides within one TLVDEV (and ditto
copies have their own separate life in another TLVDEV if available),
and striping over several TLVDEVs occurs at a whole-block level.
This, in particular, allows disbalanced pools with TLVDEVs of
different size and layout.

IF this picture is correct (confirmation or the reverse is
kindly requested), then:

1) DVA to LBA translation should be somewhat trivial, since
    the DVA is defined as "ID(tlvdev):offset:length" in 512-byte
    units (regardless of ashift value on the pool). I did not
    test this in practice or incur from the code though.

    I don''t know if there are any gaps to take into account
    (i.e. maybe between "metaslabs", which are supposed to be
    about 200 of which per vdev (or tlvdev, or pool?) in order to
    limit seeking between data written at roughly the same time.
    Even if there are gaps (i.e. to round allocations to on-disk
    tracks or offsets at multiples of a given number), I''d not
    complicate things and just leave the gaps as addressable but
    unreferenced free spaces.

    A poster on the list recently referenced "slabs", I don''t
    think I saw this term - but I guess it stands for the total
    allocation needed for a userdata block?

2) Addressing of blocks (or reverse - saying that these sectors
    belong to a particular block or are available) is impossible
    without knowing the (generally whole) blockpointer tree, and
    depending on (re-)written object sizes, the same sector can
    at different times in its life belong to blocks (slabs?) of
    different lengths and starting at different DVA offsets...

    Indeed, we also can not assume that sectors read-in from the
    disks contain a valid part of the blockpointer tree (despite
    even matching some magic number), not until we find a path
    through the known tree that leads to this block (I discussed
    this in my other post regarding vdev prefetch and defrag).
    However since reads are free as long as the HDD head is in
    the right location, and if blkptr_t''s leading one to another
    are colocated on the disk, clever use of the prefetch and
    timely inspection of the prefetch cache can hopefully boost
    the BPtree walking speed.

    MAYBE I am wrong in this and there is also an allocation map
    in the large metaslabs or something? (I know there is some
    cleverness about finding available locations to write into,
    but I''m not ready to speak about it off the top of my head).

    I am not sure if this gives a clue to whether it''s "balancing
    on the logical block address?" though :) AFAIK the balancing
    tries to keep the maximum tree depth shortest, yet there is
    one root block and no rewriting of existing unchanged stale
    blocks (tree nodes). I am puzzled too :)


3) The layout is fixed at tlvdev creation time by its total
    number of disks since that directly affects the calculation
    "on which disk does an offset''ed sector belong" - it
would
    be offset modulo number of disks for raidzN regardless of N
    (because of not-full stripes), and just 0 for single drives
    and mirrors. This is why resizing a raidz set is indeed hard,
    while conversion of single disks to mirrors and back is easy.

    To a lesser extent the layout is limited by vdev size (which
    can be increased easily, but can not be decreased without
    reallocation and BP rewrite [*1]), and somewhat by the number
    of redundancy disks which influences individual blocks'' on-disk
    representation and required length [*2].

[*1]: This might be doable relatively easily by limiting the top
    writeable address, and executing a routine similar to zfs-send
    and zfs-recv to relocate all blocks with a larger DVA offset
    on this TLVDEV to any accessible location on the pool. When
    no more referenced blocks remain above the watermark, the
    TLVDEV can be shrunk. This may involve some magic with TXG
    "birth" and "alloc" fields in blkptr''s as well.

[*2]: As we know, in Oracle ZFS there is hybrid allocation,
    which in particular allows mirrored writes for metadata
    and raidz writes for userdata to coexist on a pool.
    I can only guess there is some new bit-flag in the blkptr_t
    for that? Anyhow, the number of redundancy disks and the
    layout algorithm for a particular block can be variable,
    so it seems...

Thanks,
//Jim Klimov

Jim Klimov

2012-May-20 22:55 UTC

head link

[zfs-discuss] How does resilver/scrub work?

I hope there is some good outcome of this thread after all, below...
I wonder if anyone else thinks the following proposal is reasonable? ;)

2012-05-18 10:18, Daniel Carosone wrote:> Let''s go over those, and clarify terminology, before going through
the
> rest of your post:
> ...* Replace: A device has gone, and needs to be completely
>      reconstructed.
As I detail below, i see Replace happening when a device
is going to be gone - but is still available and is being
proactively replaced.
>
> Scrub is very similar to normal reads, apart from checking all copies
> rather than serving the data from whichever copy successfully returns
> first. Errors are not expected, are counted and repaired as/if found.
>
> Resilver and Replace are very similar, and the terms are often used
> interchangably. Replace is essentially resilver with a starting TXG of
> 0 (plus some labelling). In both cases, an error is expected or
> assumed from the device in question, and repair initiated
> unconditionally (and without incrementing error counters).
>
> You''re suggesting an assymetry between Resilver and Replace to
exploit
> the possibile speedup of sequential access; ok, seems attractive at
> first blush, let''s explore the idea.
Well, I''ve gone to a swimming pool today to swim the halfmile
and clear my head (metaphorically at least), and from the
depths I emerged with another idea:

 From what I do see with the pool I''m upgrading (in another
thread), there is also a "Replace" mode for hotspare devices,
namely:
* I attached the hotspare to the pool
   zpool add poolname spare c1t2d0
* I asked the pool to migrate a flaky disk''s data to the new disk:
   zpool replace poolname c5t6d0 c1t2d0
* I asked the pool to forget the old disk so it can be removed:
   zpool detach poolname c5t6d0
   (cfgadm, removal, pluck in the new disk, cfgadm, etc)

 From iostat I see that all existing TLVDEV''s drives, including
the one being replaced, are actively thrashed by reads for many
hours, with some writes pouring onto the new disk.

SO THE IDEA IS as follows: the disk being explicitly replaced,
as in upgrades of the pool to larger drives, should first be
copied onto new media "DD-style", which would be sequential IO
for both devices, bandwidth-bound and rather fast. Then there
should be a selective scrub, reading and checking allocated
blocks from this TLVDEV only - like resilver does today - and
repairing possible discrepancies (since the pool was likely
live during the "DD stage", as well as errors were possible
on the source drive as well as any other), and after this
selective scrub the process is complete.

BENEFITS:
* The pool quickly gets a more-or-less good copy of the original
   disk, if it has not died completely and is able to serve reads
   for DD-style copying. This decreases the window of exposure of
   the TLVDEV to complete failure due to decreased redundancy, and
   can already help to salvage much of the data in case of partly
   bad source disk.

   That is, after the DD-style copy the new disk may be able to
   serve much of the valid data, and discrepancies might be easy
   to repair using normal checksum-mismatch modes - if the old
   disk kicks the bucket and/or is removed before the selective
   scrub is complete to gracefully finish the replacement procedure.

   The standard scrubbing approach after the DD-copy takes care
   of ensuring that by the end of the procedure the new disk''s
   data is fully valid. This also allows to not bother about the
   problems of the source disk being updated in locations ahead
   or behind the point where we''re reading now - some corrections
   to be made by the selective scrub are expected anyway.

   However, arguably, incoming writes may be placed on the source
   disk and its syncing-up spare replacement (into correct sector
   locations right from the start).

* Instead of scheduling many random writes, which may be slower
   due to sync requirements, caching priorities, etc., we lean
   towards many random reads - which would still be used if we
   were using the original replace/resilver mode. Arguably, the
   reads can be optimized better by ZFS pipeline and HDD NCQ/TCQ,
   and in a safer manner than (random) write optimizations.

* This method should be beneficial to raidz as well as mirrors,
   although the latter may have more options to cheaply recover
   bad sectors detected (as HDD IO errors) on source media of
   the one disk being replaced, on the fly - during DD-phase.

CAVEATS:

* This mode is of benefit for users whose pools are rather
   fragmented and full, so that sequential copy is noticeably
   faster than BP-tree-walk based resilvering. It is about 30x
   quicker on the utilized servers and homeNAS''es that I see.

   For example, on a Thumper in my other thread, resilvering
   of a 250Gb disk (partition) takes 15-17 hours while writing
   files and zfs-sends into a single-disk ZFS pool located on
   the same 3Tb drive fills it up in 24 hours. A full scrub of
   the original pool (45*250Gb) takes 24-27 hours. Time matters.

   The ZFS "marketing" states that it is quicker to repair
   because it only tests and copies the allocated data and
   not the whole disk as other RAID systems - well, this is
   only good as long as the pools are kept relatively empty.
   While I can agree with benefits of limiting the disk seeks
   via partitioning, i.e. by buying a 100Tb array and using
   only 10Tb by allocating smaller disk slices, I don''t see
   a good reason to allocate the 100Tb array and consistently
   keep it used at 10Tb, sorry.

   Perhaps this mode with DD-style preamble should be triggered
   by a separate command-line request (by admin''s discretion)
   or if it ever becomes a default option - it should be used
   instead of the original resilver-only method after some
   watermark value of disk utilization and/or known fragmentation.

* If the original disk (being replaced) is a piece of faulty
   hardware, it can cause problems during the DD stage, such as:
** Lags - HDD retries on bad sectors can take a considerate
    amount of time based on firmware settings/capabilities.
** Loss of device visibility from the controller, reset storms,
    etc.; physical failure of original disk during the copy -
    these would lead to inability to continue reading the disk.
** Erroneous reads - returned garbage will be compensated by
    the following scrub after the DDing phase.

   If the DDing process detects that the average read speed has
   dropped to some unacceptable level or has stalled completely,
   it can try to seek from another original-disk location and/or
   fall back to original resilvering from other vdevs and abandon
   the DD phase. This does not mean that retries should be avoided,
   or that the first encountered error (even a connection error)
   should be the cause for aborting or restarting the DD-phase.

   It is not yet critical if some sectors were skipped during
   the DDing phase - the following selective scrub will (should)
   recover them, possibly by retrying the original disk as well,
   and maybe it has got to recover and relocate the bad sectors
   in the background by this time.

   Even if the DD-phase was aborted after a non-trivial amount
   of copying, the scrub/resilver should, IMHO, also read from
   the partially filled new disk and only rewrite those sectors
   that require rewriting (especially important for media that
   is sensitive to write-wearing).

* Overall (wallclock) length of this replacement is likely to
   be higher than of the original method - since about the same
   amount of time will be needed for the scrub as for resilver,
   and some time will be added for DDing, and maybe plagued with
   retries etc. when hitting faulty sectors. However, a milestone
   of relative data safety will be reached a lot faster (if the
   source disk is substantially readable).

* Errors (on the target disk) are expected during the selective
   scrub stage, and should be fixed quietly and not cause CKSUM
   error counter bumps nor other panicky clutter.

This is so far a relatively raw idea and I''ve probably missed
something. Do you think it is worth pursuing and asking some
zfs developers to make a POC? ;)

Thanks,
//Jim Klimov

Bob Friesenhahn

2012-May-22 02:18 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On Mon, 21 May 2012, Jim Klimov wrote:> This is so far a relatively raw idea and I''ve probably missed
> something. Do you think it is worth pursuing and asking some
> zfs developers to make a POC? ;)
I did read all of your text. :-)

This is an interesting idea and could be of some use but it would be 
wise to test it first a few times before suggesting it as a general 
course.  Zfs is still totally not foolproof.  I still see postings 
from time to time regarding pools which panic/crash the system 
(probably due to memory corruption).

Zfs will try to keep the data compacted at the beginning of the 
partition so if you have a way to know how far out it extends, then 
the initial ''dd'' could be much faster when the pool is not
close to
full.

Zfs scrub does need to do many more reads than a resilver since it 
reads all data and metadata copies.  Triggering a resilver operation 
for the specific disk would likely hasten progress.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Daniel Carosone

2012-May-22 03:30 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn
wrote:> On Mon, 21 May 2012, Jim Klimov wrote:
>> This is so far a relatively raw idea and I''ve probably missed
>> something. Do you think it is worth pursuing and asking some
>> zfs developers to make a POC? ;)
>
> I did read all of your text. :-)
>
> This is an interesting idea and could be of some use but it would be  
> wise to test it first a few times before suggesting it as a general  
> course. 
I''ve done basically this kind of thing before: dd a disk and then
scrub rather than replace, treating errors as expected. 
> Zfs will try to keep the data compacted at the beginning of the  
> partition so if you have a way to know how far out it extends, then the 
> initial ''dd'' could be much faster when the pool is not
close to full.
zdb will show you usage per metaslab, you could use that and
effectively select offset ranges to skip any empty ones.  After a
while, and once the pool has seen usage fill past low %''ages,
I''d say
most metaslabs would have some usage, so you might not save much
time.  Going to finer detail within a metaslab is not worthwhile -
much more involved and involves the seeks you''re trying to avoid.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120522/5b16625d/attachment.bin>

Jim Klimov

2012-May-22 08:17 UTC

head link

[zfs-discuss] How does resilver/scrub work?

Thank you for reading and replying :-)

2012-05-22 6:18, Bob Friesenhahn wrote:> On Mon, 21 May 2012, Jim Klimov wrote:
>> This is so far a relatively raw idea and I''ve probably missed
>> something. Do you think it is worth pursuing and asking some
>> zfs developers to make a POC? ;)
>
> I did read all of your text. :-)
>
> This is an interesting idea and could be of some use but it would be
> wise to test it first a few times before suggesting it as a general
> course. Zfs is still totally not foolproof. I still see postings from
> time to time regarding pools which panic/crash the system (probably due
> to memory corruption).
>
> Zfs will try to keep the data compacted at the beginning of the
> partition so if you have a way to know how far out it extends, then the
> initial ''dd'' could be much faster when the pool is not
close to full.
For a not-full not-fragmented pool it is likely that a run
of the original resilvering would be faster and known to
be correct ;)
> Zfs scrub does need to do many more reads than a resilver since it reads
> all data and metadata copies. Triggering a resilver operation for the
> specific disk would likely hasten progress.
Well, the point in this case was for a "selective scrub" as
I called it in the text, which would indeed read all blocks
and dittos, if their DVA lies on that VDEV where we replaced
the disk and expect discrepancies. In this limitation it is
like a resilver, but is indeed a scrub in effect.

Besides, from what I''ve seen, a resilver just rewrites the
given range of TXGs (like [0-current]) on the target disk
based on reads from other disks in the TLVDEV and on the
assumption that up to some TXG point (zero or above) that
disk had valid data matching other disks in the array.

Here, after the DD-stage, we have no guarantee that the
source disk was fully valid (just a hope, augmented by
lack of read errors and timeouts during the DD-phase),
and to some degree we don''t know if the writes that came
in during the replacement process were properly written
onto the target disk as well as its counterpart being
replaced and still a (more) valid part of the pool.
If we do that write-cloning and if the source disk was
okay, the scrub shouldn''t find any errors I think.

//Jim Klimov

Jim Klimov

2012-May-22 08:42 UTC

head link

[zfs-discuss] How does resilver/scrub work?

2012-05-22 7:30, Daniel Carosone wrote:> On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote:
>> On Mon, 21 May 2012, Jim Klimov wrote:
>>> This is so far a relatively raw idea and I''ve probably
missed
>>> something. Do you think it is worth pursuing and asking some
>>> zfs developers to make a POC? ;)
>>
>> I did read all of your text. :-)
>>
>> This is an interesting idea and could be of some use but it would be
>> wise to test it first a few times before suggesting it as a general
>> course.
>
> I''ve done basically this kind of thing before: dd a disk and then
> scrub rather than replace, treating errors as expected.
I got into similar situation last night on that Thumper -
it is now migrating a flaky source disk in the array from
an original old 250Gb disk into a same-sized partition on
the new 3Tb drive (as I outlined as IDEA7 in another thread).
The source disk itself had about 300 CKSUM errors during
the process, and for reasons beyond my current understanding,
the resilver never completed.

In zpool status it said that the process was done several
hours before the time I looked at it, but the TLVDEV still
had a "spare" component device comprised of the old disk
and new partition, and the (same) hotspare device in the
pool was "INUSE".

After a while we just detached the old disk from the pool
and ran scrub, which first found some 178 CKSUM errors on
the new partition right away, and degraded the TLVDEV and
pool.

We cleared the errors, and ran the script below to log
the detected errors and clear them, so the disk is fixed
and not kicked out of the pool due to mismatches.
Overall 1277 errors were logged and apparently fixed, and
the pool is now on its second full scrub run - no bugs so
far (knocking wood; certainly none this early in the scrub
as we had last time).

So in effect, this methodology works for two of us :)

Since you did similar stuff already, I have a few questions:
1) How/what did you DD? The whole slice with the zfs vdev?
    Did the system complain (much) about the renaming of the
    device compared to paths embedded in pool/vdev headers?
    Did you do anything manually to remedy that (forcing
    import, DDing some handcrafted uberblocks, anything?)

2) How did you "treat errors as expected" during scrub?
    As I''ve discovered, there were hoops to jump through.
    Is there a switch to disable "degrading" of pools and
    TLVDEVs based on only the CKSUM counts?

My raw hoop-jumping script:
-----

#!/bin/bash

# /root/scrubwatch.sh
# Watches ''pond'' scrub and resets errors to avoid
auto-degrading
# the device, but logs the detected error counts however.
# See also "fmstat|grep zfs-diag" for precise counts.
# See also https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great
#          for details on FMA and fmstat with zfs hotspares

while true; do
     zpool status pond | gegrep -A4 -B3
''resilv|error|c1t2d|c5t6d|%''
     date
     echo ""

     C1="`zpool status pond | grep c1t2d`"
     C2="`echo "$C1" | grep ''c1t2d0s1  ONLINE       0   
0     0''`"
     if [ x"$C2" = x ]; then
         echo "`date`: $C1" >> /var/tmp/zpool-clear_pond.log
         zpool clear pond
         zpool status pond | gegrep -A4 -B3
''resilv|error|c1t2d|c5t6d|%''
         date
     fi
     echo ""

     sleep 60
done

HTH,
//Jim Klimov

Daniel Carosone

2012-May-23 00:00 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On Tue, May 22, 2012 at 12:42:02PM +0400, Jim Klimov
wrote:> 2012-05-22 7:30, Daniel Carosone wrote:
>> I''ve done basically this kind of thing before: dd a disk and
then
>> scrub rather than replace, treating errors as expected.
>
> I got into similar situation last night on that Thumper -
> it is now migrating a flaky source disk in the array from
> an original old 250Gb disk into a same-sized partition on
> the new 3Tb drive (as I outlined as IDEA7 in another thread).
> The source disk itself had about 300 CKSUM errors during
> the process, and for reasons beyond my current understanding,
> the resilver never completed.
>
> In zpool status it said that the process was done several
> hours before the time I looked at it, but the TLVDEV still
> had a "spare" component device comprised of the old disk
> and new partition, and the (same) hotspare device in the
> pool was "INUSE".
I think this is at least in part an issue with older code.  There have
been various fixes for hangs/restarts/incomplete replaces and sparings
over the time since.  
> After a while we just detached the old disk from the pool
> and ran scrub, which first found some 178 CKSUM errors on
> the new partition right away, and degraded the TLVDEV and
> pool.
>
> We cleared the errors, and ran the script below to log
> the detected errors and clear them, so the disk is fixed
> and not kicked out of the pool due to mismatches.
>
> So in effect, this methodology works for two of us :)
>
> Since you did similar stuff already, I have a few questions:
> 1) How/what did you DD? The whole slice with the zfs vdev?
>    Did the system complain (much) about the renaming of the
>    device compared to paths embedded in pool/vdev headers?
>    Did you do anything manually to remedy that (forcing
>    import, DDing some handcrafted uberblocks, anything?)
I''ve done it a couple of times at least:

 * a failed disk in a raidz1, where i didn''t trust that the other
   disks didn''t also have errors.  Basically did a ddrescue from one
   disk to the new. I think these days, a ''replace'' where the
   original disk is still online will use that content, like a
   hotspare replace, rather than assume it has gone away and must be
   recreated, but that wasn''t the case at the time.

 * Where I had an iscsi mirror of a laptop hard disk, but it was out
   of date and had been detached when the laptop iscsi initiator
   refused to start.  Later, the disk developed a few bad sectors.  I
   made a new submirror, let it sync (with the error still), then
   blatted bits of the old image over the new in the areas where the
   bad sectors where being reported.  Scrub again, and they were fixed
   (as well as some blocks on the new submirror repaired coming back
   up to date again). 
> 2) How did you "treat errors as expected" during scrub?
Pretty much as you did: decline to panic and restart scrubs.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120523/9500d386/attachment.bin>

Richard Elling

2012-May-23 16:54 UTC

head link

[zfs-discuss] How does resilver/scrub work?

comments far below...

On May 22, 2012, at 1:42 AM, Jim Klimov wrote:
> 2012-05-22 7:30, Daniel Carosone wrote:
>> On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote:
>>> On Mon, 21 May 2012, Jim Klimov wrote:
>>>> This is so far a relatively raw idea and I''ve probably
missed
>>>> something. Do you think it is worth pursuing and asking some
>>>> zfs developers to make a POC? ;)
>>> 
>>> I did read all of your text. :-)
>>> 
>>> This is an interesting idea and could be of some use but it would
be
>>> wise to test it first a few times before suggesting it as a general
>>> course.
>> 
>> I''ve done basically this kind of thing before: dd a disk and
then
>> scrub rather than replace, treating errors as expected.
> 
> I got into similar situation last night on that Thumper -
> it is now migrating a flaky source disk in the array from
> an original old 250Gb disk into a same-sized partition on
> the new 3Tb drive (as I outlined as IDEA7 in another thread).
> The source disk itself had about 300 CKSUM errors during
> the process, and for reasons beyond my current understanding,
> the resilver never completed.
> 
> In zpool status it said that the process was done several
> hours before the time I looked at it, but the TLVDEV still
> had a "spare" component device comprised of the old disk
> and new partition, and the (same) hotspare device in the
> pool was "INUSE".
> 
> After a while we just detached the old disk from the pool
> and ran scrub, which first found some 178 CKSUM errors on
> the new partition right away, and degraded the TLVDEV and
> pool.
> 
> We cleared the errors, and ran the script below to log
> the detected errors and clear them, so the disk is fixed
> and not kicked out of the pool due to mismatches.
> Overall 1277 errors were logged and apparently fixed, and
> the pool is now on its second full scrub run - no bugs so
> far (knocking wood; certainly none this early in the scrub
> as we had last time).
> 
> So in effect, this methodology works for two of us :)
> 
> Since you did similar stuff already, I have a few questions:
> 1) How/what did you DD? The whole slice with the zfs vdev?
dd, or simular dumb block copiers, should work fine. However, they 
are inefficient and operationally difficult to manage, which is why they
tend to fall in the prefer-to-use-something-else catagory.
>   Did the system complain (much) about the renaming of the
>   device compared to paths embedded in pool/vdev headers?
It shouldn''t unless you did something to confuse it, such as having
both
the original and the dd copy online at the same time. In that case, you
will have two different copies of the same identified device that are
independent. This is an operational mistake, hence my comment above.
>   Did you do anything manually to remedy that (forcing
>   import, DDing some handcrafted uberblocks, anything?)
Not needed.
> 
> 2) How did you "treat errors as expected" during scrub?
>   As I''ve discovered, there were hoops to jump through.
>   Is there a switch to disable "degrading" of pools and
>   TLVDEVs based on only the CKSUM counts?
DEGRADED is the status. You clear degraded states by fixing the problem
and running zpool clear. DEGRADED, in and of itself, is not a problem.
> 
> 
> My raw hoop-jumping script:
> -----
> 
> #!/bin/bash
> 
> # /root/scrubwatch.sh
> # Watches ''pond'' scrub and resets errors to avoid
auto-degrading
> # the device, but logs the detected error counts however.
> # See also "fmstat|grep zfs-diag" for precise counts.
> # See also https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great
> #          for details on FMA and fmstat with zfs hotspares
> 
> while true; do
>    zpool status pond | gegrep -A4 -B3
''resilv|error|c1t2d|c5t6d|%''
>    date
>    echo ""
> 
>    C1="`zpool status pond | grep c1t2d`"
>    C2="`echo "$C1" | grep ''c1t2d0s1  ONLINE       0
0     0''`"
>    if [ x"$C2" = x ]; then
>        echo "`date`: $C1" >> /var/tmp/zpool-clear_pond.log
>        zpool clear pond
>        zpool status pond | gegrep -A4 -B3
''resilv|error|c1t2d|c5t6d|%''
>        date
>    fi
>    echo ""
> 
>    sleep 60
> done
I would never allow such scripts in my site. It is important to track the 
progress and state changes. This script resets those counters for no
good reason.

I post this comment in the hope that future searches will not encourage 
people to try such things.
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120523/97e07bfe/attachment.html>

Jim Klimov

2012-May-23 20:32 UTC

head link

[zfs-discuss] How does resilver/scrub work?

2012-05-23 20:54, Richard Elling wrote:> comments far below...
Thank you Richard for taking notice of this thread and the
definitive answers I needed not quote below, for further
questions ;)
>> 2) How did you "treat errors as expected" during scrub?
>> As I''ve discovered, there were hoops to jump through.
>> Is there a switch to disable "degrading" of pools and
>> TLVDEVs based on only the CKSUM counts?
>
> DEGRADED is the status. You clear degraded states by fixing the problem
> and running zpool clear. DEGRADED, in and of itself, is not a problem.
Doesn''t this status preclude the device with many CKSUM errors
from participating in the pool (TLVDEV) and the remainder of
the scrub in particular?

At least the textual error message infers that if a hotspare
were available for the pool, it would kick in and invalidate
the device I am scrubbing to update into the pool after the
DD-phase (well, it was not DD but a hung-up resilver in this
case, but that is not substantial).

Such automatic replacement is definitely not what I needed
in this particular case, so if it were to happen - it would
be a problem indeed, in and of itself.

 > dd, or simular dumb block copiers, should work fine.
 > However, they are inefficient...

Define efficient? In terms of transferring the 900Gb payload
of a 1Tb HDD used for ZFS for a year - DD would beat resilver
anytime, in terms of getting most or (less likely) all of the
valid bits with data onto the new device. It is the next phase
(getting the rest of the bits into valid state) that needs
some attention, manual or automated.

Again, DD is not a good usecase indeed for pools with little
data on big disks, and while I see why these could be used
(i.e. to never face fragmentation), I haven''t seen them in
practice around here.

 >... and operationally difficult to manage

Actually, that''s why I asked whether it makes sense to
automate such a scenario as another legal variant of disk
replacement, complete with fast data transfer and verification
and simultaneous work of the new and old devices until the
data migration is marked complete. In particular that would
take care of accepting the scrub errors as an expected part
of the disk replacement and not a fatal fault/degradation,
and/or allowing new writes to propagate onto the new disk
while the replacement is going on and minimize discrepancies
right on the run.

In visible effect this would be similar to current resilver
during replacement of a live disk with a hotspare, but the
prcess would follow a different scenario I suggested earlier
in the thread.
>> My raw hoop-jumping script:
...> I would never allow such scripts in my site. It is important to track the
> progress and state changes. This script resets those counters for no
> good reason.
>
> I post this comment in the hope that future searches will not encourage
> people to try such things.
Understood, point taken, I won''t try to promote such a
"solution",
and I agree that certainly it is not a good general idea indeed.
It should be noted however (or I want to be corrected, please,
if I am wrong), that:

1) Errors are expected on this run since the DD''ed copy is expected
    to deviate from current pool state; if the "degradation" mark of
    new disk would force it to be kicked out of the pool just because
    there are many CKSUM errors - which we know should be there due
    to manual DD-phase - then the reason is good IMHO (in this one
    case);

2) The progress is tracked by logging the error counts into a text
    file. If the admin fired up the script (manually in his terminal
    or a vnc/screen session), he can also look into the log file or
    even tail it.

3) The individual CKSUM errors are summed up in fmstat output, and
    this script does not zero them out, so even system-side tracking
    is not disturbed here.

Anyhow, if there is a device with just a few CKSUM errors, then the
next scrub clears its error counts anyway (if no new problems are
found)....

Thanks,
//Jim Klimov

Richard Elling

2012-May-23 21:01 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On May 23, 2012, at 1:32 PM, Jim Klimov wrote:
> 2012-05-23 20:54, Richard Elling wrote:
>> comments far below...
> 
> Thank you Richard for taking notice of this thread and the
> definitive answers I needed not quote below, for further
> questions ;)
> 
>>> 2) How did you "treat errors as expected" during scrub?
>>> As I''ve discovered, there were hoops to jump through.
>>> Is there a switch to disable "degrading" of pools and
>>> TLVDEVs based on only the CKSUM counts?
>> 
>> DEGRADED is the status. You clear degraded states by fixing the problem
>> and running zpool clear. DEGRADED, in and of itself, is not a problem.
> 
> Doesn''t this status preclude the device with many CKSUM errors
> from participating in the pool (TLVDEV) and the remainder of
> the scrub in particular?
no
> At least the textual error message infers that if a hotspare
> were available for the pool, it would kick in and invalidate
> the device I am scrubbing to update into the pool after the
> DD-phase (well, it was not DD but a hung-up resilver in this
> case, but that is not substantial).
The man page is clear on this topic, IMHO

     DEGRADED
                 One or more top-level vdevs is in  the  degraded
                 state  because one or more component devices are
                 offline. Sufficient replicas exist  to  continue
                 functioning.

                 One or more component devices is in the degraded
                 or  faulted state, but sufficient replicas exist
                 to continue functioning. The  underlying  condi-
                 tions are as follows:

                     o    The number of checksum  errors  exceeds
                          acceptable  levels  and  the  device is
                          degraded as an  indication  that  some-
                          thing  may  be  wrong. ZFS continues to
                          use the device as necessary.

> 
> Such automatic replacement is definitely not what I needed
> in this particular case, so if it were to happen - it would
> be a problem indeed, in and of itself.
> 
> > dd, or simular dumb block copiers, should work fine.
> > However, they are inefficient...
> 
> Define efficient? In terms of transferring the 900Gb payload
> of a 1Tb HDD used for ZFS for a year - DD would beat resilver
> anytime, in terms of getting most or (less likely) all of the
> valid bits with data onto the new device. It is the next phase
> (getting the rest of the bits into valid state) that needs
> some attention, manual or automated.
speed != efficiency
> Again, DD is not a good usecase indeed for pools with little
> data on big disks, and while I see why these could be used
> (i.e. to never face fragmentation), I haven''t seen them in
> practice around here.
> 
> >... and operationally difficult to manage
> 
> Actually, that''s why I asked whether it makes sense to
> automate such a scenario as another legal variant of disk
> replacement, complete with fast data transfer and verification
> and simultaneous work of the new and old devices until the
> data migration is marked complete. In particular that would
> take care of accepting the scrub errors as an expected part
> of the disk replacement and not a fatal fault/degradation,
> and/or allowing new writes to propagate onto the new disk
> while the replacement is going on and minimize discrepancies
> right on the run.
> 
> In visible effect this would be similar to current resilver
> during replacement of a live disk with a hotspare, but the
> prcess would follow a different scenario I suggested earlier
> in the thread.
IMHO, this is too operationally complex for most folks. KISS wins.
>>> My raw hoop-jumping script:
> ...
>> I would never allow such scripts in my site. It is important to track
the
>> progress and state changes. This script resets those counters for no
>> good reason.
>> 
>> I post this comment in the hope that future searches will not encourage
>> people to try such things.
> 
> Understood, point taken, I won''t try to promote such a
"solution",
> and I agree that certainly it is not a good general idea indeed.
> It should be noted however (or I want to be corrected, please,
> if I am wrong), that:
> 
> 1) Errors are expected on this run since the DD''ed copy is
expected
>   to deviate from current pool state; if the "degradation" mark
of
>   new disk would force it to be kicked out of the pool just because
>   there are many CKSUM errors - which we know should be there due
>   to manual DD-phase - then the reason is good IMHO (in this one
>   case);
> 
> 2) The progress is tracked by logging the error counts into a text
>   file. If the admin fired up the script (manually in his terminal
>   or a vnc/screen session), he can also look into the log file or
>   even tail it.
> 
> 3) The individual CKSUM errors are summed up in fmstat output, and
>   this script does not zero them out, so even system-side tracking
>   is not disturbed here.
> 
> Anyhow, if there is a device with just a few CKSUM errors, then the
> next scrub clears its error counts anyway (if no new problems are
> found)....

What is it about error counters that frightens you enough to want to clear 
them often?
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120523/5f22398e/attachment-0001.html>

Jim Klimov

2012-May-23 21:56 UTC

head link

[zfs-discuss] How does resilver/scrub work?

Thanks again,

2012-05-24 1:01, Richard Elling wrote:>> At least the textual error message infers that if a hotspare
>> were available for the pool, it would kick in and invalidate
>> the device I am scrubbing to update into the pool after the
>> DD-phase (well, it was not DD but a hung-up resilver in this
>> case, but that is not substantial).
>
> The man page is clear on this topic, IMHO
Indeed, even in snv_117 the zpool man page says that. But the
console/dmesg message was also quite clear, so go figure whom
to trust (or fear) more ;)

fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, 
VER: 1, SEVERITY: Major
EVENT-TIME: Wed May 16 03:27:31 MSK 2012
PLATFORM: Sun Fire X4500, CSN: 0804AMT023            , HOSTNAME: thumper
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3
DESC: The number of checksum errors associated with a ZFS device
exceeded acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-GH for 
more information.
AUTO-RESPONSE: The device has been marked as degraded.  An attempt
will be made to activate a hot spare if available.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run ''zpool status -x'' and replace the bad device.

>> > dd, or simular dumb block copiers, should work fine.
>> > However, they are inefficient...
>>
>> Define efficient? In terms of transferring the 900Gb payload
>> of a 1Tb HDD used for ZFS for a year - DD would beat resilver
>> anytime, in terms of getting most or (less likely) all of the
>> valid bits with data onto the new device. It is the next phase
>> (getting the rest of the bits into valid state) that needs
>> some attention, manual or automated.
>
> speed != efficiency
Ummm... this is likely to start a flame war with other posters,
and you did not say what efficiency is to you? How can we compare
apples to meat, not even knowing whether the latter is a steak or
a pork knee?

I, for now, choose to stand by a statement that reduction of the
timeframe that the old disk needs to be in the system is a good
thing, as well as that changing the IO pattern from random writes
into (mostly) sequential writes and after that random reads may
be also somewhat more efficient, especially under other loads
(interfering less with them). Even though the whole replacement
process may take more wallclock time, there are cases when I''d
likely trust it to do a better job than original resilvering.

I think, someone with equipment could stage an experiment and
compare the two procedures (existing and proposed) on a nearly
full and somewhat fragmented pool.

Maybe you can disenchant me (not with vague phrases but either
theory or practice) and I would then see that my trust is blind,
misdirected and without basement. =)
> IMHO, this is too operationally complex for most folks. KISS wins.
That''s why I proposed to tuck this scenario under the zfs hood
(DD + selective scrub + ditto writes during the process,
as an optional alternative to current resilver), or explain
coherently why this should not be done - not for any situation.
Implementing it as a standard supported command would be KISS ;)

Especially if it is known that with some quirks this procedure
works, and may be beneficial to some cases, i.e. by reducing
the timeframe that a pool with a flaky disk in place is exposed
to potential loss of redundancy and large amounts of data, and
in the worst case the loss is constrained to those sectors
which couldn''t be (correctly) read by DD from the source disk
and couldn''t be reconstructed by raidz/mirror redundancies due
to whatever overlaying problems (i.e. a sector from same block
died on another disk too).
> What is it about error counters that frightens you enough to want to clear
> them often?
In this case, mostly, the fright of having the device kicked
out of the pool automatically instead of getting it "synced"
("resilvered" is an improper term here, I guess) to proper state.

In general - since this is a part of some migration procedure
which is, again, expected to have errors, we don''t really care
for signalling them. Why doesn''t the original resilver signal
several million CKSUM errors per new empty disk when it does
reconstruction of sectors onto it? I''d say this is functionally
identical. (At least, would be - if it were part of a supported
procedure as I suggest).

Thanks,
//Jim Klimov

PS: I pondered for a while if I should make up an argument that
on a dying disk mechanics, lots of random IO (resilver) instead
of sequential IO (DD) would cause it to die faster, but that''s
just a FUD not backed by any scientific data or statistics -
which you likely have, and perhaps opposing this argument indeed.

Richard Elling

2012-May-24 06:28 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On May 23, 2012, at 2:56 PM, Jim Klimov wrote:
> Thanks again,
> 
> 2012-05-24 1:01, Richard Elling wrote:
>>> At least the textual error message infers that if a hotspare
>>> were available for the pool, it would kick in and invalidate
>>> the device I am scrubbing to update into the pool after the
>>> DD-phase (well, it was not DD but a hung-up resilver in this
>>> case, but that is not substantial).
>> 
>> The man page is clear on this topic, IMHO
> 
> Indeed, even in snv_117 the zpool man page says that. But the
> console/dmesg message was also quite clear, so go figure whom
> to trust (or fear) more ;)
The FMA message is consistent with the man page.
> 
> fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER:
1, SEVERITY: Major
> EVENT-TIME: Wed May 16 03:27:31 MSK 2012
> PLATFORM: Sun Fire X4500, CSN: 0804AMT023            , HOSTNAME: thumper
> SOURCE: zfs-diagnosis, REV: 1.0
> EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3
> DESC: The number of checksum errors associated with a ZFS device
> exceeded acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-GH for
more information.
> AUTO-RESPONSE: The device has been marked as degraded.  An attempt
> will be made to activate a hot spare if available.
> IMPACT: Fault tolerance of the pool may be compromised.
> REC-ACTION: Run ''zpool status -x'' and replace the bad
device.
> 
> 
> 
>>> > dd, or simular dumb block copiers, should work fine.
>>> > However, they are inefficient...
>>> 
>>> Define efficient? In terms of transferring the 900Gb payload
>>> of a 1Tb HDD used for ZFS for a year - DD would beat resilver
>>> anytime, in terms of getting most or (less likely) all of the
>>> valid bits with data onto the new device. It is the next phase
>>> (getting the rest of the bits into valid state) that needs
>>> some attention, manual or automated.
>> 
>> speed != efficiency
> 
> Ummm... this is likely to start a flame war with other posters,
> and you did not say what efficiency is to you? How can we compare
> apples to meat, not even knowing whether the latter is a steak or
> a pork knee?
Efficiency allows use of denominators other than time. Speed is restricted
to a denominator of time. There is no flame war here, look elsewhere.
> I, for now, choose to stand by a statement that reduction of the
> timeframe that the old disk needs to be in the system is a good
> thing, as well as that changing the IO pattern from random writes
> into (mostly) sequential writes and after that random reads may
> be also somewhat more efficient, especially under other loads
> (interfering less with them). Even though the whole replacement
> process may take more wallclock time, there are cases when I''d
> likely trust it to do a better job than original resilvering.
> 
> I think, someone with equipment could stage an experiment and
> compare the two procedures (existing and proposed) on a nearly
> full and somewhat fragmented pool.
Operationally, your method loses every time.
> 
> Maybe you can disenchant me (not with vague phrases but either
> theory or practice) and I would then see that my trust is blind,
> misdirected and without basement. =)
>> IMHO, this is too operationally complex for most folks. KISS wins.
> 
> That''s why I proposed to tuck this scenario under the zfs hood
> (DD + selective scrub + ditto writes during the process,
> as an optional alternative to current resilver), or explain
> coherently why this should not be done - not for any situation.
> Implementing it as a standard supported command would be KISS ;)
> 
> Especially if it is known that with some quirks this procedure
> works, and may be beneficial to some cases, i.e. by reducing
> the timeframe that a pool with a flaky disk in place is exposed
> to potential loss of redundancy and large amounts of data, and
> in the worst case the loss is constrained to those sectors
> which couldn''t be (correctly) read by DD from the source disk
> and couldn''t be reconstructed by raidz/mirror redundancies due
> to whatever overlaying problems (i.e. a sector from same block
> died on another disk too).
You have not made a case for why this hybrid and failure-prone 
procedure is required. What problem are you trying to solve?
>> What is it about error counters that frightens you enough to want to
clear
>> them often?
> 
> In this case, mostly, the fright of having the device kicked
> out of the pool automatically instead of getting it "synced"
> ("resilvered" is an improper term here, I guess) to proper state.
Why not follow the well-designed existing procedure?
> In general - since this is a part of some migration procedure
> which is, again, expected to have errors, we don''t really care
> for signalling them. Why doesn''t the original resilver signal
> several million CKSUM errors per new empty disk when it does
> reconstruction of sectors onto it? I''d say this is functionally
> identical. (At least, would be - if it were part of a supported
> procedure as I suggest).
> 
> Thanks,
> //Jim Klimov
> 
> PS: I pondered for a while if I should make up an argument that
> on a dying disk mechanics, lots of random IO (resilver) instead
> of sequential IO (DD) would cause it to die faster, but that''s
> just a FUD not backed by any scientific data or statistics -
> which you likely have, and perhaps opposing this argument indeed.
The failure data does not support your hypothesis.
 -- richard


-- 

ZFS and performance consulting
http://www.RichardElling.com
SCALE 10x, Los Angeles, Jan 20-22, 2012

Jim Klimov

2012-May-24 13:06 UTC

head link

[zfs-discuss] How does resilver/scrub work?

Let me try to formulate my idea again... You called a similar
process "pushing the rope" some time ago, I think.

I feel like I''m passing some exam and am trying to pick answers
for a discipline like philosophy and I have no idea about the
examinator''s preferences - is he an ex-Communism teacher or an
eager new religion fanatic? The same answer can lead to an A
or to an F on a state exam. Ah, that was some fun experience :)

Well, what we know is what remains after we forget everything
that we were taught, while the exams are our last chance to
learn something at all =)

2012-05-24 10:28, Richard Elling wrote:> You have not made a case for why this hybrid and failure-prone
> procedure is required. What problem are you trying to solve?
Bigger-better-faster? ;)

The original proposal in this thread was about understanding
how resilvers and scrubs work, why they are so dog slow on
HDDs in comparison to sequential reads, and thinking aloud
what can be improved in this area.

One of the later posts was about improving the disk replacement
(where the original is still responsive, but may be imperfect)
for filled-up fragmented pools by including a stage of fast
data transfer and a different IO pattern for verification and
updating of the new disk image, in comparison with current
resilver''s IO patterns.

This may or may not have some benefits in certain (corner?)
cases which are of practical interest to some users on this
list, and if this discussion leads to a POC made by a competent
ZFS programmer, which can be tested on a variety of ZFS pools
(without risking one''s only pool on a homeNAS) - so much the
better. Then we would see if this scenario is viable or utterly
useless and bad in every tested case.

The practical numbers I have from the same box and disks are:
* Copy from a 250Gb raidz1 (9*(4+1)) pool to a single-disk 3Tb
test pool took 24 hours to fill the new disk - including the
ZFS overheads.
* Copying of one raw 250(232)Gb partition takes under 2 hours
(if it can sustain about 70Mb/s reads from the source without
distractions like other pool IO - then 1 hour).
* Proper resilvering (reading all BP-tree from the original pool,
reading all blocks from the TLVDEV, writing reconstructed(?)
sectors to the target disk) from one partition to another
took 17 hours.
* Full scrubbing (reading all blocks from the pool, fixing
checksum mismatches) takes 25-27 hours.
* Selective scrubbing - unimplemented, timeframe unknown
(reading all BP-tree from the original pool, reading all
blocks from the TLVDEV including the target disk and the
original disk, fixing checksum mismatches without panicky
messages and/or hotspares kicking in).
I *guess* it would have similar speed to a resilver, but
less bound to random write IO patterns, which may be better
for latencies of other tasks on the system.

So, in case of original resilver, I replace the not-yet-dead
disk with a hotspare, and after 17 hours of waiting I see if
it was successfully resilvered or not. During this time the
disk can die for example, leaving my pool with lowered
protection (or lack thereof in case of raidz1 or two-way
mirrors).

In case of the new method proposed for a POC implementation,
after 1 hour I''d already have a somewhat reliable copy of
that vdev (a few blocks may have mismatches, but if the
source disk dies or is taken away now - not the whole TLVDEV
or pool is degraded and has compromised protection). Then
after the same +17 hours for scrubs I''d be certain that
this copy is good.

If the new writes incoming to this TLVDEV between start of
DD and end of scrub are directed to be written on both the
source disk and its copy, then there are less (down to zero)
checksum discrepancies that the scrub phase would find.
> Why not follow the well-designed existing procedure?
First it was a theoretical speculation, but a couple of days
later the incomplete resilver made me a practical experiment
of the idea.
> The failure data does not support your hypothesis.Ok, then my made-up and dismissed argument does not stand ;)

Thanks for the discussion,
//Jim Klimov

Richard Elling

2012-May-24 14:55 UTC

head link

[zfs-discuss] How does resilver/scrub work?

big assumption below...

On May 24, 2012, at 6:06 AM, Jim Klimov wrote:
> Let me try to formulate my idea again... You called a similar
> process "pushing the rope" some time ago, I think.
> 
> I feel like I''m passing some exam and am trying to pick answers
> for a discipline like philosophy and I have no idea about the
> examinator''s preferences - is he an ex-Communism teacher or an
> eager new religion fanatic? The same answer can lead to an A
> or to an F on a state exam. Ah, that was some fun experience :)
> 
> Well, what we know is what remains after we forget everything
> that we were taught, while the exams are our last chance to
> learn something at all =)
> 
> 2012-05-24 10:28, Richard Elling wrote:
>> You have not made a case for why this hybrid and failure-prone
>> procedure is required. What problem are you trying to solve?
> 
> Bigger-better-faster? ;)
> 
> The original proposal in this thread was about understanding
> how resilvers and scrubs work, why they are so dog slow on
> HDDs in comparison to sequential reads, and thinking aloud
> what can be improved in this area.
> 
> One of the later posts was about improving the disk replacement
> (where the original is still responsive, but may be imperfect)
> for filled-up fragmented pools by including a stage of fast
> data transfer and a different IO pattern for verification and
> updating of the new disk image, in comparison with current
> resilver''s IO patterns.
> 
> This may or may not have some benefits in certain (corner?)
> cases which are of practical interest to some users on this
> list, and if this discussion leads to a POC made by a competent
> ZFS programmer, which can be tested on a variety of ZFS pools
> (without risking one''s only pool on a homeNAS) - so much the
> better. Then we would see if this scenario is viable or utterly
> useless and bad in every tested case.
> 
> The practical numbers I have from the same box and disks are:
> * Copy from a 250Gb raidz1 (9*(4+1)) pool to a single-disk 3Tb
>  test pool took 24 hours to fill the new disk - including the
>  ZFS overheads.
> * Copying of one raw 250(232)Gb partition takes under 2 hours
>  (if it can sustain about 70Mb/s reads from the source without
>  distractions like other pool IO - then 1 hour).
> * Proper resilvering (reading all BP-tree from the original pool,
>  reading all blocks from the TLVDEV, writing reconstructed(?)
>  sectors to the target disk) from one partition to another
>  took 17 hours.
> * Full scrubbing (reading all blocks from the pool, fixing
>  checksum mismatches) takes 25-27 hours.
> * Selective scrubbing - unimplemented, timeframe unknown
>  (reading all BP-tree from the original pool, reading all
>  blocks from the TLVDEV including the target disk and the
>  original disk, fixing checksum mismatches without panicky
>  messages and/or hotspares kicking in).
>  I *guess* it would have similar speed to a resilver, but
>  less bound to random write IO patterns, which may be better
>  for latencies of other tasks on the system.
> 
> So, in case of original resilver, I replace the not-yet-dead
> disk with a hotspare, and after 17 hours of waiting I see if
> it was successfully resilvered or not. During this time the
> disk can die for example, leaving my pool with lowered
> protection (or lack thereof in case of raidz1 or two-way
> mirrors).
> 
> In case of the new method proposed for a POC implementation,
> after 1 hour I''d already have a somewhat reliable copy of
> that vdev (a few blocks may have mismatches,
This is a big assumption -- that the disk will operate normally, even
for data it cannot read. In my experience, this assumption is not valid
for the majority of HDD failure modes. Also, in the case of consumer-grade
disks, a single sector media error could take a very long time to retry/fail.
> but if the
> source disk dies or is taken away now - not the whole TLVDEV
> or pool is degraded and has compromised protection). Then
> after the same +17 hours for scrubs I''d be certain that
> this copy is good.
> 
> If the new writes incoming to this TLVDEV between start of
> DD and end of scrub are directed to be written on both the
> source disk and its copy, then there are less (down to zero)
> checksum discrepancies that the scrub phase would find.
> 
>> Why not follow the well-designed existing procedure?
> 
> First it was a theoretical speculation, but a couple of days
> later the incomplete resilver made me a practical experiment
> of the idea.
> 
>> The failure data does not support your hypothesis.
> Ok, then my made-up and dismissed argument does not stand ;)
> 
> Thanks for the discussion,
np 
 -- richard

--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120524/e133eb3d/attachment.html>

Jim Klimov

2012-May-24 15:53 UTC

head link

[zfs-discuss] How does resilver/scrub work?

2012-05-24 18:55, Richard Elling wrote:> This is a big assumption -- that the disk will operate normally, even
> for data it cannot read. In my experience, this assumption is not valid
> for the majority of HDD failure modes. Also, in the case of consumer-grade
> disks, a single sector media error could take a very long time to
> retry/fail.
Indeed it is, and I''ve covered this in the thread earlier -
the bulk copying phase ("DD-phase") should monitor its real
progress, and if it detects lags in comparison to the average
or expected speeds (expected = some tuning variable i.e. 50Mb/s),
the process should skip over some (arbitrary) range of sectors
and go on from another location (such skipped sectors are in
danger indeed, until the scrub-phase detects and reconstructs
them) or fall back to the original resilver method completely.
That was already described in some detail I thought of at the
time of the posting, and I can''t add much to that yet.

 From what I''ve seen with faulty sectors is that they are usually
either single errors or a "scratched" range which can be worked
around with i.e. partitioning for legacy FSes (if the SMART
relocation doesn''t deal with them properly for any reason),
while most of the rest of the disk is okay. Retries may be
lengthy, ranging from several seconds up to a minute, but
they are often constrained in a few locations and *may* add
little delay in the overall scheme of things. If the delay
is more than acceptable and/or we can''t find a "working
location" on the source disk, we just fall back to the
old method - either original resilver, or if much data has
been copied to the new disk - to the new selective scrub
(it being much like the resilver, but taking into account
those sectors on the target disk which may have been copied
over correctly).

A somewhat worse case is intermittent errors in random times
and logical disk locations due to who knows what - overheating,
firmware overflow errors, bus resets, or whatever. It''s rather
them being the reason for scrub-validation of data after mass
migration, perhaps (as well as a reason for preventive regular
scrubs)...

//Jim

zfs user

2012-May-25 20:53 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On 5/23/12 11:28 PM, Richard Elling wrote:>>> The man page is clear on this topic, IMHO
>>
>> Indeed, even in snv_117 the zpool man page says that. But the
>> console/dmesg message was also quite clear, so go figure whom
>> to trust (or fear) more ;)
>
> The FMA message is consistent with the man page.
The man page seems to not mention the critical part of the FMA msg that OP is 
worried about.
OP said that his motivation for clearing the errors and fearing the degraded 
state was because he feared this:

 >> AUTO-RESPONSE: The device has been marked as degraded.  An attempt
 >> will be made to activate a hot spare if available.

he doesn''t want his dd''d new device kicked out of the vdev and
replaced by a
hot spare (if avaialable) due to the number of errors and the scarlet letter 
of "degraded" at the device level - I don''t think he cares
about the pool
level degraded status since it doesn''t "do" anything.
>> fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault,
VER: 1, SEVERITY: Major
>> EVENT-TIME: Wed May 16 03:27:31 MSK 2012
>> PLATFORM: Sun Fire X4500, CSN: 0804AMT023            , HOSTNAME:
thumper
>> SOURCE: zfs-diagnosis, REV: 1.0
>> EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3
>> DESC: The number of checksum errors associated with a ZFS device
>> exceeded acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-GH
for more information.
>> AUTO-RESPONSE: The device has been marked as degraded.  An attempt
>> will be made to activate a hot spare if available.
>> IMPACT: Fault tolerance of the pool may be compromised.
>> REC-ACTION: Run ''zpool status -x'' and replace the bad
device.

Richard Elling

2012-May-25 21:07 UTC

head link

[zfs-discuss] How does resilver/scrub work?

On May 25, 2012, at 1:53 PM, zfs user wrote:
> On 5/23/12 11:28 PM, Richard Elling wrote:
>>>> The man page is clear on this topic, IMHO
>>> 
>>> Indeed, even in snv_117 the zpool man page says that. But the
>>> console/dmesg message was also quite clear, so go figure whom
>>> to trust (or fear) more ;)
>> 
>> The FMA message is consistent with the man page.
> 
> The man page seems to not mention the critical part of the FMA msg that OP
is worried about.
> OP said that his motivation for clearing the errors and fearing the
degraded state was because he feared this:
> 
> >> AUTO-RESPONSE: The device has been marked as degraded.  An attempt
> >> will be made to activate a hot spare if available.
> 
> he doesn''t want his dd''d new device kicked out of the
vdev and replaced by a hot spare (if avaialable) due to the number of errors and
the scarlet letter of "degraded" at the device level - I
don''t think he cares about the pool level degraded status since it
doesn''t "do" anything.
By the time you could read such a message, the hot spare would have already
kicked in. Obviously, this was not the OP''s issue.
 -- richard
> 
>>> fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE:
Fault, VER: 1, SEVERITY: Major
>>> EVENT-TIME: Wed May 16 03:27:31 MSK 2012
>>> PLATFORM: Sun Fire X4500, CSN: 0804AMT023            , HOSTNAME:
thumper
>>> SOURCE: zfs-diagnosis, REV: 1.0
>>> EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3
>>> DESC: The number of checksum errors associated with a ZFS device
>>> exceeded acceptable levels.  Refer to
http://sun.com/msg/ZFS-8000-GH for more information.
>>> AUTO-RESPONSE: The device has been marked as degraded.  An attempt
>>> will be made to activate a hot spare if available.
>>> IMPACT: Fault tolerance of the pool may be compromised.
>>> REC-ACTION: Run ''zpool status -x'' and replace the
bad device.
--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120525/fcbd6295/attachment.html>

Jim Klimov

2012-May-25 22:01 UTC

head link

[zfs-discuss] How does resilver/scrub work?

2012-05-26 1:07, Richard Elling wrote:> On May 25, 2012, at 1:53 PM, zfs user wrote:
>> The man page seems to not mention the critical part of the FMA msg
>> that OP is worried about.
>> OP said that his motivation for clearing the errors and fearing the
>> degraded state was because he feared this:
>>
>> >> AUTO-RESPONSE: The device has been marked as degraded. An
attempt
>> >> will be made to activate a hot spare if available.
>>
>> he doesn''t want his dd''d new device kicked out of the
vdev and
>> replaced by a hot spare (if avaialable) due to the number of errors
>> and the scarlet letter of "degraded" at the device level - I
don''t
>> think he cares about the pool level degraded status since it
doesn''t
>> "do" anything.
>
> By the time you could read such a message, the hot spare would have already
> kicked in. Obviously, this was not the OP''s issue.
> -- richard
Kind of, it was - the motivation for feeling insecure and
ultimately for clearing the CKSUM errors every minute (that
there is a nonzero error count), at least - the script you
said should never be used in "normal" practice, and I agree
to that conclusion. (Manual) DDing is not the normal practice
sanely covered by the degradation/hotsparing mechanism.

As I wrote, the first time I saw the message, the pool did not
have an assigned hotspare, but it got marked degraded. Just in
case, I came up with that "cksum-mismatch-cleansing" script and
restarted the scrub, since I knew the errors on-disk were due
to an unfinished "proper" resilver onto it. I was not convinced
whether the new disk still fully operates in the pool when it
is marked as degraded, and I did not want the scrub to continue
just to find out whether the disk won''t be actively used and
"repaired".

To say that in other words, I know that sometimes docs can lag
behind or hop ahead of implemented features, and the latter can
also be buggy or incomplete. While the theory (FMA and manpage
snippets) said the disk should continue being used by the array
despite the DEGRADED mark, I did not have an intention of staging
an experiment here to find out whether it actually would, in that
aged version of the software.

Thanks,
//the OP ;)

zfs discuss - May 2012 - How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?

[zfs-discuss] How does resilver/scrub work?