Hello all, While waiting for that resilver to complete last week, I caught myself wondering how the resilvers (are supposed to) work in ZFS? Based on what I see in practice and read in this list and some blogs, I''ve built a picture and would be grateful if some experts actually familiar with code and architecture would say how far off I guessed from the truth ;) Ultimately I wonder if there are possible optimizations to make the scrub process more resembling a sequential drive-cloning (bandwidth/throughput-bound), than an IOPS-bound random seek thrashing for hours that we often see now, at least on (over?)saturated pools. This may possibly improve zfs send speeds as well. First of all, I state (and ask to confirm): I think resilvers are a subset of scrubs, in that: 1) resilvers are limited to a particular top-level VDEV (and its number is a component of each block''s DVA address) and 2) when scrub finds a block mismatching its known checksum, scrub reallocates the whole block anew using the recovered known-valid data - in essence it is a newly written block with a new path in BP tree and so on; a resilver expects to have a disk full of known-missing pieces of blocks, and reconstructed pieces are written on the resilvering disk "in-place" at an address dictated by the known DVA - this allows to not rewrite the other disks and BP tree as COW would otherwise require. Other than these points, resilvers and scrubs should work the same, perhaps with nuances like separate tunables for throttling and such - but generic algorithms should be nearly identical. Q1: Is this assessment true? So I''ll call them both a "scrub" below - it''s shorter :) Now, as everybody knows, at least by word-of-mouth on this list, the scrub tends to be slow on pools with a rich life (many updates and deletions, causing fragmentation, with "old" and "young" blocks intermixed on disk), more so if the pools are quite full (over about 80% for some reporters). This slowness (on non-SSD disks with non-zero seek latency) is attributed to several reasons I''ve seen stated and/or thought up while pondering. The reasons may include statements like: 1) "Scrub goes on in TXG order". If it is indeed so - the system must find older blocks, then newer ones, and so on. IF the block-pointer tree starting from uberblock is the only reference to the entirety of the on-disk blocks (unlike say DDT) then this tree would have to be read into memory and sorted by TXG age and then processed. From my system''s failures I know that this tree would take about 30Gb on my home-NAS box with 8Gb RAM, and the kernel crashes the machine by depleting RAM and not going into swap after certain operations (i.e. large deletes on datasets with enabled deduplication). That was discussed last year by me, and recently by other posters. Since the scrub does not do that and does not even press on RAM in a fatal manner, I think this "reason" is wrong. I also fail to see why one would do that processing ordering in the first place - on a fairly fragmented system even the blocks from "newer" TXGs do not necessarily follow those from the "previous" ones. What this rumour could reflect, however, is that a scrub (or more importantly, a resilver) are indeed limited by the "interesting" range of TXGs, such as picking only those blocks which were written between the last TXG that a lost-and-reconnected disk knew of (known to the system via that disk''s stale uberblock), and the current TXG at the moment of its reconnection. Newer writes would probably land onto all disks anyway, so a resilver has only to find and fix those missing TXG numbers. In my problematic system however I only saw full resilvers even after they restarted numerously... This may actually support the idea that scrubs are NOT txg-ordered, otherwise a regularly updated tabkeeping attribute on the disk (in uberblock?) would note that some TXGs are known to fully exist on the resilvering drive - and this is not happening. 2) "Scrub walks the block-pointer tree". That seems like a viable reason for lots of random reads (hitting the IOPS barrier). It does not directly explain the reports I think I''ve seen about L2ARC improving scrub speeds and system responsiveness - although extra caching takes the repetitive load off the HDDs and leaves them some more timeslices to participate in scrubbing (and *that* should incur reads from disks, not caches). On an active system, block pointer entries are relatively short-lived, with whole branches of a tree being updated and written in a new location upon every file update. This image is bound to look like good cheese after a while even if the writes were initially coalesced into few IOs. 3) "If there are N top-level VDEVs in a pool, then only the one with the resilvering disk would be hit for performance" - not quite true, because pieces of the BPtree are spread across all VDEVs. The one resilvering would get the most bulk traffic, when DVAs residing on it are found and userdata blocks get transferred, but random read seeks caused by the resilvering process should happen all over the pool. Q2: Am I correct with the interpretation of statements 1-3? IDEA1 One optimization that could take place here would be to store some of the BPs'' ditto copies in compact locations on disk (not all over it evenly), albeit maybe hurting the write performance. This way a resilver run, or even a scrub or zfs send, might be like a vdev-prefetch - a scooping read of several megabytes worth of blockpointers (this would especially help if the whole tree would fit in RAM/L2ARC/swap), then sorting out the tree or its major branches. The benefit would be little mechanical seeking for lots of BP data. This might possibly require us to invalidate the freed BP slots somehow as well :\ In case of scrubs, where we would have to read in all of the allocated blocks from the media to test it, this would let us schedule a sequential read of the drives userdata while making sense of the sectors we find (as particular zfs blocks). In case of resilvering - this would let us find DVAs of blocks in the interesting TLVDEV and in the TXG range and also schedule huge sequential reads instead of random seeking. In case of zfs send, this would help us pick out the TXG-limited ranges of the blocks for a dataset, and again schedule the sequential reads for userdata (if any). Q3: Does the IDEA above make sense - storing BP entries (one of the ditto blocks) in some common location on disk, so as to minimize mechanical seeks while reading much of the BP tree? IDEA2 It seems possible to enable defragmentation of the BP tree (those ditto copies that are stored together) by just relocating the valid ones in correct order onto a free metaslab. It seems that ZFS keeps some free space for passive defrag purposes anyway - why not use it actively? Live migration of blocks like this seems to be available with scrub''s repair of the mismatching blocks. However, here some care should be taken to take into account that the parent blockpointers would also need to be reallocated since the childrens'' checksums would change - so the whole tree/branch of reallocations would have to be planned and written out in sequential order onto the spare free space. Overall, if my understanding somewhat resembles how things really are, these ideas may help create and maintain such layout of metadata that it can be bulk-read, which is IMHO critical for many operations as well as to shorted recovery windows when resilvering disks. Q4: I wonder if similar (equivalent) solutions are already in place and did not help much? ;) Thanks, //Jim
On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote:> While waiting for that resilver to complete last week, > I caught myself wondering how the resilvers (are supposed > to) work in ZFS?The devil finds work for idle hands... :-)> Based on what I see in practice and read in this list > and some blogs, I''ve built a picture and would be grateful > if some experts actually familiar with code and architecture > would say how far off I guessed from the truth ;)Well, I''m not that - certainly not on the code. It would probably be best (for both of us) to spend idle time looking at the code, before spending too much on speculation. Nonetheless, let''s have at it! :)> Ultimately I wonder if there are possible optimizations > to make the scrub process more resembling a sequential > drive-cloning (bandwidth/throughput-bound), than an > IOPS-bound random seek thrashing for hours that we > often see now, at least on (over?)saturated pools.The tradeoff will be code complexity and resulting fragility. Choose wisely what you wish for.> This may possibly improve zfs send speeds as well.Less likely, that''s pretty much always going to have to go in txg order.> First of all, I state (and ask to confirm): I think > resilvers are a subset of scrubs, in that: > 1) resilvers are limited to a particular top-level VDEV > (and its number is a component of each block''s DVA address) > and > 2) when scrub finds a block mismatching its known checksum, > scrub reallocates the whole block anew using the recovered > known-valid data - in essence it is a newly written block > with a new path in BP tree and so on; a resilver expects > to have a disk full of known-missing pieces of blocks, > and reconstructed pieces are written on the resilvering > disk "in-place" at an address dictated by the known DVA - > this allows to not rewrite the other disks and BP tree > as COW would otherwise require.No. Scrub (and any other repair, such as for errors found in the course of normal reads) rewrite the reconstructed blocks in-place: to the original DVA as referenced by its parents in the BP tree, even if the device underneath that DVA is actually a new disk. There is no COW. This is not a rewrite, and there is no original data to preserve, this is a repair: making the disk sector contain what the rest of the filesystem tree ''expects'' it to contain. More specifically, making it contain data that checksums to the value that block pointers elsewhere say it should, via reconstruction using redundant information (same DVA on a mirror/RAIDZ recon, or ditto blocks at different DVAs found in the parent BP for copies>1, including metadata) BTW, if a new BP tree was required to repair blocks, we''d have bp-rewrite already (or we wouldn''t have repair yet).> Other than these points, resilvers and scrubs should > work the same, perhaps with nuances like separate tunables > for throttling and such - but generic algorithms should > be nearly identical. > > Q1: Is this assessment true?In a sense, yes, despite the correction above. There is less difference between these cases than you expected, so they are nearly identical :-)> So I''ll call them both a "scrub" below - it''s shorter :)Call them all repair. The difference is not in how repair happens, but in how the need for a given sector to be repaired is discovered. Let''s go over those, and clarify terminology, before going through the rest of your post: * Normal reads: a device error or checksum failure triggers a repair. * Scrub: Devices may be fine, but we want to verify that and fix any errors. In particular, we want to check all redundant copies. * Resilver: A device has been offline for a while, and needs to be ''caught up'', from its last known-good TXG to current. * Replace: A device has gone, and needs to be completely reconstructed. Scrub is very similar to normal reads, apart from checking all copies rather than serving the data from whichever copy successfully returns first. Errors are not expected, are counted and repaired as/if found. Resilver and Replace are very similar, and the terms are often used interchangably. Replace is essentially resilver with a starting TXG of 0 (plus some labelling). In both cases, an error is expected or assumed from the device in question, and repair initiated unconditionally (and without incrementing error counters). You''re suggesting an assymetry between Resilver and Replace to exploit the possibile speedup of sequential access; ok, seems attractive at first blush, let''s explore the idea.> Now, as everybody knows, at least by word-of-mouth on > this list, the scrub tends to be slow on pools with a rich > life (many updates and deletions, causing fragmentation, > with "old" and "young" blocks intermixed on disk), more > so if the pools are quite full (over about 80% for some > reporters). This slowness (on non-SSD disks with non-zero > seek latency) is attributed to several reasons I''ve seen > stated and/or thought up while pondering. The reasons may > include statements like: > > 1) "Scrub goes on in TXG order".Yes, it does, approximately. More below.> If it is indeed so - the system must find older blocks, > then newer ones, and so on. IF the block-pointer tree > starting from uberblock is the only reference to the > entirety of the on-disk blocks (unlike say DDT)(aside: it is. The DDT is not special in this sense, because to find the DDT you have to follow the bp tree too.)> then > this tree would have to be read into memory and sorted > by TXG age and then processed. > > From my system''s failures I know that this tree would > take about 30Gb on my home-NAS box with 8Gb RAM, and > the kernel crashes the machine by depleting RAM and > not going into swap after certain operations (i.e. > large deletes on datasets with enabled deduplication). > That was discussed last year by me, and recently by > other posters. > > Since the scrub does not do that and does not even > press on RAM in a fatal manner, I think this "reason" > is wrong.Well, your observations and analysis of what scrub is not doing are correct and sound.. :-)> I also fail to see why one would do that > processing ordering in the first place - on a fairly > fragmented system even the blocks from "newer" TXGs > do not necessarily follow those from the "previous" > ones.You''re thinking too much about the on-disk ordering of sector numbers. Understandable, since you''re trying to find a way to do sequential repair. For now, let''s just say that going in TXG order is the easiest way to iterate over the disk and be sure to get all live data, without doing other complicated and memory/IO-intensive sorts. Again, we''ll come back to this.> What this rumour could reflect, however, is that a scrub > (or more importantly, a resilver) are indeed limited by > the "interesting" range of TXGs, such as picking only > those blocks which were written between the last TXG that > a lost-and-reconnected disk knew of (known to the system > via that disk''s stale uberblock), and the current TXG > at the moment of its reconnection. Newer writes would > probably land onto all disks anyway, so a resilver has > only to find and fix those missing TXG numbers.Yes, for resilver this is spot on, as above.> In my problematic system however I only saw full resilvers > even after they restarted numerously... This may actually > support the idea that scrubs are NOT txg-ordered, otherwise > a regularly updated tabkeeping attribute on the disk (in > uberblock?) would note that some TXGs are known to fully > exist on the resilvering drive - and this is not happening.Now you have two problems: * confusing scrub (as a way of checking and possibly triggering repair) with resilver (known need to repair). * older code: in newer code there is better bookkeeping, at least for scrub, that allows a resume (after, say a reboot) from where it left off. I''m not sure about resilver here, though (and note the complexity with the optimisation of ''new writes'' past the offline window, above).> 2) "Scrub walks the block-pointer tree".yes, it does. It''s essentially the same as the previous point, though: scrub walks the bp tree in txg order.> That seems like a viable reason for lots of random reads > (hitting the IOPS barrier).Yep. We''re getting closer to the real reason here, but let''s play it out in full as we go.> It does not directly explain > the reports I think I''ve seen about L2ARC improving scrub > speeds and system responsiveness - although extra caching > takes the repetitive load off the HDDs and leaves them > some more timeslices to participate in scrubbing (and > *that* should incur reads from disks, not caches).If L2ARC indeed helps, it will surely be mostly to do with improving responsiveness on other reads and freeing up the disks to do scrubs.> On an active system, block pointer entries are relatively > short-lived, with whole branches of a tree being updated > and written in a new location upon every file update. > This image is bound to look like good cheese after a while > even if the writes were initially coalesced into few IOs.You might be surprised, you probably have more long-lived data than you thought, especially with snapshots in place. The full metadata bp tree path to that old data is also retained. Note also the corollary: whenever data is COW''d, the full metadata path is also COW''d (possibly rolled up together with other updates in the same TXG). What that means is that, to read data for a new TXG as you progress in a resilver, replace or scrub, you have to read all new metadata.> 3) "If there are N top-level VDEVs in a pool, then only > the one with the resilvering disk would be hit for > performance" - not quite true, because pieces of the > BPtree are spread across all VDEVs. The one resilvering > would get the most bulk traffic, when DVAs residing on > it are found and userdata blocks get transferred, but > random read seeks caused by the resilvering process > should happen all over the pool.Not sure what this one means and I think it''s mostly false for the reason you state. Either resilvering or replacing, the disk is mostly getting writes - and cachable writes at that - from this activity. For resilver especially, it might see reads for other concurrent activity. The IOPS limitation is for seeks necessary to satisfy reads, mostly from other disks, to provide data for reconstruction. As noted above, if a disk is being resilvered for TXG n, it won''t have any of the metadata for that TXG either, so won''t really be servicing any reads.> Q2: Am I correct with the interpretation of statements 1-3?Not quite, as discussed above. Let''s go over the scrub case in detail (resilver being a txg window-limited variant, and both resilver and replace enabling different error reporting logic). * Every meta/data block in the disk was written in a given TXG. * Every meta/data block is reachable by a path through the bp tree, from the root at the close of that TXG, down through however many indirect levels are needed. * For every later TXG while the data remains current, the new root and top few nodes in the tree will change, (due to other writes) but those upper nodes will refer to the same subtree below the point of divergence caused by those later writes. In other words, each TXG assembles a bp tree from a new root, and reuses subtrees from the previous TXG where no changes have been made. * Snapshots are simply additional references to old (filesystem) root bp''s, as a way to keep that subtree live. And here''s the kicker for any attempt at LBA-sequential repair: * The checksum for a given block, that allows it to be verified, is stored in the bp that refers to it. So, if reading blocks sequentially, you can''t verify them. You don''t know what their checksums are supposed to be, or even what they contain or where to look for the checksums, even if you were prepared to seek to find them. This is why scrub walks the bp tree. When doing a scrub, you start at the root bp and walk the tree, doing reads for everything, verifying checksums, and letting repair happen for any errors. That traversal is either a breadth-first or depth-first traversal of the tree (I''m not sure which) done in TXG order. When you''re done with that bp tree, the pool has almost certainly moved on with new TXG''s. Get the new root bp, and do the traversal again. This time, any bp with a birth time equal or older to the TXG you previously finished has already been verified, including the entire subtree below, and so can be skipped. This is why scrub walks in TXG order. It''s also why the disk access is in ''approximate TXG order'', as you''ll sometimes see the more pedantic commenters state. Note that there can be a lot of fanout in the tree; don''t make the mistake of thinking that the directories and filesystems you see are the tree in question; the ZPL is a layer on top of the ZAP object store.> IDEA1 > > One optimization that could take place here would be to > store some of the BPs'' ditto copies in compact locations > on disk (not all over it evenly), albeit maybe hurting > the write performance. This way a resilver run, or even > a scrub or zfs send, might be like a vdev-prefetch - a > scooping read of several megabytes worth of blockpointers > (this would especially help if the whole tree would fit > in RAM/L2ARC/swap), then sorting out the tree or its major > branches. The benefit would be little mechanical seeking > for lots of BP data. This might possibly require us to > invalidate the freed BP slots somehow as well :\ > > In case of scrubs, where we would have to read in all of > the allocated blocks from the media to test it, this would > let us schedule a sequential read of the drives userdata > while making sense of the sectors we find (as particular > zfs blocks). > > In case of resilvering - this would let us find DVAs of > blocks in the interesting TLVDEV and in the TXG range and > also schedule huge sequential reads instead of random > seeking. > > In case of zfs send, this would help us pick out the > TXG-limited ranges of the blocks for a dataset, and > again schedule the sequential reads for userdata (if any). > > Q3: Does the IDEA above make sense - storing BP entries > (one of the ditto blocks) in some common location on disk, > so as to minimize mechanical seeks while reading much of > the BP tree?It''s not going to help a scrub, since that reads all of the ditto block copies, so bunching just one copy isn''t useful. It might potentially help metadata-heavy activities that don''t touch the data, like find(1), at the expense of several other issues, at least some of which you note. That said, there are always opportunities for tweaks and improvements to the allocation policy, or even for multiple allocation policies each more suited/tuned to specific workloads if known in advance.> IDEA2 > > It seems possible to enable defragmentation of the BP tree > (those ditto copies that are stored together) by just > relocating the valid ones in correct order onto a free > metaslab."Just" is such a four-letter word. If you move a bp, you change its DVA. Which means that the parent bp pointing to it needs to be updated and rewritten, and its parents as well. This is new, COW data, with a new TXG attached -- but referring to data that is old and has not been changed. You just broke snapshots and scrub, at least. This is bp rewrite, or rather, why bp-rewrite is hard.> It seems that ZFS keeps some free space for > passive defrag purposes anyway - why not use it actively? > Live migration of blocks like this seems to be available > with scrub''s repair of the mismatching blocks.This gets back to the misunderstanding (way) above. Repair is not COW; repair is repairing the disk block to the original, correct contents.> However, > here some care should be taken to take into account that > the parent blockpointers would also need to be reallocated > since the childrens'' checksums would change - so the whole > tree/branch of reallocations would have to be planned and > written out in sequential order onto the spare free space.And more complexities, since you want this done on a live pool.> Overall, if my understanding somewhat resembles how things > really are, these ideas may help create and maintain such > layout of metadata that it can be bulk-read, which is IMHO > critical for many operations as well as to shorted recovery > windows when resilvering disks. > > Q4: I wonder if similar (equivalent) solutions are already > in place and did not help much? ;)At least scrub does more book-keeping in more recent code and will avoid restarts and rework. I would like to see a replace variant that signals that at least some of the data on the disk may already be valid, so it could potentially be used in reconstruction when multiple disks have errors. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120518/b4012f9f/attachment.bin>
On Fri, May 18, 2012 at 04:18:12PM +1000, Daniel Carosone wrote:> > When doing a scrub, you start at the root bp and walk the tree, doing > reads for everything, verifying checksums, and letting repair happen > for any errors. That traversal is either a breadth-first or > depth-first traversal of the tree (I''m not sure which) done in TXG > order. > > [..] > > Note that there can be a lot of fanout in the tree;Given the latter point, I''m going to guess depth-first. Yes, I should look at the code instead of posting speculation. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120518/360cd729/attachment.bin>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim KlimovI''m reading the ZFS on-disk spec, and I get the idea that there''s an uberblock pointing to a self-balancing tree (some say b-tree, some say avl-tree, some say nv-tree), where data is only contained in the nodes. But I haven''t found one particular important detail yet: On which values does the balancing tree balance? Is it balancing on the logical block address? This would make sense, as an application requests to read/write some logical block, making it easy and fast to find the corresponding physical blocks... If that is the case, wouldn''t scrub/resilver need to work according to logical block order? (Which would also be random-ish, but decidedly NOT the same as TXG temporal order.)
First of all, thank you Daniel for taking the time to post a lengthy reply! I do not get that kind of high-quality feedback very often :) I hope the community and googlers would benefit from that conversation sometime. I did straighten out some thoughts and (mis-)understandings, at least, more on that below :) 2012-05-18 15:30, Daniel Carosone wrote:> On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote:>> While waiting for that resilver to complete last week, >> I caught myself wondering how the resilvers (are supposed >> to) work in ZFS? > The devil finds work for idle hands... :-) Or rather, brains ;) > Well, I''m not that - certainly not on the code. It would probably be > best (for both of us) to spend idle time looking at the code, before > spending too much on speculation. Nonetheless, let''s have at it! :)> ...Yes, I should look at the code instead of posting speculation.Good idea any day, but rather lengthy in time. I have looked at the code, at blogs, at mailing list archives, at the aged ZFS spec, for about a year on-and-off now, and as you could see - understanding remains imperfect ;) Besides, turning the specific C code, even with those good comments that are in place, into a narrative description like we did in this thread, is bulky, time-consuming and likely useless (not conveyed) to other people wanting to understand the same and perhaps hoping to contribute - even if only algorithmic ideas ;) Finally, breaking the head over existing code only, instead of sitting back and doing some educated thinking (speculation), *may* be useless in the sense that if the current algorithms (or their implementation) work unsatisfactorily for at least the use-cases I see them used in. Thus I as a n00b researcher might care a bit less about what exactly is wrong in the system that does not work (the way I want it to, at least), and I''d care a bit more about designing and planning = speculating how (I think) it should work to suit my needs and usage patterns. In this regard the existing implementation may be seen as a POC which demostrates what can be done, even if sub-optimally. It works somewhat, and since we see downsides - it might work better. At the very least I can try to understand how it works now and why some particular choices and tradeoffs were mare (perhaps we do use the lesser of evils indeed) - explained in higher-level concepts and natural-language words that correspondents like you or other ZFS experts (and authors) on this list can quickly confirm or deny without wasting their precious time (no sarcasm) on lengthy posts like these, describing it all in detail. This is a useful experience and learning source, and different from what reading the code alone gives me. Anyway, this "speculation" would be done by this n00b reader of the code implicitly and with less (without any?) constructive discussion (thanks again for that!) if I were to look into code trying to fix something without planning ahead, and I know that often does not end very well. Ultimately, I guess I got more understanding by spending a few hours to formulate correct questions (and thankfully getting some answers) than from compiling all the disparate (and often outdated) docs and blogs, and code, into some form of a structure in my head. I also got to confirm that much of this compilation was correct and which parts I missed ;) Perhaps, now I (or someone else) won''t waste months on inventing or implementing something senseless from the start, or would find ways to make a pluggable writing policy for tests of different allocators for different purposes, or something of that kind... - as you propose here: > That said, there are always opportunities for tweaks and improvements > to the allocation policy, or even for multiple allocation policies > each more suited/tuned to specific workloads if known in advance. Now, on to my ZFS questions and your selected responses: >> This may possibly improve zfs send speeds as well. > > Less likely, that''s pretty much always going to have to go in txg > order. Would that be really TXG order - i.e. send blocks from TXG(N), then send blocks from TXG(N+1), and so on; OR a BPtree walk of the selected branch (starting from the root of snapshot dataset), perhaps limiting the range of chosen TXG numbers by the snapshot''s creation and completion "TXG timestamps"? Essentially, I don''t want to quote all those pieces of text, but I still doubt that tree walks are done in TXG order - at least the way I understand it (which may be different from your or others'' understanding): I interpreted "TXG order" as I said above - a monotonous incremental walk from older TXG numbers to newer ones. In order to do that you must have the whole tree in RAM and sort it by TXGs (perhaps making an array of all defined TXGs and pointers to individual block pointers that have this TXG), which is lengthy, bulky on RAM and I don''t think I see it happening in real life. If the statement means that "when walking the tree, first walk the child branch with lower TXG" then the statement makes sense somewhat - but it is not strictly "TXG-ordered", I think. At the very least, the walk starts with the most recent TXG being the uberblock (or poolwide root block) ;) Such a walk would indeed reach out to the oldest TXGs in a particular branch first, but starting from (and backtracking to) newer ones. So in order to benefit from sequential reads during the tree walk, the written blocks with the block-pointer tree (at least one copy of them) should be stored on disk in essentially this same order that a tree walk reader expects to find them. Then a read request (with associated vdev prefetch) would find large portions of the BP tree needed "now or in a few steps" in one mechanical IO... > So, if reading blocks sequentially, you can''t verify them. You don''t > know what their checksums are supposed to be, or even what they > contain or where to look for the checksums, even if you were prepared > to seek to find them. This is why scrub walks the bp tree. ...And perhaps to take more advantage of this, the system should not descend into a single child BP and its branch right away, but rather try to see in the rolling prefetch cache (after a read was satisfied by a mechanical IO) if more of the soon-to-be-needed blkptrs are in RAM currently and should be relocated to the ARC/L2ARC before they roll out of the prefetch cache, even if actual requests for them would come after the subtree walk, perhaps in a few seconds or minutes. If the subtree is so big that these ARCed entries would be pushed out by then, well, we did all we could to speed up the system for smaller branches and lost little time in the process. And cache misses would be logged so users can know to upgrade their ARCs. > No. Scrub (and any other repair, such as for errors found in the > course of normal reads) rewrite the reconstructed blocks in-place: to > the original DVA as referenced by its parents in the BP tree, even if > the device underneath that DVA is actually a new disk. > There is no COW. This is not a rewrite, and there is no original data > to preserve... Okay, thanks, I guess this simplifies things - although somewhat defies the BPtree defrag approach I proposed. > BTW, if a new BP tree was required to repair blocks, we''d have > bp-rewrite already (or we wouldn''t have repair yet). I''m not so sure. I''ve seen discussed (and proposed) many small tasks that could be done by a BP rewrite in general, but can be done "elsehow". Taking as an example my (mis)understanding of scrub repairs, the recovered block data could just be written into the pool just like any other new data block, and cause the rewriting of the BP tree branch leading to it. If that is not done (or required) here - well, that''s for the better I guess. > ...This is bp rewrite, or rather, why bp-rewrite is hard. The generic BP rewrite also should handle things like reduction of VDEV sizes, removal of TLVDEVs, changes to TLVDEV layouts (i.e. migration of raidz levels) and so on. That is likely hard (especially to do online) indeed. Individual operations, like defragmentation, recompression or dedup of existing data, all of which can be done today by zfs-sending data away from the pool, cleaning it up, and zfs-receiving the data back - without all the lowlevel layout changes that BP rewrite can do - well, they can be done today. Why not in-place? Unlike manual send-away-and-receive cycles incurring downtime, the equivalent in-place manipulations can be done transparently to ZPL/ZVOL users by just invalidating parts of the ARC (by DVA of reallocated blocks), I think, and do not seem as inherently difficult as complete BP rewrites. Again, this interim solution may be just a POC for later works on BP rewrite to include and improve :) > "Just" is such a four-letter word. > > If you move a bp, you change its DVA. Which means that the parent bp > pointing to it needs to be updated and rewritten, and its parents as > well. This is new, COW data, with a new TXG attached -- but referring > to data that is old and has not been changed. > This gets back to the misunderstanding (way) above. Repair is not > COW; repair is repairing the disk block to the original, correct > contents. Changes of DVAs causing reallocation of the whole branch of BPs during the defrag - yes, as I also wrote. However I am not sure that it would induce such changes to TXG numbers that must be fatal to snapshots and scrubs: as I''ve seen in the code (unlike the ZFS on-disk format docs), the current blkptr_t includes two fields for a TXG number - the birth TXG and (IIRC) the write TXG. I guess one refers to the timestamp of when the data block was initially allocated in the queue, and another one (if non-zero) refers to the timestamp of when the block was optionally reallocated and written into the pool - perhaps upon recovery from ZIL, or (as I thought above) upon generic repair, or my proposed idea of defrag. So perhaps the system is already ready to correctly process such reallocations, or can be cheated into that by "clever" use and/or ignoration of one of these fields... > You just broke snapshots and scrub, at least. As for snapshots: you can send a series of incremental snapshots from one system to another, and of course the TXG numbers on a particular pool for blocks of the snapshot dataset would differ. But this does not matter, as long as they are committed on disk in a particular order, with BPtree branches properly pointing to timestamp-ordered snapshots of the parent dataset. Your concern seems valid indeed, but I think it can be countered by scheduling a BPtree defrag to involve relocating and updating block pointers for all snapshots of a dataset (and maybe its clones), or at least ensuring that the parent blocks of newer snapshots have higher TXG numbers - if that is required. This may place non-trivial demands on cache or buffer memory size and usage in order to prepare the big transaction in case of large datasets, so perhaps if the system detects it can''t properly defrag the BPtree branch in one operation, it should abort without crashing the OS into scanrate-hell ;) > It''s not going to help a scrub, since that reads all of the ditto > block copies, so bunching just one copy isn''t useful. I can agree - but only partially. If the point of storing the blockpointers together and minimizing mechanical reads to get many of them at once is reachable, then it becomes possible to "preread" the "colocated" version of BP tree or its large portions quickly (if there are no checksum or device errors during such reads - otherwise we fall back to scattered ditto copies of those corrupted BP tree blocks). Then we can schedule more optimal reads for the scattered data, including the ditto blocks of the BP tree that we''ve already read in (the other copies of these blocks). It would be the same walk covering the same data objects on disk, but possibly in a different (and hopefully faster) manner than today.> -- > Dan.Thanks a lot for the discussion, I really appreciate it :) //Jim Klimov
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim KlimovI''m reading the ZFS on-disk spec, and I get the idea that there''s an uberblock pointing to a self-balancing tree (some say b-tree, some say avl-tree, some say nv-tree), where data is only contained in the nodes. But I haven''t found one particular important detail yet: On which values does the balancing tree balance? Is it balancing on the logical block address? This would make sense, as an application requests to read/write some logical block, making it easy and fast to find the corresponding physical blocks... If that is the case, wouldn''t scrub/resilver need to work according to logical block order? (Which would also be random-ish, but decidedly NOT the same as TXG temporal order.)
2012-05-18 19:08, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Jim Klimov > > I''m reading the ZFS on-disk spec, and I get the idea that there''s an > uberblock pointing to a self-balancing tree (some say b-tree, some say > avl-tree, some say nv-tree), where data is only contained in the nodes. But > I haven''t found one particular important detail yet: > > On which values does the balancing tree balance? Is it balancing on the > logical block address? This would make sense, as an application requests to > read/write some logical block, making it easy and fast to find the > corresponding physical blocks...My memory fails me here for a precise answer... I think that the on-disk data within a raidzN top-level VDEV (mirrors are trivial) is laid out as follows, for an arbitrary 6-disk set of raidz2 TLVDEV: D1 D2 D3 D4 D5 D6 Ar1 Ar2 Ad1 Ad2 Ad3 Ad4 Br1 Br2 Bd1 Cr1 Cr2 Cd1 Cd2 Cd3 Cd4 Cr3 Cr4 Cd5 Cd6 Dr1 Dr2 Dd1 ... In these examples above, several blocks are laid out in sectors of different disks, including the redundancy blocks. Sequential accesses on one disk progress in a column from top to bottom. Accesses in a row are parallelized between many disks. The "A" block userdata is 4 sectors long, with 2 redundancy blocks. The "B" block has just one userdata sector, and the "C" block has 6 userdata sectors with a redundancy started for each 4 sectors. AFAIK each ZFS block fully resides within one TLVDEV (and ditto copies have their own separate life in another TLVDEV if available), and striping over several TLVDEVs occurs at a whole-block level. This, in particular, allows disbalanced pools with TLVDEVs of different size and layout. IF this picture is correct (confirmation or the reverse is kindly requested), then: 1) DVA to LBA translation should be somewhat trivial, since the DVA is defined as "ID(tlvdev):offset:length" in 512-byte units (regardless of ashift value on the pool). I did not test this in practice or incur from the code though. I don''t know if there are any gaps to take into account (i.e. maybe between "metaslabs", which are supposed to be about 200 of which per vdev (or tlvdev, or pool?) in order to limit seeking between data written at roughly the same time. Even if there are gaps (i.e. to round allocations to on-disk tracks or offsets at multiples of a given number), I''d not complicate things and just leave the gaps as addressable but unreferenced free spaces. A poster on the list recently referenced "slabs", I don''t think I saw this term - but I guess it stands for the total allocation needed for a userdata block? 2) Addressing of blocks (or reverse - saying that these sectors belong to a particular block or are available) is impossible without knowing the (generally whole) blockpointer tree, and depending on (re-)written object sizes, the same sector can at different times in its life belong to blocks (slabs?) of different lengths and starting at different DVA offsets... Indeed, we also can not assume that sectors read-in from the disks contain a valid part of the blockpointer tree (despite even matching some magic number), not until we find a path through the known tree that leads to this block (I discussed this in my other post regarding vdev prefetch and defrag). However since reads are free as long as the HDD head is in the right location, and if blkptr_t''s leading one to another are colocated on the disk, clever use of the prefetch and timely inspection of the prefetch cache can hopefully boost the BPtree walking speed. MAYBE I am wrong in this and there is also an allocation map in the large metaslabs or something? (I know there is some cleverness about finding available locations to write into, but I''m not ready to speak about it off the top of my head). I am not sure if this gives a clue to whether it''s "balancing on the logical block address?" though :) AFAIK the balancing tries to keep the maximum tree depth shortest, yet there is one root block and no rewriting of existing unchanged stale blocks (tree nodes). I am puzzled too :) 3) The layout is fixed at tlvdev creation time by its total number of disks since that directly affects the calculation "on which disk does an offset''ed sector belong" - it would be offset modulo number of disks for raidzN regardless of N (because of not-full stripes), and just 0 for single drives and mirrors. This is why resizing a raidz set is indeed hard, while conversion of single disks to mirrors and back is easy. To a lesser extent the layout is limited by vdev size (which can be increased easily, but can not be decreased without reallocation and BP rewrite [*1]), and somewhat by the number of redundancy disks which influences individual blocks'' on-disk representation and required length [*2]. [*1]: This might be doable relatively easily by limiting the top writeable address, and executing a routine similar to zfs-send and zfs-recv to relocate all blocks with a larger DVA offset on this TLVDEV to any accessible location on the pool. When no more referenced blocks remain above the watermark, the TLVDEV can be shrunk. This may involve some magic with TXG "birth" and "alloc" fields in blkptr''s as well. [*2]: As we know, in Oracle ZFS there is hybrid allocation, which in particular allows mirrored writes for metadata and raidz writes for userdata to coexist on a pool. I can only guess there is some new bit-flag in the blkptr_t for that? Anyhow, the number of redundancy disks and the layout algorithm for a particular block can be variable, so it seems... Thanks, //Jim Klimov
I hope there is some good outcome of this thread after all, below... I wonder if anyone else thinks the following proposal is reasonable? ;) 2012-05-18 10:18, Daniel Carosone wrote:> Let''s go over those, and clarify terminology, before going through the > rest of your post: > ...* Replace: A device has gone, and needs to be completely > reconstructed.As I detail below, i see Replace happening when a device is going to be gone - but is still available and is being proactively replaced.> > Scrub is very similar to normal reads, apart from checking all copies > rather than serving the data from whichever copy successfully returns > first. Errors are not expected, are counted and repaired as/if found. > > Resilver and Replace are very similar, and the terms are often used > interchangably. Replace is essentially resilver with a starting TXG of > 0 (plus some labelling). In both cases, an error is expected or > assumed from the device in question, and repair initiated > unconditionally (and without incrementing error counters). > > You''re suggesting an assymetry between Resilver and Replace to exploit > the possibile speedup of sequential access; ok, seems attractive at > first blush, let''s explore the idea.Well, I''ve gone to a swimming pool today to swim the halfmile and clear my head (metaphorically at least), and from the depths I emerged with another idea: From what I do see with the pool I''m upgrading (in another thread), there is also a "Replace" mode for hotspare devices, namely: * I attached the hotspare to the pool zpool add poolname spare c1t2d0 * I asked the pool to migrate a flaky disk''s data to the new disk: zpool replace poolname c5t6d0 c1t2d0 * I asked the pool to forget the old disk so it can be removed: zpool detach poolname c5t6d0 (cfgadm, removal, pluck in the new disk, cfgadm, etc) From iostat I see that all existing TLVDEV''s drives, including the one being replaced, are actively thrashed by reads for many hours, with some writes pouring onto the new disk. SO THE IDEA IS as follows: the disk being explicitly replaced, as in upgrades of the pool to larger drives, should first be copied onto new media "DD-style", which would be sequential IO for both devices, bandwidth-bound and rather fast. Then there should be a selective scrub, reading and checking allocated blocks from this TLVDEV only - like resilver does today - and repairing possible discrepancies (since the pool was likely live during the "DD stage", as well as errors were possible on the source drive as well as any other), and after this selective scrub the process is complete. BENEFITS: * The pool quickly gets a more-or-less good copy of the original disk, if it has not died completely and is able to serve reads for DD-style copying. This decreases the window of exposure of the TLVDEV to complete failure due to decreased redundancy, and can already help to salvage much of the data in case of partly bad source disk. That is, after the DD-style copy the new disk may be able to serve much of the valid data, and discrepancies might be easy to repair using normal checksum-mismatch modes - if the old disk kicks the bucket and/or is removed before the selective scrub is complete to gracefully finish the replacement procedure. The standard scrubbing approach after the DD-copy takes care of ensuring that by the end of the procedure the new disk''s data is fully valid. This also allows to not bother about the problems of the source disk being updated in locations ahead or behind the point where we''re reading now - some corrections to be made by the selective scrub are expected anyway. However, arguably, incoming writes may be placed on the source disk and its syncing-up spare replacement (into correct sector locations right from the start). * Instead of scheduling many random writes, which may be slower due to sync requirements, caching priorities, etc., we lean towards many random reads - which would still be used if we were using the original replace/resilver mode. Arguably, the reads can be optimized better by ZFS pipeline and HDD NCQ/TCQ, and in a safer manner than (random) write optimizations. * This method should be beneficial to raidz as well as mirrors, although the latter may have more options to cheaply recover bad sectors detected (as HDD IO errors) on source media of the one disk being replaced, on the fly - during DD-phase. CAVEATS: * This mode is of benefit for users whose pools are rather fragmented and full, so that sequential copy is noticeably faster than BP-tree-walk based resilvering. It is about 30x quicker on the utilized servers and homeNAS''es that I see. For example, on a Thumper in my other thread, resilvering of a 250Gb disk (partition) takes 15-17 hours while writing files and zfs-sends into a single-disk ZFS pool located on the same 3Tb drive fills it up in 24 hours. A full scrub of the original pool (45*250Gb) takes 24-27 hours. Time matters. The ZFS "marketing" states that it is quicker to repair because it only tests and copies the allocated data and not the whole disk as other RAID systems - well, this is only good as long as the pools are kept relatively empty. While I can agree with benefits of limiting the disk seeks via partitioning, i.e. by buying a 100Tb array and using only 10Tb by allocating smaller disk slices, I don''t see a good reason to allocate the 100Tb array and consistently keep it used at 10Tb, sorry. Perhaps this mode with DD-style preamble should be triggered by a separate command-line request (by admin''s discretion) or if it ever becomes a default option - it should be used instead of the original resilver-only method after some watermark value of disk utilization and/or known fragmentation. * If the original disk (being replaced) is a piece of faulty hardware, it can cause problems during the DD stage, such as: ** Lags - HDD retries on bad sectors can take a considerate amount of time based on firmware settings/capabilities. ** Loss of device visibility from the controller, reset storms, etc.; physical failure of original disk during the copy - these would lead to inability to continue reading the disk. ** Erroneous reads - returned garbage will be compensated by the following scrub after the DDing phase. If the DDing process detects that the average read speed has dropped to some unacceptable level or has stalled completely, it can try to seek from another original-disk location and/or fall back to original resilvering from other vdevs and abandon the DD phase. This does not mean that retries should be avoided, or that the first encountered error (even a connection error) should be the cause for aborting or restarting the DD-phase. It is not yet critical if some sectors were skipped during the DDing phase - the following selective scrub will (should) recover them, possibly by retrying the original disk as well, and maybe it has got to recover and relocate the bad sectors in the background by this time. Even if the DD-phase was aborted after a non-trivial amount of copying, the scrub/resilver should, IMHO, also read from the partially filled new disk and only rewrite those sectors that require rewriting (especially important for media that is sensitive to write-wearing). * Overall (wallclock) length of this replacement is likely to be higher than of the original method - since about the same amount of time will be needed for the scrub as for resilver, and some time will be added for DDing, and maybe plagued with retries etc. when hitting faulty sectors. However, a milestone of relative data safety will be reached a lot faster (if the source disk is substantially readable). * Errors (on the target disk) are expected during the selective scrub stage, and should be fixed quietly and not cause CKSUM error counter bumps nor other panicky clutter. This is so far a relatively raw idea and I''ve probably missed something. Do you think it is worth pursuing and asking some zfs developers to make a POC? ;) Thanks, //Jim Klimov
On Mon, 21 May 2012, Jim Klimov wrote:> This is so far a relatively raw idea and I''ve probably missed > something. Do you think it is worth pursuing and asking some > zfs developers to make a POC? ;)I did read all of your text. :-) This is an interesting idea and could be of some use but it would be wise to test it first a few times before suggesting it as a general course. Zfs is still totally not foolproof. I still see postings from time to time regarding pools which panic/crash the system (probably due to memory corruption). Zfs will try to keep the data compacted at the beginning of the partition so if you have a way to know how far out it extends, then the initial ''dd'' could be much faster when the pool is not close to full. Zfs scrub does need to do many more reads than a resilver since it reads all data and metadata copies. Triggering a resilver operation for the specific disk would likely hasten progress. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote:> On Mon, 21 May 2012, Jim Klimov wrote: >> This is so far a relatively raw idea and I''ve probably missed >> something. Do you think it is worth pursuing and asking some >> zfs developers to make a POC? ;) > > I did read all of your text. :-) > > This is an interesting idea and could be of some use but it would be > wise to test it first a few times before suggesting it as a general > course.I''ve done basically this kind of thing before: dd a disk and then scrub rather than replace, treating errors as expected.> Zfs will try to keep the data compacted at the beginning of the > partition so if you have a way to know how far out it extends, then the > initial ''dd'' could be much faster when the pool is not close to full.zdb will show you usage per metaslab, you could use that and effectively select offset ranges to skip any empty ones. After a while, and once the pool has seen usage fill past low %''ages, I''d say most metaslabs would have some usage, so you might not save much time. Going to finer detail within a metaslab is not worthwhile - much more involved and involves the seeks you''re trying to avoid. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120522/5b16625d/attachment.bin>
Thank you for reading and replying :-) 2012-05-22 6:18, Bob Friesenhahn wrote:> On Mon, 21 May 2012, Jim Klimov wrote: >> This is so far a relatively raw idea and I''ve probably missed >> something. Do you think it is worth pursuing and asking some >> zfs developers to make a POC? ;) > > I did read all of your text. :-) > > This is an interesting idea and could be of some use but it would be > wise to test it first a few times before suggesting it as a general > course. Zfs is still totally not foolproof. I still see postings from > time to time regarding pools which panic/crash the system (probably due > to memory corruption). > > Zfs will try to keep the data compacted at the beginning of the > partition so if you have a way to know how far out it extends, then the > initial ''dd'' could be much faster when the pool is not close to full.For a not-full not-fragmented pool it is likely that a run of the original resilvering would be faster and known to be correct ;)> Zfs scrub does need to do many more reads than a resilver since it reads > all data and metadata copies. Triggering a resilver operation for the > specific disk would likely hasten progress.Well, the point in this case was for a "selective scrub" as I called it in the text, which would indeed read all blocks and dittos, if their DVA lies on that VDEV where we replaced the disk and expect discrepancies. In this limitation it is like a resilver, but is indeed a scrub in effect. Besides, from what I''ve seen, a resilver just rewrites the given range of TXGs (like [0-current]) on the target disk based on reads from other disks in the TLVDEV and on the assumption that up to some TXG point (zero or above) that disk had valid data matching other disks in the array. Here, after the DD-stage, we have no guarantee that the source disk was fully valid (just a hope, augmented by lack of read errors and timeouts during the DD-phase), and to some degree we don''t know if the writes that came in during the replacement process were properly written onto the target disk as well as its counterpart being replaced and still a (more) valid part of the pool. If we do that write-cloning and if the source disk was okay, the scrub shouldn''t find any errors I think. //Jim Klimov
2012-05-22 7:30, Daniel Carosone wrote:> On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote: >> On Mon, 21 May 2012, Jim Klimov wrote: >>> This is so far a relatively raw idea and I''ve probably missed >>> something. Do you think it is worth pursuing and asking some >>> zfs developers to make a POC? ;) >> >> I did read all of your text. :-) >> >> This is an interesting idea and could be of some use but it would be >> wise to test it first a few times before suggesting it as a general >> course. > > I''ve done basically this kind of thing before: dd a disk and then > scrub rather than replace, treating errors as expected.I got into similar situation last night on that Thumper - it is now migrating a flaky source disk in the array from an original old 250Gb disk into a same-sized partition on the new 3Tb drive (as I outlined as IDEA7 in another thread). The source disk itself had about 300 CKSUM errors during the process, and for reasons beyond my current understanding, the resilver never completed. In zpool status it said that the process was done several hours before the time I looked at it, but the TLVDEV still had a "spare" component device comprised of the old disk and new partition, and the (same) hotspare device in the pool was "INUSE". After a while we just detached the old disk from the pool and ran scrub, which first found some 178 CKSUM errors on the new partition right away, and degraded the TLVDEV and pool. We cleared the errors, and ran the script below to log the detected errors and clear them, so the disk is fixed and not kicked out of the pool due to mismatches. Overall 1277 errors were logged and apparently fixed, and the pool is now on its second full scrub run - no bugs so far (knocking wood; certainly none this early in the scrub as we had last time). So in effect, this methodology works for two of us :) Since you did similar stuff already, I have a few questions: 1) How/what did you DD? The whole slice with the zfs vdev? Did the system complain (much) about the renaming of the device compared to paths embedded in pool/vdev headers? Did you do anything manually to remedy that (forcing import, DDing some handcrafted uberblocks, anything?) 2) How did you "treat errors as expected" during scrub? As I''ve discovered, there were hoops to jump through. Is there a switch to disable "degrading" of pools and TLVDEVs based on only the CKSUM counts? My raw hoop-jumping script: ----- #!/bin/bash # /root/scrubwatch.sh # Watches ''pond'' scrub and resets errors to avoid auto-degrading # the device, but logs the detected error counts however. # See also "fmstat|grep zfs-diag" for precise counts. # See also https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great # for details on FMA and fmstat with zfs hotspares while true; do zpool status pond | gegrep -A4 -B3 ''resilv|error|c1t2d|c5t6d|%'' date echo "" C1="`zpool status pond | grep c1t2d`" C2="`echo "$C1" | grep ''c1t2d0s1 ONLINE 0 0 0''`" if [ x"$C2" = x ]; then echo "`date`: $C1" >> /var/tmp/zpool-clear_pond.log zpool clear pond zpool status pond | gegrep -A4 -B3 ''resilv|error|c1t2d|c5t6d|%'' date fi echo "" sleep 60 done HTH, //Jim Klimov
On Tue, May 22, 2012 at 12:42:02PM +0400, Jim Klimov wrote:> 2012-05-22 7:30, Daniel Carosone wrote: >> I''ve done basically this kind of thing before: dd a disk and then >> scrub rather than replace, treating errors as expected. > > I got into similar situation last night on that Thumper - > it is now migrating a flaky source disk in the array from > an original old 250Gb disk into a same-sized partition on > the new 3Tb drive (as I outlined as IDEA7 in another thread). > The source disk itself had about 300 CKSUM errors during > the process, and for reasons beyond my current understanding, > the resilver never completed. > > In zpool status it said that the process was done several > hours before the time I looked at it, but the TLVDEV still > had a "spare" component device comprised of the old disk > and new partition, and the (same) hotspare device in the > pool was "INUSE".I think this is at least in part an issue with older code. There have been various fixes for hangs/restarts/incomplete replaces and sparings over the time since.> After a while we just detached the old disk from the pool > and ran scrub, which first found some 178 CKSUM errors on > the new partition right away, and degraded the TLVDEV and > pool. > > We cleared the errors, and ran the script below to log > the detected errors and clear them, so the disk is fixed > and not kicked out of the pool due to mismatches. > > So in effect, this methodology works for two of us :) > > Since you did similar stuff already, I have a few questions: > 1) How/what did you DD? The whole slice with the zfs vdev? > Did the system complain (much) about the renaming of the > device compared to paths embedded in pool/vdev headers? > Did you do anything manually to remedy that (forcing > import, DDing some handcrafted uberblocks, anything?)I''ve done it a couple of times at least: * a failed disk in a raidz1, where i didn''t trust that the other disks didn''t also have errors. Basically did a ddrescue from one disk to the new. I think these days, a ''replace'' where the original disk is still online will use that content, like a hotspare replace, rather than assume it has gone away and must be recreated, but that wasn''t the case at the time. * Where I had an iscsi mirror of a laptop hard disk, but it was out of date and had been detached when the laptop iscsi initiator refused to start. Later, the disk developed a few bad sectors. I made a new submirror, let it sync (with the error still), then blatted bits of the old image over the new in the areas where the bad sectors where being reported. Scrub again, and they were fixed (as well as some blocks on the new submirror repaired coming back up to date again).> 2) How did you "treat errors as expected" during scrub?Pretty much as you did: decline to panic and restart scrubs. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120523/9500d386/attachment.bin>
comments far below... On May 22, 2012, at 1:42 AM, Jim Klimov wrote:> 2012-05-22 7:30, Daniel Carosone wrote: >> On Mon, May 21, 2012 at 09:18:03PM -0500, Bob Friesenhahn wrote: >>> On Mon, 21 May 2012, Jim Klimov wrote: >>>> This is so far a relatively raw idea and I''ve probably missed >>>> something. Do you think it is worth pursuing and asking some >>>> zfs developers to make a POC? ;) >>> >>> I did read all of your text. :-) >>> >>> This is an interesting idea and could be of some use but it would be >>> wise to test it first a few times before suggesting it as a general >>> course. >> >> I''ve done basically this kind of thing before: dd a disk and then >> scrub rather than replace, treating errors as expected. > > I got into similar situation last night on that Thumper - > it is now migrating a flaky source disk in the array from > an original old 250Gb disk into a same-sized partition on > the new 3Tb drive (as I outlined as IDEA7 in another thread). > The source disk itself had about 300 CKSUM errors during > the process, and for reasons beyond my current understanding, > the resilver never completed. > > In zpool status it said that the process was done several > hours before the time I looked at it, but the TLVDEV still > had a "spare" component device comprised of the old disk > and new partition, and the (same) hotspare device in the > pool was "INUSE". > > After a while we just detached the old disk from the pool > and ran scrub, which first found some 178 CKSUM errors on > the new partition right away, and degraded the TLVDEV and > pool. > > We cleared the errors, and ran the script below to log > the detected errors and clear them, so the disk is fixed > and not kicked out of the pool due to mismatches. > Overall 1277 errors were logged and apparently fixed, and > the pool is now on its second full scrub run - no bugs so > far (knocking wood; certainly none this early in the scrub > as we had last time). > > So in effect, this methodology works for two of us :) > > Since you did similar stuff already, I have a few questions: > 1) How/what did you DD? The whole slice with the zfs vdev?dd, or simular dumb block copiers, should work fine. However, they are inefficient and operationally difficult to manage, which is why they tend to fall in the prefer-to-use-something-else catagory.> Did the system complain (much) about the renaming of the > device compared to paths embedded in pool/vdev headers?It shouldn''t unless you did something to confuse it, such as having both the original and the dd copy online at the same time. In that case, you will have two different copies of the same identified device that are independent. This is an operational mistake, hence my comment above.> Did you do anything manually to remedy that (forcing > import, DDing some handcrafted uberblocks, anything?)Not needed.> > 2) How did you "treat errors as expected" during scrub? > As I''ve discovered, there were hoops to jump through. > Is there a switch to disable "degrading" of pools and > TLVDEVs based on only the CKSUM counts?DEGRADED is the status. You clear degraded states by fixing the problem and running zpool clear. DEGRADED, in and of itself, is not a problem.> > > My raw hoop-jumping script: > ----- > > #!/bin/bash > > # /root/scrubwatch.sh > # Watches ''pond'' scrub and resets errors to avoid auto-degrading > # the device, but logs the detected error counts however. > # See also "fmstat|grep zfs-diag" for precise counts. > # See also https://blogs.oracle.com/bobn/entry/zfs_and_fma_two_great > # for details on FMA and fmstat with zfs hotspares > > while true; do > zpool status pond | gegrep -A4 -B3 ''resilv|error|c1t2d|c5t6d|%'' > date > echo "" > > C1="`zpool status pond | grep c1t2d`" > C2="`echo "$C1" | grep ''c1t2d0s1 ONLINE 0 0 0''`" > if [ x"$C2" = x ]; then > echo "`date`: $C1" >> /var/tmp/zpool-clear_pond.log > zpool clear pond > zpool status pond | gegrep -A4 -B3 ''resilv|error|c1t2d|c5t6d|%'' > date > fi > echo "" > > sleep 60 > doneI would never allow such scripts in my site. It is important to track the progress and state changes. This script resets those counters for no good reason. I post this comment in the hope that future searches will not encourage people to try such things. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120523/97e07bfe/attachment.html>
2012-05-23 20:54, Richard Elling wrote:> comments far below...Thank you Richard for taking notice of this thread and the definitive answers I needed not quote below, for further questions ;)>> 2) How did you "treat errors as expected" during scrub? >> As I''ve discovered, there were hoops to jump through. >> Is there a switch to disable "degrading" of pools and >> TLVDEVs based on only the CKSUM counts? > > DEGRADED is the status. You clear degraded states by fixing the problem > and running zpool clear. DEGRADED, in and of itself, is not a problem.Doesn''t this status preclude the device with many CKSUM errors from participating in the pool (TLVDEV) and the remainder of the scrub in particular? At least the textual error message infers that if a hotspare were available for the pool, it would kick in and invalidate the device I am scrubbing to update into the pool after the DD-phase (well, it was not DD but a hung-up resilver in this case, but that is not substantial). Such automatic replacement is definitely not what I needed in this particular case, so if it were to happen - it would be a problem indeed, in and of itself. > dd, or simular dumb block copiers, should work fine. > However, they are inefficient... Define efficient? In terms of transferring the 900Gb payload of a 1Tb HDD used for ZFS for a year - DD would beat resilver anytime, in terms of getting most or (less likely) all of the valid bits with data onto the new device. It is the next phase (getting the rest of the bits into valid state) that needs some attention, manual or automated. Again, DD is not a good usecase indeed for pools with little data on big disks, and while I see why these could be used (i.e. to never face fragmentation), I haven''t seen them in practice around here. >... and operationally difficult to manage Actually, that''s why I asked whether it makes sense to automate such a scenario as another legal variant of disk replacement, complete with fast data transfer and verification and simultaneous work of the new and old devices until the data migration is marked complete. In particular that would take care of accepting the scrub errors as an expected part of the disk replacement and not a fatal fault/degradation, and/or allowing new writes to propagate onto the new disk while the replacement is going on and minimize discrepancies right on the run. In visible effect this would be similar to current resilver during replacement of a live disk with a hotspare, but the prcess would follow a different scenario I suggested earlier in the thread.>> My raw hoop-jumping script:...> I would never allow such scripts in my site. It is important to track the > progress and state changes. This script resets those counters for no > good reason. > > I post this comment in the hope that future searches will not encourage > people to try such things.Understood, point taken, I won''t try to promote such a "solution", and I agree that certainly it is not a good general idea indeed. It should be noted however (or I want to be corrected, please, if I am wrong), that: 1) Errors are expected on this run since the DD''ed copy is expected to deviate from current pool state; if the "degradation" mark of new disk would force it to be kicked out of the pool just because there are many CKSUM errors - which we know should be there due to manual DD-phase - then the reason is good IMHO (in this one case); 2) The progress is tracked by logging the error counts into a text file. If the admin fired up the script (manually in his terminal or a vnc/screen session), he can also look into the log file or even tail it. 3) The individual CKSUM errors are summed up in fmstat output, and this script does not zero them out, so even system-side tracking is not disturbed here. Anyhow, if there is a device with just a few CKSUM errors, then the next scrub clears its error counts anyway (if no new problems are found).... Thanks, //Jim Klimov
On May 23, 2012, at 1:32 PM, Jim Klimov wrote:> 2012-05-23 20:54, Richard Elling wrote: >> comments far below... > > Thank you Richard for taking notice of this thread and the > definitive answers I needed not quote below, for further > questions ;) > >>> 2) How did you "treat errors as expected" during scrub? >>> As I''ve discovered, there were hoops to jump through. >>> Is there a switch to disable "degrading" of pools and >>> TLVDEVs based on only the CKSUM counts? >> >> DEGRADED is the status. You clear degraded states by fixing the problem >> and running zpool clear. DEGRADED, in and of itself, is not a problem. > > Doesn''t this status preclude the device with many CKSUM errors > from participating in the pool (TLVDEV) and the remainder of > the scrub in particular?no> At least the textual error message infers that if a hotspare > were available for the pool, it would kick in and invalidate > the device I am scrubbing to update into the pool after the > DD-phase (well, it was not DD but a hung-up resilver in this > case, but that is not substantial).The man page is clear on this topic, IMHO DEGRADED One or more top-level vdevs is in the degraded state because one or more component devices are offline. Sufficient replicas exist to continue functioning. One or more component devices is in the degraded or faulted state, but sufficient replicas exist to continue functioning. The underlying condi- tions are as follows: o The number of checksum errors exceeds acceptable levels and the device is degraded as an indication that some- thing may be wrong. ZFS continues to use the device as necessary.> > Such automatic replacement is definitely not what I needed > in this particular case, so if it were to happen - it would > be a problem indeed, in and of itself. > > > dd, or simular dumb block copiers, should work fine. > > However, they are inefficient... > > Define efficient? In terms of transferring the 900Gb payload > of a 1Tb HDD used for ZFS for a year - DD would beat resilver > anytime, in terms of getting most or (less likely) all of the > valid bits with data onto the new device. It is the next phase > (getting the rest of the bits into valid state) that needs > some attention, manual or automated.speed != efficiency> Again, DD is not a good usecase indeed for pools with little > data on big disks, and while I see why these could be used > (i.e. to never face fragmentation), I haven''t seen them in > practice around here. > > >... and operationally difficult to manage > > Actually, that''s why I asked whether it makes sense to > automate such a scenario as another legal variant of disk > replacement, complete with fast data transfer and verification > and simultaneous work of the new and old devices until the > data migration is marked complete. In particular that would > take care of accepting the scrub errors as an expected part > of the disk replacement and not a fatal fault/degradation, > and/or allowing new writes to propagate onto the new disk > while the replacement is going on and minimize discrepancies > right on the run. > > In visible effect this would be similar to current resilver > during replacement of a live disk with a hotspare, but the > prcess would follow a different scenario I suggested earlier > in the thread.IMHO, this is too operationally complex for most folks. KISS wins.>>> My raw hoop-jumping script: > ... >> I would never allow such scripts in my site. It is important to track the >> progress and state changes. This script resets those counters for no >> good reason. >> >> I post this comment in the hope that future searches will not encourage >> people to try such things. > > Understood, point taken, I won''t try to promote such a "solution", > and I agree that certainly it is not a good general idea indeed. > It should be noted however (or I want to be corrected, please, > if I am wrong), that: > > 1) Errors are expected on this run since the DD''ed copy is expected > to deviate from current pool state; if the "degradation" mark of > new disk would force it to be kicked out of the pool just because > there are many CKSUM errors - which we know should be there due > to manual DD-phase - then the reason is good IMHO (in this one > case); > > 2) The progress is tracked by logging the error counts into a text > file. If the admin fired up the script (manually in his terminal > or a vnc/screen session), he can also look into the log file or > even tail it. > > 3) The individual CKSUM errors are summed up in fmstat output, and > this script does not zero them out, so even system-side tracking > is not disturbed here. > > Anyhow, if there is a device with just a few CKSUM errors, then the > next scrub clears its error counts anyway (if no new problems are > found)....What is it about error counters that frightens you enough to want to clear them often? -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120523/5f22398e/attachment-0001.html>
Thanks again, 2012-05-24 1:01, Richard Elling wrote:>> At least the textual error message infers that if a hotspare >> were available for the pool, it would kick in and invalidate >> the device I am scrubbing to update into the pool after the >> DD-phase (well, it was not DD but a hung-up resilver in this >> case, but that is not substantial). > > The man page is clear on this topic, IMHOIndeed, even in snv_117 the zpool man page says that. But the console/dmesg message was also quite clear, so go figure whom to trust (or fear) more ;) fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Wed May 16 03:27:31 MSK 2012 PLATFORM: Sun Fire X4500, CSN: 0804AMT023 , HOSTNAME: thumper SOURCE: zfs-diagnosis, REV: 1.0 EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3 DESC: The number of checksum errors associated with a ZFS device exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information. AUTO-RESPONSE: The device has been marked as degraded. An attempt will be made to activate a hot spare if available. IMPACT: Fault tolerance of the pool may be compromised. REC-ACTION: Run ''zpool status -x'' and replace the bad device.>> > dd, or simular dumb block copiers, should work fine. >> > However, they are inefficient... >> >> Define efficient? In terms of transferring the 900Gb payload >> of a 1Tb HDD used for ZFS for a year - DD would beat resilver >> anytime, in terms of getting most or (less likely) all of the >> valid bits with data onto the new device. It is the next phase >> (getting the rest of the bits into valid state) that needs >> some attention, manual or automated. > > speed != efficiencyUmmm... this is likely to start a flame war with other posters, and you did not say what efficiency is to you? How can we compare apples to meat, not even knowing whether the latter is a steak or a pork knee? I, for now, choose to stand by a statement that reduction of the timeframe that the old disk needs to be in the system is a good thing, as well as that changing the IO pattern from random writes into (mostly) sequential writes and after that random reads may be also somewhat more efficient, especially under other loads (interfering less with them). Even though the whole replacement process may take more wallclock time, there are cases when I''d likely trust it to do a better job than original resilvering. I think, someone with equipment could stage an experiment and compare the two procedures (existing and proposed) on a nearly full and somewhat fragmented pool. Maybe you can disenchant me (not with vague phrases but either theory or practice) and I would then see that my trust is blind, misdirected and without basement. =)> IMHO, this is too operationally complex for most folks. KISS wins.That''s why I proposed to tuck this scenario under the zfs hood (DD + selective scrub + ditto writes during the process, as an optional alternative to current resilver), or explain coherently why this should not be done - not for any situation. Implementing it as a standard supported command would be KISS ;) Especially if it is known that with some quirks this procedure works, and may be beneficial to some cases, i.e. by reducing the timeframe that a pool with a flaky disk in place is exposed to potential loss of redundancy and large amounts of data, and in the worst case the loss is constrained to those sectors which couldn''t be (correctly) read by DD from the source disk and couldn''t be reconstructed by raidz/mirror redundancies due to whatever overlaying problems (i.e. a sector from same block died on another disk too).> What is it about error counters that frightens you enough to want to clear > them often?In this case, mostly, the fright of having the device kicked out of the pool automatically instead of getting it "synced" ("resilvered" is an improper term here, I guess) to proper state. In general - since this is a part of some migration procedure which is, again, expected to have errors, we don''t really care for signalling them. Why doesn''t the original resilver signal several million CKSUM errors per new empty disk when it does reconstruction of sectors onto it? I''d say this is functionally identical. (At least, would be - if it were part of a supported procedure as I suggest). Thanks, //Jim Klimov PS: I pondered for a while if I should make up an argument that on a dying disk mechanics, lots of random IO (resilver) instead of sequential IO (DD) would cause it to die faster, but that''s just a FUD not backed by any scientific data or statistics - which you likely have, and perhaps opposing this argument indeed.
On May 23, 2012, at 2:56 PM, Jim Klimov wrote:> Thanks again, > > 2012-05-24 1:01, Richard Elling wrote: >>> At least the textual error message infers that if a hotspare >>> were available for the pool, it would kick in and invalidate >>> the device I am scrubbing to update into the pool after the >>> DD-phase (well, it was not DD but a hung-up resilver in this >>> case, but that is not substantial). >> >> The man page is clear on this topic, IMHO > > Indeed, even in snv_117 the zpool man page says that. But the > console/dmesg message was also quite clear, so go figure whom > to trust (or fear) more ;)The FMA message is consistent with the man page.> > fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major > EVENT-TIME: Wed May 16 03:27:31 MSK 2012 > PLATFORM: Sun Fire X4500, CSN: 0804AMT023 , HOSTNAME: thumper > SOURCE: zfs-diagnosis, REV: 1.0 > EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3 > DESC: The number of checksum errors associated with a ZFS device > exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information. > AUTO-RESPONSE: The device has been marked as degraded. An attempt > will be made to activate a hot spare if available. > IMPACT: Fault tolerance of the pool may be compromised. > REC-ACTION: Run ''zpool status -x'' and replace the bad device. > > > >>> > dd, or simular dumb block copiers, should work fine. >>> > However, they are inefficient... >>> >>> Define efficient? In terms of transferring the 900Gb payload >>> of a 1Tb HDD used for ZFS for a year - DD would beat resilver >>> anytime, in terms of getting most or (less likely) all of the >>> valid bits with data onto the new device. It is the next phase >>> (getting the rest of the bits into valid state) that needs >>> some attention, manual or automated. >> >> speed != efficiency > > Ummm... this is likely to start a flame war with other posters, > and you did not say what efficiency is to you? How can we compare > apples to meat, not even knowing whether the latter is a steak or > a pork knee?Efficiency allows use of denominators other than time. Speed is restricted to a denominator of time. There is no flame war here, look elsewhere.> I, for now, choose to stand by a statement that reduction of the > timeframe that the old disk needs to be in the system is a good > thing, as well as that changing the IO pattern from random writes > into (mostly) sequential writes and after that random reads may > be also somewhat more efficient, especially under other loads > (interfering less with them). Even though the whole replacement > process may take more wallclock time, there are cases when I''d > likely trust it to do a better job than original resilvering. > > I think, someone with equipment could stage an experiment and > compare the two procedures (existing and proposed) on a nearly > full and somewhat fragmented pool.Operationally, your method loses every time.> > Maybe you can disenchant me (not with vague phrases but either > theory or practice) and I would then see that my trust is blind, > misdirected and without basement. =)>> IMHO, this is too operationally complex for most folks. KISS wins. > > That''s why I proposed to tuck this scenario under the zfs hood > (DD + selective scrub + ditto writes during the process, > as an optional alternative to current resilver), or explain > coherently why this should not be done - not for any situation. > Implementing it as a standard supported command would be KISS ;) > > Especially if it is known that with some quirks this procedure > works, and may be beneficial to some cases, i.e. by reducing > the timeframe that a pool with a flaky disk in place is exposed > to potential loss of redundancy and large amounts of data, and > in the worst case the loss is constrained to those sectors > which couldn''t be (correctly) read by DD from the source disk > and couldn''t be reconstructed by raidz/mirror redundancies due > to whatever overlaying problems (i.e. a sector from same block > died on another disk too).You have not made a case for why this hybrid and failure-prone procedure is required. What problem are you trying to solve?>> What is it about error counters that frightens you enough to want to clear >> them often? > > In this case, mostly, the fright of having the device kicked > out of the pool automatically instead of getting it "synced" > ("resilvered" is an improper term here, I guess) to proper state.Why not follow the well-designed existing procedure?> In general - since this is a part of some migration procedure > which is, again, expected to have errors, we don''t really care > for signalling them. Why doesn''t the original resilver signal > several million CKSUM errors per new empty disk when it does > reconstruction of sectors onto it? I''d say this is functionally > identical. (At least, would be - if it were part of a supported > procedure as I suggest). > > Thanks, > //Jim Klimov > > PS: I pondered for a while if I should make up an argument that > on a dying disk mechanics, lots of random IO (resilver) instead > of sequential IO (DD) would cause it to die faster, but that''s > just a FUD not backed by any scientific data or statistics - > which you likely have, and perhaps opposing this argument indeed.The failure data does not support your hypothesis. -- richard -- ZFS and performance consulting http://www.RichardElling.com SCALE 10x, Los Angeles, Jan 20-22, 2012
Let me try to formulate my idea again... You called a similar process "pushing the rope" some time ago, I think. I feel like I''m passing some exam and am trying to pick answers for a discipline like philosophy and I have no idea about the examinator''s preferences - is he an ex-Communism teacher or an eager new religion fanatic? The same answer can lead to an A or to an F on a state exam. Ah, that was some fun experience :) Well, what we know is what remains after we forget everything that we were taught, while the exams are our last chance to learn something at all =) 2012-05-24 10:28, Richard Elling wrote:> You have not made a case for why this hybrid and failure-prone > procedure is required. What problem are you trying to solve?Bigger-better-faster? ;) The original proposal in this thread was about understanding how resilvers and scrubs work, why they are so dog slow on HDDs in comparison to sequential reads, and thinking aloud what can be improved in this area. One of the later posts was about improving the disk replacement (where the original is still responsive, but may be imperfect) for filled-up fragmented pools by including a stage of fast data transfer and a different IO pattern for verification and updating of the new disk image, in comparison with current resilver''s IO patterns. This may or may not have some benefits in certain (corner?) cases which are of practical interest to some users on this list, and if this discussion leads to a POC made by a competent ZFS programmer, which can be tested on a variety of ZFS pools (without risking one''s only pool on a homeNAS) - so much the better. Then we would see if this scenario is viable or utterly useless and bad in every tested case. The practical numbers I have from the same box and disks are: * Copy from a 250Gb raidz1 (9*(4+1)) pool to a single-disk 3Tb test pool took 24 hours to fill the new disk - including the ZFS overheads. * Copying of one raw 250(232)Gb partition takes under 2 hours (if it can sustain about 70Mb/s reads from the source without distractions like other pool IO - then 1 hour). * Proper resilvering (reading all BP-tree from the original pool, reading all blocks from the TLVDEV, writing reconstructed(?) sectors to the target disk) from one partition to another took 17 hours. * Full scrubbing (reading all blocks from the pool, fixing checksum mismatches) takes 25-27 hours. * Selective scrubbing - unimplemented, timeframe unknown (reading all BP-tree from the original pool, reading all blocks from the TLVDEV including the target disk and the original disk, fixing checksum mismatches without panicky messages and/or hotspares kicking in). I *guess* it would have similar speed to a resilver, but less bound to random write IO patterns, which may be better for latencies of other tasks on the system. So, in case of original resilver, I replace the not-yet-dead disk with a hotspare, and after 17 hours of waiting I see if it was successfully resilvered or not. During this time the disk can die for example, leaving my pool with lowered protection (or lack thereof in case of raidz1 or two-way mirrors). In case of the new method proposed for a POC implementation, after 1 hour I''d already have a somewhat reliable copy of that vdev (a few blocks may have mismatches, but if the source disk dies or is taken away now - not the whole TLVDEV or pool is degraded and has compromised protection). Then after the same +17 hours for scrubs I''d be certain that this copy is good. If the new writes incoming to this TLVDEV between start of DD and end of scrub are directed to be written on both the source disk and its copy, then there are less (down to zero) checksum discrepancies that the scrub phase would find.> Why not follow the well-designed existing procedure?First it was a theoretical speculation, but a couple of days later the incomplete resilver made me a practical experiment of the idea.> The failure data does not support your hypothesis.Ok, then my made-up and dismissed argument does not stand ;) Thanks for the discussion, //Jim Klimov
big assumption below... On May 24, 2012, at 6:06 AM, Jim Klimov wrote:> Let me try to formulate my idea again... You called a similar > process "pushing the rope" some time ago, I think. > > I feel like I''m passing some exam and am trying to pick answers > for a discipline like philosophy and I have no idea about the > examinator''s preferences - is he an ex-Communism teacher or an > eager new religion fanatic? The same answer can lead to an A > or to an F on a state exam. Ah, that was some fun experience :) > > Well, what we know is what remains after we forget everything > that we were taught, while the exams are our last chance to > learn something at all =) > > 2012-05-24 10:28, Richard Elling wrote: >> You have not made a case for why this hybrid and failure-prone >> procedure is required. What problem are you trying to solve? > > Bigger-better-faster? ;) > > The original proposal in this thread was about understanding > how resilvers and scrubs work, why they are so dog slow on > HDDs in comparison to sequential reads, and thinking aloud > what can be improved in this area. > > One of the later posts was about improving the disk replacement > (where the original is still responsive, but may be imperfect) > for filled-up fragmented pools by including a stage of fast > data transfer and a different IO pattern for verification and > updating of the new disk image, in comparison with current > resilver''s IO patterns. > > This may or may not have some benefits in certain (corner?) > cases which are of practical interest to some users on this > list, and if this discussion leads to a POC made by a competent > ZFS programmer, which can be tested on a variety of ZFS pools > (without risking one''s only pool on a homeNAS) - so much the > better. Then we would see if this scenario is viable or utterly > useless and bad in every tested case. > > The practical numbers I have from the same box and disks are: > * Copy from a 250Gb raidz1 (9*(4+1)) pool to a single-disk 3Tb > test pool took 24 hours to fill the new disk - including the > ZFS overheads. > * Copying of one raw 250(232)Gb partition takes under 2 hours > (if it can sustain about 70Mb/s reads from the source without > distractions like other pool IO - then 1 hour). > * Proper resilvering (reading all BP-tree from the original pool, > reading all blocks from the TLVDEV, writing reconstructed(?) > sectors to the target disk) from one partition to another > took 17 hours. > * Full scrubbing (reading all blocks from the pool, fixing > checksum mismatches) takes 25-27 hours. > * Selective scrubbing - unimplemented, timeframe unknown > (reading all BP-tree from the original pool, reading all > blocks from the TLVDEV including the target disk and the > original disk, fixing checksum mismatches without panicky > messages and/or hotspares kicking in). > I *guess* it would have similar speed to a resilver, but > less bound to random write IO patterns, which may be better > for latencies of other tasks on the system. > > So, in case of original resilver, I replace the not-yet-dead > disk with a hotspare, and after 17 hours of waiting I see if > it was successfully resilvered or not. During this time the > disk can die for example, leaving my pool with lowered > protection (or lack thereof in case of raidz1 or two-way > mirrors). > > In case of the new method proposed for a POC implementation, > after 1 hour I''d already have a somewhat reliable copy of > that vdev (a few blocks may have mismatches,This is a big assumption -- that the disk will operate normally, even for data it cannot read. In my experience, this assumption is not valid for the majority of HDD failure modes. Also, in the case of consumer-grade disks, a single sector media error could take a very long time to retry/fail.> but if the > source disk dies or is taken away now - not the whole TLVDEV > or pool is degraded and has compromised protection). Then > after the same +17 hours for scrubs I''d be certain that > this copy is good. > > If the new writes incoming to this TLVDEV between start of > DD and end of scrub are directed to be written on both the > source disk and its copy, then there are less (down to zero) > checksum discrepancies that the scrub phase would find. > >> Why not follow the well-designed existing procedure? > > First it was a theoretical speculation, but a couple of days > later the incomplete resilver made me a practical experiment > of the idea. > >> The failure data does not support your hypothesis. > Ok, then my made-up and dismissed argument does not stand ;) > > Thanks for the discussion,np -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120524/e133eb3d/attachment.html>
2012-05-24 18:55, Richard Elling wrote:> This is a big assumption -- that the disk will operate normally, even > for data it cannot read. In my experience, this assumption is not valid > for the majority of HDD failure modes. Also, in the case of consumer-grade > disks, a single sector media error could take a very long time to > retry/fail.Indeed it is, and I''ve covered this in the thread earlier - the bulk copying phase ("DD-phase") should monitor its real progress, and if it detects lags in comparison to the average or expected speeds (expected = some tuning variable i.e. 50Mb/s), the process should skip over some (arbitrary) range of sectors and go on from another location (such skipped sectors are in danger indeed, until the scrub-phase detects and reconstructs them) or fall back to the original resilver method completely. That was already described in some detail I thought of at the time of the posting, and I can''t add much to that yet. From what I''ve seen with faulty sectors is that they are usually either single errors or a "scratched" range which can be worked around with i.e. partitioning for legacy FSes (if the SMART relocation doesn''t deal with them properly for any reason), while most of the rest of the disk is okay. Retries may be lengthy, ranging from several seconds up to a minute, but they are often constrained in a few locations and *may* add little delay in the overall scheme of things. If the delay is more than acceptable and/or we can''t find a "working location" on the source disk, we just fall back to the old method - either original resilver, or if much data has been copied to the new disk - to the new selective scrub (it being much like the resilver, but taking into account those sectors on the target disk which may have been copied over correctly). A somewhat worse case is intermittent errors in random times and logical disk locations due to who knows what - overheating, firmware overflow errors, bus resets, or whatever. It''s rather them being the reason for scrub-validation of data after mass migration, perhaps (as well as a reason for preventive regular scrubs)... //Jim
On 5/23/12 11:28 PM, Richard Elling wrote:>>> The man page is clear on this topic, IMHO >> >> Indeed, even in snv_117 the zpool man page says that. But the >> console/dmesg message was also quite clear, so go figure whom >> to trust (or fear) more ;) > > The FMA message is consistent with the man page.The man page seems to not mention the critical part of the FMA msg that OP is worried about. OP said that his motivation for clearing the errors and fearing the degraded state was because he feared this: >> AUTO-RESPONSE: The device has been marked as degraded. An attempt >> will be made to activate a hot spare if available. he doesn''t want his dd''d new device kicked out of the vdev and replaced by a hot spare (if avaialable) due to the number of errors and the scarlet letter of "degraded" at the device level - I don''t think he cares about the pool level degraded status since it doesn''t "do" anything.>> fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major >> EVENT-TIME: Wed May 16 03:27:31 MSK 2012 >> PLATFORM: Sun Fire X4500, CSN: 0804AMT023 , HOSTNAME: thumper >> SOURCE: zfs-diagnosis, REV: 1.0 >> EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3 >> DESC: The number of checksum errors associated with a ZFS device >> exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information. >> AUTO-RESPONSE: The device has been marked as degraded. An attempt >> will be made to activate a hot spare if available. >> IMPACT: Fault tolerance of the pool may be compromised. >> REC-ACTION: Run ''zpool status -x'' and replace the bad device.
On May 25, 2012, at 1:53 PM, zfs user wrote:> On 5/23/12 11:28 PM, Richard Elling wrote: >>>> The man page is clear on this topic, IMHO >>> >>> Indeed, even in snv_117 the zpool man page says that. But the >>> console/dmesg message was also quite clear, so go figure whom >>> to trust (or fear) more ;) >> >> The FMA message is consistent with the man page. > > The man page seems to not mention the critical part of the FMA msg that OP is worried about. > OP said that his motivation for clearing the errors and fearing the degraded state was because he feared this: > > >> AUTO-RESPONSE: The device has been marked as degraded. An attempt > >> will be made to activate a hot spare if available. > > he doesn''t want his dd''d new device kicked out of the vdev and replaced by a hot spare (if avaialable) due to the number of errors and the scarlet letter of "degraded" at the device level - I don''t think he cares about the pool level degraded status since it doesn''t "do" anything.By the time you could read such a message, the hot spare would have already kicked in. Obviously, this was not the OP''s issue. -- richard> >>> fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major >>> EVENT-TIME: Wed May 16 03:27:31 MSK 2012 >>> PLATFORM: Sun Fire X4500, CSN: 0804AMT023 , HOSTNAME: thumper >>> SOURCE: zfs-diagnosis, REV: 1.0 >>> EVENT-ID: cc25a316-4018-4f13-c675-d1d84c6325c3 >>> DESC: The number of checksum errors associated with a ZFS device >>> exceeded acceptable levels. Refer to http://sun.com/msg/ZFS-8000-GH for more information. >>> AUTO-RESPONSE: The device has been marked as degraded. An attempt >>> will be made to activate a hot spare if available. >>> IMPACT: Fault tolerance of the pool may be compromised. >>> REC-ACTION: Run ''zpool status -x'' and replace the bad device.-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120525/fcbd6295/attachment.html>
2012-05-26 1:07, Richard Elling wrote:> On May 25, 2012, at 1:53 PM, zfs user wrote: >> The man page seems to not mention the critical part of the FMA msg >> that OP is worried about. >> OP said that his motivation for clearing the errors and fearing the >> degraded state was because he feared this: >> >> >> AUTO-RESPONSE: The device has been marked as degraded. An attempt >> >> will be made to activate a hot spare if available. >> >> he doesn''t want his dd''d new device kicked out of the vdev and >> replaced by a hot spare (if avaialable) due to the number of errors >> and the scarlet letter of "degraded" at the device level - I don''t >> think he cares about the pool level degraded status since it doesn''t >> "do" anything. > > By the time you could read such a message, the hot spare would have already > kicked in. Obviously, this was not the OP''s issue. > -- richardKind of, it was - the motivation for feeling insecure and ultimately for clearing the CKSUM errors every minute (that there is a nonzero error count), at least - the script you said should never be used in "normal" practice, and I agree to that conclusion. (Manual) DDing is not the normal practice sanely covered by the degradation/hotsparing mechanism. As I wrote, the first time I saw the message, the pool did not have an assigned hotspare, but it got marked degraded. Just in case, I came up with that "cksum-mismatch-cleansing" script and restarted the scrub, since I knew the errors on-disk were due to an unfinished "proper" resilver onto it. I was not convinced whether the new disk still fully operates in the pool when it is marked as degraded, and I did not want the scrub to continue just to find out whether the disk won''t be actively used and "repaired". To say that in other words, I know that sometimes docs can lag behind or hop ahead of implemented features, and the latter can also be buggy or incomplete. While the theory (FMA and manpage snippets) said the disk should continue being used by the array despite the DEGRADED mark, I did not have an intention of staging an experiment here to find out whether it actually would, in that aged version of the software. Thanks, //the OP ;)