Jim Klimov
2012-Jan-15 15:04 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
"Does raidzN actually protect against bitrot?" That''s a kind of radical, possibly offensive, question formula that I have lately. Reading up on theory of RAID5, I grasped the idea of the write hole (where one of the sectors of the stripe, such as the parity data, doesn''t get written - leading to invalid data upon read). In general, I think the same applies to bitrot of data that was written successfully and corrupted later - either way, upon reading all sectors of the stripe, we don''t have a valid result (for the XOR-parity example, XORing all bytes does not produce a zero). The way I get it, RAID5/6 generally has no mechanism to detect *WHICH* sector was faulty, if all of them got read without error reports from the disk. Perhaps it won''t even test whether parity matches and bytes zero out, as long as there were no read errors reported. In this case having a dead drive is better than having one with a silent corruption, because when one sector is known to be invalid or absent, its contents can be reconstructed thanks to other sectors and parity data. I''ve seen statements (do I have to scavenge for prooflinks?) that raidzN {sometimes or always?} has no means to detect which drive produced bad data either. In this case in output of "zpool status" we see zero CKSUM error-counts on leaf disk levels, and non-zero counts on raidzN levels. Opposed to that, on mirrors (which are used in examples of ZFS''s on-the-fly data repairs in all presentations), we do always know the faulty source of data and can repair it with a verifiable good source, if present. In a real-life example, on my 6-disk raidz2 pool I see some irrepairable corruptions as well as several "repaired" detected errors. So I have a set of questions here, outlined below... (DISCLAIMER: I haven''t finished reading through on-disk format spec in detail, but that PDF document is 5 years old anyway and I''ve heard some things have changed). 1) How does raidzN protect agaist bit-rot without known full death of a component disk, if it at all does? Or does it only help against "loud corruption" where the disk reports a sector-access error or dies completely? 2) Do the "leaf blocks" (on-disk sectors or ranges of sectors that belong to a raidzN stripe) have any ZFS checksums of their own? That is, can ZFS determine which of the disks produced invalid data and reconstruct the whole stripe? 2*) How are the sector-ranges on-physical-disk addressed by ZFS? Are there special block pointers with some sort of physical LBA addresses in place of DVAs and with checksums? I think there should be (claimed end-to-end checksumming) but wanted to confirm. 2**) Alternatively, how does raidzN get into situation like "I know there is an error somewhere, but don''t know where"? Does this signal simultaneous failures in different disks of one stripe? How *do* some things get fixed then - can only dittoed data or metadata be salvaged from second good copies on raidZ? 3) Is it true that in recent ZFS the metadata is stored in a mirrored layout, even for raidzN pools? That is, does the raidzN layout only apply to userdata blocks now? If "yes": 3*) Is such mirroring applied over physical VDEVs or over top-level VDEVs? For certain 512/4096 bytes of a metadata block, are there two (ditto-mirror) or more (ditto over raidz) physical sectors of storage directly involved? 3**) If small blocks, sized 1-or-few sectors, are fanned out in incomplete raidz stripes (i.e. 512b parity + 512b data) does this actually lead to +100% overhead for small data, double that (200%) for dittoed data/copies=2? Does this apply to metadata in particular? ;) Does this large factor apply to ZVOLs with fixed block size being defined "small" (i.e. down to the minimum 512b/4k available for these disks)? 3***) In fact, for the considerations above, what is metadata? :) Is it only the tree of blockpointers, or is it all the two or three dozen block types except userdata (ZPL file, ZVOL block) and unallocated blocks? AND ON A SIDE NOTE: I do hope to see answers from the gurus on the list to these and other questions I posed recently. One frequently announced weakness in ZFS is the relatively small pool of engineering talent knowledgeable enough to hack ZFS and develop new features (i.e. the ex-Sunnites and very few determined other individuals): "We might do this, but we have few resources and already have other more pressing priorities". I think there is a lot more programming talent in the greater user/hacker community around ZFS, including active askers on this list, Linux/BSD porters, and probably many more people who just occasionally hit upon our discussions here by googling up their questions. I mean programmers ready to dedicate some time to ZFS, which are held back by not fully understanding the architecture, and just do not start their developing (so as not to make matters worse). And the knowledge barrier to start coding is quite high. I do hope that instead of spending weeks to make a new feature, development gurus could spend a day writing replies to questions like mine (and many others'') and then someone in the community would come up with a reasonable POC or finished code for new features and improvements. It is like education. Say, math: many talented mathematicians have spent thousands of man-years developing and refining the theory which now we learn over 3 or 6 years in a university. Maybe we''re skimming overheads on lectures, but we gain enough understanding to deepen into any more specific subject ourselves. Likewise with opensource: yes, the code is there. A developer might read into it and possibly comprehend some in a year or so. Or he could spend a few days midway (when he knows enough to pose hard questions not googlable in some FAQ yet) in yes-no question sessions with the more knowledgeable people, and become ready to work in just a few weeks from start. Wouldn''t that be wonderful for ZFS in general? :) Thanks in advance, //Jim Klimov
Edward Ned Harvey
2012-Jan-15 15:38 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > 1) How does raidzN protect agaist bit-rot without known full > death of a component disk, if it at all does? > Or does it only help against "loud corruption" where the > disk reports a sector-access error or dies completely?Whenever raidzN encounters a cksum error, it will read the redundant copies until it finds one that passes the cksum. The only ways you get unrecoverable cksum error are (a) more than N disks are failing, or (b) the storage subsystem - i.e. HBA, bus, memory, cpu, etc - are failing. Let''s suppose one disk in a raidz (1) returns corrupt data silently. Recall that there is enough parity redundancy to recover from *any* complete disk failure. That means zfs can read disks 1,2,3,4... Then read disks 1,2,3,5... Then read disks 1,2,4,5... ZFS can figure out which disk returned the faulty data, UNLESS the disk actually returns correct data upon subsequent retries.> How *do* some things get fixed then - can only dittoed data > or metadata be salvaged from second good copies on raidZ?dittoed data is a layer of redundancy over and above your sysadmin chosen level of redundancy / raidz/mirror level.> One frequently announced weakness in ZFS is the relatively small > pool of engineering talent knowledgeable enough to hack ZFS and > develop new features (i.e. the ex-Sunnites and very few determined > other individuals): "We might do this, but we have few resources > and already have other more pressing priorities".While that may be true, compare it to everything else. I prefer ZFS over OnTap any day of the week. And although btrfs will be good someday, it''s just barely suitable now for *any* production purposes. I don''t see the competition developing faster than ZFS.> Likewise with opensource: yes, the code is there. A developer > might read into it and possibly comprehend some in a year or so. > Or he could spend a few days midway (when he knows enough to > pose hard questions not googlable in some FAQ yet) in yes-no > question sessions with the more knowledgeable people, and become > ready to work in just a few weeks from start. Wouldn''t that be > wonderful for ZFS in general? :)You know the open-source question in regards to ZFS is pretty much concluded, right? What oracle called zpool version 28 was the last open source version, currently in use on nexenta, freebsd, and some others. The illumos project has continued development, minimally. If you think the development effort is resource limited in oracle working on zfs, just try the open source illumos community... Since close-sourcing a little over a year ago, oracle has continued developing and releasing new features... They''re now at ... what, zpool version 37 or something? Illumos has continued developing too... Much less. Yes, it would be nice to see more open source development, but I think the main obstacle is the COW patent issue. Oracle is now immune from Netapp lawsuit over it, but if you want somebody ... for example perhaps Apple ... to contribute development resource to the open source branch, they''ll have to duke it out with Netapp using their own legal resources. So far, the obstacle is just large enough that we don''t see any other organizations contributing significantly. Linux is going with btrfs. MS has their own thing. Oracle continues with ZFS closed source. Apple needs a filesystem that doesn''t suck, but they''re not showing inclinations toward ZFS or anything else that I know of.
Peter Tribble
2012-Jan-15 16:06 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov <jimklimov at cos.ru> wrote:> "Does raidzN actually protect against bitrot?" > That''s a kind of radical, possibly offensive, question formula > that I have lately.Yup, it does. That''s why many of us use it.> The way I get it, RAID5/6 generally has no mechanism to detect > *WHICH* sector was faulty, if all of them got read without > error reports from the disk.It validates the checksum on every read. If it doesn''t match, then one of the devices (at least) is returning incorrect data. So it simply tries reconstruction assuming each device in turn is bad until it gets the right answer. That gives you the correct data, and tells you which device was wrong, and then you write back the correct data to the errant device.> Perhaps it won''t even test whether > parity matches and bytes zero out, as long as there were no read > errors reported.Absolutely not. It always checks, regardless. (Try writing over one half of a zfs mirror with dd and watch it cheerfully repair your data without an actual error in sight.)> 1) How does raidzN protect agaist bit-rot without known full > ? death of a component disk, if it at all does? > ? Or does it only help against "loud corruption" where the > ? disk reports a sector-access error or dies completely? > > 2) Do the "leaf blocks" (on-disk sectors or ranges of sectors > ? that belong to a raidzN stripe) have any ZFS checksums of > ? their own? That is, can ZFS determine which of the disks > ? produced invalid data and reconstruct the whole stripe?No, the checksum is against the whole stripe. And you do the combinatorial reconstruction to work out which is bad.> 2**) Alternatively, how does raidzN get into situation like > ? "I know there is an error somewhere, but don''t know where"? > ? Does this signal simultaneous failures in different disks > ? of one stripe?If you have raidz1, and two devices give bad data, then you don''t have enough redundancy to do the reconstruction. I''ve not seen this myself for random bitrot, but it''s the sort of thing that can give you trouble if you lose a whole disk and then hit a bad block on another device during resilver. (Regular scrubs to identify and fix bad individual blocks before you have to do a resilver are therefore a good thing.)> ? How *do* some things get fixed then - can only dittoed data > ? or metadata be salvaged from second good copies on raidZ?You can recover anything you have enough redundancy for. Which means everything, up to the redundancy of the vdev. Beyond that, you may be able to recover dittoed data (of which metadata is just one example) even if you''ve lost an entire vdev. -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
Jim Klimov
2012-Jan-15 16:20 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
2012-01-15 19:38, Edward Ned Harvey wrote: >> 1) How does raidzN protect agaist bit-rot without known full >> death of a component disk, if it at all does? > zfs can read disks 1,2,3,4... Then read disks 1,2,3,5... > Then read disks 1,2,4,5... ZFS can figure out which disk > returned the faulty data, UNLESS the disk actually returns > correct data upon subsequent retries. Makes sense, if ZFS does actually do that ;) Counter-examples: 1) For several scrubs in a row, my pool consistently found two vdev errors and one pool error with zero per-disk errors (further leading to error in some object <metadata>:<0x0>). If the disk-read errors were transient, sometimes returning correct data (i.e. bad sector relocation was successful in the background), ZFS would receive good blocks on further scrubs - shouldn''t it? 2) Even with one bad sector consistently in place, if ZFS can deduce correct original block data, why report the error at all (especially - for many times) as uncorrectable? This leaves me thinking of two on-disk errors, and/or lack of checksums for leaf blocks, as the possible reasons for such detected raidz errors with undetected faulty individual disks. Any other options I overlooked?> You know the open-source question in regards to ZFS is pretty much > concluded, right? What oracle called zpool version 28 was the last open > source version, currently in use on nexenta, freebsd, and some others. The > illumos project has continued development, minimally. If you think the > development effort is resource limited in oracle working on zfs, just try > the open source illumos community...I do try it. I do also see some companies like Nexenta or Joyent having discussed the NetApp problem and having moved on betting on their work with opensourced ZFS. Also, Oracle''s closed ZFS is actually of little relevance to me or other SOHO users (laptops, home NASes, etc.) As Oracle doesn''t deal with small customers, and people still have problems buying or getting support for small-volume stuff, or find Oracle''s offerings prohibitively expensive, it is hard to get Oracle noticing a bug/RFE report not backed by money. There is nothing inherently bad with the business model, Sun also had it (while being more open to suggestions). It''s just that in this model SOHO users have no influence on ZFS and it becomes a closed proprietary gadget like any other FS, without engineering interest to enhance it. And this couples with limited understanding whether you have a right to use it at all and not get sued by Oracle (i.e. for trying to put Solaris 11 in your production without paying the tax). Over the past year I have proposed or discussed a number of features for ZFS, and while there is little chance that illumos developers would implement any of that soon, there is near-zero chance that Oracle ever will. And there is a greater chance that myself or some other developer would dig into such RFEs and publish a solution - especially if such developer is helped with theory. //Jim
Bob Friesenhahn
2012-Jan-15 16:30 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
On Sun, 15 Jan 2012, Jim Klimov wrote:> > 1) How does raidzN protect agaist bit-rot without known full > death of a component disk, if it at all does? > Or does it only help against "loud corruption" where the > disk reports a sector-access error or dies completely?Here is a layman''s answer since I am not a zfs developer: ZFS maintains a checksum for a full block which is comprised of multiple chunks which are stored on different disks. If one of these chunks proves to be faulty (i.e. drive reports error while reading that chunk, or overall block checksum is faulty) then zfs is able to re-construct the data using the ''N'' redundancy chunks based on distributed parity. If the drive reports success but the data it returned is wrong, then the zfs block checksum will fail. Given a sufficiently strong checksum algorithm, there is only one permutation of reconstructed data which will satisfy the zfs block checksum so zfs can try the various reconstruction permutations (assuming that each returned chunk is defective in turn) until it finds a permutation which re-creates the correct data. At this point, it knows which chunk was bad and can re-write it or take other recovery action. If the number of bad chunks exceeds the available redundancy level, then the bad data can be detected, but can''t be corrected. The fly in the ointment is that system memory needs to operate reliably, which is why ECC memory is recommended/required in order for zfs to offer full data reliability assurance.> One frequently announced weakness in ZFS is the relatively small > pool of engineering talent knowledgeable enough to hack ZFS and > develop new features (i.e. the ex-Sunnites and very few determined > other individuals): "We might do this, but we have few resources > and already have other more pressing priorities".One might assume that the pool of ZFS knowledge is less than other popular filesystems but it seems to be currently larger than UFS, FFS, EXT4, and XFS. For example, Kirk McKusick is still the one fixing reported bugs in BSD FFS. As filesystems become more mature, the number of people fluent in how their implementation works grows smaller due to a diminishing need to fix problems. Filesystems which are the brainchild of just one person (e.g. EXT4, Reiserfs) become subject to the abilities of that one person. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jim Klimov
2012-Jan-15 16:42 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
2012-01-15 20:06, Peter Tribble wrote:> (Try writing over one > half of a zfs mirror with dd and watch it cheerfully repair your data > without an actual error in sight.)Are you certain it always works? AFAIK, mirror reads are round-robined (which leads to parallel read performance boosts). Only if your read hits the mismatch, the block would be reconstructed from another copy. And scrubs are one mechanism to force such reads of all copies of all blocks and trigger reconstructions as needed.> >> 1) How does raidzN protect agaist bit-rot without known full >> death of a component disk, if it at all does? >> Or does it only help against "loud corruption" where the >> disk reports a sector-access error or dies completely? >> >> 2) Do the "leaf blocks" (on-disk sectors or ranges of sectors >> that belong to a raidzN stripe) have any ZFS checksums of >> their own? That is, can ZFS determine which of the disks >> produced invalid data and reconstruct the whole stripe? > > No, the checksum is against the whole stripe. And you do the > combinatorial reconstruction to work out which is bad.Hmmm, in this case, how does ZFS precisely know which disks contain which sector ranges of variable-width stripes? In Max Bruning''s weblog I saw a reference to kernel routine vdev_raidz_map_alloc(). Without a layer of pointers to sectors with data on each physical vdev (and, as I hoped, such layer might contain checksums or ECCs), it seems like a fundamental unchangeable part of ZFS raidz. Is it true?> >> 2**) Alternatively, how does raidzN get into situation like >> "I know there is an error somewhere, but don''t know where"? >> Does this signal simultaneous failures in different disks >> of one stripe? > > If you have raidz1, and two devices give bad data, then you don''t > have enough redundancy to do the reconstruction. I''ve not seen this > myself for random bitrot, but it''s the sort of thing that can give you > trouble if you lose a whole disk and then hit a bad block on another > device during resilver. > > (Regular scrubs to identify and fix bad individual blocks before you have > to do a resilver are therefore a good thing.)That''s what I did more or less regularly. Then one nice scrub gave me such a condition... :(> >> How *do* some things get fixed then - can only dittoed data >> or metadata be salvaged from second good copies on raidZ? > > You can recover anything you have enough redundancy for. Which > means everything, up to the redundancy of the vdev. Beyond that, > you may be able to recover dittoed data (of which metadata is just > one example) even if you''ve lost an entire vdev.And, now, with my one pool-level error and two raidz-level errors, is it correct to interpret that attempts to read both dittoed copies of pool:<metadata>:<0x0>, whatever that is, have failed? In particular, shouldn''t the metadata redundancy (mirroring and/or copies=2 over raidz or over its component disks) point to specific disks that contained the block and failed to produce it correctly?.. Thanks all for the replies, //Jim Klimov
Gary Mills
2012-Jan-15 16:43 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
On Sun, Jan 15, 2012 at 04:06:33PM +0000, Peter Tribble wrote:> On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov <jimklimov at cos.ru> wrote: > > "Does raidzN actually protect against bitrot?" > > That''s a kind of radical, possibly offensive, question formula > > that I have lately. > > Yup, it does. That''s why many of us use it.There''s actually no such thing as bitrot on a disk. Each sector on the disk is accompanied by a CRC that''s verified by the disk controller on each read. It will either return correct data or report an unreadable sector. There''s nothing inbetween. Of course, if something outside of ZFS writes to the disk, then data belonging to ZFS will be modified. I''ve heard of RAID controllers or SAN devices doing this when they modify the disk geometry or reserved areas on the disk. -- -Gary Mills- -refurb- -Winnipeg, Manitoba, Canada-
Andrew Gabriel
2012-Jan-15 16:49 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
Gary Mills wrote:> On Sun, Jan 15, 2012 at 04:06:33PM +0000, Peter Tribble wrote: > >> On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov <jimklimov at cos.ru> wrote: >> >>> "Does raidzN actually protect against bitrot?" >>> That''s a kind of radical, possibly offensive, question formula >>> that I have lately. >>> >> Yup, it does. That''s why many of us use it. >> > > There''s actually no such thing as bitrot on a disk. Each sector on > the disk is accompanied by a CRC that''s verified by the disk > controller on each read. It will either return correct data or report > an unreadable sector. There''s nothing inbetween. >Actually, there are a number of disk firmware and cache faults inbetween, which zfs has picked up over the years. -- Andrew Gabriel
Jim Klimov
2012-Jan-15 16:53 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
2012-01-15 20:43, Gary Mills ?????:> On Sun, Jan 15, 2012 at 04:06:33PM +0000, Peter Tribble wrote: >> On Sun, Jan 15, 2012 at 3:04 PM, Jim Klimov<jimklimov at cos.ru> wrote: >>> "Does raidzN actually protect against bitrot?" >>> That''s a kind of radical, possibly offensive, question formula >>> that I have lately. >> >> Yup, it does. That''s why many of us use it. > > There''s actually no such thing as bitrot on a disk. Each sector on > the disk is accompanied by a CRC that''s verified by the disk > controller on each read. It will either return correct data or report > an unreadable sector.What about UBER (uncorrectable bit-error rates)? For example, the non-zero small chances of another block contents matching the CRC code (circa 10^-14 - 10^-16)? If hashes were perfect with zero collisions, they could be used instead of original data and be much more compact, and lossless compression algorithms would always return smaller data than *any* random original stream ;) Even ZFS dedup with 10^-77 collision chance proposes a mode to verify-on-write. > There''s nothing inbetween. Also "inbetween" there''s cabling, contacts and dialog protocols. AFAIK some protocols and/or implementations don''t bother with the on-wire CRC/ECC, perhaps the IDE (and maybe consumer SATA) protocols? Thanks for replies, //Jim Klimov
Toby Thain
2012-Jan-16 01:56 UTC
[zfs-discuss] Apple''s ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?
On 15/01/12 10:38 AM, Edward Ned Harvey wrote:>... > Linux is going with btrfs. MS has their own thing. Oracle continues with > ZFS closed source. Apple needs a filesystem that doesn''t suck, but they''re > not showing inclinations toward ZFS or anything else that I know of. >Rumours have long circulated, even before the brief public debacle of ZFS in OS X - "is it in Leopard...yes it''s in...no it''s not...yes it''s in...oh damn, it''s really not" - that Apple is building their own clone of ZFS. --Toby> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Edward Ned Harvey
2012-Jan-16 02:40 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > 2012-01-15 19:38, Edward Ned Harvey wrote: > >> 1) How does raidzN protect agaist bit-rot without known full > >> death of a component disk, if it at all does? > > zfs can read disks 1,2,3,4... Then read disks 1,2,3,5... > > Then read disks 1,2,4,5... ZFS can figure out which disk > > returned the faulty data, UNLESS the disk actually returns > > correct data upon subsequent retries. > > Makes sense, if ZFS does actually do that ;) > > Counter-examples: > 1) For several scrubs in a row, my pool consistently found two > vdev errors and one pool error with zero per-disk errors > (further leading to error in some object <metadata>:<0x0>). > If the disk-read errors were transient, sometimes returning > correct data (i.e. bad sector relocation was successful in > the background), ZFS would receive good blocks on further > scrubs - shouldn''t it?I can''t say this is the explanation for your situation, but I can offer it as one possible explanation: Suppose your system is in operation, and you get corruption in your CPU or RAM, so it calculates the wrong cksum for the data that is about to be written. The data gets written, along with the wrong cksum. Later, you come along and read that data. You discover the cksum error, it''s unrecoverable, but there are no disk errors. I have certainly experienced CPU''s that perform incorrect calculations before - and I have certainly encountered errant memory before - Usually when a component starts failing like that, it progressively gets worse (or at least you can usually run some diag utils) and you can identify the failing component. But not always. Such failures can happen undetected with or without ECC memory. It''s simply less likely with ECC. The whole thing about ECC memory... It''s just doing parity. It''s a very weak checksum. If corruption happens in memory, it''s FAR more likely that the error will go undetected by ECC as compared to the Fletcher or SHA checksum that''s being used by ZFS. Even when you get down to the actual disk... All disks store parity / checksum information, using their FEC chip. All disks will attempt to detect and correct errors they encounter (this is even stronger than ECC memory). But nothing''s perfect, not even SHA... But the accuracy of Fletcher or SHA is far, far greater than the ECC or FEC being used by your memory and disks.
Edward Ned Harvey
2012-Jan-16 02:48 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Gary Mills > > There''s actually no such thing as bitrot on a disk. Each sector on > the disk is accompanied by a CRC that''s verified by the disk > controller on each read. It will either return correct data or report > an unreadable sector. There''s nothing inbetween.There is something in between: Corrupt data that happens to pass the hardware checksum. You said CRC, but CRC is a specific algorithm. All the various disk manufacturers can implement various different algorithms, including parity, CRC, LDPC, or whatever. Of all the algorithms they use inside the actual disk, CRC would be a relatively strong one. And if there was "nothing in between" absolute accuracy and absolute error... Then there would be no point in all the stronger checksum algorithms, such as SHA256. There would be no point for ZFS to bother doing checksumming, if the disks could never silently return corrupted data.
Bob Friesenhahn
2012-Jan-16 04:49 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
On Sun, 15 Jan 2012, Edward Ned Harvey wrote:> > Such failures can happen undetected with or without ECC memory. It''s simply > less likely with ECC. The whole thing about ECC memory... It''s just doing > parity. It''s a very weak checksum. If corruption happens in memory, it''sI am beginning to become worried now. ECC is more than "just doing parity". http://en.wikipedia.org/wiki/Error-correcting_code http://en.wikipedia.org/wiki/ECC_memory There have been enough sequential errors now (a form of corruption) that I think that you should start doing research prior to posting. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2012-Jan-16 06:08 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
On Jan 15, 2012, at 7:04 AM, Jim Klimov wrote:> "Does raidzN actually protect against bitrot?" > That''s a kind of radical, possibly offensive, question formula > that I have lately.Simple answer: no. raidz provides data protection. Checksums verify data is correct. Two different parts of the storage solution.> Reading up on theory of RAID5, I grasped the idea of the write > hole (where one of the sectors of the stripe, such as the parity > data, doesn''t get written - leading to invalid data upon read). > In general, I think the same applies to bitrot of data that was > written successfully and corrupted later - either way, upon > reading all sectors of the stripe, we don''t have a valid result > (for the XOR-parity example, XORing all bytes does not produce > a zero). > > The way I get it, RAID5/6 generally has no mechanism to detect > *WHICH* sector was faulty, if all of them got read without > error reports from the disk. Perhaps it won''t even test whether > parity matches and bytes zero out, as long as there were no read > errors reported. In this case having a dead drive is better than > having one with a silent corruption, because when one sector is > known to be invalid or absent, its contents can be reconstructed > thanks to other sectors and parity data. > > I''ve seen statements (do I have to scavenge for prooflinks?) > that raidzN {sometimes or always?} has no means to detect > which drive produced bad data either. In this case in output > of "zpool status" we see zero CKSUM error-counts on leaf disk > levels, and non-zero counts on raidzN levels.raidz uses an algorithm to try permutations of data and parity to verify against the checksum. Once the checksum matches, repair can begin.> Opposed to that, on mirrors (which are used in examples of > ZFS''s on-the-fly data repairs in all presentations), we do > always know the faulty source of data and can repair it > with a verifiable good source, if present.Mirrors are no different, ZFS tries each side of the mirror until it finds data that matches the checksum.> In a real-life example, on my 6-disk raidz2 pool I see some > irrepairable corruptions as well as several "repaired" detected > errors. So I have a set of questions here, outlined below... > > (DISCLAIMER: I haven''t finished reading through on-disk > format spec in detail, but that PDF document is 5 years > old anyway and I''ve heard some things have changed). > > > 1) How does raidzN protect agaist bit-rot without known full > death of a component disk, if it at all does? > Or does it only help against "loud corruption" where the > disk reports a sector-access error or dies completely?raidz cannot be separated from the ZFS checksum verification in this answer.> 2) Do the "leaf blocks" (on-disk sectors or ranges of sectors > that belong to a raidzN stripe) have any ZFS checksums of > their own? That is, can ZFS determine which of the disks > produced invalid data and reconstruct the whole stripe?No. Yes.> 2*) How are the sector-ranges on-physical-disk addressed by > ZFS? Are there special block pointers with some sort of > physical LBA addresses in place of DVAs and with checksums? > I think there should be (claimed end-to-end checksumming) > but wanted to confirm.No.> 2**) Alternatively, how does raidzN get into situation like > "I know there is an error somewhere, but don''t know where"? > Does this signal simultaneous failures in different disks > of one stripe? > How *do* some things get fixed then - can only dittoed data > or metadata be salvaged from second good copies on raidZ?No. See the seminal blog on raidz http://blogs.oracle.com/bonwick/entry/raid_z> > 3) Is it true that in recent ZFS the metadata is stored in > a mirrored layout, even for raidzN pools? That is, does > the raidzN layout only apply to userdata blocks now? > If "yes":Yes, for Solaris 11. No, for all other implementations, at this time.> 3*) Is such mirroring applied over physical VDEVs or over > top-level VDEVs? For certain 512/4096 bytes of a metadata > block, are there two (ditto-mirror) or more (ditto over > raidz) physical sectors of storage directly involved?It is done in the top-level vdev. For more information see the manual, What''s New in ZFS? - Oracle Solaris ZFS Administration Guide docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html> 3**) If small blocks, sized 1-or-few sectors, are fanned out > in incomplete raidz stripes (i.e. 512b parity + 512b data) > does this actually lead to +100% overhead for small data, > double that (200%) for dittoed data/copies=2?The term "incomplete" does not apply here. The stripe written is complete: data + parity.> Does this apply to metadata in particular? ;)lost context here, for non-Solaris 11 implementations, metadata is no different than data with copies=[23]> Does this large factor apply to ZVOLs with fixed block > size being defined "small" (i.e. down to the minimum 512b/4k > available for these disks)?NB, there are a few slides in my ZFS tutorials where we talk about this. http://www.slideshare.net/relling/usenix-lisa11-tutorial-zfs-a> 3***) In fact, for the considerations above, what is metadata? :) > Is it only the tree of blockpointers, or is it all the two > or three dozen block types except userdata (ZPL file, ZVOL > block) and unallocated blocks?It is metadata, there is quite a variety. For example, there is the MOS, zpool history, DSL configuration, etc.> > > AND ON A SIDE NOTE: > > I do hope to see answers from the gurus on the list to these > and other questions I posed recently. > > One frequently announced weakness in ZFS is the relatively small > pool of engineering talent knowledgeable enough to hack ZFS and > develop new features (i.e. the ex-Sunnites and very few determined > other individuals): "We might do this, but we have few resources > and already have other more pressing priorities".There is quite a bit of activity going on under the illumos umbrella. In fact, at the illumos meetup last week, there were several presentations about upcoming changes and additions (awesome stuff!) See Deirdre''s videos at http://smartos.org/tag/zfs/> I think there is a lot more programming talent in the greater > user/hacker community around ZFS, including active askers on this > list, Linux/BSD porters, and probably many more people who just > occasionally hit upon our discussions here by googling up their > questions. I mean programmers ready to dedicate some time to ZFS, > which are held back by not fully understanding the architecture, > and just do not start their developing (so as not to make matters > worse). And the knowledge barrier to start coding is quite high. > > I do hope that instead of spending weeks to make a new feature, > development gurus could spend a day writing replies to questions > like mine (and many others'') and then someone in the community > would come up with a reasonable POC or finished code for new > features and improvements. > > It is like education. Say, math: many talented mathematicians > have spent thousands of man-years developing and refining the > theory which now we learn over 3 or 6 years in a university. > Maybe we''re skimming overheads on lectures, but we gain enough > understanding to deepen into any more specific subject ourselves. > > Likewise with opensource: yes, the code is there. A developer > might read into it and possibly comprehend some in a year or so. > Or he could spend a few days midway (when he knows enough to > pose hard questions not googlable in some FAQ yet) in yes-no > question sessions with the more knowledgeable people, and become > ready to work in just a few weeks from start. Wouldn''t that be > wonderful for ZFS in general? :)Agree 110% -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120115/9b6cbcd2/attachment.html>
Richard Elling
2012-Jan-16 06:31 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
On Jan 15, 2012, at 8:49 PM, Bob Friesenhahn wrote:> On Sun, 15 Jan 2012, Edward Ned Harvey wrote: >> >> Such failures can happen undetected with or without ECC memory. It''s simply >> less likely with ECC. The whole thing about ECC memory... It''s just doing >> parity. It''s a very weak checksum. If corruption happens in memory, it''s > > I am beginning to become worried now. ECC is more than "just doing parity".It depends. ECC is a very generic term. Most "ECC memory" is SECDED, except for the high-end servers and mainframes.> http://en.wikipedia.org/wiki/Error-correcting_code > http://en.wikipedia.org/wiki/ECC_memory > > There have been enough sequential errors now (a form of corruption) that I think that you should start doing research prior to posting.I''ve been collecting a number of ZFS bit error reports (courtesy of fmdump -eV) and I have never seen a single-bit error in a block. The errors appear to be of the overwrite or stuck-at variety that impact multiple bits. This makes sense because most disks already correct up to 8 bytes (or so) per sector. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120115/7e8158b8/attachment-0001.html>
Jim Klimov
2012-Jan-16 12:02 UTC
[zfs-discuss] Does raidzN actually protect against bitrot? If yes - how?
Thanks again for answering! :) 2012-01-16 10:08, Richard Elling wrote:> On Jan 15, 2012, at 7:04 AM, Jim Klimov wrote: > >> "Does raidzN actually protect against bitrot?" >> That''s a kind of radical, possibly offensive, question formula >> that I have lately. > > Simple answer: no. raidz provides data protection. Checksums verify > data is correct. Two different parts of the storage solution.Meaning - data-block checksum mismatch allows to detect an error; afterwards raidz permutations matching the checksum allow to fix it (if enough redundancy is available)? Right?> raidz uses an algorithm to try permutations of data and parity to > verify against the checksum. Once the checksum matches, repair > can begin.Ok, nice to have this statement confirmed so many times now ;) How do per-disk cksum errors get counted then for raidz - thanks to permutation of fixable errors we can detect which disk:sector returned mismatching data? Likewise, for unfixable errors we can''t know the faulty disk - unless one had explicitly erred? So, if my 6-disk raidz2 couldn''t fix the error, it either occured on 3 disks'' parts of one stripe, or in RAM/CPU (SPOF) before writing the data and checksum to disk? In the latter case there is definitely no single-disk''s fault for returning bad data, so per-disk cksum counters are zero? ;)>> 2*) How are the sector-ranges on-physical-disk addressed by >> ZFS? Are there special block pointers with some sort of >> physical LBA addresses in place of DVAs and with checksums? >> I think there should be (claimed end-to-end checksumming) >> but wanted to confirm. > > No.Ok, so basically there is the vdev_raidz_map_alloc() algorithm to convert DVAs into leaf addresses, and it is always going to be the same for all raidz''s? For example, such lack of explicit addressing would not let ZFS reallocate one disk''s bad media sector into another location - the disk is always expected to do that reliably and successfully?> >> 2**) Alternatively, how does raidzN get into situation like >> "I know there is an error somewhere, but don''t know where"? >> Does this signal simultaneous failures in different disks >> of one stripe? >> How *do* some things get fixed then - can only dittoed data >> or metadata be salvaged from second good copies on raidZ? > > No. See the seminal blog on raidz > http://blogs.oracle.com/bonwick/entry/raid_z > >> >> 3) Is it true that in recent ZFS the metadata is stored in >> a mirrored layout, even for raidzN pools? That is, does >> the raidzN layout only apply to userdata blocks now? >> If "yes": > > Yes, for Solaris 11. No, for all other implementations, at this time.Are there plans to do this for illumos, etc.? I thought that my oi_148a''s disks'' IO patterns matched the idea of mirroring metadata, now I''ll have to explain that data with some secondary ideas ;)> >> 3*) Is such mirroring applied over physical VDEVs or over >> top-level VDEVs? For certain 512/4096 bytes of a metadata >> block, are there two (ditto-mirror) or more (ditto over >> raidz) physical sectors of storage directly involved? > > It is done in the top-level vdev. For more information see the manual, > > > What''s New in /ZFS/? - Oracle Solaris /ZFS/ Administration Guide > <http://docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html> > > docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html > <http://docs.oracle.com/cd/E19963-01/html/821-1448/gbscy.html> > >> 3**) If small blocks, sized 1-or-few sectors, are fanned out >> in incomplete raidz stripes (i.e. 512b parity + 512b data) >> does this actually lead to +100% overhead for small data, >> double that (200%) for dittoed data/copies=2? > > The term "incomplete" does not apply here. The stripe written is > complete: data + parity.Just to make up, I meant variable-width stripes as opposed to "full-width stripe" writes in other RAIDs. That is, to update one sector of data on a 6-disk raid6 I''d need to write 6 sectors; while on raidz2 I need to write only two. No extra reply solicited here ;)>> Does this apply to metadata in particular? ;) > > lost context here, for non-Solaris 11 implementations, metadata is > no different than data with copies=[23]The question here was whether writes of metadata (assumed to be a small number of sectors down to one per block) incur writes of parity, of ditto copies, or of parity and copies, increasing storage requirements by several times. One background thought was that I wanted to make sense of my last year''s experience with a zvol whose blocksize was 1 sector (4kb), and the metadata overhead (consumption of free space) was about the same as userdata size. At that time I thought it''s because I have a 1-sector metadata block to address each 1-sector data block of the volume; but now I think the overhead would be closer to 400% of userdata size...> >> Does this large factor apply to ZVOLs with fixed block >> size being defined "small" (i.e. down to the minimum 512b/4k >> available for these disks)? > > NB, there are a few slides in my ZFS tutorials where we talk about this. > http://www.slideshare.net/relling/usenix-lisa11-tutorial-zfs-a > >> 3***) In fact, for the considerations above, what is metadata? :) >> Is it only the tree of blockpointers, or is it all the two >> or three dozen block types except userdata (ZPL file, ZVOL >> block) and unallocated blocks? > > It is metadata, there is quite a variety. For example, there is the MOS, > zpool history, DSL configuration, etc.Yes, there''s a big table on the DMU page... So metadata is indeed everything except userdata and empty space? ;)>> AND ON A SIDE NOTE: >> >> I do hope to see answers from the gurus on the list to these >> and other questions I posed recently. >> I think there is a lot more programming talent in the greater... > > Agree 110% > -- richardThanks for support, and I''ll look into the videos, and the slides+blogs you referenced. Do you have a chance to comment on-list about the ZFS patenting licensing FUD - how many of the fears have real-life foundations, and what can be dismissed? I.e. after the community makes ZFS even greater, can Oracle or NetApp pull the carpet and claim it''s all theirs? :) //Jim
Rich Teer
2012-Jan-16 16:05 UTC
[zfs-discuss] Apple''s ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?
On Sun, 15 Jan 2012, Toby Thain wrote:> Rumours have long circulated, even before the brief public debacle of ZFS in > OS X - "is it in Leopard...yes it''s in...no it''s not...yes it''s in...oh damn, > it''s really not" - that Apple is building their own clone of ZFS.I don''t know why APple don''t just get off the pot and officially adopy ZFS. I mean, they''ve embraced DTrace, so what''s stopping them from using ZFS too? -- Rich Teer, Publisher Vinylphile Magazine . * * . * .* . . * . .* * . . /\ ( . . * . . / .\ . * . .*. / * \ . . . /* o \ . * ''''''||'''''' . www.vinylphilemag.com ******************
David Magda
2012-Jan-16 16:13 UTC
[zfs-discuss] Apple''s ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?
On Mon, January 16, 2012 11:05, Rich Teer wrote:> On Sun, 15 Jan 2012, Toby Thain wrote: > >> Rumours have long circulated, even before the brief public debacle of >> ZFS in OS X - "is it in Leopard...yes it''s in...no it''s not...yes it''s >> in...oh damn, it''s really not" - that Apple is building their own clone >> of ZFS. > > I don''t know why APple don''t just get off the pot and officially adopy > ZFS. I mean, they''ve embraced DTrace, so what''s stopping them from using > ZFS too?This was discussed already:>> [On Sat Oct 24 14:14:19 UTC 2009, David Magda wrote:] >> >> Apple can currently just take the ZFS CDDL code and incorporate it >> (like they did with DTrace), but it may be that they wanted a "private >> license" from Sun (with appropriate technical support and >> indemnification), and the two entities couldn''t come to mutually >> agreeable terms. > > I cannot disclose details, but that is the essence of it.http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/033125.html Perhaps Apple can come to an agreement with Oracle when they couldn''t with Sun.
Bob Friesenhahn
2012-Jan-16 16:22 UTC
[zfs-discuss] Apple''s ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?
On Mon, 16 Jan 2012, David Magda wrote:> > http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/033125.html > > Perhaps Apple can come to an agreement with Oracle when they couldn''t with > Sun.This seems very unlikely since the future needs of Apple show little requirement for zfs. Apple only offers one computer model which provides ECC and a disk drive configuration which is marginally useful for zfs. This computer model has a very limited user-base which is primarily people in the video and desktop imaging/publishing world. Apple already exited the server market, for which they only ever offered single limited-use model (Xserve). There would likely be a market if someone was to sell pre-packaged zfs for Apple OS-X at a much higher price than the operating system itself. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Freddie Cash
2012-Jan-16 16:40 UTC
[zfs-discuss] Apple''s ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?
On Mon, Jan 16, 2012 at 8:22 AM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Mon, 16 Jan 2012, David Magda wrote: >> http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/033125.html >> >> Perhaps Apple can come to an agreement with Oracle when they couldn''t with >> Sun. > > This seems very unlikely since the future needs of Apple show little > requirement for zfs. ?Apple only offers one computer model which provides > ECC and a disk drive configuration which is marginally useful for zfs. ?This > computer model has a very limited user-base which is primarily people in the > video and desktop imaging/publishing world. Apple already exited the server > market, for which they only ever offered single limited-use model (Xserve).As an FS for their TimeMachine NAS boxes (Time Capsule, I think), though, ZFS would be a good fit. Similar to how the Time Slider works in Sun/Oracle''s version of Nautilus/GNOME2. Especially if they expand the boxes to use 4 drives (2x mirror), and had the pool pre-configured. As a desktop/laptop FS, though, ZFS (in its current incarnation) is overkill and unwieldy. Especially since most of these machines only have room for a single HD.> There would likely be a market if someone was to sell pre-packaged zfs for > Apple OS-X at a much higher price than the operating system itself. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, ? ?http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Freddie Cash fjwcash at gmail.com
Rich Teer
2012-Jan-16 16:56 UTC
[zfs-discuss] Apple''s ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?
On Mon, 16 Jan 2012, Freddie Cash wrote:> As an FS for their TimeMachine NAS boxes (Time Capsule, I think), > though, ZFS would be a good fit. Similar to how the Time Slider works > in Sun/Oracle''s version of Nautilus/GNOME2. Especially if they expand > the boxes to use 4 drives (2x mirror), and had the pool > pre-configured.Agreed.> As a desktop/laptop FS, though, ZFS (in its current incarnation) is > overkill and unwieldy. Especially since most of these machines only > have room for a single HD.I respectfully disagree: end-to-end checksums are always a good thing, and simple single-drive {desk,lap}top could use a single pool and gain all the benfits of ZFS with none of the "unweildyness", although again I disagree that ZFS is unweildy.> > There would likely be a market if someone was to sell pre-packaged zfs for > > Apple OS-X at a much higher price than the operating system itself.10''s Complement (?) are planning such a thing, although I have no idea on their pricing. The software is still in development. -- Rich Teer, Publisher Vinylphile Magazine . * * . * .* . . * . .* * . . /\ ( . . * . . / .\ . * . .*. / * \ . . . /* o \ . * ''''''||'''''' . www.vinylphilemag.com ******************
Chris Ridd
2012-Jan-16 17:25 UTC
[zfs-discuss] Apple''s ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?
On 16 Jan 2012, at 16:56, Rich Teer wrote:> On Mon, 16 Jan 2012, Freddie Cash wrote: > >>> There would likely be a market if someone was to sell pre-packaged zfs for >>> Apple OS-X at a much higher price than the operating system itself. > > 10''s Complement (?) are planning such a thing, although I have no idea > on their pricing. The software is still in development.They have announced pricing for 2 of their 4 ZFS products: see <http://tenscomplement.com/our-products>. Chris
David Magda
2012-Jan-16 17:43 UTC
[zfs-discuss] Apple''s ZFS-alike - Re: Does raidzN actually protect against bitrot? If yes - how?
On Mon, January 16, 2012 11:22, Bob Friesenhahn wrote:> This seems very unlikely since the future needs of Apple show little > requirement for zfs. Apple only offers one computer model which > provides ECC and a disk drive configuration which is marginally useful > for zfs. This computer model has a very limited user-base which is > primarily people in the video and desktop imaging/publishing world. > Apple already exited the server market, for which they only ever > offered single limited-use model (Xserve).Having "real" snapshots would certainly help Time Machine. That said, Apple has managed to add on-disk Time Machine snapshots and better encryption in 10.7 (Lion) via a logical file manager: http://arstechnica.com/apple/reviews/2011/07/mac-os-x-10-7.ars/13 Zpools aren''t the only feature in ZFS after all. Oh well.> There would likely be a market if someone was to sell pre-packaged zfs > for Apple OS-X at a much higher price than the operating system > itself.Already a product: http://tenscomplement.com/ http://arstechnica.com/apple/news/2011/03/how-zfs-is-slowly-making-its-way-to-mac-os-x.ars