Christo Kutrovsky
2010-Feb-17 00:47 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
Just finished reading the following excellent post: http://queue.acm.org/detail.cfm?id=1670144 And started thinking what would be the best long term setup for a home server, given limited number of disk slots (say 10). I considered something like simply do a 2way mirror. What are the chances for a very specific drive to fail in 2 way mirror? What if I do not want to take that chance? I could always put "copies=2" (or more) to my important datasets and take some risk and tolerate such a failure. But chances are, everything that is not copies=2 will have some data on those devices, and will be lost. So I was thinking, how can I limit the damage, how to inject some kind of "damage control". One of the ideas that sparkled is have a "max devices" property for each data set, and limit how many mirrored devices a given data set can be spread on. I mean if you don''t need the performance, you can limit (minimize) the device, should your capacity allow this. Imagine this scenario: You lost 2 disks, and unfortunately you lost the 2 sides of a mirror. You have 2 choices to pick from: - loose entirely Mary, Gary''s and Kelly''s "documents" or - loose a small piece of Everyone''s "documents". This could be implement via something similar to: read/write property "target device spread" read only property of "achieved device spread" as this will be size dependant. Opinions? Remember. The goal is damage control. I know 2x raidz2 offers better protection for more capacity (altought less performance, but that''s no the point). -- This message posted from opensolaris.org
James Dickens
2010-Feb-17 01:10 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Tue, Feb 16, 2010 at 6:47 PM, Christo Kutrovsky <kutrovsky at pythian.com>wrote:> Just finished reading the following excellent post: > > http://queue.acm.org/detail.cfm?id=1670144 > > And started thinking what would be the best long term setup for a home > server, given limited number of disk slots (say 10). > > I considered something like simply do a 2way mirror. What are the chances > for a very specific drive to fail in 2 way mirror? What if I do not want to > take that chance? > > I could always put "copies=2" (or more) to my important datasets and take > some risk and tolerate such a failure. > > But chances are, everything that is not copies=2 will have some data on > those devices, and will be lost. > > So I was thinking, how can I limit the damage, how to inject some kind of > "damage control". > > One of the ideas that sparkled is have a "max devices" property for each > data set, and limit how many mirrored devices a given data set can be spread > on. I mean if you don''t need the performance, you can limit (minimize) the > device, should your capacity allow this. > > Imagine this scenario: > You lost 2 disks, and unfortunately you lost the 2 sides of a mirror. > > You have 2 choices to pick from: > - loose entirely Mary, Gary''s and Kelly''s "documents" > or > - loose a small piece of Everyone''s "documents". > > This could be implement via something similar to: > read/write property "target device spread" > read only property of "achieved device spread" as this will be size > dependant. > > Opinions? > > raid is not designed to protect data, its designed to ensure uptime, if youcan''t afford to loose the data, then you should back it up, daily, and store more than one copy, with at least one copy being off site. If your site burns to the ground your data is gone no matter how many disks you have in the system. after this you should allocate a number of hot spares to the system should one fail. If you are truly paranoid, 3-way mirror can be used. then you can loose 2 disks without a loss of data. Spread disks across multiple controllers, and get disks from different companies and different lots to less the likely hood of getting hit by a bad batch taking out your pool. replace disks early as soon as you see disk errors. And above all backup all data you can''t afford to loose. James Dickens http://uadmin.blogspot.com> Remember. The goal is damage control. I know 2x raidz2 offers better > protection for more capacity (altought less performance, but that''s no the > point). > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100216/46948111/attachment.html>
Christo Kutrovsky
2010-Feb-17 01:37 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
Thanks for your feedback James, but that''s not the direction where I wanted this discussion to go. The goal was not how to create a better solution for an enterprise. The goal was to do "damage control" in a disk failure scenario involving data loss. Back to the original question/idea. Which would you prefer, loose a couple of datasets, or loose a little bit of every file in every dataset. -- This message posted from opensolaris.org
Bob Friesenhahn
2010-Feb-17 02:19 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Tue, 16 Feb 2010, Christo Kutrovsky wrote:> Just finished reading the following excellent post: > > http://queue.acm.org/detail.cfm?id=1670144A nice article, even if I don''t agree with all of its surmises and conclusions. :-) In fact, I would reach a different conclusion.> I considered something like simply do a 2way mirror. What are the > chances for a very specific drive to fail in 2 way mirror? What if I > do not want to take that chance?The probability of whole drive failure, or individual sector failure, has not increased over the years. The probability of individual sector failure has diminished substantially over the years. The probability of losing a whole mirror pair has gone down since the probability of individual drive failure has gone down.> I could always put "copies=2" (or more) to my important datasets and > take some risk and tolerate such a failure.I don''t believe that "copies=2" buys much at all when using mirror disks (or raidz). It assumes that there is a concurrency of simultaneous media failure, which is actually quite rare indeed. The "copies=2" setting only buys something when there is no other redundancy available.> One of the ideas that sparkled is have a "max devices" property for > each data set, and limit how many mirrored devices a given data set > can be spread on. I mean if you don''t need the performance, you can > limit (minimize) the device, should your capacity allow this.What you seem to be suggesting is a sort of targeted heirarchical vdev without extra RAID.> Remember. The goal is damage control. I know 2x raidz2 offers better > protection for more capacity (altought less performance, but that''s > no the point).It seems that Adam Leventhal''s excellent paper reaches the wrong conclusions because it assumes that history is a predictor for the future. However, history is a rather poor predictor in this case. Imagine if 9" floppies had increased their density to support 20GB each (up from 160KB), but that did not happen, and now we don''t use floppies at all. We already see many cases where history was no longer a good predictor of the future, and (as an example) increased integration has brought us multi-core CPUs rather than 20GHz CPUs. My own conclusions (supported by Adam Leventhal''s excellent paper) are that - maximum device size should be constrained based on its time to resilver. - devices are growing too large and it is about time to transition to the next smaller physical size. It is unreasonable to spend more than 24 hours to resilver a single drive. It is unreasonable to spend more than 6 days resilvering all of the devices in a RAID group (the 7th day is reserved for the system administrator). It is unreasonable to spend very much time at all on resilvering (using current rotating media) since the resilvering process kills performance. When looking at the possibility of data failure it is wise to consider physical issues such as - shared power supply - shared chassis - shared physical location - shared OS kernel or firmware instance all of which are very bad for data reliability since a problem with anything shared can lead to destruction of all copies of the data. In New York City, all of the apartment doors seem to be fitted with three deadlocks, all of which lock into the same flimsy splintered door frame. It is important to consider each significant system weakness in turn in order to achieve the least chance of loss, while providing the best service. Bob P.S. NASA is tracking large asteroids and meteors with the hope that they will eventually be able to deflect any which will strike our planet in order to in an effort to save your precious data. -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2010-Feb-17 02:24 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Tue, 16 Feb 2010, Christo Kutrovsky wrote:> > The goal was to do "damage control" in a disk failure scenario > involving data loss. Back to the original question/idea. > > Which would you prefer, loose a couple of datasets, or loose a > little bit of every file in every dataset.This ignores the fact that zfs is based on complex heirarchical data structures which support the user data. When a pool breaks, it is usually because one of these complex data structures has failed, and not because user data has failed. It seems easiest to support your requirement by simply creating another pool. The vast majority of complaints to this list are about pool-wide problems and not lost files due to media/disk failure. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Richard Elling
2010-Feb-17 02:28 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Feb 16, 2010, at 4:47 PM, Christo Kutrovsky wrote:> Just finished reading the following excellent post: > > http://queue.acm.org/detail.cfm?id=1670144 > > And started thinking what would be the best long term setup for a home server, given limited number of disk slots (say 10). > > I considered something like simply do a 2way mirror. What are the chances for a very specific drive to fail in 2 way mirror? What if I do not want to take that chance?The probability of a device to fail in a time interval (T) given its MTBF (or AFR, but be careful about how the vendors publish such specs [*]) is: 1-e^(-T/MTBF) so if you have a consumer-grade disk with 700,000 hours rated MTBF, then over a time period of 1 year (8760 hours) you get: Pfailure = 1 - e^(-8760/700000) = 1.24%> I could always put "copies=2" (or more) to my important datasets and take some risk and tolerate such a failure.+1> But chances are, everything that is not copies=2 will have some data on those devices, and will be lost. > > So I was thinking, how can I limit the damage, how to inject some kind of "damage control".The problem is that MTBF measurements are only one part of the picture. Murphy''s Law says something will go wrong, so also plan on backups.> One of the ideas that sparkled is have a "max devices" property for each data set, and limit how many mirrored devices a given data set can be spread on. I mean if you don''t need the performance, you can limit (minimize) the device, should your capacity allow this. > > Imagine this scenario: > You lost 2 disks, and unfortunately you lost the 2 sides of a mirror.Doing some simple math, and using the simple MTTDL[1] model, you can figure the probability of that happening in one year for a pair of 700k hours disks and a 24 hour MTTR as: Pfailure = 0.000086% (trust me, I''ve got a spreadsheet :-)> You have 2 choices to pick from: > - loose entirely Mary, Gary''s and Kelly''s "documents" > or > - loose a small piece of Everyone''s "documents". > > This could be implement via something similar to: > read/write property "target device spread" > read only property of "achieved device spread" as this will be size dependant. > > Opinions?I use mirrors. For the important stuff, like my wife''s photos and articles, I set copies=2. And I take regular backups via snapshots to multiple disks, some of which are offsite. With an appliance, like NexentaStor, it is trivial to setup a replication scheme between multiple ZFS sites.> Remember. The goal is damage control. I know 2x raidz2 offers better protection for more capacity (altought less performance, but that''s no the point).Notes: * http://blogs.sun.com/relling/entry/awesome_disk_afr_or_is ** http://blogs.sun.com/relling/entry/a_story_of_two_mttdl -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
Daniel Carosone
2010-Feb-17 03:24 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Tue, Feb 16, 2010 at 06:28:05PM -0800, Richard Elling wrote:> The problem is that MTBF measurements are only one part of the picture. > Murphy''s Law says something will go wrong, so also plan on backups.+n> > Imagine this scenario: > > You lost 2 disks, and unfortunately you lost the 2 sides of a mirror. > > Doing some simple math, and using the simple MTTDL[1] model, you can > figure the probability of that happening in one year for a pair of 700k hours > disks and a 24 hour MTTR as: > Pfailure = 0.000086% (trust me, I''ve got a spreadsheet :-)Which is close enough to zero, but doesn''t consider all the other things that can go wrong: power surge, fire, typing destructive commands in the wrong window, animals and small children, capricious deities, forgetting to run backups, etc. These small numbers just tell you to be more worried about defending against the other stuff.> > You have 2 choices to pick from: > > - loose entirely Mary, Gary''s and Kelly''s "documents" > > or > > - loose a small piece of Everyone''s "documents".Back to the OP''s question, it''s worth making the distinction here between "lose" as in not-able-to-recover-because-there-are-no-backups, and some data being out of service and inaccessible for a while, until restored. Perhaps this is what "loose" means? :) If the goal is partitioning "service disruption" rather than "data loss", then splitting things into multiple pools and even multiple servers is a valid tactic - as well as then allowing further opportunities via failover. That''s well covered ground, and one reason it''s done at pool level is that allows concrete reasoning about what will and won''t be affected in each scenario. Setting preferences, such as copies or the suggested similar alternatives, will never be able to provide the same concrete assurance. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100217/2db112a9/attachment.bin>
Daniel Carosone
2010-Feb-17 03:49 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Tue, Feb 16, 2010 at 04:47:11PM -0800, Christo Kutrovsky wrote:> One of the ideas that sparkled is have a "max devices" property for > each data set, and limit how many mirrored devices a given data set > can be spread on. I mean if you don''t need the performance, you can > limit (minimize) the device, should your capacity allow this.There have been some good responses, around better ways to do damage control. I thought I''d respond separately, with a different use case for essentially the same facility. If your suggestion were to be implemented, it would be in the form of a different allocation policy, when selecting vdevs and metaslabs for writes. There is scope for several alternate policies addressing different requirements, in future development, and some nice XXX comments about "cool stuff could go here" accordingly. One of these is for power-saving, with MAID-style pools, whereby the majority of disks (vdevs) in a pool would be idle and spun down, most of the time. This requires expressing very similar kinds of preferences, for what data goes where (and when). AIX''s LVM (not the nasty linux knock-off) had similar layout preferences, for different purposes - you could mark lv''s with allocation prefernces to the centre of spindles for performance, or other options, and then relayout the data accordingly. I say "had", it presumably still does, but I haven''t touched it in 15 years or more. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100217/9da1e6b7/attachment.bin>
Christo Kutrovsky
2010-Feb-17 07:57 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
Bob, Using a separate pool would dictate other limitations, such as not been able to use more space than what''s allocated in the pool. You could "add" space as needed, but you can''t remove (move) devices freely. By using a shared pool with a hint of desired vdev/space allocation policy, you could have the best possible from both worlds. -- This message posted from opensolaris.org
Christo Kutrovsky
2010-Feb-17 08:03 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
Dan, "loose" was a typo. I meant "lose". Interesting how a typo (write error) can cause a lot of confusion on what exactly I mean :) Resulting in corrupted interpretation. Note that my idea/proposal is targeted for a growing number of home users. To those, value for money usually is a much more significant factor than others. -- This message posted from opensolaris.org
Bob Friesenhahn
2010-Feb-17 14:34 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Wed, 17 Feb 2010, Daniel Carosone wrote:> > These small numbers just tell you to be more worried about defending > against the other stuff.Let''s not forget that the most common cause of data loss is human error! Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Marty Scholes
2010-Feb-17 17:09 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
Bob Friesenhahn wrote:> It is unreasonable to spend more than 24 hours to resilver a single > drive. It is unreasonable to spend more than 6 days resilvering all > of the devices in a RAID group (the 7th day is reserved for the system > administrator). It is unreasonable to spend very much time at all on > resilvering (using current rotating media) since the resilvering > process kills performance.Bob, the vast majority of your post I agree with. At the same time, I might disagree with a couple of things. I don''t really care how long a resilver takes (hours, days, months) given a couple things: * Sufficient protection exists on the degraded array during rebuild ** Put another way, the array is NEVER in danger * Rebuild takes a back seat to production demands Since I am on a rant, I suspect there is also room for improvement in the scrub. Why would I rescrub a stripe that was read (and presumably validated) 30 seconds ago by a production application? Wouldn''t it make more sense for scrub to "play nice" with production, moving a leisurely pace and only scrubbing stripes not read in the past X hours/days/weeks/whatever? I also agree that an ideal pool would be lowering the device capacity and radically increasing the device count. In my perfect world, I would have a RAID set of 200+ cheap, low-latency, low-capacity flash drives backed by an additional N% parity, e.g. 40-ish flash drives. A setup like this would give massive throughput: 200x each flash drive, amazing IOPS and incredible resiliancy. Rebuild times would be low due to lower capacity. One could probably build such a beast in 1U using MicroSDHC cards or some such thing. End rant. Cheers, Marty -- This message posted from opensolaris.org
Richard Elling
2010-Feb-17 17:34 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Feb 17, 2010, at 9:09 AM, Marty Scholes wrote:> Bob Friesenhahn wrote: >> It is unreasonable to spend more than 24 hours to resilver a single >> drive. It is unreasonable to spend more than 6 days resilvering all >> of the devices in a RAID group (the 7th day is reserved for the system >> administrator). It is unreasonable to spend very much time at all on >> resilvering (using current rotating media) since the resilvering >> process kills performance. > > Bob, the vast majority of your post I agree with. At the same time, I might disagree with a couple of things. > > I don''t really care how long a resilver takes (hours, days, months) given a couple things: > * Sufficient protection exists on the degraded array during rebuild > ** Put another way, the array is NEVER in danger > * Rebuild takes a back seat to production demands > > Since I am on a rant, I suspect there is also room for improvement in the scrub. Why would I rescrub a stripe that was read (and presumably validated) 30 seconds ago by a production application?Because scrub reads all copies of the data, not just one.> Wouldn''t it make more sense for scrub to "play nice" with production, moving a leisurely pace and only scrubbing stripes not read in the past X hours/days/weeks/whatever?Scrubs are done at the lowest priority and are throttled. I suppose there could be a knob to adjust the throttle, but then you get into the problem we had with SVM -- large devices could take weeks to (resilver) scrub. From a reliability perspective, there is a weak argument for not scrubbing recent writes. And scrubs are done in txg order (oldest first). So perhaps there is some merit to a scrub limit. However, it is not clear to me that this really buys you anything. Scrubs are rare events, so the impact of shaving a few minutes or hours off the scrub time is low.> I also agree that an ideal pool would be lowering the device capacity and radically increasing the device count.The ideal pool has one inexpensive, fast, and reliable device :-)> In my perfect world, I would have a RAID set of 200+ cheap, low-latency, low-capacity flash drives backed by an additional N% parity, e.g. 40-ish flash drives. A setup like this would give massive throughput: 200x each flash drive, amazing IOPS and incredible resiliancy. Rebuild times would be low due to lower capacity. One could probably build such a beast in 1U using MicroSDHC cards or some such thing.I''m not sure how to connect those into the system (USB 3?), but when you build it, let us know how it works out. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
Marty Scholes
2010-Feb-17 17:59 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
I cant'' stop myself; I have to respond. :-) Richard wrote:> The ideal pool has one inexpensive, fast, and reliable device :-)My ideal pool has become one inexpensive, fast and reliable "device" built on whatever I choose.> I''m not sure how to connect those into the system (USB 3?)Me neither, but if I had to start guessing about host connections, I would probably think FC.> but when you build it, let us know how it works out.While it would be a fun project, a toy like that would certainly exceed my feeble hardware experience and even more feeble budget. At the same time, I could make a compelling argument that this sort of arrangement: stripes of flash, is the future of tier-one storage. We already are seeing SSD devices which internally are stripes of flash. More and more disks farms are taking on the older roles of tape. -- This message posted from opensolaris.org
Bob Friesenhahn
2010-Feb-17 18:09 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Wed, 17 Feb 2010, Marty Scholes wrote:> > Bob, the vast majority of your post I agree with. At the same time, I might disagree with a couple of things. > > I don''t really care how long a resilver takes (hours, days, months) given a couple things: > * Sufficient protection exists on the degraded array during rebuild > ** Put another way, the array is NEVER in danger > * Rebuild takes a back seat to production demandsMost data loss is due to human error. To me it seems like disks which take a week to resilver introduce more opportunity for human error. The maintenance operation fades from human memory while it is still underway. If an impeccable log book is not kept and understood, then it is up to (potentially) multiple administrators with varying levels of experience to correctly understand and interpret the output of ''zpool status''. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
martyscholes at yahoo.com
2010-Feb-17 18:57 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
Well said. -original message- Subject: Re: [zfs-discuss] Proposed idea for enhancement - damage control From: Bob Friesenhahn <bfriesen at simple.dallas.tx.us> Date: 02/17/2010 11:10 On Wed, 17 Feb 2010, Marty Scholes wrote:> > Bob, the vast majority of your post I agree with. At the same time, I might disagree with a couple of things. > > I don''t really care how long a resilver takes (hours, days, months) given a couple things: > * Sufficient protection exists on the degraded array during rebuild > ** Put another way, the array is NEVER in danger > * Rebuild takes a back seat to production demandsMost data loss is due to human error. To me it seems like disks which take a week to resilver introduce more opportunity for human error. The maintenance operation fades from human memory while it is still underway. If an impeccable log book is not kept and understood, then it is up to (potentially) multiple administrators with varying levels of experience to correctly understand and interpret the output of ''zpool status''. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Miles Nordin
2010-Feb-17 19:38 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
>>>>> "ck" == Christo Kutrovsky <kutrovsky at pythian.com> writes:ck> I could always put "copies=2" (or more) to my important ck> datasets and take some risk and tolerate such a failure. copies=2 has proven to be mostly useless in practice. If there were a real-world device that tended to randomly flip bits, or randomly replace swaths of LBA''s with zeroes, but otherwise behave normally (not return any errors, not slow down retrying reads, not fail to attach), then copies=2 would be really valuable, but so far it seems no such device exists. If you actually explore the errors that really happen I venture there are few to no cases copies=2 would save you. one case where such a device appears to exist but doesn''t really, is what I often end up doing for family/friend laptops and external USB drives: wait for drives to start going bad, then rescue them with ''dd conv=noerror,sync'', fsck, and hope for ``most of'''' the data. copies=2 would help get more out of the rescued drive for some but not all of the times I''ve done this, but there is not much point: Time Machine or rsync backups, or NFS/iSCSI-booting, or zfs send|zfs recv replication to a backup pool, are smarter. I''ve never been recovering a stupidly-vulnerable drive like that in a situation where I had ZFS on it, so I''m not sure copies=2 will get used here much either. One particular case of doom: a lot of people want to make two unredundant vdevs and then set ''copies=2'' and rely on ZFS''s promise to spread the two copies out as much as possible. Then they expect to import the pool with only one of the two vdev''s and read ``some but not all'''' of the data---``I understand I won''t get all of it but I just want ZFS to try it''s best and we''ll see.'''' Maybe you want to do this instead of a mirror so you can have scratch datasets that consume space at 1/2 the rate they would on a mirror. Nope, nice try but won''t happen. ZFS is full of all sorts of webbed assertions to ratchet you safely through sane pool states that are regression-testable and supportable so it will refuse to import a pool that isn''t vdev-complete, and no negotiation is possible on this. The dream is a FAQ and the answer is a clear ``No'''' followed by ``you''d better test with file vdevs next time you have such a dream.'''' ck> What are the chances for a very specific drive to fail in 2 ck> way mirror? This may not be what you mean, but in general single device redundancy isn''t ideal for two reasons: * speculation (although maybe not operational experience?) that modern drives are so huge that even a good drive will occasionally develop an unreadable spot and still be within its BER spec. so, without redundancy, you cannot read for sure at all, even if all the drives are ``good''''. * drives do not immediately turn red and start brrk-brrking when they go bad. In the real world, they develop latent sector errors, which you will not discover and mark the drive bad until you scrub or coincidentally happen to read the file that accumulated the error. It''s possible for a second drive to go bad in the interval you''re waiting to discover the first. This usually gets retold as ``a drive went bad while I was resilvering! what bad luck. If only I could''ve resilvered faster to close this window of vulnerability I''d not be in such a terrible situation'''' but the retelling''s wrong: what''s really happening is that resilver implies a scrub, so it uncovers the second bad drive you didn''t yet know was bad at the time you discovered the first. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100217/521018d7/attachment.bin>
Frank Middleton
2010-Feb-17 20:52 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On 02/17/10 02:38 PM, Miles Nordin wrote:> copies=2 has proven to be mostly useless in practice.Not true. Take an ancient PC with a mirrored root pool, no bus error checking and non-ECC memory, that flawlessly passes every known diagnostic (SMC included). Reboot with copies=1 and the same files in /usr/lib will get trashed every time and you''ll have to reboot from some other media to repair it. Set copies=2 (copy all of /usr/lib, of course) and it will reboot every time with no problem, albeit with a varying number of repaired checksum errors, almost always on the same set of files. Without copies=2 this hardware would be useless (well, it ran Linux just fine), but with it, it has a new lease of life. There is an ancient CR about this, but AFAIK no one has any idea what the problem is or how to fix it. IMO it proves that copies=2 can help avoid data loss in the face of flaky buses and perhaps memory. I don''t think you should be able to lose data on mirrored drives unless both drives fail simultaneously, but with ZFS you can. Certainly, on any machine without ECC memory, or buses without ECC (is parity good enough?) my suggestion would be to set copies=2, and I have it set for critical datasets even on machines with ECC on both. Just waiting for the bus that those SAS controllers are on to burp at the wrong moment... Is one counter-example enough? Cheers -- Frank
David Magda
2010-Feb-17 23:53 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Feb 17, 2010, at 12:34, Richard Elling wrote:> I''m not sure how to connect those into the system (USB 3?), but when > you build it, let us > know how it works out.FireWire 3200 preferably. Anyone know if USB 3 sucks as much CPU as previous versions? If I''m burning CPU on I/O I''d rather having going to checksums than polling cheap-ass USB controllers that need to be baby sat.
Casper.Dik at Sun.COM
2010-Feb-18 00:17 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
>If there were a real-world device that tended to randomly flip bits, >or randomly replace swaths of LBA''s with zeroes, but otherwise behave >normally (not return any errors, not slow down retrying reads, not >fail to attach), then copies=2 would be really valuable, but so far it >seems no such device exists. If you actually explore the errors that >really happen I venture there are few to no cases copies=2 would save >you.I had a device which had 256 bytes of the 32MB broken (some were "1", some were always "0"). But I never put it online because it was so broken. Casper
Casper.Dik at Sun.COM
2010-Feb-18 00:26 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
> > >>If there were a real-world device that tended to randomly flip bits, >>or randomly replace swaths of LBA''s with zeroes, but otherwise behave >>normally (not return any errors, not slow down retrying reads, not >>fail to attach), then copies=2 would be really valuable, but so far it >>seems no such device exists. If you actually explore the errors that >>really happen I venture there are few to no cases copies=2 would save >>you. > >I had a device which had 256 bytes of the 32MB broken (some were "1", some >were always "0"). But I never put it online because it was so broken.Of the 32MB cache, sorry. Casper
Daniel Carosone
2010-Feb-18 00:55 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Wed, Feb 17, 2010 at 02:38:04PM -0500, Miles Nordin wrote:> copies=2 has proven to be mostly useless in practice.I disagree. Perhaps my cases fit under the weasel-word "mostly", but single-disk laptops are a pretty common use-case.> If there were a real-world device that tended to randomly flip bits, > or randomly replace swaths of LBA''s with zeroesAs an aside, there can be non-device causes of this, especially when sharing disks with other operating systems, booting livecd''s and etc.> * drives do not immediately turn red and start brrk-brrking when they > go bad. In the real world, they develop latent sector errors, > which you will not discover and mark the drive bad until you scrub > or coincidentally happen to read the file that accumulated the > error.Yes, exactly - at this point, with copies=1, you get a signal that your drive is about to go bad, and that data has been lost. With copies=2, you get a signal that your drive is about to go bad, but less disruption and data loss to go with it. Note that pool metadata is inherently using ditto blocks for precisely this reason. I dunno about BER spec, but I have seen sectors go unreadable many times. Sometimes, as you note, in combination with other problems or further deterioriation, sometimes not. Regardless of what you do in response, and how soon you replace the drive, copies >1 can cover that interval. I agree fully, copies=2 is not a substitute for backup replication of whatever flavour you prefer. It is a useful supplement. Misunderstanding this is dangerous. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100218/8cf99e80/attachment.bin>
Christo Kutrovsky
2010-Feb-18 06:28 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
Dan, Exactly what I meant. An allocation policy, that will help in distributing the data in a way that when one disk is lost (entire mirror) than some data remains fully accessible as opposed to not been able to access pieces all over the storage pool. -- This message posted from opensolaris.org
Miles Nordin
2010-Feb-18 19:55 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
>>>>> "dc" == Daniel Carosone <dan at geek.com.au> writes:dc> single-disk laptops are a pretty common use-case. It does not help this case. It helps the case where a single laptop disk fails and you recover it with dd conv=noerror,sync. This case is uncommon because few people know how to do it, or bother. This should not ever be part of your plan even if you know how to do it because it will only help maybe half the time: it''s silly to invest in this case. dc> As an aside, there can be non-device causes of this, dc> especially when sharing disks with other operating systems, dc> booting livecd''s and etc. solution in search of a problem, as opposed to operational experience. The copies= feature is not so new that we need to imagine so optimistically, and in practice the advice ``it''s generally not useful'''' is the best you can give because it seems like it''s currently misunderstood more often than it''s used in a realistically helpful way. >> * drives do not immediately turn red and start brrk-brrking >> when they go bad. In the real world, they develop latent >> sector errors, dc> Yes, exactly - at this point, with copies=1, you get a signal dc> that your drive is about to go bad, and that data has been dc> lost. With copies=2, you get a signal that your drive is dc> about to go bad, but less disruption and data loss to go with dc> it. No, to repeat myself, with copies=2 you get a system that freezes and crashes oddly, sometimes runs for a while but cannot ever complete a ''zfs send'' of the filesystems. With copies=1 you get the exact same thing. imagination does not match experience. This is what you get even on an x4500: many posters here report when a disk starts going bad you need to find it and entirely remove it before you can bother with any kind of recovery. dc> I dunno about BER spec, but I have seen sectors go unreadable dc> many times. yes. obviously. dc> Regardless of what you do in response, and how soon you dc> replace the drive, copies >1 can cover that interval. no, you are caught in taxonomic obsession again, because the exposure is not that parts of the disk gradually go bad in a predictable/controllable way with gradually rising probability and a bit of clumpyness you can avoid by spraying your copies randomly LBA-wise. It''s that the disk slowly accumulates software landmines that prevent it from responding to commands in a reasonable way (increase the response time to each individual command from 30ms to 30 seconds), and confuse the storage stack above it into seemingly-arbitrary and highly controller-dependent odd behavior (causing crashes or multiplying the 30 seconds to somewhere between 180 seconds and a couple hours). Once teh disk starts going bad, anything you can recover from it is luck, and aside from disks with maybe like one bad sector where you can note which file you were reading when the system froze, reboot, and not read that file any more, I just don''t think it matches experience to believe you will get a chance to read the second copy your copies=2 wrote. Remember, if the machine is still functioning but its performance is reduced 1000-fold, it''s functionally equivalent to frozen for all but the most pedantic purposes. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100218/82b7dcdf/attachment.bin>
Adam Leventhal
2010-Feb-19 01:22 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
Hey Bob,> My own conclusions (supported by Adam Leventhal''s excellent paper) are that > > - maximum device size should be constrained based on its time to > resilver. > > - devices are growing too large and it is about time to transition to > the next smaller physical size.I don''t disagree with those conclusions necessarily, but the HDD vendors have significant momentum built up in their efforts to improve density -- that''s not going to change in the next 5 years. If indeed transitioned to reducing physical size while improving density, that would imply that there would be many more end-points to deal with, bigger switches, etc. All reasonable, but there are some significant implications.> It is unreasonable to spend more than 24 hours to resilver a single drive.Why?> It is unreasonable to spend very much time at all on resilvering (using current rotating media) since the resilvering process kills performance.Maybe, but then it depends on how much you rely on your disks for performance. Adam -- Adam Leventhal, Fishworks http://blogs.sun.com/ahl
Bob Friesenhahn
2010-Feb-19 02:39 UTC
[zfs-discuss] Proposed idea for enhancement - damage control
On Thu, 18 Feb 2010, Adam Leventhal wrote:> >> It is unreasonable to spend more than 24 hours to resilver a single drive. > > Why?Human factors. People usually go to work once per day so it makes sense that they should be able to perform at least once maintenance action per day. Ideally the resilver should complete within an 8-hour work day so that a maintenance action can be performed in the morning, and another in the evening. Computers should be there to serve the attendant humans, not the other way around. :-) Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/