Hello, There have been comparisons posted here (and in general out there on the net) for various RAID levels and the chances of e.g. double failures. One problem that is rarely addressed though, is the various edge cases that significantly impact the probability of loss of data. In particular, I am concerned about the relative likelyhood of bad sectors on a drive, vs. entire-drive failure. On a raidz where uptime is not important, I would not want a dead drive + a single bad sector on another drive to cause loss of data, yet dead drive + bad sector is going to be a lot more likely than two dead drives within the same time window. In many situations it may not feel worth it to move to a raidz2 just to avoid this particular case. I would like a pool policy that allowed one to specify that at the moment a disk fails (where "fails" = considered faulty), all mutable I/O would be immediately stopped (returning I/O errors to userspace I presume), and any transaction in the process of being committed is rolled back. The result is that the drive that just failed completely will not go out of date immediately. If one then triggers a bad block on another drive while resilvering with a replacement drive, you know that you have the failed drive as a last resort (given that a full-drive failure is unlikely to mean the drive was physically obliterated; perhaps the controller circuitery can be replaced or certain physical components can be replaced). In the case of raidz2, you effectively have another "half" level of redundancy. Also, with either raidz/raidz2 one can imagine cases where a machine is booted with one or two drives missing (due to cabling issues for example); guaranteeing that no pool is ever online for writable operations (thus making abscent drives out of date) until the administrative explicitly asks for it, would greatly reduce the probability of data loss due to a bad block in this case aswell. In short, if true irrevocable dataloss is limited (assuming no software issues) to the complete obliteration of all data on n drives (for n levels of redundancy), or alternatively to the unlikely event of bad blocks co-inciding on multiple drives, wouldn''t reliability be significantly increased in cases where this is an acceptable practice? Opinions? -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
Peter Schuller wrote:> Hello, > > There have been comparisons posted here (and in general out there on the net) > for various RAID levels and the chances of e.g. double failures. One problem > that is rarely addressed though, is the various edge cases that significantly > impact the probability of loss of data.I agree 110%> In particular, I am concerned about the relative likelyhood of bad sectors on > a drive, vs. entire-drive failure. On a raidz where uptime is not important, > I would not want a dead drive + a single bad sector on another drive to cause > loss of data, yet dead drive + bad sector is going to be a lot more likely > than two dead drives within the same time window.I covered that topic in http://blogs.sun.com/relling/entry/a_story_of_two_mttdl where I described a model which does take into account the probability of an unrecoverable read during reconstruction. In general, the problem is that the average person can''t get some of the detailed information that a more complex model would require. The models I describe use data that you can glean from datasheets, or follow a reasonable extrapolation from historical data.> In many situations it may not feel worth it to move to a raidz2 just to avoid > this particular case.I can''t think of any, but then again, I get paid to worry about failures :-)> I would like a pool policy that allowed one to specify that at the moment a > disk fails (where "fails" = considered faulty), all mutable I/O would be > immediately stopped (returning I/O errors to userspace I presume), and any > transaction in the process of being committed is rolled back. The result is > that the drive that just failed completely will not go out of date > immediately. > > If one then triggers a bad block on another drive while resilvering with a > replacement drive, you know that you have the failed drive as a last resort > (given that a full-drive failure is unlikely to mean the drive was physically > obliterated; perhaps the controller circuitery can be replaced or certain > physical components can be replaced). In the case of raidz2, you effectively > have another "half" level of redundancy.Please correct me if I misunderstand your reasoning, are you saying that a broken disk should not be replaced? If so, then that is contrary to the accepted methods used in most mission critical systems. There may be other methods which meet your requirements and are accepted. For example, one procedure we see for those sites who are very interested in data retention is to power off a system when it is degraded to a point (as specified) where data retention is put at unacceptable risk. The theory is that a powered down system will stop wearing out. When the system is serviced, then it can be brought back online. Obviously, this is not the case where data availability is a primary requirement -- data retention has higher priority.> Also, with either raidz/raidz2 one can imagine cases where a machine is booted > with one or two drives missing (due to cabling issues for example); > guaranteeing that no pool is ever online for writable operations (thus making > abscent drives out of date) until the administrative explicitly asks for it, > would greatly reduce the probability of data loss due to a bad block in this > case aswell. > > In short, if true irrevocable dataloss is limited (assuming no software > issues) to the complete obliteration of all data on n drives (for n levels of > redundancy), or alternatively to the unlikely event of bad blocks co-inciding > on multiple drives, wouldn''t reliability be significantly increased in cases > where this is an acceptable practice? > > Opinions?We can already set a pool (actually the file systems in a pool) to be read only. I think that what may be lurking here is the fact that all of the RAS features are not yet implemented, hence the issue of a corrupted zpool may cause a panic. Clearly, such issues are short term and will be fixed anyway. There may be something else lurking here that we might be able to take advantage of, at least for some cases. Since ZFS is COW, it doesn''t have the same "data loss" profile as other file systems, like UFS, which can overwrite the data making reconstruction difficult or impossible. But while this might be useful for forensics, the general case is perhaps largely covered by the existing snapshot features. N.B. I do have a lot of field data on failures and failure rates. It is often difficult to grok without having a clear objective in mind. We may be able to agree on a set of questions which would quantify the need for your ideas. Feel free to contact me directly. -- richard
> > In many situations it may not feel worth it to move to a raidz2 just to > > avoid this particular case. > > I can''t think of any, but then again, I get paid to worry about failures > :-)Given that one of the tauted features of ZFS is data integrity, including in the case of cheap drives, that implies it is of interest to get maximum integrity with any given amount of resources. In your typical home use situation for example, buying 4 drives of decent size is pretty expensive considering that it *is* home use. Getting 4 drives for the diskspace of 3 is a lot more attractive than 5 drives for the diskspace of 3. But given that you do get 4 drives and put them in a raidz, you want as much safety as possible, and often you don''t care that much about availability. That said, the argument scales. If you''re not in a situation like the above, you may easily warrant "wasting" an extra drive on raidz2. But raidz2 without this feature is still less safe than raidz2 with the feature. So moving back to the idea of getting as much redundancy as possible given a certain set of hardware resources, you''re still not optimal given your hardware.> Please correct me if I misunderstand your reasoning, are you saying that a > broken disk should not be replaced?Sorry, no. However, I realize my desire actually requires an additional feature. The situation I envision situation is this: * One disk goes down in a raidz, because the controller suddenly broke (platters/heads are fine). * You replace the disk and start a re-silvering. * You trigger a bad block. At this point, you are now pretty screwed, unless: * The pool did not change after the original drive failed, AND a "broken drive assisted" resilvering is supported. You go to whatever effort required to fix the disk (say, buy another one of the same model and replace the controller, or hire some company that does this stuff), re-insert it into the machine. * At this point you have a drive you can read data off of, but that you certainly don''t trust in general. So you want to start replacing the drive with the new drive; if ZFS were then able to resilver to the new drive by using both parity data on other healthy drives in the pool, and the disk being replaced, you''re a happy. Or let''s do a more likely senario. A disk starts dying because of bad sectors (the disk has run out of remapping possibilities). You cannot fix this anymore by re-writing the bad sectors; trying to re-write the sector ends up failing with an I/O error and ZFS kicks the disk out of the pool. Standard procedure at this point is to replace the drive and resilver. But once again - you might end up with a bad sector on another drive. Without utilizing the existing broken drive, you''re screwed. If however you were able to take advantage of sectors that *ARE* readable off of the drive, and the drive has *NOT* gone out of date since it was kicked out due to additional transaction commits, you are once again happy. (Once again assuming you don''t happen to have bad luck and the set of bad sectors on the two drives overlap.)> If so, then that is contrary to the > accepted methods used in most mission critical systems. There may be other > methods which meet your requirements and are accepted. For example, one > procedure we see for those sites who are very interested in data retention > is to power off a system when it is degraded to a point (as specified) > where data retention is put at unacceptable risk.This is kind of what I am after, except that I want to guarantee that not a single transaction gets committed once a pool is degraded. Even if an admin goes and turns the machine off, the disk will be out of date.> The theory is that a > powered down system will stop wearing out. When the system is serviced, > then it can be brought back online. Obviously, this is not the case where > data availability is a primary requirement -- data retention has higher > priority.On the other hand, hardware has a nasty tendancy to break in relation to power cycles...> We can already set a pool (actually the file systems in a pool) to be read > only.Automatically and *immediately* on a drive failure?> There may be something else lurking here that we might be able to take > advantage of, at least for some cases. Since ZFS is COW, it doesn''t have > the same "data loss" profile as other file systems, like UFS, which can > overwrite the data making reconstruction difficult or impossible. But > while this might be useful for forensics, the general case is perhaps > largely covered by the existing snapshot features.Heh, in an ideal world - have ZFS automatically create a snapshot when a pool degrades. Normal case is then continued read/write operation. But if you DO end up with a bad situation and a double bad sector, you could then resilver based on the snapshot (from the perspective of which, the partially failed drive is up to date) and at least get back an older version of the data, rather than no data at all. As a compromise, if you cannot afford immediate read-only mode for availability reasons. I suppose that if a resilvering can be performed relative to any arbitrary node considered the root node, it might even be realistic to implement?> N.B. I do have a lot of field data on failures and failure rates. It is > often difficult to grok without having a clear objective in mind. We may > be able to agree on a set of questions which would quantify the need for > your ideas. Feel free to contact me directly.Thanks. It''s not that I have any particular situation where this becomes more important than usual. It is just a general observation of a behavior which, in cases where availability is not important, is sub-optimal from a data safety perspective. The only reason I even brought it up was the focus on data integrity that we see with ZFS. -- / Peter Schuller, InfiDyne Technologies HB PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
comment below... Peter Schuller wrote:>>> In many situations it may not feel worth it to move to a raidz2 just to >>> avoid this particular case. >> I can''t think of any, but then again, I get paid to worry about failures >> :-) > > Given that one of the tauted features of ZFS is data integrity, including in > the case of cheap drives, that implies it is of interest to get maximum > integrity with any given amount of resources. > > In your typical home use situation for example, buying 4 drives of decent size > is pretty expensive considering that it *is* home use. Getting 4 drives for > the diskspace of 3 is a lot more attractive than 5 drives for the diskspace > of 3. But given that you do get 4 drives and put them in a raidz, you want as > much safety as possible, and often you don''t care that much about > availability. > > That said, the argument scales. If you''re not in a situation like the above, > you may easily warrant "wasting" an extra drive on raidz2. But raidz2 without > this feature is still less safe than raidz2 with the feature. So moving back > to the idea of getting as much redundancy as possible given a certain set of > hardware resources, you''re still not optimal given your hardware. > >> Please correct me if I misunderstand your reasoning, are you saying that a >> broken disk should not be replaced? > > Sorry, no. However, I realize my desire actually requires an additional > feature. The situation I envision situation is this: > > * One disk goes down in a raidz, because the controller suddenly broke > (platters/heads are fine). > > * You replace the disk and start a re-silvering. > > * You trigger a bad block. At this point, you are now pretty screwed, unless: > > * The pool did not change after the original drive failed, AND a "broken drive > assisted" resilvering is supported. You go to whatever effort required to fix > the disk (say, buy another one of the same model and replace the controller, > or hire some company that does this stuff), re-insert it into the machine. > > * At this point you have a drive you can read data off of, but that you > certainly don''t trust in general. So you want to start replacing the drive > with the new drive; if ZFS were then able to resilver to the new drive by > using both parity data on other healthy drives in the pool, and the disk > being replaced, you''re a happy.It is my understanding that zpool replace already does this. Just don''t remove the failing disk...> Or let''s do a more likely senario. A disk starts dying because of bad sectors > (the disk has run out of remapping possibilities). You cannot fix this > anymore by re-writing the bad sectors; trying to re-write the sector ends up > failing with an I/O error and ZFS kicks the disk out of the pool. > > Standard procedure at this point is to replace the drive and resilver. But > once again - you might end up with a bad sector on another drive. Without > utilizing the existing broken drive, you''re screwed. If however you were able > to take advantage of sectors that *ARE* readable off of the drive, and the > drive has *NOT* gone out of date since it was kicked out due to additional > transaction commits, you are once again happy. > > (Once again assuming you don''t happen to have bad luck and the set of bad > sectors on the two drives overlap.)... I think I was off base previously. It seems to me that you are really after the policy for failing/failed disks. Currently, the only way a drive gets "kicked out" is if ZFS cannot open it. Obviously, if ZFS cannot open the drive, then you won''t be able to read anything from it. Looking forward, I think that there are several policies which may be desired...>> If so, then that is contrary to the >> accepted methods used in most mission critical systems. There may be other >> methods which meet your requirements and are accepted. For example, one >> procedure we see for those sites who are very interested in data retention >> is to power off a system when it is degraded to a point (as specified) >> where data retention is put at unacceptable risk. > > This is kind of what I am after, except that I want to guarantee that not a > single transaction gets committed once a pool is degraded. Even if an admin > goes and turns the machine off, the disk will be out of date.... such as a policy that says "if a disk is going bad, go read-only." I''m quite sure that most applications won''t respond well to such a policy, though.>> The theory is that a >> powered down system will stop wearing out. When the system is serviced, >> then it can be brought back online. Obviously, this is not the case where >> data availability is a primary requirement -- data retention has higher >> priority. > > On the other hand, hardware has a nasty tendancy to break in relation to power > cycles... > >> We can already set a pool (actually the file systems in a pool) to be read >> only. > > Automatically and *immediately* on a drive failure?You can listen to sysevents and implement policies.>> There may be something else lurking here that we might be able to take >> advantage of, at least for some cases. Since ZFS is COW, it doesn''t have >> the same "data loss" profile as other file systems, like UFS, which can >> overwrite the data making reconstruction difficult or impossible. But >> while this might be useful for forensics, the general case is perhaps >> largely covered by the existing snapshot features. > > Heh, in an ideal world - have ZFS automatically create a snapshot when a pool > degrades. Normal case is then continued read/write operation. But if you DO > end up with a bad situation and a double bad sector, you could then resilver > based on the snapshot (from the perspective of which, the partially failed > drive is up to date) and at least get back an older version of the data, > rather than no data at all. > > As a compromise, if you cannot afford immediate read-only mode for > availability reasons. > > I suppose that if a resilvering can be performed relative to any arbitrary > node considered the root node, it might even be realistic to implement?If I understand correctly, resilvering occurs at the zpool, not the file system level. I think a better policy may be to initiate a scrub when a failed read occurs (along with a SERD policy). The scrub will remap bad blocks that it can recover.>> N.B. I do have a lot of field data on failures and failure rates. It is >> often difficult to grok without having a clear objective in mind. We may >> be able to agree on a set of questions which would quantify the need for >> your ideas. Feel free to contact me directly. > > Thanks. It''s not that I have any particular situation where this becomes more > important than usual. It is just a general observation of a behavior which, > in cases where availability is not important, is sub-optimal from a data > safety perspective. The only reason I even brought it up was the focus on > data integrity that we see with ZFS.In any case, this is a job for ZFS+FMA integration. -- richard