thr3ads.net - zfs discuss - [zfs-discuss] On-failure policies for pools [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Peter Schuller

2007-Jan-23 21:31 UTC

[zfs-discuss] On-failure policies for pools

Hello,

There have been comparisons posted here (and in general out there on the net) 
for various RAID levels and the chances of e.g. double failures. One problem 
that is rarely addressed though, is the various edge cases that significantly 
impact the probability of loss of data.

In particular, I am concerned about the relative likelyhood of bad sectors on 
a drive, vs. entire-drive failure. On a raidz where uptime is not important, 
I would not want a dead drive + a single bad sector on another drive to cause 
loss of data, yet dead drive + bad sector is going to be a lot more likely 
than two dead drives within the same time window.

In many situations it may not feel worth it to move to a raidz2 just to avoid 
this particular case.

I would like a pool policy that allowed one to specify that at the moment a 
disk fails (where "fails" = considered faulty), all mutable I/O would
be
immediately stopped (returning I/O errors to userspace I presume), and any 
transaction in the process of being committed is rolled back. The result is 
that the drive that just failed completely will not go out of date 
immediately.

If one then triggers a bad block on another drive while resilvering with a 
replacement drive, you know that you have the failed drive as a last resort 
(given that a full-drive failure is unlikely to mean the drive was physically 
obliterated; perhaps the controller circuitery can be replaced or certain 
physical components can be replaced). In the case of raidz2, you effectively 
have another "half" level of redundancy.

Also, with either raidz/raidz2 one can imagine cases where a machine is booted 
with one or two drives missing (due to cabling issues for example); 
guaranteeing that no pool is ever online for writable operations (thus making 
abscent drives out of date) until the administrative explicitly asks for it, 
would greatly reduce the probability of data loss due to a bad block in this 
case aswell.

In short, if true irrevocable dataloss is limited (assuming no software 
issues) to the complete obliteration of all data on n drives (for n levels of 
redundancy), or alternatively to the unlikely event of bad blocks co-inciding 
on multiple drives, wouldn''t reliability be significantly increased in
cases
where this is an acceptable practice?

Opinions?

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at
infidyne.com>''
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

Richard Elling

2007-Jan-23 22:40 UTC

head link

[zfs-discuss] On-failure policies for pools

Peter Schuller wrote:> Hello,
> 
> There have been comparisons posted here (and in general out there on the
net)
> for various RAID levels and the chances of e.g. double failures. One
problem
> that is rarely addressed though, is the various edge cases that
significantly
> impact the probability of loss of data.
I agree 110%
> In particular, I am concerned about the relative likelyhood of bad sectors
on
> a drive, vs. entire-drive failure. On a raidz where uptime is not
important,
> I would not want a dead drive + a single bad sector on another drive to
cause
> loss of data, yet dead drive + bad sector is going to be a lot more likely 
> than two dead drives within the same time window.
I covered that topic in http://blogs.sun.com/relling/entry/a_story_of_two_mttdl
where I described a model which does take into account the probability of an
unrecoverable read during reconstruction.  In general, the problem is that the
average person can''t get some of the detailed information that a more
complex
model would require.  The models I describe use data that you can glean from
datasheets, or follow a reasonable extrapolation from historical data.
> In many situations it may not feel worth it to move to a raidz2 just to
avoid
> this particular case.
I can''t think of any, but then again, I get paid to worry about
failures :-)
> I would like a pool policy that allowed one to specify that at the moment a
> disk fails (where "fails" = considered faulty), all mutable I/O
would be
> immediately stopped (returning I/O errors to userspace I presume), and any 
> transaction in the process of being committed is rolled back. The result is
> that the drive that just failed completely will not go out of date 
> immediately.
> 
> If one then triggers a bad block on another drive while resilvering with a 
> replacement drive, you know that you have the failed drive as a last resort
> (given that a full-drive failure is unlikely to mean the drive was
physically
> obliterated; perhaps the controller circuitery can be replaced or certain 
> physical components can be replaced). In the case of raidz2, you
effectively
> have another "half" level of redundancy.
Please correct me if I misunderstand your reasoning, are you saying that a
broken disk should not be replaced?  If so, then that is contrary to the
accepted methods used in most mission critical systems.  There may be other
methods which meet your requirements and are accepted.  For example, one
procedure we see for those sites who are very interested in data retention
is to power off a system when it is degraded to a point (as specified) where
data retention is put at unacceptable risk.  The theory is that a powered
down system will stop wearing out.  When the system is serviced, then it can
be brought back online.  Obviously, this is not the case where data availability
is a primary requirement -- data retention has higher priority.
> Also, with either raidz/raidz2 one can imagine cases where a machine is
booted
> with one or two drives missing (due to cabling issues for example); 
> guaranteeing that no pool is ever online for writable operations (thus
making
> abscent drives out of date) until the administrative explicitly asks for
it,
> would greatly reduce the probability of data loss due to a bad block in
this
> case aswell.
> 
> In short, if true irrevocable dataloss is limited (assuming no software 
> issues) to the complete obliteration of all data on n drives (for n levels
of
> redundancy), or alternatively to the unlikely event of bad blocks
co-inciding
> on multiple drives, wouldn''t reliability be significantly
increased in cases
> where this is an acceptable practice?
> 
> Opinions?
We can already set a pool (actually the file systems in a pool) to be read only.
I think that what may be lurking here is the fact that all of the RAS features
are not yet implemented, hence the issue of a corrupted zpool may cause a panic.
Clearly, such issues are short term and will be fixed anyway.

There may be something else lurking here that we might be able to take
advantage of, at least for some cases.  Since ZFS is COW, it doesn''t
have the
same "data loss" profile as other file systems, like UFS, which can
overwrite
the data making reconstruction difficult or impossible.  But while this might
be useful for forensics, the general case is perhaps largely covered by the
existing snapshot features.

N.B. I do have a lot of field data on failures and failure rates.  It is often
difficult to grok without having a clear objective in mind.  We may be able to
agree on a set of questions which would quantify the need for your ideas.  Feel
free to contact me directly.
  -- richard

Peter Schuller

2007-Jan-24 22:02 UTC

head link

[zfs-discuss] On-failure policies for pools

> > In many situations it may not feel worth it to move to a raidz2 just
to
> > avoid this particular case.
>
> I can''t think of any, but then again, I get paid to worry about
failures
> :-)
Given that one of the tauted features of ZFS is data integrity, including in 
the case of cheap drives, that implies it is of interest to get maximum 
integrity with any given amount of resources.

In your typical home use situation for example, buying 4 drives of decent size 
is pretty expensive considering that it *is* home use. Getting 4 drives for 
the diskspace of 3 is a lot more attractive than 5 drives for the diskspace 
of 3. But given that you do get 4 drives and put them in a raidz, you want as 
much safety as possible, and often you don''t care that much about 
availability.

That said, the argument scales. If you''re not in a situation like the
above,
you may easily warrant "wasting" an extra drive on raidz2. But raidz2
without
this feature is still less safe than raidz2 with the feature. So moving back 
to the idea of getting as much redundancy as possible given a certain set of 
hardware resources, you''re still not optimal given your hardware.
> Please correct me if I misunderstand your reasoning, are you saying that a
> broken disk should not be replaced?
Sorry, no. However, I realize my desire actually requires an additional 
feature. The situation I envision situation is this:

* One disk goes down in a raidz, because the controller suddenly broke 
(platters/heads are fine).

* You replace the disk and start a re-silvering.

* You trigger a bad block. At this point, you are now pretty screwed, unless:

* The pool did not change after the original drive failed, AND a "broken
drive
assisted" resilvering is supported. You go to whatever effort required to
fix
the disk (say, buy another one of the same model and replace the controller, 
or hire some company that does this stuff), re-insert it into the machine.

* At this point you have a drive you can read data off of, but that you 
certainly don''t trust in general. So you want to start replacing the
drive
with the new drive; if ZFS were then able to resilver to the new drive by 
using both parity data on other healthy drives in the pool, and the disk 
being replaced, you''re a happy.

Or let''s do a more likely senario. A disk starts dying because of bad
sectors
(the disk has run out of remapping possibilities). You cannot fix this 
anymore by re-writing the bad sectors; trying to re-write the sector ends up 
failing with an I/O error and ZFS kicks the disk out of the pool.

Standard procedure at this point is to replace the drive and resilver. But 
once again - you might end up with a bad sector on another drive. Without 
utilizing the existing broken drive, you''re screwed. If however you
were able
to take advantage of sectors that *ARE* readable off of the drive, and the 
drive has *NOT* gone out of date since it was kicked out due to additional 
transaction commits, you are once again happy.

(Once again assuming you don''t happen to have bad luck and the set of
bad
sectors on the two drives overlap.)
> If so, then that is contrary to the 
> accepted methods used in most mission critical systems.  There may be other
> methods which meet your requirements and are accepted.  For example, one
> procedure we see for those sites who are very interested in data retention
> is to power off a system when it is degraded to a point (as specified)
> where data retention is put at unacceptable risk.
This is kind of what I am after, except that I want to guarantee that not a 
single transaction gets committed once a pool is degraded. Even if an admin 
goes and turns the machine off, the disk will be out of date.
> The theory is that a 
> powered down system will stop wearing out.  When the system is serviced,
> then it can be brought back online.  Obviously, this is not the case where
> data availability is a primary requirement -- data retention has higher
> priority.
On the other hand, hardware has a nasty tendancy to break in relation to power 
cycles...
> We can already set a pool (actually the file systems in a pool) to be read
> only.
Automatically and *immediately* on a drive failure?
> There may be something else lurking here that we might be able to take
> advantage of, at least for some cases.  Since ZFS is COW, it
doesn''t have
> the same "data loss" profile as other file systems, like UFS,
which can
> overwrite the data making reconstruction difficult or impossible.  But
> while this might be useful for forensics, the general case is perhaps
> largely covered by the existing snapshot features.
Heh, in an ideal world - have ZFS automatically create a snapshot when a pool 
degrades. Normal case is then continued read/write operation. But if you DO 
end up with a bad situation and a double bad sector, you could then resilver 
based on the snapshot (from the perspective of which, the partially failed 
drive is up to date) and at least get back an older version of the data, 
rather than no data at all.

As a compromise, if you cannot afford immediate read-only mode for 
availability reasons.

 I suppose that if a resilvering can be performed relative to any arbitrary 
node considered the root node, it might even be realistic to implement?
> N.B. I do have a lot of field data on failures and failure rates.  It is
> often difficult to grok without having a clear objective in mind.  We may
> be able to agree on a set of questions which would quantify the need for
> your ideas.  Feel free to contact me directly.
Thanks. It''s not that I have any particular situation where this
becomes more
important than usual. It is just a general observation of a behavior which, 
in cases where availability is not important, is sub-optimal from a data 
safety perspective. The only reason I even brought it up was the focus on 
data integrity that we see with ZFS.

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at
infidyne.com>''
Key retrieval: Send an E-Mail to getpgpkey at scode.org
E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org

Richard Elling

2007-Jan-24 23:16 UTC

head link

[zfs-discuss] On-failure policies for pools

comment below...

Peter Schuller wrote:>>> In many situations it may not feel worth it to move to a raidz2
just to
>>> avoid this particular case.
>> I can''t think of any, but then again, I get paid to worry
about failures
>> :-)
> 
> Given that one of the tauted features of ZFS is data integrity, including
in
> the case of cheap drives, that implies it is of interest to get maximum 
> integrity with any given amount of resources.
> 
> In your typical home use situation for example, buying 4 drives of decent
size
> is pretty expensive considering that it *is* home use. Getting 4 drives for
> the diskspace of 3 is a lot more attractive than 5 drives for the diskspace
> of 3. But given that you do get 4 drives and put them in a raidz, you want
as
> much safety as possible, and often you don''t care that much about 
> availability.
> 
> That said, the argument scales. If you''re not in a situation like
the above,
> you may easily warrant "wasting" an extra drive on raidz2. But
raidz2 without
> this feature is still less safe than raidz2 with the feature. So moving
back
> to the idea of getting as much redundancy as possible given a certain set
of
> hardware resources, you''re still not optimal given your hardware.
> 
>> Please correct me if I misunderstand your reasoning, are you saying
that a
>> broken disk should not be replaced?
> 
> Sorry, no. However, I realize my desire actually requires an additional 
> feature. The situation I envision situation is this:
> 
> * One disk goes down in a raidz, because the controller suddenly broke 
> (platters/heads are fine).
> 
> * You replace the disk and start a re-silvering.
> 
> * You trigger a bad block. At this point, you are now pretty screwed,
unless:
> 
> * The pool did not change after the original drive failed, AND a
"broken drive
> assisted" resilvering is supported. You go to whatever effort required
to fix
> the disk (say, buy another one of the same model and replace the
controller,
> or hire some company that does this stuff), re-insert it into the machine.
> 
> * At this point you have a drive you can read data off of, but that you 
> certainly don''t trust in general. So you want to start replacing
the drive
> with the new drive; if ZFS were then able to resilver to the new drive by 
> using both parity data on other healthy drives in the pool, and the disk 
> being replaced, you''re a happy.
It is my understanding that zpool replace already does this.  Just
don''t
remove the failing disk...
> Or let''s do a more likely senario. A disk starts dying because of
bad sectors
> (the disk has run out of remapping possibilities). You cannot fix this 
> anymore by re-writing the bad sectors; trying to re-write the sector ends
up
> failing with an I/O error and ZFS kicks the disk out of the pool.
> 
> Standard procedure at this point is to replace the drive and resilver. But 
> once again - you might end up with a bad sector on another drive. Without 
> utilizing the existing broken drive, you''re screwed. If however
you were able
> to take advantage of sectors that *ARE* readable off of the drive, and the 
> drive has *NOT* gone out of date since it was kicked out due to additional 
> transaction commits, you are once again happy.
> 
> (Once again assuming you don''t happen to have bad luck and the set
of bad
> sectors on the two drives overlap.)
...
I think I was off base previously.  It seems to me that you are really after
the policy for failing/failed disks.  Currently, the only way a drive gets
"kicked out" is if ZFS cannot open it.  Obviously, if ZFS cannot open
the
drive, then you won''t be able to read anything from it.

Looking forward, I think that there are several policies which may be desired...
>> If so, then that is contrary to the 
>> accepted methods used in most mission critical systems.  There may be
other
>> methods which meet your requirements and are accepted.  For example,
one
>> procedure we see for those sites who are very interested in data
retention
>> is to power off a system when it is degraded to a point (as specified)
>> where data retention is put at unacceptable risk.
> 
> This is kind of what I am after, except that I want to guarantee that not a
> single transaction gets committed once a pool is degraded. Even if an admin
> goes and turns the machine off, the disk will be out of date.
... such as a policy that says "if a disk is going bad, go read-only."
I''m
quite sure that most applications won''t respond well to such a policy,
though.
>> The theory is that a 
>> powered down system will stop wearing out.  When the system is
serviced,
>> then it can be brought back online.  Obviously, this is not the case
where
>> data availability is a primary requirement -- data retention has higher
>> priority.
> 
> On the other hand, hardware has a nasty tendancy to break in relation to
power
> cycles...
> 
>> We can already set a pool (actually the file systems in a pool) to be
read
>> only.
> 
> Automatically and *immediately* on a drive failure?
You can listen to sysevents and implement policies.
>> There may be something else lurking here that we might be able to take
>> advantage of, at least for some cases.  Since ZFS is COW, it
doesn''t have
>> the same "data loss" profile as other file systems, like UFS,
which can
>> overwrite the data making reconstruction difficult or impossible.  But
>> while this might be useful for forensics, the general case is perhaps
>> largely covered by the existing snapshot features.
> 
> Heh, in an ideal world - have ZFS automatically create a snapshot when a
pool
> degrades. Normal case is then continued read/write operation. But if you DO
> end up with a bad situation and a double bad sector, you could then
resilver
> based on the snapshot (from the perspective of which, the partially failed 
> drive is up to date) and at least get back an older version of the data, 
> rather than no data at all.
> 
> As a compromise, if you cannot afford immediate read-only mode for 
> availability reasons.
> 
>  I suppose that if a resilvering can be performed relative to any arbitrary
> node considered the root node, it might even be realistic to implement?
If I understand correctly, resilvering occurs at the zpool, not the file
system level.

I think a better policy may be to initiate a scrub when a failed read
occurs (along with a SERD policy).  The scrub will remap bad blocks that
it can recover.
>> N.B. I do have a lot of field data on failures and failure rates.  It
is
>> often difficult to grok without having a clear objective in mind.  We
may
>> be able to agree on a set of questions which would quantify the need
for
>> your ideas.  Feel free to contact me directly.
> 
> Thanks. It''s not that I have any particular situation where this
becomes more
> important than usual. It is just a general observation of a behavior which,
> in cases where availability is not important, is sub-optimal from a data 
> safety perspective. The only reason I even brought it up was the focus on 
> data integrity that we see with ZFS.
In any case, this is a job for ZFS+FMA integration.
  -- richard

zfs discuss - Jan 2007 - On-failure policies for pools

[zfs-discuss] On-failure policies for pools

[zfs-discuss] On-failure policies for pools

[zfs-discuss] On-failure policies for pools

[zfs-discuss] On-failure policies for pools