thr3ads.net - zfs discuss - [zfs-discuss] What is your data error rate? [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Stefan Ring

2012-Jan-24 15:50 UTC

[zfs-discuss] What is your data error rate?

After having read this mailing list for a little while, I get the
impression that there are at least some people who regularly
experience on-disk corruption that ZFS should be able to report and
handle. I?ve been running a raidz1 on three 1TB consumer disks for
approx. 2 years now (about 90% full), and I scrub the pool every 3-4
weeks and have never had a single error. From the oft-quoted 10^14
error rate that consumer disks are rated at, I should have seen an
error by now -- the scrubbing process is not the only activity on the
disks, after all, and the data transfer volume from that alone clocks
in at almost exactly 10^14 by now.

Not that I?m worried, of course, but it comes at a slight surprise to
me. Or does the 10^14 rating just reflect the strength of the on-disk
ECC algorithm?

Jim Klimov

2012-Jan-24 16:23 UTC

head link

[zfs-discuss] What is your data error rate?

2012-01-24 19:50, Stefan Ring ?????:> After having read this mailing list for a little while, I get the
> impression that there are at least some people who regularly
> experience on-disk corruption that ZFS should be able to report and
> handle. I?ve been running a raidz1 on three 1TB consumer disks for
> approx. 2 years now (about 90% full), and I scrub the pool every 3-4
> weeks and have never had a single error. From the oft-quoted 10^14
> error rate that consumer disks are rated at, I should have seen an
> error by now -- the scrubbing process is not the only activity on the
> disks, after all, and the data transfer volume from that alone clocks
> in at almost exactly 10^14 by now.
>
> Not that I?m worried, of course, but it comes at a slight surprise to
> me. Or does the 10^14 rating just reflect the strength of the on-disk
> ECC algorithm?
I maintained several dozen storage servers for about
12 years, and I''ve seen quite a few drive deaths as
well as automatically triggered RAID array rebuilds.
But usually these were "infant deaths" in the first
year, and those drives who passed the age test often
give no noticeable problems for the next decade.
Several 2-4 disk systems work as OpenSolaris SXCE
servers with ZFS pools for root and data for years
now, and also show now problems. However most of
these are branded systems and disks from Sun.
I think we''ve only had one or two drives die, but
happened to have cold-spares due to over-ordering ;)

I do have a suspiciously high error rate on my home-NAS
which was thrown together from whatever pieces I had
at home at the time I left for an overseas trip. The
box is nearly unmaintained since then, and can suffer
from physical reasons known and unknown, such as the
SATA cabling (varied and quite possibly bad), non-ECC
memory, dust and overheating, etc.

It is also possible that aging components such as the
CPU and Motherboard which have about 5 years of active
lifetime (including an overclocked past) can contribute
to error-rates.

The old 80gb root drive has had some bad sectors (READ
errors in scrub and data access) and rpool was recreated
with copies=2 for a few times now, thanks to LiveUSB,
but the main data pool had no substantial errors until
the CKSUM errors reported this winter (metadata:0x0 and
then the dozen of in-file checksum mismatches). Since
one of the drives got itself lost soon after, and only
reappeared after all the cables were replugged, I still
tend to blame this on SATA cabling as the most probable
root cause.

I do not have an up-to-date SMART error report, and
the box is not accessible at the moment, so I can''t
comment on lower-level errors in the main pool drives.
They were new at the time I put the box together (almost
a year ago now).

However, so far much more than discovered on-disk CKSUM
errors (whichever way they''ve appeared) I am bothered
by tendency of this box to lock up and/or reboot after
somewhat repeatable actions (such as destroying large
snapshots of deduped datasets, etc.) I tend to write
this off as shortcomings of the OS (i.e. memory-hunger
and lockup in scarate hell as the most frequent cause),
and this really bothers me more now - causing lots of
downtime until some friend comes to that apartment to
reboot the box.

 > Or does the 10^14 rating just reflect the strength
 > of the on-disk ECC algorithm?

I am not sure how much the algorithms differ between
"enterprise" and "consumer" disks, while the UBER is
said to differ about 100 times. It might have also
to do with quality of materials (better steel in ball
bearings, etc.) as well as better firmware/processors
which optimize mechanical workloads and reduce the
mechanical wear. Maybe so, at least...

Finally, this is statistics. It does not "guarantee"
that for some 90Tbits of transferred data you will
certainly see an error (and just one for that matter).
Those drives which died young hopefully also count
in the overall stats, moving the bar a bit higher
for their better-made brethren.

Also, disk UBER regards media failures and ability
of disks'' cache, firmware and ECC to deal with that.
After the disk sends the "correct" sector on the wire,
many things can happen like noise in bad connectors,
electromagnetic interference from all the motors in
your computer onto the data cable, ability or lack
thereof for the data protocol (IDE, ATA, SCSI) to
detect and/or recover from such incoming random bits
between disk and HBA, errors in HBA chips and code,
noise in old rusty PCI* connector slots, bitflips in
non-ECC RAM or overheated CPUs, power surges from PSU...
There is a lot of stuff that can break :)

//Jim Klimov

Bob Friesenhahn

2012-Jan-24 17:16 UTC

head link

[zfs-discuss] What is your data error rate?

On Tue, 24 Jan 2012, Jim Klimov wrote:>
>> Or does the 10^14 rating just reflect the strength
>> of the on-disk ECC algorithm?
>
> I am not sure how much the algorithms differ between
> "enterprise" and "consumer" disks, while the UBER is
> said to differ about 100 times. It might have also
> to do with quality of materials (better steel in ball
> bearings, etc.) as well as better firmware/processors
> which optimize mechanical workloads and reduce the
> mechanical wear. Maybe so, at least...
In addition to the above, an important factor is that enterprise disks 
with 10^16 ratings also offer considerably less storage density. 
Instead of 3TB storage per drive, you get 400GB storage per drive.

So-called "nearline" enterprise storage drives fit in somewhere in the
middle, with higher storage densities, but also higher error rates.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Gregg Wonderly

2012-Jan-24 22:06 UTC

head link

[zfs-discuss] What is your data error rate?

What I''ve noticed, is that when I have my drives in a situation of
small
airflow, and hence hotter operating temperatures, my disks will drop quite 
quickly.  I''ve now moved my systems into large cases, which large
amounts of
airflow and using the icydock brand of removable drive enclosures.

http://www.newegg.com/Product/Product.aspx?Item=N82E16817994097
http://www.newegg.com/Product/Product.aspx?Item=N82E16817994113

I use the SASUC8I SATA/SAS controller to access 8 drives.

http://www.newegg.com/Product/Product.aspx?Item=N82E16816117157

I put it in PCI-e x16 slots on "graphics heavy" motherboards which
might have as
many as 4x PCI-e x16 slots.  I am replacing an old motherboard with this one.

http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1124780

The case that I found to be a good match for my needs is the Raven

http://www.newegg.com/Product/Product.aspx?Item=N82E16811163180

It has enough slots (7) to put 2x 3-in-2 and 1x 4-in-3 icy dock bays in to 
provide 10 drives in hot swap bays.

I really think that the big issue is that you must move the air.  The drives 
really need to stay cool or else you will see degraded performance and/or data 
loss much more often.

Gregg Wonderly

On 1/24/2012 9:50 AM, Stefan Ring wrote:> After having read this mailing list for a little while, I get the
> impression that there are at least some people who regularly
> experience on-disk corruption that ZFS should be able to report and
> handle. I?ve been running a raidz1 on three 1TB consumer disks for
> approx. 2 years now (about 90% full), and I scrub the pool every 3-4
> weeks and have never had a single error. From the oft-quoted 10^14
> error rate that consumer disks are rated at, I should have seen an
> error by now -- the scrubbing process is not the only activity on the
> disks, after all, and the data transfer volume from that alone clocks
> in at almost exactly 10^14 by now.
>
> Not that I?m worried, of course, but it comes at a slight surprise to
> me. Or does the 10^14 rating just reflect the strength of the on-disk
> ECC algorithm?
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

John Martin

2012-Jan-24 23:02 UTC

head link

[zfs-discuss] What is your data error rate?

On 01/24/12 17:06, Gregg Wonderly wrote:> What I''ve noticed, is that when I have my drives in a situation of
small
> airflow, and hence hotter operating temperatures, my disks will drop
> quite quickly.
While I *believe* the same thing and thus have over provisioned
airflow in my cases (for both drives and memory), there
are studies which failed to find a strong correlation between
drive temperature and failure rates:

   http://research.google.com/archive/disk_failures.pdf

   http://www.usenix.org/events/fast07/tech/schroeder.html

Anonymous Remailer (austria)

2012-Jan-25 09:08 UTC

head link

[zfs-discuss] What is your data error rate?

I''ve been watching the heat control issue carefully since I had to take
a
job offshore (cough reverse H1B cough) in a place without adequate AC and I
was able to get them to ship my servers and some other gear. Then I read
Intel is guaranteeing their servers will work up to 100 degrees F ambient
temps in the pricing wars to sell servers, he who goes green and saves data
center cooling budget will win big since now everyone realizes AC costs more
than hardware for server farms. And this is not on new special heat-tolerant
gear, I heard they will put this in writing even for their older units. From
that I would conclude at least commercial server gear can take a lot more
abuse than it gets and still not be affected enough to make components fail
because if they did, Intel could not afford to make this guarantee. YMMV of
course. I still feel nervous running equipment in this kind of environment
but after 3 years of doing that including commodity desktops I haven''t
seen
any abnormal failures.

Edward Ned Harvey

2012-Jan-25 14:08 UTC

head link

[zfs-discuss] What is your data error rate?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Stefan Ring
> 
> I?ve been running a raidz1 on three 1TB consumer disks for
> approx. 2 years now (about 90% full), and I scrub the pool every 3-4
> weeks and have never had a single error. 
Well...  You''re probably not 100% active 100% of the time...
And...
Assuming the failure rate of drives is not linear, but skewed toward higher
failure rate after some period of time (say, 3 yrs) then you''re more
likely to experience no errors for the first year or two, and you''re
more likely to experience multiple simultaneous failures after 3yrs or so.

John Martin

2012-Jan-25 14:26 UTC

head link

[zfs-discuss] What is your data error rate?

On 01/25/12 09:08, Edward Ned Harvey wrote:
> Assuming the failure rate of drives is not linear, but skewed toward higher
failure rate after some period of time (say, 3 yrs)  ...
See section 3.1 of the Google study:

   http://research.google.com/archive/disk_failures.pdf

although section 4.2 of the Carnegie Mellon study
is much more supportive of the assumption.

   http://www.usenix.org/events/fast07/tech/schroeder/schroeder.pdf

Bob Friesenhahn

2012-Jan-25 17:45 UTC

head link

[zfs-discuss] What is your data error rate?

On Wed, 25 Jan 2012, Anonymous Remailer (austria) wrote:
>
> I''ve been watching the heat control issue carefully since I had to
take a
> job offshore (cough reverse H1B cough) in a place without adequate AC and I
> was able to get them to ship my servers and some other gear. Then I read
> Intel is guaranteeing their servers will work up to 100 degrees F ambient
> temps in the pricing wars to sell servers, he who goes green and saves data
Most servers seem to be specified to run up to 95 degrees, with some 
particularly-dense ones specified to only handle 90.  Network 
switching gear is usually specified to handle 105.

My own equipment typically experiences up to 83 degrees during the 
peak of summer (but quite a lot more if the AC fails).

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Paul Kraus

2012-Jan-25 19:59 UTC

head link

[zfs-discuss] What is your data error rate?

On Tue, Jan 24, 2012 at 10:50 AM, Stefan Ring <stefanrin at gmail.com>
wrote:
> After having read this mailing list for a little while, I get the
> impression that there are at least some people who regularly
> experience on-disk corruption that ZFS should be able to report and
> handle. I?ve been running a raidz1 on three 1TB consumer disks for
> approx. 2 years now (about 90% full), and I scrub the pool every 3-4
> weeks and have never had a single error. From the oft-quoted 10^14
> error rate that consumer disks are rated at, I should have seen an
> error by now -- the scrubbing process is not the only activity on the
> disks, after all, and the data transfer volume from that alone clocks
> in at almost exactly 10^14 by now.
    The 10^-14 (or 10^-15 or 10^-16) number is a statistical average.
So if you have a big enough pool of drives, for every drive that moves
more than 10^14 with no uncorrectable errors, then there will be a
drive that moves less than 10^14 before hitting an uncorrectable
error. The three 1 TB consumer drives you have must have been
manufactured on a "good day" and not a "bad day" :-)

    Note the error rate is 10^-14 (or 10^-15 or 10^-15) which
translates into one error per 10^14 bits (bytes ?) transferred to /
from the drive. Note the sign change on the exponent :-)

-- 
{--------1---------2---------3---------4---------5---------6---------7---------}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, Troy Civic Theatre Company
-> Technical Advisor, RPI Players

Seemingly Similar Threads

Search for more seemingly similar threads

zfs discuss - Jan 2012 - What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

[zfs-discuss] What is your data error rate?

Seemingly Similar Threads