thr3ads.net - zfs discuss - [zfs-discuss] Errors on mirrored drive [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Frank Middleton

2009-Apr-15 15:28 UTC

[zfs-discuss] Errors on mirrored drive

Experimenting with OpenSolaris on an elderly PC with equally
elderly drives, zpool status shows errors after a pkg image-update
followed by a scrub. It is entirely possible that one of these
drives is flaky, but surely the whole point of a zfs mirror is
to avoid this? It seems unlikely that both drives failed at the
same time. Could someone explain how this can happen? Another
question (perhaps for the indiana folks) is how to restore these
files?

# zpool status -v
   pool: rpool
  state: ONLINE
status: One or more devices has experienced an error resulting in data
         corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
         entire pool from backup.
    see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: scrub completed after 0h24m with 2 errors on Wed Apr 15 09:15:40 2009
config:

         NAME        STATE     READ WRITE CKSUM
         rpool       ONLINE       0     0    69
           mirror    ONLINE       0     0   144
             c3d0s0  ONLINE       0     0   145  128K repaired
             c3d1s0  ONLINE       0     0   151  168K repaired

errors: Permanent errors have been detected in the following files:

         //lib/amd64/libsec.so.1
         //lib/libdlpi.so.1

Bob Friesenhahn

2009-Apr-15 18:30 UTC

head link

[zfs-discuss] Errors on mirrored drive

On Wed, 15 Apr 2009, Frank Middleton wrote:
> Experimenting with OpenSolaris on an elderly PC with equally
> elderly drives, zpool status shows errors after a pkg image-update
> followed by a scrub. It is entirely possible that one of these
> drives is flaky, but surely the whole point of a zfs mirror is
> to avoid this? It seems unlikely that both drives failed at the
> same time. Could someone explain how this can happen? Another
> question (perhaps for the indiana folks) is how to restore these
> files?
If a corruption occured in the main memory, the backplane, or the disk 
controller during the writes to these files, then the original data 
written could be corrupted, even though you are using mirrors.  If the 
system experienced a physical shock, or power supply glitch, while the 
data was written, then it could impact both drives.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Frank Middleton

2009-Apr-16 00:31 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 04/15/09 14:30, Bob Friesenhahn wrote:> On Wed, 15 Apr 2009, Frank Middleton wrote:
>> zpool status shows errors after a pkg image-update
>> followed by a scrub.
> If a corruption occured in the main memory, the backplane, or the disk
> controller during the writes to these files, then the original data
> written could be corrupted, even though you are using mirrors. If the
> system experienced a physical shock, or power supply glitch, while the
> data was written, then it could impact both drives.
Quite. Sounds like an architectural problem. This old machine probably
doesn''t have ecc memory (AFAIK still rare on most PCs), but it is on
a serial UPS and isolated from shocks, and this has happened more
than once. These drives on this machine recently passed both the purge
and verify cycles (format/analyze) several times. Unless the data is
written to both drives from the same buffer and checksum (surely not!),
it is still unclear how it could get written to *both* drives with a
bad checksum. It looks like the files really are bad - neither of
them can be read - unless ZFS sensibly refuses to allow possibly good
files with bad checksums to be read (cannot read: I/O error).

BTW fmdump -ev doesn''t seem to report any disk errors  at all.

So my question remains - even with the grottiest hardware, how can
several files get written with bad checksums to mirrored drives? ZFS
has so many cool features this would be easy to live with if there
was a reasonably simple way to get copies of these files to restore
them, short of getting the source and recompiling, or pkg uninstall
followed by install (if you can figure out which pkg(s) the bad files
are in), but it seems to defeat the purpose of softwaremirroring...

Toby Thain

2009-Apr-16 00:41 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 15-Apr-09, at 8:31 PM, Frank Middleton wrote:
> On 04/15/09 14:30, Bob Friesenhahn wrote:
>> On Wed, 15 Apr 2009, Frank Middleton wrote:
>>> zpool status shows errors after a pkg image-update
>>> followed by a scrub.
>
>> If a corruption occured in the main memory, the backplane, or the  
>> disk
>> controller during the writes to these files, then the original data
>> written could be corrupted, even though you are using mirrors. If the
>> system experienced a physical shock, or power supply glitch, while  
>> the
>> data was written, then it could impact both drives.
>
> Quite. Sounds like an architectural problem. This old machine probably
> doesn''t have ecc memory (AFAIK still rare on most PCs), but it is
on
> a serial UPS and isolated from shocks, and this has happened more
> than once. These drives on this machine recently passed both the purge
> and verify cycles (format/analyze) several times. Unless the data is
> written to both drives from the same buffer and checksum (surely  
> not!),
Doesn''t seem that far-fetched...
> it is still unclear how it could get written to *both* drives with a
> bad checksum. It looks like the files really are bad - neither of
> them can be read - unless ZFS sensibly refuses to allow possibly good
> files with bad checksums to be read (cannot read: I/O error).
>
> BTW fmdump -ev doesn''t seem to report any disk errors  at all.
>
> So my question remains - even with the grottiest hardware, how can
> several files get written with bad checksums to mirrored drives?
Bad RAM would seem a possible cause, wouldn''t it?

--Toby
> ZFS
> has so many cool features this would be easy to live with if there
> was a reasonably simple way to get copies of these files to restore
> them, short of getting the source and recompiling, or pkg uninstall
> followed by install (if you can figure out which pkg(s) the bad files
> are in), but it seems to defeat the purpose of softwaremirroring...
>
>
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Casper.Dik at Sun.COM

2009-Apr-16 08:39 UTC

head link

[zfs-discuss] Errors on mirrored drive

>Quite. Sounds like an architectural problem. This old machine probably
>doesn''t have ecc memory (AFAIK still rare on most PCs), but it is
on
>a serial UPS and isolated from shocks, and this has happened more
>than once. These drives on this machine recently passed both the purge
>and verify cycles (format/analyze) several times. Unless the data is
>written to both drives from the same buffer and checksum (surely not!),
You really believe that the copy was copied and checksummed twice before
writing to the disk?  Of course not.  Copying the data doesn''t help;
both pieces of memory need to be good.  It''s checksummed once.  The 
checksum fails to verify so:
	- the memory was corrupted after the checksum was computed
	- the data was damaged en route to the disk
	- the data was damaged on disk.
	- the data was damaged back from the disk
	- the data was damaged in memory
>it is still unclear how it could get written to *both* drives with a
>bad checksum. It looks like the files really are bad - neither of
>them can be read - unless ZFS sensibly refuses to allow possibly good
>files with bad checksums to be read (cannot read: I/O error).
That can happen when the memory is corrupted before the first write
to disk.

>So my question remains - even with the grottiest hardware, how can
>several files get written with bad checksums to mirrored drives? ZFS
>has so many cool features this would be easy to live with if there
>was a reasonably simple way to get copies of these files to restore
>them, short of getting the source and recompiling, or pkg uninstall
>followed by install (if you can figure out which pkg(s) the bad files
>are in), but it seems to defeat the purpose of softwaremirroring...
Bad memory.

Casper

Richard Elling

2009-Apr-16 14:54 UTC

head link

[zfs-discuss] Errors on mirrored drive

Frank Middleton wrote:> Experimenting with OpenSolaris on an elderly PC with equally
> elderly drives, zpool status shows errors after a pkg image-update
> followed by a scrub. It is entirely possible that one of these
> drives is flaky, but surely the whole point of a zfs mirror is
> to avoid this? It seems unlikely that both drives failed at the
> same time. 
Possible causes:
    + bad CPU
    + bad memory, or memory which does not self-correct transient errors
    + faulty cabling
    + electrically noisy environment
    + multiply all of the above by 3 or more (main CPU, HBA, disk logic)

Alas, today we don''t know if bad data was written to both
sides of the mirror.  Rather, ZFS does not report if both sides
of the mirror agree, but the checksum fails.  I filed an RFE on
this, because it would help diagnose where the corruption
occurred: memory/data path vs medium.
> Could someone explain how this can happen? Another
> question (perhaps for the indiana folks) is how to restore these
> files?
>
These files look like they would have been delivered in an OS.
If so, just copy a good version over the bad.
 -- richard
> # zpool status -v
>   pool: rpool
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: scrub completed after 0h24m with 2 errors on Wed Apr 15 
> 09:15:40 2009
> config:
>
>         NAME        STATE     READ WRITE CKSUM
>         rpool       ONLINE       0     0    69
>           mirror    ONLINE       0     0   144
>             c3d0s0  ONLINE       0     0   145  128K repaired
>             c3d1s0  ONLINE       0     0   151  168K repaired
>
> errors: Permanent errors have been detected in the following files:
>
>         //lib/amd64/libsec.so.1
>         //lib/libdlpi.so.1
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Frank Middleton

2009-Apr-17 15:49 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 04/16/09 04:39, Casper.Dik at Sun.COM wrote:>
> You really believe that the copy was copied and checksummed twice before
> writing to the disk?  Of course not.  Copying the data doesn''t
help;
> both pieces of memory need to be good.  It''s checksummed once.
If OpenSolaris succeeds in being significantly adopted as a desktop o/s,
it is going to be running on some pretty grotty hardware. No ecc memory,
cheap PCI controllers, etc. Clearly this computer has hardware problems,
my guess is the PCI itself, although it seems to run Linux and OpenSolaris
quite happily. If the memory is so bad that two separately computed check-
sums fail, then I doubt it would run anything reliably. FWIW it passes
every diagnostic I''ve run, but  that doesn''t prove anything...

ZFS can''t catch the case where the data is bad before it is
checksummed,
so we can ignore that one for this discussion. This scenario seems to have
bad checksums or bad data (or both) being written to both disks.

So why not copy and store the data+checksum twice? In the grand scheme of
things, it is hard to believe that this would add significant overhead,
(maybe even speed things up if both disks can be written in parallel?)
and it would help diagnosing what is a novel problem.

Let CSA and CSB = stored checksums, and CRA and CRB be the recomputed
checksums after the data is read from each mirror. Presumably a scrub
always reads both sides of a mirror, so all permutations are possible.
One interesting case is where CSA == CSB and CRA == CRB but CSA != CRA
vs. the case where all 4 checksums are different. It seems improbable
that two disks would fail in the same way at the same moment, so this
scenario would point at some other source of error.

It would be helpful to know which scenario is happening.

Good old reliable Sun products with ecc bus and memory simply don''t
have this kind of problem. The hardware detects it long before it becomes
a software issue. Not so with el-cheapo PCs whose owners will likely
be frustrated (see the "[zfs-discuss] How recoverable is an
''unrecoverable
error''?" thread) when their previously seemingly reliable disks
start to
apparently fail in mysterious ways.

I''d like to submit an RFE suggesting that data + checksum be copied for
mirrored writes, but I won''t waste anyone''s time doing so
unless you
think there is a point. One might argue that a machine this flaky should
be retired, but it is actually working quite well, and perhaps represents
not even the extreme of bad hardware that ZFS might encounter.

Cheers -- Frank

Casper.Dik at Sun.COM

2009-Apr-17 16:37 UTC

head link

[zfs-discuss] Errors on mirrored drive

>I''d like to submit an RFE suggesting that data + checksum be copied
for
>mirrored writes, but I won''t waste anyone''s time doing so
unless you
>think there is a point. One might argue that a machine this flaky should
>be retired, but it is actually working quite well, and perhaps represents
>not even the extreme of bad hardware that ZFS might encounter.

I think it''s a stupid idea.  If you get two checksums, what can you do?
The second copy is most likely suspect and you double your chance that you 
use bad memory.

Casper

Toby Thain

2009-Apr-17 17:26 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 17-Apr-09, at 11:49 AM, Frank Middleton wrote:
> ... One might argue that a machine this flaky should
> be retired, but it is actually working quite well,
If it has bad memory, you won''t get much useful work done on it until  
the memory is replaced - unless you want to risk your data with  
random failues, and potentially waste large amounts of time.

You should do a comprehensive memory test ASAP and replace what''s not  
working.

ZFS'' job isn''t to test your memory, so I think the proposed
patch is
pointless. It also doesn''t address the case where the application  
buffer is corrupt.

--T
> and perhaps represents
> not even the extreme of bad hardware that ZFS might encounter.
>
> Cheers -- Frank
>
>
>  _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Frank Middleton

2009-Apr-21 15:50 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 04/17/09 12:37, Casper.Dik at Sun.COM wrote:>> I''d like to submit an RFE suggesting that data + checksum be
copied for
>> mirrored writes, but I won''t waste anyone''s time
doing so unless you
>> think there is a point. One might argue that a machine this flaky
should
>> be retired, but it is actually working quite well, and perhaps
represents
>> not even the extreme of bad hardware that ZFS might encounter.
>
> I think it''s a stupid idea.  If you get two checksums, what can
you do?
> The second copy is most likely suspect and you double your chance that you
> use bad memory.
>
> Casper
If there were permanently bad memory locations, surely the diagnostics
would reveal them. Here''s an interesting paper on memory errors:
http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf
Given the inevitability of relatively frequent transient memory
errors, I would think it behooves the file system to minimize the
effects of such errors. But I won''t belabor the point except to
suggest that the cost of adding the suggested step would not be
very expensive (either to implement or run).

Memory diagnostics ran for a full 12 hours with no errors. Same goes
for both disks, using Solaris format/ana/verify. So far, after
creating 400,000 files, two files had permanent, apparently truly
unrecoverable errors and could not be read by anything.

Now it gets really funky. I detached one of the disks, and then found
it couldn''t be reattached. Turns out there is a rounding problem with
Solaris fdisk (run from format) that can cause identical partitions on
identical disks to have different sizes. I used the Linux sfdisk
utility to repair the MBR and fix the Solaris2 partition sizes. Then
it was possible to reattach the disk. Unfortunately it wasn''t possible
to boot from the result, but a reinstall went perfectly with no ZFS
errors being reported at all. So it appears that the problem may be
with the OpenSolaris fdisk. Is this worth reporting as a bug? It is
likely to be quite hard to reproduce...

Casper.Dik at Sun.COM

2009-Apr-21 16:16 UTC

head link

[zfs-discuss] Errors on mirrored drive

>If there were permanently bad memory locations, surely the diagnostics
>would reveal them. Here''s an interesting paper on memory errors:
>http://www.ece.rochester.edu/~mihuang/PAPERS/hotdep07.pdf
>Given the inevitability of relatively frequent transient memory
>errors, I would think it behooves the file system to minimize the
>effects of such errors. But I won''t belabor the point except to
>suggest that the cost of adding the suggested step would not be
>very expensive (either to implement or run).
I''m still not clear what you win.

You copy the data (which isn''t actually that cheap, especially when 
running a load which uses a lot of memory bandwidth).

And no what?  You can''t write two different checksums; I mean,
we''re
mirroring the data so it MUST BE THE SAME.  (A different checksum would be 
wrong: I don''t think ZFS will allow different checksums for different 
sides of a mirror)

You are assuming that the error is the memory being modified after 
computing the checksums; I would say that that is unlikely; I think
it''s a
bit more likely that the data gets corrupted when it''s handled by the
disk
controller or the disk itself.  (The data is continuously re-written by 
the DRAM controller)
>Memory diagnostics ran for a full 12 hours with no errors. Same goes
>for both disks, using Solaris format/ana/verify. So far, after
>creating 400,000 files, two files had permanent, apparently truly
>unrecoverable errors and could not be read by anything.
It would have been nice if we were able to recover the contents of the
file; if you also know what was supposed to be there, you can diff and
then we can find out what was wrong.
>Now it gets really funky. I detached one of the disks, and then found
>it couldn''t be reattached. Turns out there is a rounding problem
with
>Solaris fdisk (run from format) that can cause identical partitions on
>identical disks to have different sizes. I used the Linux sfdisk
>utility to repair the MBR and fix the Solaris2 partition sizes. Then
>it was possible to reattach the disk. Unfortunately it wasn''t
possible
>to boot from the result, but a reinstall went perfectly with no ZFS
>errors being reported at all. So it appears that the problem may be
>with the OpenSolaris fdisk. Is this worth reporting as a bug? It is
>likely to be quite hard to reproduce...
There might be some skeletons buried in the ide device drivers; I once had 
a disk which broke (well, one sector was broken or more); so I added
"bad sectors" in format.  But the disk seemed to be bad, even after I
"check disk" tool from Western Digital.  The disk would hang when I
read
certain bits. Then I copied the disk to an identical disk; it hang in the
same way.  Then I "zapped" the copy, relabeled it, copied the data per
slice (not the while disk, but a per slice) and then the new disk worked.

So while the first disk was broken (it Western Digital tool moved some 
sectors somewhere else), adding "bad sectors" in Solaris broke
"something
else".

Casper

Frank Middleton

2009-May-22 21:24 UTC

head link

[zfs-discuss] Errors on mirrored drive

There have been a number of threads here on the reliability of ZFS in the
face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC)
hardware, but isn''t it reasonable to expect it to run well on something
less well engineered? I am a real ZFS fan, and I''d hate to see folks
trash it because it appears to be unreliable.

In an attempt to bolster the proposition that there should at least be
an option to buffer the data before checksumming and writing, we''ve
been doing a lot of testing on presumed flaky (cheap) hardware, with a
peculiar result - see below.

On 04/21/09 12:16, Casper.Dik at Sun.COM wrote:
  > And so what?  You can''t write two different checksums; I mean,
we''re
> mirroring the data so it MUST BE THE SAME.  (A different checksum would be
> wrong: I don''t think ZFS will allow different checksums for
different
> sides of a mirror)
Unless it does a read after write on each disk, how would it know that
the checksums are the same? If the data is damaged before the checksum
is calculated then it is no worse than the ufs/ext3 case. If data +
checksum is damaged whilst the (single) checksum is being calculated,
or after, then the file is already lost before it is even written!
There is a significant probability that this could occur on a machine
with no ecc. Evidently memory concerns /are/ an issue - this thread
http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests
including a memory diagnostic with the distribution CD (Fedora already
does so).

Memory diagnostics just test memory. Disk diagnostics just test disks.
ZFS keeps disks pretty busy, so perhaps it loads the power supply
to the point where it heats up and memory glitches are more likely.
It might also explain why errors don''t really begin until ~15 minutes
after the busy time starts.

You might argue that this problem could only affect systems doing a
lot of disk i/o and such systems probably have ecc memory. But doing
an o/s install is the one time where a consumer grade computer does
a *lot* of disk i/o for quite a long time and is hence vulnerable.
Ironically,  the Open Solaris installer does not allow for ZFS
mirroring at install time, one time where it might be really important!
Now that sounds like a more useful RFE, especially since it would be
relatively easy to implement. Anaconda does it...

A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look
at Cypress on ECC, see http://www.edn.com/article/CA454636.html.
Possibly, statistically likely random memory glitches could actually
explain the error rate that is occurring.
> You are assuming that the error is the memory being modified after
> computing the checksums; I would say that that is unlikely; I think
it''s a
> bit more likely that the data gets corrupted when it''s handled by
the disk
> controller or the disk itself.  (The data is continuously re-written by
> the DRAM controller)
See below for an example where a checksum error occurs without the
disk subsystem being involved. There seems to be no other plausible
explanation other than an improbable bug in X86 ZFS itself.
> It would have been nice if we were able to recover the contents of the
> file; if you also know what was supposed to be there, you can diff and
> then we can find out what was wrong.
"file" on those files resulted in "bus error". Is there a
way to actually
read a file reported by ZFS as unrecoverable to do just that (and to
separately retrieve the copy from each half of the mirror)?

Maybe this should be a new thread, but I suspect the following
proves that the problem must be memory, and that begs the question
as to how memory glitches can cause fatal ZFS checksum errors.

Here is the peculiar result (same machine)

After several attempts, succeeded in doing a zfs send to a file
on a NFS mounted ZFS file system on another machine (SPARC)
followed by a ZFS recv of that file there. But every attempt to
do a ZFS recv of the same snapshot (i.e., from NFS) on the local
machine (X86) has failed with a checksum mismatch. Obviously,
the file is good, since it was possible to do a zfs recv from it.
You can''t blame the IDE drivers (or the bus, or the disks) for
this. Similarly, piping the snapshot though SSH fails, so you
can''t blame NFS either. Something is happening to cause checksum
failures between after when the data is received by the PC and
when ZFS computes its checksums. Surely this is either a highly
repeatable memory glitch, or (most unlikely) a bug in X86 ZFS.
ZFS recv to another SPARC over SSH to the same physical disk
(accessed via a sata/pata adapter) was also successful.

Does this prove that the data+checksum is being corrupted by
memory glitches? Both NFS and SSH over TCP/IP provide reliable
transport (via checksums), so the data is presumably received
correctly. ZFS then calculates its own checksum and it fails.
Oddly, it /always/ fails, but not at the same point, and far
into the stream when both disks have been very busy for a while.

It would be interesting to see if the checksumming still fails
if the writes were somehow skipped or sent to /dev/null. If it
still fails. it should be possible to pinpoint the failure. If
not, then it would seem the the only recourse is to replace
the machine or not use ZFS even though it is otherwise quite
reliable (it has been running an XDMCP session for 2 weeks
now with no apparent glitches; even zpool status shows no
errors at all after a couple of scrubs). It would be even
more interesting to hear speculation as to why another machine
can recv the datastream but not the one that originated it.

If a memory that can pass diagnostics for 24 hours at a
stretch can cause glitches in huge datastreams, then IMO it
behooves ZFS to defend itself against them. Buffering disk
i/o on machines with no ECC seems like reasonably cheap
insurance against a whole class of errors, and could make
ZFS usable on PCs that, although they work fine with ext3,
fail annoyingly with ZFS. Ironically this wouldn''t fix the
peculiar recv problem, which none-the-less seems to point
to memory glitches as a source of errors.

-- Frank

Toby Thain

2009-May-23 01:08 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 22-May-09, at 5:24 PM, Frank Middleton wrote:
> There have been a number of threads here on the reliability of ZFS  
> in the
> face of flaky hardware. ZFS certainly runs well on decent (e.g.,  
> SPARC)
> hardware, but isn''t it reasonable to expect it to run well on  
> something
> less well engineered? I am a real ZFS fan, and I''d hate to see
folks
> trash it because it appears to be unreliable.
>
> In an attempt to bolster the proposition that there should at least be
> an option to buffer the data before checksumming and writing,
we''ve
> been doing a lot of testing on presumed flaky (cheap) hardware, with a
> peculiar result - see below.
>
> On 04/21/09 12:16, Casper.Dik at Sun.COM wrote:
>
>> And so what?  You can''t write two different checksums; I mean,
we''re
>> mirroring the data so it MUST BE THE SAME.  (A different checksum  
>> would be
>> wrong: I don''t think ZFS will allow different checksums for
different
>> sides of a mirror)
>
> Unless it does a read after write on each disk, how would it know that
> the checksums are the same? If the data is damaged before the checksum
> is calculated then it is no worse than the ufs/ext3 case. If data +
> checksum is damaged whilst the (single) checksum is being calculated,
> or after, then the file is already lost before it is even written!
> There is a significant probability that this could occur on a machine
> with no ecc. Evidently memory concerns /are/ an issue
Yes, the important thing is to *detect* them, no system can run  
reliably with bad memory, and that includes any system with ZFS.  
Doing nutty things like calculating the checksum twice does not buy  
anything of value here.

If the memory is this bad then applications will be dying all over  
the place, compilers will be segfaulting, and databases will be  
writing bad data even before it reaches ZFS.
> - this thread
> http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests
> including a memory diagnostic with the distribution CD (Fedora already
> does so).
Absolutely, memory diags are essential. And you certainly run them if  
you see unexpected behaviour that has no other obvious cause.
>
> Memory diagnostics just test memory. Disk diagnostics just test disks.
> ZFS keeps disks pretty busy, so perhaps it loads the power supply
> to the point where it heats up and memory glitches are more likely.
Your logic is rather tortuous. If the hardware is that crappy then  
there''s not much ZFS can do about it.
> It might also explain why errors don''t really begin until ~15
minutes
> after the busy time starts.
>
> You might argue that this problem could only affect systems doing a
> lot of disk i/o and such systems probably have ecc memory. But doing
> an o/s install is the one time where a consumer grade computer does
> a *lot* of disk i/o for quite a long time and is hence vulnerable.
> Ironically,  the Open Solaris installer does not allow for ZFS
> mirroring at install time, one time where it might be really  
> important!
> Now that sounds like a more useful RFE, especially since it would be
> relatively easy to implement. Anaconda does it...
>
> A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look
> at Cypress on ECC, see http://www.edn.com/article/CA454636.html.
> Possibly, statistically likely random memory glitches could actually
> explain the error rate that is occurring.
>
>> You are assuming that the error is the memory being modified after
>> computing the checksums; I would say that that is unlikely; I  
>> think it''s a
>> bit more likely that the data gets corrupted when it''s handled
by
>> the disk
>> controller or the disk itself.  (The data is continuously re- 
>> written by
>> the DRAM controller)
>
> See below for an example where a checksum error occurs without the
> disk subsystem being involved. There seems to be no other plausible
> explanation other than an improbable bug in X86 ZFS itself.
>
>> It would have been nice if we were able to recover the contents of  
>> the
>> file; if you also know what was supposed to be there, you can diff  
>> and
>> then we can find out what was wrong.
>
> "file" on those files resulted in "bus error". Is there
a way to
> actually
> read a file reported by ZFS as unrecoverable to do just that (and to
> separately retrieve the copy from each half of the mirror)?
>
> Maybe this should be a new thread, but I suspect the following
> proves that the problem must be memory, and that begs the question
> as to how memory glitches can cause fatal ZFS checksum errors.
Of course they can; but they will also break anything else on the  
machine.

...
> If a memory that can pass diagnostics for 24 hours at a
> stretch can cause glitches in huge datastreams, then IMO it
> behooves ZFS to defend itself against them. Buffering disk
> i/o on machines with no ECC seems like reasonably cheap
> insurance against a whole class of errors, and could make
> ZFS usable on PCs that, although they work fine with ext3,
How can a machine with bad memory "work fine with ext3"?

--Toby
> fail annoyingly with ZFS. Ironically this wouldn''t fix the
> peculiar recv problem, which none-the-less seems to point
> to memory glitches as a source of errors.
>
> -- Frank
>
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Casper.Dik at Sun.COM

2009-May-23 09:52 UTC

head link

[zfs-discuss] Errors on mirrored drive

>> If a memory that can pass diagnostics for 24 hours at a
>> stretch can cause glitches in huge datastreams, then IMO it
>> behooves ZFS to defend itself against them. Buffering disk
>> i/o on machines with no ECC seems like reasonably cheap
>> insurance against a whole class of errors, and could make
>> ZFS usable on PCs that, although they work fine with ext3,
>
>How can a machine with bad memory "work fine with ext3"?
"It appears to work".

A long time ago I bought a new PC; it run Windows, it installed Solaris
(pre-ZFS) but when I tried to build on-net, something would die because of 
a SIGBUS or a SIGSEGV.

When I finally run memtest86 (did require a BIOS which supported a USB 
keyboard properly), I had one broken 512MB dimm and I replaced it.

Similarly, when someone upgraded and started to use ZFS he continuously got
bad checksums; and in the end it turned out the powersupply was broken
(not a "bad brand" but one which was broken, delivering out-spec
voltages)

Casper

Joerg Schilling

2009-May-23 10:16 UTC

head link

[zfs-discuss] Errors on mirrored drive

Casper.Dik at Sun.COM wrote:
>
>
> >> If a memory that can pass diagnostics for 24 hours at a
> >> stretch can cause glitches in huge datastreams, then IMO it
> >> behooves ZFS to defend itself against them. Buffering disk
> >> i/o on machines with no ECC seems like reasonably cheap
> >> insurance against a whole class of errors, and could make
> >> ZFS usable on PCs that, although they work fine with ext3,
> >
> >How can a machine with bad memory "work fine with ext3"?
>
> "It appears to work".
...> When I finally run memtest86 (did require a BIOS which supported a USB 
> keyboard properly), I had one broken 512MB dimm and I replaced it.
Another important fact is that Linux starts using physical memory from 
low addresses while Solaris takes care about the CPU cache and in addition
allocates memory for DMA from the top physical pages.

A PC with bad memory in the top parts may appear OK with other OS but this
is a fallacious impression.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       joerg.schilling at fokus.fraunhofer.de (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Richard Elling

2009-May-23 14:21 UTC

head link

[zfs-discuss] Errors on mirrored drive

<preface>
This forum is littered with claims of "zfs checksums are broken" where
the root cause turned out to be faulty hardware or firmware in the data 
path.
</preface>

I think that before you should speculate on a redesign, we should get to
the root cause.

Frank Middleton wrote:> There have been a number of threads here on the reliability of ZFS in the
> face of flaky hardware. ZFS certainly runs well on decent (e.g., SPARC)
> hardware, but isn''t it reasonable to expect it to run well on
something
> less well engineered? I am a real ZFS fan, and I''d hate to see
folks
> trash it because it appears to be unreliable.

It depends on what you consider to be flaky.  If a CPU has a stuck bit
in the carry lookahead (can''t add properly for some pattern of
operands),
then it is flaky and will probably create bogus checksums, no?
>
> In an attempt to bolster the proposition that there should at least be
> an option to buffer the data before checksumming and writing,
we''ve
> been doing a lot of testing on presumed flaky (cheap) hardware, with a
> peculiar result - see below.
>
> On 04/21/09 12:16, Casper.Dik at Sun.COM wrote:
>  
>> And so what?  You can''t write two different checksums; I mean,
we''re
>> mirroring the data so it MUST BE THE SAME.  (A different checksum 
>> would be
>> wrong: I don''t think ZFS will allow different checksums for
different
>> sides of a mirror)
>
> Unless it does a read after write on each disk, how would it know that
> the checksums are the same? If the data is damaged before the checksum
> is calculated then it is no worse than the ufs/ext3 case. 
Even if you do a read after write, there is no guarantee that you will
read from the medium instead of a cache.  There is some concern here,
in general, because some mobo RAID controllers and (I believe) some
disk drives have caches which are not protected.  These are generally
not too much of a problem because the data is not resident for a
significant period of time and the probability of a bit flip caused by
radiation, for instance, is a function of time.
> If data +
> checksum is damaged whilst the (single) checksum is being calculated,
> or after, then the file is already lost before it is even written!
The checksum occurs in the pipeline prior to write to disk.
So if the data is damaged prior to checksum, then ZFS will
never know.  Nor will UFS.  Neither will be able to detect
this.  In Solaris, if the damage is greater than the ability
of the memory system and CPU to detect or correct, then
even Solaris won''t know.  If the memory system or CPU
detects a problem, then Solaris fault management will kick
in and do something, preempting ZFS.
> There is a significant probability that this could occur on a machine
> with no ecc. Evidently memory concerns /are/ an issue - this thread
> http://opensolaris.org/jive/thread.jspa?messageID=338148 even suggests
> including a memory diagnostic with the distribution CD (Fedora already
> does so).
SunVTS ships with SCXE and Solaris 2.2-10.  SunVTS replaced
SunDiag which, IIRC, started shipping in SunOS 3.  I believe SunVTS
is available via OpenSolaris repository for those with support contracts.
VTS is an acronym for Verification Test Suite and includes many
tests, including memory tests.  VTS is used to verify systems in the
factory prior to shipping to customers.  Look for /usr/sunvts on your
system or search for the SUNWvts* packages and checkout the docs
online.
>
> Memory diagnostics just test memory. Disk diagnostics just test disks.
This is not completely accurate.  Disk diagnostics also test the
data path.  Memory tests also test the CPU.  The difference is the
amount of test coverage for the subsystem.
> ZFS keeps disks pretty busy, so perhaps it loads the power supply
> to the point where it heats up and memory glitches are more likely.
In general, for like configurations, ZFS won''t keep a disk any more
busy than other file systems. In fact, because ZFS groups transactions,
it may create less activity than other file systems, such as UFS.
>
> It might also explain why errors don''t really begin until ~15
minutes
> after the busy time starts.
>
> You might argue that this problem could only affect systems doing a
> lot of disk i/o and such systems probably have ecc memory. But doing
> an o/s install is the one time where a consumer grade computer does
> a *lot* of disk i/o for quite a long time and is hence vulnerable.
> Ironically,  the Open Solaris installer does not allow for ZFS
> mirroring at install time, one time where it might be really important!
> Now that sounds like a more useful RFE, especially since it would be
> relatively easy to implement. Anaconda does it...
This is not an accurate statement.  The OpenSolaris installer does
support mirrored boot disks via the Automated Installer method.
http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html
You can also install Solaris 10 to mirrored root pools via JumpStart.
>
> A Solaris install writes almost 4*10^10 bits. Quoting Wikipedia, look
> at Cypress on ECC, see http://www.edn.com/article/CA454636.html.
> Possibly, statistically likely random memory glitches could actually
> explain the error rate that is occurring.
>
>> You are assuming that the error is the memory being modified after
>> computing the checksums; I would say that that is unlikely; I think 
>> it''s a
>> bit more likely that the data gets corrupted when it''s handled
by the
>> disk
>> controller or the disk itself.  (The data is continuously re-written by
>> the DRAM controller)
>
> See below for an example where a checksum error occurs without the
> disk subsystem being involved. There seems to be no other plausible
> explanation other than an improbable bug in X86 ZFS itself.
I think a better test would be to md5 the file from all systems
and see if the md5 hashes are the same.  If they are, then yes,
the finger would point more in the direction of ZFS.  The
send/recv protocol hasn''t changed in quite some time, but it
is arguably not as robust as it could be.

ZFS send/recv use fletcher4 for the checksums.  ZFS uses fletcher2
for data (by default) and fletcher4 for metadata.  The same fletcher
code is used. So if you believe fletcher4 is broken for send/recv,
how do you explain that it works for the metadata?  Or does it?
There may be another failure mode at work here...
(see comment on scrubs at the end of this extended post)
>> It would have been nice if we were able to recover the contents of the
>> file; if you also know what was supposed to be there, you can diff and
>> then we can find out what was wrong.
>
> "file" on those files resulted in "bus error". Is there
a way to actually
> read a file reported by ZFS as unrecoverable to do just that (and to
> separately retrieve the copy from each half of the mirror)?
ZFS corrects automatically, when it can.  But if the source data is
bad, then ZFS couldn''t possibly detect it.

For files that ZFS can detect are corrupted and cannot automatically
correct, you can get the list from "zpool status -xv"  The behaviour
as seen by applications is determined by the zpool failmode property.

In any event, if file core dumps consistently in the same part of the
code, then please log a bug against file -- it should not core dump,
no matter what input it receives.
>
> Maybe this should be a new thread, but I suspect the following
> proves that the problem must be memory, and that begs the question
> as to how memory glitches can cause fatal ZFS checksum errors.
>
> Here is the peculiar result (same machine)
>
> After several attempts, succeeded in doing a zfs send to a file
> on a NFS mounted ZFS file system on another machine (SPARC)
> followed by a ZFS recv of that file there. But every attempt to
> do a ZFS recv of the same snapshot (i.e., from NFS) on the local
> machine (X86) has failed with a checksum mismatch. Obviously,
> the file is good, since it was possible to do a zfs recv from it.
> You can''t blame the IDE drivers (or the bus, or the disks) for
> this. Similarly, piping the snapshot though SSH fails, so you
> can''t blame NFS either. Something is happening to cause checksum
> failures between after when the data is received by the PC and
> when ZFS computes its checksums. Surely this is either a highly
> repeatable memory glitch, or (most unlikely) a bug in X86 ZFS.
> ZFS recv to another SPARC over SSH to the same physical disk
> (accessed via a sata/pata adapter) was also successful.
>
> Does this prove that the data+checksum is being corrupted by
> memory glitches? Both NFS and SSH over TCP/IP provide reliable
> transport (via checksums), so the data is presumably received
> correctly. ZFS then calculates its own checksum and it fails.
> Oddly, it /always/ fails, but not at the same point, and far
> into the stream when both disks have been very busy for a while.
Uhmm, if it were a software bug, one would expect it to fail
at exactly the same place, no?
>
> It would be interesting to see if the checksumming still fails
> if the writes were somehow skipped or sent to /dev/null. If it
> still fails. it should be possible to pinpoint the failure. If
> not, then it would seem the the only recourse is to replace
> the machine or not use ZFS even though it is otherwise quite
> reliable (it has been running an XDMCP session for 2 weeks
> now with no apparent glitches; even zpool status shows no
> errors at all after a couple of scrubs). It would be even
> more interesting to hear speculation as to why another machine
> can recv the datastream but not the one that originated it.
Yep, interesting question.  But since you say "even zpool status
shows no error at all after a couple of scrubs" makes me think
that you''ve had errors in the past?
>
> If a memory that can pass diagnostics for 24 hours at a
> stretch can cause glitches in huge datastreams, then IMO it
> behooves ZFS to defend itself against them. Buffering disk
> i/o on machines with no ECC seems like reasonably cheap
> insurance against a whole class of errors, and could make
> ZFS usable on PCs that, although they work fine with ext3,
> fail annoyingly with ZFS. Ironically this wouldn''t fix the
> peculiar recv problem, which none-the-less seems to point
> to memory glitches as a source of errors.
I''m still a little confused.  If ext3 can''t detect data
errors, what
verification have you used to back your claim that it is unaffected?

Please check the image views with md5 digests and get back to us.
If you get a chance, run SunVTS to verify the memory and CPU,
too.  If the CPU is b0rken, the fletcher4 checksum for the recv may
be tickling it.

<sidebar>
Microsoft got so tired of defending its software against memory
errors, that it requires Windows Server platforms to use ECC.  But
even Microsoft doesn''t have the power to force the vendors to use
ECC for all PCs.
</sidebar>

 -- richard

Frank Middleton

2009-May-26 03:16 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 05/22/09 21:08, Toby Thain wrote:> Yes, the important thing is to *detect* them, no system can run reliably
> with bad memory, and that includes any system with ZFS. Doing nutty
> things like calculating the checksum twice does not buy anything of
> value here.
All memory is "bad" if it doesn''t have ECC. There are only
varying
degrees of badness. Calculating the checksum twice on its own would
be nutty, as you say, but doing so on a separate copy of the data
might prevent unrecoverable errors after writes to mirrored drives.
You can''t detect memory errors if you don''t have ECC. But you
can
try to mitigate them. Without doing so makes ZFS less reliable than
the memory it is running on. The problem is that ZFS makes any file
with a bad checksum inaccessible, even if one really doesn''t care
if the data has been corrupted. A workaround might be a way to allow
such files to be readable despite the bad checksum...

In hindsight I probably should have merely reported the problem and
left those with more knowledge to propose a solution. Oh well.
  > If the memory is this bad then applications will be dying all over the
> place, compilers will be segfaulting, and databases will be writing bad
> data even before it reaches ZFS.
But it isn''t. Applications aren''t dying, compilers are not
segfaulting
(it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm
is staying up for weeks at a time... And I wouldn''t consider running a
non-trivial database application on a machine without ECC.
> Absolutely, memory diags are essential. And you certainly run them if
> you see unexpected behaviour that has no other obvious cause.
Runs for days, as noted.
  > Your logic is rather tortuous. If the hardware is that crappy then
> there''s not much ZFS can do about it.
Well, it could. For example, it could make copies of the data before
checksumming so that one memory hit doesn''t result in an unrecoverable
file on a mirrored drive. Either that or there''s a bug in ZFS. I am
more inclined to blame the memory, especially since the failure rate
isn''t much higher than the expected rate as reported elsewhere.
>> Maybe this should be a new thread, but I suspect the following
>> proves that the problem must be memory, and that begs the question
>> as to how memory glitches can cause fatal ZFS checksum errors.
>
> Of course they can; but they will also break anything else on the machine.
But they don''t. Checksum errors are reasonable, but not unrecoverable
ones on mirrors.
  > How can a machine with bad memory "work fine with ext3"?
It does. It works fine with ZFS too. Just really annoying unrecoverable
files every now and then on mirrored drives. This shouldn''t happen even
with lousy memory and wouldn''t (doesn''t) with ECC. If there
was a way
to examine the files and their checksums, I would be surprised if they
were different (If they were, it would almost certainly be the controller
or the PCI bus itself causing the problem). But I speculate that it is
predictable memory hits.

-- Frank

Casper.Dik at Sun.COM

2009-May-26 07:23 UTC

head link

[zfs-discuss] Errors on mirrored drive

>On 05/22/09 21:08, Toby Thain wrote:
>> Yes, the important thing is to *detect* them, no system can run
reliably
>> with bad memory, and that includes any system with ZFS. Doing nutty
>> things like calculating the checksum twice does not buy anything of
>> value here.
>
>All memory is "bad" if it doesn''t have ECC. There are
only varying
>degrees of badness. Calculating the checksum twice on its own would
>be nutty, as you say, but doing so on a separate copy of the data
>might prevent unrecoverable errors after writes to mirrored drives.
>You can''t detect memory errors if you don''t have ECC.
And where exactly do you get the second good copy of the data?

If you copy the code you''ve just doubled your chance of using bad
memory.
The original copy can be good or bad; the second copy cannot be better
than the first copy.

 But you can>try to mitigate them. Without doing so makes ZFS less reliable than
>the memory it is running on. The problem is that ZFS makes any file
>with a bad checksum inaccessible, even if one really doesn''t care
>if the data has been corrupted. A workaround might be a way to allow
>such files to be readable despite the bad checksum...
You can disable the checksums if you don''t care.
>But it isn''t. Applications aren''t dying, compilers are not
segfaulting
>(it was even possible to compile GCC 4.3.2 with the supplied gcc); gdm
>is staying up for weeks at a time... And I wouldn''t consider
running a
>non-trivial database application on a machine without ECC.
One broken bit may not have cause serious damage "most things work".
>> Absolutely, memory diags are essential. And you certainly run them if
>> you see unexpected behaviour that has no other obvious cause.
>
>Runs for days, as noted.
Doesn''t proof anything.

Casper

Frank Middleton

2009-May-26 13:26 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 05/23/09 10:21, Richard Elling wrote:> <preface>
> This forum is littered with claims of "zfs checksums are broken"
where
> the root cause turned out to be faulty hardware or firmware in the data
> path.
> </preface>
>
> I think that before you should speculate on a redesign, we should get to
> the root cause.
The hardware is clearly misbehaving. No argument. The questions is - how
far out of reasonable behavior is it?

Redesign? I''m not sure I can conceive an architecture that would make
double buffering difficult to do. It is unclear how faulty hardware or
firmware could be responsible for such a low error rate (<1 in 4*10^10).
Just asking if an option for machines with no ecc and their inevitable
memory errors is a reasonable thing to suggest in an RFE.
> The checksum occurs in the pipeline prior to write to disk.
> So if the data is damaged prior to checksum, then ZFS will
> never know. Nor will UFS. Neither will be able to detect
> this. In Solaris, if the damage is greater than the ability
> of the memory system and CPU to detect or correct, then
> even Solaris won''t know. If the memory system or CPU
> detects a problem, then Solaris fault management will kick
> in and do something, preempting ZFS.
Exactly. My whole point. And without ECC there''s no way of knowing.
But if the data is damaged /after/ checksum but /before/ write, then
you have a real problem...
>> Memory diagnostics just test memory. Disk diagnostics just test disks.
>
> This is not completely accurate. Disk diagnostics also test the
> data path. Memory tests also test the CPU. The difference is the
> amount of test coverage for the subsystem.
Quite. But the disk diagnostic doesn''t really test memory beyond what
it uses to run itself. Likewise it may not test the FPU forexample.
>> ZFS keeps disks pretty busy, so perhaps it loads the power supply
>> to the point where it heats up and memory glitches are more likely.
>
> In general, for like configurations, ZFS won''t keep a disk any
more
> busy than other file systems. In fact, because ZFS groups transactions,
> it may create less activity than other file systems, such as UFS.
That''s a point in it''s favor, although not really relevant. If
the disks
are really busy they will load the PSU more and that could drag the supply
down which in turn might make errors occur that otherwise wouldn''t.
>> Ironically, the Open Solaris installer does not allow for ZFS
>> mirroring at install time, one time where it might be really important!
>> Now that sounds like a more useful RFE, especially since it would be
>> relatively easy to implement. Anaconda does it...
>
> This is not an accurate statement. The OpenSolaris installer does
> support mirrored boot disks via the Automated Installer method.
> http://dlc.sun.com/osol/docs/content/2008.11/AIinstall/index.html
> You can also install Solaris 10 to mirrored root pools via JumpStart.
Talking about the live CD here. I prefer to install via jumpstart, but
AFAIK Open Solaris (indiana) isn''t available as an installable DVD. But
most consumers are going to be installing from the live CD and they
are the ones with the low end hardware without ECC. There was recently
a suggestion on another thread about an RFE to add mirroring as an
install option.
  > I think a better test would be to md5 the file from all systems
> and see if the md5 hashes are the same. If they are, then yes,
> the finger would point more in the direction of ZFS. The
> send/recv protocol hasn''t changed in quite some time, but it
> is arguably not as robust as it could be.
Thanks! md5 hash is exactly the kind of test I was looking for.
ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS)
md5sum on X86   9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)
> ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2
> for data (by default) and fletcher4 for metadata. The same fletcher
> code is used. So if you believe fletcher4 is broken for send/recv,
> how do you explain that it works for the metadata? Or does it?
> There may be another failure mode at work here...
> (see comment on scrubs at the end of this extended post)[Did you forget the scrubs comment?]

Never said it was broken. I assume the same code is used for both SPARC
and X86, and it works fine on SPARC. It would seem that this machine
gets memory errors so often (even though it passes the Linux memory
diagnostic) that it can never get to the end of a 4GB recv stream. Odd
that it can do the md5sum, but as mentioned, perhaps doing the i/o
puts more strain on the machine and stresses it to where more memory
faults occur. I can''t quite picture a software bug that would cause
random failures on specific hardware and I am happy to give ZFS the
benefit of the doubt.
>>> It would have been nice if we were able to recover the contents of
the
>>> file; if you also know what was supposed to be there, you can diff
and
>>> then we can find out what was wrong.
>>
>> "file" on those files resulted in "bus error". Is
there a way to actually
>> read a file reported by ZFS as unrecoverable to do just that (and to
>> separately retrieve the copy from each half of the mirror)?
>
> ZFS corrects automatically, when it can. But if the source data is
> bad, then ZFS couldn''t possibly detect it.
> For files that ZFS can detect are corrupted and cannot automatically
> correct, you can get the list from "zpool status -xv" The
behaviour
> as seen by applications is determined by the zpool failmode property.
Exactly. And "file" on such a file will repeatably segfault. So will
pkg fix (there is a bug reported for this). Fortunately rm doesn''t
segfault or there would be no way to repair such files. Is there
a way to actually get copies of with bad checksums so they may be
examined to see where the fault actually lies?

Quoting the ZFS admin guide: "The failmode property ... provides the
failmode property for determining the behavior of a catastrophic pool
failure due to a loss of device connectivity or the failure of all
devices in the pool. ". Has this changed since the ZFS admin guide
was last updated?  If not, it doesn''t seem relevant.
> In any event, if file core dumps consistently in the same part of the
> code, then please log a bug against file -- it should not core dump,
> no matter what input it receives.
Ironically all such files have long since been scrubbed away. I suppose
one could deliberately damage a file to reproduce this. It could also be
that a library required to /run/ file was the one that was damaged...
> Uhmm, if it were a software bug, one would expect it to fail
> at exactly the same place, no?
Exactly. Not a bug. If it were, it would have been fixed a long time
ago on such a critical path. How about an RFE along the lines of
"Improved support for machines without ecc memory"? How about one
to recover files with bad checksums (a bit like getting fragments
out oflost+found in the bad old days)?
> Yep, interesting question. But since you say "even zpool status
> shows no error at all after a couple of scrubs" makes me think
> that you''ve had errors in the past?
You bet! 5 unrecoverable errors, and maybe 10 or so recoverable
ones. About once a month, zpool status shows an error (note this
machine is being used as an X-terminal, so it hardly does any i/o)
and a scrub gets rid of it.
> I''m still a little confused. If ext3 can''t detect data
errors, what
> verification have you used to back your claim that it is unaffected?
None at all. But in a read-mostly environment this isn''t an issue.
Other, known, bugs (in Fedora) account for almost every crash, and
Solaris hasn''t failed once since it was (finally) installed a few
weeks ago with the screensaver disabled :-).
  > Please check the image views with md5 digests and get back to us.
> If you get a chance, run SunVTS to verify the memory and CPU,
> too. If the CPU is b0rken, the fletcher4 checksum for the recv may
> be tickling it.
If the CPU was broken, wouldn''t it always fail at the same point in
the stream? It definitely doesn''t. Could you expand a little on what
it means to do md5sums on the image views? I''m not sure what an image
view is in this context. AFAIK SUNWvts is available only in SXCE, not
in Open Solaris. Oddly, you can load SUNWvts via pkg, but evidently
not smcwebserver - please correct me if I am wrong. FWIW we are running
SXCE on SPARC (installed via jumpstart) and indiana on X86 (installed
via live CD and updated to snv111a via pkg.
> <sidebar>
> Microsoft got so tired of defending its software against memory
> errors, that it requires Windows Server platforms to use ECC. But
> even Microsoft doesn''t have the power to force the vendors to use
> ECC for all PCs.
> </sidebar>
Quite. My point exactly! My only issue is that I have experienced
what is IMO an unreasonably large number of unrecoverable errors on
mirrored drives. I was merely speculating on reasons for this and
possible solutions. Ironically, my applications are running beautifully,
and the users are quite happy with the performance and stability. ZFS
is wonderful because updates are so easy to roll back and painless
to install, snapshots are so useful, and all the other reasons that
make every other fs seem so antiquated...

In a sense, the proposal is merely to replicate in software what
ECC does in hardware. There may be much better solutions than double
buffering the data, and doing it at the level of ZFS is not a complete
solution. But doing nothing exposes ZFS users of mirrored drives to
the likelihood of unnecessarily unrecoverable failures due to
statistically probable memory glitches on machines with no ecc.

Cheers -- Frank

Frank Middleton

2009-May-26 14:21 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 05/26/09 03:23, Casper.Dik at Sun.COM wrote:
> And where exactly do you get the second good copy of the data?
 From the first. And if it is already bad, as noted previously, this
is no worse than the UFS/ext3 case. If you want total freedom from
this class of errors, use ECC.
  > If you copy the code you''ve just doubled your chance of using bad
memory.
> The original copy can be good or bad; the second copy cannot be better
> than the first copy.
The whole point is that the memory isn''t bad. About once a month, 4GB
of memory of any quality can experience 1 bit being flipped, perhaps
more or less often. If that bit happens to be in the checksummed buffer
then you''ll get an unrecoverable error on a mirrored drive. And if I
understand correctly, ZFS keeps data in memory for a lot longer than
other file systems and uses more memory doing so. Good features, but
makes it more vulnerable to random bit flips. This is why decent
machine have ECC. To argue that ZFS should work reliably on machines
without ECC flies in the face of statistical reality and the reason
for ECC in the first place.
> You can disable the checksums if you don''t care.
But I do care. I''d like to know if my files have been corrupted, or at
least as much as possible. But there are huge classes of files for
which the odd flipped bit doesn''t matter and the loss of which would
be very painful. Email archives and videos come to mind. An easy
workaround is to simply store all important stuff on a machine with
ECC. Problem solved...
> One broken bit may not have cause serious damage "most things
work".
Exactly.
>>> Absolutely, memory diags are essential. And you certainly run them
if
>>> you see unexpected behaviour that has no other obvious cause.
>> Runs for days, as noted.
>
> Doesn''t proof anything.
Quite. But nonetheless, the unrecoverable errors did occur on mirrored
drives and it seems to defeat the whole purpose of mirroring, which is
AFAIK, keeping two independent copies of every file in case one gets lost.
Writing both images from one buffer appears to violate the premise. I
can think of two RFEs

1) Add an option to buffer writes on machines without ECC memory to
    avoid the possibility of random memory flips causing unrecoverable
    errors with mirrored drives.

2) An option to read files even if they have failed checksums.

1) could be fixed in the documentation - "ZFS should be used with caution
on machines with no ECC since random bit flips can cause unrecoverable
checksum failures on mirrored drives". Or "ZFS isn''t
supported on
machines with memory that has no ECC".

Disabling checksums is one way of working around 2). But it also disables
a cool feature. I suppose you could optionally change checksum failure
from an error to a warning, but ideally it would be file by file...

Ironically, I wonder if this is even a problem with raidz? But grotty
machines like these can''t really support 3 or more internal drives...

Cheers -- Frank

Bob Friesenhahn

2009-May-26 14:34 UTC

head link

[zfs-discuss] Errors on mirrored drive

On Tue, 26 May 2009, Frank Middleton wrote:> Just asking if an option for machines with no ecc and their inevitable
> memory errors is a reasonable thing to suggest in an RFE.
Machines lacking ECC do not suffer from "inevitable memory errors". 
Memory errors are not like death and taxes.
> Exactly. My whole point. And without ECC there''s no way of
knowing.
> But if the data is damaged /after/ checksum but /before/ write, then
> you have a real problem...
If memory does not work, then you do have a real problem.  The ZFS ARC 
consumes a large amount of memory.  Note that the problem of 
corruption around the time of the checksum/write is minor compared to 
corruption in the ZFS ARC since data is continually read from the ZFS 
ARC and so bad data may be returned to the user even though it is 
(was?) fine on disk.  This is as close as ZFS comes to having an 
Achilles'' heel.  Solving this problem would require crippling the 
system performance.
> Never said it was broken. I assume the same code is used for both SPARC
> and X86, and it works fine on SPARC. It would seem that this machine
> gets memory errors so often (even though it passes the Linux memory
> diagnostic) that it can never get to the end of a 4GB recv stream. Odd
Maybe you need a new computer, or need to fix your broken one.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Darren J Moffat

2009-May-26 14:41 UTC

head link

[zfs-discuss] Errors on mirrored drive

Bob Friesenhahn wrote:> On Tue, 26 May 2009, Frank Middleton wrote:
>> Just asking if an option for machines with no ecc and their inevitable
>> memory errors is a reasonable thing to suggest in an RFE.
> 
> Machines lacking ECC do not suffer from "inevitable memory
errors".
> Memory errors are not like death and taxes.
> 
>> Exactly. My whole point. And without ECC there''s no way of
knowing.
>> But if the data is damaged /after/ checksum but /before/ write, then
>> you have a real problem...
> 
> If memory does not work, then you do have a real problem.  The ZFS ARC 
> consumes a large amount of memory.  Note that the problem of corruption 
> around the time of the checksum/write is minor compared to corruption in 
> the ZFS ARC since data is continually read from the ZFS ARC and so bad 
> data may be returned to the user even though it is (was?) fine on disk.  
> This is as close as ZFS comes to having an Achilles'' heel. 
Solving this
> problem would require crippling the system performance.
When running a DEBUG kernel (not something most people would do on a 
"production" system) ZFS does actually checksum and verify the buffers
in the ARC - not on every access but certain operations cause it to happen.

-- 
Darren J Moffat

Bob Friesenhahn

2009-May-26 14:48 UTC

head link

[zfs-discuss] Errors on mirrored drive

On Tue, 26 May 2009, Frank Middleton wrote:>
> 1) could be fixed in the documentation - "ZFS should be used with
caution
> on machines with no ECC since random bit flips can cause unrecoverable
> checksum failures on mirrored drives". Or "ZFS isn''t
supported on
> machines with memory that has no ECC".
What problem are you looking to solve?  Data is written by application 
software which includes none of the extra safeguards you are insisting 
should be in ZFS.  This means that the data may be undetectably 
corrupted.

I strongly recommend that you purchase a system with ECC in order to 
operate reliably in the (apparent) radium mine where you live.  It is 
time to wake up, smell the radon, and do something about the problem. 
Check this map to see if there is cause for concern in your area: 
"http://upload.wikimedia.org/wikipedia/en/8/8b/US_homes_over_recommended_radon_levels.gif".

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Toby Thain

2009-May-26 15:04 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 25-May-09, at 11:16 PM, Frank Middleton wrote:
> On 05/22/09 21:08, Toby Thain wrote:
>> Yes, the important thing is to *detect* them, no system can run  
>> reliably
>> with bad memory, and that includes any system with ZFS. Doing nutty
>> things like calculating the checksum twice does not buy anything of
>> value here.
>
> All memory is "bad" if it doesn''t have ECC. There are
only varying
> degrees of badness. Calculating the checksum twice on its own would
> be nutty, as you say, but doing so on a separate copy of the data
> might prevent unrecoverable errors
I don''t see this at all. The kernel reads the application buffer. How  
does reading it twice buy you anything?? It sounds like you are  
assuming 1) the buffer includes faulty RAM; and 2) the faulty RAM  
reads differently each time. Doesn''t that seem statistically unlikely  
to you? And even if you really are chasing this improbable scenario,  
why make ZFS do the job of a memory tester?
> after writes to mirrored drives.
> You can''t detect memory errors if you don''t have ECC. But
you can
> try to mitigate them. Without doing so makes ZFS less reliable than
> the memory it is running on. The problem is that ZFS makes any file
> with a bad checksum inaccessible, even if one really doesn''t care
> if the data has been corrupted. A workaround might be a way to allow
> such files to be readable despite the bad checksum...
I am not sure what you are trying to say here.
>
> ...
>
>> How can a machine with bad memory "work fine with ext3"?
>
> It does. It works fine with ZFS too. Just really annoying  
> unrecoverable
> files every now and then on mirrored drives. This shouldn''t happen
> even
> with lousy memory and wouldn''t (doesn''t) with ECC. If
there was a way
> to examine the files and their checksums, I would be surprised if they
> were different (If they were, it would almost certainly be the  
> controller
> or the PCI bus itself causing the problem). But I speculate that it is
> predictable memory hits.
You''re making this harder than it really is. Run a memory test. If it  
fails, take the machine out of service until it''s fixed.
There''s no
reasonable way to keep running faulty hardware.

--Toby
>
> -- Frank
>

Toby Thain

2009-May-26 15:08 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 26-May-09, at 10:21 AM, Frank Middleton wrote:
> On 05/26/09 03:23, Casper.Dik at Sun.COM wrote:
>
>> And where exactly do you get the second good copy of the data?
>
> From the first. And if it is already bad, as noted previously, this
> is no worse than the UFS/ext3 case. If you want total freedom from
> this class of errors, use ECC.
>
>> If you copy the code you''ve just doubled your chance of using
bad
>> memory.
>> The original copy can be good or bad; the second copy cannot be  
>> better
>> than the first copy.
>
> The whole point is that the memory isn''t bad. About once a month,
4GB
> of memory of any quality can experience 1 bit being flipped, perhaps
> more or less often.

What you are proposing does practically nothing to mitigate "random  
bit flips". Think about the probabilities involved. You''re testing
one tiny buffer, very occasionally, for an extremely improbable  
event. It is also nothing to do with ZFS, and leaves every other byte  
of your RAM untested. See the reasoning?

--Toby
> ...
>
> Cheers -- Frank
>

Richard Elling

2009-May-26 15:56 UTC

head link

[zfs-discuss] Errors on mirrored drive

Frank brings up some interesting ideas, some of which might
need some additional thoughts...

Frank Middleton wrote:> On 05/23/09 10:21, Richard Elling wrote:
>> <preface>
>> This forum is littered with claims of "zfs checksums are
broken" where
>> the root cause turned out to be faulty hardware or firmware in the data
>> path.
>> </preface>
>>
>> I think that before you should speculate on a redesign, we should get
to
>> the root cause.
>
> The hardware is clearly misbehaving. No argument. The questions is - how
> far out of reasonable behavior is it?
Hardware is much less expensive than software, even free software.
Your system has a negative ROI, kinda like trading credit default
swaps.  The best thing you can do is junk it :-)
>
> Redesign? I''m not sure I can conceive an architecture that would
make
> double buffering difficult to do. It is unclear how faulty hardware or
> firmware could be responsible for such a low error rate (<1 in 4*10^10).
> Just asking if an option for machines with no ecc and their inevitable
> memory errors is a reasonable thing to suggest in an RFE.
It is a good RFE, but it isn''t an RFE for the software folks.
>> The checksum occurs in the pipeline prior to write to disk.
>> So if the data is damaged prior to checksum, then ZFS will
>> never know. Nor will UFS. Neither will be able to detect
>> this. In Solaris, if the damage is greater than the ability
>> of the memory system and CPU to detect or correct, then
>> even Solaris won''t know. If the memory system or CPU
>> detects a problem, then Solaris fault management will kick
>> in and do something, preempting ZFS.
>
> Exactly. My whole point. And without ECC there''s no way of
knowing.
> But if the data is damaged /after/ checksum but /before/ write, then
> you have a real problem...
To put this in perspective, ECC is a broad category.  When we
think of ECC for memory, it is usually Single Error (bit) Correction,
Double Error (bit) Detection (SECDED).  A well designed system
will also do Single Device Data Correction (aka Chipkill or Extended
ECC, since Chipkill is trademarked).  What this means is that faults
of more than 2 bits per word are not detected, unless all of the faults
occur in the same chip for SDDC cases.

Clearly, this wouldn''t scale well to large data streams, which is why
they use checksums like Fletcher or hash functions like SHA-256.
>>> ZFS keeps disks pretty busy, so perhaps it loads the power supply
>>> to the point where it heats up and memory glitches are more likely.
>>
>> In general, for like configurations, ZFS won''t keep a disk any
more
>> busy than other file systems. In fact, because ZFS groups transactions,
>> it may create less activity than other file systems, such as UFS.
>
> That''s a point in it''s favor, although not really
relevant. If the disks
> are really busy they will load the PSU more and that could drag the 
> supply
> down which in turn might make errors occur that otherwise
wouldn''t.
The dynamic loads of modern disk drives are not very great.  I don''t
believe your argument is very strong, here.  Also, the solution is,
once again, fix the hardware.
>> I think a better test would be to md5 the file from all systems
>> and see if the md5 hashes are the same. If they are, then yes,
>> the finger would point more in the direction of ZFS. The
>> send/recv protocol hasn''t changed in quite some time, but it
>> is arguably not as robust as it could be.
>
> Thanks! md5 hash is exactly the kind of test I was looking for.
> ms5sum on SPARC 9ec4f7da41741b469fcd7cb8c5040564 (local ZFS)
> md5sum on X86   9ec4f7da41741b469fcd7cb8c5040564 (remote NFS)
Good.
>> ZFS send/recv use fletcher4 for the checksums. ZFS uses fletcher2
>> for data (by default) and fletcher4 for metadata. The same fletcher
>> code is used. So if you believe fletcher4 is broken for send/recv,
>> how do you explain that it works for the metadata? Or does it?
>> There may be another failure mode at work here...
>> (see comment on scrubs at the end of this extended post)
> [Did you forget the scrubs comment?]
no, you responded that you had been seeing scrubs fix errors.
> Never said it was broken. I assume the same code is used for both SPARC
> and X86, and it works fine on SPARC. It would seem that this machine
> gets memory errors so often (even though it passes the Linux memory
> diagnostic) that it can never get to the end of a 4GB recv stream. Odd
> that it can do the md5sum, but as mentioned, perhaps doing the i/o
> puts more strain on the machine and stresses it to where more memory
> faults occur. I can''t quite picture a software bug that would
cause
> random failures on specific hardware and I am happy to give ZFS the
> benefit of the doubt.
Yes, software can trigger memory failures.  More below...
>>>> It would have been nice if we were able to recover the contents
of the
>>>> file; if you also know what was supposed to be there, you can
diff and
>>>> then we can find out what was wrong.
>>>
>>> "file" on those files resulted in "bus error".
Is there a way to
>>> actually
>>> read a file reported by ZFS as unrecoverable to do just that (and
to
>>> separately retrieve the copy from each half of the mirror)?
>>
>> ZFS corrects automatically, when it can. But if the source data is
>> bad, then ZFS couldn''t possibly detect it.
>
>> For files that ZFS can detect are corrupted and cannot automatically
>> correct, you can get the list from "zpool status -xv" The
behaviour
>> as seen by applications is determined by the zpool failmode property.
>
> Exactly. And "file" on such a file will repeatably segfault. So
will
> pkg fix (there is a bug reported for this). Fortunately rm doesn''t
> segfault or there would be no way to repair such files. Is there
> a way to actually get copies of with bad checksums so they may be
> examined to see where the fault actually lies?
Yes, to some degree.  See a few of the blogs in this collection
http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file
> Quoting the ZFS admin guide: "The failmode property ... provides the
> failmode property for determining the behavior of a catastrophic pool
> failure due to a loss of device connectivity or the failure of all
> devices in the pool. ". Has this changed since the ZFS admin guide
> was last updated?  If not, it doesn''t seem relevant.
It is relevant in those cases where you want a process to continue
though the hardware has failed.  Rather than panic, you can get
an EIO.
>> Uhmm, if it were a software bug, one would expect it to fail
>> at exactly the same place, no?
>
> Exactly. Not a bug. If it were, it would have been fixed a long time
> ago on such a critical path. How about an RFE along the lines of
> "Improved support for machines without ecc memory"? How about one
> to recover files with bad checksums (a bit like getting fragments
> out oflost+found in the bad old days)?
argv!  Why does this keep coming up?  UFS fsck does not recover
data!  It only recovers metadata, sometimes.
>> Yep, interesting question. But since you say "even zpool status
>> shows no error at all after a couple of scrubs" makes me think
>> that you''ve had errors in the past?
>
> You bet! 5 unrecoverable errors, and maybe 10 or so recoverable
> ones. About once a month, zpool status shows an error (note this
> machine is being used as an X-terminal, so it hardly does any i/o)
> and a scrub gets rid of it.
heh, if the fault is in memory, then the scrub will be correcting
correct data :-)
>> Please check the image views with md5 digests and get back to us.
>> If you get a chance, run SunVTS to verify the memory and CPU,
>> too. If the CPU is b0rken, the fletcher4 checksum for the recv may
>> be tickling it.
>
> If the CPU was broken, wouldn''t it always fail at the same point
in
> the stream? 
Not necessarily.  All failure modes are mechanical.  There are a class
of failure modes in semiconductors which are due to changes in the
speed of transistors as a function of temperature. Temperature increases
as a function of the frequency of input changes in a CMOS gate. So,
if your software causes a specific change in the temperature of a portion
of a device, then it could trip on a temperature-induced fault.  These
tend to be rare because of the margins, but if the hardware is flaky,
it is already arguably beyond the margins. 

These sorts of codes might be humorously classified as halt-and-catch-fire.
But they do exist, and there are some cool thermographs which show
how the heat is distributed for various workloads.
http://en.wikipedia.org/wiki/Halt_and_Catch_Fire
> It definitely doesn''t. Could you expand a little on what
> it means to do md5sums on the image views? I''m not sure what an
image
> view is in this context. AFAIK SUNWvts is available only in SXCE, not
> in Open Solaris. Oddly, you can load SUNWvts via pkg, but evidently
> not smcwebserver - please correct me if I am wrong. FWIW we are running
> SXCE on SPARC (installed via jumpstart) and indiana on X86 (installed
> via live CD and updated to snv111a via pkg.
>
>> <sidebar>
>> Microsoft got so tired of defending its software against memory
>> errors, that it requires Windows Server platforms to use ECC. But
>> even Microsoft doesn''t have the power to force the vendors to
use
>> ECC for all PCs.
>> </sidebar>
>
> Quite. My point exactly! My only issue is that I have experienced
> what is IMO an unreasonably large number of unrecoverable errors on
> mirrored drives. I was merely speculating on reasons for this and
> possible solutions. Ironically, my applications are running beautifully,
> and the users are quite happy with the performance and stability. ZFS
> is wonderful because updates are so easy to roll back and painless
> to install, snapshots are so useful, and all the other reasons that
> make every other fs seem so antiquated...
There may be an opportunity here.  Let''s assume that your disks
were fine and the bad checksums were caused by transient
memory faults. In such cases, a re-read of the data would effectively
clear the transient fault.  In a sense, this is where mirroring works
against us -- ZFS will attempt to repair.  This brings up a lot of
much more complex system issues, which makes me glad that
FMA exists ;-)
 -- richard

Kjetil Torgrim Homme

2009-May-26 17:07 UTC

head link

[zfs-discuss] Errors on mirrored drive

Frank Middleton <f.middleton at apogeect.com> writes:
> Exactly. My whole point. And without ECC there''s no way of
knowing.
> But if the data is damaged /after/ checksum but /before/ write, then
> you have a real problem...
we can''t do much to protect ourselves from damage to the data itself
(an extra copy in RAM will help little and ruin performance).

damages to the bits holding the computed checksum before it is written
can be alleviated by doing the calculation independently for each
written copy.  in particular, this will help if the bit error is
transient.

since the number of octets in RAM holding the checksum dwarves the
number of octets occupied by data by a large ratio (256 bits vs. one
mebibit for a full default sized record), such a paranoia mode will
most likely tell you that the *data* is corrupt, not the checksum.
but today you don''t know, so it''s an improvement in my book.
> Quoting the ZFS admin guide: "The failmode property ... provides the
> failmode property for determining the behavior of a catastrophic
> pool failure due to a loss of device connectivity or the failure of
> all devices in the pool. ". Has this changed since the ZFS admin
> guide was last updated?  If not, it doesn''t seem relevant.
I guess checksum error handling is orthogonal to this and should have
its own property.  it sure would be nice if the admin could ask the OS
to deliver the bits contained in a file, no matter what, and just log
the problem.
> Cheers -- Frank
thank you for pointing out this potential weakness in ZFS'' consistency
checking, I didn''t realise it was there.

also thank you, all ZFS developers, for your great job :-)

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

Frank Middleton

2009-May-29 19:05 UTC

head link

[zfs-discuss] Errors on mirrored drive

On 05/26/09 13:07, Kjetil Torgrim Homme wrote:> also thank you, all ZFS developers, for your great job :-)
I''ll second that! A great achievement - puts Solaris in a league of
its own, so much so, you''d want to run it on all your hardware,
however crappy the hardware might be ;-)

There are too many branches in this thread now. Going to summarize here
without responding to some of the less than helpful comments, although
death and taxes seems an ironic metaphor in the current climate :-)

In some ways this isn''t a technical issue. This much maligned machine
and its ilk are running Solaris and ZFS quite happily and the users
are pleased with the stability and performance. But their applications
are running on machines (via xdmcp) with ECC, and ZFS mirror/raidz
doesn''t have a problem there.

Picture a new convert with enthusiasm for ZFS, but has a less than
perfect PC which has otherwise been apparently quite reliable. Perhaps
it already has mirrored drives. He/she installs Solaris from the live
CD (and finds that the installer doesn''t support mirroring). The
install fails, or worse, afterwords he/she loses that movie of
Aunt Minnie playing golf, because a checksum error makes the file
unrecoverable. This could be very frustrating and make the blogosphere
go crazy, especially if the PC passes every diagnostic. Be even
worse if a file is lost on a mirror.

Unrecoverable files on mirrored drives simply shouldn''t happen. What
kind of hardware error (other than a rare bit flip) could conceivably
cause 5 out of 15 checksum errors to be unrecoverable when mirrored
during the write of around 20*10^10 bits? ZFS has both a larger spatial
and temporal footprint than other file systems, so it is slightly more
vulnerable to the once-a-month on average bit flip that will afflict
many a PC with 4GB of memory.

Perhaps someone with a statistical bent could step in and actually
calculate the probability of random errors, perhaps assuming that
half of available memory is used to queue writes, that there is
a 95% chance of one bit flip per month per 4GB, and there is a
(say) 10% duty cycle over say a period of a year. Alternatively,
the chance of a 1 bit flip over a period of 6 hours at a 100% duty
cycle repeated 1461 times (1461 installs per year at 100%). Seems
to me intuitively that 6 out of 1461 installs will fail due to an
unrecoverable checksum failure, but I''m not a statistician.

Multiply that failure rate by the number of Live CD installs
you expect over the next year (noting that *all* checksum
failures are unrecoverable without mirroring) and you''ll count
quite a few frustrated would-be installers. Maybe ZFS without ECC
and no mirroring should disable checksumming by default - it
would be a little worse than UFS and ext3 (due to its larger
spatial and temporal footprints) but still provide all the other
great features.

Proposed RFE #1

Add option to make files with unrecoverable checksum failures readable
and to pass the best image possible back to the application. [How
much do you bet most folks would select this option?]

If both sides of the mirror could be read, it might help to diagnose
the problem, which obviously must be in the hardware somewhere. If
both images are identical, then it surely must be memory. If they
differ, then what could it be?

Proposed RFE#2

Add an option for machines with mirrored drives but without ECC to
double buffer and only then calculate the checksums (for those
who are reasonably paranoid about cosmic rays).

Proposed RFE#3 (or is this a bug report?)

Add diagnostics to the ZFS recv to help understand why a perfectly
good ZFS send can''t be received when the same machine can successfully
compute a md5sum over the same stream. Even something like "recv failed
at block nnnnnnn" would help. For example, it seems to fail suspiciously
close to 2GB on a 32 bit machine.

Proposed RFE #4

Disable checksumming by default if no mirroring and no ECC is
detected. (Of course this assumes a install to mirror option).
If it could still checksum, but make it a warning instead of an
error, this could turn into a great feature for cheapskates with
machines that have no ECC.

---

1 and #2 above could be fixed in the documentation. "Random memory
bit flips can theoretically cause unrecoverable checksum failures,
even if the data is mirrored. Either disable the checksum feature
or only run ZFS on systems with ECC memory if you have any data you
don''t want to risk losing [even with a 1 bit error]".

None of this is meant as a criticism of ZFS, just suggestions to help
make a merely superb file system into the unbeatable one it should be.
(I suppose it really is a system of file systems, but ZFS it is...)

Regards -- Frank

zfs discuss - Apr 2009 - Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive

[zfs-discuss] Errors on mirrored drive