thr3ads.net - zfs discuss - [zfs-discuss] Cause for data corruption? [Feb 2008]

If this information is useful, please help other people find it:
Share via:

Sandro

2008-Feb-25 19:05 UTC

[zfs-discuss] Cause for data corruption?

hi folks

I''ve been running my fileserver at home with linux for a couple of
years and last week I finally reinstalled it with solaris 10 u4.

I borrowed a bunch of disks from a friend, copied over all the files,
reinstalled my fileserver and copied the data back.

Everything went fine, but after a few days now, quite a lot of files got
corrupted.
here''s the output:

 # zpool status data
  pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0 5.52K
          raidz1    ONLINE       0     0 5.52K
            c0t0d0  ONLINE       0     0 10.72
            c0t1d0  ONLINE       0     0 4.59K
            c0t2d0  ONLINE       0     0 5.18K
            c0t3d0  ONLINE       0     0 9.10K
            c1t0d0  ONLINE       0     0 7.64K
            c1t1d0  ONLINE       0     0 3.75K
            c1t2d0  ONLINE       0     0 4.39K
            c1t3d0  ONLINE       0     0 6.04K

errors: 388 data errors, use ''-v'' for a list

Last night I found out about this, it told me there were errors in like 50
files.
So I scrubbed the whole pool and it found a lot more corrupted files.

The temporary system which I used to hold the data while I''m installing
solaris on my fileserver is running nv build 80 and no errors on there.

What could be the cause of these errors??
I don''t see any hw errors on my disks..

 # iostat -En | grep -i error
c3d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c4d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c0t0d0           Soft Errors: 574 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c1t0d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c0t1d0           Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c0t2d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c0t3d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c1t1d0           Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c1t2d0           Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
c1t3d0           Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0

although a lot of soft errors.
Linux said that one disk had gone bad, but I figured the sata cable was somehow
broken, so I replaced that before installing solaris. And solaris
didn''t and doesn''t see any actual hw errors on the disks, does
it?
 
 
This message posted from opensolaris.org

Nathan Kroenert

2008-Feb-25 22:06 UTC

head link

[zfs-discuss] Cause for data corruption?

My guess is that you have some defective hardware in the system that''s 
causing bit flips in the checksum or the data payload.

I''d suggest running some sort of system diagnostics for a few hours to 
see if you can locate the bad piece of hardware.

My suspicion would be your memory or CPU, but that''s just a wild guess,
based on the number of errors you have and the number of devices it''s 
spread over.

Could it be that you have been corrupting data for some time and now 
known it?

Oh - And i''d also look around based on your disk controller and ensure 
that there are no newer patches for it, just in case it''s one for which
there was a known problem. (which was worked around in the driver)

I *think* there was an issue with at least one or two...

Cheers!

Nathan.

Sandro wrote:> hi folks
> 
> I''ve been running my fileserver at home with linux for a couple of
years and last week I finally reinstalled it with solaris 10 u4.
> 
> I borrowed a bunch of disks from a friend, copied over all the files,
reinstalled my fileserver and copied the data back.
> 
> Everything went fine, but after a few days now, quite a lot of files got
corrupted.
> here''s the output:
> 
>  # zpool status data
>   pool: data
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         data        ONLINE       0     0 5.52K
>           raidz1    ONLINE       0     0 5.52K
>             c0t0d0  ONLINE       0     0 10.72
>             c0t1d0  ONLINE       0     0 4.59K
>             c0t2d0  ONLINE       0     0 5.18K
>             c0t3d0  ONLINE       0     0 9.10K
>             c1t0d0  ONLINE       0     0 7.64K
>             c1t1d0  ONLINE       0     0 3.75K
>             c1t2d0  ONLINE       0     0 4.39K
>             c1t3d0  ONLINE       0     0 6.04K
> 
> errors: 388 data errors, use ''-v'' for a list
> 
> Last night I found out about this, it told me there were errors in like 50
files.
> So I scrubbed the whole pool and it found a lot more corrupted files.
> 
> The temporary system which I used to hold the data while I''m
installing solaris on my fileserver is running nv build 80 and no errors on
there.
> 
> What could be the cause of these errors??
> I don''t see any hw errors on my disks..
> 
>  # iostat -En | grep -i error
> c3d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c4d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c0t0d0           Soft Errors: 574 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c1t0d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c0t1d0           Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c0t2d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c0t3d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c1t1d0           Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c1t2d0           Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c1t3d0           Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> 
> although a lot of soft errors.
> Linux said that one disk had gone bad, but I figured the sata cable was
somehow broken, so I replaced that before installing solaris. And solaris
didn''t and doesn''t see any actual hw errors on the disks, does
it?
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Nicolas Szalay

2008-Feb-26 09:17 UTC

head link

[zfs-discuss] Cause for data corruption?

Le lundi 25 f?vrier 2008 ? 11:05 -0800, Sandro a ?crit :> hi folks
Hi,
> I''ve been running my fileserver at home with linux for a couple of
years and last week I finally reinstalled it with solaris 10 u4.
> 
> I borrowed a bunch of disks from a friend, copied over all the files,
reinstalled my fileserver and copied the data back.
> 
> Everything went fine, but after a few days now, quite a lot of files got
corrupted.
> here''s the output:
> 
>  # zpool status data
>   pool: data
>  state: ONLINE
> status: One or more devices has experienced an error resulting in data
>         corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>         entire pool from backup.
>    see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: scrub completed with 422 errors on Mon Feb 25 00:32:18 2008
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         data        ONLINE       0     0 5.52K
>           raidz1    ONLINE       0     0 5.52K
>             c0t0d0  ONLINE       0     0 10.72
>             c0t1d0  ONLINE       0     0 4.59K
>             c0t2d0  ONLINE       0     0 5.18K
>             c0t3d0  ONLINE       0     0 9.10K
>             c1t0d0  ONLINE       0     0 7.64K
>             c1t1d0  ONLINE       0     0 3.75K
>             c1t2d0  ONLINE       0     0 4.39K
>             c1t3d0  ONLINE       0     0 6.04K
> 
> errors: 388 data errors, use ''-v'' for a list
> 
> Last night I found out about this, it told me there were errors in like 50
files.
> So I scrubbed the whole pool and it found a lot more corrupted files.
> 
> The temporary system which I used to hold the data while I''m
installing solaris on my fileserver is running nv build 80 and no errors on
there.
> 
> What could be the cause of these errors??
> I don''t see any hw errors on my disks..
> 
>  # iostat -En | grep -i error
> c3d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c4d0             Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c0t0d0           Soft Errors: 574 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c1t0d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c0t1d0           Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c0t2d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c0t3d0           Soft Errors: 549 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c1t1d0           Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c1t2d0           Soft Errors: 14 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> c1t3d0           Soft Errors: 548 Hard Errors: 0 Transport Errors: 0
> Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
> 
> although a lot of soft errors.
> Linux said that one disk had gone bad, but I figured the sata cable was
somehow broken, so I replaced that before installing solaris. And solaris
didn''t and doesn''t see any actual hw errors on the disks, does
it?
I had the same symptoms recently. I also thought the disk were dying but
I was wrong. Suspected the RAM, no. Finally it was because I mixed raid
cards on different PCI buses : 2 64bits buses (no problem with these
ones) and 1 32 Bits PCI bus which caused *all* the checksum errors.

Kicked ou the card on the 32 bit PCI bus and all worked fine.

Hope it helps,

-- 
Nicolas Szalay

Administrateur syst?mes & r?seaux

--                     _
ASCII ribbon campaign ( )
 - against HTML email  X
             & vCards / \
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Ceci est une partie de message num?riquement sign?e
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080226/93db10cf/attachment.bin>

Sandro

2008-Feb-26 13:59 UTC

head link

[zfs-discuss] Cause for data corruption?

Hey

Thanks for your answers guys.

I''ll run VTS to stresstest cpu and memory.

And I just checked the block diagram of my motherboard (Gigabyte M61P-S3).
It doesn''t even have 64bit pci slots.. just standard old 33mhz 32bit
pci .. and a couple of newer pci-e.
But my two controllers are both the same vendor / version and are both connected
to the same pci bus.
 
 
This message posted from opensolaris.org

Nicolas Szalay

2008-Feb-27 08:35 UTC

head link

[zfs-discuss] Cause for data corruption?

Le mardi 26 f?vrier 2008 ? 05:59 -0800, Sandro a ?crit :> Hey
> 
> Thanks for your answers guys.
> 
> I''ll run VTS to stresstest cpu and memory.
> 
> And I just checked the block diagram of my motherboard (Gigabyte M61P-S3).
> It doesn''t even have 64bit pci slots.. just standard old 33mhz
32bit pci .. and a couple of newer pci-e.
> But my two controllers are both the same vendor / version and are both
connected to the same pci bus. 
looks like 32 bits & ZFS definitively hurts :D

-- 
Nicolas Szalay

Administrateur syst?mes & r?seaux

--                     _
ASCII ribbon campaign ( )
 - against HTML email  X
             & vCards / \
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: Ceci est une partie de message num?riquement sign?e
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080227/0802ef9e/attachment.bin>

Sandro

2008-Feb-27 11:22 UTC

head link

[zfs-discuss] Cause for data corruption?

haha very funny :D

Just the controllers are on a 32bit PCI bus.. solaris itself is running 64bit:

[root at ragnaros] /var/tmp/
 # isainfo 
amd64 i386

And besides, a lot of our customers are having serious problems with their
thumpers and zfs and stuff...
 
 
This message posted from opensolaris.org

2008-Feb-28 17:50 UTC

head link

[zfs-discuss] Cause for data corruption?

> So I scrubbed the whole pool and it found a lot more corrupted files.
My condolences :)  

General questions and comments about ZFS and data corruption:

I thought RAIDZ would correct data errors automatically with the parity data. 
How wrong am I on that?  Perhaps a parity correction was already tried, and
there was too much corruption to be successful, implying a very significant
amount of data corruption?

Assuming the errors are being generated by bad hardware somewhere between the
disk and the CPU (inclusively), how could ZFS be configured to handle these
errors automatically?  Set data copies to equal 2, I think.  Anything else?
 
 
This message posted from opensolaris.org

Sandro

2008-Feb-29 07:23 UTC

head link

[zfs-discuss] Cause for data corruption?

Thanks for your reassuring post, loomy :)

I''m pretty sure the reason for all this is some bad hardware..
But I can''t get VTS to work, looks like its not supported for this kind
of hardware.

And in order to run some other stresstest software or something I would have to
connect monitor, keyboard and dvd rom.. which I''m just so sick of doing
:)

Hopefully I can motivate myself on the weekend .. I''ll keep you all
here updated when I find something.
 
 
This message posted from opensolaris.org

Jeff Bonwick

2008-Feb-29 23:31 UTC

head link

[zfs-discuss] Cause for data corruption?

> I thought RAIDZ would correct data errors automatically with the parity
data.
Right.  However, if the data is corrupted while in memory (e.g. on a PC
with non-parity memory), there''s nothing ZFS can do to detect that.
I mean, not even theoretically.  The best we could do would be to
narrow the windows of vulnerability by recomputing the checksum
every time we accessed an in-memory object, which would be terribly
expensive.

Jeff

Nathan Kroenert

2008-Mar-02 22:49 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Say, Jeff -

Speaking of expensive, but interesting things we could do -

 From the little I know of ZFS''s checksum, it''s NOT like the
ECC
checksum we use in memory in that it''s not something we can use to 
determine which bit flipped in the event that there was a single bit 
flip in the data. (I could be completely wrong here... but...)

What is the chance we could put a little more resilience into ZFS such 
that if we do get a checksum error, we systematically flip each bit in 
sequence and check the checksum to see if we could in fact proceed 
(including writing the data back correctly.).

Or build into the checksum something analogous to ECC so we can choose 
to use NON-ZFS protected disks and paths, but still have single bit flip 
protection...

Considering the pain that users of NON-ZFS protected systems suffer when 
there is minor corruption, it would be fantastic if we could attempt to 
work through the simple case of a single flipped bit for the user, and 
if we find that flipping said bit gets us to a consistent checksum, proceed.

I know that on the default 128K block size, that''s a lot of bits, and a
lot of operations to arrive at an answer either way but if we could log 
an error, and spend the cycles to try to recover and proceed without 
user intervention, that would have to be a huge win for ZFS, even if the 
re-calculation took a few seconds.

What do others on the list think? Do we have enough folks using ZFS on 
HDS / EMC / other hardware RAID(X) environments that might find this useful?

Thoughts?

And of course, sorry if we already do this... :)

Nathan.

Jeff Bonwick wrote:>> I thought RAIDZ would correct data errors automatically with the parity
data.
> 
> Right.  However, if the data is corrupted while in memory (e.g. on a PC
> with non-parity memory), there''s nothing ZFS can do to detect
that.
> I mean, not even theoretically.  The best we could do would be to
> narrow the windows of vulnerability by recomputing the checksum
> every time we accessed an in-memory object, which would be terribly
> expensive.
> 
> Jeff
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2008-Mar-02 23:28 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

On Mon, 3 Mar 2008, Nathan Kroenert wrote:> Speaking of expensive, but interesting things we could do -
>
> From the little I know of ZFS''s checksum, it''s NOT like
the ECC
> checksum we use in memory in that it''s not something we can use to
> determine which bit flipped in the event that there was a single bit
> flip in the data. (I could be completely wrong here... but...)
It seems that the emphasis on single-bit errors may be misplaced.  Is 
there evidence which suggests that single-bit errors are much more 
common than multiple bit errors?
> What is the chance we could put a little more resilience into ZFS such
> that if we do get a checksum error, we systematically flip each bit in
> sequence and check the checksum to see if we could in fact proceed
> (including writing the data back correctly.).
It is easier to retry the disk read another 100 times or store the 
data in multiple places.
> Or build into the checksum something analogous to ECC so we can choose
> to use NON-ZFS protected disks and paths, but still have single bit flip
> protection...
Disk drives commonly use an algorithm like Reed Solomon 
(http://en.wikipedia.org/wiki/Reed-Solomon_error_correction) which 
provides forward-error correction.  This is done in hardware.  Doing 
the same in software is likely to be very slow.
> What do others on the list think? Do we have enough folks using ZFS on
> HDS / EMC / other hardware RAID(X) environments that might find this
useful?
It seems that since ZFS is intended to support extremely large storage 
pools, available energy should be spent ensuring that the storage pool 
remains healthy or can be repaired.  Loss of individual file blocks is 
annoying, but loss of entire storage pools is devastating.

Since raw disk is cheap (and backups are expensive), it makes sense to 
write more redundant data rather than to minimize loss through exotic 
algorithms.  Even if RAID is not used, redundant copies may be used on 
the same disk to help protect against block read errors.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jeff Bonwick

2008-Mar-03 00:28 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Nathan: yes.  Flipping each bit and recomputing the checksum is not only
possible, we actually did it in early versions of the code.  The problem
is that it''s really expensive.  For a 128K block, that''s a
million bits,
so you have to re-run the checksum a million times, on 128K of data.
That''s 128GB of data to churn through.

So Bob: you''re right too.  It''s generally much cheaper to
retry the I/O,
try another disk, try a ditto block, etc.  That said, when all else fails,
a 128GB computation is a lot cheaper than a restore from tape.  At some
point it becomes a bit philosophical.  Suppose the block in question is
a single user data block.  How much of the machine should you be willing
to dedicate to getting that block back?  I mean, suppose you knew that
it was theoretically possible, but would consume 500 hours of CPU time
during which everything else would be slower -- and the affected app''s
read() system call would hang for 500 hours.  What is the right policy?
There''s no one right answer.  If we were to introduce a feature like
this,
we''d need some admin-settable limit on how much time to dedicate to it.

For some checksum functions like fletcher2 and fletcher4, it is possible
to do much better than brute force because you can compute an incremental
update -- that is, you can compute the effect of changing the nth bit
without rerunning the entire checksum.  This is, however, not possible
with SHA-256 or any other secure hash.

We ended up taking that code out because single-bit errors didn''t seem
to arise in practice, and in testing, the error correction had a rather
surprising unintended side effect: it masked bugs in the code!

The nastiest kind of bug in ZFS is something we call a future leak,
which is when some change from txg (transaction group) 37 ends up
going out as part of txg 36.  It normally wouldn''t matter, except if
you lost power before txg 37 was committed to disk.  On reboot you''d
have inconsistent on-disk state (all of 36 plus random bits of 37).
We developed coding practices and stress tests to catch future leaks,
and as I know we''ve never actually shipped one.  But they are scary.

If you *do* have a future leak, it''s not uncommon for it to be a very
small change -- perhaps incrementing a counter in some on-disk structure.
The thing is, if the counter is going from even to odd, that''s exactly
a one-bit change.  The single-bit error correction logic would happily
detect these and fix them up -- not at all what you want when testing!
(Of course, we could turn it off during testing -- but then we wouldn''t
be testing it.)

All that said, I''m still occasionally tempted to bring it back.
It may become more relevant with flash memory as a storage medium.

Jeff

On Sun, Mar 02, 2008 at 05:28:48PM -0600, Bob Friesenhahn
wrote:> On Mon, 3 Mar 2008, Nathan Kroenert wrote:
> > Speaking of expensive, but interesting things we could do -
> >
> > From the little I know of ZFS''s checksum, it''s NOT
like the ECC
> > checksum we use in memory in that it''s not something we can
use to
> > determine which bit flipped in the event that there was a single bit
> > flip in the data. (I could be completely wrong here... but...)
> 
> It seems that the emphasis on single-bit errors may be misplaced.  Is 
> there evidence which suggests that single-bit errors are much more 
> common than multiple bit errors?
> 
> > What is the chance we could put a little more resilience into ZFS such
> > that if we do get a checksum error, we systematically flip each bit in
> > sequence and check the checksum to see if we could in fact proceed
> > (including writing the data back correctly.).
> 
> It is easier to retry the disk read another 100 times or store the 
> data in multiple places.
> 
> > Or build into the checksum something analogous to ECC so we can choose
> > to use NON-ZFS protected disks and paths, but still have single bit
flip
> > protection...
> 
> Disk drives commonly use an algorithm like Reed Solomon 
> (http://en.wikipedia.org/wiki/Reed-Solomon_error_correction) which 
> provides forward-error correction.  This is done in hardware.  Doing 
> the same in software is likely to be very slow.
> 
> > What do others on the list think? Do we have enough folks using ZFS on
> > HDS / EMC / other hardware RAID(X) environments that might find this
useful?
> 
> It seems that since ZFS is intended to support extremely large storage 
> pools, available energy should be spent ensuring that the storage pool 
> remains healthy or can be repaired.  Loss of individual file blocks is 
> annoying, but loss of entire storage pools is devastating.
> 
> Since raw disk is cheap (and backups are expensive), it makes sense to 
> write more redundant data rather than to minimize loss through exotic 
> algorithms.  Even if RAID is not used, redundant copies may be used on 
> the same disk to help protect against block read errors.
> 
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Darren J Moffat

2008-Mar-03 10:37 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Jeff Bonwick wrote:> All that said, I''m still occasionally tempted to bring it back.
> It may become more relevant with flash memory as a storage medium.
Would it be worth considering bring it back as part of zdb rather than 
part of the core zio layer ?

-- 
Darren J Moffat

2008-Mar-03 11:03 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Causefor data corruption?

> All that said, I''m still occasionally tempted to bring it back.
> It may become more relevant with flash memory as a storage medium.
How common would be single on-disk bit flips in 128K blocks? Disk
manufacturers quantized it as a 1 to 10 to the power of god knows what,
which practically means every few years or so. If this is just optimistic
marketing crap, wouldn''t it be viable to have a bit flip checker as
option
to the scrub mode (with tons of warnings, yes/no confirmation and
recommendation to do this in single user mode)? I''m sure people using
no
redundancy (e.g. future OSX users) would appreciate it, saving some grief
if the bad blocks are indeed just single bit flips.

-mg

Bob Friesenhahn

2008-Mar-03 16:10 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Causefor data corruption?

On Mon, 3 Mar 2008, me wrote:
> I''m sure people using no redundancy (e.g. future OSX users) would 
> appreciate it, saving some grief if the bad blocks are indeed just 
> single bit flips.
In case people have somehow forgotten, most other filesystems in 
common use do not checksum data blocks.  In spite of this, we rarely 
hear users wailing about single bit flips in their files.  Instead we 
usually hear about people who find whole chunks of their file missing 
or overwritten, or find that the hard disk does not spin up at all any 
more.  As we move toward solid state storage, the typical error cases 
will surely differ.

Since ZFS is smart and is able to perform tasks in the background, one 
possibility to consider is to use otherwise unused storage space to 
store "weak" ditto copies or even forward error correction data. 
However, rather than explicitly writing these blocks during normal 
I/O, they could be created by a background task, and reused for other 
purposes when required.  In this way, otherwise unused disk blocks 
would be taken advantage of in a similar way that otherwise unused 
memory is used to cache filesystem data.  If the filesystem becomes 
very full, then there would be less protection but if the filesystem 
has plenty of free space then there would be lots of protection.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Mar-03 16:19 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Darren J Moffat wrote:> Jeff Bonwick wrote:
>   
>> All that said, I''m still occasionally tempted to bring it
back.
>> It may become more relevant with flash memory as a storage medium.
>>     
>
> Would it be worth considering bring it back as part of zdb rather than 
> part of the core zio layer ?
>
>   
I''m not convinced that single bit flips are the common
failure mode for disks.  Most enterprise class disks already
have enough ECC to correct at least 8 bytes per block.
By the time the disk sends something back that it couldn''t
correct, there is no telling how many bits have been flipped,
but I''ll bet a steak dinner it is more than one.

There may be some benefit for path failures, but I''ve not
seen any measured data on those failure modes.  For paths
which have framing checksums, we would expect them to
be detected there.
 -- richard

Richard Elling

2008-Mar-03 16:27 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Causefor data corruption?

me wrote:>> All that said, I''m still occasionally tempted to bring it
back.
>> It may become more relevant with flash memory as a storage medium.
>>     
>
> How common would be single on-disk bit flips in 128K blocks? Disk
> manufacturers quantized it as a 1 to 10 to the power of god knows what,
> which practically means every few years or so. If this is just optimistic
> marketing crap, wouldn''t it be viable to have a bit flip checker
as option
> to the scrub mode (with tons of warnings, yes/no confirmation and
> recommendation to do this in single user mode)? I''m sure people
using no
> redundancy (e.g. future OSX users) would appreciate it, saving some grief
> if the bad blocks are indeed just single bit flips.
>   
Most enterprise class disks are rated at 1 uncorrectable read error for 
10^15
bits(!) read.  For a 1 TByte disk, that means you can expect an 
uncorrectable
read error about once for every 175 times you read the entire disk.  
Contrast
this to consumer class disks which are UER 1 in 10^14, or 17 times for a
1 TByte disk.

I posted some of our measured field data a while back,
http://blogs.sun.com/relling/entry/zfs_copies_and_data_protection
 -- richard

Darren J Moffat

2008-Mar-03 16:35 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Richard Elling wrote:> Darren J Moffat wrote:
>> Jeff Bonwick wrote:
>>   
>>> All that said, I''m still occasionally tempted to bring it
back.
>>> It may become more relevant with flash memory as a storage medium.
>>>     
>> Would it be worth considering bring it back as part of zdb rather than 
>> part of the core zio layer ?
>>
>>   
> 
> I''m not convinced that single bit flips are the common
> failure mode for disks.  Most enterprise class disks already
> have enough ECC to correct at least 8 bytes per block.
and for consumer rather than enterprise  class disks ?

Which after all are the people most likely to be hit hardest because:
	a) their disk is cheaper quality
	b) less likely to have a redundant pool config
            eg on a laptop which can physically only have one disk
	c) less likely to have an off pool backup
	d) can''t recover easily if the filesystem doesn''t help them
            and are used to filesystems that give them their data even
	   if it is corrupt.
            For example a few bit flips in an MP3 or MPEG4 file probably
	   don''t matter too much to many people in a consumer system and
            they would rather have that then have ZFS tell them they
	   can''t have the pool or some files in it.

-- 
Darren J Moffat

2008-Mar-03 16:48 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for

> I''m not convinced that single bit flips are the common failure
mode for disks.
I think the original suggestion might be for bad RAM more than bad disks.  Just
about every home computer does not have ECC RAM, so as ZFS transitions from
enterprise to home, this (optional) feature sounds very worthwhile.

I''ve experienced some bad RAM in my days, and I''ve only
noticed when applications started acting weird and crashing.  When I''ve
done memtest86+ on such sticks of RAM I''ve found that very few errors
(maybe 2-8) are usually reported.  Not sure if those errors are bad bits or
something more granular.

The original suggestion sounds like a useful one for the body of users outside
of Sun''s usual ECC RAM-using client?le.
 
 
This message posted from opensolaris.org

Gary Mills

2008-Mar-03 16:59 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Causefor data corruption?

On Mon, Mar 03, 2008 at 08:27:08AM -0800, Richard Elling
wrote:> me wrote:
> >> All that said, I''m still occasionally tempted to bring it
back.
> >> It may become more relevant with flash memory as a storage medium.
> >
> > How common would be single on-disk bit flips in 128K blocks?
> 
> Most enterprise class disks are rated at 1 uncorrectable read error for 
> 10^15
> bits(!) read.  For a 1 TByte disk, that means you can expect an 
> uncorrectable
> read error about once for every 175 times you read the entire disk.  
> Contrast
> this to consumer class disks which are UER 1 in 10^14, or 17 times for a
> 1 TByte disk.
I take it that that would mean that the block would be unreadable,
rather than readable with incorrect data.  That would be based on the
CRC included with each disk block.  So, the granularity is really at
the block level.  You probably can''t even read a bad block from a
disk.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Bob Friesenhahn

2008-Mar-03 17:47 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

On Mon, 3 Mar 2008, Darren J Moffat wrote:
>> I''m not convinced that single bit flips are the common
>> failure mode for disks.  Most enterprise class disks already
>> have enough ECC to correct at least 8 bytes per block.
>
> and for consumer rather than enterprise  class disks ?
You are assuming that the ECC used for "consumer" disks is 
substantially different than that used for "enterprise" disks.  That 
is likely not the case since ECC is provided by a chip which costs a 
few dollars.  The only reason to use a lesser grade algorithm would be 
to save a small bit of storage space.

Consumer disks use essentially the same media as enterprise disks.

Consumer disks store a higher bit density on similar media.

Consumer disks have less precise/consistent head controllers than 
enterprise disks.

Consumer disks are less well-specified than enterprise disks.

Due to the higher bit density we can expect more wrong bits to be read 
since we are pushing the media harder.  Due to less consistent head 
controllers we can expect more incidences of reading or writing the 
wrong track or writing something which can''t be read.  Consumer disks 
are often used in an environment where they may be physically 
disturbed while they are writing or reading the data.  Enterprise 
disks are usually used in very stable environments.

The upshot of this is that we can expect more unrecoverable errors, 
but it seems unlikely that there will be more "single bit" errors 
recoverable at the ZFS level.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Mar-03 18:50 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Bob Friesenhahn wrote:> On Mon, 3 Mar 2008, Darren J Moffat wrote:
>
>   
>>> I''m not convinced that single bit flips are the common
>>> failure mode for disks.  Most enterprise class disks already
>>> have enough ECC to correct at least 8 bytes per block.
>>>       
>> and for consumer rather than enterprise  class disks ?
>>     
>
> You are assuming that the ECC used for "consumer" disks is 
> substantially different than that used for "enterprise" disks. 
That
> is likely not the case since ECC is provided by a chip which costs a 
> few dollars.  The only reason to use a lesser grade algorithm would be 
> to save a small bit of storage space.
>
> Consumer disks use essentially the same media as enterprise disks.
>
> Consumer disks store a higher bit density on similar media.
>
> Consumer disks have less precise/consistent head controllers than 
> enterprise disks.
>
> Consumer disks are less well-specified than enterprise disks.
>
> Due to the higher bit density we can expect more wrong bits to be read 
> since we are pushing the media harder.  Due to less consistent head 
> controllers we can expect more incidences of reading or writing the 
> wrong track or writing something which can''t be read.  Consumer
disks
> are often used in an environment where they may be physically 
> disturbed while they are writing or reading the data.  Enterprise 
> disks are usually used in very stable environments.
>
> The upshot of this is that we can expect more unrecoverable errors, 
> but it seems unlikely that there will be more "single bit" errors
> recoverable at the ZFS level.
>   
I agree, and am waiting to get the proceedings from FAST08
which has some interesting papers in the list.

A while back I blogged about an Adaptec online seminar
which addressed this topic.  Rather than repeating what they
said, I left a pointer and a recommendation.
http://blogs.sun.com/relling/entry/adaptec_webinar_on_disks_and

Also, note that the published reliability data from disk vendors
is constantly changing.  For laptop drives, we''re seeing less
MTBF or UER and more head landings specs.  It seems that
an important failure mode for laptop disks is wear out at the
landing site.  This is due to power management powering or
spinning down the disk.  We don''t tend to see this failure
mode in servers or RAID arrays.
 -- richard

Nathan Kroenert

2008-Mar-03 23:01 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Hey, Bob,

Though I have already got the answer I was looking for here, I thought 
I''d at least take the time to provide my point of view as to my
*why*...

First: I don''t think any of us have forgotten the goodness that
ZFS''s
checksum *can* bring.

I''m also keenly aware that we have some customers running HDS / EMC 
boxes who disable the ZFS checksum by default because they
''don''t want
to have files break due to a single bit flip...'' and they really
don''t
care where the flip happens, and they don''t want to
''waste'' disks or
bandwidth allowing ZFS to do it''s own protection when they already pay 
for it inside their zillion dollar disk box. (Some say waste, some call 
it insurance... ;). Oracle users in particular seem to have this 
mindset, though that''s another thread entirely. :)

I''d suspect we don''t hear people whining about single bit
flips, because
they would not know if it''s happening unless the app sitting on top had
it''s own protection. Or - if the error is obvious, or crashes their 
system... Or if they were running ZFS, but at this stage, we cannot 
delineate between single bit or massively crapped out errors, so what''s
to say we are NOT seeing it?

Also - Don''t assume bit rot on disk is the only way we can get single 
bit errors.

Considering that until very recently (and quite likely even now to a 
reasonable extent), most CPU''s did not have data protection in *every* 
place data transited through, single bit flips are still a very real 
possibility, and becoming more likely as process shrinks continue. 
Granted, on CPU''s with Register Parity protection, undetected doubles 
are more likely to ''slip under the radar'', as registers are
typically
protected with parity at best, if at all... A single bit in the parity 
protected register will be detected, a double won''t.

It does seem that some of us are getting a little caught up in disks and 
their magnificence in what they write to the platter and read back, and 
overlooking the potential value of a simple (though potentially 
computationally expensive) circus trick, which might, just might, make 
your broken 1TB archive useful again...

I don''t think it''s a good idea for us to assume that
it''s OK to ''leave
out'' potential goodness for the masses that want to use ZFS in 
non-enterprise environments like laptops / home PC''s, or use commodity 
components in conjunction with the Big Stuff... (Like white box PC''s 
connected to an EMC or HDS box... )

Anyhoo - I''m glad we have pretty much already done this work once 
before. It gives me hope that we''ll see it make a comeback. ;)

(And I look forward to Jeff & Co developing a hyper cool way of 
generating 128000000 checksums using all 64 threads of a Niagara 2, 
using the same source data in cache, so we don''t need to hit memory, so
that it happens in the blink of an eye. or two. ok - maybe three... ;) 
Maybe we could also use the SPU''s as well... OK - So, I''m
possibly
dreaming here, but hell, if I''m dreaming, why not dream big. :)

Nathan.

Bob Friesenhahn wrote:> On Mon, 3 Mar 2008, me wrote:
> 
>> I''m sure people using no redundancy (e.g. future OSX users)
would
>> appreciate it, saving some grief if the bad blocks are indeed just 
>> single bit flips.
> 
> In case people have somehow forgotten, most other filesystems in 
> common use do not checksum data blocks.  In spite of this, we rarely 
> hear users wailing about single bit flips in their files.  Instead we 
> usually hear about people who find whole chunks of their file missing 
> or overwritten, or find that the hard disk does not spin up at all any 
> more.  As we move toward solid state storage, the typical error cases 
> will surely differ.
> 
> Since ZFS is smart and is able to perform tasks in the background, one 
> possibility to consider is to use otherwise unused storage space to 
> store "weak" ditto copies or even forward error correction data. 
> However, rather than explicitly writing these blocks during normal 
> I/O, they could be created by a background task, and reused for other 
> purposes when required.  In this way, otherwise unused disk blocks 
> would be taken advantage of in a similar way that otherwise unused 
> memory is used to cache filesystem data.  If the filesystem becomes 
> very full, then there would be less protection but if the filesystem 
> has plenty of free space then there would be lots of protection.
> 
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2008-Mar-04 00:05 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

On Tue, 4 Mar 2008, Nathan Kroenert wrote:>
> It does seem that some of us are getting a little caught up in disks and 
> their magnificence in what they write to the platter and read back, and 
> overlooking the potential value of a simple (though potentially 
> computationally expensive) circus trick, which might, just might, make your
> broken 1TB archive useful again...
The circus trick can be handled via a user-contributed utility.  In 
fact, people can compete with their various repair utilities.  There 
are only 1048576 1-bit permuations to try, and then the various 
two-bit permutations can be tried.
> I don''t think it''s a good idea for us to assume that
it''s OK to ''leave out''
> potential goodness for the masses that want to use ZFS in non-enterprise 
> environments like laptops / home PC''s, or use commodity components
in
> conjunction with the Big Stuff... (Like white box PC''s connected
to an EMC or
> HDS box... )
It seems that "goodness for the masses" has not been left out.  The 
forthcoming ability to request duplicate ZFS blocks is very good news 
indeed.  We are entering an age where the entry level SATA disk is 1TB 
and users have more space than they know what to do with.  A little 
replication gives these users something useful to do with their new 
disk while avoiding the need for unreliable "circus tricks" to recover
data.  ZFS goes far beyond MS-DOS''s "recover" command (which
should
have been called "destroy").

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Nathan Kroenert

2008-Mar-04 00:25 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Bob Friesenhahn wrote:> On Tue, 4 Mar 2008, Nathan Kroenert wrote:
>>
>> It does seem that some of us are getting a little caught up in disks 
>> and their magnificence in what they write to the platter and read 
>> back, and overlooking the potential value of a simple (though 
>> potentially computationally expensive) circus trick, which might, just 
>> might, make your broken 1TB archive useful again...
> 
> The circus trick can be handled via a user-contributed utility.  In 
> fact, people can compete with their various repair utilities.  There are 
> only 1048576 1-bit permuations to try, and then the various two-bit 
> permutations can be tried.
That does not sound ''easy'', and I consider that ZFS should
be... :) and
IMO it''s something that should really be built in, not attacked with an
addon.

I had (as did Jeff in his initial response) considered that we only need 
to actually try to flip 128KB worth of bits once... That many flips 
means that we in a way ''processing'' some 128GB in the worse
case when
re-generating checksums.  Internal to a CPU, depending on Cache 
Aliasing, competing workloads, threadedness, etc, this could be 
dramatically variable... something I guess the ZFS team would want to 
keep out of the ''standard'' filesystem operation... hm. :\
>> I don''t think it''s a good idea for us to assume that
it''s OK to ''leave
>> out'' potential goodness for the masses that want to use ZFS in
>> non-enterprise environments like laptops / home PC''s, or use
commodity
>> components in conjunction with the Big Stuff... (Like white box
PC''s
>> connected to an EMC or HDS box... )
> 
> It seems that "goodness for the masses" has not been left out. 
The
> forthcoming ability to request duplicate ZFS blocks is very good news 
> indeed.  We are entering an age where the entry level SATA disk is 1TB 
> and users have more space than they know what to do with.  A little 
> replication gives these users something useful to do with their new disk 
> while avoiding the need for unreliable "circus tricks" to recover
data.
> ZFS goes far beyond MS-DOS''s "recover" command (which
should have been
> called "destroy").
I never have enough space on my laptop... I guess I''m a freak.

But - I am sure that we are *both* right for some subsets of ZFS users, 
and that the more choice we have built into the filesystem, the better.

Thanks again for the comments!

Nathan.

Bob Friesenhahn

2008-Mar-04 00:42 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

On Tue, 4 Mar 2008, Nathan Kroenert wrote:>> The circus trick can be handled via a user-contributed utility.  In
fact,
>> people can compete with their various repair utilities.  There are only
>> 1048576 1-bit permuations to try, and then the various two-bit
permutations
>> can be tried.
>
> That does not sound ''easy'', and I consider that ZFS
should be... :) and IMO
> it''s something that should really be built in, not attacked with
an addon.
There are several reasons why this sort of thing should not be in ZFS 
itself.  A big reason is that if it is in ZFS itself, it can only be 
updated via an OS patch or upgrade, along with a required reboot.  If 
it is in a utility, it can be downloaded and used as the user sees fit 
without any additional disruption to the system.  While some errors 
are random, others follow well defined patterns, so it may be that one 
utility is better than another or that user-provided options can help 
achieve success faster.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Boyd Adamson

2008-Mar-04 01:38 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Nathan Kroenert <Nathan.Kroenert at Sun.COM>
writes:> Bob Friesenhahn wrote:
>> On Tue, 4 Mar 2008, Nathan Kroenert wrote:
>>>
>>> It does seem that some of us are getting a little caught up in
disks
>>> and their magnificence in what they write to the platter and read 
>>> back, and overlooking the potential value of a simple (though 
>>> potentially computationally expensive) circus trick, which might,
just
>>> might, make your broken 1TB archive useful again...
>> 
>> The circus trick can be handled via a user-contributed utility.  In 
>> fact, people can compete with their various repair utilities.  There
are
>> only 1048576 1-bit permuations to try, and then the various two-bit 
>> permutations can be tried.
>
> That does not sound ''easy'', and I consider that ZFS
should be... :) and
> IMO it''s something that should really be built in, not attacked
with an
> addon.
>
> I had (as did Jeff in his initial response) considered that we only need 
> to actually try to flip 128KB worth of bits once... That many flips 
> means that we in a way ''processing'' some 128GB in the
worse case when
> re-generating checksums.  Internal to a CPU, depending on Cache 
> Aliasing, competing workloads, threadedness, etc, this could be 
> dramatically variable... something I guess the ZFS team would want to 
> keep out of the ''standard'' filesystem operation... hm. :\
Maybe an option to scrub... something that says "work on bitflips for
bad blocks", or "work on bitflips for bad blocks in this file"

Boyd

Nathan Kroenert

2008-Mar-04 04:05 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

Hey, Bob

My perspective on Big reasons for it *to* be integrated would be:
  - It''s tested - By the folks charged with making ZFS good
  - It''s kept in sync with the differing Zpool versions
  - It''s documented
  - When the system *is* patched, any changes the patch brings are 
synced with the recovery mechanism
  - Being integrated, it has options that can be persistently set if 
required
  - It''s there when you actually need it
  - It could be integrated with Solaris FMA to take some funky actions 
based on the nature of the failure, including cool messages telling you 
what you need to run to attempt a repair etc
  - It''s integrated (recursive, self fulfilling benefit... ;)

As for the separate utility for different failure modes, I agree, 
*development* of these might be faster if everyone chases their own pet 
failure mode and contributes it, but I still think getting them 
integrated either as optional actions on error, or as part of zdb or 
other would be far better than having to go looking for the utility and 
''give it a whirl''.

But - I''m sure that''s a personal preference, and I''m
sure that there are
those that would love the opportunity to roll their own.

OK - I''m going to shutup now. I think I have done this to death, and I 
don''t want to end up in everyone''s kill filter.

Cheers!

Nathan.



Bob Friesenhahn wrote:> On Tue, 4 Mar 2008, Nathan Kroenert wrote:
>>> The circus trick can be handled via a user-contributed utility.  In
fact,
>>> people can compete with their various repair utilities.  There are
only
>>> 1048576 1-bit permuations to try, and then the various two-bit
permutations
>>> can be tried.
>> That does not sound ''easy'', and I consider that ZFS
should be... :) and IMO
>> it''s something that should really be built in, not attacked
with an addon.
> 
> There are several reasons why this sort of thing should not be in ZFS 
> itself.  A big reason is that if it is in ZFS itself, it can only be 
> updated via an OS patch or upgrade, along with a required reboot.  If 
> it is in a utility, it can be downloaded and used as the user sees fit 
> without any additional disruption to the system.  While some errors 
> are random, others follow well defined patterns, so it may be that one 
> utility is better than another or that user-provided options can help 
> achieve success faster.
> 
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mario Goebbels (Webmail)

2008-Mar-04 10:56 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause fordata corruption?

> Maybe an option to scrub... something that says "work on bitflips for
> bad blocks", or "work on bitflips for bad blocks in this
file"
I''ve suggested this, too. But in retrospect, there''s no way to
detect
whether a bad block is indeed due a bitflip or not. So each checksum error,
ZFS might just spent several hours on. You would work some idiot detection
into ZFS by having it sum the values of each byte in the filesystem block
and store it in a 32bit value, and when scrubbing with bitflip correction,
see if the difference in sum didn''t deviate by more than -/+ 127. BUt
that''d incur more computing power required on writes and actually the
availability of a 32bit field in the metadata block referring to the FS
block.

-mg

Richard Elling

2008-Mar-04 18:30 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[slightly different angle below...]

Nathan Kroenert wrote:> Hey, Bob,
>
> Though I have already got the answer I was looking for here, I thought 
> I''d at least take the time to provide my point of view as to my
*why*...
>
> First: I don''t think any of us have forgotten the goodness that
ZFS''s
> checksum *can* bring.
>
> I''m also keenly aware that we have some customers running HDS /
EMC
> boxes who disable the ZFS checksum by default because they
''don''t want
> to have files break due to a single bit flip...'' and they really
don''t
> care where the flip happens, and they don''t want to
''waste'' disks or
> bandwidth allowing ZFS to do it''s own protection when they already
pay
> for it inside their zillion dollar disk box. (Some say waste, some call 
> it insurance... ;). Oracle users in particular seem to have this 
> mindset, though that''s another thread entirely. :)
>   
If you look at the zfs-discuss archives, you will find anecdotes
of failing raid arrays (yes, even expensive ones) and SAN switches
causing corruption which was detected by ZFS.  A telltale sign of
borken hardware is someone complaining that ZFS checksums are
borken, only to find out their hardware is at fault.

As for Oracle, modern releases of the Oracle database also have
checksumming enabled by default, so there is some merit to the
argument that ZFS checksums are redundant.  IMNSHO, ZFS is
not being designed to replace ASM.
> I''d suspect we don''t hear people whining about single bit
flips, because
> they would not know if it''s happening unless the app sitting on
top had
> it''s own protection. Or - if the error is obvious, or crashes
their
> system... Or if they were running ZFS, but at this stage, we cannot 
> delineate between single bit or massively crapped out errors, so
what''s
> to say we are NOT seeing it?
>
> Also - Don''t assume bit rot on disk is the only way we can get
single
> bit errors.
>
> Considering that until very recently (and quite likely even now to a 
> reasonable extent), most CPU''s did not have data protection in
*every*
> place data transited through, single bit flips are still a very real 
> possibility, and becoming more likely as process shrinks continue. 
> Granted, on CPU''s with Register Parity protection, undetected
doubles
> are more likely to ''slip under the radar'', as registers
are typically
> protected with parity at best, if at all... A single bit in the parity 
> protected register will be detected, a double won''t.
>   
It depends on the processor.  Most of the modern SPARC processors
have extensive error detection and correction inside.  But processors
are still different than memories in that the time a datum resides in a
single location is quite short.  We worry more about random data
losses when the datum is stored in one place for a long time, which
is why you see different sorts of data protection at the different layers
of a system design.  To put this in more mathematical terms, there is
a failure rate for each failure mode, but your exposure to the failure
mode is time bounded.
> It does seem that some of us are getting a little caught up in disks and 
> their magnificence in what they write to the platter and read back, and 
> overlooking the potential value of a simple (though potentially 
> computationally expensive) circus trick, which might, just might, make 
> your broken 1TB archive useful again...
>
> I don''t think it''s a good idea for us to assume that
it''s OK to ''leave
> out'' potential goodness for the masses that want to use ZFS in 
> non-enterprise environments like laptops / home PC''s, or use
commodity
> components in conjunction with the Big Stuff... (Like white box
PC''s
> connected to an EMC or HDS box... )
>
> Anyhoo - I''m glad we have pretty much already done this work once 
> before. It gives me hope that we''ll see it make a comeback. ;)
>
> (And I look forward to Jeff & Co developing a hyper cool way of 
> generating 128000000 checksums using all 64 threads of a Niagara 2, 
> using the same source data in cache, so we don''t need to hit
memory, so
> that it happens in the blink of an eye. or two. ok - maybe three... ;) 
> Maybe we could also use the SPU''s as well... OK - So, I''m
possibly
> dreaming here, but hell, if I''m dreaming, why not dream big. :)
>   
I sense that the requested behaviour here is to be able to
get to the corrupted contents of a file, even if we know it
is corrupted.  I think this is a good idea because:

1. The block is what is corrupted, not necessarily my file.
   A single block may contain several files which are grouped
   together, checksummed, and written to disk.

2.  The current behaviour of returning EIO when read()ing a
   file up to the (possible) corruption point is rather irritating,
   but probably the right thing to do.  Since we know the
   files affected, we could write a savior, providing we can
   get some reasonable response other than EIO.
   As Jeff points out, I''m not sure that automatic repair is
   the right answer, but a manual savior might work better
   than restore from backup.

Note: some apps can handle partially missing files.  Others
do things like zip everything together (eg. StarOffice), which
makes manual recover difficult.

Also note: the checksums don''t have enough information to
recreate the data for very many bit changes.  Hashes might,
but I don''t know anyone using sha256.

now, where was that intern hiding? ... :-)
 -- richard

Bob Friesenhahn

2008-Mar-04 19:00 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

On Tue, 4 Mar 2008, Richard Elling wrote:>
> Also note: the checksums don''t have enough information to
> recreate the data for very many bit changes.  Hashes might,
> but I don''t know anyone using sha256.
It is indeed important to recognize that the checksums are a way to 
detect that the data is incorrect rather than a way to tell that the 
data is correct.  There may be several permutations of wrong data 
which can result in the same checksum, but the probability of 
encountering those permutations due to natural causes is quite small.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mario Goebbels (Webmail)

2008-Mar-04 21:58 UTC

head link

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause fordata corruption?

> Also note: the checksums don''t have enough information to
> recreate the data for very many bit changes.  Hashes might,
> but I don''t know anyone using sha256.
My ~/Documents uses sha256 checksums, but then again, it also uses copies=2
:)

-mg

zfs discuss - Feb 2008 - Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Causefor data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Causefor data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Causefor data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for

[zfs-discuss] Dealing with Single Bit Flips - WAS: Causefor data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause fordata corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause for data corruption?

[zfs-discuss] Dealing with Single Bit Flips - WAS: Cause fordata corruption?