thr3ads.net - zfs discuss - [zfs-discuss] Best way to convert checksums [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Ray Clark

2009-Sep-25 18:07 UTC

[zfs-discuss] Best way to convert checksums

What is the "Best" way to convert the checksums of an existing ZFS
file system from one checksum to another?  To me "Best" means safest
and most complete.

My zpool is 39% used, so there is plenty of space available.

Thanks.
-- 
This message posted from opensolaris.org

Ray Clark

2009-Sep-25 19:33 UTC

head link

[zfs-discuss] Best way to convert checksums

I didn''t want my question to lead to an answer, but perhaps I should
have put more information.  My idea is to copy the file system with one of the
following:
   cp -rp
   zfs send | zfs receive
   tar
   cpio
But I don''t know what would be the best.

Then I would do a "diff -r" on them before deleting the old.

I don''t know the "obscure" (for me) secondary things like
attributes, links, extended modes, etc.

Thanks again.
-- 
This message posted from opensolaris.org

Orvar Korvar

2009-Sep-26 10:30 UTC

head link

[zfs-discuss] Best way to convert checksums

I had this same question. I was recommended to use rsync or zfs send. I used
both just to be safe. With zfs send, you create a snapshot and then send the
snapshot. After deleting the snapshot on the target, you have identical copies.
rsync seems to be used for this task also. And also zfs send.
-- 
This message posted from opensolaris.org

Ray Clark

2009-Sep-29 18:59 UTC

head link

[zfs-discuss] Best way to convert checksums

When using zfs send/receive to do the conversion, the receive creates a new file
system:

   zfs snapshot zfs01/home at before
   zfs send zfs01/home at before | zfs receive afx01/home.sha256

Where do I get the chance to "zfs set checksum=sha256" on the new file
system before all of the files are written ???

The new filesystem is created automatically by the receive command!

Although it does not say so in the man page or zfs admin guide, it certainly
seems reasonable that I don''t get a chance - the idea is that
send/receive recreates the file system exactly.

This would still have an ambiguity as to whether the new blocks are
created/copied with the checksum algorithm they had in the source filesystem
(Which would not result in the conversion I am trying to accomplish), or are
they created and checksumed with the algorithm specified by the checksum
PROPERTY set in the source file system at the time of the send/receive (which
WOULD do the conversion I am trying to accomplish)?

Is there a way to use send/receive to duplicate a filesystem with a different
checksum, or do I use cpio or tar?  (I pick on cpio and tar because they are
specifically called out in the zfs admin manual as saving and restoring zfs file
attributes and ACLs).

Thanks.

--Ray
-- 
This message posted from opensolaris.org

Darren J Moffat

2009-Sep-30 10:31 UTC

head link

[zfs-discuss] Best way to convert checksums

Ray Clark wrote:> When using zfs send/receive to do the conversion, the receive creates a new
file system:
> 
>    zfs snapshot zfs01/home at before
>    zfs send zfs01/home at before | zfs receive afx01/home.sha256
> 
> Where do I get the chance to "zfs set checksum=sha256" on the new
file system before all of the files are written ???
Set it on the afx01 dataset before you do the receive and it will be 
inherited.

-- 
Darren J Moffat

Ray Clark

2009-Sep-30 18:17 UTC

head link

[zfs-discuss] Best way to convert checksums

I made a typo... I only have one pool.  I should have typed:

   zfs snapshot zfs01/home at before
   zfs send zfs01/home at before | zfs receive zfs01/home.sha256

Does that change the answer?

And independently if it does or not, zfs01 is a pool, and the property is on the
home zfs file system.

I cannot change it on the file system before doing the receive because the file
system does not exist - it is created by the receive.

This raises a related question of whether the file system on the receiving end
is ALL created using the checksum property from the source file system, or if
the blocks and their present mix of checksums are faithfully recreated in the
received file system?

Finally, is there any way to verify behavior after it is done?

Thanks for helping on this.
-- 
This message posted from opensolaris.org

Darren J Moffat

2009-Sep-30 18:23 UTC

head link

[zfs-discuss] Best way to convert checksums

Ray Clark wrote:> I made a typo... I only have one pool.  I should have typed:
> 
>    zfs snapshot zfs01/home at before
>    zfs send zfs01/home at before | zfs receive zfs01/home.sha256
> 
> Does that change the answer?
No it doesn''t change my answer
> And independently if it does or not, zfs01 is a pool, and the property is
on the home zfs file system.
doesn''t mater if zfs01 is the top level dataset or not.

Before you do the receive do this:

zfs set checksum=sha256 zfs01

-- 
Darren J Moffat

Ray Clark

2009-Sep-30 20:24 UTC

head link

[zfs-discuss] Best way to convert checksums

Dynamite!

I don''t feel comfortable leaving things implicit.  That is how
misunderstandings happen.

Would you please acknowlege that zfs send | zfs receive uses the checksum
setting on the receiving pool instead of preserving the checksum algorithm used
by the sending block?

Thanks a million!
--Ray
-- 
This message posted from opensolaris.org

Ray Clark

2009-Sep-30 20:27 UTC

head link

[zfs-discuss] Best way to convert checksums

Sinking feeling...

zfs01 was originally created with fletcher2.  Doesn''t this mean that
the sort of "root level" stuff in the zfs pool exist with fletcher2
and so are not well protected?

If so, is there a way to fix this short of a backup and restore?
-- 
This message posted from opensolaris.org

Darren J Moffat

2009-Oct-01 09:08 UTC

head link

[zfs-discuss] Best way to convert checksums

Ray Clark wrote:> Dynamite!
> 
> I don''t feel comfortable leaving things implicit.  That is how
misunderstandings happen.
It isn''t implicit it is explicitly inherited that is how ZFS is
designed
to (and does) work.
> Would you please acknowlege that zfs send | zfs receive uses the checksum
setting on the receiving pool instead of preserving the checksum algorithm used
by the sending block?
For now it depends wither or not you pass -R to ''zfs send'' or
not.
Without the -R argument the send stream does not have any properties in 
it so it will (by design) use those that would be used if the dataset 
was created  by ''zfs create''.

In the future there will be a distinction between the local and the 
received values see the recently (yesterday) approved case PSARC/2009/510:

http://arc.opensolaris.org/caselog/PSARC/2009/510/20090924_tom.erickson

Lets look at how it works just now:

portellen:pts/2# zpool create dummy c7t3d0
portellen:pts/2# zfs create dummy/home
portellen:pts/2# cp /etc/profile /dummy/home
portellen:pts/2# zfs get checksum dummy/home
NAME        PROPERTY  VALUE      SOURCE
dummy/home  checksum  on         default
portellen:pts/2# zfs snapshot dummy/home at 1
portellen:pts/2# zfs set checksum=sha256 dummy
portellen:pts/2# zfs send dummy/home at 1 | zfs recv -F dummy/home.sha256
portellen:pts/2# zfs get checksum dummy/home.sha256
NAME               PROPERTY  VALUE      SOURCE
dummy/home.sha256  checksum  sha256     inherited from dummy

Now lets verify using zdb, we should have two plain file blocks 
(/etc/profile fits in a single ZFS block) one from the original 
dummy/home and one from the newly received home.sha256.

portellen:pts/2# zdb -vvv -S user:all dummy
0	2048	1	ZFS plain file	fletcher4	uncompressed 
8040e8f120:a2c635bc0556:73b5ba539e9699:3b4d66984ac9d6b4
0	2048	1	ZFS plain file	SHA256	uncompressed 
57f1e8168c58e8cf:3b20be148f57852e:f72ee8e66663358f:1bfae4ae0599577c



-- 
Darren J Moffat

Frank Middleton

2009-Oct-01 13:58 UTC

head link

[zfs-discuss] Best way to convert checksums

On 10/01/09 05:08 AM, Darren J Moffat wrote:
> In the future there will be a distinction between the local and the
> received values see the recently (yesterday) approved case PSARC/2009/510:
>
> http://arc.opensolaris.org/caselog/PSARC/2009/510/20090924_tom.erickson
Currently non-recursive incremental streams send properties and full
streams don''t. Will the "p" flag reverse its meaning for
incremental
streams? For my purposes the current behavior is the exact opposite
of what I need and it isn''t obvious that the case addresses this
peculiar inconsistency without going through a lot of hoops. I suppose
the new properties can be sent initially so that subsequent incremental
streams won''t override the possibly changed local properties, but that
seems so complicated :-). If I understand the case correctly, we can
now set a flag that says "ignore properties sent by any future incremental
non-recursive stream". This instead of having a flag for incremental
streams that says "don''t send properties". What happens if
sometimes
we do and sometimes we don''t? Sounds like a static property when a
dynamic flag is really what is wanted and this is a complicated way of
working around a design inconsistency. But maybe I missed something :-)

So what would the semantics of the new "p" flag be for non-recursive
incremental streams?

Thanks -- Frank

Ray Clark

2009-Oct-01 14:10 UTC

head link

[zfs-discuss] Best way to convert checksums

Darren, thank you very much!  Not only have you answered my question, you have
made me aware of a tool to verify, and probably do alot more (zdb).

Can you comment on my concern regarding what checksum is used in the base zpool
before anything is created in it?  (No doubt my terminology is wrong, but you
get the idea I am sure).

The single critical feature of ZFS is debatably that every block on ZFS is
checksummed to enable detection of corruption, but it appears that the user does
not have the ability to choose the checksum for the highest levels of the pool
itself.  Given the issue with fletcher2, this is of concern!  Since this
"activity" was kicked off by a "Corrupt Metadata"
ZFS-8000-CS, I am trying to move away from fletcher2.  Don''t know if
that was the cause, but my goal is to restore the "safety" that we
went to ZFS for.

Is my understanding correct?
Are there ways to control the checksum algorithm on the empty zpool?

Thanks, again.

--Ray
-- 
This message posted from opensolaris.org

Richard Elling

2009-Oct-01 16:51 UTC

head link

[zfs-discuss] Best way to convert checksums

On Oct 1, 2009, at 7:10 AM, Ray Clark wrote:
> Darren, thank you very much!  Not only have you answered my  
> question, you have made me aware of a tool to verify, and probably  
> do alot more (zdb).
>
> Can you comment on my concern regarding what checksum is used in the  
> base zpool before anything is created in it?  (No doubt my  
> terminology is wrong, but you get the idea I am sure).
>
> The single critical feature of ZFS is debatably that every block on  
> ZFS is checksummed to enable detection of corruption, but it appears  
> that the user does not have the ability to choose the checksum for  
> the highest levels of the pool itself.  Given the issue with  
> fletcher2, this is of concern!  Since this "activity" was kicked
off
> by a "Corrupt Metadata" ZFS-8000-CS, I am trying to move away
from
> fletcher2.  Don''t know if that was the cause, but my goal is to  
> restore the "safety" that we went to ZFS for.
>
> Is my understanding correct?
> Are there ways to control the checksum algorithm on the empty zpool?
You can set both zpool (-o option) and zfs (-O option) options when you
create the zpool. See zpool(1m)
  -- richard

Ross

2009-Oct-01 17:47 UTC

head link

[zfs-discuss] Best way to convert checksums

Ray, if you don''t mind me asking, what was the original problem you had
on your system that makes you think the checksum type is the problem?
-- 
This message posted from opensolaris.org

Ray Clark

2009-Oct-01 18:21 UTC

head link

[zfs-discuss] Best way to convert checksums

U4 zpool does not appear to support the -o option...   Reading a current zpool
manpage online lists the valid properties for the current zpool -o, and checksum
is not one of them.  Are you mistaken or am I missing something?

Another thought is that *perhaps* all of the blocks that comprise an empty zpool
are re-written sooner or later, and once the checksum is changed with "zfs
set checksum=sha256 zfs01" (The pool name) they will be re-written with the
new checksum very soon anyway.  Is this true?  This would require an
understanding of the on-disk structure and when what is rewritten.

--Ray
-- 
This message posted from opensolaris.org

Ross

2009-Oct-01 19:08 UTC

head link

[zfs-discuss] Best way to convert checksums

Ray, if you use -o it sets properties for the pool.  If you use -O (capital), it
sets the filesystem properties for the default filesystem created with the pool.

zpool -O can use any valid zfs filesystem option.

But I agree, it''s not very clearly documented.
-- 
This message posted from opensolaris.org

Cindy Swearingen

2009-Oct-01 19:34 UTC

head link

[zfs-discuss] Best way to convert checksums

You are correct. The zpool create -O option isn''t available in a
Solaris
10 release but will be soon. This will allow you to set the file system
checksum property when the pool is created:

# zpool create -O checksum=sha256 pool c1t1d0
# zfs get checksum pool
NAME  PROPERTY  VALUE      SOURCE
pool  checksum  sha256     local

Otherwise, you would have to set it like this:

# zpool create pool c1t1d0
# zfs set checksum=sha256 pool
# zfs get checksum pool
NAME  PROPERTY  VALUE      SOURCE
pool  checksum  sha256     local

I''m not sure I understand the second part of your comments but will
add:

If *you* rewrite your data then the new data will contain the new
checksum. I believe an upcoming project will provide the ability to
revise file system properties on the fly.

On 10/01/09 12:21, Ray Clark wrote:> U4 zpool does not appear to support the -o option...   Reading a current
zpool manpage online lists the valid properties for the current zpool -o, and
checksum is not one of them.  Are you mistaken or am I missing something?
> 
> Another thought is that *perhaps* all of the blocks that comprise an empty
zpool are re-written sooner or later, and once the checksum is changed with
"zfs set checksum=sha256 zfs01" (The pool name) they will be
re-written with the new checksum very soon anyway.  Is this true?  This would
require an understanding of the on-disk structure and when what is rewritten.
> 
> --Ray

Richard Elling

2009-Oct-01 19:54 UTC

head link

[zfs-discuss] Best way to convert checksums

Also, when a pool is created, there is only metadata which uses  
fletcher4[*].
So it is not a crime if you set the checksum after the pool is created  
and before
data is written :-)

* note: the uberblock uses SHA-256
  -- richard


On Oct 1, 2009, at 12:34 PM, Cindy Swearingen wrote:
> You are correct. The zpool create -O option isn''t available in a  
> Solaris 10 release but will be soon. This will allow you to set the  
> file system
> checksum property when the pool is created:
>
> # zpool create -O checksum=sha256 pool c1t1d0
> # zfs get checksum pool
> NAME  PROPERTY  VALUE      SOURCE
> pool  checksum  sha256     local
>
> Otherwise, you would have to set it like this:
>
> # zpool create pool c1t1d0
> # zfs set checksum=sha256 pool
> # zfs get checksum pool
> NAME  PROPERTY  VALUE      SOURCE
> pool  checksum  sha256     local
>
> I''m not sure I understand the second part of your comments but
will
> add:
>
> If *you* rewrite your data then the new data will contain the new
> checksum. I believe an upcoming project will provide the ability to
> revise file system properties on the fly.
>
>
> On 10/01/09 12:21, Ray Clark wrote:
>> U4 zpool does not appear to support the -o option...   Reading a  
>> current zpool manpage online lists the valid properties for the  
>> current zpool -o, and checksum is not one of them.  Are you  
>> mistaken or am I missing something?
>> Another thought is that *perhaps* all of the blocks that comprise  
>> an empty zpool are re-written sooner or later, and once the  
>> checksum is changed with "zfs set checksum=sha256 zfs01" (The
pool
>> name) they will be re-written with the new checksum very soon  
>> anyway.  Is this true?  This would require an understanding of the  
>> on-disk structure and when what is rewritten.
>> --Ray
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ray Clark

2009-Oct-02 14:28 UTC

head link

[zfs-discuss] Best way to convert checksums

Data security.  I migrated my organization from Linux to Solaris driven away
from Linux by the the shortfalls of fsck on TB size file systems, and towards
Solaris by the features of ZFS.

At the time I tried to dig up information concerning tradeoffs associated with
Fletcher2 vs. 4 vs. SHA256 and found nothing.  Studying the algorithms, I
decided that fletcher2 would tend to be weak for periodic data, which
characterizes my data.  I ran throughput tests and got 67MB/Sec for Fletcher2
and 4 and 48MB/Sec for SHA256.  I projected (perhaps without basis)
SHA256''s cryptographic strength to also mean strength as a hash, and
chose it since 48MB/Sec is more than I need.

21 months later (9/15/09) I lost everything to a "corrupt metadata"
(Not sure where this was printed) ZFS-8000-CS.  No clue why to date, I will
never know.  The person who restored from tape was not informed to set
checksum=sha256, so it all went in with the default, Fletcher2.

Before taking rather disruptive actions to correct this, I decided to question
my original decision and found schlie''s post stating that a bug in
fletcher2 makes it essentially a one bit parity on the entire block:   
http://opensolaris.org/jive/thread.jspa?threadID=69655&tstart=30      While
this is twice as good as any other file system in the world that has NO such
checksum, this does not provide the security I migrated for.  Especially given
that I did not know what caused the original data loss, it is all I have to lean
on.

Convinced that I need to convert all of the checksums to sha256 to have the data
security ZFS purports to deliver and in the absence of a checksum conversion
capability, I need to copy the data.  It appears that all of the implementations
of the various means of copying data, from tar and cpio to cp to rsync to pax
have ghosts in their closets, each living in glass houses, and each throwing
stones at the other with respect to various issues with file size, filename
lengths, pathname lengths, ACLs, extended attributes, sparse files, etc. etc.
etc.

It seems like zfs send/receive *should* be safe from all such issues as part of
the zfs family, but the questions raised here are ambiguous once one starts to
think about it.  If the file system is faithfully duplicated, it should also
duplicate all properties, including the checksum used on each block.  It appears
(to my advantage) that this is not what is done.  This enables the filesystem
spontaneously created by zfs receive to inherit from the pool, which evidently
can be set to sha256 though it is a pool not a file system in the pool.  The
present question is protection on the base pool.  This can be set when the pool
is created, though not with U4 which I am running.  It is not clear (yet) if
this is simply not documented with the current release or if the version that
supports this has not been released yet.  If I were to upgrade (Which I cannot
do in a timely fashion), it would only be to U7.  I cannot run a "weekly
build" type of OS on my production server.  Any way it goes I am hosed.  In
short there is surely some structure, some blocks with stuff written in them
when a pool is created but before anything else is done, else it would be a
blank disk, not a zfs pool.  Are these "protected" by Fletcher2 as the
default?  I have learned that the Ubberblock is protected by SHA256, other parts
by Fletcher4.  Is this everything?  In U4 was it fletcher4, or was this a recent
change steming from Schlie''s report?

In short, what is the situation with regard to the data security I switched to
Solaris/ZFS for, and what can I do to achieve it?  What *do* the tools do?  Are
there tools for what needs to be done to convert things, to copy things, to
verify things, and to do so completely and correctly?

So here is where I am:  I should zfs send/receive, but I cannot have confidence
that there are not fletcher2 protected blocks (1 bit parity) at the most
fundamental levels of the zpool.  To verify data, I cannot depend on existing
tools since diff is not large file aware.  My best idea at this point is to
calculate and compare MD5 sums of every file and spot check other properties as
best I can.

Given this rather full perspective, help or comments very appreciated.  I still
think zfs is the way to go, but the road is a little bumpy at the moment.
-- 
This message posted from opensolaris.org

Ray Clark

2009-Oct-02 14:34 UTC

head link

[zfs-discuss] Best way to convert checksums

Appologize that the preceeding post appears out of context.  I expected it to
"indent" as I pushed the reply button on myxiplx'' Oct 1, 2009
1:47 post.  It was in response to his question.  I will try to remember to
provide links internal to my messages.
-- 
This message posted from opensolaris.org

Tomas Ögren

2009-Oct-02 14:40 UTC

head link

[zfs-discuss] Best way to convert checksums

On 02 October, 2009 - Ray Clark sent me these 4,4K bytes:
> Data security.  I migrated my organization from Linux to Solaris
> driven away from Linux by the the shortfalls of fsck on TB size file
> systems, and towards Solaris by the features of ZFS.
[...]> Before taking rather disruptive actions to correct this, I decided to
> question my original decision and found schlie''s post stating that
a
> bug in fletcher2 makes it essentially a one bit parity on the entire
> block:
> http://opensolaris.org/jive/thread.jspa?threadID=69655&tstart=30
> While this is twice as good as any other file system in the world that
> has NO such checksum, this does not provide the security I migrated
> for.  Especially given that I did not know what caused the original
> data loss, it is all I have to lean on....

That post refers to bug 6740597
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6740597
which also refers to
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=2178540

So it seems like it''s fixed in snv114 and s10u8, which won''t
help your
s10u4 unless you update..

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Ray Clark

2009-Oct-02 14:44 UTC

head link

[zfs-discuss] Best way to convert checksums

Replying to Cindys Oct 1, 2009 3:34 PM post:

Thank you.   The second part was my attempt to guess my way out of this.  If the
fundamental structure of the pool (That which was created before I set the
checksum=sha256 property) is using fletcher2, perhaps as I use the pool all of
this structure will be updated, and therefore automatically migrate to the new
checksum.  It would be very difficult for me to recreate the pool, but I have
space to duplicate the "user" files (and so get the new checksum). 
Perhaps this will also result in the underlying "structure" of the
pool being converted in the course of normal use.

Comments for or against?
-- 
This message posted from opensolaris.org

Ray Clark

2009-Oct-02 14:46 UTC

head link

[zfs-discuss] Best way to convert checksums

Replying to relling''s October 1, 2009 3:34 post:

Richard, regarding "when a pool is created, there is only metadata which
uses fletcher4".  Was this true in U4, or is this a new change of default
with U4 using fletcher2?  Similarly, did the Ubberblock use sha256 in U4?  I am
running U4.

--Ray
-- 
This message posted from opensolaris.org

Ross

2009-Oct-02 15:48 UTC

head link

[zfs-discuss] Best way to convert checksums

Interesting answer, thanks :)

I''d like to dig a little deeper if you don''t mind, just to
further my own understanding (which is usually rudimentary compared to a lot of
the guys on here).  My belief is that ZFS stores two copies of the metadata for
any block, so corrupt metadata really shouldn''t happen often.

Could I ask what the structure of your pool is, what level of redundancy do you
have there.  The very fact that you had a ''corrupt metadata''
error implies to me that the checksums have done their job in finding an error,
and I''m wondering if the true cause could be further down the line.

I''m still taking all this in though - we''ll be using sha256 on
our secondary system, just in case :)
-- 
This message posted from opensolaris.org

Ray Clark

2009-Oct-02 17:42 UTC

head link

[zfs-discuss] Best way to convert checksums

My pool was the default, with checksum=256.  The default has two copies of all
metadata (as I understand it), and one copy of user data.  It was a raidz2 with
eight 750GB drives, yielding just over 4TB of usable space.

I am not happy with the situation, but I recognize that I am 2x better off (1
bit parity) than I would be with any other file system.
-- 
This message posted from opensolaris.org

Marion Hakanson

2009-Oct-02 18:01 UTC

head link

[zfs-discuss] Best way to convert checksums

webclark at rochester.rr.com said:>  To verify data, I cannot depend on existing tools since diff is not large
> file aware.  My best idea at this point is to calculate and compare MD5
sums
> of every file and spot check other properties as best I can. 
Ray,

I recommend that you use rsync''s "-c" to compare copies.  It
reads all the
source files, computes a checksum for them, then does the same for the
destination and compares checksums.  As far as I know, the only thing
that rsync can''t do in your situation is the ZFS/NFSv4 ACL''s. 
I''ve used
it to migrate many TB''s of data.

Regards,

Marion

Cindy Swearingen

2009-Oct-02 18:59 UTC

head link

[zfs-discuss] Best way to convert checksums

Ray,

The checksums are set on the file systems not the pool.

If a new checksum is set and *you* rewrite the data, then the rewritten
data will contain the new checksum. If your pool has the space for you 
to duplicate the user data and new checksum is set, then the duplicated 
data will have the new checksum.

ZFS doesn''t rewrite data as part of normal operations. I confirmed with
a simple test (like Darren''s) that even if you have a single-disk pool 
and the disk is replaced and all the data is resilvered and a new 
checksum is set, you''ll see data with the previous checksum and the new
checksum.

Cindy

On 10/02/09 08:44, Ray Clark wrote:> Replying to Cindys Oct 1, 2009 3:34 PM post:
> 
> Thank you.   The second part was my attempt to guess my way out of this. 
If the fundamental structure of the pool (That which was created before I set
the checksum=sha256 property) is using fletcher2, perhaps as I use the pool all
of this structure will be updated, and therefore automatically migrate to the
new checksum.  It would be very difficult for me to recreate the pool, but I
have space to duplicate the "user" files (and so get the new
checksum).  Perhaps this will also result in the underlying
"structure" of the pool being converted in the course of normal use.
> 
> Comments for or against?

Richard Elling

2009-Oct-02 19:26 UTC

head link

[zfs-discuss] Best way to convert checksums

On Oct 2, 2009, at 7:46 AM, Ray Clark wrote:
> Replying to relling''s October 1, 2009 3:34 post:
>
> Richard, regarding "when a pool is created, there is only metadata  
> which uses fletcher4".  Was this true in U4, or is this a new change  
> of default with U4 using fletcher2?  Similarly, did the Ubberblock  
> use sha256 in U4?  I am running U4.
ZFS uses different checksums for different things. Briefly,

	use			checksum
	---------------------------------------------------------
	uberblock	SHA-256, self-checksummed
	labels		SHA-256
	metadata	fletcher4
	data		fletcher2 (default), set with checksum parameter
	ZIL log		fletcher2, self-checksummed
	gang block	SHA-256, self-checksummed

The parent holds the checksum for an entity is not self-checksummed.

The big question, that is currently unanswered, is do we see single
bit faults in disk-based storage systems? The answer to this question
must be known before the effectiveness of a checksum can be evaluated.
The overwhelming empirical evidence suggests that fletcher2 catches
many storage system corruptions.
  -- richard

Miles Nordin

2009-Oct-02 20:20 UTC

head link

[zfs-discuss] Best way to convert checksums

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
>>>>> "r" == Ross <myxiplx at googlemail.com>
writes:
re> The answer to this question must be known before the
re> effectiveness of a checksum can be evaluated.

...well...we can use math to know that a checksum is effective. What
you are really suggesting we evaluate ``empirically'''' is the
degree of
INeffectiveness of the broken checksum.

r> ZFS stores two copies of the metadata for any block, so
r> corrupt metadata really shouldn''t happen often.

the other copy probably won''t be read if the first copy read has a
valid checksum. I think it''ll more likely just lazy-panic instead.
If that''s the case, the two copies won''t help cover up the
broken
checksum bug. but Richard''s table says metadata has fletcher4 which
the OP said is as good as the correct algorithm would have been, even
in its broken implementation, so long as it''s only used up to
128kByte. It''s only data and ZIL that has the relevantly-broken
checksum, according to his math.

re> The overwhelming empirical evidence suggests that fletcher2
re> catches many storage system corruptions.

What do you mean by the word ``many''''? It''s a
weasel-word. It
basically means, AFAICT, ``the broken checksum still trips
sometimes.'''' But have you any empirical evidence about the
fraction
of real world errors which are still caught by the broken checksum
vs. those that are not? I don''t see how you could.

How about cases where checksums are not used to correct bit-flip
gremlins but relied upon to determine whether a data structure is
fully present (committed) yet, like in the ZIL, or to determine which
half of a mirror is stale---these are cases where checksums could be
wrong even if the storage subsystem is functioning in an ideal way.

Checksum weakness on ZFS where checksums are presumed good by other
parts of the design could potentially be worse overall than a
checksumless design. That''s not my impression, but it''s the
right
place to put the bar. Ray''s ``well at least it''s better than
no
checksums'''' is wrong because it presumes ZFS could function as
well as
another filesystem if ZFS were using a hypothetical null checksum. It
couldn''t.

Anyway I''m glad the problem is both fixed and also avoidable on the
broken systems. I just think the doublespeak after the fact is, once
again, not helping anyone.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091002/370a1de4/attachment.bin>

Richard Elling

2009-Oct-02 21:06 UTC

head link

[zfs-discuss] Best way to convert checksums

Hi Miles, good to hear from you again.

On Oct 2, 2009, at 1:20 PM, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
>>>>>> "r" == Ross  <myxiplx at
googlemail.com> writes:
>
>    re> The answer to this question must be known before the
>    re> effectiveness of a checksum can be evaluated.
>
> ...well...we can use math to know that a checksum is effective.  What
> you are really suggesting we evaluate ``empirically'''' is
the degree of
> INeffectiveness of the broken checksum.
By your logic, SECDED ECC for memory is broken because it only
corrects 1 bit per symbol and only detects brokeness of 2 bits per
symbol. However, the empirical evidence suggests that ECC provides
a useful function for many people. Do we know how many triple bit
errors occur in memories? I can compute the probability, but have
never seen a field failure analysis. So, if ECC is "good enough" for
DRAM, is fletcher2 "good enough" for storage?

NB, for DRAM the symbol size is usually 64 bits. For the ZFS case, the
symbol size is 4,096 to 1,048,576 bits. AFAIK, no collisions have been
found in SHA-256 digests for symbols of size 1,048,576, but it has not
been proven that that they do not exist.
>     r> ZFS stores two copies of the metadata for any block, so
>     r> corrupt metadata really shouldn''t happen often.
>
> the other copy probably won''t be read if the first copy read has a
> valid checksum.  I think it''ll more likely just lazy-panic
instead.
> If that''s the case, the two copies won''t help cover up
the broken
> checksum bug.  but Richard''s table says metadata has fletcher4
which
> the OP said is as good as the correct algorithm would have been, even
> in its broken implementation, so long as it''s only used up to
> 128kByte.  It''s only data and ZIL that has the relevantly-broken
> checksum, according to his math.
>
>    re> The overwhelming empirical evidence suggests that fletcher2
>    re> catches many storage system corruptions.
>
> What do you mean by the word ``many''''?  It''s a
weasel-word.
I''ll blame the lawyers. They are causing me to remove certain words
from my vocabulary :-(
>  It
> basically means, AFAICT, ``the broken checksum still trips
> sometimes.''''  But have you any empirical evidence about
the fraction
> of real world errors which are still caught by the broken checksum
> vs. those that are not?  I don''t see how you could.
Question for the zfs-discuss participants, have you seen a data  
corruption
that was not detected when using fletcher2?

Personally, I''ve seen many corruptions of data stored on file systems
lacking checksums.
> How about cases where checksums are not used to correct bit-flip
> gremlins but relied upon to determine whether a data structure is
> fully present (committed) yet, like in the ZIL, or to determine which
> half of a mirror is stale---these are cases where checksums could be
> wrong even if the storage subsystem is functioning in an ideal way.
>
> Checksum weakness on ZFS where checksums are presumed good by other
> parts of the design could potentially be worse overall than a
> checksumless design.  That''s not my impression, but it''s
the right
> place to put the bar.  Ray''s ``well at least it''s better
than no
> checksums'''' is wrong because it presumes ZFS could
function as well as
> another filesystem if ZFS were using a hypothetical null checksum.  It
> couldn''t.
I''m in Ray''s camp. I''ve got far to many scars from
data corruption and
I''d
rather not add more.
  -- richard
>
> Anyway I''m glad the problem is both fixed and also avoidable on
the
> broken systems.  I just think the doublespeak after the fact is, once
> again, not helping anyone.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ray Clark

2009-Oct-02 21:35 UTC

head link

[zfs-discuss] Best way to convert checksums

Replying to hakanson''s Oct 2, 2009 2:01 post:

Thanks.  I suppose it is true that I am not even trying to compare the
peripheral stuff, and simple presence of a file and the data matching covers
some of them.

Using it for moving data, one encounters a longer list:  Sparse files, ACL
handling, extended atributes, length of filenames, length of pathnames, large
files.  And probably other "interesting" things that can be not
handled correctly.

Most information for misbehavior of the various archive / backup / data movement
utilities is very old.  One wonders how they behave today.  This would be a
useful compilation, but I can''t do it.
-- 
This message posted from opensolaris.org

Ray Clark

2009-Oct-02 21:54 UTC

head link

[zfs-discuss] Best way to convert checksums

Cindys Oct 2, 2009 2:59,  Thanks for staying with me.

Re: "The checksums are aset on the file systems not the pool.":

But previous responses seem to indicate that I can set them for file stored in
the filesystem that appears to be the pool, at the pool level, before I create
any new ones.  One post seems to indicate that there is a checksum property for
this file system, and independently for the pool.  (This topic needs a picture).

Re: "If a new checksum is set and *you* rewrite the data ... then the
duplciated data will have the new checksum."

Understand.  Now I am on to being concerned for the blocks that comprise the
zpool that *contain* the file system.

Re: "ZFS doesn''t rewrite data as part of normal operations.  I
confirmed with a simple test (like Darren''s) that even if you have a
single-disk pool and the disk is replaced and all the data is resilvered and a
new checksum is set, you''ll see data with the previous checksum and the
new checksum."

Yes, ... a resilver duplicates exactly.  Darren''s example showed that
without the -R, no properties were sent and the zfs receive had no choice but to
use the pool default for the zfs filesystem that it created.  This also implies
that there was a property associated with the pool.  So my previous comment
about zfs send/receive not duplicating exactly was not fair.  The man page /
admin guide should be clear as to what is sent without -R.  I would have guessed
everything, just not descendent file systems.  It is a shame that zdb is totally
undocumented.  I thought I had discovered a gold mine when I first read
Darren''s note!

--Ray
-- 
This message posted from opensolaris.org

Ray Clark

2009-Oct-02 22:05 UTC

head link

[zfs-discuss] Best way to convert checksums

Re: relling''s Oct 2, 2009 3:26 Post:

(1) Is this list everything?
(2) Is this the same for U4?
(3) If I change the zpool checksum property on creation as you indicated in your
Oct 1, 12:51 post (evidently very recent versions only), does this change the
checksums used for this list?  Why would not the strongest checksum be used for
the most fundamental data rather than fool around, allowing the user to
compromise only when the tradeoff pays back on the 99% bulk of the data?

Re: "The big question, that is currently unanswered, is do we see single
bit faults in disk-based storage systems?"

I don''t think this is the question.  I believe the implication of
schlie''s post is not that single bit faults will get through, but that
the current fletcher2 is equivalent to a single bit checksum.  You could have
1,000 bits in error, or 4095, and still have a 50-50 chance of detecting it.  A
single bit error would be certain to be detected (I think) even with the current
code.
-- 
This message posted from opensolaris.org

Ray Clark

2009-Oct-02 22:16 UTC

head link

[zfs-discuss] Best way to convert checksums

Re: Miles Nordin Oct 2, 2009 4:20:

Re: "Anyway, I''m glad the problem is both fixed..."

I want to know HOW it can be fixed?  If they fixed it, this will invalidate
every pool that has not been changed from the default (Probably almost all of
them!).  This can''t be!  So what WAS done?  In the interest of honesty
in advertising and enabling people to evaluate their own risks, I think we
should know how it was fixed.  Something either ingenious or potentially
misleading must have been done.  I am not suggesting that it was not the best
way to handle a difficult situation, but I don''t see how it can be
transparent.  If the string "fletcher2" does the same thing, it is not
fixed.  If it does something different, it is misleading.

"... and avoidable on the broken systems."

Please tell me how!  Without destroying and recreating my zpool, I can only fix
the zfs file system blocks, not the underlying zpool blocks.  WITH destroying
and recreating my zpool, I can only control the checksum on the underlying zpool
using a version of Solaris that is not yet available.  And then (Pending
relling''s response) may or may not *still* effect the blocks I am
concerned about.  So how is this avoidable?  It is partially avoidable (so far)
IF I have the luxury of doing significant rebuilding..  No?
-- 
This message posted from opensolaris.org

Richard Elling

2009-Oct-02 22:34 UTC

head link

[zfs-discuss] Best way to convert checksums

On Oct 2, 2009, at 3:05 PM, Ray Clark wrote:
> Re: relling''s Oct 2, 2009 3:26 Post:
>
> (1) Is this list everything?
AFAIK
> (2) Is this the same for U4?
Yes.  This hasn''t changed in a very long time.
> (3) If I change the zpool checksum property on creation as you  
> indicated in your Oct 1, 12:51 post (evidently very recent versions  
> only), does this change the checksums used for this list?  Why would  
> not the strongest checksum be used for the most fundamental data  
> rather than fool around, allowing the user to compromise only when  
> the tradeoff pays back on the 99% bulk of the data?
Performance.  Many people value performance over dependability.
> Re: "The big question, that is currently unanswered, is do we see  
> single bit faults in disk-based storage systems?"
>
> I don''t think this is the question.  I believe the implication of
> schlie''s post is not that single bit faults will get through, but
> that the current fletcher2 is equivalent to a single bit checksum.   
> You could have 1,000 bits in error, or 4095, and still have a 50-50  
> chance of detecting it.  A single bit error would be certain to be  
> detected (I think) even with the current code.
I don''t believe schlie posted the number of fletcher2 collisions for
the
symbol size used by ZFS. I do not believe it will be anywhere near
50% collisions.
  -- richard

Ray Clark

2009-Oct-02 22:35 UTC

head link

[zfs-discuss] Best way to convert checksums

Re: relling''s Oct 2 5:06 Post:

Re: analogy to ECC memory... 

I appreciate the support, but the ECC memory analogy does not hold water.  ECC
memory is designed to correct for multiple independent events, such as
electrical noise, bits flipped due to alpha particles from the DRAM package, or
cosmic rays.  The probability of these independent events coinciding in time and
space is very small indeed.  It works well.

ZFS does purport to cover errors such as these in the crummy double layer boards
wtihout sufficient decoupling, microcontrollers and memories without parity or
ECC, etc. found in the cost-reduced to the razor''s edge hardware most
of us run on, but it also covers system level errors such as entire blocks being
replaced, or large fractions of them being corrupted by high level bugs.  With
the current fletcher2 we have only a 50-50 chance of catching these multi-bit
errors.  Probability of multiple bits being changed is not small, because the
probabilities of the error mechanism effecting the 4096~1048576 bits in the
block are not independent.  Indeed, in many of the show-cased mechanisms, it is
a sure bet - the entire disk sector is written with the wrong data, for sure! 
Although there is a good chance that many of the bits in the sector happen to
match, there is an excellent chance that many are different.  And the mechanisms
that caused these differences were not independent.

Re: "AFAIK, no collisions have been found in SHA-256 digests for symbols of
size 1,048,576, but it has not been proven that they do not exist"

For sure they exist.  I think 4096 of them, for every SHA256 digest, there are
(I think) 4096 1,048,576 bit long blocks that will create it.  One hopes that
the same properties that make SHA256 a good cryptographic hash also make it a
good hash period.  This, I admit, is a leap of ignorance (At least I know what
cliff I am leaping off of).

Regarding the question of what people have seen, I have seen lots of unexplained
things happen, and by definition one never knows why.  I am not interested in
seeing any more.  I see the potential for disaster, and my time, and the time of
my group, is better spent doing other things.  That is why I moved to ZFS.
-- 
This message posted from opensolaris.org

Miles Nordin

2009-Oct-02 22:36 UTC

head link

[zfs-discuss] Best way to convert checksums

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
re> By your logic, SECDED ECC for memory is broken because it only
re> corrects

ECC is not a checksum.

Go ahead, get out your dictionary, enter severe-pedantry-mode. but it
is relevantly different. In for example data transmission scenarios,
FEC''s like ECC are often used along with a strong noncorrecting
checksum over a larger block.

The OP further described scenarios plausible for storage, like ``long
string of zeroes with 1 bit flipped'''', that produce collisions
with
the misimplemented fletcher2 (but, obviously, not with any strong
checksum like correct-fletcher2).

re> is fletcher2 "good enough" for storage?

yes, it probably is good enough, but ZFS implements some other broken
algorithm and calls it fletcher2. so, please stop saying fletcher2.

re> I''ll blame the lawyers. They are causing me to remove
certain
re> words from my vocabulary :-(

yeah, well, allow me to add a word back to the vocabulary: BROKEN.

If you are not legally allowed to use words like broken and working,
then find another identity from which to talk, please.

re> Question for the zfs-discuss participants, have you seen a
re> data corruption that was not detected when using fletcher2?

This is ridiculous. It''s not fletcher2, it''s brokenfletcher2.
It''s
avoidably extremely weak. It''s reasonable to want to use a real
checksum, and this PR game you are playing is frustrating and
confidence-harming for people who want that.

This does not have to become a big deal, unless you try to spin it
with a 7200rpm PR machine like IBM did with their broken Deathstar
drives before they became HGST.

Please, what we need to do is admit that the checksum is relevantly
broken in a way that compromises the integrity guarantees with which
ZFS was sold to many customers, fix the checksum, and learn how to
conveniently migrate our data.

Based on the table you posted, I guess file data can be set to
fletcher4 or sha256 using filesystem properties to work around the
bug on Solaris versions with the broken implementation.

1. What''s needed to avoid fletcher2 on the ZIL on broken Solaris
versions?

2. I understand the workaround, but not the fix.

How does the fix included S10u8 and snv_114 work? Is there a ZFS
version bump? Does the fix work by implementing fletcher2
correctly? or does it just disable fletcher2 and force everything
to use brokenfletcher4 which is good enough? If the former, how
are the broken and correct versions of fletcher2
distinguished---do they show up with different names in the pool
properties?

Once you have the fixed software, how do you make sure fixed
checksums are actually covering data blocks originally written by
old broken software? I assume you have to use rsync or zfs
send/recv to rewrite all the data with the new checksum? If yes,
what do you have to do before rewriting---upgrade solaris and then
''zfs upgrade'' each filesystem one by one? Will zfs
send/recv work
across the filesystem versions, or does the copying have to be
done with rsync?

3. speaking of which, what about the checksum in zfs send streams?
is it also fletcher2, and if so was it also fixed in
s10u8/snv_114, and how does this affect compatibility for people
who have ignored my advice and stored streams instead of zpools?
Will a newer ''zfs recv'' always work with an older
''zfs send'' but
not the other way around?

there is basically no informaiton about implementing the fix in the
bug, and we can''t write to the bug from outside Sun. Whatever
sysadmins need to do to get their data under the strength of checksum
they thought it was under, it might be nice to describe it in the bug
for whoever gets referred to the bug and has an affected version.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091002/d5e74a63/attachment.bin>

Ray Clark

2009-Oct-02 22:44 UTC

head link

[zfs-discuss] Best way to convert checksums

Let me try to refocus:

Given that I have a U4 system with a zpool created with Fletcher2:

What blocks in the system are protected by Fletcher2, or even Fletcher4 although
that does not worry me so much.

Given that I only have 1.6TB of data in a 4TB pool, what can I do to change
those blocks to sha256 or Fletcher4:

(1) Without destroying and recreating the zpool under U4

(2) With destroying and recreating the zpool under U4 (Which I don''t
really have the resources to pull off)

(3) With upgrading to U7 (Perhaps in a few months)

(4) With upgrading to U8

Thanks.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Oct-02 23:07 UTC

head link

[zfs-discuss] Best way to convert checksums

On Oct 2, 2009, at 3:44 PM, Ray Clark wrote:
> Let me try to refocus:
>
> Given that I have a U4 system with a zpool created with Fletcher2:
>
> What blocks in the system are protected by Fletcher2, or even  
> Fletcher4 although that does not worry me so much.
>
> Given that I only have 1.6TB of data in a 4TB pool, what can I do to  
> change those blocks to sha256 or Fletcher4:
>
> (1) Without destroying and recreating the zpool under U4
>
> (2) With destroying and recreating the zpool under U4 (Which I
don''t
> really have the resources to pull off)
>
> (3) With upgrading to U7 (Perhaps in a few months)
>
> (4) With upgrading to U8
This has been answered several times in this thread already.
set checksum=sha256 filesystem
copy your files -- all newly written data will have the sha256  
checksums.

  -- richard

Richard Elling

2009-Oct-03 00:29 UTC

head link

[zfs-discuss] Best way to convert checksums

On Oct 2, 2009, at 3:36 PM, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
>
>    re> By your logic, SECDED ECC for memory is broken because it only
>    re> corrects
>
> ECC is not a checksum.
SHA-256 is not a checksum, either, but that isn''t the point. The  
concern is
that corruption can be detected.  ECC has very, very limited detection
capabilities, yet it is "good enough" for many people. We know that
MOS memories have certain failure modes that cause bit flips and by
using ECC and interleaving, the dependability is improved. The big
question is, what does the corrupted data look like in storage? Random
bit flips? Big chunks of zeros? 55aa patterns? Since the concern with
the broken fletcher2 is restricted to the most significant bits, we are
most concerned with failures where the most significants are set to
ones. But as I said, we have no real idea what the corrupted data
should look like, and if it is zero-filled, then fletcher2 will catch  
it.
> Go ahead, get out your dictionary, enter severe-pedantry-mode.  but it
> is relevantly different.  In for example data transmission scenarios,
> FEC''s like ECC are often used along with a strong noncorrecting
> checksum over a larger block.
>
> The OP further described scenarios plausible for storage, like ``long
> string of zeroes with 1 bit flipped'''', that produce
collisions with
> the misimplemented fletcher2 (but, obviously, not with any strong
> checksum like correct-fletcher2).
>
>    re> is fletcher2 "good enough" for storage?
>
> yes, it probably is good enough, but ZFS implements some other broken
> algorithm and calls it fletcher2.  so, please stop saying fletcher2.
If I was to refer to Fletcher''s algorithm, I would use Fletcher.  When
I
am referring to the ZFS checksum setting of "fletcher2" I will
continue
to use "fletcher2"
>    re> I''ll blame the lawyers. They are causing me to remove
certain
>    re> words from my vocabulary :-(
>
> yeah, well, allow me to add a word back to the vocabulary: BROKEN.
>
> If you are not legally allowed to use words like broken and working,
> then find another identity from which to talk, please.
>
>    re> Question for the zfs-discuss participants, have you seen a
>    re> data corruption that was not detected when using fletcher2?
>
> This is ridiculous.  It''s not fletcher2, it''s
brokenfletcher2.  It''s
> avoidably extremely weak.  It''s reasonable to want to use a real
> checksum, and this PR game you are playing is frustrating and
> confidence-harming for people who want that.
There is no PR campaign. It is what it is. What is done is done.
> This does not have to become a big deal, unless you try to spin it
> with a 7200rpm PR machine like IBM did with their broken Deathstar
> drives before they became HGST.
>
> Please, what we need to do is admit that the checksum is relevantly
> broken in a way that compromises the integrity guarantees with which
> ZFS was sold to many customers, fix the checksum, and learn how to
> conveniently migrate our data.
Unfortunately, there is a backwards compatibility issue that
requires the current fletcher2 to live for a very long time. The
only question for debate is whether it should be the default.
To date, I see no field data that suggests it is not detecting
corruption.
> Based on the table you posted, I guess file data can be set to
> fletcher4 or sha256 using filesystem properties to work around the
> bug on Solaris versions with the broken implementation.
>
> 1. What''s needed to avoid fletcher2 on the ZIL on broken Solaris
>    versions?
Please file RFEs at bugs.opensolaris.org
> 2. I understand the workaround, but not the fix.
>
>    How does the fix included S10u8 and snv_114 work?  Is there a ZFS
>    version bump?  Does the fix work by implementing fletcher2
>    correctly?  or does it just disable fletcher2 and force everything
>    to use brokenfletcher4 which is good enough?  If the former, how
>    are the broken and correct versions of fletcher2
>    distinguished---do they show up with different names in the pool
>    properties?
The best I can tell, the comments are changed to indicate fletcher2 is
deprecated. However, it must live on (forever) because of backwards
compatibility. I presume one day the default will change to fletcher4
or something else. This is implied by zfs(1m):

      checksum=on | off | fletcher2,| fletcher4 | sha256

          Controls the checksum used to verify data integrity. The
          default  value  is  on,  which  automatically selects an
          appropriate algorithm (currently,  fletcher2,  but  this
          may  change  in future releases). The value off disables
          integrity checking on user data. Disabling checksums  is
          NOT a recommended practice.
>    Once you have the fixed software, how do you make sure fixed
>    checksums are actually covering data blocks originally written by
>    old broken software?  I assume you have to use rsync or zfs
>    send/recv to rewrite all the data with the new checksum?  If yes,
>    what do you have to do before rewriting---upgrade solaris and then
>    ''zfs upgrade'' each filesystem one by one?  Will zfs
send/recv work
>    across the filesystem versions, or does the copying have to be
>    done with rsync?
I believe such a requirement would have a half-life of less than a
nanosecond.
> 3. speaking of which, what about the checksum in zfs send streams?
>    is it also fletcher2, and if so was it also fixed in
>    s10u8/snv_114, and how does this affect compatibility for people
>    who have ignored my advice and stored streams instead of zpools?
>    Will a newer ''zfs recv'' always work with an older
''zfs send'' but
>    not the other way around?
fletcher4.  Thanks for reminding me... I''ll update my slides :-)
> there is basically no informaiton about implementing the fix in the
> bug, and we can''t write to the bug from outside Sun.  Whatever
> sysadmins need to do to get their data under the strength of checksum
> they thought it was under, it might be nice to describe it in the bug
> for whoever gets referred to the bug and has an affected version.
UTSL

Bottom line: the checksum match does not guarantee correctness,
but a checksum mismatch does indicate differences. In general, this is
how checksums work, no?
  -- richard

Ray Clark

2009-Oct-03 14:46 UTC

head link

[zfs-discuss] Best way to convert checksums

Richard, with respect to:

"This has been answered several times in this thread already.
set checksum=sha256 filesystem
copy your files -- all newly written data will have the sha256
checksums."

I understand that.  I understood it before the thread started.  I did not ask
this.  It is a fact that there is no feature to convert checksums as a part of
resilver or some such.

I started what utility to use, but quickly zeroed in on zfs send/receive as
being the native and presumably best method, but had questions as to how to get
the properly set when it was automatically created.  etc.

Note that my focus in recent portions of the thread has changed to the
underlying zpool.

Simply changing the checksum=sha356 and copying my data is analogous to hanging
my data from a hierarchy of 0.256" welded steel chain, with the top of the
hierarchy hanging it all from an 0.001" steel thread.  Well, that is not
quite fair because there are probabilities involved.  Someone is going to pick a
link randomly and go after it with a fingernail clipper.  If they pick a thick
one, I have very little to worry about to say the least.  If they pick one of
the few dozen? hundred? thousand? (I don''t know how many) that contain
the structure and services of the underlying zpool, then the nailclipper will
not be stopped by the 0.001" thread.   I do have 8,000,000,000 links in the
chain, and only a a very small fraction are 0.001" thick, and that is
strongly in my favor, but I would expect the heads to also spend a
disproportionate amount of time over the intent log.  It is hard to know how it
comes out.  I just don''t want and 0.001" steel threads protecting
my data from the gremlins.  I moved to ZFS to avoid gambles.  If I wanted
gambles I would use linux raid and lvm2.  They work well enough if there are no
errors.

I should have enumerated the knowns and unknowns in my list last night, then I
would not have annoyed you with my apparent deafness.  (Hopefully I am not still
being deaf).  I will clarify below, as I should have last night:


Given that I only have 1.6TB of data in a 4TB pool, what can I do to
change those blocks to sha256 or Fletcher4:

(1) Without destroying and recreating the zpool under U4

I know how to fix the user data (just change checksum property on the pool using
zfs specifying the pool vs. a zfs file system, then copy the data).

I don''t know (am ignorant of) blocks comprising the underlying zpool,
and how to fix them without recreating the pool.  It makes sense to me that at
least some would be rewritten in the course of using the system, but (1) I have
had no confirmation or denial that this is the case, (2) I don''t know
if this is all of them or some of them, (3) I don''t know if the
checksum= parameter would effect these (relling''s Oct 2 at 3:26 post
implies that it does not by lack of reference to checksum property).  So I
don''t know yet how much exposure will remain.  I would think that if
the user specified a stronger checksum for their data that the system would
abandon its use of weaker ones in the underlying structure, but
Richard''s list seems to imply otherwise.

(2) With destroying and recreating the zpool under U4 (Which I don''t
really have the resources to pull off)

Due to some of the non-technical factors in the situation, I cannot actually
execute an experimental valid zpool command, but "zpool create -o
garbage" gives me a usage that does not include any -o or -O.  So it
appears that under U4 I cannot do this.  I wish there were someone who could
confirm that I can or cannot do this before I arrange for and propose that we
dive into this massive undertaking.  Also, from Richard''s Oct 2 3:26
note, I infer that this will not change the checksum used by the underlying
zpool anyway, so this might be fruitless.  But I am infering... Richard gave a
quick list, his attitude was not that of providing all level of precise detail
so I really don''t know.  Many of the answers I have received have
turned out to recommend features that are not available in U4 but in later
versions, even unreleased versions.  I have no way of sorting this out without
the information being qualified with a version.

(3) With upgrading to U7 (Perhaps in a few months)
Not clear what this will support on zpool, or if it would be effective (similar
to U4 above)

(4) With upgrading to U8
Not sure when it will come out, what it will support, or if it will be effective
(similar to U7, U4 above).

So I can enable robust protection on my user data, but perhaps not the
infrastructure needed to get at that user data, and perhaps not the intent log.

The answer may be that I cannot be helped.  That is not the desired answer, but
if that is the case, so be it.  Let''s lay out the facts and the best
way to move on from here, for me and everybody else.  Why leave us thrashing in
the dark?  Am I a Mac user?

I personally will still believe ZFS is the way to go in the short term because
it is still a vastly better gamble, and in the long term because this too will
pass as file systems are rebuilt.  I would question why anything but the best
would be used for the underlying zpool, any why there is absolutely zero
presentation of the tradeoffs between the three algorithms in the admin guide,
but that is another story.

I know that someone out there, and probably people reading this thread know the
answers to these questions.  I hate to stop without that simple knowledge being
communicated.  I do greatly appreciate the attempts to work with me, I
don''t understand how I could be clearer.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Oct-03 14:59 UTC

head link

[zfs-discuss] Best way to convert checksums

On Fri, 2 Oct 2009, Ray Clark wrote:
> With the current fletcher2 we have only a 50-50 chance of catching 
> these multi-bit errors.  Probability of multiple bits being changed 
> is not
What is the current fletcher2?  A while back I seem to recall reading 
a discussion in the zfs-code forum about how the original zfs 
fletcher2 was found to be unexpectedly weak and broken so they updated 
the "fletcher2" algorithm and assigned it a new enumeration so that 
fresh blocks use the corrected algorithm.  I could be just imagining 
all of this but that is what I remember today.

Since you are using Solaris 10 U4 maybe you are using the dinosaur 
version of fletcher2?

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Oct-03 18:18 UTC

head link

[zfs-discuss] Best way to convert checksums

On Oct 3, 2009, at 7:46 AM, Ray Clark wrote:
> Richard, with respect to:
>
> "This has been answered several times in this thread already.
> set checksum=sha256 filesystem
> copy your files -- all newly written data will have the sha256
> checksums."
>
> I understand that.  I understood it before the thread started.  I  
> did not ask this.  It is a fact that there is no feature to convert  
> checksums as a part of resilver or some such.
There is no such feature.  There is a long-awaited RFE for block
pointer rewriting (checksums are stored in the block pointer)
which would add the underlying capabilities for this.
> I started what utility to use, but quickly zeroed in on zfs send/ 
> receive as being the native and presumably best method, but had  
> questions as to how to get the properly set when it was  
> automatically created.  etc.
Say for example I have a pool called zwimming with some stuff in it
and compression=on. To create a copy of the data using send/recv
in the same pool but with compression=sha256 do:
	zfs snapshot zwimming at now
	zfs send zwimming at now | zfs receive zwimming/new

You will now have a new file system called "zwimming/new" with
the same data as zwimming, but with compression=sha256.
If you then want to get back to the original directory structure
you can the mountpoint properties, as desired. There are dozens
of other ways to accomplish the copy.
> Note that my focus in recent portions of the thread has changed to  
> the underlying zpool.
>
> Simply changing the checksum=sha356 and copying my data is analogous  
> to hanging my data from a hierarchy of 0.256" welded steel chain,  
> with the top of the hierarchy hanging it all from an 0.001" steel  
> thread.  Well, that is not quite fair because there are  
> probabilities involved.  Someone is going to pick a link randomly  
> and go after it with a fingernail clipper.  If they pick a thick  
> one, I have very little to worry about to say the least.  If they  
> pick one of the few dozen? hundred? thousand? (I don''t know how  
> many) that contain the structure and services of the underlying  
> zpool, then the nailclipper will not be stopped by the 0.001"  
> thread.   I do have 8,000,000,000 links in the chain, and only a a  
> very small fraction are 0.001" thick, and that is strongly in my  
> favor, but I would expect the heads to also spend a disproportionate  
> amount of time over the intent log.  It is hard to know how it comes  
> out.  I just don''t want and 0.001" steel threads protecting
my data
> from the
>  gremlins.  I moved to ZFS to avoid gambles.  If I wanted gambles I  
> would use linux raid and lvm2.  They work well enough if there are  
> no errors.
I think you are missing the concept of pools. Pools contain datasets.
One form of dataset is a file system. Pools do not contain data per se,
datasets contain data.  Reviewing the checksums used with this
heirarchy in mind:

	Pool
		Label [SHA-256]
		Uberblock [SHA-256]
		Metadata [fletcher4]
		Gang block [SHA-256]
		ZIL log [fletcher2]

		Dataset (file system or volume)
			Metadata [fletcher4]
			Data [fletcher2 (default, today), fletcher4, or SHA-256]
			Send stream [fletcher4]

With this in mind, I don''t understand your steel analogy.

wrt the ZIL, it is rarely used for normal file system access.  ZIL  
blocks
are allocated from the pool as needed and freed no more than 30
seconds later, unless there is a sudden halt. If the system is halted
then the ZIL is used to roll forward transactions. The heads do not
"spend a disproportionate amount of time over the intent log."
  -- richard
> I should have enumerated the knowns and unknowns in my list last  
> night, then I would not have annoyed you with my apparent deafness.   
> (Hopefully I am not still being deaf).  I will clarify below, as I  
> should have last night:
>
>
> Given that I only have 1.6TB of data in a 4TB pool, what can I do to
> change those blocks to sha256 or Fletcher4:
>
> (1) Without destroying and recreating the zpool under U4
>
> I know how to fix the user data (just change checksum property on  
> the pool using zfs specifying the pool vs. a zfs file system, then  
> copy the data).
>
> I don''t know (am ignorant of) blocks comprising the underlying  
> zpool, and how to fix them without recreating the pool.  It makes  
> sense to me that at least some would be rewritten in the course of  
> using the system, but (1) I have had no confirmation or denial that  
> this is the case, (2) I don''t know if this is all of them or some
of
> them, (3) I don''t know if the checksum= parameter would effect
these
> (relling''s Oct 2 at 3:26 post implies that it does not by lack of
> reference to checksum property).  So I don''t know yet how much  
> exposure will remain.  I would think that if the user specified a  
> stronger checksum for their data that the system would abandon its  
> use of weaker ones in the underlying structure, but Richard''s list
> seems to imply otherwise.
>
> (2) With destroying and recreating the zpool under U4 (Which I
don''t
> really have the resources to pull off)
>
> Due to some of the non-technical factors in the situation, I cannot  
> actually execute an experimental valid zpool command, but "zpool  
> create -o garbage" gives me a usage that does not include any -o or - 
> O.  So it appears that under U4 I cannot do this.  I wish there were  
> someone who could confirm that I can or cannot do this before I  
> arrange for and propose that we dive into this massive undertaking.   
> Also, from Richard''s Oct 2 3:26 note, I infer that this will not  
> change the checksum used by the underlying zpool anyway, so this  
> might be fruitless.  But I am infering... Richard gave a quick list,  
> his attitude was not that of providing all level of precise detail  
> so I really don''t know.  Many of the answers I have received have
> turned out to recommend features that are not available in U4 but in  
> later versions, even unreleased versions.  I have no way of sorting  
> this out without the information being qualified with a version.
>
> (3) With upgrading to U7 (Perhaps in a few months)
> Not clear what this will support on zpool, or if it would be  
> effective (similar to U4 above)
>
> (4) With upgrading to U8
> Not sure when it will come out, what it will support, or if it will  
> be effective (similar to U7, U4 above).
>
> So I can enable robust protection on my user data, but perhaps not  
> the infrastructure needed to get at that user data, and perhaps not  
> the intent log.
>
> The answer may be that I cannot be helped.  That is not the desired  
> answer, but if that is the case, so be it.  Let''s lay out the
facts
> and the best way to move on from here, for me and everybody else.   
> Why leave us thrashing in the dark?  Am I a Mac user?
>
> I personally will still believe ZFS is the way to go in the short  
> term because it is still a vastly better gamble, and in the long  
> term because this too will pass as file systems are rebuilt.  I  
> would question why anything but the best would be used for the  
> underlying zpool, any why there is absolutely zero presentation of  
> the tradeoffs between the three algorithms in the admin guide, but  
> that is another story.
>
> I know that someone out there, and probably people reading this  
> thread know the answers to these questions.  I hate to stop without  
> that simple knowledge being communicated.  I do greatly appreciate  
> the attempts to work with me, I don''t understand how I could be  
> clearer.
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Miles Nordin

2009-Oct-03 19:22 UTC

head link

[zfs-discuss] Best way to convert checksums

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
    re> If I was to refer to Fletcher''s algorithm, I would use
    re> Fletcher.  When I am referring to the ZFS checksum setting of
    re> "fletcher2" I will continue to use "fletcher2"

haha okay, so to clarify, when reading a Richard Elling post:

 fletcher2 = ZFS''s broken attempt to implement a 32-bit Fletcher
checksum

 Fletcher  = hypothetical correct implementation of a Fletcher checksum

In that case, for clarity I think I''ll have to use the word
``broken''''
a lot more often.

     >> How does the fix included S10u8 and snv_114 work?

    re> The best I can tell, the comments are changed to indicate
    re> fletcher2 is deprecated.

You are saying the ``fix'''' was a change in documentation,
nothing
else?  The default is still fletcher2, and there is no correct
implementation of the Fletcher checksum only the
good-enough-but-broken fletcher4, which is not the default?

Also, there is no way to use a non-broken checksum on the ZIL?

doesn''t sound fixed to me.  At least there is some transparency,
though, and a partial workaround.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091003/ba1888cd/attachment.bin>

Bob Friesenhahn

2009-Oct-03 19:52 UTC

head link

[zfs-discuss] Best way to convert checksums

On Sat, 3 Oct 2009, Miles Nordin wrote:>    re> The best I can tell, the comments are changed to indicate
>    re> fletcher2 is deprecated.
>
> You are saying the ``fix'''' was a change in documentation,
nothing
> else?  The default is still fletcher2, and there is no correct
> implementation of the Fletcher checksum only the
> good-enough-but-broken fletcher4, which is not the default?
It seems that my memory is kind of crappy (like fletcher2).

There were discussions of the fletcher2 issue on the zfs-code list 
starting in March and ending in May:

http://mail.opensolaris.org/pipermail/zfs-code/2009-March/thread.html
http://mail.opensolaris.org/pipermail/zfs-code/2009-April/thread.html
http://mail.opensolaris.org/pipermail/zfs-code/2009-May/thread.html

Unless someone has a legal requirement to prove data integrity, it 
does not seem like the fletcher2 woes are much for most people to be 
worried about.  After all, before zfs, this level of validation did 
not exist at all.  Fletcher2 will still catch most instances of data 
corruption.

One thing I did learn from this discussion is that when accessing 
uncached memory, the performance of fletcher2 and fletcher4 is roughly 
equivalent so there is usually no penalty for enabling fletcher4.  It 
does seem like there could be some CPU impact for synchronous writes 
from fletcher4 since it is more likely that the data is in cache for a 
synchronous write.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2009-Oct-03 20:53 UTC

head link

[zfs-discuss] Best way to convert checksums

On Oct 3, 2009, at 12:22 PM, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
>
>    re> If I was to refer to Fletcher''s algorithm, I would use
>    re> Fletcher.  When I am referring to the ZFS checksum setting of
>    re> "fletcher2" I will continue to use
"fletcher2"
>
> haha okay, so to clarify, when reading a Richard Elling post:
>
> fletcher2 = ZFS''s broken attempt to implement a 32-bit Fletcher  
> checksum
>
> Fletcher  = hypothetical correct implementation of a Fletcher checksum
>
> In that case, for clarity I think I''ll have to use the word
``broken''''
> a lot more often.
>
>>> How does the fix included S10u8 and snv_114 work?
>
>    re> The best I can tell, the comments are changed to indicate
>    re> fletcher2 is deprecated.
>
> You are saying the ``fix'''' was a change in documentation,
nothing
> else?  The default is still fletcher2, and there is no correct
> implementation of the Fletcher checksum only the
> good-enough-but-broken fletcher4, which is not the default?
>
> Also, there is no way to use a non-broken checksum on the ZIL?
The ZIL is a slightly different beast. If there is a checksum mismatch
while processing the log, it signals the effective end of the log.  This
is why log entries are self-checksummed.  In other words, if you reach
garbage, then you''ve reached the end of the log. The probability of
the garbage having both a valid fletcher2 checksum at the proper
offset and having the proper sequence number and having the right
log chain link and having the right block size is considerably lower
than the weakness of fletcher2.

Unfortunately, the ZIL is also latency sensitive, so the performance
case gets stronger while the additional error checking already
boosts the dependability case.
  -- richard

> doesn''t sound fixed to me.  At least there is some transparency,
> though, and a partial workaround.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ray Clark

2009-Oct-03 22:18 UTC

head link

[zfs-discuss] Best way to convert checksums

With respect to relling''s Oct 3 2009 7:46 AM Post:
> I think you are missing the concept of pools. Pools contain datasets.
> One form of dataset is a file system. Pools do not contain data per se,
> datasets contain data. Reviewing the checksums used with this
> heirarchy in mind:
> Pool
> Label [SHA-256]
> Uberblock [SHA-256]
> Metadata [fletcher4]
> Gang block [SHA-256]
> ZIL log [fletcher2]
> Dataset (file system or volume)
> Metadata [fletcher4]
> Data [fletcher2 (default, today), fletcher4, or SHA-256]
> Send stream [fletcher4]
> With this in mind, I don''t understand your steel analogy.
I am assuming based on the context of our presentation that the above list of
"pool stuff" is exhaustive, that this is everything not in a dataset.

My "steel analogy" is based on the assumption that the pool-level
stuff that you list above is needed to gain access to the dataset.  If the
dataset can be accessed with all of the pool stuff trashed, then my steel thread
does not exist.  But it also means that the pool stuff is extraneous, so I doubt
that this is the case.

Given that all of the pool stuff is either sha256 or fletcher4 except for the
ZIL, I have new understanding that suggests (though I don''t understand
the details of the system) that I am not depending on Fletcher2 protected data,
and my steel thread is actually pretty thick, not 0.001".

Based on your comments regarding the ZIL, I am infering that stuff is written
there and never used except for a restart after a messy shutdown.  I might be
exposed to whatever weakness the Fletcher2 has as implemented, but only in these
rare circumstances.  Normal transactions and data would not be impacted by
corruption in the ZIL blocks since they would never be used.  So a large layer
of probability protects me:  I would have to have a crash at the same instance
of corruption in the ZIL that hits on a Fletcher2 weakness.

Based on all of this I believe I am relatively happy simply copying my data, not
recreating my zpool.

As Darren Moffat taught me, I can "zfs set checksum=sha256 zfs01"
where zfs01 is the zpool, then "zfs send zfs01/home at snapshot | zfs
receive zfs01/home.new" and the new file system will all be sha256 as long
as I don''t specify the -R option on the zfs send, and all of this is
supported in U4.  I believe it has to be supported due to the presence of files
with properties in the (odd?) zfs file system that exists at the zfs01 zpool
level before creation of zfs file systems.

So assuming the above process works, this thread is done as far as I am
concerned right now.

Thank you all for your help, not to snub anyone, but Darren, Richard, and Cindy
especially come to mind.  Thanks for sparring with me until we understood each
other.

--Ray
-- 
This message posted from opensolaris.org

Miles Nordin

2009-Oct-04 18:51 UTC

head link

[zfs-discuss] Best way to convert checksums

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
    re> The probability of the garbage having both a valid fletcher2
    re> checksum at the proper offset and having the proper sequence
    re> number and having the right log chain link and having the
    re> right block size is considerably lower than the weakness of
    re> fletcher2.

I''m having trouble parsing this.  I think you''re confusing a
few
different failure modes:

 * ZIL entry is written, but corrupted by the storage, so that, for
   example, an entry should be read from the mirrored ZIL instead.

   + broken fletcher2 detects the storage corruption
     CASE A: Good!

   + broken fletcher2 misses the error, so that corrupted data is
     replayed from ZIL into the proper pool, possibly adding a
     stronger checksum to the corrupt data while writing it.
     CASE B: Bad!

   + broken fletcher2 misinterprets storage corruption as signalling
     the end of the ZIL, and any data in the ZIL after the corrupt
     entry is truncated without even attempting to read the mirror.
     (does this happen?)
     CASE C: Bad!

 * ZIL entry is intentional garbage, either a partially-written entry
   or an old entry, and should be treated as the end of the ZIL

   + broken fletcher2 identifies the partially written entry by a
     checksum mismatch, or the sequence number identifies it as old
     CASE D: Good!

   + broken fletcher2 misidentifies a partially-written entry as
     complete because of a hash collision
     CASE E: Bad!

   + (hypothetical, only applies to non-existent fixed system) working
     fletcher2 or broken-good-enough fletcher4 misidentifies a
     partially-written entry as complete because of a hash collision
     CASE F: Bad!

If I read your sentence carefully and try to match it with this chart,
it seems like you''re saying P(CASE F) << P(CASE E), which seems
like
an argument for fixing the checksum.  While you don''t say so, I
presume from your other posts you''re trying to make a case for doing
nothing, so I''m confused.

I was mostly thinking about CASE B though.  It seems like the special
way the ZIL works has nothing to do with CASE B: if you send data
through the ZIL to a sha256 pool, it can be written to ZIL under
broken-fletcher2, corrupted by the storage, and then read in and
played back corrupt but covered with a sha256 checksum to the pool
proper.  AFAICT your relative-probability sentence has nothing to do
with CASE B.

    re> Unfortunately, the ZIL is also latency sensitive, so the
    re> performance case gets stronger 

The performance case advocating what?  not fixing the broken checksum?

    re> while the additional error checking already boosts the
    re> dependability case.

what additional error checking?

Isn''t the whole specialness of the ZIL that the checksum is needed in
normal operation, absent storage subsystem corruption, as I originally
said?  It seems like the checksum''s strength is more important here,
not less.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091004/2057fad4/attachment.bin>

Richard Elling

2009-Oct-04 22:40 UTC

head link

[zfs-discuss] Best way to convert checksums

On Oct 4, 2009, at 11:51 AM, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
>
>    re> The probability of the garbage having both a valid fletcher2
>    re> checksum at the proper offset and having the proper sequence
>    re> number and having the right log chain link and having the
>    re> right block size is considerably lower than the weakness of
>    re> fletcher2.
>
> I''m having trouble parsing this.  I think you''re
confusing a few
> different failure modes:
>
> * ZIL entry is written, but corrupted by the storage, so that, for
>   example, an entry should be read from the mirrored ZIL instead.
This is attempted, if you have a mirrored slog.
>   + broken fletcher2 detects the storage corruption
>     CASE A: Good!
>
>   + broken fletcher2 misses the error, so that corrupted data is
>     replayed from ZIL into the proper pool, possibly adding a
>     stronger checksum to the corrupt data while writing it.
>     CASE B: Bad!
>
>   + broken fletcher2 misinterprets storage corruption as signalling
>     the end of the ZIL, and any data in the ZIL after the corrupt
>     entry is truncated without even attempting to read the mirror.
>     (does this happen?)
>     CASE C: Bad!
>
> * ZIL entry is intentional garbage, either a partially-written entry
>   or an old entry, and should be treated as the end of the ZIL
>
>   + broken fletcher2 identifies the partially written entry by a
>     checksum mismatch, or the sequence number identifies it as old
>     CASE D: Good!
If the checksum mismatches, you can''t go any further because
the pointer to the next ZIL log entry cannot be trusted. So the
roll forward stops.  This is how such logs work -- there is no
end-of-log record.
>   + broken fletcher2 misidentifies a partially-written entry as
>     complete because of a hash collision
>     CASE E: Bad!
>
>   + (hypothetical, only applies to non-existent fixed system) working
>     fletcher2 or broken-good-enough fletcher4 misidentifies a
>     partially-written entry as complete because of a hash collision
>     CASE F: Bad!
As I said before, if the checksum matches, then the data is
checked for sequence number = previous + 1, the blk_birth == 0,
and the size is correct. Since this data lives inside the block, it
is unlikely that a collision would also result in a valid block.
In other words, ZFS doesn''t just trust the checksum for slog entries.
  -- richard
> If I read your sentence carefully and try to match it with this chart,
> it seems like you''re saying P(CASE F) << P(CASE E), which
seems like
> an argument for fixing the checksum.  While you don''t say so, I
> presume from your other posts you''re trying to make a case for
doing
> nothing, so I''m confused.
>
> I was mostly thinking about CASE B though.  It seems like the special
> way the ZIL works has nothing to do with CASE B: if you send data
> through the ZIL to a sha256 pool, it can be written to ZIL under
> broken-fletcher2, corrupted by the storage, and then read in and
> played back corrupt but covered with a sha256 checksum to the pool
> proper.  AFAICT your relative-probability sentence has nothing to do
> with CASE B.
>
>    re> Unfortunately, the ZIL is also latency sensitive, so the
>    re> performance case gets stronger
>
> The performance case advocating what?  not fixing the broken checksum?
>
>    re> while the additional error checking already boosts the
>    re> dependability case.
>
> what additional error checking?
>
> Isn''t the whole specialness of the ZIL that the checksum is needed
in
> normal operation, absent storage subsystem corruption, as I originally
> said?  It seems like the checksum''s strength is more important
here,
> not less.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

David Dyer-Bennet

2009-Oct-05 14:27 UTC

head link

[zfs-discuss] Best way to convert checksums

On Sat, October 3, 2009 17:18, Ray Clark wrote:
> Thank you all for your help, not to snub anyone, but Darren, Richard, and
> Cindy especially come to mind.  Thanks for sparring with me until we
> understood each other.
I''d like to echo this (and extend the thanks to include Ray). 
I''m now
starting to feel that I understand this issue, and I didn''t for quite a
while.  And that I understand the risks better, and have a clearer idea of
what the possible fixes are.  And I didn''t before.  That I do now is
due
to Ray''s persistence, and to the rest of your patience.  Thank you!

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Brandon Mercer

2009-Oct-05 14:29 UTC

head link

[zfs-discuss] Best way to convert checksums

On Mon, Oct 5, 2009 at 10:27 AM, David Dyer-Bennet <dd-b at dd-b.net>
wrote:>
> On Sat, October 3, 2009 17:18, Ray Clark wrote:
>
>> Thank you all for your help, not to snub anyone, but Darren, Richard,
and
>> Cindy especially come to mind. ?Thanks for sparring with me until we
>> understood each other.
>
> I''d like to echo this (and extend the thanks to include Ray).
?I''m now
> starting to feel that I understand this issue, and I didn''t for
quite a
> while. ?And that I understand the risks better, and have a clearer idea of
> what the possible fixes are. ?And I didn''t before. ?That I do now
is due
> to Ray''s persistence, and to the rest of your patience. ?Thank
you!
Excellent, can this thread die now? :P

Al Hopper

2009-Oct-05 15:43 UTC

head link

[zfs-discuss] Best way to convert checksums

Question (for Richard E):  Is there a write-up on the ZFS broken fletcher fix?
Is the default checksum for new pool creation changed in U8?
Is the default checksum for new pool creation change in OpenSolaris or
SXCE  (which versions)?
Is there a case open to allow the user to select the checksum to be
used when a ZIL is being created?

Interesting thread - and commiserations to the team ZFS on the broken
fletcher implementation - we (developers) all have bad days!!

Regards,

-- 
Al Hopper  Logical Approach Inc,Plano,TX al at logical-approach.com
                   Voice: 972.379.2133 Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/

Miles Nordin

2009-Oct-05 19:07 UTC

head link

[zfs-discuss] Best way to convert checksums

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
re> As I said before, if the checksum matches, then the data is
re> checked for sequence number = previous + 1, the blk_birth = re>
0, and the size is correct. Since this data lives inside the
re> block, it is unlikely that a collision would also result in a
re> valid block.

That''s just a description of how the zil works, not an additional
layer of protection for user data in the ZIL beyond the checksum. The
point of all this is to avoid needing to write a synchronous commit
sector to mark the block valid. Instead, the block becomes valid once
it''s entirely written. Yes, the checksum has an additional, critical,
use in the ZIL compared to its use in the bulk pool, but checking
these header fields for sanity does nothing to mitigate broken
fletcher2''s weakness in detecting corruption of the user data stored
inside the zil records. It''s completely orthogonal.

If anything, the additional use of broken fletcher2 in the ZIL is a
reason it''s even more important to fix the checksum in the ZIL:
checksum mismatches occur in the ZIL even during normal operation,
even when the storage is not misbehaving, because sometimes blocks are
incompletely written. This is the normal case, not the exception,
because the ZIL is only read after unclean shutdown.

and AIUI you are saying fletcher2 is still the default for bulk pool
data, too? even on newly created pools with the latest code? The fix
was just to add the word ``deprecated'''' to some documentation
somewhere, without actually performing the deprecation? I feel like
FreeBSD/NetBSD would probably have left this bug open until it''s
fixed. :/ Ubuntu or Gentoo would probably keep closing and reopening
it though while people haggled in the comments section.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091005/43c3913f/attachment.bin>

Victor Latushkin

2009-Oct-05 19:15 UTC

head link

[zfs-discuss] Best way to convert checksums

On 05.10.09 23:07, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
> 
>     re> As I said before, if the checksum matches, then the data is
>     re> checked for sequence number = previous + 1, the blk_birth => 
re> 0, and the size is correct. Since this data lives inside the
>     re> block, it is unlikely that a collision would also result in a
>     re> valid block.
> 
> That''s just a description of how the zil works, not an additional
> layer of protection for user data in the ZIL beyond the checksum.  The
> point of all this is to avoid needing to write a synchronous commit
> sector to mark the block valid.  Instead, the block becomes valid once
> it''s entirely written.  Yes, the checksum has an additional,
critical,
> use in the ZIL compared to its use in the bulk pool, but checking
> these header fields for sanity does nothing to mitigate broken
> fletcher2''s weakness in detecting corruption of the user data
stored
> inside the zil records.  It''s completely orthogonal.
> 
> If anything, the additional use of broken fletcher2 in the ZIL is a
> reason it''s even more important to fix the checksum in the ZIL:
> checksum mismatches occur in the ZIL even during normal operation,
> even when the storage is not misbehaving, because sometimes blocks are
> incompletely written.  This is the normal case, not the exception,
> because the ZIL is only read after unclean shutdown.
> 
> and AIUI you are saying fletcher2 is still the default for bulk pool
> data, too?  even on newly created pools with the latest code?
Here''s essentially the fix:

http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/zio.h?r2=%252Fonnv%252Fonnv-gate%252Fusr%252Fsrc%252Futs%252Fcommon%252Ffs%252Fzfs%252Fsys%252Fzio.h%409454%3A02e1ddcc9be7&r1=%252Fonnv%252Fonnv-gate%252Fusr%252Fsrc%252Futs%252Fcommon%252Ffs%252Fzfs%252Fsys%252Fzio.h%409443%3A2a96d8478e95

It changes setting of checksum=on to mean "fletcher4", so it is used
by default
for all user data and metadata. Though you can still set it to
"fletcher2"
explicitly.

victor

Miles Nordin

2009-Oct-05 19:32 UTC

head link

[zfs-discuss] Best way to convert checksums

>>>>> "bm" == Brandon Mercer <yourcomputerpal at
gmail.com> writes:
>> I''m now starting to feel that I understand this issue,
>> and I didn''t for quite a while. ?And that I understand the
>> risks better, and have a clearer idea of what the possible
>> fixes are. ?And I didn''t before.

haha, yes, I think I can explain it to people when advocating ZFS, but
the story goes something like ``ZFS is serious business and pretty
useful, but it has some pretty hilarious problems that you wouldn''t
expect from some of the blog hype you read. Let me give you a couple
examples of things that still aren''t fixed and how the discussion
went...''''

bm> Excellent, can this thread die now? :P

If no one is going to fix the problem, I guess so. I''m not intending
to submit a patch myself---I''ll just use fletcher4 or sha256 for the
bulk pool, and cross my fingers for the ZIL. I''m not even sure there
is any point in submitting a patch because it sounds like the problem
is political, not code. Fixing the math mistake would be trivial for
the person who originally wrote broken-fletcher2, but if you break
pool compatibility, you widen the discussion about the original broken
checksum to include all the ZFS-loving the hype-blogs.

I''m just surprised the bug is closed without fixing the problem, and
that any ZFS user who didn''t participate in this thread will almost
certainly still end up creating pools with broken checksums. That
doesn''t seem right at all, especially when ZFS has so many simple
convenient paths for eventually fixing the problem.

Why not simply change the default for new filesystems to fletcher4?
This is backward-compatible. Because of the way opensolaris is
livecd-install-then-upgrade, new users will continue getting broken
checksums for several months even with this fix until the next livecd
comes out, but at least it''s an eventual resolution.

As for the ZIL, why not change it to broken-fletcher4 the next time
the ZFS ''update'' version is incremented? The ZIL is less
urgent to
fix on a scale of months because users don''t have to migrate all their
data to get the new checksum, so sites won''t be stuck with broken ZIL
checksums after upgrading their software like they are for the bulk
data in the pool with livecd-then-upgrade. If fletcher4 is some
``performance issue'''' (is it?), then implement nfletcher2
(correct
implementation of Fletcher''s checksum) and include it in the new ZFS
version as the default.

The only argument I can think of for doing nothing is, it''s like
mercury in vaccines or broken autoclaves---if you respond, it''s
admitting there was a problem in the first place, while until you
respond you can balance the effort you spend fuzzing the issue against
your liability. However in this case I don''t think anyone needs a 60
minutes exposee. It''s impossible to argue the problem''s
imaginary,
especially after so much ZFS advocacy was based on drumming up FUD
about how naked you supposedly are without these checksums.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091005/825bfdc6/attachment.bin>

Miles Nordin

2009-Oct-05 19:42 UTC

head link

[zfs-discuss] Best way to convert checksums

>>>>> "vl" == Victor Latushkin <Victor.Latushkin at
Sun.COM> writes:
    vl> It changes setting of checksum=on to mean "fletcher4"

oh, good.  so it is only the ZIL that''s unfixed?  At least that fix
could come from a simple upgrade, if it ever gets fixed.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091005/12f139c3/attachment.bin>

Ray Clark

2009-Oct-05 21:09 UTC

head link

[zfs-discuss] Best way to convert checksums

Richard,

The sub-threads being woven with a couple of other people are very important,
though they are not my immediate issue.  I really don''t think you need
us debating with *you* about this - I think you could argue our point also. 
What we need to get across is a perspective.

I am pretty sure that the current Fletcher2 algorithm as implemented does not
provide the level of security intended by the guy who originally extrapolated
Fletcher2 from Fletcher for ZFS.  There are no rocks to throw, every single one
of us, and every system, has goofed once or twice.

The point is that once you experience a few issues of flakey hardware and data
corruption, then read the ZFS propaganda, you never want to go back.  Yes,
everything you said was true.  But having seen the vision, we just are not
interested in being convinced that it is relatively alright.  We have been
there.  We are interested in a strategy, a roadmap, to move on and get back to
the vision.

Just remember that we are *here* primarily because we see ZFS as being many
orders of magnitude more reliable, both in terms of not loosing data, as well as
telling us when it does.  To dilute this capability is to dilute ZFS''
differentiation.  To not be transparent is to invite uncertainty and distrust.

Perhaps in hindsight, and given the extreme aversion to risk exhibited by your
users, the ZFS team might review the checksums used on the zpool level stuff.  I
certainly would, but I am willing to let the ZFS team who understands the
mathematics, probabilities, and implications better than I make this decision. 
Given that we are very technical customers and don''t have our Mac hat
on, I also believe that it would be appropriate to document some of the
rationale.  There certainly has been alot of material explaining other technical
issues - why not here?

At any rate, keep on going - we are all behind you 100%.  Please give us an open
technical solution that we can have 100% confidence in.

--Ray
-- 
This message posted from opensolaris.org

Toby Thain

2009-Oct-05 21:38 UTC

head link

[zfs-discuss] Best way to convert checksums

On 5-Oct-09, at 3:32 PM, Miles Nordin wrote:
>>>>>> "bm" == Brandon Mercer <yourcomputerpal at
gmail.com> writes:
>
>>> I''m now starting to feel that I understand this issue,
>>> and I didn''t for quite a while.  And that I understand the
>>> risks better, and have a clearer idea of what the possible
>>> fixes are.  And I didn''t before.
>
> haha, yes, I think I can explain it to people when advocating ZFS, but
> the story goes something like ``ZFS is serious business and pretty
> useful, but it has some pretty hilarious problems that you
wouldn''t
> expect
Let''s talk about the "hilarious problems" that a naive RAID
stack
has, and most users "don''t expect". For a start, no crash
safe
behaviour, and no way to self-heal from unexpected mirror desync.  
Then we could compare always-consistent COW with conventionally  
fragile metadata needing regular consistency checks...

> from some of the blog hype you read.  Let me give you a couple
> examples of things that still aren''t fixed
...and can''t be fixed, in RAID, or conventional filesystems.

--Toby
> and how the discussion
> went...''''...

zfs discuss - Sep 2009 - Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums

[zfs-discuss] Best way to convert checksums