What is the "Best" way to convert the checksums of an existing ZFS file system from one checksum to another? To me "Best" means safest and most complete. My zpool is 39% used, so there is plenty of space available. Thanks. -- This message posted from opensolaris.org
I didn''t want my question to lead to an answer, but perhaps I should have put more information. My idea is to copy the file system with one of the following: cp -rp zfs send | zfs receive tar cpio But I don''t know what would be the best. Then I would do a "diff -r" on them before deleting the old. I don''t know the "obscure" (for me) secondary things like attributes, links, extended modes, etc. Thanks again. -- This message posted from opensolaris.org
I had this same question. I was recommended to use rsync or zfs send. I used both just to be safe. With zfs send, you create a snapshot and then send the snapshot. After deleting the snapshot on the target, you have identical copies. rsync seems to be used for this task also. And also zfs send. -- This message posted from opensolaris.org
When using zfs send/receive to do the conversion, the receive creates a new file system: zfs snapshot zfs01/home at before zfs send zfs01/home at before | zfs receive afx01/home.sha256 Where do I get the chance to "zfs set checksum=sha256" on the new file system before all of the files are written ??? The new filesystem is created automatically by the receive command! Although it does not say so in the man page or zfs admin guide, it certainly seems reasonable that I don''t get a chance - the idea is that send/receive recreates the file system exactly. This would still have an ambiguity as to whether the new blocks are created/copied with the checksum algorithm they had in the source filesystem (Which would not result in the conversion I am trying to accomplish), or are they created and checksumed with the algorithm specified by the checksum PROPERTY set in the source file system at the time of the send/receive (which WOULD do the conversion I am trying to accomplish)? Is there a way to use send/receive to duplicate a filesystem with a different checksum, or do I use cpio or tar? (I pick on cpio and tar because they are specifically called out in the zfs admin manual as saving and restoring zfs file attributes and ACLs). Thanks. --Ray -- This message posted from opensolaris.org
Ray Clark wrote:> When using zfs send/receive to do the conversion, the receive creates a new file system: > > zfs snapshot zfs01/home at before > zfs send zfs01/home at before | zfs receive afx01/home.sha256 > > Where do I get the chance to "zfs set checksum=sha256" on the new file system before all of the files are written ???Set it on the afx01 dataset before you do the receive and it will be inherited. -- Darren J Moffat
I made a typo... I only have one pool. I should have typed: zfs snapshot zfs01/home at before zfs send zfs01/home at before | zfs receive zfs01/home.sha256 Does that change the answer? And independently if it does or not, zfs01 is a pool, and the property is on the home zfs file system. I cannot change it on the file system before doing the receive because the file system does not exist - it is created by the receive. This raises a related question of whether the file system on the receiving end is ALL created using the checksum property from the source file system, or if the blocks and their present mix of checksums are faithfully recreated in the received file system? Finally, is there any way to verify behavior after it is done? Thanks for helping on this. -- This message posted from opensolaris.org
Ray Clark wrote:> I made a typo... I only have one pool. I should have typed: > > zfs snapshot zfs01/home at before > zfs send zfs01/home at before | zfs receive zfs01/home.sha256 > > Does that change the answer?No it doesn''t change my answer> And independently if it does or not, zfs01 is a pool, and the property is on the home zfs file system.doesn''t mater if zfs01 is the top level dataset or not. Before you do the receive do this: zfs set checksum=sha256 zfs01 -- Darren J Moffat
Dynamite! I don''t feel comfortable leaving things implicit. That is how misunderstandings happen. Would you please acknowlege that zfs send | zfs receive uses the checksum setting on the receiving pool instead of preserving the checksum algorithm used by the sending block? Thanks a million! --Ray -- This message posted from opensolaris.org
Sinking feeling... zfs01 was originally created with fletcher2. Doesn''t this mean that the sort of "root level" stuff in the zfs pool exist with fletcher2 and so are not well protected? If so, is there a way to fix this short of a backup and restore? -- This message posted from opensolaris.org
Ray Clark wrote:> Dynamite! > > I don''t feel comfortable leaving things implicit. That is how misunderstandings happen.It isn''t implicit it is explicitly inherited that is how ZFS is designed to (and does) work.> Would you please acknowlege that zfs send | zfs receive uses the checksum setting on the receiving pool instead of preserving the checksum algorithm used by the sending block?For now it depends wither or not you pass -R to ''zfs send'' or not. Without the -R argument the send stream does not have any properties in it so it will (by design) use those that would be used if the dataset was created by ''zfs create''. In the future there will be a distinction between the local and the received values see the recently (yesterday) approved case PSARC/2009/510: http://arc.opensolaris.org/caselog/PSARC/2009/510/20090924_tom.erickson Lets look at how it works just now: portellen:pts/2# zpool create dummy c7t3d0 portellen:pts/2# zfs create dummy/home portellen:pts/2# cp /etc/profile /dummy/home portellen:pts/2# zfs get checksum dummy/home NAME PROPERTY VALUE SOURCE dummy/home checksum on default portellen:pts/2# zfs snapshot dummy/home at 1 portellen:pts/2# zfs set checksum=sha256 dummy portellen:pts/2# zfs send dummy/home at 1 | zfs recv -F dummy/home.sha256 portellen:pts/2# zfs get checksum dummy/home.sha256 NAME PROPERTY VALUE SOURCE dummy/home.sha256 checksum sha256 inherited from dummy Now lets verify using zdb, we should have two plain file blocks (/etc/profile fits in a single ZFS block) one from the original dummy/home and one from the newly received home.sha256. portellen:pts/2# zdb -vvv -S user:all dummy 0 2048 1 ZFS plain file fletcher4 uncompressed 8040e8f120:a2c635bc0556:73b5ba539e9699:3b4d66984ac9d6b4 0 2048 1 ZFS plain file SHA256 uncompressed 57f1e8168c58e8cf:3b20be148f57852e:f72ee8e66663358f:1bfae4ae0599577c -- Darren J Moffat
On 10/01/09 05:08 AM, Darren J Moffat wrote:> In the future there will be a distinction between the local and the > received values see the recently (yesterday) approved case PSARC/2009/510: > > http://arc.opensolaris.org/caselog/PSARC/2009/510/20090924_tom.ericksonCurrently non-recursive incremental streams send properties and full streams don''t. Will the "p" flag reverse its meaning for incremental streams? For my purposes the current behavior is the exact opposite of what I need and it isn''t obvious that the case addresses this peculiar inconsistency without going through a lot of hoops. I suppose the new properties can be sent initially so that subsequent incremental streams won''t override the possibly changed local properties, but that seems so complicated :-). If I understand the case correctly, we can now set a flag that says "ignore properties sent by any future incremental non-recursive stream". This instead of having a flag for incremental streams that says "don''t send properties". What happens if sometimes we do and sometimes we don''t? Sounds like a static property when a dynamic flag is really what is wanted and this is a complicated way of working around a design inconsistency. But maybe I missed something :-) So what would the semantics of the new "p" flag be for non-recursive incremental streams? Thanks -- Frank
Darren, thank you very much! Not only have you answered my question, you have made me aware of a tool to verify, and probably do alot more (zdb). Can you comment on my concern regarding what checksum is used in the base zpool before anything is created in it? (No doubt my terminology is wrong, but you get the idea I am sure). The single critical feature of ZFS is debatably that every block on ZFS is checksummed to enable detection of corruption, but it appears that the user does not have the ability to choose the checksum for the highest levels of the pool itself. Given the issue with fletcher2, this is of concern! Since this "activity" was kicked off by a "Corrupt Metadata" ZFS-8000-CS, I am trying to move away from fletcher2. Don''t know if that was the cause, but my goal is to restore the "safety" that we went to ZFS for. Is my understanding correct? Are there ways to control the checksum algorithm on the empty zpool? Thanks, again. --Ray -- This message posted from opensolaris.org
On Oct 1, 2009, at 7:10 AM, Ray Clark wrote:> Darren, thank you very much! Not only have you answered my > question, you have made me aware of a tool to verify, and probably > do alot more (zdb). > > Can you comment on my concern regarding what checksum is used in the > base zpool before anything is created in it? (No doubt my > terminology is wrong, but you get the idea I am sure). > > The single critical feature of ZFS is debatably that every block on > ZFS is checksummed to enable detection of corruption, but it appears > that the user does not have the ability to choose the checksum for > the highest levels of the pool itself. Given the issue with > fletcher2, this is of concern! Since this "activity" was kicked off > by a "Corrupt Metadata" ZFS-8000-CS, I am trying to move away from > fletcher2. Don''t know if that was the cause, but my goal is to > restore the "safety" that we went to ZFS for. > > Is my understanding correct? > Are there ways to control the checksum algorithm on the empty zpool?You can set both zpool (-o option) and zfs (-O option) options when you create the zpool. See zpool(1m) -- richard
Ray, if you don''t mind me asking, what was the original problem you had on your system that makes you think the checksum type is the problem? -- This message posted from opensolaris.org
U4 zpool does not appear to support the -o option... Reading a current zpool manpage online lists the valid properties for the current zpool -o, and checksum is not one of them. Are you mistaken or am I missing something? Another thought is that *perhaps* all of the blocks that comprise an empty zpool are re-written sooner or later, and once the checksum is changed with "zfs set checksum=sha256 zfs01" (The pool name) they will be re-written with the new checksum very soon anyway. Is this true? This would require an understanding of the on-disk structure and when what is rewritten. --Ray -- This message posted from opensolaris.org
Ray, if you use -o it sets properties for the pool. If you use -O (capital), it sets the filesystem properties for the default filesystem created with the pool. zpool -O can use any valid zfs filesystem option. But I agree, it''s not very clearly documented. -- This message posted from opensolaris.org
You are correct. The zpool create -O option isn''t available in a Solaris 10 release but will be soon. This will allow you to set the file system checksum property when the pool is created: # zpool create -O checksum=sha256 pool c1t1d0 # zfs get checksum pool NAME PROPERTY VALUE SOURCE pool checksum sha256 local Otherwise, you would have to set it like this: # zpool create pool c1t1d0 # zfs set checksum=sha256 pool # zfs get checksum pool NAME PROPERTY VALUE SOURCE pool checksum sha256 local I''m not sure I understand the second part of your comments but will add: If *you* rewrite your data then the new data will contain the new checksum. I believe an upcoming project will provide the ability to revise file system properties on the fly. On 10/01/09 12:21, Ray Clark wrote:> U4 zpool does not appear to support the -o option... Reading a current zpool manpage online lists the valid properties for the current zpool -o, and checksum is not one of them. Are you mistaken or am I missing something? > > Another thought is that *perhaps* all of the blocks that comprise an empty zpool are re-written sooner or later, and once the checksum is changed with "zfs set checksum=sha256 zfs01" (The pool name) they will be re-written with the new checksum very soon anyway. Is this true? This would require an understanding of the on-disk structure and when what is rewritten. > > --Ray
Also, when a pool is created, there is only metadata which uses fletcher4[*]. So it is not a crime if you set the checksum after the pool is created and before data is written :-) * note: the uberblock uses SHA-256 -- richard On Oct 1, 2009, at 12:34 PM, Cindy Swearingen wrote:> You are correct. The zpool create -O option isn''t available in a > Solaris 10 release but will be soon. This will allow you to set the > file system > checksum property when the pool is created: > > # zpool create -O checksum=sha256 pool c1t1d0 > # zfs get checksum pool > NAME PROPERTY VALUE SOURCE > pool checksum sha256 local > > Otherwise, you would have to set it like this: > > # zpool create pool c1t1d0 > # zfs set checksum=sha256 pool > # zfs get checksum pool > NAME PROPERTY VALUE SOURCE > pool checksum sha256 local > > I''m not sure I understand the second part of your comments but will > add: > > If *you* rewrite your data then the new data will contain the new > checksum. I believe an upcoming project will provide the ability to > revise file system properties on the fly. > > > On 10/01/09 12:21, Ray Clark wrote: >> U4 zpool does not appear to support the -o option... Reading a >> current zpool manpage online lists the valid properties for the >> current zpool -o, and checksum is not one of them. Are you >> mistaken or am I missing something? >> Another thought is that *perhaps* all of the blocks that comprise >> an empty zpool are re-written sooner or later, and once the >> checksum is changed with "zfs set checksum=sha256 zfs01" (The pool >> name) they will be re-written with the new checksum very soon >> anyway. Is this true? This would require an understanding of the >> on-disk structure and when what is rewritten. >> --Ray > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Data security. I migrated my organization from Linux to Solaris driven away from Linux by the the shortfalls of fsck on TB size file systems, and towards Solaris by the features of ZFS. At the time I tried to dig up information concerning tradeoffs associated with Fletcher2 vs. 4 vs. SHA256 and found nothing. Studying the algorithms, I decided that fletcher2 would tend to be weak for periodic data, which characterizes my data. I ran throughput tests and got 67MB/Sec for Fletcher2 and 4 and 48MB/Sec for SHA256. I projected (perhaps without basis) SHA256''s cryptographic strength to also mean strength as a hash, and chose it since 48MB/Sec is more than I need. 21 months later (9/15/09) I lost everything to a "corrupt metadata" (Not sure where this was printed) ZFS-8000-CS. No clue why to date, I will never know. The person who restored from tape was not informed to set checksum=sha256, so it all went in with the default, Fletcher2. Before taking rather disruptive actions to correct this, I decided to question my original decision and found schlie''s post stating that a bug in fletcher2 makes it essentially a one bit parity on the entire block: http://opensolaris.org/jive/thread.jspa?threadID=69655&tstart=30 While this is twice as good as any other file system in the world that has NO such checksum, this does not provide the security I migrated for. Especially given that I did not know what caused the original data loss, it is all I have to lean on. Convinced that I need to convert all of the checksums to sha256 to have the data security ZFS purports to deliver and in the absence of a checksum conversion capability, I need to copy the data. It appears that all of the implementations of the various means of copying data, from tar and cpio to cp to rsync to pax have ghosts in their closets, each living in glass houses, and each throwing stones at the other with respect to various issues with file size, filename lengths, pathname lengths, ACLs, extended attributes, sparse files, etc. etc. etc. It seems like zfs send/receive *should* be safe from all such issues as part of the zfs family, but the questions raised here are ambiguous once one starts to think about it. If the file system is faithfully duplicated, it should also duplicate all properties, including the checksum used on each block. It appears (to my advantage) that this is not what is done. This enables the filesystem spontaneously created by zfs receive to inherit from the pool, which evidently can be set to sha256 though it is a pool not a file system in the pool. The present question is protection on the base pool. This can be set when the pool is created, though not with U4 which I am running. It is not clear (yet) if this is simply not documented with the current release or if the version that supports this has not been released yet. If I were to upgrade (Which I cannot do in a timely fashion), it would only be to U7. I cannot run a "weekly build" type of OS on my production server. Any way it goes I am hosed. In short there is surely some structure, some blocks with stuff written in them when a pool is created but before anything else is done, else it would be a blank disk, not a zfs pool. Are these "protected" by Fletcher2 as the default? I have learned that the Ubberblock is protected by SHA256, other parts by Fletcher4. Is this everything? In U4 was it fletcher4, or was this a recent change steming from Schlie''s report? In short, what is the situation with regard to the data security I switched to Solaris/ZFS for, and what can I do to achieve it? What *do* the tools do? Are there tools for what needs to be done to convert things, to copy things, to verify things, and to do so completely and correctly? So here is where I am: I should zfs send/receive, but I cannot have confidence that there are not fletcher2 protected blocks (1 bit parity) at the most fundamental levels of the zpool. To verify data, I cannot depend on existing tools since diff is not large file aware. My best idea at this point is to calculate and compare MD5 sums of every file and spot check other properties as best I can. Given this rather full perspective, help or comments very appreciated. I still think zfs is the way to go, but the road is a little bumpy at the moment. -- This message posted from opensolaris.org
Appologize that the preceeding post appears out of context. I expected it to "indent" as I pushed the reply button on myxiplx'' Oct 1, 2009 1:47 post. It was in response to his question. I will try to remember to provide links internal to my messages. -- This message posted from opensolaris.org
On 02 October, 2009 - Ray Clark sent me these 4,4K bytes:> Data security. I migrated my organization from Linux to Solaris > driven away from Linux by the the shortfalls of fsck on TB size file > systems, and towards Solaris by the features of ZFS.[...]> Before taking rather disruptive actions to correct this, I decided to > question my original decision and found schlie''s post stating that a > bug in fletcher2 makes it essentially a one bit parity on the entire > block: > http://opensolaris.org/jive/thread.jspa?threadID=69655&tstart=30 > While this is twice as good as any other file system in the world that > has NO such checksum, this does not provide the security I migrated > for. Especially given that I did not know what caused the original > data loss, it is all I have to lean on.... That post refers to bug 6740597 http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6740597 which also refers to http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=2178540 So it seems like it''s fixed in snv114 and s10u8, which won''t help your s10u4 unless you update.. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Replying to Cindys Oct 1, 2009 3:34 PM post: Thank you. The second part was my attempt to guess my way out of this. If the fundamental structure of the pool (That which was created before I set the checksum=sha256 property) is using fletcher2, perhaps as I use the pool all of this structure will be updated, and therefore automatically migrate to the new checksum. It would be very difficult for me to recreate the pool, but I have space to duplicate the "user" files (and so get the new checksum). Perhaps this will also result in the underlying "structure" of the pool being converted in the course of normal use. Comments for or against? -- This message posted from opensolaris.org
Replying to relling''s October 1, 2009 3:34 post: Richard, regarding "when a pool is created, there is only metadata which uses fletcher4". Was this true in U4, or is this a new change of default with U4 using fletcher2? Similarly, did the Ubberblock use sha256 in U4? I am running U4. --Ray -- This message posted from opensolaris.org
Interesting answer, thanks :) I''d like to dig a little deeper if you don''t mind, just to further my own understanding (which is usually rudimentary compared to a lot of the guys on here). My belief is that ZFS stores two copies of the metadata for any block, so corrupt metadata really shouldn''t happen often. Could I ask what the structure of your pool is, what level of redundancy do you have there. The very fact that you had a ''corrupt metadata'' error implies to me that the checksums have done their job in finding an error, and I''m wondering if the true cause could be further down the line. I''m still taking all this in though - we''ll be using sha256 on our secondary system, just in case :) -- This message posted from opensolaris.org
My pool was the default, with checksum=256. The default has two copies of all metadata (as I understand it), and one copy of user data. It was a raidz2 with eight 750GB drives, yielding just over 4TB of usable space. I am not happy with the situation, but I recognize that I am 2x better off (1 bit parity) than I would be with any other file system. -- This message posted from opensolaris.org
webclark at rochester.rr.com said:> To verify data, I cannot depend on existing tools since diff is not large > file aware. My best idea at this point is to calculate and compare MD5 sums > of every file and spot check other properties as best I can.Ray, I recommend that you use rsync''s "-c" to compare copies. It reads all the source files, computes a checksum for them, then does the same for the destination and compares checksums. As far as I know, the only thing that rsync can''t do in your situation is the ZFS/NFSv4 ACL''s. I''ve used it to migrate many TB''s of data. Regards, Marion
Ray, The checksums are set on the file systems not the pool. If a new checksum is set and *you* rewrite the data, then the rewritten data will contain the new checksum. If your pool has the space for you to duplicate the user data and new checksum is set, then the duplicated data will have the new checksum. ZFS doesn''t rewrite data as part of normal operations. I confirmed with a simple test (like Darren''s) that even if you have a single-disk pool and the disk is replaced and all the data is resilvered and a new checksum is set, you''ll see data with the previous checksum and the new checksum. Cindy On 10/02/09 08:44, Ray Clark wrote:> Replying to Cindys Oct 1, 2009 3:34 PM post: > > Thank you. The second part was my attempt to guess my way out of this. If the fundamental structure of the pool (That which was created before I set the checksum=sha256 property) is using fletcher2, perhaps as I use the pool all of this structure will be updated, and therefore automatically migrate to the new checksum. It would be very difficult for me to recreate the pool, but I have space to duplicate the "user" files (and so get the new checksum). Perhaps this will also result in the underlying "structure" of the pool being converted in the course of normal use. > > Comments for or against?
On Oct 2, 2009, at 7:46 AM, Ray Clark wrote:> Replying to relling''s October 1, 2009 3:34 post: > > Richard, regarding "when a pool is created, there is only metadata > which uses fletcher4". Was this true in U4, or is this a new change > of default with U4 using fletcher2? Similarly, did the Ubberblock > use sha256 in U4? I am running U4.ZFS uses different checksums for different things. Briefly, use checksum --------------------------------------------------------- uberblock SHA-256, self-checksummed labels SHA-256 metadata fletcher4 data fletcher2 (default), set with checksum parameter ZIL log fletcher2, self-checksummed gang block SHA-256, self-checksummed The parent holds the checksum for an entity is not self-checksummed. The big question, that is currently unanswered, is do we see single bit faults in disk-based storage systems? The answer to this question must be known before the effectiveness of a checksum can be evaluated. The overwhelming empirical evidence suggests that fletcher2 catches many storage system corruptions. -- richard
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: >>>>> "r" == Ross <myxiplx at googlemail.com> writes:re> The answer to this question must be known before the re> effectiveness of a checksum can be evaluated. ...well...we can use math to know that a checksum is effective. What you are really suggesting we evaluate ``empirically'''' is the degree of INeffectiveness of the broken checksum. r> ZFS stores two copies of the metadata for any block, so r> corrupt metadata really shouldn''t happen often. the other copy probably won''t be read if the first copy read has a valid checksum. I think it''ll more likely just lazy-panic instead. If that''s the case, the two copies won''t help cover up the broken checksum bug. but Richard''s table says metadata has fletcher4 which the OP said is as good as the correct algorithm would have been, even in its broken implementation, so long as it''s only used up to 128kByte. It''s only data and ZIL that has the relevantly-broken checksum, according to his math. re> The overwhelming empirical evidence suggests that fletcher2 re> catches many storage system corruptions. What do you mean by the word ``many''''? It''s a weasel-word. It basically means, AFAICT, ``the broken checksum still trips sometimes.'''' But have you any empirical evidence about the fraction of real world errors which are still caught by the broken checksum vs. those that are not? I don''t see how you could. How about cases where checksums are not used to correct bit-flip gremlins but relied upon to determine whether a data structure is fully present (committed) yet, like in the ZIL, or to determine which half of a mirror is stale---these are cases where checksums could be wrong even if the storage subsystem is functioning in an ideal way. Checksum weakness on ZFS where checksums are presumed good by other parts of the design could potentially be worse overall than a checksumless design. That''s not my impression, but it''s the right place to put the bar. Ray''s ``well at least it''s better than no checksums'''' is wrong because it presumes ZFS could function as well as another filesystem if ZFS were using a hypothetical null checksum. It couldn''t. Anyway I''m glad the problem is both fixed and also avoidable on the broken systems. I just think the doublespeak after the fact is, once again, not helping anyone. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091002/370a1de4/attachment.bin>
Hi Miles, good to hear from you again. On Oct 2, 2009, at 1:20 PM, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: >>>>>> "r" == Ross <myxiplx at googlemail.com> writes: > > re> The answer to this question must be known before the > re> effectiveness of a checksum can be evaluated. > > ...well...we can use math to know that a checksum is effective. What > you are really suggesting we evaluate ``empirically'''' is the degree of > INeffectiveness of the broken checksum.By your logic, SECDED ECC for memory is broken because it only corrects 1 bit per symbol and only detects brokeness of 2 bits per symbol. However, the empirical evidence suggests that ECC provides a useful function for many people. Do we know how many triple bit errors occur in memories? I can compute the probability, but have never seen a field failure analysis. So, if ECC is "good enough" for DRAM, is fletcher2 "good enough" for storage? NB, for DRAM the symbol size is usually 64 bits. For the ZFS case, the symbol size is 4,096 to 1,048,576 bits. AFAIK, no collisions have been found in SHA-256 digests for symbols of size 1,048,576, but it has not been proven that that they do not exist.> r> ZFS stores two copies of the metadata for any block, so > r> corrupt metadata really shouldn''t happen often. > > the other copy probably won''t be read if the first copy read has a > valid checksum. I think it''ll more likely just lazy-panic instead. > If that''s the case, the two copies won''t help cover up the broken > checksum bug. but Richard''s table says metadata has fletcher4 which > the OP said is as good as the correct algorithm would have been, even > in its broken implementation, so long as it''s only used up to > 128kByte. It''s only data and ZIL that has the relevantly-broken > checksum, according to his math. > > re> The overwhelming empirical evidence suggests that fletcher2 > re> catches many storage system corruptions. > > What do you mean by the word ``many''''? It''s a weasel-word.I''ll blame the lawyers. They are causing me to remove certain words from my vocabulary :-(> It > basically means, AFAICT, ``the broken checksum still trips > sometimes.'''' But have you any empirical evidence about the fraction > of real world errors which are still caught by the broken checksum > vs. those that are not? I don''t see how you could.Question for the zfs-discuss participants, have you seen a data corruption that was not detected when using fletcher2? Personally, I''ve seen many corruptions of data stored on file systems lacking checksums.> How about cases where checksums are not used to correct bit-flip > gremlins but relied upon to determine whether a data structure is > fully present (committed) yet, like in the ZIL, or to determine which > half of a mirror is stale---these are cases where checksums could be > wrong even if the storage subsystem is functioning in an ideal way. > > Checksum weakness on ZFS where checksums are presumed good by other > parts of the design could potentially be worse overall than a > checksumless design. That''s not my impression, but it''s the right > place to put the bar. Ray''s ``well at least it''s better than no > checksums'''' is wrong because it presumes ZFS could function as well as > another filesystem if ZFS were using a hypothetical null checksum. It > couldn''t.I''m in Ray''s camp. I''ve got far to many scars from data corruption and I''d rather not add more. -- richard> > Anyway I''m glad the problem is both fixed and also avoidable on the > broken systems. I just think the doublespeak after the fact is, once > again, not helping anyone. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Replying to hakanson''s Oct 2, 2009 2:01 post: Thanks. I suppose it is true that I am not even trying to compare the peripheral stuff, and simple presence of a file and the data matching covers some of them. Using it for moving data, one encounters a longer list: Sparse files, ACL handling, extended atributes, length of filenames, length of pathnames, large files. And probably other "interesting" things that can be not handled correctly. Most information for misbehavior of the various archive / backup / data movement utilities is very old. One wonders how they behave today. This would be a useful compilation, but I can''t do it. -- This message posted from opensolaris.org
Cindys Oct 2, 2009 2:59, Thanks for staying with me. Re: "The checksums are aset on the file systems not the pool.": But previous responses seem to indicate that I can set them for file stored in the filesystem that appears to be the pool, at the pool level, before I create any new ones. One post seems to indicate that there is a checksum property for this file system, and independently for the pool. (This topic needs a picture). Re: "If a new checksum is set and *you* rewrite the data ... then the duplciated data will have the new checksum." Understand. Now I am on to being concerned for the blocks that comprise the zpool that *contain* the file system. Re: "ZFS doesn''t rewrite data as part of normal operations. I confirmed with a simple test (like Darren''s) that even if you have a single-disk pool and the disk is replaced and all the data is resilvered and a new checksum is set, you''ll see data with the previous checksum and the new checksum." Yes, ... a resilver duplicates exactly. Darren''s example showed that without the -R, no properties were sent and the zfs receive had no choice but to use the pool default for the zfs filesystem that it created. This also implies that there was a property associated with the pool. So my previous comment about zfs send/receive not duplicating exactly was not fair. The man page / admin guide should be clear as to what is sent without -R. I would have guessed everything, just not descendent file systems. It is a shame that zdb is totally undocumented. I thought I had discovered a gold mine when I first read Darren''s note! --Ray -- This message posted from opensolaris.org
Re: relling''s Oct 2, 2009 3:26 Post: (1) Is this list everything? (2) Is this the same for U4? (3) If I change the zpool checksum property on creation as you indicated in your Oct 1, 12:51 post (evidently very recent versions only), does this change the checksums used for this list? Why would not the strongest checksum be used for the most fundamental data rather than fool around, allowing the user to compromise only when the tradeoff pays back on the 99% bulk of the data? Re: "The big question, that is currently unanswered, is do we see single bit faults in disk-based storage systems?" I don''t think this is the question. I believe the implication of schlie''s post is not that single bit faults will get through, but that the current fletcher2 is equivalent to a single bit checksum. You could have 1,000 bits in error, or 4095, and still have a 50-50 chance of detecting it. A single bit error would be certain to be detected (I think) even with the current code. -- This message posted from opensolaris.org
Re: Miles Nordin Oct 2, 2009 4:20: Re: "Anyway, I''m glad the problem is both fixed..." I want to know HOW it can be fixed? If they fixed it, this will invalidate every pool that has not been changed from the default (Probably almost all of them!). This can''t be! So what WAS done? In the interest of honesty in advertising and enabling people to evaluate their own risks, I think we should know how it was fixed. Something either ingenious or potentially misleading must have been done. I am not suggesting that it was not the best way to handle a difficult situation, but I don''t see how it can be transparent. If the string "fletcher2" does the same thing, it is not fixed. If it does something different, it is misleading. "... and avoidable on the broken systems." Please tell me how! Without destroying and recreating my zpool, I can only fix the zfs file system blocks, not the underlying zpool blocks. WITH destroying and recreating my zpool, I can only control the checksum on the underlying zpool using a version of Solaris that is not yet available. And then (Pending relling''s response) may or may not *still* effect the blocks I am concerned about. So how is this avoidable? It is partially avoidable (so far) IF I have the luxury of doing significant rebuilding.. No? -- This message posted from opensolaris.org
On Oct 2, 2009, at 3:05 PM, Ray Clark wrote:> Re: relling''s Oct 2, 2009 3:26 Post: > > (1) Is this list everything?AFAIK> (2) Is this the same for U4?Yes. This hasn''t changed in a very long time.> (3) If I change the zpool checksum property on creation as you > indicated in your Oct 1, 12:51 post (evidently very recent versions > only), does this change the checksums used for this list? Why would > not the strongest checksum be used for the most fundamental data > rather than fool around, allowing the user to compromise only when > the tradeoff pays back on the 99% bulk of the data?Performance. Many people value performance over dependability.> Re: "The big question, that is currently unanswered, is do we see > single bit faults in disk-based storage systems?" > > I don''t think this is the question. I believe the implication of > schlie''s post is not that single bit faults will get through, but > that the current fletcher2 is equivalent to a single bit checksum. > You could have 1,000 bits in error, or 4095, and still have a 50-50 > chance of detecting it. A single bit error would be certain to be > detected (I think) even with the current code.I don''t believe schlie posted the number of fletcher2 collisions for the symbol size used by ZFS. I do not believe it will be anywhere near 50% collisions. -- richard
Re: relling''s Oct 2 5:06 Post: Re: analogy to ECC memory... I appreciate the support, but the ECC memory analogy does not hold water. ECC memory is designed to correct for multiple independent events, such as electrical noise, bits flipped due to alpha particles from the DRAM package, or cosmic rays. The probability of these independent events coinciding in time and space is very small indeed. It works well. ZFS does purport to cover errors such as these in the crummy double layer boards wtihout sufficient decoupling, microcontrollers and memories without parity or ECC, etc. found in the cost-reduced to the razor''s edge hardware most of us run on, but it also covers system level errors such as entire blocks being replaced, or large fractions of them being corrupted by high level bugs. With the current fletcher2 we have only a 50-50 chance of catching these multi-bit errors. Probability of multiple bits being changed is not small, because the probabilities of the error mechanism effecting the 4096~1048576 bits in the block are not independent. Indeed, in many of the show-cased mechanisms, it is a sure bet - the entire disk sector is written with the wrong data, for sure! Although there is a good chance that many of the bits in the sector happen to match, there is an excellent chance that many are different. And the mechanisms that caused these differences were not independent. Re: "AFAIK, no collisions have been found in SHA-256 digests for symbols of size 1,048,576, but it has not been proven that they do not exist" For sure they exist. I think 4096 of them, for every SHA256 digest, there are (I think) 4096 1,048,576 bit long blocks that will create it. One hopes that the same properties that make SHA256 a good cryptographic hash also make it a good hash period. This, I admit, is a leap of ignorance (At least I know what cliff I am leaping off of). Regarding the question of what people have seen, I have seen lots of unexplained things happen, and by definition one never knows why. I am not interested in seeing any more. I see the potential for disaster, and my time, and the time of my group, is better spent doing other things. That is why I moved to ZFS. -- This message posted from opensolaris.org
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> By your logic, SECDED ECC for memory is broken because it only re> corrects ECC is not a checksum. Go ahead, get out your dictionary, enter severe-pedantry-mode. but it is relevantly different. In for example data transmission scenarios, FEC''s like ECC are often used along with a strong noncorrecting checksum over a larger block. The OP further described scenarios plausible for storage, like ``long string of zeroes with 1 bit flipped'''', that produce collisions with the misimplemented fletcher2 (but, obviously, not with any strong checksum like correct-fletcher2). re> is fletcher2 "good enough" for storage? yes, it probably is good enough, but ZFS implements some other broken algorithm and calls it fletcher2. so, please stop saying fletcher2. re> I''ll blame the lawyers. They are causing me to remove certain re> words from my vocabulary :-( yeah, well, allow me to add a word back to the vocabulary: BROKEN. If you are not legally allowed to use words like broken and working, then find another identity from which to talk, please. re> Question for the zfs-discuss participants, have you seen a re> data corruption that was not detected when using fletcher2? This is ridiculous. It''s not fletcher2, it''s brokenfletcher2. It''s avoidably extremely weak. It''s reasonable to want to use a real checksum, and this PR game you are playing is frustrating and confidence-harming for people who want that. This does not have to become a big deal, unless you try to spin it with a 7200rpm PR machine like IBM did with their broken Deathstar drives before they became HGST. Please, what we need to do is admit that the checksum is relevantly broken in a way that compromises the integrity guarantees with which ZFS was sold to many customers, fix the checksum, and learn how to conveniently migrate our data. Based on the table you posted, I guess file data can be set to fletcher4 or sha256 using filesystem properties to work around the bug on Solaris versions with the broken implementation. 1. What''s needed to avoid fletcher2 on the ZIL on broken Solaris versions? 2. I understand the workaround, but not the fix. How does the fix included S10u8 and snv_114 work? Is there a ZFS version bump? Does the fix work by implementing fletcher2 correctly? or does it just disable fletcher2 and force everything to use brokenfletcher4 which is good enough? If the former, how are the broken and correct versions of fletcher2 distinguished---do they show up with different names in the pool properties? Once you have the fixed software, how do you make sure fixed checksums are actually covering data blocks originally written by old broken software? I assume you have to use rsync or zfs send/recv to rewrite all the data with the new checksum? If yes, what do you have to do before rewriting---upgrade solaris and then ''zfs upgrade'' each filesystem one by one? Will zfs send/recv work across the filesystem versions, or does the copying have to be done with rsync? 3. speaking of which, what about the checksum in zfs send streams? is it also fletcher2, and if so was it also fixed in s10u8/snv_114, and how does this affect compatibility for people who have ignored my advice and stored streams instead of zpools? Will a newer ''zfs recv'' always work with an older ''zfs send'' but not the other way around? there is basically no informaiton about implementing the fix in the bug, and we can''t write to the bug from outside Sun. Whatever sysadmins need to do to get their data under the strength of checksum they thought it was under, it might be nice to describe it in the bug for whoever gets referred to the bug and has an affected version. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091002/d5e74a63/attachment.bin>
Let me try to refocus: Given that I have a U4 system with a zpool created with Fletcher2: What blocks in the system are protected by Fletcher2, or even Fletcher4 although that does not worry me so much. Given that I only have 1.6TB of data in a 4TB pool, what can I do to change those blocks to sha256 or Fletcher4: (1) Without destroying and recreating the zpool under U4 (2) With destroying and recreating the zpool under U4 (Which I don''t really have the resources to pull off) (3) With upgrading to U7 (Perhaps in a few months) (4) With upgrading to U8 Thanks. -- This message posted from opensolaris.org
On Oct 2, 2009, at 3:44 PM, Ray Clark wrote:> Let me try to refocus: > > Given that I have a U4 system with a zpool created with Fletcher2: > > What blocks in the system are protected by Fletcher2, or even > Fletcher4 although that does not worry me so much. > > Given that I only have 1.6TB of data in a 4TB pool, what can I do to > change those blocks to sha256 or Fletcher4: > > (1) Without destroying and recreating the zpool under U4 > > (2) With destroying and recreating the zpool under U4 (Which I don''t > really have the resources to pull off) > > (3) With upgrading to U7 (Perhaps in a few months) > > (4) With upgrading to U8This has been answered several times in this thread already. set checksum=sha256 filesystem copy your files -- all newly written data will have the sha256 checksums. -- richard
On Oct 2, 2009, at 3:36 PM, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: > > re> By your logic, SECDED ECC for memory is broken because it only > re> corrects > > ECC is not a checksum.SHA-256 is not a checksum, either, but that isn''t the point. The concern is that corruption can be detected. ECC has very, very limited detection capabilities, yet it is "good enough" for many people. We know that MOS memories have certain failure modes that cause bit flips and by using ECC and interleaving, the dependability is improved. The big question is, what does the corrupted data look like in storage? Random bit flips? Big chunks of zeros? 55aa patterns? Since the concern with the broken fletcher2 is restricted to the most significant bits, we are most concerned with failures where the most significants are set to ones. But as I said, we have no real idea what the corrupted data should look like, and if it is zero-filled, then fletcher2 will catch it.> Go ahead, get out your dictionary, enter severe-pedantry-mode. but it > is relevantly different. In for example data transmission scenarios, > FEC''s like ECC are often used along with a strong noncorrecting > checksum over a larger block. > > The OP further described scenarios plausible for storage, like ``long > string of zeroes with 1 bit flipped'''', that produce collisions with > the misimplemented fletcher2 (but, obviously, not with any strong > checksum like correct-fletcher2). > > re> is fletcher2 "good enough" for storage? > > yes, it probably is good enough, but ZFS implements some other broken > algorithm and calls it fletcher2. so, please stop saying fletcher2.If I was to refer to Fletcher''s algorithm, I would use Fletcher. When I am referring to the ZFS checksum setting of "fletcher2" I will continue to use "fletcher2"> re> I''ll blame the lawyers. They are causing me to remove certain > re> words from my vocabulary :-( > > yeah, well, allow me to add a word back to the vocabulary: BROKEN. > > If you are not legally allowed to use words like broken and working, > then find another identity from which to talk, please. > > re> Question for the zfs-discuss participants, have you seen a > re> data corruption that was not detected when using fletcher2? > > This is ridiculous. It''s not fletcher2, it''s brokenfletcher2. It''s > avoidably extremely weak. It''s reasonable to want to use a real > checksum, and this PR game you are playing is frustrating and > confidence-harming for people who want that.There is no PR campaign. It is what it is. What is done is done.> This does not have to become a big deal, unless you try to spin it > with a 7200rpm PR machine like IBM did with their broken Deathstar > drives before they became HGST. > > Please, what we need to do is admit that the checksum is relevantly > broken in a way that compromises the integrity guarantees with which > ZFS was sold to many customers, fix the checksum, and learn how to > conveniently migrate our data.Unfortunately, there is a backwards compatibility issue that requires the current fletcher2 to live for a very long time. The only question for debate is whether it should be the default. To date, I see no field data that suggests it is not detecting corruption.> Based on the table you posted, I guess file data can be set to > fletcher4 or sha256 using filesystem properties to work around the > bug on Solaris versions with the broken implementation. > > 1. What''s needed to avoid fletcher2 on the ZIL on broken Solaris > versions?Please file RFEs at bugs.opensolaris.org> 2. I understand the workaround, but not the fix. > > How does the fix included S10u8 and snv_114 work? Is there a ZFS > version bump? Does the fix work by implementing fletcher2 > correctly? or does it just disable fletcher2 and force everything > to use brokenfletcher4 which is good enough? If the former, how > are the broken and correct versions of fletcher2 > distinguished---do they show up with different names in the pool > properties?The best I can tell, the comments are changed to indicate fletcher2 is deprecated. However, it must live on (forever) because of backwards compatibility. I presume one day the default will change to fletcher4 or something else. This is implied by zfs(1m): checksum=on | off | fletcher2,| fletcher4 | sha256 Controls the checksum used to verify data integrity. The default value is on, which automatically selects an appropriate algorithm (currently, fletcher2, but this may change in future releases). The value off disables integrity checking on user data. Disabling checksums is NOT a recommended practice.> Once you have the fixed software, how do you make sure fixed > checksums are actually covering data blocks originally written by > old broken software? I assume you have to use rsync or zfs > send/recv to rewrite all the data with the new checksum? If yes, > what do you have to do before rewriting---upgrade solaris and then > ''zfs upgrade'' each filesystem one by one? Will zfs send/recv work > across the filesystem versions, or does the copying have to be > done with rsync?I believe such a requirement would have a half-life of less than a nanosecond.> 3. speaking of which, what about the checksum in zfs send streams? > is it also fletcher2, and if so was it also fixed in > s10u8/snv_114, and how does this affect compatibility for people > who have ignored my advice and stored streams instead of zpools? > Will a newer ''zfs recv'' always work with an older ''zfs send'' but > not the other way around?fletcher4. Thanks for reminding me... I''ll update my slides :-)> there is basically no informaiton about implementing the fix in the > bug, and we can''t write to the bug from outside Sun. Whatever > sysadmins need to do to get their data under the strength of checksum > they thought it was under, it might be nice to describe it in the bug > for whoever gets referred to the bug and has an affected version.UTSL Bottom line: the checksum match does not guarantee correctness, but a checksum mismatch does indicate differences. In general, this is how checksums work, no? -- richard
Richard, with respect to: "This has been answered several times in this thread already. set checksum=sha256 filesystem copy your files -- all newly written data will have the sha256 checksums." I understand that. I understood it before the thread started. I did not ask this. It is a fact that there is no feature to convert checksums as a part of resilver or some such. I started what utility to use, but quickly zeroed in on zfs send/receive as being the native and presumably best method, but had questions as to how to get the properly set when it was automatically created. etc. Note that my focus in recent portions of the thread has changed to the underlying zpool. Simply changing the checksum=sha356 and copying my data is analogous to hanging my data from a hierarchy of 0.256" welded steel chain, with the top of the hierarchy hanging it all from an 0.001" steel thread. Well, that is not quite fair because there are probabilities involved. Someone is going to pick a link randomly and go after it with a fingernail clipper. If they pick a thick one, I have very little to worry about to say the least. If they pick one of the few dozen? hundred? thousand? (I don''t know how many) that contain the structure and services of the underlying zpool, then the nailclipper will not be stopped by the 0.001" thread. I do have 8,000,000,000 links in the chain, and only a a very small fraction are 0.001" thick, and that is strongly in my favor, but I would expect the heads to also spend a disproportionate amount of time over the intent log. It is hard to know how it comes out. I just don''t want and 0.001" steel threads protecting my data from the gremlins. I moved to ZFS to avoid gambles. If I wanted gambles I would use linux raid and lvm2. They work well enough if there are no errors. I should have enumerated the knowns and unknowns in my list last night, then I would not have annoyed you with my apparent deafness. (Hopefully I am not still being deaf). I will clarify below, as I should have last night: Given that I only have 1.6TB of data in a 4TB pool, what can I do to change those blocks to sha256 or Fletcher4: (1) Without destroying and recreating the zpool under U4 I know how to fix the user data (just change checksum property on the pool using zfs specifying the pool vs. a zfs file system, then copy the data). I don''t know (am ignorant of) blocks comprising the underlying zpool, and how to fix them without recreating the pool. It makes sense to me that at least some would be rewritten in the course of using the system, but (1) I have had no confirmation or denial that this is the case, (2) I don''t know if this is all of them or some of them, (3) I don''t know if the checksum= parameter would effect these (relling''s Oct 2 at 3:26 post implies that it does not by lack of reference to checksum property). So I don''t know yet how much exposure will remain. I would think that if the user specified a stronger checksum for their data that the system would abandon its use of weaker ones in the underlying structure, but Richard''s list seems to imply otherwise. (2) With destroying and recreating the zpool under U4 (Which I don''t really have the resources to pull off) Due to some of the non-technical factors in the situation, I cannot actually execute an experimental valid zpool command, but "zpool create -o garbage" gives me a usage that does not include any -o or -O. So it appears that under U4 I cannot do this. I wish there were someone who could confirm that I can or cannot do this before I arrange for and propose that we dive into this massive undertaking. Also, from Richard''s Oct 2 3:26 note, I infer that this will not change the checksum used by the underlying zpool anyway, so this might be fruitless. But I am infering... Richard gave a quick list, his attitude was not that of providing all level of precise detail so I really don''t know. Many of the answers I have received have turned out to recommend features that are not available in U4 but in later versions, even unreleased versions. I have no way of sorting this out without the information being qualified with a version. (3) With upgrading to U7 (Perhaps in a few months) Not clear what this will support on zpool, or if it would be effective (similar to U4 above) (4) With upgrading to U8 Not sure when it will come out, what it will support, or if it will be effective (similar to U7, U4 above). So I can enable robust protection on my user data, but perhaps not the infrastructure needed to get at that user data, and perhaps not the intent log. The answer may be that I cannot be helped. That is not the desired answer, but if that is the case, so be it. Let''s lay out the facts and the best way to move on from here, for me and everybody else. Why leave us thrashing in the dark? Am I a Mac user? I personally will still believe ZFS is the way to go in the short term because it is still a vastly better gamble, and in the long term because this too will pass as file systems are rebuilt. I would question why anything but the best would be used for the underlying zpool, any why there is absolutely zero presentation of the tradeoffs between the three algorithms in the admin guide, but that is another story. I know that someone out there, and probably people reading this thread know the answers to these questions. I hate to stop without that simple knowledge being communicated. I do greatly appreciate the attempts to work with me, I don''t understand how I could be clearer. -- This message posted from opensolaris.org
On Fri, 2 Oct 2009, Ray Clark wrote:> With the current fletcher2 we have only a 50-50 chance of catching > these multi-bit errors. Probability of multiple bits being changed > is notWhat is the current fletcher2? A while back I seem to recall reading a discussion in the zfs-code forum about how the original zfs fletcher2 was found to be unexpectedly weak and broken so they updated the "fletcher2" algorithm and assigned it a new enumeration so that fresh blocks use the corrected algorithm. I could be just imagining all of this but that is what I remember today. Since you are using Solaris 10 U4 maybe you are using the dinosaur version of fletcher2? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Oct 3, 2009, at 7:46 AM, Ray Clark wrote:> Richard, with respect to: > > "This has been answered several times in this thread already. > set checksum=sha256 filesystem > copy your files -- all newly written data will have the sha256 > checksums." > > I understand that. I understood it before the thread started. I > did not ask this. It is a fact that there is no feature to convert > checksums as a part of resilver or some such.There is no such feature. There is a long-awaited RFE for block pointer rewriting (checksums are stored in the block pointer) which would add the underlying capabilities for this.> I started what utility to use, but quickly zeroed in on zfs send/ > receive as being the native and presumably best method, but had > questions as to how to get the properly set when it was > automatically created. etc.Say for example I have a pool called zwimming with some stuff in it and compression=on. To create a copy of the data using send/recv in the same pool but with compression=sha256 do: zfs snapshot zwimming at now zfs send zwimming at now | zfs receive zwimming/new You will now have a new file system called "zwimming/new" with the same data as zwimming, but with compression=sha256. If you then want to get back to the original directory structure you can the mountpoint properties, as desired. There are dozens of other ways to accomplish the copy.> Note that my focus in recent portions of the thread has changed to > the underlying zpool. > > Simply changing the checksum=sha356 and copying my data is analogous > to hanging my data from a hierarchy of 0.256" welded steel chain, > with the top of the hierarchy hanging it all from an 0.001" steel > thread. Well, that is not quite fair because there are > probabilities involved. Someone is going to pick a link randomly > and go after it with a fingernail clipper. If they pick a thick > one, I have very little to worry about to say the least. If they > pick one of the few dozen? hundred? thousand? (I don''t know how > many) that contain the structure and services of the underlying > zpool, then the nailclipper will not be stopped by the 0.001" > thread. I do have 8,000,000,000 links in the chain, and only a a > very small fraction are 0.001" thick, and that is strongly in my > favor, but I would expect the heads to also spend a disproportionate > amount of time over the intent log. It is hard to know how it comes > out. I just don''t want and 0.001" steel threads protecting my data > from the > gremlins. I moved to ZFS to avoid gambles. If I wanted gambles I > would use linux raid and lvm2. They work well enough if there are > no errors.I think you are missing the concept of pools. Pools contain datasets. One form of dataset is a file system. Pools do not contain data per se, datasets contain data. Reviewing the checksums used with this heirarchy in mind: Pool Label [SHA-256] Uberblock [SHA-256] Metadata [fletcher4] Gang block [SHA-256] ZIL log [fletcher2] Dataset (file system or volume) Metadata [fletcher4] Data [fletcher2 (default, today), fletcher4, or SHA-256] Send stream [fletcher4] With this in mind, I don''t understand your steel analogy. wrt the ZIL, it is rarely used for normal file system access. ZIL blocks are allocated from the pool as needed and freed no more than 30 seconds later, unless there is a sudden halt. If the system is halted then the ZIL is used to roll forward transactions. The heads do not "spend a disproportionate amount of time over the intent log." -- richard> I should have enumerated the knowns and unknowns in my list last > night, then I would not have annoyed you with my apparent deafness. > (Hopefully I am not still being deaf). I will clarify below, as I > should have last night: > > > Given that I only have 1.6TB of data in a 4TB pool, what can I do to > change those blocks to sha256 or Fletcher4: > > (1) Without destroying and recreating the zpool under U4 > > I know how to fix the user data (just change checksum property on > the pool using zfs specifying the pool vs. a zfs file system, then > copy the data). > > I don''t know (am ignorant of) blocks comprising the underlying > zpool, and how to fix them without recreating the pool. It makes > sense to me that at least some would be rewritten in the course of > using the system, but (1) I have had no confirmation or denial that > this is the case, (2) I don''t know if this is all of them or some of > them, (3) I don''t know if the checksum= parameter would effect these > (relling''s Oct 2 at 3:26 post implies that it does not by lack of > reference to checksum property). So I don''t know yet how much > exposure will remain. I would think that if the user specified a > stronger checksum for their data that the system would abandon its > use of weaker ones in the underlying structure, but Richard''s list > seems to imply otherwise. > > (2) With destroying and recreating the zpool under U4 (Which I don''t > really have the resources to pull off) > > Due to some of the non-technical factors in the situation, I cannot > actually execute an experimental valid zpool command, but "zpool > create -o garbage" gives me a usage that does not include any -o or - > O. So it appears that under U4 I cannot do this. I wish there were > someone who could confirm that I can or cannot do this before I > arrange for and propose that we dive into this massive undertaking. > Also, from Richard''s Oct 2 3:26 note, I infer that this will not > change the checksum used by the underlying zpool anyway, so this > might be fruitless. But I am infering... Richard gave a quick list, > his attitude was not that of providing all level of precise detail > so I really don''t know. Many of the answers I have received have > turned out to recommend features that are not available in U4 but in > later versions, even unreleased versions. I have no way of sorting > this out without the information being qualified with a version. > > (3) With upgrading to U7 (Perhaps in a few months) > Not clear what this will support on zpool, or if it would be > effective (similar to U4 above) > > (4) With upgrading to U8 > Not sure when it will come out, what it will support, or if it will > be effective (similar to U7, U4 above). > > So I can enable robust protection on my user data, but perhaps not > the infrastructure needed to get at that user data, and perhaps not > the intent log. > > The answer may be that I cannot be helped. That is not the desired > answer, but if that is the case, so be it. Let''s lay out the facts > and the best way to move on from here, for me and everybody else. > Why leave us thrashing in the dark? Am I a Mac user? > > I personally will still believe ZFS is the way to go in the short > term because it is still a vastly better gamble, and in the long > term because this too will pass as file systems are rebuilt. I > would question why anything but the best would be used for the > underlying zpool, any why there is absolutely zero presentation of > the tradeoffs between the three algorithms in the admin guide, but > that is another story. > > I know that someone out there, and probably people reading this > thread know the answers to these questions. I hate to stop without > that simple knowledge being communicated. I do greatly appreciate > the attempts to work with me, I don''t understand how I could be > clearer. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> If I was to refer to Fletcher''s algorithm, I would use re> Fletcher. When I am referring to the ZFS checksum setting of re> "fletcher2" I will continue to use "fletcher2" haha okay, so to clarify, when reading a Richard Elling post: fletcher2 = ZFS''s broken attempt to implement a 32-bit Fletcher checksum Fletcher = hypothetical correct implementation of a Fletcher checksum In that case, for clarity I think I''ll have to use the word ``broken'''' a lot more often. >> How does the fix included S10u8 and snv_114 work? re> The best I can tell, the comments are changed to indicate re> fletcher2 is deprecated. You are saying the ``fix'''' was a change in documentation, nothing else? The default is still fletcher2, and there is no correct implementation of the Fletcher checksum only the good-enough-but-broken fletcher4, which is not the default? Also, there is no way to use a non-broken checksum on the ZIL? doesn''t sound fixed to me. At least there is some transparency, though, and a partial workaround. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091003/ba1888cd/attachment.bin>
On Sat, 3 Oct 2009, Miles Nordin wrote:> re> The best I can tell, the comments are changed to indicate > re> fletcher2 is deprecated. > > You are saying the ``fix'''' was a change in documentation, nothing > else? The default is still fletcher2, and there is no correct > implementation of the Fletcher checksum only the > good-enough-but-broken fletcher4, which is not the default?It seems that my memory is kind of crappy (like fletcher2). There were discussions of the fletcher2 issue on the zfs-code list starting in March and ending in May: http://mail.opensolaris.org/pipermail/zfs-code/2009-March/thread.html http://mail.opensolaris.org/pipermail/zfs-code/2009-April/thread.html http://mail.opensolaris.org/pipermail/zfs-code/2009-May/thread.html Unless someone has a legal requirement to prove data integrity, it does not seem like the fletcher2 woes are much for most people to be worried about. After all, before zfs, this level of validation did not exist at all. Fletcher2 will still catch most instances of data corruption. One thing I did learn from this discussion is that when accessing uncached memory, the performance of fletcher2 and fletcher4 is roughly equivalent so there is usually no penalty for enabling fletcher4. It does seem like there could be some CPU impact for synchronous writes from fletcher4 since it is more likely that the data is in cache for a synchronous write. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Oct 3, 2009, at 12:22 PM, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: > > re> If I was to refer to Fletcher''s algorithm, I would use > re> Fletcher. When I am referring to the ZFS checksum setting of > re> "fletcher2" I will continue to use "fletcher2" > > haha okay, so to clarify, when reading a Richard Elling post: > > fletcher2 = ZFS''s broken attempt to implement a 32-bit Fletcher > checksum > > Fletcher = hypothetical correct implementation of a Fletcher checksum > > In that case, for clarity I think I''ll have to use the word ``broken'''' > a lot more often. > >>> How does the fix included S10u8 and snv_114 work? > > re> The best I can tell, the comments are changed to indicate > re> fletcher2 is deprecated. > > You are saying the ``fix'''' was a change in documentation, nothing > else? The default is still fletcher2, and there is no correct > implementation of the Fletcher checksum only the > good-enough-but-broken fletcher4, which is not the default? > > Also, there is no way to use a non-broken checksum on the ZIL?The ZIL is a slightly different beast. If there is a checksum mismatch while processing the log, it signals the effective end of the log. This is why log entries are self-checksummed. In other words, if you reach garbage, then you''ve reached the end of the log. The probability of the garbage having both a valid fletcher2 checksum at the proper offset and having the proper sequence number and having the right log chain link and having the right block size is considerably lower than the weakness of fletcher2. Unfortunately, the ZIL is also latency sensitive, so the performance case gets stronger while the additional error checking already boosts the dependability case. -- richard> doesn''t sound fixed to me. At least there is some transparency, > though, and a partial workaround. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
With respect to relling''s Oct 3 2009 7:46 AM Post:> I think you are missing the concept of pools. Pools contain datasets. > One form of dataset is a file system. Pools do not contain data per se, > datasets contain data. Reviewing the checksums used with this > heirarchy in mind:> Pool > Label [SHA-256] > Uberblock [SHA-256] > Metadata [fletcher4] > Gang block [SHA-256] > ZIL log [fletcher2]> Dataset (file system or volume) > Metadata [fletcher4] > Data [fletcher2 (default, today), fletcher4, or SHA-256] > Send stream [fletcher4]> With this in mind, I don''t understand your steel analogy.I am assuming based on the context of our presentation that the above list of "pool stuff" is exhaustive, that this is everything not in a dataset. My "steel analogy" is based on the assumption that the pool-level stuff that you list above is needed to gain access to the dataset. If the dataset can be accessed with all of the pool stuff trashed, then my steel thread does not exist. But it also means that the pool stuff is extraneous, so I doubt that this is the case. Given that all of the pool stuff is either sha256 or fletcher4 except for the ZIL, I have new understanding that suggests (though I don''t understand the details of the system) that I am not depending on Fletcher2 protected data, and my steel thread is actually pretty thick, not 0.001". Based on your comments regarding the ZIL, I am infering that stuff is written there and never used except for a restart after a messy shutdown. I might be exposed to whatever weakness the Fletcher2 has as implemented, but only in these rare circumstances. Normal transactions and data would not be impacted by corruption in the ZIL blocks since they would never be used. So a large layer of probability protects me: I would have to have a crash at the same instance of corruption in the ZIL that hits on a Fletcher2 weakness. Based on all of this I believe I am relatively happy simply copying my data, not recreating my zpool. As Darren Moffat taught me, I can "zfs set checksum=sha256 zfs01" where zfs01 is the zpool, then "zfs send zfs01/home at snapshot | zfs receive zfs01/home.new" and the new file system will all be sha256 as long as I don''t specify the -R option on the zfs send, and all of this is supported in U4. I believe it has to be supported due to the presence of files with properties in the (odd?) zfs file system that exists at the zfs01 zpool level before creation of zfs file systems. So assuming the above process works, this thread is done as far as I am concerned right now. Thank you all for your help, not to snub anyone, but Darren, Richard, and Cindy especially come to mind. Thanks for sparring with me until we understood each other. --Ray -- This message posted from opensolaris.org
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> The probability of the garbage having both a valid fletcher2 re> checksum at the proper offset and having the proper sequence re> number and having the right log chain link and having the re> right block size is considerably lower than the weakness of re> fletcher2. I''m having trouble parsing this. I think you''re confusing a few different failure modes: * ZIL entry is written, but corrupted by the storage, so that, for example, an entry should be read from the mirrored ZIL instead. + broken fletcher2 detects the storage corruption CASE A: Good! + broken fletcher2 misses the error, so that corrupted data is replayed from ZIL into the proper pool, possibly adding a stronger checksum to the corrupt data while writing it. CASE B: Bad! + broken fletcher2 misinterprets storage corruption as signalling the end of the ZIL, and any data in the ZIL after the corrupt entry is truncated without even attempting to read the mirror. (does this happen?) CASE C: Bad! * ZIL entry is intentional garbage, either a partially-written entry or an old entry, and should be treated as the end of the ZIL + broken fletcher2 identifies the partially written entry by a checksum mismatch, or the sequence number identifies it as old CASE D: Good! + broken fletcher2 misidentifies a partially-written entry as complete because of a hash collision CASE E: Bad! + (hypothetical, only applies to non-existent fixed system) working fletcher2 or broken-good-enough fletcher4 misidentifies a partially-written entry as complete because of a hash collision CASE F: Bad! If I read your sentence carefully and try to match it with this chart, it seems like you''re saying P(CASE F) << P(CASE E), which seems like an argument for fixing the checksum. While you don''t say so, I presume from your other posts you''re trying to make a case for doing nothing, so I''m confused. I was mostly thinking about CASE B though. It seems like the special way the ZIL works has nothing to do with CASE B: if you send data through the ZIL to a sha256 pool, it can be written to ZIL under broken-fletcher2, corrupted by the storage, and then read in and played back corrupt but covered with a sha256 checksum to the pool proper. AFAICT your relative-probability sentence has nothing to do with CASE B. re> Unfortunately, the ZIL is also latency sensitive, so the re> performance case gets stronger The performance case advocating what? not fixing the broken checksum? re> while the additional error checking already boosts the re> dependability case. what additional error checking? Isn''t the whole specialness of the ZIL that the checksum is needed in normal operation, absent storage subsystem corruption, as I originally said? It seems like the checksum''s strength is more important here, not less. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091004/2057fad4/attachment.bin>
On Oct 4, 2009, at 11:51 AM, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: > > re> The probability of the garbage having both a valid fletcher2 > re> checksum at the proper offset and having the proper sequence > re> number and having the right log chain link and having the > re> right block size is considerably lower than the weakness of > re> fletcher2. > > I''m having trouble parsing this. I think you''re confusing a few > different failure modes: > > * ZIL entry is written, but corrupted by the storage, so that, for > example, an entry should be read from the mirrored ZIL instead.This is attempted, if you have a mirrored slog.> + broken fletcher2 detects the storage corruption > CASE A: Good! > > + broken fletcher2 misses the error, so that corrupted data is > replayed from ZIL into the proper pool, possibly adding a > stronger checksum to the corrupt data while writing it. > CASE B: Bad! > > + broken fletcher2 misinterprets storage corruption as signalling > the end of the ZIL, and any data in the ZIL after the corrupt > entry is truncated without even attempting to read the mirror. > (does this happen?) > CASE C: Bad! > > * ZIL entry is intentional garbage, either a partially-written entry > or an old entry, and should be treated as the end of the ZIL > > + broken fletcher2 identifies the partially written entry by a > checksum mismatch, or the sequence number identifies it as old > CASE D: Good!If the checksum mismatches, you can''t go any further because the pointer to the next ZIL log entry cannot be trusted. So the roll forward stops. This is how such logs work -- there is no end-of-log record.> + broken fletcher2 misidentifies a partially-written entry as > complete because of a hash collision > CASE E: Bad! > > + (hypothetical, only applies to non-existent fixed system) working > fletcher2 or broken-good-enough fletcher4 misidentifies a > partially-written entry as complete because of a hash collision > CASE F: Bad!As I said before, if the checksum matches, then the data is checked for sequence number = previous + 1, the blk_birth == 0, and the size is correct. Since this data lives inside the block, it is unlikely that a collision would also result in a valid block. In other words, ZFS doesn''t just trust the checksum for slog entries. -- richard> If I read your sentence carefully and try to match it with this chart, > it seems like you''re saying P(CASE F) << P(CASE E), which seems like > an argument for fixing the checksum. While you don''t say so, I > presume from your other posts you''re trying to make a case for doing > nothing, so I''m confused. > > I was mostly thinking about CASE B though. It seems like the special > way the ZIL works has nothing to do with CASE B: if you send data > through the ZIL to a sha256 pool, it can be written to ZIL under > broken-fletcher2, corrupted by the storage, and then read in and > played back corrupt but covered with a sha256 checksum to the pool > proper. AFAICT your relative-probability sentence has nothing to do > with CASE B. > > re> Unfortunately, the ZIL is also latency sensitive, so the > re> performance case gets stronger > > The performance case advocating what? not fixing the broken checksum? > > re> while the additional error checking already boosts the > re> dependability case. > > what additional error checking? > > Isn''t the whole specialness of the ZIL that the checksum is needed in > normal operation, absent storage subsystem corruption, as I originally > said? It seems like the checksum''s strength is more important here, > not less. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Sat, October 3, 2009 17:18, Ray Clark wrote:> Thank you all for your help, not to snub anyone, but Darren, Richard, and > Cindy especially come to mind. Thanks for sparring with me until we > understood each other.I''d like to echo this (and extend the thanks to include Ray). I''m now starting to feel that I understand this issue, and I didn''t for quite a while. And that I understand the risks better, and have a clearer idea of what the possible fixes are. And I didn''t before. That I do now is due to Ray''s persistence, and to the rest of your patience. Thank you! -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Mon, Oct 5, 2009 at 10:27 AM, David Dyer-Bennet <dd-b at dd-b.net> wrote:> > On Sat, October 3, 2009 17:18, Ray Clark wrote: > >> Thank you all for your help, not to snub anyone, but Darren, Richard, and >> Cindy especially come to mind. ?Thanks for sparring with me until we >> understood each other. > > I''d like to echo this (and extend the thanks to include Ray). ?I''m now > starting to feel that I understand this issue, and I didn''t for quite a > while. ?And that I understand the risks better, and have a clearer idea of > what the possible fixes are. ?And I didn''t before. ?That I do now is due > to Ray''s persistence, and to the rest of your patience. ?Thank you!Excellent, can this thread die now? :P
Question (for Richard E): Is there a write-up on the ZFS broken fletcher fix? Is the default checksum for new pool creation changed in U8? Is the default checksum for new pool creation change in OpenSolaris or SXCE (which versions)? Is there a case open to allow the user to select the checksum to be used when a ZIL is being created? Interesting thread - and commiserations to the team ZFS on the broken fletcher implementation - we (developers) all have bad days!! Regards, -- Al Hopper Logical Approach Inc,Plano,TX al at logical-approach.com Voice: 972.379.2133 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes:re> As I said before, if the checksum matches, then the data is re> checked for sequence number = previous + 1, the blk_birth = re> 0, and the size is correct. Since this data lives inside the re> block, it is unlikely that a collision would also result in a re> valid block. That''s just a description of how the zil works, not an additional layer of protection for user data in the ZIL beyond the checksum. The point of all this is to avoid needing to write a synchronous commit sector to mark the block valid. Instead, the block becomes valid once it''s entirely written. Yes, the checksum has an additional, critical, use in the ZIL compared to its use in the bulk pool, but checking these header fields for sanity does nothing to mitigate broken fletcher2''s weakness in detecting corruption of the user data stored inside the zil records. It''s completely orthogonal. If anything, the additional use of broken fletcher2 in the ZIL is a reason it''s even more important to fix the checksum in the ZIL: checksum mismatches occur in the ZIL even during normal operation, even when the storage is not misbehaving, because sometimes blocks are incompletely written. This is the normal case, not the exception, because the ZIL is only read after unclean shutdown. and AIUI you are saying fletcher2 is still the default for bulk pool data, too? even on newly created pools with the latest code? The fix was just to add the word ``deprecated'''' to some documentation somewhere, without actually performing the deprecation? I feel like FreeBSD/NetBSD would probably have left this bug open until it''s fixed. :/ Ubuntu or Gentoo would probably keep closing and reopening it though while people haggled in the comments section. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091005/43c3913f/attachment.bin>
On 05.10.09 23:07, Miles Nordin wrote:>>>>>> "re" == Richard Elling <richard.elling at gmail.com> writes: > > re> As I said before, if the checksum matches, then the data is > re> checked for sequence number = previous + 1, the blk_birth => re> 0, and the size is correct. Since this data lives inside the > re> block, it is unlikely that a collision would also result in a > re> valid block. > > That''s just a description of how the zil works, not an additional > layer of protection for user data in the ZIL beyond the checksum. The > point of all this is to avoid needing to write a synchronous commit > sector to mark the block valid. Instead, the block becomes valid once > it''s entirely written. Yes, the checksum has an additional, critical, > use in the ZIL compared to its use in the bulk pool, but checking > these header fields for sanity does nothing to mitigate broken > fletcher2''s weakness in detecting corruption of the user data stored > inside the zil records. It''s completely orthogonal. > > If anything, the additional use of broken fletcher2 in the ZIL is a > reason it''s even more important to fix the checksum in the ZIL: > checksum mismatches occur in the ZIL even during normal operation, > even when the storage is not misbehaving, because sometimes blocks are > incompletely written. This is the normal case, not the exception, > because the ZIL is only read after unclean shutdown. > > and AIUI you are saying fletcher2 is still the default for bulk pool > data, too? even on newly created pools with the latest code?Here''s essentially the fix: http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/zio.h?r2=%252Fonnv%252Fonnv-gate%252Fusr%252Fsrc%252Futs%252Fcommon%252Ffs%252Fzfs%252Fsys%252Fzio.h%409454%3A02e1ddcc9be7&r1=%252Fonnv%252Fonnv-gate%252Fusr%252Fsrc%252Futs%252Fcommon%252Ffs%252Fzfs%252Fsys%252Fzio.h%409443%3A2a96d8478e95 It changes setting of checksum=on to mean "fletcher4", so it is used by default for all user data and metadata. Though you can still set it to "fletcher2" explicitly. victor
>>>>> "bm" == Brandon Mercer <yourcomputerpal at gmail.com> writes:>> I''m now starting to feel that I understand this issue, >> and I didn''t for quite a while. ?And that I understand the >> risks better, and have a clearer idea of what the possible >> fixes are. ?And I didn''t before. haha, yes, I think I can explain it to people when advocating ZFS, but the story goes something like ``ZFS is serious business and pretty useful, but it has some pretty hilarious problems that you wouldn''t expect from some of the blog hype you read. Let me give you a couple examples of things that still aren''t fixed and how the discussion went...'''' bm> Excellent, can this thread die now? :P If no one is going to fix the problem, I guess so. I''m not intending to submit a patch myself---I''ll just use fletcher4 or sha256 for the bulk pool, and cross my fingers for the ZIL. I''m not even sure there is any point in submitting a patch because it sounds like the problem is political, not code. Fixing the math mistake would be trivial for the person who originally wrote broken-fletcher2, but if you break pool compatibility, you widen the discussion about the original broken checksum to include all the ZFS-loving the hype-blogs. I''m just surprised the bug is closed without fixing the problem, and that any ZFS user who didn''t participate in this thread will almost certainly still end up creating pools with broken checksums. That doesn''t seem right at all, especially when ZFS has so many simple convenient paths for eventually fixing the problem. Why not simply change the default for new filesystems to fletcher4? This is backward-compatible. Because of the way opensolaris is livecd-install-then-upgrade, new users will continue getting broken checksums for several months even with this fix until the next livecd comes out, but at least it''s an eventual resolution. As for the ZIL, why not change it to broken-fletcher4 the next time the ZFS ''update'' version is incremented? The ZIL is less urgent to fix on a scale of months because users don''t have to migrate all their data to get the new checksum, so sites won''t be stuck with broken ZIL checksums after upgrading their software like they are for the bulk data in the pool with livecd-then-upgrade. If fletcher4 is some ``performance issue'''' (is it?), then implement nfletcher2 (correct implementation of Fletcher''s checksum) and include it in the new ZFS version as the default. The only argument I can think of for doing nothing is, it''s like mercury in vaccines or broken autoclaves---if you respond, it''s admitting there was a problem in the first place, while until you respond you can balance the effort you spend fuzzing the issue against your liability. However in this case I don''t think anyone needs a 60 minutes exposee. It''s impossible to argue the problem''s imaginary, especially after so much ZFS advocacy was based on drumming up FUD about how naked you supposedly are without these checksums. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091005/825bfdc6/attachment.bin>
>>>>> "vl" == Victor Latushkin <Victor.Latushkin at Sun.COM> writes:vl> It changes setting of checksum=on to mean "fletcher4" oh, good. so it is only the ZIL that''s unfixed? At least that fix could come from a simple upgrade, if it ever gets fixed. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091005/12f139c3/attachment.bin>
Richard, The sub-threads being woven with a couple of other people are very important, though they are not my immediate issue. I really don''t think you need us debating with *you* about this - I think you could argue our point also. What we need to get across is a perspective. I am pretty sure that the current Fletcher2 algorithm as implemented does not provide the level of security intended by the guy who originally extrapolated Fletcher2 from Fletcher for ZFS. There are no rocks to throw, every single one of us, and every system, has goofed once or twice. The point is that once you experience a few issues of flakey hardware and data corruption, then read the ZFS propaganda, you never want to go back. Yes, everything you said was true. But having seen the vision, we just are not interested in being convinced that it is relatively alright. We have been there. We are interested in a strategy, a roadmap, to move on and get back to the vision. Just remember that we are *here* primarily because we see ZFS as being many orders of magnitude more reliable, both in terms of not loosing data, as well as telling us when it does. To dilute this capability is to dilute ZFS'' differentiation. To not be transparent is to invite uncertainty and distrust. Perhaps in hindsight, and given the extreme aversion to risk exhibited by your users, the ZFS team might review the checksums used on the zpool level stuff. I certainly would, but I am willing to let the ZFS team who understands the mathematics, probabilities, and implications better than I make this decision. Given that we are very technical customers and don''t have our Mac hat on, I also believe that it would be appropriate to document some of the rationale. There certainly has been alot of material explaining other technical issues - why not here? At any rate, keep on going - we are all behind you 100%. Please give us an open technical solution that we can have 100% confidence in. --Ray -- This message posted from opensolaris.org
On 5-Oct-09, at 3:32 PM, Miles Nordin wrote:>>>>>> "bm" == Brandon Mercer <yourcomputerpal at gmail.com> writes: > >>> I''m now starting to feel that I understand this issue, >>> and I didn''t for quite a while. And that I understand the >>> risks better, and have a clearer idea of what the possible >>> fixes are. And I didn''t before. > > haha, yes, I think I can explain it to people when advocating ZFS, but > the story goes something like ``ZFS is serious business and pretty > useful, but it has some pretty hilarious problems that you wouldn''t > expectLet''s talk about the "hilarious problems" that a naive RAID stack has, and most users "don''t expect". For a start, no crash safe behaviour, and no way to self-heal from unexpected mirror desync. Then we could compare always-consistent COW with conventionally fragile metadata needing regular consistency checks...> from some of the blog hype you read. Let me give you a couple > examples of things that still aren''t fixed...and can''t be fixed, in RAID, or conventional filesystems. --Toby> and how the discussion > went...''''...