Hi all, Just subscribed to the list after a debate on our helpdesk lead me to the posting about ZFS corruption and the need for a fsck repair tool of some kind... Has there been any update on this? Kind regards, ? Kevin Walker Coreix Limited ? DDI: (+44) 0207 183 1725 ext 90 Mobile: (+44) 07960 967818 Fax: (+44) 0208 53 44 111 ********************************************************************* This message is intended solely for the use of the individual or organisation to whom it is addressed. It may contain privileged or confidential information. If you are not the intended recipient, you should not use, copy, alter, or disclose the contents of this message *********************************************************************
ZFS scrub will detect many types of error in your data or the filesystem metadata. If you have sufficient redundancy in your pool and the errors were not due to dropped or misordered writes, then they can often be automatically corrected during the scrub. If ZFS detects an error from which it cannot automatically recover, it will often instantly lock your entire pool to prevent any read or write access, informing you only that you must destroy it and "restore from backups" to get your data back. Your only recourse in such situations is to do exactly that, or enlist the help of Victor Latushkin to attempt to recover your pool using painstaking manual manipulation. Recent putbacks seem to indicate that future releases will provide a mechanism to allow mere mortals to recover from some of the errors caused by dropped writes. cheers, Rob -- This message posted from opensolaris.org
Such a functionality is in the ZFS code now. It will be available later for us http://c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-support.html -- This message posted from opensolaris.org
Also, read this: http://c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-support.html -- This message posted from opensolaris.org
Joerg just posted a lengthy answer to the fsck question: http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html Good stuff. I see two answers to "nobody complained about lying hardware before ZFS". One: The user has never tried another filesystem that tests for end-to-end data integrity, so ZFS notices more problems, and sooner. Two: If you lost data with another filesystem, you may have overlooked it and blamed the OS or the application, instead of the inexpensive hardware. -- This message posted from opensolaris.org
Kevin Walker wrote:> Hi all, > > Just subscribed to the list after a debate on our helpdesk lead me to the posting about ZFS corruption and the need for a fsck repair tool of some kind... > > Has there been any update on this? > >I guess the discussion started after someone read an article on OSNEWS. The way zfs works is that basically you get an fsck equivalent while using a pool. ZFS checks checksums for all metadata and user data while reading it. Then all metadata are using ditto blocks to provide 2 or three copies of it (totally independent from any pool redundancy) depends on type of metadata. If it is corrupted a second (or third) copy will be used so correct data is returned and a corrupted block is automatically repaired. The ability to repair a block containing a user data depends on if you have a pool configured with or without redundancy. But even if pool is non-redundant (lets say a single disk drive) zfs still will be able to detect corruption and will be able to tell you what files are affected while metadata will be correct in most cases (unless corruption is so large and not localized so it affected all copies of a block in a pool). You will be able to read all other files and other parts of the file. So fsck actually happens while you are accessing your data and it is even better than fsck on most other filesystems as thanks to checksumming of all data and metadata zfs knows exactly when/if something is wrong and in most cases is even able to fix it on the fly. If you want to scan entire pool including all redundant copies and get them fix if something doesn''t checksum then you can schedule the pool scrubbing (while your applications are still using the pool!). This will force zfs to read all blocks from all copies to be read, their checksum checked and if needed data corrected if possible and the fact reported to user. Legacy fsck is not even close to it. I think that the perceived need for fsck for ZFS probably comes from lack of understanding how ZFS works and from some frustrated users where under a very unlikely and rare circumstances due to data corruption a user might be in a position of not being able to import the pool therefore not being able to access any data at all while a corruption might have affected only a relatively small amount of data. Most other filesystem will allow you to access most of the data after fsck in such a situation (probably with some data loss) while zfs left user with no access to data at all. In such a case the problem lies with zfs uberblock and the remedy is to revert a pool to its previous uberblock version (or even an earlier one). In almost all the cases this will render a pool importable and then the mechanisms described in the first paragraph above will kick-in. The problem is (was) that the procedure to revert a pool to one of its previous uberblock is not documented nor is automatic and is definitely far from being sys-admin friendly. But thanks to some community members (most notably mr. Victor I think) some users affected by the issue were given a hand and were able to recover most/all their data. Others were probably assisted by Sun''s support service I guess. Fortunately a much more user-friendly mechanism has been finally implemented and inegrated into Open Solaris build 126 which allows a user to import a pool and force it to on of the previous versions of its uberblock if necessary. See http://c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-support.html for more details. There is another CR (don''t have its number at hand) which is about implementing a delayed re-use on just freed blocks which should allow for more data to be recovered in such a case as above. Although I''m not sure if it has been implemented yet. IMHO with the above CR implemented, in most cases ZFS currently provides *much* better solution to random data corruption than any other filesystem+fsck in the market. Personally I don''t blame Sun that implementing the CR took so long as it mostly affected home users with cheap hardware from BestBuy like sources and even then it was relatively rare. So called enterprise customers were affected even less and then either they had enough expertise or called Sun''s support organization to get a pool manually reverted to its previous uberblock. So from Sun''s perspective the issue was far from being top-priority and the resources are limited as usual. Still IIRC it was thanks to some vocal users here complaining about the issue which convinced ZFS developers to get it expedited... :) ps. sorry for a chaotic email but lack of time is mine friend as usual :) -- Robert Milkowski http://milek.blogspot.com
Robert Milkowski wrote:> Kevin Walker wrote: >> Hi all, >> >> Just subscribed to the list after a debate on our helpdesk lead me to >> the posting about ZFS corruption and the need for a fsck repair tool >> of some kind... >> >> Has there been any update on this? >> >> > > I guess the discussion started after someone read an article on OSNEWS. > > The way zfs works is that basically you get an fsck equivalent while > using a pool. > ZFS checks checksums for all metadata and user data while reading it. > Then all metadata are using ditto blocks to provide 2 or three copies of > it (totally independent from any pool redundancy) depends on type of > metadata. If it is corrupted a second (or third) copy will be used so > correct data is returned and a corrupted block is automatically > repaired. The ability to repair a block containing a user data depends > on if you have a pool configured with or without redundancy. But even if > pool is non-redundant (lets say a single disk drive) zfs still will be > able to detect corruption and will be able to tell you what files are > affected while metadata will be correct in most cases (unless corruption > is so large and not localized so it affected all copies of a block in a > pool). You will be able to read all other files and other parts of the > file. > > So fsck actually happens while you are accessing your data and it is > even better than fsck on most other filesystems as thanks to > checksumming of all data and metadata zfs knows exactly when/if > something is wrong and in most cases is even able to fix it on the fly. > If you want to scan entire pool including all redundant copies and get > them fix if something doesn''t checksum then you can schedule the pool > scrubbing (while your applications are still using the pool!). This will > force zfs to read all blocks from all copies to be read, their checksum > checked and if needed data corrected if possible and the fact reported > to user. Legacy fsck is not even close to it. > > > I think that the perceived need for fsck for ZFS probably comes from > lack of understanding how ZFS works and from some frustrated users where > under a very unlikely and rare circumstances due to data corruption a > user might be in a position of not being able to import the pool > therefore not being able to access any data at all while a corruption > might have affected only a relatively small amount of data. Most other > filesystem will allow you to access most of the data after fsck in such > a situation (probably with some data loss) while zfs left user with no > access to data at all. In such a case the problem lies with zfs > uberblock and the remedy is to revert a pool to its previous uberblock > version (or even an earlier one). In almost all the cases this will > render a pool importable and then the mechanisms described in the first > paragraph above will kick-in. The problem is (was) that the procedure to > revert a pool to one of its previous uberblock is not documented nor is > automatic and is definitely far from being sys-admin friendly. But > thanks to some community members (most notably mr. Victor I think) some > users affected by the issue were given a hand and were able to recover > most/all their data. Others were probably assisted by Sun''s support > service I guess. > > Fortunately a much more user-friendly mechanism has been finally > implemented and inegrated into Open Solaris build 126 which allows a > user to import a pool and force it to on of the previous versions of its > uberblock if necessary. See > http://c0t0d0s0.org/archives/6067-PSARC-2009479-zpool-recovery-support.html > for more details. > > There is another CR (don''t have its number at hand) which is about > implementing a delayed re-use on just freed blocks which should allow > for more data to be recovered in such a case as above. Although I''m not > sure if it has been implemented yet. > > IMHO with the above CR implemented, in most cases ZFS currently provides > *much* better solution to random data corruption than any other > filesystem+fsck in the market. >The code for the putback of 2009/479 allows reverting to an earlier uberblock AND defers the re-use of blocks for a short time to make this "rewind" safer. -tim
Tim Haley wrote:> Robert Milkowski wrote: >> >> There is another CR (don''t have its number at hand) which is about >> implementing a delayed re-use on just freed blocks which should allow >> for more data to be recovered in such a case as above. Although I''m >> not sure if it has been implemented yet. >> >> IMHO with the above CR implemented, in most cases ZFS currently >> provides *much* better solution to random data corruption than any >> other filesystem+fsck in the market. >> > The code for the putback of 2009/479 allows reverting to an earlier > uberblock AND defers the re-use of blocks for a short time to make > this "rewind" safer. >Excellent! Thank you for the information. -- Robert Milkowski http://milek.blogspot.com
Does this putback mean that I have to upgrade my zpool, or is it a zfs tool? If I missed upgrading my zpool I am smoked? -- This message posted from opensolaris.org
Orvar Korvar wrote:> Does this putback mean that I have to upgrade my zpool, or is it a zfs tool? If I missed upgrading my zpool I am smoked?The putback did not bump zpool or zfs versions. You shouldn''t have to upgrade your pool. -tim
>>>>> "csb" == Craig S Bell <cbell at standard.com> writes:csb> Two: If you lost data with another filesystem, you may have csb> overlooked it and blamed the OS or the application, yeah, but with ZFS you often lose the whole pool in certain classes of repeatable real-world failures, like hotswap disks with flakey power or SAN''s without NVRAM where the target reboots and the initiator does not. Losing the whole pool is relevantly different to corrupting the insides of a few files. Yes, I know, the red-eyed screaming ZFS rats will come out of the walls screaming ``that 1 bit could have been critical Banking Data on which millions of lives depend and nuclear reactors and spaceships too! Wouldn''t you rather KNOW, even if ZFS desides to inform with zpool_self-destruct_condescending-error()?'''' Maybe, sometimes, yes, but USUALLY, **NO**! I''ve no objection to deciding how much recovery tools are needed based on experience rather than wide-eyed kool-aid ranting or presumptions from earlier filesystems, but so far experience says the recovery work was really needed, so I can''t agree with the bloggers rehashing each other''s zealotry. It would be nice to isolate and fix the underlying problems, though. That is the spirit in all these ``we don''t need no fsck because we are perfect'''' blogs with which I do agree. Their overoptimism isn''t as honest as I''d like about the way ZFS''s error messages do not enough to lead us toward the real cause in the case of SAN problems because they are all designed presuming spatially-clustered, temporally-spread, disk-based failures rather than temporally-clustered interconnect failures, so rather the error detection becomes no more than ``printf("simon sez u will not blame me, blame someone else. these aren''t the droids you''re looking for. move along.");'''' ....but, yeah, the blogger''s point of banging on the whole stack until it works rather than concealing errors, is a good one. Unfortunately I don''t think that''s what will actually happen with these dropped-write SAN failures. I think people will just use the new recovery bits, which conceal errors just like earlier filesystems and fsck tools, and shrug. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091105/ac1cfc07/attachment.bin>
>>>>> "rm" == Robert Milkowski <milek at task.gda.pl> writes:rm> Personally I don''t blame Sun that implementing the CR took so rm> long as it mostly affected home users with cheap hardware from rm> BestBuy like sources no, many of the reports were FC SAN''s. rm> and even then it was relatively rare. no, they said they were losing way more zpools than they ever lost vxfs''s in the same environment. rm> called enterprise customers were affected even less and then rm> either they had enough expertise or called Sun''s support rm> organization to get a pool manually reverted to its previous rm> uberblock. which is probably why the tool exists. but, great! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091105/b8ffba8f/attachment.bin>
Miles Nordin wrote:>>>>>> "csb" == Craig S Bell <cbell at standard.com> writes: >>>>>> csb> Two: If you lost data with another filesystem, you may have >>>>>> csb> overlooked it and blamed the OS or the application, >>>>>> >>>>>> yeah, but with ZFS you often lose the whole pool in certain classes of >>>>>> repeatable real-world failures, like hotswap disks with flakey power >>>>>> or SAN''s without NVRAM where the target reboots and the initiator does >>>>>> not. Losing the whole pool is relevantly different to corrupting the >>>>>> insides of a few files.I think that most people including ZFS developers agree with you that losing an access to entire pool is not acceptable. And this has been fixed in snv_126 so now in those rare circumstances you should be able to import a pool. And generally you will end-up in a much better situation than with legacy filesystems + fsck. -- Robert Milkowski http://milek.blogspot.com
Miles Nordin wrote:>>>>>> "rm" == Robert Milkowski <milek at task.gda.pl> writes: >>>>>> > > rm> Personally I don''t blame Sun that implementing the CR took so > rm> long as it mostly affected home users with cheap hardware from > rm> BestBuy like sources > > no, many of the reports were FC SAN''s. > > rm> and even then it was relatively rare. > > no, they said they were losing way more zpools than they ever lost > vxfs''s in the same environment. > >Well, who''s they? I''ve been depolying ZFS for years on many different platforms from low-end, jbods, thru midrange, SAN, and high-end disk arrays and I have yet to loose a pool (hopefully not). It doesn''t mean that some other people did not have problems or did not loose they pools - in most if not in all such cases almost all data could probably be recovered by following manual and "hackish" procedure to rollback to a previous uberblock. Now it is integrated into ZFS and no special knowledge is required to be able to do so in such circumstances. Then there might have been other bugs... life, no software is without them.> rm> called enterprise customers were affected even less and then > rm> either they had enough expertise or called Sun''s support > rm> organization to get a pool manually reverted to its previous > rm> uberblock. > > which is probably why the tool exists. but, great! >The point is that you don''t need the tool now as it is built-in in zfs starting with snv_126.
Hi Robert I think you mean snv_128 not 126 :-) 6667683 need a way to rollback to an uberblock from a previous txg http://bugs.opensolaris.org/view_bug.do?bug_id=6667683 http://hg.genunix.org/onnv-gate.hg/rev/8aac17999e4d Regards Nigel Smith -- This message posted from opensolaris.org
On Thu, 5 Nov 2009, Miles Nordin wrote:>>>>>> "rm" == Robert Milkowski <milek at task.gda.pl> writes: > > rm> Personally I don''t blame Sun that implementing the CR took so > rm> long as it mostly affected home users with cheap hardware from > rm> BestBuy like sources > > no, many of the reports were FC SAN''s.Do you have a secret back-channel to receive these many reports? Are the reports from trolls or gnomes?> rm> and even then it was relatively rare. > > no, they said they were losing way more zpools than they ever lost > vxfs''s in the same environment.Who are ''they''? Are they the little gnomes that come out at night and lurk in your computer room? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Robert Milkowski wrote:> Miles Nordin wrote: >>>>>>> "csb" == Craig S Bell <cbell at standard.com> writes: >>>>>>> csb> Two: If you lost data with another filesystem, you may have >>>>>>> csb> overlooked it and blamed the OS or the application, >>>>>>> >>>>>>> yeah, but with ZFS you often lose the whole pool in certain >>>>>>> classes of >>>>>>> repeatable real-world failures, like hotswap disks with flakey power >>>>>>> or SAN''s without NVRAM where the target reboots and the initiator >>>>>>> does >>>>>>> not. Losing the whole pool is relevantly different to corrupting >>>>>>> the >>>>>>> insides of a few files. > I think that most people including ZFS developers agree with you that > losing an access to entire pool is not acceptable. And this has been > fixed in snv_126 so now in those rare circumstances you should be able > to import a pool. And generally you will end-up in a much better > situation than with legacy filesystems + fsck. > >Just a slight correction. The current build in-process is 128 and that''s the build into which the changes were pushed. -tim
On Thu, Nov 05, 2009 at 03:04:05PM -0700, Tim Haley wrote:> Robert Milkowski wrote: > >I think that most people including ZFS developers agree with you that > >losing an access to entire pool is not acceptable. And this has been > >fixed in snv_126 so now in those rare circumstances you should be able > >to import a pool. And generally you will end-up in a much better > >situation than with legacy filesystems + fsck. > > > Just a slight correction. The current build in-process is 128 and that''s > the build into which the changes were pushed.It would be nice to see this information at: http://hub.opensolaris.org/bin/view/Community+Group+on/126-130 but it hasn''t changed since 23 October. -- -Gary Mills- -Unix Group- -Computer and Network Services-
Hi Gary I will let ''website-discuss'' know about this problem. They normally fix issues like that. Those pages always seemed to just update automatically. I guess it''s related to the website transition. Thanks Nigel Smith -- This message posted from opensolaris.org
Thanks for taking the time to write this - very useful info :) -- This message posted from opensolaris.org
I like it. Any idea what rev of zfs has the PSARC 2009/479 zpool recovery support <http://c0t0d0s0.org/codenews/fragments/c3570a6dcfb6712c7307e758e58550ee7b9c32b8.txt> ? cheers, Brian Craig S. Bell wrote:> Joerg just posted a lengthy answer to the fsck question: > > http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html > > Good stuff. I see two answers to "nobody complained about lying hardware before ZFS". > > One: The user has never tried another filesystem that tests for end-to-end data integrity, so ZFS notices more problems, and sooner. > > Two: If you lost data with another filesystem, you may have overlooked it and blamed the OS or the application, instead of the inexpensive hardware. >
>>>>> "rm" == Robert Milkowski <milek at task.gda.pl> writes:rm> who''s they? posters to this list. not interested in going in endless circles and spending half an hour hunting for citations because Someone is Wrong on the Internet. posts are there, go find them, or agree to disagree. rm> I''ve been depolying ZFS for years on many different platforms rm> from low-end, jbods, thru midrange, SAN, and high-end disk rm> arrays and I have yet to loose a pool well, have you ever lost a vxfs? This is another case of ``I can''t tell you how close to zero the number of problems *I''ve* had with it is. It''s so close, it is zero, so this means by extrapolation that no one is having problems anywhere, and I don''t need to bother reading any `lists'' where people report problems.'''' Sorry, but no. I''m less interested in ``I installed a zpool on something big and expensive and it worked,'''' more interested in ``losing more zpools than vxfs''s in the same clustered environment. we just restore from backup but are annoyed by the lost time,'''' which is the post I remember. rm> in all such cases almost all data could probably be recovered rm> by following manual and "hackish" procedure to rollback to a rm> previous uberblock. this is often not timely, cost-effective, acceptable, or even within reach. rm> Now it is integrated into ZFS and no special knowledge is rm> required this, of course, is. It''s also good that AIUI it works without a pool version bump, so you can boot an exotic new livedvd and recover a pool for an older stable release that lacks the fix. Unfortunately it will take some time to get the new builds and then more time to gain experience and know if the ueberblock rollback fixes bring ZFS resiliency on SAN''s in line with vxfs, especially with the amount of fuzzing around the old issue, of people blaming the lost pools on bitflip gremlins and telling people they need zpool-layer redundancy and citing various papers about UNC''s and CRC errors on 520-byte-sector netapp disks that have nothing to do with the SAN problems, and even now having selective memory of list posts as in ``yes I know the class of problems the ueberblock rollback fixes. But nobody ever had any of those problems, in spite of the fact the problems exist as a CLASS and we can draw a fucking box aroudn it.'''' It was always a hazy box, and I don''t know that anyone really root-caused the SAN problems, we just ended up with a bunch of broken pools that were all recovered using the same technique and imagined backwards from there. I''m optimistic about the fix though. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091106/5be81144/attachment.bin>
On Nov 6, 2009, at 2:32 PM, Miles Nordin wrote:> rm> I''ve been depolying ZFS for years on many different platforms > rm> from low-end, jbods, thru midrange, SAN, and high-end disk > rm> arrays and I have yet to loose a poolFew people have encountered the problem where rollback is the solution. Few people need heart bypass surgery, either.> well, have you ever lost a vxfs? This is another case of ``I can''t > tell you how close to zero the number of problems *I''ve* had with it > is.[Richard raises his hand, showing the scar from a lost vxfs file system, just below the wrist, next to the vxvm scar :-)]> It''s so close, it is zero, so this means by extrapolation that no > one is having problems anywhere, and I don''t need to bother reading > any `lists'' where people report problems.'''' Sorry, but no. I''m less > interested in ``I installed a zpool on something big and expensive and > it worked,'''' more interested in ``losing more zpools than vxfs''s in > the same clustered environment. we just restore from backup but are > annoyed by the lost time,'''' which is the post I remember. > > rm> in all such cases almost all data could probably be recovered > rm> by following manual and "hackish" procedure to rollback to a > rm> previous uberblock. > > this is often not timely, cost-effective, acceptable, or even within > reach. > > rm> Now it is integrated into ZFS and no special knowledge is > rm> required > > this, of course, is. > > It''s also good that AIUI it works without a pool version bump, so you > can boot an exotic new livedvd and recover a pool for an older stable > release that lacks the fix.I understand it to work this way. I think the integrators and service folks will manage the expectation setting properly for those folks who have service contracts. For the rest, you might just hear a "boot latest LiveCD and fix," which also seems reasonable. In a few years, it will be a distant memory.> Unfortunately it will take some time to get the new builds and then > more time to gain experience and know if the ueberblock rollback fixes > bring ZFS resiliency on SAN''s in line with vxfs, especially with the > amount of fuzzing around the old issue, of people blaming the lost > pools on bitflip gremlins and telling people they need zpool-layer > redundancy and citing various papers about UNC''s and CRC errors on > 520-byte-sector netapp disks that have nothing to do with the SAN > problems, and even now having selective memory of list posts as in > ``yes I know the class of problems the ueberblock rollback fixes.Actually, I like NetApp''s response better, in some ways. They are now using a parity block (512 bytes) for every 8 blocks. This can work well for PC-like clients (8 512-byte blocks = 4 KB). This has the beneficial affect of using a code which can contain enough info to support correction, rather than just a digest. Digests, by design, are intended for verification and completely useless for correction. Now that ZFS can report the bitwise extent of errors (b125), we can finally get a real sense of the sorts of corruption we''re dealing with. One potential problem with using a whole block for checksum/ECC is that there are failure modes which affect multiple blocks: either spatially or temporally. But once you know the failure modes, you create better solutions... -- richard
Richard Elling wrote:> Now that ZFS can report the bitwise extent of errors (b125), we canRichard, I had not noticed that feature being added. Do you have the bug number for that feature to hand? Thanks Nigel -- This message posted from opensolaris.org
On Fri, Nov 06, 2009 at 03:48:24PM -0800, Richard Elling wrote:> Actually, I like NetApp''s response better, in some ways. They are now > using a parity block (512 bytes) for every 8 blocks. This can work well > for PC-like clients (8 512-byte blocks = 4 KB). This has the beneficial > affect of using a code which can contain enough info to support > correction, rather than just a digest. Digests, by design, are > intended for verification and completely useless for correction.I''m not sure I follow. I thought the 8/9ths thing was just a bundled netapp-style checksum with no parity involved (only used on ATA drives). And that it would go to RAID-4 or RAID-DP for any correction. -- Darren
On Nov 6, 2009, at 5:02 PM, A Darren Dunham wrote:> On Fri, Nov 06, 2009 at 03:48:24PM -0800, Richard Elling wrote: >> Actually, I like NetApp''s response better, in some ways. They are now >> using a parity block (512 bytes) for every 8 blocks. This can work >> well >> for PC-like clients (8 512-byte blocks = 4 KB). This has the >> beneficial >> affect of using a code which can contain enough info to support >> correction, rather than just a digest. Digests, by design, are >> intended for verification and completely useless for correction. > > I''m not sure I follow. I thought the 8/9ths thing was just a bundled > netapp-style checksum with no parity involved (only used on ATA > drives). > And that it would go to RAID-4 or RAID-DP for any correction.Perhaps, if they would open source it, we could examine in detail :-) The key is that 512 bytes is enough size that you could implement some, albeit limited, error correction for 4 KB (think RAID-5, 8+1). OTOH, 256 bits is nowhere near enough space to implement an interesting correction for 128 KB (1M bits). -- richard
On Nov 6, 2009, at 4:35 PM, Nigel Smith wrote:> Richard Elling wrote: >> Now that ZFS can report the bitwise extent of errors (b125), we can > > Richard, I had not noticed that feature being added. > Do you have the bug number for that feature to hand?PSARC 2009/497 zfs checksum ereport payload additions http://arc.opensolaris.org/caselog/PSARC/2009/497/ CR 6867188 zfs checksum ereports could be more informative http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6867188 Good stuff! -- richard
fyi Robert Milkowski wrote:> XXX wrote: >> | Have you actually tried to roll-back to previous uberblocks when you >> | hit the issue? I''m asking as I haven''t yet heard about any case >> | of the issue witch was not solved by rolling back to a previous >> | uberblock. The problem though was that the way to do it was "hackish". >> >> Until recently I didn''t even know that this was possible or a likely >> solution to ''pool panics system on import'' and similar pool destruction, >> and I don''t have any tools to do it. (Since we run Solaris 10, we won''t >> have official support for it for quite some time.) >> > I wouldn''t be that surprised if this particular feature would actually > be backported to S10 soon. At least you may raise a CR asking for it - > maybe you will get an access to IDR first (I''m not saying there is or > isn''t already one). > >> If there are (public) tools for doing this, I will give them a try >> the next time I get a test pool into this situation. >> > > IIRC someone send one to the zfs-discuss list some time ago. > Then usually you will also need to poke with zdb. > A sketchy and unsupported procedure was discussed on the list as well. > Look at the archives. > >> | The bugs which prevented importing a pool in some circumstances were >> | really "annoying" but lets face it - it was bound to happen and they >> | are just bugs which are getting fixed. ZFS is still young after all. >> | And when you google for data loss on other filesystems I''m sure you >> | will find lots of user testimonies - be it ufs, ext3, raiserfs or your >> | favourite one. >> >> The difference between ZFS and those other filesystems is that with >> a few exceptions (XFS, ReiserFS), which sysadmins in the field didn''t >> like either, those filesystems didn''t generally lose *all* your data >> when something went wrong. Their official repair tools could usually >> put things back together to at least some extent. >> > Generally they didn''t although I''ve seen situation when entire ext2 > and ufs were lost and fsck was not able to get them even mounted > (kernel panics right after mounting them). In other occassion fsck was > crashing the box in yet another one fsck claimed everything was ok but > then when doing backup system was crashing (fsck can''t really properly > fix filesystem state - it is more of guessing and sometimes it goes > terribly wrong). > > But I agrre that generally with other file systems you can recover > most or all data just fine. > And generally it is the case with zfs - there were probably more bugs > in ZFS as it is much younger filesystem, but most of them were very > quickly fixed. And the uberblock one - I 100% agree then when you hit > the issue and didn''t know about manual method to recover it was very > bad - but it has finally been fixed. > >> (Just as importantly, when they couldn''t put things back together you >> could honestly tell management and the users ''we ran the recovery tools >> and this is all they could get back''. At the moment, we would have >> to tell users and management ''well, there are no (official) recovery >> tools...'', unless Sun Support came through for once.) >> > > But these tools are built-in into zfs and are happening automatically > and with virtually 100% confidence that if something can be fixed it > is fixed correctly and if something is wrong it will be detected - > thanks to end-to-end checksumming of data and meta-data. The problem > *was* that one case scenario when rolling back to previous uberblock > is required was not implemented and required a complicated and > undocumented procedure to follow. It wasn''t high priority for Sun as > it was very rare , wasn''t affecting much enterprise customers and > although complicated the procedure is there is one and was > successfully used on many occasions even for non paying customers > thanks to guys like Victor on the zfs mailing list who helped some > people in such a situations. > > But you didn''t know about it and it seems like Sun''s support service > was no use for you - which is really a shame. > In your case I would probably point that out to them and at least get > some good deal as a compensation or something... > > But what is most important is that finally fully supported, built in > and easy to use procedure is available to recover from such > situations. As time will progress and more bugs will be fixed ZFS will > behave much better under many corner cases as it does already in Open > Solaris - last 6 months or so were really very productive in fixing > many bugs like that. > >> | However the whole point of the discussion is that zfs really >> doesn''t | need a fsck tool. >> | All the problems encountered so far were bugs and most of them are >> | already fixed. One missing feature was a built-in support for | >> rolling-back uberblock which just has been integrated. But I''m sure | >> there are more bugs to be found.. >> >> I disagree strongly. Fsck tools have multiple purposes; ZFS obsoletes >> some of them but not all. One thing fsck is there for is to recover as >> much as possible after things happen that are supposed to be impossible, >> like operating system bugs or crazy corruption. ZFS''s current attitude >> is more or less that impossible things won''t happen so it doesn''t have >> to do anything (except, perhaps, panic with assert failures). >> > This is not true - I will try to explain why. > Generally if you want to recover some data from a filesystem you need > to get it into a state you can mount it (at least RO). Most legacy > filesystems when hitting with the problem that metadata do not make > sense to them and they think it is wrong won''t allow you to mount the > filesystem and will ask you to run fsck. Now as there are not checksum > in these filesystems generally there is no accurate way of telling how > the bad metadata should be fixed. Fsck is looking for obvious things > and is trying to "guess" in many cases and sometimes it is right and > sometimes it is not. Then sometimes it won''t even detect then there > was corruption. Also keep in mind that fsck in most filesystems does > not even try to check for user data - just metadata. The main reason > is that it can''t really do it. > Now because running fsck could potentially be disastrous to a > filesystem and lead to even more damage if it is started automatically > (for example during system boot) it is started in an interactive-mode > and if some less obvious fixes are required it will require a human to > confirm its action. But even then it is still just guessing what it is > supposed to do. And it happens that situation gets even worse. > > Then sometimes there were bugs both in filesystems and fsck and user > was left with no access to data at all until these bugs were fixed (or > user was skilled enough to fix/workaround them on his/her own). I came > across such problems on EMC IP4700, EMC Celerra and couple of other > systems in my life. For example fsck was running for well over 10h > consuming more and more memory and finally server was running out of > memory and fsck died... and it all started over again, failed > again.... in other case fsck was just crashing during repair in the > same location and file system was crashing the os after couple of > minutes after mounting it.. > > The other problem with fsck is that even if it thinks that filesystem > is ok it actually might not be - even its metadata state. Then all > different things might happen - like when accessing a given file or > directory a system will panic or more data will get corrupted... I was > in such a situation couple of times and it took days to copy files > from such a filesystem to another one with many panics in-between when > we had to skip such files or directories, etc. fsck didn''t help and > reported everything is fine. > > Now with ZFS it is completely different world. ZFS is able in > virtually all cases to detect if its meta-data and data on-disk is > corrupted in anyway or not thanks to its end-to-end checksumming. If > someone is concern with how strong default checksumming is (fletcher4) > then currently one cas switch zfs to use sha256 to have a good sleep. > So here is first big difference compared to most filesystems in a > market - ZFS if some data is corrupted does not have to *guess* if it > is the case or not but can actually detect it with almost 100% > confidence when it is the case. > Once such a case is detected ZFS will try to automatically fix the > issue if there is redundant copy of corrupted block available - if > there is it will all happen transparently to applications without any > need to unmount filesystems or run external tools like fsck. Then > because ZFS checksums both metadata and user data it will be able to > detect and possibly fix data corruptions in both cases (which fsck > can''t even if it is lucky). Now even if you are not doing any > redundancy at pool level by using ZFS its metadata blocks are always > kept in at least two copies physically separated on disk if possible. > What it means is that even in a single disk configuration (or stirpe) > if some data is corrupted zfs will be able to detect it and if it is > meta-data block it will be able not only to detect it but also > automatically and transparently fix it and preserve filesystem > consistency. There is a simple test you may run - create a pool on top > of one disk drive, put some files in it then overwrite lets say 20% of > the disk drives with some random data or zeros while zfs is running. > Then flush caches (export/import pool) and try to access allmetadata > by doing a full ls -lra on a filesystem. You should be able to get a > full listing with proper attributes, etc. but if you check zpool > status it will probably report many checksum errors which were > corrected. (when overwriting overwrite so portion of the beginning of > the disk as zfs will usually start writing to a disk from the > beginning). Now if you actually try to read a file contents it should > be fine if you lucky enough to read onwhich was not overwritten and if > you are unlucky you won''t be able to read blocks which are corrupted > (and since you don''t have ane redundancy at zfs level it can''t fix its > user-data but can detect it) but you will be able to read all the > other blocks from the file. Now try to do something like these with > any other file system - you will probably end-up with os panic and in > many cases fsck won''t be able to recover file system to such a point > so you can recover some data.... and when fixing it will be only > guessing what to do and skip user data entirely... > > Now there is a specific scenario case of the above when metadata is > corrupted which is describing pool itself or its root block and it > can''t be fixed as all copies are wrong. ZFS can also detect it but an > extra functionality was not implemented until very recently to > actually try to use N-1 rootblock in such the case. This was very > unfortunate but because it was very rare in the field and resources > are limited as usual it wasn''t implemented - instead there was an > undocumented, unsupported and hard to follow procedure on how to do it > manually - and some people did use it successfully (check zfs-discuss > archives). But of course it shouldn''t be like that and ZFS developers > did recognized it by having accepted bug report on it. Bur limited > resources...... fortunately a built-in mechanism to deal with such a > case has finally been implemented. So now when it happens a user will > have a choice of importing a pool with extra option to rollback to a > previous version of txg so the pool can be imported. From now one all > the mechanisms described above will kick-in. And again - no guessing > here but a guarantee of detecting a corruption and fixing it if > possible. And you don''t even have to run any check and wait hours > sometimes days on large filesystems with millions of files before you > can access your data (and still not be sure what exactly you''re > accessing and if it won''t cause further issues). Of course it would > probably be wise to run zpool scrub to force reading all data and > metadata and checking their checksum and fix them if possible at > convinient time for you but in a mean time you may run your > applications and any corruptions will be detected and fixed while data > is being accessed. > > So from the practical point of view you make think of the mechanisms > in ZFS as a built-in fsck with an ability to actually detect when > corruption happens (instead of just guessing it and just for > meta-data), get it fixed if a redundant copy is available (and do it > transparently to applications). Having a separate tool doesn''t really > makes sense here. Of course you can always write a script called > fsck.zfs which will import a pool and run zpool scrub if you want. And > sometimes people will do exactly that before going back into > production. But having a genuine extra tool like fsck doesn''t really > make sense - what such a tool should exactly do (keeping in mind all > the above)? > > Then there were a couple of bugs which prevented ZFS from importing a > pool with some specific corruptions which were entirely fixable (AFAIK > all known were fixed in Open Solaris). When you think about it - we > are talking about bugs here - if you would put all the recovery > mechanisms into a separate tool called fsck with the same bugs it > wouldn''t be able to repair such a pool anyway, would it? So you would > need to fix these bugs first - but once you fixed them the zfs will > able to mount such a pool and still an external tool to do so is not > needed (or after applying a patch/fix do ''alias fsck=''zpool import'''' > and then fsck pool will get your pool fixed... :) > You might ask but what are you supposed to do until such a bug is > fixed? Well, what would you do if you wouldn''t be able to mount ext2 > filesystem (or any other) and there was a bug in its fsck which would > prevent it from getting the fs into a mountable state.... you would > have to wait for a fix, or get it fixed yourself, or play with its > on-disk format with tools like e2fs, fsdb, ... and try to fix > filesystem manually. Well, on zfs you''ve also have zdb... > or you would probably be forced to recover data from backup. > > The point here is that most filesystem and their tool had such bugs > and zfs is one of the youngest filesystems in the market so it is no > wonder in a way that such bugs are getting fixed now and not 5-7 years > ago. Then there is a critical mass of users required for a given > filesystems so it is deployed in many different environments, > different workloads, hardware, drievers, usage cases, ... so all these > corner cases can surface, users hopefully will report them and they > will get fixed. ZFS is becoming widely deployed only for last couple > of years or so so no wonder that most of these bugs were spotted (and > fixed) during the same period. > > But then thanks to a fundamentally different architecture of ZFS once > most (all? :)) of bugs like these are fixed ZFS offers something MUCH > better than legacy filesystems + fsck. It offers a guarantee of > detecting data corruption and fixing it properly when possible while > reporting what can''t be fixed and still providing an access to all the > other data in your pool. > > > btw: the email exchange is private so I don''t won''t to include > zfs-discuss without your consent but if you want to forward this email > to zfs-discuss for other users benefit feel free to do so. > > >> ) As the evolution of ZFS has demonstrated, impossible things *do* >> happen >> and you *do* need the ability to recover as much as possible. ZFS is >> busy slapping bandaids over specific problems instead of dealing with >> the general issue. >> > Just a quick "google" and: > > 1. fsck fails and causes panic of Linux kernel > https://bugzilla.redhat.com/show_bug.cgi?id=126238 > > 2. btrfs - filesystem gots corrupted, running btrfsck causes even more > damage and entire filesystem is nuked due to a bug. BTRFS is not the > best example as it is far from being production ready but still... > > https://bugzilla.redhat.com/show_bug.cgi?id=497821 > > 3. linux gfs2 - fsck has a bug (or lack of feature) and is not able to > fix the filesystem with a specific corruption, but filesystem is > unmountable. The only option is to manually fix data on-disk with help > from a support service on case-by-case basis... > > https://bugzilla.redhat.com/show_bug.cgi?id=457557 > > 4. e2fsck segfaults + dumps core when trying to check a filesystem > > https://bugzilla.redhat.com/show_bug.cgi?id=108075 > > 5. ext3 filesystem crashes - fsck can''t repair it and goes into > infinite loop.... fixed in development version of fsck > > https://bugzilla.redhat.com/show_bug.cgi?id=467677 > > 6. gfs2 corruption is causing a linux kernel to panic.... fsck says it > fixes the issue but it doesn''t and system crashes all over again under > load... > > https://bugzilla.redhat.com/show_bug.cgi?id=519049 > > 7. ext3 filesystem can''t be mounted anf fsck won''t finish after 10 > days of running (probably some kind of infinite looping bug again) > > http://ubuntuforums.org/archive/index.php/t-394744.html > > 8. AIX JFS2 filesystem corruption - due to a bug in fsck it can''t fix > the fs, data had to be recovered from backup > > http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/error-518-file-system-corruption-366503 > > > > 9. > > https://bugzilla.redhat.com/show_bug.cgi?id=514511 > https://bugzilla.redhat.com/show_bug.cgi?id=477856 > > > And there are many more... > > The point again is that bugs happen even in fsck and until they are > fixed a common user/sysadmin quote often won''t be able to recover on > its own. ZFS is not exception here when it comes to bugs. But thanks > to its different approach (mostly end-to-end checksumming + COW) its > ability to detect data corruption and deal with it exceeds most > generally available solutions in the market. The fixes for some bugs > mentioned before make it only more robust and reliable even for those > unlucky users before... :) > > >
On Sun, Nov 8, 2009 at 7:55 AM, Robert Milkowski <milek at task.gda.pl> wrote:> > fyi > > Robert Milkowski wrote: >> >> XXX wrote: >>> >>> | Have you actually tried to roll-back to previous uberblocks when you >>> | hit the issue? ?I''m asking as I haven''t yet heard about any case >>> | of the issue witch was not solved by rolling back to a previous >>> | uberblock. The problem though was that the way to do it was "hackish". >>> >>> ?Until recently I didn''t even know that this was possible or a likely >>> solution to ''pool panics system on import'' and similar pool destruction, >>> and I don''t have any tools to do it. (Since we run Solaris 10, we won''t >>> have official support for it for quite some time.) >>> >> >> I wouldn''t be that surprised if this particular feature would actually be >> backported to S10 soon. At least you may raise a CR asking for it - maybe >> you will get an access to IDR first (I''m not saying there is or isn''t >> already one). >> >>> ?If there are (public) tools for doing this, I will give them a try >>> the next time I get a test pool into this situation. >>> >> >> IIRC someone send one to the zfs-discuss list some time ago. >> Then usually you will also need to poke with zdb. >> A sketchy and unsupported procedure was discussed on the list as well. >> Look at the archives. >> >>> | The bugs which prevented importing a pool in some circumstances were >>> | really "annoying" but lets face it - it was bound to happen and they >>> | are just bugs which are getting fixed. ZFS is still young after all. >>> | And when you google for data loss on other filesystems I''m sure you >>> | will find lots of user testimonies - be it ufs, ext3, raiserfs or your >>> | favourite one. >>> >>> ?The difference between ZFS and those other filesystems is that with >>> a few exceptions (XFS, ReiserFS), which sysadmins in the field didn''t >>> like either, those filesystems didn''t generally lose *all* your data >>> when something went wrong. Their official repair tools could usually >>> put things back together to at least some extent. >>> >> >> Generally they didn''t although I''ve seen situation when entire ext2 and >> ufs were lost and fsck was not able to get them even mounted (kernel panics >> right after mounting them). In other occassion fsck was crashing the box in >> yet another one fsck claimed everything was ok but then when doing backup >> system was crashing (fsck can''t really properly fix filesystem state - it is >> more of guessing and sometimes it goes terribly wrong). >> >> But I agrre that generally with other file systems you can recover most or >> all data just fine. >> And generally it is the case with zfs - there were probably more bugs in >> ZFS as it is much younger filesystem, but most of them were very quickly >> fixed. And the uberblock one - I 100% agree then when you hit the issue and >> didn''t know about manual method to recover it was very bad - but it has >> finally been fixed. >> >>> (Just as importantly, when they couldn''t put things back together you >>> could honestly tell management and the users ''we ran the recovery tools >>> and this is all they could get back''. At the moment, we would have >>> to tell users and management ''well, there are no (official) recovery >>> tools...'', unless Sun Support came through for once.) >>> >> >> But these tools are built-in into zfs and are happening automatically and >> with virtually 100% confidence that if something can be fixed it is fixed >> correctly and if something is wrong it will be detected - thanks to >> end-to-end checksumming of data and meta-data. The problem *was* that one >> case scenario when rolling back to previous uberblock is required was not >> implemented and required a complicated and undocumented procedure to follow. >> It wasn''t high priority for Sun as it was very rare , wasn''t affecting much >> enterprise customers and although complicated the procedure is there is one >> and was successfully used on many occasions even for non paying customers >> thanks to guys like Victor on the zfs mailing list who helped some people in >> such a situations. >> >> But you didn''t know about it and it seems like Sun''s support service was >> no use for you - which is really a shame. >> In your case I would probably point that out to them and at least get some >> good deal as a compensation or something... >> >> But what is most important is that finally fully supported, built in and >> easy to use procedure is available to recover from such situations. As time >> will progress and more bugs will be fixed ZFS will behave much better under >> many corner cases as it does already in Open Solaris - last 6 months or so >> were really very productive in fixing many bugs like that. >> >>> | However the whole point of the discussion is that zfs really doesn''t | >>> need a fsck tool. >>> | All the problems encountered so far were bugs and most of them are | >>> already fixed. One missing feature was a built-in support for | rolling-back >>> uberblock which just has been integrated. But I''m sure | there are more bugs >>> to be found.. >>> >>> ?I disagree strongly. Fsck tools have multiple purposes; ZFS obsoletes >>> some of them but not all. One thing fsck is there for is to recover as >>> much as possible after things happen that are supposed to be impossible, >>> like operating system bugs or crazy corruption. ZFS''s current attitude >>> is more or less that impossible things won''t happen so it doesn''t have >>> to do anything (except, perhaps, panic with assert failures). >>> >> >> This is not true - I will try to explain why. >> Generally if you want to recover some data from a filesystem you need to >> get it into a state you can mount it (at least RO). Most legacy filesystems >> when ?hitting with the problem that metadata do not make sense to them and >> they think it is wrong ?won''t allow you to mount the filesystem and will ask >> you to run fsck. Now as there are not checksum in these filesystems >> generally there is no accurate way of telling how the bad metadata should be >> fixed. Fsck is looking for obvious things and is trying to "guess" in many >> cases and sometimes it is right and sometimes it is not. Then sometimes it >> won''t even detect then there was corruption. Also keep in mind that fsck in >> most filesystems does not even try to check for user data - just metadata. >> The main reason is that it can''t really do it. >> Now because running fsck could potentially be disastrous ?to a filesystem >> and lead to even more damage if it is started automatically (for example >> during system boot) it is started in an interactive-mode and if some less >> obvious fixes are required it will require a human to confirm its action. >> But even then it is still just guessing what it is supposed to do. And it >> happens that situation gets even worse. >> >> Then sometimes there were bugs both in filesystems and fsck and user was >> left with no access to data at all until these bugs were fixed (or user was >> skilled enough to fix/workaround them on his/her own). I came across such >> problems on EMC IP4700, EMC Celerra and couple of other systems in my life. >> For example fsck was running for well over 10h consuming more and more >> memory and finally server was running out of memory and fsck died... and it >> all started over again, failed again.... in other case fsck was just >> crashing during repair in the same location and file system was crashing the >> os after couple of minutes after mounting it.. >> >> The other problem with fsck is that even if it thinks that filesystem is >> ok it actually might not be - even its metadata state. Then all different >> things might happen - like when accessing a given file or directory a system >> will panic or more data will get corrupted... I was in such a situation >> couple of times and it took days to copy files from such a filesystem to >> another one with many panics in-between when we had to skip such files or >> directories, etc. fsck didn''t help and reported everything is fine. >> >> Now with ZFS it is completely different world. ZFS is able in virtually >> all cases to detect if its meta-data and data on-disk is corrupted in anyway >> or not thanks to its end-to-end checksumming. If someone is concern with how >> strong default checksumming is (fletcher4) then currently one cas switch zfs >> to use sha256 to have a good sleep. So here is first big difference compared >> to most filesystems in a market - ZFS if some data is corrupted does not >> have to *guess* if it is the case or not but can actually detect it with >> almost 100% confidence when it is the case. >> Once such a case is detected ZFS will try to automatically fix the issue >> if there is redundant copy of corrupted block available - if there is it >> will all happen transparently to applications without any need to unmount >> filesystems or run external tools like fsck. Then because ZFS checksums both >> metadata and user data it will be able to detect and possibly fix data >> corruptions in both cases (which fsck can''t even if it is lucky). Now even >> if you are not doing any redundancy at pool level by using ZFS its metadata >> blocks are always kept in at least two copies physically separated on disk >> if possible. What it means is that even in a single disk configuration (or >> stirpe) if some data is corrupted zfs will be able to detect it and if it is >> meta-data block it will be able not only to detect it but also automatically >> and transparently fix it and preserve filesystem consistency. There is a >> simple test you may run - create a pool on top of one disk drive, put some >> files in it then overwrite lets say 20% of the disk drives with some random >> data or zeros while zfs is running. Then flush caches (export/import pool) >> and try to access allmetadata by doing a full ls -lra on a filesystem. You >> should be able to get a full listing with proper attributes, etc. but if you >> check zpool status it will probably report many checksum errors which were >> corrected. (when overwriting overwrite so portion of the beginning of the >> disk as zfs will usually start writing to a disk from the beginning). Now if >> you actually try to read a file contents it should be fine if you lucky >> enough to read onwhich was not overwritten and if you are unlucky you won''t >> be able to read blocks which are corrupted (and since you don''t have ane >> redundancy at zfs level it can''t fix its user-data but can detect it) but >> you will be able to read all the other blocks from the file. Now try to do >> something like these with any other file system - you will probably end-up >> with os panic and in many cases fsck won''t be able to recover file system to >> such a point so you can recover some data.... and when fixing it will be >> only guessing what to do and skip user data entirely... >> >> Now there is a specific scenario case of the above when metadata is >> corrupted which is describing pool itself or its root block and it can''t be >> fixed as all copies are wrong. ZFS can also detect it but an extra >> functionality was not implemented until very recently to actually try to use >> N-1 rootblock in such the case. This was very unfortunate but because it was >> very rare in the field and resources are limited as usual it wasn''t >> implemented - instead there was an undocumented, unsupported and hard to >> follow procedure on how to do it manually - and some people did use it >> successfully (check zfs-discuss archives). But of course it shouldn''t be >> like that and ZFS developers did recognized it by having accepted bug report >> on it. Bur limited resources...... fortunately a built-in mechanism to deal >> with such a case has finally been implemented. So now when it happens a user >> will have a choice of importing a pool with extra option to rollback to a >> previous version of txg so the pool can be imported. From now one all the >> mechanisms described above will kick-in. And again - no guessing here but a >> guarantee of detecting a corruption and fixing it if possible. And you don''t >> even have to run any check and wait hours sometimes days on large >> filesystems with millions of files before you can access your data (and >> still not be sure what exactly you''re accessing and if it won''t cause >> further issues). Of course it would probably be wise to run zpool scrub to >> force reading all data and metadata and checking their checksum and fix them >> if possible at convinient time for you but in a mean time you may run your >> applications and any corruptions will be detected and fixed while data is >> being accessed. >> >> So from the practical point of view you make think of the mechanisms in >> ZFS as a built-in fsck with an ability to actually detect when corruption >> happens (instead of just guessing it and just for meta-data), get it fixed >> if a redundant copy is available (and do it transparently to applications). >> Having a separate tool doesn''t really makes sense here. Of course you can >> always write a script called fsck.zfs which will import a pool and run zpool >> scrub if you want. And sometimes people will do exactly that before going >> back into production. But having a genuine extra tool like fsck doesn''t >> really make sense - what such a tool should exactly do (keeping in mind all >> the above)? >> >> Then there were a couple of bugs which prevented ZFS from importing a pool >> with some specific corruptions which were entirely fixable (AFAIK all known >> were fixed in Open Solaris). When you think about it - we are talking about >> bugs here - if you would put all the recovery mechanisms into a separate >> tool called fsck with the same bugs it wouldn''t be able to repair such a >> pool anyway, would it? So you would need to fix these bugs first - but once >> you fixed them the zfs will able to mount such a pool and still an external >> tool to do so is not needed (or after applying a patch/fix do ''alias >> fsck=''zpool import'''' and then fsck pool will get your pool fixed... :) >> You might ask but what are you supposed to do until such a bug is fixed? >> Well, what would you do if you wouldn''t be able to mount ext2 filesystem (or >> any other) and there was a bug in its fsck which would prevent it from >> getting the fs into a mountable state.... you would have to wait for a fix, >> or get it fixed yourself, or play with its on-disk format with tools like >> e2fs, fsdb, ... and try to fix filesystem manually. Well, on zfs you''ve also >> have zdb... >> or you would probably be forced to recover data from backup. >> >> The point here is that most filesystem and their tool had such bugs and >> zfs is one of the youngest filesystems in the market so it is no wonder in a >> way that such bugs are getting fixed now and not 5-7 years ago. Then there >> is a critical mass of users required for a given filesystems so it is >> deployed in many different environments, different workloads, hardware, >> drievers, usage cases, ... so all these corner cases can surface, users >> hopefully will report them and they will get fixed. ZFS is becoming widely >> deployed only for last couple of years or so so no wonder that most of these >> bugs were spotted (and fixed) during the same period. >> >> But then thanks to a fundamentally different architecture of ZFS once most >> (all? :)) of bugs like these are fixed ZFS offers something MUCH better than >> legacy filesystems + fsck. It offers a guarantee of detecting data >> corruption and fixing it properly when possible while reporting what can''t >> be fixed and still providing an access to all the other data in your pool. >> >> >> btw: the email exchange is private so I don''t won''t to include zfs-discuss >> without your consent but if you want to forward this email to zfs-discuss >> for other users benefit feel free to do so. >> >> >>> ) As the evolution of ZFS has demonstrated, impossible things *do* happen >>> and you *do* need the ability to recover as much as possible. ?ZFS is >>> busy slapping bandaids over specific problems instead of dealing with >>> the general issue. >>> >> >> Just a quick "google" and: >> >> 1. fsck fails and causes panic of Linux kernel >> https://bugzilla.redhat.com/show_bug.cgi?id=126238 >> >> 2. btrfs - filesystem gots corrupted, running btrfsck causes even more >> damage and entire filesystem is nuked due to a bug. BTRFS is not the best >> example as it is far from being production ready but still... >> >> https://bugzilla.redhat.com/show_bug.cgi?id=497821 >> >> 3. linux gfs2 - fsck has a bug (or lack of feature) and is not able to fix >> the filesystem with a specific corruption, but filesystem is unmountable. >> The only option is to manually fix data on-disk with help from a support >> service on case-by-case basis... >> >> https://bugzilla.redhat.com/show_bug.cgi?id=457557 >> >> 4. e2fsck segfaults + dumps core when trying to check a filesystem >> >> https://bugzilla.redhat.com/show_bug.cgi?id=108075 >> >> 5. ext3 filesystem crashes - fsck can''t repair it and goes into infinite >> loop.... fixed in development version of fsck >> >> https://bugzilla.redhat.com/show_bug.cgi?id=467677 >> >> 6. gfs2 corruption is causing a linux kernel to panic.... fsck says it >> fixes the issue but it doesn''t and system crashes all over again under >> load... >> >> https://bugzilla.redhat.com/show_bug.cgi?id=519049 >> >> 7. ext3 filesystem can''t be mounted anf fsck won''t finish after 10 days of >> running (probably some kind of infinite looping bug again) >> >> http://ubuntuforums.org/archive/index.php/t-394744.html >> >> 8. AIX JFS2 filesystem corruption - due to a bug in fsck it can''t fix the >> fs, data had to be recovered from backup >> >> >> http://unix.ittoolbox.com/groups/technical-functional/ibm-aix-l/error-518-file-system-corruption-366503 >> >> >> 9. >> >> https://bugzilla.redhat.com/show_bug.cgi?id=514511 >> https://bugzilla.redhat.com/show_bug.cgi?id=477856 >> >> >> And there are many more...You missed the fun vxfs ones where a full fs can corrupt itself so badly that your only option is to restore from backup (fsck won''t help you). Then there was the vxfs memory leak on Solaris 10 (didn''t cause corruption, but at some point you had to take outages to workaround the problem). Or the ''feature'' that was there for a long time, where unclean shutdowns could (not always, but often enough to be annoying) mess up vxvm so much that you had to run vxprivutil on your luns, send the output to veritas, then they create a custom file for vxmake to repair the private area just to be able to import the disk group.>> The point again is that bugs happen even in fsck and until they are fixed >> a common user/sysadmin quote often won''t be able to recover on its own. ZFS >> is not exception here when it comes to bugs. But thanks to its different >> approach (mostly end-to-end checksumming + COW) its ability to detect data >> corruption and deal with it exceeds ?most ?generally available solutions in >> the market. The fixes for some bugs mentioned before make it only more >> robust and reliable even for those unlucky users before... :)And they even happen in ''mature'' and ''proven'' filesystems too...>> >> >> > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
This new PSARC putback that allows to rollback to an earlier valid uber block is good. This immediately raises a question: could we use this PSARC functionality to recover deleted files? Or some variation? I dont need that functionality now, but I am just curious... -- This message posted from opensolaris.org
Orvar Korvar wrote:> This new PSARC putback that allows to rollback to an earlier valid uber block is good. > > This immediately raises a question: could we use this PSARC functionality to recover deleted files? Or some variation? I dont need that functionality now, but I am just curious...Not really. Uberblocks are associated with transaction groups. A transaction group can contain writes to many different files, and those writes aren''t necessarily everything that''s in a particular file. You might luck out and restore a particular file in a rollback like this, but you''d probably lose a lot of other data at the same time. We don''t offer the ability to rollback if the pool can be opened/imported successfully anyway. -tim
frequent snapshots offer outstanding "oops" protection. Rob
Maybe to create snapshots "after the fact" as a part of some larger disaster recovery effort. (What did my pool/file-system look like at 10am?... Say 30-minutes before the database barffed on itself...) With some enhancements might this functionality be extendable into a "poor man''s CDP" offering that won''t protect against (non-redundant) hardware failures, but can provide some relieve in App/Human creativity. Seems like one of those things you never really need... Until you have to that one time, at which point nothing else will do. One would think that using zdb and friends it might be possible to "walk the chain" of tx-logs backwards and each good/whole one could be a valid recover/reset-point. -- This raises a more fundamental question that perhaps someone can comment on. Does ZFS''s COW follow a fairly strict last released-block, last overwritten model (keeping a maximum buffer of "in tact" data), or do previously used blocks get overwritten largely based on block/physical location, fragmentation/best-fit, etc?). In cases of blank disks/LUNs, does for instance a 1TB drive get completely COW-ed onto its blank-space, or does zfs re-use previously used (and freed) space before burning through then entire disk-space? Thanks, -- MikeE -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Orvar Korvar Sent: Monday, November 09, 2009 8:36 AM To: zfs-discuss at opensolaris.org Subject: [zfs-discuss] PSARC recover files? This new PSARC putback that allows to rollback to an earlier valid uber block is good. This immediately raises a question: could we use this PSARC functionality to recover deleted files? Or some variation? I dont need that functionality now, but I am just curious... -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> Maybe to create snapshots "after the fact"how does one quiesce a drive "after the fact"?
+------------------------------------------------------------------------------ | On 2009-11-09 12:18:04, Ellis, Mike wrote: | | Maybe to create snapshots "after the fact" as a part of some larger disaster recovery effort. | (What did my pool/file-system look like at 10am?... Say 30-minutes before the database barffed on itself...) | | With some enhancements might this functionality be extendable into a "poor man''s CDP" offering that won''t protect against (non-redundant) hardware failures, but can provide some relieve in App/Human creativity. Alternatively, you can write a cronjob/service that takes snapshots of your important filesystems. I take hourly snaps of our all our homedirs, and five-minute snaps of our database volumes (InnoDB and Postgres both recover adequately; I have used these snaps to build recovery zones to pull accidentally deleted data from before; good times). Look at OpenSolaris'' Time Slider service, although writing something that does this is pretty trivial (we use a Perl program with YAML configs launched by cron every minute). My one suggestion would be to ensure the automatically taken snaps have a unique name (@auto, or whatever), so you can do bulk expiry tomorrow or next week without worry. Cheers. -- bda cyberpunk is dead. long live cyberpunk.
On Thu Nov 5 14:38:13 PST 2009, Gary Mills wrote:> It would be nice to see this information at: > http://hub.opensolaris.org/bin/view/Community+Group+on/126-130 > but it hasn''t changed since 23 October.Well it seems we have an answer: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-November/033672.html On Mon Nov 9 14:26:54 PST 2009, James C. McPherson wrote:> The flag days page has not been updated since the switch > to XWiki, it''s on my todo list but I don''t have an ETA > for when it''ll be done.Perhaps anyone interested in seeing the flags days page resurrected can petition James to raise the priority on his todo list. Thanks Nigel Smith -- This message posted from opensolaris.org
Nigel Smith wrote:> On Thu Nov 5 14:38:13 PST 2009, Gary Mills wrote: >> It would be nice to see this information at: >> http://hub.opensolaris.org/bin/view/Community+Group+on/126-130 >> but it hasn''t changed since 23 October. > > Well it seems we have an answer: > > http://mail.opensolaris.org/pipermail/zfs-discuss/2009-November/033672.html > > On Mon Nov 9 14:26:54 PST 2009, James C. McPherson wrote: >> The flag days page has not been updated since the switch >> to XWiki, it''s on my todo list but I don''t have an ETA >> for when it''ll be done. > > Perhaps anyone interested in seeing the flags days page > resurrected can petition James to raise the priority on > his todo list.Nigel, *everybody* is interested in the flag days page. Including me. Asking me to "raise the priority" is not helpful. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Hi James James C. McPherson wrote:> *everybody* is interested in the flag days page. Including me. > Asking me to "raise the priority" is not helpful.>From my perspective, it''s a surprise that ''everybody'' is interested, as I''mnot seeing a lot of people complaining that the flag day page is not updating. Only a couple of people on this list, and one of those is me! Perhaps I''m looking in the wrong places. I''m prepared to admit that I may well have misjudged the situation, due to my lack of a full overview. I''m sorry if my forum posts regarding this has not been helpful, as my only intention was to try to be helpful. Best Regards Nigel Smith -- This message posted from opensolaris.org
Hi,>> *everybody* is interested in the flag days page. Including me. >> Asking me to "raise the priority" is not helpful. > >> From my perspective, it''s a surprise that ''everybody'' is interested, as I''m > not seeing a lot of people complaining that the flag day page is not updating. > Only a couple of people on this list, and one of those is me! > Perhaps I''m looking in the wrong places.I used this page frequently, too. But now i''m just using the twitter account feeded by onnv-notify . You can look to it at http://twitter.com/codenews .... Regards Joerg
Say I end up with a handful of unrecoverable bad blocks that just so happen to be referenced by ALL of my snapshots (in some file that''s been around forever). Say I don''t care about the file or two in which the bad blocks exist. Is there any way to purge those blocks from the pool (and all snapshots) without having to restore the whole pool from backup? -- This message posted from opensolaris.org
On Tue, Nov 10, 2009 at 2:40 PM, BJ Quinn <bjquinn at seidal.com> wrote:> Say I end up with a handful of unrecoverable bad blocks that just so happen > to be referenced by ALL of my snapshots (in some file that''s been around > forever). Say I don''t care about the file or two in which the bad blocks > exist. Is there any way to purge those blocks from the pool (and all > snapshots) without having to restore the whole pool from backup? >No. The whole point of a snapshot is to keep a consistent on-disk state from a certain point in time. I''m not entirely sure how you managed to corrupt blocks that are part of an existing snapshot though, as they''d be read-only. The only way that should even be able to happen is if you took a snapshot after the blocks were already corrupted. Any new writes would be allocated from new blocks. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091110/bb951d8f/attachment.html>
On Tue, Nov 10, 2009 at 03:04:24PM -0600, Tim Cook wrote:> No. The whole point of a snapshot is to keep a consistent on-disk state > from a certain point in time. I''m not entirely sure how you managed to > corrupt blocks that are part of an existing snapshot though, as they''d be > read-only.Physical corruption of the media Something outside of ZFS diddling bits on storage> The only way that should even be able to happen is if you took a > snapshot after the blocks were already corrupted. Any new writes would be > allocated from new blocks.It can be corrupted while it sits on disk. Since it''s read-only, you can''t force it to allocate anything and clean itself up. -- Darren
On Tue, Nov 10, 2009 at 3:19 PM, A Darren Dunham <ddunham at taos.com> wrote:> On Tue, Nov 10, 2009 at 03:04:24PM -0600, Tim Cook wrote: > > No. The whole point of a snapshot is to keep a consistent on-disk state > > from a certain point in time. I''m not entirely sure how you managed to > > corrupt blocks that are part of an existing snapshot though, as they''d be > > read-only. > > Physical corruption of the media > Something outside of ZFS diddling bits on storage > > > The only way that should even be able to happen is if you took a > > snapshot after the blocks were already corrupted. Any new writes would > be > > allocated from new blocks. > > It can be corrupted while it sits on disk. Since it''s read-only, you > can''t force it to allocate anything and clean itself up. > >You''re telling me a scrub won''t actively clean up corruption in snapshots? That sounds absolutely absurd to me. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091110/4eb8265f/attachment.html>
On Tue, Nov 10, 2009 at 03:33:22PM -0600, Tim Cook wrote:> You''re telling me a scrub won''t actively clean up corruption in snapshots? > That sounds absolutely absurd to me.Depends on how much redundancy you have in your pool. If you have no mirrors, no RAID-Z, and no ditto blocks for data, well, you have no redundancy, and ZFS won''t be able to recover affected files. Nico --
I believe it was physical corruption of the media. Strange thing is last time it happened to me it also managed to replicate the bad blocks over to my backup server replicated with SNDR... And yes, it IS read only, and a scrub will NOT actively clean up corruption in snapshots. It will DETECT corruption, but if it''s unrecoverable, that''s that. It''s unrecoverable. If there''s not enough redundancy in the pool, I''m ok with the data not being recoverable. But wouldn''t there be a way to purge out the bad blocks if for example it was only in a single bad file out of millions of files, and I didn''t care about the file in question? I don''t want to recover the file, I want to have a working version of my pool+snapshots minus the tiny bit that was obviously corrupt. Barring another solution, I''d have to take the pool in question, delete the bad file, and delete ALL the snapshots. Then restore the old snapshots from backup to another pool, and copy over the current data from the pool that had a problem over to the new pool. I can get most of my snapshots back that way, with the best known current data sitting on top as the active data set. Problem is with hundreds of snapshots plus compression, zfs send/recv takes over 24 hours to restore a full backup like that to a new storage device. Last time this happened to me, I just had to say goodbye to all my snapshots and deal with it, all over a couple of kilobytes of temp files. -- This message posted from opensolaris.org